Sound synthesis for games

Andy James Farnell, UK

Abstract

This work presents an overview of real-time client-side synthetic sound for use in games and interactive applications. Creating and managing native sound synthesis code for immediate client-side execution is large scale programming and design task. We need a good understanding of sound design practice to optimally solve the problem, but this is not sound design, nor is it music. We wish to sketch out a systematic way of attacking the general programming problems of procedural sound, for the general case of sound effects. There are two commonly used sources of sound, digital samples made by recording, and synthesised sounds created from first principles. While the former is the currently preferred method for most games, our focus is the second kind of sound generation. This approach has many benefits and tradeoffs. We can demonstrate that as complexity increases it eventually becomes more effective and efficient to synthesise many sounds on the client machine. We will remain aware of the most complicated and difficult application, multi-player network games, where a proper solution must allow for object replication across multiple clients and deal with issues like consistency, latency and client CPU usage.

1 Synthesis

Traditionally, using sampling we obtain a file containing recorded data. Collections of these files can get very large. Synthesis on the other hand uses a much smaller file. In many important regards the sound has been "compressed" to a small set of rules which define its generation procedure. A synthesised sound is no more than an equation or chunk of code which evaluates a function of time. When executed on the client the output is a sound signal which the player hears through the audio system. When wrapped in a layer of code to provide meaningful parametric access, and consistent behaviour patterns for certain events, we call this object a "synthesiser". A "DSP engine" is some part of the client software which does the actual synthesis. Services like ALSA, Csound(1), Puredata(2) and so on can either provide libraries or daemons to interpret the synthesis instructions, and manage the audio output for us. Unlike the traditional design of synthesisers for musical applications we never craft a GUI interface for the user to play with. The sound designer may use a GUI interface to program new behaviours, but that is part of the development toolkit. The interface we create for our synthesisers is never seen in the games world, it is a set of events and parameters derived from the physics engine, collision detection and other object properties. Sound design then becomes a task of imbuing existing world objects with information on how to synthesise their sound, by creating methods which call on our synthesiser building blocks. In many cases, where the default parameters are taken from textures, mesh geometry and so on, the design of sounds for objects becomes automatic. Here lies real power in the synthetic method, the sound designer only needs specify rules of scale and combination to enjoy having a great many objects update their sonic behaviour.

2 Efficiency

It is possible to run many synthesisers in parallel, typically hundreds, but unlike sample playback which has a fixed resource cost per channel, a synthesisers resource cost is variable. A single complicated synthesiser may consume many times the resources of the simplest one. On the other hand very simple sources can have hundreds or thousands of instances created. Contrast the CPU intensive nature of synthesis with disk intensive traditional sample replay activity. In fact synthesis trades off time against space so dramatically that it is pretty much the converse of the current state of games sound programming where disk IO is the bottleneck. Synthesis objects can be so small and reusable that a properly implemented synthetic sound system needs to make almost no disk access at all. The upshot is that the sound designers are now properly fighting the graphics people for CPU cycles, which the graphics people should be made to see as wonderful opportunity to reclaim new disk bandwidth. Sound scenes developed in my studio typically run on a 500MHz processor. The assignment of this low spec machine to games sound development was a deliberate although sometimes frustrating decision. It reflects an assumption that the typical user machine is a 1GHz device and so everything achieved can be implemented using exactly half the CPU cycles available. This has been very important in driving research towards efficient methods.

3 Level of detail

In 3D graphics processing the rendering stage is able to optimise by not bothering to draw distant objects in so much detail as closer ones. We have a direct analogy with procedural sound synthesis. Objects that produce sounds far away, or as part of an ensemble can be tailored in complexity to become simpler as less detail is required from them. The rules for doing this are somewhat different than those for graphics, but using psychoacoustic knowledge of masking effects, spectral prominence and relevance(3) we can produce an analogous effect. As complexity increases this gains a distinct advantage over the fixed cost per channel of mixing samples, and it allows the processing resources to follow the "focus of play".

Assigning continuous control to level of detail can also produce some striking effects that sound much better than simply mixing samples. The psychoacoustic effect of an ensemble of objects can be quite different from just creating many instances of a sample and playing them together. This allows us to make spectacular optimisations in defining behaviour of groups or collections. One LOD trick is a straightforward substitution ladder where, say for rain, every 10 raindrops are substituted for one synthesis object which simulates 10 raindrops, and each group of 10 for one that handles 100 etc. In some cases we can make those rules built into the sound object so it can self optimise and create "scalable extents".

4 Complexity

4.1 Synthesis complexity

The number and arrangement of basic DSP atoms used to synthesise a sound is its synthesis complexity. It can vary from one noise (random number) generator and a multiplication to hundreds of lines of signal processing code. Generally, as a rule of thumb a synthesiser for most common sound classes occupies between 10 and 100 basic operations with about half of these being arithmetic or linear functions and a dozen or so trigonometric or higher order functions. A moderately complex sound scene occupies around a thousand operators. The complexity can depend on how much detail is needed, so pieces of glass landing close to a player will be synthesised with more complexity than fragments falling in the distance which may be approximated by cheap noise based effects. Just like detail in a painting or 3D graphics scene this method works very well for sound indeed, and my own "Vincent Van Gogh" attitude to synthetic sound design has made it perfectly clear that it's possible to substitute features of a sound scene with very broad strokes and yet obtain an unambiguous result. These decisions require human input from the sound designer and are not very amenable to computation, but some rules are emerging and happily they tally with other research in psychoacoustics.

4.2 Parametric complexity

Each of the synthesis units needs to be controlled, which is where we meet parametric complexity. Parameters are the variables we make visible, some may be global broadcast variables with a UID to denote their special name while they exist. A parameter can be a very straightforward mapping to a DSP operation like a filter cutoff frequency, but more generally we design a parametric interface to have as few as possible high level values that are human understandable for a particular task. For example we might reduce a running water model from scores of sine wave pattern matrices to just three controls for "flux" (rate of change of volume) and "depth", or we may have a general spherical membrane model we use for balloons, bubbles and footballs which accepts parameters fairly suited to both the physics engine output and the human designers understanding in a form like "diameter", "elastic compressibility" (Youngs' modulus), "pressure", "material density". The mapping between events and parameter changes can be many to one or one to many. Things become parametrically connected when they interact or form relationships. These might be collisions, proximity, or unusual physical triggers like temperature differential. In all cases the parameters of one or more synthesis objects become codependent. An exchange protocol between objects for particular events should have mirror image relationships like "Hit" and "Was Hit By" being pointers to the same event data. Separation may be as important as impact, consider a plucked string or getting out of water. While coupled, two objects can slide to produce frictional excitation, or transmit sound from one to the other. Maintainance of relational and state information is certainly not the job of the sound engine, but instead it takes parameters from existing game world objects at a low level.

4.3 Organisational complexity

Many of the sounds in a synthetic system are not point sources playing a noise and attached to objects like standard design, they are the emergent result result of the interaction of objects. It must be remembered that sound is a subset of dynamics, nothing makes a sound without somebody or something doing something, moving or changing energy state. Entity relationships, particularly entity-action events are therefore central to the sound creation process. We must remember that things like car engines, fans etc are machines and are self contained running objects because they have a supply of energy, sticking within this thorough physic framework makes it easy to see the right coding decisions. But we need to allow for special cases. Although its a noble goal to unify sound with general synthesis methods closely coupled to the physics engine it's unreasonable to expect real systems to survive without the facility for unusual and unique sounds to be handled by special code or by samples. "Difficult" sounds like some animals or human speech are good examples. No synthesis system is going to replace voice actors anytime soon, possibly never except as the voice of an electronic narrator. Distributed phenomena that vary massively in scale, like fire, rain, wind and running water turn out to be surprisingly easy, but must be carefully managed for scaling and LOD or they can overwhelm the synthesis engine easily. We want our system to allow those special cases within a general framework such that defaults exist to fall back on, but these may be overridden by more specific local data. It turns out managing this is not so hard to do when synthesiser DSP chunks are a priori labled with their execution cost, but otherwise quite difficult. Complexity may also increase into hierarchical structures like relationships forming a tree. For example objects inside objects inside objects would have the potential to form deep and expensive structures to navigate, so we need to be able to make sensible decisions about how to cull them. A specific example might be, having put the box of matches inside the briefcase inside the cupboard, can we hear the matches shaking if we move the cupboard? Such factors may be interestingly modified by object relevance variables and ideas that lie entirely outside real world thinking.

5 Dynamics

As a player or object moves about a games level it comes into proximity with new objects. This creates a dynamic graph of entity relationships which describe sounds which might be possible to hear at that point in time. As an example, moving from the inside of a steel tank to a large wooden walled building changes many of the parameters to be used for environmental impact sounds, like dropping an item. We wish to construct our synthesis objects only as we need them which means we are empowered greatly by having many small reusable chunks. A dynamically reconfigurable DSP environment like Puredata is an essential part of a realistic sound synthesis engine for games. We need to carefully manage the scope of DSP atoms, dynamically creating new ones as needed, and quickly cleaning up unused operators. This works because real world sound objects can be broken down into component signals which behave in different ways according to factors like distance (level of detail) or excitation method. For example consider a tin can, falling, bouncing, rolling or crumpling. All of these sound capabilities can be built from the same basic synthesis object which presents the required parametric interface allowing all these different excitation sequences to happen. We would expect things as simple as "scale" to let us obtain the sounds for a large oil drum as easily as a small tin can, or to be able to use the same object with "material" substituted for "wood" to get the sounds of a wooden barrel, although it should be noted that objects changing size are rare in reality with a few exceptions like inflating a balloon, or organic growth, and we often choose to model explosions as a change in space corresponding to a pressure wave. Objects changing material makeup are even rarer perhaps and only encompass chemical and nuclear reactions, but we should remain mindful of state changes too, for example melting and boiling are perfectly plausible phenomena to model in a game.

Reconfiguring relationships like containment and coupling is relatively easy. Moving one object inside another changes the formant filters which model environmental effects such that the sound inside takes on some of the characteristics of the containing object. Dynamically reconfigurable physical geometry has been demonstrated long ago on such systems as Cordis Anima(4), and engines like Puredata give us plenty of scope for managing this sort of thing too. It is particularly important in the special event case of fragmentation where a single object may be replaced with many smaller ones, each inheriting the material characteristics of its parent but broken into smaller sized pieces. We have been able to demonstrate good effects like glass shattering following this principle.

6 Scheduling

Objects are scheduled and run, just like any other process they must be deleted when dead and garbage collected, which is handled in many sound engines already. This could present a difficult to understand, over-complicated system if there are two versions of that reality, one in the object and one in a separate sound synthesis part. To solve this we must define each game object to have an extensive and authorative version of what sounds it would like to make, in other words the objects version of its sound is speculative and will be delivered by the sound synthesis system if and only if it can according to its settings. This apes the behaviour of existing graphics code where the game engine still runs regardless of the graphics capability but the user gets a graceful degradation of quality for lower machine power. This is achieved by strictly observing the simple rule that no part of the sound code in any game world object ever conditionally depends on the state of another - in other words the definition of the desired sounds and their implementation are utterly decoupled.

7 Sound rendering

Current games technology allows for the attachment of looping samples to point sources which can then be attached to objects. These can change state (and play a different sample) or play a single one-shot sample to depict a specific sound event. Such sources are usually placed around by the level designer to add ambiences, or attached by hand to specific event methods of objects, like "dropped", or "activated". The sound rendering process for synthetic sound is directly analogous to graphical rendering. Objects exist within an environment and a set of events define actions on or between those objects. The difference is that instead of calling disk based data up for replay, synthesis code is summoned to create the sound effects. The first thing to note is the huge space advantage obtained by continuous parameterisation rather than categorical sample selections. The example I often give is of "footsteps" which are an age old film and game tradition. The footstep sounds of actors are a most essential semantic binding to create the immersive experience, and a lot of work has gone into making footstep sounds good, meaning realistic and appropriate. A typical game may have 5 actors, 10 movement speeds and 20 surfaces, which needs 1000 samples, but all of these could be simulated by one small piece of code which is dedicated to the task of simulating the pressure curves of human feet on various surfaces. This exists either in the "shoe" object, which is attached to the bottom of the actors legs, or by the surface object itself, or in some models to both. In the rendering process the player viewpoint, modelled as a binaural receiver pair for ears and linked to head direction, velocity (for Doppler relativity), and position, is the final mix destination. This is already typical in most games designs. In more advanced technologies environmental preprocessing is done taking into account local geometry. For this to happen the world geometry must be reduced to a simplified mesh matrix which is used to create an impulse response, a characteristic of the reverb appropriate to that place. The object sources and receiver (player) can then easily be localised within that room sounding very much as they would in reality. Unfortunately this is expensive to do, and at this point a few games compute local geometry impulses natively but most rely instead on DSP capabilities built into the better soundcard hardware, which is cheating and a poor substitute for doing it properly.

Properly means taking into account a great deal more physics than is presently done by the discrete and heuristic type of coding employed in games. The natural place for sound synthesis is closely linked to the physics engine, since we are already computing eigenvalues for bouncing, rotating objects and doing calculations for material collisions this is a natural source of input to the sound synthesis system. One example to consider is "underwater effects". A model that implicitly takes into account the speed of sound in the medium (using varispeed delay buffers as couplings) can instantly adapt to having all objects suddenly be immersed in water (the speed of sound changes amongst other things), and this can even extend to explicitly modelling phenomena like refraction and per object Doppler effects. This is a very powerful position to be in and it totally changes the role of the sound designer.

There is a difficult research problem attached to this which is somewhat philosophical and so I call it the "one hand clapping problem". This asks "Which object creates the sound?". In reality any physical interaction produces sound by excitation of both objects, examples to consider are a football hitting the ground and hitting a window. In the former case it's clear that although the ground makes the sound of itself being hit by a football, and the football makes the sound of itself hitting the ground, those two sounds might be subsumed easily into a single computation by a simple rule of precedence - in other words the football takes responsibility for synthesising both. But in the latter case when we encounter a subsequent events causally dependent on prior actions, in the case the window breaks, we have to watch out for proper causal linkage in the sound too. Fragmentation can also be modelled in a highly detailed way with continuous parameters, with each, or a percentage of each fragment becoming an object with its own sonic properties derived from its parent. Combined with careful LOD management this can create stunning sound-action correspondence without being too expensive.

8 Material modelling

Every object or piece of world material in a game or film has a sound it would make if you interacted with it. Even the air is sound source, think of wind or a whip swoosh. Game code defines a finite set of interactions between world objects, as it happens we need only a handful to cover a vast range of possible sounds. In scale they are singulars, composites and extents. In excitation they are various kinds of hits (unit impulse, noise impulse, chirp impulse), frictions for scraping and sliding (noisy nn coupling, noisy n coupling) and for materials they are quite standard human categoricals like metal, wood, plastic. Materials can be arranged heirachically according to their generality, and often a continuous parameter can be specified to replace a categorical, so for example we discover that there is a category for "hardwoods" vs "softwoods" and collapse this into the density parameter of the material. More exotic material parameters such as "thermal expansivity" or "melting point" might only be relevant on a few occasions and should always default to a medium value of 0.5 which works best for all world situations and is only overridden for specific cases.

9 Parameterisation

This is quite a difficult part of the design. In a nutshell the job is to map control level signals, messages, events and continuous controls to the DSP code. In many cases hand crafting is needed and essentially this becomes the role of the sound designer in a purely synthetic system. Fortunately it's possible to use object hierarchies and code whole classes of objects inheriting base features. An example might be a "liquids" model that works as well for lava or boiling mud as it does for water simply by correctly mapping "fluid viscosity". Some objects may be quite complex composites of other things attached in a particular way. An example might be a car. A car engine sounds very different at various distances, not just because of environmental factors like air damping, but because it is really two loud sounds, separated in many cases by at least a meter. The exhaust and piston sounds have a phase relationship that varies with acceleration and a moving vehicle is also producing four wheel sounds. Then we have to consider action sounds like doors closing, engine ignition etc. In fact all these sounds have their proper place within the larger car model, but the object designer and the sound designer now share the same territory in building a coherently coupled model. Once built correctly the effect will automatically morph between the sound of sitting inside the car with engine running, to the sound of standing next to the car , to the sound of hearing it pull away into the distance, depending on player position, but all using the same model and code. These are the kind of optimisations that allow us to replace dozens of sounds occupying megabytes of space with a single compact piece of code.

10 Replication

When we play multi play games we work on the assumption that the other players see and hear all the same things as we do but from their respective viewpoints. Actually this is not always the case, because for one thing different hardware specifications for each client mean some of us will see less detailed graphics than others participating in the same round. The maintainance of an authorative world view, which is the job of the game server is to try and harmonise the information on all clients and this is called "replication". The replication of sounds across multi clients has its own challenges. For the most part it is an a priori activity, or in other words most of the algorithms are predictive and preemptive, looking ahead to update resources before they are used. For reasonably small levels where all the data is part of the level and is not updated, sound needs very little special treatment for replication. The interpolation functions in local object instances is fine to synthesise most sounds. When exploring larger worlds where each level must transfer a new batch of data we find with sound that defaults or previous sound objects from the last level work just fine until they are updated by specific new versions. We must remember that no two clients are ever likely to hear exactly the same sounds, in fact we can even set this diversity with stochastic models, but also that we can't know the exact sequence that will emerge on a client due to network latency, nor can we know their hardware capability and hence which sounds may have been culled from the client synthesis engine. This is where we might meet serious problems with composite sounds that depend on the exact order of sequences, where we must add time-tagged event data or pass ordered event lists rather than seeing every sound as a discrete and independent thing. Data will generally be raw parameters for existing synthesis objects but chunks of DSP code need to encapsulated too to send new production methods for new objects. In fact some ground was made in the 1990s in that direction. MP4, or Structured Audio(5) was the first attempt at defining an encapsulation for synthesised sound, but it does not constitute a full dynamic interactive synthesis system. The actual definition of MPEG4 is left open ended, so it means nothing more than streaming audio with extra bells, left to be defined by the implementor. In summary, MP4 Structured Audio was designed with music in mind more than general sound effects, and it's limitation is that it was cast mainly as a content delivery system. The syntax of MP4SA is basically Csound wrapped as XML so that music synthesisers and midi performance data can be delivered as a single chunk in a stream of serialised synthesis objects. It doesn't seem very practical in this context, yet, but it contains some insights and MP4 can viewed as a partial solution corresponding to a necessary layer in a larger model. Csound suffers from a number of limitations, some technical and some legal which make rewriting this part of structured audio protocols for general purpose DSP nets necessary. Puredata makes a perfect candidate because rather than sending code we can send the instructions to make code, which is a subtle but powerful difference. So, synthesisers are really stored and moved about in a quite abstract manner, and since clients and servers would have a protocol for negotiating missing parts, and since the pieces are quite small, very complex synthesisers could be constructed on the fly according to rules given by the server.

11 Toolchains

My own work focuses on using Puredata as a separate synthesis engine. It makes sense to have the DSP engine of a system it's own separate program or subsystem and deal entirely with creating code to run there. Although sound designers for synthetic games will be mixing it up a lot more with the object designers the majority of real work should be done in a fast visual development environment as most of the work will be listening and tweaking, and Puredata provides no better possible tool in my own opinion. The interface to this in a runtime environment will probably be via a pipe like OSC, but the nice thing about Puredata is that it easily opens any socket, even network sockets in UDP and TCP so it can listen to just about anything, which makes it a very flexible development tool. Presently there is some activity to marry Blender with Puredata, but to what depth object properties in the game engine can be manipulated and interrogated to obtain useful parameters remains to be seen. Puredata itself is not available as a linkable library of the main engine code, it must be run as a daemon. While this is a very elegant model the current inability to compile stand alone sound generators limits its portability. A desirable toolchain would manage a "pool" of reusable objects which are executed on an "invisible" client engine embedded into application software.

References