Subsections

Forces driving procedural audio

Order of growth

As games worlds get bigger and the number of world objects increases the combinations of interactions requiring a sound explodes. Unlike textures and meshes which have a linear O(n) relationship to world objects, sound is interactive and grows in accordance with the relationships between objects. This is polynomial growth in practice, but theoretically worst case is closer to factorial. A problem many game audio directors are currently complaining of is that even with a large team of sound designers they are unable to generate sufficient content to populate the worlds. Games worlds are only going to keep growing fast so procedural synthetic audio that allows sounds to be automatically generated for all possible object-object interactions is very attractive. Instead of exhaustively considering every possible drop, scrape and impact noise the games world is automatically populated with default interactions. The sound designer can now concentrate on using their real skills, to tune the aesthetic of critical sounds, or replace them with samples and forget about the drudge work of creating thousands of filler noises. This is very liberating for the sound designer.

Asset control

In terms of project management the data model means that large amounts of data must be taken care of, so part of the audio teams job is to keep track of assets and make sure there are no missing pieces. Asset management is now one of the biggest challenges in game audio. Each game event, object pickup, door opening, car engine or whatever must have it's own piece of sound data kept on secondary storage, usually as .wav files, logged and cross-referenced against event lists. In contrast procedural audio requires management of code, or rather sound objects, which happen to be the world objects. This brings the view of the all media in the project under one roof. These have enormous scope for reuse and arrangement in a class hierarchy that tends to manage itself. More specific objects may be derived from general ones. Instead of gigabytes of raw audio data the project manager must now be able to deal with more compact but more complex systems of parameters that define each sound object.

Data throughput

With the data model, each recorded sample must be brought from secondary storage into main RAM or directly via a data pipeline to the sound core during play. This means that the data model is "data bus intensive". A problem that occurs in current game audio that holds it back is that sound data must compete with graphics data, also a data intensive operation. In contrast, procedural audio is CPU intensive. The instructions necessary to generate procedural audio are very small. Extremely compact programmatic data has almost negligible load on the data bus. Instead of large DMA transfers the CPU works to compute the data in real time. This radical change has important effects on system architecture and program design. One immediate advantage is that it frees up great amounts of data throughput for other uses.

Deferred form

The data model requires that most of the work is done in advance, prior to execution on the platform. Decisions such as sound levels, event-sound mappings and choices of effects are made in advance and cast in stone. Procedural audio on the other hand is highly dynamic and flexible, it defers many decisions until runtime. Data driven audio uses prior assignment of polyphony limits or priorities for masking, but dynamic procedural audio can make more flexible choices at runtime so long as we satisfy the problem of predicting execution cost. This means that critical aesthetic choices can be made later in the process, such as having the sound mixers work with a desk "in-world" during the final phase of production, much like a film is mixed. They can focus on important scenes and remix the music and effects for maximum impact. With runtime dynamic mixing it is possible to ``set focus" on an object that the player is looking at, or a significant actor that requires highlighting in context.

Object based

In terms of how the sound designer and programmer work, procedural audio heralds the creation of a new class of audio producer/programmer. Instead of working with collected data and a fairly naive sample replay engine the sound designers focus shifts from specific instances of sound to the behaviour and physics of whole classes of sounds. The designer now works more like a programmer creating and modifying sound objects. A sound object is really a method of an existing world object, thus the false divide between visual and audio mediums is bridged. Simple operations such as scaling can be automatic. For example, if the sound designer works on an object that models a metal cylinder, and if a tin can is scaled up to a steel barrel the sound may scale as easily as the 3D artist changes the size of an object mesh. If the barrel is now textured with a wood material the sound automatically changes to become a wooden barrel. For a great many objects the designer is concerned with physical and material properties and geometry, the actual sounds are an emergent property of the work of the sound designer and the 3D object designer who work more closely together. In the best case this direction practically eliminates audio middleware tasks that perform explicit mapping, for example instead of manually specifying footstep sounds for each texture the appropriate choices are made automatically from texture material properties, after all the sound is an object attribute of the surface-foot interaction and if the texture is correctly labelled by the texture artist and the speed and weight of the actor are known then footfalls can be correctly synthesised directly (often using granular methods).

Variety

Further advantages of procedural audio are versatility, uniqueness, dynamic level of detail, focus control and localised intelligence. Let's consider the first of these for a moment. As we mentioned above, a recorded sound always plays precisely the same way. Procedural sound may be highly interactive with continuous real-time parameters being applied. Generative music for example can change its motifs, structure and balance to reflect emotional dimensions. The sound of flying bullets or airplane propellers can adapt to velocity in ways that are impossible with current resampling or pitch shifting techniques. Synthesised crowds can burst into applause or shouting, complex weather systems where the wind speed affects the sound of rainfall, rain that sounds different when falling on roofs or into water, realistic footsteps that automatically adapt to player speed, ground texture and incline, the dynamic possibilities are practically endless. We will consider dynamic level of detail shortly because this is closely tied up with computational cost models, but it is also related to dynamic mixing which allows us to force focus in a sound mix according to game variables.

Variable cost

Playing back sample data has a fixed cost. It doesn't matter what the sound is, it always requires the same amount of computing power to do it. Procedural sound has a variable cost, the more complex the sound is the more work it requires. What is not immediately apparent is that the dynamic cost of procedural audio is a great advantage in the limiting condition. With only a few sounds playing sampled methods vastly outperform procedural audio in terms of cost and realism. However as the number of sounds grows past a few dozen the fixed cost of samples starts to work against it. Some procedural sounds are very hard to produce, for example an engine sound, while some are extremely easy and cheap to produce, for example wind or fire sounds. Because of this we reach a point in a typical sound scene where the curves cross and procedural sound starts to outperform sample data. What makes this even more attractive is the concept of dynamic level of detail. A sampled sound always has the same cost so long as it is playing, regardless of how much it is attenuated. In mixing a sound scene LOD is applied to fade out distant or irrelevant sounds, usually by distance or fogging effects that work with a simple radius, or by zoning that attenuates sounds behind walls. Until a sampled sound drops below the hearing or masking threshold it consumes resources. Research at on dynamic LOD techniques has shown how a synthetic source can gracefully blend in and out of a sound scene producing a variable cost. We can employ physcoacoustic, perceptual methods to constructing only the parts of the sound that are most relevant [13], or cull unneeded frequencies in a spectral model [25]. What this means is that for a complex sound scene the cost of peripheral sounds is reduced beyond that of sampled sources. The magic cutoff point where procedural sound begins to outperform sampled sources is a density of a few hundred sources.

Andy Farnell
http://obiwannabe.co.uk/