Subsections
If we are to coin a term to highlight the difference between the established model and procedural model then let's call the current way of doing things the "data model". It is data intensive. Massive amounts of recorded sound are collected by sound designers, trimmed, normalised, compressed, and arranged into sequence and parameter tables in the same way that multi-sample libraries were produced for music samplers.8 They are matched to code hooks in the game logic using an event-buffer manager like the FMOD or Wwise audio system and some real-time localisation or reverb is applied. The data model sits nicely with the old school philosophy from the film and music business which is principally concerned with assets and ownership. The methodology is simple, collect as much data as possible in finished form and assemble it like a jigsaw puzzle, using a hammer where necessary, into a product. At each stage a particular recording can be ascribed ownership by a game object or event.
Figure 8:
Use of recorded audio in todays games
|
|
There are many drawbacks to this. One is that it forces aesthetic decisions to be made early in the production chain and makes revisions very expensive. Another that we have already touched upon is the colossal amounts of data that must be managed. Therefore existing game audio systems are as much asset management tools as they are data delivery systems. One very important point is that this approach is incoherent. Sound has long been treated as an afterthought, added at the end of game development, so much that this attitude is reflected in the tools and workflows that have evolved in game development industries. In fact for most cases other than music delivery it is completely unnatural to treat sound as separate from vision. Unlike film where location sound must be replaced piece by piece with ``Foley" and ``wild effects" games are inherently coherent. They are based on object oriented programming where it is natural to treat the sonic and visual properties of an object as attributes of the same thing. The present way of doing things forces a partition between the design of objects visual and sonic properties and in turn leads to more work tweaking and experimenting with sound alignment or when revisions are made to assets.
Let's take a quick look at some aspects of current technology and methods.
When an object comes into play, either because it comes within range of the player or acquires relevance, it must be activated. This may involve a prefetch phase where a soundbank is loaded from secondary storage. Although modern game sound systems have hundreds or even thousands of channels it is still necessary to manage voice playback in a sensible way. Like the polyphony assignment for traditional samplers a game audio system prioritises sounds. Those that fall below an amplitude threshold where they are masked by others are dropped and the object instance containing the table replay code is destroyed. Activation may be by triggers or events within the world.
Composite or concatenated sounds may be constructed by ordering or randomly selecting segments. Examples are footsteps or weapons sounds that comprise many small clicks or discrete partials in combination.
Crossfading and mixing of sample data is very much like a normal sampler. Velocity crossfades for impact intensity are really no different from a multi-sampled piano.
Most game audio systems incorporate a mixer much like a traditional large frame multi-bus desk with groups, auxilliary sends, inserts and busses. The difference between a digital desk used for music and one used in game audio is more to do with how it is used. In traditional music the configuration of the desk stays largely the same throughout the mix of a piece of media, but in a game the entire structure can be quickly and radically changed in a very dynamic way. Reconfiguring the routing of the entire mix system at the millisecond or sample accurate level without clicking or dropouts is the strength of game audio mixers.
Continuous real-time parameterisation from arbitrary qualities can be applied to a sound source. Object speed, distance, age, rotation or even temperature are possible. Presently these are usually routed to filter cutoff or pitch controls, the range of dynamic real-time control for non-synthetic sounds is quite poor.
Simple panning or inter-aural phase shift according to head transfer response is applied to the sound in order to place it perceptually for the player actor. Relative actor speed, orientation and the propagation medium (air, fog, water etc) all contribute to how the sound is received. This is closely connected to ``ambiance" below.
This is an extension of localisation which creates much more realistic sound by contextualising it. Reverb, delay, Doppler shift and filtering are applied to place point sources or extents within the environment. Echos can be taken from the proximity of nearby large objects or world geometry so that sound sources obtain natural ambience as the player moves from outdoors, through a forest, into a cave and then into a corridor or room.
This is directly linked to distance but may also apply filters to affect fogging (absorbtion), or material damping caused by intervening objects that occlude sound. Localisation, ambience and attenuation are all really aspects of the same process, placing dry discrete sources or extents into a natural sounding mix.
If we ignore Einstein for a moment and assume the existence of a synchronous global timeframe then networked clients in a multi-player game would all march like an army in lockstep. In reality clients do not follow this behaviour, they are more like a loose crowd following along asynchronously because of network latency. The server maintains an authoritative ``world view" which is broadcast to all clients. This data may include new objects and their sounds as well as time tagged packets that indicate the relative rather than absolute timing between events. It is necessary to reschedule some sound events pushing them forwards (if possible) or backwards a few milliseconds to make them correspond to visual elements. Without this, network jitter would scramble the sequence and timing of events so variable delays are used to align sounds back to
correct object positions or states which are interpolated on the client.
These are often given special treatment having their own groups or subsystems. Dialogue is often available in several languages which can contain sentences of differing length or even an entirely different semantic structure. Where music is dynamic or interactive this is currently achieved by mixing multitrack sections according to a composition matrix that reflects emotive game states. Short musical effects or ``stings" can be overlaid for punctuation and atmospheres can be slowly blended together to affect shifting moods. Menu sounds require a separate code environment because they exist outside the game and may continue to be used even when all world objects, or the level itself has been destroyed.
How do new procedural methods fit into or modify this methodology? The difference between the foregoing and what we are presently considering is that procedural sound is now determined by the game objects themselves. Sounds heard by the player actor very closely reflect what is actually happening in the game world, being more tightly bound to the physics of the situation. Procedural audio relies much more on real-time parameters than discrete events. Instead of selecting from pre-recorded pieces we generate the audio at the point of use according to runtime parameters from the context and object behaviours. A fairly direct analogy can be drawn between the rendering and lighting of 3D scenes compared to static texture photographs, and the synthesis of procedural audio compared to using recorded samples. In each case the former is more flexible but more CPU intensive. Another difference is the granularity of construction. Instead of large audio files we tend to use either extremely small wavetables or none at all where sources are synthesised.
Take for examples a gun and a car engine. Traditional methods would require at least two sounds,
sampled, looped and processed, tagged and mapped to game events. In fact these are very similar physical entities. For the gun we model a small explosive detonation of the shell and apply it to the formant and radiance characteristics for the stock, magazine and barrel. These each impart particular resonances. Similarly an internal combustion engine is a repeated explosion within a massive body connected to a tubular exhaust. Standing waves and resonances within the engine are determined by the engine revs, the length and construction of the exhaust and the formants imparted to the sound by the vehicle body. A synthetic sound designer can create the sounds for most objects assembling efficient physical models from reusable components 9.
Instead of simply applying filters to attenuate recorded sources we are able to rather cleverly tailor a procedural synthetic sound to use less resources as it fades into the distance. Think of a helicopter sound. When in the distance the only sound audible is the "chop chop" of the rotor blades. But as it approaches we hear the tail rotor and engine. Similarly the sound of running water is very detailed pattern of sinewaves when close, but as it fades into the distance the detail can be replaced by cheaply executable noise approximations. In fact psychoacoustic models of perception and Gabors granular theory of sound suggest this is the correct way to do level of detail, making sounds with less focus actually consume less resources is merely a bonus from a computational point of view. This can lead to perceptually sparser, cleaner mixes, without the ``grey goo" phenomena that comes from the superposition of an overwhelming number of channels of sampled audio.
Extending the highly dynamic structure of existing mixers we now use DSP graphs that are arbitary. A fixed architecture in which sources flow through plugins in a linear fashion is too limited. The ability to construct an FM synthesiser or formant filter bank with a variable number of parallel bands requires a very flexible DSP engine, a real SOUND ENGINE. This is much more like a sophisticated synthesiser than a sampler. Software architectures that already have this behaviour are Reaktor and Max/MSP/Puredata. One essential feature of these tools is that they give a visual representation of the DSP graph which is easy to program, which allows sound and music designers who are not traditional programmers to speedily construct objects for real-time execution.
Instead of delivering mixed tracks or multitrack stems for clientside mixing the composer now returns to the earlier tracker style philosophy (only in a more modern incarnation with XMF, DLS2, EAS type formats). Music is delivered in three parts, MIDI score components that determine the themes, melodies and transitions of a piece, a set of "meta" data or rules for assembling scores according to real-time emotive states, and a set of instruments which are either multi-sample library banks or synthetic patches. The composer is no longer concerned with a definitive form for a piece, but rather with the ``shape" and ``boundaries" of how the piece will perform during play. Chipset MIDI on some soundcards has improved greatly in recent years reaching par with professional samplers and synthesisers, but for platform independence the most promising direction is with native implementations.
Figure 9:
Use of synthesis in procedural sound effects
|
|
Andy Farnell
http://obiwannabe.co.uk/