XY, MS, and First-Order Ambisonics

This chapter describes first-order Ambisonic technologies starting from classical coincident audio recording and playback principles from the 1930s until the invention of first-order Ambisonics in the 1970s. Coincident recording is based on arrangements of directional microphones at the smallest-possible spacings in between. Hereby incident sound approximately arrives with equal delay at all microphones. Intensity-based coincident stereophonic recording such as XY and MS typically yields stable directional playback on a stereophonic loudspeaker pair. While the stereo width is adjustable by MS processing, the directional mapping of first-order Ambisonics is a bit more rigid: the omnidirectional and figure-of-eight recording pickup patterns are reproduced unaltered by equivalent patterns in playback. In perfect appreciation of the benefits of coincident first-order Ambisonic recording technologies in VR and field recording, the chapter gives practical examples for encoding, headphone- and loudspeaker-based decoding. It concludes with a desire for a higher-order Ambisonics format to get a larger sweet area and accommodate first-order resolution-enhancement algorithms, the embedding of alternative, channel-based recordings, etc.

by recording and reproducing with spatially undistorted omnidirectional and figureof-eight patterns on circularly (2D) or spherically (3D) surrounding loudspeaker layouts.

Blumlein Pair: XY Recording and Playback
The XY technique dates back to Blumlein's patent from the 1930s [1] and his patents thereafter [4]. Nowadays outdated, manufacturers started producing ribbon microphones that offered means to record with figure-of-eight pickup patterns.

Blumlein Pair using 90 • -angled figure-of-eight microphones (XY).
Blumlein's classic coincident microphone pair [3, Fig. 3] uses two figure-of-eight microphones pointing to ±45 • , see Fig. 1.1. Its directional pickup pattern is described by cos φ when φ is the angle enclosed by microphone aiming and sound source. Using a mathematically positive coordinate definition for X (front-right) and Y (front-left), the polar angle ϕ = 0 aiming at the front, the figure-of-eight X uses the angle φ = ϕ + 45 • and Y the angle φ = ϕ − 45 • , so that the pickup pattern of the microphone pair is: Assuming a signal s coming from the angle ϕ, the signals recorded are [X, Y ] T g(ϕ) s. Sound sources from the left 45 • , the front 0 • and the right −45 • will be received by the pair of gains: Obviously, a source moving from the right −45 • to the left 45 • pans the signal from the channel X to the channel Y. This property provides a strongly perceivable lateralization of lateral sources when feeding the left and right channel of a stereophonic loudspeaker pair by Y and X, respectively. However, ideally there should not be any dominant sounds arriving from the sides, as for the source angles between −135 • ≤ ϕ ≤ −45 • and 45 • ≤ ϕ ≤ 135 • the Blumlein pair produces out-of-phase signals between X and Y. The back directions are mapped with consistent sign again, however, left-right reversed. It is only possible to avoid this by decreasing the angle between the microphone pair, which, however, would make the stereo image narrower.
Therefore, coincident XY recording pairs nowadays most often use cardioid directivities 1 2 + 1 2 cos ϕ, instead. They receive all directions without sign change and easily permit stereo width adjustments by varying the angle between the microphones.

MS Recording and Playback
Blumlein's patent [1] considers sum and difference signals between a pair of channels/microphones, yielding M-S stereophony. In M-S [8], the sum signal represents the mid (omnidirectional, sometimes cardioid-directional to front) and the difference the side signal (figure-of-eight). MS recordings can also be taken with cardioid microphones and permit manipulation of the stereo width of the recording.

MS recording by omnidirectional and figure-of-eight microphone (native MS).
Mid-side recording can be done by using a pair of coincident microphones with an omnidirectional (mid, W) and a side-ways oriented figure-of-eight (side, Y) directivity, Fig. 1 MS recording with a pair of 180 • -angled cardioids. Two coincident cardioid microphones (cardioid directivity 1 2 + 1 2 cos ϕ) pointing to the polar angles 90 • (left) and −90 • (right) are also applicable to mid-side recording, Fig. 1.3. Their pickup patterns are encoded into the MS pickup patterns (W,Y) by a matrix The matrix eliminates the cardioids' figure-of-eight characteristics by their sum signal, and their omnidirectional characteristics by the difference. We obtain the MS signal pair (W,Y) from the cardioid microphone signals as Picture of the recording setup (a) 180 angled cardioid microphones

Decoding of MS signals to a stereo loudspeaker pair.
Decoding of the mid-side signal pair to left and right loudspeaker is done by feeding both signals to both loudspeakers, however out-of-phase for the side signal, Fig. 1.4b: An interesting aspect about the 180 • -angled cardioid microphone MS is that after inserting the XY-to-MS encoder Eq. (1.5) into the decoder Eq. (1.6), a brief calculation shows that matrices invert each other. In this case, the cardioid signals are directly fed to the loudspeakers Stereo width. Modifying the mid versus side signal balance before stereo playback, using a blending parameter α, allows to change the width of the stereo image from α = 0 (narrow) to α = 1 (full), Fig. 1.4a, see also [9]: In stereophonic MS playback, the playback loudspeaker directions at ±30 • are not identical to the peaks of the recording pickup pattern of the side channel (Y) at ±90 • . Ambisonics assumes a more strict correspondence between directional patterns of recording and patterns mapped on the playback system.

First-Order Ambisonics (FOA)
After Cooper and Shiga [10] worked on expressing panning strategies for arbitrary surround loudspeaker setups in terms of a directional Fourier series, the notion and technology of Ambisonics was developed by Felgett [11], Gerzon [7], and Craven [12]. In particular, they were also considering a suitable recording technology.
Essentially based on similar considerations as MS, one can define first-order Ambisonic recording. For 2D recordings, a Double-MS microphone arrangement is suitable and only requires one more microphone than MS recording: a front-back oriented figure-of-eight microphone. The scheme is extended to 3D first-order Ambisonics by a third figure-of-eight microphone of up-down aiming. Oftentimes, first-order Ambisonics still is the basis of nowadays' virtual reality applications and 360 • audio streams on the internet. In addition to potential loudspeaker playback, it permits interactive playback on head-tracked headphones to render the acoustic sound scene static to the listener.
First-order Ambisonic recording has the advantage that it can be done with only a few high-quality microphones. However, the sole distribution of first-order Ambisonic recordings to playback loudspeakers is typically not convincing without going to higher orders and directional enhancements (Sect. 5.8).

2D First-Order Ambisonic Recording and Playback
The first-order Ambisonic format in 2D consists of one signal corresponding to an omnidirectional pickup pattern (called W), and two signals corresponding to the figure-of-eight pickup patterns aligned with the Cartesian axes (X and Y).

Native 2D Ambisonic recording (Double-MS).
To record the Ambisonic channels W, X, Y, one can use a Double-MS arrangement as shown in Fig. 1.5.  2D Ambisonic recording with three 120 • -angled cardioids. Assuming 3 coincident cardioid microphones aiming at the angles 0 • , ±120 • in the horizontal plane, cf. Fig. 1.7, we obtain as the pickup pattern for the incoming sound Combining all the three microphone signals yields an omnidirectional pickup pattern as N −1 k=0 cos(ϕ + 2π N k) = 0. Moreover introducing the differences between the front and two back microphone signals and between the left and right microphone signals yields an encoding matrix to obtain the omnidirectional W and the two X and Y figure-of-eight characteristcs (1.8) The decoding weights obviously discretizes the directional pickup characteristics of the Ambisonic channels at the directions of the loudspeaker layout. Consequently, if the loudspeaker layout is more arbitrary and described by the set of its angles {ϕ l }, the sampling decoder can be given as To achieve a panning-invariant and balanced mapping by this decoder, loudspeakers should be evenly arranged. Moreover, it can be favorable to sharpen the spatial image by attenuating W by 1 √ 3 to map a sound by a supercardioid playback pattern. Playback to head-tracked headphones and interactive rotation. In headphone playback, the headphone signals are generated by convolution with the head-related impulses responses of all four loudspeakers contributing to the left and the right ear signals Fig. 1.9 2D first-order Ambisonic decoding to head-tracked headphones To rotate the Ambisonic input scene of the decoder, it is sufficient to obtain a new set of figure-of-eight signals by mixing the X, Y channels with the following matrix depending on the rotation angle ρ, keeping W unaltered ( 1.12) This effect is important for head-tracked headphone playback to render the VR/360 • audio scene static around the listener. A complete playback system is shown in Fig. 1.9. The big advantage of such a system is that rotational updates can be done at high control rates and the HRIRs of the convolver are constant.

3D First-Order Ambisonic Recording and Playback
The first-order Ambisonic format in 3D consists of a signal W corresponding to an omnidirectional pickup pattern, and three signals (X, Y, and Z) corresponding to figure-of-eight pickup patterns aligned with the Cartesian coordinate axes.
In three dimensions, we cannot work with figure-of-eight patterns described by sin ϕ or cos ϕ of the azimuth angle only, anymore. It is more convenient to describe the arbitrarily oriented figure-of-eight characteristics cos(φ) using the inner product between a variable direction vector (direction of arriving sound) and a fixed direction vector (microphone direction). Direction vectors are of unit length θ = 1 and their inner product corresponds to θ T 1 θ = cos(φ), where φ is the angle enclosed by the direction of arrival θ and the microphone direction θ 1 . Consequently, a cardioid pickup pattern aiming at θ 1 is described by 1 2 + 1 2 θ T 1 θ.

Native 3D Ambisonic recording (Triple-MS).
To record the Ambisonic channels W, X, Y, Z, one can use a Triple-MS scheme as shown in Fig. 1  θ T Y θ, and θ T Z θ, we can mathematically describe the pickup patterns of native 3D firstorder Ambisonics as 3D Ambisonic recording with a tetrahedral arrangement of cardioids. The principle that worked for three cardioid microphones on the horizon also works for a coincident tetrahedron microphone array of cardioids with the aiming directions FLU-FRD-BLD-BRU, see Fig. 1.11, and [12], (1.14) Encoding is achieved there by the matrix that adds all microphone signals in the first line (W omnidirectional), subtracts back from front microphone signals in the second line (X figure-of- As Fig. 1.12 shows, practical microphone layouts should be as closely spaced as possible. Nevertheless for high frequencies, the microphones cannot be considered coincident anymore, and besides a directional error, there will be a loss of presence in the diffuse field. Typically a shelving filter is used to slightly boost high frequencies.
Roughly, a high-shelf filter with a 3 dB boost is sufficient to correct timbral defects at frequencies above which the microphone spacing exceeds half a wavelength, e.g., 5 kHz for a 3.4 cm spacing of the microphones. More advanced strategies are found, e.g., in [7,[13][14][15].
3D Ambisonic decoding to loudspeakers. As before in the 2D case, a sampling decoder can be defined that represents the continuous directivity patterns associated with the channels W, X, Y, Z to map the signals to the discrete directions of the loudspeakers. Given the set of loudspeaker directions {θ l } and the unit-vectors to X, Y, Z, the loudspeaker signals of the sampling decoder become (1.16)

Equivalent panning function/virtual microphone.
The sampling decoder together with the native Ambisonic directivity patterns g T WXYZ (θ) = [1, θ T ] yields the mapping of a signal s from the direction θ to the loudspeakers to be This result means that the gain of a source from θ at each loudspeaker θ l corresponds to evaluating a cardioid pattern aligned with θ. Consequently, the Ambisonic mapping corresponds to a signal distribution to the loudspeakers using weights obtained by discretization of an Ambisonics-equivalent first-order panning function. Equivalently, Ambisonic playback using a sampling decoder is comparable to recording each loudspeaker signal with a virtual first-order cardioid microphone aligned with the loudspeaker's direction θ l .
It is decisive for a panning-independent loudness mapping and balanced performance that the directions of the loudspeaker layout are well chosen. Also, it can be preferred to reduce the level of the omnidirectional channel W by 1 √ 3 to map a sound by the narrower supercardioid playback pattern instead of a cardioid pattern, which is rather broad.
Decoder design problems were early addressed by Gerzon [16], Malham [17], and Daniel [18]. A current solution for higher-order decoding is given in Sect. 4.9.6 on All-round Ambisonic decoding.
3D Ambisonic decoding to headphones. 3D Ambisonic decoding to headphones uses the same approach as for 2D above, except that additional rotational degrees are implemented to compensate for any change in head orientation. Rotation concerns the three directional components X, Y, Z For the definition of the rotation matrix R(α, β, γ ) and the meaning of its angles refer to Eq. 5.5 of Sect. 5.2.2. The selection of a suitable set of HRIRs is a question of directional discretization of the 3D directions, as addressed in the decoder above. Signals obtained for virtual loudspeakers are again to be convolved with the corresponding HRIRs for the left and the right ear.

Practical Free-Software Examples
The practical examples below show first-order Ambisonic panning a mono sound, decoded to simple loudspeaker layouts. These are either a square layout with 4 loudspeakers at the azimuth angles

Pd with Iemmatrix, Iemlib, and Zexy
Pd is free and it can load and install its extensions from the internet. Required software components are: • pure-data (free, http://puredata.info/downloads/pure-data) • iemmatrix (free download within pure-data) • zexy (free download within pure-data) • iemlib (free download within pure-data) Figure 1.13 gives an example for horizontal (2D) first-order Ambisonic panning, decoded to 4 loudspeaker and 2 headphone signals. Figure 1.14 shows the processing inside the Pd abstraction [FOA_binaural_decoder] contained in the Fig. 1.13 example, which uses SADIE database 1 subject 1 (KU100 dummy head) HRIRs to render headphone signals. Figure 1.15 sketches a first-order Ambisonic panning in 3D with decoding to an octahedral loudspeaker layout; master level [multiline∼] and hardware outlets [dac∼] were omitted for easier readability.

Ambix VST Plugins
This example uses a DAW and ready-to-use VST plug-ins to render first-order Ambisonics. As DAW, we recommend Reaper (reaper.fm) because it nicely facilitates higher-order Ambisonics by allowing tracks of up to 64 channels. Moreover, it is relatively low-priced and there is a fully functional free evaluation version available. You can also use any other DAW that supports VST and sufficiently many multi-track    After creating the new track for the virtual source and importing a mono/stereo audio file (per drag-and-drop), the next step is the setup of the track channels. As shown in the table, the virtual source has a single-channel (mono) input and 4 output channels to send the 4 channels of first-order Ambisonics to the Master. The option to send to the Master is activated by default, cf. left in Fig. 1.16. The Master track itself requires 4 input channels and 6 output channels to feed the 6 loudspeakers (right). In Reaper, there is no separate adjustment for input and output channels, thus the Master track has to be set to 6 channels.
In the source track FX, the ambix_encoder_o1 can be used to encode the virtual source signal at an arbitrary location on a sphere by inserting the plug-in into the track of the virtual source, cf. its panning GUI in Fig. 1.17. For adding more sources, the track of the virtual source can simply be copied or duplicated. All effects and routing options are maintained for the new tracks.
In order to decode the 4 first-order Ambisonics Master channels to the loudspeakers the ambix_decoder_o1 plug-in is added to the Master track. The plug-in requires a preset that defines the decoding matrix and its channel sequence and normalization. For the exemplary octahedral setup with 6 loudspeakers, the following text can be copied to a text file and saved as config-file, e.g., "octahedral.config". The decoder After loading the preset into the decoder plug-in, the decoder can generate the loudspeaker signals as shown in Fig. 1.18. In the example, the virtual source is panned to the front, resulting in the highest level for loudspeaker 1 (front). The loudspeaker 3 (back) is 12dB quieter because of a side-lobe suppressing super cardioid weighting implied by the switch /coeff_scale n3d, as a trick to keep things simple.
In addition to the virtual sources, you can also add a 4-channel recording done with a B-format microphone by placing the 4-channel file in a new track. Reaper will automatically set the number of track channels to 4 and send the channels to the Master. Note that some B-format microphones use a different order and/or weighting of the Ambisonics channels. Simple conversion to the AmbiX-format can be done by inserting the ambix_converter_o1 plug-in into the microphone track.

Motivation of Higher-Order Ambisonics
Diffuseness, spaciousness, depth? Diffuse sound fields are typically characterized by sound arriving randomly from evenly distributed directions at evenly distributed delays. It is practical knowledge that the impression of diffuseness and spaciousness requires benefits from decorrelated signals, which is typically achieved by large distances between the microphones rather than by coincident microphones.
Due to the evenness of diffuse sound fields, one would still hope that a low spatial resolution is sufficient to map diffuseness and spatial depth of a room, using coincident microphones or first-order Ambisonics. Nevertheless, high directional correlation during playback destroys this hope and in fact yields a perceptually impeded playback of diffuseness, spaciousness, and depth.
The technical advantages in interactivity and VR as well as the known shortcomings of first-order coincident recording techniques offer enough motivation to increase the directional resolution and go to higher-order Ambisonics, as presented in the subsequent chapters. For professional productions, it is often not sufficient to only rely on first-order coincident microphone recordings. By contrast, higher-order Ambisonics is able to drastically improve the mapping of diffuseness, spaciousness, and depth, as shown in the upcoming chapter about psychoacoustical properties of many-loudspeaker systems.
Recording with a higher-order main microphone array increases the required technological complexity. Nevertheless, digital signal processing and the theory presented in the later chapters is powerful nowadays to achieve this goal.
After all, it seems that delay-based stereophonic recording, such as AB, or equivalence-based recording, such as ORTF, INA5, etc., is often required and wellknown in its mapping properties for spaciousness and diffuseness, correspondingly. What is nice about higher-order Ambisonics: it can make use of these benefits by embedding such recordings appropriately, see Fig. 1.20.
Facts about higher orders: Ambisonics extended to higher orders permits a refinement of the directional resolution and hereby improves the mapping of uncorrelated sounds in playback. Figure 1.21a shows the correlation introduced in two neighboring loudspeaker signals when using Ambisonics, given their spacing of 60 • . Given the just noticeable difference (JND) of the inter-aural cross correlation, the figure indicates that an Ambisonic order of ≥3 might be necessary to perceptually preserve decorrelation. For this reason, the perception of spatial depth strongly improves when increasing the Ambisonic order from 1 up to 3, Fig. 1.21b. However, this is only the case when seated at the central listening position. Outside this sweet spot, higher orders than 3, e.g., 5, additionally improve the mapping of depth [19]. Therefore, higher-order Ambisonics is important for preserving spatial impressions and when supplying a large audience. Figure 1.22 shows that the sweet area of perceptually plausible playback increases with the Ambisonic order [20]. With fifth-order Ambisonics, nearly all the area spanned by the horizontal loudspeakers at the IEM CUBE, the 12 × 10 m concert space at our lab, becomes a valid listening area.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.