Cooper [2] used higher-order angular harmonics to formulate circular panning of auditory events. Due to the work of Felgett [3], Gerzon [4], and Craven [5], the term Ambisonics became common for technology using spherical harmonic functions. Around the early 2000s, most notably Bamford [6], Malham [7], Poletti [8], Jot [9], and Daniel [10] pioneered the development of higher-order Ambisonic panning and decoding, Ward and Abhayapala [11], Dickens [12], and at the lab of the authors Sontacchi [13].

Another leap happened around 2010, when Ambisonic decoding to loudspeakers could be largely improved by considering regularization methods [14], singular-value decomposition [15], and all-round Ambisonic decoding (AllRAD) [15, 16], a combination of vector-base panning techniques with Ambisonics, yielding the most robust and flexible higher-order decoding method known today.

For headphones, after the work of Jot [9] that outlined the basic problems of binaural decoding in the 1990s, Sun, Bernschütz, Ben-Hur, and Brinkmann [17,18,19] made important contributions to binaural decoding, and we consider TAC and MagLS decoders by Zaunschirm and Schörkhuber [20, 21] as the essential binaural decoders. Both remove HRTF delays or optimize HRTF phases at high frequencies to avoid spectral artifacts. By interaural covariance correction, MagLS/TAC manage to play back diffuse fields consistently, using the formalism of Vilkamo et al [22].

4.1 Direction Spread in First-Order 2D Ambisonics

In 2D first-order Ambisonics as discussed in Chap. 1, the directional mapping of a single sound source from the angle \(\varphi _\mathrm {s}\) to the direction of each loudspeaker \(\varphi \) is described by the shape of panning function (or direction-spread function) in Eq. (1.17). The directional spreading is not infinitely narrow, but determined by what can be represented by first-order directivity patterns. Consequently, sound from the angle \(\varphi _\mathrm {s}\) will be mapped by a dipole pattern aligned with the source and an additional omnidirectional pattern. We can involve a spread parameter a to make the directional spread to the loudspeakers system adjustable and either cardioid-shaped \(a=1\), 2D-supercardioid-shaped \(a=\sqrt{2}\), or 2D-hypercardioid-shaped \(a=2\), using:

$$\begin{aligned} g(\varphi )&=1+a\,\cos (\varphi -\varphi _\mathrm {S}). \end{aligned}$$
(4.1)

This function represents how first-order Ambisonic panning would distribute a mono signal to loudspeakers. With the loudspeaker positions described by the set of angles \(\{\upvarphi _l\}\), a vector of amplitude-panning gains with an entry for each loudspeaker could be determined by sampling the direction-spread function:

$$\begin{aligned} \varvec{g}&=\begin{bmatrix}g_1\\\vdots \\g_\mathrm {L}\end{bmatrix}=1+a\,\begin{bmatrix} \cos (\upvarphi _1-\varphi _\mathrm {S})\\ \vdots \\ \cos (\upvarphi _\mathrm {L}-\varphi _\mathrm {S}) \end{bmatrix}. \end{aligned}$$
(4.2)

With these gain values, we evaluate models of perceived loudness, direction, and width, as introduced in Chap. 2, in order to enter a discussion of perceptual goals.

If the loudspeaker directions \(\{\varvec{\uptheta }_l\}\) are chosen suitably, it is possible to obtain panning-independent loudness, direction, and width measures \(E=\sum _l g_l^2\), \(\varvec{r}_\mathrm {E}=\frac{1}{E}\sum _l g_l^2\varvec{\uptheta }_l\), and \(\frac{5}{8}\frac{180^\circ }{\pi }\,2\arccos \Vert \varvec{r}_\mathrm {E}\Vert \). How is it done?

Fig. 4.1
figure 1

Width-related angle \(\arccos \Vert \varvec{r}_\mathrm {E}\Vert \) and angular error \(\angle \varvec{r}_\mathrm {E}-\varphi _\mathrm {s}\) for first-order Ambisonics on a ring with \(90^\circ \) spaced loudspeakers using a cardioid, or 2D super-/hyper-cardioid patterns; the measures are all panning-invariant and the super-cardioid weighting is the 2D max-\(\varvec{r}_\mathrm {E}\) weighting with \(\arccos \Vert \varvec{r}_\mathrm {E}\Vert =45^\circ \)

For first-order 2D Ambisonics, it is theoretically optimal to use at least a ring of 4 loudspeakers with uniform angular spacing and \(a=\sqrt{2}\), which is easily checked with the aid of a computer, cf. Fig. 4.1, and explained below and in Sect. 4.4.

Direction spread in FOA. The panning-function interpretation with its directional spread has some similarity to MDAP, with its attempt to directionally spread an amplitude-panned signal. Similar to the discrete virtual spread by \({\pm }\alpha =\arccos \Vert \varvec{r}_\mathrm {E}\Vert \) around the panning direction. The virtual direction spread of first-order Ambisonics is described by its continuous panning function \(g(\varphi )\) in Eq. (4.1). To inspect the continuous function by the \(\varvec{r}_\mathrm {E}\) measure defined in Eq. (2.7), we may evaluate an integral over the panning function instead of the sum. Because of the symmetry around \(\varphi _\mathrm {s}\), we may set for convenience \(\varphi _\mathrm {s}=0\), which knowingly causes \(r_\mathrm {E,y}=0\), and evaluate

$$\begin{aligned} r_\mathrm {E,x}&=\frac{\int _0^{2\pi } g^2(\varphi )\,\cos \varphi \,\mathrm {d}\varphi }{\int _0^{2\pi } g^2(\varphi )\,\mathrm {d}\varphi }= \frac{\int _0^\pi [1+2a\cos \varphi +a^2\cos ^2\varphi ]\,\cos \varphi \,\mathrm {d}\varphi }{\int _0^\pi [1+2a\cos \varphi +a^2\cos ^2\varphi ]\,\mathrm {d}\varphi }&=\frac{a}{1+\frac{a^2}{2}}. \end{aligned}$$
(4.3)

The maximum of \(r_\mathrm {E,x}=\frac{2a}{2+a^2}\) is found by \(\frac{\mathrm {d}}{\mathrm {d}a }r_\mathrm {E,x}=\frac{4+2a^2-4a^2}{2+a^2}=0\), hence at \(a=\sqrt{2}\). Consequently, the 2D max-\(\varvec{r}_\mathrm {E}\) weight is \(r_\mathrm {E,x}=\frac{\sqrt{2}}{2}=\frac{1}{\sqrt{2}}\) and yields the angle \(\arccos \Vert \varvec{r}_\mathrm {E}\Vert =45^\circ \). This would resemble a 2D-MDAP-equivalent source spread to \({\pm }45^\circ \). Note that first-order Ambisonics cannot map to a smaller spread than this. Only higher orders permit to further reduce this spread to a desired angle below \(90^\circ \).

Ideal loudspeaker layouts. Not only is the directional aiming of the virtual, continuous first-order Ambisonic panning function ideal and its width panning-invariant, also its loudness measure is panning-invariant. However, decoding to a physical loudspeaker setup can degrade the ideal behavior. For which loudspeaker layout are these properties preserved by sampling decoding?

The 2D first-order Ambisonic components (WXY) correspond to \(\{1,\,\cos \varphi ,\,\sin \varphi \}\) patterns, a first-order Fourier series in the angle. Sampling the playback directions by \(\mathrm {L}=3\) uniformly spaced loudspeakers on the horizon, the sampling theorem for this series is already fulfilled. Accordingly, Parseval’s theorem ensures panning-invariant loudness E for any panning direction.

For an ideal \(\varvec{r}_\mathrm {E}\) measure, however, one more loudspeaker is required \(\mathrm {L}\ge 4\) for a uniformly spaced horizontal ring. To explain this increase exhaustively, the concept of circular/spherical polynomials and t-designs will be introduced in this chapter. For a brief explanation, \(g^2(\varphi )\) is a second-order expression and therefore to represent the ideal constant loudness \(E=\int g^2(\varphi )\,\mathrm {d}\varphi \) of the continuous panning function consistently after discretization \(E=\frac{2\pi }{\mathrm {L}}\sum _l g_l^2\), it requires \(\mathrm {L}=3\) uniformly spaced loudspeakers, as argued before. By contrast, the expressions \(g^2(\varphi )\cos \varphi \) and \(g^2(\varphi )\sin \varphi \) are third-order and appear in \(\varvec{r}_\mathrm {E}\cdot E=\int g^2(\varphi )\,[\cos \varphi ,\,\sin \varphi ]^\mathrm {T}\mathrm {d}\varphi \). Consequently, ideal mapping of \(\varvec{r}_\mathrm {E}\) (direction and width) requires at least one more loudspeaker \(\mathrm {L}=4\) for a uniformly spaced arrangement to make the continuous and the discretized form \(\varvec{r}_\mathrm {E}\,E=\frac{2\pi }{\mathrm {L}}\sum _l g_l^2\,[\cos \upvarphi _l,\,\sin \upvarphi _l]^\mathrm {T}\) perfectly equal.

Towards a higher-order panning function. An \(\mathrm {N{th}}\)-order cardioid pattern is obtained from the cardioid pattern by taking its \(\mathrm {N{th}}\) power

$$\begin{aligned} g_\mathrm {N}(\varphi )=\frac{1}{2^\mathrm {N}}(1+\cos \varphi )^\mathrm {N}, \end{aligned}$$

which makes it narrower. With \(\mathrm {N}=2\), this becomes, using \(\cos ^2\varphi =\frac{1}{2}(1+\cos 2\varphi )\),

$$\begin{aligned} g_2(\varphi )=\frac{1}{4}(1+2\cos \varphi +\cos ^2\varphi )=\frac{1}{8}(3+4\cos \varphi +\cos 2\varphi ). \end{aligned}$$

More generally, Chebyshev polynomials \(T_m(\cos \varphi )=\cos m\varphi \), cf. [23, Eq. 3.11.6] can be used to argue that there is always a fully equivalent cosine series describing the higher-order 2D panning function in the azimuth angle

$$\begin{aligned} g(\varphi )=\sum _{m=0}^\mathrm {N}a_m\,\cos m\varphi . \end{aligned}$$
(4.4)

Rotated panning function. In first-order Ambisonics, panning functions consist of an omnidirectional part, \(\cos (0\varphi )=1\), and a figure-of-eight to x, \(\cos \varphi \), but that was not all: Recording and playback also required a figure-of-eight pattern to y, \(\sin \varphi \).

The additional component allows to express rotated first-order directivities by a basis set of fixed directivities. For higher orders, a panning function rotated to a non-zero aiming \(\varphi _\mathrm {s}\ne 0\)

$$\begin{aligned} g(\varphi -\varphi _s)=\sum _{m=0}^\mathrm {N}a_m\,\cos [m(\varphi -\varphi _\mathrm {s})] \end{aligned}$$
(4.5)

can be re-expressed by the addition theorem \(\cos (\alpha +\beta )=\cos \alpha \,\cos \beta -\sin \alpha \sin \beta \) into a series involving the sinusoids (odd symmetric part of a Fourier series),

$$\begin{aligned} g(\varphi -\varphi _s)&=\sum _{m=0}^\mathrm {N}a_m\,[\cos m\varphi _\mathrm {s}\,\cos m\varphi +\sin m\varphi _\mathrm {s}\,\sin m\varphi ]\\ {}&=\sum _{n=0}^\mathrm {N}a_m^{(c)}\,\cos m\varphi +\sum _{m=0}^\mathrm {N}a_m^{(s)}\,\sin m\varphi .\nonumber \end{aligned}$$
(4.6)

We conclude: Higher-order Ambisonics in 2D (and the associated set of theoretical microphone directivities) is based on the Fourier series in the azimuth angle \(\varphi \).

4.2 Higher-Order Polynomials and Harmonics

The previous section required that direction and length of the \(\varvec{r}_\mathrm {E}\) vector resulting from amplitude panning on loudspeakers matched the desired auditory event direction and width. Harmonic functions with strict symmetry around a panning direction \(\varvec{\theta }_\mathrm {s}\) will help us in achieving this goal and in defining good sampling.

Regardless of the dimensions, be it in 2D or 3D, we desire to define continuous and resolution-limited axisymmetric functions around the panning direction \(\varvec{\uptheta }_\mathrm {s}\) to fulfill our perceptual goals of a panning-invariant loudness E, width \(\Vert \varvec{r}_\mathrm {E}\Vert \), and perfect alignment between panning direction \(\varvec{\uptheta }_\mathrm {s}\) and localized direction \(\varvec{r}_\mathrm {E}\). Then we hope to find suitable directional discretization schemes for ideal loudspeaker layouts, so that the measures E and \(\varvec{r}_\mathrm {E}\) are perfectly reconstructed in playback.

The projection of a variable direction vector \(\varvec{\theta }\) onto the panning direction \(\varvec{\uptheta }_\mathrm {s}\) always yields the cosine of the enclosed angle \(\varvec{\uptheta }_\mathrm {s}^\mathrm {T}\varvec{\theta }=\cos \phi \), no matter whether it is in two or three dimensions. Hereby constructing the panning function based on this projection readily meets the desired goals. The \(m\mathrm {th}\) power thereof, \((\varvec{\uptheta }_\mathrm {s}^\mathrm {T}\varvec{\theta })^m=\cos ^m\phi \) helps to build an \(\mathrm {N{th}}\)-order power series \(g=\sum _{m=0}^\mathrm {N}a_{m}(\varvec{\uptheta }_\mathrm {s}^\mathrm {T}\varvec{\theta })^m\) to describe a virtual Ambisonic panning function.

For 2D, such a circular polynomial \(g=\sum _{m=0}^\mathrm {N}a_{m}(\varvec{\uptheta }_\mathrm {s}^\mathrm {T}\varvec{\theta })^m\) contains all \((\mathrm {N}+1)(\mathrm {N}+2)/2\) mixed powers by \((\varvec{\uptheta }_\mathrm {s}^\mathrm {T}\varvec{\theta })^m=(\uptheta _\mathrm {xs}\theta _\mathrm {x}+ \uptheta _\mathrm {ys}\theta _\mathrm {y})^m=\sum _{k=0}^m{m \atopwithdelims ()k}\,(\uptheta _\mathrm {xs}\theta _\mathrm {x})^k\,(\uptheta _\mathrm {ys}\theta _\mathrm {y})^{m-k}\) of the direction vectors’ entries \(\varvec{\uptheta }_\mathrm {s}^\mathrm {T}=[\uptheta _\mathrm {xs},\,\uptheta _\mathrm {ys}]\) and \(\varvec{\theta }=[\theta _\mathrm {x},\,\theta _\mathrm {y}]^\mathrm {T}\). However, we could already recognize that it only takes \(2\mathrm {N}+1\) functions to express \(g=\sum _{m=0}^\mathrm {N}a_{m}(\varvec{\uptheta }_\mathrm {s}^\mathrm {T}\varvec{\theta })^m=\sum _{m=0}^\mathrm {N}a_{m}\cos ^m\phi \): First an initial polynomial with relative azimuth \(\phi =\varphi -\varphi _\mathrm {s}\) relating to a harmonic series of \(\mathrm {N}+1\) cosines or Chebyshev-polynomials \(g=\sum _{m=0}^\mathrm {N}b_m\,\cos m\phi =\sum _{m=0}^\mathrm {N}b_m\,T_m(\varvec{\uptheta }_\mathrm {s}\varvec{\theta })\). Then, in terms of absolute azimuth \(\varphi \), the trigonometric addition theorem re-expresses the series into one of \(\mathrm {N}+1\) cosines and \(\mathrm {N}\) sines, with \(T_m(\varvec{\uptheta }_\mathrm {s}\varvec{\theta })=\cos [m(\varphi -\varphi _\mathrm {s})]=\cos m\varphi _\mathrm {s}\cos m\varphi +\sin m\varphi _\mathrm {s}\sin m\varphi \). As shown in the upcoming section, we can alternatively obtain such orthonormal harmonic functions by solving a second-order differential equation that is generally used to define harmonics, which bears the later benefit that we can use the approach to define spherical harmonics in three space dimensions.

Spherical polynomials are similar, \(g=\sum _{n=0}^\mathrm {N}a_n\,(\varvec{\uptheta }_\mathrm {s}^\mathrm {T}\varvec{\theta })^n\), involving the expressions \((\varvec{\uptheta }_\mathrm {s}^\mathrm {T}\varvec{\theta })^n=(\uptheta _\mathrm {xs}\theta _\mathrm {x}+ \uptheta _\mathrm {ys}\theta _\mathrm {y}+\uptheta _\mathrm {zs}\theta _\mathrm {z})^n=\sum _{k=0}^n\sum _{l=0}^{n-k}{n \atopwithdelims ()k}{l \atopwithdelims ()n-k}\,(\uptheta _\mathrm {zs}\theta _\mathrm {z})^k(\uptheta _\mathrm {xs}\theta _\mathrm {x})^{l}(\uptheta _\mathrm {ys}\theta _\mathrm {y})^{n-k-l}\). Again, all these \((\mathrm {N}+1)(\mathrm {N}+2)(\mathrm {N}+3)/6\) combinations would be too many to form an orthogonal set of basis functions. Moreover, while the different cosine harmonics are orthogonal axisymmetric functions in 2D, they are not in 3D. On the sphere, the \(\mathrm {N}+1\) orthogonal Legendre polynomials \(P_n(\cos \phi )\) replace the cosine series as a basis for \(g=\sum _{n=0}^\mathrm {N}c_n\,P_n(\cos \phi )\), as shown below. All mathematical derivations for the sphere rely on the definition of harmonics. They result in \((\mathrm {N}+1)^2\) spherical harmonics and their addition theorem as a basis in terms of absolute directions \(\frac{2n+1}{4\pi }P_n(\varvec{\uptheta }_\mathrm {s}^\mathrm {T}\varvec{\theta })=\sum _{m=-n}^nY_n^m(\varvec{\uptheta }_s)Y_n^m(\varvec{\theta })\). Dickins’ thesis is interesting for further reading [12].

In both regimes, 2D and 3D, the circular or spherical polynomials concept will be used to determine optimal layouts, so-called t-designs. Such t-designs are directional sampling grids that are able to keep the information about the constant part of any either circular (2D) or spherical (3D) polynomials up to the order \(\mathrm {N}\le t\). This will be a mathematical key property exploited to determine requirements for preserving E and \(\varvec{r}_\mathrm {E}\) measures during Ambisonic playback with optimal loudspeaker setups, but not only. Also t-designs simplify numerical integration of circular or spherical harmonics to define state-of-the-art Ambisonic decoders or mapping effects.

4.3 Angular/Directional Harmonics in 2D and 3D

The Laplacian is defined in the \(\mathrm {D}\)-dimensional Cartesian space as

$$\begin{aligned} \bigtriangleup&=\sum _{j=1}^\mathrm {D}\frac{\partial ^2}{\partial x_j^2}, \end{aligned}$$
(4.7)

and for any function f, the Laplacian \(\bigtriangleup f\) describes the curvature. Any harmonic function is proportional to its curvature by an eigenvalue \(\lambda \),

$$\begin{aligned} \bigtriangleup f&=-\lambda \,f, \end{aligned}$$
(4.8)

and therefore is an oscillatory function. Generally, eigensolutions \(\bigtriangleup f=-\lambda \,f\) to the Laplacian are called harmonics. For suitable eigenvalues \(\lambda \), harmonics span an orthogonal set of basis functions that are typically used for Fourier expansion on a finite interval. It seems desirable to find such harmonics for functions only exhibiting directional dependencies, i.e. in the azimuth angle \(\varphi \) in 2D, and azimuth and zenith angle \(\varphi ,\vartheta \) in 3D.

4.4 Panning with Circular Harmonics in 2D

For 2 dimensions Appendix A.3.2 uses the generalized chain rule to convert the Laplacian of a 2D coordinate system \(\bigtriangleup =\frac{\partial ^2}{\partial x^2}+\frac{\partial ^2}{\partial y^2}\) to a polar coordinate system with the radius r and the angle \(\varphi \) to the x axis, \( \bigtriangleup =\frac{1}{r}\frac{\partial }{\partial r} + \frac{\partial ^2}{\partial r^2}+\frac{1}{r^2}\frac{\partial ^2}{\partial \varphi ^2}. \) And for functions \(\Phi =\Phi (\varphi )\) purely in the angle \(\varphi \), the radial derivatives of \(\bigtriangleup \Phi \) all vanish and it remains (\(\partial \rightarrow \mathrm {d}\))

$$\begin{aligned} \frac{\mathrm {d}^2}{\mathrm {d}\varphi ^2}\Phi&=-\lambda \,r^2\,\Phi . \end{aligned}$$
(4.9)

It is only yielding useful solutions with \(\lambda \,r^2=m^2\), \(m\in \mathbb {Z}\), cf. Appendix A.3.4, Fig. 4.2,

$$\begin{aligned} \Phi _m=\frac{1}{\sqrt{2\pi }}{\left\{ \begin{array}{ll} \sqrt{2}\sin (|m|\varphi ),&{} \text {for m<0},\\ 1, &{} \text {for m=0},\\ \sqrt{2}\cos (m\varphi ), &{} \text {for m> 0}, \end{array}\right. } \end{aligned}$$
(4.10)

which defines how to decompose panning functions of limited order \(|m|<\mathrm {N}\). The harmonics are periodic in azimuth, orthogonal and normalized (orthonormal) on the period \(-\pi \le \varphi \le \pi \). Due to their completeness, any square-integrable function \(g(\varphi )\) can be expanded into a series of the harmonics using coefficients \(\gamma _m\)

$$\begin{aligned} g(\varphi )&=\sum _{m=-\infty }^{\infty }\gamma _m\,\Phi _m(\varphi ). \end{aligned}$$
(4.11)

For a known function \(g(\varphi )\), the coefficients \(\gamma _m\) are obtained by the transformation integral

$$\begin{aligned} \gamma _m=\int _{-\pi }^{\pi }g(\varphi )\,\Phi _m(\varphi )\,\mathrm {d}\varphi , \end{aligned}$$
(4.12)

as shown in Appendix Eq. (A.14).

Fig. 4.2
figure 2

Circular harmonics with \(m=-3,\dots ,3\) plotted as polar diagram using the radius \(R=20\lg |\sqrt{\pi }\Phi _m|\) and grayscale to distinguish between positive (gray) and negative (black) signs

2D panning function. An infinitely narrow angular range around a desired direction \(|\varphi -\varphi _\mathrm {s}|<\varepsilon \rightarrow 0\) is represented by the transformation integral over a Dirac delta distribution \(\delta (\varphi -\varphi _{s})\), cf. Appendix Eq. (A.16), so that the coefficients of such a panning function are

$$\begin{aligned} \gamma _m&=\Phi _m(\varphi _\mathrm {s}). \end{aligned}$$
(4.13)

As the infinite circular harmonic series is complete, the panning function is

$$\begin{aligned} g(\varphi )&=\sum _{m=-\infty }^\infty \Phi _m(\varphi _\mathrm {s})\,\Phi _m(\varphi )=\delta (\varphi -\varphi _\mathrm {s}), \end{aligned}$$
(4.14)

and in practice we resolution-limit it to the \(\mathrm {N{th}}\) Ambisonic order, \(|m|\le \mathrm {N}\), and use an additional weight \(a_{m}\) that allows us to design its side lobes

$$\begin{aligned} g_\mathrm {N}(\varphi )&=\sum _{m=-\mathrm {N}}^\mathrm {N}a_m\,\Phi _m(\varphi _\mathrm {s})\,\Phi _m(\varphi ). \end{aligned}$$
(4.15)

The max-\(\varvec{r}_\mathrm {E}\) panning function [24] uses the weights \(a_m=\cos (\frac{\pi \,m}{2(\mathrm {N}+1)})\), as derived in Appendix Eq. (A.20). The spread is now adjustable by the order to \({\pm }\frac{90^\circ }{\mathrm {N}+1}\). The result is shown in Fig. 4.3, compared with no side-lobe suppression when \(a_n=1\) (basic).

It is easy to recognize: \(\Phi _m(\varphi _{s})\) represents the recorded or encoded directions, and \(\Phi _m(\varphi )\) represents the decoded playback directions.

Fig. 4.3
figure 3

2D unweighted \(a_n=1\) basic and weighted max-\(\varvec{r}_\mathrm {E}\) Ambisonic panning functions for the orders \(\mathrm {N}=1,2,5\)

Optimal sampling of the 2D panning function. In the theory of circular/spherical polynomials in the variable \(\zeta =\cos (\varphi -\varphi _\mathrm {s})\), so-called t-designs in 2D are optimal point sets of given angles \(\{\upvarphi _l\}\) with \(l=1,\dots ,\mathrm {L}\) and size \(\mathrm {L}\). A t-design allows to perfectly compute the integral (constant part) over the polynomials \(\mathcal {P}_m(\zeta )\) of limited degree \(m\le t\) by discrete summation

$$\begin{aligned} \int _{-\pi }^\pi \mathcal {P}_m(\cos \phi )\,\mathrm {d}\phi =\sum _{l=1}^\mathrm {L} \mathcal {P}_m[\cos (\upvarphi _l-\varphi _\mathrm {s})]\,{\textstyle \frac{2\pi }{\mathrm {L}}}, \end{aligned}$$
(4.16)

regardless of any angular shift \(\varphi _\mathrm {s}\). In 2D, Chebyshev polynomials \(T_m(\cos \phi )=\cos (m\phi )\) are orthogonal polynomials, therefore an \(\mathrm {N{th}}\)-order panning function composed out of \(\cos (m\phi )\) is always a polynomial of \(\mathrm {N{th}}\) degree. Knowing this, it is clear that the integral over \(g_\mathrm {N}^2\) required to evaluate the loudness measure E is a polynomial of the order \(2\mathrm {N}\). The integral to calculate \(\varvec{r}_\mathrm {E}\) is over \(g_\mathrm {N}^2\,\cos (\phi )\) and thus of the order \(2\mathrm {N}+1\). In playback, to get a perfectly panning-invariant loudness measure E of the continuous panning function and also the perfectly oriented \(\varvec{r}_\mathrm {E}\) vector of constant spread \(\arccos \Vert \varvec{r}_\mathrm {E}\Vert \), the parameter t must be \(t\ge 2\mathrm {N}+1\). In 2D, all regular polygons are t-designs with \(\mathrm {L}=t+1\) points

$$\begin{aligned} \upvarphi _l&=\frac{2\pi }{t+1}\,(l-1). \end{aligned}$$
(4.17)

We can use the smallest set of \(2\mathrm {N}+2\) angles \(\upvarphi _l=\frac{180^\circ }{\mathrm {N}+1}\,(l-1)\) as optimal 2D layout.

4.5 Ambisonics Encoding and Optimal Decoding in 2D

To encode a signal s into Ambisonic signals \(\chi _m\), we multiply the signal with the encoder representing the direction of the signal at the angle \(\varphi _\mathrm {s}\) by the weights \(\Phi _m(\varphi _\mathrm {s})\)

$$\begin{aligned} \chi _m(t)&=\Phi _m(\varphi _\mathrm {s})\,s(t), \end{aligned}$$
(4.18)

or in vector notation

$$\begin{aligned} \varvec{\chi }_\mathrm {N}=\varvec{y}_\mathrm {N}(\varphi _\mathrm {s})\,s, \end{aligned}$$
(4.19)

using the column vector \(\varvec{y}_\mathrm {N}=[\Phi _{-\mathrm {N}}(\varphi _\mathrm {s}),\,\dots ,\,\Phi _{\mathrm {N}}(\varphi _\mathrm {s})]^\mathrm {T}\) of \(2\mathrm {N}+1\) components. The Ambisonic signals in \(\varvec{\chi }_\mathrm {N}\) are weighted by side-lobe suppressing weights \(\varvec{a}_\mathrm {N}=[a_{|-\mathrm {N}|},\,\dots ,\,a_\mathrm {N}]^\mathrm {T}\), expressed by the multiplication with a diagonal matrix \(\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\), and then decoded to the \(\mathrm {L}\) loudspeaker signals \(\varvec{x}\) by a sampling decoder

$$\begin{aligned} {\varvec{D}}&={\textstyle \sqrt{\frac{2\pi }{\mathrm {L}}}} \begin{bmatrix}{\varvec{y}}_\mathrm {N}(\upvarphi _1),\,\dots ,\,{\varvec{y}}_\mathrm {N}(\upvarphi _\mathrm {L})\end{bmatrix}^\mathrm {T} ={\textstyle \sqrt{\frac{2\pi }{\mathrm {L}}}}\, {\varvec{Y}}_\mathrm {N}^\mathrm {T}, \end{aligned}$$
(4.20)

using

$$\begin{aligned} {\varvec{x}}={\varvec{D}}\,\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\,{\varvec{\chi }}_\mathrm {N}. \end{aligned}$$
(4.21)

In total, the system for encoding and decoding can also be written to yield a set of loudspeaker gains for one virtual source

$$\begin{aligned} {\varvec{g}}={\varvec{D}}\,\mathrm {diag}\{{\varvec{a}}_\mathrm {N}\}\,\varvec{y}_\mathrm {N}(\varphi _\mathrm {s}), \end{aligned}$$
(4.22)

or in particular for the 2D sampling decoder \(\varvec{g}=\sqrt{\frac{2\pi }{\mathrm {L}}}\,\varvec{Y}_\mathrm {N}^\mathrm {T}\,\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\,\varvec{ y}_\mathrm {N}(\varphi _\mathrm {s})\).

4.6 Listening Experiments on 2D Ambisonics

There are several listening experiments discussing the features of Ambisonics, most of which are summarized in [25], which will be discussed complemented with those from [26] below.

Fig. 4.4
figure 4

\(2\mathrm {nd}\)-order max-\(\varvec{r}_\mathrm {E}\)-weighted Ambisonic panning with pink-noise on horizontal rings of \(60^\circ \)-spaced loudspeakers adjusted to perceptually match reference loudspeaker directions (harmonic complex) every \(15^\circ \). Markers and whiskers indicate \(95\%\) confidence intervals and medians, black curve the \(\varvec{r}_\mathrm {E}\) vector model

Fig. 4.5
figure 5

In Frank’s 2008 pointing experiment [27] on center and off-center listening seats for 3 virtual sources (A, B, C) using \(1\mathrm {st}\)-order (left) and \(5\mathrm {th}\)-order (right) Ambisonics on 12 horizontal loudspeakers (IEM CUBE) indicate a more stable localization with high orders. Moreover, for \(5\mathrm {th}\)-order, \(\mathrm {max}\)-\(\varvec{r}_\mathrm {E}\) weighting and omission of delay compensation were preferred. Omission of \(\mathrm {max}\)-\(\varvec{r}_\mathrm {E}\) weights (“basic”) or alternative “in-phase” weights that entirely suppresses any side lobe yield less precise localization at off-center listening positions

Fig. 4.6
figure 6

Experiments on an off-center position in a show that \(\mathrm {max}\)-\(\varvec{r}_\mathrm {E}\) outperforms the basic, rectangularly truncated Fourier series at off-center listening positions, b where it can avoid splitting of the auditory event. Stitt’s experiments c imply that localization with higher orders is more stable and that the localization deficiency at off-center listening seats seems to be proportional to the ratio between distance to the center divided by radius of the loudspeaker ring, and not the specific time-delays that are larger for large loudspeaker rings, cf. [28]

Fig. 4.7
figure 7

Predicted sweet area sizes using the \(\varvec{r}_\mathrm {E}\) model Sect. 2.2.9 for loudspeaker layouts and playback orders used in Stitt’s experiments [28]: first order (top), third order (bottom), small (left), and large (right)

Fig. 4.8
figure 8

The perceptual sweet spot size as investigated by Frank [29] is nearly covering the entire area enclosed by the IEM CUBE as a playback setup (black \(=\) 5th, gray \(=\) 3rd, light gray \(=\) 1st order Ambisonics). It is smallest for 1st-order Ambisonics

Fig. 4.9
figure 9

Frank’s 2013 experiments showed for \(\mathrm {max}\)-\(\varvec{r}_\mathrm {E}\) Ambisonics on three frontal loudspeakers that large perceived widths are not entirely accurately modeled by the optimally-sampled, therefore constant \(\varvec{r}_\mathrm {E}\) vector for low orders, however reasonably well, and significantly for high orders/narrow width

Fig. 4.10
figure 10

Frank’s 2013 experiments on the variation of the sound coloration of virtual sources rotating at a speed of \(100^\circ /\mathrm {s}\) imply that \(\mathrm {max}\)-\(\varvec{r}_\mathrm {E}\) weighting outperforms the “basic” weighting with \(a_m=1\), and that adjusting the number of loudspeakers to the Ambisonic order seems to be reasonable

The perceptually adjusted panning angle of \(2\mathrm {nd}\)-order max-\(\varvec{r}_\mathrm {E}\) Ambisonics panning on 6 horizontal loudspeakers matches quite well the acoustic reference direction as shown in Fig. 4.4, similar to MDAP in Fig. 3.8, but with a slightly more accurate median by \(0.5^\circ \) on average, and in particular at side and back panning directions.

Another aspect to investigate is how stable the results are for center and off-center listening seats as shown in Fig. 4.5. It illustrates that \(\mathrm {max}\)-\(\varvec{r}_\mathrm {E}\) with the highest order achieves the best stability with regard to localization at off-center listening seats. Astonishingly, the delay compensation for non-uniform delay times to the center deteriorated the results, most probably because of the nearly linear frontal arrangement of loudspeakers that is more robust to lateral shifts of the listening positions than a circular arrangement.

Figure 4.6a, b shows the direction histogram for two different weightings \(a_m\), and it illustrates that proper sidelobe suppression of the panning function by using \(\mathrm {max}\)-\(\varvec{r}_\mathrm {E}\) weights is decisive at shifted listening positions to avoid splitting of the auditory image, as it appears in Fig. 4.6b without the weights (basic).

Peter Stitt’s work shows that the localization offsets at off-center listening seats do not increase with the radius of the loudspeaker arrangement as long as the off-center seat stays in proportion to the radius, Fig. 4.6c. The result are predicted by the sweet area model from Sect. 2.2.9 for the first order (top row) and third order (bottom row) in Fig. 4.7, with both sizes small setup (left) and large setup (right).

Frank’s 2016 experiments [29] used scales on the floor from which listeners read off where the sweet area ends in every radial direction, cf. Fig. 4.8a. For Fig. 4.8b, the criterion for listeners to indicate leaving the sweet area was when the frontally panned sound was mapped outside the loudspeaker pairs L, C, and R. It showed that a sweet area providing perceptually plausible playback measures at least \(\frac{2}{3}\) of the radius of the loudspeaker setup if the order is high enough.

The perceived width of auditory events is investigated in the experimental results of Fig. 4.9, [25], in which pink noise was frontally panned in different orientations of the loudspeaker ring (with one loudspeaker in front, with front direction lying quarter- and half-spaced wrt. loudspeaker spacing). Listeners compared the width of multiple stimuli, and the results were expected to indicate constant width for the differently rotated loudspeaker ring, as the optimal arrangement with \(\mathrm {L}=2\mathrm {N}+2\) provides constant \(\varvec{r}_\mathrm {E}\) length. The panning-invariant length is not perfectly reflected in the perceived widths with \(3\mathrm {rd}\) order on 8 loudspeakers, for which the on-loudspeaker position is perceived as being significantly wider. By contrast, the high-order experiment with \(7\mathrm {th}\) order on 16 loudspeakers would perfectly validate the model.

Figure 4.10 shows experiments investigating the time-variant change in sound coloration for a pink-noise virtual source rotating at a speed of \(100^\circ \)/s, and for different Ambisonic panning setups. There is an obvious advantage of a reduced fluctuation in coloration at both listening positions, centered and off-center, when using the side-lobe-suppressing “\(\mathrm {max}\)-\(\varvec{r}_\mathrm {E}\)” weighting instead of the “basic” rectangular truncation of the Fourier series. At the off-center listening position, \(\mathrm {max}\)-\(\varvec{ r}_\mathrm {E}\) weights achieve good results with regard to constant coloration for both \(3\mathrm {rd}\) and \(7\mathrm {th}\) order arrangements with 8 and 16 loudspeakers that were investigated.

How well would diffuse signals be preserved played back? All the above experiments deal with how non-diffuse signals are presented. To complement what is shown in Fig. 1.21 of Chap. 1 with an explanation, the relation between Ambisonic order and its ability to preserve diffuse fields is estimated here by the covariance between uncorrelated directions. Assume a max-\(\varvec{r}_\mathrm {E}\)-weighted \(\mathrm {N{th}}\)-order Ambisonic panning function \(g(\varvec{\theta }_\mathrm {s}^\mathrm {T}\varvec{\theta })\) that is normalized to \(g(1)=1\), encodes two sounds \(s_{1,2}\) from two directions \(\varvec{\theta }_1\) and \(\varvec{\theta }_2\), with the sounds being uncorrelated and unit-variance \(E\{s_1s_2\}=\delta _{1,2}\). We can find that the Ambisonic representation mixes the sounds at their respective mapped directions and yields an increase of their correlation \(x_1=s_{1}+g_{12}\,s_2\) and \(x_2=s_{2}+g_{12}\,s_1\), using \(g_{12}=g(\cos \phi )\),

$$\begin{aligned} R\{x_1\,x_2\}&= \frac{E\{x_1\,x_2\}}{\sqrt{E\{x_1^2\}E\{x_2^2\}}}\\&= \frac{E\{(1+g_{12}^2)\,s_1\,s_2+g_{12}(s_1^2+s_2^2)\}}{ \sqrt{E\{s_{1}^2+2\,g_{12}\,s_1\,s_2+g_{12}^2s_{2}^2\}E\{s_{2}^2+2\,g_{12}\,s_1\,s_2+g_{12}^2s_{1}^2\}}}= \frac{2\,g_{12}}{1+g_{12}^2}.\nonumber \end{aligned}$$
(4.23)

This result was presented in Fig. 1.21 and was used to argue that the directional separation of first-order Ambisonics by its high crosstalk term \(g_{12}\) might be too weak. Higher-order Ambisonics decreases this directional crosstalk and therefore improves the representation of diffuse sound fields.

4.7 Panning with Spherical Harmonics in 3D

In three space dimensions, the spherical coordinate system has a radius r and two angles, azimuth \(\varphi \) indicating the polar angle of the orthogonal projection to the xy plane, and the zenith angle \(\vartheta \) indicating the angle to the z axis, according to the right-handed spherical coordinate systems in ISO31-11, ISO80000-2, [30, 31], Fig. 4.11.

Fig. 4.11
figure 11

The spherical coordinate system

By the generalized chain rule, Appendix A.3 re-writes the Laplacian to spherical coordinates in 3D with r signifying the radius, \(\varphi \) the azimuth angle, and the zenith angle \(\vartheta \) re-expressed as \(\zeta =\frac{z}{r}=\cos \vartheta \), yielding the operator \( \bigtriangleup = \frac{2}{r}\frac{\partial }{\partial r} + \frac{\partial ^2}{\partial r^2}+\frac{1}{r^2(1-\zeta ^2)}\frac{\partial ^2}{\partial \varphi ^2} -\frac{2}{r^2}\zeta \frac{\partial }{\partial \zeta }+ \frac{1-\zeta ^2}{r^2} \frac{\partial ^2}{\partial \zeta ^2}\). Any radius-dependent part is removed to define an eigenproblem yielding the basis for panning functions, taking only \(r^2\bigtriangleup _{\upvarphi ,\upzeta ,3\mathrm {D}}\),

$$\begin{aligned} \left[ \frac{1}{1-\zeta ^2}\frac{\partial ^2}{\partial \varphi ^2} -2\zeta \frac{\partial }{\partial \zeta }+ (1-\zeta ^2) \frac{\partial ^2}{\partial \zeta ^2}\right] Y=-\lambda \,Y \end{aligned}$$
(4.24)

whose solution with \(\lambda =n(n+1)\) defines the spherical harmonics

$$\begin{aligned} Y_n^m(\varvec{\theta })=Y_n^m(\varphi ,\vartheta )&=\Theta _n^m(\vartheta )\,\Phi _m(\varphi ). \end{aligned}$$
(4.25)

The pre-requisites are (i) periodicity in \(\varphi \) and (ii) that the function \(Y_n^m\) is finite on the sphere. In addition to the circular harmonics \(\Phi _m\) expressing the dependency on azimuth \(\varphi \) according to Eq. (4.10), the spherical harmonics contain the associated Legendre functions \(P_n^m\) and their normalization term

$$\begin{aligned} \Theta _n^m(\vartheta )&=N_n^{|m|}\,P_n^{|m|}(\cos \vartheta ) \end{aligned}$$
(4.26)

to express the dependency on the zenith angle \(\vartheta \). The index \(n\ge 0\) expresses the order and the directional resolution can be limited by requiring \(0\le n\le \mathrm {N}\). The index m is the degree and for each n it is limited by \(-n\le 0 \le n\).

Fig. 4.12
figure 12

Spherical harmonics indexed by Ambisonic channel number \(ACN=n^2+n+m\); rows show spherical harmonics for the order \(0\le n\le 3\) with the \(2n+1\) harmonics of the degree \(-n\le m\le n\). What is plotted is a polar diagram with the radius \(R=20\lg |Y_n^m|\) normalized to the upper 30 dB of each pattern, with positive (gray) and negative (black) color indicating the sign. The order n counts the circular zero crossings, and |m| counts those running through zenith and nadir

The spherical harmonics, Fig. 4.12, are orthonormal on the sphere \(-\pi \le \varphi \le \pi \) and \(0\le \vartheta \le \pi \), and for unbounded order \(\mathrm {N}\rightarrow \infty \) they are complete; see also Appendix A.3.7.

The spherical harmonics permit a series representation of square-integrable 3D directional functions by the coefficients \(\gamma _{nm}\),

$$\begin{aligned} g(\varvec{\theta })&=\sum _{n=0}^\infty \sum _{m=-n}^n\gamma _{nm}\,Y_n^m(\varvec{\theta }). \end{aligned}$$
(4.27)

From a known function \(g(\varvec{\theta })\), the coefficients are obtained by the transformation integral over the unit sphere \(\mathbb {S}^2\), cf. appendix Eq. (A.38)

$$\begin{aligned} \gamma _{nm}&=\int _{\mathbb {S}^2}g(\varvec{\theta })\,Y_n^m(\varvec{\theta })\,\mathrm {d}\varvec{\theta }. \end{aligned}$$
(4.28)

Note that the above N3D normalization \(\int _{\varvec{\theta }\in \mathbb {S}^2}|Y_n^m(\varvec{\theta })|^2\,\mathrm {d}\varvec{\theta }=1\) defines each spherical harmonic except for an arbitrary-phase it might be multiplied with. Legendre functions for the zenith dependency might be defined differently in literature, and for azimuth, some implementations use \(\sin (m\varphi )\) instead of \(\sin (|m|\varphi )\). In Ambisonics, real-valued functions and the SN3D normalization \(\sqrt{\frac{1}{2}\frac{(n-|m|)!}{(n+|m|)!}}\) are preferred, and positive signs of the first-order dipole components in the directions of the respective coordinate axes, x, y, z, are preferred. This might require to involve the Condon-Shortley phase \((-1)^m\) to correct the signs of the Legendre functions, or \(-1\) for \(m<0\) to correct the sign of azimuthal sinusoids, depending on the implementation of the respective functions. It is often helpful to employ converters and directional checks to ensure compatibility!

3D panning function. An infinitely narrow direction range around a desired direction \(\varvec{\theta }_\mathrm {s}^\mathrm {T}\varvec{\theta }>\cos \varepsilon \rightarrow 1\) is represented by the transformation integral over the Dirac delta \(\delta (1-\varvec{\theta }_\mathrm {s}^\mathrm {T}\varvec{\theta })\), cf. Eq.(A.41), so that the coefficients of the panning function are

$$\begin{aligned} \gamma _{nm}&=Y_n^m(\varvec{\theta }_\mathrm {s}). \end{aligned}$$
(4.29)

As infinitely many spherical harmonics are complete, the panning function is

$$\begin{aligned} g(\varvec{\theta })&=\sum _{n=0}^\infty \sum _{m=-n}^n Y_n^m(\varvec{\theta }_\mathrm {s})\,Y_n^m(\varvec{\theta })=\delta (1-\varvec{\theta }_\mathrm {s}^\mathrm {T}\varvec{\theta }), \end{aligned}$$
(4.30)

and in practice, the finite-resolution \(\mathrm {N{th}}\)-order panning function with \(n\le \mathrm {N}\) employs a weight \(a_n\) to reduce side lobes and optimize the spread

$$\begin{aligned} g_\mathrm {N}(\varvec{\theta })&=\sum _{n=0}^\mathrm {N}\sum _{m=-n}^na_n\,Y_n^m(\varvec{\theta }_\mathrm {s})\,Y_n^m(\varvec{\theta }). \end{aligned}$$
(4.31)

The max-\(\varvec{r}_\mathrm {E}\) panning function uses the weights \(a_n=P_n\bigl [\cos (\frac{137.9^\circ }{\mathrm {N}+1.51})\bigr ]\), as derived in Appendix Eq. (A.46). The spread is now adjustable by the order to \({\pm }\frac{137.9^\circ }{\mathrm {N}+1.51}\). Figure 4.13 shows a comparison to the basic weighting \(a_n=1\). An alternative expression that uses Legendre polynomials \(P_n\) and only depends on the angle \(\phi \) to the panning direction \(\varvec{\theta }_\mathrm {s}\) is obtained by replacing the sum over m by the spherical harmonics addition theorem \(\sum _{m=-n}^n Y_n^m(\varvec{\theta }_\mathrm {s})\,Y_n^m(\varvec{\theta }) ={\textstyle \frac{2n+1}{4\pi }}\,P_n(\cos \phi )\),

$$\begin{aligned} g_\mathrm {N}(\phi )&=\sum _{n=0}^\mathrm {N}{\textstyle \frac{2n+1}{4\pi }}\;a_n\,P_n(\cos \phi ). \end{aligned}$$
(4.32)

Comparison to first-order Ambisonics shows: now \(Y_n^m(\varvec{\theta }_{s})\) represents the recorded or encoded directions, and \(Y_n^m(\varvec{\theta })\) represents the decoded playback directions.

Fig. 4.13
figure 13

3D unweighted \(a_n=1\) basic and weighted max-\(\varvec{r}_\mathrm {E}\) Ambisonic panning functions for the orders \(\mathrm {N}=1,2,5\)

Optimal sampling of the 3D panning function. In the theory of spherical polynomials in the variable \(\zeta =\varvec{\theta }_s^\mathrm {T}\varvec{\theta }\), so-called t-designs describe point sets of given directions \(\{\varvec{\uptheta }_l\}\) with \(l=1,\dots ,\mathrm {L}\) and size \(\mathrm {L}\) that allow to perfectly compute the integral (constant part) over the polynomials \(\mathcal {P}_n(\zeta )\) of limited order \(n\le t\) by discrete summation

$$\begin{aligned} \int _{-\pi }^\pi \mathrm {d}\varphi \,\int _{-1}^1 \mathcal {P}_n(\zeta )\,\mathrm {d}\zeta =\sum _{l=1}^\mathrm {L} \mathcal {P}_n(\varvec{\theta }_s^\mathrm {T}\varvec{\uptheta }_l)\,{\textstyle \frac{4\pi }{\mathrm {L}}}, \end{aligned}$$
(4.33)

relative to any axis \(\varvec{\theta }_\mathrm {s}\) the point set is projected onto. In 3D, the Legendre polynomials \(P_n(\zeta )\) are orthogonal polynomials, therefore an \(\mathrm {N{th}}\)-order panning function composed thereof is a polynomial of \(\mathrm {N{th}}\) order. The loudness measure E is calculated by the integral over \(g_\mathrm {N}^2\), therefore over a polynomial of the order \(2\mathrm {N}\). The integral to calculate \(\varvec{r}_\mathrm {E}\) runs over \(g_\mathrm {N}^2\,\zeta \), therefore over a polynomial of the order \(2\mathrm {N}+1\). In playback, to get a perfectly panning-invariant loudness measure E of the continuous panning function and also the perfectly oriented \(\varvec{r}_\mathrm {E}\) vector of constant spread \(\arccos \Vert \varvec{r}_\mathrm {E}\Vert \), the parameter t must be \(t\ge 2\mathrm {N}+1\). In 3D there are only 5 geometrically regular layouts

  • the tetrahedron, \(\mathrm {L}=4\) corners, is a 2-design,

  • the octahedron, \(\mathrm {L}=6\) corners, is a 3-design,

  • the hexahedron (cube), \(\mathrm {L}=8\) corners, is a 3-design,

  • the icosahedron, \(\mathrm {L}=12\) corners, is a 5-design,

  • the dodecahedron, \(\mathrm {L}=20\) corners, is a 5-design.

For instance, for \(\mathrm {N}=1\), the octahedron is a suitable spherical design, for \(\mathrm {N}=2\), the icosahedral or dodecahedral layouts are suitable.

Exceeding the geometrically regular layouts, there are designs found by optimization to be regular under the mathematical rule to approximate \(\int _{\mathbb {S}^2}Y_n^m(\varvec{\theta })\,\mathrm {d}\varvec{\theta }=\sqrt{4\pi }\delta _n\) accurately by \(\frac{4\pi }{\mathrm {L}}\sum _lY_n^m(\varvec{\uptheta }_l)\) for all \(n\le t\) and \(|m|\le n\). A large collection can be found by Hardin and Sloane [32], Gräf and Potts [33], and Womersley [34] available on the following websites

http://neilsloane.com/sphdesigns/dim3/

http://homepage.univie.ac.at/manuel.graef/quadrature.php

(Chebyshev-type Quadratures on \(\mathbb {S}^2\)), and

https://web.maths.unsw.edu.au/~rsw/Sphere/EffSphDes/ss.html.

Figure 4.14 gives some graphical examples.

Fig. 4.14
figure 14

t-designs from Gräf’s website (Chebyshev-type quadrature)

4.8 Ambisonic Encoding and Optimal Decoding in 3D

To encode a signal s into Ambisonic signals \(\chi _{nm}\), we multiply the signal with the encoder representing the direction \(\varvec{\theta }_\mathrm {s}\) of the signal by the weights \(Y_n^m(\varvec{\theta }_\mathrm {s})\)

$$\begin{aligned} \chi _{nm}(t)&=Y_n^m(\varvec{\theta }_\mathrm {s})\,s(t), \end{aligned}$$
(4.34)

or in vector notation

$$\begin{aligned} \varvec{\chi }_\mathrm {N}=\varvec{y}_\mathrm {N}(\varvec{\theta }_\mathrm {s})\,s, \end{aligned}$$
(4.35)

using the column vector \(\varvec{y}_\mathrm {N}=[Y_0^0(\varvec{\theta }_\mathrm {s}),\,Y_1^{-1}(\varvec{\theta }_\mathrm {s}),\,\dots ,\,Y_{\mathrm {N}}^\mathrm {N}(\varvec{\theta }_\mathrm {s})]^\mathrm {T}\) of \((\mathrm {N}+1)^2\) components. The Ambisonic signals in \(\varvec{\chi }_\mathrm {N}\) are weighted by side-lobe suppressing weights \(\varvec{a}_\mathrm {N}=[a_0,\,a_1,a_1,a_1,\, a_2,\dots ,a_\mathrm {N}]^\mathrm {T}\), expressed by the multiplication with a diagonal matrix \(\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\), and then decoded to the \(\mathrm {L}\) loudspeaker signals \(\varvec{x}\) by a sampling decoder

$$\begin{aligned} \varvec{D}&={\textstyle \sqrt{\frac{4\pi }{\mathrm {L}}}}\, \begin{bmatrix}\varvec{y}_\mathrm {N}(\varvec{\uptheta }_1),\,\dots ,\,\varvec{y}_\mathrm {N}(\varvec{\uptheta }_\mathrm {L})\end{bmatrix}^\mathrm {T} ={\textstyle \sqrt{\frac{4\pi }{\mathrm {L}}}}\, \varvec{Y}_\mathrm {N}^\mathrm {T} , \end{aligned}$$
(4.36)

using

$$\begin{aligned} \varvec{x}=\varvec{D}\,\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\,\varvec{\chi }_\mathrm {N}. \end{aligned}$$
(4.37)

In total, the system for encoding Eq. (4.35) and decoding Eq. (4.36) can also be written to yield loudspeaker gains for one signal

$$\begin{aligned} \varvec{g}=\varvec{D}\,\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\,\varvec{y}_\mathrm {N}(\varvec{\theta }_\mathrm {s}), \end{aligned}$$
(4.38)

or in particular for the 3D sampling decoding \(\varvec{g}=\sqrt{\frac{4\pi }{\mathrm {L}}}\, \varvec{Y}_\mathrm {N}^\mathrm {T}\,\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\,\varvec{y}_\mathrm {N}(\varvec{\theta }_\mathrm {s})\).

4.9 Ambisonic Decoding to Loudspeakers

Ambisonic decoding to loudspeakers has been dealt with by numerous researchers, in the past, particularly because result are not very stable for first-order Ambisonics, and later because they strongly depend on how uniform the loudspeaker layout is for higher-order Ambisonics. Moreover, Solvang found that even the use of too many loudspeakers has a degrading effect [35].

For first-order decoding, the Vienna decoders by Michael Gerzon [36] are often cited, and for higher-order Ambisonic decoding, one can, e.g. find works by Daniel with max-\(\varvec{r}_\mathrm {E}\) [37] and pseudo-inverse decoding [10], also by Poletti [14, 38, 39].

What turned out to be the most practical solution, is the All-Round Ambisonic Decoding approach (AllRAD) due to its feature of allowing imaginary loudspeaker insertion and downmix as described in the sections above, cf. [40]. It moreover does not have restrictions on the Ambisonics order, which for other decoders often yields poor controllability of panning-dependent fluctuations in loudness and directional mapping errors.

The playable set of directions \(\varvec{\uptheta }_l\) or \(\upvarphi _l\) is usually finite and discrete, and it is represented by the surrounding loudspeakers’ directions. The directional distribution of the surrounding loudspeakers is typically neither a t-design (with \(t\ge 2\mathrm {N}+1\) in general, sometimes not even regular polygons with \(\mathrm {L}\ge 2\mathrm {N}+2\) loudspeakers for 2D, in particular). In such cases, it is extremely helpful to be aware of the properties of the various decoder design methods.

4.9.1 Sampling Ambisonic Decoder (SAD)

The sampling decoder as introduced above is the simplest decoding method. For dimensions (\(\mathrm {D}=2\)) and three (\(\mathrm {D}=3\)), it uses the matrix \(\varvec{Y}_\mathrm {N}=[\varvec{y}_\mathrm {N}(\varvec{\uptheta }_1),\,\dots ,\,\varvec{y}_\mathrm {N}(\varvec{\uptheta }_\mathrm {L})]\) containing the respective circular or spherical harmonics \(\varvec{y}_\mathrm {N}(\varvec{\theta })\) sampled at the loudspeaker directions \(\{\varvec{\uptheta }_l\}\),

$$\begin{aligned} \varvec{D}&={\textstyle \sqrt{\frac{S_{\mathrm {D}-1}}{\mathrm {L}}}}\,\varvec{Y}_\mathrm {N}^\mathrm {T}, \end{aligned}$$
(4.39)

with the circumference of the unit circle denoted as \(S_1=2\pi \) or the surface of the unit sphere written as \(S_2=4\pi \). The factor \(\sqrt{\frac{S_{\mathrm {D}-1}}{\mathrm {L}}}\) expresses that each loudspeaker synthesizes a fraction of the E measure on the circle or sphere of the surrounding directions. However, the sampling decoder would neither yield perfectly constant loudness and width measures, E, \(\Vert \varvec{r}_\mathrm {E}\Vert \), nor a correct aiming of the localization measure \(\varvec{r}_\mathrm {E}\) if the loudspeaker layout wasn’t optimal. For instance concerning loudness, for panning towards directional regions of poor loudspeaker coverage, sampling misses out the main lobe of the panning function, yielding a noticeably reduced loudness.

4.9.2 Mode Matching Decoder (MAD)

The mode-matching method is used in [10, 39] and yields a fundamentally different decoder design. Its concept is to re-encode the gain vector \(\varvec{g}\) of the loudspeakers for any panning direction \(\varvec{\theta }_\mathrm {s}\) by the encoding matrix \(\varvec{Y}_\mathrm {N}=[\varvec{y}_\mathrm {N}(\varvec{\uptheta }_1),\,\dots ,\,\varvec{y}_\mathrm {N}(\varvec{\uptheta }_\mathrm {L})]\) for all loudspeaker directions \(\{\varvec{\uptheta }_l\}\). Ideally, the re-encoded result should match the encoding of the panning direction with sidelobes suppressed

$$\begin{aligned} \varvec{Y}_\mathrm {N}\,\varvec{g}=\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\,\varvec{y}_\mathrm {N}(\varvec{\theta }_s). \end{aligned}$$

Using the definition \(\varvec{g}=\varvec{D}\,\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\,\varvec{y}_\mathrm {N}(\varvec{\theta }_s)\) of the panning gains, we obtain

$$\begin{aligned} \varvec{Y}_\mathrm {N}\varvec{D}\,\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\,\varvec{y}_\mathrm {N}(\varvec{\theta }_s)&=\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\,\varvec{y}_\mathrm {N}(\varvec{\theta }_s),\nonumber \\ \Rightarrow \varvec{D}&={\textstyle \sqrt{\frac{\mathrm {L}}{S_{\mathrm {D}-1}}}}\varvec{Y}_\mathrm {N}^\mathrm {T}(\varvec{Y}_\mathrm {N}\varvec{Y}_\mathrm {N}^\mathrm {T})^{-1} \end{aligned}$$
(4.40)

so that the decoder \(\varvec{D}\) is required to be right-inverse to the matrix \(\varvec{Y}_\mathrm {N}\), i.e. \(\varvec{Y}_\mathrm {N}\,\varvec{D}=\varvec{Y}_\mathrm {N}\varvec{Y}_\mathrm {N}^\mathrm {T}(\varvec{Y}_\mathrm {N}\varvec{Y}_\mathrm {N}^\mathrm {T})^{-1}=\varvec{I}\), see Eq. (A.63) in Appendix A.4. For the inverse of \(\varvec{Y}_\mathrm {N}\varvec{Y}_\mathrm {N}^\mathrm {T}\) to exist, it is necessary to have at least as many loudspeakers as harmonics, i.e. \(\mathrm {L}\ge (\mathrm {N}+1)^2\) with \(\mathrm {D}=3\) or \(\mathrm {L}\ge 2\mathrm {N}+1\) for \(\mathrm {D}=2\). However, this is not a sufficient criterion yet: In directions poorly covered with loudspeakers, the inversion will boost the loudness, so that the result is often numerically ill conditioned for \((\varvec{Y}_\mathrm {N}\varvec{Y}_\mathrm {N}^\mathrm {T})^{-1}\) unless the loudspeaker layout is uniformly designed, at least. Mode matching decoding is ill-conditioned on hemispherical or semicircular loudspeaker layouts. The solution is equivalently described by the more general pseudo inverse \(\varvec{Y}_\mathrm {N}^\dagger \), which is right-inverse for fat matrices.

4.9.3 Energy Preservation on Optimal Layouts

For instance, for an order of \(\mathrm {N}=2\), 2D Ambisonics should work optimally with a ring of \(45^\circ \) spaced loudspeakers on the horizon, a circular \((2\mathrm {N}+1)\)-design, or for 3D, a spherical \((2\mathrm {N}+1)\)-design. On a t-design selected by \(t\ge 2\mathrm {N}\), the loudness measure E is panning-invariant, in general,

$$\begin{aligned} E&=\Vert \varvec{g}\Vert ^2=\varvec{y}^\mathrm {T}_\mathrm {N}(\varvec{\theta }_s)\,\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\,\underbrace{\varvec{D}^\mathrm {T} \varvec{D}}_{=\varvec{I}}\,\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\, \varvec{y}_\mathrm {N}(\varvec{\theta }_s) =\Vert \mathrm {diag}\{\varvec{a}_\mathrm {N}\}\, \varvec{y}_\mathrm {N}(\varvec{\theta }_s)\Vert ^2=\text {const.} \end{aligned}$$

This is because a \(t\ge 2\mathrm {N}\)-design discretization preserves orthonormality

$$\begin{aligned} \int \varvec{y}_\mathrm {N}(\varvec{\theta })\,\varvec{y}_\mathrm {N}^\mathrm {T}(\varvec{\theta })\,\mathrm {d}\varvec{\theta }={\textstyle \frac{S_{\mathrm {D}-1}}{\mathrm {L}}}\sum _{l=1}^\mathrm {L} \varvec{y}_\mathrm {N}(\varvec{\uptheta }_l)\,\varvec{y}_\mathrm {N}^\mathrm {T}(\varvec{\uptheta }_l)= {\textstyle \frac{S_{\mathrm {D}-1}}{\mathrm {L}}} \varvec{Y}_\mathrm {N}\varvec{Y}_\mathrm {N}^\mathrm {T}=\varvec{I}, \end{aligned}$$
(4.41)

which implies for the sampling decoder \(\varvec{D}^\mathrm {T}\varvec{D}=\frac{S_{\mathrm {D}-1}}{\mathrm {L}}\,\varvec{Y}_\mathrm {N}\varvec{Y}_\mathrm {N}^\mathrm {T}=\varvec{I}\), and we notice the panning invariant norm of \(g(\varvec{\theta })\) within its coefficients \(\varvec{\gamma }_\mathrm {N}=\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\varvec{y}_\mathrm {N}(\varvec{\theta }_\mathrm {s})\) by the Parseval theorem \(\int g^2(\varvec{\theta })\,\mathrm {d}\varvec{\theta }=\Vert \varvec{\gamma }_\mathrm {N}\Vert ^2\). The panning invariant E measure also holds for the mode-matching decoder using a \(t\ge 2\mathrm {N}\)-design, as it becomes equivalent to a sampling decoder \(\varvec{D}={\sqrt{\frac{\mathrm {L}}{S_{\mathrm {D}-1}}}}\varvec{Y}_\mathrm {N}^\mathrm {T}(\varvec{Y}_\mathrm {N}\varvec{Y}_\mathrm {N}^\mathrm {T})^{-1}= {\sqrt{\frac{\mathrm {L}}{S_{\mathrm {D}-1}}}}\varvec{Y}_\mathrm {N}^\mathrm {T}{\frac{S_{\mathrm {D}-1}}{\mathrm {L}}}={\sqrt{\frac{S_{\mathrm {D}-1}}{\mathrm {L}}}}\varvec{Y}_\mathrm {N}^\mathrm {T}\). Under these ideal conditions, both decoders are energy-preserving.

4.9.4 Loudness Deficiencies on Sub-optimal Layouts

For 2D layouts, Fig. 4.15 shows what happens if a decoder is calculated for a \(t\ge 2\mathrm {N}+1\)-design with one loudspeaker removed: While, for panning across the gap, the sampling Ambisonic decoder (SAD) yields a quieter signal, moderate localization errors and width fluctuation, the mode-matching decoder (MAD) yields a strong loudness increase and severe jumps in the localization/width. MAD is therefore not very practical with sub-optimal layouts, SAD only slightly more so.

Fig. 4.15
figure 15

Analysis of loudness, localization error, and width for \(3\mathrm {rd}\)-order sampling (SAD) and mode-matching (MAD) Ambisonic decoding for all panning angles on a sub-optimal \(45^\circ \) spaced loudspeaker ring with gap at \(-90^\circ \), compared to VBIP

4.9.5 Energy-Preserving Ambisonic Decoder (EPAD)

To establish panning-invariant loudness for decoding to non-uniform surround loudspeaker layouts one can ensure a constant loudness measure E by enforcing \(\varvec{D}^\mathrm {T}\varvec{D}=\varvec{I}\), which is otherwise only achieved on \(t\ge 2\mathrm {N}\)-designs. We may search for a decoding matrix \(\varvec{D}\) whose entries are closest to the sampling decoder under the constraint to be column-orthogonal:

$$\begin{aligned} \Vert \varvec{D}-\sqrt{\textstyle \frac{S_{\mathrm {D}-1}}{\mathrm {L}}}\,\varvec{Y}_\mathrm {N}^ \mathrm {T}\Vert ^2_\mathrm {Fro}&\rightarrow \min \\ \text {subject to }\varvec{D}^\mathrm {T}\varvec{D}&=\varvec{I}.\nonumber \end{aligned}$$
(4.42)

The singular value decomposition of

$$\begin{aligned} \varvec{Y}_\mathrm {N}^\mathrm {T}=\varvec{U}\,[\mathrm {diag}\{\varvec{s}\},\varvec{0}]^\mathrm {T}\,\varvec{V}^\mathrm {T} \end{aligned}$$
(4.43)

can be used to create

$$\begin{aligned} \varvec{D}= \varvec{U}\, \begin{bmatrix} \varvec{I},\, \varvec{0} \end{bmatrix}^\mathrm {T}\, \varvec{V}^\mathrm {T}. \end{aligned}$$
(4.44)

by replacing the singular values \(\varvec{s}\) with ones. Such a decoder is column-orthogonal, as the singular-value decomposition delivers \(\varvec{U}^\mathrm {T}\varvec{U}=\varvec{I}\) and \(\varvec{V}\varvec{V}^\mathrm {T}=\varvec{I}\), and as a consequenceFootnote 1 \(\varvec{D}^\mathrm {T}\varvec{D}=\varvec{I}\). The energy-preserving decoder in this basic version requires \(\mathrm {L}\ge 2\mathrm {N}+1\) loudspeakers in 2D or \(\mathrm {L}\ge (\mathrm {N}+1)^2\) in 3D to work.

Note that if the loudspeaker setup directions are already a \(t\ge 2\mathrm {N}\) design, the sampling, mode-matching, and energy-preserving decoders are equivalent.

4.9.6 All-Round Ambisonic Decoding (AllRAD)

In Chap. 3 on vector-base amplitude panning methods, a well-balanced panning result in terms of loudness, width, and localization was achieved by MDAP that distributes a signal to an arrangement of several superimposed VBAP virtual sources. Hereby \(E=\text {const.}\), \(\varvec{r}_\mathrm {E}\approx r_\mathrm {E}\,\varvec{\theta }_\mathrm {s}\), and \(r_\mathrm {E}\approx \text {const}\). This works for nearly any loudspeaker layout.

While, to calculate loudspeaker gains, MDAP superimposes an arrangement of discrete virtual sources within a range of \({\pm }\alpha \) around the panning direction \(\varvec{\theta }_\mathrm {s}\), one could also think of superimposing a quasi-continuous distribution of virtual sources that are weighted by a continuous panning function \(g(\varvec{\theta })\).

The ideal continuous panning function \(g(\varvec{\theta })\) of axisymmetric directional spread around the panning direction \(\varvec{\theta }_\mathrm {s}\) is described by \(g(\varvec{\theta })=\varvec{y}_\mathrm {N}^\mathrm {T}(\varvec{\theta })\,\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\,\varvec{y}_\mathrm {N}(\varvec{\theta }_\mathrm {s})\), the Ambisonic panning function. This rotation-invariant continuous function is optimal in terms of loudness, width, and localization measures, which are all evaluated by continuous integrals: \(E=\int g^2(\varvec{\theta })\,\mathrm {d}\varvec{\theta }=\text {const.}\) expresses panning-invariant loudness, \(\varvec{r}_\mathrm {E}=\frac{1}{E}\int g^2(\varvec{\theta })\,\varvec{\theta }\,\mathrm {d}\varvec{\theta }=r_\mathrm {E}\,\varvec{\theta }_\mathrm {s}\) indicates a perfect alignment \(\varvec{r}_\mathrm {E}\parallel \varvec{\theta }_\mathrm {s}\) with the panning direction and a panning-invariant width \(r_\mathrm {E}=\text {const}\). However, the optimal values of these integrals are only preserved by discretization with optimal \(t\ge 2\mathrm {N}+1\)-design loudspeaker layouts.

All-round Ambisonic decoding (AllRAD) is preceded by the work of Batke and Keiler [16]. They describe Ambisonic panning \(\varvec{g}_\mathrm {AllRAD}(\varvec{\theta })=\varvec{D}\,\varvec{y}_\mathrm {N}(\varvec{\theta })\) by a decoder \(\varvec{D}\), whose result matches best with VBAP \(\varvec{g}_\mathrm {VBAP}(\varvec{\theta })\). Without max-\(\varvec{r}_\mathrm {E}\) weights yet, we use this here to define AllRAD by the integral expressing a minimum-mean-square-error problem using the integral over all panning directions \(\varvec{\theta }\)

$$\begin{aligned} \min _{\varvec{D}}\int _{\mathbb {S}^2}\big \Vert \varvec{g}_\mathrm {VBAP}(\varvec{\theta }) - \varvec{D}\,\varvec{y}_\mathrm {N}(\varvec{\theta }) \big \Vert ^2\,\mathrm {d}\varvec{\theta }. \end{aligned}$$
(4.45)

Equivalently, as described by Zotter and Frank [40] who coined the name, we may define AllRAD as VBAP synthesis on the physical loudspeakers when using as multiple-virtual-source inputs the Ambisonic panning function \(g_\mathrm {AMBI}(\varvec{\theta })=\varvec{y}_\mathrm {N}^\mathrm {T}(\varvec{\theta })\,\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\,\varvec{y}_\mathrm {N}(\varvec{\theta }_\mathrm {s})\) sampled at an optimal layout of virtual loudspeakers. Here, we write the synthesis as the integral over infinitely many virtual loudspeakers \(\varvec{\theta }\),

$$\begin{aligned} \varvec{g}&=\int \varvec{g}_\mathrm {VBAP}(\varvec{\theta })\, g_\mathrm {AMBI}(\varvec{\theta })\,\mathrm {d}\varvec{\theta }=\int \varvec{g}_\mathrm {VBAP}(\varvec{\theta })\,\varvec{y}_\mathrm {N}^\mathrm {T}(\varvec{\theta })\, \mathrm {diag}\{\varvec{a}_\mathrm {N}\}\varvec{y}_\mathrm {N}(\varvec{\theta }_\mathrm {s}) \,\mathrm {d}\varvec{\theta }\\&=\underbrace{\int \varvec{g}_\mathrm {VBAP}(\varvec{\theta })\,\varvec{y}_\mathrm {N}^\mathrm {T}(\varvec{\theta })\,\mathrm {d}\varvec{\theta }}_{:=\varvec{D}}\, \mathrm {diag}\{\varvec{a}_\mathrm {N}\}\varvec{y}_\mathrm {N}(\varvec{\theta }_\mathrm {s}). \end{aligned}$$

We can obviously pull the term \(\mathrm {diag}\{\varvec{a}_\mathrm {N}\}\varvec{y}_\mathrm {N}(\varvec{\theta }_\mathrm {s})\) out of the integral. The remaining integral defines the AllRAD matrix \(\varvec{D}\). We may interpret it as a transformation of the VBAP loudspeaker gain functions \(\varvec{g}_\mathrm {VBAP}(\varvec{\theta })\) into spherical harmonic coefficients. In the original paper [40], AllRAD is evaluated by an optimal layout of discrete virtual loudspeakers

$$\begin{aligned} \varvec{D}&=\int \varvec{g}_\mathrm {VBAP}(\varvec{\theta })\,\varvec{y}_\mathrm {N}^\mathrm {T}(\varvec{\theta })\,\mathrm {d}\varvec{\theta }= {\textstyle \frac{S_{\mathrm {D}-1}}{\hat{\mathrm {L}}}}\sum _{l=0}^{\hat{\mathrm {L}}} \varvec{g}_\mathrm {VBAP}(\varvec{\hat{\uptheta }}_l)\,\varvec{y}_\mathrm {N}^\mathrm {T}(\varvec{\hat{\uptheta }}_l)= {\textstyle \frac{S_{\mathrm {D}-1}}{\hat{\mathrm {L}}}}\,\varvec{\hat{ G}}\,\varvec{\hat{Y}}^{\mathrm {T}}_\mathrm {N}, \end{aligned}$$
(4.46)

using the directions \(\{\varvec{\hat{\uptheta }}_l\}\) of a t-design. As VBAP’s gain functions aren’t smooth (derivatives are non-continuous), they are order-unlimited, and a t-design of sufficiently high t should be used. In the 3D practice, the 5200 pts. Chebyshev-type design from [33] is dense enough. Note that the VBAP part permits improvements by insertion and downmix of imaginary loudspeakers to adapt to asymmetric or hemispherical layouts, as suggested in the original paper [40], cf. Sect. 3.3.

Note that the decoder needs to be scaled properly. For instance, the norm of the omnidirectional component (first column) could be equalized to one, as it would typically be with a sampling decoder; there are alternative strategies to circumvent the scaling problem [41].

4.9.7 EPAD and AllRAD on Sub-optimal Layouts

Figude 4.16 shows the improvement achieved with EPAD and AllRAD on an equi-angular arrangement that is suboptimal by the missing loudspeaker at \(-90^\circ \). Both decoders manage to handle either the loudness stabilization perfectly well (EPAD) or keep the directional and spread mapping errors small (AllRAD). We notice that for EPAD, with the constraint that \(\mathrm {L}\ge (2\mathrm {N}+1)\) just fulfilled for \(\mathrm {N}=3\) and \(\mathrm {L}=7\) of the simulation, it would not simply be possible to remove any further loudspeakers without degradation.

Fig. 4.16
figure 16

Analysis of loudness, localization error and width measure for energy-preserving Ambisonic decoding (EPAD) and all-round Ambisonic decoding (AllRAD) for all panning angles on a sub-optimal \(45^\circ \) spaced loudspeaker ring with gap at \(-90^\circ \)

4.9.8 Decoding to Hemispherical 3D Loudspeaker Layouts

In typical loudspeaker playback situations for large audience, a solid floor and no loudspeakers below ear level are considered practical for several reasons. However, this does not permit decoding by sampling with optimal t-design layouts covering all directions. As shown above, EPAD and AllRAD do not require such arrays. And yet, they still require some care when used with hemispherical loudspeaker layouts, see [15, 40] for further reading.

EPAD with hemispherical loudspeaker layouts. Even for a hemispherical layout, the energy-preserving decoding method requires \(\mathrm {L}\ge (\mathrm {N}+1)^2\) loudspeakers to achieve a perfectly panning-invariant loudness. However, this is counter-intuitive: Why should one need at least as many loudspeakers on a hemisphere as are required for same-order playback on a full sphere? Shouldn’t the number be half as many?

We can show that while the spherical harmonics are orthonormal on the sphere \(\mathbb {S}^2\), i.e. \(\int _{\mathbb {S}^2}\varvec{y}_\mathrm {N}(\varvec{\theta })\varvec{y}^\mathrm {T}_\mathrm {N}(\varvec{\theta })\,\mathrm {d}\varvec{\theta }=\varvec{I}\), they aren’t orthogonal on the hemisphere \(S=\mathbb {S}^2:\vartheta \le \vartheta _\mathrm {max}\)

$$\begin{aligned} \int _{{S}}\varvec{y}_\mathrm {N}(\varvec{\theta })\varvec{y}^\mathrm {T}_\mathrm {N}(\varvec{\theta })\,\mathrm {d}\varvec{\theta }=\varvec{G}. \end{aligned}$$
(4.47)

Here, \(\varvec{G}\) is called Gram matrix, and it is evaluated by \( \frac{4\pi }{\hat{\mathrm {L}}}\sum _{l:\uptheta _{\mathrm {z},l}\ge 0}\varvec{y}_\mathrm {N}(\varvec{\uptheta }_l)\varvec{y}^\mathrm {T}_\mathrm {N}(\varvec{\uptheta }_l)\) using a high-enough t-design. By singular-value decomposition of the positive semi-definite matrix \(\varvec{G}=\varvec{Q}\,\mathrm {diag}\{\varvec{s}\}\,\varvec{Q}^\mathrm {T}\), with \(\varvec{Q}^\mathrm {T}\varvec{Q}=\varvec{Q}\varvec{Q}^\mathrm {T}=\varvec{I}\), we diagonalize \(\varvec{G}\) and find new basis functions \(\varvec{\tilde{y}}_\mathrm {N}(\varvec{\theta })\), the so-called Slepian functions [42], that are orthogonal on S

$$\begin{aligned} \varvec{Q}^\mathrm {T}\varvec{G}\varvec{Q}&= \mathrm {diag}\{\varvec{s}\}= \int _{{S}}\varvec{Q}^\mathrm {T}\varvec{y}_\mathrm {N}(\varvec{\theta })\varvec{y}^\mathrm {T}_\mathrm {N}(\varvec{\theta })\varvec{Q}\,\mathrm {d}\varvec{\theta },&\Rightarrow \varvec{\tilde{y}}_\mathrm {N}(\varvec{\theta })&=\varvec{Q}^\mathrm {T}\varvec{y}_\mathrm {N}(\varvec{\theta }). \end{aligned}$$

Typically, the singular values in \(\varvec{s}\) are sorted descendingly \(s_1\ge s_2\ge \dots \ge s_{(\mathrm {N}+1)^2}\) so that it is possible to cut out basis functions of significantly large contribution to the upper hemisphere S by

$$\begin{aligned} \varvec{\tilde{y}}_\mathrm {N}(\varvec{\theta })=[\varvec{I},\,\varvec{0}]\,\varvec{Q}^\mathrm {T}\varvec{y}_\mathrm {N}(\varvec{\theta }). \end{aligned}$$
(4.48)

Typically, the numerical integral is extended to slightly below the horizon, see Table 4.1, so that truncation to the \((\mathrm {N}+1)(\mathrm {N}+2)/2\) most significant basis functions, see Fig. 4.17, produces a minimum fluctuation in the loudness measure \(\tilde{E}=\Vert \varvec{\tilde{y}}_\mathrm {N}(\varvec{\theta })\Vert ^2\) for panning on the hemisphere.

Table 4.1 Integration ranges \(0\le \vartheta \le \vartheta _\mathrm {max}\) to obtain \((\mathrm {N}+1)(\mathrm {N}+2)/2\) Slepian functions with minimum loudness fluctuation \(\frac{\mathrm {max}E}{\mathrm {min}E}\) for panning on the hemisphere
Fig. 4.17
figure 17

Slepian basis functions for the upper hemisphere, composed of the spherical harmonics up to \(3\mathrm {rd}\) order, using \(0^\circ \le \vartheta \le 113^\circ \) as integration interval and \((\mathrm {N}+1)(\mathrm {N}+2)/2\) functions. To get these nice shapes, the Slepian functions were found separately for every degree m where they were moreover de-mixed by QR decomposition

With \(\varvec{\tilde{y}}_\mathrm {N}(\varvec{\theta })\), EPAD is calculated in the same way as for the ordinary harmonics

$$\begin{aligned} \varvec{\tilde{Y}}_\mathrm {N}^\mathrm {T}&=\varvec{\tilde{U}}[\mathrm {diag}\{\varvec{s}\},\, \varvec{0}]^\mathrm {T}\varvec{\tilde{V}}^\mathrm {T},&\varvec{\tilde{D}}&=\varvec{\tilde{U}}[\varvec{I},\, \varvec{0}]^\mathrm {T}\varvec{\tilde{V}}^\mathrm {T}, \end{aligned}$$
(4.49)

with the main difference that the lower limit for the number of loudspeakers decreases to \(\mathrm {L}\ge (\mathrm {N}+1)(\mathrm {N}+2)/2\). Interfaced to the spherical harmonics by \([\varvec{I},\,\varvec{0}]\,\varvec{Q}^\mathrm {T}\), the hemispherical energy-preserving decoder becomes

$$\begin{aligned} \varvec{D}&=\varvec{\tilde{D}}\;[\varvec{I},\,\varvec{0}]\,\varvec{Q}^\mathrm {T}. \end{aligned}$$
(4.50)

AllRAD with hemispherical loudspeaker layouts. Because of the vector-base amplitude panning involved, all-round Ambisonic decoding (AllRAD) is comparatively robust to irregular loudspeaker setups. Still, a hemispherical layout does not contain any loudspeaker direction vector pointing to the lower half space, therefore one could just omit information of the lower half space. However, the Ambisonic panning function implies a directional spread, so that panning to exactly the horizon also produces content below, whose omission causes: (i) a loss in loudness, (ii) a slight elevation of the perceived direction, cf. Fig. 4.18.

As discussed in the section on triangulation Sect. 3.3, the insertion of imaginary loudspeakers fixes this behavior. In the case of hemispherical loudspeaker layouts, it is not necessary to downmix the signal of the imaginary loudspeaker at nadir to stabilize both loudness and localization for panning to the horizon.

Signal contributions below but close to the horizon largely contribute to the horizontal loudspeakers, and it is therefore safe to dispose the signal that would feed the imaginary loudspeaker at nadir without loss of loudness. Moreover, this contribution from below also reinforces signals on the horizontal loudspeakers so that localization is pulled back down. Both can be observed in Fig. 4.18 that shows the loudness measure E as well as mislocalization and width by the measure \(\varvec{r}_\mathrm {E}\) using max-\(\varvec{r}_\mathrm {E}\)-weighted AllRAD with \(5\mathrm {th}\)-order Ambisonics along a vertical panning circle on the IEM mobile Ambisonics Array (mAmbA). It consists of 25 loudspeakers set up in rings of 8, 8, 4, 4, and 1 loudspeakers at 0, 20, 40, 60, 90 degrees elevation. Rings two and four start at 0 degree, the others are half-way rotated.

Performance comparison on hemispherical layouts. Figure 4.18 shows a comparison of AllRAD and EPAD decoding to the 25-channel mAmbA hemispherical loudspeaker layout.

Fig. 4.18
figure 18

Perceptual measures for \(5\mathrm {th}\) max-\(\varvec{r}_\mathrm {E}\)-weighted AllRAD on the IEM mAmbA layout [43] with (black) and without insertion of the bottom imaginary loudspeaker (black dotted) whose signal is disposed, and max-\(\varvec{r}_\mathrm {E}\)-weighted EPAD (gray), for panning on a vertical circle: E in dB (top), orientation error of \(\varvec{r}_\mathrm {E}\) in degrees (middle), and width expressed by \(\arccos \Vert \varvec{r}_\mathrm {E}\Vert \) in degrees (bottom). The thin dashed line shows AllRAD without imaginary loudspeakers

While (top in Fig. 4.18) AllRAD produces a loudness fluctuation roughly spanning 1 dB for panning on the hemisphere, EPAD only exhibits 0.3 dB, as specified in Table 4.1. While in monophonic playback of noise, loudness differences of less than 0.5 dB can be heard, it is safe to assume that a weak directional loudness fluctuation of less than 1 dB is normally inaudible. In this regard, loudness fluctuation should be no problem with both EPAD and AllRAD.

Concerning the directional mapping, EPAD produces a more strongly pronounced ripple, with \(\varvec{r}_\mathrm {E}\) indicating sounds on the horizon \(\vartheta _\mathrm {s}=\pm 90^\circ \) to be pulled upwards towards \(0^\circ \) more with EPAD (\(7^\circ \)) than with AllRAD (\(3^\circ \)). In terms of width, both EPAD and AllRAD exhibit the \({\approx }20^\circ \) average associated with max-\(\varvec{r}_\mathrm {E}\) weighting. However, EPAD also produces a greater fluctuation, and it widens up to about \(30^\circ \) degree for panning to the horizon \(\vartheta _\mathrm {s}=\pm 90^\circ \).

With the 9 loudspeakers of the ITU [44] \(4+5+0\) layout (horizontal ring: \(\varphi =0,\,\pm 30^\circ ,\,\pm 120^\circ \), upper ring at \(40^\circ \) elevation with \(\varphi =\pm 30^\circ ,\,\pm 120^\circ \)), it is not possible anymore to use EPAD with \(5\mathrm {th}\) order, which would be the optimal resolution for the front loudspeaker triplet. EPAD only supports orders up to \(\mathrm {N}=2\), and to lose level towards below-horizon directions, we can use the reduced set of 6 Slepian functions; alternatively all 9 spherical harmonics of \(\mathrm {N}=2\) would also be thinkable. For AllRAD, imaginary loudspeakers are inserted at the sides at azimuth/elevation \({\pm }75^\circ /27^\circ \), up at \(0^\circ /78^\circ \), back \(180^\circ /35^\circ \), and below \(0^\circ /-90^\circ \). It is reasonable to downmix the imaginary loudspeakers with a factor one for up, sides, back, and re-normalize the VBAP gain matrix, while disposing the signal of the imaginary loudspeaker below. AllRAD permits to use the order \(\mathrm {N}=5\), which resolves the frontal loudspeaker triplet much better for horizontal panning.

Fig. 4.19
figure 19

Perceptual measures for Ambisonic panning on the ITU [44] \(4+5+0\) layout with insertion and downmix of imaginary loudspeakers at the sides, back, and top, and insertion and disposal of an imaginary loudspeaker below for AllRAD. Measures are evaluated for max-\(\varvec{r}_\mathrm {E}\)-weighted \(5\mathrm {th}\)-order AllRAD (black) and \(2\mathrm {nd}\)-order EPAD (gray), for panning on a vertical circle: E in dB (top), orientation error of \(\varvec{r}_\mathrm {E}\) in degrees (middle), and width expressed by \(\arccos \Vert \varvec{r}_\mathrm {E}\Vert \) in degrees (bottom)

Figure 4.19 shows the result of max-\(\varvec{r}_\mathrm {E}\)-weighted \(2\mathrm {nd}\)-order EPAD and \(5\mathrm {th}\)-order AllRAD for the \(4+5+0\) layout using a vertical panning curve. While the perfectly constant loudness measure of EPAD might be favored over the almost \(+3\) dB loudness increase of front and back for AllRAD, AllRAD’s lower directional error, narrower width mapping, greater flexibility, and simplicity has often proven to be clearly superior in practice.

4.10 Practical Studio/Sound Reinforcement Application Examples

This section analyzes the application of 3D Ambisonic amplitude panning consisting of encoding and AllRAD to studio (with typical setups of 2 m radius) and sound reinforcement applications (for an audience of, e.g., 250 people). Application scenarios are sketched in [43], and various other examples are given below. Requirements of a constant loudness and width are analyzed below, and as sound reinforcement requires a particularly large sweet area, the \(\varvec{r}_\mathrm {E}\) vector model for off-center listening positions from Sect. 2.2.9 is used to depict the sweet area size.

The analysis of decoders above described loudness measures for panning on a circle. To observe them with panning across all directions in Figs. 4.20 and 4.22, world-map-like mappings using a gray-scale representation of the loudness and width measures are more reasonable. For several loudspeaker layouts, its axes are azimuth horizontally and zenith vertically, and the gray-scale map displays the loudness measure E in dB (left column) and the width measure \(\arccos \Vert \varvec{r}_\mathrm {E}\Vert \) in degrees (right column). As \(5\mathrm {th}\)-order max-\(\varvec{r}_\mathrm {E}\)-weighted AllRAD typically produces minor directional mapping errors, they aren’t explicitly shown in Figs. 4.20 and 4.22. However, the mappings of the sweet area size of plausible localization in Figs. 4.21 and 4.23 illustrate the usefulness of the systems for the listening areas hosting the number of listeners targeted for either the studio or the sound reinforcement application.

Fig. 4.20
figure 20

Comparison of \(5\mathrm {th}\)-order max-\(\varvec{r}_\mathrm {E}\)-weighted AllRAD for panning across all directions on hemispherical loudspeaker layouts in studios. The left column the loudness measure E in dB and the right-most column the width measure \(\arccos \Vert \varvec{r}_\mathrm {E}\Vert \) in degree, and the loudspeaker position are marked with a white \(+\) sign

Fig. 4.21
figure 21

Comparison of the calculated sweet are size for \(5\mathrm {th}\)-order max-\(\varvec{r}_\mathrm {E}\)-weighted AllRAD for panning across all directions on hemispherical loudspeaker layouts in studios. As a plausibility definition, the directional mapping errors depending on the listening position should stay within angular bounds (e.g. \(10^\circ \))

Fig. 4.22
figure 22

Comparison of \(5\mathrm {th}\)-order max-\(\varvec{r}_\mathrm {E}\)-weighted AllRAD for panning across all directions on various hemispherical loudspeaker layouts for sound reinforcement. The left column the loudness measure E in dB and the right-most column the width measure \(\arccos \Vert \varvec{r}_\mathrm {E}\Vert \) in degree, and the loudspeaker position are marked with a white + sign

Fig. 4.23
figure 23

Comparison of the calculated sweet are size for \(5\mathrm {th}\)-order max-\(\varvec{r}_\mathrm {E}\)-weighted AllRAD for panning across all directions on various hemispherical loudspeaker layouts for sound reinforcement. As a plausibility definition, the directional mapping errors depending on the listening position should stay within angular bounds (e.g. \(10^\circ \))

Figure 4.20 illustrates AllRAD’s tendency of attenuated signals in too closely spaced loudspeaker ensembles as in the front section of the ITU [44] \(4+5+0\). By contrast, for instance the mAmbA layout in Fig. 4.22 only has 8 loudspeakers on the horizon, and signals panned to the largely spaced below-horizon triangles tend to get louder. Moreover, it is easier for loudspeaker systems of many channels such as IEM CUBE, mAmbA, Lobby, and Ligeti Hall in Fig. 4.22 to yield smooth loudness and width mappings. Still, also with only a few loudspeakers, slight direction adjustment in the layout can fix some of the behavior, as with the IEM Production Studio, whose \({\pm }45^\circ \) loudspeakers in the elevated layer is superior to a \({\pm } 30^\circ \) spacing.

A hint for designing good decoders sometimes is idealization: often it is better to disregard the true loudspeaker setup locations and feed the decoder design with idealized positions instead. Hereby can one trade slight directional distortions for a more uniform loudness distribution. For instance at the IEM CUBE, loudspeaker locations of the horizontal ring could be idealized to \(30^\circ \) to get a smoother loudness mapping as the one shown in Fig. 4.22.

4.11 Ambisonic Decoding to Headphones

Typically, Ambisonic decoding to headphones can be done similarly as with loudspeakers, except that the loudspeaker signals are rendered to headphones by convolution with the head-related impulse responses (HRIRs) of the corresponding playback directions. Various databases of such HRIRs can be found, e.g., on the website SOFA-conventions.Footnote 2 This headphone decoding approach is classically using a small set of so-called virtual loudspeakers, as it is found in many places in technical literature, e.g. in the pioneering works of Jean-Marc Jot et al. [9] or Jérôme Daniel [10]. It is relevant in many important other works [18, 45, 46], the SADIE project,Footnote 3 and it is employed in Sect. 1.4.2 on first-order Ambisonics.

Coarse. However, as outlined in some research papers [9, 18, 46], these approaches have in common that low-order Ambisonic synthesis is problematic. It can either happen when inserting a dense grid of virtual-loudspeaker HRIRs that the Ambisonic smoothing attenuates high-frequency at frontal and dorsal directions. Or, what had been the solution for a long time, a coarse grid of virtual-loudspeaker HRIRs does not attenuate high frequencies, but still yields that spatial quality strongly depends on the particular grid layout or orientation [46]. An early paper by Jot [9] proposed to remove the time delays of the HRIR before Ambisonic decomposition, and then to re-insert the otherwise missing interaural time-delay afterwards, for any sound panned in Ambisonics, which unfortunately yields an object-based panning system rather than a scene-based Ambisonic system.

Dense. Some dense-grid approaches propose to keep the HRIR time delays, or if formulated in the frequency domain: the HRTF phases (head-related transfer function), and hereby stay in a scene-based Ambisonic format, while correcting spectral deficiencies by diffuse-field or interaural-covariance equalization [18, 47]. Finally, most recent solutions proposed by Jin, Sun, and Epain, [17, 48] or Zaunschirm, Schörkhuber, and Höldrich [20, 21] modify the HRIR time delays/HRTF phases but only above, e.g., 3 kHz, without any object-based re-insertion afterwards. The omission of high-frequency interaural time-delay/phase information is a reasonable trade off done in favor of a more important accuracy in spectral magnitude.

Fig. 4.24
figure 24

Rigid-sphere model of delay to ear depending on angle \(\phi \) (here: head rotation)

What does directional HRIR smoothing do to high frequencies? The geometrical theory of diffraction [49] suggests that HRIRs must always contain at least the delay to the ear of either the shortest direct path or the shortest indirect path via the surface of the head. For a spherical head model with the radius \(\mathrm {R}=0.0875\) m and speed of sound \(c=343\) \(\frac{\mathrm {m}}{\mathrm {s}}\), the Woodworth-Schlosberg formula [50] is composed of this consideration, see Fig. 4.24. The left ear receives a distant horizontal sound from the azimuth interval \(0\le \phi \le \frac{\pi }{2}\) as direct sound anticipated by \(\tau =-\frac{\mathrm {R}}{c}\sin \phi \), or for \(-\frac{\pi }{2}<\phi \le 0\) as an indirect sound delayed by \(\tau =-\frac{\mathrm {R}}{c}\,\phi \),

$$\begin{aligned} \tau (\phi )&=\textstyle -\frac{\mathrm {R}}{c}\;\min \left\{ \sin \phi ,\, \phi \right\} , \end{aligned}$$
(4.51)

as plotted in Fig. 4.25a, and recognizable from dummy-head measurementsFootnote 4 in Fig. 4.25b.

Fig. 4.25
figure 25

Time delay to the ear depending on azimuth of horizontal sound (a) and (b) \(360^\circ \)-measured KU100 dummy head HRIRs set from TH Köln (color displays dB levels)

If the HRIR is smoothed across an angular range, the time-delay curve gets spread across time as well, see Fig. 4.26. In this way, depending on whether the smoothing uses a continuous or discrete set of directions, one either obtains something like a comb filter or a sinc-shaped frequency response. This smoothing is least disturbing around the direct-ear side as shown left in Fig. 4.26, and, as the indirect ear also encounters high-frequency shadowing effects, it is most disturbing mainly for frontal and rear sounds at \(0^\circ \) or \(180^\circ \), as shown right in Fig. 4.26. The corresponding frequency responses are roughly exemplified with what third-order Ambisonics equivalent smoothing would do to either \(45^\circ \)-spaced HRIRs in Fig. 4.27a or \(15^\circ \)-spaced ones in Fig. 4.27b.

Fig. 4.26
figure 26

Directionally smoothed playback to multiple HRIRs within a window, e.g. \(30^\circ \), causes different impulse response shapes at \(90^\circ \) and \(0^\circ \)

Fig. 4.27
figure 27

Differences of directionally smoothed HRTF frequency responses for a horizontal sound from either \(0^\circ \) and \(90^\circ \), smoothed within a \({\pm } 22.5^\circ \) window, which roughly corresponds to the Ambisonics order \(\mathrm {N}=3\); a and b either use a grid of \(45^\circ \) or \(7.5^\circ \) spaced HRIRs. The dashed line shows the the theoretical frequency limit 1.87 kHz for \(\mathrm {N}=3\)

To get an upper frequency limit, it is insightful to work in the frequency domain where the HRIR is denoted head-related transfer function (HRTF). A simplified linearized-phase version around \(\phi =0\) uses \(\tau \approx \frac{\mathrm {R}}{c}\;\phi \), and the resulting in the Fourier transform with \(\omega =2\pi \,f\) is

$$\begin{aligned} H\approx e^{-\mathrm {i}\frac{\omega }{c}\mathrm {R}\,\tau (\phi )}=e^{\mathrm {i}\frac{\omega }{c}\mathrm {R}\,\phi }. \end{aligned}$$
(4.52)

To represent it by circular or spherical harmonics transformation limited to the order \(\mathrm {N}\), a maximum phase change represented by the harmonic \(e^{\mathrm {i}\mathrm {N}\phi }\) implies that we can only resolve the phase up to \(\frac{\omega }{c}\mathrm {R}\le \mathrm {N}\), hence the range of accurate operation is limited in frequency

$$\begin{aligned} \textstyle f_\mathrm {N}\le \frac{c\;\mathrm {N}}{2\pi \,\mathrm {R}}=\mathrm {N}\cdot 624\,\mathrm {Hz}. \end{aligned}$$
(4.53)

As high-frequency HRTF phase evolves more rapidly over the angle as what the finite order can represent, this typically yields attenuation of the high frequencies when obtaining circular/spherical harmonics coefficients by transformation integral.

Fig. 4.28
figure 28

Experiment on audibility of removal of the linear phase trend from HRTF above a varied cutoff frequency from [21] showing medians and 95% confidence intervals

Directional smoothing of the discrete directional HRTFs causes relevant spectral problems, regardless of whether directional smoothing is done by Ambisonics, VBAP, MDAP. Mainly the geometric delay in the HRIRs is responsible for the emerging comb-filter or low-pass behavior. One could pull out the linear phase trend above the frequency limit and re-insert it, but is re-insertion necessary?

4.11.1 High-Frequency Time-Aligned Binaural Decoding (TAC)

As a pre-requisite for their binaural Ambisonic decoders, Schörkhuber et al. [21] tested, above which frequency the removal of the HRTF linear phase trend remains inaudible in direct HRTF-based rendering without panning or smoothing. In fact, most of their listeners could not distinguish the absence of the linear phase trend when removed above 3 kHz for various sound examples (drums, speech, pink noise, rendered at directions \(10^\circ \), \(-45^\circ \), \(80^\circ \), \(-130^\circ \)). They had their subjects compare the result to a reference with unaltered HRTFs, and the result is analyzed in Fig. 4.28.

By this finding, it is possible to split up each of the \(2\times 1\) HRIRs \(\varvec{h}(t,\varvec{\theta })\) into an unaltered low-pass band and a time-aligned high-pass band to unify the high-frequency HRIR delay

$$\begin{aligned} \varvec{\hat{h}}(t,\varvec{\theta })&= \varvec{h}_{\le 3kHz}(t,\varvec{\theta })+ \begin{bmatrix} {h}_{\mathrm {left}>3kHz}[t-\tau (\arcsin \theta _\mathrm {y}),\varvec{\theta }]\\ {h}_{\mathrm {right}>3kHz}[t+\tau (\arcsin \theta _\mathrm {y}),\varvec{\theta }] \end{bmatrix}. \end{aligned}$$
(4.54)

The time delay model \(\tau (\phi )\) uses the angle to the left/right ear on the positive/negative y axis, so \(\arccos \pm \theta _\mathrm {y}\), but shifted by \(90^\circ \), hence \(\phi =\pm \arcsin \theta _\mathrm {y}\).

Fig. 4.29
figure 29

Exemplary horizontal cross-sections of linear (lin), time-aligned (ta), and MagLS/magnitude-least-squares (mls) of third order \(\mathrm {N}=3\), compared to high-order \(\mathrm {N}=35\) (max) Ambisonic left-ear HRTF representations of the TH Cologne HRIR_L2720.sofa set

This removal allows use all available HRIRs of dense measurement sets for binaural synthesis of high accuracy, using a suitable linear Ambisonic decoder such as AllRAD. Assuming the resulting modified left and right HRIR for all directions are denoted as \(2\times \mathrm {L}\) matrix \(\varvec{\hat{H}}(t)=[\varvec{\hat{h}}(t,\varvec{\uptheta }_1),\,\dots ,\,\varvec{\hat{h}}(t,\varvec{\uptheta }_\mathrm {L})]^\mathrm {T}\), the \(2\times (\mathrm {N}+1)^2\) filter set for decoding every of the Ambisonic channels to the ears becomes:

$$\begin{aligned} \varvec{\hat{H}}_\mathrm {SH}^\mathrm {T}(t)&=\varvec{\hat{H}}(t)\;\mathbf {D}\,\mathrm {diag}\{\varvec{a}\}. \end{aligned}$$
(4.55)

Results achieved by a pseudo-inverse decoding to hereby time-aligned HRIRs using \(\mathrm {R}=0.085\) cm with \(\mathrm {N}=3\) from the 2702-directions Cologne HRIRsFootnote 5 is shown in Fig. 4.29. The resulting polar patterns (ta) clearly outperform the linear decomposition (lin) at frequencies above 2kHz in representing the original HRTFs (max).

4.11.2 Magnitude Least Squares (MagLS)

Alternative to high-frequency time delay disposal, Schörkhuber et al. present an optimum-phase approach [21] that disregards phase match in favor of an improved magnitude match above cutoff. Formulated exemplarily for the left ear, across every HRTF direction \(\varvec{\uptheta }_l\), and for every discrete frequency \(\omega _k\), with \(h_{l,k}=h(\varvec{\uptheta }_l,\omega _k)\), this becomes

$$\begin{aligned} \min _{\varvec{\hat{h}}_{\mathrm {SH},k}}\sum _{l=1}^\mathrm {L}\Big [|\varvec{y}_\mathrm {N}(\varvec{\uptheta }_l)^\mathrm {T}\,\varvec{\hat{h}}_{\mathrm {SH},k}| - |h_{l,k}|\Big ]^2. \end{aligned}$$
(4.56)

Typically, one would need to solve magnitude least squares or magnitude squares least squares tasks with semidefinite relaxation, see Kassakian [51].

In practice, however, results turn out to be perfect already with an iterative combination of the reconstructed phase \(\hat{\phi }_{l,k-1}\) from the previous frequency \(\omega _{k-1}\) with the HRTF magnitude \(|h_{k,l}|\) of the current frequency \(\omega _k\), before a linear decomposition thereof into spherical harmonic coefficients \(\varvec{\hat{h}}_{\mathrm {SH},k}\).

Every frequency below cutoff \(\omega _{k}<2\pi f_\mathrm {N}\) just uses the linear least-squares spherical harmonics decomposition with the left-inverse of the spherical harmonics \(\varvec{Y}_\mathrm {N}\) sampled at the HRTF measurement nodes,

$$\begin{aligned} \varvec{\hat{h}}_{\mathrm {SH},k}&=(\varvec{Y}_\mathrm {N}^\mathrm {T}\varvec{Y}_\mathrm {N})^{-1}\varvec{Y}_\mathrm {N}^\mathrm {T}\,\big [h_{l,k}\big ]_l\;. \end{aligned}$$
(4.57)

Continuing with the first frequency above/equal to cutoff \(\omega _k\ge 2\pi f_\mathrm {N}\), the algorithm proceeds as:

$$\begin{aligned} \hat{\phi }_{l,k-1}&=\angle \big \{\varvec{y}_\mathrm {N}(\varvec{\uptheta }_l)^\mathrm {T}\,\varvec{\hat{h}}_{\mathrm {SH},k-1}\big \}\;,\end{aligned}$$
(4.58)
$$\begin{aligned} \varvec{\hat{h}}_{\mathrm {SH},k}&=(\varvec{Y}_\mathrm {N}^\mathrm {T}\varvec{Y}_\mathrm {N})^{-1}\varvec{Y}_\mathrm {N}^\mathrm {T}\,\big [|h_{l,k}|\,e^{\mathrm {i}\hat{\phi }_{l,k-1}}\big ]_l\; , \end{aligned}$$
(4.59)

and then moves to the next frequency \(k\leftarrow k+1\). The results are typically transformed back to time domain to get a real-valued impulse response for every spherical harmonic to the regarded ear.

The results of the MagLS approach (mls) outperform the time-alignment approach (ta) in the exemplary results shown for \(\mathrm {N}=3\) in Fig. 4.29, in particular at the highest frequencies, where sphere-model-based delay simplification is not sufficiently helpful, anymore.

4.11.3 Diffuse-Field Covariance Constraint

Also for both the above approaches that modify the high-frequency phase, Zaunschirm et al. [20] note that low order rendering degrades envelopment in diffuse fields, so that they introduce an additional covariance constraint as defined by Vilkamo [22]. It can be implemented as a \(2\times 2\) filter matrix equalizing the resulting frequency-domain diffuse-field covariance matrix to the one of the original HRTF datasets. On the main diagonal, this covariance matrix shows the diffuse-field ear sensitivities (left and right), and off-diagonal it contains the diffuse-field inter-aural cross correlation.

At every frequency, the \(2\times 2\) diffuse-field covariance matrix of the original, very-high-order spherical harmonics HRTF dataset \(\varvec{H}_\mathrm {SH}^\mathrm {H}\) of the dimensions \(2\times (\mathrm {M}+1)^2\) with (\(\mathrm {M}\gg \mathrm {N}\)) is given by

$$\begin{aligned} \varvec{R}&=\varvec{H}_\mathrm {SH}^\mathrm {H}\varvec{H}_\mathrm {SH}. \end{aligned}$$
(4.60)

The derivation why this inner product of spherical harmonic coefficients represents the diffuse-field covariance is given in Appendix A.5. The low-order high-frequency modified HRTF coefficient set \(\varvec{\tilde{H}}_\mathrm {SH}\) of the dimensions \(2\times (\mathrm {N}+1)^2\) also has a \(2\times 2\) covariance matrix \(\varvec{\hat{R}}\) that will differ from the more accurate \(\varvec{R}\),

$$\begin{aligned} \varvec{\hat{R}}&=\varvec{\hat{H}}_\mathrm {SH}^\mathrm {H}\varvec{\hat{H}}_\mathrm {SH}. \end{aligned}$$
(4.61)

Its diffuse-field reproduction improves after equalizing \(\varvec{R}=\varvec{\hat{R}}\) by a \(2\times 2\) filter matrix,

$$\begin{aligned} \varvec{\hat{H}}_\mathrm {SH,corr}&=\varvec{\hat{H}}_\mathrm {SH}\varvec{M}. \end{aligned}$$
(4.62)

Appendix A.5 shows the derivation of \(\varvec{M}\) based on [20, 22]. In summary, it is composed of factors obtained by Cholesky and SVD matrix decompositions

$$\begin{aligned} \varvec{\hat{H}}_\mathrm {corr,SH}&=\varvec{\hat{H}}_\mathrm {SH}\,\varvec{\hat{X}}^{-1}\varvec{V}\varvec{U}^\mathrm {H}\varvec{X},\\ \text {Cholesky factors: }&\text {SVD:}\nonumber \\ \varvec{H}_\mathrm {SH}^\mathrm {H}\varvec{H}_\mathrm {SH}&= \varvec{X}^\mathrm {H}\varvec{X},&\varvec{\hat{X}}^\mathrm {H}\varvec{X}&=\varvec{U}\varvec{S}\varvec{V}^\mathrm {H},\nonumber \\ \varvec{\hat{H}}_\mathrm {SH}^\mathrm {H} \varvec{\hat{H}}_\mathrm {SH}&=\varvec{\hat{X}}^\mathrm {H}\varvec{\hat{X}} \nonumber . \end{aligned}$$
(4.63)

While MagLS binaural decoding with orders higher than 2 or 3 does not require covariance correction, the correction enhances the decorrelation of the ear signals for \(1\mathrm {st}\) to \(2\mathrm {nd}\) order reproduction, as shown in Fig. 4.30.

Fig. 4.30
figure 30

Covariance constraint filters enhance the binaural decorrelation of MagLS by negative crosstalk \(M_{12}\) and \(M_{21}\), under corresponding correction of the diffuse-field sensitivities \(M_{11}\) and \(M_{22}\) at playback orders \(\mathrm {N}<3\)

4.12 Practical Free-Software Examples

4.12.1 Pd and Circular/Spherical Harmonics

Similar as in the example section on first-order encoding and decoding in pure data (Pd), Fig. 4.31 shows \(3\mathrm {rd}\)-order 2D Ambisonic encoding and decoding for an octagon loudspeaker layout. The implementation [mtx_circular_harmonics] of the circular harmonics is used from the iemmatrix library, and the numbers for \(\frac{180}{\pi }=57.29\) and \(a_m=\cos \frac{\pi m}{2\cdot (\mathrm {N}+1)}\) were pre-calculated. Note the similarity to the first-order 2D example of Fig. 1.13, to which the main change is the use of the circular harmonics matrix object.

Fig. 4.31
figure 31

2D encoding and decoding in Pd using [mtx_circular_harmonics] with \(3\mathrm {rd}\) order, 8 equidistant loudspeakers, and max-\(\varvec{r}_\mathrm {E}\) weighted decoder

For decoding to headphones, programming in Pd also looks rather similar as in the first-order example in Fig. 1.14, only more HRIRs matching the respective loudspeaker positions need to be employed. To work in 3 dimensions, programming in Pd would also be similar as in the corresponding first-order example of Fig. 1.15, using the matrix object [mtx_spherical_harmonics]. Typically, pre-calculated decoders including AllRAD and max-\(\varvec{r}_\mathrm {E}\) are used and loaded by, e.g., [mtx D.mtx] into Pd to keep programming simple.

4.12.2 Ambix Encoder, IEM MultiEncoder, and IEM AllRADecoder

For encoding single- or multi-channel signals into Ambisonics, there are the , or VST plugins available from Kronlachner’s ambix plugin suite or the IEM MultiEncoder from the IEM plugin suite. As exemplarily shown in Fig. 4.32, the multi encoder allows to encode channel-based multi-channel audio material, where channel-based [52] typically refers to each channel of the multi-channel material meant to be played back on a separate loudspeaker of clearly defined direction, cf. [44]. Elsewhere, the embedding of virtual playback directions can also be found referred to as beds or virtual panning spots.

Fig. 4.32
figure 32

MultiEncoder plug-in: encoding of a \(4+5+0\) recording

Fig. 4.33
figure 33

AllRADecoder plug-in: \(5+7+0\) layout from IEM Production Studio

Fig. 4.34
figure 34

AllRADecoder plug-in: \(5+7+0\) layout from IEM Production Studio

Fig. 4.35
figure 35

AllRADecoder plug-in: \(5+7+0\) layout from IEM Production Studio

The IEM AllRADecoder permits to manually enter or import the loudspeaker coordinates and channel indices, with the coordinates specified by the azimuth and elevation angle in degrees, as exemplified for the IEM production studio in Fig. 4.33. The figure also shows that just entering the pure \(5+7+0\) layout would produce an error message Point of origin not within convex hull. Try adding imaginary loudspeakers.

By adding an imaginary loudspeaker below whose signal is typically omitted, see Fig. 4.34, it becomes geometrically valid to calculate and employ the resulting decoder, however it is better to also insert an imaginary loudspeaker at the rear whose signal is preserved by specifying the gain value 1, as shown in Fig. 4.35.

Fig. 4.36
figure 36

RoomEncoder plug-in

4.12.3 Reaper, IEM RoomEncoder, and IEM BinauralDecoder

Particularly relevant for head-phone-based listening, rendering of anechoic sounds will typically not externalize well, as it does not match the mental expectation of ordinary listening environments [53,54,55,56]. To avoid that this would rather cause an in-head localization than the desired external sound image, one can, e.g., use the IEM RoomEncoder plugin, see Fig. 4.36. It is based on an image-source room model and encodes first-order wall-reflections involving reflection factors and propagation delays together with the desired direct sound.

The MagLS approach for Ambisonic decoding, using the KU100 measurements from Cologne Applied Science University and (optionally) their headphone equalization curves is implemented by the IEM BinauralDecoder, see Fig. 4.37.

In combination of both, IEM RoomEncoder and IEM BinauralDecoder with an Ambisonics-encoded single-channel sound (e.g. using ambix_encoder), one can simply try to place the source and receiver together in the symmetry plane of the room, and then to slightly shift one of both sideways to see how externalization improves by slight asymmetry in the ear signals.

Fig. 4.37
figure 37

BinauralDecoder plug-in