of Multi-loudspeaker Playback

This chapter describes the perceptual properties of auditory events, the sound images that we localize in terms of direction and width, when distributing a signal with different amplitudes to one or a couple of loudspeakers. These amplitude differences are what methods for amplitude panning implement, and they are also what mapping of any coincident-microphone recording implies when reproduced over the directions of a loudspeaker layout. Therefore several listening experiments on localization are described and analyzed that are essential to understand and model the psychoacoustical properties of amplitude panning on multiple loudspeakers of a 3D audio system. For delay-based recordings or diffuse sounds, there is some relation, however, it is found to be less stable for the desired applications. Moreover, amplitude panning is not only about consistent directional localization. Loudness, spectrum, temporal structure, or the perceived width should be panning-invariant. The chapter also shows experiments and models required to understand and provide those panning-invariant aspects, especially for moving sounds. It concludes with openly-available response data of most of the presented listening experiments. horizontal chapter explores the relevant perceptual for and their models. the

is possible to firmly establish Gerzon's [1] E, r E and r E estimators for perceived loudness, direction, and width that apply to most stationary sounds in typical studio and performance environments.

Loudness
At a measurement point in the free field, the same signal fed to equalized loudspeakers of exactly the same acoustic distance would superimpose constructively (+6 dB).
In a room with early reflections and a less strict equality of the incoming pair of sounds (typical, slight inaccuracy in loudspeaker/listener position, different mounting situations, different directions in the directivities of ears and loudspeakers), the superposition can be regarded as stochastically constructive (+3 dB) in particular at frequencies that aren't very low.
For the above reasoning, typical amplitude panning rules try to keep the weights distributing the signal to the loudspeakers normalized by root of squares instead of normalizing to the linear sum, in order to obtain constant loudness ( [12], VBAP): (2.1) Loudness Model. If all loudspeakers are equalized, located at the same distance to the listener, and fed by the same signal with different amplitude gains g l , a constructive interference could be expected so that the amplitude becomes [1] P = L l=1 g l . (2.2) However, the interference stops to be strictly constructive as soon as the room is not entirely anechoic, the sitting position is not exactly centered, or even for anechoic and centered conditions at high frequencies, when the superposition at the ears cannot be assumed to be purely constructive anymore. Then it is better to assume a less well-defined, stochastic superposition in which a squared amplitude is determined by the sum of the squared weights [1]: Therefore, the most common amplitude panning rules use root-squares normalization to obtain a loudness impression that is as constant as possible.
The measure E seems to be most useful when designing and evaluating amplitudepanning or coincident microphone techniques. It is not surprising that the ITU-R BS.1770-4 1 uses the Leq(RLB) measure as a loudness model: it is essentially the RMS level after high-pass filtering, cf. [13], which is closely related to the E measure detected from loudspeaker signals.
An interesting refinement was proposed by Laitinen et al. [14], which uses a measure p L l=1 g p l in which the exponent p is close to 1 at low frequencies under anechoic conditions and close to 2 at high frequencies/under reverberant conditions.

Direction
In the early years of stereophony, researchers investigated the differences in delay times and amplitudes required to control the perceived direction. Below, only experiments are considered that did not use fixation of the listener's head.

Time Differences on Frontal, Horizontal Loudspeaker Pair
The dissertation of K. Wendt in 1963 [3] shows notably accurate listening experiments done on ±30 • two-channel stereophony using time delays, in which listeners indicated from where they heard the sounds for each of the tested time differences. H. Lee revisited the properties in 2013 [8], but with musical sound material and an experiment, in which the listener adjusted the time differences until the perceived direction matched the one of a corresponding fixed reference loudspeaker, Fig. 2.1. The time differences are seldom applicable to reliable angular auditory event placement: auditory images are strongly frequency-dependent (not shown here) and therefore unstable for narrow-band sounds. Leakey and Cherry showed 1957 [2] that time-delay stereophony loses its effect under the presence of background noise.

Level Differences on Frontal, Horizontal Loudspeaker Pair
K. Wendt's [3] and H. Lee's [8] experiments deliver insights in sound source positioning with ±30 • two-channel stereophony, however this time with level differences. As opposed to Fig. 2.1, in which auditory image panning with time differences were characterized by statistical spreads of up to 15 • , level-difference-based panning is clearly smaller in the spread of perceived directions than 10 • , Fig. 2

.2.
Signal dependency. Wendt [3] described the signal dependency of panning curves on various transient and band-limited sounds, and Lee [8] [3] results to crack (impulsive) signals with level differences and without head fixation (the figure shows means and standard deviation; standard deviation was interpolated to plot this figure). In gray: Results of Lee's [8] level-difference adjustment experiment with musical sounds (25, 50, 75% quartiles, symmetrized diagram) comprehensive investigation on frequency dependency was carried out by Helm and Kurz [9]. With level differences {0, 3, 6, 9, 12} dB and third-octave filtered pulsed pink noise at {125, 250, 500, 1k, 2k, 4k} Hz, they showed that the perceived angle pointed at by the listeners using a motion-tracked pointer was similar between the broad-band case and third-octave bands below 2 kHz. In bands above 2 kHz, smaller level differences cause a larger lateralization, see interpolated curves in Fig. 2.3.

Level Differences on Horizontally Surrounding Pairs
Successive pairwise panning on neighboring loudspeaker pairs is typically used to pan auditory events freely along the loudspeakers of a horizontally surrounding loudspeaker ring. The classical research done specifically targeted at such applications was contributed by Theile and Plenge 1977 [4]. They used a mobile reference loudspeaker with some reference sound that could be moved to match the perceived direction of a loudspeaker pair playing pink noise with level differences at different orientations with respect to the listener's head. There is also the experiment of Pulkki [15] using a level-adjustment task, in which levels were adjusted as to match the auditory event to one of a reference loudspeaker at three different reference directions and for different head orientations. A comprehensive experiment was done by Simon et al. [5], who used a graphical user interface displaying the floor plan of a 45 • -spaced loudspeaker ring to have the listeners specify the perceived direction. Martin et al. in 1999 [16] used a graphical user interface showing the floorplan of a 5.1 ring in their experiment, and last but not least, Matthias Frank used a direct pointing method to enter the perceived direction [10] in one of his experiments.
As the experiments did not seem to yield consistent results, a comprehensive leveldifference adjustment experiment with 24 loudspeakers arranged as a horizontal ring was done in [17] and partially repeated later in [11], see results in Fig. 2.4. In the repeated experiment [11] it became clear that in the anechoic room, a large amount of the differently pronounced localization biases can be avoided by encouraging the listeners to do front-back and left-right head motion by a few of centimeters, whenever there is doubt. Medians and 95% confidence intervals for adjusted level differences to align amplitudepanned pink-noise with harmonic complex tone from {±15 • , 0 • }, for a frontal and b lateral 60 • stereo pair; a uses data from [17] with 4 responses per direction from 5 listeners; b used data from [11] with 20 responses per direction. Despite the considerably different spread, frontal and lateral stereo pairs seem to yield pretty much the same tendency

Level Differences on Frontal, Horizontal to Vertical Pairs
Quite extensively, T. Kimura investigates the localization of auditory events between frontal, vertical ±13.5 • loudspeaker pairs in 2012 [6,18]. The work of F. Wendt in 2013 [7,19] also investigates a slant and vertical loudspeaker pair, Fig. 2.5. Kimura uses pulsed white noise, Wendt uses pulsed pink noise. Obviously, the horizontal spread is always smaller than the vertical spread and the spread does not align with the direction of the loudspeaker pair. The largest vertical spread appears for the vertical loudspeaker pair.

Vector Models for Horizontal Loudspeaker Pairs
A weighted sum of the loudspeakers' direction vectors θ 1 , θ 2 could be conceived as simple linear model of the perceived direction, using a linear blending parameter 0 ≤ q ≤ 1 The parameter q adjusts where the resulting vector r is located on the connecting line between θ 1 and θ 2 . On frontal loudspeaker pairs, localization curves typically run through the middle direction q = 1 2 for level differences of 0 dB. If only one loudspeakers is active, the result is either of the loudspeaker directions, thus the parameter is q = 0 or q = 1.

Classical definitions.
As the simplest choice for q, one could insert q = g 2 g 1 +g 2 or q = g 2 2 g 2 1 +g 2 2 to get the vector definitions as weighted average using either the linear or squared gains according to [1]: (2.5) For both models, equal gains g 1 = g 2 yield q = 1 2 , and also the endpoints with g 2 = 0 or g 1 = 0 correspond to q = 0 or q = 1, respectively. However, the slope of the r E vector is steeper than the one of the r V . For instance, if g 2 = 2 g 1 , the vector r V lies on q = 2/3 of the line between θ 1 and θ 2 , while r E lies at q = 4/5 of the connecting line.
The r V vector for the ±α loudspeaker pair at the directions θ T 1,2 = (cos α, ± sin α) corresponds to the tangent law [20], whose formal origin lies in a model of summing localization based on a simple model of the ear signals, cf. Appendix A.7. The equivalence of this law to the vector model follows from the tangent tan ϕ as ratio of the y divided by x component of the r V vector, tan ϕ = g 1 sin(α)+g 2 sin(−α) g 1 cos(α)+g 2 cos(α) = g 1 −g 2 g 1 +g 2 tan α.  Fit of the r V , r E , and r γ models for a third-octave noise on a frontal stereo pair using data from [9], and with data from [11] Adjusted slope. Differently steep curves were fitted by an adjustable-slope model [17] which uses γ = 1 for r V and γ = 2 for r E . Figure 2.6 compares the prediction by r V , r E , and r γ to frequency-dependently perceived directions in frontal horizontal pairs, to perceived directions in a lateral stereo pair, and to perceived directions in a frontal pair that is either horizontal or vertical, using various studies mentioned above.
Practical choice r E . While a specific exponent γ closely fitting the experimental data may vary, a constant value is preferable. Figure 2.6 indicates that in most cases focusing on r E is reasonable and sufficiently precise, see also [11].

Level Differences on Frontal Loudspeaker Triangles
V. Pulkki [21] and F. Wendt [7,19] investigated localization properties for frontal loudspeaker triplets with level differences, see Fig. 2.7. Both used pulsed pink noise in their experiments. While V. Pulkki used an indirect adjustment task to evaluate VBAP control angles to obtain auditory events directionally matching the respective reference loudspeakers, F. Wendt uses a direct pointing method. Wendt's experiments indicate that loudspeaker triplets with three different azimuthal positions yield a smaller spread in the indicated direction than such with vertical loudspeaker pairs (not the case in Pulkki's experiments).

Level Differences on Frontal Loudspeaker Rectangles
F. Wendt [7,19] moreover presents experiments about frontal loudspeaker rectangles, again using a pointer method and pulsed pink noise, Fig. 2.8.
Again it seems that arrangements avoiding vertical loudspeaker pairs exhibit a smaller statistical spread in the responses.

Vector Model for More than 2 Loudspeakers
For more than two active loudspeakers and in 3D, a vector model based on the exponent γ = 2 yields the r E vector [1] (2.7)

Vector Model for Off-Center Listening Positions
At off-center listening positions, the distances to the loudspeakers are not equal anymore, resulting in additional attenuation and delay for each loudspeaker depending on the position. For stationary sounds, this effect can be incorporated into the energy vector by additional weights w r,i and w τ,i r E = L l=1 (w r,l w τ,l g l ) 2 θ l L l=1 (w r,l w τ,l g l ) 2 . (2.8) The weight w r,l models the attenuation of a point-source-like propagation 1 r . The reference distance is the distance to the closest loudspeaker at the evaluated listening position, thus the weight of each loudspeaker results in The incorporation of delays into the energy vector requires a transformation that yields the weights w τ,l for each loudspeaker. It is reasonable that these weights attenuate the lagging signals in order to reduce their influence on the predicted direction. An attenuation of 1 4 dB ms is known from the echo threshold in [22], similarly [23], and has successfully been applied for the prediction of localization in rooms [24]. The weight of each loudspeaker is calculated as τ l = c r l in seconds at the listening position under test (2.10) Further weights can be applied in order to model the precedence effect in more detail, as proposed by Stitt [25,26]. Listening test results in [27] compared the differently complex extensions of the energy vector and revealed that the simple weighting with w r,i and w τ,i is sufficient for a rough prediction of the perceived direction in typical playback scenarios.
The left side of Fig. 2.9 shows the predicted directions by the energy vector for various listening positions when playing back the same signal on a standard stereo loudspeaker pair with a radius of 2.5 m. The absolute localization error can be calculated from the difference of the predicted direction and the desired panning direction. The right side of Concerning a single playback scenario, i.e. a single panning direction on a loudspeaker setup, the perceptual sweet area for plausible playback can be estimated by the area with localization errors below 30 • . For the prediction of a more general sweet area, the absolute localization errors can be computed for all possible panning directions in a fine grid of 1 • and averaged at each listening position as shown in Fig. 2.10.

Width
M. Frank [10] investigated the auditory source width for frontal loudspeaker pairs with 0 dB level difference and various aperture angles, as well as the influence of an additional center loudspeaker on the auditory source width. The response was given by reading numbers off a left-right symmetric scale written on the loudspeaker arrangement ( Fig. 2.11). Figure 2.11 (right) shows the statistical analysis of the responses. Obviously the additional center loudspeaker decreases the auditory source width.
Auditory source with is difficult to compare for different directions and also single loudspeakers yield auditory source widths that vary with direction. Still, a relatively constant auditory source width is desirable for moving auditory events. For static auditory events, the narrowest-possible extent can be desirable.

Model of the Perceived Width
The angle 2 arccos r E describes the aperture of a cap cut off the unit sphere perpendicular to the r E vector, at its tip, from the origin, see Fig. 2.12. As the r E vector length is between 0 (unclear direction) and 1 (only one loudspeaker active), this angle stays between 180 • and 0 • . M. Frank's experiments about the auditory source width [10,28] showed that stereo pairs of larger half angles α were also heard as wider. The length of the r E vector gets shorter with the half angle α. In a symmetrical loudspeaker pair θ T 12 = (cos α, ± sin α) with g 1 = g 2 = 1, the y coordinate of the r E vector cancels and its length is The corresponding spherical cap is same size as the loudspeaker pair 2 arccos r E = 2α. However, only 5 8 of the size was indicated by the listeners of the experiments, which yields the following estimator of the perceived width: (2.11)

Fig. 2.12
Cap size associated with r E length model for L+R (left plot) and L+R+C (right plot)

Fig. 2.13
Model of the perceived width as 5 8 of the half-angle arccos r E matches the half-angle of the experiment. Except for a lower limit, which is determined by the apparent source width (ASW) due to the room acoustical setting an increase matching the experiments as arccos r E < α, see Figs. 2.13 and 2.12.

Coloration
Despite research primarily focuses on the spatial fidelity of multi-loudspeaker playback, the overall quality of surround sound playback was found to be largely determined by timbral fidelity (70%) [29]. Loudspeakers in a studio or performance space are often characterized by different colorations that are caused by different reflection patterns (most often the wall behind the loudspeaker). When changing the active loudspeakers, or their number, these differences become audible. On the one hand, static coloration, e.g. the frequency responses of the loudspeakers, can typically be equalized. On the other hand, changes in coloration during the movement of a source cannot be equalized easily and yield annoying comb filters. Although coloration is often assessed verbally [30], we employ a simple technical predictor based on the composite loudness level (CLL) by Ono [31,32]. The CLL spectrum predicts the perceived coloration and is calculated from the sum of the loudnesses of both ears in each third-octave band. Studies about loudspeaker and headphone equalization show that differences in third-octave band levels of less than 1dB are inaudible by most listeners [33,34]. This criterion can also be applied for the perception of coloration, i.e., differences between CLL spectra of less than 1dB are assumed to be inaudible.
Pairwise panning between loudspeakers results in a single active loudspeaker for source directions that coincide with the direction of a loudspeaker and two equally loud loudspeakers for source directions exactly between two neighboring loudspeakers, cf. Fig. 2.14. In the second case, the different propagation paths from the two loudspeakers to the ears create a comb filter. This comb filter is not present for sources played from a single loudspeaker. Thus, moving a source between the two directions yields noticeable coloration. This is in contrast to static sources, for which Theile's experiments [35] indicated that they are perceived without coloration. The actual shape of the afore-mentioned comb filter depends on the angular distance between the loudspeakers. The first notch and its depth decreases with the distance. This implies that coloration increases for playback with higher loudspeaker densities.
A similar comb filter is created when using a triplet of loudspeakers with the same loudspeaker density as the pair, e.g. L, C, R compared to C, R. In order to avoid a strong increase in source width or annoying phasing effects, the outmost loudspeakers L and R are strongly reduced in their level, typically around -12dB compared to loudspeaker C. In doing so, the similarity of the comb filters yields barely any coloration when moving a source between the two directions, cf. Fig. 2

.15.
Judging from what is shown above, it appears beneficial to activate always a few loudspeaker to stabilize the coloration, as opposed to using just one loudspeaker and moving the playback to another one. Keeping the number of simultaneously active loudspeakers more or less constant does not only prevent coloration of source movements, it also yields a more constant source width. Because of this relation between coloration and source width, the fluctuation of r E is also a simple predictor of panning-dependent coloration.
In general, the strongest coloration is perceived under anechoic listening conditions. In reverberant rooms, the additional comb filters introduced by reflections help to conceal the comb filters due to multi-loudspeaker playback. Coloration CLL in dB on lspk. betw. lspk. difference

Fig. 2.15
Coloration predicted by composite loudness levels for loudspeaker C with additional -12 dB from L and R (black), two equally loud loudspeakers C and R (light gray), and their difference (dashed dark gray)

Open Listening Experiment Data
Experimental data from azimuthal localization in frontal and lateral loudspeaker pairs Figs. 2.3 and 2.4, azimuthal/elevational localization in horizontal, skew, and vertical frontal pairs Fig. 2.5, triangles Fig. 2.7, and quadrilaterals Fig. 2.8 are available online at https://opendata.iem.at in the listening experiment data project, as well as the data to the width experiment in Fig. 2.11. The opendata.iem.at listening experiment data project contains evaluation routines to analyze the 95%-confidence intervals symmetrically based on means, standard deviations and the inverse Student's t-distribution CIMEAN.m, or more robustly based on median and inter-quartile ranges CI2.m and Student's t-distribution, or for twodimensional data analysis robust_multivariate_confidence_region.m. The MATLAB script plot_gathered_data.m reads the formatted listening experiment data and its exemplary code generates figures like the above.
In order to support others providing own listening experiment data, the MATLAB functions write_experimental_data.m read_experimental_data.m are provided on the website.