1 Introduction

A room impulse response (RIR) characterizes the linear time-invariant acoustic propagation between a source and a receiver within a specific acoustic environment. RIRs are crucial for sound field auralization [1] as well as in the realm of room acoustics, where they are used for estimating acoustic properties of a room such as the reverberation time [2].

The emergence of data-driven methods in acoustics [3], particularly deep learning methods, has sparked increasing interest in the availability of rich, high-quality RIR datasets. These datasets play a pivotal role in the training of data-driven (interpolatory) sound field reconstruction methods [4,5,6,7], deep generative models [8, 9], and augmentation methods [10]. In addition, RIR datasets can be flexibly employed in order to synthesize acoustic training data for source localization and characterization [11], sound event detection, and speech separation tasks by convolving arbitrary source signals with RIRs [12,13,14]. The same synthesis procedure can be employed for data-driven acoustic parameter estimation problems, such as blind reverberation time estimation [15, 16] and others [17].

While data-driven methods often exhibit superior performance compared to conventional model-based methods, they require large amounts of realistic training data and are sensitive to variations of underlying probability distributions describing the data, also known as dataset shift [18]. Experimental data is oftentimes not available or too time-consuming to acquire. Many data-driven methods across various application areas, such as speech enhancement and recognition [19, 20], localization [11, 21], sound field reconstruction [22], room acoustic parameter estimation [23, 24], and acoustical engineering [25,26,27], are therefore trained with simulated data, whereby enhanced realism helps to improve generalization performance [28, 29]. However, without adaptation to or training with realistic data, the performance of data-driven methods can be significantly impaired [25, 30], which indicates the need for experimentally measured RIR datasets.

1.1 State of the art

In contrast to the diverse analytical and data-driven frameworks for the simulation of RIRs proposed in the past [8, 31,32,33,34], the availability of openly available experimental RIR datasets is limited. Moreover, the landscape of RIR datasets is rather heterogeneous because the experimental data is usually acquired with a certain application in mind. This manifests in application-specific environments, customized source-receiver arrangements, and non-standardized dataset annotations.

In some datasets, the focus lies on environmental variability through different room geometries and reverberation times [35,36,37,38,39,40]. The SoundCam dataset [35] poses an extreme example in which the environmental characteristics of four rooms are manipulated by placing humans at hundreds of locations in the rooms, leading to an enormous amount of environmental representations. These datasets are of particular interest in speech recognition, teleconferencing, and sound event detection applications. While in some of these datasets, the sensors are distributed over the room [35, 36], others rely on compact microphone arrangements with only a few channels and aperture sizes that are small compared to the source distance [39,40,41]. The latter is often the case in speech and sound event localization applications. Only very few datasets provide additional annotations of the environmental characteristics, such as echo annotations, which enable their use for acoustic parameter estimation problems [40].

Conversely, there are application fields where spatial diversity and accurate positional annotations are similarly or even more important than environmental variability. In acoustical engineering applications, microphone array measurements for determining source locations and individually induced sound pressure at a specific receiver position are widely used to implement noise reduction measures [42]. To accurately estimate these source characteristics, free-field conditions or laboratory environments with a short reverberation time are preferred. Usually, the utilized microphone arrangements are planar and contain many channels spread over a large area to ensure accurate spatial filtering capabilities over a wide frequency range. Typical examples include the measurement of machinery noise and wind turbines. Oftentimes, multiple closely neighbouring sound sources can be expected. There exist datasets that employed mechatronic positioning devices, which offer a higher positional accuracy over manual positioning, to achieve a dense sampling of the acoustic space under reverberant [43,44,45] or anechoic conditions [46]. For example, the Pyramic dataset [46] was acquired by rotating a tetrahedral microphone array with a turntable device using fixed loudspeaker positions. However, the accuracy of the acquired positional information is rarely assessed in the literature [46]. A limitation of these datasets is that either the number of sources or receivers can be considered as small [39, 43, 44, 46], which prohibits their use in engineering acoustic applications where a dense spatial sampling of the source and receiver space is required.

Another discipline with similar data requirements is the spatial interpolation of room impulse responses [5,6,7, 47]. Due to the large spatial variability of RIRs [47], dense sampling grids have to be employed in order to adequately capture the entire range of dynamics, especially at high frequencies. The absence of large and spatially dense RIR datasets entails that thorough assessment of interpolation methods has only been performed with synthetic data so far [5, 7].

Regardless of the application, only a few large-scale datasets encompassing several thousand RIRs, the largest of which is presented in this work. These datasets are outlined in Table 1. In addition to these larger datasets, there are several popular datasets with a smaller number of RIRs, for example [38,39,40, 48,49,50,51,52,53,54,55]. Large and realistic RIR datasets are of great interest to many scientific communities, especially those concerned with the development of novel data-driven modelling algorithms. The mathematical field of model order reduction (MOR) [56], for example, is rapidly developing data-driven methods [57,58,59]. However, in the majority of cases, new methods are validated only with synthetic data of common benchmark problems [60,61,62] and the validation of these methods in realistic scenarios with high-dimensional data is still pending. Providing well-documented high-dimensional measurement data is an essential element that can enable these communities to increase the applicability of their methods to real problems.

Table 1 Existing large-scale RIR datasets consisting of more than 2000 RIRs (A, anechoic; R, reflective; M, mechatronic positioning)

In this work, we introduce a large measured room impulse response (RIR) dataset, which we call “Microphone Array Impulse Response Dataset for Acoustical Learning” (MIRACLE). The dataset is tailored to the application field of sound source localization and characterization in acoustical engineering applications, and spatial interpolation of RIRs. The focus of the dataset lies on spatial rather than on environmental variability and accurate positional annotations.

Its key highlights are:

  1. 1.

    MIRACLE is the first RIR dataset with a dense sampling of the environment in both source and receiver space.

  2. 2.

    Particularly accurate spatial sampling is achieved by using a mechatronic positioning device to control the loudspeaker position. In contrast to most of the previously published datasets, the accuracy of the source positions (positional labels) is statistically validated. The assessment reveals outstandingly accurate and precise source locations with positional uncertainties in the order of a few millimetres.

  3. 3.

    With a total of 856,128 captured room impulse responses and dense spatial sampling of the observation area, the dataset can be well suited for machine learning and deep learning.

  4. 4.

    This paper presents two important applications of the dataset. As will be shown in Section 4.1, the dataset is well suited to construct acoustic source cases with possibly closely neighbouring sources by superimposing signals that have been convolved with the provided RIRs. In Section 4.2, reduced order state-space models are constructed from the MIRACLE dataset, demonstrating that it can serve as an excellent real-world benchmark problem for data-driven reduced order modelling methods.

2 Materials and methods

2.1 Experimental setup

The experimental setup is illustrated in Fig. 1. Details on the utilized hardware are given in Table 2.

Fig. 1
figure 1

Experimental setup for the main experiment (R2) with reflective ground plate

Table 2 Utilized hardware devices

2.1.1 Microphone array

The phased microphone setup features a planar microphone array comprising \(n_{\textrm{o}}=64\) channels mounted in a \(1.5\,\text {m} \times 1.5\,\text {m}\) aluminium plate. The microphone arrangement follows Vogel’s spiral [63]. The maximum pairwise distance between the array microphones is referred to as the aperture size \(d_{\textrm{a}} = 1.47\,\text {m}\). The microphone array data was acquired with a multichannel acquisition system (sampling rate = \(51.2\,\text {kHz}\)).

2.1.2 Sound source and excitation signal

A dynamic 2” cone loudspeaker in a cylindrical 3D-printed enclosure was employed as the sound source. An exponential sine sweep was used as the excitation signal because of its favourable properties with regard to crest factor and rejection of non-linearities [64]. It was designed according to [65,66,67] in the frequency range of the loudspeaker, namely from \(100\,\text {Hz}\) to \(16\,\text {kHz}\). Because the anechoic chamber is nearly free of reflections and has very low noise levels, it was possible to choose a relatively short sweep time of \(3\,\text {s}\) for the measurement. In order to ensure that the entire system response after excitation is captured, a safety window of \(250\,\text {ms}\) was added to the recording duration, resulting in \(n_s=166{,}400\) samples per measurement. The loudspeaker excitation signal was also fed back directly to the AD-converter and was synchronously recorded with the microphone signals as a reference signal for post-processing. It is referred to as the loopback excitation signal in the following.

2.1.3 Positioning

A high-precision motor-driven 2D positioning system was employed for loudspeaker positioning. The positioning system and the microphone array were manually aligned by using a laser distance meter and a cross-line laser, achieving only minor alignment errors of a few millimetres at worst. The loudspeaker dust cap at the membrane centre was used as reference in the manual alignment. During data post-processing, a spatial offset correction was applied based on a statistical evaluation given in Section 3.3. The corrected positions apply to the acoustical centre of the loudspeaker rather than the centre of the membrane.

2.1.4 Environment

All measurements were performed in the anechoic chamber of TU Berlin (room volume \(V=830\,\text {m}^3\), lower cut-off frequency \(f_{\textrm{c}}=63\,\text {Hz}\)). Neither heating nor air conditioning was active, and the temperature was monitored at the microphone array centre throughout the experiment. A ground plate was placed between the loudspeaker and the microphone array in one of the experimental scenarios to enable a reflective environment. The supporting grid platform and the positioning system were clad with absorptive foam to minimize reflections.

2.2 Experimental procedure

A customized and fully automated data acquisition procedure was implemented. Before each experiment, the loudspeaker was repeatedly excited with the excitation signal for a duration of 20 min (the duration was determined in a dedicated experiment). This warm-up phase accounts for the weakly non-stationary dynamics of the loudspeaker’s transfer function, e.g., changes of the properties of the loudspeaker magnet related to internal temperature fluctuations; see [68]. Subsequently, the actual measurement routine was started by positioning the loudspeaker at the desired source location and measuring the room temperature simultaneously. After positioning, two repetitions of background noise measurement (\(1\,\text {s}\) each) and loudspeaker excitation measurements (\(3\,\text {s}\) each) were performed using all \(n_{\textrm{o}}\) microphones at once. Subsequently, the cross-correlation between all \(n_{\textrm{i}}\) recorded channels was evaluated according to the rule of two [69]. Based on the measured sweep signals and the noise signal, the rule of two defines a cross-correlation threshold at which a pair of measured sweeps can be regarded free of corruption. In case of any violations, the measurement was repeated automatically.

Following the main measurement campaign, an additional measurement was conducted in the anechoic chamber to obtain the angle-dependent frequency response of the loudspeaker at discrete azimuth angles at a resolution of \(\Delta \theta = 2.5^\circ\). A microphone was placed at a distance of \(0.5\,\text {m}\) from the loudspeaker centre. The latter was mounted on a motor-driven dispersion measurement turntable. A photograph of the measurement setup can be found in Fig. 2. The same excitation signal and processing parameters as in the previous measurement campaign were used to determine the loudspeaker impulse response. Due to the cylindrical enclosure enclosing the loudspeaker, rotational symmetry around the z-axis can be assumed.

Fig. 2
figure 2

Experimental setup for the directivity measurement

2.3 Post-processing

Several post-processing steps were performed to obtain a good estimate of the system impulse response from the measurements. Firstly, the loopback excitation and microphone signals were averaged across the two measurement repetitions to obtain a single averaged excitation signal \(\tilde{u}_{i,j}\in \mathbb {R}^{n_s}\) and averaged microphone signal \(\tilde{y}_{i,j}\in \mathbb {R}^{n_s}\) at the i-th source to the j-th receiver location, respectively. According to that, all signals were resampled to a sampling rate of \(f_{\textrm{d}}=32\,\text {kHz}\) since the loudspeaker transmission capability and excitation sweep have an upper frequency limit of \(16\,\text {kHz}\). We applied the polyphase method for resampling (see [70] for details).

2.3.1 Deconvolution

In the following, let \(n_{\textrm{d}}=104{,}000\) denote the number of samples after resampling. An estimate of the frequency response was obtained by dividing the Discrete Fourier Transform (DFT) of the averaged and downsampled measurement signals \(Y_{i,j}=\textrm{DFT}(\tilde{y}_{i,j})\in \mathbb {C}^{n_{\textrm{d}}}\) by the corresponding DFT of the averaged and resampled loopback excitation signals \(U_{i,j}=\textrm{DFT}(\tilde{u}_{i,j})\in \mathbb {C}^{n_{\textrm{d}}}\), i.e.

$$\begin{aligned} H_{i,j}\left( e^{\imath \omega _k}\right) =Y_{i,j}\left( e^{\imath \omega _k}\right) U_{i,j}^{-1}\left( e^{\imath \omega _k}\right) \in \mathbb {C}, \end{aligned}$$
(1)

for the angular frequency \(\omega _k=2\pi k/n_{\textrm{d}}\) with \({k\in [-n_{\textrm{d}}/2,n_{\textrm{d}}/2]\subset \mathbb {Z}}\). The inverse spectra \(U_{i,j}^{-1}\in \mathbb {C}^{n_{\textrm{d}}}\) were obtained by regularized inversion [67, 71,72,73,74,75]

$$\begin{aligned} U_{i,j}^{-1}\left( e^{\imath \omega _k}\right) =\frac{U_{i,j}^{*}\left( e^{\imath \omega _k}\right) }{U_{i,j}^{*}\left( e^{\imath \omega _k}\right) U_{i,j}\left( e^{\imath \omega _k}\right) +M\lambda \left( e^{\imath \omega _k}\right) }, \end{aligned}$$
(2)

where \(M=\max _{k\in \{1,\,\dots ,\,n_{\textrm{d}}\}}\{|U_{i,j}\left( e^{\imath \omega _k}\right) |^2\}=1\). Regularization is necessary to avoid instabilities in the deconvolved frequency response that arise from persistently exciting only over a limited frequency range. Practical considerations for choosing the regularization term in acoustic applications can be found in [67]. The regularization term \(\lambda \in \mathbb {R}^{n_{\textrm{d}}}\) was chosen as

$$\begin{aligned} \lambda \left( e^{\imath \omega _k}\right) = \left\{ \begin{array}{ll} 1&{}\textrm{for}~~|\omega _k|\in \left[ \,0,\,\omega _{\text {fade}}\right] \\ \frac{1+\cos \left( {\frac{\omega _{\text {fade}}-|\omega _k|}{\omega _{\text {fade}}-\omega _{\text {cut}}}}\right) }{2}&{}\textrm{for}~~|\omega _k|\in \left[ \,\omega _{\text {fade}},\,\omega _{\text {cut}}\right] \\ 0&{}\textrm{for}~~|\omega _k|\in \left[ \,\omega _{\text {cut}},\,\pi \right] \end{array}\right. \end{aligned}$$
(3)

such that the regularization term \(\lambda \left( e^{\imath \omega _k}\right)\) is equal to 0 above the cutoff frequency

$$\begin{aligned} \omega _{\text {cut}}=2\pi \frac{100\,\text {Hz}}{f_{\textrm{d}}} \end{aligned}$$
(4)

which is chosen according to the lower limit of the loudspeaker’s frequency range of \(100\,\text {Hz}\) and equal to 1 below \(\omega _{\text {fade}}=\frac{\omega _{\text {cut}}}{\sqrt{2}}\). A cross-fade based on a Hann window (raised-cosine) is used to smoothly transition in between. The estimate of the frequency response \(H_{i,j}\) was then transformed back to the time domain to finally obtain the impulse response

$$\begin{aligned} h_{i,j}=\textrm{DFT}^{-1}\,(H_{i,j}). \end{aligned}$$
(5)

2.3.2 Truncation

The calculated impulse responses were subsequently truncated in order to contain the size of the final dataset. For user convenience, the impulse responses of all measurement scenarios were truncated identically. For this, the minimum cumulative energy \(e\in \mathbb {R}^{n_{\textrm{d}}}\) given by

$$\begin{aligned} e(t)=\min _{i\in n_{\textrm{i}}, j\in n_{\textrm{o}}}\sum _{\tau =1}^t|h_{i,j}(\tau )|^2,\quad t\in \{1,\,\dots ,\,n_{\textrm{d}}\}, \end{aligned}$$
(6)

was calculated for each scenario. The truncation index \(n_t\) was chosen to be the smallest power of two that is larger than the time index for which \(0.1\,\%\) of the energy is truncated at worst, namely

$$\begin{aligned} n_t=1{,}024\ge \tilde{t}=\underset{t\in \{1,\,\dots ,\,n_{\textrm{d}}\}}{\arg \,\max }\left\{ e(t)\le 0.999 \Vert {e}\Vert _{\infty }\right\} . \end{aligned}$$
(7)

3 Results and discussion

3.1 Impulse responses

A total of four different experimental scenarios were realized, which are summarized in Table 3. The acquisition time for each of the large-scale scenarios A1, A2, and R2 was about 20 h. The total number of single-channel impulse responses across all scenarios is 856,128. The scenarios differ regarding the environment as well as the spatial dimension (\(d_y = d_x\)), sampling resolution (\(\Delta d_y = \Delta d_x\)), and distance \(d_z\) of the source plane. The two large anechoic scenarios A1 and A2 each include 4096 measured source positions on an equidistantly spaced \(64\times 64\) grid at different source-plane distances \(d_z\). In addition, a densely sampled scenario D1 was acquired on a smaller \(33\times 33\) grid with a spacing of only \(5\,\text {mm}\). Scenario R2 is based on the same geometric setup as scenario A2, but an aluminium plate on the floor introduces a specular reflection.

Table 3 MIRACLE experimental scenarios

Figures 3 and 4 exemplarily show the measured impulse response and its magnitude spectrum for a single source-receiver combination for scenarios A1, A2, and R2, respectively. It can be readily verified that the doubling of the distance to the source is also reflected in a doubling of the delay shift and an attenuation of the magnitude spectrum by approximately \(-6\text {dB}\). Furthermore, the specular reflection for scenario R2 manifests in a prominent second peak in the impulse response and comb filtering in its magnitude spectrum. Additional reflections manifesting as spurious peaks in the impulse response are due to the structure of the positioning system and the supporting grid platform.

Fig. 3
figure 3

Measured impulse responses for the scenarios A1, A2, and R2 and the centremost locations in the source and receiver plane. The dash-dotted vertical lines indicate the scenarios’ index \(\tilde{t}\)

Fig. 4
figure 4

Magnitude of the frequency response of the measured transfer functions for the scenarios A1, A2, and R2 at the centremost locations in the source and receiver plane

The mean and standard deviation of temperature and the speed of sound for each of the scenarios are given in Table 4. The speed of sound has been calculated according to [76, 77]Footnote 1. It reveals that the temperature and the speed of sound are almost identical across all scenarios with an absolute difference of \(\Delta \mu <1\,^{\circ }\textrm{C}\) and \(\Delta \mu \le 0.6\,\frac{m}{s}\), respectively, which is expected due to the fairly constant environmental conditions inside the anechoic chamber.

Table 4 Mean \(\mu\) and standard deviation \(\sigma\) of the temperature and speed of sound for each experiment

3.2 Loudspeaker directivity

Figure 5 shows the directivity \(\textrm{D}\) and the directivity index \(\textrm{DI}\) of the loudspeaker measured with a dispersion measurement turntable in the azimuthal plane. In this work, the directivity is defined as the ratio between the measured squared sound pressure \(p_{\textrm{RMS}}(\theta , f)\) at an angle \(\theta\) and the maximum among all angles, i.e.

$$\begin{aligned} \textrm{D}(\theta , f) = 10\log _{10}\left( {\frac{p_{\textrm{RMS}}(\theta ,f)}{\max _{\phi \in [0,2\pi ]}p_{\textrm{RMS}}(\phi ,f)}}\right) , \end{aligned}$$
(8)
Fig. 5
figure 5

Directivity \(\textrm{D}\) and directivity index \(\textrm{DI}\) of the loudspeaker. The maximum opening angle across all experiments is denoted by \(\theta _{\textrm{max}}\)

The directivity index under the assumption of rotational symmetry is expressed as

$$\begin{aligned} \textrm{DI}(f)=10\log _{10}\left( {\frac{4 \pi p_{\textrm{RMS}}^2(0, f)}{2 \pi \int _0^\pi p_{\textrm{RMS}}^2(\phi , f) \sin (\phi ) \textrm{d}\phi }}\right) , \end{aligned}$$
(9)

where \(p_{\textrm{RMS}}^2(0, f)\) represents the squared sound pressure in front of the speaker.

It is seen that the loudspeaker exhibits a radiation pattern similar to a monopole until an upper frequency of \(2\,\text {kHz}\). Above this frequency, the directivity index increases. Still, the directivity observed by the microphone array is close to a monopole at relevant radiation angles, i.e. \(\theta \le \theta _{\textrm{max}}=67.3^{\circ }\), as indicated by the dashed line in Fig. 5.

3.3 Positional validation

Several uncertainty factors affected the spatial alignment precision regarding the microphone array centre and the centre of the observation area. These factors include measurement uncertainties with regard to the utilized cross-line laser and distance meter as well as mechanical backlash, which occurred primarily with horizontal changes of direction. Therefore, a systematic spatial offset within the range of a few millimetres can be assumed.

Due to the anechoic environment and the use of a large-scale microphone array enabling an excellent spatial resolution, Conventional Frequency Domain Beamforming [42] serves as an appropriate method to obtain an estimate of the actual source location. The large number of acoustic cases also permits a statistical approach to determine the spatial offset for a measurement scenario and to quantify the uncertainty regarding the source position information.

3.3.1 Beamforming

Let \(\omega _k=2\pi k/n_{\textrm{d}}\) with \({k\in [-n_{\textrm{d}}/2,n_{\textrm{d}}/2]\subset \mathbb {Z}}\) and let

$$\begin{aligned} H(e^{\imath \omega _k})= \left[ H_{i,1}(e^{\imath \omega _k})\quad \dots \quad H_{i,n_{\textrm{o}}}(e^{\imath \omega _k}) \right] \in \mathbb {C}^{n_{\textrm{o}}} \end{aligned}$$
(10)

denote the transfer function measurements from the i-th source at location \(x_{\textrm{s}}\) for \(i\in \{1,\,\dots ,\,n_{\textrm{i}}\}\) to each of the \(n_{\textrm{o}}\) microphones. The cross-spectral matrix induced by a sound source with unit strength is then given by

$$\begin{aligned} C(\omega _k)=H(e^{\imath \omega _k})H(e^{\imath \omega _k})^*\in \mathbb {C}^{n_{\textrm{o}} \times n_{\textrm{o}}}. \end{aligned}$$
(11)

The beamforming result for an assumed source location \(x_{\textrm{s}}\in \mathbb {R}^{3}\) is then given by the square of the C-weighted norm of the steering vector \(a(x_{\textrm{s}},\omega _k)\in \mathbb {C}^{n_{\textrm{o}}}\), i.e.

$$\begin{aligned} b(x_{\textrm{s}},\omega _k)= \Vert {a(x_{\textrm{s}},\omega _k)}\Vert _{C(\omega _k)}^2=a(x_{\textrm{s}},\omega _k)^*C(\omega _k)a(x_{\textrm{s}},\omega _k). \end{aligned}$$
(12)

Many formulations of the steering vector can be found in the literature. The formulations I and IV in [78] result in a coincidence of the beamformer’s steered response power maximum and the actual source location for a single monopole source radiating under free-field conditions. In this work, formulation IV was used, which defines the entries of a via

$$\begin{aligned} \left\{ a(x_{\textrm{s}},\omega )\right\} _j=\frac{e^{\imath \omega (r_j-r_0)/c}}{r_j\sqrt{n_{\textrm{o}}\sum _{k=1}^{n_{\textrm{o}}}r_k^{-2}}}, \end{aligned}$$
(13)

where \(r_j= \left\| x_{\textrm{s}}-x_j\right\| _2\) is the distance between the assumed source location \(x_{\textrm{s}}\) and the j-th microphone location \(x_j\), and \(r_0=\left\| x_{\textrm{s}}-x_0\right\| _2\) is the distance between \(x_{\textrm{s}}\) and the reference position, in this case the origin of the coordinate system.

Validation of each measured source position commenced with the spatial discretization of a neighbourhood around the assumed source position. A \(201\times 201\) equidistantly spaced focus-grid with a resolution of \(\Delta x=0.5\,\text {mm}\) was employed. The beamforming map was computed on the discretized region for every frequency in the range

$$\begin{aligned} \Omega =\left[ 2\pi \frac{f_{\textrm{l}}}{f_{\textrm{d}}},\,2\pi \frac{f_{\textrm{u}}}{f_{\textrm{d}}}\right] \end{aligned}$$
(14)

which was chosen such that the lower frequency limit \(f_{\textrm{l}}=2\,\text {kHz}\) enabled a sufficiently large spatial resolution in the resulting beamforming map, and the upper frequency limit \(f_{\textrm{u}}=4\,\text {kHz}\) ensures that the wavelength is larger than the loudspeaker diameter. The latter is important to ensure that the loudspeaker has a radiation pattern close to a monopole at relevant radiation angles in order to meet the monopole assumption needed for the steering vector formulation. As indicated by the dashed line in Fig. 5, the radiation angle from the loudspeaker to any microphone in the array is bounded by \(\theta _{\textrm{max}}=67.3^{\circ }\). The global spatial maximum is then determined by

$$\begin{aligned} \hat{x}_i=\underset{x_s}{\arg \,\max }\ \sum _{\omega \in \Omega } \hat{b}(x_s, \omega ), \end{aligned}$$
(15)

where \(\hat{b}(x_s, \omega )\) denotes the amplitude normalized beamforming result

$$\begin{aligned} \hat{b}(x_s, \omega ) = \frac{b(x_s, \omega )}{b(\hat{x}_s, \omega )}, \end{aligned}$$
(16)

with \(b(\hat{x}_s, \omega )\) being the beamformer’s maximum output among all source locations \(x_s\) at a given frequency \(\omega\). The evaluation was conducted for different distances within a range of up to \(\pm 12\,\text {mm}\) around the assumed source distance with a sampling interval of \(\Delta z = 1\,\text {mm}\) to account for a potential mismatch of the source plane distance. Finally, the positional offset between the beamformer’s prediction and the assumed source position is determined by \(\Delta x_{i} = \hat{x}_{i} - x_{i}\).

3.3.2 Statistical evaluation

The systematic positional offset between the centre of the observation area and the microphone array in the horizontal and vertical direction can be statistically determined by using the estimates \(\Delta x_{i} \in \mathbb {R}^2\) for each individual measured source position. Thereby, each estimated positional deviation \(\Delta x_{i}\) can be seen as a realization of the jointly distributed random variables \(R_x, R_y\) with the joint Probability Density Function (PDF) \(f_{R_x, R_y}(\Delta x_{i})\). It is assumed that the individual positional offset estimations \(\Delta x_i\) are symmetrically distributed around the true positional offset due to the approximate symmetry of the microphone array and observation plane around the origin. Then, the true positional offset corresponds to the deviation associated with the greatest probability. A simple method to determine the joint PDF of jointly distributed random variables based on a finite set of samples is the kernel density estimation [79], denoted by

$$\begin{aligned} \hat{f}_{R_x, R_y}(\Delta x_{i})=\frac{1}{N} \sum _{n=1}^N K_h (\Delta x_{i}-\Delta x_{i}^{(n)}), \end{aligned}$$
(17)

where N refers to the sample size and \(K_h\) is the so-called kernel. A bivariate Gaussian kernel with bandwidth h was used, where h was chosen according to the Silverman’s rule of thumb [80].

3.3.3 Offset correction

The correction procedure’s first step was determining the distance \(\Delta z\) between the loudspeaker and the microphone array plane for the experiments \(\{A1, D1\}\) and \(\{A2\}\). The joint PDF was estimated individually for each evaluated distance \(\Delta z\). Note that source cases from experiment R2 were excluded from the statistical evaluation since the ground plate reflections would introduce an additional disruptive factor in the positional estimation. It is assumed that the true distance minimizes the variance among any direction associated with \(\hat{f}_{R_x, R_y}(\Delta x_{i})\), i.e. the spectral norm of the covariance matrix \(\Sigma _{\Delta x_{i}}(\Delta z)\) is minimized, such that

$$\begin{aligned} \underset{\Delta z}{\arg \,\min }\ \Vert {\Sigma _{\Delta x_{i}}(\Delta z)}\Vert _2. \end{aligned}$$
(18)

Figure 6 shows the joint PDF with the smallest spectral norm for the experiments \(\{A1, D1\}\) and \(\{A2\}\). Based on the joint PDF corresponding to the optimal distance correction \(\Delta z\), the true positional offset in vertical and horizontal direction is determined from the maximum of the corresponding marginal distributions depicted in Fig. 7. Table 5 shows the positional offset correction values for each of the experiments.

Fig. 6
figure 6

Estimated joint PDF of the positional deviations between the beamforming results and the assumed source positions. The inner black circle corresponds to the outer rim of the loudspeaker and the outer black circle indicates the outer rim of the enclosure box (left: Experiments {A1, D1}, right: Experiment A2)

Fig. 7
figure 7

Marginal distribution functions characterizing the positional offset between the microphone array and the observation plane (left: Experiments {A1, D1}, right: Experiment A2). The dashed line indicates the positional offset corresponding to the maximum of the corresponding PDF. The dotted lines indicate the 2.5% and 97.5% percentiles

Table 5 Positional correction values for each experiment

With the correction offset applied, one can conclude that the positional uncertainties regarding the true source positions are in the order of a few millimetres. Given the 2.5 and 97.5 percentiles of the marginal distributions, the positional uncertainty is in the range of \([-3.6\,\text {mm}, 3.4\,\text {mm}]\) in x-direction and \([-2.1\,\text {mm}, 3.5\,\text {mm}]\) in y-direction for the experiments \(\{A1, D1\}\). Regarding the experiments \(\{A2,R2\}\), the positional uncertainty is in the range of \([-4.9\,\text {mm}, 1.4\,\text {mm}]\) in x-direction and \([-2.6\,\text {mm}, 3.7\,\text {mm}]\) in y-direction.

4 Applications

Experimental measurement of acoustical systems is rarely an end in itself. Usually, one utilizes the measurements to infer system properties or data-driven models of the system at hand. Given its substantial size and precision, the MIRACLE dataset can be applied to a variety of research applications. In the following, two timely and crucial acoustic research areas are addressed: firstly, source localization and characterization, and secondly, data-driven reduced order modelling.

4.1 Acoustic source localization and characterization

In recent years, there has been extensive investigation into data-driven methods, particularly deep learning models, to solve source localization and characterization problems [11]. Localization focuses on determining the positions of one or multiple sound sources, whereas source characterization entails quantifying additional characteristics, such as their individual strength at a specific spatial reference position. Due to the scarcity of suitable experimental data, it is common to leverage synthetically generated microphone array data, with only few studies considering real-world data already during training [25]. It remains an open research question how strongly the performance of data-driven methods is affected by this oversimplification and how this performance degradation could be mitigated.

4.1.1 Microphone array data generation

One way to overcome the lack of experimental data is to generate microphone array data via a convolution of experimental RIRs with arbitrary source signals in a Monte Carlo simulation. In order to facilitate reproducibility in machine learning, the training data has to be published alongside the source code. Providing the immutable raw microphone array data of the Monte Carlo simulation would lead to infeasibly large storage requirements. As a lightweight alternative, a code environment of the data simulation process can be published. This enables the generation of microphone array data on demand, which not only alleviates the problem of storage but can also greatly accelerate the training process. This approach is taken by the AcouPipe framework [81], a previous work of the authors.

AcouPipe is an open-source project written in Python programming language that incorporates a default dataset generation method that constructs source cases with an arbitrary number of sound sources. As of AcouPipe v24.04 [82], the AcouPipe framework includes an additional experimental dataset generation method that utilizes the MIRACLE dataset per default. Both the synthetic and the experimental data generation method rely on similar measurement setups. A Monte Carlo simulation is used to randomly generate multi-source scenarios by superposition. By default, the number of sources and their respective positions are drawn from a Poisson distribution \(I \sim \mathcal {P}(\lambda =3)\) and a bivariate normal distribution \((x_i, y_i) \sim \mathcal {N}(\mu =0, \sigma =0.1688\,d_{\textrm{a}})\) for each individual source case. Artificial white noise is used as the source signal with a signal length of five s. The signal amplitudes are chosen according to the desired squared RMS sound pressure at the reference position. The latter is drawn from a Rayleigh distribution, such that \(p_{\textrm{RMS},i}^2 \sim \mathcal {R}(\sigma _{R}=5)\). Finally, the sound pressure signals at the microphones are obtained by superposition and convolution of the respective RIRs with the source signals

$$\begin{aligned} p_j(t) = \sum _{i=1}^{I} h_{i,j}(t) * q_i(t) + n_j(t). \end{aligned}$$
(19)

Uncorrelated white noise \(n_j(t)\) is added to each microphone channel. A more detailed description of the underlying dataset generation procedure can be found in [81].

figure a

Listing 1 Python source code snippet demonstrating multi-source scenario generation based on the MIRACLE dataset

To demonstrate the ease of use, Listing 1 exhibits a Python source code example that generates multi-source scenarios based on the MIRACLE dataset. Here, a Dataset instance is constructed, utilizing the RIRs from scenario A1 for the generation of 1000 validation samples. With AcouPipe, it is not only possible to obtain the raw time data but also to extract several features. The features include a source mapping of the sources and their locations. By default, conventional beamforming is used to calculate the source mapping. Figure 8 shows the beamforming map related to the seventh sample of the validation dataset.

Fig. 8
figure 8

Source mapping at \(f=2.5\,\text {kHz}\) for the seventh sample of the experimental validation dataset. Grey circles indicate the microphone position and black crosses indicate the source positions

Theoretically, very large numbers of experimental scenarios can be created by superimposing the individual measurements of the MIRACLE dataset. For example, when considering unordered sampling without replacement

$$\begin{aligned} \left( {\begin{array}{c}n_i\\ I\end{array}}\right) = \frac{n_i!}{I!(n_i-I)!}, \end{aligned}$$
(20)

more than eight million different source cases can be constructed when considering scenario A1 and two sources. The number of possible combinations increases significantly when sources of different strengths or even more than two sources are combined.

We believe that both the synthetic and the experimental data provide an excellent basis to address and answer up-to-date research questions related to data-driven source localization and characterization tasks, such as: How strong are synthetically trained source characterization models influenced by the domain shift when evaluated with data acquired in a real environment? Which domain generalization and adaptation methods are suitable to overcome the performance degradation due to the domain shift? How do existing source localization and characterization methods generalize to real-world scenarios?

4.2 Reduced order modelling

System measurements are usually performed with a certain goal in mind, e.g. inferring system properties from the data, such as the reverberation time. In many cases, one would like to obtain a model for the dynamical system transmission, i.e. the mapping from an input signal to an output signal. A common and apparent approach to obtaining a model for the system dynamics from RIR measurements is to employ them as a convolution kernel \(h\in \ell _2^{n_{\textrm{o}}\times n_{\textrm{i}}}\) via the convolution sum

$$\begin{aligned} y(t)=\sum _{k\in \mathbb {N}}h(k-t)u(t), \end{aligned}$$
(21)

where \(u\in \ell _2^{n_{\textrm{i}}}\) and \(y\in \ell _2^{n_{\textrm{o}}}\) are the input and output signal, respectively, and \(\ell _2\) denotes the Hilbert space of square-summable sequences. This approach was also pursued in the previous section. However, it can be unfavourable with regard to computational effort and storage requirements, especially for larger systems with many inputs and/or outputs, such as densely sampled RIRs, because the computational effort scales with the product of the number of inputs and outputs [83]. Furthermore, systems whose dynamics are governed by a low number of weakly damped modes, e.g. small room acoustics, possess long impulse responses but are otherwise described by a simple model. In a convolution-based approach, there are no effective means to eliminate redundancies in the data, making the dynamical transmission of aforementioned systems unnecessarily expensive to compute.

As argued in [83, 84], state-space models can be computationally advantageous, especially in real-time scenarios, and provide access to powerful system-theoretic MOR methods [56]. MOR is a very active research field that aims to find small models that approximate the dynamics of a large and potentially infeasible so-called full order model as closely as possible. Instead of firstly identifying a large full order model from data and then applying MOR methods, a reduced order model can also be inferred directly from the data. This is oftentimes referred to as data-driven model order reduction or reduced order modelling.

In the following, discrete-time state-space models are briefly introduced, and the applicability of the Eigensystem Realization Algorithm  (ERA) [85,86,87] to the MIRACLE dataset is demonstrated.

4.2.1 State-space models

A discrete-time linear time-invariant dynamical system with \(m\in \mathbb {N}\) inputs, \(p\in \mathbb {N}\) outputs can be described by the matrix quadruple (ABCD) with \(A\in \mathbb {R}^{n \times n}\), \(B\in \mathbb {R}^{n \times m}\), \(C\in \mathbb {R}^{p \times m}\), and \(D\in \mathbb {R}^{p \times m}\), where \(n\in \mathbb {N}\) is the order of the system. Its dynamics are governed by the discrete-time state equations

$$\begin{aligned} x(t+1)&=Ax(t)+Bu(t)\nonumber \\ y(t)&=Cx(t)+Du(t), \end{aligned}$$
(22)

for input \(u\in \ell _{2}^{m}\), output \(y\in \ell _{2}^{p}\), and state \(x\in \ell _{2}^{n}\). In our concrete case, u denotes the \(n_{\textrm{i}}\)-dimensional input signal at the source locations, and y denotes the \(n_{\textrm{o}}\)-dimensional output signal at the array microphones. Being an LTI system, the input-output behaviour of the state-space system is fully described by the convolution sum (21), where the impulse response \(h\in \ell _{2}^{p\times m}\) is given by \(h(t)=CA^{t-1}B\) in discrete-time. This can be easily verified from (22) by applying an impulse input \(u(t)=\delta _{t0}\) with zero initial condition \(x(0)=0\). A frequency domain description can be obtained by applying the z-transform to (22). The z-transform of output \(Y(z)\in \mathbb {C}^{p}\) is given by a multiplication of the z-transform of the input \(U(z)\in \mathbb {C}^{m}\) with the transfer function

$$\begin{aligned} H(z)=C(zI-A)^{-1}B\in \mathbb {C}^{p \times m}. \end{aligned}$$
(23)

4.2.2 Eigensystem Realization Algorithm

The transfer function in (23) is a matrix-valued rational function that can be approximated by a Padé approximation [88]. The moments of the transfer function for expansion at infinity, i.e. \(h_i=\frac{\textrm{d}^i}{\textrm{d}z^i}H(z)\big |_{z=\infty }\), are referred to as the Markov parameters of the system. Intriguingly, the impulse response is equivalent to the sequence of Markov parameters in discrete time, i.e. \(h(t)=h_t=CA^{t-1}B\). This connection is leveraged by ERA in the following way: Given a finite sequence \(h\in \ell _{2}^{p\times m}\) of \(2s-1\), \(s\in \mathbb {N}\) Markov parameters, a matching state-space model can be identified via the Hankel matrix of Markov parameters

(24)

For discrete-time systems, the Markov parameters can be conveniently obtained by measurements of the system impulse response. The Hankel matrix can be factored into the observability and controllability matrix \(\mathcal {O}\in \mathbb {R}^{ps \times n}\) and \(\mathcal {C}\in \mathbb {R}^{n \times ms}\):

$$\begin{aligned} \mathcal {H} = \underbrace{\left[ \begin{array}{c} C\\ \vdots \\ CA^{s-1} \end{array}\right] }_{{=:} \mathcal {O}} \underbrace{\left[ B\quad \cdots \quad A^{s-1}B \right] }_{{=:} \mathcal {C}}. \end{aligned}$$
(25)

From this factorization, a realization can be constructed via

$$\begin{aligned} A=\mathcal {O}_{f}^{\dagger }\mathcal {O}_{l}, \qquad \qquad B=\mathcal {C} \left[ \begin{array}{c} I_{m}\\ 0 \end{array}\right] , \qquad \qquad C = [I_p\quad 0]\mathcal {O}, \end{aligned}$$
(26)

where \(\mathcal {O}_{f}\) and \(\mathcal {O}_{l}\) denote the first and last \(p(s-1)\) rows of the observability matrix \(\mathcal {O}\in \mathbb {R}^{ps \times n}\), respectively [86, 87]. Classically, the factorization is computed from a (truncated) singular value decomposition (SVD), i.e. for a truncation order \(r\le \min \{ms,\,ps\}\)

$$\begin{aligned} \mathcal {H} = [U_{r}\quad *] \left[ \begin{array}{cc} \Sigma _{r}&{}0\\ 0&{}*\end{array}\right] \left[ \begin{array}{c} V_{r}^{T}\\ *\end{array}\right] , \end{aligned}$$
(27)

where \(U_{r}\in \mathbb {R}^{ps \times r}\) comprises the first r left singular vectors as columns, \(\Sigma _{r}=\text {diag} \left( {\sigma _{1},\,\dots ,\,\sigma _{r}}\right)\) the first r singular values with \(\sigma _1\le \dots \le \sigma _r\) and \(V_{r}\in \mathbb {R}^{ms \times r}\) the first r right singular vectors as columns. By choosing \(\mathcal {O}=U_{r}\Sigma _{r}^{1/2}\quad \text {and}\quad \mathcal {C} = \Sigma _{r}^{1/2}V_{r}^{T}\), a reduced order realization \((A_r,B_r,C_r)\) can be constructed via (26). For a basic stability and error analysis, the reader is referred to the classical sources [85,86,87]. A recent and thorough error analysis can be found in [89].

Using above theory, ERA is applied to scenario D1 of the MIRACLE dataset with \(m=1089\) inputs, \(p=64\) outputs. In order to improve the model quality, as argued in [83], the common dead time in the measurements is removed conservatively by truncating the first 68 samples. Additionally, to avoid unnecessary computations, only the first \(20\,\text {ms}\) of the impulse response are considered. Thus, after truncation, a total of \(2s-2=572\) samples are taken as input data for ERA. It is immediately obvious from (24) that the resulting Hankel matrix of dimension \(18{,}240\times 310{,}365\) can lead to computational problems with the outlined algorithm. One would already need \(45.3\,\)GB to construct this matrix explicitly, and computing the SVD of such a large matrix is infeasible. There are several strategies [90,91,92] that can alleviate this computational burden in classical ERA by exploiting low-rank structures in the data and/or the Hankel matrix. A randomized SVD algorithm [93] was employed here as an approximate orthogonal decomposition instead of (27) by sampling the range of \(\mathcal {H}\) with 5000 normally distributed random vectors and a single subspace iteration. For details of this randomized ERA variant, see [83, 92].

It is important to emphasize that multiple-input-multiple-output reduced order models are identified for all input-output transmissions simultaneously. In order to facilitate a qualitative analysis of the model quality, Fig. 9 depicts the magnitude responses of the measurement data alongside the identified reduced order models for the single-channel transmission from the centremost source to the centremost microphone. It can be seen that the accuracy increases with increasing model order and decreases with frequency. Furthermore, we have noticed that the single-channel approximation accuracy tends to deteriorate for source or receiver locations with greater distances to the source and receiver planes, respectively. Therefore, an appropriate quantitative error criterion must encompass all transmission channels simultaneously. To this end, denote

$$\begin{aligned} h_{\textrm{avg}}(t)=\left( \frac{1}{n_{\textrm{i}}n_{\textrm{o}}}\sum _{i=1}^{n_{\textrm{i}}}\sum _{j=1}^{n_{\textrm{o}}}|h_{i,j}(t)|^2\right) ^{1/2}=\frac{\left\| h(t) \right\| _{F}}{\sqrt{n_{\textrm{i}}n_{\textrm{o}}}} \end{aligned}$$
(28)

and

$$\begin{aligned} \varepsilon _r(t)=\frac{\left\| h(t)-C_rA_r^{t-1}B_r\right\| _{F}}{\sqrt{n_{\textrm{i}}n_{\textrm{o}}}}, \end{aligned}$$
(29)

as the RMS-averaged system and error impulse response, respectively, where \(h(t)\in \ell _{2}^{64\times 1{,}089}\) denotes the measured multichannel impulse response of the D1 scenario and \(A_r,B_r,C_r\) are the reduced order system matrices obtained by ERA for model orders \(r=1000,\,2000,\,3000\). The averaged responses are shown in Fig. 10 and it can be observed that the averaged error \(\varepsilon _r\) is still substantial around \(3\,\text {ms}\) for \(r=1000\), which is due to the aforementioned deterioration of approximation quality towards the edges of the source and receiver planes. For the higher orders \(r=2000,\,3000\), this deviation is much less pronounced.

Fig. 9
figure 9

Frequency response of the measured and modelled transfer functions (solid lines) and the error \(\varepsilon _{r}\) (dash-dotted lines) of different model orders for scenario D1 at the centremost location in the source and receiver plane

Fig. 10
figure 10

RMS-averaged impulse response of the data and error systems for different model orders

The potential benefit for model compression and real-time application is showcased in Table 6 in which the required memory and computational cost in Mega Floating Point Operations (MFLOPs) is listed alongside the relative approximation error

$$\begin{aligned} \hat{\varepsilon }_r=10\log _{10}\left( {\frac{\sum _{t=1}^{n_{\textrm{d}}}\left\| h(t)-C_rA_r^{t-1}B_r\right\| _{F}^2}{\sum _{t=1}^{n_{\textrm{d}}}\left\| h(t)\right\| _{F}^{2}}}\right) . \end{aligned}$$
(30)

In order to compare the cost of solving (22) with the state-space approach over the convolution sum in (21), the memory demand and computational cost of the Uniformly Partitioned Overlap-save (UPOLS) method [94] are also listedFootnote 2. The UPOLS method and the variants described in [94] are well-established and solve (21) via spectral multiplication of overlapping blocks of signals. It can be seen that the reduced order models are more memory efficient than the UPOLS method by factors between \(5-30\). A computational speedup for real-time simulation over the UPOLS method by factors between roughly \(1.3-3.9\) can be achieved for all models if the models are transformed into so-called modal or (quasi-)diagonal form which can be achieved via an eigenvalue decomposition of \(A_r\). Details on the transformation can be found in [83].

Table 6 Memory demand in MB, computational cost in MFLOPs, and approximation error \(\hat{\varepsilon }_r\) according to (30) for the UPOLS method with block-size \(B=64\) and \(P=16\) filter partitions and the reduced order models in different state-space representations. \(\square\): standard form, \(\diagdown\): (quasi-) diagonal form

5 Conclusion

A publicly available large-scale RIR dataset of 856,128 single-channel impulse responses has been presented, which is particularly suited for applications in the field of microphone array signal processing and sound field reconstruction because of the dense spatial sampling of the receiver and source spaces. To the authors’ knowledge, the MIRACLE dataset is the largest openly available RIR dataset yet.

The measured data have been shown to be spatially accurate and therefore provide an excellent basis for the development and statistical evaluation of data-driven modelling methods in acoustical engineering. The relevance, versatility, and applicability of the MIRACLE dataset have been demonstrated by means of two timely and diverging application examples.

In contrast to existing datasets, the MIRACLE dataset exhibits a notable limitation in environmental diversity, primarily constrained to free-field sound propagation scenarios with and without specular reflection. While the characteristics of the source and receiver predominantly shape the impulse response representation, enhancing environmental variability may be feasible through a semi-synthetic approach. This involves synthesizing corresponding RIRs and integrating them with the measured RIRs. To enrich the dataset, it is envisaged that the dataset will be extended in the near future by similar measurements of reverberant rooms. A persistent challenge lies in calibrating the source positions with comparable precision, mainly due to the diminished reliability of beamforming in reverberant environments. Addressing this challenge and broadening the dataset will be the focus of a forthcoming publication.