1 Introduction

Microphone arrays are networks of sensors for the measurement of sound pressure. Processing of the collected channel data enables the separation of individual source signals and the determination of source locations relative to the array [55]. The development of powerful and reliable methods is still an active field of research for a variety of applications including robotics [43], acoustic measurement technology for noise reduction [31], speech recognition [17], autonomous driving [50] and machine fault detection [5]. Microphone setup, environment, and underlying assumptions about the sound field significantly differ between applications. For example, a typical application setup in speech recognition involves small spherical, circular or binaural microphone arrays operating in echoic environments. The sound events of interest can be regarded as temporally non-stationary. In contrast, acoustic measurement technology for noise reduction usually aims to accurately determine the noise spectra of specific source mechanisms, requiring customized array apertures with many microphone channels and nearly anechoic conditions. Quasi-stationary flow and machinery noise are typical source signals that originate from an a-priori known area of interest.

Parallel to microphone array technology, deep learning has advanced rapidly, especially in the field of image recognition [25, 45] but also in acoustics [3]. It is not surprising that many recently developed microphone array methods replace physical model-based approaches with data-driven algorithms, often in form of deep neural network architectures (DNN) [14, 43].

The existence of sufficiently large datasets that are openly available is a prerequisite for the comparability of studies and development of DNN models. A recent survey of sound source localization methods with deep learning revealed that most data-driven models are optimized by supervised learning, requiring accurately labeled data [14]. Ideally, training data is used that stems from real recordings and reflects actual operating conditions to avoid performance degradation due to the domain shift often encountered when training with synthetic data. However, real data is scarce, time consuming to obtain, and often suffer from limited label accuracy or even the nonexistence of labels. Given these limitations, it becomes evident that the data requirements cannot be exclusively satisfied by real data. Moreover, in acoustic measurement technology, complicated scenarios exist where the labels cannot be known or determined with the required accuracy, e.g., when measuring an airfoil in the wind tunnel. For such cases, only synthetic training data is suitable.

Aggregation of large-scale microphone array data for training is cumbersome. Existing datasets vary in size, are immutable, and can often only be used for a specific intended application due heterogeneous microphone array setups. Software for unifying existing microphone array datasets was recently proposed [44] but can only be applied if the trained model relies on the ambisonics format. Thus, it is not suitable for applications where no standardized format exists. Besides the low applicability of available datasets, a publication of the raw channel data from microphone arrays is often not possible due to the demand on computer storage. Acoustic data is high-dimensional since it is often acquired at a high sampling rate to respect the Shannon sampling theorem. Rapid development of new measurement technology also enables the processing of a growing number of sensors [12]. As Section 2.3 will show, several studies created individual datasets that are not openly available which hampers comparability and reproducibility. Data availability problems become particularly apparent when temporally stationary sources shall be characterized and the number of microphone channels are large.

The objective of this contribution is the development of a framework for the reproducible generation of microphone array data. The simulation of virtual measurement scenarios ensure exact labels and allow to simulate an arbitrary number of dataset samples, which can be used for training and testing purposes. The simulation parameters should be adaptable to different measurement situations. A further goal is to improve the scientific discussion, comparability and reproducibility of machine learning methods by publishing the framework.

The paper is structured as follows. In Section 2, model-based and data-driven approaches to solve source localization and characterization problems are reviewed. Section 3 introduces the concept and the framework. Section 4 demonstrates an application of the framework to simulate a dataset for source localization and characterization. Section 5 concludes the paper.

2 Source localization and characterization

Different problems can be solved with microphone array methods, including

  • Acoustic Source Localization (ASL): localization (or tracking) of sound sources

  • Acoustic Source Characterization (ASC): determination of sound source characteristics, e.g. the strength and location

  • Acoustic Source Detection (ASD)Footnote 1: activity detection and classification of sound sources, with or without determining their position

Common models assume uncorrelated signals of sources. The sound pressure \(p_m(t)\) at the m-th sensor of an array with M microphones can be expressed as a linear superposition of J uncorrelated source signals \(q_{j}\left( t-\Delta \tau _{m j}\right) \), such that

$$\begin{aligned} p_{m}(t)=\sum _{j=1}^{J} a_{mj} q_{j}\left( t-\Delta \tau _{m j}\right) . \end{aligned}$$
(1)

The traveling time of the sound from j-th source to the m-th receiver is expressed by \(\Delta \tau _{m j}\). An attenuation of the source signals is denoted by \(a_{mj}\). The choice of \(\Delta \tau _{m j}\) and \(a_{mj}\) depends on whether the microphone array is placed in the near- or far-field of the radiating sources. The categorization to one of the two cases depends on the aperture of the array and the distance between the source and receiver. As illustrated in Fig. 1, the wavefront originating from a source in the far-field can be described as a plane wave. The angle of incidence is identical for each microphone of the array if all sensors are aligned and in the far-field. With an identical angle of incidence, only the direction-of-arrival (DOA) but not the exact location of a source can be determined. In the near-field case, the wavefront originating from a point source arrives in curved shape at the receiving microphone array. The angle of incidence then varies for each sensor which allows to determine the exact location of a source.

Fig. 1
figure 1

Illustration of delay-and-sum beamforming in the near- and far-field. The beamformer output \(b_j\) for the j-th location at time t is obtained from a summation of the delayed sound pressure \(p_m\) that is weighted by the steering coefficient \(h_{mj}\). The time delay between the j-th location and the m-th sensor is given by \(\Delta t_{mj}\)

2.1 Physical model-based source localization and characterization

ASL and ASC problems have been traditionally solved with model-based techniques. Beamforming is one of the simplest methods and can be performed both in the time and frequency domain [55].

2.1.1 Delay-and-sum beamforming

As shown in Fig. 1, beamforming solves the inverse formulation of (1) by delaying, weighting and summing the sensor signals according to the j-th source location \(\textbf{x}_j\), with

$$\begin{aligned} b_j(t)=\sum _{m=1}^{M} h_{mj} p_{m}\left( t+\Delta t_{m j}\right) , \end{aligned}$$
(2)

where \(h_{mj}\) is a steering coefficient that describes the amplitude relation between the focused source and the microphone position. The time delays from the j-th source to the m-th microphone are expressed by \(\Delta t_{mj}=\frac{r_{mj}}{c}\), where \(r_{mj}\) is the spatial distance and c the speed of sound. The resulting beamformer output \(b_j(t)\) is a spatially filtered representation of the source signal. The squared and time averaged beamformer output \(b_{j,\text {Sq}}\), also known as the steered response power, can be obtained with

$$\begin{aligned} b_{j,\text {Sq}} = \frac{1}{T} \int _{0}^{T} \Big ( \sum _{m=1}^{M} h_{mj} p_{m}\left( t+\Delta t_{m j}\right) \Big )^{2}\, dt, \end{aligned}$$
(3)

where T is the integration time. Equivalently, (3) can be formulated in the frequency domain. Hereby, the signals are represented by their Fourier transform. The Fourier coefficients of M sensors are organized in \({\textbf{p}}:=[p_1(\varvec{\omega }), p_2(\varvec{\omega }), \cdots p_M(\varvec{\omega })]^{\top } \in \mathbb {C}^{M}\). Then, an estimate \({\hat{\textbf{C}}} \in \mathbb {C}^{M \times M}\) of the true cross-spectral matrix \(\textbf{C}\) can be calculated by multiplying the vector \({\textbf{p}}\) with its Hermitian transpose and by averaging across K different snapshots, such that

$$\begin{aligned} {\hat{\textbf{C}}} = \frac{1}{K} \sum _{k=1}^{K} {\textbf{p}}_k{\textbf{p}}_k^{\text {H}}. \end{aligned}$$
(4)

Phase and amplitude values are subsequently weighted by applying the complex-valued steering vector \(\textbf{h}:= [h_{1j}(\omega ), h_{2j}(\omega ), \cdots h_{Mj}(\omega )]^{\top } \in \mathbb {C}^{M}\) . The squared beamformer output in the frequency domain becomes

$$\begin{aligned} b_{j,\text {Sq}}(\omega )=\textbf{h}^{\textrm{H}} {\hat{\textbf{C}}}\textbf{h}. \end{aligned}$$
(5)

Usually, a domain enclosing the sources is discretized by a search grid. The desired characteristics are then calculated with respect to each point of the grid. Maxima in the acoustic source mapping are associated with the source locations. Alternatively, adaptive methods can be applied which search for the maximum beamformer output without the need of a search grid.

2.2 Datasets for source localization and characterization

During the past years, multiple datasets have been published to support the development of microphone array methods, summarized in Table 1.

Table 1 Existing microphone array datasets suitable for machine learning tasks

For speech applications, datasets exist that include a modest number of measurements [24, 28]. For example, the LOCATA corpus is a measured dataset that belongs to the IEEE AASP challenge on acoustic source LOCalization And TrAcking (LOCATA) [28]. It contains static and moving speech sources in form of radiating loudspeakers or human talkers, captured with four different microphone arrays. Other than that, datasets designed for data-driven speech localization can be found that include audio [51] and audio-visual recordings [9, 18, 41].

In contrast to purely speech-related datasets, datasets exist that are dedicated to the joint problem of sound event localization and detection (SELD). Of particular importance is the TAU-NIGENS spatial sound event dataset that belongs to task three of the annual Detection and Classification of Acoustic Scenes and Events (DCASE) challenge (2019-2021) [2, 37, 38]. The dataset size, the complexity of the scenes, and the included number of overlapping sources that can occur was extended over the years. The datasets provide scenes of one-minute length captured with two different microphone array configurations. The semi-synthetic array data were created by convolving anechoic sound events with measured spatial room impulse responses. The latest dataset of the DCASE challenge is the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset [39], which exclusively consists of real scenes.

Acoustic events in office rooms are also provided by the SECL-UMons dataset [4]. In contrast to the L3DAS21 and L3DAS22 datasets, real scenes were captured by a seven channel circular microphone array. Despite the enhanced degree of realism, the spatial variety related to the different events is small since only 30 positions per event class are included. The mentioned datasets mainly address applications with arrays capable of providing data in ambisonics format. In contrast to these applications, the Wearable SELD dataset [33] targets situations where the microphone array is a wearable device, such as earphones. The virtual scenes were created by convolution of 12 anechoic sound event classes with recorded impulse responses from 108 locations.

So far, the presented publications are explicitly dedicated to solve ASL or SELD tasks. A lack of accessible development data exists for ASC tasks, which requires additional labels for the strength of a source. Here, the authors are not aware of any available datasets. The TAU-NIGENS Spatial Sound Events dataset and the LOCATA corpus are designed for solving ASL problems. Besides the source locations, no information about the strength of a source is given. In [19], a synthetic dataset was designed to compare the performance of model-based microphone array methods. All parameters that are necessary to closely reproduce the array data were documented. The drawback is that substantial work is required to reproduce the data with no guarantee of arriving at identical data.

Although the openly available datasets provide fairly realistic scenes, their applicability is limited. Most importantly, the datasets do not reflect the variety of available array hardware. For example, most datasets rely on the FOA format, and all the publications considered microphone arrays with a small number of channels and a small extent compared to the source distance. This is in contrast to deep learning publications using non-standardized microphone arrangements with tens of microphones [7, 21, 27, 29, 36]. Moreover, the datasets only cover a small range of possible applications, including the localization of speech and real-life sound events. Therefore, all publications consider a small number of sources emitting non-stationary signals. None of the datasets provide information about the source strength, which do not permit their use for the development of source characterization methods. In addition, the variety of source positions is limited. For example, in the semi-synthetic datasets, the number of possible source locations is bounded by the number of measured room impulse responses, and the available amount of real data is small. Finally, the datasets are immutable, meaning that one cannot adjust underlying properties, such as the type of source signal.

2.3 Data-driven source localization and characterization

Table 2 provides a rough overview of some recently published contributions that solve the ASL or ASC problem by supervised learning of DNNs. A comprehensive review of deep learning models for SELD, ASL, and ASC revealed that most of the studied methods relied on supervised learning [14]. In supervised learning, a DNN model learns a functional mapping from a variety of sample pairs. A sample pair consists of input features and corresponding labels. Less often, semi-supervised and weakly supervised training strategies are used, which arise from the low availability of labeled data [14].

Table 2 Examples of contributions using DNNs for solving ASC or ASL problems

In sound event localization and detection, the majority of deep learning models are based on the presented datasets from Section 2.2. A large number of methods were presented in the context of the DCASE challenges held in 2019, 2020, and 2021. These methods were exclusively trained, validated, and tested with the existing data from the corresponding TAU-NIGENS Spatial Sound Events dataset since external data was prohibited by the challenge guidelines [2, 37, 38]. Frequently, the researchers employed augmentation strategies to cope with data sparsity [30, 52, 61]. One popular approach is to create further source cases from existing data by the rotational transformation of the coordinate system spawned by the microphone array [30]. Further strategies concern the manipulation of the sensor signals, e.g., by masking or equalizing the spectro-temporal information [34, 54], and the addition of background noise [20]. In contrast to the previous DCASE challenges, the use of external data is allowed for the DCASE 2022 challenge due to the limited amount provided by the STARSS22 dataset [39]. The first-ranked contribution [57] generated additional semi-synthetic training data similar to [37, 38], whereby 184 hours were used for training and 300 hours for fine-tuning.

In ASL for speech applications, data usage is more diverse than in the SELD field. Although contributions can be found that rely on the existing datasets [42] for model training, the use of individualized development data is much more common. Individualized training data includes synthetic [53] and semi-synthetic datasets [11, 40, 56]. A common practice is to utilize widely adopted speech recordings, e.g. from the TIMIT speech database [13], and to convolve these signals with measured or simulated spatial room impulse responses. For experimental model evaluation, some contributions utilize the existing datasets, such as the LOCATA corpus [11, 53], the AV16.3 dataset [56], the SLR2019 dataset [8], or rely on specific recorded data [40].

So far, research on ASC with deep learning has been focused on temporally stationary noise sources emitting to a microphone geometry with a large number of channels. It is obvious that these conditions significantly increase the generation effort and hampers data publication, since multiple seconds of array data are associated with a single label instance. In fact, related contributions targeting ASC exclusively rely on unpublished datasets [6, 7, 26, 27, 29, 60]. Compact features representing time-averaged signals were used in the form of the CSM [6, 7, 29, 60] and the source map obtained via beamforming [21, 26, 27, 36]. To keep the computational effort and storage demands low, analytical calculation of the CSM was used in [6, 7, 29, 60]. However, a shortcoming of this approach is that uncertainties that arise from a limited number of averaged snapshots are not incorporated. A more realistic sampling approach was performed in [21, 26, 36], where the CSM and the source map were obtained on the basis of simulated array time data under free-field conditions. However, a considerable drawback of the approach is that the computational effort complicates the generation of large-scale datasets. For example, 5000 source cases were considered in [36] by using time data sampling, whereas analytic sampling enabled the use of one million source cases [7] or even a quasi-infinite dataset [60].

3 Framework concept and implementation

The presented findings from Section 2.2 and 2.3 reveal that the generation and usage of microphone array data remains to be a crucial topic. Studies that can rely on existing data usually have to employ augmentation strategies to achieve considerable performance. In contrast, publications using individual data are faced with the computational burden and non-comparability of their results. It is evident and often reported that the size of the dataset has a strong impact on the accuracy and generalization capabilities of machine learning models [9, 29, 36]. However, available frameworks for the simulation of acoustic data mainly focus on the generation of realistic scenes [46] or improved performance on side-tasks (e.g. room impulse response generation [10]). None of the these frameworks focus on the scalability of the whole data generation process. In this section we propose a framework which can help to resolve this issue.

3.1 Concept

In this section, a new approach to create and share large scale microphone array data for machine learning is presented. Instead of providing a physical file containing an immutable dataset, the source code that runs the data simulation process is shared inside a containerized environment. The source code then runs on the user’s system and generates a synthetic dataset by means of the present computational resources. Instead of saving the raw channel data, only the features, labels and meta-data that are needed for machine learning are written to file.

This approach offers several advantages over retrieving the raw channel data stored in physical files from a database. Firstly, full access to the code that creates the data provides a high level of transparency and introspect. Secondly, the dataset can be used more flexibly. Features that are not included in the dataset can be subsequently added by modification of the source code without changing the underlying statistical properties. Thirdly, very large datasets can be easily shared since the source code that defines the simulation process is very small in memory. Finally, it is possible to create datasets of quasi-infinite size by simply increasing the number of simulated cases. The concept is based on the following important points:

  1. 1.

    Sophisticated simulation methods should be involved to create realistic acoustic data.

  2. 2.

    Data generation for different acoustic scenarios should be possible (e.g. different measurement environments).

  3. 3.

    The simulation process must be scalable to ensure that large datasets can be created in an acceptable amount of time.

  4. 4.

    The simulation process must be fully reproducible to ensure that every user can create the exactly same data.

  5. 5.

    The framework should be written in a widely used and easy-to-use programming language in order to be attractive to a wide range of users.

3.1.1 Acoustic data simulation

Various open source libraries for acoustic data processing exist. Of those libraries, the Acoular framework is especially suitable, since it is explicitly designed for microphone array applications [49]. Besides numerous state-of-the-art algorithms for acoustic source mapping, Acoular implements tools to simulate synthetic microphone array data.

Acoular is written in an object-oriented style. The library includes different classes to simulate stationary and non-stationary signals. A signal object can be used to feed any source type, including monopole, dipole or line sources. Sources with arbitrary radiation patterns can be generated by the use of spherical harmonics. A trajectory class allows to create dynamic source scenarios. Acoular provides further classes for block-wise processing of the channel data. These classes can be used to simulate environmental conditions, e.g. by convolution with an impulse response. Time data can be saved to common file formats, including the waveform audio file format (WAV) or the hierarchical data format version 5 (HDF5).

Many of Acoular’s software design characteristics are advantageous for the simulation of large datasets in the context of deep learning:

  1. 1.

    Acoular is written in the widely used Python programming language which is compatible with the APIs of modern machine learning frameworks (e.g. Tensorflow or PyTorch [1, 35]).

  2. 2.

    The use of Numba [23] speeds up costly algorithms. JIT-compiled low-level routines helps to achieve the speed of static-typed languages.

  3. 3.

    Acoular inherits a pipeline-based processing concept. Computational pipelines are built from various abstract objects representing the processing blocks. A lazy evaluation paradigm ensures that calculations are only triggered when needed. Computations of intermediate results are avoided until the final result is retrieved.

  4. 4.

    A caching mechanism allows to rerun computations with minimum effort since already calculated results are persistent between individual runs.

3.1.2 Distributed data simulation

The simulation of a virtual acoustic measurement situation can be considered as an independent task. Significant time savings are possible when multiple of such tasks are calculated in parallel. Multiprocessing is not supported naturally in Python due to the global interpreter lock (GIL). However, various free available third party modules exist that are capable of bypassing the GIL and which can be used to realize computations on multiple CPU threads. Only few of these implements methods for the distribution of computations to multiple machines without extensive source code modifications (e.g. for use with a high-performance cluster).

A suitable package that simplifies the handling for single-node or multi-node applications in Python is the Ray cluster-computing framework [32]. Ray is written in C++ language with an additional Python API. In multi-node applications, Ray builds up a client server architecture forming a Ray cluster. A Ray cluster consists of multiple compute nodes, where one node is assigned to be the Head Node that is executing the main program in a so-called Driver Process [32]. The head node also hosts the Global Control Store (GCS). The GCS keeps information about nodes of the cluster, remote tasks and objects. The remaining worker nodes host the independent worker processes. A web user interface allows the user to maintain an overview of the running processes.

3.2 AcouPipe Framework

A new Python framework has been designed that embeds methods from Acoular and Ray. The AcouPipe [22] module uses Acoular’s computational pipeline-based concept and provides additional modules, including:

  1. 1.

    Sampler module containing classes for reproducible random sampling of virtual measurement scenarios

  2. 2.

    Pipeline module that provides classes to calculate the features (and labels) for a corresponding source case that should be included in the dataset. The computations can be carried out in parallel with the aid of the Ray API

  3. 3.

    Writer module with classes that allows to write datasets to common file formats, including the HDF5 and Tensorflow record format (TFRecord)

  4. 4.

    Loader module for reading datasets

The modules can be used to build a script-like main program for the generation of a dataset.

Figure 2 illustrates a multi-node setup with AcouPipe using a Ray cluster.

Fig. 2
figure 2

Exemplary structure of a distributed dataset simulation process with AcouPipe. The main program schedules multiple remote tasks with a specific ID. Each remote task represents a virtual measurement from which the features of the dataset are obtained. The calculated data is aggregated and stored into a database by the main program. Calculations are performed asynchronously, meaning that the order of the IDs can vary

The remote tasks are scheduled by AcouPipe’s pipeline object included in the main program. Each remote task calculates a sample of the dataset that has a unique sample ID. A sample is synonymous to a feature (and label) of a dataset. The tasks are serialized and sent to the worker nodes for execution. After a task has finished, the results are retrieved and de-serialized by the main program. Each remote task may require a different time to complete its computation. Hence, the initial scheduling order might not be maintained when fetching the task results. The sample ID enables the recovery of the initial sampling order. When using the hierarchical HDF5 format to store the data, the correct sampling order is always represented in the fileFootnote 2. A task is able to write intermediate results to a separate cache file when using Acoular objects with caching capabilities to calculate the features. The cache files can be used to read already calculated results. It is worth noting that the cache files can be moved to a different device for reuse (e.g. a single workstation with lower computational resources).

A more detailed description of the steps performed by the main program is shown in the flow chart in Fig. 3. The color of each processing step indicates which object of the Acoupipe framework is responsible for execution. Immediately after start-up of the main program, the sample ID is increased by an object of the pipeline module that supervises data generation and aggregation, indicating that a new virtual measurement is performed. The pipeline object holds connection with the sampler objects that are responsible for sampling of the underlying parameters characterizing the new virtual measurement, such as the number of sources or the source positions. To sample the parameters, a sampler objects draws values from a specified random distribution. These values are then assigned to the dependent objects involved in the virtual measurement. Thanks to the underlying lazy evaluation strategy, no calculations have been performed so far. Subsequently, a remote task is scheduled to execute the desired feature extraction functions remotely on one of the free worker nodes. To schedule the task, the pipeline object uses the tools provided by the Ray package [32]. The scheduling process requires serialization and deserialization of the feature function. On the worker node, execution of the feature function starts the virtual measurement, which is mainly performed by the Acoular objects. In the main program, the pipeline object receives a future [32] that points to the result of the task that is not yet available. In the meantime, the mentioned steps are repeated as long as free workers exist. If no free worker exists, the pipeline object waits until a remote task has finished. As soon as Ray reports that a feature calculation is done, the pipeline object fetches the result from the corresponding worker node by utilizing the future. The retrieved data is then passed to the writer object that writes the data to a database. New virtual measurements are performed until the latest ID matches the maximum ID. The main program is finished when all the virtual measurements have been completed and the data has been stored.

Fig. 3
figure 3

Flow-chart illustrating the computational steps for a distributed simulation process. The steps are performed by objects from the AcouPipe modules (red: pipeline object, green: sampler object, yellow: writer object, light green: Acoular object)

4 A synthetic dataset for source localization and characterization

A new openly available large-scale dataset for ASC and ASL is introduced in this section. The dataset comprises a training corpus with 500,000 and a validation corpus with 10,000 simulated source cases. Following the concept explained in Section 3.1, the source code necessary to produce the dataset is openly available and shared in a containerized environment [22].

4.1 Dataset characteristics

The statistical properties of the generated training and validation datasets are closely related to the work of Herold and Sarradj [19]. Figure 4 illustrates the virtual simulation setup, which is explained in the following. A microphone array with an aperture size of \(d=1\,\text {m}\) consisting of 64 sensors is used. The geometry follows a Vogel’s spiral with the parameters \(V=5.0\) and \(H=0.5\) as described in [48]. The array is focusing a planar observation area with a horizontal and vertical length that equals the aperture size. Under real measurement conditions, the accuracy with which the transfer paths between source plane and microphones are determined is limited. Therefore, individually deviating sensor positions are used in the simulation. The deviation correspond to a bivariate normal distribution with a mean of \(\mu =0\) and a standard deviation of \(\sigma =1\,\text {mm}\). An anechoic environment with a resting homogeneous fluid is assumed. The speed of sound is set to \(c=343\,\text {m}/\text {s}\). In the observation plane, a varying number of monopole sources emitting uncorrelated white noise with a signal length of five seconds. The total number of sources per case follows a Poisson distribution (\(\lambda =3\)), whereas the location of each individual source follows a bivariate normal distribution (\(\mu =0\), \(\sigma =0.1688\,\text {m}\)). The maximum number of simultaneously occurring sources is limited to ten. The squared sound pressure in one meter distance (\(\text {Pa}^2,d_{\text {ref}} = 1\,\text {m}\)) of each source is drawn from a Rayleigh distribution (\(\sigma _{\text {R}}=5\)). The virtual signals are sampled with a sampling rate of \(F_{\text {s}}=13720\,\text {Hz}\). By including the aperture size with \(d=1\, \text {m}\) the frequency f can also be written as the non-dimensional Helmholtz number:

$$\begin{aligned} He = \frac{f\cdot d}{c}, \end{aligned}$$
(6)

with the sampling Helmholtz number \(He=40\). The simulation parameters are summarized in Tab. 3.

Fig. 4
figure 4

Virtual measurement setup including a 64-channel microphone array and a planar observation area. The red dot indicates the reference microphone position. The cross marks the origin of the coordinate system. The blue dots represent sound sources, which are randomly placed

Table 3 Environmental parameters used for the synthetic data generation according to [19]
Fig. 5
figure 5

Histogram of the number of cases depending on the number of simultaneously active sources in the (a) training and (b) validation dataset

Figure 5 shows two histograms indicating the absolute number of simultaneously occurring sources in the training and validation dataset. The most probable constellation is the presence of four sources, which is already larger than the maximum number provided by most of the existing datasets. Figure 6 shows scaled histograms of the minimum and maximum relative source distance between any two sources in the validation dataset. The minimum distance \(d_{\text {min}}\) is most likely between \(5\,\text {cm}\) and \(10\,\text {cm}\), whereas the highest probability for the maximum distance \(d_{\text {max}}\) between two sources exists for \(45\,\text {cm} \le d_{\text {max}} < 50\,\text {cm}\).

4.2 Features and labels

The current implementation allows to store the cross-spectral matrix or an acoustic source mapping as the input feature of the dataset. It is also possible to save the raw time data, but this requires a huge amount of disk space (several TB) and training with this data is only feasible if large scale computing resources are available. Further input features can easily be added to the generation process by extending the open source code.

The user can choose between the full CSM (e.g. used in [29]) and non-redundant CSM (e.g. used in [6]). The latter makes use of the Hermitian representation of the CSM by omitting the lower triangular part that includes the conjugate complex elements of the upper triangular matrix. The CSM calculation is carried out as stated by (4) of Section 2.1.1 and involves temporal windowing with a Hanning window and 50\(\%\) overlap into blocks of 128 samples.

The acoustic source mapping is calculated according to (5) via conventional frequency domain beamforming on a \(64 \times 64\) sized rectangular grid with a resolution of \(\Delta x = \Delta y \approx 1.6\,\text {cm}\). According to [47], different beamforming steering vector formulations exist in the literature. Formulation III from [47] is used for the dataset. Table 4 summarizes the processing parameters of the features.

It is assumed that a limited number of frequency bins are necessary in the data for developing a machine learning model. Therefore, all features can be saved individually for a single frequency bin in order to reduce workload and memory demands. In case of the non-redundant CSM in single-precision (32-bit) floating point format, an uncompressed record of the training dataset requires only 8.3 GB disk space.

The dataset comprises additional labels needed for supervised learning of ASL or ASC models. These include the averaged squared sound pressure values \(p_{\text {ref},j}^2\) at the reference sensor (see red dot in Fig. 4) for each of the J sources. Moreover, the sources locations are provided in Cartesian coordinates.

Fig. 6
figure 6

Scaled histogram of the minimum (a) and the maximum (b) spatial distance between any two sources of the validation dataset

Figure 7 shows an example of the currently implemented input features and Table 5 lists the labels for a source case taken from the validation dataset. In Fig. 7, (a) shows a source map obtained with beamforming, whereas (b) and (c) show two different representations of the cross-spectral matrix. In (b), the CSM is a Hermitian complex-valued representation, which was used, for example, in [29, 60]. In (c), a compressed version is shown that uses only the real and imaginary part of the upper triangular matrix (see [7] for details).

Table 4 Feature processing parameters used for the synthetic data generation

4.3 Simulation times

In order to analyze the benefits of the simulation with AcouPipe, the computation times have been measured for 1, 2, 4, 8 and 16 parallel tasks. All simulations were performed on a workstation PC with two Intel Xeon Gold 4214R CPU, each offering 12 cores and 24 threads. The full validation dataset including the CSM for 10000 source cases was simulated with and without the use of intermediate cached results.

Figure 8a indicates the absolute computation time needed to calculate the validation dataset depending on the number of parallel tasks. Note that the computation time is given on a logarithmic scale. Figure 8a shows that the total runtime required is almost halved when the number of parallel tasks is doubled. This behavior can be observed with and without the use of cached results. Further, it can be seen that the full validation dataset can be obtained in less than 20 minutes when at least 16 CPU threads are available. If intermediate results were already written to cache, the dataset can be recalculated in 11 minutes.

Fig. 7
figure 7

Example of different dataset features for \(He=5.0\) and \(J=4\). (a) shows a source map obtained with beamforming. (b) shows the Hermitian complex-valued CSM and (c) is a compressed CSM comprising the real and imaginary part of the upper triangular matrix according to [7]

Table 5 Labels of each of the four sources from the validation data set used in Fig. 7 at \(He=5.0\)

Figure 8b shows the throughput of acoustic source cases depending on the number of tasks. Almost 10 cases per second can be simulated when using 16 parallel threads, whereas only one source case per second can be simulated within a single-threaded application. With already cached CSMs, the throughput can be increased by about \(30\%\). Consequently, the complete training dataset with 500000 cases can be computed in 15 hours without cached and in approximately 9 hours with cached data, if at least 16 threads are available.

Fig. 8
figure 8

Computation time statistics of the validation dataset depending on the number of parallel tasks on a single compute node. Calculations were either performed with or without re-using cached intermediate results

5 Conclusion

This work presented a framework for creating and sharing large scale microphone array data for machine learning. An easy-to-use library named AcouPipe has been created that provides a flexible framework for the simulation of acoustic measurement situations. An advantage compared to other frameworks capable of generating microphone array data is that the presented framework directly addresses the scalability of the simulation process. It has been shown that large-scale datasets can be created in reasonable time with the aid of parallel computing. A further advantage is that the framework can only store the features and labels needed for model optimization in the dataset, which is memory efficient, especially when the model considers time-stationary sources and many microphone channels. Caching the calculated features allows re-running computation pipelines in a significantly shorter time. Finally, the framework facilitates data distribution since researchers can publish their simulation pipeline instead of the raw channel data. Seeding of the randomized processes ensures the reproducibility of scientific results. The dataset publication can be considerably simplified by sharing the simulation source code in a containerized environment.

To the best of the authors’ knowledge, this work introduces the first openly available dataset for ASC. The dataset allows researchers from the field of microphone array processing to develop or benchmark their models with fully reproducible results. Moreover, the public accessibility of the source code allows the dataset to be customized to a particular application, including the use of a specific microphone geometry or propagation environment. The object-oriented code structure also allows easy implementation of new features. We hope that the framework serves as a helpful tool for creating new datasets in different application areas in the future.