A framework for generating large-scale microphone array data for machine learning

Kujawski, Adam; Pelling, Art J. R.; Jekosch, Simon; Sarradj, Ennes

doi:10.1007/s11042-023-16947-w

A framework for generating large-scale microphone array data for machine learning

Open access
Published: 25 September 2023

Volume 83, pages 31211–31231, (2024)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

A framework for generating large-scale microphone array data for machine learning

Download PDF

Adam Kujawski ORCID: orcid.org/0000-0003-4579-8813¹,
Art J. R. Pelling¹,
Simon Jekosch¹ &
…
Ennes Sarradj¹

1344 Accesses
Explore all metrics

Abstract

The use of machine learning for localization of sound sources from microphone array data has increased rapidly in recent years. Newly developed methods are of great value for hearing aids, speech technologies, smart home systems or engineering acoustics. The existence of openly available data is crucial for the comparability and development of new data-driven methods. However, the literature review reveals a lack of openly available datasets, especially for large microphone arrays. This contribution introduces a framework for generation of acoustic data for machine learning. It implements tools for the reproducible random sampling of virtual measurement scenarios. The framework allows computations on multiple machines, which significantly speeds up the process of data generation. Using the framework, an example of a development dataset for sound source characterization with a 64-channel array is given. A containerized environment running the simulation source code is openly available. The presented approach enables the user to calculate large datasets, to store only the features necessary for training, and to share the source code which is needed to reproduce datasets instead of sharing the data itself. This avoids the problem of distributing large datasets and enables reproducible research.

MYRiAD: a multi-array room acoustic database

Article Open access 26 April 2023

Analysis of Sound Localization Data Generated by the Extended Mainzer Kindertisch

Introduction

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Microphone arrays are networks of sensors for the measurement of sound pressure. Processing of the collected channel data enables the separation of individual source signals and the determination of source locations relative to the array [55]. The development of powerful and reliable methods is still an active field of research for a variety of applications including robotics [43], acoustic measurement technology for noise reduction [31], speech recognition [17], autonomous driving [50] and machine fault detection [5]. Microphone setup, environment, and underlying assumptions about the sound field significantly differ between applications. For example, a typical application setup in speech recognition involves small spherical, circular or binaural microphone arrays operating in echoic environments. The sound events of interest can be regarded as temporally non-stationary. In contrast, acoustic measurement technology for noise reduction usually aims to accurately determine the noise spectra of specific source mechanisms, requiring customized array apertures with many microphone channels and nearly anechoic conditions. Quasi-stationary flow and machinery noise are typical source signals that originate from an a-priori known area of interest.

Parallel to microphone array technology, deep learning has advanced rapidly, especially in the field of image recognition [25, 45] but also in acoustics [3]. It is not surprising that many recently developed microphone array methods replace physical model-based approaches with data-driven algorithms, often in form of deep neural network architectures (DNN) [14, 43].

The existence of sufficiently large datasets that are openly available is a prerequisite for the comparability of studies and development of DNN models. A recent survey of sound source localization methods with deep learning revealed that most data-driven models are optimized by supervised learning, requiring accurately labeled data [14]. Ideally, training data is used that stems from real recordings and reflects actual operating conditions to avoid performance degradation due to the domain shift often encountered when training with synthetic data. However, real data is scarce, time consuming to obtain, and often suffer from limited label accuracy or even the nonexistence of labels. Given these limitations, it becomes evident that the data requirements cannot be exclusively satisfied by real data. Moreover, in acoustic measurement technology, complicated scenarios exist where the labels cannot be known or determined with the required accuracy, e.g., when measuring an airfoil in the wind tunnel. For such cases, only synthetic training data is suitable.

Aggregation of large-scale microphone array data for training is cumbersome. Existing datasets vary in size, are immutable, and can often only be used for a specific intended application due heterogeneous microphone array setups. Software for unifying existing microphone array datasets was recently proposed [44] but can only be applied if the trained model relies on the ambisonics format. Thus, it is not suitable for applications where no standardized format exists. Besides the low applicability of available datasets, a publication of the raw channel data from microphone arrays is often not possible due to the demand on computer storage. Acoustic data is high-dimensional since it is often acquired at a high sampling rate to respect the Shannon sampling theorem. Rapid development of new measurement technology also enables the processing of a growing number of sensors [12]. As Section 2.3 will show, several studies created individual datasets that are not openly available which hampers comparability and reproducibility. Data availability problems become particularly apparent when temporally stationary sources shall be characterized and the number of microphone channels are large.

The objective of this contribution is the development of a framework for the reproducible generation of microphone array data. The simulation of virtual measurement scenarios ensure exact labels and allow to simulate an arbitrary number of dataset samples, which can be used for training and testing purposes. The simulation parameters should be adaptable to different measurement situations. A further goal is to improve the scientific discussion, comparability and reproducibility of machine learning methods by publishing the framework.

The paper is structured as follows. In Section 2, model-based and data-driven approaches to solve source localization and characterization problems are reviewed. Section 3 introduces the concept and the framework. Section 4 demonstrates an application of the framework to simulate a dataset for source localization and characterization. Section 5 concludes the paper.

2 Source localization and characterization

Different problems can be solved with microphone array methods, including

Acoustic Source Localization (ASL): localization (or tracking) of sound sources
Acoustic Source Characterization (ASC): determination of sound source characteristics, e.g. the strength and location
Acoustic Source Detection (ASD)^{Footnote 1}: activity detection and classification of sound sources, with or without determining their position

Common models assume uncorrelated signals of sources. The sound pressure $p_m(t)$ at the m-th sensor of an array with M microphones can be expressed as a linear superposition of J uncorrelated source signals $q_{j}\left( t-\Delta \tau _{m j}\right) $, such that

$$\begin{aligned} p_{m}(t)=\sum _{j=1}^{J} a_{mj} q_{j}\left( t-\Delta \tau _{m j}\right) . \end{aligned}$$

(1)

The traveling time of the sound from j-th source to the m-th receiver is expressed by $\Delta \tau _{m j}$. An attenuation of the source signals is denoted by $a_{mj}$. The choice of $\Delta \tau _{m j}$ and $a_{mj}$ depends on whether the microphone array is placed in the near- or far-field of the radiating sources. The categorization to one of the two cases depends on the aperture of the array and the distance between the source and receiver. As illustrated in Fig. 1, the wavefront originating from a source in the far-field can be described as a plane wave. The angle of incidence is identical for each microphone of the array if all sensors are aligned and in the far-field. With an identical angle of incidence, only the direction-of-arrival (DOA) but not the exact location of a source can be determined. In the near-field case, the wavefront originating from a point source arrives in curved shape at the receiving microphone array. The angle of incidence then varies for each sensor which allows to determine the exact location of a source.

2.1 Physical model-based source localization and characterization

ASL and ASC problems have been traditionally solved with model-based techniques. Beamforming is one of the simplest methods and can be performed both in the time and frequency domain [55].

2.1.1 Delay-and-sum beamforming

As shown in Fig. 1, beamforming solves the inverse formulation of (1) by delaying, weighting and summing the sensor signals according to the j-th source location $\textbf{x}_j$, with

$$\begin{aligned} b_j(t)=\sum _{m=1}^{M} h_{mj} p_{m}\left( t+\Delta t_{m j}\right) , \end{aligned}$$

(2)

where $h_{mj}$ is a steering coefficient that describes the amplitude relation between the focused source and the microphone position. The time delays from the j-th source to the m-th microphone are expressed by $\Delta t_{mj}=\frac{r_{mj}}{c}$, where $r_{mj}$ is the spatial distance and c the speed of sound. The resulting beamformer output $b_j(t)$ is a spatially filtered representation of the source signal. The squared and time averaged beamformer output $b_{j,\text {Sq}}$, also known as the steered response power, can be obtained with

$$\begin{aligned} b_{j,\text {Sq}} = \frac{1}{T} \int _{0}^{T} \Big ( \sum _{m=1}^{M} h_{mj} p_{m}\left( t+\Delta t_{m j}\right) \Big )^{2}\, dt, \end{aligned}$$

(3)

where T is the integration time. Equivalently, (3) can be formulated in the frequency domain. Hereby, the signals are represented by their Fourier transform. The Fourier coefficients of M sensors are organized in ${\textbf{p}}:=[p_1(\varvec{\omega }), p_2(\varvec{\omega }), \cdots p_M(\varvec{\omega })]^{\top } \in \mathbb {C}^{M}$. Then, an estimate ${\hat{\textbf{C}}} \in \mathbb {C}^{M \times M}$ of the true cross-spectral matrix $\textbf{C}$ can be calculated by multiplying the vector ${\textbf{p}}$ with its Hermitian transpose and by averaging across K different snapshots, such that

$$\begin{aligned} {\hat{\textbf{C}}} = \frac{1}{K} \sum _{k=1}^{K} {\textbf{p}}_k{\textbf{p}}_k^{\text {H}}. \end{aligned}$$

(4)

Phase and amplitude values are subsequently weighted by applying the complex-valued steering vector $\textbf{h}:= [h_{1j}(\omega ), h_{2j}(\omega ), \cdots h_{Mj}(\omega )]^{\top } \in \mathbb {C}^{M}$ . The squared beamformer output in the frequency domain becomes

$$\begin{aligned} b_{j,\text {Sq}}(\omega )=\textbf{h}^{\textrm{H}} {\hat{\textbf{C}}}\textbf{h}. \end{aligned}$$

(5)

Usually, a domain enclosing the sources is discretized by a search grid. The desired characteristics are then calculated with respect to each point of the grid. Maxima in the acoustic source mapping are associated with the source locations. Alternatively, adaptive methods can be applied which search for the maximum beamformer output without the need of a search grid.

2.2 Datasets for source localization and characterization

During the past years, multiple datasets have been published to support the development of microphone array methods, summarized in Table 1.

Table 1 Existing microphone array datasets suitable for machine learning tasks

Full size table

For speech applications, datasets exist that include a modest number of measurements [24, 28]. For example, the LOCATA corpus is a measured dataset that belongs to the IEEE AASP challenge on acoustic source LOCalization And TrAcking (LOCATA) [28]. It contains static and moving speech sources in form of radiating loudspeakers or human talkers, captured with four different microphone arrays. Other than that, datasets designed for data-driven speech localization can be found that include audio [51] and audio-visual recordings [9, 18, 41].

In contrast to purely speech-related datasets, datasets exist that are dedicated to the joint problem of sound event localization and detection (SELD). Of particular importance is the TAU-NIGENS spatial sound event dataset that belongs to task three of the annual Detection and Classification of Acoustic Scenes and Events (DCASE) challenge (2019-2021) [2, 37, 38]. The dataset size, the complexity of the scenes, and the included number of overlapping sources that can occur was extended over the years. The datasets provide scenes of one-minute length captured with two different microphone array configurations. The semi-synthetic array data were created by convolving anechoic sound events with measured spatial room impulse responses. The latest dataset of the DCASE challenge is the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset [39], which exclusively consists of real scenes.

Acoustic events in office rooms are also provided by the SECL-UMons dataset [4]. In contrast to the L3DAS21 and L3DAS22 datasets, real scenes were captured by a seven channel circular microphone array. Despite the enhanced degree of realism, the spatial variety related to the different events is small since only 30 positions per event class are included. The mentioned datasets mainly address applications with arrays capable of providing data in ambisonics format. In contrast to these applications, the Wearable SELD dataset [33] targets situations where the microphone array is a wearable device, such as earphones. The virtual scenes were created by convolution of 12 anechoic sound event classes with recorded impulse responses from 108 locations.

So far, the presented publications are explicitly dedicated to solve ASL or SELD tasks. A lack of accessible development data exists for ASC tasks, which requires additional labels for the strength of a source. Here, the authors are not aware of any available datasets. The TAU-NIGENS Spatial Sound Events dataset and the LOCATA corpus are designed for solving ASL problems. Besides the source locations, no information about the strength of a source is given. In [19], a synthetic dataset was designed to compare the performance of model-based microphone array methods. All parameters that are necessary to closely reproduce the array data were documented. The drawback is that substantial work is required to reproduce the data with no guarantee of arriving at identical data.

Although the openly available datasets provide fairly realistic scenes, their applicability is limited. Most importantly, the datasets do not reflect the variety of available array hardware. For example, most datasets rely on the FOA format, and all the publications considered microphone arrays with a small number of channels and a small extent compared to the source distance. This is in contrast to deep learning publications using non-standardized microphone arrangements with tens of microphones [7, 21, 27, 29, 36]. Moreover, the datasets only cover a small range of possible applications, including the localization of speech and real-life sound events. Therefore, all publications consider a small number of sources emitting non-stationary signals. None of the datasets provide information about the source strength, which do not permit their use for the development of source characterization methods. In addition, the variety of source positions is limited. For example, in the semi-synthetic datasets, the number of possible source locations is bounded by the number of measured room impulse responses, and the available amount of real data is small. Finally, the datasets are immutable, meaning that one cannot adjust underlying properties, such as the type of source signal.

2.3 Data-driven source localization and characterization

Table 2 provides a rough overview of some recently published contributions that solve the ASL or ASC problem by supervised learning of DNNs. A comprehensive review of deep learning models for SELD, ASL, and ASC revealed that most of the studied methods relied on supervised learning [14]. In supervised learning, a DNN model learns a functional mapping from a variety of sample pairs. A sample pair consists of input features and corresponding labels. Less often, semi-supervised and weakly supervised training strategies are used, which arise from the low availability of labeled data [14].

Table 2 Examples of contributions using DNNs for solving ASC or ASL problems

Full size table

In sound event localization and detection, the majority of deep learning models are based on the presented datasets from Section 2.2. A large number of methods were presented in the context of the DCASE challenges held in 2019, 2020, and 2021. These methods were exclusively trained, validated, and tested with the existing data from the corresponding TAU-NIGENS Spatial Sound Events dataset since external data was prohibited by the challenge guidelines [2, 37, 38]. Frequently, the researchers employed augmentation strategies to cope with data sparsity [30, 52, 61]. One popular approach is to create further source cases from existing data by the rotational transformation of the coordinate system spawned by the microphone array [30]. Further strategies concern the manipulation of the sensor signals, e.g., by masking or equalizing the spectro-temporal information [34, 54], and the addition of background noise [20]. In contrast to the previous DCASE challenges, the use of external data is allowed for the DCASE 2022 challenge due to the limited amount provided by the STARSS22 dataset [39]. The first-ranked contribution [57] generated additional semi-synthetic training data similar to [37, 38], whereby 184 hours were used for training and 300 hours for fine-tuning.

In ASL for speech applications, data usage is more diverse than in the SELD field. Although contributions can be found that rely on the existing datasets [42] for model training, the use of individualized development data is much more common. Individualized training data includes synthetic [53] and semi-synthetic datasets [11, 40, 56]. A common practice is to utilize widely adopted speech recordings, e.g. from the TIMIT speech database [13], and to convolve these signals with measured or simulated spatial room impulse responses. For experimental model evaluation, some contributions utilize the existing datasets, such as the LOCATA corpus [11, 53], the AV16.3 dataset [56], the SLR2019 dataset [8], or rely on specific recorded data [40].

So far, research on ASC with deep learning has been focused on temporally stationary noise sources emitting to a microphone geometry with a large number of channels. It is obvious that these conditions significantly increase the generation effort and hampers data publication, since multiple seconds of array data are associated with a single label instance. In fact, related contributions targeting ASC exclusively rely on unpublished datasets [6, 7, 26, 27, 29, 60]. Compact features representing time-averaged signals were used in the form of the CSM [6, 7, 29, 60] and the source map obtained via beamforming [21, 26, 27, 36]. To keep the computational effort and storage demands low, analytical calculation of the CSM was used in [6, 7, 29, 60]. However, a shortcoming of this approach is that uncertainties that arise from a limited number of averaged snapshots are not incorporated. A more realistic sampling approach was performed in [21, 26, 36], where the CSM and the source map were obtained on the basis of simulated array time data under free-field conditions. However, a considerable drawback of the approach is that the computational effort complicates the generation of large-scale datasets. For example, 5000 source cases were considered in [36] by using time data sampling, whereas analytic sampling enabled the use of one million source cases [7] or even a quasi-infinite dataset [60].

3 Framework concept and implementation

The presented findings from Section 2.2 and 2.3 reveal that the generation and usage of microphone array data remains to be a crucial topic. Studies that can rely on existing data usually have to employ augmentation strategies to achieve considerable performance. In contrast, publications using individual data are faced with the computational burden and non-comparability of their results. It is evident and often reported that the size of the dataset has a strong impact on the accuracy and generalization capabilities of machine learning models [9, 29, 36]. However, available frameworks for the simulation of acoustic data mainly focus on the generation of realistic scenes [46] or improved performance on side-tasks (e.g. room impulse response generation [10]). None of the these frameworks focus on the scalability of the whole data generation process. In this section we propose a framework which can help to resolve this issue.

3.1 Concept

In this section, a new approach to create and share large scale microphone array data for machine learning is presented. Instead of providing a physical file containing an immutable dataset, the source code that runs the data simulation process is shared inside a containerized environment. The source code then runs on the user’s system and generates a synthetic dataset by means of the present computational resources. Instead of saving the raw channel data, only the features, labels and meta-data that are needed for machine learning are written to file.

This approach offers several advantages over retrieving the raw channel data stored in physical files from a database. Firstly, full access to the code that creates the data provides a high level of transparency and introspect. Secondly, the dataset can be used more flexibly. Features that are not included in the dataset can be subsequently added by modification of the source code without changing the underlying statistical properties. Thirdly, very large datasets can be easily shared since the source code that defines the simulation process is very small in memory. Finally, it is possible to create datasets of quasi-infinite size by simply increasing the number of simulated cases. The concept is based on the following important points:

1.
Sophisticated simulation methods should be involved to create realistic acoustic data.
2.
Data generation for different acoustic scenarios should be possible (e.g. different measurement environments).
3.
The simulation process must be scalable to ensure that large datasets can be created in an acceptable amount of time.
4.
The simulation process must be fully reproducible to ensure that every user can create the exactly same data.
5.
The framework should be written in a widely used and easy-to-use programming language in order to be attractive to a wide range of users.

3.1.1 Acoustic data simulation

Various open source libraries for acoustic data processing exist. Of those libraries, the Acoular framework is especially suitable, since it is explicitly designed for microphone array applications [49]. Besides numerous state-of-the-art algorithms for acoustic source mapping, Acoular implements tools to simulate synthetic microphone array data.

Acoular is written in an object-oriented style. The library includes different classes to simulate stationary and non-stationary signals. A signal object can be used to feed any source type, including monopole, dipole or line sources. Sources with arbitrary radiation patterns can be generated by the use of spherical harmonics. A trajectory class allows to create dynamic source scenarios. Acoular provides further classes for block-wise processing of the channel data. These classes can be used to simulate environmental conditions, e.g. by convolution with an impulse response. Time data can be saved to common file formats, including the waveform audio file format (WAV) or the hierarchical data format version 5 (HDF5).

Many of Acoular’s software design characteristics are advantageous for the simulation of large datasets in the context of deep learning:

1.
Acoular is written in the widely used Python programming language which is compatible with the APIs of modern machine learning frameworks (e.g. Tensorflow or PyTorch [1, 35]).
2.
The use of Numba [23] speeds up costly algorithms. JIT-compiled low-level routines helps to achieve the speed of static-typed languages.
3.
Acoular inherits a pipeline-based processing concept. Computational pipelines are built from various abstract objects representing the processing blocks. A lazy evaluation paradigm ensures that calculations are only triggered when needed. Computations of intermediate results are avoided until the final result is retrieved.
4.
A caching mechanism allows to rerun computations with minimum effort since already calculated results are persistent between individual runs.

3.1.2 Distributed data simulation

The simulation of a virtual acoustic measurement situation can be considered as an independent task. Significant time savings are possible when multiple of such tasks are calculated in parallel. Multiprocessing is not supported naturally in Python due to the global interpreter lock (GIL). However, various free available third party modules exist that are capable of bypassing the GIL and which can be used to realize computations on multiple CPU threads. Only few of these implements methods for the distribution of computations to multiple machines without extensive source code modifications (e.g. for use with a high-performance cluster).

A suitable package that simplifies the handling for single-node or multi-node applications in Python is the Ray cluster-computing framework [32]. Ray is written in C++ language with an additional Python API. In multi-node applications, Ray builds up a client server architecture forming a Ray cluster. A Ray cluster consists of multiple compute nodes, where one node is assigned to be the Head Node that is executing the main program in a so-called Driver Process [32]. The head node also hosts the Global Control Store (GCS). The GCS keeps information about nodes of the cluster, remote tasks and objects. The remaining worker nodes host the independent worker processes. A web user interface allows the user to maintain an overview of the running processes.

3.2 AcouPipe Framework

A new Python framework has been designed that embeds methods from Acoular and Ray. The AcouPipe [22] module uses Acoular’s computational pipeline-based concept and provides additional modules, including:

1.
Sampler module containing classes for reproducible random sampling of virtual measurement scenarios
2.
Pipeline module that provides classes to calculate the features (and labels) for a corresponding source case that should be included in the dataset. The computations can be carried out in parallel with the aid of the Ray API
3.
Writer module with classes that allows to write datasets to common file formats, including the HDF5 and Tensorflow record format (TFRecord)
4.
Loader module for reading datasets

The modules can be used to build a script-like main program for the generation of a dataset.

Figure 2 illustrates a multi-node setup with AcouPipe using a Ray cluster.

The remote tasks are scheduled by AcouPipe’s pipeline object included in the main program. Each remote task calculates a sample of the dataset that has a unique sample ID. A sample is synonymous to a feature (and label) of a dataset. The tasks are serialized and sent to the worker nodes for execution. After a task has finished, the results are retrieved and de-serialized by the main program. Each remote task may require a different time to complete its computation. Hence, the initial scheduling order might not be maintained when fetching the task results. The sample ID enables the recovery of the initial sampling order. When using the hierarchical HDF5 format to store the data, the correct sampling order is always represented in the file^{Footnote 2}. A task is able to write intermediate results to a separate cache file when using Acoular objects with caching capabilities to calculate the features. The cache files can be used to read already calculated results. It is worth noting that the cache files can be moved to a different device for reuse (e.g. a single workstation with lower computational resources).

A more detailed description of the steps performed by the main program is shown in the flow chart in Fig. 3. The color of each processing step indicates which object of the Acoupipe framework is responsible for execution. Immediately after start-up of the main program, the sample ID is increased by an object of the pipeline module that supervises data generation and aggregation, indicating that a new virtual measurement is performed. The pipeline object holds connection with the sampler objects that are responsible for sampling of the underlying parameters characterizing the new virtual measurement, such as the number of sources or the source positions. To sample the parameters, a sampler objects draws values from a specified random distribution. These values are then assigned to the dependent objects involved in the virtual measurement. Thanks to the underlying lazy evaluation strategy, no calculations have been performed so far. Subsequently, a remote task is scheduled to execute the desired feature extraction functions remotely on one of the free worker nodes. To schedule the task, the pipeline object uses the tools provided by the Ray package [32]. The scheduling process requires serialization and deserialization of the feature function. On the worker node, execution of the feature function starts the virtual measurement, which is mainly performed by the Acoular objects. In the main program, the pipeline object receives a future [32] that points to the result of the task that is not yet available. In the meantime, the mentioned steps are repeated as long as free workers exist. If no free worker exists, the pipeline object waits until a remote task has finished. As soon as Ray reports that a feature calculation is done, the pipeline object fetches the result from the corresponding worker node by utilizing the future. The retrieved data is then passed to the writer object that writes the data to a database. New virtual measurements are performed until the latest ID matches the maximum ID. The main program is finished when all the virtual measurements have been completed and the data has been stored.

4 A synthetic dataset for source localization and characterization

A new openly available large-scale dataset for ASC and ASL is introduced in this section. The dataset comprises a training corpus with 500,000 and a validation corpus with 10,000 simulated source cases. Following the concept explained in Section 3.1, the source code necessary to produce the dataset is openly available and shared in a containerized environment [22].

4.1 Dataset characteristics

The statistical properties of the generated training and validation datasets are closely related to the work of Herold and Sarradj [19]. Figure 4 illustrates the virtual simulation setup, which is explained in the following. A microphone array with an aperture size of $d=1\,\text {m}$ consisting of 64 sensors is used. The geometry follows a Vogel’s spiral with the parameters $V=5.0$ and $H=0.5$ as described in [48]. The array is focusing a planar observation area with a horizontal and vertical length that equals the aperture size. Under real measurement conditions, the accuracy with which the transfer paths between source plane and microphones are determined is limited. Therefore, individually deviating sensor positions are used in the simulation. The deviation correspond to a bivariate normal distribution with a mean of $\mu =0$ and a standard deviation of $\sigma =1\,\text {mm}$. An anechoic environment with a resting homogeneous fluid is assumed. The speed of sound is set to $c=343\,\text {m}/\text {s}$. In the observation plane, a varying number of monopole sources emitting uncorrelated white noise with a signal length of five seconds. The total number of sources per case follows a Poisson distribution ($\lambda =3$), whereas the location of each individual source follows a bivariate normal distribution ($\mu =0$, $\sigma =0.1688\,\text {m}$). The maximum number of simultaneously occurring sources is limited to ten. The squared sound pressure in one meter distance ($\text {Pa}^2,d_{\text {ref}} = 1\,\text {m}$) of each source is drawn from a Rayleigh distribution ($\sigma _{\text {R}}=5$). The virtual signals are sampled with a sampling rate of $F_{\text {s}}=13720\,\text {Hz}$. By including the aperture size with $d=1\, \text {m}$ the frequency f can also be written as the non-dimensional Helmholtz number:

$$\begin{aligned} He = \frac{f\cdot d}{c}, \end{aligned}$$

(6)

with the sampling Helmholtz number $He=40$. The simulation parameters are summarized in Tab. 3.

Table 3 Environmental parameters used for the synthetic data generation according to [19]

Full size table

Figure 5 shows two histograms indicating the absolute number of simultaneously occurring sources in the training and validation dataset. The most probable constellation is the presence of four sources, which is already larger than the maximum number provided by most of the existing datasets. Figure 6 shows scaled histograms of the minimum and maximum relative source distance between any two sources in the validation dataset. The minimum distance $d_{\text {min}}$ is most likely between $5\,\text {cm}$ and $10\,\text {cm}$, whereas the highest probability for the maximum distance $d_{\text {max}}$ between two sources exists for $45\,\text {cm} \le d_{\text {max}} < 50\,\text {cm}$.

4.2 Features and labels

The current implementation allows to store the cross-spectral matrix or an acoustic source mapping as the input feature of the dataset. It is also possible to save the raw time data, but this requires a huge amount of disk space (several TB) and training with this data is only feasible if large scale computing resources are available. Further input features can easily be added to the generation process by extending the open source code.

The user can choose between the full CSM (e.g. used in [29]) and non-redundant CSM (e.g. used in [6]). The latter makes use of the Hermitian representation of the CSM by omitting the lower triangular part that includes the conjugate complex elements of the upper triangular matrix. The CSM calculation is carried out as stated by (4) of Section 2.1.1 and involves temporal windowing with a Hanning window and 50$\%$ overlap into blocks of 128 samples.

The acoustic source mapping is calculated according to (5) via conventional frequency domain beamforming on a $64 \times 64$ sized rectangular grid with a resolution of $\Delta x = \Delta y \approx 1.6\,\text {cm}$. According to [47], different beamforming steering vector formulations exist in the literature. Formulation III from [47] is used for the dataset. Table 4 summarizes the processing parameters of the features.

It is assumed that a limited number of frequency bins are necessary in the data for developing a machine learning model. Therefore, all features can be saved individually for a single frequency bin in order to reduce workload and memory demands. In case of the non-redundant CSM in single-precision (32-bit) floating point format, an uncompressed record of the training dataset requires only 8.3 GB disk space.

The dataset comprises additional labels needed for supervised learning of ASL or ASC models. These include the averaged squared sound pressure values $p_{\text {ref},j}^2$ at the reference sensor (see red dot in Fig. 4) for each of the J sources. Moreover, the sources locations are provided in Cartesian coordinates.

Figure 7 shows an example of the currently implemented input features and Table 5 lists the labels for a source case taken from the validation dataset. In Fig. 7, (a) shows a source map obtained with beamforming, whereas (b) and (c) show two different representations of the cross-spectral matrix. In (b), the CSM is a Hermitian complex-valued representation, which was used, for example, in [29, 60]. In (c), a compressed version is shown that uses only the real and imaginary part of the upper triangular matrix (see [7] for details).

Table 4 Feature processing parameters used for the synthetic data generation

Full size table

4.3 Simulation times

In order to analyze the benefits of the simulation with AcouPipe, the computation times have been measured for 1, 2, 4, 8 and 16 parallel tasks. All simulations were performed on a workstation PC with two Intel Xeon Gold 4214R CPU, each offering 12 cores and 24 threads. The full validation dataset including the CSM for 10000 source cases was simulated with and without the use of intermediate cached results.

Figure 8a indicates the absolute computation time needed to calculate the validation dataset depending on the number of parallel tasks. Note that the computation time is given on a logarithmic scale. Figure 8a shows that the total runtime required is almost halved when the number of parallel tasks is doubled. This behavior can be observed with and without the use of cached results. Further, it can be seen that the full validation dataset can be obtained in less than 20 minutes when at least 16 CPU threads are available. If intermediate results were already written to cache, the dataset can be recalculated in 11 minutes.

Table 5 Labels of each of the four sources from the validation data set used in Fig. 7 at $He=5.0$

Full size table

Figure 8b shows the throughput of acoustic source cases depending on the number of tasks. Almost 10 cases per second can be simulated when using 16 parallel threads, whereas only one source case per second can be simulated within a single-threaded application. With already cached CSMs, the throughput can be increased by about $30\%$. Consequently, the complete training dataset with 500000 cases can be computed in 15 hours without cached and in approximately 9 hours with cached data, if at least 16 threads are available.

5 Conclusion

This work presented a framework for creating and sharing large scale microphone array data for machine learning. An easy-to-use library named AcouPipe has been created that provides a flexible framework for the simulation of acoustic measurement situations. An advantage compared to other frameworks capable of generating microphone array data is that the presented framework directly addresses the scalability of the simulation process. It has been shown that large-scale datasets can be created in reasonable time with the aid of parallel computing. A further advantage is that the framework can only store the features and labels needed for model optimization in the dataset, which is memory efficient, especially when the model considers time-stationary sources and many microphone channels. Caching the calculated features allows re-running computation pipelines in a significantly shorter time. Finally, the framework facilitates data distribution since researchers can publish their simulation pipeline instead of the raw channel data. Seeding of the randomized processes ensures the reproducibility of scientific results. The dataset publication can be considerably simplified by sharing the simulation source code in a containerized environment.

To the best of the authors’ knowledge, this work introduces the first openly available dataset for ASC. The dataset allows researchers from the field of microphone array processing to develop or benchmark their models with fully reproducible results. Moreover, the public accessibility of the source code allows the dataset to be customized to a particular application, including the use of a specific microphone geometry or propagation environment. The object-oriented code structure also allows easy implementation of new features. We hope that the framework serves as a helpful tool for creating new datasets in different application areas in the future.

Notes

Studies solving the ASD problem are out of the scope of this contribution and are therefore not reviewed.
In HDF5 files, the data is stored in a directory-like format with the sample ID as the directory name.

References

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org (Last viewed September 5, 2022). https://www.tensorflow.org/
Adavanne S, Politis A, Virtanen T (2019) A multi-room reverberant dataset for sound event localization and detection. In: Proceedings of the detection and classification of acoustic scenes and events workshop (DCASE Workshop). New York, NY
Bianco MJ, Gerstoft P, Traer J, Ozanich E, Roch MA, Gannot S, Deledalle CA (2019) Machine learning in acoustics: Theory and applications. J. Acoust. Soc. Am. 146(5):3590–3628. https://doi.org/10.1121/1.5133944
Article ADS PubMed Google Scholar
Brousmiche M, Rouat J (2020) SECL-UMons database for sound event classification and localization. In: Proceedings of the ICASSP, pp 756–760. IEEE, May 4-8, Barcelona, Spain . https://doi.org/10.1109/ICASSP40776.2020.9053298
Cardenas Cabada E, Leclere Q, Antoni J, Hamzaoui N (2017) Fault detection in rotating machines with beamforming: Spatial visualization of diagnosis features. Mech. Syst. Signal Process. 97:33–43. https://doi.org/10.1016/j.ymssp.2017.04.018
Article ADS Google Scholar
Castellini P, Giulietti N, Falcionelli N, Dragoni AF, Chiariotti P (2020) A neural network based approach to gridless sound source identification. In: Proceedings on CD of the 8th Berlin Beamforming Conference, 2-3 March, Berlin Germany, D22
Castellini P, Giulietti N, Falcionelli N, Dragoni AF, Chiariotti P (2021) A neural network based microphone array approach to grid-less noise source localization. Appl. Acoust. 177:107947. https://doi.org/10.1016/j.apacoust.2021.107947
Article Google Scholar
Choi J, Chang Jh (2022) Supervised learning approach for explicit spatial filtering of speech. IEEE Signal Process. Lett. 29:1412–1416. https://doi.org/10.1109/LSP.2022.3181971
Article ADS Google Scholar
Deleforge A, Horaud R, Schechner YY, Girin L (2015) Co-localization of audio sources in images using binaural features and locally-linear regression. IEEE/ACM Trans. Audio. Speech. Lang. Process. 23(4):718–731. https://doi.org/10.1109/TASLP.2015.2405475
Article Google Scholar
Diaz-Guerra D, Miguel A, Beltran JR (2021) gpuRIR: A python library for room impulse response simulation with GPU acceleration. Multimedia Tools and Applications 80:5653–5671. https://doi.org/10.1007/s11042-020-09905-3
Article Google Scholar
Diaz-Guerra D, Miguel A, Beltran JR (2021) Robust sound source tracking using srp-phat and 3d convolutional neural networks. IEEE/ACM Trans. Audio. Speech. Lang. Process. 29:300–311. https://doi.org/10.1109/TASLP.2020.3040031
Article Google Scholar
Ernst D, Geisler R, Kleindienst T, Ahlefeldt T, Spehr C (2020) Portable 512 MEMS-microphone-array for 3d-intensity- and beamforming-measurements using a FPGA based data-acquisition-system. In: Proceedings on CD of the 8th Berlin Beamforming Conference, 2-3 March, Berlin, Germany, D27
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL, Zue V (1993) TIMIT acoustic-phonetic continuous speech corpus. https://doi.org/10.35111/17gk-bn40. https://catalog.ldc.upenn.edu/LDC93s1. (Last viewed September 5, 2022)
Grumiaux PA, Kitić S, Girin L, Guérin A (2022) A survey of sound source localization with deep learning methods. J. Acoust. Soc. Am. 152(1):107–151. https://doi.org/10.1121/10.0011809
Article ADS PubMed Google Scholar
Guizzo E, Gramaccioni RF, Jamili S, Marinoni C, Massaro E, Medaglia C, Nachira G, Nucciarelli L, Paglialunga L Pennese M, Pepe S, Rocchi E, Uncini A, Comminiello D (2021) L3DAS21 Challenge: Machine learning for 3d audio signal processing. In: 2021 IEEE 31st international workshop on machine learning for signal processing (MLSP). IEEE, October 25–28, Gold Coast, Australia. https://doi.org/10.1109/MLSP52302.2021.9596248
Guizzo E, Marinoni C, Pennese M, Ren X, Zheng X, Zhang C, Masiero B, Uncini A, Comminiello D (2022) L3DAS22 Challenge: Learning 3d audio sources in a real office environment. In: ICASSP 2022 - 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 9186–9190. IEEE, May 23–27, Singapore, Singapore. https://doi.org/10.1109/ICASSP43922.2022.9746872
Haeb-Umbach R, Heymann J, Drude L, Watanabe S, Delcroix M, Nakatani T (2021) Far-field automatic speech recognition. Proc. IEEE 109(2):124–148. https://doi.org/10.1109/JPROC.2020.3018668
Article Google Scholar
He W, Motlicek P, Odobez JM (2018) Deep neural networks for multiple speaker detection and localization. In: IEEE international conference on robotics and automation (ICRA), pp 74–79. IEEE, May 21–25, Brisbane QLD, Australia. https://doi.org/10.1109/ICRA.2018.8461267
Herold G, Sarradj E (2017) Performance analysis of microphone array methods. J Sound Vib 401:152–168. https://doi.org/10.1016/j.jsv.2017.04.030
Article ADS Google Scholar
Krause D, Politis A, Kowalczyk K (2021) Data diversity for improving dnn-based localization of concurrent sound events. In: 2021 29th european signal processing conference (EUSIPCO), pp 236–240. EURASIP, August 23-27, Dublin, Ireland (virtual conference). https://doi.org/10.23919/EUSIPCO54536.2021.9616284
Kujawski A, Herold G, Sarradj E (2019) A deep learning method for grid-free localization and quantification of sound sources. J. Acoust. Soc. Am. 146(3):EL225–EL231. https://doi.org/10.1121/1.5126020
Article ADS PubMed Google Scholar
Kujawski A, Jekosch S, Pelling A (2021) adku1173/acoupipe: v21.08. https://doi.org/10.5281/zenodo.5176234
Lam SK, Pitrou A, Seibert S (2015) Numba: A llvm-based python jit compiler. In: proceedings of the second workshop on the llvm compiler infrastructure in HPC, LLVM ’15. Association for computing machinery, New York, NY, USA. https://doi.org/10.1145/2833157.2833162
Lathoud G, Odobez JM, Gatica-Perez D (2004) AV16.3: An audio-visual corpus for speaker localization and tracking. In: Bengio S, Bourlard H (eds) Machine learning for multimodal interaction. MLMI 2004. lecture notes in computer science, vol 3361, pp 182–195. Springer, Berlin Heidelberg. https://doi.org/10.1007/978-3-540-30568-2_16
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539
Article ADS CAS PubMed Google Scholar
Lee SY, Chang J, Lee S (2020) Acoustic source localization for a single point source using convolutional neural network and weighted frequency loss. In: Proceedings of the Inter-Noise Conference. August 23–26, Seoul, Korea
Lee SY, Chang J, Lee S (2021) Deep learning-based method for multiple sound source localization with high resolution and accuracy. Mech. Syst. Signal Process. 161:107959. https://doi.org/10.1016/j.ymssp.2021.107959
Article Google Scholar
Löllmann HW, Evers C, Schmidt A, Mellmann H, Barfuss H, Naylor PA, Kellermann W (2018) The locata challenge data corpus for acoustic source localization and tracking. In: 2018 IEEE 10th sensor array and multichannel signal processing workshop (SAM), pp 410–414. July 8–11, Sheffield, UK. https://doi.org/10.1109/SAM.2018.8448644
Ma W, Liu X (2019) Phased microphone array for sound source localization with deep learning. Aerospace Systems 2:71–81. https://doi.org/10.1007/s42401-019-00026-w
Article ADS Google Scholar
Mazzon L, Koizumi Y, Yasuda M, Harada N (2019) First order Ambisonics domain spatial augmentation for DNN-based direction of arrival estimation. In: Proceedings of the detection and classification of acoustic scenes and events workshop (DCASE Workshop). October 25-26, New York, NY
Merino-Martínez R, Sijtsma P, Snellen M, Ahlefeldt T, Antoni J, Bahr CJ, Blacodon D, Ernst D, Finez A, Funke S, Geyer TF, Haxter S, Herold G, Huang X, Humphreys WM, Leclère Q, Malgoezar A, Michel U, Padois T, Pereira A, Picard C, Sarradj E, Hiller H, Simons DG, Spehr C (2019) A review of acoustic imaging methods using phased microphone arrays. CEAS Aeronautical Journal 10:197–230. https://doi.org/10.1007/s13272-019-00383-4
Article Google Scholar
Moritz P, Nishihara R, Wang S, Tumanov A, Liaw R, Liang E, Elibol M, Yang Z, Paul W, Jordan MI, Stoica I (2018) Ray: A distributed framework for emerging ai applications. In: Proceedings of the 13th USENIX conference on operating systems design and implementation, OSDI’18, p 561–577. USENIX Association, October 8-10, Carlsbad, CA, USA
Nagatomo K, Yasuda M, Yatabe K, Saito S, Oikawa Y (2022) Wearable seld dataset: dataset for sound event localization and detection using wearable devices around head. In: Proceedings of the ICASSP, pp 156–160. IEEE, May 23-27, Singapore, Singapore. https://doi.org/10.1109/ICASSP43922.2022.9746544
Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, Le QV (2019) SpecAugment : A simple data augmentation method for automatic speech recognition. In: Proceedings of the Interspeech, pp 2613–2617. Graz, Austria. https://doi.org/10.21437/Interspeech.2019-2680
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: An imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems 32. Curran Associates Inc, pp 8024–8035
Pinto WG, Bauerheim M, Parisot-Dupuis H (2021) Deconvoluting acoustic beamforming maps with a deep neural network. In: Proceedings of the Inter-Noise Conference, pp 5397–5408. Institute of noise control engineering, August 1-5, Washington, DC, USA. https://doi.org/10.3397/IN-2021-3084
Politis A, Adavanne S, Krause D, Deleforge A, Srivastava P, Virtanen T (2021) A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection. In: Proceedings of the detection and classification of acoustic scenes and events workshop (DCASE Workshop), pp. 125–129. November 15–19, Barcelona, Spain
Politis A, Adavanne S, Virtanen T (2020) A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. In: Proceedings of the detection and classification of acoustic scenes and events workshop (DCASE Workshop), pp 165–169. November 2–4, Tokyo, Japan
Politis A, Shimada K, Sudarsanam P, Adavanne S, Krause D, Koyama Y, Takahashi N, Takahashi S, Mitsufuji Y, Virtanen T (2022) Starss22: a dataset of spatial recordings of real scenes with patiotemporal annotations of sound events. ArXiv:2206.01948v1
Pujol H, Bavu É, Garcia A (2021) BeamLearning: An end-to-end deep learning approach for the angular localization of sound sources using raw multichannel acoustic pressure data. J. Acoust. Soc. Am. 149(6):4248–4263. https://doi.org/10.1121/10.0005046
Article ADS PubMed Google Scholar
Qian X, Brutti A, Lanz O, Omologo M, Cavallaro A (2019) Multi-speaker tracking from an audio-visual sensing device. IEEE Transactions on Multimedia 21(10):2576–2588. https://doi.org/10.1109/TMM.2019.2902489
Article Google Scholar
Qian X, Zhang Q, Guan G, Xue W (2022) Deep audio-visual beamforming for speaker localization. IEEE Signal Process. Lett. 29:1132–1136. https://doi.org/10.1109/LSP.2022.3165466
Article ADS Google Scholar
Rascon C, Meza I (2017) Localization of sound sources in robotics: A review. Robot. Auton. Syst. 96:184–210. https://doi.org/10.1016/j.robot.2017.07.011
Article Google Scholar
Roman IR, Bello JP (2021) Micarraylib : Software for reproducible aggregation, standardization , and signal processing of microphone array datasets. In: Proceedings of the detection and classification of acoustic scenes and events workshop (DCASE Workshop), pp 175–180. November 15–19, Barcelona, Spain
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Salamon J, MacConnell D, Cartwright M, Li P, Bello JP (2017) Scaper: A library for soundscape synthesis and augmentation. In: IEEE workshop on applications of signal processing to audio and acoustics, pp 344–348. October 15–18, New Paltz NY, USA. https://doi.org/10.1109/WASPAA.2017.8170052
Sarradj E (2012) Three-dimensional acoustic source mapping with different beamforming steering vector formulations. Advances in acoustics and vibration. https://doi.org/10.1155/2012/292695
Sarradj E (2016) A generic approach to synthesize optimal array microphone arrangements. In: Proceedings on CD of the 6th Berlin Beamforming Conference, February 29 - March 1, Berlin Germany, BeBeC-2016-S4
Sarradj E, Herold G (2017) A python framework for microphone array data processing. Appl. Acoust. 116:50–58. https://doi.org/10.1016/j.apacoust.2016.09.015
Article Google Scholar
Schulz Y, Mattar AK, Hehn TM, Kooij JFP (2021) Hearing what you cannot see: Acoustic vehicle detection around corners. IEEE Robot. Autom. Lett. 6(2):2587–2594. https://doi.org/10.1109/LRA.2021.3062254
Article Google Scholar
Sheelvant R, Sharma B, Madhavi M, Das RK, Prasanna SRM, Li H (2019) RSL2019 : A realistic speech localization corpus. In: Proceedings of the O-COCOSDA. October 25-27, Cebu, Philippines. https://doi.org/10.1109/O-COCOSDA46868.2019.9060842
Shimada K, Koyama Y, Takahashi S, Takahashi N, Tsunoo E, Mitsufuji Y (2022) Multi-ACCDOA: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training. In: ICASSP 2022 - 2022 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 316–320. May 23-27, Singapore, Singapore . https://doi.org/10.1109/ICASSP43922.2022.9746384
Songgong K, Wang W, Chen H (2022) Acoustic source localization in the circular harmonic domain using deep learning architecture. IEEE/ACM Trans. Audio. Speech. Lang. Process. 30:2475–2491. https://doi.org/10.1109/TASLP.2022.3190723
Article Google Scholar
Takahashi N, Gygli M, Van Gool L (2018) AENet : Learning deep audio features for video analysis. IEEE Trans Multimed 20(3):513–524. https://doi.org/10.1109/TMM.2017.2751969
Article Google Scholar
Van Veen B, Buckley K (1988) Beamforming: a versatile approach to spatial filtering. IEEE Signal Proc. Mag. 5(2):4–24
Google Scholar
Vera-Diaz JM, Pizarro D, Macias-Guarasa J (2018) Towards end-to-end acoustic localization using deep learning: From audio signals to source position coordinates. Sensors 18(10). https://doi.org/10.3390/s18103418
Wang Q, Chai L, Wu H, Nian Z, Niu S, Zheng S, Wang Y, Sun L, Fang Y, Pan J, Du J, Lee Ch (2022) The NERC-SLIP system for sound event localization and detection of DCASE2022 challenge. Tech rep, detection and classification of acoustic scenes and events 2022
Wang Q, Wu H, Jing Z, Ma F, Fang Y, Wang Y, Chen T, Pan JY, Du J, Lee, CH (2020) The USTC-IFLYTEK system for sound event localization and detection of DCASE2020 challange. Tech rep, detection and classification of acoustic scenes and events 2020
Xu P, Arcondoulis EJ, Liu Y (2020) Deep neural network models for acoustic source localization. In: Proceedings on CD of the 8th Berlin Beamforming Conference, 2-3 March, Berlin Germany, D21
Xu P, Arcondoulis EJ, Liu Y (2021) Acoustic source imaging using densely connected convolutional networks. Mech. Syst. Signal Process. 151:107370. https://doi.org/10.1016/j.ymssp.2020.107370
Article Google Scholar
Zhang J, Ding W, He L (2019) Data augmentation and priori knowledge-based regularization for sound event localization and detection. Tech rep, detection and classification of acoustic scenes and events 2019

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL. This work is funded by Deutsche Forschungsgemeinschaft through grant SA 1502/13-1.

Author information

Authors and Affiliations

Institute of Fluid Mechanics and Engineering Acoustics, Technische Universität Berlin, Einsteinufer 25, D-10587, Berlin, Germany
Adam Kujawski, Art J. R. Pelling, Simon Jekosch & Ennes Sarradj

Authors

Adam Kujawski
View author publications
You can also search for this author in PubMed Google Scholar
Art J. R. Pelling
View author publications
You can also search for this author in PubMed Google Scholar
Simon Jekosch
View author publications
You can also search for this author in PubMed Google Scholar
Ennes Sarradj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adam Kujawski.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kujawski, A., Pelling, A.J.R., Jekosch, S. et al. A framework for generating large-scale microphone array data for machine learning. Multimed Tools Appl 83, 31211–31231 (2024). https://doi.org/10.1007/s11042-023-16947-w

Download citation

Received: 17 August 2021
Revised: 15 September 2022
Accepted: 11 September 2023
Published: 25 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-16947-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A framework for generating large-scale microphone array data for machine learning

Abstract

Similar content being viewed by others

MYRiAD: a multi-array room acoustic database