1 Introduction

In hands-free speech communication systems, the acoustic environment can crucially affect the quality and intelligibility of the speech signal acquired by the microphone(s). In fact, the speech signal propagates through the air and is reflected by the walls, the floor, the ceiling, and any object in the room before being picked up by the microphone(s). This propagation results in a signal attenuation and spectral distortion, called reverberation, that seriously degrades speech quality and intelligibility. Many dereverberation techniques have thus been proposed in the past (e.g., [15]). One direct way is to estimate an inverse filter of the room impulse response (RIR) [6] to deconvolve the reverberant signal. Wu and Wang [1], Mosayyebpour [2] designed an inverse filter of RIR by maximizing the kurtosis and skewness of the linear prediction (LP) residual, respectively, to reduce early reverberation. However, a minimum phase assumption is often needed, which is almost never satisfied in practice [6]. The RIR can also be varying in time and hard to estimate [7]. Kinoshita et al. [3] estimated the late reverberations using long-term multi-step linear prediction, and then reduced the late reverberation effect by employing spectral subtraction.

Recently, due to their strong regression capabilities, deep neural networks (DNNs) [8, 9] have also been utilized in speech dereverberation. In [10, 11], a DNN-based single-microphone dereverberation system was proposed by adopting a sigmoid activation function at the output layer and min-max normalization of target features. An improved DNN dereverberation system we proposed recently [12] adopted a linear output layer and globally normalized the target features into zero mean and unit variance, achieving the state-of-the-art performances.

Microphone array signal processing which utilizes spatial information, is another fundamentally important way for enhancement of speech acquisition in noisy environment [13, 14]. It has recently been shown that the use of the time-varying nature of speech signals could achieve high-quality speech dereverberation based on multi-channel linear prediction (MCLP) [1517]. Its efficient implementation method which performs in time-frequency-domain, is often referred to as the weighted prediction error (WPE) [15, 16, 18]. The work in [19] designed a feed-forward neural network for mapping microphone array’s spatial features into a T-F mask. And [20] utilized a deep neural network (DNN) based multichannel speech enhancement technique, where the speech and noise spectra are estimated using a DNN based regressor and the spatial parameters are derived in an expectation-maximization (EM) like fashion.

In this paper, we aim to provide a robust and efficient DNN-based multi-channel dereverberation framework that takes explicitly and fully advantage of the rich temporal and spatial information provided by the microphone array. A linear uniform array is considered. And possible extension to circular or even ad-hoc array arrangements can be made in the future. First, we propose a single DNN to simultaneously perform beamforming and dereverberation to overcome some of the limitations of the commonly used delay-and-sum beamformers. We refers to the proposed approach as DNNSpatial, because it selectively combines LPS input features of reverberant speech obtained from multiple microphones in an array and map them into the expected output LPS features of anechoic reference speech. The proposed new data utilization strategy based on multi-channel data, leverages upon the complementary information captured in microphone array speech to simultaneously perform beamforming and dereverberation. Its key goal is to discover rich and complex interactions in the signals without any ad-hoc pre-processing but with only data. Different from beamforming, where expert knowledge has to be involved in order to reach the desired result, the DNN will eventually boost the signal in the direction of the desired source, and/or possibly ignore/deemphasize some of the available channels using information available in the data. To better assess the strength of the proposed approach, we also build two standard DNN-based dereverberation systems, namely DSB-DNN and DNNs-DSB, which are direct extensions of the single-microphone case in [12] by combining a beamformer with single or multiple DNN dereverberation systems. The experimental evidence confirms that the proposed DNNSpatial approach outperforms the other two more common DNN-based solutions investigated in this work, in both 2-microphone and 6-microphone settings, according to all of the three objective measures tested. Moreover, the six-microphone array also outperforms the dual-microphone configurations at all reverberation times.

In a single-channel case [12], we found that RT60-dependent frame shift and acoustic context are two key environment-aware parameters in DNN training, which can boost the system’s environment robustness. While in multichannel case, we pay more attention to spatial information rather than frame shift, because spatial information captured by microphone array is fundamentally important to speech enhancement of speech acquisition in noisy environment. We next investigate the temporal auto-correlation function of reverberant signals in different reverberant conditions, illustrating a RT60-dependent feature selection is needed to achieve top performances in diverse reverberant cases. Experimental evidence confirms that in stronger reverberant situations, even at a cost of losing spatial contexts, larger temporal contexts with larger aperture size can achieve higher quality dereverberated speech, An environment-aware approach, namely RTA-DNNSpatial, is thus designed to improve the system performance and enhance system robustness, by adopting RT60-dependent temporal-spatial information. Experimental results demonstrate that RTA-DNNSpatial surpasses WPE and DNNSpatial at a wide range of RT60s. It also has good generalization capabilities to unseen room size, array geometry and loudspeaker position, and robustness to RT60 estimation error.

The rest of the paper is organized as follows. We first describe the proposed reverberation-time-aware DNN-based multi-channel dereverberation framework in Section 2. Experimental results are next provided and analyzed in Section 3. The generalization capabilities of the proposed environment-aware model are illustrated in Section 4. Finally we summarize our findings in Section 5.

2 Multi-microphone dereverberation

The well-known delay and sum beamformer (DSB), which is a fundamentally essential method to speech dereverberation [7], is utilized in this paper. It provides us with a foundation from which to explore alternative approaches and techniques. It is of interest also since it has been used as a benchmark for several newly developed dereverberation algorithms [2123]. Moreover, we adopt an improved DNN dereverberation system we proposed recently [12] by adopting a linear output layer and globally normalizing the target features into zero mean and unit variance. In addition, a linear uniform array is utilized.

2.1 DSB-DNN

Figure 1 shows a potential framework, namely DSB-DNN, consisting of a DSB followed by a single-channel dereverberation DNN model [19, 24]. Here the DNN system is adopted as the mapping function from DSB output to the reference channel’s anechoic speech features. This system has a low computational complexity and is easy to accomplish, therefore commonly used in dereverberation applications. The following experimental results show that it is not the optimal choice.

Fig. 1
figure 1

A block diagram of the DSB-DNN system

2.2 DNNs-DSB

Another possible solution is that the DNN models are performed first on each channel independently and the resulting outputs are fed to the DSB [20], as illustrated in Fig. 2. It has a very high computational cost, especially when the number of microphones M becomes large. r i (t) (i=1,...,M) is the i-ch reverberant signal, with \(Y_{r_{i}}\) and \(\varPhi _{r_{i}}\) representing its magnitude and phase spectrums, respectively. \(Y_{r_{i}}^{'}\) denotes the DNN-based dereverberated magnitude. The enhanced waveform s i (t) is then reconstructed from the estimated spectral magnitude \(Y_{r_{i}}^{'}\) and the phase of reverberant speech \(\varPhi _{r_{i}}\) with overlap-add method [12]. Finally, the system output s(t) is obtained by performing delay and sum beamforming. The DSB relies on the time delay between s i (t) and s j (t) (ij) [7]. However, on the reconstructed signals, neither the magnitude nor phase guarantees the delay assumptions across channels. For example, the phase of the reverberant speech \(\varPhi _{r_{i}}\) could be different from the phase of reconstructed signal \(\varPhi _{s_{i}}\), because for a spectrogram-like matrix in the time-frequency domain, it is not guaranteed there exists a time-domain signal whose STFT is equal to that matrix [25]. Thus the following DSB is not reliable, and the dereverberated signal might have relative worse performance on temporal-domain measures, like frequency-weighted segmental signal-to-noise ratio (fwSegSNR) [26].

Fig. 2
figure 2

A block diagram of the DNNs-DSB system

2.3 DNNSpatial

Both the DSB-DNN and DNNs-DSB rely on the DSB module, but actually it is possible to embed the beamformer in DNN, which encourages us to design a single-DNN configuration without DSB to accomplish simultaneous speech dereverberation and beamforming. We propose a speech dereverberation framework, namely DNNSpatial, by selectively combining input LPS features of reverberant speech from multiple microphones in an array and map them into the expected output LPS features of anechoic reference speech based on DNNs. The block diagram of DNNSpatial system is illustrated in Fig. 3, while the detailed training procedure is in the bottom panel.

Fig. 3
figure 3

A block diagram of the proposed DNNSpatial system

A clean signal x(t) is first passed through a M-microphone array. In our experiments, the received signal of each channel r i (t) (i=1,...,M) is then divided into 32 ms time frame with 16 ms frame shift. A 512-point DFT of each overlapping windowed frame is computed. Then 257-dimension LPS feature vectors [27] are used for DNN training. We use R i (k,c) to denote the log magnitude in the time-frequency (T-F) unit for time frame k and frequency channel c in the ith channel. Therefore, in the LPS domain, each frame can be represented as a vector r i (k):

$$ \boldsymbol{r}_{i}(k)= [R_{i}(k,1), R_{i}(k,2),..., R_{i}(k,257)]^{T} $$
(1)

In order to incorporate temporal dynamics, we include the spectral features of neighboring frames into a feature vector. Therefore, r i (k) is extended to

$$ \boldsymbol{\widetilde{r}}_{i}(k)= [\boldsymbol{r}_{i}(k-d_{i});...; \boldsymbol{r}_{i}(k);...; \boldsymbol{r}_{i}(k+d_{i})] $$
(2)

where d i denotes the number of neighboring frames in each side in the ith channel. Note that it has been shown in [11] that using the frames in both sides is an optimal feature extension strategy. The input vector is a concatenation of selective speech frames spatially from different microphones and temporally from various acoustic contexts in different microphones. Therefore, the concatenated vector for the DNN feature mapping is:

$$ \boldsymbol{\hat{r}}(k)= [\boldsymbol{\widetilde{r}}_{1}(k);...; \boldsymbol{\widetilde{r}}_{i}(k);...; \boldsymbol{\widetilde{r}}_{M}(k)] $$
(3)

The desired output of the neural network is the spectrogram of anechoic reference speech y(t)=x(tt ) (t is the time delay between loudspeaker and reference microphone) in the current frame k, denoted by a 257-dimensional feature vector, whose elements correspond to the log magnitude in each frequency bin at the kth frame. The channel between the loudspeaker and the first microphone is considered as the reference channel; therefore the target feature is denoted as y 1(k).

A fixed-length context window is used in order to ensure a fair comparison and a controlled environment (i.e., \(\sum _{i=1}^{M}(2d_{i}+1)\) is a fixed value). It also avoids a brute-force approach that always selects the same number of frames for each microphone, which may eventually leads to a too big DNN input size as the number of microphone increases and thus causes dramatic performance deterioration [28]. The proposed approach has a low computational complexity, since only a single DNN is used. It is also independent on the delay assumptions because no beamformer is utilized.

2.4 Reverberation-time-aware DNNSpatial

In DNNSpatial, the input vector is a concatenation of speech frames selected temporally (d i ) and spatially (i) from various acoustic contexts in different microphones, and thus there are different possible combinations of temporal and spatial features. Therefore, the effects of temporal-spatial feature combinations on DNNSpatial-based dereverberation performance should be investigated.

2.4.1 Temporal auto-correlation function at different RT60s

In reverberant environments, the source speech, x(t), and the corresponding received signal at the ith microphone, r i (t), can be related by

$$ r_{i}(t)= x(t)*h_{i}(t), $$
(4)

where h i (t) is the RIR of the ith channel, which is assumed to be a time-invariant system, and ∗ denotes convolution. Then the room transfer function H i (e j2πf) is found by taking the Fourier transform of h i (t), which can be expressed in terms of a direct and a reverberant component, i.e.,

$$ H_{i}\left(e^{j2\pi f}\right) = H_{d,i}\left(e^{j2\pi f}\right)+H_{r,i}\left(e^{j2\pi f}\right). $$
(5)

where f denotes the signal frequency.

It is further assumed that the direct and reverberant components are statistically uncorrelated, and thus, the energy spectral density can be written as

$$ {}\begin{aligned} E\left\{\left|H_{i}\left(e^{j2\pi f}\right)\right|^{2}\right\} = \left|H_{d,i}\left(e^{j2\pi f}\right)\right|^{2} + E\left\{\left|H_{r,i}\left(e^{j2\pi f}\right)\right|^{2}\right\}. \end{aligned} $$
(6)

The direct component is given by [29]

$$ H_{d,i}\left(e^{j2\pi f}\right)=\frac{e^{{jQD}_{i}}}{4\pi D_{i}}, $$
(7)

where D i is the distance between the source and the ith microphone. The wave number Q=2f/c with c being the sound velocity in air.

From the statistical room response model, the reverberant component, E{|H r,i (e j2πf)|2}, can be written as [29]

$$ E\left\{\left|H_{r,i}\left(e^{j2\pi f}\right)\right|^{2}\right\}=\frac{1-\alpha}{\pi A\alpha}, $$
(8)

where α and A are the average wall absorption coefficient and total wall surface area, respectively.

Using Eqs. (7) and (8), we obtain the energy spectral density

$$ E\{|H_{i}\left(e^{j2\pi f}\right)|^{2}\} = \frac{1}{16\pi^{2}D_{i}^{2}} + \frac{1-\alpha}{\pi A\alpha}. $$
(9)

The temporal auto-correlation function of r i (t) is given as follows [30],

$$ \phi_{r_{i}r_{i}}(\tau)= \phi_{xx}(\tau)*R_{h_{i}}(\tau), $$
(10)

where \(\phi _{r_{i}r_{i}}(\tau)\) and ϕ xx (τ) represent the auto-correlation function of r i (t) and x(t), respectively. And, \(R_{h_{i}}(\tau)\triangleq h_{i}(\tau)*h_{i}(-\tau)\) is called the deterministic auto-correlation function of h i (t) [31]. The Fourier transform can be written as the following equation:

$$ \Phi_{r_{i}r_{i}}\left(e^{j2\pi f}\right)=\Phi_{xx}\left(e^{j2\pi f}\right)\times\Phi_{h_{i}h_{i}}\left(e^{j2\pi f}\right), $$
(11)

where \(\Phi _{r_{i}r_{i}}\left (e^{j2\pi f}\right)\), Φ xx (e j2πf) and \(\Phi _{h_{i}h_{i}}\left (e^{j2\pi f}\right)\) denote the Fourier transforms of \(\phi _{r_{i}r_{i}}(\tau)\), ϕ xx (τ), and \(R_{h_{i}}(\tau)\), respectively. Because the autocorrelation function and energy spectral density are a pair of Fourier transform for a random energy signal [32],

$$ \Phi_{h_{i}h_{i}}\left(e^{j2\pi f}\right)=E\left\{\left|H_{i}\left(e^{j2\pi f}\right)\right|^{2}\right\}. $$
(12)

Moreover, the energy of \(\phi _{r_{i}r_{i}}(\tau)\) can be calculated by using Parseval’s theorem [33]

$$ \begin{aligned} E_{\phi_{r_{i}r_{i}}}&=\int_{-\infty}^{\infty} |\phi_{r_{i}r_{i}}(\tau)|^{2}d\tau \\ &=\int_{-\infty}^{\infty} |\Phi_{r_{i}r_{i}}\left(e^{j2\pi f}\right)|^{2}df \\ &=\int_{-\infty}^{\infty} \left|\Phi_{xx}\left(e^{j2\pi f}\right)\right|^{2}\left|\Phi_{h_{i}h_{i}}\left(e^{j2\pi f}\right)\right|^{2}df. \\ \end{aligned} $$
(13)

Substituting Eqs. (9) and (12) into Eq. (13), then we can obtain

$$ E_{\phi_{r_{i}r_{i}}}=\int_{-\infty}^{\infty} \left|\Phi_{xx}\left(e^{j2\pi f}\right)\right|^{2}\left(\frac{1}{16\pi^{2}D_{i}^{2}} + \frac{1-\alpha}{\pi A\alpha}\right)^{2}df. $$
(14)

Clearly, \(E_{\phi _{r_{i}r_{i}}}\) is inversely proportional to α (0<α≤1) and RT60 is also inversely proportional to α [34]. As a result, RT60 is proportional to \(E_{\phi _{r_{i}r_{i}}}\). Therefore, a higher RT60 will result in a larger \(E_{\phi _{r_{i}r_{i}}}\). Generally, \(\phi _{r_{i}r_{i}}\) will decrease with the increase of τ, so there will be more energy at high autocorrelation lags in more severe reverberation.

Figure 4 shows an utterance’s temporal auto-correlation function, corrupted by reverberation in a simulated room of dimension 6 by 4 by 3 m (length by width by height) at RT60 = 0.2 and 2 s, respectively. Since the energy of the reverberant utterance will significantly affect the autocorrelation, we scale ϕ rr (0) to 1 to fairly compare the auto-correlations of the received signals at different reverberant environments. The positions of the loudspeaker and microphone are at (2, 3, 1.5) and (4, 1, 2) meters. It can be observed that ϕ rr (τ) at high autocorrelation lags at RT60 = 2 s is much stronger than that at RT60 = 0.2 s, which is consistent with the above theoretical analysis.

Fig. 4
figure 4

Temporal auto-correlation function of an utterance in TIMIT dataset [51], corrupted by reverberation at RT60 = 0.2 and 2 s, respectively

Consequently, in more severe reverberation, the temporal correlation of the consecutive reverberant frames will become stronger. Thus, the temporal context needs to be more emphasised; however it may cause the loss of spatial context because of the fixed-length context window scenario, which has been adopted in this paper for a fair comparison among different approaches, as discussed in Section 3.2. While in weaker reverberant conditions, the spatial context can be guaranteed by decreasing the redundant temporal context. Based on the above findings, an approach exploiting the combinations of temporal and spatial information at different RT60s, should be more appropriate in multi-channel dereverberation systems.

2.4.2 Channel selection with less spatial contexts

In stronger reverberant conditions, the temporal context should be emphasised, resulting in less spatial contexts, i.e., only a subset of microphones is available. To achieve better performances, we use the array aperture size as a measure for selecting the channels.

For example, in a fixed uniform linear array, the beam-width is [13]

$$ \theta=2cos^{-1}\left[\frac{c}{Mdf}\right], $$
(15)

where M and d denote the number of microphones and the spacing between neighboring sensors, respectively. L=M d is the array aperture. Therefore, the spatial discrimination capability depends on the array aperture size, i.e., discrimination improves with a larger aperture size [13]. If we assume only 2 elements in the M-microphone array are available in highly reverberant cases, in order to get the largest aperture size, the ith and Mth channels with the largest spacing should be chosen.

In addition, in order to avoid spatial aliasing, the spacing between neighboring sensors has to satisfy the spatial sampling theorem [35, 36], i.e.,

$$ d\leq \frac{\lambda}{2}=\frac{c}{2f}, $$
(16)

where λ is the wavelength of the speech signal. Since microphone signals are naturally broadband [37], one should sample at half of the wavelength corresponding to the smallest wavelength (or highest temporal frequency) of interest. For example, for a two-element array, the spacing is only about 2.1 cm to prevent aliasing for up to 8 kHz. Clearly, spatial aliasing is somewhat of a misunderstood phenomenon [38], since the human binaural auditory system does not experience problems localizing broadband sounds with an average spacing of 20 cm (corresponding to aliasing above 850 Hz). And it has been revealed in [38] that the spatial Nyquist criterion has little importance for microphone arrays. Therefore, in array design, we could expect a large aperture size by setting a large spacing.

For the purpose of improving the system performance and enhancing system robustness, we propose an environment-aware framework, namely RTA-DNNSpatial, by incorporating the temporal-spatial characteristics at distinct reverberant conditions and the channel selection strategy. A block diagram of the proposed RTA-DNNSpatial system is illustrated in Fig. 5 which is an improved version over our proposed DNNSpatial system illustrated in Fig. 3, by integrating the RT60-dependent temporal and spatial contexts (red parts), into training and dereverberation. In the training and dereverberation stages, the feature selection of temporal and spatial contexts is dependent on the utterance-level RT60, while an RT60 estimator is required in the dereverberation stage. A detailed description of how RT60 affects the combinations of temporal and spatial features will be presented later in Section 3.2.1.

Fig. 5
figure 5

A block diagram of the proposed RTA-DNNSpatial system

3 Experiments and result analysis

3.1 DNNSpatial

The experiments were conducted in a simulated room of dimension 6 by 4 by 3 meters (length by width by height). The position of the loudspeaker was at (2, 3, 1.5) meters. Both 2-microphone and 6-microphone arrays were considered. For the 2-microphone array, the positions of the microphones were at (4, 1, 2) and (4, 1.2, 2) meters, respectively. For the 6-microphone array, the positions of the microphones were at (4, 1, 2), (4, 1.1, 2), (4, 1.2, 2), (4, 1.3, 2), (4, 1.4, 2), and (4, 1.5, 2) meters. Ten RIRs were simulated using an improved image-source method (ISM) [39] with reverberation time (RT60) ranging from 0.1 to 1.0 s, with an increment of 0.1 s. To learn a high-quality DNN model, all 4620 training utterances from the TIMIT set were convolved with the generated RIRs to build a large multi-condition training set, resulting in about 40 h of reverberant speech at each microphone. To test DNN’s generalization capability in mismatch conditions, RIRs with RT60 from 0.1 to 1.0 s with the increment of 0.05 s (rather than 0.1 s) were convolved with 100 randomly selected utterances from the TIMIT test set to construct the test set. This resulted in a collection of 19×100 reverberant utterances at each microphone.

Kaldi [40] was used to train DNNs, with 3 hidden layers, 2048 nodes for each layer. The number of pre-training epochs for each RBM [41] layer was 1. The learning rate of pre-training was 0.4. As for fine-tuning, the learning rate and the maximum number of epochs were 0.00008 and 30, respectively. The mini-batch size was set to 128. The configuration parameters were chosen according to a previous investigation on speech enhancement [42]. Input and target features of DNN were globally normalized to zero mean and unit variance [43].

In addition, frequency weighted segmental SNR (fwSegSNR) [26], short-time objective intelligibility (STOI) [44], and perceptual evaluation of speech quality (PESQ) [45] were used to evaluate the system performance.

3.1.1 Single-microphone DNN-based dereverberation systems

We first show the performances of our recently proposed single-channel DNN dereverberation system “DNN-baseline” in [12] with 11 frames of input feature expansion; another single-channel DNN dereverberation system “DNN-HWW” in [10] without post-processing [25]; the signal processing dereverberation method “WPE” with its single channel mode. The WPE code is available at http://www.kecl.ntt.co.jp/icl/signal/wpe/index.html. “Rev” represents unprocessed reverberant speech.

Figure 6 illustrated that when compared to unprocessed reverberant speech, our proposed DNN-baseline could achieve a significant fwSegSNR improvement of 3.5 dB on the average at all RT60s, including mismatched conditions of RIRs and unseen speakers. This results demonstrated that our proposed single-channel DNN system had both powerful regression and generalization capabilities. Furthermore, the proposed system was superior to WPE and DNN-HWW at each RT60, achieving average fwSegSNR increase of 2.5 and 1.2 dB, respectively.

Fig. 6
figure 6

Average single-microphone fwSegSNR of WPE, DNN-HWW, and DNN-baseline at different RT60s

3.1.2 Multi-microphone DNN-based dereverberation systems

Then we show the performances of three DNN-based array dereverberation systems (i.e., DSB-DNN, DNNs-DSB, and DNNSpatial) in a 2-microphone array. For DNN models in both DSB-DNN and DNNs-DSB, a 11-frame context window was considered during DNN training. While for DNNSpatial, a 10-frame context window was utilized (i.e., \(\sum _{i=1}^{2}(2d_{i}+1)=10\)), to ensure a fair comparison and a controlled environment. The possible configurations for feature selection in DNNSpatial are (9, 1), (7, 3), and (5, 5). Specifically, (9, 1) implies the number of frames in acoustic context for the first and second microphone is 9 and 1, respectively (i.e., d 1=4, d 2=0). We referred to these three configurations as DNNSpatial9-1, DNNSpatial7-3, and DNNSpatial5-5, respectively. We selected the last one as our DNNSpatial configuration without any special purpose. And note that the phase used to do reconstruction in DNN-based systems were the phase of the delay-and-sumed speech signal, so that all systems are comparable regardless of phase issue. As shown in Fig. 7, compared with DNN-baseline, although DSB-DNN could substantially increase the fwSegSNR scores, the improvements were small. DNNs-DSB only improved fwSegSNR below RT60 ≤ 0.15 s, and showed decreases at all other RT60s tested, as explained in Section 2.2. While our proposed DNNSpatial could significantly improve the speech quality and achieved the best fwSegSNR scores at all RT60s. In terms of STOI and PESQ, the proposed DNNSpatial still could obtain better scores than DSB-DNN and DNNs-DSB at almost all RT60s, demonstrating DNNSpatial performed better than the other two DNN-based array dereverberation models. Moreover, the proposed DNNSpatial framework generalizes well to unseen RT60s.

Fig. 7
figure 7

Average 2-microphone fwSegSNR (a), STOI (b) and PESQ (c) of three DNN-based array dereverberation systems at different RT60s

Figure 8 shows the performances of three DNN-based array dereverberation systems in a 6-microphone array. And, DNNSpatial5-1-1-1-1-1 (d 1=2, d 2=d 3=d 4=d 5=d 6=0, \(\sum _{i=1}^{6}(2d_{i}+1)=10\)) was adopted in the experiment without any special purpose. It still could obtain the best scores at almost all RT60s, in terms of fwSegSNR, STOI, and PESQ, demonstrating its robustness to the array configuration. Moreover, the 6-microphone array also outperformed dual-microphone configurations at all RT60s according to the three objective measures tested.

Fig. 8
figure 8

Average 6-microphone fwSegSNR (a), STOI (b) and PESQ (c) of three DNN-based array dereverberation systems at different RT60s

3.2 Reverberation-time-aware DNNSpatial (RTA-DNNSpatial)

In Section 3.2.1, our proposed DNNSpatial models were trained to estimate the two contexts, i and d i , needed to achieve top performances for each RT60. In Section 3.2.2, we further explore a RTA-DNNSpatial system that consider the effects of feature selection by adopting the “optimal” temporal and spatial contexts at each RT60, which is assumed to be known in the dereverberation stage.

3.2.1 Temporal and spatial contexts in feature selection

The following experimental settings were the same as in Section 3.1. The larger microphone array that consisted of six elements was considered. Three-thousand seventy-two nodes for each layer were used to train DNNs. The RT60 was also extended to 2.0 s, in order to explicitly explore the effects of temporal and spatial variations in feature selection on the dereverberation performance.

Table 1 presents the average fwSegSNR scores of a series of sampled DNNSpatial configurations at different RT60s. The numbers in bold denote maximum values at each RT60 among all DNNSpatial models. “Rev” represents unprocessed reverberant speech. The results in Table 1 show that:

Table 1 Average 6-microphone fwSegSNR of different DNNSpatial configurations at different RT60s

(i) A 14-frame context window was used to study how RT60 affected the temporal-spatial features. To ensure a controlled environment and differentiable temporal-spatial contexts, three symmetric configurations of DNNSpatial3-3-1-1-3-3, DNNSpatial5-1-1-1-1-5, and DNNSpatial7-0-0-0-0-7 were investigated. Obviously, DNNSpatial7-0-0-0-0-7 had the longest temporal context but the least spatial context, and the largest aperture array size having only two elements.

Clearly the performances were affected greatly by the combinations of temporal-spatial contexts as these bold numbers show the optimal contexts vary with RT60s. Specially, for conditions of RT60 ≥ 0.9 s, it was surprising that even at a price of losing some spatial contexts, DNNSpatial7-0-0-0-0-7 still could perform better than DNNSpatial3-3-1-1-3-3 and DNNSpatial5-1-1-1-1-5. This could be explained by the theoretical analysis in Section 2.4 that, in a stronger reverberant environment, (a) larger temporal contexts should be more emphasized because of the stronger temporal correlation of the consecutive reverberant frames; (b) the channels with larger spacing should be selected in order to obtain the largest aperture size, which could compensate the loss of spatial contexts. Similar results were obtained in terms of STOI and PESQ. A “optimal” RT60-dependent feature selection of temporal-spatial contexts was then obtained by extracting the best DNNSpatial configuration at each RT60 in Table 1, illustrated in Table 2.

Table 2 RT60-dependent temporal-spatial contexts

(ii) The DNNbaseline with 15-frame feature extension was also considered, which was actually a special case of DNNSpatial15-0-0-0-0-0. It was not surprising that at RT60=2.0 s, DNNbaseline with the largest temporal context was superior to DNNSpatial3-3-1-1-3-3. This was also consistent with our findings in Section 2.4.

(iii) Compared with the state-of-the-art signal processing multi-channel dereverberation method WPE with the 6-channel mode, our proposed DNNSpatial frameworks (3-3-1-1-3-3, 5-1-1-1-1-5, 7-0-0-0-0-7) could significantly improve the fwSegSNR by about 1.6 dB on the average at all RT60s tested.

3.2.2 RTA-DNNSpatial (Known RT60)

Inspired by the findings in Table 1, a RTA-DNNSpatial1 was established by incorporating RT60-dependent temporal-spatial contexts. In the training stage, the utterance-based speech frames spatially from different microphones and temporally from various acoustic contexts in different microphones were concatenated by the optimal temporal and spatial contexts, as shown in Table 2. In the dereverberation stage, as the RT60s were assumed to be known, the test data could be processed in the same manner as the training utterances.

The results in Fig. 9 show that, compared with DNNSpatial systems, RTA-DNNSpatial model could achieved the best fwSegSNR, STOI, and PESQ scores at almost all known RT60s, including extremely weak (RT60 = 0.1 s) and severe (RT6 = 2.0 s) reverberant cases. That indicates that the proposed environment-aware approach is robust enough to handle the slightly and highly reverberant situations. Furthermore, RTA-DNNSpatial worked even better at low RT60s.

Fig. 9
figure 9

Average 6-microphone fwSegSNR (a), STOI (b) and PESQ (c) results at different RT60s

4 Discussions on generalization capabilities

Since the DNN is a data-driven mechanism, it is important to evaluate our proposed RTA-DNNSpatial’s generalization capabilities to situations not seen in training. Therefore, we directly evaluate RTA-DNNSpatial (obtained in Section 3.2.2) without retraining in a series of mismatched conditions that are most commonly considered in practical applications. fwSegSNR, which is a speech intelligibility indicator [46], was used to evaluate the system performances.

4.1 Generalization to room size

The DNN system, which was trained in the room of dimension 6 by 4 by 3 m (length by width by height), was tested in a very different room of dimension 10 by 7 by 3 m, with the positions of loudspeaker and microphones unchanged. Figure 10 shows the generalization results of the room size at different RT60s. Clearly, RTA-DNNSpatial yielded higher fwSegSNR scores than the unprocessed reverberant speech and WPE at each RT60s. The results illustrates that although our proposed DNN model was only trained in a single room, it generalized well to an unseen room size, demonstrating the robustness of our proposed environment-aware approach to new room sizes.

Fig. 10
figure 10

Average fwSegSNR results at different RT60s, tested in a new room

4.2 Generalization to array geometry

Next, we investigated the geometry dependence of the proposed RTA-DNNSpatial. In the training phase, a uniform linear array with an increment of 10 cm was utilized. In the dereverberation stage, a new array pattern was constructed by increasing the increment to 15 cm. As shown in Fig. 11, the RTA-DNNSpatial model still could achieve the highest scores at each RT60s, illustrating that the proposed approach had a powerful generalization capability to the geometrically mismatched case.

Fig. 11
figure 11

Average fwSegSNR results at different RT60s, tested in a new array geometry

4.3 Generalization to loudspeaker position

In real scenarios, the speaker could be anywhere. Therefore, it is of importance to evaluate the generalization capabilities of loudspeaker position. In the training stage, the positions of the loudspeaker was at (2, 3, 1.5) m. In the dereverberation stage, we purposely changed the positions of the loudspeaker to be at (1, 1.5, 2.5) m. This time the room size and microphone array position was kept unchanged. Figure 12 illustrates that compared with WPE, the proposed RTA-DNNSpatial boosted fwSegSNR scores at RT60 = 0.3 s, 1.5 and 2.0 s. It demonstrates that our proposed RTA-DNNSpatial had good generalization capabilities to loudspeaker position.

Fig. 12
figure 12

Average fwSegSNR results at different RT60s, tested at a new loudspeaker position

4.4 Robustness to RT60 estimation error

The above RTA-DNNSpatial models assumed RT60 known in the dereverberation stage. However, it was unavailable in practice. Now, we present experimental results to assess the performance of RTA-DNNSpatial in practical solutions by using an accurate RT60 estimator proposed in [47], marked as “RTA-DNNSpatial-nonoracle”. The RT60 is estimated to choose the optimal RT60-dependent temporal and spatial contexts from Table 2, which will be utilized in the dereverberation stage. The results of the two standard DNN-based multi-channel dereverberation configurations were also given. As shown in Fig. 13, compared with unprocessed reverberant speech, WPE, DSB-DNN, and DNNs-DSB, the nonoracle case substantially boosted fwSegSNR scores at all RT60s. Moreover, it is good to know that the two environment-aware DNNSpatial frameworks were comparable. That is because in our designed RT60-aware DNNSpatial, the estimated utterance-level RT60 was rounded up to the nearest value of the multiples of 0.1 s. That means it allows a ± 0.05 s RT60 estimation error, increasing the robustness of our proposed algorithm to potential estimation errors.

Fig. 13
figure 13

Average fwSegSNR results at different RT60s, tested with an RT60 estimator

In addition, it is difficult to evaluate the proposed approach on real RIRs. For example, in REVERB Challenge [48], it assumes the scenario of capturing utterances with a 8-ch circular array in reverberant meeting rooms. But the selection of RT60-dependent temporal and spatial contexts are analyzed and determined on a uniform linear array. But we can test our proposed DNNSpatial and RTA-DNNSpatial in its single-channel model on the REVERB Challenge 1-channel real data. As shown in [49], the proposed DNN systems outperform all other methods in all situations - the best performing method listed in the REVERB Challenge is shown for ease of comparison. In addition, although ACE Challenge [50] provides a 8-channel uniform linear array real recorded RIRs, it is tough for our RTA-DNNSpatial to handle all the extreme mismatched conditions together (unseen room size, speaker position, microphone position, array element increment and RT60, etc).

5 Conclusions

In this paper, we first propose a speech dereverberation framework, namely DNNSpatial, by selectively combining input LPS features of reverberant speech from multiple microphones in an array and map them into the expected output LPS features of anechoic reference speech based on DNNs. We compare the proposed single-DNN approach to two standard DNN-based multi-channel dereverberation configurations, namely DSB-DNN and DNNs-DSB. Experimental results demonstrate that the proposed single DNNSpatial model without DSB performs better than the other two DNN models with DSB in both 2-microphone and 6-microphone settings according to all the three objective measures tested.

Next, we propose a reverberation-time-aware DNNSpatial framework, namely RTA-DNNSpatial, by adopting RT60-dependent temporal and spatial contexts, to make the system robust enough to handle a wide range of RT60s. Experimental results indicate that it is superior to the state-of-the-art signal processing multi-channel dereverberation algorithm WPE and DNNSpatial models, including slightly and severely reverberant environments. It also generalizes well to mismatched room size, array geometry, and loudspeaker position, and is robust to RT60 estimation error, which will yield significant benefits in many practical applications.

In future studies, we would like to explore the availability of direction of arrival (DOA) information to further improve the system performances.s

6 Endnote

1 For the input concatenated feature vectors of 3-3-1-1-3-3, 5-1-1-1-1-5, and 7-0-0-0-0-7, the same frequency bin may correspond to different microphones. Based on our preliminary experiments, this inconsistency between frequency and spatial domains will result in degraded dereverberation performances. Therefore during the RTA-DNNSpatial training, the normalized input vectors of different feature combinations were added zeros to remove the inconsistency.