Abstract
The use of machine learning in automatic speaker identification and localization systems has recently seen significant advances. However, this progress comes at the cost of using complex models, computations, and increasing the number of microphone arrays and training data. Therefore, in this work, we propose a new end-to-end identification and localization model based on a simple fully connected deep neural network (FC-DNN) and just two input microphones. This model can jointly or separately localize and identify an active speaker with high accuracy in single and multi-speaker scenarios by exploiting a new data augmentation approach. In this regard, we propose using a novel Mel Frequency Cepstral Coefficients (MFCC) based feature called Shuffled MFCC (SHMFCC) and its variant Difference Shuffled MFCC (DSHMFCC). In order to test our approach, we analyzed the performance of the identification and localization proposed model on the new features at different noise and reverberation conditions for single and multi-speaker scenarios. The results show that our approach achieves high accuracy in these scenarios, outperforms the baseline and conventional methods, and achieves robustness even with small-sized training data.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The use of microphone array processing techniques plays a very important rule in various applications such as video conferencing, human–computer interaction, surveillance, robotics, and several speech technologies which include speech enhancement (Geng et al., 2022; Taherian et al., 2022; Ali et al., 2021), speaker recognition (Meng et al., 2022; Mošner, 2022) and speaker localization (Grumiaux et al., 2022; He et al., 2021; Kang, 2020; Chakrabarty et al., 2019; Wang et al., 2019; Hao et al., 2020; Vera-Diaz et al., 2018).
Speaker localization depends on calculating the frequency, time, and level differences between the signals captured by each of the microphone arrays. In literature, localization algorithms can be categorized by two main groups: machine learning algorithms that deploy and train deep neural networks to estimate a speaker position or direction (Chakrabarty et al., 2019; Wang et al., 2019; Hao et al., 2020; Vera-Diaz et al., 2018), and parametric algorithms like those that depend on MUltiple Signal Classification (MUSIC; Schmidt, 1986), the Steered-Response-Power (SRP; Brandstein et al., 1991) or the Time Difference Of Arrival (TDOA; Huang et al., 2001).
Automatic Speaker Identification (ASI), which is a part of Automatic Speaker Recognition (ASR), is the process of using a machine to determine the speaker’s identity through his/her speech utterances. ASI’s main challenge resides in extracting distinctive features from those utterances that maximize the identification performance with different environmental conditions (Jahangir et al., 2021; Kabir et al., 2021; Astuti et al., 2020). Accordingly, many feature techniques have been proposed, and widely used features are the Mel Frequency Cepstral Coefficients (MFCC). However, the performance of these features may vary with the speaker’s health conditions, emotional state (Shahin, 2013) and at different levels of noises, reverberation, interference, and multi-source conditions (Li et al., 2014; Kinoshita et al., 2020).
Machine learning algorithms and deep neural networks have been successfully utilized in speaker localization during the last decade. In Chakrabarty et al. (2019), a Convolutional Neural Network (CNN) based supervised learning method for DOA estimation of multiple speakers was presented and trained using synthesized noise signals. The network exhibits a superior performance compared to a signal processing-based localization like SRP-PHAT and MUSIC using multiple microphones. In Wang et al. (2019), the authors utilized a Bidirectional Long Short-Term Memory (BLSTM) neural network to identify speech dominant time-frequency units containing a relatively clean phase for DOA estimation. However, the mentioned network cannot be used for online applications and the study considered only one noise type and one SNR value. Moreover, Hao et al. (2020) investigated a real-time implementation of speech source localization using relatively a small three-layer CNN architecture and by decomposing the complex-valued spectrograms into real and imaginary parts. In addition, In Vera-Diaz et al. (2018), an end-to-end CNN-based network is trained to map the raw waveforms of a distributed microphone network to the source position in reverberant environments.
In addition, machine learning helped achieving best performances in speaker identification applications (Ashar et al., 2020; An et al., 2019; Zhang et al., 2019; Nagrani et al., 2017; Ye et al., 2021; Ravanelli et al., 2018; Ghezaiel et al., 2020; Barhoush et al., 2021). In Ashar et al. (2020), the authors applied a depth-wise separable CNN architecture to identify 251 speakers using a spectrogram of four-second duration clean utterances. The study in Nagrani et al. (2017) used CNNs to classify spectrogram features extracted from the speakers’ utterances in the VoxCeleb1 (Nagrani et al., 2017) dataset and reached 80% accuracy. The model from An et al. (2019) attained 91% accuracy when applied to the same dataset using deep CNNs with self-attention and MFCC features. Furthermore, the authors in Ye et al. (2021) proposed DNN model based on a two-dimensional convolutional neural network (2-D CNN) and gated recurrent unit (GRU) for speaker identification. In addition, Speaker identification systems in Ghezaiel et al. (2020) and Ravanelli et al. (2018) used raw waveform speech samples and CNNs along with finite impulse response filters called SincNet or a wavelet scattering transform and achieved an accuracy of 67% and 80%, respectively. Although the speaker identification problem was traditionally exploited by a single channel processing technique, using microphone arrays is proven in Mošner et al. (2018) and Xiao et al. (2016) to enhance the ASR system performance as it reduces reverberation, interference and background noises.
The joint speaker localization and identification has been addressed in a relatively limited capacity in literature. In Salvati et al. (2018), different processing blocks from uniform linear arrays (ULA) are used in a loop to localize speakers with the SRP-DU (SRP diagonal unloading) method, enhance the signal using a DU beamformer and then, identify the speaker. Authors in May et al. (2012) used an artificial head with a binaural scene analyzer that is able to simultaneously localize, detect and identify a known number of target speakers in the presence of spatially positioned noise sources and reverberation. Moreover, in Salvati et al. (2020), an end-to-end scheme based on CNNs is trained to map raw waveforms of two microphones into the speaker identity and his Direction Of Arrival (DOA). Finally, the authors in Marques et al. (2022) presented microphone array and sensor system for speaker localization and identification in shared autonomous vehicles.
Data augmentation in acoustics and sound classification has also been studied in the literature. The augmentation method in Salamon et al. (2017) was based on time stretching and pitch shifting techniques along with adding background noises. Moreover, researchers in Inoue et al. (2019), Yuji et al. (2018) and Hongyi et al. (2018) mixed two sound sources within the same class or from different classes to generate a new sound. Furthermore, in the study of Bird et al. (2020), the authors investigated the feasibility of increasing speaker identification by enhancing the dataset with artificial data generated by training a character-level recurrent neural network on a short clip of five spoken sentences.
In this paper, we extend our work in Barhoush et al. (2021), where we used the SHMFCC features in single-channel speaker identification scenario, to the scenario where we jointly or separably identify and localize speakers using two-channel microphone under various conditions.
The contribution of this work can be summarised as follows:
-
We propose a FC-DNN model that can jointly or separately localize and identify an active speaker with high accuracy.
-
We propose using two new MFCC-based features as input to our proposed model: shuffled MFCC features (SHMFCC) and its variant Difference (DSHMFCC).
-
We propose a new data augmentation prepossessing step in the training and validation phase and a post-processing step in the testing and deployment phase to maximize the system accuracy and robustness.
-
We tested the model under adverse conditions with different SNRs and reverberation. We also illustrated the capability of our model compared to conventional methods to perform very well on localizing and identifying predetermined authenticated speakers in single and multi-speaker scenarios.
-
We compared our approach to traditional methods and showed its robustness even with small-sized training data.
2 Methodology
This section describes in detail the proposed methodology. Firstly, we describe dataset used to generate the training, validation, and testing samples. Secondly, we introduce our SHuffled and Difference SHuffled MFCC (SHMFCC and DSHMFCC) features. Thirdly, we present the machine learning model and its parameter tuning. Finally, we explain the feature extraction and post-processing for the deployment phase.
2.1 Training dataset
In all experiments presented in this paper, we used the same room setup as shown in Fig. 1. The simulated room has a dimension of (\(5\, {\text {m}}\times 4\, {\text {m}}\times 3.5\, {\text {m}}\)). The centre of the two microphone array (TMA) separated by a distance of \(d_{TMA}\) is placed at the point of (\(2.5\, {\text {m}},1\, {\text {m}},1.6\, {\text {m}}\)). One speaker out of C speakers spoke 37 different 10-s-length utterances, corresponding to \(D=37\) locations each with a distance of 2 m from the microphone array centre (half azimuth plane with \(5^\circ\) of separation). Therefore, the training dataset has a balanced class distribution for both speaker’s identity and position. The simulation is based on the image source method (ISM; Allen et al., 1979) and all the audio samples used are from the publicly available LibriSpeech dataset (Panayotov et al., 2015).
2.2 Feature extraction and preprocessing
Figure 2 shows the feature extraction pipeline used to jointly identify and localize speakers. All speaker utterances are divided into training, validation, and testing with percentages 60%, 20%, and 20%, respectively. The utterances go through some steps before they are fed to the model where the details are discussed in the subsequent sections.
2.2.1 Framing and windowing
The speaker utterances at each microphone are first sampled with a sampling frequency of \(F_s = 16\) kHz and then framed in the time domain with a standard frame duration of 25 ms and 10 ms overlapping to avoid any information losses which corresponds to frame length H of 400 samples and an overlapping length Q of 160 samples, respectively. Later, each frame is windowed using an H-length Hanning window to smooth discontinuities at the beginning and end of the frame.
Let \(s_i[k]\) and \(s_j[k]\) be the sampled version of the utterances \(s_i(t)\) and \(s_j(t)\) recorded at mic i and mic j, respectively. Each contains L frames with H samples per frame. Then, we can represent all the frames in \(s_i[k]\) and \(s_j[k]\) using the matrices \({{\varvec{S}}}_i^{L\times H}\) and \({{\varvec{S}}}_j^{L\times H}\) respectively as follows:
where \({{\varvec{f}}}_{\,i,l}^{\,H}\), \(1\le l \le L\) is a vector of length H that represents the l th frame in \(s_i[k]\) written as follows:
2.2.2 SHMFCC and DSHMFCC
In this work, we build on the well-known MFCC features (Apte, 2017) and propose new MFCC-based features called SHuffled MFCC features (SHMFCC) and Difference Shuffled MFCC features (DSHMFCC) to enhance the identification and localization system accuracy and immunity at various conditions. These proposed features are computed in four stages:
-
MFCC Extraction here, N Mel-frequency cepstral coefficients are computed for each time frame from the two channels separately. The time frame samples are first transformed into the frequency domain using the discrete Fourier transform (DFT), and the power spectrum is calculated. The power spectrum is then passed through a series of Mel-filter banks (O’Shaughnessy, 2000) where the Mel frequencies have equal perceptual human hearing distances and thus, imitate the perception of human ears. After that, the log of the power from each filter bank is computed, and finally, we get the MFCC coefficients by taking the discrete cosine transform (DCT) of each MEL-log-powers. For mic i, let \({{\varvec{m}}}_{\,i,l}^{\,N}= \textit{mfcc}({{\varvec{f}}}_{\,i,l}^{\,H})\) be the vector that represents the N MFCC coefficients of the frame \({{\varvec{f}}}_{\,i,l}^{\,H}\), then \({{\varvec{M}}}_i^{L\times N}\) is the matrix representing the N MFCC coefficients of the L frames in \(s_i[k]\) as follows:
$$\begin{aligned} {{\varvec{M}}}_i^{L\times N} = \begin{bmatrix} \textit{mfcc}({{\varvec{f}}}_{\,i,1}^{\,H}) \\ \textit{mfcc}({{\varvec{f}}}_{\,i,2}^{\,H}) \\ : \\ \textit{mfcc}({{\varvec{f}}}_{\,i,L}^{\,H}) \\ \end{bmatrix} = \begin{bmatrix} {{\varvec{m}}}_{\,i,1}^{\,N}\\ {{\varvec{m}}}_{\,i,2}^{\,N}\\ :\\ {{\varvec{m}}}_{\,i,L}^{\,N}\\ \end{bmatrix}. \end{aligned}$$(3)The same way, we can write \({{\varvec{M}}}_j^{L\times N}\) for mic j as follows:
$$\begin{aligned} {{\varvec{M}}}_j^{L\times N} = \begin{bmatrix} \textit{mfcc}({{\varvec{f}}}_{\,j,1}^{\,H}) \\ \textit{mfcc}({{\varvec{f}}}_{\,j,2}^{\,H}) \\ : \\ \textit{mfcc}({{\varvec{f}}}_{\,j,L}^{\,H}) \\ \end{bmatrix} = \begin{bmatrix} {{\varvec{m}}}_{\,j,1}^{\,N}\\ {{\varvec{m}}}_{\,j,2}^{\,N}\\ :\\ {{\varvec{m}}}_{\,j,L}^{\,N}\\ \end{bmatrix}. \end{aligned}$$(4) -
Single-mic MFCC and Inter-mic MFCC Difference (DMFCC) the next step is to take the MFCC coefficients of one mic (mic i in this case) and concatenate them raw wise (concatR) with the difference between the MFCC coefficients from both microphones i and j to form \({{\varvec{M}}}^{L\times 2N}\) where:
$$\begin{aligned} \begin{aligned} {{\varvec{M}}}^{L\times 2N}&= \textit{concatR}\left( {{\varvec{M}}}_i^{L\times N} , {{\varvec{M}}}_i^{L\times N}-{{\varvec{M}}}_j^{L\times N}\right) \\&= \begin{bmatrix} {{\varvec{m}}}_{i,1}^{N},{{\varvec{m}}}_{i,1}^{N}-{{\varvec{m}}}_{j,1}^{N}\\ {{\varvec{m}}}_{i,2}^{N},{{\varvec{m}}}_{i,2}^{N}-{{\varvec{m}}}_{j,2}^{N}\\ :\\ {{\varvec{m}}}_{i,L}^{N},{{\varvec{m}}}_{i,L}^{N}-{{\varvec{m}}}_{j,L}^{N} \end{bmatrix}. \end{aligned} \end{aligned}$$(5) -
Frame Shuffling (Data Augmentation) we divide \({{\varvec{M}}}^{L\times 2N}\) into T blocks denoted as \({{\varvec{W}}}_t^{B\times 2N}\) where each block contains B frames. We can write Eq. 5 as follows:
$$\begin{aligned} {{\varvec{M}}}^{L\times 2N} = \begin{bmatrix} {{\varvec{W}}}_{1}^{B\times 2N}\\ {{\varvec{W}}}_{2}^{B\times 2N}\\ :\\ {{\varvec{W}}}_{T}^{B\times 2N} \end{bmatrix}. \end{aligned}$$(6)Then, we randomly shuffle the MFCC frames within each block \(R \in {\mathbb {R}}\) times. Each block will produce R different blocks with shuffled frames. The result of this shuffling produces a new matrix \({{\varvec{M}}}_{sh}^{LR\times 2N}\) where its rows are R times greater than the rows of the original matrix in Eq. 6. The new matrix can be seen in the following equation:
$$\begin{aligned} {{\varvec{M}}}_{sh}^{LR\times N} = \begin{bmatrix} \textit{shuffle}_1({{\varvec{W}}}_{1}^{B\times 2N})\\ \textit{shuffle}_2({{\varvec{W}}}_{1}^{B\times 2N})\\ :\\ \textit{shuffle}_R({{\varvec{W}}}_{1}^{B\times 2N})\\ :\\ :\\ \textit{shuffle}_1({{\varvec{W}}}_{T}^{B\times 2N})\\ \textit{shuffle}_2({{\varvec{W}}}_{T}^{B\times 2N})\\ :\\ \textit{shuffle}_R({{\varvec{W}}}_{T}^{B\times 2N})\\ \end{bmatrix}, \end{aligned}$$(7)where \(\textit{shuffle}({{\varvec{W}}}_{t}^{B\times 2N})\) changes the order of the B frames in \({{\varvec{W}}}_{t}^{B\times 2N}\) randomly and keeps the order of the 2N MFCC coefficients in each frame intact.
-
SHMFCC/DSHMFCC Features finally, we end up with SHMFCC/DSHMFCC features by taking each consecutive K frames from the output of Eq. 7, and since each frame now has a length of 2N, then one SHMFCC/DSHMFCC feature will be a vector of shuffled MFCC and DMFCC coefficients with length equals to 2KN. These SHMFCC/DSHMFCC feature vectors along with their corresponding speaker labels will be used to train, validate and test the machine learning model.
This method will increase the system’s accuracy and robustness as demonstrated in the Sect. 3. The block length B and the shuffling rate R are considered hyper-parameters in the Training/Validation stage and denoted as \(B_{tv}\), \(R_{tv}\) respectively.
2.3 Our DNN model
We tested different DNN models and many traditional classification algorithms to identify/localize speakers using the SHMFCC/DSHMFCC features (see “Appendix”). However, it turned out that a simple fully connected Deep Neural Network (FC-DNN) achieves the best results in terms of classification accuracy and robustness at various environmental conditions. Details on this model are discussed in this section.
We designed different FC-DNN structures by changing the number of hidden layers, the number of neurons in each hidden layer, the dropout rate, and the activation functions to avoid overfitting and to maximize the system’s accuracy. Figure 3 shows our final tuned FC-DNN model. This model consists of one input layer, two output layers, and six hidden and fully connected dense layers (HL). After each 512-neurons dense layer, a sigmoid activation function is used, and then the outputs are normalized using batch normalization (Ioffe et al., 2015). After that, a Dropout mechanism (Srivastava & Hinton, 2014) with a dropout rate of 30% is used to prevent the network from overfitting. The input layer receives an SHMFCC/DSHMFCC feature vector of length of 2KN, where K is a hyperparameter that represents the number of shuffled frames used to classify the feature and N is also a hyperparameter that represents the number of MFCC coefficients extracted from each time frame. Finally, the output layers are two dense layers followed by the sigmoid activation function. One dense layer with C neurons computes the output values for the C-speaker classification problem and the other has D neurons to localize and classify all the direction of arrivals of interest. Each layer kernels are computed through Adam stochastic gradient method (Kingma et al., 2015) and the loss function used for classification is the Binary cross-entropy.
Some parameters are left to be tuned like the number of SHMFCC and DSHMFCC feature frames K, the number of MFCC coefficients for each frame N, the block lengths \(B_{tv}\) and shuffling rate \(R_{tv}\). We did an extensive hyper tuning for these parameters by taking different values and combinations and the results showed that:
-
The system gave the best results when the number of SHMFCC and DSHMFCC feature frames K is between 10 and 20 frames. We choose \(K=10\) frames to reduce the number of trainable parameters in the system model.
-
The system accuracy (especially the DOA classification accuracy) increases with a higher number of MFCC coefficients N. We took \(N=64\) coefficients.
-
We choose block lengths \(B_{tv} = 50\) frames as it gave the best system performance which corresponds to 0.5 s.
-
As we increase the shuffling rate \(R_{tv}\), more data augmentation is applied and hence the system accuracy is increased as we can see from the simulation results section.
The joint localization and identification model in Fig. 3 can be modified to work as a separate localization model by simply omitting the speaker identification part and using only the SHMFCC features (delete the DMFCC in Eq. 5). On the other hand, a separate identification model is obtained by deleting the DOA classification part and using only the DSHMFCC features (delete the single-mic MFCC in Eq. 5). In both separate models, the input layer shape will be of length KN.
In the deployment and testing phase, SHMFCC and DSHMFCC features are extracted the same way as in the training and validation phase. However, the block length denoted as \(B_t\) depends now on the duration of the utterance in seconds \(P_t\) within the joint localization and identification is estimated. Both \(P_t\) and \(B_t\) are interchangeable as \(B_t=(F_S \times P_t)/Q.\) Their values are application dependent and can vary according to the real-time constrains. Moreover, the shuffling rate in this phase denoted as \(R_t\) can take a value of one where the features extracted from \(P_t\) are shuffled once. However, taking \(R_t \ge 2\) can slightly increase the system accuracy at the expense of more computational power and processing time as shown in the Sect. 3.
The DOA and identity classification decision is traditionally based on the corresponding class of the maximum score of the last layer in the system model given one feature input. However, in our case, all the SHMFCC and DSHMFCC features extracted from the same block/duration (\(B_t, P_t\)) are of the same DOA/identity class. Therefore, we propose to use a soft decision technique SDT where the scores resulting from all the features extracted from one block are summed into one single net score, and then the maximum net score points to the DOA/identity class. We found that using this soft technique is slightly better than the hard decision technique HDT where the classification is done for each score corresponding to each feature extracted from one block, then the class with higher votes is chosen.
3 Simulation results
We designed our experiments to answer the following questions: (1) How does our tuned model based on the new SHMFCC/DSHMFCC features perform compared to the classical MFCC features and other baseline algorithms? (2) What is the effect of our new data augmentation approach on the system identification and localization performance? (3) Does our new speaker identification and localization pipeline system experience robustness and immunity at the adverse conditions which include variable reverberation time and AWGN conditions in single and multi-speaker scenarios? And (4) How do separate and joint models perform on these conditions and scenarios?
In the following subsections, we present the simulation results of our proposed system model using the SHMFCC/DSHMFCC features. Moreover, we compare our results to those from baseline features (MFCC/DMFCC) and the well-known parametric localization algorithm of Steered Response Power Phase Transform (SRP-PHAT). In addition, we show how our system performance is affected by different choices like the number of MFCC coefficients N, the distance between the two microphone arrays \(d_{TMA}\), the training/validation shuffle rate \(R_{tv}\), the testing shuffle rate \(R_t\), the single/joint model approaches and the duration of utterance in the testing phase \(P_t\).
We tested both single and multi-speaker scenarios. In the single speaker scenario, we try to identify and localize \(C=40\) speakers (20 males and 20 females) at variable values of SNR and reverberation times RT60. On the other hand, in the multi-speaker scenario, one of \(C=10\) (5 males and 5 females) predetermined and authorized speakers is of interest to be identified and localized while one (\(NSP=1\)) or two (\(NSP=2\)) noisy/unauthorized speakers are actively speaking at the same time. Those unauthorized speakers are randomly selected from a pool of 20 speakers (10 male and 10 female) and randomly located away from the authorized speaker with a minimum angular separation of \(20^\circ\). In both scenarios, we used a balanced testing subset of speakers’ utterances which was not used in the training/validation stage. In the following results, we used the \(F_1\) score to represent the speaker identification accuracy. Moreover, the localization accuracy is expressed by the Mean Absolute Error (MAE) of the DOA angle in degrees. A perfect model has unity \(F_1\) score and zero MAE value.
In Tables 1 and 2, we can see the proposed separable and joint system performance with different variable training/validation shuffle rate \(R_{tv}\) in both single and multi-speaker scenarios, respectively. Here, we chose \(P_t =0.5\) second, \(R_t =1\), \(N=64\), and we used soft decision technique STD. Both tables are used as result reference for the following first three subsections.
3.1 Our approach versus baseline
Regrading the identification problem, we can observe that our approach based on SHMFCC/DSHMFCC features outperforms the baseline MFCC features in all cases and has a greater advantage with higher values of shuffle rate \(R_{tv}\) and at lower values of SNR. For the localization problem, our system model based on DSHMFCC/DSHMFCC features gives better results compared to the SRP-PHAT baseline algorithm with the proper value of shuffling rate \(R_{tv}\) and this can be clearly seen when the reverberation time starts to increase in Table 1. Moreover, using DMFCC features without shuffling has close and sometimes better performance than the baseline in the single speaker scenario.
It should be noted here that in the multi-speaker scenario, in contrast to our model and to get the maximum performance of the SRP-PHAT algorithm, the number of active speakers (\(NSP+1\)) was given as an input to the SRP-PHAT algorithm, and the closest estimation to the true value was selected for evaluation.
3.2 Separable or joint identification and localization systems
In the single speaker scenario and in both identification and localization problems, both separate and joint models have almost the same performance in all cases. However, in the case of the multi-speaker scenario, the joint localization and identification model is performing better than the separate models since the joint model learned during the training/validation phase to simultaneously and gradually focus on the authorized speaker data and ignore the rest.
3.3 Training/validation shuffle rate \(R_{tv}\)
We used \(R_{tv} =1,3\) and 5 in Tables 1 and 2. In general, increasing the value of \(R_{tv}\) increases the training and validation shuffled features extracted from the same block \(B_{tv}\) by the same factor, serving as a data augmentation method that exploits both the statistical distribution and the cross-correlation of MFCC features over a short time \(P_{tv}\). As a consequence, the identification and localization accuracy will increase gradually. Moreover, this data augmentation method indirectly enlarges the training space for each class by providing extra ground truth sample variations for the non-shuffled MFCC features, which improves the overall system training robustness. Furthermore, the need to use higher values of \(R_{tv}\) is preferable in situations where we have lower SNR values, higher reverberation times RT60, and in complex scenarios like the multi-speaker one since this technique helps average out the short-term noise reverberation and interference in the microphone channel signals.
3.4 Number of MFCC coefficients N
The effect of the number of the MFCC coefficients N on our joint system performance is shown in Fig. 4. Without loss of generality, we chose case 9 to represent the single speaker scenario and case 11 from the multi-speaker scenario with \(R_{tv}=5\). The higher the number of the MFCC coefficients N, the better performance the system can achieve. However, one should point out that increasing the number of MFCC coefficients N and its corresponding system performance comes at the cost of increasing the feature extraction processing overhead and having more trainable parameters in our model. Therefore, we fixed N to be 64 in other figures and tables.
3.5 Two microphone array distance \(d_{TMA}\)
Figure 5 shows that our system performance is slightly affected by the distance between the two microphone arrays \(d_{TMA}\) used to capture the speaker utterances in both single and multi-speaker scenarios. The reason behind that comes from the fact that our features are based on MFCC calculations where one step of these calculation takes the power (magnitude) of the DFT coefficients and here the phase information, which depend on the microphone array locations, are discarded. In the other figures and tables, we used \(d_{TMA}=10\) cm.
3.6 Deployment duration \(P_t\)
Choosing the duration of the deployment time \(P_t\) has a great effect on the system’s accuracy and robustness.
Figure 6 express the advantage of increasing \(P_t\) on enhancing the system identification and localization performance in both single and multi-speaker scenarios. longer duration \(P_t\) leads to longer block length \(B_t\) and this increases the number of SHMFCC/DSHMFCC features in this block which enhances the single net score in the soft decision technique SDT or the number of votes in the hard decision technique HDT and therefore gives better results. In real-time applications, one should choose \(P_t \le 0.5\) s to avoid any delays and keep up with the speaker movements. On the other hand and for off-line applications, the higher the \(P_t\) duration will surely give better outcomes. In other figures and tables we chose \(P_t=0.5\) s.
3.7 Deployment shuffle rate \(R_{t}\)
In the deployment phase, extra post-processing can be done by having the shuffle rate \(R_t \ge 2\). Here, for each block \(B_t\), we do shuffling \(R_t\) times which will increase the SHMFCC/DSHMFCC features by the same ratio. This will give a better performance by enhancing the single net score in the soft decision technique SDT or the number of votes in the hard decision technique HDT. Table 3 shows the effect of choosing \(R_t= 1\) and \(R_t = 2\) in single and multi-speaker scenarios. Both identification and localization results are reinforced using the higher value of \(R_t\).
3.8 Training data size
In general, any well-designed machine learning system will function and give better performance when the training data is large enough and representative. Consistently, the system performance deteriorates with small sizes of training data. Table 7 takes case #10 from the multi-speaker scenario as an example to demonstrate the effect of the training data size on our proposed approach using shuffled features (SHMFCC/DSHMFCC) in comparison to traditional non-shuffled features (MFCC/DMFCC). The x-axis represents the duration of each speaker’s utterance per position used to train/validate the model. In the case of the shuffled features, we compensated for the shortage of the training data with the help of the new data augmentation technique which is achieved by increasing the shuffling rate \(R_{tv}\) to have the same data size in all cases (i.e using \(R_{tv}=5\) for the 8-s training utterance corresponds to using \(R_{tv}=20\) for the 2-s training utterance).
In Table 7, we see that reducing the training utterance duration will decrease the performance of both features. However, the shuffled features experience less degradation compared to the non-shuffled features. For example, reducing the training data size and taking only 25% of its original size will decrease the identification accuracy by 17.4% (from 98.92 to 81.52%) in the case of using the shuffled features. On the other hand, using the non-shuffled features will drop the identification accuracy by 46.76% (from 72.84 to 29.08%). The same comparison is also valid in the localization problem. One can also notice that using the shuffled features with 2-s training utterances gives better performance compared to using the non-shuffled features with 8-s training utterances.
3.9 Speech enhancement and WER
We chose the application of speech enhancement to demonstrate the effect of the DOA estimation accuracy. Here, we took 40 s of noisy speech recordings from both cases #10 and #12 in the multi-speaker scenario and utilized two microphone arrays to localize the authorized speakers using our approach and SRP-PHAT algorithm. Then, we used \(BFM =3\) and \(BFM = 5\) linear microphone array spaced by a distance of 5 cm to apply the Minimum Variance Distortionless Response MVDR beamformer on the received speech signal by firstly beamforming the microphone array main beam to the estimated DOA and secondly filtering the noise from other direction given the noise profile of the unauthorized speakers. Finally, we used Word Error Rate WER (0%: best and 100%: worst) as a metric to compare between all layouts.
As shown in Table 4, our approach achieves a lower localization error even with a lower deployment time \(P_t\) and hence achieves a lower WER. Moreover, increasing the number of beamforming microphone arrays BFM enhances the speech further as the main beam becomes narrower and has greater gain. For example, in case #12 with \(P_t =0.5\) Second and \(BFM = 5\), our approach estimates the DOA with MAE of 1.9\(^\circ\) and reduces the WER by 75%. However, the SRP-PHAT algorithm gives a DOA estimation with MAE of 16.1\(^\circ\) and reduces the WER by 61.9%.
4 Conclusion
We presented an end-to-end DNN acoustic model to identify and localize active speakers in single and multi-speaker scenarios with noisy and reverberation conditions. The model uses the new proposed SHMFCC and DSHMFCC features and provides accuracy that outperforms baseline and conventional methods in all scenarios and conditions. Moreover, we showed how to use the new features as a new augmentation technique to overcome the shortage of training data. Finally, we demonstrated the effect of our system accuracy compared to baseline in the application of microphone beamforming and speech enhancement.
References
Ali, R., van Waterschoot, T., & Moonen, M. (2021). An integrated MVDR beamformer for speech enhancement using a local microphone array and external microphones. EURASIP Journal of Audio, Speech, and Music Processing, 1, 1–20.
Allen, J. B., & Berkley, D. A. (1979). Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America, 65(4), 943–950.
An, N. N., Thanh, N. Q., & Liu, Y. (2019). Deep CNNs with self-attention for speaker identification. IEEE Access. https://doi.org/10.1109/ACCESS.2019.2917470
Apte, S. (2017). Random signal processing. CRC Press.
Ashar, A., Bhatti, M. S., & Mushtaq, U. (2020). Speaker identification using a hybrid CNN-MFCC approach. In International conference on emerging trends in smart technologies (ICETST), 2020 (pp. 1–4). https://doi.org/10.1109/ICETST49965.2020.9080730.
Astuti, Y., Hidayat, R., & Bejo, A. (2020). Comparison of feature extraction for speaker identification system. In International seminar on research of information technology and intelligent systems (ISRITI), 2020 (pp. 642–645). https://doi.org/10.1109/ISRITI51436.2020.9315332.
Barhoush, M., Hallawa, A., & Schmeink, A. (2021). Robust automatic speaker identification system using shuffled MFCC features. In IEEE international conference on machine learning and applied network technologies (ICMLANT), 2021 (pp. 1–6). https://doi.org/10.1109/ICMLANT53170.2021.9690530.
Bird, J. J., Faria, D. R., Premebida, C., Ekárt, A., & Ayrosa, P. P. (2020). Overcoming data scarcity in speaker identification: Dataset augmentation with synthetic MFCCs via character-level RNN. In IEEE international conference on autonomous robot systems and competitions (ICARSC), 2020 (pp. 146–151). https://doi.org/10.1109/ICARSC49921.2020.9096166.
DiBiase, J. (1991). A high-accuracy, low-latency technique for talker localization in reverberant environments using microphone arrays. In IEEE international conference on acoustics, speech and signal processing, 1991.
Chakrabarty, S., & Habets, E. A. (2019). Multi-speaker DOA estimation using deep convolutional networks trained with noise signals. IEEE Journal of Selected Topics in Signal Processing, 13(1), 8–21. https://doi.org/10.1109/JSTSP.2019.2901664
Geng, Y., Zhang, T., Yaw, M. S., & Wang, H.(2022). A speech enhancement method based on the combination of microphone array and parabolic reflector. Journal of the Audio Engineering Society, 70(1/2), 5–23.
Ghezaiel, W., Luc, B. R. U. N., & Lézoray, O. (2020). Wavelet scattering transform and CNN for closed set speaker identification. In IEEE 22nd international workshop on multimedia signal processing (MMSP), 2020 (pp. 1–6). https://doi.org/10.1109/MMSP48831.2020.9287061.
Grumiaux, P. A., Kitić, S., Girin, L., & Guérin, A. (2022). A survey of sound source localization with deep learning methods. Journal of Acoustical Society of America, 152(1), 107–151.
Hao, Y., Küçük, A., Ganguly, A., & Panahi, I. M. (2020). Spectral flux-based convolutional neural network architecture for speech source localization and its real-time implementation. IEEE Access. https://doi.org/10.1109/ACCESS.2020.3033533
He, W., Motlicek, P., & Odobez, J. M. (2021). Neural network adaptation and data augmentation for multi-speaker direction-of-arrival estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. https://doi.org/10.1109/TASLP.2021.3060257
Huang, Y., Benesty, J., Elko, G. W., & Mersereati, R. M. (2001). Real-time passive source localization: A practical linear-correction least-squares approach. IEEE Transactions on Speech and Audio Processing, 9(8), 943–956. https://doi.org/10.1109/89.966097
Inoue, T., Vinayavekhin, P., Wang, S., Wood, D., Munawar, A., Ko, B. J., ... & Tachibana, R. (2019). Shuffling and mixing data augmentation for environmental sound classification. In Proceedings of the detection and classification of acoustic scenes and events (DCASE), 2019.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 2015 (pp. 37448–37456).
Jahangir, R., Teh, Y. W., Nweke, H. F., Mujtaba, G., Al-Garadi, M. A., & Ali, I. (2021). Speaker identification through artificial intelligence techniques: A comprehensive review and research challenges. Expert Systems with Applications. https://doi.org/10.1016/j.eswa.2021.114591
Kabir, M. M., Mridha, M. F., Shin, J., Jahan, I., & Ohi, A. Q. (2021). A survey of speaker recognition: Fundamental theories, recognition methods and opportunities. IEEE Access. https://doi.org/10.1109/ACCESS.2021.3084299
Kang, W., Roy, B. C., & Chow, W. (2020). Multimodal speaker diarization of real-world meetings using d-vectors with spatial features. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2020 (pp. 6509–6513). https://doi.org/10.1109/ICASSP40776.2020.9053122.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. CoRR abs/1412.6980.
Kinoshita, K., Ochiai, T., Delcroix, M., & Nakatani, T. (2020). Improving noise robust automatic speech recognition with single-channel time-domain enhancement network. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2020 (pp. 7009–7013). https://doi.org/10.1109/ICASSP40776.2020.9053266.
Li, J., Deng, L., Gong, Y., & Haeb-Umbach, R. (2014). An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4), 745–777. https://doi.org/10.1109/TASLP.2014.2304637
Marques, I., Sousa, J., Sá, B., Costa, D., Sousa, P., Pereira, S., ... & Gomes, T. (2022). Microphone array for speaker localization and identification in shared autonomous vehicles. Electronics, 11(5), 766.
May, T., Van de Par, S., & Kohlrausch, A. (2012). A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation. IEEE Transactions on Audio, Speech, and Language Processing, 20(7), 2016–2030. https://doi.org/10.1109/TASL.2012.2193391
Meng, Y., Li, J., Pillari, M., Deopujari, A., Brennan, L., Shamsie, H., ... & Tian, Y. (2022). Your microphone array retains your identity: A robust voice liveness detection system for smart speakers. In USENIX security symposium, 2022 (pp. 1077–1094). USENIX Association.
Mošner, L., Matějka, P., Novotný, O., & Černocký, J. H. (2018). Dereverberation and beamforming in far-field speaker recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018 (pp. 5254–5258). https://doi.org/10.1109/ICASSP.2018.8462365.
Mošner, L., Plchot, O., Burget, L., & Černocký, J. H. (2022). Multi-channel speaker verification with Conv-Tasnet based beamformer. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2022 (pp. 7982–7986). https://doi.org/10.1109/ICASSP43922.2022.9747771.
Nagrani, A., Chung, J. S., & Zisserman, A. (2017). VoxCeleb: A large-scale speaker identification dataset. In INTERSPEECH.
O’Shaughnessy, D. (2000). Speech communications: Human and machine. Wiley.
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). LibriSpeech: An ASR corpus based on public domain audio books. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015 (pp. 5206–5210). https://doi.org/10.1109/ICASSP.2015.7178964.
Ravanelli, M., & Bengio, Y. (2018). Speaker recognition from raw waveform with SincNet. In Proceedings of SLT, 2018.
Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 24(3), 279–283. https://doi.org/10.1109/LSP.2017.2657381
Salvati, D., Drioli, C., & Foresti, G. L. (2018). Joint identification and localization of a speaker in adverse conditions using a microphone array. In European signal processing conference (EUSIPCO), 2018 (pp. 21–25). https://doi.org/10.23919/EUSIPCO.2018.8553501.
Salvati, D., Drioli, C., & Foresti, G. L. (2020). Two-microphone end-to-end speaker joint identification and localization via convolutional neural networks. In International joint conference on neural networks (IJCNN), 2020 (pp. 1–6). https://doi.org/10.1109/IJCNN48605.2020.9206674.
Schmidt, R. (1986). Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation, 34(3), 276–280. https://doi.org/10.1109/TAP.1986.1143830
Shahin, I. (2013). Speaker identification in emotional talking environments based on csphmm2s. Engineering Applications of Artificial Intelligence, 26(7), 1652–1659.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
Taherian, H., Eskimez, S. E., Yoshioka, T., Wang, H., Chen, Z., & Huang, X. (2022). One model to enhance them all: Array geometry agnostic multi-channel personalized speech enhancement. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2022 (pp. 271–275).
Vera-Diaz, J. M., Pizarro, D., & Macias-Guarasa, J. (2018). Towards end-to-end acoustic localization using deep learning: From audio signals to source position coordinates. Sensors, 18(10), 3418.
Wang, Z. Q., Zhang, X., & Wang, D. (2019). Robust speaker localization guided by deep learning-based time–frequency masking. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1), 178–188. https://doi.org/10.1109/TASLP.2018.2876169
Xiao, X., Watanabe, S., Erdogan, H., Lu, L., Hershey, J., Seltzer, M. L., & Yu, D. (2016). Deep beamforming networks for multi-channel speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016 (pp. 5745–5749). https://doi.org/10.1109/ICASSP.2016.7472778.
Ye, F., & Jun, Y. (2021). A deep neural network model for speaker identification. Applied Sciences, 11(8), 3603.
Yuji, T., Ushiku, Y., & Harada, T. (2018). Learning from between-class examples for deep sound recognition. In ICLR, 2018.
Zhang, C., Chen, W., & Xu, C. (2019). Depthwise separable convolutions for short utterance speaker identification. In IEEE 8th joint international information technology and artificial intelligence conference (ITAIC), 2019 (pp. 962–966). https://doi.org/10.1109/ITAIC.2019.8785780.
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). Mixup: Beyond empirical risk minimization. In ICLR, 2018.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
We investigated the effect of the shuffle rate (\(R_{tv}\)) on the performance of various DNN classification models, including fully connected networks (FC-DNN), convolutional neural networks (CNN–DNN), and recurrent neural networks (RNN–DNN).
For the CNN–DNN model, we replaced the first two hidden layers in Fig. 3 with two 2D-CNN layers each with 128 filters, a kernel size of \((3\times 3)\) and ReLU activation functions. Moreover, we used max-pooling layer with a \(2\times 2\) pooling window in-between these CNN layers. The input feature of length (\(2KN=1280\)) was reshaped to have a dimension of \(20\times 64\). On the other hand, we used two bidirectional GRUs for the first two hidden layers in the RNN–DNN model. Each of these layers had 128 units and was activated using the tanh activation function. Other training parameters are kept the same as in the FC-DNN case.
Table 5 presents the models’ results for the single-speaker case #3 and multi-speaker case #11 in the joint speaker identification and localization setup. The performance of these models is close to each other for the non-shuffled features scenario. However, in feature shuffling scenarios (\(R_{tv}\)), the FC-DNN model outperforms the CNN–DNN and RNN–DNN models. The last two models give low performance in the feature shuffling case since they depend on the spatial and temporal correlation between the adjacent features, which is partially lost in the shuffling technique. However, the FC-DNN model can compensate for this loss by exploiting the short-time statistical distribution and the cross-correlation of these features.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Barhoush, M., Hallawa, A. & Schmeink, A. Speaker identification and localization using shuffled MFCC features and deep learning. Int J Speech Technol 26, 185–196 (2023). https://doi.org/10.1007/s10772-023-10023-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-023-10023-2