1 Introduction

The use of microphone array processing techniques plays a very important rule in various applications such as video conferencing, human–computer interaction, surveillance, robotics, and several speech technologies which include speech enhancement (Geng et al., 2022; Taherian et al., 2022; Ali et al., 2021), speaker recognition (Meng et al., 2022; Mošner, 2022) and speaker localization (Grumiaux et al., 2022; He et al., 2021; Kang, 2020; Chakrabarty et al., 2019; Wang et al., 2019; Hao et al., 2020; Vera-Diaz et al., 2018).

Speaker localization depends on calculating the frequency, time, and level differences between the signals captured by each of the microphone arrays. In literature, localization algorithms can be categorized by two main groups: machine learning algorithms that deploy and train deep neural networks to estimate a speaker position or direction (Chakrabarty et al., 2019; Wang et al., 2019; Hao et al., 2020; Vera-Diaz et al., 2018), and parametric algorithms like those that depend on MUltiple Signal Classification (MUSIC; Schmidt, 1986), the Steered-Response-Power (SRP; Brandstein et al., 1991) or the Time Difference Of Arrival (TDOA; Huang et al., 2001).

Automatic Speaker Identification (ASI), which is a part of Automatic Speaker Recognition (ASR), is the process of using a machine to determine the speaker’s identity through his/her speech utterances. ASI’s main challenge resides in extracting distinctive features from those utterances that maximize the identification performance with different environmental conditions (Jahangir et al., 2021; Kabir et al., 2021; Astuti et al., 2020). Accordingly, many feature techniques have been proposed, and widely used features are the Mel Frequency Cepstral Coefficients (MFCC). However, the performance of these features may vary with the speaker’s health conditions, emotional state (Shahin, 2013) and at different levels of noises, reverberation, interference, and multi-source conditions (Li et al., 2014; Kinoshita et al., 2020).

Machine learning algorithms and deep neural networks have been successfully utilized in speaker localization during the last decade. In Chakrabarty et al. (2019), a Convolutional Neural Network (CNN) based supervised learning method for DOA estimation of multiple speakers was presented and trained using synthesized noise signals. The network exhibits a superior performance compared to a signal processing-based localization like SRP-PHAT and MUSIC using multiple microphones. In Wang et al. (2019), the authors utilized a Bidirectional Long Short-Term Memory (BLSTM) neural network to identify speech dominant time-frequency units containing a relatively clean phase for DOA estimation. However, the mentioned network cannot be used for online applications and the study considered only one noise type and one SNR value. Moreover, Hao et al. (2020) investigated a real-time implementation of speech source localization using relatively a small three-layer CNN architecture and by decomposing the complex-valued spectrograms into real and imaginary parts. In addition, In Vera-Diaz et al. (2018), an end-to-end CNN-based network is trained to map the raw waveforms of a distributed microphone network to the source position in reverberant environments.

In addition, machine learning helped achieving best performances in speaker identification applications (Ashar et al., 2020; An et al., 2019; Zhang et al., 2019; Nagrani et al., 2017; Ye et al., 2021; Ravanelli et al., 2018; Ghezaiel et al., 2020; Barhoush et al., 2021). In Ashar et al. (2020), the authors applied a depth-wise separable CNN architecture to identify 251 speakers using a spectrogram of four-second duration clean utterances. The study in Nagrani et al. (2017) used CNNs to classify spectrogram features extracted from the speakers’ utterances in the VoxCeleb1 (Nagrani et al., 2017) dataset and reached 80% accuracy. The model from An et al. (2019) attained 91% accuracy when applied to the same dataset using deep CNNs with self-attention and MFCC features. Furthermore, the authors in Ye et al. (2021) proposed DNN model based on a two-dimensional convolutional neural network (2-D CNN) and gated recurrent unit (GRU) for speaker identification. In addition, Speaker identification systems in Ghezaiel et al. (2020) and Ravanelli et al. (2018) used raw waveform speech samples and CNNs along with finite impulse response filters called SincNet or a wavelet scattering transform and achieved an accuracy of 67% and 80%, respectively. Although the speaker identification problem was traditionally exploited by a single channel processing technique, using microphone arrays is proven in Mošner et al. (2018) and Xiao et al. (2016) to enhance the ASR system performance as it reduces reverberation, interference and background noises.

The joint speaker localization and identification has been addressed in a relatively limited capacity in literature. In Salvati et al. (2018), different processing blocks from uniform linear arrays (ULA) are used in a loop to localize speakers with the SRP-DU (SRP diagonal unloading) method, enhance the signal using a DU beamformer and then, identify the speaker. Authors in May et al. (2012) used an artificial head with a binaural scene analyzer that is able to simultaneously localize, detect and identify a known number of target speakers in the presence of spatially positioned noise sources and reverberation. Moreover, in Salvati et al. (2020), an end-to-end scheme based on CNNs is trained to map raw waveforms of two microphones into the speaker identity and his Direction Of Arrival (DOA). Finally, the authors in Marques et al. (2022) presented microphone array and sensor system for speaker localization and identification in shared autonomous vehicles.

Data augmentation in acoustics and sound classification has also been studied in the literature. The augmentation method in Salamon et al. (2017) was based on time stretching and pitch shifting techniques along with adding background noises. Moreover, researchers in Inoue et al. (2019), Yuji et al. (2018) and Hongyi et al. (2018) mixed two sound sources within the same class or from different classes to generate a new sound. Furthermore, in the study of Bird et al. (2020), the authors investigated the feasibility of increasing speaker identification by enhancing the dataset with artificial data generated by training a character-level recurrent neural network on a short clip of five spoken sentences.

In this paper, we extend our work in Barhoush et al. (2021), where we used the SHMFCC features in single-channel speaker identification scenario, to the scenario where we jointly or separably identify and localize speakers using two-channel microphone under various conditions.

The contribution of this work can be summarised as follows:

  • We propose a FC-DNN model that can jointly or separately localize and identify an active speaker with high accuracy.

  • We propose using two new MFCC-based features as input to our proposed model: shuffled MFCC features (SHMFCC) and its variant Difference (DSHMFCC).

  • We propose a new data augmentation prepossessing step in the training and validation phase and a post-processing step in the testing and deployment phase to maximize the system accuracy and robustness.

  • We tested the model under adverse conditions with different SNRs and reverberation. We also illustrated the capability of our model compared to conventional methods to perform very well on localizing and identifying predetermined authenticated speakers in single and multi-speaker scenarios.

  • We compared our approach to traditional methods and showed its robustness even with small-sized training data.

2 Methodology

This section describes in detail the proposed methodology. Firstly, we describe dataset used to generate the training, validation, and testing samples. Secondly, we introduce our SHuffled and Difference SHuffled MFCC (SHMFCC and DSHMFCC) features. Thirdly, we present the machine learning model and its parameter tuning. Finally, we explain the feature extraction and post-processing for the deployment phase.

2.1 Training dataset

In all experiments presented in this paper, we used the same room setup as shown in Fig. 1. The simulated room has a dimension of (\(5\, {\text {m}}\times 4\, {\text {m}}\times 3.5\, {\text {m}}\)). The centre of the two microphone array (TMA) separated by a distance of \(d_{TMA}\) is placed at the point of (\(2.5\, {\text {m}},1\, {\text {m}},1.6\, {\text {m}}\)). One speaker out of C speakers spoke 37 different 10-s-length utterances, corresponding to \(D=37\) locations each with a distance of 2  m from the microphone array centre (half azimuth plane with \(5^\circ\) of separation). Therefore, the training dataset has a balanced class distribution for both speaker’s identity and position. The simulation is based on the image source method (ISM; Allen et al., 1979) and all the audio samples used are from the publicly available LibriSpeech dataset (Panayotov et al., 2015).

Fig. 1
figure 1

Top view of the simulated room, microphone array and speaker setup

2.2 Feature extraction and preprocessing

Figure 2 shows the feature extraction pipeline used to jointly identify and localize speakers. All speaker utterances are divided into training, validation, and testing with percentages 60%, 20%, and 20%, respectively. The utterances go through some steps before they are fed to the model where the details are discussed in the subsequent sections.

Fig. 2
figure 2

SHMFCC/DSHMFCC feature extraction block diagram is shown on the left. On the right, an equivalent and detailed four steps pipeline for SHMFCC/DSHMFCC feature extraction is depicted. Here, we used \(K=5\) for the purpose of simple clarification. However, in the simulation results, we chose \(K=10\)

2.2.1 Framing and windowing

The speaker utterances at each microphone are first sampled with a sampling frequency of \(F_s = 16\) kHz and then framed in the time domain with a standard frame duration of 25 ms and 10 ms overlapping to avoid any information losses which corresponds to frame length H of 400 samples and an overlapping length Q of 160 samples, respectively. Later, each frame is windowed using an H-length Hanning window to smooth discontinuities at the beginning and end of the frame.

Let \(s_i[k]\) and \(s_j[k]\) be the sampled version of the utterances \(s_i(t)\) and \(s_j(t)\) recorded at mic i and mic j, respectively. Each contains L frames with H samples per frame. Then, we can represent all the frames in \(s_i[k]\) and \(s_j[k]\) using the matrices \({{\varvec{S}}}_i^{L\times H}\) and \({{\varvec{S}}}_j^{L\times H}\) respectively as follows:

$$\begin{aligned} {{\varvec{S}}}_i^{L\times H} = \begin{bmatrix} {{\varvec{f}}}_{\,i,1}^{\,H} \\ {{\varvec{f}}}_{\,i,2}^{\,H} \\ : \\ {{\varvec{f}}}_{\,i,L}^{\,H} \\ \end{bmatrix}, \quad {{\varvec{S}}}_j^{L\times H} = \begin{bmatrix} {{\varvec{f}}}_{\,j,1}^{\,H} \\ {{\varvec{f}}}_{\,j,2}^{\,H} \\ : \\ {{\varvec{f}}}_{\,j,L}^{\,H} \\ \end{bmatrix}, \end{aligned}$$
(1)

where \({{\varvec{f}}}_{\,i,l}^{\,H}\), \(1\le l \le L\) is a vector of length H that represents the l th frame in \(s_i[k]\) written as follows:

$$\begin{aligned} {{\varvec{f}}}_{\,i,l}^{\,H}= \begin{bmatrix} s_i[1+Q(l-1)] \\ s_i[2+Q(l-1)] \\ : \\ s_i[H+Q(l-1)] \\ \end{bmatrix}^T. \end{aligned}$$
(2)

2.2.2 SHMFCC and DSHMFCC

In this work, we build on the well-known MFCC features (Apte, 2017) and propose new MFCC-based features called SHuffled MFCC features (SHMFCC) and Difference Shuffled MFCC features (DSHMFCC) to enhance the identification and localization system accuracy and immunity at various conditions. These proposed features are computed in four stages:

  • MFCC Extraction here, N Mel-frequency cepstral coefficients are computed for each time frame from the two channels separately. The time frame samples are first transformed into the frequency domain using the discrete Fourier transform (DFT), and the power spectrum is calculated. The power spectrum is then passed through a series of Mel-filter banks (O’Shaughnessy, 2000) where the Mel frequencies have equal perceptual human hearing distances and thus, imitate the perception of human ears. After that, the log of the power from each filter bank is computed, and finally, we get the MFCC coefficients by taking the discrete cosine transform (DCT) of each MEL-log-powers. For mic i, let \({{\varvec{m}}}_{\,i,l}^{\,N}= \textit{mfcc}({{\varvec{f}}}_{\,i,l}^{\,H})\) be the vector that represents the N MFCC coefficients of the frame \({{\varvec{f}}}_{\,i,l}^{\,H}\), then \({{\varvec{M}}}_i^{L\times N}\) is the matrix representing the N MFCC coefficients of the L frames in \(s_i[k]\) as follows:

    $$\begin{aligned} {{\varvec{M}}}_i^{L\times N} = \begin{bmatrix} \textit{mfcc}({{\varvec{f}}}_{\,i,1}^{\,H}) \\ \textit{mfcc}({{\varvec{f}}}_{\,i,2}^{\,H}) \\ : \\ \textit{mfcc}({{\varvec{f}}}_{\,i,L}^{\,H}) \\ \end{bmatrix} = \begin{bmatrix} {{\varvec{m}}}_{\,i,1}^{\,N}\\ {{\varvec{m}}}_{\,i,2}^{\,N}\\ :\\ {{\varvec{m}}}_{\,i,L}^{\,N}\\ \end{bmatrix}. \end{aligned}$$
    (3)

    The same way, we can write \({{\varvec{M}}}_j^{L\times N}\) for mic j as follows:

    $$\begin{aligned} {{\varvec{M}}}_j^{L\times N} = \begin{bmatrix} \textit{mfcc}({{\varvec{f}}}_{\,j,1}^{\,H}) \\ \textit{mfcc}({{\varvec{f}}}_{\,j,2}^{\,H}) \\ : \\ \textit{mfcc}({{\varvec{f}}}_{\,j,L}^{\,H}) \\ \end{bmatrix} = \begin{bmatrix} {{\varvec{m}}}_{\,j,1}^{\,N}\\ {{\varvec{m}}}_{\,j,2}^{\,N}\\ :\\ {{\varvec{m}}}_{\,j,L}^{\,N}\\ \end{bmatrix}. \end{aligned}$$
    (4)
  • Single-mic MFCC and Inter-mic MFCC Difference (DMFCC) the next step is to take the MFCC coefficients of one mic (mic i in this case) and concatenate them raw wise (concatR) with the difference between the MFCC coefficients from both microphones i and j to form \({{\varvec{M}}}^{L\times 2N}\) where:

    $$\begin{aligned} \begin{aligned} {{\varvec{M}}}^{L\times 2N}&= \textit{concatR}\left( {{\varvec{M}}}_i^{L\times N} , {{\varvec{M}}}_i^{L\times N}-{{\varvec{M}}}_j^{L\times N}\right) \\&= \begin{bmatrix} {{\varvec{m}}}_{i,1}^{N},{{\varvec{m}}}_{i,1}^{N}-{{\varvec{m}}}_{j,1}^{N}\\ {{\varvec{m}}}_{i,2}^{N},{{\varvec{m}}}_{i,2}^{N}-{{\varvec{m}}}_{j,2}^{N}\\ :\\ {{\varvec{m}}}_{i,L}^{N},{{\varvec{m}}}_{i,L}^{N}-{{\varvec{m}}}_{j,L}^{N} \end{bmatrix}. \end{aligned} \end{aligned}$$
    (5)
  • Frame Shuffling (Data Augmentation) we divide \({{\varvec{M}}}^{L\times 2N}\) into T blocks denoted as \({{\varvec{W}}}_t^{B\times 2N}\) where each block contains B frames. We can write Eq. 5 as follows:

    $$\begin{aligned} {{\varvec{M}}}^{L\times 2N} = \begin{bmatrix} {{\varvec{W}}}_{1}^{B\times 2N}\\ {{\varvec{W}}}_{2}^{B\times 2N}\\ :\\ {{\varvec{W}}}_{T}^{B\times 2N} \end{bmatrix}. \end{aligned}$$
    (6)

    Then, we randomly shuffle the MFCC frames within each block \(R \in {\mathbb {R}}\) times. Each block will produce R different blocks with shuffled frames. The result of this shuffling produces a new matrix \({{\varvec{M}}}_{sh}^{LR\times 2N}\) where its rows are R times greater than the rows of the original matrix in Eq. 6. The new matrix can be seen in the following equation:

    $$\begin{aligned} {{\varvec{M}}}_{sh}^{LR\times N} = \begin{bmatrix} \textit{shuffle}_1({{\varvec{W}}}_{1}^{B\times 2N})\\ \textit{shuffle}_2({{\varvec{W}}}_{1}^{B\times 2N})\\ :\\ \textit{shuffle}_R({{\varvec{W}}}_{1}^{B\times 2N})\\ :\\ :\\ \textit{shuffle}_1({{\varvec{W}}}_{T}^{B\times 2N})\\ \textit{shuffle}_2({{\varvec{W}}}_{T}^{B\times 2N})\\ :\\ \textit{shuffle}_R({{\varvec{W}}}_{T}^{B\times 2N})\\ \end{bmatrix}, \end{aligned}$$
    (7)

    where \(\textit{shuffle}({{\varvec{W}}}_{t}^{B\times 2N})\) changes the order of the B frames in \({{\varvec{W}}}_{t}^{B\times 2N}\) randomly and keeps the order of the 2N MFCC coefficients in each frame intact.

  • SHMFCC/DSHMFCC Features finally, we end up with SHMFCC/DSHMFCC features by taking each consecutive K frames from the output of Eq. 7, and since each frame now has a length of 2N, then one SHMFCC/DSHMFCC feature will be a vector of shuffled MFCC and DMFCC coefficients with length equals to 2KN. These SHMFCC/DSHMFCC feature vectors along with their corresponding speaker labels will be used to train, validate and test the machine learning model.

This method will increase the system’s accuracy and robustness as demonstrated in the Sect. 3. The block length B and the shuffling rate R are considered hyper-parameters in the Training/Validation stage and denoted as \(B_{tv}\), \(R_{tv}\) respectively.

2.3 Our DNN model

We tested different DNN models and many traditional classification algorithms to identify/localize speakers using the SHMFCC/DSHMFCC features (see “Appendix”). However, it turned out that a simple fully connected Deep Neural Network (FC-DNN) achieves the best results in terms of classification accuracy and robustness at various environmental conditions. Details on this model are discussed in this section.

We designed different FC-DNN structures by changing the number of hidden layers, the number of neurons in each hidden layer, the dropout rate, and the activation functions to avoid overfitting and to maximize the system’s accuracy. Figure 3 shows our final tuned FC-DNN model. This model consists of one input layer, two output layers, and six hidden and fully connected dense layers (HL). After each 512-neurons dense layer, a sigmoid activation function is used, and then the outputs are normalized using batch normalization (Ioffe et al., 2015). After that, a Dropout mechanism (Srivastava & Hinton, 2014) with a dropout rate of 30% is used to prevent the network from overfitting. The input layer receives an SHMFCC/DSHMFCC feature vector of length of 2KN, where K is a hyperparameter that represents the number of shuffled frames used to classify the feature and N is also a hyperparameter that represents the number of MFCC coefficients extracted from each time frame. Finally, the output layers are two dense layers followed by the sigmoid activation function. One dense layer with C neurons computes the output values for the C-speaker classification problem and the other has D neurons to localize and classify all the direction of arrivals of interest. Each layer kernels are computed through Adam stochastic gradient method (Kingma et al., 2015) and the loss function used for classification is the Binary cross-entropy.

Fig. 3
figure 3

Proposed deep neural network model

Some parameters are left to be tuned like the number of SHMFCC and DSHMFCC feature frames K, the number of MFCC coefficients for each frame N, the block lengths \(B_{tv}\) and shuffling rate \(R_{tv}\). We did an extensive hyper tuning for these parameters by taking different values and combinations and the results showed that:

  • The system gave the best results when the number of SHMFCC and DSHMFCC feature frames K is between 10 and 20 frames. We choose \(K=10\) frames to reduce the number of trainable parameters in the system model.

  • The system accuracy (especially the DOA classification accuracy) increases with a higher number of MFCC coefficients N. We took \(N=64\) coefficients.

  • We choose block lengths \(B_{tv} = 50\) frames as it gave the best system performance which corresponds to 0.5 s.

  • As we increase the shuffling rate \(R_{tv}\), more data augmentation is applied and hence the system accuracy is increased as we can see from the simulation results section.

The joint localization and identification model in Fig.  3 can be modified to work as a separate localization model by simply omitting the speaker identification part and using only the SHMFCC features (delete the DMFCC in Eq.  5). On the other hand, a separate identification model is obtained by deleting the DOA classification part and using only the DSHMFCC features (delete the single-mic MFCC in Eq.  5). In both separate models, the input layer shape will be of length KN.

In the deployment and testing phase, SHMFCC and DSHMFCC features are extracted the same way as in the training and validation phase. However, the block length denoted as \(B_t\) depends now on the duration of the utterance in seconds \(P_t\) within the joint localization and identification is estimated. Both \(P_t\) and \(B_t\) are interchangeable as \(B_t=(F_S \times P_t)/Q.\) Their values are application dependent and can vary according to the real-time constrains. Moreover, the shuffling rate in this phase denoted as \(R_t\) can take a value of one where the features extracted from \(P_t\) are shuffled once. However, taking \(R_t \ge 2\) can slightly increase the system accuracy at the expense of more computational power and processing time as shown in the Sect. 3.

The DOA and identity classification decision is traditionally based on the corresponding class of the maximum score of the last layer in the system model given one feature input. However, in our case, all the SHMFCC and DSHMFCC features extracted from the same block/duration (\(B_t, P_t\)) are of the same DOA/identity class. Therefore, we propose to use a soft decision technique SDT where the scores resulting from all the features extracted from one block are summed into one single net score, and then the maximum net score points to the DOA/identity class. We found that using this soft technique is slightly better than the hard decision technique HDT where the classification is done for each score corresponding to each feature extracted from one block, then the class with higher votes is chosen.

3 Simulation results

We designed our experiments to answer the following questions: (1) How does our tuned model based on the new SHMFCC/DSHMFCC features perform compared to the classical MFCC features and other baseline algorithms? (2) What is the effect of our new data augmentation approach on the system identification and localization performance? (3) Does our new speaker identification and localization pipeline system experience robustness and immunity at the adverse conditions which include variable reverberation time and AWGN conditions in single and multi-speaker scenarios? And (4) How do separate and joint models perform on these conditions and scenarios?

In the following subsections, we present the simulation results of our proposed system model using the SHMFCC/DSHMFCC features. Moreover, we compare our results to those from baseline features (MFCC/DMFCC) and the well-known parametric localization algorithm of Steered Response Power Phase Transform (SRP-PHAT). In addition, we show how our system performance is affected by different choices like the number of MFCC coefficients N, the distance between the two microphone arrays \(d_{TMA}\), the training/validation shuffle rate \(R_{tv}\), the testing shuffle rate \(R_t\), the single/joint model approaches and the duration of utterance in the testing phase \(P_t\).

We tested both single and multi-speaker scenarios. In the single speaker scenario, we try to identify and localize \(C=40\) speakers (20 males and 20 females) at variable values of SNR and reverberation times RT60. On the other hand, in the multi-speaker scenario, one of \(C=10\) (5 males and 5 females) predetermined and authorized speakers is of interest to be identified and localized while one (\(NSP=1\)) or two (\(NSP=2\)) noisy/unauthorized speakers are actively speaking at the same time. Those unauthorized speakers are randomly selected from a pool of 20 speakers (10 male and 10 female) and randomly located away from the authorized speaker with a minimum angular separation of \(20^\circ\). In both scenarios, we used a balanced testing subset of speakers’ utterances which was not used in the training/validation stage. In the following results, we used the \(F_1\) score to represent the speaker identification accuracy. Moreover, the localization accuracy is expressed by the Mean Absolute Error (MAE) of the DOA angle in degrees. A perfect model has unity \(F_1\) score and zero MAE value.

In Tables 1 and  2, we can see the proposed separable and joint system performance with different variable training/validation shuffle rate \(R_{tv}\) in both single and multi-speaker scenarios, respectively. Here, we chose \(P_t =0.5\) second, \(R_t =1\), \(N=64\), and we used soft decision technique STD. Both tables are used as result reference for the following first three subsections.

3.1 Our approach versus baseline

Regrading the identification problem, we can observe that our approach based on SHMFCC/DSHMFCC features outperforms the baseline MFCC features in all cases and has a greater advantage with higher values of shuffle rate \(R_{tv}\) and at lower values of SNR. For the localization problem, our system model based on DSHMFCC/DSHMFCC features gives better results compared to the SRP-PHAT baseline algorithm with the proper value of shuffling rate \(R_{tv}\) and this can be clearly seen when the reverberation time starts to increase in Table 1. Moreover, using DMFCC features without shuffling has close and sometimes better performance than the baseline in the single speaker scenario.

It should be noted here that in the multi-speaker scenario, in contrast to our model and to get the maximum performance of the SRP-PHAT algorithm, the number of active speakers (\(NSP+1\)) was given as an input to the SRP-PHAT algorithm, and the closest estimation to the true value was selected for evaluation.

3.2 Separable or joint identification and localization systems

In the single speaker scenario and in both identification and localization problems, both separate and joint models have almost the same performance in all cases. However, in the case of the multi-speaker scenario, the joint localization and identification model is performing better than the separate models since the joint model learned during the training/validation phase to simultaneously and gradually focus on the authorized speaker data and ignore the rest.

3.3 Training/validation shuffle rate \(R_{tv}\)

We used \(R_{tv} =1,3\) and 5 in Tables 1 and 2. In general, increasing the value of \(R_{tv}\) increases the training and validation shuffled features extracted from the same block \(B_{tv}\) by the same factor, serving as a data augmentation method that exploits both the statistical distribution and the cross-correlation of MFCC features over a short time \(P_{tv}\). As a consequence, the identification and localization accuracy will increase gradually. Moreover, this data augmentation method indirectly enlarges the training space for each class by providing extra ground truth sample variations for the non-shuffled MFCC features, which improves the overall system training robustness. Furthermore, the need to use higher values of \(R_{tv}\) is preferable in situations where we have lower SNR values, higher reverberation times RT60, and in complex scenarios like the multi-speaker one since this technique helps average out the short-term noise reverberation and interference in the microphone channel signals.

3.4 Number of MFCC coefficients N

The effect of the number of the MFCC coefficients N on our joint system performance is shown in Fig. 4. Without loss of generality, we chose case 9 to represent the single speaker scenario and case 11 from the multi-speaker scenario with \(R_{tv}=5\). The higher the number of the MFCC coefficients N, the better performance the system can achieve. However, one should point out that increasing the number of MFCC coefficients N and its corresponding system performance comes at the cost of increasing the feature extraction processing overhead and having more trainable parameters in our model. Therefore, we fixed N to be 64 in other figures and tables.

Fig. 4
figure 4

The effect of the number of MFCC coefficients on the system accuracy. Here \(R_{tv}\) is fixed to the value of 5

3.5 Two microphone array distance \(d_{TMA}\)

Figure 5 shows that our system performance is slightly affected by the distance between the two microphone arrays \(d_{TMA}\) used to capture the speaker utterances in both single and multi-speaker scenarios. The reason behind that comes from the fact that our features are based on MFCC calculations where one step of these calculation takes the power (magnitude) of the DFT coefficients and here the phase information, which depend on the microphone array locations, are discarded. In the other figures and tables, we used \(d_{TMA}=10\) cm.

Fig. 5
figure 5

The effect of the distance between the two microphone arrays \(d_{TMA}\) on the system accuracy. Here \(R_{tv}\) is fixed to the value of 5

Table 1 Speaker identification and Localization results in the single speaker scenario with \(C=40\) speakers
Table 2 Speaker identification and Localization results in the multi-speaker scenario with \(C=10\) authorised speakers

3.6 Deployment duration \(P_t\)

Choosing the duration of the deployment time \(P_t\) has a great effect on the system’s accuracy and robustness.

Figure 6 express the advantage of increasing \(P_t\) on enhancing the system identification and localization performance in both single and multi-speaker scenarios. longer duration \(P_t\) leads to longer block length \(B_t\) and this increases the number of SHMFCC/DSHMFCC features in this block which enhances the single net score in the soft decision technique SDT or the number of votes in the hard decision technique HDT and therefore gives better results. In real-time applications, one should choose \(P_t \le 0.5\) s to avoid any delays and keep up with the speaker movements. On the other hand and for off-line applications, the higher the \(P_t\) duration will surely give better outcomes. In other figures and tables we chose \(P_t=0.5\) s.

Fig. 6
figure 6

The effect of choosing variable duration \(P_t\) in the deployment time on the system accuracy which corresponds to variable deployment block length \(B_t\). Here \(R_{tv}\) is fixed to the value of 5

3.7 Deployment shuffle rate \(R_{t}\)

In the deployment phase, extra post-processing can be done by having the shuffle rate \(R_t \ge 2\). Here, for each block \(B_t\), we do shuffling \(R_t\) times which will increase the SHMFCC/DSHMFCC features by the same ratio. This will give a better performance by enhancing the single net score in the soft decision technique SDT or the number of votes in the hard decision technique HDT. Table 3 shows the effect of choosing \(R_t= 1\) and \(R_t = 2\) in single and multi-speaker scenarios. Both identification and localization results are reinforced using the higher value of \(R_t\).

Table 3 Effect of \(R_t\) on system accuracy

3.8 Training data size

In general, any well-designed machine learning system will function and give better performance when the training data is large enough and representative. Consistently, the system performance deteriorates with small sizes of training data. Table 7 takes case #10 from the multi-speaker scenario as an example to demonstrate the effect of the training data size on our proposed approach using shuffled features (SHMFCC/DSHMFCC) in comparison to traditional non-shuffled features (MFCC/DMFCC). The x-axis represents the duration of each speaker’s utterance per position used to train/validate the model. In the case of the shuffled features, we compensated for the shortage of the training data with the help of the new data augmentation technique which is achieved by increasing the shuffling rate \(R_{tv}\) to have the same data size in all cases (i.e using \(R_{tv}=5\) for the 8-s training utterance corresponds to using \(R_{tv}=20\) for the 2-s training utterance).

In Table 7, we see that reducing the training utterance duration will decrease the performance of both features. However, the shuffled features experience less degradation compared to the non-shuffled features. For example, reducing the training data size and taking only 25% of its original size will decrease the identification accuracy by 17.4% (from 98.92 to 81.52%) in the case of using the shuffled features. On the other hand, using the non-shuffled features will drop the identification accuracy by 46.76% (from 72.84 to 29.08%). The same comparison is also valid in the localization problem. One can also notice that using the shuffled features with 2-s training utterances gives better performance compared to using the non-shuffled features with 8-s training utterances.

Fig. 7
figure 7

The effect of training dataset size on the performance of join identification/localization system based on the new shuffled features SHMFCC/DSHMFCC and the non shuffled features MFCC/DMFCC

3.9 Speech enhancement and WER

We chose the application of speech enhancement to demonstrate the effect of the DOA estimation accuracy. Here, we took 40 s of noisy speech recordings from both cases #10 and #12 in the multi-speaker scenario and utilized two microphone arrays to localize the authorized speakers using our approach and SRP-PHAT algorithm. Then, we used \(BFM =3\) and \(BFM = 5\) linear microphone array spaced by a distance of 5 cm to apply the Minimum Variance Distortionless Response MVDR beamformer on the received speech signal by firstly beamforming the microphone array main beam to the estimated DOA and secondly filtering the noise from other direction given the noise profile of the unauthorized speakers. Finally, we used Word Error Rate WER (0%: best and 100%: worst) as a metric to compare between all layouts.

As shown in Table 4, our approach achieves a lower localization error even with a lower deployment time \(P_t\) and hence achieves a lower WER. Moreover, increasing the number of beamforming microphone arrays BFM enhances the speech further as the main beam becomes narrower and has greater gain. For example, in case #12 with \(P_t =0.5\) Second and \(BFM = 5\), our approach estimates the DOA with MAE of 1.9\(^\circ\) and reduces the WER by 75%. However, the SRP-PHAT algorithm gives a DOA estimation with MAE of 16.1\(^\circ\) and reduces the WER by 61.9%.

Table 4 Speech Enhancement and WER

4 Conclusion

We presented an end-to-end DNN acoustic model to identify and localize active speakers in single and multi-speaker scenarios with noisy and reverberation conditions. The model uses the new proposed SHMFCC and DSHMFCC features and provides accuracy that outperforms baseline and conventional methods in all scenarios and conditions. Moreover, we showed how to use the new features as a new augmentation technique to overcome the shortage of training data. Finally, we demonstrated the effect of our system accuracy compared to baseline in the application of microphone beamforming and speech enhancement.