1 Introduction

Speaker Recognition (SR) is the process of identifying speakers based on vocal features (aka voice biometrics) of their given speech samples, whereas speech recognition confines to the content recognition process rather than the speaker [29]. SR tasks include speaker verification, speaker diarization [1], speaker de-identification, etc., in which using SR to identify an unknown speaker from a set of stored, known speaker models is called Speaker Identification (SI). Mainly, SI is the process of comparing unknown user voice biometrics against many known biometric profiles and finding the best or exact match. Among the most prominent applications of speaker identification are mail automation tasks, automated labeling of speakers of a conversation, user identification and authorization, and acoustic forensics.

Deep learning advances have enabled Deep Neural Networks (DNNs) to produce accurate speech and speaker recognition systems. However, neural network-based models require adjusting and processing many neural weights and bias parameters that make them computationally expensive and complicated, requiring high-end computing machines with powerful processing units [33]. Hence, their applications for lower-capacity smart devices are limited because of the required run-time resources when large-scale applications are intended. One solution is to implement the SI DNN engine on a powerful server and deliver its services via the cloud or over a network (i.e., online SI), or employ other SI engines that require less memory and computation footprint, such as rapid i-vectors [42]. The problems with the first solution are the additional cost of the server, the availability of the network, and security and privacy concerns with respect to transmitting the authentication data over the network. Likewise, the second solution does not benefit from DNNs’ strong generalization capabilities. On the other hand, an offline DNN-based SI with a small memory and computation footprint that still delivers high accuracies can address these shortcomings.

Most traditional approaches to SI include modeling all speaker voice biometrics via one classifier (aka learner), meaning that the learner is responsible for storing and providing all speaker models [17]. In active learning theory, this is known as the Single-Learner (SL) paradigm. A supervised active learning framework called Multi-Learner was proposed by [39] that recommended accuracy improvements for pattern recognition tasks comprising two or more views. An example of views is describing an image by its visual features (i.e., view1) and the words surrounding it (i.e., view2) in a web image retrieval system. Another example is web page classification, where a page can be classified by the words on the page and the words on other pages linked to this page, where the former is considered view1 and the latter view2. If there are several learners to approximate the views, the learning theory is known as Multi-View Multi-Learner (MVML).

Multi-learner frameworks consider additional views as supplementary data to help improve the performance of the main view (i.e., the first view in the examples above). They employ these additional views as meta-data to provide more information about the main view. It is pertinent to note that the primary objective of multi-learners remains approximating the first view, while the additionally created views are only considered to improve this objective. Nevertheless, incorporating such extra views by multi-learner frameworks makes the approximation task more complicated as there is more information for the learner(s) to process and learn. Considering DNNs as learners means MVML requires additional DNNs in order to model the supplementary views to assist in approximating the main DNN; this increases the required computational resources significantly, but superior approximation accuracy can be achieved in comparison to Multi-View Single-Learner (MVSL) models.

An improvement to multi-learner, Enhanced Multi-Active Learner (EMAL), redefines views and describes how they are perceived and modeled via several learners [32]. Contrary to MVML, in which the objective is to improve the approximation of the main view by considering other views and modeling them via supplementary learners, EMAL aims to distribute the complexity of the main view among several learners. Via EMAL, each learner is responsible for learning an aspect of the main view without increasing the size, number, or complexity of the view or function. EMAL achieves this by breaking down the main view into smaller, less complex views and then distributing the modeling of these new views among the learners.

In terms of DNNs, the EMAL simplification of the main view may also streamline the DNN structural and computational complexity since the overall simulation/approximation task assigned to each DNN in the network is also simplified; hence, the DNN requires fewer parameters as the result of dealing with less approximation complexity. This architectural simplification can help to rectify issues of long training time and high computational complexity associated with DNNs. Nevertheless, EMAL applications in SI have remained unknown as the traditional MVSL approaches profoundly dominate the SI research community.

Furthermore, the common SI approach employs a combination of acoustic features extracted from sequential sounds to present the speakers’ voice biometrics. Notably, 5 to 10 sound segments are stacked to form the input before and after the current segment [28]. Stacking sound segments is particularly important for speech recognition technologies as words are broken down into sounds (or phonemes and phones), and successful identification of each word depends on recognizing the previous sounds. Nevertheless, the discriminative acoustic information used to identify a speaker is mostly embedded in how speech is uttered and not necessarily in the content of speech [15]. Likewise, the objective of text-independent SI is to identify the speaker and not the content of speech. Thus, our proposed system assumes the distinctive speaker features in an acoustic sound segment contain enough speaker-distinguishable information to identify the speaker regardless of the previously uttered sounds. Identifying a speaker using a single-sound segment instead of a stack of sounds can further reduce the complexity of DNN learner(s) as the input dimension can significantly be reduced, which means the classifier requires to learn a smaller number of acoustic features. While our initial study shows the single sound SI approach has merits for smaller scale, straightforward SI tasks [34], it is essential to investigate the effects of such reduction of input dimensions on a larger scale, realistic and challenging SI tasks.

The objectives of this paper are to propose an optimized system for text-independent, closed-set speaker identification based on Convolutional Neural Networks (CNNs) that leverages both EMAL and single sound approach and benchmark its performance and DNN computational complexity. We intend to simplify the complexity of SI yet utilize the powerful knowledge discovery and generalization capabilities of CNNs so that not only an accurate SI is achieved but also the smaller and fewer complex CNNs reduce the computational costs associated with deep learning-based SI systems. The proposed system applies the EMAL approach to SI to improve the efficiency of speaker identification tasks and integrates the single-sound segment SI approach to decrease the SI complexity. Its performance and DNN computational complexity were measured when it was applied to identify 630 speakers whose utterances were provided by Texas Instruments/Massachusetts Institute of Technology (TIMIT) Acoustic-Phonetic Continuous Speech Corpus [14], and 1251 celebrities provided by VoxCeleb [25]. Here, the number of trainable deep learning parameters is considered the metric to measure the computation complexity of the proposed SI system. A comparative performance study with DNN-based SI systems that employed the same datasets is also provided.

The rest of the paper is organized as follows: Section 2 reviews the related SI works. The proposed system is explained in Section 3, and Section 4 describes the experiments. Finally, Section 5 discusses the results and presents the comparative study, while Section 6 concludes the paper.

2 Related work

A typical speaker identification system goes through an enrollment procedure in which speaker models are created and stored by the learner, and a matching procedure that explores the modeled speakers for a match [15]. Both enrollment and matching procedures are initiated by receiving the speech input signals, extracting acoustic features that represent the speech parameters in a form understandable by the system, and then these features are fed to the learner/classifier. Hence, the feature extractor and learner are the two primary components of SI systems.

As an acoustic feature extraction approach, Mel-Frequency Cepstral Coefficient (MFCC) is a practical approach to decomposing an acoustic signal into its phone or sounds and presenting them via frames of acoustic features that can be modeled via various machine learning algorithms. MFCC features have been widely applied in various speech processing tasks, such as impaired speech recognition and heart anomaly detection [19]. Regarding SI, MFCC and its variations are the most frequently applied feature extraction approach. Current speaker identification systems, especially those that leverage deep learning algorithms to store and model the speakers, employ a combination of stacked acoustic features extracted from sequential acoustic frames to present the speakers’ voice biometrics, as explained in the previous section.

2.1 i-vector based Speaker Identification

In terms of the SI learner, Gaussian Mixture Models (GMMs) were traditionally one of the most popular methods in different speaker recognition tasks. The first published application of GMM in SI is [27] in which GMMs were used to provide a multi-classification statistical model of speakers’ data by modeling the distribution of the speaker’s MFCCs. Recent improvements to GMM-based SI systems are i-vector based approaches that employ 0th, first, and second-order statistics produced by GMMs.

Today, one of the most successful classes of learning algorithms in speech processing is deep learning. Recent deep learning implementations for SR highlight that the complexity involved in SR requires special attention compared to traditional pattern recognition problems [28]. In terms of SI, deep learning-based SI approaches have outperformed other traditionally popular ones on large-scale applications, such as i-vector based approaches [40, 43]. This is because deep neural nets learn features discriminatively instead of the generative approach that GMM/i-vector frameworks apply [35]. For example, Chen et al. [8] proposed a bilevel SI framework that used sparse coding with no gaussian assumption to improve i-vectors, and employed a softmax and linear Support Vector Machine. Evaluated on VoxCeleb 1 dataset, the authors achieved the highest top-1 accuracy of 67.2%, which is lower than the DNN performances reported in the literature on the same dataset.

2.2 Deep learning based speaker identification

An example of employing deep learning in SI is utilizing Restricted Boltzmann Machines (RBM) to assist feature extraction in a hybrid noise-robust SI model proposed by [46]. The study evaluated the model by providing speech data infected by factory noise, destroyer engine room noise, and speech shape noise. Three sets of models based on gammatone features, gammatone frequency cepstral coefficients, and MFCCs were generated for speakers selected from the NIST Speaker Recognition corpus and reported that MFCC-based models achieved the best benchmarks in most experiments. An investigation into the effects of depth and layered-wise training using deep autoencoders (DAEs) in SI was also conducted in [41].

Recently, deep CNN approaches to identify speakers have proven to be very useful because of CNNs’ effectiveness in modeling real-world, noisy data without any requirements of specific features engineering [16, 38]. In this respect, [2] proposed a SI system using a deep CNN that was verified using connected speech samples provided by TIMIT. The network employed 32 and 64 filters for its convolutional layers, each followed by another layer to perform max-pooling, and the outputs of these filters were fed to two dense layers. Similarly, Nagrani et al. [25] adopted VGG-M CNN architecture [7] and designed a closed-set SI model verified on noisy speech samples collected from 1251 celebrities in the VoxCeleb 1 corpus. Another example of large-scale CNN-based SI is [4] in which VGG and Residual Neural Networks were studied.

2.3 Hybrid and ensemble-based speaker identification

Ali and his colleagues [2] presented a hybrid SI approach that employed different learners to perform different tasks in the SI pipeline. Particularly, feature extraction was done by applying a Deep Belief Network (DBN) to extract unsupervised features and then combined with MFCCs. Principal Component Analysis (PCA) was applied for the linear transformation of features. Then, the features were pipelined through multiple learning algorithms, including RBMs, K-Means, and Support Vector Machines (SVMs), and finally mapped to different speaker models. Similarly, a DBN-GMM SI was proposed and verified on a custom corpus by [40]. Another hybrid approach is [36] in which a GMM-DNN SI approach was proposed to improve the performance of identifying speakers’ emotions. The authors delivered better performance than conventional perceptron neural networks when they applied their solution to a customized Arabic dataset.

Based on document classification’s hierarchical attention network (HAN) [44], Yanper et al. [37] proposed another hybrid SI called H-Vectors. This approach aimed to find which segments of an utterance contribute more towards identifying the speaker. The proposed HAN architecture was composed of three components: (1) a frame-level encoder and attention layer consisting of a CNN, a Gated RNN (GRU) [9], and an MLP, (2) a segment-level encoder that includes another CNN and MLP, and (3) a dense fully connected DNN with two layers. They verified this architecture with NIST SRE 2008 part1 (SRE08), Call Home American English Speech (CHE), and Switchboard Cellular Part 1 (SWBC) datasets achieving accuracies of up to 98.5%, 92.8%, and 86.2% for each dataset respectively.

An ensemble neural networks approach to SI using Probabilistic Neural Networks (PNNs), General Regression Neural Networks (GRNNs), and RBFs was studied by [3] in which each neural net was responsible for probing the data differently to fit the training audio features. In particular, one network was trained on all the data, the other network was only trained on the data showing a margin of error, and the other was trained using the data with no error margin. The model was evaluated on GRID speech corpus and showed improvements in recognition time and accuracy over traditional approaches. Section 5 provides a performance comparison of the SI systems mentioned above.

The literature does not report any EMAL implementation of speaker identification, where the single-learner concept remains the dominant approach among SI researchers. Thus, it is important to investigate whether EMAL benefits are achievable in SI and what advantages EMAL active learning offers.

It is also essential to highlight the difference between SI approaches that use different types and numbers of learners to perform different views or tasks in the SI pipeline (for example, [2, 27]), or repeat modeling the same view(s) using different variations of learners (such as Bagging ensemble learning [26]), and EMAL based SI systems. EMAL systems use a network of learners to improve the performance of learning the main view by distributing the main view’s complexity among several learners. Notably, each learner in an EMAL-based SI has a specific responsibility that is different from other learners (i.e., no redundancy of views) and contributes towards performing a different aspect of the overall task, whereas in bagging ensemble learning each view is modeled several times using a different machine learning algorithm. To put it differently, EMAL stores only one model of each view, but ensemble learning stores several models of each view. The literature usually refers to ensemble methods as the collection of learners that are variations of the same learner. Likewise, a broader category is known as multi-classification systems in which the hybridization of different learners is considered [12].

As an illustration, the example above of using ensemble learning in SI [3] applied three different types of neural networks, while each neural net represented all speaker models (i.e., all of the views); thus, there were three variations of speaker models. Given an utterance, the final SI outcome was calculated by majority voting (bagging), where the speaker model that obtained 2/3 of the votes was selected as the identified speaker. The next section explains EMAL SI in detail.

Regarding single-sound segment SI, we recently investigated whether this approach to SI is feasible and to identify the best parameters and MFCC configuration [34]. We showed that speaker identification systems could operate by relying on the distinctive acoustic features that an individual segment of speech presents (such as an MFCC frame) without relying on previous speech segments. In our previous study, we conducted more than one hundred experiments in which small MVSL SI systems were created using dense, fully connected neural networks considering different SI parameters. The initial results indicated that speaker identification using one sound segment is possible, and results are on par with the traditional stack-based SI approaches where the number of speaker models is small. We applied the previous study’s findings to set the acoustic feature parameters in the present study, as stated in the next section. Nevertheless, the neural network used in the previous study and MVSL active learning provided poor results when speech samples of all TIMIT speakers were given. Thus, the present study focused on transferring the knowledge from the previous experience to design an EMAL single-sound segment SI system that can handle complex SI tasks yet reduces DNN trainable parameters.

3 The proposed system

3.1 Formulating speaker identification

Suppose the speaker identification task is declared as a function approximating speaker models S by receiving a speech sample \(x\in X\), i.e., x is one of the input speech signals from set X. Based on the number of speakers to be identified, S is composed of multiple speaker models:

$${S=\{s}_{1}, {s}_{2},\cdots , {s}_{n}\}$$
(1)

where n is the number of speakers, and si is the ith speaker model (i = 1 to n). Given a speech sample x obtained from one of S speakers, speaker identification S* can be defined as the mappings of X to individual speaker models in Eq. 1 by finding the most probable match as stated by Eq. 2:

$${S}^{*}=\underset{S}{\text{argmax}} \ P\left(S|X\right)$$
(2)

3.2 Features extraction

The first step in speaker identification is to prepare the input signals X and extract their distinctive acoustic features in a feature extraction process. In either enrolment (aka training) or matching phase (aka inference), the speech samples must be pre-processed to remove speech frames representing silence since such frames may confuse the learners. Silence frames tend to be similar regardless of speakers and do not contain discriminative speaker-dependent data. The proposed system uses 20ms segments (aka frames); thus, any silence data equal to or greater than 20ms should be removed from input utterances. Alternatively, an additional speaker model can be created to model silence segments.

It is important to select a features extraction method that best presents the acoustic characteristics of signals X because S* refers to these features to associate X with each speaker model si. Among different acoustic features extraction methods, MFCC is constructed using frequencies of the vocal track and present acoustic signals in the cepstral domain that employs FFT to represent windowed short signals as the real cepstrum of X. MFCCs are inspired by our natural auditory perception mechanism; hence MFCC frequency bands are spaced equally on Mel scale [5]. Although MFCCs ignore some acoustic information, they still preserve sufficient distinguishable data [10]. This attribute of MFCCs has been widely used in speech and speaker recognition tasks making them one of the most popular acoustic feature extraction methods.

Applying MFCCs to an input speech signal x results in a 2D tensor of Acoustic Features AF where columns are frames representing sound segments, and rows are MFCC coefficients for each frame. Thus, S* becomes the mapping of AF to speaker models S:

$${S}^{*}=\underset{S}{\text{argmax}} \ P\left(S|AF\right)$$
(3)

3.3 The single-sound segment approach

Acoustic Features AF in Eq. 3 is a 2-dimensional matrix of sequential sound segments extracted from x that are represented as frames (aka segments) of MFCC features; each segment is a row in AF matrix that may present a sound or phone. Although this 2D representation of utterance x is vital to identify its content and requires processing multiple segments, identifying speech content is not the objective of speaker identification. Thus, the single-sound segment SI approach assumes one frame of speech features (i.e., features corresponding to an individual sound segment) contains enough information to distinguish between speaker models S [34]. Hence, in the proposed system, the feature extraction process presents the speaker’s voice biometrics as a single MFCCs frame AF′, which means speaker identification S* relies only on the information provided by AF′ to approximate speaker models S without considering the frames before or after AF′. As such, the proposed system redefines speaker identification as given by Eq. 4:

$${S}^{*}=\underset{S}{\text{argmax}}P\left(S|AF\right)$$
(4)

It is important to note that AF′ in Eq. 4 is a 1D tensor in comparison to 2D tensor AF, and its dimension is considerably smaller, as stated by Eq. 5:

$$\left|\text{AF}\right|=\frac{\left|\text{AF}\right|}{\text{k}},k=\text{the number of} \ AF\text \ {sound segments}$$
(5)

In this study, each speech frame contains 60 MFCCs, meaning each AF′ tensor is composed of 60 coefficients.

3.4 Enhanced multi-active learner

The approximation task in Eq. 4 can be done by one or many learners L:

$$L=\left({l}_{1},{l}_{2},\cdots , {l}_{m}\right)$$
(6)

where m is the number of learners. The traditional neural net-based SI approaches employ a single learner (i.e., m = 1) to approximate S* by applying a single-learner paradigm. The complexity of this task requires a neural net with many neurons and parameters to learn the acoustic features of all speaker models si. On the other hand, by applying EMAL, the complexity of S* can be distributed among a network of learners by setting m > 1. This distribution decreases the complexity of the required neural network learners and may also improve the classification performance as the number of positive-response acoustic features being modeled by a learner can be reduced. A positive response for the ith speaker refers to any acoustic frame AF′ that was extracted from any speech signal x uttered by speaker si.

Suppose:

$${AF}_{i,j}^{{\prime }} = ( {af}_{{i,j}_{1}}^{{\prime }},{af}_{{i,j}_{2}}^{{\prime }},\dots , {af}_{{i,j}_{60}}^{{\prime }})$$
(7)

where \({AF}_{i,j}^{{\prime }}\) is the jth acoustic segment for speaker model si extracted from a speech signal uttered by the ith speaker, and contains the 60 extracted MFCC features. AF′ contains many samples of AF′i,j to present how the ith speaker pronounces different sounds, as shown by Eq. 8:

$$\begin{array}{c}{D(AF}_{i}^{{\prime }}) = D\left( {af}_{{i,j}_{1}}^{{\prime }}\right)\times {D(af}_{{i,j}_{2}}^{{\prime }}) \times \dots \times D\left({af}_{{i,j}_{60}}^{{\prime }}\right)\\for \ i=1 \ to \ n, \text{and} \ j=1 \ \text{to} \ p\end{array},$$
(8)

where p is the total number of sound samples for the ith speaker, and \(D\left( {af}_{{i,j}_{y}}^{{\prime }}\right)\) is the number of all possible yth MFCC coefficients (y = 1 to 60) extracted from the ith speaker’s jth sound segment. In simple terms, \({D(AF}_{i}^{{\prime }})\)is the number of all MFCC features extracted from all sound samples that S* requires processing to approximate speaker model si in Eq. 4, and \({D(AF}^{{\prime }})\) is the same but for all speaker models S from Eq. 1.

In an EMAL approach to SI, a network of learners L is used to learn AF′ where each learner li is responsible for approximating si. Hence, li only needs to learn features presented in AF′i for its positive responses, whereas single-learner SI systems use one learner l1 to learn AF′. It is pertinent to note that \({D(AF}_{i}^{{\prime }})\) for positive responses is considerably smaller than \({D(AF}^{{\prime }})\) because the number of learnable speech features per si is less than S as denoted by Eq. 9:

$$\begin{array}{c}{D(AF}_{i}^{{\prime }})=\frac{{D(AF}^{{\prime }})}{\text{o}}\\o=\text{the number of sounds samples associated with} \ s_i\end{array}$$
(9)

To put it differently, each learner in an EMAL-based speaker identification system only learns acoustic features directly associated with one speaker instead of all speaker features. Combining the reduction in the number of learnable coefficients provided by Eqs. 5 and 9, it can be concluded that each EMAL learner li needs to process significantly fewer speaker-dependent features belonging to its class than a single-learner SI. In particular, each neural network li acts as a binary classifier that decides whether a given \({AF}^{{\prime }}\) vector is associated with speaker si (that is responsible to model) since it only needs to map the input frame vector to si instead of S.

3.5 The single-sound EMAL-based system

Figure 1 depicts the proposed system in which n = m (n is the number of speaker models, and m is the number of learners), meaning each speaker is modeled by an individual learner. After silence segments are removed from X (when necessary), each utterance is presented by several sound segments of 60-dimensional MFCCs indicated by AF′. Each \({AF}_{i,j}^{{\prime }}\) (Eq. 7) needs to be appropriately labeled with its associated speaker si and stored to be used for training. The proposed system employs a network of CNN learners L to learn the voice biometrics of speakers where each li associates itself with one of si speakers, which means for n number of speaker models, n learners are required.

Since each learner performs a binary classification, it only needs one sigmoid output neuron that calculates the probability of the given \({AF}_{i,j}^{{\prime }}\) representing speaker model si. During the training phase, each \({AF}_{i,j}^{{\prime }}\) is individually given to all learners; the learner that is responsible for the speaker in which her voice biometrics frame is given has its output neuron set to one (ymax) while the remaining learners receive a zero (ymin) for the output neuron to show that the given sound segment does not belong to the speaker they represent.

Particularly, let us assume i = 1 in Fig. 1, which means \({AF}_{1,j}^{{\prime }}\) is one of the sound segments representing the first speaker model s1. In this case, the target vector for l1 is set to one (positive response), while the rest of the li learners (i≠1) receive a zero for their target vectors (negative response) while speaker enrollments are being performed. Next, i = 2 meaning that \({AF}_{2,j}^{{\prime }}\) is a sound associated with s2, so l2 receives a one for its target vector and zero for the remaining learners. This process continues by providing each sound sample for each i and j per training epoch.

Nevertheless, this labeling strategy results in an imbalanced training dataset used to train CNN li because there are only \(\frac{1}{n}\) positive training samples against \(\frac{n-1}{n}\) negative samples (n is the number of speaker models). This problem can be resolved by increasing the class weight of positive-response training samples and assigning smaller weights to other samples during speaker enrollment (i.e., training), as explained in [11]. This approach instructs the li optimizer to pay increased attention to sound segments extracted from speech samples uttered by the ith speaker.

Once the training procedure is complete, given an unforeseen speech frame (with all silence traces removed, if required), the EMAL-SI should be able to highlight speaker-distinguishable acoustic features presented by an unforeseen sound segment and relate them to one of the speaker models. This is done by feeding the unidentified sound segment to all learners and querying them to relate the data to the speaker model they present. Each CNN output is the likelihood that the given sound segment is uttered by the speaker the CNN is responsible for. The learners’ outputs are then given to a softmax function to be squashed as probability distributions. By finding the maximum of the softmax function results, it can be determined which CNN provides the highest probability.

Given the proposed system is a closed-set speaker identification system, it requires all speakers/users to be enrolled in the system in advance. In case there are new speakers, new learners have to be added to L to match the number of learners with the number of speakers, and all learners need to be re-trained following the process explained here.

Fig. 1
figure 1

The proposed system

4 Experimental setup and results

This section describes the experimental setup, evaluation methodology, and results.

4.1 Datasets

Our experiments were conducted over two corpora. The first corpus was TIMIT which contains phonetically rich, clean speech samples obtained from 630 speakers in which ten utterances of each speaker are provided. The research community has widely used the dataset in different speech processing tasks. We considered speech materials from all 630 TIMIT speakers to verify the proposed system. All ten utterances per speaker were employed, of which eight were used to provide the training sound segments and the remaining two for extracting the test segments. Conducting 10-fold repeated random sub-sampling validation [6], the training and testing utterances were changed for each fold randomly.

The second dataset, VoxCeleb 1 [25], contains more than 153 K gender-balanced utterances collected from 1251 celebrities captured from videos uploaded to YouTube. Overall, the dataset contains 352 h of speech. The participants in this dataset were 55% male, and speech samples were collected from speakers with a diverse range of ethnicities, professions, ages, and accents. The audio samples reflect real-world environments, including utterances with background chatter and music, room reverberations, laughter, etc. There are an average of 116 utterances per speaker, and the average length of each sample is 8.2 s.

Similar to the baseline experiment conducted on this dataset [9], we used around 145 K utterances for training and 8 K for testing, including all speakers provided in the corpus. We did not consider any cross-validation procedure similar to the baseline system since any experiment considering VoxCeleb full dataset is significantly resource-consuming.

Comparing TIMIT and VoxCeleb 1, the latter increases the complexity of speaker identification because (1) VoxCeleb audio samples include different types of background noise profiles, and (2) the number of speakers is almost twice larger than TIMIT.

4.2 Evaluation criteria

The performance of the proposed system was measured using two criteria. The first criterion was speaker identification accuracy (aka top-1 accuracy), as the proportion of correct identification of speakers based on the testing sound segment data. Accuracy conveys the practicality of S* to identify speakers based on the given acoustic sound segment during testing.

The second criterion was Normalized Root Mean Squared Error (NRMSE):

$$NRMSE= \frac{Root \ MSE}{{y}_{max}-{y}_{min}}=\frac{\sqrt{MSE}}{1-0}$$
(10)

where ymax and ymin were one and zero, respectively, as explained in Section 3. NRMSE was considered to measure the system’s error rate in terms of how close the results generated by the SI were to the target results. In particular, lower NRMSE implies S* is more capable of making precise predictions. It is pertinent to note that NRMSE was calculated based on the results obtained from all sigmoid output neurons and before the softmax function was applied.

4.3 Experimental setup

All experiments were implemented in Python. Feature extraction was done via Python Speech Feature Extraction library [24], and the CNN learners were implemented on Google’s TensorFlow framework.

For TIMIT experiments, another Python library called Pydub [11] was used to automatically remove any trace of silence from speech utterances before feature extraction was performed. The silence threshold was set according to Decibels Relative to the Full Scale of each utterance. Silence removal was not necessary for the VoxCeleb experiment as the audio samples in the dataset did not include traces of silence.

4.4 CNN architecture

The convolution setup of the CNNs was inspired by [23], also used in [23]. Each CNN li comprised two convolutional layers with 32 and 64 filters, respectively, followed by a max-pooling layer for down-sampling the feature maps after the individual convolution layers. Nevertheless, we did not apply the standard 2D windows on feature maps since the input sound segments were 1D tensors of 60 MFCCs rather than the 2D tensors of multi-frame MFCCs. In particular, both convolution layers applied a 3⋅1 window to the feature maps, and down-sampling was done by a kernel of size 4⋅1. The convolutional layers had no strides (i.e., 1⋅1), while max-pooling strides were 2⋅2.

Identifying the dense layers hyperparameters, initial experiments on a small subset of the dataset (20 TIMIT speakers - ten females, ten males) were conducted in advance. Next, the hyperparameters were tuned and selected by a grid search algorithm [20], where 2 to 4 dense layers with 32, 64, and 128 neurons and different activations were trialed. Then, the EMAL CNN architecture that provided the best accuracy was selected and applied to the full datasets. As a result of the grid search algorithm, each CNN li architecture was selected, as shown in Fig. 2 - the remaining hyperparameters are provided in Table 1.

Fig. 2
figure 2

CNN li architecture

Table 1 DNN architecture and hyperparameters

During the TIMIT experiment, the training data were shuffled after every run to apply cross-validation, and batch training was stopped when a loss lower than 0.05 or 300 epochs were achieved. EMAL SI needed 630 of the CNNs shown in Fig. 2 to model TIMIT and 1251 CNNs to model VoxCeleb 1, one per si, and they all followed the same architecture and hyperparameters. Accordingly, each CNN needed to adjust 22,913 trainable parameters. In order to avoid overfitting, dropout regularization with a relatively high drop rate of 50% was applied to each fully connected layer of the CNNs.

4.5 Results

The evaluation results of experimenting with the proposed SI system are shown in Table 2. Please be noted that the training loss in each row is the average value obtained from all 630 CNNs in that fold for the TIMIT experiments.

Table 2 The proposed SI system experimental results

5 Discussion

Accuracy and NRMSE standard deviations for the TIMIT experiments were 0.26% and 0.94%, respectively. According to Table 2, the accuracies and NRMSEs obtained during these experiments followed normal distribution since 70% of the observations were between one standard deviation above or below the mean, and all observations fell between positive or negative two standard deviations concerning the mean. Normal distribution confirms the reliability of the results obtained.

The low NRMSEs in Table 2 show the proposed SI predictions were close to the targets that imply the robustness of using a single sound segment to identify speakers. Similarly, the high accuracies indicate the applicability of using a single sound to perform complex SI tasks.

5.1 Comparative study

To highlight the advantages of the proposed system, a comparative study with state-of-the-art neural network-based speaker identification systems published in the literature is presented in Table 3. Additionally, we selected the SI systems explained in [23] and [25], highlighted in Table 3, for direct benchmarking since they both adopted a deep CNN-based approach, reported results based on the same corpora, and considered the same number of speakers we used in our experiments. This selection enables us to benchmark our proposed SI system with theirs directly. The rest of the shown SI systems employed a significantly smaller number of speakers or were applied to different datasets. Reducing the number of speakers decreases the complexity of the SI task; hence a direct accuracy comparison of large- and small-scale SI systems is not fair since they belong to different complexity classes.

Table 3 DNN-based speaker identification systems comparative study

It is pertinent to note that deep learning-based speaker verification (SV) systems were not included in this table for two reasons. First, the objectives of SV and SI are different. Mainly, SV’s objective is to apply voice biometrics to verify users’ claimed identity, while SI intents to find the closest match of an unlabeled speech sample to a set of stored speaker models [15]. Although the enrollment process is similar in both tasks, testing is different, as explained in [42]. Second, a direct comparison of SV and SI systems is impossible as they use different performance indicators. In particular, using Equal Error Rate (EER), a metric that looks at false positive and negative ratios, is mostly considered to measure the performance of speaker verification systems, whereas accuracy is the typical performance indicator for speaker identification systems [15].

In the first baseline system [23], TIMIT speech data were presented to a CNN as one-second spectrograms of 128 ⋅ 100 pixels. The CNN consisted of two convolutional layers with 32 and 64 filters, respectively (similar to the CNNs used in our study), each followed by a max-pooling layer (pooling size was 4⋅ 4 and stride was 2 ⋅ 2). The convolutional layers were stacked into two dense layers with 6300 and 3150 neurons. The output layer consisted of 630 softmax neurons (one neuron per speaker). Similar to our study, 20% of each speaker’s data was used for testing and the rest for training. The CNN delivered 97% accuracy, but no cross-validation was applied. Such CNN requires more than 279 million trainable parameters with large memory and computational footprint. In comparison, as shown in Table 2, our proposed EMAL SI achieved a minimum accuracy of 97.80% and a maximum of 98.61% on TIMIT cross-validation folds 2 and 6 experiments, respectively. This is an improvement of up to 1.61% that was achieved via significantly less complicated CNNs, as explained in the next section.

The CNN used in the second baseline system [25] adopted a different convolutional architecture than [23]. VoxCeleb provides significantly more training data than TIMIT and imposes more complexity, includes more speakers, and contains different types of background noises, which resulted in the lower accuracy reported for this dataset. The VoxCeleb 1 CNN was composed of five convolutional layers, four pooling layers, and two dense layers with 4096 and 1024 neurons, respectively, that received speech spectrograms of size 512⋅300 pixels as input. This neural network architecture translates into approximately 106 million trainable parameters. Similar to our experiment on the same corpus, the authors used around 8 K of the utterances for testing and the rest for training. Our proposed EMAL SI delivered 2.43% better accuracy over this benchmark.

5.2 Optimization and parameter explosion

Using a single sound facilitates the SI task as the dimension of the data DNN learners process decreases significantly; this means deep learning models with fewer parameters are required compared to traditional SI approaches that use a chain of acoustic sounds. Similarly, training SI systems using a sound segment is more convenient than the traditional approaches, where a minimum of three minutes of utterances from each speaker is recommended to achieve an acceptable level of accuracy, according to [22]. On the other hand, in our experience, high accuracy was achieved based on only around 25 s of training speech samples per speaker, which means around 87% fewer speech data was needed.

Concerning EMAL, it can be argued that using several CNNs may result in parameter explosion, but our experiments prove otherwise. In particular, each CNN comprising the proposed SI requires significantly fewer parameters as EMAL facilities the learning task by distributing it among several learners. In our experiments, an EMAL CNN dealt with only 22,913 trainable parameters meaning that the proposed SI needed overall 14 M trainable parameters to model all TIMIT speakers (22,913 parameters per CNN ⋅ 630 CNNs) in comparison to the benchmark CNN with more than 279 M trainable parameters, and 28 M parameters (22,913 ⋅ 1251) compared to VoxCeleb 1 benchmark CNN with over 106 M parameters. Consequently, the EMAL SI optimization resulted in around 95% reduction in the number of DNN trainable parameters over the TIMIT benchmark SI and 78% over the VoxCeleb 1 benchmark SI, and yet the proposed SI system improved the performance over both benchmark systems. This complexity optimization of SI resulting from integrating the single sound-based SI concept and EMAL can potentially enable devices within the lower-end processing power spectrum to perform accurate and offline speaker identification.

To put it differently, employing a deep learning model with hundreds of millions of parameters could be very challenging for any low-end processor, but the same processor can train and apply each CNN in the proposed SI sequentially, meaning that at any point in time, it deals with a considerably smaller CNN. Likewise, a lower amount of memory is required. In our experiments, we did not use any computer with GPUs; instead, we trained the proposed SI using three typical laptops by structuring them to train a range of EMAL CNNs in parallel.

5.3 Summary of contributions

The contributions of this study can be summarized as follows:

  1. 1.

    Speaker Identification using a single acoustic sound frame decreases the complexity of large SI and facilitates it since SI learners require to learn a fewer number of acoustic features in comparison to the traditional stacked-based approaches.

  2. 2.

    Single-frame SI requires shorter speech data.

  3. 3.

    EMAL framework decreases the SI structural complexity due to the reductions in trainable parameters.

  4. 4.

    The optimizations offered by the proposed system make SI more affordable since less expensive hardware may be required.

  5. 5.

    The proposed SI system offers state-of-the-art accuracies with considerably smaller CNNs, even over noisy speech data.

The proposed approach is open source and available from [31].

6 Conclusion

This paper proposed an optimized speaker identification system that integrates Enhanced Multi-Active Learners and the single sound segment approach. A text-independent speaker identification system that employed a network of CNNs to learn the speaker models and distribute the complexity of speaker identification was proposed and evaluated. The speaker models were formed according to the speakers’ voice biometrics in a single sound segment presented by an acoustic frame of 60-dimensional MFCCs. Overall, we conducted experiments with 1881 CNNs, including 630 CNNs for each TIMIT speaker and 1252 CNNs for VoxCeleb 1 speakers, during which a standalone CNN was considered for each speaker model. Compared with similar CNN-based speaker identification systems trained and tested on the same speakers, the proposed system delivered comparable performance but significantly reduced the number of DNN trainable parameters. In particular, the proposed speaker identification system reduced the number of trainable parameters by up to 95% while delivering the top-1 accuracy of 98.61%.

Combining the reduction of the number of trainable parameters as the result of EMAL SI with the reduction of input dimension due to using a single sound segment, it can be concluded that the proposed SI system optimizes the complexity of challenging SI tasks. In other words, in an EMAL-based speaker identification system, each learner focuses solely on the acoustic features related to one speaker rather than encompassing all speaker features. Likewise, using a single sound approach, only one frame of acoustic features is fed to the learners in contrast to multiple stacked sequential frames, which means the input length is considerably smaller, as shown by formula 5. This may enable large-scale, offline speaker identification for devices without specific neural chips or GPUs, making SI more affordable. Finlay, we can highlight the novelty of the proposed approach as follows:

  1. 1.

    The applications of EMAL in SI.

  2. 2.

    Large-scale speaker identification using a single sound segment.

  3. 3.

    Implementation of a CNN-based SI system that benefits from (1) and (2).

The source code of the proposed approach is available from [31].