Introduction

A speaker recognition system can be text-dependent or text-independent. For text-dependent systems, there is some restriction on the type of utterance that the speaker of the system can pronounce (for instance, a fixed pin or certain words in any order, etc.) while the text-independent speaker can say whatever they want. Text-independent speaker recognition is a method of verifying the identity of the speaker without restriction on the speech content. Compared to text-dependent speaker recognition, it is more suitable because the speaker can speak without any restrictions on the system. In the proposed work, all the speakers in the database have different utterances, and hence proposed work is based on text-independent speaker recognition.

Automatic speaker recognition (ASR) is a state-of-the-art technique [1, 2]. Feature extraction and feature mapping are two important processes in SR. Feature extraction extracts several feature vectors called descriptors, and feature matching is used to avoid redundancy present in speech signal features and is used to compare the feature vectors extracted from a signal belonging to an unknown speaker with those extracted from a signal belonging to a known speaker set [3,4,5]. In the 1980s, the Mel frequency cepstral coefficient (MFCC) was introduced; they use the mel frequency scale, which is a characteristic of popular speech [6]. A comparison of various features was done in Ref. [6]. It was concluded that among all these features, MFCC and LPCC allowed for better performance than other features [6]. The concept of dynamic features was introduced in 1981 by Furui [7] to detect the temporal variability in feature vectors. In addition, in short-term frame energy, the formed transitions and energy modulations also include useful speaker information [7]. The major problem is the deterioration of the performance of ASR systems in the presence of additive noise [8]. Overall, researchers have tried to ensure that recognition systems are noise-resistant in three main ways: (a) statistical models have been adapted to recognize noise (e.g., using parallel model combinations) [9], (b) methods for decreasing the noise in speech signals have been proposed [10, 11], and (c) noise-resistant features have been applied. Several techniques have been designed to address the sensitivity of cepstral features to noise, and various approaches, such as wiener filtering [12], spectral subtraction [13], RASTA [14], and lin-log RASTA [15], have been proposed. The improvement in the cepstral features themselves has not produced satisfactory results. Because of this limitation, more research has been carried out on new features that are more robust to noise in addition to cepstral features.

In order to build a better SR model, it is important to extract additional speaker-dependent information, such as entropy, energy, centroid and prosodic information, with square mean values (RMS). Prosodic features, such as pitch, energy, and RMS, are comparatively less disturbed by channel differences and noise. Although systems based on spectral features, such as MFCC, perform better than prosody-based systems, their combined performance can provide the robustness needed for recognition systems [16, 17]. Prosodic features are those features of speech that deal with the auditory properties of sound, such as stress and pitch [16,17,18,19,20,21,22]. One of the important problems related to the degradation of the performance of SR is that the MFCC, LPC, PLP, centroid, entropy, and RMS feature sets contain only static features. A static feature does not capture small changes in the speech signals because the speech signals change very frequently, and consequently, the values of the signals also change rapidly. Therefore, delta and delta–delta feature values are used to add more information and detect feature values over small intervals in speech signals [18]. Many research papers show that models with delta values drastically improve the performance of SI and SV systems [18, 23]. However, from the results of many published research papers [20, 24, 25] and their references, it is clear that prosodic information can also be used to improve SR performance. Usually, feature-level fusion contains more information than a single feature and thus improves the performance of the SR system [26, 27]. This concept of information theory for SR systems is explained clearly in Refs. [28, 29]. The main contributions of the proposed work are as follows:

  1. 1.

    A new methodology for feature aggregation using all 18 features is proposed to obtain the best SR results using various combinations of spectral and temporal speech features with their delta and delta–delta values.

  2. 2.

    The most effective common feature fusion model suitable for speech datasets of various sizes is proposed, for which various experiments are conducted using speech datasets of 5 different sizes.

  3. 3.

    A total of 315 unique feature fusion models (a large number of feature models) are tested on the NIST-2008 database, which contains 18 feature fusion steps. The best 35 models are selected (the best 2 models at each feature fusion step) and tested on the remaining 4 datasets for fast computation to obtain the best feature fusion model.

  4. 4.

    The factors affecting the SR performance are investigated. Feature fusion models suitable for small, medium, and large size voice datasets are proposed.

This paper is mainly divided into related work (Sect. “Related Work”), which defines the previous research done on SI and SV systems using mainly the ELSDSR, voxforge, VCTK, NIST-2008 and voxceleb1 audio databases. The proposed work and methodology section (Sect. “Proposed Work and Methodology”) explains the theoretical and practical description of the proposed research. Sect. “Evaluation” consists of database descriptions and a discussion of the results generated. The conclusion and future work are described in Sect. “Conclusion and Future Work”.

Related Work

Here, we give a detailed overview of the popular methods used for SI and SV systems, mainly using the ELSDSR, VCTK, voxforge, NIST-2008 and voxceleb1 speech databases. Furui first used the feature concatenation approach for the joint use of cepstral and polynomial features in the form of delta and delta–delta coefficients [7]. The concatenation of MFCC and spectral features enhances the performance of the SR system [30]. In Ref. [31], it is shown that the concatenation of phase information with MFCC enhances speaker performance [31]. In another work, the authors jointly used the statistical pH feature and the concatenation characteristics of MFCC and achieved better performance under noise conditions [32].

In Ref. [33], SI implementation of score-level fusion and feature-level fusion is performed with ELSDSR audio data. The score-level fusion-based system gave a better identification rate of 100% than all other systems using a support vector machine (SVM) [33]. Score-level fusion and feature-level fusion were used in Ref. [34] to calculate SI accuracy using ELSDSR speech data, and the SI accuracy was increased to 95.22% when score-level fusion was used in random forest (RF) and multiclass SVM classification. In Ref. [35], the authors show the potential of deep belief networks (DBNs) in the extraction of short-term spectral features. An accuracy of 95% is achieved by combining MFCC and DBN features with the Gaussian mixture model-universal background model (GMM-UBM) on the ELSDSR database. In Ref. [36], a two-step approach using the gender and voice information of speakers from ELSDSR was proposed, and the model obtained an improved accuracy of 99.9% with the GMM classifier. Paper [37] presented a simulation study on a transformation-based fusion algorithm for a multimodal biometric authentication system using an ensemble classifier with face scores and voice recognition modules. Using score fusion, true positive rates of 99% and an accuracy of 99.22% are achieved on the ELSDSR voice dataset. The authors in Ref. [38] used the score fusion method with SVM, linear discriminant analysis (LDA) classifier and MFCC, delta MFCC, and delta–delta MFCC features for GMM-UBM modeling. The best EER of 0.02 is obtained with LDA and cosine distance scoring using ELSDSR data. In Ref. [39], a new type of pipeline architecture was proposed, and the fusion of the Gabor filter (GF) and convolutional neural network (CNN) features with RF, SVM, and deep neural network (DNN) classifiers on the ELSDSR database is used. The best accuracy of 94.87% is obtained using the RF classifier for 22 speakers.

In Ref. [40], the authors proposed a new type of feature extraction technique called the twofold information set (TFIS) for a text-independent SR system on three voice datasets, i.e., NIST-2003, voxforge (2015), and VCTK. On the voxforge 2014 database, for the clean voice dataset, the best performance accuracy of 100% and an EER of 0.02 were achieved. On the VCTK dataset, the best SI accuracy of 98.9% and an EER of 0.05 were achieved using TFIS features, and a genuine acceptance rate (GAR) of 0.1% was achieved. In Ref. [41], the authors have proposed a prototypical network loss (PNL)-based speaker embedding model, and a comparison is made with popular triplet loss-based models (TL). The best SI test accuracy and EER achieved using the VCTK database with 90 speakers are 95.63% and 4.08%, respectively, using the PNL technique. Reference [42] showed how the accuracy of SI improves when the fusion of delta and delta–delta features with nondelta features is performed. The best SI accuracy of 94% was achieved after the fusion of MFCC, delta MFCC and delta–delta MFCC with 18 feature vectors using voxforge database. While Ref. [43] showed how the accuracy of SI is increased by fusing models, a new type of generalized fuzzy model (GFM) was implemented and combined with GMM and the hidden Markov model (HMM). The HMM-GFM combination achieves an accuracy of 93% using voxforge data. In Ref. [44], the authors have proposed a speaker verification approach that learns speaker discriminative information directly from the raw speech signal using CNNs in an end-to-end manner. On the Voxforge corpus, the proposed approach yielded a system that outperformed systems based on state-of-the-art approaches. In Ref. [45], a three-step score fusion method was proposed and an SI system was tested with and without adding white Gaussian noise (AWGN) and nonstationary noise (NSN) using MFCC, the power normalized cepstral coefficient (PNCC) and GMM-UBM acoustic modeling. The best SI accuracy of 95.83% was obtained when testing the clean speech data in NIST-2008 [45]. In Ref. [46], a comparison was made between the i-vector model and GMM-UBM on clean and noisy speech of 120 speakers from the TIMIT and NIST-2008 speech datasets with 7 types of score fusion techniques. The highest SI accuracy of 96.67% was achieved using the i-vector approach, while an SI accuracy of 95.83% was achieved using GMM-UBM on clean NIST-2008 data. In reference to [47], the authors have proposed bottleneck (BN) features based on multilingual deep neural networks. Experiments are done on the NIST SRE 2008 female short2-short3 telephone task (multilingual) and the NIST SRE 2010 female core-extended telephone task (English) audio datasets. Tian et al. [47] show that compared to the deep neural network (DNN)-based approach, the BN features based model provides better results for the speaker verification system.

Research paper [48] used the i-vector and x-vector approaches and proposed attentive pooling for deep speaker embedding for a text-independent speaker verification (SV) system using the voxceleb1 and NIST-2012 voice datasets. Mainly four pooling techniques, including (i) simple average pooling, (ii) statistics pooling, (iii) attentive average pooling, and (iv) attentive statistics pooling, were used in Ref. [48]. From the experimental results, it was observed that the best EER of 3.85% was achieved using attentive statistics pooling (x-vector), and with the i-vector, the best EER of 5.39% was achieved when the voxceleb1 dataset was used for training and evaluation. In Ref. [49], a fully automated pipeline based on computer vision techniques was used on the voxceleb1 dataset from open-source media. Research paper [49] showed that a CNN-based architecture obtained the best result for an SI accuracy of 80.5% using the top1 classification accuracy, and the best EER of 7.8% was obtained for SV. Table 1 shows the summary of all the related work in ASR using mainly ELSDSR, voxforge, VCTK, NIST-2008 and voxceleb1 database.

Table 1 Related work on speaker recognition using mainly ELSDSR, voxforge, VCTK, NIST-2008 and voxceleb1 database

Proposed Work and Methodology

Motivation

An analysis of other SR approaches was performed [33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49], and the explanations of these approaches are given in Sect. “Related Work”. It is found that there is still room for improvement although the proposed work used various feature fusion approaches to achieve the best SR results. The author believes that the SR result of Ref. [31] can be further improved if more features are tested with more classification methods; hence, this is performed in the proposed work. For a better comparison of the SR performance, all the feature combinations in Ref. [34] should be considered. The proposed methods are tested on 315 different feature combinations with linear discriminant (LD), K nearest neighbor (KNN) and ensemble classifiers using 18 features in total with NIST-2008 data, and the 35 best feature combinations are selected for testing on ELSDSR, voxforge, VCTK and voxceleb1 data to achieve the best SR performance. Banerjee et al. [35] showed that the performance of an SR system can be improved if more features are fused with MFCC. The authors believe that to build a better SR model, it is essential to extract additional speaker-dependent information such as entropy, energy, centroid and prosodic information, such as RMS values (proposed).

Prosodic features contain useful information that is different from the information in cepstral features. Thus, an increasing number of researchers from the SR area have shown interest in prosodic features [36].

Many studies have been performed on feature-level fusion [28, 32, 40,41,42], but they mainly involve the fusion of MFCC features with other features and MFCC delta and delta–delta values. The proposed method, which uses fusion MFCC, LPC, PLP, RMS, centroid, and entropy information and combinations of their delta and delta–delta values to further improve the SR performance and to find a unique feature fusion model suitable for speech databases of various sizes, is implemented. The main advantage of using feature-level fusion is the recognition of correlated feature values produced by different biometric algorithms, thereby determining a compact set of relevant features that can enhance SR accuracy and remove redundant features to improve the SR results [48].

In addition, it is observed that other research on SR systems includes speech datasets of similar sizes; for example, Refs. [31, 32, 39] included only small-size speech datasets; Refs. [40] and [42,43,44,45,46,47] performed SR evaluation using only medium-size speech data (more than 100 speakers); Refs. [48, 49] used large speech datasets, while the proposed method is tested on various size speech datasets to find one common model suitable for different sizes of speech datasets. Another problem in the SR system is how different speech features should be combined, and this problem can be solved by the proposed methodology of feature aggregation. Furthermore, the best results obtained by the proposed method are compared with those of some famous SR methods, such as the x-vector, i-vector, and DNN approaches.

Feature Extraction

For the proposed work, the following features and their delta and delta–delta values are used. Mirtoolbox [50] in MATLAB is used to compute the feature vectors of MFCC, centroid, RMS and entropy.

Mel Frequency Cepstral Coefficient (MFCC)

MFCC is the most important and effective aspect of speech-related applications. Since the mid-1980s, MFCCs have been the most popular feature extraction method in the field of ASR. The benefit of using MFCC is that it gives higher accuracy for a less noisy audio dataset. This work mainly focuses on improving speaker recognition accuracy using less noisy data. Hence, MFCC could be a good choice. Following are the steps involved in the extraction of MFCC feature vectors [26, 51,52,53,54].

  • Framing: For the detailed analysis of speech signals, their parameters are divided into frames as speech signals are continuous in nature. The signals are divided into 20–40 ms frames.

  • Windowing: Since speech signals are nonstationary in nature, their parameters change approximately every 10 ms. Hence, hamming window is applied to the frames at 10 ms for MFCC; these are logarithmic in nature.

  • FFT: Used for converting a time-domain speech signal into a frequency-domain signal.

  • Mel-Filter bank transformation: It is on the same logarithmic scale as the human auditory system, which is also logarithmic in nature. The Mel scale is calculated using Eq. 1:

    $$ {\text{Mel}}\left( f \right) \, = {\text{ 2595 log1}}0 \, \left\{ {{1} + f/{1}000} \right\}. $$
    (1)

where f actual frequency of speech

  • LOG: A logarithmic mel scale is applied to the FFT frame, which is linear up to 1 kHz and logarithmic at higher frequencies.

The relationship between the frequency of speech and the mel scale can be established as (Eq. 2):

$$ {\text{Frequency }}\left( {\text{mel scaled}} \right) \, = \, \left[ {{\text{2695 log }}\left( {{1 } + {\text{ f}}\left( {{\text{Hz}}} \right)} \right)/{7}00} \right] $$
(2)
  • DCT: The last step is to calculate the discrete cosine transform, which de-correlates the speech features and arranges them in descending order of information. Hence, the first 13 coefficients are used as MFCC features for building the model [26, 51,52,53,54].

Linear Predictive Coding (LPC)

LPC is commonly used because it is fast, simple, and has the capability to extract and store time-varying formant information. This technique of coding uses data encryption to secure the data until it reaches its destination. Also, in our previous research paper [19, 26], it can be seen that fusion of cepstral features, including LPC features, improves SR results, and hence LPC is considered for this work as a feature extraction technique in order to improve the recognition rate.

LPC calculates the current sample using a linear combination of the past samples. Inverse filtering is performed in which the formant is removed from the speech signals, and the remaining signal that is left after inverse filtering is called the residue [55]. The LPC features are calculated using the VQ-LBG algorithm. To reduce the bit rate, VQ is applied to LPC features in the linear spectral frequency (LSF) domain.

It is important to understand the autoregressive (AR) model of speech in order to understand LPCs. An audio signal can be modeled as a pth-order AR process, where each sample is given by Eq. (3):

$$ x\left( n \right) = - \mathop \sum \limits_{k = 1}^{p} a_{k} x \left( {n - k} \right) + u\left( n \right) $$
(3)

Each sample at the nth instant depends on ‘p’ previous samples, added with a Gaussian noise u(n). LPC coefficients are given by a.

Yule–Walker equations are used to estimate the coefficients. The autocorrelation at lag l is given by Eq. (4). R(l) is autocorrelation function:

$$ R\left( l \right) = a_{0} + \mathop \sum \limits_{n = 1}^{N} \left( {x\left( n \right)x\left( {n - 1} \right)} \right) $$
(4)

The final form of Yule–Walker equations is given by Eqs. (5) and (6):

$$ \mathop \sum \limits_{k = 1}^{p} a_{k} R\left( {l - k} \right) = R\left( l \right) $$
(5)
$$ \left[ {\begin{array}{*{20}c} {{\text{R}}\left( 0 \right)} & \cdots & {{\text{R}}\left( {{\text{p}} - 1} \right)} \\ \vdots & \ddots & \vdots \\ {{\text{R}}\left( {{\text{p}} - 1} \right)} & \cdots & {{\text{R}}\left( 0 \right)} \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {{\text{a}}_{1} } \\ \vdots \\ {{\text{a}}_{{\text{k}}} } \\ \end{array} } \right] = - \left[ {\begin{array}{*{20}c} {R\left( 1 \right)} \\ \vdots \\ {R\left( K \right)} \\ \end{array} } \right] $$
(6)

Equation 7 gives the final solution to get LPC coefficients:

$$ a\, = \, - R^{ - 1} r $$
(7)

We have used only the first 13 LPC coefficients in order to reduce system complexity. Details of the proposed LPC features, the VQ-LBG algorithm, and its calculation steps can also be found in Refs. [55,56,57].

Perceptual Linear Prediction (PLP)

PLP features have better noise reduction, reverberation suppression, and echo cancellation, which leads to an improvement in performance. In addition, in our previous research papers [19, 26], fusion of cepstral features, including the PLP feature, improved SR results, and hence PLP is considered for this work in order to enhance the speaker recognition rate. Following are the steps involved in the extraction of PLP features:

  • PLP is very similar to MFCC. It uses equal loudness pre-emphasis and the cube-root compression technique. PLP rejects irrelevant speech information and thus increases the voice recognition rate. The PLP is the same as the LPC, except that its spectral characteristics have been transformed to match the characteristics of the human hearing system.

  • After applying the hamming window and FFT function to audio samples and converting them into the frequency domain, they are transformed into a power spectrum. This spectrum is warped into a bark scale using the approximation (Eq. 8):

    $$ \Omega \left( \omega \right) = 6 \cdot \ln \left( {\frac{\omega }{1200 \cdot \pi } + \sqrt {\left( {\frac{\omega }{1200 \cdot \pi }} \right)^{2} + 1} } \right) $$
    (8)

    where \(\omega\) is the angular frequency in rad/s and Ω represents the bark frequency.

  • The bark scaled spectra is combined with the power spectra of the critical band filters. The frequency resolution of the ear is measured as a constant on the bark scale. The final samples of the critical band power spectrum with the approximation of the critical band curve Ψ(Ω) can be written as follows (Eq. 9):

    $$ \theta \left( {\Omega_{t} } \right) = \mathop \sum \limits_{\Omega } \left( {p_{{\left( {\Omega - \Omega_{t} } \right) \cdot \Psi \left( \Omega \right)}} } \right) $$
    (9)
  • Equal loudness pre-emphasis is done in order to compensate for the non-equal perception of loudness at different frequencies using the following Eq. 10:

    $$ E\left( {\Omega \left( \omega \right)} \right) = E\left( \omega \right) \cdot \theta \left( {\Omega \left( \omega \right)} \right) $$
    (10)

Here, E(Ω) is used as an approximation to the non-equal sensitivity of human hearing. After that, perceived loudness Γ(Ω) is calculated by taking the cube root of the intensity, which is known as the power law of hearing (Eq. 11):

$$ \Gamma \left( \Omega \right) = \sqrt[3]{E\left( \Omega \right)} $$
(11)
  • In the final step of PLP, \(\Gamma \left( \Omega \right)\) is found by the spectrum using autocorrelation method of all-pole spectral modeling. The inverse DFT (IDFT) is applied to \(\Gamma \left( \Omega \right)\) to yield the autocorrelation function. The autoregressive coefficients could be further converted into some other set of parameters of interest, such as cepstral coefficients.

The PLP feature extraction steps are clearly explained in Refs. [15, 58, 59]. Thirteen PLP features are calculated by taking the mean of all the PLP features for each voice to reduce the system complexity for MATLAB software [58, 59]. To make the PLP feature dimension equal to the dimension of the other features, the mean value of each frame is calculated, which results in dimensions of 13 × 1 feature vectors per audio file.

Figure 1 explains the feature extraction steps for MFCC, LPC, and PLP features.

Fig. 1
figure 1

MFCC, LPC and PLP feature extraction steps

Spectral Centroid (SC)

SC defines the center of gravity of the magnitude spectrum of the short time Fourier transform and gives a single value that represents the frequency domain characteristic of a speech signal. A larger value of SC corresponds to a signal with more energy [60]. A spectral centroid gives a noise-robust estimation of how the dominant frequency of a signal deviates over time. As in our previous research [60], fusion of the spectral centroid with the MFCC enhances the speaker recognition performance; hence, it is also used in the proposed work. It is computed as follows using Eq. (12): xi(n), where n = 0, 1,… N–1, is the sample of the ith frame, and xi(k), where k = 0,1… N–1, are the discrete Fourier transform (DFT) coefficients of the sequence. Then, the centroid C(i) is calculated as follows:

$$ C\left( i \right) = \frac{{\mathop \sum \nolimits_{k = 0}^{N - 1} k\left| {x_{i} \left( k \right)} \right|}}{{\mathop \sum \nolimits_{k = 0}^{N - 1} \left| {x_{i} \left( k \right)} \right|}} $$
(12)

Spectral Entropy (SE)

Entropy spectral estimation is a spectral density estimation technique that is computed in the following manner. The use of spectral entropy features as additional features showed improvements in the recognition accuracy in Ref. [61].

  • For the given signal x(t), s(f), which is the power spectral density, is computed with the Fourier transform of the autocorrelation function of the signal x(t).

  • Depending on the frequency of interest, the power in the spectral band is extracted. After calculating the spectral band power, the power in the given band of interest is normalized.

  • The spectral entropy is calculated using Eq. 13 [48]:

    $$ {\text{SE}} = \sum s\left( f \right)*\ln \frac{1}{s\left( f \right)} $$
    (13)

Root Mean Square (RMS)

Prosodic features such as pitch, energy, RMS, and duration are less affected by channel differences and noise. Although systems based on spectral features, such as MFCC, give better SR performance than prosody-based models, their combination may provide the robustness needed by recognition systems [16, 17]. RMS is a measure of the loudness of an audio signal. It is found by calculating the square root of the sum of the mean squares of the amplitudes of the sound samples. The RMS formula is given in Eq. 14 [62], where  x1,  x2,…xn are n observations, and xrms is the RMS value for the n observations:

$$ x_{{{\text{rms}}}} = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} xi^{2} } $$
(14)

Delta Features

Delta features are used to calculate the rate of change in speech power to the change in the short time power of the noise. Using the delta function, small changes can be calculated for speech features. Delta (Δ) and delta–delta (ΔΔ) features can also be used to compute small dynamic information of the speech signals [7, 42]. The performance of the system with and without delta values is observed in the proposed work. For a feature fk and the time constant k, Δ (Eq. 15) and ΔΔ (Eq. 16) are calculated as

$$ \Delta_{k} = f_{k} - f_{k - 1} $$
(15)
$$ \Delta \Delta_{k} = \Delta - \Delta_{k - 1} $$
(16)

Classification

The LD (Fig. 2), KNN (Fig. 3), and ensemble (Fig. 4) classification techniques are used for the proposed work. All classification tasks are performed using the classification learner application in MATLAB.

Fig. 2
figure 2

Linear discriminant (LD) classification

Fig. 3
figure 3

KNN classification

Fig. 4
figure 4

Ensemble classification

For the proposed work, LD is used as a classifier. LD approaches obtain a linear combination of features to distinguish two or more classes, and the resulting combination is used as a linear classifier. All the features are used in the model for LD classification.

LD uses Bayes’ Theorem to find the probabilities. If the output class is (k) and the input is (x), here is how Bayes’ theorem works to estimate the probability that the data belongs to each class (Eqs. 17 and 18):

$$ P\left( {Y = x|X = x} \right) \, = \, \left( {PIk \, * \, fk\left( x \right)} \right) \, /{\text{ sum}}\left( {PII \, * \, fl\left( x \right)} \right) $$
(17)
$$ PIk \, = \, nk/n $$
(18)

In the above equation, PIk is the prior probability. This is the base probability of each class as observed in the training data. f(x) is the estimated probability that x belongs to that class. f(x) uses a Gaussian distribution function; n is the number of instances; and K is the number of classes. Plug the Gaussian into the equation above and simplify, and we find ourselves with the Eq. 19. This is a discriminating function, and the class computed as having the greatest value will be the output classification (y):

$$ {\text{Dk}}(x) = x*(muk/{\text{sig}}a^{2} ) - (muk^{2} /(2*{\text{sigma}}^{2} )) + \ln ({\text{PIk}}) $$
(19)

Dk (x) is the discriminate function for class k given input x, the muk, sigma2 and PIk are all calculated from your data. Detail explanations of LD classification can be found in Refs. [63, 64].

Figure 2 shows the scatter plot for prediction data points for 22 speakers using the LD classifier when tested on ELSDSR speech data for the combination of MFCC, ΔΔentropy, and ΔPLP features which consists of total 13(MFCC) + 1(ΔΔentropy) + 13(ΔPLP) = 27 feature vectors. Feature 1 (x-axis) and feature 2 (y-axis) data points are shown in Fig. 2 out of a total of 27 feature vectors for 22 speakers (22 different classes). A different-colored dot indicates it belongs to a different speaker or class, while the same-colored dot indicates it belongs to the same speaker or class, while x indicates an incorrect prediction of speaker or class.

The KNN approach is used to classify a set of unknown data points based on their similarity with a neighbor. Here, K is the number of dataset elements that help in classification; for the proposed work, K is taken as 1. The KNN algorithm can be explained based on the following steps:

  • Select the number K of the neighbors.

  • Calculation of the Euclidean distance for K neighbors.

  • Select the K nearest neighbors according to the calculated Euclidean distance.

  • Count the number of data points in each class among these k neighbors.

  • Assign the new data points to that class for which the number of the neighbor is more.

The KNN algorithm is explained in Ref. [65]. Figure 3 shows the structure of KNN for two different speakers.

The ensemble classification method helps improve the SR results by combining different models and reducing the risk of overfitting [66]. The random subspace ensemble method is used with a determinant learner (the number of learners is 30 and the subspace dimension is 5) in the proposed work for the ensemble classifier. The random subspace ensemble approach, also called bagging or feature bagging, is a machine learning algorithm that combines the predictions from many decision trees trained on various subsets of columns in the training dataset and decreases the correlation between estimators in an ensemble by training them on random samples of features instead of the all-feature set [67]. The fitcensemble function in MATLAB is used to train ensemble classification. Let X be a data matrix. Each row contains a single observation, and each column contains a single predictor variable. Y is the vector of responses and has an equal number of observations as the rows in X. Figure 4 shows the ensemble classification creation details and the information used for ensemble classification.

An ensemble of models using the random subspace method can be calculated using the following algorithm.

  • Assume that the number of training points is N and the number of features in the training data is DL be the number of individual models in the ensemble.

  • For each individual model L, select nl (nl < N) to be the number of input points for l. It is mutual to have only one value of nl for all the individual models.

  • For each individual model l, generate a training set by selecting dl features from D with replacement and train the model.

  • Now, to apply the ensemble model to a hidden point, combine the outputs of the L individual models by majority voting [66].

Feature Fusion and Model Optimization Steps

Feature Fusion Methodology

A methodology for feature aggregation using one dataset (NIST-2008) is developed, and the 2 best models with the highest SI accuracy and the lowest SV EER values are selected at each step for testing the remaining 4 datasets to reduce the complexity of testing many models, since testing all combinations is impractical and therefore requires a heuristic algorithm to find the best feature combinations. To increase the probability of obtaining an effective model that is suitable for all the databases used, the two best models are selected at each evaluation step using the NIST-2008 database, which involves training and testing a total of 315 models. In case the selection of the two best models from each step does not give satisfactory results, the three best models are selected; and this process continues with the same methodology of feature fusion until a final effective model suitable for all databases used is found, which involves the training and testing of approximately 475 models (where three models are selected), and feature fusion becomes particularly complex. Table 2 shows the total number of models tested when the best 1, 2 and 3 models are selected in each step of feature fusion. Figures 5 and 6 explain the feature fusion methodology and workflow, respectively. Figure 7 shows the computational steps for the SI and SV systems.

Table 2 Total number of models tested on the NIST-2008 database
Fig. 5
figure 5

Methodology for feature fusion using NIST-2008

Fig. 6
figure 6

Flow diagram for selecting the best model using NIST-2008

Fig. 7
figure 7

SI/SV computation steps

Model Optimization

The main goal of selecting the best 2 models at each step of feature fusion using NIST-2008 data and model optimization is to achieve a common effective model that is suitable for all sizes of speech datasets. Table 2 shows the total number of models tested when the best 1, 2 and 3 models are selected in each step of feature fusion. Some steps used to select the 2 best models apply a smaller number of models due to the repetition of features.

Optimization Algorithm

The following steps show how different features are fused and how the best models are selected. Figure 8 shows how the models are optimized to obtain an effective model suitable for various sized voice datasets. Table 3 shows the dimensions of each feature.

  1. 1.

    MFCC, LPC, PLP, centroid, RMS, and entropy features and their delta and delta–delta feature vectors are extracted for all 5 voice datasets using MATLAB software. Table 3 shows the total number of feature vectors extracted for one audio file. The input matrix files are created using single features and combinations of features in MATLAB. The labeling of each speaker is performed with the features values so that a MATLAB classification learner application can be used for testing (Figs. 5, 6).

  2. 2.

    The first step of feature fusion involves training and testing all 18 features individually and then selecting the best 2 features with the highest SI accuracy and lowest average EER among all 18 features. To select the best model, the average accuracy and average EER values of the three classifiers are considered. PLP and MFCC are the first- and second-best models, respectively, because they have the highest average accuracy values of 80% and 61.1% and the lowest EER values of 5.2% and 15.4%, respectively, compared to other features. Equation 20 shows the calculation of the average accuracy and EER values using all three classifier results:

    $$ {\text{Average result}} = \frac{{{\text{LD}} + {\text{KNN}} + {\text{ensemble }}\left( {\text{accuracy or EER}} \right)}}{3} $$
    (20)
  3. 3.

    In the second step, 2 features are fused by combining the best features, MFCC and PLP, separately with the remaining 17 features, and again, the best two models are selected from this step. The two best models from this step are the MFCC and Δentropy fusion model and the PLP and LPC fusion model.

  4. 4.

    In the third step, 3 features are fused by combining the remaining 16 features separately with the two best models selected from step 2. Fusion of 4 to 18 features is performed in the same way, and the two best models are selected at each step. A total of 315 models are tested on the NIST-2008 data, and 35 feature models are selected for testing on the remaining 4 databases for fast computation.

  5. 5.

    The best models that obtain the highest accuracy and lowest EER values on each speech dataset are selected, and one common effective model that obtains the best result on all 5 datasets is found.

Fig. 8
figure 8

Model optimization

Table 3 Feature dimensions

Evaluation

Database Preparation

The following five voice databases are used in the proposed work, and their explanations are as follows. All the following datasets are text-independent. Only the speaker's voice is important for the recognition purpose, content of the speech is not important, and the speaker can speak freely in a text-independent speaker recognition system.

ELSDSR is a small corpus dataset that was recorded at the Technical University of Denmark (DTU) by faculty, Ph.D. students, and master’s students. ELSDSR consists of voice messages from 22 speakers, 12 males and 10 females. For training, 154 voices were recorded with 7 sentences each. For the testing set, 44 utterances were provided, and 2 sentences were spoken by each speaker. The time duration for the training data is 78 s for males and 88.3 s for females. The test data duration is 16.1 s for males and 19.6 s for females [68].

  1. 1.

    Voxforge is an open speech dataset (medium size) consisting of many speaker voices. For the proposed work, 100 English speakers are randomly selected. Each speaker spoke 10 sentences recorded at a sampling rate of 8 kHz. A total of 1000 voice files are used for 100 speakers. Out of the 1000 voices, 800 voices are used for training, and 200 voices are used for testing [40].

  2. 2.

    The CSTR VCTK (medium size) corpus consists of speech data from 109 native English speakers with different accents. Each speaker recorded approximately 400 English sentences. For the proposed work, 5 sentences from each speaker are selected. A total of 545 voices are used. A total of 436 voices are used for training, and the remaining 109 voices are used for testing [40].

  3. 3.

    A total of 942 h of multilingual telephone speech and English interview speech are included in the NIST-SRE-2008 database (medium size) [46, 69]. The sampling frequency was converted from the original 8–16 kHz, and 120 English-only microphone channels were selected for better comparison with other databases. Audacity software [70] is used to separate a single speaker from multiple speakers and segment the speaker voice into 10 equal parts. Each speaker consists of 10 audio files with a fixed length of 8 s each. Six audio files are used for training, and the remaining 4 audio files are used for testing. A total of 1200 voices are used, out of which 720 voices are used for training and 480 voices are used for testing.

  4. 4.

    Voxceleb1 (large size) contains more than 100,000 voice samples. Videos included in the database are recorded in challenging multispeaker environments, including at red carpet events and in outdoor stadiums. All the datasets are degraded with real-world noise, such as laughter, overlapping speech and room acoustics. For this paper, all 1251 pieces of speaker data are used for a total of 153,516 speaker voices. To obtain a fair comparison with Refs. [48] and [49], 148,642 utterances are used for training and 4874 utterances are used for testing in the SV task; in addition, 145,265 utterances are used for training, and 8251 utterances are used for testing in the SI task. Table 4 provides the details of all the voice datasets used.

Table 4 Database details

Evaluation Using the 5 Databases

Performance Evaluation for Speaker Identification

The SI accuracy (%) is calculated using a classification learner application in MATLAB for all proposed models using all 3 classifiers. The proposed work determines the overall SI accuracy of the system for comparison with systems in other work. The accuracy (Eq. 21) indicates how many voice samples of all speakers are correctly identified from the total number of voice samples. The feature fusion models that provide better SR results are shown in the results of tables.

$$ {\text{Accuracy}} = \frac{{\text{ Number of voices correctly identified}}}{{\text{Total number of audio files }}} $$
(21)

Performance Evaluation for Speaker Verification

ROC curves of different models are plotted using the false-positive rate (FPR) and true-positive rate (TPR) for each speaker at an interval of 0.005. The EER is calculated from the intersection of the ROC curve and the diagonal axis from (0, 1) to (1, 0) and is the value of the false positive rate. For each speaker, a selected speaker is considered a true speaker, and the rest of the speakers are collectively considered impostors. Figure 9 shows the schematic diagram for the ROC curve and how the EER value is calculated. The final EER values are shown in the result tables. A perfect ROC curve includes a straight line from the starting point (0.0, 0.0) to the upper left corner (0.0, 1.0) and a straight line from the upper left corner to the upper right corner (1.0, 1.0) [71,72,73,74], as shown in Fig. 8 (blue line-best models), which means that the classifiers give 0 false positives and 0 false negatives and are absolutely accurate. Similar ROC curves are generated by the model that gives the best results. ROC curves near the left corner indicate better results [72, 73]. Figures 10 and 11 show how SI accuracy and SV EER change with different feature fusion models. Only a few models are considered in Figs. 10 and 11 to show the variation in SR performance because including all the models makes the graph unclear.

Fig. 9
figure 9

Schematic diagram of ROC curves and the EER calculation

Fig. 10
figure 10

SI performance on the ELSDSR, Voxforge, VCTK, NIST-2008 and voxceleb1 audio datasets

Fig. 11
figure 11

SV performance on the ELSDSR, Voxforge, VCTK, NIST-2008 and voxceleb1 audio datasets

Two of the most popularly used measures in biometrics are the false positive rate (FPR) and the false negative rate (FNR). The FPR is the ratio between the number of false acceptances and the total number of imposter attempts. Hence, it measures the likelihood that the biometric model will falsely accept access from an imposter (see Figs. 12, 13, 14, 15, 16). The FNR calculates the likelihood that the biometric model will incorrectly reject a genuine speaker, and it represents the ratio between the number of false recognitions and the total number of speaker attempts [72]. Figures 12, 13, 14, 15, and 16 show the graphs for the false positive rate (FPR) and false negative rate (FNR) at different thresholds and the equal error rate (EER) (%) (EER = FPR = FNR) for a few features of the fusion model for all the speech databases used.

Fig.12
figure 12

FPR, FNR and EER on different threshold value, for different features fusion models, ELSDSR database

Fig.13
figure 13

FPR, FNR and EER on different threshold value, for different features fusion models, voxforge database

Fig.14
figure 14

FPR, FNR and EER on different threshold value, for different features fusion models, VCTK database

Fig.15
figure 15

FPR, FNR and EER on different threshold value, for different features fusion models, NIST-2008 database

Fig.16
figure 16

FPR, FNR and EER on different threshold value, for different features fusion models, voxceleb1 database

Figure 12 shows the error rate graph at different threshold values with MFCC features using a KNN classifier, and the best model (Table 5) gives the lowest EER of 0 and 100% accuracy for ELSDSR speech data. Figure 13 shows the error rate graph at different threshold values with the fusion of 4 features (PLP + LPC + ΔPLP + MFCC) using KNN classifier, and the best model (Table 6) gives the lowest EER of 0 and 100% accuracy (fusion of 7 features) for voxforge speech data. Figure 14 shows the error rate graph at different threshold values with the PLP feature using the LD classifier, and the best model (Table 7) gives the lowest 0 EER and 100% accuracy for the VCTK speech data (7). Figure 15 shows the error rate graph at different threshold values with the MFCC feature using the LD classifier and the best model having the fusion of 14 (Table 8) for NIST-2008 data. Figure 16 shows the error rate graph at different threshold values with the MFCC feature using the KNN classifier and the fusion of the 11 feature with the KNN (Table 9) for voxceleb1 data. From Figs. 12, 13, 14, 15, and 16, we can observe that the error rate decreases when more feature combinations are used. An error rate graph of only a few combinations is used to show the effectiveness of the proposed method and to show that SR performance improves using more feature fusion.

Table 5 Best feature fusion models on the ELSDSR audio datasets (proposed best models vs. other best models)
Table 6 Best feature fusion models on the voxforge audio datasets (proposed best models vs. other best models)
Table 7 Best feature fusion models on the VCTK audio datasets (proposed best models vs. other best models)
Table 8 Best feature fusion models on the NIST-2008 audio datasets (proposed best models vs. other best models)
Table 9 Best feature fusion models on the voxceleb1 audio datasets (proposed best models vs. other best models)

Comparison of Results and the Best Models

Tables 5, 6, 7, 8, and 9 show the best model results, which provide the top SR performance (with the highest SI accuracy and lowest SV EER values) compared to other fusion models, and the results generated by other models. For a fair comparison, we included only the results of the other models that use the same database and number of audio clips as used by the proposed model. In addition, Tables 5, 6, 7, 8, and 9 show a comparison of the proposed best model results (red font), the common effective model results on all datasets (bold red font) and other models’ best results on the ELSDSR, voxforge, NIST-2008, VCTK and voxceleb1 databases, respectively. The following points explain the number of best models (those that obtain the highest SI accuracy and lowest SV EER values) obtained by the proposed work and one effective model suitable for all the datasets:

  1. 1.

    For the ELSDSR database, the best average SV EER of 0% and SI accuracy of 100% are achieved by the proposed work using a fusion of 6, 7, 9, 10, 12, 14, and 15 features; however, for ELSDSR data, less research has been performed for SV; hence, previous results are difficult to compare. The highest SI accuracies of 95.2% [34], 95% [35], and 94.8% [39] are obtained with score-level fusion and GMM-UBM modeling, respectively. As ELSDSR is a small voice dataset, the proposed models with the fusion of 6, 7, 9, 10, 12, 14 and 15 features can be considered suitable for small audio databases (Table 5).

  2. 2.

    For the voxforge speech database, the proposed models with a fusion of 7 and 14 features obtain the lowest EER value of 0% and the highest SI accuracy of 100% and 99.5% with the KNN classifier, while previous SR models on voxforge speech data [40] achieved an EER value of 0% using the feature-level fusion method when the same number of voices were used for training and testing as used by the proposed model. The best SI accuracies of 94% and 93% are achieved by Refs. [42, 43] using feature fusion and model fusion, respectively (Table 6). While it can be seen from Ref. [44] result that 1.18% EER is achieved using CNN-based approach for voxforge data.

  3. 3.

    For the NIST-2008 database, the proposed models achieve the best EER of 0.2% and the best SI accuracy of 96.9% using a fusion of 11, 12, and 14 features with an LD classifier, and the score fusion method with GMM-UBM modeling [45] achieves the best accuracy of 95.83%. The model in Ref. [46] achieved the best SI accuracy of 96.6% with an i-vector approach, and with GMM-UBM modeling, a best accuracy of 95.83% is achieved when tested on the same NIST-2008 data. References [45] and [46] included only SI results; therefore, only the SI results of proposed work can be compared with the results in Refs. [45, 46] (Table 8). In Ref. [47], DNN and bottle neck-based techniques is used, and NIST-2008 data are used only for testing. For English NIST-2008 data, best EER of 5.86% is achieved using bottleneck i-vector technique and 7.26% EER is archived using DNN-based approach. Only result for English database is used for comparison.

  4. 4.

    For the proposed models using 5, 8, 9, 10, 11, and 14 features, the lowest EER value of 0% and highest SI accuracy of 100% on VCTK data were achieved, while other approaches achieved a lowest EER value of 5% [40] and highest SI accuracy of 98.9% [40] with feature-level fusion, score-level fusion, and i-vector/GMM-UBM on the VCTK voice dataset (Table 7). For Ref. [41], best accuracy of 95.63% and EER of 4.08% is achieved using DNN method.

  5. 5.

    Voxforge, NIST-2008 and VCTK are medium-size voice databases; therefore, the fusion of 14 features, which is the best common model among the three, can be considered appropriate for medium-size audio datasets.

  6. 6.

    For the voxceleb1 dataset, the least SV EER values of 4.07% and 4.31% and 90% and SI accuracy values of 89.3% are achieved using the fusion of 14 features and 15 features, respectively, with the KNN classifier, while in Ref. [48], the best EER of 3.85% is achieved using the x-vector and time delay neural network (TDNN) approach. Nagrani et al. [49] achieved the best SV EER of 7.8% using a CNN architecture. A total of 1251 speakers were used in Refs. [48, 49], as in the proposed work. In Ref. [48], the total number of speaker voice samples is slightly less than that used in the proposed work; hence, the fusion of 14 features can be considered better than the results achieved by Refs. [48, 49] (Table 9).

  7. 7.

    The voxceleb1 dataset has the largest number of speakers among all the datasets used; therefore, the 14 and 15 feature fusion models should be considered suitable for large audio datasets.

  8. 8.

    From the results in Tables 5, 6, 7, 8, 9, it is observed that feature fusion with delta and delta–delta values generates better SR results than using single features. The fusion of PLP, LPC, ΔPLP, ΔΔLPC, RMS, MFCC, ΔΔRMS, ΔΔPLP, ΔΔentropy, ΔLPC, entropy, Δentropy, ΔMFCC, and ΔRMS (14 features) highlighted in bold red font in the results in Tables 5, 6, 7, 8, 9 is the only model that obtains effective results for SI as well as SV on all 5 voice datasets.

  9. 9.

    Furthermore, when the performance of the three classifiers is compared, it is observed that the KNN classifier performs better on the ELSDSR, voxforge, and voxceleb1 databases (small, medium and large audio datasets), while the LD classifier gives better results on VCTK and NIST-2008 (medium audio datasets). Different classifiers generate different SR results for each voice dataset due to variation in the size of the training/testing datasets. This is why the final effective model with the fusion of 14 features is generated by different classifiers for each voice dataset.

Table 10 shows the summary of results obtained by the common best model of fusion of PLP, LPC, ΔPLP, ΔΔLPC, RMS, MFCC, ΔΔRMS, ΔΔPLP, ΔΔentropy, ΔLPC, entropy, Δentropy, ΔMFCC, and ΔRMS (14 features), for all databases, including training and testing data division and timing for model computation using a KNN classifier. From Table 10, it can be observed that training and testing time increase as the size of the database increases. Voxceleb1 is taking maximum training and testing timings of 2203.8 s and 43.3 s, respectively, for speaker identification, while for speaker verification, a different number of training and testing data divisions are used, hence training and testing times for SV are 2502.1 s and 52.2 s, respectively. In addition, the results generated by the proposed models are better than the other results; hence, the selection of the 2 best models at each step can be considered an effective way to produce the best SR results.

Table 10 Summary of SR result using Fusion of PLP, LPC, PLP, ΔΔLPC, RMS, MFCC, ΔΔRMS, ΔΔPLP, ΔΔentropy, ΔLPC, entropy, Δentropy, ΔMFCC, and ΔRMS (14 features), KNN classifier (common best model) for all databases

Factors Influencing the SR Performance

  1. 1.

    The best model usually contains RMS features, so it can be predicted that adding spectral features with prosodic features, such as RMS, will improve SR performance.

  2. 2.

    The fusion of many features does not necessarily produce better SR results, and sometimes small feature fusion models produce better results than models with more features.

  3. 3.

    SR performance and training and testing timing can be affected when large datasets are used.

Conclusion and Future Work

In this paper, a new and unique feature fusion methodology is implemented to find an effective model suitable for all clean speech databases using a total of 18 features with LD, KNN and ensemble classifiers when the input is ELSDSR, VCTK, voxforge, NIST-2008 and voxceleb1 speech data. The experimental results show that the SI accuracy of the system increases to 100% and the EER value is reduced to 0% when multiple fusions of features are tested on ELSDSR, voxforge, and VCTK data. For the NIST-2008 dataset, the proposed model achieves the best SI accuracy of 96.9% with the fusion of 11, 12 and 14 features and the best EER of 0.2% with the fusion of 11 features using the LD classifier. For voxceleb1, the fusion of 14 and 15 features gave the best SI accuracy of 90% and 89.3% and SV EER values of 4.07% and 4.31%, respectively. From the experimental results, it is observed that the fusion of PLP, LPC, PLP, ΔΔLPC, RMS, MFCC, ΔΔRMS, ΔΔPLP, ΔΔentropy, ΔLPC, entropy, Δentropy, ΔMFCC, and ΔRMS (14 features) gives the best SI and SV results on all five speech datasets, from which it can be concluded that the proposed model with the fusion of 14 features is suitable for various sizes of speech datasets.

The future challenge is how to achieve faster and better ASR models. Dimension reduction techniques, such as principal component analysis (PCA) and independent component analysis (ICA), can be used to solve this problem. In addition, feature selection optimization techniques can be used in the future to reduce the computation time, and the results can be compared with the ones proposed.