1 Introduction

In the speaker verification task, a person claims an identity, and the system attempts to accept or reject it using the features of the individual’s speech. There are two text-dependent and -independent scenarios considering whether the text expressed in the two training and testing sections is identical or different, respectively. In the first one, since speech content can be multifarious, systems are faced with significant variations in learning and modeling individuals’ speech specifications. Therefore, designing a robust and efficient text-independent speaker verification system is challenging.

The general parts of a speaker verification system include pre-processing, feature extraction, acoustic modeling, and decision making. The pre-processing step tries to form all the signals in a single format with several tasks like removing noise and silence. Due to its sampling frequency, each speech segment has a large number of samples that are not particularly useful for speaker verification, and this increases the computational time and complicates the system. Consequently, the feature extraction stage tries to extract valuable and low-dimensional coefficients, which describe the speech signal. One of the most commonly used features is the Mel Frequency Cepstral Coefficient (MFCC).

Based on extracted features, the acoustic modeling stage endeavors to create specific models for each one of the speakers via a modeling algorithm like the Gaussian Mixture Model-Universal Background Model (GMM-UBM) [43]. Finally, the decision is made by comparing the created model and the test utterance feature.

The i-vector approach aims to identify a specific speaker and channel information based on a fixed-length identity vector (i-vector). Several experiments have been conducted to extract these i-vectors from GMM [11] or deep neural networks (DNN) [3, 44, 53]. Although deep learning-based techniques like x-vector [12, 28, 51], in which the averaged activations of the last hidden layer of a deep neural network are selected as the identity vector, or end-to-end architecture [13, 34, 52] are used for speaker recognition, many modern speaker verification systems are still based on the i-vector [7, 8, 39].

Speaker recognition systems encounter significant challenges, notably data scarcity [2] and short-duration speech [41], which impact their design and performance. Our research focuses on developing specialized solutions that effectively address the unique difficulties posed by short utterances to improve the overall effectiveness of the speaker verification system. In a long-duration speech, which is longer than 30 s, the i-vector and the PLDA-based systems perform well; however, low-level performance is expected for short utterances [10]. In identical conditions, the i-vector extracted from the short utterances has more intra-class variations than long utterances [42]. It should be mentioned that there have been various efforts to address the issue as mentioned above. One of the ideas was the improvement of modeling of the variations in the i-vector extracted from these short-length statements [9]. Kanagasundaram et al. [27] proposed the normalization and variance modeling of utterances at the i-vector level. Moreover, the phonetic information is associated with the acoustic modeling. Several studies have attempted content matching through phonetic details [54].

The session variability vectors were also used to estimate the phonetic components instead of the i-vector extracted from an utterance. Some studies have focused on the i-vector mapping from the short utterance i-vectors to the long version [19, 42]. Kheder et al. [33] trained a GMM with the short and long utterances to perform the i-vector mapping from the short to long versions. Instead of GMM-based mapping functions, nonlinear function-based mappings like the convolutional neural network (CNN) have received attention in recent years [46].

Deep neural networks improved the performance of speech recognition and speaker recognition systems. Takamizawa et al. [50] proposed a speaker identification system based on a deep neural network that identified whether or not the same speaker uttered two speech samples by focusing on the phonemes, which had very short durations. The architecture of the proposed system was based on ResNet. In recent years, several studies used CNN algorithms for acoustic modeling and feature extraction [14, 29, 38], and several research works have focused on speaker spoofing challenges [25, 49, 59].

Variational autoencoder (VAE) was used in speaker verification to improve environment mismatch between training and testing, such as noises and channel effects [56]. Evaluations on the NIST SRE2016 dataset showed 15.54% and 7.84% EER for Tagalog and Cantonese languages using i-vector and PLDA. In [55], a VAE was proposed to transform x-vectors into a regularized latent space. Experiments demonstrated that this VAE-adaptation approach transformed speaker embeddings to the target domain and achieved 12.73% EER for 77 speakers.

Some other works have also used deep belief networks (DBN) in speaker recognition. Ghahabi et al. [18] proposed an adaptation for a universal DBN as the background model for each speaker. Additionally, an impostor selection method was introduced to help the DBN outperform the cosine distance classifier. The evaluation was performed on the core test condition of the NIST SRE2006 corpora, and a 10% improvement in EER was reported. In [4], DBN was used as a feature extractor, and performance improvement was noted. This improvement was achieved by utilizing the spectrogram as input for DBN.

Feature extraction has considerably affected the performance of speaker verification systems. It has been argued that deep neural networks can model nonlinear functions [20]. This proposition creates the idea of effectively extracting speech features. The present study aims to investigate designing a short utterance text-independent speaker verification system using DBN in an autoencoder architecture to extract speakers’ features. This DBN tries to improve the performance of the speaker verification systems by incorporating regular MFCC features into the network and extracting new feature sets in an unsupervised learning strategy. Moreover, an effort was made to reduce the feature vector dimensions to decrease the computational time using DBN.

The current study is organized as follows: Section 2 describes two GMM-UBM and i-vector-PLDA speaker verification systems. Section 3 discusses the proposed feature extractor and deep belief network theory. Section 4 presents the experimental settings and criteria. The simulation results and the analysis of the proposed systems are presented in Sect. 5. Finally, the conclusion is presented in Section 6.

2 Proposed Speaker Verification Systems

2.1 GMM-UBM Speaker Verification

GMM-UBM was the most commonly used method in the classical speaker verification systems. Figure 1 shows the framework of the proposed GMM-UBM-based system. Accordingly, audio files are pre-processed and converted to efficient features in the front-end block. The output of the front-end block, MFCC features, is the input for the proposed DBN block. The UBM has an essential role in GMM-UBM-based systems. It is a GMM-based background model built from the speech samples of the nontarget speakers in the development phase. The purpose of the UBM is to achieve a speaker-independent distribution, covering all the probabilistic space, so the feature distribution of a particular speaker can be extracted from it. More diversity in speakers, channels, and vocabularies can make a more general UBM. The expectation–maximization (EM) algorithm is commonly utilized to construct this model [37].

Fig. 1
figure 1

Framework of the proposed GMM-UBM speaker verification system with DBN block

In the Enrollment phase, every speaker’s train utterances are applied to the UBM, and according to each speaker’s information, the background model parameters, including averages, variances, and coefficients, are updated to make the acoustic model of each person. This update is accomplished using the MAP algorithm [17]. During the verification phase, there are two assumptions of H0 and H1 for each utterance U claims the speaker’s identity S.

H0: if U belonged to the speaker S.

H1: if U does not belong to the speaker S.

These two assumptions are examined using the two specific speaker and background models. Finally, Eq. 1 makes the decision using the probability factor.

$$ \Lambda = \frac{1}{L}\log \frac{{p(U|H_{0} )}}{{p(U|H_{1} )}} = \left\{ \begin{gathered} \ge 0\,\,\,{\text{accept}}\,H_{0} \hfill \\ < 0\,\,{\text{accept}}\,H_{1} \hfill \\ \end{gathered} \right. $$
(1)

where P(U|Hi), i = 0, 1 is the conditional probability of the hypothesis Hi for utterance U. L denotes the number of frames per view of U. Generally, the UBM is utilized as the impostor model during the testing phase.

2.2 i-Vector Speaker Verification

Figure 2 shows the framework of the proposed i-vector system. The i-vector refers to the identity vector for each speaker. This model can process variable-length speech signals by mapping them to a fixed-length and low-dimensional vector. The i-vector extraction block tries to represent the mismatches like intra-speaker variations and the session variability in the GMM. The idea behind the i-vector is based on the assumption that the speaker-dependent and channel-dependent variations can be incorporated into a separate low-dimensional subspace through the joint factor analysis (JFA) technique. This algorithm eliminates or reduces intra-speaker changes and channel effects [30].

Fig. 2
figure 2

Framework of the proposed i-vector-based speaker verification system with DBN block

In the development phase, this system employs the UBM. The UBM and total variability models are trained as a space to represent the changes. In other words, each utterance can be represented as a supervector M, which is the mean vector of the total GMMs belonging to each speaker. This supervector is separately calculated for every speaker as follows:

$$ M = m + Vy + Ux + Dz $$
(2)

where m is a supervector-like array derived from the UBM and assumed to be independent of the channel and the speaker information. The three parameters of V, U, and D represent the characteristics of the speaker, subspace, and sessions, respectively. The two components y and x signify the speaker and channel components, respectively. The Dz is speaker’s residual information that is not included in Vy. In the Enrollment phase, the i-vector is extracted regarding each speaker. It is worth noting that the channel components also have the speaker information itself, so the subspace is proposed for both variables [53]. This GMM-based supervector, depending on the speakers and sessions, is defined as follows:

$$ M = m + Tw $$
(3)

where T is the matrix of the speaker variability of the sessions, and the component w denotes the identity vector or the i-vector. Baum–Welch statistical algorithm is used to train the full variable subspace, which is defined as follows:

$$ N_{c} = \sum\limits_{t} {P(c|y_{t} ,\Phi )} $$
(4)
$$ F_{c} = \sum\limits_{t} {P(c|y_{t} ,\Phi )} y_{t} $$
(5)

where Nc and Fc express zero- and first-order statistics, yt is the feature sample at the time t, Φ denotes the UBM with the mixture component c (c = 1,….,C), which is the Gaussian index, and P(c|yt,Ω) is the posterior probability of the mixture component c that produces the yt vector. In the verification phase, these identity vectors can be used in several classifiers, including cosine similarity, LDA [36], and PLDA [31].

3 Feature Extraction Using Deep Generative Model

The feature extractor is one of the most important parts of a speech processing system that extracts low-dimensional and efficient information from the input speech signal. MFCC is one of the most popular features in speech processing. The MFCC algorithm aims to extract the envelope of speech signals. This feature is known as a short-term feature. On the other side, several long-term features have been introduced for different speech recognition scenarios [58], which may not properly describe the speaker’s specific information despite the contextual information. Therefore, finding particular features that are effective in speaker recognition can significantly affect the performance of these systems.

Deep learning is a novel method widely used for feature extraction from raw data or classical feature engineering methods [48]. Accordingly, this research applies a special type of probabilistic generative deep neural network called deep belief networks [22] under autoencoder architecture that can be an unsupervised feature extractor. Recent studies on speaker verification have exploited the benefits of DNNs in acoustic modeling and feature extraction [48]. To use DNNs in acoustic modeling, the network must be trained with the specific information of each speaker.

The deployment of DNNs as the feature extractor consists of supervised [47] and unsupervised [16] scenarios. The supervised approach needs labeled data, and the selection of the labels is important and dependent on the context (text-independent or text-dependent). In the proposed unsupervised mode, the DBN is used in the autoencoder architecture, in which the network attempts to reconstruct the information of the input layer in the output layer.

DBNs are generative models made up of stacked restricted Boltzmann machines (RBM). Each RBM is a two-layer network that models the distribution of input (visible) layer data in the output (hidden) layer based on the weights of connections [24]. There are no visible–visible and hidden–hidden connections. In other words, The RBM is an energy-based model whose probability of joint distribution is based on its energy function as follows:

$$ P(v,h|\lambda ) = \frac{{e^{ - E(v,h)} }}{Z} $$
(6)

The configuration energy (v, h) in this network is presented in Eq. (7):

$$ E(v,h) = - \sum\limits_{i = 1}^{V} {\sum\limits_{j = 1}^{H} {w_{{{\text{ij}}}} v_{i} h_{j} - \sum\limits_{i = 1}^{V} {a_{i} v_{i} - \sum\limits_{j = 1}^{H} {b_{j} } h_{j} } } } $$
(7)

where vi and hj are the states of the visible unit i and the hidden unit j, respectively, wij is the weight between vi and hj, and ai and bj are the biases. Therefore, an expression for the marginal probability can be written by assigning an RBM to a visible vector v,

$$ P(v|h) = \frac{{\sum\nolimits_{h} {e^{ - E(v,h)} } }}{Z}. $$
(8)

where \(Z = \sum\limits_{v} {\sum\limits_{h} {e^{ - E(v,h)} } }\) is the normalizing constant. The derivative estimation of the log probability P(v|λ) concerning the model parameters λ is as follows:

$$ \frac{\partial \log P(v|\lambda )}{{\partial w_{{{\text{ij}}}} }} = \langle v_{i} h_{j} \rangle_{{{\text{data}}}} - \langle v_{i} h_{j} \rangle_{{{\text{model}}}} $$
(9)

where < α > data and < α > model are the expectation of α estimated from the data and the model, respectively. The derivative in (9) leads to the following learning rule:

$$ \Delta w_{{{\text{ij}}}} = \varepsilon (\langle v_{i} h_{j} \rangle_{{{\text{data}}}} - \langle v_{i} h_{j} \rangle_{{{\text{model}}}} ) $$
(10)

where ε is the learning rate. The hidden neurons are conditionally independent, presenting the visible vector. Then, the binary state of each hidden unit hj is set to one with the following probability:

$$ P(h_{j} = 1|v) = \psi (\sum\limits_{i} {w_{{{\text{ij}}}} v_{i} + b_{j} )} $$
(11)

where ψ(.) is the sigmoid logistic function. Likewise, Eq. (12) presents the visible binary neuron:

$$ P(v_{i} = 1|h) = \psi (\sum\limits_{j} {w_{{{\text{ij}}}} h_{j} + a_{i} )} $$
(12)

The estimation of the input data < vihj > data is straightforward, but approximated methods such as contrastive divergence (CD) [21] are required to estimate the < vihj > model term.

The one-step CD of an RBM is shown in Fig. 3. The approximation for the gradient regarding the visible to hidden weights is as follows:

$$ \begin{aligned} \Delta w_{{{\text{ij}}}} &= - \varepsilon (\langle v_{i} h_{j} \rangle_{{{\text{data}}}} - \langle v_{i} h_{j} \rangle_{\infty } ) \\ &\approx - \varepsilon (\langle v_{i} h_{j} \rangle_{{{\text{data}}}} - \langle v_{i} h_{j} \rangle_{1} ) \\ \end{aligned} $$
(13)

where < . >  denotes the expectation computed with the samples generated by running the Gibbs sampler in infinite steps, and < . > 1 is the expectation for running in one step. Similarly, the learning rules for the bias parameters are as follows:

$$ \begin{gathered} \Delta a = - \varepsilon (\langle v\rangle_{{{\text{data}}}} - \langle v\rangle_{1} ) \hfill \\ \Delta b = - \varepsilon (\langle h\rangle_{{{\text{data}}}} - \langle h\rangle_{1} ). \hfill \\ \end{gathered} $$
(14)
Fig. 3
figure 3

One-step contrastive divergence of an RBM

When the visible unit v is real-valued like the MFCC vector and the hidden unit h is binary, the RBM energy function can be modified to enable it to adapt such variables, presenting a Gaussian–Bernoulli RBM (GRBM). The energy of GRBM is defined as follows [20]:

$$ E(v,h) = \sum\limits_{i = 1}^{V} {\frac{{(v_{i} - a_{i} )^{2} }}{{2\sigma_{i}^{2} }}} - \sum\limits_{i = 1}^{V} {\sum\limits_{j = 1}^{H} {\frac{{v_{i} }}{{\sigma_{i} }}w_{{{\text{ij}}}} h_{j} - \sum\limits_{j = 1}^{H} {b_{j} h_{j} } } } $$
(15)

where the variance parameters σ2 are commonly fixed to a predetermined value instead of being learned from training data. To train a GRBM using the CD algorithm, two conditional distributions for Gibbs sampling are derived as follows:

$$ P(h_{j} = 1|v) = \psi (b_{j} + \sum\limits_{i} {\frac{{v_{i} }}{{\sigma_{i} }}w_{{{\text{ij}}}} )} $$
(16)
$$ P(v_{i} |h) = N(v,\sum\limits_{j} {h_{j} w_{{{\text{ij}}}} + a_{i} ,\sigma_{i}^{2} )} $$
(17)

where N(v,μ) denotes a Gaussian distribution of v with a mean vector μ and a covariance matrix Σ. In the unsupervised pre-training, the data is normalized using the CD with the intention that each coefficient vector has a mean and unit variance equal to zero. Whereas the CD is not exact, several other methods, like probabilistic contrastive divergence (PCD), have been proposed in the RBM [5]. Unlike the CD, which utilizes training data as the initial value for visible units, the PCD method uses the last chain state in the last update step. In other words, the PCD employs successive Gibbs sampling runs to estimate < vihj > model.

The proposed DBN is illustrated in Fig. 4. DBN training uses the greedy algorithm [23]. Using this algorithm in a three-layer encoder that consists of three RBMs, initially, a single RBM is trained. After training the first RBM, the second RBM joins it and is trained using the first RBM’s output as the second RBM’s input, and this process continues until the end of the encoder part. After this, the decoder, which is the reverse of the encoder, concatenates to it, and the error backpropagation algorithm is performed for the final correction of the weights.

Fig. 4
figure 4

Architecture of proposed DBN to extract efficient features based on the MFCC vector

In the arrangement of the proposed DBN, the number of layers and neurons of each layer is selected so that the middle layer reaches a convergence about speaker information and consequently retrieves the input information in the last layer based upon the middle layer’s converged information.

During three restricted Boltzmann machines, information convergence occurs by reducing the number of neurons in each layer. Therefore, it acts as an encoder and produces low-dimensional features. The second half of this network, which is the reverse of the first part, tries to retrieve the input data based on the low-dimensional features of the middle layer like a decoder. The training continues until the network can accurately reconstruct the information in the output layer. If the network can properly model the input vector with a low number of middle-layer neurons and then reconstruct the input data, the information of this middle layer can accurately describe the input vector so that it can be used as a low-dimensional feature containing important information of the speech signal. This is precisely the property of an excellent feature extraction algorithm.

4 Experimental Setup

In this section, we explain all the experimental settings. These settings include the datasets, front-end, baseline system, proposed DBN-based system parameters, and evaluation metrics.

4.1 Datasets

The NIST SRE2004 data was used to train the UBM, DBN, and T matrix. This data consisted of 10,743 telephone speech audio files from 480 speakers (181 males and 299 females) [1]. The NIST SRE2008 data was utilized to evaluate the systems. In SRE2008 data, the interview speeches were recorded with several types of microphones in addition to the telephone speech [26]. The utterances of 1270 speakers from all languages in the dataset in both interview and telephone scenarios were used to assess the systems. Training and testing conditions were performed based on predefined Short2 and Short3 conditions, respectively. Speech files contained two-channel telephone conversations of approximately five minutes of target speaker. However, for the interview segments, approximately a three-minute involved target speaker.

4.2 Front-end

A voice activity detector (VAD) in the front-end block separated the speech and silence sections [35]. From an enhanced speech, silence can be detected with higher accuracy. Various methods have been proposed to improve speech signals; among them, the spectrum-based methods stand more attentive [40]. This study used an energy-based VAD called spectral subtraction voice activity detection (SSVAD) to perform speech enhancement and silence removal [6]. SSVAD is specially designed for NIST datasets. In the SRE2008 dataset, in addition to telephone data, interview data, which have a lower signal-to-noise ratio than telephone data, was also included. Therefore, this specialized VAD is considered for this dataset in this work. It has also been shown that using this VAD has been associated with increase in the efficiency of speaker verification systems [35]. In the next step, the first 12 coefficients of MFCC, the energy coefficient, and the first and second derivatives were extracted from speech. This process was performed on 25 ms of speech frame length with 10 ms intervals using the HTK toolbox [57].

This work uses two feature normalization methods: Cepstral Mean and Variance Normalization (CMVN) and Cepstral Mean and Variance Normalization over a sliding window (WCMVN) that typically spans 301 frames to remove the linear channel effects.

4.3 Baseline System

Baseline systems include the GMM-UBM and i-vector-PLDA speaker verification systems based on the MFCC features. The i-vector-PLDA-based system uses the same UBM trained in the GMM-UBM system. Through LDA, the dimensionality of the vector was reduced to 150, and the PLDA scoring was utilized.

4.4 Proposed DBN-Based System

The optimal parameters for the DBN, such as the input type, the number of layers, and the number of input neurons, were determined based on our computational resources during several experiments. Various features like the time-domain speech signal, the Fourier transform of the signal, and the MFCC were considered as inputs. Consequently, the best speaker verification performance was obtained using the MFCC as the DBN input.

Regarding the arrangement of DBN’s input strategy, the best results were achieved when the DBN input contained five consecutive frames of the MFCC. To compare the performance of the proposed DBN-based and MFCC features, the dimension of feature vectors was considered 39. The DBN was trained with half a million data samples (half of the data concerning NIST SRE2004 and NIST SRE2008).

The proposed DBN consists of six RBM blocks with two GRBM layers as the input and output to model speech data. As shown in Fig. 3, the input layer contains 195 neurons to receive five frames of the 39-length MFCC feature. The middle layer consists of 39 neurons to extract the DBN feature. Two layers are arranged between the input and middle layers with 150 and 100 neurons, respectively.

The proposed DBN model was trained in two stages. First, unsupervised learning was performed for 100 epochs using the PCD method. Then, error backpropagation was conducted with a maximum of 200 iterations using the mean squared error (MSE) loss function. The training process utilized a batch size of 100, a learning rate of 0.001, and a penalty of 0.0002.

The DBN was applied to both the GMM-UBM and i-vector-PLDA systems, and the results were investigated under the same conditions with and without the DBN block. In other words, we reported the results of all four conditions to show the impact of DBN on the performance of typical speaker verification systems. The MSR and DeeBNet toolboxes have been employed in the system implementation [32, 45].

4.5 Evaluation Criteria

In this research, the detection error trade-off (DET) curve and detection cost function (DCF) are utilized as the metrics to evaluate the systems. Generally, there are four types of decisions for a test utterance. These errors are defined as follows:

False Positive (FP): Accept fake speaker (incorrectly accept).

False Negative (FN): Reject target speaker (incorrectly reject).

True Positive (TP): Accept target speaker (correctly accept).

True Negative (TN): Reject fake speaker (correctly reject).

According to these definitions, there are two types of errors in speaker verification systems. These error coefficients are defined as follows:

$$ {\text{PFR}} = \frac{{{\text{FP}}}}{{{\text{FP}} + {\text{TN}}}} $$
(18)
$$ {\text{FNR}} = \frac{{{\text{FN}}}}{{{\text{FN}} + {\text{TP}}}} $$
(19)
$$ {\text{EER}} = \frac{{\text{FNR + FPR}}}{{2}}\,\,\,\,\,\,\,{\text{if}}\,\,\,{\text{FNR}} = {\text{FPR}} $$
(20)

The Equal Error Rate (EER) is where the FPR and FNR errors are equal, obtained by changing the decision threshold. DET is a graphical scheme that plots FNR versus FPR. The speaker verification systems obtained a matching coefficient between the trained acoustic models and the test utterances. This score is a variable, indicating the similarity between the trained speaker and the test speaker. A high score indicates greater similarity. The system requires a threshold value to make a decision. If this decision threshold value is low, the FA error increases; otherwise, it increases the FR error. In the DET plot, the higher the system performance, the closer the curve to zero. The detection cost function (DCF) is defined as a weighted sum of miss and false alarm errors, where the cost function is minimum.

$$ \begin{aligned} {\text{DCF}} &= C_{{{\text{miss}}}} \times P_{{{\text{Miss}}|{\text{Target}}}} \, \times \,P_{{{\text{Target}}}} \\ &\quad+ C_{{{\text{FalseAlarm}}}} \times P_{{{\text{FalseAlarm}}|{\text{NonTarget}}}} \times (1 - P_{{{\text{Target}}}} ) \\ \end{aligned} $$
(21)

The DCF is calculated via the parameter value CMiss = 10, CFalseAlarm = 1, PTarget = 0.01 for the dataset NIST 2008 [26].

5 Evaluations and Results

This section presents the results of the baseline and the proposed systems for short utterance text-independent speaker verification. Various parameters were examined to design these systems. This variety includes the presence and absence of the SSVAD, the determination of the best selection for the number of GMM components, and the feature normalization method. The results are listed in Table 1. All the systems were tested on 1270 speakers (telephone and interview) of the NIST2008 dataset. Initially, a system with 512 Gaussian mixtures was designed without considering any of the proposed methods and even without applying the SSVAD. This experiment extracted the MFCC features without applying the SSVAD and used them for GMM-UBM training and testing. In this condition, the EER was equal to 33.2%. The SSVAD was employed to remove silence and improve the result. Under these conditions, firstly, the signals were denoised, silent parts were removed, and then MFCC feature extraction was followed. The GMM with 512 and 1024 mixtures was evaluated to determine the optimal number of the mixtures. Considering the SSVAD, the EER value for the 512 and 1024 mixtures was reported as 25.15% and 25.19%, respectively. As can be seen, the SSVAD improves the system performance by about 8%. It can also be found that 512 components for the GMM produce better results and have less computational cost.

Table 1 The results of the GMM-UBM systems in different steps of the design

Adding the proposed DBN feature reduced EER to 15.50% and improved the system performance by about 10%. It can be seen that using the proposed DBN can improve the final result in a typical GMM-UBM system. Because of the benefits of feature normalization methods in speaker verification, the CMVN and WCMVN were employed. The MFCC features were normalized and utilized to train the DBN.

The EER benchmark reached 14.34% and 14.30% in the case of the CMVN and the WCMVN, respectively. In this comparison, WCMVN showed the lowest EER. To illustrate the impact of using the DBN, Fig. 5 shows the DET curves for the two modes before and after using DBN.

Fig. 5
figure 5

The DET curve to compare the two GMM-UBM based systems

The second scenario discusses the performance of the i-vector-PLDA-based system with 512 GMM components with 256 i-vector lengths. The system was designed in two modes, with and without the DBN. The results of these systems are shown in Table 2. In addition, the DET curve is indicated in Fig. 6.

Table 2 The results of the i-vector system in two scenarios with and without DBN
Fig. 6
figure 6

The DET curve of the two systems based on i-vector/PLDA

The i-vector-PLDA system achieved an EER value equal to 15.24% when the MFCC feature was utilized. Adding the WCMVN and DBN methods results in an EER value of 10.97%. Applying the DBN and WCMVN methods showed that the i-vector-based system improved the performance by up to 4.27% in the EER metric. We can see the proposed generative model’s impact on popular speaker verification systems.

The evaluation of the systems on this scale has a high computational cost and requires an extended processing time. In many applications, the processing time is a priority, even if system performance declines. Moreover, people are less inclined to provide long training speeches in real-world applications. Feature extraction can have an essential role in solving the computational cost problem. Therefore, besides using the DBN as a feature extractor, this scenario attempted to reduce the data dimension to decrease the processing time.

The previous experiments utilized five consecutive MFCC frames for training the DBN with one frame interval at each step so that the DBN tries to model the middle frame. This experiment used five consecutive speech frames with five frame intervals per step. Under these conditions, the DBN tries to model five frames per step, and there would be no overlap between the MFCC of each step. Using this method, the DBN reduced the volume of data by one-fifth. During the evaluation phase, each audio file of the dataset remained a speech for about 3 min after applying the SSVAD algorithm. Based on proposed DBN dimension-reduction method, there is a signal with about 30 s duration. The result of this system is shown in Fig. 7. Under these circumstances, the computational time has been significantly reduced to approximately one-twelfth compared to the previous system, which utilized i-vector + WCMVN + DBN. This means that the new speaker verification system, incorporating the DBN-based feature extraction and dimension-reduction strategy, takes around 16 min to analyze and make decisions on all the test files for 1270 speakers. This computations were performed using a single CPU (i7 ten-generation) and 32GB RAM as the available computational resources. The EER benchmark reached 15.92%, which is 4.95% higher than the case in which dimension reduction was not applied. Reducing processing time is crucial for various tasks. The system can significantly decrease the processing load by employing dimension-reduction strategies. This enables the implementation of the system in real-world applications with limited computational resources.

Fig. 7
figure 7

The DET curves while the DBN network extracts features and reduces data dimension

6 Conclusion and Future Works

The present research showed the highly improved performance of the GMM-UBM-based and i-vector-PLDA-based speaker verification systems in the text-independent mode with the proposed DBN features. By comparing the results of baseline MFCC and proposed DBN-based systems, it was found that the proposed generative network improves performance. In another scenario, DBN was used to feature dimension reduction and decrease the computational time. The proposed feature dimension reduction reduced the length of the utterances and made light systems suitable for some online scenarios and devices with low computational resources, such as mobile phones. Hopefully, future studies can improve performance by implementing the proposed feature extraction method into a system with new acoustic modeling, datasets, and feature normalization methods. Moreover, utilizing a convolutional neural network in autoencoder architecture can be a future trend in speaker-specific feature extraction. The source code of this paper is available from [15].