1 Introduction

The information located in person’s sound waves by using the process of automatic recognition is called speaker recognition [1]. To confirm a person's identity, the person’s voice can be used to trigger such authority utilizing speaker recognition techniques. Applicable services are voice dialing, telebanking, teleshopping, remote access of computers, security control for confidential information, database access services, forensic purposes, information services, reservation services, and voice mail [2]. Speaker recognition has two classifications, speaker verification and speaker identification [3,4,5]. Speaker verification is described as the procedure of recognizing or refusing the identity declared by one of the speakers. Speaker identification defined as the identification of the person who speaks among the registered speakers.

Identifying a speaker is accomplished by comparing samples from an unknown test speaker with those of unknown speakers. Based on the best model match to his/her speech, the unknown person is identified as a speaker. In speaker verification, when an unknown person is claiming to be the owner of the identity, his speech is compared with the claimed person's model. Therefore, the authentication process is accepted if the match level is excellent and above a specified threshold [6]. The number of taken decisions is a fundamental difference between identification and verification. In identification, the number of registered models is the same as the number of decisions, therefore, the rise of models number can hurt the performance. In addition, acceptance or rejection are the only choices in the verification process. Hence, the verification performance does not depend on the number of models [7].

Text-dependent or text-independent speaker recognition systems are another characteristic of these systems. Text-dependent based on an individual's speech utterance of a specific word or phrase. On the other hand, the speaker recognition system is identified regardless of the spoken word or phrase called text-independent [6, 8].

The feature extraction process consists of getting the signal's parameters (characteristics) to be used for the classification of the signal. Pattern recognition problems can be solved by extracting prominent features. The recognition of the speaker requires identifying the desired person from the speech signal based on a unique identification feature [6]. Speech signals are controversial because of the features they contain. Various features have been found to perform better for certain applications than others. So far, we have not found a feature that is ideal for all applications [6].

Most feature extraction methods use the Karhunen–Loeve Transform (KLT) [9, 10]. With outstanding results, these methods were applied to cases of text-independent speaker recognition [8]. The KLT transform is the best conversion in terms of minimum mean square error (MMSE) and maximum energy packing. Moreover, the most popular identification systems use Mel-frequency cepstral coefficient (MFCC) [11] and linear prediction cepstral coefficients (LPCC) [12, 13] as features. MFCC and LPCC have excellent features in speaker identification. The MFCC has the disadvantage of employing the short-time Fourier transform (STFT), which has double resolution time–frequency signal and the assumption that is fixed. Thus, it is difficult to identify explosive phonemes because of these characteristics.

Furthermore, the wavelet-transform is being studied by some researchers for extracting the speaker feature [14,15,16]. Wavelet transform [17, 18] has been widely used in various fields of science and engineering. The operation involves signal analysis using dilated and translated versions of a base signal called the mother wavelet. Using the wavelet analysis, we can express signals of interest by a set of coefficients (wavelet coefficients), and we can implement signal processing algorithms by adjusting these coefficients. From a mathematical standpoint, the mother wavelet scale can be a real positive value, and the translation value can be any real number [19]. In practice, however, to improve calculation efficiency, the translation and scale parameters are frequently limited to discrete lattices [20, 21].

The paper has been structured into five distinct parts. The first component of the investigation starts by presenting a comprehensive background. Subsequently, this study undertakes an exhaustive examination of scholarly literature sources in order to ascertain the current status of research pertaining to the topic. The article continues by outlining the technique used in the investigation, followed by the presentation and analysis of the results. The researchers ultimately engage in a discussion of these results within the context of the existing literature, derive conclusions, and provide suggestions about the optimal use of a text-independent speaker identification system.

2 Literature review

The subject of speaker recognition [22] began to develop in the mid-twentieth century. The first known published paper on this topic was in the 1950s [23, 24]. This research was interested in preserving the personal qualities of the speakers through the analysis of speech. As [23] pointed out the need to identify the speaker for the emergence of communication networks in early the 1950. Most of the early studies were based on text-dependent analysis to facilitate the task of identification. In 1959, [24] tried to facilitate the identification process and started to compare the formants of speech. Human experience incarnation of the speaker's first recognition has been used until now to deal with the speaker identification forensic [25]. Legal experts have used this type of approach for various analyses of criminal forensics [26, 27]. Pruzansky et al. [28, 29] used a text-dependent approach to make an automatic statistical comparison of speakers by analyzing 10 speakers, where each speaker utters a few unique words. At least to identify a speaker, it was clear that a text-dependent analysis method was needed [22]. However, for speaker verification, there were cases where a text-dependent analysis could perform better than the text-independent method [30]. The Gaussian Mixture Model (GMM) and Support Vector Machine (SVM) approaches are currently the most popular modeling techniques. Classifiers such as artificial neural networks have been used [8, 22].

As per [31, 32] presented a technique for speaker identification using a frame linear predictive coding spectrum (FLPCS). The FLPCS technique can be used to reduce the size of a speaker's feature vector. In classification, the general regression neural network (GRNN) and the GMM were used. In a very short time, GMM can achieve a higher recognition rate with feature extraction using the FLPCS method. Avci [33] presented a discrete wavelet adaptive network model based on the fuzzy inference system (DWANFIS). The DWANFIS model has two layers: discrete wavelet and an adaptive network based on a fuzzy inference system. For the sample speakers, the classification rate was approximately 90.55%.

E. Avci and D. Avci [34] presented a genetic wavelet adaptive network model based on a fuzzy inference system (GWANFIS). The model was made up of three layers: a genetic algorithm, a wavelet, and an adaptive network based on a fuzzy inference system (ANFIS). The classification rate was approximately 91%. As a feature selection technique, Singular Value Decomposition (SVD) followed by QR Decomposition with Column Pivoting (QRcp) was proposed by Chakroborty and Saha [35]. The method was accomplished by extracting the most salient information from the speaker's data. The proposed SVD-QRcp-based method outperforms the F-Ratio based method, and the proposed feature extraction tool outperforms the baseline MFCC and Linear Frequency Cepstral Coefficients (LFCC).

As per [36] presented feature analysis and compensator design for speaker recognition in stressed speech conditions. MFCC, linear prediction (LP) coefficients, LPCC, reflection coefficients (RC), arc-sin reflection coefficients (ARC), and log-area ratios (LAR) were six speech features that were widely used for speaker identification and were analyzed for evaluation of their characteristics under a stressed condition. To evaluate speaker identification results with different speaker features, the GMM and Vector Quantization (VQ) classifiers were used. This analysis aided in the selection of the best feature set for stressed-out speaker recognition.

As per [37] demonstrated speaker identification using empirical mode decomposition (EMD), a feature extraction method, and an artificial neural network. The EMD is a non-linear, non-stationary data analysis technique that uses adaptive multi-resolution decomposition. The proposed system's performance and training time were validated using back-propagation neural networks (BPNN) and GRNN. The experimental results showed that the GRNN outperformed the BPNN in terms of feature extraction using the EMD method.

Daqrouq [38] presented a text-independent speaker identification method using wavelet transform (WT) and neural networks. In the text-independent speaker identification system, the use of a discrete wavelet transform approximating sub-signals of the original signal via several levels instead of the original imposter had a good performance on Additive white Gaussian noise (AWGN) facing, particularly on levels 3 and 4. As per [39, 40] used a fused Mel feature set and GMM to develop a method for text-independent speaker identification. Each speaker's MFCC and Inverted Mel Frequency Cepstral Coefficient (IMFCC) features were obtained. The identification efficiency of this method was 93.88%. As per [41,42,43] presented a text-independent speaker identification system using an average framing linear prediction coding (AFLPC) technique. The distinguished speaker's vocal tract characteristics were extracted using the AFLPC technique during the feature extraction stage, and the size of the feature vector was optimized. The probabilistic neural network (PNN) classifier outperformed the wavelet packet (WP) and AFLPC in terms of recognition rate 97.36%.

3 Methodology

The Discrete Wavelet Transform (DWT) is a mathematical technique used to split a given signal into a set of coefficients, which correspond to various frequency bands and temporal scales. This characteristic makes it a potent instrument for extracting speech signal elements that are pertinent to speaker identification. Linear predictive coding (LPC) is a statistical technique used for modeling the correlation between previous and current values of a given signal. The use of this technique enables the anticipation of forthcoming signal values. The DWT and LPC are two complementing methodologies that may be effectively used to enhance the precision of speaker identification systems [30].

3.1 Wavelet Speaker Identification Method

By employing low pass and high pass filters, g and h generated from wavelets parents; scaling and mother functions denoted by φ and Ψ, respectively, we obtain approximation and detail coefficients of a speech signal \(X\) through the DWT and are given by:

\({a}_{X}\left(j+1,k\right)=\left(\left({a}_{X}\left(j\right)*g\right)\downarrow 2\right)\left(k\right)={\sum }_{m\epsilon Z}^{ }{g}_{2k-m}{a}_{x}(j,m)\),

$${\mathrm{And }\;d}_{X}\left(j+1,k\right)=\left(\left({a}_{X}\left(j\right)*h\right)\downarrow 2\right)\left(k\right)={\sum }_{m\epsilon Z}{h}_{2k-m}{a}_{X}\left(j,m\right),{\text{respectively}}$$
(1)

where \(j\in \left\{\mathrm{1,2},\dots ,J\right\}\), \({k\in \left\{\mathrm{1,2},\dots , {n}_{j}-1\right\},}\)* is convolution, \(\downarrow\) is decimation, and \({n}_{j}\) represents the number of DWT coefficients at level \(j\). We assume that

$${D}_{X}\left(1\right)={\left\{{d}_{X}\left(\mathrm{1,0}\right), {d}_{X}\left(\mathrm{1,1}\right),\dots , {d}_{X}\left(1, {n}_{j}-1 \right)\right\}, }$$

\({D}_{X}\left(2\right)={\left\{{d}_{X}\left(\mathrm{2,0}\right), {d}_{X}\left(\mathrm{2,1}\right),\dots , {d}_{X}\left(2, {n}_{j}-1 \right)\right\}, }\)

\({D}_{X}\left(J\right)={\left\{{d}_{X}\left(J,0\right), {d}_{X}\left(J,1\right),\dots , {d}_{X}\left(J, {n}_{j}-1 \right)\right\}, }\) and

$${A}_{X}\left(J\right)={\left\{{a}_{X}\left(J,0\right), {d}_{X}\left(J,1\right),\dots , {a}_{X}\left(J, {n}_{j}-1 \right)\right\}, }$$
(2)

are the DWT sub-signals.

First of all, we divide the speech signal into windows and decompose each window separately into DWT sub-signals. And then, each sub-signal \({D}_{X}\left(1\right), { D}_{X}\left(2\right), \dots , { D}_{X}\left(J\right), \text{and} \;A\left(J\right)\) is divided into S frames as follows:

$${D}_{X}\left(j\right)= \left\{{frame}_{X1},{frame}_{X2},\dots ,{frame}_{XS}\right\}$$
(3)

For each frame \({d}_{Xs}\) LPC coefficients of a specific length are obtained as follows:

$${{LPC}_{{D}_{X}(j)}}_{ }= \left\{{{ {lpc}_{{f}_{X1}},}_{ }{lpc}_{{f}_{X2}},\dots ,{lpc}_{{f}_{XS}} }_{ }\right\}.$$

Then are averaged as follows:

$${wlpca}_{{D}_{X}(j)}=\frac{\sum_{S}( {LPC}_{{D}_{X}\left(j\right)})}{S}$$
(4)

Then the feature extraction vector of one window is considered as:

$$WLPCA=\left\{{wlpca}_{{D}_{x}(1)}, {wlpca}_{{D}_{x}(2)},\dots , {wlpca}_{{D}_{x}\left(J\right)}\right\}$$
(5)

The feature extraction matrix contained \(WLPCA\) of all widows is fed to the GMM classifier. Consider \(F\) is a feature matrix extracted by \(WLPCA\) for speaker \(X\), its mixture density is defined as

$$p(F/{\lambda }_{X})= \sum_{i=1}^{M}{p}_{i}^{X}{b}_{i}^{X}(F)$$
(6)

The density \({b}_{i}^{X}(F)\) is a linear weighted combination of M unimodal Gaussian densities.

$${b}_{i}^{X}\left(F\right)=\frac{1}{{(2\pi )}^{D/2}{\left|{\Sigma }_{i}^{X}\right|}^{1/2}}*{\text{exp}}\{-1/2(F-{\upmu }_{i}^{X}){\prime} {\left({\Sigma }_{i}^{X}\right)}^{-1}(F-{\upmu }_{i}^{X})\}s$$
(7)

where \({\upmu }_{i}^{X}\) is the mean vector, and \({\Sigma }_{i}^{X}\) is the covariance matrix. Then the mixture weights \({p}_{i}^{X}\) satisfy the constraint.

$$\sum_{i=1}^{M}{p}_{i}^{X}=1$$
(8)

The parameters for a speaker X's density model are extracted as

$${\lambda }_{X}=\left\{{p}_{i}^{X},{\mu }_{i}^{X}, {\Sigma }_{i}^{X}\right\}, i=1,\dots , M$$
(9)

The iterative Expectation–Maximization (EM) algorithm is used to estimate the maximum likelihood speaker model parameters. In this study, a simple maximum-likelihood classifier was used for identification. For a reference database speakers \(\mathcal{l}\) = {1, 2,..., Y) represented by models \({\lambda }_{1}\)\({\lambda }_{2}\),…,\({\lambda }_{Y}\), we have to recognize the speaker model of the maximum posterior probability for the input feature vector sequence, \(F=\{{f}_{1},{f}_{2},\dots ,{f}_{T}\}\) [44,45,46].

4 Results and discussions 

4.1 Recorded database

Speech signals with a spectral frequency of 4000 Hz and a sampling frequency of 8000 Hz were recorded using a PC sound card. The recordings involved 50 people. Each participant recorded a minimum of 20 different Arabic utterances. The speakers ranged in age from 20 to 45 years old, with 28 men and 22 women speaking. The recording procedure was carried out in a typical university office setting. This database was used in the first stage of our investigation. All experiments were conducted using the text-independent speaker identification system. The normalized and silence removed signals were given to the WLPCA (discrete wavelet transform using a linear prediction coding algorithm) for feature vectors extraction. Then the GMM was utilized to model the feature vectors obtained for each speaker. Half of the speaker's signals were used for training, with the other half reserved for testing. All speakers in our database were used for algorithm evaluation. This section's two main objectives are to investigate the best WLPCA parameters and compare WLPCA as a new method with the well-known feature extraction method, MFCC. The experiments were conducted concerning the recognition rate.

In the first experiment, the system was run for six several LPC coefficient vector lengths 5, 10, 12, 15, 20, and 30. Four decomposition levels of the DWT with window length 400 and 10 Gaussian mixtures we applied. Table 1 summarizes the results of this experiment for the DWT. The presented results are conducted in terms of recognition rate calculated as the ratio of the number of times the speakers are correctly recognized by the total number of test signals. The best result was observed for 12 LPC coefficients.

Table 1 The effect of the number of LPC coefficients on the recognition rate

In the next experiment, the process was repeated for different DWT decomposition levels 3, 4, and 5, with 12 LPC coefficients. Window length was 400, and Gaussian mixtures were set to 10. Table 2 presents the results of this experiment. The best recognition rate (0.939) of the presented method was observed for 4 DWT decomposition levels.

Table 2 The effect of the number of DWT levels on the recognition rate

Table 3 contains the results of the recognition rates calculated for different window lengths. The system parameters were set as follows: the DWT level is 4, the number of the LPC coefficients is12 and the number of Gaussian mixtures is 10. By analyzing the results tabulated in Table 3, we can notice that the best performance is at window length 400 (relatively small window size). The reason behind this is the ability to divide the signal into more windows. So, we can gain more features over the same length of the signal. On the other hand, a smaller window than 400 could be of less benefit.

Table 3 The effect of window length on the recognition rate

In the next part of this study, we compare WLPCA with other published well-known feature extraction methods. A AFLPC method with wavelet packet (WPLPCF) at WP level 2 [41, 42], the conventional LPC [47], AFLPC [41, 42] and MFCC [48]. With 10 Gaussian mixtures, the proposed method gave the best recognition rate as summarized in Table 4.

Table 4 Recognition rate comparisons between several feature extraction methods4

The results of WP were auspicious as much as DWT. The limitation for WP is the feature extraction vector length in comparison to DWT. However, it is difficult to increase the WP level as the window length is constant. In contrast to WP, DWT has more flexibility in terms of decomposition level. In the case of WP at level two, the feature extraction vector's length is near to the length of the vector obtained by DWT at level 4.

The proposed method produced better results than the MFCC method overall suggested Gaussian mixtures. The reason behind that is that the LPC coefficients in our method are obtained through different frequency passbands. The decomposition offers that into several DWT levels. Additionally, the averaging over the frames over each sub-signal can help gain more accurate results.

4.2 TIMIT database

A real standard database could be an essential tool for proving out a recognition algorithm. For the generalization of our new method, a standard common database is required. Without a doubt, the TIMIT database has been one of the most widely used standard databases [49]. There are 630 speakers of the same dialect in this database. that are divided into 438 males and 192 females. 10 utterances were recorded for each speaker with an 8000Hz sampling frequency, a wideband microphone was used in a clean environment.

In the next experiment, WLPCA and MFCC are tested using the TIMIT database. The signals were preprocessed by silence removing algorithm. GMM was trained using 8 utterances out of the 10 for each class (speaker). The remaining two utterances were used for testing. The TIMIT database recognition rate was calculated by running experiments on 50 randomly selected speaker sets from a total of 630 speakers and averaging the resulting recognition rates. Figure 1 illustrates the results of the TIMIT database. The experiment was performed for 5, 10, 20, and 30 Gaussian mixtures. 10 random sets were taken for calculating 10 average recognition rates for each of these Gaussian mixture numbers. As shown in the figure, the WLPCA has better performance, particularly with a small number of Gaussian mixtures.

Fig. 1
figure 1

The results of the TIMIT database

In contrast to WLPCA, MFCC behaved better with a high number of Gaussian mixtures. Nevertheless, the running time is increased to the extent that the number of Gaussian mixtures is increased. For instance, MFCC with 30 Gaussian mixtures needs 7.5 min to get the max recognition rate for a set of 50 speakers. Considering that, WLPCA needs 3.8 min to get the max recognition rate. So, we can state that WLPCA requires a smaller number of Gaussian mixtures and requires less running time than MFCC.

Cross-validation continuously processes recognition performance by excluding a few instances (10% for tenfold cross-validation) to be used as the test set during the training process [15pdz, 19pdz]. We ran a tenfold cross-validation test to compare our algorithm to the validation technique. In addition, a 20-fold cross-validation test was carried out. Table 5 shows tenfold cross-validation and 20-fold cross-validation tests results. 10 random sets were taken for calculating each recognition rate for five and thirty Gaussian mixture numbers. As shown in the table, that our proposed method is stable and not sensitive to the validation technique.

Table 5 Tenfold and 20-fold cross-validation tests results

In the next experiment, WLPCA and MFCC were tested in a noisy environment. Additive white Gaussian noise (AWGN) was added to the TIMIT database signals. TIMIT has about 59 dB signal-to-noise ratio (SNR). We conducted the identification experiment over 0 dB and 10 dB SNR. The analysis was performed for 5 and 30 Gaussian mixtures. 10 random sets were taken for calculating each recognition rate for each of these Gaussian mixture numbers in 0 dB and 10 dB SNR environments. Table 6 summarizes the recognition rate for 5 and 30 Gaussian mixtures in 0 dB and 10 dB SNR environments. The results show that our method performs slightly better than MFCC method.

Table 6 The recognition rate for 5 and 10 Gaussian mixtures

5 Conclusion

This research has a new speaker feature extraction method based on a discrete wavelet transform using a linear prediction coding algorithm (WLPCA) is proposed. It leads to some improvement in the recognition rate over MFCC. The proposed method was tested regarding the linear prediction coding coefficients length, the discrete wavelet-transform level, the window length, and the Gaussian mixture number. Experimental evaluation of the method performance was performed on two speech databases; our recorded database and the publicly available TIMIT database. The presented work compares the proposed method's performance with the MFCC method in the computation of the feature extraction in speaker recognition. It showed that the speech features derived by the newly proposed method resulted in a more suitable representation, in terms of the time-consumed, by requiring less Gaussian mixtures in comparison to MFCC. It also showed a slight improvement in the implementation of the newly proposed method over the MFCC in a noisy environment. The idea behind that is that the LPC coefficients in our method are obtained through different frequency passbands. The decomposition offers that into several DWT levels. Additionally, the averaging over the frames through each sub-signal can help gain more accurate results.