Keywords

1 Introduction

Automatic speaker verification (ASV) is a biometric authentication technique that is intended to recognize people by analysing their speech. With the rapid development of this authentication technique, ASV technique has been extensively used in the fields of life, judicial, and the financial. Compared to other biometric authentication techniques, such as fingerprints, irises, and faces, Voiceprint authentication does not require users to perform face to face contact. Therefore speech is more susceptible to spoofing attacks than other biometric signals [1, 2]. Secondly, high-quality audio capture devices and powerful audio editing software are more conducive to spoof voice to attack ASV systems.

Spoofing attacks can be categorized as impersonation, replay, speech conversion and speech synthesis [3]. For impersonation attacks, existing ASV techniques have been able to effectively resist this spoofing attacks. Speech conversion and speech synthesis requires the counterfeiters has more specialized technical. In addition, this spoof attacks can be effectively defended by existing solutions [4, 5]. However, replay attacks are the most accessible and can be highly effective. More importantly, popularity and portability of high-fidelity audio equipment in recent years have greatly increased the threat of replaying speech to ASV systems.

In the past two years, replay attacks have received extensive attention from researchers. The ASV spoof 2017 Challenge uses the Constant-Q Cepstral Coefficients (CQCC) to detect spoofing attack and its equal error rate (EER) is 24.55% [6]. In this database, the multi-feature fusion methods and the integrated classifier methods are used for replay attack detection [7] and its EER is 10.8%. The fusion of the two features of RFCC and LFCC reduced the EER to 10.52% [8]. In addition, the I-MFCC feature has also been shown to be effective in detecting replay speech [9]. At the same time, high-frequency information features obtained by CQT transformation has also proven to be effective [10]. Recently, Delgado et al. used the Cepstral Mean and Variance Normalization (CMVN) method on CQCC features [11]. The results show that this method is very effective for detecting replay attacks. Although the above work is significantly improved compared to the baseline, the computational complexity is relatively high due to the introduction of the CQT transformation.

Recent work focused on how to find effective features rather than analysing the differences between replay and genuine voice in each sub-band. Further, according to the differences reflected in different sub-bands, feature extraction approaches are discussed in this Work.

2 Database

The ASV spoof 2017 corpus is used in our investigations. The corpus is partitioned into three subsets: training, development, and evaluation. A summary of their composition is presented in Table 1. This paper uses Train and Development to train the model and Evaluation to test the performance of the model.

Table 1. Statistics of the ASV spoof 2017 corpus.

3 Sub-band Analysis

First, the speech signal is transformed from the time domain to the frequency domain by time-frequency transformation method. Then the entire frequency band is divided into 16 sub-bands and 8 sub-bands. During the experiment, one sub-band is removed at a time, and the remaining sub-bands are used to extract the sub-band features and used the GMM model for training; the equal error rate (EER) is used as the metrics of feature performance. Finally, a classification level measure of discriminative ability is estimated using EER ratio of a sub-band based spoofing detection system.

3.1 Sub-band Division and Analysis

The sub-bands feature extraction process is shown in Fig. 1. For each frame of speech, frequency bins are subdivided into sub-bands based on DFT bin groupings. The number of the DFT bins is 256, and the window function is the Hanning. During the experiment, one sub-band is removed at a time. Within remaining sub-bands, DCT is applied to the corresponding log magnitude to obtain the remaining sub-band features. The features include 150 dimensions, comprising of 50 DCT coefficients along with the deltas and delta-deltas. Cepstral mean and variance normalization (CMVN) [12] is an efficient normalization technique used to remove nuisance channel effects. Therefore, the CMVN technique is applicable to sub-band feature.

Fig. 1.
figure 1

Sub-band feature extraction

The \( EER \) represents the equal error rate of all sub-bands, \( EER_{i} \) represents the equal error rate of the remaining sub-bands after removing the i-th sub-band, and \( r_{i} \) represents the ratio of \( EER_{i} \) and \( EER \) which represents the contribution capacity of the i-th sub-band. The ratio is defined as follows:

$$ r_{i} = EER_{i} /EER $$
(1)

The first approach involved dividing the speech bandwidth into uniform 1 kHz wide sub-bands. And the second approach involved dividing the speech bandwidth into uniform 0.5 kHz wide sub-bands. The two approaches are referred to as 8-band and 16-band divisions in the rest of the paper.

3.2 GMM Models and Performance Indicators

In Sect. 3.1, we removed each sub-band feature at a time. Within the remaining sub-bands, a 256-component GMM system is used to determine the discriminative ability within a removed sub-band. The process of GMM model training and identification is shown in Fig. 2. The primary metric is the EER [13].

Fig. 2.
figure 2

GMM training process

3.3 Sub-band Division and Analysis

Table 2 shows the \( EER_{i} \) and \( r_{i} \) for the 8 sub-bands. The experimental results demonstrate that the \( r_{i} \) of the 1st and 8th sub-bands are obviously greater than 1. Specifically, the 0–1 kHz and 7–8 kHz sub-bands are identified as the most discriminative frequency regions.

Table 2. The experimental result of 8-bands

Table 3 shows the \( EER_{i} \) and \( r_{i} \) for the 16-bands. The experimental results show that at low-frequencies, 0–0.5 kHz contains more discriminatory information than 0.5 Hz–1 kHz. Also in the high-frequency region, 7.5 kHz–8 kHz contains more discriminative information.

Table 3. The experimental result of 16-bands

As can be seen from Tables 2 and 3, the 0–0.5 kHz and 7–8 kHz sub-bands are identified as the most discriminative frequency regions. And compared to low-frequencies, high frequencies contain more discriminative information.

4 Filter Banks Design

For the better use of the discriminative information brought by the 0–1 kHz sub-band and the 7–8 kHz sub-band, we have proposed two filter design approaches. The basic idea behind the proposed approaches is the allocation of a greater number of filters within the discriminative sub-bands [3].

Two different filter banks design approaches are presented in this paper. All two approaches involve assigning the center frequencies of triangular filters across the speech bandwidth. The initial approach is allocating more linear filters in discriminative frequency bands based on the \( r_{i} \) in Sect. 3. The second approach is also based on \( r_{i} \), which is allocating Mel filter banks at low-frequencies bands, linear filter banks at intermediate frequency bands, and I-Mel filter banks at high-frequencies bands. The output of the filter is defined as the cepstrum coefficient which includes 46 dimensions, comprising of 15 DCT coefficients along with the deltas, delta-deltas, and log-energy. The process of feature extraction is shown in Fig. 3.

Fig. 3.
figure 3

Feature extraction

4.1 Linear Filter Design

This approach idea is the allocation of a greater number of filters within the discriminative sub-bands. The number of linear filters allocates in each band is related to the \( r_{i} \). For example, in an 8-band experiment, the \( r_{i} \) at 0–1 kHz is 1.5, the \( r_{i} \) between 1–7 kHz is around 1.0, and the \( r_{i} \) between 7–8 kHz is around 1.8. Therefore the 8 -band filter design is to design 6 linear filters per 1 kHz in the 0–1 kHz frequency band. In the frequency band of 1–7 kHz, 4 linear filters are allocated per 1 kHz. In the 7–8 kHz frequency band, 7 linear filters are allocated per 1 kHz. The shape of the filter banks is shown in Fig. 4.

Fig. 4.
figure 4

8 sub-band linear filter design

According to the 8 sub-band design idea, the 16-band filter bank is designed to allocate 3 linear filters in 0–0.5 kHz, 26 linear filters in 0.5–7 kHz, and 7 linear filters in 7–8 kHz. The shape of the filter banks is shown in Fig. 5.

Fig. 5.
figure 5

16 sub-band linear filter design

4.2 Mel, Linear, and I-Mel Filter Design

This approach idea is not only to allocate a greater number of filters within the discriminative sub-bands but also assign more appropriate filter types to the corresponding sub-bands. At low frequencies, we use the Mel filter design to enhance the details of the low frequencies. At high frequencies, we use I-Mel filters (inverting the Mel scale from high frequency to low frequency) to enhance the detail of the high frequencies, while the Intermediate frequency uses linear filters. According to the above theory, the 8-band filter is designed to allocate 6 Mel filters per 1 kHz in the frequency band of 0–1 kHz. In the frequency band of 1–7 kHz, 4 linear filters are allocated per 1 kHz. In the 7–8 Hz frequency band, 7 I-Mel filters are allocated per 1 Hz. The shape of the filter design is shown in Fig. 6.

Fig. 6.
figure 6

8 sub-band combination filter design

According to the 8-band design idea, the 16-band filter bank is designed to allocate 3 Mel filters in 0–0.5 kHz frequency band and 26 linear filters in 0.5–7 kHz frequency band. And in 7–8 kHz frequency band, 7 I-Mel filters are used in the sub-band. The design of the filter is shown in Fig. 7.

Fig. 7.
figure 7

16 sub-band Mel, Linear, and I-Mel filter design

5 Results and Discussion

This paper proposes a new filter design method by calculating the EER ratio for each sub-band to determine the number and shape of filters for each sub-band. In order to verify the validity of the filter bank designed in this paper, we compare the cepstrum coefficient proposed by the filter bank proposed in this paper with the cepstrum coefficient proposed by the traditional filter. The cepstrum coefficient proposed by the traditional filter is defined as LFCC. MFCC, I-MFCC, extraction process as showed in Fig. 3. The cepstrum coefficient includes 46 dimensions, comprising of 15 DCT coefficients along with the deltas, delta-deltas, and log-energy. In addition, we compare the algorithm proposed in this paper with the algorithm proposed by other researchers. Experimental results show that our algorithm is superior to other literature to varying degrees (Table 4).

Table 4. Experimental results

6 Conclusions

In this paper, we have used EER ratio to identify sub-bands that contain discriminative information between genuine and replay speech. Two such discriminatory sub-bands were identified: 0–0.5 kHz and 7–8 kHz. We have then proposed two approaches to designing banks of triangular filters that allocate a greater number of filters to the more discriminative sub-bands. The two approaches were experimentally validated on the ASV spoof 2017 corpus and outperform other approaches proposed by other researchers. Considering that the number of filters in the filter bank is a key parameter that may have a significant effect on system performance. Therefore, future work will pay more attention to the choice of each sub-band filter.