Increasing anti-spoofing protection in speaker verification using linear prediction
- 797 Downloads
This article addresses the problem of anti-spoofing protection in an automatic speaker verification (ASV) system. An improved version of a previously proposed spoofing countermeasure is presented. The presented method is based on the analysis of linear prediction error that results from both short- and long-term prediction of the input speech signal. It was observed that non-natural speech signals, i.e., synthetic or converted speech, were predicted in a different way than genuine speech. Therefore, in contrast to the classical linear prediction analysis, where usually only the prediction coefficients are analyzed, in the proposed approach the residual (error) signals were examined. During this analysis, 23 various prediction parameters were extracted, such as the energy of the prediction error, prediction gains and temporal parameters related to the prediction error signals. Various binary classifiers were researched to separate human and spoof classes, however the support vector machines with radial basis function (SVM-RBF) yielded the best results. When tested on the corpora provided for the ASVspoof 2015 Challenge, the proposed countermeasure returned better results than the previous version of the algorithm and, in most of the cases, the baseline spoofing detector based on the local binary patterns (LBP). It is hoped that the proposed method can be part of a generalized spoofing countermeasure helping to increase security of ASV systems.
KeywordsSpeaker verification Spoofing Linear prediction Local binary patterns SVM-RBF
User authorization based on a human voice, often referred to as speaker verification, is becoming more and more popular. Automatic speaker verification systems (ASVs) are used to authorize access, e.g., to a mobile phone, or to authenticate customers over a phone line. The latter application is already used, e.g., by banks such as Bank Smart in Poland , or public institutions, such as the Australian Taxation Office .
Similar to other biometric modalities, systems based on voice are also prone to spoofing, i.e., attacks to get illegitimate access. In the context of ASV systems, spoofing is realized by presenting artificial or manipulated speech, which can be generated using, e.g., speech synthesis or voice conversion algorithms. Since these techniques are becoming more easily available and are constantly improving their quality, they have begun to pose a major threat to ASV systems. Quite recently, researchers have started to investigate how much ASV systems are prone to spoofing using various methods. Various researchers have worked on assessing the threat caused by imitators [13, 23], speech synthesis [26, 35], converted speech [22, 39] or replay of previously acquired recordings [3, 24]. In parallel, much effort has been invested in work on various spoofing countermeasures. These algorithms can be either dedicated to a given attack or can be generally applicable. A thorough review of spoofing methods and their countermeasures can be found in .
In  a novel spoofing countermeasure was proposed. It was based on analysis of linear prediction error and used a logistic classifier as a detection algorithm. The idea of this algorithm was inspired by the fact that synthetic or converted voice is quite likely to be either very easily predicted, if generated with a simplified acoustic model, or very difficult to predict, if any artifacts in the signal are present. Experiments conducted using the datasets provided by the organizers of the first ASV Spoofing and Countermeasures Challenge ASVspoof 2015  showed that the proposed method was able to detect spoofing effectuated with voice conversion and speech synthesis. When testing spoofing detection independently from an ASV system, it yielded an equal error rate (EER) of less than 9 % for the Development corpus, while the baseline method based on local binary patterns (LBP) resulted in an EER higher than 14 %.
The work described in this article is the continuation of the author’s research aiming to contribute to the speech community’s efforts to find efficient anti-spoofing methods for ASV systems. In the current study, a new version of the spoofing detection algorithm was proposed. It uses an extended set of parameters and employs a new, more powerful classifier. The performance of this anti-spoofing protection is compared with the previous version of this algorithm  as well as with the results of the detector based on LBPs, which has been efficient in other studies .
In this article, first the state-of-the-art in spoofing countermeasures is summarized and the basics of linear prediction theory are recalled. Then, in Section 3 the proposed countermeasure is presented and compared with its previous version. In Section 4 the experimental set-up is described. In Section 5 the results are presented and compared with the results from the previous version of the algorithm and the LPB-based countermeasure. Finally, Section 6 concludes the paper and presents perspectives for future work.
2 Previous work
2.1 Spoofing countermeasures for ASV systems
One group of countermeasures exploits prior knowledge about the origin of the spoofing attack. For example, some algorithms detect artifacts typical of speech synthesis, such as simplification of F0 contours . In  the authors proposed an algorithm that was based on measuring the pair-wise distance (PWD) between spectral parameters (such as linear prediction coefficients or Mel-cepstral coefficients) in consecutive frames. The authors claimed that voice conversion decreases PWD values and, as a consequence, changes PWD distributions. They compared speaker-dependent PWD distributions between genuine and converted speech using speech data from the NIST’06 database and the NIST SRE protocol. The authors showed that the proposed countermeasure was able to lower the EER from more than 30 % to below 3 %.
Countermeasures dedicated to detecting replay attacks often try to identify unexpected channel artifacts indicative of recording and replaying. Such algorithms were reported in , for which the EER of a baseline GMM-UBM system was shown to decrease from 40 % to 10 % with active countermeasures. Another replay countermeasure aimed at detecting far-field recordings, which are unlikely in natural access scenarios where the speaker is usually close to the microphone .
Not many algorithms have so far been reported that claim to be more generally applicable and less dependent on prior knowledge of the attack. For example, one group of such methods exploits the fact that many speech synthesis and voice conversion algorithms disturb the natural phase of the speech signal. In  the authors challenged GMM-UBM and SVM-GMM speaker verification systems with genuine and synthesized speech originating from the WSJ corpus. They showed that by using relative phase shift (RPS) features it was possible to decrease the EER from over 81 % to less than 3 %. Unfortunately, the method proposed was vocoder-dependent. Similarly, phase information was successfully used in detecting converted speech in .
Another generalized method was presented in . It was based on the LBP analysis of speech cepstrograms and was inspired by an original application to image texture analysis . In this approach LBP analysis was applied to a Mel-scaled cepstrogram with appended dynamic features. The authors claimed that modifications made through spoofing disturb the natural ‘texture’ of the speech signal. Experimental results showed that the LBP-based textrogram analysis was very effective in detecting spoofing trials generated using speech synthesis (EERs of below 1 %), but it was less effective in detecting those originating from voice conversion (EER in the order of 7 %).
2.2 Linear prediction theory
The LPC technique is widely used in speech coding, e.g., in GSM 06.10  or in narrow-band and wide-band adaptive multi-rate coders (AMR) . It can also be used to parametrize signals in speech or speaker recognition. Linear prediction can also be applied to vectors, and in such a case, a vector of samples is predicted using another vector of samples from the signal’s history. This method is called long-term prediction (LTP) in contrast to LPC, sometimes referred to as short-term prediction. LTP is often used on top of LPC, i.e., the LPC error is further processed by LTP. This approach has been encountered in , for example. LTP works especially well for voiced speech, where the signal is quasi-periodic, so this makes it easier to match a similar vector in the past. Prediction error and prediction gain for LTP are defined analogously as for LPC.
3 Proposed spoofing countermeasure
The proposed ASV spoofing countermeasure is based on analysis of the prediction error in the speech signal at the ASV input. One may expect that if a non-natural speech signal undergoes the prediction process, it may be either “too well” predicted (i.e., with a high prediction gain) or ineffectively predicted (i.e., with a prediction gain lower than usual).
It must be stressed that even though the linear prediction technique described above is well-known, the way it is used in the proposed method is significantly novel. In the classical application of linear prediction (e.g., in the GSM 06.10 speech coding), the main effort is invested in minimizing the prediction error, so that the majority (in this case: 2/3) of the residual (error) samples can be zeroed. In contrast, the prediction coefficients ai are transmitted further and well protected against transmission errors, because they convey the majority of the acoustic information. However, in the proposed countermeasure the prediction coefficients are completely disregarded. In contrast, the utmost attention is paid to the prediction error, as it is claimed that these data carry information about the nature of the speech signal, i.e., whether it is genuine or artificial.
- Eight parameters related to the LPC error:
MeanLPCerrAll – mean energy of the LPC error for the whole signal, i.e., mean energy of x′(n);
MaxLPCerrAll – maximum energy of the LPC error for the whole signal;
MeanLPCerrV – mean energy of the LPC error narrowed down to the voiced regions;
MaxLPCerrV – maximum energy of the LPC error for the voiced speech;
MeanLPCgainAll – mean LPC gain for the whole signal (i.e., mean ratio between the energies of the input signal and the LPC error, mean Gp as defined in (2));
MaxLPCgainAll – maximum LPC gain for the whole signal;
MeanLPCgainV – mean LPC gain narrowed down to the voiced regions;
MaxLPCgainV – maximum LPC gain narrowed down to the voiced regions.
- Ten energy-related parameters of the LTP error:
MeanLTPerrAll – mean energy of the LTP error for the whole signal, i.e., mean energy of x″(n);
MaxLTPerrAll – maximum energy of the LTP error for the whole signal;
MeanLTPerrV – mean energy of the LTP error narrowed to voiced regions;
MaxLTPerrV – maximum energy of the LTP error for voiced regions;
MeanLTPgainAll – mean LTP gain (i.e., mean ratio between energies of the LPC and LTP errors);
MaxLTPgainAll – maximum of the LTP gain;
MeanLTPgainV – mean LTP gain for voiced speech;
MaxLTPgainV – maximum LTP gain for voiced speech;
MeanLTPvar – mean variance of the LTP gain error for the whole signal;
MaxLTPvar – maximum variance of the LTP gain error for the whole signal.
- Five time-related parameters of the LTP error:
MeanErrLen – mean length of segments with LTP error above threshold 𝜃;
MaxErrLen – maximum length of segments with LTP error above threshold 𝜃;
MeanNoErrLen – mean length of segments with LTP error equal to or below threshold 𝜃;
MaxNoErrLen – maximum length of segments with LTP error equal to or below threshold 𝜃;
ErrChangeRate – LTP threshold crossing rate (counted per 20ms frame).
Time-related parameters of the LTP error were calculated for the whole signal (including the regions with no speech activity), as it was suspected that various speech synthesis artifacts can be visible throughout the whole signal. The initial version of this algorithm, proposed by , used only a 10-element subset of the parameters presented above, mainly the ones calculated for voiced speech. This version of the anti-spoofing algorithm will hereinafter be referred to as LPAv1 (linear prediction analysis version 1).
Comparing with LPAv1, in its new version, the parameters calculated for the whole signal were added, as well as the parameters related to LPC gain and the parameters calculated based on the variance of the LTP error. The spoofing countermeasure using the full set of 23 parameters proposed in this study will hereinafter be referred to as LPAv2 (linear prediction analysis version 2).
4 Experimental set-up
The experiments were conducted on the corpus provided by the ASVspoof 2015 Challenge organizers . The corpus, originating from the English Multi-speaker Corpus for CSTR Voice Cloning Toolkit, contained recordings of various access trials, annotated either as human voice or as a spoof trial. The recordings were divided into three parts: Training, Development and Evaluation, and consisted of 16,375, 53,372 and 193,404 recordings, respectively. The spoof trials were generated using 10 different spoofing algorithms (S1..S10), based either on speech synthesis (S3, S4, and S10) or on voice conversion (the remaining ones). Their spoofing efficiency ranged from 25.42 % to 45.79 % EER, with the exception of S2, which yielded a very low spoofing efficiency equal to 0.87 % EER. The baseline EER value achieved with the iVectors-PLDA system equaled 0.42 % .
The proposed spoof detection system and baseline LBP-based detector were trained using the Training database. Experiments with spoofing detection, including parameter tuning, were run using the Development corpus. Since the spoofing efficiency caused by method S2 was very low, some of the measurements were done on a subset of recordings without S2 trials. These recordings were also excluded from the training set because preliminary experiments showed that they had a negative impact on the spoofing detection. The Evaluation corpus was tested to show the performance both for already known and for previously unseen spoofing algorithms.
Contrary to previous research , the analysis of the prediction error was run not only for voiced regions, but for the whole signal. Wherever the analysis was narrowed to voiced regions, voicing detection was realized using the SWIPE pitch detector .
Feature extraction of the baseline LBP-based countermeasure was set up according to the description in . Each signal was analyzed forming a feature matrix consisting of 16 cepstral coefficients plus energy, their deltas and delta-delta coefficients, which was further analyzed using 58 possible uniform LBP patterns. As a result, 2842 features were generated for every recording.
Data analysis showed that the training datasets, obtained using either the LBP or LPA-based classifiers, contained a substantial number (ca. 10 %) of outliers. Initial experiments revealed that their presence had a negative impact on the spoofing detection, therefore it was decided to remove them. A filter based on the interquartile range with outlier factor equal to 3.0 was applied to detect the outliers. All the parameters, before feeding classifiers, were normalized.
A range of binary classifiers was used to try to achieve the largest area under the receiver operating characteristic (ROC) curve. Based on these results, initially three classifiers were selected: Logistic, Bayesian Networks and AdaBoosts. The Logistic classifier used logistic regression with the ridge penalty to classify data . The Bayesian Networks method used a network of probabilistic dependencies between random variables . The AdaBoost algorithm is a meta-classifier that iteratively used other “weak” classifiers, such as decision trees, and iteratively boosted their performance by exposing them to previously misclassified items .
Since Bayesian Networks yielded the worst results in the previous study , they were subsequently replaced by support vector machines (SVM) with a radial basis function (RBF) kernel. SVMs, originally proposed in , have been used with various kernels (e.g., linear, polynomial and radial) for data classification in a wide range of studies, e.g., in food categorization , speaker recognition [10, 20] or personality traits recognition based on handwriting . The RBF kernel  is well-known for its high classification power. Since the SVM-RBF classifier for optimal performance requires a careful adjustment of C and σ parameters, they were set experimentally using a grid search, similar to that in , separately for each tested configuration of the spoof detector.
The assessment of anti-spoofing protection was realized independently from any ASV system, i.e., each recording underwent a binary decision process to decide, whether a recording was real human speech or a spoof trial. The actual speaker verification process can take place either before or after the spoofing detection process and is beyond the scope of this research.
Similar to many other studies [12, 22], spoofing detection was evaluated by measuring the EER values. In addition, detection error trade-off (DET) curves were plotted to show the relation between false alarm and miss probabilities. The EER values and DET plots were obtained using the Bosaris toolkit.2
EER results (in percentages) of the spoof detection for various parameter groups and various classifiers, for the Development set without S2 algorithm
10 prms. (LPAv1)
EER results (in percentages) of the spoof detection, for Development and Evaluation datasets, for various classifiers and the tested countermeasures
Development (S2 excluded)
Spoofing and countermeasure performance (EER in percentages) for Evaluation dataset, for the LBP, LPAv1 and LPAv2 countermeasures
We observed no correlation between the detection results for the spoofing signals generated using different vocoders, i.e., for S5 (using the MLSA vocoder) and, e.g., S4 and S8 (using the STRAIGHT vocoder), which returned EER values of 1.4 %, 1.0 % and 2.0 %, respectively. This suggests that the proposed solution is vocoder-independent.
The anti-spoofing protection proposed in this article outperformed the previous version  for most of the spoofing algorithms, apart from S3 and S4, where the results were in fact the same. The difference in performance is especially visible as for the voice conversion-based spoofing methods (see Table 3). The average spoofing detection efficacy for speech synthesis attacks (algorithms: S3, S4 and S10) was almost the same for LPAv1 and LPAv2.
When comparing the LPAv2 with the LBP-based spoofing countermeasure, the LPAv2 results were better both for known and previously unseen spoofing algorithms (EER equaled 1.2 % and 10.2 % vs. 1.9 % and 15.6 %, for the LPAv2 and LBP algorithms, respectively). However, when it came to comparing individual spoofing algorithms, it turned out that for three spoofing algorithms (S6, S7 and S9) the LBP-based countermeasure returned better results. For example, for algorithm S6, the returned EER results were 10.2 % and 13.3 % for LBP and LPAv2, respectively. What is worth noting is that all these algorithms were not present in the training set. In contrast, the LPAv2 anti-spoofing method returned much better results for other unseen algorithms (S2, S8, S10), so at this stage it would be difficult to definitively judge which of these two algorithms has better generalization capabilities.
6 Conclusions and future work
This article presented a new version of the algorithm that aimed to increase the anti-spoofing protection of speaker verification systems against unauthorized access using speech synthesis, voice conversion and, potentially, other attacks. The results shown in this paper were compared with the previous version of this algorithm (LPAv1) as well as with the LBP-based anti-spoofing method.
The proposed spoofing countermeasure is based on analysis of prediction error, which results from cascaded LPC and LTP blocks. As the LP-based vocoder is often a part of speech synthesis or voice conversion systems, by using linear prediction analysis in some sense a reverse operation is performed, to verify if vocoding really took place.
The described new version of the countermeasure, called LPAv2, uses an extended set of parameters – 23 instead of 10, including LPC gain and variance of LTP error, as well as the parameters extracted from the whole speech signal (and not only from the voiced speech as in LPAv1). The new version employs a more powerful classifier (SVM-RBF), in which the parameters were tuned to maximize the detection efficiency.
It turned out that the proposed version, with an enlarged number of parameters and the SVM-RBF classifier, was able to improve the EER results by lowering them from 3 % for LPAv1 down to 1.3 %, when using the Development corpus without S2. The analyses shown in this study suggest that the LTP-based parameters contributed most to successful spoofing detection.
In addition, the results for the Evaluation dataset were better: 8 % for LPAv2 vs. 11.6 % for LPAv1. Some scientists have reported better results here, e.g.,  used a classical GMM-based system that returned an EER of 3 %. However, a huge difference between the performance for the known attacks (1.2 % EER) and the unknown ones (10.2 % EER) may imply that the proposed algorithm requires further parameter tuning to be able to better detect previously unseen spoofing algorithms.
In most of the cases, the proposed method performed better than a baseline LBP-based detector. It is likely that the LBP detector required longer speech data, as in  it was tested on 5 min. recordings, while the recordings tested in the current study were no longer than several seconds. For three out of 10 spoofing algorithms present in the Evaluation dataset, the results using the LBP-based features and the SVM-RBF classifier outperformed the proposed method. It would be interesting to know if the fusion of scores returned by the LBP and LPAv2 countermeasures would lead to an increase in the spoofing detection – this may be a subject for future work.
It is hoped that the improved version of the spoofing countermeasure, based on an analysis of prediction error, as well as the analyses presented in this article, will help to increase the security of speaker verification systems. Future work can focus on elaborating a generalized countermeasure able to precisely detect a wide range of spoofing attacks against ASV systems, in which the proposed algorithm can be one of the contributing elements.
- 1.Alegre F., Amehraye A., Evans N. (2013) Spoofing countermeasures to protect automatic speaker verification from voice conversion. In: Proc. IEEE int. Conf. Acoust., speech and signal process. (ICASSP)Google Scholar
- 2.Alegre F., Vipperla R., Amehraye A., Evans N. (2013) A new speaker verification spoofing countermeasure based on local binary patterns. In: Proc. Interspeech. Lyon, France, p 2013Google Scholar
- 3.Alegre F., Janicki A., Evans N. (2014) Re-assessing the threat of replay spoofing attacks against automatic speaker verification. In: Proc. 13th international conference of the biometrics special interest group (BIOSIG). Darmstadt, Germany, pp 157–168Google Scholar
- 4.Australian Government (2014) ATO launches voice authentication. https://www.ato.gov.au/media-centre/media-releases/ato-launches-voice-authentication/
- 5.Bank Smart (2015) Bank SMART - biometria. http://www.banksmart.pl/aplikacja/biometria/
- 7.Camacho A. (2007) SWIPE: A Sawtooth waveform inspired pitch estimator for speech and music. Ph.D. thesis, Gainesville, FL USAGoogle Scholar
- 9.De Leon P. L., Hernaez I., Saratxaga I., Pucher M., Yamagishi J. (2011) Detection of synthetic speech for the problem of imposture. In: Proc. IEEE int. Conf. Acoust., speech and signal process. (ICASSP), pp 4844–4847Google Scholar
- 11.ETSI (1999) Digital cellular telecommunications system (Phase 2+) (GSM); Full rate speech; Transcoding GSM 06.10Google Scholar
- 12.Evans N., Kinnunen T., Yamagishi J. (2013) Spoofing and countermeasures for automatic speaker verification. In: Proc. Interspeech. Lyon, France, p 2013Google Scholar
- 13.Farrús M., Wagner M., Anguita J., Hernando J. (2008) How vulnerable are prosodic features to professional imitators?. In: Proc. Odyssey. ISCA, Stellenbosch, South AfricaGoogle Scholar
- 14.Freund Y., Schapire R. E. (1999) A short introduction to boosting. In: Proceedings of the sixteenth international joint conference on artificial intelligence, pp. 1401–1406. Morgan kaufmannGoogle Scholar
- 18.Hanilçi C., Kinnunen T., Sahidullah M., Sizov A. (2015) Classifiers for synthetic speech detection: A comparison. In: Proc. Interspeech 2015. Dresden, Germany, pp 2057–2061Google Scholar
- 19.Janicki A. (2015) Spoofing countermeasure based on analysis of linear prediction error. In: Proc. Interspeech 2015. Dresden, Germany, pp 2077–2081Google Scholar
- 20.Janicki A., Staroszczyk T. (2011) Speaker recognition from coded speech using support vector machines. In: Proceedings of the 14th international conference on text, speech and dialogue, TSD’11. Springer, Berlin, pp 291–298Google Scholar
- 22.Kinnunen T., Wu Z., Lee K. A., Sedlak F., Chng E. S., Li H. (2012) Vulnerability of Speaker Verification Systems Against Voice Conversion Spoofing Attacks: The case of Telephone Speech. In: Proc. IEEE int. Conf. Acoust., speech and signal process. (ICASSP), pp 4401–4404Google Scholar
- 23.Lau Y. W., Wagner M., Tran D. (2004) Vulnerability of speaker verification to voice mimicking. In: Proc. International symposium on intelligent multimedia, video and speech processing. IEEE, Hong Kong, China, pp 145–148Google Scholar
- 24.Lindberg J., Blomberg M. (1999) Vulnerability in speaker verification - a study of technical impostor techniques. In: European conference on speech communication and technology, pp 1211–1214Google Scholar
- 26.Masuko T., Hitotsumatsu T., Tokuda K., Kobayashi T. (1999) On the security of HMM-based speaker verification systems against imposture using synthetic speech. In: Proc. EUROSPEECHGoogle Scholar
- 31.Rybka J., Janicki A. (2013) Comparison of speaker dependent and speaker independent emotion recognition. Appl Math Comput Sci 23(4):797–808Google Scholar
- 33.Vaidyanathan P. P. (2007) The Theory of Linear Prediction. Synthesis Lectures on Signal Processing Morgan & Claypool PublishersGoogle Scholar
- 34.Vapnik J. N. (1995) The nature of statistical learning theory. SpringerGoogle Scholar
- 35.Villalba J., Lleida E. (2010) Speaker verification performance degradation against spoofing and tampering attacks. In: FALA Workshop, pp 131–134Google Scholar
- 36.Villalba J., Lleida E. (2011) Preventing replay attacks on speaker verification systems. In: Security Technology (ICCST), 2011 IEEE International Carnahan Conference on, pp 1–8. doi:10.1109/CCST.2011.6095943
- 37.Wang Z.F., Wei G., He Q.H. (2011) Channel pattern noise based playback attack detection algorithm for speaker recognition. In: Machine Learning and Cybernetics (ICMLC), 2011 International Conference on, vol 4, pp 1708–1713. doi:10.1109/ICMLC.2011.6016982
- 38.Wu Z., Chng E., Li H. (2013) Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: Proc. 13th Interspeech, LyonGoogle Scholar
- 39.Wu Z., Larcher A., Lee K. A., Chng E., Kinnunen T., Li H. (2013) Vulnerability evaluation of speaker verification under voice conversion spoofing: the effect of text constraints. In: Bimbot F., Cerisara C., Fougeron C., Gravier G., Lamel L., Pellegrino F., Perrier P. (eds) INTERSPEECH, pp. 950–954. ISCAGoogle Scholar
- 41.Wu Z., Kinnunen T., Evans N., Yamagishi J., Hanilc C., Sahidullah M., Sizov A. (2015) ASVSpoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. In: Proc. Interspeech. Dresden, Germany, p 2015Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.