Abstract
Automatic speaker verification (ASV) systems are enhanced enough, that industry is attracted to use them practically in security systems. However, vulnerability of these systems to various direct and indirect access attacks weakens the power of ASV authentication mechanism. The increasing research in spoofing and anti-spoofing technologies is contributing to the enhancement of these systems. The objective of this paper is to review and analyze these important advancements proposed by different researchers and scientists. Various classical, autoregressive, cepstral, etc., and modern deep learning based feature extraction techniques that are chosen to design the frontend of these systems are discussed. Extracted features are learned and classified in the backend of an ASV system, which can be classical machine learning or deep learning models that are also the main focus of the presented review. Experimental studies use constantly modified datasets and evaluation measures to develop robust systems since emergence of practical work in this area. This paper analysis most of the contributing spoofed speech datasets and evaluation protocols. Speech synthesis (SS), voice conversion (VC), replay, mimicry and twins are the potential spoofing attacks to ASV systems. This work provides the knowledge of generation techniques of these attacks to empower the defence mechanism of ASV. This survey marks the start of a new era in ASV system development and highlights the start of a new generation (G4) in SS attack development methods. With the increase in advancement of deep learning techniques, the paper makes best efforts to give the complete idea of ASV to new comers to this area and also, puts some light on some of the spoofing attacks that can be targeted during implementation of the future ASV systems.
Similar content being viewed by others
References
Aggarwal, R. K., & Kumar, A. (2020). Discriminatively trained continuous Hindi speech recognition using integrated acoustic features and recurrent neural network language modeling.
Alam, M. J., Kinnunen, T., Kenny, P., Ouellet, P., & O’Shaughnessy, D. (2013). Multitaper MFCC and PLP features for speaker verification using i-vectors. Speech Communication, 55(2), 237–251.
Al-Kaltakchi, M. T., Woo, W. L., Dlay, S. S., & Chambers, J. A. (2016, March). Study of fusion strategies and exploiting the combination of MFCC and PNCC features for robust biometric speaker identification. In 4th international conference on biometrics and forensics (IWBF) (pp. 1–6). IEEE.
ASVspoof consortium. (2019). ASVspoof 2019: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan*. http://www.asvspoof.org/.
ASVspoof. (2019): https://www.idiap.ch/dataset/avspoof
Balamurali, B. T., Lin, K. E., Lui, S., Chen, J. M., & Herremans, D. (2019). Toward robust audio spoofing detection: A detailed comparison of traditional and learned features. IEEE Access, 7, 84229–84241.
Beranek, B. (2013). Voice biometrics: Success stories, success factors and what’s next. Biometric Technology Today, 2013(7), 9–11.
Brown, J. C. (1991). Calculation of a constant Q spectral transform. The Journal of the Acoustical Society of America, 89(1), 425–434.
Brown, J. C., & Puckette, M. S. (1992). An efficient algorithm for the calculation of a constant Q transform. The Journal of the Acoustical Society of America, 92(5), 2698–2701.
Cai, W., Wu, H., Cai, D., & Li, M. (2019). The dku replay detection system for the asvspoof 2019 challenge: On data augmentation, feature representation, classification, and fusion. arXiv:1907.02663
Campbell, J. P. (1995, May). Testing with the YOHO CD-ROM voice verification corpus. In 1995 international conference on acoustics, speech, and signal processing (vol. 1, pp. 341–344). IEEE.
Chakroborty, S., & Saha, G. (2009). Improved text-independent speaker identification using fused MFCC & IMFCC feature sets based on Gaussian filter. International Journal of Signal Processing, 5(1), 11–19.
Chen, K., & Salman, A. (2011). Learning speaker-specific characteristics with a deep neural architecture. IEEE Transactions on Neural Networks, 22(11), 1744–1756.
Chen, N., Qian, Y., & Yu, K. (2015). Multi-task learning for text-dependent speaker verification. Sixteenth annual conference of the international speech communication association.
Chen, Z., Zhang, W., Xie, Z., Xu, X., & Chen, D. (2018, April). Recurrent neural networks for automatic replay spoofing attack detection. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2052–2056). IEEE.
Chettri, B., Kinnunen, T., & Benetos, E. (2020). Deep generative variational autoencoding for replay spoof detection in automatic speaker verification. Computer Speech & Language, 101092.
Chettri, B., Mishra, S., Sturm, B. L., & Benetos, E. (2018, December). Analysing the predictions of a CNN-based replay spoofing detection system. In 2018 IEEE spoken language technology workshop (SLT) (pp. 92–97). IEEE.
Chettri, B., Stoller, D., Morfi, V., Ramírez, M. A. M., Benetos, E., & Sturm, B. L. (2019). Ensemble models for spoofing detection in automatic speaker verification. arXiv preprint arXiv:1904.04589.
Cheuk, K. W., Anderson, H., Agres, K., & Herremans, D. (2019). nnAudio: An on-the-fly GPU audio to spectrogram conversion toolbox using 1D convolution neural networks. arXiv:1912.12055.
De Leon, P. L., Pucher, M., Yamagishi, J., Hernaez, I., & Saratxaga, I. (2012). Evaluation of speaker verification security and detection of HMM-based synthetic speech. IEEE Transactions on Audio, Speech, and Language Processing, 20(8), 2280–2290.
Delgado, H., Todisco, M., Sahidullah, M., Evans, N., Kinnunen, T., Lee, K., & Yamagishi, J. (2018, June). ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements.
Dinkel, H., Qian, Y., & Yu, K. (2018). Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11), 2002–2014.
Dua, M., Aggarwal, R. K., & Biswas, M. (2017). Discriminative training using heterogeneous feature vector for Hindi automatic speech recognition system. In International conference on computer and applications (ICCA) (pp. 158–162).
Dua, M., Aggarwal, R. K., & Biswas, M. (2018a). Discriminative training using noise robust integrated features and refined HMM modeling. Journal of Intelligent Systems, 29(1), 327–344.
Dua, M., Aggarwal, R. K., & Biswas, M. (2018b). Performance evaluation of Hindi speech recognition system using optimized filterbanks. International Journal, Engineering Science and Technology, 1(3), 389–398.
Dua, M., Aggarwal, R. K., & Biswas, M. (2019a). Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Computing and Applications, 31(10), 6747–6755.
Dua, M., Aggarwal, R. K., & Biswas, M. (2019b). GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. Journal of Ambient Intelligence and Humanized Computing, 10(6), 2301–2314.
Dua, M., Aggarwal, R. K., Kadyan, V., & Dua, S. (2012a). Punjabi automatic speech recognition using HTK. International Journal of Computer Science Issues (IJCSI), 9(4), 359.
Dua, M., R. K. Aggarwal, Kadyan, V., Dua, S., (2012). Punjabi speech to text system for connected words, 206–209.
Dua, M., Jain, C., & Kumar, S. (2021). LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-021-02960-0
Farrus, M., Wagner, M., Erro, D., & Hernando, F. J. (2010). Automatic speaker recognition as a measurement of voice imitation and conversion. International Journal of Speech, Language and the Law, 1(17), 119–142.
Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., & Muller, P. A. (2019, July). Deep neural network ensembles for time series classification. In International joint conference on neural networks (IJCNN) (pp. 1–6). IEEE.
Fenglei, H., & Bingxi, W. (2002, August). Text-independent speaker verification using speaker clustering and support vector machines. In International conference on signal processing (Vol. 1, pp. 456–459). IEEE.
Garofalo, J. S., Lamel, L. F., & Fisher, W. M. (1990). The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CDROM. NIST.
Glover, J. C., Lazzarini, V., & Timoney, J. (2011). Python for audio signal processing.
Godoy, A., Sim˜oes, F., Stuchi, J. A., Angeloni, M. d. A., Uliani, M., & Violato, R. (2015). Using deep learning for detecting spoofing attacks on speech signals. arXiv preprint arXiv:1508.01746.
Gong, Y., & Yang, J., (2020). Detecting replay attacks using multi-channel audio: a neural network-based method, arXiv:2003.08225v1 [cs.SD].
Hanilçi, C., Kinnunen, T., Sahidullah, M., & Sizov, A. (2015). Classifiers for synthetic speech detection: A comparison.
Hautamäki, R. G., Kinnunen, T., Hautamäki, V., & Laukkanen, A. M. (2014). Comparison of human listeners and speaker verification systems using voice mimicry data. TARGET, 4000, 5000.
Hautamäki, R. G., Kinnunen, T., Hautamäki, V., Leino, T., & Laukkanen, A. M. (2013). I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry. In Interspeech (pp. 930–934).
Hegde, R. M., Murthy, H. A., & Rao, G. R. (2004, May). Application of the modified group delay function to speaker identification and discrimination. In IEEE international conference on acoustics, speech, and signal processing (Vol. 1, p. I-517). IEEE.
Helander, E., & Gabbouj, M. (2012). Jani Nurminen1, Hanna Silén2, Victor Popa2. Speech Enhancement, Modeling And Recognition–Algorithms And Applications, 69.
Huang, L., & Pun, C. M. (2019, May). Audio replay spoof attack detection using segment-based hybrid feature and DenseNet-LSTM network. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2567–2571). IEEE.
Indumathi, A., & Chandra, E. (2012). Survey on speech synthesis. Signal Processing: An International Journal (SPIJ), 6(5), 140.
Janicki, A. (2015). Spoofing countermeasure based on analysis of linear prediction error. In Proc. Interspeech.
Jelil, S., Das, R. K., Prasanna, S. M., & Sinha, R. (2017, August). Spoof detection using source, instantaneous frequency and cepstral features. In Interspeech (pp. 22–26).
Kadyan, V., Dua, M., & Dhiman, P. (2021a). Enhancing accuracy of long contextual dependencies for Punjabi speech recognition system using deep LSTM. International Journal of Speech Technology, 1–11.
Kadyan, V., Shanawazuddin, S., & Singh, A. (2021b). Developing children’s speech recognition system for low resource Punjabi language. Applied Acoustics, 178, 108002.
Kamble, M. R., Sailor, H. B., Patil, H. A., & Li, H. (2020). Advances in anti-spoofing: From the perspective of ASVspoof challenges. APSIPA Transactions on Signal and Information Processing. https://doi.org/10.1017/ATSIP.2019.21
Karpe, R., & Vernekar, N. (2018). A survey: On text to speech synthesis. International Journal for Research in Applied Science and Engineering Technology, 6, 351–355.
Kersta, L., & Colangelo, J. (1970). Spectrographic speech patterns of identical twins. The Journal of the Acoustical Society of America, 47(1), 58–59.
Kim, C., & Stern, R. M. (2016). Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(7), 1315–1329.
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.
Kinnunen, T., Lee, K. A., Delgado, H., Evans, N., Todisco, M., Sahidullah, M., & Reynolds, D. A. (2018). t-DCF: A detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. arXiv preprint arXiv:1804.09618.
Kinnunen, T., Sahidullah, M., Falcone, M., Costantini, L., Hautamäki, R. G., Thomsen, D., & Evans, N. (2017, March). Reddots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5395–5399). IEEE.
Koolwaaij, J. W., & Boves, L. W. J. (1999). On the use of automatic speaker verification systems in forensic casework.
Korshunov, P., Gonçalves, A. R., Violato, R. P., Simões, F. O., & Marcel, S. (2018, January). On the use of convolutional neural networks for speech presentation attack detection. In 2018 IEEE 4th international conference on identity, security, and behavior analysis (ISBA) (pp. 1–8). IEEE.
Korshunov, P., Gonçalves, A. R., Violato, R. P., Simões, F. O., & Marcel, S. (2018, January). On the use of convolutional neural networks for speech presentation attack detection. In 4th international conference on identity, security, and behavior analysis (ISBA) (pp. 1–8). IEEE.
Kumar, A., & Aggarwal, R. K. (2020a). A hybrid CNN-LiGRU acoustic modeling using raw waveform sincnet for Hindi ASR. Computer Science, 2, 89. https://doi.org/10.7494/csci.2020.21.4.3748
Kumar, A., & Aggarwal, R. K. (2020b). Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. International Journal of Speech Technology. https://doi.org/10.1007/s10772-020-09757-0
Kumar, A., & Aggarwal, R. K. (2020d). A time delay neural network acoustic modeling for hindi speech recognition. In Advances in data and information sciences (pp. 425–432). Singapore: Springer.
Kumar, A., & Aggarwal, R. K. (2021). An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition. Journal of Reliable Intelligent Environments, 1–16.
Kumar, M. G., Kumar, S. R., Saranya, M. S., Bharathi, B., & Murthy, H. A. (2019, December). Spoof detection using time-delay shallow neural network and feature switching. In Automatic speech recognition and understanding workshop (ASRU) (pp. 1011–1017). IEEE.
Lau, Y. W., Tran, D., & Wagner, M. (2005). Testing voice mimicry with the yoho speaker verification corpus. In International conference on knowledge-based and intelligent information and engineering systems (pp. 15–21). Springer.
Lau, Y. W., Wagner, M., & Tran, D. (2004, October). Vulnerability of speaker verification to voice mimicking. In International symposium on intelligent multimedia, video and speech processing (pp. 145–148). IEEE.
Lavrentyeva, G., Novoselov, S., Malykh, E., Kozlov, A., Kudashev, O., & Shchemelinin, V. (2017, August). Audio replay attack detection with deep learning frameworks. In Interspeech (pp. 82–86).
Lee, J., Park, J., Kim, K. L., & Nam, J. (2017). Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint arXiv:1703.01789.
Lee, K. A., Larcher, A., Wang, G., Kenny, P., Brümmer, N., Leeuwen, D. V & Li, H. (2015). The RedDots data collection for speaker recognition. In Sixteenth annual conference of the international speech communication association.
Lim, R., & Kwan, E. (2011, August). Voice conversion application (VOCAL). In International conference on uncertainty reasoning and knowledge engineering (Vol. 1, pp. 259–262). IEEE.
Lindberg, J., & Blomberg, M. (1999). Vulnerability in speaker verification-a study of technical impostor techniques. In Sixth European conference on speech communication and technology.
Mariéthoz, J., & Bengio, S. (2005). Can a professional imitator fool a GMM-based speaker verification system? (No. REP_WORK). IDIAP.
Marinov, S. (2003). Text dependent and text independent speaker verification systems. Technology and applications. Overview article.
Masuko, T., Hitotsumatsu, T., Tokuda, K., & Kobayashi, T. (1999). On the security of HMM-based speaker verification systems against imposture using synthetic speech. In Sixth European conference on speech communication and technology.
Mezghani, A., & O'Shaughnessy, D. (2005, May). Speaker verification using a new representation based on a combination of MFCC and formants. In Canadian conference on electrical and computer engineering (pp. 1461–1464). IEEE.
Mittal A., Dua M. (2021a). Constant Q Cepstral Coefficients and Long Short-Term Memory Model-Based Automatic Speaker Verification System. Proceedings of International Conference on Intelligent Computing, Information and Control Systems. Advances in Intelligent Systems and Computing, 1272, 895–904.
Mittal A., Dua M. (2021b). Automatic speaker verification system using three dimensional static and contextual variation-based features with two dimensional convolutional neural network. International Journal of Swarm Intelligence.
Mohammadi, M., & Mohammadi, H. R. S. (2017, May). Robust features fusion for text independent speaker verification enhancement in noisy environments. Iranian Conference on Electrical Engineering (ICEE), 1863–1868. IEEE.
Mohammadi, S. H., & Kain, A. (2017). An overview of voice conversion systems. Speech Communication, 88, 65–82.
Morfi, V., & Stowell, D. (2018). Deep learning for audio event detection and tagging on low-resource datasets. Applied Sciences, 8(8), 1397.
Munteanu, D. P., & Toma, S. A. (2010, June). Automatic speaker verification experiments using HMM. In 2010 8th International Conference on Communications, 107–110. IEEE.
Ochiai, T., Matsuda, S., Lu, X., Hori, C., & Katagiri, S. (2014, May). Speaker adaptive training using deep neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6349–6353. IEEE.
Oo, Z., Wang, L., Phapatanaburi, K., Liu, M., Nakagawa, S., Iwahashi, M., & Dang, J. (2019). Replay attack detection with auditory filter-based relative phase features. EURASIP Journal on Audio, Speech, and Music Processing, 2019(1), 8.
Ou, G., & Ke, D. (2004, December). Text-independent speaker verification based on relation of MFCC components. International Symposium on Chinese Spoken Language Processing, 57–60. IEEE.
Pal, M., Paul, D., & Saha, G. (2018). Synthetic speech detection using fundamental frequency variation and spectral features. Computer Speech & Language, 48, 31–50.
Paliwal, K. K. (1998, May). Spectral subband centroid features for speech recognition. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP'98 (Cat. No. 98CH36181), 2, 617–620. IEEE.
Patel, T. B., & Patil, H. A. (2015). Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. Sixteenth Annual Conference of the International Speech Communication Association.
Patil, H. A., & Kamble, M. R. (2018, November). A survey on replay attack detection for automatic speaker verification (ASV) system. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 1047–1053. IEEE.
Patil, H. A., & Parhi, K. K. (2009, December). Variable length Teager energy based mel cepstral features for identification of twins. In: International conference on pattern recognition and machine intelligence (pp. 525–530). Berlin: Springer.
Patil, H. A., Kamble, M. R., Patel, T. B., & Soni, M. H. (2017, August). Novel variable length Teager energy separation based instantaneous frequency features for replay detection. In INTERSPEECH (pp. 12–16).
Paul, D. B., & Baker, J. M. (1992, February). The design for the Wall Street Journal-based CSR corpus. In Proceedings of the workshop on speech and natural language (pp. 357–362). Association for Computational Linguistics.
Paul, D., Pal, M., & Saha, G. (2015, December). Novel speech features for improved detection of spoofing attacks. In Annual IEEE India conference (INDICON) (pp. 1–6). IEEE.
Pellom, B. L., & Hansen, J. H. (1999, March). An experimental study of speaker verification sensitivity to computer voice-altered imposters. In International conference on acoustics, speech, and signal processing. proceedings. ICASSP99 (Cat. No. 99CH36258) (Vol. 2, pp. 837–840). IEEE.
Picone, J. W. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9), 1215–1247.
Pritam, L. S., Jainar, S. J., & Nagaraja, B. G. (2018). A comparison of features for multilingual speaker identification—A review and some experimental results. International Journal of Recent Technology and Engineering (IJRTE), 7 (4S2).
Prithvi, P., & Kumar, T. K. (2016). Comparative analysis of MFCC, LFCC, RASTA-PLP. International Journal of Scientific Engineering and Research, 4(5), 1–4.
Rajan, P., Kinnunen, T., Hanilci, C., Pohjalainen, J., & Alku, P. (2013, August). Using group delay functions from all-pole models for speaker recognition. In Interspeech (pp. 2489–2493).
Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.
Rose, R. C., & Juang, B. H. (1996). Hidden Markov models for speech and signal recognition. Electroencephalography and Clinical Neurophysiology. Supplement, 45, 137–152.
Sahidullah, M., Delgado, H., Todisco, M., Kinnunen, T., Evans, N., Yamagishi, J., & Lee, K. A. (2019). Introduction to voice presentation attack detection and recent advances. Handbook of biometric anti-spoofing (pp. 321–361). Springer.
Sahidullah, M., Delgado, H., Todisco, M., Yu, H., Kinnunen, T., Evans, N., & Tan, Z. H. (2016). Integrated spoofing countermeasures and automatic speaker verification: An evaluation on ASVspoof 2015.
Sahidullah, M., Kinnunen, T., & Hanilçi, C. (2015). A comparison of features for synthetic speech detection.
Saranya, M. S., & Murthy, H. A. (2018). Decision-level feature switching as a paradigm for replay attack detection. In Interspeech (pp. 686–690).
Saranya, M. S., Padmanabhan, R., & Murthy, H. A. (2017). Feature-switching: Dynamic feature selection for anti-vector based speaker verification system. Speech Communication, 93, 53–62.
Scardapane, S., Stoffl, L., Röhrbein, F., & Uncini, A. (2017, May). On the use of deep recurrent neural networks for detecting audio spoofing attacks. In International joint conference on neural networks (IJCNN) (pp. 3483–3490). IEEE.
Shanmugapriya, P., & Venkataramani, Y. (2011, February). Implementation of speaker verification system using fuzzy wavelet network. In International conference on communications and signal processing (pp. 460–464). IEEE.
Shim, H. J., Jung, J. W., Heo, H. S., Yoon, S. H., & Yu, H. J. (2018, November). Replay spoofing detection system for automatic speaker verification using multi-task learning of noise classes. In Conference on technologies and applications of artificial intelligence (TAAI) (pp. 172–176). IEEE.
Shuvaev, S., Giaffar, H., & Koulakov, A. A. (2017). Representations of sound in deep learning of audio features from music. arXiv preprint arXiv:1712.02898.
Singh, G., Panda, A., Bhattacharyya, S., & Srikanthan, T. (2003, April). Vector quantization techniques for GMM based speaker verification. In IEEE international conference on acoustics, speech, and signal processing, proceedings (ICASSP'03) (Vol. 2(65)). IEEE.
Singh, N., Agrawal, A., & Khan, R. A. (2018). Voice biometric: A technology for voice based authentication. Advanced Science, Engineering and Medicine, 10(7–8), 754–759.
Sriskandaraja, K., Sethu, V., & Ambikairajah, E. (2018). Deep siamese architecture based replay detection for secure voice biometric. In Interspeech (pp. 671–675).
Sturim, D. E., Torres-Carrasquillo, P. A., & Campbell, J. P. (2016). Corpora for the evaluation of robust speaker recognition systems. In Interspeech (pp. 2776–2780).
Suthokumar, G., Sriskandaraja, K., Sethu, V., Wijenayake, C., & Ambikairajah, E. (2017). Independent modelling of high and low energy speech frames for spoofing detection. In Interspeech (pp. 2606–2610).
Sztahó, D., Szaszák, G., & Beke, A. (2019). Deep learning methods in speaker recognition: a review. arXiv preprint arXiv:1911.06615.
Tadokoro, N., Kosaka, T., Kato, M., & Kohda, M. (2009, August). Improvement of speaker vector-based speaker verification. In Fifth international conference on information assurance and security (Vol. 1, pp. 721–724). IEEE.
Todisco, M., Delgado, H., & Evans, N. (2017). Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Computer Speech & Language, 45, 516–535.
Todisco, M., Delgado, H., & Evans, N. W. (2016, September). Articulation rate filtering of CQCC features for automatic speaker verification. In Interspeech (pp. 3628–3632).
Todisco, M., Delgado, H., Lee, K., Sahidullah, M., Evans, N., Kinnunen, T., & Yamagishi, J. (2018, September). Integrated presentation attack detection and automatic speaker verification: Common features and Gaussian back-end fusion.
Todisco, M., Wang, X., Vestman, V., Sahidullah, M., Delgado, H., Nautsch, A., & Lee, K. A. (2019). Asvspoof 2019: Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441.
Varchol, P., Levicky, D., & Juhar, J. (2008, April). Optimalization of GMM for text independent speaker verification system. In 18th International Conference Radioelektronika (pp. 1–4). IEEE.
Vestman, V., Kinnunen, T., Hautamäki, R. G., & Sahidullah, M. (2020). Voice mimicry attacks assisted by automatic speaker verification. Computer Speech & Language, 59, 36–54.
Villalba, J., Miguel, A., Ortega, A., & Lleida, E. (2015). Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge. In Sixteenth annual conference of the international speech communication association.
VosxsCselesb. (2019). http://www.robots.ox.ac.uk/~vgg/data/vosxsceslseb/
Wang, X., Yamagishi, J., Todisco, M., Delgado, H., Nautsch, A., Evans, N., & Juvela, L. (2019). ASVspoof 2019: A large-scale public database of synthetic, converted and replayed speech. arXiv, arXiv-1911.
Wong, L. P., & Russell, M. (2001, May). Text-dependent speaker verification under noisy conditions using parallel model combination. In IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) (Vol. 1, pp. 457–460). IEEE.
Wu, Z., De Leon, P. L., Demiroglu, C., Khodabakhsh, A., King, S., Ling, Z. H., & Yamagishi, J. (2016). Anti-spoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance. IEEE/ACM Transactions on Audio, Speech and Language Processing, 24(4), 768–783.
Wu, Z., Evans, N., Kinnunen, T., Yamagishi, J., Alegre, F., & Li, H. (2015a). Spoofing and countermeasures for speaker verification: A survey. Speech Communication, 66, 130–153.
Wu, Z., Khodabakhsh, A., Demiroglu, C., Yamagishi, J., Saito, D., Toda, T., & King, S. (2015, April). SAS: A speaker verification spoofing database containing diverse attacks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4440–4444). IEEE.
Wu, Z., Kinnunen, T., Evans, N., Yamagishi, J., Hanilçi, C., Sahidullah, M., & Sizov, A. (2015). ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. In Sixteenth annual conference of the international speech communication association.
Wu, Z., Xiao, X., Chng, E. S., & Li, H. (2013, May). Synthetic speech detection using temporal modulation feature. In IEEE international conference on acoustics, speech and signal processing (pp. 7234–7238). IEEE.
Yang, J., Das, R. K., & Li, H. (2018, November). Extended constant-Q cepstral coefficients for detection of spoofing attacks. In Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC) (pp. 1024–1029). IEEE.
Ze, H., Senior, A., & Schuster, M. (2013, May). Statistical parametric speech synthesis using deep neural networks. In IEEE international conference on acoustics, speech and signal processing (pp. 7962–7966). IEEE.
Zetterholm, E. (2007). Detection of speaker characteristics using voice imitation. In Speaker classification II, ser. lecture notes in computer science (pp. 192–205).
Zhao, Y., Togneri, R., & Sreeram, V. (2018, January). Spoofing detection using adaptive weighting framework and clustering analysis. In Interspeech (pp. 626–630).
Zhizheng, W., Junichi, Y., Tomi, K., Cemal, H., Mohammed, S., Aleksandr, S., & Hector, D. (2017). ASVspoof: The automatic speaker verification spoofing and countermeasures challenge.
Zouhir, Y., & Ouni, K. (2014). A bio-inspired feature extraction for robust speech recognition. Springerplus, 3(1), 651.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mittal, A., Dua, M. Automatic speaker verification systems and spoof detection techniques: review and analysis. Int J Speech Technol 25, 105–134 (2022). https://doi.org/10.1007/s10772-021-09876-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-021-09876-2