Skip to main content

Advertisement

Log in

Automatic speaker verification systems and spoof detection techniques: review and analysis

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Automatic speaker verification (ASV) systems are enhanced enough, that industry is attracted to use them practically in security systems. However, vulnerability of these systems to various direct and indirect access attacks weakens the power of ASV authentication mechanism. The increasing research in spoofing and anti-spoofing technologies is contributing to the enhancement of these systems. The objective of this paper is to review and analyze these important advancements proposed by different researchers and scientists. Various classical, autoregressive, cepstral, etc., and modern deep learning based feature extraction techniques that are chosen to design the frontend of these systems are discussed. Extracted features are learned and classified in the backend of an ASV system, which can be classical machine learning or deep learning models that are also the main focus of the presented review. Experimental studies use constantly modified datasets and evaluation measures to develop robust systems since emergence of practical work in this area. This paper analysis most of the contributing spoofed speech datasets and evaluation protocols. Speech synthesis (SS), voice conversion (VC), replay, mimicry and twins are the potential spoofing attacks to ASV systems. This work provides the knowledge of generation techniques of these attacks to empower the defence mechanism of ASV. This survey marks the start of a new era in ASV system development and highlights the start of a new generation (G4) in SS attack development methods. With the increase in advancement of deep learning techniques, the paper makes best efforts to give the complete idea of ASV to new comers to this area and also, puts some light on some of the spoofing attacks that can be targeted during implementation of the future ASV systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  • Aggarwal, R. K., & Kumar, A. (2020). Discriminatively trained continuous Hindi speech recognition using integrated acoustic features and recurrent neural network language modeling.

  • Alam, M. J., Kinnunen, T., Kenny, P., Ouellet, P., & O’Shaughnessy, D. (2013). Multitaper MFCC and PLP features for speaker verification using i-vectors. Speech Communication, 55(2), 237–251.

    Article  Google Scholar 

  • Al-Kaltakchi, M. T., Woo, W. L., Dlay, S. S., & Chambers, J. A. (2016, March). Study of fusion strategies and exploiting the combination of MFCC and PNCC features for robust biometric speaker identification. In 4th international conference on biometrics and forensics (IWBF) (pp. 1–6). IEEE.

  • ASVspoof consortium. (2019). ASVspoof 2019: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan*. http://www.asvspoof.org/.

  • ASVspoof. (2019): https://www.idiap.ch/dataset/avspoof

  • Balamurali, B. T., Lin, K. E., Lui, S., Chen, J. M., & Herremans, D. (2019). Toward robust audio spoofing detection: A detailed comparison of traditional and learned features. IEEE Access, 7, 84229–84241.

    Article  Google Scholar 

  • Beranek, B. (2013). Voice biometrics: Success stories, success factors and what’s next. Biometric Technology Today, 2013(7), 9–11.

    Article  MathSciNet  Google Scholar 

  • Brown, J. C. (1991). Calculation of a constant Q spectral transform. The Journal of the Acoustical Society of America, 89(1), 425–434.

    Article  Google Scholar 

  • Brown, J. C., & Puckette, M. S. (1992). An efficient algorithm for the calculation of a constant Q transform. The Journal of the Acoustical Society of America, 92(5), 2698–2701.

    Article  Google Scholar 

  • Cai, W., Wu, H., Cai, D., & Li, M. (2019). The dku replay detection system for the asvspoof 2019 challenge: On data augmentation, feature representation, classification, and fusion. arXiv:1907.02663

  • Campbell, J. P. (1995, May). Testing with the YOHO CD-ROM voice verification corpus. In 1995 international conference on acoustics, speech, and signal processing (vol. 1, pp. 341–344). IEEE.

  • Chakroborty, S., & Saha, G. (2009). Improved text-independent speaker identification using fused MFCC & IMFCC feature sets based on Gaussian filter. International Journal of Signal Processing, 5(1), 11–19.

    Google Scholar 

  • Chen, K., & Salman, A. (2011). Learning speaker-specific characteristics with a deep neural architecture. IEEE Transactions on Neural Networks, 22(11), 1744–1756.

    Article  Google Scholar 

  • Chen, N., Qian, Y., & Yu, K. (2015). Multi-task learning for text-dependent speaker verification. Sixteenth annual conference of the international speech communication association.

  • Chen, Z., Zhang, W., Xie, Z., Xu, X., & Chen, D. (2018, April). Recurrent neural networks for automatic replay spoofing attack detection. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2052–2056). IEEE.

  • Chettri, B., Kinnunen, T., & Benetos, E. (2020). Deep generative variational autoencoding for replay spoof detection in automatic speaker verification. Computer Speech & Language, 101092.

  • Chettri, B., Mishra, S., Sturm, B. L., & Benetos, E. (2018, December). Analysing the predictions of a CNN-based replay spoofing detection system. In 2018 IEEE spoken language technology workshop (SLT) (pp. 92–97). IEEE.

  • Chettri, B., Stoller, D., Morfi, V., Ramírez, M. A. M., Benetos, E., & Sturm, B. L. (2019). Ensemble models for spoofing detection in automatic speaker verification. arXiv preprint arXiv:1904.04589.

  • Cheuk, K. W., Anderson, H., Agres, K., & Herremans, D. (2019). nnAudio: An on-the-fly GPU audio to spectrogram conversion toolbox using 1D convolution neural networks. arXiv:1912.12055.

  • De Leon, P. L., Pucher, M., Yamagishi, J., Hernaez, I., & Saratxaga, I. (2012). Evaluation of speaker verification security and detection of HMM-based synthetic speech. IEEE Transactions on Audio, Speech, and Language Processing, 20(8), 2280–2290.

    Article  Google Scholar 

  • Delgado, H., Todisco, M., Sahidullah, M., Evans, N., Kinnunen, T., Lee, K., & Yamagishi, J. (2018, June). ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements.

  • Dinkel, H., Qian, Y., & Yu, K. (2018). Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11), 2002–2014.

    Article  Google Scholar 

  • Dua, M., Aggarwal, R. K., & Biswas, M. (2017). Discriminative training using heterogeneous feature vector for Hindi automatic speech recognition system. In International conference on computer and applications (ICCA) (pp. 158–162).

  • Dua, M., Aggarwal, R. K., & Biswas, M. (2018a). Discriminative training using noise robust integrated features and refined HMM modeling. Journal of Intelligent Systems, 29(1), 327–344.

    Article  Google Scholar 

  • Dua, M., Aggarwal, R. K., & Biswas, M. (2018b). Performance evaluation of Hindi speech recognition system using optimized filterbanks. International Journal, Engineering Science and Technology, 1(3), 389–398.

    Google Scholar 

  • Dua, M., Aggarwal, R. K., & Biswas, M. (2019a). Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Computing and Applications, 31(10), 6747–6755.

    Article  Google Scholar 

  • Dua, M., Aggarwal, R. K., & Biswas, M. (2019b). GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. Journal of Ambient Intelligence and Humanized Computing, 10(6), 2301–2314.

    Article  Google Scholar 

  • Dua, M., Aggarwal, R. K., Kadyan, V., & Dua, S. (2012a). Punjabi automatic speech recognition using HTK. International Journal of Computer Science Issues (IJCSI), 9(4), 359.

    Google Scholar 

  • Dua, M., R. K. Aggarwal, Kadyan, V., Dua, S., (2012). Punjabi speech to text system for connected words, 206–209.

  • Dua, M., Jain, C., & Kumar, S. (2021). LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-021-02960-0

    Article  Google Scholar 

  • Farrus, M., Wagner, M., Erro, D., & Hernando, F. J. (2010). Automatic speaker recognition as a measurement of voice imitation and conversion. International Journal of Speech, Language and the Law, 1(17), 119–142.

    Google Scholar 

  • Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., & Muller, P. A. (2019, July). Deep neural network ensembles for time series classification. In International joint conference on neural networks (IJCNN) (pp. 1–6). IEEE.

  • Fenglei, H., & Bingxi, W. (2002, August). Text-independent speaker verification using speaker clustering and support vector machines. In International conference on signal processing (Vol. 1, pp. 456–459). IEEE.

  • Garofalo, J. S., Lamel, L. F., & Fisher, W. M. (1990). The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CDROM. NIST.

  • Glover, J. C., Lazzarini, V., & Timoney, J. (2011). Python for audio signal processing.

  • Godoy, A., Sim˜oes, F., Stuchi, J. A., Angeloni, M. d. A., Uliani, M., & Violato, R. (2015). Using deep learning for detecting spoofing attacks on speech signals. arXiv preprint arXiv:1508.01746.

  • Gong, Y., & Yang, J., (2020). Detecting replay attacks using multi-channel audio: a neural network-based method, arXiv:2003.08225v1 [cs.SD].

  • Hanilçi, C., Kinnunen, T., Sahidullah, M., & Sizov, A. (2015). Classifiers for synthetic speech detection: A comparison.

  • Hautamäki, R. G., Kinnunen, T., Hautamäki, V., & Laukkanen, A. M. (2014). Comparison of human listeners and speaker verification systems using voice mimicry data. TARGET, 4000, 5000.

  • Hautamäki, R. G., Kinnunen, T., Hautamäki, V., Leino, T., & Laukkanen, A. M. (2013). I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry. In Interspeech (pp. 930–934).

  • Hegde, R. M., Murthy, H. A., & Rao, G. R. (2004, May). Application of the modified group delay function to speaker identification and discrimination. In IEEE international conference on acoustics, speech, and signal processing (Vol. 1, p. I-517). IEEE.

  • Helander, E., & Gabbouj, M. (2012). Jani Nurminen1, Hanna Silén2, Victor Popa2. Speech Enhancement, Modeling And Recognition–Algorithms And Applications, 69.

  • Huang, L., & Pun, C. M. (2019, May). Audio replay spoof attack detection using segment-based hybrid feature and DenseNet-LSTM network. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2567–2571). IEEE.

  • Indumathi, A., & Chandra, E. (2012). Survey on speech synthesis. Signal Processing: An International Journal (SPIJ), 6(5), 140.

    Google Scholar 

  • Janicki, A. (2015). Spoofing countermeasure based on analysis of linear prediction error. In Proc. Interspeech.

  • Jelil, S., Das, R. K., Prasanna, S. M., & Sinha, R. (2017, August). Spoof detection using source, instantaneous frequency and cepstral features. In Interspeech (pp. 22–26).

  • Kadyan, V., Dua, M., & Dhiman, P. (2021a). Enhancing accuracy of long contextual dependencies for Punjabi speech recognition system using deep LSTM. International Journal of Speech Technology, 1–11.

  • Kadyan, V., Shanawazuddin, S., & Singh, A. (2021b). Developing children’s speech recognition system for low resource Punjabi language. Applied Acoustics, 178, 108002.

    Article  Google Scholar 

  • Kamble, M. R., Sailor, H. B., Patil, H. A., & Li, H. (2020). Advances in anti-spoofing: From the perspective of ASVspoof challenges. APSIPA Transactions on Signal and Information Processing. https://doi.org/10.1017/ATSIP.2019.21

    Article  Google Scholar 

  • Karpe, R., & Vernekar, N. (2018). A survey: On text to speech synthesis. International Journal for Research in Applied Science and Engineering Technology, 6, 351–355.

    Article  Google Scholar 

  • Kersta, L., & Colangelo, J. (1970). Spectrographic speech patterns of identical twins. The Journal of the Acoustical Society of America, 47(1), 58–59.

    Article  Google Scholar 

  • Kim, C., & Stern, R. M. (2016). Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(7), 1315–1329.

    Article  Google Scholar 

  • Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.

    Article  Google Scholar 

  • Kinnunen, T., Lee, K. A., Delgado, H., Evans, N., Todisco, M., Sahidullah, M., & Reynolds, D. A. (2018). t-DCF: A detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. arXiv preprint arXiv:1804.09618.

  • Kinnunen, T., Sahidullah, M., Falcone, M., Costantini, L., Hautamäki, R. G., Thomsen, D., & Evans, N. (2017, March). Reddots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5395–5399). IEEE.

  • Koolwaaij, J. W., & Boves, L. W. J. (1999). On the use of automatic speaker verification systems in forensic casework.

  • Korshunov, P., Gonçalves, A. R., Violato, R. P., Simões, F. O., & Marcel, S. (2018, January). On the use of convolutional neural networks for speech presentation attack detection. In 2018 IEEE 4th international conference on identity, security, and behavior analysis (ISBA) (pp. 1–8). IEEE.

  • Korshunov, P., Gonçalves, A. R., Violato, R. P., Simões, F. O., & Marcel, S. (2018, January). On the use of convolutional neural networks for speech presentation attack detection. In 4th international conference on identity, security, and behavior analysis (ISBA) (pp. 1–8). IEEE.

  • Kumar, A., & Aggarwal, R. K. (2020a). A hybrid CNN-LiGRU acoustic modeling using raw waveform sincnet for Hindi ASR. Computer Science, 2, 89. https://doi.org/10.7494/csci.2020.21.4.3748

    Article  Google Scholar 

  • Kumar, A., & Aggarwal, R. K. (2020b). Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. International Journal of Speech Technology. https://doi.org/10.1007/s10772-020-09757-0

    Article  Google Scholar 

  • Kumar, A., & Aggarwal, R. K. (2020d). A time delay neural network acoustic modeling for hindi speech recognition. In Advances in data and information sciences (pp. 425–432). Singapore: Springer.

  • Kumar, A., & Aggarwal, R. K. (2021). An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition. Journal of Reliable Intelligent Environments, 1–16.

  • Kumar, M. G., Kumar, S. R., Saranya, M. S., Bharathi, B., & Murthy, H. A. (2019, December). Spoof detection using time-delay shallow neural network and feature switching. In Automatic speech recognition and understanding workshop (ASRU) (pp. 1011–1017). IEEE.

  • Lau, Y. W., Tran, D., & Wagner, M. (2005). Testing voice mimicry with the yoho speaker verification corpus. In International conference on knowledge-based and intelligent information and engineering systems (pp. 15–21). Springer.

  • Lau, Y. W., Wagner, M., & Tran, D. (2004, October). Vulnerability of speaker verification to voice mimicking. In International symposium on intelligent multimedia, video and speech processing (pp. 145–148). IEEE.

  • Lavrentyeva, G., Novoselov, S., Malykh, E., Kozlov, A., Kudashev, O., & Shchemelinin, V. (2017, August). Audio replay attack detection with deep learning frameworks. In Interspeech (pp. 82–86).

  • Lee, J., Park, J., Kim, K. L., & Nam, J. (2017). Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint arXiv:1703.01789.

  • Lee, K. A., Larcher, A., Wang, G., Kenny, P., Brümmer, N., Leeuwen, D. V & Li, H. (2015). The RedDots data collection for speaker recognition. In Sixteenth annual conference of the international speech communication association.

  • Lim, R., & Kwan, E. (2011, August). Voice conversion application (VOCAL). In International conference on uncertainty reasoning and knowledge engineering (Vol. 1, pp. 259–262). IEEE.

  • Lindberg, J., & Blomberg, M. (1999). Vulnerability in speaker verification-a study of technical impostor techniques. In Sixth European conference on speech communication and technology.

  • Mariéthoz, J., & Bengio, S. (2005). Can a professional imitator fool a GMM-based speaker verification system? (No. REP_WORK). IDIAP.

  • Marinov, S. (2003). Text dependent and text independent speaker verification systems. Technology and applications. Overview article.

  • Masuko, T., Hitotsumatsu, T., Tokuda, K., & Kobayashi, T. (1999). On the security of HMM-based speaker verification systems against imposture using synthetic speech. In Sixth European conference on speech communication and technology.

  • Mezghani, A., & O'Shaughnessy, D. (2005, May). Speaker verification using a new representation based on a combination of MFCC and formants. In Canadian conference on electrical and computer engineering (pp. 1461–1464). IEEE.

  • Mittal A., Dua M. (2021a). Constant Q Cepstral Coefficients and Long Short-Term Memory Model-Based Automatic Speaker Verification System. Proceedings of International Conference on Intelligent Computing, Information and Control Systems. Advances in Intelligent Systems and Computing, 1272, 895–904.

  • Mittal A., Dua M. (2021b). Automatic speaker verification system using three dimensional static and contextual variation-based features with two dimensional convolutional neural network. International Journal of Swarm Intelligence.

  • Mohammadi, M., & Mohammadi, H. R. S. (2017, May). Robust features fusion for text independent speaker verification enhancement in noisy environments. Iranian Conference on Electrical Engineering (ICEE), 1863–1868. IEEE.

  • Mohammadi, S. H., & Kain, A. (2017). An overview of voice conversion systems. Speech Communication, 88, 65–82.

    Article  Google Scholar 

  • Morfi, V., & Stowell, D. (2018). Deep learning for audio event detection and tagging on low-resource datasets. Applied Sciences, 8(8), 1397.

    Article  Google Scholar 

  • Munteanu, D. P., & Toma, S. A. (2010, June). Automatic speaker verification experiments using HMM. In 2010 8th International Conference on Communications, 107–110. IEEE.

  • Ochiai, T., Matsuda, S., Lu, X., Hori, C., & Katagiri, S. (2014, May). Speaker adaptive training using deep neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6349–6353. IEEE.

  • Oo, Z., Wang, L., Phapatanaburi, K., Liu, M., Nakagawa, S., Iwahashi, M., & Dang, J. (2019). Replay attack detection with auditory filter-based relative phase features. EURASIP Journal on Audio, Speech, and Music Processing, 2019(1), 8.

    Article  Google Scholar 

  • Ou, G., & Ke, D. (2004, December). Text-independent speaker verification based on relation of MFCC components. International Symposium on Chinese Spoken Language Processing, 57–60. IEEE.

  • Pal, M., Paul, D., & Saha, G. (2018). Synthetic speech detection using fundamental frequency variation and spectral features. Computer Speech & Language, 48, 31–50.

    Article  Google Scholar 

  • Paliwal, K. K. (1998, May). Spectral subband centroid features for speech recognition. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP'98 (Cat. No. 98CH36181), 2, 617–620. IEEE.

  • Patel, T. B., & Patil, H. A. (2015). Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. Sixteenth Annual Conference of the International Speech Communication Association.

  • Patil, H. A., & Kamble, M. R. (2018, November). A survey on replay attack detection for automatic speaker verification (ASV) system. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 1047–1053. IEEE.

  • Patil, H. A., & Parhi, K. K. (2009, December). Variable length Teager energy based mel cepstral features for identification of twins. In: International conference on pattern recognition and machine intelligence (pp. 525–530). Berlin: Springer.

  • Patil, H. A., Kamble, M. R., Patel, T. B., & Soni, M. H. (2017, August). Novel variable length Teager energy separation based instantaneous frequency features for replay detection. In INTERSPEECH (pp. 12–16).

  • Paul, D. B., & Baker, J. M. (1992, February). The design for the Wall Street Journal-based CSR corpus. In Proceedings of the workshop on speech and natural language (pp. 357–362). Association for Computational Linguistics.

  • Paul, D., Pal, M., & Saha, G. (2015, December). Novel speech features for improved detection of spoofing attacks. In Annual IEEE India conference (INDICON) (pp. 1–6). IEEE.

  • Pellom, B. L., & Hansen, J. H. (1999, March). An experimental study of speaker verification sensitivity to computer voice-altered imposters. In International conference on acoustics, speech, and signal processing. proceedings. ICASSP99 (Cat. No. 99CH36258) (Vol. 2, pp. 837–840). IEEE.

  • Picone, J. W. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9), 1215–1247.

    Article  Google Scholar 

  • Pritam, L. S., Jainar, S. J., & Nagaraja, B. G. (2018). A comparison of features for multilingual speaker identification—A review and some experimental results. International Journal of Recent Technology and Engineering (IJRTE), 7 (4S2).

  • Prithvi, P., & Kumar, T. K. (2016). Comparative analysis of MFCC, LFCC, RASTA-PLP. International Journal of Scientific Engineering and Research, 4(5), 1–4.

    Google Scholar 

  • Rajan, P., Kinnunen, T., Hanilci, C., Pohjalainen, J., & Alku, P. (2013, August). Using group delay functions from all-pole models for speaker recognition. In Interspeech (pp. 2489–2493).

  • Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.

    Article  Google Scholar 

  • Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.

    Article  Google Scholar 

  • Rose, R. C., & Juang, B. H. (1996). Hidden Markov models for speech and signal recognition. Electroencephalography and Clinical Neurophysiology. Supplement, 45, 137–152.

    Google Scholar 

  • Sahidullah, M., Delgado, H., Todisco, M., Kinnunen, T., Evans, N., Yamagishi, J., & Lee, K. A. (2019). Introduction to voice presentation attack detection and recent advances. Handbook of biometric anti-spoofing (pp. 321–361). Springer.

    Chapter  Google Scholar 

  • Sahidullah, M., Delgado, H., Todisco, M., Yu, H., Kinnunen, T., Evans, N., & Tan, Z. H. (2016). Integrated spoofing countermeasures and automatic speaker verification: An evaluation on ASVspoof 2015.

  • Sahidullah, M., Kinnunen, T., & Hanilçi, C. (2015). A comparison of features for synthetic speech detection.

  • Saranya, M. S., & Murthy, H. A. (2018). Decision-level feature switching as a paradigm for replay attack detection. In Interspeech (pp. 686–690).

  • Saranya, M. S., Padmanabhan, R., & Murthy, H. A. (2017). Feature-switching: Dynamic feature selection for anti-vector based speaker verification system. Speech Communication, 93, 53–62.

    Article  Google Scholar 

  • Scardapane, S., Stoffl, L., Röhrbein, F., & Uncini, A. (2017, May). On the use of deep recurrent neural networks for detecting audio spoofing attacks. In International joint conference on neural networks (IJCNN) (pp. 3483–3490). IEEE.

  • Shanmugapriya, P., & Venkataramani, Y. (2011, February). Implementation of speaker verification system using fuzzy wavelet network. In International conference on communications and signal processing (pp. 460–464). IEEE.

  • Shim, H. J., Jung, J. W., Heo, H. S., Yoon, S. H., & Yu, H. J. (2018, November). Replay spoofing detection system for automatic speaker verification using multi-task learning of noise classes. In Conference on technologies and applications of artificial intelligence (TAAI) (pp. 172–176). IEEE.

  • Shuvaev, S., Giaffar, H., & Koulakov, A. A. (2017). Representations of sound in deep learning of audio features from music. arXiv preprint arXiv:1712.02898.

  • Singh, G., Panda, A., Bhattacharyya, S., & Srikanthan, T. (2003, April). Vector quantization techniques for GMM based speaker verification. In IEEE international conference on acoustics, speech, and signal processing, proceedings (ICASSP'03) (Vol. 2(65)). IEEE.

  • Singh, N., Agrawal, A., & Khan, R. A. (2018). Voice biometric: A technology for voice based authentication. Advanced Science, Engineering and Medicine, 10(7–8), 754–759.

    Article  Google Scholar 

  • Sriskandaraja, K., Sethu, V., & Ambikairajah, E. (2018). Deep siamese architecture based replay detection for secure voice biometric. In Interspeech (pp. 671–675).

  • Sturim, D. E., Torres-Carrasquillo, P. A., & Campbell, J. P. (2016). Corpora for the evaluation of robust speaker recognition systems. In Interspeech (pp. 2776–2780).

  • Suthokumar, G., Sriskandaraja, K., Sethu, V., Wijenayake, C., & Ambikairajah, E. (2017). Independent modelling of high and low energy speech frames for spoofing detection. In Interspeech (pp. 2606–2610).

  • Sztahó, D., Szaszák, G., & Beke, A. (2019). Deep learning methods in speaker recognition: a review. arXiv preprint arXiv:1911.06615.

  • Tadokoro, N., Kosaka, T., Kato, M., & Kohda, M. (2009, August). Improvement of speaker vector-based speaker verification. In Fifth international conference on information assurance and security (Vol. 1, pp. 721–724). IEEE.

  • Todisco, M., Delgado, H., & Evans, N. (2017). Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Computer Speech & Language, 45, 516–535.

    Article  Google Scholar 

  • Todisco, M., Delgado, H., & Evans, N. W. (2016, September). Articulation rate filtering of CQCC features for automatic speaker verification. In Interspeech (pp. 3628–3632).

  • Todisco, M., Delgado, H., Lee, K., Sahidullah, M., Evans, N., Kinnunen, T., & Yamagishi, J. (2018, September). Integrated presentation attack detection and automatic speaker verification: Common features and Gaussian back-end fusion.

  • Todisco, M., Wang, X., Vestman, V., Sahidullah, M., Delgado, H., Nautsch, A., & Lee, K. A. (2019). Asvspoof 2019: Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441.

  • Varchol, P., Levicky, D., & Juhar, J. (2008, April). Optimalization of GMM for text independent speaker verification system. In 18th International Conference Radioelektronika (pp. 1–4). IEEE.

  • Vestman, V., Kinnunen, T., Hautamäki, R. G., & Sahidullah, M. (2020). Voice mimicry attacks assisted by automatic speaker verification. Computer Speech & Language, 59, 36–54.

    Article  Google Scholar 

  • Villalba, J., Miguel, A., Ortega, A., & Lleida, E. (2015). Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge. In Sixteenth annual conference of the international speech communication association.

  • VosxsCselesb. (2019). http://www.robots.ox.ac.uk/~vgg/data/vosxsceslseb/

  • Wang, X., Yamagishi, J., Todisco, M., Delgado, H., Nautsch, A., Evans, N., & Juvela, L. (2019). ASVspoof 2019: A large-scale public database of synthetic, converted and replayed speech. arXiv, arXiv-1911.

  • Wong, L. P., & Russell, M. (2001, May). Text-dependent speaker verification under noisy conditions using parallel model combination. In IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) (Vol. 1, pp. 457–460). IEEE.

  • Wu, Z., De Leon, P. L., Demiroglu, C., Khodabakhsh, A., King, S., Ling, Z. H., & Yamagishi, J. (2016). Anti-spoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance. IEEE/ACM Transactions on Audio, Speech and Language Processing, 24(4), 768–783.

    Article  Google Scholar 

  • Wu, Z., Evans, N., Kinnunen, T., Yamagishi, J., Alegre, F., & Li, H. (2015a). Spoofing and countermeasures for speaker verification: A survey. Speech Communication, 66, 130–153.

    Article  Google Scholar 

  • Wu, Z., Khodabakhsh, A., Demiroglu, C., Yamagishi, J., Saito, D., Toda, T., & King, S. (2015, April). SAS: A speaker verification spoofing database containing diverse attacks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4440–4444). IEEE.

  • Wu, Z., Kinnunen, T., Evans, N., Yamagishi, J., Hanilçi, C., Sahidullah, M., & Sizov, A. (2015). ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. In Sixteenth annual conference of the international speech communication association.

  • Wu, Z., Xiao, X., Chng, E. S., & Li, H. (2013, May). Synthetic speech detection using temporal modulation feature. In IEEE international conference on acoustics, speech and signal processing (pp. 7234–7238). IEEE.

  • Yang, J., Das, R. K., & Li, H. (2018, November). Extended constant-Q cepstral coefficients for detection of spoofing attacks. In Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC) (pp. 1024–1029). IEEE.

  • Ze, H., Senior, A., & Schuster, M. (2013, May). Statistical parametric speech synthesis using deep neural networks. In IEEE international conference on acoustics, speech and signal processing (pp. 7962–7966). IEEE.

  • Zetterholm, E. (2007). Detection of speaker characteristics using voice imitation. In Speaker classification II, ser. lecture notes in computer science (pp. 192–205).

  • Zhao, Y., Togneri, R., & Sreeram, V. (2018, January). Spoofing detection using adaptive weighting framework and clustering analysis. In Interspeech (pp. 626–630).

  • Zhizheng, W., Junichi, Y., Tomi, K., Cemal, H., Mohammed, S., Aleksandr, S., & Hector, D. (2017). ASVspoof: The automatic speaker verification spoofing and countermeasures challenge.

  • Zouhir, Y., & Ouni, K. (2014). A bio-inspired feature extraction for robust speech recognition. Springerplus, 3(1), 651.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohit Dua.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mittal, A., Dua, M. Automatic speaker verification systems and spoof detection techniques: review and analysis. Int J Speech Technol 25, 105–134 (2022). https://doi.org/10.1007/s10772-021-09876-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09876-2

Keywords

Navigation