Automatic speaker verification systems and spoof detection techniques: review and analysis

Mittal, Aakshi; Dua, Mohit

doi:10.1007/s10772-021-09876-2

Automatic speaker verification systems and spoof detection techniques: review and analysis

Published: 16 August 2021

Volume 25, pages 105–134, (2022)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

1825 Accesses
16 Citations
Explore all metrics

Abstract

Automatic speaker verification (ASV) systems are enhanced enough, that industry is attracted to use them practically in security systems. However, vulnerability of these systems to various direct and indirect access attacks weakens the power of ASV authentication mechanism. The increasing research in spoofing and anti-spoofing technologies is contributing to the enhancement of these systems. The objective of this paper is to review and analyze these important advancements proposed by different researchers and scientists. Various classical, autoregressive, cepstral, etc., and modern deep learning based feature extraction techniques that are chosen to design the frontend of these systems are discussed. Extracted features are learned and classified in the backend of an ASV system, which can be classical machine learning or deep learning models that are also the main focus of the presented review. Experimental studies use constantly modified datasets and evaluation measures to develop robust systems since emergence of practical work in this area. This paper analysis most of the contributing spoofed speech datasets and evaluation protocols. Speech synthesis (SS), voice conversion (VC), replay, mimicry and twins are the potential spoofing attacks to ASV systems. This work provides the knowledge of generation techniques of these attacks to empower the defence mechanism of ASV. This survey marks the start of a new era in ASV system development and highlights the start of a new generation (G₄) in SS attack development methods. With the increase in advancement of deep learning techniques, the paper makes best efforts to give the complete idea of ASV to new comers to this area and also, puts some light on some of the spoofing attacks that can be targeted during implementation of the future ASV systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Article Open access 03 January 2024

References

Aggarwal, R. K., & Kumar, A. (2020). Discriminatively trained continuous Hindi speech recognition using integrated acoustic features and recurrent neural network language modeling.
Alam, M. J., Kinnunen, T., Kenny, P., Ouellet, P., & O’Shaughnessy, D. (2013). Multitaper MFCC and PLP features for speaker verification using i-vectors. Speech Communication, 55(2), 237–251.
Article Google Scholar
Al-Kaltakchi, M. T., Woo, W. L., Dlay, S. S., & Chambers, J. A. (2016, March). Study of fusion strategies and exploiting the combination of MFCC and PNCC features for robust biometric speaker identification. In 4th international conference on biometrics and forensics (IWBF) (pp. 1–6). IEEE.
ASVspoof consortium. (2019). ASVspoof 2019: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan*. http://www.asvspoof.org/.
ASVspoof. (2019): https://www.idiap.ch/dataset/avspoof
Balamurali, B. T., Lin, K. E., Lui, S., Chen, J. M., & Herremans, D. (2019). Toward robust audio spoofing detection: A detailed comparison of traditional and learned features. IEEE Access, 7, 84229–84241.
Article Google Scholar
Beranek, B. (2013). Voice biometrics: Success stories, success factors and what’s next. Biometric Technology Today, 2013(7), 9–11.
Article MathSciNet Google Scholar
Brown, J. C. (1991). Calculation of a constant Q spectral transform. The Journal of the Acoustical Society of America, 89(1), 425–434.
Article Google Scholar
Brown, J. C., & Puckette, M. S. (1992). An efficient algorithm for the calculation of a constant Q transform. The Journal of the Acoustical Society of America, 92(5), 2698–2701.
Article Google Scholar
Cai, W., Wu, H., Cai, D., & Li, M. (2019). The dku replay detection system for the asvspoof 2019 challenge: On data augmentation, feature representation, classification, and fusion. arXiv:1907.02663
Campbell, J. P. (1995, May). Testing with the YOHO CD-ROM voice verification corpus. In 1995 international conference on acoustics, speech, and signal processing (vol. 1, pp. 341–344). IEEE.
Chakroborty, S., & Saha, G. (2009). Improved text-independent speaker identification using fused MFCC & IMFCC feature sets based on Gaussian filter. International Journal of Signal Processing, 5(1), 11–19.
Google Scholar
Chen, K., & Salman, A. (2011). Learning speaker-specific characteristics with a deep neural architecture. IEEE Transactions on Neural Networks, 22(11), 1744–1756.
Article Google Scholar
Chen, N., Qian, Y., & Yu, K. (2015). Multi-task learning for text-dependent speaker verification. Sixteenth annual conference of the international speech communication association.
Chen, Z., Zhang, W., Xie, Z., Xu, X., & Chen, D. (2018, April). Recurrent neural networks for automatic replay spoofing attack detection. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2052–2056). IEEE.
Chettri, B., Kinnunen, T., & Benetos, E. (2020). Deep generative variational autoencoding for replay spoof detection in automatic speaker verification. Computer Speech & Language, 101092.
Chettri, B., Mishra, S., Sturm, B. L., & Benetos, E. (2018, December). Analysing the predictions of a CNN-based replay spoofing detection system. In 2018 IEEE spoken language technology workshop (SLT) (pp. 92–97). IEEE.
Chettri, B., Stoller, D., Morfi, V., Ramírez, M. A. M., Benetos, E., & Sturm, B. L. (2019). Ensemble models for spoofing detection in automatic speaker verification. arXiv preprint arXiv:1904.04589.
Cheuk, K. W., Anderson, H., Agres, K., & Herremans, D. (2019). nnAudio: An on-the-fly GPU audio to spectrogram conversion toolbox using 1D convolution neural networks. arXiv:1912.12055.
De Leon, P. L., Pucher, M., Yamagishi, J., Hernaez, I., & Saratxaga, I. (2012). Evaluation of speaker verification security and detection of HMM-based synthetic speech. IEEE Transactions on Audio, Speech, and Language Processing, 20(8), 2280–2290.
Article Google Scholar
Delgado, H., Todisco, M., Sahidullah, M., Evans, N., Kinnunen, T., Lee, K., & Yamagishi, J. (2018, June). ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements.
Dinkel, H., Qian, Y., & Yu, K. (2018). Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11), 2002–2014.
Article Google Scholar
Dua, M., Aggarwal, R. K., & Biswas, M. (2017). Discriminative training using heterogeneous feature vector for Hindi automatic speech recognition system. In International conference on computer and applications (ICCA) (pp. 158–162).
Dua, M., Aggarwal, R. K., & Biswas, M. (2018a). Discriminative training using noise robust integrated features and refined HMM modeling. Journal of Intelligent Systems, 29(1), 327–344.
Article Google Scholar
Dua, M., Aggarwal, R. K., & Biswas, M. (2018b). Performance evaluation of Hindi speech recognition system using optimized filterbanks. International Journal, Engineering Science and Technology, 1(3), 389–398.
Google Scholar
Dua, M., Aggarwal, R. K., & Biswas, M. (2019a). Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Computing and Applications, 31(10), 6747–6755.
Article Google Scholar
Dua, M., Aggarwal, R. K., & Biswas, M. (2019b). GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. Journal of Ambient Intelligence and Humanized Computing, 10(6), 2301–2314.
Article Google Scholar
Dua, M., Aggarwal, R. K., Kadyan, V., & Dua, S. (2012a). Punjabi automatic speech recognition using HTK. International Journal of Computer Science Issues (IJCSI), 9(4), 359.
Google Scholar
Dua, M., R. K. Aggarwal, Kadyan, V., Dua, S., (2012). Punjabi speech to text system for connected words, 206–209.
Dua, M., Jain, C., & Kumar, S. (2021). LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-021-02960-0
Article Google Scholar
Farrus, M., Wagner, M., Erro, D., & Hernando, F. J. (2010). Automatic speaker recognition as a measurement of voice imitation and conversion. International Journal of Speech, Language and the Law, 1(17), 119–142.
Google Scholar
Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., & Muller, P. A. (2019, July). Deep neural network ensembles for time series classification. In International joint conference on neural networks (IJCNN) (pp. 1–6). IEEE.
Fenglei, H., & Bingxi, W. (2002, August). Text-independent speaker verification using speaker clustering and support vector machines. In International conference on signal processing (Vol. 1, pp. 456–459). IEEE.
Garofalo, J. S., Lamel, L. F., & Fisher, W. M. (1990). The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CDROM. NIST.
Glover, J. C., Lazzarini, V., & Timoney, J. (2011). Python for audio signal processing.
Godoy, A., Sim˜oes, F., Stuchi, J. A., Angeloni, M. d. A., Uliani, M., & Violato, R. (2015). Using deep learning for detecting spoofing attacks on speech signals. arXiv preprint arXiv:1508.01746.
Gong, Y., & Yang, J., (2020). Detecting replay attacks using multi-channel audio: a neural network-based method, arXiv:2003.08225v1 [cs.SD].
Hanilçi, C., Kinnunen, T., Sahidullah, M., & Sizov, A. (2015). Classifiers for synthetic speech detection: A comparison.
Hautamäki, R. G., Kinnunen, T., Hautamäki, V., & Laukkanen, A. M. (2014). Comparison of human listeners and speaker verification systems using voice mimicry data. TARGET, 4000, 5000.
Hautamäki, R. G., Kinnunen, T., Hautamäki, V., Leino, T., & Laukkanen, A. M. (2013). I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry. In Interspeech (pp. 930–934).
Hegde, R. M., Murthy, H. A., & Rao, G. R. (2004, May). Application of the modified group delay function to speaker identification and discrimination. In IEEE international conference on acoustics, speech, and signal processing (Vol. 1, p. I-517). IEEE.
Helander, E., & Gabbouj, M. (2012). Jani Nurminen1, Hanna Silén2, Victor Popa2. Speech Enhancement, Modeling And Recognition–Algorithms And Applications, 69.
Huang, L., & Pun, C. M. (2019, May). Audio replay spoof attack detection using segment-based hybrid feature and DenseNet-LSTM network. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2567–2571). IEEE.
Indumathi, A., & Chandra, E. (2012). Survey on speech synthesis. Signal Processing: An International Journal (SPIJ), 6(5), 140.
Google Scholar
Janicki, A. (2015). Spoofing countermeasure based on analysis of linear prediction error. In Proc. Interspeech.
Jelil, S., Das, R. K., Prasanna, S. M., & Sinha, R. (2017, August). Spoof detection using source, instantaneous frequency and cepstral features. In Interspeech (pp. 22–26).
Kadyan, V., Dua, M., & Dhiman, P. (2021a). Enhancing accuracy of long contextual dependencies for Punjabi speech recognition system using deep LSTM. International Journal of Speech Technology, 1–11.
Kadyan, V., Shanawazuddin, S., & Singh, A. (2021b). Developing children’s speech recognition system for low resource Punjabi language. Applied Acoustics, 178, 108002.
Article Google Scholar
Kamble, M. R., Sailor, H. B., Patil, H. A., & Li, H. (2020). Advances in anti-spoofing: From the perspective of ASVspoof challenges. APSIPA Transactions on Signal and Information Processing. https://doi.org/10.1017/ATSIP.2019.21
Article Google Scholar
Karpe, R., & Vernekar, N. (2018). A survey: On text to speech synthesis. International Journal for Research in Applied Science and Engineering Technology, 6, 351–355.
Article Google Scholar
Kersta, L., & Colangelo, J. (1970). Spectrographic speech patterns of identical twins. The Journal of the Acoustical Society of America, 47(1), 58–59.
Article Google Scholar
Kim, C., & Stern, R. M. (2016). Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(7), 1315–1329.
Article Google Scholar
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.
Article Google Scholar
Kinnunen, T., Lee, K. A., Delgado, H., Evans, N., Todisco, M., Sahidullah, M., & Reynolds, D. A. (2018). t-DCF: A detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. arXiv preprint arXiv:1804.09618.
Kinnunen, T., Sahidullah, M., Falcone, M., Costantini, L., Hautamäki, R. G., Thomsen, D., & Evans, N. (2017, March). Reddots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5395–5399). IEEE.
Koolwaaij, J. W., & Boves, L. W. J. (1999). On the use of automatic speaker verification systems in forensic casework.
Korshunov, P., Gonçalves, A. R., Violato, R. P., Simões, F. O., & Marcel, S. (2018, January). On the use of convolutional neural networks for speech presentation attack detection. In 2018 IEEE 4th international conference on identity, security, and behavior analysis (ISBA) (pp. 1–8). IEEE.
Korshunov, P., Gonçalves, A. R., Violato, R. P., Simões, F. O., & Marcel, S. (2018, January). On the use of convolutional neural networks for speech presentation attack detection. In 4th international conference on identity, security, and behavior analysis (ISBA) (pp. 1–8). IEEE.
Kumar, A., & Aggarwal, R. K. (2020a). A hybrid CNN-LiGRU acoustic modeling using raw waveform sincnet for Hindi ASR. Computer Science, 2, 89. https://doi.org/10.7494/csci.2020.21.4.3748
Article Google Scholar
Kumar, A., & Aggarwal, R. K. (2020b). Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. International Journal of Speech Technology. https://doi.org/10.1007/s10772-020-09757-0
Article Google Scholar
Kumar, A., & Aggarwal, R. K. (2020d). A time delay neural network acoustic modeling for hindi speech recognition. In Advances in data and information sciences (pp. 425–432). Singapore: Springer.
Kumar, A., & Aggarwal, R. K. (2021). An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition. Journal of Reliable Intelligent Environments, 1–16.
Kumar, M. G., Kumar, S. R., Saranya, M. S., Bharathi, B., & Murthy, H. A. (2019, December). Spoof detection using time-delay shallow neural network and feature switching. In Automatic speech recognition and understanding workshop (ASRU) (pp. 1011–1017). IEEE.
Lau, Y. W., Tran, D., & Wagner, M. (2005). Testing voice mimicry with the yoho speaker verification corpus. In International conference on knowledge-based and intelligent information and engineering systems (pp. 15–21). Springer.
Lau, Y. W., Wagner, M., & Tran, D. (2004, October). Vulnerability of speaker verification to voice mimicking. In International symposium on intelligent multimedia, video and speech processing (pp. 145–148). IEEE.
Lavrentyeva, G., Novoselov, S., Malykh, E., Kozlov, A., Kudashev, O., & Shchemelinin, V. (2017, August). Audio replay attack detection with deep learning frameworks. In Interspeech (pp. 82–86).
Lee, J., Park, J., Kim, K. L., & Nam, J. (2017). Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint arXiv:1703.01789.
Lee, K. A., Larcher, A., Wang, G., Kenny, P., Brümmer, N., Leeuwen, D. V & Li, H. (2015). The RedDots data collection for speaker recognition. In Sixteenth annual conference of the international speech communication association.
Lim, R., & Kwan, E. (2011, August). Voice conversion application (VOCAL). In International conference on uncertainty reasoning and knowledge engineering (Vol. 1, pp. 259–262). IEEE.
Lindberg, J., & Blomberg, M. (1999). Vulnerability in speaker verification-a study of technical impostor techniques. In Sixth European conference on speech communication and technology.
Mariéthoz, J., & Bengio, S. (2005). Can a professional imitator fool a GMM-based speaker verification system? (No. REP_WORK). IDIAP.
Marinov, S. (2003). Text dependent and text independent speaker verification systems. Technology and applications. Overview article.
Masuko, T., Hitotsumatsu, T., Tokuda, K., & Kobayashi, T. (1999). On the security of HMM-based speaker verification systems against imposture using synthetic speech. In Sixth European conference on speech communication and technology.
Mezghani, A., & O'Shaughnessy, D. (2005, May). Speaker verification using a new representation based on a combination of MFCC and formants. In Canadian conference on electrical and computer engineering (pp. 1461–1464). IEEE.
Mittal A., Dua M. (2021a). Constant Q Cepstral Coefficients and Long Short-Term Memory Model-Based Automatic Speaker Verification System. Proceedings of International Conference on Intelligent Computing, Information and Control Systems. Advances in Intelligent Systems and Computing, 1272, 895–904.
Mittal A., Dua M. (2021b). Automatic speaker verification system using three dimensional static and contextual variation-based features with two dimensional convolutional neural network. International Journal of Swarm Intelligence.
Mohammadi, M., & Mohammadi, H. R. S. (2017, May). Robust features fusion for text independent speaker verification enhancement in noisy environments. Iranian Conference on Electrical Engineering (ICEE), 1863–1868. IEEE.
Mohammadi, S. H., & Kain, A. (2017). An overview of voice conversion systems. Speech Communication, 88, 65–82.
Article Google Scholar
Morfi, V., & Stowell, D. (2018). Deep learning for audio event detection and tagging on low-resource datasets. Applied Sciences, 8(8), 1397.
Article Google Scholar
Munteanu, D. P., & Toma, S. A. (2010, June). Automatic speaker verification experiments using HMM. In 2010 8th International Conference on Communications, 107–110. IEEE.
Ochiai, T., Matsuda, S., Lu, X., Hori, C., & Katagiri, S. (2014, May). Speaker adaptive training using deep neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6349–6353. IEEE.
Oo, Z., Wang, L., Phapatanaburi, K., Liu, M., Nakagawa, S., Iwahashi, M., & Dang, J. (2019). Replay attack detection with auditory filter-based relative phase features. EURASIP Journal on Audio, Speech, and Music Processing, 2019(1), 8.
Article Google Scholar
Ou, G., & Ke, D. (2004, December). Text-independent speaker verification based on relation of MFCC components. International Symposium on Chinese Spoken Language Processing, 57–60. IEEE.
Pal, M., Paul, D., & Saha, G. (2018). Synthetic speech detection using fundamental frequency variation and spectral features. Computer Speech & Language, 48, 31–50.
Article Google Scholar
Paliwal, K. K. (1998, May). Spectral subband centroid features for speech recognition. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP'98 (Cat. No. 98CH36181), 2, 617–620. IEEE.
Patel, T. B., & Patil, H. A. (2015). Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. Sixteenth Annual Conference of the International Speech Communication Association.
Patil, H. A., & Kamble, M. R. (2018, November). A survey on replay attack detection for automatic speaker verification (ASV) system. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 1047–1053. IEEE.
Patil, H. A., & Parhi, K. K. (2009, December). Variable length Teager energy based mel cepstral features for identification of twins. In: International conference on pattern recognition and machine intelligence (pp. 525–530). Berlin: Springer.
Patil, H. A., Kamble, M. R., Patel, T. B., & Soni, M. H. (2017, August). Novel variable length Teager energy separation based instantaneous frequency features for replay detection. In INTERSPEECH (pp. 12–16).
Paul, D. B., & Baker, J. M. (1992, February). The design for the Wall Street Journal-based CSR corpus. In Proceedings of the workshop on speech and natural language (pp. 357–362). Association for Computational Linguistics.
Paul, D., Pal, M., & Saha, G. (2015, December). Novel speech features for improved detection of spoofing attacks. In Annual IEEE India conference (INDICON) (pp. 1–6). IEEE.
Pellom, B. L., & Hansen, J. H. (1999, March). An experimental study of speaker verification sensitivity to computer voice-altered imposters. In International conference on acoustics, speech, and signal processing. proceedings. ICASSP99 (Cat. No. 99CH36258) (Vol. 2, pp. 837–840). IEEE.
Picone, J. W. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9), 1215–1247.
Article Google Scholar
Pritam, L. S., Jainar, S. J., & Nagaraja, B. G. (2018). A comparison of features for multilingual speaker identification—A review and some experimental results. International Journal of Recent Technology and Engineering (IJRTE), 7 (4S2).
Prithvi, P., & Kumar, T. K. (2016). Comparative analysis of MFCC, LFCC, RASTA-PLP. International Journal of Scientific Engineering and Research, 4(5), 1–4.
Google Scholar
Rajan, P., Kinnunen, T., Hanilci, C., Pohjalainen, J., & Alku, P. (2013, August). Using group delay functions from all-pole models for speaker recognition. In Interspeech (pp. 2489–2493).
Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.
Article Google Scholar
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.
Article Google Scholar
Rose, R. C., & Juang, B. H. (1996). Hidden Markov models for speech and signal recognition. Electroencephalography and Clinical Neurophysiology. Supplement, 45, 137–152.
Google Scholar
Sahidullah, M., Delgado, H., Todisco, M., Kinnunen, T., Evans, N., Yamagishi, J., & Lee, K. A. (2019). Introduction to voice presentation attack detection and recent advances. Handbook of biometric anti-spoofing (pp. 321–361). Springer.
Chapter Google Scholar
Sahidullah, M., Delgado, H., Todisco, M., Yu, H., Kinnunen, T., Evans, N., & Tan, Z. H. (2016). Integrated spoofing countermeasures and automatic speaker verification: An evaluation on ASVspoof 2015.
Sahidullah, M., Kinnunen, T., & Hanilçi, C. (2015). A comparison of features for synthetic speech detection.
Saranya, M. S., & Murthy, H. A. (2018). Decision-level feature switching as a paradigm for replay attack detection. In Interspeech (pp. 686–690).
Saranya, M. S., Padmanabhan, R., & Murthy, H. A. (2017). Feature-switching: Dynamic feature selection for anti-vector based speaker verification system. Speech Communication, 93, 53–62.
Article Google Scholar
Scardapane, S., Stoffl, L., Röhrbein, F., & Uncini, A. (2017, May). On the use of deep recurrent neural networks for detecting audio spoofing attacks. In International joint conference on neural networks (IJCNN) (pp. 3483–3490). IEEE.
Shanmugapriya, P., & Venkataramani, Y. (2011, February). Implementation of speaker verification system using fuzzy wavelet network. In International conference on communications and signal processing (pp. 460–464). IEEE.
Shim, H. J., Jung, J. W., Heo, H. S., Yoon, S. H., & Yu, H. J. (2018, November). Replay spoofing detection system for automatic speaker verification using multi-task learning of noise classes. In Conference on technologies and applications of artificial intelligence (TAAI) (pp. 172–176). IEEE.
Shuvaev, S., Giaffar, H., & Koulakov, A. A. (2017). Representations of sound in deep learning of audio features from music. arXiv preprint arXiv:1712.02898.
Singh, G., Panda, A., Bhattacharyya, S., & Srikanthan, T. (2003, April). Vector quantization techniques for GMM based speaker verification. In IEEE international conference on acoustics, speech, and signal processing, proceedings (ICASSP'03) (Vol. 2(65)). IEEE.
Singh, N., Agrawal, A., & Khan, R. A. (2018). Voice biometric: A technology for voice based authentication. Advanced Science, Engineering and Medicine, 10(7–8), 754–759.
Article Google Scholar
Sriskandaraja, K., Sethu, V., & Ambikairajah, E. (2018). Deep siamese architecture based replay detection for secure voice biometric. In Interspeech (pp. 671–675).
Sturim, D. E., Torres-Carrasquillo, P. A., & Campbell, J. P. (2016). Corpora for the evaluation of robust speaker recognition systems. In Interspeech (pp. 2776–2780).
Suthokumar, G., Sriskandaraja, K., Sethu, V., Wijenayake, C., & Ambikairajah, E. (2017). Independent modelling of high and low energy speech frames for spoofing detection. In Interspeech (pp. 2606–2610).
Sztahó, D., Szaszák, G., & Beke, A. (2019). Deep learning methods in speaker recognition: a review. arXiv preprint arXiv:1911.06615.
Tadokoro, N., Kosaka, T., Kato, M., & Kohda, M. (2009, August). Improvement of speaker vector-based speaker verification. In Fifth international conference on information assurance and security (Vol. 1, pp. 721–724). IEEE.
Todisco, M., Delgado, H., & Evans, N. (2017). Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Computer Speech & Language, 45, 516–535.
Article Google Scholar
Todisco, M., Delgado, H., & Evans, N. W. (2016, September). Articulation rate filtering of CQCC features for automatic speaker verification. In Interspeech (pp. 3628–3632).
Todisco, M., Delgado, H., Lee, K., Sahidullah, M., Evans, N., Kinnunen, T., & Yamagishi, J. (2018, September). Integrated presentation attack detection and automatic speaker verification: Common features and Gaussian back-end fusion.
Todisco, M., Wang, X., Vestman, V., Sahidullah, M., Delgado, H., Nautsch, A., & Lee, K. A. (2019). Asvspoof 2019: Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441.
Varchol, P., Levicky, D., & Juhar, J. (2008, April). Optimalization of GMM for text independent speaker verification system. In 18th International Conference Radioelektronika (pp. 1–4). IEEE.
Vestman, V., Kinnunen, T., Hautamäki, R. G., & Sahidullah, M. (2020). Voice mimicry attacks assisted by automatic speaker verification. Computer Speech & Language, 59, 36–54.
Article Google Scholar
Villalba, J., Miguel, A., Ortega, A., & Lleida, E. (2015). Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge. In Sixteenth annual conference of the international speech communication association.
VosxsCselesb. (2019). http://www.robots.ox.ac.uk/~vgg/data/vosxsceslseb/
Wang, X., Yamagishi, J., Todisco, M., Delgado, H., Nautsch, A., Evans, N., & Juvela, L. (2019). ASVspoof 2019: A large-scale public database of synthetic, converted and replayed speech. arXiv, arXiv-1911.
Wong, L. P., & Russell, M. (2001, May). Text-dependent speaker verification under noisy conditions using parallel model combination. In IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) (Vol. 1, pp. 457–460). IEEE.
Wu, Z., De Leon, P. L., Demiroglu, C., Khodabakhsh, A., King, S., Ling, Z. H., & Yamagishi, J. (2016). Anti-spoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance. IEEE/ACM Transactions on Audio, Speech and Language Processing, 24(4), 768–783.
Article Google Scholar
Wu, Z., Evans, N., Kinnunen, T., Yamagishi, J., Alegre, F., & Li, H. (2015a). Spoofing and countermeasures for speaker verification: A survey. Speech Communication, 66, 130–153.
Article Google Scholar
Wu, Z., Khodabakhsh, A., Demiroglu, C., Yamagishi, J., Saito, D., Toda, T., & King, S. (2015, April). SAS: A speaker verification spoofing database containing diverse attacks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4440–4444). IEEE.
Wu, Z., Kinnunen, T., Evans, N., Yamagishi, J., Hanilçi, C., Sahidullah, M., & Sizov, A. (2015). ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. In Sixteenth annual conference of the international speech communication association.
Wu, Z., Xiao, X., Chng, E. S., & Li, H. (2013, May). Synthetic speech detection using temporal modulation feature. In IEEE international conference on acoustics, speech and signal processing (pp. 7234–7238). IEEE.
Yang, J., Das, R. K., & Li, H. (2018, November). Extended constant-Q cepstral coefficients for detection of spoofing attacks. In Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC) (pp. 1024–1029). IEEE.
Ze, H., Senior, A., & Schuster, M. (2013, May). Statistical parametric speech synthesis using deep neural networks. In IEEE international conference on acoustics, speech and signal processing (pp. 7962–7966). IEEE.
Zetterholm, E. (2007). Detection of speaker characteristics using voice imitation. In Speaker classification II, ser. lecture notes in computer science (pp. 192–205).
Zhao, Y., Togneri, R., & Sreeram, V. (2018, January). Spoofing detection using adaptive weighting framework and clustering analysis. In Interspeech (pp. 626–630).
Zhizheng, W., Junichi, Y., Tomi, K., Cemal, H., Mohammed, S., Aleksandr, S., & Hector, D. (2017). ASVspoof: The automatic speaker verification spoofing and countermeasures challenge.
Zouhir, Y., & Ouni, K. (2014). A bio-inspired feature extraction for robust speech recognition. Springerplus, 3(1), 651.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, National Institute of Technology, Kurukshetra, Kurukshetra, India
Aakshi Mittal & Mohit Dua

Authors

Aakshi Mittal
View author publications
You can also search for this author in PubMed Google Scholar
Mohit Dua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohit Dua.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mittal, A., Dua, M. Automatic speaker verification systems and spoof detection techniques: review and analysis. Int J Speech Technol 25, 105–134 (2022). https://doi.org/10.1007/s10772-021-09876-2

Download citation

Received: 18 September 2020
Accepted: 27 July 2021
Published: 16 August 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10772-021-09876-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speaker verification systems and spoof detection techniques: review and analysis

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic speaker verification systems and spoof detection techniques: review and analysis

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation