Voice Presentation Attack Detection Using Convolutional Neural Networks

Himawan, Ivan; Madikeri, Srikanth; Motlicek, Petr; Cernak, Milos; Sridharan, Sridha; Fookes, Clinton

doi:10.1007/978-3-319-92627-8_17

Ivan Himawan⁶,
Srikanth Madikeri⁷,
Petr Motlicek⁷,
Milos Cernak⁸,
Sridha Sridharan⁶ &
…
Clinton Fookes⁶

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

2146 Accesses
1 Citations

Abstract

Current state-of-the-art automatic speaker verification (ASV) systems are prone to spoofing. The security and reliability of ASV systems can be threatened by different types of spoofing attacks using voice conversion, synthetic speech, or recorded passphrase. It is therefore essential to develop countermeasure techniques which can detect such spoofed speech. Inspired by the success of deep learning approaches in various classification tasks, this work presents an in-depth study of convolutional neural networks (CNNs) for spoofing detection in automatic speaker verification (ASV) systems. Specifically, we have compared the use of three different CNNs architectures: AlexNet, CNNs with max-feature-map activation, and an ensemble of standard CNNs for developing spoofing countermeasures, and discussed their potential to avoid overfitting due to small amounts of training data that is usually available in this task. We used popular deep learning toolkits for the system implementation and have released the implementation code of our methods publicly. We have evaluated the proposed countermeasure systems for detecting replay attacks on recently released spoofing corpora ASVspoof 2017, and also provided in-depth visual analyses of CNNs to aid for future research in this area.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Hautamäki RS et al (2015) Automatic versus human speaker verification: the case of voice mimicry. Speech Commun 72:13–31
Google Scholar
Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio, Speech, Lang Process 15(8):2222–2235
Article Google Scholar
Erro D, Polyakova T, Moreno A (2008) On combining statistical methods and frequency warping for high-quality voice conversion. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 4665–4668
Google Scholar
Masuko T, Tokuda K, Kobayashi T (2008) Imposture using synthetic speech against speaker verification based on spectrum and pitch. In: Proceedings of international conference on spoken language processing, pp 302–305
Google Scholar
Satoh T et al (2001) A robust speaker verification system against imposture using an HMM-based speech synthesis system. In: Proceedings of interspeech, pp 759–762
Google Scholar
Zheng TF, Li L (2017) Robustness-related issues in speaker recognition. Springer, Singapore
Google Scholar
Wu Z et al (2015) ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In: Proceedings of interspeech, pp 2037–2041
Google Scholar
ISO/IEC JTC 1/SC 37 Biometrics: ISO/IEC 30107-1:2016, Information technology - Biometrics presentation attack detection - part 1: Framework. ISO/IEC Information Technology Task Force (ITTF) (2016)
Google Scholar
Wu Z et al (2014) A study on replay attack and anti-spoofing for text-dependent speaker verification. In: Proceedings of asia-pacific signal and information processing association, annual summit and conference (APSIPA), pp 1–5
Google Scholar
Gałka J, Grzywacz M, Samborski R (2015) Playback attack detection for text-dependent speaker verification over telephone channels. Speech Commun 67:143–153
Article Google Scholar
Janicki A, Alegre F, Evans N (2016) An assessment of automatic speaker verification vulnerabilities to replay spoofing attacks. Sec Commun Netw 9:3030–3044
Article Google Scholar
Lavrentyeva G et al (2017) Audio replay attack detection with deep learning frameworks. In: Proceedings of interspeech, pp 82–86
Google Scholar
Chen Z et al (2017) ResNet and model fusion for automatic spoofing detection. In: Proceedings of interspeech, pp 102–106
Google Scholar
Cai W et al (2017) Countermeasures for automatic speaker verification replay spoofing attack: on data augmentation, feature representation, classification and fusion. In: Proceedings of intespeech, pp 17–21
Google Scholar
Hinton GE et al (2012) Improving neural networks by preventing co-adaption of feature detectors. arXiv:1207.0580
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2:1097–1105
Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Russakovsky O et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Abadi M et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467
Paszke A et al (2017) Automatic differentiation in PyTorch. In: 31st conference on neural information processing systems
Google Scholar
Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP-97)vol 2, pp 1331–1334. https://doi.org/10.1109/ICASSP.1997.596192
Paliwal KK (1998) Spectral subband centroid features for speech recognition. Proc IEEE Int Conf Acoustic, Speech Signal Process 2:617–620
Google Scholar
Youngberg J, Boll S (1978) Constant-Q signal analysis and synthesis. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 375–378
Google Scholar
Mallat S (2008) A wavelet tour of signal processing, 3rd edn. The sparse way. Academic press, New York
Google Scholar
Liu Y, Tian Y, He L, Liu J, Johnson MT (2015) Simultaneous utilization of spectral magnitude and phase information to extract supervectors for speaker verification anti-spoofing. sign (gp- gc) 2:1
Google Scholar
Sahidullah Md et al (2016) Integrated spoofing countermeasures and automatic speaker verification: an evaluation on ASVspoof 2015. In: Proceedings of interspeech, pp 1700–1704
Google Scholar
Villalba J et al (2015) Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge. In: Proceedings of interspeech, pp 2067–2071
Google Scholar
Chakroborty S, Roy A, Saha G (2007) Improved close set text-independent speaker identification by combining MFCC with evidence from flipped filter banks. Int J Signal Process 4(2):114–121
Google Scholar
Xiao X, Tian X, Du S, Xu H, Chng ES, Li H (2015) Spoofing speech detection using high dimensional magnitude and phase features: The NTU approach for ASVspoof 2015 challenge. In: Proceedings of interspeech
Google Scholar
Saratxaga I (2016) Synthetic speech detection using phase information. Speech Commun 81:30–41
Article Google Scholar
Wu Z (2016) Anti-spoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio, Speech Lang Process 24:768–783
Article Google Scholar
Chen N, Qian Y, Dinkel H, Chen B, Yu K (2015) Robust deep feature for spoofing detection-the SJTU system for ASVspoof 2015 challenge. In: Proceedings of interspeech
Google Scholar
Korshunov P, Marcel S, Muckenhirn H, Gonçalves AR, Mello AGS, Violato RPV, Simoes FO, Neto MU, de Assis Angeloni M, Stuchi JA, Dinkel H, Chen N, Qian Y, Paul D, Saha G, Sahidullah M (2016) Overview of BTAS 2016 speaker anti-spoofing competition. In: IEEE international conference on biometrics theory, applications and systems (BTAS)
Google Scholar
Qian Y (2017) Deep feature engineering for noise robust spoofing detection. IEEE/ACM Trans Audio, Speech Lang Process 25(10):1942–1955
Article Google Scholar
Dinkel H et al (2017) End-to-end spoofing detection with raw waveform CLDNNS. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 4860–4864
Google Scholar
Zhang C, Yu C, Hansen JHL (2017) An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J Sel Top Signal Process 11(4):684–694. https://doi.org/10.1109/JSTSP.2016.2647199
Article Google Scholar
Alam MJ et al (2016) Spoofing detection on the ASVspoof 2015 challenge corpus employing deep neural networks. In: Proceedings of odyssey, pp 270–276
Google Scholar
Yu H et al (2017) Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features. In: IEEE transactions on neural networks and learning systems, pp 1–12
Google Scholar
Yu H et al (2017) DNN filter bank cepstral coefficients for spoofing detection. IEEE Access 5:4779–4787
Article Google Scholar
Qian Y, Chen N, Yu K (2016) Deep features for automatic spoofing detection. Speech Commun 85(C):43–52. https://doi.org/10.1016/j.specom.2016.10.007
Article Google Scholar
Korshunov P et al (2018) On the use of convolutional neural network for speech presentation attack detection. In: Proceedings of IEEE international conference on identity, security, and behavior analysis
Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
Google Scholar
Veit A et al (2016) Residual networks behave like ensembles of relatively shallow networks. Adv Neural Inf Process Syst 550–558
Google Scholar
Muckenhirn H, Magimai-Doss M, Marcel S (2017) End-to-end convolutional neural network-based voice presentation attack detection. In: Proceedings of international joint conference on biometrics
Google Scholar
Chen Z et al (2018) Recurrent neural networks for automatic replay spoofing attack detection. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing
Google Scholar
Nandakumar K (2008) Likelihood ratio-based biometric score fusion. IEEE Trans Pattern Anal Mach Intell 30(2):342–347
Article Google Scholar
Todisco M, Delgado H, Evans N (2016) Articulation rate filtering of CQCC features for automatic speaker verification. In: Proceeding of interspeech, pp 3628–3632
Google Scholar
Todisco M, Delgado H, Evans N (2016) A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients. In: Proceedings of odessey, pp 283–290
Google Scholar
Kinnunen T et al (2016) Utterance verification for text-dependent speaker recognition: a comparative assessment using the RedDots corpus. In: Proceedings of interspeech, pp 430–434
Google Scholar
Kinnunen T et al (2017) The ASVspoof 2017 challenge: Assesing the limits of replay spoofing attack detection. In: Proceedings of interspeech, pp 2–6
Google Scholar
Wang X, Takaki S, Yamagishi J (2017) An RNN-based quantized F0 model with multi-tier feedback links for text-to-speech synthesis. In: Proceedings of interspeech, pp 1059–1063
Google Scholar
Font R, Espín JM, Cano MJ (2017) Experimental analysis of features for replay attack detection-results on the ASVspoof 2017 challenge. In: Proceedings of interspeech, pp 7–11
Google Scholar
Sermanet P et al (2014) Overfeat: integrated recognition, localization and detection using convolutional networks. In: Proceedings of international conference on learning representations
Google Scholar
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision, pp 818–833
Google Scholar
Lin M, Chen Q, Yan S (2014) Network in network. In: Proceedings of international conference on learning representations
Google Scholar
Wu X, He R, Sun Z (2015) A lightened CNN for deep face representation. arXiv:1511.02683v1
Tieleman T, Hinton G (2012) Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: neural networks for machine learning
Google Scholar
Brümmer N, du Preez J (2006) Application-independent evaluation of speaker detection. Comput Speech Lang 20:230–275
Article Google Scholar
van der Maaten LJP, Hinton GE (2008) Visualizing high-dimensional data using t-SNE. J Mach Learn Res 9:2579–2605
MATH Google Scholar
Selvaraju RR et al (2017) Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of IEEE international conference on computer vision, pp 618–626
Google Scholar
Samek W (2017) Evaluating the visualization of what a deep neural network has learned. IEEE Trans Neural Netw Learn Syst 28(11):2660–2673
Article MathSciNet Google Scholar
Nagarsheth P et al (2017) Replay attack detection using DNN for channel discrimination. In: Proceedings of interspeech, pp 97–101
Google Scholar

Download references

Acknowledgements

Computational (and/or data visualization) resources and services used in this work were provided by the HPC and Research Support Group, Queensland University of Technology, Brisbane, Australia. This project was supported in part by an Australian Research Council Linkage grant LP 130100110.

Author information

Authors and Affiliations

Queensland University of Technology, Brisbane, Australia
Ivan Himawan, Sridha Sridharan & Clinton Fookes
Idiap Research Institute, Martigny, Switzerland
Srikanth Madikeri & Petr Motlicek
Logitech, Lausanne, Switzerland
Milos Cernak

Authors

Ivan Himawan
View author publications
You can also search for this author in PubMed Google Scholar
Srikanth Madikeri
View author publications
You can also search for this author in PubMed Google Scholar
Petr Motlicek
View author publications
You can also search for this author in PubMed Google Scholar
Milos Cernak
View author publications
You can also search for this author in PubMed Google Scholar
Sridha Sridharan
View author publications
You can also search for this author in PubMed Google Scholar
Clinton Fookes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ivan Himawan .

Editor information

Editors and Affiliations

Idiap Research Institute, Martigny, Switzerland
Sébastien Marcel
University of Southampton, Southampton, UK
Mark S. Nixon
Universidad Autonoma de Madrid, Madrid, Spain
Julian Fierrez
EURECOM, Biot Sophia Antipolis, France
Nicholas Evans

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Himawan, I., Madikeri, S., Motlicek, P., Cernak, M., Sridharan, S., Fookes, C. (2019). Voice Presentation Attack Detection Using Convolutional Neural Networks. In: Marcel, S., Nixon, M., Fierrez, J., Evans, N. (eds) Handbook of Biometric Anti-Spoofing. Advances in Computer Vision and Pattern Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-92627-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-92627-8_17
Published: 02 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92626-1
Online ISBN: 978-3-319-92627-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics