Skip to main content

Voice Presentation Attack Detection Using Convolutional Neural Networks

  • Chapter
  • First Online:
Handbook of Biometric Anti-Spoofing

Abstract

Current state-of-the-art automatic speaker verification (ASV) systems are prone to spoofing. The security and reliability of ASV systems can be threatened by different types of spoofing attacks using voice conversion, synthetic speech, or recorded passphrase. It is therefore essential to develop countermeasure techniques which can detect such spoofed speech. Inspired by the success of deep learning approaches in various classification tasks, this work presents an in-depth study of convolutional neural networks (CNNs) for spoofing detection in automatic speaker verification (ASV) systems. Specifically, we have compared the use of three different CNNs architectures: AlexNet, CNNs with max-feature-map activation, and an ensemble of standard CNNs for developing spoofing countermeasures, and discussed their potential to avoid overfitting due to small amounts of training data that is usually available in this task. We used popular deep learning toolkits for the system implementation and have released the implementation code of our methods publicly. We have evaluated the proposed countermeasure systems for detecting replay attacks on recently released spoofing corpora ASVspoof 2017, and also provided in-depth visual analyses of CNNs to aid for future research in this area.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.asvspoof.org/.

  2. 2.

    https://au.mathworks.com/help/nnet/ref/alexnet.html.

  3. 3.

    https://www.tensorflow.org.

  4. 4.

    https://www.pytorch.org.

  5. 5.

    https://anaconda.org/anaconda/python.

  6. 6.

    https://github.com/insikk/Grad-CAM-tensorflow.

References

  1. Hautamäki RS et al (2015) Automatic versus human speaker verification: the case of voice mimicry. Speech Commun 72:13–31

    Google Scholar 

  2. Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio, Speech, Lang Process 15(8):2222–2235

    Article  Google Scholar 

  3. Erro D, Polyakova T, Moreno A (2008) On combining statistical methods and frequency warping for high-quality voice conversion. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 4665–4668

    Google Scholar 

  4. Masuko T, Tokuda K, Kobayashi T (2008) Imposture using synthetic speech against speaker verification based on spectrum and pitch. In: Proceedings of international conference on spoken language processing, pp 302–305

    Google Scholar 

  5. Satoh T et al (2001) A robust speaker verification system against imposture using an HMM-based speech synthesis system. In: Proceedings of interspeech, pp 759–762

    Google Scholar 

  6. Zheng TF, Li L (2017) Robustness-related issues in speaker recognition. Springer, Singapore

    Google Scholar 

  7. Wu Z et al (2015) ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In: Proceedings of interspeech, pp 2037–2041

    Google Scholar 

  8. ISO/IEC JTC 1/SC 37 Biometrics: ISO/IEC 30107-1:2016, Information technology - Biometrics presentation attack detection - part 1: Framework. ISO/IEC Information Technology Task Force (ITTF) (2016)

    Google Scholar 

  9. Wu Z et al (2014) A study on replay attack and anti-spoofing for text-dependent speaker verification. In: Proceedings of asia-pacific signal and information processing association, annual summit and conference (APSIPA), pp 1–5

    Google Scholar 

  10. Gałka J, Grzywacz M, Samborski R (2015) Playback attack detection for text-dependent speaker verification over telephone channels. Speech Commun 67:143–153

    Article  Google Scholar 

  11. Janicki A, Alegre F, Evans N (2016) An assessment of automatic speaker verification vulnerabilities to replay spoofing attacks. Sec Commun Netw 9:3030–3044

    Article  Google Scholar 

  12. Lavrentyeva G et al (2017) Audio replay attack detection with deep learning frameworks. In: Proceedings of interspeech, pp 82–86

    Google Scholar 

  13. Chen Z et al (2017) ResNet and model fusion for automatic spoofing detection. In: Proceedings of interspeech, pp 102–106

    Google Scholar 

  14. Cai W et al (2017) Countermeasures for automatic speaker verification replay spoofing attack: on data augmentation, feature representation, classification and fusion. In: Proceedings of intespeech, pp 17–21

    Google Scholar 

  15. Hinton GE et al (2012) Improving neural networks by preventing co-adaption of feature detectors. arXiv:1207.0580

  16. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2:1097–1105

    Google Scholar 

  17. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  18. Russakovsky O et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  19. Abadi M et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467

  20. Paszke A et al (2017) Automatic differentiation in PyTorch. In: 31st conference on neural information processing systems

    Google Scholar 

  21. Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP-97)vol 2, pp 1331–1334. https://doi.org/10.1109/ICASSP.1997.596192

  22. Paliwal KK (1998) Spectral subband centroid features for speech recognition. Proc IEEE Int Conf Acoustic, Speech Signal Process 2:617–620

    Google Scholar 

  23. Youngberg J, Boll S (1978) Constant-Q signal analysis and synthesis. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 375–378

    Google Scholar 

  24. Mallat S (2008) A wavelet tour of signal processing, 3rd edn. The sparse way. Academic press, New York

    Google Scholar 

  25. Liu Y, Tian Y, He L, Liu J, Johnson MT (2015) Simultaneous utilization of spectral magnitude and phase information to extract supervectors for speaker verification anti-spoofing. sign (gp- gc) 2:1

    Google Scholar 

  26. Sahidullah Md et al (2016) Integrated spoofing countermeasures and automatic speaker verification: an evaluation on ASVspoof 2015. In: Proceedings of interspeech, pp 1700–1704

    Google Scholar 

  27. Villalba J et al (2015) Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge. In: Proceedings of interspeech, pp 2067–2071

    Google Scholar 

  28. Chakroborty S, Roy A, Saha G (2007) Improved close set text-independent speaker identification by combining MFCC with evidence from flipped filter banks. Int J Signal Process 4(2):114–121

    Google Scholar 

  29. Xiao X, Tian X, Du S, Xu H, Chng ES, Li H (2015) Spoofing speech detection using high dimensional magnitude and phase features: The NTU approach for ASVspoof 2015 challenge. In: Proceedings of interspeech

    Google Scholar 

  30. Saratxaga I (2016) Synthetic speech detection using phase information. Speech Commun 81:30–41

    Article  Google Scholar 

  31. Wu Z (2016) Anti-spoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio, Speech Lang Process 24:768–783

    Article  Google Scholar 

  32. Chen N, Qian Y, Dinkel H, Chen B, Yu K (2015) Robust deep feature for spoofing detection-the SJTU system for ASVspoof 2015 challenge. In: Proceedings of interspeech

    Google Scholar 

  33. Korshunov P, Marcel S, Muckenhirn H, Gonçalves AR, Mello AGS, Violato RPV, Simoes FO, Neto MU, de Assis Angeloni M, Stuchi JA, Dinkel H, Chen N, Qian Y, Paul D, Saha G, Sahidullah M (2016) Overview of BTAS 2016 speaker anti-spoofing competition. In: IEEE international conference on biometrics theory, applications and systems (BTAS)

    Google Scholar 

  34. Qian Y (2017) Deep feature engineering for noise robust spoofing detection. IEEE/ACM Trans Audio, Speech Lang Process 25(10):1942–1955

    Article  Google Scholar 

  35. Dinkel H et al (2017) End-to-end spoofing detection with raw waveform CLDNNS. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing, pp 4860–4864

    Google Scholar 

  36. Zhang C, Yu C, Hansen JHL (2017) An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J Sel Top Signal Process 11(4):684–694. https://doi.org/10.1109/JSTSP.2016.2647199

    Article  Google Scholar 

  37. Alam MJ et al (2016) Spoofing detection on the ASVspoof 2015 challenge corpus employing deep neural networks. In: Proceedings of odyssey, pp 270–276

    Google Scholar 

  38. Yu H et al (2017) Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features. In: IEEE transactions on neural networks and learning systems, pp 1–12

    Google Scholar 

  39. Yu H et al (2017) DNN filter bank cepstral coefficients for spoofing detection. IEEE Access 5:4779–4787

    Article  Google Scholar 

  40. Qian Y, Chen N, Yu K (2016) Deep features for automatic spoofing detection. Speech Commun 85(C):43–52. https://doi.org/10.1016/j.specom.2016.10.007

    Article  Google Scholar 

  41. Korshunov P et al (2018) On the use of convolutional neural network for speech presentation attack detection. In: Proceedings of IEEE international conference on identity, security, and behavior analysis

    Google Scholar 

  42. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778

    Google Scholar 

  43. Veit A et al (2016) Residual networks behave like ensembles of relatively shallow networks. Adv Neural Inf Process Syst 550–558

    Google Scholar 

  44. Muckenhirn H, Magimai-Doss M, Marcel S (2017) End-to-end convolutional neural network-based voice presentation attack detection. In: Proceedings of international joint conference on biometrics

    Google Scholar 

  45. Chen Z et al (2018) Recurrent neural networks for automatic replay spoofing attack detection. In: Proceedings of IEEE international conference on acoustic, speech, and signal processing

    Google Scholar 

  46. Nandakumar K (2008) Likelihood ratio-based biometric score fusion. IEEE Trans Pattern Anal Mach Intell 30(2):342–347

    Article  Google Scholar 

  47. Todisco M, Delgado H, Evans N (2016) Articulation rate filtering of CQCC features for automatic speaker verification. In: Proceeding of interspeech, pp 3628–3632

    Google Scholar 

  48. Todisco M, Delgado H, Evans N (2016) A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients. In: Proceedings of odessey, pp 283–290

    Google Scholar 

  49. Kinnunen T et al (2016) Utterance verification for text-dependent speaker recognition: a comparative assessment using the RedDots corpus. In: Proceedings of interspeech, pp 430–434

    Google Scholar 

  50. Kinnunen T et al (2017) The ASVspoof 2017 challenge: Assesing the limits of replay spoofing attack detection. In: Proceedings of interspeech, pp 2–6

    Google Scholar 

  51. Wang X, Takaki S, Yamagishi J (2017) An RNN-based quantized F0 model with multi-tier feedback links for text-to-speech synthesis. In: Proceedings of interspeech, pp 1059–1063

    Google Scholar 

  52. Font R, Espín JM, Cano MJ (2017) Experimental analysis of features for replay attack detection-results on the ASVspoof 2017 challenge. In: Proceedings of interspeech, pp 7–11

    Google Scholar 

  53. Sermanet P et al (2014) Overfeat: integrated recognition, localization and detection using convolutional networks. In: Proceedings of international conference on learning representations

    Google Scholar 

  54. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision, pp 818–833

    Google Scholar 

  55. Lin M, Chen Q, Yan S (2014) Network in network. In: Proceedings of international conference on learning representations

    Google Scholar 

  56. Wu X, He R, Sun Z (2015) A lightened CNN for deep face representation. arXiv:1511.02683v1

  57. Tieleman T, Hinton G (2012) Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: neural networks for machine learning

    Google Scholar 

  58. Brümmer N, du Preez J (2006) Application-independent evaluation of speaker detection. Comput Speech Lang 20:230–275

    Article  Google Scholar 

  59. van der Maaten LJP, Hinton GE (2008) Visualizing high-dimensional data using t-SNE. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

  60. Selvaraju RR et al (2017) Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of IEEE international conference on computer vision, pp 618–626

    Google Scholar 

  61. Samek W (2017) Evaluating the visualization of what a deep neural network has learned. IEEE Trans Neural Netw Learn Syst 28(11):2660–2673

    Article  MathSciNet  Google Scholar 

  62. Nagarsheth P et al (2017) Replay attack detection using DNN for channel discrimination. In: Proceedings of interspeech, pp 97–101

    Google Scholar 

Download references

Acknowledgements

Computational (and/or data visualization) resources and services used in this work were provided by the HPC and Research Support Group, Queensland University of Technology, Brisbane, Australia. This project was supported in part by an Australian Research Council Linkage grant LP 130100110.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ivan Himawan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Himawan, I., Madikeri, S., Motlicek, P., Cernak, M., Sridharan, S., Fookes, C. (2019). Voice Presentation Attack Detection Using Convolutional Neural Networks. In: Marcel, S., Nixon, M., Fierrez, J., Evans, N. (eds) Handbook of Biometric Anti-Spoofing. Advances in Computer Vision and Pattern Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-92627-8_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-92627-8_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-92626-1

  • Online ISBN: 978-3-319-92627-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics