Advertisement

A Cross-Database Study of Voice Presentation Attack Detection

  • Pavel KorshunovEmail author
  • Sébastien Marcel
Chapter
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)

Abstract

Despite an increasing interest in speaker recognition technologies, a significant obstacle still hinders their wide deployment—their high vulnerability to spoofing or presentation attacks. These attacks can be easy to perform. For instance, if an attacker has access to a speech sample from a target user, he/she can replay it using a loudspeaker or a smartphone to the recognition system during the authentication process. The ease of executing presentation attacks and the fact that no technical knowledge of the biometric system is required to make these attacks especially threatening in practical application. Therefore, late research focuses on collecting data databases with such attacks and on development of presentation attack detection (PAD) systems. In this chapter, we present an overview of the latest databases and the techniques to detect presentation attacks. We consider several prominent databases that contain bona fide and attack data, including ASVspoof 2015, ASVspoof 2017, AVspoof, voicePA, and BioCPqD-PA (the only proprietary database). Using these databases, we focus on the performance of PAD systems in the cross-database scenario or in the presence of “unknown” (not available during training) attacks, as these scenarios are closer to practice, when pretrained systems need to detect attacks in unforeseen conditions. We first present and discuss the performance of PAD systems based on handcrafted features and traditional Gaussian mixture model (GMM) classifiers. We then demonstrate whether the score fusion techniques can improve the performance of PADs. We also present some of the latest results of using neural networks for presentation attack detection. The experiments show that PAD systems struggle to generalize across databases and mostly unable to detect unknown attacks, with systems based on neural networks demonstrating better performance compared to the systems based on handcrafted features.

Notes

Acknowledgements

This work has been supported by the European H2020-ICT project TeSLA (grant agreement no. 688520), the project on Secure Access Control over Wide Area Networks (SWAN) funded by the Research Council of Norway (grant no. IKTPLUSS 248030/O70), and by the Swiss Center for Biometrics Research and Testing.

References

  1. 1.
    ISO/IEC JTC 1/SC 37 Biometrics. (2016) DIS 30107-1, information technology biometrics presentation attack detection. American National Standards InstituteGoogle Scholar
  2. 2.
    Wu Z, Kinnunen T, Evans N, Yamagishi J, Hanilçi C, Sahidullah M, Sizov A (2015) ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. In: INTERSPEECH, pp 2037–2041Google Scholar
  3. 3.
    Mariéthoz J, Bengio S (2005) Can a professional imitator fool a GMM-based speaker verification system?. Tech Rep Idiap-RR-61-2005, Idiap Research InstituteGoogle Scholar
  4. 4.
    Kucur Ergunay S, Khoury, E., Lazaridis, A., Marcel, S.: On the vulnerability of speaker verification to realistic voice spoofing. In: IEEE international conference on biometrics: Theory, applications and systems, pp 1–6Google Scholar
  5. 5.
    Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H (2015) Spoofing and countermeasures for speaker verification: A survey. Speech Commun 66:130–153CrossRefGoogle Scholar
  6. 6.
    Wu Z, Xiao X, Chng ES, Li H (2013) Synthetic speech detection using temporal modulation feature. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7234–7238.  https://doi.org/10.1109/ICASSP.2013.6639067
  7. 7.
    Gałka J, Grzywacz M, Samborski R (2015) Playback attack detection for text-dependent speaker verification over telephone channels. Speech Commun 67:143–153CrossRefGoogle Scholar
  8. 8.
    Shiota S, Villavicencio F, Yamagishi J, Ono N, Echizen I, Matsui T (2015) Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verification. In: Sixteenth annual conference of the international speech communication association, pp 239–243Google Scholar
  9. 9.
    Wu Z, Siong CE, Li H (2012) Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: INTERSPEECHGoogle Scholar
  10. 10.
    De Leon P, Pucher M, Yamagishi J, Hernaez I, Saratxaga I (2012) Evaluation of speaker verification security and detection of hmm-based synthetic speech. IEEE Trans Audio Speech Lang Process 20(8):2280–2290.  https://doi.org/10.1109/TASL.2012.2201472CrossRefGoogle Scholar
  11. 11.
    Patel TB, Patil HA (2015) Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. In: INTERSPEECH, pp 2062–2066Google Scholar
  12. 12.
    Todisco M, Delgado H, Evans N (2017) Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Comput Speech LangGoogle Scholar
  13. 13.
    Alegre F, Amehraye A, Evans N (2013) A one-class classification approach to generalised speaker verification spoofing countermeasures using local binary patterns. In: IEEE international conference on biometrics: Theory, applications and systems, pp 1–8Google Scholar
  14. 14.
    Janicki A (2015) Spoofing countermeasure based on analysis of linear prediction error. In: Sixteenth annual conference of the international speech communication association, pp 2077–2081Google Scholar
  15. 15.
    Sahidullah M, Kinnunen T, Hanilçi C (2015) A comparison of features for synthetic speech detection. In: INTERSPEECH, pp 2087–2091Google Scholar
  16. 16.
    Luo D, Wu H, Huang J (2015) Audio recapture detection using deep learning. In: 2015 IEEE China summit and international conference on signal and information processing (ChinaSIP), pp 478–482.  https://doi.org/10.1109/ChinaSIP.2015.7230448
  17. 17.
    Paul D, Sahidullah M, Saha G (2017) Generalization of spoofing countermeasures: A case study with ASVspoof 2015 and BTAS 2016 corpora. In: ICASSP, pp 2047–2051Google Scholar
  18. 18.
    Korshunov P, Marcel S (2016) Cross-database evaluation of audio-based spoofing detection systems. In: Interspeech, pp 1705–1709Google Scholar
  19. 19.
    Korshunov P, Marcel S (2017) Impact of score fusion on voice biometrics and presentation attack detection in cross-database evaluations. IEEE J Sel Top Sig Process 11(4):695–705.  https://doi.org/10.1109/JSTSP.2017.2692389CrossRefGoogle Scholar
  20. 20.
    Goncalves AR, Korshunov P, Violato RPV, Simões FO, Marcel S (2017) On the generalization of fused systems in voice presentation attack detection. In: Brömme A, Busch C, Dantcheva A, Rathgeb C, Uhl A (eds) 16th international conference of the biometrics special interest group. Darmstadt, GermanyGoogle Scholar
  21. 21.
    Korshunov P, Marcel S, Muckenhirn H, Gonçalves AR, Mello AGS, Violato RPV, Simoes FO, Neto MU, de, (2016) In: Assis Angeloni M, Stuchi JA, Dinkel H, Chen N, Qian Y, Paul D, Saha G, Sahidullah, M (eds) Overview of BTAS 2016 speaker anti-spoofing competition. IEEE international conference on biometrics: Theory applications and systems. Niagara Falls, NY, USA, pp 1–6Google Scholar
  22. 22.
    Muckenhirn H, Magimai-Doss M, Marcel S (2017) End-to-end convolutional neural network-based voice presentation attack detection. In: International joint conference on biometricsGoogle Scholar
  23. 23.
    Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366.  https://doi.org/10.1109/TASSP.1980.1163420CrossRefGoogle Scholar
  24. 24.
    Furui S (1981) Cepstral analysis technique for automatic speaker verification. IEEE Trans Acoust Speech Signal Process 29(2):254–272.  https://doi.org/10.1109/TASSP.1981.1163530CrossRefGoogle Scholar
  25. 25.
    Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP-97), vol 2, pp 1331–1334.  https://doi.org/10.1109/ICASSP.1997.596192
  26. 26.
    Le PN, Ambikairajah E, Epps J, Sethu V, Choi EHC (2011) Investigation of spectral centroid features for cognitive load classification. Speech Commun 53(4):540–551CrossRefGoogle Scholar
  27. 27.
    Todisco M, Delgado H, Evans N (2016) A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients. In: Odyssey, pp 283–290Google Scholar
  28. 28.
    Violato R, Neto MU, Simões F, Pereira T, Angeloni M (2013) BioCPqD: uma base de dados biométricos com amostras de face e voz de indivíduos brasileiros. Cad CPqD Tecnolo 9(2):7–18Google Scholar
  29. 29.
    Kinnunen T, Sahidullah M, Delgado H, Todisco M, Evans N, Yamagishi J, Lee KA (2017) The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. In: INTERSPEECH 2017, annual conference of the international speech communication association, 20–24 August 2017, Stockholm, SwedenGoogle Scholar
  30. 30.
    Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064CrossRefGoogle Scholar
  31. 31.
    Lee K, Larcher A, Wang G, Kenny P, Brümmer N, van Leeuwen DA, Aronowitz H, Kockmann M, Vaquero C, Ma B, Li H, Stafylakis T, Alam MJ, Swart A, Perez J (2015) The reddots data collection for speaker recognition. In: Interspeech, pp 2996–2091Google Scholar
  32. 32.
    Kinnunen T, Sahidullah M, Falcone M, Costantini L, Hautamäki RG, Thomsen D, Sarkar A, Tan Z, Delgado H, Todisco M, Evans N, Hautamäki V, Lee K (2017) RedDots replayed: a new replay spoofing attack corpus for text-dependent speaker verification research. In: ICASSP, pp 5395–5399Google Scholar
  33. 33.
    ISO/IEC JTC 1/SC 37 Biometrics (2016) DIS 30107-3:2016, information technology—biometrics presentation attack detection—part 3: Testing and reporting. American National Standards InstituteGoogle Scholar
  34. 34.
    Soong FK, Rosenberg AE (1988) On the use of instantaneous and transitional spectral information in speaker recognition. IEEE Trans Acoust Speech Signal Process 36(6):871–879.  https://doi.org/10.1109/29.1598CrossRefzbMATHGoogle Scholar
  35. 35.
    Mandasari MI, Gnther M, Wallace R, Saeidi R, Marcel S, van Leeuwen DA (2014) Score calibration in face recognition. IET Biom 3(4):246–256.  https://doi.org/10.1049/iet-bmt.2013.0066CrossRefGoogle Scholar
  36. 36.
    Scherhag U, Nautsch A, Rathgeb C, Busch C (2016) Unit-selection attack detection based on unfiltered frequency-domain features. In: INTERSPEECH, San Francisco, USA pp 2209–2212Google Scholar
  37. 37.
    Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in large margin classifiers. MIT Press, pp 61–74Google Scholar
  38. 38.
    Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press. http://www.deeplearningbook.org
  39. 39.
    Gonçalves AR, Uliani Neto M, Yehia HC (2015) Accelerating replay attack detector synthesis with loudspeaker characterization. In: 7th symposium of instrumentation and medical images /6th symposium of signal processing of UNICAMPGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Idiap Research InstituteMartignySwitzerland

Personalised recommendations