Skip to main content

Introduction to Voice Presentation Attack Detection and Recent Advances

  • Chapter
  • First Online:
Handbook of Biometric Anti-Spoofing

Abstract

Over the past few years, significant progress has been made in the field of presentation attack detection (PAD) for automatic speaker recognition (ASV). This includes the development of new speech corpora, standard evaluation protocols and advancements in front-end feature extraction and back-end classifiers. The use of standard databases and evaluation protocols has enabled, for the first time, the meaningful benchmarking of different PAD solutions. This chapter summarises the progress, with a focus on studies completed in the last 3 years. The article presents a summary of findings and lessons learned from three ASVspoof challenges, the first community-led benchmarking efforts. These show that ASV PAD remains an unsolved problem and further attention is required to develop generalised PAD solutions which have the potential to detect diverse and previously unseen spoofing attacks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.asvspoof.org/.

  2. 2.

    https://sites.google.com/site/bosaristoolkit/.

  3. 3.

    http://www.festvox.org/.

  4. 4.

    http://mary.dfki.de/.

  5. 5.

    https://sites.google.com/site/thereddotsproject/.

  6. 6.

    https://www.octave-project.eu/.

  7. 7.

    A replay configuration refers to a unique combination of room, replay device and recording device while a session refers to a set of source files, which share the same replay configuration.

  8. 8.

    See Appendix A.2. Software packages.

  9. 9.

    https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb/v2.

  10. 10.

    https://www.asvspoof.org/asvspoof2019/ASVspoof_2019_baseline_CM_v1.zip.

  11. 11.

    https://github.com/marytts/.

  12. 12.

    https://github.com/Microsoft/CNTK.

  13. 13.

    https://www.idiap.ch/software/bob/docs/bob/bob.bio.spear/stable/index.html.

References

  1. Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52(1):12–40 https://doi.org/10.1016/j.specom.2009.08.009. www.sciencedirect.com/science/article/pii/S0167639309001289

  2. Hansen J, Hasan T (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process Mag 32(6):74–99

    Article  Google Scholar 

  3. ISO/IEC 30107 (2016) Information technology—biometric presentation attack detection. International Organization for Standardization

    Google Scholar 

  4. Kinnunen T, Sahidullah M, Kukanov I, Delgado H, Todisco M, Sarkar A, Thomsen N, Hautamäki V, Evans N, Tan ZH (2016) Utterance verification for text-dependent speaker recognition: a comparative assessment using the reddots corpus. In: Proceedings of the interspeech, pp 430–434

    Google Scholar 

  5. Shang W, Stevenson M (2010) Score normalization in playback attack detection. In: Proceedings of the ICASSP. IEEE, pp 1678–1681

    Google Scholar 

  6. Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H (2015) Spoofing and countermeasures for speaker verification: a survey. Speech Commun 66:130–153

    Article  Google Scholar 

  7. Korshunov P, Marcel S, Muckenhirn H, Gonçalves A, Mello A, Violato R, Simoes F, Neto M, de Angeloni AM, Stuchi J, Dinkel H, Chen N, Qian Y, Paul D, Saha G, Sahidullah M (2016) Overview of BTAS 2016 speaker anti-spoofing competition. In: 2016 IEEE 8th international conference on biometrics theory, applications and systems (BTAS), pp 1–6

    Google Scholar 

  8. Evans N, Kinnunen T, Yamagishi J, Wu Z, Alegre F, DeLeon P (2014) Speaker recognition anti-spoofing. In: Marcel S, Li SZ, Nixon M (eds) Handbook of biometric anti-spoofing. Springer

    Google Scholar 

  9. Marcel S, Li SZ, Nixon M (eds) (2014) Handbook of biometric anti-spoofing: trusted biometrics under spoofing attacks. Springer

    Google Scholar 

  10. Farrús Cabeceran M, Wagner M, Erro D, Pericás H (2010) Automatic speaker recognition as a measurement of voice imitation and conversion. Int J Speech Lang Law 1(17):119–142

    Google Scholar 

  11. Perrot P, Aversano G, Chollet G (2007) Voice disguise and automatic detection: review and perspectives. Progress in nonlinear speech processing, pp 101–117

    Google Scholar 

  12. Zetterholm E (2007) Detection of speaker characteristics using voice imitation. In: Speaker classification II. Springer, pp 192–205

    Google Scholar 

  13. Lau Y, Wagner M, Tran D (2004) Vulnerability of speaker verification to voice mimicking. In: Proceedings of 2004 international symposium on intelligent multimedia, video and speech processing, 2004. IEEE, pp 145–148

    Google Scholar 

  14. Lau Y, Tran D, Wagner M (2005) Testing voice mimicry with the YOHO speaker verification corpus. In: International conference on knowledge-based and intelligent information and engineering systems. Springer, pp 15–21

    Google Scholar 

  15. Mariéthoz J, Bengio S (2005) Can a professional imitator fool a GMM-based speaker verification system? Technical report, IDIAP

    Google Scholar 

  16. Panjwani S, Prakash A (2014) Crowdsourcing attacks on biometric systems. In: Symposium on usable privacy and security (SOUPS 2014), pp 257–269

    Google Scholar 

  17. Hautamäki R, Kinnunen T, Hautamäki V, Laukkanen AM (2015) Automatic versus human speaker verification: the case of voice mimicry. Speech Commun 72:13–31

    Article  Google Scholar 

  18. Ergunay S, Khoury E, Lazaridis A, Marcel S (2015) On the vulnerability of speaker verification to realistic voice spoofing. In: IEEE international conference on biometrics: theory, applications and systems, pp 1–8

    Google Scholar 

  19. Lindberg J, Blomberg M (1999) Vulnerability in speaker verification-a study of technical impostor techniques. In: Proceedings of the European conference on speech communication and technology, vol 3, pp 1211–1214

    Google Scholar 

  20. Villalba J, Lleida E (2010) Speaker verification performance degradation against spoofing and tampering attacks. In: FALA 10 workshop, pp 131–134

    Google Scholar 

  21. Wang ZF, Wei G, He QH (2011) Channel pattern noise based playback attack detection algorithm for speaker recognition. In: 2011 international conference on machine learning and cybernetics, vol 4, pp 1708–1713

    Google Scholar 

  22. Villalba J, Lleida E (2011) Preventing replay attacks on speaker verification systems. In: 2011 IEEE international Carnahan conference on security technology (ICCST). IEEE, pp 1–8

    Google Scholar 

  23. Gałka J, Grzywacz M, Samborski R (2015) Playback attack detection for text-dependent speaker verification over telephone channels. Speech Commun 67:143–153

    Article  Google Scholar 

  24. Taylor P (2009) Text-to-Speech synthesis. Cambridge University Press

    Google Scholar 

  25. Klatt DH (1980) Software for a cascade/parallel formant synthesizer. J Acoust Soc Am 67:971–995

    Article  Google Scholar 

  26. Moulines E, Charpentier F (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun 9:453–467

    Article  Google Scholar 

  27. Hunt A, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of the ICASSP, pp 373–376

    Google Scholar 

  28. Breen A, Jackson P (1998) A phonologically motivated method of selecting nonuniform units. In: Proceedings of the ICSLP, pp 2735–2738

    Google Scholar 

  29. Donovan RE, Eide EM (1998) The IBM trainable speech synthesis system. In: Proceedings of the ICSLP, pp 1703–1706

    Google Scholar 

  30. Beutnagel B, Conkie A, Schroeter J, Stylianou Y, Syrdal A (1999) The AT &T Next-Gen TTS system. In: Proceedings of the joint ASA, EAA and DAEA meeting, pp 15–19

    Google Scholar 

  31. Coorman G, Fackrell J, Rutten P, Coile B (2000) Segment selection in the L & H realspeak laboratory TTS system. In: Proceedings of the ICSLP, pp 395–398

    Google Scholar 

  32. Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1999) Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Proceedings of the Eurospeech, pp 2347–2350

    Google Scholar 

  33. Ling ZH, Wu YJ, Wang YP, Qin L, Wang RH (2006) USTC system for Blizzard Challenge 2006 an improved HMM-based speech synthesis method. In: Proceedings of the Blizzard challenge workshop

    Google Scholar 

  34. Black A (2006) CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling. In: Proceedings of the Interspeech, pp 1762–1765

    Google Scholar 

  35. Zen H, Toda T, Nakamura M, Tokuda K (2007) Details of the Nitech HMM-based speech synthesis system for the Blizzard challenge 2005. IEICE Trans Inf Syst E90-D(1):325–333

    Google Scholar 

  36. Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064

    Article  Google Scholar 

  37. Yamagishi J, Kobayashi T, Nakano Y, Ogata K, Isogai J (2009) Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans Speech Audio Lang Process 17(1):66–83

    Google Scholar 

  38. Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9:171–185

    Article  Google Scholar 

  39. Woodland PC (2001) Speaker adaptation for continuous density HMMs: a review. In: Proceedings of the ISCA workshop on adaptation methods for speech recognition, p 119

    Google Scholar 

  40. Ze H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the ICASSP, pp 7962–7966

    Google Scholar 

  41. Ling ZH, Deng L, Yu D (2013) Modeling spectral envelopes using restricted boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Trans Audio Speech Lang Proc 21(10):2129–2139

    Article  Google Scholar 

  42. Fan Y, Qian Y, Xie FL, Soong F (2014) TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Proceedings of the interspeech, pp 1964–1968

    Google Scholar 

  43. Zen H, Sak H (2015) Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: Proceedings of the ICASSP, pp 4470–4474

    Google Scholar 

  44. Wu Z, King S (2016) Investigating gated recurrent networks for speech synthesis. In: Proceedings of the ICASSP, pp 5140–5144

    Google Scholar 

  45. Wang X, Takaki S, Yamagishi J (2016) Investigating very deep highway networks for parametric speech synthesis. In: 9th ISCA speech synthesis workshop, pp 166–171

    Google Scholar 

  46. Wang X, Takaki S, Yamagishi J (2018) Investigating very deep highway networks for parametric speech synthesis. Speech Commun 96:1–9

    Article  Google Scholar 

  47. Wang X, Takaki S, Yamagishi J An autoregressive recurrent mixture density network for parametric speech synthesis. In: Proceedings of the ICASSP, pp 4895–4899

    Google Scholar 

  48. Wang X, Takaki S, Yamagishi J (2017) An RNN-based quantized F0 model with multi-tier feedback links for text-to-speech synthesis. In: Proceedings of the interspeech, pp 1059–1063

    Google Scholar 

  49. Saito Y, Takamichi S, Saruwatari H (2017) Training algorithm to deceive anti-spoofing verification for DNN-based speech synthesis. In: Proceedings of the ICASSP, pp 4900–4904

    Google Scholar 

  50. Saito Y, Takamichi S, Saruwatari H (2018) Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Trans Audio Speech Lang Proc 26(1):84–96

    Article  Google Scholar 

  51. Kaneko T, Kameoka H, Hojo N, Ijima Y, Hiramatsu K, Kashino K (2017) Generative adversarial network-based postfilter for statistical parametric speech synthesis. In: Proceedings of the ICASSP, pp 4910–4914

    Google Scholar 

  52. Van Oord DA, Dieleman, S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. arXiv:1609.03499

  53. Mehri S, Kumar K, Gulrajani I, Kumar R, Jain S, Sotelo J, Courville A, Bengio Y (2016) Samplernn: an unconditional end-to-end neural audio generation model. arXiv:1612.07837

  54. Wang Y, Skerry-Ryan R, Stanton D, Wu Y, Weiss R, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le Q, Agiomyrgiannakis Y, Clark R, Saurous R (2017) Tacotron: towards end-to-end speech synthesis. In: Proceedings of the interspeech, pp 4006–4010

    Google Scholar 

  55. Gibiansky A, Arik S, Diamos G, Miller J, Peng K, Ping W, Raiman J, Zhou Y (2017) Deep voice 2: multi-speaker neural text-to-speech. In: Advances in neural information processing systems, pp 2966–2974

    Google Scholar 

  56. Shen J, Schuster M, Jaitly N, Skerry-Ryan R, Saurous R, Weiss R, Pang R, Agiomyrgiannakis Y, Wu Y, Zhang Y, Wang Y, Chen Z, Yang Z (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: Proceedings of the ICASSP

    Google Scholar 

  57. King S (2014) Measuring a decade of progress in text-to-speech. Loquens 1(1):006

    Article  Google Scholar 

  58. King S, Wihlborg L, Guo W (2017) The blizzard challenge 2017. In: Proceedings of the Blizzard Challenge Workshop, Stockholm, Sweden

    Google Scholar 

  59. Foomany F, Hirschfield A, Ingleby M (2009) Toward a dynamic framework for security evaluation of voice verification systems. In: IEEE Toronto international conference on science and technology for humanity (TIC-STH), 2009, pp 22–27

    Google Scholar 

  60. Masuko T, Hitotsumatsu T, Tokuda K, Kobayashi T (1999) On the security of HMM-based speaker verification systems against imposture using synthetic speech. In: Proceedings of the EUROSPEECH

    Google Scholar 

  61. Matsui T, Furui S (1995) Likelihood normalization for speaker verification using a phoneme- and speaker-independent model. Speech Commun 17(1–2):109–116

    Article  Google Scholar 

  62. Masuko T, Tokuda K, Kobayashi T, Imai S (1996) Speech synthesis using HMMs with dynamic features. In: Proceedings of the ICASSP

    Google Scholar 

  63. Masuko T, Tokuda K, Kobayashi T, Imai S (1997) Voice characteristics conversion for HMM-based speech synthesis system. In: Proceedings of the ICASSP

    Google Scholar 

  64. De Leon PL, Pucher M, Yamagishi J, Hernaez I, Saratxaga I (2012) Evaluation of speaker verification security and detection of HMM-based synthetic speech. IEEE Trans Audio Speech Lang Process 20(8):2280–2290

    Article  Google Scholar 

  65. Galou G (2011) Synthetic voice forgery in the forensic context: a short tutorial. In: Forensic speech and audio analysis working group (ENFSI-FSAAWG), pp 1–3

    Google Scholar 

  66. Cai W, Doshi A, Valle R (2018) Attacking speaker recognition with deep generative models. arXiv:1801.02384

  67. Satoh T, Masuko T, Kobayashi T, Tokuda K (2001) A robust speaker verification system against imposture using an HMM-based speech synthesis system. In: Proceedings of the Eurospeech

    Google Scholar 

  68. Chen LW, Guo W, Dai LR (2010) Speaker verification against synthetic speech. In: 2010 7th international symposium on Chinese spoken language processing (ISCSLP), pp 309–312

    Google Scholar 

  69. Quatieri TF (2002) Discrete-Time speech signal processing: principles and practice. Prentice-Hall, Inc

    Google Scholar 

  70. Wu Z, Chng E, Li H (212) Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: Proceedings of the interspeech

    Google Scholar 

  71. Ogihara A, Unno H, Shiozakai A (2005) Discrimination method of synthetic speech using pitch frequency against synthetic speech falsification. IEICE Trans Fund Electron Commun Comput Sci 88(1):280–286

    Article  Google Scholar 

  72. De Leon P, Stewart B, Yamagishi J (2012) Synthetic speech discrimination using pitch pattern statistics derived from image analysis. In: Proceedings of the interspeech 2012, Portland, Oregon, USA

    Google Scholar 

  73. Stylianou Y (2009) Voice transformation: a survey. In: Proceedings of the ICASSP, pp 3585–3588

    Google Scholar 

  74. Pellom B, Hansen J (1999) An experimental study of speaker verification sensitivity to computer voice-altered imposters. In: Proceedings of the ICASSP, vol 2, pp 837–840

    Google Scholar 

  75. Mohammadi S, Kain A (2017) An overview of voice conversion systems. Speech Commun 88:65–82

    Article  Google Scholar 

  76. Abe M, Nakamura S, Shikano K, Kuwabara H (1988) Voice conversion through vector quantization. In: Proceedings of the ICASSP, pp 655–658

    Google Scholar 

  77. Arslan L (1999) Speaker transformation algorithm using segmental codebooks (STASC). Speech Commun 28(3):211–226

    Article  Google Scholar 

  78. Kain A, Macon M (1998) Spectral voice conversion for text-to-speech synthesis. In: Proceedings of the ICASSP, vol 1, pp 285–288

    Google Scholar 

  79. Stylianou Y, Cappé O, Moulines E (1998) Continuous probabilistic transform for voice conversion. IEEE Trans Speech Audio Process 6(2):131–142

    Article  Google Scholar 

  80. Toda T, Black A, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235

    Article  Google Scholar 

  81. Kobayashi K, Toda T, Neubig G, Sakti S, Nakamura S (2014) Statistical singing voice conversion with direct waveform modification based on the spectrum differential. In: Proceedings of the interspeech

    Google Scholar 

  82. Popa V, Silen H, Nurminen J, Gabbouj M (2012) Local linear transformation for voice conversion. In: Proceedings of the ICASSP. IEEE, pp 4517–4520

    Google Scholar 

  83. Chen Y, Chu M, Chang E, Liu J, Liu R (2003) Voice conversion with smoothed GMM and MAP adaptation. In: Proceedings of the EUROSPEECH, pp 2413–2416

    Google Scholar 

  84. Hwang HT, Tsao Y, Wang HM, Wang YR, Chen SH (2012) A study of mutual information for GMM-based spectral conversion. In: Proceedings of the interspeech

    Google Scholar 

  85. Helander E, Virtanen T, Nurminen J, Gabbouj M (2010) Voice conversion using partial least squares regression. IEEE Trans Audio Speech Lang Process 18(5):912–921

    Article  Google Scholar 

  86. Pilkington N, Zen H, Gales M (2011) Gaussian process experts for voice conversion. In: Proceedings of the interspeech

    Google Scholar 

  87. Saito D, Yamamoto K, Minematsu N, Hirose K (2011) One-to-many voice conversion based on tensor representation of speaker space. In: Proceedings of the interspeech, pp 653–656

    Google Scholar 

  88. Zen H, Nankaku Y, Tokuda K (2011) Continuous stochastic feature mapping based on trajectory HMMs. IEEE Trans Audio Speech Lang Process 19(2):417–430

    Article  Google Scholar 

  89. Wu Z, Kinnunen T, Chng E, Li H (2012) Mixture of factor analyzers using priors from non-parallel speech for voice conversion. IEEE Signal Process Lett 19(12)

    Google Scholar 

  90. Saito D, Watanabe S, Nakamura A, Minematsu N (2012) Statistical voice conversion based on noisy channel model. IEEE Trans Audio Speech Lang Process 20(6):1784–1794

    Article  Google Scholar 

  91. Song P, Bao Y, Zhao L, Zou C (2011) Voice conversion using support vector regression. Electron Lett 47(18):1045–1046

    Article  Google Scholar 

  92. Helander E, Silén H, Virtanen T, Gabbouj M (2012) Voice conversion using dynamic kernel partial least squares regression. IEEE Trans Audio Speech Lang Process 20(3):806–817

    Google Scholar 

  93. Wu Z, Chng E, Li H (2013) Conditional restricted boltzmann machine for voice conversion. In: The first IEEE China summit and international conference on signal and information processing (ChinaSIP). IEEE

    Google Scholar 

  94. Narendranath M, Murthy H, Rajendran S, Yegnanarayana B (1995) Transformation of formants for voice conversion using artificial neural networks. Speech Commun 16(2):207–216

    Google Scholar 

  95. Desai S, Raghavendra E, Yegnanarayana B, Black A, Prahallad K (2009) Voice conversion using artificial neural networks. In: Proceedings of the ICASSP. IEEE, pp 3893–3896

    Google Scholar 

  96. Saito Y, Takamichi S, Saruwatari H (2017) Voice conversion using input-to-output highway networks. IEICE Trans Inf Syst E10(08):1925–1928

    Google Scholar 

  97. Nakashika T, Takiguchi T, Ariki Y (2015) Voice conversion using RNN pre-trained by recurrent temporal restricted boltzmann machines. IEEE/ACM Trans Audio Speech Lang Process (TASLP) 23(3):580–587

    Google Scholar 

  98. Sun L, Kang S, Li K, Meng H (2015) Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In: Proceedings of the ICASSP, pp 4869–4873

    Google Scholar 

  99. Sundermann D, Ney H (2003) VTLN-based voice conversion. In: Proceedings of the 3rd IEEE international symposium on signal processing and information technology, 2003. ISSPIT 2003. IEEE, pp 556–559

    Google Scholar 

  100. Erro D, Moreno A, Bonafonte A (2010) Voice conversion based on weighted frequency warping. IEEE Trans Audio Speech Lang Process 18(5):922–931

    Google Scholar 

  101. Erro D, Navas E, Hernaez I (2013) Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Trans Audio Speech Lang Process 21(3):556–566

    Article  Google Scholar 

  102. Hsu CC, Hwang HT, Wu YC, Tsao Y, Wang HM (2017) Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. In: Proceedings of the interspeech 2017, pp 3364–3368

    Google Scholar 

  103. Miyoshi H, Saito Y, Takamichi S, Saruwatari H (2017) Voice conversion using sequence-to-sequence learning of context posterior probabilities. In: Proceedings of the interspeech 2017, pp 1268–1272

    Google Scholar 

  104. Fang F, Yamagishi J, Echizen I, Lorenzo-Trueba J (2018) High-quality nonparallel voice conversion based on cycle-consistent adversarial network. In: Proceedings of the ICASSP 2018

    Google Scholar 

  105. Kobayashi K, Hayashi T, Tamamori A, Toda T (2017) Statistical voice conversion with wavenet-based waveform generation. In: Proceedings of the interspeech, pp 1138–1142

    Google Scholar 

  106. Gillet B, King S (2003) Transforming F0 contours. In: Proceedings of the EUROSPEECH, pp 101–104

    Google Scholar 

  107. Wu CH, Hsia CC, Liu TH, Wang JF (2006) Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis. IEEE Trans Audio Speech Lang Process 14(4):1109–1116

    Article  Google Scholar 

  108. Helander E, Nurminen J (2007) A novel method for prosody prediction in voice conversion. In: Proceedings of the ICASSP, vol 4. IEEE, pp IV–509

    Google Scholar 

  109. Wu Z, Kinnunen T, Chng E, Li H (2010) Text-independent F0 transformation with non-parallel data for voice conversion. In: Proceedings of the interspeech

    Google Scholar 

  110. Lolive D, Barbot N, Boeffard O (2008) Pitch and duration transformation with non-parallel data. Speech Prosody 2008:111–114

    Google Scholar 

  111. Toda T, Chen LH, Saito D, Villavicencio F, Wester M, Wu Z, Yamagishi J (2016) The voice conversion challenge 2016. In: Proceedings of the interspeech, pp 1632–1636

    Google Scholar 

  112. Wester M, Wu Z, Yamagishi J (2016) Analysis of the voice conversion challenge 2016 evaluation results. In: Proceedings of the interspeech, pp 1637–1641

    Google Scholar 

  113. Perrot P, Aversano G, Blouet R, Charbit M, Chollet G (2005) Voice forgery using ALISP: indexation in a client memory. In: Proceedings of the ICASSP, vol 1. IEEE, pp 17–20

    Google Scholar 

  114. Matrouf D, Bonastre JF, Fredouille C (2006) Effect of speech transformation on impostor acceptance. In: Proceedings of the ICASSP, vol 1. IEEE, pp I–I

    Google Scholar 

  115. Kinnunen T, Wu Z, Lee K, Sedlak F, Chng E, Li H (2012) Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech. In: Proceedings of the ICASSP. IEEE, pp 4401–4404

    Google Scholar 

  116. Sundermann D, Hoge H, Bonafonte A, Ney H, Black A, Narayanan S (2006) Text-independent voice conversion based on unit selection. In: Proceedings of the ICASSP, vol 1, pp I–I

    Google Scholar 

  117. Wu Z, Larcher A, Lee K, Chng E, Kinnunen T, Li H (2013) Vulnerability evaluation of speaker verification under voice conversion spoofing: the effect of text constraints. In: Proceedings of the interspeech, Lyon, France

    Google Scholar 

  118. Alegre F, Vipperla R, Evans N, Fauve B (2012) On the vulnerability of automatic speaker recognition to spoofing attacks with artificial signals. In: 2012 EURASIP conference on European conference on signal processing (EUSIPCO)

    Google Scholar 

  119. De Leon PL, Hernaez I, Saratxaga I, Pucher M, Yamagishi J (2011) Detection of synthetic speech for the problem of imposture. In: Proceedings of the ICASSP, Dallas, USA, pp 4844–4847

    Google Scholar 

  120. Wu Z, Kinnunen T, Chng E, Li H, Ambikairajah E (2012) A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case. In: Proceedings of the Asia-Pacific signal information processing association annual summit and conference (APSIPA ASC), pp 1–5. IEEE

    Google Scholar 

  121. Alegre F, Vipperla R, Evans N (2012) Spoofing countermeasures for the protection of automatic speaker recognition systems against attacks with artificial signals. In: Proceedings of the interspeech

    Google Scholar 

  122. Alegre F, Amehraye A, Evans N (2013) Spoofing countermeasures to protect automatic speaker verification from voice conversion. In: Proceedings of the ICASSP

    Google Scholar 

  123. Wu Z, Xiao X, Chng E, Li H (2013) Synthetic speech detection using temporal modulation feature. In: Proceedings of the ICASSP

    Google Scholar 

  124. Alegre F, Vipperla R, Amehraye A, Evans N (2013) A new speaker verification spoofing countermeasure based on local binary patterns. In: Proceedings of the interspeech, Lyon, France

    Google Scholar 

  125. Kinnunen T, Lee K, Delgado H, Evans N, Todisco M, Sahidullah M, Yamagishi J, Reynolds DA (2018) t-DCF: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. In: Proceedings of the Odyssey, Les Sables d’Olonne, France

    Google Scholar 

  126. Kinnunen T, Delgado H, Evans N, Lee KA, Vestman V, Nautsch A, Todisco M, Wang X, Sahidullah M, Yamagishi J et al (2020) Tandem assessment of spoofing countermeasures and automatic speaker verification: fundamentals. IEEE/ACM Trans Audio Speech Lang Process 28:2195–2210

    Article  Google Scholar 

  127. Wu Z, Kinnunen T, Evans N, Yamagishi J, Hanilçi C, Sahidullah M, Sizov A (2015) ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In: Proceedings of the interspeech

    Google Scholar 

  128. Kinnunen T, Sahidullah M, Delgado H, Todisco M, Evans N, Yamagishi J, Lee K (2017) The ASVspoof 2017 challenge: assessing the limits of replay spoofing attack detection. In: Interspeech

    Google Scholar 

  129. Nautsch A, Wang X, Evans N, Kinnunen TH, Vestman V, Todisco M, Delgado H, Sahidullah M, Yamagishi J, Lee KA (2021) Asvspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech. IEEE Trans Biomet Behav Identity Sci 3(2):252–265

    Article  Google Scholar 

  130. Wang X, Yamagishi J, Todisco M, Delgado H, Nautsch A, Evans N, Sahidullah M, Vestman V, Kinnunen T, Lee KA, Juvela L, Alku P, Peng YH, Hwang HT, Tsao Y, Wang HM, Maguer SL, Becker M, Henderson F, Clark R, Zhang Y, Wang Q, Jia Y, Onuma K, Mushika K, Kaneda T, Jiang Y, Liu LJ, Wu YC, Huang WC, Toda T, Tanaka K, Kameoka H, Steiner I, Matrouf D, Bonastre JF, Govender A, Ronanki S, Zhang JX, Ling ZH (2020) Asvspoof 2019: a large-scale public database of synthesized, converted and replayed speech. Comput Speech Lang 64:101–114

    Google Scholar 

  131. Wu Z, Khodabakhsh A, Demiroglu C, Yamagishi J, Saito D, Toda T, King S (2015) SAS: a speaker verification spoofing database containing diverse attacks. In: Proceedings of the IEEE international conferences on acoustics, speech, and signal processing (ICASSP)

    Google Scholar 

  132. Wu Z, Kinnunen T, Evans N, Yamagishi J (2014) ASVspoof 2015: automatic speaker verification spoofing and countermeasures challenge evaluation plan. http://www.spoofingchallenge.org/asvSpoof.pdf

  133. Patel T, Patil H (2015) Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural versus spoofed speech. In: Proceedings of the interspeech

    Google Scholar 

  134. Novoselov S, Kozlov A, Lavrentyeva G, Simonchik K, Shchemelinin V (2016) STC anti-spoofing systems for the ASVspoof 2015 challenge. In: Proceedings of the IEEE international conferences on acoustics, speech, and signal processing (ICASSP), pp 5475–5479

    Google Scholar 

  135. Chen N, Qian Y, Dinkel H, Chen B, Yu K (2015) Robust deep feature for spoofing detection-the SJTU system for ASVspoof 2015 challenge. In: Proceedings of the interspeech

    Google Scholar 

  136. Xiao X, Tian X, Du S, Xu H, Chng E, Li H (2015) Spoofing speech detection using high dimensional magnitude and phase features: the NTU approach for ASVspoof 2015 challenge. In: Proceedings of the Interspeech (2015)

    Google Scholar 

  137. Alam M, Kenny P, Bhattacharya G, Stafylakis T (2015) Development of CRIM system for the automatic speaker verification spoofing and countermeasures challenge 2015. In: Proceedings of the interspeech

    Google Scholar 

  138. Wu Z, Yamagishi J, Kinnunen T, Hanilçi C, Sahidullah M, Sizov A, Evans N, Todisco M, Delgado H (2017) Asvspoof: the automatic speaker verification spoofing and countermeasures challenge. IEEE J Sel Top Signal Process 11(4):588–604

    Article  Google Scholar 

  139. Delgado H, Todisco M, Sahidullah M, Evans N, Kinnunen T, Lee K, Yamagishi J (2018) ASVspoof 2017 version 2.0: meta-data analysis and baseline enhancements. In: Proceedings of the Odyssey 2018 the speaker and language recognition workshop, pp 296–303

    Google Scholar 

  140. Todisco M, Delgado H, Evans N (2016) A new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients. In: Proceedings of the Odyssey: the speaker and language recognition workshop, Bilbao, Spain, pp 283–290

    Google Scholar 

  141. Todisco M, Delgado H, Evans N (2017) Constant Q cepstral coefficients: a spoofing countermeasure for automatic speaker verification. Comput Speech Lang 45:516–535

    Article  Google Scholar 

  142. Lavrentyeva G, Novoselov S, Malykh E, Kozlov A, Kudashev O, Shchemelinin V (2017) Audio replay attack detection with deep learning frameworks. In: Proceedings of the interspeech, pp 82–86

    Google Scholar 

  143. Ji Z, Li Z, Li P, An M, Gao S, Wu D, Zhao F (2017) Ensemble learning for countermeasure of audio replay spoofing attack in ASVspoof2017. In: Proceedings interspeech, pp 87–91

    Google Scholar 

  144. Li L, Chen Y, Wang D, Zheng T (2017) A study on replay attack and anti-spoofing for automatic speaker verification. In: Proceedings of the interspeech, pp 92–96

    Google Scholar 

  145. Patil H, Kamble M, Patel T, Soni M (2017) Novel variable length teager energy separation based instantaneous frequency features for replay detection. In: Proceedings of the interspeech, pp 12–16

    Google Scholar 

  146. Chen Z, Xie Z, Zhang W, Xu X (2017) ResNet and model fusion for automatic spoofing detection. In: Proceedings of the interspeech, pp 102–106

    Google Scholar 

  147. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. https://doi.org/10.7488/ds/1994. Accessed 3 Sept 2019

  148. Yamagishi J, Todisco M, Sahidullah M, Delgado H, Wang X, Evans N, Kinnunen T, Lee KA, Vestman V, Nautsch A (2019) ASVspoof 2019: Automatic speaker verification spoofing and countermeasures challenge evaluation plan. Technical report, ASVspoof Consortium

    Google Scholar 

  149. Zen H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the ICASSP, pp 7962–7966

    Google Scholar 

  150. Morise M, Yokomori F, Ozawa K (2016) WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans Inf Syst 99(7):1877–1884

    Article  Google Scholar 

  151. Wu Z, Watts O, King S (2016) Merlin: An open source neural network speech synthesis system. In: Speech synthesis workshop SSW 2016

    Google Scholar 

  152. Schröder M, Charfuelan M, Pammi S, Steiner I (2011) Open source voice creation toolkit for the MARY TTS platform. In: Proceedings of the interspeech, pp 3253–3256

    Google Scholar 

  153. Steiner I, Le Maguer S (2018) Creating new language and voice components for the updated MaryTTS text-to-speech synthesis platform. In: 11th language resources and evaluation conference (LREC), Miyazaki, Japan, pp 3171–3175

    Google Scholar 

  154. Hsu CC, Hwang HT, Wu YC, Tsao Y, Wang HM (2016) Voice conversion from non-parallel corpora using variational auto-encoder. In: 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA). IEEE, pp 1–6

    Google Scholar 

  155. Matrouf D, Bonastre J, Fredouille C (2006) Effect of speech transformation on impostor acceptance. In: 2006 IEEE international conference on acoustics speech and signal processing proceedings, vol 1, pp I–I

    Google Scholar 

  156. Wang X, Takaki S, Yamagishi J (2019) Neural source-filter-based waveform model for statistical parametric speech synthesis. In: ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5916–5920

    Google Scholar 

  157. Zen H, Agiomyrgiannakis Y, Egberts N, Henderson F, Szczepaniak P (2016) Fast, compact, and high quality lstm-rnn based statistical parametric speech synthesizers for mobile devices. In: Proceedings of the interspeech, pp 2273–2277

    Google Scholar 

  158. Jia Y, Zhang Y, Weiss R, Wang Q, Shen J, Ren F, Nguyen P, Pang R, Moreno IL, Wu Y, et al (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Advances in neural information processing systems, pp 4480–4490

    Google Scholar 

  159. Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R et al (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4779–4783. IEEE

    Google Scholar 

  160. Griffin DW, Lim JS (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243

    Article  Google Scholar 

  161. Huang WC, Wu YC, Kobayashi K, Peng YH, Hwang HT, Lumban Tobing P, Tsao Y, Wang HM, Toda T (2019) Generalization of spectrum differential based direct waveform modification for voice conversion. In: Proceedings of the SSW10

    Google Scholar 

  162. Kobayashi K, Toda T, Nakamura S (2018) Intra-gender statistical singing voice conversion with direct waveform modification using log-spectral differential. Speech Commun 99:211–220

    Article  Google Scholar 

  163. Kinnunen T, Juvela L, Alku P, Yamagishi J (2017) Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5535–5539

    Google Scholar 

  164. Campbell DR, Palomäki KJ, Brown G (2005) A MATLAB simulation of “shoebox” room acoustics for use in research and teaching. Comput Inf Syst J 9(3). ISSN 1352-9404

    Google Scholar 

  165. Vincent E (2008) Roomsimove. http://homepages.loria.fr/evincent/software/Roomsimove_1.4.zip

  166. Novak A, Lotton P, Simon L (2015) Synchronized swept-sine: theory, application, and implementation. J Audio Eng Soc 63(10):786–798. http://www.aes.org/e-lib/browse.cfm?elib=18042

  167. Todisco M, Wang X, Vestman V, Sahidullah M, Delgado H, Nautsch A, Yamagishi J, Evans N, Kinnunen TH, Lee KA (2019) ASVspoof 2019: future horizons in spoofed and fake audio detection. In: Proceedings of the interspeech, pp 1008–1012

    Google Scholar 

  168. Wu Z, Gao S, Cling E, Li H (2014) A study on replay attack and anti-spoofing for text-dependent speaker verification. In: Proceedings of the Asia-Pacific signal information processing association annual summit and conference (APSIPA ASC), pp 1–5. IEEE

    Google Scholar 

  169. Li Q (2009) An auditory-based transform for audio signal processing. In: 2009 IEEE workshop on applications of signal processing to audio and acoustics. IEEE, pp 181–184

    Google Scholar 

  170. Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366

    Article  Google Scholar 

  171. Sahidullah M, Kinnunen T, Hanilçi C (2015) A comparison of features for synthetic speech detection. In: Proceedings of the interspeech. ISCA, pp 2087–2091

    Google Scholar 

  172. Brown J (1991) Calculation of a constant Q spectral transform. J Acoust Soc Am 89(1):425–434

    Article  Google Scholar 

  173. Alam M, Kenny P (2017) Spoofing detection employing infinite impulse response - constant Q transform-based feature representations. In: Proceedings of the European signal processing conference (EUSIPCO)

    Google Scholar 

  174. Cancela P, Rocamora M, López E (2009) An efficient multi-resolution spectral transform for music analysis. In: Proceedings of the international society for music information retrieval conference, pp 309–314

    Google Scholar 

  175. Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127

    Article  MathSciNet  MATH  Google Scholar 

  176. Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep learning. MIT Press, Cambridge

    MATH  Google Scholar 

  177. Tian Y, Cai M, He L, Liu J (2015) Investigation of bottleneck features and multilingual deep neural networks for speaker verification. In: Proceedings of the interspeech, pp 1151–1155

    Google Scholar 

  178. Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675

    Article  Google Scholar 

  179. Hinton G, Deng L, Yu D, Dahl GE, Mohamed RA, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97

    Google Scholar 

  180. Alam M, Kenny P, Gupta V, Stafylakis T (2016) Spoofing detection on the ASVspoof2015 challenge corpus employing deep neural networks. In: Proceedings of the Odyssey: the speaker and language recognition workshop, Bilbao, Spain, pp 270–276

    Google Scholar 

  181. Qian Y, Chen N, Yu K (2016) Deep features for automatic spoofing detection. Speech Commun 85:43–52

    Article  Google Scholar 

  182. Yu H, Tan ZH, Zhang Y, Ma Z, Guo J (2017) DNN filter bank cepstral coefficients for spoofing detection. IEEE Access 5:4779–4787

    Article  Google Scholar 

  183. Sriskandaraja K, Sethu V, Ambikairajah E, Li H (2017) Front-end for antispoofing countermeasures in speaker verification: scattering spectral decomposition. IEEE J Sel Topi Signal Process 11(4):632–643. https://doi.org/10.1109/JSTSP.2016.2647202

    Article  Google Scholar 

  184. Andén J, Mallat S (2014) Deep scattering spectrum. IEEE Trans Signal Process 62(16):4114–4128

    Article  MathSciNet  MATH  Google Scholar 

  185. Mallat S (2012) Group invariant scattering. Commun Pure Appl Math 65:1331–1398

    Article  MathSciNet  MATH  Google Scholar 

  186. Pal M, Paul D, Saha G (2018) Synthetic speech detection using fundamental frequency variation and spectral features. Comput Speech Lang 48:31–50

    Article  Google Scholar 

  187. Laskowski K, Heldner M, Edlund J (2008) The fundamental frequency variation spectrum. In: Proceedings of fonetik, vol 2008, pp 29–32

    Google Scholar 

  188. Saratxaga I, Sanchez J, Wu Z, Hernaez I, Navas E (2016) Synthetic speech detection using phase information. Speech Commun 81:30–41

    Article  Google Scholar 

  189. Wang L, Nakagawa S, Zhang Z, Yoshida Y, Kawakami Y (2017) Spoofing speech detection using modified relative phase information. IEEE J Sel Top Signal Process 11(4):660–670

    Article  Google Scholar 

  190. Chakroborty S, Saha G (2009) Improved text-independent speaker identification using fused MFCC & IMFCC feature sets based on Gaussian filter. In J Signal Process 5(1):11–19

    Google Scholar 

  191. Wu X, He R, Sun Z, Tan T (2018) A light CNN for deep face representation with noisy labels. IEEE Trans Inf Forens Secur 13(11):2884–2896

    Article  Google Scholar 

  192. Goncalves AR, Violato RPV, Korshunov P, Marcel S, Simoes FO (2017) On the generalization of fused systems in voice presentation attack detection. In: 2017 international conference of the biometrics special interest group (BIOSIG), pp 1–5. https://doi.org/10.23919/BIOSIG.2017.8053516

  193. Paul D, Pal M, Saha G (2016) Novel speech features for improved detection of spoofing attacks. In: Proceedings of the annual IEEE India conference (INDICON)

    Google Scholar 

  194. Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798

    Google Scholar 

  195. Khoury E, Kinnunen T, Sizov A, Wu Z, Marcel S (2014) Introducing i-vectors for joint anti-spoofing and speaker verification. In: Proceedings of the interspeech

    Google Scholar 

  196. Sizov A, Khoury E, Kinnunen T, Wu Z, Marcel S (2015) Joint speaker verification and antispoofing in the i-vector space. IEEE Trans Inf Forens Secur 10(4):821–832

    Article  Google Scholar 

  197. Hanilçi C (2018) Data selection for i-vector based automatic speaker verification anti-spoofing. Digit Signal Process 72:171–180

    Article  Google Scholar 

  198. Tian X, Wu Z, Xiao X, Chng E, Li H (2016) Spoofing detection from a feature representation perspective. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), pp 2119–2123

    Google Scholar 

  199. Yu H, Tan ZH, Ma Z, Martin R, Guo J (2018) Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features. IEEE Trans Neural Netw Learn Syst PP(99):1–12

    Google Scholar 

  200. Dinkel H, Chen N, Qian Y, Yu K (2017) End-to-end spoofing detection with raw waveform cldnns. In: Proceedings of the IEEE international conferences on acoust speech signal process (ICASSP), pp 4860–4864

    Google Scholar 

  201. Sainath T, Weiss R, Senior A, Wilson K, Vinyals O (2015) Learning the speech front-end with raw waveform CLDNNs. In: Proceedings of the interspeech

    Google Scholar 

  202. Zhang C, Yu C, Hansen JHL (2017) An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J Sel Top Signal Process 11(4):684–694

    Article  Google Scholar 

  203. Muckenhirn H, Magimai-Doss M, Marcel S (2017) End-to-end convolutional neural network-based voice presentation attack detection. In: 2017 IEEE international joint conference on biometrics (IJCB), pp 335–341

    Google Scholar 

  204. Chen S, Ren K, Piao S, Wang C, Wang Q, Weng J, Su L, Mohaisen A (2017) You can hear but you cannot steal: defending against voice impersonation attacks on smartphones. In: 2017 IEEE 37th international conference on distributed computing systems (ICDCS), pp 183–195. IEEE

    Google Scholar 

  205. Shiota S, Villavicencio F, Yamagishi J, Ono N, Echizen I, Matsui T (2015) Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verification. In: Proceedings of the interspeech

    Google Scholar 

  206. Shiota S, Villavicencio F, Yamagishi J, Ono N, Echizen I, Matsui T (2016) Voice liveness detection for speaker verification based on a tandem single/double-channel pop noise detector. In: Odyssey

    Google Scholar 

  207. Sahidullah M, Thomsen D, Hautamäki R, Kinnunen T, Tan ZH, Parts R, Pitkänen M (2018) Robust voice liveness detection and speaker verification using throat microphones. IEEE/ACM Trans Audio Speech Lang Process 26(1):44–56

    Article  Google Scholar 

  208. Elko G, Meyer J, Backer S, Peissig J (2007) Electronic pop protection for microphones. In: 2007 IEEE workshop on applications of signal processing to audio and acoustics, pp 46–49. IEEE

    Google Scholar 

  209. Zhang L, Tan S, Yang J, Chen Y (2016) Voicelive: a phoneme localization based liveness detection for voice authentication on smartphones. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp 1080–1091. ACM

    Google Scholar 

  210. Zhang L, Tan S, Yang J (2017) Hearing your voice is not enough: an articulatory gesture based liveness detection for voice authentication. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. ACM, pp 57–71

    Google Scholar 

  211. Hanilçi C, Kinnunen T, Sahidullah M, Sizov A (2016) Spoofing detection goes noisy: an analysis of synthetic speech detection in the presence of additive noise. Speech Commun 85:83–97

    Article  Google Scholar 

  212. Yu H, Sarkar A, Thomsen D, Tan ZH, Ma Z, Guo J (2016) Effect of multi-condition training and speech enhancement methods on spoofing detection. In: Proceedings of the international workshop on sensing, processing and learning for intelligent machines (SPLINE)

    Google Scholar 

  213. Tian X, Wu Z, Xiao X, Chng E, Li H (2016) An investigation of spoofing speech detection under additive noise and reverberant conditions. In: Proceedings of the interspeech

    Google Scholar 

  214. Delgado H, Todisco M, Evans N, Sahidullah M, Liu W, Alegre F, Kinnunen T, Fauve B (2017) Impact of bandwidth and channel variation on presentation attack detection for speaker verification. In: 2017 international conference of the biometrics special interest group (BIOSIG), pp 1–6

    Google Scholar 

  215. Qian Y, Chen N, Dinkel H, Wu Z (2017) Deep feature engineering for noise robust spoofing detection. IEEE/ACM Trans Audio Speech Lang Process 25(10):1942–1955

    Article  Google Scholar 

  216. Korshunov P, Marcel S (2016) Cross-database evaluation of audio-based spoofing detection systems. In: Proceedings of the interspeech

    Google Scholar 

  217. Paul D, Sahidullah M, Saha G (2017) Generalization of spoofing countermeasures: a case study with ASVspoof 2015 and BTAS 2016 corpora. In: Proceedings of the IEEE international conferences on acoustics, speech, and signal processing (ICASSP). IEEE, pp 2047–2051

    Google Scholar 

  218. Lorenzo-Trueba J, Fang F, Wang X, Echizen I, Yamagishi J, Kinnunen T (2018) Can we steal your vocal identity from the Internet?: initial investigation of cloning Obama’s voice using GAN, WaveNet and low-quality found data. In: Proceedings of the Odyssey: the speaker and language recognition workshop

    Google Scholar 

  219. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680

    Google Scholar 

  220. Kreuk F, Adi Y, Cisse M, Keshet J (2018) Fooling end-to-end speaker verification by adversarial examples. arXiv:1801.03339

  221. Sahidullah M, Delgado H, Todisco M, Yu H, Kinnunen T, Evans N, Tan ZH (2016) Integrated spoofing countermeasures and automatic speaker verification: an evaluation on ASVspoof 2015. In: Proceedings of the interspeech

    Google Scholar 

  222. Muckenhirn H, Korshunov P, Magimai-Doss M, Marcel S (2017) Long-term spectral statistics for voice presentation attack detection. IEEE/ACM Trans Audio Speech Lang Process 25(11):2098–2111

    Article  Google Scholar 

  223. Sarkar A, Sahidullah M, Tan ZH, Kinnunen T (2017) Improving speaker verification performance in presence of spoofing attacks using out-of-domain spoofed data. In: Proceedings of the interspeech

    Google Scholar 

  224. Kinnunen T, Lee K, Delgado H, Evans N, Todisco M, Sahidullah M, Yamagishi J, Reynolds D (2018) t-DCF: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. In: Proceedings of the Odyssey: the speaker and language recognition workshop

    Google Scholar 

  225. Todisco M, Delgado H, Lee K, Sahidullah M, Evans N, Kinnunen T, Yamagishi J (2018) Integrated presentation attack detection and automatic speaker verification: common features and Gaussian back-end fusion. In: Proceedings of the interspeech

    Google Scholar 

  226. Wu Z, De Leon P, Demiroglu C, Khodabakhsh A, King S, Ling ZH, Saito D, Stewart B, Toda T, Wester M, Yamagishi Y (2016) Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio Speech Lang Process 24(4):768–783

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Md Sahidullah , Héctor Delgado or Nicholas Evans .

Editor information

Editors and Affiliations

Appendix A. Action Towards Reproducible Research

Appendix A. Action Towards Reproducible Research

1.1 A.1: Speech Corpora

  1. 1.

    Spoofing and Anti-Spoofing (SAS) database v1.0: This database presents the first version of a speaker verification spoofing and anti-spoofing database, named SAS corpus [226]. The corpus includes nine spoofing techniques, two of which are speech synthesis, and seven are voice conversion.

    Download link: http://dx.doi.org/10.7488/ds/252

  2. 2.

    ASVspoof 2015 database: This database has been used in the first Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015). Genuine speech is collected from 106 speakers (45 male, 61 female) and with no significant channel or background noise effects. Spoofed speech is generated from the genuine data using a number of different spoofing algorithms. The full dataset is partitioned into three subsets, the first for training, the second for development and the third for evaluation.

    Download link: http://dx.doi.org/10.7488/ds/298

  3. 3.

    ASVspoof 2017 database: This database has been used in the Second Automatic Speaker Verification Spoofing and Countermeasures Challenge: ASVspoof 2017. This database makes extensive use of the recent text-dependent RedDots corpus, as well as a replayed version of the same data. It contains a large amount of speech data from 42 speakers collected from 179 replay sessions in 62 unique replay configurations.

    Download link: http://dx.doi.org/10.7488/ds/2313

  4. 4.

    ASVspoof 2019 database: This database has been used in the third Automatic Speaker Verification Spoofing and Countermeasures Challenge: ASVspoof 2019. This database has two independent subsets: logical access (LA) and physical (PA). Both of the subsets contain speech data from 107 speakers (46 male, 61 female) with no significant background noise effects as in ASVspoof 2015.

    Download link: https://datashare.ed.ac.uk/handle/10283/3336

1.2 A.2: Software Packages

  1. 1.

    Feature extraction techniques for anti-spoofing: This package contains the MATLAB implementation of different acoustic feature extraction schemes as evaluated in [171].

    Download link: http://cs.joensuu.fi/~sahid/codes/AntiSpoofing_Features.zip

  2. 2.

    Baseline spoofing detection package for ASVspoof 2017 corpus: This package contains the MATLAB implementations of two spoofing detectors employed as baseline in the official ASVspoof 2017 evaluation. They are based on linear-frequency cepstral coefficients (LFCCs) and constant Q cepstral coefficients (CQCC) [141] and Gaussian mixture model classifiers.

    Download link: http://audio.eurecom.fr/software/ASVspoof2017_baseline_countermeasures.zip

  3. 3.

    Baseline spoofing detection package for ASVspoof 2019: This package contains the MATLAB implementation of two official baseline spoofing detectors. Similar to ASVspoof 2017, they are based on CQCC and LFCC with GMM as classifier backend.

    Download link: https://www.asvspoof.org/asvspoof2019/ASVspoof_2019_baseline_CM_v1.zip

  4. 4.

    Software package for t-DCF metric computation: This package contains the implementations of the t-DCF metric. This also computes the EER.

    Download link: MATLAB: https://www.asvspoof.org/asvspoof2019/tDCF_matlab_v1.zip

    Python: https://www.asvspoof.org/asvspoof2019/tDCF_python_v1.zip

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sahidullah, M. et al. (2023). Introduction to Voice Presentation Attack Detection and Recent Advances. In: Marcel, S., Fierrez, J., Evans, N. (eds) Handbook of Biometric Anti-Spoofing. Advances in Computer Vision and Pattern Recognition. Springer, Singapore. https://doi.org/10.1007/978-981-19-5288-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-5288-3_13

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-5287-6

  • Online ISBN: 978-981-19-5288-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics