Skip to main content

Audio Spoofing Detection Using Constant-Q Spectral Sketches and Parallel-Attention SE-ResNet

  • Conference paper
  • First Online:
Computer Security – ESORICS 2022 (ESORICS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13556))

Included in the following conference series:

Abstract

Automatic speaker verification (ASV) system is widely used in many voice-based applications, which are very vulnerable to spoofing attacks like Text-to-Speech synthesis and converted voice signals. Effectively detecting the spoofed audio is the main solution to protect ASV systems. However, new types of spoofing technologies are emerging rapidly, and existing researches have exposed poor generalization and low robustness to unknown attacks. In this paper, an audio spoofing detection is proposed based on Constant-Q Spectral Sketches (CQSS) and parallel-attention SE-ResNet. Specially, CQSS features are first extracted using the constant-Q transform, characterized by matrix and spectrogram respectively fed into different detection model. Then, a new deep neural network architecture is proposed based on SE-ResNet, and parallel attention is designed to improve generalization ability and training efficiency. Finally, the yielding scores by different model are fused using an average strategy. The experimental results show that the proposed fusion method achieves the tandem decision cost function and equal error rate scores as 0.0307 and 0.96%, respectively, for unknown attacks, which has better verification performance compared with state-of-art methods.

Supported by Anhui Provincial Key Research and Development Program (202004d07020011, 202104d07020001); Guangdong Provincial Key Laboratory of Brain-inspired Intelligent Computation (GBL202117); Fundamental Research Funds for the Central Universities (PA2021GDSK0073, PA2021GDSK0074, PA2022GDSK0037).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Oord, A., Dieleman, S., Zen, H., Simonyan, K., Kavukcuoglu, K.: WaveNet: a generative model for raw audio. (2016). https://doi.org/10.48550/arXiv.1609.03499

  2. Ping, W., et al.: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (2017). https://doi.org/10.48550/arXiv.1710.07654

  3. Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on mel Spectrogram Predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pp. 4779–4783. Institute of Electrical and Electronics Engineers (IEEE), Calgary, AB, Canada (2018). https://doi.org/10.1109/icassp.2018.8461368

  4. Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M.: Voice conversion from non-parallel corpora using variational auto-encoder. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6. Institute of Electrical and Electronics Engineers (IEEE), Jeju, Korea (2016). https://doi.org/10.1109/apsipa.2016.7820786

  5. Wu, Z., Chng, E. S., Li, H.: Conditional restricted Boltzmann machine for voice conversion. In: 2013 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), pp. 104–108. Institute of Electrical and Electronics Engineers (IEEE), Beijing, China (2013). https://doi.org/10.1109/chinasip.2013.6625307

  6. De Leon, P.L., Pucher, M., Yamagishi, J., Hernaez, I., Saratxaga, I.: Evaluation of speaker verification security and detection of HMM-based synthetic speech. IEEE Trans. Audio Speech Lang. Process. 20(8), 2280–2290 (2012)

    Article  Google Scholar 

  7. Wu, Z., Xiao, X., Chng, E. S., Li, H.: Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: 13th Annual Conference of the International Speech Communication Association (Interspeech 2012), pp. 1700–1703. International Speech Communications Association, Portland, Oregon, USA (2013). https://doi.org/10.21437/interspeech.2012-465

  8. Wu, Z., Xiao, X., Chng, E. S., Li, H.: Synthetic speech detection using temporal modulation feature. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7234–7238. Institute of Electrical and Electronics Engineers (IEEE), Vancouver, BC, Canada (2013). https://doi.org/10.1109/icassp.2013.6639067

  9. Pal, M., Paul, D., Saha, G.: Synthetic speech detection using fundamental frequency variation and spectral features. Computer Speech and Language (2017)

    Google Scholar 

  10. Dinkel, H., Qian, Y., Yu, K.: Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Trans. Audio Speech Lang. Process. 26(11), 2002–2014 (2018)

    Article  Google Scholar 

  11. Yang, J., He, Q., Hu, Y., Pan, W.: CBC-based synthetic speech detection. International Journal of Digital Crime and Forensics 11(2), 63–74 (2019)

    Article  Google Scholar 

  12. Yang, J., Das, R.K.: Long-term high frequency features for synthetic speech detection. Digital Signal Process. 97, 102622 (2019)

    Article  Google Scholar 

  13. Alzantot, M., Wang, Z., Srivastava, M. B.: Deep residual neural networks for audio spoofing detection. In: 20th Annual Conference of the International Speech Communication Association (Interspeech 2019). International Speech Communications Association, Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-3174

  14. Lavrentyeva, G., Novoselov, S., Tseren, A., Volkova, M., Gorlanov, A., Kozlov, A.: STC antispoofing systems for the ASVspoof2019 challenge. In: 20th Annual Conference of the International Speech Communication Association (Interspeech 2019), pp. 1033–1037. International Speech Communications Association, Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-1768

  15. Wang, Z., Cui, S., Kang, X., Li, Z.: Densely connected convolutional network for audio spoofing detection. In: 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1352–1360. IEEE, Auckland, New Zealand (2020)

    Google Scholar 

  16. Luo, A., Li, E., Liu, Y., Kang, X., Wang, Z. J.: A Capsule Network Based Approach for Detection of Audio Spoofing Attacks. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Toronto, ON, Canada (2021). doi: https://doi.org/10.1109/icassp39728.2021.9414670

  17. Todisco, M., Delgado, H., Evans, N.W.: A new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients. In: Odyssey 2016 - The Speaker and Language Recognition Workshop. University of the Basque Country (UPV/EHU), Bizkaia Aretoa, Bilbao, Spain (2016). https://doi.org/10.21437/Odyssey.2016-41

  18. Jie, H., Li, S., Gang, S., Albanie, S.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell., 99 (2017)

    Google Scholar 

  19. Todisco, M., Xin, W., V Vestman, Sahidullah, M., Kong, A. L.: ASVspoof 2019: future horizons in spoofed and fake audio detection. In: Proceedings Interspeech 2019, pp. 1008–1012. International Speech Communication Association, Graz, Austria (2019). https://doi.org/10.21437/interspeech.2019-2249

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhaopin Su .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yue, F., Chen, J., Su, Z., Wang, N., Zhang, G. (2022). Audio Spoofing Detection Using Constant-Q Spectral Sketches and Parallel-Attention SE-ResNet. In: Atluri, V., Di Pietro, R., Jensen, C.D., Meng, W. (eds) Computer Security – ESORICS 2022. ESORICS 2022. Lecture Notes in Computer Science, vol 13556. Springer, Cham. https://doi.org/10.1007/978-3-031-17143-7_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17143-7_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17142-0

  • Online ISBN: 978-3-031-17143-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics