Abstract
Automatic speaker verification (ASV) system is widely used in many voice-based applications, which are very vulnerable to spoofing attacks like Text-to-Speech synthesis and converted voice signals. Effectively detecting the spoofed audio is the main solution to protect ASV systems. However, new types of spoofing technologies are emerging rapidly, and existing researches have exposed poor generalization and low robustness to unknown attacks. In this paper, an audio spoofing detection is proposed based on Constant-Q Spectral Sketches (CQSS) and parallel-attention SE-ResNet. Specially, CQSS features are first extracted using the constant-Q transform, characterized by matrix and spectrogram respectively fed into different detection model. Then, a new deep neural network architecture is proposed based on SE-ResNet, and parallel attention is designed to improve generalization ability and training efficiency. Finally, the yielding scores by different model are fused using an average strategy. The experimental results show that the proposed fusion method achieves the tandem decision cost function and equal error rate scores as 0.0307 and 0.96%, respectively, for unknown attacks, which has better verification performance compared with state-of-art methods.
Supported by Anhui Provincial Key Research and Development Program (202004d07020011, 202104d07020001); Guangdong Provincial Key Laboratory of Brain-inspired Intelligent Computation (GBL202117); Fundamental Research Funds for the Central Universities (PA2021GDSK0073, PA2021GDSK0074, PA2022GDSK0037).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Oord, A., Dieleman, S., Zen, H., Simonyan, K., Kavukcuoglu, K.: WaveNet: a generative model for raw audio. (2016). https://doi.org/10.48550/arXiv.1609.03499
Ping, W., et al.: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (2017). https://doi.org/10.48550/arXiv.1710.07654
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on mel Spectrogram Predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pp. 4779–4783. Institute of Electrical and Electronics Engineers (IEEE), Calgary, AB, Canada (2018). https://doi.org/10.1109/icassp.2018.8461368
Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M.: Voice conversion from non-parallel corpora using variational auto-encoder. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6. Institute of Electrical and Electronics Engineers (IEEE), Jeju, Korea (2016). https://doi.org/10.1109/apsipa.2016.7820786
Wu, Z., Chng, E. S., Li, H.: Conditional restricted Boltzmann machine for voice conversion. In: 2013 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), pp. 104–108. Institute of Electrical and Electronics Engineers (IEEE), Beijing, China (2013). https://doi.org/10.1109/chinasip.2013.6625307
De Leon, P.L., Pucher, M., Yamagishi, J., Hernaez, I., Saratxaga, I.: Evaluation of speaker verification security and detection of HMM-based synthetic speech. IEEE Trans. Audio Speech Lang. Process. 20(8), 2280–2290 (2012)
Wu, Z., Xiao, X., Chng, E. S., Li, H.: Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: 13th Annual Conference of the International Speech Communication Association (Interspeech 2012), pp. 1700–1703. International Speech Communications Association, Portland, Oregon, USA (2013). https://doi.org/10.21437/interspeech.2012-465
Wu, Z., Xiao, X., Chng, E. S., Li, H.: Synthetic speech detection using temporal modulation feature. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7234–7238. Institute of Electrical and Electronics Engineers (IEEE), Vancouver, BC, Canada (2013). https://doi.org/10.1109/icassp.2013.6639067
Pal, M., Paul, D., Saha, G.: Synthetic speech detection using fundamental frequency variation and spectral features. Computer Speech and Language (2017)
Dinkel, H., Qian, Y., Yu, K.: Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Trans. Audio Speech Lang. Process. 26(11), 2002–2014 (2018)
Yang, J., He, Q., Hu, Y., Pan, W.: CBC-based synthetic speech detection. International Journal of Digital Crime and Forensics 11(2), 63–74 (2019)
Yang, J., Das, R.K.: Long-term high frequency features for synthetic speech detection. Digital Signal Process. 97, 102622 (2019)
Alzantot, M., Wang, Z., Srivastava, M. B.: Deep residual neural networks for audio spoofing detection. In: 20th Annual Conference of the International Speech Communication Association (Interspeech 2019). International Speech Communications Association, Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-3174
Lavrentyeva, G., Novoselov, S., Tseren, A., Volkova, M., Gorlanov, A., Kozlov, A.: STC antispoofing systems for the ASVspoof2019 challenge. In: 20th Annual Conference of the International Speech Communication Association (Interspeech 2019), pp. 1033–1037. International Speech Communications Association, Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-1768
Wang, Z., Cui, S., Kang, X., Li, Z.: Densely connected convolutional network for audio spoofing detection. In: 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1352–1360. IEEE, Auckland, New Zealand (2020)
Luo, A., Li, E., Liu, Y., Kang, X., Wang, Z. J.: A Capsule Network Based Approach for Detection of Audio Spoofing Attacks. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Toronto, ON, Canada (2021). doi: https://doi.org/10.1109/icassp39728.2021.9414670
Todisco, M., Delgado, H., Evans, N.W.: A new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients. In: Odyssey 2016 - The Speaker and Language Recognition Workshop. University of the Basque Country (UPV/EHU), Bizkaia Aretoa, Bilbao, Spain (2016). https://doi.org/10.21437/Odyssey.2016-41
Jie, H., Li, S., Gang, S., Albanie, S.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell., 99 (2017)
Todisco, M., Xin, W., V Vestman, Sahidullah, M., Kong, A. L.: ASVspoof 2019: future horizons in spoofed and fake audio detection. In: Proceedings Interspeech 2019, pp. 1008–1012. International Speech Communication Association, Graz, Austria (2019). https://doi.org/10.21437/interspeech.2019-2249
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yue, F., Chen, J., Su, Z., Wang, N., Zhang, G. (2022). Audio Spoofing Detection Using Constant-Q Spectral Sketches and Parallel-Attention SE-ResNet. In: Atluri, V., Di Pietro, R., Jensen, C.D., Meng, W. (eds) Computer Security – ESORICS 2022. ESORICS 2022. Lecture Notes in Computer Science, vol 13556. Springer, Cham. https://doi.org/10.1007/978-3-031-17143-7_38
Download citation
DOI: https://doi.org/10.1007/978-3-031-17143-7_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17142-0
Online ISBN: 978-3-031-17143-7
eBook Packages: Computer ScienceComputer Science (R0)