Audio Spoofing Detection Using Constant-Q Spectral Sketches and Parallel-Attention SE-ResNet

Yue, Feng; Chen, Jiale; Su, Zhaopin; Wang, Niansong; Zhang, Guofu

doi:10.1007/978-3-031-17143-7_38

Feng Yue¹¹,
Jiale Chen¹¹,
Zhaopin Su^11,12,
Niansong Wang¹⁴ &
…
Guofu Zhang^11,12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13556))

Included in the following conference series:

European Symposium on Research in Computer Security

2348 Accesses
1 Citations

Abstract

Automatic speaker verification (ASV) system is widely used in many voice-based applications, which are very vulnerable to spoofing attacks like Text-to-Speech synthesis and converted voice signals. Effectively detecting the spoofed audio is the main solution to protect ASV systems. However, new types of spoofing technologies are emerging rapidly, and existing researches have exposed poor generalization and low robustness to unknown attacks. In this paper, an audio spoofing detection is proposed based on Constant-Q Spectral Sketches (CQSS) and parallel-attention SE-ResNet. Specially, CQSS features are first extracted using the constant-Q transform, characterized by matrix and spectrogram respectively fed into different detection model. Then, a new deep neural network architecture is proposed based on SE-ResNet, and parallel attention is designed to improve generalization ability and training efficiency. Finally, the yielding scores by different model are fused using an average strategy. The experimental results show that the proposed fusion method achieves the tandem decision cost function and equal error rate scores as 0.0307 and 0.96%, respectively, for unknown attacks, which has better verification performance compared with state-of-art methods.

Supported by Anhui Provincial Key Research and Development Program (202004d07020011, 202104d07020001); Guangdong Provincial Key Laboratory of Brain-inspired Intelligent Computation (GBL202117); Fundamental Research Funds for the Central Universities (PA2021GDSK0073, PA2021GDSK0074, PA2022GDSK0037).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Oord, A., Dieleman, S., Zen, H., Simonyan, K., Kavukcuoglu, K.: WaveNet: a generative model for raw audio. (2016). https://doi.org/10.48550/arXiv.1609.03499
Ping, W., et al.: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (2017). https://doi.org/10.48550/arXiv.1710.07654
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on mel Spectrogram Predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pp. 4779–4783. Institute of Electrical and Electronics Engineers (IEEE), Calgary, AB, Canada (2018). https://doi.org/10.1109/icassp.2018.8461368
Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M.: Voice conversion from non-parallel corpora using variational auto-encoder. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6. Institute of Electrical and Electronics Engineers (IEEE), Jeju, Korea (2016). https://doi.org/10.1109/apsipa.2016.7820786
Wu, Z., Chng, E. S., Li, H.: Conditional restricted Boltzmann machine for voice conversion. In: 2013 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), pp. 104–108. Institute of Electrical and Electronics Engineers (IEEE), Beijing, China (2013). https://doi.org/10.1109/chinasip.2013.6625307
De Leon, P.L., Pucher, M., Yamagishi, J., Hernaez, I., Saratxaga, I.: Evaluation of speaker verification security and detection of HMM-based synthetic speech. IEEE Trans. Audio Speech Lang. Process. 20(8), 2280–2290 (2012)
Article Google Scholar
Wu, Z., Xiao, X., Chng, E. S., Li, H.: Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: 13th Annual Conference of the International Speech Communication Association (Interspeech 2012), pp. 1700–1703. International Speech Communications Association, Portland, Oregon, USA (2013). https://doi.org/10.21437/interspeech.2012-465
Wu, Z., Xiao, X., Chng, E. S., Li, H.: Synthetic speech detection using temporal modulation feature. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7234–7238. Institute of Electrical and Electronics Engineers (IEEE), Vancouver, BC, Canada (2013). https://doi.org/10.1109/icassp.2013.6639067
Pal, M., Paul, D., Saha, G.: Synthetic speech detection using fundamental frequency variation and spectral features. Computer Speech and Language (2017)
Google Scholar
Dinkel, H., Qian, Y., Yu, K.: Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Trans. Audio Speech Lang. Process. 26(11), 2002–2014 (2018)
Article Google Scholar
Yang, J., He, Q., Hu, Y., Pan, W.: CBC-based synthetic speech detection. International Journal of Digital Crime and Forensics 11(2), 63–74 (2019)
Article Google Scholar
Yang, J., Das, R.K.: Long-term high frequency features for synthetic speech detection. Digital Signal Process. 97, 102622 (2019)
Article Google Scholar
Alzantot, M., Wang, Z., Srivastava, M. B.: Deep residual neural networks for audio spoofing detection. In: 20th Annual Conference of the International Speech Communication Association (Interspeech 2019). International Speech Communications Association, Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-3174
Lavrentyeva, G., Novoselov, S., Tseren, A., Volkova, M., Gorlanov, A., Kozlov, A.: STC antispoofing systems for the ASVspoof2019 challenge. In: 20th Annual Conference of the International Speech Communication Association (Interspeech 2019), pp. 1033–1037. International Speech Communications Association, Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-1768
Wang, Z., Cui, S., Kang, X., Li, Z.: Densely connected convolutional network for audio spoofing detection. In: 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1352–1360. IEEE, Auckland, New Zealand (2020)
Google Scholar
Luo, A., Li, E., Liu, Y., Kang, X., Wang, Z. J.: A Capsule Network Based Approach for Detection of Audio Spoofing Attacks. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Toronto, ON, Canada (2021). doi: https://doi.org/10.1109/icassp39728.2021.9414670
Todisco, M., Delgado, H., Evans, N.W.: A new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients. In: Odyssey 2016 - The Speaker and Language Recognition Workshop. University of the Basque Country (UPV/EHU), Bizkaia Aretoa, Bilbao, Spain (2016). https://doi.org/10.21437/Odyssey.2016-41
Jie, H., Li, S., Gang, S., Albanie, S.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell., 99 (2017)
Google Scholar
Todisco, M., Xin, W., V Vestman, Sahidullah, M., Kong, A. L.: ASVspoof 2019: future horizons in spoofed and fake audio detection. In: Proceedings Interspeech 2019, pp. 1008–1012. International Speech Communication Association, Graz, Austria (2019). https://doi.org/10.21437/interspeech.2019-2249

Download references

Author information

Authors and Affiliations

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, 230601, China
Feng Yue, Jiale Chen, Zhaopin Su & Guofu Zhang
Province Key Laboratory of Industry Safety and Emergency Technology, Hefei University of Technology, Hefei, 230601, China
Zhaopin Su & Guofu Zhang
Intelligent Interconnected Systems Laboratory of Anhui Province, Hefei University of Technology, Hefei, 230009, China
Guofu Zhang
Institute of Forensic Science, Department of Public Security of Anhui Province, Hefei, 230000, China
Niansong Wang

Authors

Feng Yue
View author publications
You can also search for this author in PubMed Google Scholar
Jiale Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhaopin Su
View author publications
You can also search for this author in PubMed Google Scholar
Niansong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guofu Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhaopin Su .

Editor information

Editors and Affiliations

Rutgers University, Newark, NJ, USA
Vijayalakshmi Atluri
Hamad Bin Khalifa University, Doha, Qatar
Roberto Di Pietro
Technical University of Denmark, Kongens Lyngby, Denmark
Christian D. Jensen
Technical University of Denmark, Kongens Lyngby, Denmark
Weizhi Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yue, F., Chen, J., Su, Z., Wang, N., Zhang, G. (2022). Audio Spoofing Detection Using Constant-Q Spectral Sketches and Parallel-Attention SE-ResNet. In: Atluri, V., Di Pietro, R., Jensen, C.D., Meng, W. (eds) Computer Security – ESORICS 2022. ESORICS 2022. Lecture Notes in Computer Science, vol 13556. Springer, Cham. https://doi.org/10.1007/978-3-031-17143-7_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-17143-7_38
Published: 24 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17142-0
Online ISBN: 978-3-031-17143-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Audio Spoofing Detection Using Constant-Q Spectral Sketches and Parallel-Attention SE-ResNet