Physiological-physical feature fusion for automatic voice spoofing detection

Xue, Junxiao; Zhou, Hao

doi:10.1007/s11704-022-2121-6

Physiological-physical feature fusion for automatic voice spoofing detection

Research Article
Published: 27 August 2022

Volume 17, article number 172318, (2023)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Junxiao Xue^1,2,3 &
Hao Zhou²

85 Accesses
5 Citations
54 Altmetric
7 Mentions
Explore all metrics

Abstract

Biometric speech recognition systems are often subject to various spoofing attacks, the most common of which are speech synthesis and speech conversion attacks. These spoofing attacks can cause the biometric speech recognition system to incorrectly accept these spoofing attacks, which can compromise the security of this system. Researchers have made many efforts to address this problem, and the existing studies have used the physical features of speech to identify spoofing attacks. However, recent studies have shown that speech contains a large number of physiological features related to the human face. For example, we can determine the speaker’s gender, age, mouth shape, and other information by voice. Inspired by the above researches, we propose a spoofing attack recognition method based on physiological-physical features fusion. This method involves feature extraction, a densely connected convolutional neural network with squeeze and excitation block (SE-DenseNet), and feature fusion strategies. We first extract physiological features in audio from a pre-trained convolutional network. Then we use SE-DenseNet to extract physical features. Such a dense connection pattern has high parameter efficiency, and squeeze and excitation blocks can enhance the transmission of the feature. Finally, we integrate the two features into the classification network to identify the spoofing attacks. Experimental results on the ASVspoof 2019 data set show that our model is effective for voice spoofing detection. In the logical access scenario, our model improves the tandem decision cost function and equal error rate scores by 5% and 7%, respectively, compared to existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature extraction using GTCC spectrogram and ResNet50 based classification for audio spoof detection

Article 29 March 2024

End-to-End Voice Spoofing Detection Employing Time Delay Neural Networks and Higher Order Statistics

Audio Spoofing Detection Using Constant-Q Spectral Sketches and Parallel-Attention SE-ResNet

References

Reynolds D A. An overview of automatic speaker recognition technology. In: Proceedings of 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. 2002, IV-4072-IV-4075
Todisco M, Wang X, Vestman V, Sahidullah M, Delgado H, Nautsch A, Yamagishi J, Evans N, Kinnunen T, Lee K A. ASVspoof 2019: future horizons in spoofed and fake audio detection. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association. 2019, 1008–1012
Chen Z, Xie Z, Zhang W, Xu X. ResNet and model fusion for automatic spoofing detection. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. 2017, 102–106
Lai C I, Abad A, Richmond K, Yamagishi J, Dehak N, King S. Attentive filtering networks for audio replay attack detection. In: Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. 2019, 6316–6320
Valenti G, Delgado H, Todisco M, Evans N, Pilati L. An end-to-end spoofing countermeasure for automatic speaker verification using evolving recurrent neural networks. In: Proceedings of Speaker and Language Recognition Workshop. 2018, 288–295
Lai C I, Chen N, Villalba J, Dehak N. ASSERT: anti-spoofing with squeeze-excitation and residual networks. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association. 2019, 1013–1017
Monteiro J, Alam J, Falk T H. End-to-end detection of attacks to automatic speaker recognizers with time-attentive light convolutional neural networks. In: Proceedings of the 29th IEEE International Workshop on Machine Learning for Signal Processing. 2019, 1–6
Wu X, He R, Sun Z, Tan T. A light CNN for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 2018, 13(11): 2884–2896
Article Google Scholar
Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 7132–7141
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770–778
Oh T H, Dekel T, Kim C, Mosseri I, Freeman W T, Rubinstein M, Matusik W. Speech2Face: Learning the face behind a voice. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7531–7540
Lavrentyeva G, Novoselov S, Tseren A, Volkova M, Gorlanov A, Kozlov A. STC antispoofing systems for the ASVspoof2019 challenge. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association. 2019, 1033–1037
Yang Y, Wang H, Dinkel H, Chen Z, Wang S, Qian Y, Yu K. The SJTU robust anti-spoofing system for the ASVspoof 2019 challenge. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association. 2019, 1038–1042
Durak L, Arikan O. Short-time Fourier transform: two fundamental properties and an optimal implementation. IEEE Transactions on Signal Processing, 2003, 51(5): 1231–1242
Article MathSciNet Google Scholar
Podder P, Khan T Z, Khan M H, Rahman M M. Comparative performance analysis of Hamming, Hanning and Blackman window. International Journal of Computer Applications, 2014, 96(18): 1–7
Article Google Scholar
Parkhi O M, Vedaldi A, Zisserman A. Deep face recognition. In: Proceedings of the British Machine Vision Conference. 2015, 41.1–41.12
Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2261–2269
Alzantot M, Wang Z, Srivastava M B. Deep residual neural networks for audio spoofing detection. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association. 2019, 1078–1082
Wu Z, Das R K, Yang J, Li H. Light convolutional neural network with feature genuinization for detection of synthetic speech attacks. In: Proceedings of the 21st Annual Conference of the International Speech Communication Association. 2020, 1101–1105
Gomez-Alanis A, Gonzalez-Lopez J A, Peinado A M. A kernel density estimation based loss function and its application to ASV-spoofing detection. IEEE Access, 2020, 8: 108530–108543
Article Google Scholar

Download references

Acknowledgements

This work was supported by Open Foundation of Henan Key Laboratory of Cyberspace Situation Awareness (HNTS2022035), the National Natural Science Foundation of China (Grant Nos. 62036010 and 61972362), and Young Backbone Teachers in Henan Province (22020GGJS014).

Author information

Authors and Affiliations

Research Institute of Artificial Intelligence, Zhejiang Lab, Hangzhou, 311121, China
Junxiao Xue
School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou, 450002, China
Junxiao Xue & Hao Zhou
Henan Key Laboratory of Cyberspace Situation Awareness, Zhengzhou, 450000, China
Junxiao Xue

Authors

Junxiao Xue
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hao Zhou.

Additional information

Junxiao Xue is currently an Associate Professor in the School of Cyber Science and Engineering, Zhengzhou University, China. He has published more than 50 papers in international conference and journals, including the IEEE Transactions on Knowledge and Data Engineering (TKDE), the IEEE Transactions on Computational Social Systems (TCSS), the Computers & Graphics, the Information Processing & Management, the Science China-Information Sciences, etc. His current research interests include computer graphics, virtual reality, and multimedia.

Hao Zhou received his BS degree in Software Engineering from Zhongyuan University of Technology in 2020 and is currently pursuing his MS degree at Zhengzhou University, China. His research focuses on fake audio detection, speech recognition and voice synthesis.

Electronic Supplementary Material