Skip to main content

Advertisement

Log in

Feature extraction using GTCC spectrogram and ResNet50 based classification for audio spoof detection

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

With the increasing adoption of voice-based authentication systems, the threat of audio spoofing attacks has become a significant concern. These attacks aim to deceive voice authentication systems by manipulating or impersonating audio signals. To improve the audios security, we have introduced a spectrogram-based solution. Spectrograms, known for their effectiveness in audio analysis and feature extraction, offer valuable insights into combating audio spoofing. Our proposed model is divided into two parts that is frontend and backend. For implementing the frontend, our proposed model extensively investigates the utility of Mel Spectrogram, Gammatone Cepstral Coefficients Spectrogram (GTCC), Acoustic Ternary Pattern Spectrogram (ATP), and Mel-Frequency Cepstral Coefficients Spectrogram (MFCC). For backend implementation, two deep learning models that are Convolutional Neural Network (CNN) and Residual Network (ResNet50) have been leveraged individually with these four spectrograms. The effectiveness of the proposed system is validated through successful experimentation on the ASV Spoof 2019 Logical Access (LA), Physical Access (PA) evaluation datasets and our own Voice Impersonation Corpus in Hindi Language (VIHL) dataset. The outcome demonstrates that the proposed combination of GTCC spectrograms and ResNet50 outperforms all other proposed combinations by achieving Equal Error Rate (EER) of 0.6%, 1.15%, 4.3% for LA, PA and VIHL, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability

All datasets used in this work are publicly available and have been properly referenced in the text.

References

  • Aggarwal, R., & Dave, M. (2011). Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system. Telecommunication Systems. https://doi.org/10.1007/s11235-011-9623-0

    Article  Google Scholar 

  • Allen, J. (1977). Short term spectral analysis, synthesis, and modification by discrete Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(3), 235–238.

    Article  Google Scholar 

  • Alzantot, M., Wang, Z., & Srivastava, M. (2019). Deep residual neural networks for audio spoofing detection. arXiv:1907.00501

  • Aravind, P. R., Nechiyil, U., & Paramparambath, N. (2020). Audio spoofing verification using deep convolutional neural networks by transfer learning. arXiv Prepr. arXiv:2008.03464

  • Cai, W., Danwei, C., Liu, W., Li, G., & Li, M. (2017). Countermeasures for automatic speaker verification replay spoofing attack: On data augmentation, feature representation, classification and fusion.

  • Chakravarty, N., & Dua, M. (2022). Noise robust ASV spoof detection using integrated features and time delay neural network. SN Computer Science, 4(2), 127.

    Article  Google Scholar 

  • Chakravarty, N., & Dua, M. (2023a). Data augmentation and hybrid feature amalgamation to detect audio deep fake attacks. Physica Scripta. https://doi.org/10.1088/1402-4896/acea05

    Article  Google Scholar 

  • Chakravarty, N., & Dua, M. (2023b). Spoof detection using sequentially integrated image and audio features. International Journal of Computing and Digital Systems, 13(1), 1.

    Article  Google Scholar 

  • Chakravarty, N., & Dua, M. (2024a). A lightweight feature extraction technique for deepfake audio detection. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-024-18217-9

    Article  Google Scholar 

  • Chakravarty, N., & Dua, M. (2024b). An improved feature extraction for Hindi language audio impersonation attack detection. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-023-18104-9

    Article  Google Scholar 

  • Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  • Dua, M., Aggarwal, R., & Biswas, M. (2017). Discriminative training using heterogeneous feature vector for Hindi automatic speech recognition system. In 2017 international conference on computer and applications (ICCA). IEEE.

  • Dua, M., Aggarwal, R., & Biswas, M. (2018a). Optimizing integrated features for Hindi automatic speech recognition system. Journal of Intelligent Systems. https://doi.org/10.1515/jisys-2018-0057

    Article  Google Scholar 

  • Dua, M., Aggarwal, R. K., & Biswas, M. (2018b). Discriminative training using noise robust integrated features and refined HMM modeling. Journal of Intelligent Systems, 29(1), 327–344.

    Article  Google Scholar 

  • Dua, M., Aggarwal, R. K., Kadyan, V., & Dua, S. (2012). Punjabi speech to text system for connected words. In Fourth international conference on advances in recent technologies in communication and computing (ARTCom2012) (pp. 206–209).

  • Dua, M., Sadhu, A., Jindal, A., & Mehta, R. (2022). A hybrid noise robust model for multireplay attack detection in automatic speaker verification systems. Biomedical Signal Processing and Control, 74, 103517. https://doi.org/10.1016/j.bspc.2022.103517

    Article  Google Scholar 

  • Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of American Statistical Association, 32(200), 675–701.

    Article  Google Scholar 

  • Hossan, M. A., Memon, S., & Gregory, M. A. (2010). A novel approach for MFCC feature extraction. In 2010 4th international conference on signal processing and communication systems (pp. 1–5). https://doi.org/10.1109/ICSPCS.2010.5709752.

  • Joshi, S., & Dua, M. (2022). LSTM-GTCC based approach for audio spoof detection. In 2022 international conference on machine learning, big data, cloud and parallel computing (COM-IT-CON) (Vol. 1, pp. 656–661).

  • Joshi, S., & Dua, M. (2023). Multi-order replay attack detection using enhanced feature extraction and deep learning classification. In Proceedings of international conference on recent trends in computing (ICRTC 2022) (pp. 739–745).

  • Kereliuk, C., Sturm, B. L., & Larsen, J. (2015). Deep learning and music adversaries. IEEE Transactions on Multimedia, 17(11), 2059–2071.

    Article  Google Scholar 

  • Liu, G. K. (2018). Evaluating Gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech. arXiv Prepr. arXiv:1806.09010

  • Malik, K., Javed, A., Malik, H., & Irtaza, A. (2020). A light-weight replay detection framework for voice controlled IoT devices. IEEE Journal of Selected Topics in Signal Processing. https://doi.org/10.1109/JSTSP.2020.2999828

    Article  Google Scholar 

  • Mittal, A., & Dua, M. (2021). Static–dynamic features and hybrid deep learning models based spoof detection system for ASV. Complex & Intelligent Systems. https://doi.org/10.1007/s40747-021-00565-w

    Article  Google Scholar 

  • Mittal, A., Dua, M., & Dua, S. (2021). Classical and deep learning data processing techniques for speech and speaker recognitions. In Virender Kadyan, Amitoj Singh, Mohit Mittal, & Laith Abualigah (Eds.), Deep learning approaches for spoken and natural language processing, (pp. 111- 126). Springer. https://doi.org/10.1007/978-3-030-79778-2_7

  • Qian, J., Zhang, Y., Zhao, D., Zhang, X., Xu, Y., & Tao, Z. (2023). Investigation of vowel generation method in low-resource pathological voice database. Engineering Letters, 31(1), 399.

    Google Scholar 

  • Valero, X., & Alías, F. (2012). Gammatone cepstral coefficients: Biologically inspired features for non-speech audio classification. IEEE Transactions on Multimedia, 14, 1684–1689. https://doi.org/10.1109/TMM.2012.2199972

    Article  Google Scholar 

  • Todisco, M., Wang, X., Vestman, V., Sahidullah, M., Delgado, H., Nautsch, Andreas N., Junichi Y., Nicholas E., Tomi K., & Lee, K. A. (2019). ASVspoof 2019: Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441.

  • Wang, X & Yamagishi, Junichi & Todisco, Massimiliano & Delgado, Héctor & Nautsch, Andreas & Evans, Nicholas & Sahidullah, Md & Vestman, Ville & Kinnunen, Tomi & Lee, Kong Aik & Juvela, Lauri & Alku, Paavo & Peng, Yu-Huai & Hwang, Hsin-Te & Tsao, Yu & Wang, Hsin-min & Le Maguer, Sébastien & Becker, Markus & Henderson, Fergus & Ling, Zhen-Hua (2020). ASVspoof 2019: A large-scale public database of synthetized, converted and replayed speech. Computer Speech & Language, 64, 101114. https://doi.org/10.1016/j.csl.2020.101114

    Article  Google Scholar 

  • Wu, Zhizheng, Yamagishi, Junichi, Kinnunen, Tomi, Hanilçi, Cemal, Sahidullah, Mohammed, Sizov, Aleksandr, Evans, Nicholas, Todisco, Massimiliano, & Delgado, Hector. (2017). ASVspoof: The automatic speaker verification spoofing and countermeasures challenge. IEEE Journal of Selected Topics in Signal Processing, 11(4), 588–604. https://doi.org/10.1109/JSTSP.2017.2671435

    Article  Google Scholar 

  • Xue, J., & Zhou, H. (2023). Physiological-physical feature fusion for automatic voice spoofing detection. Frontiers of Computer Science, 17(2), 172318.

    Article  Google Scholar 

  • Yamagishi, Junichi, Todisco, Massimiliano, Md Sahidullah, Delgado, Héctor, Wang, Xin, Evans, Nicolas, Kinnunen, Tomi, Lee, Kong Aik, Vestman, Ville, & Nautsch, Andreas. (2019). Asvspoof 2019: The 3rd automatic speaker verification spoofing and countermeasures challenge database. Zenodo. https://doi.org/10.7488/ds/2555

  • Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part I 13 (pp. 818–833).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nidhi Chakravarty.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chakravarty, N., Dua, M. Feature extraction using GTCC spectrogram and ResNet50 based classification for audio spoof detection. Int J Speech Technol (2024). https://doi.org/10.1007/s10772-024-10093-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10772-024-10093-w

Keywords

Navigation