Skip to main content

Advertisement

Log in

An improved feature extraction for Hindi language audio impersonation attack detection

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Audio impersonation attacks offer a substantial risk to voice-based authentication systems and various speech recognition applications. Hence, there is a requirement for robust detection methods to assure system security and dependability. The work in this paper discusses a new approach to improve front-end feature extraction of an audio imitation attack detection system, notably in the context of the Hindi language. The proposed model is implemented in three main steps. Firstly, Gammatone spectrogram, Mel spectrogram, and Acoustic Ternary Pattern Audio Features (TPAF)spectrogram are generated from the recorded audio samples. Secondly, an optimized Residual Network (ResNet27) is employed to capture distinctive characteristics from these spectrograms. Lastly, four different binary classifier algorithms; eXtreme Gradient Boosting (XGboost), Random Forest (RF), K-nearest neighbor (KNN), and Naïve Bayes (NB) are individually applied to the aforementioned three different feature combinations, resulting in a total of twelve distinct systems. All these systems have been evaluated using own created dataset named as Voice Impersonation Corpus in Hindi Language (VIHL) for audio impersonation attack. Also, the evaluation of the proposed models have been carried using ASVspoof 2019 and ASVspoof 2021 datasets for spoof, impersonation, replay and deepfake attacks. The results obtained from the proposed work show that Gammatone spectrogram-ResNet27 combination with XGboost classifier achieved 0.9% Equal Error Rate (EER) for impersonation attack, which surpasses existing techniques in accurately identifying such attacks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

All data generated or analyzed during this study are included in this published article.

References

  1. Tan CB et al (2021) A survey on presentation attack detection for automatic speaker verification systems: state-of-the-art, taxonomy, issues and future direction. Multimed Tools Appl 80(21–23):32725–32762

    Article  Google Scholar 

  2. Valero X, Alías F (2012) Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. IEEE Trans Multimed 14:1684–1689. https://doi.org/10.1109/TMM.2012.2199972

    Article  Google Scholar 

  3. Todisco M, Delgado H, Evans NWD (2016) a new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients. In: Odyssey, vol 2016, pp 283–290

  4. Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 87(4):1738–1752

    Article  Google Scholar 

  5. Qian Y, Chen N, Dinkel H, Wu Z (2017) Deep feature engineering for noise robust spoofing detection. IEEE/ACM Trans Audio Speech Lang Process 25(10):1942–1955

    Article  Google Scholar 

  6. Fedila M, Bengherabi M, Amrouche A (2018) Gammatone filterbank and symbiotic combination of amplitude and phase-based spectra for robust speaker verification under noisy conditions and compression artifacts. Multimed Tools Appl 77:16721–16739

    Article  Google Scholar 

  7. Chakravarty N, Dua M (2022) Noise robust ASV spoof detection using integrated features and time delay neural network. SN Comput Sci 4(2):127

    Article  Google Scholar 

  8. Joshi S, Dua M (2022) LSTM-GTCC based approach for audio spoof detection. In: 2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), vol 1, pp 656–661

  9. Joshi S, Dua M, Dua S (2023) Various audio classification models for automatic speaker verification system in industry 4.0. In: Intelligent analytics for industry 4.0 applications. CRC Press, pp 113–130

  10. Arias-Vergara T, Klumpp P, Vasquez-Correa JC, Nöth E, Orozco-Arroyave JR, Schuster M (2021) Multi-channel spectrograms for speech processing applications using deep learning methods. Pattern Anal Appl 24:423–431

    Article  Google Scholar 

  11. Hou S, Lian A, Chu Y (2023) Bearing fault diagnosis method using the joint feature extraction of transformer and ResNet. Meas Sci Technol 34(7):75108

    Article  Google Scholar 

  12. Priya TS (2021) Resnet based feature extraction with decision tree classifier for classificaton of mammogram images. Turkish J Comput Math Educ 12(2):1147–1153

    Article  Google Scholar 

  13. Khalifa O, El-Darymli K, Daoud J (2013) Statistical modeling for speech recognition. World Appl Sci J 21:115–122. https://doi.org/10.5829/idosi.wasj.2013.21.mae.99935

    Article  Google Scholar 

  14. Jelinek F (1976) Continuous speech recognition by statistical methods. Proc IEEE 64(4):532–556. https://doi.org/10.1109/PROC.1976.10159

    Article  Google Scholar 

  15. Huang XD, Jack MA (1988) Performance comparison between semicontinuous and discrete hidden Markov models of speech. Electron Lett 24(3):149–150

    Article  Google Scholar 

  16. Bellegarda JR, Nahamoo D (1990) Tied mixture continuous parameter modeling for speech recognition. IEEE Trans Acoust 38(12):2033–2045

    Article  Google Scholar 

  17. McLaren M, Vogt R, Baker B, Sridharan S (2007) A comparison of session variability compensation techniques for SVM-based speaker recognition. In: Proceedings of the 8th Annual Conference of the International Speech Communication Association (Interspeech 2007), pp 790–793

  18. Malik KM, Javed A, Malik H, Irtaza A (2020) A light-weight replay detection framework for voice controlled IoT devices. IEEE J Sel Top Sign Proces 14(5):982–996

    Article  Google Scholar 

  19. Dua M, Sadhu A, Jindal A, Mehta R (2022) A hybrid noise robust model for multireplay attack detection in automatic speaker verification systems. Biomed Signal Process Control 74:103517. https://doi.org/10.1016/j.bspc.2022.103517

    Article  Google Scholar 

  20. Mittal A, Dua M (2021) Automatic speaker verification systems and spoof detection techniques: review and analysis. Int J Speech Technol. https://doi.org/10.1007/s10772-021-09876-2

    Article  Google Scholar 

  21. Chakravarty N, Dua M (2023) Data augmentation and hybrid feature amalgamation to detect audio deep fake attacks. Physica Scripta 98(9):096001

    Article  Google Scholar 

  22. Adnan SM, Irtaza A, Aziz S, Ullah MO, Javed A, Mahmood MT (2018) Fall detection through acoustic local ternary patterns. Appl Acoust 140:296–300

    Article  Google Scholar 

  23. Yamagishi J, Todisco M, Sahidullah M, Delgado H, Wang X, Evans N, ... Nautsch A (2019) Asvspoof 2019: The 3rd automatic speaker verification spoofing and countermeasures challenge database

  24. Cai W, Cai D, Liu W, Li G, Li M (2017) Countermeasures for automatic speaker verification replay spoofing attack: on data augmentation, feature representation, classification and fusion. In Interspeech, pp 17–21

  25. Ren Y, Fang Z, Liu D, Chen C (2019) Replay attack detection based on distortion by loudspeaker for voice authentication. Multimed Tools Appl 78:8383–8396

    Article  Google Scholar 

  26. Chettri B, Stoller D, Morfi V, Ramírez MAM, Benetos E, Sturm BL (2019) Ensemble models for spoofing detection in automatic speaker verification. arXiv Prepr. arXiv1904.04589

  27. Meriem F, Messaoud B, Bahia YZ (2023) Texture analysis of edge mapped audio spectrogram for spoofing attack detection. Multimed Tools Appl 1–23. https://doi.org/10.1007/s11042-023-15329-6

  28. Bharath KP, Kumar MR (2022) Replay spoof detection for speaker verification system using magnitude-phase-instantaneous frequency and energy features. Multimed Tools Appl 81(27):39343–39366

    Article  Google Scholar 

  29. Rahmeni R, Ben Aicha A, Ben Ayed Y (2022) Voice spoofing detection based on acoustic and glottal flow features using conventional machine learning techniques. Multimed Tools Appl 81(22):31443–31467

    Article  Google Scholar 

  30. Mittal A, Dua M (2022) Static–dynamic features and hybrid deep learning models based spoof detection system for ASV. Complex & Intelligent Systems  8(2):1153–1166. https://doi.org/10.1007/s40747-021-00565-w

    Article  Google Scholar 

  31. Dua M, Jain C, Kumar S (2021) LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. J Ambient Intell Humaniz Comput 13:1–16. https://doi.org/10.1007/s12652-021-02960-0

  32. Joshi S, Dua M (2023) Multi-order replay attack detection using enhanced feature extraction and deep learning classification. In: Proceedings of International Conference on Recent Trends in Computing: ICRTC 2022, pp 739–745

  33. Pao T-L, Liao W-Y, Chen Y-T (2007) Audio-visual speech recognition with weighted KNN-based classification in mandarin database. In: Third International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP 2007), vol 1, pp 39–42

  34. Lu L, Zhang H-J, Li SZ (2003) Content-based audio classification and segmentation by using support vector machines. Multimed Syst 8(6):482–492

    Article  Google Scholar 

  35. Fu Z, Lu G, Ting KM, Zhang D (2010) Learning naive Bayes classifiers for music classification and retrieval. In: 2010 20th international conference on pattern recognition, pp 4589–4592

  36. Chakravarty N, Dua M (2023) Spoof detection using sequentially integrated image and audio features. Int J Comput Digit Syst 13(1):1

    Google Scholar 

  37. Neelima M, Prabha IS (2023) Optimized deep network based spoof detection in automatic speaker verification system. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-16127-w

  38. Wang C et al (2023) TO-Rawnet: improving RawNet with TCN and orthogonal regularization for fake audio detection. arXiv Prepr. arXiv2305.13701

  39. Kwak I-Y et al (2023) Voice spoofing detection through residual network, max feature map, and depthwise separable convolution. IEEE Access 11:49140-49152. https://doi.org/10.1109/ACCESS.2023.3275790

  40. Aravind PR, Nechiyil U, Paramparambath N (2020) Audio spoofing verification using deep convolutional neural networks by transfer learning. arXiv Prepr. arXiv2008.03464

  41. Sen Gupta S, Hossain S, Kim K-D (2022) Recognize the surrounding: Development and evaluation of convolutional deep networks using gammatone spectrograms and raw audio signals. Expert Syst Appl 200:116998

    Article  Google Scholar 

  42. Allen J (1977) Short term spectral analysis, synthesis, and modification by discrete Fourier transform. IEEE Trans Acoust 25(3):235–238

    Article  Google Scholar 

  43. Ali M, Sabir A, Hassan M (2021) Fake audio detection using hierarchical representations learning and spectrogram features. In: 2021 International Conference on Robotics and Automation in Industry (ICRAI), pp 1–6

  44. Holdsworth J, Nimmo-Smith I, Patterson R, Rice P (1988) Implementing a gammatone filter bank. Annex C SVOS Final Rep. Part A Audit. Filterbank, vol 1, pp 1–5

  45. Gibson J, Segbroeck MV, Narayanan SS (2014) Comparing time-frequency representations for directional derivative features. In Fifteenth Annual Conference of the International Speech Communication Association. https://doi.org/10.21437/Interspeech.2014-147

  46. Irtaza A, Adnan SM, Aziz S, Javed A, Ullah MO, Mahmood MT (2017) A framework for fall detection of elderly people by analyzing environmental sounds through acoustic local ternary patterns. In: 2017 ieee international conference on systems, man, and cybernetics (smc), pp 1558–1563

  47. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  48. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232

    Article  MathSciNet  Google Scholar 

  49. Grama L, Rusu C (2017) Audio signal classification using linear predictive coding and random forests. In: 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp 1–9.https://doi.org/10.1109/SPED.2017.7990431

  50. Bhakre SK, Bang A (2016) Emotion recognition on the basis of audio signal using Naive Bayes classifier. In 2016 International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 2363–2367

  51. Thiruvengatanadhan R (2017) Speech/Music classification using MFCC and KNN. Int J Comput Intell Res 13(10):2449–2452

    Google Scholar 

  52. Aha DW (1990) A study of instance-based algorithms for supervised learning tasks: mathematical, empirical, and psychological evaluations. University of California, Irvine

    Google Scholar 

  53. Slaney M (1993) An efficient implementation of the Patterson-Holdsworth auditory filter bank. Apple Comput. Percept. Group, Tech. Rep, vol 35, no 8

  54. Huang X, Acero A, Hon H-W, Reddy R (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice hall PTR, Upper Saddle River

    Google Scholar 

  55. Singh R, Biswas M, Pal M (2022) Cloud detection using sentinel 2 imageries: a comparison of XGBoost, RF, SVM, and CNN algorithms. Geocarto International 38:1–36. https://doi.org/10.1080/10106049.2022.2146211

    Article  Google Scholar 

  56. Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 15(1):3133–3181

    MathSciNet  Google Scholar 

  57. Zhang S, Li X, Zong M, Zhu X, Wang R (2017) Efficient kNN classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst 29(5):1774–1785

    Article  MathSciNet  Google Scholar 

  58. Wang X et al (2020) ASVspoof 2019: a large-scale public database of synthetized, converted and replayed speech. Comput Speech Lang 64:101114. https://doi.org/10.1016/j.csl.2020.101114

    Article  Google Scholar 

  59. Delgado H et al (2021) ASVspoof 2021: automatic speaker verification spoofing and countermeasures challenge evaluation plan. arXiv Prepr. arXiv2109.00535

  60. Martín-Doñas JM, Álvarez A (2022) The vicomtech audio deepfake detection system based on Wav2vec2 for the 2022 ADD challenge. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 9241–9245

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nidhi Chakravarty.

Ethics declarations

I, Nidhi Chakravarty, on behalf of all the authors, declare that:

• This study did not receive any funding from any resource.

• All the authors and the submitted manuscript do not have any conflict of interest.

• This article contains no studies with human participants or animals performed by any authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chakravarty, N., Dua, M. An improved feature extraction for Hindi language audio impersonation attack detection. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-023-18104-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-023-18104-9

Keywords

Navigation