Abstract
Audio impersonation attacks offer a substantial risk to voice-based authentication systems and various speech recognition applications. Hence, there is a requirement for robust detection methods to assure system security and dependability. The work in this paper discusses a new approach to improve front-end feature extraction of an audio imitation attack detection system, notably in the context of the Hindi language. The proposed model is implemented in three main steps. Firstly, Gammatone spectrogram, Mel spectrogram, and Acoustic Ternary Pattern Audio Features (TPAF)spectrogram are generated from the recorded audio samples. Secondly, an optimized Residual Network (ResNet27) is employed to capture distinctive characteristics from these spectrograms. Lastly, four different binary classifier algorithms; eXtreme Gradient Boosting (XGboost), Random Forest (RF), K-nearest neighbor (KNN), and Naïve Bayes (NB) are individually applied to the aforementioned three different feature combinations, resulting in a total of twelve distinct systems. All these systems have been evaluated using own created dataset named as Voice Impersonation Corpus in Hindi Language (VIHL) for audio impersonation attack. Also, the evaluation of the proposed models have been carried using ASVspoof 2019 and ASVspoof 2021 datasets for spoof, impersonation, replay and deepfake attacks. The results obtained from the proposed work show that Gammatone spectrogram-ResNet27 combination with XGboost classifier achieved 0.9% Equal Error Rate (EER) for impersonation attack, which surpasses existing techniques in accurately identifying such attacks.
Similar content being viewed by others
Data Availability
All data generated or analyzed during this study are included in this published article.
References
Tan CB et al (2021) A survey on presentation attack detection for automatic speaker verification systems: state-of-the-art, taxonomy, issues and future direction. Multimed Tools Appl 80(21–23):32725–32762
Valero X, Alías F (2012) Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. IEEE Trans Multimed 14:1684–1689. https://doi.org/10.1109/TMM.2012.2199972
Todisco M, Delgado H, Evans NWD (2016) a new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients. In: Odyssey, vol 2016, pp 283–290
Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 87(4):1738–1752
Qian Y, Chen N, Dinkel H, Wu Z (2017) Deep feature engineering for noise robust spoofing detection. IEEE/ACM Trans Audio Speech Lang Process 25(10):1942–1955
Fedila M, Bengherabi M, Amrouche A (2018) Gammatone filterbank and symbiotic combination of amplitude and phase-based spectra for robust speaker verification under noisy conditions and compression artifacts. Multimed Tools Appl 77:16721–16739
Chakravarty N, Dua M (2022) Noise robust ASV spoof detection using integrated features and time delay neural network. SN Comput Sci 4(2):127
Joshi S, Dua M (2022) LSTM-GTCC based approach for audio spoof detection. In: 2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), vol 1, pp 656–661
Joshi S, Dua M, Dua S (2023) Various audio classification models for automatic speaker verification system in industry 4.0. In: Intelligent analytics for industry 4.0 applications. CRC Press, pp 113–130
Arias-Vergara T, Klumpp P, Vasquez-Correa JC, Nöth E, Orozco-Arroyave JR, Schuster M (2021) Multi-channel spectrograms for speech processing applications using deep learning methods. Pattern Anal Appl 24:423–431
Hou S, Lian A, Chu Y (2023) Bearing fault diagnosis method using the joint feature extraction of transformer and ResNet. Meas Sci Technol 34(7):75108
Priya TS (2021) Resnet based feature extraction with decision tree classifier for classificaton of mammogram images. Turkish J Comput Math Educ 12(2):1147–1153
Khalifa O, El-Darymli K, Daoud J (2013) Statistical modeling for speech recognition. World Appl Sci J 21:115–122. https://doi.org/10.5829/idosi.wasj.2013.21.mae.99935
Jelinek F (1976) Continuous speech recognition by statistical methods. Proc IEEE 64(4):532–556. https://doi.org/10.1109/PROC.1976.10159
Huang XD, Jack MA (1988) Performance comparison between semicontinuous and discrete hidden Markov models of speech. Electron Lett 24(3):149–150
Bellegarda JR, Nahamoo D (1990) Tied mixture continuous parameter modeling for speech recognition. IEEE Trans Acoust 38(12):2033–2045
McLaren M, Vogt R, Baker B, Sridharan S (2007) A comparison of session variability compensation techniques for SVM-based speaker recognition. In: Proceedings of the 8th Annual Conference of the International Speech Communication Association (Interspeech 2007), pp 790–793
Malik KM, Javed A, Malik H, Irtaza A (2020) A light-weight replay detection framework for voice controlled IoT devices. IEEE J Sel Top Sign Proces 14(5):982–996
Dua M, Sadhu A, Jindal A, Mehta R (2022) A hybrid noise robust model for multireplay attack detection in automatic speaker verification systems. Biomed Signal Process Control 74:103517. https://doi.org/10.1016/j.bspc.2022.103517
Mittal A, Dua M (2021) Automatic speaker verification systems and spoof detection techniques: review and analysis. Int J Speech Technol. https://doi.org/10.1007/s10772-021-09876-2
Chakravarty N, Dua M (2023) Data augmentation and hybrid feature amalgamation to detect audio deep fake attacks. Physica Scripta 98(9):096001
Adnan SM, Irtaza A, Aziz S, Ullah MO, Javed A, Mahmood MT (2018) Fall detection through acoustic local ternary patterns. Appl Acoust 140:296–300
Yamagishi J, Todisco M, Sahidullah M, Delgado H, Wang X, Evans N, ... Nautsch A (2019) Asvspoof 2019: The 3rd automatic speaker verification spoofing and countermeasures challenge database
Cai W, Cai D, Liu W, Li G, Li M (2017) Countermeasures for automatic speaker verification replay spoofing attack: on data augmentation, feature representation, classification and fusion. In Interspeech, pp 17–21
Ren Y, Fang Z, Liu D, Chen C (2019) Replay attack detection based on distortion by loudspeaker for voice authentication. Multimed Tools Appl 78:8383–8396
Chettri B, Stoller D, Morfi V, Ramírez MAM, Benetos E, Sturm BL (2019) Ensemble models for spoofing detection in automatic speaker verification. arXiv Prepr. arXiv1904.04589
Meriem F, Messaoud B, Bahia YZ (2023) Texture analysis of edge mapped audio spectrogram for spoofing attack detection. Multimed Tools Appl 1–23. https://doi.org/10.1007/s11042-023-15329-6
Bharath KP, Kumar MR (2022) Replay spoof detection for speaker verification system using magnitude-phase-instantaneous frequency and energy features. Multimed Tools Appl 81(27):39343–39366
Rahmeni R, Ben Aicha A, Ben Ayed Y (2022) Voice spoofing detection based on acoustic and glottal flow features using conventional machine learning techniques. Multimed Tools Appl 81(22):31443–31467
Mittal A, Dua M (2022) Static–dynamic features and hybrid deep learning models based spoof detection system for ASV. Complex & Intelligent Systems 8(2):1153–1166. https://doi.org/10.1007/s40747-021-00565-w
Dua M, Jain C, Kumar S (2021) LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. J Ambient Intell Humaniz Comput 13:1–16. https://doi.org/10.1007/s12652-021-02960-0
Joshi S, Dua M (2023) Multi-order replay attack detection using enhanced feature extraction and deep learning classification. In: Proceedings of International Conference on Recent Trends in Computing: ICRTC 2022, pp 739–745
Pao T-L, Liao W-Y, Chen Y-T (2007) Audio-visual speech recognition with weighted KNN-based classification in mandarin database. In: Third International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP 2007), vol 1, pp 39–42
Lu L, Zhang H-J, Li SZ (2003) Content-based audio classification and segmentation by using support vector machines. Multimed Syst 8(6):482–492
Fu Z, Lu G, Ting KM, Zhang D (2010) Learning naive Bayes classifiers for music classification and retrieval. In: 2010 20th international conference on pattern recognition, pp 4589–4592
Chakravarty N, Dua M (2023) Spoof detection using sequentially integrated image and audio features. Int J Comput Digit Syst 13(1):1
Neelima M, Prabha IS (2023) Optimized deep network based spoof detection in automatic speaker verification system. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-16127-w
Wang C et al (2023) TO-Rawnet: improving RawNet with TCN and orthogonal regularization for fake audio detection. arXiv Prepr. arXiv2305.13701
Kwak I-Y et al (2023) Voice spoofing detection through residual network, max feature map, and depthwise separable convolution. IEEE Access 11:49140-49152. https://doi.org/10.1109/ACCESS.2023.3275790
Aravind PR, Nechiyil U, Paramparambath N (2020) Audio spoofing verification using deep convolutional neural networks by transfer learning. arXiv Prepr. arXiv2008.03464
Sen Gupta S, Hossain S, Kim K-D (2022) Recognize the surrounding: Development and evaluation of convolutional deep networks using gammatone spectrograms and raw audio signals. Expert Syst Appl 200:116998
Allen J (1977) Short term spectral analysis, synthesis, and modification by discrete Fourier transform. IEEE Trans Acoust 25(3):235–238
Ali M, Sabir A, Hassan M (2021) Fake audio detection using hierarchical representations learning and spectrogram features. In: 2021 International Conference on Robotics and Automation in Industry (ICRAI), pp 1–6
Holdsworth J, Nimmo-Smith I, Patterson R, Rice P (1988) Implementing a gammatone filter bank. Annex C SVOS Final Rep. Part A Audit. Filterbank, vol 1, pp 1–5
Gibson J, Segbroeck MV, Narayanan SS (2014) Comparing time-frequency representations for directional derivative features. In Fifteenth Annual Conference of the International Speech Communication Association. https://doi.org/10.21437/Interspeech.2014-147
Irtaza A, Adnan SM, Aziz S, Javed A, Ullah MO, Mahmood MT (2017) A framework for fall detection of elderly people by analyzing environmental sounds through acoustic local ternary patterns. In: 2017 ieee international conference on systems, man, and cybernetics (smc), pp 1558–1563
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Grama L, Rusu C (2017) Audio signal classification using linear predictive coding and random forests. In: 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp 1–9.https://doi.org/10.1109/SPED.2017.7990431
Bhakre SK, Bang A (2016) Emotion recognition on the basis of audio signal using Naive Bayes classifier. In 2016 International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 2363–2367
Thiruvengatanadhan R (2017) Speech/Music classification using MFCC and KNN. Int J Comput Intell Res 13(10):2449–2452
Aha DW (1990) A study of instance-based algorithms for supervised learning tasks: mathematical, empirical, and psychological evaluations. University of California, Irvine
Slaney M (1993) An efficient implementation of the Patterson-Holdsworth auditory filter bank. Apple Comput. Percept. Group, Tech. Rep, vol 35, no 8
Huang X, Acero A, Hon H-W, Reddy R (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice hall PTR, Upper Saddle River
Singh R, Biswas M, Pal M (2022) Cloud detection using sentinel 2 imageries: a comparison of XGBoost, RF, SVM, and CNN algorithms. Geocarto International 38:1–36. https://doi.org/10.1080/10106049.2022.2146211
Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 15(1):3133–3181
Zhang S, Li X, Zong M, Zhu X, Wang R (2017) Efficient kNN classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst 29(5):1774–1785
Wang X et al (2020) ASVspoof 2019: a large-scale public database of synthetized, converted and replayed speech. Comput Speech Lang 64:101114. https://doi.org/10.1016/j.csl.2020.101114
Delgado H et al (2021) ASVspoof 2021: automatic speaker verification spoofing and countermeasures challenge evaluation plan. arXiv Prepr. arXiv2109.00535
Martín-Doñas JM, Álvarez A (2022) The vicomtech audio deepfake detection system based on Wav2vec2 for the 2022 ADD challenge. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 9241–9245
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
I, Nidhi Chakravarty, on behalf of all the authors, declare that:
• This study did not receive any funding from any resource.
• All the authors and the submitted manuscript do not have any conflict of interest.
• This article contains no studies with human participants or animals performed by any authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chakravarty, N., Dua, M. An improved feature extraction for Hindi language audio impersonation attack detection. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-023-18104-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-023-18104-9