An improved feature extraction for Hindi language audio impersonation attack detection

Chakravarty, Nidhi; Dua, Mohit

doi:10.1007/s11042-023-18104-9

An improved feature extraction for Hindi language audio impersonation attack detection

Published: 24 January 2024

(2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Nidhi Chakravarty¹ &
Mohit Dua¹

157 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Audio impersonation attacks offer a substantial risk to voice-based authentication systems and various speech recognition applications. Hence, there is a requirement for robust detection methods to assure system security and dependability. The work in this paper discusses a new approach to improve front-end feature extraction of an audio imitation attack detection system, notably in the context of the Hindi language. The proposed model is implemented in three main steps. Firstly, Gammatone spectrogram, Mel spectrogram, and Acoustic Ternary Pattern Audio Features (TPAF)spectrogram are generated from the recorded audio samples. Secondly, an optimized Residual Network (ResNet27) is employed to capture distinctive characteristics from these spectrograms. Lastly, four different binary classifier algorithms; eXtreme Gradient Boosting (XGboost), Random Forest (RF), K-nearest neighbor (KNN), and Naïve Bayes (NB) are individually applied to the aforementioned three different feature combinations, resulting in a total of twelve distinct systems. All these systems have been evaluated using own created dataset named as Voice Impersonation Corpus in Hindi Language (VIHL) for audio impersonation attack. Also, the evaluation of the proposed models have been carried using ASVspoof 2019 and ASVspoof 2021 datasets for spoof, impersonation, replay and deepfake attacks. The results obtained from the proposed work show that Gammatone spectrogram-ResNet27 combination with XGboost classifier achieved 0.9% Equal Error Rate (EER) for impersonation attack, which surpasses existing techniques in accurately identifying such attacks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Multi-language: ensemble learning-based speech emotion recognition

Article 07 May 2024

Data Availability

All data generated or analyzed during this study are included in this published article.

References

Tan CB et al (2021) A survey on presentation attack detection for automatic speaker verification systems: state-of-the-art, taxonomy, issues and future direction. Multimed Tools Appl 80(21–23):32725–32762
Article Google Scholar
Valero X, Alías F (2012) Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. IEEE Trans Multimed 14:1684–1689. https://doi.org/10.1109/TMM.2012.2199972
Article Google Scholar
Todisco M, Delgado H, Evans NWD (2016) a new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients. In: Odyssey, vol 2016, pp 283–290
Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 87(4):1738–1752
Article Google Scholar
Qian Y, Chen N, Dinkel H, Wu Z (2017) Deep feature engineering for noise robust spoofing detection. IEEE/ACM Trans Audio Speech Lang Process 25(10):1942–1955
Article Google Scholar
Fedila M, Bengherabi M, Amrouche A (2018) Gammatone filterbank and symbiotic combination of amplitude and phase-based spectra for robust speaker verification under noisy conditions and compression artifacts. Multimed Tools Appl 77:16721–16739
Article Google Scholar
Chakravarty N, Dua M (2022) Noise robust ASV spoof detection using integrated features and time delay neural network. SN Comput Sci 4(2):127
Article Google Scholar
Joshi S, Dua M (2022) LSTM-GTCC based approach for audio spoof detection. In: 2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), vol 1, pp 656–661
Joshi S, Dua M, Dua S (2023) Various audio classification models for automatic speaker verification system in industry 4.0. In: Intelligent analytics for industry 4.0 applications. CRC Press, pp 113–130
Arias-Vergara T, Klumpp P, Vasquez-Correa JC, Nöth E, Orozco-Arroyave JR, Schuster M (2021) Multi-channel spectrograms for speech processing applications using deep learning methods. Pattern Anal Appl 24:423–431
Article Google Scholar
Hou S, Lian A, Chu Y (2023) Bearing fault diagnosis method using the joint feature extraction of transformer and ResNet. Meas Sci Technol 34(7):75108
Article Google Scholar
Priya TS (2021) Resnet based feature extraction with decision tree classifier for classificaton of mammogram images. Turkish J Comput Math Educ 12(2):1147–1153
Article Google Scholar
Khalifa O, El-Darymli K, Daoud J (2013) Statistical modeling for speech recognition. World Appl Sci J 21:115–122. https://doi.org/10.5829/idosi.wasj.2013.21.mae.99935
Article Google Scholar
Jelinek F (1976) Continuous speech recognition by statistical methods. Proc IEEE 64(4):532–556. https://doi.org/10.1109/PROC.1976.10159
Article Google Scholar
Huang XD, Jack MA (1988) Performance comparison between semicontinuous and discrete hidden Markov models of speech. Electron Lett 24(3):149–150
Article Google Scholar
Bellegarda JR, Nahamoo D (1990) Tied mixture continuous parameter modeling for speech recognition. IEEE Trans Acoust 38(12):2033–2045
Article Google Scholar
McLaren M, Vogt R, Baker B, Sridharan S (2007) A comparison of session variability compensation techniques for SVM-based speaker recognition. In: Proceedings of the 8th Annual Conference of the International Speech Communication Association (Interspeech 2007), pp 790–793
Malik KM, Javed A, Malik H, Irtaza A (2020) A light-weight replay detection framework for voice controlled IoT devices. IEEE J Sel Top Sign Proces 14(5):982–996
Article Google Scholar
Dua M, Sadhu A, Jindal A, Mehta R (2022) A hybrid noise robust model for multireplay attack detection in automatic speaker verification systems. Biomed Signal Process Control 74:103517. https://doi.org/10.1016/j.bspc.2022.103517
Article Google Scholar
Mittal A, Dua M (2021) Automatic speaker verification systems and spoof detection techniques: review and analysis. Int J Speech Technol. https://doi.org/10.1007/s10772-021-09876-2
Article Google Scholar
Chakravarty N, Dua M (2023) Data augmentation and hybrid feature amalgamation to detect audio deep fake attacks. Physica Scripta 98(9):096001
Article Google Scholar
Adnan SM, Irtaza A, Aziz S, Ullah MO, Javed A, Mahmood MT (2018) Fall detection through acoustic local ternary patterns. Appl Acoust 140:296–300
Article Google Scholar
Yamagishi J, Todisco M, Sahidullah M, Delgado H, Wang X, Evans N, ... Nautsch A (2019) Asvspoof 2019: The 3rd automatic speaker verification spoofing and countermeasures challenge database
Cai W, Cai D, Liu W, Li G, Li M (2017) Countermeasures for automatic speaker verification replay spoofing attack: on data augmentation, feature representation, classification and fusion. In Interspeech, pp 17–21
Ren Y, Fang Z, Liu D, Chen C (2019) Replay attack detection based on distortion by loudspeaker for voice authentication. Multimed Tools Appl 78:8383–8396
Article Google Scholar
Chettri B, Stoller D, Morfi V, Ramírez MAM, Benetos E, Sturm BL (2019) Ensemble models for spoofing detection in automatic speaker verification. arXiv Prepr. arXiv1904.04589
Meriem F, Messaoud B, Bahia YZ (2023) Texture analysis of edge mapped audio spectrogram for spoofing attack detection. Multimed Tools Appl 1–23. https://doi.org/10.1007/s11042-023-15329-6
Bharath KP, Kumar MR (2022) Replay spoof detection for speaker verification system using magnitude-phase-instantaneous frequency and energy features. Multimed Tools Appl 81(27):39343–39366
Article Google Scholar
Rahmeni R, Ben Aicha A, Ben Ayed Y (2022) Voice spoofing detection based on acoustic and glottal flow features using conventional machine learning techniques. Multimed Tools Appl 81(22):31443–31467
Article Google Scholar
Mittal A, Dua M (2022) Static–dynamic features and hybrid deep learning models based spoof detection system for ASV. Complex & Intelligent Systems 8(2):1153–1166. https://doi.org/10.1007/s40747-021-00565-w
Article Google Scholar
Dua M, Jain C, Kumar S (2021) LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. J Ambient Intell Humaniz Comput 13:1–16. https://doi.org/10.1007/s12652-021-02960-0
Joshi S, Dua M (2023) Multi-order replay attack detection using enhanced feature extraction and deep learning classification. In: Proceedings of International Conference on Recent Trends in Computing: ICRTC 2022, pp 739–745
Pao T-L, Liao W-Y, Chen Y-T (2007) Audio-visual speech recognition with weighted KNN-based classification in mandarin database. In: Third International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP 2007), vol 1, pp 39–42
Lu L, Zhang H-J, Li SZ (2003) Content-based audio classification and segmentation by using support vector machines. Multimed Syst 8(6):482–492
Article Google Scholar
Fu Z, Lu G, Ting KM, Zhang D (2010) Learning naive Bayes classifiers for music classification and retrieval. In: 2010 20th international conference on pattern recognition, pp 4589–4592
Chakravarty N, Dua M (2023) Spoof detection using sequentially integrated image and audio features. Int J Comput Digit Syst 13(1):1
Google Scholar
Neelima M, Prabha IS (2023) Optimized deep network based spoof detection in automatic speaker verification system. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-16127-w
Wang C et al (2023) TO-Rawnet: improving RawNet with TCN and orthogonal regularization for fake audio detection. arXiv Prepr. arXiv2305.13701
Kwak I-Y et al (2023) Voice spoofing detection through residual network, max feature map, and depthwise separable convolution. IEEE Access 11:49140-49152. https://doi.org/10.1109/ACCESS.2023.3275790
Aravind PR, Nechiyil U, Paramparambath N (2020) Audio spoofing verification using deep convolutional neural networks by transfer learning. arXiv Prepr. arXiv2008.03464
Sen Gupta S, Hossain S, Kim K-D (2022) Recognize the surrounding: Development and evaluation of convolutional deep networks using gammatone spectrograms and raw audio signals. Expert Syst Appl 200:116998
Article Google Scholar
Allen J (1977) Short term spectral analysis, synthesis, and modification by discrete Fourier transform. IEEE Trans Acoust 25(3):235–238
Article Google Scholar
Ali M, Sabir A, Hassan M (2021) Fake audio detection using hierarchical representations learning and spectrogram features. In: 2021 International Conference on Robotics and Automation in Industry (ICRAI), pp 1–6
Holdsworth J, Nimmo-Smith I, Patterson R, Rice P (1988) Implementing a gammatone filter bank. Annex C SVOS Final Rep. Part A Audit. Filterbank, vol 1, pp 1–5
Gibson J, Segbroeck MV, Narayanan SS (2014) Comparing time-frequency representations for directional derivative features. In Fifteenth Annual Conference of the International Speech Communication Association. https://doi.org/10.21437/Interspeech.2014-147
Irtaza A, Adnan SM, Aziz S, Javed A, Ullah MO, Mahmood MT (2017) A framework for fall detection of elderly people by analyzing environmental sounds through acoustic local ternary patterns. In: 2017 ieee international conference on systems, man, and cybernetics (smc), pp 1558–1563
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Article MathSciNet Google Scholar
Grama L, Rusu C (2017) Audio signal classification using linear predictive coding and random forests. In: 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp 1–9.https://doi.org/10.1109/SPED.2017.7990431
Bhakre SK, Bang A (2016) Emotion recognition on the basis of audio signal using Naive Bayes classifier. In 2016 International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 2363–2367
Thiruvengatanadhan R (2017) Speech/Music classification using MFCC and KNN. Int J Comput Intell Res 13(10):2449–2452
Google Scholar
Aha DW (1990) A study of instance-based algorithms for supervised learning tasks: mathematical, empirical, and psychological evaluations. University of California, Irvine
Google Scholar
Slaney M (1993) An efficient implementation of the Patterson-Holdsworth auditory filter bank. Apple Comput. Percept. Group, Tech. Rep, vol 35, no 8
Huang X, Acero A, Hon H-W, Reddy R (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice hall PTR, Upper Saddle River
Google Scholar
Singh R, Biswas M, Pal M (2022) Cloud detection using sentinel 2 imageries: a comparison of XGBoost, RF, SVM, and CNN algorithms. Geocarto International 38:1–36. https://doi.org/10.1080/10106049.2022.2146211
Article Google Scholar
Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 15(1):3133–3181
MathSciNet Google Scholar
Zhang S, Li X, Zong M, Zhu X, Wang R (2017) Efficient kNN classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst 29(5):1774–1785
Article MathSciNet Google Scholar
Wang X et al (2020) ASVspoof 2019: a large-scale public database of synthetized, converted and replayed speech. Comput Speech Lang 64:101114. https://doi.org/10.1016/j.csl.2020.101114
Article Google Scholar
Delgado H et al (2021) ASVspoof 2021: automatic speaker verification spoofing and countermeasures challenge evaluation plan. arXiv Prepr. arXiv2109.00535
Martín-Doñas JM, Álvarez A (2022) The vicomtech audio deepfake detection system based on Wav2vec2 for the 2022 ADD challenge. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 9241–9245

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, National Institute of Technology, Kurukshetra, India
Nidhi Chakravarty & Mohit Dua

Authors

Nidhi Chakravarty
View author publications
You can also search for this author in PubMed Google Scholar
Mohit Dua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nidhi Chakravarty.

Ethics declarations

I, Nidhi Chakravarty, on behalf of all the authors, declare that:

• This study did not receive any funding from any resource.

• All the authors and the submitted manuscript do not have any conflict of interest.

• This article contains no studies with human participants or animals performed by any authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chakravarty, N., Dua, M. An improved feature extraction for Hindi language audio impersonation attack detection. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-023-18104-9

Download citation

Received: 14 July 2023
Revised: 08 September 2023
Accepted: 29 December 2023
Published: 24 January 2024
DOI: https://doi.org/10.1007/s11042-023-18104-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An improved feature extraction for Hindi language audio impersonation attack detection

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Multi-language: ensemble learning-based speech emotion recognition

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An improved feature extraction for Hindi language audio impersonation attack detection

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Multi-language: ensemble learning-based speech emotion recognition

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation