Abstract
The vulnerability of the Automatic Speaker verification technology to different presentation attacks has drawn much interest to design anti-spoofing countermeasures. Following their success in face anti-spoofing, some techniques based on texture analysis were cross-pollinated recently to the voice anti-spoofing domain with very promising results. In this same research direction and motivated by the fact that the spectrogram image local gradient amplitude is sensitive to image distortions presented by spoof attacks, a novel voice anti-spoofing countermeasure based on texture analysis of spectrogram image is devised in this paper. The main novelty of the proposed approach resides in the use of edge detection techniques applied to the spectrogram image. The recorded speech was initially converted to a 2-D image. Then, the spectrogram is mapped by the use of the edge detection method. Finally, different texture descriptors are used to extract the visual features from the original and the mapped spectrograms combination. Experimental results on the ASVspoof 2017 v2.0 dataset show that the proposed countermeasure clearly outperforms some well-known techniques, providing an Equal-Error Rate (EER) as low as 3.52% and 22.86% for the development and evaluation sets, respectively. For the ASVspoof 2019 dataset, in turn, the proposed method achieves up to 27.62% and 29.15% relative EER improvement over the baseline CQCC feature in the Logical and Physical Access scenarios, respectively. In addition, the results show that the proposed hand-crafted features-based approach with an SVM classifier can deliver superior performance than more sophisticated solutions that rely upon the complex neural network architecture.
Similar content being viewed by others
Change history
30 November 2023
Author name "Yahya-zoubir" has been change to "Yahya-Zoubir".
References
Adiban M, Sameti H, Shehnepoor S (2020) Replay spoofing countermeasure using autoencoder and siamese networks on asvspoof 2019 challenge. Comput Speech Lang 64:101105
Al-Karawi K A, Mohammed D Y (2021) Improving short utterance speaker verification by combining mfcc and entrocy in noisy conditions. Multimed Tools Applic 80(14):22231–22249
Alegre F, Amehraye A, Evans N (2013). In: 2013 IEEE Sixth international conference on Biometrics: Theory, Applications and Systems (BTAS). IEEE, pp 1–8
Alegre F, Vipperla R, Amehraye A, Evans N (2013). In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon: France (2013), p 5p
Alzantot M, Wang Z, Srivastava M B (2019) Deep residual neural networks for audio spoofing detection. arXiv:1907.00501
Avila A R, Alam J, Prado F O C, O’Shaughnessy D, Falk T H (2021) On the use of blind channel response estimation and a residual neural network to detect physical access attacks to speaker verification systems. Comput Speech Lang 66:101163
Bakar B, Hanilçi C (2018). In: 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, pp 132–138
Bharath KP, Kumar M R (2022) Replay spoof detection for speaker verification system using magnitude-phase-instantaneous frequency and energy features. Multimedia Tools and Applications, 1–24
Biagio M S, Crocco M, Cristani M, Martelli S, Murino V (2013). In: Proceedings of the IEEE international conference on computer vision, pp 809–816
Boashash B (2015) Time-frequency signal analysis and processing: a comprehensive reference. Academic Press
Brown J C (1991) Calculation of a constant q spectral transform. J Acoust Soc Amer 89(1):425–434
Brown J C, Puckette M S (1992) An efficient algorithm for the calculation of a constant q transform. J Acoust Soc Amer 92(5):2698–2701
Bruna J, Mallat S (2013) Invariant scattering convolution networks. IEEE Trans Pattern Anal Mach Intell 35(8):1872–1886
Burger W, Burge M J, Burge M J, Burge M J (2009) Principles of digital image processing, vol 111. Springer
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/%7Ecjlin/libsvm
Chang S-Y, Wu K-C, Chen C-P (2019). In: INTERSPEECH, pp 1063–1067
Chen Z, Zhang W, Xie Z, Xu X, Chen D (2018). In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 2052–2056
Cheng X, Xu M, Zheng T F (2019). In: 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, pp 540–545
Chettri B, Mishra S, Sturm B L, Benetos E (2018) A study on convolutional neural network based end-to-end replay anti-spoofing. arXiv:1805.09164
Chettri B, Kinnunen T, Benetos E (2020) Deep generative variational autoencoding for replay spoof detection in automatic speaker verification. Comput Speech Lang 63:101092
Dehak N, Dehak R, Glass J R, Reynolds D A, Kenny P et al (2010). In: Odyssey, p 15
Delgado H, Todisco M, Sahidullah M, Evans N, Kinnunen T, Lee K A, Yamagishi J (2018). In: Odyssey 2018-the speaker and language recognition workshop
Evans NWD, Kinnunen T, Yamagishi J (2013) Spoofing and countermeasures for automatic speaker verification. In: Interspeech, pp 925–929
Fedila M, Amrouche A (2012). In: 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA). IEEE, pp 1034–1038
Fedila M, Amrouche A (2012). In: 2012 24th International Conference on Microelectronics (ICM). IEEE, pp 1–4
Fedila M, Bengherabi M, Amrouche A (2018) Gammatone filterbank and symbiotic combination of amplitude and phase-based spectra for robust speaker verification under noisy conditions and compression artifacts. Multimed Tools Applic 77(13):16721–16739
Fitzgerald D (2010). In: Proceedings of the International Conference on Digital Audio Effects (DAFx), vol 13, pp 1–4
Gonzalez-Soler L J, Patino J, Gomez-Barrero M, Todisco M, Busch C, Evans N (2020). In: 2020 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, pp 1–6
Gupta P, Chodingala P K, Patil H A (2022) Replay spoof detection using energy separation based instantaneous frequency estimation from quadrature and in-phase components. Computer Speech & Language, 101423
Hanilçi C (2018) Data selection for i-vector based automatic speaker verification anti-spoofing. Digital Signal Process 72:171–180
Jung J-, Shim H-, Heo H-S, Yu H-J (2019) Replay attack detection with complementary high-resolution information using end-to-end dnn for the asvspoof 2019 challenge. arXiv:1904.10134
Kamble M R, Patil H A (2021) Detection of replay spoof speech using teager energy feature cues. Comput Speech Lang 65:101140
Kanagasundaram A, Dean D, Sridharan S, McLaren M, Vogt R (2014) I-vector based speaker recognition using advanced channel compensation techniques. Comput Speech Lang 28(1):121–140
Kannala J, Rahtu E (2012). In: Proceedings of the 21st international conference on pattern recognition (ICPR2012). IEEE, pp 1363–1366
Khoury E, Kinnunen T, Sizov A, Wu Z, Marcel S (2014). In: Fifteenth annual conference of the international speech communication association. Citeseer
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52(1):12–40
Kinnunen T, Sahidullah M, Falcone M, Costantini L, Hautamäki R G, Thomsen D, Sarkar A, Tan Z-H, Delgado H, Todisco M et al (2017). In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5395–5399
Kinnunen T, Lee K A, Delgado H, Evans N, Todisco M, Sahidullah M, Yamagishi J, Reynolds D A (2018) t-dcf: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. arXiv:1804.09618
Krobba A, Debyeche M, Selouani S-A (2020) Mixture linear prediction gammatone cepstral features for robust speaker verification under transmission channel noise. Multimed Tools Applic 79(25):18679–18693
Lai C-I, Abad A, Richmond K, Yamagishi J, Dehak N, King S (2019). In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 6316–6320
Lai C-I, Chen N, Villalba J, Dehak N (2019) Assert: anti-spoofing with squeeze-excitation and residual networks. arXiv:1904.01120
Lavrentyeva G, Novoselov S, Malykh E, Kozlov A, Kudashev O, Shchemelinin V (2017). In: Interspeech, pp 82–86
Li J, Wang H, He P, Abdullahi S M, Li B (2022) Long-term variable q transform: a novel time-frequency transform algorithm for synthetic speech detection. Digit Signal Process 120:103256
Li R, Zhao M, Li Z, Li L, Hong Q (2019). In: Interspeech, pp 1048–1052
Liu Y, Tian Y, He L, Liu J, Johnson M T (2015). In: Sixteenth annual conference of the international speech communication association
Liu M, Wang L, Dang J, Lee K A, Nakagawa S (2021) Replay attack detection using variable-frequency resolution phase and magnitude features. Comput Speech Lang 66:101161
Maini R, Aggarwal H (2009) Study and comparison of various image edge detection techniques. International Journal of Image Processing (IJIP) 3 (1):1–11
Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The det curve in assessment of detection task performance. Tech. rep., National Inst of Standards and Technology Gaithersburg MD
Monteiro J, Alam J, Falk T H (2020) Generalized end-to-end detection of spoofing attacks to automatic speaker recognizers. Comput Speech Lang 63:101096
Nagarsheth P, Khoury E, Patil K, Garland M (2017). In: Interspeech, pp 97–101
Ojansivu V, Heikkilä J (2008). In: International conference on image and signal processing. Springer, pp 236–243
Patel T B, Patil H A (2015). In: Sixteenth annual conference of the international speech communication association
Patil A T, Acharya R, Sai P K A, Patil H A (2019). In: Interspeech, pp 2898–2902
Patil A T, Acharya R, Patil H A, Guido R C (2022) . Comput Speech Lang 72:101281
Rahmeni R, Aicha A B, Ayed Y B (2022) Voice spoofing detection based on acoustic and glottal flow features using conventional machine learning techniques. Multimedia Tools and Applications, 1–25
Rakotomamonjy A, Gasso G (2014) Histogram of gradients of time–frequency representations for audio scene classification. IEEE/ACM Trans Audio Speech Lang Process 23(1):142–153
Reynolds D A (1995) Speaker identification and verification using gaussian mixture speaker models. Speech Commun 17(1-2):91–108
Rose P (2006) Technical forensic speaker recognition: Evaluation, types and testing of evidence. Comput Speech Lang 20(2-3):159–191
Schluter R, Bezrukov I, Wagner H, Ney H (2007). In: 2007 IEEE International conference on acoustics, speech and signal processing-ICASSP’07, vol 4. IEEE, pp IV–649
Schörkhuber C, Klapuri A, Holighaus N, Dörfler M (2014). In: Audio engineering society conference: 53rd international conference: semantic audio. Audio Engineering Society
Singh M, Pati D (2019) Usefulness of linear prediction residual for replay attack detection. AEU-Int Journal of Electronics and Communications 110:152837
Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S (2018). In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5329–5333
Solomon C, Breckon T (2011) Fundamentals of digital image processing: a practical approach with examples in matlab. Wiley
Spoofing A S V (2019) Countermeasures challenge evaluation plan
Sriskandaraja K, Sethu V, Ambikairajah E (2018). In: Interspeech, pp 671–675
Standard I (2017) Information technology–biometric presentation attack detection–part 3: testing and reporting. International Organization for Standardization: Geneva, Switzerland
Tapkir P A, Kamble M R, Patil H A, Madhavi M (2018). In: 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, pp 1019–1023
Todisco M, Delgado H, Evans NWD (2016). In: Odyssey, vol 2016, pp 283–290
Todisco M, Delgado H, Evans N (2017) Constant q cepstral coefficients: a spoofing countermeasure for automatic speaker verification. Comput Speech Lang 45:516–535
Todisco M, Wang X, Vestman V, Sahidullah M, Delgado H, Nautsch A, Yamagishi J, Evans N, Kinnunen T, Lee K A (2019) Asvspoof 2019: future horizons in spoofed and fake audio detection. arXiv:1904.05441
Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. Journal of machine learning research 9:11
Vapnik V N (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10 5:988–99
Wang X, Xiao Y, Zhu X (2017). In: Interspeech, pp 32–36
Wang X, Yamagishi J, Todisco M, Delgado H, Nautsch A, Evans N, Sahidullah M, Vestman V, Kinnunen T, Lee K A et al (2020) Asvspoof 2019: a large-scale public database of synthesized, converted and replayed speech. Comput Speech Lang 64:101114
Wei L, Long Y, Wei H, Li Y (2022) New acoustic features for synthetic and replay spoofing attack detection. Symmetry 14(2):274
Witkowski M, Kacprzak S, Zelasko P, Kowalczyk K, Galka J (2017) Audio replay attack detection using high-frequency features. In: Interspeech, pp 27–31
Wu Z, Li H (2016) On the study of replay and voice conversion attacks to text-dependent speaker verification. Multimed Tools Applic 75 (9):5311–5327
Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H (2015) Spoofing and countermeasures for speaker verification: a survey. Speech Commun 66:130–153
Wu Z, Kinnunen T, Evans N, Yamagishi J, Hanilçi C, Sahidullah M, Sizov A (2015). In: Sixteenth annual conference of the international speech communication association
Yang J, Das R K (2019) Low frequency frame-wise normalization over constant-q transform for playback speech detection. Digit Signal Process 89:30–39
Yang J, Das R K (2020) Long-term high frequency features for synthetic speech detection. Digit Signal Process 97:102622
Yang J, Das R K, Zhou N (2019) Extraction of octave spectra information for spoofing attack detection. IEEE/ACM Trans Audio Speech Lang Process 27(12):2373–2384
Zhang C, Ranjan S, Nandwana M K, Zhang Q, Misra A, Liu G, Kelly F, Hansen JHL (2016). In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5035–5039
Zhang C, Yu C, Hansen John HL (2017) An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J Selected Topics Signal Process 11(4):684–694
Zhao Y, Togneri R, Sreeram V (2020) Replay anti-spoofing countermeasure based on data augmentation with post selection. Comput Speech Lang 64:101115
Acknowledgment
This work is financially supported by the DGRSDT (Direction Générale de la Recherche Scientifique et du Développement Technologique) and CDTA (Centre de Développement des Technologies Avancées) Algeria, as part of the CDTA’s triennial research protocol 2019-2021. (Project code: N004-1/BIOSMC/DTELECOM/CDTA/PT 19-21)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Meriem, F., Messaoud, B. & Bahia, YZ. Texture analysis of edge mapped audio spectrogram for spoofing attack detection. Multimed Tools Appl 83, 15915–15937 (2024). https://doi.org/10.1007/s11042-023-15329-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15329-6