Skip to main content

Advertisement

Log in

Replay spoof detection for speaker verification system using magnitude-phase-instantaneous frequency and energy features

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Spoofing attack detection is one of the essential components in automatic speaker verification (ASV) systems. The success of\ ASV-2015 shows a great perspective by detecting the voice conversion and speech synthesis spoofs. However, the researchers address fewer replay attack spoof detection systems, and non-professional impersonators most likely use the replay attacks. This paper detects replay attacks on the ASV system using the ASVspoof-2017-v2.0 corpus. This work is mainly partitioned into two parts. The first part shows the significance of Empirical Mode Decomposition (EMD) and Hilbert Spectrum (HS) to detect the replay attack detection by extracting the instantaneous frequency (IF) and instantaneous energies (IE) from frequency components of the speech signal to differentiate the characteristics of genuine and spoof speech, then it given to rectangular filter cepstral coefficients (RFCC) to obtain the desired set of features to detect whether the given speech sample is genuine or spoof. In the second part, a new score-level fusion system is proposed to increase the system performance. Along with the proposed stand-alone method, Constant-Q cepstral coefficients (CQCC) and All-Pole Group Delay Function (APGDF) methods are used to extract the magnitude and phase features set, respectively. The proposed stand-alone and score-level fusion method improves performance accuracy than other state-of-art techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Alam MJ, Kenny P, Stafylakis T (2015) Combining amplitude and phase-based features for speaker verification with short duration utterances. Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH 2015-Janua:249–253

  2. Bakar B, Hanilçi C (2018) An experimental study on audio replay attack detection using deep neural networks. 2018 IEEE Spoken Language Technology Workshop. IEEE

  3. Banno H, Lu J, Nakamura S, Shikano K, Kawahara H (2003) Efficient representation of short-time phase based on time-domain smoothed group delay. Electron Commun Japan, Part III Fundam Electron Sci (English Transl Denshi Tsushin Gakkai Ronbunshi) 86(10):56–64. https://doi.org/10.1002/ecjc.10120

    Article  Google Scholar 

  4. Brown JC (1991) Calculation of a constant Q spectral transform. J Acoustical Soc Am 89(1):425–434

    Article  Google Scholar 

  5. Brümmer N, de Villiers E (2013) The bosaris toolkit: Theory, algorithms and code for surviving the new dcf. arXiv preprint arXiv:1304.2865

  6. Campbell JP (1997) Speaker recognition: a tutorial. Proc IEEE 85(9):1437–1462. https://doi.org/10.1109/5.628714

    Article  Google Scholar 

  7. Chen Z, Zhang W, Xie Z, Xu X, Chen D (2018) Recurrent neural networks for automatic replay spoofing attack detection. ICASSP, IEEE Int Conf Acoust Speech Signal Process - Proc 2018-April:2052–2056. https://doi.org/10.1109/ICASSP.2018.8462644

    Article  Google Scholar 

  8. Cheng Z, Shen J (2016) On effective location-aware music recommendation. ACM Trans Inform Syst (TOIS) 34(2):1–32

    Article  Google Scholar 

  9. Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2010) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798

    Article  Google Scholar 

  10. Eargle J (2003) Loudspeaker handbook Springer Science & Business Media

  11. Font R, Espín JM, Cano MJ (2017) Experimental analysis of features for replay attack detection-Results on the ASVspoof 2017 Challenge. Proc Annu Conf Int Speech Commun Assoc INTERSPEECH 2017-Augus:7–11. https://doi.org/10.21437/Interspeech.2017-450

    Article  Google Scholar 

  12. Glasberg BR, Moore BCJ (1990) Derivation of auditory filter shapes from notched-noise data. Hear Res 47(1–2):103–138. https://doi.org/10.1016/0378-5955(90)90170-T

    Article  Google Scholar 

  13. Hanilçi C, Kinnunen T, Sahidullah M, Sizov A (2016) Spoofing detection goes noisy: an analysis of synthetic speech detection in the presence of additive noise. Speech Commun 85:83–97. https://doi.org/10.1016/j.specom.2016.10.002

    Article  Google Scholar 

  14. Hanilçi C, Kinnunen T, Sahidullah M, Sizov A (2015) Classifiers for synthetic speech detection: A comparison Proc Annu Conf Int Speech Commun Assoc INTERSPEECH 2057:2015–2061

  15. Hautamäki RG, Kinnunen T, Hautamäki V, Leino T, Laukkanen AM (2013) I-vectors meet imitators: On vulnerability of speaker verification systems against voice mimicry. Proc Annu Conf Int Speech Commun Assoc. INTERSPEECH (August):930–934

  16. Hegde RM, Murthy HA, Gadde VRR (2007) Significance of the modified group delay feature in speech recognition. IEEE Trans Audio Speech Lang Process 15(1):190–202. https://doi.org/10.1109/TASL.2006.876858

    Article  Google Scholar 

  17. Huang H, Pan J (2006) Speech pitch determination based on Hilbert-Huang transform. Signal Process 86(4):792–803. https://doi.org/10.1016/j.sigpro.2005.06.011

    Article  MATH  Google Scholar 

  18. Huang NE, Liu HH (1998) The empirical mode decomposition and the Hubert spectrum for nonlinear and non-stationary time series analysis. Proc R Soc A Math Phys Eng Sci 454(1971):903–995. https://doi.org/10.1098/rspa.1998.0193

  19. Jelil S, Das RK, Prasanna SRM, Sinha R (2017) Spoof detection using source, instantaneous frequency and cepstral features. Proc Annu Conf Int Speech Commun Assoc INTERSPEECH 2017-Augus:22–26. https://doi.org/10.21437/Interspeech.2017-930

    Article  Google Scholar 

  20. Jelil S, Kalita S, Mahadeva Prasanna SR, Sinha R (2018) Exploration of compressed ILPR features for replay attack detection. Proc Annu Conf Int Speech Commun Assoc. INTERSPEECH 2018-Septe(September):631–635. https://doi.org/10.21437/Interspeech.2018-1297

    Article  Google Scholar 

  21. Kamble MR, Patil HA (2018) Novel variable length energy separation algorithm using instantaneous amplitude features for replay detection. Proc Annu Conf Int Speech Commun Assoc. INTERSPEECH 2018-Septe(June):646–650. https://doi.org/10.21437/Interspeech.2018-1687

  22. Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52(1):12–40. https://doi.org/10.1016/j.specom.2009.08.009

    Article  Google Scholar 

  23. Lai C-I et al (2019) Attentive filtering networks for audio replay attack detection. ICASSP 2019, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE

  24. Larcher A, Lee KA, Ma B, Li H (2014) Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun 60:56–77. https://doi.org/10.1016/j.specom.2014.03.001

    Article  Google Scholar 

  25. Lavrentyeva G, Novoselov S, Malykh E, Kozlov A, Kudashev O, Shchemelinin V (2017) Audio replay attack detection with deep learning frameworks. Proc Annu Conf Int Speech Commun Assoc. INTERSPEECH 2017-Augus:82–86. https://doi.org/10.21437/Interspeech.2017-360

    Article  Google Scholar 

  26. Liu Y, Tian Y, He L, Liu J, Johnson MT (2015) Simultaneous utilization of spectral magnitude and phase information to extract supervectors for speaker verification anti-spoofing. Proc Annu Conf Int Speech Commun Assoc. INTERSPEECH 2015-Janua:2082:2015–2086

    Google Scholar 

  27. Lukic Y, Vogt C, Dürr O, Stadelmann T (2016) Speaker identification and clustering using convolutional neural networks. In 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP), pp. 1–6. IEEE

  28. Makhoul J (1975) Linear prediction: a tutorial review. Proc IEEE 63(4):561–580

    Article  Google Scholar 

  29. Murthy HA, Yegnanarayana B (2011 Oct 1) Group delay functions and its applications in speech technology. Sadhana 36(5):745–782

    Article  Google Scholar 

  30. Murthy HA, Gudde V, Avenue R, Park M (2003) The modified group delay function and its application to phoneme recognition, no. 3, pp. 68–71

  31. Pal M, Saha G (2015) On robustness of speech based biometric systems against voice conversion attack. Appl Soft Comput J 30:214–228. https://doi.org/10.1016/j.asoc.2015.01.036

    Article  Google Scholar 

  32. Phapatanaburi K, Iwahashi M (2019) Replay attack detection using linear prediction analysis-based relative phase features. IEEE Access 7:183614–183625

  33. Rajan P, Kinnunen T, Hanilci C, Pohjalainen J, Alku P (2013) Using group delay functions from all-pole models for speaker recognition. Proc Annu Conf Int Speech Commun Assoc INTERSPEECH:2489–2493

  34. Sailor HB, Kamble MR, Patil HA (2018) Auditory filterbank learning for temporal modulation features in replay spoof speech detection. Proc Annu Conf Int Speech Commun Assoc. INTERSPEECH 2018-Septe(September):666–670. https://doi.org/10.21437/Interspeech.2018-1651

  35. Schlotthauer G, Torres ME, Rufiner HL (2009) Voice fundamental frequency extraction algorithm based on ensemble empirical mode decomposition and entropies. IFMBE Proc 25(4):984–987. https://doi.org/10.1007/978-3-642-03882-2-262

    Article  Google Scholar 

  36. Sharma R, Mahadeva Prasanna SR (2016) A better decomposition of speech obtained using modified empirical mode decomposition. Digit Signal Process A Rev J 58:26–39. https://doi.org/10.1016/j.dsp.2016.07.012

    Article  Google Scholar 

  37. Sharma R, Prasanna SRM, Bhukya RK, Kumar Das R (2017) Analysis of the intrinsic mode functions for speaker information. Speech Commun 91:1–16. https://doi.org/10.1016/j.specom.2017.04.006

    Article  Google Scholar 

  38. Sharma R, Vignolo L, Schlotthauer G, Colominas MA, Leonardo Rufiner H, Prasanna SRM (2017) Empirical mode decomposition for adaptive AM-FM analysis of speech: a review. Speech Comm 88:39–64

    Article  Google Scholar 

  39. Sharma R, Bhukya RK, Prasanna SRM (2018) Analysis of the Hilbert Spectrum for Text-Dependent Speaker Verification. Speech Commun 96(March 2017):207–224. https://doi.org/10.1016/j.specom.2017.12.001

    Article  Google Scholar 

  40. Shen J, Shepherd J, Ngu AHH (2006) Towards effective content-based music retrieval with multiple acoustic feature combination. IEEE Trans Multimed 8(6):1179–1189

    Article  Google Scholar 

  41. Shen J, Tao M, Qiang Q, Tao D, Rui Y (2019) Toward efficient indexing structure for scalable content-based music retrieval. Multimedia Syst 25(6):639–653

    Article  Google Scholar 

  42. Suthokumar G, Sethu V, Wijenayake C, Ambikairajah E (2018) Modulation dynamic features for the detection of replay attacks. Proc Annu Conf Int Speech Commun Assoc INTERSPEECH 2018-Septe(September):691–695. https://doi.org/10.21437/Interspeech.2018-1846

    Article  Google Scholar 

  43. Sztahó D, Szaszák G, Beke A (2019) Deep learning methods in speaker recognition: a review. arXiv preprint arXiv:1911.06615

  44. Tapkir PA et al (2018) Replay spoof detection using power function based features. 2018 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). IEEE

  45. Todisco M, Evans N, Kinnunen T, Lee KA, Yamagishi J (2017) ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements

  46. Todisco M, Delgado H, Evans N (2017) Constant Q cepstral coefficients: a spoofing countermeasure for automatic speaker verification. Comput Speech Lang 45:516–535. https://doi.org/10.1016/j.csl.2017.01.001

    Article  Google Scholar 

  47. Todisco M, Delgado H, Evans N (2019) A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients. Odyssey 2016 Speak Lang Recognit Work:283–290. https://doi.org/10.21437/Odyssey.2016-41

  48. Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H (2015) Spoofing and countermeasures for speaker verification: a survey. Speech Commun 66:130–153. https://doi.org/10.1016/j.specom.2014.10.005

    Article  Google Scholar 

  49. Wu Z, Gao S, Cling ES, Li H A study on replay attack and anti-spoofing for text-dependent speaker verification. 2014 Asia-Pacific Signal Inf Process Assoc Annu Summit Conf APSIPA 2014:2014. https://doi.org/10.1109/APSIPA.2014.7041636

  50. Yegnanarayana B (1978) Formant extraction from linear-prediction phase spectra. J Acoust Soc Am 63(5):1638–1640. https://doi.org/10.1121/1.381864

    Article  Google Scholar 

  51. Yoshioka T, Sehr A, Delcroix M, Kinoshita K, Maas R, Nakatani T, Kellermann W (2012) Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition. IEEE Signal Process Mag 29(6):114–126

    Article  Google Scholar 

  52. Zeyan et al (2019) Replay attack detection with auditory filter-based relative phase features. EURASIP Journal on Audio, Speech, and Music Processing 2019.1

Download references

Acknowledgments

First author Bharath K P, (CSIR-Senior Research Fellow) would like to thank the Council of Scientific & Industrial Research (CSIR) Human Resource Development Group (HRDG), Govt of India, for financial assistance (CSIR-SRF, Ack. No.: 143672/2 k18/1, File No.: 09/844(0084)/2019 EMR-I.)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Rajesh Kumar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bharath, K.P., Kumar, M.R. Replay spoof detection for speaker verification system using magnitude-phase-instantaneous frequency and energy features. Multimed Tools Appl 81, 39343–39366 (2022). https://doi.org/10.1007/s11042-022-12380-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12380-7

Keywords

Navigation