Skip to main content
Log in

Voice conversion spoofing detection by exploring artifacts estimates

  • 1163: Large-scale multimedia signal processing for security and digital forensics
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Automatic speaker verification or voice biometrics is an approach to verify the person’s claimed identity through his/her voice. Voice biometrics finds its application in mobile banking and forensics. With the increased usage of speaker verification systems, studying the spoofing threats to speaker verification systems and building proper countermeasure is gaining attention. Spoofing is a genuine challenge as it leads to increase in the false alarm rate, i.e. an impostor is incorrectly accepted as genuine speaker. To make voice biometrics viable for practical applications there is a need to detect spoofing attack. Voice conversion spoofing is a technique where the imposter speaker’s speech is converted to desired speaker’s speech using signal processing approaches. Studies show that voice conversion introduces artifacts in resultant speech, hence, this paper proposes a novel approach to detect voice conversion spoofing attack by estimating artifact estimates from the given speech signal. To obtain artifact estimate from speech signal non-negative matrix factorization based source separation technique is employed. Later, Convolutional Neural Network based binary classifier is built to classify artifact estimates of input speech as natural and synthetic speech. Experiments are conducted on voice conversion challenge 2016 and voice conversion challenge 2018 database. Results show that proposed technique gives excellent performance by detecting wide range of unknown attacks. The proposed systems are compared to state of art spoof detection systems based on Constant Q Cepstrum Coefficients and Linear Frequency Cepstral Coefficients and results show the proposed system give relatively equivalent and/or better performance. Validation results for various noises is studied using NOIZEUS database and results show the efficiency of the proposed system in noisy environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Abe M, Nakamura S, Shikano K, Kuwabara H (1990) Voice conversion through vector quantization. J Acoust Soc Jpn (E) 11(2):71–76

    Article  Google Scholar 

  2. Alegre F, Vipperla R, Amehraye A, Evans N WD (2013) A new speaker verification spoofing countermeasure based on local binary patterns. In: INTERSPEECH

  3. Bonastre J-F, Matrouf D, Fredouille C (2007) Artificial impostor voice transformation effects on false acceptance rates. In: INTERSPEECH

  4. Campbell WM, Campbell JP, Gleason TP, Reynolds DA, Shen W (2007) Speaker verification using support vector machines and high-level features. IEEE Trans Audio Speech Lang Process 15(7):2085–2094

    Article  Google Scholar 

  5. Chen LH, Ling ZH, Liu LJ, Dai LR (2014) Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Trans Audio Speech Lang Process 22(12):1859–1872

    Article  Google Scholar 

  6. Erro D, Moreno A, Bonafonte A (2010) Voice conversion based on weighted frequency warping. IEEE Trans Audio Speech Lang Process 18(5):922–931

    Article  Google Scholar 

  7. Gao B, Woo WL, Dlay SS (2011) Single-channel source separation using emd-subband variable regularized sparse features. IEEE Trans Audio Speech Lang Process 19(4):961–976

    Article  Google Scholar 

  8. Gao B, Woo WL, Dlay SS (2013) Unsupervised single-channel separation of nonstationary signals using gammatone filterbank and itakura saito nonnegative matrix two-dimensional factorizations. IEEE Trans Circ Syst I: Reg Papers 60 (3):662–675

    MathSciNet  Google Scholar 

  9. Hanilci C, Kinnunen T, Sahidullah M, Sizov A (2016) Spoofing detection goes noisy: An analysis of synthetic speech detection in the presence of additive noise. Speech Comm 85:83–97

    Article  Google Scholar 

  10. Hautamäki RG, Kinnunen T, Hautamäki V, Laukkanen A-M (2015) Automatic versus human speaker verification: The case of voice mimicry. Speech Comm 72:13–31

    Article  Google Scholar 

  11. Hemavathi R, Swamy RK (2018) Unsupervised speech separation using statistical, auditory and signal processing approaches. In: 2018 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)

  12. Hu Y, Loizou PC (2007) Subjective comparison and evaluation of speech enhancement algorithms. Speech Commun 49(7-8):588–601

    Article  Google Scholar 

  13. Hu K, Wang D (2013) An unsupervised approach to cochannel speech separation. IEEE Trans Audio Speech Lang Process 21(1):122–131

    Article  MathSciNet  Google Scholar 

  14. Kinnunen T, Karpov E, Franti P (2006) Real-time speaker identification and verification. IEEE Trans Audio Speech Lang Process 14(1):277–288

    Article  Google Scholar 

  15. Krishna P KM, Ramaswamy K (2017) Single channel speech separation based on empirical mode decomposition and hilbert transform. IET Signal Process 11(5):579–586

    Article  Google Scholar 

  16. Lau YW, Wagner M, Tran D (2004) Vulnerability of speaker verification to voice mimicking. In: Proceedings of International Symposium on Intelligent Multimedia, Video and Speech Processing, pp 145–148

  17. Khor LC, Woo WL, Dlay SS (2005) Non-sparse approach to underdetermined blind signal estimation. ICASSP

  18. Li W, Wang L, Zhou Y, Dines J, Doss MM, Bourlard H, Liao Q (2014) Feature mapping of multiple beamformed sources for robust overlapping speech recognition using a microphone array. IEEE/ACM Trans Audio Speech Lang Process 22(12):2244–2255

    Article  Google Scholar 

  19. Lorenzo-Trueba J, Yamagishi J, Toda T, Saito D, Villavicencio F, Kinnunen T, Ling Z (2018) The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. In: Proceedings Odyssey 2018 The Speaker and Language Recognition Workshop, pp 195–202. https://doi.org/10.21437/Odyssey.2018-28

  20. Masuko T, Tokuda K, Kobayashi T, Imai S (1996) Speech synthesis using hmms with dynamic features. In: Proceedings of ICASSP, vol 1, pp 389–392

  21. Masuko T, Tokuda K, Kobayashi T, Imai S (1997) Voice characteristics conversion for hmm-based speech synthesis system. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 3, pp 1611–1614

  22. Molla M KI, Hirose K (2007) Single-mixture audio source separation by subspace decomposition of hilbert spectrum. IEEE Trans Audio Speech Lang Process 15(3):893–900

    Article  Google Scholar 

  23. Morgan DP, George EB, Lee LT, Kay SM (1995) Co-channel speaker separation. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, vol 1, pp 828–831

  24. Patel TB, Patil HA (2017) Cochlear filter and instantaneous frequency based features for spoofed speech detection. IEEE J Sel Top Signal Process 11 (4):618–631

    Article  Google Scholar 

  25. PrasannaKumar MK, Kumaraswamy R (2015) Supervised and unsupervised separation of convolutive speech mixtures using f0 and formant frequencies. Int J Speech Technol 18(4):649–662

    Article  Google Scholar 

  26. PrasannaKumar MK, Kumaraswamy R (2017) Single-channel speech separation using empirical mode decomposition and multi pitch information with estimation of number of speakers. Int J Speech Technol 20(1):109–125

    Article  Google Scholar 

  27. Sahidullah M, Kinnunen T., Hanilci C. (2015) A comparison of features for synthetic speech detection, Proc. INTERSPEECH, pp. 2087–2091

  28. Shao Y, Wang D (2003) Co-channel speaker identification using usable speech extraction based on multi-pitch tracking. In: 2003. Proceedings. (ICASSP ’03). 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 2, pp II–205–8

  29. Smaragdis P (2007) Convolutive speech bases and their application to supervised speech separation. IEEE Trans Audio Speech Lang Process 15(1):1–12. https://doi.org/10.1109/TASL.2006.876726

    Article  Google Scholar 

  30. Swamy RK, Murty K SR, Yegnanarayana B (2007) Determining number of speakers from multispeaker speech signals using excitation source information. IEEE Signal Process Lett 14(7):481–484

    Article  Google Scholar 

  31. Toda T, Chen L-H, Saito D, Villavicencio F, Wester M, Wu Z, Yamagishi J (2016) The voice conversion challenge 2016. In: Interspeech, pp 1632–1636

  32. Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235

    Article  Google Scholar 

  33. Todisco M, Delgado H, Evans N (2017) Constant q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Comput Speech Lang 45. https://doi.org/10.1016/j.csl.2017.01.001

  34. Wang DL, Brown GJ (1999) Separation of speech from interfering sounds based on oscillatory correlation. IEEE Trans Neural Netw 10(3):684–697

    Article  Google Scholar 

  35. Wu Z, Gao S, Cling ES, Li H (2014) A study on replay attack and anti-spoofing for text-dependent speaker verification. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, pp 1–5

  36. Wu Z, Chng ES, Li H (2012) Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: INTERSPEECH

  37. Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H (2015) Spoofing and countermeasures for speaker verification: A survey. Speech Comm 66:130–153

    Article  Google Scholar 

  38. Wu Z, Leon P LD, Demiroglu C, Khodabakhsh A, King S, Ling ZH, Saito D, Stewart B, Toda T, Wester M, Yamagishi J (2016) Anti-spoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio Speech Lang Process 24(4):768–783

    Article  Google Scholar 

  39. Yegnanarayana B, Swamy RK, Prasanna S RM (2005) Separation of multispeaker speech using excitation information. In: NOLISP-2005, pp 11–18

  40. Yegnanarayana B, Swamy RK, Murty K SR (2009) Determining mixing parameters from multispeaker data using speech-specific information. IEEE Trans Audio Speech Lang Process 17(6):1196–1207

    Article  Google Scholar 

  41. Zhang X, Wang D (2017) Deep learning based binaural speech separation in reverberant environments. IEEE/ACM Trans Audio Speech Lang Process 25(5):1075–1084

    Article  Google Scholar 

Download references

Acknowledgments

First author would like to thank Women scientist scheme-A, Department of science and technology, Government of India for providing financial assistance vide reference number SR/WOS-A/ET-69/2016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. Hemavathi.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hemavathi, R., Kumaraswamy, R. Voice conversion spoofing detection by exploring artifacts estimates. Multimed Tools Appl 80, 23561–23580 (2021). https://doi.org/10.1007/s11042-020-10212-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10212-0

Keywords

Navigation