Voice conversion spoofing detection by exploring artifacts estimates

Hemavathi, R.; Kumaraswamy, R.

doi:10.1007/s11042-020-10212-0

Voice conversion spoofing detection by exploring artifacts estimates

1163: Large-scale multimedia signal processing for security and digital forensics
Published: 06 January 2021

Volume 80, pages 23561–23580, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

608 Accesses
5 Citations
Explore all metrics

Abstract

Automatic speaker verification or voice biometrics is an approach to verify the person’s claimed identity through his/her voice. Voice biometrics finds its application in mobile banking and forensics. With the increased usage of speaker verification systems, studying the spoofing threats to speaker verification systems and building proper countermeasure is gaining attention. Spoofing is a genuine challenge as it leads to increase in the false alarm rate, i.e. an impostor is incorrectly accepted as genuine speaker. To make voice biometrics viable for practical applications there is a need to detect spoofing attack. Voice conversion spoofing is a technique where the imposter speaker’s speech is converted to desired speaker’s speech using signal processing approaches. Studies show that voice conversion introduces artifacts in resultant speech, hence, this paper proposes a novel approach to detect voice conversion spoofing attack by estimating artifact estimates from the given speech signal. To obtain artifact estimate from speech signal non-negative matrix factorization based source separation technique is employed. Later, Convolutional Neural Network based binary classifier is built to classify artifact estimates of input speech as natural and synthetic speech. Experiments are conducted on voice conversion challenge 2016 and voice conversion challenge 2018 database. Results show that proposed technique gives excellent performance by detecting wide range of unknown attacks. The proposed systems are compared to state of art spoof detection systems based on Constant Q Cepstrum Coefficients and Linear Frequency Cepstral Coefficients and results show the proposed system give relatively equivalent and/or better performance. Validation results for various noises is studied using NOIZEUS database and results show the efficiency of the proposed system in noisy environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring Source Separation as a Countermeasure for Voice Conversion Spoofing Attack

Noise Robust Audio Spoof Detection Using Hybrid Feature Extraction and LCNN

Article 13 April 2024

Voice Presentation Attack Detection Using Convolutional Neural Networks

References

Abe M, Nakamura S, Shikano K, Kuwabara H (1990) Voice conversion through vector quantization. J Acoust Soc Jpn (E) 11(2):71–76
Article Google Scholar
Alegre F, Vipperla R, Amehraye A, Evans N WD (2013) A new speaker verification spoofing countermeasure based on local binary patterns. In: INTERSPEECH
Bonastre J-F, Matrouf D, Fredouille C (2007) Artificial impostor voice transformation effects on false acceptance rates. In: INTERSPEECH
Campbell WM, Campbell JP, Gleason TP, Reynolds DA, Shen W (2007) Speaker verification using support vector machines and high-level features. IEEE Trans Audio Speech Lang Process 15(7):2085–2094
Article Google Scholar
Chen LH, Ling ZH, Liu LJ, Dai LR (2014) Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Trans Audio Speech Lang Process 22(12):1859–1872
Article Google Scholar
Erro D, Moreno A, Bonafonte A (2010) Voice conversion based on weighted frequency warping. IEEE Trans Audio Speech Lang Process 18(5):922–931
Article Google Scholar
Gao B, Woo WL, Dlay SS (2011) Single-channel source separation using emd-subband variable regularized sparse features. IEEE Trans Audio Speech Lang Process 19(4):961–976
Article Google Scholar
Gao B, Woo WL, Dlay SS (2013) Unsupervised single-channel separation of nonstationary signals using gammatone filterbank and itakura saito nonnegative matrix two-dimensional factorizations. IEEE Trans Circ Syst I: Reg Papers 60 (3):662–675
MathSciNet Google Scholar
Hanilci C, Kinnunen T, Sahidullah M, Sizov A (2016) Spoofing detection goes noisy: An analysis of synthetic speech detection in the presence of additive noise. Speech Comm 85:83–97
Article Google Scholar
Hautamäki RG, Kinnunen T, Hautamäki V, Laukkanen A-M (2015) Automatic versus human speaker verification: The case of voice mimicry. Speech Comm 72:13–31
Article Google Scholar
Hemavathi R, Swamy RK (2018) Unsupervised speech separation using statistical, auditory and signal processing approaches. In: 2018 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)
Hu Y, Loizou PC (2007) Subjective comparison and evaluation of speech enhancement algorithms. Speech Commun 49(7-8):588–601
Article Google Scholar
Hu K, Wang D (2013) An unsupervised approach to cochannel speech separation. IEEE Trans Audio Speech Lang Process 21(1):122–131
Article MathSciNet Google Scholar
Kinnunen T, Karpov E, Franti P (2006) Real-time speaker identification and verification. IEEE Trans Audio Speech Lang Process 14(1):277–288
Article Google Scholar
Krishna P KM, Ramaswamy K (2017) Single channel speech separation based on empirical mode decomposition and hilbert transform. IET Signal Process 11(5):579–586
Article Google Scholar
Lau YW, Wagner M, Tran D (2004) Vulnerability of speaker verification to voice mimicking. In: Proceedings of International Symposium on Intelligent Multimedia, Video and Speech Processing, pp 145–148
Khor LC, Woo WL, Dlay SS (2005) Non-sparse approach to underdetermined blind signal estimation. ICASSP
Li W, Wang L, Zhou Y, Dines J, Doss MM, Bourlard H, Liao Q (2014) Feature mapping of multiple beamformed sources for robust overlapping speech recognition using a microphone array. IEEE/ACM Trans Audio Speech Lang Process 22(12):2244–2255
Article Google Scholar
Lorenzo-Trueba J, Yamagishi J, Toda T, Saito D, Villavicencio F, Kinnunen T, Ling Z (2018) The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. In: Proceedings Odyssey 2018 The Speaker and Language Recognition Workshop, pp 195–202. https://doi.org/10.21437/Odyssey.2018-28
Masuko T, Tokuda K, Kobayashi T, Imai S (1996) Speech synthesis using hmms with dynamic features. In: Proceedings of ICASSP, vol 1, pp 389–392
Masuko T, Tokuda K, Kobayashi T, Imai S (1997) Voice characteristics conversion for hmm-based speech synthesis system. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 3, pp 1611–1614
Molla M KI, Hirose K (2007) Single-mixture audio source separation by subspace decomposition of hilbert spectrum. IEEE Trans Audio Speech Lang Process 15(3):893–900
Article Google Scholar
Morgan DP, George EB, Lee LT, Kay SM (1995) Co-channel speaker separation. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, vol 1, pp 828–831
Patel TB, Patil HA (2017) Cochlear filter and instantaneous frequency based features for spoofed speech detection. IEEE J Sel Top Signal Process 11 (4):618–631
Article Google Scholar
PrasannaKumar MK, Kumaraswamy R (2015) Supervised and unsupervised separation of convolutive speech mixtures using f0 and formant frequencies. Int J Speech Technol 18(4):649–662
Article Google Scholar
PrasannaKumar MK, Kumaraswamy R (2017) Single-channel speech separation using empirical mode decomposition and multi pitch information with estimation of number of speakers. Int J Speech Technol 20(1):109–125
Article Google Scholar
Sahidullah M, Kinnunen T., Hanilci C. (2015) A comparison of features for synthetic speech detection, Proc. INTERSPEECH, pp. 2087–2091
Shao Y, Wang D (2003) Co-channel speaker identification using usable speech extraction based on multi-pitch tracking. In: 2003. Proceedings. (ICASSP ’03). 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 2, pp II–205–8
Smaragdis P (2007) Convolutive speech bases and their application to supervised speech separation. IEEE Trans Audio Speech Lang Process 15(1):1–12. https://doi.org/10.1109/TASL.2006.876726
Article Google Scholar
Swamy RK, Murty K SR, Yegnanarayana B (2007) Determining number of speakers from multispeaker speech signals using excitation source information. IEEE Signal Process Lett 14(7):481–484
Article Google Scholar
Toda T, Chen L-H, Saito D, Villavicencio F, Wester M, Wu Z, Yamagishi J (2016) The voice conversion challenge 2016. In: Interspeech, pp 1632–1636
Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235
Article Google Scholar
Todisco M, Delgado H, Evans N (2017) Constant q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Comput Speech Lang 45. https://doi.org/10.1016/j.csl.2017.01.001
Wang DL, Brown GJ (1999) Separation of speech from interfering sounds based on oscillatory correlation. IEEE Trans Neural Netw 10(3):684–697
Article Google Scholar
Wu Z, Gao S, Cling ES, Li H (2014) A study on replay attack and anti-spoofing for text-dependent speaker verification. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, pp 1–5
Wu Z, Chng ES, Li H (2012) Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: INTERSPEECH
Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H (2015) Spoofing and countermeasures for speaker verification: A survey. Speech Comm 66:130–153
Article Google Scholar
Wu Z, Leon P LD, Demiroglu C, Khodabakhsh A, King S, Ling ZH, Saito D, Stewart B, Toda T, Wester M, Yamagishi J (2016) Anti-spoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio Speech Lang Process 24(4):768–783
Article Google Scholar
Yegnanarayana B, Swamy RK, Prasanna S RM (2005) Separation of multispeaker speech using excitation information. In: NOLISP-2005, pp 11–18
Yegnanarayana B, Swamy RK, Murty K SR (2009) Determining mixing parameters from multispeaker data using speech-specific information. IEEE Trans Audio Speech Lang Process 17(6):1196–1207
Article Google Scholar
Zhang X, Wang D (2017) Deep learning based binaural speech separation in reverberant environments. IEEE/ACM Trans Audio Speech Lang Process 25(5):1075–1084
Article Google Scholar

Download references

Acknowledgments

First author would like to thank Women scientist scheme-A, Department of science and technology, Government of India for providing financial assistance vide reference number SR/WOS-A/ET-69/2016.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, Siddaganga Institute of Technology, Tumakuru, India
R. Hemavathi & R. Kumaraswamy
Visveswaraya Technological University, Belagavi, Karnataka, 572103, India
R. Hemavathi & R. Kumaraswamy

Authors

R. Hemavathi
View author publications
You can also search for this author in PubMed Google Scholar
R. Kumaraswamy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. Hemavathi.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hemavathi, R., Kumaraswamy, R. Voice conversion spoofing detection by exploring artifacts estimates. Multimed Tools Appl 80, 23561–23580 (2021). https://doi.org/10.1007/s11042-020-10212-0

Download citation

Received: 06 March 2020
Revised: 03 August 2020
Accepted: 09 December 2020
Published: 06 January 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s11042-020-10212-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Voice conversion spoofing detection by exploring artifacts estimates

Abstract

Access this article

Similar content being viewed by others

Exploring Source Separation as a Countermeasure for Voice Conversion Spoofing Attack

Noise Robust Audio Spoof Detection Using Hybrid Feature Extraction and LCNN

Voice Presentation Attack Detection Using Convolutional Neural Networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Voice conversion spoofing detection by exploring artifacts estimates

Abstract

Access this article

Similar content being viewed by others

Exploring Source Separation as a Countermeasure for Voice Conversion Spoofing Attack

Noise Robust Audio Spoof Detection Using Hybrid Feature Extraction and LCNN

Voice Presentation Attack Detection Using Convolutional Neural Networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation