Abstract
Automatic speaker verification or voice biometrics is an approach to verify the person’s claimed identity through his/her voice. Voice biometrics finds its application in mobile banking and forensics. With the increased usage of speaker verification systems, studying the spoofing threats to speaker verification systems and building proper countermeasure is gaining attention. Spoofing is a genuine challenge as it leads to increase in the false alarm rate, i.e. an impostor is incorrectly accepted as genuine speaker. To make voice biometrics viable for practical applications there is a need to detect spoofing attack. Voice conversion spoofing is a technique where the imposter speaker’s speech is converted to desired speaker’s speech using signal processing approaches. Studies show that voice conversion introduces artifacts in resultant speech, hence, this paper proposes a novel approach to detect voice conversion spoofing attack by estimating artifact estimates from the given speech signal. To obtain artifact estimate from speech signal non-negative matrix factorization based source separation technique is employed. Later, Convolutional Neural Network based binary classifier is built to classify artifact estimates of input speech as natural and synthetic speech. Experiments are conducted on voice conversion challenge 2016 and voice conversion challenge 2018 database. Results show that proposed technique gives excellent performance by detecting wide range of unknown attacks. The proposed systems are compared to state of art spoof detection systems based on Constant Q Cepstrum Coefficients and Linear Frequency Cepstral Coefficients and results show the proposed system give relatively equivalent and/or better performance. Validation results for various noises is studied using NOIZEUS database and results show the efficiency of the proposed system in noisy environments.
Similar content being viewed by others
References
Abe M, Nakamura S, Shikano K, Kuwabara H (1990) Voice conversion through vector quantization. J Acoust Soc Jpn (E) 11(2):71–76
Alegre F, Vipperla R, Amehraye A, Evans N WD (2013) A new speaker verification spoofing countermeasure based on local binary patterns. In: INTERSPEECH
Bonastre J-F, Matrouf D, Fredouille C (2007) Artificial impostor voice transformation effects on false acceptance rates. In: INTERSPEECH
Campbell WM, Campbell JP, Gleason TP, Reynolds DA, Shen W (2007) Speaker verification using support vector machines and high-level features. IEEE Trans Audio Speech Lang Process 15(7):2085–2094
Chen LH, Ling ZH, Liu LJ, Dai LR (2014) Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Trans Audio Speech Lang Process 22(12):1859–1872
Erro D, Moreno A, Bonafonte A (2010) Voice conversion based on weighted frequency warping. IEEE Trans Audio Speech Lang Process 18(5):922–931
Gao B, Woo WL, Dlay SS (2011) Single-channel source separation using emd-subband variable regularized sparse features. IEEE Trans Audio Speech Lang Process 19(4):961–976
Gao B, Woo WL, Dlay SS (2013) Unsupervised single-channel separation of nonstationary signals using gammatone filterbank and itakura saito nonnegative matrix two-dimensional factorizations. IEEE Trans Circ Syst I: Reg Papers 60 (3):662–675
Hanilci C, Kinnunen T, Sahidullah M, Sizov A (2016) Spoofing detection goes noisy: An analysis of synthetic speech detection in the presence of additive noise. Speech Comm 85:83–97
Hautamäki RG, Kinnunen T, Hautamäki V, Laukkanen A-M (2015) Automatic versus human speaker verification: The case of voice mimicry. Speech Comm 72:13–31
Hemavathi R, Swamy RK (2018) Unsupervised speech separation using statistical, auditory and signal processing approaches. In: 2018 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)
Hu Y, Loizou PC (2007) Subjective comparison and evaluation of speech enhancement algorithms. Speech Commun 49(7-8):588–601
Hu K, Wang D (2013) An unsupervised approach to cochannel speech separation. IEEE Trans Audio Speech Lang Process 21(1):122–131
Kinnunen T, Karpov E, Franti P (2006) Real-time speaker identification and verification. IEEE Trans Audio Speech Lang Process 14(1):277–288
Krishna P KM, Ramaswamy K (2017) Single channel speech separation based on empirical mode decomposition and hilbert transform. IET Signal Process 11(5):579–586
Lau YW, Wagner M, Tran D (2004) Vulnerability of speaker verification to voice mimicking. In: Proceedings of International Symposium on Intelligent Multimedia, Video and Speech Processing, pp 145–148
Khor LC, Woo WL, Dlay SS (2005) Non-sparse approach to underdetermined blind signal estimation. ICASSP
Li W, Wang L, Zhou Y, Dines J, Doss MM, Bourlard H, Liao Q (2014) Feature mapping of multiple beamformed sources for robust overlapping speech recognition using a microphone array. IEEE/ACM Trans Audio Speech Lang Process 22(12):2244–2255
Lorenzo-Trueba J, Yamagishi J, Toda T, Saito D, Villavicencio F, Kinnunen T, Ling Z (2018) The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. In: Proceedings Odyssey 2018 The Speaker and Language Recognition Workshop, pp 195–202. https://doi.org/10.21437/Odyssey.2018-28
Masuko T, Tokuda K, Kobayashi T, Imai S (1996) Speech synthesis using hmms with dynamic features. In: Proceedings of ICASSP, vol 1, pp 389–392
Masuko T, Tokuda K, Kobayashi T, Imai S (1997) Voice characteristics conversion for hmm-based speech synthesis system. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 3, pp 1611–1614
Molla M KI, Hirose K (2007) Single-mixture audio source separation by subspace decomposition of hilbert spectrum. IEEE Trans Audio Speech Lang Process 15(3):893–900
Morgan DP, George EB, Lee LT, Kay SM (1995) Co-channel speaker separation. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, vol 1, pp 828–831
Patel TB, Patil HA (2017) Cochlear filter and instantaneous frequency based features for spoofed speech detection. IEEE J Sel Top Signal Process 11 (4):618–631
PrasannaKumar MK, Kumaraswamy R (2015) Supervised and unsupervised separation of convolutive speech mixtures using f0 and formant frequencies. Int J Speech Technol 18(4):649–662
PrasannaKumar MK, Kumaraswamy R (2017) Single-channel speech separation using empirical mode decomposition and multi pitch information with estimation of number of speakers. Int J Speech Technol 20(1):109–125
Sahidullah M, Kinnunen T., Hanilci C. (2015) A comparison of features for synthetic speech detection, Proc. INTERSPEECH, pp. 2087–2091
Shao Y, Wang D (2003) Co-channel speaker identification using usable speech extraction based on multi-pitch tracking. In: 2003. Proceedings. (ICASSP ’03). 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 2, pp II–205–8
Smaragdis P (2007) Convolutive speech bases and their application to supervised speech separation. IEEE Trans Audio Speech Lang Process 15(1):1–12. https://doi.org/10.1109/TASL.2006.876726
Swamy RK, Murty K SR, Yegnanarayana B (2007) Determining number of speakers from multispeaker speech signals using excitation source information. IEEE Signal Process Lett 14(7):481–484
Toda T, Chen L-H, Saito D, Villavicencio F, Wester M, Wu Z, Yamagishi J (2016) The voice conversion challenge 2016. In: Interspeech, pp 1632–1636
Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235
Todisco M, Delgado H, Evans N (2017) Constant q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Comput Speech Lang 45. https://doi.org/10.1016/j.csl.2017.01.001
Wang DL, Brown GJ (1999) Separation of speech from interfering sounds based on oscillatory correlation. IEEE Trans Neural Netw 10(3):684–697
Wu Z, Gao S, Cling ES, Li H (2014) A study on replay attack and anti-spoofing for text-dependent speaker verification. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, pp 1–5
Wu Z, Chng ES, Li H (2012) Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: INTERSPEECH
Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H (2015) Spoofing and countermeasures for speaker verification: A survey. Speech Comm 66:130–153
Wu Z, Leon P LD, Demiroglu C, Khodabakhsh A, King S, Ling ZH, Saito D, Stewart B, Toda T, Wester M, Yamagishi J (2016) Anti-spoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio Speech Lang Process 24(4):768–783
Yegnanarayana B, Swamy RK, Prasanna S RM (2005) Separation of multispeaker speech using excitation information. In: NOLISP-2005, pp 11–18
Yegnanarayana B, Swamy RK, Murty K SR (2009) Determining mixing parameters from multispeaker data using speech-specific information. IEEE Trans Audio Speech Lang Process 17(6):1196–1207
Zhang X, Wang D (2017) Deep learning based binaural speech separation in reverberant environments. IEEE/ACM Trans Audio Speech Lang Process 25(5):1075–1084
Acknowledgments
First author would like to thank Women scientist scheme-A, Department of science and technology, Government of India for providing financial assistance vide reference number SR/WOS-A/ET-69/2016.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hemavathi, R., Kumaraswamy, R. Voice conversion spoofing detection by exploring artifacts estimates. Multimed Tools Appl 80, 23561–23580 (2021). https://doi.org/10.1007/s11042-020-10212-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10212-0