Abstract
Classification of genuine and spoofed utterance is the basis for most of the countermeasure detecting spoof attacks on automatic speaker verification system. The choice of a good discriminating feature and a complementing classifier adds to the robustness of the countermeasure. Cepstral coefficients of the linear sub-band energy analysis have proved its worth in countering unknown attacks as witnessed by the literature. The intention behind the proposed work is to assess the behaviour of a spoof detection countermeasure using linear frequency cepstral coefficients with both generative and discriminative classifiers. The same are considered as baseline systems for further analysis. Parallelly, the paper proposes modifications to the traditional weighting function used in the retrieval of energy sub-bands on linear scale in order to leverage its full potential in spoof detection. The weighting function used is Gaussian, and hence, the modified feature is referred as GaussFCC. The aforementioned analysis is carried out on non-pre-emphasised utterances. The classifiers used are Gaussian mixture model (generative) and bidirectional long short-term memory (discriminative) classifiers. The empirical results show that the generative classifier has performed significantly in the detection of spoof attacks under logical access condition and discriminative classifier has shown drastic improvement in spoof detection under physical access condition over the generative model. Tandem detection cost function for logical access scenario (LA) using GMM classifier is 0.000 for development data and 0.113 for evaluation data, and in physical access scenario using BiLSTM classifier, it is 0.030 for development data and 0.044 for evaluation data. A detailed comparative analysis of the performance of the countermeasure is carried out based on different types of attacks, features, classifiers and utterances from female and male speakers.
Similar content being viewed by others
References
C.M. Bishop, J. Lasserre, Generative or discrimative? Getting the best of both worlds, vol. 8, pp. 3–23 (2007)
D.R. Campbell, K.J. Palomäki, G. Brown, A matlab simulation of “shoebox’’ room acoustics for use in research and teaching. Comput. Inf. Syst. J. 9(3), 48 (2005). (ISSN 1352-9404)
K. Conrad, Probability distributions and maximum entropy (2005)
R.K. Das, J. Yang, H. Li, Long range acoustic features for spoofed speech detection, in INTERSPEECH (2019)
L. Deng, D. O’Shaughnessy, Speech processing: a dynamic and optimization-oriented approach. Marcel Dekker Inc., (2003). https://doi.org/10.1201/9781482276237
A.R. Douglas, R.C. Rose, Robust text-independent speaker identification using Gaussian mixture models. IEEE Trans. Speech Audio Process. 1, 72–83 (1995). https://doi.org/10.1109/89.365379
S.K. Ergünay, E. Khoury, A. Lazaridis, S. Marcel, On the vulnerability of speaker verification to realistic voice spoofing (2015), pp. 1–6. https://doi.org/10.1109/BTAS.2015.7358783
M.D. Femila, A.A. Irudhayaraj, Biometric system. in 2011 3rd International Conference on Electronics Computer Technology, , vol 1, pp. 152–156 (2011). https://doi.org/10.1109/ICECTECH.2011.5941580
C. Hanilci, T. Kinnunen, Md. Sahidullah, A. Sizov, Classifiers for synthetic speech detection: a comparison, in 16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015) (2015), pp. 2057–2061
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 (ISSN 0899-7667)
M.R. Kamble, H.A. Patil, Analysis of reverberation via Teager energy features for replay spoof speech detection, in ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), pp. 2607–2611. https://doi.org/10.1109/ICASSP.2019.8683830
T. Kinnunen, Z. Wu, K.A. Lee, F. Sedlak, E.S. Chng, H. Li, Vulnerability of speaker verification systems against voice conversion spoofing attacks: the case of telephone speech, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4401—4404
T. Kinnunen, K.A. Lee, H. Delgado, N. Evans, M. Todisco, M. Sahidullah, J. Yamagishi, D.A. Reynolds, t-DCF: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification, in Proceedings, Odyssey 2018 (2018)
M. Kudo, J. Toyama, M. Shimbo, Multidimensional curve classification using passing-through regions. Pattern Recognit. Lett. 20(11), 1103–1111 (1999). https://doi.org/10.1016/S0167-8655(99)00077-X (ISSN 0167-8655)
M.G. Kumar, S.R. Kumar, M.S. Saranya, B. Bharathi, H.A. Murthy, Spoof detection using time-delay shallow neural network and feature switching, in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2019). https://doi.org/10.1109/asru46091.2019.9003824
S.R. Kumar, B. Bharathi, A novel approach towards generalization of countermeasure for spoofing attack on ASV systems. Circuits Syst. Signal Process. 40, 872–889 (2021). https://doi.org/10.1007/s00034-020-01501-y (ISSN 1531-5878)
O. Kwon, I. Jang, C. Ahn, H. Kang. Emotional speech synthesis based on style embedded tacotron2 framework (2019), pp. 1–4. https://doi.org/10.1109/ITC-CSCC.2019.8793393
X. Li, N. Li, C. Weng, X. Liu, D. Su, D. Yu, H. Meng, Replay and synthetic speech detection with res2net architecture In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6–11, 2021, pp. 6354–6358. IEEE (2021). https://doi.org/10.1109/ICASSP39728.2021.9413828
D. Mitrovic, M. Zeppelzauer, C. Breiteneder, Features for content-based audio retrieval. Adv. Comput. 78, 71–150 (2010). https://doi.org/10.1016/S0065-2458(10)78003-7
A. Novak, P. Lotton, L. Simon, Synchronized swept-sine: theory, application, and implementation. J. Audio Eng. Soc. 63(10), 786–798 (2015). (ISSN 1352-9404)
S.P. Panda, Intelligent voice-based authentication system (2019), pp. 757–760. https://doi.org/10.1109/I-SMAC47947.2019.9032671
Y. Qian, N. Chen, K. Yu, Deep features for automatic spoofing detection. Speech Commun. 85, 43–52 (2016). https://doi.org/10.1016/j.specom.2016.10.007
R.A. Rashid, N.H. Mahalin, M.A. Sarijari, A.A. Abdul Aziz, Security system using biometric technology: design and implementation of voice recognition system (VRS) (2008), pp. 898–902. https://doi.org/10.1109/ICCCE.2008.4580735
Md. Sahidullah, T. Kinnunen, C. Hanilçi, A comparison of features for synthetic speech detection, in Interspeech (2015), pp. 2087–2091
T.J. Sefara, T.B. Mokgonyane, M.J. Manamela, T.I. Modipa, Hmm-based speech synthesis system incorporated with language identification for low-resourced languages (2019), pp. 1–6. https://doi.org/10.1109/ICABCD.2019.8851055
C.E. Shannon, W. Weaver, A Mathematical Theory of Communication (University of Illinois Press, Illinois, 1963). (ISBN 0252725484)
K. Sriskandaraja, V. Sethu, P.N. Le, E. Ambikairajah, Investigation of sub-band discriminative information between spoofed and genuine speech. Interspeech 2016, 1710–1714 (2016)
M. Todisco, H. Delgado, N. Evans, A new feature for automatic speaker verification anti-spoofing: constant q cepstral coefficients, in Proceedings of the Speaker and Language Recognition Workshop (2016), pp. 283–290
M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, K.A. Lee, Asvspoof 2019: future horizons in spoofed and fake audio detection, in Interspeech 2019 (2019)
E. Vincent. Roomsimove (2008). http://homepages.loria.fr/evincent/software/Roomsimove_1.4.zip
Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, H. Li, Spoofing and countermeasures for speaker verification: a survey. Speech Commun. 66, 130–153 (2015). https://doi.org/10.1016/j.specom.2014.10.005 (ISSN 0167-6393)
Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, M. Hanil çi, C. Sahidullah, A. Sizov, Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge, in Interspeech (2015), pp. 2037–2041
Z. Xie, W. Zhang, Z. Chen, X. Xu, A comparison of features for replay attack detection. J. Phys. Conf. Ser. (JPCS) 1229, 8 (2019)
J. Yamagishi, M. Todisco, Md. Sahidullah, H. Delgado, X. Wang, N. Evans, T. Kinnunen, K.A. Lee, V. Vestman, A. Nautsch, Asvspoof 2019: automatic speaker verification spoofing and countermeasures challenge evaluation plan (2019)
J. Yang, L. Xu, B. Ren, Y. Ji, Discriminative features based on modified log magnitude spectrum for playback speech detection. EURASIP J. Audio Speech Music Process. (2020). https://doi.org/10.1186/s13636-020-00173-5 (ISSN 1352-9404)
H. Yu, Z.H. Tan, Y. Zhang, Z. Ma, J. Guo, Dnn filter bank cepstral coefficients for spoofing detection. IEEE Access 5, 4779–4787 (2017)
C. Zhang, C. Yu, J.H.L. Hansen, An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J. Sel. Top. Signal Process. 11(4), 684–694 (2017). https://doi.org/10.1109/JSTSP.2016.2647199
X. Zhou, D. Garcia-Romero, R. Duraiswami, C. Espy-Wilson, S. Shamma, Linear versus mel frequency cepstral coefficients for speaker recognition, in 2011 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 559–564 (2011). https://doi.org/10.1109/ASRU.2011.6163888
Acknowledgements
We extend our thanks to SSN College of Engineering for providing us with the required infrastructure to carry out our research work.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rupesh Kumar, S., Bharathi, B. Generative and Discriminative Modelling of Linear Energy Sub-bands for Spoof Detection in Speaker Verification Systems. Circuits Syst Signal Process 41, 3811–3831 (2022). https://doi.org/10.1007/s00034-022-01957-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-01957-0