Skip to main content
Log in

Speech frame selection for spoofing detection with an application to partially spoofed audio-data

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this paper, we introduce a frame selection strategy for improved detection of spoofed speech. A countermeasure (CM) system typically uses a Gaussian mixture model (GMM) based classifier for computing the log-likelihood scores. The average log-likelihood ratio for all speech frames of a test utterance is calculated as the score for the decision making. As opposed to this standard approach, we propose to use selected speech frames of the test utterance for scoring. We present two simple and computationally efficient frame selection strategies based on the log-likelihood ratios of the individual frames. The performance is evaluated with constant-Q cepstral coefficients as front-end feature extraction and two-class GMM as a back-end classifier. We conduct the experiments using the speech corpora from ASVspoof 2015, 2017, and 2019 challenges. The experimental results show that the proposed scoring techniques substantially outperform the conventional scoring technique for both the development and evaluation data set of ASVspoof 2015 corpus. We did not observe noticeable performance gain in ASVspoof 2017 and ASVspoof 2019 corpus. We further conducted experiments with partially spoofed data where spoofed data is created by augmenting natural and spoofed speech. In this scenario, the proposed methods demonstrate considerable performance improvement over baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The natural speech files in this version of the dataset contain some zero-sequence artifacts at the beginning which might help in the detection process with attention model.

  2. Other than voice-biometrics, this situation may also encounter where someone creates fake speech by combining segments from multiple sources.

  3. http://dx.doi.org/10.7488/ds/1994

  4. https://sites.google.com/site/bosaristoolkit/

References

  • Delgado, H., Todisco, M., Sahidullah, M., Evans, N., Kinnunen, T., Lee, K., Yamagishi, J.(2018). ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements, in Odyssey 2018: The Speaker and Language Recognition Workshop

  • Erro, D., Moreno, A., & Bonafonte, A. (2010). Voice conversion based on weighted frequency warping. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 922–931.

    Article  Google Scholar 

  • Fujihara, H., Goto, M., Kitahara, T., & Okuno, H. G. (2010). A modeling of singing voice robust to accompaniment sounds and its application to singer identification and vocal-timbre-similarity-based music information retrieval. IEEE Transactions on Audio, Speech, and Language Processing, 18(3), 638–648.

    Article  Google Scholar 

  • Hanilçi, C., Kinnunen, T., Sahidullah, M., Sizov, A.(2015). Classifiers for synthetic speech detection: A comparison. In Proceeding of the INTERSPEECH, pp. 2057–2061

  • Hanilçi, C., & Kinnunen, T. (2014). Source cell-phone recognition from recorded speech using non-speech segments. Digital Signal Processing, 35, 75–85.

    Article  Google Scholar 

  • Jahangir, M. J., Kenny, P., Bhattacharya, G., Stafylakis, T., & Development of CRIM System for the Automatic Speaker Verification Spoofing and Countermeasures Challenge,. (2015). in Proceeding of the INTERSPEECH, 2015, pp. 2072–2076.

  • Jung, C. S., Kim, M. Y., & Kang, H. G. (2010). Selecting feature frames for automatic speaker recognition using mutual information. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1332–1340.

    Article  Google Scholar 

  • Kamble, M. R., Sailor, H. B., Patil, H. A., & Li, H. (2020). Advances in anti-spoofing: from the perspective of ASVspoof challenges. APSIPA Transactions on Signal and Information Processing, 9,

  • Khodabakhsh, A., Demiroglu, C. (2016). Investigation of synthetic speech detection using frame-and segment-specific importance weighting. arXiv preprint arXiv:1610.03009

  • Kinnunen, T., Sahidullah, M., Delgado, H., Todisco, M., Evans, N., Yamagishi, J., Lee, K.A. (2017a). The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection, in Proceeding of the INTERSPEECH, pp. 2–6

  • Kinnunen, T., Sahidullah, M., Delgado, H., Todisco, M., Evans, N., Yamagishi, J., Lee, K.A. (2017b). The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection

  • Kinnunen, T., Karpov, E., & Franti, P. (2006). Real-time speaker identification and verification. IEEE Transactions on Audio, Speech, and Language Processing, 14(1), 277–288.

    Article  Google Scholar 

  • Kwon, S., & Narayanan, S. (2007). Robust speaker identification based on selective use of feature vectors. Pattern Recognition Letters, 28(1), 85–89.

    Article  Google Scholar 

  • Okabe, K., Koshinaka, T., Shinoda, K. (2018). Attentive statistics pooling for deep speaker embedding, in Proceeding of the INTERSPEECH, pp. 2252–2256

  • Pal, M., Paul, D., & Saha, G. (2018). Synthetic speech detection using fundamental frequency variation and spectral features. Computer Speech & Language, 48, 31–50.

    Article  Google Scholar 

  • Patel, T. B., & Patil, H. A. (2017). Cochlear filter and instantaneous frequency based features for spoofed speech detection. IEEE Journal of Selected Topics in Signal Processing, 11(4), 618–631.

    Article  Google Scholar 

  • Paul, D., Sahidullah, M., Saha, G. (2017b). Generalization of spoofing countermeasures: A case study with ASVspoof 2015 and BTAS 2016 corpora, in Proceeding of the ICASSP, pp. 2047–2051

  • Paul, D., Pal, M., & Saha, G. (2017a). Spectral features for synthetic speech detection. IEEE Journal of Selected Topics in Signal Processing, 11(4), 605–617.

    Article  Google Scholar 

  • Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.

    Article  Google Scholar 

  • Sahidullah, M., Delgado, H., Todisco, M., Kinnunen, T., Evans, N., Yamagishi, J., Lee, K.-A.(2019) .Introduction to Voice Presentation Attack Detection and Recent Advances, ed. by S. Marcel, M.S. Nixon, J. Fierrez, N. Evans (Springer, Cham), pp. 321–361

  • Sahidullah, M., Kinnunen, T., Hanilçi, C. (2015). A comparison of features for synthetic speech detection, in Proceeding of the INTERSPEECH, pp. 2087–2091

  • Sanchez, J., Saratxaga, I., Hernaez, I., Navas, E., Erro, D., & Raitio, T. (2015). Toward a universal synthetic speech spoofing detection using phase information. IEEE Transactions on Information Forensics and Security, 10(4), 810–820.

    Article  Google Scholar 

  • Tian, X., Xiao, X., Chng, E.S., Li, H. (2016). Spoofing speech detection using temporal convolutional neural network, in ASIPA, pp. 1–6

  • Todisco, M., Wang, X., Vestman, V., Sahidullah, M., Delgado, H., Nautsch, A., Yamagishi, J., Evans, N., Kinnunen, T., Lee, K.A. (2019). Asvspoof 2019: Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441

  • Todisco, M., Delgado, H., & Evans, N. (2017). Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Computer Speech & Language, 45, 516–535.

    Article  Google Scholar 

  • Tom, F., Jain, M., Dey, P.(2018). End-To-end audio replay attack detection using deep convolutional networks with attention, in Proceeding of the INTERSPEECH, pp. 681–685

  • Ventura, T. M., de Oliveira, A. G., Ganchev, T. D., de Figueiredo, J. M., Jahn, O., Marques, M. I., et al. (2015). Audio parameterization with robust frame selection for improved bird identification. Expert Systems with Applications, 42(22), 8463–8471.

    Article  Google Scholar 

  • Villalba, J.A., Miguel, A., Ortega, A., Lleida, E. (2015a). Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge, in Proceeding of the INTERSPEECH, pp. 2067–2071

  • Villalba, J., Miguel, A., Ortega, A., Lleida, E. (2015b). Spoofing detection with dnn and one-class svm for the asv spoof 2015 challenge, in Proceeding of the INTERSPEECH

  • Wu, Z., Kinnunen, T., Evans, N., Yamagishi, J., Hanilçi, C., Sahidullah, M., Sizov, A.(2015). ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge, in Proceeding of the INTERSPEECH, 2037–2041

  • Wu, Z., et al. (2017). ASVspoof: the automatic speaker verification spoofing and countermeasures challenge. IEEE Journal of Selected Topics in Signal Processing, 11(4), 588–604.

    Article  Google Scholar 

  • Wu, Z., Evans, N., Kinnunen, T., Yamagishi, J., Alegre, F., & Li, H. (2015). Spoofing and countermeasures for speaker verification: A survey. Speech Communication, 66, 130–153.

    Article  Google Scholar 

  • Yu, H., Tan, Z.-H., Ma, Z., Martin, R., & Guo, J. (2017). Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features. IEEE Transactions on Neural Networks and Learning Systems, 29(10), 4633–4644.

    Article  Google Scholar 

  • Zhu, Y., Ko, T., Snyder, D., Mak, B., Povey, D. (2018). Self-attentive speaker embeddings for text-independent speaker verification, in Proceeding of the INTERSPEECH, pp. 3573–3577

Download references

Acknowledgements

The work of Mr. Dipjyoti Paul is funded by the EUs H2020 research and innovation programme under the MSCA GA 67532 (the ENRICH network: www.enrich-etn.eu). Dr. Md Sahidullah’s work is supported by Region Grand Est. We would like to thank Dr. Shefali Waldekar for her assistance in language editing and proofreading.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A Kishore Kumar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, A.K., Paul, D., Pal, M. et al. Speech frame selection for spoofing detection with an application to partially spoofed audio-data. Int J Speech Technol 24, 193–203 (2021). https://doi.org/10.1007/s10772-020-09785-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-020-09785-w

Keywords

Navigation