Skip to main content
Log in

A Deep Learning Framework for Audio Deepfake Detection

  • Research Article-Electrical Engineering
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

Audio deepfakes have been increasingly emerging as a potential source of deceit, with the development of avant-garde methods of synthetic speech generation. Hence, differentiating fake audio from the real one is becoming even more difficult owing to the increasing accuracy of text-to-speech models, posing a serious threat to speaker verification systems. Within the domain of audio deepfake detection, a majority of experiments have been based on the ASVSpoof or the AVSpoof dataset using various machine learning and deep learning approaches. In this work, experiments were performed on a more recent dataset, the Fake or Real (FoR) dataset which contains data generated using some of the best text to speech models. Two approaches have been adopted to the solve problem: feature-based approach and image-based approach. The feature-based approach involves converting audio data into a dataset consisting of various spectral features of the audio samples, which are fed to the machine learning algorithms for the classification of audio as fake or real. While in the image-based approach audio samples are converted into melspectrograms which are input into deep learning algorithms, namely Temporal Convolutional Network (TCN) and Spatial Transformer Network (STN). TCN has been implemented because it is a sequential model and has been shown to give good results on sequential data. A comparison between the performances of both the approaches has been made, and it is observed that deep learning algorithms, particularly TCN, outperforms the machine learning algorithms by a significant margin, with a 92 percent test accuracy. This solution presents a model for audio deepfake classification which has an accuracy comparable to the traditional CNN models like VGG16, XceptionNet, etc.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Reimao, R.; Tzerpos, V.: For: a dataset for synthetic speech detection. In: 2019 International Conference on Speech Technology and Human–Computer Dialogue (SpeD), pp. 1–10. IEEE (2019)

  2. Evgeniou, T.; Pontil, M.: Support vector machines: theory and applications. In: Advanced Course on Artificial Intelligence, pp. 249–257. Springer, Berlin (1999)

  3. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y.: Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural. Inf. Process. Syst. 30, 3146–3154 (2017)

    Google Scholar 

  4. Chen, T.; Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)

  5. Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K.: KNN model-based approach in classification. In: OTM Confederated International Conferences on the Move to Meaningful Internet Systems, pp. 986–996. Springer, Berlin (2003)

  6. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  7. Wold, E.; Blum, T.; Keislar, D.; Wheaten, J.: Content-based classification, search, and retrieval of audio. IEEE Multimedia 3(3), 27–36 (1996)

    Article  Google Scholar 

  8. Nanni, L.; Costa, Y.M.G.; Lucio, D.R.; Silla, C.N., Jr.; Brahnam, S.: Combining visual and acoustic features for audio classification tasks. Pattern Recogn. Lett. 88, 49–56 (2017)

    Article  Google Scholar 

  9. Lie, L.; Zhang, H.-J.; Jiang, H.: Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Process. 10(7), 504–516 (2002)

    Article  Google Scholar 

  10. Zhao, J.; Mao, X.; Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)

    Article  Google Scholar 

  11. Carey, M.J.; Parris, E.S.; Lloyd-Thomas, H.: A comparison of features for speech, music discrimination. In: 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), vol. 1, pp. 149–152. IEEE, London (1999)

  12. Stylianou, Y.: Voice transformation: a survey. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3585–3588. IEEE, New York (2009)

  13. Wu, Z.; Evans, N.; Kinnunen, T.; Yamagishi, J.; Alegre, F.; Li, H.: Spoofing and countermeasures for speaker verification: a survey. Speech Commun. 66, 130–153 (2015)

    Article  Google Scholar 

  14. Wu, Z.; De Leon, P.L.; Demiroglu, C.; Khodabakhsh, A.; King, S.; Ling, Z.-H.; Saito, D.; Stewart, B.; Toda, T.; Wester, M.; et al.: Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 768–783 (2016)

    Article  Google Scholar 

  15. Reimao, R.A.M.: Synthetic speech detection using deep neural networks. Thesis, York University, Toronto, Ontario (2019)

  16. Muckenhirn, H.; Magimai-Doss, M.; Marcel, S.: End-to-end convolutional neural network-based voice presentation attack detection. In: 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 335–341. IEEE, New York (2017)

  17. Dinkel, H.; Qian, Y.; Kai, Yu.: Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ ACM Trans. Audio Speech Lang. Process. 26(11), 2002–2014 (2018)

    Article  Google Scholar 

  18. De Leon, P.L.; Hernaez, I.; Saratxaga, I.; Pucher, M.; Yamagishi, J.: Detection of synthetic speech for the problem of imposture. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4844–4847. IEEE, New York (2011)

  19. Ze, H.; Senior, A.; Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7962–7966. IEEE, London (2013)

  20. Dörfler, M.; Bammer, R.; Grill, T.: Inside the spectrogram: Convolutional neural networks in audio processing. In: 2017 International Conference on Sampling Theory and Applications (SampTA), pp. 152–155. IEEE, New York (2017)

  21. Hong, Yu.; Tan, Z.-H.; Ma, Z.; Martin, R.; Guo, J.: Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features. IEEE Trans. Neural Netw. Learn. Syst. 29(10), 4633–4644 (2017)

    Google Scholar 

  22. Balamurali, B.T.; Lin, K.E.; Lui, S.; Chen, J.-M.; Herremans, D.: Toward robust audio spoofing detection: a detailed comparison of traditional and learned features. IEEE Access 7, 84229–84241 (2019)

    Article  Google Scholar 

  23. Maccagno, A.; Mastropietro, A.; Mazziotta, U.; Scarpiniti, M.; Lee, Y.-C.; Uncini, A.: A CNN approach for audio classification in construction sites. In: Progresses in Artificial Intelligence and Neural Systems, pp. 371–381. Springer, Berlin (2019)

  24. Bai, S.; Kolter, J.Z.; Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling (2018). arXiv preprint arXiv:1803.01271

  25. Zhang, C.; Yu, C.; Hansen, J.H.L.: An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J. Select. Top. Signal Process. 11(4), 684–694 (2017)

    Article  Google Scholar 

  26. Paul, D.; Pal, M.; Saha, G.: Spectral features for synthetic speech detection. IEEE J. Select. Top. Signal Process. 11(4), 605–617 (2017)

    Article  Google Scholar 

  27. Kinnunen, T.; Sahidullah, M.; Delgado, H.; Todisco, M.; Evans, N.; Yamagishi, J.; Lee, K.A.: The ASVspoof 2017 challenge: assessing the limits of replay spoofing attack detection (2017)

  28. Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K.: Spatial transformer networks. Adv. Neural Inform. Process. Syst. 28, 2017–2025 (2015)

    Google Scholar 

  29. Lea, C.; Vidal, R.; Reiter, A.; Hager, G.D.: Temporal convolutional networks: A unified approach to action segmentation. In: European Conference on Computer Vision, pp. 47–54. Springer, Berlin (2016)

  30. Alqahtani, S.; Mishra, A.; Diab, M.: Efficient convolutional neural networks for diacritic restoration (2019). arXiv preprint arXiv:1912.06900

  31. Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017)

  32. Farha, Y.A.; Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019)

  33. Tian, X.; Xiao, X.; Chng, E.S.; Li, H.: Spoofing speech detection using temporal convolutional neural network. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6. IEEE, London (2016)

  34. Chen, Y.; Kang, Y.; Chen, Y.; Wang, Z.: Probabilistic forecasting with temporal convolutional neural network. Neurocomputing 399, 491–501 (2020)

    Article  Google Scholar 

  35. Danilyuk, K.: Convnets series. Spatial transformer networks-towards data science. Towards Data Sci. (2017)

  36. Nagarajan, S.; Nettimi, S.S.S.; Kumar, L.S.; Nath, M.K.; Kanhe, A.: Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and ERB frequency scales. Digit. Signal Process. 104, 102763 (2020)

    Article  Google Scholar 

  37. Jia, Y.; Zhang, Y.; Weiss, R.J.; Wang, Q.; Shen, J.; Ren, F.; Chen, Z.; Nguyen, P.; Pang, R.; Moreno, I.L.; et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis (2018). arXiv preprint arXiv:1806.04558

  38. Dash, T.K.; Mishra, S.; Panda, G.; Satapathy, S.C.: Detection of COVID-19 from speech signal using bio-inspired based cepstral features. Pattern Recogn. 117, 107999 (2021)

    Article  Google Scholar 

  39. Zheng, F.; Zhang, G.; Song, Z.: Comparison of different implementations of MFCC. J. Comput. Sci. Technol. 16(6), 582–589 (2001)

    Article  Google Scholar 

  40. van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K.: Wavenet: a generative model for raw audio (2016). arXiv preprint arXiv:1609.03499

Download references

Acknowledgements

Authors acknowledge Centre of Excellence in Complex and Nonlinear Dynamical Systems (CoE-CNDS) laboratory for providing support and platform for research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shraddha Suratkar.

Ethics declarations

Conflict of interest

The authors declare that they no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khochare, J., Joshi, C., Yenarkar, B. et al. A Deep Learning Framework for Audio Deepfake Detection. Arab J Sci Eng 47, 3447–3458 (2022). https://doi.org/10.1007/s13369-021-06297-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-021-06297-w

Keywords

Navigation