A Deep Learning Framework for Audio Deepfake Detection

Khochare, Janavi; Joshi, Chaitali; Yenarkar, Bakul; Suratkar, Shraddha; Kazi, Faruk

doi:10.1007/s13369-021-06297-w

A Deep Learning Framework for Audio Deepfake Detection

Research Article-Electrical Engineering
Published: 08 November 2021

Volume 47, pages 3447–3458, (2022)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Janavi Khochare¹,
Chaitali Joshi¹,
Bakul Yenarkar¹,
Shraddha Suratkar ORCID: orcid.org/0000-0002-5983-0098¹ &
…
Faruk Kazi¹

3147 Accesses
25 Citations
Explore all metrics

Abstract

Audio deepfakes have been increasingly emerging as a potential source of deceit, with the development of avant-garde methods of synthetic speech generation. Hence, differentiating fake audio from the real one is becoming even more difficult owing to the increasing accuracy of text-to-speech models, posing a serious threat to speaker verification systems. Within the domain of audio deepfake detection, a majority of experiments have been based on the ASVSpoof or the AVSpoof dataset using various machine learning and deep learning approaches. In this work, experiments were performed on a more recent dataset, the Fake or Real (FoR) dataset which contains data generated using some of the best text to speech models. Two approaches have been adopted to the solve problem: feature-based approach and image-based approach. The feature-based approach involves converting audio data into a dataset consisting of various spectral features of the audio samples, which are fed to the machine learning algorithms for the classification of audio as fake or real. While in the image-based approach audio samples are converted into melspectrograms which are input into deep learning algorithms, namely Temporal Convolutional Network (TCN) and Spatial Transformer Network (STN). TCN has been implemented because it is a sequential model and has been shown to give good results on sequential data. A comparison between the performances of both the approaches has been made, and it is observed that deep learning algorithms, particularly TCN, outperforms the machine learning algorithms by a significant margin, with a 92 percent test accuracy. This solution presents a model for audio deepfake classification which has an accuracy comparable to the traditional CNN models like VGG16, XceptionNet, etc.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Fig. 3

CBAM: Convolutional Block Attention Module

Deep learning for time series classification: a review

Article 02 March 2019

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

Article 04 June 2022

References

Reimao, R.; Tzerpos, V.: For: a dataset for synthetic speech detection. In: 2019 International Conference on Speech Technology and Human–Computer Dialogue (SpeD), pp. 1–10. IEEE (2019)
Evgeniou, T.; Pontil, M.: Support vector machines: theory and applications. In: Advanced Course on Artificial Intelligence, pp. 249–257. Springer, Berlin (1999)
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y.: Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural. Inf. Process. Syst. 30, 3146–3154 (2017)
Google Scholar
Chen, T.; Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)
Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K.: KNN model-based approach in classification. In: OTM Confederated International Conferences on the Move to Meaningful Internet Systems, pp. 986–996. Springer, Berlin (2003)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Wold, E.; Blum, T.; Keislar, D.; Wheaten, J.: Content-based classification, search, and retrieval of audio. IEEE Multimedia 3(3), 27–36 (1996)
Article Google Scholar
Nanni, L.; Costa, Y.M.G.; Lucio, D.R.; Silla, C.N., Jr.; Brahnam, S.: Combining visual and acoustic features for audio classification tasks. Pattern Recogn. Lett. 88, 49–56 (2017)
Article Google Scholar
Lie, L.; Zhang, H.-J.; Jiang, H.: Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Process. 10(7), 504–516 (2002)
Article Google Scholar
Zhao, J.; Mao, X.; Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)
Article Google Scholar
Carey, M.J.; Parris, E.S.; Lloyd-Thomas, H.: A comparison of features for speech, music discrimination. In: 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), vol. 1, pp. 149–152. IEEE, London (1999)
Stylianou, Y.: Voice transformation: a survey. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3585–3588. IEEE, New York (2009)
Wu, Z.; Evans, N.; Kinnunen, T.; Yamagishi, J.; Alegre, F.; Li, H.: Spoofing and countermeasures for speaker verification: a survey. Speech Commun. 66, 130–153 (2015)
Article Google Scholar
Wu, Z.; De Leon, P.L.; Demiroglu, C.; Khodabakhsh, A.; King, S.; Ling, Z.-H.; Saito, D.; Stewart, B.; Toda, T.; Wester, M.; et al.: Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 768–783 (2016)
Article Google Scholar
Reimao, R.A.M.: Synthetic speech detection using deep neural networks. Thesis, York University, Toronto, Ontario (2019)
Muckenhirn, H.; Magimai-Doss, M.; Marcel, S.: End-to-end convolutional neural network-based voice presentation attack detection. In: 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 335–341. IEEE, New York (2017)
Dinkel, H.; Qian, Y.; Kai, Yu.: Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ ACM Trans. Audio Speech Lang. Process. 26(11), 2002–2014 (2018)
Article Google Scholar
De Leon, P.L.; Hernaez, I.; Saratxaga, I.; Pucher, M.; Yamagishi, J.: Detection of synthetic speech for the problem of imposture. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4844–4847. IEEE, New York (2011)
Ze, H.; Senior, A.; Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7962–7966. IEEE, London (2013)
Dörfler, M.; Bammer, R.; Grill, T.: Inside the spectrogram: Convolutional neural networks in audio processing. In: 2017 International Conference on Sampling Theory and Applications (SampTA), pp. 152–155. IEEE, New York (2017)
Hong, Yu.; Tan, Z.-H.; Ma, Z.; Martin, R.; Guo, J.: Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features. IEEE Trans. Neural Netw. Learn. Syst. 29(10), 4633–4644 (2017)
Google Scholar
Balamurali, B.T.; Lin, K.E.; Lui, S.; Chen, J.-M.; Herremans, D.: Toward robust audio spoofing detection: a detailed comparison of traditional and learned features. IEEE Access 7, 84229–84241 (2019)
Article Google Scholar
Maccagno, A.; Mastropietro, A.; Mazziotta, U.; Scarpiniti, M.; Lee, Y.-C.; Uncini, A.: A CNN approach for audio classification in construction sites. In: Progresses in Artificial Intelligence and Neural Systems, pp. 371–381. Springer, Berlin (2019)
Bai, S.; Kolter, J.Z.; Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling (2018). arXiv preprint arXiv:1803.01271
Zhang, C.; Yu, C.; Hansen, J.H.L.: An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J. Select. Top. Signal Process. 11(4), 684–694 (2017)
Article Google Scholar
Paul, D.; Pal, M.; Saha, G.: Spectral features for synthetic speech detection. IEEE J. Select. Top. Signal Process. 11(4), 605–617 (2017)
Article Google Scholar
Kinnunen, T.; Sahidullah, M.; Delgado, H.; Todisco, M.; Evans, N.; Yamagishi, J.; Lee, K.A.: The ASVspoof 2017 challenge: assessing the limits of replay spoofing attack detection (2017)
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K.: Spatial transformer networks. Adv. Neural Inform. Process. Syst. 28, 2017–2025 (2015)
Google Scholar
Lea, C.; Vidal, R.; Reiter, A.; Hager, G.D.: Temporal convolutional networks: A unified approach to action segmentation. In: European Conference on Computer Vision, pp. 47–54. Springer, Berlin (2016)
Alqahtani, S.; Mishra, A.; Diab, M.: Efficient convolutional neural networks for diacritic restoration (2019). arXiv preprint arXiv:1912.06900
Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017)
Farha, Y.A.; Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019)
Tian, X.; Xiao, X.; Chng, E.S.; Li, H.: Spoofing speech detection using temporal convolutional neural network. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6. IEEE, London (2016)
Chen, Y.; Kang, Y.; Chen, Y.; Wang, Z.: Probabilistic forecasting with temporal convolutional neural network. Neurocomputing 399, 491–501 (2020)
Article Google Scholar
Danilyuk, K.: Convnets series. Spatial transformer networks-towards data science. Towards Data Sci. (2017)
Nagarajan, S.; Nettimi, S.S.S.; Kumar, L.S.; Nath, M.K.; Kanhe, A.: Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and ERB frequency scales. Digit. Signal Process. 104, 102763 (2020)
Article Google Scholar
Jia, Y.; Zhang, Y.; Weiss, R.J.; Wang, Q.; Shen, J.; Ren, F.; Chen, Z.; Nguyen, P.; Pang, R.; Moreno, I.L.; et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis (2018). arXiv preprint arXiv:1806.04558
Dash, T.K.; Mishra, S.; Panda, G.; Satapathy, S.C.: Detection of COVID-19 from speech signal using bio-inspired based cepstral features. Pattern Recogn. 117, 107999 (2021)
Article Google Scholar
Zheng, F.; Zhang, G.; Song, Z.: Comparison of different implementations of MFCC. J. Comput. Sci. Technol. 16(6), 582–589 (2001)
Article Google Scholar
van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K.: Wavenet: a generative model for raw audio (2016). arXiv preprint arXiv:1609.03499

Download references

Acknowledgements

Authors acknowledge Centre of Excellence in Complex and Nonlinear Dynamical Systems (CoE-CNDS) laboratory for providing support and platform for research.

Author information

Authors and Affiliations

Veermata Jijabai Technological Institute, Mumbai, India
Janavi Khochare, Chaitali Joshi, Bakul Yenarkar, Shraddha Suratkar & Faruk Kazi

Authors

Janavi Khochare
View author publications
You can also search for this author in PubMed Google Scholar
Chaitali Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Bakul Yenarkar
View author publications
You can also search for this author in PubMed Google Scholar
Shraddha Suratkar
View author publications
You can also search for this author in PubMed Google Scholar
Faruk Kazi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shraddha Suratkar.

Ethics declarations

Conflict of interest

The authors declare that they no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khochare, J., Joshi, C., Yenarkar, B. et al. A Deep Learning Framework for Audio Deepfake Detection. Arab J Sci Eng 47, 3447–3458 (2022). https://doi.org/10.1007/s13369-021-06297-w

Download citation

Received: 06 June 2021
Accepted: 03 October 2021
Published: 08 November 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s13369-021-06297-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Deep Learning Framework for Audio Deepfake Detection

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Deep learning for time series classification: a review

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Deep Learning Framework for Audio Deepfake Detection

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Deep learning for time series classification: a review

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation