Abstract
Audio deepfakes have been increasingly emerging as a potential source of deceit, with the development of avant-garde methods of synthetic speech generation. Hence, differentiating fake audio from the real one is becoming even more difficult owing to the increasing accuracy of text-to-speech models, posing a serious threat to speaker verification systems. Within the domain of audio deepfake detection, a majority of experiments have been based on the ASVSpoof or the AVSpoof dataset using various machine learning and deep learning approaches. In this work, experiments were performed on a more recent dataset, the Fake or Real (FoR) dataset which contains data generated using some of the best text to speech models. Two approaches have been adopted to the solve problem: feature-based approach and image-based approach. The feature-based approach involves converting audio data into a dataset consisting of various spectral features of the audio samples, which are fed to the machine learning algorithms for the classification of audio as fake or real. While in the image-based approach audio samples are converted into melspectrograms which are input into deep learning algorithms, namely Temporal Convolutional Network (TCN) and Spatial Transformer Network (STN). TCN has been implemented because it is a sequential model and has been shown to give good results on sequential data. A comparison between the performances of both the approaches has been made, and it is observed that deep learning algorithms, particularly TCN, outperforms the machine learning algorithms by a significant margin, with a 92 percent test accuracy. This solution presents a model for audio deepfake classification which has an accuracy comparable to the traditional CNN models like VGG16, XceptionNet, etc.
Similar content being viewed by others
References
Reimao, R.; Tzerpos, V.: For: a dataset for synthetic speech detection. In: 2019 International Conference on Speech Technology and Human–Computer Dialogue (SpeD), pp. 1–10. IEEE (2019)
Evgeniou, T.; Pontil, M.: Support vector machines: theory and applications. In: Advanced Course on Artificial Intelligence, pp. 249–257. Springer, Berlin (1999)
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y.: Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural. Inf. Process. Syst. 30, 3146–3154 (2017)
Chen, T.; Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)
Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K.: KNN model-based approach in classification. In: OTM Confederated International Conferences on the Move to Meaningful Internet Systems, pp. 986–996. Springer, Berlin (2003)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Wold, E.; Blum, T.; Keislar, D.; Wheaten, J.: Content-based classification, search, and retrieval of audio. IEEE Multimedia 3(3), 27–36 (1996)
Nanni, L.; Costa, Y.M.G.; Lucio, D.R.; Silla, C.N., Jr.; Brahnam, S.: Combining visual and acoustic features for audio classification tasks. Pattern Recogn. Lett. 88, 49–56 (2017)
Lie, L.; Zhang, H.-J.; Jiang, H.: Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Process. 10(7), 504–516 (2002)
Zhao, J.; Mao, X.; Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)
Carey, M.J.; Parris, E.S.; Lloyd-Thomas, H.: A comparison of features for speech, music discrimination. In: 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), vol. 1, pp. 149–152. IEEE, London (1999)
Stylianou, Y.: Voice transformation: a survey. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3585–3588. IEEE, New York (2009)
Wu, Z.; Evans, N.; Kinnunen, T.; Yamagishi, J.; Alegre, F.; Li, H.: Spoofing and countermeasures for speaker verification: a survey. Speech Commun. 66, 130–153 (2015)
Wu, Z.; De Leon, P.L.; Demiroglu, C.; Khodabakhsh, A.; King, S.; Ling, Z.-H.; Saito, D.; Stewart, B.; Toda, T.; Wester, M.; et al.: Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 768–783 (2016)
Reimao, R.A.M.: Synthetic speech detection using deep neural networks. Thesis, York University, Toronto, Ontario (2019)
Muckenhirn, H.; Magimai-Doss, M.; Marcel, S.: End-to-end convolutional neural network-based voice presentation attack detection. In: 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 335–341. IEEE, New York (2017)
Dinkel, H.; Qian, Y.; Kai, Yu.: Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ ACM Trans. Audio Speech Lang. Process. 26(11), 2002–2014 (2018)
De Leon, P.L.; Hernaez, I.; Saratxaga, I.; Pucher, M.; Yamagishi, J.: Detection of synthetic speech for the problem of imposture. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4844–4847. IEEE, New York (2011)
Ze, H.; Senior, A.; Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7962–7966. IEEE, London (2013)
Dörfler, M.; Bammer, R.; Grill, T.: Inside the spectrogram: Convolutional neural networks in audio processing. In: 2017 International Conference on Sampling Theory and Applications (SampTA), pp. 152–155. IEEE, New York (2017)
Hong, Yu.; Tan, Z.-H.; Ma, Z.; Martin, R.; Guo, J.: Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features. IEEE Trans. Neural Netw. Learn. Syst. 29(10), 4633–4644 (2017)
Balamurali, B.T.; Lin, K.E.; Lui, S.; Chen, J.-M.; Herremans, D.: Toward robust audio spoofing detection: a detailed comparison of traditional and learned features. IEEE Access 7, 84229–84241 (2019)
Maccagno, A.; Mastropietro, A.; Mazziotta, U.; Scarpiniti, M.; Lee, Y.-C.; Uncini, A.: A CNN approach for audio classification in construction sites. In: Progresses in Artificial Intelligence and Neural Systems, pp. 371–381. Springer, Berlin (2019)
Bai, S.; Kolter, J.Z.; Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling (2018). arXiv preprint arXiv:1803.01271
Zhang, C.; Yu, C.; Hansen, J.H.L.: An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J. Select. Top. Signal Process. 11(4), 684–694 (2017)
Paul, D.; Pal, M.; Saha, G.: Spectral features for synthetic speech detection. IEEE J. Select. Top. Signal Process. 11(4), 605–617 (2017)
Kinnunen, T.; Sahidullah, M.; Delgado, H.; Todisco, M.; Evans, N.; Yamagishi, J.; Lee, K.A.: The ASVspoof 2017 challenge: assessing the limits of replay spoofing attack detection (2017)
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K.: Spatial transformer networks. Adv. Neural Inform. Process. Syst. 28, 2017–2025 (2015)
Lea, C.; Vidal, R.; Reiter, A.; Hager, G.D.: Temporal convolutional networks: A unified approach to action segmentation. In: European Conference on Computer Vision, pp. 47–54. Springer, Berlin (2016)
Alqahtani, S.; Mishra, A.; Diab, M.: Efficient convolutional neural networks for diacritic restoration (2019). arXiv preprint arXiv:1912.06900
Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017)
Farha, Y.A.; Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019)
Tian, X.; Xiao, X.; Chng, E.S.; Li, H.: Spoofing speech detection using temporal convolutional neural network. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6. IEEE, London (2016)
Chen, Y.; Kang, Y.; Chen, Y.; Wang, Z.: Probabilistic forecasting with temporal convolutional neural network. Neurocomputing 399, 491–501 (2020)
Danilyuk, K.: Convnets series. Spatial transformer networks-towards data science. Towards Data Sci. (2017)
Nagarajan, S.; Nettimi, S.S.S.; Kumar, L.S.; Nath, M.K.; Kanhe, A.: Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and ERB frequency scales. Digit. Signal Process. 104, 102763 (2020)
Jia, Y.; Zhang, Y.; Weiss, R.J.; Wang, Q.; Shen, J.; Ren, F.; Chen, Z.; Nguyen, P.; Pang, R.; Moreno, I.L.; et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis (2018). arXiv preprint arXiv:1806.04558
Dash, T.K.; Mishra, S.; Panda, G.; Satapathy, S.C.: Detection of COVID-19 from speech signal using bio-inspired based cepstral features. Pattern Recogn. 117, 107999 (2021)
Zheng, F.; Zhang, G.; Song, Z.: Comparison of different implementations of MFCC. J. Comput. Sci. Technol. 16(6), 582–589 (2001)
van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K.: Wavenet: a generative model for raw audio (2016). arXiv preprint arXiv:1609.03499
Acknowledgements
Authors acknowledge Centre of Excellence in Complex and Nonlinear Dynamical Systems (CoE-CNDS) laboratory for providing support and platform for research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they no conflict of interest.
Rights and permissions
About this article
Cite this article
Khochare, J., Joshi, C., Yenarkar, B. et al. A Deep Learning Framework for Audio Deepfake Detection. Arab J Sci Eng 47, 3447–3458 (2022). https://doi.org/10.1007/s13369-021-06297-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-021-06297-w