Skip to main content
Log in

A defensive attention mechanism to detect deepfake content across multiple modalities

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Recently, researchers have attracted much attention to the realistic nature of multi-modal deepfake content. They have employed plenty of handcrafted, learned features, and deep learning techniques to achieve promising performances for recognizing facial deepfakes. However, attackers continue to create deepfakes that outperform their earlier works by focusing on changes in many modalities, making deepfake identification under multiple modalities difficult. To exploit the merits of attention-based network architecture, we propose a novel cross-modal attention architecture on a bi-directional recurrent convolutional network to capture fake content in audio and video. For effective deepfake detection, the system records the spatial–temporal deformations of audio–video sequences and investigates the correlation in these modalities. We propose a self-attenuated VGG16 deep model for extracting visual features for facial fake recognition. Besides, the system incorporates a recurrent neural network with self-attention to extract false audio elements effectively. The cross-modal attention mechanism effectively learns the divergence between two modalities. Besides, we include multi-modal fake examples to create a well-balanced bespoke dataset to address the drawbacks of small and unbalanced training samples. We test the effectiveness of our proposed multi-modal deepfake detection strategy in comparison to state-of-the-art methods on a variety of existing datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The data that support the fndings of this study are openly available in DFDC at https://paperswithcode.com/dataset/dfdc, MMDFD at https://dl.acm.org/doi/10.1145/3607947.3608013, FakeAVCeleb at https://paperswithcode.com/dataset/fakeavceleb.

References

  1. Masood, M., Nawaz, M., Malik, K.M., Javed, A., Irtaza, A., Malik, H.: Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward. Appl. Intell. 53(4), 3974–4026 (2023)

    Article  Google Scholar 

  2. Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing obama: learning lip sync from audio. ACM Trans. Graph. (ToG) 36(4), 1–13 (2017)

    Article  Google Scholar 

  3. News Desk. Fabricated video of vladimir putin takes twitter by storm. 2020. https://www.globalvillagespace.com/fabricated-video-of-vladimir-putin-takes-twitter-by-storm. Accessed 27 Aug 2023

  4. Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492. Seattle WA, USA (2020)

    Google Scholar 

  5. Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., Nguyen, P., Pang, R., Moreno, I. Lopez, Wu, Y. et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Advances in neural information processing systems, pp. 1–11 (2018)

  6. Liu, C., Tang, T., Lv, K., Wang, M.: Multi-feature based emotion recognition for video clips. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 630–634. Boulder CO USA (2018)

    Google Scholar 

  7. Lu, C., Zheng, W., Li, C., Tang, C., Liu, S., Yan, S., Zong, Y.: Multiple spatio-temporal feature learning for video-based emotion recognition in the wild. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 646–652 (2018)

  8. Yuhas, B.P., Goldstein, M.H., Sejnowski, T.J.: Integration of acoustic and visual speech signals using neural networks. IEEE Commun. Mag. 27(11), 65–71 (1989)

    Article  Google Scholar 

  9. Chen, L.S., Huang, T.S., Miyasato, T., Nakatsu, R.: Multimodal human emotion/expression recognition. Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, pp. 366–371. IEEE (1998)

    Chapter  Google Scholar 

  10. Attabi, Y., Dumouchel, P.: Anchor models and wccn normalization for speaker trait classification. In: Thirteenth Annual Conference of the International Speech Communication Association., Oregon, USA (2012)

    Google Scholar 

  11. Liang, P.P., Salakhutdinov, R., Morency, L.-P.: Computational modeling of human multimodal language: the mosei dataset and interpretable dynamic fusion. In: First Workshop and Grand Challenge on Computational Modeling of Human Multimodal Language, Melbourne (2018)

  12. Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.-P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32(1). Washington, DC, USA, (2018)]

  13. Roy, R., Joshi, I., Das, A., Dantcheva, A.: 3d cnn architectures and attention mechanisms for deepfake detection. Handbook of Digital Face Manipulation and Detection, pp. 213–234. Springer, Cham (2022)

    Chapter  Google Scholar 

  14. Das, A., Das, S., Dantcheva, A.: Demystifying attention mechanisms for deepfake detection. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pp. 1–7. IEEE (2021)

  15. Korshunov, P., Marcel, S.: Deepfakes: a new threat to face recognition? Assessment and detection. arXiv preprint arXiv:1812.08685 (2018)

  16. Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-df: a large-scale challenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3207–3216. California, USA (2020)

    Google Scholar 

  17. Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Faceforensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1–11 (2019)

  18. Dufour, N., Gully, A.: Contributing data to deepfake detection research. Google AI Blog 1(2), 3 (2019)

    Google Scholar 

  19. Dolhansky, B., Howes, R., Pflaum, B., Baram, N., Ferrer, C.C.: The deepfake detection challenge (dfdc) preview dataset. arXiv preprint arXiv:1910.08854 (2019)

  20. Khalid, H., Tariq, S., Kim, M., Woo, S.S.: Fakeavceleb: a novel audio-video multimodal deepfake dataset. arXiv preprint arXiv:2108.05080 (2021)

  21. Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: “Emotions don” lie: a deepfake detection method using audio-visual affective cues. In: Proceedings of the 28th ACM International Conference on Multimedia. ACM, pp. 2823–2832 (2020)

  22. Lewis, J.K., Toubal, I.E., Chen, H., Sandesera, V., Lomnitz, M., Hampel-Arias, Z., Prasad, C., Palaniappan, K.: Deepfake video detection based on spatial, spectral, and temporal inconsistencies using multimodal deep learning. In: IEEE Applied Imagery Pattern Recognition Workshop (AIPR), vol. 2020, pp. 1–9. IEEE (2020)

  23. Lomnitz, M., Hampel-Arias, Z., Sandesara, V., Hu, S.: Multimodal approach for deepfake detection. In: IEEE Applied Imagery Pattern Recognition Workshop (AIPR), vol. 2020, pp. 1–9. IEEE (2020)

  24. Chugh, K., Gupta, P., Dhall, A., Subramanian, R.: Not made for each other-audio-visual dissonance-based deepfake detection and localization. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 439–447. United States, Seattle (2020)

    Google Scholar 

  25. Hosler, B., Salvi, D., Murray, A., Antonacci, F., Bestagini, P., Tubaro, S., Stamm, M.C.: Do deepfakes feel emotions? A semantic approach to detecting deepfakes via emotional inconsistencies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, California, USA, pp. 1013–1022 (2021)

  26. Khalid, H., Kim, M., Tariq, S., Woo, S.S.: Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. In: Proceedings of the 1st Workshop on Synthetic Multimedia-Audiovisual Deepfake Generation and Detection, pp. 7–15 (2021)

  27. Liu, X., Yu, Y., Li, X., Zhao, Y.: Mcl: multimodal contrastive learning for deepfake detection. IEEE Transactions on Circuits and Systems for Video Technology (2023)

  28. Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 27, 1–9 (2014)

    Google Scholar 

  29. Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, USA, pp. 3156–3164 (2017)

  30. Lin, Y.-B., Wang, Y.-C.F.: Audiovisual transformer with instance attention for audio-visual event localization. In: Proceedings of the Asian Conference on Computer Vision, Macao, China (2020)

    Google Scholar 

  31. Choi, H., Cho, K., Bengio, Y.: Fine-grained attention mechanism for neural machine translation. Neurocomputing 284, 171–176 (2018)

    Article  Google Scholar 

  32. Ge, H., Yan, Z., Yu, W., Sun, L.: An attention mechanism based convolutional lstm network for video action recognition. Multimed. Tools Appl. 78(14), 20533–20556 (2019)

    Article  Google Scholar 

  33. Hsiao, P.-W., Chen, C.-P.: Effective attention mechanism in dynamic models for speech emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2018, pp. 2526–2530. IEEE (2018)

  34. Ganguly, S., Mohiuddin, S., Malakar, S., Cuevas, E., Sarkar, R.: Visual attention-based deepfake video forgery detection. Pattern Anal. Appl. 25, 1–12 (2022)

    Article  Google Scholar 

  35. Zhou, Y., Lim, S.-N.: Joint audio-visual deepfake detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14800–14809 (2021)

  36. Yu, Y., Liu, X., Ni, R., Yang, S., Zhao, Y., Kot, A.C.: Pvass-mdd: predictive visual-audio alignment self-supervision for multimodal deepfake detection. IEEE Transactions on Circuits and Systems for Video Technology (2023)

  37. Salvi, D., Liu, H., Mandelli, S., Bestagini, P., Zhou, W., Zhang, W., Tubaro, S.: A robust approach to multimodal deepfake detection. J. Imaging 9(6), 122 (2023)

    Article  PubMed  PubMed Central  Google Scholar 

  38. Kharel, A., Paranjape, M., Bera, A.: Df-transfusion: Multimodal deepfake detection via lip-audio cross-attention and facial self-attention. arXiv preprint arXiv:2309.06511 (2023)

  39. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)

  40. Machado, G.R., Silva, E., Goldschmidt, R.R.: A non-deterministic method to construct ensemble-based classifiers to protect decision support systems against adversarial images: a case study. In: Proceedings of the XV Brazilian Symposium on Information Systems. ACM, p. 72 (2019)

  41. “Dlib python api tutorials link,” http://dlib.net/python/index.html (2015)

  42. Horn, B.K., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1–3), 185–203 (1981)

    Article  Google Scholar 

  43. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  44. Dang, H., Liu, F., Stehouwer, J., Liu, X., Jain, A.K.: On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition, USA, pp. 5781–5790 (2020)

  45. O’Shaughnessy, D.: Automatic speech recognition: history, methods and challenges. Pattern Recognit. 41(10), 2965–2979 (2008)

    Article  ADS  Google Scholar 

  46. Baveye, Y., Chamaret, C., Dellandréa, E., Chen, L.: Affective video content analysis: a multidisciplinary insight. IEEE Trans. Affect. Comput. 9(4), 396–409 (2017)

    Article  Google Scholar 

  47. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. Adv. Neural Inf. Process. Syst. 28, 1–9 (2015)

    Google Scholar 

  48. Chen, J., Jiang, D., Zhang, Y.: A hierarchical bidirectional gru model with attention for eeg-based emotion classification. IEEE Access 7, 118 530-118 540 (2019)

    Article  Google Scholar 

  49. Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)

  50. Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. PMLR, pp. 6105–6114 (2019)

  51. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI Conference on Artificial Intelligence. California, USA (2017)

    Google Scholar 

  52. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Hawaii, USA, pp. 1251–1258 (2017)

  53. Afchar, D., Nozick, V., Yamagishi, J., Echizen, I.: Mesonet: a compact facial video forgery detection network. In: IEEE International Workshop on Information Forensics and Security (WIFS), vol. 2018, pp. 1–7. IEEE (2018)

  54. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition, Hawaii, USA, pp. 6450–6459 (2018)

  55. McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

AS: Conceived and designed the analysis: contributed to the conceptualization of the research, including the formulation of the research questions and objectives. Collected the data: contributed to designing the research methodology, including data collection. Software development: developed and implemented the software used in the experiments. Performed the analysis: performed analysis using provided data and model. Writing, review and editing: responsible for the initial drafting of the manuscript, reviewed and edited the manuscript for clarity and coherence. PV: Conceived and designed the analysis: contributed to the conceptualization of the research, including the formulation of the research questions and objectives. Collected the data: involved in collecting and organizing the research data. Data analysis: Conducted data analysis and contributed to the interpretation of results. Experimental design: Contributed to the design of the experimental setup. Review and editing: reviewed and edited the manuscript for clarity and coherence. VGM: Data analysis: conducted data analysis and contributed to the interpretation of results. Experimental design: contributed to the design of the experimental setup. Writing—review and editing: contributed to manuscript review and editing. Supervision: provided overall supervision and guidance throughout the research project.

Corresponding author

Correspondence to S. Asha.

Ethics declarations

Conflict of interest

The authors declare no competing interests.                                                                                                                                                                                    

Additional information

Communicated by I. Ide.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Asha, S., Vinod, P. & Menon, V.G. A defensive attention mechanism to detect deepfake content across multiple modalities. Multimedia Systems 30, 56 (2024). https://doi.org/10.1007/s00530-023-01248-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-023-01248-x

Keywords

Navigation