A defensive attention mechanism to detect deepfake content across multiple modalities

Asha, S.; Vinod, P.; Menon, Varun G.

doi:10.1007/s00530-023-01248-x

A defensive attention mechanism to detect deepfake content across multiple modalities

Regular Paper
Published: 03 February 2024

Volume 30, article number 56, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

S. Asha¹,
P. Vinod² &
Varun G. Menon³

385 Accesses
1 Citation
Explore all metrics

Abstract

Recently, researchers have attracted much attention to the realistic nature of multi-modal deepfake content. They have employed plenty of handcrafted, learned features, and deep learning techniques to achieve promising performances for recognizing facial deepfakes. However, attackers continue to create deepfakes that outperform their earlier works by focusing on changes in many modalities, making deepfake identification under multiple modalities difficult. To exploit the merits of attention-based network architecture, we propose a novel cross-modal attention architecture on a bi-directional recurrent convolutional network to capture fake content in audio and video. For effective deepfake detection, the system records the spatial–temporal deformations of audio–video sequences and investigates the correlation in these modalities. We propose a self-attenuated VGG16 deep model for extracting visual features for facial fake recognition. Besides, the system incorporates a recurrent neural network with self-attention to extract false audio elements effectively. The cross-modal attention mechanism effectively learns the divergence between two modalities. Besides, we include multi-modal fake examples to create a well-balanced bespoke dataset to address the drawbacks of small and unbalanced training samples. We test the effectiveness of our proposed multi-modal deepfake detection strategy in comparison to state-of-the-art methods on a variety of existing datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A defensive framework for deepfake detection under adversarial settings using temporal and spatial features

Article 03 May 2023

Exposing DeepFake Videos Using Attention Based Convolutional LSTM Network

Article 04 September 2021

A Multi-stage Multi-modal Classification Model for DeepFakes Combining Deep Learned and Computer Vision Oriented Features

Data availability

The data that support the fndings of this study are openly available in DFDC at https://paperswithcode.com/dataset/dfdc, MMDFD at https://dl.acm.org/doi/10.1145/3607947.3608013, FakeAVCeleb at https://paperswithcode.com/dataset/fakeavceleb.

References

Masood, M., Nawaz, M., Malik, K.M., Javed, A., Irtaza, A., Malik, H.: Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward. Appl. Intell. 53(4), 3974–4026 (2023)
Article Google Scholar
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing obama: learning lip sync from audio. ACM Trans. Graph. (ToG) 36(4), 1–13 (2017)
Article Google Scholar
News Desk. Fabricated video of vladimir putin takes twitter by storm. 2020. https://www.globalvillagespace.com/fabricated-video-of-vladimir-putin-takes-twitter-by-storm. Accessed 27 Aug 2023
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492. Seattle WA, USA (2020)
Google Scholar
Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., Nguyen, P., Pang, R., Moreno, I. Lopez, Wu, Y. et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Advances in neural information processing systems, pp. 1–11 (2018)
Liu, C., Tang, T., Lv, K., Wang, M.: Multi-feature based emotion recognition for video clips. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 630–634. Boulder CO USA (2018)
Google Scholar
Lu, C., Zheng, W., Li, C., Tang, C., Liu, S., Yan, S., Zong, Y.: Multiple spatio-temporal feature learning for video-based emotion recognition in the wild. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 646–652 (2018)
Yuhas, B.P., Goldstein, M.H., Sejnowski, T.J.: Integration of acoustic and visual speech signals using neural networks. IEEE Commun. Mag. 27(11), 65–71 (1989)
Article Google Scholar
Chen, L.S., Huang, T.S., Miyasato, T., Nakatsu, R.: Multimodal human emotion/expression recognition. Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, pp. 366–371. IEEE (1998)
Chapter Google Scholar
Attabi, Y., Dumouchel, P.: Anchor models and wccn normalization for speaker trait classification. In: Thirteenth Annual Conference of the International Speech Communication Association., Oregon, USA (2012)
Google Scholar
Liang, P.P., Salakhutdinov, R., Morency, L.-P.: Computational modeling of human multimodal language: the mosei dataset and interpretable dynamic fusion. In: First Workshop and Grand Challenge on Computational Modeling of Human Multimodal Language, Melbourne (2018)
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.-P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32(1). Washington, DC, USA, (2018)]
Roy, R., Joshi, I., Das, A., Dantcheva, A.: 3d cnn architectures and attention mechanisms for deepfake detection. Handbook of Digital Face Manipulation and Detection, pp. 213–234. Springer, Cham (2022)
Chapter Google Scholar
Das, A., Das, S., Dantcheva, A.: Demystifying attention mechanisms for deepfake detection. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pp. 1–7. IEEE (2021)
Korshunov, P., Marcel, S.: Deepfakes: a new threat to face recognition? Assessment and detection. arXiv preprint arXiv:1812.08685 (2018)
Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-df: a large-scale challenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3207–3216. California, USA (2020)
Google Scholar
Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Faceforensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1–11 (2019)
Dufour, N., Gully, A.: Contributing data to deepfake detection research. Google AI Blog 1(2), 3 (2019)
Google Scholar
Dolhansky, B., Howes, R., Pflaum, B., Baram, N., Ferrer, C.C.: The deepfake detection challenge (dfdc) preview dataset. arXiv preprint arXiv:1910.08854 (2019)
Khalid, H., Tariq, S., Kim, M., Woo, S.S.: Fakeavceleb: a novel audio-video multimodal deepfake dataset. arXiv preprint arXiv:2108.05080 (2021)
Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: “Emotions don” lie: a deepfake detection method using audio-visual affective cues. In: Proceedings of the 28th ACM International Conference on Multimedia. ACM, pp. 2823–2832 (2020)
Lewis, J.K., Toubal, I.E., Chen, H., Sandesera, V., Lomnitz, M., Hampel-Arias, Z., Prasad, C., Palaniappan, K.: Deepfake video detection based on spatial, spectral, and temporal inconsistencies using multimodal deep learning. In: IEEE Applied Imagery Pattern Recognition Workshop (AIPR), vol. 2020, pp. 1–9. IEEE (2020)
Lomnitz, M., Hampel-Arias, Z., Sandesara, V., Hu, S.: Multimodal approach for deepfake detection. In: IEEE Applied Imagery Pattern Recognition Workshop (AIPR), vol. 2020, pp. 1–9. IEEE (2020)
Chugh, K., Gupta, P., Dhall, A., Subramanian, R.: Not made for each other-audio-visual dissonance-based deepfake detection and localization. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 439–447. United States, Seattle (2020)
Google Scholar
Hosler, B., Salvi, D., Murray, A., Antonacci, F., Bestagini, P., Tubaro, S., Stamm, M.C.: Do deepfakes feel emotions? A semantic approach to detecting deepfakes via emotional inconsistencies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, California, USA, pp. 1013–1022 (2021)
Khalid, H., Kim, M., Tariq, S., Woo, S.S.: Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. In: Proceedings of the 1st Workshop on Synthetic Multimedia-Audiovisual Deepfake Generation and Detection, pp. 7–15 (2021)
Liu, X., Yu, Y., Li, X., Zhao, Y.: Mcl: multimodal contrastive learning for deepfake detection. IEEE Transactions on Circuits and Systems for Video Technology (2023)
Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 27, 1–9 (2014)
Google Scholar
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, USA, pp. 3156–3164 (2017)
Lin, Y.-B., Wang, Y.-C.F.: Audiovisual transformer with instance attention for audio-visual event localization. In: Proceedings of the Asian Conference on Computer Vision, Macao, China (2020)
Google Scholar
Choi, H., Cho, K., Bengio, Y.: Fine-grained attention mechanism for neural machine translation. Neurocomputing 284, 171–176 (2018)
Article Google Scholar
Ge, H., Yan, Z., Yu, W., Sun, L.: An attention mechanism based convolutional lstm network for video action recognition. Multimed. Tools Appl. 78(14), 20533–20556 (2019)
Article Google Scholar
Hsiao, P.-W., Chen, C.-P.: Effective attention mechanism in dynamic models for speech emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2018, pp. 2526–2530. IEEE (2018)
Ganguly, S., Mohiuddin, S., Malakar, S., Cuevas, E., Sarkar, R.: Visual attention-based deepfake video forgery detection. Pattern Anal. Appl. 25, 1–12 (2022)
Article Google Scholar
Zhou, Y., Lim, S.-N.: Joint audio-visual deepfake detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14800–14809 (2021)
Yu, Y., Liu, X., Ni, R., Yang, S., Zhao, Y., Kot, A.C.: Pvass-mdd: predictive visual-audio alignment self-supervision for multimodal deepfake detection. IEEE Transactions on Circuits and Systems for Video Technology (2023)
Salvi, D., Liu, H., Mandelli, S., Bestagini, P., Zhou, W., Zhang, W., Tubaro, S.: A robust approach to multimodal deepfake detection. J. Imaging 9(6), 122 (2023)
Article PubMed PubMed Central Google Scholar
Kharel, A., Paranjape, M., Bera, A.: Df-transfusion: Multimodal deepfake detection via lip-audio cross-attention and facial self-attention. arXiv preprint arXiv:2309.06511 (2023)
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
Machado, G.R., Silva, E., Goldschmidt, R.R.: A non-deterministic method to construct ensemble-based classifiers to protect decision support systems against adversarial images: a case study. In: Proceedings of the XV Brazilian Symposium on Information Systems. ACM, p. 72 (2019)
“Dlib python api tutorials link,” http://dlib.net/python/index.html (2015)
Horn, B.K., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1–3), 185–203 (1981)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Dang, H., Liu, F., Stehouwer, J., Liu, X., Jain, A.K.: On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition, USA, pp. 5781–5790 (2020)
O’Shaughnessy, D.: Automatic speech recognition: history, methods and challenges. Pattern Recognit. 41(10), 2965–2979 (2008)
Article ADS Google Scholar
Baveye, Y., Chamaret, C., Dellandréa, E., Chen, L.: Affective video content analysis: a multidisciplinary insight. IEEE Trans. Affect. Comput. 9(4), 396–409 (2017)
Article Google Scholar
Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. Adv. Neural Inf. Process. Syst. 28, 1–9 (2015)
Google Scholar
Chen, J., Jiang, D., Zhang, Y.: A hierarchical bidirectional gru model with attention for eeg-based emotion classification. IEEE Access 7, 118 530-118 540 (2019)
Article Google Scholar
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. PMLR, pp. 6105–6114 (2019)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI Conference on Artificial Intelligence. California, USA (2017)
Google Scholar
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Hawaii, USA, pp. 1251–1258 (2017)
Afchar, D., Nozick, V., Yamagishi, J., Echizen, I.: Mesonet: a compact facial video forgery detection network. In: IEEE International Workshop on Information Forensics and Security (WIFS), vol. 2018, pp. 1–7. IEEE (2018)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition, Hawaii, USA, pp. 6450–6459 (2018)
McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, SCMS School of Engineering and Technology, Ernakulam, India
S. Asha
Department of Computer Applications, Cochin University of Science and Technology, Cochin, India
P. Vinod
Department of Computer Science and Engineering, SCMS School of Engineering and Technology, Ernakulam, India
Varun G. Menon

Authors

S. Asha
View author publications
You can also search for this author in PubMed Google Scholar
P. Vinod
View author publications
You can also search for this author in PubMed Google Scholar
Varun G. Menon
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AS: Conceived and designed the analysis: contributed to the conceptualization of the research, including the formulation of the research questions and objectives. Collected the data: contributed to designing the research methodology, including data collection. Software development: developed and implemented the software used in the experiments. Performed the analysis: performed analysis using provided data and model. Writing, review and editing: responsible for the initial drafting of the manuscript, reviewed and edited the manuscript for clarity and coherence. PV: Conceived and designed the analysis: contributed to the conceptualization of the research, including the formulation of the research questions and objectives. Collected the data: involved in collecting and organizing the research data. Data analysis: Conducted data analysis and contributed to the interpretation of results. Experimental design: Contributed to the design of the experimental setup. Review and editing: reviewed and edited the manuscript for clarity and coherence. VGM: Data analysis: conducted data analysis and contributed to the interpretation of results. Experimental design: contributed to the design of the experimental setup. Writing—review and editing: contributed to manuscript review and editing. Supervision: provided overall supervision and guidance throughout the research project.

Corresponding author

Correspondence to S. Asha.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by I. Ide.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Asha, S., Vinod, P. & Menon, V.G. A defensive attention mechanism to detect deepfake content across multiple modalities. Multimedia Systems 30, 56 (2024). https://doi.org/10.1007/s00530-023-01248-x

Download citation

Received: 09 July 2023
Accepted: 19 December 2023
Published: 03 February 2024
DOI: https://doi.org/10.1007/s00530-023-01248-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A defensive attention mechanism to detect deepfake content across multiple modalities

Abstract

Access this article

Similar content being viewed by others

A defensive framework for deepfake detection under adversarial settings using temporal and spatial features

Exposing DeepFake Videos Using Attention Based Convolutional LSTM Network

A Multi-stage Multi-modal Classification Model for DeepFakes Combining Deep Learned and Computer Vision Oriented Features

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A defensive attention mechanism to detect deepfake content across multiple modalities

Abstract

Access this article

Similar content being viewed by others

A defensive framework for deepfake detection under adversarial settings using temporal and spatial features

Exposing DeepFake Videos Using Attention Based Convolutional LSTM Network

A Multi-stage Multi-modal Classification Model for DeepFakes Combining Deep Learned and Computer Vision Oriented Features

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation