Multi-modal Semantic Inconsistency Detection in Social Media News Posts

McCrae, Scott; Wang, Kehan; Zakhor, Avideh

doi:10.1007/978-3-030-98355-0_28

Scott McCrae¹⁵,
Kehan Wang¹⁵ &
Avideh Zakhor¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13142))

Included in the following conference series:

International Conference on Multimedia Modeling

2197 Accesses
3 Citations

Abstract

As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this paper, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. While similar systems exist for text and images, we aim to detect inconsistencies in a more ambiguous setting, as videos can be long and contain several distinct scenes, in addition to adding audio as an extra modality. We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts by leveraging an ensemble method based on textual analysis of the caption, automatic audio transcription, semantic video analysis, object detection, named entity consistency, and facial verification. To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts for analysis. Our multi-modal approach achieves 60.5% classification accuracy on random mismatches between caption and appearance, compared to accuracy below 50% for uni-modal models. Further ablation studies confirm the necessity of fusion across modalities for correctly identifying semantic inconsistencies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. ArXiv arXiv:1609.08675 (2016)
Agarwal, S., et al.: Detecting deep-fake videos from appearance and behavior. In: 2020 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–6 (2020)
Google Scholar
Aneja, S., Bregler, C., Nießner, M.: COSMOS: catching out-of-context misinformation with self-supervised learning. ArXiv arXiv: 2101.06278 [cs.CV] (2021)
Antol, S., et al.: VQA: Visual Question Answering. In: International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Araujo, A., et al.: Stanford I2V: a news video dataset for query-by-image experiments. In: ACM Multimedia Systems Conference (2015)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? arXiv: 2102.05095 (2021)
CrowdTangle Team: CrowdTangle. Facebook, CA, United States (2021)
Google Scholar
Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv: 1810.04805 [cs.CL] (2019)
FFmpeg Developers. Version 4.3.1 (2020). http://ffmpeg.org/
Guarnera, L., Giudice, O., Battiato, S.: DeepFake detection by analyzing convolutional traces. In: CVPR, June 2020
Google Scholar
Güera, D., Delp, E.J.: DeepFake video detection using recurrent neural networks. In: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6 (2018)
Google Scholar
Habibian, A., et al.: Video2vec embeddings recognize events when examples are scarce. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 2089–2103 (2017)
Google Scholar
Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv: 1412.5567 [cs.CL] (2014)
He, K., et al.: Deep residual learning for image recognition. arXiv: 1512.03385 [cs.CV] (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). ISSN: 0899-7667
Google Scholar
Honnibal, M., et al.: spaCy: industrial-strength natural language processing in Python (2020). https://doi.org/10.5281/zenodo.1212303
Kay, W., et al.: The kinetics human action video dataset. ArXiv arXiv:1705.06950 (2017)
Li, G., et al.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. arXiv: 1908.06066 [cs.CV] (2019)
Li, L., et al.: HERO: hierarchical encoder for video+language omni-representation pre-training. In: EMNLP, pp. 2046–2065. Association for Computational Linguistics, November 2020
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv: 1907.11692 [cs.CL] (2019)
Lu, J., et al.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv: 1908.02265 (2019)
Luo, G., Darrell, T., Rohrbach, A.: NewsCLIPpings: automatic generation of out-of-context multimodal media. arXiv: 2104.05893 (2021)
Luo, H., et al.: UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv:2002.06353 (2020)
Miech, A., et al.: End-to-end learning of visual representations from uncurated instructional videos. In: CVPR (2020)
Google Scholar
Miech, A., et al.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: ICCV (2019)
Google Scholar
Mittal, T., et al.: Emotions don’t lie: an audio-visual deepfake detection method using affective cues. In: Proceedings of the 28th ACM International Conference on Multimedia, MM 2020, Seattle, WA, USA, pp. 2823–2832. Association for Computing Machinery (2020). ISBN: 9781450379885
Google Scholar
Peiser, J.: The rise of the robot reporter. The New York Times, February, 2019
Google Scholar
Popescu, A.C., Farid, H.: Exposing digital forgeries by detecting traces of resampling. IEEE Trans. Sig. Process. 53(2), 758–767 (2005)
Article MathSciNet Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Shekhar, R., et al.: FOIL it! Find One mismatch between Image and Language caption. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 255–265. Association for Computational Linguistics, July 2017
Google Scholar
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. arXiv: 1908.08530 [cs.CV] (2020)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019)
Google Scholar
Tan, R., Plummer, B.A., Saenko, K.: Detecting cross-modal inconsistency to defend against neural fake news. In: Empirical Methods in Natural Language Processing (EMNLP) (2020)
Google Scholar
Wang, S., et al.: CNN-generated images are surprisingly easy to spot...for now. In: CVPR (2020)
Google Scholar
Wang, S., et al.: Detecting photoshopped faces by scripting photoshop. In: ICCV, October 2019
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP: System Demonstrations, pp. 38–45. Association for Computational Linguistics, October 2020
Google Scholar
Xie, S., et al.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. arXiv: 1712.04851 [cs.CV] (2018)
Xu, H., et al.: Multilevel language and vision integration for Text-to-Clip retrieval. In: AAAI, vol. 33, no. 01, pp. 9062–9069, July 2019
Google Scholar
youtube-dl. Version 2021.01.24.1 (2021). https://youtube-dl.org
Zellers, R., et al.: Defending against neural fake news. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Zhao, H., et al.: Multi-attentional deepfake detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2185–2194, June 2021
Google Scholar

Download references

Author information

Authors and Affiliations

University of California, Berkeley, USA
Scott McCrae, Kehan Wang & Avideh Zakhor

Authors

Scott McCrae
View author publications
You can also search for this author in PubMed Google Scholar
Kehan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Avideh Zakhor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Scott McCrae .

Editor information

Editors and Affiliations

IT University of Copenhagen, Copenhagen, Denmark
Björn Þór Jónsson
Dublin City University, Dublin, Ireland
Cathal Gurrin
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
Minh-Triet Tran
University of Bergen, Bergen, Norway
Duc-Tien Dang-Nguyen
National Tsing Hua University, Hsinchu, Taiwan
Anita Min-Chun Hu
Hanoi University of Science and Technology, Hanoi, Vietnam
Binh Huynh Thi Thanh
Median Technologies, Valbonne, France
Benoit Huet

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

McCrae, S., Wang, K., Zakhor, A. (2022). Multi-modal Semantic Inconsistency Detection in Social Media News Posts. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham. https://doi.org/10.1007/978-3-030-98355-0_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-98355-0_28
Published: 15 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98354-3
Online ISBN: 978-3-030-98355-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-modal Semantic Inconsistency Detection in Social Media News Posts