Skip to main content

Multi-modal Semantic Inconsistency Detection in Social Media News Posts

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13142))

Included in the following conference series:

Abstract

As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this paper, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. While similar systems exist for text and images, we aim to detect inconsistencies in a more ambiguous setting, as videos can be long and contain several distinct scenes, in addition to adding audio as an extra modality. We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts by leveraging an ensemble method based on textual analysis of the caption, automatic audio transcription, semantic video analysis, object detection, named entity consistency, and facial verification. To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts for analysis. Our multi-modal approach achieves 60.5% classification accuracy on random mismatches between caption and appearance, compared to accuracy below 50% for uni-modal models. Further ablation studies confirm the necessity of fusion across modalities for correctly identifying semantic inconsistencies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. ArXiv arXiv:1609.08675 (2016)

  2. Agarwal, S., et al.: Detecting deep-fake videos from appearance and behavior. In: 2020 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–6 (2020)

    Google Scholar 

  3. Aneja, S., Bregler, C., Nießner, M.: COSMOS: catching out-of-context misinformation with self-supervised learning. ArXiv arXiv: 2101.06278 [cs.CV] (2021)

  4. Antol, S., et al.: VQA: Visual Question Answering. In: International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  5. Araujo, A., et al.: Stanford I2V: a news video dataset for query-by-image experiments. In: ACM Multimedia Systems Conference (2015)

    Google Scholar 

  6. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? arXiv: 2102.05095 (2021)

  7. CrowdTangle Team: CrowdTangle. Facebook, CA, United States (2021)

    Google Scholar 

  8. Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv: 1810.04805 [cs.CL] (2019)

  9. FFmpeg Developers. Version 4.3.1 (2020). http://ffmpeg.org/

  10. Guarnera, L., Giudice, O., Battiato, S.: DeepFake detection by analyzing convolutional traces. In: CVPR, June 2020

    Google Scholar 

  11. Güera, D., Delp, E.J.: DeepFake video detection using recurrent neural networks. In: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6 (2018)

    Google Scholar 

  12. Habibian, A., et al.: Video2vec embeddings recognize events when examples are scarce. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 2089–2103 (2017)

    Google Scholar 

  13. Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv: 1412.5567 [cs.CL] (2014)

  14. He, K., et al.: Deep residual learning for image recognition. arXiv: 1512.03385 [cs.CV] (2015)

  15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). ISSN: 0899-7667

    Google Scholar 

  16. Honnibal, M., et al.: spaCy: industrial-strength natural language processing in Python (2020). https://doi.org/10.5281/zenodo.1212303

  17. Kay, W., et al.: The kinetics human action video dataset. ArXiv arXiv:1705.06950 (2017)

  18. Li, G., et al.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. arXiv: 1908.06066 [cs.CV] (2019)

  19. Li, L., et al.: HERO: hierarchical encoder for video+language omni-representation pre-training. In: EMNLP, pp. 2046–2065. Association for Computational Linguistics, November 2020

    Google Scholar 

  20. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv: 1907.11692 [cs.CL] (2019)

  21. Lu, J., et al.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv: 1908.02265 (2019)

  22. Luo, G., Darrell, T., Rohrbach, A.: NewsCLIPpings: automatic generation of out-of-context multimodal media. arXiv: 2104.05893 (2021)

  23. Luo, H., et al.: UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv:2002.06353 (2020)

  24. Miech, A., et al.: End-to-end learning of visual representations from uncurated instructional videos. In: CVPR (2020)

    Google Scholar 

  25. Miech, A., et al.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: ICCV (2019)

    Google Scholar 

  26. Mittal, T., et al.: Emotions don’t lie: an audio-visual deepfake detection method using affective cues. In: Proceedings of the 28th ACM International Conference on Multimedia, MM 2020, Seattle, WA, USA, pp. 2823–2832. Association for Computing Machinery (2020). ISBN: 9781450379885

    Google Scholar 

  27. Peiser, J.: The rise of the robot reporter. The New York Times, February, 2019

    Google Scholar 

  28. Popescu, A.C., Farid, H.: Exposing digital forgeries by detecting traces of resampling. IEEE Trans. Sig. Process. 53(2), 758–767 (2005)

    Article  MathSciNet  Google Scholar 

  29. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

    Google Scholar 

  30. Shekhar, R., et al.: FOIL it! Find One mismatch between Image and Language caption. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 255–265. Association for Computational Linguistics, July 2017

    Google Scholar 

  31. Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. arXiv: 1908.08530 [cs.CV] (2020)

  32. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019)

    Google Scholar 

  33. Tan, R., Plummer, B.A., Saenko, K.: Detecting cross-modal inconsistency to defend against neural fake news. In: Empirical Methods in Natural Language Processing (EMNLP) (2020)

    Google Scholar 

  34. Wang, S., et al.: CNN-generated images are surprisingly easy to spot...for now. In: CVPR (2020)

    Google Scholar 

  35. Wang, S., et al.: Detecting photoshopped faces by scripting photoshop. In: ICCV, October 2019

    Google Scholar 

  36. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP: System Demonstrations, pp. 38–45. Association for Computational Linguistics, October 2020

    Google Scholar 

  37. Xie, S., et al.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. arXiv: 1712.04851 [cs.CV] (2018)

  38. Xu, H., et al.: Multilevel language and vision integration for Text-to-Clip retrieval. In: AAAI, vol. 33, no. 01, pp. 9062–9069, July 2019

    Google Scholar 

  39. youtube-dl. Version 2021.01.24.1 (2021). https://youtube-dl.org

  40. Zellers, R., et al.: Defending against neural fake news. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  41. Zhao, H., et al.: Multi-attentional deepfake detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2185–2194, June 2021

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Scott McCrae .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

McCrae, S., Wang, K., Zakhor, A. (2022). Multi-modal Semantic Inconsistency Detection in Social Media News Posts. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham. https://doi.org/10.1007/978-3-030-98355-0_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-98355-0_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-98354-3

  • Online ISBN: 978-3-030-98355-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics