Skip to main content
Log in

Deep visual-linguistic fusion network considering cross-modal inconsistency for rumor detection

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

With the development of the Internet, users can freely publish posts on various social media platforms, which offers great convenience for keeping abreast of the world. However, posts usually carry many rumors, which require plenty of manpower for monitoring. Owing to the success of modern machine learning techniques, especially deep learning models, we tried to detect rumors as a classification problem automatically. Early attempts have always focused on building classifiers relying on image or text information, i.e., single modality in posts. Thereafter, several multimodal detection approaches employ an early or late fusion operator for aggregating multiple source information. Nevertheless, they only take advantage of multimodal embeddings for fusion and ignore another important detection factor, i.e., the intermodal inconsistency between modalities. To solve this problem, we develop a novel deep visual-linguistic fusion network (DVLFN) considering cross-modal inconsistency, which detects rumors by comprehensively considering modal aggregation and contrast information. Specifically, the DVLFN first utilizes visual and textual deep encoders, i.e., Faster R-CNN and bidirectional encoder representations from transformers, to extract global and regional embeddings for image and text modalities. Then, it predicts posts’ authenticity from two aspects: (1) intermodal inconsistency, which employs the Wasserstein distance to efficiently measure the similarity between regional embeddings of different modalities, and (2) modal aggregation, which experimentally employs the early fusion to aggregate two modal embeddings for prediction. Consequently, the DVLFN can compose the final prediction based on the modal fusion and inconsistency measure. Experiments are conducted on three real-world multimedia rumor detection datasets collected from Reddit, GoodNews, and Weibo. The results validate the superior performance of the proposed DVLFN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Allport G W, Postman L. The Psychology of Rumor. New York: Russell&Russell Pub, 1947

    Google Scholar 

  2. Allcott H, Gentzkow M. Social media and fake news in the 2016 election. J Economic Perspect, 2017, 31: 211–236

    Article  Google Scholar 

  3. Budak C. What happened? The spread of fake news publisher content during the 2016 U.S. presidential election. In: Proceedings of World Wide Web Conference, San Francisco, 2019. 139–150

  4. Farabet C, Couprie C, Najman L, et al. Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell, 2013, 35: 1915–1929

    Article  Google Scholar 

  5. Yang Y, Zhan D C, Wu Y F, et al. Semi-supervised multi-modal clustering and classification with incomplete modalities. IEEE Trans Knowl Data Eng, 2021, 33: 682–695

    Google Scholar 

  6. Collobert R, Weston J. A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference Machine Learning, Helsinki, 2008. 160–167

  7. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations, San Diego, 2015

  8. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, Long Beach, 2017. 5998–6008

  9. Gupt M, Zhao P, Han J. Evaluating event credibility on twitter. In: Proceedings of the SIAM International Conference on Data Mining, Anaheim, 2012. 153–164

  10. Kwon S, Cha M, Jung K, et al. Prominent features of rumor propagation in online social media. In: Proceedings of the IEEE 13th International Conference on Data Mining, Dallas, 2013. 1103–1108

  11. Wu K, Yang S, Zhu K Q. False rumors detection on sina weibo by propagation structures. In: Proceedings of the IEEE International Conference on Data Engineering, Seoul, 2015. 651–662

  12. Jin Z, Cao J, Zhang Y, et al. News verification by exploiting conflicting social viewpoints in microblogs. In: Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, 2016. 2972–2978

  13. Ma J, Gao W, Mitra P, et al. Detecting rumors from microblogs with recurrent neural networks. In: Proceedings of the International Joint Conference on Artificial Intelligence, New York, 2016. 3818–3824

  14. Yu F, Liu Q, Wu S, et al. A convolutional approach for misinformation identification. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, 2017. 3901–3907

  15. Boididou C, Papadopoulos S, Dang-Nguyen D T, et al. The certh-unitn participation@ verifying multimedia use 2015. In: Proceedings of MediaEval, 2015

  16. Qi P, Cao J, Yang T, et al. Exploiting multi-domain visual information for fake news detection. In: Proceedings of the IEEE International Conference on Data Mining, Beijing, 2019. 518–527

  17. Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, Quebec, 2014. 2672–2680

  18. Nataraj L, Mohammed T M, Manjunath B S, et al. Detecting GAN generated fake images using co-occurrence matrices. In: Proceedings of the Media Watermarking, Security, and Forensics, Burlingame, 2019

  19. Ma J, Gao W, Wong K. Detect rumors on twitter by promoting information campaigns with generative adversarial learning. In: Proceedings of the World Wide Web Conference, San Francisco, 2019. 3049–3055

  20. Jia B B, Zhang M L. Multi-dimensional classification via selective feature augmentation. Mach Intell Res, 2022, 19: 38–51

    Article  Google Scholar 

  21. Zhang H, Fang Q, Qian S, et al. Multi-modal knowledge-aware event memory network for social media rumor detection. In: Proceedings of the ACM International Conference on Multimedia, Nice, 2019. 1942–1951

  22. Khattar D, Goud J S, Gupta M, et al. MVAE: multimodal variational autoencoder for fake news detection. In: Proceedings of the World Wide Web Conference, San Francisco, 2019. 2915–2921

  23. Wang Y, Ma F, Jin Z, et al. EANN: event adversarial neural networks for multi-modal fake news detection. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, 2018. 849–857

  24. Jin Z, Cao J, Guo H, et al. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In: Proceedings of the ACM on Multimedia Conference, Mountain View, 2017. 795–816

  25. Nakamura K, Levy S, Wang W Y. r/Fakeddit: a new multimodal benchmark dataset for fine-grained fake news detection. 2019. ArXiv:1911.03854

  26. Tan R, Plummer B A, Saenko K. Detecting cross-modal inconsistency to defend against neural fake news. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020. 2081–2106

  27. Jin Z, Cao J, Jiang Y, et al. News credibility evaluation on microblog with a hierarchical propagation model. In: Proceedings of the IEEE International Conference on Data Mining, Shenzhen, 2014. 230–239

  28. Castillo C, Mendoza M, Poblete B. Information credibility on twitter. In: Proceedings of the International Conference on World Wide Web, Hyderabad, 2011. 675–684

  29. Jin Z, Cao J, Zhang Y, et al. Novel visual and statistical image features for microblogs news verification. IEEE Trans Multimedia, 2016, 19: 598–608

    Article  Google Scholar 

  30. Guo H, Cao J, Zhang Y, et al. Rumor detection with hierarchical social attention network. In: Proceedings of the ACM International Conference on Information and Knowledge Management, Torino, 2018. 943–951

  31. Boididou C, Andreadou K, Papadopoulos S, et al. Verifying multimedia use at mediaeval 2015. In: Proceedings of the MediaEval 2015 Workshop, Wurzen, 2015

  32. Karpathy A, Li F. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3128–3137

  33. Yang Y, Wu Y, Zhan D, et al. Deep robust unsupervised multi-modal network. In: Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, 2019. 5652–5659

  34. Yang Y, Zhang C, Xu Y, et al. Rethinking label-wise cross-modal retrieval from a semantic sharing perspective. In: Proceedings of the International Joint Conference on Artificial Intelligence, 2021. 3300–3306

  35. Wu Q, Teney D, Wang P, et al. Visual question answering: a survey of methods and datasets. Comput Vision Image Underst, 2017, 163: 21–40

    Article  Google Scholar 

  36. Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 6077–6086

  37. Jia C, Yang Y, Xia Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision. 2021. ArXiv:2102.05918

  38. Lin T, Maire M, Belongie S J, et al. Microsoft COCO: common objects in context. In: Proceedings of the IEEE European Conference on Computer Vision, Zurich, 2014. 740–755

  39. Huiskes M J, Lew M S. The MIR flickr retrieval evaluation. In: Proceedings of the ACM International Conference on Multimedia, British Columbia, 2008. 39–43

  40. Zhou X, Wu J, Zafarani R. SAFE: similarity-aware multi-modal fake news detection. In: Proceedings of the 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 2020. 354–367

  41. Qi P, Cao J, Li X, et al. Improving fake news detection by using an entity-enhanced framework to fuse diverse multimodal clues. In: Proceedings of ACM Multimedia, 2021. 1212–1220

  42. Yang Y, Ye H, Zhan D, et al. Auxiliary information regularized machine for multiple modality feature learning. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, 2015. 1033–1039

  43. Devlin J, Chang M, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, 2019. 4171–4186

  44. Wu Y, Schuster M, Chen Z, et al. Google’s neural machine translation system: bridging the gap between human and machine translation. 2016. ArXiv:1609.08144

  45. Hendrycks D, Gimpel K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. 2016. arXiv:1606.08415

  46. Lee K, Chen X, Hua G, et al. Stacked cross attention for image-text matching. In: Proceedings of the European Conference Computer Vision, Munich, 2018. 212–228

  47. Yang Y, Wang K, Zhan D, et al. Comprehensive semi-supervised multi-modal learning. In: Proceedings of the International Joint Conference on Artificial Intelligence, Macao, 2019. 4092–4098

  48. Yossi R, Guibas L, Tomasi C. The earth mover’s distance multi-dimensional scaling and color-based image retrieval. In: Proceedings of ARPA, 1997

  49. Yang Y, Fu Z Y, Zhan D C, et al. Semi-Supervised multi-modal multi-instance multi-label deep network with optimal transport. IEEE Trans Knowl Data Eng, 2019, 33: 696–709

    Google Scholar 

  50. Villani C. Optimal Transport: Old and New. Berlin: Springer, 2008

    MATH  Google Scholar 

  51. Rubner Y, Tomasi C, Guibas L J. The earth mover’s distance as a metric for image retrieval. Int J Comput Vision, 2000, 40: 99–121

    Article  MATH  Google Scholar 

  52. Togninalli M, Ghisu M E, Llinares-López F, et al. Wasserstein weisfeiler-lehman graph kernels. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 6436–6446

  53. Biten A F, Gómez L, Rusiñol M, et al. Good news, everyone! Context driven entity-aware captioning for news images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 12466–12475

  54. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770–778

  55. Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations, San Diego, 2015

  56. Su W, Zhu X, Cao Y, et al. VL-BERT: pre-training of generic visual-linguistic representations. In: Proceedings of the International Conference on Learning Representations, Addis Ababa, 2020

  57. Tong M, Wang S, Cao Y, et al. Image enhanced event detection in news articles. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, 2020. 9040–9047

  58. Singhal S, Shah R R, Chakraborty T, et al. SpotFake: a multi-modal framework for fake news detection. In: Proceedings of BigMM, Singapore, 2019. 39–47

  59. Song C, Ning N, Zhang Y, et al. A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks. Inf Process Manage, 2021, 58: 102437

    Article  Google Scholar 

  60. Ke G, Meng Q, Finley T, et al. LightGBM: a highly efficient gradient boosting decision tree. In: Proceedings of Advances in Neural Information Processing Systems, Long Beach, 2017. 3146–3154

  61. Zellers R, Holtzman A, Rashkin H, et al. Defending against neural fake news. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 9051–9062

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos. 62006118, 61906092, 61773198, 91746301), Natural Science Foundation of Jiangsu Province (Grant Nos. BK20200460, BK20190441), Jiangsu Shuangchuang (Mass Innovation and Entrepreneurship) Talent Program, and CAAI-Huawei MindSpore Open Fund (Grant No. CAAIXSJLJJ-2021-014B).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Yang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Y., Bao, R., Guo, W. et al. Deep visual-linguistic fusion network considering cross-modal inconsistency for rumor detection. Sci. China Inf. Sci. 66, 222102 (2023). https://doi.org/10.1007/s11432-021-3530-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-021-3530-7

Keywords

Navigation