Deep visual-linguistic fusion network considering cross-modal inconsistency for rumor detection

Yang, Yang; Bao, Ran; Guo, Weili; Zhan, De-Chuan; Yin, Yilong; Yang, Jian

doi:10.1007/s11432-021-3530-7

Deep visual-linguistic fusion network considering cross-modal inconsistency for rumor detection

Research Paper
Published: 30 October 2023

Volume 66, article number 222102, (2023)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Yang Yang¹,
Ran Bao²,
Weili Guo¹,
De-Chuan Zhan²,
Yilong Yin³ &
…
Jian Yang¹

142 Accesses
Explore all metrics

Abstract

With the development of the Internet, users can freely publish posts on various social media platforms, which offers great convenience for keeping abreast of the world. However, posts usually carry many rumors, which require plenty of manpower for monitoring. Owing to the success of modern machine learning techniques, especially deep learning models, we tried to detect rumors as a classification problem automatically. Early attempts have always focused on building classifiers relying on image or text information, i.e., single modality in posts. Thereafter, several multimodal detection approaches employ an early or late fusion operator for aggregating multiple source information. Nevertheless, they only take advantage of multimodal embeddings for fusion and ignore another important detection factor, i.e., the intermodal inconsistency between modalities. To solve this problem, we develop a novel deep visual-linguistic fusion network (DVLFN) considering cross-modal inconsistency, which detects rumors by comprehensively considering modal aggregation and contrast information. Specifically, the DVLFN first utilizes visual and textual deep encoders, i.e., Faster R-CNN and bidirectional encoder representations from transformers, to extract global and regional embeddings for image and text modalities. Then, it predicts posts’ authenticity from two aspects: (1) intermodal inconsistency, which employs the Wasserstein distance to efficiently measure the similarity between regional embeddings of different modalities, and (2) modal aggregation, which experimentally employs the early fusion to aggregate two modal embeddings for prediction. Consequently, the DVLFN can compose the final prediction based on the modal fusion and inconsistency measure. Experiments are conducted on three real-world multimedia rumor detection datasets collected from Reddit, GoodNews, and Weibo. The results validate the superior performance of the proposed DVLFN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

Article 04 June 2022

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Learning with Noisy Correspondence

Article 13 April 2024

References

Allport G W, Postman L. The Psychology of Rumor. New York: Russell&Russell Pub, 1947
Google Scholar
Allcott H, Gentzkow M. Social media and fake news in the 2016 election. J Economic Perspect, 2017, 31: 211–236
Article Google Scholar
Budak C. What happened? The spread of fake news publisher content during the 2016 U.S. presidential election. In: Proceedings of World Wide Web Conference, San Francisco, 2019. 139–150
Farabet C, Couprie C, Najman L, et al. Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell, 2013, 35: 1915–1929
Article Google Scholar
Yang Y, Zhan D C, Wu Y F, et al. Semi-supervised multi-modal clustering and classification with incomplete modalities. IEEE Trans Knowl Data Eng, 2021, 33: 682–695
Google Scholar
Collobert R, Weston J. A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference Machine Learning, Helsinki, 2008. 160–167
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations, San Diego, 2015
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, Long Beach, 2017. 5998–6008
Gupt M, Zhao P, Han J. Evaluating event credibility on twitter. In: Proceedings of the SIAM International Conference on Data Mining, Anaheim, 2012. 153–164
Kwon S, Cha M, Jung K, et al. Prominent features of rumor propagation in online social media. In: Proceedings of the IEEE 13th International Conference on Data Mining, Dallas, 2013. 1103–1108
Wu K, Yang S, Zhu K Q. False rumors detection on sina weibo by propagation structures. In: Proceedings of the IEEE International Conference on Data Engineering, Seoul, 2015. 651–662
Jin Z, Cao J, Zhang Y, et al. News verification by exploiting conflicting social viewpoints in microblogs. In: Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, 2016. 2972–2978
Ma J, Gao W, Mitra P, et al. Detecting rumors from microblogs with recurrent neural networks. In: Proceedings of the International Joint Conference on Artificial Intelligence, New York, 2016. 3818–3824
Yu F, Liu Q, Wu S, et al. A convolutional approach for misinformation identification. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, 2017. 3901–3907
Boididou C, Papadopoulos S, Dang-Nguyen D T, et al. The certh-unitn participation@ verifying multimedia use 2015. In: Proceedings of MediaEval, 2015
Qi P, Cao J, Yang T, et al. Exploiting multi-domain visual information for fake news detection. In: Proceedings of the IEEE International Conference on Data Mining, Beijing, 2019. 518–527
Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, Quebec, 2014. 2672–2680
Nataraj L, Mohammed T M, Manjunath B S, et al. Detecting GAN generated fake images using co-occurrence matrices. In: Proceedings of the Media Watermarking, Security, and Forensics, Burlingame, 2019
Ma J, Gao W, Wong K. Detect rumors on twitter by promoting information campaigns with generative adversarial learning. In: Proceedings of the World Wide Web Conference, San Francisco, 2019. 3049–3055
Jia B B, Zhang M L. Multi-dimensional classification via selective feature augmentation. Mach Intell Res, 2022, 19: 38–51
Article Google Scholar
Zhang H, Fang Q, Qian S, et al. Multi-modal knowledge-aware event memory network for social media rumor detection. In: Proceedings of the ACM International Conference on Multimedia, Nice, 2019. 1942–1951
Khattar D, Goud J S, Gupta M, et al. MVAE: multimodal variational autoencoder for fake news detection. In: Proceedings of the World Wide Web Conference, San Francisco, 2019. 2915–2921
Wang Y, Ma F, Jin Z, et al. EANN: event adversarial neural networks for multi-modal fake news detection. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, 2018. 849–857
Jin Z, Cao J, Guo H, et al. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In: Proceedings of the ACM on Multimedia Conference, Mountain View, 2017. 795–816
Nakamura K, Levy S, Wang W Y. r/Fakeddit: a new multimodal benchmark dataset for fine-grained fake news detection. 2019. ArXiv:1911.03854
Tan R, Plummer B A, Saenko K. Detecting cross-modal inconsistency to defend against neural fake news. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020. 2081–2106
Jin Z, Cao J, Jiang Y, et al. News credibility evaluation on microblog with a hierarchical propagation model. In: Proceedings of the IEEE International Conference on Data Mining, Shenzhen, 2014. 230–239
Castillo C, Mendoza M, Poblete B. Information credibility on twitter. In: Proceedings of the International Conference on World Wide Web, Hyderabad, 2011. 675–684
Jin Z, Cao J, Zhang Y, et al. Novel visual and statistical image features for microblogs news verification. IEEE Trans Multimedia, 2016, 19: 598–608
Article Google Scholar
Guo H, Cao J, Zhang Y, et al. Rumor detection with hierarchical social attention network. In: Proceedings of the ACM International Conference on Information and Knowledge Management, Torino, 2018. 943–951
Boididou C, Andreadou K, Papadopoulos S, et al. Verifying multimedia use at mediaeval 2015. In: Proceedings of the MediaEval 2015 Workshop, Wurzen, 2015
Karpathy A, Li F. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3128–3137
Yang Y, Wu Y, Zhan D, et al. Deep robust unsupervised multi-modal network. In: Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, 2019. 5652–5659
Yang Y, Zhang C, Xu Y, et al. Rethinking label-wise cross-modal retrieval from a semantic sharing perspective. In: Proceedings of the International Joint Conference on Artificial Intelligence, 2021. 3300–3306
Wu Q, Teney D, Wang P, et al. Visual question answering: a survey of methods and datasets. Comput Vision Image Underst, 2017, 163: 21–40
Article Google Scholar
Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 6077–6086
Jia C, Yang Y, Xia Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision. 2021. ArXiv:2102.05918
Lin T, Maire M, Belongie S J, et al. Microsoft COCO: common objects in context. In: Proceedings of the IEEE European Conference on Computer Vision, Zurich, 2014. 740–755
Huiskes M J, Lew M S. The MIR flickr retrieval evaluation. In: Proceedings of the ACM International Conference on Multimedia, British Columbia, 2008. 39–43
Zhou X, Wu J, Zafarani R. SAFE: similarity-aware multi-modal fake news detection. In: Proceedings of the 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 2020. 354–367
Qi P, Cao J, Li X, et al. Improving fake news detection by using an entity-enhanced framework to fuse diverse multimodal clues. In: Proceedings of ACM Multimedia, 2021. 1212–1220
Yang Y, Ye H, Zhan D, et al. Auxiliary information regularized machine for multiple modality feature learning. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, 2015. 1033–1039
Devlin J, Chang M, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, 2019. 4171–4186
Wu Y, Schuster M, Chen Z, et al. Google’s neural machine translation system: bridging the gap between human and machine translation. 2016. ArXiv:1609.08144
Hendrycks D, Gimpel K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. 2016. arXiv:1606.08415
Lee K, Chen X, Hua G, et al. Stacked cross attention for image-text matching. In: Proceedings of the European Conference Computer Vision, Munich, 2018. 212–228
Yang Y, Wang K, Zhan D, et al. Comprehensive semi-supervised multi-modal learning. In: Proceedings of the International Joint Conference on Artificial Intelligence, Macao, 2019. 4092–4098
Yossi R, Guibas L, Tomasi C. The earth mover’s distance multi-dimensional scaling and color-based image retrieval. In: Proceedings of ARPA, 1997
Yang Y, Fu Z Y, Zhan D C, et al. Semi-Supervised multi-modal multi-instance multi-label deep network with optimal transport. IEEE Trans Knowl Data Eng, 2019, 33: 696–709
Google Scholar
Villani C. Optimal Transport: Old and New. Berlin: Springer, 2008
MATH Google Scholar
Rubner Y, Tomasi C, Guibas L J. The earth mover’s distance as a metric for image retrieval. Int J Comput Vision, 2000, 40: 99–121
Article MATH Google Scholar
Togninalli M, Ghisu M E, Llinares-López F, et al. Wasserstein weisfeiler-lehman graph kernels. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 6436–6446
Biten A F, Gómez L, Rusiñol M, et al. Good news, everyone! Context driven entity-aware captioning for news images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 12466–12475
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770–778
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations, San Diego, 2015
Su W, Zhu X, Cao Y, et al. VL-BERT: pre-training of generic visual-linguistic representations. In: Proceedings of the International Conference on Learning Representations, Addis Ababa, 2020
Tong M, Wang S, Cao Y, et al. Image enhanced event detection in news articles. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, 2020. 9040–9047
Singhal S, Shah R R, Chakraborty T, et al. SpotFake: a multi-modal framework for fake news detection. In: Proceedings of BigMM, Singapore, 2019. 39–47
Song C, Ning N, Zhang Y, et al. A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks. Inf Process Manage, 2021, 58: 102437
Article Google Scholar
Ke G, Meng Q, Finley T, et al. LightGBM: a highly efficient gradient boosting decision tree. In: Proceedings of Advances in Neural Information Processing Systems, Long Beach, 2017. 3146–3154
Zellers R, Holtzman A, Rashkin H, et al. Defending against neural fake news. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 9051–9062

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos. 62006118, 61906092, 61773198, 91746301), Natural Science Foundation of Jiangsu Province (Grant Nos. BK20200460, BK20190441), Jiangsu Shuangchuang (Mass Innovation and Entrepreneurship) Talent Program, and CAAI-Huawei MindSpore Open Fund (Grant No. CAAIXSJLJJ-2021-014B).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
Yang Yang, Weili Guo & Jian Yang
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China
Ran Bao & De-Chuan Zhan
School of Software, Shandong University, Shandong, 250101, China
Yilong Yin

Authors

Yang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ran Bao
View author publications
You can also search for this author in PubMed Google Scholar
Weili Guo
View author publications
You can also search for this author in PubMed Google Scholar
De-Chuan Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Yilong Yin
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, Y., Bao, R., Guo, W. et al. Deep visual-linguistic fusion network considering cross-modal inconsistency for rumor detection. Sci. China Inf. Sci. 66, 222102 (2023). https://doi.org/10.1007/s11432-021-3530-7

Download citation

Received: 02 June 2021
Revised: 01 January 2022
Accepted: 31 May 2022
Published: 30 October 2023
DOI: https://doi.org/10.1007/s11432-021-3530-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep visual-linguistic fusion network considering cross-modal inconsistency for rumor detection

Abstract

Access this article

Similar content being viewed by others

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning with Noisy Correspondence

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep visual-linguistic fusion network considering cross-modal inconsistency for rumor detection

Abstract

Access this article

Similar content being viewed by others

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning with Noisy Correspondence

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation