Abstract
With the development of the Internet, users can freely publish posts on various social media platforms, which offers great convenience for keeping abreast of the world. However, posts usually carry many rumors, which require plenty of manpower for monitoring. Owing to the success of modern machine learning techniques, especially deep learning models, we tried to detect rumors as a classification problem automatically. Early attempts have always focused on building classifiers relying on image or text information, i.e., single modality in posts. Thereafter, several multimodal detection approaches employ an early or late fusion operator for aggregating multiple source information. Nevertheless, they only take advantage of multimodal embeddings for fusion and ignore another important detection factor, i.e., the intermodal inconsistency between modalities. To solve this problem, we develop a novel deep visual-linguistic fusion network (DVLFN) considering cross-modal inconsistency, which detects rumors by comprehensively considering modal aggregation and contrast information. Specifically, the DVLFN first utilizes visual and textual deep encoders, i.e., Faster R-CNN and bidirectional encoder representations from transformers, to extract global and regional embeddings for image and text modalities. Then, it predicts posts’ authenticity from two aspects: (1) intermodal inconsistency, which employs the Wasserstein distance to efficiently measure the similarity between regional embeddings of different modalities, and (2) modal aggregation, which experimentally employs the early fusion to aggregate two modal embeddings for prediction. Consequently, the DVLFN can compose the final prediction based on the modal fusion and inconsistency measure. Experiments are conducted on three real-world multimedia rumor detection datasets collected from Reddit, GoodNews, and Weibo. The results validate the superior performance of the proposed DVLFN.
Similar content being viewed by others
References
Allport G W, Postman L. The Psychology of Rumor. New York: Russell&Russell Pub, 1947
Allcott H, Gentzkow M. Social media and fake news in the 2016 election. J Economic Perspect, 2017, 31: 211–236
Budak C. What happened? The spread of fake news publisher content during the 2016 U.S. presidential election. In: Proceedings of World Wide Web Conference, San Francisco, 2019. 139–150
Farabet C, Couprie C, Najman L, et al. Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell, 2013, 35: 1915–1929
Yang Y, Zhan D C, Wu Y F, et al. Semi-supervised multi-modal clustering and classification with incomplete modalities. IEEE Trans Knowl Data Eng, 2021, 33: 682–695
Collobert R, Weston J. A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference Machine Learning, Helsinki, 2008. 160–167
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations, San Diego, 2015
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, Long Beach, 2017. 5998–6008
Gupt M, Zhao P, Han J. Evaluating event credibility on twitter. In: Proceedings of the SIAM International Conference on Data Mining, Anaheim, 2012. 153–164
Kwon S, Cha M, Jung K, et al. Prominent features of rumor propagation in online social media. In: Proceedings of the IEEE 13th International Conference on Data Mining, Dallas, 2013. 1103–1108
Wu K, Yang S, Zhu K Q. False rumors detection on sina weibo by propagation structures. In: Proceedings of the IEEE International Conference on Data Engineering, Seoul, 2015. 651–662
Jin Z, Cao J, Zhang Y, et al. News verification by exploiting conflicting social viewpoints in microblogs. In: Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, 2016. 2972–2978
Ma J, Gao W, Mitra P, et al. Detecting rumors from microblogs with recurrent neural networks. In: Proceedings of the International Joint Conference on Artificial Intelligence, New York, 2016. 3818–3824
Yu F, Liu Q, Wu S, et al. A convolutional approach for misinformation identification. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, 2017. 3901–3907
Boididou C, Papadopoulos S, Dang-Nguyen D T, et al. The certh-unitn participation@ verifying multimedia use 2015. In: Proceedings of MediaEval, 2015
Qi P, Cao J, Yang T, et al. Exploiting multi-domain visual information for fake news detection. In: Proceedings of the IEEE International Conference on Data Mining, Beijing, 2019. 518–527
Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, Quebec, 2014. 2672–2680
Nataraj L, Mohammed T M, Manjunath B S, et al. Detecting GAN generated fake images using co-occurrence matrices. In: Proceedings of the Media Watermarking, Security, and Forensics, Burlingame, 2019
Ma J, Gao W, Wong K. Detect rumors on twitter by promoting information campaigns with generative adversarial learning. In: Proceedings of the World Wide Web Conference, San Francisco, 2019. 3049–3055
Jia B B, Zhang M L. Multi-dimensional classification via selective feature augmentation. Mach Intell Res, 2022, 19: 38–51
Zhang H, Fang Q, Qian S, et al. Multi-modal knowledge-aware event memory network for social media rumor detection. In: Proceedings of the ACM International Conference on Multimedia, Nice, 2019. 1942–1951
Khattar D, Goud J S, Gupta M, et al. MVAE: multimodal variational autoencoder for fake news detection. In: Proceedings of the World Wide Web Conference, San Francisco, 2019. 2915–2921
Wang Y, Ma F, Jin Z, et al. EANN: event adversarial neural networks for multi-modal fake news detection. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, 2018. 849–857
Jin Z, Cao J, Guo H, et al. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In: Proceedings of the ACM on Multimedia Conference, Mountain View, 2017. 795–816
Nakamura K, Levy S, Wang W Y. r/Fakeddit: a new multimodal benchmark dataset for fine-grained fake news detection. 2019. ArXiv:1911.03854
Tan R, Plummer B A, Saenko K. Detecting cross-modal inconsistency to defend against neural fake news. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020. 2081–2106
Jin Z, Cao J, Jiang Y, et al. News credibility evaluation on microblog with a hierarchical propagation model. In: Proceedings of the IEEE International Conference on Data Mining, Shenzhen, 2014. 230–239
Castillo C, Mendoza M, Poblete B. Information credibility on twitter. In: Proceedings of the International Conference on World Wide Web, Hyderabad, 2011. 675–684
Jin Z, Cao J, Zhang Y, et al. Novel visual and statistical image features for microblogs news verification. IEEE Trans Multimedia, 2016, 19: 598–608
Guo H, Cao J, Zhang Y, et al. Rumor detection with hierarchical social attention network. In: Proceedings of the ACM International Conference on Information and Knowledge Management, Torino, 2018. 943–951
Boididou C, Andreadou K, Papadopoulos S, et al. Verifying multimedia use at mediaeval 2015. In: Proceedings of the MediaEval 2015 Workshop, Wurzen, 2015
Karpathy A, Li F. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3128–3137
Yang Y, Wu Y, Zhan D, et al. Deep robust unsupervised multi-modal network. In: Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, 2019. 5652–5659
Yang Y, Zhang C, Xu Y, et al. Rethinking label-wise cross-modal retrieval from a semantic sharing perspective. In: Proceedings of the International Joint Conference on Artificial Intelligence, 2021. 3300–3306
Wu Q, Teney D, Wang P, et al. Visual question answering: a survey of methods and datasets. Comput Vision Image Underst, 2017, 163: 21–40
Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 6077–6086
Jia C, Yang Y, Xia Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision. 2021. ArXiv:2102.05918
Lin T, Maire M, Belongie S J, et al. Microsoft COCO: common objects in context. In: Proceedings of the IEEE European Conference on Computer Vision, Zurich, 2014. 740–755
Huiskes M J, Lew M S. The MIR flickr retrieval evaluation. In: Proceedings of the ACM International Conference on Multimedia, British Columbia, 2008. 39–43
Zhou X, Wu J, Zafarani R. SAFE: similarity-aware multi-modal fake news detection. In: Proceedings of the 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 2020. 354–367
Qi P, Cao J, Li X, et al. Improving fake news detection by using an entity-enhanced framework to fuse diverse multimodal clues. In: Proceedings of ACM Multimedia, 2021. 1212–1220
Yang Y, Ye H, Zhan D, et al. Auxiliary information regularized machine for multiple modality feature learning. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, 2015. 1033–1039
Devlin J, Chang M, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, 2019. 4171–4186
Wu Y, Schuster M, Chen Z, et al. Google’s neural machine translation system: bridging the gap between human and machine translation. 2016. ArXiv:1609.08144
Hendrycks D, Gimpel K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. 2016. arXiv:1606.08415
Lee K, Chen X, Hua G, et al. Stacked cross attention for image-text matching. In: Proceedings of the European Conference Computer Vision, Munich, 2018. 212–228
Yang Y, Wang K, Zhan D, et al. Comprehensive semi-supervised multi-modal learning. In: Proceedings of the International Joint Conference on Artificial Intelligence, Macao, 2019. 4092–4098
Yossi R, Guibas L, Tomasi C. The earth mover’s distance multi-dimensional scaling and color-based image retrieval. In: Proceedings of ARPA, 1997
Yang Y, Fu Z Y, Zhan D C, et al. Semi-Supervised multi-modal multi-instance multi-label deep network with optimal transport. IEEE Trans Knowl Data Eng, 2019, 33: 696–709
Villani C. Optimal Transport: Old and New. Berlin: Springer, 2008
Rubner Y, Tomasi C, Guibas L J. The earth mover’s distance as a metric for image retrieval. Int J Comput Vision, 2000, 40: 99–121
Togninalli M, Ghisu M E, Llinares-López F, et al. Wasserstein weisfeiler-lehman graph kernels. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 6436–6446
Biten A F, Gómez L, Rusiñol M, et al. Good news, everyone! Context driven entity-aware captioning for news images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 12466–12475
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770–778
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations, San Diego, 2015
Su W, Zhu X, Cao Y, et al. VL-BERT: pre-training of generic visual-linguistic representations. In: Proceedings of the International Conference on Learning Representations, Addis Ababa, 2020
Tong M, Wang S, Cao Y, et al. Image enhanced event detection in news articles. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, 2020. 9040–9047
Singhal S, Shah R R, Chakraborty T, et al. SpotFake: a multi-modal framework for fake news detection. In: Proceedings of BigMM, Singapore, 2019. 39–47
Song C, Ning N, Zhang Y, et al. A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks. Inf Process Manage, 2021, 58: 102437
Ke G, Meng Q, Finley T, et al. LightGBM: a highly efficient gradient boosting decision tree. In: Proceedings of Advances in Neural Information Processing Systems, Long Beach, 2017. 3146–3154
Zellers R, Holtzman A, Rashkin H, et al. Defending against neural fake news. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 9051–9062
Acknowledgements
This work was supported by National Natural Science Foundation of China (Grant Nos. 62006118, 61906092, 61773198, 91746301), Natural Science Foundation of Jiangsu Province (Grant Nos. BK20200460, BK20190441), Jiangsu Shuangchuang (Mass Innovation and Entrepreneurship) Talent Program, and CAAI-Huawei MindSpore Open Fund (Grant No. CAAIXSJLJJ-2021-014B).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, Y., Bao, R., Guo, W. et al. Deep visual-linguistic fusion network considering cross-modal inconsistency for rumor detection. Sci. China Inf. Sci. 66, 222102 (2023). https://doi.org/10.1007/s11432-021-3530-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-021-3530-7