Abstract
The prevalence of online misinformation, termed “fake news”, has exponentially escalated in recent years. These deceptive information, often rich with multimodal content, can easily deceive individuals into spreading them via various social media platforms. This has made it a hot research topic to automatically detect multimodal fake news. Existing works made a great progress on inter-modality feature fusion or semantic interaction yet largely ignore the importance of intra-modality entities and feature aggregation. This imbalance causes them to perform erratically on data with different emphases. In the realm of authentic news, the intra-modality contents and the inter-modality relationship should be in mutually supportive relationships. Inspired by this idea, we propose an innovative approach to multimodal fake news detection (IFIS), incorporating both intra-modality feature aggregation and inter-modality semantic fusion. Specifically, the proposed model implements a entity detection module and utilizes attention mechanisms for intra-modality feature aggregation, whereas inter-modality semantic fusion is accomplished via two concurrent Co-attention blocks. The performance of IFIS is extensively tested on two datasets, namely Weibo and Twitter, and has demonstrated superior performance, surpassing various advanced methods by 0.6 The experimental results validate the capability of our proposed approach in offering the most balanced performance for multimodal fake news detection tasks.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
As information technology continues to evolve at a rapid pace, online social media has become the most important platform for the daily information exchange [1]. Everyone in the current era enjoys the convenience of online social platforms. However, the exponential development of social media has provided fertile ground for the creation and wanton dissemination of fake news [2, 3]. The unrestricted spread of misinformation on social platforms has undermined not only the public opinion environment in cyberspace, but also political stability [4], social order and economic activities in reality [5]. Effective detection is essential for preventing the propagation fake news in the internet.
Multimodal news has become popular in recent years, especially those with visual information. Compared with text-only news, they attract more attention from readers on social media. Benefiting from this feature, multimodal fake news gets more clicks and retweets to expand its reach [6]. Hence, the focus of fake news detection has been on the detection of multimodal content of social media. Figure 1 shows three samples from Twitter dataset with “fake” label. In Fig. 1a, both the text and the image indicate that this news should not be trusted. In Fig. 1b, the text provides nothing to prove its authenticity. However, the image is apparently faked or fabricated. In Fig. 1c, the image seems reasonable, while the text indicates that it is possibly not real. A hypothesis emerging from these examples is that multimodal approaches are more conducive to detecting fake news.
Recently, many works focused on addressing fake news detection tasks from moultimodal perspective [7, 8]. Benefiting from the consequences of pre-trained model, they usually adopt pre-trained models to extract feature from different modalities. However, some early works naively concatenate multimodal features together and ignore the complex interactions among features. Some studies have investigated the learning of joint text and image representations based on adversarial networks [9] and variational autoencoders [10]. But they do not consider fake news detection as a single task. This task is solved with events classification or original sample reconstruction. Later, some researchers attempt to extract more visual information by using image description model [11] or fake deep image detection algorithm [12]. But they do not make progress in combining the inter-modality features. Wu et al. then fused textual and pictorial features for several times with the real reading habits of human beings [13]. This method focuses on multimodal fusion mechanisms but neglects the importance of multimodal representation. Some studies attempt to implement semantic alignment by using ambiguity learning [14] or entity detection [15]. However, experiments prove that the performance of these models is not good enough due to their imbalances in different datasets. For a real multimodal news, the text, image content and cross-modal relation ought to be flawless. Nevertheless, most of the existing methods do not consider single-modal feature judgement and cross-modal semantic fusion simultaneously. It leads to the loss of potential information and imbalance in model performance for fake news detection. The MCAN [13], which only considers the cross-modal relationship of news, performs well in the Weibo dataset, but performs poorly in the Twitter dataset, with an accuracy difference of nearly 8 The MPFN [8] proposed in 2023 deeply integrates cross-modal features, and balances the accuracy on Weibo and Twitter datasets. However, the corresponding performance is not good enough. This suggests that new approaches need to be explored to consider both the inter-modal and intra-modal features of news.
Motivated by this, we propose a novel multimodal fake news detection method with Intra-modality Feature aggregation and Inter-modality Semantic fusion (IFIS). Specifically, to improve the detection accuracy, we extract entity features of images to reduce the noisy and redundant visual features. Next, based on the detected entities, we design a intra-modality attentional mechanism for aggregating the complex feature relations. In addition, we utilize semantic fusion module to capture the inter-modality relationship of the features from different modalities. The semantic fusion module is built through the adoption of two parallel Co-attention blocks to establish accurate and reliable relationships between modalities. The contributions of our work can be summarized as:
-
We propose a novel multimodal fake news detection method that considers both single-modal feature judgment and cross-modal semantic fusion. The method demonstrates excellent and balanced performance on datasets with different attributes.
-
We integrate an object detection block, Faster R-CNN, into the entities’ features extraction module and aggregate the relational features through attention mechanism.
-
We develop a semantic fusion module to capture the inter-modality relationship between news text and corresponding entities from images. The semantic fusion module is made up of two parallel Co-attention blocks to obtain stable connections between modalities.
The rest of this work is summarized as follows. Section “Related works” summaries previous works related with the fake news detection topic, specially, the frameworks using multimodal data in detail. While Section “Problem statement” provides a detailed definition of the fake news detection task. In Section “Proposed method”, details of the newly proposed multimodal framework are provided. Then, Section “Experiments and analysis” illustrates the descriptions of adopted datasets and parameter settings. In addition, extensive simulations are conducted while corresponding analysis are provided in Section “Experiments and analysis”. Finally, Section “Conclusion” gives the conclusion.
Related works
Traditional methods are single-modality recognition ones that focus only on the text and are mainly based on simple classifiers. Text-based detection methods usually use various neural networks to extract and classify text information, including Convolutional Neural Network (CNN) [16], Recurrent Neural Network (RNN) [17], Long Short-Term Memory network (LSTM) [18], and so on. Later, some studies tried to utilize visual information in the news to detect its authenticity [19, 20]. Visual information-based methods usually use image recognition or classification models to classify images. However, the use of insufficient information makes it difficult to achieve satisfactory results with the above single-modal methods.
In recent times, the news has evolved from a pure text format to a multimedia one that consists of visual information such as images [21] or videos [22]. As a result, image classification models [23] and optimization algorithms [24] have been widely used in fake news detection. This allows multimodal detection methods to show superior performance [25].
To explore the information beyond text, Jin et al. present the att-RNN model [6] to integrate features from different modalities. They introduce the attentional mechanism that combines word embedding with visual features and social association features. Later, Wang et al. build a multimodal feature encoder. In addition, to distinguish events and detect fake news, they also build an Event Adversarial Neural Networks (EANN) [9]. This model is essentially composed of two modules. For text processing, it inputs word embedding vectors into a CNN to derive text representations. In terms of image processing, it uses pre-trained VGG-19 [26] to abstract image representations. Then, the two representations are combined and fed to two identical neutral network classifiers. One classifier is used to discriminate between events, while the other is used to detect fake news. Later, Khattar et al. design a Multimodal Variational AutoEncoder (MVAE) [10] motivated by [9]. In this framework, bidirectional LSTM and VGG-19 are used for the extraction of text and image representations independently. It is possible to obtain new vectors by concatenating the two representations mentioned above. To restore the initial sample, the vectors are sent to a decoder. At the same time, the detection of fake news is a secondary task in this framework [27]. However, this detection classifier has to be learned simultaneously with another one. It is bound to increase the complexity and the instability of the model. In addition, the whole framework is sometimes hampered by the unavailability of labeled data for detection task. The methods mentioned above are all reasonable attempts to detect fake news in a multimodal way. However, the interactive features of multimodality are not well exploited.
To deal with the above mentioned problem, a number of multimodal frameworks have been presented which attempts to interactively fuse multimodal features [28]. Singhal et al. develop SpotFake [29], a multimodal framework, to address the task of detecting fake news specifically. Bidirectional Encoder Representations from Transformers (BERT) [30] and VGG-19 are applied to learn text features and image features separately. In this framework, the authors only take into account the characteristics of the text and images. The authenticity of the news item is then determined accordingly. The removal of interference from other sub-tasks ensures that the framework is well suited to the detection of fake news. However, this model only uses a connection between two pre-trained models and does not take into account intermodality feature. Zhou et al. also present Similarity-Aware FakE (SAFE) [11] news detection method using cross-modal information from a different viewpoint. SAFE adopts an description model to convert images into text descriptions. Then, a text-based classifier is trained with news texts and image descriptions. In this process, the authors argue that SAFE takes into account the relationship between the features extracted from different modalities. The inability to effectively fuse multimodal features is a shortcoming of the approaches mentioned above. To better fuse textual and visual features, Wu et al. propose a Multimodal Co-Attention Networks (MCAN) [13] inspired by real human reading habits. This model fuses different domains image features and text features several times by Co-attention layers. However, it does not take into account the role of single-modal features.
Since then, many researchers have explored multimodal fake news detection from a variety of perspectives. Considering of image tampering, Xue et al. present a Multimodal Consistency Neural Network (MCNN) [12]. It uses the Error Level Analysis algorithm [31] to detect forged images. However, it does not improve the performance of feature fusion. Wang et al. explored several semantic associations between images and text. They proposed an instance-driven multimodal graph fusion method [15] focusing on the implications of multimodal presentation. To address the intrinsic ambiguity between different modalities content which causes the multimodal detection of fake news to be inferior, Chen et al. present a Cross-modal Ambiguity-aware FakE news detection method (CAFE) [14]. It is a meaningful attempt to utilize an information-theoretic perspective. To take advantage of valid information at the shallow level, Jing et al. proposed a Multimodal Progressive Fusion Network (MPFN) [8]. It undergoes several intermediate fusions which improves the performance effectively.
Problem statement
In essence, fake news is a distortion of information by those who create it. According to prior work on media bias theory [32], distortion bias is generally modelled as a binary classification challenge. For this reason, fake news detection has typically been formulated as a binary classification problem.
A social media news post is defined as \({\mathcal {N}}\). Depending on whether \({\mathcal {N}}\) is true or false, the label y is denoted by ‘0’ or ‘1’ accordingly. Based on the evaluation of the news \({\mathcal {N}}\), the prediction label \({\hat{y}}\) is classified as ‘0’ or ‘1’ by model \({\mathcal {M}}\).
Fake News Detection: The aim of this topic is to access whether the news article \({\mathcal {N}}\) is forged, under the condition that the original post of news \({\mathcal {N}}\) and related information are provided, i.e., \({\mathcal {M}}: {\mathcal {N}} \rightarrow \{1,0\}\) such that,
where \({\mathcal {M}}\) represents the assessment method that the researchers are working on.
Proposed method
In this section, we will discuss the detailed structure and motivation of proposed multimodal method.
Model overview
In this manuscript, we present a new multimodal fake news detection approach (IFIS) through intra-modality feature aggregation and inter-modality semantic fusion. In essence, the proposed approach is made up of four modules, i.e., text embedding, entity feature embedding, feature aggregation and semantic fusion. In order to eliminate redundant features, IFIS extracts entity features from images. In addition, to comprehensively utilize the intra-modal features and inter-modal features, we designe the feature aggregation and semantic fusion modules. The feature aggregation module uses the attention mechanism to strengthen single modal features, while the semantic fusion module adopts the co-attention to carry out cross-modal semantic interaction. For illustration, we present an overview of IFIS in Fig. 2. The main construction and functions of the various modules are further described later. For visualization, we use different colors to represent features of different modalities. As in Fig. 2, orange and blue represent visual and textual features, respectively.
Text embedding
Aiming to obtain a high-quality corpus, we are anticipated to clean the original news articles first. Due to the casual nature of human tweeting, we have to remove strange symbols and emoji signs from the text. The cleaned text is concatenated into a single paragraph as text input. Text paragraphs are split into different sentences. A sentence s is transformed into a succession of tokens \(\left\{ w_1, w_2, \ldots , w_k\right\} \), where \(w_k\) is the aggregation of position and token for the k-th token that exists in the sentence. Inspired by [30], the produced input is passed to Transformer [33]. To confine the learnt information of input, the Transformer encoder is adopted. The input sequence of tokens \(\left\{ w_1, w_2, \ldots , w_k\right\} \) is then mapped into an abstract continuous representation \(\left\{ z_1, z_2, \ldots , z_k\right\} \).
Then, we adopt BERT, an excellent language model, to process the abstract continuous representation. It views words through the shared conditioning of their immediate context, which is deeply bidirectional. The BERT module is pre-trained with Next Sentence Prediction and Masked Language Modelling, two unsupervised predictive tasks. The goal of Next Sentence Prediction is to predict whether a sentence is an adjacent sentence to the target sentence. While, the target of Masked Language Modelling is to predict the masked input tokens in a paragraph, which are masked in some percentage at random. In this manuscript, we adopt two datasets to valid the superiority of presented approach, i.e., Weibo and Twitter, which are in Chinese and English respectively; thus, two independent versions of BERT are utilized, respectively. Though the two versions of BERT are trained with datasets in different languages but they are of no structural differences.
Entity feature embedding
Inspired by [34], we utilize pre-trained Faster R-CNN [35] to segment the salient regions containing entities from the images. We abandon the classical methods of image feature extraction for the following two reasons. Firstly, the semantic relationship may not be captured by the embedding representations obtained from CNN or VGG [36], even though they can preserve the spatial information. Secondly, classical approaches divide images equally at the spatial level [37], resulting in unnecessary background fragments and broken entities. It takes extra work to filter out the necessary fragments and the broken entities degrade the performance of the model.
As in Fig. 3, we utilize pre-trained Faster R-CNN to identify the image patches containing entity targets. Faster R-CNN is an object recognition framework which consists of two main steps. It uses bounding boxes to identify and localize areas of the image that belong to particular classes. Firstly, a Region Proposal Network tries to forecast the boundaries and scores of objects. Then, it adopts region of interest pooling to acquire feature maps of each bounding and classify fragments within proposed region. To detect fake news, we remove some tiny and dependent entities from the category of object detection, including “eyes", “eyebrow", “necklace", and so on. For each news, we select the k highest scoring entities in the image. Next, pre-trained ResNet-50 [38] is adopted to transform the entity regions to vectors \(e_t^i \in R_t^i\), where i and t represent the dimension and sequence of image vector, respectively. Then, the visual entity vectors are arranged as the entity feature embedding in order of scores.
Feature aggregation module
In this section, we will highlight how self-attention mechanism is used to model the intra-modality relationship for image vectors and text tokens. Attention module has been used extensively in Visual Question Answering (VQA) and Natural Language Processing (NLP) tasks in recent years. The attention mechanism can solve the problem of long-distance dependence of text and the average feature of the image. It learns new weight distributions in a targeted manner and act them on important features. The attention mechanism used in this module can effectively highlight the features of single modal to improve the accuracy of the method. Let us review the paradigm of attention functioning. The attention module is a function of the mapping, which is capable of capturing the global constraints of all the items in sequences. It receives a variable number of inputs and returns outputs with the same quatity. Each input consists of three representations: query, key and value, which are packed into matrices Q, K, V independently. They interact and decide where to focus the attention.
Self-attention block is incorporated into the feature aggregation module because the image vectors and text embedding are processed independently. As in Fig. 4, self-attention is a particular type of attention mechanism. Its purpose is the encoding of interactions between fragments of images or text. In self-attention, matrices Q, K, V are the same, i.e. the three input representations make no difference. (\(d\times 1\))-dimensional queries, keys, and values are computed by the input of multi-head self-attention firstly. Then, they are packed into Q, K and V, respectively. The dot product between Q and K describes the attention allocation on V. The attention mechanism is run through a number of times in parallel in the multi-head attention submodule. Every head pays unique attention to one piece of the sequence. Finally, all the outputs of attention heads are combined and linearly rescaled to obtain the required dimension of the projection.
For the i-th head, Q, K, and V are transformed to the inputs as follow:
where \(W_i^Q, W_i^K, W_i^V \in {\mathbb {R}}^{1 \times d_h}\) are the projection matrices for the i-th head, \(d_h=d / m\) is the dimensionality of the output features from the heads and m is the total number of heads.
The operation of calculating the multi-head self-attention function can be represented as:
where \(W^O\in {\mathbb {R}}^{m d_h\times 1}\) represents a learnable parameter matrix, \(\oplus \) denotes the combination of vectors.
Next, the fully connected feed-forward network is applied on each fragment independently and reshape outcomes linearly:
where \(\left( x, b_1, b_2\right) \in {R}^{d \times 1}\), \(\left( W_1, W_2\right) \in {R}^{d \times d}\), \(W_1, b_1\) and \(W_2, b_2\) are the learnable parameters trained in the first and the second fully connected layers, respectively. Finally, residual connections and layer normalisation are placed around the two sub-layers to carry the positioning information to superior layers.
For the visual features, the representations obtained from the entity patches \(\left[ e_1; \ldots ; e_k\right] \in {\mathbb {R}}^{k \times d}\) are fed into a self-attention layer to capture the interaction between visual entities. The result of the multi-head self-attention module is, \(O=\left[ o_1; \ldots ; o_k\right] \in {\mathbb {R}}^{k \times d}\), where \(O=L(Y+(MultiHead(Y))\). A set of continuous representations is then obtained by applying position-wise feed-forward and layer normalisation, \(X=\left\{ X_i\right\} _{i=1}^k\), where \(X_i=L\left( o_i+\right. FeedForward \left. \left( o_i\right) \right) \). Finally, average pooling followed by L2 normalisation compresses the resulting image vectors into a dense representation.
For the textual features, the representations obtained from text content \(Z=\left[ z_1, z_2,\ldots , z_k\right] \) are fed into a 1-dimensional convolution neural network, which is useful for capturing the hidden sequencing features. The convolutional layer is adopted to map the feature \(F=\left\{ f_i\right\} _{i=1}^{k-h+1}\) from the input sequences \(\left\{ z_{i:(i+h-1)}\right\} _{i=1}^{k-h+1}\). Each input is a set of contiguous words, which is presented as,
where \(w, z_{i:(i+h-1)} \in {\mathbb {R}}^{hd}\), ReLU is the function of ReLU activation, b is a bias, w is the matrix of trainable parameters. Then, a max-pooling operation is applied to the resulting feature map for dimensionality reduction, \({\hat{f}}_s=max\{f_i\}_{i=1}^{k-h+1}\). The text representations are generated by \(r=W{\hat{f}}+b\). Finally, a fully connected layer and a \({l_2}\) normalisation are adopted to operate on r to produce the feature vector of text.
Semantic fusion module
Semantic feature fusion is implemented through the semantic fusion module with the detailed architecture being provided in Fig. 5. Firstly, we will introduce Co-Attention block which is the basic unit of semantic fusion module.
Co-Attention block is a variant of the multi-head self-attention block. As in Fig. 5, in a Co-attention block, queries use data from one modality, while keys and values use data from another modality. In addition, the query matrices are used as residuals to keep the original semantic feature. The remaining architectures are the same with those used in multi-headed self-attention. Attention features for one modality based on another modality are produced by the Co-Attention block. For instance, if matrix Q is taken using the textual features and matrices K and V are taken using the visual features, the attention matrix calculated by Q and K can be applied to measure the similarity between the text and image. After that, the attention matrix is used to weight V, i.e., the textual features. Just like the real habit of humans reading news, after looking at the images, we have more attention for the text sequences that are in relation to the images. Co-attention can effectively simulate the above process and learn the semantic fusion relationship between different modal features.
As in Fig. 5, a fusion module is obtained by connecting two Co-Attention blocks in parallel. The orange squares represent features extracted from one modality. The blue squares represent features extracted from another modality. As previously explained, one Co-Attention block represents the weighting of the attention matrix to one modal feature and the other Co-Attention block represents the opposite situation. Then, the results of two Co-Attention blocks are concatenated and placed in a fully connected layer to produce the final representations. The semantic fusion module models cross-modal interactions by changing the modal of input features to simulates the real reading habits of humans.
Experiments and analysis
Data descriptions
To reveal the effectiveness of proposed model, extensive experiments are carried out on two public datasets, Weibo and Twitter. Table 1 shows the detailed statistics for these two datasets.
-
Weibo [6]: This dataset was constructed by Jin et al. All posts were made between May 2012 and January 2016 in Weibo.
All fake news is validated by the official rumour debunking system of Weibo, while all real news is checked by an official news organisation, Xinhua News Agency. This dataset contains 4,779 real posts and 4,749 fake ones. The tweets consist of articles, additional images and social context. As we are primarily interested in identifying news with text and images, we have removed tweets without relevant images.
-
Twitter [39]: This dataset was first released to support the fake content detection on Twitter in the “Verifying Multimedia Use" task. A development set and a test set are included in the raw data set, and both contain tweets about different events. To ensure the fairness of performance comparison, we retain the news containing both text and images while the others are filtered out. It is notable that the quantity of images is much smaller than that of samples in this dataset. This means that an image might be shared by several samples.
Evaluation metrics
To measure the performance of various models, a variety of metrics are adopted, with corresponding being provided as follows:
-
(1)
Precision: It represents the proportion of true positives among the positive examples identified by the model.
It is abbreviated as Pre.;
-
(2)
Recall: It indicates the proportion of correctly identified positive cases out of the total number of positive cases. It is abbreviated as Rec.;
-
(3)
\({F_{Score}}\): Precision and Recall are sometimes contradictory, especially when used to measure unbalanced datasets. Therefore, \(F_{Score}\) is proposed as an indicator to measure the comprehensive performance of a model. It is calculated by weighting Precision and Recall, which can be represented as:
$$\begin{aligned} {\textrm{F}}_{{\textrm{Score}}}=\left( 1+\tau ^{2}\right) \frac{Pre.\times Rec.}{\tau ^{2} \cdot Pre.+Rec.} \end{aligned}$$(6)where \(\tau \) represents a weighted parameter. When \(\tau \) is assigned to 1, \(F_{Score}\) is written as \(F_1\). In general, a larger \({F_1}\) indicates better performance;
-
(4)
Accuracy: It represents the ability of the classifier to determine all samples within the dataset.
It is an ideal metric when the sample proportion of each category in the dataset is balanced. In general, a larger value of \({\textrm{Accuracy}}\) usually indicates better overall performance.
Parameter settings
During the training phases, the maximum lengths of the text equal to 200 and 30 for Weibo and Twitter, respectively. As for the size of hidden nodes and the dimensionality d, both of them are assigned to 256; the total number of heads m is fixed at 8. We fix the number of visual entities k to 5 (the reason of adopting this value will be discussed later). When training on Twitter dataset, the parameters of BERT and Faster R-CNN are frozen to avoid overfitting; however, on Weibo dataset, we will not do this. For Twitter and Weibo datasets, the adopted BERT model are BERT-base-multilingual-cased and BERT-base-Chinese, respectively. IFIS is trained for 50 epochs with an learning rate of 5e-5. The dropout rate is fix at 0.5 and the batchsize is assigned to 128. We use the Adam optimizer and categorical cross-entropy loss functions to optimize the model. The above hyperparameters are determined using grid search, where the accuracy is used as the criterion for parameter selection. Furthermore, the model is implemented on PyTorch framework and all experiments are run on an Nvidia GeForce RTX 3090Ti graphic card. Besides, hyperparameters for baselines are the same as respective original papers.
Baselines
To illustrate the superiority of proposed model, several SOTA approaches (including single-modal and multimodal ones) are selected for performance comparison. They are listed as:
Single-modal methods
Here, the considered single-modal methods contain Textual and Visual.
Textual [10]: Each word is mapped to a 32-dimensional vector, then a bidirectional LSTM is applied to extract the sequential features and produce the prediction results.
Visual [6]: The VGG-Net is utilized to extract 4096 dimensional features from images and a classifier is then trained to infer the labels.
Multimodal methods
As to multimodal methods, they mainly focus on developing approaches to efficiently extract and fuse features. The considered multimodal methods are provided as:
VQA [40]: The aim is to address the Visual Question Answering through the concatenation of textual and visual features. To make fair comparisons, the concatenated features are fed into a single-layer LSTM.
att-RNN [6]: This approach utilizes an LSTM to obtain features. Then, the attention mechanism is adopted for the fusion of textual, visual and social features. For fair comparisons, social features are removed.
EANN [9]: It develops a CNN for the generation of text representation, and utilizes a pre-trained VGG-19 to abstract the image representations. After that, the text and image representations are merged and sent to a network classifier.
MVAE [10]: It extracts text and image representations using Bi-LSTMs and VGG-19, respectively. The two representations are then interlinked and sent to the decoder to reconstitute the source samples.
SAFE [11]: It uses a descriptive model to paraphrase images into text. The textual description is then combined with the original text to train a text classifier.
SpotFake+ [7]: SpotFake applies BERT and VGG-19 to extract textual and visual features separately. These features are interlinked together for classification training. On the basis of SpotFake, SpotFake+ is presented replacing BERT with a pre-trained XLNET.
MCAN [13]: It utilises VGG-19, CNNs and BERT to extract different domains visual features and text features, respectively. Then, it fuses the features several times using Co-attention layer. For fair comparisons, frequency-domain feature is removed from the whole model.
MCNN [12]: It adopts BERT and ResNet-50 to capture textual and visual features. Meanwhile, ELA algorithm is applied to detect forged images. In addition, it assigns weights to the image and text using the attention mechanism.
DIIF [15]: It applies BERT and Mask R-CNN [41] to derive textual and image features. Then, a unifying graph is constructed to combine the textual and visual instances. It adopts attention mechanism and gating mechanism to generate the contextual representation and gather the semantic interactions.
CAFE [14]: It uses BERT and ResNet-34 to learn textual and image features from news. A ambiguity learning approach is used to extract ambiguity across modalities. Besides, a fusion module is adopted to collect the interactions of modalities.
MPFN [8]: Here, traditional CNN is used to process the sequence of news words and the image patches. Finally, it merges the resulting representations and then classifies them using a soft-max layer.
Result and discussion
In this subsection, we calculate the Precision, Recall and \(F_1\) scores of different news categories obtained for all methods on the considered datasets.
Table 2 illustrates the overall performance comparison of considered approaches and the proposed model with experiments being conducted on Weibo and Twitter. As revealed, there is no doubt that IFIS outperforms previous models with an accuracy of 0.896 and 0.838 on Weibo and Twitter, respectively. Besides, there are some interesting observations and phenomena worth exploring.
The length of text in Weibo is much longer than that in Twitter, which is the reason why the maximum text length of Weibo is limited to 200 and the corresponding value of Twitter is limited to 30. The images and news are one-to-one in Weibo, which means the visual information is weakly correlated with the authenticity of the sample. Nevertheless, the amount of images in Twitter is much smaller than the sample. The same image is shared by several samples. Even worse, a consistent label is applied to all samples that share the same image. This leads to a strong association between the visual feature and the authenticity of the sample, which is contrary to the characteristic of data in Weibo. However, for both the Weibo and Twitter datasets, the performance of Visual continues to lag behind VQA and att-RNN according to Table 2. It indicates that the utilization of multiple modalities in fake news detection tasks is superior to methods which adopt single modal. This also verifies that the performance of single-modal methods is inferior to multimodal methods in most cases. Besides, MCAN, MCNN and DIIF generally perform better than SAFE and SpotFake+, which confirms the contribution of cross-modal interaction in improving the performance of models.
Compared to att-RNN, EANN and MVAE introduce auxiliary tasks that demonstrate significant performance advantages. It verifies that the most significant feature from each modality can be captured by using coarse-granted multi-modal fusion. Following this idea, SpotFake+ adopts XLNET tokens as a replacement for word embeddings and outperforms the above methods. It is a demonstration of the tremendous benefits of pre-trained models. After that, the success of MCAN, MCNN and DIIF shows that reasonable feature fusion module can further improve the detection performance.
Overall, as in Table 2, the presented results confirm the effectiveness of our method. In combination with the above analysis, the advantages are attributed to the following factors. (1) The proposed model makes use of information about the news content as much as possible. It cleans the data reasonably and integrate textual and visual information. (2) It adopts powerful pre-trained models to extract multimodal features, including ResNet and BERT. (3) It utilizes Faster R-CNN for entity acquisition, which eliminates the interference of redundant features as much as possible during feature extraction. (4) It incorporates both intra-modality feature aggregation and inter-modality semantic fusion simultaneously. It combines the advantages of single-modal features judgment and cross-modal interactions.
Finally, our proposed model is of a boosted performance in the accuracy when addressing detecting tasks on both Weibo and Twitter datasets. For comparison, some state-of-the-art methods are incorporated. As in Table 2, the best method on one dataset usually performs mediocre on others, while the proposed method improves the accuracy by 0.6 More importantly, the proposed method has a balanced performance on different datasets, and the accuracy is optimal simultaneously. The performance improvement is attribute to better feature aggregation and feature interaction capability. Meanwhile, on the considered datasets, the corresponding performance of our proposed method also varies, which is mainly incurred by the existed differences between samples in different datasets, especially the visual information.
Ablation study
As in the following Table 3, we illustrate the performance of some variants of proposed model. Hence, the influence of considering different components will be discussed explicitly.
-
Entity: To evaluate the impact of the entity extraction, we present a variant which no longer adopts the Faster R-CNN to derive the visual entity. Here, the ResNet-50 is employed to obtain the visual features from the averagely segmented regions of images.
-
Aggregation: To evaluate the impact of the feature aggregation, it remove this module from IFIS. The features of images and text are concatenated to the outputs of semantic fusion module.
-
Fusion: In order to evaluate the impact of the semantic fusion, the corresponding module is removed from IFIS. The features from two Feature aggregation modules are concatenated together to derive the final prediction.
To present the results intuitively, we further display the results in Fig. 6. As revealed, it is clear that all the components in the proposed model contribute significantly to the result of both datasets. Specifically, comparing with Entity, the proposed method adopts the entity detection, thereby incurring an improved detection performance. This suggests that the inclusion of visual units as input is helpful in improving performance. Besides, the proposed method outperforms the Aggregation which lacks the intra-modal feature interactions. It is an indication of the effectiveness of intra-modal correlations between visual entity features and textual semantic features. In addition, if the semantic fusion module is removed, i.e., Fusion, the method faces a significant performance degradation, which demonstrates the importance of exploiting the semantic interactions between different modalities.
Furthermore, when removing any of the components, the performance degradation of the model on the Twitter is greater than that on Weibo. Considering the differences between datasets, this phenomenon is attributed to the balanced distribution of data in Weibo and the robustness of fine-tuned BERT.
Parameter analysis
As afore-stated, we utilize the top k visual entities which are ranked based on the detection possibility of each image. Then, we will further study the effect of varying k. Here, we conduct experiments for scenarios when adopting different k with the results being provided in Fig. 7.
As revealed in Fig. 7, the effect of varying k on the detecting performance illustrates a similar trend on both datasets which increases first and then declines. This indicates there exists an optimal k. Overall, we can obtain a maximum detecting accuracy when k equals to 5. Hence, it is clear that the optimal parameter for k is 5. In addition, the selection of entities effectively reduces the number of features that need to be processed by the proposed framework. As a result, this method does not increase the complexity of fake news detection. The computing power and time required are similar to existing ones. Furthermore, for the Twitter dataset, the effect of varying k on the overall performance is much greater than that for the Weibo dataset, being indicated by a larger varying amount. As to the reason of this phenomenon, we suppose that it is related to the strong association between the visual feature and the authenticity of the sample in the Twitter dataset. In contrast, this association in Weibo dataset is extremely weak. Meanwhile, this phenomenon also validates the derived conclusions in above subsections.
Conclusion
Aiming to detect multimodal fake news effectively, we present an effective multimodal detection method (IFIS) through the utilization of intra-modality feature aggregation and inter-modality semantic fusion. Specifically, our proposed method extracts entity features from a image to avoid acquiring too much visual noisy features. And intra-modality features are aggregated based on entity detection methods and attention mechanisms. In addition, we utilize semantic fusion module to capture the inter-modality relationships between textual content and visual entities. The semantic fusion module is made up of two parallel Co-attention blocks which establish stable connections between modalities. The simultaneous integration of single-modal features and cross-modal semantics allows the model to determine the authenticity of the sample in different datasets.
The superiority of our proposed approach is demonstrated by experimental results on different datasets. Nonetheless, there also exists some limitations of this work that could be addressed in the future: (1) The modalities considered are not sufficient, and the features of more modalities are not used, such as video, audio, propagation structure and so on; (2) Parameter setting process can be improved, for instance, through the adoption of cooperative-competitive neural networks [42], reaction–diffusion neural networks [43] or fault tolerant iterative learning control [44].
For future work, we will delve into the role of entity features in feature representation and feature interaction. Moreover, we will also consider the leverage of prior knowledge and dissemination structures in such tasks.
Data availability
The datasets analyzed during the current study are available from the corresponding author on reasonable request.
References
Mitra T, Wright GP, Gilbert E (2017) A parsimonious language model of social media credibility across disparate events. In: Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, pp 126–145. https://doi.org/10.1145/2998181.2998351
Zhao Z, Resnick P, Mei Q (2015) Enquiring minds: early detection of rumors in social media from enquiry posts. In: Proceedings of the 24th International Conference on World Wide Web, pp 1395–1405. https://doi.org/10.1145/2736277.2741637
Zhu P, Cheng L, Gao C, Wang Z, Li X (2022) Locating multi-sources in social networks with a low infection rate. IEEE Trans Netw Sci Eng 9(3):1853–1865. https://doi.org/10.1109/TNSE.2022.3153968
Shu K, Sliva A, Wang S, Tang J, Liu H (2017) Fake news detection on social media: a data mining perspective. ACM SIGKDD Explor Newslett 19(1):22–36. https://doi.org/10.1145/3137597.3137600
Xia H, Wang Y, Zhang JZ, Zheng LJ, Kamal MM, Arya V (2023) Covid-19 fake news detection: a hybrid cnn-bilstm-am model. Technol Forecast Soc Change 195:122746. https://doi.org/10.1016/j.techfore.2023.122746
Jin Z, Cao J, Guo H, Zhang Y, Luo J (2017) Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In: Proceedings of the 25th ACM International Conference on Multimedia, pp 795–816. https://doi.org/10.1145/3123266.3123454
Singhal S, Kabra A, Sharma M, Shah RR, Chakraborty T, Kumaraguru P (2020) SpotFake+: a multimodal framework for fake news detection via transfer learning (student abstract). In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 13915–13916. https://doi.org/10.1609/aaai.v34i10.7230
Jing J, Wu H, Sun J, Fang X, Zhang H (2023) Multimodal fake news detection via progressive fusion networks. Inform Process Manag 60(1):103120. https://doi.org/10.1016/j.ipm.2022.103120
Wang Y, Ma F, Jin Z, Yuan Y, Xun G, Jha K, Su L, Gao J (2018) EANN: event adversarial neural networks for multi-modal fake news detection. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 849–857. https://doi.org/10.1145/3219819.3219903
Khattar D, Goud JS, Gupta M, Varma V (2019) MVAE: Multimodal variational autoencoder for fake news detection. In: Proceedings of the 28th International Conference on World Wide Web, pp 2915–2921. https://doi.org/10.1145/3308558.3313552
Zhou X, Wu J, Zafarani R (2020) SAFE: similarity-aware multi-modal fake news detection. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp 354–367. https://doi.org/10.1007/978-3-030-47436-2_27
Xue J, Wang Y, Tian Y, Li Y, Shi L, Wei L (2021) Detecting fake news by exploring the consistency of multimodal data. Inform Process Manag 58(5):102610. https://doi.org/10.1016/j.ipm.2021.102610
Wu Y, Zhan P, Zhang Y, Wang L, Xu Z (2021) Multimodal fusion with co-attention networks for fake news detection. In: Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp 2560–2569. https://doi.org/10.18653/v1/2021.findings-acl.226
Chen Y, Li D, Zhang P, Sui J, Lv Q, Tun L, Shang L (2022) Cross-modal ambiguity learning for multimodal fake news detection. In: Proceedings of the ACM Web Conference 2022, pp 2897–2905. https://doi.org/10.1145/3485447.3511968
Wang J, Yang Y, Liu K, Xie P, Liu X (2022) Instance-guided multi-modal fake news detection with dynamic intra-and inter-modality fusion. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp 510–521. https://doi.org/10.1007/978-3-031-05933-9_40
Bacanin N, Zivkovic M, Al-Turjman F, Venkatachalam K, Trojovskỳ P, Strumberger I, Bezdan T (2022) Hybridized sine cosine algorithm with convolutional neural networks dropout regularization application. Sci Rep 12(1):6302. https://doi.org/10.1038/s41598-022-09744-2
Zivkovic M, Bacanin N, Antonijevic M, Nikolic B, Kvascev G, Marjanovic M, Savanovic N (2022) Hybrid cnn and xgboost model tuned by modified arithmetic optimization algorithm for covid-19 early diagnostics from x-ray images. Electronics 11(22):3798. https://doi.org/10.3390/electronics11223798
Liu P, Qiu X, Chen X, Wu S, Huang X (2015) Multi-timescale long short-term memory neural network for modelling sentences and documents. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 2326–2335. https://doi.org/10.18653/v1/d15-1280
Guo Z, Zhang Q, Ding F, Zhu X, Yu K (2023) A novel fake news detection model for context of mixed languages through multiscale transformer. IEEE Trans Comput Soc Syst. https://doi.org/10.1109/tcss.2023.3298480
Zhu P, Pan Z, Liu Y, Tian J, Tang K, Wang Z (2024) A general black-box adversarial attack on graph-based fake news detectors. arXiv preprint arXiv:2404.15744
Yin S, Zhu P, Wu L, Gao C, Wang Z (2024) GAMC: an unsupervised method for fake news detection using graph autoencoder with masking. In: Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v38i1.27788
Qi P, Bu Y, Cao J, Ji W, Shui R, Xiao J, Wang D, Chua T-S (2023) Fakesv: a multimodal benchmark with rich social context for fake news detection on short video platforms. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 37, pp 14444–14452. https://doi.org/10.1609/aaai.v37i12.26689
Tang K, Ma Y, Miao D, Song P, Gu Z, Tian Z, Wang W (2022) Decision fusion networks for image classification. IEEE Trans Neural Netw Learn Syst:1–14. https://doi.org/10.1109/tnnls.2022.3196129
Zivkovic M, Stoean C, Petrovic A, Bacanin N, Strumberger I, Zivkovic T (2021) A novel method for covid-19 pandemic information fake news detection based on the arithmetic optimization algorithm, pp 259–266. https://doi.org/10.1109/synasc54541.2021.00051
Hua J, Cui X, Li X, Tang K, Zhu P (2023) Multimodal fake news detection through data augmentation-based contrastive learning. Appl Soft Comput 136:110125. https://doi.org/10.1016/j.asoc.2023.110125
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition 1–14. arXiv preprint arXiv:1409.1556
Cheng L, Zhu P, Tang K, Gao C, Wang Z (2024) GIN-SD: source detection in graphs with incomplete nodes via positional encoding and attentive fusion. In: Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v38i1.27755
Zhu P, Wang B, Tang K, Zhang H, Cui X, Wang Z (2024) A knowledge-guided graph attention network for emotion-cause pair extraction. Knowl Based Syst 286:111342. https://doi.org/10.1016/j.knosys.2023.111342
Singhal S, Shah RR, Chakraborty T, Kumaraguru P, Satoh S (2019) SpotFake: a multi-modal framework for fake news detection. In: Proceedings of the 5th IEEE International Conference on Multimedia Big Data, pp 39–47. https://doi.org/10.1109/bigmm.2019.00-44
Kenton JDM-WC, Toutanova LK (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 4171–4186
Sudiatmika IBK, Rahman F, Trisno T, Suyoto S (2019) Image forgery detection using error level analysis and deep learning. Telecommun Comput Electron Control 17(2):653–659. https://doi.org/10.12928/telkomnika.v17i2.8976
Gentzkow M, Shapiro JM, Stone DF (2015) Media bias in the marketplace: theory. In: Handbook of Media Economics, pp 623–645. https://doi.org/10.3386/w19880
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Process Syst 30:5998–6008
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp 201–216. https://doi.org/10.1007/978-3-030-01225-0_13
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28. https://doi.org/10.1109/tpami.2016.2577031
Wang S, Chen Y, Zhuo J, Huang Q, Tian Q (2018) Joint global and co-attentive representation learning for image-sentence retrieval. In: Roceedings of the 26th ACM International Conference on Multimedia, pp 1398–1406. https://doi.org/10.1145/3240508.3240535
Zhang W, Gui L, He Y (2021) Supervised contrastive learning for multimodal unreliable news detection in covid-19 pandemic. In: Proceedings of the 30th ACM International Conference on Information and Knowledge Management, pp 3637–3641. https://doi.org/10.1145/3459637.3482196
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778. https://doi.org/10.1109/cvpr.2016.90
Boididou C, Andreadou K, Papadopoulos S, Dang-Nguyen D-T, Boato G, Riegler M, Kompatsiaris Y (2015) Verifying multimedia use at mediaeval 2015. MediaEval 3(3):7
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2425–2433. https://doi.org/10.1109/iccv.2015.279
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961–2969
Song X, Wu N, Song S, Zhang Y, Stojanovic V (2023) Bipartite synchronization for cooperative-competitive neural networks with reaction-diffusion terms via dual event-triggered mechanism. Neurocomputing 550:126498. https://doi.org/10.1016/j.neucom.2023.126498
Song X, Wu N, Song S, Stojanovic V (2023) Switching-like event-triggered state estimation for reaction-diffusion neural networks against dos attacks. Neural Process Lett 55(7):8997–9018. https://doi.org/10.1007/s11063-023-11189-1
Wang R, Zhuang Z, Tao H, Paszke W, Stojanovic V (2023) Q-learning based fault estimation and fault tolerant iterative learning control for mimo systems. ISA Trans 142:123–135. https://doi.org/10.1016/j.isatra.2023.07.043
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (Grant nos. 62073263, 62102105), the Natural Science Basic Research Program of Shaanxi (Grant no. 2023-JC-YB-575), the Open Research Subject of State Key Laboratory of Intelligent Game (Grant no. ZBKF-24-02), the Fundamental Research Funds for the Central Universities (Grant no. D5000230112), the Young Talent Fund of Association for Science and Technology in Shaanxi (Grant no. 20240105), Shaanxi Provincial Natural Science Foundation (Grant no. 2024JC-YBQN-0620).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhu, P., Hua, J., Tang, K. et al. Multimodal fake news detection through intra-modality feature aggregation and inter-modality semantic fusion. Complex Intell. Syst. 10, 5851–5863 (2024). https://doi.org/10.1007/s40747-024-01473-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-024-01473-5