1 Introduction

The rapid web expansion and adoption of social media platforms have promoted people to create and share more misleading, fake, and irrelevant information, negatively impacting individuals and society. Fake news or misinformation can be defined as false news stories of a sensational nature, which is created and widely spread for various purpose such as generating revenue, discrediting or promoting a public figure, political movement, etc. It is evident that, while the World Health Organization (WHO) has declared the COVID-19 disease as a pandemic, misinformation about symptoms and cures has created various health hazards (Zhang et al , 2021a). Adding visual information (images and video) to news articles attracts more readers than textual information. Fake news creators take advantage of this human cognition to create and disseminate multimedia fake and misleading information.

In recent years, multimedia news containing the same (non-novel) image with different (novel) text influences the fake news on social media to mislead the people of the targeted or general community. Fake news is induced by using an image published in an earlier post by changing its associated text in the current setting. Since the image looks authentic and aligns with the new text, detecting this category of fake news becomes very challenging. Figure 1 shows two examples of this specific category of misleading news. The first figure shows an image claiming that China has launched an artificial Sun. Later, the fact-checking websites investigated and declared that a similar image had been circulated previously with a caption Visitors gather along the beach to witness the maiden blast-off of China’s new Long March-8 carrier rocket in Wenchang, south China’s Hainan Province, December 22, 2020Footnote 1. The second image went viral on social media platforms during the Russia-Ukrain conflict in 2022. The image was associated with text as this photo brought tears to my eyes. Two young Ukrainian children sending off soldiers to fight the Russians. The kids holding hands, the girl holds a stuffed animal and the boy (her brother?) salutes. Look what he has on his back. This picture speaks a thousand words.. Later, www.indiatoday.in has fact-checked and declared that the image had been taken from Children of War album, and the same image was published in 2016Footnote 2. These examples argue that fake news or misinformation detection is impossible without context, i.e., background knowledge or prior information about the actual event during which the image was captured.

Fig. 1
figure 1

The first image has been circulated on Twitter and Facebook with a claim, China has launched an artificial Sun.The second image went viral during the Russia-Ukraine conflicts in 2022 associated with the text as this photo brought tears to my eyes. Two young Ukrainian children sending off soldiers to fight the Russians. The kids holding hands, the girl holds a stuffed animal and the boy (her brother?) salutes. Look what he has on his back. This picture speaks a thousand words

In recent years, image repurposing has become a broadly used method for spreading misinformation on social media, which publishes untampered images with modified metadata to create and disseminate rumors (Gunawan et al , 2023). Our work is very much similar to misinformation detection using Image Repurposing Detection (IRD) Jaiswal et al (2019), however, it has significant advantages as follows: i) IRD extracts the semantic relationship between the image and its caption based on the object in the image and keywords in the caption. If keywords in the caption do not match the object in the image, it becomes difficult to find the semantic relationship. Instead, in our work, we consider the entire background information (when and where the image has been published before) with all the metadata (described in background information) along with the image claim which results in extracting a better semantic relationship. ii) In IRD the caption is very small (one small sentence in almost all the examples) which may not describe the image properly. Instead, in our work, the background information is large enough (one paragraph in almost all the examples) to justify the image in the context of the claim (caption). It results in better multimodal feature extraction. iii) IRD does not explore the semantic similarity based on emotion-aware and novelty features present in the image and text data. Instead, our work considers these features also for multimodal feature extraction.

Although the news contents (text, image, video, etc.) are the essential and unavoidable factors for misinformation detection, sometimes implicit characteristics like novelty (element of surprise or uncommon phenomenon) and affect information (emotion and sentiment) also play an influential role in misinformation detection. A study on Twitter (Vosoughi et al , 2018) proved that false news contains higher novelty than true news, and users like to retweet more novel content. Apart from novelty, emotion, and sentiment are also driving factors for fake news virality (Zhang et al , 2021b). Various research on human psychologyFootnote 3, argue that the emotional conviction and novelty of fake news are the salient aspects that accelerate its dissemination and force the reader to believe in it. These studies show that fake news inspires certain emotions like fear, disgust, and surprise, whereas true news inspires emotions like anticipation, sadness, joy, and trust. Numerous unimodal, multimodal, novelty, and affect-aware mechanisms are available for misinformation detection; however, existing approaches do not efficiently explore the importance of background information along with novelty and emotion factors in a multimodal setup for misleading fake news detection. There is also no such dataset available that could help to design misleading news detection methods. To address these limitations, we first develop a multimodal fake news corpus, NovEmoFake, and further design a deep learning-based model using novelty detection with supervised contrastive learning and emotion prediction that classifies the news as fake or real.

The contrastive learning framework tries to bring the representations of a single class closer while simultaneously pushing away the representations of the other class. This gives rise to the separation of the classes in the latent space, hence making one class different or novel from another. Also, the relationship of the anchor (in our case, the multimodal target data) with the augmented data (in our case, the multimodal source data) differs for real and fake classes. The target and the multimodal source data semantically follow each other for the real class, whereas the target and the source data convey opposite meanings for the fake class. Hence, by using supervised contrastive learning as a proxy, we attempt to capture this difference and enforce the concept of multimodal novelty detection. While we talk about emotion, many existing works have explored textual emotion, but none have investigated visual emotion for fake news detection. Prior works (Kumari et al , 2021a; Kumari et al , 2022) and Kumari et al (2021b) have investigated the role of textual novelty and emotion in fake news detection and obtained SOTA performance on three well-known text-based fake news datasets. These results have motivated us to investigate the novelty and emotion in multimodal fake news detection. The contributions of our work are summarized as follows:

  • We create and publicize a novel multimodal fake news detection dataset NovEmoFake, which is, to the best of our knowledge, the very first attempt toward creating the corpus for multimodal misleading information detection where the same image is used in a different context to convey false information.

  • We propose a multimodal framework using novelty and emotion prediction tasks for multimodal fake news detection, where the main task is to check whether the same visual information has been published earlier in a different context or not.

We organize the remainder of this paper as follows. Section 2 presents a brief survey of the prior works. Section 3 depicts the NovEmoFake dataset in detail. In section 4, we present the methodology in detail. Section 5 demonstrates the experiments, results, and error analysis. This section also depicts that the proposed model is more significant than the baseline models. Section 6 concludes our work along with some road maps for future directions.

2 Related work

Classifying a news post as fake or real is crucial based on the news content and context information. Nowadays, there has been a rising interest in building robust algorithms and systems to combat misinformation and fake news detection and stop its dissemination on social media platforms. Prior studies for misinformation detection have majorly focused on utilizing textual contents and social contexts of the news post. The study explored in (Galli et al , 2022) introduces the most frequently used deep learning-based mechanisms for fake news detection. The work presented in (Jlifi et al , 2022) explores various machine-learning algorithms for COVID-19 fake news detection. With the significant growth in multimedia news, people are creating and spreading fake news with different modalities (text, image, audio, and video). Textual information-based methods do not give a robust solution for multimedia news posts. To overcome this limitation, researchers nowadays focus on multimodal fake news detection. In recent years, many researchers have shed light on studying multimodal fake news and misinformation detection. The discussion presented in (Hangloo and Arora , 2011) highlights various data collection strategies, open areas, and challenges in the multimodal misinformation detection domain. The work investigated in (Jin et al , 2017) combines the textual, visual, and social context features using an attention mechanism for fake news prediction. To introduce the advancements in the prior works, (Wang et al , 2018) and (Khattar et al , 2019) have introduced deep learning-based models and justified that the model is efficient in handling newly emerged events better than the prior methods. Later, (Singhal et al , 2019) presented BERT (Devlin et al , 2019) based model for multimodal fake news detection. The work presented in (Zhang et al , 2023) finds four types of image-text similarities, namely, semantic similarity, textual similarity, contextual similarity, and post-training similarity. Based on these similarities, this work proved that fake news image-text similarity is higher than real news image-text similarity in most of cases.

In order to develop better feature extraction mechanisms, people have introduced different variations of graph neural network-based architecture for misinformation detection. Studies in (Wang et al , 2020) and (Qian et al , 2021) propose the novel Knowledge-driven Multimodal Graph Convolutional Network (KMGCN) which extracts the semantic representations by learning the textual information, visual information, and knowledge concepts into a unified framework. As in real-world information diffusion networks, new nodes and edges continuously emerge, the study in (Song et al , 2021) proposes a dynamic propagation graph-based framework for misinformation detection.

Former studies have focused on the best feature extraction. In contrast, later investigations (Kumari and Ekbal , 2021; Wu et al , 2021; Uppada and Patel , 2022) have given attention to feature fusion along with the feature extraction mechanisms and proved that the model’s performance also depends upon the semantic interaction between the different modalities. The works presented in (Song et al , 2021) and (Jing et al , 2021) have introduced multitask learning framework to extract hidden relationships across different modalities and proved that joint learning of related tasks improves the performance of fake news prediction tasks. The demand for automatic relevant feature extraction from news content has recently evolved into the concept of Contrastive Learning (CL). The authors in (Zhang et al , 2021a) and (Hua et al , 2023) have investigated the credibility of previously published news articles on the same events as the background knowledge by introducing the contrastive learning between previously published and the latest news on similar events.

Although people have extensively investigated different dimensions of misinformation detection, few mechanisms have focused on novelty and emotion-aware misinformation detection. The study presented in (Qin et al , 2016) is the first contribution for rumor detection using novelty detection tasks. In this study, the authors first find the similarity between a viral news article and existing rumors and use this similarity information as a novelty feature for rumor detection. Motivated by this idea, the works proposed in (Kumari et al , 2021a; Kumari et al , 2021b; Zhang et al , 2022) and (Kumari et al , 2022) have also explored the novelty between news article pair or news title-body pair. These models approximate the textual entailment task for novelty prediction, which further helps in misinformation detection. In recent works, affective information mining has attracted researchers by assuming that there are probably evident emotion and sentiment biases in fake news content and context. Several works (Giachanou et al , 2021; Ghanem et al , 2021) have investigated the role of textual emotion, but none of them has explored the visual emotion for fake news detection. As we have discussed earlier in this section, multimedia news is getting more popular than textual news, which also increases the possibility of invoking emotional appeal in visual content. Therefore visual emotion could be an important feature for misinformation detection. Among the prior works, most of the models fail to predict fake news when authentic images are used in a different context due to the insufficient or absence of background information about that particular event or activities during which the image has been captured. Recently, Contrastive Language-Image Pretraining (CLIP) model (Radford et al , 2021) has been introduced for out-of-context image prediction. This model is frequently used for misinformation detection. Studies presented in (Zhou et al , 2022) and (Choi et al , 2022) have proposed a multimodal fake news detection model based on the CLIP model which maps the semantic similarity between image thumbnail and caption for fake news detection. Table 1 describes the limitations of existing works.

Table 1 Limitations of existing works compared to our proposed work
Fig. 2
figure 2

The image shows two examples from our dataset. The source text provides additional information about the event which is essential for fake news detection

Although these works are very similar to our line of research, however, they only detect the credibility of the news based on the image-caption pair. They do not consider the entire news article and any prior information like events, activities, location, and other contextual information about that particular image. In light of the limitations described in Table 1, a comprehensive large-scale multimodal dataset with background information is essential to detect fake news containing a non-novel image and novel text. In contrast, the publicly available datasets do not include background knowledge, which limits the efficiency of existing datasets. To fill these gaps, first, we develop a novel multimodal corpus with the help of three existing multimodal fake news datasets viz. Fauxtography (Zlatkova et al , 2019), Recovery (Zhou et al , 2020a), and TI-CNN (Yang et al , 2018). To design our dataset, we collect the background knowledge corresponding to each instance of the above-discussed three datasets. Second, we design a robust multimodal framework using SCL-based novelty detection and emotion prediction tasks for fake news detection.

3 Data description and analysis

Nowadays various multimodal resources are available for multimodal misinformation detection. The existing datasets are of good quality however, none includes background information (where and in which context the news has been published first) of the news articles, which is crucial for this task. Therefore, we prepare a novel multimodal fake news dataset NovEmoFake, which includes the source information along with the context information. Figure 2 shows two examples from our dataset. Each instance of the dataset is in the form of source-target pair. Target is the combination of multimodal instances of the three existing multimodal fake news datasets viz. Fauxtography, ReCOVery, and TI-CNN. The source is the target-related background information extracted from different websites. We can see that the information provided in the source text is critical for detecting whether the news is fake or real. This dataset consists of 6816 real and 4950 fake samples. Table 2 explains the brief statistics and complete distribution of the dataset. It also shows the number of instances taken from each dataset to prepare our proposed dataset.

Table 2 Statistics of Fauxtography, ReCOVery, TI-CNN, and NovEmoFake datasets

3.1 Data collection

We consider Fauxtography, TI-CNN, and ReCOVery to prepare our dataset because they are balanced and include good-quality images. We form a set of target samples by combining multimodal instances of each dataset with their class labels. This set includes the text and image URL pairing with the labels as fake or real. We collect the background information for each multimodal instance of the target sample set in the following steps.

Source information extraction: To extract the source information, we perform Google Reverse Image Search (GRIS)Footnote 4 using the target image URL and extract the URLs of all the sources containing text or image information related to the target image. After that, we download the text and images present on a particular source using the extracted URLs. We remove the target instance without source information or with unimodal source information. In this stage, approx 3.2% instances of the target set are removed. To get authentic background knowledge, the source websites (websites from where the source information is extracted) must be highly credible. So, we compute the website’s credibility in the next step.

Credible source websites selection: We check the credibility of the source websites using MediabiasFootnote 5 (a website credibility checker). MediaBias assigns a class among the six classes viz. very high, high, primarily factual, mixed, low, and very low. We consider the extracted source information only from very high, high, and primarily factual class in order up to four source information. All four source information may be from different websites or the same website. We keep the data extracted from reliable websites and discard the data collected from low-credible websites. We extract textual information from the source information extracted from credible websites and save all the images present in the source information. In this way, for each target instance, we have up to 4 source information, where each source information has some piece of text and a list of images. We consider the piece of text as the source text. Although the primary purpose of this step is to shorten the background information up to four, however approximately 110 target instances are removed due to the low credibility of the source.

Source image selection: Initially, we remove the images with a dimension less than 50x50 from the list of images corresponding to each source and then remove the unimodal source information again. Approximately 2% of the target instances are removed in this stage. Since this work deals with the hypothesis of fake news detection containing non-novel images and novel text, we extract only one source image, which is identical or almost similar to the target image. We compute the 4096-dimensional vector representation for all the source and target images using the pre-trained VGG16 (Tammina , 2019) model. After that, we calculate the cosine similarity between the vector representations of the target image and all the source images. Finally, we keep only one image per source, which is most similar to the target image. This highly similar image is the final source image.

3.2 Data annotation

Since we create the NovEmoFake dataset using existing datasets, the instances already have fake or real labels. In this work, we annotate the data instances to find whether the extracted background knowledge from different sources is relevant to the corresponding target instance or not. This annotation aims to keep the example if the extracted metadata is relevant to the target; otherwise, discard it. In this way, we assign yes or no as labels for each instance. This annotation is only based on the textual content of the source and target data instance. We perform automatic annotation for all the instances and human annotation of the 200 instances to check the quality of automatic annotation. Figure 3 shows the complete data annotation steps followed during the data preparation, and subsequent paragraphs briefly describe the automatic and human annotation schema.

Fig. 3
figure 3

Overall workflow to develop the NovEmoFake corpus. Here, T_T, T_I, S_T and S_I represent the Target_Text, Target_Image, Source_Text and Source_Image respectively. CS and CST represent the cosine similarity for image and text respectively. STE and TTE depict Source Text Entity and Target Text Entity respectively. Here, Th is a threshold

Automatic annotation: We consider two types of annotations for each source-target pair of the NovEmoFake dataset: (i). In the first annotation type, we assign the label as "yes" if the source is relevant to the target; otherwise, we assign a label as "no". We obtain these labels based on \(R_{value}\). To compute the \(R_{value}\), we extract the named entities from source and target texts and find the ratio of the number of common entities present in the source and target text and the number of entities present in the target. To define it formally, let S and T be the set of entities present in the source and target, respectively. The ratio (\(R_{value}\)) is defined as shown in Eq. 1.

$$\begin{aligned} R_{value} = \frac{|S \cap T|}{|T|} \end{aligned}$$
(1)

We assign a label as "yes" for a maximum \(R_{value}\) and give a label as "no" for the remaining instances; (ii). We assign the label as fake or real in the second annotation type. As we discussed in the preceding section, the label of the NovEmoFake data instance (source-target pair) is similar to the label of the target. More specifically, if the target label is fake, then the label of the source-target pair will be fake. Hence, the annotation of our dataset is entirely automatic and based on the label of the target.

Human annotation: We cross-verify the automatic data annotation quality by performing human annotation for the randomly selected 200 instances. We select 100 fake and 100 real samples for human annotation that includes the instances from all three datasets in equal proportion. Each instance contains a target_image_URL, target_text, and information about sources. Source information includes source_URL, source_text, source_image_URL, and source_reliability. We provide the selected instances to two well-qualified human annotators. One is having post-graduation in English literature, and another is pursuing a doctoral degree in natural language processing. We also ensure that the annotators are proficient in reading, writing, and speaking English and perform these tasks very well. The annotators were provided with the following guidelines: (i). Go through every source text corresponding to a target instance, and if the source text is in a language other than English, assign the label as "no"; (ii). Otherwise, read some parts of the source text and verify the following key things: (a). Source text is relevant to the image; (b). Source text gives some accurate description of the image; (c). The source text provides any background knowledge about the image. If any of the above points are true, assign the label "yes" for that particular source. To get the relevant and highly correlated information to the target instance, we avoid the non-English source information, which is very few in numbers and does not affect the model performance. We compute Cohen’s Kappa coefficient (Cohen , 1960) agreement for 200 instances between (i). the automatic and first human annotator, (ii). the automatic and second human annotator, and (iii). the first and second human annotators. We obtain 89.08%, 89.34%, and 87.15% agreement scores for each annotator pair, respectively, which denotes that the automatically annotated dataset is of good quality.

4 Methodology

Prior works on multimedia misinformation detection explore various works on feature extraction and feature fusion to gather useful information from the news. However, how background information of news affects decision-making is still an open question. Aiming at addressing the issues of existing approaches, in this section, we propose an effective multimodal fake news detection framework, which consists of three components: the novelty detection module using SCL, the image emotion prediction module, and the fake news detection module. The novelty detection module finds the credibility of the new news (target) with respect to prior verified news (source). If the source supports the target news, it represents that target does not have any novel information with respect to source information. Also, if the source is fake and it supports the target, the target news is also fake, and vice versa. In this way, the novelty model extracts the novelty-aware multimodal feature representations from news pair. The image emotion prediction module extracts the emotion-aware visual feature representations. Finally, the fake news detection module classifies the news as fake or real. Figure 4 shows the complete illustration of the overall model structure.

Fig. 4
figure 4

The proposed model consists of three components. The novelty model finds the multimodal semantic representations of the source and target pair using SCL. The emotion model extracts emotion-aware visual representation, and finally fake news detection model classifies the news as fake or real

4.1 The novelty detection using SCL

We perform a novelty detection task using SCL to find high-level semantic relationships within target and source multimodal news pairs and extract the novelty-aware multimodal feature representations from these news pairs. As discussed below, we give the multimodal source and target as input to the multimodal encoder.

Multimodal encoder: We develop the encoder in two different ways viz. (i). We encode the text data using pre-trained BERT-Base architecture with 12 layers, 768 hidden nodes, 12 attention heads, and 110M parameters. It extracts the 768-dimensional textual feature representations. The input to the BERT model is the word sequence (w = w_1,..w_n) where n is the sequence length. To encode the visual data, we use a pre-trained ResNet18 model with approx 23 million trainable parameters and consists of 5 stages. Each stage includes a convolution and an identity block. Each convolution block and each identity block is made up of 3 convolution layers. After extracting, the features, we concatenate the textual and visual features to obtain the multimodal feature representations. (ii). We directly encode the multimodal data using the pre-trained VisualBert model with 768-hidden-nodes, 12-layers, 12-attention-heads and 512-visual_embedding_dimension (Li et al , 2019) to obtain multimodal feature representations. We employ two fully connected layers over the encoded source and target representations to project them in a 128-dimensional latent space. Projected source and target features are represented by MS and MT, respectively. Now, we train the model using contrastive learning so that the target representation attracts the source representation if both are of the same class (or both support each other); otherwise, the target repeals the source representation. This gives rise to the separation of the classes in the latent space, making one class different or novel from the other. This way, proper learning of novelty using contrastive learning detects the complex semantic relations between the source and target and also finds better semantic interactions between the different modalities.

We optimize the contrastive loss function, similar to (Khosla et al , 2020) (shown in (2)) to train the novelty model. Here, I is the set of indices of the target (anchor); P is the set of positive samples (samples of the same class of anchor), and \(\tau \) is a scalar parameter.

$$\begin{aligned} \small L_{SCL} = \sum _{i \in I}\frac{-1}{|P(i)|}\sum _{p\in P(i)}log \frac{exp(\frac{z_i \cdot z_p}{\tau })}{\sum _{a\in A(i)} exp(\frac{z_i \cdot z_p}{\tau })} \end{aligned}$$
(2)

4.2 Image emotion prediction

We design a neural network-based emotion classification model to obtain the emotion-aware visual feature representation. For pre-training this network, we use the combined form of UnbiasedEmo (Panda et al , 2018) and ArtPhoto (Machajdik and Hanbury , 2010) datasets. These two datasets are the collection of general domain images with the various emotion labels, which better fits with images of NovEmoFake dataset. The combination of these datasets contains one label among joy, love, sadness, fear, surprise, and anger. For our experiments, we follow (Kumari et al , 2021a) to consider two emotion labels, emotion true, which is formed by combining joy, love, and sadness labels, and emotion false, which is formed by the combination of fear, surprise, and anger. Given a set of n images I = \((I_1, . . , I_n)\), and their emotion labels EL = \((EL_1,..,EL_n)\), we encode each image \(I_i\) using ResNet18 to the model. Further, we pass this encoded image representation to a Multilayer Perceptron (MLP) network consisting of two hidden layers with 1024 and 512 dimensions, one output layer with two neurons, and a softmax classifier function. Since the number of instances in each emotion class is not balanced, we optimize the weighted cross-entropy loss during training. After training this emotion model, we predict the emotion labels of images present in the NovEmoFake dataset. Equation 3 depicts the mathematical description of the image emotion model. The encoded representations (\(IR_i\)) obtained from ResNet18 are projected in 1024 and 512-dimensional latent representations respectively.

$$\begin{aligned} \begin{aligned} IR_i = ResNet18(I_i); IR1_i = tanh(W_{IR_i}*IR_i + b_{IR_i})\\ IR2_i = tanh(W_{IR1_i}*IR1_i + b_{IR1_i})\\ I_{out} = Softmax(W_{IR2_i}*IR2_i + b_{IR2_i}) \end{aligned} \end{aligned}$$
(3)

4.3 Fake news detection

After pre-training the novelty model, we extract the 512-dimensional feature representations for the source (\(N_{SR}\)) and target (\(N_{TR}\)) and fuse them to obtain multimodal representation (\(N_{MR}\)) following (4).

$$\begin{aligned} \begin{aligned} a = N_{TR} + N_{SR}; b = N_{TR} - N_{SR}; c = N_{TR} * N_{SR}\\ N_{MR} = Concat(a,|b|,c) \end{aligned} \end{aligned}$$
(4)

Finally, we project this fused representation into 512-dimensional feature space and use it as a novelty-aware multimodal feature representation to develop our fake news detection model. Equation 5 describes the final novelty-aware feature representation. Where, FN is the final novelty representation, \(W_{N_{MR}}\) and \(b_{N_{MR}}\) are the weight and bias of \(N_{MR}\), respectively. We extract 512-dimensional emotion-aware visual feature representations from a pre-trained image emotion model and perform scaffolding by concatenating a 200-dimensional representation of predicted emotion labels. This way, we have 712 (512+200)-dimensional emotion-aware visual feature representations. Further, we employ Principle Component Analysis (PCA) to obtain the 128-dimensional emotion-aware feature representations of images. Here, FE is the final emotion-aware feature representation, SL and FE1 represent intermediate layers.

$$\begin{aligned} \begin{aligned} FN = tanh((W_{N_{MR}}*N_{MR})+b_{N_{MR}})\\ SL = Scafolding(emotionlabels)\\ FE1 = Concat(IR2, SL); FE = PCA(FE1) \end{aligned} \end{aligned}$$
(5)

After getting novelty-aware multimodal representation and emotion-aware visual representation, we concatenate and pass them to MLP which contains two hidden layers and an output layer with a softmax function to classify the news as fake or real. Equation 6 depicts the final classification. Where, FND1, FND2, and FND3 are the intermediate layers and FND is the final layer that classifies the news as fake or real. We optimize the cross-entropy loss (shown in (7)) to train our fake news detection model. where y is the original class and p is the predicted class of the news post.

$$\begin{aligned} \begin{aligned} FND1 = Concat(FN, FE)\\ FND2 = tanh(W_{FND1}*FND1 + b_{FND1})\\ FND3 = tanh(W_{FND2}*FND2 + b_{FND2})\\ FND = Softmax(W_{FND3}*FND3 + b_{FND3}) \end{aligned} \end{aligned}$$
(6)
$$\begin{aligned} L = -\Sigma [ylogp + (1-y)log(1-p)] \end{aligned}$$
(7)

5 Experiments and results

This section discusses the experimental setup, the performance of the baselines, and a detailed description of the experimental results. It also analyses the errors for some examples from NovEmoFake dataset where our proposed model becomes confused and does not give the correct prediction.

5.1 Experimental setup

We first remove punctuation and convert the text into lowercase for pre-processing the text. We limit the text length to 300 words. For the images part of our dataset, we load each image with 224*224 dimensions. We use the Pytorch library for all our experiments with a single Nvidia GeForce RTX GPU with 10GB of RAM. We evaluate the performance of the system in terms of accuracy, Micro Average (MA), and Weighted Average (WA) of the F1 score. We train our baseline models for 100 epochs using the Adam optimizer We perform pre-training of our contrastive learning framework for 1000 epochs using the Layer-wise Adaptive Rate Scaling (LARS) optimizer for Stochastic Gradient Descent (SGD) with a batch size of 512, which takes ten minutes. We pre-train the emotion model for 100 epochs using the Adam optimizer with a batch size of 128, which takes ten minutes. We train the final proposed model using different variations of hyper-parameters. Table 3 shows the details of hyperparameters. By performing all the experiments, we observe that our proposed model gives the best result for 100 epochs with a batch size of 128. It better optimizes the loss using the Adam optimizer with a learning rate of 0.01 and takes fifteen minutes to run. The proposed model performs better for the dataset split into 80:20 ratios as the train and test set. The train set is further divided into a 90:10 ratio as the train and validation set. We report the result on the test set. The total number of parameters for this proposed model is 17,533,985.

Table 3 Variations of hyper-parameters for the final proposed model

5.2 Baselines and comparing systems

We implement the following baseline models to compare our proposed model and show all the results in Table 4.

Table 4 Evaluation results with/without background knowledge. Here, WBG: With Background Knowledge, W/O BG: Without Background Knowledge, Acc: Accuracy, MA: Micro Average, WA: Weighted Average

BERT + ResNet18: We use the pre-trained BERT and the pre-trained ResNet18 model to encode textual and visual information, respectively. We concatenate the textual and visual representations to obtain multimodal representations for both the source and target. We pass this multimodal representation to an MLP network having two hidden layers and one output layer with a softmax function. We implement two variations of this model, one without background knowledge (using only target data) and the other with background knowledge (using the source-target pair).

BERT (text only): In this baseline model, we provide only the target text and its corresponding source text as input. We obtain the 768-dimensional source and target text representations using the pre-trained BERT model. We concatenate the target and source features and then employ a simple feed-forward neural network with two feed-forward layers with 512 and 128 units and a tanh activation function. We finally pass the 128-dimensional representation to the output layer with two neurons and a softmax activation function.

ResNet18 (image only): We use only the target images for fake news detection as one of our ablation studies. We obtain the 512-dimensional representation of each target image using a pre-trained ResNet-18 model. We employ the same feed-forward neural network for the final fake news classification as the Text-only model.

VisualBERT: Instead of encoding the textual and visual information separately, we encode the multimodal information using VisualBERT (Li et al , 2019). After obtaining the representation, we employ an MLP model similar to the previous baseline model. We also implement two different variations, one with background knowledge and the other without background knowledge.

SAFE: We implement the SAFE (Zhou et al , 2020b) model as a baseline. This model uses the news headline, body, and image to find their similarity for fake news detection. We use the target text as the headline, the source text as the body, and the target image as image input for this model. Since our dataset assigns real and fake labels based on the agreement of the target and the source text, we believe that SAFE will serve as one of its strong baselines as it detects the mutual similarity between the news headline, body, and image.

EANN: We also implement Event Adversarial Neural Networks (Wang et al , 2018) as one of our baselines. The EANN model learns event-invariant features through an adversarial setup for multimodal fake news detection, which can perform well on unseen events. As our dataset does not explicitly come with event-specific labels, we devise a mechanism to obtain the event labels. We first extract the multimodal features of our target data (image and text) using the CLIP model. We then apply K-Means++ clustering (Arthur and Vassilvitskii , 2007) on the extracted features for 10 clusters which is the default number of clusters used in the EANN framework. We consider these cluster labels of the samples as event labels. We then use these event labels and the target instance to train the EANN framework.

KMGCN: We implement a graph-based model following a study presented in (Wang et al , 2020) as a baseline model. This model introduces a Knowledge-driven Graph Convolutional Network (KMGCN) which models the semantic representations by learning the knowledge concepts, textual information, and visual information into a single framework for misinformation detection. Following this idea, we also extract the source and target multimodal features using a graph convolutional neural network. Further, we concatenate both features and pass them to an MLP model with two hidden layers for final classification.

5.3 Results and discussion

Table 4 shows the comparison of the baseline models with and without background knowledge. We report the results for both BERT+ResNet as well as VisualBERT models. We also report the results on each base dataset that makes up our NovEmoFake data. Results in Table 4 (in bold font) show that both the models perform significantly well with the background knowledge on all four datasets. On the NovEmoFake data, we note a 16.06% increase in the accuracy using BERT+ResNet, and a 9.9% increase with Visual BERT. These conclude that the use of background knowledge helps us in detecting fake news effectively. We also see that using the SAFE model with background knowledge as a baseline yields the best performance among all the baselines. The BERT+Resnet model also performs better than the EANN model as it obtains richer textual embeddings using BERT. The EANN results outperform the VisualBERT results because some images do not necessarily contain discrete objects used for modeling the VisualBERT model in our dataset. Due to this same reason, the KMGCN model also does not perform well as the object described in the text is not directly present in the image for most of the instances. The multimodal models outperform the text-only model as expected due to the multimodal information instead of only text. Also, the target image-only model yields the least accuracy among all the different experiments because it is hard to categorize the news as real or fake based on only images without having any additional information. This observation argues that background knowledge plays a major role in misinformation detection, which justifies our primary research objective. In this result table, we also observe that all the models perform well on individual datasets compared to the NovEmoFake dataset. This is because all the datasets are of different characteristics and the baseline models are not capable of extracting more generalized features from the merged dataset. Our proposed model overcomes this limitation also.

Table 5 Results of the proposed model only with novelty and only with emotion. PMN: Proposed Model with Novelty, PME: Proposed Model with Emotion
Table 6 Results of the proposed model without novelty and emotion and with novelty and emotion. We observe that the model performs better when we use both novelty and emotion along with background knowledge. PM: Proposed Model without Novelty and Emotion, PMNE: Proposed Model with Novelty and Emotion

We present the results of our proposed model using novelty and using emotion only in Table 5. We show the results for BERT+ResNet and VisualBERT with all four data split configurations. We also show the results for only textual and only visual models for the NovEmoFake dataset. We observe that the proposed model works well with novelty compared to emotion in all cases. This is because all four split datasets are collections of news articles that may not have strong emotional content. However, the results in Table 6 show that emotion and novelty alone are improving the model performance, and when we consider these two factors together, the model’s performance increases by a large margin. This observation justifies that novelty detection and emotion recognition play major roles in fake news detection.

We present the results of our proposed model without using novelty and emotion and using both novelty and emotion in Table 6. We show the results for both the BERT+ResNet and VisualBERT with all four data split configurations mentioned above. We also show the results for only textual and only visual models for the NovEmoFake dataset. We can see that by using novelty and emotion, we obtain 11.58% accuracy improvements compared to the model without using novelty and emotion for the best-performing model (BERT+ResNet). We observe 7.10% and 9.48% accuracy improvement of the proposed model over the model with novelty and the model with emotion respectively. We also observe a 10.11% accuracy improvement of our proposed model over a simple BERT+ResNet model with background knowledge.

We also notice an improvement of 1.92% and 1.75% in accuracy concerning the SAFE model using BERT+ResNet and VisualBERT, respectively. Hence, we can conclude that background knowledge, emotion, novelty, and contrastive learning are helping effectively and our final proposed architecture using novelty and emotion outperforms all the baselines and produces the best performance. Results also argue that our proposed model is also capable to generalize the model’s performance for the dataset having different characteristics.

5.4 Case studies

Figure 5 shows a sample from the ReCOVery dataset showing U.K’s Prime Minister, which is labeled initially as fake. Our final proposed model with BERT+ResNet predicts it as fake, whereas our novelty-only model predicts it as real. Here, the target image’s emotions play a significant role in detecting the image as fake. The picture conveys surprise, anger, and disgust emotions that belong to the fake emotion class. Hence, our final model using novelty and emotion performs better than the novelty-only model.

Fig. 5
figure 5

This figure shows an example, which is correctly classified by our proposed model. Here, GTL: Ground Truth Label, MME: Multimodal Encoder

5.5 Error analysis

Figure 6 shows three samples taken from the NovEmoFake dataset. For the first example, the complete novelty+emotion model performs better as the emotions conveyed in this picture are fear and disgust, which belong to the fake class. Also, the content of the source and the target text are from different contexts, although they mention both President Trump’s and Israel’s Prime Minister’s names. Our contrastive learning framework can capture this better than the SAFE framework. In the second example, the novelty-only framework performs better than novelty and emotion together. The source text clearly states that this picture is photoshopped, and the original picture was taken in the 1980s. Hence the novelty detection model can flag this news as fake. However, we can see that the photo conveys the emotion of sadness which is, in general, an attribute of true news. Hence, the novelty and emotion models are confused, which leads to predicting this news is real. For the third example, the model with background knowledge is not performing well as without background knowledge. In the without background knowledge model, we see that the target image and text are in sync with each other, i.e., both convey those tanks shown at a place. With the background knowledge model, we see that additional information that talks about Putin and Trump confuses the model and takes it away from the main subject of the presence of tanks. Hence, the model without background knowledge correctly classifies this image as real, whereas background knowledge classifies this as fake.

Fig. 6
figure 6

Error Analysis enlists the examples for which our proposed model does not give a correct prediction

5.6 Significance test

To determine if the results we obtain using our models are statistically significant, we use the McNemar significance test (Pembury Smith and Ruxton , 2020). We consider the threshold p-value for rejection of the null hypothesis as 0.01. We report the p-value using different configurations of the runs with both the base models, ResNet+BERT and Visual BERT, in Table 7. From the table, we can see that the p-values are less than 0.01 for every configuration, with and without background knowledge, the proposed model without novelty and emotion, and the proposed model with novelty and emotion, SAFE, and the proposed model with novelty and emotion for both the base models. Hence, we can conclude that the results obtained are statistically significant.

Table 7 P-values obtained using McNemar statistical significance test. Here WBG - With Background Knowledge. W/O BG - Without Background Knowledge. PM - Proposed model without novelty and emotion. PMNE - Proposed Model with Novelty and Emotion

6 Conclusion

In this work, we have defined a new problem and then proposed an effective solution for fake news detection, in which the fake multimedia news contains non-novel images but novel text information. This form of fake news detection demands a new multimodal fake news dataset that includes the background knowledge for each instance. We first create the NovEmoFake dataset with the help of Fauxtography, ReCOVery, and TICNN datasets by augmenting them with the retrieved background knowledge from the internet. We argue through examples and experiments that background knowledge is critical for effective fake news detection, which justifies our effort in forming a new dataset. Also, since novelty and emotion are the key elements for spreading fake news, we investigate their effects by designing a novelty and emotion-aware multimodal misinformation detection system. We pre-train the novelty model using SCL, pre-train a visual emotion prediction model to extract the novelty and emotion-aware multimodal feature representations and then encode them after concatenation to an MLP network for fake news detection. Experiments show that our model outperforms the baselines and previous SOTA models. We achieve a 14.91% accuracy gain over our best baseline model and 1.92% over the SOTA model. We believe that our work advances the field of multimodal fake news detection by defining a novel problem and then proposing an effective solution. In the future study, the information-retrieval prototype described in this paper can be used to create datasets that include additional information like audio, video, and user comments related to the target news across different regional and low-resource languages.