VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias

Multimedia content has become ubiquitous on social media platforms, leading to the rise of multimodal misinformation (MM) and the urgent need for effective strategies to detect and prevent its spread. In recent years, the challenge of multimodal misinformation detection (MMD) has garnered significant attention by researchers and has mainly involved the creation of annotated, weakly annotated, or synthetically generated training datasets, along with the development of various deep learning MMD models. However, the problem of unimodal bias has been overlooked, where specific patterns and biases in MMD benchmarks can result in biased or unimodal models outperforming their multimodal counterparts on an inherently multimodal task; making it difficult to assess progress. In this study, we systematically investigate and identify the presence of unimodal bias in widely-used MMD benchmarks, namely VMU-Twitter and COSMOS. To address this issue, we introduce the"VERification of Image-TExt pairs"(VERITE) benchmark for MMD which incorporates real-world data, excludes"asymmetric multimodal misinformation"and utilizes"modality balancing". We conduct an extensive comparative study with a Transformer-based architecture that shows the ability of VERITE to effectively address unimodal bias, rendering it a robust evaluation framework for MMD. Furthermore, we introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data that preserve crossmodal relations between legitimate images and false human-written captions. By leveraging CHASMA in the training process, we observe consistent and notable improvements in predictive performance on VERITE; with a 9.2% increase in accuracy. We release our code at: https://github.com/stevejpapad/image-text-verification


Introduction
The proliferation of misinformation poses a significant societal challenge with potential negative impacts on democratic processes [1], social cohesion [2], public health [3], political and religious persecution [4] among others.The widespread usage of digital media platforms in recent years has only exacerbated the problem [5].In the context of social media platforms, multimedia content has been shown to often be more attention-grabbing and widely disseminated than plain text [6], while the presence of an image can significantly enhance the persuasiveness of a false statement [7].Against this backdrop, while the work of fact-checkers becomes increasingly important it also becomes increasingly more difficult, considering the scale of content produced and shared daily on social media.In response, researchers have been investigating a range of AI-based methods for detecting misinformation, e.g.detecting inaccurate claims with the use of natural language processing [8], detecting synthetic images, such as DeepFakes, with the use of deep learning [9] or multimodal misinformation with the use of multimodal deep learning [10].Multimodal misinformation (MM) typically refers to false or misleading information that is spread using multiple modes of communication, such as text, images, audio and video [11].Here, we focus on image-caption pairs that collaboratively contribute to the dissemination of misinformation.For instance, in Fig. 1a, an image depicts the grounds of a musical festival covered in garbage, accompanied by the false claim that it was taken in June 2022 "after Greta Thunberg's environmentalist speech", while the image was actually taken in 2015 2 .
Previous studies on automated multimodal misinformation detection (MMD) have predominantly explored three approaches in terms of training datasets: annotated ( [12,13]), weakly-annotated ( [14,15,16]) and synthetically generated datasets [17,18,19].These distinct routes facilitated the development and evaluation of multimodal models designed to detect and combat misinformation effectively [20,21,22,23].However, previous studies have overlooked the investigation of unimodal bias.Training datasets exhibiting certain patterns and biases (asymmetries and imbalances) towards one modality can lead to biased models or unimodal methods capable of outperforming their multimodal counterparts in a purportedly multimodal task.If these patterns persist within the evaluation benchmarks, they can obscure the impact of unimodal bias; hindering our ability to effectively assess progress in the field of MMD.In our investigation, we uncover that the widely used VMU-Twitter dataset [12] exhibits an image-side unimodal bias while the COSMOS evaluation benchmark [24] exhibits a text-side unimodal bias; raising questions about their reliability as evaluation benchmarks for MMD.
Against this backdrop, the primary aim of this study is to create a robust evaluation framework that accounts for unimodal bias.To this end, we have create the "VERification of Image-TExt pairs" (VERITE) evaluation benchmark which accounts for unimodal bias by (1) consisting of real-world data, (2) excluding "asymmetric multimodal misinformation" (Asymmetric-MM) and (3) employing "modality balancing".We introduce the term Asymmetric-MM -and contrast it with MM-to highlight cases where one dominant modality is responsible for propagating misinformation while other modalities have little or no influence.An example of Asymmetric-MM can be seen in Fig. 1c where a claim pertains to "deceased people turning up to vote" and an image merely thematically related to the claim, is added primarily for cosmetic enhancement.Focusing on the dominant modality, a robust text-only (or, in other scenarios, image-only) detector would suffice for detecting misinformation; rendering the other modality inconsequential in the detection process.We hypothesize that this asymmetry can exacerbate unimodal bias.Furthermore, we introduce the concept of "modality balancing" which ensures that all images and captions are presented twice during evaluation, once in their truthful pair and once in their misleading pair, thus compelling a model to consider both modalities and their relation when discerning between truth and misinformation.We conduct a comprehensive comparative analysis where we train a Transformer-based architecture using different datasets, including VMU-Twitter, Fakeddit, and various synthetically generated datasets.Our empirical results demonstrate that VERITE effectively mitigates and prevents the occurrence of unimodal bias.
Our second contribution is the introduction of "Crossmodal HArd Synthetic MisAlignment" (CHASMA), a new method for generating synthetic training datasets that aims to maintain crossmodal relation between legitimate images and misleading human-written texts to create plausible misleading pairs.More specifically, CHASMA utilizes a large pre-trained crossmodal alignment model (CLIP [25]) to pair legitimate images (from VisualNews [26]) with contextually relevant but misleading texts (from Fakeddit).CHASMA maintains the sophisticated linguistic patterns (e.g.exaggeration, irony, emotions) that are often found in human-written texts; unlike methods that rely on Named Entity Inconsistencies (NEI) for generating MM [19].The inclusion of CHASMA in the training process consistently enhances the predictive performance on the VERITE benchmark, particularly evident in aggregated datasets, resulting in a notable 9.2% increase in accuracy.
The main contributions of our work can be summarised as follows: • Systematically investigate the issue of unimodal bias within widely used evaluation MMD benchmarks (VMU-Twitter and COSMOS).
• Create the VERITE benchmark, which effectively mitigates the problem of unimodal bias and provides a more robust and reliable evaluation framework for MMD.
• Develop CHASMA, a novel approach for creating synthetic training data for MMD, that consistently leads to improved detection accuracy on the VERITE benchmark.

Related Work
The automated detection of misinformation is a challenging task that has garnered increasing attention from researchers in recent years.A range of methods is being explored to identify misinformation in text [8] and images [9].Consequently, multiple datasets have been created for fake news detection [27] and manipulated images [28].These challenges involve unimodal settings.However, there is a need for MMD models that can handle cases where the combination of an image and its caption lead to misinformation.Given the complexity of this task, large training datasets are required to train robust MMD models.In this section, we explore the available research on existing datasets, including both annotated and synthetically generated as well as available evaluation benchmarks for MMD.

Annotated multimodal misinformation datasets
The "image verification corpus", often referred as the "Twitter" dataset, ("VMU-Twitter" from now on) was used in the MediaEval2016 Verifying Multimedia Use (VMU) challenge [12] and comprises 16,440 tweets regarding 410 images for training and 1,090 tweets regarding 104 images for evaluation.Since the images are accompanied by tweets, the dataset has been widely used for MMD [20,21,22,22,23].In addition, the Fauxtography dataset comprises manually fact-checked image-caption pairs sourced from Snopes 3 and Reuters4 , with a total of 1,233 pairs, of which 592 are classified as truthful and 641 as misleading [13].However, their very limited size raises doubts about the effectiveness and generalizability of deep neural networks trained on these datasets.
To address the challenges of collecting and annotating large-scale datasets, researchers have also explored weakly annotated datasets.The MuMiN dataset, for instance, consists of 21 million tweets on Twitter, linked to 13,000 fact-checked claims, with a total of 6,573 images [14].While this dataset provides rich social information such as user information, articles, and hashtags, its limited number of images may also be insufficient for MMD.The NewsBag is another large-scale multimodal dataset that was created by scraping the Wall Street Journal and Real News for truthful pairs and The Onion 5 and The Poke6 for misleading pairs [15].However, the latter sites publish humorous and satirical articles which may not reflect real-world misinformation [29].
Fakeddit is a large weakly labeled dataset consisting of 1,063,106 instances collected from various subreddits 7 and grouped into two, three, or six classes based on their content [16].The instances are classified as either Truthful or Misleading and then separated into six classes, including true, satire, misleading content, manipulated content, false connection, or impostor content.Of the total instances, 680,798 have both an image and a caption, with 413,197 of them being Misleading and 267,601 being Truthful.Despite being weakly labeled, Fakeddit provides a large-scale resource for training machine learning models to detect misleading multimodal content.

Synthetic multimodal misinformation datasets
Due to the need for large-scale datasets, the labor-intensive nature of manual annotation and the potential for weak labeling to introduce noise, researchers have also been exploring the use of synthetically generated training data for MMD.These methods can be categorized into two groups based on the type of misinformation they generate, namely OOC pairs or NEI.
OOC-based datasets can be created through random-sampling techniques, such as in the case of the MAIM [17] and COSMOS [24] datasets.However, these methods tend to produce easily detectable non-realistic pairs, making them unsuitable for training effective misinformation detection models [30].An alternative approach is to use feature-based sampling to retrieve more realistic pairs that more realistically resemble multimodal misinformation.The NewsCLIPings dataset [18] was created using scene-learning, person matching and CLIP in order to retrieve images from within the VisualNews dataset in order to create OOC samples.Similarly, the Twitter-COMMs dataset was created via CLIP-based sampling on Twitter data related to climate, COVID, and military vehicles [31].
On the other hand, NEI-based methods rely on substituting named entities in the caption -such as people, locations, and dates -with other entities of the same type, resulting in misleading inconsistencies between the image and caption.
Since random retrieval and replacement of entities may be easily detectable [30], several methods have been proposed to retrieve relevant entities based on cluster-based retrieval for MEIR [19], rule-based retrieval for TamperedNews [32], and CLIP-based retrieval for CLIP-NESt [30].Finally, aggregating synthetically generated datasets -combining both OOC and NEI -has been shown to further improve performance [30].

Unimodal Bias and Evaluation Benchmarks
Unimodal bias has mainly been observed and investigated in the domain of visual question answering (VQA), wherein biased models rely on surface-level statistical patterns within one modality (usually the textual modality) while disregarding the information present in the other modality (usually the visual modality) [33].Evaluation benchmarks have been devised to enhance fairness and robustness of evaluating of VQA models [34] and various methods have been proposed for counteracting unimodal bias during training [35].However, comparable efforts in addressing unimodal bias have not been explored within the context of MMD.
Currently, there is no widely accepted benchmark for evaluating MMD models.Most studies assess their approaches on a split of their weakly annotated [16,14] or their synthetically generated datasets [17,18,19,32], which may not provide a realistic estimate of how these methods will perform when confronted with real-world misinformation.
The COSMOS benchmark is one of the few works that collect an evaluation set consisting of real-world multimodal misinformation and make it publicly available [24].It consists of 1,700 pairs and is balanced between truthful and misleading pairs -collected from credible news sources and Snopes.comrespectively-and has been used in two challenges for "CheapFakes detection" [36,37].Nevertheless, in [30] it was found that text-only methods, especially NEI-based ones, can outperform their multimodal counterpart on COSMOS, raising questions about its reliability as an MMD benchmark.Another widely used dataset for MMD is the VMU-Twitter dataset [12], despite consisting mainly of manipulated and digitally created images.In this paper, we systematically investigate the factors behind unimodal bias in MMD and create a new evaluation benchmark that accounts for it.

Problem Definition
In this study, we focus on the challenge of multimodal misinformation detection (MMD).and specifically on imagecaption pairs that collaboratively contribute to the propagation of misinformation.Typically, MMD can be defined as follows: given a dataset , where x i = (I i , C i ) represents an image-caption pair and y i ∈ {0, 1} denotes the ground truth label indicating the presence or absence of misinformation, the objective is to learn a mapping function f : x → y that accurately predicts the presence of misinformation in a given image-caption pair.However, instead of addressing MMD as a binary classification problem ( [13,24,30,17,18,19,32]), we introduce a new taxonomy that includes three classes: 1. Truthful (True): an image-caption pair (I t i , C t i ) is considered True when the origin, content, and context of an image are accurately described in the accompanying caption.

Out-Of-Context (OOC) image-text pairs: involves a deceptive combination of a truthful caption C t
i and an out-of-context image I x i or a legitimate image I t i with an out-of-context caption C x i ; with " x " denoting the different context but otherwise truthful information.

MisCaptioned images (MC): involves an image I t
i being paired with a misleading caption C f i that misrepresents the origin, content, and/or meaning of the image; with " f " denoting falsehood or manipulation.
We consider the structural differences between OOC and MC to warrant separate classification since MC cases predominantly involve the introduction of falsehoods within the textual modality that are linked to the image, whereas OOC scenarios involve the juxtaposition of otherwise truthful text with a legitimate yet decontextualized image, resulting in the propagation of misinformation.Furthermore, we investigate the problem of unimodal bias in the context of MMD, the phenomenon of unimodal models or models biased towards one modality outperforming their unbiased multimodal counterparts on an inherently multimodal task.Unimodal bias can emerge during the training process as a consequence of certain patterns and biases, wherein models tend to emphasize superficial statistical correlations within a single modality.If these patterns persist within the evaluation benchmarks, they have the potential to obscure the presence of unimodal biases within the results.We hypothesize that one such problematic pattern is "asymmetric multimodal misinformation" (Asymmetric-MM) -which we contrast against MM -where false claims are accompanied by a loosely connected image (associative imagery) or manipulated images are accompanied by captions that simply reinforce the misleading content of the image (reinforcing captions).Examples are provided in Figures 1c and 1d.Both scenarios create an asymmetry between the two modalities; rendering one modality as the dominant source of misinformation while the second modality has little or no influence.It is important to note that instances of MC images (including NEI) may exhibit a certain degree of 'asymmetry' in that misinformation is primarily propagated through the textual modality.Nevertheless, we do not consider them to be Asymmetric-MM because the text in MC pairs remains connected to and misrepresents some aspect of the image, such as depicted entities or events.Previous studies did not make a distinction between MM and Asymmetric-MM while collecting or annotating their datasets.Given 200 random samples from COSMOS and following the classification taxonomy of Snopes8 we found that 48% of COSMOS pairs are "false claims" (41% associative imagery and 7% reinforcing captions) while 52% were classified as "miscaptioned", which we consider to be MM because it implies a relationship between the two modalities.After de-duplicating the images of the COSMOS benchmark, the rates were 41% miscaptioned, 35% associative imagery, 4% reinforcing captions and 20% duplicates.On Fakeddit -given 300 random samples-roughly 45% of pairs were considered Asymmetric-MM, with 41% being manipulated images and 4% with associative imagery.Moreover, we consider that roughly 14% of Fakeddit's samples are MM since the remaining 40% were mostly funny memes, visual jokes, pareidolia imagery and other content that is not generally considered to be misinformation9 .

Creating the VERITE evaluation benchmark
Due to the lack of a robust evaluation benchmark for MMD that accounts for unimodal bias, we introduce the "VERification of Image-TExt pairs" (VERITE) benchmark.VERITE comprises three classes: True, OOC and MC pairs.The data collection process is illustrated in Fig. 2 and involves the following steps: 1. Define inclusion criteria • Consider fact-checked articles from Snopes and Reuters that are classified as "MisCaptioned" (MC).• Exclude articles classified as "false claim", "legend", "satire', "scam", "misattributed" and other categories that do not adhere to our definition of MM. • Exclude articles regarding video footage or animated content and keep image-related cases, unless a screenshot of the video is provided that clearly captures the content and claim of the caption.• Include manipulated images (digital art, AI-generated imagery, etc.) only if they are not created with intention to misinform, and their initial origin, content, context, or meaning has been misrepresented within the claim.

Select images and captions
• Review the article and collect the misleading claim i is linked to I t i and misrepresents some aspect of it (e.g.origin, content, context, depicted entities etc).If not, exclude for being Asymmetric-MM.

Refine captions and images
• Remove "giveaway" words such as "supposedly", "allegedly", "however" or phrases like "this is not the case", that negate the false claim.Such words and phrases, if learned during the training process, could be used as "shortcuts" by MMD models.
• Rephrase C f i to mimic the syntactic and grammatical structure of C t i in order to avoid potential linguistic biases.• Rephrase both C t i and C f i to follow the format: "An image shows.." or "Photograph showing.." in order to create a direct link between the two modalities.• Examine both C t i and C f i for spelling and grammatical errors using "Google Docs spelling and grammar check".
• Verify that the images are of reasonable quality and do not have any watermarks.If needed, use reverse image search to find the exact same image in better quality.

OOC Image retrieval
• Extract relevant keywords, or their synonyms, in C t i to create a query Q. • Use Google image search to retrieve one OOC image I x i based on Q. • Ensure that C t i and I x i share a discernible and meaningful connection (identical or similar origin, content, context or depicted entities) and their alignment is deceptive.
To illustrate the aforementioned process in practice, let us consider the example shown in Fig. 2. Starting with a fact-checked article10 , we collect I t i showing a damaged railway that has collapsed into a body of water and C f i falsely claiming that the event occurred in "2022 during the Russia-Ukraine war".We also collect the truthful C t i which is provided by professional fact-checkers.C t i clarifies that the event took place in "June 2020 in Murmanask, Russian" and thus is unrelated to the 2022 Russia-Ukraine war.Afterwards, we extract keywords from C t i and use Q ="collapsed railway bridge" as the query and retrieve I x i from Google Images.Similar to I t i , I x i also depicts a collapsed railway bridge but it was captured in Chile, not Russia; thus misaligning the "location" entity.
We collected 260 articles from Snopes and 78 from Reuters that met our criteria, which translates to 338 (I t i , C t i ), 338 (I t i , C f i ) and 324 (I x i , C t i ) pairs for True, MC and OOC respectively.The collected Snopes articles date as far back as January 2001 up to January 2023, whereas Reuters -only allowing searches up to two years in the past-date from January 2021 to January 2023.The collected data cover a wide and diverse array of topics and cases including world news (29.04%), politics (27.94%), culture and arts (8.82%), entertainment (7.72%), sports (3.67%), the environment (3.66%), religion (2.94%), travel (2.57%), business (2.20%),science and technology (2.19%), health and wellness (1.46%) and others 11 .
We introduce the term "modality balancing" to denote that I t i and C t i are included twice in the dataset: once with the truthful label and once within the misleading label, as seen in Fig. 2.More specifically, each image is present once in its truthful pair and once in the MC pair while each caption is present once in its truthful pair and once in the OOC pair.This approach ensures that the model will have to focus on both modalities to consistently discern between factual and misleading I-C pairs.

Crossmodal Hard Synthetic Misalignment
Previous studies on synthetic training data for MMD have primarily relied on OOC pairs or NEI.These methods create formulaic manipulations, either by re-sampling existing pairs or substituting named entities, and therefore lack the imaginative or expressive characteristic of human produced misinformation such as emotions or irony.Conversely, large weakly annotated datasets may contain noisy labels and high rates of Asymmetric-MM.
To address these issues, we propose a new method for generating MM termed Crossmodal HArd Synthetic MisAlignment (CHASMA).Given a truthful (I t i , C t i ) pair and their V I t i , T C t i visual and textual embeddings extracted from CLIP, retrieve the most plausible misleading caption C f j from a collection of misleading captions C F with T C F textual embeddings, in order to produce a miscaptioned (I t i , C f j ) pair with: where p ∈ [0, 1] is a uniformly sampled number that determines calculating the cosine similarity (sim) between text-to-text or image-to-text pairs.
We apply crossmodal hard synthetic misalignment between VisualNews [26] dataset -consisting of 1,259,732 (I t i , C t i ) pairs -and the Fakeddit dataset (I f j , C f j ) [16].Out of the 400K misleading captions in C F in the Fakeddit dataset, the misalignment process only retains 145,891.The resulting generated dataset, termed CHASMA, is balanced between 1.2M (I t i , C t i ) truthful and 1.2M (I t i , C f j ) miscaptioned pairs.Since C f j from Fakeddit may have been aligned with more than one image from VisualNews, we also create CHASMA-D by removing duplicate instances of C f j .We balance the classes of CHASMA-D through random down-sampling.The resulting dataset consists of 145,891 (I t i , C t i ) and an equal number of (I t i , C f j ).We randomly sample 100 instances from the generated data and determined that approximately 73% of generated (I t i , C f j ) can be considered MM while 12% are Asymmetric-MM.Moreover, 6% of the pairs in the dataset are accidentally correct pairs, for instance, an image of firefighters near a fire being paired with the caption "Firemen battling a blaze".Finally, 9% of pairs are unclear, containing click-bait captions as "You'll never guess how far new home prices have dropped" which are paired with a weakly relevant image and cannot necessarily be considered misinformation.Naturally, the proposed method is not perfect, with approximately 27% of its samples not aligning with our definition of MM.Nevertheless, it provides a significant improvement over the original Fakeddit dataset where roughly 45% are Asymmetric-MM and only 14% are MM.
As seen in the examples of Fig. 3 (bottom), misleading captions C f j from Fakeddit can contain humor, irony and be more imaginative than named entity substitutions.However, their connections with the images I f j are often Asymmetric-MM or can be easy to detect (e.g. an illustrated image being humorously paired with a real demonstration).Conversely, CHASMA maintains the 'desired' aspects of C f j (e.g.sarcasm, emotions, etc.) but pairs them with more relevant imagery, thus creating "hard" samples and by extension more robust training data.For example, consider the case shown in Fig. 3, where an illustrated image is humorously paired with a caption about a demonstration and is subsequently 'misaligned' with an image of a real protest, thus creating a more realistic misleading pair.
In contrast to NEI-based methods, our generated samples consists of human-written misinformation rather than simple named entity manipulations.Finally, unlike NewsCLIPings, CHASMA utilizes CLIP-based retrieval to generate MC rather than OOC pairs and employs both intra-modal and cross-modal similarity to create synthetic samples.

Detection model
In our experiments, we encode all image-caption pairs (I, C) using the pretrained CLIP ViT-L/14 [25] both as the image encoder E I (•) and the textual encoder E C (•) that produce the corresponding vector representations V I ∈ R m×1 and T C ∈ R m×1 respectively, where m = 768 the encoder's embedding dimension.CLIP is an open and widely used model for multimodal feature extraction in numerous multimedia analysis and retrieval tasks [38,39,40] including multimedia verification and has yielded promising results [30,18,41,42,43].
We concatenate the extracted features across the first or 'token' axis as [V I , T C ] ∈ R m×2 .('batch dimension' omitted for clarity).As the "detector" D(•) we use the Transformer encoder [44] but exclude positional encoding and use average pooling instead of a CLS token.D(•) comprises L layers of h attention heads and a feed-forward network of f dimension and outputs y: where LN stands for Layer Normalization, W 0 ∈ R m×2 is a GELU activated fully connected layer and W 1 ∈ R n× m 2 is the final classification layer with n = 1 for binary and n = 3 for multiclass tasks (learnable bias terms are considered but omitted here for clarity).The network is optimized based on the categorical cross entropy or the binary cross entropy loss function for multiclass or binary tasks, respectively.
For unimodal experiments we only pass V I or T C through D(•) and define W 0 ∈ R m×1 .In these cases, D(•) only receives a single input token.Therefore, its attention scores are uniformly assigned a value of 1, resulting in an absence of distinct attention weights.We denote this "Transformer" detector as D -(•); D(•) minus multi-head self-attention, since the latter has no contributing role.Moreover, in order to investigate the role that multi-head self-attention plays in unimodal bias, we conduct additional experiments using the variant D -(•) where the two modalities are concatenated along the second or "dimensional" axis, resulting 4 Experimental Setup

D(•) MC VERITE E I (•) OOC (I t ,C t ) E C (•)
Figure 4: High-level overview of the employed pipeline.
Crossmodal Fusion (BCMF) network using DeiT and BERT [23] and a transformer-based architecture employing Faster-RCNN and BERT to capture intra-modal relations and a multiplicative multimodal method to capture inter-modal relations (Intra+Inter) [45].
Afterwards, we compare D(•) when trained on the original Fakeddit [16], our CHASMA and CHASMA-D datasets as well as numerous synthetically generated datasets, including OOC: NewsCLIPings text-text (NC-t2t) [18], randomsampling by topic (RSt) [30] as well as NEI: MEIR [19], random named entity swapping by topic (R-NESt) and CLIP-based named entity swapping by topic (CLIP-NESt) [30].The number of samples per class for each dataset can be seen in Table 1.
Furthermore, we experiment with dataset aggregation, the combination of various generated datasets.Aggregated datasets are denoted with a plus sign, for instance R-NESt + NC-t2t.For the multiclass task, we combine one OOC dataset and at least one MC dataset to represent the OOC and MC classes, respectively.To evaluate the contribution of CHASMA (or CHASMA-D) in MMD we perform an ablation experiment where they are either integrated or excluded from aggregated datasets.Note that, during training, we apply random down-sampling to address any class imbalance.
Figure 4 presents a high-level overview of our employed pipeline.We incorporate truthful image-caption pairs from the VisualNews dataset and employ an OOC-based (e.g.NewsCLIPings) and a MC-based generation method (e.g.CHASMA) to create false OOC and MC pairs, respectively.Subsequently, we utilize CLIP to extract the visual and textual features from the image-caption pairs and then train the multiclass Transformer detector D(•) before ultimately assessing its performance on the VERITE benchmark.

Evaluation protocol
Considering the distribution shift between training (generated) and test sets (real-world), utilizing an "out-of-distribution" validation set could potentially result in slightly better test accuracy [46].However, due to the relatively small sizes of both COSMOS and VERITE datasets, we decided to avoid this approach.Instead, after training, we retrieve the best performing hyper-parameter combination based on the "in-distribution" validation set (generated) and evaluate it on the  1.
Prior works using the VMU-Twitter dataset do not specify the validation set used for hyperparameter tuning [21,22,23].By inspecting their code 13 14 , we can deduce that the test set was used for this purpose, which is problematic.We follow this protocol only for comparability and also train D(•) using a corrected protocol, where the development set is randomly split into training (90%) and validation (10%).
To evaluate the presence and magnitude of unimodal bias, we employ two metrics: the percentage increase in accuracy (∆%) between a unimodal model and its multimodal counterpart, and Cohen's d (d) effect size.Negative ∆% and positive d values serve as indicators for the presence of unimodal bias.For experiments on the VMU-Twitter dataset, we reduce the batch size to 16 and define lr ∈ {5e − 5, 1e − 5}, since it is a much smaller dataset.We set a constant random seed (0) for Torch, Python random and NumPy to ensure the reproducibility of our experiments.We conducted the experiments on a computer equipped with an AMD Ryzen 3960X 24-Core CPU, 128GB of RAM, and a single GeForce RTX 3060 GPU.

Experimental Results
Image-side unimodal bias on VMU-Twitter: We begin by comparing the performance of D(•) with various models trained and evaluated on the VMU-Twitter dataset.In Table 2, we observe that among multimodal models, D -(I; C) achieves the third-highest result (80.5%), after Intra+Inter (83.1%) and BCMF (81.5%).However, it is noteworthy that the image-only model D -(I) achieves the highest overall accuracy (83.7%).This finding indicates the presence of image-side unimodal bias within models trained and evaluated on the VMU-Twitter.Table 7 also demonstrates that D(I, C) displays a greater percentage decrease (-4.78%) compared to D -(I; C) (-3.92%), thus VMU-Twitter does not seem to allow the full utilization of multi-head self-attention.
Fig. 5 demonstrates that the multimodal model D(I, C) produces the same outputs regardless of whether the image is paired with its corresponding caption or two randomly selected captions.D(I, C) predicts that all pairs are "true" regardless of the accompanying text.This example visually highlights the presence of image-side unimodal bias within the model's inference process.The occurrence of image-side unimodal bias can be attributed to two primary factors.Firstly, VMU-Twitter was originally designed as an image verification corpus, comprising a substantial number of manipulated or edited images.Consequently, the significance of the accompanying text diminishes, as the primary source of misinformation lies within the image itself, what we term Asymmetric-MM.Secondly, VMU-Twitter exhibits an imbalance between the number of texts and images used for training and testing.With only 410 images available for training and 104 images for testing, compared to approximately 17k and 1k tweets respectively, each image appears multiple times in the dataset, albeit with different texts.This discrepancy can lead to the model disregarding the textual modality, further reinforcing the image-side bias.Considering these factors, it appears that VMU-Twitter may not be an optimal choice for training and evaluating models for the task of MMD and might be better suited for its original purpose, namely, image verification.
As discussed in Section 4.2, it is worth highlighting that the evaluation protocol employed in [21,22,23] is problematic; using the test set during the validation and/or hyper-parameter tuning process.Under the corrected evaluation protocol, D -(I) achieves 81.0% accuracy, D -(I; C) achieves 77.3% (-4.56%), and D(I, C) achieves 76.66% (-5.35%).The aforementioned conclusions regarding image-side bias remain consistent even under the corrected evaluation protocol.
Finally, note that a direct comparison between the models in Table 2 is not possible as they employ different image and text encoders.Consequently, we refrain from asserting that we have attained "state-of-the-art" performance on the VMU-Twitter.Instead, the results showcase that D(•) can provide competitive and reasonably strong performance -while being a relatively simple architecture-and will be leveraged in all preceding experiments.
Text-side unimodal bias on COSMOS: We proceed by training D(•) on various datasets for binary classification and evaluating on the COSMOS benchmark, as illustrated in Table 3.We observe that the text-only D -(C) trained on CHASMA-D achieves 72.6% accuracy on COSMOS, the highest accuracy score on COSMOS.However, this translates into the text-only model outperforming its multimodal counterparts, D -(I; C) and D(I, C) by -7.85% and -14.88% respectively.As seen in Table 7, on average, D -(C) outperforms D -(I; C) by 2.34% and D(I, C) by 3.47% with a d of 0.25 and 0.4 respectively; highlighting that COSMOS does not seem to allow the full utilization of multi-head self-attention.We also observe in Table 3 that both D(I, C) and D -(I; C) suffer from text-side unimodal bias on COSMOS, only when trained with NEI-based datasets (CLIP-NESt and R-NESt) or datasets relying on human-written misinformation (Fakeddit and CHASMA).Text-manipulations and human-written texts may display certain linguistic patterns that inadvertently the models learn to attend to while reducing attention towards the visual modality.
Fig. 6 provides a visual representation of the behavior of the multimodal model D(I, C) when trained on CHASMA-D and evaluated on COSMOS.It showcases that the model can generate different outputs when applied to near-duplicate image-caption pairs, where the textual content exhibits only very minor differences that do not significantly alter the fact that it represents misinformation.Considering these results, we can conclude that COSMOS is not an ideal choice when it comes evaluating models for the task of multimodal misinformation detection.The dataset's characteristics  Unimodal bias is not (entirely) algorithmic: Table 4 presents the performance of D(•) when trained on various datasets and evaluated on their respective test sets.When trained on OOC-based datasets (RSt, NC-t2t, and CSt) D(•) performs poorly in both image-and text-only settings -with an average of 53.6% and 52.5% respectively-while achieving high multimodal accuracy.Expectedly, as both the image and the caption in OOC samples are factually accurate, but only their relation is corrupted, it is not possible to determine the existence of misinformation by solely examining one modality.
In contrast, D -(C) trained on NEI datasets (MEIR, R-NESt, CLIP-NESt) and CHASMA perform in closer proximity to the multimodal one, with D -(C) scoring 81.4%, compared to 86.6% by D -(I; C) and 85.9% by D(I, C).At the same time, the image-only setting yields significantly lower performance for NEI methods and CHASMA; the only exception being Fakeddit, which comprises a higher percentage of manipulated images.Once again, these results suggest that methods relying on text manipulation or human-written misinformation may introduce linguistic patterns and biases that render the image less important.
However, unlike the COSMOS benchmark, no unimodal method surpasses its multimodal counterparts on the test sets.This is also demonstrated in Table 7, where neither ∆% nor d indicate the presence of any unimodal bias.We can deduce that unimodal bias is partially algorithmic -an MMD model may rely on certain superficial unimodal patterns during training-but more importantly, these biases are significantly exacerbated by certain characteristics of VMU-Twitterand COSMOS -one of which is the high prevalence of Asymmetric-MM instances-thus raising concerns about their reliability as evaluation benchmarks.Additionally, we train D(•) for binary classification and evaluate its performance on VERITE-B; the binary version of VERITE.The primary aim of these experiments is to investigate the implications of removing "modality balancing" from VERITE in relation to unimodal bias.This entails that each image no longer appears twice in VERITE, once in the "True" class and once in the "Miscaptioned" class, and each caption no longer appears twice, once in the "True" class and once in the "out-of-context" class; since they are separated into two separate evaluations.In Table 6 we observe that D -(I; C) trained on R-NESt or CHASMA-D exhibits minor instances of unimodal bias in the "True vs MC" evaluation.The scale of this bias becomes more pronounced when multi-head self-attention is employed in D(I, C).Additionally, when trained with Fakeddit, D(I, C) showcases unimodal bias within the "True vs OOC" metric.These findings bear similarities to the patterns identified within the COSMOS benchmark, albeit at a smaller scale, presumably due to the lack of Asymmetric-MM in VERITE.Based on these results, we can infer that "modality balancing" plays a crucial role in mitigating the manifestation of unimodal bias within VERITE.Hence, we advise against employing VERITE-B as an evaluation benchmark for multimodal misinformation detection; especially of MC pairs.Instead, we recommend utilizing the original VERITE benchmark, as it has demonstrated its robustness as a comprehensive evaluation framework.
On the performance of CHASMA: are also reproduced while using D -(I; C).These findings highlight the effectiveness of the proposed methodology.By producing "harder" training samples and reducing the rate of Asymmetric-MM, CHASMA can significantly improve predictive performance on real-world data.Finally, it is worth noting that while D(•) trained on CHASMA displayed a high rate of text-side unimodal bias on COSMOS, this phenomenon is not present in the VERITE evaluation benchmark.

Conclusions
In this study, we address the task of multimodal misinformation detection (MMD) where an image and its accompanying caption collaborate to spread misleading or false information.Our primary focus lies in addressing the issue of unimodal bias, arising in datasets that exhibit distinct patterns and biases towards one modality, which allows unimodal methods to outperform their multimodal counterparts in an inherently multimodal task.Our systematic investigation found that datasets widely used for MMD, namely VMU-Twitter and COSMOS, can enable image-side and text-side unimodal bias respectively; raising questions about their reliability as benchmarks for MMD.
To address the aforementioned concerns, we introduce the VERITE evaluation benchmark, designed to provide a comprehensive and robust framework for multimodal misinformation detection.VERITE encompasses a diverse array of real-world data, excludes "asymmetric multimodal misinformation" (Asymmetric-MM) -where one modality plays a dominant role in propagating misinformation while others have little or no influence-and implements "modality balancing"; where each image and caption appear twice in the dataset, once within their truthful and once within a misleading pair.We conduct an extensive comparative study with a Transformer-based architecture which demonstrates that VERITE effectively mitigates and prevents the manifestation of unimodal bias, offering an improved evaluation framework for MMD.
In addition, we introduce CHASMA, a novel method for generating synthetic training data that retain crossmodal relations between image-caption pairs.CHASMA employs a large pre-trained crossmodal alignment model to generate hard examples that retain crossmodal relations between legitimate images and misleading human-written captions.Empirical results show that using CHASMA in the training process consistently improves detection accuracy and has achieved the highest performance on VERITE.
The proposed approach achieved 52.1% accuracy for multiclass MMD.Nevertheless, we are optimistic that CHASMA and VERITE can serve as a foundation for future research, leading to further advancements in this area.For instance, future works could experiment with improved multimodal encoders [47,48], news-or event-aware encoders [40], advanced modality fusion techniques [23,49,50], utilize external evidence [51] or explore new methods for generating training data [52].As future research unfolds, VERITE could be expanded to include additional types of MM (e.g.AI generated content) or additional modalities (e.g.videos), or be repurposed for other relevant tasks (e.g.fact-checked article retrieval [53]).Moreover, since MMD is only one part of multimedia verification [54], "claim detection" and "check-worthiness" [55] could be employed to distinguish between Asymmetric-MM and MM and determine whether to use a unimodal detector (e.g.false claim or manipulated image) or a multimodal misinformation detector in each scenario.Finally, while our focus has been on alleviating unimodal bias at the evaluation level, it may be worth exploring methods for reducing unimodal bias from an algorithmic perspective [35].In all these endeavors, VERITE can serve as a robust evaluation benchmark.

"Figure 2 :
Figure 2: Data collection, filtering and refinement process for creating VERITE.

"Figure 3 :
Figure 3: Training samples from CHASMA when applied across the VisualNews and Fakeddit datasets.

4. 3
Implementation details D(•) is trained for a maximum of 30 epochs (early stopping at 10 epochs) by the Adam optimizer with a learning rate of lr = 5e − 5.For tuning the hyperparameters of D(•) consider the following values: L ∈ {1, 4} transformer layers of f ∈ {128, 1024} dimension of the feed-forward network model, h ∈ {2, 8} attention heads.The dropout rate is constant at 0.1 and the batch size at 512.This grid-search results in a total of 8 experiments per modality (image-only, text-only, multimodal), thus 24 per dataset.

"
Facial recognition nabs ISIS fighter waiting to cross the Greek border as a refugee.What more to say?" "Lemmy and Bowie together.This is my absolute favorite picture I've seen in a long time.#RIPDavidBowie #RIPLemmy"

Figure 5 :
Figure 5: Inference by D(I, C) on three samples from VMU-Twitter.Moreover, we examine the model's image-side unimodal bias by inputting the middle image along with each of the three captions.D(I, C) predicts "true" with all three captions, which means that the model does not take the caption into consideration.Red underlines denote mistaken predictions.

Figure 6 :
Figure 6: Inference on two misleading samples from COSMOS with near-duplicate texts by D(I, C) trained on CHASMA-D.Red underlines denote mistaken predictions.

Table 1 :
Number of samples per class in each training and testing dataset."*" denotes datasets whose "false" pairs exhibit more similarities to, but may not entirely align with, our definition of miscaptioned (MC) images.Validation sets are used but omitted here.

Table 2 :
Performance of Transformer D(I, C) and D -(•) for caption-only (C), image-only (I) or multimodal inputs (I; C) when trained and evaluated on the VMU-Twitter dataset.Bold denotes the highest binary accuracy.COSMOS and VERITE.For evaluation, we report the accuracy score (image-only, text-only or multimodal) for binary classification on COSMOS and multiclass accuracy on VERITE.Moreover, we experiment with a binary version of VERITE (VERITE-B) where both "OOC" and "MC" pairs are combined into a single class denoting misinformation.Here, we report the accuracy for each pair of classes, namely "True vs OOC" and "True vs MC".The number of samples per class for each evaluation dataset can be seen in Table

Table 3 :
Results on the COSMOS benchmark.We report the performance of Transformer D(I, C) and D -(•) for caption-only (C), image-only (I) or multimodal inputs (C; I).Bold denotes the highest binary accuracy.
"Some 53,000 dead people were found to be included in Florida's voter rolls in November 2018.""53,000dead people turned up on the state's voter rolls in November 2018."

Table 4 :
Binary classification results on the test set of each dataset.

Table 5 :
Multiclass classification results on the VERITE dataset with different training MC data.For OOC data, NC-t2t is used in all experiments.
VERITE alleviates unimodal bias: The analysis of Table7reveals that both ∆% and Cohen's d effect sizes indicate the absence of any unimodal bias on the VERITE benchmark.Notably, D(I, C) displays an average 27.94% increase in accuracy when compared to text-only D -(C) and 43.27% when compared to image-only D -(I).These results emphasize that a model biased towards one modality can not achieve satisfactory performance on VERITE.Furthermore, it is worth noting that D(I, C) consistently outperforms D -(I; C), demonstrating that VERITE effectively allows for the power of multi-head self-attention to be leveraged, unlike COSMOS and VMU-Twitter.

Table 6 :
Table 5provides a detailed overview of the results obtained on the VERITE evaluation benchmark.In our training process for multiclass misinformation detection, we employ D(•) using one Results on VERITE-B by D(•) trained on different datasets for binary classification.The objective of these experiments is to investigate the impact on unimodal bias when eliminating "modality balancing" from VERITE.Evaluation metrics used include "True vs OOC" and "True vs MC" accuracy.In parentheses, we report the percentage improvement (∆%) of each multimodal model compared to the text-only model.Bold denotes the best performance per evaluation metric.

Table 7 :
Examination of unimodal bias on different evaluation datasets.We report the average percentage increase in terms of accuracy (∆%) and the average effect size measured by Cohen's d (d).Negative ∆% and positive d values indicate the presence and magnitude of unimodal bias (denoted with bold).-3.56 43.27 -10.41 21.38 -2.19 36.28 -4.68