1 Introduction

The 2020 coronavirus pandemic has seen a rampant spread of misinformation, resulting in an “infodemic” concurrent to the real-world disease. Many times inneundo and illogic are used to spread inaccurate concepts, which makes fact checking difficult algorithmically. Fact checking sites thus perform the crucial step in social cybersecurity by making use of human-in-the-loop techniques. These techniques include correlating information from available databases, or searching up expert perspectives. During the coronavirus pandemic, major fact-checking groups such as PoltiFact, Poynter and Snopes have begun to focus considerable efforts on verifying misinformation on the Internet. For example, PolitiFact uses the “Pants on Fire” metric to denote fake news in their Truth-O-Meter rating while Poynter uses the “Four Pinocchios” metric to do the same. These networks are important in reducing misinformation spread (Ünal and Çiçeklioğlu 2019).

This paper examines a corpus of coronavirus-related fact checks collected from the three major fact checking groups. It characterizes the stories the groups choose to fact-check through clusters of story narratives. It examines the consistency of human fact-checking work through the agreement between fact-checking sites in classifying these stories. Additionally, we develop a unique pipeline to characterize stories into more granular story types, and extend this pipeline on a corpus of COVID-related misinformation tweets.

2 Related work

Since the coronavirus pandemic broke, multi-faceted works on the analysis of the coronavirus-related information on social media have emerged (Ng et al. 2020; Lwin et al. 2020; van Loon et al. 2020; Medina Serrano et al. 2020) to understand the sentiment, emotions and topics surrounding the coronavirus discussion. In particular, misinformation surrounding the pandemic has been examined (McQuillan et al. 2020; Ng and Yuan 2020). Several coronavirus-related conspiracies have appeared and gained traction in the social media. These have been perpetuated by a topic oriented communities of conspiracy theorists, bots, and trolls (Carley 2020). Misinformation diffusion has also been fittingly compared against a virus epidemic model (Cinelli et al. 2020).

Rumour identification and verification on social media (Kochkina et al. 2018; Shu et al. 2017) are essential topics in an infodemic spread. Fact-checking is crucial for informing the public on rumours, disinformation and misinformation due to their influence on citizens’ reactions to information (Fridkin et al. 2015; Kouzy et al. 2020).

In a coronavirus-fact-check related work, prior work collected misinformation stories from publicly available aggregators and characterised temporal narratives across topic streams (Marcoux et al. 2020). Works comparing election-related misinformation from fact-checking sites conclude a generally high level of agreement between the sites (Amazeen 2016). But they also caution rare agreement on ambiguous statements (Lim 2018). Hassan et al. (2015) built a fact checking classifier on the 2015 Republican primary debate and obtained a 0.457 accuracy against fact checked by news network CNN.

Classifying social media health-related data has been studied by Liu et al. (2017) who classified behavioural stages through Twitter. On classification of coronavirus-related social media posts, prior work constructed classifiers using Support Vector Machines (Mircea 2020), Bidirectional Encoder Representations from Transformers (BERT) and ROBERTa word embeddings (Hossain et al. 2020), and Long-Short Term Memory neural networks (LSTMs) (Jelodar et al. 2020). Attempts have also been made at document classification of coronavirus-related literature (Jiménez Gutiérrez et al. 2020). These works seek to classify texts that report on coronavirus symptoms (Al-garadi et al. 2020) and retrieve coronavirus-related scientific and clinical literature (Das et al. 2020; Huang et al. 2020).

This paper classifies coronavirus-related fact-checks by three major fact checking groups. We empirically derive clusters of these stories, and analyse cluster characteristics across time, originating medium (platform where the story first appeared, e.g. news article, social media), and validity. We train a story validity classifier on the corpus, presenting an automated misinformation verification classifier. We propose an automated method to characterize stories into more granular story types, using only one-third human annotations. This classifier is extended to classifying misinformation tweet story types. We believe this work is useful in characterizing fact-checking sites through the story clusters they report on and understand how much these sites agree with each other. In addition, we propose a semi-supervised way of requiring minimal human annotations in identifying story types in diverse media.

3 Data and methodology

This section describes data collection and pre-processing of stories from three major fact checking sites and the methodology used to analyse stories.

3.1 Data collection

We collected 6731 fact-checked stories from three well-known main fact checking websites: PoynterFootnote 1, SnopesFootnote 2 and PolitiFactFootnote 3 in the timeframe of January 14 2020 to June 5 2020. The stories collected are in the English language. Poynter is part of the International Fact Checking Network, and hosts a coronavirus fact-checking section with over 7000 stories specific to the pandemic. As such, we collected our stories from Poynter from its coronavirus-specific section. PolitiFact is a US-based independent fact checking agency that has a primary focus on politician claims. PolitiFact was acquired by Poynter in 2018 (Poynter 2018). Snopes is an independent publication that is focused on urban legends, hoaxes and folklore. Tables 1 and 2 describe the dataset.

Table 1 Summary of stories
Table 2 Data fields

3.2 Data preprocessing

Harmonising originating medium Each story is tagged with an originating medium, the platform where the post was first submitted to the fact-checking site. We first identified top-level domains like.net,.com and labelled the originators of these claims as “Website”. For the other stories, we perform entity extraction using the StanfordNLP Named-Entity Recognition package (Finkel et al. 2005) on the originating field and labelled positive results as “Person”. Finally, we parsed the social media platforms that are listed in the originating field and tagged the story accordingly. We harmonise the originating mediums across the sites. A story may have multiple originators, i.e. a story may appear on both Twitter and Facebook.

Harmonising validity Given that each website expresses the validity of the stories in different ways, we performed pre-processing on the stories’ validity to summarise the categories into: True, Partially True, Partially False, False and Unknown. Table 3 shows the harmonisation metric used.

Table 3 Harmonisation metric for story validity

Word representations We first perform text pre-processing functions on the story text such as special character removal, stemming and lemmatization. We then construct contextual word embeddings of each story in two different ways: (1) a Bag-Of-Words (BOW) static vector representation using word tokens from the Sklearn Python package, and (2) a BERT vector representation for contextualised word embeddings using the pre-trained uncased English embedding model from HuggingFace SentenceTransformer (Reimers and Gurevych 2020).

The BOW vector representation first creates a vector for each sentence that represents the count of word occurrences in each sentence. It can be enhanced by the weighting scheme of Term Frequency-Inverse Document Frequency (TF-IDF) to reflect how important the word is to the corpus of sentences. The BERT representation builds a language transformer model based on the concept that similar words have similar contexts, reflected in that these vectors are closer to each other.

3.3 Cluster analysis on stories

Automatic clustering of stories is used to discover a hidden grouping of story clusters. We reduce the dimensions of the constructed story embeddings using Principal Component Analysis before performing kmeans clustering to obtain an automatic grouping of stories. For the rest of our analysis, we segment the stories into these clusters, providing an understanding of each of the story cluster.

Classification of story validity For each cluster, we divide the stories into an 80–20 train-test ratio to construct a series of machine learning models predicting the validity of the story. For each story, we construct two word representations: a BOW representation and a BERT representation (elaborated in Sect. 3.2). We compare the classification performances of both representations using Naive Bayes and logistic regression classifiers.

Level of agreement across fact-checking sites A single story may be classified on multiple sites as having slightly different validity. We seek to understand how the sites report on stories similarly, and the types of stories that are most reported. For each cluster, we look at stories across the sites by comparing their BERT embeddings through cosine distance. We find the five closest embeddings above a threshold of 70%, and take the mode of the reported story validity. If the story validity is a match, we consider the story to have been agreed between both sites.

3.4 Story type categorization

Automatic clustering of stories in Sect. 3.3 reveals that several story types can be grouped together into a single cluster. Several clusters may also contain the same story type. As such, we also categorized stories via manual annotations. We enlisted three annotators who have had exposure to online misinformation on the coronavirus and speak English as their first language. Inter-annotator agreement is resolved by taking the mode of the annotations. These annotators categorized 2000, or one-third, of the collected stories into the taxonomy developed by Memon and Carley (2020): Case Occurrences, Commercial Activity/ Promotion, Conspiracy, Correction/Calling Out, Emergency Responses, Fake Cures, Fake/True Fact or Prevention, Fake/True Public Health Responses and Public Figures.

We test three categorization techniques with text pre-processed as described in Sect. 3.2: (1) a Bag-Of-Words (BOW) classifier, (2) a BERT classifier, and (3) a BERT-enhanced classifier. Figure 1 provides a pictorial overview of the three classifiers.

In the first technique, we construct a BOW classifier from word token representations of the sentence. The story type is annotated with the story type of the closest word token vector representation by cosine distance.

In the second instance, we further enhance the BOW classifier with salient entities in each category. We perform Named-Entity Recognition to extract persons (Finkel et al. 2005). Using extracted person names, we query Wikipedia using the MediaWiki API, and classify the story as a “Political/Public Figure” if the person has a dedicated page. For stories without political/ public figures, we check if they contain a predefined list of words relating to each story type. For example, the “Conspiracy” story type typically contains words like “bioweapon” or “5G”. If the story type does not match any of the following, the BOW classification process in the first technique is used to annotate the story.

In the last instance, we construct the BERT classifier by matching the story embedding with the embeddings of manually annotated stories. The target story is annotated with the story type of the closest vector embedding found through smallest cosine distance.

Fig. 1
figure 1

Three story types categorization process flows

To validate our pipeline, we extend this process to classify 4573 hand-annotated tweets that contained misinformation. These tweets are collected by Memon and Carley (2020) over three weeks beginning with 29th March 2020, 15th June 2020, and 24th June 2020 with the #covid19 and related hashtags. The tweets are annotated with the same categories as the stories by a total of 7 annotators. We use these tweets and perform cross-comparison against the stories.

4 Results and discussion

Our findings characterize story clusters in fact-checking sites surrounding the 2020 coronavirus pandemic. In the succeeding sections, we present an analysis of the story clusters in terms of the validity of facts, storyline duration and describe the level of agreement between fact-checking sites. We also present comparisons between automated grouping of stories and manual annotations.

4.1 Story clusters

Each story is represented as a word vector using BERT embeddings, and further reduced to 100 principal components using Principal Component Analysis, capturing 95% of the variance. Six topics were chosen for kmeans clustering based on the elbow rule from the values of Within-Cluster-Sum of Squared Errors (WSS). The clusters are then manually interpreted. Every story was assigned to a cluster number based on their Euclidean distance to the cluster center in the projected space. We note that some clusters remain internally mixed and most clusters contain multiple story types, and will address the problem in the Sect. 4.4.

The story clusters generated from clustering BERT story embeddings mimic human curated storylines from Carnegie Mellon University’s CASOS Coronavirus website (IDeaS 2020). The human curated storylines are referenced for manual interpretations of the story clusters. In addition, story clusters also mimic the six misinformation categories manually curated by the CoronavirusFactsAlliance, pointing that misinformation around coronavirus revolve around the discovered story clusters (Nature 2020). Stories are evenly distributed across the story clusters.

Story Cluster 1: Photos/Videos, Calling Out/ Correction Accounting for about 23% of the stories, this first topic generally describes stories that contain photos and videos, and stories answering questions about the coronavirus. This topic has been active since January 30, which coincides with the initial phase of the pandemic. In addition, Poynter formed the coronavirus fact checking alliance on January 24 (Tardáguila and Mantas 2020). Sample stories include: “Video of man eating bat soup in restaurant in China”, and “Scientists and experts answer questions and rumors about the coronavirus”.

Story Cluster 2: Public Figures, Conspiracy/Prediction Accounting for around 20% of the stories, the second topic was active as early as January 29. This cluster mentioned public figures like celebrities and politicians, conspiracy theories about the source of the coronavirus and past predictions about a global pandemic. Sample stories include: “Did Kim Jong Un Order North Korea First Coronavirus Patient To Be Executed”, “Did Nostradamus Predict the COVID-19 Pandemic”, “Studies show the coronavirus was engineered to be a bioweapon”.

Story Cluster 3: False Public Health Responses, Natural Cures/Prevention Around 12% of stories fell into the third topic. These stories began to appear on January 31, but began to dwindle by April. Sample stories include: “The Canadian Department of Health issued an emergency notification recommending that people keep their throats moist to protect form the coronavirus”, “Grape vinegar is the antidote to the coronavirus”, “Vitamin C with zinc can prevent and treat the infection”.

Story Cluster 4: Social Incidents, Commercial Activity/Promotion, Emergency Responses, False Public Health Responses The fourth topic accounts for 12% of the stories, beginning on January 29 and ending on April 6. Sample stories include: “Kuwaitt boycotted the products of the Saudi Almari Company”, “20 million Chinese convert to Islam, and the coronavirus does not affect Muslims”, “No, Red Cross is not Offering Coronavirus Home Tests”, “If you are refused service at a store for now wearing a mask call the department of health and report the store”.

Story Cluster 5: Fake Cures/Vaccines, Fake Facts Around 17% of the stories fall into the fifth topic, from March 16 to April 9, discussing cures and vaccines and other false facts about the coronavirus. Sample stories include: “There is magically already a vaccine available”, “COVID-19 comes from rhino horns.”

Story Cluster 6: Public Health Responses Finally, about 16% of the stories fall into the final topic, which contains stories on public health responses from February 3 to May 14. Sample stories include: “Google has donated 59 billion (5900 crores) rupees to fight coronavirus to India”, and “China built a hospital for 1000 people in 10 days and everyone cheered”.

In Figure 2a, we observe that Snopes has a large proportion of stories in clusters 1 and 2. This is consistent with Snopes’ statement on checking folklore and hoaxes, most of which are presented in photos, videos, conspiracy theories and prediction stories. PolitiFact heavily fact checks on cluster 6, looking into claims relating to public health responses made by governments, consistent with their mission to fact-check political claims. The distribution of stories across Poynter is fairly even, likely due to their large network of fact-checkers across many countries. Facebook and WhatsApp are the greatest originating medium of stories across all story clusters (Fig. 2b). True stories generally involve public health responses (Fig. 2d), while partially true stories have a large proportion mentioning public figures.

From the time series chart in Fig. 2c, the number of stories increased steadily across the months of February and peaked in end-March. In March, the World Health Organisation declared a global pandemic, many cities and states issued lockdown orders. As the coronavirus was a new virus at that time, people seeking explanations coupled with global authorities implementing measures may have contributed to the sharp increase in stories. The decrease in stories may be attributed to the multiple statements and infographics released by governments around the world to educate people about the coronavirus, hence dispelling myths and fake news.

Fig. 2
figure 2

Story clusters

4.2 Classification of story validity

In classifying story validity, we enhanced the BOW representation with the TF-IDF metric and trained classifiers with Naive Bayes, Support Vector Machines (SVM) and Logistic Regression. We compared this classification technique against constructing BERT vector embeddings on the stories and classifying them using SVM and Logistic Regression. We use the F1-score accuracy metric to evaluate the classifiers. Table 4 details the performance of each classifier variant. There is no significant difference in accuracy whether using a bag-of-words model or a vector-based model, with a good accuracy of 87% on average. In general, stories in clusters 1 (photos/videos, calling out/correction) and 5 (fake cures/vaccines, fake facts) perform better in the classification models, which could be attributed the presence of unique words, i.e. stories on fake cures tend to contain the words “cure” and “vaccines”. Stories in clusters 3 (false public health responses, natural cures/ prevention) and 4 (social incidents, commercial activity, false public health responses) performed the worst, because these clusters contain a variety of stories with differing validity.

Table 4 Performance of story validity classifier variant (F1 score)

4.3 Level of agreement across fact checking sites

The levels of agreement across the three sites are cross tabulated in Table 5. In particular, we note that the story matches for Story Clusters 4 and 5 are close to 0, and that PolitiFact and Poynter have the highest level of agreement of their stories averaging a 78% agreement across their stories. We postulate the larger proportion of similar stories and agreement could be due to the overlapping resources of both sites since the Poynter acquisiton of PolitiFact in 2018 (Poynter 2018).

Table 5 Level of agreement across fact checking sites

4.4 Story type categorization

We propose a pipeline to further classify the story clusters into more granular story types, and validate the pipeline to tweets with misinformation. One-third of the sory dataset is manually annotated as a ground truth for comparison. Due to the different nature of the misinformation in stories and tweets, human annotators have determined 14 classification types for stories and 16 types for tweets (ie two classification types had no stories classified).

In comparing BOW against BERT word embeddings for classifiers, we find that BERT classifiers outperform BOW classifiers. This indicates that contextualized word vectors perform better than identifying individual words, as individual words can be used in a variety of contexts in stories.

In the BERT-enhanced classifier, we extract salient entities from the sentences to perform story types categorization before comparing BERT-tokenized vectors of story types. This BERT-enhanced classifier consistently perform worse than the naive BERT classifier. However, it performs better than the naive BOW classifier with the exception of Stories trained on Stories. This suggests that contextualization of word vectors in a sentence outperforms manual selection of specific entities. The full results are presented in Table 6, and samples of categories and stories/tweets are provided in Table 7.

With the BERT classifier, the classes with best performance are: case occurrences and public figures for stories trained on stories; conspiracy and fake cure for stories trained on tweets; conspiracy and public figures for tweets trained on stories; and conspiracy and panic buying for tweets trained on tweets. We observe that the BERT classifier performs better than the BOW-enhanced classifier, implying that augmenting the stories with additional information such as presence of a dedicated Wikipedia page does not improve accuracy. We also note that the classifier performs best when classifying the same medium of story types, i.e. stories trained on stories and tweets trained on tweets. In fact, the classification framework performs worse than the random baseline when trained on a different medium of data. This is likely due to the differences in the text structures of each medium.

From our experiments, we demonstrate the novelty of using the same algorithm based on BERT embeddings that can be used to categorise stories in diverse media. In our experiments, we performed training by manually annotating 33% of the story types, then perform classification on the same medium type. In all variations of story/tweet categorization, when trained on the same medium of data (i.e. classifying stories with embeddings trained on stories and tweets with embeddings trained on tweets), our framework correctly classified an average of 59% and 43% stories and tweets respectively, which is 4.5 and 2.7 times more accurate than random baseline. Classifying tweets based on story embeddings performed the worst overall because there are story types annotated in tweets that do not appear in stories. These results demonstrate that story type classification is a difficult task and this accuracy is an acceptable improvement over the random baseline.

Table 6 Performance of story type classification
Table 7 Sampling of story type categories and examples

4.5 Limitations and future work

Several challenges were encountered in the analysis we conducted. The dataset necessitated painstaking pre-processing procedures for textual analysis as each fact-checking site had its own rating scale for story validity. Within the same site, because the posts are written by a variety of authors, authors have their own creative ways of expressing story validity. For example, Poynter authors may denote a false claim as “Pants on fire” or “Two Pinocchios”. As with the nature of fact-checking sites which seeks to debunk false claims, the collected data has an overwhelming percentage of False facts, which results in high recall rates for the classifiers constructed in Sect. 4.2. Future work may involve making use of the explanation as true facts to balance the dataset.

Human annotators classify story types based on their inherent knowledge of the situation. In this work, we have enhanced the story information through searching Wikipedia for extracted persons’ names and predefined lists of words for each story type for our BOW classifier. With contextualised vector representations with BERT outperforming BOW classifiers, promising directions involve further enhancing the story information through verified information.

5 Conclusion

In this paper, we examined coronavirus-related fact-checked stories from three well-known fact-checking websites, and automatically characterised the stories into six clusters. We obtain an average accuracy of 87% in supervised classification of story validity. By comparing BERT embeddings of the stories across sites, PoltiFact and Poynter has the highest amount of similarity in stories. We further characterised story clusters into more granular story types determined by human annotators, and extended the classification technique to match tweets with misinformation, demonstrating an approach where the same algorithm can be used for classifying different media. Story type classification results perform best when trained on the same medium, of which at least one-third of the data were manually annotated. Contextualised BERT vector representations outperforms a classifier that augments stories with additional information. Our framework correctly classified an average of 59% and 43% stories and tweets respectively, which is 4.5 and 2.7 times more accurate than random baseline.