1 Introduction

The first cases of the SARS CoV-2 novel coronavirus (COVID-19) were reported in December 2019 in Wuhan, China. The World Health Organization (WHO) coronavirus dashboard (WHO 2022) reported that by March 9, 2022, there were almost 450 million confirmed cases across the globe and over 6 million deaths. For the United States, WHO reported over 78 million confirmed cases and over 950,000 deaths. In addition to the deaths and damage to public health, the COVID-19 pandemic unleashed major disruptions around the globe in terms of economy, education, and society in general (Alenezi and Alqenaei 2021).

Although there are other important factors that have contributed to the COVID-19 pandemic, there is strong consensus among researchers and public-health experts that the spread of COVID-19-related misinformation and disinformation on social- and digital-media platforms are major contributors (Tasnim et al. 2020; Roozenbeek et al. 2020; Vériter et al., 2020; Horawalavithana et al. 2021; Kricorian et al. 2021; Neely et al. 2022). As Tasnim et al. (2020:171) stressed, COVID-19 misinformation “is masking healthy behaviors and promoting erroneous practices that increase the spread of the virus and ultimately result in poor physical and mental health outcomes among individuals.”

Leaders of various international organizations, including the United Nations and WHO, have called for special attention to be directed to the problem of misinformation and other types of falsehoods regarding the COVID-19 pandemic, calling it an “infodemic” (WHO 2021). WHO defined an infodemic as “too much information including false or misleading information in digital and physical environments during a disease outbreak,” stressing that it can cause confusion and risk-taking behaviors that can harm public health.

Roozenbeek et al. (2020) found that although the beliefs in COVID-19 misinformation might not be prevalent in several countries, including the United Kingdom, the United States, Ireland, Mexico, and Spain, a substantial proportion of respondents in each country viewed COVID-19 misinformation as highly reliable. In the practical realm, that study also found that those respondents who believed in COVID-19 misinformation were also less likely to comply with public-health guidance.

1.1 COVID-19 disinformation and misinformation

In addition to the spread of misinformation, there have been numerous confirmed reports of disinformation campaigns directed and implemented by several state actors, including Russia, China, and Iran (Gradon 2020; Bright et al. 2020; Dubowitz and Ghasseminejad 2020; Hotez 2021; Horawalavithana et al. 2021). Although there are various ways to distinguish between misinformation and disinformation, we follow Jack (2017) in defining misinformation as information whose inaccuracy is unintentional and disinformation as information that is deliberately false or misleading (O'Brien and Alsmadi 2021). This distinction, based on intent, is also evident in the definitions provided by the Centers for Disease Control (CDC), which defines misinformation as “false information shared by people who do not intend to mislead others” and disinformation as “false information deliberately created and disseminated with malicious intent” (CDC 2021).

In some cases, there is evidence that many disinformation campaigns are coordinated. For example, in 2020 European researchers confirmed that Russia and China were coordinating and synchronizing their efforts by producing similar messages and narratives and boosting and spreading each other’s messages (Vériter et al. 2020). The topic of COVID-19 vaccinations has also been used in state-run information-warfare efforts, with researchers uncovering the efforts by the Russians to boost the popularity and sales of Russia-produced Sputnik V vaccine by spreading disinformation and undermining public trust in

vaccines produced in Western countries (Hotez 2021; Horawalavithana et al. 2021; U.S. Agency for Global Media 2021).

An additional difficulty in dealing with state-sponsored disinformation campaigns is that during ongoing efforts by state actors to influence democratic process or shape public opinion through social and digital media, those efforts seem to become more and more sophisticated and successful. For example, analysis of social-media posts produced by the Internet Research Agency, which represented a part of Russian efforts to influence U.S. elections, has shown that techniques and messages evolved over time in order to make them more effective (Ruck et al. 2019). Analysis of bot activity conducted by actors affiliated with the Russian government have also demonstrated that those techniques are evolving and becoming more sophisticated (Alsmadi and O’Brien 2020). This indicates that detecting and countering state-sponsored disinformation campaigns related to COVID-19 present additional challenges compared to other types of misinformation.

1.2 COVID-19 misinformation, disinformation, and vaccine hesitancy

Since the development and public rollout of COVID-19 vaccines, the spread of false information on social media and other digital platforms has been described by public-health officials and researchers as directly contributing to the high number of unvaccinated individuals in the United States and abroad (Dror et al. 2020; Kricorian 2021; Puri et al. 2020). Roozenbeek et al. (2020) found that those respondents who believed COVID-19 misinformation claims and questioned valid science-based claims were also less likely to get vaccinated or to recommend vaccinations to their friends and family, indicating a direct link between susceptibility to misinformation and the reduced likelihood of vaccination and adherence to health standards. A study of U.S. respondents by Kricorian et al. (2021) confirmed that misinformation about COVID-19 and vaccines was prevalent among those who refused to be vaccinated.

Further, analysis of survey data conducted by Neely et al. (2022) confirmed the previously detected link between misinformation and hesitancy among U.S. respondents. According to their analysis, although high levels of exposure to the misinformation were detected among the respondents, the exposure to false information was directly correlated with vaccine hesitancy, with politicization as a major contributing factor.

According to the New York Times COVID-19 Vaccination Tracker, by February 2022 63% of the global population had received at least one dose of the COVID-19 vaccine (New York Times 2022), whereas in the United States the level of partially vaccinated reached 75%; fully vaccinated 64%; and those who had received booster shots 27%. With the level of vaccination in the United States and around the globe being lower than what is needed to achieve herd immunity, proliferation of misinformation and disinformation remains “a significant impediment to the attainment of herd immunity and the end of the COVID-19 pandemic” (Neely et al. 2021:179).

Since the beginning of the pandemic, there has been a growing number of calls by experts and stakeholders to monitor and combat the spread of COVID-19 and vaccination-related misinformation and disinformation on social media, digital media, and other platforms. U.N. Secretary General Antonio Guterres called for additional efforts to stop the spread of false information and conspiracy theories that have a direct negative impact on efforts to curtail the pandemic, directly addressing social-media companies to flag and remove harmful content and to “remove racist, misogynist and other harmful content” (CBS News 2020). WHO partnered with a number of major tech companies, including Facebook, Twitter, and YouTube, to detect and delete COVID-19-related misinformation and promote legitimate updates from healthcare organizations (Statt 2020). However, recent results showed that the tech companies often fail to adequately monitor COVID-19 falsehood and delete such content in a timely manner (Brindha et al. 2020; Wardle and Singerman 2021).

1.3 COVID-19 and machine-learning efforts

As a result of the high volume of misinformation, disinformation, and other types of falsehoods on various social- and digital-media platforms, automated detection of such false or inaccurate information has recently gained importance and has become a primary detection technique (Tacchini et al. 2017; Thota et al. 2018; Ruchansky et al. 2017). Several recent studies have highlighted the need to apply machine learning and other techniques to the problem of widespread COVID-19 misinformation and disinformation. For example, Tasnim et al. (2020) have called for using advanced technology such as natural language processing (NLP) or data-mining approaches to detect and remove misinformation and other types of falsehoods with no basis in science from digital platforms.

Employment of advanced technologies to detect misinformation from social media and other digital sources is a robust and developing field that produces successful results (Shu et al. 2017). Although algorithms have been successfully employed to identify false information, this work has its own set of unique challenges (Shu et al. 2017). Tasnim (2020), as well as Alenezi and Alqenaei (2021), argued that despite challenges, application of the same principles behind identifying and removing false COVID-19 information is both feasible and highly desirable. While research showed that misinformation spreads faster on social-media platforms than information from legitimate news sources (Tasnim et al. 2020), applying machine learning to detection of falsehoods might be a major tool in fighting the global pandemic.

Several studies have focused on developing tools for automatic detection of COVID-19-related misinformation using NLP approaches, including detection-of-misinformation videos on YouTube by leveraging user comments (Serrano et al. 2020) and classification of social-media posts containing misinformation based on health risks associated with them (Dharawat et al. 2020).

Hossain et al. (2020) pointed out that the existing misinformation-detection datasets were not effective for evaluating systems designed to detect COVID-19 misinformation resulting from the use of novel language and rapid changes in information. They also released the COVIDLIES1 dataset and evaluated existing NLP systems on that dataset.

Alenezi and Alqenaei (2021) proposed building machine-learning misinformation-detection models that target COVID-19 misinformation in social media. Specifically, they tested three detection models—long short-term memory (LSTM) networks, a multichannel convolutional neural network (MC-CNN), and k-nearest neighbors (kNN) on Twitter data and obtained results superior to those from previous studies.

The Bidirectional Encoder Representations from Transformers (BERT) language-representation model initially developed by Devlin et al. (2018) was successfully used by several research teams —for example, Wani et al. (2021) and Glazkova et al. (2021)—to evaluate social-media posts related to COVID-19 misinformation. Hamid et al. (2020) examined both falsehoods and 5G conspiracy theories, and Wahle et al. (2021) focused on using transformer-based models on five COVID-19 misinformation datasets that included a variety of sources such as social-media posts, news articles, and scientific papers.

For this paper, we evaluated the impact of using word- and sentence-embedding models and transformers on classification models. Our effort was motivated by several recent publications that indicate the advantage of using embedding models in general and sentence transformers (e.g., BERT) to improve the prediction of classification models (e.g., Ling et al. 2017; Liu et al. 2019; Hao et al. 2019; Ruas et al. 2020). Another motivation for using embedding models is related to our integration of different datasets related to COVID-19. We believe that integrating text from different datasets, while simultaneously employing word-embedding models, can help generalize results from classification models and help reduce possible bias in their predictions. The complexity of using several embedding models is related to the amount of time and resources needed to pre-train large datasets in each one of those models. The pre-trained models cannot be reused from one embedding model to another. However, for the same embedding model, pre-trained models can be used to pre-train and test new data.

2 Research questions and methods

The following two questions guided our research:

  1. 1.

    Why and how to integrate different text-based datasets? In the next section, research methods, we discuss some of the reasons for integrating different text-based datasets from the same domain. Here, the research question is related to how best to integrate datasets if they have different or somewhat different target-column labels. Although most COVID-19 misinformation datasets have a binary target column identifying whether the information is fake or not, we noticed that the way they identify and describe what is fake and what is not fake is different. Additionally, the term misinformation has different meanings and interpretations.

  2. 2.

    Which embedding types or models improve COVID-19 misinformation-classification models? We evaluated several models—W2V, Glove, Google, Paragram, Wiki, and BERT—that are available for public use in research on COVID-19 integrated dataset. We observed similar goals in the use of those embedding models and our dataset’s integration in producing pre-trained models that can be used beyond a single or a particular dataset.

Figures 1 and 2 summarize the two experimental tracks we followed. The first track focused on data-analytics models per the different individual datasets, whereas the second track focused on first combining those datasets before performing data-analytics tasks.

Feature extractions in text-based datasets are based primarily on the text column, and so we believe that text datasets in the same domain should be combined to improve the quality of classification models.

Fig. 1
figure 1

Research methods: Individual datasets

Fig. 2
figure 2

Research methods: Combined dataset

We think there is value in combining datasets:

  1. 1.

    Text-based datasets have similar feature-based extractions, i.e., from the single-text column, unlike nontext-based datasets, where features can be very different from one dataset to another.

  2. 2.

    In the same or similar domains, corpus-based features are expected to be similar. In other words, for machine-learning models, for the same domain and target, popular text-based features should be similar.

  3. 3.

    Reduces bias in input datasets. Here, bias refers to machine-learning bias, particularly when models perform well because of narrow or certain scopes. There are several aspects and reasons for bias in machine-learning models. One aspect of possible bias is related to the input dataset. Integrating different datasets for the same domain is expected to reduce such bias.

We began the experiments with basic models on the individual datasets. We then reused those models in the integrated datasets and introduced new models applied on the integrated dataset.

As implied in Fig. 2, our goal is to evaluate producing a reference model that can be used for other datasets beyond those used here.

3 Data, experiments, and analysis

We used three public datasets related to COVID-19 misinformation:

  1. 1.

    FakeCOVID (Shahi and Nandini 2020). The FakeCOVID dataset is multilingual, with 7623 news articles related to COVID-19. In addition to not fake/fake1, some articles are labeled with partially not fake or partially fake. We used 6286 of those articles that had either fake or not-fake labels. The dataset is imbalanced; only 34 articles were labeled as not fake, and the rest were labeled as fake.

  2. 2.

    CoAID (COVID-19 heAlthcare mIsinformation Dataset) (Cui and Lee 2020). CoAID is a diverse COVID-19 healthcare-misinformation dataset that includes fake news from websites and social networks and also users’ responses to that news. It includes 3,235 news stories, 294,692 related user engagements, 851 social-platform posts about COVID19, and ground-truth labels.

  3. 3.

    The COVIEWED project is an effort to combat COVID-19 misinformation. By initiating the effort through a public website (COVIEWED 2020), the project invites data-science researchers to present their ideas for achieving that goal. We used one subset from COVIEWED submissions, which includes a known list of COVID-19 misinformation collected from IDeas (2020). This list is labeled as fake claims. However, the definition of not fake claims was more generic, and the dataset includes many posts or comments from different websites that are citing COVID-19.

We noticed that the datasets have different interpretations of what is fake and what is not fake. Although we acknowledge that classifiers’ accuracy would be impacted with such integration, we nonetheless believe that this combination of datasets will achieve two main goals:

  1. 1.

    To produce models that generalize to multiple disinformation topics.

  2. 2.

    To reduce possible bias in proposed models. Bias can exist or be introduced to machine-learning models in several different ways. For example, classification models that are generated based on specific datasets can be biased or overfitted as a result of issues in the input data (e.g., the features, how data are collected, and how target columns are interpreted). They might work well in the evaluated datasets but poorly in any other dataset.

3.1 Results from the FakeCOVID dataset

We initially evaluated three single classifiers: logistic regression (LR), support vector classifier (SVC), and naive Bayes (NB). As expected, because of the imbalance in the dataset, one class label that had a relatively large number of instances reported very good accuracy whereas the other did not. Table 1 summarizes the performance metrics for all three classifiers.

Table 1 Classification metrics for the FakeCOVID dataset

To summarize:

  • Performance metrics for fake claims are very high, as there are enough data to help machine-learning algorithms learn and predict.

  • On the other hand, those metrics are very low for the not-fake claims, as there are not enough data to build a reliable classification model.

  • Unlike the SVC and NB classifiers, which reported the same results for both CV and TFIDF terms/features’ extractions, the LR classifier produced different results for the two approaches.

  • Overall accuracy is high, as the majority of the dataset is on the fake-claims side. In comparison with the original paper that produced this dataset (Shahi and Nadini 2020), we found better results in terms of all reported metrics, accuracy, precision, recall, and F-score.

3.2 Results from the CoAID dataset

Using the same three single classifiers—LR, SVC, and NB—we found that the CoAID dataset showed variation in performance results (Table 2). The sample dataset had balanced numbers from both label types. All three classifiers showed different results between CV and TFIDF, with CV producing better accuracy in all three classifiers.

Table 2 Classification metrics for CoAID dataset

3.3 Results from the COVIEWED dataset

The COVIEWED dataset had more not-fake claims than fake claims, and thus performance metrics are high on not-fake claims and very low on fake claims. Accuracy is similar for all classifier models (Table 3).

Table 3 Classification metrics for COVIEWED dataset

3.4 Combination of COVID-19 misinformation datasets

As mentioned earlier, datasets related to misinformation are inconsistent in terms of how they label information/misinformation. As a result, it is difficult to (1) integrate different datasets with each other and/or (2) transfer models and knowledge from one dataset to another. Our goal here is to show different ways of dealing with such issues.

We focused on combining only two columns, the text column and the label column, and ignored all other columns that can be different among datasets.

We generalized the terminology among the different datasets. Our combined dataset included a more generic binary label of fake versus not fake to accommodate less-generic labels used by the different datasets. For example, a true claim indicates telling a correct story but not necessarily with correct information. In our combined label, this is “not fake.” We could extend this approach by combining misinformation datasets from different categories (e.g., false claims, fake news, hoaxes, spam, and insincere questions) and creating a broad binary label, such as fake/not fake and yes/no.

We aggregated data about COVID-19 misinformation from different sources for several reasons, one related to bias and overfitting issues. Bias refers to models that can be highly accurate in terms of performance metrics but which represent only a subset of reality due to their focus on some data points while ignoring others. Overfitting in data analytics refers to a problem when models work well in one dataset or a subset of a dataset but poorly when applied to different datasets that were not part of model learning or testing.

We aggregated the three datasets and used instances from all three in both training and testing. The combined dataset has a total of 20,563 claims, 7,905 of which are fake claims and 12,658 of which are not-fake claims. Preliminary text analysis for the fake versus not-fake showed some differences. The first feature we evaluated was the word count.

We created word clouds for fake versus not-fake combined text, as shown in Fig. 3. Word clouds show the top words in each group and highlight the different focuses between fake/not fake discussions.

Fig. 3
figure 3

Word cloud for the title column (fake text, right, versus not fake, left)

We can summarize top words between the two word clouds as

  1. 1.

    Top words in both clouds: COVID-19, coronavirus, virus, people, hospital, say/said, novel, China;

  2. 2.

    Top words in not-fake content: outbreak, patient, New, New York; and

  3. 3.

    Top words in fake content: lockdown, vaccine, India, Chinese, video.

Then we did the same using the content column, as shown in Fig. 4.

Fig. 4
figure 4

Word cloud for the content column (fake text, right, versus not fake, left)

4 Features and classification models

One important step in text analysis is to evaluate features that can produce classification or prediction models with high accuracy. We evaluated two popular approaches—count vectors (CV) and term frequency/inverse document frequency (TF/IDF). Figure 5, left, shows an assessment of using CV for several classifiers. For our four evaluated performance metrics, Precision, Recall, F1-Score, and Accuracy, except for KNN, most classifiers had values between 70% and 80% with all metrics. Again, except for KNN, all classifiers showed similar values in metrics between fake and not-fake claims. Figure 5, right, shows similar results for TF/IDF.

Fig. 5
figure 5

Classifiers’ performance metrics: (CV, left, TF/IDF, right), 100 terms, no embedding

The previous experiments used an initial fixed set of terms/features (100). Our next goal was to evaluate the impact of increasing the number of input terms/features on the performance of classification models. We also wanted to assess the impact of increasing the input dataset from 2000 to 15,000 claims while ensuring that the same number of fake and not fake claims was used in each model. Each classifier had a few input variables. As previous results showed similar results between CV and TF/IDF, and to reduce redundancy, we report results from only one model, CV text-feature extraction. Table 4 summarizes results from evaluating the number of terms on classifiers’ performance metrics.

Table 4 Classification accuracy versus the number of model input terms

The table shows that of all evaluated classifiers, the Decision Tree (DT) shows the lowest accuracy in both evaluated settings. Additionally, the classifier showed insensitivity to increasing the number of terms in the module. Its best performance metrics were achieved with a relatively small number of terms. Adding more terms did not improve performance metrics but rather had the opposite results in some cases.

All other evaluated classifiers showed sensitivity to increasing the number of terms as input features to the classification model. As a cost, increasing the number of terms will increase the model complexity and impact its efficiency.

5 Learning with word- and sentence-embedding models

Word- and sentence embedding is a method for obtaining a context-dependent vectorized representation for every word/sentence in a text corpus. This representation allows comparison of words in embedding space: Words spaced closer together have a similar meaning and/or connotation, whereas words far apart are very dissimilar. Word-embedding data are existing, pre-trained distributed word representations. The main task is to determine the most qualitative word embeddings. In the process, distributional models are generated over different sources such as Wikinews, news articles, Google News, and BERT.

Recent state-of-the-art word-embedding models such as BERT have proven successful in obtaining relevant word embeddings for practical applications such as language translation. All terms in the corpus are embedded. The BERT sentence-transformers repository allows training and transformer models to generate sentence and text embeddings (Reimers 2019). Sentence BERT uses a Siamese network-like architecture to provide two sentences as an input. The sentences are then passed to BERT models and a pooling layer to generate their embeddings (Huilgol 2020).

To evaluate the impact of using word embeddings, we used the same classification settings of the previous experiment with the addition of using word embeddings. Before using the training and testing data from COVID-19 claims, both were trained with the BERT embedding model. The trained outputs were used as input for all classifiers. Figure 6 shows a summary of the accuracy metric for all classifiers. Except for Decision Tree models, classification models showed improvement in all performance metrics when using embedding models. Unlike in previous experiments, no classifier showed sensitivity to an increase in the model’s number of terms.

Fig. 6
figure 6

Classifiers’ accuracy versus the number of terms, with embedding

5.1 Most-informative features

Classification and prediction models are based on input features. In text analytics, those features can be extracted from either text statistics or text corpus. In text corpus, the default approach is to use tokens—words, phrases, n-grams, and the like—as features. The process starts with all text. Different preprocessing steps such as stop-words removals and stemming can be applied to produce a preprocessed corpus. The analysis then focuses on producing the most-informative features that can predict the classification target class. Below we present three examples of the approaches we evaluated to extract the most-informative single-word features. Due to space limitation, we show results from only one experiment, TFIDF most-informative features (Table 5).

Table 5 Most informative features: TFIDF.

Looking at the most-informative text-based terms, we can see that they may reveal more about the particularities and properties of the datasets used rather than any objective truth about which words are good indicators of fake news. A large majority of the top terms (not shown in the previous tables) are simply words that point to a specific domain or publisher. Similar to findings mentioned in other research (e.g., Fairbanks et al. 2018), informative terms in the not fake-claims section include those that typically exist in news articles, whereas informative terms in the fake-claims section include highly specialized terms, indicating that they refer to specific conspiracy theories. Table 6 shows a sample from another approach that integrates the logistic-regression (LR) classifier with the chi-square feature-selection method. Chi-square values show the significance of the term on LR classification or on making a prediction of an instance target label.

Table 6 Top terms using LR and chi-square

5.2 Comparison of word-embedding models

We used several word-embedding models under the same experimental settings to extend our assessment of using word-embedding models in COVID-19 fake-news detection. The specific word-embedding models we used were W2V, Glove, Google, Paragram, Wiki, and BERT. Overall, SGD and logistic-regression classifiers scored the highest accuracy of values—between 86% and 87% in most embedding models—whereas the MLP Classifier scored the lowest in most experiments (Fig. 7).

Fig. 7
figure 7

Classifiers versus word-embedding models

6 Conclusion

As Tasnim et al. (2020) stated, providing a variety of stakeholders, including social-media companies, healthcare professionals, mass media, and other actors, with the latest results of research related to battling the spread of misinformation, disinformation, and other untrustworthy online information related to COVID-19 is an important step in bringing closer the end of the devastating global pandemic. Given that previous studies have demonstrated a direct link between COVID-19 misinformation and an unwillingness to follow public-health measures, effective application of machine-learning techniques to detect misinformation and disinformation in social and digital platforms is becoming an increasingly important tool in the global fight against the deadly disease.

We evaluated some of the challenges related to using some public-misinformation datasets to extract relevant knowledge. Misinformation these days refers to a spectrum of terminologies and concepts that can differ from each other in many respects. As a result, analytic models that are produced based on those datasets can be biased and may not work well with different datasets. We combined several misinformation-related datasets that discuss different aspects of misinformation related to COVID-19. As the three datasets we used have imbalanced class labels, one advantage of the integration was fixing such imbalance.

The combination process was simple, as text-based datasets focused on two main columns, text and label. The major challenge in the integration will be when class labels differ among datasets. With misinformation datasets, we found the best approach is to broadly categorize all misinformation labels under one category and similarly combine the opposite class labels. This can make models more general and less biased, although prediction accuracy may be impacted.

We focused our analysis of how some recent text analyses featuring extraction and prediction techniques can impact prediction models’ performance. We observed that some classifiers are more sensitive than others to the volume of search terms. We also observed that whereas word-embedding methods showed improvements in all evaluated classification models, the improvement level can vary among the different classifiers. Compared to word- and sentence-embedding models, our experiments showed that recent sentence transformers such as BERT showed better improvements on most classifiers.

For machine-learning models and tools to be accurate in terms of classification/prediction and to be transferable from one model or dataset to another, there is a need for common and unified terminologies for both information and misinformation. Again, although our work was focused on COVID-19-related information, a similar approach of combining datasets to improve performance in learning models could be used with a variety of other topics that are prone to high saturation of misinformation and disinformation, including political messages on social and digital media. For example, recent Russian aggression against Ukraine that employs a variety of information-warfare techniques targeting multiple audiences outside of Russia shows a growing need for using machine-learning techniques to identify such disinformation campaigns.

Note

1 For consistency, we used the terms “fake” and “not fake,” whereas other authors or datasets sometimes use terms such as “true” and “false.”