Introduction

“Anybody that wants a test can get a test. That’s what the bottom line is.” [1] was a sentence uttered by the president of the United States at a time when the daily testing capacity was about 75,000 nationally. Misinformation ranging from harmless rumors to extremely complex and dangerous conspiracy theories has been one of the most defining characteristics of social media use in recent years. The intensification of the use of social media as an information-sharing tool, coupled with a directly and indirectly forced lockdown during the COVID-19 pandemic, has multiplied the production of misinformation.

As provided in the example above, people of different walks of life including world leaders, Instagram celebrities, troll factories, and many others have been intentionally or unintentionally spreading misinformation. Assuming that these entities act rationally, people and institutions that use their own resources to produce those falsehoods must have a purpose. Inspired from the national-level analysis and empirical approach of the book Varieties of Capitalism: The Institutional Foundations of Comparative Advantage [2] and other articles and books that use ‘Varieties of…’ in their titles,Footnote 1 this article studies the relationship between country-specific variables, and the topics and content of misinformation. The study specifically looks at the variation in topic ratios, and the creativity in misinformation content over time by providing extensive data analysis and using the most extensive datasets on misleading and false news created during the global pandemic.Footnote 2

Findings indicate that the ‘Varieties of…” literature can be applied only to certain cases in this context. Specifically, countries struggling in a range of areas including press freedom and human development tend to produce news content similar to each other. In addition, news on animals, predictions and symptoms are the three biggest differentiators between countries, whereas news on conspiracies, medical equipment and risk factors offer the least explanation to differentiate. Based on those findings, we discuss some distinct public health and communication strategies to dispel misinformation in countries with particular characteristics, such as efforts to repair Western governments and pharmaceutical companies’ tarnished reputation in low human development countries. We also emphasize that a global action plan against misinformation is needed given the highly globalized nature of the online media environment and the ubiquitousness of the major conspiracy theories around COVID-19. The dataset used in this study is the largest dataset of online misinformation about COVID-19 that can be found in the literature as it comprises over 10 thousand falsehoods from 129 countries. Our study is also unique in the number of variables used: the countries are compared based on 14 variables aiming to measure economic, informational, political, and socio-cultural environments in each, such as their level of democracy, trust in science, income inequality, and healthcare strength.

Misinformation during a pandemic

First, a clarification on the terms we use in this article. In parallel to the rise of social media and other online platforms over the past 15 years, there has been a proliferation of studies looking at the emergence, spread, consumption, and effects of misleading and false information by analyzing the phenomenon in various terms: how rumors spread on Twitter [9], real-world impacts of hoaxes at Wikipedia [10], how individuals consumed fake news prior to 2016 US presidential election [11], or how regular people, and not just state-supported media, actively participate in generating disinformation in Russia [12]. In this paper, we chose to use the term misinformation over the aforementioned alternatives. As defined in Merriam-Webster’s dictionary [13], misinformation refers to “incorrect or misleading information”. According to this definition, any piece of information that is partly or fully false can be labeled as misinformation, irrespective of an intent to deceive on the part of those who produce or diffuse the information. Thus, misinformation is a broader term, which comprises disinformation, i.e., false information that is “deliberately” spread “in order to influence public opinion or obscure the truth” [13], as well as those shared inadvertently or unintentionally. As discussed in the “Data” section below, this term befits the dataset we use. In line with some other scholars [14, 15], we also avoid the popular term fake news both because it is polarizing and politically charged and also it is rather limited to forms of misinformation that are deliberately designed to mimic news from established and mainstream news organizations [16].

Misinformation during the COVID-19 pandemic has reached such high levels that World Health Organization (WHO) and other United Nations bodies recognized the need to fight against false and misleading information as a critical part of the global pandemic strategy [17]. Many of the falsehoods regarding COVID-19 have been inspired by and interacting with conspiracy theories that predated the pandemic, including those involving “Big Pharma”, GMOs, Bill Gates, “deep state”, or a ring of satanic pedophiles [18,19,20]. The outcomes have been tangible: misinformation has significantly contributed to the spread of the illness and preventable deaths by promoting ineffective and harmful treatments and discouraging people from basic prevention such as wearing a mask or maintaining social distance. For instance, a study shows that over just a few months in the first of half 2020, approximately 800 people died and many more were hospitalized after drinking methanol as a cure for coronavirus [21]. More recently, vaccine-related misinformation has curtailed the vaccination efforts of many countries around the world, including the US where as of September 2021 the rate of vaccinated individuals fell short of the Biden administration’s original ambitions [22]. This was particularly worrying as hospitalization and death rates were much higher among the unvaccinated [23].

Unsurprisingly, there is a growing literature around misinformation regarding COVID-19, which can be arguably grouped into three main streams of research. A first stream strives to understand different types of misinformation in terms of their sources, spread patterns, and influence as well as analyzing how misinformation differs from more accurate and factual information in those respects. The main findings indicate that most news shared online are accurate; however, they are less likely to be shared than inaccurate ones [24, 25], which are produced and spread by denser and more organized communities [26, 27]. In addition, we learn that more misinformation circulates on social media platforms than on traditional news media [28], and misinformation coming from public figures such as celebrities or politicians, although making up a relatively small part of online misinformation about COVID-19, generates higher levels of engagement and support than other types of falsehoods [29, 30].

A second group of research looks at what specific factors push individuals to believe in or share misinformation. Accordingly, they show that certain psychological predispositions, political leanings, and daily habits such as tendency to reject expert information [31, 32], political conservatism [33], and right-leaning media consumption [34] are positively correlated with expressing and propagating misinformed views about COVID-19; whereas others such as greater science knowledge and cognitive reflection [35, 36], and trust in science and scientists [37] are negatively associated with those. Furthermore, it is interesting that an individual’s worry for personal health doesn’t seem to have an effect on their propensity to share COVID-19 related misinformation [38].

A third category of research articles focuses on how exposure to misinformation affects health behavior during the COVID-19 pandemic. The findings are clear as follows: consuming and/or believing in misinformation has a significant negative effect on willingness to take preventive measures such as wearing a face mask [39], maintaining physical distancing [40, 41], or getting vaccinated [42, 43]. In addition, belief in COVID-19 related falsehoods, including conspiracies, predicts greater use of pseudoscientific practices such as consuming garlic [44] or hydroxychloroquine [45]. Taken together with the research streams discussed previously, these studies shed light on the level of threat that misinformation poses on public health.

In parallel with these three streams, a relatively small but emerging group of studies analyze how economic, political, and socio-cultural differences among countries impact misinformation during the pandemic. This body of work illustrates the significant effect of a number of county-level variables on sources and types of as well as exposure and susceptibility to misinformation. Hence, countries’ levels of uncertainty avoidance [46]; political and media freedom, and mobile connectivity [47]; human development [48]; political conservatism and political control over media [49]; Gross Domestic Product [50]; media fragmentation and partisanship [51] have been found to shape the misinformation environment beyond the individual-level factors.

This study contributes to the existing literature by using the largest dataset of online misinformation to our knowledge: over 10 thousand falsehoods produced and shared across 129 countries.Footnote 3 It is worth noting that, unlike most other studies focusing on online misinformation [14, 25, 52, 53], our study does not solely rely on Twitter, but uses false and misleading information from various sources, including social media platforms such as Facebook, WhatsApp, Instagram, and Telegram. In fact, Facebook leads the list by making up 4347 of those pieces of misinformation. We find this important, because Facebook is not only the biggest social media platform globally with its 2.8 billion users [54], it is also the most popular one for COVID-19-related misinformation [24]. For comparison, Twitter had slightly less than 400 million active users as of July 2021 [54]. The prominent role of Facebook in the spread of Covid-19-related misinformation has been documented in a number of settings including the USA (2020), and Egypt (2021) [55,56,57].

Another unique aspect of this study is the wide range of variables it uses to compare COVID-19 misinformation across countries. As the “Data” section below discusses, the countries are compared based on 14 variables aiming to measure political, economic, informational, and socio-cultural environments in each, such as the levels of corruption perception, human development, press freedom, and healthcare strength. The Health Belief Model posits that when facing a health threat like the COVID-19 pandemic, individuals’ likelihood of engaging in preventive health behaviors depend on the perceived level of threat and perceptions regarding the potential benefits of and barriers of participating in such behaviors [58]. The literature laid out above shows how misinformation affects those perceptions by downplaying the threat, distorting the facts around the origin and spread of the pandemic, and offering ineffective or harmful treatments. Therefore, we believe that our study can help public health authorities better leverage their country-level resources while fighting against online misinformation such as by strengthening media freedom or improving scientific literacy.

Data

The data for this study include 10,131 falsehoods about the COVID-19 Pandemic in varying intensities of misinformation.Footnote 4 Observations have been collected from the Poynter CoronaVirus Facts/DatosCoronaVirus Alliance Database provided by the Poynter Institute [1]. Poynter Institute is a non-profit NGO focused on journalism and research and based in St. Petersburg, Florida (https://www.poynter.org/). The dataset is updated daily and provides a comprehensive understanding on the evolution of misinformation during the progression of the pandemic. The dataset covers the period between January 2020 and February 2021. For each observation, Poynter Institute provides the fact-checker that has provided information on the intensity of misinformation to the institute, the date on which the story was published, origin country/countries/continents, the intensity of misinformation, a brief summary of the story, the original text, and the link to the story. Most of the misinformation has initially been published on social media outlets, such as Facebook, Instagram or YouTube. In most cases, it is hard to know the true origin of the misinformation, since they are posted on different social media outlets on the same day. We should note that our study is not the only one that uses Poynter Institute’s dataset. A few other articles [14, 29, 47] also rely on the Poynter CoronaVirus Facts/DatosCoronaVirus Alliance Database. However, our study uses the highest number of falsehoods—over 10 000—not just among the studies that use this particular dataset, but among all published research articles that focus on misinformation around COVID-19.

This study stands apart from other work in that it uses a carefully controlled hand-labeling of topics of falsehood. More specifically, to extract more information from the dataset, we manually labeled 28 different topic categories that are mentioned by the Poynter Institute in association with COVID-19. (The topics that have been identified by the Poynter Institute are aid, animals, conspiracies, crime, cures, detection, food, governments, hospitals, individuals, insurance, laws, lockdown, medical equipment, medicine, origins, other diseases, predictions, prevention, religion, risk factors, spread, symptoms, travel, vaccines, videos, technology, and NGOs.) Manual labeling is a common practice in NLP-research and has been used in other studies, as well.Footnote 5 In contrast to the suggestion by the Poynter Institute that each news belongs to exactly one topic, using excellent research assistance, we identified the news belonging to more than one topic and marked those accordingly. We used stratified sampling to check for the quality and consistency of the data collection process. The original dataset has later been enriched by using NLP by removing the stop words, using contraction mapping, removing links, emojis and hashtags, POS-tagging the words, and lemmatizing the news content. Brief and long summaries have then been merged to provide more information. The descriptive table below provides a few interesting facts about the dataset (Table 1).

Table 1 Descriptive table for the falsehood dataset

The map below shows the aggregate count of misinforming and misleading stories coming from each country. Most of the falsehoods were produced in India, the USA, Spain, Brazil and a few other European countries. As indicated by the map in Fig. 1, there is some correlation between the intensity of the pandemic and the amount of misinformation production. For example, as of September 2021, the USA, India, and Brazil were the top three countries in the world in terms of both the number of confirmed COVID-19 cases and deaths [65]. Also, as shown below, the period between March 2020 and May 2020 was the peak of misinformation production worldwide, and during that time, Spain had a higher number of cases than other European nations such as Italy, France, or Germany, despite having a significantly smaller population size than them [17].

Fig. 1
figure 1

Geographic distribution of the falsehood counts

The stacked bar graph below shows the distribution of topics for each day, and the red line plot shows the number of stories published per day. As expected, there is a strong correlation between the two; nevertheless, there are some news that could not be assigned to any topic and some others with more than one topic. As one can observe below, the dataset shows signs of seasonality and trend. Specifically, the number of falsehoods reached its peak in the period March 2020-May 2020, in a time when there were significant uncertainties about the definition and implications of the Sars-Cov2 virus. It is worth noting that starting from June 2020, significantly fewer falsehoods have been published worldwide, arguably thanks to clarifications and new information regarding the origin and spread of the virus as well as methods of treatment and prevention. Three key moments in the pandemic may have played important roles in reducing the amount of misinformation regarding COVID-19:

  1. 1.

    On May 27, 2020, Dr. Anthony Fauci, the director of the National Institute of Allergy and Infectious Diseases, announced that a vaccine would be ready by December 2020 [66], which offered a promise of an end to the pandemic and lockdowns.

  2. 2.

    On June 17, 2020, the WHO announced that it was stopping its trial of the hyped anti-malaria drug hydroxychloroquine after new data suggested that the drug was not effective for COVID-19. This helped dispel the myths regarding its benefits that were spread by individuals and organizations, including the influential French physician Didier Raoult, Donald Trump, and Russian state-owned media [67].

  3. 3.

    On July 7, 2020, the WHO announced that COVID-19 may be an airborne disease mainly transmitted through respiratory droplets [68]. This not only reinforced the message around the importance of mask-wearing, but also helped limit the spread of misinformation concerning transmission through food or packaging (Figs. 2, 3).

Fig. 2
figure 2

News count per day

Fig. 3
figure 3

Topics and co-occurrences

As noted above, a significant contribution of this study is the assignment of the falsehoods to topics. Similarly, the exploration of co-occurrences between topics is of great importance to understand how groups of countries with varying characteristics produce misinformation. The graph below shows the aggregated results for topic co-occurrences. In the graph, the diagonal values show the counts for topics in the dataset, and the non-diagonal values indicate the number of co-occurrences between topics. Most falsehoods refer to an individual creating a story, and therefore have been classified as ‘Individuals’. Other topics with a significant amount of co-occurrence are ‘Governments’, ‘Conspiracies’, ‘Cures’, ‘Food’, ‘Prevention’, ‘Spread’, and ‘Medicine’. We explain these co-occurrences as follows: a great percentage of falsehoods were created between January 2020 and May 2020, a period in which there were still lots of unknowns about the origin, spread, cure, and prevention of the disease. This created a fertile ground for fully or partially misleading stories (e.g., benefits of eating a particular food as a protective measure) as well as outright conspiracy theories (e.g., Sars-Cov2 virus being a biological weapon). Starting from June 2020, the aforementioned announcements by leading figures and organizations seem to have helped reduce the misinformation around those subjects. However, despite a significant reduction in its amount, misinformation continued to be produced, and started to focus on other issues such as vaccines, politicians, and the impact of COVID-19 on healthcare systems. The count values in the graph have been colorized using a logarithmic scale, and there are a few instances where no co-occurrence has occurred (those observations have been colored as gray). On average, each falsehood has 2.165 co-occurring topics, and the maximum number of topics found in a single observation is 8.

During the collection and processing of the dataset, we were pleasantly surprised to see news with a varying degree of credibility and fantasy. There were stories with complete divergence from reality, nevertheless, they sounded credible; others gave the impression that they were produced in distant corners of a world of fantasy. Overall, we can divide the stories into two main categories. On the one hand, there were those that had some ground in reality but still distorted the facts and misled the reader, whether intentionally or not. One example was a story published in Brazil that claimed the FDA had warned the public that COVID-19 vaccines cause stroke. In reality, the FDA had prepared a table listing all possible side effects that must be looked after. That did not mean those side effects were known to happen after immunization with vaccines. On the other hand, some stories did not have any basis in facts and deserved to be labeled conspiracy theories. For instance, a story published in Georgia used a fake quote attributed to an American virologist and claimed that 5G high-frequency towers were installed to control humans implanted with microchips through vaccines, and the Rockefeller and Mason families were the financiers of this project. A range of examples showing this variation and the diversity of topics are in the Table 2.

Table 2 Examples of falsehoods

As mentioned above, we compared the 129 countries in the dataset based on 14 variables aiming to measure economic (e.g. income inequality), political (e.g. trust in government), informational (e.g. press freedom), and socio-cultural (e.g. trust in science) environments in each. While selecting those variables, our goal was to draw a thorough picture of countries to understand what factors impact the production of COVID-19-related misinformation in different settings. A detailed discussion on the social, economic, and political variables we have used can be found in “Part I” of the Appendix.

Research questions

The rich dataset on misinformation provides opportunities to make statistical comparisons between countries (e.g., their social and political characteristics), topics of the news, and the content of their text. In addition, since the evolution of the pandemic was a socially dynamic phenomenon, the fourth aspect is time. The examination of the dataset shows that there is considerably high variation between individual observations over time, and this study aims to find if these micro-variations can lead to meaningful macro-level comparisons. The richness of detail and the opportunities offered by the variables guided us to construct a methodological framework to find the larger patterns in the data.

By attempting to cover a large ground, this study aims to conceptually and empirically contribute to the literature by grouping the countries according to the predominant types of falsehoods they produce. In order to cover the dynamic evolution of misinformation over the course of 13 months during the pandemic, the paper looks at four pillars of analysis that can be grouped under topic analysis and content analysis. With this background in mind, the paper aims to be one of the pioneering contributions to literature. Despite the early optimism stemming from vaccine rollouts, as of September 2021, COVID-19 was still a major health threat around the world due to factors including new mutations and many countries missing their vaccination targets. Particularly, we can predict that a sizable number of low- and middle-income countries will be fighting against it for a long time due to unacceptably low vaccination rates—e.g., only around 2% received at least one dose of a vaccine in low-income countries [83]. In addition, socially and biologically, the field is still characterized by known unknowns and unknown unknowns; therefore, conceptual contributions may provide helpful guidance to researchers and policymakers. Theoretically, the paper also aims to give comparative politics and sociology scholars opportunities to look deeper into the reasons why different countries produce different types of falsehoods and to analyze which socio-cultural, economic, and political variables affect the misinformation environment more than others. Methodologically, the paper takes advantage of a variety of statistical techniques, including a selection of network similarity algorithms. The use of network similarity algorithms to compare texts has largely been neglected in the computational social science literature.

It is also important to mention that before conducting this study, we considered a different strategy as well. In fact, initially, we extracted hundreds of millions of tweets from more than ten countries around the world and calculated the similarity between those tweets and the dataset on falsehoods. However, this approach resulted in no findings, and using different text similarity algorithms, we were not able to identify any matches. This led us to believe that a targeted approach to analyze misinformation could be more effective than trying to discover patterns in large but random samples.

A closer elaboration of the research questions has been provided below.

Topic analysis

The first two sets of research questions analyze the causal mechanism of topic selection by different groups of countries. The countries have been grouped by using economic, informational, political, and socio-cultural variables that we have introduced in the Data section. The set of questions below help to understand the macro patterns in the dataset by minimizing the errors associated with labeling through manual classification.

RQ1) Divisive and connective topics

  • RQ1a) In terms of topic creation, what are the topics that two groups of countries utilize in the most comparable amount vis-a-vis each other? In other words, what are the topics that are the most connective?

  • RQ1b) What are the topics that are the most divisive?

RQ2) Topic co-occurrences

  • RQ2a) What are the topics that co-occur the most?

  • RQ2b) Are some groups of countries statistically significantly different from others in terms of topic co-occurrence?

  • RQ2c) Is there a time frame in the evolution of COVID-19 in which topics were more similar to each other?

Content analysis

The second group of questions look more deeply at the content of the news by calculating the similarity between and across news associated with different topics and also analyze the news from the perspective of ‘unusualness’. (This will be explained in greater detail in the Methods section.) By doing this, we aim to understand if there is any association between topic correlation or content similarity across different groups of countries. The goal is to find out if countries have been inspired from each other in terms of content creation and how this relates to variables collected.

RQ3) Content similarity

  • RQ3a) Are there groups of countries that produce news that are significantly more similar to each other?

  • RQ3b) How does the similarity between news change over time?

RQ4) Misinformation unusualness/creativity

  • RQ4a) Are there groups of countries that are more creative than others in content formation?

  • RQ4b) How does creativity evolve over time?

As the research questions suggest, the paper aims to offer a descriptive perspective into the creation of a framework on misinformation production. The statistical tools used in the paper are elaborated closely in the next section.

Methods

The methodological tools used in the paper have been chosen to find similarities and differences between individual and groups of observations. To answer the four sets of research questions indicated above, we employed a variety of tools, including t-test, calculation of entropy and GINI index as a measure of information gain, k-means++ clustering, network similarity algorithms, and content comparison algorithms (NLP). (For the analysis, Python programming language and associated libraries were used.)

The data on falsehoods were collected from the Poynter Institute using webscraping techniques. (Poynter Institute allows the use of their data for research purposes.) The data was later manually processed to associate each observation with at least one topic from a collection of 28 different topics (topics were identified through the examples provided on Poynter Institute’s website). Around 500 observations could not be associated with any topic and therefore discarded. For the classification of topics, a few unsupervised clustering options have been tested, such as latent Dirichlet allocation [84] and non-negative matrix factorization [85]; nevertheless, the most coherent results have been obtained through manual labeling.

As previously mentioned, the starting point of this paper is the assumption that the nature of misinformation production is highly dependent on the personalities of countries that can be associated with certain socio-cultural, political, informational, and economic characteristics. In that sense, our study follows previous work such as Sauvy’s “Three Worlds, One Planet” (1952) [86] hat coined the term Third World; Huntington’s “The Clash of Civilizations?” (2000) [87]; Hall and Soskice’s Varieties of Capitalism: The Institutional Foundations of Comparative Advantage (2001) [2]; and Wallerstein’s The Capitalist World Economy (1979) [88] with its core versus periphery distinction. However, as different from them, we do not prioritize a specific dimension (e.g., economic systems or “culture”) as the primary distinguishing variable; instead, we aim at drawing a more comprehensive picture of countries by using 14 variables ranging from income inequality to trust in science and scientists to colonization history. In order to simplify and automate the classification of countries, a generally accepted and useful clustering algorithm, k-Means++ [89] was used. To pre-process the data for clustering, categorical variables have been converted into a 5-point Likert scale, and 0–1 normalization has been applied to all variables. The optimal number of clusters was determined using the cluster variation (SSE) and the ‘elbow method’. Two clusters came out as the optimal number, and six as the second optimal choice; to better represent the variation among countries, we chose six.

To handle the missing data in the datasets, different imputation methods were used. For the missing social, economic and political observations, a technique called “multivariate feature imputation” was implemented [90]. This technique uses a two-dimensional matrix as the input and models each feature with missing values as a function of other features using an iterated round-robin fashion. This is suitable for our case, since the variables at hand possibly have causal connections. To fill in the missing values in the time series datasets (for similarity and unusualness), K-nearest neighbors (KNN) [91,92,93,94] algorithm was used. The assumption behind this algorithm is that missing observations can be approximated by the values of the closest points, most frequently by taking the average of ‘k’ many points around the missing observation. As argued in a multitude of works using KNN for imputation (This approach is believed to work well in time series data with missing observations for which the best predictor of the missing points are the values temporally closest to them.

In addition, the same set of social, political, cultural, and economic variables were used to reduce dimensionality using principal components analysis. For the PCA, two dimensions came out to be the optimal choice based on the scree plot. Two other dimension reduction techniques, namely t-SNE [95] and spectral embedding [96], were considered; however, PCA was preferred as a traditional method to obtain two variables that are not correlated. The correlation map between the variables used and the reduced dimensions can be seen in the plot below (significant correlation values are marked with a black box). As Fig. 4 shows, most of the social, cultural, and political variation can be explained by four variables: corruption perceptions index and health coverage (PCA—Dimension 1), GINI Index, and trust in government (PCA—Dimension 2). Figure 5 is a representation of the variables after dimension reduction (PCA). As evidenced by it, countries in the dataset can be successfully grouped into six clusters (by using k-means++) with the help of the variables listed below. Findings have been consistent and robust after running the clustering algorithm ten times.

Fig. 4
figure 4

Correlations between variables

Fig. 5
figure 5

Countries in different social, economic and political clusters

In order to compare the use of topics across groups of countries, an idea employed by decision trees was used. When decision trees are applied for classification goals, entropy and GINI index are the two most frequently used cost functions to calculate information gain/purity of the classes obtained by splitting the data. Thus, we wanted to find, given two groups of countries that are different by a single feature (for example, two groups of countries with different levels of democracy), which topic (in terms of its frequency) is the most different and which topic is the most similar among the two. Across the social, cultural, economic, and political variables, hundreds of comparisons between groups of countries were made, and the count values for most divisive (most different) and the most connective (most similar) topics were identified. Finally, these count values were inversely weighted by the count of the associated topic in the dataset to obtain a ranking for the most divisive and connective topics.

The paper assumes that the diversity of word usage in news reflects creativity; thus, more ‘unusually’ worded news are more creative. To measure the unusualness of the observations, 3-g and 4-g for the cleaned and lemmatized news were found. These n-grams were then used to calculate the TF-IDF score of each observation, which corresponds to the sum of TF-IDF scores for each n-gram associated with that observation. Observations with higher TF-IDF scores are believed to be more important and more creative; those with lower scores are considered as less unusual. This information was then used to compute how the unusualness changes over time and across groups of countries.

To calculate the document similarity between different observations, several considerations and attempts have been made, ranging from more generally accepted and earlier algorithms to more advanced techniques. Specifically, Word Mover Distance [97], Universal Sentence Encoder provided by Google [98], BERT embeddings [99], and Knowledge-based Measures [100] have been explored in the earlier phases of the analysis. All these models have turned out to be computationally too expensive to find the cross-similarities between over 10,000 documents. To solve this problem, TF-IDF (term frequency-inverse document frequency) scores have been calculated for all documents following the cleaning and lemmatization process [101]. Eventually, cosine similarities have been found in over 50 million cross-comparisons. These similarities have then been aggregated to make cross-country comparisons using t tests.

Finally, network similarity algorithms were applied to compare adjacency matrices composed of bi-weekly aggregated topic correlations between documents. Topic similarities can be represented as graph data since one document can only have a limited number of topics, and more than one topic presented in a single document can change the impact of the misinformation dramatically (holistic assumption). To calculate the similarities between topic correlation matrices, the following two advanced graph similarity algorithms were used: Frobenius distance [102] and quantum-JSD distance [103]. The aggregated relative similarity matrix between topic ratios has been provided as an example in Fig. 6. Topics with yellow-to-red colored cross-similarities are more closely related, and topics with yellow-to-blue colored cross-similarities are rarely mentioned together. A closer elaboration on this relationship will be provided in the “Results” section. For more information on how network similarities have been calculated, please refer to the “Appendix” section. In a similar approach, PERMANOVA [104] and Anosim [105] techniques that allow the comparison of × n-dimensional topic correlation matrices were put to test; however, ultimately, the tests were not reported because of the impact of data size on the results.

Fig. 6
figure 6

Relative similarities between falsehoods of all topics

Empirical results

Topic analysis

The topic analysis focuses on two questions as previously mentioned: divisive and connective topics (i) and topic co-occurrences (ii). The results obtained for the first case indicate significant variance in the power of topics to differentiate groups of countries from each other. Thus, clusters of countries can be strongly associated with topics and vice versa. Topics were used to separate countries into clusters and these clusters were compared with the groups generated through the use of social, economic, cultural, and political variables. The results show that some topics lead to a much greater amount of cluster purity when used for generating groups. The analysis to calculate purity has been repeated by using two cost functions, entropy, and GINI Index, and the results are the same. The table below provides a ranking for the most connective and divisive topics. In each comparison the name for the most connective and divisive topic has been obtained and the number of times a topic appears as the most connective or most divisive has been recorded. Finally, the count values have been inversely weighted by the falsehood count associated with that particular topic. The table below shows the ranking for most connective and divisive topics and these values. The inversely weighted values explain how strongly connective or divisive a topic is compared to others (Table 3).

Table 3 Ranking of most connective and most divisive topics

As seen above, “conspiracies” is the most shared topic category across groups of countries. We explain this finding as follows: the major conspiracy theories, including those pointing to Bill Gates-led plots to implant digital microchips to control people, marking the virus as a biological weapon created by Chinese or American scientists, and those demonizing pharmaceutical companies as agents that worsen the pandemic and conceal the effective treatments, are produced by a small number of individuals and organizations with political and financial goals. Then, these are shared globally in the form of news stories occasionally through media outlets, but primarily via social media posts. In fact, a recent investigation conducted by the Associated Press and the Atlantic Council’s Digital Forensic Research Lab found that a few “superspreaders”—people and organizations such as Kevin Barrett, an anti-Semitic former lecturer on Islam, and the Montreal based “Centre for Research on Globalization”—were responsible for a great percentage of conspiracies on the origin of COVID-19 circulating online [106]. Similarly, a study published right before the COVID-19 pandemic found that 54% of all anti-vaccine ads on Facebook were funded by two organizations, even though most of the ads appeared to be grass-root discussions by concerned parents and neighborhood groups [107]. Thus, in addition to raising concerns around the use of Facebook and similar platforms to spread misinformation, this finding indicates that conspiracy theories regarding COVID-19 have a global appeal cutting across socio-cultural, economic, informational, and political variables that divide the countries.

Secondly, we looked at the co-occurrence dynamics of the topics in the topic analysis section. A one-to-one match between each topic gives close to 400 possibilities for topic pairs. Among those, co-occurring topics with an aggregated relative similarity of more than 0.1 have been selected and their number of co-occurrences have been inversely weighted by the total count of both topics (comparable to Jaccard similarity) in periods of two weeks. In other words, a matrix similar to the one in Fig. 5 has been produced for every two weeks topic pairs with high relative similarity have been observed. This gave us Fig. 6 below. The high similarity co-occurring topics are food-cures, individuals-governments, lockdown-governments, lockdown-individuals, medicine-cures, origins-conspiracies, other diseases-medical equipment, prevention-cures, prevention-food, spread-detection, spread-individuals, videos-individuals, and videos-religion. The figure below suggests that there is a pattern in the co-occurrence of the topics and the time series dataset can be clustered into the following two groups: before April 2020 and after. Higher values correspond to greater weighted topic co-occurrence and lower values indicate that co-occurrence has become weaker (Fig. 7).

Fig. 7
figure 7

Relative similarity between high-correlation topics over time

Lastly, bi-weekly relative similarity matrices have been treated as networks and compared to each other using network similarity algorithms. This provided a systematic way to compare the dynamics of topic ratios in the misinformation dataset over time. To validate the results, two different algorithms (Frobenius distance and quantum-JSD distance) have been used and the results have been evaluated against each other. The results indicate that in the first few months of the pandemic, topic ratios have been comparable to each other; specifically, starting from February 2020 until the end of June 2020, results suggest that there has not been much variation. This finding is also reinforced by the results provided in Fig. 6 relatively more conservatively: highly correlated topics formed a pattern until the end of May 2020. This suggests intense cross-country exchanges and learning from each other in the first few months of the pandemic. The graphs below show the similarities (or distances) between the bi-weekly relative similarity graphs. The rectangles in the intersection of two time points show the distance between two topic ratio graphs. Red values are associated with greater similarity and blue values correspond to lower similarity scores. In addition, to have complete data for the bi-weekly periods, the first and the last time series observations have been truncated (Fig. 8).

Fig. 8
figure 8

Frobenius and quantum-JSD distances between the bi-weekly relative similarity graphs of topic ratios

We interpret those results in line with the discussion in the “Data” section above. As we also mentioned there, the period until May/June 2020—the first few months of the pandemic— was characterized by uncertainties about the definition and implications of the Sars-Cov2 virus and the highest intensity of misinformation production. More specifically, there were still lots of unknowns about the origin, spread, cure, and prevention of the disease; each among the topics with a significant amount of co-occurrence. Starting from June 2020, with key announcements by leading figures and organizations regarding the origin and spread of the virus as well as methods of treatment and prevention—e.g. the WHO’ announcement that COVID-19 may be an airborne disease mainly transmitted through respiratory droplets—we saw a decline in the number of falsehoods related to those popular and highly co-occurring topics, while misinformation started to focus on other issues such as vaccines, politicians, and the impact of COVID-19 on healthcare systems.

Content analysis

Content similarity

In this section, we tried to understand if countries of similar social, economic, and political backgrounds produce news with similar content, or if there is a statistically significant difference between countries with different social, economic and political endowment in terms of content creation. The assumption was that the behavior of people is strongly related to national variables [108, 109] and this ultimately translates into writing. In fact, there is an extensive literature showing that individual’s everyday behaviors such as financial decisions [110], consumption habits [111], and health behaviors [112] are associated with national variables, including cultural values, human development levels, or business systems. Similarly, scholars point to how national-level factors such as political systems, economic indicators, or press freedom have determining impacts on the “journalistic cultures” [113], which, in turn, shape how different topics, including climate change [114] and international migration [115] are covered.

We broke the countries down into groups and compared the aggregated mean of pairwise similarity between the news for bi-weekly periods and for the whole dataset. The instances in which a comparison results in statistically significantly higher similarity results than the other sets of comparisons have been identified. These instances can be observed from the graphs below. The remainder of the comparisons can be found in the “Appendix”. On the whole, countries in the fourth cluster, countries in West and South-Asia, socialist/Arab-oil-based/advanced city economies, countries with low HDI (human development index), countries with very serious press freedom problems produce news that are more similar to each other (the remainder of the comparisons have been reported in the “Appendix”) (Figs. 9, 10, 11, 12, 13).

Fig. 9
figure 9

Similarity between news over time, cluster comparison

Fig. 10
figure 10

Similarity between news over time, HDI comparison

Fig. 11
figure 11

Similarity between news over time, culture comparison

Fig. 12
figure 12

Similarity between news over time, business systems comparison

Fig. 13
figure 13

Similarity between news over time, press freedom comparison

We believe some of those findings are particularly worth discussing here. First, it should be noted that an overwhelming percentage of countries classified as having low HDI are located in sub-Saharan Africa. A number of studies indicate that the COVID-19-related misinformation in the African continent has a few distinct characteristics, and we speculate that this might explain why the news are more similar to each other. Specifically, falsehoods related to unproven local remedies [116] and those stemming from religious beliefs [117] are found to be particularly common in Africa. In addition, distrust towards international bodies [118] and the history of unethical Western medical practices in the continent [119] are some of the other factors fuelling misinformation. In fact, our dataset offers some interesting examples. In Ivory Coast, stories claiming that neem leaf works against COVID-19 were posted thousands of times on social media despite no evidence. Similarly, falsehoods claiming that the Rwandan president Paul Kagame censured the WHO for rejecting a herbal tonic were widely shared on Facebook and Twitter across African nations, including Nigeria.

Countries classified as having “very serious” press freedom issues by the reporters without borders also produced more similar news. In line with our discussion on the “journalistic cultures” above, we believe that this might reflect the effects of government control and censorship of the media, which largely shape both the content and tone of the coverage with regard to the pandemic. In fact, in this group of countries— including Egypt, Iran, Saudi Arabia, China, and Vietnam, among others—not just the traditional media sources, but also the social media networks are subject to heavy government control [72]. For instance, a report by the social media exchange—an NGO working to advance digital rights in the Arabic-speaking region—shows how Egyptian authorities prosecuted a number of journalists, doctors, and activists who circulated news on the social media on the COVID-19 outbreak—e.g., the number of infections or deaths—that did not match the official discourse and numbers [120].

A third interesting finding is that the countries in the fourth cluster—the light pink colored cluster in Fig. 3 above—generated more similar news in terms of the content. Some of the countries in this group are Iraq, the Democratic Republic of Congo, Venezuela, Honduras, Kenya, Bolivia, Uganda, and Yemen. A few of the members of this cluster have also low HDI and/or very serious press freedom issues; therefore, the explanations above can partially apply to those countries. However, four distinct characteristics identify the countries in this cluster: a high perception of corruption of the public sector, a high degree of mistrust towards the government, a high level of economic inequality, and largely ineffective health service provision. Taken together, these factors point to an environment of weak state capacity and a low level of trust in public institutions. In fact, studies show that there are remarkable correlations among those variables. For example, while a study finds a strong relationship between high levels of economic inequality and low levels of trust in national institutions across the EU member countries [121], another one conducted in post-Soviet nations shows that there is a negative association between perception of corruption and trust in public institutions such as police, national and regional governments, and courts [122]. Given that the state capacity and trust in public institutions are integral to an effective pandemic strategy—affecting people’s compliance with restrictions and willingness to get vaccinated, governments’ success in enforcing lockdowns and other isolation practices, etc.—it is not surprising that those countries produced more similar news. Our dataset includes several fascinating falsehoods particularly common to the countries in this cluster. Reflecting the mistrust towards the government and its capacity to supervise the pandemic efforts, a news story in Zimbabwe alleged that a medical laboratory conducted clinical trials for a possible vaccine and led to the death of 68 out of 80 volunteers in total. Similarly, mistrust towards the government and a high level of political polarization undoubtedly fostered misinforming news such as the one in Bolivia that was published in July 2020 and claimed that President Maduro was extending the full lockdown until January 2021.

Misinformation unusualness/creativity

In the last part of the paper, groups of countries have been compared against each other in terms of the creative word usage in the news they published. The average value of unusualness of one group was compared with the average unusualness extracted from the other group. Since the difference is taken into consideration, statistically significant results should be much higher than zero. We provided the average aggregated unusualness score in Fig. 14. The figure shows that an initial lack of creativity in the first few months of the pandemic was followed by an increase and relative stability throughout 2020.

Fig. 14
figure 14

Aggregated average unusualness scores over time

Breaking down the countries into groups and comparing the levels of creativity between them did not provide any results that are statistically significantly different from zero. Thus, our expectation that countries of different backgrounds would choose word-groups according to their taste did not come true. A comparison between clusters of countries created using the social, cultural, economic, and political variables in the dataset has been provided below in Fig. 14. We believe that this lack of meaningful difference across clusters of countries in terms of creative word usage can be explained by three main factors. First, it points to a highly globalized media environment in the sense that media outlets across nations share vocabulary and discourses to a great extent. The digital media, and the Internet more broadly, have created “a new global language” [123] with specific neologisms and novel syntactic, orthographic, and lexical commonalities among world languages, such as heavy use of emojis and emoticons, abbreviations, and acronyms [124]. Second, research shows significant differences between truthful news and falsehoods regarding their linguistic characteristics as the latter use more words related to anxiety, more superlatives, sensationalistic writing, and overly emotional language [125, 126]. Third, as previously mentioned, a great percentage of falsehoods in our dataset are from Facebook and a few other social media outlets. Given the studies showing that misinformation spreads really fast on social media platforms [24, 127] and that most COVID-19-related falsehoods were produced by a very small number of individuals [128], the lack of statistically significant difference among the groups of countries is not too surprising (Fig. 15).

Fig. 15
figure 15

Comparison of unusualness scores between clusters of countries

Discussion and conclusion

The “Varieties of… “literature has influenced more than a generation of scholars and practitioners worldwide. There have been politicians to use the arguments first offered by Hall and Soskice to transform the institutional structure of their countries (such as the British Labour Party politician, Ed Miliband, when he was Leader of the Opposition) and many other scholars who expanded the typologies first proposed by Hall and Soskice. Academically, we hope that our paper will provide a strong comparative perspective to an emerging literature. We also agree that the “Varieties of…” conceptualization is deterministic in nature; however, as recent media-viewer experience suggests, local and global media, policymakers, and transnational institutions have also been looking at pandemic-related policy success and failure from a cross-national, and mostly deterministic perspective. Thus, many are wondering why some countries have been more successful than others in mitigating the human costs of the pandemic, while others have been less so in an environment where local political leaders are looking for the best non-local practices. From a practical sense, we believe that the arguments and facts laid out here may contribute to the public health efforts to fight against misinformation, which continues to take lives in a myriad of ways, such as by discouraging people from getting vaccinated or promoting fraudulent and dangerous products.

To conclude, we want to reiterate four of the key contributions that this paper provides to the literature, and particularly to the tools to be used by global public health circles. First, our study is truly unique in terms of its data and methodology—it comprises over 10 thousand falsehoods from 129 countries; its data come from a variety of sources, including the most widely used social media platform globally, i.e., Facebook; and it uses 14 different variables aiming to measure political, economic, informational, and socio-cultural environments in each country in order to compare COVID-19 misinformation across them. We believe that the resulting clustering of countries into groups offers avenues for developing distinct public health and communication strategies to dispel misinformation in countries with particular characteristics.

Second, and relatedly, the findings give clues on what those strategies should be. For instance, our analysis suggests that countries with low HDI (mainly located in sub-Saharan Africa) produce misinformation related to unproven local remedies and those stemming from certain religious beliefs as well as from distrust of international organizations and Western medical practices. This shows the importance of working with local religious leaders and healers and repairing Western governments and pharmaceutical companies’ tarnished reputation. Likewise, given that countries with severe press freedom issues (e.g., those implementing outright censorship of news and bans on social media platforms) generate similar news, global public health circles should design an anti-misinformation strategy specific to those nations, which necessitates going beyond using online platforms that are at risk of being censored.

Third, our study indicates that there have been successful anti-misinformation efforts throughout the pandemic, but significant challenges persist. More specifically, we found that the types of falsehoods that were particularly common in the first few months of the pandemic and were widely shared across countries (mainly those about the origin, spread, cure, and prevention of the disease) got effectively addressed by announcements coming from leading figures and organizations such as WHO or Anthony Fauci, resulting in a decline in the number of falsehoods related to those topics. However, the findings also reveal two worrying trends, among others: (1) conspiracy theories are common among all groups of countries, which can be explained by the fact that they are originated by a small number of individuals and organizations (aka misinformation “superspreaders”), but are effectively disseminated across the globe; (2) in countries with weak state capacity and a low level of trust in public institutions, misinformation creates a particularly dangerous vicious circle—distrust of government fosters the production of falsehoods, which in turn further weakens governments’ ability to supervise the pandemic efforts. Accordingly, we argue that international organizations and leading figures in global health should strengthen their efforts to reach out to those populations and develop effective strategies against the dissemination of the major conspiracies.

Fourth, we found that while the most prominent misinformation topics vary across groups of countries, the word-groups used in misinforming news stories are remarkably similar. In line with the “glocalization” literature [129, 130], we interpret this as the coexistence of globalizing and localizing processes—on the one hand, socio-economic, cultural, political, and informational characteristics of countries clearly affect the types of falsehoods Internet users are exposed to; but, one the other hand, the tone and structure of falsehoods do not show much variance, which points to a highly globalized online media environment. Given those findings, we argue that even though implementing country-specific strategies (e.g., improving scientific literacy in a country where it is currently weak) is crucial, a global action plan against misinformation is also very much needed.