1 Introduction

The COVID-19 pandemic has become an unparalleled public health crisis worldwide since its outbreak. At the same time, spreading just as widely and traveling even faster, the world has been also inundated by various misinformation regarding the etiology, prevention, diagnosis, morbidity, motility, and cure rate of the disease [1]. Examples include‘Coronavirus was found in horses.’, ‘Vitamin C is a miracle cure for the novel coronavirus.’, ‘Tom Cotton claimed that COVID-19 was manufactured in Chinese bio-laboratory’, and so on.Footnote 1 Such misinformation has jeopardized the information ecosystems of the society by eroding public trust, misleading people’s decision-making at critical times and may even lead to society disruptions [2].

In mid-February of 2020, the World Health Organization issued a global warning about the health information crisis caused by the COVID-19 ‘infodemic’.Footnote 2 Then, many concerted efforts are convened from all sectors to mitigate its negative effect to the society [4,5,6,7,8,9]. For instance, a bunch of fact-checking algorithms/tools/websites have been developed based on big data analytics and language processing technologies [10,11,12]. A core technology in natural language processing treats it as a multi-class labeling task [13,14,15,16] with state-of-the-art models, which mainly concerns the pre-training and fine-tuning paradigm [17,18,19]. Such methods demonstrate high computing power for task resolution, but are often opaque from knowing how and why certain representations can optimize the performance.

The literature has suggested the effectiveness and significance of linguist attributes in accounting people’s behavior during information communication [20, 21]. This paper aims to probe into the distinct linguistic characteristics and to account for the ‘PathogenictyFootnote 3 of COVID-19 infodemic with a data-driven analytical approach. We focus on measuring culturally shared ‘fundamentals’ in the affective theory in terms of: Evaluation, Potency, and Activity (EPA). Specifically, we consider three essential aspects of work regarding (1) the characterization of the salient linguistic features employed in the language through data observation, (2) the interpretation of the key psychological triggers (psycho-triggers) essential for accounting people’s social behavior, and (3) the prediction of information credibility as a classification task by building linguistically enhanced statistical models.Footnote 4 Such a three-dimension paradigm represents a mutually defining triple chain for combating an infodemic, as conceptualized in Fig. 1.

Fig. 1
figure 1

The triple-dimension paradigm of understanding and combating infodemic

Theoretically, we can leverage the metaphor of an infodemic as a kind of epidemic. Hence, the identification of the virus (misinformation) itself does not cure the disease (infodemic). It is intriguing to know the deep encoding system of infodemic in (psycho-)linguistic devices. Practically, finding the distinct linguistic patterns and underlying psychological mechanisms can help to pinpoint the essential factors in causing the infodemic. This can provide possible recommendations on how to prevent or rectify transferred information, as well as identifying discriminant features for benefiting anti-infodemic technologies.

In the following sections, we first review the related works and highlight the innovation of the current study in Sect. 2; we then introduce the dataset and method in Sect. 3; third, we conduct data-driven linguistic analysis with theory-grounded explanation in Sect. 4; fourth, we model on the proposed features for misinformation detection in Sect. 5; finally, we discuss the results with further implications and draw concluding remarks in Sect. 6.

2 Related work

Infodemic during the COVID-19 pandemic has been studied from various perspectives with different methodologies and technologies, including health communication control, social network analysis, automatic misinformation detection, linguistic investigations, as well as data science applications. Discussions mainly focus on epidemiological issues or social behavior studies based on survey data or textual data mined from a multitude of social media platforms. In the followings, we provide a comprehensive review of these studies and then highlight the innovation of the current work.

2.1 Health information control

Fig. 2
figure 2

An example of COVID-19 myth (with the debunked fact) from MythBusters WHO

Many health information controlling organizations, such as Mythbusters, PolitiFact, Mayo Clinic, Avert, and NewsGuard, have constantly posted alarming messages to the public to be aware of the popular medical myths about COVID-19, as exemplified in Fig. 2. Based on such information, many people work on reporting issues of COVID-19 infodemic in viewpoints of pharmacist, nursing and medical experts, regarding information such as dietary supplements, disease prevention, controls, and treatment [1, 2, 22]. These actions play a critical role in enhancing people’s awareness of data legitimacy and information credibility. However, the above way of data demonstration with a few selected examples is disadvantaged of not looking into the COVID-19 infodemic globally and statistically. Hence, their efforts are limited in terms of the understanding and combating of an infodemic.

2.2 Existing datasets

A myriad of misinformation datasets have emerged before the COVID-19 pandemic, which mainly concern fake news of the political discourse, such as LIAR, FEVER, and CREDBANK [23,24,25]. Since the outbreak of COVID-19, a growing number of COVID-19 misinformation datasets have been compiled for research in combating the proliferation of COVID-19 misinformation online, e.g., CoAID, ReCOVery, COVID Fake News Dataset, and so on [21, 26,27,28,29,30,31,32,33,34,35]. We review the most relevant datasets to this work in Table 1. As demonstrated, there is a rather diversified archive of datasets on COVID-19 misinformation in terms of data source, size, language, modality, and truth classes. This work is interested at deploying a wide range of existing published data (with verified truth labels) to conduct empirical analysis and statistical modeling on proposed linguistic features in order to draw a generalized conclusion.

Table 1 Summary on truth-labeled datasets related to COVID-19

2.3 Social behavior studies

The outbreak of COVID-19 and the rapid transmission of infodemic has drawn a primary attention from social scientists in studying the social behavior of information generators, consumers, and spreaders. They attempt to model the social network and information community of the COVID-19 infodemic so as to take appropriate actions in social monitoring of information quality.

For instance, Pennycook et al. [9] conduct a survey on human subjects studying why people believe and share misinformation related to COVID-19 and find a strong association between users’ behavior and their cognition and knowledge capacity. Pulido et al. [36] study the retweeting behavior of users and find that false information is tweeted more but retweeted less than science-based evidence or fact-checking tweets, while science-based evidence and fact-checking tweets capture more engagement than mere facts. Memon and Carley [28] studied the characteristics of an information community (network) and observe: (1) misinformed communities are observed to be denser than informed communities; (2) informed users use many more narratives than misinformed users; (3) misinformation communities are much more complex as they are highly organized, and tend to be highly analytical.

These studies unveil many interesting social behavior patterns pertaining to COVID-19 infodemic, yet mask the language devices used for constructing and construing misinformation.

2.4 Linguistic approach

The linguistic literature has attempted to answer the key issue whether misinformation is rooted in the language and how distinctive are misinformation in linguistic terms. For example, Newman et al. [20] showed that low-credibility text tends to use more pronouns, conjunctions, and exclusive words (e.g., without, except, but) or motion words (e.g., walk, move, go). Su [43] observed that a liar tends to employ epistemic and stance markers or involve impersonal views. Rafi [44] conducted dialogic content analysis and find that the language used in most posts concerning COVID-19 misinformation is deterministic, imperative, and declarative; Medford et al. [45] studied misinformation in tweets and find that tweets with negative sentiment and emotion parallel the incidence of cases for the COVID-19 outbreak; Kapusta et al. [46] studied grammatical-semantic classes and observe significant differences for certain lexical categories preferences. Many other evidenced works include analyses to parts of speech [47], syntactic structure [48], measures of syntactic complexity and semantically-related keyword lists [49], discourse structure [50], and named entities [51]. However, such linguistic patterns have to be further attested for the COVID-19 infodemic.

2.5 Data analytics, computation and applications

Data scientists and computational linguists attempt at seeking practical solutions to combating infodemic by addressing real world applications. Much success has been witnessed with the advent of big data and state of the arts (STOA) of deep neural networks, as well as the pre-trained, fine-tuning paradigm recently. Here we review a few representative works.

Cinelli et al. [37] focus on analyzing a massive data from Twitter, Instagram, YouTube, Reddit, and Gab to address the diffusion of COVID-19 infodemic in social media. They use epidemiology models to characterize the basic reproduction number \(R{^0}\) and provide platform-dependent numerical estimates of rumors’ amplification. Hang et al. [38] study graph-based framework to infodemiology using joint hierarchical clustering and cloud computing, which is a key to designing scalable data analytics for infodemic control. In addition, they use statistical machine learning to exploit the statistics of data to accelerate computation. Olaleye et al. [39] conduct predictive analytics of COVID-19 infodemic on tweets with deployment of classifier Vote ensembles formed by base classifiers SMO, Voted Perceptron, Liblinear, Reptree, and Decision Stump. Ceron et al. [41] introduce a Markov-inspired computational method for topic modeling of infodemic in order to identify the ‘fake news’ trends in Twitter accounts. Chen et al. [42] propose a transformer-based language model for fake news detection using RoBERTa and domain-specific model CT-BERT, which are fused by one multiple layer perception to integrate fine-grained and high-level specific representations.

2.6 Specificity of our work

The above advances leverage big data analytics and automatic detection methods for investigating and controlling infodemic, demonstrating a ground-breaking success in this information and digital age. However, such methods may present disadvantage in semantic understanding of why certain features and techniques work.

In addition, current fact-checking technologies show limitations in controlling misinformation exposure as there are several factors driving misinformation sharing and acceptance in the context of the COVID-19 pandemic, such as emotions, distrust, cognitive biases, racism, and xenophobia [40]. These factors both make individuals more vulnerable to certain types of misinformation and also make them impervious to future correction attempts. There are several additional measures, beyond fact-checking that may help further mitigate the effects of misinformation in the current pandemic.

Therefore, we take a further step to leverage data-driven analytics for studying the lexico-syntactic–semantic features pertaining to COVID-19 misinformation. In addition, we provide theory-grounded accounts for understanding the pathogenicity of the infodemic, as well as conducting logistic regression and machine learning models to test the performance of linguistic features for misinformation predictions.

Table 2 Basic statistics of the CovMythFact dataset
Fig. 3
figure 3

Sentence length distribution

Fig. 4
figure 4

Word average length distribution

3 Data and method

3.1 The CovMythFact dataset

To gather a balanced dataset on COVID-19 misinformation, we curate a large collection of COVID-19 myths from a multitude of existing data resources published in the infodemic community. We focus on COVID-19 myths because they are regarded as the most contagious misinformation existing and expanding on the internet [52]. We integrate a wide range of news headlines and claims from all the English datasets in Table 1, focusing on 2 class truth types, i.e., TRUE versus FALSE, so as to minimize the disagreement on truth-labeling. For example, LIAR [23] defines six classes to label various degrees of truthiness in news, i.e., True, Mostly True, Half True, Mostly False, False, and Pants on Fire, whereas CREDBANK [25] defines five classes, and many other fake news datasets only define 2-4 classes, such as FEVER [24], BUZZFACE [53], PHEME [54] and FA-KES [55]. We then de-duplicate the repeated myths and delete all the question titles (such as ‘How Long Does It Take for COVID-19 to Stop Being Contagious?’). Finally, we obtain around 8000 false headlines and 5000 true headlines. In order to balance the two sub-corpora for a comparative study, we randomly sampled 5000 false headlines from the 8000 false headlines and finally obtain a balanced dataset—CovMythFact.Footnote 5 It contains 5000 headlines for each truth class (132,244 tokens in total). We provide the basis statistics about the dataset in Table 2.

Statistics of ‘Token’, ‘Sentence’, and ‘Lemma’ in Table 2 are calculated by Sketch Engine [56] based on the CovMythFact dataset. In addition, we calculate TTR (type-token-ratio) [57] for measuring the lexical diversity of the two codes of statements. The result shows that myths are longer in terms of sentence length but are not as diversified as facts in the vocabulary. To have a basic description about the data, we first provide the following density plots of the distributions of sentence length and word average length for myths and facts with t-tests of the distribution differences, as given in Figs. 3 and 4.

The result shows that myths are significantly longer in terms of sentence length (fake: mean = 14.6012, sd = 7.42; true: mean = 9.7628, sd = 4.21, p value < 2.2e−16), while facts have significantly longer word average length (fake: mean = 5.427400, sd = 0.81; true: mean = 5.902078, sd = 1.09, p value < 2.2e−16), indicating a reverse relation between the sentence length and word length distributions for the two codes of statements. The true code employs longer words while the sentences are shorter, presenting a more contracted structure in lexical semantics; in contrast, the fake code employs longer sentences with shorter words, which is more unfolded in terms of information packaging.

3.2 Methodology

This work adopts a data-driven linguistic approach to examining the unique and distinct linguistic patterns of COVID-19 myths through the lexical and grammatical interfaces. We first process the unstructured dataset with linguistic annotations, focusing on lemmatization and POS tagging currently using NLTK (Natural Language Toolkit).Footnote 6 In order to facilitate linguistic inquiries, Sketch Engine is utilized for keyword extraction, concordancing, and word sketch difference searching. In addition, statistical analysis and visualization are implemented using R with RStudio and the R markdown language. Other methods are described in the following subsections.

3.2.1 Lexical dispersion measure

The distinctive linguistic patterns are measured with normalized deviation of proportions (DP) measure as in [58]. DP is based on the difference between observed and expected relative frequencies. Let \(v{_1}, \ldots ,v{_n}\) be the relative frequencies that are observed in texts \(S{_1}, \ldots ,S{_n}\), and let \(S{_1}, \ldots ,S{_n}\) be the relative sizes of the texts. DP is defined as:

$$\begin{aligned} \mathrm{DP} = \left( \sum _{i=1}^{n}|s_{i}-v_{i} |\right) /2 \end{aligned}$$
(1)

We adopt \(\mathrm{DP}_{\mathrm{norm}}\) to measure the distinct words for the two subcorpora as formulated in:

$$\begin{aligned} \mathrm{DP}_{\mathrm{norm}} = \mathrm{DP}/\left( 1-\min _{i}(s_{i})\right) \end{aligned}$$
(2)

The normalized measure, as presented by Lijffijt and Gries [59], has a minimum value of 0 and a maximum value of 1, regardless of the corpus structure, whereas DP also has a minimum of 0, but its maximum depends on the corpus structure. Because the dispersion is quantified as the difference between the expected and observed frequencies, a dispersion of 0 indicates that a word is dispersed as expected, whereas a dispersion of 1 indicates that the word is minimally dispersed.

3.2.2 EPA_Grounded account

To consolidate the lexical observations in Sect. 4.1, we adopt an EPA_Grounded approach to account for the pathogenicity of COVID-19 myths. We map the distinct words to the EPA lexicon and obtain their Evaluation, Potency, and Activity scores in order to verify our interpretations to the distinct words in sociopsychological aspects. The EPA collection is provided by Heise [60] which consists of the most commonly-used 5000 English sentiment words.

The EPA scores have been rated by human knowledge experts based on the affective control theory (ACT) [61]. ACT is a social psychological theory based on the assumption that people tend to maintain culturally shared perceptions of identities and behaviors in transient impressions during observation and participation of social events [62]. In this theory, culturally shared ‘fundamental’ sentiments about each of these elements are measured in three dimensions: Evaluation, Potency, and Activity (EPA).

We use the EPA score of the word ‘mother’ [2.74, 2.04, 0.67] for concept illustration. It corresponds to ‘quite good,’ ‘quite powerful,’ and ‘slightly active’ in the three aspects.Footnote 7 The scores of the three dimensions for each word provide direct links to the social perceptions, actions, and emotional experiences of people for the social events. Such indexes have been proven effective for sentiment analysis [63]. By employing such indexes for the distinct words in COVID-19 myths, we are able to probe into the respective sociopsychological dimensions of the salient lexicon and to account for the social behavior of people in disseminating the COVID-19 infodemic.

3.2.3 Feature regression

Logistic regression [64] is conducted to model the relation between the linguistic features and the truthiness of the claims. We regard the truthiness of the headlines as the dependent variable, all the linguistic variables as the independent variables. The truthiness falls into one of the two categories, true or false, so we use logistic regression model to estimate the probability that truthiness belongs to a particular category.

Given X as the explanatory variable, the logistic function to model p(X) that gives outputs between 0 and 1 for all values of X:

$$\begin{aligned} p(X) = \left( e^{\beta _{0}+\beta _{1}X}\right) /\left( 1+e^{\beta _{0}+\beta _{1}X}\right) \end{aligned}$$
(3)

The logistic function will always produce an S-shaped curve, so regardless of the value of X, we will obtain a sensible prediction between 0 or 1. The above equation can also be reframed as:

$$\begin{aligned} p(X)/\left( 1-p(X)\right) =e^{\beta _{0}+\beta _{1}X} \end{aligned}$$
(4)

The quantity \(p(X)/(1-p(X))\) is called the odds ratio, and can take on any value between 0 and \(\infty \). Values of the odds ratio close to 0 and \(\infty \) indicate very low and very high probabilities of p(X), respectively.

By taking the logarithm of both sides from the equation above, we obtain:

$$\begin{aligned} \log \left( p(X)/1-p(X)\right) =\beta _{0}+\beta _{1}X \end{aligned}$$
(5)

The left-hand side is called the logit. In a logistic regression model, increasing X by one unit changes the logit by \(\beta _0\). The amount that p(X) changes due to a one-unit change in X will depend on the current value of X. But regardless of the value of X, if \(\beta _1\) is positive then increasing X will be associated with increasing p(X), and if \(\beta _1\) is negative then increasing X will be associated with decreasing p(X).

Table 3 Distinct lemmas for myths
Table 4 Distinct lemmas for facts

The coefficients \(\beta _0\) and \(\beta _1\) are unknown and must be estimated based on the available training data. We seek estimates for \(\beta _0\) and \(\beta _1\) such that plugging these estimates into the model for p(X) yields a number close to 1 for all individuals who are true, and a number close to 0 for all individuals who are not. To implement the logistic regression model, we use the glm() function in R provided by the ISLR package.

3.2.4 Machine learning models

We adopt machine learning models to conduct automatic prediction of information credibility (a binary classification task) so as to test the usefulness of the proposed features. Three traditional classifiers are used, including logistic regression (LR), support vector machine (SVM), and a random forest classifier (RFC). The machine learning experiments are run through utilities in the sklearn.Footnote 8 In terms of feature sets, we use bag-of-word representation (BOW) as the baseline and test the grammatical feature (POS), word2vec representationsFootnote 9 (W2V) and affective features (EPA), respectively, for performance comparisons. For parameter tuning, we use grid search to find optimal parameters for the classifiers.Footnote 10

4 Results and analyses

This section presents the results of the proposed linguistic analyses of COVID-19 myths focusing on the lexical and grammatical features, as provided in the following subsections.

4.1 Lexical analysis

We conduct lexical analysis on the words in myths and facts and study their distributions, trying to identify the distinct word choices preferred by the two codes of statements. We also provide further analysis to the disperse distribution with EPA_grounded interpretations.

4.1.1 Distinct lemmas

We retrieve the distinct words for each group using DP_norm (cf. formula 2). Tables 3 and 4 display all the distinct words for the two groups, respectively, with DP_norm larger than 0.2. DP1 is the DP_norm with Myths as the observed group and Facts as the expected, while DP2 is the DP_norm with Facts as the observed group and Myths as the expected. Note that the enlisted words for each code are not exclusively used by either one group. They are just used more in one code by reference to the other. Actually, most of the words may occur in both sub-datasets.

Fig. 5
figure 5

EPA indexes of distinct words in the two groups. Words are displayed in descending order of the major dimension. We highlight words with negative scores in bold. Words that are dominant in the fact group are in the italic form. The gradient colors refer to different degrees of weights in each dimension of the words

Fig. 6
figure 6

Word sketch difference analysis of ‘COVID-19’ versus ‘SARS-CoV-2’

There are 80 distinct words preferred by myths, while only 60 words are distinctly used by facts. Several categories of lexical contrasts can be observed from the two word lists, as summarized below.

  1. 1.

    The two codes favor different names to address the coronavirus disease: ‘COVID-19’ is predominantly used by myths, while ‘SARS-CoV-2’ is predominantly used by facts. To account for such differences, we further study the collocations of the pair of words in Sect. 4.2.

  2. 2.

    Many personal entities are distinctly used by the myth group, including ‘minister,’ ‘president,’ ‘police,’ ‘doctor,’ ‘man,’ ‘victim,’ ‘patient,’ ‘kid,’ ‘woman,’ ‘citizen,’ and ‘people,’ while the true code mentions people that are usually with expertise or special skills, such as ‘worker,’ ‘researcher,’ and ‘expert’. Evaluating these words in terms of ‘power’ is evidenced based on the P dimension of the EPA scores, as shown in Fig. 5. Note that the myth group pays special attention to the less powerful people, i.e., ‘woman,’ ‘kid,’ and ‘victim,’ demonstrating an sympathetic emotion toward vulnerable persons.

  3. 3.

    Many proper nouns, especially the severely affected places, such as ‘China,’ ‘India,’ ‘Italy,’ ‘French,’ ‘Brazil,’ and ‘Spanish’ are predominantly used in the myth group, while the fact group focuses mainly on America (‘U.S.,’ ‘American,’ ‘Chicago’)—though also severely affected but more powerful. Evaluating these words in terms of ‘power’ is evidenced by the P dimension of the EPA scores, as shown in Fig. 5. Out of sympathy and worries to these severely affected (and less powerful) countries, people are more likely to believe the relevant information even the information reliability is uncensored, which explains why myths mentioning vulnerable persons and places spread so widely and quickly across the world.

  4. 4.

    Many social media platforms, including ‘Facebook,’ ‘Twitter,’ and ‘Whatsapp’ are frequently mentioned in the myth group, while the fact group only mentions ‘Google.’ Evaluating these words in terms of ‘activity’ is evidenced based on the A dimension of the EPA scores, as shown in Fig. 5. The higher activity scores in the myth group suggests that active social events are more likely to get people engaged, as a result they tend to believe what they see at the social media platforms.

  5. 5.

    In line with the above observation on the activity in social media platforms, words such as ‘video,’ ‘picture,’ ‘image,’ and ‘photo’ are frequently used. It shows the prevalence of COVID-19 myths major in social media platforms, as well as an effective way of spreading such myths through various kinds of multimedia devices. That is, people tend to believe more on information provided with pictures, videos, and so on.

  6. 6.

    Many words showing strongly negative sentiment such as ‘kill,’ ‘die,’ and ‘dead’ are found predominantly used in the myth group. Interestingly, the fact group uses many words of positive sentiment such as ‘guidance,’ ‘healthcare,’ ‘care,’ ‘help,’ ‘tip,’ ‘nursing,’ and ‘support.’ Evaluating these words in terms of ‘sentiment’ is evidenced based on the E dimension of the EPA scores, as shown in Fig. 5.

The above analysis highlights several interesting word pairs of meaning contrast, i.e., ‘COVID-19’ versus ‘SARS-CoV-2’; ‘China’ versus ‘U.S.’; ‘kid’ versus ‘adult’; ‘Facebook’ versus ‘Google’; ‘lockdown’ versus ‘reopen,’ etc. The apparent differences of E, P, A of these words in addition to their meaning contrast have indicated the effectiveness of leveraging negative sentiment toward vulnerable groups with active social interactions in arousing people’s sympathetic responses to disseminate such information.

Fig. 7
figure 7

Distributions of parts of speech between myths and facts

4.2 Case study of ‘COVID-19’ versus ‘SARS-CoV-2

We conduct a case study on investigating the collocational tendencies of word pair ‘COVID-19’ versus ‘SARS-CoV-2’ using the Word Sketch Difference.Footnote 11 Four major syntactic collocations are extracted, as displayed in Fig. 6. The collocation frequency is indicated by the size of the circles, and the distance to the two words shows the strength of collocation. We observe the following patterns:

  1. 1.

    For verbs with ‘COVID-19/SARS-CoV-2’ as the object, the majority of collocations to ’COVID-19’ denote meaning of anti-virus, such as ‘fight,’ ’cure,’ ’treat,’ and ’prevent.’ These words including ‘COVID-19’ are more frequently used in the myth group, suggesting a strong willingness of people in controlling the virus. In contrast, verbs that collocate highly with ’SARS-CoV-2’ (e.g., ‘generate,’ ‘neutralize,’ ’differentiate’) are more neutral and they tend to occur in the fact group. This shows that myths are inherently distinct from facts in terms of sentiment, coherent to the findings in the lexical analysis.

  2. 2.

    For verbs with ‘COVID-19/SARS-CoV-2’ as the subject, ‘COVID-19’ collocates mostly with verbs such as ‘hit,’ ‘cause,’ ‘affect,’ denoting a causative ‘impact’ of the virus to the victims. In contrast, ‘SARS-CoV-2’ as a subject collocates most with neutral or positive words such as ‘involve’ and ‘bode,’ which conforms to the finding in Sect. 4.1.1.

  3. 3.

    For modifiers of ‘COVID-19/SARS-CoV-2’, ‘COVID-19’ collocates with negative words such as ‘severe’. In contrast, collocations to ‘SARS-CoV-2’ is rather sparse, and no obvious patterns can be observed.

  4. 4.

    For nouns modified by ‘COVID-19/SARS-CoV-2,’ ‘COVID-19’ has more collocations than ‘SARS-CoV-2.’ Besides, the negative sentiment is consistently observed in the collocations of ‘COVID-19,’ such as ‘pandemic.’

The above collocational study has basically conformed to the lexical observation that misinformation denotes a stronger sentiment in the negative polarity toward the vulnerable community. Both the lexical and syntactic connotations imply the effective linguistic strategy of employing sympathetic devices for convincing people in spreading the misinformation.

4.3 POS-based analysis

The current section focuses on analyzing the grammatical distribution discrepancies of facts and myths using the 36 Part-of-Speech labelsFootnote 12 by NLTK pos. We use two pirate plots in Fig. 7 to display the distribution of four major lexical categories (Verb, Noun, Adj, Adv), as well as their sum (Content) and Function words for representing the lexical classes of the two codes of statements. The y-axis value corresponds to the normalized frequency of each POS tag in each claim.

The pirate plots of the POS distribution of myths and facts show significant differences (p value < 2.2e−16) of using Nouns and Verbs in the two sub-datasets. The fact group consists of 60% Nouns, which is almost 15% larger than the myth group. However, Verbs occur more in the myth group, which is 8% larger than the fact group, suggesting a tendency of using dynamic structures in misinformation.

To probe further into the dominant verbal expressions in the myths group, we extract the bigram concept pairs in the myth group. The top 14 concept pairs together with their occurrences are provided in Fig. 8. We found that most of these concept pairs show people’s strong willingness in controlling the virus, such as kill Coronavirus, prevent COVID-19, reflecting people’s fear and anxiety toward the pandemic.

Fig. 8
figure 8

Top concept pairs in myths

Moreover, in terms of Nouns, both myths and facts employ NNP (singular proper noun) with highest frequency. As for Verbs, myths and facts also demonstrate significant differences in all subcategories, where myths predominantly favor VBZ (3rd person singular present verbs) and facts on VB (base form verbs). In addition, we calculate the number of function and content words for each of the two sub-datasets and obtain the respective LD (lexical density)Footnote 13 [65]. The LD of myths compared to facts suggest a lower lexical diversity of low-credibility information.

5 Feature discrimination

This section aims to identify the prominent linguistic features for predicting information credibility. We provide the following two subsections of experiments to study the interactions between the investigated features (independent variables) and information credibility (dependent variables).

5.1 Logistic regression

We first conduct the logistic regression to model the relation between the investigated linguistic features and the credibility of the claims. The ‘Truth’ variable is taken as the binominal dependent variable, and the linguistic features are the independent variables. We model TTR, Sentence length (s_len), Word average length (w_len), average E, P, A scores (E_avg, P_avg, A_avg), as well as the frequency of the six general POS tags (cf. Sect. 4.3) for each claim as the linguistic variables. In addition, we provide a Null model on random values generated in the range of (− 5, 5) to serve as a baseline. The logistic model is built in RStudio with R markdown and the glm() function in the ISLR package is adopted for model fitting with the family parameter set as binominal. The results are displayed in Table 5.

Table 5 Binominal logistic regression results for predicting information credibility

The results in Table 5 include the coefficients, their standard errors, the z-statistic, and the associated p values. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable. Among all the variables, E_avg is the most significant predictor with p value below 0.001; P_avg is also a significant predictor at 0.1% level; Noun, Verb, sentence length, word average length, A_avg are also significant predictors at 1% level. The standard interpretation of the binominal logit is that for a unit change in the predictor variable, the logit of outcome is expected to change by its respective parameter estimate given the variables in the model are held constant. For example, for every one unit change in E (valuation), the log odds of true (versus untrue) increases by 0.43 with significance. The regression result shows that the evaluative score is the strongest feature for predicting information credibility, and the other linguistic features such as nouns, verbs, TTR also show significance in predicting information credibility. The regression result basically conforms to the findings in Sect. 4.1 that sentiment and other linguistic devices are very important factors in constructing the language in misinformation.

5.2 Machine learning performance

This subsection evaluates the effectiveness of the investigated features for automatic detection of information credibility using customized machine learning classifiers. Experiment settings are deployed as in Sect. 3.2.4. We divide the dataset into a training set and a test set in a ratio of 7:3. Evaluation metrics in the codeFootnote 14 include Accuracy, Precision, Recall, and F1-score. The results in terms of F1 are summarized in Table 6 for demonstration.

Table 6 Evaluation results on machine learning classifiers

In Table 6, the first four rows of results are from the four individual feature sets, including BOW, POS, W2V, and EPA. Among the four individual features, W2V shows the best performance for all the three classifiers, followed by the EPA feature set. Note that both POS and EPA outperform the baseline feature, indicating the usefulness of the proposed feature for truth detection in certain scenario. In addition, we concatenate the affective values of each word in E, P, A to word vectors and test their effectiveness, respectively. The E(valuative) affix shows greatest improvement compared to P(otency), and A(ctivity). Finally, the combined vectors of E,P,A with W2V demonstrate the best performance for all classifiers. Overall, the proposed affective features are effective for the task of truth detection, and the SVM classifier demonstrates a superior performance than the other two classifiers.

6 Conclusions and future work

This work describes an empirical analysis of the COVID-19 infodemic in terms of the distinct (psycho-)linguistic characteristics by focusing on a balanced dataset of COVID-19 myths and facts—CovMythFact. In addition, we provide an in-depth analysis to the three-dimensional affective values (EPA) of the seed words and the collocations to account for the pathogenicity of the infodemic based on the affective control theory. Basic machine learning models are tested by utilizing the proposed features for the task of truth detection.

The results show that the COVID-19 infodemic is characterized by a patterned language that prefers several salient groups of words and lexical categories. That is, COVID-19 myths manifest itself a linguistically distinct code that is more unfolded,Footnote 15 dynamic,Footnote 16, negative,Footnote 17 sympathetic,Footnote 18 and active,Footnote 19 as evidenced by the analytical and predictive results. Note that the identification of the virus (myths) as well as its properties does not cure the disease (infodemic), it is intriguing and vital to know why such distinct linguistic patterns explain to the persuasion of the information receivers. The EPA_grounded evidence based on the affective control theory provides a sound explanation to the social perceptions, actions, and emotional experiences of people from the psychological point of view. These factors (e.g., negative, sympathetic, emotional) are governed by a psychological intention of people to minimize deflections between fundamental sentiments and transient impressions that can affect their social interactive behaviors. As indicated by Drif et al. [66], information consumers tend to believe their own perception of reality as the only facts. Such psychological factors are essential for arousing people’s collective memories in their cognition, sentiment, and knowledge systems, which as a result mobilizes them to be engaged in the transmission of misinformation, hence the infodemic.

Upon the respective descriptions and interpretations, we have attempted to pinpoint the essential components in the language that are effective for affecting people to believe and share the misinformation. In addition to the linguistic significance of such investigations, it also provides some practical recommendations on how to prevent low-credibility information with indication of some discriminant features (such as nouns, verbs, word length, word sentiment, and affection) for benefiting automatic anti-infodemic systems. In order to further testify the effectiveness of the investigated features, we will customize these features with sophisticated machine learning models to measure the possible performance enhancement in the future work.

Out of the above objectives, another research interest is to study the persuasive powerFootnote 20 of the language in an infodemic regardless the credibility of the information. That is, other than identifying the truth, what are the key arguments essential for persuasion? We will further address this research question by reference to theories of Persuasive Arguments and Fallacy Arguments [67, 68] and seek experimental verification for further evidence.