Introduction

Former President Trump is a controversial figure in the media and in political commentary. he made heavy use of his former Twitter account and was always ready to call out ’fake news’. According to Wikipedia, Donald Trump’s presence on social media attracted attention worldwide since he joined Twitter in 2009, with the handle @realDonaldTrump, having over 88.9 million followers by 2021, culminating in his banning from Twitter for at least two years on 8 January 2021.

The authors have previously analysed his tweets on climate change (Allen and McAleer 2018a), on nuclear weapons and Kim Jong-Un (Allen and McAleer 2018b), and have contrasted his first State of the Union Address (SOU) with the previous one by President Obama (Allen et al. 2018). We have also compared some of his speeches with those of Obama and Hitler (Allen et al. 2019a, 2019b).

This paper features an analysis of former President Trump’s recent tweets on COVID-19. The tweets are analysed by means of various data mining techniques, including sentiment analysis. The intention is to explore the contents and sentiments of the messages contained, the degree to which they differ, and their potential implications for the national reaction to COVID-19. The data set or corpus includes 159 tweets on the coronavirus that are sourced from the Trump Twitter Archive running from 24 January 2020 to 2 April 2020.

The analysis is performed via the application of a variety of R packages. These include ’tm’, a text mining package, created by Feinerer and Hornik (2019), ’Textmining’, Eder and Melcer (2016), ’tidytext’, by Silge and Robinson (2016), and ’stringi’, Gagolewski (2020). We also used ’syuzhet’, a sentiment extraction tool, incorporated into an R package by Jockers (2015), ’wordcloud’ by Fellows (2018), ’TwitterR’, Gentry (2015), and ’tm’, Feinerer and Hornik (2019),

Data mining methods are drawn from statistics, machine learning, and database systems and are applied to the analysis of textual data and the exploration of patterns within it. Sentiment analysis features the exploration of the nature of the emotions contained in a text. Differences in sentiment can be viewed in terms of binary distinctions (positive versus negative). Alternatively, different types of emotions can be explored. We used the R packages ’tidytext’ and ’syuzhet’, which distinguish between eight different emotions, namely trust, anticipation, fear, joy, anger, sadness, disgust, and surprise.

Sentiment analysis has it’s limitations, who given the mention difficulties interpreting various common language usages, such as sarcasm, negations, and so forth.

It is possible to analyse the sentiments of news feeds using these techniques. Allen et al. (2015); Allen et al. (2017) analyse the influence of the Thomson Reuters News Analytics (TRNA) sentiment series. The first of these papers explored the influence of the Sentiment measure as a asset-pricing factor in pricing DJIA constituent company stocks. The second used an aggregated DJIA market sentiment score and entropy measures, to assess the impact of scores on DJIA returns. Allen et al. (2018) use the Thomson Reuters News Analytics (TRNA) data to successfully augment the Fama–French three-factor model.

Allen and McAleer (2019) undertake an analysis of then President Trump’s two State of the Union addresses and also apply Zipf and Mandelbrot’s power law to assess the degree to which they differ from common language patterns. In order to provide a contrast and some parallel context, analyses are also undertaken of President Obama’s last State of the Union address and Hitler’s 1933 Berlin Proclamation.

The structure of these four political addresses is remarkably similar. The three US Presidential speeches are more positive emotionally than is Hitler’s relatively shorter address, which is characterized by a prevalence of negative emotions. Hitler’s speech deviates most from common speech, but all three appear to target their audiences effectively by use of non-complex speech.

Various papers have explored the influence of news items on asset prices and volatility. Examples of this are given by: Tetlock et al. (2008), Da et al. (2011), Barber and Odean (2008), diBartolomeo and Warrick (2005), Mitra et al. (2009), and Dzielinski et al. (2011)). Cahan et al. (2009), and Hafez and Xie (2012), used RavenPack data to examine diversification benefits.

A number of papers provide surveys of this literature. Loughran and McDonald (2014) survey applications of textual analysis and sentiment analysis to the accounting, finance, and economics literature. Kearney and Lui (2014) review sentiment analysis and discuss applications in the related literature.

In this paper, we concentrate on the content of then President Trump’s recent tweets on the topic of COVID-19 and their implications for the management of the pandemic, in the early days up to the beginning of April 2020. Does President try Trump to present the pandemic in a positive light for the benefit of his voter base? Does he try to play down the risks of both the virus and the attendant risks? Could the tweets on these topics be interpreted seriously as attempts to avoid scare-mongering?

To provide context, the tweets are presented against a background of what Dr. Fauci was saying in public statements at the time and privately via his recently revealed private emails, as obtained by Buzzfeed on 1 June 2021. It appears, with the benefit of hindsight, that there was an inherent contradiction between his public statements and his private email correspondence. It may be the case that President’s Trump’s instincts were more accurate than was thought at the time, and that his response to the pandemic was not as zenophobic, or ’contrary to the science’, as was originally claimed.

(https://www.buzzfeednews.com/article/nataliebettendorf/fauci-emails-covid-response). Dr Fauci’s dissembling possibly gives the lie to more extreme interpretations of President Trump’s pronouncements at the time.

Might President Trump’s constant tweeting be fairly described as constituting ’propaganda’? This might be defined as presenting information in a biased or misleading nature, and commonly used to promote a political cause or point of view.

Sentiment analysis will not give a clear answer to this question per-se, but it should reveal patterns in the sentiment displayed within the tweets, correlations and associations in the use of words, and patterns displayed over time in the messaging embedded in the tweets.

Zipf (1932, p. 1) suggested an alternative approach to the analysis of language as a whole based on relative frequency, suggesting: “the accent or degree of conspicuousness of any word, syllable, or sound is inversely proportionate to the relative frequency of that word, syllable, or sound, among its fellow words, syllables, or sounds in the stream of spoken language. As any element’s usage becomes more frequent, its form tends to become less accented, or more easily pronounceable, and vice versa”.

Zipf (1932) described four important features recognisable in words, ’meaning’, ’quality’, referring to either positive or negative qualities. The focus of sentiment analysis in this paper. ’Emotional intensity’, which can also be related to espousal of sentiment. Plus, what he referred to as being ’order’, a concept related to semantic change and the occurrence of the relative frequency of usage of different words.

Zipf suggested a suitable formula for capturing the relative frequency of words is \(P_{n}\,\sim 1/n^{\alpha },\) where \(P_{n}\) is the frequency of a word ranked nth, and the exponent \(\alpha \) is close to 1. This means that the second most frequently observed word occurs approximately 1/2 as often as the first, and the third word 1/3 as often as the first, and so on.

Mandelbrot (1965) expanded and refined Zipf’s theory concentrating on the view that human languages evolved to optimize the conveyance of information. He drew on Shannon’s (1948) ’information theory’. Mandelbrot wrote the formula i(rk)/k,  where i is defined as the relative number of repetitions of the word W(r) in a sample of length k. This is proposed as being inversely proportional to 10 times \(r,\,\,i(r,k)/k=1/10r.\)

Shannon (1948, p. 6) showed how artificial languages can be used to approximate natural languages. If all letters are given the same probability and chosen independently, this would be a zero-order approximation. In a first-order approximation, letters would be chosen independently, but their probability of occurrence would match that in the relevant natural language. A trigram structure would be adopted in a third-order approximation. In this case, the probability of each letter would be dependent on the preceding two letters.

Shannon (1948) writes, let \(p(B_{i})\) be the probability \(B_{i}\) of a sequence of symbols from a source text. Let:

$$\begin{aligned} G_{N}=-\frac{1}{N}\sum _{i}p(B_{i})logp(B_{i}), \end{aligned}$$
(1)

where the sum is over all sequences \(B_{i}\) containing N symbols. The implication is that \(G_{N}\), which is probability mass function, is a monotonically decreasing function of N,  and that:

$$\begin{aligned} \underset{N\rightarrow \infty }{lim}G_{N}=H. \end{aligned}$$

Shannon lets \(p(B_{i},S_{j})\) be the probability of sequence \(B_{i}\) being followed by symbol \(S_{j}\) and \(p_{B_{i}}S_{j}=p(B_{i},S_{j})/p(B_{i})\) be the conditional probability of \(S_{j}\) after \(B_{i},\) then let:

$$\begin{aligned} F_{N}=-\sum p(B_{i},S_{j})logp_{B_{i}}(S_{j}), \end{aligned}$$
(2)

where the summation is over all blocks \(B_{i}\) of \(N-1\) symbols and over all symbols \(S_{j}\), then \(F_{N}\) is a monotonically decreasing function of N : 

$$\begin{aligned}&F_{N}=NG_{N}-(N-1)G_{N-1}, \\&G_{N}=\frac{1}{N}\sum _{N=1}^{N}F_{N}, \\&F_{N}\le G_{N}, \end{aligned}$$

and \(lim_{N\rightarrow \infty }F_{N}=H.\)

Shannon (1948) suggests that \(F_{N}\) is the entropy of the Nth-order approximation. Mandelbrot (1965) interprets this, derivation of the law of word frequencies, as being consistent with maximising Shannon’s “quantity of information” under certain constraints.

Ficcadenti et al. (2019), represent an application of this type of approach and also review some of the relevant literature. We apply the framework outlined above to analyse President Trump’s tweets on COVID-19 and use them to explore the degree to which the language in them is removed from standard patterns of speech.

The remainder of the paper is divided into four sections. An explanation of the research method is given in Sects. 2, 3 presents the results, and Sect. 4 provides some concluding comments.

Research method

We use a number of R libraries in our data mining and sentiment analysis.These include word cloud, tm, textmining, textreg, and syuzhet, plus a variety of graphics packages. The R package tm provides a basic infrastructure required to organize, transform, and analyze textual data. The process involves importing the body of tweets into a ’corpus’. The corpus, in turn, has to be transformed by various manipulations into a suitable form for analysis. This creates a term-document matrix which can be used for analysis.

When we have the text in matrix form, a large number of R functions (like clustering, classifications, etc) can be used. Associations between words, and their correlations can be examined, and text can be filtered to reveal frequently occurring words. Once we know the frequently occurring words we can create a word cloud, as described by Feinerer and Hornik (2019). Another R library package that is useful for creating and analysing word clouds is ’wordcloud’ by Fellows (2018).

For sentiment analysis, we use the R package ’syuzhet’ and apply the default syuzet lexicon, which was developed in the Nebraska Literary Lab under the direction of Jockers (2015). The term ’Syuzhet’ comes from the Russian Formalists Shklovsky (1917) and Propp (1928), who divided narrative into two components, the ’fabula’ and the ’syuzhet’. The first term refers to the ’device’ or technique of a narrative, and the second to the chronological order of events. The R package constructs global measures of sentiment into eight constituent emotional categories, namely trust, anticipation, fear, joy, anger, sadness, disgust, and surprise.

To explore how the narrative is constructed, and how the positive or negative sentiment revealed changes over time, we plot the values in a graph, where the x-axis represents the passage of time from the beginning to the end of the text, and the y-axis measures the degrees of positive and negative sentiment.

We develop the appropriate R code to undertake the Zipf and Mandelbrot power law distribution analysis, to assess the degree to which the tweets on COVID-19 deviate from common language, and draw on the R package ’tm’.

The paper features an analysis of then President Trumps public statements via his Twitter account and the application of machine learning techniques. The recent release of Dr. Fauci’s emails on the topic gives a more detailed context to President Trump’s tweets, both in terms of what Dr Fauci was saying in public, and privately in his email correspondence.

Results and analysis

President trump’s tweets

Figure 1 presents a word cloud analysis of President Trump’s tweets on COVID-19. It is no surprise that the word cloud suggests that the most frequently occurring word was ’coronavirus’ because the tweets in the ’TrumpTwitter Archive’ (see: http://www.trumptwitterarchive.com/archivehttp://www.trumptwitterarchive.com/archive) were screened on the word ’coronavirus’, so it occcurs at least once in every tweet sampled, and 178 times in total. It was followed in frequency by ’will’, which occurred 35 times, ’covid’ 31 times, ’president’ 28 times, and ’realdonaldtrump’, one of his twitter handles, 22 times, together with ’response’, which also appeared 22 times. A bar chart of the 20 most frequently occurring words is shown in Fig. 2.

The most frequently occurring words in Fig. 2 have predominantly positive interpretations: ’will’ and ’response’ were mentioned previously, and ’great’ appears 21 times, with ’just’, ’task’ and ’force’ 20 times each. ’Working’ appears 18 times, while ’trump’ ’briefing’ and ’act’ appear 17 times each. China, the ’villain’ of the piece, appears 15 times, together with ’job’, ’american’, ’help’ and ’families’, each at 15. Figure 3 provides a histogram of the relative proportions of positive and negative sentiments in these tweets.

Fig. 1
figure 1

President trump tweets COVID-19

Fig. 2
figure 2

Most frequent words bar chart

Table 1 shows the specific words that are associated with the most frequently occurring words. Somewhat surprisingly, neither ’Wuhan’ nor ’Wuhan Institute of Virology seems to rate a mention. ’Coronavirus’ seems to be associated with words aimed at mitigating the economic effects of shutdowns and enforced social distancing. Words with high correlations include ’free’, ’leave’, ’paid’, and ’sick’.

The correlations with ’china’ are surprisingly positive or neutral, and include ’closely’, ’agencies’, ’conversation’, ’anywhere’, ’good’, ’monitor’, ’ongoing’, ’received’, ’top’, ’detail’, ’developed’, ’discussed’, ’experts’, ’respect’, ’best’, ’leading’, to mention a few. The only possibly perjorative terms appear to be ’ravaging’ and ’virus’.

The word ’president’ is associated also with the ’vice’ president, Mike Pence, and words such as ’airlines’, ’ceos’, ’corona’, ’impact’, ’met’, and so on. Another frequently occurring word in this set of tweets is ’busy’, which seems to be highly associated with ’battle’, ’calling’, ’flight’, ’republican’, ’wasting’, ’ahead’, ’anything’, ’bad’, ’closing’, ’hoax’, ’putting’, ’wrong’, ’immigration’, ’border’, ’impeachment’, ’dems’, just to mention just those words with correlations in excess of 0.70.

With respect to the previously mentioned ’democrats’, there seems to be a much greater prevalence of negative associations than those mentioned with respect to the previous words considered. Some of the words with negative connotations include ’battle’, ’wasting’, ’bad’, ’closings’, ’hoax’, ’wrong’, ’nothing’, ’blamed’, ’fault’, ’impeachment’, ’scam’, ’incite’, ’harm’, ’disinformation’, and so forth. The use of President Trump’s twitter feed as a political weapon seems to be apparent in relation to the word ’democrats’.

Table 1 Most frequent word associations
Fig. 3
figure 3

Sentiment analysis of president trump’s tweets on the COVID-19

The sentiment analysis of these tweets shown in Fig. 3 reveals that overall they are predominantly positive, as revealed by the two central columns in the figure. The predominant emotion conveyed is trust, followed by the negative emotion fear. These are then followed by anticipation, joy, anger, surprise, sadness and disgust, in order of their relative predominance.

Figure 4 plots the ’emotional valence’ of this series of tweets. This refers to the pattern of sequential positive and negative emotions displayed as the tweets on corona virus unfold through time. The plot of these patterns shown in Fig. 4 does not reveal a particular pattern in the occurrence of positive and negative emotions.

Figure 5 shows a theoretical application of Zipf’s law to the set of tweets. A full confirmation of Zipf’s law would show a line of slope of negative one in the plot in Fig. 5, running from the top left to the bottom right. In the diagram, the y axis depicts the logarithm of relative frequency and x axis the logarithm of the index.

The regression model which produced the diagram in Fig. 5 is shown in Table 2. The plot deviates from a theoretical plot of a line with a slope of negative 1. A flatter Zipf slope can indicate a more random signal, but it can also indicate a broader vocabulary that conveys a more precisely worded message. Zipf suggests that attempts to remove ambiguities should produce a flatter slope that favours the recipient. The estimated slope coefficient is -0.70, which is highly significant at the 1% level, with a t-statistic of -138 and an adjusted-R squared value of 0.94.

The slope estimate suggests that President Trump’s tweets on COVID-19 are designed to favour the recipient, and deviates from standard language patterns. This makes perfect sense, given the sparse nature of tweets and the fact that they are frequently designed to convey a simple message.

Fig. 4
figure 4

Emotional valence of tweets on COVID-19

Fig. 5
figure 5

Estimation of Zipf law relationship

Table 2 Zipf Regression Model 3: OLS, using observations 1–1296 Dependent variable: l_RELFRE

Dr Fauci’s public statements and private emails

In an email response to Danelle Steinberg on 3 February 2020, Dr Fauci replied: “You ask that there have been animal markets for a long time, and so why now. The fact is that this is likely pure chance +/− more interactions in the human-animal interface. Animal viruses mutate and most of the time the mutations have no significant impact on virus transmission to humans. Sometimes they mutate and allow single “dead end” transmissions to individual humans with no efficiency in going human to human and so we get individual infections and no outbreak as we have seen with HSNl and H7N9 influenzas that jump from chickens to humans but do not go from human to human. Then rarely, animal viruses mutate and the mutation allows them not only to jump species to humans, but to also efficiently spread from human to human”. Dr Fauci appeared to be very reluctant to consider that the virus may have escaped from a laboratory in Wuhan.

Dr. Fauci has also been a proponent of the continuation of gain of function research and in a piece written in (2012), stated that: “Scientists working in this field might say—as indeed I have said—that the benefits of such experiments and the resulting knowledge outweigh the risks. It is more likely that a pandemic would occur in nature, and the need to stay ahead of such a threat is a primary reason for performing an experiment that might appear to be risky.”

Buzzfeed notes that Dr. Fauci responded to Sivia Burwell, who had emailed him a query on 5 February 2020 about the efficacy of wearing a mask, with the comment: “Masks are really for infected people to prevent them from spreading infection to people who are not infected rather than protecting uninfected people from acquiring infection.” It does appear that Dr. Fauci’s advice about masks did change, with his endorsement of the wearing of one or two masks increasing over time.

The NIH has come under significant criticism in recent weeks over funding WIV research relating to change-of-function, while Dr Fauci has denied that the National Institute of Health (NIH) has funded gain of function research at the Wuhan Institute of Virology (WIV). He told a US Senate hearing that the NIH “has not ever and does not now fund gain-of-function research in the WIV”. Yet researchers at WIV, including its prominent virologist Dr. Shi Zhengli have disclosed that work on coronaviruses had been funded by NIH grants.

It does appear that Dr. Fauci had a vested interest in steering attention towards the hypothesis that COVID-19 jumped from animals to humans in the Wuhan wet market, given that the NIH had actually been funding gain of function research via Dr. Peter Daszak and the EcoHealth Alliance, a US-based organization that conducts research and outreach programs on global health, conservation and international development.

Dr Daszak complained in Nature on 21 August 2020 about ’unfounded rumours’ that the COVID-19 pandemic was caused by a coronavirus released from its laboratory. The NIH cancelled EcoHealth Alliance’s grant in April 2020, days after US President Donald Trump told a reporter that the United States would stop funding work at the WIV. The actual origins of COVID-19 remain uncertain, but the lab leak hypothesis has been gaining traction at the time of writing.

Conclusion

In this paper we have analysed a set of 159 tweets on COVID-19 that are sourced from the Trump Twitter Archive, running from 24 January 2020 to 2 April 2020. We have used a variety of R library packages to analyse the tweets using different data and text mining routines, with a focus on sentiment analysis and applications of the Zipf law.

The analysis reveals that former President Trump used the Twitter feed to deliver effectively a simplified set of repeated messages. Given the nature of the topic, it is perhaps surprising that sentiment analysis reveals that the general tenor of the tweets is largely positive. This confirms previous findings in analyses of climate change Allen and McAleer (2018a), and on nuclear weapons and Kim Jong-Un Allen and McAleer (2018b). A positive note was also evident in the contrasts we undertook of then President Trump’s first State of the Union Address (SOU) with the previous one by President Obama (Allen et al. 2018), together with some of his speeches with those of Obama and Hitler (2019).

The current analysis reveals that former President Trump’s tweets do not appear to be highly critical of China, but show strong evidence of a political theme, and are highly critical of the Democrats. The Zipf analysis suggests that the tweets contain simplified language to convey the simple and effective political message to the target audience. Then President Trump continued to make effective use of Twitter to target his political enemies and deliver questionable information, while avoiding the filters provided by conventional TV and news media. This avenue of effective communication ceased when he was banned for at least two years from Twitter and Facebook in early 2021.

The recent controversy about the nature of Dr. Fauci’s private emails and the purported contrast between his private and public statements suggest that then President Trump’s criticism of Dr Fauci may have had more justification than was realized at the time, and his decision to stop gain of function research funding may have been a wise move.