1 Introduction

In recent years, data science has emerged as a new and important discipline which can be viewed as an amalgamation of traditional disciplines like statistics, data mining and distributed systems [1]. Data driven decision making has become ubiquitous in almost all aspects of the society. With the Internet of Things, huge volumes of wide variety of data are generated at high velocity. Real time decision making is central to the Internet of Things [2].

In the world where a lot is bound to happen, what the masses think or feel about these happenings is a concern for governments, businesses and even individuals. Governments would want to know how their policies, interventions etc. are received or perceived by the masses, politicians would want to know if they have a favorable rating, and how their policies are received and implemented while businesses would want to understand the reputation of their brands. Social media presents a great promise for achieving this by analyzing social media posts, product reviews, customer feedback etc. The advent of social media has made available a platform where individuals can freely express their opinions, feelings or judgments. Careful data mining techniques can help unravel valuable information and draw insights which may be hidden in these expressions.

On 31st December 2019, a cluster of pneumonia cases of unknown etiology was reported in Wuhan, Hubei Province, China [3, 4]. About a week later on 9th January 2020, the Chinese center for disease control (CDC) reported a novel coronavirus as the causative agent of this outbreak, corona virus disease 2019 (COVID-19). Covid-19 is spread from person to person through respiratory droplets when an infected person sneezes, coughs or talks [5]. One is also able to contract the COVID-19 by touching a surface or object that has the virus on it and then touching his/her nose, mouth or eyes.

As of April 29th 2020, there were 2,995,758 confirmed cases, 204, 987 deaths in 213 countries, areas or territories [6]. The nature of the disease (highly transmissible even when an infected individual is still asymptomatic) has seen many governments put in place a raft of measures in bid to curb the spread or the disease. Some of these measures include total and partial lockdowns which have seen businesses closed, curfews, advocacy for staying at home, social distancing, wearing of a cloth face covering nose and mouth in public places, regular washing of hands for at least 20 s or by using alcohol based hand sanitizers that contains at least 60% alcohol, quarantine for infected individuals [5].

The measures put in place as a result of the COVID-19 pandemic has affected the way people do things and it would be of interest to know and understand the feelings, opinions or judgment of the masses on various issues. Several studies have used varied approaches and datasets to try and explain the COVID-19 dynamics. Kumar [7] employed cluster analysis in monitoring COVID-19 infections in India. The approach identified areas/clusters that needed more medical facilities (ventilators, testing kits, masks etc.) and those that needed optimization of monitoring techniques (screening, lockdowns, closedowns, curfews etc.). Khakharia et al. [8] used machine learning techniques to predict the outbreak of COVID-19 for 10 densely populated countries. In particular, they compared the performance of 9 machine learning models in predicting the outbreak. The highest prediction accuracy was achieved for Ethiopia using the Autoregressive Moving Average Model. Social contact based analysis has also been employed to study the underlying disease transmission patterns. Liu et al. [9] using this approach showed that the age-groups involving relatively intensive contacts in households and public/communities were dispersedly distributed explaining why the transmission of COVID-19 in the early stage mainly took place in public places and families in Wuhan. Other data mining techniques that can be employed to study the dynamics of COVID-19 are available in [10, 11]

Sentiment analysis or opinion mining can be defined as the process of identifying and extracting the subjective information that underlies a text [12]. This information can either be an opinion, a feeling about a particular topic or subject matter or a judgment. Sentiment analysis is becoming a field of interest that cannot be ignored. Nguyen et al. [13] employed sentiment analysis on social media to predict stock movement. Their method of incorporating social media data achieved 2.07% better performance than the model using historical prices only. Vincenza et al. demonstrated that Twitter data and sentiment analysis can be used to study disease dynamics [14]. Twitter is a micro blogging and social networking service on which users post and interact with messages known as tweets [15]. Twitter’s 321 million active users provide a rich source of data from the tweets they post. In this study we seek to mine opinions and sentiments on the COVID-19 pandemic from Twitter users.

2 Methodology

2.1 Data

This study seeks to provide a framework for real time social media data analysis for actionable intelligence. Data used in this study were tweets relating to the COVID-19 pandemic and these were streamed live from Twitter on 14th -15th April 2020 (from 16:43:09 on 14th to 23:50:53 on 15th) and from 18:24:25 on 17th April 2020 to 16:41:16 the following day using streamR package [16]. The time periods were East African time. In particular only tweets bearing words such as corona, covid-19, sanitizer, virus, lockdown, quarantine, social distance were of interest and thus streamed.

The streaming was broken into two–three hours intervals with about 2 s break between each interval in order to obtain smaller sizes of streamed tweets. The tweet files were then parsed and compiled into a single excel file. We obtained more than 20 million tweets out of which a 91,784 geo-tagged tweets from all over the world were derived.

2.2 Exploratory Data Analysis

Figure 1 was obtained using data from the John Hopkins University and functions from tidycovid19 package [17]. The United States of America, some parts of Europe, Asia and Russia had the highest number of active cases per 100,000 inhabitants.

Fig. 1
figure 1

COVID-19 Cumulative active cases

2.3 Tweets Location

Figure 2 shows COVID-19 related tweets. It is clear there was concentration of tweets around Europe, South America in particular Brazil, Asia in particular India and in Western and Southern Africa. Countries with high cases of COVID-19 posted more tweets.

Fig. 2
figure 2

Location of COVID-19 related tweets

Figure 3 displays tweets location by language. English language was the most dominant language in our data set denoted by red dots. In particular there were 63,056 English tweets.

Fig. 3
figure 3

Tweets by language

2.4 Methods

There exists several methods of sentiment analysis. Sentiment analysis can be done on three levels namely: document-level, sentence level and aspect-level [18]. Document-level sentiment analysis considers the whole document as a basic information unit (talking about one topic) and classifies it as expressing a negative, positive or neutral sentiment. Sentence level sentiment analysis classifies sentiment expressed in each sentence [18]. Sentiment classification techniques can be divided into machine learning approach, lexicon based approach and a hybrid approach that combines machine learning and lexicon approaches [19]. Machine learning approach relies on the machine learning algorithms like the naïve Bayes, support vector machines, neural networks among others together with linguistic features. In lexicon based approach, a collection of known and precompiled sentiment terms known as sentiment lexicon is used. This approach can be divided into dictionary based approach and corpus based approach which employs statistical or semantic methods to find sentiment polarity [18].

In communication, one listens out to an entire sentence and derive meaning that is greater than the sum of individual words. Calculating polarity or sentiment by matching words with those in the dictionary of words classified as positive, negative or neutral leaves out useful information. In many cases valance shifters (negators, amplifiers/intensifiers, de-amplifiers/downtoners, adversative conjunctions) are not taken into account. Negators flip the sign of a polarized word e.g. “I do not like”, An amplifier (intensifier) increases the impact of a polarized word (e.g., “I r eally like it.”). de-amplifier (downtoner) reduces the impact of a polarized word (e.g., “I hardly like it.”). An adversative conjunction overrules the previous clause containing a polarized word (e.g., “I like it but it’s not worth it.”) [20].

Valence shifters affect polarized words and if they do occur frequently, a single dictionary look up may not be the best approach to model the sentiments appropriately. The entire sentence may be reversed or overruled in the case of negators and adversative conjunctions [20].

From Tinker’s methodology [20], tweet \({S}_{j}\) is a sentence composed of words \({W}_{1},{W}_{2},\dots ,{W}_{n}\). Each tweet is broken down into an ordered bag of words. With the exception of pause punctuations (commas, colons, semicolons) which are considered words within a sentence, other punctuations are removed. The words are indexed as \({W}_{ij}\) indicating the \({j}{th}\) word in the \({i}{th}\) tweet. The words in each tweet are searched and compared to a dictionary of polarized words, with positive words \({W}_{ij}^{+}\) assigned + 1 and negative ones \({W}_{ij}^{-1}\) −1 or other positive and negative weighting depending on the sentiment dictionary used.

Denote polarized words by \({\varvec{p}}{\varvec{w}}\), these will form a polar cluster \({c}_{ijl}\) which is a subset of a tweet i.e.\({c}_{ikl}\subset {s}_{ij}\). The polarized cluster of words \({c}_{ijl}\) is pulled from around the polarized word \({\varvec{p}}{\varvec{w}}\) and defaults to 4 words before and two words after \({\varvec{p}}{\varvec{w}}\) to be considered as valance shifters. The cluster is represented as\({c}_{ijl}={pw}_{ij}-nb,\dots ,{pw}_{ij},\dots ,{pw}_{ij}-na\). Here \(nb \mathrm{and }na\) are parameters n-before and n-after set by the user. The words \({c}_{ijl}\) are labeled neutral\({w}_{ij}^{0}\), negator\({w}_{ij}^{n}\), amplifier/intensifier \({w}_{ij}^{a}\) or deamplifier of downtoner\({w}_{ij}^{d}\). Neutral words only contribute to the number of words in the equation. Each polarized word is then weighed by some function and the number of valence shifters surrounding the positive or negative word. Pause locations denoted by \(cw\) (i.e. punctuations that denote a pause including commas, colons and semicolons) are indexed and incorporated in calculating the upper and lower bounds in the polarized context cluster. The polarized word in the cluster is acted upon by the valence shifters. Amplifiers increase polarity by 1.8 (0.8 is the default weight) and they become de-amplifiers if the context cluster contains an odd number of negators (two negatives equal a positive and 3 negatives equal a negative). De-amplifiers decrease polarity. Adversative conjunctions (AC) (e.g. but, however, although) also weight the cluster. AC before a polarized word up-weights the cluster by \(1+z({n}_{AC})\) (with 0.85 being the default weight for \({z}_{2} \mathrm{and }{n}_{BAC}\) is the number of ACs before the polarized word. An AC after the polarized word down weights the cluster by\(1+\left\{{n}_{AAC}-1\right\}*{z}_{2}\). The weights \(z\) may be provided by the use with the default being 0.8. Lastly, these weighted context clusters \({c}_{ijl}\) are summed and divided by the square root of the word count \(({W}_{ijn})\) yielding the polarity score \({\delta }_{ij}\) for each tweet i.e. \({\delta }_{ij}=\sum {c}_{ij}/\sqrt{{W}_{ijn}}\)

For the co-occurrence of words the study used the udpipe package [21]. The study considered only 63,056 tweets that were written in English.

3 Results

3.1 Tweet Polarity

Figure 4 shows the spatial location of all positive tweets. USA, Europe, Western and Southern Africa and India had high number of positive tweets.

Fig. 4
figure 4

Location of positive tweets

Regions with high numbers of positive tweets also posted high number of negative tweets. This could indicate that individuals had opposing views on different issues (Fig. 5, 6).

Fig. 5
figure 5

Location of negative tweets

Figure 6 show the location of positive, negative and neutral tweets.

Fig. 6
figure 6

Location of Positive (Blue), Negative (Red) and Neutral tweets (Yellow). (Color figure online)

3.2 Word Co-occurrence

Spatial distribution of tweets: negative, positive and neutral is not that informative. A look at word co-occurrence may supply more insights on why tweets were negative, positive or neutral.

Figure 7 shows which words co-occurred with negative words. The thicker the path the more the co-occurrence. The blue dots depict the negative words while the red ones the words they occurred with. The strongest co-occurrence was mental and health. With many individual’s routine lives altered, there is a risk of mental health problems. Qui j et al. [22] in their nationwide survey of psychological distress among Chinese people in the COVID-19 epidemic rightly captures the aftermath of the COVID-19 pandemic: “The implementation of unprecedented strict quarantine measures in China has kept a large number of people in isolation and affected many aspects of people’s lives. It has also triggered a wide variety of psychological problems, such as panic disorder, anxiety and depression”. These aspects are supported by Fig. 7.

Fig. 7
figure 7

Negative

Other strong co-occurrences were small-business which have been adversely affected by the lock down. In their survey on the effects of COVID-19 on small businesses, Alexander W et al. [23] report that 43 percent of businesses are temporarily closed, and businesses have—on average—reduced their employee counts by 40 percent relative to January 2020.

Fig. 8
figure 8

Positive word co-occurrence

The strongest co-occurrence was face and mask followed by hand and sanitizer (Fig. 8). These are some of the WHO recommended steps to curb the spread of COVID-19 [24]. This indicates an effective campaign to help curb the spread. Other concerns were food supplies, insecurity, school etc.

From Fig. 9, most tweets conveyed negative sentiment, understandably so because of the pandemic.

Fig. 9
figure 9

General feel of the tweets

4 Discussions

This paper has demonstrated the wealth of information that is contained in sentiments expressed on social media, in this case Twitter. The direct effect of a great pandemic like the corona virus is death which can easily be measured. The indirect effects which range from loss of jobs [25], mental issues [22] to closing down of countries need other methods of quantification. Sentiment analysis is particularly useful in gauging the uptake of directives, emerging issues relating to the topic of interest among others, fake news, misinformation that may lead to fear and panic.

Results from sentiment analysis may help the government or relevant authorities relax, tighten or change approach altogether. Mental health was among the key concern among individuals, fear and panic was also evident Fig. 7 and Fig. 9. Studies [26, 27] have indicated that domestic violence is on the rise during this period of the coronavirus pandemic, a clear indication of mental anguish faced by the masses. Face-mask and hand-sanitizers also had high number of co-occurrence indicating that the sensitization efforts were working.

As it is with all studies, this one too has some shortfalls and limitations. Some of the shortfalls is that global Twitter data comes in various languages and as such methodologies to handle multilingual sentiment analysis are still in development. Our study focused on tweets written in English. The 2019 global multidimensional poverty index report indicates that 1.3 billion people or 23.1% are multidimensionally poor (in terms of health, education, standards of living) [28], This makes Twitter not a good platform to get insights from this group of people making it another downside of this study. Analysis of Twitter data for over a long period of time is computationally expensive as a whole day’s tweets may be few hundred gigabytes. The study results relied on the data from a two day live stream, further work can be dedicated towards live streaming for a longer period or using historic data for over a longer period. These limitations however do not invalidate the results.

The results from this study may help governments combat the consequences of COVID-19 like mental health issues, lack of supplies e.g. food and also gauge the effectiveness or the reach of their guidelines. Li et al. [29] motivate the need for multifaceted approach in combating the COVID-19 pandemic. They stress that there is a need for more global collaboration to effectively combat the COVID-19 pandemic. They outline five pillars for achieving this including: Cross cultural collaboration and communication, strengthening of data and information sharing system, Adopting early experiences learned in other countries, evaluation and strengthening of public health systems and promoting of virtual communities to help improve mental health and well-being issues.