Keywords

Sentiment analysis, also referred to as opinion mining, is a branch of Natural Language Processing that aims to identify either the polarity or the emotions expressed in a text (B. Liu 2012), although the term emotion recognition is sometimes used for this specific task, and usually appears linked to the field of affective computing. The main objective of sentiment analysis is to recognize subjective data, such as judgments, opinions, and feelings towards people, things, and their characteristics (Pang and Lee 2008).

Sentiment analysis has many uses in many different industries. It is used for brand monitoring and product analytics in business, and for tracking public opinion and social media analysis in politics. It also has a big impact on customer service, where it aids in comprehending client feedback and enhancing offerings (Cambria et al. 2017). The range of applications is as varied as the range of texts that sentiment analysis can be applied to: from movies and books reviews, e.g. Kennedy and Inkpen (2006), Carretero and Taboada (2014), to hotel reviews, e.g. Moreno-Ortiz et al. (2011), online news, e.g. Soo-Guan Khoo et al. (2012), and political debate on social media, e.g. Wang et al. (2012).

The methods used in sentiment analysis are also varied, ranging from lexicon-based to machine learning and hybrid techniques. Many different machine learning techniques have been developed, including Support Vector Machines (SVM), Naive Bayes, and deep learning models like Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN) (Medhat et al. 2014). Lexicon-based approaches, on the other hand, rely on a sentiment lexicon, i.e. a list of lexical features that are labelled as either positive or negative according to their semantic orientation (Taboada et al. 2011).

6.1 Sentiment Analysis Methods

The field has advanced over time to take on more challenging tasks like aspect-based sentiment analysis and emotion detection (Zhang et al. 2018). These authors also define sentiment analysis as the task whose goal is to identify “people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes”. Thus, sentiment analysis is often reduced to a text classification task, which is in fact one of the most basic NLP tasks, whereby a document is classified as belonging to one of two or more classes. This is accomplished by using a classifier, i.e. a predictive model that reads the input text and outputs a certain class, sometimes with a confidence score (i.e. how confident the classifier is that the document belongs in that class).

The classification techniques, like other processes that attempt to emulate intelligent behaviour, can be implemented in many ways. The traditional approach is a series of if–then statements (or production rules), which together form a rule-based system. Rule-based systems have been employed since the beginning in computing, as they form the basis of most programming languages. They are sometimes referred to as “the simplest form of artificial intelligence” (Grosan and Abraham 2011, 149). A rule-based system contains a set of production rules, a set of facts, and an interpreter that controls the application of the rules given the facts. Thus, these systems require expert knowledge on the domain at hand, as well as engineering skills to encode this knowledge as a set of facts and rules. In the case of sentiment analysis, this type of system applies to lexicon-based approaches, where the set of facts specify which words and expressions are positive and negative, and the rules would define how the proportion of positive versus negative words is to be measured to come up with a global sentiment score for the document. Context can also be accounted for by a set of such rules (e.g. “if a negative particle precedes a sentiment adjective, then its polarity is inverted”). Lexicon-based sentiment analysis systems are, for the most part, rule-based systems, where the required static facts (e.g. sentiment lexicon) and procedural knowledge (e.g. context rules) have been obtained from certain knowledge sources—corpora, dictionaries—and encoded by a knowledge engineer.

In contrast to these deterministic systems, machine learning simulates intelligent behaviour using probabilistic (or stochastic) techniques. In lieu of relying on expertly encoded and distilled knowledge, the learning algorithms can acquire this knowledge from vast quantities of data, in this case text. Corpus-based (i.e. machine learning) approaches are prevalent in both industry and research, as they have demonstrated superior classification performance.

The current state of the art in sentiment classification consists of machine learning approaches in the form of neural networks that employ transformers, i.e. deep learning models that aim to solve certain text-related tasks (bi-directional attention, word and sentence prediction, sequence-to-sequence tasks) while easily handling long-range dependencies. Language models based on the transformer architecture include two of the most successful ones to date: Google’s BERT (Devlin et al. 2019) and OpenAI’s GPT (Brown et al. 2020), which have been shown to improve on previous top benchmark scores across numerous NLP tasks, both in natural language understanding and generation, including sentiment analysis (Wolf et al. 2020).

6.1.1 Deterministic Methods

Lexicon-based methods of sentiment analysis can be referred to as deterministic because they employ deterministic data, i.e. a set of words that comprise a lexicon in which sentiment information about those words is stored and, in some cases, a set of rules that can contextualize the semantic orientation of those words in actual usage. Examples of sentiment dictionaries include The Harvard General Inquirer (Stone and Hunt 1963), Bing Liu’s Opinion Lexicon (Hu and Liu 2004) MPQA (Wilson et al. 2005), SentiWordNet (Baccianella et al. 2010), SO-CAL (Taboada et al. 2011), EmoLex (Mohammad and Turney 2010), VADER (Hutto and Gilbert 2014), Lingmotif-Lex (Moreno-Ortiz and Pérez-Hernández 2018), and SenticNet (Cambria et al. 2020). These resources generally consist of word lists with varying degrees of sentiment information, from simple polarity to emotion classification.

However, the context in which individual words and phrases appear can alter their semantics (including polarity), sometimes to the point where they mean the exact opposite of what they initially denote; this is especially true of sentiment words. A negative adverb, such as “not” or “never”, can invert the polarity of the adjective “happy”, for instance. It is therefore difficult for a lexicon-based sentiment analysis system to account for all such context-shifting words. For example, we can implement a rule that inverts the sentiment of “happy” when it is preceded by “never” within a span of three words. This rule would correctly classify as negative expressions such as “I was never truly happy there”, but would incorrectly classify cases such as “I've never been so happy before”. In the field of sentiment analysis, numerous contextual shifter systems have been developed, e.g. Kennedy and Inkpen (2006), Moreno-Ortiz and Pérez-Hernández (2018), Polanyi and Zaenen (2006), Taboada et al. (2011). Nonetheless, the level of difficulty that sentence-level context handling poses pales in comparison to higher-order linguistic levels of analysis; discourse-related phenomena, such as the metaphorical use of words, irony, sarcasm, understatements, or humblebragging—all of which are pervasive in social media—are a serious problem for which there are no immediate solutions.

These knowledge sources are also deterministic because they have been compiled and curated by humans and are therefore known to be true, or at least assumed to be true; consequently, the performance of these systems is entirely dependent on the data upon which they are based. Their underlying model is also deterministic: if a text contains more positive words than negative words, it is predicted to be positive. When analysis errors occur, they are attributed to faulty or insufficient data: a particular sentiment word is missing, a valence shifter was incorrectly applied, pragmatic features were not taken into account, or additional world or common-sense knowledge is required. The underlying assumption is that it is possible to collect all of the facts and rules required for optimal model performance. This is applicable not only to lexicon-based sentiment analysis systems, but also to all formal grammars and computational implementations of linguistic theories. However, it has been repeatedly demonstrated that the facts and rules of language are far too elusive and organic to be constrained by the deterministic straightjacket. Otherwise, after seven decades of implementations of linguistic theories, at least one would have emerged as a viable framework for developing real-world language applications, which, arguably, has not occurred.

6.1.2 Probabilistic Methods

Since the 1960s, machine learning (ML) algorithms have been used in a variety of research fields. However, it has only been in the last two decades that we have witnessed their widespread use in real-world applications. In conventional programming, we tell the computer exactly what steps to take in order to solve a problem, which works well for many situations such as solving an equation; however, there are other tasks that do not lend themselves to this approach: How can we break down the process of identifying a specific object in a picture or the text understanding process, into minute, step-by-step detail? The analysis process I described in the previous section, which is utilized by lexicon-based SA tools, is merely an extreme procedural simplification of much more complex cognitive processes that our brains are able to handle effortlessly.

The goal of machine learning is to teach computers to solve these complex problems by providing them with examples of the problem and allowing them to figure out how to solve it on their own. Despite the fact that “classical” ML algorithms (Naïve Bayes, decision trees, Support Vector Machines, etc.) have been (and continue to be) successfully used to solve practical NLP problems, including sentiment analysis, deep learning and neural networks have revolutionized the field. As mentioned in the previous section, the current state-of-the-art performance in all language-related tasks is offered by the transformer architecture (Vaswani et al. 2017), and therefore it has rapidly become the dominant architecture for NLP (Wolf et al. 2020). It is based on the concept and practice of “pretraining”, i.e. creating a language model from a very large corpus in an unsupervised manner that can then be repurposed for different specific applications by “tuning” it on smaller, labelled (i.e. annotated) corpora.

Probabilistic methods based on the transformer architecture have been repeatedly shown in the literature to be state of the art in terms of sentiment classification, which obviously includes lexicon-based systems. However, lexicon-based systems do provide very useful capabilities that pure classifiers do not possess: the ability to point out which words and expressions have been found that justify their classification results. Conversely, ML systems, especially neural networks, exhibit the well-known “explainability” issue. Indeed, these algorithms excel at discovering correlation in massive datasets, but offer little to nothing in the way of causation. Ultimately, the researcher is left to come up with likely interpretations of the results. Important steps are being taken towards an explainable AI (Barredo Arrieta et al. 2020), but current technology simply cannot offer “explanations” of its own predictions; they simply act as black boxes that take an input and produce an output based on their probabilistic model.

6.2 Experiment: Sentiment Analysis of the CCTC by Country

This experiment is intended to showcase the capabilities of both state-of-the-art, transformer-based sentiment classification systems and an advanced lexicon-based sentiment analysis system. Thus, it consists of two parts; in the first one I use a script that employs the HuggingFace Transformers library (Wolf et al. 2020) together with TweetNLP (Camacho-Collados et al. 2022), a state-of-the-art model for Twitter sentiment classification trained on 124 million tweets and based on RoBERTa (Y. Liu et al. 2019).

In the second part I use Lingmotif (Moreno-Ortiz 2017, 2023), an advanced lexicon-based sentiment analysis system, to analyse the same corpus and obtain frequency lists of sentiment-related lexical items that can help us understand not just the overall semantic orientation of the corpus, but also the nature and type of that sentiment by exposing the actual words and phrases that materialize it.

For this study, I will be using the top six subcorpora by volume in the geotagged section of the CCTC.Footnote 1 Table 6.1 describes the subcorpora quantitatively. As in the experiments in the previous chapters, I use a proportional part of the each subcorpus when this is possible (United States, United Kingdom, and India). For the other three countries, the full subcorpus was used.

Table 6.1 Corpus used in the experiments in this chapter

6.2.1 Tweet Classification and Sentiment Over Time

The HuggingFace library makes classification very simple, as it takes care of every stage of the process by means of an integrated pipeline, thus hiding the complexity that entails working with transformer-based models. Every file in the corpus, where each line is a tweet, is read line by line, and the full list of documents is passed to the “sentiment-analysis” pipeline together with the tokenizer and language model. The pipeline returns a list of results where each document is classified as belonging to one of three classes—“positive”, “neutral”, or “negative”—and a confidence score in the range 0–1. Table 6.2 shows the global results of the analysis.

Table 6.2 Sentiment classification with transformer-based classifier

The most obvious fact that the data tell us is that the general sentiment is negative, as the proportion of negative tweets is the largest across all countries. However, there are important differences among them: the United States dataset has the most negative results, with over 53% of the tweets being negative, which is significantly higher than the average (47.02% including USA, 45.81% excluding it). This is surprising considering that it is the country with the highest GDP per capita of the group and perhaps a reflection of the poor pandemic management of the Trump administration. Conversely, India, which has the lowest GDP per capita, has the lowest percentage of negative tweets.

These global results, however, are aggregated (averaged) data, as the actual classification task was performed on weekly samples. This organization allows us to look at the evolution of sentiment over the two years that the samples span. Figure 6.1 is a visualization of the sentiment timeline corresponding to the United States using the raw data returned by the classifier.

Fig. 6.1
A multi line graph of sentiment timeline of negative, neutral and positive. All lines are fluctuating with small amplitudes. The negative line lies at the top of the plot area between 40 and 60%, followed by neutral line between 25 and 35%, and positive line between 10 and 25%. Values are estimated.

Sentiment timeline (U.S.)

The timeline reflects some of the most relevant events during the pandemic. After the initial alarm caused by the cases reported in China, the positive sentiment increases during the early spring of 2020 and then negativity increases as lockdowns are ordered in some states. Similarly, the significant surge in negative sentiment during the summer of 2021 correlates with the beginning of a third wave of infections as a result of the Delta variant of the virus.

In order to more easily compare the sentiment timeline of different countries, we can merge these polarity proportions into a single sentiment score using the following equation:

$${\text{SentScore}}= \frac{({\text{neg}}\%*-1)+({\text{neu}}\%*0)+({\text{pos}}\%*1)}{100}$$

This will give us a score in the range −1 to 1, which can easily be converted it to a more readable 0–100 range. Figures 6.2 and 6.3 use this unified sentiment score to visually compare sentiment evolution in the six countries. Three countries are shown in each graph to facilitate the interpretation of the data.

Fig. 6.2
A multi line graph of sentiment evolution in the U S, U K and India. All lines are fluctuating with small amplitudes. India line lies at the top of the plot area while U S lies at the bottom.

Sentiment evolution in the U.S., U.K., and India

Fig. 6.3
A multi line graph of sentiment evolution in the Canada, Australia and South Africa. All lines are fluctuating with small amplitudes. Canada line lies at the top of the plot area while South Africa lies at the bottom.

Sentiment evolution in Canada, Australia, and South Africa

These data visualizations make it apparent that some countries follow more a similar evolution of sentiment than others. Just by looking at the graph, it seems apparent that India’s sentiment evolution is the one that deviates the most from the rest of the countries. However, in order to properly quantify how much correlation there is between the different time series we can use the Pearson correlation coefficient between country pairs. Table 6.3 shows the list of correlations between country pairs in descending order.

Table 6.3 Sentiment timeline correlations between country pairs

This list of correlations tells us that countries that share more in terms of geographical proximity, culture, or economy tend to correlate higher. We can now say with all certainty that India displays the most deviation from the rest, followed by South Africa.

The reasons why India’s sentiment evolution is so different may be due to many factors, but it probably has to do with a different vaccination process and the different times of the two major waves of COVID-19 cases, which differed from most other countries. India started the vaccination programme in January 2021 and initially managed to control the number of new cases; however, a major second wave started in April 2021, which made new cases spike from 9,000 per day to over 400,000 and 3,500 deaths per day by the end of April.Footnote 2 The reason for this massive increase in daily cases was the incipient Delta variant, which started in India during this time and would later expand to the rest of the world. This clearly correlates with the sentiment timeline during this period, when negative sentiment clearly increases.

Obviously, looking at the changes in sentiment as represented by the peaks and troughs in the graphs and correlating them with real-world events is not an easy task, as there are multiple factors that may cause those changes. However, with sentiment classification of tweets that is all we have. We can only browse through the—very large—set of classified tweets and attempt to see what causes the sentiment. Examples (42) through (47) are tweets from this period. In them, people complain about the bleak situation and the poor management of the pandemic by the government. Examples (43) and (46) are interesting because they illustrate the trouble that state-of-the-art sentiment classifiers run into when faced with sarcasm, as both are clearly negative but are classified as positive and neutral.

  1. 42.

    ‘A person cannot live peacefully in Delhi, a person cannot even die peacefully in Delhi’. India overwhelmed by world's worst Covid crisis—BBC News. [negative, 0.896]

  2. 43.

    Half of the world's total covid cases are now from India!! What an achievement.. #IndiaFightsCOVID19. [positive, 0.811]

  3. 44.

    What to do brother, our government is not listening to us right now. There no use of these types of requests [negative 0.926]

  4. 45.

    #Karnatakagovernment Please consider the necessary requirements/decision towards raising COVID-19 death's before it gets out of control. We can afford the raising cases not the raising death's. [negative, 0.632]

  5. 46.

    When coronavirus cases went down, Govt declared victory, PM took all credit as always; Now they're blaming states: Rahul Gandhi [neutral, 0.614]

  6. 47.

    Mismanagement and lack of planning in production and distribution has killed more than the #virus. [negative, 0.898]

6.2.2 The Sentiment Lexicon of the Pandemic on Twitter

Sentiment classification of tweets is obviously useful, but it falls short of telling us about the nature of the sentiment. All we have is the classification data, either as individual or aggregated results by time span, and the classified tweets themselves, which is too much data to manually make sense of. For instance, examples (1) to (6) above were selected from the set of tweets in the week April 26 to May 2, 2021, but that week alone contains 27,902 tweets, so it is very hard to draw any conclusions regarding the content, and the examples are nothing more than anecdotal evidence.

Lexicon-based sentiment analysis systems can be very useful when it comes to obtaining more clues as to the nature of the sentiment, as they can provide frequency lists of the words and expressions that motivate the sentiment. For example, Table 6.4 shows the list of the most frequent negative words during this time period in India.

Table 6.4 Top 50 negative words for the week 26-04-2021 in India

From this set of negative words and expressions, we can see that many refer to the disease itself (‘pandemic’, ‘epidemic’, ‘virus’, ‘disease’, ‘infect’, ‘test positive’, ‘fever, ‘risk’), others to the deaths caused by the disease (‘death’, ‘dying’, ‘dead’, ‘rest in peace’, ‘rip’, ‘deadly’, ‘kill’, ‘condolence’, ‘loss’), others to the social and economic difficulties (‘crisis’, ‘emergency’, ‘shortage’, ‘poor’, ‘needy’, ‘lack’, ‘struggle’), and finally some of them refer to the management of the pandemic by the government (‘fail’, ‘blame’, ‘shame’, ‘impose’, ‘failure’, ‘wrong’, ‘fake’, ‘unable’, ‘quarantine’, ‘complete lockdown’, ‘lack’).

These words provide a more complete picture of the particular reasons that motivate the negativity at this particular point in time. Looking at positive and negative words can also help us identify what causes the unexpected positive peak in India during the week of October 18, 2021, which, with an all-time high sentiment score of 58.87—from 45.53 the previous week and 40.49 the next—is also an anomaly compared to the rest of the countries. But it is also interesting to contrast these results with the topics that we saw in the previous chapter. Figure 6.4 shows the topics over time for India, where a surge of the vaccines topic is quite apparent.

Fig. 6.4
A line graph plots the various types over time for India from April 2020 to October 2021. Covid 19 vaccination in India started in October 2020 and reached the peak between April and July 2021.

Topics over time for India showing the vaccination topic

Finally, looking at the tweets in this week, there is a very large number of tweets celebrating the advancement of the vaccination process. Examples (48) to (52) illustrate these.

  1. 48.

    World Bank Prez Congratulates India on Successful Covid-19 Vaccination Campaign. NaMo App. [positive, 0.922]

  2. 49.

    PM congratulates people of Devbhoomi for 100% first dose of Covid vaccination. [positive, 0.871]

  3. 50.

    98 crores done. India is quickly making its way to #COVID19 vaccine century! Just two more steps to go . ji. [positive, 0.914]

  4. 51.

    2nd Dose Done . Fully vaccinated #corona #vaccinationdone #vaccine #sainisurinder Anandpur Sahib. [positive, 0.823]

  5. 52.

    #India crosses 98 crore vaccine doses. And the roses are increasing fast. Seems 20 Oct is going to be the day when India will cross #100crore doses. Salute to all health care workers, Salute to spirit of India. #TogetherWeWin #COVID19 [positive, 0.946]

Comparing the negative words across different countries can also shed light on the particular circumstances and contexts. Table 6.5 contains the top 25 negative words for each of the countries in this study.

Table 6.5 Top 25 negative lexical items in the 6 top countries by volume

The top few words are mostly the same across all countries (‘pandemic’, ‘lockdown’, ‘virus’, ‘death’). Upon investigation, the third position of the word ‘stigma’ in Canada’s list is due to a specific and very active Twitter account “Fighting Stigma”, which preceded its many tweets with these two words. It is interesting, however, how the word ‘lockdown’ is either in first or second position in all countries except the United States, where it ranks fifth; this is most probably due to the fact that lockdown measures were fewer and more relaxed than in other countries and therefore had less impact on the population. The lemma ‘lockdown’ has 12,111 occurrences in the US subcorpus, whereas in the U.K. corpus (which has a similar number of words) there are 146,388 occurrences.

The lists also offer insights into the particular problems that the countries had to face. For example, in South Africa’s list the words ‘HIV’, ‘arrest’, and ‘corruption’ refer to issues that are not present in other countries. The lemma ‘poor’ is also present, which is also included in India’s list, the only list to contain the word ‘struggle’. These differences do suggest a more difficult economic situation for the people of these countries, which was made worse by the hardships brought about by the pandemic.

On the other hand, all word lists except India’s contain insults and profanity words (U.S.: ‘fuck’, ‘shit’, ‘stupid’, ‘idiot; U.K.: ‘fuck’, ‘shit’, ‘idiot; Canada: ‘fuck’, ‘shit’; Australia: ‘fuck’, ‘shit’, ‘idiot; South Africa: ‘shit’), which is telling of the different cultures. The phrase ‘please help’ is also only found in the India list; In fact, the lemma ‘help’ is extremely more frequent in the India subcorpus.

Finally, the United States list is the only one that contains the word ‘hate’ (in 17th position), which is probably a reflection of the political atmosphere at the time, as examples (53) to (58) illustrate.

  1. 53.

    On coronavirus, Trump needs the ones he hates: Experts and journalists—The Washington Post

  2. 54.

    I fucking hate it here

  3. 55.

    I hate the healthcare process in this country

  4. 56.

    CNN loves China and hates America

  5. 57.

    Pence has his beliefs that many disagree w/ and hate him for it. We need to come together as patriots against those who openly or secretly hate us. #Corona #Coronavirus #MikePence

  6. 58.

    Republicans hate government until an enormous problem made by the private sector (2008 crash) or not solvable by the private sector (Coronavirus) emerges.

As for the positive words, they are very similar across all languages, although of course the frequencies change. Table 6.6 shows the top 50 positive sentiment words and expressions for each of the subcorpora. Lingmotif treats emojis as regular lexical items, which is why they are listed and ranked along with the rest of the words.

Table 6.6 Top 25 positive lexical items in the 6 top countries by volume

As with negative words, most of the words in this list are positive in general, but some are specific of the pandemic subject domain, such as ‘protect’, ‘recovery’, ‘immunity’, ‘volunteer’ or ‘save lives’. There are not many differences between the countries. The primary themes that the lexical items refer to are good wishes, positive advice, and congratulations (on fighting the pandemic). The only country that shows a different theme is, again, India, with the word ‘donate’, in consonance with the recurrent “request-for-help” topic identified before.

We can also track the frequency of positive and negative words and phrases over time. To do this, we need to calculate the frequencies of all positive and negative lexical items over the whole period for each country, which will produce a ranked list of the most frequent sentiment items, which we can then track over time by looking at their frequency at each time period (weeks in this case). To account for the different sizes in the subcorpora, the relative frequencies were calculated per 1,000 words for each of the lexical items.

There are two ways in which we can track sentiment words over time. We can either look at the evolution of the top n words for one specific country, or we can track one specific word in several countries. The latter offers more interesting results, as focusing on certain specific words and comparing their frequency among different countries can provide useful insights. For example, Fig. 6.5 displays the frequency of the word ‘help’ over time, where India is clearly the most prominent and the peaks correspond with the particularly hard periods mentioned before.

Fig. 6.5
A multi line graph plots frequency of the word help versus January 20, 2020 to December 20, 2021. The lines of the United States, the United Kingdom, Canada, Australia, and South Africa fluctuate between 0 and 2, the line of India fluctuates between 0 and 7. Data are estimated.

Sentiment over time—‘help’

Plotting the frequency of specific sentiment words can provide evidence of real-world events. Figure 6.6 plots the timeline of the word ‘protest’ for all countries in the corpus. It evidences the periods where protests became an issue. June 2020 witnessed demonstrations in most countries after months of lockdowns and stay-at-home orders.

Fig. 6.6
A multi line graph of sentiment over time - protest, in the U S, U K, India, Canada, Australia, and South Africa. All lines are fluctuating with small amplitudes except for India which has many peaks.

Sentiment over time—‘protest’

Australia is the country that shows the most spikes on this word, surpassing all other countries in June 2020, but also showing many other peaks due to different events. For example, during September 2020, several anti-lockdown protests were organized in this country, as they were during July through November 2021. India, again, is the country that deviates the most from the rest, with a rather flat line except for two obvious spikes during December 2020 and December 2021. However, these demonstrations were most probably related to the farmers’ protest as a reaction to the laws passed by the Indian government in September 2020.

6.3 The Role of Emojis in the Expression of Sentiment

There is little doubt that emojis contribute importantly to the emotional content of social media messages. They rank high in our lists of negative (Table 6.5) and positive (Table 6.6) terms of every country, and they are present in a large number of the examples provided in this chapter, which demonstrates the significant role they play. Emojis facilitate communication of subtle emotional cues by condensing ideas and emotions into a single icon or pictograph (Bai et al. 2019) and have a ubiquitous presence in social media all around the world (Ljubešić and Fišer 2016).

Emojis, like their less sophisticated text-based counterparts, emoticons, are said to fill the role of human facial expressions and gestures, absent in text-based communication, which is demonstrated by the fact that emojis that display facial expressions have been shown to elicit a similar neural time-course as actual human faces, although with lower attentional orientation response (Gantiva et al. 2020). In fact, research has shown how reaction times is slower when humans are confronted with messages where the text and the accompanying emoticon or emoji expresses conflictive valences, and the overall messages also tend to be interpreted as negative more often (Aldunate et al. 2018).

From a cultural perspective, however, the use of emojis is not homogeneous across nations and languages. Kejriwal et al. (2021), in a large-scale study that included 30 countries and as many languages on tens of millions of tweets, concluded that emoji usage is not only strongly dependent on cultural and geographical variables, but its diversity is also much more constrained in some languages and countries than others.

These conclusions are unquestionably supported by the data in this study. From a purely quantitative perspective, we can see in Table 6.5 that four negative emojis are included in the top 25 negative items for South Africa, whereas only one is present in the rest of the countries, which suggests that South African Twitter users tend to use a higher proportion of emojis in their tweets. However, an actual count of emojis in each of the corpora is necessary to confirm this hypothesis. The script used to achieve this task counts emojis using the emoji Python library to detect emojis and produces frequency counts per 1,000 words to account for the differences in corpus size across countries.

Figure 6.7 shows a visualization of the results, which clearly show that the use of emojis is much more frequent in the South African subcorpus, where 35.84 are found per 1,000 words, versus 18.33 on average (including South Africa; 14.83 excluding it).

Fig. 6.7
A bar graph plots the frequencies of emojis per 1000 words across countries. 1. U S, 14.52, 2. U K, 18.65, 3. India, 14.11, 4. Canada, 11.53, 5. Australia, 15.35, and 6. South Africa, 35.84.

Frequency of emojis (per 1,000 words) across countries

This huge difference is also apparent when we attempt to plot the presence of specific emojis over time to make comparisons across countries. Figure 6.8 visualizes the relative frequency of the loudly crying emoji ( ), which can be used to unequivocally measure the level of sadness, unlike others whose interpretation may be more ambiguous, and their interpretation is more dependent on cultural factors (Godard and Holtzman 2022). The overall frequency of this emoji in the South African corpus is so much higher than in the rest of subcorpora that it dwarfs all other timelines.

Fig. 6.8
A multi line graph plots the sentiment over time for U S, U K, India, Canada, Australia, and South Africa. South Africa is at the top of the plot area.

Sentiment over time—‘ ’ (loudly crying emoji)

Specifically, the frequency of the loudly crying emoji in the South African subcorpus was 12.28 times greater than the average of other countries. Comparative analysis revealed that the United States had the nearest frequency to South Africa, albeit still 5.80 times lower. The most substantial discrepancy was observed in the Indian subcorpus, where the emoji’s frequency was a striking 19.25 times less than that of South Africa.

In order to offer a more complete overview of the use of emojis, Table 6.7 shows ranked lists of emojis, including their frequency (per 1,000 words) in each of the subcorpora.

Table 6.7 Top 25 emojis per country ranked by frequency (per 1,000 words)

The data in this table show some interesting differences among countries. For example, the virus emoji ( ) is present in all of the subcorpora except, precisely, in South Africa; in fact, it ranks very low in the South African list of emojis (58th position). The same is true of the syringe emoji ( ), which ranks in 48th position. To easily check all differences in emoji use across countries, Table 6.8 summarizes them.

Table 6.8 Idiosyncratic uses of emojis by country

We can now see clearly that South Africa (8 differences with rest) and India (5 differences) are the two countries that deviate the most from the rest. Thus, the United States (3 differences), the United Kingdom (2 differences) and Canada (0 differences) share the most emojis with the rest, especially among themselves. Conversely, India and South Africa are the two countries that show most differences with the rest, and also among themselves.

The order in which the emojis rank in each country is also very telling. For example, the praying hands emoji is present in all lists, but it ranks at the very top in the case of India, a more spiritual country than the rest. Again, we find evidence that social media language, both in terms of linguistic and paralinguistics elements, is a good reflection of the cultural, economic, and social differences that exist between societies when it comes to expressing their emotions in written text.