Keywords

This book set out to pursue two objectives, one of a methodological nature (to describe available methods for extracting information from large social media corpora), and the other content-related: to distil specific information concerning the perceptions, attitudes, and concerns of English speakers around the world concerning the COVID-19 pandemic as expressed on social networks.

In each of the preceding chapters, I have presented and discussed a number of outstanding computational methods for extracting the concepts, ideas, topics, and opinions of Twitter users from large social media corpora. In the remainder of this section, I provide a summary of the most important conclusions mentioned in earlier chapters.

The first key takeaway, amply discussed in Chapter 3, is that large social media corpora are, not shockingly, too large to process in whole, and sampling strategies are necessary. However, using samples should not be considered a poor substitute for processing the whole corpus, as in most cases adequate sampling should produce very similar results, while optimizing time and resources. This is especially true in the case of social media corpora on a single major topic, as is the case here.

As for keyword extraction, it is quite apparent that different methods offer advantages and disadvantages, and many factors need to be considered when selecting one. Arguably, the most important criterion is the quality of results, which, as we have seen, is not easy to assess, and inherently conveys subjectivity: when offered to competing lists of keywords that are meant to summarize the themes and topics that a text contains, all we can do is “guess” what the actual topics are, and then assess the accuracy of the keyword lists against our mental image of those topics.

There are some more formal criteria that we can apply. For example, repetition in the form of synonyms or morphological variations is an undesirable feature, as is the inclusion of arbitrary n-grams instead of syntactically and semantically meaningful groupings for multi-word keywords.

Of all the keyword extraction methods surveyed, the best results for social media corpora—in terms of quality—are offered by the traditional Corpus Linguistics method, based on the use of a reference corpus, and graph-based approaches, which seem to complement each other. Unsupervised methods may be adequate in other scenarios, such as extracting keywords from single documents, which is in fact the type of application that they were created for. Finally, novel unsupervised methods based on large language models are promising, but at least in the case of the implementation that we have tested (KeyBERT), they still need considerable improvement and refinement to be successfully applied to large social media corpora.

Using basic set theory methods to compare sets of keywords can provide a clarifying image of the similarities and differences between them, and are therefore useful to evaluate extraction methods. This system also allows for the automatic generation of tables and graphs (Venn-type diagrams) that facilitate understanding and sharing results.

The potential of LLMs is much higher for topic extraction, as they can successfully fill a part of the process that has traditionally been left to the end user: labelling the extracted topics. Using the word embeddings contained in LLMs to extract topics also offers superior performance compared to “traditional” topic modelling algorithms (LDA and NMF). Together, these two features result in high-quality topic lists that accurately extract the essential themes contained in a large corpus. Furthermore, software libraries such as BERTopic make it extremely easy to use these advanced features, as they provide an abstraction layer over the complex lower-level processes (tokenization, embeddings creation, dimensionality reduction, clustering, etc.). If we bear in mind that the availability of LLMs is very recent, and that, with all certainty, more powerful, sophisticated models will become available in the future, it seems that this method of topic modelling will soon become the logical choice for most researchers. The one drawback they present is that using them requires specific hardware capable of parallel processing (GPUs or TPUs), which may not be available to some researchers, although cloud-based alternatives are available.

In terms of applicability and overall usefulness, topic modelling is probably a more versatile and powerful tool than keyword extraction, since it regards a topic as a set of related keywords, which are clustered using various techniques (co-occurrence in the case of traditional topic modelling, semantic similarity in the case of embeddings-based tools). Thus, an advanced topic modelling tool, such as BERTopic, coupled with LLM-based topic labelling, is capable of automatically distilling the relevant themes of a corpus, rank them by relevance, and list the keywords associated with each topic. This is not to say that keyword extraction tools are not useful, as they focus on the actual words, and score them individually, regardless of the topic they refer to, which is extremely valuable for terminology extraction and other linguistic applications. Extracting keywords is also computationally less demanding, and there are several user-friendly tools available, whereas topic modelling tools, especially advanced ones, require considerable computational resources and technical knowhow. Therefore, these techniques are complementary rather than antagonistic, and the choice between them may ultimately depend on the research objectives and the available resources.

The lists of topics extracted from the different subcorpora of the geotagged portion of the CCTC suggest that, while a core set of topics (prevention, testing, treatments, vaccination, safe practices, social impact, etc.) is present in all societies, there are significant differences between them that reflect and reveal their respective cultures.

Sentiment analysis is indispensable for analysing social media corpora, as well as any other emotionally charged text. While keywords and topics provide information as to what is being discussed, sentiment analysis tells us about the emotional perspectives and opinions on those topics, which is essential for understanding the speakers’ feelings and attitudes, a crucial aspect of social communication in general, and computer-mediated communication in particular. Without some type of sentiment analysis, this key communicative dimension would remain concealed beneath the surface of individual words and topics.

The two types of approaches used in this field, machine learning and lexicon-based methods, produce valuable results and perfectly complement one another to facilitate quantitative and qualitative research by extracting insights from large amounts of texts. Although lexicon-based tools can classify documents with acceptable accuracy, machine learning classifiers, specifically neural networks based on transformers currently offer the state of the art and are particularly well suited to analyse Twitter/X data, as specific models exist that have been trained on vast amounts of tweets and offer excellent performance.

Comparing sentiment classification across countries yields unexpected results, as poorer economies that were hit harder by the pandemic tend to show higher sentiment scores, suggesting that those societies, being accustomed to harsher conditions and standards of living may be better equipped to deal with a sudden deterioration in living conditions.

On the other hand, lexicon-based tools excel at extracting the specific linguistic expressions that determine the semantic orientation of texts, which is critical to understand people’s motivations and the nature of that sentiment. The negative words used to describe the pandemic’s effects also indicate significant differences between nations, revealing the precise causes of human suffering, poverty being more prominent in less developed countries. There are also differences in the use of profanity and taboo words, with such countries using these terms significantly less frequently than more developed nations.

Emojis have become an essential tool for expressing emotions and attitudes in social media text, and as such they deserve special consideration. They often serve to make the intended sentiment explicit and provide a proper interpretation of the text.

Significant differences were also found in the use of emojis between different countries. South African social media users employ this expressive resource much more frequently than users in other countries, and the particular emojis also vary from the rest. India also displayed some important differences, for example, it was found to be the country where the “praying hands” emoji was used most frequently, indicating a more spiritual personality of the population.

Finally, hashtags can reveal important insights and are easy to identify and account for. For example, the frequency of racism-related hashtags clearly correlates with real-world events, as is the case with the killing of George Floyd by the police in the United States. It can also be used to track the development of political campaigns, as other authors have shown in previous research. Furthermore, the simple statistical study based on the frequency of use described in the previous chapter provides further evidence that some issues are more pressing than others in different cultures and economies. For example, hashtags related to mental health problems were considerably less frequent in less developed countries, which is probably an indication that the phrase and well-known meme “first world problems” does have a justification.

Ultimately, the experiments and studies conducted in this book have generated vast quantities of data, of which only a small subset have been fully described in some depth.Footnote 1 Readers are encouraged to take advantage of all the datasets generated, as they are sure to find additional insights than those explicitly discussed in the text.