Keywords

A Twitter corpus differs considerably in format from other text corpora. A regular corpus is usually distributed as either a collection of plain text files or, if metadata is included, a set of XML files. Corpus creators decide what (if any) metadata is to be added to the actual data. In a corpus of literary texts, for example, these may include information like author, publishing date, genre, edition, etc. These data categories are additions that describe the text, which is the actual data to be explored, and their primary function is to organize, catalogue, and serve as search criteria. They usually need to be added manually, although sometimes data categories may be inferred or extracted from the original texts (chapters, page numbers, etc.); regular expressions—i.e. advanced text pattern matching—are very helpful to automatically remove such unstructured data from the actual text and encode it as usable metadata. In terms of size, the proportion of data (the actual text) vs. metadata (data about the text) falls heavily on the former: for each of the documents in a corpus the bulk of the data is the document itself, the metadata usually being a very small proportion of each text.

With Twitter data, the situation is reversed in both aspects. Although not apparent to users, each tweet consists of a short text (280 characters maximum) and a slew of metadata that provide additional information about that tweet (user, date, number of retweets, etc.); in fact, there is a lot more content in a tweet’s metadata than in the tweet’s text. This is why the term dataset is often used to refer to Twitter corpora, as this term refers to any structured collection of data of any kind (numerical, textual, multimodal), whereas a text corpus is any collection of texts which minimally contains plain text and may or may not contain further metadata. Thus, in the context of this book, both terms (corpus and dataset) are generally treated as synonyms, as we are dealing a Twitter dataset/corpus, which, by definition, contains structured text and metadata.

The conciseness feature of tweets is probably its most differentiating one, as it determines a very particular type of communication form that differs from other traditional “compressed” language genres, such as telegrams or newspaper headlines. Optional multimedia elements, such as hypertext links, user mentions, and hashtags, provide the means to expand the message in ways previously not available.

Some of these features (multimedia objects, hypertext links) are common to most—if not all—modern social networking sites. However, Twitter/X has several differentiating features. The first one is related to the aforementioned size restriction. Facebook allows up to 63,206 characters in regular posts, while Instagram has a limited text length of 2,200 characters.Footnote 1 But it is the social aspect of the social networking site (SNS) that truly distinguishes Twitter/X from others. By default, a user’s tweets are public, and any other user may optionally be notified of new posts after “following” them, and automatically receive every tweet in their feed. Twitter users can block specific users, who will stop receiving their tweets in their feed; however, that does not mean that the blocked user will not be able to access the tweets, as they can use third-party apps and websites, therefore blocking someone really works in the opposite direction: the user who blocks will stop receiving tweets from the blocked user, who will, in turn, not be able to reply to their tweets. In contrast, both Facebook and Instagram users need to approve a follower’s request before they can view their content (in the case of Instagram this is only true of private accounts, as public ones require no approval to follow).

Thus, Twitter’s “openness” of content has largely determined its success as a data source for researchers. Twitter has, since the beginning, offered an API (Application Programmer’s Interface) to allow developers to access their data. This favourable scenario, however, changed in March 2023, with Twitter’s new API policy. Although free access is still present, it considerably limits the number of tweets to be downloaded (1,500 per month); they also offer two paid licences, a “basic” one, with a 10,000-tweet download limit, and an “enterprise” licence that can be tailored to specific needs. They also offer an academic research licence that needs to be applied for and meet several eligibility criteria, subject to approval by Twitter/X on an individual basis.Footnote 2

Whichever the current—or future—limitations, existing Twitter datasets will remain to be available, and, despite the new limitations, there is little doubt that new ones will be created and made available.

3.1 Twitter Content

Although some of the characteristics of this SNS have been mentioned in passing in previous sections, it is important to understand how Twitter data is obtained, structured, and processed, to be aware of the possibilities and limitations that existing Twitter corpora present.

Since its inception, Twitter has significantly evolved, adding new features and implementing modifications to enhance the user experience and the quality of public discourse on the platform. The first tweet was published in July 2006 by Jack Dorsey, one of the platform’s creators; it read “just setting up my twttr.” Initially, tweets were limited to 140 text characters and multimedia elements were not allowed, a feature that was added in 2012. This is the reason why Twitter was dubbed a microblogging site, as the idea was to be used to share users’ concrete ideas or status updates by publishing a number of short daily posts. The next year, the company went public on the New York Stock Exchange under the ticker symbol “TWTR”. In 2015 Twitter doubled the number of characters allowed in a tweet, which remains limited to 280 as June 2023, except for Chinese, Japanese, and Korean, for which the original limit of 140 characters was kept, as these languages can convey more content in fewer characters than Western languages.

Along with the ability for developers to access content via an API, this strict length limit is probably the most defining characteristic of Twitter, as it encourages users to be concise and to the point. This characteristic has in fact shaped a unique style of communication on the platform. For example, the expression of sarcasm is sometimes not easy to identify, and users recurrently need to resort to paratextual methods, such as the use of the hashtag #sarcasm to make their intentions explicit (Bamman and Smith 2015).

Irony, sarcasm, and other figurative language types are known to be pervasive on Twitter (Sulis et al. 2016), which poses a serious challenge to sentiment analysis and related natural language processing tasks, such as emotion detection. In fact, sarcasm detection is an active NLP area of research itself, and researchers dedicate entire datasets—e.g. Khodak et al. (2018)—and shared tasks—e.g. Ghosh et al. (2020)—to this particular topic.

The character limit is by no means the only problem that makes this task difficult, the lack of acoustic markers is probably the most limiting factor to achieve good results in these tasks (Woodland and Voyer 2011). Although researchers have employed a number of strategies to overcome this problem, the state of the art in sarcasm detection is far from optimal. Plepi and Flek (2021), for example, achieved state-of-the-art performance by using graph attention networks (GAT) to leverage both a user’s historical tweets and social information from their conversational neighbourhood in order to contextualize the interpretation of a post. Therefore, detecting sarcasm on Twitter requires sophisticated strategies that take into account not just the tweet’s content but also the user’s profile. Since this approach is not easily implemented on an isolated Twitter dataset, irony and sarcasm detection remains an issue that affects a proportion of tweets that have been measured at 10% (Moreno-Ortiz and García-Gámez 2022).

Another issue that has been well documented is the presence of potentially misleading information, which became a noticeable problem during the COVID-19 pandemic and the US elections, leading Twitter to add warning labels to suspicious tweets in 2020. Several specific corpora have been compiled to deal with this issue in dedicated shared tasks. For example, FEVER: Fact Extraction and VERification (Thorne et al. 2018) is a manually annotated dataset that consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. More relevant to this book is COVIDLies (Hossain et al. 2020), a dataset of 6,761 expert-annotated tweets to evaluate the performance of misinformation detection systems on 86 different pieces of COVID-19-related misinformation.

Evidently, issues such as the use of figurative language and, especially, misinformation, are aspects that need to be taken into account when carrying out any analysis of Twitter data, but which do not invalidate results, as they should be considered the exception rather than the norm. Irony and sarcasm are often used as devices to create humour, as the entertainment aspect of social networks is clearly an important motivation for users. In fact, Tkáčová et al. (2021) mention that social networks were a useful source of entertainment for teens during COVID-19 lockdowns. Also, although the presence of sarcasm affects the performance of sentiment classifiers, it is rather irrelevant for keyword and topic extraction, as is the presence of misinformation tweets, since the object is not to identify the user’s stance on the topic, but the topic itself.

3.2 Downloading and Managing a Large Twitter Corpus

3.2.1 Anatomy of a Tweet

Processing a “raw” Twitter corpus involves dealing with each tweet individually, using a loop to read them sequentially, and extracting the actual data that we need. Each tweet contains a large amount of data fields, most of which may be irrelevant. Figure 3.1 shows a screenshot displaying part of the hierarchical data structure of a tweet.

Fig. 3.1
A screenshot displays part of the hierarchical data structure of a tweet. It includes tweet, created at, i d, i d st r, full text, truncated, display text range, entities, hashtags, 0, text, indices, len, symbols, user mentions, u r l s, source, user, geo, place, coordinates, etcetera.

Data structure of a tweet

In all, a tweet contains 141 data fields (attribute-value pairs), many of which are nested data structures themselves, such as arrays. For example, in the figure’s tweet, the entities.hashtags field consists of a list-type value of cardinality 1, which contains an array of two attributes, text and indices, and the latter contains a list of two numerical values.

This complex, hierarchical data structure is obviously not easy to manage, and most data are irrelevant or missing. Missing information is a major problem, as some potentially useful data regarding the geographical location of the user or place of publication recurrently fall into this category. In fact, researchers have developed strategies to overcome this problem (Qazi et al. 2020).

Therefore, the first step to process a Twitter corpus consists of selecting the data fields that are relevant to our research, and then save the simplified data structure in a suitable format. Tweets are downloaded as a JSON object, a data exchange format that has gained popularity due to its expressive power (as opposed to CSV) and simplicity (as opposed to XML). However, CSV (comma/tab-separated values) may be preferable for our simplified version, as it is more readily usable with certain data processing libraries, such as Pandas. Alternatively, XML is necessary if we plan to use the resulting corpus with XML-aware corpus processing tools, such as Sketch Engine (see Sect. 1.5). Finally, JSON is probably the best option if we need to store hierarchical data structures; also, converting JSON to XML is rather straightforward.

3.2.2 Downloading and Extracting Data

Due to copyright issues, Twitter corpora cannot be distributed directly, that is, including the original tweets information. Instead, a Twitter corpus is usually made publicly available as lists of “tweet IDs”, a string of 18 numerical characters that uniquely identify each Twitter object.Footnote 3 This means that accessing a publicly available Twitter corpus involves downloading the original content from Twitter, using its API, by way of each individual tweet ID contained in the distribution, a process known as tweet hydration. In the case of the CCTC, the corpus is distributed as a set of gzipped text files containing the IDs of each tweet. A Python script (“hydrate.py”) is included that downloads the tweets using Twitter’s streaming API.

An additional hurdle is Twitter’s bandwidth limitations: download will stop if these limits are exceeded. In order to circumvent this, the download process must be paused at regular intervals. Therefore, it takes an average of 12 days to download each month of the original CCTC corpus (the initial months are faster, but it takes longer as the download moves forward in time). The corpus is then downloaded as a collection of gzipped JSON Lines files, where each JSON Line contains a complete tweet.

These compressed JSON Lines files contain all tweets in all languages. Therefore, the first step is to extract the tweets from the original files.

For this study, a custom Python script was used to extract only the English tweets, keeping specific information for each tweet (tweet ID, user, date, and text). This script uses several parameters that can be customized to change its behaviour (language to extract, time period, minimum number of words, etc.).

Another important aspect of Twitter/X corpora is the high proportion of duplicates, either because they are retweets or copy—pasted. One data field present in the tweet’s structure that can be used to deal with this situation is “retweet_status”. However, this is not so straightforward in practice because there is no certainty that the original tweet is included in the dataset. The method used by the extraction script resorts to adding each tweet to a daily Python dictionary using the tweet’s text as the key, which makes it impossible to have two identical tweets. Thus, we avoid saving retweets or repeated tweets; instead, only one instance of each tweet per day is saved, along with a counter indicating the number of times that tweet occurs during that day.

Additionally, the script applies a number of pre-processing operations on the original text in order to remove hyperlinks and problematic characters such as newlines, tabs, and certain Unicode characters (e.g. typographic quotes). It also ignores tweets with fewer than the minimum number of words specified (3 by default).

Finally, the script generates a log file that includes details about the extraction process along with important statistics:

  • Processed tweets by day.

  • Saved tweets by day.

  • Processed words by day.

  • Saved words by day.

  • Repeated tweets by day.

  • Totals.

These data are printed to the console for each day at runtime and saved as a text file at the end of the extraction process. Since the data are saved in tab-separated format, it can be copy—pasted in a spreadsheet to generate data visualizations, such as the one in Fig. 3.2, where the daily data have been aggregated by week using a pivot table.

Fig. 3.2
A double-line graph compares the number of total and unique tweets versus time. Both lines have fluctuating trends, with a significant trend for total tweets.

Total English tweets over time (aggregated by week)

Table 3.1 summarizes the data in absolute figures.

Table 3.1 Corpus extraction statistics

In summary, the English portion of the CCTC for the years 2020 and 2021 consists of nearly 1.12 billion tweets and over 32 billion tokens. The “compressed” form used to store it, however, offers considerable savings. If the number of tweets was to be used as an estimator of the size of the corpus, the employed method (saving one instance of each unique tweet per day) offers a space saving of 68.45%. The advantage is not simply a considerably reduced storage size, but, more importantly, reduced processing time for any operation subsequently performed on the data, an aspect that becomes critical when dealing with large corpora.Footnote 5

3.2.3 Data Organization and File Format Selection

The original distribution of the CCTC stores each hour worth of data in one file, thus having 24 files for each day, for a total of 17,040 files for the years 2020 and 2021 (a total of 710 days, starting February 21). During the process of extraction by language, unique tweets were stored for a given day into a single file (710 files in total). The result is as many files as there are days in the corpus, with each line representing a single tweet containing a reduced set of data fields. JSON Lines was chosen as the storage format, although CSV or TSV is also a good choice. All files were compressed with gzip, as this format allows fast, on-the-fly decompression during opening. A few examples of a data line are given in (1) to (6) below. All the data are extracted from the original tweet except the retweet counter (“n”).

  1. 1.

    {“text”: “A man who lives in Snohomish County, Washington, is confirmed to have the first US case of Wuhan coronavirus”, “user”: “cnnbrk”, “date”: “Tue Jan 21 19:42:55+0000 2020”, “id”: “1219706962851569665”, “n”: 75}

  2. 2.

    {“text”: “BREAKING: First confirmed case of the new coronavirus has been reported in Washington state, CDC says.”, “user”: “ABC”, “date”: “Tue Jan 21 19:14:03+0000 2020”, “id”: “1219699699520876544”, “n”: 6}

  3. 3.

    {“text”: “Dear friends, please spare a few minutes, and read about the #NovelCoronaVirus, and the ongoing epidemic in #Wuhan China…and now being recorded in other cities and countries. Do not spread fear. Spread the right information. And protect yourself and others.”, “user”: “Fredros_Inc”, “date”: “Tue Jan 21 21:41:31+0000 2020”, “id”: “1219736807832682498”, “n”: 12}

  4. 4.

    {“text”: “ . Remember when Ford cut a billion dollars from Toronto's public health ? . A good portion of that was infectious and communicable disease surveillance and treatment programs. Wuhan virus ain't nuthin’ ta f’ wit. . #cdnpoli #onpoli.”, “user”: “StephenPunwasi”, “date”: “Tue Jan 21 23:01:00+0000 2020”, “id”: “1219756813140275200”, “n”: 66}

  5. 5.

    {“text”: “PLEASE SHARE. First Case of Mystery Coronavirus Found In Washington State CDC via @YouTube”, “user”: “PeaMyrtle”, “date”: “Wed Jan 22 00:30:58+0000 2020”, “id”: “1219779453901058049”, “n”: 1}

  6. 6.

    {“text”: “ This Isn't True Killer Chinese virus comes to the US, CDC says via @MailOnline”, “user”: “amandadonnell14”, “date”: “Tue Jan 21 19:01:15+0000 2020”, “id”: “1219696475858440197”, “n”: 3}

The online repository for this bookFootnote 6 contains the extracted corpus in distributable form, that is as a collection of TweetIDs (dehydrated). The files, one for each day, have the extension “.tsv” (tab-separated values), and contain two data fields: “tweet_id” and “n”, where “n” is the number of times that the tweet occurs in the original corpus on that particular day.

For some of the exploration tasks that I present in the following chapters, the geotagged subset of the corpus will be used, whose extraction process and statistics are described in Sect. 1.4.

3.3 Data Sampling

Given the size and organization of the corpus (large daily collections of tweets) sampling becomes extremely important. Although sequential, unindexed processing of each and every tweet in the corpus is possible (whether for keyword extraction, topic modelling, or sentiment analysis), it would be extremely impractical, as the processing time may extend for days or even weeks. Not only that, it may be unnecessary altogether, as a properly extracted sample may return the same or very similar results. This is true of all large corpora, but especially of social media corpora, as Twitter data consists primarily of short texts, many of which are merely repetitions of one another (retweets). Preparing the data and employing a consistent sampling method, as well as a representative sample size, is crucial for optimizing the storage and processing of data.

The importance of choosing the appropriate data sampling technique cannot be overstated. According to Boyd and Crawford (2012), “just because Big Data presents us with large quantities of data does not mean that methodological issues are no longer relevant. Understanding sample, for example, is more important now than ever” (p. 668).

Data sampling refers to the set of methods used to select a subset of units from the target population. Although many definitions of sampling exist, the one by Brown (2012) is particularly suited to our context:

Sampling is the act of choosing a smaller, more manageable subset of the objects or members of a population to include in an investigation in order to study with greater ease something about that population. In other words, sampling allows researchers to select a subset of the objects or members of a population to represent the total population. Sampling is used in language research when the objects or members (hereafter simply objects or members, but not both) of a population are so numerous that investigating all of them would be unwieldy. Such objects of study might include the total populations of all ESL learners, TOEFL examinees, essay raters, words, cohesive devices, and so on. (p. 1)

Our “objects of study” are tweets and the words that they contain, and the population is the full corpus. This creates an interesting paradox, as a corpus is usually defined as a sample of a language (Sinclair 2004) and the concept of representativeness enters into play. The notion of a subcorpus is also relevant in this context. A subcorpus is a part or section of larger corpus, but it is usually selected according to one or more directed criteria that define the content of that section, such as date, genre or media.Footnote 7 Sampling, however, attempts to extract a representative, usually random subset that can be used with the statistical certainty that the results do not differ significantly from those that would have been obtained from the population (i.e., the entire corpus).

There are numerous sampling methods, which are typically divided into two categories: probability and non-probability sampling. The primary distinction between these is that the latter selects units using a non-random and therefore subjective or intentional method, such as applying one of the abovementioned criteria for subcorpus creation. In the following sections of this chapter, I discuss the creation of time and location-based subcorpora, a good example of non-probability sampling.

Probability sampling, on the other hand, is based on the randomization principle, which is the best way to obtain statistical representativeness. There are, however, several methods to implement probability sampling (Beliga et al. 2015; Siddiqi and Sharan 2015): (i) simple random sampling, (ii) systematic sampling, (iii) stratified sampling, (iv) cluster sampling, (v) multistage sampling, (vi) multiphase sampling, and (vii) proportional-to-size sampling. Two of these methods are especially relevant to our objective: simple random sampling, which is the most commonly used due to its simplicity, and proportional-to-size (PPS) sampling.

Simple random sampling basically requires a list of all the units in the target population, and all population members have the same probability of being selected for the sample. A drawback of this method is that the random drawing may lead to the over- or underrepresentation of small segments of the population: since all of the members of the sampling frame can be randomly drawn, it leaves to fate to which extent a particular group will be represented—or if it is at all—in the sample (Kamakura 2010).

Consequently, ensuring representation may require more sophisticated sampling techniques, such as proportional-to-size sampling. This method requires a finite population of units, in which a size measure “is available for each population unit before sampling and where the probability of selecting a unit is proportional to its size” (Skinner 2016, 1). Therefore, the likelihood of being included in the sample increases as the unit size grows.

Systematic sampling utilizes intervals to determine the number of sample units, which is determined by dividing the number of units in the population by the desired sample size. Although this scheme is frequently preferred due to its simplicity and convenience, it runs the risk of not being representative of the population, for instance if there is a periodic feature in the population’s arrangement that coincides with the chosen sampling interval. Moreover, this method does not permit an impartial estimator of the sampling design variance (Bellhouse 2014).

Stratified sampling is based on the division of a population into strata. This ensures that each stratum is appropriately represented in the same proportion in the sample as in the sampling frame. This process improves the efficiency of sample designs in terms of estimator precision, as it allows the division of a heterogeneous population into internally homogeneous subpopulations (strata) whose sampling variability is smaller than that for the whole population (Parsons 2017).

Cluster sampling divides the population in groups, which are subsequently selected randomly in order to represent the total population. Then, all the units found in the selected clusters are included in the sample (Levy 2014). This method is especially useful in what Kamakura (2010) defines as “mini-populations”, each having its own features and characteristics.

Multistage sampling involves the selection of a sample within each of the selected clusters (Shimizu 2014) and requires, at least, two stages: (i) selection and identification of large clusters (primary sampling units), and (ii) selection of units from within the selected clusters (secondary sampling units). A third optional stage is formed by tertiary sampling units, which are selected within the secondary sampling units.

Multiphase sampling is based on the (i) collection of basic information from a large sample of units, and the (ii) collection of more detailed information. It must be distinguished from multistage sampling: in multiphase sampling, “the different phases of observation relate to sample units of the same type, while in multistage sampling, the sample units are of different types at different stages” (Lesser 2014, 1).

Since our corpus consists of a daily set of unique tweets, each of which has a frequency indicator with the number of times it was retweeted, we can use this counter to apply proportional-to-size sampling. Thus, the probability of a tweet to be included in the sample grows proportionally with the number of times it was retweeted.

To describe the statistical distribution of the number of daily retweets (in fact, duplicate tweets, whether retweeted or not), Table 3.2 shows the descriptive statistics of the number of duplicates of a random day (June 20, 2021).

Table 3.2 Central tendency measures of daily number of retweets

On this particular day there are 318,926 unique tweets, with 3.28 average number of duplicates, but with a very large range, standard deviation, and variance, which indicates that the distribution is greatly spread and skewed. The median and mode of 1 suggest that the vast majority of daily tweets are unique. To provide a more accurate image of these numbers, Table 3.3 provides counts of daily retweets by ranges.

Table 3.3 Daily retweets in the corpus by range

Using proportional-to-size sampling, several samples were extracted from the full corpus to use in the experiments described in the following chapters, the assumption being that working with smaller, fixed-interval samples is more practical and efficient than working with the rather unwieldly numbers of the full corpus. Table 3.4 summarizes the number of tweets and tokens contained in the full corpus and in each of the extracted samples.

Table 3.4 Corpus samples used in the studyFootnote

Along with the full corpus, the three samples are included in the book’s repository as collections of tweet ID’s.

To extract these samples, the corpus is taken as a time series of day intervals. The sample extraction script takes several parameters, including sample percentage and time period in number of days. All the samples in this study used daily time periods and the proportional-to-size (“pps”) sampling method, but the script can use any number of days as a time period and two alternative sampling methods: “random”, which retrieves a simple random sample of the desired percentage of tweets, and “top”, which extracts the top retweeted tweets. The PPS and top methods use the frequency information obtained during the tweet extraction process.

As with the full corpus, samples are stored as gzipped JSONL files (one file per day, one JSONL document per tweet), with the text, date, and frequency of each tweet included in each JSONL document. With this system, considerable processing time is saved. Thus, instead of processing the actual number of tweets (many of which are the same text because they are retweeted or copied and pasted), we can simply multiply results by the tweet’s frequency. To give an idea of how this system optimizes processing, Table 3.5 provides a summary of the processing times of some operations, such as sample and keyword extraction.

Table 3.5 Processing times of some operationsFootnote

All times are given in hh:mm format. All tasks were run on an Intel Core i7-7400 3.0 GHz CPU (4 cores) on Ubuntu Linux 20.04 Server 64-bit. During the keyword extraction process other text items, such as entities, mentions, hashtags, and emojis, were also extracted, thus adding considerable overhead processing time.

These processing times indicate that even though the sample extraction time is comparable for the 0.1%, 0.5%, and 1% samples, sample size becomes an important factor in the keyword extraction task: in the case of the 1% sample, this task alone took over 48 hours, compared to the almost 6 hours needed for the 0.1% sample.

3.4 Extracting Geotagged Tweets

The creators of GeoCov19 (Qazi et al. 2020), one of the few geotagged COVID-19 Twitter corpora available (described in Sect. 2.5), mention that only 1% of the tweets contain actual latitude/longitude coordinates. However, this figure is much smaller in reality, as they mention that only 378,772 tweets in their dataset of 452 million were actually geotagged (i.e. 0.084%). This is in fact very similar to what we find in the CCTC corpus.

In order to extract the geotagged portion of the English corpus, a script was created which only extracted tweets where the language was English and the place.country_code data field was not empty. This returned a total of 8.2 million tweets distributed in 242 different countries. As with the full English dataset, they were saved with the date information per day. The timeline, shown in Fig. 3.3, has a very similar profile to the overall English corpus (see Fig. 3.2), which indicates that the time distribution of geotagged tweets is almost identical.

Fig. 3.3
A line graph traces the trend of English country-geotagged tweets versus time. The line first rises and then falls with fluctuations before rising again, followed by a small decline.

English country-geotagged tweets aggregated by week

An additional script was used to obtain statistics by country. Table 3.6 summarizes the data, and Fig. 3.4 visually displays the top ten countries by number of tweets.

Table 3.6 Distribution of geotagged tweets by country
Fig. 3.4
A bar chart of the number of geotagged tweets versus 10 countries. U S A has the highest number of geotagged tweets at 3,984,700, followed by U K at 1,418,550, and India at 684,902.

Number of geotagged tweets by country

The United States alone generated almost 4 million tweets, that is, almost half of all the geotagged tweets (8.2 million). It must be remembered that this distribution may or may not be representative of all the English tweets; this is probably because these countries generated most of the tweets about the pandemic, but it can also be due to device configurations that allow the client application to read and post the country of origin. It does mean, however, that any study of English tweets will be skewed towards the most prolific countries, particularly the United States, the United Kingdom, and India, which account for 74.2% of the total volume.

The geotagged corpus obviously requires a different data structure to include the country code. (7) to (12) below are sequential JSON Lines randomly taken from the file corresponding to January 17, 2021.

  1. 7.

    {“country_code”: “CA”, “timestamp”: “Sun Jan 17 00:02:51+0000 2021”, “user”: “RunnertheFirst”, “id”: “1350594395431706627”, “text”: “Has the reporter been arrested?”}

  2. 8.

    {“country_code”: “US”, “timestamp”: “Sun Jan 17 00:03:01+0000 2021”, “user”: “cbwebster”, “id”: “1350594435931889669”, “text”: “If this doesn't make you think. #CNN #COVID19 #coronavirus #CovidDeaths #CoronaVirusUpdates #planecrash”}

  3. 9.

    {“country_code”: “US”, “timestamp”: “Sun Jan 17 00:03:19+0000 2021”, “user”: “trenttarbutton”, “id”: “1350594515518824448”, “text”: “COVID finally got me ”}

  4. 10.

    {“country_code”: “US”, “timestamp”: “Sun Jan 17 00:03:25+0000 2021”, “user”: “Chrissy287”, “id”: “1350594538319065093”, “text”: “Poor guy these people are just trying to make a living there is nothing worse than someone who refuses to wear a mask in a pandemic”}

  5. 11.

    {“country_code”: “GB”, “timestamp”: “Sun Jan 17 00:03:25+0000 2021”, “user”: “Gerfome”, “id”: “1350594540630138882”, “text”: “Not hearing any world news now on BBC, or other media outlets. Don't hear about what's happening in the EU. Proper mushroomed we are now, but I bet we all know the latest UK covid statistics !”}

  6. 12.

    {“country_code”: “US”, “timestamp”: “Sun Jan 17 00:03:32+0000 2021”, “user”: “dago_deportes”, “id”: “1350594569516187648”, “text”: “@LeviHayes21 Haha I know there's a game but didn't know if covid restrictions applied”}

As for the number of words, the entire geotagged corpus consists of nearly 198 million words (counted using the abovementioned split() method). Although this is a much more manageable figure, it may still be too large to apply some methods that require intensive computing, such as embeddings-based topic modelling, which is explored in Sect. 5.2. Thus, a script was created that extracts a daily random sample by country proportional to the number of tweets of that country in that day. The script takes several parameters, including the list of country codes whose tweets are to be sampled and the percentage of the desired daily sample. The list of country codes can be left empty to sample all countries in the corpus, and if 100 is selected as the percentage of the sample, all tweets for the specified country or countries will be extracted. The script also generates a log file that includes statistics on the read and written data, including number of tweets and number of words for each of the sampled countries.

Table 3.7 shows the statistics of the samples used in this book: 10%, 25%, and 50% of the top ten English-speaking countries.

Table 3.7 Tweet and word counts of the geotagged corpus samples by countryFootnote

Along with the full geotagged corpus, the three samples are included in the book’s repository as collections of tweet ID’s in TSV format with two data columns: “tweet_id” and “country_code”.

Although all these countries have English as a first language, not all countries included in the corpus do. In fact, Germany is in 14th position by number of tweets published in English in the Geotagged section of the CCTC, after Kenya. Table 3.8 offers the ranked list of the top 50 countries present in the corpus, including the exact number of tweets and the percentage of the whole corpus. The top 10 countries selected for the samples make up 92.31% of the entire geotagged corpus.

Table 3.8 Top 50 countries by volume in the geotagged corpus

3.5 Subcorpora. Using Metadata with XML-Aware Corpus Tools

The described JSONL format chosen to store the corpus is suitable for processing the data with the custom tools that we will be using in this book, but other formats are required to use the data effectively with different tools. XML (Extensible Markup Language), in particular, is a standard text exchange format that is used by many text processing tools.

Like JSON, XML is capable of encoding metadata together with the text. (13) to (17) are examples of XML-encoded tweets from the geotagged corpus.

  1. 13.

    <doc date=“2021-05-01” country=“US” id=“1388488549922598915”>Planned Parenthood? We're a pro-life institution. Vaccinations? We're pro-choice.</doc>

  2. 14.

    <doc date=“2021-05-01” country=“US” id=“1388283084072787971”> I literally have covid for the 3rd time….how in the fuck???</doc>

  3. 15.

    <doc date=“2021-05-01” country=“US” id=“1388521299698495491”>May gone be the month I stay my ass home I been away from home like weeks out out of April smh</doc>

  4. 16.

    <doc date=“2021-05-01” country=“GB” id=“1388381954714832899”>Koreans are immune to Covid—fact. (The Japanese call them the “Garlic Eaters”).</doc>

  5. 17.

    <doc date=“2021-05-01” country=“IN” id=“1388539225767772161”>Sir i need an oxygen bed for a corona positive relative in dehradun. Plz help sir. Regards</doc>

XML-aware corpus tools, such as the web-based corpus suite Sketch Engine (Kilgarriff et al. 2014), are able to read the metadata and offer certain extra functionalities, such as the creation of subcorpora that can be searched individually by the different tools. Furthermore, some of the tools in this suite do depend on the availability of time metadata in order to be available altogether. Such is the case of the Trends tool, which can keep track of the diachronic frequency of words in the corpus.Footnote 11

Unlike other tools, such as Google Trends or the dynamic topic modelling tools we explore in Sect. 5.3, Sketch Engine’s Trends cannot show word-specific usage over time, but offers a useful list of words whose frequency shows a significant change (upwards or downwards) over time, computed using a user-selected statistic (either linear regression or Mann–Kendall, Theil-Sen). To illustrate what this tool achieves, Fig. 3.5 and Fig. 3.6 show the results obtained from the India 2020 and 2021 subcorpora, respectively.

Fig. 3.5
A screenshot displays the word usage trends. It has a search box at the top with text that reads, Covid-19 geotagged 10 percent. Below it is a table of 3 by 10 with word, trend, and frequency as the column headers. The trend is indicated by colored arrows. 8 words trend out of 10.

Word usage trends of the India 2020 subcorpus

Fig. 3.6
A screenshot displays the word usage trends. It has a search box at the top with text that reads, Covid-19 geotagged 10 percent. Below it is a table of 3 by 10 with word, trend, and frequency as column headers. The trend is indicated by colored arrows. The words pandemic and wave trend.

Word usage trends of the India 2021 subcorpus

In order to get these results, specific subcorpora need to be created combining location and time data, which, as mentioned above, require that these attributes be encoded in the XML metadata prior to uploading the corpus. Both charts were computed with the tool’s default settings: attribute = lemma, minimum frequency = 69, maximum p-value = 0.01, method = Mann-Kendal, Theil-Sen (all).