Keywords

2.1 CORD-19: The COVID-19 Open Research Dataset

The CORD-19 corpus (Wang et al. 2020) is the result of the collaboration of scientists from several institutions and companies, including The White House Office of Science and Technology Policy (OSTP), the National Library of Medicine (NLM), the Chan Zuckerburg Initiative (CZI), Microsoft Research, and Kaggle, coordinated by Georgetown University’s Center for Security and Emerging Technology (CSET). The final version of CORD-19 was released on June 2, 2022, and contains over 1 million scholarly articles, of which over a third are full texts, totalling about 1.5 billion words.

The dataset was started in 2020 as an urgent initiative meant to facilitate the application of Natural Language Processing and other AI techniques to generate new insights in support of the ongoing effort to combat the disease. The dataset consists predominantly of papers in medicine (55%), biology (31%), and chemistry (3%), which together constitute almost 90% of the corpus.

It was created by applying a pipeline of machine learning and NLP tools to convert scientific articles into a structured format that can be readily consumed by downstream applications. The pipeline includes document parsing, named entity recognition, coreference resolution, and relation extraction. The original PDF documents were parsed using the Grobid tool to generate the JSON-based S2ORC (Lo et al. 2020) final distribution format. Coreference resolution and relation extraction are performed using a combination of rule-based and machine learning methods.

The great relevance of this knowledge resource cannot be overstated, as it has been used by clinical researchers, clinicians, and the text mining and NLP research community, who have generated a considerable body of research, including information extraction, text classification, pretrained language models, and knowledge graphs. It has also been used in diverse NLP shared tasks and analysis tools.

2.2 COVID-19 Twitter Chatter Dataset for Open Scientific Research

This dataset, created by Banda et al. (2021), consists of over 1.12 billion tweets (at the time of publishing the paper)Footnote 1 related to COVID-19 chatter generated from January1, 2020, to June 27, 2021. The term chatter in the context of social media data refers to the ongoing conversation or discourse happening on the platform. As with most Twitter corpora, the authors used Twitter’s streaming API with the Tweepy Python library to identify and download COVID-19-related tweets suing a set of keywords (‘coronavirus’, ‘wuhan’, ‘pneumonia’, ‘pneumonie’, ‘neumonia’, ‘lungenentzündung’, ‘covid19’). They also relied on a number of collaborators to expand the tweet collection. For pre-processing they used the Social Media Mining Toolkit (SMMT) (Tekumalla and Banda 2020a), and decided to keep the retweets, as the intention of this corpus is to be able to trace the interactions between Twitter users, although a clean version with no retweets, intended for NLP researchers, is also available. The authors decided to include, together with the corpus, a number of Python scripts to read files and generate n-grams from the text. This dataset is intended to be instrumental in advancing research in various fields, including epidemiology, social sciences, and NLP.

The corpus has been well received by the research community, with 211,773 downloads as of June 2023, and has been used in a good number of publications. For example, Tekumalla and Banda (2020b) attempted to identify discourse related to potential drug treatments available for COVID-19 patients from Twitter data. They highlight the difficulties derived from the high number of misspellings of drugs (e.g. “hydroxychloroquine”). To deal with this issue they combined four different methodologies to acquire additional data. Firstly, a machine learning approach called QMisSpell, which relies on a dense vector model learned from large, unlabelled text. Secondly, they used a keyboard layout distance approach for generating the misspellings. Thirdly, a spelling correction module called Symspell, which corrects the spelling errors at the text level before text tagging. The authors demonstrate the importance of dealing with constant misspellings found in Twitter data and show that with a combination of methods, around 15% additional terms can be identified, which would have been lost otherwise.

2.3 The Coronavirus Corpus

The Coronavirus Corpus (Davies 2021) contains approximately 1.5 billion words of data in about 1.9 million texts from January 2020 to December 2022.Footnote 2 The corpus is in fact derived from the NOW (News on the Web) Corpus, which currently contains 17.4 billion words.Footnote 3 Initially, the NOW Corpus was based on links from Google News; every hour of every day, Google News was queried to find online newspaper and magazine articles published within the preceding 60 minutes. This search would be repeated for each of the twenty English-speaking countries considered by the author, and the URLs from Google News were stored in a relational database, along with all of the pertinent metadata (country, source, URL, etc.). Every night, scripts would download 15,000–20,000 articles, clean, tag, and remove duplicates before merging them with the existing NOW Corpus. Due to changes in Google News, the procedure was modified in the middle of 2019 to collect URLs using Microsoft Azure Cognitive Services. New magazine and newspaper articles from the previous 24 hours are retrieved daily for each of the twenty English-speaking countries. In addition, Bing is queried daily to find new articles published within the previous 24 hours for 1,000 distinct websites.

The Coronavirus Corpus provides the same query tools available for all corpora on English-Corpora.org, such as advanced searching, concordancing, and views of the frequency of words and phrases over time. Users can also browse the collocates of words and phrases and compare the collocates to see how particular topics have been discussed over time.

This corpus has been used in several studies, especially those interested in the linguistic perspective. One such example is the study by Dong et al. (2021) described in Sect. 1.2 above; another example is the one by Montkhongtham (2021), which aimed to examine the use of if-conditionals expressing options and possibilities during the pandemic. The extracted if-conditionals were classified according to Puente-Castelo and Monaco’s (2013) framework of if-typology, and the grammatical aspects of all the verb strings were examined in terms of tense, aspect, sentential modality, and voice. They concluded that speech act conditionals were most frequently used to offer specific recommendations for combating the pandemic.

2.4 Parallel Corpora

Roussis et al.’s (2022) is the best example of the few existing parallel corpora on COVID-19. It is a collection of parallel corpora with English as the main language, as all of them are EN-X language pairs. The primary data source they used was the COVID-19 dataset of metadata created with the Europe Media Monitor (EMM)/Medical Information System (MedISys) processing chain of news articles. The MedISys metadata were parsed to select datasets spanning 10 months (December 2019 to September 2020) and located the articles in several languages for a total of about 57 million URLs. The source HTML content was retrieved and processed to get the raw text. All documents were merged into one for each language and period, and subsequently tokenized into sentences using NLTK (Bird et al. 2009). In total they obtained 150 million sentences in 29 languages. Then they applied the LASER toolkit on each document pair to mine sentence alignments for each EN-X pair and, finally, the parallel data for each period were concatenated to form a single bilingual corpus per language pair.

Overall, the final dataset comprises over 11.2 million sentence pairs in 26 EN-X language pairs. It is offered both in TMX (Translation Memory Exchange) and CSV formats. It covers 22 of the 24 official EU languages, as well as Albanian, Arabic, Icelandic, Macedonian, and Norwegian. Obviously, there are great differences between low-resources languages and those with a high speaker base, with Icelandic having just a few sentence alignments, in contrast to 1.5 million for Spanish.

2.5 GeoCoV19

A well-known problem with Twitter/X datasets is that only a tiny proportion of them are geotagged. Lack of geolocation information may or may not be an issue to researchers, depending on their objectives. Location information, however, may be inferred from other data. Qazi et al. (2020) used a variety of strategies to geotag a large number of tweets downloaded for 90 days starting February 1, 2020. The dataset, dubbed GeoCov19 by the authors, contains over 424 million geolocated tweets. The authors’ objective was to create a resource that allows researchers to study the impact of COVID-19 in different countries and societies. They used four types of data from a tweet: geo-coordinates (if present), place, user location, and tweet content. They adopted a gazetteer-based approach and used Nominatim, a search engine for OpenStreetMap data, to perform geocoding and reverse geocoding. They set up six local Nominatim servers on their infrastructure and tuned each server to handle 4,000 calls/second. In the absence of coordinates, toponym extraction from the mentioned data fields and the tweet’s text was employed. They followed a 5-step process for toponym extraction: pre-processing, candidate generation, non-toponyms pruning, Nominatim search, and majority voting.

The evaluation of their toponym extraction approach showed that the lower the granularity (i.e. higher administrative level), the better the accuracy scores. At the country level, for both user location and tweet content fields, the accuracy scores at 0.86 for user location data and 0.75 for tweet text. By far, the tweet’s text was the data that produced the most results, followed by user location, place, and geo-coordinates.

The final tweets dataset covers 218 countries and 47,328 unique cities worldwide, and several types of locations, such as hospitals, parks, and schools. In terms of languages, the corpus contains tweets in 62 languages, English being clearly the top one, with 348 million tweets, followed by Spanish and French.

2.6 Chen et al.’s Coronavirus Twitter Corpus (CCTC)

This is the corpus used in this study. The authors did not name their corpus in any particular way, so, in order to avoid confusion, I will refer to it as CCTC (Chen’s Coronavirus Twitter Corpus). Chen et al. (2020) compiled this corpus as an ongoing collection of COVID-19-related tweets. Twitter's API and the Tweepy Python library were used to compile tweets since January 21, 2020. The searches were conducted using trending accounts and keywords such as ‘coronavirus’, ‘corona’, and ‘COVID-19’. While the dataset contains tweets in over 67 languages, Chen et al. (2020) concede that there is a significant bias in favour of English tweets. The dataset is available on GitHub as a collection of text files with just the Tweet IDs.Footnote 4 The repository also includes a Python script (‘hydrate.py’) that facilitates downloading the actual tweets via the Twitter API. This is because Twitter specifically forbids the distribution of Twitter/X data by third parties. In Sect. 3.2, we provide more details regarding the “tweet hydration” process.

This corpus has also been extensively used in previous research, with 320 citations as of June 2023,Footnote 5 in a wide variety of research fields, including medicine, sociology, linguistics, and engineering.

For example, Bahja and Safdar’s (2020) study aimed to analyse the spread of misinformation through social media platforms concerning the effects of 5G radiation and its alleged link to the pandemic, which at some point led to attacks on 5G towers. The authors applied Social Network Analysis (SNA), topic modelling (specifically LDA), and sentiment analysis to identify topics and understand the nature of the information being spread, as well as the inter-relationships between topics and the geographical occurrence of the tweets. They found that the majority of the topics speak about the conspiracy behind the pandemic, and that the source of the misinformative tweets can be tracked using SNA.

An interesting study with important social implications is the one by Bracci et al. (2021). The authors sought to understand how the pandemic has reshaped the demand for goods and services worldwide in the shadow economy, particularly the Dark Web Marketplaces (DWMs). They analysed 851,199 listings from 30 DWMs directly related to COVID-19 medical products and monitored the temporal evolution of product categories including Personal Protective Equipment, medicines (e.g. hydroxychloroquine), medical frauds, tests, fake medical records, and even ventilators. They also compared the trends in the listings in their temporal evolution with variations in public attention, as measured by tweets and Wikipedia page visits of products advertised in the listings. They found that listing prices correlated with both variations in public attention and individual choices of a few vendors, with prices experiencing sharp increases at key points in the timeline, which also correlated with user attention as reflected on tweets and Wikipedia searches.

In psychology, the study by Aiello et al. (2021) aimed to identify and understand the psychological responses of the population, thus contributing to the research line of measuring the effects of epidemics on societal dynamics and the mental health of the population; the paper also aimed to provide a starting point for developing more sophisticated tools for monitoring psycho-social epidemics. In order to identify medical entities and symptoms, the authors used the GloVe (Pennington et al. 2014) and RoBERTa (Vaswani et al. 2017) word embeddings in a Bi-LSTM neural network architecture to train a model trained on the Micromed database from manually labelled entities. The thematic analysis of tweets identified recurring themes in the three phases of epidemic psychology: denial, they-focus, and business-as-usual in the refusal phase; anger vs. political opponents, anger vs. each other, science, and religion in anger phase; we-focus, authority and resuming work in the acceptance phase. They also tested Strong's (1990) model of epidemic psychology.

In politics, Jiang et al. (2021) used an early version of the CCTC (until July 2020) to study the polarization of discourse regarding the pandemic, and identify and describe the structure of partisan echo chambers on Twitter in the United States, in an effort to understand the relationship between information dissemination and political preference, a crucial aspect for effective public health communication. To achieve these objectives they created an innovative language model, which they dubbed Retweet-BERT, a sentence embedding model that incorporates the retweet network, inspired by Sentence Transformers (S-BERT) (Reimers and Gurevych 2019). The model is based on the assumption that users who retweet each other are more likely to share similar ideologies; it was evaluated thoroughly, achieving strong performance (96% cross-validated AUC). They identified three different Twitter user roles: information creators, information broadcasters, and information distributors. Right-leaning users were found to more likely be broadcasters and distributors than left-leaning users, and therefore were noticeably more vocal and active in the production and consumption of COVID-19 information. As for echo chambers, they found them to be present on both ends of the political spectrum, but they are especially intense in the right-leaning community, as their members almost exclusively retweeted like-minded users. In contrast, far left and nonpartisan users were significantly more receptive to each other’s information.

Li et al. (2021) used Chen et al.’s corpus to extract tweets produced by non-governmental organizations, which use Twitter to form communities and address social issues. They analysed a total of 2,558 US-based NGOs, which published 8,281,600 tweets. They focused on the NGOs’ distinctive networked communities via features such as retweets and mentions, and how the discourse evolves as new social issues appear. The authors found that, over time, as NGOs discussed the COVID-19 crisis and its social repercussions, distinct organizational communities arose around various topics. In addition, the use of social media helped eliminate geographical and specialization barriers, allowing NGOs with diverse identities and backgrounds to collaborate. They also observed that the patterns of tie formation in NGO communities largely mirrored the predictions of Issue Niche Theory.

The current version of the corpus at the time of writing (version 2.106, July 2023) contains over 2,77 billion tweets. English is the top language (64.3%), followed by Spanish (11.09%), Portuguese (3.78%), and French (3.7%).

The CCTC corpus is not without flaws: the authors acknowledge that there are some known gaps in the dataset due to Twitter API restrictions on data access and the collection of data using Twitter’s streaming API, which returns only 1% of the total volume, so the number of collected tweets is dependent on network connection and filter endpoint. Additionally, the list of keywords used by the streaming API was modified and expanded as related terms (such as “lockdown” and “quarantine”) emerged, which explains the sudden increases in the number of tweets at specific times (see Fig. 3.2).

Despite these shortcomings, it is quite possibly the most valuable available resource to study the impact of the pandemic in the world through the voices of social media users. Its sheer size, over 32 billion tweets for the years 2020 and 2021 alone, compensates some of the limitations. For example, even though only a tiny proportion of tweets are geotagged (less than 0.01%), the absolute number of geotagged tweets is enough to undertake contrastive studies that require this information.

The corpus is described in more detail in the following chapter, where specific figures are provided, along with the strategies and techniques followed to manage such a large dataset.