Our dataset is based on roughly 23 million tweet IDs containing the string “Brexit” collected between May 5 and August 24, 2016.Footnote 3 We downloaded all available tweets via the official Twitter API (which amounts to roughly 20 million tweets) and processed the dataset as follows.
Firstly, we only consider original tweets (no retweets), which amount to approximately half of all successfully downloaded tweets; the distribution of originals is shown in Fig. 1. Secondly, we only keep tweets identified as English by Twitter, which leaves more than 7 million of the original tweets; the most frequent other languages were Spanish (655,799), French (314,609), German (200,066), Italian (190,530), and undefined (160,756). Thirdly, for the development phase, we restricted the dataset to tweets posted before the referendum on June 23, 2016, aiming at a relatively consistent dataset and expecting substantial differences between arguments presented before and after the referendum. Our final implementation will later be applied to the full dataset, enabling the comparison of argumentation patterns over time.
In addition, we resolved reply-threads by retrieving all available tweets for which there is a reply in our dataset, but excluding non-English tweets. In this way, we can access dialogues between users, which are more likely to contain arguments. One example of such a reply thread is shown in Fig. 2. Note that our final database therefore contains tweets sent before May 5, 2016 and tweets that do not contain the search string “Brexit”. There are 215,744 reply-threads involving 688,905 tweets in our data set; 85% of them (183,188) could be resolved to their root.
Last but not least, we excluded near-duplicates (most of them likely generated by social bots – see  for details on our deduplication heuristics) as long as they are not part of a reply-thread. This way, we only consider genuine original content. Our final corpus consists of approximately 2.4 million tweets.
We used off-the-shelf software tools for tokenization and coarse POS tagging Footnote 4 and a custom lemmatizer based on work by Minnen et al. . POS taggers categorize words according to their parts of speech, e.g. verb or noun. Lemmatizers group together inflected forms of the same stem (e.g. takes, took, taken are all mapped to the lemma take). We additionally ran a tool for phrase chunking and named entity recognition (NER) [22, 23] – which also tags tokens with around 50 fine-grained POS tags following Penn Treebank style – combining the different tokenization layers in a post-processing step. Having a linguistically enriched data basis is an important prerequisite for formulating precise queries, cf. Sec. 2.3.
The annotated corpus has a total size of 32 million tokens. Unsurprisingly, the most frequent words forms are Brexit and the corresponding hashtag #Brexit, which together make up around 4% of all tokens, followed by function words such as determiners, punctuation marks, and prepositions. The most frequent content lemmas (after brexit and the auxiliary be) are vote and eu (both about 0.8% relative frequency).Footnote 5 The coarse-grained POS system tagged approximately one third of all tokens as verb or noun, followed by proper nouns, prepositions, punctuation marks, determiners, adjectives, hashtags, URLs, pronouns, and adverbs. The NER system detected around 10 million noun phrases, 4 million verb phrases, 3 million preprositional phrases, and 2 million named entities. The annotation of a typical tweet is shown in Table 1; this tweet containts a match of the query presented in Sec. 4.Footnote 6
Our corpus queries serve to extract argumentation from the data (cf. Sec. 4). Having queries as the central element of extraction allows us to combine lexical and grammatical patterns with word lists. At the same time, the formulation of explicit queries incorporating a fixed linguistic structure allows us to handle the noisy data prevalent in social media, as the queries can capture typical phenomena on the level of syntax, vocabulary and phraseology.
Our query architecture builds on the IMS Corpus Workbench (CWB) , a system designed for enabling complex linguistic searches on large corpora. The query language is based on regular expressions and allows for the incorporation of various levels of annotation. All grammatical information added to the corpus during pre-processing (cf. Sec. 2.2) can be accessed for each individual word or region – for instance, [pos=“N”] will retrieve any word identified as a noun by the POS tagger, while [lemma=“have”] finds all forms of have (have, has, had, having). Similarly, phrase chunks like <np>…</np> specify a sequence tagged as a noun phrase. These elements can be freely combined: <np> [pos=“N”]+ </np> matches a noun phrase consisting only of one or more nouns. For initial query development, we used the web-based concordancing front-end CQPweb , allowing us to browse query results, view and sort context displays and perform statistical analyses.
To support the particular needs of RANT, we implemented our own python-wrapper around CWBFootnote 7 and developed a bespoke web-application to manage word lists and queries, and to display results in a way taylored to the needs of argument extraction. In comparison to CQPweb, our app places its central focus on enabling the management of multiple query patterns rather than on individual queries and their statistical properties. This is achieved in particular by supporting complex macros and allowing the user to build and semi-automatically expand word lists.