1 Introduction

The development of Internet and mobile technologies has dramatically changed the way individuals communicate and acquire information. Consequently, social media have gained a central role for research in the field of computational social science that investigates questions using quantitative techniques and big data (Lazer et al. 2009; Cioffi-Revilla 2010).

Social media can be defined as any web-based and mobile-based Internet applications that provide real time communication and information, allowing the creation, access and exchange of user-generated content that is ubiquitously accessible. A taxonomy of social media can be found in Kaplan and Haenlein (2010) and a survey of techniques, tools and platforms exploited in social media analytics is provided by Batrinca and Treleaven (2015). The application of quantitative techniques to social media data enables a deeper comprehension of social, political, and economic phenomena. In particular, the impact of digital platforms on the production, distribution, and consumption of political information has been extensively analysed in literature. For instance, Ceron et al. (2017) provide a review of different techniques using social media to nowcast and forecast elections. A comprehensive overview of the main transformations induced by social media on the information process is presented in Casero-Ripollés (2018), while Jungherr and Theocharis (2017) discuss the opportunities along with the pitfalls of the continuously growing use of digital data in political science. A survey on political event researches on social media can be found in Korakakis et al. (2017). The authors focus mainly on Twitter and identify three main areas of interest: prediction of electoral results, sentiment analysis in political topics and opinion polls, and, finally, social analysis of human behaviour related to the interaction between politicians and citizens.

Social media have also acquired a central role during the most significant political event of the last fifty years in the UK as testified by the huge volume of information and discussion on Brexit on online platforms. The Brexit process, in spite of its recent date, has been the subject of several social media studies which concentrate mainly on the prediction of the results of the referendum and on the influence of social media in shaping the vote. Focusing on the information shared on Twitter, Khatua and Khatua (2016) investigated the 2016 Brexit referendum analysing an exhaustive set of hashtags, selected by considering the lack of ambiguity of their political leaning. Hänska-Ahy and Bauchowitz (2017) analysed 7.5 million tweets and found how the predominance of Euroscepticism on social media mirrored its dominance in the press. Howard and Kollanyi (2016) carried out a preliminary study on the use of political bots during the Brexit campaign. Grčar et al. (2017) addressed the stance of the Twitter users in relation to the referendum outcome polls and identified the influential users on both sides of the Brexit debate. An opinion analysis on the British EU membership referendum within Twitter is reported in the study of Llewellyn and Cram (2016). By adopting three different search strategies, the authors collected tweets from specific groups to explore how topics and language differ among groups and how those groups influence each other. Lansdall-Welfare et al. (2016), using simultaneous multiple change-point analysis, tried to capture changes in public mood in the days before and after the Brexit referendum. Moving along these lines of research, Hürlimann et al. (2016) present a dataset of sentiment-annotated tweets targeting the historical event of Brexit to categorise the social and discourse dynamics behind this political event as well as the strength of the sentiment.

In this paper, we explore Twitter conversations collected in proximity of the UK’s planned withdrawal from the EU. Specifically, we queried for the tweets containing the keyword Brexit and posted between the end of December 2018 and the first week of February 2020. By exploiting approximately 30 million Brexit related tweets, our overall goal is to explore the prominent themes discussed, and in particular to identify the topics that have attracted the most attention from users and how they evolve over time. We address these research questions through topic modelling techniques (Blei 2012). Probabilistic topic modelling consists of a collection of methods which specify a probabilistic generative model for the documents with the purpose of discovering and annotating large archives of textual documents with thematic information. In our study, we implement both standard (Blei et al. 2003) and dynamic (Blei and Lafferty 2006) Latent Dirichlet Allocation (LDA) models. The general idea beyond LDA is that documents with the same topic will use similar words and the key assumption is that documents are mixtures of topics, so that of central interest is how to discover a topic distribution over each document and a word distribution over each topic. In addition, the LDA dynamic version allows to analyse topic distributions over time and to gain insights on their changes and evolutions.

The remainder of this paper is organized as follows. Section 2 introduces probabilistic topic models, focusing in particular on LDA and its dynamic version, which relaxes the assumption that all documents are generated in the same time step. Section 3 describe the large collection of tweets related to Brexit, extracted from 31 December 2018 and 9 February 2020, and provides results of the exploratory analysis of the Twitter activity to uncover temporal patterns. Also results of hashtag analysis are presented. Section 4 discusses the main findings of the topic analyses. Finally, Sect. 5 concludes the paper and considers possible future works.

2 Probabilistic Topic Models

Topic modelling provides a powerful method for projecting text documents into topic space and it has been widely applied in many fields, ranging from information retrieval (Boyd-Graber et al. 2017; Wei and Croft 2006), to information visualization (Wang et al. 2016), and to recommendation systems (Wang and Blei 2011).

The application of automatic topic mining techniques to large electronic document archives, obtained from social media channels, constitutes an important tool in computational social science aiming at the detection of hidden topics in online discussions. Among several application fields, researchers have introduced the topic modelling approach also into political science studies focusing, in particular, on content shared on Twitter (see, for example, Karami et al. 2018; Fang et al. 2019 and references herein).

Depending on the problem at hand, there are many approaches and techniques one can use to extract and manage large volume of data. The two foundational probabilistic topic models are the probabilistic latent semantic analysis (pLSA, Hofmann 1999) and the Latent Dirichlet Allocation (Blei et al. 2003). The pLSA is a probabilistic variant of the Latent Semantic Analysis introduced by Deerwester et al. (1990) to capture the meaning or semantic information embedded in large textual corpora without human supervision. In the pLSA approach, each word in a document is modelled as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of “topics”. The pLSA model allows multiple topics in each document, and the possible topic proportions are learned from the document collection. Blei et al. (2003) introduced the LDA model which presents a higher modelling flexibility over pLSA by assuming fully complete probabilistic generative model where each document is represented as a random mixture over latent topics and each topic is characterized by a distribution over words. According to the generative model, each document in the corpus is generated in a two stage process (Blei 2012; Steyvers and Griffiths 2006). First, a distribution over topics is randomly chosen; then each word in the document is generated by first sampling a topic from this topic distribution, and choosing a word from the topic-word distribution over the vocabulary. A detailed derivation of the LDA can be found in Blei et al. (2003). Standard statistical techniques can be used to invert the generative process in order to infer the set of topics that were responsible for generating the collection of observed documents and a wide variety of approximate inference algorithms, such as sampling-based algorithms (see, for example Steyvers and Griffiths 2006, for a detailed derivation of Gibbs-sampler for LDA) and variational algorithms (Blei et al. 2003) can be considered.

The statistical assumptions behind standard LDA include that both words and documents are exchangeable (i.e. the order does not matter) and all documents are generated in the same time steps. This assumption is relaxed in dynamic topic models (Blei and Lafferty 2006; Wang and McCallum 2006; Wang et al. 2008; Iwata et al. 2010) which respect the ordering of the documents and give a richer posterior topical structure than LDA. In this case, rather than a single distribution over words, a topic is a sequence of distributions over word and it is possible to find an underlying theme of the collection and track how it has changed over time. In this work, we refer to the dynamic version of topic models (DTM) proposed by Blei and Lafferty (2006), which specifies a statistical model of topic evolution. In this approach, documents are divided into a set of sequential non-overlapping time slices and the basic assumption is that the topics associated with the temporal window t evolve from topics associated with slice \(t-1 \). Therefore, the documents of each slice are modelled through a K-component topic model, and both the natural parameters of the underlying topic distribution, and the natural parameters of the distributions for the document-specific topic proportions, associated with slice t, are chained in a state space model. Blei and Lafferty (2006) propose to use Variational Kalman Filtering or Variational Wavelet Regression to estimate the model parameters.

3 Data Collection and Twitter Traffic Temporal Evolution

For our study we use a Twitter dataset, collected for 58 weeks, spanning from 31/12/2018 to 09/02/2020. The data were extracted by using Twitter’s Streaming Application Programming Interface (API) and we searched for tweets, written in English, containing the term Brexit. Our sample includes 135,607,216 tweets, of which 102,176,840 are retweets.

There are 6,811,652 tweets with at least one retweet. Of these, approximately 90% were retweeted less than 15 times. In our research, we focuses on pure tweets (i.e., tweets that are no retweets).

We developed a Python script to perform the screening and cleaning process (tokenization; lowercase conversion; special characters, URL, mentions and stop-words removal) of text documents in order to extract the relevant content and remove any unwanted nuisance terms. The nltk.word_tokenize function in the NLTK Python package (Bird et al. 2009) has been employed for tokenization. Some tokens have been merged according to a multiword list, generated ad hoc for this dataset. The dataset contains 2.148.421 unique tokens; about 65% of these distinct tokens are hapax legomena (i.e. words that occur only once within the entire corpus) and around 90% are used less than 10 times in the entire corpus. For the analyses, the tokens with less than 4000 occurrences in the corpus have been pruned and the stop words removed. The stop words list contains a set of words that have no inherent useful information, either because they do not have any meaning (prepositions, coordinating conjunctions, determines, etc..) or because they are too frequent. Stop words create problems in identification of key concepts from textual sources when they are not removed due to their overwhelming presence. The stop word list for our analysis contains also the word Brexit, used as search term and thus presents in all the tweets. The final vocabulary consists of 6181 tokens and the most frequent words are visualised through a wordcloud in Fig. 1, where the word sizing is proportional to how frequently a term is used in the corpus. The words that appear more frequently are: “eu”,“party”, “uk”,“vote”, “people”, “nodeal”, “out”, “deal”, “labour”, “may”, “leave”, “boris”, “voted”, “johnson”, “country”, “tory”, “remain”. These terms recorded more than one million occurrences. Political words such as “government”, “parliament”, party denominations and politician names are also evident in the cloud, highlighting that the actions taken by Government, Parliament and political actors are at the center of the discussion, whereas terms related to economy seem to be less recurrent.

Fig. 1
figure 1

Wordcloud of the most frequent words

Twitter activities can be tracked through the inspection of users posting activity (total number of tweets) and their ability to engage their followers for support (i.e., by retweeting their posts). The daily volume of tweets is represented in Fig. 2, along with some of the most relevant dates on the Brexit timeline. In Table 1, we indicate some specific dates in which we observe a relevant variation in Twitter activity, broken down by total number of tweets, pure tweets and retweets. We report also the ratio between retweets and pure tweets.

Fig. 2
figure 2

Brexit-related tweet distributions over days, from 31/12/2018 to 09/02/2020

Table 1 Peak dates in the temporal distribution of Brexit-related tweets

The peak dates are linked to important events in the Brexit process. The highest peaks in the Twitter traffic are clearly associated with 13 December 2019, the day of the General election results (when the Conservative party secured a huge majority) and with 27 May 2019, when the results of the UK’s European Parliament elections were announced: the Brexit Party was the clear winner, the pro-EU Liberal Democrats came second, and the Conservative and Labour Parties suffered heavy losses. Starting from the beginning of 2019, important peaks coincide with the dates of the Brexit meaningful votes and indicative votes. The meaningful votes on the Withdrawal Agreement that the Conservative government had reached with the European Union took place in the House of Commons between January and March 2019: the bill was three times decisively defeated, following a major revolt amongst Conservative backbenchers. The two rounds of indicative votes, on a series of non-binding resolutions on alternative Brexit options, were all rejected. A sudden increase in Brexit tweets is evident on 28 August, the date when the Queen granted Prime Minister Boris Johnson’s request to suspend Parliament from 10 September until 14 October. Following this, a massive escalation in the number of Twitter users joining in the Brexit debate is apparent on 3 and 4 September. This coincides with the emergency debate on the so-called Benn Bill, or Benn Act, that was aimed at ruling out a unilateral no-deal Brexit by forcing the Government either to reach an Agreement or to get parliamentary approval for a no-deal Brexit, or else (if neither condition was fulfilled by 19 October) extend the deadline to 31 January 2020. A high Twitter traffic was recorded on 19 October 2019, when MPs, instead of backing Johnson’s agreement in a meaningful vote, passed the second Letwin amendment, which forced the Government to request from the EU a delay to Brexit until 31 January 2020. An inspection of Fig. 2 reveals a final sharp increase in the number of tweets (more than 870,000 in total) on 31 January 2020, Brexit day.

An in-depth analysis can be obtained by tracking the top hashtags, which are a well-established means of categorising tweets by content and are included for emphasis. The number of pure tweets containing at least one hashtag is 7,994,833. Apart from #Brexit, which appears in 5,135,423 pure tweets, according to our data the hashtags contained in more than 100,000 pure tweets are: #peoplesvote, #eu, #uk, #stopbrexit, #remain, #brexitshambles, #borisjohnson, #revokearticle50, #fbpe, #labour, and #nodeal. In the contest between the two opposing positions of leavers and remainers, Fig. 3 shows that #remain (153,945 pure tweets) greatly exceeds #leave (79,109), with the biggest gap evident in May 2019. There was a reduction in the occurrences of both hashtags over the following months and the difference between their frequency fell sharply, starting from the second half of December 2019 when some key events related to Brexit occurred. In particular, on 19 December 2019 the New Withdrawal Agreement Bill was introduced in Parliament. The hashtags #remainers and #brexiteers show analogous popularity and similar temporal patterns.

Fig. 3
figure 3

Temporal distribution of “Leave” and “Remain” hashtags over weeks, from 31/12/2018 to 09/02/2020

Table 2 Occurences of the most popular hashtags for the period ranging from 31/12/2018 to 09/02/2020

Attitudes critical of the Brexit process are confirmed by several hashtags which reflect the viewpoint of Twitter users: as shown in Table 2, among the most widely-used hashtags are #peoplesvote, #revokearticle50, #fbpe (i.e. #FollowBackProEU), which have a significant larger promotion than #leavemeansleave, #brexitbetrayal and #britishindependence. Figures 4 and 5 reveal how most of the pro and anti Brexit hashtags had a wider diffusion during the first months of 2019; some of them (e.g. #revokearticle50, #peoplesvotemarch, #peoplesvotenow, #putittothepeople, #leavemeansleave, #brexitbetrayal) reached their highest levels of popularity during March 2019. Hashtags calling for a stop to the Brexit process (#stopbrexit, #stopbrexitsavebritain, #indyref2), show a longer persistence, beginning to tail off in the second half of October 2019. September 2019 saw the rise of #britishindependence, followed by #getbrexitdone, which persisted until December 2019.

Fig. 4
figure 4

Heatmap of anti-Brexit hashtags. The color represents biweekly occurrences from 31/12/2018 to 09/02/2020, normalised with respect to the total occurrences of the each hashtag. (Color figure online)

Fig. 5
figure 5

Heatmap of pro-Brexit hashtags. The color represents biweekly occurrences from 31/12/2018 to 09/02/2020, normalised with respect to the total occurrences of the each hashtag. (Color figure online)

A number of hashtags indicate the difficulties encountered during the Brexit process (e.g. #brexitshambles, #brexitchaos, #brexitmayhem) and their temporal dynamic shows that they were mostly in use just before the first Brexit withdrawal deadline (29 March 2019) and after the suspension of Parliament on 10 September 2019 (Fig. 6).

Fig. 6
figure 6

Heatmap of critical hashtags of the Brexit process. The color represents biweekly occurrences from 31/12/2018 to 09/02/2020, normalised with respect to the total occurrences of the each hashtag. (Color figure online)

Details of the most widely used hashtags relating to politicians and political parties, and of their temporal dynamic, can be found in the online Supplementary Material.

4 Results

In order to investigate recurrent themes emerging from the Brexit debate, probabilistic topic models were used which allow the extraction of coherent topics hidden within a huge volume of text. For this purpose both the standard LDA and its dynamic version were applied; the results are provided in the following sections.

4.1 Discovering Topics Associated with Brexit Tweets Through LDA

To perform LDA, we consider a corpus where each document (\(N=9744\)) consists of the bunch of tweets posted in a hour time span and we set the input parameter related to the number of desired topics (K), in turn, equal to 10, 15, 20, 25, 30. The analysis was performed through the fitlda Matlab routine available in the Text Analytics Toolbox (MATLAB 2018). In particular, we used collapsed variational Bayesian algorithm (Teh et al. 2006).

To select the number of topics we considered UMASS coherence measure (Mimno et al. 2011) that, for each topic k, is defined as

$$\begin{aligned} C(k,W^{(k))})=\sum _{m=2}^M\sum _{l=1}^{m-1}log \frac{D\left( w_m^{(k)}\right) ,D\left( w_l^{(k)}\right) +1}{D\left( w_l^{(k)}\right) } \end{aligned}$$

where \(W^{(k)} =(w_1^{(k)},\ldots ,w_M^{(k)})\) is the list of the M-top words for topic k, D(w) is the document frequency of word type w (i.e., the number of documents with at least one token of type w) and \(D(w, w')\) is the co-document frequency of word types w and \(w'\). The coherence measure, computed considering 25-top words for each topic, suggested a twenty-topic solution, as shown in Table 3.

Table 3 Coherence measures for the choice of the number of topics in the LDA

The sensitivity analysis performed varying the number of top words in the range 10–30 confirmed the choice of 20 topics, as shown in the Supplementary Material. From an interpretative point of view, the model estimated with \(K=20\), guarantees the right trade-off between having enough words to disclose relevant information without making the topics cluttered.

In LDA, the topics are assumed to be latent variables, which need to be intuitively interpreted and, as point out by Steyvers and Griffiths (2006), the topics are individually interpretable, providing a probability distribution over words that picks out coherent clusters of correlated terms. All the estimated topics, along with the top ten relevant terms, are listed in Table 4. Words are ordered according to their relevance, obtained by normalising the posterior word probabilities per topic by the geometric mean of the posterior probabilities for the word across all topics.

In Fig. 7, for each topic, we represent the document-topic probabilities. As previously stated, each document consists of tweets posted over a time span of an hour, and, therefore, this representation allows us to follow topics’ temporal evolution. A more detailed representation of each topic, coupling wordclouds with temporal dynamic, is provided in the Supplementary Material.

Table 4 LDA analysis: top ten terms within the 20 topics sorted according to their relevance scores
Fig. 7
figure 7

Trajectories of topic-document probabilities in the LDA analysis

As can be seen from Table 4, Topics 1 and 2 relate to fears for the economic and social consequences of a “no-deal” Brexit scenario. Specifically, Topic 1 contains references to contracts awarded to three ferry companies to handle the potential additional need for roll-on roll-off lorry freight capacity in the event of a no-deal Brexit. It was later discovered that one of these companies, Seaborne Freight, did not, in fact, have any ferries. The Transport Secretary, Chris Grayling, came under considerable pressure and later decided to cancel the ferry contracts. Topic 2 focuses on alarm over possible shortages of food and medical supplies, and on the debate over the compromise on the Irish border backstop included in Theresa May’s Brexit deal and aimed at ensuring that no “hard border” (physical checks and infrastructure) should be reintroduced in Ireland. It is evident from Fig. 7 that Topics 1 and 2 were very prominent in the Brexit debate from December 2018 to March 2019, while Topic 2 also dominated the discussions during the months preceding the vote held in October 2019. The words retrieved under Topic 3 clearly refer to the car industry crisis of February 2019, which concerned Japanese automobile and insurance companies. On 19 February, Honda announced its intention to close its Swindon manufacturing plant in 2021. At the same time, Nissan decided to withdraw investment from its Sunderland plant, while Jaguar Land Rover and Ford announced job cuts. Discussions concerning the meaningful votes and the defeat of the government are captured in Topic 4. This topic appears to be highly localised in time and the peaks correspond to the days in which the votes were held. Topic 5 captures Article 50 extensions and the protest movements organised especially around the first Brexit withdrawal deadline (29 March 2019). In March 2019, the Speaker of the House of Commons, John Bercow, ruled out a third meaningful vote on Theresa May’s Brexit deal. This event is found in Topic 6, along with discussion of the indicative votes. Topic 7 focuses on the pro-Brexit politician Nigel Farage, and especially refers to an incident which saw Farage being assaulted by an opponent of Brexit during an electoral campaign event held in Newcastle before the European elections. Topic 8 characterises the Brexit debate during the month preceding the May 2019 European Parliament election. It reflects the electoral debate featuring the UK’s major political parties. Topic 9 was evidently highly relevant during June and July 2019. The words retrieved relate to concerns expressed by the most important members of the Conservative and Brexit parties after the Labour party victory in the Peterborough by-election of 6 June 2019. The top scoring words for Topic 10 clearly refer to the attempt by Prime Minister Boris Johnson to suspend Parliament’s activities. Chaos broke out in Parliament after the suspension, and impromptu protests were held in major cities across the country to “stop the coup”. Indignation was also expressed by John Bercow, the House of Commons Speaker, who described the initiative as a “constitutional outrage”. This topic was prominent during September 2019. Figure 7 shows that, soon after this event, the Brexit debate focused on the Supreme Court judgement over the attempt to prorogue Parliament: the Court ruled that the prorogation of Parliament was unlawful and in breach of Britain’s constitution. These events feature in Topic 11. Discussions over Brexit intensified as the October deadline drew closer, and concerned the UK and EU positions over the agreed deal (Topic 12). After he had reached an agreement with the EU leaders, Boris Johnson sought Parliamentary approval. While European Commission president Jean-Claude Juncker appeared to rule out an extension to Brexit, Northern Ireland’s Democratic Unionist party rejected the deal. The UK general election held on December 2019 characterises Topic 13. Topic 14 captures discussions of the negotiations involving the National Health Service as part of the USA-UK deal. The most relevant terms highlight the strong position of the Labour leader, Jeremy Corbyn, who raised concerns over the implications of giving US companies access to the British health service. Reflections on the defeat of the Labour party in the UK general election, which are captured in Topic 15, dominated the discussion in mid to late December. There was widespread speculation that responsibility for the defeat could be ascribed only to Jeremy Corbyn, and in particular to his reputation for anti-Semitism. He appeared to be widely mistrusted by the British electorate. Words in Topic 16 are linked to discussions trending in the days preceding the Brexit deadline on 31 January 2020: there were references to a crowdfunding campaign, run by the StandUp4Brexit group, to raise money to pay for making the bell of Big Ben ’bong for Brexit’ (which never in fact happened); to issue of a commemorative 50p Brexit coin; and to plans for, and concerns about, the post-Brexit transition period.

Topic 17 clearly refers to celebrations by Brexit supporters. On 31 January, after three and a half years of negotiations, the UK became the first country to leave the EU, and celebrations were held all over the whole country.

The discussion in Topic 18 is not clearly localised in time, and the principle terms do not help to identify a clear theme. Finally, the words in the last two topics refer to the two UK Prime Ministers who were protagonists of the Brexit negotiation, and Fig. 7 clearly highlights the transition between Theresa May’s and Boris Johnson’s leadership. On 24 May PM Theresa May bowed to intense pressure from her own party and named 7 June as the day she would resign as Conservative leader. At the end of July 2019 Boris Johnson won the Conservative leadership contest and took over as the UK’s prime minister.

4.2 Dynamic Structure of Topics Associated with Brexit Tweets

To detect topics showing a stability over time, we applied dynamic topic modelling, as described in Sect. 2. After distinguishing topics in time periods, DTM applies a state space model that handles the transition of topics from one period to another.

In this analysis, we consider 58 temporal slices, spanning the period from 31 December 2018 to 09 February 2020, so that each slice corresponds to a weekly window. The corpus consists of 33,430,376 pure tweets. The analysis was performed using the Gensim Python library (https://radimrehurek.com/gensim; Rehurek and Sojka 2011).

To explore the resulting corpus and its themes, we estimated a 20-component dynamic topic model. In this analysis, in order to address word relevance, we take into account their topic-specific probability, and in particular, the top terms for each topic were selected using functional boxplots. Sun and Genton (2011) propose an extension of the classic boxplot to the functional data analysis framework by defining the descriptive statistics of a functional boxplot as: the envelope of the 50% central region, the median curve, and the maximum non-outlying envelope. They further develop the model to allow for detection of outliers. In this analysis, we make use of this model to identify superior outliers in the relevance trends. We are implicitly assuming that the superior outliers correspond to the terms having the highest relevance over time, thereby representing good candidates to characterise the topic dynamics. The functional boxplots are provided in the Supplementary Material and their visual inspection helps in detecting those topics whose dynamic shows a more stable structure over time.

Tracking theme temporal trends, we see that a stable topic is related to the juxtaposition, in the online debate, of the Leave and Remain stances (Topic 5). It is worth noticing how the most characteristic words for this topic (Fig. 8) are strictly linked to the leavers’ argument that the result of the public vote in the referendum held in June 2016, when 17.4 million people opted for Brexit, must be respected: this gave the Leave side 51.9%, compared with 48.1% for Remain. This argument is also discussed in Topic 12 (Fig. 9), whose narrative concerns the negotiations for the UK’s exit from the EU and the subsequent relationship. It is apparent from Fig. 10 how the Leavers’ argument is stressed, in particular starting from the first extension of Article 50 until the European Election and then again from the beginning of August until 9 September, when the Benn Bill was approved. In the early months of 2019 and after the October 2019 general election there is a major discussion on freedom of movement and citizenship rights.

Fig. 8
figure 8

DTM analysis: wordcloud of the most relevant words of Topic 5

Fig. 9
figure 9

DTM analysis: wordcloud of the most relevant words of Topic 12

Fig. 10
figure 10

DTM analysis: heatmap of the most relevant words of Topic 12. The color represents word-document probabilities, normalised with respect to the maximum for each word. (Color figure online)

Persistent themes, widely debated, are also those linked to the major social, economic, and political consequences of Brexit (see wordcloud in Fig. 11). In particular, all keywords of Topic 1 are related to the implications for health, social care and education, and principally reflect concerns about possible shortages in the supply of vital medicines, price increases for some medications, and greater difficulty in recruiting international staff. The discussion associated with Topic 3 undoubtedly refers to the negative impact that the exit of the UK from the EU may have on the economy in general and on trade in particular. The top scoring words reported for Topic 9 are highly coherent upon inspection and clearly suggest that the underlying discussion is linked to relations between England, Wales, and Scotland, and specifically to the debate over Scottish independence. Another steadily disputed theme, retrieved under Topic 18, refers to the impact of Brexit on the Irish border, which was due to become the only land border between Great Britain and the European Union. The Irish border issue is clearly one of key arguments also captured by Topic 20, along with the challenges that Northern Ireland has to face, being more exposed to the impact of any trade barriers that might emerge as a consequence of Brexit. Exploring the most frequent terms associated with this topic, we find that the Twitter public expresses uncertainties about changes to trade, customs, investments, local economies, services, and other matters as a result of Brexit.

Fig. 11
figure 11

Wordcloud for Topics related to socio-economic issues

Finally, it is worth noting that DTM was able to identify a topic that melds together offensive and pejorative terms (see Topic 7 in Fig. 12) and which has remained stable over time.

Fig. 12
figure 12

DTM analysis: wordcloud for Topic 7

5 Discussion and Future Research

Social media have emerged as a critical factor in information dissemination and, potentially, an important tool for mobilizing people. Social scientists have long recognized the importance of social networks in the spread of information (Granovetter 1973) and modern communications technologies have enhanced this role making social networks ubiquitous. Therefore, social media are an important channel for people to share information, simplifying and facilitating at the same time news sharing for media organizations.

This role is remarkably carried out by Twitter, which is used as a hybrid between a communication media and an online social network and hosts real-time discussion of current topics of popular interest (Kwak et al. 2010).

Focusing on political communication, Casero-Ripollés (2018) describes how social media, in general, and Twitter, in particular, widen the number and type of actors that interact and how the communication is, consequently, moving from a scenario characterised by relations between journalists and politicians, to a more scattered panorama in which a greater number of actors participate in the exchanges that contribute to define the public sphere, thanks to digital platforms. Besides the spreading of information, social influence can play a pivotal role also on the adoption of political opinions. Studies of political conversations have shown that both mass media and individuals can deliver political information that citizens employ in their voting decisions (see Vaccari et al. 2013, and references herein).

In this study, we have provided a comprehensive analysis of the Brexit-debate on the online Twitter platform. Britain’s split from the European Union is one of the most important political happening of recent times as well as being one of the most debated events almost from the moment the idea was broached, and this is testified by the large volume of tweets containing the term Brexit. Compared to previous studies, in our work we explored an exhaustive and updated Brexit-related Twitter activity, occurred between December 2018 and February 2020 and our investigation has been aimed at building up an understanding of the online debate around Brexit and its dynamic across time.

To map Twitter info-sphere, we collected more than 135 million Brexit-related tweets. The temporal analysis of Twitter traffic allows to clearly identify the key dates in the UK’s divorce from the EU, supporting the role of Twitter as a communication channel exploited by political and social actors as well individuals to spread information and news. Also the hashtag dynamic follows closely the Brexit timeline and reveals two opposite viewpoints: hard “brexiteers” who would exit the EU with no deal, and, on the other side, remainers who support people’s vote campaign and call for a second referendum on the final Brexit deal. Apart from some hashtags that are popular only over a limited time frame, the hashtag evolution shows a high permanence and stability, confirming the results of Romero et al. (2011) on the persistence of hashtags on politically controversial topics. According to the authors, this is an example of the “complex contagion” principle which states that repeated exposures to an idea are particularly crucial when the idea is in some way controversial or contentious.

The clear temporal evolution of the debate and its links to the most relevant events in the Brexit process is registered also in the latent topics retrieved by the probabilistic topic models which testify Twitter’s role inside the information dissemination process. The use of LDA models enabled us to gain valuable insights into various aspects of the debate. Use of the static model revealed transient topics (having a significant localisation in time) inspired by the general sequence of events, while the main underlying topics that have characterised the debate from start to finish were captured by the dynamic model: these are not localised in time, but they represent the fundamental elements of the Brexit debate.

An interesting aspect to further investigate, and a possible dimension for our future research on intermedia agenda setting in the spread of information on Brexit, is the mechanism of content transfers between traditional mass media and social media. Questions exist over the extent and the direction of the interaction between those two categories of media (Rogstad 2016; Harder et al. 2017). Su and Borah (2019), comparing the agendas on Twitter and newspapers through rank-order and cross-lagged correlations between both platforms, have found out that Twitter is more likely to influence newspapers agenda in terms of breaking news, whereas newspapers are more likely to lead Twitters agenda in terms of ongoing discussions during non-breaking news periods. The analysis of the intermedia agenda setting can be accomplished by using time series models to identify influence in media networks as well as convergence behaviour in the topics being discussed across source (Meraz 2011).