1 Introduction

The advent of the digital information age – and, in particular, the stratospheric rise in popularity of social media platforms such as Facebook, Instagram, Twitter, YouTube, and TikTok – has led to unprecedented opportunities for people to share information and content with one another in a much less mediated fashion that was ever possible previously. These opportunities, however, have been accompanied by a myriad of new concerns and challenges at both the individual and societal levels, including threats to systems of democratic governance (Tucker et al., 2017). Chief among these are the rise of hateful and abusive forms of communication on these platforms, the seemingly unchecked spread of mis- and disinformation,Footnote 1 and the ability of malicious political actors, including, and perhaps most notably, foreign adversaries, to launch coordinated influence attacks in an attempt to hijack public opinion.

Concurrently, the rise of computing power and the astonishing developments in the fields of information storage and retrieval, text-as-data, and machine learning have given rise to a whole new set of tools – collectively known as Computational Social Science – that have allowed scholars to study the digital trace data left behind by the new online activity of the digital information era in previously unimaginable ways. These Computational Social Science tools can enable scholars to characterize and describe the newly emerging phenomena of the digital information era but also, in the case of the more malicious of these new phenomena, to test ways to mitigate their prevalence and impact. Accordingly, this chapter of the handbook summarizes what we have learned about the potential for Computational Social Science tools to be used to address the three of these threats identified above: hate speech, mis-/disinformation, and foreign coordinated influence campaigns. As these topics are set against the backdrop of influencing public opinion, I begin with an overview of how Computational Social Science techniques can be harnessed to measure public opinion. Finally, the chapter concludes with a discussion of the paramount importance for any of these efforts of ensuring that independent researchers – that is, researchers not employed by the platforms themselves – have access to the data necessary to continue and build upon the research described in the chapter, as well as to inform, and ultimately facilitate, public regulatory policy.

All of these areas – using Computational Social Science to measure public opinion, and to detect, respond to, and possibly even remove hate speech, misinformation, and foreign influence campaigns – have important public policy connotations. Using social media to measure public opinion offers the possibility for policy makers to have additional tools at their disposal for gauging the opinions regarding, and salience of, issues among the general public, ideally helping to make governments more responsive to the public. Hate speech and misinformation together form the crux of the debate over “content moderation” on platforms, and Computational Social Science can provide the tools necessary to implement policy makers’ proscriptions for addressing these potential harms but also, equally importantly, for understanding the actual nature of the problems that they are trying to address. Finally, foreign coordinated influence campaigns, regardless of the extent to which they actually influence politics in other countries, can rightly be conceived of as national security threats when foreign powers attempt to undermine the quality and functioning of democratic institutions. Here again, Computational Social Science has an important role to play in identifying such campaigns but also in terms of attempting to measure the goals, strategies, reach, and ultimate impact of such campaigns.Footnote 2

In the review that follows, I focus almost exclusively on publications and papers from the last 3–4 years. To be clear, this research all builds on very important prior work that will not be covered in the review.Footnote 3 In addition, in the time it has taken to bring this piece to publication, there have undoubtedly been many new and important contributions to the field that will not be addressed here. But hopefully the review is able to provide readers with a fairly up to date sense of the promises of – and challenges facing – new approaches from Computational Social Science to the study of democracy and its challenges.

2 Computational Social Science and Measuring Public Opinion

One of the great lures of social media was that it would lead to new ways to analyse and measure public opinion (Barberá & Steinert-Threlkeld, 2020; Klašnja et al., 2017). Traditional survey-based methods of measuring public opinion of course have all sort of important advantages, to say nothing of a 70-year pedigree of developing appropriate methods around sampling and estimation. There are, however, drawbacks too: surveys are expensive; there are limits to how many anyone can run; they are dependent on appropriate sampling frames; they rely on an “artificial” environment for measuring opinion and are correspondingly subject to social desirability bias; and, perhaps most importantly, they can only measure opinions for the questions pollsters decide to ask. Social media, on the other hand, holds open the promise of inexpensive, real-time, finely grained time-series measurement of people’s opinions in a non-artificial environment where there is no sense of being observed for a study or needing to respond to a pollster (Beauchamp, 2017). Moreover, analysis can also be retrospective, going back in time to study the evolution of opinion on a topic for which one might not thought to have previously asked questions in public opinion surveys.Footnote 4

The way the field has developed has not, however, been in a way that uses social media to mimic the traditional public opinion polling approach of an omnibus survey that presents attitudes among the public across a large number of topics on a regular basis. Instead, we have seen two types of Computational Social Science studies take centre stage: studies that examine attitudes over time related to one particular issue or topic and studies that attempt to use social media data to assess the popularity of political parties and politicians, often in an attempt to predict election outcomes.Footnote 5

The issue-based studies generally involve a corpus of social media posts (usually tweets) being collected around a series of keywords related to the issue in question and then sentiment analysis (usually positive or negative sentiment towards the issue) being measured over a period of time. Studies of this nature have examined attitudes towards topics such as Brexit (Georgiadou et al., 2020), immigration (Freire-Vidal & Graells-Garrido, 2019), refugees (Barisione et al., 2019), austerity (Barisione & Ceron, 2017), COVID-19 (Dai et al., 2021; Gilardi et al., 2021; Lu et al., 2021), the police (Oh et al., 2021), gay rights (Adams-Cohen, 2020), and climate change (Chen et al., 2021b). Studies of political parties and candidates follow similar patterns although sometimes using engagement such as “likes” to measure popularity instead of sentiment analysis. Recent examples include studies that have been conducted in countries including Finland (Vepsäläinen et al., 2017), Spain (Bansal & Srivastava, 2019; Grimaldi et al., 2020), and Greece (Tsakalidis et al., 2018).Footnote 6

Of course, studying public opinion using computational social methods and social media data is not without its challenges. First and foremost is the question of representativeness: whose opinions are being measured when we analyse social media data? There are two layers of concern here: whether the people whose posts are being analysed are representative of the overall users of the platform but also whether the overall users of the platform are representative of the population of interest (Klašnja et al., 2017). If the goal is simply to ascertain the opinions of those using the platform, then the latter question is less problematic. Of course, the “people” part of the question can also be problematic, as social media accounts can also be “bots”, accounts that are automated to produce content based on algorithms as opposed to having a one-to-one relationship to a human being, although this varies by platform (Grimaldi et al., 2020; Sanovich et al., 2018; Yang et al., 2020). Another problem for representativeness can arise when significant portions of the population lack internet access, or when people are afraid to voice their opinions online due to fear of state repression (Isani, 2021).

Even if the question of representativeness can be solved and/or an appropriate population of interest identified, the original question of how to extract opinions out of unstructured text data still remains. Here, however, we have seen great strides by computational social scientists in developing innovative methods. Loosely speaking, we can identify two basic approaches. The first set of methods are characterized by a priori identifying text that is positively or negatively associated with a certain topic and then simply tracking the prevalence (e.g. counts, ratios) of these words over time (Barisione et al., 2019; Georgiadou et al., 2020; Gilardi et al., 2022). For example, in Siegel and Tucker (2018), we took advantage of the fact that when discussing ISIS in Arabic, the term “Islamic State” suggests support for the organization, while the derogatory term “Daesh” is used by those opposed to ISIS. Slight variations on this approach can involve including emojis as well as words (Bansal & Srivastava, 2019) or focusing on likes instead of text (Vepsäläinen et al., 2017).

The more popular approach, however, is to rely on one of the many different machine learning approaches to try to classify sentiment. These approaches include nonnegative matrix factorization (Freire-Vidal & Graells-Garrido, 2019), deep learning (Dai et al., 2021), convolutional and recurrent neural nets (Wood-Doughty et al., 2018), and pre-trained language transformer models (Lu et al., 2021; Terechshenko et al., 2020); many papers also compare a number of different supervised machine learning models and select the one that performs best (Adams-Cohen, 2020; Grimaldi et al., 2020; Tsakalidis et al., 2018). While less common, some studies use unsupervised approaches for stance relying on networks and activity to cluster accounts (Darwish et al., 2019). Closely related to these latter approaches are network-based models that are not focused on positive or negative sentiment towards a particular topic, but rather attempt to place different users along a latent dimension of opinion, such as partisanship (Barberá, 2015; Barberá et al., 2015) or attitudes towards climate change (Chen et al., 2021b).

With this basic background on the ways in which Computational Social Science can be utilized to measure public opinion using social media data, in the remainder of this chapter, I examine the potential of Computational Social Science to address three pernicious forms of online behaviour that have been identified as threats to the quality of democracy: hate speech, misinformation, and foreign influence campaigns.

3 Computational Social Science and Hate Speech

The rise of Web 2.0 brought with it the promise of a more interactive internet, where ordinary users could be contributing content in near real time (Ackland, 2013). Social media in many ways represented the apex of this trend, with the most dominant tech companies becoming those that did not actually produce content, but instead provided platforms on which everyone could create content. While removing the gatekeepers from the content production process has many attractive features from the perspective of democratic participation and accountability, it also has its downsides – perhaps no more obvious than the fact that gatekeepers could also play a role in policing online hate. As that observation became increasingly obvious, a wave of scholarship has developed utilizing Computational Social Science tools to attempt to characterize the extent of the problem, measure its impact, and assess the effectiveness of various countermeasures (Siegel, 2020).

Attempts to measure the prevalence and diffusion of hate speech have been at the forefront of this work, including studies that take place on single platforms (Gallacher & Bright, 2021; He et al., 2021; Mathew et al., 2018) and those on multiple platforms (Gallacher, 2021; Velásquez et al., 2021) with the latter including studies of what happens to user’s hate speech on one platform when they are banned from another one (Ali et al., 2021; Mitts, 2021). Other studies have focused on more specific topics, such as the amount of hate speech produced by bots as opposed to humans (Albadi et al., 2019), examining whether there are serial producers of hate in Italy (Cinelli et al., 2021) or hate speech targeted at elected officials and politicians (Greenwood et al., 2019; Rheault et al., 2019; Theocharis et al., 2020).

A second line of research has involved attempting to ascertain both the causes and effects of hate speech and in particular the relationship between offline violence, including hate crimes, and online hate speech. For example, a number of papers have examined the rise in online anti-Muslim hate speech on Twitter and Reddit following terrorist attacks in Paris (Fischer-Preßler et al., 2019; Olteanu et al., 2018) and Berlin (Kaakinen et al., 2018). Conversely, other studies have examined the relationship between hate speech on social media and hate crimes (Müller & Schwarz, 2021; Williams et al., 2020). Other work examines the relationship between political developments and the rise of hate speech, such as the arrival of a boat of refugees in Spain (Arcila-Calderón et al., 2021). Closely related are studies, primarily of an experimental nature, that attempt to measure the impact of being exposed to incivility (Kosmidis & Theocharis, 2020) or hate speech on outcomes such as prejudice (Soral et al., 2018) or fear (Oksanen et al., 2020).

A third line of research has focused on attempts to not just detect but also to counter hate speech online. The main approach here has been field experiments, where researchers detect users of hate speech on Twitter, use “sock puppet” accounts to deliver some sort of message designed to reduce the use of hate speech using an experimental research design, and then monitor users’ future behaviour. Stimuli tested have involved varying the popularity, race, and partisanship of the account delivering the message (Munger, 2017, 2021), embedding the exhortation in religious (Islamic) references (Siegel & Badaan, 2020), and threats of suspension from the platform (Yildirim et al., 2021). Researchers have also employed survey experiments to measure the impact of counter-hate speech (Sim et al., 2020) as well as observational studies, such as Garland et al. (2022)’s study of 180,000 conversations on German political Twitter.

Computational Social Science sits squarely at the root of all of this research, as any study that involves detecting hate speech at scale needs to rely on automated methods.Footnote 7 There are essentially two different research strategies employed by researchers. The first is to utilize dictionary methods – identifying hateful words that are either available in existing databases or identified by the researchers conducting the study and then collecting posts that contain those particular terms (Arcila-Calderón et al., 2021; Greenwood et al., 2019; Mathew et al., 2018; Mitts, 2021; Olteanu et al., 2018).

The second option is to rely on supervised machine learning. As with the study of opinions and sentiment generally, we can see a wide range of supervised ML methods employed, including pre-trained language models based on the BERT architecture (Cinelli et al., 2021; Gallacher, 2021; Gallacher & Bright, 2021; He et al., 2021), SVM models (Rheault et al., 2019; Williams et al., 2019), random forest (Albadi et al., 2019), doc2vec (Garland et al., 2022), and logistic regression with L1 regularization (Theocharis et al., 2020). Siegel et al. (2021) combine dictionary methods with supervised machine learning to screen out false positives from the dictionary methods using a naive Bayes classifier and, signaling a potential warning for the dictionary methods, find that large numbers (in many cases approximately half) of the tweets identified by the dictionary methods are removed by the supervised machine learning approach as false positives.

Unsupervised machine learning is less prevalent in this research – other than for identifying subtopics in a general area in which to look for the relative prevalence of hate speech (e.g. Arcila-Calderón et al. 2021, (refugees), Velásquez et al. 2021 (COVID-19), Fischer-Preßler et al. 2019 (terrorist attacks)) – although Rasmussen et al. (2021) propose what they call a “super-unsupervised” method for hate speech detection that relies on word embeddings and does not require human-coded training data.

One important development of note is that in recent years it is becoming more and more possible to find studies of hate speech involving language other than English, including Spanish (Arcila-Calderón et al., 2021), Italian (Cinelli et al., 2021), German (Garland et al., 2022), and Arabic (Albadi et al., 2019; Siegel & Badaan, 2020). Other important Computational Social Science innovations in the field include matching accounts across multiple platforms to observe how the same people behave on multiple platforms, including how content moderation actions on one platform can impact hate speech on another (Mitts, 2021) and network analyses of the spread of hateful content (Velásquez et al., 2021). Finally, it is important to remember that any form of identification of hate speech that relies on humans to classify speech as hateful or not is subject to whatever biases underlie human coding (Ross et al., 2017), which includes all supervised machine learning methods. One warning here can be found in Davidson et al. (2019), who demonstrate that a number of hate speech classifiers are more likely to classify tweets written in what the authors call “African-American English” as hate speech than tweets written in standard English.

4 Computational Social Science and Misinformation

In the past 6 years or so, we have witnessed a very significant increase in research related to misinformation online.Footnote 8 One can conceive of this field as attempting to answer six closely related questions, roughly in order of time sequence:

  1. 1.

    Who produces misinformation?

  2. 2.

    Who is exposed to misinformation?

  3. 3.

    Conditional on exposure, who believes misinformation?

  4. 4.

    Conditional on belief, is it possible to correct misinformation?

  5. 5.

    Conditional on exposure, who shares misinformation?

  6. 6.

    Through production and sharing, how much misinformation exists online/on platforms?

Computational Social Science can be used to shed light on any of these questions but is particularly important for questions 2, 5, and 6: who is exposed, who shares, and how much misinformation exists online?Footnote 9

To answer these questions, Computational Social Science is employed in one of two ways: to trace the spread of misinformation or to identify misinformation. The former of these is a generally easier task than the latter, and studies that employ Computational Social Science in this way generally follow the following pattern. First, a set of domains or news articles are identified as being false. In the case of news articles, researchers generally turn to fact checking organizations for lists of articles that have been previously identified as being false such as Snopes or PolitiFact (Allcott et al., 2019; Allcott & Gentzkow, 2017; Shao et al., 2018). Two points are worth noting here. First, this means that such studies are limited to countries in which fact checking organizations exist. Second, such studies are also limited to articles that fact checking organizations have chosen to check (which might be subject to their own organizational biases).Footnote 10 For news domains, researchers generally rely either on outside organization that ranks the quality of news domains, such as NewsGuard (Aslett et al., 2022), or else lists of suspect news sites published by journalists or other scholars (Grinberg et al., 2019; Guess et al., 2019). Scholars have also found other creative ways to find sources of suspect information, such as public pages on Facebook associated with conspiracy theories (Del Vicario et al., 2016) or videos that were removed from YouTube (Knuutila et al., 2020). Once the list of suspect domains or articles are identified, the Computational Social Science component of researching the spread comes from interacting with and/or scraping online information to track where these links are found. This can be as simple as querying an API, and as complicated as developing methods to track the spread of information.Footnote 11

The second – and primary – use of Computational Social Science techniques in the study of misinformation is the arguably more difficult task of using Computational Social Science to identify content as misinformation. As might be expected, using dictionary methods to do so is much more difficult than for tasks such as identifying hate speech or finding posts about a particular topic or issue. Accordingly, when we do see dictionary methods in the study of misinformation, they are generally employed in order to identify posts about a specific topic (e.g. Facebook ads related to a Spanish general election in Cano-Orón et al., 2021) that are then coded by hand; Gorwa (2017) and Oehmichen et al. (2019) follow similar procedures of hand labelling small numbers of posts/accounts as examples of misinformation in Poland and the United States, respectively.

Although still a very challenging computational task, recent research has begun to attempt to use machine learning to build supervised classifiers to identify misinformation on Twitter using SVMs (Bojjireddy et al., 2021), BERT embeddings (Micallef et al., 2020), and ensemble methods (Al-Rakhami & Al-Amri, 2020). Jagtap et al. (2021) comparatively test a variety of different supervised classifiers to identify misinformation in YouTube comments. Jachim et al. (2021) have built a tool based on unsupervised machine learning called “Troll Hunter” that while not identifying misinformation per se can be used to surface narratives across multiple posts online that might form the basis of disinformation campaign. Karduni et al. (2019) also incorporate images into their classifier.

Closely related, other studies have sought to harness network analysis to identify misinformation online. For example, working with leaked documents that identify actors paid by the South Korean government, Keller et al. (2020) show how retweet and co-tweet networks can be used to identify possible purveyors of misinformation. Zhu et al. (2020) utilize a “heuristic greedy algorithm” to attempt to identify nodes in networks that, if removed, would greatly reduce the spread of misinformation. Sharma et al. (2021) train a network-based model on data from the Russian Internet Research Agency (IRA) troll datasets released by Twitter and use it to identify coordinated groups spreading anti-vaccination and anti-masks conspiracies.

A different use of machine learning to identify misinformation – in this case, false news articles – can be found in Godel et al. (2021). Here we assess the possibility of crowdsourcing fact checking of news articles by testing a wide range of different possible rules for how decisions could possibly be made by crowds. Compared with intuitively simple rules such as “take the mode of the crowd”, we find that machine learning methods that draw upon a richer set of features – and in particular when analysed using convolutional neural nets – far outperform simple aggregation rules in having the judgment of the crowd match the assessment of a set of professional fact checkers.

Given the scale at which misinformation spreads, it is clear that any content moderation policy related to misinformation will need to rely on machine learning to at least some extent. From this vantage point, the progress the field has made in recent years must be seen as encouraging; still, important challenges remain. First, the necessary data to train models is not always available, either because platforms do not make it available to researchers due to privacy or commercial concerns or because it has, ironically, been deleted as part of the process of content moderation.Footnote 12 In some cases, platforms have released data of deleted accounts for scholarly research, but even here the method by which these accounts were identified generally remains a black box. Second, for any supervised learning method, the question of the robustness of a classifier designed to identify misinformation in one context to detect it in another context (different language, different country, different context even in the same country and language) remains paramount. While this is a problem for measuring sentiment on policy issues or hate speech as well, we have reason to suspect that the contextual nature of misinformation might make this even more challenging and suggests the potential value of unsupervised and/or network-based models. Third, so many of the methods to date rely on training classifiers based on news that has existed in the information ecosystem for extended periods of time, while the challenge for content moderation is to be able to identify misinformation in near real time before it spreads widely (Godel et al., 2021). Finally, false positives can have negative consequences as well, if the reaction to identifying misinformation is to suppress its spread. While reducing the spread of misinformation receives the most attention, it is important to remember that reducing true news in circulation is also costly, so future studies should try to explicitly address this trade-off, perhaps by attempting to assess the impact of methods of identifying misinformation for the overall makeup of the information ecosystem.

5 Computational Social Science and Coordinated Foreign Influence Operations

A third area in which Computational Social Science plays an important role in protecting democratic integrity is in the study of foreign influence operations. Here, I define foreign influence operations as coordinated attempts online by one state to influence the attitudes and behaviours of citizens of another state.Footnote 13 While foreign propaganda efforts of course precede the advent of the modern digital information age, the cost of mounting coordinated foreign influence operations has significantly dropped in the digital information era, especially due to the rise of social media platforms.Footnote 14

Research on coordinated foreign influence operations (hereafter CFIOs) can loosely be described as falling into one of two categories: attempts to describe what actually happened as part of previously identified CFIOs and attempts to develop methods to identify new CFIOs. Notably, the scholarly literature on the former is much larger (although one would guess that research on the latter is being conducted by social media platforms). Crucially, almost all of this literature, though, is dependent on having a list of identified accounts and/or posts that are part of CFIOs – by definition if the goal is to describe what happened in a CFIO and for use as training data if the goal is to develop methods to identify CFIOs. Accordingly, the primary source of data for the studies described in the remainder of this section are collections of posts from (or list of accounts involved with) CFIOs released by social media platforms. After having turned over lists of CFIO accounts to the US government as part of congressional testimony, Twitter has emerged as a leader in this regard; however other platforms including Reddit and Facebook have made CFIO data available for external research as well.Footnote 15

By far the most studied subject of CFIOs is the activities of the Russian IRA in the United States (Bail et al., 2020; Bastos & Farkas, 2019) and in particular in the period of time surrounding the 2016 US presidential election (Arif et al., 2018; Boyd et al., 2018; DiResta et al., 2022; Golovchenko et al., 2020; Kim et al., 2018; Linvill & Warren, 2020; Lukito, 2020; Yin et al., 2018; Zannettou et al., 2020).

Studies of CFIOs in other countries include Russian influence attempts in Germany (Dawson & Innes, 2019), across 12 European countries (Innes et al., 2021), Syria (Metzger & Siegel, 2019), Libya, Sudan, Madagascar, Central African Republic, and Mozambique (Grossman et al., 2019, 2020); Chinese influence attempts in the United Kingdom (Schliebs et al., 2021), Hong Kong, and Taiwan (Wallis et al., 2020); and the US (Molter & DiResta, 2020) and Iranian influence attempts in the Middle East (Elswah et al., 2019).

The methods employed in these studies vary, but many involve a role for Computational Social Science. In Yin et al. (2018) and Golovchenko et al. (2020), we extract hyperlinks shared by Russian IRA trolls using a custom-built Computational Social Science tool; in the latter study, we also utilize methods described earlier in this review in the measuring public opinion section to automate the estimation of the ideological placement of the shared links. Zannettou et al. (2020) extract and analyse the images shared by Russian IRA accounts. Innes et al. (2021), Dawson and Innes (2019), and Arif et al. (2018) all rely on various forms of network analysis to track the spread of IRA content in Germany, Europe, and the United States, respectively. Two studies of Chinese influence operations use sentiment analysis – again, in a manner similar to the one described earlier in the measuring public option section – to measure whether influence operations are relying on positive or negative messages (Molter & DiResta, 2020; Wallis et al., 2020). In a similar vein, Boyd et al. (2018) use NLP tools to chart the stylistic evolution of Russian IRA posts over time. DiResta et al. (2022) and Metzger and Siegel (2019) use structural topic models to dig deeper into the topics discussed by Russian influence operations in the United States and tweets by Russian state media about Syria, respectively. Lukito (2020) employs a similar method to the one discussed earlier in the measuring public opinion section regarding whether elites or masses drive the discussion of political topics to argue that the Russian IRA was trying out topics on Reddit before purchasing ads on those subjects on Facebook. Other papers combine digital trace data from social media platforms such as Facebook ads (Kim et al., 2018) or exposure to IRA tweets (Bail et al., 2020; Eady et al., 2022) with survey data.

A number of studies rely on qualitative analyses based on human annotation of CFIO account activity (e.g. Innes et al. (2021) include a case study of a Russian influence in Estonia to supplement a network-based study of Russian influence in 12 European countries; see also Bastos and Farkas, 2019; Dawson and Innes, 2019; DiResta et al., 2022; and Linvill and Warren, 2020), but even in these cases, Computational Social Science plays a role in allowing scholars to extract the relevant posts and accounts for analysis.

What there is much less of, though, are studies of the actual influence of exposure to CFIOs, which is a direction in which the literature should try to expand in the future. Two exceptions are Bail et al. (2020) and Eady et al. (2022), both of which rely on panel survey data combined with data on exposure to tweets by Russian trolls that took place between waves of the panel survey.

A second strand of the Computational Social Science literature involves trying to use machine learning to identify CFIOs.Footnote 16 One approach has been to use the releases of posts from CFIOs by social media platforms as training data for supervised models to identify new CFIOs (or at least new CFIOs that are unknown to the models); both Alizadeh et al. (2020) and Marcellino et al. (2020) report promising findings using this approach. Innes et al. (2021) filter on keywords and then attempt to identify influence campaigns through network analysis; this approach has the advantage of not needing to use training data, although the ultimate findings will of course be a function of the original keyword search. Schliebs et al. (2021) use NLP techniques to look for common phrases or patterns across the posts from Chinese diplomats, thus suggesting evidence of a coordinated campaign. This method also does not require training data, but, unlike either of the previous approaches, does require identifying the potential actors involved in the CFIO as a precursor to the analysis.

Taken together, it is clear that a great deal about the ways in which CFIOs operate in the modern digital era has been learned in a short period of time. That being said, a strikingly large proportion of recent research has focused on the activities of Russian CFIOs around the 2016 US elections; future research should continue to look at influence operations run by other countries with other targets.Footnote 17 There is also clearly a lot more work to be done in terms of understanding the impact of CFIOs, as well as in developing methods for identifying these campaigns. This latter point reflects a fundamental reality of the field, which is that its development has occurred largely because the platforms chose (or were compelled) to release data, and it is to this topic that I turn in some brief concluding remarks in the following section.

6 The Importance of External Data Access

Online hate, disinformation, and online coordinated influence operations all pose potential threats to the quality of democracy, to say nothing of the threats to people whose personal lives may be impacted by being attacked online or being exposed to dangerous misinformation. Computational Social Science – and in particular tools that facilitate working with large collections of (digital trace) data and innovations in machine learning – have important roles to play in helping society understand the nature of these threats, as well as potential mitigation strategies. Indeed, social scientists are getting better and better at incorporating the newest developments in machine learning (e.g. neural networks, pre-trained transformer models) into their research. So many of the results laid out in the previous sections are incredibly impressive and represent research we would not have even conceived of being able to do a decade ago.

That being said, the field as a whole remains dependent on the availability of data. And here, social scientists find themselves in a different position than in years past. Previously, most quantitative social research was conducted either with administrative data (e.g. election results, unemployment data, test scores) or with data – usually survey or experimental – that we could collect ourselves. As Nathaniel Persily and I have noted in much greater detail elsewhere (Persily & Tucker, 2020a, b, 2021), we now find ourselves in a world where the data which we need to do our research on the kinds of topics surveyed in this handbook chapter are “owned” by a handful of very large private companies. Thus, the key to advancing our knowledge of all of the topics discussed in this review, as well as the continued development of related methods and tools, is a legal and regulatory framework that ensures that outside researchers that are not employees of the platforms, and who are committed to sharing the results of their research with both the mass public and policy makers, are able to continue to access the data necessary for this research.Footnote 18

Let me give just two examples. First, almost none of the work surveyed in the previous section on CFIOs would have been possible had Twitter not decided to release its collections of tweets produced by CFIOs after they were taken off the platform. Yes, it is fantastic that Twitter did (and has continued to) release these data, but we as a society do not want to be at the mercy of decisions by platforms to release data for matters as crucial as understanding whether foreign countries are interfering in democratic processes. And just because Twitter has chosen to do this in the past, it does not mean that it will continue to do so in the future. Second, even with all the data that Twitter releases publicly through its researcher API, external researchers still do not have access to impressions data (e.g. how many times tweets were seen and by whom). While some have come up with creative ways to try to estimate impressions, this means that any research that is built around impressions is carrying out studies with unnecessary noise in our estimates; a decision by Twitter tomorrow could change this reality. For all of the topics in this review – hate speech, misinformation, foreign influence campaigns – impressions are crucially important pieces of the puzzle that we are currently missing.

As of the final editing of this essay, though, important steps are being taken on both sides of the Atlantic to try to address this question of data access for external academic researchers. In the United States, a number of bills have recently been introduced in the national legislature that include components aimed at making social media data available to external researchers for public-facing analysis.Footnote 19 While such bills are a still a long way from being made into law, the fact that multiple lawmakers are taking the matter seriously is a positive step forwards. Perhaps more importantly in terms of immediate impact, the European Union’s Digital Services Act (DSA) has provisions allowing data access to “vetted researchers” of key platforms, in order for researchers to evaluate how platforms work and how online risk evolves and to support transparency, accountability, and compliance with the new laws and regulations.Footnote 20

Computational Social Science has a huge role to play in helping us understand some of the most important challenges faced by democratic societies today. The scholarship that is being produced is incredibly inspiring, and the methodological leaps that are occurring in such short periods of time were perhaps previously unimaginable. But at the end of the day, the ultimate quality of the work we are able to do will depend on the data to which we have access. Thus data access needs to be a fundamental part of any forward-facing research plan for improving what Computational Social Science can teach us about threats to democracy.