1 Introduction

We live in a digital world, where virtually every realm of our existence has been transformed by a rapid and ongoing process of datafication and computation. Travel, retail, entertainment, finance, and medicine: to these areas of life, all grown virtually unrecognizable in recent years, we must also add the social sciences. In recent years the burgeoning field of Computational Social Science (CSS) has begun changing the way sociologists, anthropologists, economists, political scientists, and others interpret human behaviour and motivations, in the process leading to new insights into human society. Some have gone so far as to herald a “social research revolution” or a “paradigm shift” in the social sciences (Chang et al., 2014; Porter et al., 2020). Recently, the Economist magazine proclaimed an era of “third-wave economics”, transformed by the availability of massive amounts of real-time data (Kansas, 2021).

Of course social scientists have always used data to interpret and analyse human beings and the social structures they create. CSS, as a concept, first emerged in the latter half of the twentieth century across the field of social science and STEM (Edelmann et al., 2020). Earlier generations of researchers were well-versed in quantitative methods, as well as in the use of a variety of computational and statistical tools, ranging from SPSS to Excel. What has changed is the sheer quantity of data now available, as well as the easy (and often free) access to sophisticated computational tools to process and analyse that data. To the extent there is indeed a revolution underway in the social sciences, then, it stems in large part from its intersection with the equally heralded Big Data Revolution (McAfee & Brynjolfsson, 2019).

CSS offers some very real opportunities. It enables new forms of research (e.g., large-scale simulations and more accurate predictions), allows social scientists to model and derive findings from a much larger empirical base, and offers the potential for new, cross-disciplinary insights that could lead to innovative and more effective social or economic policy interventions. In recent years, CSS has allowed researchers to better understand, among other phenomena, the roots and patterns of socioeconomic inequalities, how infectious diseases spread, trends in crime and other factors contributing to social malaise, and much more.

As with many technological innovations, however, the rhetoric—and hype—surrounding CSS can sometimes overtake reality (Blosch & Fenn, 2018). For all the undeniable opportunities, there remains a chasm between potential and what CSS is actually doing and revealing. Bridging this chasm could unlock new social insights and also, through more targeted and responsive policy interventions, lead to greater opportunities to enhance public good.

Fig. 2.1
figure 1

Taxonomy of governance and policy challenges

This chapter seeks to take stock of and categorize a variety of governance and policy hurdles that continue to hold back the potential of CSS. In what follows, we outline 20 challenges that limit how data is accessed and analysed in the social sciences. We categorize these into six areas: challenges associated with the Data Ecosystem, Data Governance, Research Design, Computational Structures and Processes, the Scientific Ecosystem, and those concerned with Societal Impact (Fig. 2.1). Albert Einstein once said, “If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and five minutes thinking about solutions”. In the spirit of Einstein’s maxim, we do not seek to provide detailed solutions to the identified challenges. Instead, our goal is to design a taxonomy of challenges and issues that require further exploration, in the hope of setting a research, funding, and governance agenda that could advance the field of CSS and help unleash its full potential.

2 Data Ecosystem Challenges

2.1 Data Accessibility: Paucity and Asymmetries

Although CSS is enabled by the massive explosion in data availability, in truth access to data remains a serious bottleneck. Accessibility problems can take many forms. In certain cases, accessibility can be limited when certain kinds of data simply don’t exist. Such data paucity problems were more common in the early days of CSS but remain a challenge in particular areas of social science research, for example, in the study of certain disaster events (Burger et al., 2019). The challenges posed by data paucity are not limited to an inability to conduct research; the risk of wrong or inappropriate conclusions, built upon shaky empirical foundations, must equally be considered. Such limitations can to an extent be overcome by reliance on new and innovative forms of data—for example, those collected by social media companies or through sensors and other devices on the rapidly growing Internet of Things (IoT) (Hernandez-Suarez et al., 2019).

Even when sufficient data exists, however, accessibility can remain a problem due to asymmetries and inequalities in patterns of data ownership, as well as due to regulatory or policy bottlenecks (OECD, 2019). Recent attention on corporate concentration in the technology industry has shed light on related issues, including the vast stores of siloed data held by private sector entities that remain inaccessible to researchers and others (The World Wide Web Foundation, 2016). The European Union, for example, is working to address this challenge through policies like the Data Act, which attempts to bridge existing inequalities in access to and use of data (Bahrke & Manoury, 2022). While the open data movement and other efforts to spur data collaboratives (and similar entities) Footnote 1 have made strides in opening up some of these silos, a range of obstacles—reluctance to share data perceived as having competitive value, apprehension about inadvertently violating privacy-protective laws—mean that considerable amounts of private sector data with potential public good applications remain inaccessible (Verhulst et al., 2020b). Access to such large datasets could lead to more effective decision-making in both the corporate and policymaking worlds, as well as stronger transparency and accountability measures across sectors (Russo & Feng, 2021). Concerns around heightened public scrutiny and regulatory exposure as a result of greater transparency and accountability measures are also in part why larger corporations may resist open data policies.

2.2 Misaligned or Negative Incentives for Collaborating

Misaligned incentives are a common and well-understood problem in the worlds of business and social sciences. Misaligned incentives commonly occur when certain individuals’ or groups’ incentives are not aligned towards the broader common goal of the collaboration. These incentives can be based on specific parties’ interests, as well as on differences between long-term and short-term priorities (Novak, 2011). In a business supply chain, for example, misaligned incentives can cause a number of issues, ranging from operational inefficiency to higher production costs to weak market visibility (Narayanan & Raman, 2004). In order for supply chain relationships to function optimally, incentives must be realigned through trust, transparency, stronger communication, regulation, and clear contracts.

Many of these same concepts apply to the data sharing and data collaboration ecologies and thus to how data is used for CSS. Misaligned incentives can take a number of forms but commonly refer to conflicts or differences between data owners (frequently in the private sector) and those who would potentially benefit or be able to derive insights from access to data (frequently academic researchers, policy analysts, or members of civil society). Data owners may perceive efforts to share data with social scientists as potentially leading to competitive threats, or they may perceive regulatory risk; social scientists, on the other hand, will perceive data collaboration as leading to new insights that can enhance the public good. There are no easy solutions to such misalignments, and alleviating them will rely on a complex interplay of regulation, awareness-raising, and efforts to increase transparency and trust. For the moment, misaligned incentives remain a serious impediment to CSS research.

2.3 Poorly Understood (and Studied) Value Proposition, Benefits, and Risks

Misaligned incentives often arise when data owners and social scientists (or others who may benefit from data sharing) have different perceptions about the benefits or risks of sharing. David Lazer et al. note, for instance, that the incidence of data sharing and opening of data may have reduced in the wake of laws designed to protect privacy (e.g., GDPR) (Lazer et al., 2020). This suggests that companies may overestimate the regulatory and other risks of making data available to researchers, while under-valuing the possible benefits. Companies dealing may also face real concerns about data protection and data privacy that are not effectively addressed by laws. Likewise, companies may be reluctant to share data, fearing that doing so will erode a competitive advantage or otherwise harm the bottom line. As our research has shown, this is often a mis-perception (Dahmm, 2020). Data sharing does not operate in a zero-sum ecosystem, and companies willing to open their data to external researchers may ultimately reap the benefits of new insights and new uses for their otherwise siloed datasets.

3 Data Governance Challenges

3.1 Data Reuse, Purpose Specification, and Minimization

A spate of privacy scandals has led to renewed regulatory oversight of data, data sharing, and data reuse. Such oversight is often justified and very necessary. At the same time, an exclusive focus on privacy risks undermining some of the societal benefits of sharing; we need a more calibrated and nuanced understanding of risk (Verhulst, 2021). Purpose specification and minimization mandates, which seek to narrowly limit the scope of how data may be reused, pose particular challenges to CSS. Such laws or guidelines do offer greater consumer control over their data and can thus be trust-enhancing. At the same time, serious consideration must be given to the specific circumstances under which it is acceptable to reuse data and the best way to balance potential risk and reward.

Absent such consideration and clear guidelines, a secondary use—for social science research or other purposes—runs the risk of violating regulations, jeopardizing privacy, and de-legitimizing data initiatives by undermining citizen trust. Among the questions that need to be asked are what types of secondary use should be allowed (e.g., only with a clear public benefit), who is permitted to reuse data, are there any types of data that should never be reused (e.g., medical data), and what framework can allow us to weigh the potential benefits of unlocking data against the costs or risks (Verhulst et al., 2020a). The 2019 Finnish Act on the Secondary Use of Health and Social Data is one policy model that effectively addresses these questions (Ministry of Social Affairs and Health, 2019).

To tackle the challenge of purpose specification in data reuse, new processes and notions of stakeholdership must be considered. For example, one emerging vehicle for balancing risk and opportunity is the use of working groups or symposia where thought leaders, public decision-makers, representatives of industry and civil society, and citizens come together to assess and help improve existing approaches and methodologies for data collaboration. Footnote 2

3.2 Data Anonymization and Re-identification

Data anonymization and/or de-identification refers to the process by which a dataset is sanitized to remove or hide personally identifiable information with the goal of protecting individual privacy (OmniSci, n.d.-a). This process is key to maintaining personal privacy while also empowering actors to expand the ways in which data can be used without violating privacy and data protection laws. As anonymized data becomes more readily accessible and freely available, social scientists are working with large anonymized datasets to answer previously unanswerable questions. In the context of the COVID-19 pandemic, for example, social scientists used mobile phone records and anonymized credit card purchases to understand how people’s movement and spending habits shifted in response to the pandemic across numerous sectors of the economy (“The Powers and Perils of Using Digital Data to Understand Human Behaviour”, 2021).

In contrast to data anonymization, data re-identification involves matching previously anonymized data with its original owners. The general ease of re-identification means that the promised privacy of data anonymization is a weak commitment and that data privacy laws must also be applied to anonymized data (Ghinita et al., 2009; Ohm, 2010; Rubinstein & Hartzog, 2015). One way to address the risk of re-identification is to prevent the so-called mosaic effect (Czajka et al., 2014). This phenomenon occurs as a result of the re-identification of data by combining multiple datasets containing similar or complementary information. The mosaic effect can pose a threat both to individual and group privacy (e.g., in the case of a small minority demographic group). Groups are frequently established through data analytics and segmentation choices (Mittelstadt, 2017). Under such conditions, individuals are often unaware that their data are being included in the context of a particular group, and decisions made on behalf of a group can limit data holders’ control and agency (Radaelli et al., 2018). Children’s data and humanitarian data are particularly susceptible to the challenges of group data (Berens et al., 2016; Young, 2020). Mitigation strategies include considering all possible points of intrusion, limiting analysis output details only to what is truly needed, and releasing aggregated information or graphs rather than granular data. In addition, limited access conditions can be established to protect datasets that could potentially be combined (Green et al., 2017).

3.3 Data Rights (Co-generated Data) and Sovereignty

CSS research often leads not only to new data but new forms of data. In particular, the collaborative process involved in CSS often leads to co-generated or co-created data, processes which raise thorny questions about data rights, data sovereignty, and the very notion of “ownership” (Ducuing, 2020a, 2020b). Without a clear owner, traditional intellectual property laws are difficult and often impossible to apply, which means that CSS may require new models of ownership and governance that promote data sharing and collaborative research while also protecting property rights (Micheli et al., 2020).

In order to tackle the challenge of ownership and governance, stakeholders in the data space have proposed a number of potential models to replace traditional norms of ownership and property. These include adopting a more collective, rights-based approach to data ownership, creating public data repositories, and establishing private data cooperatives, data trusts, or data collaboratives. Footnote 3 Each of these methods has advantages and certain disadvantages, but they all go beyond the notion of co-ownership towards concepts of co-governance or co-regulation (Richet, 2021; Rubinstein, 2018). Such shared governance models could play a critical role in removing barriers to data and enabling the research potential of CSS.

3.4 Barriers to Data Portability, Interoperability, and Platform Portability

Data portability and data interoperability approach the same concept from two different actor perspectives. Data portability refers to the ability of individuals to reuse their personal data by moving it across different service platforms in a secure way (Information Commissioner’s Office, n.d.). Data interoperability, on the other hand, allows systems to share and use data across platforms free of any restrictions (OmniSci, n.d.-b). More recently, certain observers have begun to point to the limitations of both these concepts, arguing instead for platform portability, which would, for example, allow consumers to transfer not only their personal data from one social media platform to another but also a broader set of data, including contact lists and other “rich” information (Hesse, 2021).

Such concepts offer great potential for data sharing and more generally for the collaboration and access that are critical to enabling CSS. Yet a series of barriers exist, ranging from technical to regulatory to a general lack of trust among the public (De Hert et al., 2018; Vanberg & Ünver, 2017). Technical barriers are generally surmountable (Kadadi et al., 2014). Regulatory concerns, however, are thornier, with some scholars pointing out that provisions such as Article 20 of the GDPR, the right to data portability, could be interpreted to hamper cross-platform portability and create obstacles in building such partnerships (Hesse, 2021). There are also arguments that applying the new GDPR principles may prove more challenging for small and medium sized enterprises that may lack the resources and technology required to be effective (European Commission, 2020). Such restrictions are linked to a broader set of concerns over privacy and consent. Designed to protect consumer rights, they also have the inadvertent effect of restricting the potential of sharing and collaboration. Once again, they illustrate the difficult challenges involved in balancing a desire to minimize risk while maximizing potential in the data ecosystem.

3.5 Data Ownership and Licensing

As noted above, existing notions of data ownership and licensing pose a challenge due to the complex nature of ownership in the data ecosystem (Van Asbroeck, 2019). Traditional notions of ownership (and related concepts of copyright or IP licensing) convey a sense of non-rivalrous control over physical or virtual property. Yet data is more complicated as an entity; data about an individual is often not “owned” or controlled by that individual but rather by an entity—a company, a government organization—that has collected the data and that is now responsible for storing it, ensuring its quality and accuracy, and protecting the data from potential violations. Questions about ownership get even more complicated when we consider the nature of co-creation or co-generation (cf. above) or when we consider the data value chain, by which data is repurposed and mingled with other data to generate new insights and forms of information (Van Asbroeck, 2019). For all these reasons, there have been calls for “more holistic” models and for a recognition of the “intersecting interests” that may define data ownership, particularly of personal information (Kerry & Morris, 2019; Nelson, 2017).

The lack of conceptual and regulatory clarity over data ownership poses serious obstacles to the project of CSS (Balahur et al., 2010). It hinders data collaboration and sharing and prevents the inter-sectoral pooling of data and expertise that are so critical to conducting social science or other forms of research. In the absence of a more robust governance framework, research must often take place on the strength of ad hoc or trust-based relationships between parties—hardly a solid foundation upon which to scale CSS or harness its full potential.

4 Research Design Challenges

4.1 Injustice and Bias in Data and Algorithms

Datafication—like technology in general—is often accompanied by claims of neutrality. Yet as society becomes increasingly datafied, various forms of bias have emerged more clearly (Baeza-Yates, 2016). Bias can take many forms and present itself at various stages of the data value chain. There can be bias during the process of data collection or processing, as well as in the models or algorithms used to glean insights from datasets. Often, bias replicates existing social or political forms of exclusion. With the rise to prominence of Artificial Intelligence (AI), considerable attention has been paid recently to the issue of algorithmic bias and bias in machine learning models (Krishnamurthy, 2019; Lu et al., 2019; Turner Lee et al., 2019). Bias can also arise from incomplete data that doesn’t necessarily replicate societal patterns but that is nonetheless unrepresentative and leads to flawed or discriminatory outcomes. Moreover, biases are not limited to just the data but can also extend into interpretations affecting frames of reference, underlying assumptions and models of analysis to name a few (Jünger et al., 2022).

Bias, in whatever form, poses serious challenges to CSS. One meta-analysis estimates that up to a third of studies using a method known as Qualitative Comparative Analysis (QCA) may be afflicted by bias, one in ten “severely so” (Thiem et al., 2020). Such problems lead to insufficient or incorrect conclusions; when translated into policy, they may result in harmful steps that perpetuate or amplify existing racial, gender, socioeconomic, and other forms of exclusion. Thus the issues posed by bias are deeply tied to questions of power and justice in society and represent some of the more serious challenges to effective, fair, and responsible CSS.

4.2 Data Accuracy and Quality

Bias also is one of the main contributors to problems of data quality and accuracy. But these problems are multidimensional—i.e., they are caused by many factors—and inevitably represent a serious challenge to any project involving computational or data-led social studies. Exacerbating matters, the very notions of accuracy and, especially, quality are contested, with definitions and standards varying widely across projects, geographies, and legal jurisdictions.

To an extent, the concept of accuracy can be simplified to a question about whether data is factually correct (facts, of course, are themselves contested). Quality is, however, a more nebulous concept, extending not only to the data itself but to various links in the data chain, including how the data was collected, stored, and processed (Dimitrova, 2021; Herrera & Kapur, 2007). In order to advance the field of CSS, clearer definitions and standards will be required. While doing so, it will be critical to bring data subjects themselves into the conversations, in order to ensure a measure of participatory validation and ensure that any adopted standards have widespread buy-in.

4.3 Data Invisibles and Systemic Inequalities

The concept of “data invisibles” refers to individuals who are outside the formal or digital economy and thus systematically excluded from the benefits of that economy (Shuman & Paramita, 2016). Because many of these individuals are located in developing countries, many datasets or algorithmic models trained on such datasets systematically exclude non-Western citizens, gender invisibles, and countless other disadvantaged populations and minority groups and thus pose further challenges to the accuracy of CSS and its findings (D’Ignazio & Klein, 2018; Fisher & Streinz, 2021; Naudts, 2019; Neumayer et al., 2021).

The problem of data invisibility is exacerbated by data governance practices that fail to proactively take into account the need for inclusion (D’Ignazio & Klein, 2018; Fisher & Streinz, 2021; Naudts, 2019; Neumayer et al., 2021). Such practices include insufficient or non-existent guidelines or standards on data quality and representativeness; a lack of robust accountability and auditing mechanisms Footnote 4 for algorithms or machine learning models; and the demographic composition of research teams which often lack diversity or representation of those studied. Thus in order to strengthen the practice of CSS, it will be necessary to address the wider ecosystem of data governance.

5 Computational Structures and Processes Challenges

5.1 Human Computation, Collective Intelligence, and Exploitation

Collective intelligence refers to the shared reasoning and insights that arises from our collective participation (both collaborative and competitive) in the data ecosystem (Figueroa & Pérez, 2018; Lévy, 2010). Collective intelligence has emerged as a potentially powerful tool in understanding our societies and in leading to more effective policies and offers tremendous potential for CSS. However, collective intelligence also faces a number of limitations that compromise the quality of its insights. These include bureaucratization that prevents lower-level actors from sharing their insights or expertise; the so-called “common knowledge” effect where participants do not strive to go beyond conventional wisdom and informational pressures which limit independent thoughts and actions.

All of these challenges negatively impact collective intelligence and, indirectly, CSS. A further challenge emerging in this space, especially as collective intelligence intersects with AI, relates to the exploitation of machines, which may be co-participants in the process of collectively generated intelligence (Caverlee, 2013; Melo et al., 2016). Although this challenge remains more hypothetical than actual at the present, it raises complex ethical questions that could ultimately impact how research is conducted and who has the right to take credit (or blame) for its findings.

5.2 Need for Increased Computational Processing Power and Tackling Related Environmental Challenges

The massive amounts of data available for social sciences research require equally massive amounts of computational processing power. This raises important questions about equity and inclusion and also poses serious environmental challenges (Lazer et al., 2020). According to a recent study by Harvard’s John A. Paulson School of Engineering and Applied Sciences, modern data centres already account for 1% of global energy consumption, a number that is rapidly increasing (Harvard John A. Paulson School of Engineering and Applied Sciences, 2021). The study points out that in addition to energy use, our data economy also contributes indirectly, for example, through e-waste, to pollution. Such problems are only likely to increase with the growing prominence of blockchain and the so-called Web3, which are already making their impact felt in the social sciences (Hurt, 2018). According to the Bitcoin Energy Consumption Index, Bitcoin alone generates as much waste annually as the entire country of Holland. A single Bitcoin transaction uses a similar amount of energy as the consumption of an average US home over 64.61 days (“Bitcoin Energy Consumption Index”, n.d.).

Computational processing requirements also pose serious obstacles to participation by less developed countries or marginalized groups within developed countries, both of which may lack the necessary financial and technical resources (Johnson, 2020). Such exclusion may lead, in turn, to unrepresentative or biased social science research and conclusions. One possible solution lies in developing new, less computationally demanding models to analyse data. Solutions of this nature have been developed, for instance, to analyse data from Instagram to monitor social media trends and for natural language processing algorithms that make it easier to process and derive insights from social media data (Pryzant et al., 2018; Riis et al., 2021). Another potential strategy is using volunteer computing, wherein a problem that would ordinarily require the computing power of a super computer is broken down and solved by thousands of volunteers with their personal computers (Toth et al., 2011). As volunteer computing grows in popularity, volunteer numbers must rapidly expand if this solution is to remain viable in the long run. These developments are just a start, but they represent efforts to address current limitations in processing power to help achieve more robust and equitable insights from CSS analyses.

6 Scientific Ecosystem Challenges

6.1 Domain, Computational, and Data Expertise: The Need for Interdisciplinary Collaboration Networks

As the field of CSS develops, the divide between domain, computational, and data expertise is emerging as a limiting factor. There is a pressing need for interdisciplinary collaboration networks to help bridge this divide and achieve more accurate insights and findings. For example, in order to effectively use large anonymized datasets on credit card purchases to understand shifts in spending patterns, a research team would need the combined expertise of data scientists, economists, sociologists, and anthropologists (relevant skill sets) “bilinguals” from around the world—practitioners across fields who possess both domain knowledge and data science expertise.

One possible way to bridge this gap in CSS applications is by relying on “bilinguals” Footnote 5—scholars and professionals who possess expertise across domains and sectors (Porway, 2019). For example, these individuals can bring the requisite understanding of social sciences alongside strong data know-how required for CSS research. The valuable contribution of bilinguals is evident in the GovLab’s 100 Questions initiative, which seeks to identify the most pressing problems facing the world that can be answered by leveraging datasets in a responsible manner (“The 100 Questions Initiative—About”, n.d.). Each bilingual brings specific sector expertise coupled with a strong foundation in data science to draw out not only the most critical questions facing a domain but also to identify questions that can be answered using the current context of data (“The 100 Questions Initiative—About”, n.d.). In this way, interdisciplinary collaboration networks and bilinguals can help to bridge the knowledge gap that exists in the field of CSS and to unlock deeper and more insightful outcomes with potentially deeper public impact.

6.2 Conflict of Interests, Corporate Funding, Data Donation Dependencies, and Other Ethical Limitations

Conflicts of interest—real or perceived—are a major concern in all social studies research. Such conflicts can skew research results even when they are declared (Friedman & Richter, 2004). Many long-standing ethical concerns are relevant within the field of CSS. These include issues related to funding, conflicts of interest (which may not be limited to financial interests), and scope or type of work. Yet the use of data and emerging computational methods, for which ethical boundaries are often blurred, complicate matters and introduce new concerns. One recent study, for example, points to the difficulties in defence-sector work involving technology, highlighting “the code of ethics of social science organizations and their limits in dealing with ethical problems of new technologies” and “the need to develop an ethical imagination about technological advances and research and develop an appropriately supportive environment for promoting ethical behavior in the scientific community” (Goolsby, 2005). Such recommendations point to the shifting boundaries of ethics in a nascent and rapidly shifting field.

In addition to standard concerns over financial conflicts of interest, CSS practitioners must also consider ethical concerns arising from non-financial contributions, especially shared data. Data donations, for instance, can pose a challenge in terms of quality and transparency creating dependencies and vulnerabilities for the researchers using the data in their work, as was seen in Facebook’s Social Science One project (Timberg, 2021). In a collaborative landscape characterized by significant reliance on corporate data, the sources of such data, as well as the motivations involved in sharing it, must be acknowledged, and their potential impact on research thoroughly considered.

6.3 The Failure of Reproducibility

Reproducibility is a critical part of the scientific process, as it enables other researchers to verify or challenge the veracity of a study’s findings (Coveney et al., 2021). This ensures that high standards of research are maintained and that findings can be corroborated by multiple actors to strengthen their credibility. While the concept has long been used by the scientific community, it only recently began to enter the work of social studies and computational social scientists. The notion of reproducibility has generally been problematic in CSS due to the many difficulties—outlined above—when it comes to data sharing and open software agreements. A lack of transparency in computational research also further aggravates the challenge, making it extremely difficult to implement the practices of reproducibility.

In order to address this challenge, scholars have suggested the use of open trusted repositories as a potential solution (Stodden et al., 2016). Such repositories would enable researchers to share their data, software, and other details of their work in a secure manner to encourage collaboration and reproducibility without compromising the integrity of the original researcher’s work. More generally, a stronger culture of collaboration in the ecosystem would also help increase the adoption of reproducibility, which would be to the benefit of computational sciences as a whole (Kedron et al., 2021).

7 Societal Impact Challenges

7.1 Need for Citizen/Community Engagement and Acquiring a Social License

Trust has emerged as a major issue in the data ecosystem. In order for CSS research to be successful, it requires buy-in from citizens and communities. This is particularly true given the heavy reliance on data sharing, which requires trust and a trust-building culture to sustain the required inter-sectoral collaboration. For instance, a 2012 “Manifesto of Computational Social Science”, published in the European Physical Journal, emphasizes the importance of involving citizens in gathering data and of “enhancing citizen participation in [the] decision process” (Conte et al., 2012).

In pursuit of such goals, CSS can borrow from the existing methodology of “Citizen Science”, which highlights the role of community participation in various stages of social sciences research (Albert et al., 2021). Citizen Science methods can be adapted for—and in some cases strengthened by—the era of big data. New and emerging methods include crowdsourcing through citizen involvement in data gathering (e.g., through the IoT and other sensors); collaborative decision-making processes facilitated by technology that involve a greater range of stakeholders; and technologies to harness the distributed intelligence or expertise of citizens. Recently, some social scientists have also relied on so-called pop-up experiments (or PUEs), defined by one set of Spanish researchers as “physical, light, very flexible, highly adaptable, reproducible, transportable, tuneable, collective, participatory and public experimental set-up for urban contexts” (Sagarra et al., 2016). Indeed, urban settings have proven particularly fertile ground for such methodological innovations, given the density of citizens and data-generating devices.

7.2 Lack of Data Literacy and Agency

A lack of public understanding of data and data governance means that the public faces considerable risk associated with mismanagement of their data and exploitative data practices. This is particularly the case given that the current data ecosystem is largely dominated by corporate actors, who control access to large amounts of personal data and may use the data for their gain (Micheli et al., 2020). In order to address the associated inequalities and power asymmetries and to begin democratizing the data ecosystem, data governance methods must improve. Legislation such as the European Union’s General Data Protection Regulation (GDPR) is a step in the right direction. In addition to legislative change, the development of data sharing infrastructures and the involvement of government and third sector actors in data collaborations with private actors will help mitigate the challenges of weak data literacy and agency among the public.

A lack of data literacy and agency have both ethical and practical implications for CSS (Chen et al., 2021; Pryzant et al., 2018; Sokol & Flach, 2020). In the context of data, agency refers to the power to make decisions about where and how one’s data is used. Without sufficient awareness and agency, it is hard not only for individuals to meaningfully consent to their data being used but also for researchers to effectively and responsibly collect and use data for their studies. Moreover, a lack of data literacy and agency makes it difficult for citizens and others to interpret the results of a study or to implement policy and other concrete steps informed by CSS research. For CSS to achieve its potential, a stronger foundation of data literacy and an understanding of agency will be crucial both among the general public and among key decision-makers.

7.3 Computational Solutionism and Determinism

Determinism has a long and problematic history in the social sciences, with concerns historically raised about overly prescriptive or simplistic explanatory frameworks and models for human and social behaviour (Richardson & Bishop, 2002). CSS holds the potential both to improve upon such difficulties and to exacerbate them. The intersection of “technological determinism” and the social sciences is particularly grounds for wariness; any attempt to derive social explanations from technical phenomena must resist the temptation to construct overly deterministic or linear explanations. Models based on unrepresentative or otherwise flawed datasets (as described above) similarly risk flawed solutions and policy interventions.

At the same time, Big Data offers the theoretical potential at least for richer and more complete empirical frameworks. Some have gone so far as to suggest that the interaction of Big Data and the social sciences could spell the “end of theory”, offering social scientists a less deterministic and hypothetical framework through which to approach the world (Kitchin, 2014). CSS also offers the potential of more realistic and complex simulations that can help social scientists and policymakers understand phenomena as well as potential outcomes of interventions (Tolk et al., 2018). For such visions to become a reality, however, the challenges posed to collaboration and sharing—many discussed in this paper—need to be mitigated.

7.4 Computational/Data Surveillance and the Risk of Exploitation

The final societal impact challenge associated with CSS pertains to the risk of computational and data surveillance (Tufekci, 2014). Considerable concern already exists over the data insights that drive targeted advertising, personalized social media content and disinformation, and more. We live, as Shoshana Zuboff has famously observed, in a “surveillance economy” (Zuboff, 2019).

This economy creates challenges related to misinformation and polarization, and it is a direct result of companies’ ability to exploit the wealth of data they hold on their users. While the potential benefits of CSS are manifold, there is also a risk of new forms of exploitation and manipulation, based on new insights and new forms of data (Caled & Silva, 2021). Each case of exploitation has a direct result and also further erodes trust in the broader ecosystem. The only solution is a series of actions—legislative and otherwise—aimed at encouraging responsible data-driven research and CSS. Many potential actions are outlined in this paper. Further research is needed to flesh out some of the proposals and to develop new ones.

To tackle this challenge, new legislation addressing the uses of data and Computational Social Science analyses will be critical.

8 Reflections and Conclusion

The intersection of big data, advanced computational tools, and the social sciences is now well established among researchers and policymakers around the world. The potential for dramatic and perhaps even revolutionary insights and impact are clear. But as this paper—and others in this volume—shows, many hurdles remain to achieving that potential. The priority, therefore, is not simply to find ways to leverage data in the pursuit of research but, equally or more importantly, to innovate in how we govern the use of data for the social sciences.

An effective governance framework needs to be multi-tentacled. It would cover the broader ecologies of data, technology, science, and social science. It would address how data is collected and shared and also how research is conducted and transferred into insights and ultimately impact. It would also seek to promote the adoption of more robust data literacy and skills standards and programs. The above touches upon a number of specific suggestions, some of which we hope to expand upon in future research or writing projects. Elements of a responsible governance framework include the need to foster interdisciplinary collaboration; more fairly distribute computational power and technical and financial resources; rethink our notions of ownership and data rights; address misaligned incentives and misunderstood aspects of data reuse and collaboration; and ensure better quality data and representation. Last but not least, a responsible governance framework ought to develop a new research agenda in alignment with emerging concepts and concerns from the data ecosystem.

Perhaps the most urgent priority is the need to gain (or regain) a social license for the reuse of data in the pursuit of social and scientific knowledge. A social license to operate refers to the public acceptance of business practices or operating procedures used by a specific organization or industry (Kenton, 2021). In recent years the tremendous potential of data sharing and collaboration has been somewhat clouded by rising anxiety over misuses of data, with the resulting privacy and surveillance violations. These risks are very real, as are the resulting harms to individual and community rights. They have eroded the trust of the public and policymakers in data and data collaboration and undermined the possibilities offered by data sharing and CSS.

The solution, however, is not to pull away. Rather, we must strengthen the governance framework—and wider norms—within which data reuse and data-driven research take place. This paper represents an initial gesture in that direction. By identifying problems, we hope to take steps towards solutions.