1 Introduction

From the beginnings of epidemiology, the importance of data has been central. Often considered fathers of the field, John Graunt analysed London’s bills of mortality to measure the mortality of certain diseases in 1663, and later in 1854, John Snow mapped the cholera cases to identify its sources. Although since those early days in London the medical and mathematical understanding of disease have greatly advanced, one of the primary roles of the epidemiologist is still to prepare and organize the collection of relevant and useful data and to use it to model disease (Obi et al., 2020). This data includes the fundamental Ws that are necessary to understand disease: health event (what), people involved (who), place (where), time (when), and causes, risk factors, and modes of transmission (why/how) (Dicker et al., 2006). Thus, some of the main tasks of an epidemiologist are disease surveillance, field investigation, contact tracing, evaluation of interventions, and public communication—all of which have been transformed by the digital and computing revolutions.

Scientifically, the field is highly multidisciplinary, first measuring the basics of the Ws—identifying the people, places, and time frames of the health events—and then introducing higher-order considerations, the biology of disease, behaviour of its carriers, and ecological influences on the transmission. By building models around this knowledge, it attempts to recommend possible interventions, which then require additional measurement and modelling of complex feedback effects and the psychological and behavioural factors. Advances in disparate fields like genetics, behavioural economics, and ecology on the one hand and more recent strides in computing methods and digitization on the other are making it possible for epidemiology to develop a systems conceptualization of the fields it connects. Computational social science (CSS) in particular adds new tools via large-scale detection, tracking, and contextualizing of disease. As we will see below, digital traces such as mobility and cellphone data have been used to better understand human networks, user-generated content on social media and the web has been employed to now-cast symptoms and disease, and social interactions have been monitored to understand the impact of social contact and new information on health-related behaviour change. Capturing the latest modelling and computing techniques, the umbrella terms of digital or computational epidemiology encompass these new methodological developments.

A string of epidemics in the early twenty-first century—H1N1 (swine flu) in 2009, Ebola in 2014 and 2019, and Zika in 2016—has brought epidemiology to the forefront of public awareness, culminating in the COVID-19 pandemic (at the time of this writing, in any case). Meanwhile, public health policy and interventions are being increasingly informed by telecommunications and other digital data (Budd et al., 2020; Oliver et al., 2020; Rich & Miah, 2017). Governments are collaborating with major cellphone companies to perform privacy-preserving contact tracing, internet companies are releasing aggregated mobility data for contagion modelling, and social media giants are partnering with public health organizations to tackle health misinformation and to support public health messaging campaigns. Throughout, a constant negotiation is at play between the needs of public health researchers and the release of commercially valuable information by the companies. Moreover, a less publicized, but nevertheless critical, battle is being waged against non-communicable diseases including cardiovascular diseases, cancer, diabetes, and mental health disorders. Daily digital traces, such as social media posts and location check-ins, are being used to understand the lifestyle choices of large cohorts, as an alternative to surveys and diaries. Discussions around mental health, disordered eating, illicit drugs, and other topics that are difficult to capture using traditional surveillance methods are now presenting a window into vulnerable populations, even before they register in medical records.

Despite the great promise of new data sources and methodologies, big data approaches are a subject to a slew of challenges that the field needs to overcome in order to establish fruitful collaboration with policymakers. Although big, the datasets often present biased view of the population, which is more tech-savvy and affluent, while excluding those who may have a more urgent need of monitoring and assistance. However, integration of this new data into existing datasets allows for the reduction of overall bias and helps in extending analyses performed on traditional data sources. Encompassing many disciplines who have their own organization, research frameworks, peculiar jargon, and publication venues, digital epidemiology is still in the processing of bridging the siloes to encourage truly multidisciplinary insight. Standardizing the reporting and transparency among these disciplines aims to reduce the number of isolated studies which may suffer from the lack of reproducibility due to the peculiar nature of the available data, application domain, or poorly documented methodology. The legal and ethical standards of using digital data are still being decided through a dialogue between data owners, public health researchers, academics in various disciplines, and representatives of the users of the digital platforms. Thus, the field is still building the structures of cooperation, trust, and legitimacy that are necessary to provide impactful insights for policymakers. Nevertheless, COVID-19 has accelerated the integration of digital epidemiology into its decision-making process. Below, we outline the major accomplishments in the application of computational social science to epidemiology, the accompanying challenges, and the possible ways forward to greater legitimacy and impact.

2 Existing Literature

The explosion in the utilization of computational methods for epidemiology has been spurred by the combination of new computational techniques and the availability of new sources of data. The immense volume of available data has encouraged further development and integration into the scientific toolkit of distributed computing frameworks, as well as data-intense deep learning algorithms, with frameworks such as Apache Spark and TensorFlow that allow the ingestion and processing of terabytes of data (Kleppmann, 2017; Weidman, 2019). The rise of infrastructure as a service (IaaS) business model from giants of industry including Amazon Web Services, Microsoft Azure, and cloud services from Oracle, Google, and IBM has allowed the researchers to access sophisticated infrastructure without purchasing the hardware and support staff within their institutions (on a similar topic, see also Fontana & Guerzoni, 2023).

Much of the data that has accompanied these developments in the computing field has been put to use by epidemiologists, opening new scientific ground. The ongoing digitization of medical records, insurance claims, and governmental public health data continues to provide large-scale, high-quality view of individuals within the medical system. Ongoing efforts, such as the European Health Data Space,Footnote 1 aggregate such datasets, handle privacy concerns, and make it available for research and policymaking (European Commission., 2021). Moreover, the communication revolution has enabled researchers to better understand these individuals even before they enter the public health system. Digital traces of people’s daily activities, including the apps they use, web searches they make, social media posts they publish, as well as the signals from the wearables they keep on their bodies, can help create a view of health-related activities with an unprecedented resolution and reach. One of the earliest attempts to track influenza-like illness (ILI) using user-generated data was proposed by a team of Google researchers who tracked the occurrence of specific keywords in the company’s search query logs (Ginsberg et al., 2009). Although highly criticized by subsequent researchers (Lazer et al., 2014) (we will discuss these concerns below), research on web logs continues to produce encouraging results, including detecting adverse reactions (Yom-Tov & Lev-Ran, 2017), predicting diagnosis of diabetes (Hochberg et al., 2019), and understanding the information needs around medical topics (Rosenblum & Yom-Tov, 2017). Specialized application use has been used to understand the effects of gamification (Althoff et al., 2016) and social contagion (Aral & Nicolaides, 2017) on exercise and the characteristics of (un-)successful diets (Weber & Achananuparp, 2016). The text posted by thousands of users on social media platforms has been used to identify and track depression (De Choudhury et al., 2013), eating disorders (Stewart et al., 2017), attitudes toward vaccination (Cossard et al., 2020), and other health interventions. The networked nature of the data often allows the study of the way in which information (Johnson et al., 2020), behaviours, and diseases propagate. Finally, anonymized mobility data, often coming from telephone and transportation companies, has allowed a more fine-grained transmission modelling of the disease (Vespe et al., 2021), as well as the impact of mobility-related interventions (Jeffrey et al., 2020). These data sources add immense value to the traditional ones by increasing the population coverage (some into millions of people), temporal resolution (allowing “now-casting”), and qualitative depth that are impossible or prohibitively expensive to reach outside the digital domain.

One of the earliest examples of the application of computational models to infectious diseases was human influenza, which is an ongoing public health battle. It is continuously analysed via viral phylodynamics in order to better understand its transmission dynamics. Computational phylogenetics methods are applied to datasets of genetic sequences sampled over time and sub-populations in order to assemble a phylogenetic tree and estimate various dynamics of the process (Volz et al., 2013). Fitness models also help in selecting the vaccines year over year (Łuksza & Lässig, 2014). Beyond the study of the virus itself, CSS has introduced several behavioural aspects to the models, many of which have been used during the COVID-19 epidemic. Mobility data (including that provided publicly by large corporations during the pandemic) has been used to monitor the compliance with interventions, such as the stay-at-home orders during COVID-19, revealing the role of awareness and fatigue in modelling risky behaviours (Weitz et al., 2020). Large-scale online surveys and crowdsourcing have been used to gauge psychological and behavioural responses to the pandemic around the world (Yamada et al., 2021). Even larger efforts, such as InfluenzaNet, recruit thousands of volunteers across Europe to regularly report ILI symptoms, allowing researchers to identify risk factors and gauge influenza vaccine effectiveness (Koppeschaar et al., 2017). Travel records have been used to track the international transmission of disease (Azad & Devi, 2020), whereas a machine-learned anonymized smartphone mobility map has been used to forecast influenza within and across countries (Venkatramanan et al., 2021). For instance, the Global Epidemic and Mobility (GLEaM) framework uses local and international mobility data to build epidemic models, allowing for the simulation of worldwide pandemics, including estimating the impact of interventions during the COVID-19 epidemic (Chinazzi et al., 2020; Van den Broeck et al., 2011). To better understand the reasons behind risky behaviours and non-compliance with public health advice, researchers utilized discussions on social media, often finding misunderstandings and downright misinformation (Betti et al., 2021; Keller et al., 2021). Finally, public health communication campaigns have been evaluated using outreach online by influencers (Bonnevie et al., 2020) as well as news websites and popular social media sites (Carlson et al., 2020).

Unlike in the beginning of epidemiology’s development as a science, the infectious diseases have these days given way to non-communicable diseases as the cause of illness and death, especially in the developed countries. The daily behaviours captured in digital trace data, especially social media, have been extensively used to study non-communicable diseases including obesity and diabetes type 2, mental illness, and even suicide. At population level, diabetes has been tracked using store purchase data (Aiello et al., 2019), as well as social media posts (Abbar et al., 2015), and some environmental causes have been tracked in the USA, with a focus on “food deserts” where access to healthy food is limited (De Choudhury, Sharma, et al., 2016). Attempts to inform potential interventions have been made by measuring the importance of community support during a weight loss journey (Cunha et al., 2016) and the effect of intervention messaging on those affected by anorexia (Yom-Tov et al., 2012). Observational studies of exercise in particular through specialized exercise applications have shown that information about other people’s routine may affect one’s own (Aral & Nicolaides, 2017) and that gender plays an important role in the continued use of such apps (Mejova & Kalimeri, 2019). Further, a combination of web search and wearables data has been used to show the health impact of applications not necessarily meant for exercise, such as Pokémon Go, which resulted in potentially years worth of life spans added to the fans of the game (Althoff et al., 2016). The anonymous and connected nature of social media and specialized forums have also allowed a better understanding of depression, anxiety, eating disorders, and other mental health issues (for an overview, see Chancellor & De Choudhury, 2020). The text of the posts has been used to predict suicidal ideation (Cheng et al., 2017), psychotic relapses (Birnbaum et al., 2019), and PTSD (Coppersmith et al., 2014). More specialized data sources have been used to track recreational drug use (Deluca et al., 2012), as well as the use of “dark web” as a marketplace for such activities (Aldridge & Décary-Hétu, 2016). In combination with screening questionnaires which use validated scales such as Center for Epidemiologic Studies Depression Scale (CES-D) and Beck Depression Inventory (BDI), the daily self-expressions of those dealing with mental health issues provide an unintrusive record of the condition’s progression and reactions to potential interventions.

These encouraging developments have been accompanied by a vigorous discussion of their limitations. The privacy concerns regarding secondary use of personal data, even if originally posted on public platforms, demand a critical evaluation of the balance between potential benefits of public health research, compared to the privacy risks to the individuals captured in the data (see, e.g. Taylor, 2023). Other critiques are more unique to the field of epidemiology. For instance, the machine learning framework of classification, as well as most deterministic compartmental models (such as Susceptible-Infected-Recovered (SIR), more on which later), makes necessary simplifying assumptions about the natural progression of a disease, its behaviour, as well as the pharmaceutical and non-pharmaceutical interventions introduced to slow its spread, although more sophisticated models with more complex representations are continuously being proposed.

The separation between traditional epidemiology and computing disciplines in the research teams often results in the failure to take into consideration the established theories in clinical science, using operationalization that is most convenient technically, but not as well matched to the medical condition tracked, while a vague communication of the technical aspects of computing pipelines makes it difficult to integrate the results into clinical practice (Chancellor & De Choudhury, 2020). Observational studies have also lacked the rigor of causal analysis, often stopping at correlational observations. Despite capturing multitudes of people, each data source has substantial biases that must be not only acknowledged by the researchers but accounted for in the analytical pipeline (Yom-Tov, 2019). Finally, data ownership, global justice, and ethical oversight are all important problems that need to be addressed for digital epidemiology to gain legitimacy on the scientific and policy stage (Vayena et al., 2015). We will touch on these and other peculiarities of using computational social science for epidemiology in the next section.

3 Computational Guidelines

The abovementioned literature not only pushes the boundaries of traditional epidemiology and the purview of computing but addresses multiple important policy questions regarding public health. The third goal of the UN Sustainable Development Goals (SDGs) is to “Ensure healthy lives and promote well-being for all at all ages”.Footnote 2 For instance, the goal encompasses the work on alleviating communicable and non-communicable diseases, prevention and treatment of substance abuse, ensuring access to sexual and reproductive services, and increasing the healthcare capacity in all countries, but especially in the developing ones. Although CSS cannot build the necessary infrastructure, it can measure, on both community and individual scale, the utilization of healthcare services, the barriers experienced by the populous, and the expression of unfulfilled needs. Furthermore, it can help in tracking and forecasting disease, again at the scales including individuals, thus measuring the impact of potential ongoing interventions. In fact, CSS can help to craft, deploy, and monitor epidemiological interventions by providing detailed profiling of the target audience, individualized message delivery, and fine-grained behavioural feedback. In order to bring these promises to fruition, a slew of challenges remain to be fully addressed by the research and policy community, including data access and privacy, construct validity, methodological transparency, sampling bias, accounting for confounders, and finally sufficiently clear communication to ensure real-world application. Below, we discuss several policy questions that CSS may address and outline technical and organizational best practices.

3.1 Infectious Diseases

The modelling and predicting of infectious diseases is perhaps the most well-known purview of digital epidemiology. Some of the simplest models of disease spread use a system of states as a basis, such as the Susceptible-Infected-Recovered (SIR) model wherein the population can be put into one of these three states (Bjørnstad et al., 2020). Other compartmental models exist which describe the progression of disease with more states (“compartments”), including Asymptomatic infectious, Hospitalized, etc. (Blackwood & Childs, 2018). Such states may also include behaviours of the population segments, including those produced via interventions such as quarantining (Maier & Brockmann, 2020) and wearing masks (Ngonghala et al., 2020). The SIR model has also been extended to incorporate the age structure in the contact matrices (Walker et al., 2020). Compartmental models are popular because they can be designed to frame the essential parts of a question and to work with reduced amounts of data for calibration. By varying parameters such as time between cases, average rate an individual can infect another, and the time infected individual can recover, researchers can estimate the case increase, as well as other properties of the epidemic. For instance, during the COVID-19 epidemic, the effective reproduction number R, or average number of secondary cases per infectious case in a population made up of both susceptible and non-susceptible hosts, has been closely watched and estimated in different affected countries, providing an important characterization of the disease’s spread (D’Arienzo & Coniglio, 2020). This classic model has been recently challenged and improvements have been proposed. For instance, the assumption that any individual may contact and thus infect any other in a population (homogeneous mixing) has been shown to be oversimplification of the way people interact in reality; instead, considering other information, such as differential susceptibility by age, may improve the models models (Q.-H. Liu et al., 2018).

Further, the availability of large-scale data has allowed scholars to model the real-world networks more accurately. The effect of network structure has been studied in the context of epidemic spreading velocity (Cui et al., 2014) and size (Y. Liu et al., 2016; Wu et al., 2015) and thresholds (Silva et al., 2019). Pandemic outbreaks have been found to be supported in networks with high assortativity (Moreno et al., 2003) and those having community structures (Z. Liu & Hu, 2005). The plethora of data has also allowed the application of agent-based models (ABMs) which attempt to capture empirical socio-demographic characteristics such as household’s sizes and compositions, however at a larger computational cost. Such models have been used to incorporate empirical knowledge about contact rates within and between age groups (Ogden et al., 2020) and comorbidities (Wilder et al., 2020). Most such models are built using known population statistics, such as the ABM built to simulate disease evolution in France in order to evaluate the effectiveness of COVID-19 lockdowns, physical distancing, and mask-wearing (Hoertel et al., 2020). Alternatively, contact tracing data has been used to build detailed community network approximations, such as one built for Boston, by considering anonymized GDPR-compliant mobile location data in combination with 83,000 places from Foursquare (Aleta et al., 2020). To make sure data sparsity does not result in individual privacy violations, the authors use a probabilistic approach to measure co-presence. Thus, ABMs have been useful in furthering our understanding of the changes to contact networks and their impact on disease transmission.

Fine-grained mobile phone data has been used to estimate population movements affecting the spread of influenza-like illness (ILI) predating COVID-19. In Tizzoni et al. (2014), the data comes as a set of phone calls georeferenced to the cellphone tower. The authors estimate that a user’s most frequent location in the data is their residence and second-most frequent is the place of employment. Usually obtained via extensive (and expensive) surveys, such information is revolutionizing disease modelling on both local and global scales. Beyond phone records, internet data has also been used to monitor mobility. These works show the possibility for large corporations to surface anonymized, aggregated, and differentially private data in order to assist public health researchers and decision-makers. These include Google COVID-19 Community Mobility Reports (Google, 2021a), Apple Mobility Trends Reports (Apple., 2021), and Facebook Disease Prevention Maps (Facebook, 2021b), all of which aggregate the massive amounts of information their platforms collect about the location of their users. All three resources have been used to gauge the changes in mobility of during the COVID-19 lockdowns (Mejova & Kourtellis, 2021; Shepherd et al., 2021; Woskie et al., 2021). However, if one wants to obtain a more nuanced understanding of contact networks, wearable technologies can be used to detect face-to-face interactions within, say, an organization or a building. Unobtrusive sensors have been used to detect close proximity interactions at 1.5 m in order to reveal the interaction patterns among healthcare workers and patients in a hospital (Vanhems et al., 2013), as well as at an academic conference (Smieszek et al., 2016) and within several households in Kenya (Kiti et al., 2016). Large-scale proximity sensors were later used by many governments during the COVID-19 epidemic through passive contact tracing apps, which use anonymous identifiers to remember devices which were in a close proximity of a person and which can notify their users in case somebody within their contact history has been found to be COVID-positive (Barrat et al., 2020).

But before the disease can be tracked, its very presence needs to be detected. Computational social science presents several unprecedented data sources that enable researchers to “now-cast” disease as it moves through the population. As mentioned, web search data has been used to monitor ILI symptoms (Ginsberg et al., 2009) and is still used for many others. However, one does not need to be a Google employee to perform such research, as aggregated search data is surfaced by the company via Google Search Trends (Google, 2021b), which has been used to track anything from Lyme disease (Kapitány-Fövény et al., 2019) to type 2 diabetes (Tkachenko et al., 2017). Of course, other dynamic social media have been used to track disease, including Twitter, Reddit, and Sina Weibo, all of which have been used to track non-communicable diseases as well. Beyond observation, self-reported data can be obtained from participatory surveillance systems, such as InfluenzaNet (Koppeschaar et al., 2017), which collects influenza-related information from thousands of volunteers from countries around the EU.

Both algorithmic and data advances described above come with many caveats which both the scientific and policy communities are yet to tackle effectively. As machine learning and other modelling algorithms become more complex, difficulties in communicating their benefits and—more importantly—limitations to those outside the initiated trained practitioners result in misunderstandings about the certainty of the predictions and limits of their applications, leading to a limited deployment in the field. However, the solution may not lie in a more detailed description of the algorithms, but in the clarification of their merits, such that we can be determined whether their performance warrants their integration in the decision-making process of policymaking. One could take a page from the social science “reproducibility crisis” (Camerer et al., 2018) which illustrated the bias toward significant, positive, and theoretically neat results at the cost of valid, generalizable insights. Several actions, including the Social Sciences Replication Project (SSRP), the Reproducibility Project: Psychology (RPP), and the Experimental Economics Replication Project (EERP), have been organized to provide increased rigor to the insights on important theories and results in each field. Beyond reproducibility, integration of new methodologies should be tested in prediction competitions, such as CDC’s FluSight, a competition that brings together researchers and industry leaders to forecast the timing, peak, and intensity of the flu season (Centers for Disease Control and Prevention., 2021). Another ongoing effort is the ECDC’s European Covid-19 Forecast Hub which collates and combines short-term forecasts of COVID-19 generated by different independent modelling teams across Europe and makes available a near-term future trajectory of the pandemic (European Centre for Disease Control and Prevention (ECDC), 2021). The legitimacy afforded by such efforts would encourage the data owners (e.g. internet/technology companies including social media websites and phone companies) to contribute datasets that would level the playing field between well-funded and smaller players. It is especially important to solicit both algorithmic and expert (human) predictions in order to provide a baseline for comparison, as it has been shown that people tend to distrust algorithms faster when they make mistakes, compared to when humans do the same (Dietvorst et al., 2015). Increased transparency in the way epidemiological studies are designed, the kind of data they use, and—crucially—their predictions ahead of the target date are all likely not only to clarify the potential impact of the new methods on public health but also to unify the field under a set of common goals (Miguel et al., 2014).

This proposal will hopefully address several other critiques. Legitimizing and clearly describing the uses of data would give a greater transparency to the secondary use of data, greater oversight over anonymization standards, and aggregate statistics of its biases. Biases in data collection have been a constant critique of scientific endeavours; however, it may be even easier to gloss over biases in big datasets, but it has been shown that even large datasets of internet or technology users have substantial biases in terms of demographics, wealth, and technological access (Hargittai, 2020; Yom-Tov, 2019). Sampling biases limit the generalizability of the scientific studies. As such biases tend to underrepresent those coming from more disadvantaged backgrounds and locales, systematic testing of the algorithms on different populations would provide a quantifiable measure of the change in performance across groups of interest (Olteanu et al., 2019). The peculiarities of the digital platforms provide another constraint, including the affordances provided by each website, as well as the peculiar user base and culture. For instance, the privacy and identification limitations on Facebook distinguish it from more open platforms, like Twitter, or community-oriented ones, like Reddit, resulting in differences of information disclosure and propagation. The very timing of the studies imposes biases specific to the time period selected for the analysis (for instance, 2020 will likely be a special year in many datasets), making some observations unique to the contemporary societal, technological, and public health situation. To address some of these problems, scientists must be encouraged to publish replication studies, as well as to extend them into long-term projects, in order to test the models initially proposed on different data and time spans. Further, establishing data partnerships addressing important public health concerns will insure the infrastructure is in place in case a crisis, such as the COVID-19 epidemic, strikes.

3.2 Non-communicable Diseases

As medicine advanced against infectious diseases, non-communicable diseases have become the leading causes of death and illness throughout developed and developing world. Many of such conditions, including obesity and the overweight, diabetes, and cardiovascular complications, have a strong “lifestyle” component, wherein the daily activities of the population accumulate to contribute to worsening outcomes. CSS provides a unique view of such behaviours, using the digital traces left through these daily activities such as social media posts, business check-ins, web searches, use of applications, and many others. Behaviours around food consumption and nutrition have been studied using Twitter (Abbar et al., 2015), Instagram (Mejova et al., 2015), as well as large datasets of grocery purchases (Aiello et al., 2019). Often, natural language processing (NLP) tools are used to process the text obtained from many internet users or deep machine learning (ML) models to “recognize” relevant objects in the shared images in order to understand the daily behaviours of the internet users. Crucially, these activities can be put into a cultural context to better understand the societal, economic, and psychological forces shaping these daily decisions, much as proposed by Weiss as “cultural epidemiology” (Weiss, 2001) that combines quantitative and qualitative methodologies. For instance, large datasets of recipes have been examined in order to establish a network of flavours and ingredients across countries and relate it to the health outcomes of different locales (Sajadmanesh et al., 2017). The relationship between economic deprivation on diet in the USA has shown that those living in “food deserts” mention food that is higher in fat, cholesterol, and sugar than otherwise (De Choudhury, Sharma, et al., 2016). Further, specialized apps and wearables are used to monitor physical activity. For example, a study of running tracking app data (Aral & Nicolaides, 2017) aimed to understand the role of social interaction and comparison on the duration of one’s run. However, some researchers aim to go beyond behavioural profiling and use internet search data to detect those potentially having serious illness. A team used search query logs to first identify users who mentioned having a diabetes diagnosis and compare them to a control group (Hochberg et al., 2019). Researchers were able to predict whether a user will be searching for diabetes-related words from their previous queries with a positive predictive value of 56% at a false-positive rate of 1% at up to 240 days before they mention the diagnosis. In general, it was found that people tend to search about symptoms some time before they are diagnosed with the underlying condition (Hochberg et al., 2020), especially if the symptoms are serious. Yet more data is available to monitor disease on a population level via information surfaced by the advertising systems of large social media platforms. For instance, Facebook allows potential advertisers to run detailed queries on their target audience, specifying their demographics, precise location, language, and interests (which span health concerns, activities, hobbies, worldviews, and many more categories) (Facebook, 2021a). These can then be used as a kind of “digital census” to quantify awareness of health-related topics and behaviours related to non-communicable diseases within well-defined demographic groups across fine and broad geographies (Mejova, Weber, et al., 2018). Compared to traditional survey-based monitoring, the above studies provide unobtrusive, real-time, and extremely rich sources of behavioural observation. Especially on social media, the users are self-motivated to share their meals and activities, to annotate them with geographic and other metadata, and to interact with other posts. Although suffering from social desirability bias, in combination with other consumption statistics, social media and app use data provide important signals about the social and psychological context of health-related behaviours.

Further, non-communicable disease interventions can be studied on a personal level while delivered through a myriad of technologies. Integration of smartphones with user-generated content is leading to sophisticated personalized interventions aiming at motivating the users to increase their physical activity level (Harrington et al., 2018; op den Akker et al., 2014). Different messaging strategies have been explored including personalized exercise recommendation (Tseng et al., 2015), also employing machine learning via supervised learning (Hales et al., 2016; Marsaux et al., 2016) and reinforcement learning (Rabbi et al., 2015; Yom-Tov et al., 2017). Others help users find exercise partners (Hales et al., 2016) and provide educational materials (Short et al., 2017) and emotional support (Vandelanotte et al., 2015). The applications have been embraced by the governments and businesses worldwide. For instance, UK’s National Health Service promotes an Active 10 app that encourages everyone to have a brisk walk and for those ready for a bigger challenge has Couch to 5K app for beginner runners (National Health Service., 2021). India’s Ministry of Youth Affairs and Sports launched its Fit India app to help its populous keep track of their fitness goals, water intake, and sleep (Play Store., 2021). Social media is, of course, another popular outlet for public health outreach. Many associations, such as the National Eating Disorders Association in the USA, run annual health awareness campaigns on different social media channels, making it possible to measure the impact of their campaigns on the sustainability of the attention to the topic and other subsequent behaviours expressed by their audience (Mejova & Suarez-Lledó, 2020). To assist in the efforts, some researchers focus on which influencers and content (especially contagious “memes”) are particularly successful in attracting an audience (Kostygina et al., 2020) or how to better identify the relevant users to target (Chu et al., 2019).

Although the above studies provide a valuable context to the ongoing epidemics of non-communicable diseases, and potential avenues to communicate about them, mostly observational studies usually fail to reach the threshold for causal insight. Often large datasets lack the information on important confounders that may affect the outcome of the study. For instance, while comparing health-related interests expressed by Facebook users to rates of obesity, diabetes, and alcoholism, researchers have found that unrelated (or “placebo”) interests, such as those in entertainment or technology, also had substantial correlation with the rates of disease (Mejova, Weber, et al., 2018). Some attempt to improve the quality of their models by employing instrumental variables, especially when the explanatory variable of interest is correlated with the error term. Weather is a popular instrumental variable, as it is often not related to the dependent variable, but may have some relationship with the independent ones. In their study of social contagion in a community of runners, the authors used the weather at one person’s location as an instrumental variable when modelling the running behaviour of another (Aral & Nicolaides, 2017). They show that without the corrections, the effect would have been overestimated by 71–82%. The inability to acquire multi-dimensional data that has important confounders (which are often demographics, protected by numerous privacy regulations) has an additional effect of hiding the unequal relevance of the ongoing work to those less represented in these datasets. Inferring sensitive information, including age, gender, and location, may be possible from some sources of data, but such activity may both break the privacy of the platform and violate the protections imposed by the EU General Data Protection Regulation (GDPR). It is thus imperative to engage legitimate stakeholders who will negotiate controlled releases of highly detailed data for research on pressing topics and especially provide input during policy changes when a “natural experiment” may take place. Policymakers may also want to explicitly outline the under-served populations they would like to focus on, thus encouraging the creation of datasets around groups that are not yet captured in currently available data. For instance, India’s efforts in the National Mission for Empowerment of Women (NMEW) may be augmented by encouraging the monitoring of technology use through available data (Mejova, Gandhi et al., 2018). Alternatively, access to care can be monitored using online tools, such as those for women’s health services (Dodge et al., 2018) across the USA.

3.3 Mental Illness and Suicide

An especially vulnerable population that has been extensively studied by CSS in the context of epidemiology is people with diagnosed mental illness, or those simply expressing mental distress, alongside those who vocally contemplate suicide. The anonymity and social support provided by the internet forums and websites allow many to express feelings and thoughts which may be difficult to evoke using standard public health methods like surveys and medical records. The pervasive use of social media, including on mobile devices, allows users to post instantly during the moments of mental distress and for some to integrate digital platforms into their coping mechanisms. Communities around eating disorders (anorexia, bulimia, etc.) (Stewart et al., 2017; Yom-Tov et al., 2012), depression (De Choudhury et al., 2013; Reece & Danforth, 2017), and drug abuse (Kazemi et al., 2017) and recovery (Chancellor et al., 2019) are providing valuable insights in the way people experience these conditions, seek and provide support, and even provide practical advice. For instance, by combining automated machine learning classification and text processing techniques with clinical expertise, researchers have used the Reddit opioid addiction recovery forums to discover alternative treatments that the users share and discuss (Chancellor et al., 2019). It is also possible to monitor the progression of mental illness to serious suicide ideation by examining suicide prevention forums (De Choudhury, Kiciman et al., 2016), as well as studying web search patterns (Adler et al., 2019). Studies of search engine usage have been able to confirm behavioural signs of people with autism, for instance, finding that users who have self-stated that they have autism spend less time examining image results (Yechiam, Yom-Tov et al., 2021). Whereas most studies rely on self-declaration of diagnosis, some studies use social media to better understand those who have been confirmed to be clinically diagnosed. Facebook posts of patients diagnosed with a primary psychotic disorder have been analysed to find predictors of a future psychotic relapse (Birnbaum et al., 2019).

However, the very fact that self-expression of mental distress may come before official diagnosis makes such research struggle with construct validity, that is, what exactly is being measured, and how robust it is in clinical terms. Reviews of literature on mental health status on social media show that few use the definitions and theories developed in the clinical setting to define, for instance, the conditions of “anxiety” or “depression” that are being tracked (Chancellor & De Choudhury, 2020). Whether mentions of disorders on social media capture users who are struggling with them, merely interested in the topic, or even misusing the terms is an important question to answer before these methods can be applied to the clinical setting. It is imperative to foster a closer collaboration between the medical establishment and researchers attempting to contribute to the epidemiology of conditions possibly discussed in user-generated data. From the CSS research community’s side, it is important to rigorously define the cohorts of interest and follow clinically validated diagnostic procedures (Ernala et al., 2019) when studying new sources of data and methods for identifying those potentially struggling with mental illness. However, it is also desirable to have the medical community to acknowledge these new sources of information as an additional signal that should be clinically studied and which may play a role in official diagnostic (and possibly treatment) frameworks. As mentioned earlier, methods based on alternative data sources may play a role in the profiling of future recruits for studies, potentially expanding their reach beyond those already in the medical system.

3.4 Beliefs, Information, and Misinformation

User-generated data provides yet another unique context around health and disease: the dynamics of individual’s knowledge, opinion, and belief and their interactions with various information sources that shape these important precursors to behaviour. The quality of medical information available to people on social media and through web search can be evaluated using big data NLP tools and in collaboration with area experts. YouTube videos have been found to be some of the worst offenders in terms of advocating methods proven to be harmful or having no scientific basis (Madathil et al., 2015). Twitter (Rosenberg et al., 2020), Reddit (Jang et al., 2019), and Pinterest (Guidry & Messner, 2017) all have been examined for links to potentially harmful health advice. One of the most serious problems is the anti-vaccination movement that has been strengthening in both developed and developing countries. Twitter data has proven to be useful in explaining some variation in the vaccine coverage rates, as reported by the immunization monitoring system of WHO (Bello-Orgaz et al., 2017). Classifying whether social media users support or oppose vaccinations has been shown to be feasible, both using deep learning on the posted text and images (Wang et al., 2020) and using network algorithms on the conversation network (Cossard et al., 2020). However, it is in the more specialized websites, such as the discussion forums for parents, that give space to those who are hesitant and are in the process of making healthcare decisions for themselves or their family. There, researchers can find lists of concerns, previous experiences, and information seeking, as well as testimonials about the experiences with the medical establishment (Betti et al., 2021).

Further, internet captures myriad interactions with medical services and consequences of health interventions. Social media has been used extensively for pharmacovigilance, discovering drug side effects (Alvaro et al., 2015), drug interactions (Correia et al., 2016), and recreational drug use (Deluca et al., 2012) and even uncovering illicit online pharmacies (Katsuki et al., 2015). Patient experiences can be found on business review websites (Rastegar-Mojarad et al., 2015), as well as general-purpose social media, where communities can discuss their perceptions of treatment (Booth et al., 2019; Hswen et al., 2020). Super-utilizers of healthcare services have also been studied on social media in order to inform online social support interventions and complement offline community care services (Guntuku et al., 2021), and efforts have been made to integrate patient experiences in online discussions into customer satisfaction and service quality measures (Albarrak & Li, 2018).

As more and more people use internet and social media as a source of medical knowledge and advice, as well as social support, understanding how this information is translated into behaviours and life choices is an increasingly urgent research direction. Although the detection of cyberbullying and other negative speech on social media is an active research direction (Chatzakou et al., 2019), ethical concerns prevent the integration of user profiling and targeting in mental health interventions. However, health misinformation has been acknowledged to be a parallel pandemic in the COVID-19 era, and concerted efforts are ongoing in monitoring and tackling potentially harmful information (World Health Organization, 2021a). In this sphere, CSS will continue to play an important role by providing the tools for the analysis of new social and information sharing platforms that are increasingly permeating the information landscape.

4 The Way Forward

Epidemiology was one of the first of the sciences to use large datasets, and thus, it is in a natural position to take advantage of the latest developments in digitization, big data, and computing methods. The year 2020 has forced the field to mobilize its best resources to address the COVID-19 pandemic and put in stark light the challenges facing the field. The silver lining of this dark cloud could be an understanding of the necessary steps in bringing digital epidemiology into the policy sphere, making it agile and relevant in a fast-moving globalized world.

The COVID-19 pandemic has imparted an important component to the epidemiological field—a clarity of vision. It has shown in a stark contrast the cost of indecision and the global repercussions on the lives and economies and forced the realities of a global pandemic to the public and the governmental attention. It has also revealed weaknesses in the current health policy structures, the slow response of the governments to the WHO’s messaging, and disarray in the case tracking and reporting standards. Already, actions are in place to remedy these weaknesses. Attempts are being made to formalize the government responses through treaties and international agreements (though enforcing such agreements remains a struggle) (Maxmen, 2021). Partnerships are being forged, and large companies released detailed datasets of user activity and mobility to aid in monitoring and modelling (National Institutes of Health., 2021).

Such clarity of vision is necessary to improve the impact of digital epidemiology also in other spheres. The UN Sustainable Development Goals (SDGs)Footnote 3 provide a general prioritization for the health and well-being challenges, but these must be defined clearly in order to encourage the building of tools and partnerships. One such effort is the European Data Space, which aims to legitimize and operationalize the data usage across the member states while complying with its established privacy regulations.Footnote 4 Another is the WHO Hub for Pandemic and Epidemic Intelligence which aims to build a “global trust architecture” that will encourage greater sharing of data through addressing numerous aspects: “governance, legal frameworks and data-sharing agreements; data solidarity, fairness and benefits sharing; transparency about how pandemic and epidemic intelligence outputs are used; openness of technology solutions and artificial intelligence applications; security of data; combating misinformation and addressing infodemics; privacy by design principles; and public participation and people’s data literacy” (World Health Organization, 2021b,c). Additionally, the One Health movement, supported by the WHO, emphasizes the collaboration between disparate domains to accomplish a systems-level perspective on problems such as antibiotic resistance (World Health Organization., 2017). These ambitious projects are a response to a complex problem that involves many parties, some of which only recently began weighing the benefits and dangers of massive surveillance for the greater good.

Several important steps need to be taken in order to engage all major parties involved. First, civil society must be educated in the basics of digital literacy, data privacy, and its governance in order to ensure the users of technologies contributing to the big data revolution provide truly informed consent. For instance, the EU has proposed the Digital Competence Framework that comprises not only information literacy but also skills in communication, digital content creation, safety, and problem-solving (EU Science Hub., 2021). Second, the professionals coming from different civic, academic, and policy silos must be brought together and upskilled to legibly communicate about the role of data in public health. For instance, efforts such as the Lagrange Fellowships in Italy (Fondazione CRT., 2019); the Data Science for Social Good Fellowship in Chicago, USA (University of Chicago., 2021); and the Data Fellowship at the OCHA Centre for Humanitarian Data (Centre for Humanitarian Data, 2021) are excellent efforts to impart data science skills in the next generation of humanitarians, epidemiologists, and academics. Institutionally, the normalization of building teams that incorporate data literacy (and analytics skills, if possible) is an ongoing process that is only recently being supported by educational resources. Third, the governance of technology giants that own much of the data necessary for monitoring and modelling disease must be kept clear and up-to-date considering the latest technological developments. Interestingly, during the COVID-19 efforts to build contact tracing apps, it was the corporations (Apple and Google) that refused to implement features that would threaten the privacy of their users (privacy being an important feature of their services) (Meyer, 2021). However, one must not rely on the businesses to maintain ethical standards of data use, which must be carefully negotiated before the next disaster strikes.

Much of this chapter describes the impressive accomplishments by the academic researchers in the fields of disease monitoring, modelling, prediction, and contextualization. However, to bring these tools to the policymakers’ table, they must be robust, vetted, and available on demand. Additional organization is necessary to establish a well-defined set of problems for the community to tackle and to provide legitimacy in order to foster data exchange to support research. Standardizing the tasks (such as flu season prediction), metrics, available data, and benchmarks will allow for an increased accountability and reproducibility of academic endeavours that go beyond publication peer review. Such tasks should be defined in collaboration with the policymakers in order to align the priorities with the societal needs and system outputs with the information needs. The way Netflix Prize has invigorated the recommender systems community (Netflix, 2009) and Google Flu Trends spurred interest in the digital disease tracking (Google., 2014), ambitious competitions not only would provide clarity of vision for the field but would also be able to direct the research agenda to under-served areas and communities. It would be beneficial if the collaborative efforts described above would include a space for the academics and researchers to tackle specific problems within an evaluation framework that produces benchmark datasets and reproducible methods, beyond scientific publications.

Finally, the technological development will continue revolutionizing the field, spurring debate on additional policy considerations. The advances in deep machine learning are allowing to process speech, images, and video at scale and are already being used for plant (Ferentinos, 2018) and human (Li et al., 2020) disease detection. The rise of confidential computing, wherein user data is isolated and protected on the user’s device, and only trusted operations can be run on it, eliminates the need to transfer the data for processing elsewhere (Rashid, 2020). The negotiation between the new potential insights and the cost to the society will require thoughtful, informed, and urgent consideration.