The methodologies to assess the effectiveness of non-pharmaceutical interventions during COVID-19: a systematic review

Non-pharmaceutical interventions, such as school closures and stay-at-home orders, have been implemented around the world to control the spread of SARS-CoV-2. Their effectiveness in improving health-related outcomes has been the subject of numerous empirical studies. However, these studies show fairly large variation among methodologies in use, reflecting the absence of an established methodological framework. On the one hand, variation in methodologies may be desirable to assess the robustness of results; on the other hand, a lack of common standards can impede comparability among studies. To establish a comprehensive overview over the methodologies in use, we conducted a systematic review of studies assessing the effectiveness of non-pharmaceutical interventions between January 1, 2020 and January 12, 2021 (n = 248). We identified substantial variation in methodologies with respect to study setting, outcome, intervention, methodological approach, and effectiveness assessment. On this basis, we point to shortcomings of existing studies and make recommendations for the design of future studies. Supplementary Information The online version contains supplementary material available at 10.1007/s10654-022-00908-y.


Introduction
In response to the COVID-19 pandemic, countries around the world have implemented non-pharmaceutical interventions. These include a variety of public health measures implemented by governments at the population level to decrease the number of person-to-person contacts with the intention of controlling, preventing, and mitigating transmission, e.g. school closures, stay-at-home orders, and mandates for compulsory wearing of masks in public places [1][2][3][4]. The widespread use of these interventions has raised interest in empirically studying their effects on health-related outcomes reflecting disease dynamics, e. g. the number of new cases or infection rates [5][6][7][8][9][10]. Such studies can play an important role in informing the discussion about the effectiveness of interventions. In particular, insights from the COVID-19 pandemic may contribute to an evidence-based public-health response in subsequent COVID-19 waves or future pandemics.
Accordingly, a plethora of studies assessing the effectiveness of non-pharmaceutical interventions during the COVID-19 pandemic have been published. Their findings have been summarized by several meta-analyses [11][12][13][14][15]; Nicolas Banholzer and Adrian Lison contributed equally to this work.

3
nonetheless, each meta-analysis considered a different subset of studies. We argue that the latter is due to substantial variation in the methodologies used to conduct empirical studies on the effectiveness of non-pharmaceutical interventions. The resulting lack of similarity constrains meta-analyses to a comparably small and specific subset of the overall evidence.
There are different reasons to expect variation in methodologies in the studies on the effectiveness of non-pharmaceutical interventions for controlling a pandemic. One possibility is the lack of empirical data before the COVID-19 pandemic, so that early studies have been largely theoretical [16]. Empirically assessing the effectiveness of non-pharmaceutical interventions is therefore a relatively new subject, and corresponding studies do not build on an established scientific framework. Another possibility is that empirical assessments have been approached with different methods and domain knowledge by researchers from various fields, e. g. computational biology, infectious disease epidemiology, public health, economics, and statistical modeling.
Variation in methodologies can be manifold. Different study settings, outcomes, interventions, methodological approaches, and ways to assess effects may be used. On the one hand, such variation may be desired as it allows to assess the robustness of results against individual assumptions and methodologies. On the other hand, variation in methodologies can impede comparability among studies, which is necessary to arrive at conclusive evidence regarding the effectiveness of non-pharmaceutical interventions.
Here, we systematically review the methodologies for studying the associations of non-pharmaceutical interventions with health-related outcomes. Thereby, we aim to inform about different methodologies that were used by previous studies and identify opportunities to improve the robustness and comparability of future studies. In particular, we explore shortcomings of common approaches and provide recommendations for subsequent studies on the effectiveness of non-pharmaceutical interventions.

Methods
We tailored our review to the challenge of mapping a potentially diverse set methodologies from a large number of studies. To ensure rigour and consistency, we preregistered the procedures for all stages of the review process, following common guidelines for systematic literature reviews [17]. Certain guidelines were not applicable to a methodology review as ours. In particular, our eligibility criteria and risk of bias assessment reflect the objective of this review, which was not to evaluate the evidence regarding the effectiveness of non-pharmaceutical interventions, but to map the variation in methodologies used. The preregistered methodology was documented in a review protocol at PROSPERO [18].
We report our review according to the PRISMA 2020 statement [19]. A completed PRISMA 2020 checklist is provided in online Appendix G. The search strategy was developed jointly and executed by an experienced information consultant. Then, two authors (NB and AL) performed study selection, data extraction, and synthesis, while having regular meetings with the complete author team.

Eligibility criteria for studies
In the following, we describe our eligibility criteria, which informed our search strategy and were systematically applied during study selection. Importantly, if a study contained multiple analyses of which only some fulfilled our eligibility criteria, we included the study but extracted only the eligible analyses. This may sometimes not correspond to the main analysis of a paper or may include more than one analysis per study.

Study design
In this review, we considered observational studies assessing the associations of non-pharmaceutical interventions with outcomes related to the COVID-19 disease. We focused on analyses that used real-world observational data to assess the effectiveness of non-pharmaceutical interventions. Specifically, we excluded modeling studies that predominantly worked with synthetic data or projected future transmission dynamics based on hypothetical scenarios without assessing the effectiveness of interventions empirically.
We considered all types of observational study designs, i. e. cross-sectional, case-control, retrospective and prospective cohort, etc. However, note that these study designs often relate to the analysis of individuals or cohorts. This is different from our setting where non-pharmaceutical interventions were implemented at the population level, thus typically all individuals within a population were exposed to nonpharmaceutical interventions at the same time. As a result, changes in outcomes following interventions were also usually evaluated at the population rather than the individual level. Nevertheless, our review sample includes studies that used individual-level epidemiological data, e. g. individuallevel epidemiological data on cases and transmission chains, which were typically used to compute population-level outcomes quantifying transmission such as the reproduction number.

Population
We considered studies assessing the effectiveness of nonpharmaceutical interventions on the population in one or several geographic regions. Our review was not limited to a specific geographic region, i. e. all national and subnational 1 3 regions worldwide were considered. We furthermore included studies analyzing specific subpopulations in a certain region (e. g. certain age groups). We also considered analyses using individual-level data, as long as the effectiveness of non-pharmaceutical interventions were assessed on a population level.

Outcome
The main outcomes considered were health-related outcomes at population level that are associated with COVID-19 (e. g. confirmed cases, hospitalizations, and deaths), and epidemiological outcomes characterizing infection dynamics such as reproduction numbers or transmission rates. We also considered similar outcomes associated with other infectious diseases (e. g. influenza), if used as a surrogate for COVID-19. Moreover, behavioral outcomes potentially mediating the effects of non-pharmaceutical interventions were included (e. g. human mobility). In contrast, we excluded analyses assessing the effects of non-pharmaceutical interventions solely on other outcomes not directly related to infectious diseases (e. g. psychological well-being or economic activities).

Intervention
As non-pharmaceutical interventions, we considered the implementation of health policy measures in the context of the COVID-19 pandemic. Specifically, we included any intervention related to social distancing (e. g. school closures, venue closures, workplace closures), containment (e. g. contact tracing, quarantining), population flow (e. g. border closures), or personal protection (e. g. facial mask mandates). Analyses were considered regardless of whether they assessed the effectiveness of a single intervention, the effectiveness of multiple interventions separately, or the effectiveness of a combination of interventions. For simplicity, we refer to these as non-pharmaceutical interventions throughout the review, while recognizing that also other terms have been used in the literature. Importantly, we accounted for various alternative terms in our literature search (see Search strategy below). We excluded interventions not directly related to disease control (e. g. economic measures like social benefits).

Search strategy
We searched for peer-reviewed original research articles in English language that were accepted, published, or in press between January 1, 2020 and January 12, 2021 (2929 unique records). In our review protocol, we specified that we would also include preprints in our search. However, due to their enormous volume, we eventually decided not to consider gray literature or preprints in our review. Our results therefore only cover methodologies used by articles peerreviewed at the time of search, among which we already found considerable variation. To account for potentially new methodologies in articles published after the time of search, we also considered further recent studies on the effectiveness of non-pharmaceutical interventions in our discussion and put them into the context of our review findings.
We searched the databases Embase, PubMed, Scopus, and Web of Science. These databases include, among others, MEDLINE, Biological Abstracts, CAB Abstracts, and Global Health. We composed our search query of four components to be contained in the publication title or abstract: (1) a synonym for "non-pharmaceutical intervention", (2) a synonym for "estimation" or "assessment", (3) a synonym for "effect", and (4) a synonym for "COVID-19". Starting from a precompiled list of 18 references based on our primary research on the effectiveness of non-pharmaceutical interventions, we created and repeatedly extended a collection of synonyms for each of the above components, thereby achieving a broad search while keeping the number of selected studies manageable. The strings for our search queries are provided in online Appendix B. Importantly, we decided not to include search terms for single interventions such as face masks or travel restrictions, as this would have resulted in an unmanageable number of studies that were not concerned with the population-level impact of the nonpharmaceutical intervention. Nevertheless, our search query found studies on single interventions through other terms describing non-pharmaceutical interventions.

Study selection
As a first step, we screened the titles of the studies retrieved from the database search for keywords clearly suggesting that the study would not meet our predefined eligibility criteria (e. g. "mental health" or "air quality"). The compiled set of keywords (see online Appendix B) was used to automatically identify cases for exclusion via the publication title. For all remaining studies, two authors (AL and NB) checked the eligibility criteria and individually decided on inclusion or exclusion. Each of the two authors checked the eligibility for one half of the studies via the following process: First, studies were checked by title and, if in doubt, by abstract. Then, if still in doubt, studies were checked by full text and discussed by both authors. Any disagreements were resolved with involvement of a third author (WV). Generally, we followed an inclusive approach by keeping all studies that could not be excluded with high confidence. At each stage, all decisions were recorded in a spreadsheet.

Data extraction
We extracted data from all included studies (n = 248) in a spreadsheet. Our extraction strategy reflected the exploratory nature of our analysis and thus allowed for new data items to be added throughout the process. Therefore, we maintained a detailed manual with all data items and the potential values for each item (see online Appendix D). Before extraction, a preliminary version of the data extraction form was created based on reporting items from checklists for observational studies (STROBE [20], RECORD [21]), a template for public health policy interventions (TIDieR-PHP [22]) and an initial set of studies. Aside from bibliographic information, the data to be extracted consisted of information on the study setting, outcome, intervention, methodological approach, and effectiveness assessment.
The extraction process was structured in four rounds. During the first round, two authors (AL and NB) extracted data from an initial set of 20 publications, blinded to each other's coding. The coding was then compared, and any differences were discussed to resolve ambiguities. Corresponding changes were recorded by updating the extraction form and manual, and applied subsequently. In the second round, the two reviewers each extracted data from one half of the remaining publications and checked the other half coded by their colleague. Color-coding was used to highlight uncertain or ambiguous entries for the other reviewer or to mark such entries for further discussion. Regular meetings were held between the two authors (AL and NB) to discuss these uncertainties and ambiguities. All disagreements were resolved through discussion, if needed by involving a third author (WV). Thereby, the data extraction manual and form were continuously refined and kept up-to-date. In particular, the list of values that could be potentially assigned to each data item was continuously extended and harmonized as new studies were extracted. In the third round, the data extraction form and manual were simplified by merging data items or categories that, retrospectively, were found redundant, or by relabeling items and categories to define them more precisely. This was done with particular attention to enable comparability among the extracted analyses as well as readability of the results. In the fourth round, the final scheme was applied to all studies.

Quality assessment
The goal of this systematic review was not to perform a meta-analysis or narrative synthesis of the evidence regarding the effectiveness of non-pharmaceutical interventions, but to compare the included studies along methodological dimensions and to analyze the variation in study setting, outcome, intervention, methodological approach, and effectiveness assessment. Therefore, no risk of bias assessment with regard to the study results was conducted. Our minimum requirement for quality was that most information on the aforementioned dimensions could be extracted from the manuscript and/or supplementary material. This minimum requirement was not met by four studies, which were thus excluded. For other studies where only some methodological information was missing, we noted in the data extraction sheet that this information "could not be evaluated".

Data synthesis
The results of the data extraction were synthesized in tabular form by recording the frequency of categories per item. We reported the frequency for each item of the main dimensions (study setting, outcome, intervention, methodological approach, and effectiveness assessment) individually. For some items, we conducted further specialized analyses, for example by computing the frequencies of categories for different methodologies separately, or by qualitatively evaluating the supplementary information added to certain entries during extraction. Insights from these additional analyses were reported textually. Furthermore, we synthesized common analysis types based on patterns identified in the methodological approaches. Lastly, based on our findings, we derived specific recommendations for future studies with regard to scope, robustness, and comparability, and put them into the context of more recent studies that were not part of our review sample.

Results
We conducted a systematic database search for peerreviewed research articles from January 1, 2020 up to January 12, 2021 (see "Methods" section). Figure 1 shows the PRISMA flow diagram of our identification process. The search yielded 2,929 unique records of studies for screening. Through title and abstract screening, we identified 411 studies as potentially relevant and evaluated their full texts. Of these, we excluded 163 studies that did not meet the eligibility criteria. The most frequent reasons for exclusion were that (1) studies primarily simulated the effects of interventions in hypothetical scenarios rather than making inferences from observational data; (2) studies had a different objective than assessing the effectiveness of interventions, and (3) studies only assessed the association of health-related outcomes with population behavior (most often mobility), but not with interventions. The remaining n = 248 studies met our eligibility criteria and were included for subsequent data extraction. Importantly, 35 studies in our review sample contained multiple (i. e. up to three) analyses, e. g. with different methodological approaches, leading to 285 different analyses included. If not indicated otherwise, our results are presented at the level of individual analyses (and not at the level of studies).
We characterized the analyses along five dimensions (online Appendix D): study setting (D.1), outcome (D.2), intervention (D.3), methodological approach (D.4), and effectiveness assessment (D.5). In the Results section, if not stated otherwise, we use the term interventions to refer to non-pharmaceutical interventions. Where appropriate, we also point to exemplary studies of specific characteristics. Due to the large size of our review sample, however, we refrain from referencing all studies in the main manuscript and instead refer to our complete data extraction report in online Appendix E.

Study setting
The analyses vary in their scope across populations, geographic areas, and study period. A systematic classification of the study setting is shown in Table 1.

Population
More than half of the analyses studied multiple populations, i. e. multiple countries or subnational regions (e. g. states or cities). The remainder focused on a single population, i. e. a single country or subnational region. The analyses were performed at the national level, the subnational level, or both. If both levels were studied, the country and all its subnational regions were oftentimes considered, e. g. all states of the United States. Geographically, some regions and countries were more frequently studied than others, probably due to an earlier start of the pandemic or particularly high incidence and mortality during the first epidemic wave.

Study period
Typically, the study period covered both a rise and decline in new cases of the first epidemic wave in the analyzed population, and started before and ended after the analyzed interventions were implemented. However, many analyses also deviated from this pattern in one or several aspects (Table 1, D.1.5). Notably, in some analyses, the end date of the study period was still within the epidemic growth phase for some populations but already in the control phase for other populations.

Outcome
The studies in our review sample used different types of health-related outcomes or surrogates. For every analysis, we identified the "raw outcome", i. e. the outcome data which were self-collected or obtained from external sources and used as input for the analysis. In around half of the analyses, the raw outcome was analyzed directly to assess the effectiveness of interventions. The other half of analyses, however, involved an intermediate step, in which another outcome was computed from the raw outcome. This "computed outcome" was then analyzed instead of the raw outcome, or sometimes in addition to it. A systematic classification of the outcomes are shown in Table 2.

Raw outcome
We identified three main types of raw outcome data used, namely (1) epidemiological population-level data, (1) epidemiological individual-level data, and (3) behavioral data.
(1) Epidemiological population-level data The majority of analyses used population-level data on epidemiological outcomes, e. g. confirmed cases and deaths. The most frequent types were surveillance data, mainly the number of confirmed cases, but also deaths, hospitalizations, recovered cases, and, less frequently, intensive care unit (ICU) admissions. Importantly, some outcomes, such as recovered cases, were predominantly used to fit transmission models, in which case the effectiveness of interventions was rather measured in terms of a different latent outcome (see D.5 Methodological approach, Sect. 3.4). Frequently, authors also included several types of data (e. g. both cases and deaths), either to perform a separate analysis for each (e. g. as a robustness check) or to combine them in a joint model (e. g. a transmission model). Some analyses used surveillance data on other diseases than COVID-19, with influenza being the most popular choice. Such surrogate diseases have often been monitored over an extended period of time, which allows comparing their spread during the COVID-19 pandemic to earlier years. Notably, we found only three analyses that used external data on latent epidemiological population-level outcomes (e. g. the reproduction number). All other analyses using a latent outcome self-computed it from raw data in an intermediate step (see D.2.2 Computed outcome, Sect. 3.2.2).

(2) Epidemiological individual-level data
Instead of population-level data, some analyses also used individual-level epidemiological data. These were in particular data about individual cases with case ID, demographics, and epidemiological characteristics (e. g. the date of symptom onset or travel history). In some instances, this included contact tracing data with links between index and secondary cases, allowing the reconstruction of transmission chains. Two analyses also used genome sequence data of clinical SARS-CoV-2 samples [23,24].

(3) Behavioral data
In addition to epidemiological data, a relevant share of analyses employed data on population behavior, mainly mobility data. These data were usually obtained through tracking of mobile phone movements and provided as aggregates at the population level, based on summary statistics such as the daily number of trips made, time spent at certain locations, or population flow between regions. Another, less frequently used source of information on human behavior were surveys regarding social distancing practices, such as adherence to interventions, face mask usage, or daily faceto-face contacts.

Computed outcome
Around half of the analyses involved an intermediate step in which the raw outcome was used to compute another outcome, and the effectiveness of interventions were subsequently assessed using this "computed outcome". Typically, the rationale of such analyses was to conduct the assessment on an outcome with clearer epidemiological interpretation and relevance to policy-making. At the same time, the methodological separation of outcome computation and effectiveness assessment into distinct steps allowed the authors to limit the complexity of their models. In our review sample, we identified four main types of computed outcomes: (1) Measures of epidemic trend were computed to describe the overall trend of the epidemic, e. g. through the growth rate or doubling time of confirmed cases or hospitalizations. These measures were often interpreted as crude estimates of the infection dynamics in a population, and authors used them to achieve better comparability of outcomes across different populations. (2) Epidemiological parameters were computed to measure specific infection dynamics, most often in terms of the reproduction number. That is, studies typically used the time series of confirmed cases in a population to compute the effective reproduction number over time and then assessed whether it decreased during interventions. A few analyses also used individual-level epidemiological data to compute and assess changes in epidemiologically relevant time spans such as the serial interval or the time from symptom onset to isolation. (3) Summary statistics were typically used to aggregate the raw outcome of a population over time into a single figure describing the progression of the epidemic in the population. For example, authors computed the time until a certain number of documented cumulative cases was reached, or the time until the reproduction number first fell below one. (4) Change points in the outcome were computed with the aim to find time points of presumably structural changes in epidemic dynamics and compare them with implementation dates of interventions in the subsequent analysis [10,25,26]. Typically, change points were computed for the time series of confirmed cases or mobility. Of note, the raw outcome was not always used only for obtaining the computed outcome, e. g. changes both in the number of new confirmed cases (raw outcome) and in the reproduction number (computed outcome) were sometimes analyzed.

Method to obtain the computed outcome
(1) Measures of epidemic trend were often obtained through simple computation (e. g. growth rate as percentage change in confirmed cases). Other analyses used simple modeling approaches, e. g. fitting an exponential growth model to the time series and extracting the exponential growth rate or doubling time from the estimated parameters. (2) Epidemiological parameters were mostly estimated from confirmed cases or deaths. Some approaches fitted a compartmental transmission model to the raw epidemiological outcome. For this, the parameter of interest was either allowed to vary over time, or the model was fitted independently on different time periods. Other approaches employed a statistical method to directly estimate reproduction numbers from the observed outcome. Here, the method by Cori et al. [27] as implemented in the popular software package "EpiEstim" [28] for estimation of the instantaneous effective reproduction number was used in a large number of analyses. However, we found that statistical methods were not always applied correctly, which could have led to bias in the inferred transmission dynamics (see online Appendix A). Sometimes, authors also used methods to estimate reproduction numbers from contact matrices [29] (derived from surveys on personal contacts) or from transmission chains [30,31] (derived from contact tracing data). (3) Summary statistics were typically obtained through simple computation. (4) Change points in the outcome were obtained by fitting a compartmental transmission model with special parameters representing points in time when the transmission rate changes [26]. Other analyses used special change point detection algorithms [25].

Data source
The majority of authors directly accessed surveillance data from national health authorities or other governmental bodies.
In the case of individual-level data, which may be subject to privacy regulations, authors were often themselves affiliated to the relevant health authority. To obtain population-level data, a considerable share of analyses also used publicly available data from cross-country selections, e. g. the European Centre for Disease Prevention and Control (ECDC) [32], the Johns Hopkins University (JHU) [33], or Worldometer [34], which offer aggregated surveillance data internationally from various sources for the pandemic. Mobile phone tracking data were usually provided by corporate organizations such as Google [35], Apple [36], or Baidu [37]. A few analyses were also based on data collected by the authors, e. g. survey data on behavioral outcomes, seroprevalence studies, or data collected at a local facility such as a hospital.

Data availability
Data for the raw outcome was usually publicly available, in particular for epidemiological population-level outcomes such as cases and deaths because such data could oftentimes be accessed via the source that is documented in the manuscript. In several cases, the data was made publicly available by the study authors, e. g. by depositing the analyzed data in a public repository. For a small, yet considerable number of analyses, data was not accessible as the data was neither made publicly available nor the source of the data could be identified. Of note, data on epidemiological individual-level data was typically not available due to privacy concerns. Furthermore, corporate mobility data was widely available in the past, but access has recently been restricted by many providers.

Intervention
The analyses vary in the types of exposures and non-pharmaceutical interventions. A systematic classification is shown in Table 3.

Terminology for non-pharmaceutical interventions
Varying terminology was used by the literature to refer to non-pharmaceutical interventions ( . This is reflected in our search string, where we used a large set of terms in order to capture a broad range of relevant studies. While terminology sometimes reflected the specific types of non-pharmaceutical interventions that were analyzed, differences in terminology may also be the result of different research backgrounds of the study authors.

Exposure types and types of single interventions
A considerable number of analyses examined one single [5,38] or multiple interventions separately [2,7]. Among these analyses, school closures and stay-at-home orders were examined most frequently, which may be due to these interventions being particularly controversial in the public discourse [39,40]. The majority of analyses, however, did not examine multiple interventions separately but rather analyzed the a combination of multiple interventions jointly [41][42][43], which is often the case when multiple interventions were implemented on the same day and when thus the separate associations of the outcome with interventions could not be disentangled. A considerable number of analyses were even less specific by only analyzing whether interventions were altogether effective but without attributing changes in the outcome to specific interventions [44,45]. Other ways to assess the effectiveness of interventions were: examining the start time of interventions [26,46], e. g. to compare different delays with which governments responded to the pandemic [46]; dividing the public health response into different periods [47,48]; dividing interventions into different categories [49,50]; or summarizing the stringency of interventions to a numerical index at a specific time point [51,52]. Of note, analyses that examined a combination of interventions often referred to this combination as "lockdown". In the underlying analyses, such lockdowns typically included multiple interventions implemented on the same day [42,53]. However, the specific interventions included in lockdowns varied considerably between populations. We therefore considered "lockdown" as an umbrella term for different combinations of interventions rather than as a specific type of intervention. Furthermore, some studies did not only assess the relationship between mobility and nonpharmaceutical interventions, but also between changes in mobility and population-level epidemiological outcomes. In these analyses, human mobility was typically defined as a continuous exposure. We extracted information on such complementary analyses of mobility as an addendum to the main review (see online Appendix E).

Coding of interventions
When multiple populations were jointly analyzed, coding of interventions may have been necessary in order to reconcile differences in the definitions of interventions between populations. For instance, the term "school closures" could refer to the closure of primary or secondary schools or universities. Differences across populations are thus reconciled during coding by deciding upon the type of intervention and providing a common name and definition that is then applied to all populations. As a result, such coding can be subjective and thus needs to be carefully documented and evaluated (see online Appendix A). Coding of interventions was necessary in around a quarter of analyses.

Source of intervention data and availability of data on exposure
If coding of interventions was not necessary, authors often obtained intervention data (i. e. the date of interventions) from a government or news website. Unfortunately however, the data source was often not provided by the authors and could thus not be evaluated. If coding of interventions was necessary, then study authors either coded the data themselves, i. e. collected the data from government or news websites and systematically categorized them or, more frequently, used externally coded data. The most popular choices for externally coded data were the Oxford Government Response Tracker [1] and, for the United States, the New York Times [54].

Methodological approach
A variety of methodological approaches were used to assess the effectiveness of interventions. The methodological approaches extracted here describe the actual stage of estimating the associations of health-related outcomes with interventions. A systematic classification of the methodological approaches is shown in Table 4. An additional analysis of the average citation count per category is presented in online Appendix C.

Empirical approach
We distinguished three general empirical approaches for assessing the effectiveness of interventions, namely (D) descriptive, (P) parametric, and (C) counterfactual approaches.
• (D) Descriptive approaches were used by the majority of analyses: These approaches provided descriptive summaries of the outcome over time or between populations, and related variation in these summaries to the presence or absence of different interventions. For example, some analyses compared changes in the growth rate of observed cases before and after interventions were implemented [55,56]. Of note, descriptive approaches could involve modeling as part of an intermediate step, where a latent outcome was computed from the raw outcome (see Computed outcome), while, afterward, a descriptive approach was used to the link the latent outcome to interventions. For example, some analyses used a single-population compartmental trans- Coded data not available 10 (4%) mission model to estimate the time-varying reproduction number and then compared the reproduction number before and after interventions were implemented [57][58][59]. • (P) Parametric approaches were used by a third of analysis: These approaches formulated an explicit link between intervention and outcome, where the association was quantified via a parameter in a model. Most frequently these were regression-like links between interventions and the reproduction number [2,8]. • (C) Counterfactual approaches were least frequently used: These approaches assessed the effectiveness of interventions by comparing the observed outcome with a counterfactual outcome based on an explicit scenario in which the interventions were not implemented. For example, the observed number of cases was compared with the number of cases that would have been observed if the exponential growth in cases had continued as before the implementation of interventions [60,61].

Use of exposure variation
Effectiveness of interventions were assessed by exploiting variation in the exposure to the intervention over time, between populations, or both. Assessments exploiting exposure variation over time contrasted the outcome in time periods when specific measures were in place with the outcome in time periods when they were not in place. In contrast, assessments exploiting exposure variation between populations were based on a comparison of the outcome between populations that were subject to specific measures with populations that were not. Only a small share of analyses exploited variation between populations or both between populations and over time.

Method
We grouped the different methods used into (1) description of change over time, (2) comparison of populations, (3) comparison of change points with intervention dates, (4) non-mechanistic model, (5) mechanistic model, and (6) synthetic controls. We review these in the following.
(1) Description of change over time The large majority of analyses following a descriptive approach examined the change of the outcome over time to assess the effectiveness of interventions. In some of these analyses, the focus was on the course of the outcome over time, typically by attributing the observed change (e. g. a reduction in new cases over time) to the analyzed interventions. For example, the outcome was assessed at regular or irregular intervals, which were not necessarily aligned with the implementation dates of interventions [44,45,62]. The majority of analyses, however, followed the logic of an interrupted time series analysis, i. e. the outcome was explicitly compared between time periods before and after interventions [63][64][65][66].
(2) Comparison of populations A few descriptive analyses compared outcomes via summary statistics only between populations (i. e. without considering variation over time) to assess the effectiveness of interventions. In such analyses, the outcomes were compared between populations that were stratified by different exposure to interventions (e. g. populations that implemented a certain intervention and populations that did not) [67][68][69].

(3) Comparison of change points with intervention dates
Some descriptive analyses checked whether the dates of estimated change points in outcomes and the implementation dates of interventions coincide [10,25,70]. If both dates were more or less in agreement, this was taken as evidence confirming the effectiveness of the intervention. However, change point detection methods could also yield change points prior to the implementation of interventions, which was sometimes interpreted as a sign of additional factors influencing the outcome (e. g. proactive social distancing) [25].

(4) Non-mechanistic model
Non-mechanistic models are statistical models that typically make no explicit assumptions about the mechanisms that drive infection dynamics. Such models were used in both parametric and counterfactual approaches by a quarter of analyses.
In parametric approaches, non-mechanistic modelsalmost always (generalized) linear regression modelswere used to model a direct link between interventions and outcome. Typically, dummy variables were used to indicate when (variation over time) [9,71,72] or where (variation between populations) [73][74][75] interventions were implemented. Analyses exploiting both variation over time and between populations typically used panel regression methods [5,76,77].
In counterfactual approaches, the non-mechanistic models used were mostly exponential growth models, and sometimes time series models (e. g. AR(I)MA or exponential smoothing) [41,47,61]. These models were fitted using data prior to when an intervention was implemented and then extrapolated the outcome afterwards.

(5) Mechanistic model
Mechanistic models have a structure that makes, to some extent, explicit assumptions about the mechanisms that drive infection dynamics. They were used in both parametric and counterfactual approaches by slightly more than ten percent of analyses.
In parametric approaches, the association of an outcome with an intervention was represented via a parameter that was functionally linked to the disease dynamics (i. e. via a latent variable) of the model. This was typically achieved by parameterizing the transmission rate or reproduction number as a function of binary variables, indicating whether interventions were implemented or not [2,[78][79][80]. Others linked interventions to the contact rate, the transmission probability upon contact, or to entries in the contact matrix [81][82][83]. A few modeling approaches also represented the intervention via an explicit structure or dynamic in the model, e. g. a compartment for quarantined individuals with a quarantine rate [50,84] or an exponential decay of the susceptible population [49,50,85].
The most popular mechanistic models used in parametric approaches were compartmental transmission models. These models were fitted to the time series of cases, hospitalizations, recovered cases, deaths, or several simultaneously. With the exception of one meta-population model [86], all compartmental models used in analyses following a parametric approach were single-population models. If multiple populations were analyzed, each population was modeled separately. A few parametric analyses also used a semi-mechanistic Bayesian transmission model with a time-discrete renewal process, similar to the one in an early influential paper by Flaxman et al. [8]. These analyses fitted a Bayesian hierarchical model with stochastic elements for disease transmission and ascertainment on observed time series for cases, deaths, or both [2,8,87]. The model was usually fitted to data from several populations, modeling separately the time course in each population but estimating the parameters for the associations of outcome with interventions jointly across populations. Rarely, analyses used highly complex models such as individual-based transmission models simulating the behavior of individual agents, or phylodynamic models inferring both virus phylogenies and transmission dynamics from genome sequence data.
In counterfactual approaches, mechanistic models were, similar to non-mechanistic models, calibrated to data before the implementation of an intervention and then projected the outcome for the time after the intervention, while keeping the model parameters fixed [88][89][90]. Thus, no relationship between intervention and outcome is explicitly modeled. Regularly, these analyses used meta-population or individual-based models that incorporated migration dynamics through mobility data and a network between individuals or populations [90][91][92].

(6) Synthetic controls
Some counterfactual approaches used synthetic control methods. Here, a counterfactual scenario was constructed by computing the counterfactual outcome as a weighted combination of observations from a pool of "control" populations in which the intervention was not implemented [46,93,94]. Weights were fitted so as to give more importance to control populations similar to the intervention population. In these analyses, the course of the outcome before intervention was often used as the primary measure of similarity [6,93,94]. Sometimes, further factors such as geographic proximity or population characteristics were also considered [93,95].

Code availability
For around one in four analyses, a link to a publicly accessible repository containing the computer code implemented for a specific analysis was provided. Overall, the code availability was comparably higher for parametric approaches, where one in three analyses provided a link.

Effectiveness assessment
The analyses in our review sample varied in their form of effectiveness assessment, i. e. how the association of outcomes with interventions were quantified, whether they were interpreted causally, whether uncertainty was reported, and whether sensitivity analyses or subgroup assessments were conducted. A systematic classification of the effectiveness assessment is shown in Table 5.

Reporting of effectiveness, measure of effectiveness, and reporting of uncertainty
Around one in five analyses qualitatively described the change in the outcome over time following the implementation of interventions. More frequently the outcome values before an intervention were compared with the outcome values after an intervention. Around half of the analyses reported a quantitative change in outcome values, e. g. by computing the difference in the outcome values before and after an intervention, or estimating the difference via a parameter in a statistical model. The effectiveness was oftentimes measured in terms of a change in the reproduction number, in confirmed cases, or in mobility, but many other measures of effectiveness were also common. Uncertainty was reported in around one half of the analyses, e. g. via standard error, confidence intervals, and credible intervals.

Interpretation of results
Some analyses, in particular those describing the change of the outcome over time, interpreted their results only as associative [63,77], i. e. a statistical or temporal association between interventions and the measure of effectiveness was noted without a causal implication. In the majority of analyses, however, a causal conclusion was implicitly drawn from the results through the use of causal language (e. g. it was concluded that "interventions reduced transmission") [8,44]. Only rarely, and mostly in analyses using non-mechanistic econometric models [96,97] or synthetic controls, results were explicitly described as estimates of the causal effects of interventions.

Sensitivity analyses
We checked all works for sensitivity analyses that were specifically conducted to examine the robustness of the reported effectiveness. Many studies conducted sensitivity analyses only related to the predicted outcome or model fit, but not to the effectiveness of interventions. Overall, the vast majority of analyses did not conduct sensitivity analyses with regard to the effectiveness.
Of those that did, sensitivity analyses focused on model extensions or adjustments in which the model specification was varied, e. g. by changing the structure of a transmission model or by adjusting the estimated effects of interventions for additional variables in a regression model. Others analyzed sensitivity with respect to variations in epidemiological parameters, e. g. by assuming a different basic reproduction number, generation or serial interval, infectious period, or reporting delay distribution. Only few analyses tested sensitivity with regard to data: i. e. using different or modified outcomes [72,98]; using a different coding of interventions [3,7]; or repeating the same analysis but excluding (sub)populations [2].

Subgroup assessment
The effectiveness of interventions were rarely assessed within subgroups of the population. Two thirds of such assessments were within subgroups created based on socioeconomic indicators, e. g. by age and gender [44] or by regions with different income levels [6] Less frequent were subgroups based on epidemiological indicators [5], the public health response [99], or geographic areas [100].

Discussion
Our systematic review covers over 240 studies published between January 2020 and January 2021. Insights from this review can inform different types of future studies: (1) studies using data from the same period that extend our knowledge on aspects that have so far been rarely investigated; (2) studies using data from subsequent periods that generate new insights or corroborate existing ones; and (3) studies using data from a future pandemic caused by another virus.
Although the preconditions to conduct these studies differ, they share the goals and challenges of the studies in our review sample. Accordingly, the results from our systematic review allow us to discuss implications for future work and make recommendations for improving methodologies and comparability across studies.

Implications for future work
During the COVID-19 pandemic, both surveillance data on confirmed cases, hospitalizations, or deaths [32,33], and mobility data from mobile phones [35,36] have become publicly available at scale. This has enabled a large number of studies assessing the associations of population-level epidemiological outcomes and human mobility with non-pharmaceutical interventions ( Table 2, D.2.1). However, considerable potential remains in the exploration of outcomes and analyses that have so far been rarely employed. First, the population-level data used by the majority of studies in our review sample can be subject to systematic differences in ascertainment between populations and over time. For example, epidemiological analyses have discussed the influence of testing procedures and intensity on the number of confirmed cases [2,14,42]. Due to limited availability of metadata from health authorities, it is oftentimes difficult to account for such factors. In this context, smaller-scale surveys with precisely defined outcomes and controlled sampling schemes (e. g. representative community sampling [101]) could provide a complementary source of data for future studies. Similarly, as has been demonstrated by studies in our review sample, the use of individual-level data could allow for more detailed analyses, e. g. by relating nonpharmaceutical interventions to changes in the serial interval using symptom onset data [44], to transmission chains using contact tracing data [90], or to virus migration rates using genome sequence data [23]. We hope to see more such analyses as more individual-level data becomes available.
Second, there can be great merit in analyses advancing our understanding of the mechanisms by which non-pharmaceutical interventions work. For example, interventions may influence behaviour and transmission through factors not captured by previous studies using mobility data from mobile phones, and, moreover, the relationship between population behavior and disease transmission may change over time [102,103]. Additional insights can be gained from analyses using behavioral data from other sources, e. g. surveys evaluating compliance with mask mandates [65] or the number of daily contacts [104]. Moreover, we see value in analyzing interventions, behavior and epidemiological outcomes jointly, i. e. in the form of a mediation analysis [96,105], allowing to differentiate the direct and indirect effect of non-pharmaceutical interventions.
Third, only one in ten analyses in our review sample examined variation in the effectiveness of interventions across subgroups or populations ( Table 4, D.4.2). Estimating and explaining such variation could help understand the conditions under which interventions are more or less effective for a specific subgroup, potentially allowing policy makers to tailor interventions to a specific subgroup or setting. Our review points out two approaches to analyze such variation: (1) comparing the effectiveness of interventions between subgroups of the same population (e. g. between the young and elderly population) [44]; and (2) comparing the effectiveness of interventions between different populations and relating differences to population-specific characteristics (e. g. population density) [106].
Last, while many analyses in our review sample used terminology related to a causal interpretation of the provided evidence (Table 5, D.5.3), it is difficult in general to estimate causal effects based on population-level observational data and to rule out unobserved confounding, e. g. from voluntary behavioral changes [107][108][109][110]. Hence, the evidence from the studies in our review sample should generally be interpreted as associative rather than causal estimates. A causal interpretation may be justified when additional criteria are met (e. g. the "no unmeasured confounding" assumption [111] or the Bradford Hill criteria [112] and extensions thereof [113][114][115]), but until now a rigorous discussion of such assumptions when estimating the effects of non-pharmaceutical interventions is lacking. Apart from that, we caution against emphasizing results from single studies. Rather, we recommend evaluating evidence from multiple observational studies jointly and, more importantly, in combination with other types of evidence. For example, evidence on the infectiousness of school children and parental strategies to fill the care gap can produce independent predictions about the effectiveness of school closures. Similarly, laboratory evidence regarding the effectiveness of masks together with evidence on compliance with masks can produce independent predictions of the effectiveness of mask mandates.

Recommendations for improving methodologies
Variation in the exposure to interventions (i. e. when, where and which interventions were implemented) is required in order to empirically assess their effectiveness. However, changes in the outcome over time may falsely be attributed to non-pharmaceutical interventions if they are subject to confounding by concurring time trends. We thus recommend to also exploit exposure variation between populations, i. e. with respect to the timing and the types of single interventions that were implemented. This was done by only one in five analyses in our review sample (Table 4, D.4.2), although we found that on average these studies had more citations than other studies (see online Appendix C). Given that the types and timing of interventions varied considerably between populations, a valuable source of variation remains largely untapped by most analyses.
Evidenced-based decision making requires empirical estimates for the effects of single non-pharmaceutical interventions (e. g. school closures or stay-at-home orders). However, the majority of analyses assessed the effectiveness of population-specific combinations of interventions such as lockdowns (Table 3, D.3.3). The underlying analyses typically studied only a single population (or multiple populations separately) where multiple interventions were implemented on the same day, and, as a result, the separate associations of outcomes with interventions cannot be disentangled. For future work, we recommend more effort to conduct analyses across multiple populations, so that the separate associations of outcomes with single interventions can be dissected.

Recommendations for improving comparability across studies
During a pandemic, public health policy has a strong focus on the number of confirmed cases, hospitalizations, and deaths, making them obvious outcomes to evaluate the effectiveness of interventions. However, non-pharmaceutical interventions act only indirectly and with a certain delay on these observable outcomes. Typically, non-pharmaceutical interventions should influence the behaviour of the population, which should reduce transmission (e. g. by limiting the contact rate), which in turn should affect the number of new infections and, subsequently, observed outcomes like confirmed cases, hospitalizations, or deaths. The question of how to assess the effectiveness of interventions along this path has been answered differently by the studies in our review sample. We identified four main types of analyses; see (1)-(4) in Box 1. In the following, we discuss the different types with regard to their ability of enabling a comparison of results between studies. Box 1. Different types of analyses to assess the effects of non-pharmaceutical interventions (1) Observed outcome directly linked to interventions A raw, observed outcome is analyzed directly by evaluating differences (1) over time with an interrupted time-series analysis comparing the outcome before vs. after an intervention, (2) between populations with a cross-sectional analysis comparing populations exposed vs. not exposed to an intervention, or (3) both over time and between populations with a panel data analysis. Mechanistic modeling is typically not involved in this type of analysis, with one exception, namely counterfactual approaches using a transmission model to project the observed outcome after intervention. (2) Computed, unobserved outcome linked to interventions In contrast to type (1), the intervention effect is measured in terms of an unobserved outcome. This is computed from the raw outcome and then analyzed in a similar manner as in (1) Analyses of type (1) can avoid mechanistic modeling by directly analyzing an observed outcome such as cases or deaths. Here, a central challenge is to take into account the uncertain delay between the implementation of non-pharmaceutical interventions and their effects on the observable outcome. The fact that infections and subsequent outcomes such as confirmed cases follow exponential dynamics during an epidemic wave makes it difficult to compare estimates measured by observable outcomes across different epidemic phases. In contrast to that, analyses following type (2) or (3) employ mechanistic modeling, allowing to link latent, unobservable outcomes (e. g. the transmission rate or the reproduction number) to interventions. Since these latent outcomes can be inferred from different observed outcomes like cases or deaths, it becomes possible to compare analyses that use different raw data. The difference between type (2) and (3) is that for (2) the estimation of the latent outcome is separated from the effectiveness assessment. Such separation reduces model complexity, however, often at the expense of incomplete uncertainty assessments (if uncertainty regarding the computed outcome is left out). Finally, analyses following type (4) take a very different approach that shares few assumptions with the other approaches: A comparison of change points can verify the presence of an association, yet without quantifying its size. As a result, such findings are best complemented with an analysis of type (1), (2), or (3). Notably, we found that studies which received many citations often used analyses of type (2) or (3) (see online Appendix C).
While variation in methodologies can complicate the comparison of studies, it may help to identify the influence of certain methodological choices on the results. Here, the public availability of data for outcomes and interventions holds potential for sensitivity analyses within studies as well as comparisons between studies. Specifically, the same analysis could be repeated with different sets of publicly available data as part of the same study. This way, sensitivity of the findings with respect to the choice of outcome and intervention data could be assessed within studies, reducing the risk of bias from specific outcome data (e. g. incomplete case ascertainment due to limited testing capacity etc.) or the specific coding of interventions. For example, the number of new cases, deaths, or both could be used as the raw outcome in mechanistic models with a comparable latent outcome [2]. However, other aspects, in particular the specific setting and methodologies used, are presumably more difficult to vary as part of a sensitivity analysis, and may therefore need to be compared between different studies. Important for such comparisons is giving access to the preprocessed data even if the raw data was retrieved from public sources (see online Appendix A).

Limitations of our review
Our systematic review has limitations. First, it covers studies published between January 1, 2020 and January 12, 2021. While comprising a large review sample of more than 240 publications, it is therefore limited to the first year of the COVID-19 pandemic. A future review may examine to what extent methodologies have changed over time or what novelties were introduced when analyzing data from later waves, although we have referred to some recent contributions in our discussion. Second, although our review process aimed to ensure a representative sample of studies in the field, certain biases cannot be ruled out. Specifically, our search queries focused on general terms describing non-pharmaceutical interventions (see Methods), so that studies using only terminology related to specific interventions may not have been found. Third, our data extraction form comprises items that were widely applicable over a diverse set of studies. While providing a consistent framework to compare different methodologies, this naturally limits in-depth analyses of specific methodologies, which could complement our review.

Comparison with related work
Our review provides the first large, systematic categorization of existing methodologies to assess the effectiveness of nonpharmaceutical interventions. An early review on the subject was written by Perra [16], which is however not systematic and has a different objective than our methodology review (i. e. the majority of studies in the review sample cover other aspects than the effectiveness of non-pharmaceutical interventions). A more recent methodology review by Garin et al. [116] focuses on epidemic models during the COVID-19 pandemic. It thus complements our review by considering a subset of the methodologies in our review sample and studying them beyond their use in assessing the effectiveness of non-pharmaceutical interventions. Finally, besides methodology reviews, several meta-analyses have attempted to summarize the effectiveness of non-pharmaceutical interventions [11][12][13][14][15].

Conclusions
Our review of more than 240 studies on the effectiveness of non-pharmaceutical interventions revealed substantial variation in methodologies. Until specific best practices emerge, further heterogeneity in studies is inevitable and can also be beneficial, e. g. for assessing robustness of the results with respect to method and input data. Nevertheless, some standardization is required in order to synthesize evidence on the effectiveness of non-pharmaceutical interventions from multiple studies. So far, a lack of common standards and substantial variation in the methodologies used have created a challenge for meta-analyses to summarize and compare the reported evidence from existing studies [11][12][13][14][15]. Here, our methodology review can serve as a basis for subsequent meta-analyses to factor in the variety of existing methodologies when pooling and comparing the reported evidence. Moreover, the systematic categorization of methodologies developed in this review may also serve as a basis for designing a risk of bias assessment tool specific to studies on the effectiveness of non-pharmaceutical interventions. Most importantly though, our recommendations for the design of future studies aim to extend the scope of existing analyses and reduce methodological barriers to comparability across studies.
During the COVID-19 pandemic, a tremendous amount of publicly available epidemiological data has been generated. The ease of access to this data allowed many researchers to contribute work, using a variety of methodologies to assess the effectiveness of non-pharmaceutical interventions on health-related outcomes. With researchers from diverse fields contributing, there is a unique opportunity to benefit from the various inputs in developing a methodological foundation for timely and robust assessments during future pandemics. This will however require a thorough examination of the present methodologies in order to share lessons learnt and develop best practices. Our systematic review can be viewed as a first such attempt.
Author contributions NB and AL contributed to conceptualization, literature search, study selection, data extraction, data synthesis, and manuscript writing, review & editing. DO contributed to literature search and manuscript review & and editing. TS contributed to manuscript review & editing. SF contributed to manuscript writing, review & editing. WV contributed to conceptualization, study selection, data extraction, data synthesis, manuscript writing, review & editing.
Funding Open access funding provided by Swiss Federal Institute of Technology Zurich. NB and SF acknowledge funding from the Swiss National Science Foundation (SNSF) as part of the Eccellenza grant 186932 on "Data-driven health management". The funding bodies had no control over design, conduct, data, analysis, review, reporting, or interpretation of the research conducted.

Data availibility
The full data extracted from the studies in this review is available in machine-readable format at https:// github. com/ adrianlison/ metho dolog ies-npi-effec ts.

Declarations
Competing interests All authors declare no competing interests.
Ethics approval Ethics approval was not required for this study.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.