Background

In medical research, randomised controlled trials (RCTs) are considered the gold-standard study to evaluate the effectiveness of a treatment [1]. However, RCTs are sometimes not feasible due to factors such as their high cost, and even when viable, can still take too long to provide answers to inform pressing clinical and health policy decisions. In this scenario the careful analysis of observational data might provide an alternative to generate evidence to guide those decisions [2,3,4].

Observational data is a broad term that includes any patient data, health, and care information collected in non-experimental settings (e.g. RCTs) [5, 6]. In this paper, we make the distinction between two types of observational data: research-generated data and non-research-generated data (Table 1).

Table 1 Sources of two different types of observational data

Accurate estimation of treatment effects from observational data is challenging. The main reason for that is the possibility of confounding of the effect of treatment on the clinical outcome(s). Unlike in RCTs, in observational studies, patients are not randomly assigned to treatment groups at baseline. Instead, each patient is prescribed a treatment by a clinician according to their demographic and clinical characteristics (e.g. gender, age, severity of illness etc.), which is likely to result in an unequal distribution of these characteristics across treatment groups. If these characteristics are also prognostic factors for the outcome(s), and hence confounders, they must be accounted for, otherwise this may result in confounding bias [13, 14].

Moreover, poorly designed or ill-thought-out observational studies can result in additional issues due to misalignments in treatment initiation, eligibility, and follow-up periods, as well as loss to follow-up [4, 13, 15]. Bias can result from a misalignment of the start of follow-up, eligibility, and treatment initiation. In a well-designed prospective trial, baseline assessment is carried out just before random allocation to treatment, and participant follow-up starts with randomisation. In contrast, in an observational study of treatment initiation vs. no initiation, there can be a delay between start of follow-up (i.e. when the eligibility criteria are met and the study outcome(s) begin to be considered) and treatment initiation. This will result in a period of follow-up time, commonly referred to as ‘immortal time’, when participants in the treated group specifically cannot have died or experienced the outcome(s) and are essentially ‘immortal’. Participants in the treated group are not truly ‘immortal’ during this period of time; however, they must have survived it (i.e. be alive and event-free) to be initiating treatment [13, 14, 16,17,18,19]. Inadequate consideration of this unexposed period of time as part of the design or analysis of the observational study, results in ‘immortal time bias’ [18]. Loss to follow-up in observational studies can lead to selection bias since participants lost to follow-up may systematically differ from those who were not lost to follow-up in terms of their treatment status as well as prognostic variables. If this is not accounted for appropriately in the study’s analysis, it may compromise its validity [3, 20].

Additional complexity arises in observational studies which aim to evaluate the causal effect of a sustained treatment strategy or treatment regimen rather than that of a ‘point treatment’. Treatment regimens often consist of a number of treatments that might be sustained over time, such as repeat prescriptions for human immunodeficiency virus (HIV) [21]. When evaluating the causal effect of a particular treatment regimen, e.g. the causal contrast between continuously being prescribed HIV medication versus no prescription at all, the observed treatment histories may depart from these regimens as clinical decisions to re-prescribe drugs may depend on previous drug responses or side effects. Therefore, in such studies there may be (observable) variables such as intermediate treatment response or side effects that are (i) affected by past treatments, and (ii) drive both future treatments allocations as well as the long-term outcome. Such variables are known as ‘time-varying confounders’ to distinguish them from ‘baseline/pre-treatment confounders’. This statistical issue is often overlooked as more complex analysis methods are needed to avoid bias arising from these confounders [21, 22].

In 2016 Hernán and Robins put forward a solution to avert most of those biases, that is the ‘target trial’ framework. This framework consists of three steps. First, clearly defining a causal question about a treatment. Second, specifying the protocol of the ‘target trial’ (i.e. the eligibility criteria, the treatment strategies being compared (including their start and end times), the assignment procedures, the follow-up period, the outcome(s) of interest, the causal contrast(s) of interest and a plan to estimate them without bias). In other words, the protocol of the RCT you would like to perform but cannot due to impracticality. Last, explaining how the observational data will be used to explicitly emulate it. Meticulously following this structured process step by step when planning observational studies can help prevent biases such as immortal time bias and selection bias. Avoiding confounding bias tends to be more difficult in practice. To emulate randomisation, all baseline (and where relevant time-varying) confounders must be measured. However, there is no guarantee that the observational database contains sufficient information on the confounders. Furthermore, there might be confounders that the study investigator is not aware of and therefore does not attempt to measure nor control for (i.e. unobserved confounders). Hence, successful emulation of randomisation is never guaranteed, and there is no certainty that residual confounding is not present [3]. Nonetheless, the ‘target trial’ framework is a rigorous approach for evaluating treatment effects from observational data.

The aim of this scoping review is to identify and review all explicit attempts of trial emulations across all medical fields. This work will provide an overview of the medical fields that have been covered, the types of observational data that have been most frequently used and the statistical methods that have been employed to address the following biases: (A) confounding bias, (B) immortal time bias, and (C) potential selection bias due loss to follow-up, henceforth simply referred to as selection bias.

Methods

Search strategy and selection criteria

Three bibliographic databases (Embase (Ovid), Medline (Ovid) and Web of Science) were searched for studies published in English from database inception (Embase (Ovid): 1974, Medline (Ovid): 1946 and Web of Science: 1900) to February 25, 2021, using predefined search terms. These were related to concepts such as trial emulation and observational data (see file Additional file 1).

The studies’ selection process consisted of two key steps. First, identifying and removing all duplicates. This was done automatically in EndNote X9 [23] and was manually checked and completed by one reviewer (GS). Next, identifying eligible studies based on their titles, abstracts and/or keywords. For a study to be considered eligible, it must explicitly mention in its title, abstract or keywords that it emulated a trial using observational data. One reviewer (GS) systematically checked each study’s title, abstract and keywords.

Data extraction

One reviewer (GS) extracted the data from the studies. Only when further methodological details were necessary, the studies’ supplementary materials were also checked. A custom Excel spreadsheet was used to record specific information, such as the studies’ subject area, what type of observational data were used, the causal contrast(s) of interest, and the statistical methods used for analysing the primary outcome(s) and for addressing the following biases: (A) confounding bias, (B) immortal time bias and (C) selection bias (see Table 2).

Table 2 Data extraction form

Quality check

A second reviewer (AC) re-screened 100 articles (16%) and extracted data from eight out of the 38 eligible articles (21%) to assess the reliability of study selection and data extraction. There were no disagreements between the first and the second reviewer (GS and AC).

Results

The literature search yielded 617 studies. After removing duplicates and excluding studies based on title, abstract and keywords, 38 studies were identified as eligible for review (Fig. 1). Out of those 38 studies, most were cardiology (N = 11, 26%), infectious diseases (N = 9, 21%) or oncology (N = 8, 19%) studies (Fig. 2). Five studies (9, 23, 31, 35 and 36 in Table 3) covered more than one medical field, and therefore the percentages were calculated out of 43 datasets rather than 38.

Fig. 1
figure 1

Study selection flow chart

Fig. 2
figure 2

Medical fields most covered

Note. Studies were classified based on their outcomes, whenever possible

Table 3 Types of observational data used and subject area

Observational data sources

Out of the 38 studies we reviewed, most used electronic health records (EHRs)/electronic medical records (EMRs) data (N = 12, 29%) and cohort studies data (N = 12, 29%) (see Table 3). Among those that used EHRs/EMRs data, only Keyhani and colleagues mentioned using a natural language processing (NLP) algorithm to retrieve and extract unstructured data, i.e. ‘carotid imaging results showing stenosis of less than 50% or hemodynamically insignificant stenosis’ [57]. Three studies (2, 3 and 36 in Table 3) used different observational data sources, and therefore the percentages were calculated out of 41 datasets rather than 38.

Causal contrast of interest

Most of the trial emulation studies we reviewed aimed to assess the causal effect of treatment initiation – the observational analogue of the intention-to-treat effect (ITT) in trials (25 out of 38 studies reviewed, with 21 out of those 25 considering the initiation of a treatment regimen rather than point treatments). Seven studies assessed the causal effect of receiving a point treatment and 15 studies compared the effect of two or more alternative sustained treatment regimens including no treatment—the observational analogue of a per-protocol (PP) effect. Nine studies (1, 4, 6, 13, 17, 18, 26, 28 and 31 in Table 4) assessed both types of causal contrasts.

Table 4 Causal contrast of interest and methods used to address different biases

Most of the primary outcomes of the reviewed studies were measured on a time-to-event scale (N = 34/38, 89%). As a result, the most common effect size measure used was the hazard ratio (N = 22, 65%), which was estimated by fitting a Cox proportional hazards model (N = 14, 61%), a pooled logistic regression (N = 8, 35%) or a time-to-event Fine and Gray regression model (N = 1, 4%). One study used both a Cox proportional hazards model and a pooled logistic regression, which resulted in the calculation of percentages based on 23 datasets instead of 22 (17 in Table 4).

Handling of confounding

When estimating the observational analogue of an ITT effect, trial emulation studies used different statistical methods to adjust for baseline confounders, such as conditioning on the confounders (N = 18, 37%), propensity score methods (propensity score matching, stratification on the propensity score and adjustment based on the propensity score, etc., N = 10, 20%), and g-methods: inverse probability of treatment weighting (IPTW, N = 10, 20%), the parametric g-formula (N = 3, 6%) and doubly robust methods, i.e. targeted maximum likelihood estimation (TMLE, N = 1, 2%). Six studies (12%) used the cloning approach in combination with inverse probability of censoring weighting (IPCW), as suggested by Hernán within the context of the ‘target trial’ framework (3, 8, 10, 19, 29 and 38 in Table 4). Out of these six studies, four additionally conditioned on confounders in their analyses (3, 8, 19 and 29 in Table 4). Despite trying to adjust for confounders at the design stage, one study (2%) still relied on conditioning on those confounders in their analyses (20 in Table 4). Ten studies used more than one method, and therefore the percentages were calculated out of 49 datasets rather than 38 (3, 8, 17, 19, 20, 22, 26, 29, 30, and 33 in Table 4).

Out of the 15 studies that reported the observational analogue of the PP effect for sustained treatment strategies most used g-methods to adjust for time varying-confounding. More specifically, nine studies (60%) used IPTW, two studies (13%) used the cloning approach combined with IPCW, and an additional three studies (20%) used the parametric g-formula. For one study (7%) it was unclear which statistical method they had used (13 in Table 4).

Immortal time bias

All studies reviewed attempted to address immortal time bias. This was achieved in on one of three ways: (1) by designing studies so that participants are assigned to treatment strategies at start of follow-up based on their data available at that specific time (N = 21, 55%), (2) using the cloning approach (N = 6, 16%) or (3) by using the sequential trial emulations approach (N = 11, 29%) (Table 4).

Selection bias

Out of the 38 reviewed studies, only 15 studies (39%) explicitly addressed the possibility of selection bias resulting from loss to follow-up. These studies used different methods including IPCW (N = 7, 35%), the parametric g-formula (N = 3, 15%), TMLE (N = 1, 5%), multiple imputation (N = 2, 10%), last observation carried forward (N = 1, 5%), non-responder imputation (N = 1, 5%), and a complete case analysis (N = 5, 25%). Two studies used multiple methods, and therefore the percentages were calculated out of 20 datasets rather than 15 (24 and 26 in Table 4). For the remaining 25 studies (61%) it was unclear whether and how they adjusted for selection bias (see Table 4).

Discussion

Out of the 38 trial emulation studies we reviewed, most concerned cardiology, infectious diseases, and oncology. Furthermore, those studies leveraged different types of observational data, predominantly EHRs/EMRs data and cohort study data. It is worth noting that among those studies that used EHRs/EMRs data, only one study mentioned using unstructured EHRs/EMRs data. However, we do not exclude the possibility of some EHRs/EMRs databases having already pre-processed and converted unstructured EHRs/EMRs data to a structured tabular format.

The reviewed trial emulation studies used conventional or more advanced statistical methods to adjust for baseline confounders when estimating the observational analog of an ITT effect. Conventional statistical methods include conditioning on the putative confounders (i.e. including the confounding variables in the statistical model), whereas more advanced statistical methods include propensity score methods and g-methods (IPTW, the parametric g-formula and TMLE).

Conversely, when estimating the observational analog of the PP effect of sustained treatment strategies, the reviewed studies used g-methods, specifically IPTW and the parametric g-formula, to account for time-varying confounders. Such more advanced statistical methods were needed because time-varying confounders can themselves be affected by prior treatment and adjusting for them using conventional statistical or propensity score methods would prevent the identification of the total causal effect of treatment.

In summary, both conventional and more advanced statistical methods can be used to adjust for confounding at baseline. However, to properly account for time-varying confounding, specific statistical methods, such as the parametric g-formula and IPTW must be used.

To address immortal time bias different approaches can be used. One common approach is to assign individuals to treatment strategies at the start of follow-up based on their data available at that specific time. Additionally, alternative approaches, such as the sequential trial emulation approach or the cloning approach, can be used.

Start of follow-up is the time when an individual meets the eligibility criteria and is assigned a treatment strategy. In some instances, however, an individual might meet the eligibility criteria at multiple times. For example, when comparing initiators and non-initiators of treatment, a non-initiator at one specific point in time might be an initiator at a subsequent point in time and meet the eligibility criteria at both time points. When that is the case, there are two unbiased options for choosing the start of follow-up. One option is to consider a single eligible time point. The other is to consider both time points and use the sequential trial emulation approach. This consists in emulating a sequence of trials, with different starts of follow-up, thereby making it possible for a non-initiator to enter a subsequent trial as an initiator if they meet all the eligibility criteria at the start of that subsequent trial. It should be noted, however, that since the same individuals might contribute to multiple emulated trials, the variance estimators must be adjusted for appropriately. Furthermore, emulating a sequence of trials is expected to yield more precise results compared to emulating a single trial, given the additional data available for analysis [3, 60].

As regards the cloning approach, it is used when the treatment strategies of the individuals are unknown at baseline. It consists of three key steps for implementation. First, in the case of a trial emulation study with two treatment groups under study, if individuals cannot yet be assigned to a specific treatment strategy at baseline, two exact copies (clones) of each individual are created. One clone is assigned to one treatment group, whilst the other is assigned to the other treatment group. Next, clones are followed over time and are censored when they deviate from their assigned treatment strategy. Last, IPCW is used to account for potential selection bias resulting from censoring [14, 60]. Given that only clones who comply with their assigned treatment strategy are kept under study, the cloning approach only allows for the estimation of the observational analog of the PP effect in trial emulations with point treatments or sustained treatment strategies. Furthermore, the cloning approach can be used in combination with a grace period. This is a predefined time period of the follow-up during which treatment initiation can happen and its length is chosen based on real-world clinical scenarios (e.g. hospital delays before surgery). Using the grace period makes it possible to better reflect real-world clinical scenarios and can increase the number of eligible individuals from the observational database [3, 14, 61]. In relation to confounding bias when using the cloning approach, cloning patients removes confounding at baseline. However, artificially censoring clones introduces selection bias, which is accounted for using IPCW [14, 60]. Nonetheless, most of the studies using the cloning approach still adjusted for confounders at baseline.

In summary, different strategies can be used to address immortal time bias, assigning individuals to treatment strategies at baseline based on their data available at that specific time; using the sequential trial emulations approach or the cloning approach.

Potential selection bias resulting from loss to follow-up was primarily accounted for using IPCW. Other methods include complete case analysis, the parametric g-formula, TMLE, multiple imputation, last observation carried forward, and non-responder imputation.

As a general remark, it should be noted that not all trial emulation studies we reviewed have mentioned explicitly using the ‘target trial’ framework, or if they did use it, have not reported the use of it clearly. Those that did use the ‘target trial’ framework tended to follow its reporting guidelines, usually provided a table in their papers outlining the protocol of the ‘target trial’ and explicitly specifying how each component of its protocol was emulated using observational data. Reporting these details is crucial, and is advised going forward, as it allows readers to readily understand the aim of the study and the statistical methods used to address confounding bias, immortal time bias and selection bias.

Limitations

This scoping review has one main limitation which is that our search strategy has most certainly not identified all trial emulation studies published by February 25, 2021. This is a result of varying nomenclature – where not every trial emulation study refers to itself as such. For instance, to our knowledge, the first ever trial emulation study that was published was defined as an: ‘observational study analysed like a randomised experiment’ [2]. We refrained from using search terms like ‘randomised experiment’ and/or ‘randomised clinical trial’ in our search strategy because, when combined with search terms such as ‘observational study’ and/or ‘observational data’, our search strategy would yield thousands of studies, which for the most part would be most likely irrelevant. Instead, we decided to use search terms such as ‘trial emulation’ and ‘target trial’, which were coined by Hernán and Robins in 2016, who were the first to formalise the idea of using observational data to emulate a randomised trial. This, however, could have resulted in omitting some trial emulation studies, as we acknowledge the fact that not every researcher/research group might refer to trial emulation as such. Future trial emulations work should clearly label themselves as such going forward, both in their abstracts and throughout their papers.

Future directions

Currently there is much interest regarding the suitability of EHRs/EMRs data for trial emulation purposes given the increased availability of big electronic healthcare databases. The main concern is the quality of EHRs/EMRs data. These should be free from errors, inconsistencies and inaccuracies, and provide all the information required to answer the causal research question under study, including data on exposure, outcome, baseline confounders, time-varying confounders (if applicable), eligibility criteria and missingness predictors. Furthermore, the data should be available in standardized format, trustworthy, and up-to-date [3, 4, 62].

Trial emulation studies that have used EHRs/EMRs data, extracted data from multiple sources. For instance, The Health Improvement Network database, which was used in some studies, consists of EHRs/EMRs data from over 500 primary care practices in the United Kingdom (UK) [63]. This type of EHR/EMR database has proved useful for research purposes. It remains to be determined, however, whether EHRs/EMRs data from a single healthcare facility can be used successfully to emulate trials, inform clinical decisions, and ultimately contribute to improving patient care at the facility itself. In England specifically, large National Health Service (NHS) Trusts, such as King’s College Hospital, the University College London Hospitals, and the University Hospitals Birmingham NHS Foundation Trusts store plentiful amounts of EHRs/EMRs data. It would be worth evaluating the feasibility of emulating trials using specifically these EHRs/EMRs data, especially given the recent advances in health informatics (e.g. NLP) that enable quick access to and full use of these data. If these trial emulations are proven to be feasible and do indeed provide valid findings, these approaches could then be applied on a wider scale in order to gain scientific insights at a fast pace and with lower cost.

Conclusions

This study reviewed explicit attempts of trial emulation studies across all medical fields and provides a comprehensive overview of the types of observational data that were leveraged, and the statistical methods used to address the following biases: (A) confounding bias, (B) immortal time bias and (C) selection bias. Different methods can used to address those biases. Future trial emulation studies should clearly define the causal question of interest, specify the protocol of the ‘target trial’, explain how observational data were used to explicitly emulate the ‘target trial’ and include this information in the paper. By doing so, reporting of trial emulation studies will be improved. When working with observational data, and if possible, the ‘target trial’ framework should be used as it provides a structured conceptual approach to observational research.

Although EHR/EMRs databases have been used successfully for trial emulation purposes, these consist of EHRs/EMRs data extracted from multiple sources and tend to use structured data. Currently, it remains to be determined whether EHR/EMRs data from a single healthcare facility include sufficient information and if this information is accurate enough to successfully emulate trials. If that is the case, EHR/EMRs data could be leveraged to improve patient care at the facility.