Background

Scientists may decide to use only historical control data to fully or partially replace a concurrent control. Especially when there are either ethical concerns in recruiting patients for control arms in life threatening diseases with no credible control arm, it is clear that an alternative source of control data is essential. Secondly, challenges in developing underserved indications [1] may be partially ameliorated by historical controls, making drug developers more likely to invest, as programs will be somewhat more cost-effective. In particular, by reducing required patient numbers, historical controls may make enrollment of rare disease trials more feasible.

Historical data including, but not limited to, historical controls can also be used either at the study design phase for refining parameter estimations or for adaptation within a study [2]. An interesting example of the latter is the use of maturing phase II data to govern adaptations at the interim analysis in a phase III study [3]. In a phase III study with time to event (TTE) endpoints that take time to develop, interim analyses are usually performed using data from within the phase 3 study. In this conventional approach, one must either perform the interim adaptation based on a very small amount of immature data concerning the slowly developing definitive endpoint, or rely on a rapidly developing surrogate endpoint that correlates imperfectly with the definitive endpoint.

Alternatively, one may use historical data from a previously conducted phase II study to govern the adaptation within the phase 3 study, which would then allow the adaptation to be governed by a larger amount of mature definitive endpoint data [2]. This example is one of many of using historical data for study design or adaptation. The example differs from the other examples in this paper in two important respects: First, the example involves use of historical data of treatment effect, not just a historical control; Second, the historical data, are not part of the final confirmatory evidence package submitted for approval, which is limited to data within the phase 3 study only.

Similar to any other approaches, use of HCs is not always the best approach as there are pros and cons associated with it. Thus, researchers should plan for simulation and thorough sensitivity analysis, and account for and interpret the results with caution. When HCs are part of the final confirmatory package, this paradigm should be reserved for carefully selected products and clinical conditions in which RCTs are not practical [1, 4, 5]. Using HCs requires a robust justification and involvement of regulatory bodies early on for non-RCT phase III clinical trials [6].

Resources of HCs

In general, one is searching for HC data that is as similar as possible to the patients being enrolled in the study of interest. Similarity of patients should go beyond the standard baseline characteristics to include other considerations such as the healthcare environment, background therapy, progress in the standard of care, psychological effects, etc. Data may come from different sources with a variety of structure and quality, which results in different biases and concerns for use of different types of HCs. This must be accounted for when a decision is made to use HCs in a clinical trial [3, 7, 8]. Here we list main resources of HCs in the current practice and discuss the pros and cons using those resources and our recommendations.

Real world data (RWD)

It can exist across a wide spectrum, ranging from observational studies within an existing database to studies that incorporate planned interventions with or without randomization at the point of care [9]. Medical charts, published data of off-label use, registries and natural history studies are all examples of real world data. In addition, RWD includes any data relating to patient health status and/or the delivery of health care routinely collected from a variety of sources [10]. Studies leveraging RWD can potentially provide information on a wider patient population that cannot be obtained through classic clinical trials. An existing RWD source, however, may have inherent biases that could limit its value for drawing causal inferences between drug exposure and outcomes. To mitigate the potential bias, a study protocol and analysis plan should be created prior to accessing, retrieving, and analyzing RWD, regardless of whether the RWD are already collected retrospectively or if they are to be collected prospectively. Protocols and analysis plans for RWD should address the same elements that a classic clinical trial protocol and statistical analysis plan would cover. When considering a prospective study design, one should consider whether RWD collection instruments and analysis infrastructure are sufficient to serve as the mechanism for conducting the study or is it possible to modify them for such a purpose. Ultimately, if the sources of bias can be mitigated, RWD collected using a prospective study design may be used to generate or contribute to the totality of the evidence regarding the control arm. The increased use of electronic data systems in the healthcare setting has the potential to generate substantial amounts of RWD. Because of its nature, the quality of RWD can vary greatly across different data type and sources. For the relevance and reliability of RWD, RWD sources and resultant analysis please see Appendix 1. Below some of these resources are discussed further.

Medical chart

It is a complete record of a single patient’s medical history, clinical data and health care information at a single institution, maybe supplemented by prior institutions, which is usually incomplete. It contains a variety of medical notes made by a physician, nurse, lab technician or any other member of a patient’s healthcare team, as well as laboratory, diagnostic and therapeutic procedures data.

Even though the majority of older paper charts have been replaced by digital electronic heath records (EHR), the structure and quality of data is still challenging to work with due to frequent missing data, inaccuracies, and difficulties in extracting unstructured free text information in medical notes, the most informative component.

Even though concurrent patients can be selected, there is a possibility of introducing a patient selection bias, as it has shown in several studies that patients from clinical trials tend to have better outcomes than those seen in routine practice. This could be due to the variability of quality of routine practice in different regions [11,12,13]. Regardless of available matching techniques, it must be borne in mind that the characteristic that clinicians use for patient selection when performing clinical trials are subtle and challenging to quantify. In order to avoid this bias, it may be advisable to consider a pragmatic clinical trial, i.e. a trial in which the active arm data are also collected in a real world practice setting. This increases the generalizability of the results to the real world at the expense of decreased accuracy, increased missing data, and the need to strictly minimize study complexity.

In the cases of ultra-rare diseases, medical charts have been used as the basis for drug approval, e.g. CARBAGLU (Carglumic Acid) for the treatment of the deficiency of the hepatic enzyme N-acetylglutamate synthase (NAGS), the rarest of the Urea Cycle Disorders (UCDs), affecting fewer than 10 patients in the U.S. at any given time and fewer than 50 patients worldwide. This drug was approved in March 2010 based on a medical chart case series derived from fewer than 20 patients and comparison to a historical control.

Patient registry

It is an organized system that uses observational methods to collect uniform data on specified outcomes in a population defined by a particular disease, condition or exposure. At their core, registries are data collection tools created for one or more predetermined scientific, clinical, or policy purposes. Data entered into a registry are generally categorized either by diagnosis of a disease (disease registry) or by drug, device, or other treatment (exposure registry). There is also the option of using only data entered into the registry concurrently. This “concurrent registry” option minimizes the impact of time dependent covariates, such as improvement in supportive care for the condition of interest, as well as stage migration, the process by which increasing sensitivity of diagnostic techniques leads to classification of less severely affected patients for a given category over time [14, 15].

The interoperability of registries is dependent upon the use of data standards. The quest for registry standards is complicated by the number of different registries, the variety of purposes that they serve, and the lack of a single governor of registries. Data standards are consensual specifications for the representation of data from different sources or settings. Part of the challenge for standards observance is the reality that often any given individual organization or registry project perceives little immediate benefit or incentive to implement data standards. Standards become vitally important, when data is being exchanged or shared, often benefiting a secondary user. Standardized data include specifications for data fields (variables) and value sets (codes) that encode the data within these fields.

Even with standardization, there is a good chance of frequent missing data in registries, as the culture of recording evidence and information in current practice is not comparable to what has been mandated in clinical trials. Appendix 2 summarizes the recommendation by Clinical Trials Transformation Initiative (CTTI) for evaluating an existing registry [15,16,17].

Natural history (NH) trial

NH trials track the natural course of a disease to identify demographic, genetic, environmental, and other variables that correlate with the disease and its outcomes in absence of a treatment [18, 19]. However, these studies often include patients receiving the current standard of care, which may postpone the disease progression. They differ from registries in that they can be designed specifically to collect comprehensive and granular data in an attempt to describe the disease, which may or may not be present to varying degrees in a registry. These trials are usually designed to address the special problems of rare disease clinical trial designs, where there may be little pre-existing information about natural history and preferred endpoints for study. The trials are uncontrolled and non-interventional. As such, existing NH trials are also a valuable source of HC.

If there are no pre-existing NH studies available, the initiation of such studies has been recommended in parallel with early stages of drug development including preclinical, prior to the initiation of interventional trials [20]. Inclusion criteria of natural history studies should be broad, to allow characterization of the heterogeneity of the disease and the effect of covariates on outcome. However, in many cases rare diseases are rapidly progressive or fatal and affecting children. The ethical justification for conducting non-interventional trials prior to interventional trials under such circumstances may be questionable, if it delays the interventional trials.

Instead, alternative approaches that involve adaptive clinical trial designs are recommended [21, 22]. In some cases, sponsors may have more incentive to address the rare diseases directly instead of conducting a non-interventional study. For instance, it may be difficult to convince a sponsor to fund a NH study without proof of concept for the therapy it is developing. Despite these issues, NH studies have been used in some development programs to approve therapies successfully (Appendix 3).

Completed clinical trials

Completed clinical trials for the drugs with the same mechanism of action (MOA) or the same drug from previous phases of development are a great source of high-quality data, since they are generated in a controlled environment. The control (placebo or standard of care, or SOC) arm of completed clinical trials can be treated as a source of control data for an indication [23, 24].

Current use of HCs in pivotal clinical trials and the regulatory outcome

Even though our focus is not limited to pivotal studies, we illustrate the current use of HCs for approvals by regulatory agencies (Table 1). The majority of confirmatory clinical trials using historical controls have indications in rare diseases, where there are no approved therapies for SOC. Other common applications for HC in the confirmatory setting are: medical devices [9], label expansion, pediatric indications, and small populations such as biomarker subgroups. ICH and FDA E10 guidance accept the use of external controls as a credible approach only in exceptional situations, in which either the effect of treatment is dramatic and the usual course of the disease highly predictable and the endpoints are objective and the impact of baseline and treatment variables on the endpoint is well characterized. In the majority of cases listed in Table 1, the treatment had either a large effect size and/or addressed a life-threatening disease with no treatment options, to be considered. However, the second example was not an exceptional case like the others. It was a phase III trial that used data from Phase II studies to determine the null hypothesis. The approvals of the drugs have been based on the totality of benefit-risk assessments, involving a qualitative judgment about whether the expected benefits of a product outweigh its potential risks. For example, EXONDYS 51 (eteplirsen) that treats duchenne muscular dystrophy was approved, although initially the agency raised doubts about results from small sample size of 12-patients and the advisory panel voted against the approval.

Table 1 Examples of HCs used in Clinical Trials for Approvals by Regulatory Agencies

Minimize disadvantages of using HC in clinical trials [25, 26]

As discussed in Resources of HCs section, there may be potential biases in using HCs that have to be accounted for. In classical designs, even though randomization and blinding techniques do not guarantee the complete elimination of unknown confounders, they reduce the chance of bias and dissimilarity among the arms of a clinical trial. In a non-randomized study, an external control group is identified retrospectively, which potentially could lead to a selection bias or a systematic difference among groups that could affect the final outcome. These biases could come from dissimilarity in a wide range of factors: patient demographics may vary among different population; time trends bias may occur when a control arm is chosen from patients who were observed at some time in the past, or for whom data are available through medical records; the SOC treatment and concomitant medications may have improved over the years; the severity of actual disease associated with a particular stage designation may have decreased due to more sensitive diagnostic technologies (stage migration). For example, the medical devices and the accuracy and precision of measurements may have improved. It has been shown that untreated patients in HCs have worse outcomes than a current control group in a randomized trial [27, 28], possibly due to more stringent inclusion and exclusion criteria in the trial, subtle selection bias, as well as improvement in medical care and potential increased access to care. Moreover, assessment bias or the lack of blinding of the investigator, patients or healthcare personnel may also affect the evaluation of subjective endpoints.

The literature is mostly focused on choosing an HC that matches with the currently designed clinical trials and reducing bias in the analysis. We recommend a different and more fundamental solution. If possible, start with choosing a high-quality HC for the indication and design the current study as closely as possible to the selected HC. The recommendations are consistent with Pocock’s Criteria published in 1976 [29]: Choosing an HC with the same: a) Inclusion/Exclusion criteria, b) type of study design, c) well-known prognostic factors, d) study quality, and e) treatment for the control group in a recent previous study in the design; and finally either use a concurrent HC or adjust for biases such as time dependent biases at the analysis step to the extent feasible. For example, placebo control groups of concurrent clinical trials are preferred if possible. If a placebo group is unethical, consider selecting trials with active control groups that received the same SOC treatment and design the trial as an add-on design using the same SOC, provided there is no pharmacological antagonism between the experimental therapy and the SOC and there is an acceptable toxicity profile of the combination. Otherwise, the rest of this paper provides guidance regarding selecting HCs for replacement or augmentation of a control arm for an existing clinical trial design, in case the existing clinical trial cannot be redesigned to match closely with the selected HCs.

It is of primary importance to improve the quality and usability of the data and subsequently increase the feasibility of using HCs in clinical trials. Thus, we recommend using standards as much as possible in HC data collections. A standard has been developed by the Clinical Data Standards Interchange Consortium (CDISC) [30] for regulated drug development submissions and drug safety. However, there are still some challenges.

Missing data can have a substantial impact on registry findings, as well as any clinical or observational study. It is important to take steps throughout the design and operational phases to avoid or minimize missing data; understanding the types of and reasons for missing data can help guide the selection of the most appropriate analytical strategy for handling the missing data, or the potential bias that may be introduced by such missing data.

Analysis methods can also help to minimize the disadvantages of using HCs in clinical trials. Several meta-analysis methods have been developed to analyze the historical data for epidemiology research, benchmarking and comparison of drug efficacy for a specific indication, developing the safety profile of a drug or device, marketing or hypothesis generation. Some of these methods are applied on summary data, and some require individual data. However, traditionally, this wealth of information has not been used directly in the analysis of the current data, which could increase the accuracy and precision of estimated relevant endpoints. The challenges for doing so are to determine: how much the HCs are similar to each other and to the current data; what is the optimal weight of the given HCs in the current analysis, in case of conflict between HCs or with the concurrent control data; and how to incorporate the HCs in the analysis with minimum bias.

Overall, these analysis methods can be classified as Frequentist and Bayesian. While Bayesian methods seem flexible as the amount of historical information to be combined with the study information can be adjusted based on the similarity among different available datasets (HCs and concurrent data), they have their own challenges. It’s noteworthy that the combination of information is not the same as “borrowing information”, since the inference is based on the combined information and not only on the study information with some borrowed information from HCs. However, all the listed methods use the “borrowing” term. There is no clear guidance of selection of specific priors and how dynamic borrowing is applied with control of the Type I error rate (i.e. the probability of making type I error) [31]. No one solution that works in all situations. Our recommendation is to perform simulations and to create a benchmark (see Appendix 4 and simulation section) to characterize each method for comparison to help with the decision-making process and, if using Bayesian methods, with the choice of proper prior [32, 33].

Generally, the historical information is discounted in recognition of the enhanced uncertainty when combining information from the past. Dynamic borrowing controls the level of borrowing based on the similarity of the historical control with the concurrent data. It borrows the most from past data when the past and current data are consistent and the least otherwise. If you have multiple HCs, the inter-trial variability is a measure of the suitability of the dataset for borrowing [34,35,36,37,38,39,40]. On the other hand, static borrowing sets a fixed level of borrowing, which is not determined by the level of consistency or inconsistency among the past and current datasets. The control for Type I error when borrowing can be addressed by simulation as discussed in the following sections. Recommendations for the use of simulations are given in the FDA draft guidance for adaptive designs released in 2018 [33].

Frequentist approaches

Normally frequentist approaches may involve analyzing historical controls to estimate a threshold or parameter required for a simulation in a hypothetical scenario or balancing the population of different arms for a fair comparison. For details on frequentist approaches please refer to Appendix 4. The approaches can be further classified into two subgroups depending on whether individual data is available or only summary data. Mixed Treatment Comparison (MTC) or network analysis, Simulated Treatment Outcome (STC), or Matching-adjusted indirect comparison (MAIC) work with summary data [41]. If individual data is available, the Propensity score (PS) method is commonly used to match or stratify patients, to weigh the observations by the inverse probability of treatment, or to adjust covariates [42,43,44]. However, King et al., argues that matching based on propensity score often increases imbalance, inefficiency, model dependence, and bias [45]. They claim that “The weakness of PSM comes from its attempts to approximate a completely randomized experiment, rather than, a more efficient fully blocked randomized experiment. PSM is thus uniquely blind to the often large portion of imbalance that can be eliminated by approximating full blocking with other matching methods”.

Threshold Crossing introduces a new framework for evidence generation [1]. The HCs are used for estimation of futility and success thresholds. A single arm trial is conducted after obtaining these estimations. Bias-variance, a model suggested by Pocock [29], assumes a bias parameter with a specified distribution as a representative of the difference between past and present. Test-then-pool [37] compares the similarity of historical and concurrent control data with significance level α. If they are similar, it pools the HC and the concurrent control data; otherwise discards HC. In other words, it borrows nothing or all of the information from HCs.

Meta-analytical approach is a hierarchical modeling approach, which uses a data model and a parameter model to infer the parameter of interest (e.g. treatment effect). It is normally mentioned in the context of Bayesian approach, but it can be performed using a non-Bayesian approach [see Appendix 4]. Adaptive Design allows adaptations or modifications to different aspects of a trial after its initiation without undermining the validity and integrity of the trial [46]. This topic is out of the scope of this paper; however, one flavor of such design is Adaptive Borrowing that adjusts the recruitment of a concurrent control based on the evaluation of the similarity of HCs and concurrent control via several interim analyses. The Analysis can be performed using Bayesian and non-Bayesian approaches.

Bayesian approaches

Most of the Bayesian methods aim for discounting or down-weighting the historical control [34]. The power prior method estimates an informative prior for the current study. The informative prior is the product of an initial prior (non-informative prior) and a likelihood function of the parameters of a given model given the HC data that is raised to a power between 0 and 1 (α0). The resulting ‘posterior’ is used as an informative prior for the current study. Alpha0 can be set to a fixed value (static borrowing) or can be set based on heterogeneity of HCs and the current data (dynamic borrowing). As mentioned before, Adaptive Designs can also use Bayesian approaches for the analysis of the data.

Meta-analytic or hierarchical modeling applies dynamic borrowing by placing a distribution of the degree of borrowing across current and historical controls, in which its variation is controlled by the level of the heterogeneity of the data. Meta-analytic Combined (MAC) and Meta-analytic Predictive (MAP) are two approaches of meta-analysis. It performs a meta-analysis of historical data and current trial data and infers the parameter of interest at the end of the new trial. MAC can be either a non-Bayesian or Bayesian approach using non-informative or vague prior to estimate the MAC model. MAP is a full Bayesian approach; it derives a “MAP prior” from historical data by derivation of a posterior distribution which expresses the information of all HCs. In the next step it combines the estimated “MAP prior” with the current trial data to get a posterior distribution of the parameter of interest (such as treatment effect) [See Appendix 4] [34]. The Offset approach, however, recommends using simulation and graphical tools to identify a range of plausible values for the true mean difference between historical and current control data. It argues that an estimate of bias based on comparison between the historical and new data gives little guidance on the value to be chosen for discounting the historical data. The offset approach claims that HC will be almost completely discounted, without a strong assumption regarding its relevance [47]. It should be pointed out that the differences between the current trial and HCs could be due to pure chance in sampling, specifically for rare diseases with small sample sizes. Ignoring HC information might lead to bias. Thus, additional scientific input is warranted.

Sensitivity analysis

One crucial step in properly using HCs is performing sensitivity analyses, beyond what has been proposed by FDA for randomized clinical trials based on the primary analyses [48], to assess the robustness of the results. These analyses should interrogate any assumptions or criteria that could cause variation in the final result with regards to using HCs as a component of a clinical trial, i.e. the impact of cohort selection to address selection bias, the impact of missing data on key results, the method of matching patients with the historical controls, the level of borrowing using different priors and methods (power prior vs. mixed prior), the alteration of weighting factors when using any weighing schemes, the impact of exclusion of some historical controls when more than one is used, the unmet components of Pocock’s requirements for including a historical control, and the heterogeneity of historical controls with each other and with the concurrent control. Sensitivity analysis provides the full spectrum of potential truth in the presence of biases [49].

Decision making

Quantitative methods to extrapolate from existing information to support decision-making are addressed in a recent EMA draft guidance [50]. Here we propose a decision making tool (Fig. 1: Decision Making Diagram for using HCs in Clinical Trials), a procedure for planning, conducting, analyzing and reporting of studies using HCs, or as a supplement to a concurrent control arm, while maintaining scientific validity.

Fig. 1
figure 1

Decision Making Diagram for using HCs in Clinical Trials

Simulation role

Clinical trials simulation plays a critical role in the drug development process by quantifying and evaluating design operating characteristics and possible decisions in the face of uncertainties [51]. It is also an important tool for selecting methods for bias control or for performing sensitivity analyses when HC is used in a trial. The operational characteristics of the design is of interest for Bayesian and frequentist approaches in order to assess the sensitivity of the decision-making process. For the Bayesian approach, a sensitivity analysis particularly for mismatch with regards to historical control, might be of interest. How to design, perform, and report such simulation studies deserve further attention and standardization [52]. Dejardin et al. demonstrate an example clinical trial simulation being applied to compare the performance among different Bayesian approaches for borrowing information from HC data [53]. The case study was a phase III comparative study of a new antibacterial therapy against Pseudomonas aeruginosa. Since the target population, patients with ventilator associated and hospital-acquired pneumonia, was a rare condition, HC became an attractive option. After identifying a non-inferiority combination design, mortality rate at 14 days as the primary endpoint, and maximum 300 subjects in total, the authors simulated trial data based on specific knowledge, e.g. mortality rate in randomized controlled studies. In order to reduce the impact of uncontrolled factors on HC validity, dynamic Bayesian borrowing was implemented to control the weights on use of compatible and incompatible HC data. Bayesian approaches with three different power priors were compared with regards to two outcomes: limited inflation in type I error and increase in power. The simulation predicted comparable performance among the three methods and therefore recommended the one with easiest implementation. This example illustrates best simulation practices and the benefits of applying them in the context of HCs. Best practices for simulation and reporting have recently been addressed by the DIA ADSWG [52]. The recent FDA draft guidance on Adaptive Designs also gives recommendations on clinical trial simulations [33] saying that simulations can be used to estimate not only basic trial operating characteristics, e.g. type I error probability and power, but also other relevant characteristics for more complex adaptive designs, such as expected sample size, expected calendar time to market or time to study completion, and bias in treatment effect estimates. The whole process is standardized in the diagram below (Fig. 2: Clinical Trial using HCs Simulation Process) that can be generalized to any clinical trial simulation.

Fig. 2
figure 2

Clinical Trial using HCs Simulation Process

Simulation can also be used for the comparison of different methods for choosing the “best” model for a specific design. I.e. One should compare the most effective area, “sweet spots”, among these methods, with lower MSE, lower type I error, and higher power compared to others when borrowing (using Frequentist or Bayes approaches) from historical controls [54]. Borrowing reduces type I error and increases the power of the study when the current control rate is close to the historical observed rate. This is intuitive as we are borrowing information nearly identical to the true current value. As the true current control diverges from the observed historical data, we can acquire reduced power (in one direction) and inflated type I error (in the other direction). Assessing the magnitude and relative likelihood of these costs in comparison to the possible benefits is the key issue in determining whether historical borrowing is appropriate in any given setting. In assessing a borrowing method in terms of MSE, type I error, and power, we can answer several questions. First, how broad the sweet spot is and how much power increases. The larger a region, where borrowing dominates the separate analysis, the more appealing the method will be. Second (and most important for possible confirmatory trials), where is the type I error and how much type I error inflation occurs? Third, how much of a power loss do we have when the true control is much lower than the historical data? These three regions (reduced power, sweet spot, and inflated type I error) create a decision problem in deciding how and whether to adopt borrowing.

Fig. 3 shows a useful plot that can be generated via simulation to compare the sweet spot of different methods. This visualization shows type I error (lower curves, right axis) and power (upper curves, left axis) on the y-axis as a function of closeness of the event rate in historical data to the true control event rate on the x-axis for hierarchical models. Different line styles show different methods. Borrowing behavior tends to be ‘flatter’ for hierarchical models (blue curves), borrowing moderately over a long range, while still displaying dynamic borrowing (borrowing is reduced when the true control rate is far from the historical data). This moderate, long-range borrowing is also reflected in the type I error inflation that has a lower slope than other methods (although it still does reach reasonably high values). Generally, the sweet spot of improved type I error and higher power extends farther down (for values under 0.65) than other methods.

Fig. 3
figure 3

Visualization of Comparison of the “Sweet Spots” of Methods via Simulation

Finally, documented software (code) for all simulations should be made accessible to all stakeholders for reproducibility. However, if simulations are computationally intensive or the code is complex, it will be challenging to independently verify the results [52].

Summary of recommendations

In Summary, for choosing a proper and high-quality HC for a credible result in a non-randomized clinical trial, the following steps are necessary in reduction of selection bias if the study design cannot be altered to mimic the selected HCs with regards to the time trend, population, change of SOC, logistics, and other risk factors. The important prerequisite for choosing a proper HC is to have a clear understanding of the disease, detailed characteristics of the affected population, precise definition of the endpoint and clear definition of the diagnosis in terms of what and how it has to be measured. It is also very important to know how to compare the measurements between the treatment groups while addressing interfering events such as rescue medication, drop-outs, death, non-adherence, and non-compliance. In other words, there has to be a very well-defined and precise scientific question or objective and analytic plan up-front before picking a historical control, so that the selection bias and ad-hoc analysis are minimized.

In general, choosing a simple endpoint reduces the complexity and subjectivity of the assessment and increases the precision of measurement and clarity of treatment efficacy. First, systematically review the literature for well-organized and maintained completed RCTs, Natural Histories, RWD, and registries for reproducibility of the existing evidence. Guidance provided from systematic reviews and meta-analyses could be helpful. Select an appropriately similar HC or HCs for the experimental setting. Explore and seek to understand the similarities and differences of different resources. It is recommended to choose more than one HC from different sources to capture the heterogeneity of treatment and the disease in different populations, which could give an estimation of treatment effect in the real world.

Second, select an applicable approach to account for biases in study design. Even with a well-thought design and proper selection of HCs, we may not be able to eliminate all the biases due to differences between the concurrent treatment group and HCs. Thus, we still need to consider adjusting for these biases by varying the level of borrowing information using Bayesian or frequentist methodologies (Section 7 and Appendix 4). If there are several HCs for example, one might give a different weight to HCs based on the variability or quality of the data. Adjust for selection bias by adjusting for patient’s covariance, if the detailed data are provided [55,56,57,58,59]. Otherwise, use summaries of baseline covariates. Use matching based on distance metrics (e.g. Euclidean distance) for direct standardization or use propensity score for stratification or covariate adjustment for indirect standardization in observational studies [60, 61]. Refer to Encyclopedia of Biostatistics for specialized methods for time-to-event endpoints [62]. Address the assessment bias due to the lack of blinding and randomization by recruiting patients receiving placebo or SOC in a much smaller scale when using HC, if possible and/ or have several blinded assessors assess the endpoint to estimate the difference of treatment effect between the arms.

Third, using simulation, plan for an extensive sensitivity analyses to demonstrate the robustness of the results and to determine the weaknesses or strengths of the analyses by varying assumptions, especially if there is no concurrent control group and the comparison is solely done with HCs [1]. Fourth, for any innovative approach, it is highly recommended to include regulatory bodies at the design stage to discuss and explore the potential concerns and issues together. This would increase the quality of the evidence for potential submission in the future, as their familiarity with the approach would help them to coach the scientist properly in the design of the study. Fifth, document the nature of both historical and/or concurrent placebo control groups used in your study: Explain why and where these control groups are needed. Explain how the external control groups were chosen and what are the similarities and difference with the current study design and population. Specify how these differences have been addressed in the design and the analyses. Report the results of primary and sensitivity analyses with caution if there is no concurrent control group and the comparison is solely done with HCs.

However, we acknowledge the difficulty of addressing the above points for rare diseases. An alternative option in a rare disease for which none of these resources are available is to prospectively design and implement a well-thought registry or to conduct a natural history study, which may require collaboration with external entities in a larger scale and may delay therapeutic studies. In such cases, it may be preferable to have a randomized therapeutic study than a natural history study in which no one receives therapy. Adaptive designs such as the “informational design” that allow adaptation at the end of the study may be useful for rare diseases where information for designing a therapeutic study may be sparse [19,20,21].

Conclusion

FDA is embracing the use of RWD to open up opportunities for more resourceful and innovative approaches, especially in orphan and rare diseases [63]. Consequently, pharmaceutical companies are changing their traditional mindset with development of new hybrid models for integrated decision making. In this evolving era, pharmaceutical and regulatory bodies may be more open to use of HCs as a replacement for or supplement to a concurrent control when a concurrent control is either unethical or impossible, or the condition under study is difficult to enroll. HCs introduce multiple biases compared to a concurrent randomized control. However, these biases may be minimized by careful choice of HCs, design of the study to mimic the HC conditions, and numerous other methods for design, conduct, and analysis of the studies. With careful attention to these approaches, we can apply HCs when they are needed while maximizing scientific validity to the extent feasible.