Background

The decision to start a clinical trial to investigate a new drug or medical device is informed by preclinical studies to evaluate efficacy and safety. Depending on the medicinal product, some types of testing like toxicology studies are regulated and mandatory before moving from bench to bedside; others are specific to the disease, drug, and/or (animal) model. Here, we focus on preclinical efficacy studies where fewer regulatory prescriptions apply. The ultimate goal of such studies is to make knowledge claims [1]. Articulated on different effect levels, these include for example the claim of a specific role for a protein in a physiological process, or that an intervention will cure or slow the progression of a disease. To arrive at a knowledge claim, preclinical studies are performed in a stepwise approach. Hypothesis-generating exploratory studies evolve along a continuum through within-lab replications to knowledge-claiming confirmation. During this process, investigators need to continuously re-evaluate premises and refine study designs to increase validity and reliability. This includes defining Go/No-Go criteria for further studies already in the early stages [2]. When it comes to detailed guidance for this transition process, information on planning, conducting, analyzing, and evaluating confirmatory studies in preclinical research is scarce. The need for such guidance is emphasized by recent initiatives investigating evidence from single studies, for example in cancer biology, that find a substantial number of experiments that do not replicate. That is, effect sizes are substantially lower than in the original study and results are no longer significant [3]. Whereas this is not unexpected, and science has the potential to self-correct, efficient strategies need to be devised to foster translation into the clinic and generate patient benefit. This includes the essential questions of when and how to conduct a confirmatory study.

To close this gap, biostatisticians, preclinical scientists, clinicians, and meta-researchers held a workshop to discuss the aforementioned issues for preclinical multicenter confirmatory studies (see Figure S1 for the composition of workshop participants). Whereas the collaborative conduct of a study by more than one independent study site using shared protocols is common practice in clinical trials, this is a rather recent approach in the preclinical context [4]. Herein, most participating researchers currently conduct confirmatory studies funded by the German Federal Ministry of Education and Research [5]. Importantly, investigators aim to confirm their own previous exploratory research findings and underlying knowledge claims in a preclinical multicenter setting. Generated evidence should inform decisions to start a clinical trial. To develop guidance for conducting confirmatory studies, we have reviewed and discussed current approaches to identify what strength of evidence is needed before engaging in a confirmatory study and how evidence generation can be optimized in a confirmatory study concerning the knowledge claim. In this report, we will present suggestions from a transdisciplinary perspective and highlight open questions and opportunities for further research.

Main text

Towards robust evidence

For the decision to proceed to confirmatory experiments, criteria need to be defined a priori. These criteria reflect the evidence gathered so far and address the necessarily high uncertainty and possible bias of exploratory experiments. To evaluate robustness of evidence, two factors are of main importance: reliability and validity. Reliability refers to the characteristics of a result that reflect the level of replicability measured for example by effect size precision or statistical significance. Importantly, a reliable experiment is not necessarily valid as results might be replicable and still not reflect the underlying postulated mechanism. For this, experiments also need sufficient validity to substantiate the knowledge claim. Here, we recommend minimum criteria for validity and reliability to support the decision to conduct a confirmatory study.

Minimum reliability and validity criteria

In exploratory studies, low sample sizes often threaten the reliability of results. Two factors contribute to this. First, significant results do not necessarily reflect the existence of a biologically relevant effect. Second, even if they do the estimated effect size will be an overestimation of the actual effect. To understand the first issue, one must look at a set of scientific hypotheses that are experimentally tested. Some of these will reflect an underlying biologically relevant effect whereas others do not. The probability to detect a relevant effect is closely correlated with the sample size. Low sample sizes as frequently seen in preclinical experiments and with that low statistical power will have decreased detection rates for these relevant effects [6, 7]. Additionally and inherent to statistical test procedures, experiments also produce false positives, usually 5% of all cases in which a biologically relevant effect does not exist. This results in a dilution of the small number of identified relevant effects by several false positives. That is, a significant finding derived in a low sample size experiment is at an increased risk of not reflecting a true cause-effect relationship. The second effect caused by low sample sizes is inflation of effect sizes for significant results. This so-called winner’s curse is elicited by the applied p-value filter wherein only large experimental effect sizes yield significant results in low-powered experiments [8]. That is, even if experiments detect relevant effects the effect estimate carries a risk of inflation.

Consequently, when deciding whether to conduct a confirmatory study, the inflation of effect sizes and limitations of the p-value [9] need to be considered. If uncertainty about effect estimates is still high, within-lab replications could be a viable way to substantiate exploratory findings (see section Within-lab replications as a road to rigorous evidence). Alternatively, and similar to clinical trials, investigators can a priori determine a smallest effect size of interest that reflects biological or clinical relevance to argue for a specific mechanism of action or to predict the efficacy of an intervention, respectively. Such a lower bound could be informed by published effect size distributions, discussion with clinicians about viable clinical effects, and/or available resources that will only allow for a certain minimal effect size to be detected [10]. This discussion should involve biostatisticians and biomedical researchers who need to set decision-critical a priori criteria (e.g. smallest effect within confidence interval (CI) of exploratory study estimate) for progression to the next phase of experiments.

Regarding validity, the minimum set of criteria [11, 12] spans mainly three domains; internal, external, and translational validity. A high degree of internal validity is necessary already in the early stages. This not only includes measures to reduce the risk of bias such as randomization [13] and blinding [14], but also the use of validated methods that measure outcomes with low bias and high accuracy [15] (Table 1). To promote generalizability of results beyond the single experiment, external validity needs to be increased for example by investigating or systematically introducing sources of variation through systematic heterogenization. This can be achieved by varying genetic and/or environmental conditions, for example, by testing immune-competent animal models instead of specific pathogen free (SPF) immunocompromised strains [16, 17] or by the introduction of environmental variation in a multicenter approach. To what extent this is necessary and feasible already in exploratory stages is an open question. Another powerful tool that adds to external validity is triangulation where different methods and approaches are combined to support the same claim. If different methods yield converging evidence, validity of generated evidence increases at the potential cost of adding complexity to a study design [18]. Additionally, within-lab replications potentially increase external validity (see section Within-lab replications as a road to rigorous evidence). As the ultimate goal of these experiments is clinical translation, factors that are diagnostic for the human case need to be considered and outcomes defined to facilitate interpretation in the clinical context (translational validity). Particularly, (animal) models should reflect targeted aspects of human disease and converging evidence from different methods and contexts. We also recommend investigating the bioavailability of the drug before or very early in the confirmatory stage, which ideally includes pharmacokinetics. Here, dose-finding experiments should be performed before a large multicenter confirmation to either start with a predefined dose, or at least narrow it down to a minimum range. Other factors are less concerting for the decision to continue with a confirmatory study. For example, testing clinically relevant biomarkers and route of administration can be part of complementary experiments in the confirmatory phase. Those complementary experiments might be exploratory or considered flanking experiments to strengthen the evidence.

Table 1 Minimum criteria that need to be fulfilled/considered before starting a preclinical confirmatory multicenter trial. Best practices are based on existing (reporting) guidelines and sketch the ideal situation. However, there can be practical limitation that hinder e.g., blinding or randomization

Within-lab replications as a road to rigorous evidence

If the minimum criteria (as presented in Table 1) are not met with the first exploratory study, replication experiments potentially serve as a powerful validation tool before conducting a larger (multicenter) study. In this context, within-lab replications or also mini-experiments [23] with refined experimental design and improved internal as well as external (by considering batch effects) validity will be valuable. Moreover, refined animal models generate evidence to assess translational potential in this early-stage replication e.g., from a low complex cell line-based xenograft cancer mouse model to a patient-derived xenograft model [24]. Using material from varying donors (if available) patient-derived models might enable the evaluation of different responses by better mimicking the clinical heterogeneity of the disease [25, 26]. In this context, companion diagnostics might be used to quantify the accumulation of a molecule at the target site or assess a hormone or receptor status to predict treatment outcomes or stratify subjects [27, 28].

Exact within-lab replications might also be used to increase the reliability of the results via increased sample size and/or increasing the number of (smaller) batches [29]. This will decrease outcome uncertainty and aid in sample size planning for confirmatory studies. Ethical constraints, e.g. regarding studies including large animals, potentially prohibit stand-alone exact replication experiments. However, a replication study might be integrated as a positive or negative control group into the experimental design of a new exploratory study.

Ideally, exploration and within-lab replication studies have the potential to reveal effect modifiers, confounders, and colliders. This may require adjustment of experimental design, for example by including an estimate of drop-out rate either due to the animal model or due to the intervention that affects sample size planning. Information on such covariates can then lead to a refinement of e.g. the randomization scheme if body weight is affecting the outcome of a study. In this example, to control for the variation in body weight, the experiment could be split up into smaller blocks and interventions would be randomized to experimental units within each weight block. It can also support the selection of Go/No-Go decision points before confirmation. Finally, the decision about the transition from exploration to confirmation needs to include all stakeholders including preclinical and clinical researchers as well as biostatisticians.

Engaging in a confirmatory multicenter study -reality check

Irrespective of the generated evidence from an exploratory study, feasibility needs to be evaluated to decide whether a multicenter decision-enabling experiment should be conducted. This evaluation includes practical constraints such as available resources (can increased animal numbers be handled?) or ethical approval (replication experiments as area of tension [30, 31]), and medical need. According to the animal welfare act and Directive 2010/63/EU of the European parliament [32], an animal experiment can only be justified if it generates new knowledge and if that knowledge outweighs the harm for the animals [33]. Thus, confirmatory studies need to go beyond exact replications and generate diagnostic (= decision enabling) evidence about a knowledge claim [30, 34, 35]. In general, exploratory studies provide only preliminary evidence. Building on such initial findings, confirmatory studies allow generalization beyond specific experiments gathering support for the underlying knowledge claim. For this, investigators need to ensure that validity and scientific rigor are preserved at a high level throughout the preclinical research trajectory (Fig. 1).

Fig. 1
figure 1

Simplified illustration of a preclinical research trajectory, starting from exploration towards confirmation considering robust study design, minimum validity criteria to finally engage in a confirmatory multicenter study. At some steps a decision is required (loupe) whether to proceed with the study (check mark, blue), whether refinement of i.e., experimental design or (animal) model (question mark) is needed or whether (a priori defined) No-Go criteria were met to stop an experiment completely (red X). A robustness check after exploration should be used to decide if a within-lab replication is required before the multicenter confirmation. If minimum validity and reliability criteria are already met during exploration, a multicenter study might be planned without further in-house replication. Icons in dark blue (Generated evidence, within-lab replication, and study plan of the multicenter confirmatory study) highlight the focus areas of this review

Optimization of evidence generation during confirmation

The goal of the (multicenter) confirmatory study is to support a knowledge claim and potentially inform the decision to move to the clinic. Again, a clear a priori definition of Go/No-Go decision points and clearly defined primary and secondary outcomes are indispensable. Other parts of the planning process are less generalizable (Fig. 2). Some of these aspects are beyond the scope of this manuscript and we will solely focus on biometry related issues or practical constrains/aspects (v-vii) (Fig. 2).

Fig. 2
figure 2

Steps involved in planning a confirmatory study. Highlighted (pink) aspects (v – vii) are discussed in more detail. Whereas other aspects (grey, (I – iii and viii—ix) are equally important, they are beyond the scope of this manuscript and are solely mentioned in the context of experimental design, standardization and/or sample size calculation. Highlighted in purple (iv) is the choice of participating study centers. Here, skills and expertise as well as training requirements are important selection criteria. Blue notes cannot be assigned to only one aspect, but need to be considered at different stages, e.g., sample size calculation as well as the analysis should be based on the experimental unit

Protocols, standardization and systematic heterogenization

One important step in conducting multicenter studies is harmonization of protocols (Fig. 2 (i, v)). In this process, involved laboratories need to decide on which aspects of the experimental protocols need standardization and which will systematically vary between centers. Important aspects that need to be standardized and quality controlled include the treatment scheme to ensure comparable dosage and the same quality of the drug. Additionally, quality control measures identified through initial baseline studies are recommended. A comparison of outcomes from control groups for example can identify potential problems between centers early on. Knowledge about center variability and information on factors that influence variance of results can be gained by introducing systematic heterogenization. This includes comorbidities and the use of both sexes [36, 37]. The latter is considered a minimum requirement in a confirmatory approach except for sex-related diseases like prostate cancer or in case of well-grounded arguments.

Heterogeneity will also be introduced by each study center. One naturally occurring source of variation is the different experimenters themselves. However, the latest literature indicates that this is less of an issue, particularly if all involved parties are well trained [38, 39]. To assess replicability of results across centers, a low number of centers already is sufficient. A minimum of two participating laboratories may already be sufficient and the added value of additional laboratories decreases rapidly [37]. A small number of centers precludes, however, estimation of between center heterogeneity. Here, strategies need to ensure that centers actually can be jointly analyzed. Concerning animal experiments, husbandry conditions including food, temperature and cage mates will most likely vary between centers and laboratories and need to be considered if those affect the outcome [20].

Primary outcomes should be complemented by evidence from other sources. Here, selection of partner laboratories can also be based on such complementary methods and approaches. When developing drugs to treat a symptom associated with multiple diseases, additional animal models can increase the external validity and predictability of translational success. Including but not limited to numerous existing animal models of neuropathic pain that can e.g., be chemotherapy-induced, emerge from cancer pain, or be mechanically introduced (sciatic nerve injury) [40, 41]. Another example are patient derived 3D cell cultures to gain a deeper understanding about underlying mechanisms and to capture effects only seen in human cells. By increasing the number of donors or models to support a research claim, the validity of an observed effect can be increased (triangulation [42]). For studies that aim at clinical translation, translational validity should be improved by including (several) biomarkers or other diagnostic tools [43, 44] in the analysis and/or experimental design. For drug efficacy testing, control groups in the confirmatory study should include a competitor drug i.e., clinical standard treatment and/or other negative and/or positive control groups. Researchers should be in close, early on contact with regulatory authorities to ensure that experiments already incorporate requirements for approval. To avoid increasing the sample size by additional positive and negative control groups, it can be feasible to consider historical cohorts [45, 46] or an unbalanced design [47, 48] with smaller but more control groups (multi-arm design) that can be pooled. The latter two points led to extensive discussions between the authors and should thus be viewed as controversial [49].

Sample size calculation for confirmatory studies

The basis for sample size calculation is the anticipated effect size that is defined in various ways [50, 51]. Herein, we refer to effect size as a mean difference divided by a measure of spread. In a typical preclinical efficacy study, that could be the difference between the mean of the primary outcome measure of an intervention group and of the control group divided by the pooled standard deviation [52]. As already mentioned earlier, the effect size estimate from exploratory studies tends to be inflated (“winners curse”) [8]. Basing a sample size calculation of a confirmatory study on such an inflated effect size results in an underpowered study that runs the risk to miss an existing effect. This is aggravated in experiments with low internal validity [8, 53]. Sample size calculations for confirmatory studies should take this potential effect inflation into account and apply a shrinkage to exploratory effect size estimators to avoid underpowered studies. This also applies to effect sizes from published studies that are exploratory. This needs not necessarily be stated in the published study, but we recommend treating all research that does not explicitly state its confirmatory nature as exploratory. In case several prior studies are available (pilot, exploration, mini-experiments), effect sizes can be pooled via meta-analyses if heterogeneity between experiments is limited. Moreover, effect sizes do not typically extrapolate from animals to humans and are potentially smaller in humans [54]. It is thus necessary to apply shrinkage to effect sizes from exploratory studies, the exact magnitude, however, is still a matter of debate.

An alternative approach is to define a smallest effect size of interest as outlined above. This will set a lower bound under which results are no longer considered worthwhile exploring. Choosing such a threshold needs to reflect knowledge of the human disease, biology, effect size distribution in previous studies using similar model systems, available resources, and feasibility considerations [10]. That is, if the smallest effect size of interest is set too high the experiment will not be able to detect an actually existing effect. Contrary, an unnecessarily low smallest effect size of interest potentially requires a substantial number of resources and animals threatening the reduction principle of the 3R. In this context, in progressive diseases, clinicians can inform early treatment time points and evaluate how closely models reflect disease progression.

Once an effect size is chosen, this has an implication on the statistical power. With discussions on the utility of p-values and standard threshold of p < 0.05, the planning of a confirmatory trial can have a stricter bound such as a threshold of p < 0.005 or an increased power of for example 0.9 [55,56,57]. Again, this has to be weighed against the increased effort and cost–benefit calculations are necessary to avoid spending resources that could be used for other complementary studies [57, 58]. In confirmatory studies, strict correction for multiple comparisons should be applied to preserve the pre-specified false positive rate. As there is considerable uncertainty about the true effect, power could be calculated across a range of plausible effect sizes [59], instead of a point estimate to illustrate limitations for investigators. Particularly when confirmatory studies are conducted in a sequential manner [60], this may increase efficiency. Moreover, as the exploratory study has already registered the direction of the effect, sample size calculations and subsequent analysis can be based on one-sided tests. However, in case of an underpowered exploratory study aiming at mechanistic understanding confirming a prior knowledge claim, a sign error (type-S error) can occur where the replication detects an effect estimate in the opposite direction of the initial experiment or the actual effect size [61].

Multicenter considerations

A balanced design, where each center is allocated the same number of animals, is considered ideal as it increases the precision of estimates under between-center heterogeneity. One advantage over clinical trials here is that recruitment differences can be held to a minimum. Heterogeneity between centers is not due to different patient populations with different comorbidities but as outlined above most of the heterogeneity is systematically implemented in advance. The randomization to centers should take these previously planned factors into account in a block randomization scheme across centers. That is, factors need to be stratified and centers should for example test equal numbers of male and female animals, or animals from similar weight categories should be allocated to treatments similarly across centers. For this, a small number of additional animals may be needed to ensure a balanced design over all centers. Noteworthy, the impact on statistical efficiency with unequal or equal numbers of subjects in different centers also depends on the type of estimator used (e.g., fixed vs random effects). Finally, unbalanced numbers are not necessarily a sign of poor planning but a consequence of varying capacities or breeding of animals [62].

It is important to consider which experiments need to be performed by the initiating institute and which experiments by the partner laboratories. If a within-lab replication already indicated within-lab replicability of a result within the initiating institute, then this lab potentially does not need to perform the analogous experiment, but instead proceeds with triangulating evidence, a different strain, a different (large) animal model or flanking ex vivo experiments. In agreement with the initiating lab, partner labs can consider only selectively replicating core results to save on resources. Core results refer to assessment of the primary and important secondary outcome variables. If a costly method like single cell sequencing has been conducted in the initiating lab, a replication across all labs could lead to an undue increase in costs with little generation of additional insights. With respect to the animal model, subsequent designs are recommended (rodents—> non-rodents—> non-human primates). As sample sizes in large mammals including non-human primates typically need to be smaller due to ethical constrains, a smaller number of centers may be acceptable. It is an open question to which extent evidence from rodent experiments can be extrapolated to large animals and inform sample size planning. The effect size magnitude in rodents may neither translate to larger animals nor to the human case.

Reporting of confirmatory multicenter studies

Next to standard guidelines in preclinical research like ARRIVE [11], there are few points that are especially relevant when reporting a confirmatory study. This includes the provision of raw data to enable meta-analysis. Meta-analysis can help to cumulate evidence, find commonality, and develop guidance for best practices. In this context, it is crucial to transparently include and report outliers (data) as well as dropout rates (i.e., animal attrition). Standardization (or normalization) of all data to one control group should be avoided. For better and transparent visualization of data, for example forest plots are suitable to show center specific data. With the a priori definition of No-Go decision points and potential failure of confirmation studies, it should be common practice to publish also null results.

Conclusion

Summarizing remarks and limitations

Even though confirmation studies are seen as essential part of preclinical research, so far little guidance exists on how to conduct such a confirmation. Here, we mapped out strategies to conduct such studies (see Table 2 for summary points and open questions). We acknowledge that no one size fits all; rather a broad set of recommendations applies that need to be adjusted to individual research fields and specific questions. Importantly, our recommendations are based on a scenario where an initial finding or exploratory study prompts the very same investigators to initiate a replication. This deviates from recent attempts where initial findings from other researchers were replicated on a larger scale [3, 63, 64]. These studies revealed that in many cases a replication could not even be attempted due to missing protocols or other aspects of reporting. In contrast, here we explore the scenario where researchers team up to confirm a knowledge claim. That is, confirmation in this case is not about an exact replication but rather to efficiently generate evidence to substantiate the knowledge claim and enable a decision to start a clinical trial.

Table 2 Summary points and recommendations for the conduct of a confirmatory multicenter study including open questions that require further discussion and will be subject matter of future research

Towards this goal, we described criteria to decide when to start a confirmation study, how to use within lab replications to arrive at or reinforce such evidence, and how to plan a multicenter study. This guidance is, however, based on the authors' and workshop participants' experiences and fields of expertise and is consequently focused on drug development and efficacy studies. While some of the ideas might be applicable for diagnostic and biomarker development, this is beyond the scope of this manuscript and requires further consideration.

Moreover, we have not addressed one important aspect in confirmatory multicenter studies. That is, has the confirmation been successful or not. Previous replication projects have shown there are numerous ways to define replication success [55, 63, 65, 66]. It is, however, unclear which of these criteria apply to confirmations and how they can guide decisions towards clinical trials.

As additional limitation, we foremost see that confirmatory projects are resources intense, and funders are less inclined to fund confirmatory research. Current developments with a funding line particularly for confirmatory studies in Germany and NIH initiatives [67] show that funding opportunities exist and probably will arise more frequently in the future. With such funding also recognition for confirmatory research will grow in a similar way. Broader funding of specific confirmatory projects will open opportunities to reevaluate and refine the presented recommendations. Moreover, field specific strategies may evolve that will ultimately contribute to translation as a science with strong theory building at its core.