1 Introduction

The testing of new medicines in humans, particularly in patients, requires a sound and robust body of nonclinical evidence for both ethical and economic reasons (Wieschowski et al. 2018). Ethically, participation of humans in a clinical study is based on the premise that a new drug may provide more efficacious and/or safer treatment of a disease. Economically, it requires an investment of around 20 million € to bring a drug candidate to clinical proof-of-concept evaluation, particularly for a first-in-class medicine. The supporting evidence for the ethical and the economical proposition is typically based on animal studies. Animal models can become particularly important if human material is difficult to obtain (Alini et al. 2008) or the condition to be treated is too complex or too poorly understood to solely extrapolate from in vitro studies (Michel and Korstanje 2016). While existing treatments show that this can be done successfully, numerous examples exist where experimental treatments looked promising in animal models but have failed in clinical studies due to lack of efficacy. Prominent examples of failed clinical programs include amyotrophic lateral sclerosis (Perrin 2014), anti-angiogenic treatment in oncology (Martić-Kehl et al. 2015), several cardiovascular diseases (Vatner 2016), sepsis (Shukla et al. 2014), and stroke-associated neuroprotection (Davis 2006). Therefore, the idea of enhancing robustness of nonclinical studies is not new and has been advocated for more than 20 years (Hsu 1993; Stroke Therapy Academic Industry Roundtable (STAIR) 1999). Nonetheless, poor technical quality and reporting issues remain abundant (Chang et al. 2015; Kilkenny et al. 2009), and clinical development programs continue to fail due to lack of efficacy despite promising findings in animals.

Generalizability shows how applicable the results from one model are for others. In the context of translational research, this translates into the question whether findings from experimental models are likely to also occur in patients. Generalizability of preclinical animal studies is possible, only if the studies are reproducible, replicable, and robust. This chapter discusses causes contributing to lack of robustness of translational studies and the cost/benefit in addressing them. In this context we define robustness as an outcome that can be confirmed in principle despite some modifications of experimental approach, e.g., different strains or species. Only robust findings in the nonclinical models are likely to predict those in clinical proof-of-concept studies. For obvious reasons, a translational study can only be robust if it is reproducible, i.e., if another investigator doing everything exactly as the original researchers will obtain a comparable result. General factors enhancing reproducibility such as randomization, blinding, choice of appropriate sample sizes and analytical techniques, and avoiding bias due to selective reporting of findings (Lapchak et al. 2013; Snyder et al. 2016) will not be covered here because they are discussed in depth in other chapters of this book. However, it should be noted that generally accepted measures to enhance reproducibility have not been adhered to in most studies intended to have translational value (Kilkenny et al. 2009) and reporting standards have often been poor (Chang et al. 2015).

Against this background, communities interested in various diseases have developed specific recommendation for the design, conduct, analysis, and reporting of animal studies in their field, e.g., Alzheimer’s disease (Snyder et al. 2016), atherosclerosis (Daugherty et al. 2017), lung fibrosis (Jenkins et al. 2017), multiple sclerosis (Amor and Baker 2012; Baker and Amor 2012), rheumatology (Christensen et al. 2013), stroke (Lapchak et al. 2013; Stroke Therapy Academic Industry Roundtable (STAIR) 1999), or type 1 diabetes (Atkinson 2011; Graham and Schuurman 2015); disease-overarching guidelines for animal studies with greater translational value have also been proposed (Anders and Vielhauer 2007). We will also discuss these disease-overarching approaches.

2 Homogeneous vs. Heterogeneous Models

Homogeneous models, e.g., inbred strains or single sex of experimental animals, intrinsically exhibit less variability and, accordingly, have greater statistical power to find a difference with a given number of animals (sample size). In contrast, human populations to be treated tend to be more heterogeneous, e.g., regarding gender, ethnicity, age, comorbidities, and comedications. While heterogeneity often remains limited in phase III clinical studies due to strict inclusion and exclusion criteria, marketed drugs are used in even more heterogeneous populations. This creates a fundamental challenge for translational studies. More homogeneous models tend to need fewer animals to have statistical power but may have a smaller chance to reflect the broad patient population intended to use a drug. In contrast, more heterogeneous translational programs are likely to be costlier but, if consistently showing efficacy, should have a greater chance to predict efficacy in patients. The following discusses some frequent sources of heterogeneity. However, these are just examples, and investigators are well advised to systematically consider the costs and opportunities implied in selection of models and experimental conditions (Fig. 1).

Fig. 1
figure 1

Degree of heterogeneity has effects on program/study costs and on translational robustness. An appropriate balance must be defined on a project-specific basis

2.1 Animal Species and Strain

While mammals share many regulatory systems, individual species may differ regarding the functional role of a certain pathway. For instance, β3-adrenoceptors are a major regulator of lipolysis in rodents, particularly in brown adipose tissue; this was the basis of clinical development of β3-adrenoceptor agonists for the treatment of obesity and type 2 diabetes. However, metabolic β3-adrenoceptor agonist programs of several pharmaceutical companies have failed in phase II trials (Michel and Korstanje 2016) because adult humans have little brown adipose tissue and lipolysis in human white adipose tissues is primarily driven by β1-adrenoceptors (Barbe et al. 1996). Similarly, α1-adrenoceptors are major regulators of inotropy in rat heart but have a limited role, if any, in the human heart (Brodde and Michel 1999). Thus, the targeted mechanism should be operative in the animal species used as preclinical model and in humans. When corresponding human data are lacking, multiple animal strains and species should be compared that are phylogenetically sufficiently distinct, i.e., the confirmation of rat studies should not be in mice but perhaps in dogs or nonhuman primates. This approach has long been standard practice in toxicology, where regulatory agencies require data in at least two species, one of which must be a nonrodent.

A variation of this theme are differences between strains within a species. For instance, rat strains can differ in the thymic atrophy in response to induction of experimental autoimmune encephalomyelitis (Nacka-Aleksić et al. 2018) or in the degree of urinary bladder hypertrophy in streptozotocin-induced type 1 diabetes (Arioglu Inan et al. 2018). Similarly, inbred strains may yield more homogeneous responses than outbred strains and, accordingly, may require smaller sample sizes; however, the traits selected in an inbred strain may be less informative. For example, inbred Wistar-Kyoto rats are frequently used as normotensive control in studies with spontaneously hypertensive rats. However, Wistar-Kyoto rats may share some phenotypes with spontaneously hypertensive rats such as increased frequency of micturition and amplitude of urinary bladder detrusor activity with the hypertensive animals that are not observed in other normotensive strains (Jin et al. 2010).

2.2 Sex of Animals

Except for a small number of gender-specific conditions such as endometriosis or benign prostatic hyperplasia, diseases typically affect men and women – although often with different prevalence. Thus, most drugs must work in both genders. Many drug classes are similarly effective in both genders, for instance, muscarinic antagonists in the treatment of overactive bladder syndrome (Witte et al. 2009), and preclinical data directly comparing both sexes had predicted this (Kories et al. 2003). On the other hand, men and women may exhibit differential responsiveness to a given drug, at least at the quantitative level; for instance, the vasopressin receptor agonist desmopressin reduced nocturia to a greater extent in women than in men (Weiss et al. 2012). Such findings can lead to failed studies in a mixed gender population. Therefore, robust preclinical data should demonstrate efficacy in both sexes. However, most preclinical studies do not account for sex as a variable and have largely been limited to male animals (Tierney et al. 2017; Pitkänen et al. 2014 ). For instance, only 12 out of 71 group comparisons of urinary bladder hypertrophy in the streptozotocin model of type 1 diabetes were reported for female rats (Arioglu Inan et al. 2018). In reaction, the NIH have published guidance on the consideration of sex as a biological variable (National Institutes of Health 2015). It requires to use both sexes in grant application unless the target disease predominantly affects one gender. For a more detailed discussion of the role of sex differences, see chap. 9 by Rizzo et al.

Generally performing preclinical studies in both sexes comes at a price. A study designed to look at drug effects vs. vehicle in male and female rats and compare the effect between sexes needs not only twice as many groups but also a greater number of animals per group to maintain statistical power when adjusting for multiple comparisons. This makes a given study more expensive (Tannenbaum and Day 2017), and lack of funding is seen as a main reason not to incorporate both sexes in study design (McCarthy 2015). An alternative approach could be to do a single study based on mixed sexes. This may be more robust to detect an efficacious treatment but also may have more false negatives if a drug is considerably less effective in one of the two sexes. As it may be useful to study multiple animal models for a given condition (see below), a third option could be to use males in one and females in the other model, targeting a balanced representation of sexes across a program. This works well if the two studies yield similar results. However, if they show different results, one does not know whether such difference comes from that in sex of the experimental animal or from that of model, necessitating additional studies.

2.3 Age

Studies in various organ systems and pathologies show that older animals react differently than adolescent ones, for instance, in the brain (Scerri et al. 2012), blood vessels (Mukai et al. 2002), or urinary bladder (Frazier et al. 2006). Nonetheless, the most frequently used age group in rat experiments is about 12 weeks old at the start of the experiment, i.e., adolescent, whereas most diseases preferentially affect patients at a much higher age. Moreover, the elderly may be more sensitive to side effects, for instance, because they exhibit a leakier blood-brain barrier (Farrall and Wardlaw 2009) or are more prone to orthostasis (Mets 1995), thereby shifting the efficacy/tolerability balance in an unfavorable way. The same applies to the reduced renal function in the elderly, which can affect pharmacokinetics of an investigational drug in a way not predicted from studies in adolescent animals. While experiments in old animals are more expensive than those in adolescent ones (a 2-year-old rat may be ten times as expensive as a 12-week-old animal), it cannot necessarily be expected that young animals are highly predictive for conditions predominantly affecting the elderly. The conflict between a need for preclinical data in an age group comparable with that of the target population and the considerably higher cost of aged animals could be resolved by performing at least one key preclinical study in old animals.

2.4 Comorbidities

A condition targeted by a new medication frequently is associated with other diseases. This can reflect that two conditions with high prevalence in an age group often coincide based on chance, e.g., because both increase in prevalence with age. However, it can also result from two conditions sharing a root cause. An example of the former is that patients seeking treatment for the overactive bladder syndrome were reported to concomitantly suffer from arterial hypertension (37.8%), diabetes (15.4%), benign prostatic hyperplasia (37.3% of male patients), and depression and congestive heart failure (6.5% each) (Schneider et al. 2014), all of which are age-related conditions. An example of the latter are diseases involving atherosclerosis as a common risk factor, including hypertension, heart failure, and stroke. Regardless of the cause of association between two disease states, comorbidity may affect drug effects in the target disease. Thus, studies in experimental stroke treatment have largely been performed in otherwise healthy animals (Sena et al. 2007). That this may yield misleading data is highlighted by a presumed anti-stroke drug, the free radical scavenger NXY-059. While this drug had been tested in nine preclinical studies, it failed in a clinical proof-of-concept study; however, the preclinical package included only a single study involving animals with relevant comorbidity, and that study showed a considerably smaller efficacy than the others (MacLeod et al. 2008). Greater reliance on animal models with relevant comorbidities may have prevented this drug candidate advancing to clinical studies, thereby sparing study participants from an ineffective treatment and saving the sponsoring company a considerable amount of resources. Again, the balance between greater robustness and greater cost may be achieved by doing some studies in animal models with and some without relevant comorbidity.

In a more general vein, lack of consistency across species, sexes, age groups, or comorbidities is not only a hurdle but can also be informative as diverging findings in one group (if consistent within this group) may point to important new avenues of research (Bespalov et al. 2016a).

3 Translational Bias

Bias leading to poor reproducibility can occur at many levels, including selection of model and samples, execution of study, detection of findings, unequal attrition across experimental groups, and reporting (Green 2015; Hooijmans et al. 2014). Most of these are generic issues of poor reproducibility and are covered elsewhere in this book. We will focus on aspects with particular relevance for translational studies.

3.1 Single Versus Multiple Pathophysiologies

Choice of animal model is easier if the primary defect is known and can be recreated in an animal, e.g., mutation of a specific gene. However, even then it is not fully clear whether physiological response to mutation is the same in animals and humans. Many syndromes can be the result of multiple underlying pathophysiologies. For instance, arterial hypertension can be due to excessive catecholamine release as in pheochromocytoma, to increased glucocorticoid levels as in Cushing’s disease, to an increased activity of the renin-angiotensin system such as in patients with renal disease, or to increased mineralocorticoid levels such as in patients with Conn’s syndrome. Accordingly, angiotensin receptor antagonists effectively lower blood pressure in most animal models but have little effect in normotensive animals or animals hypertensive due to high salt intake or the diabetic Goto-Kakizaki rat (Michel et al. 2016). As a counterexample, β-adrenoceptor antagonists lower blood pressure in many hypertensive patients but may increase it in pheochromocytoma. Thus, reliance on a limited panel of animal models can yield misleading results if those models only reflect pathophysiologies relevant to a minor fraction of the patient population to be treated. While irrelevant models often yield positive findings (MacLeod et al. 2008), they do not advance candidates likely to become approved treatments. The choice of relevant animal models should involve careful consideration of the pathophysiology underlying the human disease to be treated (Green 2002). However, this may be easier said than done, particularly in diseases that are multifactorial and/or for which limited insight into the underlying pathophysiology exists.

3.2 Timing of Intervention

Activation or inhibition of a given mechanism is not necessarily equally effective in the prevention and treatment of a disease or in early and late phases of it (Green 2002). An example of this is sepsis. A frequently used model of sepsis is the administration of lipopolysaccharide. Many agents found active in this model when used as pre-treatment or co-administered with the causative agent have failed to be of benefit in patients, presumably at least partly because the early pathophysiological cascade initiated by lipopolysaccharide differs from the mechanisms in later phases of sepsis. Patients typically need treatment for symptoms of sepsis when the condition has fully developed; therefore, a higher translational value is expected from studies with treatment starting several hours after onset of septic symptoms (Wang et al. 2015). Similarly, many conditions often are diagnosed in patients at an advanced stage, where partly irreversible processes may have taken place. One example of this is tissue fibrosis, which is more difficult to resolve than to prevent (Michel et al. 2016). Another example is oncology where growth of the primary tumor may involve different mechanisms than metastasis (Heger et al. 2014). Moreover, the assessment of outcomes must match a clinically relevant time point (Lapchak et al. 2013); for instance, β-adrenoceptor agonists acutely improve cardiac function in patients with heart failure, but their chronic use increased mortality (The German Austrian Xamoterol-Study Group 1988). Therefore, animal models can only be expected to be predictive if they reflect the clinical setting in which treatment is intended to be used.

3.3 Pharmacokinetics and Dosage Choice

Most drugs have only limited selectivity for their molecular target. If a drug is underdosed relative to what is needed to engage that target, false negative effects may occur (Green 2002), and a promising drug candidate may be wrongly abandoned. More often an experimental treatment is given to animals in doses higher than required, which may lead to off-target effects that can yield false positive data. Testing of multiple doses, preferably accompanied by pharmacokinetics of each dose, may help avoiding the false negative and false positive conclusions based on under- and overdosing, respectively. Moreover, a too high dose may cause adverse effects which shift the efficacy/tolerability ration in an unfavorable way, potentially leading to unjustified termination of a program. Careful comparison of pharmacokinetic and pharmacodynamic effects can improve interpretation of data from heterogeneous models (Snyder et al. 2016), as has been shown in QTc prolongation (Gotta et al. 2015). QTc prolongation can happen as a consequence of alterations of ion channel function, which can lead to impairing ventricular repolarization in the heart and may predispose to polymorphic ventricular tachycardia (torsade de pointes), which in turn may cause syncope and sudden death. This may be further aided by the use of biomarkers, particularly target engagement markers (Bespalov et al. 2016b). It is generally advisable to search for information on specific-specific pharmacokinetic data to identify most suitable doses prior to finalizing a study design (Kleiman and Ehlers 2016).

Moreover, desired and adverse drug effects may exhibit differential time profiles. An example of this are α1-adrenoceptor antagonists intended for the treatment of symptoms of benign prostatic hyperplasia. Concomitant lowering of blood pressure may be an adverse effect in their use. Original animal studies have typically compared drug effects on intraurethral pressure as a proxy for efficacy and those on blood pressure as a proxy for tolerability (Witte et al. 2002). However, it became clear that some α1-adrenoceptor antagonists reach higher concentrations in target tissue than in plasma at late time points; this allows dosing with smaller peak plasma levels and blood pressure effects which maintain therapeutic efficacy, thereby providing a basis for improving tolerability (Franco-Salinas et al. 2010).

4 Conclusions

Bias at all levels of planning, execution, analysis, interpretation, and reporting of studies is a general source of poor reproducibility (Green 2015; Hooijmans et al. 2014). Additional layers such as choice of animal model, stage of condition, and tested doses add to the potential bias in translational research. A major additional hurdle for translational research is related to the balancing between homo- and heterogeneity of models. While greater heterogeneity is more likely to be representative for patient groups to be treated, it also requires larger studies. These are not only more expensive but also have ethical implications balancing number of animals being used against statistical power and robustness of results (Hawkins et al. 2013). Conclusions on the appropriate trade-off may not only be specific for a disease or a treatment but also depend on the setting in which the research is performed, e.g., academia, biotech, and big pharma (Frye et al. 2015). As there is no universally applicable recipe to strike the optimal balance, translational investigators should carefully consider the implications of a chosen study design and apply the limitations implied in their interpretation of the resulting findings.

While considerations will need to be adapted to the disease and intervention of interest, some more general recommendations emerge. While we often have a good understanding which degree of symptom improvement in a patient is clinically meaningful, similar understanding in animal models if mostly missing. Nonetheless, we consider it advisable that investigators critically consider whether the effects they observe in an experimental model are of a magnitude that can be deemed to be clinically meaningful if confirmed in patients. It has been argued that the nonclinical studies building the final justification to take a new treatment into patients “need to apply the same attention to detail, experimental rigour and statistical power … as in the clinical trials themselves” (Howells et al. 2014). In this respect, it has long been the gold standard that pivotal clinical trials should be based on a multicenter design to limit site-specific biases, largely reflecting unknown unknowns. While most nonclinical studies are run within a single lab, examples of preclinical multicenter studies are emerging (Llovera et al. 2015; Maysami et al. 2016). While preclinical multicenter studies improve robustness, they are one component for improved quality but do not substitute for critical thinking.