Improving basic and translational science by accounting for litter-to-litter variation in animal models
- First Online:
- Cite this article as:
- Lazic, S.E. & Essioux, L. BMC Neurosci (2013) 14: 37. doi:10.1186/1471-2202-14-37
Animals from the same litter are often more alike compared with animals from different litters. This litter-to-litter variation, or “litter effects”, can influence the results in addition to the experimental factors of interest. Furthermore, sometimes an experimental treatment can only be applied to whole litters rather than to individual offspring. An example is the valproic acid (VPA) model of autism, where VPA is administered to pregnant females thereby inducing the disease phenotype in the offspring. With this type of experiment the sample size is the number of litters and not the total number of offspring. If such experiments are not appropriately designed and analysed, the results can be severely biased as well as extremely underpowered.
A review of the VPA literature showed that only 9% (3/34) of studies correctly determined that the experimental unit (n) was the litter and therefore made valid statistical inferences. In addition, litter effects accounted for up to 61% (p <0.001) of the variation in behavioural outcomes, which was larger than the treatment effects. In addition, few studies reported using randomisation (12%) or blinding (18%), and none indicated that a sample size calculation or power analysis had been conducted.
Litter effects are common, large, and ignoring them can make replication of findings difficult and can contribute to the low rate of translating preclinical in vivo studies into successful therapies. Only a minority of studies reported using rigorous experimental methods, which is consistent with much of the preclinical in vivo literature.
KeywordsAutismExperimental designLitter-effectsMixed-effects modelMultiparousNested modelValproic acid
Analysis Of Variance
Numerous animal models (lesion, transgenic, knockout, selective breeding, etc.) have been developed for a variety of psychiatric, neurodegenerative, and neurodevelopmental disorders. While many of these models have been helpful for understanding disease pathology, they have been less useful for discovering potential therapies or predicting clinical efficacy. Translation from in vivo animal models (typically rodent) has been poor, despite many years of research and effort. There are many reasons for this, including the inherent difference in biology between rodents and humans , particularly relating to higher cognitive functions. In addition, there is the ever-present question of whether a particular animal model is even suitable; whether it recapitulates the disease process of interest or faithfully mimics key aspects of the human condition. While important, these two considerations will be put aside and the focus will be on the design and analysis of preclinical studies using multiparous species, and how this affects the validity and reproducibility of results. There are two issues that will be discussed. The first deals with designs where an experimental treatment is applied to whole litters rather than to the individual animals, usually because the treatment is applied to pregnant females and therefore to all of the offspring. The second is the natural litter-to-litter variation that is often present, which means that the value of a measured experimental outcome can potentially be influenced by the litter that the animal came from.
Applying treatments to whole litters
A related design issue is that greater statistical power can be achieved when litter-mates are used to test a therapeutic compound versus a placebo. If the therapeutic treatment is applied to the individual animals postnatally, then the individual animal is the experimental unit for this comparison. This is referred to as a split-plot or split-unit design and has more than one type of experimental unit: litters for some comparisons and individual animals for others . These studies therefore require careful planning and analysis, but biologists are rarely introduced to these designs and how to appropriately analyse data derived from them during the course of their training.
Litter effects are ubiquitous, large, and important
It is known that for many measurable characteristics across many species, monozygotic twins are more similar than dizygotic twins, which are more similar than non-twin siblings, and which in turn are more similar than two unrelated individuals. What has not been fully appreciated is that all of the standard statistical methods (e.g. t-test, ANOVA, regression, non-parametric methods) assume that the data come from unrelated individuals. However, rodents from the same litter are effectively dizygotic twins; they are genetically very similar and share prenatal and early postnatal environments. Therefore, studies need to be designed and analysed in such a way that if differences between litters exist, they do not bias or confound the results [2–12]. More specifically, this relates to the assumption of independence of observations. For example, measuring blood pressure (BP) from the left and right arm of ten unrelated people only provides ten independent measurements of BP, not twenty. This is because the left and right BP values will be highly correlated—if the BP value measured from a person’s left arm is high, then so will the value measured from their right arm. Similarly, two animals from the same litter will tend to have values that are more alike (i.e. correlated) than two animals from different litters. One can think of it as within-litter homogeneity and between-litter heterogeneity. This lack of independence needs to be handled appropriately in the analysis and the three strategies outlined in the previous section can be used. Many animal models are derived from highly inbred strains, and this results in reduced genotypic and phenotypic variation. This is a different issue and unrelated to lack of independence. It does not mean that animals “are all the same” and that differences between litters do not exist.
Litter effects are not a minor issue that only statistical pedants worry about with little practical importance for scientists. Using actual body weight data from their experiment, Holson and Pearce showed that if three “treated” and three “control” litters are used, with two offspring per litter (total number of offspring = 12), then the false positive rate (Type I error) is 20% rather than the assumed 5% . The false positive rate was determined as the proportion of p-values that fell below 0.05. Since there was no actual treatment applied, random sampling should produce only a 5% error rate. Furthermore, they showed that the false positive rate increases with the number of offspring per litter; if the number of offspring per litter is 12 (total number of offspring = 72) then the false positive rate is 80%. The error rate is also influenced by the relative variability between and within litters and will therefore vary for each experimental outcome. Given that papers report the results of multiple tests (multiple outcome variables and multiple comparisons), we can expect the literature to contain many false positive results. It may seem paradoxical, but in addition to an increased false positive rate, ignoring litter-to-litter variation can also lead to low power (too many false negatives) when true effects exist [4, 5]. This occurs because litter-to-litter variation is unexplained variation, and thus the “noise” in the data is increased, potentially masking true treatment effects. A subsequent study using forty litters found “significant litter effects… in varying degrees, for almost every behavioural, morphologic, and neuroendocrine measure; they were evident across indices of neural, adrenal, thyroid, and immunologic functioning in adulthood”  (and see references therein for further studies supporting this conclusion). Holson and Pearce reported that only 30% of papers in the behavioural neurotoxicology literature correctly accounted for litter effects  and Zorrilla noted that 34% of papers in Developmental Psychobiology and 15% of papers in related journals correctly accounted for litter effects . This issue has been discussed repeatedly for almost forty years , but experimental biologists seem unaware (or choose to ignore) the importance of dealing with litter effects. One can only speculate on the number of erroneous conclusions that have been reached and the resources that have been wasted.
One might argue that when many studies are conducted, including replications within and between labs, the evidence will eventually converge to the “truth”, and therefore these considerations are only of minor interest. Unfortunately, there is no guarantee of such convergence, as the literature on the superoxide dismutase (SOD1) transgenic mouse model of amyotrophic lateral sclerosis (ALS) demonstrates. Several treatments showed efficacy in this model and were advanced to clinical trials, where they proved to be ineffective . A subsequent large-scale and properly executed replication study did not support the previous findings . This study also identified litter as an important variable that affected survival (the main outcome) and which was not taken into account in the earlier studies. The authors also demonstrated how false positive results can arise with inappropriate experimental designs and analyses. Litter effects were not the only contributing factor; a meta-analysis of the preclinical SOD1 literature revealed that only 31% of studies reported randomly assigning animals to treatment conditions and even fewer reported blind assessment of outcomes . Lack of randomisation and blinding are known to overstate the size of treatment effects [19–23]. In addition, there was evidence of publication bias, where studies with positive results were more likely to be published . Thus, the combination of poor experimental design, analysis, and publication bias contributed to numerous incorrect decisions regarding treatment efficacy.
General quality of preclinical animal studies
Previous studies have shown that the general quality of the design, analysis, and interpretation of preclinical animal experiments is low [20, 21, 23–30]. For example, Nieuwenhuis et al. recently reported that 50% of papers in the neuroscience literature misinterpret interaction effects . In addition, the issue of “inflated n”, or pseudoreplication, shows up in other guises [12, 32], and whole fields can misattribute cause-and-effect relationships [33, 34]. There is also the concept of “researcher degrees of freedom”, which refers to the post hoc flexibility in choosing the main outcome variables, statistical models, data transformations, how outliers are handled, when to stop collecting data, and what is reported in the final paper . Various permutations of the above options greatly increases the chances that something statistically significant can be found, and this gets reported as the sole analysis that was conducted. Given the above concerns, it is not surprising that the pharmaceutical industry has difficulty reproducing many published results [36–39].
Ninety-five papers were identified on PubMed using the search term “(VPA [tiab] OR ‘valproic acid’ [tiab]) AND autism [tiab]” (up to the end of 2011). Reference lists from these articles were then examined for further relevant studies, and one was found. Only primary research articles that injected pregnant dams with VPA and subsequently analysed the effects in the offspring were selected (in vitro studies were excluded). A total of thirty-five studies were found, and one was excluded as key information was located in the supplementary material, but this was not available online . Two key pieces of information were extracted: (1) whether the analysis correctly identified the experimental unit as the litter and (2) whether important features of good experimental design were mentioned, including randomisation, blinding, sample size calculation, and whether the total sample size (i.e. number of pregnant dams) was indicated or could be determined.
Estimating the importance of litter-to-litter variation
Data from Mehta et al.  were used to estimate the magnitude of differences between litters on a number of outcome variables. This study was chosen because it included animals from fourteen litters (five saline, nine VPA) and therefore it was possible to get a reasonable estimate of the litter-to-litter variation. In addition, the study mentioned using randomisation and blind assessment of outcomes. Half of the animals in each condition were also given MPEP (2-methyl-6-phenylethyl-pyrididine), a metabotropic glutamate receptor 5 antagonist. To assess the magnitude of the litter effects, the effect of VPA, MPEP, and sex (if relevant) were removed, and the remaining variability in the data that could be attributed to differences between litters was estimated. More specifically, models with and without a random effect of litter were compared with a likelihood ratio test. This analysis is testing whether the variance between litters is zero, and it is known that p-values will be too large because of “testing on the boundary”, and therefore the simple method of dividing the resulting p-values by two was used as recommended by Zuur et al. The exact specification of the models is provided as R code in Additional file 1 and the data are provided in Additional file 2.
In these types of designs, power (the ability to detect an effect that is actually present) is influenced by the (1) number of litters, (2) variability between litters, (3) number of animals within litters, (4) variability of animals within litters, (5) difference between the means of the treatment groups (effect size), (6) significance cutoff (traditionally α = 0.05), and (7) statistical test used. In order to illustrate the importance of the number of litters relative to the number of animals within litters and how an inappropriate analysis can lead to p-values that are too small, a power analysis was conducted with the number of litters per group varying from three to ten, and the number of animals per litter varying from one to ten. The other factors were held constant. Variability between litters (SD = 0.8803) and the variability of animals within litters (SD = 0.8142) was estimated from the locomotor activity data from Mehta et al. . For each combination of litters and animals, 5000 simulated datasets were created with a mean difference between groups of 0.15. Once the datasets were generated, the power for three types of analyses were calculated. The first analysis averaged the values of the animals within each litter and then groups were compared with a t-test. The second analysis used a mixed-effects model, and the third ignored litter and compared all of the values with a t-test. The last analysis is incorrect and only presented to demonstrate how artificially inflating sample size affects power. The power for each analysis was determined as the proportion of tests that had p < 0.05. The R code is provided in Additional file 1 and is adapted from Gelman and Hill .
Results and discussion
Low quality of the published literature
The VPA model of autism is relatively new and potential therapeutic compounds tested in this model have not yet advanced to human trials. The opportunity therefore exists to clean up the literature and prevent a repeat of the SOD1 story. The main finding is that only 9% (3/34) of studies correctly identified the experimental unit and thus made valid inferences from the data. One study used a nested design , the second mentioned that litter was the experimental unit, and the third used one animal from each litter, thus bypassing the issue. In fourteen studies (41%) it was not possible to determine the number of dams that were used (i.e. the sample size) and in four studies (12%) the number of offspring used were not indicated. In addition, only four (12%) reported randomly assigning pregnant females to the VPA or control group. Many studies also used only a subset of the offspring from each litter, but often it was not mentioned how the offspring were selected. Only six studies (18%) reported that the investigator was blind to the experimental condition when collecting the data. Ten studies (29%) did not indicate whether both male and female offspring were used. No study mentioned performing a power analysis to determine a suitable sample size to detect effects of a given magnitude—but this is probably fortuitous, given that only three studies correctly identified the experimental unit. It is possible that many studies did use randomisation and assess outcomes blindly, but simply did not report it. However, randomisation and blinding are crucial aspects for the validity of the results and their omission in manuscripts suggests that they were not used. This is further supported by studies showing that when manuscripts do not mention using randomisation or blinding the estimated effects sizes are larger compared to studies that do mention using these methods, which is suggestive of bias[19–23, 29].
A number of papers had additional statistical or experimental design issues, ranging from trivial (e.g. reporting total degrees of freedom rather than residual degrees of freedom for an F-statistic) to serious. These include treating individual neurons as the experimental unit, which is common in electrophysiological studies, but just as inappropriate as treating blood pressure values taken from left and right arms as n = 2, or dissecting a single liver sample into ten pieces and treating the expression of a gene measured in each piece as n = 10 . If it were that easy, clinical trials could be conducted with tens of patients rather than hundreds or thousands. Regulatory authorities are not fooled by such stratagems, but is seems many journal editors and peer-reviewers are. A list of studies can be found in Additional file 3.
Estimating the magnitude of litter effects
Importance of litter effects on body weight and behavioural tests
Anxiety (open field)
How power is affected by the number of litters and animals
Some may object on ethical grounds to using so many litters and then selecting only one or a few animals from each, as there will be many additional animals that will not be used and presumably culled. Certainly all of the animals could be used, but there is almost no increase in power after three animals per litter (at least for the locomotor data) and therefore it is a poor use of time and resources to include all of the animals. One could argue therefore that it is unethical to submit a greater number of animals to the experimental procedure if they contribute little or nothing to the result. One could also argue that it is even more unethical to use any animals for a severely underpowered (or flawed) study in the first place and then to clutter the scientific literature with the results. One way to deal with the excess animals is to use them for other experiments. This requires greater planning, organisation, and coordination, but it is possible. Another option is to purchase animals from a commercial supplier and request that the animals come from different litters rather than have an in-house colony. As a side note, suppliers do not routinely provide information on the litters that the animals come from and thus an important variable is not under the experimenter’s control and cannot even be checked whether it is influencing the results.
How does litter-to-litter variation arise?
Differences between litters could exist for a variety of reasons, including shared genes and shared prenatal and early postnatal environments, but also due to age differences (it is difficult to control the time of mating), and because litters are convenient units to work with. For example, it is not unusual for litter-mates to be housed in the same cage, which means that animals within a litter also share not just their early, but also their adult environment. It is also often administratively easier to apply experimental treatments on a per cage (and thus per litter) basis rather than per animal basis. For example, animals in cage A and C are treated while cage B and D are controls. Animals may also undergo behavioural testing on a per cage basis; for example, animals are taken from the housing room to the testing room one cage at a time, tested, and then returned. Larger experiments may need to be conducted over several days and it is often easier to test all the animals in a subset of cages on each day, rather than a subset of animals from all of the cages. At the end of the experiment animals may also be killed on a per cage basis. Given that it may take many hours to kill the animals, remove the brains, collect blood, etc., the values of many outcomes such as gene expression, hormone and metabolite concentrations, and physiological parameters may change due to circadian rhythms. All of these can lead to systematic differences between litters and can thus bias results and/or add noise to the data.
There is an important distinction to be made between applying treatments to whole litters versus “natural” variation between litters. When a treatment is applied to a whole litter such as the VPA model of autism or maternal stress models, then the litter is the experimental unit and the sample size is the number of litters. Therefore, by definition, litter needs to be included in the analysis if more than one animal per litter is used (or the values within a litter can be averaged). However, if multiple litters are used but the treatment(s) are applied to the individual animals, experiments should be designed so that if litter effects exist, then valid inferences can still be made. In other words, litters should not be confounded with other experimental variables because it would be difficult or impossible to detect their influence and remove their effects. Whether litter is an important factor for any particular outcome is then an empirical question, and if it is not important then it need not be included in the analysis. However, the power to detect differences between litters will be low if only a few litters are used in the experiment and therefore a non-significant test for litter effects should not be interpreted as the absence of such effects. Analysing the data with and without litter and choosing the analysis that gives the “right” answer should of course be avoided . Flood et al. provide a nice example in the autism literature of an appropriate design followed by a check for litter effects, and then the results for the experimental effect were reported when litter was both included and excluded. Consistent with other studies demonstrating litter-effects, this paper found a strong effect of litter on brain mass.
Four ways to improve basic and translational research
Better training for biologists
Most experimental biologists are not provided with sufficient training in experimental design and data analysis to be able to plan, conduct, and interpret the results of scientific investigations at the level required to consistently obtain valid results. The solution is straightforward, but requires major changes in the education and training of biologists and it will take many years to implement. Nevertheless, this should be a long-term goal for the biomedical research community.
Make better use of statistical expertise
A second solution is to have statisticians play a greater role in preclinical studies, including peer reviewing grant applications and manuscripts, as well as being part of scientific teams . However, there are not enough statisticians with the appropriate subject matter knowledge to fully meet this demand—just as it is difficult to do good science without a knowledge of statistics, it is difficult to perform a good analysis without knowledge of the science. In addition, this type of “project support” is often viewed by academic statisticians as a secondary activity. Despite this, there is still scope for improving the quality of studies by making better use of statistical expertise.
More detailed reporting of experimental methods
Detailed reporting of how experiments were conducted, how data were analysed, how outliers were handled, whether all animals that entered the study completed it, and how the sample size was determined are all required to assess whether the results of the study are valid, and a number of guidelines have been proposed which cover these points, including the National Institute of Neurological Disorders and Stroke (NINDS) guidelines , the Gold Standard Publication Checklist, and the ARRIVE (Animals in Research: Reporting In Vivo Experiments) guidelines . For example, ARRIVE items 6 (Study design), 10 (Sample size), 11 (Allocating animals to experimental groups), and 13 (Statistical methods) should be a mandatory requirement for all publications involving animals and could be included as a separate checklist that is submitted along with the manuscript, much like a conflict of interest or a transfer of copyright form. Something similar has recently been introduced by Nature Neuroscience. This would make it easier to spot any design and analysis issues by reviewers, editors, and other readers. In addition, and more importantly, if scientists are required to comment on how they randomised treatment allocation, or how they ensured that assessment of outcomes was blinded, then they will conduct their experiments accordingly if they plan on submitting to a journal with these reporting requirements. Similarly, if researchers are required to state what the experimental unit is (e.g. litter, cage, individual animal, etc.), then they will be prompted to think hard about the issue and design better experiments, or seek advice. This recommendation will not only improve the quality of reporting, but it will also improve the quality of experiments, which is the real benefit. A final advantage is that it will make quantitative reviews/meta-analyses easier because much of the key information will be on a single page.
Make raw data available
Another solution is to make the provision of raw data a requirement for acceptance of a manuscript; not “to make it available if someone asks for it”, which is the current requirement for many journals, but uploaded as supplementary material or hosted by a third party data repository. None of the VPA studies provided the data that the conclusions were based on, making reanalysis impossible. Remarkably, of the thirty-five studies published, only one provided the necessary information to conduct a power analysis to plan a future study , and this was only because one animal per litter was used and the necessary values could be extracted from the figures. Datasets used in preclinical animal studies are typically small, do not have confidentiality issues associated with them, are unlikely to be used for further analyses by the original authors, and have no additional intellectual property issues associated with them given that the manuscript itself has been published. It is noteworthy that many journals require microarray data to be uploaded to a publicly available repository (e.g. Gene Expression Omnibus or ArrayExpress), but not the corresponding behavioural or histological data. It is perhaps not surprising that there is a relationship between study quality and the willingness to share data -. Publishing raw data can be taken as a signal that researchers stand behind their data, analysis, and conclusions. Funding bodies should encourage this by requiring that data arising from the grant are made publicly available (with penalties for non-adherence).
The above suggestions would help ensure that appropriate design and analyses are used, and to make it easy to verify claims or to reanalyse data. Currently, it is often difficult to establish the former and almost impossible to perform the latter. Moreover, it is clear that appropriate designs and analyses are often not used, making it difficult to give the benefit of the doubt to those studies with incomplete reporting of how experiments were conducted and data analysed.
While it is difficult to quantify the extent to which poor statistical practices hinder basic and translational research, it is clear that a large inflation of false positive and false negative rates will only slow progress. In addition, coupled with researcher degrees of freedom and publication bias, it is possible for a field to converge to the wrong answer. Experimental design and statistical issues are, in principle, fixable. Improving these will allow scientists to focus on creating and assessing the suitability of disease models and the efficacy of therapeutic interventions, which is challenging enough.
The authors would like to thank the Siegel lab at the University of Pennsylvania for kindly sharing their data.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.