Enhancing the interpretation of statistical P values in toxicology studies: implementation of linear mixed models (LMMs) and standardized effect sizes (SESs)

In this paper, we compare the traditional ANOVA approach to analysing data from 90-day toxicity studies with a more modern LMM approach, and we investigate the use of standardized effect sizes. The LMM approach is used to analyse weight or feed consumption data. When compared to the week-by-week ANOVA with multiple test results per week, this approach results in only one statement on differences in weight development between groups. Standardized effect sizes are calculated for the endpoints: weight, relative organ weights, haematology and clinical biochemistry. The endpoints are standardized, allowing different endpoints of the same study to be compared and providing an overall picture of group differences at a glance. Furthermore, in terms of standardized effect sizes, statistical significance and biological relevance are displayed simultaneously in a graph.


90-day feeding studies
OECD has developed standard procedures employing animal models to assess the toxicity of chemical compounds to humans. In this context, repeated-dose 90-day oral (subchronic) toxicity studies are usually carried out to evaluate the toxic potential of a chemical in more detail after initial information on its toxicity has been obtained from acute or repeated-dose 28-day toxicity tests. At least three dose levels of a test substance and a concurrent control are administered daily per os for a period of 90 days to groups of animals (OECD/OCDE 2014).
This general OECD test approach has been applied to the testing of whole food/feed derived from genetically modified organisms (GMOs) in order to consider toxic effects holistically rather than for a single compound. Toxicity studies are now a mandatory part of the risk assessment of genetically modified (GM) food and feed in Europe. Although there is a fundamental difference (dosing range) between testing chemicals and whole food/feed, repeated-dose 90-day oral toxicity studies nevertheless have been included in the integrated approach of assessing the potential toxicity of GM plants (EFSA Scientific Committee 2011). The idea is to administer diets containing the plant under study as a component: in treatment groups, this component consists of GM plant material (high and low doses), and in a control group this component consists of conventional plant material. Several observation and examination data are recorded and compared between the treatment and control groups.
In this paper, we describe the statistical methods used for analysing the data from the GRACE 90-day studies (Zeljenková et al. 2014). We compare the traditional ANOVA approach with a more modern LMM approach, and we investigate the use of standardized effect sizes as proposed by EFSA (2011).

Statistical significance and biological relevance
There are several guidelines and publications dealing with the statistical treatment of toxicity study data (e.g. Anses  lications 2012). OECD in its guidance document No. 116 mentions that there is no single approach to the statistical analysis of data and that statistical methods continue to develop so that new and modified approaches may continue to be proposed (OECD Environment, Health and Safety Publications 2012). Most of the guidelines favour a traditional approach (i.e. hypothesis testing, P value), which simply asks 'Is there an effect?', while other more recently published papers promote the reporting of effect sizes and confidence intervals and to ask 'How much of an effect is there?' (Ellis 2010;Nuzzo 2014).
Most importantly, biological relevance should always be preferred over statistical significance in any evidence-based decision-making. Statistical analysis is a (undoubtedly very useful) tool for extracting information from data and helping scientists blend data and background knowledge to derive scientific conclusions-no more and no less. Denoting something as statistically significant does not mean it is biologically relevant. Statistical significance is determined by the precision of the measurements, and as such is not connected to the biological relevance of observed differences. Therefore, another element has to enter the discussion if biological relevance is of prime importance, as it is for decision-making in risk management. This element is the setting of limits for relevance, called 'equivalence limits' (European Commission 2013) or alternatively 'limits of concern' (EFSA Panel on Genetically Modified Organisms 2010). Statistical measures like 'significant' test results and P values always need interpretation, when one considers what they really mean: the chance of observing data under the assumption of a null hypothesis (of no correlation or no effect); therefore, they only reflect the likelihood that the null hypothesis is true. When the UK statistician Ronald Fisher introduced the P value in the 1920s, he did not mean it to be a definite decision basis. However, this was the beginning of a movement towards rigorous and objective decision-making based on P values, statistical power, false positives and false negatives-and disregarding the biological interpretation by simply classifying results as significant or not significant (Nuzzo 2014).
The discussion on the different number of significant differences reported by Lemen et al. (2002) and Séralini et al. (2007) when analysing the same MON863 90-day feeding study very nicely demonstrates this dilemma. EFSA (2007) summarized that both studies reported significant differences for the same 25 endpoints. Moreover, Lemen et al. (2002) described a further 10 significant differences not reported by Séralini et al. (2007), while Séralini et al. (2007) pointed out a further 13 significant differences not reported by Lemen et al. (2002). Furthermore, Séralini et al. (2007) found significant differences in 40 out of 494 tests and claimed that only 25 would be expected by chance alone. Such counting only causes confusion and uncertainty. As EFSA emphasizes in its study, statistically significant differences must be evaluated with respect to their biological relevance. This is equally true for non-significant differences as it would be unacceptable if biologically relevant effects went unnoticed for the lack of statistical power. For this reason, a prospective power analysis has been made mandatory in GMO risk assessment (EFSA Scientific Committee 2011). In summary, the relevance of statistical significance is limited.
Traditional P value approach versus LMM and SES When performing 90-day toxicity feeding studies, two types of endpoints are usually analysed: weight and feed consumption are recorded weekly ('weight and feed consumption data'). Organ weights, haematology and clinical biochemistry, as well as gross necropsy and histopathology parameters, are surveyed once at the end of the study ('other endpoints'). All these endpoints are compared between the groups and tested with relevant baseline values to identify any test substance-and dose-dependent toxic responses.
The traditional approach (i.e. hypothesis testing, P value) focuses on the analysis of variance (ANOVA) followed by post hoc tests. ANOVA compares group (treatments, control) means separately for each factor level (e.g. gender: male/ female) and separately for each endpoint (i.e. weight data are Fig. 2 Simplified version of a graph allowing visual assessment of statistical significance and biological relevance of group comparisons. The SES point estimate (circle) and the 95 % confidence limits (whiskers, bars showing confidence interval) illustrate the (standardized) effect size between two groups. The vertical black line indicates no effect (zero difference), and vertical grey lines indicate biological relevance limits (here 1.0 SD, according to the study design). If the confidence interval bars cross the zero line but not the grey lines, therefore lie within the ±1.0 limits, there is evidence for no statistical significance as well as no biological relevance (case a). Two groups are significantly different when the confidence interval bars do not cross the black vertical line (cases b, c). The effect size between two groups is supposed to be biologically (toxically) relevant, when the confidence interval bars lie outside the ±1.0 limits (case c). Case b indicates statistical significance, but no clear biological relevance. Case d indicates no statistical significance, but no clear negation of biological relevance [reproduced from Zeljenková et al. (2014)] also independently analysed week-by-week). The choice of statistical method depends on whether the data are qualitative or quantitative and whether the generic assumptions underlying the specific test are met (OECD Environment, Health and Safety Publications 2012). Figure 1 presents a typical decision tree for the choice of statistical tests when analysing toxicity studies. Following the logic of this decision tree, ANOVA is applied for quantitative data, independent observations with normally distributed residuals and with equal variances in the groups, whereas nonparametric tests are applied for qualitative data and when the assumption of normality and/or variance homogeneity are not met (according to the normality and variance homogeneity tests indicated in Fig. 1). Nonparametric tests are usually limited to these cases, since they have lower power compared to their parametric counterparts when the corresponding assumptions are met. Most of the endpoints in 90-day toxicity studies are quantitative: body and organ weights, haematology and clinical biochemistry data are continuous data, and numbers of blood cells are discrete counts. Nevertheless, the assumptions of normal distribution and variance homogeneity are often not met. In this case, data may be transformed (logarithmic, logit, square root transformation) to improve the normality or variance homogeneity. ANOVA is initially applied to test the overall hypothesis that there are no differences among the group means. In the event that ANOVA delivers a significant result, certain differences between two groups are examined case-by-case applying either post hoc tests or orthogonal contrasts. The most frequently used post hoc tests are Dunnett's test to compare each treatment group with the control and Tukey's test to compare groups pairwise. Gross necropsy and histopathology data are qualitative data (categorical, binary or ordinal). For qualitative data and quantitative data in which the ANOVA assumptions are not met, the Kruskal-Wallis test is applied as an overall test of significant differences, and the Wilcoxon test is applied to individually compare two groups. Note that these nonparametric tests assume equal variances as well, and in case of heteroscedasticity, the Kruskal-Wallis test is not better than an ANOVA. Nevertheless, in view of the lack of any alternative for a nonparametric test, the Kruskal-Wallis is named by the OECD Guidance Document 116 (OECD Environment, Health and Safety Publications 2012). Since ANOVA tolerates deviations from the assumptions and parametric tests are usually more powerful and versatile, it is sometimes applied to all variables. A more conservative approach is to apply only nonparametric tests to all variables.
Finally, all results are presented in tables (group means and standard deviations per factor level) and bar or line graphs with asterisks marking significant differences.
Linear mixed models (LMMs) allow weight development in growth curves to be analysed as repeated measurements over time. They are robust with respect to the assumptions of normal distribution, homogeneity of variance and error independence. Moreover, they allow a comprehensive analysis of weight data of a study, thereby incorporating all model factors with interactions such as group, dose and gender plus the development over time. When compared to the week-by-week ANOVA with multiple test results (for all group comparisons) per week, this approach results in only one statement on differences in weight development between groups. Taking time as a fixed factor to indicate repeated measurements allows modelling of time and interactions as well as taking account of serial correlations and reducing residual variance.
An effect size in a toxicology study endpoint is the difference (e.g. treatment vs. control) of means per group. Whether the size of an effect is biologically relevant has to be assessed by comparing it to an equivalence limit or limit of concern set by a toxicologist or other expert. A standardized effect size (SES) is the difference between two group means divided by a standardizing factor, for which EFSA (2011) has proposed the pooled standard deviation (SD). With this standardization, all endpoints are transformed and expressed in SD units, allowing comparison of different endpoints (organ weights, haematology and clinical biochemistry parameters) of the same study (Festing 2014). Therefore, an overall picture of group differences is provided at a glance. Furthermore, SES enables statistical significance and biological relevance (in SD units) to be illustrated simultaneously when the equivalence limits are also indicated in the display. EFSA (2011) gives the example where differences of one unit of SD are considered of little toxicological relevance. The equivalence limits can then be set at 1 for the SES, and in this work, we will follow this example.

Materials and methods
We used data from two 90-day feeding trials with two different GM maize MON810 varieties performed within the GRACE project (GMO Risk Assessment and Communication Each animal was weighed on the first day of the feeding trial, once weekly during the feeding trial and at the end of the feeding trial. Feed consumption was determined once weekly and reported as the total amount of feed consumed by two animals in one cage per week. White blood cell count (WBC), red blood cell count (RBC), haemoglobin concentration (HGB), haematocrit (HCT), mean cell volume (MCV), mean corpuscular haemoglobin (MCH), mean corpuscular haemoglobin concentration (MCHC), platelet count (PLT), lymphocyte count (LYM) and differential leucocyte count parameters were measured for the haematology analyses. For the clinical biochemistry analyses, the parameters alkaline phosphatase (ALP), alanine aminotransferase (ALT), aspartate aminotransferase (AST), albumin (ALB), total protein (TP), glucose (GLU), creatinine (CREA), urea (U), cholesterol (CHOL), triglycerides (TRG), calcium (Ca), chloride (Cl), potassium (K), sodium (Na) and phosphorus (P) were measured. Moreover, the wet weight of the kidneys, spleen, liver, adrenal glands, pancreas, lung, heart, thymus, testes, epididymides, uterus, ovaries and brain of all animals was recorded (The collated primary data are available through the website http://www. cadima.info.). Two animals were housed per cage. Consequently, the cage was considered the experimental unit and means per cage were calculated for all measurements prior to the statistical analysis.
Data check and quality control Raw data from both trials were screened for their structure and data, and variable definitions were determined. Based on these definitions, an SAS analysis data set was created. Mean values per cage (experimental unit) were calculated for all endpoints except feed consumption. Secondary variables like weight gain per week or organ/body weights were re-computed. All variables were formatted and labelled. The SAS data set was locked to exclude further modifications. An SPSS data file and an Excel file were exported from this SAS data set.
Data were screened for outliers and extreme values. Box and whisker plots were created for each gender-group factor level combination and all variables to identify extreme values (variable values within the 1.5* and 3* interquartile range and variable values outside the 3* interquartile range). Extreme values were recorded in an Excel sheet for easier identification of irregular patterns or abnormal animals. Growth curves for all animals were plotted (scatter plots, weight against study day) and visually inspected for irregular patterns.
To describe the data, summary statistics including means, standard deviations, 95 % confidence intervals, medians, number of valid values, minima and maxima were calculated and tabulated. In addition to the box and whisker plots, plots of means and 95 % confidence intervals were drawn. Descriptive analysis was performed separately for each gender and group.

3
To check the normality of the data, Kolmogorov-Smirnov tests (with Lilliefors correction) and Shapiro-Wilk tests were performed. When significances were identified, the corresponding normal Q-Q plots were displayed.

Analysis of weight data
Firstly, a traditional analysis with ANOVA was carried out for weight and feed consumption data, separately for each gender and each week. For four comparisons of particular interest (control-GMO33 %, control-GMO11 %, control-conventional 1, control-conventional 2), post hoc Dunnett's tests were performed. There were no missing data, and the data set was fully balanced in each week; therefore, the default type III sum of square procedure was used for the ANOVA. Levene's test to check homogeneity of variance was applied. Test results were presented in tables of means and standard deviations, where all means of groups GMO11 %, GMO33 %, conventional 1 and conventional 2 differing significantly from control group means were marked with asterisks.
Secondly, weight and feed consumption data were analysed with mixed models, using the restricted maximum likelihood (REML) algorithm with Toeplitz covariance structure. Group (five levels) was considered a fixed factor. The factor week (time in weeks from the start of the experiment) or day (time in days from the start of the experiment) was considered a continuous fixed factor. For the resulting least square means, standardized effect sizes as well as their 95 % confidence intervals were calculated according to Nakagawa and Cuthill (2007).

Analysis of all other endpoints
Firstly, a traditional frequentist analysis with ANOVA and N-sample nonparametric tests was carried out for all other endpoints separately for each gender, and post hoc Dunnett's tests and two-sample nonparametric Wilcoxon tests were performed for four comparisons of particular interest (control-GMO33 %, control-GMO11 %, control-conventional 1, control-conventional 2).
Secondly, for all other endpoints, standardized effect sizes as well as their 95 % confidence intervals were calculated according to (Nakagawa and Cuthill 2007). The same four group pairs were compared with each other: control-GMO11 %, control-GMO33 %, controlconventional 1 and control-conventional 2. A bootstrap test was applied to compare the variability within paired sets of SES (Festing 2014). The idea of this test is to investigate whether variation among the SES in the control versus GM is greater than in the control versus conventional groups (and thus indicating that the GM food is toxic).

Graphical presentation of all results
All SES estimates were illustrated graphically, displaying both statistical significance and biological relevance for each of the endpoint comparison results (Fig. 2). Biological relevance here was supposed to be defined by equivalence limits of ±1.0 SD, as proposed by EFSA (2011). Body weight plus all other endpoints were shown on the same graph (separately for male and female), thereby forming an overall pattern and allowing the assessment of group comparisons at a glance.
For all analyses, we used SAS (SAS Software, version 9.4. Copyright, SAS Institute Inc. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc., Cary, NC, USA). The growth curves were also created with SAS, while the SES graphs were created with SPSS (SPSS for Windows, version 12.0. Chicago, SPSS Inc.).

Results
In this paper, we show only the results for male rats in feeding trial B of the GRACE study (Zeljenková et al. 2014) to compare the traditional and the enhanced approach. The full statistical report by Schmidt and Schmidtke (2014) is available under www.cadima.info.

Data quality and distribution check
The plotted growth curves did not show any irregular pattern over time. The box plot inspections identified some extreme values, mainly in the haematology and clinical biochemistry data. Most data were confirmed by the study director as not being erroneous. Two biochemistry results were excluded due to the fact that the measured values were outside the dynamic range of the analyser (animal ID    Table 2 shows the means and standard deviations for each group and each week. The weekly weight development of all groups is displayed in a line graph (Fig. 3).
The results of the LMM analysis are shown in Table 3(a). Significant effects of intercept were expected from the model choice. Since growth rates differ over time, a significant week/day effect was also expected. There is no group effect for weight development, and neither is there an interaction between group and week/day. Table 3(b) shows the least square weight means (i.e. the mean weights over time) for the five groups. Table 3(c) shows the differences between least square means, indicating that no difference is significant.
The SES and confidence intervals for the least square means are shown in Table 3(d).

Other endpoints
Results of normality and variance homogeneity testing for all other endpoints are presented in Table 4. Significant test results, indicating that data are not normally distributed or variances are not homogeneous, are italicized. Consequently, column 10 states whether parametric or nonparametric tests should be applied.
The results of ANOVA and N-sample nonparametric tests for all other endpoints are shown in Table 5. Significant test results are italicized. Column 3 lists the ANOVA test results (P values) for the overall test hypothesis that there are no differences between the five groups. Column 9 includes the test results (P values) of the nonparametric counterpart (Kruskal-Wallis test). Columns 4-7 show the Dunnett's test results (P values) post hoc to ANOVA for the four pairwise group comparisons of interest, while columns 9-12 show the corresponding nonparametric test results (Wilcoxon test post hoc to Kruskal-Wallis).
As is usual, all test results are presented in the form of tables with means and standard deviations for each group, and significant differences are marked with asterisks (Table 6).
Standardized effects sizes (SES) and confidence intervals are shown in Table 7. Confidence intervals not including the zero value, therefore indicating a significant difference between the groups, are italicized.
Results of ANOVA/Dunnett's and Kruskal-Wallis/Wilcoxon test, respectively, can be directly compared with SES and their confidence intervals aligning Tables 6 and 7. It is obvious that the patterns created by highlighting significances in italics are the same and that both approaches identified the same significances.
The results of the bootstrap test (Table 8) indicate that there are no overall differences between the groups for all comparisons of interest; i.e. that variation among the SESs does not differ between the control versus GM and the control versus conventional groups.

Graphical presentation of all results
The SES of all endpoints (body weight, organ weights, haematology and clinical biochemistry) with their confidence intervals is graphically displayed in Fig. 4. Each graph displays all information extricable from a group comparison. The graphs illustrate the pattern of significances better than the asterisk-marked tables. Moreover, not only the 'yes' (= italicized or marked with asterisks)/'no' significances but also the sizes of the effects are visualized. Additionally, the biological relevance of the effects (defined here by equivalence limits of ±1.0 SD-dotted lines) can be directly assessed.
The four graphs in Fig. 4 display the four comparisons of interest: control-GMO33 %, control-GMO11 %, control-conventional 1, control-conventional 2. Placing these graphs next to one another allows a direct visual comparison of all comparison patterns. This is the most effective way to assess the outcome of a feeding study at a glance.

Discussion
The availability of software for fitting LMMs has facilitated their application in biological sciences. Applying linear mixed models to assess developing endpoints like weight allows these data to be analysed in a more comprehensive and consolidated way and facilitates interpretation. First of all, these models enable the complete weight or feed consumption trend to be evaluated and compared, instead of individual points in time. They provide a global statement on group/treatment differences, which is much easier to interpret than a diverse set of single significances between different groups at various points in time of the study. Furthermore, by considering time dependency and averaging over time points, LMMs are more robust against certain deviations from the assumptions on data distribution and therefore model such data more precisely.
The traditional approach (OECD Environment, Health and Safety Publications 2012) applies one-way ANOVA in case of normally distributed variables and equal variation within the treatment groups, and Kruskal-Wallis ANOVA if these assumptions are not met. However, in case of heteroscedasticity also the Kruskal-Wallis test may give inaccurate results; therefore, both approaches are incorrect and will not help. An alternative approach is to apply Welch's ANOVA test (Kohr and Games 1974), which in turn has been criticized to be unable to handle skewed distributions (Skovlund 2010). Neuhäuser (2010 proposes to apply the generalized Wilcoxon test by Brunner and Munzel (2000). Nevertheless, in simulation studies it has been shown that non-robustness remains a serious problem with all tests, if assumptions of normality and variance homogeneity are not met and a final advice is yet to be given (Skovlund 2010). However-in the absence of an appealing alternative-for our comparison of the LMM and SES approach with the traditional one, we followed the OECD Guidance Document 116 and applied the Kruskal-Wallis test, but it is obvious that the flexible modelling of the variances is a further advantage of the LMM approach.
Reporting and graphically displaying effect sizes and confidence intervals can help to avoid the yes/no decision trap of statistical tests and to illustrate the size of effects in the context of biological relevance. This is supported by several publications in the area of toxicology, particularly by Festing (2014), who demonstrated the use of SES as a data transformation, which can be used in addition to existing techniques to clarify the results of toxicity tests. OECD (2012) states that emphasizing the size of effects and the confidence in them avoids the problem of a small, biologically unimportant effect being declared statistically significant and the artificiality of trying to dichotomize a result into a positive or negative finding on the basis of a P value. Furthermore, owing to standardization, all endpoints might be displayed in one graph, allowing a pattern of effects to be assessed instead of single means and significant differences.
In principle, SES might support the assessment of statistical significances with respect to their biological relevance. Since they consider the effects, i.e. the differences in endpoints between treatments, they allow an assessment of the toxicological relevance of the sizes of these effects provided that limits for effect sizes of biological/toxicological relevance are also expressed on the same scale. For our study, we followed EFSA (2011) and applied a rough setting of the equivalence limits of ±1.0 SD by assuming that an SES of 1.0 SD or less is unlikely to be of toxicological importance.
There are several issues about standardization that are open to discussion and could be chosen differently. First, the standardization and setting of equivalence limits on a dimensionless scale (as multiples of standard deviation) might be too abstract for interpretation. Toxicologists might prefer to think in the original scales of the various endpoints. Consequently, they might prefer to set equivalence limits or limits of concern individually for each endpoint and each scale. Second, the pooled standard deviation of individual observations SD is determined by both natural variation and measurement uncertainty, and is a priori not expected to be directly related to biological relevance. If external equivalence limits were available, it would be preferable to use these for standardization. Moreover, to assess the relevance of the data of a feeding study, toxicologists compare correlated parameters (like: liver weight, liver necropsy and certain blood values).
The effect size presentation, either supplementing or replacing the traditional P value approach, enhances transparency and delivers a more comprehensive overall picture of the information derived from the data, which might support consensus in a decision-making process between all actors involved, namely toxicologists, statisticians and regulators. Furthermore, it helps communicate study results to the public in a more easily understood way.