Variability of control data and relevance of observed group differences in five oral toxicity studies with genetically modified maize MON810 in rats

The data of four 90-day feeding trials and a 1-year feeding trial with the genetically modified (GM) maize MON810 in Wistar Han RCC rats performed in the frame of EU–funded project GRACE were analysed. Firstly, the data obtained from the groups having been fed the non–GM maize diets were combined to establish a historical control data set for Wistar Han RCC rats at the animal housing facility (Slovak Medical University, Bratislava, Slovakia). The variability of all parameters is described, and the reference values and ranges have been derived. Secondly, the consistency of statistically significant differences found in the five studies was analysed. In order to do so, the body weight development, organ weight, haematology and clinical biochemistry data were compared between the studies. Based on the historical control data, equivalence ranges for these parameters were defined, and the values measured in the GM maize–fed groups were compared with these equivalence ranges. Thirdly, the (statistical) power of these feeding studies with whole food/feed was assessed and detectable toxicologically relevant group differences were derived. Linear mixed models (LMM) were applied, and standardized effect sizes (SES) were calculated in order to compare different parameters as well as to provide an overall picture of group and study differences at a glance. The comparison of the five feeding trials showed a clear study effect in the control data. It also showed inconsistency both in the frequency of statistically significant differences and in the difference values between control and test groups.


Introduction
Feeding studies with whole food/feed to assess the safety of genetically modified crops-the EU−funded project GRACE The safety of GM plants is a subject of intense political and societal debate, characterized by diverging positions in different EU Member States. For example, there is an ongoing debate regarding the necessity of 90-day animal feeding studies to assess the safety of genetically modified (GM) plants. In 2013, a new Commission Implementing Regulation (European Commission 2013) came into force, which made 90-day feeding studies with whole food/feed mandatory in the frame of the safety assessment process of GM 1 3 plants and derived products. This move was the result of a long-lasting discussion between the Member States and the European Commission aiming to incorporate the European Food Safety Authority (EFSA) Guidance for the GMO risk assessment into a legal text.
Before the Implementing Regulation became effective, 90-day animal feeding studies were only requested by EFSA in indicated cases (EFSA 2011a). Given the now mandatory but untargeted nature of the 90-day feeding trials, the challenge was to determine the scientific value and limitations of such studies and how they should be interpreted within the risk assessment process. In this context, the EU-funded project GRACE comparatively evaluated the use of 90-day animal feeding trials, animal studies with an extended time frame as well as analytical, in vitro and in silico studies on GM maize in the GMO risk assessment process.
The performance of subchronic toxicity studies in form of 90-day feeding trials and of chronic toxicity studies in form of 1-year feeding trials using whole food or feed as test material is challenging. This is due to the fact that the available internationally accepted test guidelines such as the OECD Test Guidelines 408 (OECD 1998) and 452 (OECD 2009) were originally developed to test the potential toxicity of chemicals.
There is a big difference between whole food or feed and chemicals as test material for this kind of studies. Chemicals can be administered at dose levels that are much higher than those to which human beings are exposed. Such a testing approach is not possible with whole food or feed, since the incorporation of high amounts of a crop to whole food or feed might lead to a nutritional imbalance. Therefore, when wanting to test the safety of GM crops, the OECD Test Guidelines used to test the potential toxicity of chemicals need to be adapted to meet the constraints of whole food or feed testing.
Therefore, EFSA developed a guidance document/recommendations for the performance of feeding studies on whole food and feed in rodents based on the OECD Test Guidelines 408 and 453 (EFSA 2011b(EFSA , 2013(EFSA , 2014, but these documents do not provide prescriptive test protocols to carry out such experiments. Hence, four 90-day feeding trials and a 1-year feeding trial with whole GM and non-GM plant material to validate and refine the suggested approaches were performed in the frame of the GRACE project (see below).

Specific challenges of feeding trials with whole food/ feed
From the point of view of the frequency with which parameters are assessed, two types of parameters are analysed in feeding studies. On the one hand, body weight and feed consumption are recorded once per week ("weight and feed consumption data"). On the other hand, organ weights, haematology, clinical biochemistry, urinalysis as well as gross necropsy and histopathology parameters are determined once at the end of the study ("other parameters"). All these parameters are compared between the groups and tested with respect to relevant baseline values to identify any test substance-and dose-dependent toxic responses ).
The first critical issue to be pointed out is that the different parameters are measured in different scales and in different units. The toxicological relevance of deviations differs from parameter to parameter. Nevertheless, for obvious reasons, all parameters are measured for the same fixed number of animals, as proposed, e.g. by OECD and EFSA. This approach results in a broad range of (possibly insufficient) statistical power values.
The second issue is that historical data (from the respective lab) are very important when assessing the toxicological relevance of group differences measured in a feeding study. This is due to the fact that a natural variation among rats of an outbred strain such as the Wistar Han RCC strain used in all feeding trials performed in the course of the GRACE project does in fact exist. As stated, for example, in the OECD Test Guideline 452 (OECD 2009), historical control data may be valuable in the interpretation of the results of the study, e.g. in the case when there are indications that the data provided by the concurrent controls are substantially out of line when compared to recent data from control animals from the same test facility/colony. Historical control data, if evaluated, should be submitted from the same laboratory and relate to animals of the same age and strain generated during the 5 years preceding the study in question (OECD 2009). Unfortunately, no historical data regarding feeding trials with diets containing up to 33 % maize were available at the animal housing facility of the Slovak Medical University in Bratislava, Slovakia, in which all feeding trials performed in the frame of the GRACE project were run. Consequently, minimum differences of toxicological relevance could not be set for a prospective power analysis.
The third issue is related to the limited possibility of adding the GM crop to the feed at an inclusion rate that eventually will induce signs of toxicity and, thus, will allow to observe a dose-response relationship. In the frame of toxicity studies, chemicals are usually administered at dose levels that are much higher than the probable human exposure levels, thereby leading to toxicologically relevant results in the high-dose groups. Such an approach is not always possible with whole food or feed, since high inclusion rates might result in nutritional imbalanced diets (EFSA 2011b). EFSA (2014) suggested a 50 % incorporation rate as a reference value for a high dose of genetically modified maize in 90-day feeding trials in rodents, based on a study by Zhu et al. (2013) in Sprague-Dawley rats fed the glyphosate-resistant G2-aroA maize. At the present time, it remains unknown whether an incorporation rate of 50 % maize will lead to a nutritional imbalance in longterm feeding trials with a duration ≥1 year.
To handle the issue of statistical power in the case of whole food/feed studies, EFSA proposed to base the sample size on a pre-specified effect size defined in standard deviation (SD) units, thereby providing an example in which, based on data from previous toxicity tests and for all parameters, a difference of one SD or less is of little toxicological relevance (EFSA 2011b). EFSA also recommends the use of fewer dose levels but more animals in the control and high-dose groups to maximize the power of the study. To handle the issue of toxicological relevance and to identify the range of the natural (non-positive) variability of parameters, EFSA recommends to consider historical background data available in the actual testing facility and-if such data is not available-to include further reference groups.
Two different MON810 varieties with their corresponding near-isogenic controls and four conventional maize varieties were used as plant material in the studies described here. The test approaches comprised the following four 90-day feeding trials as well as a 1-year feeding trial: • The rat feeding trials A, B and C were conducted by taking into account the EFSA Guidance on conducting repeated−dose 90-day oral toxicity study in rodents on whole food/feed (EFSA 2011b) and the OECD Test Guidelines 408 and 452 (OECD 1998(OECD , 2009, and the results of these feeding trials have recently been published (Zeljenková et al. 2014(Zeljenková et al. , 2016. The longitudinal and metabolomics 90-day studies, as they were originally named by the GRACE consortium, refer to 90-day feeding trials performed in the same way as were the studies A and B with one exception: In the studies D and E, blood and urine were collected from the tail vein of the animals at day 7 and after 1, 2 and 3 months for immunological and metabolomics analyses, while in the studies A and B blood (no urine) was collected once at the end of the feeding trial.

Aims of the present study
In a first step, the data sets obtained in the feeding trials D and E, including the daily clinical observations, the ophthalmological findings, the body weight of the animals, the relative organ weights, the haematology and clinical biochemistry parameters as well as the gross necropsy and histological findings, are reported and discussed.
In a second step, the data sets of the feeding trials D and E together with those of the recently published studies A, B and C (Zeljenková et al. 2014(Zeljenková et al. , 2016 were used to address the specific issues described above. Specifically, • "historical" control data of Wistar Han RCC rats fed whole food/feed diets containing 33 % non-GM maize for the animal housing facility at the Slovak Medical University (Bratislava, Slovakia) were compiled; • the homogeneity of the historical control data was assessed; • the toxicological relevance of any observed statistically significant differences between the control and test groups was analysed by assessing the consistency of such differences between GM and non-GM diet groups across the five studies and by comparing the data obtained with the GM-containing diets to the historical data; • the (statistical) power of feeding studies with whole food/feed to detect toxicologically relevant group differences was assessed.

Materials and methods
Data from five 90-day feeding trials (studies A-E) performed in the frame of the GRACE project were available (Table 1) trials A and B were analysed by Zeljenková et al. (2014; for details see Schmidt and Schmidtke 2014), those of the feeding trial C by Zeljenková et al. (2016; for details see Schmidt et al. 2015a) and those of the feeding trials D and E in this report (for details also see Schmidt et al. 2015b). The design of the 90-day studies A, B, D and E was the same with one exception: Blood and urine were collected at day 7 and after 1, 2 and 3 months for immunological and metabolomics analyses in the studies D and E, while blood, but no urine, was collected once at the end of the feeding trials A and B. For all trials, data quality and distribution checks and thereafter the above-mentioned individual analyses were performed (as described in Schmidt et al. 2016). Raw data and statistical analysis reports of all studies are available at the CADIMA website. The open-access database CADIMA is a non-profit internet portal aiming to increase the transparency and traceability of information being associated with the impact/risk assessment of plant genetic improvement technologies. Among others, it grants access to raw data generated by associated research activities (see www.cadima.info).

Maize varieties and diets
Studies A and B used maize material harvested in the 2012 season in Pla de Foixa (Girona, Catalonia, Spain, 42°05′N, 3°E). Studies D and E used the same batch of diets as studies A and B, respectively, but the diets were stored further on at −21 °C for 10 months. Study C used maize material from the 2013 harvest from the same region, but from different sites than the other studies. The maize varieties and the diets used are listed in Table 1.

Data structure
In all trials, the two treatment groups ( Body weight and feed consumption data: Each rat was weighed 14 times-starting on the first day of the trial and then proceeding on a weekly basis until the last weighing 13 weeks later. Feed consumption was also determined once per week and reported as the total amount of feed consumed by two animals in one cage per week. All other parameters: The following haematology parameters were determined: the white blood cell count (WBC), red blood cell count (RBC), haemoglobin concentration (HGB), haematocrit (HCT), mean cell volume (MCV), mean corpuscular haemoglobin (MCH), mean Pioneer 90-day longitudinal/metabolomics corpuscular haemoglobin concentration (MCHC), platelet count (PLT) and lymphocyte count (LYM) as well as the differential leukocyte count. The following clinical biochemistry parameters were measured: alkaline phosphatase (ALP), alanine aminotransferase (ALT), aspartate aminotransferase (AST), albumin (ALB), total protein (TP), glucose (GLU), creatinine (CREA), urea (U), cholesterol (CHOL), triglycerides (TRG), calcium (Ca), chloride (Cl), potassium (K), sodium (Na) and phosphorus (P). Moreover, except for trial C (since this was a 1-year feeding trial, rats were not sacrificed at t = 90 days), the weight of the kidneys, spleen, liver, adrenal glands, pancreas, lung, heart, thymus, testes, epididymides, uterus, ovaries and brain was recorded (the collated primary data are available on the website http://www.cadima.info). An overview of the data sets determined in all five studies is shown in Fig. 1. In all five feeding trials, two animals per cage were held. According to the EFSA Guidance Document (EFSA 2011b), the cage was defined as the experimental unit and means per cage were calculated for all parameters prior to the statistical analysis. All animals survived until t = 90 days, i.e. there were no missing data.
It has to be pointed out that statistical analyses were only performed on the quantitative parameters, i.e. the body weight, haematology, clinical biochemistry and relative organ weight data. Clinical findings were only, if at all, sporadically observed and, therefore, did not undergo any sort of statistical analysis (for details see Schmidt and Schmidtke 2014;Zeljenková et al. 2014Zeljenková et al. , 2016Schmidt et al. 2015a, b).

Individual study analyses
For each individual study, body weight and feed consumption data were analysed by applying mixed models, separately for each gender, and by using the restricted maximum likelihood (REML) algorithm with Toeplitz/AR(1) covariance structure. This approach considers the changes in both repeatedly measured parameters over all points in time and is a much more generalized model than nonlinear models or growth curves . The group (five/four/three levels) was considered a fixed factor. The factor week (time in weeks from the start of the experiment) or day (time in days from the start of the experiment) was considered a quantitative fixed factor (repeated measurements). For the resulting least square means, standardized effect sizes with their 95 % confidence intervals were calculated for the comparisons of particular interest (mainly 11 % GMO-control and 33 % GMO-control, in some studies also conventional-control) according to Nakagawa and Cuthill (2007). Moreover, a "classical" analysis of variance (ANOVA) was carried out, separately for each gender and each week. For the comparisons of particular interest (11 % GMO-control, 33 % GMO-control and conventional-control), post hoc Dunnett's tests were performed after the ANOVA.
In the case of all other parameters, standardized effect sizes as well as their 95 % confidence intervals were calculated for the same group pairs and separately for each gender (Festing 2014;Schmidt et al. 2016). All standardized effect size (SES) estimates are graphically shown, thereby displaying both statistical significance and presumptive decision thresholds for the toxicological assessment of each of the parameter comparison results (Fig. 2). Setting equivalence limits based on toxicological relevance is not easy and the subject of continued debate. In this paper, a working value for toxicological relevance is pragmatically set at equivalence limits of ±1.0 SD, as previously used in a simple example by EFSA (EFSA 2011b). It should therefore be kept in mind that future decisions on relevant equivalence limits may influence the equivalence results as presented in the current paper. The body weight as well as all other parameters are shown in the same graph (separately for male and female rats), thereby forming an overall pattern and allowing the assessment of group comparisons at a glance. Again, an additional "classical" analysis with an ANOVA/post hoc Dunnett's tests or nonparametric Kruskal-Wallis/Wilcoxon tests was carried out for each gender.

Compilation of historical control data from the groups fed 33 % non-GM maize-containing diets
In the five feeding trials, two control and four other conventional diets (all together named hereafter "non-GM diets") were used ( Table 2). The parameters measured in these groups were combined to compile historical non-GM control data (point and interval estimates) for Wistar Han RCC rats fed a diet containing 33 % conventional maize in the animal housing facility at the Slovak Medical University (Bratislava, Slovakia). Reference statistics (mean, SD, minimum, maximum, percentiles: median, 5 and 95 % percentiles) were calculated for the weight and laboratory parameters from all non-GM groups in the five feeding trials. To describe the variability, two measures were used: the coefficient of variation (relative SD, i.e. ratio of the SD to the mean) and the relative width of the 90 % central interpercentile range (5-95 % percentile) to the mean.
The distribution of the non-GM diet-related data was graphically displayed in boxplots, separately for male and female animals and showing the median, the interquartile range = box length and extreme cases of individual variables in two categories, namely values between 1.5 and 3 box lengths from the upper or lower edge of the box and cases with values more than 3 box lengths from the upper or lower edge of the box (shown in the statistical analysis reports: Schmidt and Schmidtke 2014;Schmidt et al. 2015a, b). All calculations and graphs were based on the values per cage (experimental unit).

Assessment of the homogeneity of the historical control data set
Firstly, to evaluate the reproducibility of the parameter measurements, an ANOVA was applied to compare the means of the individual studies. Secondly, to evaluate the magnitude of study-specific influences on the variance of the historical data set, a principal component analysis (PCA) followed by a cluster analysis (CA) was applied. The PCA was applied to combined male-female data to assess its suitability to separate expected gender differences (e.g. in the body and organ weights). The corresponding values were converted into two principal components, groupwise for body weight, haematology, clinical biochemistry and organ weight. Then, these components were grouped into five clusters. The number of components and clusters were chosen both for practical and technical reasons. For PCA, varimax rotation with Kaiser normalization was used, and a hierarchical cluster analysis was applied to form clusters based on their quadratic Euclidian distance. The resulting components were displayed graphically in a scatter plot. Fig. 2 Simplified version of a graph allowing the visual assessment of statistical significance and toxicological relevance of group comparisons. The SES point estimate (circle) and the 95 % confidence limits (whiskers, bars showing confidence interval) illustrate the (standardized) effect size between two groups. The vertical black line indicates no statistically significant difference (zero difference), vertical dashed grey lines indicate toxicological relevance limits (here 1.0 SD, according to the study design). If the confidence interval bars cross the zero line but not the grey dashed lines, i.e. they lie within the ±1.0 SD limits, there is evidence for no statistical significance as well as no toxicological relevance (case a). Two groups are significantly different when the confidence interval bars do not cross the black vertical line (cases b, c). The effect size between two groups is supposed to be toxicologically relevant, when the confidence interval bars lie outside the ±1.0 limits (case c). Case b indicates statistical significance, but no clear toxicological relevance. Case d indicates no statistical significance, but no clear negation of toxicological relevance (reproduced from Zeljenková et al. 2014)  Evaluation of the toxicological relevance of statistically significant differences In its Scientific Opinion on statistical significance and biological relevance (EFSA 2011c), EFSA defined "The objective of carrying out an empirical study is usually to identify the existence of relevant biological effects at the population level using statistical tools to detect them. Therefore, the identification of statistical significance is only part of the evaluation of the biological relevance." As a basic principle, in each study, it was evaluated whether the observed statistically significant differences between the control and the test groups were in line with an effect pattern indicative of potential toxicity (e.g. liver toxicity).
To support the evaluation of the toxicological relevance of statistically significant differences identified in all five studies, the following approach was applied: (i) a consistency check of differences across studies. A statistically significant difference may be considered "consistent" and might be an indicator of a toxicologically relevant (i.e. adverse) alteration if it is reproduced across studies. (ii) a comparison of the values observed in the GM maize-fed groups with the historical control data set compiled from the non-GM groups of the five studies. A value outside of the historical control data range might be an indicator of a toxicologically relevant (i.e. adverse) alteration.

Consistency of statistically significant differences between GM and non-GM diet groups
Firstly, in order to visually evaluate the consistency of the statistically significant differences, weight development curves as well as SES graphs of all five studies were combined in one graph, allowing a direct visual comparison of differences observed in the individual studies. Secondly, to evaluate the consistency of the frequencies of statistically significant differences, total and relative numbers of such differences observed in haematology, clinical biochemistry and organ weight measurements of all five feeding trials were compared.
Thirdly, to evaluate the consistency of the difference values, the absolute differences between GM and non-GM diet groups were listed in form of tables for all studies, separately for male and female animals. Statistically significant differences (positive or negative) were highlighted to easily compare their absolute values and their incidence.

Comparison of the values observed in the GM maize-fed groups with the historical control data
The historical control data compiled from the non-GM groups of the five studies (Table 3) were used to set up simplified equivalence ranges (baselines) on the basis of the 1-SD approach. The mean and SD calculated for all non-GM (control and conventional) groups of the five studies were taken to set a lower equivalence limit (=mean-SD) and an upper equivalence limit (=mean + SD). Thereafter, it was checked whether the means of the individual parameters measured in the 11 % GMO-and 33 % GMOfed groups fell within or outside these limits.

Assessment of the statistical power of the studies
The power or sensitivity of a statistical test (here: of a comparison of the results from a GM with a control group) is the probability that the test correctly identifies an actually existing effect. "Actually existing" refers to the effect of toxicological relevance that is defined in advance of the study and thus co-determines the sample size. EFSA (2011b) described an example in which differences of one SD unit were considered of little toxicological relevance, resulting in a sample size of 16 animals per group and sex (i.e. 8 cages). Consequently, feeding studies designed according to this example aim to detect group effects of one SD value for all parameters measured.
On the basis of the historical control data compiled from the non-GM groups of the five studies, in a first approach without factoring the study effect, absolute values of effect sizes detectable with a power of 0.8 or 0.9 and sample sizes of (i) 10 animals per group and sex (OECD 2009), (ii) 16 animals per group and sex (EFSA 2011b) or (iii) 20 animals per group and sex (OECD 1998) were determined for each single parameter. For these calculations, the experimental unit is the cage with two animals.
For all analyses and graphs, the SAS Software, version 9.4, from the SAS Institute Inc. (Cary, NC, USA) was used. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of the SAS Institute Inc.

90-day feeding trials D and E
No signs of morbidity and mortality were observed throughout the 90-day feeding period, and the daily clinical observations did not reveal any signs of functional deficits in the feeding trials D and E. Furthermore, no ophthalmological alterations were visible in male and female rats fed the control, 11 % GMO and 33 % GMO diets in both trials. Body weight, relative organ weight as well as haematology and clinical biochemistry data were analysed as described in the "Materials and methods, Individual study analyses" section and are shown for study D in Table 1 of Table 3 Compilation of historical control data for male (A) and female (B) rats fed 33 % non−GM maize−containing diets in the GRACE project 90-day feeding trials A, B, C, D and E  For all the haematology and clinical biochemistry parameters, SES were calculated. In this context, a statistically significant difference means that the confidence interval of the SES to the control does not include the zero value, while "similar" means that the confidence interval of the SES to the control includes the zero value (Fig. 2).
The haematology parameters and the differential leukocyte counts were similar between the groups fed GM maize and the control groups of male and female rats in both trials with the exceptions, as described below. In trial D (Electronic Supplementary Material, Table 1), MCV, MCH and the percentage of eosinophils were significantly increased in male rats fed the 11 % GMO diet when compared to the control group, while the percentage of neutrophils was significantly decreased in male rats fed the 33 % GMO diet when compared to the control group. Furthermore, WBC were increased in female rats fed the 11 % GMO diet when compared to the control group, whereas WBC, MCHC and the number of lymphocytes per microlitre were significantly increased in female rats fed the 33 % GMO diet when compared to the control group. In trial E (Electronic Supplementary Material, Table 2), the percentage of neutrophils was significantly decreased in male rats fed the 11 % GMO diet, while the percentage of lymphocytes was significantly increased and the percentage of neutrophils was significantly decreased in male rats fed the 33 % GMO diet when compared to the control group. Moreover, in female rats fed the 11 % GMO diet, the percentage of lymphocytes was significantly decreased and the percentage of neutrophils was increased when compared to the control group, whereas WBC and the percentage of monocytes were decreased in the 33 % GMO group if compared to the control group. Clinical biochemistry parameters in both trials were similar between the groups fed GM maize and the control groups of male and female rats with the exceptions, as described below. In trial D (Electronic Supplementary Material, Table 1), ALP activity was significantly decreased in male rats fed the 11 % GMO diet, and U was significantly decreased in the 33 % GMO group if compared to the corresponding control group. ALT, AST and U were significantly increased in female rats fed the 11 % GMO diet, while ALT, AST and TRG were significantly increased in female rats fed the 33 % GMO diet if compared to the corresponding control group. In trial E, no significant differences between rats being fed the control diet and those fed the GMO diets were observed.
The relative weight of all organs was similar between the groups fed GM maize, and the control groups of male and female rats except in trial D the weight of the left kidney were slightly lower in female rats fed the 11 % GMO diet than in the control rats (Electronic Supplementary Material, Table 1).
All parameters measured in the urine in both trials were similar with one exception: Osmolality was significantly increased in female rats fed the 33 % GMO diet when compared to the control group.
No signs of morbidity and mortality were observed throughout the 90-day feeding period, and the daily clinical observations did not reveal any signs of functional deficits in feeding trials D and E. No ophthalmological alterations were visible in male and female rats fed the control, 11 % GMO and 33 % GMO diets in both trials. No macroscopically visible alterations were observed in male and female rats fed the control, 11 % GMO and 33 % GMO diets in both trials with the exception of a male rat (No. 32) fed the 33 % GMO diet in trial E, in which the right seminal vesicle and coagulating gland had a reduced size but without accompanying histopathological alterations. A low number of histological changes were observed in the control and 33 % GMO groups in trials D and E and mostly were inflammatory reactions (Electronic Supplementary Material, Table 3). No pre-neoplastic and/or neoplastic lesions were observed. Since no treatment-related changes were observed between the control and high-dose groups, no further tissue analyses were carried out.

Compilation of historical control data from groups fed 33 % non-GM maize-containing diets
The 90-day body weight, haematology, clinical biochemistry as well as relative organ weight data from all non-GM groups of the five studies (i.e. those having been fed a diet containing either a control, i.e. near-isogenic, non-GM, or another conventional maize variety) were transferred to a meta-data file. As in the case of all analyses of previous GRACE feeding trials, those values were excluded from the analysis that showed distinct haemolysis or that were outside the dynamic range of the analyser, but no other statistical outliers or extreme values were removed from the data set (see Schmidt and Schmidtke 2014;Schmidt et al. 2015a, b).
The compiled information on historical control values is shown in Table 3, in which the reference statistics (mean, median, SD, minimum, maximum, 5 and 95 % percentiles) for body weight per week, haematology and clinical biochemistry parameters as well as relative organ weights of all non-GM groups in the five studies are listed.
Means and medians of the non-GM groups for 93 % of the parameters differed by less than 5 %, thereby indicating symmetric distributions (Table 4, columns 2 and 5). Only a few extreme values were observed (not shown).
The corresponding natural variation of the data was described by two parameters of variability (coefficient of variation and the relative width of the 90 % central interpercentile range), which are listed for each parameter in Table 4, columns 3-4 and, 6-7.
The coefficient of variation in body weight at the beginning of the study was 4.3 % in males and 4.8 % in females, increasing over time up to 5.6 % in males and 7.3 % in females (week 13). The relative 90 % interpercentile range of the body weight measurements at the beginning of the study was 7.5 % of the mean in males and 8.2 % of the mean in females, also slightly increasing over time up to 9.1 % of the mean in males and 12.2 % of the mean in females (week 13).
In the case of the haematology parameters, the coefficient of variation ranged between 2 and 45 %, whereas the 90 % interpercentile interval varied between 3 and 78 % of the mean.
Clinical biochemistry parameters showed the highest variability: The coefficients of variation were about 5 to about 106 %, and the 90 % interpercentile intervals ranged from 4 to 86.7 % of the means. Relative organ weights had coefficients of variation of about 5-15 %, and 90 % interpercentile intervals ranged from 8 to 32 % of the means (Table 4).

Assessment of the homogeneity of the historical control data set
In 94 % of the ANOVAs applied to all parameters (to body weight per week, to haematology and clinical biochemistry parameters as well as to relative organ weights of male and female rats), the hypothesis that the five study means were equal was rejected, i.e. for these parameters the mean in at least one study was different from the mean of at least one other study. The hypothesis of equal study means could not be rejected for MCV, eosinophils, ALT and epididymis (right) in males as well as for basophils and adrenal (right and left) in females. PCA followed by CA for body and relative organ weights showed a clear separation of male and female rats into two clusters (Fig. 3, top left and bottom right). Additionally, in the case of the body weight, but not in the case of the relative organ weights, studies A, B and C were grouped in one cluster, separated from a second cluster including studies D and E. Furthermore, the clinical biochemistry parameters of studies A and B formed one cluster, separated from a second cluster including the parameters of studies C, D and E (Fig. 3, bottom left). Regarding the haematology parameters, there was no discrimination between the studies (Fig. 3, top right).

Consistency of statistically significant differences between GM and non-GM diet groups
Visual evaluation of the consistency of body weight data The weight development in each study is displayed in Fig. 4 (control, 11 % GMO and 33 % GMO groups). The curves run parallel to each other, thus indicating a comparable weight development in all studies. As already seen in the PCA graphs, all groups in studies D and E showed a slower body weight growth than in studies A, B and C.   Table 5). Single SES confidence interval bars moved to the right or to the left; such shifts were replicated once or twice for less than 1 % of haematology parameters, about 2 % of clinical chemistry parameters and less than 5 % of organ weights, but no confidence interval bar shift was reproduced in all five studies.

Visual evaluation of the consistency of the other parameters
Evaluation of the consistency of the numbers of statistically significant differences The total and relative numbers of statistically significant differences between the control and the test groups (based on SES estimates) in haematology, clinical biochemistry and organ weight measures (altogether 45 for female parameters and 44 for male parameters) are shown in Table 5. Up to 20 % statistically significant differ-ences were observed in the studies A and B, about 10 % in study D and below 5 % in the studies C and E. Therefore, the number of observed statistically significant differences was not consistent across the studies.

Evaluation of the consistency of the difference values
The absolute 11 % GMO-control and 33 % GMO-control group differences for all parameters are shown in Table 6. Statistically significant negative differences are highlighted in italics, while statistically significant positive differences are highlighted in bold. Differences were not replicated across the studies (i.e. the columns), and in the case of some parameters even opposite differences (italics/bold) were observed. The absolute values of statistically significant dif- ferences often deviated extremely from the differences in the other studies. Taken together, there was no single difference between a control and a GMO group that was consistent across all five studies.

Comparison of the values observed in GM maize-fed groups with the historical control data baseline
The means and SD of the historical control data, as shown in Table 3, are included in columns 2-3 of Table 7, and the derived simplified equivalence ranges (mean + SD, mean − SD) are listed in columns 4-5. Additionally, the means of the GM group parameters of all five studies are shown, and mean values laying within the simplified equivalence limits are highlighted in italics . In male rats, 81.5 % of the parameters measured in the 11 % GMO group and 81.8 % of the parameters measured in the 33 %GMO group were within the simplified equivalence interval, whereas in female rats 77.9 % of the parameters measured in the 11 % GMO group and 82.3 % of the parameters measured in the 33 % GMO group were within the simplified equivalence interval. No dose-response relationship could be observed. Moreover, 11.3 % of all parameters measured in study A, 7.8 % of all parameters measured in study B, 25.0 % of all parameters measured in study C, 26.9 % of all parameters measured in study D and 26.0 % of all parameters measured in study E were outside the equivalence limits. Table 8, part a, specifies the achievable statistical power to detect effect sizes of one SD (corresponds to √2 SD with 2 animals/cage) with the given sample sizes proposed by several internationally recognized test guidelines. Table 8, part b, specifies the relative effect sizes (in SD units) detectable with a power of 0.9 for the same given sample sizes.

Assessment of the statistical power of the studies
Combining the relative effect sizes of Table 8, part b, and the historical control data, absolute and from a practical point of view relevant effects sizes were calculated. The values for given sample sizes of (i) 10 animals per group and sex (OECD 2009b), (ii) 16 animals per group and sex (EFSA 2011 b) and (iii) 20 animals per group and sex (OECD 1998) in the case of two animals per cage are listed in Table 9. For the testing facility with the historical data background, these values represent the effect sizes that might be detected using statistical tools. for studies A-E, (left: 11 % GMO-control, right: 33 % GMO-control), haematology parameters in male (top) and female rats (bottom); b combined SES graphs for studies A-E, (left: 11 % GMO-control, right: 33 % GMO-control), clinical biochemistry parameters in male (top) and female rats (bottom); c combined SES graphs for studies A-E, (left: 11 % GMO-control, right: 33 % GMO-control), body weight and relative organ weights of male (top) and female rats (bottom) (a) 1 3

90-day feeding trials D and E
The feeding trials D and E were performed in parallel (i.e. all experimental conditions were strictly the same), while the studies A, B and C were performed at different points in time than the studies D and E in the same animal housing facility. Therefore, when discussing the relevance of statistically significant differences regarding a particular parameter between the control and the GMO-fed rats in the feeding studies D and E, it was considered pertinent to only compare the outcome of these two studies among themselves and not to include observations on alterations of the particular parameter in the feeding trials A, B and C.
There were no statistically significant differences between the mean body weights of the three experimental groups at any point in time of the feeding trials D and E in the case of male as well as female rats. Regarding the haematology parameters, three out of 14 parameters in male rats fed the 11 % GMO diet and one out of 14 parameters in male rats fed the 33 % GMO diet were significantly different from the control male animals in trial D, while one out of 14 parameters in female rats fed the 11 % GMO diet and three out of 14 parameters in female rats fed the 33 % GMO diet were significantly different from the control female animals in trial D. In the case of study E, one out of 14 parameters in male rats fed the 11 % GMO diet and two out of 14 parameters in male rats fed the 33 % GMO diet were significantly different from the control male animals, whereas two out of 14 parameters in female rats fed the 11 % GMO diet and the 33 % GMO diet were significantly different from the control female animals. It is important to note that the statistically significant changes in the haematology parameters in trial D were different from those observed in trial E with only one exception: The percentage of neutrophils decreased in male rats fed the 33 % GMO diet in both studies. In this context, the only haematological alterations that showed a tentative dose-effect relationship were the decrease in the percentage of neutrophils in male rats as well as the increase in the MCHC and LYM in female rats in trial D and the decrease in WBC in female rats in trial E, whereby the values in all four cases were within or close to the value range of the control rats. Thus, the described alterations in the haematology parameters are not considered to be relevant from a toxicological point of view. This conclusion is supported by the fact that after feeding the rats for 1 year with the diet containing 33 % MON810 maize no changes were observed in the four above-mentioned parameters in male and female rats (Zeljenková et al. 2016).
None of the clinical biochemistry parameters showed statistically significant differences in the three experimental groups of feeding trial E, while one out of 15 clinical biochemistry parameters in male rats and three out of 15 clinical biochemistry parameters in female rats were significantly different when the data from control and GMO-fed rats in the feeding trial D were compared. The increased   ALT and AST activity in female rats were not accompanied by changes in the ALB and TP levels and/or by histopathological alterations in the liver and were not observed in female rats in the feeding trial E. An increased U level was observed in female rats fed the 11 % GMO diet but not the 33 % GMO diet, and the increased TRG level in the female rats fed the 33 % GMO diet was not accompanied by a lipid accumulation in any of the histologically examined organs and was not observed in trial E. Based on the above-mentioned observations, the described alterations in the clinical biochemistry parameters are not considered to be toxicologically relevant and are not related to the diets supplemented with the MON810 maize varieties. A low number of histological changes, mostly inflammatory reactions and no pre-neoplastic/neoplastic lesions, were observed in the control and 33 % GMO groups in trials D and E, which is in accordance with the findings in the previously published 90-day feeding trials A and B (Zeljenková et al. 2014).

Compilation of historical control data from groups fed 33 % non-GM maize-containing diets
In this study, historical control data regarding body weight development, haematology and clinical biochemistry parameters as well as relative organ weights of male and female Wistar Han RCC rats having been fed a diet containing 33 % non-GM maize at the animal housing facility of the Slovak Medical University (Bratislava, Slovakia) are presented. These data constitute the basis for future study designs, power analyses and study result assessments at this particular animal housing facility. The data refer to the six conventional varieties used in the five feeding studies (Table 2) and deliver useful information on the magnitude and variability of the measured parameters in male and female Wistar Han RCC rats being fed a diet containing 33 % non-GM maize for 90 days.
The ANOVA assessing the homogeneity of the historical control data showed a clear study effect. Moreover, in the case of body weight development and clinical biochemistry parameters, the data measured for diets containing the same maize variety are clearly clustered according to the studies (Fig. 3). This underlines the importance of comparing treatment groups (test and control groups) within studies. In the case of future feeding trials to be performed at the animal housing facility of the Slovak Medical University, the historical control data will help in determining whether the values of individual parameters measured in maize-fed rats show deviations from the corresponding normal range for rats held at the above-mentioned institution.
When analysing the historical control data, no statistically extreme values were excluded, in line with the approach in single study analyses where there were no technical reasons to do this. Consequently, the variability of some parameter values is high. In this context, the historical control data showed a reduced variability in studies C, D and E compared to studies A and B, which contained more extreme values.

Consistency of statistically significant differences
The methods to evaluate the consistency of statistically significant differences between control and test groups showed inconsistency both in the frequency of statistically significant differences and in the difference values. Statistically significant differences in one study were not reproduced in other studies, except in the following seven cases: (1) male, 33 % GMO-control, neutrophiles: study D and E; (2) female, 33 % GMO-control, WBC: study B and D; (3) female, 33 % GMO-control, LYM: study B and D; (4) female, 11 % GMO-control, ALT: study A and D; (5) female, 33 % GMO-control, ALT: study B and D; (6) female, 33 % GMO-control, AST: study B and D; (7) female, 11 % GMO-control, U: study B and D. Therefore, the great majority of the differences was only found in a single study, and the toxicological relevance of these statistically significant differences between control and test groups is questionable. A higher percentage of statistically significant differences between rats fed the control diets and those fed the diets containing the MON810 maize was observed in the studies A and B than in the studies C, D and E. In this context, it has to be pointed out that the GRACE project partners did not find any evidence that this was due to major differences in the composition of the diets, to technical defects of the laboratory equipment used or to mistakes in the handling of the blood samples by the laboratory staff.

Comparison of the data sets from GM maize-fed rats with the historical control data
About 80 % of the individual parameter measurements in the GM maize-fed groups were within the simplified equivalence limits defined by the historical control data. If the parameter measurements fell outside the equivalence limits, the corresponding differences were statistically significant in 30 to 80 % of the cases.
The equivalence limits were calculated in a simplified way and had to be based on study-internal data; therefore, they only are rough estimates. A more refined equivalence testing procedure will be elaborated in the European Commission-funded project G-TwYST (GM Plants 2 Year Safety Testing, www.g-twyst.eu).

Assessment of the statistical power of the studies
The post hoc power analysis revealed a power between 0.50 and 0.85 to detect an effect size of one SD in studies designed according to international guidelines (EFSA 2011b;OECD 1998OECD , 2009) with samples sizes per group and sex of 10, 16 or 20, respectively (the experimental unit is the cage with two animals, i.e. 5, 8 or 10 cages á 2 animals). An effect size of one SD is not necessarily linked to a real toxicological relevant effect. The size of toxicological relevance in absolute or SD units should be considered separately for each parameter by toxicologists, largely based on previous experience. Therefore, based on the historical background data of the animal housing facility and its associated laboratories at the Slovak Medical University, the corresponding absolute effect sizes (in original units) were calculated (Table 9). They will constitute the basis for future study designs and power analyses. This is why they should be critically examined by toxicologists regarding their toxicological relevance.   Table 9 Required effect sizes detectable by study designs with sample sizes of 5, 8 or 10 cages á 2 animals per group and sex, power = 0.8 and α = 0.05, based on historical control data Parameter Male Female One SD n = 5×2 n = 8×2 n = 10 × 2 One SD n = 5×2 n = 8 × 2 n = 10 × 2