Calibration and validation of predicted genomic breeding values in an advanced cycle maize population

Key message Model training on data from all selection cycles yielded the highest prediction accuracy by attenuating specific effects of individual cycles. Expected reliability was a robust predictor of accuracies obtained with different calibration sets. Abstract The transition from phenotypic to genome-based selection requires a profound understanding of factors that determine genomic prediction accuracy. We analysed experimental data from a commercial maize breeding programme to investigate if genomic measures can assist in identifying optimal calibration sets for model training. The data set consisted of six contiguous selection cycles comprising testcrosses of 5968 doubled haploid lines genotyped with a minimum of 12,000 SNP markers. We evaluated genomic prediction accuracies in two independent prediction sets in combination with calibration sets differing in sample size and genomic measures (effective sample size, average maximum kinship, expected reliability, number of common polymorphic SNPs and linkage phase similarity). Our results indicate that across selection cycles prediction accuracies were as high as 0.57 for grain dry matter yield and 0.76 for grain dry matter content. Including data from all selection cycles in model training yielded the best results because interactions between calibration and prediction sets as well as the effects of different testers and specific years were attenuated. Among genomic measures, the expected reliability of genomic breeding values was the best predictor of empirical accuracies obtained with different calibration sets. For grain yield, a large difference between expected and empirical reliability was observed in one prediction set. We propose to use this difference as guidance for determining the weight phenotypic data of a given selection cycle should receive in model retraining and for selection when both genomic breeding values and phenotypes are available. Supplementary Information The online version contains supplementary material available at 10.1007/s00122-021-03880-5.


Introduction
The prediction of breeding values from molecular data has become a key component of many plant breeding programmes. In breeding hybrid crops such as maize, genomic prediction can be applied at different stages of the breeding scheme. When beginning a new selection cycle, genomebased prediction can assist in the choice of crosses that warrant both high mean performance and high genetic variance for target traits (Lehermeier et al. 2017;Allier et al. 2019). The next step is to identify selection candidates with the highest combining ability in a large sample of testcrosses within the same heterotic group (Albrecht et al. 2011;Riedelsheimer et al. 2012;Jacobson et al. 2014) or to predict the performance of potential hybrid combinations directly (Massman et al. 2013;Technow et al. 2014;Seye et al. 2020). For each of these prediction steps, a statistical model must be trained on experimental calibration data comprising highquality phenotypes and genotypes. Deterministic formulas forecasting prediction accuracy suggest a strong influence of the sample size, the heritability, the genetic architecture of the target trait and the genome structure of the species under study (Daetwyler et al. 2010;Schopp et al. 2017). Simulation studies have shown that the mating design and family structure of the calibration set also have a strong influence on prediction accuracy (Hickey et al. 2014). Results from experimental studies corroborate these findings irrespective of whether the studied populations were designed for research purposes (Lehermeier et al. 2014) or originated from commercial breeding programmes (Albrecht et al. 2014;Krchov et al. 2015;Auinger et al. 2016).
In addition to the specific properties of the calibration set, the relatedness between the calibration and prediction set plays an important role (Habier et al. 2007;Saatchi et al. 2011;Clark et al. 2012;Lorenz and Nice 2017). In animal breeding, selection candidates are direct descendants of the previous selection cycle; thus, relatedness between calibration and prediction sets is given. In many plant breeding programmes, however, the situation is different. Depending on the generation interval of the crop, several years might elapse between the evaluation of parental lines and their progenies. Furthermore, plant breeders enrich the genetic diversity of advanced cycle breeding populations through crossings with unrelated or distantly related genetic material. Depending on the mating design, this practice can alter haplotype structure and linkage disequilibrium of the selection candidates dramatically compared to the population on which the prediction model was trained. While selection on phenotypes is not impaired by these proceedings, the success of genome-based selection might be jeopardised. In the worst case, the predictive power of the calibration set breaks down despite continuous retraining of the model over the years.
In breeding programmes where genome-based prediction is applied on a routine basis, a large body of data becomes available for model training. Several authors have found that prediction accuracy was impacted adversely when the calibration and the prediction sets were distantly related, and that removal of some genotypes from the calibration set improved prediction accuracy (Albrecht et al. 2014;Michel et al. 2016;Pembleton et al. 2018). Brandariz and Bernardo (2019) demonstrated that, for populations derived from biparental crosses of maize, utilising ad hoc training populations produced better results than did training the prediction model on all available data, despite a substantially smaller sample size. Their experimental data set comprised a high number of large half-sib families, and prediction accuracies were highest when the calibration set comprised families having one parent in common with the family to be predicted. In these cases, relatedness was high, and changes in linkage phase across families were negligible, as shown theoretically and empirically by Lehermeier et al. (2014). If the calibration and prediction sets do not comprise large biparental families, the creation of ad hoc calibration sets is complicated, and drivers of prediction accuracy other than sample size and relatedness are largely unknown. We, therefore, investigated several genomic measures contributing to prediction accuracy. Building on a unique data set from an advanced cycle maize breeding programme comprising high-precision phenotypic and high-density genotypic data and representing six interconnected breeding cycles, our main objectives were to (1) assess the accuracy of genomic best linear unbiased prediction (GBLUP) of different calibration sets from up to five previous selection cycles in two prediction sets, (2) investigate the impact of the genetic diversity of the calibration set, (3) examine differences between sample size and effective sample size in the calibration set and (4) investigate how variation in marker polymorphism, linkage disequilibrium and the degree of relatedness between calibration and prediction sets affect prediction accuracy.

Plant material
The experimental data presented in this study consist of genetic material from six contiguous selection cycles of a commercial maize breeding programme (Table 1). The six data sets (S1-S6) comprise 5968 doubled haploid (DH) lines from the Dent heterotic group crossed to one or several Flint testers. Data sets were disconnected with respect to selection candidates but connected through 11 commercial check hybrids. Data sets S1 to S6 varied in size from 551 to 1545 DH lines. Individual sets were generated by crossing between 36 and 148 parents, which resulted in 130 to 607 crosses with 1 to 455 progenies per parent (Table 1). In S1 and S2, each line was crossed to one of two testers. In S3, four testers were used with 193 lines crossed to more than one tester. In S4, S5 and S6, one tester was used for each of the sets. S1 to S5 are connected by one single-cross tester, while the tester in S6 was a double cross with the common tester of S1 to S5 as a parent. Data sets S1 and S2 were part of the study of Albrecht et al. (2014). Plant materials described in this study are proprietary to KWS SAAT SE & Co. KGaA.

Genotypic and phenotypic data
The DH lines of sets S1 to S4 were genotyped with 56,110 SNP markers using the Illumina® MaizeSNP50 BeadChip 1 3 (Ganal et al. 2011). DH lines of sets S5 and S6 were genotyped with custom-made chips comprising subsets of 12,062 and 22,359 SNPs of the Illumina® MaizeSNP50 BeadChip, respectively. Genotypic data from sets S1 to S6 were merged, and only SNPs with a GTScore ≥ 0.7 or call frequency ≥ 0.9 were retained for further analyses. Monomorphic SNPs and SNPs with the alternative allele appearing only once were removed. If two SNPs were fully collinear, one was discarded at random. In the full dataset (S all , N = 5968), 9742 informative SNPs remained. Missing genotypic information was imputed using the function 'codeGeno' from the R-package 'synbreed' version 0.12-14 with option 'impute.type = "beagle"' (Wimmer et al. 2012;Browning and Browning 2009) within the GNU-R environment (R Core Team 2020).
Data sets S1 to S6 were evaluated phenotypically in multi-location trials over 6 years (2010)(2011)(2012)(2013)(2014)(2015). Within locations, testcrosses were allocated to a series of trials laid out as 10 × 10 lattices comprising additional DH lines from the same breeding programme as well as five to seven common commercial checks. The full set of genotypes was replicated twice in S1 and S2, partially in S3 and not at all in S4, S5 and S6. Testcross performance was evaluated for grain dry matter yield (GDY, dt/ha) and grain dry matter content (GDC, %). Best linear unbiased estimates (BLUEs), variance components and heritabilities within data sets S1 to S6 were estimated as described by Albrecht et al. (2014).

Prediction of genomic breeding values
Data sets S1 to S5 served as calibration sets (CS) for model training individually and in all possible combinations resulting in 31 calibration sets ranging from N = 842 (S2) to N = 5417 (S1_2_3_4_5 = combined S1, S2, S3, S4 and S5). Data set S6 served as prediction set. Furthermore, for the 15 calibration sets involving all possible combinations of S1 to S4, data set S5 served as prediction set. In the respective calibration sets, BLUEs of DH lines averaged across locations were used as a response variable in the GBLUP model with the following form: where is a vector of the BLUEs of GDY or GDC, respectively, µ is the population mean and u is a vector of random genomic breeding values (GBV) with the distribution ∼ N 0; 2 u . Z is the corresponding incidence matrix, is a vector of residuals, which, for simplicity, is assumed to be normally distributed with a mean of zero and equal variance ∼ N(0; 2 ). The genomic kinship matrix of the genotyped DH lines was calculated according to VanRaden (2008) with allele frequencies estimated from S all . Variance components 2 u and 2 are the testcross and residual variances pertaining to the GBLUP model, respectively.
Prediction accuracies (r) were calculated using Pearson's correlation coefficients between GBVs predicted based on the model trained with the respective calibration set and BLUEs averaged across locations of the respective prediction set, divided by the corresponding square root of the trait's heritability. Empirical reliabilities were obtained from the squared accuracies (r 2 ).

Genomic measures for characterising calibration and prediction sets
The molecular diversity of data sets S1 to S6 was assessed by calculating the proportion of polymorphic markers, nucleotide diversity (Nei and Li 1979) and haplotype heterozygosity (Nei and Tajima 1981). Haplotype heterozygosity was calculated for sliding windows (Conrad et al. 2006) of 0.5 Mb, with steps of 1 SNP and a minimum number of 5 SNPs per window. If causal variants contribute to trait variation in the prediction set but are not captured by SNPs in the calibration set or vice versa, prediction accuracy is likely to deteriorate. We, therefore, calculated the parameter nPoly, which reflects the number of common polymorphic SNPs, for all 46 combinations of calibration and prediction sets (15 with S5, 31 with S6).
A principal coordinate analysis (PCoA, Gower 1966) was conducted on data set S all (S1-S6, N = 5968 and 9742 SNPs) 1 3 based on the realised kinship matrix using the R-package 'ape' version 5.3 (Paradis and Schliep 2019). Variation within and between data sets was assessed by partitioning the molecular variance of S all in an analysis of molecular variance (AMOVA) according to Excoffier et al. (1992). For data sets S1 to S6, the linkage disequilibrium (LD) measure r 2 (Hill and Robertson 1968) was calculated for pairs of SNPs located on the same chromosome, and the average LD decay distance for r 2 < 0.1 was estimated using nonlinear regression (Hill and Weir 1988). We calculated linkage phase similarities (LPS) for all combinations of calibration and prediction sets according to Schopp et al. (2017): where k is the index for the p marker pairs. The sign of r k CS was inferred from calculating D = p AB − p A p B of marker pair k, where p AB denotes the frequency of haplotype AB, p A the frequency of allele A at one marker locus and p B the frequency of allele B at the other locus in the calibration set.
The influence of the sample size N on prediction accuracy was determined by sampling at random DH lines from the combined set S1_2_3_4_5 for N = 100 to N = 5400 in increments of 100. However, in advanced cycle breeding populations, relatedness between genotypes might be highly unbalanced. Therefore, we introduced the concept of effective sample size 1 ≤ N eff ≤ N given by the following formula: Here N denotes the size of the calibration set under study and var(λ) the estimated variance of the eigenvalues of the corresponding genomic kinship matrix U. With N independent genotypes, the expected variance of the eigenvalues of is zero, and N eff is equal to N. As the pattern of relatedness becomes increasingly unbalanced, one would expect var(λ) to increase, leading to a reduction of N eff compared to N.
To assess the degree of the relatedness between the calibration and prediction sets, we adopted the approach of Saatchi et al. (2011). For all possible combinations, we calculated the average maximum realised kinship coefficient (u max ) based on the genomic kinship matrix U. The maximum kinship of line i of the respective prediction set (u max,i ) was derived as max(U ij ) where U ij are the realised kinship coefficients between line i and the lines j of the respective calibration set. Averaging over DH lines in the prediction set produced the u max value for each combination.
For each combination of the calibration and prediction sets, the average expected reliability can be calculated from theory. Following Clark et al. (2012), we calculated the reliability of line i in the prediction set from its prediction error ) derived from the GBLUP model employing a specific calibration set as follows: where 2 u is the genomic variance pertaining to this model and U ii is the diagonal element of matrix referring to line i. By averaging over all DH lines in the prediction set, we can obtain the expected reliability estimate for each trait and each combination of the calibration and prediction sets.
To analyse the relative importance and interdependencies of sample size N and genomic measures N eff , nPoly, u max , LPS and trait-specific reliabilities ρ 2 for differentiating the 46 combinations of the calibration and prediction sets, we conducted a principal component analysis (Jolliffe and Cadima 2016) on the centered and standardised estimates using the GNU-R environment (R Core Team 2020).
The significance of genomic measures for predicting accuracies was assessed with multiple linear regression using empirical prediction accuracies for GDY and GDC of the 46 combinations of the calibration and prediction sets as response variables, respectively. Model selection was performed using stepwise model selection in the GNU-R environment (R Core Team 2020), which involved adding and removing covariates in each step. In addition, the significance of sample size N and of individual genomic measures N eff , nPoly, u max , LPS and ρ 2 was tested in a linear model. To account for the effect of the prediction set, a categorical covariate indicating whether prediction accuracies were estimated in S5 or S6 was included in all models. Table 2 presents testcross means, variance components and heritabilities for traits GDY and GDC in data sets S1 to S6. Genotypic variance components were highly significant (p < 0.01) in all sets. Trait heritabilities (h 2 ) on a progenymean basis were high for both traits in most data sets with the exception of GDY in S4 and S6.

Phenotypic and molecular characterisation of individual data sets
The principal coordinate analysis indicated substantial overlap of the six data sets with the first three coordinates explaining 14.9% of the molecular variance (Fig. 1). In the AMOVA (Suppl . Table S1), only approximately 5% of the total molecular variance was due to variation among data sets. Within data sets S1, S3 and S4, family substructures were visible in the heatmap of pairwise realised kinship coefficients between DH lines (Suppl. Figure S1). The measure of the proportion of polymorphic markers, nucleotide ii 2 u 1 3 diversity and haplotype heterozygosity identified S4 and S6 as the data sets with the lowest diversity (Suppl. Tables S1 and S2). Data set S6 showed twice the range of LD noted in all other data sets. Table 3 and Suppl. Table S3 offer estimates of N eff , nPoly, u max , LPS and trait-specific reliabilities ρ 2 calculated for all possible combinations of the calibration and prediction sets (15 for S5, 31 for S6). Data set S5 and all its combinations exhibited substantially higher N eff than all other calibration sets. The number of polymorphic SNPs shared by the calibration and prediction set was higher for combinations with S5 compared to those with S6 due to the low number of polymorphic markers in data set S6. Mean estimates for u max , LPS and ρ 2 were similar for S5 and S6, but the range across calibration sets was larger for u max and LPS when predicting in S6. This larger range derived mainly from the low values of u max (0.26) and LPS (0.59) for the combination S1/S6 (Suppl . Table S3). Correlations between estimates of genomic measures ranged from 0.17 ( N eff , u max for prediction in S6) to 0.93 (u max , ρ 2 for prediction in S5; Table 4).

Relatedness, reliability and linkage phase similarity of calibration and prediction sets
Correlations of u max with N and N eff differed strongly for the two prediction sets because in combinations with S6, u max values formed two distinct groups depending on whether S3 was included or not (Suppl. Figure S2). Table 3 presents mean, minimum and maximum prediction accuracies in S5 and S6, and Fig. 2 and Suppl. Table S3 present prediction accuracies obtained for individual combinations of calibration and prediction sets. For GDY, the mean accuracy over all calibration sets was 0.50 for S5 and 0.31 for S6. While prediction accuracy in S5 was rather stable, varying from 0.41 (S4) to 0.57 (S1_2_3), accuracies obtained with S6 were lower and more variable (0.03 for S2 to 0.43 for S3_4_5). Compared to the accuracies for GDY, accuracies for GDC were much higher and similar for the two prediction sets. On average, prediction accuracy increased when we employed more data sets for calibration. When averaging accuracies for groups of two, three or four data sets, accuracy increased monotonically and was always highest for the combination including the largest possible number (Suppl . Table S3). Prediction accuracy increased with N when sampling from the combined set S1_2_3_4_5 with strongly diminishing returns for N > 3000 (Fig. 3). For N > 500, none of the samples yielded negative accuracies.

Prediction accuracy as a function of genomic measures
The relationship of prediction accuracy for GDY with sample size N and each of the genome-based measures is shown in Fig. 4 and Table 4 (for GDC in Suppl. Figures S3 and Table 4). In combination with both prediction sets, S1, S2 and S4 exhibited the lowest values for most measures and returned the lowest prediction accuracy for GDY.
To test the effect of the various measures on prediction accuracy, we fitted a linear regression model with prediction set (PS) coded as a categorical covariate. For both traits, all measures were significantly associated with prediction accuracy ( Table 5). The models explained up to 80% of the variance of prediction accuracy for GDY and 75% for GDC with ρ 2 as the second covariate. In the stepwise regression, only PS, nPoly and ρ 2 (GDY), which explained 81% of the variance of prediction accuracy for GDY (Table 5), were retained in the model. With GDC as a response variable, covariates PS, u max , N eff , nPoly and ρ 2 (GDC) were retained in the model and explained 84% of the variance.
To obtain a clearer picture of the interdependencies of the seven measures describing the 46 possible combinations of the calibration and prediction sets, we performed a principal component analysis on variables PS, sample size N and the five genomic measures. Together, the first and second principal components explained 85% of the total variation (Suppl . Tables S4 and S5). Parameters N , ρ 2 and u max dominated the first linear component while parameters PS, nPoly and LPS dominated the second component. Table 3 Mean and range of prediction accuracy (r), effective sample size (N eff ) of calibration sets, number of polymorphic SNPs shared by the calibration and prediction set (nPoly), average maximum kinship (u max ), linkage phase similarity (LPS) and trait-specific reliability (ρ 2 ) for prediction sets S5 and S6 in combination with all possible calibration sets (15 for

Discussion
The sample size of the calibration set, the relatedness of individuals in the prediction and calibration sets and the LD between markers and causal variants are major factors contributing to prediction accuracy in random mating populations (de los Campos et al. 2013). In plant breeding populations, however, the relationship between prediction accuracy and these factors is not as straightforward. Therefore, we investigated how merging data sets across several years affects prediction accuracy and related genomic measures. GDY GDC S1 S2 S3 S4 S1_2 S1_3 S1_4 S2_3 S2_4 S3_4 S1_2_3 S1_2_4 S1_3_4 S2_3_4 S1_2_3_4 Order of bars GDY GDC S1 S2 S3 S4 S5 S1_2 S1_3 S1_4 S1_5 S2_3 S2_4 S2_5 S3_4 S3_5 S4_5 S1_2_3 S1_2_4 S1_2_5 S1_3_4 S1_3_5 S1_4_5 S2_3_4 S2_3_5 S2_4_5 S3_4_5 S1_2_3_4 S1_2_3_5 S1_2_4_5 S1_3_4_5 S2_3_4_5 S1_2_3_4_5 1 3

The calibration set
We analysed experimental data from a medium-sized commercial maize hybrid breeding programme. Experiments were designed for phenotypic selection. Consequently, data sets S1 to S6 were highly heterogeneous with respect to sample size, mating design, number of crosses, progenies per cross and relatedness of DH lines within and between data sets. The six data sets were connected genetically as they were derived from the same heterotic group and were adapted to the same maturity zone, but each set of DH lines experienced environmental conditions of a specific year. Because the direct progenies of crosses with selected DH lines were tested several years later (e.g. parents in S1, progenies in S4), it was important to analyse a minimum of four selection cycles. Theoretically, prediction accuracy increases with sample size, so merging as many data sets as possible for model training seems intuitively reasonable. On the other hand, it has been shown that increasing the size of the calibration set is not always beneficial (Albrecht et al. 2014;Pembleton et al. 2018;Brandariz and Bernardo 2019). Inbreeding and strong familial relationships induce cosegregation of markers and quantitative trait loci (QTL), which differ for contiguous selection cycles. Unrelated genetic material may create population substructures. Resulting phase changes between markers and QTL can have detrimental effects on prediction when the level of relatedness between the calibration and the prediction set is low (de los Campos et al. 2013). In our study, merging data sets across years significantly increased prediction accuracy in both prediction sets and for both traits, but some calibration sets comprising only three (four) data sets yielded slightly higher accuracies than did the full set for GDY and GDC (Fig. 2, Suppl. Table S3). Thus, it might be possible to create optimised calibration sets from existing genotypic and phenotypic data. In the literature, different optimisation criteria have been suggested (Rincent et al. 2012;Isidro et al. 2015;Mangin et al. 2019; Lopez-Cruz and de los Campos 2021). However, creating bespoke calibration sets is not a simple task. A profound knowledge of the genetic makeup of the population under study is required to model systemic effects, such as genetic groups, testers, trials, locations and years, correctly (Albrecht et al. 2014). Calibration sets thus need to be optimised for each trait under study. In addition, optimisation yields the highest gains in prediction accuracy when sample sizes are small, but it is expected to show diminishing returns as the size of the calibration set increases. Considering the decreasing variation in prediction accuracy among random samples when sample size N increased (Fig. 3), choosing DH lines from S all based on a trait-specific optimisation criterion is unlikely to be rewarding. Prediction based on the socalled general combining ability (GCA) model suggested by Brandariz and Bernardo (2019), where only crosses that share a common parent with the prediction set are used for model training, is feasible only if both the biparental families to be predicted and the breeding programme are quite large. In medium-sized breeding populations such as ours, where the number of progenies per cross and the number of crosses per parent are much lower than in their study (Table 1), the GCA model is not applicable. We, therefore, investigated if genomic measures can assist in evaluating the predictive performance of entire data sets generated in different selection cycles and tested in different years.
The results revealed that the combination of the calibration and prediction sets impacted prediction accuracies. Including S2 in model training did not negatively affect prediction accuracy in S5; when predicting GDY in S6, however, accuracies were close to zero (Fig. 2, Suppl. Table S3). Excluding S2 from model training increased prediction accuracy (e.g. prediction accuracy S1_3 > S1_2_3), despite a reduction in sample size of 842 DH lines in the calibration sets (Suppl. Figure S4). The effect of S2 was attenuated in the larger calibration sets and those including S5. Specific interactions between data sets are difficult to predict, and in fact, none of the genomic measures suggested that S2 and its combinations would exhibit such poor prediction accuracy. Thus, merging data from several selection cycles and evaluation years for model training increased the robustness of S5 S1_2 S1_3 S1_4 S1_5 S2_3 S2_4 S2_5 S3_4 S3_5 S4_5 S1_2_3 S1_2_4 S1_2_5 S1_3_4 S1_3_5 S1_4_5 S2_3_4 S2_3_5 S2_4_5 S3_4_5 S1_2_3_4 S1_2_3_5 S1_2_4_5 S1_3_4_5 S2_3_4_5 S1_2_3_4_5 1 3 predicted GBVs in a given prediction set. When averaging accuracies over all combinations of three (four) calibration sets, predictions with the full set were always more accurate for both traits and both prediction sets (Suppl . Table S3).
In addition, despite the small effective population size of this advanced cycle breeding population, prediction accuracy still increased for both traits even when sample sizes exceeded N = 4000.

The prediction set
In breeding schemes with a genome-based selection step, phenotypic data collected in a given year serve several purposes. The data are used to validate GBVs of selection candidates from the previous cycle and to retrain the prediction model. Furthermore, the empirical reliabilities serve as the basis for evaluating the efficiency of genome-based selection compared to phenotypic selection and for optimising breeding schemes. A comparison of expected and empirical reliabilities in a given data set assists in evaluating the usefulness of the data. Expected reliabilities (ρ 2 ) were similar in magnitude and highly correlated for both prediction sets (r = 0.97). While empirical reliabilities for GDC were consistent with expectations, those for GDY were lower than expected, especially for the smaller calibration sets in combination with S6 (Suppl. Figure S5). Several factors might have contributed to these low empirical reliabilities. Testers in S5 and S6 were related (coefficient of coancestry = 0.5) but not identical. Empirical reliabilities are expected to decrease when a different tester is used (Schopp et al. 2015), but in a commercial maize breeding programme, correlations between testers can easily exceed 0.6, especially if testers are related (Melchinger et al. 1998). Therefore, the change in testers can partly but not fully explain the difference between expected and empirical reliability. Another factor decreasing prediction reliability might have been the specific weather conditions of the year in which data set S6 was evaluated. Two of the five locations in 2015 suffered suboptimal growing conditions, leading to low yields and location specific reliabilities close to zero (data not shown). When omitting these two locations from the analysis, the empirical reliability averaged over all calibration sets increased from 0.10 to 0.14 (Suppl . Table S6). Thus, environmental effects of the prediction year contributed to differences in expected and empirical reliabilities, but as with the change in testers, these environmental factors could not fully explain the differences.
A third factor specific to S6 was its low level of molecular and phenotypic variability. The genotypic variance component for GDY in S6 was only half that in S5, explaining the difference in heritability between the two prediction sets (Table 2). Data set S6 also shared a low number of polymorphic markers with the calibration sets and exhibited a small effective sample size (N eff = 39.1). In addition, the range of LD blocks in S6 was approximately twice as large as in the other data sets (Suppl . Table S2). Both simulated and experimental data demonstrate that model training is ineffective when the length of haplotype blocks in the prediction set is not well represented in the calibration set (Hickey et al. 2014;Brandariz and Bernardo 2019). This might have contributed to the lower reliabilities in S6. With its high diversity, small LD blocks and large effective sample size, data set S5 was effective for both validating GBVs derived from S1 to S4 and retraining the prediction model to predict S6 (Fig. 4).

Genomic measures
Estimates of diversity such as the proportion of polymorphic markers, nucleotide diversity or haplotype heterozygosity Table 5 Regression analysis of prediction accuracy for grain dry matter yield (GDY) and grain dry matter content (GDC) on genomic measures characterising the 46 possible combinations of calibration and prediction sets. Significance (p-value), Akaike information criterion (AIC) and explained variance (R adj 2 ) are given for models fitting sample size (N), effective sample size (N eff ), number of poly-morphic SNPs shared by the calibration and prediction set (nPoly), average maximum kinship (u max ), linkage phase similarity (LPS) and trait-specific reliability (ρ 2 ) in combination with the affiliation to the prediction set (PS) as covariates. The last row presents results from the best model selected by stepwise regression varied little over calibration sets S1 to S5 and did not facilitate choosing between calibration sets. With array-based data in an advanced cycle breeding population, this might have resulted from an ascertainment bias of the chip towards medium allele frequencies. All other genomic measures were highly correlated with sample size N and showed high mutual correlations (Table 4). Interdependencies between measures were expected; for example, the level of relatedness of a DH line with the calibration set affected u max , the reliability of its estimated breeding value and LPS. On the other hand, markers that were monomorphic in the prediction set differentiated between calibration sets with respect to relatedness but did not affect values of nPoly and LPS. To interpret dependencies between parameters, we performed a principal component analysis. The loads of the measures on the first and second component suggested two groups (Group 1: N, N eff , u max and ρ 2 ; Group 2: nPoly and LPS). Performing stepwise regression of all possible combinations of the calibration and prediction sets with prediction accuracy of GDY as the response variable confirmed this result. However, the results from the stepwise regression must be interpreted with caution because differences in Akaike information criterion (AIC) were small across the best models (Suppl . Tables S7 and S8). Only the trait-specific expected reliability ρ 2 entered into all of the 10 best models for both traits. Thus, we conclude that compared with sample size N or u max , the average expected reliability has a higher predictive value to rank calibration sets with respect to their predictive performance in combination with a given prediction set (Table 5, Fig. 4). The effective sample size (N eff ) is a function of sample size N and the distribution of kinship coefficients between pairs of DH lines. Data set S5 had an extremely large effective sample size (N eff = 82.7), which suggests a more balanced distribution of the relatedness of DH lines than in the other data sets. Because GBVs of the prediction set are weighted averages of phenotypes in the calibration set, a more uniform distribution of relatedness might result in higher prediction accuracy with an independent prediction set (de los Campos et al. 2013). The large N eff of S5 relative to the other data sets was confirmed when drawing 1000 random samples of size N = 500 from each of the data sets (74.3 for S5 vs 31.2-43.5 for the others). The parameter separated S5 and all its combinations from other calibration sets (Fig. 4), but within the group containing S5, it was negatively associated with prediction accuracy for both traits (Suppl. Figure S6). Nevertheless, we conjecture that N eff may be useful for differentiating between data sets with respect to the distribution of kinship coefficients. Its relationship with prediction accuracy warrants further research as parameter estimates vary depending on the underlying kinship matrix (U calculated individually for each data set vs S all ).
We conclude that parameter ρ 2 (expected reliability) is a robust predictor of the prediction accuracy obtained with different calibration sets. How much can be gained from this information in practice, though, remains to be seen. It is possible that our study has overstated the strength of association between prediction accuracies and genomic measures. Only data sets S1 to S6 are independent samples from the breeding population; because the other calibration sets consist of overlapping data sets (e.g. S1_2 and S1_3 share the DH lines of S1), they are not independent. In addition, data sets S1 and S2 had extreme values for most measures, which inflated the pairwise correlations. Nevertheless, we conjecture that strong differences between expected and empirical reliabilities of predicted GBVs are informative. The magnitude of the difference may guide decisions regarding the weight that the phenotypes of a prediction set should receive in model training for future predictions and for constructing selection indices comprising genomic breeding values and phenotypic data (Lande and Thompson 1990).

Conclusions
Our results are relevant for integrating phenotypic and genome-based selection in hybrid breeding programmes. To obtain high prediction accuracies, the management of the population with respect to unrelated germplasm and mating design requires greater attention in genome-based than in phenotypic selection. Including data from additional cycles in model training attenuates the effects of different testers, individual years and genotype × year interactions, which represents an advantage of genome-based selection over phenotypic selection where for most crops highest selection intensities are applied to data from a single year. Because the results varied across traits and prediction sets, genomic parameters investigated in this study provided little guidance in choosing specific calibration sets for prediction. For optimal integration of phenotypic and genomic information, we, therefore, recommend including a substantial overlap of common entries between selection cycles to disentangle confounded factors contributing to the difference between expected and empirical reliabilities.