Key words

1 Introduction

Several factors affect the accuracy of genomic prediction including (1) trait-specific characteristics like the heritability and the genetic architecture of the trait, (2) population-specific characteristics like the level of linkage disequilibrium (LD) between markers and quantitative trait loci (QTLs) and the number of effective chromosome segments (Me) segregating in the population, (3) the statistical method used to make predictions, and (4) experiment-specific characteristics such as the marker density, the size of the calibration set (CS) and the degree of genetic relationship between the CS and the predicted set (PS ). The choice of the CS, i.e., reference individuals and their genotypic and phenotypic data, used to calibrate the prediction model is therefore crucial, especially when predicted traits are difficult or expensive to phenotype. In animal breeding, the pedigree-based BLUP model has been used in routine for several generations to predict the genetic value of candidates since the pioneer work of Henderson [1]. The use of genomic selection has modified the way the relationship between individuals was estimated by adding marker genotypes to pedigree information. In dairy cattle, it had little impact on the phenotypic data used to calibrate predictions for traits previously addressed in routine, but has opened the way to considering many new traits such as disease resistance, which cannot be evaluated directly for all animals [2]. In major crop species, the main focus is to select the best inbred lines that can be produced after selfing generations within large biparental families, either for their direct use as varieties or as parents of single-cross hybrid varieties. The number of candidate lines per family is often larger than the phenotyping capacities and only a small set of them can be evaluated in different environments to evaluate their adaptation to various field conditions. In this context, pedigree information is not useful to identify the best individuals [3] within a given family and pedigree-BLUP based on historical data has therefore not been broadly used in crop breeding. With genomic selection, the differences in genetic covariance between pairs of individuals from a biparental family can be accounted for in the model, unlike with pedigree data. Phenotypes are no longer used exclusively as proxies of the genetic values of candidate individuals but also to train a predictive model involving molecular markers as predictors, which potentially modifies phenotyping strategies. The advent of genomic selection clearly opens new possibilities for improving the breeding efficiency of both animal and plant species but raises the key question of how to define the best CS, especially in plants. In the first part of this chapter, we provide some general guidelines to be considered for this purpose. These guidelines are illustrated with examples and their biological bases are discussed. In a second part, we present the different approaches that have been proposed to optimize the reference population. In the last part we show some applications of reference population optimization , depending on the prediction objective. Even if genomic predictions are also used in human genetics, this chapter focuses on the application of genomic predictions for breeding objectives in animal and plant species, with more emphasis on plants where the issue of the optimization of the reference population has been the most extensively considered.

2 Impact of the Composition of the Calibration Set on the Accuracy of Genomic Prediction

2.1 Calibration and Predicted Individuals Ideally Originate from the Same Population

Most genomic prediction models such as GBLUP or in the declination of Bayesian models [4,5,6] assume that CS and PS individuals are drawn from the same population. As described in the examples below, this hypothesis is often violated and comes at the cost of a reduced accuracy.

The first case is the application of genomic prediction in a population stratified into genetic groups. This scenario has been the subject of several studies to investigate (1) to which extent a generic CS can be efficient to predict over a wide range of genetic diversity, (2) or to which extent genetic groups with limited resources can benefit from the data originating from other genetic groups with more resources. In general, when a genomic prediction model is trained on one genetic group to perform predictions in another genetic group, the accuracy tends to be lower than what can be achieved within each group. This phenomenon has been illustrated in many animal and plant species including dairy cattle [7,8,9], sheep [10], maize [11, 12], soybean [13], barley [14], oat [15], or rice [16]. This case can also be extended to the prediction between families, as shown in mice [17], maize [18, 19], wheat [20], barley [21], or triticale [22]. In the worst case, the addition of individuals to the CS that are genetically distant from the PS can lead to a deterioration in the accuracy, as shown for instance in barley [23] and maize [24].

Beyond genetic groups and families, differences in type of genetic material (e.g., purebred and crossbred) between the CS and PS can lead to reduced accuracies compared to what can be achieved when the CS and the PS are of the same type of genetic material. Examples are the prediction of crossbred individuals using a purebred parental CS, as shown in pig [25], the prediction of maize admixed population between two heterotic groups using one of the parental population as CS [26], or the prediction of interspecific hybrids, as shown in Miscanthus [27].

Finally, the CS and PS may be drawn from the same population but different breeding generations. This scenario is very common when cycles of selection are done solely based on predicted genetic values to shorten breeding cycles or when candidates are preselected to reduce their number and limit phenotyping costs. Decrease in accuracies can generally be observed over cycles when prediction models are not updated with data from the selected generation, as shown using simulations [28], or real data in pig [29], sugar beet [30], alfalfa [31], maize [32], wheat [33], barley [34], and rye [35]. Note that genotype-by-environment interactions can also contribute to the drop in accuracy when the CS and PS are not evaluated in the same environments.

As illustrated with these examples, a decrease in accuracies can be expected when CS and PS individuals come from different populations or breeding generations, and this decrease may result from different factors that are presented in the following subsections.

2.1.1 LD Between Markers and QTLs Can Be Different Between Populations

Since the early development of genomic prediction, the LD between molecular markers and QTLs has been identified as a major factor of accuracy [4]. It can be defined as the nonindependence between alleles at different loci on the same gamete. In general, LD between markers and QTLs is assumed to be homogeneous within a population, but it may vary when the population is stratified, which affects the success of across-breed genomic prediction [7, 36]. Differences in LD between markers and QTLs may indeed lead to differences in effects estimated at markers, impacting the accuracy when one population is used to predict another. LD between two loci is a function of recombination rate, minor allele frequency (MAF) at both loci and effective population size [37]. Differences in MAF and effective population size are very common whenever a population is stratified into groups [38], but differences in recombination rate can also be observed between genetic groups, as shown in maize [39]. Differences in LD extent estimated with markers have been observed among populations in dairy and beef cattle [40, 41], pig [42], chicken [43], maize [44, 45], or wheat [46]. Differences in the sign of the correlation between the allelic state of loci pairs can also be found and are referred to as differences in the linkage phase [40, 45]. In presence of dense genotyping, the effect of differences in LD between populations on the accuracy is expected to be minimized [40, 47], as most QTLs are expected to be in high LD with at least one shared marker in both populations. The ideal situation is that only causal loci are captured by the genotyping.

2.1.2 QTL Allele Frequencies Can Be Different Between Populations

In addition to differences in LD, genomic prediction accuracy across populations can be affected by differences in QTL allele frequencies. The most extreme scenario consists of a QTL for which an allele is fixed in the CS but is segregating in the PS . The effect of such a QTL cannot be estimated using the CS, and the genetic variance that it explains will not be accounted for in the prediction [48, 49]. In the context of prediction across biparental populations, Schopp et al. [50] proposed to adapt the formula of Daetwyler et al. [48], used to forecast the accuracy, by including a new term: the proportion of markers that segregates in both the CS and PS relative to the total number of markers segregating in the PS . Based on simulations, they showed that this criterion computed using markers is a good approximation of the equivalent criterion based on QTLs when the marker density is sufficiently large, and is critical for the accuracy of predictions across families. In more complex populations, one can estimate the genetic differentiation (FST) using markers, as an indication of how QTL alleles frequency differ between the CS and PS , which was shown to be negatively related to the accuracy of genomic prediction [51].

The consistency of alleles frequency between the CS and PS can be extended to the consistency of the frequencies of genotypic states at QTLs (i.e., the two homozygous states and the heterozygous state for a biallelic QTL). One example consists of dominance effects at QTLs that can only be accounted for if some individuals present a heterozygous state in the CS [52]. This phenomenon can explain the decrease in accuracy observed when predicting crossbred individuals using purebred individuals for traits with substantial dominance effects [25].

2.1.3 QTL Allele Effects Can Be Different Between Populations

In most genomic prediction models , the effects of QTLs are assumed to be consistent between the CS and PS . But this assumption can be violated when the CS and the PS are drawn from different populations. “Statistical” additive effects reflect the average effect of substituting an allele with the alternative allele in the population and are implicitly or explicitly taken into account in most models, including GBLUP . “Functional” dominance and epistatic effects at QTLs contribute to the statistical additive effect along with the functional additive effect, but, unlike the latter, their contributions depend on allele frequencies [52,53,54]. From this phenomenon emerges the concept of genetic correlation between populations that aims at quantifying this difference in statistical additive effects [55, 56]. Practically, note that genetic correlations are often estimated using markers and then also include the heterogeneity generated by differences in LD between markers and QTLs.

2.2 Genetic Relationships Between Calibration and Predicted Individuals Are Needed

In the present chapter, we define genetic relationships as standardized covariances between individuals relative to the genetic components of traits. In this context, they are defined at QTLs (i.e., at causal loci) level and reflect the sharing of alleles at these loci. Genetic relationships notably include additive genetic relationships (AGRs) that describe relationships between individuals for additive allele effects. As causal loci are generally unknown, AGRs must be estimated based on the pedigree or by using markers. From a pedigree perspective, the sharing of alleles at QTLs is considered to result from their inheritance from a common ancestor. Those alleles are characterized as identical-by-descent (IBD, see Thompson [57] for a review), with IBD being defined relative to a founder population as a reference starting point of the pedigree. In this context, the coefficients of the pedigree relationship matrix (PRM) consist of expected AGRs conditional on pedigree information. Since the advent of molecular markers, AGRs can be estimated using the genomic relationship matrix (GRM) in GBLUP , often allowing better estimates of additive genetic variances compared to those obtained using the PRM (see Speed and balding [58] for a review).

In the early developments of genomic prediction, the ancestral LD between markers and QTLs was suspected of contributing alone to the genomic prediction accuracy [4]. Ancestral LD can be defined as statistical dependencies between loci that already existed within the population founders of the pedigree, which were generated by ancestral evolutionary forces. Thus, it does not characterize additional dependencies between loci that arise from the pedigree relationships between individuals. When individuals are not related by pedigree, the GRM can only capture AGRs through ancestral LD, as illustrated in Fig. 1A. However, when CS and PS individuals are related by pedigree, the GRM captures AGRs even in absence of ancestral LD between markers and QTLs [5, 59,60,61], as illustrated in Fig. 1B. In this scenario, the GRM describes IBD at markers and can be considered as a proxy for the PRM. It explains why nonnull accuracies can be obtained when applying genomic predictions with only few markers. The contribution of ancestral LD and pedigree relationships to the accuracy depends on the genomic prediction model . Variable selection approaches like LASSO, or Bayesian approaches like Bayes-B tend to better exploit ancestral LD than GBLUP to make predictions [60, 62, 63]. In addition, for a given pedigree structure, the relative contribution of ancestral LD and pedigree relationships to the prediction accuracy depends on population size: pedigree relationships tend to have a greater effect than ancestral LD on the accuracy for CS of small size, and conversely for CS of large size [63,64,65]. This contribution of ancestral LD relative to pedigree relationships is an important parameter to consider when applying genomic prediction between genetically distant CS and PS , as the accuracy due to pedigree relationships will drop more quickly than that due to ancestral LD with decreasing relatedness [59, 62].

Fig. 1
figure 1

Hypothetical scenarios adapted from Habier et al. [61] illustrating different types of information captured by the genomic relationship matrix (GRM): (A) Ancestral LD only, (B) Pedigree only, and (C) Pedigree + cosegregations (CoS). For each scenario, 50 QTLs and 50 markers are considered with minor allele frequency of 0.5. In scenario (A), QTLs and markers are assigned in pairs to chromosomes (a single pair per chromosome with each loci pair being in LD = 0.8). In scenario (B), QTLs and markers are assigned to different chromosomes (a single locus per chromosome). In scenario (C), QTLs and markers are assigned in pairs to chromosomes (a single pair per chromosome with each loci pair being genetically distant by 5 cM but not in LD). In scenario (A), gametes are generated independently from different founders of the population, while they are generated from individuals resulting from the crossing of founders for scenarios (B) and (C). The additive genetic relationship (AGR) between gametes is computed by applying the formula of [124] to QTLs using simulated allele frequencies. The pedigree relationship matrix (PRM) between gametes is constructed by assigning a coefficient of 0.5 between gametes originating from the same individuals, and 0 otherwise. The GRM is calculated like the AGR but using markers. In scenario (A), the GRM can estimate the AGR using ancestral LD, even in absence of pedigree relationship between gametes. In scenario (B), the GRM can estimate the AGR by tracing pedigree relationships between gametes (like the PRM), even in absence of ancestral LD between markers and QTLs. In scenario (C), the GRM can better estimate the AGR within a family of gametes compared to the GRM in scenario (A) and the PRM, thanks to cosegregations between QTLs and marker alleles that are physically linked on chromosomes. Note that we considered haploid gametes to simplify the schematic representation but those concepts can be generalized to relationships between diploid individuals

In addition to pedigree relationships and ancestral LD between markers and QTLs, Habier et al. [61] also demonstrated that the GRM captures cosegregations between markers and QTLs. Cosegregations characterize the nonrandom association of alleles between linked loci that can be observed within the individuals of a given family of the pedigree. It can be distinguished from ancestral LD that characterizes the nonrandom association between alleles of different loci that were already established in the founders of the pedigree. In the absence of ancestral LD between markers and QTLs, marker alleles will, nevertheless, cosegregate with QTL alleles to which they are physically linked when new individuals are generated, as illustrated in Fig. 1C. This information will be accounted for in the GRM and will contribute to the genomic prediction accuracy . Note that cosegregations help to describe genetic covariances between individuals of the same family. When several families are pooled into a common CS, differences in linkage phase can be observed and may considerably limit the contribution of cosegregations to the accuracy [24].

Several studies have shown that the accuracy of genomic prediction is linked to the AGRs between CS and PS individuals. Based on simulation, Pszczola et al. [66] established the link between the deterministic reliability of genomic prediction and the average squared genomic relationship coefficient between CS and PS individuals. Habier et al. [60] illustrated that the accuracy of genomic prediction increased with increasing a-max in dairy cattle, where a-max is defined as the maximum pedigree relationship coefficient between CS and PS individuals. This result was confirmed in maize [67] and oil palm [68]. More generally, the need for close pedigree relationships between CS and PS individuals has been illustrated in several species like in mice [17].

In addition to AGRs, other types of genetic relationships can be modeled to improve genomic prediction accuracies such as dominance and epistatic genetic relationships. Like AGRs, these other relationships directly reflect the sharing of alleles at QTLs and can be estimated using the pedigree [57] or using markers [52, 69]. However, they are often not accounted for in genomic prediction models , as they generally have a limited contribution to the overall genetic variance, except for specific applications such as the prediction of hybrids (see Subheading 4.3).

2.3 Calibration Set Should Be As Large as Possible

When building a CS, increasing the number of individuals is generally beneficial. The importance of CS size has been shown theoretically using deterministic equations of the accuracy of genomic prediction [48, 70,71,72,73,74]. They showed that the population size should be large enough to properly estimate the effect of each of the effective chromosome segments that segregate in the population (quantified by their number Me), in particular for low heritability traits. The effect of the CS size on genomic prediction accuracy has been illustrated experimentally in plants [62, 75,76,77] and animal species [78, 79], as well as in Human [80]. However, one should keep in mind that increasing the number of individuals should be done with caution if additional individuals are genetically distant from the PS individuals, as mentioned in the previous subsection. There is also a compromise to be found between the number of phenotyped individuals and the accuracy of the phenotyping that can be increased in plants by increasing the number of observations per individual (see discussion in Subheading 4.4).

2.4 Genetic Relationships Between CS Individuals Should Be Limited

Finally, it is generally admitted that genetic relationships among individuals should be limited within the CS. This idea is related to the common assumption that, in genomic prediction, experimental designs should aim at replicating alleles rather than individuals [81]. Because individuals with high degrees of genetic relationship can be considered as partial replicates and somewhat redundant, including them all may not be the best allocation of resources regarding genomic prediction accuracy . Based on simulations, Pszczola et al. [28, 66] have shown that average reliabilities of genomic prediction decreased with increasing genetic relationships within the CS. These results were confirmed in dairy cattle [82]. However, limiting the genetic relationships between CS individuals is not sufficient to maximize the accuracy, as maximizing genetic relationships between CS and PS individuals is also important.

3 Methods to Optimize the Composition of the Calibration Set

Considering all the factors affecting predictive ability mentioned above, optimizing the composition of CS is not a simple task. We can, however, suppose that when CS and PS cover the same genetic space, they will have similar LD patterns, same segregating QTLs and a high genetic relationship, which are the main drivers of predictability. Numerous criteria have been proposed in the literature to optimize the composition of the CS. They can be grouped into two classes. The first class of approaches consists in identifying optimized CS based on model-free relatedness criteria. The second class of approaches is directly based on the genomic selection (GS) statistical model. They mostly rely on GBLUP , which is one of the reference GS models. They consist in defining CS by optimizing some criteria derived from the linear mixed model: the (generalized) Prediction Error Variance (PEV) and Coefficient of Determination (CD), or the expected Pearson correlation between predicted and observed values (r). Each criterion has advantages and drawbacks related to their efficiency to maximize predictive ability, their computational demand, and their ability to optimize the CS prior to phenotyping. A brief description of the methods/criteria (including references and, if available, scripts or tools to implement them) is presented in Table 1. In this part, we will review the two classes of criteria. The specific questions of predicting biparental families, optimizing or updating the CS when phenotypes are available, optimizing CS for hybrids, and optimizing experimental designs will be reviewed in Subheading 4.

Table 1 Main methods and criteria to optimize the composition of the calibration set in genomic selection. Methods are grouped in two categories (Type) depending on whether they rely on a statistical model (“model based”) or not (“model free”) and depending on the target objective (partition a genotyped population into CS/PS (A), optimize a CS for a given PS (B), or subsample historic data to predict a given PS (C))

3.1 Model-Free Optimization Criteria Based on Genetic Distances Between Individuals

As mentioned above, one of the main objectives when designing a CS is to ensure that the genetic space covered by the PS is well captured by the CS. Relatedness being a key contributor to prediction accuracy, different relatedness-based criteria were proposed to optimize the composition of the CS.

3.1.1 Optimization Based on Genetic Diversity Within the CS

A first way of optimizing the composition of the CS using relatedness is to minimize genetic similarity within the CS. The underlying idea is that genetic similarity within the CS can be seen as partial redundancy, and so CS including related individuals would be less informative than more diverse CS (see Subheading 1). This can be for instance done by minimizing the average or the maximum of the relationship coefficients within the CS [83]. A similar approach was proposed by Bustos-Korts et al. [84] to design a CS leading to a uniform coverage of the target genetic space. This is based on a geometric approach that ensures that no close relatives are included in the CS. Guo et al. [85] proposed a method called Partitioning around Medoids (PAM), in which individuals are grouped into clusters and representatives of each group are identified. They also proposed Fast and Unique Representative Subset Selection (FURS), which is a sampling method based on graphical networks. The principle of FURS is to identify nodes (individuals) with highest degree of centrality. These criteria are best adapted to identify a CS among a set of candidates that will be used to predict the performance of the remaining candidates (like the scenario in Fig. 2A). They cannot be used to optimize a CS for an independent PS .

Fig. 2
figure 2

Comparison of two standard scenarios that can be considered for the optimization of the calibration set (CS) based on PEVmean and CDmean. In (A) a single population is split into a CS and a predicted set (PS ) with the objective of optimizing the CS for best predicting non phenotyped individuals (PS ), while in (B) the set of CS candidates and the PS (both genotyped) are distinct. In both scenarios, the PEVmean and the CDmean criteria are computed directly for the PS individuals

CS optimization based on these criteria has led to a higher prediction accuracy than random sampling [84, 85], but they do not directly consider the genetic relatedness between the CS and the PS . If most of the PS individuals are present in a small part of the genetic space, it is important to have many CS individuals in this part, even if it leads to a low diversity in the CS. In other words, it is important to weigh the different parts of the genetic space according to the distribution of the PS individuals, the optimal CS being not necessarily the one with the highest genetic diversity.

3.1.2 Optimization Based on Genetic Relatedness Between the CS and the PS

The genetic relatedness between the CS and the PS is taken into account by criteria Gmean [23], Avg_GRM [86], and Crit_Kin [87]. For Gmean and Avg_GRM, candidates to the CS are ranked according to their average genetic relatedness to the PS . The individuals with the highest average are included in the CS. Roth et al. [88] and Berro et al. [89] proposed similar approaches in which individuals are ranked according to their maximum or median genetic relatedness to the PS . For Crit_Kin the average of the relationship coefficients between the CS and the PS is maximized. These criteria generally resulted in higher predictive ability than random sampling [23, 86,87,88, 90]. Contrary to the previous criteria (Subheading 3.1.1 above), Gmean, Avg_GRM and Crit_Kin take into account genetic relatedness between CS and PS , but do not consider redundancy within the CS. To design an optimal CS, it seems important to balance the two aspects.

Another criterion that has, to our knowledge, not yet been tested in the literature and that could help reaching this balance, is Gmax, the maximum relatedness coefficient between a given PS individual and the CS individuals. As prediction accuracy decreases with the genetic distance between the CS and the PS (see above), it seems interesting to ensure that each PS individual has a close relative in the CS, as illustrated in Clark et al. [63] and Habier et al. [59]. One criterion could be to identify a CS maximizing the average of the Gmax of all PS individuals. This criterion seems promising, as its maximization will result in similar genetic dispersion for CS and PS . We can indeed think that the optimal CS is the PS itself, and Gmax will help identify a CS as close as possible to this optimum.

3.1.3 Taking Population Structure into Account

In case of population structure, all the abovementioned criteria derived from relatedness coefficients could also be useful as population structure is partially captured by genetic relatedness. But in the case of strong structure, it may be necessary to directly take it into account. It was proposed in numerous studies to define the CS by stratified sampling. These algorithms ensure that each group is well represented in the CS, possibly taking the size of each group into account [84, 87, 90,91,92,93]. The efficiency of these approaches to increase predictive ability was disappointing, as they were often not better (sometimes worse) than random sampling, and most of the time not as good as relatedness-based criteria not taking structure into account. This is probably due to the fact that they only rely on population structure and do not consider relatedness within the groups. Their efficiency, however, increases with the importance of the structuration [91]. Optimizing CS by combining information on population structure and relatedness seems an interesting alternative strategy to achieve higher accuracy. The specific and extreme case of predicting a population stratified into biparental populations will be discussed below.

3.2 Optimization Using “Model-Based” Criteria Derived from the Mixed Model Theory (PEV, CD, r)

The abovementioned criteria are model-free in the sense that they do not rely on any genomic prediction model . We can, nevertheless, think that if a GS model results in accurate predictions, working with theoretical criteria derived from the same model could be valuable. One of the reference and most efficient GS model is the GBLUP mixed model. It is particularly adapted to polygenic traits because it relies on the infinitesimal model. Analytical developments were made within the mixed model theory to derive criteria related to the expected predictive ability of the model before any phenotyping. In this section we introduce these criteria and how they were used to optimize CS.

3.2.1 CS Optimization Using the Prediction Error Variance (PEV) or the Coefficient of Determination (CD)

Rincent et al. [83] proposed to use the generalized Prediction Error Variance (PEV(c)) or the generalized expected reliability (generalized Coefficient of Determination, CD(c)) of contrast c to optimize the composition of CS in genomic prediction.

In animal breeding, PEV(c) or CD(c) were first proposed to track disconnectedness in experimental designs [94, 95]. The contrast c indicates in which comparison we are interested in. If one wants to consider the comparison between the prediction of individual 1 and the prediction of individual 2 in a set of four individuals, then c = [1  − 1 0 0]. If one wants to compare a group of individuals (1 and 2) with another group of individuals (3 and 4), then c = [1 1 − 1 − 1]. The sum of the contrast elements always has to be null. Contrary to plants, animals cannot be replicated in different environments, and so the comparison of animals of different years or different herds can be a problem. The genetic relatedness between individuals obtained from the pedigree can be used to connect the different management units. Taking into account this connectivity is important to ensure that the comparison between animals is reliable. PEV(c) and CD(c) were initially used to optimize experimental designs (repartition of animals in different herds) to make the comparisons reliable [94, 95]. More recently, it was applied to models relying on realized relationship matrices based on marker information [96, 97], possibly in the presence of nonadditive effects [98].

The generalized PEV and CD are derived from the GBLUP model, with: y =  + Zu + e, where y is a vector of phenotypes, β is a vector of fixed effects, u is a vector of random genetic values (polygenic effect) and e is the vector of errors. X and Z are design matrices. The variance of the random effect u is: \( \operatorname{var}\left(\boldsymbol{u}\right)=\boldsymbol{A}{\sigma}_g^2 \), where A is the relationship matrix (realized relationship matrix in the context of genomic prediction) and \( {\sigma}_g^2 \) is the additive genetic variance in the population. The variance of the errors e is: \( \operatorname{var}\left(\boldsymbol{e}\right)=\boldsymbol{I}{\sigma}_e^2 \), where I is the identity matrix.

The PEV of any contrast c of predicted genetic values can be equivalently calculated as:

$$ \boldsymbol{PEV}\left(\boldsymbol{c}\right)=\frac{\operatorname{var}\left({\boldsymbol{c}}^{\prime}\hat{\boldsymbol{u}}-{\boldsymbol{c}}^{\prime}\boldsymbol{u}\right)}{{\boldsymbol{c}}^{\prime}\boldsymbol{c}}, $$
$$ \boldsymbol{PEV}\left(\boldsymbol{c}\right)=\frac{\boldsymbol{c}^{\prime }{\left({\boldsymbol{Z}}^{\prime}\boldsymbol{MZ}+\lambda {\boldsymbol{A}}^{-\mathbf{1}}\right)}^{-1}\boldsymbol{c}\ast {\sigma}_e^2}{{\boldsymbol{c}}^{\prime}\boldsymbol{c}}, $$
$$ \boldsymbol{PEV}\left(\boldsymbol{c}\right)=\frac{{\boldsymbol{c}}^{\prime}\left(\boldsymbol{A}-{\boldsymbol{AZ}}^{\prime }{\boldsymbol{M}}_{\mathbf{2}}\boldsymbol{ZA}\right)\boldsymbol{c}\ast {\sigma}_g^2}{{\boldsymbol{c}}^{\prime}\boldsymbol{c}}, $$

where c is a contrast, \( \hat{\boldsymbol{u}} \) is the BLUP of u, M is an orthogonal projector on the subspace spanned by the columns of X: M = I − X(XX)X and (XX) is a generalized inverse of XX [94], \( {\boldsymbol{M}}_{\mathbf{2}}={\overset{\sim }{\boldsymbol{\Sigma}}}^{-\mathbf{1}}-{\overset{\sim }{\boldsymbol{\Sigma}}}^{-\mathbf{1}}\boldsymbol{X}{\left({\boldsymbol{X}}^{\prime }{\overset{\sim }{\boldsymbol{\Sigma}}}^{-\mathbf{1}}\boldsymbol{X}\right)}^{-1}{\boldsymbol{X}}^{\prime }{\overset{\sim }{\boldsymbol{\Sigma}}}^{-\mathbf{1}} \), \( \overset{\sim }{\boldsymbol{\Sigma}}={\boldsymbol{ZAZ}}^{\prime }+\lambda \boldsymbol{I} \) is the phenotypic covariance matrix scaled to some variance ratio and \( \lambda ={\sigma}_e^2/{\sigma}_g^2 \). The last expression of PEV(c) has the advantage of being computationally much more efficient when the size of the CS is small in comparison to the total number of individuals considered. PEV(c) is influenced by the genetic distance between the compared individuals and by the expected amount of information brought by the experiment on the compared individuals. A low PEV of a contrast between two individuals can be due to their close genetic similarity, or to the important amount of information brought by the experiment on the given comparison (e.g., the two individuals are related to many CS individuals meaning that their predictions will be precise).

The generalized CD [94] is defined as the squared correlation between the true and the predicted contrast of genetic values, and is computed as:

$$ \boldsymbol{CD}\left(\boldsymbol{c}\right)=\mathrm{cor}{\left({\boldsymbol{c}}^{\prime}\hat{\boldsymbol{u}},{\boldsymbol{c}}^{\prime}\boldsymbol{u}\right)}^2, $$
$$ \boldsymbol{CD}\left(\boldsymbol{c}\right)=\frac{{\boldsymbol{c}}^{\prime}\left(\boldsymbol{A}-\lambda {\left({\boldsymbol{Z}}^{\prime}\boldsymbol{MZ}+\lambda {\boldsymbol{A}}^{-\mathbf{1}}\right)}^{-1}\right)\boldsymbol{c}}{{\boldsymbol{c}}^{\prime}\boldsymbol{Ac}}, $$
$$ \boldsymbol{CD}\left(\boldsymbol{c}\right)=\frac{{\boldsymbol{c}}^{\prime}\left({\boldsymbol{AZ}}^{\prime }{\boldsymbol{M}}_{\mathbf{2}}\boldsymbol{ZA}\right)\boldsymbol{c}}{{\boldsymbol{c}}^{\prime}\boldsymbol{Ac}}. $$

As for PEV(c) the last expression of CD(c) is computationally more efficient, because of the reduced size of the matrix to be inverted when the number of observations is smaller than the total number of individuals. The CD(c) is equivalent to the expected reliability of the contrast. It takes values between 0 and 1, a CD(c) close to 0 means that the prediction of the contrast is not reliable, whereas a CD(c) close to 1 means that the prediction is highly reliable. The generalized CD(c) is equal to \( \boldsymbol{CD}\left(\boldsymbol{c}\right)=1-\frac{\boldsymbol{PEV}\left(\boldsymbol{c}\right)}{{\boldsymbol{c}}^{\prime}\boldsymbol{Ac}{\sigma}_g^2} \). As a result, the CD(c) increases with diminishing PEV(c) and with increasing genetic distance between individuals involved in the targeted contrast. An increase of the genetic distance will indeed increase the genetic variance of the contrast. Note that if c is replaced by a vector of 0 and a single 1, the resulting CD is no longer the generalized CD of a contrast but the individual CD of the corresponding individual.

Rincent et al. [83] first proposed to use PEV(c) and CD(c) to optimize the composition of the CS in genomic prediction. As CD(c) is the expected reliability of a given contrast c, it is a criterion of choice to maximize prediction accuracy by optimizing the composition of the CS. The main aim of genomic selection is indeed to discriminate between individuals based on their predicted breeding values. As shown above, the computation of these criteria only requires the kinship matrix and the ratio of the error and genetic variances (λ) that can be chosen based on prior knowledge. No phenotypic information is required, so the optimization of the CS can be done prior to any phenotyping. The optimization was only marginally affected by λ in Rincent et al. [83] and Akdemir et al. [99], which means that the CS optimizing PEV(c) or CD(c) is supposed to be efficient for any polygenic trait.

Rincent et al. [83] have proposed the criteria \( PEVmean=\frac{1}{N_{PS}}{\sum}_{i=1}^{N_{PS}} PEV\left({c}_i\right) \) and \( CDmean=\frac{1}{N_{PS}}{\sum}_{i=1}^{N_{PS}} CD\left({c}_i\right) \), where ci is the contrast between PS individual i and the mean of the population, and NPS is the size of the PS . CDmean (PEVmean) is the average of the CD(c) (PEV(c)) of the individuals in the PS considering a given CS. CDmean is expected to be better than PEVmean for improving GS accuracy, as illustrated in Rincent et al. [83] and Isidro et al. [91], since the CD(c) is related to the ability to discriminate individuals. By maximizing CDmean of the PS , we define a CS able to discriminate each predicted individual from the average population, so that we are able to reliably identify the best (or the worst) individuals. Using two maize diversity panels, Rincent et al. [83] considered a case when only part of a population could be phenotyped so the CS was optimized in order to predict the non phenotyped individuals (PS ), and a case when the CS was optimized in order to predict a predetermined PS (Fig. 2). They showed that a considerable increase of prediction accuracy could be reached by optimizing the CS with PEVmean and even more with CDmean in comparison to randomly sampled CS. From another perspective, PEVmean and CDmean based CS enabled the same prediction accuracy as random CS with twice as less phenotyped individuals. One key point with these criteria is that they take into account kinship between all individuals (CS and PS ), and therefore result in the sampling of an optimized CS specific to a given PS . As a result, it is highly recommended to optimize the PEVmean or CDmean of the predicted individuals [83, 87, 99, 100] rather than those of the individuals composing the CS [91, 101]. These criteria have been tested and validated in different species such as maize [83, 86, 87, 93], palm tree [68], wheat [102,103,104], barley [90], oat [15], cassava [105, 106], miscanthus [27], Arabidopsis [99], apple tree [88], and peas [107] in populations of various levels of relatedness. CDmean led to prediction accuracies at least as good as those obtained with model-free criteria [83, 86, 87, 91, 93] with some exceptions [88,89,90, 108]. Note that the contrasts are flexible and can be adapted to address specific prediction objectives. For instance, in the context of biparental families, different contrasts have to be defined if one is interested in comparing families or individuals within families (see criterion CDpop below). In case of strong population structure, it can be necessary to adapt these criteria [87, 91, 101]. Isidro et al. [91] have proposed the stratified CDmean maximizing the CDmean within each group. This criterion did not improve prediction accuracy in comparison to CDmean. This may be explained by the fact that CDmean takes population structure into account as long as it is captured by the kinship matrix. One of the strengths of PEV(c) and CD(c) is that they can be adapted to address specific prediction objectives (e.g., scenarios a and b in Fig. 2) by adapting the contrasts. It can be used to optimize CS for a given PS (Fig. 2A), or to select the best CS within a population that can only be partially phenotyped, the remaining individuals being predicted (Fig. 2B). Rincent et al. [87] proposed to adapt the contrasts to take population structure into account. In this study based on connected biparental populations, new criteria were proposed to maximize prediction accuracy within each population (CDpop), or the global accuracy not taking population structure into account (CDmean). They showed that the definition of the contrasts could be adapted to specifically address each prediction objective (see below). Examples of CS optimized with CDmean or CDpop are presented in Fig. 3.

Fig. 3
figure 3

Networks representing examples of calibration set (CS) optimized with generalized CD criteria (A: CDmean and B: CDpop). The green dots indicate the individuals to be predicted (PS ), the red squares indicate the 30 individuals composing the CS optimized with CD criteria. Individuals are connected with an edge when their genetic relationship is above a given threshold. In (A) we considered a highly diverse panel in which the objective was to sample a CS optimal for the prediction of the remaining individuals. In (B), we considered different biparental populations of a Nested Association Mapping (NAM) design, with the objective of predicting one given biparental family by sampling an optimal CS from the other biparental families. The contrasts were adapted to answer these two prediction objectives and correspond to the criteria CDmean (A) and CDpop (B), see Rincent et al. [83, 87]. In (A), the network indicates that CDmean selects key individuals related to many others. In (A) the network illustrates that CDpop samples individuals the most representative of the PS , mostly belonging to biparental populations strongly related to the PS

3.2.2 Multitrait CS Optimization with CDmulti

Genomic prediction models can be adapted to take into account multitraits and multienvironments in a same statistical model. This was shown to increase prediction accuracy in particular when a low-cost secondary trait is measured on the PS , i.e. trait-assisted prediction [109,110,111], or when all PS individuals are phenotyped in at least one environment in a multienvironment trial, i.e., sparse testing [112,113,114,115,116]. In these situations, the partition between CS and PS is not as clear as in the previous paragraphs, as some of the PS individuals are partially observed (phenotyped for a secondary trait, and/or in some of the environments). The optimization is more complex, as the experimental design involves more than one trait or environment. The underlying model is y =  + Zu + e, in which y is a vector of phenotypes concatenating the different traits, u is the corresponding vector of multitrait polygenic effects, and e is the vector of errors, var(u) = Σa ⊗ A and var(e) = Σε ⊗ I, with Σa the matrix of genetic variance/covariance between traits, and Σε the matrix of error variance/covariance between traits. Generalized CD can be derived from this model [117] to compute the expected reliability for each individual–trait combination. This is a generalization of the single trait CD, in which the genetic and error covariances are adapted to the multitrait context. The computation of this criterion (CDmulti) is as follows.

\( \mathrm{CDmulti}\left(\boldsymbol{c}\right)=\frac{{\boldsymbol{c}}^{\prime}\left(\left({\boldsymbol{\Sigma}}_{\boldsymbol{a}}\otimes \boldsymbol{A}\right)-{\left({\boldsymbol{Z}}^{\prime }{\boldsymbol{M}}_{\mathbf{3}}\boldsymbol{Z}+\left({\boldsymbol{\Sigma}}_{\boldsymbol{a}}^{-\mathbf{1}}\otimes {\boldsymbol{A}}^{-\mathbf{1}}\right)\right)}^{-1}\right)\boldsymbol{c}}{{\boldsymbol{c}}^{\prime}\left({\boldsymbol{\Sigma}}_{\boldsymbol{a}}\otimes \boldsymbol{A}\right)\boldsymbol{c}} \), with ⊗ the Kronecker product,

$$ {\boldsymbol{M}}_3=\left({\boldsymbol{\Sigma}}_{\upvarepsilon}^{-\mathbf{1}}\otimes \boldsymbol{I}\right)-\left({\boldsymbol{\Sigma}}_{\upvarepsilon}^{-\mathbf{1}}\otimes \boldsymbol{I}\right)\boldsymbol{X}{\left({\boldsymbol{X}}^{\prime}\left({\boldsymbol{\Sigma}}_{\upvarepsilon}^{-\mathbf{1}}\otimes \boldsymbol{I}\right)\boldsymbol{X}\right)}^{-1}{\boldsymbol{X}}^{\prime}\left({\boldsymbol{\Sigma}}_{\upvarepsilon}^{-\mathbf{1}}\otimes \boldsymbol{I}\right). $$

Computing CDmulti requires prior knowledge on genetic covariances between traits (genetic and error covariance matrices between traits), and so the optimized multitrait design is specific to a set of traits or environments. In CDmulti, each individual–trait combination is characterized by a CD value (using the corresponding contrast). Ben-Sadoun et al. [117] considered a trait-assisted prediction scenario with a target trait and a secondary trait easy and inexpensive to phenotype and correlated to the target trait. The goal was to identify which individuals should be phenotyped for the target trait, for the secondary trait or for both, to maximize prediction accuracy of the PS for the target trait with budget constraints. They showed that phenotyping strategies optimized with CDmulti resulted in a slight but systematic increase of prediction accuracy in comparison to random sampling. In a multienvironment context, one can expect some levels of GxE and different phenotyping costs associated to each environment. In this situation, CDmulti could help determine which individuals should be phenotyped in each environment.

3.2.3 CS Optimization Using the Expected Predictive Ability Or Accuracy (r)

More recently, Ou and Liao [101] proposed to derive the expected Pearson correlation between phenotypes and predicted breeding values (r) in the PS , often referred to as predictive ability. This optimization criterion is also derived from the GBLUP model and can be computed without any phenotypic data. This criterion is interesting as it directly targets the predictive ability, which is related to genetic progress. The authors showed that it resulted in higher predictive ability than other criteria derived from the GBLUP model (PEV, CD) and stratified sampling. This conclusion could, however, partly be due to the fact that the CD criteria were computed within the CS (the genotypes of the PS individuals were not considered).

The main limitation common to all aforementioned criteria is that they rely on genome-wide relatedness through the use of a GBLUP model, which means that they are only adapted to polygenic traits. This is not a problem for most productivity traits, but they are not adapted to traits influenced by major genes such as some disease resistances or phenology. Theoretical developments could be proposed in the future to adapt these criteria to trait specific genetic architecture, in particular to the presence of major genes. A new criterion (EthAcc) targeting the expected prediction accuracy (\( r\left(\hat{u},u\right) \)) was proposed to better take genetic architecture into account using the results of genome wide association studies obtained with historic data and genotypic information of the PS [118, 119]. The objective here is to determine an optimal CS from existing phenotyped and genotyped individuals. This is a common situation in plant breeding, as breeders accumulate such data year after year. This criterion was efficient to determine the optimal size and composition of the CS, but the search algorithms were unable to identify the optimal CS without using phenotypic information from the PS . This approach implies that the CS is specific to a given trait and requires the identification of QTLs prior to CS optimization .

3.3 Search Algorithms for Optimal CS and Corresponding Packages

For most of the abovementioned criteria, it is not possible to directly determine the CS with the optimal value for the chosen criterion. For instance, there is no analytical way to determine the CS with the best CD, PEV or r value. Different iterative optimization algorithms were proposed based on exchanges of individuals between the CS and the remaining individuals to improve step by step the criterion computed for a combination CS/PS . These algorithms can be simple exchange algorithms [83, 87, 117], genetic algorithms [99, 101, 120, 121] or differential exchange algorithms [121] (see Table 1 for the list of scripts available implementing different algorithms). Such iterative algorithms do not guarantee convergence toward the global optimum, and have to be run with different starting values and with sufficient iterations to reach a better CS than the initial one. One of the main limits of these criteria is that the search algorithm is computationally demanding for large datasets composed of thousands of individuals or beyond [99]. Approaches based on approximation of the PEV [123,124,125], including principal component analysis on the genotypic data [99], can reduce computational time. It would be interesting to include contrasts in this approach to optimize more specific prediction objectives.

4 Focus on Some Specific Applications of CS Optimization

4.1 CS Optimization for Predicting Biparental Populations

Plant breeders mainly work with full-sib families, which is a specific case of population structure. Optimizing a CS is particularly challenging in this case because of the different LD phases and QTLs segregating between families. Considering a single family, the optimization of the CS can be done with the criteria based on genetic relatedness presented above. However, in Marulanda et al. [126] where CS optimization was applied within each family, all the tested criteria failed to optimize the CS. In this scenario, due to strong relatedness between full sibs, the improvement associated with CS optimization is expected to be limited in comparison to what can be observed with more diverse material. Apart from these simple within-family scenarios or the situations in which the parents involved in the different crosses are genetically close, the identification of families highly predictive of a target family is challenging [87, 127] even when the phenotypic variance and heritability of each family is known [128]. It is common that unrelated families result in negative prediction accuracy [19, 127], and so it is important to remove such families from the calibration set.

To identify the best predictive families Schopp et al. [50] proposed criteria such as the proportion of shared segregating SNP in the CS and the PS families (θ), the linkage phase similarity [40], or the simple matching coefficient [129]. θ was efficient to predict the accuracy when averaged over many traits, but was much less efficient when considering a given trait because of trait specific genetic architecture. Brauner et al. [127] concluded that it was too risky to add unrelated families to the CS with regard to the potential gain in predictive ability, and so recommended to include only full and half sibs.

Rincent et al. [87] proposed a criterion (CDpop) derived from the generalized CD to predict the prediction accuracy within a given family when using as CS individuals sampled from one or several other families. This criterion was able to predict the observed prediction accuracy quite accurately, and was efficient to optimize CS specifically designed to predict a given family. The prediction accuracies were on average much higher with CDpop than with random sampling. However, this study was based on families of half-sibs (NAM, [39]) and CDpop has not been tested yet on unrelated families.

4.2 CS Optimization or Update When Phenotypes Are Already Available

The criteria introduced in the previous parts were mostly proposed to optimize the composition of the CS prior to any phenotyping. Breeders are, however, facing situations in which some individuals have already been phenotyped, for instance when the CS has to be selected from previous breeding cycles. In these situations, the information provided by the phenotypes may be used to improve the composition of the CS. This would be valuable in two situations: the regular update of the CS along breeding cycles, or the selection of phenotypes from historical data.

4.2.1 Updating the CS

Prediction accuracy decreases over time in successive breeding cycles because of the lower genetic similarity and increased discrepancy of segregating QTLs between the CS and the PS [28,29,30,31,32,33,34,35]. This makes it necessary to regularly update the CS by phenotyping additional individuals. The selection of the new individuals to include in the CS, can be done with the abovementioned criteria, but we can think that the phenotyping data collected in the previous cycles could help updating the CS. Neyhart et al. [130] and Brandariz and Bernardo [131] have proposed to update the CS with the individuals with the best and worst GEBV in the previous generation(s). Simulations showed that it resulted in higher prediction accuracy than random sampling, PEVmean or CDmean. The efficiency of this approach was illustrated in various experimental studies [132,133,134,135]. We can suppose that the efficiency of this strategy is due to the maximization of the number of segregating QTLs in the CS.

4.2.2 Subsampling Historical Phenotypic Records

Breeders have access to important phenotypic data collected year after year that can be used to calibrate the GS model. It was, however, shown that subsampling part of the available phenotypic data can improve the predictive ability in comparison to using the full dataset. The presence of genetically distant individuals can indeed decrease predictive ability [23]. This subsampling can be done with classical criteria such as PEVmean, CDmean, or r derived from the GBLUP , but they cannot be used to determine the optimal CS size as they always improve when adding additional individuals. They can, however, be used to determine the composition and size of the CS after which the criterion only marginally improves [101]. Criterion such as EthAcc [119] does not present the same limitation, but its use in practice is hindered by the poor ability of the search algorithm to identify the optimal CS without including the PS phenotypes. Another option is to determine a CS specific to each predicted individual by selecting its most related individuals [23] or by optimizing criteria based on PEV(c) or CD(c) (PEVmean1, [103]). With PEVmean1, a CS is specifically designed for each PS individual by minimizing its individual PEV. Predictive abilities obtained with PEVmean1 were generally similar to those obtained with PEVmean, but higher for small CS. De los Campos and Lopez-Cruz [136] have formalized an approach in which a penalty is used to set to zero the contribution of some individuals to the prediction. They showed that it could significantly increase predictive ability when the penalty coefficient is well determined.

4.2.3 Optimizing the Choice of Individuals to Be Genotyped

In all the optimization approaches presented above, it was supposed that genotypic information was available for all CS individuals. It can, however, happen that only part of the individuals with historical phenotypic data have been genotyped, and in this case it could be valuable to genotype some additional key individuals to improve the predictions. This selection can be guided by the phenotypic data or the pedigree. Boligon et al. [133] and Michel et al. [134] have proposed to apply the “best and worst individuals” sampling strategy to identify the individuals that should be preferentially genotyped. Maenhout et al. [137] have used the generalized CD (computed with pedigree) to improve the subsampling of historical data by taking into account the balance (number of replicates of each variety) and the connectedness between individuals (disconnectedness can be present when unrelated individual are evaluated in distinct trials). Bartholomé et al. [138] proposed a two-step strategy involving pedigree information and simulations.

4.3 Optimization of the Calibration Set in the Context of Hybrid Breeding

For many plant and animal species, commercial products are hybrids between individuals from different genetic groups (different breeds or heterotic groups). In animal species such as pigs or poultry, even if the commercial products are hybrids, the conventional selection is often done at the purebred level and hybrid performances are seldom considered. With the advent of GS , several studies investigated the interest of accounting for crossbred performances in CS in addition or instead of purebred performances. Recently, Wientjes et al. [139] explored how to optimize CS in this context using simulations but focused mainly on the crossing design used to generate the crossbred individuals from the purebred and not on the composition of the crossbred CS itself. For allogamous plant species such as maize or sunflower, the breeding objective is to produce single-cross hybrid varieties from two inbred lines, each selected in complementary groups. In this context, the total number of potential single-cross hybrids is very large (N1 × N2, if N1 and N2 are the numbers of inbred lines in group 1 and 2 respectively) and all of them cannot be evaluated. Classically, the genetic value of a hybrid is decomposed as the sum of the general combining abilities (GCAs) of each of its parental lines (i.e., the average performances of the hybrids progeny generated by crossing one parental line to the lines of the other group) and the Specific Combining Ability (SCA) of the cross (i.e., the complementarity between the two parental lines). In 1994, Bernardo [140] proposed to use molecular markers to compute covariances between the GCAs of parental lines in each group and between SCAs of intergroup hybrids to predict performances of nonphenotyped hybrids from phenotyped ones. It was the first application of GS in plants. Genomic selection is particularly interesting in this context since the genotypes of all potential hybrids can be derived from the genotypes of inbred lines. This offers the possibility to use genotypes of inbred lines to (1) predict GCA of each candidate line evaluated or not as hybrid and (2) to directly predict all potential single-cross hybrid values (GCAs+SCA) to identify the most promising varieties.

First optimization approaches of the CS based on empirical data highlighted that the qualities of prediction of new hybrids were higher when the CS and PS hybrids shared common parental lines, that is when the new hybrids derived from parental lines that contributed to the CS hybrids [141,142,143]. However, there is a trade-off between the number of hybrid contributions per candidate line and the total number of lines that can contribute to the CS [142]. This trade-off depends on the proportion of SCA relative to the GCA, the total number of hybrids that can be evaluated and on the prediction objective: the prediction of new hybrid combinations between new lines (T0 hybrids) or the prediction of new hybrids between lines that contributed to the CS (T1 or T2 hybrids when respectively one or two of the parental lines are parents of some CS hybrids) [144]. Studies based on real [142], and simulated data [144] showed that increasing the number of lines contributing to the CS at the expense of the number of hybrids evaluated per line is beneficial for better predicting T0 hybrids. However, doing so decreases the total number of T0 hybrids among the whole set of potential hybrids, so the optimal solution over all categories of hybrids depends on the percentage of hybrids that can be phenotyped. This advantage is also reduced when the percentage of SCA is high since the accuracy of SCA prediction decreases when inbred lines are only evaluated in one single CS hybrid. When the objective is to predict the hybrid values in the next generations, increasing the number of lines in the CS at the expense of their contribution is generally the best solution (unless a large percentage of the variance is due to SCA). Recently, Guo et al. [85] proposed a strategy called MaxCD (Maximization of Connectedness and Diversity). In this strategy, a representative subset of parental lines is first selected from patterns detected in the inbred line genomic relationship matrix. From these lines, a set of hybrids with nonoverlapping parental lines is defined and combined with a set of hybrids issued from pairs of inbred lines most distant from each other. The idea is to represent in the CS the expected diversity of the whole set of hybrids.

Besides these empirical optimizations, other criteria such as those based on PEV and CD were proposed recently. Momen and Morota [98] extended the CD and PEV to include nonadditive effects. In a model including additive and dominance effects they proposed to use a multikernel approach for the predictions and to use as K matrix in the CD and PEV, a linear combination of the additive and dominance relationship matrices (G and D) each weighted by the proportion of variance associated with these variance components, that is,

$$ \boldsymbol{K}=\frac{\sigma_A^2}{\sigma_A^2+{\sigma}_D^2}\boldsymbol{G}+\frac{\sigma_D^2}{\sigma_A^2+{\sigma}_D^2}\boldsymbol{D} $$

They evaluated the link between the CD and the genomic prediction accuracies in an animal breeding context using simulations and real pig data. Based on their results they proposed to use the CD for optimizing the CS. Note, however, that they did not consider a hybrid design between unrelated populations and therefore assumed in their prediction model that there was only a single additive variance component and a dominance variance component, which does not correspond to the decomposition of hybrid value in terms of GCA and SCA commonly used for factorial designs. Fritsche Neto et al. [145] used the same formalism to evaluate the interest of genomic selection in different maize hybrid designs and optimized the CS using PEV. They used historical data of variance component estimation to weigh the proportion of additive and dominance variance in PEV computation and also considered, as a benchmark, PEV based on additivity only. Their results showed the interest in using PEV to optimize the hybrid CS, but not the interest of considering dominance for its computation. In agreement with empirical optimization , they found that an optimal hybrid CS should involve as many parental lines as possible. More recently, Heslot and Feoktistov [122] also confirmed on sunflower data the interest of optimizing the hybrid CS using PEV based on a single additive variance. Kadam et al. [93] used an individual CD to identify among all potential hybrids that could be produced from segregation families those to be phenotyped to be included in the CS. They confirm the interest in using these criteria (individual CD or PEV) for optimizing the CS compared to the use of stratified sampling. Akdemir et al. [121] proposed to choose wheat hybrids to be included in the CS to best predict the remaining hybrids by maximizing the worst individual CD of the PS (CDmin) and showed its interest relative to random sampling. To our knowledge, no optimization study has been based so far on CD or PEV of contrasts (CDmean and PEVmean) and questions remain on the extension of these criteria using a GCA / SCA formalism.

4.4 Optimization of the Phenotypic Evaluation of the Calibration Set

In terms of optimization of the CS, beyond its composition, a key question is the optimization of the experimental design for its evaluation in next to come experiments or, if the CS is based on historical data, the choice of the phenotypic data that should be included in the model calibration process.

Optimization of the phenotyping design is a classical question in plant breeding as a compromise must be found between the number of individuals to phenotype, which has a direct impact on the selection intensity, and the phenotyping effort: number of traits measured, number of replicates within each field trial, and the number of field trials [146]. Marker-assisted selection (MAS ), allowing selecting nonphenotyped individuals using marker-based predictions, leads to a different optimal resource allocation compared to phenotypic selection. In MAS , phenotypes are mostly used to estimate marker effects and detect QTLs. As population size plays a major role in determining the power of QTL detection, optimal resources allocation strategies for QTL-based MAS are to phenotype a larger number of individuals but with a lower number of replications per individual compared to phenotypic selection [147].

The first attempts to optimize the experimental design for phenotyping the CS, focused on selection within a given biparental population. Those approaches were based on simulations [81] and/or deterministic formula of the expected accuracy of GS adapted from Daetwyler et al. [48]:

$$ r\left(g,\hat{g}\right)=\sqrt{\frac{N{h}^2}{N{h}^2+{M}_e}} $$

where N is the size of CS, h2 is the trait heritability at the design level (which depends on the individual plot heritability , the number of plots and the GxE variance component) and Me corresponds to the number of independent loci segregating in the population. This formula assumes that the accuracy of prediction does not depend on CS composition. When considering a segregating population where the LD is only due to cosegregation, Me can be approximated from the number of chromosomes and the expected number of recombination events along chromosomes. Both Lorenz [81] and Riedelsheimer and Melchinger [148] therefore considered an Me value around 30 for a single biparental segregating family of maize. Endelman et al. [149] estimated Me on two real data sets of barley and maize and used this estimate to derive expected accuracies for the optimization process. In GS , phenotypic data are used to calibrate prediction equations with little concern on the accuracy of each marker effect estimation compared to MAS . So even if the prediction accuracy of untested individuals increases with the CS size, it plateaus more quickly than for MAS giving more flexibility in terms of design in the trade-off between the number of individuals evaluated and the number of replicates. Riedelsheimer and Melchinger [148] extended the approach by (1) considering the prediction accuracy of untested individuals but also of the tested individuals included in the CS to predict the genetic gain and (2) by taking into account GxE when optimizing the number of environments in which the CS is evaluated. Endelman et al. [149] showed that an efficient strategy is to combine GS and sparse designs in which different subsets of CS individuals are phenotyped in each trial, reducing the total number of plots needed without reducing the number of phenotyped individuals nor the number of locations. Other optimization approaches [150, 151] also studied optimal resource allocation for the phenotyping of the CS using deterministic simulations but instead of studying the impact of the resource allocation on the GS accuracy, they considered it as an entry parameter. Jarquin et al. [115] using maize experimental data confirmed the interest in using genomic prediction models including GxE effects with sparse designs in which most genotypes are evaluated in only one trial. They, nevertheless, recommended having a small percentage of individuals common to the different trials.

All the abovementioned approaches aim at optimizing the phenotyping for a next to come population of candidates considering that part of them will be phenotyped to predict the remaining ones. They did not consider the genotypic information of the candidates when choosing among them which individuals should be included in the CS at a fixed CS size. More recently, Atanda et al. [86] extended the use of the CDmean proposed by Rincent et al. [83] to this purpose in a maize data set composed of segregating families. They considered two different phenotypic designs: sparse testing (ST) design where all candidates of the targeted family are evaluated but each in only one trial and another strategy where only half of the candidates of the targeted family (HFS) are evaluated in all field trials. In both cases, they showed that CDmean efficiently selects the subset of individuals to be evaluated in each trial in ST designs and which individuals should be evaluated in the targeted family to predict the remaining ones in the HFS design. Extensions of this approach, considering phenotypes in different trials as correlated traits, showed the interest of using multitrait CD to optimize the allocation of CS individuals to different field trials [116, 121]. This opens the way to combine optimization of CS with optimal resource allocation.

A step forward into the optimization would be to fully integrate the optimization of the CS with the optimization of the experimental design up to the plot allocation of individuals in each field trial. Recently, Cullis et al. [152] showed by simulation that partially repeated field trial designs, optimized using “model-based design” and considering genetic relatedness between genotypes based on pedigree, increased the prediction accuracy of their genetic values. The optimization was based on a sum of the PEV of all pairwise contrasts between the genetic values of the individuals which ensured an efficient comparison between all of them. Ideally, it would be interesting to extend this approach for optimizing experimental design and CS composition for the prediction of individuals in the PS . This would require efficient optimization processes to jointly address these two issues.

5 Conclusion and Prospects

The practical implementation of a new tool in breeding mainly depends on the balance between costs and benefits. In this regard, the optimization of the experimental designs and in particular the optimization of the calibration set in genomic prediction is essential because it can reduce costs and increase benefits [153]. CS composition optimized with the criteria presented here most of the time resulted in higher prediction accuracy than random CS. The choice of the appropriate criterion depends on many factors including the prediction objectives, the population structure, the genetic architecture of the trait and the type of data available (e.g., PS individuals genotyped or not). In any case, there is no universal CS that would be optimal for any genetic material and any trait. We emphasize that it is fundamental to take the genotypic information of the PS into account when available to optimize the CS.

Criteria such as CD, PEV, or r should be further investigated to address other questions such as the optimization of the CS for predicting hybrids or crosses that have not been produced yet [93, 122]. Another application in a plant breeding context would be to optimize jointly the CS size, its composition as well as the phenotypic design for each individual (we can suppose that it might be beneficial to phenotype more deeply key individuals).

Another issue that should be taken care of, is the effect of the composition of the CS on the loss of diversity in the breeding population. Eynard et al. [154] have indeed shown that the way of updating the CS affected the genetic diversity of the breeding population along cycles, maybe because reducing the diversity within the CS can result in fixing some of the QTLs. The effect of CS optimized with the abovementioned criteria on this potential loss of diversity has not been studied yet. A CS constrained optimization procedure that combines both objectives by maximizing predictive ability while controlling the loss of diversity would be valuable, this was not addressed yet in literature.