Keywords

1 Introduction

The challenge in agriculture today is to produce enough food for an increasing population, using less land, water, fertilizer, and pesticides to limit ecological impacts. Global environmental changes, rainfall variability, nitrogen cycle alteration, higher temperature, and atmospheric CO2 concentration strongly impact crop plants phenology (Jagadish et al. 2016), resistance to pathogens/insects outbreaks (Deutsch et al. 2018), and yield (Brisson et al. 2010). As a consequence, genetic gain for stress tolerance has become one of the most important targets in plant breeding. In this context, genetic variants present in modern varieties, traditional local varieties (i.e., landraces), and wild relatives may be of interest for crop plant breeders (McCouch et al. 2013).

For about 10,000 years humans have been exerting selection pressure, both consciously (by selecting the “best” seeds or animals to contribute to the next generation) and involuntarily (through farming practices and expansion of the natural distribution range), gradually changing domesticated plants and animals to suit their needs. This piecemeal process of selection gradually morphed into breeding, and first commercially successful plant breeding emerged at the end of the nineteenth century. The “best” (that was selected) has been covering many different criteria (Allard 1999) in different species, times, countries, environments, and now depends on the end-users/markets targeted. Multi-trait indexes have been empirically or economically built from phenotypic observations and expert opinion. This is generally called phenotypic selection (PS). Rapid genetic gain has been secured by breeders by selecting only the highest-performing parental individuals in order to ensure high mean performance of progeny. The classical strategy in crop plants is to cross performant lines for different traits, aiming to obtain some recombinant lines in the progeny that cumulate a maximum of chosen criteria. However, continuous application of truncation selection (selection of the bests) without regular re-introduction of new alleles in the germplasm leads to a rapid loss of genetic diversity around loci under selection by hitch-hiking effects and all along the genome by drift. This can have negative consequences for loci not monitored in the process and reduce the long-term potential of the program. Additionally, truncation selection overlooks favorable alleles that only occur in lines that are not highly performing. The most famous examples of the negative impact of reduced diversity in cultivated plants concern disease resistance. All cultivated potato (Solanum tuberosum) varieties cultivated in Ireland were susceptible to late blight, leading to the Great famine in 1845–1849 (Mizubuti and Fry 2006). Similarly, maize (Zea mays) varieties that all contained the common male sterile genetic background were susceptible to southern leaf blight that caused 15% losses in 1970–1971 in the USA (Ullstrup 1972).

Through times, cultivated plants have experienced various genetic bottlenecks through selection and drift that accompanied domestication, migrations, and subsequent local adaptation (Spillane and Gepts 2001). These bottlenecks explain the reduction of genetic diversity compared to wild relatives or local traditional varieties referred to as landraces. Only a few studies have focused on long-term changes in genetic diversity in breeding programs, for instance in maize, Zea mays (Labate et al. 1999; Feng et al. 2006; van Inghelandt et al. 2010; Gerke et al. 2015; Allier et al. 2019d) and soybean, Glycine max (Bruce et al. 2019). There is evidence that modern breeding further reduced genetic diversity (Simmonds 1962; Cooper et al. 2001; Fu 2006, 2015) and changed its geographical distribution because of large open breeding systems. Such impacts of modern breeding are dramatic in the case of bread wheat, Triticum aestivum. Although landrace diversity is composed of two major genetic groups, Europe and Asia, Asian alleles are almost absent in worldwide modern lines. Note however that some extrinsic (from related species) DNA segments were introgressed into elite lines by breeders creating neo-diversity in bread wheat, Triticum aestivum (Balfourier et al. 2019), maize, Zea mays (Hufford et al. 2012), barley, Hordeum vulgare (Brown and Clegg 1983), soybean, Glycine max (Doyle 1988; Hyten et al. 2006; Han et al. 2016; Sedivy et al. 2017) or peanut, Arachis hypogaea (Fonceka et al. 2012), to list a few examples. Fu (2006) showed that genome-wide reduction of crop genetic diversity was minor but allelic reduction at some major QTLs was important. Directional selection actually tends to fix favorable alleles at some QTLs and neighboring regions by linkage drag (Maynard-Smith and Haigh 1974). So, there is an urgent need for efficient methodologies to monitor local and global diversity in breeding programs in order to maintain short-term and long-term genetic gain. It has been actually shown that a large genetic base of elite germplasm not only at known QTLs of agronomic interest but also all along the genome would assure long-term genetic gain and increase resilience of crop plants to biotic and abiotic stresses in unpredictable environmental conditions (Malézieux et al. 2009).

The way breeders rank selection candidates have changed through times. Genome-wide molecular markers and derived tools can now guide their decisions to complement the phenotyping information. With exponential capacities of genotyping and sequencing, improvements in computing and data storage, methodological and statistical developments, genomic selection (GS, Meuwissen et al. 2001) is becoming an essential tool to not only improve accuracy of selection, accelerate genetic gain using rapid cycles, optimize resource allocation, but also to better manage/introduce genetic diversity in breeding programs by optimizing parental contribution and cross design. Recurrent selection schemes (Hallauer and Sears 1972) for population improvement can also be re-visited with the help of GS to cumulate a maximum of favorable alleles in pre-breeding lines that can be integrated in breeding programs. Moreover, any useful information about loci controlling the variation of traits of agronomic interest (allele effects, genomic annotation) or subjected to historical evolutionary constraints (selection by environment or human), about genitors (genetic group, passport, and environmental data), can be used individually or as a covariate in prediction models to optimize selection process or cross designs. Therefore, population genomics in combination with quantitative genetics can provide relevant tools to evaluate, manage, and introduce GR in (pre)breeding programs.

In this chapter, we first discuss how population genomics helps to assess genetic diversity, identify genes under selection, select candidate genes, and manage genetic diversity in crop plants. Then, we review the methodologies developed in genomic prediction and quantitative genetics to manage long-term genetic gain and genetic diversity in breeding programs. Finally, we present some future perspectives to optimize diversity valorization.

2 Population Genomics of Crop Plants Genetic Resources

2.1 Genetic Diversity in Crop Fields and Genebanks

Only a few crop species are widely cultivated around the world. Four crop species (wheat, maize, rice, and soybean) cover half of all land harvested worldwide – (http://www.fao.org/faostat/en/#data/QC). Moreover, most of the widely-grown species are represented by very few varieties in the fields. Such limited genetic variation in elite germplasm increases vulnerability to market and environmental changes. Mitigation of this situation through introduction of genetic innovations relies on GR that are maintained in around 1700 genebanks worldwide. However, only a small proportion of available GR has been explored and used so far, and it is believed that their comprehensive genomic characterization could enhance their utilization in breeding.

It is estimated that 24% of allelic diversity has been lost in maize compared to teosinte (Vigouroux et al. 2005), 70% in wheat compared to wild emmer (Haudry et al. 2007), and 30% in yam (Dioscorea alata) compared to its wild relatives (Akakpo et al. 2017). This reduction of genetic diversity commonly observed in most crops is attributed to domestication and selection. Domestication corresponds to subsampling of wild progenitor species and results in what is called “the domestication bottleneck” (Goodman 1999, 2005; Meyer and Purugganan 2013; Allaby et al. 2019). Some observations suggest that the domestication bottleneck was not a rapid process associated with the dawn of agriculture, but rather a gradual genetic diversity loss that occurred during millennia (Allaby et al. 2019). The recent selection associated with modern plant breeding had a comparatively smaller impact on genetic diversity (van Heerwaarden et al. 2012), which has been well documented in maize and wheat (Reif et al. 2005; Glémin and Bataillon 2009; Meyer and Purugganan 2013). However, even the fraction of genetic diversity that has been retained in modern crops may not be effectively utilized today (Tenaillon and Charcosset 2011; Balfourier et al. 2019).

While breeding schemes rarely include diverse Genetic Resources (GR), the importance of collecting and characterizing genetic resources is widely recognized (McCouch et al. 2012). Genebank collections are an invaluable reservoir of favorable alleles that are not present in the cultivated gene pool. Examples of traits that have been successfully introgressed from GR into elite cultivars and had significant impact on crop production are numerous. In wheat for instance, dwarfing genes (reduced height loci Rht-B1 and Rht-D1) and genes conferring durable resistance against a wide spectrum of insects and diseases were introgressed by Norman Borlaug during the Green Revolution. The Sorghum Conversion Program in the USA introgressed dwarf and photoperiod-insensitive alleles into African sorghum landraces to adapt them to temperate environments (Klein et al. 2008). The Germplasm Enhancement of Maize project (GEM) (Goodman et al. 2000) enabled massive introgression of GR alleles into the elite germplasm. Introgression lines have been massively produced for peanut as well, using wild relatives (Foncéka et al. 2009). Apart from the genes that control phenology (dwarfing genes, photoperiod insensitivity), great achievements include major genes of disease resistance, such as a resistance gene against grassy stunt virus introgressed from wild rice Oryza nivara (Plucknett 1987), leaf rust resistance genes in bread wheat introgressed from Aegilops (Kuraparthy et al. 2007) or other relatives (Steffenson et al. 2007; Ellis et al. 2014), and other genes providing resistance to biotic and abiotic stresses (Huang et al. 2016). There are also a few examples proving that wild gene pools contain genetic variants that can improve quality and yield, e.g. in tomato, Lycopersicum esculentum (Gur et al. 2004), wheat, Triticum aestivum (Uauy et al. 2006), maize, Zea mays (Ribaut and Ragot 2007) and rice, Oryza sativa (Imai et al. 2013).

Since comprehensive phenotyping and genomic characterization of all the GR is beyond the capacities of genebanks or other interested parties, “core collections” are often identified with the objective to represent most of the genetic diversity according to available information (passport and/or genotypes). These core collections are being intensively phenotyped on national levels (e.g., French initiatives Breedwheat https://breedwheat.fr and Amaizing https://amaizing.fr), or within international initiatives, such as the Seeds of Discovery platform (https://seedsofdiscovery.org) for wheat and maize.

2.2 Detection of Selection Signatures

From the breeders’ perspective, the value of genetic resources is given by the presence of agronomically interesting phenotypes that can be introgressed into elite germplasm. However, given the large number of genebank accessions multiplied by the number of potentially useful traits, phenotypic information is rarely available for genetic resources. Moreover, genetic determinants of many important traits are still poorly characterized. Among these knowledge gaps, genomicists explore crop genomes with the “bottom-up” approach (Ross-Ibarra et al. 2007) aiming to identify gene variants beneficial for crop production without phenotyping. This approach assumes that positive selection is the central force in the process of domestication and adaptation, and it is therefore possible to identify domestication and adaptive genes by screening signatures of selection along the genome. The general methodology is based on comparing genomically local diversity measures in the target population to some reference values, which can be modeled under the assumption of neutrality, or estimated from a genome-wide average in the population or orthologous regions of distinct populations. Although results are usually evaluated under some statistical framework to distinguish effects of selection and other evolution forces, specificity and sensitivity of these tests remain problematic: However, identification of genomic regions with limited genetic variability has double utility in the context of breeding. On the one hand, it helps to discover genes responsible for domestication and adaptation traits, and on the other hand, it points out to loci where re-introduction of lost diversity can boost resilience to environmental challenges.

Strong positive selection on a genetic variant with low initial population frequency results in a “selective sweep” (Maynard-Smith and Haigh 1974), the fixation of one haplotype around the selected allele. The following sections contain a brief description of the major genomic signatures of selection, together with a non-exhaustive list of available software tools. It needs to be emphasized that all these tools suffer to various extent from imperfect power (the ability to find real selection signals), specificity (the ability to filter out false positives), and resolution (the ability to identify the causal loci within long sweeps), as observed in association studies. Our ability to detect signatures of positive selection in a sample of genomes depends on the time elapsed since the selection episode, its strength and duration, the mutation and the recombination rates that break up haplotypes, as well as demographic events (intensity of bottlenecks, migration, differentiation, expansion…), which can actually create diversity patterns that resemble selection signatures. The resolution of the methods mostly depends on the extent of LD.

These factors need to be considered when interpreting genome-wide signatures of selection, and robust statistical thresholds are important to identify outlier loci, either with respect to the rest of the genome or to another real or simulated population that was not under selection. In practice, multiple statistics need to be collected, and the more tests converge on the same result, the higher is the confidence that the identified locus is truly under selection. As in association studies, the identified loci need to be treated as “candidates” until phenotypes are established and the role of the genes is confirmed experimentally to avoid false positives (Pavlidis and Alachiotis 2017).

2.2.1 Decrease of Genetic Diversity

The most prominent signature of positive selection is the decrease of genetic diversity. As the frequency of the selected variant increases in the population, linked variation diminishes due to the genetic hitch-hiking effect (Maynard-Smith and Haigh 1974). This decrease of variation is easily detectable by comparing the nucleotide diversity (Pi; He) of the studied population (e.g., a population from a specific environment, or a crop as a whole) to a reference population (a population from a different environment, or a wild progenitor). A major difficulty is to distinguish selection-related decrease of diversity from stochastic variation resulting from demographic processes. In practice, several-fold decrease of genetic diversity with respect to the reference population is regarded as a sign of selection. Outlier loci are identified based on a distribution of the values across the genome.

The decrease of nucleotide diversity in the vicinity of a selected variant is mainly due to a change of allelic frequencies on linked sites, rather than to a decrease of the total number of polymorphic sites. This shift in the Site Frequency Spectrum (SFS) toward high- and low-frequency derived variants is another signature of selection (Braverman et al. 1995) and is attributed to the fact that neutral variants that are initially linked with the beneficial allele increase in frequency, while newly-emerging neutral variants hitchhike with the selected allele, and therefore remain in the population. This shift in the SFS can be measured by a summary statistics Tajima’s D (1989), where the average number of nucleotide differences between pairs of sequences (Pi; π) is compared to the total number of polymorphic sites scaled by the sample size (Watterson’s Theta; θW). The lack of medium-frequency variants in the vicinity of the selected allele causes a decrease in π while the total number of polymorphic sites may remain unaffected, and this pattern is reflected by negative values of Tajima’s D.

The statistical basis for the identification of the SFS shifts as signatures of selection was improved by the introduction of a Composite Likelihood Ratio (CLR) test (Kim and Stephan 2002). The CLR test compares the probability of the observed polymorphism data emerging under a standard neutral model with the probability of the data emerging under a selective sweep model. Nielsen (2005) introduced SweepFinder, a modification of the CLR test where the standard neutral model is replaced with an empirical SFS of the entire data set, which increases the robustness of the test under different demographic scenarios (e.g., mild bottlenecks). SweeD (Pavlidis et al. 2013) is another implementation of the CLR test that is numerically more stable and faster when analyzing large numbers of genomes.

2.2.2 Increase of LD

2.2.2.1 Local Increase of LD

A variety of tools that detect signatures of selection rely on the observation that haplotypes (stretches of DNA sequence uninterrupted by recombination) of recently selected genes extend much further than expected under neutrality. Extended Haplotype Homozygosity (EHH) (Sabeti et al. 2002), Integrated Haplotype Score (iHS) (Voight et al. 2006), Cross-population Extended Haplotype Homozygosity (XPEHH) which measures the reduction in haplotype diversity in cross-population comparisons (Sabeti et al. 2007), and nSL (Ferrer-Admetlla et al. 2014) are all based on the model of a hard selective sweep, where a de novo adaptive mutation arises on a haplotype that quickly sweeps toward fixation, reducing genetic diversity around the locus. If selection is strong enough, this occurs faster than recombination or mutation can act to break up the haplotype, and thus a signal of high haplotype homozygosity can be observed extending from an adaptive locus. These statistics, nSL in particular, retain some power to detect soft sweeps as well. They are implemented in Selscan that has been optimized for large datasets (Szpiech and Hernandez 2014).

Apart from the extended haplotypes, positive selection creates another specific pattern of LD. As the frequency of a beneficial mutation increases, together with the frequency of linked neutral variants, recombinations sometimes occur on either side of the selected mutation. Since recombinations on the two sides are independent, and double recombinations are much less likely, pairs of variants on each side of the beneficial mutation show elevated LD, but pairs of variants compared across the beneficial mutation show lower LD. This pattern can be measured by the ω-statistics (Kim and Nielsen 2004) that has been implemented in OmegaPlus (Alachiotis et al. 2012). Since the ω-statistics can be assessed at each interval between two SNPs, this method has the potential, at least in theory, to identify the locus under selection very precisely. However, it should also be noted that ω is only applicable when haplotypes are known, either on phased data or inbreds (e.g., in self-pollinating species).

All three aforementioned signatures of selection – decrease in nucleotide diversity, shift in the SFS, and a specific LD pattern – can be assessed simultaneously by RAiSD, a tool introduced by Alachiotis and Pavlidis (2018). On modeled data, this composite evaluation test outperforms tools that measure those signatures of selection individually. However, unlike other methods, RAiSD assumes that polymorphisms are sampled evenly across the genome, and this assumption may be severely violated (e.g., in exome data).

2.2.2.2 Global Increase of LD

In breeding programs, the detection of inbreeding is also relevant. Runs of homozygosity (ROH) are lengths of contiguous homozygous segments due to transmission of identical haplotypes by parents in heterozygotes. It is the percentage of genome that is identical by descent. Individuals that have undergone recent inbreeding will exhibit long runs of homozygosity (MacLeod et al. 2009; Peripolli et al. 2017). ROH was adapted by Allier et al. (2019d) for inbreds and named ROHe (Runs of Expected Homozygosity).

Some of the causal factors behind the occurrence of ROH are population phenomena, such as genetic drift, population bottlenecks, inbreeding, and intensive artificial selection. Consequently, the identification and characterization of ROH can provide insights into how a population has evolved over time in the past, and additionally, into how a population has to be managed in the future in long-term breeding programs. Intense selection regimes in livestock populations have already alerted the scientific community about the need for strategies to preserve populations, characterize and monitor inbreeding, and manage the genetic diversity by optimizing genetic contributions and mating (See Sects. 3.3.7 and 3.3.8).

2.2.3 Extreme Differentiation

Local adaptation can also be indicated by extreme differentiation of allelic frequencies between genetic groups, especially in contrasted environments. As allelic differentiation can also result from demographic events, it is important to interpret outlier loci cautiously, especially when hierarchical population structure is detected. A relevant strategy to identify significant outliers relies on building a distribution of expected values of the tested statistics in the absence of selection, using neutral coalescent simulations (Bellucci et al. 2014).

A number of statistics are available (Cruickshank and Hahn 2014), with FST (single site divergence index) (Wright 1931) being the most common. Outlier differentiation methods rely on a hypothesis that under certain conditions (migration-drift equilibrium under a neutral island model with spatially uniform migration and gene flow), population differentiation of allele frequencies across a large number of loci can be used to infer the process of selection acting on a subset of loci (Lewontin and Krakauer 1973). Loci with FST values (or other genetic distance measure) significantly greater than the genome-wide distribution of the statistics are presumed to be under diversifying selection or linked to those under selection. FDIST(2) (Beaumont and Nichols 1996) implemented in LOSITAN (Antao et al. 2008) assumes a finite island model to generate null FST distribution and can deal with heterozygosity. ARLEQUIN (Excoffier and Lischer 2009) adds hierarchical genetic structure to the inference. BayeScan (Foll and Gaggiotti 2008) uses a Bayesian method to estimate the relative probability that each locus is under selection. FLK (Bonhomme et al. 2010) uses a modified version of the Lewontin and Krakauer (1973) test for selection by comparing allele frequencies of different populations in a neighbor-joining tree constructed using a matrix of Reynold’s genetic distances (Reynolds et al. 1983). Its extension hapFLK (Fariello et al. 2013) calculates haplotype-based frequency differentiation index among hierarchically structured populations. It is robust with respect to bottlenecks and migration and can detect incomplete sweeps. OutFLANK (Whitlock and Lotterhos 2015) does not invoke any specific demographic model and uses a modified version of the Lewontin and Krakauer method to infer a null Fst distribution. XTX (Günther and Coop 2013) implemented in Baypass or Bayenv2 (Coop et al. 2010, p. 2010), employs a Bayesian hierarchical model to test individual SNPs against a null model generated by the covariance in allele frequencies between populations from the entire set of SNPs.

PCAdapt (Luu et al. 2017) assumes that genes under selection are outliers with respect to the prevalent population structure. It calculates z-scores that measure the relatedness of each SNP to the first K principal components of genome-wide variation in a population. The computation uses Mahalanobis distance, which is robust in the presence of hierarchical population structure. Comparisons on simulated data revealed that the false discovery rate of PCAdapt is around 10%, similarly to HapFLK and OutFLANK (Whitlock and Lotterhos 2015), but much lower compared to Bayescan (40%), which is negatively impacted by admixture. PCAdapt and HapFLK are the most powerful tools in scenarios of population divergence and range expansion.

Differentiation among populations can also be detected by comparing site frequency spectra. Selection does not only shift the SFS in the vicinity of the beneficial mutation toward high- and low-frequency variants (see Sect. 2.2.1), but it also causes multilocus allele frequency spectra to differ between two populations. A CLR test can be used to assess whether such local genetic differentiation departs from the expectation under neutrality, as implemented in XP-CLR (Chen et al. 2010) (https://reich.hms.harvard.edu/software).

Additional methods for identification of loci involved in local adaptation exist, but may not be applicable on large data sets (Hoban et al. 2016).

2.2.4 Specific Cases of Genetic Differentiation

2.2.4.1 Domestication/Selection

Genome scans have been performed in order to detect signatures of selection in most major crops where whole or partial genome sequence data is available for at least 100 accessions. The scans usually aim to distinguish domestication signatures (obtained by comparing traditional landraces to wild progenitors) from genetic improvement signatures (obtained by comparing landraces and modern cultivars).

Hufford et al. (2012) detected 484 loci showing signatures of domestication and 695 loci showing signatures of genetic improvement (with 23% overlap) in maize, using differentiation indexes (FST, XP-CLR), diversity indexes (π, ρ, Tajima’s D and normalized Fay and Wu’s H), and haplotype lengths in each genetic group. These results suggest that some traits are of continuous agronomic importance since domestication and additional traits became of interest during the breeding era. In total, 6–11% of the identified loci had no annotation and could correspond to regulatory regions. Some identified candidates of domestication genes controlled flowering time, nitrogen metabolism, thousand kernel weight, phyllotaxy, and seed germination. Some genetic improvement candidates were involved in the biosynthesis of a plant growth hormone gibberellin, or in drought tolerance pathway. According to a gene expression survey, the greatest changes in expression (presence or absence of expression) were observed in the domestication genes, i.e. between wild and cultivated lineages. Expression of the candidate targets of selection in cultivated lines was more homogeneous, with subtle variations, perhaps indicating the importance of fine-tuned expression, as opposed to “on and off” variability. This observation suggests that while the domestication period mostly selected particular gene variants, selection during the improvement period acted predominantly on cis-acting regulatory elements. This information is of interest for private breeding programs (Allier et al. 2019d) that intend to monitor global and local losses and gains of diversity over time in their germplasm using genetic and genomic indicators.

In bread wheat, regions that lost diversity during domestication (Haudry et al. 2007), improvement (Reif et al. 2005; Cavanagh et al. 2013), or both (Pont et al. 2019) have also been investigated. By examining genetic differentiation (PCAdapt) and diversity patterns (Reduction Of Diversity π Index, Tajima’s D), Pont et al. (2019) confirmed selection signatures for domestication genes conferring brittle rachis (Brt), tenacious glume (Tg), homoeologous pairing (Ph) and non-free-threshing character (Q); improvement genes controlling photoperiod sensitivity (Ppd), vernalization (VRN), reduced height (Rht), glutenins (Glu) and gliadins (Gli), frizzy panicle (FSP), grain number (GNS), waxy proteins (Wx), and plant architecture (uniCULm). Through FST screening among landraces and cultivars, Cavanagh et al. (2013) found introgression patterns surrounding phenology genes (Rht-B1: dwarfing, Ppd-B1: photoperiod insensitivity, Vrn1: flowering time) and the Sr36 gene involved in resistance to stem rust.

In tetraploid wheat, Maccaferri et al. (2019) used genetic diversity differentiation indexes (FST, hapFLK, XP-CLR, and XP-EHH) between wild and domesticated emmer (T. turgidum ssp. dicoccoides, T.t. ssp. dicoccum, respectively), durum landraces, and cultivars. They confirmed selection on domestication genes (two brittle rachis regions, a glume QTL controlling threshability) and improvement genes, some of which are associated with disease resistance (e.g., Sr13 and Lr14) and grain yellow pigment content loci (e.g., Psy-B1). They also identified TdHMA3-B1 as the best candidate involved in phenotypic variation of Cd accumulation in the grain. The non-functional TdHMA3-B1b allele could be responsible for a reduction in root vacuolar sequestration of Cd and Zn. It was suggested that under Zn-limiting conditions, this allele increases the pool of Zn available for transport to the shoot, thereby sustaining shoot growth.

Genetic diversity and differentiation indexes (Tajima’s D and Fst, respectively) screened on wild and domesticated yam aided the detection of root development (SCARECRFOW-LIKE gene), starch biosynthesis (Sucrose Synthase 4 and Sucrose Phosphate Synthase 1), and photosynthesis related genes (Akakpo et al. 2017) that likely facilitated habitat change during domestication (from cultivation under trees to open field cultivation).

2.2.4.2 Adaptive Introgression

Screening for past introgressions, i.e. DNA fragments that were absent in the cultivated gene pool until some point in time and appeared in recent material from crosses with related species is another way to identify candidates of agronomic interest (Hufford et al. 2013; Racimo et al. 2015; Schaefer et al. 2016; Rochus et al. 2018). From a basic point of view, detecting gene flow or gene introgression from a distinct population or a different species can help understanding adaptation to various environments and evolution (Anderson 1953; Rieseberg and Wendel 1993). When it increases fitness, it is referred to as “adaptive introgression.” It can also help reconstructing speciation processes.

In principle, genetic introgression can be implied when genealogy of a locus in a population or species (i.e., “local ancestry”) does not match the “global ancestry” estimated based on genome-wide variation. This approach has been used, for example, to reveal the portion of loci introgressed from japonica rice into the indica cultivar 93–11 (Yang et al. 2011). Since it is impractical to quantify alternative gene tree topologies on a genome-wide scale using more than a few individuals (the number of possible rooted trees grows exponentially with the increasing number of tips), other methods are necessary to study introgressions on a population level.

Several approaches can be employed to detect past admixtures that concern the whole genome without resolving the ancestry of individual loci. They quantify fractions of genomes associated with distinct populations. These include multivariate analyses, such as PCA (Patterson et al. 2006; Jombart et al. 2009), or model-based clustering algorithms like STRUCTURE, NewHybrids, ADMIXTURE, and sNMF (Pritchard et al. 2000; Anderson and Thompson 2002; Alexander et al. 2009; Frichot et al. 2014).

Different algorithms have been proposed to detect individual introgressed fragments. A model-based inference implemented in HAPMIX (Price et al. 2009) uses phased data (i.e., known haplotypes) and known ancestral populations (assuming only two contributing populations). The central idea of the method is to view haplotypes of each admixed individual as being sampled from the reference populations. At each position in the genome, HAPMIX estimates the likelihood that a haplotype from an admixed individual originated from one reference population or the other. A Hidden Markov Model (HMM) is used to combine these likelihoods with information from neighboring loci, to provide a probabilistic estimate of ancestry at each locus (Fig. 1).

Fig. 1
figure 1

Schematic representation of the Markov model used for ancestry inference. The black lower line in this figure represents a chromosomal segment from an admixed individual, carrying a number of typed mutations (black circles). The underlying ancestry is shown in the bottom color bar and reveals an ancestry change from the first population (red) to the second population (blue). The admixed chromosome is modeled as a mosaic of segments of DNA from two sets of individuals drawn from different reference populations (red and blue horizontal lines, respectively) closely related to the donors in the admixture event. The yellow line shows how the admixed chromosome is reconstructed with respect to this mosaic. The dotted line above the bottom color bar shows the reference population being copied from along the chromosome – note that at most positions, this is identical to the true underlying ancestry, but with occasional “miscopying” from the other population (blue dotted segment occurring within red ancestry segment). Reproduced from Price et al. (2009)

HAPMIX, LAMP-LD, and RFMix packages for local ancestry inference were developed to provide accurate results on human data and recent admixture events but may be difficult to parameterize for crop species. A recently published open-source software Loter does not require any biological parameters and can be applied to a wide range of species (Dias-Alves et al. 2018). Performance testing on simulated datasets revealed that HAPMIX is severely impacted by imperfect haplotype reconstruction, and Loter is the least impacted by increased time since admixture. Loter was used to infer local ancestries in aromatic rices that originated millennia ago through an admixture event between japonica rice and Indian aus-like rice (Civáň et al. 2019). However, the authors noted that the local ancestry inference was affected by sample size and missing data in simulations.

Another methodological approach mostly used to provide global estimates of gene flow is based on quantification of shared derived variation among non-sister taxa or populations (Kulathinal et al. 2009; Patterson et al. 2012; Peter 2016). In the absence of gene flow, a correct and rooted four-taxon tree will have the two most-recently diverged taxa sharing statistically equal amounts of derived variants with their non-sister taxon. Deviations from this expectation indicate gene flow. Popular implementations of this concept (often called ABBA-BABA) include the D-statistics (Green et al. 2010; Durand et al. 2011) and f-statistics (Reich et al. 2009, 2012). However, ABBA-BABA is based on a neutral evolution model, and its robustness in the presence of selection has not been tested. Since selection can increase local similarity among non-sister groups similarly to introgression, Civáň and Brown (2018) argue that only variants demonstrably absent in their ancestral population should be counted toward the introgression signal.

Numerous cases of spontaneous introgression from wild relatives to cultivated species have been described (Hajjar and Hodgkin 2007; Guarino and Lobell 2011; Dempewolf et al. 2017; Burgarella et al. 2019). In maize, adaptive mexicana alleles were incorporated into the cultivated gene pool during the expansion of maize agriculture to the highlands of central Mexico (Matsuoka et al. 2002). Some of these introgressed alleles have been functionally validated and shown to provide adaptations to altitude, biotic and abiotic stresses (Hufford et al. 2013; Fustier et al. 2019). In potato (Solanum tuberosum), the origin of tuberization under long days was traced to an introgression event from Solanum microdontum (Hardigan et al. 2017). Introgression from Populus balsamifera (balsam poplar) in P. trichocarpa (black cottonwood) was detected through Tajima’s D, FST and LD scans of local admixture, and complemented by analyses of gene expression (Suarez-Gonzalez et al. 2016). The team identified the Populus PSEUDORESPONSE REGULATOR5 (PRR5) to be a strong candidate improving biomass, as well as cold, drought, and salinity tolerance. This gene was shown to work as a transcriptional regulator important for the circadian clock mechanism in Arabidopsis (Nakamichi et al. 2010, 2016, 2020). In poplar, it is upregulated at the onset of short days and it may play a crucial role in the timing of the onset of bud dormancy (Ruttink et al. 2007; Ko et al. 2011). A second candidate gene identified by Suarez-Gonzalez et al. (2016) is COMT1 (CAFFEIC ACID 3-O-METHYLTRANSFERASE 1) that could be involved in lignification and/or pathogen defense (Barakat et al. 2011). In recent polyploids, such as bread wheat, authors consider PAV (presence-absence variation) or CNVs (copy number variation) identified from re-sequencing data as signals of putative introgressions (Balfourier et al. 2019; Cheng et al. 2019). Cheng et al. (2019) measured FST and π ratio between wild and cultivated lines and found 79 segments supposedly introgressed from wild relatives, co-localizing with 124 QTLs (grain yield, disease resistance, plant height).

The case of aromatic rice offers an example of how disentangling local ancestry and introgression history could aid the breeding process. It has been revealed that Basmati-like aromatic varieties of rice (Glaszmann 1987) originated from hybridizations between cultivated japonica rice (29–47% of the genome) introduced to the Indian subcontinent millennia ago, and local wild lineages of the Himalayan foothills similar to the present-day aus varieties (Civáň et al. 2019). They possess some characteristics highly valued by consumers (fragrance, texture of cooked rice, grain elongation after cooking). Rice stickiness and texture after cooking is mainly determined by starch synthesis pathways, and particularly, the ratio of amylose and amylopectin. While japonica rice is often sticky (or glutinous) due to low amylose content, aus and indica varieties are generally nonglutinous. Aromatic varieties have intermediate amylose content and medium gel consistency and Civáň et al. (2019) showed that they have mixed ancestry at the two genes, Waxy (Olsen et al. 2006) and ALK (Gao et al. 2011), responsible for these characteristics. Many aromatic landraces produce grain of superior quality in terms of fragrance and cooking properties, but are tall, lodging susceptible and low-yielding. Unfortunately, breeding efforts to cross aromatic landraces with high-yielding elite cultivars or introduce dwarfing genes were met with limited success. This is mainly attributed to cross incompatibility between aromatic and indica varieties, and high sterility in crosses (Singh et al. 2000). Local ancestry inference (Civáň et al. 2019) revealed that most, but not all aromatic varieties carry a japonica-derived variant of the S5 gene responsible for japonica-indica hybrid sterility (Chen et al. 2008). Identification of high-quality aromatic landraces carrying non-japonica variants of S5 could therefore be the first step in successful breeding of elite aromatic cultivars.

2.2.4.3 Environmental Differentiation: Landscape Genomics

Landscape genomics investigates associations of genetic variants with environmental variables, such as temperature, precipitation, altitude, and latitude gradients (Balkenhol et al. 2019). The goal is to identify candidate genes under selection, possibly indicating local adaptation, using outlier differentiation methods (see Sect. 2.2.3) and genetic-environment association (GEA) tests. Landscape genomics should not be confused with landscape genetics (Manel et al. 2003), which focuses on the effects of landscape variables on gene flow and population structure.

GEAs require environmental data such as WorldClim (http://www.worldclim.org, 2015). Bayenv2 tests for large allele frequency differences across environmental gradients by comparing observed allele frequency differences to transformed normal distribution of underlying population allele frequencies. Latent factor mixed models (LFMM) (Frichot et al. 2013) include population structure as latent (or hidden) variables to limit false positive signals. Spatial generalized linear mixed models (SGLMMs) (Guillot et al. 2014) are a computationally more efficient extension. Samβada (Stucki et al. 2017) is a multivariate analysis method that accounts for population structure with estimates of spatial autocorrelation in the data. When the trait of interest or the climatic gradient is correlated to the population structure, PCAdapt can also be used. Some other methods exist and are summarized in Rellstab et al. (2015).

Although most landscape genomics studies concern non-cultivated species, e.g. Arabidopsis (Hancock et al. 2011), associations with climatic data have also been investigated in forest trees (e.g., Sork et al. 2016; Rajora et al. 2016; Collevatti et al. 2020). Landscape genomics has been studied extensively in poplar (Suarez-Gonzalez et al. 2018), and it has been shown that introgressed genomic regions are enriched for disease resistance genes (TIR and LRR domains gene ontology terms) (Suarez-Gonzalez et al. 2016). In common bean (botanical name), 26 loci with selection signatures were found (Rodriguez et al. 2016), some of them involved in responses to environmental stress, such as drought response, cold acclimation or chilling susceptibility, and adaptation to different conditions of light and temperature. Four of these loci were also found to be under selection during domestication.

For sorghum, Lasky et al. (2015) showed that genome-environment associations can predict adaptive traits. Bellis et al. (2020) looked at correlation between allelic frequencies and Striga pressure. They demonstrated that local adaptation to this parasitic plant was partially controlled by the LOW GERMINATION STIMULANT 1 (LGS1) gene. Wang et al. (2020) found some loci that may control seed mass adaptation to precipitation gradients.

In wild pearl millet (Pennisetum glaucum), Berthouly-Salazar et al. (2016) focused on climate gradients in Mali and Niger and collected genotype data from 11 populations, together with RNAseq data from a subset of four populations. Looking at the genetic diversity patterns within populations (Tajima’s D), differentiation among populations (FST, Bayescan), and correlation with environmental variables (centered loadings outliers using a PCA approach for each gradient), they found contigs displaying consistent signatures of selection among populations. Two of these contigs were associated with abiotic and biotic stress responses.

Time series data (that track samples over time) can be very informative for detecting genetic regions under selection. The factors involved in selection may be unknown.

Variety names of pearl millet, their phenotypes and climate data for a period from 1976 to 2003 were collected from a region of Niger that suffered from recurrent drought (Vigouroux et al. 2011). The research showed that an allele of the PHYC locus responsible for earliness increased in frequency over time at a rate that exceeds possible effect of genetic drift and sampling. A correlation between phenology and rainfall suggested that selection of this gene had a direct effect on earliness under shorter rainy seasons. It is noteworthy that this short-term adaptation was not due to introduction of new varieties, but due to within-variety selection, highlighting the importance to conserve within-variety diversity in genebanks.

Time series of a private breeding program (Allier et al. 2019d) can also be of interest to investigate genomic regions under selection or drift. Note that underlying population genetic structure and demographic history, when not properly accounted for, can generate many false positives. For instance, serial population bottlenecks occurring during founder effects of small populations migrating to new areas can result in fixed allelic differences due to genetic drift (Excoffier and Lischer 2009). Also, recent population range expansions from refugia can generate correlations between allele frequencies and environmental variables that are not due to selection.

Genome scan analyses are biased to detect loci with large effects, because power to detect small-effect loci is generally low. Since most phenotypic traits are likely to be polygenic, and thus governed by many loci of small effect, genome scans probably miss most of the loci involved in local adaptation (Stephan 2016; Rajora et al. 2016). Recently, Bayesian and other multilocus approaches have been developed (Rajora et al. 2016; Gompert et al. 2017). Some nonlinear functions have also been proposed to model the importance of environmental variables in explaining turnover of allele frequencies (Fitzpatrick and Keller 2015).

In reality, very few candidates for agronomically important genes revealed through genomic scans have been experimentally validated. Such validation usually requires validation by phenotyping in a controlled experimentation, an association study demonstrating a link between the genotypes and phenotypes (Saïdou et al. 2014) on a large panel of individuals, and a transformation experiment proving the function of the identified gene. All these experiments are costly, laborious, and technically difficult, but essential to convince breeders to monitor those genes in their germplasm. When validated, beneficial alleles can be introgressed into elite germplasm through backcrossing using diagnostic markers (Dempewolf et al. 2017), flanking markers, and sometimes genome-wide markers to minimize linkage drag and introduction of undesirable traits.

3 Population Genomics and Quantitative Genetics Assisted Infusion of Genetic Diversity in Breeding and Pre-breeding Programs

In elite germplasm, genetic diversity is generally limited compared to ancestral diversity. In that context, genebanks are a reservoir of underexploited favorable alleles. In case of one single favorable allele to introgress from genebanks to elites, flanking molecular markers can accelerate the process. But it takes a long time to validate QTLs and allele effects in different genetic backgrounds and design diagnostic markers to monitor the favorable allele in breeding programs. For example, it took 50 years (1960–2010) from the discovery of submergence-tolerant rice landraces to the successful release of submergence-tolerant rice varieties. It necessitated fine mapping and molecular characterization of the SUBMERGENCE 1 (SUB1) locus and an introgression process (Bailey-Serres et al. 2010). This explains why only little use has been made of GR (Goodman 1999; Glaszmann et al. 2010; Wang et al. 2017).

The first reason why introgression process is so long is that elite lines contain much more favorable alleles than GR in general. It takes several generations of recurrent backcrosses with elites and selection to fill up this performance gap between GR and elites. The challenge is to break only unfavorable allelic associations while not breaking the favorable ones when crossing elites and GR. We may find co-adapted alleles in a cluster of genes (tall plant and late flowering alleles, high yield potential, and low protein content for instance) that have been selected by natural selection, creating local epistasy. No recombinants exist even in experimental populations because genes are too close. The recombination that would be desirable for agronomic purposes (small plant and late flowering, high yield potential and high protein content) may be difficult to obtain for ecophysiological or mechanical (low recombining regions, no diversity) reasons (Mayr 1954). The major unfavorable alleles of plant GR to eliminate are involved in phenology and local adaptation (e.g., flowering time, photoperiod sensitivity, height…) because they may not suit the targeted environment.

Genomic predictions (see Sect. 3.3) could help diversity infusion. Predicting GR genetic values using models that are trained on GR core collections is feasible when core collections are phenotyped in (and adapted to) targeted environments. But predicting which elite by GR crosses are of interest remains a statistical challenge because marker effects may depend on genetic backgrounds (Rio et al. 2019). We may need to first produce and evaluate recombinant lines between different genetic groups we want to cross to get an accurate marker effect estimation (GR alleles in elite genetic background in our case).

3.1 Production and Evaluation of Elite × GR Recombinants

Multi-parental crosses between elites and GR may be a good option to combine QTL detection for multiple-trait, identification of favorable GR alleles, selection of pre-breeding lines that could be introduced in breeding programs and training prediction models. Multi-parental Advanced Backcross (AB-QTL) populations (Narasimhamoorthy et al. 2006), pyramidal Multiparent Advanced Generation InterCross (MAGIC) populations (Cavanagh et al. 2008; Leung et al. 2015), Nested Association Mapping (NAM) populations connected by one common parent for US maize (Buckler et al. 2009), European maize (Bauer et al. 2013), US sorghum (Bouchet et al. 2017), Back-Cross-NAM for sorghum (Jordan et al. 2011) have been developed for that purpose. It has been shown that the connection between populations by one or several common parents improves power of QTL detection. According to simulations, Stich (2009) proposed the triple round robin design connected by donors as the most efficient multi-parental design to maximize power of QTL detection as well as maximize genetic gain. But the production of this type of population remains long and laborious. The optimal connection design is not straightforward to predict from a statistical point of view. The choice of parents is often based on empirical information from different sources, the recipient parents being chosen for performance and GR for specific criteria that breeders want to improve, such as disease resistance for instance.

3.2 Maker Assisted Selection

Marker Assisted Selection (MAS) is promising to accelerate and optimize introgression process (Charmet et al. 1999; Servin et al. 2004). It has been successful for the introgression of maize earliness (Simmonds 1979; Smith and Beavis 1996), flowering time and yield under drought (Ribaut and Ragot 2007) as well as many disease resistance genes (Sanz-Alferez et al. 1995; Thabuis et al. 2004). But it becomes very demanding when multiple genes need to be pyramided simultaneously. Very large population sizes are actually necessary to get a reasonable certainty that an individual with the target genotype can be identified. Gene pyramiding strategies using marker-assisted introgression have been proposed (Hospital and Charcosset 1997; Servin et al. 2004; Canzar and El-Kebir 2011; Xu et al. 2011; Beukelaer et al. 2015). If all genes cannot be fixed in a single step of selection, it is necessary to cross again selected individuals with individuals having the favorable alleles that are missing using a marker-based recurrent selection (Charmet et al. 1999; Bernardo and Charcosset 2006). To cumulate more loci in a single genotype, Hospital et al. (2000) proposed a marker-based recurrent selection (MBRS) method using a QTL complementation strategy in a randomly mating population, which is feasible only in open-pollinated species. More recently, Valente et al. (2013) developed the software Optimas and Han et al. (2017) proposed the Predicted Cross Value (PCV) algorithm to select at each generation crosses that maximize the likelihood of pyramiding desirable alleles in their progeny. A forward variable selection model can be used to select QTLs that explain significant genetic variance (Jansen 1993; Segura et al. 2012) instead of using arbitrary statistical thresholds. Note that Hospital and Charcosset (1997) advised that all QTLs should be given the same weight in the cross molecular score estimation to avoid rapid fixation of main QTLs and loss of small-effect alleles in the process. Control of genetic background with a few molecular markers was proposed in plants by Hospital and Charcosset (1997). With the same idea, the Genotype-Assisted Selection (GAS) concept was introduced by Meuwissen and Sonesson (2004) in animals to control polygenic background genes while selecting favorable alleles at QTLs. They proposed a multi-generation optimization of optimum contribution selection (GAOC: Genotype-Assisted Optimum Contribution) (see Sects. 3.3.7 and 3.3.8) while increasing the frequency of the positive QTL allele to increase genetic gain.

For complex traits that are controlled by a large number of genes, such as yield, MAS is often associated with substantial linkage drag, i.e., introduction of linked unfavorable alleles along with the target favorable allele (Peng et al. 2014) and often was a failure (Simmonds 1993; Hospital and Charcosset 1997; Ribaut and Ragot 2007). An approach using Genomic Selection (GS) addressing this problem in introgression schemes has been proposed (Ødegaard et al. 2009), who demonstrated that backcrosses assisted by genomic selection in fish is the best strategy compared to synthetic production or phenotypic selection to simultaneously select for elite productivity and donor disease resistance for instance. In wheat, Heffner et al. (2010, 2011a, b) came to a similar conclusion by comparing a breeding scheme including MAS with 20 QTLs or MAS followed by GS. Heffner et al. (2010) actually showed that expected annual genetic gain from GS exceeded that of Marker Assisted Recurrent Selection (MARS) for complex traits by about threefold for maize and twofold for winter wheat using analytical simulations of rapid cycles by skipping some phenotyping steps (Bernardo 2009), in a pre-breeding process in particular.

3.3 Genomic Predictions

First predictions in animals, human, and plants were based on pedigrees (Henderson 1975; Falconer et al. 1996; Bijma and Woolliams 1999). Then Lande and Thompson (1990) proposed to estimate the molecular score of an individual by adding its allelic effects at QTLs involved in trait variation. It was later shown that allele effects were overestimated in QTL detection (Beavis et al. 1994; Beavis 1998) and that the significance threshold to select the list of QTLs could be questionable. As the infinitesimal model (Fisher 1918) considering that traits are controlled by many loci of small effects was the best model to explain the variation of many complex traits such as yield, Whittaker et al. (2000) and Meuwissen et al. (2001) proposed to use all available independent markers (hundreds to millions) to build a prediction model that estimates the genetic value of unphenotyped candidates based on a related training population that is phenotyped and genotyped. Considering that genotyping is dense enough, each QTL should be in linkage disequilibrium with at least one marker. The model is thus able to capture more genetic variance than including significant associations only. The principle is to regress phenotypic values on all markers considered as random effects using a linear model. The critical difference with the Lande and Thompson (1990) approach is that we do not set a significance threshold for the loci selected for trait prediction, but we use them all. This molecular score is called Genomic Estimated Breeding Value (GEBV) or genetic value and is an estimation of the additive effects of all loci.

The first to implement GS was the US dairy industry (VanRaden et al. 2007; VanRaden 2008). It doubled genetic gain (Schaeffer 2006; García-Ruiz et al. 2016) for this species particularly well suited for the implementation of GS. It is now applied to many other animal species (Hayes et al. 2009). Daetwyler et al. (2008) showed how to use genomic prediction for analyzing the genetic risk of human diseases. In plants, Bernardo and Yu (2007) and Heffner et al. (2011a) showed the first promising results using simulations, and Lorenzana and Bernardo (2009) using empirical bi-parental data.

More details on genomic selection and prediction models are presented in another chapter of this book (Andres et al. 2020). Here we discuss, how genomic predictions could be used to optimize re-introduction of genetic diversity in plant breeding and pre-breeding programs.

3.3.1 Selection of a Relevant Training Set

Assuming that the number of markers and the training population size is optimal, the accuracy of the calibration model strongly depends on congruence between the allelic composition of the training population (to build the prediction model) and the allelic composition of the candidates whose performance is to be predicted (Habier et al. 2007). When the prediction uses unrelated populations to train the prediction equations, prediction accuracy actually becomes negligible (Crossa et al. 2014). Different ways of estimating prediction accuracy of a training population were developed and have been reviewed (Brard and Ricard 2015). Methods to optimize the composition of the calibration set prior to phenotyping have been proposed based on the Prediction Error Variance or on the Coefficient of Determination (Laloë 1993) of contrasts in unstructured (Rincent et al. 2012) or in structured populations (Rincent et al. 2017b). The algorithm of Rincent et al. (2012) has also been extended to optimize the training population for multiple correlated traits using a criterion called CDmulti (Ben-Sadoun et al. 2020). Other approaches based on spatial sampling (Bustos-Korts et al. 2016), or kinship coefficients (Rincent et al. 2012, 2017b) potentially taking genetic architecture into account (Mangin et al. 2019) were also developed.

3.3.2 Genomic Predictions Assisted Introgression

Using simulations, genomic predictions were shown to be efficient for rapid introgression of GR alleles when implementing 3 cycles per year in maize (Bernardo 2009; Combs and Bernardo 2013). Among 100 simulated QTLs, the adapted inbreds had the favorable allele at 50 or more QTLs and the GR at 50 or less QTLs. They compared 1 year of phenotypic selection versus 3 cycles of genomic selection. The results indicated that a useful strategy should involve 7–8 cycles of genomic selection (2–3 years). They showed that genetic gain was higher when starting from an F2 population rather than a backcross population, even when the number of favorable alleles was substantially larger in the adapted parent than in the GR parent. Note that they used random mating in their simulations. This procedure would require only 3 years to get some progenies that could be integrated in the breeding program. Allier et al. (2019b) showed, using simulated data and optimal parental contribution method (see Sects. 3.3.6 and 3.3.7), that in a context of multiple allele introgression from a donor into one or several elites, three-way crosses and backcrosses were more adapted than two-way crosses when donors underperform the elite population. They demonstrated that three-way crosses should be preferred because they produce more progeny variance and combine alleles from more parents. This supports the strategy adopted in the Germplasm Enhancement of Maize project (Goodman et al. 2000). Two-way crosses were actually more adapted when donors outperform the elite population for the targeted trait.

3.3.3 Predictions of Accessions’ Genetic Values Conserved in Genebanks

Using genotypes and phenotypes of a representative set of genebank accessions, we can build a model to predict the GEBV of the rest of the collection (Yu et al. 2016). As genotyping is less expensive than phenotyping, we can identify this way supplementary GR of interest (Crossa et al. 2016; Brauner et al. 2018, 2019). For instance, in maize, Allier et al. (2019c) calibrated a prediction model on a population, assembling a continuum from old dent accessions to elite iodent material, including founders of breeding pools, elite material released into public domain, and elite material from different private breeders. Yield predictive ability between the calibration population and RAGT2n company germplasm was 0.404 and allowed to detect landraces of agronomic interest to be introduced in the breeding program. But this strategy is possible only if the divergence is not too large between landraces and elite material and the predictive ability is sufficient. It is also necessary that the traits can be evaluated for the landraces in targeted environments. It turned out to be an appropriate approach for biomass sorghum (Yu et al. 2016) and dent maize. But for many species, the presence of some major genes involved in phenology may hinder good quality phenotyping of landraces, because of incapacity to flower, to mature on time or lodging. In that case, unadapted accessions may carry interesting favorable alleles but cannot reveal their potential in the targeted environment. Good quality phenotypes may necessitate to first convert GR by eliminating major phenology unfavorable alleles or to phenotype elite x GR hybrids instead of GR (Longin and Reif 2014). If we consider dominance effects of favorable over deleterious alleles for those major genes involved in phenology, heterozygous hybrids between elites and GR are expected to get favorable major alleles from elites that annul or at least reduce the effects of deleterious alleles from GR. But this assumes that hybrids are technically easy to produce which may be a challenge for autogamous species, at least laborious and expensive.

3.3.4 Optimization of the Allocation of Resources

Thanks to resource allocation optimizations some budget can be saved in evaluation of major traits (yield in general) and be transferred to

  1. 1.

    increase the size of the germplasm (the number of progenitors, crosses, and progenies per cross), leading to an increased genetic variance and a higher chance of creating outstanding individuals.

  2. 2.

    evaluate new traits (such as quality).

Different strategies have been proposed:

  1. 1.

    skip some field evaluation steps, which is relevant in long-lived species such as trees, or when trait values are expensive and/or become available late in the cycle (Hayes et al. 2009),

  2. 2.

    optimize the experimental design, i.e., minimize the number of lines or replicates evaluated in each environment,

Lorenz (2013) showed that it was advantageous to maximize population size at the expense of replication in a breeding program. Endelman et al. (2014), Heslot and Feoktistov (2017), and Akdemir (Akdemir and Isidro-Sánchez 2019) proposed efficient strategies to optimize field evaluation (sparse design) using genomic predictions. Ben-Sadoun et al. (2020) showed that it was possible to reduce budget by 25% for a fixed accuracy of French Bread Making Score by phenotyping it in a reduced number of environments. The idea is to evaluate all alleles in all environments, not all individuals.

  1. 3.

    accelerate cycles: speed (pre)breeding (2 cycles per year for winter wheat, 3 cycles for maize, up to 6 cycles for spring wheat) using adequate growth chambers and greenhouses protocols (Christopher et al. 2015; Hickey et al. 2017; Ghosh et al. 2018) to increase the rate of development,

  2. 4.

    diminish the cost of evaluation of an expensive trait using indirect measurements and optimize phenotyping of both correlated traits (Ben-Sadoun et al. 2020). The strategy is called Trait-Assisted genomic selection (TA) (Fernandes et al. 2018).

It is possible to predict two correlated traits simultaneously using multivariate best linear unbiased prediction (BLUP) (Henderson and Quaas 1976). Those models benefit from information contained in both genetic correlation between traits and genetic relationship among individuals (Calus and Veerkamp 2011). The training population is genotyped and phenotyped for both traits. Each training individual is phenotyped for at least one trait. If the candidate population is genotyped but not phenotyped for any of the traits, the strategy is called Multi-Trait genomic prediction (MT). If some of the candidates are phenotyped for the secondary trait, the strategy is called Trait-Assisted genomic selection (TA) (Fernandes et al. 2018).

As for single trait prediction, under a major QTL genetic architecture, Jia and Jannink (2012) found that Bayesian multivariate models (BayesA or BayesCπ) performed better than multi-trait GBLUP model. But for polygenic genetic architecture, multi-trait GBLUP model was equal to the Bayesian multivariate models. Note that Jiang et al. (2015) developed Bayesian multivariate models that consider correlated SNP effects. Montesinos-López et al. (2016) extended the model to a Bayesian multi-trait and multi-environment genomic prediction model (BMTME) that takes into account the correlation between traits and the three-way interaction term (Trait × Genotype × Environment). More recently, multi-trait deep learning (MTDL) models have been developed to reduce the computational resources (Montesinos-López et al. 2018, 2019). MT models can actually suffer from a high computational demand, time, and some convergence problems (Michel et al. 2018). Obviously, genetic correlation between traits is a key factor determining the MT advantage over single trait (ST) methods (Calus and Veerkamp 2011; Jia and Jannink 2012; Hayashi and Iwata 2013; Guo et al. 2014). Although MT models improve the predictive ability when the targeted trait has a low heritability and the secondary trait has higher heritability, the advantage of MT models to predict high heritability traits is low (Jia and Jannink 2012; Hayashi and Iwata 2013; Iwata et al. 2013; Guo et al. 2014). Studies using experimental data demonstrated that advantage of MT models to predict individuals which have not been phenotyped either for the trait of interest or the correlated trait was small or null in pine tree, Pinus taeda (Jia and Jannink 2012), soybean, Glycine max (Bao et al. 2015), rye, Secale cereale (Schulthess et al. 2016), maize, Zea mays (dos Santos et al. 2016), bread wheat, Triticum aestivum (Michel et al. 2018; Schulthess et al. 2018; Lado et al. 2018), and sorghum, Sorghum bicolor (Fernandes et al. 2018). Several studies using experimental data demonstrated that TA models perform better than ST and MT models in terms of accuracy. The TA models using high-throughput phenotyping, for instance, improved the prediction accuracy of bread wheat grain yield by up to 70% (Rutkoski et al. 2016; Sun et al. 2017; Crain et al. 2018). TA models also improved bread wheat baking quality-related parameters using protein content (Michel et al. 2018) or dough rheological traits (Lado et al. 2018) as correlated traits. Measuring dough strength (W) instead of French Bread Making Score for 75% of the population maintains accuracy by reducing budget of phenotyping by up to 65% (Ben-Sadoun et al. 2020). For a fixed budget, it can increase predictive ability by up to 0.14. Predictive ability of Fusarium head blight severity in hybrid bread wheat was improved using plant height and heading date as correlated traits (Schulthess et al. 2018). Fernandes et al. (2018) showed that TA models increased prediction accuracy by up to 50% when using plant height as correlated trait to predict yield in sorghum. Robert et al. (2020) proposed a new TA approach, in which the secondary trait is not phenotyped for the selection candidates, but predicted with crop-growth models. The advantage is that it is not necessary to sow the selection candidates, as only the genotypic information is used.

3.3.5 Mating Optimization

The breeder’s goal is to obtain “transgressive” individuals (with extreme genetic values) for at least one trait, cumulating as many favorable alleles as possible, putatively coming from different parents. While animal breeders optimize the choice of males, plant breeders may want to optimize mating between two or more parents. Cross design is essential but without accurate tools to guarantee its performance, breeders often select highest-performing parents to ensure high mean performance of progeny, and may focus on one or two traits. The problem is that highest-performing individuals may present similar sets of alleles and may actually produce less genetic variance in progeny than parents that have less but complementary favorable alleles. Because it is not feasible to evaluate all possible crosses in the field, it would be valuable to predict the value of a cross or a global cross design before it is made. Instead of focusing on the performance of parents, the idea is to estimate a proxy of the value of top progenies, i.e., the predicted mean and variance of the progeny. Attempts have been made using distances between parents based on phenotypes (Souza and Sorrells 1991a, b; Utz et al. 2001), genetic distance based on molecular markers (Bohn et al. 1999; Hung et al. 2012), molecular scores (summing QTL effects), or GEBV (summing marker effects estimated by ridge regression) (Tiede et al. 2015), but they were not really successful.

In a pre-breeding program context, it is even more obvious that the interest of a donor for a recipient elite individual depends on its genetic value (which can predict mean performance of progeny) but also its originality at QTLs (which will contribute to increase genetic variance and long-term genetic progress). A first approach was to count the proportion of favorable alleles and complementarity of parents at QTLs (Dudley 1984, 1988; Bernardo 2014). Van Berloo and Stam (1998) discriminated among crosses using a marker score from QTL flanking marker genotypes weighted by their effects.

The idea of Genomic Mating (GM) strategies is to use genomic predictions to optimize complementation of parents to be mated (Akdemir and Isidro-Sánchez 2016). As progeny genetic variance is generated by randomly sampling parental chromosomes during meiotic division, then recombination between those chromosomes, if we can accurately estimate marker effects as well as recombination rates between markers, we can optimize mating such as maximizing the probability to get individuals that cumulate a maximum of favorable alleles. In theory, the value of a cross, or the Usefulness Criterion (UC) of a cross (Schnell and Utz 1975) is the expected genetic value of the selected fraction (the bests) of the progeny

$$ \mathrm{UC}=\mu + ih{\sigma}_A $$

with μ the population mean, i the selection intensity, h the square root of heritability of the trait, and σA the additive genetic standard deviation among progeny.

3.3.5.1 Between Two Parents for Biparental Populations

To calculate UC of crosses, we need marker effect and recombination rates estimates. In plants, meiotic recombination maps are usually estimated from bi-parental populations. Note that we can use some other types of populations (F2, BC, HD), using adapted transformation to get the meiotic recombination rate. Several unconnected or connected populations can also be analyzed together to build consensus or composite maps cumulating more recombination information. A higher resolution method is to infer historical recombination maps from landraces or wild populations (Choi and Henderson 2015; Petit et al. 2017; Danguy des Deserts et al. 2021).

A first strategy is to simulate progeny in silico (stochastic simulations) by randomly producing crossing-overs along parental gametes according to a recombination map (Bernardo and Charcosset 2006). The value of a cross is the mean of the top progeny genetic values, the number of individuals belonging to this top group depending on the intensity of selection (Iwata et al. 2013; Bernardo 2014; Lian et al. 2015; Mohammadi et al. 2015). Note that in plants, we observe a significant negative relationship between parental mean and progeny genetic variance (Mohammadi et al. 2015). This study also showed that mid-parent value explains 99.99% of mid-progeny value and 82–88% of top-progeny value. Mid-parent value and estimated progeny genetic variance explained 99.5 of top-progeny value. This demonstrates the usefulness to estimate genetic variance and not only mean of progeny to estimate cross value. The problem with large breeding programs is that stochastic simulations are compute-intensive. So, attempts are made to predict variance using mathematic formulas (analytically). The mean of a cross is predicted by the mid-parent value in self-pollinated species or the mean of testcross performance in a hybrid crop. Several formulas have been proposed to predict the progeny variance. A first way is to estimate the value of the best possible progeny. We can determine historical haplotype blocks along the genome based on linkage disequilibrium and consider that recombination occurs only between those blocks. The effect of one haplotype block is the sum of its individual allele effects. Daetwyler et al. (2015) defined the Optimal Haploid Value (OHV) of an heterozygous individual as the sum of the effects of the best allele at each haplotypic block, corresponding to the genetic value of the best theoretical gamete to pass on to the next generation. They demonstrated that for a wheat program using DH technology (i.e., getting homozygous lines by gamete cultivation and chromosome doubling using colchicine treatment), genetic gain was improved (up to 0.6 standard deviations) when estimating the value of a cross as the OHV of the corresponding F1 heterozygous individual compared to standard GS. It also preserved a substantially greater amount of genetic diversity in the population. Müller et al. (2018) proposed the Expected Maximum haploid Breeding Value (EMBV) (Fig. 2). This is the expected GEBV of the best out of N DH lines produced by an F1 using haplotypic blocks. Compared to OHV, EMBV actually takes into account the fact that the number of progenies produced is limited, the best theoretical progeny being impossible to reach. It can be estimated by stochastic simulations or using an analytical formula. Another analytical way to estimate the value of a bi-parental population explicitly includes the vectors of recombination rate between markers that are polymorphic between parents and marker effects (Zhong and Jannink 2007). This analytical formula estimates the probabilities of transmission of alleles at all QTLs from an F1 individual (obtained by crossing two parents) to its gametes. In other words, the probability to get an outstanding progeny depends on the distribution of favorable alleles between parents and on the probability to break linkage between loci in repulsion phase and not to break linkage between loci in coupling. If we are interested in two genes that are genetically close to each other, if alleles are in the repulsion phase in the parental genotypes (neither parent has both favorable alleles), recombination widens the variance of the cross by providing extreme genotypes (you can get both favorableand unfavorable alleles in some progeny). On the opposite, if alleles are in the coupling phase in parents (you already observe the best and worse combinations), recombination reduces the variance (Zhong and Jannink 2007; Tiede et al. 2015) by getting combinations of intermediate effects. Formulas considering recombination rate between polymorphic markers were derived to calculate cross values for RILs and DH at generation k (Lehermeier et al. 2017b). The authors confirmed that predicting genetic variance in cross prediction increases genetic gain by 18% in maize compared to predicting the mean only. Formulas were also derived to optimize three- and four-way cross designs (Allier et al. 2019b). The implementation is much faster compared to in-silico simulations. But note that in practice, we can only use analytical formulas to predict the next generation variance but not several generations ahead. We actually need to recover the parent genotypes at each cycle to estimate the variance of following generations.

Fig. 2
figure 2

Different indexes to estimate cross values. Along the distribution of a simulated progeny from one cross is indicated. μ: the mean of the simulated progeny, q: the 10% quantile of the top progenies, UC: the mean of the 10% top progenies, EMBV: the mean of the best progeny for 1,000 different simulations, OHV: the best possible progeny (if all recombinations are possible and population size is unlimited)

Although Uemoto et al. (2015) suggested to filter MAF (≥5%) to improve the prediction of GEBV, markers with low MAF should be kept for cross value prediction as they may be in greater linkage disequilibrium with low MAF QTLs and provide better predictions of the gametic variance (Santos et al. 2019).

Although the superiority of predictions of GEBV using haplotypes instead of single markers has not been demonstrated, Cole and VanRaden (2011) and Bonk et al. (2016) recommended the use of haplotypes to predict cross values in order to limit sampling errors when estimating individual marker effects. Another way to take into account local LD and uncertainty of markers estimates is to use Bayesian estimates of single marker effects (Sorensen et al. 2001; Lehermeier et al. 2017a). The idea is that combinations of alleles in haplotypic blocks may be better estimated (if present in the training population) than individual SNPs.

In a maize pre-breeding context, Allier et al. (2019c) compared different indexes to estimate cross values: the Modified Roger’s Distance (MRD) between parents, the proportion of favorable alleles in donors (K) and recipients (J) (Bernardo 2014), OHV, genetic variance VarG in progeny and Lerhermeier’s UC (Lehermeier et al. 2017b). They considered different selection rates in the progeny to calculate UC, 5% (UC1) and 10−8% (UC2). The main conclusion was that one might consider UC1 or OHV with large haplotypes for short-term genetic gain prediction, OHV with small haplotypes or UC2 with stringent selection for long-term genetic gain prediction. In other words, complementarity between parents is more important to consider for long-term genetic gain. Another conclusion was that in genetic diversity conservation programs, one might just want to maximize progeny variance (VarG) for the trait of interest, or the MRD between donor and recipient in the absence of trait-specific considerations.

3.3.5.2 At the Population Level

The long-term potential of a breeding program relies on the efficiency to combine favorable alleles scattered within many individuals (Goddard 2009; Jannink 2010). In a pre-breeding program where we want to increase the number of favorable alleles in a population, this can be optimized using Genotype-Building (GB) strategies. It is the founder population as a whole and not individuals or parents which must cumulate favorable alleles at a maximum number of haplotypic blocks. A parent (founder) is chosen for its complementarity with others. It may carry only a few rare but very favorable alleles and have a low individual genetic value. Considering that the best allele combination (ideotype) is known, there may be many possible cross designs to get there. Because we cannot test all founder populations and cross combinations, the challenge is to build algorithms and solvers so that calculations are feasible and solutions are realistic.

The first proposed strategy was to select a subset of founders that possess altogether the best possible combination of haplotypic alleles along the genome. The Genotype-Building (GB) value of a subpopulation (Kemper et al. 2012) measures the GEBV of an ideal heterozygous progeny that would get the two best haplotype segments from two founders for each block. The Optimal Population Value (OPV) (Goiffon et al. 2017) is an extension of GB for inbreds. It measures the GEBV of the best possible homozygous progeny that can be produced, i.e., the value of the progeny that would get the best allele for each haplotypic segment in the founders. Note that it supposes an unlimited number of generations. The second extension is to consider time and resource constraints. Moeinizade et al. (2019) proposed the LAS (Look-Ahead Selection) algorithm where they improve the population for a few generations, starting with a subset of founders that maximizes OPV, and finally select for the best individuals in the last generation. They also consider a limited budget and vary the numbers of progenies produced from different crosses based on the genetic diversity of the parents: they spend more resources on those crosses that have wider predicted phenotypic distributions and thus higher probabilities of producing outstanding progenies. As for OHV, for GB, OPV, and LAS we assume adjacent markers are likely to segregate together and are grouped into representative haplotype blocks, recombination events occurring only between haplotype blocks.

3.3.6 The Theory of Contributions

According to the “breeder’s equation” (Lush 1937), genetic gain is limited by the initial additive genetic variance in breeder’s germplasm

$$ \Delta \mu = ih{\sigma}_A $$

with Δμ (genetic gain) the expected change in mean performance per generation, i the selection intensity, h the square root of heritability of the trait, and σA the additive genetic standard deviation among progeny.

The level of diversity depends on the effective population size Ne (Fisher 1930; Wright 1931), which refers to the number of breeding individuals in an idealized panmictic population with absence of selection that would show the same amount of genetic diversity than the real population. Genetic diversity is generally measured by the expected heterozygosity He (Nei 1973). While the expected response to selection is proportional to the selection intensity, the number of reproductors and the corresponding effective population size is inversely proportional to the square of selection intensity on major QTLs (Sanchez et al. 2006). Consequently, maximizing selection intensity (using GS for instance) to maximize short-term genetic gain reduces the effective population size and long-term genetic gain.

The genetic gain is also proportional to the product of individuals’ contributions (i.e., the number of offspring of each cross) and deviations from population mean (Woolliams and Thompson 1994; Woolliams et al. 1999). The rate of inbreeding, i.e. loss of diversity, is inversely proportional to the square of individuals’ contributions (Robertson 1961; Wray and Thompson 1990).

3.3.7 Optimization of Contributions with Diversity Constraints

Based on the theory of contributions, the optimum contribution concept has been developed in animal breeding programs (James and McBride 1958; Wray et al. 1990; Wray and Goddard 1994; Brisbane and Gibson 1995; Meuwissen 1997; Woolliams et al. 2015) and tree breeding (Kerr et al. 1998; Hallander and Waldmann 2009a, b) to limit inbreeding. These methods have been recently adapted in crop breeding (Akdemir and Isidro-Sánchez 2016; Lin et al. 2016; Cowling et al. 2017; De Beukelaer et al. 2017; Akdemir et al. 2018; Gorjanc et al. 2018; Allier et al. 2019a, b, 2020).

The vector of parental contribution to the next generation is chosen at a predefined rate of population inbreeding (Wray and Goddard 1994; Meuwissen 1997), penalizing this way individuals that are too closely related and maintaining genetic diversity. The solution is a compromise between short- and long-term genetic gain, is heuristic, and can be optimized using different types of algorithms such as evolutionary algorithms, genetic algorithms in particular (Holland 1962; Goldberg and Holland 1988). When there is no explicit solution for complex problems (some objectives are not independent), simulated annealing and genetic algorithms are efficient to explore the solution space, obtain a pseudo-optimal solution, and limit the risk to get local minimum solutions. Simulated annealing (Metropolis et al. 1953) uses a Monte Carlo criteria, i.e., a probability of acceptance of a solution. New solutions are proposed until the algorithm converges, i.e., no new solution improves the objective functions. The number of iterations with different starting points for the decision variables and the choice of convergence criteria to decide to stop the algorithm are essential. Genetic algorithms (Holland 1962) work on a population of solutions instead of individual solutions: (1) The algorithm generates a population of possible solutions, each one with defined values for the decision variables. (2) The values defined for decision variables are considered as alleles at different loci (one locus = one decision variable and one allele = one value for the decision variable). (3) The algorithm creates new solutions from existing solutions by “reproduction” with mechanisms similar to genetic evolution (genetic recombination, mutation, selection). Those transition rules are probabilistic. Different reproduction (one point, uniform) and selection (roulette wheel, tournament, rank) operators exist to propose and choose solutions.

Allier et al. (2019b) combined optimum contribution with Usefulness Criterion (UC) (Lehermeier et al. 2017b) strategies in maize. They evaluated the interest of a multi-parental cross implying a donor and one or several elite recipients using the UCPC (Usefulness Criterion Parental Contribution) criterion. They simultaneously predicted the full multivariate progeny distribution (mean, variance, and pairwise covariances) for the agronomic trait, genome-wide contribution of parents, and contributions at favorable alleles. They showed using this strategy that three-way crosses were more efficient for long-term genetic gain when donors are less performant than elites.

In animals, to maintain diversity, Bijma et al. (2020) proposed to produce a number of offspring that is proportional to the gametic variance of the reproductor to accelerate response to recurrent selection.

In addition to contributions, we can optimize mating in Optimal Cross Selection (OCS) approaches. It aims at identifying the optimal set of crosses maximizing the expected genetic value in the progeny under a constraint on genetic diversity in plants. It combines optimal contribution with optimal mating in a multi-objective problem that can be also optimized by heuristic algorithms. The classical OCS approach controls for genetic diversity in the total progeny. Allier et al. (2019a) applied OCS under a constraint on genetic diversity in the selected fraction of the progeny that is used as parents of the next generation accounting for within-family variance. They applied UCPC-based (Usefulness Criterion Parental Contribution) OCS in maize using a differential evolution algorithm (Storn and Price 1997; Kinghorn et al. 2009; Kinghorn 2011). They showed that OCS with constraints on UCPC and He was more efficient than classical OCS for long-term genetic gain with limited reduction of short-term genetic gain. Akdemir et al. (2018) maximized within-cross variance (Shepherd and Kinghorn 1998) and mating for multiple traits. It gives the list of parent mates that maximize gain, maximize cross variance, and minimize inbreeding. It is called Multi-Objective Optimized Breeding (MOOB). Compared to standard multi-trait breeding, the gains from multi-objective optimized parental proportions approaches were about 20–30% higher at the end of long-term simulations of breeding cycles.

The budget and the technical solutions are so different between species, private/public sector, that it is difficult to propose one algorithm that would handle conflicting objectives and satisfy the whole community (Wellmann 2019; Wellmann and Bennewitz 2019).

GS being more efficient at fixing major QTLs, it accelerates the loss of genetic diversity at QTLs according to simulation studies (Jannink 2010; Lin et al. 2016; Ben-Sadoun et al. 2020). Moreover, using RR-BLUP, the rare allele effects are shrunk toward zero, which increases the risk to lose individuals with rare favorable alleles and decreases long-term genetic gain (Goddard 2009; Jannink 2010; Habier et al. 2010; Pszczola et al. 2012). Several authors suggested to up-weight rare favorable alleles (Goddard 2009; Jannink 2010; Sun et al. 2014; Liu et al. 2015a) to select individuals for the next generation. They obtained encouraging results by simulation but did not propose stabilized rules to assign relevant weights to markers.

Considering computation time, several papers concluded the possibility for elite material to pre-select the population of eligible crosses based on parental mean genetic values before optimizing the cross design according to progeny genetic variance estimations (Zhong and Jannink 2007; Lehermeier et al. 2017b). They show that the genetic gain at the following generation is similar when considering all possible crosses or when removing couples with lower mean genetic values. But the conclusion may be different in more diverse materials. In that case, crosses that have high variance but low mean may be interesting for long-term genetic gain, if we wait a sufficient number of generations to give a chance to rare favorable alleles to be selected in a pre-breeding population for instance.

3.3.8 Multiple Traits Optimization

The performance of new varieties often depends on multiple traits and/or constraints. The targeted ideotype can be a compromise between yield and quality for instance, with specific molecule concentrations for the industry. Breeding for multiple traits simultaneously is challenging because some traits are uncorrelated or unfavorably correlated due to linkage or pleiotropy. Bulmer effect (Bulmer 1971) actually mechanically creates negative correlations between traits under selection, yield and protein content in wheat, for instance. Moreover, the economic value of different traits may not be equally important.

In classical multi-trait selection where traits are not correlated or negatively correlated, we have several strategies: (1) tandem selection: we select each trait singly (at different steps or generations), (2) independent culling: we reject individuals that are not meeting required standards for all traits, (3) index selection: traits are combined, using different weights corresponding to economic value, into a score that is considered as a single trait. The problem is that it may exclude the best individuals for each trait and some beneficial alleles. And it does not control for inbreeding.

Although single-objective optimization problems may have a unique optimal solution, the chance to find the best solution to a multi-objective problem is very low. The solution will be a compromise, especially when traits are antagonistic. And there may be several interesting solutions depending on the ranking of objectives. Algorithms propose a multiplicity of compromise solutions called Pareto optimal solutions after judiciously scanning the decision space, i.e., different combinations of equality and inequality constraints. Population of solutions are classified into boundaries according to their level of dominance (see more explanations in Figs. 3 and 4 for two traits, Fig. 5 for three traits).

Fig. 3
figure 3

Genetic values of grain yield and grain protein content for 15 genotypes. Different colors represent different levels of solutions (green: first level non-dominated, red: second level dominated by first level, blue: third level dominated by first and second level, orange: fourth level). Reproduced from Akdemir et al. (2018)

Fig. 4
figure 4

Optimization of grain yield and grain protein content: Pareto frontier curve. Red points indicate the individuals that are selected by the algorithm. The size of the points is proportional to their contribution to the next generation. Reproduced from Akdemir et al. (2018)

Fig. 5
figure 5

Pareto optimal solutions for parental contributions (wheat data) obtained by solving the optimization problem for three parameters (three dimensions). The objective was to improve grain yield (GY) and grain protein content (GPC) while controlling group coancestry, i.e., maximizing GY, GPC, and the negative of inbreeding. The redness of the points indicates closeness to ideal solutions. Pareto optimal solutions are represented by a knee. Reproduced from Akdemir et al. (2018)

Note that at the end of a multi-objective optimization, the decision maker still has to select the preferred solution from the Pareto frontier using its own decision rules, i.e., ranking or weighing objective functions like in index strategies.

3.3.9 Production of Varieties Adapted to Local Constraints

The objective of plant breeders is to produce new varieties well adapted to target environments. For this purpose, they evaluate candidate lines for several years in multi-environment trials. Because phenotyping is expensive, only a limited number of lines are evaluated each year in a small number of environments.

Using genomic predictions accounting for Genotype by Environment Interactions (GEI), we can explore more combinations of genotypes and environments that we cannot afford observing in the field. We can use historical breeding databases including numerous years and environment observations to calibrate those models. Different approaches have been proposed in the last decade.

3.3.9.1 Genotype by Environment Interactions (GEI) Predictions

While classical GS models rely on main effects and are not able to predict GEI (Crossa et al. 2010; Ly et al. 2013), those were adapted to predict environment-specific effects (Schulz-Streeck et al. 2013; Lopez-Cruz et al. 2015; Crossa et al. 2016; Bandeirae Sousa et al. 2017), with possibly a genetic covariance between environments (Burgueno et al. 2012; Lado et al. 2016; Cuevas et al. 2017, 2018). These approaches, similar to multi-trait models, can increase prediction accuracy and can predict missing phenotypes of observed varieties (sparse testing) or unobserved varieties. They are more efficient in the sparse testing scenario in which information on a given variety can be shared between similar environments. Specific R packages were developed to fit simple GEI models with optimized computational properties (De Coninck et al. 2016; Granato et al. 2018). But these models cannot be used to make predictions of a genotype performance in new environments, as they rely on the phenotypic data to estimate the covariance between environments.

To extend the predictions to new environments, Heslot et al. (2014), Jarquín et al. (2014), Malosetti et al. (2016), Millet et al. (2019), and Rincent et al. (2019) proposed to characterize environments with environmental covariates, like molecular markers are used to characterize varieties. These covariates are pedoclimatic characteristics supposed to affect the plants (precipitations, extreme temperature, radiation deficit) at the different developmental stages (Brancourt-Hulmel 1999). A crop model can be used to estimate the timing of the developmental stages, so that the covariates are estimated for a period during which they are supposed to impact plants. This work is inspired from the factorial regression methodology (Brancourt-Hulmel et al. 2000) in which a regression on a covariate explains the variability of the trait in presence of GEI. A generalization of factorial regression on a given covariate to the GBLUP mixed modeling context was proposed (Ly et al. 2017, 2018). In these studies, the covariate has a variety specific random effect with a variance/covariance matrix structured by the kinship. This allows predicting the sensitivity of new varieties to this covariate. It is important to note that the QTLs affecting main effects are not necessarily the same as the QTLs affecting GEI, and this can be taken into account in the statistical models at the marker level (Heslot et al. 2014) or at the kinship level (Rincent et al. 2019). These models involving environmental covariates are particularly useful in the context of climate change, because they can predict the behavior of various varieties in virtual prospective scenarios. If a relevant database exists to calibrate the GS model, it could be used to identify in-silico interesting combination of alleles to face given environmental conditions. If we consider that the genetic diversity available in the elite pool is not sufficient, the prediction models can also be used to screen genebanks for valuable GEI (Crossa et al. 2016; Yu et al. 2016).

3.3.9.2 Ecophysiological Modeling

The adaptation of plants to their environment has been long studied by ecophysiologists. Their research has allowed developing Crop Growth Models (CGM), which describe plant development using mechanistic relationships with physiological parameters and environmental covariates as inputs. In other words, the CGM simulates GEI by taking into account the specificities of the varieties (genetic parameter) and of the environments (environmental variables). Different ways of using CGM to predict GEI were proposed in the past.

The first application is to predict the developmental stages of the plants to estimate if stress appeared at critical stages. This strategy was applied in wheat and maize (Heslot et al. 2014; Jarquín et al. 2014; Malosetti et al. 2016; Ly et al. 2017; Millet et al. 2019; Rincent et al. 2019). Numerous studies indeed revealed that CGM were efficient to predict phenology even for new varieties (White and Hoogenboom 1996; Nakagawa et al. 2005; Yin et al. 2005; Messina et al. 2006). CGM can also be used to directly derive environmental covariates (Ly et al. 2017; Rincent et al. 2019). In Rincent et al. (2019), CGM SiriusQuality (Martre et al. 2006) was used to estimate dry matter stress index (DMSI) that directly relates to the impacts of temperature, drought, and N deficit, alone or in combination, to daily biomass loss. The idea is to produce stress indexes as close as possible to what the crop experienced in the field. Such variables directly simulated by the CGM were shown to better capture GEI than basic pedoclimatic covariates.

The second application is much more ambitious: the genetic model and the CGM are fully integrated within the Gene-Based Modeling approach (GBM). In GBM, the CGM simulates the development of each variety by using variety specific genetic parameters as input. These genetic parameters (phyllochron, sensitivity to photoperiod) characterize the varieties independently from the environment and are thus supposed to be stable across environments. Once the genetic parameters are estimated for the calibration set, a GS model can be calibrated to predict the genetic parameters of new varieties. These predictions can then be used as input of the CGM to predict the target trait of the new varieties in various environments. The interest and feasibility of this approach coupling CGM and genetics have been validated for leaf elongation rate in maize (Reymond et al. 2003; Chenu et al. 2008), fruit quality (Quilot et al. 2005; Prudent et al. 2011), and phenology of various species (White and Hoogenboom 1996; Nakagawa et al. 2005; Yin et al. 2005; Messina et al. 2006; White et al. 2008; Uptmoor et al. 2012; Zheng et al. 2013; Bogard et al. 2014; Onogi et al. 2016; Rincent et al. 2017a). Recently, Technow et al. (2015), Cooper et al. (2016), and Messina et al. (2018) have illustrated the possibility of coupling CGM and GS models for predicting highly integrated traits such as grain yield. One major advantage of their approach and that of the work of Onogi et al. (2016) is that the genetic parameters and the marker effects are jointly estimated, and so information can be shared between individuals thanks to genotypic data. However, using GBM to predict such complex traits remain challenging, as numerous genetic parameters have to be phenotyped or estimated on the training population. More recently, Robert et al. (2020) proposed to combine GBM with a trait-assisted prediction approach. The GBM is used to predict a secondary trait (heading date) for the test set in all environments. This secondary trait is easy to predict, and its relationship to the target trait (yield) is environment specific and thus allows predicting environment-specific effects in bread wheat.

A last application of CGM is to help clustering environments with similar properties. The objective is to use the CGM to characterize the stressing conditions experienced by the plants in each environment, and then to group environments with similar scenarios. Taking pedoclimatic data and variety characteristics as input, CGM can indeed produce daily stress indexes from sowing to maturity. It has been shown that clustering based on stress scenarios identified by CGM was more relevant than clustering based on the experimental protocols (e.g., non-irrigated vs irrigated) and that it was efficient to capture GEI (Chenu et al. 2011; Touzy et al. 2019). For example, it can happen that in a multi-environment trial, an irrigated trial is more subjected to drought than a non-irrigated trial at another location. In contrast, the CGM is able to finely characterize each environment by taking into account the environmental conditions and the plant development. Once the CGM-based clustering is obtained, reference GS models (or GWAS) can be applied within each cluster, GEI being taken into account by the clustering.

3.3.9.3 Perspectives in the Field of GEI Prediction

Phenotyping is one of the main bottlenecks in plant breeding. GS models allow predicting new varieties in observed environments or new environments for observed varieties, but large phenotype databases are necessary to calibrate the GS models accurately. High-throughput phenotyping platforms and tools which allow phenotyping at the organ level, at the plant level, or at the plot/field level (Tardieu et al. 2017) constitute a great opportunity to calibrate GEI models. This observation can be used to calibrate CGM (Reymond et al. 2003) or as environment-specific proxies of the target trait (Amani et al. 1996). The systematic and wide use of sensors in the breeding programs will probably allow using deep learning approaches, supposed to be the most efficient when such large datasets are available. Note that in all the approaches described in this section, there were only two kinds of data involved in the model: genomic and phenotypic data. The introduction of other omics data such as transcriptomics, proteomics, and metabolomics in the models will probably allow a better understanding of how a given variety grows in various environments (see Sect. 4.2 below). The introduction of this information in “phenomic” prediction models or in Genomic-like Omics Based prediction models (GLOB) was proven to improve accuracy (Fu et al. 2012; Riedelsheimer et al. 2012; Rincent et al. 2018; Schrag et al. 2018). The combined use of phenomics and genomics is used in pre-breeding for yield potential in stressed environments under the International Wheat Yield Partnership (IWYP, https://iwyp.org/) (Reynolds et al. 2021). Once those tools are cost effective, they could be integrated routinely in breeding programs.

3.3.10 Application to Pre-breeding

When performance gap between donors and elites is too large, it may be judicious to improve a pre-breeding population before introducing GR in a breeding program. For a few generations, starting from relevant founders that bring complementary alleles and mating optimization, we can increase gradually the number of favorable alleles in the population. It is only after a sufficient number of generations that we start selecting individuals based on their genetic value to cross them with elites. Gorjanc et al. (2016) provided guidelines based on stochastic simulations. Starting from 3,000 genotyped maize landraces, they evaluated different pre-breeding programs that differed according to the population to initiate crosses: (1) the best landraces, (2) the best testcrosses, or (3) the best DH seeds derived from testcrosses. They tested different (1) sizes for the pre-breeding program, (2) levels of diversity within the 3,000 landraces, (3) trait heritabilities, (4) number of markers, (5) number of crosses and progeny size per cross, and (6) number of phenotypic observations. The highest genetic gain was achieved by initiation with testcrosses. But it was reconstructing the elite genome and not utilizing the landrace favorable alleles. The best compromise to start a pre-breeding program was to start from landraces. This process can be accelerated by using existing composite or recurrent selection populations or inbred lines derived from local landraces. A recent initiative to characterize and use a part of the untapped variation in maize landraces is the Seeds of Discovery project (SeeD: http://seedofdiscovery.org). SeeD develops germplasm with 75% or more elite and 25% or less landrace genome to provide donors carrying new alleles.

Two-step breeding programs with an integrated pre-breeding program using rapid cycles (recurrent selection) (Gaynor et al. 2017; Gorjanc et al. 2018) is an efficient way to improve long-term genetic gain according to simulations (Fig. 6). An improvement population is produced by recurrent genomic selection with several cycles per year to increase the mean value of GR population in the pre-breeding program. A development population is produced using standard methods to develop new lines in the breeding program. It delivered about 2.5 times larger genetic gain compared to a conventional program for the same investment according to Gaynor et al. (2017) simulations. OCS increased long-term genetic gain by 15–78% depending on the number of parents.

Fig. 6
figure 6

Two-part breeding program. The population improvement component is based on recurrent selection that brings new genitors to the breeding program. Conventional strategy is based on variety development from elite parents. Reproduced from Gorjanc et al. (2018)

Allier et al. (2020) proposed a strategy in three steps in case of a very large gap between elites and GR. They called base broadening phase (pre-breeding) the recurrent improvement of GR to decrease the performance gap with elites. It is kept independent from breeding programs until performance is satisfying. Best progenies are then crossed with elites to produce a bridging population. And the best bridging progenies can be parents in standard breeding programs. Allier et al. (2020) compared simulated breeding programs introducing donors with different performance levels. They observed that with recurrent introductions of improved donors, it is possible to maintain the genetic diversity and increase mid- and long-term performances with only limited penalty at short-term. When donors are already high-yielding, the bridging step could be skipped (Fig. 7).

Fig. 7
figure 7

Diagram illustrating the respective positioning of pre-breeding, bridging and breeding from genetic resources to variety release. Reproduced from Allier et al. (2020)

From a practical point of view, several open-source software have been proposed. The R packages Rqtl (Broman et al. 2003), Popvar (Mohammadi et al. 2015), and software Alphasim (Faux et al. 2016) simulate bi-parental populations. The R Package Breeding Scheme Language (Yabe et al. 2017) simulates breeding programs. Multi-stage breeding schemes for hybrids using economic constraints are implemented in the R package Selectiongain (Mi et al. 2014, 2016).

To optimize mating for multiple traits, the R Package Genomic mating (Akdemir and Isidro-Sánchez 2016) and the software Alphamate (Gorjanc and Hickey 2018) have been proposed.

Forward stochastic simulations are proposed in python language in the software SeqBreed (Pérez-Enciso et al. 2020) and MoBPS (Pook et al. 2020), the last one implementing the optimum contribution method in an R environment.

To estimate the probability of getting the best progeny out of N with a specific cross, we can use the R package EMBV (Müller 2017). For qualitative traits controlled by major genes, the probability to cumulate a maximum of favorable alleles can be optimized using the software Optimas (Valente et al. 2013) or PCV (Han et al. 2017).

4 Future Perspectives

4.1 Improvement of Databases

We discussed above how diagnostic markers and genomic predictions can help the introduction of GR beneficial alleles from landraces or wild relatives in breeding populations. Operating procedures for conservation of those accessions have been in place for decades in genebanks, but there is a lack of means and methodological results to optimize the discovery and transfer of beneficial alleles into modern varieties, especially for quantitative traits or multi-trait improvement (Mascher et al. 2019). What is essential to valorize those accessions is the existence of international databases with curated and standardized information (e.g., passport, curated phenotypes, validated GEBV, alleles at validated QTLs, introgressions, cloned genes, and site under ancient or recent selection pressure). There is actually no doubt that the better the database, the better the predictions and the integration of useful information to users. Many initiatives emerged to build national databases (https://www.ars-grin.gov GRIN-Global in the USA). Some national genebanks connect their database to regional (The European Search Catalogue for Plant Genetic Resources: EURISCO, https://eurisco.ipk-gatersleben.de) and international networks, such as the Global Gateway to Genetic Resources (Genesys, https://www.genesys-pgr.org). But not much information is shared beside the passport data. It is not straightforward to standardize experimental protocols, file formats and merge different databases. But this effort would facilitate integration of information and exchange of seeds among genebanks, plant geneticists, and breeders.

For plant phenotypic data management, the number of national initiatives multiplies for many species (Adam-Blondon et al. 2016), in particular in the phenomics context (Neveu et al. 2019). We can also cite the dataverse phenotypic database for CIMMYT wheat and maize trials (www.cimmyt.org/resources/data/). A multi-species integrative information system dedicated to plant and fungi pests called GNPIS has been developed in France, for instance (Pommier et al. 2019). It bridges genetic and genomic data, allowing researchers’ access to both genetic information (e.g., genetic maps, quantitative trait loci, association genetics, markers, polymorphisms, germplasms, phenotypes and genotypes) and genomic data (e.g., genome sequences, physical maps, genome annotation and expression data). For genomic data and genome sequences in particular, transplant is an EU-funded project aiming at building hardware, software, and data infrastructure (Spannagl et al. 2016).

On the plant pathogen side, monitoring is generally organized at the national scale. The Australian cereal rust control program is estimated to save the industry $289 million per year from resistance breeding, for instance. The European project Rustwatch (H2020 Sustainable Food Security-2017) tends to gather and standardize information about wheat cultivation surfaces, rust pressure, pathogen races, allelic composition of varieties and their bypass dates, in a standardized database to better understand the dynamic of bypass.

On the breeder side, from a pedigree and phenotype database in the UK, Fradgley et al. (2019) evaluated historical parental contributions in wheat and detected adaptation and selection signatures comparing genetic diversity levels with or without selection (experimental data vs simulated data, respectively) using gene dropping. Similar databases exist for oats, Avena sativa (Tinker and Deyl 2005) and rice, for instance (Bruskiewich et al. 2003).

An interesting initiative from NIAB is to propose a Toolbox to wheat breeders including evaluated wheat material introgressed with wild relatives (synthetic lines) (https://triticeaetoolbox.org).

The university of California Davis (UC Davis) proposes a list of public wheat diagnostic markers online (MASwheat https://maswheat.ucdavis.edu).

For genomic selection, a project has started called Genomic Open-source Breeding informatics initiative (GOBii: http://gobiiproject.org/), funded by the Bill & Melinda Gates Foundation. The objective is to develop open-source data management, marker- and genomic-assisted breeding tools (PrAPI), for under-resourced breeding programs in particular, including trainings and workshops around the world (Selby et al. 2019).

The DivSeek project in the USA tends to bridge the gap between information requirements of genebank curators, plant breeders, and more targeted upstream biological researchers. They built a cooperative information platform for phenomics and genomics and gather a collaborative network of genebanks, breeders, scientists, database and computational experts for metadata curation. The objective is to share methodologies, open-source software and best practices related to genetic resources. For maize, the SeeD project established a breeder’s core of 4,000 landrace accessions that were genotyped and phenotyped, including testcross performance (http://seedsofdiscovery.org). For wheat, the Heat and Drought Wheat Improvement.

Consortium (HeDWIC, http://www.hedwic.org/) coordinated by CIMMYT aims at boosting heat and drought breeding using genomic and phenomic tools.

Then it is a long-term joint research goal to organize the conversion of information from population genomics and quantitative genetics to the development of some useful material for breeders. And public research may play an essential role in this activity, providing that means and foundings are sufficient.

4.2 Integration of Omics to Better Decipher Genome/Phenome Relationship

Elite varieties have mainly been selected for production and post-harvest qualities with less attention to other features such as drought tolerance, nutrient use efficiency or durable pest and disease resistance. The effects of these factors have been mitigated by the use of treatments such as irrigation, fertilizers, and pesticides. Now that governments promote a more sustainable agriculture, breeding for stress tolerance may become common rules once the tools and methodologies are available. A better understanding of ecophysiolocal and expression determinants is essential to breed for stress tolerance. However large-scale phenotyping of physiological traits and generating data for population genomics and other “omics” aspects, for many varieties in different conditions with biological replicates, is still not affordable. But costs are likely to drop soon (Zivy et al. 2015).

4.2.1 Sequencing Fragments with Known DNA Patterns (Target Candidates)

Instead of sequencing the whole genome of accessions, we can target exome or specific domains such as LRR that are typical of resistance genes. Jupe et al. (2013), using Resistance gene enrichment Sequencing (RenSeq), reannotated the NB-LRR gene family and rapidly mapped resistance loci in segregating populations from hexaploid bread wheat. Arora et al. (2019), using R gene enrichment sequencing, a sequence capture bait library optimized for Ae. tauschii NLR domains and k-mer based association genetics (AgRenSeq) on a diverse panel (195 Ae. tauschii accessions), rapidly cloned four rust genes (Sr33, Sr46, Sr45, SrTA1662). Using mutagenesis coupled with exome capture and NLR-baits (MutRenSeq), Steuernagel et al. (2016) rapidly cloned Sr22 and Sr45 genes.

4.2.2 Population Transcriptomics

With the availability of Next Generation Sequencing (NGS) technologies, the possibility to directly sequence mRNA at relatively reduced cost becomes available.

Genomic predictions using whole-genome SNPs or GWAS are limited in capturing epistasis. Because mRNA, small RNA (sRNA) sequences and metabolic data are involved in transcriptional, translational, and post-translational processes, we expect them to provide such information. For instance, GWAS on transcripts allowed detecting candidate genes controlling oil content in maize, and their sequencing to detect polymorphisms and favorable alleles (Li et al. 2013). In grain maize, they evaluated the ability of this kind of data in parental lines to predict the performance of untested hybrids. They found that mRNA data are a superior predictor for grain yield and whole-genome SNP data for grain dry matter content, while sRNA performed relatively poorly for both traits. Combining mRNA and genomic data as predictors resulted in high predictive abilities across both traits and could contribute to more efficient selection of hybrid candidates in maize (Schrag et al. 2018).

RNA sequences can differentiate between isoforms of a gene family, a widespread phenomenon in complex crop genomes, which is difficult using DNA sequences. For example, in wheat, Oono et al. (2013) discovered this way phosphate starvation-responsive genes. Ramírez-González et al. (2018) showed differential expression of homoeolog genes due to epigenetic modifications and variation in transposable elements within promoters. The measurement of tissue and stress-specific co-expression networks throughout the development allows reconstructing regulatory networks. Some kernel component candidates were found using this strategy (Wen et al. 2016).

4.2.3 Population Proteomics

Carpentier et al. (2011) identified protein polymorphisms correlated to drought tolerance using shotgun approaches in banana and Grimaud et al. (2013) found cold-acclimation-related proteins in pea. Virlouvet et al. (2011) identified the ZmASR1 gene under an abundance proteins QTL (pQTLs), candidate for drought tolerance in maize. The same gene was also associated in tomato, grape, lily, and banana (Maskin et al. 2001; Çakir et al. 2003; Wang et al. 2005; Henry et al. 2011).

4.2.4 Population Metabolomics: Phenotypes Targeting Candidate Metabolic Pathways

Metabolomics can detect targeted primary (sugars, organic- and amino-acids…) and secondary metabolites (photosynthates necessary to biomass formation, flavonoids, sugar-phosphates, phytohormones, phytoalexins) without genome sequence information. But it is not yet possible to work on the entire metabolome. Doerfler et al. (2014) detected 15 metabolites QTLs (mQTLs) of the flavonoid-pathway for cold and light stress in Arabidopsis thaliana. Pathogen induced markers were identified for Rhizoctonia solani in potatoes (Aliferis and Jabaji 2012), fungal pathogens in soybean (Aliferis et al. 2014), and bacterial blight-resistance in rice (Wu et al. 2012). An aroma (mesifurane) candidate gene was detected in strawberry, Fragaria x ananassa (Zorrilla-Fontanesi et al. 2012). The use of metabolomics in breeding has been reviewed in Fernandez et al. (2021).

4.2.5 Population Epigenomics

Epigenomic variations are involved in the control of plant developmental processes and shaping phenotypic plasticity to the environment (Gallusci et al. 2017; Moler et al. 2019). The elucidation of epigenetic regulatory networks using DNA methylation information should improve crop models. For instance, we can predict lycopene accumulation during tomato fruit ripening (Liu et al. 2015b), anthocyanin accumulation in apple (El-Sharkawy et al. 2015), energy-use efficiency in canola lines (Hauben et al. 2009).

Concerning histone marks, as they are likely to be erased following meiosis, they are of little interest to breeding applications in sexually propagated crops. But they can be relevant for clonally propagated crops, for pathogen resistance, for instance (Jaskiewicz et al. 2011).

It is well known that DNA mutation, copy number variants or methylation, in genes, promoters or regulatory regions can affect gene expression, which modifies phenotypes in different environmental contexts. Many studies also showed that re-arrangements of loci on chromosomes, inversions, insertions of transposable elements, deletions can also lead to gene silencing. All those types of polymorphisms/annotation could help improving genomic prediction models. Molecular markers at the vicinity of genes actually tend to link more to causal variants in maize (reference). QTL effects are higher in genic regions (Wallace et al. 2014), which is consistent with the fact that a large portion of variability of gene expression is attributed to cis polymorphisms in maize (Schadt et al. 2003). Taking into account the proximity of molecular markers to genes actually improves prediction of agronomic traits in diverse populations of hybrid maize (Ramstein et al. 2020).

To facilitate and optimize those models, we still need the development of generalized methods that integrate multiple data types.

4.2.6 Integration of Different Population “Omics” Information

The long-term objective is to be able to integrate all possible “omics” information on the same samples. We will be able to detect eQTLs, pQTLs, and mQTLs and look for co-localization with molecular marker-based QTLs (cis-QTLs), giving direct access to the genes, favorable alleles, and regulatory factors outside of the gene (trans-QTL). As skills are spread in different groups, a European network named COST project was organized to help building regulation networks from integrated databases. To make it useful to breeders, the first objective is to define traits of interest for specific climatic zones or constraints.

Then, cellular phenotyping (transcriptome/proteome/metabolome) will help building more realistic models to predict phenome in the field. Models taking into account non-additive effects, nonlinear relationships between enzyme concentrations and metabolic fluxes (Fiévet et al. 2010; Vacher and Small 2019) could actually explain even more genetic variance and improve predictions.

5 Conclusion

Integration of concepts and tools of population genomics and quantitative genetics can lead to a better valorization of genetic diversity in crop (pre)breeding programs.

Advances in population genomics offer a new dimension to quantitative genetics in the form of increasing data on genetic diversity and structure, identification of new candidate genes of agronomic interest associated with signatures of selection, associations with environmental covariates and phenotypes, and prediction of genetic values of various plant genetic resources.

Genomic predictions can detect germplasm of interest in genebanks without the need of phenotyping if the calibration population is relevant and the quality of phenotyping is satisfactory. Good quality phenotyping will actually always be a cornerstone to efficient plant breeding and predictions. Genomic predictions can help to optimize the time and cost of the breeding process, allowing a transfer of budget to test a larger number of genitors and crosses. It can accelerate recurrent selection to produce pre-breeding and breeding lines that contain new favorable alleles. It can predict optimum parental contribution and mating in (pre)breeding programs to optimize short-term genetic gain but also assure long-term genetic gain by constraining germplasm diversity. Currently, the main methodological challenge here is a good estimation of marker effects and progeny variance.

Increasingly detailed multi-omic characterization of genetic resources (through genomics, transcriptomics, methylomics, proteomics) is expected to help understand and predict the genome-phenome relationship, and ultimately design ideotypes for particular growth conditions and uses. The hope is that additional layers of omics data will improve estimation of marker by environment effect. Currently, several technical hurdles are preventing industrial implementation of multi-omics approaches in the breeding process. On the fundamental level, effects of epigenetic variation on gene expression – on the background of nucleotide variation – are still difficult to detect, quantify, and generalize. Also, it remains to be seen whether genotype is a good predictor of methylome, transcriptome, and metabolome, i.e. whether training sets characterized with multi-omic data can improve genomic prediction of candidates that have been genotyped with SNPs, giving higher weights in prediction models to QTLs. Moreover, multi-omics approaches in the next generation of genomic prediction can only come with increased analytical complexity and cost. Nonetheless, recent years have witnessed an emergence and proliferation of methods designed for multi-omic data integration and analysis, and with the continuous drop of sequencing costs, multi-omics crop research will attract significant efforts in the immediate future. With a combination of multi-omic, agronomic, phenological and physiological data, supplemented with precise environment characterization (weather, soils, crop management) and targeted trialing, we are set on the path to decipher the complex GxE interactions and predict the performance of existing and new varieties in current and future environments.

For practical applications, it is necessary to integrate population genomics and other “omics” information with phenotypes in common public databases, so that robust methodologies and decision tools could be developed to convert this information into feasible protocols. In that context, one role of public research could be to develop and disseminate databases, new methodologies, and produce decision tools that could be validated by breeders in interactive projects. Public research could also coordinate the design, production, and evaluation of ready-to-use crop plant resources, pre-breeding genitors in particular.