Once a breeder is confident in both the ability of the QTL to deliver sufficient phenotypic variance in the breeding program and in the ability of the marker system to accurately track the QTL in breeding populations, the actual implementation of MAS needs to take place. Modern breeding has steadily moved away from the artistic whims of data-poor decisions to an evidence-based, data-driven process, and genotypic information can play a central role in enabling this process. Key issues in the application of MAS revolve around what stage(s) in the breeding cycle MAS should be applied, what population sizes are required to ensure other breeding strategies can be executed effectively, and how MAS integrates with phenotypic and genomic selection methods. For the purposes of this discussion, it will be assumed that the breeding program is using a rapid single seed descent-based line fixation procedure like rapid generation advance (Collard et al. 2017) through to the F5 generation, and field testing is done on fixed lines. However, similar strategies could be applied if using a pedigree selection approach. Given this approach, marker-assisted selection can be applied at many points within, upstream, and parallel to the forward breeding activities (Fig. 6).
MAS in forward breeding
If markers of sufficient accuracy are designed, these can be used to derive a fingerprint, or ‘QTL profile’, that articulates the valuable genes and QTLs present in prospective parental lines. This could be used in two ways: to select one parent to contribute an especially valuable gene, or to add value to existing crosses. In a fully modernised breeding program, parents are chosen primarily for their breeding value for quantitative traits, but in many cases, two parents of high potential breeding value in a cross both lack the major QTL of interest. This is common, especially when the QTL is coming from diverse or wild sources and is thus absent or rare in the breeding program. QTL profiling of all potential parents allows for a more informed selection decision to be made between two lines with similar breeding values but differing for the QTL of interest. Similarly, knowing the complete QTL profile of two parents allows for more careful and cross-specific population planning (see below). For example, a cross between two elite parents is not chosen to bring in a particular QTL, but nonetheless each parent may contribute one or more QTLs the other parent lacks. Knowledge of these QTL segregating in a planned population allows the breeder to add value to progeny selected out of the population. If successful, this allows enrichment of the more expensive genomic selection activities and yield trials with material that is known to be QTL positive, thus increasing the value proposition of the yield trials.
This is of course the classical interpretation of marker-assisted selection, and yet despite its familiarity, significant variation exists as to its application in the forward breeding process. It could be applied at early generations (e.g. F2) or at later fixed generations (F5 +), and both have distinct advantages depending on the number of genes under selection, the objectives of the cross, and the cost of genotyping. In early generations such as the F2, the proportion of the population fixed for a gene is only 25%, but a further 50% of the population are heterozygous, allowing for enrichment of a population with a certain favourable allele without constraining all progeny to be homozygous positive. In fixed populations, a much larger proportion is homozygous for the desired allele (46.875% in the F5, or theoretically 50% for doubled haploids or advanced generation inbreds) allowing for smaller population sizes necessary to arrive at the same number of lines homozygous for the locus under selection. Selection in the fixed (F5) generation is easily incorporated into any strategy involving RGA, thereby fixing required major loci prior to phenotypic evaluation, and thus field space for phenotypic selection is only used for lines known to possess the required disease resistances and/or quality requirements.
The most critical factor for success of this endeavour (besides accurate marker systems) is to use population sizes sufficient to reliably identify lines positive for the number of genes/QTLs under selection. Determining the population size required can be approximated from the Mendelian segregation ratios, but this will give a consistent underestimate for the required population numbers. For example, assume a breeder wants to select for two unlinked loci and end up with 100 QTL-[+] lines available for the stage 1 yield trial. Given that the Mendelian segregation ratio of a single gene in the F5 is 0.46875, it is tempting to say that our required genotypic frequency is thus (0.46875)2 or 0.2197. Thus, for 100 QTL positive progeny, the population size n is given by 0.2197n = 100 and thus (rounding to the nearest integer) n = 455. However, it must be recalled that under this scenario, 100 positive progeny is the mean number that would be observed, not the number that would be observed in any specific population. The actual number observed would follow a distribution centred on this mean (likely a binomial distribution since it is modelling a variable—genotype—with two discrete outcomes—desired and non-desired). Thus, in any given population, there is a 50% chance that the observed number of positive progeny will be below this number. To control for this, an additional term must be introduced: the acceptable risk of failure F, being the probability that a population will not yield any positives at all. Then, to calculate the required population size (for a single positive genotype), the following formula can be derived:
Definitions:
P = probability of required genotype = segregation ratio
Q = number of genes/QTLs under selection
F = Failure risk
R = required number of positive progeny
n(1) = population size required to identify at least one individual with the required genotype
Derivation:
$$ {\text{Probability}}\;{\text{of}}\;{\text{positive}}\;{\text{genotype}}\; = \;P^{Q} $$
(1)
$$ {\text{Probability}}\;{\text{of}}\;not\;{\text{positive}}\;{\text{genotype}}\; = \;1 - P^{Q} $$
(2)
Probability of all individuals in population of size n(1) having the non-positive genotype
$$ = \left( {1 - P^{Q} } \right)^{n\left( 1 \right)} $$
(3)
The latter equation is then equivalent to the probability of failure, i.e.
$$ F = \left( {1 - P^{Q} } \right)^{n\left( 1 \right)} $$
(4)
Rearranging this to solve for the population size,
$$ n\left( 1 \right) = log_{{\left( {1 - P^{Q} } \right)}} F $$
(5)
This then can be used to calculate the population size required to have less than F probability of finding no positive genotypes. It should be noted that Eq. (5) is conceptually similar to Eq. (4) of Hospital and Charcosset (1997) with selection in a single generation (t = 1 in their formula). However, while the formula from Hospital and Charcosset (1997) is derived for a single QTL genotyped with multiple (flanking) markers over multiple generations, this is derived for multiple unlinked QTLs each genotyped with a single diagnostic marker at a single generation, consistent with the RGA workflow.
Equation (5) gives the minimum population size to have less than F probability of not finding any individuals with the required genotype. For forward breeding purposes, a single positive individual is not sufficient; each population must give a certain required number of progeny (here designated R) for field evaluation. The population size n(R) required to generate R positive progeny could be estimated by assuming n(R) = R.n(1), i.e. it is simply n(1) multiplied by R. This is, however, an overestimate of the required size; for a population of size n, there is also a nonzero probability of finding 2, 3, etc., positive progeny. Taking this into account is far more difficult and requires the use of the binomial cumulative distribution function, where the required number of successes (individuals with the required genotype) = R-1, the number of trials is n (population size), the probability of success is the segregation ratio (PQ). This then gives the probability of not finding R or more progeny with the required genotype in a population of size n. The population size n can then be estimated iteratively to achieve the desired probability of failure.
The binomial cumulative distribution of Eq. (5) can then be easily used to calculate the required RGA population sizes for varying numbers of QTLs and required numbers of positive progeny lines for field testing. Examples for up to 5 QTLs and 50, 100, or 200 positive lines for field testing are given in Table 1. It should be noted that, as expected, the required population size for RGA is somewhat larger than if estimated based on the Mendelian calculation, but significantly smaller than if the binomial cumulative distribution is not used in the calculation and it is assumed that n(R) = R.n(1). It is evident that for up to three QTLs, the population sizes required are relatively modest. However, each additional QTL just over doubles the required population size, so beyond three, the required numbers of plants in RGA escalate quickly. If selection for a larger number of genes is required, a modified two-stage selection strategy could be applied. In this strategy, selection is initially applied at the F2 generation. This would be a disadvantage if selecting for homozygous positive progeny, as the segregation ratio is much smaller (0.25 vs. 0.46875), but if instead heterozygous progeny are included (i.e. eliminating progeny negative for any of the target QTLs), the segregation ratio is actually more favourable (0.75). Subsequent generations will then of course be segregating again for the target QTLs, but the elimination of lines lacking any of the targets in the F2 generation skews the segregation ratio in subsequent generations. Even if no selection is applied, approximately 2/3 of the F5 generation will be homozygous positive for each target QTL. This means the second round of selection in the fixed generation now has a more favourable segregation ratio (approx. 0.667 vs. 0.46875), and so for a given population size, more QTLs can be fixed with this two-stage selection process. Calculations show that for a single locus, this F2 enrichment strategy is less efficient due to the need to sample twice. For two loci, the number of plants/datapoints is almost equivalent, but for three or more loci, enrichment at the F2 and subsequent fixation at the F5 are more efficient; beyond three loci, the savings are substantial.
Table 1 Required RGA population sizes for marker-assisted forward breeding QTL deployment
Utilisation of genes in the forward breeding process requires the availability of these genes in elite germplasm, but as shown in Fig. 1, for many genes/QTLs, this is not the case. The desired alleles/haplotypes are often only found in landraces and other germplasm with highly undesirable characteristics. This can be illustrated from the breeding value of several varieties commonly used as donors for disease resistance in rice: the IRBB lines as donors for Xanthomonas oryzae resistance genes and IRBL9-w as the donor for the highly effective blast resistance gene Pi9 (Fig. 7). Clearly, these donor lines could not be used in a breeding program without severely reducing the average performance of resulting progeny. The standard answer to this difficulty is to repeatedly cross the gene(s) of interest into an elite genomic background via marker-assisted backcrossing (MABC), thereby diluting the poor-quality background until its effects are negligible.
The process of taking a gene from a poor-quality donor line and making it available to breeding programs as a high-quality introgression in an elite genomic background is here called QTL deployment. The focus of QTL deployment is on quality of introgressions rather than quantity, with the objective of producing high-quality parents for breeding programs, not necessarily varieties per se. To achieve this, QTL deployment must produce introgressions designed to minimise both genomic penalties of the donor genome and linkage drag around the target gene. These introgressions must be into a recipient that will be relevant to the breeding programs for at least the next 5 years, to enable its use as a parent to introduce the new gene/QTL into the breeding program. Significant value can be added to QTL deployment products if these also take into account haplotypic diversity at the locus of interest to avoid selective sweeps (i.e. embed target genes in multiple elite haplotypes), pyramid multiple genes for a trait, and develop coupling-phase linkages of multiple genes in a genomic region.
Key to producing high-quality introgressions is to reduce risks of linkage drag associated with the target gene or QTL. Backcrossing of the gene of interest into an elite background will dilute the effects of the poor-quality donor genome but without the use of recombinant selection would not remove any negative effects from tightly linked genes. For example, Pi9 has been observed to cosegregate with unfavourable grain hull colour (Scheuermann and Jia 2016) and has shown a persistent segregation distortion and high levels of floret sterility through multiple heterozygous generations (personal observations). These effects could be due to effects of the Pi9 gene itself, but are more likely due to linkages with unfavourable genes from the wild progenitor (Amante-Bordeos et al. 1992). Selection of recombinant genotypes, possessing the donor allele at the locus of interest but containing the recipient genotype flanking this region, minimises the risk of linkage drag negatively affecting the quality of the resulting progeny. This recombinant selection requires the screening of a large number of progeny in order to be successful. Population sizes required would vary according to the genetic distance D between the peak and recombinant markers. The size required can be easily calculated by incorporating D into the segregation ratio of Eq. (5), such that the segregation ratio becomes (P.D)Q, where D is expressed in Morgans (so a recombinant at 1 cM is D = 0.01):
$$ n\left( 1 \right) = log_{{\left( {1 - \left( {P.D} \right)^{Q} } \right)}} F $$
(6)
Population sizes obtained from Eq. (6) are quantitatively equivalent to those presented in Table 3 of Hospital (2001) for identical parameters (F = 0.01, t2 = 2, assuming equal population sizes across generations), although the equations presented in the latter publication are unable to solve for a required number of positive progeny (see below).
During backcrossing, the required population size n to identify at least one recombinant for a marker at 1 cM with a failure risk of 0.05 is 602, whereas the more favourable segregation ratio in a selfing generation (F2) means this can be done with only 400 plants. However, since deployment of a gene to the breeding program requires both backcrossing and breaking of linkage drag, conducting recombinant selection during selfing generations adds to the time taken to make a gene useable. Therefore, if it is feasible to generate the population sizes required (as would be possible in a prolific cereal species like rice), conducting recombinant selection during the backcrossing process saves 2 generations of time in the deployment process. Similarly, as with marker-assisted forward breeding, it is usually desirable that > 1 positive segregant is identified—in this case to mitigate risks due to mortality of the identified individual. Thus, incorporating Eq. (6) into a binomial distribution as described previously allows the calculation of population sizes for the desired number of positive progeny R (for example, if R = 2, D = 1 cM, F = 0.05, the population size required is n = 947 BC-F1 plants).
This first-stage deployment will produce a single, high-quality introgression in one genetic background, and therefore, the newly deployed gene will be available in only one haplotype. This was of course the objective, but if this were used as the sole source of an essential gene in a breeding program, eventually all resulting lines would possess the same haplotype in the region of that gene. The resulting selective sweep would then make a large portion of the surrounding genome unavailable to recombination. If this were only observed at a single locus, its effects would be marginal, but with the number of genes that could be used in rice breeding, it would quickly result in fixation of a substantial fraction of the genome and therefore a reduced genetic diversity in the breeding program and reduced potential for genetic gain.
To avoid this situation, the gene must be deployed into multiple haplotypes. With the cost of sequencing consistently becoming more economical, any modern breeding program should make the effort to evaluate the breeding germplasm for haplotypic diversity. This information permits the breeder to choose three or four elite recipients for this embedding process, each representing a haplotype of some appreciable frequency in the breeding program. If resources permitted, deployment could commence into all of these recipients concurrently. Alternatively, the elite line produced from the primary deployment could be used as the donor for the embedding, in which case only two rounds of MABC-plus-recombinant selection may be required to embed the gene in the second, third, etc., haplotype, since the donor is already an elite background. These haplotype-embedded donors would then be a tremendous resource for quickly rolling out the new gene into an entire breeding program with minimal disruption to the surrounding genetic diversity.
Initial deployment of genes is required for any locus of sufficient rarity among elite lines. However, many valuable loci are not completely unlinked, and so even with proper QTL deployment, favourable alleles of most genes would often be found in different varieties, a phenomenon known as repulsion-phase linkage. This can create an ‘either/or’ scenario for the breeder who would need to create inordinately large forward breeding populations to break the repulsion-phase linkage. If careful planning is done up front to account for regions of the genome enriched for potential MAS targets, the QTL deployment process itself can be used to bring these favourable alleles from diverse sources into coupling-phase linkage through harnessing the recombinant selection that is inherent in this process. Coupling of genes in a genomic region adds value to that whole region—selecting for one favourable gene will then typically carry along the other gene(s) simplifying and adding substantial value to breeding selections. Likewise, pyramiding of genes for a trait is well recognised to provide superior phenotypic benefits and stability compared to a single gene; deployment of a new gene can thus be carried out into a variety already possessing one or more other genes for the same trait, thus creating a pyramid as part of the deployment process. The large number of alleles characterised with beneficial effects in rice provides substantial opportunities for these, a few of which are illustrated in Fig. 8.
Line augmentation
QTL deployment is a laborious and painstaking procedure, requiring detailed oversight of large populations to identify the required recombinants and ensure the quality of products produced. Due to this, it is not well suited to high-throughput applications, and indeed, it only needs to be achieved once into the target elite material—or perhaps a few times, to conduct haplotype embedding. However, this leaves a gap in the utilisation of the target gene in the breeding program; it is now available to crossing programs, but even if all crosses made use of the QTL deployment products as parents, it would still take several years (one complete breeding cycle) for the progeny of these crosses to work their way through to release as varieties or used as a new generation of parents.
Line augmentation is the process to address this lag; it is designed to rapidly introgress genes from QTL deployment products into existing parents, breeding lines, and varieties. These augmented lines can then be used as parents in the current crossing schedule, thereby enabling the deployed gene(s) to have rapid impact and increase in frequency much more quickly. To achieve this, line augmentation involves rapidly backcrossing genes from an elite donor into many recipients, utilising only foreground selection. Since only small numbers of backcross progeny are required each generation (Table 2), positive progeny may be maintained through an RGA-like procedure to reduce generation times. As with QTL deployment, the recipient parent is maintained in a staggered planting to ensure synchronised flowering whenever selected progeny mature. To achieve the speed required of augmentation, deployed genes can only be backcrossed two or possibly three times into the recipients, and the volume of lines to be augmented precludes recombinant selection. There is thus no opportunity to correct any linkage drag or poor-quality genomic background in the donor line—the focus of line augmentation therefore is quantity not quality. As such, line augmentation requires high-quality introgressions from QTL deployment to act as donors, ideally as pyramided loci and embedded in the same haplotype as found in the recipient line.
Table 2 Required RGA population sizes for line augmentation Line augmentation can operate in one of two manners. It can be used to deploy new genes into parents of a breeding program, rapidly increasing the frequency of the target genes in many elite backgrounds and thus the benefit of newly identified or previously unutilised loci to be rolled out quickly in new crosses. However, by this method, it would still take one full breeding cycle before the benefits of new genes could be deployed in varieties. Line augmentation can also be used to ‘upgrade’ progeny of existing crosses, perhaps after initial stages of field testing; in parallel with the field testing process, promising lines identified at early stages of testing can be rapidly upgraded within a year with genes they are lacking and upgraded versions inserted back into later stages of field testing the following year. Thus, this ‘upgrading’ mode allows benefits of new genes to be realised even within existing populations and programs, with minimal time lost. Operating together, these two modes (focused on augmenting new crosses and existing populations) could quickly roll out a gene/QTL throughout an entire breeding program.
Since line augmentation requires only foreground selection, it is much less resource-intensive than QTL deployment, yet has a greater impact in terms of lines/breeding material produced (Fig. 9). Therefore, it may be expected that not every breeding program may need to implement QTL deployment activities; it would be more efficient if most could leverage QTL deployment products from a small number of dedicated pre-breeding programs. On the other hand, most breeding programs will benefit from implementing line augmentation at some level to enable rapid conversion of the entire breeding program to new genes.
Information management
Marker-assisted selection is intimately tied in with management of breeding populations, and ideally, selection decisions based on marker data should be integrated with management of germplasm. Some freely available platforms exist to handle this process, such as the Integrated Breeding Platform (IBP; www.integratedbreeding.net) and KDDart (www.kddart.org); many others are reviewed in Rathore et al. (2018). Often the handling of genotypic data is at a much earlier stage of development than that of handling pedigree, population and phenotypic data, etc., and is typically focused on either graphical overviews of the data (e.g. Flapjack, Milne et al. 2010), calculating recurrent parent recovery rates in MABC programs, or enabling genomic selection, rather than enabling selection of major-effect loci in forward breeding programs. Nonetheless, in many of the preceding use cases, this is adequate for an informed user, particularly if reliable marker systems are available for target genes/QTLs. Where these are available, selection is usually a simple matter of identifying which individuals are positive (or heterozygous) at the target locus/loci and negative at any recombinant or background loci. This process can even be done by simple filter options in standard spreadsheet software, though integration with germplasm management databases is clearly preferable. More advanced usage such as QTL profiling of parental lines and calculation of required population sizes as described do not yet appear to be supported in existing platforms. Proper incorporation of these analytical routines into germplasm management tools is an area that needs further development.