Evaluation of genomic selection and marker-assisted selection in Miscanthus and energycane

Although energycane (Saccharum spp. hybrids) is widely used as a source of lignocellulosic biomass for bioethanol, breeding this crop for disease resistance is challenging due to its narrow genetic base. Therefore, efforts are underway to introgress novel sources of genetic resistance from Miscanthus into energycane. Given that disease resistance in energycane could be either qualitative or quantitative in nature, careful examination of a wide variety of genomic-enabled breeding approaches will be crucial to the success of such an undertaking. Here we examined the efficiency of both genomic selection (GS) and marker-assisted selection (MAS) for traits simulated under different genetic architectures in F1 and BC1 populations of Miscanthus × Miscanthus and sugarcane × sugarcane crosses. We observed that the performance of MAS was comparable and sometimes superior to GS for traits simulated with four quantitative trait nucleotides (QTNs). In contrast, as the number of simulated QTN increased, all four GS models that were evaluated tended to outperform MAS, select more phenotypically optimal F1 individuals, and accurately predict simulated trait values in subsequent BC1 generations. We therefore conclude that GS is preferable to MAS for introgressing genetic sources of horizontal disease resistance from Miscanthus to energycane, while MAS remains a suitable option for introgressing vertical disease resistance.


Introduction
Dedicated perennial lignocellulosic biomass crops can contribute substantially to meeting society's energy needs while sequestering carbon, building soils, mitigating climate change, and improving livelihoods in rural communities (Saha et al. 2013). Energycane (Saccharum spp. hybrids), a type of sugarcane that has been bred for use as a bioenergy feedstock, has great potential as a source of lignocellulose and/or sugar for sustainably producing biofuels (Kandel et al. 2018). Nevertheless, the cultivation and productivity of energycane has been hampered by its narrow genetic base and disease susceptibility because most modern energycane varieties were derived from few interspecific hybrids between S. officinarum L. and S. spontaneum L. (Wang et al. 2013). To improve disease resistance in energycane, two other genera (Miscanthus and Erianthus) that share a close phylogenetic relationship with energycane have received appreciable attention (Tew and Cobill 2008). For instance, intergeneric crosses between S. officinarum 'Ludashi' and Erianthus rockii have been used to introgress brown rust resistance into energycane (Wang et al. 2013). Miscane, a hybrid of Saccharum × Miscanthus, has been noted to exhibit increased cold tolerance and disease resistance compared with ordinary Saccharum and therefore can potentially facilitate disease resistance introgression back into energycane through backcrossing (Chen and Lo 1989). However, breeding processes can be challenging in energycane. The polyploid genome of energycane complicates trait inheritance analysis (Gouy et al. 2013), and thus energycane breeding is still largely dependent upon phenotypic selection. Additionally, large trials and multiple selection cycles (7-10 years) are needed for improvement of traits such as yield and disease resistance (Gouy et al. 2013;Kandel et al. 2018). Therefore, the development and refinement of genomicenabled breeding methods specifically for energycane has potential to substantially improve the efficiency of breeding for this bioenergy crop.
Markers tagging quantitative trait loci (QTL) and other genomic signals exhibiting statistically significant associations with traits have been identified in energycane , which can potentially facilitate marker-assisted selection (MAS) in energycane breeding programs. Given the success of MAS breeding programs in other crops (Okogbenin et al. 2007;Xu and Crouch 2008;Jiang 2013;Yohannes et al. 2015), the potential of MAS to develop disease-resistant energycane crops is promising. Nevertheless, MAS-based breeding approaches are undermined by several factors. Most importantly, QTL analyses often miss small-effect loci, and thus MAS is typically based on only a few large-effect loci that may not capture all genetic variation responsible for the trait (Jannink et al. 2010;Ben-Ari and Lavi 2012). Additionally, because mapping populations employed in identifying QTLs used for MAS typically consist of a relatively small number of individuals (n < 250), these estimated QTL effects are usually inflated and potentially not reflective of their true genetic effects (Beavis 1995(Beavis , 1998Utz et al. 2000;Xu 2003). Given these setbacks, it is not surprising that MAS has been most successfully applied to traits controlled by few genomic loci (Bernardo 2001;Zhao et al. 2014). Currently, QTL in sugarcane and Miscanthus has only been able to tag relatively few large-effect genes for a given trait (Costet et al. 2012;Dong et al. 2018). Thus, the contributions of many small-effect loci underlying horizontal disease-resistance may be unaccounted for in MASbased breeding programs.
Another approach that is arguably better suited for breeding complex traits such as horizontal resistance is genomic selection (GS), which uses genome-wide marker sets to predict breeding values (Meuwissen et al. 2001). The potential of GS to accelerate the breeding cycle, increase genetic gain, and maintain genetic diversity above what can be achieved through MAS has been demonstrated (Meuwissen et al. 2001;Heslot et al. 2012;Annicchiarico et al. 2015). Previous investigations into the prospects of GS in energycane for traits related to sugar and bagasse contents, plant morphology, and disease resistance have produced promising results, with prediction accuracies of up to 0.62 being reported (Gouy et al. 2013;de Almeida Costa 2015). Moreover, GS has been successfully used for the improvement of complex traits in other crops Heffner et al. 2011;Heslot et al. 2012;Resende et al. 2012;Beyene et al. 2015;Gezan et al. 2017), including quantitative disease resistance in maize (Technow et al. 2013) and wheat (Rutkoski et al. 2012(Rutkoski et al. , 2015Mirdita et al. 2015). Therefore, GS has the potential to substantially improve qualitative and quantitative resistance in energycane, and evaluation of its ability to accelerate the development of genetically diverse, diseaseresistant cultivars is needed.
Given recent reductions in genotyping costs (Elshire et al. 2011;Poland 2015), the adoption of GS into crop breeding programs is becoming increasingly possible. Although a benefit of these cost reductions is an increased number of available genome-wide markers, several studies have shown that randomly selected subsets of genome-wide markers are capable of achieving similar prediction accuracies relative to the full marker sets (Arruda et al. 2015;Zhang et al. 2015;Spindel et al. 2015). Thus, to ensure that resources for GS-based breeding programs are allocated as effectively as possible, evaluation of the impact of the number of markers (i.e., marker density) on GS prediction accuracy is crucial.
The comparison of GS to MAS has been the subject of previous studies in various crop species, including maize (Massman et al. 2013;Owens et al. 2014;Cao et al. 2017;Cerrudo et al. 2018), wheat (Arruda et al. 2015), rye (Wang et al. 2014), and cowpea (Olatoye et al. 2019). The results of this prior work was consistent with the simulation study conducted by Bernardo and Yu (2007) in that the number of underlying causal mutations, their effect sizes, and the heritability of the studied traits have a major influence on MAS and GS performance. In general, the concensus of these studies was that GS is better suited for polygenic traits, but, in practice, MAS could be used in conjunction with GS so that both large-and small-effect loci underlying trait variability could be captured. Although similar findings on the relative performance of MAS and GS are expected to be observed in Miscanthus × energycane hybrids, to the best of our knowledge, no studies have studied these two approaches in this system.
Given the inherent challenges with introgressing disease resistance from one perennial grass species to another, it is critical that the performance of GS and MAS is evaluated in the targeted species of our Miscanthus × energycane breeding program. Failure to do so could result in an inefficient breeding program that introduces an inadequate amount of disease resistance into energycane. Thus, the purpose of this study was to compare the ability of GS and MAS to predict traits simulated under different genetic architectures and marker densities. Due to the current lack of genomic data for Miscanthus × energycane hybrids, we conducted this analysis independently in Miscanthus × Miscanthus and sugarcane × sugarcane F 1 and BC 1 populations. Because these BC 1 populations are simulated, real phenotypic disease resistance data do not exist; thus, this study exclusively used simulated phenotypic data. We hypothesized that GS would outperform MAS for traits controlled by many genomic loci and that prediction accuracy would increase with marker density.

Plant materials and genomic resources
This study evaluated the performance of GS and MAS in F 1 and subsequent simulated BC 1 populations in Miscanthus and sugarcane. The Miscanthus F 1 population was derived from a cross between M. sinensis 'Kaskade' and M. sinensis 'Strictus'. Miscanthus is self-incompatible, and, consequently, the parents and F 1 progeny are highly heterozygous. Publicly available genomic resources for this population include a set of Goldengate SNPs that were designed from variants discovered in RNA-seq data (Swaminathan et al. 2012), a set of SNPs that were discovered and called from RADseq data (Lu et al. 2013), and a composite genetic map (Liu et al. 2016). The markers used in this analysis comprised of 3044 RAD-Seq SNPs and 136 Goldengate SNPs (3180 SNPs in total) in 85 individuals. The genetic map was composed of 19 linkage groups.
The sugarcane F 1 population was composed of 173 individuals derived from a cross between sugarcane 'CP95-1039' × sugarcane 'CP88-1762' (Yang et al. 2017) and was genotyped with 21,895 single-dose SNP markers obtained using genotyping-bysequencing (Yang et al. 2017;Elshire et al. 2011;Glaubitz et al. 2014). The resulting genomic data consisted of two parental genetic maps and their associated markers. The first parental genetic map (CP95-1039) consisted of 2453 markers spanning 162 LGs with a total map length of 4224.4 cm, while the second parental genetic map (CP88-1762) consisted of 2154 markers across 142 LGs with a total map length of 4373.2 cm. A composite genetic map was constructed using the LPmerge R package (Endelman and Plomion 2014), consisting of 162 linkage groups and 4607 markers (Yang et al. 2017).

Simulations
Using marker data from each F 1 population, we developed a simulation pipeline that allowed us to simulate phenotypes for each F 1 individual, compare the performance of GS to MAS, cross select F 1 individuals with a recurrent parent, and, finally, evaluate the performance of GS and MAS in the resulting BC 1 individuals. A schematic of this simulation pipeline is provided in Fig. 1, and critical relevant aspects are highlighted in the text below. The R scripts used for this simulation pipeline are available on GitHub at: https://github.com/marcbios/Evaluation-of-Genomic-Selection-and-Marker-Assisted-Selection-in-Miscanthus-and-energycane.

Phenotypic trait simulation
The same procedure was used to simulate traits in both F 1 and subsequent BC 1 populations. First, markers were randomly selected to be the quantitative trait nucleotides (QTNs), which constitute the genomic sources of variability for each trait. To assess the situation where causal mutations are not included in marker sets, QTNs in the M. sinensis populations were randomly selected from the 136 GoldenGate SNPs, while GS and MAS were conducted using the RAD-seq markers (3044 SNPs). Because only one marker set was available in the sugarcane populations, QTNs were randomly selected from all available 4607 markers.
The specific manner in which each randomly selected set of QTNs contributed to trait variability, which we defined as the genetic architecture of the trait, varied according the number of markers selected to be QTNs, the genetic mechanism of the QTN effects (i.e. additive, dominant, and epistatic), the size of the QTN effects, and heritability. A procedure similar to the ones described in Lande and Thompson (1990), Yu et al. (2008), and Bouchet et al. (2017) was used to simulate the effect sizes within each class of genetic effects so that they each followed a geometric series. Briefly, for a given class of p additive, dominant, or epistatic QTNs, the effect size was initiated as ψ = p−1 pþ1 , and then the effect size of the i th QTN was ψ i . Thus the contribution of a set of p QTN to the phenotype was ∑ p i¼1 ψ i x ij , where x ij denotes the observed allele of the j th individual at the i th QTN, enumerated appropriately for quantifying additive, dominant, or epistatic effects. The numerical value of ∑ p i¼1 ψ i x ij constituted the baseline trait value for the j th individual. The resulting genetic variance component of the trait σ 2 G was calculated as the variance of the baseline trait. Finally, the non-genetic component of each trait was generated by simulating a normal random variable with mean 0 and variance σ 2 ε = [ where H 2 denotes the heritability. Such normal random variables were added to the baseline trait value for each individual. The specific values considered for the number of markers p selected to be QTNs, the genetic mechanism of the QTN effects, the size of the QTN effects, and heritability are summarized in Table 1. At each of the resulting 24 genetic architectures that were considered, a total of 100 replicate traits were simulated.

GS models
Genomic selection was performed using four GS models that were representative of the diversity of such models that have been explored in previous studies. These models were evaluated using the sommer (Covarrubias-Pazaran 2016), BGLR (Perez 2014), and kernlab (Karatzoglou et al. 2004) packages in the R programming language.
Additive, dominance, and epistasis models The sommer R package was used to implement a GS model that simultaneously captures additive, dominance, and epistatic genomic signals. To achieve this, sommer fits mixed linear models that solve for multiple random effects with specific variance-covariance structures (Covarrubias-Pazaran 2016). The resulting "Addi-tive+Dominance+Epistasis" (ADE) model we considered is written as follows: where Y is a vector of observed phenotypic values; X is a design matrix; β is the vector of fixed effects; u Ã MVN(0, Aσ 2 A ), u D~M VN(0, Dσ 2 D ), and u EM VN(0, E aa σ 2 E ) are vectors of individual random effects; A, D, and E are additive, dominance, and epistasis relationship matrices (VanRaden 2008;Su et al. 2012;Endelman 2013); and ε is the random error vector~MVN(0, Iσ 2 ε ). The additive relationship matrix was estimated as follows: where Z A is an incidence matrix relating the additive contribution of each marker of each individual; the dominance relationship matrix was estimated as follows: where Z D is an incidence matrix relating the dominance contribution of each marker of each individual, and the epistasis relationship matrix was estimated as follows: where p j and q j are the respective allelic frequencies for the j th marker. These matrices were specified in sommer using the A.mat, D.mat, and E.mat functions.

BayesA
We used the BGLR package in R to implement BayesA. This model takes on the following form: where Y is the vector of simulated phenotypic data; 1 is a vector of 1's; μ is the grand mean; s is the number of markers (s > n); Z m is the observed genotypes at the m th marker (coded as e.g. − 1, 0, and 1, where 1 and − 1 were the homozygous classes and 0 the heterozygous class); u m is the random effect of the m th marker; and ε is the random error vector~MVN(0, Iσ 2 ε ). The BayesA model assigns a scaled-t density prior to each of the random marker effects. All analyses conducted for this model used the default settings of 1000 burns and 2500 iterations.

RKHS
We also explored the reproducing kernel Hilbert space (RKHS) to determine if it yielded higher prediction accuracy in the presence of non-additive genomic signals. Specifically, Bayesian RKHS was analyzed in BLGR, which is described as follows: where Y is the vector of simulated phenotypic data; 1 is a vector of 1's; μ is the grand mean; u is vector of individual random effects~MVN(0, K h σ 2 u ); and ε is the random error vector~MVN(0, Iσ 2 ε ). For the Bayesian implementation of RKHS in BLGR, the priors p(μ, u, ε) are proportional to the product of the MVN(0, K h σ 2 u ) and MVN(0, Iσ 2 ε ) density functions. The matrix K h is the kernel entries matrix with a Gaussian kernel, which uses the squared Euclidean distance between marker genotypes to quantify degree of relatedness between individuals, and a smoothing parameter h that multiplies each entry in K h by a constant. Similar to the implementation of BayesA, the RKHS was conducted using the default 1000 burns, 2500 iterations, and the smoothing parameter h set to 0.5.

SVMR
Based on the support vector machine method developed by Vapnik (1995), support vector machine regression (SVMR) was initially applied to plant breeding by Maenhout et al. (2007). The main objective of SVMR is to minimize prediction error (Long et al. 2011). A detailed description of SVMR applied to GS is provided in Howard et al. (2014). In brief, SVMR is a kernel approach with a loss-function referred to as "ε-intensive" meaning it is an approach that favors models that minimizes large residuals (Lorenz et al. 2011). In this analysis, we used the normal radial function kernel (rbfdot) implemented in the ksvm function of kernlab R package (Karatzoglou et al. 2004).

Five-fold cross validation
In this study, we used a five-fold cross validation approach to assess the ability of the tested GS models to accurately predict trait values in each of the F 1 populations. This approach has been previously described (Resende et al. 2012). In brief, both genomic and phenotypic information of all the individuals were divided into five approximately equally-sized groups. A set of four groups were used as a training set to fit the GS model, while the remaining one group was used as the evaluation or test set. Prediction accuracy was quantified as the Pearson correlation between the simulated trait values and the genomic estimated breeding values (GEBVs) predicted from a given GS model evaluated in the test set. The process was repeated five times to ensure that each of the five groups was used as a test set. All four GS models were evaluated in the same folds to facilitate comparison.

Further evaluation of GS in the F 1 populations
Several other metrics besides the average Pearson's correlation coefficient across folds were used to assess the performance of four GS models in the two F 1 populations. To evaluate the impact of marker density on GS performance, all analyses were conducted using the full complement of markers, as well as on random samples of 0.25, 0.50, and 0.75 of markers. In addition, a regression model was fitted with the observed values as the response variable and the GEBVs as the explanatory variable to assess the bias of the GEBVs, as described in Daetwyler et al. (2013). Finally, we calculated the coincidence index (Hamblin and Zimmermann 2011;Fernandes et al. 2018) to evaluate the proportion of individuals that overlap between individuals with highest trait values (10%) in simulated phenotypes and predicted phenotypic trait values for the validation population.

MAS in the F 1 populations
We carried out MAS in the two F 1 populations (M. sinensis and sugarcane) by using the unified mixed linear model (MLM; Yu et al. 2006) in the genome association and prediction integrated tool (GAPIT; Lipka et al. 2012) R package to perform a genomewide association study (GWAS) within each training set generated from the five-fold cross validation approach. Here, the same folds used to evaluate the performance of GS were used. The top four GWAS signals with the strongest marker-trait associations from each training set were selected and used to obtain GEBVs of F 1 individuals in the test set, and subsequently perform MAS. The same previously described approaches used to evaluate prediction accuracy, bias, and coincidence were then implemented to evaluate the performance of MAS in these F 1 populations.

Simulation of BC 1 populations
At this stage of the simulation pipeline, we backcrossed all F 1 individuals with the top 10% highest GEBVs from each of the four GS models and MAS. Specifically, each of these selected F 1 individuals from the M. sinensis 'Kaskade' × M. sinensis 'Strictus' F 1 population was backcrossed to M. sinensis 'Kaskade' to create M. sinensis 'Kaskade'-BC 1 individuals. Likewise, each of the selected F 1 individuals from the sugarcane 'CP95-1039' × sugarcane 'CP88-1762' F 1 population was backcrossed to sugarcane 'CP95-1039' to create sugarcane 'CP95-1039'-BC 1 individuals. Simulations were based on the assumption of normal pairing of homologous chromosomes (normal meiotic segregation) in M. sinensis and sugarcane since the genomic data were made up of mainly single dose markers (Mollinari and Garcia 2018). To simulate new genomes for each BC 1 individual, we first used a custom Haldane's mapping function in R to simulate crossover events along the genomes of the selected F 1 individuals and recurrent parent. Next, we subdivided each chromosome of each selected F 1 individual and the recurrent parent into two gametes. One gamete was then randomly selected from each parent (i.e., a selected F 1 individual and the recurrent parent) to form the genome of the corresponding BC 1 progeny. Thus, the genome of this BC 1 individual had genomic information contributed by both the recurrent parent and the selected F 1 individual.
Using this approach, we ended up with four GS derived BC 1 populations (i.e. one BC 1 population for each GS model) and one MAS derived BC 1 population for each genetic architecture simulated in each species.

Phenotypic evaluation of GS and MAS in the BC 1 population
In each BC 1 population, we simulated traits with the exact same genetic architecture (i.e., with the same markers selected to be QTNs) used for simulating the F 1 populations (summarized in Table 1). To assess the capability of an entire F 1 population to predict breeding values in each GS-derived BC 1 population, the corresponding GS model was fitted to one of the 100 replicate traits simulated at each genetic architecture in the F 1 population. Similarly, a GWAS for the same replicate traits was conducted in the F 1 population using the unified MLM (Yu et al. 2006). We then conducted MAS in the MAS-derived BC 1 population using the top four peak-associated GWAS markers. Prediction accuracy was estimated as the Pearson correlation between the trait values simulated in the BC 1 individuals and the corresponding GEBVs.

Results
In general, consistent results were obtained between high and low heritability traits, although the low heritability traits yielded lower prediction accuracies. Therefore, for brevity, the results for high heritability traits are presented below unless otherwise noted.
Effect of marker densities on GS in F 1 populations (M. sinensis and sugarcane) To investigate the effect of marker density on genomic prediction, different proportions of genome-wide markers were used for prediction in both M. sinensis and sugarcane F 1 populations. In M. sinensis, the only genetic architectures where a discernable impact of marker density was observed on prediction accuracies of several GS models was for traits with either 100 additive or 100 epistatic QTN (Fig. 2). For these genetic architectures, the prediction accuracies of all models except for SVR tended to increase as the marker density increased. In contrast, there were no differences in prediction accuracies across marker densities for all traits simulated with four QTNs under high heritability in the additive mechanism (Fig. 2a). For traits simulated with four epistatic QTNs, the impact of marker densities on prediction accuracies varied among GS models (Fig.  2b). For instance, in the ADE, RKHS, and SVR models, prediction accuracies increased with greater marker density but decreased for the Bayes A GS model (Fig. 2b). In both the additive and epistasis mechanisms, there were no differences in prediction accuracies among marker densities for traits simulated with 4 and 100 QTNs under low heritability (Online Resource 1). In sugarcane, a discernable pattern was only observed in traits simulated with four additive QTN under high heritability using the ADE, RKHS, and SVR GS models. Here prediction accuracy increased as marker density increased from 0.25 to 0.75 of the genetic markers, while prediction accuracy decreased to approximately the same accuracy that was observed with 0.50 of the genetic markers when all genetic markers were considered (Online Resource 1).

Comparison of GS and MAS models in the M. sinensis F 1 population
For each cross-validation fold across the 100 replicates of each trait simulated in the M. sinensis F 1 population, we compared the prediction accuracy between the four GS models and MAS. In general, MAS yielded equivalent or higher prediction accuracies compared with the GS models for the smallest number of simulated QTN, but, as the number QTN underlying the simulated traits increased, the prediction accuracy of MAS decreased relative to the GS models ( Fig. 3a and Online Resource 2: Tables S1 and S2 showing the results for the corresponding traits simulated with additive, dominance, and epistatic QTNs). We next evaluated the bias and coincidence index of each GS and MAS model. For all high heritability traits, ADE, BayesA, and MAS had the least bias. Moreover, we observed that prediction accuracies were less biased under high heritability compared to low heritability (Online Resource  2: Tables S3, S4, S5, and S6). Interestingly, the ADE GS models infrequently gave extremely biased GEBVs, especially for low-heritability traits (Online Resource 2: Tables S4 and S6). The analysis of coincidence indices indicated that the ability of the GS and MAS models to identify the individuals with optimal simulated trait values (i.e., observed trait values generated from the simulations) depended mostly upon the number of simulated QTN (Online Resource 2: Tables S7 and S8). However, for high heritability genetic architectures the ADE and BayesA GS models tended to consistently have higher coincidence indices, indicating that the individuals with the top 10% GEBVs from these two approaches were the most consistent with individuals in the top 10% of simulated trait values. Additionally, we observed that for traits with 100 QTN, MAS had noticeably lower coincidence indices than the GS models.

Comparison between GS and MAS in the sugarcane F 1 population
In the sugarcane F 1 population, prediction accuracies across 100 replicates were compared between the tested GS and MAS approaches. Similar to the results in the M. sinensis F 1 population, the performance of MAS tended to get worse as the number of simulated QTN increased (Fig. 3b, Online Resource 2: Tables S9 and S10). Unlike the results from the similar analysis conducted in the M. sinensis F 1 population, BayesA consistently yielded among the highest prediction accuracies compared to other GS models and MAS, even for the genetic architectures with four QTN. Moreover, the phenomenon of the ADE GS model infrequently producing extremely biased GEBVs in the M. sinensis F 1 population (Online Resource 2: Table S3-S6) was notably absent in the sugarcane F 1 population (Online Resource 2: Table S11-S14). The analysis of coincidence indices showed that MAS tended to select individuals that were the least consistent with those that had the top 10% simulated trait values, while BayesA tended to select individuals that were the most consistent (Online Resource 2: Table S15-16).

Comparison between MAS and GS models in MAS and GS derived BC 1 populations
Within each evaluated species, five backcross populations (ADE BC 1 , BayesA BC 1 , RKHS BC 1 , SVR BC 1 and MAS BC 1 ) were generated by crossing genotypes selected using each GS model and MAS in the F 1 to the recurrent parent. In general, these results suggest that GEBVs obtained from either training a GS model or identifying peak-associated markers in an F 1 population can accurately predict values of traits controlled by additive, epistatic, and a small number of dominance QTN in BC 1 progeny (Fig. 4). However, the results also suggest that such an approach cannot accurately predict trait values of BC 1 individuals when the underlying genetic architecture consists of a large number of dominance QTN (Fig. 4). For both species, no discernable pattern was observed among the performance of GS models in the BC 1 populations. Interestingly, contrasting results on the performance of MAS were obtained between the two species. In the M. sinensis BC 1 populations, MAS typically yielded among the lowest prediction accuracies across the tested approaches (Fig. 4a, Online Resource 2: Table S17-S18). In contrast the results in the sugarcane BC 1 populations suggest that MAS may be well-suited for genetic architectures consisting of a relatively small number of additive QTNs (Fig. 4b, Online Resource 2: Table S19-S20).

Discussion
To ascertain the potential of GS and MAS to facilitate introgression of disease resistance into energycane, we used marker data from F 1 populations in M. sinensis and sugarcane to simulate traits under a wide variety of genetic architectures. Although there were no GS models that consistently outperformed others, we observed that GS tended to yield higher prediction accuracies, select more individuals with the peak-performing simulated trait values, and more accurately predict trait values of BC 1 progeny relative to MAS. Coupled with the finding that at best the prediction accuracy of MAS declined as the number of simulated QTNs underlying a trait increased, our results suggest that GS is better suited for introgressing genetic sources of horizontal disease resistance into energycane. That is, our study demonstrates that when many loci underlie the genetic sources of variation, accounting for both large and small effect loci are critical for maximizing prediction accuracy and selecting optimal F 1 individuals to backcross. However, when disease resistance is controlled by only a small number of genes, our results suggest that MAS performs adequately.

Impact of data characteristics on GS and MAS performance
One important characteristic that sets the data we analyzed apart from those assessed in prior work in other species is the small number of individuals (n = 85 in the M. sinensis F 1 population and n = 173 in the sugarcane F 1 population). Given that these sizes are substantially smaller than data sets evaluated in other studies (Spindel et al. 2015;Battenfield et al. 2016;Fernandes et al. 2018), the data we analyzed provided a unique opportunity to explore the effectiveness of GS and MAS under such constraints. Such constraints are likely to be more realistic of the introgression of genes from wide crosses, where interspecific incompatibilities can result in low numbers of early generation progeny. In this regard, the ability to demonstrate the superior performance of GS over MAS was illuminating. Undoubtedly, the feasibility of obtaining high-throughput genotype and phenotype data afforded by advances in nextgeneration sequencing, unmanned aerial systems, and other high-throughput phenotyping platforms (Huang et al. 2009;Rife et al. 2011;Elshire et al. 2011;Haghighattalab et al. 2016;Wang et al. 2018) will result in larger quantities of data becoming available in energycane and M. sinensis, and our expectations would be that GS will show the same advantages over MAS that have been demonstrated in this work and in other species. Nevertheless, these results aid current breeding efforts because we show that GS appears to be superior to MAS, even when small data sets are used. Given that two marker sets were available in the M. sinensis F 1 population, we were able to simulate QTN using one marker set and evaluate the performance of GS and MAS in the other. This is beneficial because it likely reflects the reality that many causal mutations underlying traits are not included in marker data sets, and these markers are typically not in complete linkage disequilibrium with them (Korte and Farlow 2013). Coupled with the markers in the M. sinensis population used to evaluate GS and MAS being generated from RAD-seq technology, this assessment is directly applicable to our anticipated follow-up studies where we will strictly use sequence-based markers.
One drawback of this work is that only simulated phenotypic data were used. No actual disease resistence traits were considered because no such data exist for all of the popluations we evaulated. That is, the BC 1 populations we evaluated were simulated; no actual BC 1 populations generated in this manner exist in real life. Thus, it is currently impossible to confirm the results of our simulation study using real disease resistance traits. This lack of data was one of the major factors underlying our decision to simulate traits with a wide variety of contrasting genetic architectures, thereby providing guidance in the absence of empirical results from these populations and new hypotheses to test in future experiments.

Performance of GS versus MAS
Our results demonstrated that the prediction accuracy of MAS tended to decline as the number of simulated QTN increased. This finding is similar to the simulation conducted in Bernardo and Yu (2007), which showed that the response to GS was higher than MAS for traits controlled by 20-100 loci. Morover, previous studies using real phenotypic crop data agree with these results: GS outperformed MAS for horizontal disease resistance in wheat (Arruda et al. 2015) and polygenic nutrient quanlity traits in rye (Wang et al. 2014), while Owens et al. (2014) showed that GS and MAS could perform equally well for the relatively simpler carotenoid traits in maize. Combined with our finding that the GS models tended to give higher conincidence indices than the MAS models, we conclude that, while MAS may perform adequately for certain genetic archicteutres, the tested GS models should consistently perform optimally across a wider range of traits. Nonetheless, for breeding programs focused specifically on traits controlled by very few, well-definied loci with large effect sizes (such as some disease resistance traits), the lower cost and technical overhead of MAS, combined with similar performance, is likely to remain the preferred option. This recommendation is consistent with the findings of Olatoye et al. (2019), which used real cowpea phenotypic data to show that GS offered no advantages in prediction accuracy over MAS for traits where the strongest associated marker explained between 10 and 21% of the total phenotypic variation.
Across all of the simulation settings that were explored, none of the GS models decisively outperformed the others. That is, for a given genetic architecture, the prediction accuracies, bias, and coincidence indices of the four studied GS models were generally similar to each other. In general, this result is consistent with previous studies conducted in other species (e.g., Heslot et al. 2012). However, we noticed several genetic architectures where certain GS models were either the best-or worst-performing models. For example, our finding that the ADE model occasionally gave extremely biased GEBVs in the M. sinensis F 1 population for low-heritable simulated traits may suggest that this model should not be used in practice for such traits. On the other hand, our results show that BayesA often yielded the highest prediction accuracies (especially in the sugarcane F 1 population), the highest coincidence indices, and the lowest biases. Therefore, our recommendation is to run several GS models and use the collective results to identify individuals in which to backcross. Given that all four of the GS models we explored can be fitted to data and tested using publicly available software, undertaking such a multifaceted analysis should be feasible.
To summarize, the main findings from these simulation studies on the performance of GS and MAS in M. sinensis and sugarcane are consistent with similar work conducted in other species. Within the context of our specific M.sinensis × energycane breeding program, our simulation studies help underscore that either GS, MAS, or both should be successful in expediting the introgression of disease resistance from M. sinensis into energycane. Because we simulated traits with various sizes of additive, dominance, and two-way epistatic QTN across two different heritabilities, we expect these studies to foreshadow the performance of MAS and GS in our breeding material, particularly with respect to which genetic architctures they are expected to perform optimally. In these regards, the findings from the simulation studies presented here are crucial for the future success of this M.sinensis × energycane breeding program.

Conclusions and recommendations
Marker-assisted breeding efforts for introgressing disease resistance into energycane need to use statistical models that yield the highest prediction accuracies for both horizontal and vertical disease resistance. In this regard, we used genomic marker data currently available in both sugarcane and M. sinensis to demonstrate that the performance of GS is superior to MAS in the great majority of simulated genetic architectures. That is, while MAS performed reasonably well for traits controlled by a smaller number of QTN, the prediction accuracy of MAS dropped dramatically compared with the four tested GS models as the number of underlying QTN increased. We therefore conclude that MAS is a reasonable option for advancing vertical disease resistance. In contrast, we recommend that GS be used at the forefront of breeding efforts for horizontal disease resistance in energycane. As more attention and resources for genotyping, phenotyping, and sampling more individuals gets directed towards energycane and related species like M. sinensis, we hypothesize that the increase in number of markers and precision in quantifying traits will reveal novel genomic regions with intricate contributions to trait variability. Thus, we would anticipate that observed differences between the performance of GS models and MAS to increase, as marker and phenotypic data for M.sinensis × energycane populations becomes more affordable and extensive. Acknowledgments This study is made possible by the support of DOE Office of Science, Office of Biological and Environmental Research (BER), grant no. DE-SC0016264. The contents of this manuscript are the sole responsibility of the authors. We acknowledge the efforts of Dr. Matt Hudson, Dr. Samuel Fernandes, and Brian Rice for reading through the manuscript and providing critical feedback.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.