Skip to main content

Advertisement

Log in

Prediction of maize single-cross hybrid performance: support vector machine regression versus best linear prediction

  • Original Paper
  • Published:
Theoretical and Applied Genetics Aims and scope Submit manuscript

Abstract

Accurate prediction of the phenotypic performance of a hybrid plant based on the molecular fingerprints of its parents should lead to a more cost-effective breeding programme as it allows to reduce the number of expensive field evaluations. The construction of a reliable prediction model requires a representative sample of hybrids for which both molecular and phenotypic information are accessible. This phenotypic information is usually readily available as typical breeding programmes test numerous new hybrids in multi-location field trials on a yearly basis. Earlier studies indicated that a linear mixed model analysis of this typically unbalanced phenotypic data allows to construct ɛ-insensitive support vector machine regression and best linear prediction models for predicting the performance of single-cross maize hybrids. We compare these prediction methods using different subsets of the phenotypic and marker data of a commercial maize breeding programme and evaluate the resulting prediction accuracies by means of a specifically designed field experiment. This balanced field trial allows to assess the reliability of the cross-validation prediction accuracies reported here and in earlier studies. The limits of the predictive capabilities of both prediction methods are further examined by reducing the number of training hybrids and the size of the molecular fingerprints. The results indicate a considerable discrepancy between prediction accuracies obtained by cross-validation procedures and those obtained by correlating the predictions with the results of a validation field trial. The prediction accuracy of best linear prediction was less sensitive to a reduction of the number of training examples compared with that of support vector machine regression. The latter was, however, better at predicting hybrid performance when the size of the molecular fingerprints was reduced, especially if the initial set of markers had a low information content.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Bernardo R (1993) Estimation of coefficient of coancestry using molecular markers in maize. Theor Appl Genet 85:1055–1062

    Article  CAS  Google Scholar 

  • Bernardo R (1994) Prediction of maize single-cross performance using RFLPs and information from related hybrids. Crop Sci 34:20–25

    Article  Google Scholar 

  • Bernardo R (1995) Genetic models for predicting maize single-cross performance in unbalanced yield trial data. Crop Sci 35:141–147

    Article  Google Scholar 

  • Bernardo R (1996a) Best linear unbiased prediction of the performance of crosses between untested maize inbreds. Crop Sci 36:50–56

    Google Scholar 

  • Bernardo R (1996b) Best linear unbiased prediction of maize single-cross performance. Crop Sci 36:872–876

    Google Scholar 

  • Bernardo R (2008) Molecular markers and selection for complex traits in plants: learning from the last 20 years. Crop Sci 48:1649–1664

    Article  Google Scholar 

  • Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2:121–167

    Article  Google Scholar 

  • Charcosset A, Bonnisseau B, Touchebeuf O, Burstin J, Dubreuil P, Barriére Y, Gallais A, Denis JB (1998) Prediction of maize hybrid silage performance using marker data: comparison of several models for specific combining ability. Crop Sci 38:38–44

    Article  Google Scholar 

  • Cullis B, Gogel B, Verbyla A, Thompson R (1998) Spatial analysis of multi-environment early generation trials. Biometrics 54:1–18

    Article  Google Scholar 

  • Frisch M, Thiemann A, Fu J, Schrag TA, Scholten S, Melchinger AE (2009) Transcriptome-based distance measures for grouping of germplasm and prediction of hybrid performance in maize. Theor Appl Genet (in press)

  • Gilmour AR, Cullis BR, Verbyla AP (1997) Accounting for natural and extraneous variation in the analysis of field experiments. J Agric Biol Environ Stat 2:269–293

    Article  Google Scholar 

  • Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J Global Optim 13:455–492

    Article  Google Scholar 

  • Laloë D, (1993) Precision and information in linear models of genetic evaluation. Genet Sel Evol 25:557–576

    Article  Google Scholar 

  • Maenhout S, De Baets B, Haesaert G, Van Bockstaele E (2007) Support vector machine regression for the prediction of maize hybrid performance. Theor Appl Genet 115:1003–1013

    Article  CAS  PubMed  Google Scholar 

  • Maenhout S, De Baets B, Haesaert G, Van Bockstaele E (2008) Marker-based screening of maize inbred lines using support vector machine regression. Euphytica 161:123–131

    Article  Google Scholar 

  • Maenhout S, De Baets B, Haesaert G (2009) Marker-based estimation of the coefficient of coancestry in hybrid breeding programmes. Theor Appl Genet 118:1181–1192

    Article  CAS  PubMed  Google Scholar 

  • Oakey H, Verbyla AP, Cullis BR, Wei X, Pitchford WS (2007) Joint modeling of additive and non-additive (genetic line) effects in multi-environment trials. Theor Appl Genet 114:1319–1332

    Article  PubMed  Google Scholar 

  • Schrag TA, Maurer HP, Melchinger AE, Piepho HP, Peleman J, Frisch M (2007) Prediction of single-cross hybrid performance in maize using haplotype blocks associated with QTL for grain yield. Theor Appl Genet 114:1345–1355

    Article  PubMed  Google Scholar 

  • Schrag TA, Möhring J, Maurer HP, Dhillon BS, Melchinger AE, Piepho HP, Sorensen AP, Frisch M (2009) Molecular marker-based prediction of hybrid performance in maize using unbalanced data from multiple experiments with factorial crosses. Theor Appl Genet 118:741–751

    Article  CAS  PubMed  Google Scholar 

  • Schrag TA, Möhring J, Kusterer B, Dhillon BS, Melchinger AE, Piepho HP, Frisch M (2009) Hybrid performance prediction in maize using molecular markers and joint analyses of hybrids and parental inbreds. Theor Appl Genet (in press)

  • Smith A, Cullis B, Thompson R (2001) Analyzing variety by environment data using multiplicative mixed models and adjustments for spatial field trend. Biometrics 57:1138–1147

    Article  CAS  PubMed  Google Scholar 

  • Smola A, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14:199–222

    Article  Google Scholar 

  • Stuber C, Cockerham C (1966) Gene effects and variances in hybrid populations. Genetics 54:1279–1286

    PubMed  CAS  Google Scholar 

  • Vapnik V (1995) The nature of statistical learning theory. Springer, New York

    Google Scholar 

  • Welham SJ, Cullis BR, Gogel BJ, Gilmour AR, Thompson R (2004) Prediction in linear mixed models. Aust NZ J Stat 46:325–347

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to thank the people from RAGT R2n for their unreserved and open-minded scientific contribution to this research. We also gratefully acknowledge the helpful comments and suggestions of two anonymous referees.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Steven Maenhout.

Additional information

Communicated by M. Cooper.

Contribution to the special issue “Heterosis in Plants”.

Appendix: Variance structure of the linear mixed models fitted to the phenotypic data of the validation field trial

Appendix: Variance structure of the linear mixed models fitted to the phenotypic data of the validation field trial

The four random vectors \(\user2{c}\), \(\user2{a}_{\rm s},\) \(\user2{a}_{\rm o}\) and \(\user2{d}\) of Eq. 4 are assumed to be mutually independent. Furthermore, for each of these vectors \(\user2{h} \in \{ \user2{c} , \user2{a}_{\rm s}, \user2{a}_{\rm o} , \user2{d} \}\) we assume that the variance has the separable form

$$ \hbox {Var}(\user2{h})=\user2{g}_e \otimes \user2{g}_{\rm v} , $$
(7)

where \(\otimes\) denotes the Kronecker product. \(\user2{g}_e\) represents a 3 × 3 symmetric matrix containing the covariance between environments while \(\user2{g}_{\rm v}\) represents the covariance between the specified genetic components of the validation trial entries. We start by fitting a completely unstructured variance matrix for \(\user2{g}_e\) while assuming an identity matrix for \(\user2{g}_{\rm v}.\) In subsequent steps, the number of REML estimated variance components is reduced by fitting more parsimonious variance models for \(\user2{g}_e\) using restricted maximum likelihood ratio tests in case of comparisons between nested models, or Akaike’s information criterion (AIC) otherwise. We attempt to fit a first-order factor analytic variance model such that \(\user2{g}_e=\varvec{\lambda}\varvec{\lambda}^{\prime}+ \varvec{\Uppsi}\) where \(\varvec{\lambda}\) is a vector of factor loadings and the matrix \(\varvec{\Uppsi}\) is a diagonal matrix containing three location-specific variances (Smith et al. 2001). To obtain a more parsimonious model, the specific variances were sometimes made equal or zero (giving perfect correlation), and/or the loadings made equal (giving a common covariance (Cullis et al. 1998)). In a subsequent reduction, the variances on the diagonal are set equal which results in a compound symmetry model. The simplest model for \(\user2{g}_e\) assumed zero covariance and equal variances.

Once the most parsimonious model for \(\user2{g}_e\) is determined, we try different formulations for \(\user2{g}_{\rm v}.\) We fit an identity matrix for the variance model of the six check varieties in vector \(\user2{c}\) as no molecular marker or pedigree information is available for these varieties. For the vectors \(\user2{a}_{\rm s}\) and \(\user2{a}_{\rm o},\) containing the GCA effects of the inbred lines, we try to fit the different coefficient of coancestry derived matrices \(\user2{a}\) described by Maenhout et al. (2009) or an identity matrix. In a similar way, we compare the different coefficient of fraternity-based matrices \( \user2{d}\) for the variance matrix \(\user2{g}_{\rm v}\) pertaining to the vector \(\user2{d}.\) Sometimes, the most parsimonious model is obtained by not using the separable form of Eq. 7 but directly fitting a common GCA or SCA effect for all three locations.

The variance of each vector of residuals \(\user2{e}_i\) that make up vector \(\user2{e}\) in Eq. 3 is modeled as a separable process in the direction of rows and columns so we can write Var\((\user2{e}_i)=\varvec{\Upsigma}_{ic} \otimes \varvec{\Upsigma}_{ir}\) where ⊗ denotes the Kronecker product. The matrices \(\varvec{\Upsigma}_{ic}\) and \(\varvec{\Upsigma}_{ir}\) are either identity matrices or contain first order autoregressive correlations to account for spatial variation as described in Gilmour et al. (1997), Smith et al. (2001) and Oakey et al. (2007). Table 4 gives an overview of the final model for the variance structure of vectors \(\user2{g}\) and \(\user2{e}\) for each trait.

Table 4 Summary of the variance structures fitted on the measurements of the validation data set for the traits grain yield, grain moisture content and days until flowering

Rights and permissions

Reprints and permissions

About this article

Cite this article

Maenhout, S., De Baets, B. & Haesaert, G. Prediction of maize single-cross hybrid performance: support vector machine regression versus best linear prediction. Theor Appl Genet 120, 415–427 (2010). https://doi.org/10.1007/s00122-009-1200-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00122-009-1200-5

Keywords

Navigation