Skip to main content
Log in

Multi-task Gaussian process for imputing missing data in multi-trait and multi-environment trials

  • Original Article
  • Published:
Theoretical and Applied Genetics Aims and scope Submit manuscript

Abstract

Key message

A method based on a multi-task Gaussian process using self-measuring similarity gave increased accuracy for imputing missing phenotypic data in multi-trait and multi-environment trials.

Abstract

Multi-environmental trial (MET) data often encounter the problem of missing data. Accurate imputation of missing data makes subsequent analysis more effective and the results easier to understand. Moreover, accurate imputation may help to reduce the cost of phenotyping for thinned-out lines tested in METs. METs are generally performed for multiple traits that are correlated to each other. Correlation among traits can be useful information for imputation, but single-trait-based methods cannot utilize information shared by traits that are correlated. In this paper, we propose imputation methods based on a multi-task Gaussian process (MTGP) using self-measuring similarity kernels reflecting relationships among traits, genotypes, and environments. This framework allows us to use genetic correlation among multi-trait multi-environment data and also to combine MET data and marker genotype data. We compared the accuracy of three MTGP methods and iterative regularized PCA using rice MET data. Two scenarios for the generation of missing data at various missing rates were considered. The MTGP performed a better imputation accuracy than regularized PCA, especially at high missing rates. Under the ‘uniform’ scenario, in which missing data arise randomly, inclusion of marker genotype data in the imputation increased the imputation accuracy at high missing rates. Under the ‘fiber’ scenario, in which missing data arise in all traits for some combinations between genotypes and environments, the inclusion of marker genotype data decreased the imputation accuracy for most traits while increasing the accuracy in a few traits remarkably. The proposed methods will be useful for solving the missing data problem in MET data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Araus JL, Cairns JE (2014) Field high-throughput phenotyping: the new crop breeding frontier. Trends Plant Sci 19:52–61

    Article  CAS  PubMed  Google Scholar 

  • Arciniegas-Alarcón S, García-Peña M, Krzanowski W, Dias CTS (2014) Imputing missing values in multi-environment trials using the singular value decomposition: an empirical comparison. Commun Biometr Crop Sci 9:54–70

    Google Scholar 

  • Balestre M, Von Pinho RG, Souza JC, Oliveira RL (2009) Genotypic stability and adaptability in tropical maize based on AMMI and GGE biplot analysis. Genet Mol Res 8:1311–1322

    Article  CAS  PubMed  Google Scholar 

  • Basford KE, Kroonenberg PM, DeLacy IH (1991) Three-way methods for multiattribute genotype × environment data: an illustrated partial survey. Field Crops Res 27:131–157

    Article  Google Scholar 

  • Belyaev M, Burnaev E, Kapushev Y (2015) Gaussian process regression for structured data sets. In: Gammerman A, Vovk V, Papadopoulos H (eds) Statistical learning and data sciences. Springer, Cham, pp 106–115

    Chapter  Google Scholar 

  • Boer MP, Wright D, Feng L, Podlich DW, Luo L, Cooper M, van Eeuwijk FA (2007) A mixed-model quantitative trait loci (QTL) analysis for multiple-environment trial data using environmental covariables for QTL-by-environment interactions, with an example in maize. Genetics 177:1801–1813

    Article  PubMed  PubMed Central  Google Scholar 

  • Bonilla EV, Chai KM, Williams C (2007) Multi-task Gaussian process prediction. Adv Neural Inf Process Syst 20:153–160

    Google Scholar 

  • Braun HJ, Atlin G, Payne T (2010) Multi-location testing as a tool to identify plant response to global climate change. In: Reynolds MP (ed) Climate change and crop production, vol 13. CABI, Wallingford, pp 115–138

    Chapter  Google Scholar 

  • Burgueño J, de los Campos G, Weigel K, Crossa J (2012) Genomic prediction of breeding values when modeling genotype × environment interaction using pedigree and dense molecular markers. Crop Sci 52:707–719

    Article  Google Scholar 

  • Cabrera-Bosquet L, Crossa J, von Zitzewitz J, Serret MD, Araus JL (2012) High-throughput phenotyping and genomic selection: the frontiers of crop breeding converge. J Integr Plant Biol 54:312–320

    Article  PubMed  Google Scholar 

  • Chapman SC, Crossa J, Basford KE, Kroonenberg PM (1997) Genotype by environment effects and selection for drought tolerance in tropical maize. II. Three-mode pattern analysis. Euphytica 95:11–20

    Article  Google Scholar 

  • Cornelius PL, Crossa J (1999) Prediction assessment of shrinkage estimators of multiplicative models for multi-environment cultivar trials. Crop Sci 39:998–1009

    Article  Google Scholar 

  • Cornelius PL, Crossa J, Seyedsadr MS (1996) Statistical tests and estimators of multiplicative models for genotype-by-environment interaction. In: Gauch HG, Kang MS (eds) Genotype by environment interaction. CMC Press, Boca Raton, pp 199–234

    Google Scholar 

  • Craven P, Wahba G (1979) Smoothing noisy data with spline functions. Num Math 31:377–403

    Article  Google Scholar 

  • Cribari-Neto F, Zeileis A (2010) Beta regression in r. J Stat Softw 34:1–24

    Article  Google Scholar 

  • Crossa J, Cornelius PL (1997) Sites regression and shifted multiplicative model clustering of cultivar trial sites under heterogeneity of error variances. Crop Sci 37:406–415

    Article  Google Scholar 

  • Crossa J, de los Campos G, Pérez P, Gianola D, Burgueno J, Araus JL, Makumbi D, Singh RP, Dreisigacker S, Yan J, Arief V, Banziger M, Braun HJ (2010) Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 186:713–724

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Endelman JB (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4:250–255

    Article  Google Scholar 

  • Farfan IDB, De La Fuente GN, Murray SC, Isakeit T, Huang PC, Warburton M, Williams P, Windham GL, Kolomiets M (2015) Genome wide association study for drought, aflatoxin resistance, and important agronomic traits of maize hybrids in the sub-tropics. PLoS ONE 10:e0117737

    Article  PubMed  PubMed Central  Google Scholar 

  • Gauch HG, Zobel RW (1990) Imputing missing yield trial data. Theor Appl Genet 79:753–761

    Article  PubMed  Google Scholar 

  • Gauch HG, Zobel RW (1997) Identifying mega-environments and targeting genotypes. Crop Sci 37:311–326

    Article  Google Scholar 

  • Gianola D, Fernando RL, Stella A (2006) Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics 173:1761–1776

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Gutiérrez L, Germán S, Pereyra S, Hayes PM, Pérez CA, Capettini F, Locatelli A, Berberian HM, Falconi EE, Estrada R, Fros D, Gonza V, Altamirano H, Huerta-Espino J, Neyra E, Orjeda G, Sandoval-Islas S, Sing R, Turkington K, Castro AJ (2015) Multi-environment multi-QTL association mapping identifies disease resistance QTL in barley germplasm from Latin America. Theor Appl Genet 128:501–519

    Article  PubMed  Google Scholar 

  • Hayashi K, Takenouchi T, Tomioka R, Kashima H (2012) Self-measuring similarity for multi-task Gaussian process. Trans Jpn Soc Artif Intell 27:103–110 (in Japanese)

    Article  Google Scholar 

  • Heslot N, Akdemir D, Sorrells ME, Jannink JL (2014) Integrating environmental covariates and crop modeling into the genomic selection framework to predict genotype by environment interactions. Theor Appl Genet 127:463–480

    Article  PubMed  Google Scholar 

  • Husson F, Josse J (2010) missMDA: handling missing values with/in multivariate data analysis (principal component methods). http://www.agrocampus-ouest.fr/math/husson. Accessed 5 October 2015. R package version 1.8.2

  • IRRI (2002) Standard evaluation system for rice. International Rice Research Institute, Philippines

    Google Scholar 

  • Iwata H, Jannink JL (2010) Marker genotype imputation in a low-marker-density panel with a high-marker-density reference panel: accuracy evaluation in barley breeding lines. Crop Sci 50:1269–1278

    Article  Google Scholar 

  • Jannink JL, Iwata H, Bhat PR, Chao S, Wenzl P, Muehlbauer GJ (2009) Marker imputation in barley association studies. Plant Genome 2:11–22

    Article  CAS  Google Scholar 

  • Jarquín D, Crossa J, Lacaze X, Du Cheyron P, Daucourt J, Lorgeou J, Piraux F, Guerreiro L, Pérez P, Calus M, Burgueño J, de los Campos G (2014) A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theor Appl Genet 127:595–607

    Article  PubMed  Google Scholar 

  • Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. J Soc Fr Statistique 153:79–99

    Google Scholar 

  • Leach AR, Gillet VJ (2007) An introduction to chemoinformatics. Springer, Dordrecht, pp 79–81

    Book  Google Scholar 

  • Malosetti M, Ribaut JM, van Eeuwijk FA (2013) The statistical analysis of multi-environment data: modeling genotype-by-environment interaction and its genetic basis. Front Physiol 4:44

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Marticorena M, Bramardi S, Defacio R (2010) Characterization of maize populations in different environmental conditions by means of three-mode principal components analysis. Ciencia e Investigación Agraria 37:91–103

    Article  Google Scholar 

  • Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829

    CAS  PubMed  PubMed Central  Google Scholar 

  • Morota G, Gianola D (2014) Kernel-based whole-genome prediction of complex traits: a review. Front Genet 5:363

    PubMed  PubMed Central  Google Scholar 

  • Neal RM (1997) Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. arXiv preprint physics/9701026

  • Piepho HP (2009) Ridge regression and extensions for genomewide selection in maize. Crop Sci 49:1165–1176

    Article  Google Scholar 

  • Rakitsch B, Lippert C, Borgwardt K, Stegle O (2013) It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals. In: Advances in neural information processing systems, pp 1466–1474

  • Resende MFR, Muñoz P, Acosta JJ, Peter GF, Davis JM, Grattapaglia D, Resende MDV, Kirst M (2012) Accelerating the domestication of trees using genomic selection: accuracy of prediction models across ages and environments. N Phytol 193:617–624

    Article  Google Scholar 

  • Samonte SOP, Wilson LT, McClung AM, Medley JC (2005) Targeting cultivars onto rice growing environments using AMMI and SREG GGE biplot analyses. Crop Sci 45:2414–2424

    Article  Google Scholar 

  • Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc Ser B Stat Methodol 61:611–622

    Article  Google Scholar 

  • Verbanck M, Josse J, Husson F (2013) Regularised PCA to denoise and visualise data. Stat Comput 25:471–486

    Article  Google Scholar 

  • Williams CK, Rasmussen CE (2006) Gaussian processes for machine learning. The MIT Press, Cambridge, MA

    Google Scholar 

  • Yamamoto T, Nagasaki H, Yonemaru J, Ebana K, Nakajima M, Shibaya T, Yano M (2010) Fine definition of the pedigree haplotypes of closely related rice cultivars by means of genome-wide discovery of single-nucleotide polymorphisms. BMC Genom 11:267

    Article  Google Scholar 

  • Yan W (2013) Biplot analysis of incomplete two-way data. Crop Sci 53:48–57

    Article  Google Scholar 

  • Zhang Y, Yeung DY (2010) Multi-task learning using generalized t process. In: AISTATS

  • Zhang X, Pérez-Rodríguez P, Semagn K, Beyene Y, Babu R, López-Cruz MA, Vicente FS, Olsen M, Buckler E, Jannink JL, Prasanna BM, Crossa J (2015) Genomic prediction in biparental tropical maize populations in water-stressed and well-watered environments using low-density and GBS SNPs. Heredity 114:291–299

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgments

This work is supported by JSPS KAKENHI Grant Number 25252002 and by the Ministry of Foreign Affairs, Japan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hiroyoshi Iwata.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest. The experiments comply with the current laws of the countries, in which they were performed.

Additional information

Communicated by J. Crossa.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 27935 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hori, T., Montcho, D., Agbangla, C. et al. Multi-task Gaussian process for imputing missing data in multi-trait and multi-environment trials. Theor Appl Genet 129, 2101–2115 (2016). https://doi.org/10.1007/s00122-016-2760-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00122-016-2760-9

Keywords

Navigation