Abstract
Key message
A method based on a multi-task Gaussian process using self-measuring similarity gave increased accuracy for imputing missing phenotypic data in multi-trait and multi-environment trials.
Abstract
Multi-environmental trial (MET) data often encounter the problem of missing data. Accurate imputation of missing data makes subsequent analysis more effective and the results easier to understand. Moreover, accurate imputation may help to reduce the cost of phenotyping for thinned-out lines tested in METs. METs are generally performed for multiple traits that are correlated to each other. Correlation among traits can be useful information for imputation, but single-trait-based methods cannot utilize information shared by traits that are correlated. In this paper, we propose imputation methods based on a multi-task Gaussian process (MTGP) using self-measuring similarity kernels reflecting relationships among traits, genotypes, and environments. This framework allows us to use genetic correlation among multi-trait multi-environment data and also to combine MET data and marker genotype data. We compared the accuracy of three MTGP methods and iterative regularized PCA using rice MET data. Two scenarios for the generation of missing data at various missing rates were considered. The MTGP performed a better imputation accuracy than regularized PCA, especially at high missing rates. Under the ‘uniform’ scenario, in which missing data arise randomly, inclusion of marker genotype data in the imputation increased the imputation accuracy at high missing rates. Under the ‘fiber’ scenario, in which missing data arise in all traits for some combinations between genotypes and environments, the inclusion of marker genotype data decreased the imputation accuracy for most traits while increasing the accuracy in a few traits remarkably. The proposed methods will be useful for solving the missing data problem in MET data.
Similar content being viewed by others
References
Araus JL, Cairns JE (2014) Field high-throughput phenotyping: the new crop breeding frontier. Trends Plant Sci 19:52–61
Arciniegas-Alarcón S, García-Peña M, Krzanowski W, Dias CTS (2014) Imputing missing values in multi-environment trials using the singular value decomposition: an empirical comparison. Commun Biometr Crop Sci 9:54–70
Balestre M, Von Pinho RG, Souza JC, Oliveira RL (2009) Genotypic stability and adaptability in tropical maize based on AMMI and GGE biplot analysis. Genet Mol Res 8:1311–1322
Basford KE, Kroonenberg PM, DeLacy IH (1991) Three-way methods for multiattribute genotype × environment data: an illustrated partial survey. Field Crops Res 27:131–157
Belyaev M, Burnaev E, Kapushev Y (2015) Gaussian process regression for structured data sets. In: Gammerman A, Vovk V, Papadopoulos H (eds) Statistical learning and data sciences. Springer, Cham, pp 106–115
Boer MP, Wright D, Feng L, Podlich DW, Luo L, Cooper M, van Eeuwijk FA (2007) A mixed-model quantitative trait loci (QTL) analysis for multiple-environment trial data using environmental covariables for QTL-by-environment interactions, with an example in maize. Genetics 177:1801–1813
Bonilla EV, Chai KM, Williams C (2007) Multi-task Gaussian process prediction. Adv Neural Inf Process Syst 20:153–160
Braun HJ, Atlin G, Payne T (2010) Multi-location testing as a tool to identify plant response to global climate change. In: Reynolds MP (ed) Climate change and crop production, vol 13. CABI, Wallingford, pp 115–138
Burgueño J, de los Campos G, Weigel K, Crossa J (2012) Genomic prediction of breeding values when modeling genotype × environment interaction using pedigree and dense molecular markers. Crop Sci 52:707–719
Cabrera-Bosquet L, Crossa J, von Zitzewitz J, Serret MD, Araus JL (2012) High-throughput phenotyping and genomic selection: the frontiers of crop breeding converge. J Integr Plant Biol 54:312–320
Chapman SC, Crossa J, Basford KE, Kroonenberg PM (1997) Genotype by environment effects and selection for drought tolerance in tropical maize. II. Three-mode pattern analysis. Euphytica 95:11–20
Cornelius PL, Crossa J (1999) Prediction assessment of shrinkage estimators of multiplicative models for multi-environment cultivar trials. Crop Sci 39:998–1009
Cornelius PL, Crossa J, Seyedsadr MS (1996) Statistical tests and estimators of multiplicative models for genotype-by-environment interaction. In: Gauch HG, Kang MS (eds) Genotype by environment interaction. CMC Press, Boca Raton, pp 199–234
Craven P, Wahba G (1979) Smoothing noisy data with spline functions. Num Math 31:377–403
Cribari-Neto F, Zeileis A (2010) Beta regression in r. J Stat Softw 34:1–24
Crossa J, Cornelius PL (1997) Sites regression and shifted multiplicative model clustering of cultivar trial sites under heterogeneity of error variances. Crop Sci 37:406–415
Crossa J, de los Campos G, Pérez P, Gianola D, Burgueno J, Araus JL, Makumbi D, Singh RP, Dreisigacker S, Yan J, Arief V, Banziger M, Braun HJ (2010) Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 186:713–724
Endelman JB (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4:250–255
Farfan IDB, De La Fuente GN, Murray SC, Isakeit T, Huang PC, Warburton M, Williams P, Windham GL, Kolomiets M (2015) Genome wide association study for drought, aflatoxin resistance, and important agronomic traits of maize hybrids in the sub-tropics. PLoS ONE 10:e0117737
Gauch HG, Zobel RW (1990) Imputing missing yield trial data. Theor Appl Genet 79:753–761
Gauch HG, Zobel RW (1997) Identifying mega-environments and targeting genotypes. Crop Sci 37:311–326
Gianola D, Fernando RL, Stella A (2006) Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics 173:1761–1776
Gutiérrez L, Germán S, Pereyra S, Hayes PM, Pérez CA, Capettini F, Locatelli A, Berberian HM, Falconi EE, Estrada R, Fros D, Gonza V, Altamirano H, Huerta-Espino J, Neyra E, Orjeda G, Sandoval-Islas S, Sing R, Turkington K, Castro AJ (2015) Multi-environment multi-QTL association mapping identifies disease resistance QTL in barley germplasm from Latin America. Theor Appl Genet 128:501–519
Hayashi K, Takenouchi T, Tomioka R, Kashima H (2012) Self-measuring similarity for multi-task Gaussian process. Trans Jpn Soc Artif Intell 27:103–110 (in Japanese)
Heslot N, Akdemir D, Sorrells ME, Jannink JL (2014) Integrating environmental covariates and crop modeling into the genomic selection framework to predict genotype by environment interactions. Theor Appl Genet 127:463–480
Husson F, Josse J (2010) missMDA: handling missing values with/in multivariate data analysis (principal component methods). http://www.agrocampus-ouest.fr/math/husson. Accessed 5 October 2015. R package version 1.8.2
IRRI (2002) Standard evaluation system for rice. International Rice Research Institute, Philippines
Iwata H, Jannink JL (2010) Marker genotype imputation in a low-marker-density panel with a high-marker-density reference panel: accuracy evaluation in barley breeding lines. Crop Sci 50:1269–1278
Jannink JL, Iwata H, Bhat PR, Chao S, Wenzl P, Muehlbauer GJ (2009) Marker imputation in barley association studies. Plant Genome 2:11–22
Jarquín D, Crossa J, Lacaze X, Du Cheyron P, Daucourt J, Lorgeou J, Piraux F, Guerreiro L, Pérez P, Calus M, Burgueño J, de los Campos G (2014) A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theor Appl Genet 127:595–607
Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. J Soc Fr Statistique 153:79–99
Leach AR, Gillet VJ (2007) An introduction to chemoinformatics. Springer, Dordrecht, pp 79–81
Malosetti M, Ribaut JM, van Eeuwijk FA (2013) The statistical analysis of multi-environment data: modeling genotype-by-environment interaction and its genetic basis. Front Physiol 4:44
Marticorena M, Bramardi S, Defacio R (2010) Characterization of maize populations in different environmental conditions by means of three-mode principal components analysis. Ciencia e Investigación Agraria 37:91–103
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829
Morota G, Gianola D (2014) Kernel-based whole-genome prediction of complex traits: a review. Front Genet 5:363
Neal RM (1997) Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. arXiv preprint physics/9701026
Piepho HP (2009) Ridge regression and extensions for genomewide selection in maize. Crop Sci 49:1165–1176
Rakitsch B, Lippert C, Borgwardt K, Stegle O (2013) It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals. In: Advances in neural information processing systems, pp 1466–1474
Resende MFR, Muñoz P, Acosta JJ, Peter GF, Davis JM, Grattapaglia D, Resende MDV, Kirst M (2012) Accelerating the domestication of trees using genomic selection: accuracy of prediction models across ages and environments. N Phytol 193:617–624
Samonte SOP, Wilson LT, McClung AM, Medley JC (2005) Targeting cultivars onto rice growing environments using AMMI and SREG GGE biplot analyses. Crop Sci 45:2414–2424
Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc Ser B Stat Methodol 61:611–622
Verbanck M, Josse J, Husson F (2013) Regularised PCA to denoise and visualise data. Stat Comput 25:471–486
Williams CK, Rasmussen CE (2006) Gaussian processes for machine learning. The MIT Press, Cambridge, MA
Yamamoto T, Nagasaki H, Yonemaru J, Ebana K, Nakajima M, Shibaya T, Yano M (2010) Fine definition of the pedigree haplotypes of closely related rice cultivars by means of genome-wide discovery of single-nucleotide polymorphisms. BMC Genom 11:267
Yan W (2013) Biplot analysis of incomplete two-way data. Crop Sci 53:48–57
Zhang Y, Yeung DY (2010) Multi-task learning using generalized t process. In: AISTATS
Zhang X, Pérez-Rodríguez P, Semagn K, Beyene Y, Babu R, López-Cruz MA, Vicente FS, Olsen M, Buckler E, Jannink JL, Prasanna BM, Crossa J (2015) Genomic prediction in biparental tropical maize populations in water-stressed and well-watered environments using low-density and GBS SNPs. Heredity 114:291–299
Acknowledgments
This work is supported by JSPS KAKENHI Grant Number 25252002 and by the Ministry of Foreign Affairs, Japan.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest. The experiments comply with the current laws of the countries, in which they were performed.
Additional information
Communicated by J. Crossa.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Hori, T., Montcho, D., Agbangla, C. et al. Multi-task Gaussian process for imputing missing data in multi-trait and multi-environment trials. Theor Appl Genet 129, 2101–2115 (2016). https://doi.org/10.1007/s00122-016-2760-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00122-016-2760-9