Multi-task Gaussian process for imputing missing data in multi-trait and multi-environment trials

Hori, Tomoaki; Montcho, David; Agbangla, Clement; Ebana, Kaworu; Futakuchi, Koichi; Iwata, Hiroyoshi

doi:10.1007/s00122-016-2760-9

Multi-task Gaussian process for imputing missing data in multi-trait and multi-environment trials

Original Article
Published: 19 August 2016

Volume 129, pages 2101–2115, (2016)
Cite this article

Theoretical and Applied Genetics Aims and scope Submit manuscript

Tomoaki Hori¹,
David Montcho²,
Clement Agbangla³,
Kaworu Ebana⁴,
Koichi Futakuchi² &
…
Hiroyoshi Iwata ORCID: orcid.org/0000-0002-6747-7036¹

1435 Accesses
19 Citations
6 Altmetric
Explore all metrics

Abstract

Key message

A method based on a multi-task Gaussian process using self-measuring similarity gave increased accuracy for imputing missing phenotypic data in multi-trait and multi-environment trials.

Abstract

Multi-environmental trial (MET) data often encounter the problem of missing data. Accurate imputation of missing data makes subsequent analysis more effective and the results easier to understand. Moreover, accurate imputation may help to reduce the cost of phenotyping for thinned-out lines tested in METs. METs are generally performed for multiple traits that are correlated to each other. Correlation among traits can be useful information for imputation, but single-trait-based methods cannot utilize information shared by traits that are correlated. In this paper, we propose imputation methods based on a multi-task Gaussian process (MTGP) using self-measuring similarity kernels reflecting relationships among traits, genotypes, and environments. This framework allows us to use genetic correlation among multi-trait multi-environment data and also to combine MET data and marker genotype data. We compared the accuracy of three MTGP methods and iterative regularized PCA using rice MET data. Two scenarios for the generation of missing data at various missing rates were considered. The MTGP performed a better imputation accuracy than regularized PCA, especially at high missing rates. Under the ‘uniform’ scenario, in which missing data arise randomly, inclusion of marker genotype data in the imputation increased the imputation accuracy at high missing rates. Under the ‘fiber’ scenario, in which missing data arise in all traits for some combinations between genotypes and environments, the inclusion of marker genotype data decreased the imputation accuracy for most traits while increasing the accuracy in a few traits remarkably. The proposed methods will be useful for solving the missing data problem in MET data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A combination of QTL mapping and genome-wide association study revealed the key gene for husk number in maize

Article 25 April 2024

Statistical sampling of missing environmental variables improves biophysical genomic prediction in wheat

Article 18 April 2024

Genotype-by-environment interactions and local adaptation shape selection in the US National Chip Processing Trial

Article Open access 10 April 2024

References

Araus JL, Cairns JE (2014) Field high-throughput phenotyping: the new crop breeding frontier. Trends Plant Sci 19:52–61
Article CAS PubMed Google Scholar
Arciniegas-Alarcón S, García-Peña M, Krzanowski W, Dias CTS (2014) Imputing missing values in multi-environment trials using the singular value decomposition: an empirical comparison. Commun Biometr Crop Sci 9:54–70
Google Scholar
Balestre M, Von Pinho RG, Souza JC, Oliveira RL (2009) Genotypic stability and adaptability in tropical maize based on AMMI and GGE biplot analysis. Genet Mol Res 8:1311–1322
Article CAS PubMed Google Scholar
Basford KE, Kroonenberg PM, DeLacy IH (1991) Three-way methods for multiattribute genotype × environment data: an illustrated partial survey. Field Crops Res 27:131–157
Article Google Scholar
Belyaev M, Burnaev E, Kapushev Y (2015) Gaussian process regression for structured data sets. In: Gammerman A, Vovk V, Papadopoulos H (eds) Statistical learning and data sciences. Springer, Cham, pp 106–115
Chapter Google Scholar
Boer MP, Wright D, Feng L, Podlich DW, Luo L, Cooper M, van Eeuwijk FA (2007) A mixed-model quantitative trait loci (QTL) analysis for multiple-environment trial data using environmental covariables for QTL-by-environment interactions, with an example in maize. Genetics 177:1801–1813
Article PubMed PubMed Central Google Scholar
Bonilla EV, Chai KM, Williams C (2007) Multi-task Gaussian process prediction. Adv Neural Inf Process Syst 20:153–160
Google Scholar
Braun HJ, Atlin G, Payne T (2010) Multi-location testing as a tool to identify plant response to global climate change. In: Reynolds MP (ed) Climate change and crop production, vol 13. CABI, Wallingford, pp 115–138
Chapter Google Scholar
Burgueño J, de los Campos G, Weigel K, Crossa J (2012) Genomic prediction of breeding values when modeling genotype × environment interaction using pedigree and dense molecular markers. Crop Sci 52:707–719
Article Google Scholar
Cabrera-Bosquet L, Crossa J, von Zitzewitz J, Serret MD, Araus JL (2012) High-throughput phenotyping and genomic selection: the frontiers of crop breeding converge. J Integr Plant Biol 54:312–320
Article PubMed Google Scholar
Chapman SC, Crossa J, Basford KE, Kroonenberg PM (1997) Genotype by environment effects and selection for drought tolerance in tropical maize. II. Three-mode pattern analysis. Euphytica 95:11–20
Article Google Scholar
Cornelius PL, Crossa J (1999) Prediction assessment of shrinkage estimators of multiplicative models for multi-environment cultivar trials. Crop Sci 39:998–1009
Article Google Scholar
Cornelius PL, Crossa J, Seyedsadr MS (1996) Statistical tests and estimators of multiplicative models for genotype-by-environment interaction. In: Gauch HG, Kang MS (eds) Genotype by environment interaction. CMC Press, Boca Raton, pp 199–234
Google Scholar
Craven P, Wahba G (1979) Smoothing noisy data with spline functions. Num Math 31:377–403
Article Google Scholar
Cribari-Neto F, Zeileis A (2010) Beta regression in r. J Stat Softw 34:1–24
Article Google Scholar
Crossa J, Cornelius PL (1997) Sites regression and shifted multiplicative model clustering of cultivar trial sites under heterogeneity of error variances. Crop Sci 37:406–415
Article Google Scholar
Crossa J, de los Campos G, Pérez P, Gianola D, Burgueno J, Araus JL, Makumbi D, Singh RP, Dreisigacker S, Yan J, Arief V, Banziger M, Braun HJ (2010) Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 186:713–724
Article CAS PubMed PubMed Central Google Scholar
Endelman JB (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4:250–255
Article Google Scholar
Farfan IDB, De La Fuente GN, Murray SC, Isakeit T, Huang PC, Warburton M, Williams P, Windham GL, Kolomiets M (2015) Genome wide association study for drought, aflatoxin resistance, and important agronomic traits of maize hybrids in the sub-tropics. PLoS ONE 10:e0117737
Article PubMed PubMed Central Google Scholar
Gauch HG, Zobel RW (1990) Imputing missing yield trial data. Theor Appl Genet 79:753–761
Article PubMed Google Scholar
Gauch HG, Zobel RW (1997) Identifying mega-environments and targeting genotypes. Crop Sci 37:311–326
Article Google Scholar
Gianola D, Fernando RL, Stella A (2006) Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics 173:1761–1776
Article CAS PubMed PubMed Central Google Scholar
Gutiérrez L, Germán S, Pereyra S, Hayes PM, Pérez CA, Capettini F, Locatelli A, Berberian HM, Falconi EE, Estrada R, Fros D, Gonza V, Altamirano H, Huerta-Espino J, Neyra E, Orjeda G, Sandoval-Islas S, Sing R, Turkington K, Castro AJ (2015) Multi-environment multi-QTL association mapping identifies disease resistance QTL in barley germplasm from Latin America. Theor Appl Genet 128:501–519
Article PubMed Google Scholar
Hayashi K, Takenouchi T, Tomioka R, Kashima H (2012) Self-measuring similarity for multi-task Gaussian process. Trans Jpn Soc Artif Intell 27:103–110 (in Japanese)
Article Google Scholar
Heslot N, Akdemir D, Sorrells ME, Jannink JL (2014) Integrating environmental covariates and crop modeling into the genomic selection framework to predict genotype by environment interactions. Theor Appl Genet 127:463–480
Article PubMed Google Scholar
Husson F, Josse J (2010) missMDA: handling missing values with/in multivariate data analysis (principal component methods). http://www.agrocampus-ouest.fr/math/husson. Accessed 5 October 2015. R package version 1.8.2
IRRI (2002) Standard evaluation system for rice. International Rice Research Institute, Philippines
Google Scholar
Iwata H, Jannink JL (2010) Marker genotype imputation in a low-marker-density panel with a high-marker-density reference panel: accuracy evaluation in barley breeding lines. Crop Sci 50:1269–1278
Article Google Scholar
Jannink JL, Iwata H, Bhat PR, Chao S, Wenzl P, Muehlbauer GJ (2009) Marker imputation in barley association studies. Plant Genome 2:11–22
Article CAS Google Scholar
Jarquín D, Crossa J, Lacaze X, Du Cheyron P, Daucourt J, Lorgeou J, Piraux F, Guerreiro L, Pérez P, Calus M, Burgueño J, de los Campos G (2014) A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theor Appl Genet 127:595–607
Article PubMed Google Scholar
Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. J Soc Fr Statistique 153:79–99
Google Scholar
Leach AR, Gillet VJ (2007) An introduction to chemoinformatics. Springer, Dordrecht, pp 79–81
Book Google Scholar
Malosetti M, Ribaut JM, van Eeuwijk FA (2013) The statistical analysis of multi-environment data: modeling genotype-by-environment interaction and its genetic basis. Front Physiol 4:44
Article CAS PubMed PubMed Central Google Scholar
Marticorena M, Bramardi S, Defacio R (2010) Characterization of maize populations in different environmental conditions by means of three-mode principal components analysis. Ciencia e Investigación Agraria 37:91–103
Article Google Scholar
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829
CAS PubMed PubMed Central Google Scholar
Morota G, Gianola D (2014) Kernel-based whole-genome prediction of complex traits: a review. Front Genet 5:363
PubMed PubMed Central Google Scholar
Neal RM (1997) Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. arXiv preprint physics/9701026
Piepho HP (2009) Ridge regression and extensions for genomewide selection in maize. Crop Sci 49:1165–1176
Article Google Scholar
Rakitsch B, Lippert C, Borgwardt K, Stegle O (2013) It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals. In: Advances in neural information processing systems, pp 1466–1474
Resende MFR, Muñoz P, Acosta JJ, Peter GF, Davis JM, Grattapaglia D, Resende MDV, Kirst M (2012) Accelerating the domestication of trees using genomic selection: accuracy of prediction models across ages and environments. N Phytol 193:617–624
Article Google Scholar
Samonte SOP, Wilson LT, McClung AM, Medley JC (2005) Targeting cultivars onto rice growing environments using AMMI and SREG GGE biplot analyses. Crop Sci 45:2414–2424
Article Google Scholar
Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc Ser B Stat Methodol 61:611–622
Article Google Scholar
Verbanck M, Josse J, Husson F (2013) Regularised PCA to denoise and visualise data. Stat Comput 25:471–486
Article Google Scholar
Williams CK, Rasmussen CE (2006) Gaussian processes for machine learning. The MIT Press, Cambridge, MA
Google Scholar
Yamamoto T, Nagasaki H, Yonemaru J, Ebana K, Nakajima M, Shibaya T, Yano M (2010) Fine definition of the pedigree haplotypes of closely related rice cultivars by means of genome-wide discovery of single-nucleotide polymorphisms. BMC Genom 11:267
Article Google Scholar
Yan W (2013) Biplot analysis of incomplete two-way data. Crop Sci 53:48–57
Article Google Scholar
Zhang Y, Yeung DY (2010) Multi-task learning using generalized t process. In: AISTATS
Zhang X, Pérez-Rodríguez P, Semagn K, Beyene Y, Babu R, López-Cruz MA, Vicente FS, Olsen M, Buckler E, Jannink JL, Prasanna BM, Crossa J (2015) Genomic prediction in biparental tropical maize populations in water-stressed and well-watered environments using low-density and GBS SNPs. Heredity 114:291–299
Article CAS PubMed Google Scholar

Download references

Acknowledgments

This work is supported by JSPS KAKENHI Grant Number 25252002 and by the Ministry of Foreign Affairs, Japan.

Author information

Authors and Affiliations

Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo, 113-8657, Japan
Tomoaki Hori & Hiroyoshi Iwata
Africa Rice Center, 01 B.P. 2031, Cotonou, Benin
David Montcho & Koichi Futakuchi
Laboratory of Genetic and Biotechnologies, Faculty of Sciences and Techniques, University of Abomey-Calavi, 01 B.P. 526, Cotonou, Benin
Clement Agbangla
Genetic Resources Center, National Institute of Agrobiological Sciences, Tsukuba, Ibaraki, 305-8602, Japan
Kaworu Ebana

Authors

Tomoaki Hori
View author publications
You can also search for this author in PubMed Google Scholar
David Montcho
View author publications
You can also search for this author in PubMed Google Scholar
Clement Agbangla
View author publications
You can also search for this author in PubMed Google Scholar
Kaworu Ebana
View author publications
You can also search for this author in PubMed Google Scholar
Koichi Futakuchi
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyoshi Iwata
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hiroyoshi Iwata.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest. The experiments comply with the current laws of the countries, in which they were performed.

Additional information

Communicated by J. Crossa.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 27935 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hori, T., Montcho, D., Agbangla, C. et al. Multi-task Gaussian process for imputing missing data in multi-trait and multi-environment trials. Theor Appl Genet 129, 2101–2115 (2016). https://doi.org/10.1007/s00122-016-2760-9

Download citation

Received: 26 March 2016
Accepted: 02 August 2016
Published: 19 August 2016
Issue Date: November 2016
DOI: https://doi.org/10.1007/s00122-016-2760-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-task Gaussian process for imputing missing data in multi-trait and multi-environment trials