We review and propose several methods for identifying possible outliers and evaluate their properties. The methods are applied to a genomic prediction program in hybrid rye.
Many plant breeders use ANOVA-based software for routine analysis of field trials. These programs may offer specific in-built options for residual analysis that are lacking in current REML software. With the advance of molecular technologies, there is a need to switch to REML-based approaches, but without losing the good features of outlier detection methods that have proven useful in the past. Our aims were to compare the variance component estimates between ANOVA and REML approaches, to scrutinize the outlier detection method of the ANOVA-based package PlabStat and to propose and evaluate alternative procedures for outlier detection. We compared the outputs produced using ANOVA and REML approaches of four published datasets of generalized lattice designs. Five outlier detection methods are explained step by step. Their performance was evaluated by measuring the true positive rate and the false positive rate in a dataset with artificial outliers simulated in several scenarios. An implementation of genomic prediction using an empirical rye multi-environment trial was used to assess the outlier detection methods with respect to the predictive abilities of a mixed model for each method. We provide a detailed explanation of how the PlabStat outlier detection methodology can be translated to REML-based software together with the evaluation of alternative methods to identify outliers. The method combining the Bonferroni–Holm test to judge each residual and the residual standardization strategy of PlabStat exhibited good ability to detect outliers in small and large datasets and under a genomic prediction application. We recommend the use of outlier detection methods as a decision support in the routine data analyses of plant breeding experiments.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
Anscombe FJ (1960) Rejection of outliers. Technometrics 2:123–147
Anscombe FJ, Tukey JW (1963) The examination and analysis of residuals. Technometrics 5:141–160
Babadi B, Rasekh A, Rasekhi AA, Zare K, Zadkarami MR (2014) A variance shift model for detection of outliers in the linear measurement error model. Abstr Appl Anal 2014:9
Barnett V, Lewis T (2000) Outliers in statistical data. Wiley, New York
Bernal-Vasquez AM, Möhring J, Schmidt M, Schönleben M, Schön CC, Piepho HP (2014) The importance of phenotypic data analysis for genomic prediction—a case study comparing different spatial models in rye. BMC Genom 15:646
Bradu D, Hawkins DM (1982) Location of multiple outliers in two-way tables, using tetrads. Technometrics 24:103–108
Burgueño J, de los Campos G, Weigel K, Crossa J (2012) Genomic prediction of breeding values when modeling genotype × environment interaction using pedigree and dense molecular markers. Crop Sci 52:707–719
Cerioli A, Farcomeni A, Riani M (2013) Robust distances for outlier-free goodness-of-fit testing. Comput Stat Data An 65:29–45
Cochran WG, Cox GM (1957) Experimental designs, 2nd edn. Wiley, New York
Cook RD, Weisberg S (1982) Residuals and influence in regression. Chapman and Hall, London
Estaghvirou SBO, Ogutu JO, Piepho HP (2014) Influence of outliers on accuracy estimation in genomic prediction in plant breeding. G3(4):2317–2328
Gomez KA, Gomez AA (1984) Statistical procedures for agricultural research. Wiley, New York
Gumedze FN, Chatora TD (2014) Detection of outliers in longitudinal count data via overdispersion. Comput Stat Data An 79:192–202
Gumedze FN, Jackson D (2011) A random effects variance shift model for detecting and accommodating outliers in meta-analysis. BMC Med Res Methodol 11:19
Gumedze FN, Welham SJ, Gogel BJ, Thompson R (2010) A variance shift model for detection of outliers in the linear mixed model. Comput Stat Data An 54:2128–2144
Hampel FR (1985) The breakdown points of the mean combined with some rejection rules. Technometrics 27:95–107
Hochberg Y, Tamhane AC (1987) Multiple comparison procedures. Wiley, New York
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70
Iglewicz B (2000) Robust scale estimators and confidence intervals for location. In: Hoaglin D, Mosteller F, Tukey JW (eds) Understanding robust and exploratory data analysis. Wiley, New York
John JA, Williams ER (1995) Cyclic and computer generated designs, 2nd edn. Chapman and Hall, London
Littell RC (2002) Analysis of unbalanced mixed model data: a case study comparison of ANOVA versus REML/GLS. J Agric Biol Envir S 7:472–490
Littell RC, Milliken GA, Stroup WW, Wolfinger RD, Schabenberger O (2006) SAS for mixed models, 2nd edn. SAS Institute Inc., NC
Lopez-Cruz M, Crossa J, Bonnett D, Dreisigacker S, Poland J, Jannink JL, Singh RP, Autrique E, de los Campos G (2015) Increased prediction accuracy in wheat breeding trials using a marker × environment interaction genomic selection model. G3 5:569–582
Lourenço VM, Pires AM (2014) M-regression, false discovery rates and outlier detection with application to genetic association studies. Comput Stat Data An 78:33–42
Marubini E, Orenti A (2014) Detecting outliers and/or leverage points: a robust two-stage procedure with bootstrap cut-off points. Epidemiol Biostat Public Health 11:1–17
Meuwissen TH, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829
Meyer K (2009) Factor-analytic models for genotype × environment type problems and structured covariance matrices. Genet Select Evol 41:21
Nobre JS, Singer JM (2007) Residual analysis for linear mixed models. Biom J 49:863–875
Nobre JS, Singer JM (2011) Leverage analysis for linear mixed models. J Appl Stat 38:1063–1072
Piepho HP (2009) Ridge regression and extensions for genomewide selection in maize. Crop Sci 49:1165–1176
Piepho HP, Büchse A, Truberg B (2006) On the use of multiple lattice designs and \(\alpha \)-designs in plant breeding trials. Plant Breed 125:523–528
Pinho LGB, Nobre JS, Singer JM (2015) Cook’s distance for generalized linear mixed models. Comput Stat Data An 82:126–136
Rocke DM, Woodruff DL (1996) Identification of outliers in multivariate data. JASA 91:1047–1061
Ruppert D (2011) Statistics and data analysis for financial engineering. Springer, New York
Schützenmeister A, Piepho HP (2012) Residual analysis of linear mixed models using a simulation approach. Comput Stat Data An 56:1405–1416
Searle SR (1987) Linear models for unbalanced data. Wiley, New York
Searle SR, Casella G, McCulloch CE (1992) Variance components. Wiley, New York
Smith A, Cullis B, Gilmour A (2001) The analysis of crop variety evaluation data in Australia. Aust NZ J Stat 43:129–145
Swallow W, Kianifard F (1996) Using robust scale estimates in detecting multiple outliers in linear regression. Biometrics 52:545–556
Thompson WA (1962) The problem of negative estimates of variance components. Ann Math Stat 33:273–289
Utz HF (2003) PLABSTAT Manual. http://www.uni-hohenheim.de/ipsp/soft.html. version 3A of 2010-07-19
Wensch J, Wensch-Dorendorf M, Swalve HH (2013) The evaluation of variance component estimation software: generating benchmark problems by exact and approximate methods. Comput Stat 28:1725–1748
Williams ER (1977) Iterative analysis of generalized lattice designs. Aust J Stat 19:39–42
Wulff SS (2008) The equality of REML and ANOVA estimators of variance components in unbalanced normal classification models. Stat Probabil Lett 78:405–411
Zweig MH, Campbell G (1993) Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem 39:561–577
This research was funded by KWS-LOCHOW GMBH and the German Federal Ministry of Education and Research (Bonn, Germany) within the AgroClusterEr “Rye-Select: Genome-based precision breeding strategies for rye” (Grant ID: 0315946A). We thank Vanda Lourenço for commenting on the manuscript and Steffen Hadasch for helping with the R codes. We are grateful to KWS-LOCHOW for providing the datasets used in this study and the technical support to run the analyses. We thank the Synbreed project members for their helpful and constructive comments during the discussion sessions and also the anonymous reviewers for suggestions and comments that led to improvements in the clarity of the manuscript.
Conflict of interest
The authors declare that they have no conflicts of interest.
The authors declare that ethical standards are met, and all the experiments comply with the current laws of the country in which they were performed.
Communicated by M. J. Sillanpaa.
About this article
Cite this article
Bernal-Vasquez, AM., Utz , HF. & Piepho, HP. Outlier detection methods for generalized lattices: a case study on the transition from ANOVA to REML. Theor Appl Genet 129, 787–804 (2016). https://doi.org/10.1007/s00122-016-2666-6