Outlier detection methods for generalized lattices: a case study on the transition from ANOVA to REML

Abstract

Key message

We review and propose several methods for identifying possible outliers and evaluate their properties. The methods are applied to a genomic prediction program in hybrid rye.

Abstract

Many plant breeders use ANOVA-based software for routine analysis of field trials. These programs may offer specific in-built options for residual analysis that are lacking in current REML software. With the advance of molecular technologies, there is a need to switch to REML-based approaches, but without losing the good features of outlier detection methods that have proven useful in the past. Our aims were to compare the variance component estimates between ANOVA and REML approaches, to scrutinize the outlier detection method of the ANOVA-based package PlabStat and to propose and evaluate alternative procedures for outlier detection. We compared the outputs produced using ANOVA and REML approaches of four published datasets of generalized lattice designs. Five outlier detection methods are explained step by step. Their performance was evaluated by measuring the true positive rate and the false positive rate in a dataset with artificial outliers simulated in several scenarios. An implementation of genomic prediction using an empirical rye multi-environment trial was used to assess the outlier detection methods with respect to the predictive abilities of a mixed model for each method. We provide a detailed explanation of how the PlabStat outlier detection methodology can be translated to REML-based software together with the evaluation of alternative methods to identify outliers. The method combining the Bonferroni–Holm test to judge each residual and the residual standardization strategy of PlabStat exhibited good ability to detect outliers in small and large datasets and under a genomic prediction application. We recommend the use of outlier detection methods as a decision support in the routine data analyses of plant breeding experiments.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. Anscombe FJ (1960) Rejection of outliers. Technometrics 2:123–147

    Article  Google Scholar 

  2. Anscombe FJ, Tukey JW (1963) The examination and analysis of residuals. Technometrics 5:141–160

    Article  Google Scholar 

  3. Babadi B, Rasekh A, Rasekhi AA, Zare K, Zadkarami MR (2014) A variance shift model for detection of outliers in the linear measurement error model. Abstr Appl Anal 2014:9

    Article  Google Scholar 

  4. Barnett V, Lewis T (2000) Outliers in statistical data. Wiley, New York

    Google Scholar 

  5. Bernal-Vasquez AM, Möhring J, Schmidt M, Schönleben M, Schön CC, Piepho HP (2014) The importance of phenotypic data analysis for genomic prediction—a case study comparing different spatial models in rye. BMC Genom 15:646

    Article  Google Scholar 

  6. Bradu D, Hawkins DM (1982) Location of multiple outliers in two-way tables, using tetrads. Technometrics 24:103–108

    Article  Google Scholar 

  7. Burgueño J, de los Campos G, Weigel K, Crossa J (2012) Genomic prediction of breeding values when modeling genotype × environment interaction using pedigree and dense molecular markers. Crop Sci 52:707–719

  8. Cerioli A, Farcomeni A, Riani M (2013) Robust distances for outlier-free goodness-of-fit testing. Comput Stat Data An 65:29–45

    Article  Google Scholar 

  9. Cochran WG, Cox GM (1957) Experimental designs, 2nd edn. Wiley, New York

    Google Scholar 

  10. Cook RD, Weisberg S (1982) Residuals and influence in regression. Chapman and Hall, London

    Google Scholar 

  11. Estaghvirou SBO, Ogutu JO, Piepho HP (2014) Influence of outliers on accuracy estimation in genomic prediction in plant breeding. G3(4):2317–2328

    Google Scholar 

  12. Gomez KA, Gomez AA (1984) Statistical procedures for agricultural research. Wiley, New York

    Google Scholar 

  13. Gumedze FN, Chatora TD (2014) Detection of outliers in longitudinal count data via overdispersion. Comput Stat Data An 79:192–202

    Article  Google Scholar 

  14. Gumedze FN, Jackson D (2011) A random effects variance shift model for detecting and accommodating outliers in meta-analysis. BMC Med Res Methodol 11:19

    Article  PubMed  PubMed Central  Google Scholar 

  15. Gumedze FN, Welham SJ, Gogel BJ, Thompson R (2010) A variance shift model for detection of outliers in the linear mixed model. Comput Stat Data An 54:2128–2144

    Article  Google Scholar 

  16. Hampel FR (1985) The breakdown points of the mean combined with some rejection rules. Technometrics 27:95–107

    Article  Google Scholar 

  17. Hochberg Y, Tamhane AC (1987) Multiple comparison procedures. Wiley, New York

    Google Scholar 

  18. Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70

    Google Scholar 

  19. Iglewicz B (2000) Robust scale estimators and confidence intervals for location. In: Hoaglin D, Mosteller F, Tukey JW (eds) Understanding robust and exploratory data analysis. Wiley, New York

  20. John JA, Williams ER (1995) Cyclic and computer generated designs, 2nd edn. Chapman and Hall, London

    Google Scholar 

  21. Littell RC (2002) Analysis of unbalanced mixed model data: a case study comparison of ANOVA versus REML/GLS. J Agric Biol Envir S 7:472–490

    Article  Google Scholar 

  22. Littell RC, Milliken GA, Stroup WW, Wolfinger RD, Schabenberger O (2006) SAS for mixed models, 2nd edn. SAS Institute Inc., NC

    Google Scholar 

  23. Lopez-Cruz M, Crossa J, Bonnett D, Dreisigacker S, Poland J, Jannink JL, Singh RP, Autrique E, de los Campos G (2015) Increased prediction accuracy in wheat breeding trials using a marker × environment interaction genomic selection model. G3 5:569–582

  24. Lourenço VM, Pires AM (2014) M-regression, false discovery rates and outlier detection with application to genetic association studies. Comput Stat Data An 78:33–42

    Article  Google Scholar 

  25. Marubini E, Orenti A (2014) Detecting outliers and/or leverage points: a robust two-stage procedure with bootstrap cut-off points. Epidemiol Biostat Public Health 11:1–17

    Google Scholar 

  26. Meuwissen TH, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Meyer K (2009) Factor-analytic models for genotype × environment type problems and structured covariance matrices. Genet Select Evol 41:21

    Article  Google Scholar 

  28. Nobre JS, Singer JM (2007) Residual analysis for linear mixed models. Biom J 49:863–875

    Article  PubMed  Google Scholar 

  29. Nobre JS, Singer JM (2011) Leverage analysis for linear mixed models. J Appl Stat 38:1063–1072

    Article  Google Scholar 

  30. Piepho HP (2009) Ridge regression and extensions for genomewide selection in maize. Crop Sci 49:1165–1176

    Article  Google Scholar 

  31. Piepho HP, Büchse A, Truberg B (2006) On the use of multiple lattice designs and \(\alpha \)-designs in plant breeding trials. Plant Breed 125:523–528

    Article  Google Scholar 

  32. Pinho LGB, Nobre JS, Singer JM (2015) Cook’s distance for generalized linear mixed models. Comput Stat Data An 82:126–136

    Article  Google Scholar 

  33. Rocke DM, Woodruff DL (1996) Identification of outliers in multivariate data. JASA 91:1047–1061

    Article  Google Scholar 

  34. Ruppert D (2011) Statistics and data analysis for financial engineering. Springer, New York

    Google Scholar 

  35. Schützenmeister A, Piepho HP (2012) Residual analysis of linear mixed models using a simulation approach. Comput Stat Data An 56:1405–1416

    Article  Google Scholar 

  36. Searle SR (1987) Linear models for unbalanced data. Wiley, New York

    Google Scholar 

  37. Searle SR, Casella G, McCulloch CE (1992) Variance components. Wiley, New York

    Google Scholar 

  38. Smith A, Cullis B, Gilmour A (2001) The analysis of crop variety evaluation data in Australia. Aust NZ J Stat 43:129–145

    Article  Google Scholar 

  39. Swallow W, Kianifard F (1996) Using robust scale estimates in detecting multiple outliers in linear regression. Biometrics 52:545–556

    Article  Google Scholar 

  40. Thompson WA (1962) The problem of negative estimates of variance components. Ann Math Stat 33:273–289

    Article  Google Scholar 

  41. Utz HF (2003) PLABSTAT Manual. http://www.uni-hohenheim.de/ipsp/soft.html. version 3A of 2010-07-19

  42. Wensch J, Wensch-Dorendorf M, Swalve HH (2013) The evaluation of variance component estimation software: generating benchmark problems by exact and approximate methods. Comput Stat 28:1725–1748

    Article  Google Scholar 

  43. Williams ER (1977) Iterative analysis of generalized lattice designs. Aust J Stat 19:39–42

    Article  Google Scholar 

  44. Wulff SS (2008) The equality of REML and ANOVA estimators of variance components in unbalanced normal classification models. Stat Probabil Lett 78:405–411

    Article  Google Scholar 

  45. Zweig MH, Campbell G (1993) Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem 39:561–577

    CAS  PubMed  Google Scholar 

Download references

Acknowledgments

This research was funded by KWS-LOCHOW GMBH and the German Federal Ministry of Education and Research (Bonn, Germany) within the AgroClusterEr “Rye-Select: Genome-based precision breeding strategies for rye” (Grant ID: 0315946A). We thank Vanda Lourenço for commenting on the manuscript and Steffen Hadasch for helping with the R codes. We are grateful to KWS-LOCHOW for providing the datasets used in this study and the technical support to run the analyses. We thank the Synbreed project members for their helpful and constructive comments during the discussion sessions and also the anonymous reviewers for suggestions and comments that led to improvements in the clarity of the manuscript.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Angela-Maria Bernal-Vasquez.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Ethical standards

The authors declare that ethical standards are met, and all the experiments comply with the current laws of the country in which they were performed.

Additional information

Communicated by M. J. Sillanpaa.

Electronic supplementary material

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bernal-Vasquez, AM., Utz , HF. & Piepho, HP. Outlier detection methods for generalized lattices: a case study on the transition from ANOVA to REML. Theor Appl Genet 129, 787–804 (2016). https://doi.org/10.1007/s00122-016-2666-6

Download citation

Keywords

  • Outlier Detection
  • Genomic Prediction
  • Incomplete Block
  • Variance Component Estimate
  • Best Linear Unbiased Prediction