Cryptic multiple hypotheses testing in linear models: overestimated effect sizes and the winner's curse
 Wolfgang Forstmeier,
 Holger Schielzeth
 … show all 2 hide
Abstract
Fitting generalised linear models (GLMs) with more than one predictor has become the standard method of analysis in evolutionary and behavioural research. Often, GLMs are used for exploratory data analysis, where one starts with a complex full model including interaction terms and then simplifies by removing nonsignificant terms. While this approach can be useful, it is problematic if significant effects are interpreted as if they arose from a single a priori hypothesis test. This is because model selection involves cryptic multiple hypothesis testing, a fact that has only rarely been acknowledged or quantified. We show that the probability of finding at least one ‘significant’ effect is high, even if all null hypotheses are true (e.g. 40% when starting with four predictors and their twoway interactions). This probability is close to theoretical expectations when the sample size (N) is large relative to the number of predictors including interactions (k). In contrast, type I error rates strongly exceed even those expectations when model simplification is applied to models that are overfitted before simplification (low N/k ratio). The increase in falsepositive results arises primarily from an overestimation of effect sizes among significant predictors, leading to upwardbiased effect sizes that often cannot be reproduced in followup studies (‘the winner's curse’). Despite having their own problems, full model tests and P value adjustments can be used as a guide to how frequently type I errors arise by sampling variation alone. We favour the presentation of full models, since they best reflect the range of predictors investigated and ensure a balanced representation also of nonsignificant results.
 Aiken, LS, West, SG (1991) Multiple regression: testing and interpreting interactions. Sage Publications, Newbury Park
 Anderson, DR, Burnham, KP, Thompson, WL (2000) Null hypothesis testing: problems, prevalence, and an alternative. J Wildl Manage 64: pp. 912923 CrossRef
 Benjamini, Y, Hochberg, Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B Methodol 57: pp. 289300
 Blanchet, FG, Legendre, P, Borcard, D (2008) Forward selection of explanatory variables. Ecology 89: pp. 26232632 CrossRef
 Burnham, KP, Anderson, DR (2002) Model selection and multimodel inference: a practical informationtheoretic approach. Springer, Berlin
 Chatfield, C (1995) Model uncertainty, data mining and statistical inference. J R Stat Soc, A 158: pp. 419466 CrossRef
 Crawley, MJ (2007) The R book. Wiley, Chichester CrossRef
 Derksen, S, Keselman, HJ (1992) Backward, forward and stepwise automated subsetselection algorithms: frequency of obtaining authentic and noise variables. Br J Math Stat Psychol 45: pp. 265282
 Dochtermann NA, Jenkins SH (2010) Developing multiple hypotheses in behavioural ecology. Behavioral Ecology and Sociobiology. doi:10.1007/s0026501010394
 Field, A (2005) Discovering statistics using SPSS. Sage, London
 Freckleton RP (2010) Dealing with collinearity in behavioural and ecological data: model averaging and the problems of measurement error. Behavioral Ecology and Sociobiology. doi:10.1007/s0026501010456
 Garamszegi LZ (2010) Informationtheoretic approaches in statistical analysis in behavioural ecology: an introduction. Behavioral Ecology and Sociobiology. doi:10.1007/s0026501010287
 Garamszegi, LZ, Calhim, S, Dochtermann, N, Hegyi, G, Hurd, PL, Jørgensen, C, Kutsukake, N, Lajeunesse, MJ, Pollard, KA, Schielzeth, H, Symonds, MRE, Nakagawa, S (2009) Changing philosophies and tools for statistical inferences in behavioral ecology. Behav Ecol 20: pp. 13761381 CrossRef
 Göring, HHH, Terwilliger, JD, Blangero, J (2001) Large upward bias in estimation of locusspecific effects from genomewide scans. Am J Hum Genet 69: pp. 13571369 CrossRef
 Hochberg, Y (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75: pp. 800802 CrossRef
 Holm, S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6: pp. 6570
 Ioannidis, JPA (2005) Why most published research findings are false. PLoS Med 2: pp. 696701
 Ioannidis, JPA (2008) Why most discovered true associations are inflated. Epidemiology 19: pp. 640648 CrossRef
 Ioannidis, JPA, Thomas, G, Daly, MJ (2009) Validating, augmenting and refining genomewide association signals. Nat Rev Genet 10: pp. 318329 CrossRef
 Jennions, MD, Møller, AP (2002) Relationships fade with time: a metaanalysis of temporal trends in publication in ecology and evolution. Proc R Soc Lond B Biol Sci 269: pp. 4348 CrossRef
 Jennions, MD, Møller, AP (2003) A survey of the statistical power of research in behavioral ecology and animal behavior. Behav Ecol 14: pp. 438445 CrossRef
 Kelly, CD (2006) Replicating empirical research in behavioral ecology: how and why it should be done but rarely ever is. Q Rev Biol 81: pp. 221236 CrossRef
 Lukacs, PM, Burnham, KP, Anderson, DR (2010) Model selection bias and Freedman's paradox. Ann Inst Stat Math 62: pp. 117125 CrossRef
 Miller, AJ (1984) Selection of subsets of regression variables. J R Stat Soc, A 147: pp. 389425 CrossRef
 Mundry R (2010) Issues in information theory based statistical inference: a commentary from a frequentist's perspective. Behavioral Ecology and Sociobiology. doi:10.1007/s002650101040y
 Mundry, R, Nunn, CL (2009) Stepwise model fitting and statistical inference: turning noise into signal pollution. Am Nat 173: pp. 119123 CrossRef
 Nakagawa, S (2004) A farewell to Bonferroni: the problems of low statistical power and publication bias. Behav Ecol 15: pp. 10441045 CrossRef
 Perneger, TV (1998) What's wrong with Bonferroni adjustments?. Br Med J 316: pp. 12361238
 Quinn, GP, Keough, MJ (2002) Experimental design and data analysis for biologists. Cambridge University Press, Cambridge
 R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
 Rice, WR (1989) Analyzing tables of statistical tests. Evolution 43: pp. 223225 CrossRef
 Schielzeth, H (2010) Simple means to improve the interpretability of regression coefficients. Meth Ecol Evol 1: pp. 103113 CrossRef
 Schielzeth, H, Forstmeier, W (2009) Conclusions beyond support: overconfident estimates in mixed models. Behav Ecol 20: pp. 416420 CrossRef
 Stephens, PA, Buskirk, SW, Rio, CM (2007) Inference in ecology and evolution. Trends Ecol Evol 22: pp. 192197 CrossRef
 Storey, JD, Tibshirani, R (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100: pp. 94409445 CrossRef
 Tibbetts, EA, Dale, J (2007) Individual recognition: it is good to be different. Trends Ecol Evol 22: pp. 529537 CrossRef
 Venables, WN, Ripley, BD (2002) Modern applied statistics with S. Springer, New York
 Verhoeven, KJF, Simonsen, KL, McIntyre, LM (2005) Implementing false discovery rate control: increasing your power. Oikos 108: pp. 643647 CrossRef
 Whittingham, MJ, Stephens, PA, Bradbury, RB, Freckleton, RP (2006) Why do we still use stepwise modelling in ecology and behaviour?. J Anim Ecol 75: pp. 11821189 CrossRef
 Wright, SP (1992) Adjusted Pvalues for simultaneous inference. Biometrics 48: pp. 10051013 CrossRef
 Zhang, P (1992) Inference after variable selection in linear regression models. Biometrika 79: pp. 741746 CrossRef
 Zöllner, S, Pritchard, JK (2007) Overcoming the winner's curse: Estimating penetrance parameters from casecontrol data. Am J Hum Genet 80: pp. 605615 CrossRef
 Title
 Cryptic multiple hypotheses testing in linear models: overestimated effect sizes and the winner's curse
 Open Access
 Available under Open Access This content is freely available online to anyone, anywhere at any time.
 Journal

Behavioral Ecology and Sociobiology
Volume 65, Issue 1 , pp 4755
 Cover Date
 20110101
 DOI
 10.1007/s0026501010385
 Print ISSN
 03405443
 Online ISSN
 14320762
 Publisher
 SpringerVerlag
 Additional Links
 Topics
 Keywords

 Bonferroni correction
 Effect size estimation
 Generalised linear models
 Model selection
 Multiple regression
 Multiple testing
 Parameter estimation
 Publication bias
 Industry Sectors
 Authors

 Wolfgang Forstmeier ^{(1)}
 Holger Schielzeth ^{(1)} ^{(2)}
 Author Affiliations

 1. Max Planck Institute for Ornithology, EberhardGwinnerStr., 82319, Seewiesen, Germany
 2. Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Norbyvägen 18D, SE752 36, Uppsala, Sweden