Advertisement

Model selection bias and Freedman’s paradox

  • Paul M. Lukacs
  • Kenneth P. Burnham
  • David R. Anderson
Article

Abstract

In situations where limited knowledge of a system exists and the ratio of data points to variables is small, variable selection methods can often be misleading. Freedman (Am Stat 37:152–155, 1983) demonstrated how common it is to select completely unrelated variables as highly “significant” when the number of data points is similar in magnitude to the number of variables. A new type of model averaging estimator based on model selection with Akaike’s AIC is used with linear regression to investigate the problems of likely inclusion of spurious effects and model selection bias, the bias introduced while using the data to select a single seemingly “best” model from a (often large) set of models employing many predictor variables. The new model averaging estimator helps reduce these problems and provides confidence interval coverage at the nominal level while traditional stepwise selection has poor inferential properties.

Keywords

Akaike’s information criterion Confidence interval coverage Freedman’s paradox Model averaging Model selection bias Model selection uncertainty Multimodel inference Stepwise selection 

References

  1. Akaike H. (1973) Information theory as an extension of the maximum likelihood principle. In: Petrov B.N., Csaki F. (eds) Second international symposium on information theory. Budapest, Akademiai Kiado, pp 267–281Google Scholar
  2. Akaike H. (1978) On the likelihood of a time series model. The Statistician 27: 217–235CrossRefMathSciNetGoogle Scholar
  3. Akaike H. (1979) A Bayesian extension of the minimum AIC procedure of autoregressive model fitting. Biometrika 66: 237–242zbMATHCrossRefMathSciNetGoogle Scholar
  4. Anderson D.R. (2008) Model based inference in the life sciences: primer on evidence. Springer, New YorkzbMATHCrossRefGoogle Scholar
  5. Buckland S.T., Burnham K.P., Augustin N.H. (1997) Model selection: An integral part of inference. Biometrics 53: 603–618zbMATHCrossRefGoogle Scholar
  6. Burnham K.P., Anderson D.R. (2002) Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). Springer, New YorkzbMATHGoogle Scholar
  7. Burnham K.P., Anderson D.R. (2004) Multimodel inference: understanding AIC and BIC in model selection. Sociological Methods and Research 33: 261–304CrossRefMathSciNetGoogle Scholar
  8. Claeskens G., Hjort N.L. (2008) Model selection and model averaging. Cambridge University Press, New YorkzbMATHGoogle Scholar
  9. Freedman D.A. (1983) A note on screening regression equations. The American Statistician 37: 152–155CrossRefMathSciNetGoogle Scholar
  10. George E.I., McCulloch R.E. (1993) Variable selection via Gibbs sampling. Journal of the American Statistical Association 88: 881–889CrossRefGoogle Scholar
  11. Hoeting J.A., Madigan D., Raftery A.E., Volinsky C.T. (1999) Bayesian model averaging: a tutorial (with discussion). Statistical Science 14: 382–417zbMATHCrossRefMathSciNetGoogle Scholar
  12. Hurvich C.M., Tsai C.-L. (1989) Regression and time series model selection in small samples. Biometrika 76: 297–307zbMATHCrossRefMathSciNetGoogle Scholar
  13. Hurvich C.M., Tsai C.-L. (1990) The impact of model selection on inference in linear regression. The American Statistician 44: 214–217CrossRefGoogle Scholar
  14. Massart P. (2007) Concentration inequalities and model selection. Springer, BerlinzbMATHGoogle Scholar
  15. McQuarrie A.D.R., Tsai C.-L. (1998) Regression and time series model selection. World Scientific Publishing Co., SingaporezbMATHGoogle Scholar
  16. Miller A.J. (2002) Subset selection in regression (2nd ed.). Chapman and Hall, New YorkzbMATHGoogle Scholar
  17. Rawlings J.O. (1988) Applied regression analysis: a research tool. Wadsworth, Inc, BelmontzbMATHGoogle Scholar
  18. Rencher A.C., Pun F.C. (1980) Inflation of R 2 in best subset regression. Technometrics 22: 49–53zbMATHCrossRefGoogle Scholar
  19. SAS Institute, Inc. (2001). SAS version 8.02, Cary, NC.Google Scholar
  20. Sugiura N. (1978) Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics, Theory and Methods A7: 13–26MathSciNetGoogle Scholar
  21. Wheeler M.W. (2009) Comparing model averaging with other model selection strategies for benchmark dose estimation. Environmetrics and Ecological Statistics 16: 37–51CrossRefGoogle Scholar
  22. Wheeler M.W., Bailer A.J. (2007) Properties of model-averaged BMDLs: a study of model averaging in dichotomous response risk estimation. Risk Analysis 27: 659–670CrossRefGoogle Scholar
  23. Yang Y. (2007) Prediction/estimation with simple linear models: Is it really simple?. Econometric Theory 23: 1–36CrossRefMathSciNetGoogle Scholar

Copyright information

© The Institute of Statistical Mathematics, Tokyo 2009

Authors and Affiliations

  • Paul M. Lukacs
    • 1
  • Kenneth P. Burnham
    • 2
  • David R. Anderson
    • 2
  1. 1.Colorado Division of WildlifeFort CollinsUSA
  2. 2.U.S. Geological Survey, Colorado Cooperative Fish and Wildlife Research UnitColorado State UniversityFort CollinsUSA

Personalised recommendations