Skip to main content
Log in

Model selection bias and Freedman’s paradox

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

In situations where limited knowledge of a system exists and the ratio of data points to variables is small, variable selection methods can often be misleading. Freedman (Am Stat 37:152–155, 1983) demonstrated how common it is to select completely unrelated variables as highly “significant” when the number of data points is similar in magnitude to the number of variables. A new type of model averaging estimator based on model selection with Akaike’s AIC is used with linear regression to investigate the problems of likely inclusion of spurious effects and model selection bias, the bias introduced while using the data to select a single seemingly “best” model from a (often large) set of models employing many predictor variables. The new model averaging estimator helps reduce these problems and provides confidence interval coverage at the nominal level while traditional stepwise selection has poor inferential properties.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Akaike H. (1973) Information theory as an extension of the maximum likelihood principle. In: Petrov B.N., Csaki F. (eds) Second international symposium on information theory. Budapest, Akademiai Kiado, pp 267–281

    Google Scholar 

  • Akaike H. (1978) On the likelihood of a time series model. The Statistician 27: 217–235

    Article  MathSciNet  Google Scholar 

  • Akaike H. (1979) A Bayesian extension of the minimum AIC procedure of autoregressive model fitting. Biometrika 66: 237–242

    Article  MATH  MathSciNet  Google Scholar 

  • Anderson D.R. (2008) Model based inference in the life sciences: primer on evidence. Springer, New York

    Book  MATH  Google Scholar 

  • Buckland S.T., Burnham K.P., Augustin N.H. (1997) Model selection: An integral part of inference. Biometrics 53: 603–618

    Article  MATH  Google Scholar 

  • Burnham K.P., Anderson D.R. (2002) Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). Springer, New York

    MATH  Google Scholar 

  • Burnham K.P., Anderson D.R. (2004) Multimodel inference: understanding AIC and BIC in model selection. Sociological Methods and Research 33: 261–304

    Article  MathSciNet  Google Scholar 

  • Claeskens G., Hjort N.L. (2008) Model selection and model averaging. Cambridge University Press, New York

    MATH  Google Scholar 

  • Freedman D.A. (1983) A note on screening regression equations. The American Statistician 37: 152–155

    Article  MathSciNet  Google Scholar 

  • George E.I., McCulloch R.E. (1993) Variable selection via Gibbs sampling. Journal of the American Statistical Association 88: 881–889

    Article  Google Scholar 

  • Hoeting J.A., Madigan D., Raftery A.E., Volinsky C.T. (1999) Bayesian model averaging: a tutorial (with discussion). Statistical Science 14: 382–417

    Article  MATH  MathSciNet  Google Scholar 

  • Hurvich C.M., Tsai C.-L. (1989) Regression and time series model selection in small samples. Biometrika 76: 297–307

    Article  MATH  MathSciNet  Google Scholar 

  • Hurvich C.M., Tsai C.-L. (1990) The impact of model selection on inference in linear regression. The American Statistician 44: 214–217

    Article  Google Scholar 

  • Massart P. (2007) Concentration inequalities and model selection. Springer, Berlin

    MATH  Google Scholar 

  • McQuarrie A.D.R., Tsai C.-L. (1998) Regression and time series model selection. World Scientific Publishing Co., Singapore

    MATH  Google Scholar 

  • Miller A.J. (2002) Subset selection in regression (2nd ed.). Chapman and Hall, New York

    MATH  Google Scholar 

  • Rawlings J.O. (1988) Applied regression analysis: a research tool. Wadsworth, Inc, Belmont

    MATH  Google Scholar 

  • Rencher A.C., Pun F.C. (1980) Inflation of R 2 in best subset regression. Technometrics 22: 49–53

    Article  MATH  Google Scholar 

  • SAS Institute, Inc. (2001). SAS version 8.02, Cary, NC.

  • Sugiura N. (1978) Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics, Theory and Methods A7: 13–26

    MathSciNet  Google Scholar 

  • Wheeler M.W. (2009) Comparing model averaging with other model selection strategies for benchmark dose estimation. Environmetrics and Ecological Statistics 16: 37–51

    Article  Google Scholar 

  • Wheeler M.W., Bailer A.J. (2007) Properties of model-averaged BMDLs: a study of model averaging in dichotomous response risk estimation. Risk Analysis 27: 659–670

    Article  Google Scholar 

  • Yang Y. (2007) Prediction/estimation with simple linear models: Is it really simple?. Econometric Theory 23: 1–36

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paul M. Lukacs.

About this article

Cite this article

Lukacs, P.M., Burnham, K.P. & Anderson, D.R. Model selection bias and Freedman’s paradox. Ann Inst Stat Math 62, 117–125 (2010). https://doi.org/10.1007/s10463-009-0234-4

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-009-0234-4

Keywords

Navigation