Model selection bias and Freedman’s paradox

Lukacs, Paul M.; Burnham, Kenneth P.; Anderson, David R.

doi:10.1007/s10463-009-0234-4

Paul M. Lukacs¹,
Kenneth P. Burnham² &
David R. Anderson²

2724 Accesses
233 Citations
5 Altmetric
Explore all metrics

Abstract

In situations where limited knowledge of a system exists and the ratio of data points to variables is small, variable selection methods can often be misleading. Freedman (Am Stat 37:152–155, 1983) demonstrated how common it is to select completely unrelated variables as highly “significant” when the number of data points is similar in magnitude to the number of variables. A new type of model averaging estimator based on model selection with Akaike’s AIC is used with linear regression to investigate the problems of likely inclusion of spurious effects and model selection bias, the bias introduced while using the data to select a single seemingly “best” model from a (often large) set of models employing many predictor variables. The new model averaging estimator helps reduce these problems and provides confidence interval coverage at the nominal level while traditional stepwise selection has poor inferential properties.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

Article Open access 30 January 2023

RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods

Article 04 June 2018

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Article Open access 01 April 2016

References

Akaike H. (1973) Information theory as an extension of the maximum likelihood principle. In: Petrov B.N., Csaki F. (eds) Second international symposium on information theory. Budapest, Akademiai Kiado, pp 267–281
Google Scholar
Akaike H. (1978) On the likelihood of a time series model. The Statistician 27: 217–235
Article MathSciNet Google Scholar
Akaike H. (1979) A Bayesian extension of the minimum AIC procedure of autoregressive model fitting. Biometrika 66: 237–242
Article MATH MathSciNet Google Scholar
Anderson D.R. (2008) Model based inference in the life sciences: primer on evidence. Springer, New York
Book MATH Google Scholar
Buckland S.T., Burnham K.P., Augustin N.H. (1997) Model selection: An integral part of inference. Biometrics 53: 603–618
Article MATH Google Scholar
Burnham K.P., Anderson D.R. (2002) Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). Springer, New York
MATH Google Scholar
Burnham K.P., Anderson D.R. (2004) Multimodel inference: understanding AIC and BIC in model selection. Sociological Methods and Research 33: 261–304
Article MathSciNet Google Scholar
Claeskens G., Hjort N.L. (2008) Model selection and model averaging. Cambridge University Press, New York
MATH Google Scholar
Freedman D.A. (1983) A note on screening regression equations. The American Statistician 37: 152–155
Article MathSciNet Google Scholar
George E.I., McCulloch R.E. (1993) Variable selection via Gibbs sampling. Journal of the American Statistical Association 88: 881–889
Article Google Scholar
Hoeting J.A., Madigan D., Raftery A.E., Volinsky C.T. (1999) Bayesian model averaging: a tutorial (with discussion). Statistical Science 14: 382–417
Article MATH MathSciNet Google Scholar
Hurvich C.M., Tsai C.-L. (1989) Regression and time series model selection in small samples. Biometrika 76: 297–307
Article MATH MathSciNet Google Scholar
Hurvich C.M., Tsai C.-L. (1990) The impact of model selection on inference in linear regression. The American Statistician 44: 214–217
Article Google Scholar
Massart P. (2007) Concentration inequalities and model selection. Springer, Berlin
MATH Google Scholar
McQuarrie A.D.R., Tsai C.-L. (1998) Regression and time series model selection. World Scientific Publishing Co., Singapore
MATH Google Scholar
Miller A.J. (2002) Subset selection in regression (2nd ed.). Chapman and Hall, New York
MATH Google Scholar
Rawlings J.O. (1988) Applied regression analysis: a research tool. Wadsworth, Inc, Belmont
MATH Google Scholar
Rencher A.C., Pun F.C. (1980) Inflation of R ² in best subset regression. Technometrics 22: 49–53
Article MATH Google Scholar
SAS Institute, Inc. (2001). SAS version 8.02, Cary, NC.
Sugiura N. (1978) Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics, Theory and Methods A7: 13–26
MathSciNet Google Scholar
Wheeler M.W. (2009) Comparing model averaging with other model selection strategies for benchmark dose estimation. Environmetrics and Ecological Statistics 16: 37–51
Article Google Scholar
Wheeler M.W., Bailer A.J. (2007) Properties of model-averaged BMDLs: a study of model averaging in dichotomous response risk estimation. Risk Analysis 27: 659–670
Article Google Scholar
Yang Y. (2007) Prediction/estimation with simple linear models: Is it really simple?. Econometric Theory 23: 1–36
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Colorado Division of Wildlife, 317 W. Prospect Road, Fort Collins, CO, 80526, USA
Paul M. Lukacs
U.S. Geological Survey, Colorado Cooperative Fish and Wildlife Research Unit, Colorado State University, 1484 Campus Delivery, Fort Collins, CO, 80523, USA
Kenneth P. Burnham & David R. Anderson

Authors

Paul M. Lukacs
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth P. Burnham
View author publications
You can also search for this author in PubMed Google Scholar
David R. Anderson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul M. Lukacs.

About this article

Cite this article

Lukacs, P.M., Burnham, K.P. & Anderson, D.R. Model selection bias and Freedman’s paradox. Ann Inst Stat Math 62, 117–125 (2010). https://doi.org/10.1007/s10463-009-0234-4

Download citation

Received: 16 October 2008
Revised: 10 February 2009
Published: 26 May 2009
Issue Date: February 2010
DOI: https://doi.org/10.1007/s10463-009-0234-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model selection bias and Freedman’s paradox

Abstract

Access this article

Similar content being viewed by others

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Keywords

Navigation

Model selection bias and Freedman’s paradox

Abstract

Access this article

Similar content being viewed by others

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Share this article

Keywords

Search

Navigation