Skip to main content
Log in

Using AIC in multiple linear regression framework with multiply imputed data

  • Published:
Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

Abstract

Many model selection criteria proposed over the years have become common procedures in applied research. However, these procedures were designed for complete data. Complete data is rare in applied statistics, in particular in medical, public health and health policy settings. Incomplete data, another common problem in applied statistics, introduces its own set of complications in light of which the task of model selection can get quite complicated. Recently, few have suggested model selection procedures for incomplete data with varying degrees of success. In this paper we explore model selection by the Akaike Information Criterion (AIC) in the multivariate regression setting with ignorable missing data accounted for via multiple imputation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Akaike, H.: A new look at statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)

    Article  Google Scholar 

  • Burnham, K., Anderson, D.: Multimodel inference: understanding AIC and BIC in model selection. Sociol. Methods Res. 33, 261–304 (2004)

    Article  Google Scholar 

  • Chamberlain, T.: The method of multiple working hypotheses. Science 15, 93 (1890)

    Google Scholar 

  • Claeskens, G., Consentino, F.: Variable selection with incomplete covariate data. Biometrics 64, 1062–1096 (2008)

    Article  PubMed  Google Scholar 

  • Collins, L., Schafer, J., Kam, C.: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol. Methods 6, 330–351 (2001)

    Article  PubMed  CAS  Google Scholar 

  • Consentino, F., Claeskens, G.: Order selection tests with multiply-imputed data. Comput. Stat. Data Anal. 54(10), 2284–2295 (2010)

    Article  Google Scholar 

  • Harel, O.: Inferences on missing information under multiple imputation and two-stage multiple imputation. Stat. Method. 4, 75 (2007)

    Article  Google Scholar 

  • Harel, O.: The estimation of R 2 and adjusted R 2 in incomplete data sets using multiple imputation. J. Appl. Stat. 36(10), 1109–1118 (2009)

    Article  Google Scholar 

  • Hurvich, C., Tsai, C.: Regression and time series model selection in small samples. Biometrika 76, 297–307 (1989)

    Article  Google Scholar 

  • Hurvich, C., Tsai, C.: The impact of model selection on inference in linear regression. Am. Stat. 44, 214–217 (1990)

    Google Scholar 

  • Ibrahim, J.G.: Incomplete data in generalized linear models. J. Am. Stat. Assoc. 85, 765–769 (1990)

    Google Scholar 

  • Little, R., Rubin, D.: Statistical Analysis with Missing Data, 2nd edn. Wiley, New York (2002)

    Google Scholar 

  • Meng, X.-L., Rubin, D.B.: Performing likelihood ratio tests with multiply-imputed data sets. Biometrika 79, 103–11 (1992)

    Article  Google Scholar 

  • Miller, A.: Subset Selection in Regression. Chapman & Hall, London (2002)

    Book  Google Scholar 

  • Rao, C., Wu, Y.: Model selection. Lecture Notes-Monograph Series 38, 1–64 (2001)

    Article  Google Scholar 

  • Rubin, D.: Inference and missing data. Biometrika 63, 581 (1976)

    Article  Google Scholar 

  • Rubin, D.: Multiple Imputation for Nonresponse in Surveys. Wiley, New York (1987)

    Book  Google Scholar 

  • Schafer, J.: Analysis of Incomplete Multivariate Data. Chapman and Hall, London (1997)

    Book  Google Scholar 

  • Schafer, J., Graham, J.: Missing data: our view of the state of the art. Psychol. Methods 7, 147 (2002)

    Article  PubMed  Google Scholar 

  • Wood, A.M., White, I.R., Royston, P.: How should variable selection be performed with multiply imputed data? Stat. Med. 27, 3227–3246 (2008)

    Article  PubMed  Google Scholar 

  • Yang, X., Belin, T.R., Boscardin, W.J.: Imputation and variable selection in linear regression models with missing covariates. Biometrics 61(2), 498–506 (2005)

    Article  PubMed  Google Scholar 

Download references

Acknowledgments

We would like to thank the referees for providing comments that helped in improving the quality of our paper. This project was partially supported by Award Number K01MH087219 from the National Institute of Mental Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Mental Health or the National Institutes of Health.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ofer Harel.

Appendix

Appendix

In order to interpret the simulation results that follow, we use certain labels to indicate variables (model) selected by each AIC type as “best”. These labels are as follows: \({{\mathcal{C}}}\) indicates the variables in the true data generating model i.e. \({\{\user2{x}_1 \, \user2{x}_2\}, {\mathcal{I}}}\) indicates the variables in the true imputation model i.e. {x 1   x 2   x 3} when p = 3 and {x 1   x 2   x 3   x 4} when \({p=4, \, {\mathcal{I}}_{\user2{x}_1}}\) indicates the model with x 1 and variables other than \({\user2{x}_2,\,{\mathcal{I}}_{\user2{x}_2}}\) indicates the model with x 2 and variables other than \({\user2{x}_1, \, {\mathcal{O}}}\) indicates over-fitted models that contain true data generating variables and some other variables (for \({p=3, \, {\mathcal{I}}={\mathcal{O}}}\)), \({{\mathcal{W}}}\) indicates models not containing any of the true data generating variables (Tables 3, 4, 5, 6, 7, 8, 9, 10).

Table 3 Model selection rates of each AIC type in data sets with n = 50, k = 3 and γ = 25 %
Table 4 Model selection rates of each AIC type in data sets with n = 100, k = 3 and γ = 25 %
Table 5 Model selection rates of each AIC type in data sets with n = 50, k = 4 and γ = 25 %
Table 6 Model selection rates of each AIC type in data sets with n = 100, k = 4 and γ = 25 %
Table 7 Model selection rates of each AIC type in data sets with n = 50, k = 3 and γ = 50 %
Table 8 Model selection rates of each AIC type in data sets with n = 100, k = 3 and γ = 50 %
Table 9 Model selection rates of each AIC type in data sets with n = 50, k = 4 and γ = 50 %
Table 10 Model selection rates of each AIC type in data sets with n = 100, k = 4 and γ = 50 %

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chaurasia, A., Harel, O. Using AIC in multiple linear regression framework with multiply imputed data. Health Serv Outcomes Res Method 12, 219–233 (2012). https://doi.org/10.1007/s10742-012-0088-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10742-012-0088-8

Keywords

Navigation