Skip to main content

Advertisement

Log in

Semiparametric Bayesian multiple imputation for regression models with missing mixed continuous–discrete covariates

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

Issues regarding missing data are critical in observational and experimental research. Recently, for datasets with mixed continuous–discrete variables, multiple imputation by chained equation (MICE) has been widely used, although MICE may yield severely biased estimates. We propose a new semiparametric Bayes multiple imputation approach that can deal with continuous and discrete variables. This enables us to overcome the shortcomings of MICE; they must satisfy strong conditions (known as compatibility) to guarantee obtained estimators are consistent. Our simulation studies show the coverage probability of 95% interval calculated using MICE can be less than 1%, while the MSE of the proposed can be less than one-fiftieth. We applied our method to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, and the results are consistent with those of the previous works that used panel data other than ADNI database, whereas the existing methods, such as MICE, resulted in inconsistent results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Albert, J., Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, 669–679.

    MathSciNet  MATH  Google Scholar 

  • Albert, J., Chib, S. (2001). Sequential ordinal modeling with applications to survival data. Biometrics, 57, 829–836.

    MathSciNet  MATH  Google Scholar 

  • Bartlett, J. W., Seaman, S. R., White, I. R., Carpenter, J. R. (2015). Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model. Statistics in Medicine, 24, 462–487.

    MathSciNet  Google Scholar 

  • Canale, A., Dunson, D. B. (2011). Bayesian kernel mixtures for counts. Journal of the American Statistical Association, 106, 1528–1539.

    MathSciNet  MATH  Google Scholar 

  • Chen, M., Ibrahim, J. G., Shao, Q. (2006). Posterior propriety and computation for the Cox regression model with application to missing covariates. Biometrika, 93, 791–807.

    MathSciNet  MATH  Google Scholar 

  • Chib, S. (2007). Analysis of treatment response data without the joint distribution of potential outcomes. Journal of Econometrics, 140, 401–412.

    MathSciNet  MATH  Google Scholar 

  • Chung, Y., Dunson, D. B. (2009). Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association, 104, 1646–1660.

    MathSciNet  MATH  Google Scholar 

  • Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 269–276.

    MathSciNet  MATH  Google Scholar 

  • Dunson, D. B., Pillai, N., Park, J. H. (2007). Bayesian density regression. Journal of the Royal Statistical Society: Series B, 69, 163–183.

    MathSciNet  MATH  Google Scholar 

  • Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In J. M. Bernardo, J. O. Berger, A. P. Dawid, A. F. M. Smith (Eds.), Bayesian Statistics, Vol. 4. New York: Oxford University Press.

  • Hannah, L. A., Blei, D. M., Powell, W. B. (2011). Dirichlet process mixtures of generalized linear models. Journal of Machine Learning Research, 12, 1923–1953.

    MathSciNet  MATH  Google Scholar 

  • Hansson, O., Zetterberg, H., Buchhave, P., Londos, E., Blennow, K., Minthon, L. (2006). Association between CSF biomarkers and incipient Alzheimer’s disease in patients with mild cognitive impairment: A follow-up study. The Lancet Neurology, 5, 228–234.

    Google Scholar 

  • Hirano, K. (2002). Semiparametric Bayesian inference in autoregressive panel data models. Econometrica, 70, 781–799.

    MathSciNet  MATH  Google Scholar 

  • Hoshino, T. (2013). Semiparametric Bayesian estimation for marginal parametric potential outcome modeling: Application to causal inference. Journal of the American Statistical Association, 108, 1189–1204.

    MathSciNet  MATH  Google Scholar 

  • Ibrahim, J. G., Chen, M. H., Lipsitz, S. R., Herring, A. H. (2005). Missing-data methods for generalized linear models: A comparative review. Journal of the American Statistical Association, 100, 332–346.

    MathSciNet  MATH  Google Scholar 

  • Ishwaran, H., James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96, 161–173.

    MathSciNet  MATH  Google Scholar 

  • Jack, C. R., Wiste, H. J., Vemuri, P., Weigand, S. D., Senjem, M. L., Zeng, G. (2010). Brain beta-amyloid measures and magnetic resonance imaging atrophy both predict time-to-progression from mild cognitive impairment to Alzheimer’s disease. Brain, 133, 3336–3348.

    Google Scholar 

  • Kalbfleisch, J. D. (1978). Non-parametric Bayesian analysis of survival time data. Journal of the Royal Statistical Society: Series B, 40, 214–221.

    MathSciNet  MATH  Google Scholar 

  • Kim, J. S., Ratchford, B. T. (2013). A Bayesian multivariate probit for ordinal data with semiparametric random-effects. Computational Statistics and Data Analysis, 64, 192–208.

    MathSciNet  MATH  Google Scholar 

  • Kottas, A., Muller, P., Quintana, F. (2005). Nonparametric Bayesian modeling for multivariate ordinal data. Journal of Computational and Graphical Statistics, 14, 610–625.

    MathSciNet  Google Scholar 

  • Kunihama, T., Herring, A. H., Halpern, C. T., Dunson, D. B. (2016). Nonparametric Bayes modeling with sample survey weights. Statistics and Probability Letters, 113, 41–48.

    MathSciNet  MATH  Google Scholar 

  • Lawless, J. F., Kalbfleisch, J. D., Wild, C. J. (1999). Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society: Series B, 61, 413–438.

    MathSciNet  MATH  Google Scholar 

  • Li, J., Wang, Z., Li, R., Wu, R. (2015). Bayesian group LASSO for nonparametric varying-coefficient models with application to functional genome-wide association studies. The Annals of Applied Statistics, 9, 640–664.

    MathSciNet  MATH  Google Scholar 

  • Liao, S. G., Lin, Y., Kang, D. D., Chandra, D., Bon, J., Kaminski, N., Sciurba, F. C., Tseng, G. C. (2014). Missing value imputation in high-dimensional phenomic data: Imputable or not, and how? BMC Bioinformatics, 15, 346.

    Google Scholar 

  • Liu, J., Gelman, A., Hill, J., Su, Y. S., Kropko, J. (2014). On the stationary distribution of iterative imputations. Biometrika, 101, 155–173.

    MathSciNet  MATH  Google Scholar 

  • McCulloch, R., Rossi, P. E. (1994). An exact likelihood analysis of the multinomial probit model. Journal of Econometrics, 64, 207–240.

    MathSciNet  Google Scholar 

  • Meng, X. L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9, 538–558.

    Google Scholar 

  • Murray, J. S., Reiter, J. P. (2016). Multiple imputation of missing categorical and continuous values via bayesian mixture models with local dependence. Journal of the American Statistical Association, 111, 1466–1476.

    MathSciNet  Google Scholar 

  • Okello, A., Koivunen, J., Edison, P., Archer, H. A., Turkheimer, F. E., Nagren, K. U., Bullock, R., Walker, Z., Kennedy, A., Fox, N. C., Rossor, M. N., Rinne, J. O., Brooks, D. J. (2009). Conversion of amyloid positive and negative MCI to AD over 3 years An 11C-PIB PET study. Neurology, 73, 754–760.

    Google Scholar 

  • Pati, D., Dunson, D. B., Tokdar, S. T. (2013). Posterior consistency in conditional distribution estimation. Journal of Multivariate Analysis, 116, 456–472.

    MathSciNet  MATH  Google Scholar 

  • Paton, N. I., Kityo, C., Hoppe, A., Reid, A., Kambugu, A., Lugemwa, A. (2014). Assessment of second-line antiretroviral regimens for HIV therapy in Africa. The New England Journal of Medicine, 371, 234–247.

    Google Scholar 

  • Reich, B. J., Bondell, H. D., Li, L. (2011). Sufficient dimension reduction via Bayesian mixture modeling. Biometrics, 67, 886–895.

    MathSciNet  MATH  Google Scholar 

  • Robins, J. M., Hsieh, F., Newey, W. (1995). Semiparametric efficient estimation of a conditional density with missing or mismeasured covariates. Journal of the Royal Statistical Society. Series B, 61, 409–424.

    MathSciNet  MATH  Google Scholar 

  • Rodriuez, A., Dunson, D. B., Gelfand, A. E. (2009). Bayesian nonparametric functional data analysis through density estimation. Biometrika, 96, 149–162.

    MathSciNet  MATH  Google Scholar 

  • Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.

    MATH  Google Scholar 

  • Sethuraman, J. (1994). A constructive definition of dirichlet priors. Statistica Sinica, 4, 639–650.

    MathSciNet  MATH  Google Scholar 

  • Shen, W., Tokdar, S. T., Ghosal, S. (2013). Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. Biometrika, 100, 623–640.

    MathSciNet  MATH  Google Scholar 

  • Si, Y., Reiter, J. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics, 38, 499–521.

    Google Scholar 

  • Sinha, D., Ibrahim, J. G., Chen, M. H. (2003). A Bayesian justification of Cox’s partial likelihood. Biometrika, 90, 629–641.

    MathSciNet  MATH  Google Scholar 

  • Stekhoven, D. J., Buhlmann, P. (2012). MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 112–118.

    Google Scholar 

  • The National Research Council (2010). The prevention and treatment of missing data in clinical trials. Washington, DC: National Academic Press.

  • Tokdar, S. T., Zhu, Y. M., Ghosh, J. K. (2010). Bayesian density regression with logistic Gaussian process and subspace projection. Bayesian Analysis, 5, 319–344.

    MathSciNet  MATH  Google Scholar 

  • van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16, 219–242.

    MathSciNet  MATH  Google Scholar 

  • van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, Florida: Chapman and Hall/CRC.

    MATH  Google Scholar 

  • Waljee, A. K., Mukherjee, A., Singal, A. G., Zhang, Y., Warren, J., Balis, U., Marrero, J., Zhu, J., Higgins, P. D. (2013). Comparison of imputation methods for missing laboratory data in medicine. BMJ Open, 3, e002847.

    Google Scholar 

  • Walker, S. G., Damien, P., Laud, P. W., Smith, A. F. (1999). Bayesian nonparametric inference for random distributions and related functions. Journal of the Royal Statistical Society: Series B, 61, 485–527.

    MathSciNet  MATH  Google Scholar 

  • White, I. R., Royston, P., Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30, 377–399.

    MathSciNet  Google Scholar 

  • Zhang, X., Boscardin, W. J., Belin, T. R. (2008). Bayesian analysis of multivariate nominal measures using multivariate multinomial probit models. Computational Statistics and Data Analysis, 52, 3697–3708.

    MathSciNet  MATH  Google Scholar 

  • Zhang, Z., Rockette, H. E. (2006). Semiparametric maximum likelihood for missing covariates in parametric regression. Annals of the Institute of Statistical Mathematics, 58, 687–706.

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Numbers JP26285151, 18H03209, 16H02013, 16H06323. Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu; http://adni.loni.usc.edu.) As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takahiro Hoshino.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4273 KB)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kato, R., Hoshino, T. Semiparametric Bayesian multiple imputation for regression models with missing mixed continuous–discrete covariates. Ann Inst Stat Math 72, 803–825 (2020). https://doi.org/10.1007/s10463-019-00710-w

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-019-00710-w

Keywords

Navigation