Abstract
Issues regarding missing data are critical in observational and experimental research. Recently, for datasets with mixed continuous–discrete variables, multiple imputation by chained equation (MICE) has been widely used, although MICE may yield severely biased estimates. We propose a new semiparametric Bayes multiple imputation approach that can deal with continuous and discrete variables. This enables us to overcome the shortcomings of MICE; they must satisfy strong conditions (known as compatibility) to guarantee obtained estimators are consistent. Our simulation studies show the coverage probability of 95% interval calculated using MICE can be less than 1%, while the MSE of the proposed can be less than one-fiftieth. We applied our method to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, and the results are consistent with those of the previous works that used panel data other than ADNI database, whereas the existing methods, such as MICE, resulted in inconsistent results.
Similar content being viewed by others
References
Albert, J., Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, 669–679.
Albert, J., Chib, S. (2001). Sequential ordinal modeling with applications to survival data. Biometrics, 57, 829–836.
Bartlett, J. W., Seaman, S. R., White, I. R., Carpenter, J. R. (2015). Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model. Statistics in Medicine, 24, 462–487.
Canale, A., Dunson, D. B. (2011). Bayesian kernel mixtures for counts. Journal of the American Statistical Association, 106, 1528–1539.
Chen, M., Ibrahim, J. G., Shao, Q. (2006). Posterior propriety and computation for the Cox regression model with application to missing covariates. Biometrika, 93, 791–807.
Chib, S. (2007). Analysis of treatment response data without the joint distribution of potential outcomes. Journal of Econometrics, 140, 401–412.
Chung, Y., Dunson, D. B. (2009). Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association, 104, 1646–1660.
Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 269–276.
Dunson, D. B., Pillai, N., Park, J. H. (2007). Bayesian density regression. Journal of the Royal Statistical Society: Series B, 69, 163–183.
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In J. M. Bernardo, J. O. Berger, A. P. Dawid, A. F. M. Smith (Eds.), Bayesian Statistics, Vol. 4. New York: Oxford University Press.
Hannah, L. A., Blei, D. M., Powell, W. B. (2011). Dirichlet process mixtures of generalized linear models. Journal of Machine Learning Research, 12, 1923–1953.
Hansson, O., Zetterberg, H., Buchhave, P., Londos, E., Blennow, K., Minthon, L. (2006). Association between CSF biomarkers and incipient Alzheimer’s disease in patients with mild cognitive impairment: A follow-up study. The Lancet Neurology, 5, 228–234.
Hirano, K. (2002). Semiparametric Bayesian inference in autoregressive panel data models. Econometrica, 70, 781–799.
Hoshino, T. (2013). Semiparametric Bayesian estimation for marginal parametric potential outcome modeling: Application to causal inference. Journal of the American Statistical Association, 108, 1189–1204.
Ibrahim, J. G., Chen, M. H., Lipsitz, S. R., Herring, A. H. (2005). Missing-data methods for generalized linear models: A comparative review. Journal of the American Statistical Association, 100, 332–346.
Ishwaran, H., James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96, 161–173.
Jack, C. R., Wiste, H. J., Vemuri, P., Weigand, S. D., Senjem, M. L., Zeng, G. (2010). Brain beta-amyloid measures and magnetic resonance imaging atrophy both predict time-to-progression from mild cognitive impairment to Alzheimer’s disease. Brain, 133, 3336–3348.
Kalbfleisch, J. D. (1978). Non-parametric Bayesian analysis of survival time data. Journal of the Royal Statistical Society: Series B, 40, 214–221.
Kim, J. S., Ratchford, B. T. (2013). A Bayesian multivariate probit for ordinal data with semiparametric random-effects. Computational Statistics and Data Analysis, 64, 192–208.
Kottas, A., Muller, P., Quintana, F. (2005). Nonparametric Bayesian modeling for multivariate ordinal data. Journal of Computational and Graphical Statistics, 14, 610–625.
Kunihama, T., Herring, A. H., Halpern, C. T., Dunson, D. B. (2016). Nonparametric Bayes modeling with sample survey weights. Statistics and Probability Letters, 113, 41–48.
Lawless, J. F., Kalbfleisch, J. D., Wild, C. J. (1999). Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society: Series B, 61, 413–438.
Li, J., Wang, Z., Li, R., Wu, R. (2015). Bayesian group LASSO for nonparametric varying-coefficient models with application to functional genome-wide association studies. The Annals of Applied Statistics, 9, 640–664.
Liao, S. G., Lin, Y., Kang, D. D., Chandra, D., Bon, J., Kaminski, N., Sciurba, F. C., Tseng, G. C. (2014). Missing value imputation in high-dimensional phenomic data: Imputable or not, and how? BMC Bioinformatics, 15, 346.
Liu, J., Gelman, A., Hill, J., Su, Y. S., Kropko, J. (2014). On the stationary distribution of iterative imputations. Biometrika, 101, 155–173.
McCulloch, R., Rossi, P. E. (1994). An exact likelihood analysis of the multinomial probit model. Journal of Econometrics, 64, 207–240.
Meng, X. L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9, 538–558.
Murray, J. S., Reiter, J. P. (2016). Multiple imputation of missing categorical and continuous values via bayesian mixture models with local dependence. Journal of the American Statistical Association, 111, 1466–1476.
Okello, A., Koivunen, J., Edison, P., Archer, H. A., Turkheimer, F. E., Nagren, K. U., Bullock, R., Walker, Z., Kennedy, A., Fox, N. C., Rossor, M. N., Rinne, J. O., Brooks, D. J. (2009). Conversion of amyloid positive and negative MCI to AD over 3 years An 11C-PIB PET study. Neurology, 73, 754–760.
Pati, D., Dunson, D. B., Tokdar, S. T. (2013). Posterior consistency in conditional distribution estimation. Journal of Multivariate Analysis, 116, 456–472.
Paton, N. I., Kityo, C., Hoppe, A., Reid, A., Kambugu, A., Lugemwa, A. (2014). Assessment of second-line antiretroviral regimens for HIV therapy in Africa. The New England Journal of Medicine, 371, 234–247.
Reich, B. J., Bondell, H. D., Li, L. (2011). Sufficient dimension reduction via Bayesian mixture modeling. Biometrics, 67, 886–895.
Robins, J. M., Hsieh, F., Newey, W. (1995). Semiparametric efficient estimation of a conditional density with missing or mismeasured covariates. Journal of the Royal Statistical Society. Series B, 61, 409–424.
Rodriuez, A., Dunson, D. B., Gelfand, A. E. (2009). Bayesian nonparametric functional data analysis through density estimation. Biometrika, 96, 149–162.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
Sethuraman, J. (1994). A constructive definition of dirichlet priors. Statistica Sinica, 4, 639–650.
Shen, W., Tokdar, S. T., Ghosal, S. (2013). Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. Biometrika, 100, 623–640.
Si, Y., Reiter, J. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics, 38, 499–521.
Sinha, D., Ibrahim, J. G., Chen, M. H. (2003). A Bayesian justification of Cox’s partial likelihood. Biometrika, 90, 629–641.
Stekhoven, D. J., Buhlmann, P. (2012). MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 112–118.
The National Research Council (2010). The prevention and treatment of missing data in clinical trials. Washington, DC: National Academic Press.
Tokdar, S. T., Zhu, Y. M., Ghosh, J. K. (2010). Bayesian density regression with logistic Gaussian process and subspace projection. Bayesian Analysis, 5, 319–344.
van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16, 219–242.
van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, Florida: Chapman and Hall/CRC.
Waljee, A. K., Mukherjee, A., Singal, A. G., Zhang, Y., Warren, J., Balis, U., Marrero, J., Zhu, J., Higgins, P. D. (2013). Comparison of imputation methods for missing laboratory data in medicine. BMJ Open, 3, e002847.
Walker, S. G., Damien, P., Laud, P. W., Smith, A. F. (1999). Bayesian nonparametric inference for random distributions and related functions. Journal of the Royal Statistical Society: Series B, 61, 485–527.
White, I. R., Royston, P., Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30, 377–399.
Zhang, X., Boscardin, W. J., Belin, T. R. (2008). Bayesian analysis of multivariate nominal measures using multivariate multinomial probit models. Computational Statistics and Data Analysis, 52, 3697–3708.
Zhang, Z., Rockette, H. E. (2006). Semiparametric maximum likelihood for missing covariates in parametric regression. Annals of the Institute of Statistical Mathematics, 58, 687–706.
Acknowledgements
This work was supported by JSPS KAKENHI Grant Numbers JP26285151, 18H03209, 16H02013, 16H06323. Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu; http://adni.loni.usc.edu.) As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
About this article
Cite this article
Kato, R., Hoshino, T. Semiparametric Bayesian multiple imputation for regression models with missing mixed continuous–discrete covariates. Ann Inst Stat Math 72, 803–825 (2020). https://doi.org/10.1007/s10463-019-00710-w
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-019-00710-w