Semiparametric Bayesian multiple imputation for regression models with missing mixed continuous–discrete covariates

Kato, Ryo; Hoshino, Takahiro

doi:10.1007/s10463-019-00710-w

Semiparametric Bayesian multiple imputation for regression models with missing mixed continuous–discrete covariates

Published: 11 March 2019

Volume 72, pages 803–825, (2020)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Ryo Kato¹ &
Takahiro Hoshino^2,3

551 Accesses
Explore all metrics

Abstract

Issues regarding missing data are critical in observational and experimental research. Recently, for datasets with mixed continuous–discrete variables, multiple imputation by chained equation (MICE) has been widely used, although MICE may yield severely biased estimates. We propose a new semiparametric Bayes multiple imputation approach that can deal with continuous and discrete variables. This enables us to overcome the shortcomings of MICE; they must satisfy strong conditions (known as compatibility) to guarantee obtained estimators are consistent. Our simulation studies show the coverage probability of 95% interval calculated using MICE can be less than 1%, while the MSE of the proposed can be less than one-fiftieth. We applied our method to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, and the results are consistent with those of the previous works that used panel data other than ADNI database, whereas the existing methods, such as MICE, resulted in inconsistent results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A fusion of least squares and empirical likelihood for regression models with a missing binary covariate

Article 15 July 2016

Estimation of ROC Curve with Multiple Types of Missing Gold Standard

Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data

Article Open access 12 February 2016

References

Albert, J., Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, 669–679.
MathSciNet MATH Google Scholar
Albert, J., Chib, S. (2001). Sequential ordinal modeling with applications to survival data. Biometrics, 57, 829–836.
MathSciNet MATH Google Scholar
Bartlett, J. W., Seaman, S. R., White, I. R., Carpenter, J. R. (2015). Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model. Statistics in Medicine, 24, 462–487.
MathSciNet Google Scholar
Canale, A., Dunson, D. B. (2011). Bayesian kernel mixtures for counts. Journal of the American Statistical Association, 106, 1528–1539.
MathSciNet MATH Google Scholar
Chen, M., Ibrahim, J. G., Shao, Q. (2006). Posterior propriety and computation for the Cox regression model with application to missing covariates. Biometrika, 93, 791–807.
MathSciNet MATH Google Scholar
Chib, S. (2007). Analysis of treatment response data without the joint distribution of potential outcomes. Journal of Econometrics, 140, 401–412.
MathSciNet MATH Google Scholar
Chung, Y., Dunson, D. B. (2009). Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association, 104, 1646–1660.
MathSciNet MATH Google Scholar
Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 269–276.
MathSciNet MATH Google Scholar
Dunson, D. B., Pillai, N., Park, J. H. (2007). Bayesian density regression. Journal of the Royal Statistical Society: Series B, 69, 163–183.
MathSciNet MATH Google Scholar
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In J. M. Bernardo, J. O. Berger, A. P. Dawid, A. F. M. Smith (Eds.), Bayesian Statistics, Vol. 4. New York: Oxford University Press.
Hannah, L. A., Blei, D. M., Powell, W. B. (2011). Dirichlet process mixtures of generalized linear models. Journal of Machine Learning Research, 12, 1923–1953.
MathSciNet MATH Google Scholar
Hansson, O., Zetterberg, H., Buchhave, P., Londos, E., Blennow, K., Minthon, L. (2006). Association between CSF biomarkers and incipient Alzheimer’s disease in patients with mild cognitive impairment: A follow-up study. The Lancet Neurology, 5, 228–234.
Google Scholar
Hirano, K. (2002). Semiparametric Bayesian inference in autoregressive panel data models. Econometrica, 70, 781–799.
MathSciNet MATH Google Scholar
Hoshino, T. (2013). Semiparametric Bayesian estimation for marginal parametric potential outcome modeling: Application to causal inference. Journal of the American Statistical Association, 108, 1189–1204.
MathSciNet MATH Google Scholar
Ibrahim, J. G., Chen, M. H., Lipsitz, S. R., Herring, A. H. (2005). Missing-data methods for generalized linear models: A comparative review. Journal of the American Statistical Association, 100, 332–346.
MathSciNet MATH Google Scholar
Ishwaran, H., James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96, 161–173.
MathSciNet MATH Google Scholar
Jack, C. R., Wiste, H. J., Vemuri, P., Weigand, S. D., Senjem, M. L., Zeng, G. (2010). Brain beta-amyloid measures and magnetic resonance imaging atrophy both predict time-to-progression from mild cognitive impairment to Alzheimer’s disease. Brain, 133, 3336–3348.
Google Scholar
Kalbfleisch, J. D. (1978). Non-parametric Bayesian analysis of survival time data. Journal of the Royal Statistical Society: Series B, 40, 214–221.
MathSciNet MATH Google Scholar
Kim, J. S., Ratchford, B. T. (2013). A Bayesian multivariate probit for ordinal data with semiparametric random-effects. Computational Statistics and Data Analysis, 64, 192–208.
MathSciNet MATH Google Scholar
Kottas, A., Muller, P., Quintana, F. (2005). Nonparametric Bayesian modeling for multivariate ordinal data. Journal of Computational and Graphical Statistics, 14, 610–625.
MathSciNet Google Scholar
Kunihama, T., Herring, A. H., Halpern, C. T., Dunson, D. B. (2016). Nonparametric Bayes modeling with sample survey weights. Statistics and Probability Letters, 113, 41–48.
MathSciNet MATH Google Scholar
Lawless, J. F., Kalbfleisch, J. D., Wild, C. J. (1999). Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society: Series B, 61, 413–438.
MathSciNet MATH Google Scholar
Li, J., Wang, Z., Li, R., Wu, R. (2015). Bayesian group LASSO for nonparametric varying-coefficient models with application to functional genome-wide association studies. The Annals of Applied Statistics, 9, 640–664.
MathSciNet MATH Google Scholar
Liao, S. G., Lin, Y., Kang, D. D., Chandra, D., Bon, J., Kaminski, N., Sciurba, F. C., Tseng, G. C. (2014). Missing value imputation in high-dimensional phenomic data: Imputable or not, and how? BMC Bioinformatics, 15, 346.
Google Scholar
Liu, J., Gelman, A., Hill, J., Su, Y. S., Kropko, J. (2014). On the stationary distribution of iterative imputations. Biometrika, 101, 155–173.
MathSciNet MATH Google Scholar
McCulloch, R., Rossi, P. E. (1994). An exact likelihood analysis of the multinomial probit model. Journal of Econometrics, 64, 207–240.
MathSciNet Google Scholar
Meng, X. L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9, 538–558.
Google Scholar
Murray, J. S., Reiter, J. P. (2016). Multiple imputation of missing categorical and continuous values via bayesian mixture models with local dependence. Journal of the American Statistical Association, 111, 1466–1476.
MathSciNet Google Scholar
Okello, A., Koivunen, J., Edison, P., Archer, H. A., Turkheimer, F. E., Nagren, K. U., Bullock, R., Walker, Z., Kennedy, A., Fox, N. C., Rossor, M. N., Rinne, J. O., Brooks, D. J. (2009). Conversion of amyloid positive and negative MCI to AD over 3 years An 11C-PIB PET study. Neurology, 73, 754–760.
Google Scholar
Pati, D., Dunson, D. B., Tokdar, S. T. (2013). Posterior consistency in conditional distribution estimation. Journal of Multivariate Analysis, 116, 456–472.
MathSciNet MATH Google Scholar
Paton, N. I., Kityo, C., Hoppe, A., Reid, A., Kambugu, A., Lugemwa, A. (2014). Assessment of second-line antiretroviral regimens for HIV therapy in Africa. The New England Journal of Medicine, 371, 234–247.
Google Scholar
Reich, B. J., Bondell, H. D., Li, L. (2011). Sufficient dimension reduction via Bayesian mixture modeling. Biometrics, 67, 886–895.
MathSciNet MATH Google Scholar
Robins, J. M., Hsieh, F., Newey, W. (1995). Semiparametric efficient estimation of a conditional density with missing or mismeasured covariates. Journal of the Royal Statistical Society. Series B, 61, 409–424.
MathSciNet MATH Google Scholar
Rodriuez, A., Dunson, D. B., Gelfand, A. E. (2009). Bayesian nonparametric functional data analysis through density estimation. Biometrika, 96, 149–162.
MathSciNet MATH Google Scholar
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
MATH Google Scholar
Sethuraman, J. (1994). A constructive definition of dirichlet priors. Statistica Sinica, 4, 639–650.
MathSciNet MATH Google Scholar
Shen, W., Tokdar, S. T., Ghosal, S. (2013). Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. Biometrika, 100, 623–640.
MathSciNet MATH Google Scholar
Si, Y., Reiter, J. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics, 38, 499–521.
Google Scholar
Sinha, D., Ibrahim, J. G., Chen, M. H. (2003). A Bayesian justification of Cox’s partial likelihood. Biometrika, 90, 629–641.
MathSciNet MATH Google Scholar
Stekhoven, D. J., Buhlmann, P. (2012). MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 112–118.
Google Scholar
The National Research Council (2010). The prevention and treatment of missing data in clinical trials. Washington, DC: National Academic Press.
Tokdar, S. T., Zhu, Y. M., Ghosh, J. K. (2010). Bayesian density regression with logistic Gaussian process and subspace projection. Bayesian Analysis, 5, 319–344.
MathSciNet MATH Google Scholar
van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16, 219–242.
MathSciNet MATH Google Scholar
van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, Florida: Chapman and Hall/CRC.
MATH Google Scholar
Waljee, A. K., Mukherjee, A., Singal, A. G., Zhang, Y., Warren, J., Balis, U., Marrero, J., Zhu, J., Higgins, P. D. (2013). Comparison of imputation methods for missing laboratory data in medicine. BMJ Open, 3, e002847.
Google Scholar
Walker, S. G., Damien, P., Laud, P. W., Smith, A. F. (1999). Bayesian nonparametric inference for random distributions and related functions. Journal of the Royal Statistical Society: Series B, 61, 485–527.
MathSciNet MATH Google Scholar
White, I. R., Royston, P., Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30, 377–399.
MathSciNet Google Scholar
Zhang, X., Boscardin, W. J., Belin, T. R. (2008). Bayesian analysis of multivariate nominal measures using multivariate multinomial probit models. Computational Statistics and Data Analysis, 52, 3697–3708.
MathSciNet MATH Google Scholar
Zhang, Z., Rockette, H. E. (2006). Semiparametric maximum likelihood for missing covariates in parametric regression. Annals of the Institute of Statistical Mathematics, 58, 687–706.
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Numbers JP26285151, 18H03209, 16H02013, 16H06323. Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu; http://adni.loni.usc.edu.) As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

Author information

Authors and Affiliations

Research Institute for Economics and Business Administration, Kobe University, 2-1 Rokkodai-cho, Nada-ku, Kobe, Japan
Ryo Kato
Department of Economics, Keio University, 2-15-45 Mita, Minato-ku, Tokyo, Japan
Takahiro Hoshino
RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo, Japan
Takahiro Hoshino

Authors

Ryo Kato
View author publications
You can also search for this author in PubMed Google Scholar
Takahiro Hoshino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takahiro Hoshino.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4273 KB)

About this article

Cite this article

Kato, R., Hoshino, T. Semiparametric Bayesian multiple imputation for regression models with missing mixed continuous–discrete covariates. Ann Inst Stat Math 72, 803–825 (2020). https://doi.org/10.1007/s10463-019-00710-w

Download citation

Received: 17 July 2018
Revised: 10 December 2018
Published: 11 March 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s10463-019-00710-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semiparametric Bayesian multiple imputation for regression models with missing mixed continuous–discrete covariates

Abstract

Access this article

Similar content being viewed by others

A fusion of least squares and empirical likelihood for regression models with a missing binary covariate

Estimation of ROC Curve with Multiple Types of Missing Gold Standard

Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 4273 KB)

About this article

Cite this article

Keywords

Navigation

Semiparametric Bayesian multiple imputation for regression models with missing mixed continuous–discrete covariates

Abstract

Access this article

Similar content being viewed by others

A fusion of least squares and empirical likelihood for regression models with a missing binary covariate

Estimation of ROC Curve with Multiple Types of Missing Gold Standard

Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 4273 KB)

About this article

Cite this article

Share this article

Keywords

Search

Navigation