Skip to main content

Leveraging Variational Autoencoders for Multiple Data Imputation

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases: Research Track (ECML PKDD 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14169))

Abstract

Missing data persists as a major barrier to data analysis across numerous applications. Recently, deep generative models have been used for imputation of missing data, motivated by their ability to learn complex and non-linear relationships. In this work, we investigate the ability of variational autoencoders (VAEs) to account for uncertainty in missing data through multiple imputation. We find that VAEs provide poor empirical coverage of missing data, with underestimation and overconfident imputations. To overcome this, we employ \(\beta \)-VAEs, which viewed from a generalized Bayes framework, provide robustness to model misspecification. Assigning a good value of \(\beta \) is critical for uncertainty calibration and we demonstrate how this can be achieved using cross-validation. We assess three alternative methods for sampling from the posterior distribution of missing values and apply the approach to transcriptomics datasets with various simulated missingness scenarios. Finally, we show that single imputation in transcriptomic data can cause false discoveries in downstream tasks and employing multiple imputation with \(\beta \)-VAEs can effectively mitigate these inaccuracies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Barron, A., Schervish, M.J., Wasserman, L.: The consistency of posterior distributions in nonparametric problems. Ann. Stat. 27(2), 536–561 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  2. Bissiri, P.G., Holmes, C.C., Walker, S.G.: A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B (Statistical Methodology) 78(5), 1103–1130 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  3. Camino, R.D., Hammerschmidt, C.A., State, R.: Improving missing data imputation with deep generative models. arXiv preprint arXiv:1902.10666 (2019)

  4. Chatterjee, S., Diaconis, P.: The sample size required in importance sampling. Ann. Appl. Probab. 28(2), 1099–1135 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  5. Chen, R.T., Li, X., Grosse, R.B., Duvenaud, D.K.: Isolating sources of disentanglement in variational autoencoders. In: Advances in Neural Information Processing. vol. 31 (2018)

    Google Scholar 

  6. Chen, Y.C., Liu, T., Yu, C.H., Chiang, T.Y., Hwang, C.C.: Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS ONE 8(4), e62856 (2013)

    Google Scholar 

  7. Collier, M., Nazabal, A., Williams, C.K.: Vaes in the presence of missing data. arXiv preprint arXiv:2006.05301 (2020)

  8. Daxberger, E., Hernández-Lobato, J.M.: Bayesian variational autoencoders for unsupervised out-of-distribution detection. arXiv preprint arXiv:1912.05651 (2019)

  9. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International Conference on Machine Learning, pp. 1050–1059. PMLR (2016)

    Google Scholar 

  10. García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)

    Article  Google Scholar 

  11. Gershman, S., Goodman, N.: Amortized inference in probabilistic reasoning. In: Proceedings of the Annual Meeting of the Cognitive Science Society. vol. 36 (2014)

    Google Scholar 

  12. Gondara, L., Wang, K.: MIDA: multiple imputation using denoising autoencoders. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.) PAKDD 2018. LNCS (LNAI), vol. 10939, pp. 260–272. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93040-4_21

    Chapter  Google Scholar 

  13. Graham, J.W., Olchowski, A.E., Gilreath, T.D.: How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev. sci. 8, 206–213 (2007)

    Article  Google Scholar 

  14. Higgins, I., et al.: beta-VAE: learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations (2017). https://openreview.net/forum?id=Sy2fzU9gl

  15. Holmes, C.C., Walker, S.G.: Assigning a value to a power likelihood in a general Bayesian model. Biometrika 104(2), 497–503 (2017)

    MathSciNet  MATH  Google Scholar 

  16. Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: Train 1, get \(m\) for free. arXiv preprint arXiv:1704.00109 (2017)

  17. Ipsen, N.B., Mattei, P.A., Frellsen, J.: not-MIWAE: Deep generative modelling with missing not at random data. arXiv preprint arXiv:2006.12871 (2020)

  18. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (2013)

    Google Scholar 

  19. Kingma, D.P., Welling, M.: An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691 (2019)

  20. Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing. vol. 30 (2017)

    Google Scholar 

  21. Lewis, S., et al.: Accurate imputation and efficient data acquisition with transformer-based vaes. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021)

    Google Scholar 

  22. Little, R.J., Rubin, D.B.: Statistical analysis with missing data. vol. 793. John Wiley & Sons (2019)

    Google Scholar 

  23. Ma, C., Gong, W., Hernández-Lobato, J.M., Koenigstein, N., Nowozin, S., Zhang, C.: Partial VAE for hybrid recommender system. In: NIPS Workshop on Bayesian Deep Learning. vol. 2018 (2018)

    Google Scholar 

  24. Ma, C., Tschiatschek, S., Hernández-Lobato, J.M., Turner, R., Zhang, C.: VAEM: a deep generative model for heterogeneous mixed type data. arXiv preprint arXiv:2006.11941 (2020)

  25. Ma, C., Tschiatschek, S., Palla, K., Hernández-Lobato, J.M., Nowozin, S., Zhang, C.: EDDI: Efficient dynamic discovery of high-value information with partial VAE. arXiv preprint arXiv:1809.11142 (2018)

  26. Ma, C., Zhang, C.: Identifiable generative models for missing not at random data imputation. In: Advances in Neural Information Processing. vol. 34 (2021)

    Google Scholar 

  27. Ma, Q., Li, X., Bai, M., Wang, X., Ning, B., Li, G.: MIVAE: multiple imputation based on variational auto-encoder. Eng. Appl. Artif. Intell. 123, 106270 (2023). https://doi.org/10.1016/j.engappai.2023.106270, https://www.sciencedirect.com/science/article/pii/S0952197623004542

  28. Mattei, P.A., Frellsen, J.: Leveraging the exact likelihood of deep latent variable models. In: Advances in Neural Information Processing. vol. 31 (2018)

    Google Scholar 

  29. Mattei, P.A., Frellsen, J.: MIWAE: deep generative modelling and imputation of incomplete data sets. In: International Conference on Machine Learning, pp. 4413–4423. PMLR (2019)

    Google Scholar 

  30. Murray, J.S.: Multiple imputation: a review of practical and theoretical findings. Stat. Sci. 33(2), 142–159 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  31. Nazabal, A., Olmos, P.M., Ghahramani, Z., Valera, I.: Handling incomplete heterogeneous data using VAEs. Pattern Recogn. 107, 107501 (2020)

    Article  Google Scholar 

  32. Nelwamondo, F.V., Mohamed, S., Marwala, T.: Missing data: A comparison of neural network and expectation maximization techniques. Curr. Sci. 93(11), 1514–1521 (2007)

    Google Scholar 

  33. Nguyen, A., Yosinski, J., Clune, J.: Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 427–436 (2015)

    Google Scholar 

  34. Qiu, Y.L., Zheng, H., Gevaert, O.: Genomic data imputation with variational auto-encoders. GigaScience 9(8), giaa082 (2020)

    Google Scholar 

  35. Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: International Conference on Machine Learning, pp. 1278–1286. PMLR (2014)

    Google Scholar 

  36. Roskams-Hieter, B.: Betavaemultimpute. https://github.com/roskamsh/BetaVAEMultImpute (2023)

  37. Schafer, J.L., Olsen, M.K.: Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivar. Behav. Res. 33(4), 545–571 (1998)

    Article  Google Scholar 

  38. Sinharay, S., Stern, H.S., Russell, D.: The use of multiple imputation for the analysis of missing data. Psychol. Methods 6(4), 317 (2001)

    Article  Google Scholar 

  39. Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)

  40. Walker, S., Hjort, N.L.: On Bayesian consistency. J. R. Stat. Soc. Ser. B 63(4), 811–821 (2001)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

SW was supported by the Royal Society of Edinburgh (RSE) (grant number 69938). BRH and JW acknowledge the receipt of studentship awards from the Health Data Research UK & The Alan Turing Institute Wellcome PhD Programme in Health Data Science (Grant Ref: 218529/Z/19/Z).

Code is available at https://github.com/roskamsh/BetaVAEMultImpute along with the appendix for this paper, including supplementary figures.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Breeshey Roskams-Hieter .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Roskams-Hieter, B., Wells, J., Wade, S. (2023). Leveraging Variational Autoencoders for Multiple Data Imputation. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14169. Springer, Cham. https://doi.org/10.1007/978-3-031-43412-9_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43412-9_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43411-2

  • Online ISBN: 978-3-031-43412-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics