Leveraging Variational Autoencoders for Multiple Data Imputation

Roskams-Hieter, Breeshey; Wells, Jude; Wade, Sara

doi:10.1007/978-3-031-43412-9_29

Breeshey Roskams-Hieter^12,13,
Jude Wells^13,14 &
Sara Wade¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14169))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1717 Accesses
1 Citations
3 Altmetric

Abstract

Missing data persists as a major barrier to data analysis across numerous applications. Recently, deep generative models have been used for imputation of missing data, motivated by their ability to learn complex and non-linear relationships. In this work, we investigate the ability of variational autoencoders (VAEs) to account for uncertainty in missing data through multiple imputation. We find that VAEs provide poor empirical coverage of missing data, with underestimation and overconfident imputations. To overcome this, we employ \(\beta \)-VAEs, which viewed from a generalized Bayes framework, provide robustness to model misspecification. Assigning a good value of \(\beta \) is critical for uncertainty calibration and we demonstrate how this can be achieved using cross-validation. We assess three alternative methods for sampling from the posterior distribution of missing values and apply the approach to transcriptomics datasets with various simulated missingness scenarios. Finally, we show that single imputation in transcriptomic data can cause false discoveries in downstream tasks and employing multiple imputation with \(\beta \)-VAEs can effectively mitigate these inaccuracies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Posterior Consistency for Missing Data in Variational Autoencoders

MIDA: Multiple Imputation Using Denoising Autoencoders

BayesMetab: treatment of missing values in metabolomic studies using a Bayesian modeling approach

Article Open access 20 December 2019

References

Barron, A., Schervish, M.J., Wasserman, L.: The consistency of posterior distributions in nonparametric problems. Ann. Stat. 27(2), 536–561 (1999)
Article MathSciNet MATH Google Scholar
Bissiri, P.G., Holmes, C.C., Walker, S.G.: A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B (Statistical Methodology) 78(5), 1103–1130 (2016)
Article MathSciNet MATH Google Scholar
Camino, R.D., Hammerschmidt, C.A., State, R.: Improving missing data imputation with deep generative models. arXiv preprint arXiv:1902.10666 (2019)
Chatterjee, S., Diaconis, P.: The sample size required in importance sampling. Ann. Appl. Probab. 28(2), 1099–1135 (2018)
Article MathSciNet MATH Google Scholar
Chen, R.T., Li, X., Grosse, R.B., Duvenaud, D.K.: Isolating sources of disentanglement in variational autoencoders. In: Advances in Neural Information Processing. vol. 31 (2018)
Google Scholar
Chen, Y.C., Liu, T., Yu, C.H., Chiang, T.Y., Hwang, C.C.: Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS ONE 8(4), e62856 (2013)
Google Scholar
Collier, M., Nazabal, A., Williams, C.K.: Vaes in the presence of missing data. arXiv preprint arXiv:2006.05301 (2020)
Daxberger, E., Hernández-Lobato, J.M.: Bayesian variational autoencoders for unsupervised out-of-distribution detection. arXiv preprint arXiv:1912.05651 (2019)
Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International Conference on Machine Learning, pp. 1050–1059. PMLR (2016)
Google Scholar
García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)
Article Google Scholar
Gershman, S., Goodman, N.: Amortized inference in probabilistic reasoning. In: Proceedings of the Annual Meeting of the Cognitive Science Society. vol. 36 (2014)
Google Scholar
Gondara, L., Wang, K.: MIDA: multiple imputation using denoising autoencoders. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.) PAKDD 2018. LNCS (LNAI), vol. 10939, pp. 260–272. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93040-4_21
Chapter Google Scholar
Graham, J.W., Olchowski, A.E., Gilreath, T.D.: How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev. sci. 8, 206–213 (2007)
Article Google Scholar
Higgins, I., et al.: beta-VAE: learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations (2017). https://openreview.net/forum?id=Sy2fzU9gl
Holmes, C.C., Walker, S.G.: Assigning a value to a power likelihood in a general Bayesian model. Biometrika 104(2), 497–503 (2017)
MathSciNet MATH Google Scholar
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: Train 1, get \(m\) for free. arXiv preprint arXiv:1704.00109 (2017)
Ipsen, N.B., Mattei, P.A., Frellsen, J.: not-MIWAE: Deep generative modelling with missing not at random data. arXiv preprint arXiv:2006.12871 (2020)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (2013)
Google Scholar
Kingma, D.P., Welling, M.: An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691 (2019)
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing. vol. 30 (2017)
Google Scholar
Lewis, S., et al.: Accurate imputation and efficient data acquisition with transformer-based vaes. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021)
Google Scholar
Little, R.J., Rubin, D.B.: Statistical analysis with missing data. vol. 793. John Wiley & Sons (2019)
Google Scholar
Ma, C., Gong, W., Hernández-Lobato, J.M., Koenigstein, N., Nowozin, S., Zhang, C.: Partial VAE for hybrid recommender system. In: NIPS Workshop on Bayesian Deep Learning. vol. 2018 (2018)
Google Scholar
Ma, C., Tschiatschek, S., Hernández-Lobato, J.M., Turner, R., Zhang, C.: VAEM: a deep generative model for heterogeneous mixed type data. arXiv preprint arXiv:2006.11941 (2020)
Ma, C., Tschiatschek, S., Palla, K., Hernández-Lobato, J.M., Nowozin, S., Zhang, C.: EDDI: Efficient dynamic discovery of high-value information with partial VAE. arXiv preprint arXiv:1809.11142 (2018)
Ma, C., Zhang, C.: Identifiable generative models for missing not at random data imputation. In: Advances in Neural Information Processing. vol. 34 (2021)
Google Scholar
Ma, Q., Li, X., Bai, M., Wang, X., Ning, B., Li, G.: MIVAE: multiple imputation based on variational auto-encoder. Eng. Appl. Artif. Intell. 123, 106270 (2023). https://doi.org/10.1016/j.engappai.2023.106270, https://www.sciencedirect.com/science/article/pii/S0952197623004542
Mattei, P.A., Frellsen, J.: Leveraging the exact likelihood of deep latent variable models. In: Advances in Neural Information Processing. vol. 31 (2018)
Google Scholar
Mattei, P.A., Frellsen, J.: MIWAE: deep generative modelling and imputation of incomplete data sets. In: International Conference on Machine Learning, pp. 4413–4423. PMLR (2019)
Google Scholar
Murray, J.S.: Multiple imputation: a review of practical and theoretical findings. Stat. Sci. 33(2), 142–159 (2018)
Article MathSciNet MATH Google Scholar
Nazabal, A., Olmos, P.M., Ghahramani, Z., Valera, I.: Handling incomplete heterogeneous data using VAEs. Pattern Recogn. 107, 107501 (2020)
Article Google Scholar
Nelwamondo, F.V., Mohamed, S., Marwala, T.: Missing data: A comparison of neural network and expectation maximization techniques. Curr. Sci. 93(11), 1514–1521 (2007)
Google Scholar
Nguyen, A., Yosinski, J., Clune, J.: Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 427–436 (2015)
Google Scholar
Qiu, Y.L., Zheng, H., Gevaert, O.: Genomic data imputation with variational auto-encoders. GigaScience 9(8), giaa082 (2020)
Google Scholar
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: International Conference on Machine Learning, pp. 1278–1286. PMLR (2014)
Google Scholar
Roskams-Hieter, B.: Betavaemultimpute. https://github.com/roskamsh/BetaVAEMultImpute (2023)
Schafer, J.L., Olsen, M.K.: Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivar. Behav. Res. 33(4), 545–571 (1998)
Article Google Scholar
Sinharay, S., Stern, H.S., Russell, D.: The use of multiple imputation for the analysis of missing data. Psychol. Methods 6(4), 317 (2001)
Article Google Scholar
Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
Walker, S., Hjort, N.L.: On Bayesian consistency. J. R. Stat. Soc. Ser. B 63(4), 811–821 (2001)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

SW was supported by the Royal Society of Edinburgh (RSE) (grant number 69938). BRH and JW acknowledge the receipt of studentship awards from the Health Data Research UK & The Alan Turing Institute Wellcome PhD Programme in Health Data Science (Grant Ref: 218529/Z/19/Z).

Code is available at https://github.com/roskamsh/BetaVAEMultImpute along with the appendix for this paper, including supplementary figures.

Author information

Authors and Affiliations

University of Edinburgh, Edinburgh, UK
Breeshey Roskams-Hieter & Sara Wade
Health Data Research UK, London, UK
Breeshey Roskams-Hieter & Jude Wells
University College London, London, UK
Jude Wells

Authors

Breeshey Roskams-Hieter
View author publications
You can also search for this author in PubMed Google Scholar
Jude Wells
View author publications
You can also search for this author in PubMed Google Scholar
Sara Wade
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Breeshey Roskams-Hieter .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Danai Koutra
University of Vienna, Vienna, Austria
Claudia Plant
Max Planck Institute for Software Systems, Kaiserslautern, Germany
Manuel Gomez Rodriguez
Politecnico di Torino, Turin, Italy
Elena Baralis
CENTAI, Turin, Italy
Francesco Bonchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Roskams-Hieter, B., Wells, J., Wade, S. (2023). Leveraging Variational Autoencoders for Multiple Data Imputation. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14169. Springer, Cham. https://doi.org/10.1007/978-3-031-43412-9_29

Download citation

DOI: https://doi.org/10.1007/978-3-031-43412-9_29
Published: 17 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43411-2
Online ISBN: 978-3-031-43412-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Leveraging Variational Autoencoders for Multiple Data Imputation

Abstract

Access this chapter

Similar content being viewed by others

Posterior Consistency for Missing Data in Variational Autoencoders

MIDA: Multiple Imputation Using Denoising Autoencoders

BayesMetab: treatment of missing values in metabolomic studies using a Bayesian modeling approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Leveraging Variational Autoencoders for Multiple Data Imputation

Abstract

Access this chapter

Similar content being viewed by others

Posterior Consistency for Missing Data in Variational Autoencoders

MIDA: Multiple Imputation Using Denoising Autoencoders

BayesMetab: treatment of missing values in metabolomic studies using a Bayesian modeling approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation