Abstract
Missing data persists as a major barrier to data analysis across numerous applications. Recently, deep generative models have been used for imputation of missing data, motivated by their ability to learn complex and non-linear relationships. In this work, we investigate the ability of variational autoencoders (VAEs) to account for uncertainty in missing data through multiple imputation. We find that VAEs provide poor empirical coverage of missing data, with underestimation and overconfident imputations. To overcome this, we employ \(\beta \)-VAEs, which viewed from a generalized Bayes framework, provide robustness to model misspecification. Assigning a good value of \(\beta \) is critical for uncertainty calibration and we demonstrate how this can be achieved using cross-validation. We assess three alternative methods for sampling from the posterior distribution of missing values and apply the approach to transcriptomics datasets with various simulated missingness scenarios. Finally, we show that single imputation in transcriptomic data can cause false discoveries in downstream tasks and employing multiple imputation with \(\beta \)-VAEs can effectively mitigate these inaccuracies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Barron, A., Schervish, M.J., Wasserman, L.: The consistency of posterior distributions in nonparametric problems. Ann. Stat. 27(2), 536–561 (1999)
Bissiri, P.G., Holmes, C.C., Walker, S.G.: A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B (Statistical Methodology) 78(5), 1103–1130 (2016)
Camino, R.D., Hammerschmidt, C.A., State, R.: Improving missing data imputation with deep generative models. arXiv preprint arXiv:1902.10666 (2019)
Chatterjee, S., Diaconis, P.: The sample size required in importance sampling. Ann. Appl. Probab. 28(2), 1099–1135 (2018)
Chen, R.T., Li, X., Grosse, R.B., Duvenaud, D.K.: Isolating sources of disentanglement in variational autoencoders. In: Advances in Neural Information Processing. vol. 31 (2018)
Chen, Y.C., Liu, T., Yu, C.H., Chiang, T.Y., Hwang, C.C.: Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS ONE 8(4), e62856 (2013)
Collier, M., Nazabal, A., Williams, C.K.: Vaes in the presence of missing data. arXiv preprint arXiv:2006.05301 (2020)
Daxberger, E., Hernández-Lobato, J.M.: Bayesian variational autoencoders for unsupervised out-of-distribution detection. arXiv preprint arXiv:1912.05651 (2019)
Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International Conference on Machine Learning, pp. 1050–1059. PMLR (2016)
García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)
Gershman, S., Goodman, N.: Amortized inference in probabilistic reasoning. In: Proceedings of the Annual Meeting of the Cognitive Science Society. vol. 36 (2014)
Gondara, L., Wang, K.: MIDA: multiple imputation using denoising autoencoders. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.) PAKDD 2018. LNCS (LNAI), vol. 10939, pp. 260–272. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93040-4_21
Graham, J.W., Olchowski, A.E., Gilreath, T.D.: How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev. sci. 8, 206–213 (2007)
Higgins, I., et al.: beta-VAE: learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations (2017). https://openreview.net/forum?id=Sy2fzU9gl
Holmes, C.C., Walker, S.G.: Assigning a value to a power likelihood in a general Bayesian model. Biometrika 104(2), 497–503 (2017)
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: Train 1, get \(m\) for free. arXiv preprint arXiv:1704.00109 (2017)
Ipsen, N.B., Mattei, P.A., Frellsen, J.: not-MIWAE: Deep generative modelling with missing not at random data. arXiv preprint arXiv:2006.12871 (2020)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (2013)
Kingma, D.P., Welling, M.: An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691 (2019)
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing. vol. 30 (2017)
Lewis, S., et al.: Accurate imputation and efficient data acquisition with transformer-based vaes. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021)
Little, R.J., Rubin, D.B.: Statistical analysis with missing data. vol. 793. John Wiley & Sons (2019)
Ma, C., Gong, W., Hernández-Lobato, J.M., Koenigstein, N., Nowozin, S., Zhang, C.: Partial VAE for hybrid recommender system. In: NIPS Workshop on Bayesian Deep Learning. vol. 2018 (2018)
Ma, C., Tschiatschek, S., Hernández-Lobato, J.M., Turner, R., Zhang, C.: VAEM: a deep generative model for heterogeneous mixed type data. arXiv preprint arXiv:2006.11941 (2020)
Ma, C., Tschiatschek, S., Palla, K., Hernández-Lobato, J.M., Nowozin, S., Zhang, C.: EDDI: Efficient dynamic discovery of high-value information with partial VAE. arXiv preprint arXiv:1809.11142 (2018)
Ma, C., Zhang, C.: Identifiable generative models for missing not at random data imputation. In: Advances in Neural Information Processing. vol. 34 (2021)
Ma, Q., Li, X., Bai, M., Wang, X., Ning, B., Li, G.: MIVAE: multiple imputation based on variational auto-encoder. Eng. Appl. Artif. Intell. 123, 106270 (2023). https://doi.org/10.1016/j.engappai.2023.106270, https://www.sciencedirect.com/science/article/pii/S0952197623004542
Mattei, P.A., Frellsen, J.: Leveraging the exact likelihood of deep latent variable models. In: Advances in Neural Information Processing. vol. 31 (2018)
Mattei, P.A., Frellsen, J.: MIWAE: deep generative modelling and imputation of incomplete data sets. In: International Conference on Machine Learning, pp. 4413–4423. PMLR (2019)
Murray, J.S.: Multiple imputation: a review of practical and theoretical findings. Stat. Sci. 33(2), 142–159 (2018)
Nazabal, A., Olmos, P.M., Ghahramani, Z., Valera, I.: Handling incomplete heterogeneous data using VAEs. Pattern Recogn. 107, 107501 (2020)
Nelwamondo, F.V., Mohamed, S., Marwala, T.: Missing data: A comparison of neural network and expectation maximization techniques. Curr. Sci. 93(11), 1514–1521 (2007)
Nguyen, A., Yosinski, J., Clune, J.: Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 427–436 (2015)
Qiu, Y.L., Zheng, H., Gevaert, O.: Genomic data imputation with variational auto-encoders. GigaScience 9(8), giaa082 (2020)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: International Conference on Machine Learning, pp. 1278–1286. PMLR (2014)
Roskams-Hieter, B.: Betavaemultimpute. https://github.com/roskamsh/BetaVAEMultImpute (2023)
Schafer, J.L., Olsen, M.K.: Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivar. Behav. Res. 33(4), 545–571 (1998)
Sinharay, S., Stern, H.S., Russell, D.: The use of multiple imputation for the analysis of missing data. Psychol. Methods 6(4), 317 (2001)
Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
Walker, S., Hjort, N.L.: On Bayesian consistency. J. R. Stat. Soc. Ser. B 63(4), 811–821 (2001)
Acknowledgements
SW was supported by the Royal Society of Edinburgh (RSE) (grant number 69938). BRH and JW acknowledge the receipt of studentship awards from the Health Data Research UK & The Alan Turing Institute Wellcome PhD Programme in Health Data Science (Grant Ref: 218529/Z/19/Z).
Code is available at https://github.com/roskamsh/BetaVAEMultImpute along with the appendix for this paper, including supplementary figures.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Roskams-Hieter, B., Wells, J., Wade, S. (2023). Leveraging Variational Autoencoders for Multiple Data Imputation. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14169. Springer, Cham. https://doi.org/10.1007/978-3-031-43412-9_29
Download citation
DOI: https://doi.org/10.1007/978-3-031-43412-9_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43411-2
Online ISBN: 978-3-031-43412-9
eBook Packages: Computer ScienceComputer Science (R0)