Abstract
Missing data poses an ubiquitous challenge across a wide range of applications, stemming from a multitude of causes that are both diverse and context-dependent. The prevailing issue is that most advanced data analysis techniques are primarily tailored for complete datasets, thereby underscoring the indispensable need for effective imputation methods. In this chapter, we embark on an extensive exploration of missing data from a statistical perspective, offering a holistic review of its intricate nature. Our investigation encompasses a deep dive into the various mechanisms underlying missing data, shedding light on their ignorability and identifiability-fundamental concepts essential for understanding and addressing this pervasive issue. Moreover, we present a succinct yet comprehensive overview of influential classical imputation methods, showcasing their contributions to the field. Building upon this foundation, we delve into the latest advancements in generative models, a burgeoning area that holds great promise for learning from and imputing missing data. By harnessing the power of generative models, we aim to unlock novel insights and methodologies that can tackle the challenges posed by missing data. Furthermore, we introduce an approach that specifically addresses the critical problem of nonparametric identifiability in nonignorable missing data through the innovative use of generative models. This novel approach aims to overcome the limitations associated with alternative generative models and provides a potential solution to this challenging issue. To enhance the clarity of our proposed method, we supplement our discourse with curated numerical examples that distinguish its effectiveness from other baselines in specific scenarios. Through the exploration, we hope to pave the way for further research and advancements in this critical domain, ultimately leading to more accurate and reliable analyses and interpretations of incomplete datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Li, Y., Miao, W., Shpitser, I., & Tchetgen, E. J. T. (2022). A self-censoring model for multivariate nonignorable nonmonotone missing data. arXiv preprint arXiv:2207.08535.
Malinsky, D., Shpitser, I., & Tchetgen Tchetgen, E. J. (2021). Semiparametric inference for nonmonotone missing-not-at-random data: The no self-censoring model. Journal of the American Statistical Association, pp. 1–9.
Wang, Y., Liang, D., Charlin,L., & Blei, D. M. (2018). The deconfounded recommender: A causal inference approach to recommendation. arXiv preprint arXiv:1808.06581.
Marlin, B. M., & Zemel, R. S. (2009). Collaborative prediction and ranking with non-random missing data. In Proceedings of the third ACM conference on Recommender systems (pp. 5–12).
Ghalebikesabi, S., Cornish, R., Holmes, C., & Kelly, L. (2021). Deep generative missingness pattern-set mixture models. In International conference on artificial intelligence and statistics (pp. 3727–3735). PMLR.
Xue, F., & Qu, A. (2021). Integrating multisource block-wise missing data in model selection. Journal of the American Statistical Association, 116(536), 1914–1927.
Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., & Tabona, O. (2021). A survey on missing data in machine learning. Journal of Big Data, 8(1), 1–37.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
Glynn, R. J., Laird, N. M., & Rubin, D. B. (1986). Selection modeling versus mixture modeling with nonignorable nonresponse. In Drawing inferences from self-selected samples (pp. 115–142). Springer.
Little, R. J. (1993). Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association, 88(421), 125–134.
Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys ( vol. 81). Wiley.
Shrive, F. M., Stuart, H., Quan, H., & Ghali, W. A. (2006). Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Medical Research Methodology, 6, 1–10.
Jakobsen, J. C., Gluud, C., Wetterslev, J., & Winkel, P. (2017). When and how should multiple imputation be used for handling missing data in randomised clinical trials-a practical guide with flowcharts. BMC Medical Research Methodology, 17(1), 1–10.
Hernández-Lobato, J. M., Houlsby, N., & Ghahramani, Z. (2014). Probabilistic matrix factorization with non-random missing data. In International conference on machine learning (pp. 1512–1520). PMLR.
Jannach. D., Zanker, M., Felfernig, A., & Friedrich, G. (2010). Recommender systems: An introduction. Cambridge University Press.
Ma, C., & Zhang, C. (2021). Identifiable generative models for missing not at random data imputation. Advances in Neural Information Processing Systems, 34, 27645–27658.
Rubin, D. B. (1977). Formalizing subjective notions about the effect of nonrespondents in sample surveys. Journal of the American Statistical Association, 72(359), 538–543.
Robins, J. M. (1997). Non-response models for the analysis of non-monotone non-ignorable missing data. Statistics in Medicine, 16(1), 21–37.
Vansteelandt, S., Goetghebeur, E., Kenward, M. G., & Molenberghs, G. (2006). Ignorance and uncertainty regions as inferential tools in a sensitivity analysis. Statistica Sinica pp. 953–979.
Daniels, M. J., & Hogan, J. W. (2008). Missing data in longitudinal studies: Strategies for Bayesian modeling and sensitivity analysis. Chapman and Hall/CRC.
Sadinle, M., & Reiter, J. P. (2018). Sequential identification of nonignorable missing data mechanisms. Statistica Sinica, 28(4), 1741–1759.
Gill, R. D., Laan, M. J., & Robins, J. M. (1997). Coarsening at random: Characterizations, conjectures, counter-examples. In Proceedings of the first seattle symposium in biostatistics (pp. 255–294). Springer.
Wang, S., Shao, J., & Kim, J. K. (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, pp. 1097–1116.
Miao, W., Ding, P., & Geng, Z. (2016). Identifiability of normal and normal mixture models with nonignorable missing data. Journal of the American Statistical Association, 111(516), 1673–1683.
Miao, W., & Tchetgen, E. J. T. (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika, 103(2), 475–482.
d’Haultfoeuille, X. (2010). A new instrumental method for dealing with endogenous selection. Journal of Econometrics, 154(1), 1–15.
Liu, L., Miao, W., Sun, B., Robins, J., & Tchetgen, E. T. (2020). Identification and inference for marginal average treatment effect on the treated with an instrumental variable. Statistica Sinica, 30(3), 1517.
Sun, B., Liu, L., Miao, W., Wirth, K., Robins, J., & Tchetgen, E. J. T. (2018). Semiparametric estimation with data missing not at random using an instrumental variable. Statistica Sinica, 28(4), 1965.
Tchetgen, E. J. T., Wang, L., & Sun, B. (2018). Discrete choice models for nonmonotone nonignorable missing data: Identification and inference. Statistica Sinica, 28(4), 2069.
Linero, A. R. (2017). Bayesian nonparametric analysis of longitudinal studies in the presence of informative missingness. Biometrika, 104(2), 327–341.
Fay, R. E. (1986). Causal models for patterns of nonresponse. Journal of the American Statistical Association, 81(394), 354–365.
Ma, W.-Q., Geng, Z., & Hu, Y.-H. (2003). Identification of graphical models for nonignorable nonresponse of binary outcomes in longitudinal studies. Journal of multivariate analysis, 87(1), 24–45.
Mohan, K., & Pearl, J. (2021). Graphical models for processing missing data. Journal of the American Statistical Association, 116(534), 1023–1037.
Nabi, R., Bhattacharya, R.., & Shpitser, I. (2020). Full law identification in graphical models of missing data: Completeness results. In International conference on machine learning (pp. 7153–7163). PMLR.
Shpitser, I. (2016). Consistent estimation of functions of data missing non-monotonically and not at random. Advances in Neural Information Processing Systems, 29.
Sadinle, M., & Reiter, J. P. (2017). Itemwise conditionally independent nonresponse modelling for incomplete multivariate data. Biometrika, 104(1), 207–220.
Kim, K.-Y., Kim, B.-J., & Yi, G.-S. (2004). Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics, 5(1), 1–9.
Stekhoven, D. J., & Bühlmann, P. (2012). Missforest-non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., & Altman, R. B. (2001). Missing value estimation methods for dna microarrays. Bioinformatics, 17(6), 520–525.
Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45, 1–67.
Van Buuren, S. (2018). Flexible imputation of missing data. CRC Press.
Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data (Vol. 793). Wiley.
Allison, P. D. (2001). Missing data. Sage Publications.
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American statistical Association, 91(434), 473–489.
Audigier, V., Husson, F., & Josse, J. (2016). Multiple imputation for continuous variables using a Bayesian principal component analysis. Journal of Statistical Computation and Simulation, 86(11), 2140–2156.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. CRC Press.
Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., & Joachims, T. (2016). Recommendations as treatments: Debiasing learning and evaluation. In International conference on machine learning (pp. 1670–1679). PMLR.
Wang, Y., & Blei, D. M. (2019). The blessings of multiple causes. Journal of the American Statistical Association, 114(528), 1574–1596.
Wang, Y., Liang, D., Charlin, L., & Blei, D. M. (2020). Causal inference for recommender systems. In Fourteenth ACM conference on recommender systems (pp. 426–431).
Wang, X., Zhang, R., Sun, Y., & Qi, J. (2019). Doubly robust joint learning for recommendation on data missing not at random. In International conference on machine learning (pp. 6638–6647). PMLR.
Wang, Z., Akande, O., Poulos, J., & Li, F. (2021). Are deep learning models superior for missing data imputation in large surveys? Evidence from an empirical comparison. arXiv preprintarXiv:2103.09316.
Yoon, J., Jordon, J., & Schaar,, M. (2018). Gain: Missing data imputation using generative adversarial nets. In International conference on machine learning (pp. 5689–5698). PMLR.
Li, S. C.-X., Jiang, B., & Marlin, B. (2019). Misgan: Learning from incomplete data with generative adversarial networks. arXiv preprintarXiv:1902.09599.
Richardson, T. W., Wu, W., Lin, L., Xu, B., & Bernal, E. A. (2020). Mcflow: Monte carlo flow models for data imputation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14205–14214).
Nazabal, A., Olmos, P. M., Ghahramani, Z., & Valera, I. (2020). Handling incomplete heterogeneous data using vaes. Pattern Recognition, 107, 107501.
Ma, C., Tschiatschek, S., Palla, K., Hernández-Lobato, J. M., Nowozin, S., & Zhang, C. (2018). Eddi: Efficient dynamic discovery of high-value information with partial vae. arXiv preprintarXiv:1809.11142.
Mattei, P.-A., & Frellsen, J. (2019). Miwae: Deep generative modelling and imputation of incomplete data sets. In International conference on machine learning (pp. 4413–4423). PMLR.
Ipsen, N. B., Mattei, P.-A., & Frellsen, J. (2020). not-miwae: Deep generative modelling with missing not at random data. arXiv preprintarXiv:2006.12871.
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In International conference on machine learning (pp. 214–223). PMLR.
Wei, G. C., & Tanner, M. A. (1990). A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. Journal of the American statistical Association, 85(411), 699–704.
Neath, R. C., et al. (2013). On convergence properties of the monte carlo em algorithm. Advances in modern statistical theory and applications: a Festschrift in Honor of Morris L. Eaton (pp. 43–62).
Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 652–660).
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., & Smola, A. J. (2017). Deep sets. Advances in Neural Information Processing Systems 30.
Burda, Y., Grosse, R., & Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprintarXiv:1509.00519.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC press.
Khemakhem, I., Kingma, D., Monti, R., & Hyvarinen, A. (2020). Variational autoencoders and nonlinear ica: A unifying framework. In International conference on artificial intelligence and statistics (pp. 2207–2217). PMLR.
Dai, B., & Wipf, D. (2019). Diagnosing and enhancing vae models. arXiv preprintarXiv:1903.05789.
Asuncion, A., & Newman, D. (2007). Uci machine learning repository.
Chen, H. Y. (2007). A semiparametric odds ratio model for measuring association. Biometrics, 63(2), 413–421.
Chen, H. Y. (2010). Compatibility of conditionally specified models. Statistics and Probability Letters, 80(7–8), 670–677.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
7 Appendix
7 Appendix
Summary of Notations See Table 2.
Proof of Theorem 1
Proof
Under Condition (9),
Thus, p(r|x) is a function of the observed data only.
Proof of Corollary 1
Proof
Using the odds ratio parametrization [71, 72] with odds ratio function
the full-data distribution [1] can be written as
which is also a function of observed data only.
Implementation Details
In the experiments in Sect. 5, for all three generative models, we set the dimension of the latent space to be \(p-1\) for both datasets. K is set to be 20 for the importance samples. All three models have the same nonlinear structure for the decoder of missingness indicators. We use one hidden layer for both encoder and decoders with dimension 128. Instead of taking a fixed value, the observational noise for the continuous data variables in the decoder is specified as a learnable parameter as proposed in [69]. All methods are trained with Adam optimizer with batch size 16, and learning rate 0.001 for 30000 epochs. During the imputation, the number of importance samples L is set as 10000. For the two classical imputation methods, the IterativeImputer from the sklearn package in Python is exploited in the experiments, with the default estimator for MICE with maximum iteration as 100 and RandomForestRegressor for missForest with the number of estimators as 100.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Xie, H., Xue, F., Wang, X. (2024). Generative Models for Missing Data. In: Lyu, Z. (eds) Applications of Generative AI. Springer, Cham. https://doi.org/10.1007/978-3-031-46238-2_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-46238-2_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46237-5
Online ISBN: 978-3-031-46238-2
eBook Packages: Computer ScienceComputer Science (R0)