Generative Models for Missing Data

Xie, Huiming; Xue, Fei; Wang, Xiao

doi:10.1007/978-3-031-46238-2_27

Huiming Xie²,
Fei Xue² &
Xiao Wang²

543 Accesses

Abstract

Missing data poses an ubiquitous challenge across a wide range of applications, stemming from a multitude of causes that are both diverse and context-dependent. The prevailing issue is that most advanced data analysis techniques are primarily tailored for complete datasets, thereby underscoring the indispensable need for effective imputation methods. In this chapter, we embark on an extensive exploration of missing data from a statistical perspective, offering a holistic review of its intricate nature. Our investigation encompasses a deep dive into the various mechanisms underlying missing data, shedding light on their ignorability and identifiability-fundamental concepts essential for understanding and addressing this pervasive issue. Moreover, we present a succinct yet comprehensive overview of influential classical imputation methods, showcasing their contributions to the field. Building upon this foundation, we delve into the latest advancements in generative models, a burgeoning area that holds great promise for learning from and imputing missing data. By harnessing the power of generative models, we aim to unlock novel insights and methodologies that can tackle the challenges posed by missing data. Furthermore, we introduce an approach that specifically addresses the critical problem of nonparametric identifiability in nonignorable missing data through the innovative use of generative models. This novel approach aims to overcome the limitations associated with alternative generative models and provides a potential solution to this challenging issue. To enhance the clarity of our proposed method, we supplement our discourse with curated numerical examples that distinguish its effectiveness from other baselines in specific scenarios. Through the exploration, we hope to pave the way for further research and advancements in this critical domain, ultimately leading to more accurate and reliable analyses and interpretations of incomplete datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Li, Y., Miao, W., Shpitser, I., & Tchetgen, E. J. T. (2022). A self-censoring model for multivariate nonignorable nonmonotone missing data. arXiv preprint arXiv:2207.08535.
Malinsky, D., Shpitser, I., & Tchetgen Tchetgen, E. J. (2021). Semiparametric inference for nonmonotone missing-not-at-random data: The no self-censoring model. Journal of the American Statistical Association, pp. 1–9.
Google Scholar
Wang, Y., Liang, D., Charlin,L., & Blei, D. M. (2018). The deconfounded recommender: A causal inference approach to recommendation. arXiv preprint arXiv:1808.06581.
Marlin, B. M., & Zemel, R. S. (2009). Collaborative prediction and ranking with non-random missing data. In Proceedings of the third ACM conference on Recommender systems (pp. 5–12).
Google Scholar
Ghalebikesabi, S., Cornish, R., Holmes, C., & Kelly, L. (2021). Deep generative missingness pattern-set mixture models. In International conference on artificial intelligence and statistics (pp. 3727–3735). PMLR.
Google Scholar
Xue, F., & Qu, A. (2021). Integrating multisource block-wise missing data in model selection. Journal of the American Statistical Association, 116(536), 1914–1927.
Article MathSciNet Google Scholar
Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., & Tabona, O. (2021). A survey on missing data in machine learning. Journal of Big Data, 8(1), 1–37.
Article Google Scholar
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
Article MathSciNet Google Scholar
Glynn, R. J., Laird, N. M., & Rubin, D. B. (1986). Selection modeling versus mixture modeling with nonignorable nonresponse. In Drawing inferences from self-selected samples (pp. 115–142). Springer.
Google Scholar
Little, R. J. (1993). Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association, 88(421), 125–134.
Google Scholar
Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys ( vol. 81). Wiley.
Google Scholar
Shrive, F. M., Stuart, H., Quan, H., & Ghali, W. A. (2006). Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Medical Research Methodology, 6, 1–10.
Article Google Scholar
Jakobsen, J. C., Gluud, C., Wetterslev, J., & Winkel, P. (2017). When and how should multiple imputation be used for handling missing data in randomised clinical trials-a practical guide with flowcharts. BMC Medical Research Methodology, 17(1), 1–10.
Article Google Scholar
Hernández-Lobato, J. M., Houlsby, N., & Ghahramani, Z. (2014). Probabilistic matrix factorization with non-random missing data. In International conference on machine learning (pp. 1512–1520). PMLR.
Google Scholar
Jannach. D., Zanker, M., Felfernig, A., & Friedrich, G. (2010). Recommender systems: An introduction. Cambridge University Press.
Google Scholar
Ma, C., & Zhang, C. (2021). Identifiable generative models for missing not at random data imputation. Advances in Neural Information Processing Systems, 34, 27645–27658.
Google Scholar
Rubin, D. B. (1977). Formalizing subjective notions about the effect of nonrespondents in sample surveys. Journal of the American Statistical Association, 72(359), 538–543.
Article MathSciNet Google Scholar
Robins, J. M. (1997). Non-response models for the analysis of non-monotone non-ignorable missing data. Statistics in Medicine, 16(1), 21–37.
Article Google Scholar
Vansteelandt, S., Goetghebeur, E., Kenward, M. G., & Molenberghs, G. (2006). Ignorance and uncertainty regions as inferential tools in a sensitivity analysis. Statistica Sinica pp. 953–979.
Google Scholar
Daniels, M. J., & Hogan, J. W. (2008). Missing data in longitudinal studies: Strategies for Bayesian modeling and sensitivity analysis. Chapman and Hall/CRC.
Google Scholar
Sadinle, M., & Reiter, J. P. (2018). Sequential identification of nonignorable missing data mechanisms. Statistica Sinica, 28(4), 1741–1759.
MathSciNet Google Scholar
Gill, R. D., Laan, M. J., & Robins, J. M. (1997). Coarsening at random: Characterizations, conjectures, counter-examples. In Proceedings of the first seattle symposium in biostatistics (pp. 255–294). Springer.
Google Scholar
Wang, S., Shao, J., & Kim, J. K. (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, pp. 1097–1116.
Google Scholar
Miao, W., Ding, P., & Geng, Z. (2016). Identifiability of normal and normal mixture models with nonignorable missing data. Journal of the American Statistical Association, 111(516), 1673–1683.
Article MathSciNet Google Scholar
Miao, W., & Tchetgen, E. J. T. (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika, 103(2), 475–482.
Article MathSciNet Google Scholar
d’Haultfoeuille, X. (2010). A new instrumental method for dealing with endogenous selection. Journal of Econometrics, 154(1), 1–15.
Article MathSciNet Google Scholar
Liu, L., Miao, W., Sun, B., Robins, J., & Tchetgen, E. T. (2020). Identification and inference for marginal average treatment effect on the treated with an instrumental variable. Statistica Sinica, 30(3), 1517.
MathSciNet Google Scholar
Sun, B., Liu, L., Miao, W., Wirth, K., Robins, J., & Tchetgen, E. J. T. (2018). Semiparametric estimation with data missing not at random using an instrumental variable. Statistica Sinica, 28(4), 1965.
MathSciNet Google Scholar
Tchetgen, E. J. T., Wang, L., & Sun, B. (2018). Discrete choice models for nonmonotone nonignorable missing data: Identification and inference. Statistica Sinica, 28(4), 2069.
MathSciNet Google Scholar
Linero, A. R. (2017). Bayesian nonparametric analysis of longitudinal studies in the presence of informative missingness. Biometrika, 104(2), 327–341.
Article MathSciNet Google Scholar
Fay, R. E. (1986). Causal models for patterns of nonresponse. Journal of the American Statistical Association, 81(394), 354–365.
Article Google Scholar
Ma, W.-Q., Geng, Z., & Hu, Y.-H. (2003). Identification of graphical models for nonignorable nonresponse of binary outcomes in longitudinal studies. Journal of multivariate analysis, 87(1), 24–45.
Article MathSciNet Google Scholar
Mohan, K., & Pearl, J. (2021). Graphical models for processing missing data. Journal of the American Statistical Association, 116(534), 1023–1037.
Article MathSciNet Google Scholar
Nabi, R., Bhattacharya, R.., & Shpitser, I. (2020). Full law identification in graphical models of missing data: Completeness results. In International conference on machine learning (pp. 7153–7163). PMLR.
Google Scholar
Shpitser, I. (2016). Consistent estimation of functions of data missing non-monotonically and not at random. Advances in Neural Information Processing Systems, 29.
Google Scholar
Sadinle, M., & Reiter, J. P. (2017). Itemwise conditionally independent nonresponse modelling for incomplete multivariate data. Biometrika, 104(1), 207–220.
MathSciNet Google Scholar
Kim, K.-Y., Kim, B.-J., & Yi, G.-S. (2004). Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics, 5(1), 1–9.
Article Google Scholar
Stekhoven, D. J., & Bühlmann, P. (2012). Missforest-non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118.
Article Google Scholar
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., & Altman, R. B. (2001). Missing value estimation methods for dna microarrays. Bioinformatics, 17(6), 520–525.
Article Google Scholar
Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45, 1–67.
Article Google Scholar
Van Buuren, S. (2018). Flexible imputation of missing data. CRC Press.
Google Scholar
Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data (Vol. 793). Wiley.
Google Scholar
Allison, P. D. (2001). Missing data. Sage Publications.
Google Scholar
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American statistical Association, 91(434), 473–489.
Article Google Scholar
Audigier, V., Husson, F., & Josse, J. (2016). Multiple imputation for continuous variables using a Bayesian principal component analysis. Journal of Statistical Computation and Simulation, 86(11), 2140–2156.
Article MathSciNet Google Scholar
Schafer, J. L. (1997). Analysis of incomplete multivariate data. CRC Press.
Google Scholar
Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., & Joachims, T. (2016). Recommendations as treatments: Debiasing learning and evaluation. In International conference on machine learning (pp. 1670–1679). PMLR.
Google Scholar
Wang, Y., & Blei, D. M. (2019). The blessings of multiple causes. Journal of the American Statistical Association, 114(528), 1574–1596.
Article MathSciNet Google Scholar
Wang, Y., Liang, D., Charlin, L., & Blei, D. M. (2020). Causal inference for recommender systems. In Fourteenth ACM conference on recommender systems (pp. 426–431).
Google Scholar
Wang, X., Zhang, R., Sun, Y., & Qi, J. (2019). Doubly robust joint learning for recommendation on data missing not at random. In International conference on machine learning (pp. 6638–6647). PMLR.
Google Scholar
Wang, Z., Akande, O., Poulos, J., & Li, F. (2021). Are deep learning models superior for missing data imputation in large surveys? Evidence from an empirical comparison. arXiv preprintarXiv:2103.09316.
Yoon, J., Jordon, J., & Schaar,, M. (2018). Gain: Missing data imputation using generative adversarial nets. In International conference on machine learning (pp. 5689–5698). PMLR.
Google Scholar
Li, S. C.-X., Jiang, B., & Marlin, B. (2019). Misgan: Learning from incomplete data with generative adversarial networks. arXiv preprintarXiv:1902.09599.
Richardson, T. W., Wu, W., Lin, L., Xu, B., & Bernal, E. A. (2020). Mcflow: Monte carlo flow models for data imputation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14205–14214).
Google Scholar
Nazabal, A., Olmos, P. M., Ghahramani, Z., & Valera, I. (2020). Handling incomplete heterogeneous data using vaes. Pattern Recognition, 107, 107501.
Article Google Scholar
Ma, C., Tschiatschek, S., Palla, K., Hernández-Lobato, J. M., Nowozin, S., & Zhang, C. (2018). Eddi: Efficient dynamic discovery of high-value information with partial vae. arXiv preprintarXiv:1809.11142.
Mattei, P.-A., & Frellsen, J. (2019). Miwae: Deep generative modelling and imputation of incomplete data sets. In International conference on machine learning (pp. 4413–4423). PMLR.
Google Scholar
Ipsen, N. B., Mattei, P.-A., & Frellsen, J. (2020). not-miwae: Deep generative modelling with missing not at random data. arXiv preprintarXiv:2006.12871.
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Article MathSciNet Google Scholar
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In International conference on machine learning (pp. 214–223). PMLR.
Google Scholar
Wei, G. C., & Tanner, M. A. (1990). A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. Journal of the American statistical Association, 85(411), 699–704.
Article Google Scholar
Neath, R. C., et al. (2013). On convergence properties of the monte carlo em algorithm. Advances in modern statistical theory and applications: a Festschrift in Honor of Morris L. Eaton (pp. 43–62).
Google Scholar
Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 652–660).
Google Scholar
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., & Smola, A. J. (2017). Deep sets. Advances in Neural Information Processing Systems 30.
Google Scholar
Burda, Y., Grosse, R., & Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprintarXiv:1509.00519.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC press.
Google Scholar
Khemakhem, I., Kingma, D., Monti, R., & Hyvarinen, A. (2020). Variational autoencoders and nonlinear ica: A unifying framework. In International conference on artificial intelligence and statistics (pp. 2207–2217). PMLR.
Google Scholar
Dai, B., & Wipf, D. (2019). Diagnosing and enhancing vae models. arXiv preprintarXiv:1903.05789.
Asuncion, A., & Newman, D. (2007). Uci machine learning repository.
Google Scholar
Chen, H. Y. (2007). A semiparametric odds ratio model for measuring association. Biometrics, 63(2), 413–421.
Article MathSciNet Google Scholar
Chen, H. Y. (2010). Compatibility of conditionally specified models. Statistics and Probability Letters, 80(7–8), 670–677.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Purdue University, West Lafayette, IN, USA
Huiming Xie, Fei Xue & Xiao Wang

Authors

Huiming Xie
View author publications
You can also search for this author in PubMed Google Scholar
Fei Xue
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao Wang .

Editor information

Editors and Affiliations

Department of Game Design, Uppsala Universitet, Visby, Sweden
Zhihan Lyu

7 Appendix

Summary of Notations See Table 2.

Table 2 Summary of Basic Notations for Missing Data

Full size table

Proof of Theorem 1

Proof

Under Condition (9),

$$\begin{aligned} p(r|x) & = \int p(r|x, \tilde{z}) p(\tilde{z}) d\tilde{z}\\ & = \int \prod ^p_{j=1} p(r_j|x_{-j}, \tilde{z}) p(\tilde{z}) d\tilde{z}\\ & = \int \prod ^p_{j=1} p(r_j|x_{-j}, R_{-j} = \boldsymbol{1}, \tilde{z}) p(\tilde{z}) d\tilde{z}. \end{aligned}$$

Thus, p(r|x) is a function of the observed data only.

Proof of Corollary 1

Proof

Using the odds ratio parametrization [71, 72] with odds ratio function

$$\begin{aligned} \text {OR}(r,x) \equiv \text {OR}(r,x;r_0 = \boldsymbol{1}, x_0 = \boldsymbol{0}) = \frac{p (r|x)}{p (R=\boldsymbol{1}|x)}\frac{p (R=\boldsymbol{1}|X=\boldsymbol{0})}{p (r|X=\boldsymbol{0})}, \end{aligned}$$

the full-data distribution [1] can be written as

$$\begin{aligned} p (x,r) = \frac{OR(r,x)p (x|R=\boldsymbol{1})p (r|X = \boldsymbol{0})}{\sum _{r'}\mathbb {E}[OR(r',y)|R=\boldsymbol{1}]p (r'|X=\boldsymbol{0})}, \end{aligned}$$

which is also a function of observed data only.

Implementation Details

In the experiments in Sect. 5, for all three generative models, we set the dimension of the latent space to be $p-1$ for both datasets. K is set to be 20 for the importance samples. All three models have the same nonlinear structure for the decoder of missingness indicators. We use one hidden layer for both encoder and decoders with dimension 128. Instead of taking a fixed value, the observational noise for the continuous data variables in the decoder is specified as a learnable parameter as proposed in [69]. All methods are trained with Adam optimizer with batch size 16, and learning rate 0.001 for 30000 epochs. During the imputation, the number of importance samples L is set as 10000. For the two classical imputation methods, the IterativeImputer from the sklearn package in Python is exploited in the experiments, with the default estimator for MICE with maximum iteration as 100 and RandomForestRegressor for missForest with the number of estimators as 100.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Xie, H., Xue, F., Wang, X. (2024). Generative Models for Missing Data. In: Lyu, Z. (eds) Applications of Generative AI. Springer, Cham. https://doi.org/10.1007/978-3-031-46238-2_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-46238-2_27
Published: 06 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46237-5
Online ISBN: 978-3-031-46238-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Generative Models for Missing Data

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

7 Appendix

7 Appendix

Proof

Proof

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation