Skip to main content

Generative Models for Missing Data

  • Chapter
  • First Online:
Applications of Generative AI
  • 543 Accesses

Abstract

Missing data poses an ubiquitous challenge across a wide range of applications, stemming from a multitude of causes that are both diverse and context-dependent. The prevailing issue is that most advanced data analysis techniques are primarily tailored for complete datasets, thereby underscoring the indispensable need for effective imputation methods. In this chapter, we embark on an extensive exploration of missing data from a statistical perspective, offering a holistic review of its intricate nature. Our investigation encompasses a deep dive into the various mechanisms underlying missing data, shedding light on their ignorability and identifiability-fundamental concepts essential for understanding and addressing this pervasive issue. Moreover, we present a succinct yet comprehensive overview of influential classical imputation methods, showcasing their contributions to the field. Building upon this foundation, we delve into the latest advancements in generative models, a burgeoning area that holds great promise for learning from and imputing missing data. By harnessing the power of generative models, we aim to unlock novel insights and methodologies that can tackle the challenges posed by missing data. Furthermore, we introduce an approach that specifically addresses the critical problem of nonparametric identifiability in nonignorable missing data through the innovative use of generative models. This novel approach aims to overcome the limitations associated with alternative generative models and provides a potential solution to this challenging issue. To enhance the clarity of our proposed method, we supplement our discourse with curated numerical examples that distinguish its effectiveness from other baselines in specific scenarios. Through the exploration, we hope to pave the way for further research and advancements in this critical domain, ultimately leading to more accurate and reliable analyses and interpretations of incomplete datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Li, Y., Miao, W., Shpitser, I., & Tchetgen, E. J. T. (2022). A self-censoring model for multivariate nonignorable nonmonotone missing data. arXiv preprint arXiv:2207.08535.

  2. Malinsky, D., Shpitser, I., & Tchetgen Tchetgen, E. J. (2021). Semiparametric inference for nonmonotone missing-not-at-random data: The no self-censoring model. Journal of the American Statistical Association, pp. 1–9.

    Google Scholar 

  3. Wang, Y., Liang, D., Charlin,L., & Blei, D. M. (2018). The deconfounded recommender: A causal inference approach to recommendation. arXiv preprint arXiv:1808.06581.

  4. Marlin, B. M., & Zemel, R. S. (2009). Collaborative prediction and ranking with non-random missing data. In Proceedings of the third ACM conference on Recommender systems (pp. 5–12).

    Google Scholar 

  5. Ghalebikesabi, S., Cornish, R., Holmes, C., & Kelly, L. (2021). Deep generative missingness pattern-set mixture models. In International conference on artificial intelligence and statistics (pp. 3727–3735). PMLR.

    Google Scholar 

  6. Xue, F., & Qu, A. (2021). Integrating multisource block-wise missing data in model selection. Journal of the American Statistical Association, 116(536), 1914–1927.

    Article  MathSciNet  Google Scholar 

  7. Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., & Tabona, O. (2021). A survey on missing data in machine learning. Journal of Big Data, 8(1), 1–37.

    Article  Google Scholar 

  8. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.

    Article  MathSciNet  Google Scholar 

  9. Glynn, R. J., Laird, N. M., & Rubin, D. B. (1986). Selection modeling versus mixture modeling with nonignorable nonresponse. In Drawing inferences from self-selected samples (pp. 115–142). Springer.

    Google Scholar 

  10. Little, R. J. (1993). Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association, 88(421), 125–134.

    Google Scholar 

  11. Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys ( vol. 81). Wiley.

    Google Scholar 

  12. Shrive, F. M., Stuart, H., Quan, H., & Ghali, W. A. (2006). Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Medical Research Methodology, 6, 1–10.

    Article  Google Scholar 

  13. Jakobsen, J. C., Gluud, C., Wetterslev, J., & Winkel, P. (2017). When and how should multiple imputation be used for handling missing data in randomised clinical trials-a practical guide with flowcharts. BMC Medical Research Methodology, 17(1), 1–10.

    Article  Google Scholar 

  14. Hernández-Lobato, J. M., Houlsby, N., & Ghahramani, Z. (2014). Probabilistic matrix factorization with non-random missing data. In International conference on machine learning (pp. 1512–1520). PMLR.

    Google Scholar 

  15. Jannach. D., Zanker, M., Felfernig, A., & Friedrich, G. (2010). Recommender systems: An introduction. Cambridge University Press.

    Google Scholar 

  16. Ma, C., & Zhang, C. (2021). Identifiable generative models for missing not at random data imputation. Advances in Neural Information Processing Systems, 34, 27645–27658.

    Google Scholar 

  17. Rubin, D. B. (1977). Formalizing subjective notions about the effect of nonrespondents in sample surveys. Journal of the American Statistical Association, 72(359), 538–543.

    Article  MathSciNet  Google Scholar 

  18. Robins, J. M. (1997). Non-response models for the analysis of non-monotone non-ignorable missing data. Statistics in Medicine, 16(1), 21–37.

    Article  Google Scholar 

  19. Vansteelandt, S., Goetghebeur, E., Kenward, M. G., & Molenberghs, G. (2006). Ignorance and uncertainty regions as inferential tools in a sensitivity analysis. Statistica Sinica pp. 953–979.

    Google Scholar 

  20. Daniels, M. J., & Hogan, J. W. (2008). Missing data in longitudinal studies: Strategies for Bayesian modeling and sensitivity analysis. Chapman and Hall/CRC.

    Google Scholar 

  21. Sadinle, M., & Reiter, J. P. (2018). Sequential identification of nonignorable missing data mechanisms. Statistica Sinica, 28(4), 1741–1759.

    MathSciNet  Google Scholar 

  22. Gill, R. D., Laan, M. J., & Robins, J. M. (1997). Coarsening at random: Characterizations, conjectures, counter-examples. In Proceedings of the first seattle symposium in biostatistics (pp. 255–294). Springer.

    Google Scholar 

  23. Wang, S., Shao, J., & Kim, J. K. (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, pp. 1097–1116.

    Google Scholar 

  24. Miao, W., Ding, P., & Geng, Z. (2016). Identifiability of normal and normal mixture models with nonignorable missing data. Journal of the American Statistical Association, 111(516), 1673–1683.

    Article  MathSciNet  Google Scholar 

  25. Miao, W., & Tchetgen, E. J. T. (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika, 103(2), 475–482.

    Article  MathSciNet  Google Scholar 

  26. d’Haultfoeuille, X. (2010). A new instrumental method for dealing with endogenous selection. Journal of Econometrics, 154(1), 1–15.

    Article  MathSciNet  Google Scholar 

  27. Liu, L., Miao, W., Sun, B., Robins, J., & Tchetgen, E. T. (2020). Identification and inference for marginal average treatment effect on the treated with an instrumental variable. Statistica Sinica, 30(3), 1517.

    MathSciNet  Google Scholar 

  28. Sun, B., Liu, L., Miao, W., Wirth, K., Robins, J., & Tchetgen, E. J. T. (2018). Semiparametric estimation with data missing not at random using an instrumental variable. Statistica Sinica, 28(4), 1965.

    MathSciNet  Google Scholar 

  29. Tchetgen, E. J. T., Wang, L., & Sun, B. (2018). Discrete choice models for nonmonotone nonignorable missing data: Identification and inference. Statistica Sinica, 28(4), 2069.

    MathSciNet  Google Scholar 

  30. Linero, A. R. (2017). Bayesian nonparametric analysis of longitudinal studies in the presence of informative missingness. Biometrika, 104(2), 327–341.

    Article  MathSciNet  Google Scholar 

  31. Fay, R. E. (1986). Causal models for patterns of nonresponse. Journal of the American Statistical Association, 81(394), 354–365.

    Article  Google Scholar 

  32. Ma, W.-Q., Geng, Z., & Hu, Y.-H. (2003). Identification of graphical models for nonignorable nonresponse of binary outcomes in longitudinal studies. Journal of multivariate analysis, 87(1), 24–45.

    Article  MathSciNet  Google Scholar 

  33. Mohan, K., & Pearl, J. (2021). Graphical models for processing missing data. Journal of the American Statistical Association, 116(534), 1023–1037.

    Article  MathSciNet  Google Scholar 

  34. Nabi, R., Bhattacharya, R.., & Shpitser, I. (2020). Full law identification in graphical models of missing data: Completeness results. In International conference on machine learning (pp. 7153–7163). PMLR.

    Google Scholar 

  35. Shpitser, I. (2016). Consistent estimation of functions of data missing non-monotonically and not at random. Advances in Neural Information Processing Systems, 29.

    Google Scholar 

  36. Sadinle, M., & Reiter, J. P. (2017). Itemwise conditionally independent nonresponse modelling for incomplete multivariate data. Biometrika, 104(1), 207–220.

    MathSciNet  Google Scholar 

  37. Kim, K.-Y., Kim, B.-J., & Yi, G.-S. (2004). Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics, 5(1), 1–9.

    Article  Google Scholar 

  38. Stekhoven, D. J., & Bühlmann, P. (2012). Missforest-non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118.

    Article  Google Scholar 

  39. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., & Altman, R. B. (2001). Missing value estimation methods for dna microarrays. Bioinformatics, 17(6), 520–525.

    Article  Google Scholar 

  40. Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45, 1–67.

    Article  Google Scholar 

  41. Van Buuren, S. (2018). Flexible imputation of missing data. CRC Press.

    Google Scholar 

  42. Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data (Vol. 793). Wiley.

    Google Scholar 

  43. Allison, P. D. (2001). Missing data. Sage Publications.

    Google Scholar 

  44. Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American statistical Association, 91(434), 473–489.

    Article  Google Scholar 

  45. Audigier, V., Husson, F., & Josse, J. (2016). Multiple imputation for continuous variables using a Bayesian principal component analysis. Journal of Statistical Computation and Simulation, 86(11), 2140–2156.

    Article  MathSciNet  Google Scholar 

  46. Schafer, J. L. (1997). Analysis of incomplete multivariate data. CRC Press.

    Google Scholar 

  47. Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., & Joachims, T. (2016). Recommendations as treatments: Debiasing learning and evaluation. In International conference on machine learning (pp. 1670–1679). PMLR.

    Google Scholar 

  48. Wang, Y., & Blei, D. M. (2019). The blessings of multiple causes. Journal of the American Statistical Association, 114(528), 1574–1596.

    Article  MathSciNet  Google Scholar 

  49. Wang, Y., Liang, D., Charlin, L., & Blei, D. M. (2020). Causal inference for recommender systems. In Fourteenth ACM conference on recommender systems (pp. 426–431).

    Google Scholar 

  50. Wang, X., Zhang, R., Sun, Y., & Qi, J. (2019). Doubly robust joint learning for recommendation on data missing not at random. In International conference on machine learning (pp. 6638–6647). PMLR.

    Google Scholar 

  51. Wang, Z., Akande, O., Poulos, J., & Li, F. (2021). Are deep learning models superior for missing data imputation in large surveys? Evidence from an empirical comparison. arXiv preprintarXiv:2103.09316.

  52. Yoon, J., Jordon, J., & Schaar,, M. (2018). Gain: Missing data imputation using generative adversarial nets. In International conference on machine learning (pp. 5689–5698). PMLR.

    Google Scholar 

  53. Li, S. C.-X., Jiang, B., & Marlin, B. (2019). Misgan: Learning from incomplete data with generative adversarial networks. arXiv preprintarXiv:1902.09599.

  54. Richardson, T. W., Wu, W., Lin, L., Xu, B., & Bernal, E. A. (2020). Mcflow: Monte carlo flow models for data imputation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14205–14214).

    Google Scholar 

  55. Nazabal, A., Olmos, P. M., Ghahramani, Z., & Valera, I. (2020). Handling incomplete heterogeneous data using vaes. Pattern Recognition, 107, 107501.

    Article  Google Scholar 

  56. Ma, C., Tschiatschek, S., Palla, K., Hernández-Lobato, J. M., Nowozin, S., & Zhang, C. (2018). Eddi: Efficient dynamic discovery of high-value information with partial vae. arXiv preprintarXiv:1809.11142.

  57. Mattei, P.-A., & Frellsen, J. (2019). Miwae: Deep generative modelling and imputation of incomplete data sets. In International conference on machine learning (pp. 4413–4423). PMLR.

    Google Scholar 

  58. Ipsen, N. B., Mattei, P.-A., & Frellsen, J. (2020). not-miwae: Deep generative modelling with missing not at random data. arXiv preprintarXiv:2006.12871.

  59. Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114.

  60. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.

    Article  MathSciNet  Google Scholar 

  61. Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In International conference on machine learning (pp. 214–223). PMLR.

    Google Scholar 

  62. Wei, G. C., & Tanner, M. A. (1990). A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. Journal of the American statistical Association, 85(411), 699–704.

    Article  Google Scholar 

  63. Neath, R. C., et al. (2013). On convergence properties of the monte carlo em algorithm. Advances in modern statistical theory and applications: a Festschrift in Honor of Morris L. Eaton (pp. 43–62).

    Google Scholar 

  64. Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 652–660).

    Google Scholar 

  65. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., & Smola, A. J. (2017). Deep sets. Advances in Neural Information Processing Systems 30.

    Google Scholar 

  66. Burda, Y., Grosse, R., & Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprintarXiv:1509.00519.

  67. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC press.

    Google Scholar 

  68. Khemakhem, I., Kingma, D., Monti, R., & Hyvarinen, A. (2020). Variational autoencoders and nonlinear ica: A unifying framework. In International conference on artificial intelligence and statistics (pp. 2207–2217). PMLR.

    Google Scholar 

  69. Dai, B., & Wipf, D. (2019). Diagnosing and enhancing vae models. arXiv preprintarXiv:1903.05789.

  70. Asuncion, A., & Newman, D. (2007). Uci machine learning repository.

    Google Scholar 

  71. Chen, H. Y. (2007). A semiparametric odds ratio model for measuring association. Biometrics, 63(2), 413–421.

    Article  MathSciNet  Google Scholar 

  72. Chen, H. Y. (2010). Compatibility of conditionally specified models. Statistics and Probability Letters, 80(7–8), 670–677.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao Wang .

Editor information

Editors and Affiliations

7 Appendix

7 Appendix

Summary of Notations See Table 2.

Table 2 Summary of Basic Notations for Missing Data

Proof of Theorem 1

Proof

Under Condition (9),

$$\begin{aligned} p(r|x) & = \int p(r|x, \tilde{z}) p(\tilde{z}) d\tilde{z}\\ & = \int \prod ^p_{j=1} p(r_j|x_{-j}, \tilde{z}) p(\tilde{z}) d\tilde{z}\\ & = \int \prod ^p_{j=1} p(r_j|x_{-j}, R_{-j} = \boldsymbol{1}, \tilde{z}) p(\tilde{z}) d\tilde{z}. \end{aligned}$$

Thus, p(r|x) is a function of the observed data only.

Proof of Corollary 1

Proof

Using the odds ratio parametrization [71, 72] with odds ratio function

$$\begin{aligned} \text {OR}(r,x) \equiv \text {OR}(r,x;r_0 = \boldsymbol{1}, x_0 = \boldsymbol{0}) = \frac{p (r|x)}{p (R=\boldsymbol{1}|x)}\frac{p (R=\boldsymbol{1}|X=\boldsymbol{0})}{p (r|X=\boldsymbol{0})}, \end{aligned}$$

the full-data distribution [1] can be written as

$$\begin{aligned} p (x,r) = \frac{OR(r,x)p (x|R=\boldsymbol{1})p (r|X = \boldsymbol{0})}{\sum _{r'}\mathbb {E}[OR(r',y)|R=\boldsymbol{1}]p (r'|X=\boldsymbol{0})}, \end{aligned}$$

which is also a function of observed data only.

Implementation Details

In the experiments in Sect. 5, for all three generative models, we set the dimension of the latent space to be \(p-1\) for both datasets. K is set to be 20 for the importance samples. All three models have the same nonlinear structure for the decoder of missingness indicators. We use one hidden layer for both encoder and decoders with dimension 128. Instead of taking a fixed value, the observational noise for the continuous data variables in the decoder is specified as a learnable parameter as proposed in [69]. All methods are trained with Adam optimizer with batch size 16, and learning rate 0.001 for 30000 epochs. During the imputation, the number of importance samples L is set as 10000. For the two classical imputation methods, the IterativeImputer from the sklearn package in Python is exploited in the experiments, with the default estimator for MICE with maximum iteration as 100 and RandomForestRegressor for missForest with the number of estimators as 100.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Xie, H., Xue, F., Wang, X. (2024). Generative Models for Missing Data. In: Lyu, Z. (eds) Applications of Generative AI. Springer, Cham. https://doi.org/10.1007/978-3-031-46238-2_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-46238-2_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-46237-5

  • Online ISBN: 978-3-031-46238-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics