Skip to main content

Towards a Methodology for Addressing Missingness in Datasets, with an Application to Demographic Health Datasets

  • Conference paper
  • First Online:
Artificial Intelligence Research (SACAIR 2022)

Abstract

Missing data is a common concern in health datasets, and its impact on good decision-making processes is well documented. Our study’s contribution is a methodology for tackling missing data problems using a combination of synthetic dataset generation, missing data imputation and deep learning methods to resolve missing data challenges. Specifically, we conducted a series of experiments with these objectives; a) generating a realistic synthetic dataset, b) simulating data missingness, c) recovering the missing data, and d) analyzing imputation performance. Our methodology used a gaussian mixture model whose parameters were learned from a cleaned subset of a real demographic and health dataset to generate the synthetic data. We simulated various missingness degrees ranging from \(10\%\), \(20\%\), \(30\%\), and \(40\%\) under the missing completely at random scheme MCAR. We used an integrated performance analysis framework involving clustering, classification and direct imputation analysis. Our results show that models trained on synthetic and imputed datasets could make predictions with an accuracy of \(83\%\) and \(80\%\) on a) an unseen real dataset and b) an unseen reserved synthetic test dataset, respectively. Moreover, the models that used the DAE method for imputed yielded the lowest log loss an indication of good performance, even though the accuracy measures were slightly lower. In conclusion, our work demonstrates that using our methodology, one can reverse engineer a solution to resolve missingness on an unseen dataset with missingness. Moreover, though we used a health dataset, our methodology can be utilized in other contexts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anderson, J.W., Kennedy, K.E., Ngo, L.B., Luckow, A., Apon, A.W.: Synthetic data generation for the internet of things. In: Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014, pp. 171–176 (2014)

    Google Scholar 

  2. Beaulieu-Jones, B.K., Moore, J.H.: Missing data imputation in the electronic health record using deeply learned autoencoders. In: Pacific Symposium on Biocomputing 2017, pp. 207–218 (2017)

    Google Scholar 

  3. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  MATH  Google Scholar 

  4. García, S., Luengo, J., Herrera, F.: Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl.-Based Syst. 98, 1–29 (2016)

    Article  Google Scholar 

  5. Ghahramani, Z., Jordan, M.I.: Learning from incomplete data. Technical report. A I Memo No. 1509; C.B.C.L. Paper No. 108, MIT (1994). Dspace.mit.edu publications.ai.mit.edu

    Google Scholar 

  6. Lin, P.J., et al.: Development of a synthetic data set generator for building and testing information discovery systems. In: Proceedings of the Third International Conference on Information Technology: New Generations (ITNG 2006), Las Vegas, pp. 1–5 (2006). ISBN 0769524974

    Google Scholar 

  7. Lin, W.-C., Tsai, C.-F.: Missing value imputation: a review and analysis of the literature (2006–2017). Artif. Intell. Rev. 53(2), 1487–1509 (2019). https://doi.org/10.1007/s10462-019-09709-4

    Article  Google Scholar 

  8. Manaka, T., Van Zyl, T., Wade, A.N., Kar, D.: Using machine learning to fuse verbal autopsy narratives and binary features in the analysis of deaths from hyperglycaemia. In: Proceedings of SACAIR2021, vol. 1, pp. 90–106 (2022). https://2021.sacair.org.za/wp-content/uploads/2022/02/SACAIR21-Proceedings

  9. Marbán, Ó., Mariscal, G., Segovia, J.: A Data mining & knowledge discovery process model. Data Min. Knowl. Discov. Real Life Appl. (February), 1–17 (2009). www.intechopen.com, www.intechweb.org

  10. Mathonsi, T., van Zyl, T.L.: A statistics and deep learning hybrid method for multivariate time series forecasting and mortality modeling. Forecasting 4(1), 1–25 (2022)

    Article  Google Scholar 

  11. Misra, P., Yadav, A.S.: Impact of preprocessing methods on healthcare predictions. In: 2nd International Conference on Advanced Computing and Software Engineering (ICACSE-2019), Ml (2019)

    Google Scholar 

  12. Richardson, T.W., Wu, W., Lin, L., Xu, B., Bernal, E.A.: McFlow: Monte Carlo flow models for data imputation. In: 2020 Computer Vision and Pattern Recognition (CVPR) (2020). http://arxiv.org/abs/2003.12628

  13. Rubin, D.B.: An overview of multiple imputation. In: Proceedings of the Survey Research Methods Section of the American Statistical Association, pp. 79–84 (1988)

    Google Scholar 

  14. Rubin, D.R.: Inference and missing data. Biometrika 63(3), 581–592 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  15. Shang, C., Palmer, A., Sun, J., Chen, K.S., Lu, J., Bi, J.: VIGAN: missing view imputation with generative adversarial networks. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 766–775. IEEE (2017). ISBN 9781538627150

    Google Scholar 

  16. Sterne, J.A.C., et al.: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Res. Methods Rep. 1–11 (2009). https://www.bmj.com/content/338/bmj.b2393

  17. Vazifehdan, M., Moattar, M.H., Jalali, M.: A hybrid Bayesian network and tensor factorization approach for missing value imputation to improve breast cancer recurrence prediction. J. King Saud Univ. - Comput. Inf. Sci. 31(2), 175–184 (2019). ISSN 1319–1578. https://doi.org/10.1016/j.jksuci.2018.01.002

  18. Wan, Z., Zhang, Y., He, H.: Variational autoencoder based synthetic data generation for imbalanced learning. In: 2017 - IEEE Symposium Series on Computational Intelligence (SSCI) (2017). ISBN 9781538627266. https://ieeexplore.ieee.org/xpl/conhome/8267146/proceeding

  19. Yoon, J., Jordon, J., Schaar, M.V.D.: GAIN: missing data imputation using generative adversarial nets. In: Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden (2018)

    Google Scholar 

  20. Yoon, S.: GAMIN: generative adversarial multiple imputation network for highly missing data. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8456–8464 (2020)

    Google Scholar 

  21. Zheng, X., Wang, B., Xie, L.: Synthetic dynamic PMU data generation: a generative adversarial network approach. In: 2019 International Conference on Smart Grid Synchronized Measurements and Analytics, SGSMA 2019 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gift Khangamwa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Khangamwa, G., van Zyl, T., van Alten, C.J. (2022). Towards a Methodology for Addressing Missingness in Datasets, with an Application to Demographic Health Datasets. In: Pillay, A., Jembere, E., Gerber, A. (eds) Artificial Intelligence Research. SACAIR 2022. Communications in Computer and Information Science, vol 1734. Springer, Cham. https://doi.org/10.1007/978-3-031-22321-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-22321-1_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-22320-4

  • Online ISBN: 978-3-031-22321-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics