Towards a Methodology for Addressing Missingness in Datasets, with an Application to Demographic Health Datasets

Khangamwa, Gift; van Zyl, Terence; van Alten, Clint J.

doi:10.1007/978-3-031-22321-1_12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1734))

Included in the following conference series:

Southern African Conference for Artificial Intelligence Research

341 Accesses

Abstract

Missing data is a common concern in health datasets, and its impact on good decision-making processes is well documented. Our study’s contribution is a methodology for tackling missing data problems using a combination of synthetic dataset generation, missing data imputation and deep learning methods to resolve missing data challenges. Specifically, we conducted a series of experiments with these objectives; a) generating a realistic synthetic dataset, b) simulating data missingness, c) recovering the missing data, and d) analyzing imputation performance. Our methodology used a gaussian mixture model whose parameters were learned from a cleaned subset of a real demographic and health dataset to generate the synthetic data. We simulated various missingness degrees ranging from \(10\%\), \(20\%\), \(30\%\), and \(40\%\) under the missing completely at random scheme MCAR. We used an integrated performance analysis framework involving clustering, classification and direct imputation analysis. Our results show that models trained on synthetic and imputed datasets could make predictions with an accuracy of \(83\%\) and \(80\%\) on a) an unseen real dataset and b) an unseen reserved synthetic test dataset, respectively. Moreover, the models that used the DAE method for imputed yielded the lowest log loss an indication of good performance, even though the accuracy measures were slightly lower. In conclusion, our work demonstrates that using our methodology, one can reverse engineer a solution to resolve missingness on an unseen dataset with missingness. Moreover, though we used a health dataset, our methodology can be utilized in other contexts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anderson, J.W., Kennedy, K.E., Ngo, L.B., Luckow, A., Apon, A.W.: Synthetic data generation for the internet of things. In: Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014, pp. 171–176 (2014)
Google Scholar
Beaulieu-Jones, B.K., Moore, J.H.: Missing data imputation in the electronic health record using deeply learned autoencoders. In: Pacific Symposium on Biocomputing 2017, pp. 207–218 (2017)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article MATH Google Scholar
García, S., Luengo, J., Herrera, F.: Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl.-Based Syst. 98, 1–29 (2016)
Article Google Scholar
Ghahramani, Z., Jordan, M.I.: Learning from incomplete data. Technical report. A I Memo No. 1509; C.B.C.L. Paper No. 108, MIT (1994). Dspace.mit.edu publications.ai.mit.edu
Google Scholar
Lin, P.J., et al.: Development of a synthetic data set generator for building and testing information discovery systems. In: Proceedings of the Third International Conference on Information Technology: New Generations (ITNG 2006), Las Vegas, pp. 1–5 (2006). ISBN 0769524974
Google Scholar
Lin, W.-C., Tsai, C.-F.: Missing value imputation: a review and analysis of the literature (2006–2017). Artif. Intell. Rev. 53(2), 1487–1509 (2019). https://doi.org/10.1007/s10462-019-09709-4
Article Google Scholar
Manaka, T., Van Zyl, T., Wade, A.N., Kar, D.: Using machine learning to fuse verbal autopsy narratives and binary features in the analysis of deaths from hyperglycaemia. In: Proceedings of SACAIR2021, vol. 1, pp. 90–106 (2022). https://2021.sacair.org.za/wp-content/uploads/2022/02/SACAIR21-Proceedings
Marbán, Ó., Mariscal, G., Segovia, J.: A Data mining & knowledge discovery process model. Data Min. Knowl. Discov. Real Life Appl. (February), 1–17 (2009). www.intechopen.com, www.intechweb.org
Mathonsi, T., van Zyl, T.L.: A statistics and deep learning hybrid method for multivariate time series forecasting and mortality modeling. Forecasting 4(1), 1–25 (2022)
Article Google Scholar
Misra, P., Yadav, A.S.: Impact of preprocessing methods on healthcare predictions. In: 2nd International Conference on Advanced Computing and Software Engineering (ICACSE-2019), Ml (2019)
Google Scholar
Richardson, T.W., Wu, W., Lin, L., Xu, B., Bernal, E.A.: McFlow: Monte Carlo flow models for data imputation. In: 2020 Computer Vision and Pattern Recognition (CVPR) (2020). http://arxiv.org/abs/2003.12628
Rubin, D.B.: An overview of multiple imputation. In: Proceedings of the Survey Research Methods Section of the American Statistical Association, pp. 79–84 (1988)
Google Scholar
Rubin, D.R.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Article MathSciNet MATH Google Scholar
Shang, C., Palmer, A., Sun, J., Chen, K.S., Lu, J., Bi, J.: VIGAN: missing view imputation with generative adversarial networks. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 766–775. IEEE (2017). ISBN 9781538627150
Google Scholar
Sterne, J.A.C., et al.: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Res. Methods Rep. 1–11 (2009). https://www.bmj.com/content/338/bmj.b2393
Vazifehdan, M., Moattar, M.H., Jalali, M.: A hybrid Bayesian network and tensor factorization approach for missing value imputation to improve breast cancer recurrence prediction. J. King Saud Univ. - Comput. Inf. Sci. 31(2), 175–184 (2019). ISSN 1319–1578. https://doi.org/10.1016/j.jksuci.2018.01.002
Wan, Z., Zhang, Y., He, H.: Variational autoencoder based synthetic data generation for imbalanced learning. In: 2017 - IEEE Symposium Series on Computational Intelligence (SSCI) (2017). ISBN 9781538627266. https://ieeexplore.ieee.org/xpl/conhome/8267146/proceeding
Yoon, J., Jordon, J., Schaar, M.V.D.: GAIN: missing data imputation using generative adversarial nets. In: Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden (2018)
Google Scholar
Yoon, S.: GAMIN: generative adversarial multiple imputation network for highly missing data. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8456–8464 (2020)
Google Scholar
Zheng, X., Wang, B., Xie, L.: Synthetic dynamic PMU data generation: a generative adversarial network approach. In: 2019 International Conference on Smart Grid Synchronized Measurements and Analytics, SGSMA 2019 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, Johanneburg, South Africa
Gift Khangamwa & Clint J. van Alten
Institute for Intelligent Systems, University of Johannesburg, Johannesburg, South Africa
Terence van Zyl

Authors

Gift Khangamwa
View author publications
You can also search for this author in PubMed Google Scholar
Terence van Zyl
View author publications
You can also search for this author in PubMed Google Scholar
Clint J. van Alten
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gift Khangamwa .

Editor information

Editors and Affiliations

University of KwaZulu-Natal, Durban, South Africa
Anban Pillay
University of KwaZulu-Natal, Durban, South Africa
Edgar Jembere
University of Pretoria, Pretoria, South Africa
Aurona Gerber

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khangamwa, G., van Zyl, T., van Alten, C.J. (2022). Towards a Methodology for Addressing Missingness in Datasets, with an Application to Demographic Health Datasets. In: Pillay, A., Jembere, E., Gerber, A. (eds) Artificial Intelligence Research. SACAIR 2022. Communications in Computer and Information Science, vol 1734. Springer, Cham. https://doi.org/10.1007/978-3-031-22321-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-22321-1_12
Published: 28 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22320-4
Online ISBN: 978-3-031-22321-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards a Methodology for Addressing Missingness in Datasets, with an Application to Demographic Health Datasets