Abstract
Missing data is a common concern in health datasets, and its impact on good decision-making processes is well documented. Our study’s contribution is a methodology for tackling missing data problems using a combination of synthetic dataset generation, missing data imputation and deep learning methods to resolve missing data challenges. Specifically, we conducted a series of experiments with these objectives; a) generating a realistic synthetic dataset, b) simulating data missingness, c) recovering the missing data, and d) analyzing imputation performance. Our methodology used a gaussian mixture model whose parameters were learned from a cleaned subset of a real demographic and health dataset to generate the synthetic data. We simulated various missingness degrees ranging from \(10\%\), \(20\%\), \(30\%\), and \(40\%\) under the missing completely at random scheme MCAR. We used an integrated performance analysis framework involving clustering, classification and direct imputation analysis. Our results show that models trained on synthetic and imputed datasets could make predictions with an accuracy of \(83\%\) and \(80\%\) on a) an unseen real dataset and b) an unseen reserved synthetic test dataset, respectively. Moreover, the models that used the DAE method for imputed yielded the lowest log loss an indication of good performance, even though the accuracy measures were slightly lower. In conclusion, our work demonstrates that using our methodology, one can reverse engineer a solution to resolve missingness on an unseen dataset with missingness. Moreover, though we used a health dataset, our methodology can be utilized in other contexts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, J.W., Kennedy, K.E., Ngo, L.B., Luckow, A., Apon, A.W.: Synthetic data generation for the internet of things. In: Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014, pp. 171–176 (2014)
Beaulieu-Jones, B.K., Moore, J.H.: Missing data imputation in the electronic health record using deeply learned autoencoders. In: Pacific Symposium on Biocomputing 2017, pp. 207–218 (2017)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
García, S., Luengo, J., Herrera, F.: Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl.-Based Syst. 98, 1–29 (2016)
Ghahramani, Z., Jordan, M.I.: Learning from incomplete data. Technical report. A I Memo No. 1509; C.B.C.L. Paper No. 108, MIT (1994). Dspace.mit.edu publications.ai.mit.edu
Lin, P.J., et al.: Development of a synthetic data set generator for building and testing information discovery systems. In: Proceedings of the Third International Conference on Information Technology: New Generations (ITNG 2006), Las Vegas, pp. 1–5 (2006). ISBN 0769524974
Lin, W.-C., Tsai, C.-F.: Missing value imputation: a review and analysis of the literature (2006–2017). Artif. Intell. Rev. 53(2), 1487–1509 (2019). https://doi.org/10.1007/s10462-019-09709-4
Manaka, T., Van Zyl, T., Wade, A.N., Kar, D.: Using machine learning to fuse verbal autopsy narratives and binary features in the analysis of deaths from hyperglycaemia. In: Proceedings of SACAIR2021, vol. 1, pp. 90–106 (2022). https://2021.sacair.org.za/wp-content/uploads/2022/02/SACAIR21-Proceedings
Marbán, Ó., Mariscal, G., Segovia, J.: A Data mining & knowledge discovery process model. Data Min. Knowl. Discov. Real Life Appl. (February), 1–17 (2009). www.intechopen.com, www.intechweb.org
Mathonsi, T., van Zyl, T.L.: A statistics and deep learning hybrid method for multivariate time series forecasting and mortality modeling. Forecasting 4(1), 1–25 (2022)
Misra, P., Yadav, A.S.: Impact of preprocessing methods on healthcare predictions. In: 2nd International Conference on Advanced Computing and Software Engineering (ICACSE-2019), Ml (2019)
Richardson, T.W., Wu, W., Lin, L., Xu, B., Bernal, E.A.: McFlow: Monte Carlo flow models for data imputation. In: 2020 Computer Vision and Pattern Recognition (CVPR) (2020). http://arxiv.org/abs/2003.12628
Rubin, D.B.: An overview of multiple imputation. In: Proceedings of the Survey Research Methods Section of the American Statistical Association, pp. 79–84 (1988)
Rubin, D.R.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Shang, C., Palmer, A., Sun, J., Chen, K.S., Lu, J., Bi, J.: VIGAN: missing view imputation with generative adversarial networks. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 766–775. IEEE (2017). ISBN 9781538627150
Sterne, J.A.C., et al.: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Res. Methods Rep. 1–11 (2009). https://www.bmj.com/content/338/bmj.b2393
Vazifehdan, M., Moattar, M.H., Jalali, M.: A hybrid Bayesian network and tensor factorization approach for missing value imputation to improve breast cancer recurrence prediction. J. King Saud Univ. - Comput. Inf. Sci. 31(2), 175–184 (2019). ISSN 1319–1578. https://doi.org/10.1016/j.jksuci.2018.01.002
Wan, Z., Zhang, Y., He, H.: Variational autoencoder based synthetic data generation for imbalanced learning. In: 2017 - IEEE Symposium Series on Computational Intelligence (SSCI) (2017). ISBN 9781538627266. https://ieeexplore.ieee.org/xpl/conhome/8267146/proceeding
Yoon, J., Jordon, J., Schaar, M.V.D.: GAIN: missing data imputation using generative adversarial nets. In: Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden (2018)
Yoon, S.: GAMIN: generative adversarial multiple imputation network for highly missing data. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8456–8464 (2020)
Zheng, X., Wang, B., Xie, L.: Synthetic dynamic PMU data generation: a generative adversarial network approach. In: 2019 International Conference on Smart Grid Synchronized Measurements and Analytics, SGSMA 2019 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Khangamwa, G., van Zyl, T., van Alten, C.J. (2022). Towards a Methodology for Addressing Missingness in Datasets, with an Application to Demographic Health Datasets. In: Pillay, A., Jembere, E., Gerber, A. (eds) Artificial Intelligence Research. SACAIR 2022. Communications in Computer and Information Science, vol 1734. Springer, Cham. https://doi.org/10.1007/978-3-031-22321-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-22321-1_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22320-4
Online ISBN: 978-3-031-22321-1
eBook Packages: Computer ScienceComputer Science (R0)