Skip to main content

Advertisement

Log in

Missing Data Analysis Using Statistical and Machine Learning Methods in Facility-Based Maternal Health Records

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Missing data are a rule rather than an exception in quantitative research. The questionable aspect however is the extent, pattern, mechanism, and treatment of missingness in facility-based paper maternal health records. We utilized data from maternal health records at Kawempe National Referral Hospital, Uganda. Only records of women who had given birth at the Hospital during January 2017 to January 2021 were considered. The analysis was done using R-Studio using frequency distributions, Pearson χ2 Test. Treatment of missingness was done using Listwise deletion (LD), Mode Imputation, Multiple Imputation by chained equations (MICE), Imputation using K-Nearest Neighbors (KNN) and Random Forest (RF) Imputation. Performance of methods was investigated using prediction accuracy and the Kruskal–Wallis Test on Standard Errors (SEs) derived following a Logistic Regression. Overall, 5% of the data was missing. The proportion of missingness ranged from 1.4 to 20.7% in variables. Case-wise missingness was established where 2498 out of the 4626 cases (54%) had at-least one variable with missing value. The pattern of missingness was arbitrary. The data suggest either missing at random or missing completely at random. With the exception of LD, no difference in SEs following Logistic Regression was noted in the imputation methods for treatment of missingness (p > 0.05). Further, LD yielded the lowest prediction accuracy after Logistic Regression. No major variations were noted in the prediction accuracy following a Logistic Regression after imputation using MICE, mode imputation, KNN and RF. Missingness in facility-based health records should not be ignored. Researchers need to pay attention to both overall and case-wise missingness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

Data can be availed at reasonable request.

References

  1. Dong Y, Peng C-YJ. Principled missing data methods for researchers. Springerplus. 2013;2(1):222. https://doi.org/10.1186/2193-1801-2-222.

    Article  Google Scholar 

  2. Orchard T, Woodbury MA. A missing information principle: theory and applications. In: Theory of statistics. Berkeley: University of California Press; 1972. p. 697–716.

    Chapter  Google Scholar 

  3. Barnard J, Meng X-L. Applications of multiple imputation in medical studies: from AIDS to NHANES. Stat Methods Med Res. 1999;8(1):17–36. https://doi.org/10.1177/096228029900800103.

    Article  Google Scholar 

  4. Cole JC. How to deal with missing data. In: Best practices in quantitative methods. 2008. pp. 214–238

  5. Wells BJ, Chagin KM, Nowacki AS, Kattan MW. Strategies for handling missing data in electronic health record derived data. EGEMS (Wash DC). 2013;1(3):1035. https://doi.org/10.13063/2327-9214.1035.

    Article  Google Scholar 

  6. Ladouceur R, Gosselin P, Laberge M, Blaszczynski A. Dropouts in clinical research: Do results reported reflect clinical reality? Behav Ther. 2001;24(2):44–6.

    Google Scholar 

  7. Peng C-YJ, Harwell M, Liou S-M, Ehman LH. Advances in missing data methods and implications for educational research. Real Data Anal. 2006;3178.

  8. Rubin DB. Multiple imputation for nonresponse in surveys, vol. 81. Hoboken: Wiley; 2004.

    MATH  Google Scholar 

  9. Schafer JL. Analysis of incomplete multivariate data. Boca Raton: CRC Press; 1997.

    Book  Google Scholar 

  10. Pedersen AB, et al. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017;9:157–66. https://doi.org/10.2147/CLEP.S129785.

    Article  Google Scholar 

  11. Schafer JL. Multiple imputation: a primer. Stat Methods Med Res. 1999;8(1):3–15.

    Article  Google Scholar 

  12. Bennett DA. How can I deal with missing data in my study? Aust N Z J Public Health. 2001;25(5):464–9.

    Article  Google Scholar 

  13. Tabachnick BG, Fidell LS, Ullman JB. Using multivariate statistics, vol. 5. Boston: Pearson; 2007.

    Google Scholar 

  14. Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–92.

    Article  MathSciNet  Google Scholar 

  15. Collins LM, Schafer JL, Kam C-M. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods. 2001;6(4):330.

    Article  Google Scholar 

  16. Van Buuren S. Flexible imputation of missing data. Boca Raton: CRC Press; 2018.

    Book  Google Scholar 

  17. Haneuse S, et al. Learning about missing data mechanisms in electronic health records-based research: a survey-based approach. Epidemiology. 2016;27(1):82–90. https://doi.org/10.1097/EDE.0000000000000393.

    Article  MathSciNet  Google Scholar 

  18. Rubin DB, Stern HS, Vehovar V. Handling ‘Don’t Know’ Survey Responses: The Case of the Slovenian Plebiscite. J Am Stat Assoc. 1995;90(431):822–8. https://doi.org/10.1080/01621459.1995.10476580.

    Article  Google Scholar 

  19. Petersen I, et al. Health indicator recording in UK primary care electronic health records: key implications for handling missing data. Clin Epidemiol. 2019;11:157–67. https://doi.org/10.2147/CLEP.S191437.

    Article  Google Scholar 

  20. Tsai J, Bond G. A comparison of electronic records to paper records in mental health centers. Int J Qual Health Care. 2008;20(2):136–43. https://doi.org/10.1093/intqhc/mzm064.

    Article  Google Scholar 

  21. Menachemi N, Saunders C, Chukmaitov A, Matthews MC, Brooks RG. Hospital adoption of information technologies and improved patient safety: a study of 98 hospitals in Florida. J Healthc Manag. 2007;52(6):398–409.

    Google Scholar 

  22. White IR, Carlin JB. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Stat Med. 2010;29(28):2920–31. https://doi.org/10.1002/sim.3944.

    Article  MathSciNet  Google Scholar 

  23. Carpenter J, Kenward M. Multiple imputation and its application. Hoboken: Wiley; 2012.

    MATH  Google Scholar 

  24. Kabakyenga JK, Östergren P-O, Turyakira E, Mukasa PK, Pettersson KO. Individual and health facility factors and the risk for obstructed labour and its adverse outcomes in south-western Uganda. BMC Pregnancy Childbirth. 2011;11(1):73. https://doi.org/10.1186/1471-2393-11-73.

    Article  Google Scholar 

  25. Ngonzi J, et al. Puerperal sepsis, the leading cause of maternal deaths at a Tertiary University Teaching Hospital in Uganda. BMC Pregnancy Childbirth. 2016;16(1):207. https://doi.org/10.1186/s12884-016-0986-9.

    Article  Google Scholar 

  26. Alobo G, Reverzani C, Sarno L, Giordani B, Greco L. Estimating the risk of maternal death at admission: a predictive model from a 5-year case reference study in Northern Uganda. Obstet Gynecol Int. 2022;2022: e4419722. https://doi.org/10.1155/2022/4419722.

    Article  Google Scholar 

  27. Atuhairwe S, Gemzell-Danielsson K, Byamugisha J, Kaharuza F, Tumwesigye NM, Hanson C. Abortion-related near-miss morbidity and mortality in 43 health facilities with differences in readiness to provide abortion care in Uganda. BMJ Glob Health. 2021;6(2): e003274. https://doi.org/10.1136/bmjgh-2020-003274.

    Article  Google Scholar 

  28. Wasswa EW, Nakubulwa S, Mutyaba T. Fetal demise and associated factors following umbilical cord prolapse in Mulago hospital, Uganda: a retrospective study. Reprod Health. 2014;11(1):12. https://doi.org/10.1186/1742-4755-11-12.

    Article  Google Scholar 

  29. Hughes NJ, et al. Decision-to-delivery interval of emergency cesarean section in Uganda: a retrospective cohort study. BMC Pregnancy Childbirth. 2020;20(1):324. https://doi.org/10.1186/s12884-020-03010-x.

    Article  Google Scholar 

  30. Nelson JP. Indications and appropriateness of caesarean sections performed in a tertiary referral centre in Uganda: a retrospective descriptive study. Pan Afr Med J. 2017;26:64. https://doi.org/10.11604/pamj.2017.26.64.9555.

    Article  Google Scholar 

  31. Yego F, Stewart Williams J, Byles J, Nyongesa P, Aruasa W, D’Este C. A retrospective analysis of maternal and neonatal mortality at a teaching and referral hospital in Kenya. Reprod Health. 2013;10(1):13. https://doi.org/10.1186/1742-4755-10-13.

    Article  Google Scholar 

  32. Ndwiga C, Odwe G, Pooja S, Ogutu O, Osoti A, Warren CE. Clinical presentation and outcomes of pre-eclampsia and eclampsia at a national hospital, Kenya: a retrospective cohort study. PLoS ONE. 2020;15(6): e0233323. https://doi.org/10.1371/journal.pone.0233323.

    Article  Google Scholar 

  33. Bwana VM, Rumisha SF, Mremi IR, Lyimo EP, Mboera LEG. Patterns and causes of hospital maternal mortality in Tanzania: a 10-year retrospective analysis. PLoS ONE. 2019;14(4): e0214807. https://doi.org/10.1371/journal.pone.0214807.

    Article  Google Scholar 

  34. Nyirahabimana N, et al. Maternal predictors of neonatal outcomes after emergency cesarean section: a retrospective study in three rural district hospitals in Rwanda. Maternal Health, Neonatolo Perinatol. 2017;3(1):11. https://doi.org/10.1186/s40748-017-0050-4.

    Article  Google Scholar 

  35. Bhaskaran K, Forbes HJ, Douglas I, Leon DA, Smeeth L. Representativeness and optimal use of body mass index (BMI) in the UK Clinical Practice Research Datalink (CPRD). BMJ Open. 2013;3(9): e003389.

    Article  Google Scholar 

  36. Marston L, Carpenter JR, Walters KR, Morris RW, Nazareth I, Petersen I. Issues in multiple imputation of missing data for large general practice clinical databases. Pharmacoepidemiol Drug Saf. 2010;19(6):618–26. https://doi.org/10.1002/pds.1934.

    Article  Google Scholar 

  37. Jerez JM, et al. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med. 2010;50(2):105–15. https://doi.org/10.1016/j.artmed.2010.05.002.

    Article  Google Scholar 

  38. Lin J-H, Haug PJ. Exploiting missing clinical data in Bayesian network modeling for predicting medical problems. J Biomed Inform. 2008;41(1):1–14. https://doi.org/10.1016/j.jbi.2007.06.001.

    Article  Google Scholar 

  39. Bounthavong M, Watanabe JH, Sullivan KM. Approach to addressing missing data for electronic medical records and pharmacy claims data research. Pharmacotherapy: J Human Pharmacol Drug Ther. 2015;35(4):380–7. https://doi.org/10.1002/phar.1569.

    Article  Google Scholar 

  40. Batista GEAPA, Monard MC. An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell. 2003;17(5–6):519–33. https://doi.org/10.1080/713827181.

    Article  Google Scholar 

  41. Kyureghian G, Capps O, Nayga RM. A missing variable imputation methodology with an empirical application. In: Drukker DM, editor. Missing data methods: cross-sectional methods and applications, vol. 27 Part 1. Emerald Group Publishing Limited; 2011. p. 313–337. https://doi.org/10.1108/S0731-9053(2011)000027A015.

  42. Mishra S, Khare D. On comparative performance of multiple imputation methods for moderate to large proportions of missing data in clinical trials: a simulation study. J Med Stat Inform. 2014;2(1):9. https://doi.org/10.7243/2053-7662-2-9.

    Article  Google Scholar 

  43. Twala B. An empirical comparison of techniques for handling incomplete data using decision trees. Appl Artif Intell. 2009;23(5):373–405. https://doi.org/10.1080/08839510902872223.

    Article  Google Scholar 

  44. Jadhav A, Pramod D, Ramanathan K. Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell. 2019;33(10):913–33. https://doi.org/10.1080/08839514.2019.1637138.

    Article  Google Scholar 

  45. Penone C, et al. Imputation of missing data in life-history trait datasets: which approach performs the best? Methods Ecol Evol. 2014;5(9):961–70. https://doi.org/10.1111/2041-210X.12232.

    Article  Google Scholar 

  46. Ghorbani S, Desmarais MC. Performance comparison of recent imputation methods for classification tasks over binary data. Appl Artif Intell. 2017;31(1):1–22. https://doi.org/10.1080/08839514.2017.1279046.

    Article  Google Scholar 

  47. Madley-Dowd P, Hughes R, Tilling K, Heron J. The proportion of missing data should not be used to guide decisions on multiple imputation. J Clin Epidemiol. 2019;110:63–73. https://doi.org/10.1016/j.jclinepi.2019.02.016.

    Article  Google Scholar 

  48. Bono C, Ried LD, Kimberlin C, Vogel B. Missing data on the Center for Epidemiologic Studies Depression Scale: a comparison of 4 imputation techniques. Res Social Adm Pharm. 2007;3(1):1–27. https://doi.org/10.1016/j.sapharm.2006.04.001.

    Article  Google Scholar 

  49. King G, Murray CJ, Salomon JA, Tandon A. Enhancing the validity and cross-cultural comparability of measurement in survey research. Am Political Sci Rev. 2004;98(1):191–207.

    Article  Google Scholar 

  50. Van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45(1):1–67.

    Google Scholar 

  51. Nguyen DV, Wang N, Carroll RJ. Evaluation of missing value estimation for microarray data. J Data Sci. 2004;2(4):347–70.

    Article  Google Scholar 

  52. Troyanskaya O, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17(6):520–5. https://doi.org/10.1093/bioinformatics/17.6.520.

    Article  Google Scholar 

  53. Malarvizhi R, Thanamani AS. K-nearest neighbor in missing data imputation. Int J Eng Res Dev. 2012;5(1):5–7.

    Google Scholar 

  54. Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–8. https://doi.org/10.1093/bioinformatics/btr597.

    Article  Google Scholar 

  55. Prata N, Hamza S, Bell S, Karasek D, Vahidnia F, Holston M. Inability to predict postpartum hemorrhage: insights from Egyptian intervention data. BMC Pregnancy Childbirth. 2011;11(1):97. https://doi.org/10.1186/1471-2393-11-97.

    Article  Google Scholar 

  56. Akazawa M, Hashimoto K, Katsuhiko N, Kaname Y. Machine learning approach for the prediction of postpartum hemorrhage in vaginal birth. Sci Rep. 2021;11(1):Art. no. 1. https://doi.org/10.1038/s41598-021-02198-y.

  57. Venkatesh KK, et al. Machine learning and statistical models to predict postpartum hemorrhage. Obstet Gynecol. 2020;135(4):935–44. https://doi.org/10.1097/AOG.0000000000003759.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

Gratitude is extended to the African Centre of Excellence in Data Science, University of Rwanda for the financial support crucial for data collection. We wish to proffer our immense appreciation to Kawempe National Referral Hospital for availing the permission to capture data paramount to this study. We are indebted to the Department of Statistical Methods and Actuarial Science and the School of Statistics and Planning, Makerere University, for their input and technical guidance and support towards this paper.

Funding

Partial funding to support data collection was obtained from the African Centre of Excellence in Data Science, University of Rwanda.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shaheen M. Z. Memon.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Ethical Approval

Ethical approval for conducting the study was obtained from Uganda National Council of Science and Technology (Ref: HS977ES), and Mulago Hospital Research and Ethics Committee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Appendix Tables 5, 6 and 7.

Table 5 VIFs of independent variables in Logistic Regression
Table 6 Prediction accuracy in Logistic Regression based on different values of K in KNN imputation
Table 7 Standard errors from the Logistic Regression after imputation

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Memon, S.M.Z., Wamala, R. & Kabano, I.H. Missing Data Analysis Using Statistical and Machine Learning Methods in Facility-Based Maternal Health Records. SN COMPUT. SCI. 3, 355 (2022). https://doi.org/10.1007/s42979-022-01249-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01249-z

Keywords

Navigation