Skip to main content

Missing Data Imputation Through the Use of the Random Forest Algorithm

  • Conference paper
Advances in Computational Intelligence

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 116))

Abstract

This paper presents a comparison of different paradigms used for missing data imputation. The data set used is HIV seroprevalence data from an antenatal clinic study survey performed in 2001. Data imputation is performed through five methods: Random Forests; auto-associative neural networks with genetic algorithms; auto-associative neuro-fuzzy configurations; and two random forest and neural network based hybrids. Results indicate that Random Forests are superior in imputing missing data for the given data set in terms of accuracy and in terms of computation time, with accuracy increases of up to 32 % on average for certain variables when compared with auto-associative networks. While the concept of hybrid systems has promise, the presented systems appear to be hindered by their auto-associative neural network components.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 429.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 549.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ssali, G., Marwala, T.: Computational Intelligence and Decision Trees for Missing Data Estimation. In: Proceedings of the International Joint Conference on Neural Networks, part of the IEEE World Congress on Computational Intelligence, WCCI 2008, IJCNN, pp. 201–207. IEEE, Los Alamitos (2008)

    Chapter  Google Scholar 

  2. Horton, N.J., Kleinman, K.P.: Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models. The American Statistician 61(1), 79–90 (2007)

    Article  MathSciNet  Google Scholar 

  3. Markey, M.K., Tourassi, G.D., Margolis, M., DeLong, D.M.: Impact of Missing Data in Evaluating Artificial Neural Networks Trained on Complete Data. In: Computers in Biology and Medicine, vol. 36, pp. 517–525. Elsevier, Amsterdam (2006)

    Google Scholar 

  4. Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons, Chichester (2002)

    MATH  Google Scholar 

  5. Ziegler, M.L.: Variable selection when confronted with missing data. PhD thesis, University of Pittsburgh (2006)

    Google Scholar 

  6. Fogarty, D.J.: Multiple Imputation as a Missing Data Approach to Reject Inference on Consumer Credit Scoring. Intersat 41(9) (2006)

    Google Scholar 

  7. Yuan, K.H., Bentler, P.M.: Three likelihood-based methods for mean and covariance structure analysis with non-normal missing data. Sociological Methodology, 165–200 (2000)

    Google Scholar 

  8. Nelwamondo, F.V., Mohamed, S., Marwala, T.: Missing data: a of neural network and expectation maximisation techniques. Current Science 93(11), 1514–1521 (2007)

    Google Scholar 

  9. Betechuoh, B.L., Marwala, T., Tettey, T.: Autoencoder Networks for HIV Classification. Current Science 91(11), 1467–1473 (2006)

    Google Scholar 

  10. Biau, G., Devroye, L., Lugosi, G.: Consistency of Random Forests and Other Averaging Classifiers. Journal of Machine Learning Research 9, 2015–2033 (2008)

    MathSciNet  Google Scholar 

  11. Masisi, L., Nelwamondo, F.V., Marwala, T.: The Effect of Structural Diversity of an Ensemble of Classifiers on Classification Accuracy. In: IASTED International Conference on Modelling and Simulation (Africa-MS) (2008)

    Google Scholar 

  12. Qi, Y., Klein-Seetharaman, J., Bar-Joseph, Z.: Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources. In: Pacific Symposium on Biocomputing, vol. 10, pp. 531–542 (2005)

    Google Scholar 

  13. Ho, T.K.: Random Decision Forests. In: ICDAR 1995: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 1 (1995)

    Google Scholar 

  14. Breiman, L., Cutler, A.: Random Forests. Department of Statistics, University of California Berkeley (2004)

    Google Scholar 

  15. Brence, J.R., Brown, D.E.: Improving the Robust Random Forest Regression Algorithm (2006)

    Google Scholar 

  16. Engelbrecht, A.P.: Computation Intelligence, an Introduction. John Wiley & Sons, Ltd., Chichester (2002)

    Google Scholar 

  17. Haykin, S.: Neural Networks: A Comprehensive Foundation. Macmillan, New York (1994)

    MATH  Google Scholar 

  18. Jang, J.S.R., Gulley, N.: Fuzzy Logic Toolbox. The MathWorks Inc. (1997)

    Google Scholar 

  19. Abraham, A.: Neuro fuzzy systems: Sate-of-the-art modeling techniques. In: Mira, J., Prieto, A.G. (eds.) IWANN 2001. LNCS, vol. 2084, pp. 269–276. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  20. Jang, J.S.R.: Neurofuzzy Modelling and Control. Proceedings of the IEEE 83 (1995)

    Google Scholar 

  21. Jang, J.S.R.: Input Selection for ANFIS Learning. In: Proceedings of IEEE International Conference on Fuzzy Systems (1998)

    Google Scholar 

  22. Wong, H.: Genetic Algorithms. Surprise 96 Journal. Imperial College of Science Technology and Medicine (1996)

    Google Scholar 

  23. Zalzala, A.M.S., Fleming, P.J.: Genetic Algorithms in Engineering Systems. IET (1997)

    Google Scholar 

  24. Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  25. Yuan, Y., Lorenzo, R., Andrea, C.: On Early Stopping in Gradient Descent Learning. Constructive Approximation 26(2), 289–315 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  26. Ntsaluba, A.: Summary Report: National HIV and Syphilis Sero-prevalence Survey of Women Attending Public Antenatal Clinics in South Africa, 2001 Department of Health, South African Government (2001)

    Google Scholar 

  27. Mistry, J., Nelwamondo, F.V., Marwala, T.: Investigation of Autoencoder Neural Network Accuracy for Computational Intelligence Methods to Estimate Missing Data. In: IASTED International Conference on Modelling and Simulation (2008)

    Google Scholar 

  28. Pantanowitz, A., Marwala, T.: Evaluating the Impact of Missing Data Imputation. LNCS (LNAI), vol. 5678. Springer, Heidelberg (to appear, 2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pantanowitz, A., Marwala, T. (2009). Missing Data Imputation Through the Use of the Random Forest Algorithm. In: Yu, W., Sanchez, E.N. (eds) Advances in Computational Intelligence. Advances in Intelligent and Soft Computing, vol 116. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03156-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03156-4_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03155-7

  • Online ISBN: 978-3-642-03156-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics