Missing Data Imputation Through the Use of the Random Forest Algorithm

  • Adam Pantanowitz
  • Tshilidzi Marwala
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 116)


This paper presents a comparison of different paradigms used for missing data imputation. The data set used is HIV seroprevalence data from an antenatal clinic study survey performed in 2001. Data imputation is performed through five methods: Random Forests; auto-associative neural networks with genetic algorithms; auto-associative neuro-fuzzy configurations; and two random forest and neural network based hybrids. Results indicate that Random Forests are superior in imputing missing data for the given data set in terms of accuracy and in terms of computation time, with accuracy increases of up to 32 % on average for certain variables when compared with auto-associative networks. While the concept of hybrid systems has promise, the presented systems appear to be hindered by their auto-associative neural network components.


auto-associative imputation missing data neural network random forest 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ssali, G., Marwala, T.: Computational Intelligence and Decision Trees for Missing Data Estimation. In: Proceedings of the International Joint Conference on Neural Networks, part of the IEEE World Congress on Computational Intelligence, WCCI 2008, IJCNN, pp. 201–207. IEEE, Los Alamitos (2008)CrossRefGoogle Scholar
  2. 2.
    Horton, N.J., Kleinman, K.P.: Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models. The American Statistician 61(1), 79–90 (2007)CrossRefMathSciNetGoogle Scholar
  3. 3.
    Markey, M.K., Tourassi, G.D., Margolis, M., DeLong, D.M.: Impact of Missing Data in Evaluating Artificial Neural Networks Trained on Complete Data. In: Computers in Biology and Medicine, vol. 36, pp. 517–525. Elsevier, Amsterdam (2006)Google Scholar
  4. 4.
    Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons, Chichester (2002)zbMATHGoogle Scholar
  5. 5.
    Ziegler, M.L.: Variable selection when confronted with missing data. PhD thesis, University of Pittsburgh (2006)Google Scholar
  6. 6.
    Fogarty, D.J.: Multiple Imputation as a Missing Data Approach to Reject Inference on Consumer Credit Scoring. Intersat 41(9) (2006)Google Scholar
  7. 7.
    Yuan, K.H., Bentler, P.M.: Three likelihood-based methods for mean and covariance structure analysis with non-normal missing data. Sociological Methodology, 165–200 (2000)Google Scholar
  8. 8.
    Nelwamondo, F.V., Mohamed, S., Marwala, T.: Missing data: a of neural network and expectation maximisation techniques. Current Science 93(11), 1514–1521 (2007)Google Scholar
  9. 9.
    Betechuoh, B.L., Marwala, T., Tettey, T.: Autoencoder Networks for HIV Classification. Current Science 91(11), 1467–1473 (2006)Google Scholar
  10. 10.
    Biau, G., Devroye, L., Lugosi, G.: Consistency of Random Forests and Other Averaging Classifiers. Journal of Machine Learning Research 9, 2015–2033 (2008)MathSciNetGoogle Scholar
  11. 11.
    Masisi, L., Nelwamondo, F.V., Marwala, T.: The Effect of Structural Diversity of an Ensemble of Classifiers on Classification Accuracy. In: IASTED International Conference on Modelling and Simulation (Africa-MS) (2008)Google Scholar
  12. 12.
    Qi, Y., Klein-Seetharaman, J., Bar-Joseph, Z.: Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources. In: Pacific Symposium on Biocomputing, vol. 10, pp. 531–542 (2005)Google Scholar
  13. 13.
    Ho, T.K.: Random Decision Forests. In: ICDAR 1995: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 1 (1995)Google Scholar
  14. 14.
    Breiman, L., Cutler, A.: Random Forests. Department of Statistics, University of California Berkeley (2004)Google Scholar
  15. 15.
    Brence, J.R., Brown, D.E.: Improving the Robust Random Forest Regression Algorithm (2006)Google Scholar
  16. 16.
    Engelbrecht, A.P.: Computation Intelligence, an Introduction. John Wiley & Sons, Ltd., Chichester (2002)Google Scholar
  17. 17.
    Haykin, S.: Neural Networks: A Comprehensive Foundation. Macmillan, New York (1994)zbMATHGoogle Scholar
  18. 18.
    Jang, J.S.R., Gulley, N.: Fuzzy Logic Toolbox. The MathWorks Inc. (1997)Google Scholar
  19. 19.
    Abraham, A.: Neuro fuzzy systems: Sate-of-the-art modeling techniques. In: Mira, J., Prieto, A.G. (eds.) IWANN 2001. LNCS, vol. 2084, pp. 269–276. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  20. 20.
    Jang, J.S.R.: Neurofuzzy Modelling and Control. Proceedings of the IEEE 83 (1995)Google Scholar
  21. 21.
    Jang, J.S.R.: Input Selection for ANFIS Learning. In: Proceedings of IEEE International Conference on Fuzzy Systems (1998)Google Scholar
  22. 22.
    Wong, H.: Genetic Algorithms. Surprise 96 Journal. Imperial College of Science Technology and Medicine (1996)Google Scholar
  23. 23.
    Zalzala, A.M.S., Fleming, P.J.: Genetic Algorithms in Engineering Systems. IET (1997)Google Scholar
  24. 24.
    Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)zbMATHCrossRefGoogle Scholar
  25. 25.
    Yuan, Y., Lorenzo, R., Andrea, C.: On Early Stopping in Gradient Descent Learning. Constructive Approximation 26(2), 289–315 (2007)zbMATHCrossRefMathSciNetGoogle Scholar
  26. 26.
    Ntsaluba, A.: Summary Report: National HIV and Syphilis Sero-prevalence Survey of Women Attending Public Antenatal Clinics in South Africa, 2001 Department of Health, South African Government (2001)Google Scholar
  27. 27.
    Mistry, J., Nelwamondo, F.V., Marwala, T.: Investigation of Autoencoder Neural Network Accuracy for Computational Intelligence Methods to Estimate Missing Data. In: IASTED International Conference on Modelling and Simulation (2008)Google Scholar
  28. 28.
    Pantanowitz, A., Marwala, T.: Evaluating the Impact of Missing Data Imputation. LNCS (LNAI), vol. 5678. Springer, Heidelberg (to appear, 2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Adam Pantanowitz
    • 1
  • Tshilidzi Marwala
    • 1
  1. 1.School of Electrical & Information EngineeringUniversity of the Witwatersrand, JohannesburgWitsSouth Africa

Personalised recommendations