Missing Data Imputation Through the Use of the Random Forest Algorithm

Pantanowitz, Adam; Marwala, Tshilidzi

doi:10.1007/978-3-642-03156-4_6

Adam Pantanowitz⁴ &
Tshilidzi Marwala⁴

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 116))

1950 Accesses
23 Citations

Abstract

This paper presents a comparison of different paradigms used for missing data imputation. The data set used is HIV seroprevalence data from an antenatal clinic study survey performed in 2001. Data imputation is performed through five methods: Random Forests; auto-associative neural networks with genetic algorithms; auto-associative neuro-fuzzy configurations; and two random forest and neural network based hybrids. Results indicate that Random Forests are superior in imputing missing data for the given data set in terms of accuracy and in terms of computation time, with accuracy increases of up to 32 % on average for certain variables when compared with auto-associative networks. While the concept of hybrid systems has promise, the presented systems appear to be hindered by their auto-associative neural network components.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 429.00; Price excludes VAT (USA)

Softcover Book: USD 549.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ssali, G., Marwala, T.: Computational Intelligence and Decision Trees for Missing Data Estimation. In: Proceedings of the International Joint Conference on Neural Networks, part of the IEEE World Congress on Computational Intelligence, WCCI 2008, IJCNN, pp. 201–207. IEEE, Los Alamitos (2008)
Chapter Google Scholar
Horton, N.J., Kleinman, K.P.: Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models. The American Statistician 61(1), 79–90 (2007)
Article MathSciNet Google Scholar
Markey, M.K., Tourassi, G.D., Margolis, M., DeLong, D.M.: Impact of Missing Data in Evaluating Artificial Neural Networks Trained on Complete Data. In: Computers in Biology and Medicine, vol. 36, pp. 517–525. Elsevier, Amsterdam (2006)
Google Scholar
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons, Chichester (2002)
MATH Google Scholar
Ziegler, M.L.: Variable selection when confronted with missing data. PhD thesis, University of Pittsburgh (2006)
Google Scholar
Fogarty, D.J.: Multiple Imputation as a Missing Data Approach to Reject Inference on Consumer Credit Scoring. Intersat 41(9) (2006)
Google Scholar
Yuan, K.H., Bentler, P.M.: Three likelihood-based methods for mean and covariance structure analysis with non-normal missing data. Sociological Methodology, 165–200 (2000)
Google Scholar
Nelwamondo, F.V., Mohamed, S., Marwala, T.: Missing data: a of neural network and expectation maximisation techniques. Current Science 93(11), 1514–1521 (2007)
Google Scholar
Betechuoh, B.L., Marwala, T., Tettey, T.: Autoencoder Networks for HIV Classification. Current Science 91(11), 1467–1473 (2006)
Google Scholar
Biau, G., Devroye, L., Lugosi, G.: Consistency of Random Forests and Other Averaging Classifiers. Journal of Machine Learning Research 9, 2015–2033 (2008)
MathSciNet Google Scholar
Masisi, L., Nelwamondo, F.V., Marwala, T.: The Effect of Structural Diversity of an Ensemble of Classifiers on Classification Accuracy. In: IASTED International Conference on Modelling and Simulation (Africa-MS) (2008)
Google Scholar
Qi, Y., Klein-Seetharaman, J., Bar-Joseph, Z.: Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources. In: Pacific Symposium on Biocomputing, vol. 10, pp. 531–542 (2005)
Google Scholar
Ho, T.K.: Random Decision Forests. In: ICDAR 1995: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 1 (1995)
Google Scholar
Breiman, L., Cutler, A.: Random Forests. Department of Statistics, University of California Berkeley (2004)
Google Scholar
Brence, J.R., Brown, D.E.: Improving the Robust Random Forest Regression Algorithm (2006)
Google Scholar
Engelbrecht, A.P.: Computation Intelligence, an Introduction. John Wiley & Sons, Ltd., Chichester (2002)
Google Scholar
Haykin, S.: Neural Networks: A Comprehensive Foundation. Macmillan, New York (1994)
MATH Google Scholar
Jang, J.S.R., Gulley, N.: Fuzzy Logic Toolbox. The MathWorks Inc. (1997)
Google Scholar
Abraham, A.: Neuro fuzzy systems: Sate-of-the-art modeling techniques. In: Mira, J., Prieto, A.G. (eds.) IWANN 2001. LNCS, vol. 2084, pp. 269–276. Springer, Heidelberg (2001)
Chapter Google Scholar
Jang, J.S.R.: Neurofuzzy Modelling and Control. Proceedings of the IEEE 83 (1995)
Google Scholar
Jang, J.S.R.: Input Selection for ANFIS Learning. In: Proceedings of IEEE International Conference on Fuzzy Systems (1998)
Google Scholar
Wong, H.: Genetic Algorithms. Surprise 96 Journal. Imperial College of Science Technology and Medicine (1996)
Google Scholar
Zalzala, A.M.S., Fleming, P.J.: Genetic Algorithms in Engineering Systems. IET (1997)
Google Scholar
Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)
Article MATH Google Scholar
Yuan, Y., Lorenzo, R., Andrea, C.: On Early Stopping in Gradient Descent Learning. Constructive Approximation 26(2), 289–315 (2007)
Article MATH MathSciNet Google Scholar
Ntsaluba, A.: Summary Report: National HIV and Syphilis Sero-prevalence Survey of Women Attending Public Antenatal Clinics in South Africa, 2001 Department of Health, South African Government (2001)
Google Scholar
Mistry, J., Nelwamondo, F.V., Marwala, T.: Investigation of Autoencoder Neural Network Accuracy for Computational Intelligence Methods to Estimate Missing Data. In: IASTED International Conference on Modelling and Simulation (2008)
Google Scholar
Pantanowitz, A., Marwala, T.: Evaluating the Impact of Missing Data Imputation. LNCS (LNAI), vol. 5678. Springer, Heidelberg (to appear, 2009)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical & Information Engineering, University of the Witwatersrand, Johannesburg, Private Bag 3, Wits, 2050, South Africa
Adam Pantanowitz & Tshilidzi Marwala

Authors

Adam Pantanowitz
View author publications
You can also search for this author in PubMed Google Scholar
Tshilidzi Marwala
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CINVESTAV, Mexico City, Mexico
Wen Yu
CINVESTAV, Guadalajara, Mexico
Edgar N. Sanchez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pantanowitz, A., Marwala, T. (2009). Missing Data Imputation Through the Use of the Random Forest Algorithm. In: Yu, W., Sanchez, E.N. (eds) Advances in Computational Intelligence. Advances in Intelligent and Soft Computing, vol 116. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03156-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-03156-4_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03155-7
Online ISBN: 978-3-642-03156-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics