Abstract
This paper presents a comparison of different paradigms used for missing data imputation. The data set used is HIV seroprevalence data from an antenatal clinic study survey performed in 2001. Data imputation is performed through five methods: Random Forests; auto-associative neural networks with genetic algorithms; auto-associative neuro-fuzzy configurations; and two random forest and neural network based hybrids. Results indicate that Random Forests are superior in imputing missing data for the given data set in terms of accuracy and in terms of computation time, with accuracy increases of up to 32 % on average for certain variables when compared with auto-associative networks. While the concept of hybrid systems has promise, the presented systems appear to be hindered by their auto-associative neural network components.
Keywords
- auto-associative
- imputation
- missing data
- neural network
- random forest
This is a preview of subscription content, access via your institution.
Buying options
Preview
Unable to display preview. Download preview PDF.
References
Ssali, G., Marwala, T.: Computational Intelligence and Decision Trees for Missing Data Estimation. In: Proceedings of the International Joint Conference on Neural Networks, part of the IEEE World Congress on Computational Intelligence, WCCI 2008, IJCNN, pp. 201–207. IEEE, Los Alamitos (2008)
Horton, N.J., Kleinman, K.P.: Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models. The American Statistician 61(1), 79–90 (2007)
Markey, M.K., Tourassi, G.D., Margolis, M., DeLong, D.M.: Impact of Missing Data in Evaluating Artificial Neural Networks Trained on Complete Data. In: Computers in Biology and Medicine, vol. 36, pp. 517–525. Elsevier, Amsterdam (2006)
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons, Chichester (2002)
Ziegler, M.L.: Variable selection when confronted with missing data. PhD thesis, University of Pittsburgh (2006)
Fogarty, D.J.: Multiple Imputation as a Missing Data Approach to Reject Inference on Consumer Credit Scoring. Intersat 41(9) (2006)
Yuan, K.H., Bentler, P.M.: Three likelihood-based methods for mean and covariance structure analysis with non-normal missing data. Sociological Methodology, 165–200 (2000)
Nelwamondo, F.V., Mohamed, S., Marwala, T.: Missing data: a of neural network and expectation maximisation techniques. Current Science 93(11), 1514–1521 (2007)
Betechuoh, B.L., Marwala, T., Tettey, T.: Autoencoder Networks for HIV Classification. Current Science 91(11), 1467–1473 (2006)
Biau, G., Devroye, L., Lugosi, G.: Consistency of Random Forests and Other Averaging Classifiers. Journal of Machine Learning Research 9, 2015–2033 (2008)
Masisi, L., Nelwamondo, F.V., Marwala, T.: The Effect of Structural Diversity of an Ensemble of Classifiers on Classification Accuracy. In: IASTED International Conference on Modelling and Simulation (Africa-MS) (2008)
Qi, Y., Klein-Seetharaman, J., Bar-Joseph, Z.: Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources. In: Pacific Symposium on Biocomputing, vol. 10, pp. 531–542 (2005)
Ho, T.K.: Random Decision Forests. In: ICDAR 1995: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 1 (1995)
Breiman, L., Cutler, A.: Random Forests. Department of Statistics, University of California Berkeley (2004)
Brence, J.R., Brown, D.E.: Improving the Robust Random Forest Regression Algorithm (2006)
Engelbrecht, A.P.: Computation Intelligence, an Introduction. John Wiley & Sons, Ltd., Chichester (2002)
Haykin, S.: Neural Networks: A Comprehensive Foundation. Macmillan, New York (1994)
Jang, J.S.R., Gulley, N.: Fuzzy Logic Toolbox. The MathWorks Inc. (1997)
Abraham, A.: Neuro fuzzy systems: Sate-of-the-art modeling techniques. In: Mira, J., Prieto, A.G. (eds.) IWANN 2001. LNCS, vol. 2084, pp. 269–276. Springer, Heidelberg (2001)
Jang, J.S.R.: Neurofuzzy Modelling and Control. Proceedings of the IEEE 83 (1995)
Jang, J.S.R.: Input Selection for ANFIS Learning. In: Proceedings of IEEE International Conference on Fuzzy Systems (1998)
Wong, H.: Genetic Algorithms. Surprise 96 Journal. Imperial College of Science Technology and Medicine (1996)
Zalzala, A.M.S., Fleming, P.J.: Genetic Algorithms in Engineering Systems. IET (1997)
Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)
Yuan, Y., Lorenzo, R., Andrea, C.: On Early Stopping in Gradient Descent Learning. Constructive Approximation 26(2), 289–315 (2007)
Ntsaluba, A.: Summary Report: National HIV and Syphilis Sero-prevalence Survey of Women Attending Public Antenatal Clinics in South Africa, 2001 Department of Health, South African Government (2001)
Mistry, J., Nelwamondo, F.V., Marwala, T.: Investigation of Autoencoder Neural Network Accuracy for Computational Intelligence Methods to Estimate Missing Data. In: IASTED International Conference on Modelling and Simulation (2008)
Pantanowitz, A., Marwala, T.: Evaluating the Impact of Missing Data Imputation. LNCS (LNAI), vol. 5678. Springer, Heidelberg (to appear, 2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pantanowitz, A., Marwala, T. (2009). Missing Data Imputation Through the Use of the Random Forest Algorithm. In: Yu, W., Sanchez, E.N. (eds) Advances in Computational Intelligence. Advances in Intelligent and Soft Computing, vol 116. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03156-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-03156-4_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03155-7
Online ISBN: 978-3-642-03156-4
eBook Packages: EngineeringEngineering (R0)