An efficient random forests algorithm for high dimensional data classification
- 57 Downloads
In this paper, we propose a new random forest (RF) algorithm to deal with high dimensional data for classification using subspace feature sampling method and feature value searching. The new subspace sampling method maintains the diversity and randomness of the forest and enables one to generate trees with a lower prediction error. A greedy technique is used to handle cardinal categorical features for efficient node splitting when building decision trees in the forest. This allows trees to handle very high cardinality meanwhile reducing computational time in building the RF model. Extensive experiments on high dimensional real data sets including standard machine learning data sets and image data sets have been conducted. The results demonstrated that the proposed approach for learning RFs significantly reduced prediction errors and outperformed most existing RFs when dealing with high-dimensional data.
KeywordsClassification Image classification High dimensional data Random forests Data mining
Mathematics Subject Classification68T01
Part of this work was done while the author Thanh-Tung Nguyen was visiting the Department of Computer Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen 518055, and the College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.
- Deng H (2013) Guided random forest in the rrf package. arXiv preprint arXiv:1306.0237
- Donoho DL et al (2000) High-dimensional data analysis: the curses and blessings of dimensionality. AMS Math Challenges Lecture, pp 1–32Google Scholar
- Liaw A, Wiener M (2002) Classification and regression by random forest. R News 2(3):18–22Google Scholar
- Louppe G, Wehenkel L, Sutera A, Geurts P (2013) Understanding variable importances in forests of randomized trees. In: Advances in neural information processing systems, pp 431–439Google Scholar
- Meinshausen N (2012) quantregforest: quantile regression forests. R package version 02-3Google Scholar
- Samaria FS, Harter AC (1994) Parameterisation of a stochastic model for human face identification. In: Proceedings of the second IEEE workshop on applications of computer vision. IEEE, pp 138–142Google Scholar
- Viswanathan V, Sen A, Chakraborty S (2011) Stochastic greedy algorithms: a leaning based approach to combinatorial optimization. Int J Adv Softw 4(1 and 2):1–11Google Scholar