An efficient random forests algorithm for high dimensional data classification

  • Qiang Wang
  • Thanh-Tung Nguyen
  • Joshua Z. Huang
  • Thuy Thi Nguyen
Regular Article


In this paper, we propose a new random forest (RF) algorithm to deal with high dimensional data for classification using subspace feature sampling method and feature value searching. The new subspace sampling method maintains the diversity and randomness of the forest and enables one to generate trees with a lower prediction error. A greedy technique is used to handle cardinal categorical features for efficient node splitting when building decision trees in the forest. This allows trees to handle very high cardinality meanwhile reducing computational time in building the RF model. Extensive experiments on high dimensional real data sets including standard machine learning data sets and image data sets have been conducted. The results demonstrated that the proposed approach for learning RFs significantly reduced prediction errors and outperformed most existing RFs when dealing with high-dimensional data.


Classification Image classification High dimensional data Random forests Data mining 

Mathematics Subject Classification




Part of this work was done while the author Thanh-Tung Nguyen was visiting the Department of Computer Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen 518055, and the College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.


  1. Amaratunga D, Cabrera J, Lee YS (2008) Enriched random forests. Bioinformatics 24(18):2010–2014CrossRefGoogle Scholar
  2. Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29:173–180CrossRefGoogle Scholar
  3. Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefzbMATHGoogle Scholar
  4. Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca RatonzbMATHGoogle Scholar
  5. Deng H (2013) Guided random forest in the rrf package. arXiv preprint arXiv:1306.0237
  6. Deng H, Runger G (2013) Gene selection with guided regularized random forest. Pattern Recognit 46(12):3483–3489CrossRefGoogle Scholar
  7. Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2):139–157CrossRefGoogle Scholar
  8. Donoho DL et al (2000) High-dimensional data analysis: the curses and blessings of dimensionality. AMS Math Challenges Lecture, pp 1–32Google Scholar
  9. Genuer R, Poggi JM, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recognit Lett 31(14):2225–2236CrossRefGoogle Scholar
  10. Georghiades AS, Belhumeur PN, Kriegman D (2001) From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Mach Intell 23(6):643–660CrossRefGoogle Scholar
  11. Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844CrossRefGoogle Scholar
  12. Lepetit V, Fua P (2006) Keypoint recognition using randomized trees. IEEE Trans Pattern Anal Mach Intell 28(9):1465–1479CrossRefGoogle Scholar
  13. Liaw A, Wiener M (2002) Classification and regression by random forest. R News 2(3):18–22Google Scholar
  14. Louppe G, Wehenkel L, Sutera A, Geurts P (2013) Understanding variable importances in forests of randomized trees. In: Advances in neural information processing systems, pp 431–439Google Scholar
  15. Meinshausen N (2012) quantregforest: quantile regression forests. R package version 02-3Google Scholar
  16. Nguyen TT, Huang J, Nguyen T (2015) Two-level quantile regression forests for bias correction in range prediction. Mach Learn 101(1–3):325–343MathSciNetCrossRefzbMATHGoogle Scholar
  17. Samaria FS, Harter AC (1994) Parameterisation of a stochastic model for human face identification. In: Proceedings of the second IEEE workshop on applications of computer vision. IEEE, pp 138–142Google Scholar
  18. Turk M, Pentland A (1991) Eigenfaces for recognition. J Cogn Neurosci 3(1):71–86CrossRefGoogle Scholar
  19. Tuv E, Borisov A, Runger G, Torkkola K (2009) Feature selection with ensembles, artificial variables, and redundancy elimination. J Mach Learn Res 10:1341–1366MathSciNetzbMATHGoogle Scholar
  20. Viswanathan V, Sen A, Chakraborty S (2011) Stochastic greedy algorithms: a leaning based approach to combinatorial optimization. Int J Adv Softw 4(1 and 2):1–11Google Scholar
  21. Xu B, Huang JZ, Williams G, Wang Q, Ye Y (2012) Classifying very high-dimensional data with random forests built from small subspaces. Int J Data Warehous Min 8(2):44–63CrossRefGoogle Scholar
  22. Ye Y, Wu Q, Zhexue Huang J, Ng MK, Li X (2013) Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognit 46(3):769–787CrossRefGoogle Scholar
  23. Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Qiang Wang
    • 1
  • Thanh-Tung Nguyen
    • 2
    • 4
  • Joshua Z. Huang
    • 1
  • Thuy Thi Nguyen
    • 3
  1. 1.College of Computer Science and Software EngineeringShenzhen UniversityShenzhenChina
  2. 2.Faculty of Computer Science and EngineeringThuyloi UniversityHanoiVietnam
  3. 3.Faculty of Information TechnologyVietnam National University of AgricultureHanoiVietnam
  4. 4.Sorbonne Université, IRD, JEAI WARMUnité de Modélisation Mathématiques et Informatique des Systèmes ComplexesBondyFrance

Personalised recommendations