Skip to main content
Log in

Feature-ranked self-growing forest: a tree ensemble based on structure diversity for classification and regression

  • S.I.: Latin American Computational Intelligence
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Tree ensemble algorithms, such as random forest (RF), are some of the most widely applied methods in machine learning. However, an important hyperparameter, the number of classification or regression trees within the ensemble must be specified in these algorithms. The number of trees within the ensemble can adversely affect bias or computational cost and should ideally be adapted for each task. For this reason, a novel tree ensemble is described, the feature-ranked self-growing forest (FSF), that allows the automatic growth of a tree ensemble based on the structural diversity of the first two levels of trees’ nodes. The algorithm’s performance was tested with 30 classification and 30 regression datasets and compared with RF. The computational complexity was also theoretically and experimentally analyzed. FSF had a significant higher performance for 57%, and an equivalent performance for 27% of classification datasets compared to RF. FSF had a higher performance for 70% and an equivalent performance for 7% of regression datasets compared to RF. Computational complexity of FSF was competitive compared to that of other tree ensembles, being mainly dependent on the number of observations within the dataset. Therefore, it can be implied that FSF is a suitable out-of-the-box approach with potential as a tool for feature ranking and dataset’s complexity analysis using the number of trees computed for a particular task. A MATLAB and Python implementation of the algorithm and a working example for classification and regression are provided for academic use.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Noorbakhsh J, Chandok H, Karuturi RKM, George J (2019) Machine learning in biology and medicine. Adv Mol Pathol 2:143–152. https://doi.org/10.1016/j.yamp.2019.07.010

    Article  Google Scholar 

  2. Gao T, Lu W (2021) Machine learning toward advanced energy storage devices and systems. iScience 24:101936. https://doi.org/10.1016/j.isci.2020.101936

    Article  Google Scholar 

  3. Alanne K, Sierla S (2022) An overview of machine learning applications for smart buildings. Sustain Cities Soc 76:103445. https://doi.org/10.1016/j.scs.2021.103445

    Article  Google Scholar 

  4. Chlingaryan A, Sukkarieh S, Whelan B (2018) Machine learning approaches for crop yield prediction and nitrogen status estimation in precision agriculture: a review. Comput Electron Agric 151:61–69. https://doi.org/10.1016/j.compag.2018.05.012

    Article  Google Scholar 

  5. Dixon MF, Halperin I, Bilokon P (2020) Machine learning in finance. Springer, Cham

    Book  MATH  Google Scholar 

  6. Zhang C, Yunqia M (2012) Ensemble machine learning. Springer, US, Boston, MA

    Book  Google Scholar 

  7. Loh W-Y (2014) Fifty years of classification and regression trees. Int Stat Rev 82:329–348. https://doi.org/10.1111/insr.12016

    Article  MathSciNet  MATH  Google Scholar 

  8. Breiman L, Friedman J, Stone C, Olshen R (1984) Classification and regression trees, 1st edn. Chapman and Hall/CRC, New York

    MATH  Google Scholar 

  9. Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  10. Amancio DR, Comin CH, Casanova D et al (2014) A systematic comparison of supervised classifiers. PLoS ONE 9:e94137

    Article  Google Scholar 

  11. Oliveira S, Oehler F, San-Miguel-Ayanz J et al (2012) Modeling spatial patterns of fire occurrence in Mediterranean Europe using multiple regression and random forest. For Ecol Manage 275:117–129. https://doi.org/10.1016/j.foreco.2012.03.003

    Article  Google Scholar 

  12. Couronné R, Probst P, Boulesteix A-L (2018) Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinform 19:270. https://doi.org/10.1186/s12859-018-2264-5

    Article  Google Scholar 

  13. Richards G, Wang W (2012) What influences the accuracy of decision tree ensembles? J Intell Inf Syst 39:627–650. https://doi.org/10.1007/s10844-012-0206-7

    Article  Google Scholar 

  14. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232. https://doi.org/10.1214/aos/1013203451

    Article  MathSciNet  MATH  Google Scholar 

  15. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139. https://doi.org/10.1006/jcss.1997.1504

    Article  MathSciNet  MATH  Google Scholar 

  16. Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3–42. https://doi.org/10.1007/s10994-006-6226-1

    Article  MATH  Google Scholar 

  17. Chen T, Guestrin C (2016) XGBoost. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 785–794

  18. González S, García S, del Ser J et al (2020) A practical tutorial on bagging and boosting based ensembles for machine learning: algorithms, software tools, performance study, practical perspectives and opportunities. Inf Fusion 64:205–237. https://doi.org/10.1016/j.inffus.2020.07.007

    Article  Google Scholar 

  19. Sagi O, Rokach L (2020) Explainable decision forest: transforming a decision forest into an interpretable tree. Inf Fusion 61:124–138. https://doi.org/10.1016/j.inffus.2020.03.013

    Article  Google Scholar 

  20. Budnik M, Krawczyk B (2013) On optimal settings of classification tree ensembles for medical decision support. Health Inform J 19:3–15. https://doi.org/10.1177/1460458212446096

    Article  Google Scholar 

  21. Pal M (2005) Random forest classifier for remote sensing classification. Int J Remote Sens 26:217–222. https://doi.org/10.1080/01431160412331269698

    Article  Google Scholar 

  22. Probst P, Boulesteix A-L (2017) To tune or not to tune the number of trees in random forest. J Mach Learn Res 18:6673–6690

    MathSciNet  MATH  Google Scholar 

  23. Wei-Yin L (2002) Regression trees with unbiased variable selection and interaction detection. Stat Sin 12:361–386

    MathSciNet  MATH  Google Scholar 

  24. Tumer K, Ghosh J (1996) Error correlation and error reduction in ensemble classifiers. Conn Sci 8:385–404. https://doi.org/10.1080/095400996116839

    Article  Google Scholar 

  25. Han J, Kamber M, Pei C (2012) Classification: basic concepts. Data mining concepts and techniques, 3rd edn. Elsevier, Waltham, pp 327–350

    Chapter  MATH  Google Scholar 

  26. Sun T, Zhou Z-H (2018) Structural diversity for decision tree ensemble learning. Front Comput Sci 12:560–570. https://doi.org/10.1007/s11704-018-7151-8

    Article  Google Scholar 

  27. Maillo J, Triguero I, Herrera F (2020) Redundancy and complexity metrics for big data classification: towards smart data. IEEE Access 8:87918–87928. https://doi.org/10.1109/ACCESS.2020.2991800

    Article  Google Scholar 

  28. Khan Z, Gul A, Perperoglou A et al (2020) Ensemble of optimal trees, random forest and random projection ensemble classification. Adv Data Anal Classif 14:97–116. https://doi.org/10.1007/s11634-019-00364-9

    Article  MathSciNet  MATH  Google Scholar 

  29. Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537. https://doi.org/10.1126/science.286.5439.531

    Article  Google Scholar 

  30. Wang Y, Klijn JG, Zhang Y et al (2005) Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 365:671–679. https://doi.org/10.1016/S0140-6736(05)17947-1

    Article  Google Scholar 

  31. Gordon GJ, Jensen RV, Hsiao L-L et al (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma1. Cancer Res 62:4963–4967

    Google Scholar 

  32. Singh D, Febbo PG, Ross K et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209. https://doi.org/10.1016/S1535-6108(02)00030-2

    Article  Google Scholar 

  33. Shipp MA, Ross KN, Tamayo P et al (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8:68–74. https://doi.org/10.1038/nm0102-68

    Article  Google Scholar 

Download references

Acknowledgments

Authors would like to acknowledge Consejo Nacional de Ciencia y Tecnología (CONACYT) for supporting this work with grant number SALUD-2018-02-B-S-45803. Authors would also like to acknowledge the UC Irvine Machine Learning Repository for hosting the datasets used in this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jessica Cantillo-Negrete.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Availability of data and material

The datasets analyzed during the current study are available in the UCI Machine Learning repository, https://archive.ics.uci.edu/ml/index.php. The FSF algorithm can be downloaded from: https://github.com/RubenICarinoEscobar/Feature-Ranked-Self-Growing-Forest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 772 kb)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Carino-Escobar, R.I., Alonso-Silverio, G.A., Alarcón-Paredes, A. et al. Feature-ranked self-growing forest: a tree ensemble based on structure diversity for classification and regression. Neural Comput & Applic 35, 9285–9298 (2023). https://doi.org/10.1007/s00521-023-08202-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08202-y

Keywords

Navigation