Performance Evaluation of Tree Ensemble Classification Models Towards Challenges of Big Data Analytics

  • Hanuman GodaraEmail author
  • M. C. Govil
  • E. S. Pilli
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 985)


Big Data Analytics poses challenges like effective and accurate real-time data mining, lack of suitable tools & techniques and in-memory processing problem. Tree-based ensemble methods (machine learning models) are able to perform such kind of large-scale analytical processing in combination with high-performance cluster computing (special kind of distributed computing) using parallel processing. Random Forest (forest of randomized trees, a tree ensemble) algorithm is considered for the performance evaluation, as tree model supports concurrency and all trees are grown simultaneously in it, so it is a suitable parallel approach with good accuracy, noisy & imbalance dataset handling capability and also it never overfit unlike a single tree model for large dataset. However significant notable improvement over the original approach is available, but some limitation still exists regarding performance and streaming dataset such that performance rate decreases on increasing the compute nodes due to a redundant allocation of feature subsets in the hybrid approach of task & data parallelization and inability to handle stream data. So these performance issues are identified and a problem statement is formulated with an objective to achieve the linear scalable speedup and incremental processing capability of random forest algorithm to perform predictive analytics over massive datasets in the cluster environment.


Tree ensemble models Big Data Analytics HPC 


  1. 1.
    Wu, X., Zhu, X., Wu, G., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)CrossRefGoogle Scholar
  2. 2.
    Kuang, L., Hao, F., Yang, L., Lin, M., Luo, C., Min, G.: A tensor-based approach for big data representation and dimensionality reduction. IEEE Trans. Emerg. Topics Comput. 2(3), 280–291 (2014)CrossRefGoogle Scholar
  3. 3.
    Rajaraman, A., Ullman, J.: Mining of Massive Data Sets. Cambridge University Press, Cambridge (2011)CrossRefGoogle Scholar
  4. 4.
  5. 5.
    Breiman, L.: Classification and Regression Trees. Chapman & Hall, London (1984)zbMATHGoogle Scholar
  6. 6.
    Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)MathSciNetzbMATHGoogle Scholar
  7. 7.
  8. 8.
    Chand, N., Mishra, P., Krishna, C.R., Pilli, E.S., Govil, M.C.: A comparative analysis of SVM and its stacking with other classification algorithm for intrusion detection. In: Proceedings of International Conference on Computing Communication & Automation, pp. 1–6. IEEE, Dehradun (2016)Google Scholar
  9. 9.
    Chauhan, H., Kumar, V., Pundir, S., Pilli, E.S.: Comparative study of classification techniques for intrusion detection. In: Proceedings of International Symposium on Computational and Business Intelligence, pp. 40–43. IEEE, New Delhi (2013)Google Scholar
  10. 10.
    Mishra, P., Varadharajan, V., Tupakula, U., Pilli, E.S.: A detailed investigation and analysis of using machine learning techniques for intrusion detection. IEEE Commun. Surv. Tutorials (2018). Scholar
  11. 11.
    Xu, M., Chen, H., Varshney, P.K.: Dimensionality reduction for registration of high-dimensional data sets. IEEE Trans. Image Process. 22(8), 3041–3049 (2013)MathSciNetzbMATHCrossRefGoogle Scholar
  12. 12.
    Tao, Q., Chu, D., Wang, J.: Recursive support vector machines for dimensionality reduction. IEEE Trans. Neural Netw. 19(1), 189–193 (2008)CrossRefGoogle Scholar
  13. 13.
    Lin, Y., Liu, T., Fuh, C.: Multiple kernel learning for dimensionality reduction. IEEE Trans. Pattern Anal. Mach. Intell. 33(6), 1147–1160 (2011)CrossRefGoogle Scholar
  14. 14.
    Strobl, C., Boulesteix, A., Kneib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinf. 9(14), 1–11 (2008)Google Scholar
  15. 15.
    Bernard, S., Adam, S., Heutte, L.: Dynamic random forests. Pattern. Recog. Lett. 33(12), 1580–1586 (2012)CrossRefGoogle Scholar
  16. 16.
    Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A.: Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans. Syst. Man Cybern. 41(3), 552–568 (2011)CrossRefGoogle Scholar
  17. 17.
    Biau, G.: Analysis of a random forests model. J. Mach. Learn. Res. 13(1), 1063–1095 (2012)MathSciNetzbMATHGoogle Scholar
  18. 18.
    Basilico, J.D., Munson, M.A., Kolda, T.G., Dixon, K.R., Kegelmeyer, W.P.: COMET: a recipe for learning and using large ensembles on massive data. In: Proceedings of 11th International Conference on Data Mining, pp. 41–50. IEEE, Washington (2011)Google Scholar
  19. 19.
    Panda, B., Herbach, J.S., Basu, S., Bayardo, R.J.: PLANET: massively parallel learning of tree ensembles with MapReduce. In: Proceedings of VLDB Endowment, pp. 1426–1437. ACM, Lyon (2009)CrossRefGoogle Scholar
  20. 20.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  21. 21.
    Tyree, S., Weinberger, K.Q., Agrawal, K.: Parallel boosted regression trees for web search ranking. In: Proceedings of International Conference on World Wide Web, pp. 387–396. ACM, Hyderabad (2011)Google Scholar
  22. 22.
    Warneke, D., Kao, O.: Exploiting dynamic resource allocation for efficient parallel data processing in the cloud. IEEE Trans. Parallel Distrib. Syst. 22(6), 985–997 (2011)CrossRefGoogle Scholar
  23. 23.
    Briceno, L.D., et al.: Heuristics for robust resource allocation of satellite weather data processing on a heterogeneous parallel system. IEEE Trans. Parallel Distrib. Syst. 22(11), 1780–1787 (2011)CrossRefGoogle Scholar
  24. 24.
    Zhang, F., Cao, J., Tan, W., Khan, S.U., Li, K., Zomaya, A.Y.: Evolutionary scheduling of dynamic multitasking workloads for big-data analytics in elastic cloud. IEEE Trans. Emerg. Topics Comput. 2(3), 338–351 (2014)CrossRefGoogle Scholar
  25. 25.
    Li, K., Tang, X., Veeravalli, B., Li, K.: Scheduling precedence constrained stochastic tasks on heterogeneous cluster systems. IEEE Trans. Comput. 64(1), 191–204 (2015)MathSciNetzbMATHCrossRefGoogle Scholar
  26. 26.
  27. 27.
    Chen, J., Li, K., Tang, Z., Bilal, K., Yu, S., Weng, C., Li, K.: A parallel random forest algorithm for big data in a spark cloud computing environment. IEEE Trans. Parallel Distributed Syst. 28(4), 919–933 (2017)CrossRefGoogle Scholar
  28. 28.
    Guo, J., Bikshandi, G., Fraguela, B., Garzaran, M., Padua, D.: Programing with tiles. In: Proceedings of of 13th SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 111–122. ACM, Salt Lake City (2008)Google Scholar
  29. 29.
    Garg, R., Mittal, M., Son, L.H.: Reliability and energy efficient workflow scheduling in cloud environment. Cluster Comput. (2019).
  30. 30.
    Shastri, M., Roy, S., Mittal, M.: Stock price prediction using artificial neural model: an application of big data. EAI Endorsed Trans. Scalable Inf. Syst. (2019). Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.National Institute of Technology SikkimRavanglaIndia
  2. 2.Malaviya National Institute of Technology JaipurJaipurIndia

Personalised recommendations