Advertisement

Evolutionary Induction of Classification Trees on Spark

  • Daniel Reska
  • Krzysztof JurczukEmail author
  • Marek Kretowski
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10841)

Abstract

Evolutionary-based approaches have recently been increasingly proposed for data mining tasks, but their real applicability depends on efficiency and scalability for large-scale data. It is clear that parallel and distributed processing support is indispensable herein. Apache Spark is one of the most promising cluster-computing engines for Big Data. In this paper, we investigate the application of Spark to speed up an evolutionary induction of classification trees in the Global Decision Tree (GDT) system. The system simultaneously searches for the tree structure and tests in non-terminal nodes due to specialized genetic operators. As the original GDT system is implemented in C++, the Java-based module is developed for Spark-based acceleration of the most computationally demanding fitness evaluation. The training dataset is transformed to Resilient Distributed Dataset, which enables in-memory processing of dataset’s parts on workers. Preliminary experimental validation on large-scale artificial and real-life datasets shows that the proposed solution is efficient and scales well.

Keywords

Decision tree Evolutionary algorithms Spark Distributed computing Data mining Large-scale data 

Notes

Acknowledgments

This work was supported by the grant S/WI/2/18 from Bialystok University of Technology founded by Polish Ministry of Science and Higher Education.

References

  1. 1.
    The Apache Software Foundation. Apache Spark - Lightning-Fast Cluster Computing (2018). https://spark.apache.org/
  2. 2.
    Alba, E., Tomassini, M.: Parallelism and evolutionary algorithms. IEEE Trans. Evol. Comput. 6(5), 443–462 (2002)CrossRefGoogle Scholar
  3. 3.
    Barros, R.C., Basgalupp, M.P., Carvalho, A.C., Freitas, A.A.: A survey of evolutionary algorithms for decision-tree induction. IEEE Trans. SMC, Part C 42(3), 291–312 (2012)Google Scholar
  4. 4.
    Blake, C., Keogh, E., Merz, C.: UCI repository of machine learning databases (1998). http://www.ics.uci.edu/~mlearn/MLRepository.html
  5. 5.
    Czajkowski, M., Jurczuk, K., Kretowski, M.: A parallel approach for evolutionary induced decision trees. MPI+OpenMP implementation. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2015. LNCS (LNAI), vol. 9119, pp. 340–349. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-19324-3_31CrossRefGoogle Scholar
  6. 6.
    Czajkowski, M., Jurczuk, K., Kretowski, M.: Hybrid parallelization of evolutionary model tree induction. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2016. LNCS (LNAI), vol. 9692, pp. 370–379. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-39378-0_32CrossRefGoogle Scholar
  7. 7.
    Czajkowski, M., Kretowski, M.: Evolutionary induction of global model trees with specialized operators and memetic extensions. Inf. Sci. 288, 153–173 (2014)CrossRefGoogle Scholar
  8. 8.
    Deng, C., Tan, X., Dong, X., Tan, Y.: A parallel version of differential evolution based on resilient distributed datasets model. In: Gong, M., Pan, L., Song, T., Tang, K., Zhang, X. (eds.) BIC-TA 2015. CCIS, vol. 562, pp. 84–93. Springer, Heidelberg (2015).  https://doi.org/10.1007/978-3-662-49014-3_8CrossRefGoogle Scholar
  9. 9.
    Ferranti, A., Marcelloni, F., Segatori, A., Antonelli, M., Ducange, P.: A distributed approach to multi-objective evolutionary generation of fuzzy rule-based classifiers from big data. Inf. Sci. 415–416, 319–340 (2017)CrossRefGoogle Scholar
  10. 10.
    Funika, W., Koperek, P.: Towards a scalable distributed fitness evaluation service. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9573, pp. 493–502. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-32149-3_46CrossRefGoogle Scholar
  11. 11.
    Gong, Y.J., Chen, W.N., Zhan, Z.H., Zhang, J., Li, Y., Zhang, Q., Li, J.J.: Distributed evolutionary algorithms and their models: a survey of the state-of-the-art. Appl. Soft Comput. 34, 286–300 (2015)CrossRefGoogle Scholar
  12. 12.
    Grama, A., Karypis, G., Kumar, V., Gupta, A.: Introduction to Parallel Computing. Addison-Wesley, Boston (2003)zbMATHGoogle Scholar
  13. 13.
    Jurczuk, K., Czajkowski, M., Kretowski, M.: Evolutionary induction of a decision tree for large-scale data: a GPU-based approach. Soft Comput. 21(24), 7363–7379 (2017)CrossRefGoogle Scholar
  14. 14.
    Kretowski, M., Grzes, M.: Evolutionary induction of mixed decision trees. Int. J. Data Warehous. Min. (IJDWM) 3(4), 68–82 (2007)CrossRefGoogle Scholar
  15. 15.
    Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. Springer Science & Business Media, Heidelberg (2013).  https://doi.org/10.1007/978-3-662-03315-9CrossRefzbMATHGoogle Scholar
  17. 17.
    Pulgar-Rubior, F., Rivera-Rivas, A., Perez-Godoy, M., Gonzalez, P., Carmona, C., del Jesus, M.: MEFASD-BD: multi-objective evolutionary fuzzy algorithm for subgroup discovery in big data environments - a MapReduce solution. Knowl.-Based Syst. 117, 70–78 (2017)CrossRefGoogle Scholar
  18. 18.
    Qi, R., Wang, Z., Li, S.: A parallel genetic algorithm based on Spark for pairwise test suite generation. J. Comput. Sci. Technol. 31(2), 417–427 (2016)CrossRefGoogle Scholar
  19. 19.
    Teijeiro, D., Pardo, X.C., González, P., Banga, J.R., Doallo, R.: Implementing parallel differential evolution on spark. In: Squillero, G., Burelli, P. (eds.) EvoApplications 2016. LNCS, vol. 9598, pp. 75–90. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-31153-1_6CrossRefGoogle Scholar
  20. 20.
    Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)CrossRefGoogle Scholar
  21. 21.
    Zaharia, M., et al.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Daniel Reska
    • 1
  • Krzysztof Jurczuk
    • 1
    Email author
  • Marek Kretowski
    • 1
  1. 1.Faculty of Computer ScienceBialystok University of TechnologyBialystokPoland

Personalised recommendations