Genetic Programming over Spark for Higgs Boson Classification

  • Hmida HmidaEmail author
  • Sana Ben Hamida
  • Amel Borgi
  • Marta Rukoz
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 353)


With the growing number of available databases having a very large number of records, existing knowledge discovery tools need to be adapted to this shift and new tools need to be created. Genetic Programming (GP) has been proven as an efficient algorithm in particular for classification problems. Notwithstanding, GP is impaired with its computing cost that is more acute with large datasets. This paper, presents how an existing GP implementation (DEAP) can be adapted by distributing evaluations on a Spark cluster. Then, an additional sampling step is applied to fit tiny clusters. Experiments are accomplished on Higgs Boson classification with different settings. They show the benefits of using Spark as parallelization technology for GP.


Genetic Programming Machine learning Spark Large dataset Higgs Boson classification 


  1. 1.
    Al-Madi, N., Ludwig, S.A.: Scaling genetic programming for data classification using mapreduce methodology. In: Fifth World Congress on Nature and Biologically Inspired Computing, NaBIC 2013, 12–14 August 2013, pp. 132–139. IEEE (2013)Google Scholar
  2. 2.
    Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nature Commun. 5 (2014)Google Scholar
  3. 3.
    Baldi, P., Sadowski, P., Whiteson, D.: Enhanced higgs boson to \(\tau \)+ \(\tau \)- search with deep learning. Phys. Rev. Lett. 114(11), 111–801 (2015)CrossRefGoogle Scholar
  4. 4.
    Chávez, F., et al.: ECJ+HADOOP: an easy way to deploy massive runs of evolutionary algorithms. In: Squillero, G., Burelli, P. (eds.) EvoApplications 2016. LNCS, vol. 9598, pp. 91–106. Springer, Cham (2016). Scholar
  5. 5.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Brewer, E.A., Chen, P. (eds.) 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, 6–8 December 2004, pp. 137–150. USENIX Association (2004)Google Scholar
  6. 6.
    Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)MathSciNetzbMATHGoogle Scholar
  7. 7.
    Funika, W., Koperek, P.: Scaling evolutionary programming with the use of apache spark. Comput. Sci. (AGH) 17(1), 69–82 (2016)CrossRefGoogle Scholar
  8. 8.
    Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in Genetic Programming. In: Davidor, Y., Schwefel, H.-P., Männer, R. (eds.) PPSN 1994. LNCS, vol. 866, pp. 312–321. Springer, Heidelberg (1994). Scholar
  9. 9.
    Giráldez, R., Díaz-Díaz, N., Nepomuceno, I., Aguilar-Ruiz, J.S.: An approach to reduce the cost of evaluation in evolutionary learning. In: Cabestany, J., Prieto, A., Sandoval, F. (eds.) IWANN 2005. LNCS, vol. 3512, pp. 804–811. Springer, Heidelberg (2005). Scholar
  10. 10.
  11. 11.
    Hmida, H., Hamida, S.B., Borgi, A., Rukoz, M.: Scale genetic programming for large data sets: case of higgs bosons classification. Procedia Comput. Sci. 126, 302–311 (2018). The 22nd International Conference, KES-201CrossRefGoogle Scholar
  12. 12.
    Karau, H., Warren, R.: High Performance Spark, 1st edn. O’Reilly, Sebastopol (2017)Google Scholar
  13. 13.
    Kienzler, R.: Mastering Apache Spark 2.x. Packt Publishing, Birmingham (2017)Google Scholar
  14. 14.
    Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)zbMATHGoogle Scholar
  15. 15.
    Paduraru, C., Melemciuc, M., Stefanescu, A.: A distributed implementation using apache spark of a genetic algorithm applied to test data generation. In: Companion Material Proceedings of Genetic and Evolutionary Computation Conference, 15–19 July 2017, pp. 1857–1863. ACM (2017)Google Scholar
  16. 16.
    Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a MapReduce approach. Math. Probl. Eng. 2015, 11 (2015)CrossRefGoogle Scholar
  17. 17.
    Qi, R., Wang, Z., Li, S.: A parallel genetic algorithm based on spark for pairwise test suite generation. J. Comput. Sci. Technol. 31(2), 417–427 (2016)CrossRefGoogle Scholar
  18. 18.
    Shashidhara, B.M., Jain, S., Rao, V.D., Patil, N., Raghavendra, G.S.: Evaluation of machine learning frameworks on bank marketing and Higgs datasets. In: 2nd International Conference on Advances in Computing and Communication Engineering, pp. 551–555 (2015)Google Scholar
  19. 19.
    Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, 25–27 April 2012, pp. 15–28. USENIX Association (2012)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Hmida Hmida
    • 1
    • 2
    Email author
  • Sana Ben Hamida
    • 2
  • Amel Borgi
    • 1
  • Marta Rukoz
    • 2
  1. 1.Faculté des Sciences de Tunis, LR11ES14 LIPAHUniversité de Tunis El ManarTunisTunisia
  2. 2.Université Paris Dauphine, PSL Research University, CNRS, UMR[7243], LAMSADEParisFrance

Personalised recommendations