Skip to main content

Trends of Evolutionary Machine Learning to Address Big Data Mining

  • Conference paper
  • First Online:
Information and Knowledge Systems. Digital Technologies, Artificial Intelligence and Decision Making (ICIKS 2021)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 425))

Included in the following conference series:

Abstract

Improving decisions by better mining the available data in an Information System is a common goal in many decision making environments. However, the complexity and the large size of the collected data in modern systems make this goal a challenge for mining methods. Evolutionary Data Mining Algorithms (EDMA), such as Genetic Programming (GP), are powerful meta-heuristics with an empirically proven efficiency on complex machine learning problems. They are expected to be applied to real-world big data tasks and applications in our daily life. Thus, they need, as all machine learning techniques, to be scaled to Big Data bases. This paper review some solutions that could be applied to help EDMA to deal with Big Data challenges. Two solutions are then selected and explained. The first one is based on the algorithmic manipulation involving the introduction of the active learning paradigm thanks to the active data sampling. The second is based on the processing manipulation involving horizontal scaling thanks to the processing distribution over networked nodes. This work explains how each solution is introduced to GP. As preliminary experiences, the extended GP is applied to solve two complex machine learning problem: the Higgs Boson classification problem and the Pulsar detection problem. Experimental results are then discussed and compared to value the efficiency of each solution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://spark.apache.org.

  2. 2.

    https://github.com/hhmida/gp-spark.

  3. 3.

    https://hadoop.apache.org.

  4. 4.

    https://www.kaggle.com/.

  5. 5.

    See UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/HIGGS.

  6. 6.

    https://archive.ics.uci.edu/ml/datasets/HTRU2.

  7. 7.

    https://github.com/deap/deap.

References

  1. Adam-Bourdarios, C., Cowan, G., Germain, C., Guyon, I., Kegl, B., Rousseau, D.: Learning to discover: the Higgs Boson machine learning challenge (2014). http://higgsml.lal.in2p3.fr/documentation

  2. Atlas, L.E., Cohn, D., Ladner, R.: Training connectionist networks with queries and selective sampling. In: Touretzky, D. (ed.) Advances in Neural Information Processing Systems 2, pp. 566–573. Morgan-Kaufmann (1990)

    Google Scholar 

  3. Bacardit, J., Llorà, X.: Large-scale data mining using genetics-based machine learning. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 3(1), 37–61 (2013)

    Article  Google Scholar 

  4. Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5, 1–9 (2014)

    Article  Google Scholar 

  5. Baldi, P., Sadowski, P., Whiteson, D.: Enhanced Higgs Boson to \(\tau \)+ \(\tau \)- search with deep learning. Phys. Rev. Lett. 114(11), 111–801 (2015)

    Google Scholar 

  6. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)

    Article  Google Scholar 

  7. Chávez, F., et al.: ECJ+HADOOP: an easy way to deploy massive runs of evolutionary algorithms. In: Squillero, G., Burelli, P. (eds.) EvoApplications 2016. LNCS, vol. 9598, pp. 91–106. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31153-1_7

    Chapter  Google Scholar 

  8. Cohn, D., Atlas, L.E., Ladner, R., Waibel, A.: Improving generalization with active learning. Mach. Learn. 15, 201–221 (1994)

    Google Scholar 

  9. ATLAS Collaboration: Dataset from the ATLAS Higgs Boson machine learning challenge 2014 (2014). http://opendata.cern.ch/record/328. https://doi.org/10.7483/OPENDATA.ATLAS.ZBP2.M5T8

  10. Cummins, R., O’Riordan, C.: Evolved term-weighting schemes in information retrieval: an analysis of the solution space. Artif. Intell. Rev. 26(1–2), 35–47 (2006). https://doi.org/10.1007/s10462-007-9034-5

    Article  Google Scholar 

  11. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Brewer, E.A., Chen, P. (eds.) 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, 6–8 December 2004, pp. 137–150. USENIX Association (2004)

    Google Scholar 

  12. Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)

    MathSciNet  Google Scholar 

  13. Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in Genetic Programming. In: Davidor, Y., Schwefel, H.-P., Männer, R. (eds.) PPSN 1994. LNCS, vol. 866, pp. 312–321. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-58484-6_275

    Chapter  Google Scholar 

  14. Harding, S., Banzhaf, W.: Implementing cartesian genetic programming classifiers on graphics processing units using GPU. NET. In: Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation, pp. 463–470 (2011)

    Google Scholar 

  15. Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Hierarchical data topology based selection for large scale learning. In: Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress, 2016 International IEEE Conferences, pp. 1221–1226. IEEE (2016)

    Google Scholar 

  16. Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Genetic programming over spark for Higgs Boson classification. In: Abramowicz, W., Corchuelo, R. (eds.) BIS 2019. LNBIP, vol. 353, pp. 300–312. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20485-3_23

    Chapter  Google Scholar 

  17. Hmida, H., Hamida, S.B., Borgi, A., Rukoz, M.: Sampling methods in genetic programming learners from large datasets: a comparative study. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds.) INNS 2016. AISC, vol. 529, pp. 50–60. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-47898-2_6

    Chapter  Google Scholar 

  18. Hmida, H., Hamida, S.B., Borgi, A., Rukoz, M.: Scale genetic programming for large data sets: case of Higgs Bosons classification. Procedia Comput. Sci. 126, 302–311 (2018). The 22nd International Conference, KES-2018

    Google Scholar 

  19. Hunt, R., Johnston, M., Browne, W., Zhang, M.: Sampling methods in genetic programming for classification with unbalanced data. In: Li, J. (ed.) AI 2010. LNCS (LNAI), vol. 6464, pp. 273–282. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17432-2_28

    Chapter  Google Scholar 

  20. Iba, H.: Bagging, boosting, and bloating in genetic programming. In: Banzhaf, W., et al. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference on GECCO-99, pp. 1053–1060. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  21. Koza, J.R.: Genetic programming: on the programming of computers by means of natural selection. Stat. Comput. 4(2), 87–112 (1994). https://doi.org/10.1007/BF00175355

    Article  Google Scholar 

  22. Kuscu, I.: Genetic programming and incremental approaches to solve supervised learning problems (1996)

    Google Scholar 

  23. Langdon, W.B.: Graphics processing units and genetic programming: an overview. Soft Comput. 15(8), 1657–1669 (2011)

    Article  Google Scholar 

  24. Lasarczyk, C.W.G., Dittrich, P., Banzhaf, W.: Dynamic subset selection based on a fitness case topology. Evol. Comput. 12(2), 223–242 (2004). https://doi.org/10.1162/106365604773955157

    Article  Google Scholar 

  25. L’Heureux, A., Grolinger, K., ElYamany, H.F., Capretz, M.A.M.: Machine learning with big data: challenges and approaches. IEEE Access 5, 7776–7797 (2017). https://doi.org/10.1109/ACCESS.2017.2696365

    Article  Google Scholar 

  26. Lyon, R.J., Stappers, B.W., Cooper, S., Brooke, J.M., Knowles, J.D.: Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach. Mon. Not. R. Astron. Soc. 459(1), 1104–1123 (2016). https://doi.org/10.1093/mnras/stw656

    Article  Google Scholar 

  27. Maitre, O.: Genetic programming on GPGPU cards using EASEA. In: Tsutsui, S., Collet, P. (eds.) Massively Parallel Evolutionary Computation on GPGPUs. NCS, pp. 227–248. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37959-8_11

    Chapter  Google Scholar 

  28. Nordin, P., Banzhaf, W.: An on-line method to evolve behavior and to control a miniature robot in real time with genetic programming. Adapt. Behav. 5(2), 107–140 (1997). https://doi.org/10.1177/105971239700500201

    Article  Google Scholar 

  29. Paduraru, C., Melemciuc, M., Stefanescu, A.: A distributed implementation using apache spark of a genetic algorithm applied to test data generation. In: Genetic and Evolutionary Computation Conference, 15–19 July, Companion Material Proceedings, pp. 1857–1863. ACM (2017)

    Google Scholar 

  30. Peralta, D., et al.: Evolutionary feature selection for big data classification: a mapreduce approach. Math. Prob. Eng. 2015, 11 (2015)

    Google Scholar 

  31. Qi, R., Wang, Z., Li, S.: A parallel genetic algorithm based on spark for pairwise test suite generation. J. Comput. Sci. Technol. 31(2), 417–427 (2016)

    Article  Google Scholar 

  32. Vanneschi, L., Poli, R.: Genetic programming - introduction, applications, theory and open issues. In: Rozenberg, G., Bäck, T., Kok, J.N. (eds.) Handbook of Natural Computing, pp. 709–739. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-540-92910-9_24

    Chapter  Google Scholar 

  33. Wang, Y., Pan, Z., Zheng, J., Qian, L., Li, M.: A hybrid ensemble method for pulsar candidate classification. Astrophys. Space Sci. 364, 1–13 (2019). https://doi.org/10.1007/s10509-019-3602-4

    Article  MathSciNet  Google Scholar 

  34. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, 25–27 April, pp. 15–28. USENIX Association (2012)

    Google Scholar 

  35. Zhang, B.T., Joung, J.G.: Genetic programming with incremental data inheritance. In: Proceedings of the Genetic and Evolutionary Computation Conference, Orlando, Florida, USA, 13–17 July 1999, vol. 2, pp. 1217–1224. Morgan Kaufmann (1999). http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-460.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sana Ben Hamida .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ben Hamida, S., Benjelloun, G., Hmida, H. (2021). Trends of Evolutionary Machine Learning to Address Big Data Mining. In: Saad, I., Rosenthal-Sabroux, C., Gargouri, F., Arduin, PE. (eds) Information and Knowledge Systems. Digital Technologies, Artificial Intelligence and Decision Making. ICIKS 2021. Lecture Notes in Business Information Processing, vol 425. Springer, Cham. https://doi.org/10.1007/978-3-030-85977-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85977-0_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85976-3

  • Online ISBN: 978-3-030-85977-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics