Abstract
Improving decisions by better mining the available data in an Information System is a common goal in many decision making environments. However, the complexity and the large size of the collected data in modern systems make this goal a challenge for mining methods. Evolutionary Data Mining Algorithms (EDMA), such as Genetic Programming (GP), are powerful meta-heuristics with an empirically proven efficiency on complex machine learning problems. They are expected to be applied to real-world big data tasks and applications in our daily life. Thus, they need, as all machine learning techniques, to be scaled to Big Data bases. This paper review some solutions that could be applied to help EDMA to deal with Big Data challenges. Two solutions are then selected and explained. The first one is based on the algorithmic manipulation involving the introduction of the active learning paradigm thanks to the active data sampling. The second is based on the processing manipulation involving horizontal scaling thanks to the processing distribution over networked nodes. This work explains how each solution is introduced to GP. As preliminary experiences, the extended GP is applied to solve two complex machine learning problem: the Higgs Boson classification problem and the Pulsar detection problem. Experimental results are then discussed and compared to value the efficiency of each solution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
See UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/HIGGS.
- 6.
- 7.
References
Adam-Bourdarios, C., Cowan, G., Germain, C., Guyon, I., Kegl, B., Rousseau, D.: Learning to discover: the Higgs Boson machine learning challenge (2014). http://higgsml.lal.in2p3.fr/documentation
Atlas, L.E., Cohn, D., Ladner, R.: Training connectionist networks with queries and selective sampling. In: Touretzky, D. (ed.) Advances in Neural Information Processing Systems 2, pp. 566–573. Morgan-Kaufmann (1990)
Bacardit, J., Llorà, X.: Large-scale data mining using genetics-based machine learning. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 3(1), 37–61 (2013)
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5, 1–9 (2014)
Baldi, P., Sadowski, P., Whiteson, D.: Enhanced Higgs Boson to \(\tau \)+ \(\tau \)- search with deep learning. Phys. Rev. Lett. 114(11), 111–801 (2015)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)
Chávez, F., et al.: ECJ+HADOOP: an easy way to deploy massive runs of evolutionary algorithms. In: Squillero, G., Burelli, P. (eds.) EvoApplications 2016. LNCS, vol. 9598, pp. 91–106. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31153-1_7
Cohn, D., Atlas, L.E., Ladner, R., Waibel, A.: Improving generalization with active learning. Mach. Learn. 15, 201–221 (1994)
ATLAS Collaboration: Dataset from the ATLAS Higgs Boson machine learning challenge 2014 (2014). http://opendata.cern.ch/record/328. https://doi.org/10.7483/OPENDATA.ATLAS.ZBP2.M5T8
Cummins, R., O’Riordan, C.: Evolved term-weighting schemes in information retrieval: an analysis of the solution space. Artif. Intell. Rev. 26(1–2), 35–47 (2006). https://doi.org/10.1007/s10462-007-9034-5
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Brewer, E.A., Chen, P. (eds.) 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, 6–8 December 2004, pp. 137–150. USENIX Association (2004)
Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)
Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in Genetic Programming. In: Davidor, Y., Schwefel, H.-P., Männer, R. (eds.) PPSN 1994. LNCS, vol. 866, pp. 312–321. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-58484-6_275
Harding, S., Banzhaf, W.: Implementing cartesian genetic programming classifiers on graphics processing units using GPU. NET. In: Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation, pp. 463–470 (2011)
Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Hierarchical data topology based selection for large scale learning. In: Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress, 2016 International IEEE Conferences, pp. 1221–1226. IEEE (2016)
Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M.: Genetic programming over spark for Higgs Boson classification. In: Abramowicz, W., Corchuelo, R. (eds.) BIS 2019. LNBIP, vol. 353, pp. 300–312. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20485-3_23
Hmida, H., Hamida, S.B., Borgi, A., Rukoz, M.: Sampling methods in genetic programming learners from large datasets: a comparative study. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds.) INNS 2016. AISC, vol. 529, pp. 50–60. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-47898-2_6
Hmida, H., Hamida, S.B., Borgi, A., Rukoz, M.: Scale genetic programming for large data sets: case of Higgs Bosons classification. Procedia Comput. Sci. 126, 302–311 (2018). The 22nd International Conference, KES-2018
Hunt, R., Johnston, M., Browne, W., Zhang, M.: Sampling methods in genetic programming for classification with unbalanced data. In: Li, J. (ed.) AI 2010. LNCS (LNAI), vol. 6464, pp. 273–282. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17432-2_28
Iba, H.: Bagging, boosting, and bloating in genetic programming. In: Banzhaf, W., et al. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference on GECCO-99, pp. 1053–1060. Morgan Kaufmann, San Francisco (1999)
Koza, J.R.: Genetic programming: on the programming of computers by means of natural selection. Stat. Comput. 4(2), 87–112 (1994). https://doi.org/10.1007/BF00175355
Kuscu, I.: Genetic programming and incremental approaches to solve supervised learning problems (1996)
Langdon, W.B.: Graphics processing units and genetic programming: an overview. Soft Comput. 15(8), 1657–1669 (2011)
Lasarczyk, C.W.G., Dittrich, P., Banzhaf, W.: Dynamic subset selection based on a fitness case topology. Evol. Comput. 12(2), 223–242 (2004). https://doi.org/10.1162/106365604773955157
L’Heureux, A., Grolinger, K., ElYamany, H.F., Capretz, M.A.M.: Machine learning with big data: challenges and approaches. IEEE Access 5, 7776–7797 (2017). https://doi.org/10.1109/ACCESS.2017.2696365
Lyon, R.J., Stappers, B.W., Cooper, S., Brooke, J.M., Knowles, J.D.: Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach. Mon. Not. R. Astron. Soc. 459(1), 1104–1123 (2016). https://doi.org/10.1093/mnras/stw656
Maitre, O.: Genetic programming on GPGPU cards using EASEA. In: Tsutsui, S., Collet, P. (eds.) Massively Parallel Evolutionary Computation on GPGPUs. NCS, pp. 227–248. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37959-8_11
Nordin, P., Banzhaf, W.: An on-line method to evolve behavior and to control a miniature robot in real time with genetic programming. Adapt. Behav. 5(2), 107–140 (1997). https://doi.org/10.1177/105971239700500201
Paduraru, C., Melemciuc, M., Stefanescu, A.: A distributed implementation using apache spark of a genetic algorithm applied to test data generation. In: Genetic and Evolutionary Computation Conference, 15–19 July, Companion Material Proceedings, pp. 1857–1863. ACM (2017)
Peralta, D., et al.: Evolutionary feature selection for big data classification: a mapreduce approach. Math. Prob. Eng. 2015, 11 (2015)
Qi, R., Wang, Z., Li, S.: A parallel genetic algorithm based on spark for pairwise test suite generation. J. Comput. Sci. Technol. 31(2), 417–427 (2016)
Vanneschi, L., Poli, R.: Genetic programming - introduction, applications, theory and open issues. In: Rozenberg, G., Bäck, T., Kok, J.N. (eds.) Handbook of Natural Computing, pp. 709–739. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-540-92910-9_24
Wang, Y., Pan, Z., Zheng, J., Qian, L., Li, M.: A hybrid ensemble method for pulsar candidate classification. Astrophys. Space Sci. 364, 1–13 (2019). https://doi.org/10.1007/s10509-019-3602-4
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, 25–27 April, pp. 15–28. USENIX Association (2012)
Zhang, B.T., Joung, J.G.: Genetic programming with incremental data inheritance. In: Proceedings of the Genetic and Evolutionary Computation Conference, Orlando, Florida, USA, 13–17 July 1999, vol. 2, pp. 1217–1224. Morgan Kaufmann (1999). http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-460.pdf
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Ben Hamida, S., Benjelloun, G., Hmida, H. (2021). Trends of Evolutionary Machine Learning to Address Big Data Mining. In: Saad, I., Rosenthal-Sabroux, C., Gargouri, F., Arduin, PE. (eds) Information and Knowledge Systems. Digital Technologies, Artificial Intelligence and Decision Making. ICIKS 2021. Lecture Notes in Business Information Processing, vol 425. Springer, Cham. https://doi.org/10.1007/978-3-030-85977-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-85977-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85976-3
Online ISBN: 978-3-030-85977-0
eBook Packages: Computer ScienceComputer Science (R0)