Abstract
Solving problems of high dimensionality (and complexity) usually needs the intense use of technologies, like parallelism, advanced computers and new types of algorithms. MapReduce (MR) is a computing paradigm long time existing in computer science that has been proposed in the last years for dealing with big data applications, though it could also be used for many other tasks. In this article, we address big optimization: the solution to large instances of combinatorial optimization problems by using MR as the paradigm to design solvers that allow transparent runs on a varied number of computers that collaborate to find the problem solution. We study and analyze the MR technology, focusing on Hadoop, Spark, and MPI as the middleware platforms to develop genetic algorithms (GAs). From this, MRGA solvers arise using a different programming paradigm from the usual imperative transformational programming. Our objective is to confirm the expected benefits of these systems, namely file, memory, and communication management, over the resulting algorithms. We analyze our MRGA solvers from relevant points of view like scalability, speedup, and communication vs. computation time in big optimization. The results for high-dimensional datasets show that the MRGA over Hadoop outperforms the implementations in Spark and MPI frameworks. For the smallest datasets, the execution of MRGA on MPI is always faster than the executions of the remaining MRGAs. Finally, the MRGA over Spark presents the lowest communication times. Numerical and time insights are given in our work, so as to ease future comparisons of new algorithms over these three popular technologies.
Similar content being viewed by others
Data availibility
Enquiries about data availability should be directed to the authors.
References
Alba E (2002) Parallel evolutionary algorithms can achieve super-linear performance. Inf Process Lett 82(1):7–13
Alba E (2005) Parallel metaheuristics: a new class of algorithms. Wiley-Interscience, New York
Alterkawi L, Migliavacca M (2019) Parallelism and partitioning in large-scale GAs using spark. In: Proceedings of the genetic and evolutionary computation conference, GECCO’19. New York, NY, USA. Association for Computing Machinery, pp 736–744
Cano A, García-Martínez C, Ventura S (2017) Extremely high-dimensional optimization with MapReduce: scaling functions and algorithm. Inf Sci 415, 416(Supplement C):110–127
Chávez F, Fernández F, Benavides C, Lanza D, Villegas J, Trujillo L, Olague G, Román G (2016) ECJ+Hadoop: an easy way to deploy massive runs of evolutionary algorithms. In: Squillero G, Burelli P (eds) Applications of evolutionary computation. Springer, Cham, pp 91–106
De Kenneth J, William S (1991) An analysis of the interacting roles of population size and crossover in genetic algorithms. Parallel Problem Solv Nat 1:38–47
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI’04: proceedings of the 6TH conference on symposium on operating systems design and implementation. USENIX Association
Di L, Geronimo, Ferrucci F, Murolo A, Sarro F (2012) A parallel genetic algorithm based on Hadoop MapReduce for the automatic generation of JUnit test suites. In: 2012 IEEE fifth international conference on software testing, verification and validation, April 2012. pp 785–793
Ferrucci F, Salza P, Sarro F (2017) Using Hadoop MR for parallel GAs: a comparison of the global, grid and island models. Evol Comput. https://doi.org/10.1162/evco_a_00213
Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. Freeman, San Francisco
Goldberg DE (2002) The design of innovation: lessons from and for competent genetic algorithms. Kluwer, Boston
Guo Z, Ruixin Z, Yongquan Z (2018) Solving large-scale 0–1 knapsack problem by the social-spider optimisation algorithm. IJCSM 9(5):433–441
Hamstra M, Karau H, Zaharia M, Konwinski A, Wendell P (2015) Learning spark: lightning-fast big data analytics. OReilly Media, Sebastopol
Hashem I, Anuar N, Gani A, Yaqoob I, Xia F, Khan S (2016) Mapreduce: review and open challenges. Scientometrics 109(1):389–422
Hu C, Ren G, Liu C, Li M, Jie W (2017) A spark-based genetic algorithm for sensor placement in large scale drinking water distribution systems. Clust Comput 20(2):1089–1099
Jatoth C, Gangadharan GR, Fiore U, Buyya R (2018) QoS-aware big service composition using mapreduce based evolutionary algorithm with guided mutation. Futur Gener Comput Syst 86:1008–1018
Jenkins L (2002) A bicriteria knapsack program for planning remediation of contaminated lightstation sites. Eur J Oper Res 140(2):427–433
Kellerer H, Pferschy U, Pisinger D (2004) Introduction to NP-completeness of knapsack problems. Springer, Berlin, pp 483–493
Klamroth K, Wiecek MM (2000) Time-dependent capital budgeting with multiple criteria. In: Haimes YY, Steuer RE (eds) Research and practice in multiple criteria decision making. Springer, Berlin, pp 421–432
Lozano M, Molina D, Herrera F (2011) Editorial scalability of evolutionary algorithms and other metaheuristics for large-scale continuous optimization problems. Soft Comput 15(11):2085–2087
Miller B, Goldberg D (1995) Genetic algorithms, tournament selection, and the effects of noise. Complex Syst 9:193–212
Paduraru C, Melemciuc M, Stefanescu A (2017) A distributed implementation using apache spark of a genetic algorithm applied to test data generation. In: Proceedings of the genetic and evolutionary computation conference companion, GECCO’17. ACM, pp 1857–1863
Pisinger D (1999) Core problems in knapsack algorithms. Oper Res 47:570–575
Plimpton S, Devine K (2011) Mapreduce in MPI for large-scale graph algorithms. Parallel Comput 37(9):610–632
Pradhan T, Israni A, Sharma M (2014) Solving the 0–1 knapsack problem using genetic algorithm and rough set theory. In: 2014 IEEE international conference on advanced communications, control and computing technologies. pp 1120–112
Qi R, Wang Z, Li S (2016) A parallel genetic algorithm based on spark for pairwise test suite generation. J Comput Sci Technol 31:417–427
Quintuna RV, Laye M (2016) Modeling and optimization of content delivery networks with heuristics solutions for the multidimensional knapsack problem. pp 13–18
Rui Figueira J, Tavares G, Wiecek M (2010) Labeling algorithms for multiple objective integer knapsack problems. Comput Oper Res 37(4):700–711
Salama A, Wahed M, Yousif E (2018) Big data flow adjustment using knapsack problem. J Comput Commun 6:30–39
Salto C, Minetti G, Alba E, Luque G (2018) Developing genetic algorithms using different mapreduce frameworks: MPI vs. Hadoop. In: Herrera F, Damas S, Montes R, Alonso S, Cordón Ó, González A, Troncoso A (eds) Advances in artificial intelligence. Springer, Cham, pp 262–272
Scott E, Luke S (2019) ECJ at 20: Toward a general metaheuristics toolkit. In: Proceedings of the genetic and evolutionary computation conference companion, GECCO’19, New York, Association for Computing Machinery, pp 1391–1398
Talbi E (2009) Metaheuristics: from design to implementation. Wiley, New York
Verma A, Llorà X, Goldberg DE, Campbell R (2009) Scaling genetic algorithms using MapReduce. In: ISDA’09, pp 13–18
Verma A, Llorà X, Venkataraman S, Goldberg DE, Campbell R (2010) Scaling eCGA model building via data-intensive computing. In: IEEE congress on evolutionary computation, pp 1–8
Welcome to (2014) Apache\(^{\rm TM}\) Hadoop®! Technical report. The Apache Software Foundation. http://hadoop.apache.org/
White T (2012) Hadoop, the definitive guide. O’Reilly Media, Sebastopol
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauleyM, Franklin M, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’12. USENIX Association, pp 2–2
Funding
This research received financial support from the Universidad Nacional de La Pampa and the Incentive Program from MINCyT (Argentina). Moreover, this research is partially funded by the Universidad de Malaga; under grant PID 2020-116727RB-I00 (HUmove) funded by MCIN/AEI/10.13039/501100011033; and TAILOR ICT-48 Network (No 952215) funded by EU Horizon 2020 research and innovation programme.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Human and animal rights
This article does not contain any studies with animals performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Salto, C., Minetti, G., Alba, E. et al. Big optimization with genetic algorithms: Hadoop, Spark, and MPI. Soft Comput 27, 11469–11484 (2023). https://doi.org/10.1007/s00500-023-08301-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-023-08301-x