Skip to main content
Log in

Big optimization with genetic algorithms: Hadoop, Spark, and MPI

  • Optimization
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Solving problems of high dimensionality (and complexity) usually needs the intense use of technologies, like parallelism, advanced computers and new types of algorithms. MapReduce (MR) is a computing paradigm long time existing in computer science that has been proposed in the last years for dealing with big data applications, though it could also be used for many other tasks. In this article, we address big optimization: the solution to large instances of combinatorial optimization problems by using MR as the paradigm to design solvers that allow transparent runs on a varied number of computers that collaborate to find the problem solution. We study and analyze the MR technology, focusing on Hadoop, Spark, and MPI as the middleware platforms to develop genetic algorithms (GAs). From this, MRGA solvers arise using a different programming paradigm from the usual imperative transformational programming. Our objective is to confirm the expected benefits of these systems, namely file, memory, and communication management, over the resulting algorithms. We analyze our MRGA solvers from relevant points of view like scalability, speedup, and communication vs. computation time in big optimization. The results for high-dimensional datasets show that the MRGA over Hadoop outperforms the implementations in Spark and MPI frameworks. For the smallest datasets, the execution of MRGA on MPI is always faster than the executions of the remaining MRGAs. Finally, the MRGA over Spark presents the lowest communication times. Numerical and time insights are given in our work, so as to ease future comparisons of new algorithms over these three popular technologies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availibility

Enquiries about data availability should be directed to the authors.

References

  • Alba E (2002) Parallel evolutionary algorithms can achieve super-linear performance. Inf Process Lett 82(1):7–13

    Article  MathSciNet  MATH  Google Scholar 

  • Alba E (2005) Parallel metaheuristics: a new class of algorithms. Wiley-Interscience, New York

    Book  MATH  Google Scholar 

  • Alterkawi L, Migliavacca M (2019) Parallelism and partitioning in large-scale GAs using spark. In: Proceedings of the genetic and evolutionary computation conference, GECCO’19. New York, NY, USA. Association for Computing Machinery, pp 736–744

  • Cano A, García-Martínez C, Ventura S (2017) Extremely high-dimensional optimization with MapReduce: scaling functions and algorithm. Inf Sci 415, 416(Supplement C):110–127

  • Chávez F, Fernández F, Benavides C, Lanza D, Villegas J, Trujillo L, Olague G, Román G (2016) ECJ+Hadoop: an easy way to deploy massive runs of evolutionary algorithms. In: Squillero G, Burelli P (eds) Applications of evolutionary computation. Springer, Cham, pp 91–106

    Chapter  Google Scholar 

  • De Kenneth J, William S (1991) An analysis of the interacting roles of population size and crossover in genetic algorithms. Parallel Problem Solv Nat 1:38–47

    MathSciNet  Google Scholar 

  • Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI’04: proceedings of the 6TH conference on symposium on operating systems design and implementation. USENIX Association

  • Di L, Geronimo, Ferrucci F, Murolo A, Sarro F (2012) A parallel genetic algorithm based on Hadoop MapReduce for the automatic generation of JUnit test suites. In: 2012 IEEE fifth international conference on software testing, verification and validation, April 2012. pp 785–793

  • Ferrucci F, Salza P, Sarro F (2017) Using Hadoop MR for parallel GAs: a comparison of the global, grid and island models. Evol Comput. https://doi.org/10.1162/evco_a_00213

    Article  Google Scholar 

  • Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. Freeman, San Francisco

    MATH  Google Scholar 

  • Goldberg DE (2002) The design of innovation: lessons from and for competent genetic algorithms. Kluwer, Boston

    Book  MATH  Google Scholar 

  • Guo Z, Ruixin Z, Yongquan Z (2018) Solving large-scale 0–1 knapsack problem by the social-spider optimisation algorithm. IJCSM 9(5):433–441

    Article  MathSciNet  MATH  Google Scholar 

  • Hamstra M, Karau H, Zaharia M, Konwinski A, Wendell P (2015) Learning spark: lightning-fast big data analytics. OReilly Media, Sebastopol

    Google Scholar 

  • Hashem I, Anuar N, Gani A, Yaqoob I, Xia F, Khan S (2016) Mapreduce: review and open challenges. Scientometrics 109(1):389–422

    Article  Google Scholar 

  • Hu C, Ren G, Liu C, Li M, Jie W (2017) A spark-based genetic algorithm for sensor placement in large scale drinking water distribution systems. Clust Comput 20(2):1089–1099

    Article  Google Scholar 

  • Jatoth C, Gangadharan GR, Fiore U, Buyya R (2018) QoS-aware big service composition using mapreduce based evolutionary algorithm with guided mutation. Futur Gener Comput Syst 86:1008–1018

    Article  Google Scholar 

  • Jenkins L (2002) A bicriteria knapsack program for planning remediation of contaminated lightstation sites. Eur J Oper Res 140(2):427–433

    Article  MATH  Google Scholar 

  • Kellerer H, Pferschy U, Pisinger D (2004) Introduction to NP-completeness of knapsack problems. Springer, Berlin, pp 483–493

    MATH  Google Scholar 

  • Klamroth K, Wiecek MM (2000) Time-dependent capital budgeting with multiple criteria. In: Haimes YY, Steuer RE (eds) Research and practice in multiple criteria decision making. Springer, Berlin, pp 421–432

    Chapter  MATH  Google Scholar 

  • Lozano M, Molina D, Herrera F (2011) Editorial scalability of evolutionary algorithms and other metaheuristics for large-scale continuous optimization problems. Soft Comput 15(11):2085–2087

    Article  Google Scholar 

  • Miller B, Goldberg D (1995) Genetic algorithms, tournament selection, and the effects of noise. Complex Syst 9:193–212

    MathSciNet  Google Scholar 

  • Paduraru C, Melemciuc M, Stefanescu A (2017) A distributed implementation using apache spark of a genetic algorithm applied to test data generation. In: Proceedings of the genetic and evolutionary computation conference companion, GECCO’17. ACM, pp 1857–1863

  • Pisinger D (1999) Core problems in knapsack algorithms. Oper Res 47:570–575

    Article  MathSciNet  MATH  Google Scholar 

  • Plimpton S, Devine K (2011) Mapreduce in MPI for large-scale graph algorithms. Parallel Comput 37(9):610–632

    Article  Google Scholar 

  • Pradhan T, Israni A, Sharma M (2014) Solving the 0–1 knapsack problem using genetic algorithm and rough set theory. In: 2014 IEEE international conference on advanced communications, control and computing technologies. pp 1120–112

  • Qi R, Wang Z, Li S (2016) A parallel genetic algorithm based on spark for pairwise test suite generation. J Comput Sci Technol 31:417–427

    Article  Google Scholar 

  • Quintuna RV, Laye M (2016) Modeling and optimization of content delivery networks with heuristics solutions for the multidimensional knapsack problem. pp 13–18

  • Rui Figueira J, Tavares G, Wiecek M (2010) Labeling algorithms for multiple objective integer knapsack problems. Comput Oper Res 37(4):700–711

    Article  MathSciNet  MATH  Google Scholar 

  • Salama A, Wahed M, Yousif E (2018) Big data flow adjustment using knapsack problem. J Comput Commun 6:30–39

    Article  Google Scholar 

  • Salto C, Minetti G, Alba E, Luque G (2018) Developing genetic algorithms using different mapreduce frameworks: MPI vs. Hadoop. In: Herrera F, Damas S, Montes R, Alonso S, Cordón Ó, González A, Troncoso A (eds) Advances in artificial intelligence. Springer, Cham, pp 262–272

    Chapter  Google Scholar 

  • Scott E, Luke S (2019) ECJ at 20: Toward a general metaheuristics toolkit. In: Proceedings of the genetic and evolutionary computation conference companion, GECCO’19, New York, Association for Computing Machinery, pp 1391–1398

  • Talbi E (2009) Metaheuristics: from design to implementation. Wiley, New York

    Book  MATH  Google Scholar 

  • Verma A, Llorà X, Goldberg DE, Campbell R (2009) Scaling genetic algorithms using MapReduce. In: ISDA’09, pp 13–18

  • Verma A, Llorà X, Venkataraman S, Goldberg DE, Campbell R (2010) Scaling eCGA model building via data-intensive computing. In: IEEE congress on evolutionary computation, pp 1–8

  • Welcome to (2014) Apache\(^{\rm TM}\) Hadoop®! Technical report. The Apache Software Foundation. http://hadoop.apache.org/

  • White T (2012) Hadoop, the definitive guide. O’Reilly Media, Sebastopol

    Google Scholar 

  • Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauleyM, Franklin M, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’12. USENIX Association, pp 2–2

Download references

Funding

This research received financial support from the Universidad Nacional de La Pampa and the Incentive Program from MINCyT (Argentina). Moreover, this research is partially funded by the Universidad de Malaga; under grant PID 2020-116727RB-I00 (HUmove) funded by MCIN/AEI/10.13039/501100011033; and TAILOR ICT-48 Network (No 952215) funded by EU Horizon 2020 research and innovation programme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriela Minetti.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

This article does not contain any studies with animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Salto, C., Minetti, G., Alba, E. et al. Big optimization with genetic algorithms: Hadoop, Spark, and MPI. Soft Comput 27, 11469–11484 (2023). https://doi.org/10.1007/s00500-023-08301-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-023-08301-x

Keywords

Navigation