When Huge Is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing

  • Xavier Llorà
  • Abhishek Verma
  • Roy H. Campbell
  • David E. Goldberg
Part of the Studies in Computational Intelligence book series (SCI, volume 269)


Data-intensive computing has emerged as a key player for processing large volumes of data exploiting massive parallelism. Data-intensive computing frameworks have shown that terabytes and petabytes of data can be routinely processed. However, there has been little effort to explore how data-intensive computing can help scale evolutionary computation. In this book chapter we explore how evolutionary computation algorithms can be modeled using two different data-intensive frameworks—Yahoo!’s Hadoop and NCSA’s Meandre. We present a detailed step-by-step description of how three different evolutionary computation algorithms, having different execution profiles, can be translated into the data-intensive computing paradigms. Results show that (1) Hadoop is an excellent choice to push evolutionary computation boundaries on very large problems, and (2) that transparent Meandre linear speedups are possible without changing the underlying data-intensive flow thanks to its inherent parallel processing.


Genetic Algorithm Probability Vector Minimum Description Length Uniform Crossover Parallel Genetic Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alba, E. (ed.): Parallel Metaheuristics. Wiley, Chichester (2007)Google Scholar
  2. 2.
    Amdahl, G.: Validity of the single processor approach to achieving large-scale computing capabilities. In: AFIPS Conference Proceedings, pp. 483–485 (1967)Google Scholar
  3. 3.
    Baluja, S.: Population-based incremental learning: A method of integrating genetic search based function optimization and competitive learning. Tech. Rep. CMU-CS-94-163, Carnegie Mellon University (1994)Google Scholar
  4. 4.
    Baluja, S., Caruana, R.: Removing the genetics from the standard genetic algorithm. Tech. Rep. CMU-CS-95-141, Carnegie Mellon University (1995)Google Scholar
  5. 5.
    Beckett, D.: RDF/XM Syntax Specification (Revised). W3C Recommendation 10 February 2004, The World Wide Web Consortium (2004)Google Scholar
  6. 6.
    Beynon, M.D., Kurc, T., Sussman, A., Saltz, J.: Design of a framework for data-intensive wide-area applications. In: HCW 2000: Proceedings of the 9th Heterogeneous Computing Workshop, p. 116. IEEE Computer Society, Washington (2000)CrossRefGoogle Scholar
  7. 7.
    Brickley, D., Guha, R.: RDF Vocabulary Description Language 1.0: RDF Schema. W3C Recommendation 10 February 2004, The World Wide Web Consortium (2004)Google Scholar
  8. 8.
    Cantú-Paz, E.: Efficient and Accurate Parallel Genetic Algorithms. Springer, Heidelberg (2000)zbMATHGoogle Scholar
  9. 9.
    De Jong, K., Sarma, J.: On decentralizing selection algorithms. In: Proceedings of the Sixth International Conference on Genetic Algorithms, pp. 17–23. Morgan Kaufmann, San Francisco (1995)Google Scholar
  10. 10.
    de la Ossa, L., Sastry, K., Lobo, F.G.: Extended compact genetic algorithm in C++: Version 1.1. IlliGAL Report No. 2006013, University of Illinois at Urbana-Champaign, Urbana, IL (2006)Google Scholar
  11. 11.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI 2004: Sixth Symposium on Operating System Design and Implementation (2004)Google Scholar
  12. 12.
    Ekanayake, J., Pallickara, S., Fox, G.: Mapreduce for data intensive scientific analyses. In: ESCIENCE 2008: Proceedings of the 2008 Fourth IEEE International Conference on eScience, pp. 277–284. IEEE Computer Society, Washington (2008), CrossRefGoogle Scholar
  13. 13.
    Foster, I.: Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley, Reading (1995)zbMATHGoogle Scholar
  14. 14.
    Foster, I.: The virtual data grid: A new model and architecture for data-intensive collaboration. In: 15th International Conference on Scientific and Statistical Database Management, p. 11 (2003)Google Scholar
  15. 15.
    Giacobini, M., Tomassini, M., Tettamanzi, A.: Takeover time curves in random and small-world structured populations. In: GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation, pp. 1333–1340. ACM, New York (2005), CrossRefGoogle Scholar
  16. 16.
    Goldberg, D.E.: Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading, MA (1989)zbMATHGoogle Scholar
  17. 17.
    Goldberg, D.E.: The Design of Innovation: Lessons from and for Competent Genetic Algorithms. Kluwer Academic Publishers, Norwell (2002)zbMATHGoogle Scholar
  18. 18.
    Goldberg, D.E., Deb, K., Clark, J.H.: Genetic algorithms, noise, and the sizing of populations. Complex Systems 6, 333–362 (1992); (Also IlliGAL Report No. 91010)zbMATHGoogle Scholar
  19. 19.
    Goldberg, D.E., Korb, B., Deb, K.: Messy genetic algorithms: Motivation, analysis, and first results. Complex Systems 3(5), 493–530 (1989)zbMATHMathSciNetGoogle Scholar
  20. 20.
    Harik, G., Cantú-Paz, E., Goldberg, D.E., Miller, B.L.: The gambler’s ruin problem, genetic algorithms, and the sizing of populations. Evolutionary Computation 7(3), 231–253 (1999); (Also IlliGAL Report No. 96004)CrossRefGoogle Scholar
  21. 21.
    Harik, G., Lobo, F., Goldberg, D.E.: The compact genetic algorithm. In: Proceedings of the IEEE International Conference on Evolutionary Computation, pp. 523–528 (1998); (Also IlliGAL Report No. 97006)Google Scholar
  22. 22.
    Harik, G.R., Lobo, F.G., Sastry, K.: Linkage learning via probabilistic modeling in the ECGA. In: Pelikan, M., Sastry, K., Cantú-Paz, E. (eds.) Scalable Optimization via Probabilistic Modeling: From Algorithms to Applications, ch. 3. Springer, Berlin (in press) (Also IlliGAL Report No. 99010)Google Scholar
  23. 23.
    Jin, C., Vecchiola, C., Buyya, R.: MRPGA: An extension of mapreduce for parellelizing genetic algorithms. In: Press, I. (ed.) IEEE Fouth International Conference on eScience 2008, pp. 214–221 (2008)Google Scholar
  24. 24.
    Larrañaga, P., Lozano, J.A. (eds.): Estimation of Distribution Algorithms. Kluwer Academic Publishers, Boston (2002)zbMATHGoogle Scholar
  25. 25.
    Lim, D., Ong, Y.S., Jin, Y., Sendhoff, B., Lee, B.S.: Efficient hierarchical parallel genetic algorithms using grid computing. Future Gener. Comput. Syst. 23(4), 658–670 (2007), CrossRefGoogle Scholar
  26. 26.
    Lin, S.C., Punch, W.F., Goodman, E.D.: Coarse-grain parallel genetic algorithms: Categorization and new approach. In: Proceeedings of the Sixth IEEE Symposium on Parallel and Distributed Processing, pp. 28–37 (1994)Google Scholar
  27. 27.
    Llorà, X.: E2K: evolution to knowledge. SIGEVOlution 1(3), 10–17 (2006), CrossRefGoogle Scholar
  28. 28.
    Llorà, X.: Data-intensive computing for competent genetic algorithms: A pilot study using meandre. In: Proceedings of the 2009 conference on Genetic and evolutionary computation (GECCO 2009). ACM Press, Montreal (in press, 2009)Google Scholar
  29. 29.
    Llorà, X.: Genetic Based Machine Learning using Fine-grained Parallelism for Data Mining. Ph.D. thesis, Enginyeria i Arquitectura La Salle. Ramon Llull University, Barcelona (February 2002)Google Scholar
  30. 30.
    Llorà, X., Ács, B., Auvil, L., Capitanu, B., Welge, M., Goldberg, D.E.: Meandre: Semantic-driven data-intensive flows in the clouds. In: Proceedings of the 4th IEEE International Conference on e-Science, pp. 238–245. IEEE Press, Los Alamitos (2008)Google Scholar
  31. 31.
    Maruyama, T., Hirose, T., Konagaya, A.: A fine-grained parallel genetic algorithm for distributed parallel systems. In: Proceedings of the 5th International Conference on Genetic Algorithms, pp. 184–190. Morgan Kaufmann Publishers Inc., San Francisco (1993)Google Scholar
  32. 32.
    Mattmann, C.A., Crichton, D.J., Medvidovic, N., Hughes, S.: A software architecture-based framework for highly distributed and data intensive scientific applications. In: ICSE 2006: Proceedings of the 28th international conference on Software engineering, pp. 721–730. ACM, New York (2006), CrossRefGoogle Scholar
  33. 33.
    Morrison, J.P.: Flow-Based Programming: A New Approach to Application Development. Van Nostrand Reinhold (1994)Google Scholar
  34. 34.
    Mühlenbein, H.: The equation for response to selection and its use for prediction. Evolutionary Computation 5(3), 303–346 (1997)CrossRefGoogle Scholar
  35. 35.
    Mühlenbein, H., Paaß, G.: From recombination of genes to the estimation of distributions I. Binary parameters. In: Ebeling, W., Rechenberg, I., Voigt, H.-M., Schwefel, H.-P. (eds.) PPSN 1996. LNCS, vol. 1141, pp. 178–187. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  36. 36.
    Pelikan, M., Lobo, F., Goldberg, D.E.: A survey of optimization by building and using probabilistic models. Computational Optimization and Applications 21, 5–20 (2002); (Also IlliGAL Report No. 99018)zbMATHCrossRefMathSciNetGoogle Scholar
  37. 37.
    Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multicore and multiprocessors systems. In: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (2007)Google Scholar
  38. 38.
    Sarma, J., De Jong, K.: An analysis of local selection algorithms in a spatially structured evolutionary algorithm. In: Proceedings of the Seventh International Conference on Genetic Algorithms, pp. 181–186. Morgan Kaufmann, San Francisco (1997)Google Scholar
  39. 39.
    Sarma, J., De Jong, K.: Selection pressure and performance in spatially distributed evolutionary algorithms. In: Proceedings of the World Congress on Computatinal Intelligence, pp. 553–557. IEEE Press, Los Alamitos (1998)Google Scholar
  40. 40.
    Sastry, K., Goldberg, D.E.: Designing competent mutation operators via probabilistic model building of neighborhoods. In: Proceedings of the Genetic and Evolutionary Computation Conference, vol. 2, pp. 114–125 (2004); Also IlliGAL Report No. 2004006Google Scholar
  41. 41.
    Sastry, K., Goldberg, D.E., Llorà, X.: Towards billion-bit optimization via a parallel estimation of distribution algorithm. In: GECCO 2007: Proceedings of the 9th annual conference on Genetic and evolutionary computation, pp. 577–584. ACM Press, New York (2007), CrossRefGoogle Scholar
  42. 42.
    Sywerda, G.: Uniform crossover in genetic algorithms. In: Proceedings of the third international conference on Genetic algorithms, pp. 2–9. Morgan Kaufmann Publishers Inc., San Francisco (1989)Google Scholar
  43. 43.
    Uysal, M., Kurc, T.M., Sussman, A., Saltz, J.: A performance prediction framework for data intensive applications on large scale parallel machines. In: O’Hallaron, D.R. (ed.) LCR 1998. LNCS, vol. 1511, pp. 243–258. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  44. 44.
    Weibel, S., Kunze, J., Lagoze, C., Wolf, M.: Dublin Core Metadata for Resource Discovery. Tech. Rep. RFC2413, The Dublin Core Metadata Initiative (2008)Google Scholar
  45. 45.
    Welge, M., Auvil, L., Shirk, A., Bushell, C., Bajcsy, P., Cai, D., Redman, T., Clutter, D., Aydt, R., Tcheng, D.: Data to Knowledge (D2K). Tech. rep., Technical Report Automated Learning Group, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Xavier Llorà
    • 1
  • Abhishek Verma
    • 2
  • Roy H. Campbell
    • 2
  • David E. Goldberg
    • 3
  1. 1.National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-ChampaignUrbana
  2. 2.Department of Computer ScienceUniversity of Illinois at Urbana-ChampaignUrbana
  3. 3.Department of Industrial and Enterprise Systems EngineeringUniversity of Illinois at Urbana-ChampaignUrbana

Personalised recommendations