Journal of Grid Computing

, Volume 13, Issue 3, pp 391–407 | Cite as


Cloud-Based Ensemble Learning with Genetic Programming for Large Regression Problems
  • Kalyan Veeramachaneni
  • Ignacio ArnaldoEmail author
  • Owen Derby
  • Una-May O’Reilly


We describe FlexGP, the first Genetic Programming system to perform symbolic regression on large-scale datasets on the cloud via massive data-parallel ensemble learning. FlexGP provides a decentralized, fault tolerant parallelization framework that runs many copies of Multiple Regression Genetic Programming, a sophisticated symbolic regression algorithm, on the cloud. Each copy executes with a different sample of the data and different parameters. The framework can create a fused model or ensemble on demand as the individual GP learners are evolving. We demonstrate our framework by deploying 100 independent GP instances in a massive data-parallel manner to learn from a dataset composed of 515K exemplars and 90 features, and by generating a competitive fused model in less than 10 minutes.


Cloud computing Ensemble learning Genetic programming Symbolic regression 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Friese, M., Flasch, O., Vladislavleva, K., Bartz-Beielstein, T., Mersmann, O., Naujoks, B., Stork, J., Zaefferer, M.: Ensemble-based model selection for smart metering data. In: Proceedings of the 22nd Workshop Computational Intelligence, pp. 215–227. Dortmund, Germany (2012)Google Scholar
  2. 2.
    Schmidt, M., Lipson, H.: Distilling free-form natural laws from experimental data. Science 324(5923), 81–85 (2009)CrossRefGoogle Scholar
  3. 3.
    Choudhury, A., Nair, P.B., Keane, A.J., et al.: A data parallel approach for large-scale gaussian process modeling. In: Proceedings of the Second SIAM International Conference on Data Mining, pp 95–111. SIAM (2002)Google Scholar
  4. 4.
    Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B 58, 267–288 (1994)MathSciNetGoogle Scholar
  5. 5.
    Arnaldo, I., Krawiec, K., O’Reilly, U.M.: Multiple regression genetic programming. In: Proceedings of the 2014 Conference on Genetic and Evolutionary Computation, GECCO ’14, pp 879–886. ACM, New York (2014)Google Scholar
  6. 6.
    Vladislavleva, E.: Model-based problem solving through symbolic regression via pareto genetic programming. Ph.D. thesis, Tilburg University, Tilburg, the Netherlands (2008)Google Scholar
  7. 7.
    Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002). doi: 10.1109/4235.996017 CrossRefGoogle Scholar
  8. 8.
    Ganjisaffar, Y.: Lasso4j. (2014)
  9. 9.
    Friedman, J.H., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)CrossRefGoogle Scholar
  10. 10.
    Veeramachaneni, K., Derby, O., Sherry, D., O’Reilly, U.M.: Learning regression ensembles with genetic programming at scale. In: Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation, GECCO ’13, pp 1117–1124. ACM, New York (2013)Google Scholar
  11. 11.
    Yang, Y.: Adaptive regression by mixing. J. Am. Stat. Assoc. 96(454), 574–588 (2001)CrossRefzbMATHGoogle Scholar
  12. 12.
    Derby, O: FlexGP: a scalable system for factored learning in the cloud. Master’s thesis, Massachusetts Institute of Technology (2013)Google Scholar
  13. 13.
    Jelasity, M., Montresor, A., Babaoglu, O.: Gossiping in distributed systems. Comput. Netw. 53(13), 2321 (2009). doi: 10.1016/j.comnet.2009.03.013 CrossRefzbMATHGoogle Scholar
  14. 14.
    Langford, J.: Vowpal wabbit. (2014)
  15. 15.
    Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. J. Mach. Learn. Res. 10, 777–801 (2009)MathSciNetzbMATHGoogle Scholar
  16. 16.
    MathWorks: Neural network toolbox. (2014)
  17. 17.
    Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Ryan, C., Soule, T., Keijzer, M., Tsang, E., Poli, R., Costa, E. (eds.) Genetic Programming. Lecture Notes in Computer Science, vol. 2610, pp 275–299. Springer, Berlin / Heidelberg (2003)Google Scholar
  18. 18.
    Vladislavleva, C., Smits, G.: Symbolic regression via genetic programming. Final Thesis for Dow Benelux BV (2005)Google Scholar
  19. 19.
    Silva, S., Dignum, S., Vanneschi, L.: Operator equalisation for bloat free genetic programming and a survey of bloat control methods. Genet. Program Evolvable Mach. 13(2), 197–238 (2012)CrossRefGoogle Scholar
  20. 20.
  21. 21.
    Amazon web services (AWS): (2014)
  22. 22.
    Bertin-Mahieux, T., Ellis, D.P., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011) (2011)Google Scholar
  23. 23.
    Sherry, D., Veeramachaneni, K., McDermott, J., O’Reilly, U.M.: Flex-GP: genetic programming on the cloud. In: Chio, C.D., Agapitos, A., Cagnoni, S., Cotta, C., Vega, F.F.d., Caro, G.A.D., Drechsler, R., Ekart, A., Esparcia- Alcazar, A.I., Farooq, M., Langdon, W.B., Merelo- Guervos, J.J., Preuss, M., Richter, H., Silva, S., Simes, A., Squillero, G., Tarantino, E., Tettamanzi, A.G.B., Togelius, J., Urquhart, N., Uyar, A., Yannakakis, G.N. (eds.) Applications of Evolutionary Computation no. 7248 in Lecture Notes in Computer Science, pp. 477–486. Springer, Berlin Heidelberg (2012)Google Scholar
  24. 24.
    Sherry, D.J.: FlexGP 2.0: multiple levels of parallelism in distributed machine learning via genetic programming. Master’s thesis, Massachusetts Institute of Technology (2013)Google Scholar
  25. 25.
    Fernández, F., Tomassini, M., Vanneschi, L.: An empirical study of multipopulation genetic programming. Genet. Program Evolvable Mach. 4(1), 21–51 (2003). doi: 10.1023/A:1021873026259 CrossRefzbMATHGoogle Scholar
  26. 26.
    Fazenda, P., McDermott, J., O’Reilly, U.M.: A library to run evolutionary algorithms in the cloud using MapReduce. In: Chio, C., Agapitos, A., Cagnoni, S., Cotta, C., Vega, F., Caro, G., Drechsler, R., Ekárt, A., Esparcia-Alcázar, A., Farooq, M., Langdon, W., Merelo-Guervós, J., Preuss, M., Richter, H., Silva, S., Simes, A., Squillero, G., Tarantino, E., Tettamanzi, A., Togelius, J., Urquhart, N., Uyar, A., Yannakakis, G. (eds.) Applications of Evolutionary Computation. Lecture Notes in Computer Science, Vol. 7248, pp 416– 425. Springer, Berlin Heidelberg (2012)Google Scholar
  27. 27.
    Wang, S., Gao, B.J., Wang, K., Lauw, H.W.: Parallel learning to rank for information retrieval. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pp 1083–1084. ACM, New York (2011)Google Scholar
  28. 28.
    Verma, A., Llora, X., Goldberg, D., Campbell, R.: Scaling genetic algorithms using MapReduce. In: Intelligent Systems Design and Applications, 2009. ISDA ’09. Ninth International Conference on, pp 13–18 (2009)Google Scholar
  29. 29.
    Verma, A., Llora, X., Venkataraman, S., Goldberg, D., Campbell, R.: Scaling eCGA model building via data-intensive computing. In: Evolutionary Computation (CEC), 2010 IEEE Congress on, pp 1–8 (2010)Google Scholar
  30. 30.
    Huang, D.W., Lin, J.: Scaling populations of a genetic algorithm for job shop scheduling problems using MapReduce. In: Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, pp 780–785 (2010)Google Scholar
  31. 31.
    Jiménez Laredo, J., Lombrańa González, D., Fernández de Vega, F., García Arenas, M., Merelo Guervós, J.: A peer-to-peer approach to genetic programming. In: Silva, S., Foster, J., Nicolau, M., Machado, P., Giacobini, M. (eds.) Genetic programming. Lecture Notes in Computer Science, Vol. 6621, pp 108–117. Springer, Berlin Heidelberg (2011)Google Scholar
  32. 32.
    Laredo, J., Eiben, A., Steen, M., Merelo, J.: Evag: a scalable peer-to-peer evolutionary algorithm. Genet. Program Evolvable Mach. 11, 227–246 (2010). doi: 10.1007/s10710-009-9096-z CrossRefGoogle Scholar
  33. 33.
    Folino, G., Forestiero, A., Spezzano, G.: A jxta based asynchronous peer-to-peer implementation of genetic programming. J. Softw. 1(2), 12–23 (2006)CrossRefGoogle Scholar
  34. 34.
    Perrone, M.P., Cooper, L.N.: When networks disagree: Ensemble methods for hybrid neural networks. In: Mammone, R. (ed.) Neural Networks for Speech and Image processing, pp 126–142. Chapman and Hall (1993)Google Scholar
  35. 35.
    Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active learning. Adv. Neural Inf. Process. Syst. 7, 231–238 (1995)Google Scholar
  36. 36.
    Quinlan, J.R.: Bagging, boosting, and C4.5. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence, AAAI’96, vol. 1, pp 725–730. AAAI Press (1996)Google Scholar
  37. 37.
    Dietterich, T.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Mach. Learn. 40(2), 139– 157 (2000)CrossRefGoogle Scholar
  38. 38.
    Dietterich, T.: Ensemble methods in machine learning In: Multiple Classifier Systems. Lecture Notes in Computer Science, Vol. 1857, pp 1–15. Springer, Berlin Heidelberg (2000)Google Scholar
  39. 39.
    Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)MathSciNetzbMATHGoogle Scholar
  40. 40.
    Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Machine learning international conference, pp 148–156. Morgan Kauffman Publishers, Inc. (1996)Google Scholar
  41. 41.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar
  42. 42.
    Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)CrossRefMathSciNetzbMATHGoogle Scholar
  43. 43.
    Imamura, K., Soule, T., Heckendorn, R., Foster, J.: Behavioral diversity and a probabilistically optimal GP ensemble. Genet. Program Evolvable Mach. 4(3), 235–253 (2003)CrossRefGoogle Scholar
  44. 44.
    Bhowan, U., Johnston, M., Zhang, M., Yao, X.: Evolving diverse ensembles using genetic programming for classification with unbalanced data. IEEE Trans. Evol. Comput. 17(3), 368–386 (2013). doi: 10.1109/TEVC.2012.2199119 CrossRefGoogle Scholar
  45. 45.
    Langdon, W., Barrett, S., Buxton, B.: Combining decision trees and neural networks for drug discovery. In: Foster, J., Lutton, E., Miller, J., Ryan, C., Tettamanzi, A. (eds.) Genetic Programming. Lecture Notes in Computer Science, Vol. 2278, pp 60–70. Springer, Berlin Heidelberg (2002)Google Scholar
  46. 46.
    Johansson, U., Löfström, T., König, R., Niklasson, L.: Genetically evolved trees representing ensembles. In: Artificial Intelligence and Soft Computing–ICAISC 2006, pp 613–22 (2006)Google Scholar
  47. 47.
    Folino, G., Pizzuti, C., Spezzano, G.: Mining distributed evolving data streams using fractal GP ensembles. In: Ebner, M., O’Neill, M., Ekárt, A., Vanneschi, L., Esparcia-Alcázar, A. (eds.) Genetic Programming. Lecture Notes in Computer Science, Vol. 4445, pp 160–169. Springer, Berlin Heidelberg (2007)Google Scholar
  48. 48.
    Lanzi, P.L.: XCS with stack-based genetic programming. In: Sarker, R., Reynolds, R., Abbass, H., Tan, K.C., McKay, B., Essam, D., Gedeon, T. (eds.) Proceedings of the 2003 Congress on Evolutionary Computation CEC2003, pp 1186–1191. IEEE Press, Canberra (2003)Google Scholar
  49. 49.
    Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998)CrossRefGoogle Scholar
  50. 50.
    Iba, H.: Bagging, boosting, and bloating in genetic programming. In: Banzhaf, W., Daida, J., Eiben, A.E., Garzon, M.H., Honavar, V., Jakiela, M., Smith, R.E. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference, vol. 2, pp 1053–1060. Morgan Kaufmann, Orlando, Florida (1999)Google Scholar
  51. 51.
    Veeramachaneni, K., Vladislavleva, K., Burland, M., Parcon, J., O’Reilly, U.M.: Evolutionary optimization of flavors. In: Proceedings of the 12th annual conference on Genetic and evolutionary computation, pp 1291–1298. ACM (2010)Google Scholar
  52. 52.
    Kotanchek, M., Smits, G., Vladislavleva, E.: Trustable symbolic regression models: using ensembles, interval arithmetic and pareto fronts to develop robust and trust-aware models. In: Riolo, R., Soule, T., Worzel, B. (eds.) Genetic Programming Theory and Practice V. Genetic and Evolutionary Computation Series, pp 201–220. Springer, US (2008)Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2014

Authors and Affiliations

  • Kalyan Veeramachaneni
    • 1
  • Ignacio Arnaldo
    • 1
    Email author
  • Owen Derby
    • 1
  • Una-May O’Reilly
    • 1
  1. 1.Massachusetts Institute of TechnologyCambridgeUSA

Personalised recommendations