Journal of Grid Computing

, Volume 12, Issue 2, pp 245–264 | Cite as

Science in the Cloud: Allocation and Execution of Data-Intensive Scientific Workflows

  • Claudia Szabo
  • Quan Z. Sheng
  • Trent Kroeger
  • Yihong Zhang
  • Jian Yu


An important challenge for the adoption of cloud computing in the scientific community remains the efficient allocation and execution of data-intensive scientific workflows to reduce execution time and the size of transferred data. The transferred data overhead is becoming significant with emerging scientific workflows that have input/output files and intermediate data products ranging in the hundreds of gigabytes. The allocation of scientific workflows on public clouds can be described through a variety of perspectives and parameters, and has been proved to be NP-complete. This paper proposes an evolutionary approach for task allocation on public clouds considering data transfer and execution time. In our framework, a solution is represented using an allocation chromosome that encodes the allocation of tasks to nodes, and an ordering chromosome that defines the execution order according to the scientific workflow representation. We propose a multi-objective optimization that relies on a cloud cost model and employs tailored evolution operators. Starting from a population of possible solutions, we employ crossover and mutation operators on both chromosomes aiming at optimizing the data transferred between nodes as well as the total workflow runtime. The crossover operators combine parts of solutions to reduce data overhead, whereas the mutation operators swamp between parts of the same chromosome according to pre-defined rules. Our experimental study compares between the proposed approach and current state-of-the art approaches using synthetic and real-life workflows. Our algorithm performs similarly to existing heuristics for small workflows and shows up to 80 % improvements for larger synthetic workflows. To further validate our approach we compare between the allocation and scheduling obtained by our approach with that obtained by popular scientific workflow managers, when real workflows with hundreds of tasks are executed on a public cloud. The results show a 10 % improvement in runtime over existing schedulers, caused by a 80 % reduction in transferred data and optimized allocation and ordering of tasks. This improved data locality has greater impact as it can be employed to improve and study data provenance and facilitate data persistence for scientific workflows.


Data-intensive workflows Cloud computing Scheduling Allocation Evolutionary computation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abramson, D., Enticott, C., Altinas, I.: Nimrod/K: towards massively parallel dynamic grid workflows. In: Proceedings of the ACM/IEEE Conference on Supercomputing, pp. 24:1–24:11 (2008)Google Scholar
  2. 2.
    Arstechnica: $1,279 per hour, 30,000-core cluster built on Amazon EC2 cloud. Last retrieved Oct. 2012
  3. 3.
    Berriman, G.B., Deelman, E., Groth, P.T., Juve, G.: The application of cloud computing to the creation of image Mosaics and management of their provenance. In: CoRR (2010)Google Scholar
  4. 4.
    Bharathi, S., Chervenak, A., Deelman, E., Mehta, G., Su, M.-H., Vahi, K.: Characterization of scientific workflows. In: Proceedings of the Third Workshop on Workflows in Support of Large-Scale Science, pp. 1–10 (2008)Google Scholar
  5. 5.
    Catalyuek, U.V., Kaya, K., Ucar, B.: Integrated data placement and task assignment for scientific workflows in clouds. In: Proceedings of the Fourth International Workshop on Data Intensive Distributed Computing (2009)Google Scholar
  6. 6.
    Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)CrossRefGoogle Scholar
  7. 7.
    Deelman, E., Singh, G., Livny, M., Berriman, B., Good, J.: The cost of doing science on the cloud: the Montage example. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2008)Google Scholar
  8. 8.
    Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Program. 13, 219–237 (2005)Google Scholar
  9. 9.
    Durillo, J., Nebro, A., Luna, F., Dorronsoro, B., Alba, E.: jMetal: a java framework for developing multi-objective optimization metaheuristics. University of Málaga, Technical Report ITI-2006-10 (2006)Google Scholar
  10. 10.
    Evangelinos, C., Hill, C.N.: Cloud computing for parallel scientific HPC applications: feasibility of running coupled atmosphere-ocean climate models on Amazon’s EC2. In: Cloud Computing and its Applications (2008)Google Scholar
  11. 11.
    Fernandez-Baca, D.: Allocating modules to processors in a distributed system. IEEE Trans. Softw. Eng. 15, 1427–1436 (1989)CrossRefGoogle Scholar
  12. 12.
    Holland, J.: Adaptation in Natural and Artificial Systems. University of Michigan Press (1975)Google Scholar
  13. 13.
    JSwarm: PSO optimization package. Last retrieved Oct. 2012
  14. 14.
    Juve, G., Deelman, E., Vahi, K., Mehta, G., Berman, B.P., Berriman, B., Maechling, P.: Scientific workflow applications on Amazon EC2. In: Proceedings of the International Conference on E-Science, pp. 59–66 (2009)Google Scholar
  15. 15.
    Katz, D.S., Jacob, J.C., Berriman, G., Good, J., Laity, A.C., Deelman, E., Kesselman, C., Singh, G., Su, M.-H., Prince, T.A.: A comparison of two methods for building astronomical image mosaics on a Grid. In: Proceedings of the International Conference on Parallel Processing, pp. 85–94 (2005)Google Scholar
  16. 16.
    Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of the IEEE International Conference on Neural Networks, pp. 1942–1948 (1995)Google Scholar
  17. 17.
    Kwok, Y.-K., Ahmad, I.: Efficient scheduling of arbitrary task graphs to multiprocessors using a parallel genetic algorithm. J. Parallel Distrib. Comput. 47, 58–77 (1997)CrossRefGoogle Scholar
  18. 18.
    Ludascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurr. Comput. Pract. Exp. 18, 1039–1065 (2006)CrossRefGoogle Scholar
  19. 19.
    National Institutes of Standards and Technology: Cloud computing synopsis and recommendations. Last retrieved Oct. 2012
  20. 20.
    Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20, 3045–3054 (2004)CrossRefGoogle Scholar
  21. 21.
    Pandey, S., Wu, L., Guru, S., Buyya, R.: A particle swarm optimization-based heuristic for scheduling workflow applications in cloud computing environments. In: Proceedings of the 24th IEEE International Conference on Advanced Information Networking and Applications, pp. 400–407 (2010)Google Scholar
  22. 22.
    Pegasus Project: Workflow generator. Last retrieved Oct. 2012
  23. 23.
    Prodan, R., Wieczorek, M.: Negotiation-based scheduling of scientific Grid workflows through advance reservations. J. Grid Comput. 8(4), 493–510 (2010)CrossRefGoogle Scholar
  24. 24.
    Prodan, R., Wieczorek, M., Fard, H.M.: Double auction-based scheduling of scientific applications in distributed Grid and cloud environments. J. Grid Comput. 9(4), 531–548 (2011)CrossRefGoogle Scholar
  25. 25.
    Shibata, T., Choi, S., Taura, K.: File-access patterns of data-intensive workflow applications. In: Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp. 522–525 (2010)Google Scholar
  26. 26.
    Singh, G., Livny, M., Berriman, B., Good, J.: The cost of doing science on the cloud: the Montage example. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2008)Google Scholar
  27. 27.
    Szabo, C., Kroeger, T.: Evolving multi-objective strategies for task allocation of scientific workflows on public clouds. In: IEEE Congress on Evolutionary Computation, pp. 1–8 (2012)Google Scholar
  28. 28.
    Thains, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: the condor experience. Concurr. Comput. Pract. Exp. 17(2-4), 323–356 (2005)CrossRefGoogle Scholar
  29. 29.
    Vockler, J.-S., Juve, G., Deelman, E., Rynge, M., Berriman, G.B.: Experiences using cloud computing for a scientific workflow application. In: Proceedings of the 2nd Workshop on Scientific Cloud Computing (2011)Google Scholar
  30. 30.
    Vockler, J.-S., Juve, G., Deelman, E., Rynge, M., Berriman, G.B.: Integration of workflow partitioning and resource provisioning. In: Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 764–768 (2012)Google Scholar
  31. 31.
    Walker, E.: Benchmarking Amazon EC2 for high-performance scientific computing. USENIX Login 33, 18–23 (2008)Google Scholar
  32. 32.
    Wu, Z., Ni, Z., Gu, L., Liu, X.: A revised discrete particle swarm optimization for cloud workflow scheduling. In: International Conference on Computational Intelligence and Security, pp. 184–188 (2010)Google Scholar
  33. 33.
    Yigitbasi, N., Iosup, A., Epema, D.: C-meter: a framework for performance analysis of computing clouds. In: Cluster Computing and the Grid, pp. 472–477 (2009)Google Scholar
  34. 34.
    Yu, J., Buyya, R.: A taxonomy of workflow management systems for grid computing. J. Grid Comput. 3(3–4), 171–200 (2005)CrossRefGoogle Scholar
  35. 35.
    Yuan, D., Yang, Y., Liu, X., Chen, J.: A data placement strategy in scientific cloud workflows. Futur. Gener. Comput. Syst. 26, 1200–1214 (2010)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Claudia Szabo
    • 1
  • Quan Z. Sheng
    • 1
  • Trent Kroeger
    • 1
  • Yihong Zhang
    • 1
  • Jian Yu
    • 2
  1. 1.School of Computer ScienceThe University of AdelaideAdelaideAustralia
  2. 2.Faculty of Information and Communication TechnologiesSwinburne University of TechnologyMelbourneAustralia

Personalised recommendations