Journal of Grid Computing

, Volume 10, Issue 3, pp 521–552 | Cite as

A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds

  • Daniel de OliveiraEmail author
  • Kary A. C. S. Ocaña
  • Fernanda Baião
  • Marta Mattoso


In the last years, scientific workflows have emerged as a fundamental abstraction for structuring and executing scientific experiments in computational environments. Scientific workflows are becoming increasingly complex and more demanding in terms of computational resources, thus requiring the usage of parallel techniques and high performance computing (HPC) environments. Meanwhile, clouds have emerged as a new paradigm where resources are virtualized and provided on demand. By using clouds, scientists have expanded beyond single parallel computers to hundreds or even thousands of virtual machines. Although the initial focus of clouds was to provide high throughput computing, clouds are already being used to provide an HPC environment where elastic resources can be instantiated on demand during the course of a scientific workflow. However, this model also raises many open, yet important, challenges such as scheduling workflow activities. Scheduling parallel scientific workflows in the cloud is a very complex task since we have to take into account many different criteria and to explore the elasticity characteristic for optimizing workflow execution. In this paper, we introduce an adaptive scheduling heuristic for parallel execution of scientific workflows in the cloud that is based on three criteria: total execution time (makespan), reliability and financial cost. Besides scheduling workflow activities based on a 3-objective cost model, this approach also scales resources up and down according to the restrictions imposed by scientists before workflow execution. This tuning is based on provenance data captured and queried at runtime. We conducted a thorough validation of our approach using a real bioinformatics workflow. The experiments were performed in SciCumulus, a cloud workflow engine for managing scientific workflow execution.


Cloud computing Scientific workflow Scientific experiment Provenance 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Vaquero, L.M., Rodero-Merino, L., Caceres, J., Lindner, M.: A break in the clouds: towards a cloud definition. SIGCOMM Comput. Commun. Rev. 39(1), 50–55 (2009)CrossRefGoogle Scholar
  2. 2.
    de Oliveira, D., Baião, F.A., Mattoso, M.: Towards a Taxonomy for Cloud Computing from an e-Science Perspective. In: Antonopoulos, N., Gillam, L. (eds.) Cloud Computing. Computer Communications and Networks, vol. 0, pp. 47–62. Springer, London (2010). doi: 10.1007/978-1-84996-241-4_3 Google Scholar
  3. 3.
    Foster, I., Kesselman, C.: The Grid: blueprint for a new computing infrastructure. Morgan Kaufmann, San Mateo, CA (2004)Google Scholar
  4. 4.
    El-Khamra, Y., Kim, H., Jha, S., Parashar, M.: Exploring the Performance Fluctuations of HPC Workloads on Clouds. In: Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, 383–387 (2010)Google Scholar
  5. 5.
    Jackson, K.R., Ramakrishnan, L., Muriki, K., Canon, S., Cholia, S., Shalf, J., Wasserman, H.J., Wright, N.J.: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud. In: Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, 159–168 (2010)Google Scholar
  6. 6.
    He, Q., Zhou, S., Kobler, B., Duffy, D., McGlynn, T.: Case study for running HPC applications in public clouds. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, 395–401 (2010)Google Scholar
  7. 7.
    Mattoso, M., Werner, C., Travassos, G.H., Braganholo, V., Murta, L., Ogasawara, E., Oliveira, D., da Cruz, S.M.S., Martinho, W.: Towards supporting the life cycle of large-scale scientific experiments. IJBPIM 5(1), 79–92 (2010)CrossRefGoogle Scholar
  8. 8.
    Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M.: Workflows for e-Science: Scientific Workflows for Grids, 1 edn. Springer, Berlin Heidelberg New York (2007)Google Scholar
  9. 9.
    Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-Science: an overview of workflow system features and capabilities. Future Gener. Comput. Syst. 25(5), 528–540 (2009)CrossRefGoogle Scholar
  10. 10.
    Davidson, S.B., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: ACM SIGMOD International Conference on Management of Data, pp. 1345–1350 (2008)Google Scholar
  11. 11.
    Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: a survey. Comput. Sci. Eng. 10(3), 11–21 (2008)CrossRefGoogle Scholar
  12. 12.
    Walker, E., Guiang, C.: Challenges in executing large parameter sweep studies across widely distributed computing environments. In: Workshop on Challenges of large applications in distributed environments, pp. 11–18 (2007)Google Scholar
  13. 13.
    Coutinho, F., Ogasawara, E., de Oliveira, D., Braganholo, V., Lima, A.A.B., Dávila, A.M.R., Mattoso, M.: Data parallelism in bioinformatics workflows using Hydra. In: 19th ACM International Symposium on High Performance Distributed Computing, pp. 507–515 (2010)Google Scholar
  14. 14.
    Jacob, J.C., Katz, D.S., Berriman, G.B., Good, J.C., Laity, A.C., Deelman, E., Kesselman, C., Singh, G., Su, M.-H., et al.: Montage: a Grid portal and software toolkit for science-grade astronomical image mosaicking. IJCSE 4(2), 73–87 (2009)Google Scholar
  15. 15.
    Ogasawara, E., Oliveira, D., Chirigati, F., Barbosa, C.E., Elias, R., Braganholo, V., Coutinho, A., Mattoso, M.: Exploring many task computing in scientific workflows. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, pp. 1–10 (2009)Google Scholar
  16. 16.
    Oliveira, D., Ocana, K., Ogasawara, E., Dias, J., Baiao, F., Mattoso, M.: A performance evaluation of X-ray crystallography scientific workflow using SciCumulus. In: IEEE International Conference on Cloud Computing (CLOUD), pp. 708–715 (2011)Google Scholar
  17. 17.
    da Cruz, S.M.S., Batista, V., Dávila, A.M.R., Silva, E., Tosta, F., Vilela, C., Campos, M.L.M., Cuadrat, R., Tschoeke, D., et al.: OrthoSearch: a scientific workflow approach to detect distant homologies on protozoans. In: Proc. of the ACM SAC, pp. 1282–1286 (2008)Google Scholar
  18. 18.
    Oliveira, D., Ocaña, K.A.C.S., Ogasawara, E., Dias, J., Goncalves, J., Mattoso, M.: Cloud-based phylogenomic inference of evolutionary relationships: a performance study. In: Proceedings of the 2nd International Workshop on Cloud Computing and Scientific Applications (CCSA) (2012)Google Scholar
  19. 19.
    Ocaña, K.A.C.S., de Oliveira, D., Horta, F., Dias, J., Ogasawara, E., Mattoso, M.: Exploring molecular evolution reconstruction using a parallel cloud-based scientific workflow. In: Proceedings of the 2012 Brazilian Symposium on Bioinformatics (BSB 2012) (2012)Google Scholar
  20. 20.
    Ocaña, K.A.C.S., Oliveira, D., Ogasawara, E., Dávila, A.M.R., Lima, A.A.B., Mattoso, M.: SciPhy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In: Norberto de Souza, O., Telles, G.P., Palakal, M. (orgs.) Advances in Bioinformatics and Computational Biology, pp. 66–70. Springer, Berlin (2011)CrossRefGoogle Scholar
  21. 21.
    Al-Azzoni, I., Down, D.G.: Dynamic scheduling for heterogeneous Desktop Grids. In: 2008 9th IEEE/ACM International Conference on Grid Computing, pp. 136–143 (2008)Google Scholar
  22. 22.
    Smanchat, S., Indrawan, M., Ling, S., Enticott, C., Abramson, D.: Scheduling multiple parameter sweep workflow instances on the Grid. In: e-Science 2009—5th IEEE International Conference on e-Science, pp. 300–306 (2009)Google Scholar
  23. 23.
    Garg, S.K., Buyya, R., Siegel, H.J.: Scheduling parallel applications on utility Grids: time and cost trade-off management (2009)Google Scholar
  24. 24.
    Yu, J., Buyya, R.: Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms. Sci. Program. 14(3,4), 217–230 (2006)Google Scholar
  25. 25.
    Boeres, C., Sardiña, I., Drummond, L.: An efficient weighted bi-objective scheduling algorithm for heterogeneous systems. Parallel Comput. 37(8), 349–364 (2011)CrossRefGoogle Scholar
  26. 26.
    Qin, X., Hong, J.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Distrib. Comput. 65, 885–900 (2005)zbMATHCrossRefGoogle Scholar
  27. 27.
    Assayad, I., Girault, A., Kalla, H.: A bi-criteria scheduling heuristic for distributed embedded systems under reliability and real-time constraints. In: 2004 International Conference on Dependable Systems and Networks, pp. 347–356 (2004)Google Scholar
  28. 28.
    Amazon EC2. Amazon Elastic Compute Cloud (Amazon EC2), (2010)
  29. 29.
    Oliveira, D., Ogasawara, E., Baiao, F., Mattoso, M.: An adaptive approach for workflow activity execution in clouds. In: International Workshop on Challenges in e-Science—SBAC, pp. 9–16 (2010)Google Scholar
  30. 30.
    Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: SciCumulus: a lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: 3rd International Conference on Cloud Computing, pp. 378–385 (2010)Google Scholar
  31. 31.
    Lima, A., Mattoso, M., Valduriez, P.: Adaptive virtual partitioning for OLAP query processing in a database cluster. JIDM 1(1), 75–88 (2010)Google Scholar
  32. 32.
    Kotowski, N., Lima, A.A.B., Pacitti, E., Valduriez, P., Mattoso, M.: Parallel query processing for OLAP in Grids. CCPE 20(17), 2039–2048 (2008)Google Scholar
  33. 33.
    Paes, M., Lima, A.A.B., Valduriez, P., Mattoso, M.: high-performance query processing of a real-world OLAP Database with ParGRES. In: High Performance Computing for Computational Science (VECPAR), pp. 188–200 (2008)Google Scholar
  34. 34.
    Freedman, D., Pisani, R., Purves, R.: Statistics, 4th edn. W. W. Norton, New York (2007)Google Scholar
  35. 35.
    Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53, 72–77 (2010)CrossRefGoogle Scholar
  36. 36.
    Foster, I.: Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley, Reading, MA (1995)zbMATHGoogle Scholar
  37. 37.
    Wang, J., Crawl, D., Altintas, I.:. Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems. In: 4th Workshop on Workflows in Support of Large-Scale Science, pp. 1–8 (2009)Google Scholar
  38. 38.
    Howe, B., Vo, H., Silva, C., Freire, J.: Query-driven visualization in the cloud with mapreduce. In: Proceedings of the Fourth Annual Workshop on Ultrascale Visualization (2009)Google Scholar
  39. 39.
    Lin, C., Lu, S.: Scheduling Scientific Workflows Elastically for Cloud Computing. In: 2011 IEEE International Conference on Cloud Computing (CLOUD), pp. 746–747 (2011)Google Scholar
  40. 40.
    Abramson, D., Enticott, C., Altinas, I.: Nimrod/K: towards massively parallel dynamic Grid workflows. In: Proc. of International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2008)Google Scholar
  41. 41.
    Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: Scientific and Statistical Database Management, pp. 423–424 (2004)Google Scholar
  42. 42.
    Hoffa, C., Mehta, G., Freeman, T., Deelman, E., Keahey, K., Berriman, B., Good, J.: On the use of cloud computing for scientific workflows. In: IEEE Fourth International Conference on eScience (eScience 2008), Indianapolis, USA, pp. 7–12 (2008)Google Scholar
  43. 43.
    Deelman, E., Mehta, G., Singh, G., Su, M.-H., Vahi, K.: Pegasus: Mapping Large-Scale Workflows to Distributed Resources. In: Workflows for e-Science, pp. 376–394. Springer, Berlin Heidelberg New York (2007)CrossRefGoogle Scholar
  44. 44.
    Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, pp. 8:1–8:10 (2009)Google Scholar
  45. 45.
    Lee, C., Suzuki, J., Vasilakos, A., Yamamoto, Y., Oba, K.: An evolutionary game theoretic approach to adaptive and stable application deployment in clouds. In: Proceeding of the 2nd workshop on Bio-inspired algorithms for distributed systems, pp. 29–38 (2010)Google Scholar
  46. 46.
    Ogasawara, E., Dias, J., Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: An algebraic approach for data-centric scientific workflows. In: Proc. of VLDB Endowment, vol. 4, no. 12, pp. 1328–1339 (2011)Google Scholar
  47. 47.
    Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 3rd edn. Springer, New York (2011)CrossRefGoogle Scholar
  48. 48.
    Kllapi, H., Sitaridi, E., Tsangaris, M.M., Ioannidis, Y.: Schedule optimization for data processing flows on the cloud, 289 (2011)Google Scholar
  49. 49.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press (2009)Google Scholar
  50. 50.
    Meyer, L.A.V.C., Rössle, S.C., Bisch, P.M., Mattoso, M.: Parallelism in Bioinformatics Workflows. In: High Performance Computing for Computational Science—VECPAR 2004, pp. 583–597 (2005)Google Scholar
  51. 51.
    Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: VisTrails: visualization meets data management. In: SIGMOD International Conference on Management of Data, pp. 745–747 (2006)Google Scholar
  52. 52.
    Viana, V., de Oliveira, D., Mattoso, M.: Towards a cost model for scheduling scientific workflows activities in cloud environments. In: 2011 IEEE World Congress on Services (SERVICES), pp. 216–219 (2011)Google Scholar
  53. 53.
    Muniswamy-Reddy, K.-K., Macko, P., Seltzer, M.: Making a cloud provenance-aware. In: First workshop on on Theory and practice of provenance, pp. 1–10 (2009)Google Scholar
  54. 54.
    Simmhan, Y.L., Plale, B., Gannon, D.: A framework for collecting provenance in data-centric scientific workflows. ICWS, pp. 427–436 (2006)Google Scholar
  55. 55.
    Moreau, L., Freire, J., Futrelle, J., McGrath, R., Myers, J., Paulson, P.: The open provenance model: an overview. In: Provenance and Annotation of Data and Processes, pp. 323–326 (2008)Google Scholar
  56. 56.
    Greenwood, M., Goble, C., Stevens, R., Zhao, J. Addis, M., Marvin, D., Moreau, L., Oinn, T.: Provenance of e-Science Experiments—Experience from Bioinformatics. UK OST e-Science second All Hands Meeting 4, 223–226 (2003)Google Scholar
  57. 57.
    Fowler, M.: UML Distilled: A Brief Guide to the Standard Object Modeling Language, 3rd edn. Addison-Wesley Professional, Reading, MA (2003)Google Scholar
  58. 58.
    Shafi, A., Carpenter, B., Baker, M.: Nested parallelism for multi-core HPC systems using Java. J. Parallel Distrib. Comput. 69(6), 532–545 (2009)CrossRefGoogle Scholar
  59. 59.
    Gadelha, L.M.R., Mattoso, M.: Kairos: An Architecture for Securing Authorship and Temporal Information of Provenance Data in Grid-Enabled Workflow Management Systems. In: International Workshop on Scientific Workflows and Business Workflow Standards in e-Science (SWBES 2008), pp. 597–602 (2008)Google Scholar
  60. 60.
    SubCloud. Shared Enterprise File System for Amazon S3 Cloud Storage ∣ SubCloud, (2011)
  61. 61.
    de Oliveira, D., Ogasawara, E., Ocaña, K., Baião, F., Mattoso, M.: An adaptive parallel execution strategy for cloud-based scientific workflows. Concurrency Computat.: Pract. Exper. 24(13), 1531–1550 (2012). doi: 10.1002/cpe.1880 CrossRefGoogle Scholar
  62. 62.
    Zvelebil, M., Baum, J.: Understanding Bioinformatics, 1 edn. Garland Science, New York (2007)Google Scholar
  63. 63.
    Miller, W., Makova, K.D., Nekrutenko, A., Hardison, R.C.: Comparative genomics. ARGHG 5(1), 15–56 (2004)CrossRefGoogle Scholar
  64. 64.
    Clark, A.G.: Genomics of the evolutionary process. Trends Ecol. Evol. 21(6), 316–321 (2006)CrossRefGoogle Scholar
  65. 65.
    Katoh, K., Toh, H.: Recent developments in the MAFFT multiple sequence alignment program. Brief. Bioinform. 9(4), 286–298 (2008)CrossRefGoogle Scholar
  66. 66.
    Katoh, K., Toh, H.: Parallelization of the MAFFT multiple sequence alignment program. Bioinformatics (Oxford, England) 26(15), 1899–1900 (2010)CrossRefGoogle Scholar
  67. 67.
    Lassmann, T., Sonnhammer, E.L.L.: Kalign–an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6, 298 (2005)CrossRefGoogle Scholar
  68. 68.
    Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22), 4673–4680 (1994)CrossRefGoogle Scholar
  69. 69.
    Do, C.B., Mahabhashyam, M.S.P., Brudno, M., Batzoglou, S.: ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 15(2), 330–340 (2005)CrossRefGoogle Scholar
  70. 70.
    Keane, T.M., Creevey, C.J., Pentony, M.M., Naughton, T.J., Mclnerney, J.O.: Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified. BMC Evol. Biol. 6, 29 (2006)CrossRefGoogle Scholar
  71. 71.
    Stamatakis, A.: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics (Oxford, England) 22(21), 2688–2690 (2006)CrossRefGoogle Scholar
  72. 72.
    Dutilh, B.E., van Noort, V., van der Heijden, R.T.J.M., Boekhout, T., Snel, B., Huynen, M.A.: Assessment of phylogenomic and orthology approaches for phylogenetic inference. Bioinformatics 23(7), 815–824 (2007)CrossRefGoogle Scholar
  73. 73.
    Apache Software Foundation. Hadoop. Internet Website, Last accessed May 2009
  74. 74.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  75. 75.
    Fei, X., Lu, S., Lin, C.: A MapReduce-Enabled Scientific Workflow Composition Framework. ICWS, pp. 663–670 (2009)Google Scholar
  76. 76.
    Hadoop. Apache Hadoop Web page, (2012)
  77. 77.
    Carpenter, B., Getov, V., Judd, G., Skjellum, A., Fox, G.: MPJ: MPI-like message passing for Java. CCPE 12(11), 1019–1038 (2000)zbMATHGoogle Scholar
  78. 78.
    Pruitt, K.D., Tatusova, T., Klimke, W., Maglott, D.R.: NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 37(Database issue), D32–D36 (2009)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  • Daniel de Oliveira
    • 1
    Email author
  • Kary A. C. S. Ocaña
    • 1
  • Fernanda Baião
    • 2
  • Marta Mattoso
    • 1
  1. 1.Federal University of Rio de Janeiro - COPPE/UFRJRio de JaneiroBrazil
  2. 2.Federal University of the State of Rio de Janeiro – UNIRIORio de JaneiroBrazil

Personalised recommendations