Skip to main content
Log in

A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

In the last years, scientific workflows have emerged as a fundamental abstraction for structuring and executing scientific experiments in computational environments. Scientific workflows are becoming increasingly complex and more demanding in terms of computational resources, thus requiring the usage of parallel techniques and high performance computing (HPC) environments. Meanwhile, clouds have emerged as a new paradigm where resources are virtualized and provided on demand. By using clouds, scientists have expanded beyond single parallel computers to hundreds or even thousands of virtual machines. Although the initial focus of clouds was to provide high throughput computing, clouds are already being used to provide an HPC environment where elastic resources can be instantiated on demand during the course of a scientific workflow. However, this model also raises many open, yet important, challenges such as scheduling workflow activities. Scheduling parallel scientific workflows in the cloud is a very complex task since we have to take into account many different criteria and to explore the elasticity characteristic for optimizing workflow execution. In this paper, we introduce an adaptive scheduling heuristic for parallel execution of scientific workflows in the cloud that is based on three criteria: total execution time (makespan), reliability and financial cost. Besides scheduling workflow activities based on a 3-objective cost model, this approach also scales resources up and down according to the restrictions imposed by scientists before workflow execution. This tuning is based on provenance data captured and queried at runtime. We conducted a thorough validation of our approach using a real bioinformatics workflow. The experiments were performed in SciCumulus, a cloud workflow engine for managing scientific workflow execution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Vaquero, L.M., Rodero-Merino, L., Caceres, J., Lindner, M.: A break in the clouds: towards a cloud definition. SIGCOMM Comput. Commun. Rev. 39(1), 50–55 (2009)

    Article  Google Scholar 

  2. de Oliveira, D., Baião, F.A., Mattoso, M.: Towards a Taxonomy for Cloud Computing from an e-Science Perspective. In: Antonopoulos, N., Gillam, L. (eds.) Cloud Computing. Computer Communications and Networks, vol. 0, pp. 47–62. Springer, London (2010). doi:10.1007/978-1-84996-241-4_3

    Google Scholar 

  3. Foster, I., Kesselman, C.: The Grid: blueprint for a new computing infrastructure. Morgan Kaufmann, San Mateo, CA (2004)

  4. El-Khamra, Y., Kim, H., Jha, S., Parashar, M.: Exploring the Performance Fluctuations of HPC Workloads on Clouds. In: Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, 383–387 (2010)

  5. Jackson, K.R., Ramakrishnan, L., Muriki, K., Canon, S., Cholia, S., Shalf, J., Wasserman, H.J., Wright, N.J.: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud. In: Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, 159–168 (2010)

  6. He, Q., Zhou, S., Kobler, B., Duffy, D., McGlynn, T.: Case study for running HPC applications in public clouds. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, 395–401 (2010)

  7. Mattoso, M., Werner, C., Travassos, G.H., Braganholo, V., Murta, L., Ogasawara, E., Oliveira, D., da Cruz, S.M.S., Martinho, W.: Towards supporting the life cycle of large-scale scientific experiments. IJBPIM 5(1), 79–92 (2010)

    Article  Google Scholar 

  8. Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M.: Workflows for e-Science: Scientific Workflows for Grids, 1 edn. Springer, Berlin Heidelberg New York (2007)

    Google Scholar 

  9. Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-Science: an overview of workflow system features and capabilities. Future Gener. Comput. Syst. 25(5), 528–540 (2009)

    Article  Google Scholar 

  10. Davidson, S.B., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: ACM SIGMOD International Conference on Management of Data, pp. 1345–1350 (2008)

  11. Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: a survey. Comput. Sci. Eng. 10(3), 11–21 (2008)

    Article  Google Scholar 

  12. Walker, E., Guiang, C.: Challenges in executing large parameter sweep studies across widely distributed computing environments. In: Workshop on Challenges of large applications in distributed environments, pp. 11–18 (2007)

  13. Coutinho, F., Ogasawara, E., de Oliveira, D., Braganholo, V., Lima, A.A.B., Dávila, A.M.R., Mattoso, M.: Data parallelism in bioinformatics workflows using Hydra. In: 19th ACM International Symposium on High Performance Distributed Computing, pp. 507–515 (2010)

  14. Jacob, J.C., Katz, D.S., Berriman, G.B., Good, J.C., Laity, A.C., Deelman, E., Kesselman, C., Singh, G., Su, M.-H., et al.: Montage: a Grid portal and software toolkit for science-grade astronomical image mosaicking. IJCSE 4(2), 73–87 (2009)

    Google Scholar 

  15. Ogasawara, E., Oliveira, D., Chirigati, F., Barbosa, C.E., Elias, R., Braganholo, V., Coutinho, A., Mattoso, M.: Exploring many task computing in scientific workflows. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, pp. 1–10 (2009)

  16. Oliveira, D., Ocana, K., Ogasawara, E., Dias, J., Baiao, F., Mattoso, M.: A performance evaluation of X-ray crystallography scientific workflow using SciCumulus. In: IEEE International Conference on Cloud Computing (CLOUD), pp. 708–715 (2011)

  17. da Cruz, S.M.S., Batista, V., Dávila, A.M.R., Silva, E., Tosta, F., Vilela, C., Campos, M.L.M., Cuadrat, R., Tschoeke, D., et al.: OrthoSearch: a scientific workflow approach to detect distant homologies on protozoans. In: Proc. of the ACM SAC, pp. 1282–1286 (2008)

  18. Oliveira, D., Ocaña, K.A.C.S., Ogasawara, E., Dias, J., Goncalves, J., Mattoso, M.: Cloud-based phylogenomic inference of evolutionary relationships: a performance study. In: Proceedings of the 2nd International Workshop on Cloud Computing and Scientific Applications (CCSA) (2012)

  19. Ocaña, K.A.C.S., de Oliveira, D., Horta, F., Dias, J., Ogasawara, E., Mattoso, M.: Exploring molecular evolution reconstruction using a parallel cloud-based scientific workflow. In: Proceedings of the 2012 Brazilian Symposium on Bioinformatics (BSB 2012) (2012)

  20. Ocaña, K.A.C.S., Oliveira, D., Ogasawara, E., Dávila, A.M.R., Lima, A.A.B., Mattoso, M.: SciPhy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In: Norberto de Souza, O., Telles, G.P., Palakal, M. (orgs.) Advances in Bioinformatics and Computational Biology, pp. 66–70. Springer, Berlin (2011)

    Chapter  Google Scholar 

  21. Al-Azzoni, I., Down, D.G.: Dynamic scheduling for heterogeneous Desktop Grids. In: 2008 9th IEEE/ACM International Conference on Grid Computing, pp. 136–143 (2008)

  22. Smanchat, S., Indrawan, M., Ling, S., Enticott, C., Abramson, D.: Scheduling multiple parameter sweep workflow instances on the Grid. In: e-Science 2009—5th IEEE International Conference on e-Science, pp. 300–306 (2009)

  23. Garg, S.K., Buyya, R., Siegel, H.J.: Scheduling parallel applications on utility Grids: time and cost trade-off management (2009)

  24. Yu, J., Buyya, R.: Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms. Sci. Program. 14(3,4), 217–230 (2006)

    Google Scholar 

  25. Boeres, C., Sardiña, I., Drummond, L.: An efficient weighted bi-objective scheduling algorithm for heterogeneous systems. Parallel Comput. 37(8), 349–364 (2011)

    Article  Google Scholar 

  26. Qin, X., Hong, J.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Distrib. Comput. 65, 885–900 (2005)

    Article  MATH  Google Scholar 

  27. Assayad, I., Girault, A., Kalla, H.: A bi-criteria scheduling heuristic for distributed embedded systems under reliability and real-time constraints. In: 2004 International Conference on Dependable Systems and Networks, pp. 347–356 (2004)

  28. Amazon EC2. Amazon Elastic Compute Cloud (Amazon EC2), http://aws.amazon.com/ec2/ (2010)

  29. Oliveira, D., Ogasawara, E., Baiao, F., Mattoso, M.: An adaptive approach for workflow activity execution in clouds. In: International Workshop on Challenges in e-Science—SBAC, pp. 9–16 (2010)

  30. Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: SciCumulus: a lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: 3rd International Conference on Cloud Computing, pp. 378–385 (2010)

  31. Lima, A., Mattoso, M., Valduriez, P.: Adaptive virtual partitioning for OLAP query processing in a database cluster. JIDM 1(1), 75–88 (2010)

    Google Scholar 

  32. Kotowski, N., Lima, A.A.B., Pacitti, E., Valduriez, P., Mattoso, M.: Parallel query processing for OLAP in Grids. CCPE 20(17), 2039–2048 (2008)

    Google Scholar 

  33. Paes, M., Lima, A.A.B., Valduriez, P., Mattoso, M.: high-performance query processing of a real-world OLAP Database with ParGRES. In: High Performance Computing for Computational Science (VECPAR), pp. 188–200 (2008)

  34. Freedman, D., Pisani, R., Purves, R.: Statistics, 4th edn. W. W. Norton, New York (2007)

    Google Scholar 

  35. Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53, 72–77 (2010)

    Article  Google Scholar 

  36. Foster, I.: Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley, Reading, MA (1995)

    MATH  Google Scholar 

  37. Wang, J., Crawl, D., Altintas, I.:. Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems. In: 4th Workshop on Workflows in Support of Large-Scale Science, pp. 1–8 (2009)

  38. Howe, B., Vo, H., Silva, C., Freire, J.: Query-driven visualization in the cloud with mapreduce. In: Proceedings of the Fourth Annual Workshop on Ultrascale Visualization (2009)

  39. Lin, C., Lu, S.: Scheduling Scientific Workflows Elastically for Cloud Computing. In: 2011 IEEE International Conference on Cloud Computing (CLOUD), pp. 746–747 (2011)

  40. Abramson, D., Enticott, C., Altinas, I.: Nimrod/K: towards massively parallel dynamic Grid workflows. In: Proc. of International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2008)

  41. Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: Scientific and Statistical Database Management, pp. 423–424 (2004)

  42. Hoffa, C., Mehta, G., Freeman, T., Deelman, E., Keahey, K., Berriman, B., Good, J.: On the use of cloud computing for scientific workflows. In: IEEE Fourth International Conference on eScience (eScience 2008), Indianapolis, USA, pp. 7–12 (2008)

  43. Deelman, E., Mehta, G., Singh, G., Su, M.-H., Vahi, K.: Pegasus: Mapping Large-Scale Workflows to Distributed Resources. In: Workflows for e-Science, pp. 376–394. Springer, Berlin Heidelberg New York (2007)

    Chapter  Google Scholar 

  44. Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, pp. 8:1–8:10 (2009)

  45. Lee, C., Suzuki, J., Vasilakos, A., Yamamoto, Y., Oba, K.: An evolutionary game theoretic approach to adaptive and stable application deployment in clouds. In: Proceeding of the 2nd workshop on Bio-inspired algorithms for distributed systems, pp. 29–38 (2010)

  46. Ogasawara, E., Dias, J., Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: An algebraic approach for data-centric scientific workflows. In: Proc. of VLDB Endowment, vol. 4, no. 12, pp. 1328–1339 (2011)

  47. Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 3rd edn. Springer, New York (2011)

    Book  Google Scholar 

  48. Kllapi, H., Sitaridi, E., Tsangaris, M.M., Ioannidis, Y.: Schedule optimization for data processing flows on the cloud, 289 (2011)

  49. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press (2009)

  50. Meyer, L.A.V.C., Rössle, S.C., Bisch, P.M., Mattoso, M.: Parallelism in Bioinformatics Workflows. In: High Performance Computing for Computational Science—VECPAR 2004, pp. 583–597 (2005)

  51. Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: VisTrails: visualization meets data management. In: SIGMOD International Conference on Management of Data, pp. 745–747 (2006)

  52. Viana, V., de Oliveira, D., Mattoso, M.: Towards a cost model for scheduling scientific workflows activities in cloud environments. In: 2011 IEEE World Congress on Services (SERVICES), pp. 216–219 (2011)

  53. Muniswamy-Reddy, K.-K., Macko, P., Seltzer, M.: Making a cloud provenance-aware. In: First workshop on on Theory and practice of provenance, pp. 1–10 (2009)

  54. Simmhan, Y.L., Plale, B., Gannon, D.: A framework for collecting provenance in data-centric scientific workflows. ICWS, pp. 427–436 (2006)

  55. Moreau, L., Freire, J., Futrelle, J., McGrath, R., Myers, J., Paulson, P.: The open provenance model: an overview. In: Provenance and Annotation of Data and Processes, pp. 323–326 (2008)

  56. Greenwood, M., Goble, C., Stevens, R., Zhao, J. Addis, M., Marvin, D., Moreau, L., Oinn, T.: Provenance of e-Science Experiments—Experience from Bioinformatics. UK OST e-Science second All Hands Meeting 4, 223–226 (2003)

    Google Scholar 

  57. Fowler, M.: UML Distilled: A Brief Guide to the Standard Object Modeling Language, 3rd edn. Addison-Wesley Professional, Reading, MA (2003)

    Google Scholar 

  58. Shafi, A., Carpenter, B., Baker, M.: Nested parallelism for multi-core HPC systems using Java. J. Parallel Distrib. Comput. 69(6), 532–545 (2009)

    Article  Google Scholar 

  59. Gadelha, L.M.R., Mattoso, M.: Kairos: An Architecture for Securing Authorship and Temporal Information of Provenance Data in Grid-Enabled Workflow Management Systems. In: International Workshop on Scientific Workflows and Business Workflow Standards in e-Science (SWBES 2008), pp. 597–602 (2008)

  60. SubCloud. Shared Enterprise File System for Amazon S3 Cloud Storage ∣ SubCloud, http://www.subcloud.com/ (2011)

  61. de Oliveira, D., Ogasawara, E., Ocaña, K., Baião, F., Mattoso, M.: An adaptive parallel execution strategy for cloud-based scientific workflows. Concurrency Computat.: Pract. Exper. 24(13), 1531–1550 (2012). doi:10.1002/cpe.1880

    Article  Google Scholar 

  62. Zvelebil, M., Baum, J.: Understanding Bioinformatics, 1 edn. Garland Science, New York (2007)

    Google Scholar 

  63. Miller, W., Makova, K.D., Nekrutenko, A., Hardison, R.C.: Comparative genomics. ARGHG 5(1), 15–56 (2004)

    Article  Google Scholar 

  64. Clark, A.G.: Genomics of the evolutionary process. Trends Ecol. Evol. 21(6), 316–321 (2006)

    Article  Google Scholar 

  65. Katoh, K., Toh, H.: Recent developments in the MAFFT multiple sequence alignment program. Brief. Bioinform. 9(4), 286–298 (2008)

    Article  Google Scholar 

  66. Katoh, K., Toh, H.: Parallelization of the MAFFT multiple sequence alignment program. Bioinformatics (Oxford, England) 26(15), 1899–1900 (2010)

    Article  Google Scholar 

  67. Lassmann, T., Sonnhammer, E.L.L.: Kalign–an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6, 298 (2005)

    Article  Google Scholar 

  68. Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22), 4673–4680 (1994)

    Article  Google Scholar 

  69. Do, C.B., Mahabhashyam, M.S.P., Brudno, M., Batzoglou, S.: ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 15(2), 330–340 (2005)

    Article  Google Scholar 

  70. Keane, T.M., Creevey, C.J., Pentony, M.M., Naughton, T.J., Mclnerney, J.O.: Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified. BMC Evol. Biol. 6, 29 (2006)

    Article  Google Scholar 

  71. Stamatakis, A.: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics (Oxford, England) 22(21), 2688–2690 (2006)

    Article  Google Scholar 

  72. Dutilh, B.E., van Noort, V., van der Heijden, R.T.J.M., Boekhout, T., Snel, B., Huynen, M.A.: Assessment of phylogenomic and orthology approaches for phylogenetic inference. Bioinformatics 23(7), 815–824 (2007)

    Article  Google Scholar 

  73. Apache Software Foundation. Hadoop. Internet Website, hadoop.apache.org/. Last accessed May 2009

  74. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  75. Fei, X., Lu, S., Lin, C.: A MapReduce-Enabled Scientific Workflow Composition Framework. ICWS, pp. 663–670 (2009)

  76. Hadoop. Apache Hadoop Web page, http://hadoop.apache.org/ (2012)

  77. Carpenter, B., Getov, V., Judd, G., Skjellum, A., Fox, G.: MPJ: MPI-like message passing for Java. CCPE 12(11), 1019–1038 (2000)

    MATH  Google Scholar 

  78. Pruitt, K.D., Tatusova, T., Klimke, W., Maglott, D.R.: NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 37(Database issue), D32–D36 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel de Oliveira.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Oliveira, D., Ocaña, K.A.C.S., Baião, F. et al. A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds. J Grid Computing 10, 521–552 (2012). https://doi.org/10.1007/s10723-012-9227-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-012-9227-2

Keywords

Navigation