Skip to main content
Log in

A Survey of Data-Intensive Scientific Workflow Management

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Nowadays, more and more computer-based scientific experiments need to handle massive amounts of data. Their data processing consists of multiple computational steps and dependencies within them. A data-intensive scientific workflow is useful for modeling such process. Since the sequential execution of data-intensive scientific workflows may take much time, Scientific Workflow Management Systems (SWfMSs) should enable the parallel execution of data-intensive scientific workflows and exploit the resources distributed in different infrastructures such as grid and cloud. This paper provides a survey of data-intensive scientific workflow management in SWfMSs and their parallelization techniques. Based on a SWfMS functional architecture, we give a comparative analysis of the existing solutions. Finally, we identify research issues for improving the execution of data-intensive scientific workflows in a multisite cloud.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Amazon cloud (2015). http://aws.amazon.com/

  2. Grid’5000 project (2015). https://www.grid5000.fr/mediawiki/index.php

  3. Microsoft Azure cloud (2015). http://azure.microsoft.com/

  4. Pegasus 4.4.1 user guide (2015). https://pegasus.isi.edu/wms/docs/latest/

  5. Abouelhoda, M., Issa, S., Ghanem, M.: Tavaxy: Integrating taverna and galaxy workflows with cloud computing support. BMC Bioinforma. 13(1), 77 (2012)

    Article  Google Scholar 

  6. Afgan, E., Baker, D., Coraor, N., Chapman, B., Nekrutenko, A., Taylor, J.: Galaxy cloudman: delivering cloud compute clusters. BMC Bioinforma. 11(Suppl 12), S4 (2010)

    Article  Google Scholar 

  7. Albrecht, M., Donnelly, P., Bui, P., Thain, D.: Makeflow: A portable abstraction for data intensive computing on clusters, clouds, and grids. In: 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, pp. 1:1–1:13 (2012)

  8. Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the kepler scientific workflow system. In: International Conference on Provenance and Annotation of Data, pp. 118–132 (2006)

  9. Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: 16th International Conference on Scientific and Statistical Database Management (SSDBM), pp. 423–424 (2004)

  10. Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludäscher, B., Mock, S.: Kepler: Towards a Grid-Enabled system for scientific workflows. The Workflow in Grid Systems Workshop in GGF10-The 10th Global Grid Forum (2004)

  11. Anglano, C., Canonico, M.: Scheduling algorithms for multiple bag-of-task applications on desktop grids: A knowledge-free approach. In: 22nd IEEE Int. Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–8 (2008)

  12. Balaskó, Á. Workflow concept of ws-pgrade/guse. In: Kacsuk, P. (ed.) : Science Gateways for Distributed Computing Infrastructures, pp. 33–50. Springer International Publishing (2014)

  13. Barker, A., Hemert, J.V.: Scientific workflow: A survey and research directions. In: 7th Int. Conf. on Parallel Processing and Applied Mathematics, pp. 746–753 (2008)

  14. Belhajjame, K., Cresswell, S., Gil, Y., Golden, R., Groth, P., Klyne, G., McCusker, J., Miles, S., Myers, J., Sahoo, S.: The prov data model and abstract syntax notation (2011). http://www.w3.org/TR/2011/WD-prov-dm-20111215/

  15. Bergmann, R., Gil, Y.: Retrieval of semantic workflows with knowledge intensive similarity measures. In: 19th International Conference on Case-Based Reasoning Research and Development, pp. 17–31 (2011)

  16. Blythe, J., Jain, S., Deelman, E., Gil, Y., Vahi, K., Mandal, A., Kennedy, K.: Task scheduling strategies for workflow-based applications in grids. In: 5th IEEE Int. Symposium on Cluster Computing and the Grid (CCGrid), pp. 759–767 (2005)

  17. Bouganim, L., Fabret, F., Mohan, C., Valduriez, P.: Dynamic query scheduling in data integration systems. In: International Conference on Data Engineering (ICDE), pp. 425–434 (2000)

  18. Brandic, I., Dustdar, S.: Grid vs cloud - A technology comparison. IT - Inf. Technol. 53(4), 173–179 (2011)

    Article  Google Scholar 

  19. Bux, M., Leser, U.: Parallelization in scientific workflow management systems. The Computing Research Repository (CoRR), abs/1303.7195 (2013)

  20. Carpenter, B., Getov, V., Judd, G., Skjellum, A., Mpj, G. Fox.: Mpi-like message passing for java. Concurrency and Computation: Practice and Experience 12(11), 1019–1038 (2000)

    Article  MATH  Google Scholar 

  21. Chen, W., Deelman, E.: Integration of workflow partitioning and resource provisioning. In: IEEE/ACM Int. Symposium on Cluster Computing and the Grid (CCGRID), pp. 764–768 (2012)

  22. Chen, W., Deelman, E.: Partitioning and scheduling workflows across multiple sites with storage constraints. In: 9th Int. Conf. on Parallel Processing and Applied Mathematics - Volume Part II, vol. 7204, pp. 11–20 (2012)

  23. Chen, W., Silva, R.D., Deelman, E., Sakellariou, R.: Balanced task clustering in scientific workflows. In: IEEE 9th Int. Conf. on e-Science, pp. 188–195 (2013)

  24. Chervenak, A. L., Smith, D. E., Chen, W., Deelman, E.: Integrating policy with scientific workflow management for data-intensive applications. In: Supercomputing (SC) Companion: High Performance Computing, Networking Storage and Analysis, pp. 140–149 (2012)

  25. Chirigati, F., Silva, V., Ogasawara, E., de Oliveira, D., Dias, J., Porto, F., Valduriez, P., Mattoso, M.: Evaluating parameter sweep workflows in high performance computing. In: 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, pp. 2:1–2:10 (2012)

  26. Chowdhury, M., Zaharia, M., Ma, J., Jordan, M. I., Stoica, I.: Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Conf. on Applications, Technologies, Architectures, and Protocols for Computer Communications 41(4), 98–109 (2011)

    Article  Google Scholar 

  27. Coalition, W.M.: Workflow management coalition terminology and glossary (1999)

  28. Cohen-Boulakia, S., Chen, J., Missier, P., Goble, C.A., Williams, A.R., Froidevaux, C.: Distilling structure in taverna scientific workflows: a refactoring approach. BMC Bioinformatics 15(S-1), S12 (2014)

    Article  Google Scholar 

  29. Costa, F., de Oliveira, D., Ocala, K., Ogasawara, E., Dias, J., Mattoso, M.: Handling failures in parallel scientific workflows using clouds. In: Supercomputing (SC) Companion: High Performance Computing, Networking Storage and Analysis, pp. 129–139 (2012)

  30. Costa, F., Silva, V., de Oliveira, D., Ocaña, K.A.C.S., Ogasawara, E.S., Dias, J., Mattoso, M.: Capturing and querying workflow runtime provenance with prov: a practical approach. In: EDBT/ICDT Workshops, pp. 282–289 (2013)

  31. Crawl, D., Wang, J., Altintas, I.: Provenance for mapreduce-based data-intensive workflows. In: 6th Workshop on Workflows in Support of Large-scale Science, pp. 21–30 (2011)

  32. Critchlow, T., Jr, G.C.: Supercomputing and scientific workflows gaps and requirements. In: World Congress on Services, pp. 208–211 (2011)

  33. de Oliveira, D., Ocaña, K.A.C.S., Baião, F., Mattoso, M.: A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J. Grid Comput. 10(3), 521–552 (2012)

    Article  Google Scholar 

  34. de Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: Scicumulus: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: 3rd Int. Conf. on Cloud Computing (CLOUD), pp. 378–385 (2010)

  35. de Oliveira, D., Ogasawara, E., Ocaña, K., Baião, F., Mattoso, M.: An adaptive parallel execution strategy for cloud-based scientific workflows. Concurrency and Computation: Practice & Experience 24(13), 1531–1550 (2012)

    Article  Google Scholar 

  36. de Oliveira, D., Viana, V., Ogasawara, E., Ocana, K., Mattoso, M.: Dimensioning the virtual cluster for parallel scientific workflows in clouds. In: 4th ACM Workshop on Scientific Cloud Computing, pp. 5–12 (2013)

  37. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI 2004), pp. 137–150 (2004)

  38. Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-science: An overview of workflow system features and capabilities. Futur. Gener. Comput. Syst. 25(5), 528–540 (2009)

    Article  Google Scholar 

  39. Deelman, E., Juve, G., Berriman, G.B.: Using clouds for science, is it just kicking the can down the road? In: Cloud Computing and Services Science (CLOSER), 2nd Int. Conf. on Cloud Computing and Services Science, pp. 127–134 (2012)

  40. Deelman, E., Singh, G., Livny, M., Berriman, B., Good, J.: The cost of doing science on the cloud: The montage example. In: ACM/IEEE Conf. on High Performance Computing, pp. 1–12 (2008)

  41. Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program. 13(3), 219–237 (2005)

    Google Scholar 

  42. Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., Silva, R.F.d., Livny, M., Wenger, K.: Pegasus: a workflow management system for science automation. Futur. Gener. Comput. Syst. (2014)

  43. Deng, K., Kong, L., Song, J., Ren, K., Yuan, D.: A weighted k-means clustering based co-scheduling strategy towards efficient execution of scientific workflows in collaborative cloud environments. In: IEEE 9th Int. Conf. on Dependable, Autonomic and Secure Computing (DASC), pp. 547–554 (2011)

  44. Dias, J., de Oliveira, D., Mattoso, M., Ocana, K.A.C.S., Ogasawara, E.: Discovering drug targets for neglected diseases using a pharmacophylogenomic cloud workflow. In: IEEE 8th Int. Conf. on E-Science (e-Science), pp. 1–8 (2012)

  45. Dias, J., Ogasawara, E.S., de Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: Algebraic dataflows for big data analysis. In: IEEE Int. Conf. on Big Data, pp. 150–155 (2013)

  46. Duan, R., Prodan, R., Li, X.: Multi-objective game theoretic schedulingof bag-of-tasks workflows on hybrid clouds. IEEE Transactions on Cloud Computing 2(1), 29–42 (2014)

    Article  Google Scholar 

  47. Fahringer, T., Prodan, R., Duan, R., Hofer, J., Nadeem, F., Nerieri, F., Podlipnig, S., Qin, J., Siddiqui, M., Truong, H., Villazon, A., Wieczorek, M.: Askalon: A development and grid computing environment for scientific workflows. In: Workflows for e-Science, pp. 450–471. Springer (2007)

  48. Fard, H.M., Fahringer, T., Prodan, R.: Budget-constrained resource provisioning for scientific applications in clouds. In: IEEE 5th Int. Conf. on Cloud Computing Technology and Science (CloudCom), vol. 1, pp. 315–322 (2013)

  49. Fard, H.M., Prodan, R., Fahringer, T.: Multi-objective list scheduling of workflow applications in distributed computing infrastructures. J. Parallel Distrib. Comput. 74(3), 2152–2165 (2014)

    Article  MATH  Google Scholar 

  50. Farkas, Z., Hajnal, Á., Kacsuk, P.: Ws-pgrade/guse and clouds. In: Kacsuk, P. (ed.) Science Gateways for Distributed Computing Infrastructures, pp. 97–109. Springer International Publishing (2014)

  51. Felsenstein, J.: Phylip - phylogeny inference package (version 3.2). Cladistics 5, 164–166 (1989)

    Google Scholar 

  52. Foster, I., Kesselman, C.: The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers Inc. (2003)

  53. Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: A survey. Computing in Science and Engineering 10(3), 11–21 (2008)

    Article  Google Scholar 

  54. Frey, J., Tannenbaum, T., Livny, M., Foster, I., Tuecke, S.: Condor-g: a computation management agent for multi-institutional grids. In: 10th IEEE Int. Symposium on High Performance Distributed Computing, pp. 55–63 (2001)

  55. Gadelha Jr., L.M.R., Wilde, M., Mattoso, M., Foster, I.: Provenance traces of the swift parallel scripting system, pp. 325–326 (2013)

  56. Ganga, K., Karthik, S.: A fault tolerent approach in scientific workflow systems based on cloud computing. In: Int. Conf. on Pattern Recognition, Informatics and Mobile Engineering (PRIME), pp. 387–390 (2013)

  57. Garijo, D., Alper, P., Belhajjame, K., Corcho, Ó., Gil, Y., Goble, C.A.: Common motifs in scientific workflows: An empirical analysis. Futur. Gener. Comput. Syst. 36, 338–351 (2014)

    Article  Google Scholar 

  58. Gesing, S., Krüger, J., Grunzke, R., de la Garza, L., Herres-Pawlis, S., Hoffmann, A.: Molecular simulation grid (mosgrid): A science gateway tailored to the molecular simulation community. In: Kacsuk, P. (ed.) : Science Gateways for Distributed Computing Infrastructures, pp. 151–165. Springer International Publishing (2014)

  59. Gil, Y., Kim, J., Ratnakar, V., Deelman, E.: Wings for pegasus: A semantic approach to creating very large scientific workflows. In: OWLED*06 Workshop on OWL: Experiences and Directions, vol. 216 (2006)

  60. Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8), 1–13 (2010)

    Article  Google Scholar 

  61. Goecks, J., Nekrutenko, A., Taylor, J.: Lessons learned from galaxy, a web-based platform for high-throughput genomic analyses. In: IEEE Int. Conf. on E-Science, e-Science, pp. 1–6 (2012)

  62. Gonçalves, J.A.R., Oliveira, D., Ocaña, K., Ogasawara, E., Mattoso, M.: Using domain-specific data to enhance scientific workflow steering queries. In: Provenance and Annotation of Data and Processes, vol. 7525, pp. 152–167 (2012)

  63. Görlach, K., Sonntag, M., Karastoyanova, D., Leymann, F., Reiter, M.: Conventional workflow technology for scientific simulation. In: Guide to e-Science, pp. 323–352 (2011)

  64. Gottdank, T.: Introduction to the ws-pgrade/guse science gateway framework. In: Kacsuk, P. (ed.) Science Gateways for Distributed Computing Infrastructures, pp.. 19–32. Springer International Publishing (2014)

  65. Gu, Y., Wu, C., Liu, X., Yu, D.: Distributed throughput optimization for large-scale scientific workflows under fault-tolerance constraint. Journal of Grid Computing 11(3), 361–379 (2013)

    Article  Google Scholar 

  66. Gunter, D., Deelman, E., Samak, T., Brooks, C., Goode, M., Juve, G., Mehta, G., Moraes, P., Silva, F., Swany, M., Vahi, K.: Online workflow management and performance analysis with stampede. In: 7th Int. Conf. on Network and Service Management (CNSM), pp. 1–10 (2011)

  67. Hategan, M., Wozniak, J., Maheshwari, K.: Coasters: Uniform resource provisioning and access for clouds and grids. In. In: 4th IEEE Int. Conf. on Utility and Cloud Computing, pp. 114–121 (2011)

  68. Hernández, F., Fahringer, T.: Towards workflow sharing and reusein the askalon grid environment. In: Proceedings of Cracow Grid Workshops (CGW), pp. 111–119 (2008)

  69. Holl, S., Zimmermann, O., Hofmann-Apitius, M.: A new optimization phase for scientific workflow management systems. In: 8th IEEE Int. Conf. on E-Science, pp. 1–8 (2012)

  70. Horta, F., Dias, J., Ocana, K., de Oliveira, D., Ogasawara, E., Mattoso, M.: Abstract: Using provenance to visualize data from large-scale experiments. In: Supercomputing (SC): High Performance Computing, Networking Storage and Analysis, pp. 1418–1419 (2012)

  71. Huedo, E., Montero, R.S., Llorente, I. M.: A framework for adaptive execution in grids. Software - Practice and Experience (SPE) 34(7), 631–651 (2004)

    Article  Google Scholar 

  72. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed data-parallel programs from sequential building blocks. In: 2nd ACM SIGOPS/EuroSys European Conf. on Computer Systems, pp. 59–72 (2007)

  73. Jackson, K.: OpenStack Cloud Computing Cookbook. Packt Publishing (2012)

  74. Jacob, J.C., Katz, D.S., Berriman, G.B., Good, J.C., Laity, A.C., Deelman, E., Kesselman, C., Singh, G., Su, M.-H., Prince, T.A., Williams, R.: Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking. Int. J. Comput. Sci. Eng. 4(2), 73–87 (2009)

    Article  Google Scholar 

  75. Juve, G., Deelman, E.: Scientific workflows in the cloud. In: Grids, Clouds and Virtualization, pp. 71–91. Springer (2011)

  76. Juve, G., Deelman, E.: Wrangler: Virtual cluster provisioning for the cloud. In: 20th Int. Symposium on High Performance Distributed Computing, pp. 277–278 (2011)

  77. Kacsuk, P.: P-grade portal family for grid infrastructures. Concurrency and Computation: Practice and Experience 23(3), 235–245 (2011)

    Article  Google Scholar 

  78. Kacsuk, P., Farkas, Z., Kozlovszky, M., Hermann, G., Balasko, A., Karoczkai, K., Marton, I.: Ws-pgrade/guse generic dci gateway framework for a large variety of user communities. J. Grid Comput. 10(4), 601–630 (2012)

    Article  Google Scholar 

  79. Karuna, K., Mangala, N., Janaki, C., Shashi, S., Subrata, C.: Galaxy workflow integration on garuda grid. In: IEEE Int. Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), pp. 194–196 (2012)

  80. Karypis, G., Kumar, V.: Multilevel algorithms for multi-constraint graph partitioning. In: ACM/IEEE Conf. on Supercomputing, pp. 1–13 (1998)

  81. Kim, J., Deelman, E., Gil, Y., Mehta, G., Ratnakar, V.: Provenance trails in the wings-pegasus system. Concurrency and Computation: Practice and Experience 20, 587–597 (2008)

    Article  Google Scholar 

  82. Kiss, T., Kacsuk, P., Lovas, R., Balaskó, Á., Spinuso, A., Atkinson, M., D’Agostino, D., Danovaro, E., Schiffers, M. Ws-pgrade/guse in european projects. In: Kacsuk, P. (ed.) Science Gateways for Distributed Computing Infrastructures, pp.. 235–254. Springer International Publishing (2014)

  83. Kiss, T., Kacsuk, P., Takács, E., Szabó, Á., Tihanyi, P., Taylor, S.: Commercial use of ws-pgrade/guse. In: Kacsuk, P. (ed.) Science Gateways for Distributed Computing Infrastructures, pp.. 271–286. Springer International Publishing (2014)

  84. Kocair, Ç., Şener, C., Akkaya, A. Statistical seismology science gateway. In: Kacsuk, P. (ed.) Science Gateways for Distributed Computing Infrastructures, pp. 167–180. Springer International Publishing (2014)

  85. Korf, I., Yandell, M., Bedell, J.A.: BLAST - an essential guide to the basic local alignment search tool. O’Reilly (2003)

  86. Kozlovszky, M., Karóczkai, K., Márton, I., Kacsuk, P., Gottdank, T.: Dci bridge: Executing ws-pgrade workflows in distributed computing infrastructures. In: Kacsuk, P. (ed.) Science Gateways for Distributed Computing Infrastructures, pp. 51–67. Springer International Publishing (2014)

  87. Litzkow, M.J., Livny, M., Mutka, M.W.: Condor-a hunter of idle workstations. In: 8th Int. Conf. on Distributed Computing Systems, pp. 104–111 (1988)

  88. Liu, B., Sotomayor, B., Madduri, R., Chard, K., Foster, I.: Deploying bioinformatics workflows on clouds with galaxy and globus provision. In: Supercomputing (SC) Companion: High Performance Computing, Networking, Storage and Analysis (SCC), pp. 1087–1095 (2012)

  89. Liu, J., Silva, V., Pacitti, E., Valduriez, P., Mattoso, M.: Scientific workflow partitioning in multisite cloud. In: Parallel Processing Workshops - Euro-Par 2014 Int. Workshops, pp. 105–116 (2014)

  90. Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M.B., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the kepler system. Concurrency and Computation: Practice and Experience 18(10), 1039–1065 (2006)

    Article  Google Scholar 

  91. Maheswaran, M., Ali, S., Siegel, H.J., Hensgen, D., Freund, R.F.: Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems. In: 8th Heterogeneous Computing Workshop, p. 30 (1999)

  92. Malawski, M., Juve, G., Deelman, E., Nabrzyski, J.: Cost- and deadline-constrained provisioning for scientific workflow ensembles in iaas clouds. In: Supercomputing (SC) Conf. on High Performance Computing Networking, Storage and Analysis, pp. 1–11 (2012)

  93. Mattoso, M., Dias, J., Ocaña, K. A., Ogasawara, E., Costa, F., Horta, F., Silva, V., de Oliveira, D.: Dynamic steering of HPC scientific workflows: A survey. Futur. Gener. Comput. Syst. 0 (2014)

  94. Mattoso, M., Werner, C., Travassos, G., Braganholo, V., Ogasawara, E., Oliveira, D., Cruz, S., Martinho, W., Murta, L.: Towards supporting the life cycle of large scale scientific experiments. In: Int. J. Business Process Integration and Management, vol. 5, pp. 79–82 (2010)

  95. Milojicic, D.S., Llorente, I.M., Montero, R.S.: Opennebula: A cloud management tool. IEEE Internet Computing 15(2), 11–14 (2011)

    Article  Google Scholar 

  96. Missier, P., Soiland-Reyes, S., Owen, S., Tan, W., Nenadic, A., Dunlop, I., Williams, A., Oinn, T., Goble, C.: Taverna, reloaded. In: Int. Conf. on Scientific and Statistical Database Management, pp. 471–481 (2010)

  97. Nagavaram, A., Agrawal, G., Freitas, M.A., Telu, K.H., Mehta, G., Mayani, R. G., Deelman, E.: A cloud-based dynamic workflow for mass spectrometry data analysis. In: IEEE 7th Int. Conf. on E-Science (e-Science), pp. 47–54 (2011)

  98. Nguyen, D., Thoai, N.: Ebc: Application-level migration on multi-site cloud. In: Int. Conf. on Systems and Informatics (ICSAI), pp. 876–880 (2012)

  99. Ocaña, K.A., Oliveira, D., Ogasawara, E., Dávila, A.M., Lima, A.A., Mattoso, M.: Sciphy: A cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In: Advances in Bioinformatics and Computational Biology, vol. 6832, pp. 66–70 (2011)

  100. Ocaña, K.A.C.S., Oliveira, D., Horta, F., Dias, J., Ogasawara, E., Mattoso, M.: Exploring molecular evolution reconstruction using a parallel cloud based scientific workflow. In: Advances in Bioinformatics and Computational Biology, vol. 7409, pp. 179–191 (2012)

  101. Ogasawara, E.S., de Oliveira, D., Valduriez, P., Dias, J., Porto, F., Mattoso, M.: An algebraic approach for data-centric scientific workflows. Proceedings of the VLDB Endowment (PVLDB) 4(12), 1328–1339 (2011)

    Google Scholar 

  102. Ogasawara, E.S., Dias, J., Silva, V., Chirigati, F.S., de Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: Chiron: a parallel engine for algebraic scientific workflows. Concurrency and Computation: Practice and Experience 25(16), 2327–2341 (2013)

    Article  Google Scholar 

  103. Oinn, T., Li, P., Kell, D.B., Goble, C., Goderis, A., Greenwood, M., Hull, D., Stevens, R., Turi, D., Zhao, J.: Taverna/mygrid: Aligning a workflow system with the life sciences community. In: Workflows for e-Science, pp. 300–319 (2007)

  104. Oinn, T.M., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R.M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004)

    Article  Google Scholar 

  105. Olabarriaga, S., Benabdelkader, A., Caan, M., Jaghoori, M., Krüger, J., de la Garza, L., Mohr, C., Schubert, B., Danezi, A., Kiss, T.: Ws-pgrade/guse-based science gateways in teaching. In: Kacsuk, P. (ed.) Science Gateways for Distributed Computing Infrastructures, pp. 223–234. Springer International Publishing (2014)

  106. Oliveira, D.D., Ocaña, K.A.C.S., Ogasawara, E., Dias, J., Gonçalves, J., Baião, F., Mattoso, M.: Performance evaluation of parallel strategies in public clouds: A study with phylogenomic workflows. Futur. Gener. Comput. Syst. 29(7), 1816–1825 (2013)

    Article  Google Scholar 

  107. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: ACM SIGMOD Int. Conf. on Management of Data (SIGMOD), pp. 1099–1110 (2008)

  108. Ostermann, S., Plankensteiner, K., Prodan, R., Fahringer, T.: Groudsim: An event-based simulation framework for computational grids and clouds. In: European Conf. on Parallel Processing (Euro-Par) Workshops, pp. 305–313 (2011)

  109. Ostermann, S., Prodan, R., Fahringer, T.: Extending grids with cloud resource management for scientific computing. In: 10th IEEE/ACM Int. Conf. on Grid Computing, pp. 42–49 (2009)

  110. Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Springer (2011)

  111. Pacitti, E., Akbarinia, R., Dick, M.E.: P2P Techniques for Decentralized Applications. Morgan & Claypool Publishers (2012)

  112. Pautasso, C., Alonso, G.: Parallel computing patterns for grid workflows (2006)

  113. Plankensteiner, K., Prodan, R., Janetschek, M., Fahringer, T., Montagnat, J., Rogers, D., Harvey, I., Taylor, I., Balaskó, Á., Kacsuk, P.: Fine-grain interoperability of scientific workflows in distributed computing infrastructures. J. Grid Comput. 11(3), 429–455 (2013)

    Article  Google Scholar 

  114. Prodan, R.: Online analysis and runtime steering of dynamic workflows in the askalon grid environment. In: 7th IEEE Int. Symposium on Cluster Computing and the Grid (CCGRID), pp. 389–400 (2007)

  115. Raicu, I., Zhao, Y., Foster, I.T., Szalay, A.S.: Data diffusion: Dynamic resource provision and data-aware scheduling for data intensive applications. The Computing Research Repository (CoRR), abs/0808.3535 (2008)

  116. Ramakrishnan, A., Singh, G., Zhao, H., Deelman, E., Sakellariou, R., Vahi, K., Blackburn, K., Meyers, D., Samidi, M.: Scheduling data-intensiveworkflows onto storage-constrained distributed resources. In: 7th IEEE Int. Symposium on Cluster Computing and the Grid (CCGRID), pp. 401–409 (2007)

  117. Reynolds, C.J., Winter, S.C., Terstyánszky, G., Kiss, T., Greenwell, P., Acs, S., Kacsuk, P.: Scientific workflow makespan reduction through cloud augmented desktop grids. In: IEEE 3rd International Conference on Cloud Computing Technology and Science, pp. 18–23 (2011)

  118. Samak, T., Gunter, D., Goode, M., Deelman, E., Juve, G., Mehta, G., Silva, F., Vahi, K.: Online fault and anomaly detection for large-scale scientific workflows. In: 13th IEEE Int. Conf. on High Performance Computing and Communications (HPCC), pp. 373–381 (2011)

  119. Sciacca, E., Vitello, F., Becciani, U., Costa, A., Massimino, P. Visivo gateway and visivo mobile for the astrophysics community. In: Kacsuk, P. (ed.) Science Gateways for Distributed Computing Infrastructures, pp. 181–194. Springer International Publishing (2014)

  120. Shahand, S., Jaghoori, M., Benabdelkader, A., Font-Calvo, J., Huguet, J., Caan, M., van Kampen, A., Olabarriaga, S.: Computational neuroscience gateway: A science gateway based on the ws-pgrade/guse. In: Kacsuk, P. (ed.) Science Gateways for Distributed Computing Infrastructures, pp. 139–149. Springer International Publishing (2014)

  121. Shankar, S., DeWitt, D.J.: Data driven workflow planning in cluster management systems. In: 16th International Symposium on High-Performance Distributed Computing (HPDC-16), pp. 127–136 (2007)

  122. Singh, G., Su, M.-H., Vahi, K., Deelman, E., Berriman, B., Good, J., Katz, D. S., Mehta, G.: Workflow task clustering for best effort systems with pegasus. In: 15th ACM Mardi Gras Conf.: From Lightweight Mash-ups to Lambda Grids: Understanding the Spectrum of Distributed Computing Requirements, Applications, Tools, Infrastructures, Interoperability, and the Incremental Adoption of Key Capabilities, pp. 9:1–9:8 (2008)

  123. Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI-The Complete Reference, Volume 1: The MPI Core. MIT Press (1998)

  124. Tanaka, M., Tatebe, O.: Workflow scheduling to minimize data movement using multi-constraint graph partitioning. In: 12th IEEE/ACM Int. Symposium on Cluster, Cloud and Grid Computing (Ccgrid), pp. 65–72 (2012)

  125. Taylor, I., Shields, M., Wang, I., Harrison, A.: The triana workflow environment: Architecture and applications. In: Workflows for e-Science, pp. 320–339. Springer (2007)

  126. Terstyánszky, G., Kukla, T., Kiss, T., Kacsuk, P., Balaskó, Á., Farkas, Z.: Enabling scientific workflow sharing through coarse-grained interoperability. Futur. Gener. Comput. Syst. 37, 46–59 (2014)

    Article  Google Scholar 

  127. Terstyánszky, G., Michniak, E., Kiss, T., Balaskó, Á.: Sharing science gateway artefacts through repositories. In: Kacsuk, P. (ed.) : Science Gateways for Distributed Computing Infrastructures, pp. 123–135. Springer International Publishing (2014)

  128. Topcuouglu, H., Hariri, S., Wu, M.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems 13(3), 260–274 (2002)

    Article  Google Scholar 

  129. Aalst, W.M.P.v.d., Weske, M., Wirtz, G.: Advanced topics in workflow management: Issues, requirements, and solutions. Trans. SDPS 7(3), 49–77 (2003)

    Google Scholar 

  130. Vahi, K., Harvey, I., Samak, T., Gunter, D., Evans, K., Rogers, D., Taylor, I., Goode, M., Silva, F., Al-Shakarchi, E., Mehta, G., Jones, A., Deelman, E.: A general approach to real-time workflow monitoring. In: Supercomputing (SC) Companion: High Performance Computing, Networking, Storage and Analysis (SCC), pp. 108–118 (2012)

  131. Wang, J., Altintas, I.: Early cloud experiences with the kepler scientific workflow system. In: Int. Conf. on Computational Science (ICCS), vol. 9, pp. 1630–1634 (2012)

  132. Wang, J., Crawl, D., Altintas, I.: Kepler + hadoop: A general architecture facilitating data-intensive applications in scientific workflow systems. In: 4th Workshop on Workflows in Support of Large-Scale Science, pp. 12:1–12:8 (2009)

  133. White, T.: Hadoop: The Definitive Guide, O’Reilly Media, Inc. (2009)

  134. Wieder, P., Butler, J.M., Theilmann, W., Yahyapour, R.: Service Level Agreements for Cloud Computing. Springer (2011)

  135. Wilde, M., Hategan, M., Wozniak, J.M., Clifford, B., Katz, D.S., Foster, I.: Swift: A language for distributed parallel scripting. Parallel Comput. 37(9), 633–652 (2011)

    Article  Google Scholar 

  136. Wolstencroft, K., Haines, R., Fellows, D., Williams, A.R., Withers, D., Owen, S., Soiland-Reyes, S., Dunlop, I., Nenadic, A., Fisher, P., Bhagat, J., Belhajjame, K., Bacall, F., Hardisty, A., de la Hidalga, A.N., Vargas, M.P.B., Sufi, S., Goble, C.A.: The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res. 41(Webserver-Issue), 557–561 (2013)

    Article  Google Scholar 

  137. Wozniak, J.M., Armstrong, T.G., Maheshwari, K., Lusk, E.L., Katz, D.S., Wilde, M., Foster, I.T.: Turbine: A distributed-memory dataflow engine for extreme-scale many-task applications. In: 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, pp. 5:1–5:12 (2012)

  138. Yildiz, U., Guabtni, A., Ngu, A.H.H.: Business versus scientific workflows: A comparative study. In: IEEE Congress on Services, Part I, Services I, pp. 340–343 (2009)

  139. Yu, J., Buyya, R.: A taxonomy of workflow management systems for grid computing. J. Grid Comput. 3, 171–200 (2005)

    Article  Google Scholar 

  140. Yu, Z., Shi, W.: An adaptive rescheduling strategy for grid workflow applications. In: IEEE Int. Parallel and Distributed Processing Symposium (IPDPS), pp. 1–8 (2007)

  141. Yuan, D., Yang, Y., Liu, X., Chen, J.: A cost-effective strategy for intermediate data storage in scientific cloud workflow systems. In: IEEE Int. Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (2010)

  142. Zhang, H., Soiland-Reyes, S., Goble, C.A.: Taverna mobile: Taverna workflows on android. The Computing Research Repository (CoRR), abs/1309.2787 (2013)

  143. Zhang, Q., Cheng, L., Boutaba, R.: Cloud computing: state-of-the-art and research challenges. Journal of Internet Services and Applications 1, 7–18 (2010)

    Article  Google Scholar 

  144. Zhao, Y., Hategan, M., Clifford, B., Foster, I., von Laszewski, G., Nefedova, V., Raicu, I., Stef-Praun, T., Wilde, M.: Swift: Fast, reliable, loosely coupled parallel computation. In: IEEE Int. Conf. on Services Computing - Workshops (SCW), pp 199–206 (2007)

  145. Zhao, Y., Raicu, I., Foster, I.T.: Scientific workflow systems for 21st century, new bottle or new wine? In: IEEE Congress on Services, Part I, Services I, pp 467–471 (2008)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ji Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, J., Pacitti, E., Valduriez, P. et al. A Survey of Data-Intensive Scientific Workflow Management. J Grid Computing 13, 457–493 (2015). https://doi.org/10.1007/s10723-015-9329-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-015-9329-8

Keywords

Navigation