Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Workflow Systems for Big Data Analysis

  • Loris BelcastroEmail author
  • Fabrizio Marozzo
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_137


A workflow is a well-defined, and possibly repeatable, pattern or systematic organization of activities designed to achieve a certain transformation of data (Talia et al. 2015).

A Workflow Management System (WMS) is a software environment providing tools to define, compose, map, and execute workflows.


The wide availability of high-performance computing systems has allowed scientists and engineers to implement more and more complex applications for accessing and analyzing huge amounts of data (Big Data) on distributed and high-performance computing platforms. Given the variety of Big Data applications and types of users (from end users to skilled programmers), there is a need for scalable programming models with different levels of abstractions and design formalisms. The programming models should adapt to user needs by allowing (i) ease in defining data analysis applications, (ii) effectiveness in the analysis of large datasets, (iii) and efficiency of executing...

This is a preview of subscription content, log in to check access.


  1. Agapito G, Cannataro M, Guzzi PH, Marozzo F, Talia D, Trunfio P (2013) Cloud4snp: distributed analysis of snp microarray data on the cloud. In: Proceedings of the ACM conference on bioinformatics, computational biology and biomedical informatics 2013 (ACM BCB 2013). ACM, Washington, DC, p 468. ISBN:978-1-4503-2434-2Google Scholar
  2. Altomare A, Cesario E, Comito C, Marozzo F, Talia D (2017) Trajectory pattern mining for urban computing in the cloud. Trans Parallel Distrib Syst 28(2):586–599. ISSN:1045-9219Google Scholar
  3. Atay M, Chebotko A, Liu D, Lu S, Fotouhi F (2007) Efficient schema-based XML-to-relational data mapping. Inf Syst 32(3):458–476CrossRefGoogle Scholar
  4. Belcastro L, Marozzo F, Talia D, Trunfio P (2015) Programming visual and script-based big data analytics workflows on clouds. In: Grandinetti L, Joubert G, Kunze M, Pascucci V (eds) Post-proceedings of the high performance computing workshop 2014. Advances in parallel computing, vol 26. IOS Press, Cetraro, pp 18–31. ISBN:978-1-61499-582-1Google Scholar
  5. Belcastro L, Marozzo F, Talia D, Trunfio P (2016) Using scalable data mining for predicting flight delays. ACM Trans Intell Syst Technol 8(1):1–20CrossRefGoogle Scholar
  6. Bowers S, Ludascher B, Ngu AHH, Critchlow T (2006) Enabling scientific workflow reuse through structured composition of dataflow and control-flow. In: 22nd international conference on data engineering workshops (ICDEW’06), pp 70–70.  https://doi.org/10.1109/ICDEW.2006.55
  7. Brown DA, Brady PR, Dietz A, Cao J, Johnson B, McNabb J (2007) A case study on the use of workflow technologies for scientific analysis: gravitational wave data analysis. Workflows for e-Science, pp 39–59Google Scholar
  8. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  9. Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-science: an overview of workflow system features and capabilities. Futur Gener Comput Syst 25(5):528–540CrossRefGoogle Scholar
  10. Deelman E, Vahi K, Juve G, Rynge M, Callaghan S, Maechling PJ, Mayani R, Chen W, da Silva RF, Livny M et al (2015) Pegasus, a workflow management system for science automation. Futur Gener Comput Syst 46:17–35CrossRefGoogle Scholar
  11. Georgakopoulos D, Hornick M, Sheth A (1995) An overview of workflow management: from process modeling to workflow automation infrastructure. Distrib Parallel Databases 3(2):119–153CrossRefGoogle Scholar
  12. Gropp W, Lusk E, Skjellum A (1999) Using MPI: portable parallel programming with the message-passing interface, vol 1. MIT press, CambridgezbMATHCrossRefGoogle Scholar
  13. Guan Z, Hernandez F, Bangalore P, Gray J, Skjellum A, Velusamy V, Liu Y (2006) Grid-flow: a grid-enabled scientific workflow system with a petri-net-based interface. Concurr Comput Pract Exp 18(10):1115–1140CrossRefGoogle Scholar
  14. Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS operating systems review, vol 41. ACM, pp 59–72Google Scholar
  15. Juric MB, Mathew B, Sarang PG (2006) Business process execution language for web services: an architect and developer’s guide to orchestrating web services using BPEL4WS. Packt Publishing Ltd, BirminghamGoogle Scholar
  16. Juve G, Deelman E, Vahi K, Mehta G, Berriman B, Berman BP, Maechling P (2009) Scientific workflow applications on Amazon EC2. In: 2009 5th IEEE international conference on E-science workshops. IEEE, pp 59–66Google Scholar
  17. Kranjc J, Podpečan V, Lavrač N (2012) Clowdflows: a cloud based scientific workflow platform. In: Machine learning and knowledge discovery in databases. Springer, pp 816–819Google Scholar
  18. Lee S, Park H, Shin Y (2012) Cloud computing availability: multi-clouds for big data service. In: Convergence and hybrid information technology. Springer, Heidelberg, pp 799–806CrossRefGoogle Scholar
  19. Liu L, Pu C, Ruiz DD (2004) A systematic approach to flexible specification, composition, and restructuring of workflow activities. J Database Manag 15(1):1CrossRefGoogle Scholar
  20. Lordan F, Tejedor E, Ejarque J, Rafanell R, Álvarez J, Marozzo F, Lezzi D, Sirvent R, Talia D, Badia R (2014) Servicess: an interoperable programming framework for the cloud. J Grid Comput 12(1):67–91CrossRefGoogle Scholar
  21. Lu Q, Hao P, Curcin V, He W, Li YY, Luo QM, Guo YK, Li YX (2006) KDE bioscience: platform for bioinformatics analysis workflows. J Biomed Inf 39(4):440–450CrossRefGoogle Scholar
  22. Ludäscher B, Altintas I, Berkley C, Higgins D, Jaeger E, Jones M, Lee EA, Tao J, Zhao Y (2006) Scientific workflow management and the kepler system. Concurr Comput Pract Exp 18(10):1039–1065CrossRefGoogle Scholar
  23. Maheshwari K, Rodriguez A, Kelly D, Madduri R, Wozniak J, Wilde M, Foster I (2013) Enabling multi-task computation on galaxy-based gateways using swift. In: 2013 IEEE international conference on cluster computing (CLUSTER). IEEE, pp 1–3Google Scholar
  24. Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 135–146Google Scholar
  25. Marin A, Wellman B (2011) Social network analysis: an introduction. The SAGE handbook of social network analysis, p 11. Sage Publications, Thousand OaksGoogle Scholar
  26. Margolis B (2007). SOA for the business developer: concepts, BPEL, and SCA. Mc Press, LewisvilleGoogle Scholar
  27. Marozzo F, Talia D, Trunfio P (2011) A cloud framework for parameter sweeping data mining applications. In: Proceedings of the 3rd IEEE international conference on cloud computing technology and science (CloudCom’11). IEEE Computer Society Press, Athens, pp 367–374. ISBN:978-0-7695-4622-3CrossRefGoogle Scholar
  28. Marozzo F, Talia D, Trunfio P (2015) Js4cloud: script-based workflow programming for scalable data analysis on cloud platforms. Concurr Comput Pract Exp 27(17):5214–5237CrossRefGoogle Scholar
  29. Marozzo F, Talia D, Trunfio P (2016) A workflow management system for scalable data mining on clouds. IEEE Trans Serv Comput PP(99):1–1Google Scholar
  30. Talia D, Trunfio P, Marozzo F (2015) Data analysis in the cloud. Elsevier. ISBN:978-0-12-802881-0Google Scholar
  31. Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111CrossRefGoogle Scholar
  32. WFMC T (1999) Glossary, document number WFMC, issue 3.0. TC 1011Google Scholar
  33. Wilde M, Hategan M, Wozniak JM, Clifford B, Katz DS, Foster I (2011) Swift: a language for distributed parallel scripting. Parallel Comput 37(9):633–652CrossRefGoogle Scholar
  34. Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, Soiland-Reyes S, Dunlop I, Nenadic A, Fisher P et al (2013) The Taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res 41(W1):W557–W561CrossRefGoogle Scholar
  35. Wozniak JM, Wilde M, Foster IT (2014) Language features for scalable distributed-memory dataflow computing. In: 2014 fourth workshop on data-flow execution models for extreme scale computing (DFM). IEEE, pp 50–53Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.DIMESUniversity of CalabriaRendeItaly