Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

The Flowbster Cloud-Oriented Workflow System to Process Large Scientific Data Sets

  • 150 Accesses

  • 4 Citations

Abstract

The paper describes a new cloud-oriented workflow system called Flowbster. It was designed to create efficient data pipelines in clouds by which large compute-intensive data sets can efficiently be processed. The Flowbster workflow can be deployed in the target cloud as a virtual infrastructure through which the data to be processed can flow and meanwhile it flows through the workflow it is transformed as the business logic of the workflow defines it. Instead of using the enactor based workflow concept Flowbster applies the service choreography concept where the workflow nodes directly communicate with each other. Workflow nodes are able to recognize if they can be activated with a certain data set without the interaction of central control service like the enactor in service orchestration workflows. As a result Flowbster workflows implement a much more efficient data path through the workflow than service orchestration workflows. A Flowbster workflow works as a data pipeline enabling the exploitation of pipeline parallelism, workflow parallel branch parallelism and node scalability parallelism. The Flowbster workflow can be deployed in the target cloud on-demand based on the underlying Occopus cloud deployment and orchestrator tool. Occopus guarantees that the workflow can be deployed in several major types of IaaS clouds (OpenStack, OpenNebula, Amazon, CloudSigma). It takes care of not only deploying the nodes of the workflow but also to maintain their health by using various health-checking options. Flowbster also provides an intuitive graphical user interface for end-user scientists. This interface hides the low level cloud-oriented layers and hence users can concentrate on the business logic of their data processing applications without having detailed knowledge on the underlying cloud infrastructure.

This is a preview of subscription content, log in to check access.

References

  1. 1.

    Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: A survey of data-intensive scientific workflow management. J. Grid Comput. 13(4), 457–494 (2015)

  2. 2.

    Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-science: an overview of workflow system features and capabilities. Futur. Gener. Comput. Syst. 25(5), 528–540 (2009)

  3. 3.

    Yu, J., Buyya, R.: A taxonomy of workflow management systems for grid computing. J. Grid Comput. 3(3–4), 171–200 (2005)

  4. 4.

    Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., Silva, R.F.d., Livny, M., Wenger, K.: Pegasus: a workflow management system for science automation. Futur. Gener. Comput. Syst. (2014)

  5. 5.

    Fahringer, T., Prodan, R., Duan, R., Hofer, J., Nadeem, F., Nerieri, F., Podlipnig, S., Qin, J., Siddiqui, M., Truong, H.-L., Villazon, A., Wieczorek, M.: Askalon: a development and grid computing environment for scientific workflows. In: Taylor, I. J., Deelman, E., Gannon, D. B., Shields, M (eds.) Workflows for E- Science, pp. 450–471. Springer, London (2007)

  6. 6.

    Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: 16th International Conference on Scientific and Statistical Database Management (SSDBM), pp. 423–424 (2004)

  7. 7.

    Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8), 1–13 (2010)

  8. 8.

    Oinn, T.M., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R.M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004)

  9. 9.

    Zaha, J.M., Barros, A., Dumas, M., ter Hofstede, A.: A language for service behavior modeling. In: CoopIS, Montpellier, France (2006)

  10. 10.

    Kavantzas, N., Burdett, D., Ritzinger, G., Lafon, Y.: Web services choreography description language version 1.0, W3C Candidate Recommendation. Tech. Rep. (2005)

  11. 11.

    Terstyanszky, G., Kukla, T., Kiss, T., Kacsuk, P., Balasko, A., Farkas, Z.: Enabling scientific workflow sharing through coarse-grained interoperability. Futur. Gener. Comput. Syst. 37, 46–59 (2014)

  12. 12.

    Kacsuk, P., Farkas, Z., Kozlovszky, M., Herman, G., Balasko, A., Karoczkai, K., Marton, I.: WS-PGRADE/GUSE generic DCI gateway framework for a large variety of user communities. J. Grid Comput. 10(4), 601–630 (2012)

  13. 13.

    Hajnal, Á, Márton, I, Farkas, Z., Kacsuk, P.: Remote storage management in science gateways via data bridging. Concurr. Comput.: Pract. Exp. 27(16), 4398–4411 (2015)

  14. 14.

    Kacsuk, P. (ed.): Science gateways for distributed computing infrastructures: development framework and exploitation by scientific user communities. Springer International Publishing. ISBN: 978-3-319-11267-1 (2014)

  15. 15.

    Occopus github repository: https://github.com/occopus

  16. 16.

    Flowbster github repository: https://github.com/occopus/flowbster

  17. 17.

    Trott, O., Olson, A.J.: Autodock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading. J. Comput. Chem. 31, 455–461 (2010)

  18. 18.

    Farkas, Z., Kacsuk, P., Kiss, T., Borsody, P., Hajnal, Á, Balaskó, Á, Karóczkai, K: Autodock gateway for molecular docking simulations in cloud systems. In: Terzo, O., Mossucca, L. (eds.) Cloud Computing with E-Science Applications. p. 300. ISBN:978-1-4665-9115-8, pp. 217–235. CRC Press - Taylor and Francis Group, Boca Raton (2015)

  19. 19.

    Kiss, T., Kacsuk, P., Lovas, R., et al.: WS-PGRADE/GUSE in European Projects. In: Kacsuk, P (ed.) Science Gateways for Distributed Computing Infrastructures: Development Framework and Exploitation by Scientific User Communities, pp. 235–254. Springer, Berlin (2014)

  20. 20.

    D’Agostino, D., Danovaro, E., Clematis, A., Roverelli, L., Zereik, G., Galizia, A.: From lesson learned to the refactoring of the DRIHM science gateway for hydro-meteorological research. J. Grid Comput. 14(4), 575–588 (2016)

  21. 21.

    Gesing, S., Kruger, J., Grunzke, R., Herres-Pawslis, S., Hoffmann, A.: Using science gateways for bridging the differences between research infrastructures. J. Grid Comput. 14(4), 545–557 (2016)

  22. 22.

    Vina input files: https://sourceforge.net/p/guse/git/ci/master/tree/vina/vina_inputs.tar.gz?format=raw

  23. 23.

    MTA Cloud: https://cloud.mta.hu/

  24. 24.

    Occopus tutorial webpage: http://occopus.lpds.sztaki.hu/tutorials

  25. 25.

    Vina application files: https://sourceforge.net/p/guse/git/ci/master/tree/vina/AutoDock-Vina_2017-08-17-060932_all.zip?format=raw

  26. 26.

    Vahi, K., Rynge, M., Juve, G., Mayani, R., Deelman, E.: Rethinking data management for big data scientific workflows. In: 2013 IEEE International Conference on Big Data. Silicon Valley. https://doi.org/10.1109/BigData.2013.6691724 https://doi.org/10.1109/BigData.2013.6691724 (2013)

  27. 27.

    Farkas, Z., Kacsuk, P., Hajnal, Á: Enabling workflow-oriented science gateways to access multi-cloud systems. J. Grid Comput. 14(4), 619–640 (2016)

  28. 28.

    Flanagan, K., et al.: Microbase2.0: a generic framework for computationally intensive bioinformatics workflows in the cloud. J. Integr. Bioinform. (JIB). https://doi.org/10.2390/biecoll-jib-2012-212 (2012)

  29. 29.

    Emeakaroha, V.C., Maurer, M., Stern, P., Labaj, P.P., Brandic, I., Kreil, D.P.: Managing and optimizing bioinformatics workflows for data analysis in clouds. J. Grid Comput. 11(3), 407–428 (2013)

  30. 30.

    Balis, B., Figiela, K., Malawski, M., Pawlik, M., Bubak, M.: A lightweight approach for deployment of scientific workflows in cloud infrastructures. In: Parallel Processing and Applied Mathematics, Volume 9573 of the series Lecture Notes in Computer Science, pp. 281–290 (2016)

  31. 31.

    Qasha, R., et al.: A framework for scientific workflow reproducibility in the cloud. In: 2016 IEEE 12th International Conference on e-Science (e-Science), pp. 81–90. IEEE. https://doi.org/10.1109/eScience.2016.7870888 (2016)

  32. 32.

    Qasha, R., et al.: Dynamic deployment of scientific workflows in the cloud using container virtualization. In: 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, pp. 269–276. https://doi.org/10.1109/CloudCom.2016.0052 (2016)

  33. 33.

    Kacsuk, P., Kecskemeti, G., Kertesz, A., Nemeth, Z., Visegradi, A., Gergely, M.: Infrastructure aware scientific workflows and their support by a Science Gateway. In: Proceedings of the 7th International Workshop on Science Gateways (IWSG 2015), pp. 22–27. Budapest (2015)

  34. 34.

    Ubuntu Juju: http://juju.ubuntu.com

  35. 35.

    Terraform: https://www.terraform.io/

  36. 36.

    Cloudformation: https://aws.amazon.com/cloudformation/

  37. 37.

    Heat: https://wiki.openstack.org/wiki/Heat

  38. 38.

    Cloudify: http://getcloudify.org/

  39. 39.

    Slipstream: http://sixsq.com/products/slipstream/index.html

  40. 40.

    Oneflow: http://docs.opennebula.org/4.12/advanced_administration/application_flow_and_auto-scaling/appflow_use_cli.html

Download references

Acknowledgements

This work is partially funded by the European CloudiFacturing - Cloudification of Production Engineering for Predictive Digital Manufacturing project under grant No. 768892 (H2020-FoF-2017), and by the International Science & Technology Cooperation Program of China under grant No. 2015DFE12860. On behalf of the Flowbster project we thank for the usage of MTA Cloud (https://cloud.mta.hu/) that significantly helped us achieving the results published in this paper.

Author information

Correspondence to Peter Kacsuk.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kacsuk, P., Kovács, J. & Farkas, Z. The Flowbster Cloud-Oriented Workflow System to Process Large Scientific Data Sets. J Grid Computing 16, 55–83 (2018). https://doi.org/10.1007/s10723-017-9420-4

Download citation

Keywords

  • Workflow
  • Data stream
  • Scientific data
  • Cloud orchestration
  • Multi-cloud