Abstract
Cloud computing offers massive scalability and elasticity required by many scientific and commercial applications. Combining the computational and data handling capabilities of clouds with parallel processing also has the potential to tackle Big Data problems efficiently. Science gateway frameworks and workflow systems enable application developers to implement complex applications and make these available for end-users via simple graphical user interfaces. The integration of such frameworks with Big Data processing tools on the cloud opens new opportunities for application developers. This paper investigates how workflow systems and science gateways can be extended with Big Data processing capabilities. A generic approach based on infrastructure aware workflows is suggested and a proof of concept is implemented based on the WS-PGRADE/gUSE science gateway framework and its integration with the Hadoop parallel data processing solution based on the MapReduce paradigm in the cloud. The provided analysis demonstrates that the methods described to integrate Big Data processing with workflows and science gateways work well in different cloud infrastructures and application scenarios, and can be used to create massively parallel applications for scientific analysis of Big Data.
Article PDF
Similar content being viewed by others
References
Apache Hadoop. http://hadoop.apache.org/. [26 November 2015]
Dean, J., MapReduce, G.S.: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). doi:10.1145/1327452.1327492
Li, L., Ma, Z., Liu, L., Fan, Y.: Hadoop-based ARIMA algorithm and its application in weather forecast. Int. J. Database Theory Appl. 6(5), 119–132 (2013). doi:10.14257/ijdta.2013.6.5.11
Schatz, M.C.: Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics 25(11), 1363–1369 (2009). doi:10.1093/bioinformatics/btp236
Jiao, S., He, C., Dou, Y., Tang, H.: Molecular dynamics simulation: Implementation and optimization based on Hadoop. 2012 Eighth International Conference on Natural Computation (ICNC), 2012; 1203–1207. doi:10.1109/ICNC.2012.6234529
Ludascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurr. Comput. Pract. Exper. 18(10), 1039–1065 (2006). doi:10.1002/cpe.994
Kacsuk, P.: P-GRADE portal family for grid infrastructures. Concurr. Comput. Pract. Exper. 23 (3), 235–245 (2011). doi:10.1002/cpe.1654
Wang J., Crawl D., Altintas I.: Kepler + Hadoop: A general architecture facilitating data-intensive applications in scientific workflow systems. In: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS ’09. doi:10.1145/1645164.1645176, pp 12:1–12:8. ACM, NY, USA (2009)
Fei, X., Lu, S., Lin, C.: Mapreduce-enabled scientific workflow composition framework. IEEE Int. Conf. Web Services, 2009. ICWS 2009, 663–670 (2009). doi:10.1109/ICWS.2009.90
Nguyen P., Halem M.: A MapReduce Workflow System for Architecting Scientific Data Intensive Applications. In: Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing, SECLOUD’11. doi:10.1145/1985500.1985510, pp 57?-63. ACM, NY, USA (2011)
Oozie. http://oozie.apache.org/. [26 November 2015]
Chen, Q., Wang, L., Shang, Z.: MRGIS: A mapreduce-enabled high performance workflow system for GIS. In: Proceedings of the 2008 Fourth IEEE International Conference on eScience, ESCIENCE’08. doi:10.1109/eScience.2008.169, pp 646?-651. IEEE Computer Society, DC, USA (2008)
Cloudbroker platform. http://cloudbroker.com/platform/. [26 November 2015]
Taylor, S.J.E., Kiss, T., Terstyanszky, G., Kacsuk, P., Fantini, N.: Cloud computing for simulation in manufacturing and engineering: Introducing the CloudSME simulation platform. In: Proceedings of the 2014 Annual Simulation Symposium, ANSS ’14, pp 12:1–12:8. Society for Computer Simulation International, CA, USA (2014)
SHIWA Workflow Repository. https://shiwa-repo.cpc.wmin.ac.uk/shiwa-repo/. [26 November 2015]
Prefix Span Hadoop. https://github.com/WCMinor/prefixspanhadoop/. [26 November 2015]
Gugnani, S., Khanolkar, D., Bihany, T., Khadilkar, N.: Rule based classification on a multi node scalable hadoop cluster. In: Fortino, G., Fatta, G.D., Li, W., Ochoa, S., Cuzzocrea, A., Pathan, M. (eds.) Internet and Distributed Computing Systems. doi:10.1007/978-3-319-11692-115, pp 174–183. No. 8729 in Lecture Notes in Computer Science, Springer International Publishing (2014)
Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., et al.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004). doi:10.1093/bioinformatics/bth361
Churches, D., Gombas, G., Harrison, A., Maassen, J., Robinson, C., Shields, M., Taylor, I., Wang, I.: Programming scientific and distributed workflow with Triana services. Concurr. Comput. Pract. Exper. 18(10), 1021–1037 (2006). doi:10.1002/cpe.992
Institute for Biocomputation and Physics of Complex Systems(BIFI). http://bifi.es/. [26 November 2015]
Cloudsigma. https://www.cloudsigma.com/. [26 November 2015]
Kacsuk, P., Kecskemeti, G., Kertesz, A., Nemeth, Z., Visegradi, A., Gergely, M.: Infrastructure aware scientific workflows and their support by a science gateway. In: 2015 7th International Workshop on Science Gateways (IWSG). doi:10.1109/IWSG.2015.14, pp 22–27 (2015)
Kacsuk, P., Karoczkai, K., Hermann, G., Sipos, G., Kovacs, J.: WS-PGRADE: Supporting parameter sweep applications in workflows. In: Third Workshop on Workflows in Support of Large-Scale Science, 2008. WORKS 2008. doi:10.1109/WORKS.2008.4723955, pp 10–?10 (2008)
Foster, I., Grimshaw, A., Lane, P., Lee, W., Morgan, M., Newhouse, S., Pickles, S., Pulsipher, D., Smith, C., Theimer, M.: Ogsa basic execution service version 1.0 (2007)
Balasko, A., Farkas, Z., Kacsuk, P.: Building science gateways by utilizing the generic WS-PGRADE/gUSE workflow system. Comput. Sci. 14(2), 307 (2013). doi:10.7494/csci.2013.14.2.307
Gugnani, S., Kiss, T.: Extending Scientific Workflow Systems to Support MapReduce Based Applications in the Cloud, 7th International Workshop on Science Gateways, IWSG 2015, 3-5, 2015, Budapest, Hungary, pp. 16–21, doi:10.1109/IWSG.2015.15
Farkas, Z., Kacsuk, P., Hajnal, A.: Connecting Workflow-Oriented Science Gateways to Mul-ti-cloud Systems, 7th International Workshop on Science Gateways, IWSG 2015, 3-5, 2015, Budapest, Hungary, pp. 40–46, DOI 10.1109/IWSG.2015.20
Kacsuk P. (ed.): Science Gateways for Distributed Computing Infrastructures: Development Framework and Exploitation by Scientific User Communities, Springer, 2014. pp. 301. (ISBN:978-3-319-11267-1)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Gugnani, S., Blanco, C., Kiss, T. et al. Extending Science Gateway Frameworks to Support Big Data Applications in the Cloud. J Grid Computing 14, 589–601 (2016). https://doi.org/10.1007/s10723-016-9369-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-016-9369-8