Abstract
Many scientific experiments are performed using scientific workflows, which are becoming more and more data-intensive. We consider the efficient execution of such workflows in the cloud, leveraging the heterogeneous resources available at multiple cloud sites (geo-distributed data centers). Since it is common for workflow users to reuse code or data from other workflows, a promising approach for efficient workflow execution is to cache intermediate data in order to avoid re-executing entire workflows. In this paper, we propose a solution for distributed caching of scientific workflows in a multisite cloud. We implemented our solution in the OpenAlea workflow system, together with cache-aware distributed scheduling algorithms. Our experimental evaluation on a three-site cloud with a data-intensive application in plant phenotyping shows that our solution can yield major performance gains, reducing total time up to 42% with 60% of same input data for each new execution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the Kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006). https://doi.org/10.1007/11890850_14
Artzet, S., Brichet, N., Chopard, J., Mielewczik, M., Fournier, C., Pradal, C.: OpenAlea.phenomenal: a workflow for plant phenotyping, September 2018. https://doi.org/10.5281/zenodo.1436634
Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: VisTrails: visualization meets data management. In: ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 745–747 (2006)
Crago, S., et al.: Heterogeneous cloud computing. In: 2011 IEEE International Conference on Cluster Computing, pp. 378–385. IEEE (2011)
Garijo, D., Alper, P., Belhajjame, K., Corcho, O., Gil, Y., Goble, C.: Common motifs in scientific workflows: an empirical analysis. Future Gener. Comput. Syst. (FGCS) 36, 338–351 (2014)
Heidsieck, G., de Oliveira, D., Pacitti, E., Pradal, C., Tardieu, F., Valduriez, P.: Adaptive caching for data-intensive scientific workflows in the cloud. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2019. LNCS, vol. 11707, pp. 452–466. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27618-8_33
Kelling, S., et al.: Data-intensive science: a new paradigm for biodiversity studies. Bioscience 59(7), 613–620 (2009)
Liu, J., et al.: Efficient scheduling of scientific workflows using hot metadata in a multisite cloud. IEEE Trans. Knowl. Data Eng. 31(10), 1–20 (2018)
Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: A survey of data-intensive scientific workflow management. J. Grid Comput. 13(4), 457–493 (2015). https://doi.org/10.1007/s10723-015-9329-8
Liu, J., Pacitti, E., Valduriez, P., de Oliveira, D., Mattoso, M.: Multi-objective scheduling of scientific workflows in multisite clouds. Future Gener. Comput. Syst. (FGCS) 63, 76–95 (2016)
Maheshwari, K., Jung, E., Meng, J., Vishwanath, V., Kettimuthu, R.: Improving multisite workflow performance using model-based scheduling. In: IEEE International Conference on Parallel Processing (ICPP), pp. 131–140 (2014)
de Oliveira, D., Baião, F.A., Mattoso, M.: Towards a taxonomy for cloud computing from an e-science perspective. In: Antonopoulos, N., Gillam, L. (eds.) Cloud Computing. CCN, pp. 47–62. Springer, London (2010). https://doi.org/10.1007/978-1-84996-241-4_3
Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 4th edn. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-26253-2
Pradal, C., Fournier, C., Valduriez, P., Cohen-Boulakia, S.: OpenAlea: scientific workflows combining data analysis and simulation. In: International Conference on Scientific and Statistical Database Management (SSDBM), pp. 11:1–11:6 (2015)
Tardieu, F., Cabrera-Bosquet, L., Pridmore, T., Bennett, M.: Plant phenomics, from sensors to knowledge. Curr. Biol. 27(15), R770–R783 (2017)
Yuan, D., et al.: A highly practical approach toward achieving minimum data sets storage cost in the cloud. IEEE Trans. Parallel Distrib. Syst. 24(6), 1234–1244 (2013)
Zhang, J., Luo, J., Dong, F.: Scheduling of scientific workflow in non-dedicated heterogeneous multicluster platform. J. Syst. Softw. 86(7), 1806–1818 (2013)
Acknowledgments
This work was supported by the #DigitAg French initiative (www.hdigitag.fr), the SciDISC and HPDaSc Inria associated teams with Brazil, the Phenome-Emphasis project (ANR-11-INBS-0012) and IFB (ANR-11-INBS-0013) from the Agence Nationale de la Recherche and the France Grille Scientific Interest Group.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Heidsieck, G., de Oliveira, D., Pacitti, E., Pradal, C., Tardieu, F., Valduriez, P. (2020). Distributed Caching of Scientific Workflows in Multisite Cloud. In: Hartmann, S., Küng, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2020. Lecture Notes in Computer Science(), vol 12392. Springer, Cham. https://doi.org/10.1007/978-3-030-59051-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-59051-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59050-5
Online ISBN: 978-3-030-59051-2
eBook Packages: Computer ScienceComputer Science (R0)