Skip to main content

Distributed Caching of Scientific Workflows in Multisite Cloud

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12392))

Included in the following conference series:

  • 928 Accesses

  • 4 Citations

Abstract

Many scientific experiments are performed using scientific workflows, which are becoming more and more data-intensive. We consider the efficient execution of such workflows in the cloud, leveraging the heterogeneous resources available at multiple cloud sites (geo-distributed data centers). Since it is common for workflow users to reuse code or data from other workflows, a promising approach for efficient workflow execution is to cache intermediate data in order to avoid re-executing entire workflows. In this paper, we propose a solution for distributed caching of scientific workflows in a multisite cloud. We implemented our solution in the OpenAlea workflow system, together with cache-aware distributed scheduling algorithms. Our experimental evaluation on a three-site cloud with a data-intensive application in plant phenotyping shows that our solution can yield major performance gains, reducing total time up to 42% with 60% of same input data for each new execution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.phenome-emphasis.fr/phenome_eng/.

  2. 2.

    https://cassandra.apache.org.

References

  1. Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the Kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006). https://doi.org/10.1007/11890850_14

    Chapter  Google Scholar 

  2. Artzet, S., Brichet, N., Chopard, J., Mielewczik, M., Fournier, C., Pradal, C.: OpenAlea.phenomenal: a workflow for plant phenotyping, September 2018. https://doi.org/10.5281/zenodo.1436634

  3. Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: VisTrails: visualization meets data management. In: ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 745–747 (2006)

    Google Scholar 

  4. Crago, S., et al.: Heterogeneous cloud computing. In: 2011 IEEE International Conference on Cluster Computing, pp. 378–385. IEEE (2011)

    Google Scholar 

  5. Garijo, D., Alper, P., Belhajjame, K., Corcho, O., Gil, Y., Goble, C.: Common motifs in scientific workflows: an empirical analysis. Future Gener. Comput. Syst. (FGCS) 36, 338–351 (2014)

    Article  Google Scholar 

  6. Heidsieck, G., de Oliveira, D., Pacitti, E., Pradal, C., Tardieu, F., Valduriez, P.: Adaptive caching for data-intensive scientific workflows in the cloud. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2019. LNCS, vol. 11707, pp. 452–466. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27618-8_33

    Chapter  Google Scholar 

  7. Kelling, S., et al.: Data-intensive science: a new paradigm for biodiversity studies. Bioscience 59(7), 613–620 (2009)

    Article  Google Scholar 

  8. Liu, J., et al.: Efficient scheduling of scientific workflows using hot metadata in a multisite cloud. IEEE Trans. Knowl. Data Eng. 31(10), 1–20 (2018)

    Google Scholar 

  9. Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: A survey of data-intensive scientific workflow management. J. Grid Comput. 13(4), 457–493 (2015). https://doi.org/10.1007/s10723-015-9329-8

    Article  Google Scholar 

  10. Liu, J., Pacitti, E., Valduriez, P., de Oliveira, D., Mattoso, M.: Multi-objective scheduling of scientific workflows in multisite clouds. Future Gener. Comput. Syst. (FGCS) 63, 76–95 (2016)

    Article  Google Scholar 

  11. Maheshwari, K., Jung, E., Meng, J., Vishwanath, V., Kettimuthu, R.: Improving multisite workflow performance using model-based scheduling. In: IEEE International Conference on Parallel Processing (ICPP), pp. 131–140 (2014)

    Google Scholar 

  12. de Oliveira, D., Baião, F.A., Mattoso, M.: Towards a taxonomy for cloud computing from an e-science perspective. In: Antonopoulos, N., Gillam, L. (eds.) Cloud Computing. CCN, pp. 47–62. Springer, London (2010). https://doi.org/10.1007/978-1-84996-241-4_3

    Chapter  Google Scholar 

  13. Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 4th edn. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-26253-2

    Book  Google Scholar 

  14. Pradal, C., Fournier, C., Valduriez, P., Cohen-Boulakia, S.: OpenAlea: scientific workflows combining data analysis and simulation. In: International Conference on Scientific and Statistical Database Management (SSDBM), pp. 11:1–11:6 (2015)

    Google Scholar 

  15. Tardieu, F., Cabrera-Bosquet, L., Pridmore, T., Bennett, M.: Plant phenomics, from sensors to knowledge. Curr. Biol. 27(15), R770–R783 (2017)

    Article  Google Scholar 

  16. Yuan, D., et al.: A highly practical approach toward achieving minimum data sets storage cost in the cloud. IEEE Trans. Parallel Distrib. Syst. 24(6), 1234–1244 (2013)

    Article  Google Scholar 

  17. Zhang, J., Luo, J., Dong, F.: Scheduling of scientific workflow in non-dedicated heterogeneous multicluster platform. J. Syst. Softw. 86(7), 1806–1818 (2013)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by the #DigitAg French initiative (www.hdigitag.fr), the SciDISC and HPDaSc Inria associated teams with Brazil, the Phenome-Emphasis project (ANR-11-INBS-0012) and IFB (ANR-11-INBS-0013) from the Agence Nationale de la Recherche and the France Grille Scientific Interest Group.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gaëtan Heidsieck .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Heidsieck, G., de Oliveira, D., Pacitti, E., Pradal, C., Tardieu, F., Valduriez, P. (2020). Distributed Caching of Scientific Workflows in Multisite Cloud. In: Hartmann, S., Küng, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2020. Lecture Notes in Computer Science(), vol 12392. Springer, Cham. https://doi.org/10.1007/978-3-030-59051-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59051-2_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59050-5

  • Online ISBN: 978-3-030-59051-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics