Skip to main content

Toward efficient execution of data-intensive workflows

Abstract

Workflows that consume and produce large amounts of data are being widely used in modern scientific computing and data processing pipelines. Scheduling of data-intensive workflows requires a careful management of data transfers between tasks, since network contention can significantly impact the workflow execution time. The paper presents and evaluates several scheduling algorithms, data transfer strategies and optimizations aimed at efficient execution of data-intensive workflows. The studied approaches reduce or completely avoid network contention by explicit scheduling of data transfers and incorporate several optimizations, such as data caching, chunked and peer-to-peer data transfers. The results of experimental study demonstrate that the relative performance of different approaches depends on the workflow properties, data staging strategy and system configuration. The proposed CAS-L1 heuristic with additional data transfer optimizations achieves the best results.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    https://github.com/alexmnazarenko/pysimgrid.

References

  1. 1.

    1000Genome Workflow (online). https://github.com/pegasus-isi/1000genome-workflow

  2. 2.

    Alkaya AF, Topcuoglu HR (2006) A task scheduling algorithm for arbitrarily-connected processors with awareness of link contention. Clust Comput 9(4):417–431

    Article  Google Scholar 

  3. 3.

    Bharathi S, Chervenak A, Deelman E, Mehta G, Su MH, Vahi K (2008) Characterization of scientific workflows. In: 2008 Third Workshop on Workflows in Support of Large-Scale Science, pp 1–10

  4. 4.

    Bittencourt LF, Sakellariou R, Madeira ERM (2010) DAG scheduling using a lookahead variant of the heterogeneous earliest finish time algorithm. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, pp 27–34 . https://doi.org/10.1109/PDP.2010.56

  5. 5.

    Bryk P, Malawski M, Juve G, Deelman E (2016) Storage-aware algorithms for scheduling of workflow ensembles in clouds. J Grid Comput 14(2):359–378

    Article  Google Scholar 

  6. 6.

    Casanova H, Giersch A, Legrand A, Quinson M, Suter F (2014) Versatile, scalable, and accurate simulation of distributed applications and platforms. J Parallel Distrib Comput 74(10):2899–2917

    Article  Google Scholar 

  7. 7.

    Çatalyürek ÜV, Kaya K, Uçar B (2011) Integrated data placement and task assignment for scientific workflows in clouds. In: Proceedings of the Fourth International Workshop on Data-Intensive Distributed Computing. ACM, pp 45–54

  8. 8.

    da Silva RF, Filgueira R, Deelman E, Pairo-Castineira E, Overton IM, Atkinson MP (2016) Using simple PID controllers to prevent and mitigate faults in scientific workflows. In: WORKS@ SC, pp 15–24

  9. 9.

    Juve G, Chervenak A, Deelman E, Bharathi S, Mehta G, Vahi K (2013) Characterizing and profiling scientific workflows. Future Gener Comput Syst 29(3):682–692

    Article  Google Scholar 

  10. 10.

    Liu J, Pacitti E, Valduriez P, Mattoso M (2015) A survey of data-intensive scientific workflow management. J Grid Comput 13(4):457–493

    Article  Google Scholar 

  11. 11.

    Liu Z, Xiang T, Lin B, Ye X, Wang H, Zhang Y, Chen X (2018) A data placement strategy for scientific workflow in hybrid cloud. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). IEEE, pp 556–563

  12. 12.

    Sinnen O, Sousa LA (2005) Communication contention in task scheduling. IEEE Trans Parallel Distrib Syst 16(6):503–515

    Article  Google Scholar 

  13. 13.

    Sukhoroslov O (2019) An experimental study of data transfer strategies for execution of scientific workflows. In: International Conference on Parallel Computing Technologies. Springer, pp 67–79

  14. 14.

    Sukhoroslov O (2019) Supporting efficient execution of workflows on Everest platform. In: Voevodin V, Sobolev S (eds) Russian supercomputing days. Springer, Berlin, pp 713–724

    Chapter  Google Scholar 

  15. 15.

    Sukhoroslov O, Nazarenko A, Aleksandrov R (2019) An experimental study of scheduling algorithms for many-task applications. J Supercomput 75(12):7857–7871

    Article  Google Scholar 

  16. 16.

    Sukhoroslov O, Volkov S, Afanasiev A (2015) A web-based platform for publication and distributed execution of computing applications. In: 14th International Symposium on Parallel and Distributed Computing (ISPDC), pp 175–184. https://doi.org/10.1109/ISPDC.2015.27

  17. 17.

    Szabo C, Sheng QZ, Kroeger T, Zhang Y, Yu J (2014) Science in the cloud: allocation and execution of data-intensive scientific workflows. J Grid Comput 12(2):245–264

    Article  Google Scholar 

  18. 18.

    Taylor IJ, Deelman E, Gannon DB, Shields M (2014) Workflows for e-Science: scientific workflows for grids. Springer, Berlin

    Google Scholar 

  19. 19.

    Teylo L, de Paula U, Frota Y, de Oliveira D, Drummond LM (2017) A hybrid evolutionary algorithm for task scheduling and data assignment of data-intensive scientific workflows on clouds. Future Gener Comput Syst 76:1–17

    Article  Google Scholar 

  20. 20.

    Topcuoglu H, Hariri S, Wu MY (2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib Syst 13(3):260–274. https://doi.org/10.1109/71.993206

    Article  Google Scholar 

  21. 21.

    Velho P, Legrand A (2009) Accuracy study and improvement of network simulation in the SimGrid framework. In: Proceedings of the 2nd International Conference on Simulation Tools and Techniques. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), p 13

  22. 22.

    Velho P, Schnorr LM, Casanova H, Legrand A (2013) On the validity of flow-level TCP network models for grid and cloud simulations. ACM Trans Model Comput Simul: TOMACS 23(4):23

    MathSciNet  Article  Google Scholar 

  23. 23.

    Wang M, Zhang J, Dong F, Luo J (2014) Data placement and task scheduling optimization for data intensive scientific workflow in multiple data centers environment. In: 2014 Second International Conference on Advanced Cloud and Big Data. IEEE, pp 77–84

  24. 24.

    Wu F, Wu Q, Tan Y (2015) Workflow scheduling in cloud: a survey. J Supercomput 71(9):3373–3418

    Article  Google Scholar 

  25. 25.

    Workflow Generator (online). https://confluence.pegasus.isi.edu/display/pegasus/WorkflowGenerator

  26. 26.

    Yu J, Buyya R, Ramamohanarao K (2008) Workflow scheduling algorithms for grid computing. In: Xhafa F, Abraham A (eds) Metaheuristics for scheduling in distributed computing environments. Springer, Berlin, pp 173–214

    Chapter  Google Scholar 

  27. 27.

    Yuan D, Yang Y, Liu X, Chen J (2010) A data placement strategy in scientific cloud workflows. Future Gener Comput Syst 26(8):1200–1214

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the Russian Science Foundation (Project 16-11-10352).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Oleg Sukhoroslov.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sukhoroslov, O. Toward efficient execution of data-intensive workflows. J Supercomput 77, 7989–8012 (2021). https://doi.org/10.1007/s11227-020-03612-4

Download citation

Keywords

  • Workflows
  • Data-intensive computing
  • Task scheduling
  • Data management
  • Simulation