Workflows that consume and produce large amounts of data are being widely used in modern scientific computing and data processing pipelines. Scheduling of data-intensive workflows requires a careful management of data transfers between tasks, since network contention can significantly impact the workflow execution time. The paper presents and evaluates several scheduling algorithms, data transfer strategies and optimizations aimed at efficient execution of data-intensive workflows. The studied approaches reduce or completely avoid network contention by explicit scheduling of data transfers and incorporate several optimizations, such as data caching, chunked and peer-to-peer data transfers. The results of experimental study demonstrate that the relative performance of different approaches depends on the workflow properties, data staging strategy and system configuration. The proposed CAS-L1 heuristic with additional data transfer optimizations achieves the best results.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
1000Genome Workflow (online). https://github.com/pegasus-isi/1000genome-workflow
Alkaya AF, Topcuoglu HR (2006) A task scheduling algorithm for arbitrarily-connected processors with awareness of link contention. Clust Comput 9(4):417–431
Bharathi S, Chervenak A, Deelman E, Mehta G, Su MH, Vahi K (2008) Characterization of scientific workflows. In: 2008 Third Workshop on Workflows in Support of Large-Scale Science, pp 1–10
Bittencourt LF, Sakellariou R, Madeira ERM (2010) DAG scheduling using a lookahead variant of the heterogeneous earliest finish time algorithm. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, pp 27–34 . https://doi.org/10.1109/PDP.2010.56
Bryk P, Malawski M, Juve G, Deelman E (2016) Storage-aware algorithms for scheduling of workflow ensembles in clouds. J Grid Comput 14(2):359–378
Casanova H, Giersch A, Legrand A, Quinson M, Suter F (2014) Versatile, scalable, and accurate simulation of distributed applications and platforms. J Parallel Distrib Comput 74(10):2899–2917
Çatalyürek ÜV, Kaya K, Uçar B (2011) Integrated data placement and task assignment for scientific workflows in clouds. In: Proceedings of the Fourth International Workshop on Data-Intensive Distributed Computing. ACM, pp 45–54
da Silva RF, Filgueira R, Deelman E, Pairo-Castineira E, Overton IM, Atkinson MP (2016) Using simple PID controllers to prevent and mitigate faults in scientific workflows. In: WORKS@ SC, pp 15–24
Juve G, Chervenak A, Deelman E, Bharathi S, Mehta G, Vahi K (2013) Characterizing and profiling scientific workflows. Future Gener Comput Syst 29(3):682–692
Liu J, Pacitti E, Valduriez P, Mattoso M (2015) A survey of data-intensive scientific workflow management. J Grid Comput 13(4):457–493
Liu Z, Xiang T, Lin B, Ye X, Wang H, Zhang Y, Chen X (2018) A data placement strategy for scientific workflow in hybrid cloud. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). IEEE, pp 556–563
Sinnen O, Sousa LA (2005) Communication contention in task scheduling. IEEE Trans Parallel Distrib Syst 16(6):503–515
Sukhoroslov O (2019) An experimental study of data transfer strategies for execution of scientific workflows. In: International Conference on Parallel Computing Technologies. Springer, pp 67–79
Sukhoroslov O (2019) Supporting efficient execution of workflows on Everest platform. In: Voevodin V, Sobolev S (eds) Russian supercomputing days. Springer, Berlin, pp 713–724
Sukhoroslov O, Nazarenko A, Aleksandrov R (2019) An experimental study of scheduling algorithms for many-task applications. J Supercomput 75(12):7857–7871
Sukhoroslov O, Volkov S, Afanasiev A (2015) A web-based platform for publication and distributed execution of computing applications. In: 14th International Symposium on Parallel and Distributed Computing (ISPDC), pp 175–184. https://doi.org/10.1109/ISPDC.2015.27
Szabo C, Sheng QZ, Kroeger T, Zhang Y, Yu J (2014) Science in the cloud: allocation and execution of data-intensive scientific workflows. J Grid Comput 12(2):245–264
Taylor IJ, Deelman E, Gannon DB, Shields M (2014) Workflows for e-Science: scientific workflows for grids. Springer, Berlin
Teylo L, de Paula U, Frota Y, de Oliveira D, Drummond LM (2017) A hybrid evolutionary algorithm for task scheduling and data assignment of data-intensive scientific workflows on clouds. Future Gener Comput Syst 76:1–17
Topcuoglu H, Hariri S, Wu MY (2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib Syst 13(3):260–274. https://doi.org/10.1109/71.993206
Velho P, Legrand A (2009) Accuracy study and improvement of network simulation in the SimGrid framework. In: Proceedings of the 2nd International Conference on Simulation Tools and Techniques. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), p 13
Velho P, Schnorr LM, Casanova H, Legrand A (2013) On the validity of flow-level TCP network models for grid and cloud simulations. ACM Trans Model Comput Simul: TOMACS 23(4):23
Wang M, Zhang J, Dong F, Luo J (2014) Data placement and task scheduling optimization for data intensive scientific workflow in multiple data centers environment. In: 2014 Second International Conference on Advanced Cloud and Big Data. IEEE, pp 77–84
Wu F, Wu Q, Tan Y (2015) Workflow scheduling in cloud: a survey. J Supercomput 71(9):3373–3418
Workflow Generator (online). https://confluence.pegasus.isi.edu/display/pegasus/WorkflowGenerator
Yu J, Buyya R, Ramamohanarao K (2008) Workflow scheduling algorithms for grid computing. In: Xhafa F, Abraham A (eds) Metaheuristics for scheduling in distributed computing environments. Springer, Berlin, pp 173–214
Yuan D, Yang Y, Liu X, Chen J (2010) A data placement strategy in scientific cloud workflows. Future Gener Comput Syst 26(8):1200–1214
This work is supported by the Russian Science Foundation (Project 16-11-10352).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Sukhoroslov, O. Toward efficient execution of data-intensive workflows. J Supercomput 77, 7989–8012 (2021). https://doi.org/10.1007/s11227-020-03612-4
- Data-intensive computing
- Task scheduling
- Data management