Abstract
Due to the increasing desire for safe and (semi-)automated parallelization of software, the scheduling of automatically generated task graphs becomes increasingly important. Previous static scheduling algorithms assume negligible run-time overhead of spawning and joining tasks. We show that this overhead is significant for small- to medium-sized tasks which can often be found in automatically generated task graphs and in existing parallel applications.
By comparing real-world execution times of a schedule to the predicted static schedule lengths we show that the static schedule lengths are uncorrelated to the measured execution times and underestimate the execution times of task graphs by factors up to a thousand if the task graph contains small tasks. The static schedules are realistic only in the limiting case when all tasks are vastly larger than the scheduling overhead. Thus, for non-large tasks the real-world speedup achieved with these algorithms may be arbitrarily bad, maybe using many cores to realize a speedup even smaller than one, irrespective of any theoretical guarantees given for these algorithms. This is especially harmful on battery driven devices that would shut down unused cores.
We derive a model to predict parallel task execution times on symmetric schedulers, i.e. where the run-time scheduling overhead is homogeneous. The soundness of the model is verified by comparing static and real-world overhead of different run-time schedulers. Finally, we present the first clustering algorithm which guarantees a real-world speedup by clustering all parallel tasks in the task graph that cannot be efficiently executed in parallel. Our algorithm considers both, the specific target hardware and scheduler implementation and is cubic in the size of the task graph.
Our results are confirmed by applying our algorithm to a large set of randomly generated benchmark task graphs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz SMP x86_64 GNU/Linux 3.5.0-37-generic.
- 2.
Intel(R) Core(TM) i7-3667U CPU @ 2.00GHz SMP x86_64 GNU/Linux 3.5.0-17-generic.
References
Adve, V.S., Vernon, M.K.: The influence of random delays on parallel execution times. SIGMETRICS Perfom. Eval. Rev. 21(1), 61–73 (1993)
Adve, V.S., Vernon, M.K.: Parallel program performance prediction using deterministic task graph analysis. ACM Trans. Comput. Syst. 22(1), 94–136 (2004)
Bender, M.A., Farach-Colton, M., Pemmasani, G., Skiena, S., Sumazin, P.: Lowest common ancestors in trees and directed acyclic graphs. J. Algorithms 57(2), 75–94 (2005)
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)
Coffman Jr., E.G., Garey, M.R., Johnson, D.S.: An application of bin-packing to multiprocessor scheduling. SIAM J. Comput. 7(1), 1–17 (1978)
Darte, A., Robert, Y.P., Vivien, F.: Scheduling and Automatic Parallelization. Birkhäuser Boston (2000)
Dick, R.P., Rhodes, D.L., Wolf, W.: Tgff: Task graphs for free. In Proceedings of the 6th International Workshop on Hardware/Software Codesign, pp. 97–101. IEEE Computer Society (1998)
Gerasoulis, A., Yang, T.: On the granularity and clustering of directed acyclic task graphs. IEEE Trans. Parallel Distrib. Syst. 4(6), 686–701 (1993)
Girkar, M., Polychronopoulos, C.D.: Automatic extraction of functional parallelism from ordinary programs. IEEE Trans. Parallel Distrib. Syst. 3(2), 166–178 (1992)
Girkar, M., Polychronopoulos, C.D.: The hierarchical task graph as a universal intermediate representation. Int. J. Parallel Prog. 22(5), 519–551 (1994)
Graham, R.L.: Bounds on multiprocessing timing anomalies. SIAM J. Appl. Math. 17(2), 416–429 (1969)
Intel. Thread building blocks 4.1 (2013). http://www.threadingbuildingblocks.org/
Khan, A.A., McCreary, C.L., Gong, Y.: A Numerical Comparative Analysis of Partitioning Heuristics for Scheduling Tak Graphs on Multiprocessors. Auburn University, Auburn (1993)
Kwok, Y.-K., Ahmad, I.: Benchmarking the Task Graph Scheduling Algorithms, pp. 531–537 (1998)
Liou, J.-C., Palis, M.A.: An efficient task clustering heuristic for scheduling dags on multiprocessors. In: Workshop on Resource Management, Symposium on Parallel and Distributed Processing, pp. 152–156. Citeseer (1996)
Liou, J.-C., Palis, M.A.: A Comparison of General Approaches to Multiprocessor Scheduling, pp. 152–156. IEEE Computer Society, Washington, DC (1997)
Liu, Z.: Worst-case analysis of scheduling heuristics of parallel systems. Parallel Comput. 24(5–6), 863–891 (1998)
McCreary, C., Gill, H.: Automatic determination of grain size for efficient parallel processing. Commun. ACM 32(9), 1073–1078 (1989)
McCreary, C.L., Khan, A., Thompson, J., McArdle, M.: A comparison of heuristics for scheduling dags on multiprocessors. In: Proceedings on the Eighth International Parallel Processing Symposium, pp. 446–451. IEEE Computer Society (1994)
Shin, D., Kim, J.: Power-aware Scheduling of Conditional Task Graphs in Real-time Multiprocessor Systems, pp. 408–413. ACM, New York (2003)
Indiana University. Open mpi 1(4), 5 (2013). http://www.open-mpi.org/
Yang, T., Gerasoulis, A.: Dsc: Scheduling parallel tasks on an unbounded number of processors. IEEE Trans. Parallel Distrib. Syst. 5(9), 951–967 (1994)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Herz, A., Pinkau, C. (2015). Real-World Clustering for Task Graphs on Shared Memory Systems. In: Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2014. Lecture Notes in Computer Science(), vol 8828. Springer, Cham. https://doi.org/10.1007/978-3-319-15789-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-15789-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-15788-7
Online ISBN: 978-3-319-15789-4
eBook Packages: Computer ScienceComputer Science (R0)