Skip to main content

Real-World Clustering for Task Graphs on Shared Memory Systems

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8828))

Included in the following conference series:

  • 593 Accesses

Abstract

Due to the increasing desire for safe and (semi-)automated parallelization of software, the scheduling of automatically generated task graphs becomes increasingly important. Previous static scheduling algorithms assume negligible run-time overhead of spawning and joining tasks. We show that this overhead is significant for small- to medium-sized tasks which can often be found in automatically generated task graphs and in existing parallel applications.

By comparing real-world execution times of a schedule to the predicted static schedule lengths we show that the static schedule lengths are uncorrelated to the measured execution times and underestimate the execution times of task graphs by factors up to a thousand if the task graph contains small tasks. The static schedules are realistic only in the limiting case when all tasks are vastly larger than the scheduling overhead. Thus, for non-large tasks the real-world speedup achieved with these algorithms may be arbitrarily bad, maybe using many cores to realize a speedup even smaller than one, irrespective of any theoretical guarantees given for these algorithms. This is especially harmful on battery driven devices that would shut down unused cores.

We derive a model to predict parallel task execution times on symmetric schedulers, i.e. where the run-time scheduling overhead is homogeneous. The soundness of the model is verified by comparing static and real-world overhead of different run-time schedulers. Finally, we present the first clustering algorithm which guarantees a real-world speedup by clustering all parallel tasks in the task graph that cannot be efficiently executed in parallel. Our algorithm considers both, the specific target hardware and scheduler implementation and is cubic in the size of the task graph.

Our results are confirmed by applying our algorithm to a large set of randomly generated benchmark task graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz SMP x86_64 GNU/Linux 3.5.0-37-generic.

  2. 2.

    Intel(R) Core(TM) i7-3667U CPU @ 2.00GHz SMP x86_64 GNU/Linux 3.5.0-17-generic.

References

  1. Adve, V.S., Vernon, M.K.: The influence of random delays on parallel execution times. SIGMETRICS Perfom. Eval. Rev. 21(1), 61–73 (1993)

    Article  Google Scholar 

  2. Adve, V.S., Vernon, M.K.: Parallel program performance prediction using deterministic task graph analysis. ACM Trans. Comput. Syst. 22(1), 94–136 (2004)

    Article  Google Scholar 

  3. Bender, M.A., Farach-Colton, M., Pemmasani, G., Skiena, S., Sumazin, P.: Lowest common ancestors in trees and directed acyclic graphs. J. Algorithms 57(2), 75–94 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  4. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  5. Coffman Jr., E.G., Garey, M.R., Johnson, D.S.: An application of bin-packing to multiprocessor scheduling. SIAM J. Comput. 7(1), 1–17 (1978)

    Article  MATH  MathSciNet  Google Scholar 

  6. Darte, A., Robert, Y.P., Vivien, F.: Scheduling and Automatic Parallelization. Birkhäuser Boston (2000)

    Google Scholar 

  7. Dick, R.P., Rhodes, D.L., Wolf, W.: Tgff: Task graphs for free. In Proceedings of the 6th International Workshop on Hardware/Software Codesign, pp. 97–101. IEEE Computer Society (1998)

    Google Scholar 

  8. Gerasoulis, A., Yang, T.: On the granularity and clustering of directed acyclic task graphs. IEEE Trans. Parallel Distrib. Syst. 4(6), 686–701 (1993)

    Article  Google Scholar 

  9. Girkar, M., Polychronopoulos, C.D.: Automatic extraction of functional parallelism from ordinary programs. IEEE Trans. Parallel Distrib. Syst. 3(2), 166–178 (1992)

    Article  Google Scholar 

  10. Girkar, M., Polychronopoulos, C.D.: The hierarchical task graph as a universal intermediate representation. Int. J. Parallel Prog. 22(5), 519–551 (1994)

    Article  Google Scholar 

  11. Graham, R.L.: Bounds on multiprocessing timing anomalies. SIAM J. Appl. Math. 17(2), 416–429 (1969)

    Article  MATH  MathSciNet  Google Scholar 

  12. Intel. Thread building blocks 4.1 (2013). http://www.threadingbuildingblocks.org/

  13. Khan, A.A., McCreary, C.L., Gong, Y.: A Numerical Comparative Analysis of Partitioning Heuristics for Scheduling Tak Graphs on Multiprocessors. Auburn University, Auburn (1993)

    Google Scholar 

  14. Kwok, Y.-K., Ahmad, I.: Benchmarking the Task Graph Scheduling Algorithms, pp. 531–537 (1998)

    Google Scholar 

  15. Liou, J.-C., Palis, M.A.: An efficient task clustering heuristic for scheduling dags on multiprocessors. In: Workshop on Resource Management, Symposium on Parallel and Distributed Processing, pp. 152–156. Citeseer (1996)

    Google Scholar 

  16. Liou, J.-C., Palis, M.A.: A Comparison of General Approaches to Multiprocessor Scheduling, pp. 152–156. IEEE Computer Society, Washington, DC (1997)

    Google Scholar 

  17. Liu, Z.: Worst-case analysis of scheduling heuristics of parallel systems. Parallel Comput. 24(5–6), 863–891 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  18. McCreary, C., Gill, H.: Automatic determination of grain size for efficient parallel processing. Commun. ACM 32(9), 1073–1078 (1989)

    Article  Google Scholar 

  19. McCreary, C.L., Khan, A., Thompson, J., McArdle, M.: A comparison of heuristics for scheduling dags on multiprocessors. In: Proceedings on the Eighth International Parallel Processing Symposium, pp. 446–451. IEEE Computer Society (1994)

    Google Scholar 

  20. Shin, D., Kim, J.: Power-aware Scheduling of Conditional Task Graphs in Real-time Multiprocessor Systems, pp. 408–413. ACM, New York (2003)

    Google Scholar 

  21. Indiana University. Open mpi 1(4), 5 (2013). http://www.open-mpi.org/

  22. Yang, T., Gerasoulis, A.: Dsc: Scheduling parallel tasks on an unbounded number of processors. IEEE Trans. Parallel Distrib. Syst. 5(9), 951–967 (1994)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Herz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Herz, A., Pinkau, C. (2015). Real-World Clustering for Task Graphs on Shared Memory Systems. In: Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2014. Lecture Notes in Computer Science(), vol 8828. Springer, Cham. https://doi.org/10.1007/978-3-319-15789-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-15789-4_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-15788-7

  • Online ISBN: 978-3-319-15789-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics