Optimizing Task Parallelism with Library-Semantics-Aware Compilation

  • Peter Thoman
  • Stefan Moosbrugger
  • Thomas Fahringer
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9233)

Abstract

With the spread of parallel architectures throughout all areas of computing, task-based parallelism is an increasingly commonly employed programming paradigm, due to its ease of use and potential scalability. Since
, the ISO
language standard library includes support for task parallelism. However, existing research and implementation work in task parallelism relies almost exclusively on runtime systems for achieving performance and scalability. We propose a combined compiler and runtime system approach that is aware of the parallel semantics of the
standard library functions, and therefore capable of statically analyzing and optimizing their implementation, as well as automatically providing scheduling hints to the runtime system.
We have implemented this approach in an existing compiler and demonstrate its effectiveness by carrying out an empirical study across 9 task-parallel benchmarks. On a 32-core system, our method is, on average, 11.7 times faster than the best result for Clang and GCC
library implementations, and 4.1 times faster than an OpenMP baseline.

References

  1. 1.
    Augonnet, C., et al.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency Comput. Pract. Experience 23(2), 187–198 (2011)CrossRefGoogle Scholar
  2. 2.
    Liao, C., et al.: Semantic-aware automatic parallelization of modern applications using high-level abstractions. Int. J. Parallel Prog. 38(5–6), 361–378 (2010)CrossRefMATHGoogle Scholar
  3. 3.
    Novillo, D.: OpenMP and automatic parallelization in GCC. In: Proceedings of the GCC Developers Summit. GNU (2006)Google Scholar
  4. 4.
    Duran, A., Corbalán, J., Ayguadé, E.: An adaptive cut-off for task parallelism. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, pp. 1–11. IEEE (2008)Google Scholar
  5. 5.
    Turner, D.N., Loidl, H.W., Hammond, K. (eds.): On the granularity of divide-and-conquer parallelism. Glasgow Workshop on Functional Programming, pp. 8–10. Springer, Heidelberg (1995)Google Scholar
  6. 6.
    Jordan, H., et al.: Inspire: the insieme parallel intermediate representation. In: 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 7–17. IEEE (2013)Google Scholar
  7. 7.
    Insieme Compiler and Runtime Infrastructure. http://insieme-compiler.org
  8. 8.
    Reinders, J.: Intel threading building blocks: outfitting C++ for multi-core processor parallelism. “O’Reilly Media, Inc.” (2007)Google Scholar
  9. 9.
    Stratton, J.A., et al.: Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 111–119. ACM (2010)Google Scholar
  10. 10.
    Kofler, K., et al.: An automatic input-sensitive approach for heterogeneous task partitioning. In: Proceedings of the 27th International ACM conference on International Conference on Supercomputing, pp. 149–160. ACM (2013)Google Scholar
  11. 11.
    Asanovic, K. et al.: The Landscape of Parallel Computing Research: A View from Berkeley. Technical report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 12 December 2006. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
  12. 12.
    Lakshmanan, K., Kato, S., Rajkumar, R.: Scheduling parallel real-time tasks on multi-core processors. In: IEEE 31st Real-Time Systems Symposium (RTSS), pp. 259–268. IEEE (2010)Google Scholar
  13. 13.
    Lattner, C.: LLVM and Clang: Next generation compiler technology. In: The BSD Conference, pp. 1–2 (2008)Google Scholar
  14. 14.
    Batty, M., et al.: Clarifying and compiling C/C++ concurrency: from C++11 to POWER. In: Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2012, pp. 509–520. ACM, New York (2012). http://doi.acm.org/10.1145/2103656.2103717
  15. 15.
    Mohr, E., Kranz, D.A., Halstead Jr., R.H.: Lazy task creation: a technique for increasing the granularity of parallel programs. IEEE Trans. Parallel Distrib. Syst. 2(3), 264–280 (1991)CrossRefGoogle Scholar
  16. 16.
    Thoman, P., Gschwandtner, P., Fahringer, T.: On the quality of implementation of the C++11 thread support library. In: 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). IEEE (2015, to appear)Google Scholar
  17. 17.
    Thoman, P., Jordan, H., Pellegrini, S., Fahringer, T.: Automatic OpenMP loop scheduling: a combined compiler and runtime approach. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 88–101. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  18. 18.
    An, P., et al.: STAPL: an adaptive, generic parallel C++ library. In: Dietz, H.G. (ed.) LCPC 2001. LNCS, vol. 2624. Springer, Heidelberg (2003) CrossRefGoogle Scholar
  19. 19.
    Robert, D., Blumofe, et al.: Cilk: An Efficient Multithreaded Runtime System. SIGPLAN Not. 30(8), 207–216 (1995). doi:10.1145/209937.209958 CrossRefGoogle Scholar
  20. 20.
    Armstrong, T.G., et al.: Compiler techniques for massively scalable implicit task parallelism. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC14, pp. 299–310. IEEE (2014)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Peter Thoman
    • 1
  • Stefan Moosbrugger
    • 1
  • Thomas Fahringer
    • 1
  1. 1.University of InnsbruckInnsbruckAustria

Personalised recommendations