A Context-Aware Primitive for Nested Recursive Parallelism

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10104)


Nested recursive parallel applications constitute an important super-class of conventional, flat parallel codes. For this class, parallel libraries utilizing the concept of tasks have been widely adapted. However, the provided abstract task creation and synchronization interfaces force corresponding implementations to focus their attention to individual task creation and synchronization points – unaware of their relation to each other – thereby losing optimization potential.

Within this paper, we present a novel interface for task level parallelism, enabling implementations to grasp and manipulate the context of task creation and synchronization points – in particular for nested recursive parallelism. Furthermore, as a concrete application, we demonstrate the interface’s capability to reduce parallel overhead within applications based on a reference implementation utilizing C++14 template meta programming techniques to synthesize multiple versions of a parallel task during the compilation process.

To demonstrate its effectiveness, we evaluate the impact of our approach on the performance of a series of eight task parallel benchmarks. For those, our approach achieves substantial speed-ups over state of the art solutions, in particular for use cases exhibiting fine grained tasks.


Recursive Function Runtime System Parallel Code Reference Implementation Parallel Loop 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This project has received funding from the European Union’s Horizon 2020 research and innovation programme as part of the FETHPC AllScale project under grant agreement No. 671603.


  1. 1.
    An, P., et al.: STAPL: an adaptive, generic parallel C++ library. In: Dietz, H.G. (ed.) LCPC 2001. LNCS, vol. 2624, pp. 193–208. Springer, Heidelberg (2003). doi: 10.1007/3-540-35767-X_13 CrossRefGoogle Scholar
  2. 2.
    Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency Comput.: Pract. Exp. 23(2), 187–198 (2011)CrossRefGoogle Scholar
  3. 3.
    Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The design of openMP tasks. IEEE Trans. Parallel Distrib. Syst. 20(3), 404–418 (2009)CrossRefGoogle Scholar
  4. 4.
    Batty, M., Memarian, K., Owens, S., Sarkar, S., Sewell, P.: Clarifying and compiling C/C++ concurrency: from C++11 to power. In: ACM SIGPLAN Notices, vol. 47, pp. 509–520. ACM (2012)Google Scholar
  5. 5.
    Duran, A., Corbalán, J., Ayguadé, E.: An adaptive cut-off for task parallelism. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, pp. 1–11. IEEE (2008)Google Scholar
  6. 6.
    Falcou, J., Sérot, J., Chateau, T., Lapresté, J.T.: Quaff: efficient C++ design for parallel skeletons. Parallel Comput. 32(7), 604–615 (2006)CrossRefGoogle Scholar
  7. 7.
    Guo, Y., Zhao, J., Cave, V., Sarkar, V.: SLAW: a scalable locality-aware adaptive work-stealing scheduler. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp. 1–12. IEEE (2010)Google Scholar
  8. 8.
    Jones, S.: Introduction to dynamic parallelism. In: GPU Technology Conference Presentation, vol. 338 (2012)Google Scholar
  9. 9.
    Lakshmanan, K., Kato, S., Rajkumar, R.: Scheduling parallel real-time tasks on multi-core processors. In: 2010 IEEE 31st Real-Time Systems Symposium (RTSS), pp. 259–268. IEEE (2010)Google Scholar
  10. 10.
    Mohr, E., Kranz, D.A., Halstead Jr., R.H.: Lazy task creation: a technique for increasing the granularity of parallel programs. IEEE Trans. Parallel Distrib. Syst. 2(3), 264–280 (1991)CrossRefGoogle Scholar
  11. 11.
    Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media Inc., Sebastopol (2007)Google Scholar
  12. 12.
    Thoman, P., Gschwandtner, P., Fahringer, T.: On the quality of implementation of the C++11 thread support library. In: 2015 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 94–98. IEEE (2015)Google Scholar
  13. 13.
    Thoman, P., Moosbrugger, S., Fahringer, T.: Optimizing task parallelism with library-semantics-aware compilation. In: Träff, J.L., Hunold, S., Versaci, F. (eds.) Euro-Par 2015. LNCS, vol. 9233, pp. 237–249. Springer, Heidelberg (2015). doi: 10.1007/978-3-662-48096-0_19 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.University of InnsbruckInnsbruckAustria
  2. 2.Friedrich-Alexander-Universität Erlangen-NürnbergErlangenGermany

Personalised recommendations