A Context-Aware Primitive for Nested Recursive Parallelism
Nested recursive parallel applications constitute an important super-class of conventional, flat parallel codes. For this class, parallel libraries utilizing the concept of tasks have been widely adapted. However, the provided abstract task creation and synchronization interfaces force corresponding implementations to focus their attention to individual task creation and synchronization points – unaware of their relation to each other – thereby losing optimization potential.
Within this paper, we present a novel interface for task level parallelism, enabling implementations to grasp and manipulate the context of task creation and synchronization points – in particular for nested recursive parallelism. Furthermore, as a concrete application, we demonstrate the interface’s capability to reduce parallel overhead within applications based on a reference implementation utilizing C++14 template meta programming techniques to synthesize multiple versions of a parallel task during the compilation process.
To demonstrate its effectiveness, we evaluate the impact of our approach on the performance of a series of eight task parallel benchmarks. For those, our approach achieves substantial speed-ups over state of the art solutions, in particular for use cases exhibiting fine grained tasks.
KeywordsRecursive Function Runtime System Parallel Code Reference Implementation Parallel Loop
This project has received funding from the European Union’s Horizon 2020 research and innovation programme as part of the FETHPC AllScale project under grant agreement No. 671603.
- 4.Batty, M., Memarian, K., Owens, S., Sarkar, S., Sewell, P.: Clarifying and compiling C/C++ concurrency: from C++11 to power. In: ACM SIGPLAN Notices, vol. 47, pp. 509–520. ACM (2012)Google Scholar
- 5.Duran, A., Corbalán, J., Ayguadé, E.: An adaptive cut-off for task parallelism. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, pp. 1–11. IEEE (2008)Google Scholar
- 7.Guo, Y., Zhao, J., Cave, V., Sarkar, V.: SLAW: a scalable locality-aware adaptive work-stealing scheduler. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp. 1–12. IEEE (2010)Google Scholar
- 8.Jones, S.: Introduction to dynamic parallelism. In: GPU Technology Conference Presentation, vol. 338 (2012)Google Scholar
- 9.Lakshmanan, K., Kato, S., Rajkumar, R.: Scheduling parallel real-time tasks on multi-core processors. In: 2010 IEEE 31st Real-Time Systems Symposium (RTSS), pp. 259–268. IEEE (2010)Google Scholar
- 11.Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media Inc., Sebastopol (2007)Google Scholar
- 12.Thoman, P., Gschwandtner, P., Fahringer, T.: On the quality of implementation of the C++11 thread support library. In: 2015 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 94–98. IEEE (2015)Google Scholar