Efficient runtime thread management for the nano-threads programming model
The nano-threads programming model was proposed to effectively integrate multiprogramming on shared-memory multiprocessors, with the exploitation of fine-grain parallelism from standard applications. A prerequisite for the applicability of the nano-threads programming model is the ability of the runtime environment to manage parallelism at any level of granularity with minimal overheads. In this paper, we introduce runtime techniques for efficient memory management and user-level scheduling in an experimental runtime system designed to support the nano-threads programming model. We evaluate the exploitation of processor affinity for the management of nano-thread contexts, and the use of hierarchical queues to implement user-level scheduling strategies for applications with inherent multilevel parallelism. The proposed mechanisms attempt to obtain maximum benefits from data locality on cache-coherent NUMA multiprocessors. Through the use of synthetic benchmarks, we find that our mechanism for memory management in the runtime system reduces overheads by 52% on average, compared to other known mechanisms. The use of hierarchical queues gives significant performance improvements between 17% and 40%, compared to scheduling strategies that use local queues.
KeywordsTask Graph Runtime System Parallel Loop Schedule Loop Local Pool
Unable to display preview. Download preview PDF.
- [Dand95]S. Dandamundi and P. Cheng, A Hierarchical Task Queue Organization for SharedMemory Multiprocessor Systems, IEEE Transactions on Parallel and Distributed Systems, vol. 6(1), pp. l–16, January 1995.Google Scholar
- [Free96]V Freeh, D. Lowenthal, and G. Andrews, Efficient Support for Fine-Grain Parallelism on Shared-Memory Machines, Technical Report TR96-l, University of Arizona, January 1996.Google Scholar
- [Kepp93]D. Keppel, Tools and Techniques for Building Fast Portable Threads Packages, Technical Report UWCSE 93-05-06, University of Washington at Seattle, May 1993.Google Scholar
- [Laud97]J. Laudon and D. Lenoski, The SGI Origin: A ccNUMA Highly Scalable Server, Proceedings of the 24th International Symposium on Computer Architecture, pp. 241–251, Denver, Colorado, June 1997.Google Scholar
- [Mart96]X. Martorell, J. Labarta, N. Navarro and E. Ayguadé, A Library Implementation of the Nano-Threads Programming Model, Proceedings of the 2nd International EuroPar Conference, pp. 644–649, Lyon, France, August 1996.Google Scholar
- [Mart97]X. Martorell, J. Labarta, N. Navarro and E. Ayguadé, Analysis of Several Scheduling Algorithms under the Nano-threads Programming Model, Proceedings of the 11th International Parallel Processing Symposium, pp. 281–287, Geneva, Switzerland, April 1997.Google Scholar
- [More95]J. Moreira, On the Implementation and Effectiveness of Autoscheduling for SharedMemory Multiprocessors, PhD Thesis, University of Illinois at Urbana-Champaign, Department of Electrical and Computer Engineering, 1995.Google Scholar
- [Poly93]C. Polychronopoulos, N. Bitar and S. Kleiman, Nano-Threads: A User-Level Threads Architecture, Technical Report 1297, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, 1993.Google Scholar
- [Poly97]E. Polychronopoulos and T. Papatheodorou, Dynamic Bisectioning Scheduling for Scalable Shared-Memory Multiprocessors based on the Nano-Threads Programming Model, Technical Report HPCAL-TR-010697, University of Patras, Department of Computer Engineering and Informatics, June 1997.Google Scholar