Abstract
Modern high-performance server systems are typically built as several multi-core chips put together in a single system. Each chip is connected to its local memory via an integrated memory controller (IMC) behaving as a node and hence the single machine behaving as non-uniform memory architecture (NUMA). Various user-level run-time systems adapt work stealing load balancing technique in multi-core processors. The work stealing run-times have to be aware of the topology of the processor on which they are running. Work stealing run-times on multi-core processors typically rely on lock-based synchronization to guarantee the coherency of shared mutable state. Synchronization constructs such as mutex locks, condition variables, and barriers are extensively used in implementation of these user-level work stealing run-times. The locality of these lock variables in multi-socket NUMA processors has considerable impact on the performance of these run-time systems. This paper studies the effect of locality of these synchronization constructs and proposes NUMA awareness to them. The proposed methodology is implemented using a source to source translator of OpenMP run-time, evaluated using OpenMP microbenchmark programs.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Ziakas, D.: Intel quick path interconnect architectural features supporting scalable system architectures. In IEEE 18th Annual Symposium on High Performance Interconnects (HOTI), pp. 1–6 (2010)
Hughes, B., Conway, P.: The AMD Opteron northbridge architecture. In IEEE Micro 27(2) (2007)
Majo, Z., Gross, T.: Memory system performance in a NUMA multicore multiprocessor. In Proceedings of the 4th Annual International Conference on Systems and Storage, p. 12. ACM, 30 May 2011
Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y., Blumofe, R.D.: Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 25(37), 55–69 (1996)
Pheatt, C.: Intel® threading building blocks. J. Comput. Sci. Coll. 23(4), 298–299 (2008)
Terboven, C., Wong, M., an Mey, D., Eichenberger, A.E.: The design of OpenMP thread affinity. In OpenMP in a Heterogeneous World, pp. 15–28 (2012)
Hadjidoukas, P.E., Agathos, S.N., Dimakopoulos, V.V.: Design and implementation of openmp tasks in the ompi compiler. In 15th Panhellenic Conference on Informatics (PCI), pp. 265–269. IEEE, 30 Sept 2011
Al Bahra, S.: Nonblocking algorithms and scalable multicore programming. Queue 11(5), 40 (2013)
Lev, Y., Chase, D.: Dynamic circular work-stealing deque. In Proceedings of the Seventeenth Annual ACM Symposium on Parallelism in Algorithms and Architectures ACM, pp. 21–28, 18 July 2005
Leiserson, C.E., Blumofe, R.D.: Scheduling multithreaded computations by work stealing. J. ACM (JACM) 46(5), 720–748 (1999)
Kleen, A.: A NUMA API for Linux. Novel Inc. (2005)
Hager, G., Wellein, G., Meier, M., Treibig, J.: LIKWID: lightweight performance tools. In Proceedings of the 2011 Companion on High Performance Computing Networking, Storage and Analysis Companion, pp. 29–30. ACM (2011)
OpenMP AR. OpenMP 4.0 specification. June 2013
Marathe, V.J., Shavit, N., Dice, D.: Lock cohorting: a general technique for designing NUMA locks. ACM Trans. Parallel Comput. 1(2), 13 (2015)
Marathe, V.J., Shavit, N., Dice, D.: Lock cohorting: a general technique for designing NUMA locks. In ACM SIGPLAN Notices, vol. 47, no. 8, pp. 247–256. ACM (2012)
O’Neill, D., Bull, J.M.: A microbenchmark suite for OpenMP 2.0. In ICPP, ACM SIGARCH Computer Architecture News, vol. 29, no. 5, pp. 41–8, 1 Dec 2001
Reid, F. McDonnell, N., Bull J.M.: A microbenchmark suite for openmp tasks. In International Workshop on OpenMP, pp. 271–274. Springer, Berlin, Heidelberg, 11 June 2012
Wanker, R., Raghavendra Rao, C., Vikranth, B.: Topology aware task stealing for on-chip NUMA multi-core processors. In Procedia Computer Science (ICCS’13), pp. 379–388 (2013)
Wanker, R., Raghavendra Rao, C., Vikranth, B.: Effective task binding in work stealing runtimes for NUMA multi-core processors. IJCSE, 8(4), pp. 189–196 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Vikranth, B., Wankar, R., Raghavendra Rao, C. (2019). Affinity-Aware Synchronization in Work Stealing Run-Times for NUMA Multi-core Processors. In: Mandal, J., Bhattacharyya, D., Auluck, N. (eds) Advanced Computing and Communication Technologies. Advances in Intelligent Systems and Computing, vol 702. Springer, Singapore. https://doi.org/10.1007/978-981-13-0680-8_12
Download citation
DOI: https://doi.org/10.1007/978-981-13-0680-8_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0679-2
Online ISBN: 978-981-13-0680-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)