Abstract
Modern parallel computer systems exhibit Non-Uniform Memory Access (NUMA) behavior. For best performance, any parallel program therefore has to match data allocation and scheduling of computations to the memory architecture of the machine. When done manually, this becomes a tedious process and since each individual system has its own peculiarities this also leads to programs that are not performance-portable.
We propose the use of a data distribution scheme in which NUMA hardware peculiarities are abstracted away from the programmer and data distribution is delegated to a runtime system which is generated once for each machine. In addition we propose using task data dependence information now possible with the OpenMP 4.0RC2 proposal to guide the scheduling of OpenMP tasks to further reduce data stall times.
We demonstrate the viability and performance of our proposals on a four socket AMD Opteron machine with eight NUMA nodes. We identify that both data distribution and locality-aware task scheduling improves performance compared to default policies while still providing an architecture-oblivious approach for the programmer.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Olivier, S.L., de Supinski, B.R., Schulz, M., Prins, J.F.: Characterizing and mitigating work time inflation in task parallel programs. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 1–12 (2012)
Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: A generic framework for managing hardware affinities in hpc applications. In: 2010 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 180–186 (2010)
Ribeiro, C.P., Mhaut, J.F.: Minas: Memory affinity management framework (2009)
Kleen, A.: A numa api for linux. Novel Inc. (2005)
Terboven, C., Schmidl, D., Cramer, T.: an Mey, D.: Assessing OpenMP Tasking Implementations on NUMA Architectures. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 182–195. Springer, Heidelberg (2012)
McCurdy, C., Vetter, J.S.: Memphis: Finding and fixing NUMA-Related performance problems on multi-core platforms. Proceedings of the IEEE (2010)
Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In: International Conference on Parallel Processing, ICPP 2009, pp. 124–131 (2009)
AMD: BIOS and kernel developers guide for AMD family 10h processors
Conway, P., Kalyanasundharam, N., Donley, G., Lepak, K., Hughes, B.: Cache hierarchy and memory subsystem of the AMD opteron processor. IEEE Micro 30(2), 16–29 (2010)
Molka, D., Schne, R., Hackenberg, D., Mller, M.: Memory performance and SPEC OpenMP scalability on quad-socket x86_64 systems. Algorithms and Architectures for Parallel Processing, 170–181 (2011)
Pillet, V., Labarta, J., Cortes, T., Girona, S.: Paraver: A tool to visualize and analyze parallel code. WoTUG-18, 17–31 (1995)
Huang, L., Jin, H., Yi, L., Chapman, B.: Enabling locality-aware computations in OpenMP. Scientific Programming 181, 169–181 (2010)
Majo, Z., Gross, T.R.: Matching memory access patterns and data placement for NUMA systems. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 230–241 (2012)
Nikolopoulos, D.S., Papatheodorou, T.S., Polychronopoulos, C.D., Labarta, J.: Is data distribution necessary in OpenMP? In: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing (CDROM), p. 47 (2000)
Terboven, C., Schmidl, D., Jin, H., Reichstein, T.: Data and thread affinity in openmp programs. In: Proceedings of the 2008 Workshop on Memory Access on Future Processors: a Solved Problem? pp. 377–384 (2008)
Broquedis, F., Furmento, N., Goglin, B., Namyst, R., Wacrenier, P.-A.: Dynamic task and data placement over NUMA architectures: An openMP runtime perspective. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 79–92. Springer, Heidelberg (2009)
Goglin, B., Furmento, N.: Enabling high-performance memory migration for multithreaded applications on linux. In: IEEE International Symposium on Parallel & Distributed Processing, IPDPS 2009, pp. 1–9 (2009)
Wittmann, M., Hager, G.: Optimizing ccNUMA locality for task-parallel execution under OpenMP and TBB on multicore-based systems. arXiv preprint arXiv:1101 (2010)
Olivier, S.L., Porterfield, A.K., Wheeler, K.B., Spiegel, M., Prins, J.F.: OpenMP task scheduling strategies for multicore NUMA systems. International Journal of High Performance Computing Applications 26(2), 110–124 (2012)
Pilla, L.L., Ribeiro, C.P., Cordeiro, D., Mhaut, J.F.: Charm++ on NUMA platforms: the impact of SMP optimizations and a NUMA-aware load balancer. In: 4th Workshop of the INRIA-Illinois Joint Laboratory on Petascale Computing, Urbana, IL, USA (2010)
Schmidl, D., Terboven, C.: an Mey, D.: Towards NUMA Support with Distance Information. In: Chapman, B.M., Gropp, W.D., Kumaran, K., Müller, M.S. (eds.) IWOMP 2011. LNCS, vol. 6665, pp. 69–79. Springer, Heidelberg (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Muddukrishna, A., Jonsson, P.A., Vlassov, V., Brorsson, M. (2013). Locality-Aware Task Scheduling and Data Distribution on NUMA Systems. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds) OpenMP in the Era of Low Power Devices and Accelerators. IWOMP 2013. Lecture Notes in Computer Science, vol 8122. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40698-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-40698-0_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40697-3
Online ISBN: 978-3-642-40698-0
eBook Packages: Computer ScienceComputer Science (R0)