Abstract
Multicore NUMA systems present on-board memory hierarchies and communication networks that influence performance when executing shared memory parallel codes. Characterising this influence is complex, and understanding the effect of particular hardware configurations on different codes is of paramount importance. In this paper, monitoring information extracted from hardware counters at runtime is used to characterise the behaviour of each thread in the processes running in the system. This characterisation is given in terms of number of instructions per second, operational intensity, and latency of memory access. We propose to use all this information to guide a thread migration strategy that improves execution efficiency by increasing locality and affinity. Different configurations of NAS Parallel OpenMP benchmarks running concurrently on multicore systems were used to validate the benefits of the proposed thread migration strategy. Our proposal produces up to 25% improvement over the OS for heterogeneous workloads, under different and realistic locality and affinity scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adhianto, L., Banerjee, S., Fagan, M., et al.: HPCToolkit: tools for performance analysis of optimized parallel programs. Concurr. Comput.: Pract. Exp. 22(6), 685–701 (2010). https://doi.org/10.1002/cpe.1553
Akiyama, S., Hirofuchi, T.: Quantitative evaluation of intel PEBS overhead for online system-noise analysis. In: Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017, ROSS 2017, pp. 3:1–3:8. ACM, New York (2017). https://doi.org/10.1145/3095770.3095773
Chasparis, G.C., Rossbory, M.: Efficient dynamic pinning of parallelized applications by distributed reinforcement learning. Int. J. Parallel Program. 47(1), 24–38 (2017). https://doi.org/10.1007/s10766-017-0541-y
Cheung, A., Madden, S.: Performance profiling with EndoScope, an acquisitional software monitoring framework. Proc. VLDB Endow. 1(1), 42–53 (2008). https://doi.org/10.14778/1453856.1453866
Cho, J.H., Wang, Y., Chen, R., Chan, K.S., Swami, A.: A survey on modeling and optimizing multi-objective systems. IEEE Commun. Surv. Tutor. 19, 1867–1901 (2017). https://doi.org/10.1109/COMST.2017.2698366
Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurr. Comput.: Pract. Exp. 22(6), 702–719 (2010). https://doi.org/10.1002/cpe.1556
Goumas, G., Kourtis, K., Anastopoulos, N., Karakasis, V., Koziris, N.: Performance evaluation of the sparse matrix-vector multiplication on modern architectures. J. Supercomput. 50(1), 36–77 (2009). https://doi.org/10.1007/s11227-008-0251-8
Intel Corp.: Intel 64 and IA-32 Architectures Software Developer Manuals (2017). https://software.intel.com/articles/intel-sdm. Accessed Nov 2019
Intel Developer Zone: Fluctuating FLOP count on Sandy Bridge (2013). http://software.intel.com/en-us/forums/topic/375320. Accessed Nov 2019
Jin, H., Frumkin, M., Yan, J.: The OpenMP implementation of NAS parallel benchmarks and its performance. Technical report, Technical Report NAS-99-011, NASA Ames Research Center (1999)
Ju, M., Jung, H., Che, H.: A performance analysis methodology for multicore, multithreaded processors. IEEE Trans. Comput. 63(2), 276–289 (2014). https://doi.org/10.1109/TC.2012.223
Kleen, A.: A NUMA API for Linux. Novel Inc. (2005)
Lameter, C., et al.: NUMA (non-uniform memory access): an overview. ACM Queue 11(7), 40 (2013). https://queue.acm.org/detail.cfm?id=2513149
Lorenzo, O.G., Pena, T.F., Cabaleiro, J.C., Pichel, J.C., Rivera, F.F.: 3DyRM: a dynamic roofline model including memory latency information. J. Supercomput. 70(2), 696–708 (2014). https://doi.org/10.1007/s11227-014-1163-4
Lorenzo, O.G., Pena, T.F., Cabaleiro, J.C., Pichel, J.C., Rivera, F.F.: Multiobjective optimization technique based on monitoring information to increase the performance of thread migration on multicores. In: 2014 IEEE International Conference on Cluster Computing (CLUSTER), pp. 416–423. IEEE (2014). https://doi.org/10.1109/CLUSTER.2014.6968733
Rane, A., Stanzione, D.: Experiences in tuning performance of hybrid MPI/OpenMP applications on quad-core systems. In: Proceedings of 10th LCI International Conference on High-Performance Clustered Computing (2009)
Schulz, M., de Supinski, B.R.: PNMPI tools: a whole lot greater than the sum of their parts. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (2007). https://doi.org/10.1145/1362622.1362663
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). https://doi.org/10.1145/1498765.1498785
Acknowledgements
This work has received financial support from the Consellería de Cultura, Educación e Ordenación Universitaria (accreditation 2016-2019, ED431G/08 and reference competitive group 2019-2021, ED431C 2018/19) and the European Regional Development Fund (ERDF). It was also funded by the Ministerio de Economía, Industria y Competitividad within the project TIN2016-76373-P.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
García Lorenzo, O., Laso Rodríguez, R., Fernández Pena, T., Cabaleiro Domínguez, J.C., Fernández Rivera, F., Lorenzo del Castillo, J.Á. (2020). A New Hardware Counters Based Thread Migration Strategy for NUMA Systems. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12044. Springer, Cham. https://doi.org/10.1007/978-3-030-43222-5_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-43222-5_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43221-8
Online ISBN: 978-3-030-43222-5
eBook Packages: Computer ScienceComputer Science (R0)