Using Data Dependencies to Improve Task-Based Scheduling Strategies on NUMA Architectures
The recent addition of data dependencies to the OpenMP 4.0 standard provides the application programmer with a more flexible way of synchronizing tasks. Using such an approach allows both the compiler and the runtime system to know exactly which data are read or written by a given task, and how these data will be used through the program lifetime. Data placement and task scheduling strategies have a significant impact on performances when considering NUMA architectures. While numerous papers focus on these topics, none of them has made extensive use of the information available through dependencies. One can use this information to modify the behavior of the application at several levels: during initialization to control data placement and during the application execution to dynamically control both the task placement and the tasks stealing strategy, depending on the topology. This paper introduces several heuristics for these strategies and their implementations in our OpenMP runtime Xkaapi. We also evaluate their performances on linear algebra applications executed on a 192-core NUMA machine, reporting noticeable performance improvement when considering both the architecture topology and the tasks data dependencies. We finally compare them to strategies presented previously by related works.
KeywordsOpenMP Task dependencies Benchmark Runtime systems NUMA Xkaapi Scheduling Work-stealing
This work is integrated and supported by the ELCI project, a French FSN (“Fond pour la Société Numérique”) project that associates academic and industrial partners to design and provide software environment for very high performance computing.
- 1.Bleuse, R., Gautier, T., Lima, J.V.F., Mounié, G., Trystram, D.: Scheduling data flow program in XKaapi: a new affinity based algorithm for heterogeneous architectures. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014 Parallel Processing. LNCS, vol. 8632, pp. 560–571. Springer, Heidelberg (2014)Google Scholar
- 4.Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: a generic framework for managing hardware affinities in HPC applications. In: Danelutto, M., Bourgeois, J., Gross, T. (eds.) Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, PDP 2010, Pisa, Italy, 17–19 February 2010, pp. 180–186. IEEE Computer Society (2010)Google Scholar
- 5.Clet-Ortega, J., Carribault, P., Pérache, M.: Evaluation of OpenMP task scheduling algorithms for large NUMA architectures. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014 Parallel Processing. LNCS, vol. 8632, pp. 596–607. Springer, Heidelberg (2014)Google Scholar
- 9.Gautier, T., Besseron, X., Pigeon, L.: Kaapi: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In: PASCO 2007 (2007)Google Scholar
- 10.Olivier, S., Porterfield, A., Wheeler, K.B., Spiegel, M., Prins, J.F.: Openmp task scheduling strategies for multicore NUMA systems. IJHPCA 26(2), 110–124 (2012)Google Scholar
- 11.Olivier, S.L., de Supinski, B.R., Schulz, M., Prins, J.F.: Characterizing and mitigating work time inflation in task parallel programs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 65:1–65:12. IEEE Computer Society Press, Los Alamitos (2012)Google Scholar
- 12.Board, OpenMP Architecture Review: OpenMP application program interface version 4.0, July 2013Google Scholar
- 13.Tahan, O.: Towards efficient OpenMP strategies for non-uniform architectures. CoRR, abs/1411.7131 (2014)Google Scholar
- 15.Virouleau, P., Brunet, P., Broquedis, F., Furmento, N., Thibault, S., Aumage, O., Gautier, T.: Evaluation of OpenMP dependent tasks with the KASTORS benchmark suite. In: DeRose, L., de Supinski, B.R., Olivier, S.L., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2014. LNCS, vol. 8766, pp. 16–29. Springer, Heidelberg (2014)Google Scholar
- 16.Weng, T.-H., Chapman, B.M.: Implementing OpenMP using dataflow execution model for data locality and efficient parallel execution. In: Proceedings of the 16th International Parallel and Distributed Processing Symposium, IPDPS 2002, p. 180. IEEE Computer Society (2002)Google Scholar
- 17.Wittmann, M., Hager, G.: Optimizing ccNUMA locality for task-parallel execution under openmp and TBB on multicore-based systems. CoRR, abs/1101.0093 (2011)Google Scholar