Using Data Dependencies to Improve Task-Based Scheduling Strategies on NUMA Architectures

  • Philippe Virouleau
  • François Broquedis
  • Thierry Gautier
  • Fabrice Rastello
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9833)


The recent addition of data dependencies to the OpenMP 4.0 standard provides the application programmer with a more flexible way of synchronizing tasks. Using such an approach allows both the compiler and the runtime system to know exactly which data are read or written by a given task, and how these data will be used through the program lifetime. Data placement and task scheduling strategies have a significant impact on performances when considering NUMA architectures. While numerous papers focus on these topics, none of them has made extensive use of the information available through dependencies. One can use this information to modify the behavior of the application at several levels: during initialization to control data placement and during the application execution to dynamically control both the task placement and the tasks stealing strategy, depending on the topology. This paper introduces several heuristics for these strategies and their implementations in our OpenMP runtime Xkaapi. We also evaluate their performances on linear algebra applications executed on a 192-core NUMA machine, reporting noticeable performance improvement when considering both the architecture topology and the tasks data dependencies. We finally compare them to strategies presented previously by related works.


OpenMP Task dependencies Benchmark Runtime systems NUMA Xkaapi Scheduling Work-stealing 



This work is integrated and supported by the ELCI project, a French FSN (“Fond pour la Société Numérique”) project that associates academic and industrial partners to design and provide software environment for very high performance computing.


  1. 1.
    Bleuse, R., Gautier, T., Lima, J.V.F., Mounié, G., Trystram, D.: Scheduling data flow program in XKaapi: a new affinity based algorithm for heterogeneous architectures. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014 Parallel Processing. LNCS, vol. 8632, pp. 560–571. Springer, Heidelberg (2014)Google Scholar
  2. 2.
    Broquedis, F., Furmento, N., Goglin, B., Wacrenier, P.-A., Namyst, R.: ForestGOMP: an efficient OpenMP environment for NUMA architectures. Int. J. Parallel Program. 38(5), 418–439 (2010). Special Issue on OpenMP; Guest Editors: Müller, M.S., Ayguade, ECrossRefzbMATHGoogle Scholar
  3. 3.
    Broquedis, F., Gautier, T., Danjean, V.: libKOMP, an efficient OpenMP runtime system for both fork-join and data flow paradigms. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 102–115. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  4. 4.
    Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: a generic framework for managing hardware affinities in HPC applications. In: Danelutto, M., Bourgeois, J., Gross, T. (eds.) Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, PDP 2010, Pisa, Italy, 17–19 February 2010, pp. 180–186. IEEE Computer Society (2010)Google Scholar
  5. 5.
    Clet-Ortega, J., Carribault, P., Pérache, M.: Evaluation of OpenMP task scheduling algorithms for large NUMA architectures. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014 Parallel Processing. LNCS, vol. 8632, pp. 596–607. Springer, Heidelberg (2014)Google Scholar
  6. 6.
    Drebes, A., Heydemann, K., Drach, N., Pop, A., Cohen, A.: Topology-aware and dependence-aware scheduling and memory allocation for task-parallel languages. ACM Trans. Archit. Code Optim. 11(3), 30:1–30:25 (2014)CrossRefGoogle Scholar
  7. 7.
    Durand, M., Broquedis, F., Gautier, T., Raffin, B.: An efficient OpenMP loop scheduler for irregular applications on large-scale NUMA machines. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 141–155. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  8. 8.
    Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. SIGPLAN Not. 33(5), 212–223 (1998)CrossRefGoogle Scholar
  9. 9.
    Gautier, T., Besseron, X., Pigeon, L.: Kaapi: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In: PASCO 2007 (2007)Google Scholar
  10. 10.
    Olivier, S., Porterfield, A., Wheeler, K.B., Spiegel, M., Prins, J.F.: Openmp task scheduling strategies for multicore NUMA systems. IJHPCA 26(2), 110–124 (2012)Google Scholar
  11. 11.
    Olivier, S.L., de Supinski, B.R., Schulz, M., Prins, J.F.: Characterizing and mitigating work time inflation in task parallel programs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 65:1–65:12. IEEE Computer Society Press, Los Alamitos (2012)Google Scholar
  12. 12.
    Board, OpenMP Architecture Review: OpenMP application program interface version 4.0, July 2013Google Scholar
  13. 13.
    Tahan, O.: Towards efficient OpenMP strategies for non-uniform architectures. CoRR, abs/1411.7131 (2014)Google Scholar
  14. 14.
    Terboven, C., Schmidl, D., Cramer, T., an Mey, D.: Task-parallel programming on NUMA architectures. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds.) Euro-Par 2012. LNCS, vol. 7484, pp. 638–649. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  15. 15.
    Virouleau, P., Brunet, P., Broquedis, F., Furmento, N., Thibault, S., Aumage, O., Gautier, T.: Evaluation of OpenMP dependent tasks with the KASTORS benchmark suite. In: DeRose, L., de Supinski, B.R., Olivier, S.L., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2014. LNCS, vol. 8766, pp. 16–29. Springer, Heidelberg (2014)Google Scholar
  16. 16.
    Weng, T.-H., Chapman, B.M.: Implementing OpenMP using dataflow execution model for data locality and efficient parallel execution. In: Proceedings of the 16th International Parallel and Distributed Processing Symposium, IPDPS 2002, p. 180. IEEE Computer Society (2002)Google Scholar
  17. 17.
    Wittmann, M., Hager, G.: Optimizing ccNUMA locality for task-parallel execution under openmp and TBB on multicore-based systems. CoRR, abs/1101.0093 (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Philippe Virouleau
    • 1
    • 2
  • François Broquedis
    • 1
  • Thierry Gautier
    • 2
  • Fabrice Rastello
    • 1
  1. 1.Inria, Univ. Grenoble Alpes, CNRS, Grenoble Institute of Technology, LIGGrenobleFrance
  2. 2.LIP, ENS de LyonLyonFrance

Personalised recommendations