ForestGOMP: An Efficient OpenMP Environment for NUMA Architectures

  • François BroquedisEmail author
  • Nathalie Furmento
  • Brice Goglin
  • Pierre-André Wacrenier
  • Raymond Namyst


Exploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid remote memory access penalties. Directive-based programming languages such as OpenMP, can greatly help to perform such a distribution by providing programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into scheduling hints related to thread-memory affinity issues. These hints enable dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. Several experiments show that mixed solutions (migrating both threads and data) outperform work-stealing based balancing strategies and next-touch-based data distribution policies. These techniques provide insights about additional optimizations.


OpenMP Memory NUMA Hierarchical Thread Scheduling Multi-Core 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Antony, J., Janes, P.P., Rendell, A.P.: Exploring thread and memory placement on NUMA architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport. In: Proceedings of the International Conference on High Performance Computing (HiPC). Bangalore, India (2006)Google Scholar
  2. 2.
    Ayguade, E., Gonzalez, M., Martorell, X., Jost, G.: Employing nested OpenMP for the parallelization of multi-Zone computational fluid dynamics applications. In: 18th International Parallel and Distributed Processing Symposium (IPDPS) (2004)Google Scholar
  3. 3.
    Benkner, S., Brandes, T.: Efficient parallel programming on scalable shared memory systems with high performance fortran. In: Concurrency: Practice and Experience, vol. 14, pp. 789–803. John Wiley & Sons (2002)Google Scholar
  4. 4.
    Brecht, T.: On the importance of parallel application placement in NUMA multiprocessors. In: Proceedings of the Fourth Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV). San Diego, CA (1993)Google Scholar
  5. 5.
    Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: a generic framework for managing hardware affinities in HPC applications. In: Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP2010). IEEE Computer Society Press, Pisa, Italia (2010)Google Scholar
  6. 6.
    Broquedis, F., DiakhatT, F., Thibault, S., Aumage, O., Namyst, R., Wacrenier, P.A.: Scheduling Dynamic OpenMP Applications over Multicore Architectures. In: International Workshop on OpenMP (IWOMP). West Lafayette, IN (2008)Google Scholar
  7. 7.
    Carlson, W., Draper, J., Culler, D., Yelick, K., Brooks, E., Warren, K.: Introduction to UPC and Language Specification. Tech. Rep. CCS-TR-99-157, George Mason University (1999)Google Scholar
  8. 8.
    Chapman, B.M., Bregier, F., Patil, A., Prabhakar, A.: Achieving performance under OpenMP on ccNUMA and software distributed shared memory systems. In: Concurrency: Practice and Experience, vol. 14, pp. 713–739. John Wiley & Sons (2002)Google Scholar
  9. 9.
    Chapman, B.M., Huang, L., Jin, H., Jost, G., de Supinski, B.R.: Extending openmp worksharing directives for multithreading. In: EuroPar’06 Parallel Processing (2006)Google Scholar
  10. 10.
    Dolbeau, R., Bihan, S., Bodin, F.: HMPP: A hybrid multi-core parallel programming environment (2007)Google Scholar
  11. 11.
    Duran, A., Perez, J.M., Ayguade, E., Badia, R., Labarta, J.: Extending the openmp tasking model to allow dependant tasks. In: IWOMP Proceedings (2008)Google Scholar
  12. 12.
    Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the cilk-5 multithreaded language. In: ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Montreal, Canada (1998)Google Scholar
  13. 13.
    Goglin, B., Furmento, N.: Enabling High-Performance memory-migration in Linux for multithreaded applications. In: MTAAP’09: Workshop on Multithreaded Architectures and Applications, held in conjunction with IPDPS 2009. IEEE Computer Society Press, Rome, Italy (2009). doi: 10.1109/IPDPS.2009.5161101
  14. 14.
    hwloc: Portable hardware locality.
  15. 15.
    Koelbel, C., Loveman, D., Schreiber, R., Steele, G., Zosel, M.: The high performance Fortran handbook (1994)Google Scholar
  16. 16.
    Löf, H., Holmgren, S.: affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system. In: 19th ACM International Conference on Supercomputing, pp. 387–392. Cambridge, MA, USA (2005)Google Scholar
  17. 17.
    Mami: Marcel memory interface.
  18. 18.
    McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995)Google Scholar
  19. 19.
    Nikolopoulos, D.S., Papatheodorou, T.S., Polychronopoulos, C.D., Labarta, J., AyguadT, E.: User-level dynamic page migration for multiprogrammed shared-memory multiprocessors. In: ICPP, pp. 95–103. IEEE (2000)Google Scholar
  20. 20.
    Nikolopoulos D.S., Polychronopoulos C.D., Papatheodorou T.S., Labarta J., Ayguad T.E.: Scheduler-activated dynamic page migration for multiprogrammed DSM multiprocessors. Parallel Distrib. Comput. 62, 1069–1103 (2002)zbMATHCrossRefGoogle Scholar
  21. 21.
    NordTn, M., L.-f, H., Rantakokko, J., Holmgren, S.: Geographical locality and dynamic data migration for OpenMP implementations of adaptive PDE solvers. In: Second International Workshop on OpenMP (IWOMP 2006). Reims, France (2006)Google Scholar
  22. 22.
    Song, F., Moore, S., Dongarra, J.: Feedback-directed thread scheduling with memory considerations. In: Proceedings of the 16th IEEE International Symposium on High-Performance Distributed Computing (HPDC07). Monterey Bay, CA (2007)Google Scholar
  23. 23.
    Steckermeier, M., Bellosa, F.: Using locality information in userlevel scheduling. Tech. Rep. TR-95-14, University of Erlangen-Nnrnberg—Computer Science Department—Operating Systems—IMMD IV, Martensstrab́be 1, 91058 Erlangen, Germany (1995)Google Scholar
  24. 24.
    Terboven, C., an Mey, D., Schmidl, D., Jin, H., Reichstein, T.: Data and thread affinity in openmp programs. In: MAW ’08: Proceedings of the 2008 workshop on Memory access on future processors, pp. 377–384. ACM, New York, NY, USA (2008). doi: 10.1145/1366219.1366222
  25. 25.
    The OpenMP API specification for parallel programming.
  26. 26.
    Thibault, S., Namyst, R., Wacrenier, P.A.: Building portable thread schedulers for hierarchical multiprocessors: the BubbleSched Framework. In: Euro-Par. ACM, Rennes, France (2007)Google Scholar
  27. 27.
  28. 28.
    Yang, R., Antony, J., Janes, P.P., Rendell, A.P.: Memory and thread placement effects as a function of cache usage: a study of the Gaussian chemistry code on the SunFire X4600 M2. In: Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008), pp. 31–36 (2008)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • François Broquedis
    • 1
    Email author
  • Nathalie Furmento
    • 1
  • Brice Goglin
    • 1
  • Pierre-André Wacrenier
    • 1
  • Raymond Namyst
    • 1
  1. 1.LaBRI, INRIA Bordeaux-Sud-OuestUniversity of BordeauxBordeaux, TalenceFrance

Personalised recommendations