The Cooperative Parallel: A Discussion About Run-Time Schedulers for Nested Parallelism

  • Sara RoyuelaEmail author
  • Maria A. SerranoEmail author
  • Marta Garcia-GasullaEmail author
  • Sergi Mateo BellidoEmail author
  • Jesús LabartaEmail author
  • Eduardo QuiñonesEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11718)


Nested parallelism is a well-known parallelization strategy to exploit irregular parallelism in HPC applications. This strategy also fits in critical real-time embedded systems, composed of a set of concurrent functionalities. In this case, nested parallelism can be used to further exploit the parallelism of each functionality. However, current run-time implementations of nested parallelism can produce inefficiencies and load imbalance. Moreover, in critical real-time embedded systems, it may lead to incorrect executions due to, for instance, a work non-conserving scheduler. In both cases, the reason is that the teams of OpenMP threads are a black-box for the scheduler, i.e., the scheduler that assigns OpenMP threads and tasks to the set of available computing resources is agnostic to the internal execution of each team.

This paper proposes a new run-time scheduler that considers dynamic information of the OpenMP threads and tasks running within several concurrent teams, i.e., concurrent parallel regions. This information may include the existence of OpenMP threads waiting in a barrier and the priority of tasks ready to execute. By making the concurrent parallel regions to cooperate, the shared computing resources can be better controlled and a work conserving and priority driven scheduler can be guaranteed.


Resource allocation Concurrency Runtime scheduler 



This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 780622.


  1. 1.
    ARB: Openmp 3.0 specification (2008).
  2. 2.
  3. 3.
    Ayguadé, E., Duran, A., Hoeflinger, J., Massaioli, F., Teruel, X.: An experimental evaluation of the new OpenMP tasking model. In: Adve, V., Garzarán, M.J., Petersen, P. (eds.) LCPC 2007. LNCS, vol. 5234, pp. 63–77. Springer, Heidelberg (2008). Scholar
  4. 4.
    Barney, B.: Posix threads programming (2017).
  5. 5.
    Bertogna, M., Xhani, O., Marinoni, M., Esposito, F., Buttazzo, G.: Optimal selection of preemption points to minimize preemption overhead. In: Procedings of the 23rd Euromicro Conference on Real-Time Systems (ECRTS) (2011)Google Scholar
  6. 6.
    Blikberg, R., Sørevik, T.: Load balancing and OpenMP implementation of nested parallelism. Parallel Comput. 31(10–12), 984–998 (2005)CrossRefGoogle Scholar
  7. 7.
    Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM (JACM) 46(5), 720–748 (1999) MathSciNetCrossRefGoogle Scholar
  8. 8.
    Briggs, J.P., Pennycook, S.J., Fergusson, J.R., Jäykkä, J., Shellard, E.P.: Chapter 10 - cosmic microwave background analysis: nested parallelism in practice. In: High Performance Parallelism Pearls, vol. 2, pp. 171–190 (2015)CrossRefGoogle Scholar
  9. 9.
    Caballero, D., Duran, A., Martorell, X.: An OpenMP* barrier using SIMD instructions for Intel® Xeon PhiTM coprocessor. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 99–113. Springer, Heidelberg (2013). Scholar
  10. 10.
    Cajas, J., et al.: Fluid-structure interaction based on HPC multicode coupling. SIAM J. Sci. Comput. 40(6), C677–C703 (2018)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Center, B.S.: Ompss user guide (2019).
  12. 12.
    Chrysos, G.: Intel® Xeon Phi Coprocessor - The architecture. Intel Whitepaper 176 (2014)Google Scholar
  13. 13.
    Dimakopoulos, V.V., Hadjidoukas, P.E., Philos, G.C.: A microbenchmark study of OpenMP overheads under nested parallelism. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 1–12. Springer, Heidelberg (2008). Scholar
  14. 14.
    Duran, A., Gonzalez, M., Corbalán, J.: Automatic thread distribution for nested parallelism in OpenMP. In: Proceedings of the 19th Annual International Conference on Supercomputing, pp. 121–130. ACM (2005)Google Scholar
  15. 15.
    Ferry, D., Li, J., Mahadevan, M., Agrawal, K., Gill, C., Lu, C.: A real-time scheduling service for parallel tasks. In: 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 261–272. IEEE (2013)Google Scholar
  16. 16.
    Garcia, M., Corbalan, J., Labarta, J.: LeWI: a runtime balancing algorithm for nested parallelism. In: International Conference on Parallel Processing, pp. 526–533 (2009)Google Scholar
  17. 17.
    Garcia Gasulla, M.: Dynamic load balancing for hybrid applications (2017)Google Scholar
  18. 18.
    Garcia-Gasulla, M., Mantovani, F., Josep-Fabrego, M., Eguzkitza, B., Houzeaux, G.: Runtime mechanisms to survive new HPC architectures: a use case in human respiratory simulations. Int. J. High Perform. Comput. Appl. (2019)Google Scholar
  19. 19.
  20. 20.
    Hun, L.C., Yeng, O.L., Sze, L.T., Chet, K.V.: Kalman filtering and its real-time applications. In: Real-Time Systems (2016)Google Scholar
  21. 21.
    Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming: Knights, Landing edn. Morgan Kaufmann, Burlington (2016)Google Scholar
  22. 22.
    Kim, J., Kim, H., Lakshmanan, K., Rajkumar, R.R.: Parallel scheduling for cyber-physical systems: analysis and case study on a self-driving car. In: Proceedings of the ACM/IEEE 4th International Conference on Cyber-physical Systems, pp. 31–40. ACM (2013)Google Scholar
  23. 23.
    Knafla, B., Leopold, C.: Parallelizing a real-time steering simulation for computer games with OpenMP. In: Parallel Computing: Architectures, Algorithms, and Applications, vol. 15, p. 219 (2008)Google Scholar
  24. 24.
    Kroening, D., Poetzl, D., Schrammel, P., Wachter, B.: Sound static deadlock analysis for C/Pthreads. In: 31st International Conference on Automated Software Engineering, pp. 379–390. IEEE, September 2016Google Scholar
  25. 25.
    Kurzak, J., Dongarra, J.: Implementing linear algebra routines on multi-core processors with pipelining and a look ahead. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 147–156. Springer, Heidelberg (2007). Scholar
  26. 26.
    LaGrone, J., Aribuki, A., Chapman, B.: A set of microbenchmarks for measuring OpenMP task overheads. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), p. 1. Citeseer (2011)Google Scholar
  27. 27.
    Lindberg, P.: Performance obstacles for threading: how do they affect OpenMP code. Intel Software Developer Zone (2009).
  28. 28.
    LLVM: OpenMP\(^\ast \): Support for the OpenMP language (2019).
  29. 29.
    Meadows, L., Pennycook, S.J., Duran, A., Wilmarth, T., Cownie, J.: Workstealing and nested parallelism in SMP systems. In: Maruyama, N., de Supinski, B.R., Wahib, M. (eds.) IWOMP 2016. LNCS, vol. 9903, pp. 47–60. Springer, Cham (2016). Scholar
  30. 30.
    Meadows, L., Kim, J.: Chapter 18 - exploiting multilevel parallelism in quantum simulations. In: High Performance Parallelism Pearls. Volume 2: Multicore and Many-Core Programming Approaches, pp. 335–354 (2015)Google Scholar
  31. 31.
    Nanjegowda, R., Hernandez, O., Chapman, B., Jin, H.H.: Scalability evaluation of barrier algorithms for OpenMP. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 42–52. Springer, Heidelberg (2009). Scholar
  32. 32.
    Russinovich, M.E., Solomon, D.A., Ionescu, A.: Windows Internals. Pearson Education, London (2012)Google Scholar
  33. 33.
    Serrano, M.A., Melani, A., Bertogna, M., Quiñones, E.: Response-time analysis of DAG tasks under fixed priority scheduling with limited preemptions. In: Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE) (2016)Google Scholar
  34. 34.
    Serrano, M.A., Melani, A., Kehr, S., Bertogna, M., Quiñones, E.: An analysis of lazy and eager limited preemption approaches under DAG-based global fixed priority scheduling. In: Proceedings of the 20th IEEE International Symposium on Real-Time Distributed Computing (ISORC) (2017)Google Scholar
  35. 35.
    Serrano, M.A., Melani, A., Vargas, R., Marongiu, A., Bertogna, M., Quiñones, E.: Timing characterization of OpenMP4 tasking model. In: International Conference on Compilers, Architecture and Synthesis for Embedded Systems, pp. 157–166. IEEE (2015)Google Scholar
  36. 36.
    Serrano, M.A., Royuela, S., Quiñones, E.: Towards an OpenMP specification for critical real-time systems. In: de Supinski, B.R., Valero-Lara, P., Martorell, X., Mateo Bellido, S., Labarta, J. (eds.) IWOMP 2018. LNCS, vol. 11128, pp. 143–159. Springer, Cham (2018). Scholar
  37. 37.
    Sun, J., Guan, N., Wang, Y., He, Q., Yi, W.: Scheduling and analysis of realtime OpenMP task systems with tied tasks. In: Proceedings of Real-Time Systems Symposium (2017)Google Scholar
  38. 38.
    Vargas, R., Quiñones, E., Marongiu, A.: OpenMP and timing predictability: a possible union? In: Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, pp. 617–620 (2015)Google Scholar
  39. 39.
    YarKhan, A., Kurzak, J., Luszczek, P., Dongarra, J.: Porting the PLASMA numerical library to the OpenMP standard. Int. J. Parallel Prog. 45(3), 612–633 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Barcelona Supercomputing CenterBarcelonaSpain

Personalised recommendations