The Secrets of the Accelerators Unveiled: Tracing Heterogeneous Executions Through OMPT

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9903)


Heterogeneous systems are an important trend in the future of supercomputers, yet they can be hard to program and developers still lack powerful tools to gain understanding about how well their accelerated codes perform and how to improve them.

Having different types of hardware accelerators available, each with their own specific low-level APIs to program them, there is not yet a clear consensus on a standard way to retrieve information about the accelerator’s performance. To improve this scenario, OMPT is a novel performance monitoring interface that is being considered for integration into the OpenMP standard. OMPT allows analysis tools to monitor the execution of parallel OpenMP applications by providing detailed information about the activity of the runtime through a standard API. For accelerated devices, OMPT also facilitates the exchange of performance information between the runtime and the analysis tool. We implement part of the OMPT specification that refers to the use of accelerators both in the Nanos++ parallel runtime system and the Extrae tracing framework, obtaining detailed performance information about the execution of the tasks issued to the accelerated devices to later conduct insightful analysis.

Our work extends previous efforts in the field to expose detailed information from the OpenMP and OmpSs runtimes, regarding the activity and performance of task-based parallel applications. In this paper, we focus on the evaluation of FPGA devices studying the performance of two common kernels in scientific algorithms: matrix multiplication and Cholesky decomposition. Furthermore, this development is seamlessly applicable for the analysis of GPGPU accelerators and Intel® Xeon PhiTM co-processors operating under the OmpSs programming model.


Cholesky Decomposition Memory Transfer Hardware Accelerator Parallel Programming Model Master Thread 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was partially supported by the European Union H2020 program through the AXIOM project (grant ICT-01-2014 GA 645496) and the Mont-Blanc 2 project, by the Ministerio de Economía y Competitividad, under contracts Computación de Altas Prestaciones VII (TIN2015-65316-P); Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under projects MPEXPAR: Models de Programaciói Entorns d’Execució Paral \(\cdot \) lels (2014-SGR-1051) and 2009-SGR-980; the BSC-CNS Severo Ochoa program (SEV-2011-00067); the Intel-BSC Exascale Laboratory project; and the OMPT Working Group.


  1. 1.
  2. 2.
    CUDA Profiling Tools Interface.
  3. 3.
    Extrae instrumentation package.
  4. 4.
    Mercurium C/C++ source-to-source compiler.
  5. 5.
  6. 6.
    NVIDIA CUDA Compute Unified Device Architecture Programming Guide.
  7. 7.
    Top 500 supercomputing sites.
  8. 8.
  9. 9.
    Ayguade, E., Badia, R.M., Cabrera, D., Duran, A., Gonzalez, M., Igual, F., Jimenez, D., Labarta, J., Martorell, X., Mayo, R., Perez, J.M., Quintana-Ortí, E.S.: A proposal to extend the OpenMP tasking model for heterogeneous architectures. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 154–167. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  10. 10.
    OpenMP Architecture Review Board. OpenMP Application Program Interface v 3.0, May 2008Google Scholar
  11. 11.
    Browne, S., Dongarra, J., Garner, N., Ho, G., Mucci, P.: A portable programming interface for performance evaluation on modern processors. Int. J. High Perform. Comput. Appl. 14, 189–204 (2000)CrossRefGoogle Scholar
  12. 12.
    Munshi, A., et al. (eds.) Khronos OpenCL Working Group. The OpenCL specification (2009).
  13. 13.
    Cramer, T., Dietrich, R., Terboven, C., Müller, M.S., Nagel, W.E.: Performance analysis for target devices with the openmp tools interface. In: IEEE International Parallel and Distributed Processing Symposium Workshop, IPDpPS, Hyderabad, India, 25–29 May 2015, pp. 215–224 (2015)Google Scholar
  14. 14.
    Eichenberger, A.E., Mellor-Crummey, J., Schulz, M., Wong, M., Copty, N., Dietrich, R., Liu, X., Loh, E., Lorenz, D.: OMPT: an OpenMP tools application programming interface for performance analysis. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 171–185. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  15. 15.
    Filgueras, A., Gil, E., Jimenez-Gonzalez, D., Alvarez, C., Martorell, X., Langer, J., Noguera, J., Vissers, K.: Ompss@zynq all-programmable SoC ecosystem. In: Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays, FPGA 2014, pp. 137–146, New York, NY, USA. ACM (2014)Google Scholar
  16. 16.
    Fürlinger, K., Skinner, D.: Performance profiling for OpenMP tasks. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 132–139. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  17. 17.
    Hindborg, A., Laguna, I., Karlsson, S., Ahn, D.H.: A Standard Debug Interface for OpenMP Target RegionsGoogle Scholar
  18. 18.
    Itzkowitz, M., Mazurov, O., Copty, N., Lin, Y.: An OpenMP Runtime API for Profiling. Sun Microsystems, Inc., OpenMP ARB White Paper.
  19. 19.
    Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco (2013)Google Scholar
  20. 20.
    Jiménez-González, D., Álvarez, C., Filgueras, A., Martorell, X., Langer, J., Noguera, J., Vissers, K.A.: Coarse-grain performance estimator for heterogeneous parallel computing architectures like zynq all-programmable SoC (2015). CoRR, abs/1508.06830Google Scholar
  21. 21.
    Jost, G., Mazurov, O., an Mey, D.: Adding new dimensions to performance analysis through user-defined objects. In: Mueller, M.S., Chapman, B.M., de Supinski, B.R., Malony, A.D., Voss, M. (eds.) IWOMP 2005 and IWOMP 2006. LNCS, vol. 4315, pp. 255–266. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  22. 22.
    Lorenz, D., Mohr, B., Rössel, C., Schmidl, D., Wolf, F.: How to reconcile event-based performance analysis with tasking in OpenMP. In: Sato, M., Hanawa, T., Müller, M.S., Chapman, B.M., de Supinski, B.R. (eds.) IWOMP 2010. LNCS, vol. 6132, pp. 109–121. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  23. 23.
    Mohr, B., Malony, A., Hoppe, H.-C., Schlimbach, F., Haab, G., Shah, S.: A performance monitoring interface for OpenMP. In: Proceedings of the 4th European Workshop on OpenMP (EWOMP 2002), Rom, Italien, 2002. Record converted from VDB: 12 November 2012, September 2002Google Scholar
  24. 24.
    Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. In: Computer Graphics Forum, vol. 26, pp. 80–113. Wiley Online Library (2007)Google Scholar
  25. 25.
    Servat, H., Teruel, X., Llort, G., Duran, A., Giménez, J., Martorell, X., Ayguadé, E., Labarta, J.: On the Instrumentation of OpenMP and OmpSs tasking constructs. In: Caragiannis, I., et al. (eds.) Euro-Par Workshops 2012. LNCS, vol. 7640, pp. 414–428. Springer, Heidelberg (2013)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Computer SciencesBarcelona Supercomputing CenterBarcelonaSpain
  2. 2.Department of Computer ArchitecturePolytechnic University of Catalonia-BarcelonaTechBarcelonaSpain
  3. 3.Intel Corporation IberiaMadridSpain

Personalised recommendations