Toward a BLAS library truly portable across different accelerator types

  • Eduardo Rodriguez-Gutiez
  • Ana Moreton-Fernandez
  • Arturo Gonzalez-EscribanoEmail author
  • Diego R. Llanos


Scientific applications are some of the most computationally demanding software pieces. Their core is usually a set of linear algebra operations, which may represent a significant part of the overall run-time of the application. BLAS libraries aim to solve this problem by exposing a set of highly optimized, reusable routines. There are several implementations specifically tuned for different types of computing platforms, including coprocessors. Some examples include the one bundled with the Intel MKL library, which targets Intel CPUs or Xeon Phi coprocessors, or the cuBLAS library, which is specifically designed for NVIDIA GPUs. Nowadays, computing nodes in many supercomputing clusters include one or more different coprocessor types. To fully exploit these platforms might require programs that can adapt at run-time to the chosen device type, hardwiring in the program the code needed to use a different library for each device type that can be selected. This also forces the programmer to deal with different interface particularities and mechanisms to manage the memory transfers of the data structures used as parameters. This paper presents a unified, performance-oriented, and portable interface for BLAS. This interface has been integrated into a heterogeneous programming model (Controllers) which supports groups of CPU cores, Xeon Phi accelerators, or NVIDIA GPUs in a transparent way. The contribution of this paper includes: An abstraction layer to hide programming differences between diverse BLAS libraries; new types of kernel classes to support the context manipulation of different external BLAS libraries; a new kernel selection policy that considers both programmer kernels and different external libraries; a complete new Controller library interface for the whole collection of BLAS routines. This proposal enables the creation of BLAS-based portable codes that can execute on top of different types of accelerators by changing a single initialization parameter. Our software internally exploits different preexisting and widely known BLAS library implementations, such as cuBLAS, MAGMA, or the one found in Intel MKL. It transparently uses the most appropriate library for the selected device. Our experimental results show that our abstraction does not introduce significant performance penalties, while achieving the desired portability.


BLAS Parallel programming Scientific libraries Heterogeneous programming Accelerators Coprocessors GPU Xeon Phi MIC CUDA 



This research was supported by an FPI Grant (Formación de Personal Investigador) from the Spanish Ministry of Science and Innovation (MCINN) to E.R-G. It has been partially funded by the Spanish Ministerio de Economía, Industria y Competitividad and by the ERDF program of the European Union: PCAS Project (TIN2017-88614-R), CAPAP-H6 (TIN2016-81840-REDT), and Junta de Castilla y Leon—FEDER Grant VA082P17 (PROPHET Project). We used the computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional Development Fund (ERDF). CETA-CIEMAT belongs to CIEMAT and the Government of Spain. Part of this work has been performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme; in particular, the author gratefully acknowledges the support of Dr. Christophe Dubach, the School of Informatics of the University of Edinburgh, and the computer resources and technical support provided by EPCC. The authors want to thank Dr. Ingo Wald for giving us the possibility of getting a KNL coprocessor.


  1. 1.
    Aliaga JI, Reyes R, Goli M (2017) SYCL-BLAS: leveraging expression trees for linear algebra. In: Proceedings of the 5th international workshop on OpenCL, IWOCL 2017. ACM, pp 32:1–32:5.
  2. 2.
    Anderson E, Bai Z, Bischof C, Blackford L, Demmel J, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D (1999) LAPACK users’ guide, 3 edn. Software, Environments and Tools. Society for Industrial and Applied Mathematics.
  3. 3.
    Arm Ltd: Arm Performance Libraries.
  4. 4.
    Banaś K, Krużel F (2014) OpenCL performance portability for Xeon Phi coprocessor and NVIDIA GPUs: a case study of finite element numerical integration. In: Lopes L, Z̆ilinskas J, Costan A, Cascella RG, Kecskemeti G, Jeannot E, Cannataro M, Ricci L, Benkner S, Petit S, Scarano V, Gracia J, Hunold S, Scott SL, Lankes S, Lengauer C, Carretero J, Breitbart J, Alexander M (eds.) Euro-Par 2014: parallel processing workshops, Lecture notes in computer science. Springer, Berlin, pp 158–169Google Scholar
  5. 5.
    Barker J, Bowden J (2013) Manycore parallelism through OpenMP. In: OpenMP in the era of low power devices and accelerators, Lecture notes in computer science, vol 8122. Springer, Berlin, pp 45–57.
  6. 6.
    Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, Whaley RC (2002) An updated set of basic linear algebra subprograms (BLAS). ACM Trans Math Softw 28(2):135–151. MathSciNetCrossRefGoogle Scholar
  7. 7.
    Choi J, Dongarra J, Ostrouchov S, Petitet A, Walker D, Whaley RC (1995) A proposal for a set of parallel basic linear algebra subprograms. In: Applied parallel computing computations in physics, chemistry and engineering science, Lecture notes in computer science, vol 1041. Springer, Berlin, pp 107–114.
  8. 8.
    Dong T, Knox K, Chapman A, Tanner D, Liu J, Hao H (2017) rocBLAS: next generation BLAS implementation for ROCm platform. Original-date: 2015-10-08T18:48:02Z
  9. 9.
    Dongarra J (2002) Preface: basic linear algebra subprograms technical (blast) forum standard. Int J High Perform Comput Appl 16(2):115–115. CrossRefGoogle Scholar
  10. 10.
    Dongarra J, Gates M, Haidar A, Jia Y, Kabir K, Luszczek P, Tomov S (2015) HPC programming on intel many-integrated-core hardware with MAGMA port to Xeon Phi. Sci Program 2015:e502593. Google Scholar
  11. 11.
    Dongarra J, Gates M, Haidar A, Kurzak J, Luszczek P, Tomov S, Yamazaki I (2014) Accelerating numerical dense linear algebra calculations with GPUs. In: Numerical computations with GPUs. Springer, Cham, pp 3–28.
  12. 12.
    Du P, Weber R, Luszczek P, Tomov S, Peterson G, Dongarra J (2012) From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Computing 38(8):391–407.
  13. 13.
    Garigipati P, Brehler M (2017) Unified backend.
  14. 14.
    Gates M (2012) MAGMA Forum: sgemm confusion.
  15. 15.
    Gates M (2016) MAGMA forum: performance issue.
  16. 16.
    Gonzalez RC, Woods RE (2007) Digital image processing, 3rd edn. Pearson, Upper Saddle RiverGoogle Scholar
  17. 17.
    Gonzalez-Escribano A, Torres Y, Fresno J, Llanos D (2014) An extensible system for multilevel automatic data partition and mapping. IEEE Trans Parallel Distrib Syst 25(5):1145–1154. CrossRefGoogle Scholar
  18. 18.
    Horn RA, Johnson CR (1991) The hadamard product. Topics in matrix analysis. Cambridge University Press, Cambridge, pp 298–381. CrossRefGoogle Scholar
  19. 19.
    Howell, G W, Demmel J W, Fulton C T, Hammarling S, Marmol K (2008) Cache efficient bidiagonalization using BLAS 2.5 operators. ACM Trans Math Softw. 34(3), 1–33 .
  20. 20.
    Intel Corporation (2011) Using intel®MKL automatic offload on intel®Xeon \(\text{Phi}^{\rm TM}\) coprocessors.
  21. 21.
    Intel Corporation (2013). Intel®MKL automatic offload enabled functions for Intel Xeon Phi coprocessors
  22. 22.
    Intel Corporation (2017) Intel®Math kernel library (Intel® MKL).
  23. 23.
    Khaleghzadeh H, Zhong Z, Reddy R, Lastovetsky A (2018) Out-of-core implementation for accelerator kernels on heterogeneous clouds. J Supercomput 74(2):551–568. CrossRefGoogle Scholar
  24. 24.
    Knox K, Liu J, Tanner D, Yalamanchili P, Kellner C, Perkins H, Dong T, Lehmann G, Nugteren C, Coquelle B (2017) clBLAS: a software library containing BLAS functions written in OpenCL. Original-date: 2013-08-13T15:05:53Z
  25. 25.
    Kovalev M, Kroeker M, Köhler M, Aoshima T (2017) Hadamard product?. Issue #1083 .xianyi/OpenBLAS (2017).
  26. 26.
    Kurzak J, Bader DA, Dongarra J (2010) Scientific computing with multicore and accelerators, 1st edn. CRC Press, Boca RatonCrossRefGoogle Scholar
  27. 27.
    Lastovetsky A, Reddy R (2006) HeteroMPI: towards a message-passing library for heterogeneous networks of computers. J Parallel Distrib Comput 66(2):197–220.
  28. 28.
    Lim R, Lee Y, Kim R, Choi J, Lee M (2018) Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors. J. Supercomput.
  29. 29.
    Ling P (1993) A set of high-performance level 3 BLAS structured and tuned for the IBM 3090 VF and implemented in Fortran 77. J Supercomput 7(3):323–355. MathSciNetCrossRefGoogle Scholar
  30. 30.
    Malcolm J, Yalamanchili P, McClanahan C, Venugopalakrishnan V, Patel K, Melonakos J (2012)ArrayFire: a GPU acceleration platform. In: Modeling and simulation for defense systems and applications VII, vol 8403. International Society for Optics and Photonics, p 84030A.
  31. 31.
    Manumachu RR, Lastovetsky A, Alonso P (2008) Heterogeneous PBLAS: optimization of PBLAS for heterogeneous computational clusters. In: 2008 International symposium on parallel and distributed computing. IEEE Computer Society, Krakow, pp 73–80.
  32. 32.
    Moreton-Fernandez A, Gonzalez-Escribano A, Llanos DR (2017) Multi-device controllers: a library to simplify parallel heterogeneous programming. Int J Parallel Program.
  33. 33.
    Moreton-Fernandez A, Rodriguez-Gutiez E, Gonzalez-Escribano A, Llanos DR (2017) Supporting the Xeon Phi coprocessor in a heterogeneous programming model. In: Euro-Par 2017: parallel processing, Lecture notes in computer science, vol 10417. Springer, Cham, pp 457–469.
  34. 34.
    Moreton-Fernandez A, Ortega-Arranz H, Gonzalez-Escribano A (2017) Controllers: an abstraction to ease the use of hardware accelerators. Int J High Perform Comput Appl, p 109434201770296 .
  35. 35.
    Newburn CJ, Dmitriev S, Narayanaswamy R, Wiegert J, Murty R, Chinchilla F, Deodhar R, McGuire R (2013) Offload compiler runtime for the intel ®xeon phi coprocessor. In: 2013 IEEE international symposium on parallel distributed processing, workshops and Phd forum, pp 1213–1225.
  36. 36.
    NVIDIA Corporation (2017) cuBLAS library: user guide.
  37. 37.
    NVIDIA Corporation (2017) NVBLAS.
  38. 38.
    Perrot G, Domas S, Couturier R (2016) An optimized GPU-based 2D convolution implementation. Concurr Comput: Pract Exp 28(16):4291–4304. CrossRefGoogle Scholar
  39. 39.
    Podlozhnyuk V (2007) Image convolution with CUDA. Technical report, NVIDIA Corporation.
  40. 40.
    Pouchet LN PolyBench/C (2015) The polyhedral benchmark suite.
  41. 41.
    Rasch A, Bigge J, Wrodarczyk M, Schulze R, Gorlatch S (2019) dOCAL: high-level distributed programming with OpenCL and CUDA. J Supercomput.
  42. 42.
    Rousseaux S, Hubaux D, Guisset P, Legat JD (2007) A high performance FPGA-based accelerator for BLAS library implementation. In: Proceedings of reconfigurable systems summer institute (RSSI’07). Urbana.
  43. 43.
    Sanderson C, Curtin R (2016) Armadillo: a template-based C++ library for linear algebra. J Open Source Softw 1: 26.
  44. 44.
    Tomov S, Dongarra J, Baboulin M (2010) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput 36(5):232–240.
  45. 45.
    Viviani P, Aldinucci M, Torquati M, d’lppolito R (2017) Multiple back-end support for the armadillo linear algebra interface. In: Proceedings of the symposium on applied computing, SAC ’17. ACM, pp 1566–1573.
  46. 46.
    Wang E, Zhang Q, Shen B, Zhang G, Lu X, Wu Q, Wang Y (2014) High-performance computing on the Intel®Xeon \(\text{ Phi }^{{\rm TM}}\): how to fully exploit mic architectures. Springer, Berlin.
  47. 47.
    Wende F, Klemm M, Steinke T, Reinefeld A (2015) Concurrent kernel offloading. In: High performance parallelism pearls, vol 1, 1 edn. Elsevier, pp 201–223.
  48. 48.
    Yalamanchili P, Arshad U, Mohammed Z, Garigipati P, Entschev P, Kloppenborg B, Malcolm J, Melonakos J (2015) ArrayFire—a high performance software library for parallel computing with an easy-to-use API.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Dpto. de InformáticaUniversidad de ValladolidValladolidSpain

Personalised recommendations