Integration and exploitation of intra-routine malleability in BLIS

  • Rafael Rodríguez-SánchezEmail author
  • Francisco D. Igual
  • Enrique S. Quintana-Ortí


Malleability is a property of certain applications (or tasks) that, given an external request or autonomously, can accommodate a dynamic modification of the degree of parallelism being exploited at runtime. Malleability improves resource usage (core occupation) on modern multicore architectures for applications that exhibit irregular and divergent execution paths and heavily depend on the underlying library performance to attain high performance. The integration of malleability within high-performance instances of the Basic Linear Algebra Subprograms (BLAS) is nonexistent, and, in addition, it is difficult to attain given the rigidity of current application programming interfaces (APIs). In this paper, we overcome these issues presenting the integration of a malleability mechanism within BLIS, a high-performance and portable framework to implement BLAS-like operations. For this purpose, we leverage low-level (yet simple) APIs to integrate on-demand malleability across all Level-3 BLAS routines, and we demonstrate the performance benefits of this approach by means of a higher-level dense matrix operation: the LU factorization with partial pivoting and look-ahead.


Malleability Linear algebra BLAS Multicore architectures 



The researchers from Universidad Complutense de Madrid were supported by the EU (FEDER) and Spanish MINECO (TIN2015-65277-R, RTI2018-093684-B-I00), and by Spanish CM (S2018/TCS-4423). The researcher from Universitat Poliècnica de València was supported by the Spanish MINECO (TIN2017-82972-R).


  1. 1.
    Augonnet C, Thibault S, Namyst R, Wacrenier PA (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp Spec Issue Euro Par 2009(23):187–198CrossRefGoogle Scholar
  2. 2.
    Catalán S, Castelló A, Igual FD, Rodríguez-Sánchez R, Quintana-Ortí ES (2019) Programming parallel dense matrix factorizations with look-ahead and OpenMP. Cluster Comput. CrossRefGoogle Scholar
  3. 3.
    Catalán S, Herrero JR, Quintana-Ortí ES, Rodríguez-Sánchez R, Van De Geijn R (2019) A case for malleable thread-level linear algebra libraries: the LU factorization with partial pivoting. IEEE Access 7:17617–17633CrossRefGoogle Scholar
  4. 4.
    Catalán S, Igual FD, Mayo R, Rodríguez-Sánchez R, Quintana-Ortí ES (2016) Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors. Cluster Comput 19(3):1037–1051CrossRefGoogle Scholar
  5. 5.
    Chan E, Van Zee FG, Bientinesi P, Quintana-Ortí ES, Quintana-Ortí G, van de Geijn R (2008)Supermatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, pp 123–132Google Scholar
  6. 6.
    Corporation I (2019) Intel ® math kernel library developer reference. Tech rep, Intel Corporation. Accessed 13 Nov 2019
  7. 7.
    Dolz MF, Igual FD, Ludwig T, Piñuel L, Quintana-Ortí ES (2015) Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the intel xeon phi. Comput Electr Eng 46:95–111CrossRefGoogle Scholar
  8. 8.
    Dongarra JJ, Du Croz J, Hammarling S, Duff IS (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17MathSciNetCrossRefGoogle Scholar
  9. 9.
    Duran A, Ayguadé E, Badia RM, Labarta J, Martinell L, Martorell X, Planas J (2011) OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process Lett 21(2):173–193MathSciNetCrossRefGoogle Scholar
  10. 10.
    Gates M, Luszczek P, Abdelfattah A, Kurzak J, Dongarra J, Arturov K, Cecka C, Freitag C (2018) C++ API for BLAS and LAPACK. Tech Rep 2, ICL-UT-17-03 (2017). Revision 21 Feb 2018Google Scholar
  11. 11.
    Guennebaud G, Jacob B et al (2019) Eigen v3. Accessed 13 Nov 2019
  12. 12.
    LAPACK project home page. Accessed 13 Nov 2019
  13. 13.
    Leung J, Kelly L, Anderson JH (2004) Handbook of scheduling: algorithms, models, and performance analysis. CRC Press Inc, Boca Raton, FLGoogle Scholar
  14. 14.
    Smith TM, van de Geijn RA, Smelyanskiy M, Hammond JR, Van Zee FG (2014) Anatomy of high-performance many-threaded matrix multiplication. In: 28th IEEE International Parallel & Distributed Processing SymposiumGoogle Scholar
  15. 15.
    Strazdins P (1998) A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech Rep TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, AustraliaGoogle Scholar
  16. 16.
    Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35CrossRefGoogle Scholar
  17. 17.
    Van Zee FG, Implementing high-performance complex matrix multiplication via the 1m method. ACM Trans Math Softw (submitted)Google Scholar
  18. 18.
    Van Zee FG, van de Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14:1–14:33MathSciNetzbMATHGoogle Scholar
  19. 19.
    Van Zee FG, Parikh DN, van de Geijn RA, Supporting mixed-domain mixed-precision matrix multiplication within the BLIS framework. ACM Trans Math Softw (submitted)Google Scholar
  20. 20.
    Van Zee FG, Smith T (2017) Implementing high-performance complex matrix multiplication via the 3m and 4m methods. ACM Trans Math Softw 44(1):7:1–7:36MathSciNetzbMATHGoogle Scholar
  21. 21.
    Van Zee FG, Smith T, Igual FD, Smelyanskiy M, Zhang X, Kistler M, Austel V, Gunnels J, Low TM, Marker B, Killough L, van de Geijn RA (2016) The BLIS framework: experiments in portability. ACM Trans Math Softw 42(2):12:1–12:19MathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Departamento de Arquitectura de Computadores y AutomáticaUniversidad Complutense de MadridMadridSpain
  2. 2.Departamento de Informática de Sistemas y ComputadoresUniversitat Politècnica de ValènciaValenciaSpain

Personalised recommendations