Programming parallel dense matrix factorizations with look-ahead and OpenMP

  • Sandra CatalánEmail author
  • Adrián Castelló
  • Francisco D. Igual
  • Rafael Rodríguez-Sánchez
  • Enrique S. Quintana-Ortí


We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multi-threaded version of basic linear algebra subroutines (BLAS). The proposed approach is also different from the more sophisticated runtime-based implementations, which decompose the operation into tasks and identify dependencies via directives and runtime support. Instead, our strategy attains high performance by explicitly embedding a static look-ahead technique into the DMF code, in order to overcome the performance bottleneck of the panel factorization, and realizing the trailing update via a cache-aware multi-threaded implementation of the BLAS. Although the parallel algorithms are specified with a high level of abstraction, the actual implementation can be easily derived from them, paving the road to deriving a high performance implementation of a considerable fraction of linear algebra package (LAPACK) functionality on any multicore platform with an OpenMP-like runtime.


Matrix factorizations Look-ahead Multi-threading OpenMP Lightweight threads High performance computing 



Application programming interface.


Basic linear algebra subroutines.


BLAS-like library instatiation software framework.


Dense linear algebra.


Dense matrix factorization.


General matrix-matrix multiplication.


Unified API for lightweight thread libraries.


OpenMP implementation of GLT.


DMF algorithm linked with GNUs runtime.


DMF algorithm linked with Intels runtime.


Linear algebra package.


Lightweight threads library.


Intel math kernel library.


Multi-threaded BLAS parallelism exploitation.


Multi-threaded BLAS implementation.


Runtime task parallelism exploitation.



The researchers from Universidad Jaume I were supported by the CICYT Projects TIN2014-53495-R and TIN2017-82972-R of the MINECO and FEDER, and the H2020 EU FETHPC Project 671602 “INTERTWinE”. The researchers from Universidad Complutense de Madrid were supported by the CICYT Project TIN2015-65277-R of the MINECO and FEDER. Sandra Catalán was supported during part of this time by the FPU program of the Ministerio de Educación, Cultura y Deporte. Adrián Castelló was supported by the ValI+D 2015 FPI program of the Generalitat Valenciana.


  1. 1.
    Anderson, E., Bai, Z., Susan Blackford, L., Demmel, J., Dongarra, J.J., Croz, J.D., Hammarling, S., Greenbaum, A., McKenney, A., Sorensen, D.C.: LAPACK Users’ guide. SIAM, 3rd edition (1999)Google Scholar
  2. 2.
    Badia, R.M., Herrero, J.R., Labarta, J., Pérez, J.M., Quintana-Ortí, E.S., Quintana-Ortí, G.: Parallelizing dense and banded linear algebra libraries using SMPSs. Conc. Comp. 21, 2438–2456 (2009)CrossRefGoogle Scholar
  3. 3.
    Bientinesi, P., Gunnels, J.A., Myers, M.E., Quintana-Ortí, E.S., van de Geijn, R.A.: The science of deriving dense linear algebra algorithms. ACM Trans. Math. Softw. 31(1), 1–26 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Bischof, C.H., Lang, B., Sun, X.: Algorithm 807: the SBR toolbox–software for successive band reduction. ACM Trans. Math. Softw. 26(4), 602–616 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35(1), 38–53 (2009)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Castelló, A., Mayo, R., Sala, K., Beltran, V., Balaji, P., Peña, A.J.: On the adequacy of lightweight thread approaches for high-level parallel programming models. Future Gener. Comput. Syst. 84, 22–31 (2018)CrossRefGoogle Scholar
  7. 7.
    Castelló, A., Peña, A.J., Seo, S., Mayo, R., Balaji, P., Quintana-Ortí, E.S.: A review of lightweight thread approaches for high performance computing. In: Proceedings of the IEEE International Conference on Cluster Computing, Taipei, Taiwan (September 2016)Google Scholar
  8. 8.
    Castelló, A., Seo, S., Mayo, R., Balaji, P., Quintana-Ortí, E.S., Peña, A.J.: GLT: a unified API for lightweight thread libraries. In: Proceedings of the IEEE International European Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain (August 2017)Google Scholar
  9. 9.
    Castelló, A., Seo, S., Mayo, R., Balaji, P., Quintana-Ortí, E.S., Peña, A.J.: GLTO: on the adequacy of lightweight thread approaches for OpenMP implementations. In: Proceedings of the International Conference on Parallel Processing, Bristol, UK (August 2017)Google Scholar
  10. 10.
    Catalán, S, Herrero, JR., Quintana-Ortí, E.S., Rodríguez-Sánchez, R., van de Geijn, R.A.: A case for malleable thread-level linear algebra libraries: The LU factorization with partial pivoting. CoRR (2016) arXiv:1611.06365
  11. 11.
    Catalán, S., Igual, F.D., Mayo, R., Rguez-Sánchez, R.: Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors. Clust. Comput. 19(3), 1037–1051 (2016)CrossRefGoogle Scholar
  12. 12.
  13. 13.
    Demmel, J.: Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Paris (1997)CrossRefzbMATHGoogle Scholar
  14. 14.
    Dongarra, J.J., Croz, J.D., Hammarling, S., Duff, I.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    FLAME project home page.
  16. 16.
    Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996)zbMATHGoogle Scholar
  17. 17.
    Goto, K., van de Geijn, R.A.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34(3), 12:1–12:25 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Goto, K., van de Geijn, R.: High performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 4:1–4:14 (2008)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Grosser, B., Lang, B.: Efficient parallel reduction to bidiagonal form. Parallel Comput. 25(8), 969–986 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Gunter, B.C., van de Geijn, R.A.: Parallel out-of-core computation and updating the QR factorization. ACM Trans. Math. Soft. 31(1), 60–78 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    IBM. Engineering and Scientific Subroutine Library. (2015)
  22. 22.
    Intel. Math Kernel Library. (2015)
  23. 23.
    OmpSs project home page.
  24. 24.
  25. 25.
    OpenMP API specification for parallel programming. (2017)
  26. 26.
    PLASMA project home page.
  27. 27.
    Quintana-Ortí, E.S., van de Geijn, R.A.: Updating an LU factorization with pivoting. ACM Trans. Math. Softw. 35(2), 11:1–11:16 (2008)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Quintana-Ortí, G., Quintana-Ortí, E.S., van de Geijn, R.A., Van Zee, F.G., Chan, E.: Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans. Math. Softw. 36(3), 14:1–14:26 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Rodríguez-Sánchez, R., Catalán, Sandra, H., José, R., Quintana-Ortí, E.S., Tomás, A.E.: Two-sided reduction to compact band forms with look-ahead (2017) CoRR, arXiv:1709.00302
  30. 30.
    Seo, S., Amer, A., Balaji, P., Bordage, C., Bosilca, G., Brooks, A., Carns, P., Castelló, A., Genet, D., Herault, T., Iwasaki, S., Jindal, P., Kale, S., Krishnamoorthy, S., Lifflander, J., Lu, H., Meneses, E., Snir, M., Sun, Y., Taura, K., Beckman, P.: Argobots: a lightweight low-level threading and tasking framework. IEEE Trans. Parallel Distrib. Syst. PP(99), 1–1 (2017)Google Scholar
  31. 31.
    Smith, T.M., van de Geijn, R., Smelyanskiy, M., Hammond, J.R., Van Zee, F.G.: Anatomy of high-performance many-threaded matrix multiplication. In: Proceedings of IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS’14, pp. 1049–1059 (2014)Google Scholar
  32. 32.
  33. 33.
    Stein, D., Shah, D.: Implementing lightweight threads. In: USENIX Summer (1992)Google Scholar
  34. 34.
    Strazdins, P.: A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical Report TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia (1998)Google Scholar
  35. 35.
    Van Zee, F.G., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41(3), 14:1–14:33 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Whaley, C.R., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of SC’98 (1998)Google Scholar
  37. 37.
    Van Zee, F.G., Smith, T.M., Marker, B., Low, T., Van De Geijn, R.A., Igual, F.D., Smelyanskiy, M., Zhang, X., Kistler, M., Austel, V., Gunnels, J.A., Killough, L.: The BLIS framework: experiments in portability. ACM Trans. Math. Softw. 42(2), 12:1–12:19 (2016)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Depto. Ingeniería y Ciencia de ComputadoresUniversidad Jaume ICastellón de la planaSpain
  2. 2.Depto. Informática de Sistemas y ComputadoresUniversitat Politècnica de ValènciaValenciaSpain
  3. 3.Depto. de Arquitectura de Computadores y AutomáticaUniversidad Complutense de MadridMadridSpain

Personalised recommendations