Programming parallel dense matrix factorizations with look-ahead and OpenMP

Catalán, Sandra; Castelló, Adrián; Igual, Francisco D.; Rodríguez-Sánchez, Rafael; Quintana-Ortí, Enrique S.

doi:10.1007/s10586-019-02927-z

Programming parallel dense matrix factorizations with look-ahead and OpenMP

Published: 02 April 2019

Volume 23, pages 359–375, (2020)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Sandra Catalán ORCID: orcid.org/0000-0002-9321-2728¹,
Adrián Castelló¹,
Francisco D. Igual³,
Rafael Rodríguez-Sánchez³ &
…
Enrique S. Quintana-Ortí²

487 Accesses
6 Citations
Explore all metrics

Abstract

We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multi-threaded version of basic linear algebra subroutines (BLAS). The proposed approach is also different from the more sophisticated runtime-based implementations, which decompose the operation into tasks and identify dependencies via directives and runtime support. Instead, our strategy attains high performance by explicitly embedding a static look-ahead technique into the DMF code, in order to overcome the performance bottleneck of the panel factorization, and realizing the trailing update via a cache-aware multi-threaded implementation of the BLAS. Although the parallel algorithms are specified with a high level of abstraction, the actual implementation can be easily derived from them, paving the road to deriving a high performance implementation of a considerable fraction of linear algebra package (LAPACK) functionality on any multicore platform with an OpenMP-like runtime.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse Direct Solution on Parallel Computers

Optimized Hybrid Execution of Dense Matrix-Matrix Multiplication on Clusters of Heterogeneous Multicore and Many-Core Platforms

Heterogenous Acceleration for Linear Algebra in Multi-coprocessor Environments

Notes

http://www.netlib.org/lapack.
Version from October 2017. Available online at http://www.argobots.org.
https://www.openmprtl.org/.
https://gcc.gnu.org/projects/gomp/.

Abbreviations

API:: Application programming interface.
BLAS:: Basic linear algebra subroutines.
BLIS:: BLAS-like library instatiation software framework.
DLA:: Dense linear algebra.
DMF:: Dense matrix factorization.
GEMM:: General matrix-matrix multiplication.
GLT:: Unified API for lightweight thread libraries.
GLTO:: OpenMP implementation of GLT.
LA MB G:: DMF algorithm linked with GNUs runtime.
LA MB S:: DMF algorithm linked with Intels runtime.
LAPACK:: Linear algebra package.
LWT:: Lightweight threads library.
MKL:: Intel math kernel library.
MTB:: Multi-threaded BLAS parallelism exploitation.
MTL:: Multi-threaded BLAS implementation.
RTM:: Runtime task parallelism exploitation.

References

Anderson, E., Bai, Z., Susan Blackford, L., Demmel, J., Dongarra, J.J., Croz, J.D., Hammarling, S., Greenbaum, A., McKenney, A., Sorensen, D.C.: LAPACK Users’ guide. SIAM, 3rd edition (1999)
Badia, R.M., Herrero, J.R., Labarta, J., Pérez, J.M., Quintana-Ortí, E.S., Quintana-Ortí, G.: Parallelizing dense and banded linear algebra libraries using SMPSs. Conc. Comp. 21, 2438–2456 (2009)
Article Google Scholar
Bientinesi, P., Gunnels, J.A., Myers, M.E., Quintana-Ortí, E.S., van de Geijn, R.A.: The science of deriving dense linear algebra algorithms. ACM Trans. Math. Softw. 31(1), 1–26 (2005)
Article MathSciNet Google Scholar
Bischof, C.H., Lang, B., Sun, X.: Algorithm 807: the SBR toolbox–software for successive band reduction. ACM Trans. Math. Softw. 26(4), 602–616 (2000)
Article MathSciNet Google Scholar
Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35(1), 38–53 (2009)
Article MathSciNet Google Scholar
Castelló, A., Mayo, R., Sala, K., Beltran, V., Balaji, P., Peña, A.J.: On the adequacy of lightweight thread approaches for high-level parallel programming models. Future Gener. Comput. Syst. 84, 22–31 (2018)
Article Google Scholar
Castelló, A., Peña, A.J., Seo, S., Mayo, R., Balaji, P., Quintana-Ortí, E.S.: A review of lightweight thread approaches for high performance computing. In: Proceedings of the IEEE International Conference on Cluster Computing, Taipei, Taiwan (September 2016)
Castelló, A., Seo, S., Mayo, R., Balaji, P., Quintana-Ortí, E.S., Peña, A.J.: GLT: a unified API for lightweight thread libraries. In: Proceedings of the IEEE International European Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain (August 2017)
Castelló, A., Seo, S., Mayo, R., Balaji, P., Quintana-Ortí, E.S., Peña, A.J.: GLTO: on the adequacy of lightweight thread approaches for OpenMP implementations. In: Proceedings of the International Conference on Parallel Processing, Bristol, UK (August 2017)
Catalán, S, Herrero, JR., Quintana-Ortí, E.S., Rodríguez-Sánchez, R., van de Geijn, R.A.: A case for malleable thread-level linear algebra libraries: The LU factorization with partial pivoting. CoRR (2016) arXiv:1611.06365
Catalán, S., Igual, F.D., Mayo, R., Rguez-Sánchez, R.: Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors. Clust. Comput. 19(3), 1037–1051 (2016)
Article Google Scholar
Chameleon project. http://project.inria.fr/chameleon/
Demmel, J.: Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Paris (1997)
Book Google Scholar
Dongarra, J.J., Croz, J.D., Hammarling, S., Duff, I.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)
Article MathSciNet Google Scholar
FLAME project home page. http://www.cs.utexas.edu/users/flame/
Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996)
MATH Google Scholar
Goto, K., van de Geijn, R.A.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34(3), 12:1–12:25 (2008)
Article MathSciNet Google Scholar
Goto, K., van de Geijn, R.: High performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 4:1–4:14 (2008)
Article MathSciNet Google Scholar
Grosser, B., Lang, B.: Efficient parallel reduction to bidiagonal form. Parallel Comput. 25(8), 969–986 (1999)
Article MathSciNet Google Scholar
Gunter, B.C., van de Geijn, R.A.: Parallel out-of-core computation and updating the QR factorization. ACM Trans. Math. Soft. 31(1), 60–78 (2005)
Article MathSciNet Google Scholar
IBM. Engineering and Scientific Subroutine Library. http://www-03.ibm.com/systems/power/software/essl/ (2015)
Intel. Math Kernel Library. https://software.intel.com/en-us/intel-mkl (2015)
OmpSs project home page. http://pm.bsc.es/ompss
http://www.openblas.net (2015)
OpenMP API specification for parallel programming. http://www.openmp.org (2017)
PLASMA project home page. http://icl.cs.utk.edu/plasma
Quintana-Ortí, E.S., van de Geijn, R.A.: Updating an LU factorization with pivoting. ACM Trans. Math. Softw. 35(2), 11:1–11:16 (2008)
Article MathSciNet Google Scholar
Quintana-Ortí, G., Quintana-Ortí, E.S., van de Geijn, R.A., Van Zee, F.G., Chan, E.: Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans. Math. Softw. 36(3), 14:1–14:26 (2009)
Article MathSciNet Google Scholar
Rodríguez-Sánchez, R., Catalán, Sandra, H., José, R., Quintana-Ortí, E.S., Tomás, A.E.: Two-sided reduction to compact band forms with look-ahead (2017) CoRR, arXiv:1709.00302
Seo, S., Amer, A., Balaji, P., Bordage, C., Bosilca, G., Brooks, A., Carns, P., Castelló, A., Genet, D., Herault, T., Iwasaki, S., Jindal, P., Kale, S., Krishnamoorthy, S., Lifflander, J., Lu, H., Meneses, E., Snir, M., Sun, Y., Taura, K., Beckman, P.: Argobots: a lightweight low-level threading and tasking framework. IEEE Trans. Parallel Distrib. Syst. PP(99), 1–1 (2017)
Google Scholar
Smith, T.M., van de Geijn, R., Smelyanskiy, M., Hammond, J.R., Van Zee, F.G.: Anatomy of high-performance many-threaded matrix multiplication. In: Proceedings of IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS’14, pp. 1049–1059 (2014)
StarPU project. http://runtime.bordeaux.inria.fr/StarPU/
Stein, D., Shah, D.: Implementing lightweight threads. In: USENIX Summer (1992)
Strazdins, P.: A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical Report TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia (1998)
Van Zee, F.G., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41(3), 14:1–14:33 (2015)
Article MathSciNet Google Scholar
Whaley, C.R., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of SC’98 (1998)
Van Zee, F.G., Smith, T.M., Marker, B., Low, T., Van De Geijn, R.A., Igual, F.D., Smelyanskiy, M., Zhang, X., Kistler, M., Austel, V., Gunnels, J.A., Killough, L.: The BLIS framework: experiments in portability. ACM Trans. Math. Softw. 42(2), 12:1–12:19 (2016)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The researchers from Universidad Jaume I were supported by the CICYT Projects TIN2014-53495-R and TIN2017-82972-R of the MINECO and FEDER, and the H2020 EU FETHPC Project 671602 “INTERTWinE”. The researchers from Universidad Complutense de Madrid were supported by the CICYT Project TIN2015-65277-R of the MINECO and FEDER. Sandra Catalán was supported during part of this time by the FPU program of the Ministerio de Educación, Cultura y Deporte. Adrián Castelló was supported by the ValI+D 2015 FPI program of the Generalitat Valenciana.

Author information

Authors and Affiliations

Depto. Ingeniería y Ciencia de Computadores, Universidad Jaume I, Avda. Sos Baynat, s/n, 12071, Castellón de la plana, Spain
Sandra Catalán & Adrián Castelló
Depto. Informática de Sistemas y Computadores, Universitat Politècnica de València, Camino de Vera, s/n, 46071, Valencia, Spain
Enrique S. Quintana-Ortí
Depto. de Arquitectura de Computadores y Automática, Universidad Complutense de Madrid, Madrid, Spain
Francisco D. Igual & Rafael Rodríguez-Sánchez

Authors

Sandra Catalán
View author publications
You can also search for this author in PubMed Google Scholar
Adrián Castelló
View author publications
You can also search for this author in PubMed Google Scholar
Francisco D. Igual
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Rodríguez-Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Enrique S. Quintana-Ortí
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sandra Catalán.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Catalán, S., Castelló, A., Igual, F.D. et al. Programming parallel dense matrix factorizations with look-ahead and OpenMP. Cluster Comput 23, 359–375 (2020). https://doi.org/10.1007/s10586-019-02927-z

Download citation

Received: 23 April 2018
Revised: 27 February 2019
Accepted: 25 March 2019
Published: 02 April 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10586-019-02927-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Programming parallel dense matrix factorizations with look-ahead and OpenMP

Abstract

Access this article

Similar content being viewed by others

Sparse Direct Solution on Parallel Computers

Optimized Hybrid Execution of Dense Matrix-Matrix Multiplication on Clusters of Heterogeneous Multicore and Many-Core Platforms

Heterogenous Acceleration for Linear Algebra in Multi-coprocessor Environments

Notes

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Programming parallel dense matrix factorizations with look-ahead and OpenMP

Abstract

Access this article

Similar content being viewed by others

Sparse Direct Solution on Parallel Computers

Optimized Hybrid Execution of Dense Matrix-Matrix Multiplication on Clusters of Heterogeneous Multicore and Many-Core Platforms

Heterogenous Acceleration for Linear Algebra in Multi-coprocessor Environments

Notes

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation