Abstract
This paper presents an efficient algorithmic approach to the GPU-based parallel resolution of dense linear systems of extremely large size. A formal transformation of the code of Gauss method allows us to develop for matrix calculations the concept of stripe algorithm, as opposed to that of tile algorithm. Our stripe algorithm is based on the partitioning of the linear system’s matrix into stripes of rows and is well suited for efficient implementation on a GPU, using cublasDgemm function of CUBLAS library as the main building block. It is also well adapted to storage of the linear system on an array of solid state devices, the PC memory being used as a cache between the SSDs and the GPU memory. We demonstrate experimentally that our code solves efficiently dense linear systems of size up to 400,000 (160 billion matrix elements) using an NIVDIA C2050 and six 240 GB SSDs.
Similar content being viewed by others
Abbreviations
- GPU:
-
Graphical processing unit
- CUDA:
-
Compute unified device architecture
- CUBLAS:
-
CUDA basic linear algebra subroutines
- SSD:
-
Solid state device
- L\(_\mathrm{SSD}\) :
-
1st memory level: array of SSDs
- L\(_\mathrm{PC}\) :
-
2nd memory level: main memory of PC
- L\(_\mathrm{GPU}\) :
-
3rd memory level: global memory of GPU
References
Agullo E, Augonnet C, Dongarra J, Faverge M, Langou J, Ltaief H, Tomov S (2011) LU factorization for accelerator-based systems. AICCSA’ 11 conference, pp 217–224
Angelaccio M, Colajanni M (1993) Unifying and optimizing parallel linear algebra algorithms. IEEE Trans Parallel Distrib Syst 4(1):1382–1397
BLAS—basic linear algebra subprograms. www.netlib.org/blas
Buttari A, Langou J, Kurzak J, Dongarra J (2009) A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput 35:38–53
Carter L, Ferrante J, Flynn Hummel S, Alpern B, Gatlin KS (1996) Hierarchical tiling: a methodology for high performance. Technical Report, UCSD CS96-508
Cosnard M, Tourancheau B, Villard G (1987) Gaussian elimination on message passing architecture. In: Proceedings of 1st International Conference on Supercomputing. Athens. Lecture Notes in Computer Science, vol 297, pp 611–628
CUBLAS—implementation of BLAS on top of the NVIDIA CUDA runtime. http://docs.nvidia.com/cuda/cublas/index.html
CULA tools, GPU Accelerated Linear Algebra. http://www.culatools.com
Dongarra J, Faverge M, Ltaief H, Luszcsek P (2011) Achieving numerical accuracy and high performance using recursive tile LU factorization. Technical Report, University of Tennessee Computer Science ICL-UT-11-08 (also Lawn 259)
Evans F, Skiena S, Varshney A (1996) Optimizing triangle strips for fast rendering. IEEE Vis 96:319–326
Hadri B, Ltaief H, Agullo E, Dongarra J (2010) Tile QR factorization with parallel panel processing for multicore architectures. In: IEEE international symposium on parallel and distributed processing, pp 1–10
Haidar A, Ltaief H, YarKhan A, Dongarra J (2011) Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures. Technical Report, University of Tennessee Computer Science UT-CS-11-666 (also Lawn 243)
Humphrey JR, Price DK, Spagnoli KE, Paolini AL, Kelmelis EJ (2010) CULA: hybrid GPU accelerated linear algebra routines. SPIE Conference Series 7705
Igual FD, Chan E, Quintana-Ortí ES, Quintana-Ortí G, Van de Geijn RA, Van Zee FG (2012) The FLAME approach: from dense linear algebra algorithms to high-performance multi-accelerator implementations. J Parallel Distrib Comput 72:1134–1143
Kurzak J, Ltaief H, Dongarra J, Badia RM (2010) Scheduling dense linear algebra operations on multicore processors. Concurrency and Computation: Practice and Experience 22:15–44
LAPACK—derivation of a block algorithm for LU factorization. http://www.netlib.org/utk/papers/siam-review93/node13.html
LAPACK—linear algebra PACKage. http://www.netlib.org/lapack
MAGMA—matrix algebra on GPU and multicore architectures. http://icl.cs.utk.edu/graphics/posters/files/SC11-MAGMA.pdf
Marquardt D (1963) An algorithm for least-squares estimation of nonlinear parameters. SIAM J Appl Math 11(2):431–441
PLASMA—parallel linear algebra software for multicore architectures. http://www.netlib.org/plasma
ScaLAPACK—scalable linear algebra PACKage. http://www.netlib.org/scalapack
Trefethen LN, Schreiber RS (1990) Average-case stability of Gaussian elimination. SIAM J Matrix Anal Appl 11:335–360
Tomov S, Dongarra J, Baboulin M (2010) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput 36:232–240
Tomov S, Nath R, Ltaief H, Dongarra J (2010) Dense linear algebra solvers for multicore with GPU accelerators. In: IEEE international symposium on parallel and distributed processing, pp 1–8
Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: ACM/IEEE conference on supercomputing SC ’08
Zhenjie D, Yan C (2009) An optimization load balancing algorithm design in massive storage system. In: ESIAT 2009 conference, vol 3, pp 310–313
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Carcenac, M. From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices. J Supercomput 68, 365–413 (2014). https://doi.org/10.1007/s11227-013-1043-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-013-1043-3