From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices

Carcenac, Manuel

doi:10.1007/s11227-013-1043-3

From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices

Published: 23 November 2013

Volume 68, pages 365–413, (2014)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Manuel Carcenac¹

326 Accesses
3 Citations
Explore all metrics

Abstract

This paper presents an efficient algorithmic approach to the GPU-based parallel resolution of dense linear systems of extremely large size. A formal transformation of the code of Gauss method allows us to develop for matrix calculations the concept of stripe algorithm, as opposed to that of tile algorithm. Our stripe algorithm is based on the partitioning of the linear system’s matrix into stripes of rows and is well suited for efficient implementation on a GPU, using cublasDgemm function of CUBLAS library as the main building block. It is also well adapted to storage of the linear system on an array of solid state devices, the PC memory being used as a cache between the SSDs and the GPU memory. We demonstrate experimentally that our code solves efficiently dense linear systems of size up to 400,000 (160 billion matrix elements) using an NIVDIA C2050 and six 240 GB SSDs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ADELUS: A Performance-Portable Dense LU Solver for Distributed-Memory Hardware-Accelerated Systems

A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems

Resolving small random symmetric linear systems on graphics processing units

Article 12 July 2016

Abbreviations

GPU:: Graphical processing unit
CUDA:: Compute unified device architecture
CUBLAS:: CUDA basic linear algebra subroutines
SSD:: Solid state device
L\(_\mathrm{SSD}\) :: 1st memory level: array of SSDs
L\(_\mathrm{PC}\) :: 2nd memory level: main memory of PC
L\(_\mathrm{GPU}\) :: 3rd memory level: global memory of GPU

References

Agullo E, Augonnet C, Dongarra J, Faverge M, Langou J, Ltaief H, Tomov S (2011) LU factorization for accelerator-based systems. AICCSA’ 11 conference, pp 217–224
Angelaccio M, Colajanni M (1993) Unifying and optimizing parallel linear algebra algorithms. IEEE Trans Parallel Distrib Syst 4(1):1382–1397
Article Google Scholar
BLAS—basic linear algebra subprograms. www.netlib.org/blas
Buttari A, Langou J, Kurzak J, Dongarra J (2009) A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput 35:38–53
Article MathSciNet Google Scholar
Carter L, Ferrante J, Flynn Hummel S, Alpern B, Gatlin KS (1996) Hierarchical tiling: a methodology for high performance. Technical Report, UCSD CS96-508
Cosnard M, Tourancheau B, Villard G (1987) Gaussian elimination on message passing architecture. In: Proceedings of 1st International Conference on Supercomputing. Athens. Lecture Notes in Computer Science, vol 297, pp 611–628
CUBLAS—implementation of BLAS on top of the NVIDIA CUDA runtime. http://docs.nvidia.com/cuda/cublas/index.html
CULA tools, GPU Accelerated Linear Algebra. http://www.culatools.com
Dongarra J, Faverge M, Ltaief H, Luszcsek P (2011) Achieving numerical accuracy and high performance using recursive tile LU factorization. Technical Report, University of Tennessee Computer Science ICL-UT-11-08 (also Lawn 259)
Evans F, Skiena S, Varshney A (1996) Optimizing triangle strips for fast rendering. IEEE Vis 96:319–326
Google Scholar
Hadri B, Ltaief H, Agullo E, Dongarra J (2010) Tile QR factorization with parallel panel processing for multicore architectures. In: IEEE international symposium on parallel and distributed processing, pp 1–10
Haidar A, Ltaief H, YarKhan A, Dongarra J (2011) Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures. Technical Report, University of Tennessee Computer Science UT-CS-11-666 (also Lawn 243)
Humphrey JR, Price DK, Spagnoli KE, Paolini AL, Kelmelis EJ (2010) CULA: hybrid GPU accelerated linear algebra routines. SPIE Conference Series 7705
Igual FD, Chan E, Quintana-Ortí ES, Quintana-Ortí G, Van de Geijn RA, Van Zee FG (2012) The FLAME approach: from dense linear algebra algorithms to high-performance multi-accelerator implementations. J Parallel Distrib Comput 72:1134–1143
Article Google Scholar
Kurzak J, Ltaief H, Dongarra J, Badia RM (2010) Scheduling dense linear algebra operations on multicore processors. Concurrency and Computation: Practice and Experience 22:15–44
Article Google Scholar
LAPACK—derivation of a block algorithm for LU factorization. http://www.netlib.org/utk/papers/siam-review93/node13.html
LAPACK—linear algebra PACKage. http://www.netlib.org/lapack
MAGMA—matrix algebra on GPU and multicore architectures. http://icl.cs.utk.edu/graphics/posters/files/SC11-MAGMA.pdf
Marquardt D (1963) An algorithm for least-squares estimation of nonlinear parameters. SIAM J Appl Math 11(2):431–441
Article MATH MathSciNet Google Scholar
PLASMA—parallel linear algebra software for multicore architectures. http://www.netlib.org/plasma
ScaLAPACK—scalable linear algebra PACKage. http://www.netlib.org/scalapack
Trefethen LN, Schreiber RS (1990) Average-case stability of Gaussian elimination. SIAM J Matrix Anal Appl 11:335–360
Article MATH MathSciNet Google Scholar
Tomov S, Dongarra J, Baboulin M (2010) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput 36:232–240
Article MATH Google Scholar
Tomov S, Nath R, Ltaief H, Dongarra J (2010) Dense linear algebra solvers for multicore with GPU accelerators. In: IEEE international symposium on parallel and distributed processing, pp 1–8
Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: ACM/IEEE conference on supercomputing SC ’08
Zhenjie D, Yan C (2009) An optimization load balancing algorithm design in massive storage system. In: ESIAT 2009 conference, vol 3, pp 310–313

Download references

Author information

Authors and Affiliations

Computer Engineering Department, European University of Lefke, via Mersin 10, Gemikonaǧı, Turkey
Manuel Carcenac

Authors

Manuel Carcenac
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuel Carcenac.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Carcenac, M. From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices. J Supercomput 68, 365–413 (2014). https://doi.org/10.1007/s11227-013-1043-3

Download citation

Published: 23 November 2013
Issue Date: April 2014
DOI: https://doi.org/10.1007/s11227-013-1043-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices

Abstract

Access this article

Similar content being viewed by others

ADELUS: A Performance-Portable Dense LU Solver for Distributed-Memory Hardware-Accelerated Systems

A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems

Resolving small random symmetric linear systems on graphics processing units

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices

Abstract

Access this article

Similar content being viewed by others

ADELUS: A Performance-Portable Dense LU Solver for Distributed-Memory Hardware-Accelerated Systems

A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems

Resolving small random symmetric linear systems on graphics processing units

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation