Sequential performance versus scalability: Optimizing parallel LU-decomposition
High efficient implementations of parallel algorithms need high efficient sequential kernels. Therefore, libraries like BLAS are successfully used in many numerical applications. In this paper we show the tradeoff between the performance of these kernels and the scalability of parallel applications. It turns out that the fastest routine on a single node does not necessarily lead to the fastest parallel program and that the structure of the kernels have be adapted to the communication parameters of the machine. As an example application we present an optimized parallel LU-decomposition for dense systems on a distributed memory machine. Here, the size of submatrices of the blocked-algorithm determines the performance of the matrix-matrix multiplication and with a contrary effect the scalability behavior.
KeywordsBlock Size Memory Hierarchy Parallelization Technique RISC Processor Large Block Size
Unable to display preview. Download preview PDF.
- 1.Jack J. Dongarra. Performance of various computers using standard linear equations software. Technical Report CS-89-85, University of Tennessee, http://www.netlib.org/benchmark/performance.ps.Google Scholar
- 2.Jack J. Dongarra and Tom Dunigan. Message-passing performance of various computers. Technical report, Oak Ridge National Laboratory, 1995.Google Scholar
- 3.Jack J. Dongarra, Robert van de Geijn, and David Walker. Scalability issues affecting the design of a dense linear algebra library. Journal of Parallel and Distributed Computing, 22(3), 1994.Google Scholar
- 4.S. Stark and A.N. Beris. LU decomposition optimized for a parallel computer with a hierarchical distributed memory. Parallel Computing, 18(9):959–971, 1992.Google Scholar
- 5.Robert A. van de Geijn. Massively parallel LINPACK Benchmark on the Intel Touchstone DELTA and iPSC/860 Systems. Technical report, The University of Texas, Dept. of C.S., December 1991.Google Scholar