Advertisement

Sequential performance versus scalability: Optimizing parallel LU-decomposition

  • Jens Simon
  • Jens-Michael Wierum
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1067)

Abstract

High efficient implementations of parallel algorithms need high efficient sequential kernels. Therefore, libraries like BLAS are successfully used in many numerical applications. In this paper we show the tradeoff between the performance of these kernels and the scalability of parallel applications. It turns out that the fastest routine on a single node does not necessarily lead to the fastest parallel program and that the structure of the kernels have be adapted to the communication parameters of the machine. As an example application we present an optimized parallel LU-decomposition for dense systems on a distributed memory machine. Here, the size of submatrices of the blocked-algorithm determines the performance of the matrix-matrix multiplication and with a contrary effect the scalability behavior.

Keywords

Block Size Memory Hierarchy Parallelization Technique RISC Processor Large Block Size 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Jack J. Dongarra. Performance of various computers using standard linear equations software. Technical Report CS-89-85, University of Tennessee, http://www.netlib.org/benchmark/performance.ps.Google Scholar
  2. 2.
    Jack J. Dongarra and Tom Dunigan. Message-passing performance of various computers. Technical report, Oak Ridge National Laboratory, 1995.Google Scholar
  3. 3.
    Jack J. Dongarra, Robert van de Geijn, and David Walker. Scalability issues affecting the design of a dense linear algebra library. Journal of Parallel and Distributed Computing, 22(3), 1994.Google Scholar
  4. 4.
    S. Stark and A.N. Beris. LU decomposition optimized for a parallel computer with a hierarchical distributed memory. Parallel Computing, 18(9):959–971, 1992.Google Scholar
  5. 5.
    Robert A. van de Geijn. Massively parallel LINPACK Benchmark on the Intel Touchstone DELTA and iPSC/860 Systems. Technical report, The University of Texas, Dept. of C.S., December 1991.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1996

Authors and Affiliations

  • Jens Simon
    • 1
  • Jens-Michael Wierum
    • 1
  1. 1.Paderborn Center for Parallel Computing - PC2PaderbornGermany

Personalised recommendations