An adaptive blocking strategy for matrix factorizations
On most high-performance architectures, data movement is slow compared to floating-point (in particular, vector) performance. On these architectures block algorithms have been successful for matrix computations. By considering a matrix as a collection of submatrices (the so-called blocks) one naturally arrives at algorithms that require little data movement. The optimal blocking strategy, however, depends on the computing environment and on the problem parameters. Current approaches use fixed-width blocking strategies that are not optimal. This paper presents an “adaptive blocking” methodology for determining in a systematic manner an optimal blocking strategy for a uniprocessor machine. We demonstrate this technique on a block QR factorization routine on a uniprocessor. After generating timing models for the high-level kernels of the algorithm we can formulate the optimal blocking strategy in a recurrence relation that we can solve inexpensively with a dynamic programming technique. Experiments on one processor of a CRAY-2 show that in fact the resulting blocking strategy is as good as any fixed-width blocking strategy. So while we do not know the optimum fixed-width blocking strategy unless we re-run the same problem several times, adaptive blocking provides optimum performance in the very first run.
Keywordsblock algorithm adaptive blocking performance evaluation performance portability QR factorization
Unable to display preview. Download preview PDF.
- Alfred Aho, John Hopcroft, and Jeffrey Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Mass., 1974.Google Scholar
- Michael Berry, Kyle Gallivan, William Harrod, William Jalby, Sy-Shin Lo, Ulrike Meier, Bernard Philippe, and Ahmed Sameh. Parallel algorithms on the Cedar system. In W. Händler, editor, Proceedings of CONPAR 86, pages 25–39. Springer Verlag, New York, 1986.Google Scholar
- Christian Bischof, James Demmel, Jack Dongarra, Jeremy Du Croz, Anne Greenbaum, Sven Hammarling, and Danny Sorensen. LAPACK Working Note #5: Provisional contents. Technical Report ANL-88-38, Argonne National Laboratory, Mathematics and Computer Sciences Division, September 1988.Google Scholar
- Christian H. Bischof and Jack J. Dongarra. A project for developing a linear algebra library for high-performance computers. In Graham Carey, editor, Parallel and Vector Supercomputing: Methods and Algorithms, pages 45–56. John Wiley & Sons, Somerset, NJ, 1989.Google Scholar
- Jim Demmel, Jack Dongarra, Jeremy Du Croz, Anne Greenbaum, Sven Hammarling, and Danny Sorensen. Prospectus for the development of a linear algebra library for high-performance computers. Technical Report ANL-MCS-TM97, Argonne National Laboratory, Mathematics and Computer Sciences Division, September 1987.Google Scholar
- Jack J. Dongarra, Sven J. Hammarling, and Danny C. Sorensen. Block reduction of matrices to condensed form for eigenvalue computations. Technical Report MCS-TM-99, Argonne National Laboratory, Mathematics and Computer Sciences Division, September 1987.Google Scholar
- Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1983.Google Scholar
- William Harrod. Solving linear least squares problems on an Alliant FX/8. Technical report, University of Illinois at Urbana-Champaign, Center for Supercomputing Research and Development, 1986.Google Scholar
- Kai Hwang and Fayé A. Briggs. Computer Architecture and Parallel Processing. McGraw-Hill, New York, 1984.Google Scholar
- Peter Lancaster and Kestutis Šalkauskas. Curve and Surface Fitting: An Introduction. Academic Press, San Diego, 1986.Google Scholar
- Peter Mayes and Guiseppe Radicati di Brozolo. Block factorization algorithms on the IBM 3090/VF. In Proceedings of the International Meeting on Supercomputing, 1989.Google Scholar
- Robert Schreiber. Block Algorithms for Parallel Machines, pages 197–207. Number 13 in IMA Volumes in Mathematics and its Applications. Springer Verlag, Berlin, 1988.Google Scholar