Performance Study of LU Decomposition on the Programmable GPU

  • Fumihiko Ino
  • Manabu Matsui
  • Keigo Goda
  • Kenichi Hagihara
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3769)


With the increasing programmability of graphics processing units (GPUs), these units are emerging as an attractive computing platform not only for traditional graphics computation but also for general-purpose computation. In this paper, to study the performance of programmable GPUs, we describe the design and implementation of LU decomposition as an example of numerical computation. To achieve this, we have developed and evaluated some methods with different implementation approaches in terms of (a) loop processing, (b) branch processing, and (c) vector processing. The experimental results give four important points: (1) dependent loops must be implemented through the use of a render texture in order to avoid copies in the video random access memory (VRAM); (2) in most cases, branch processing can be efficiently handled by the CPU rather than the GPU; (3) as Fatahalian et al. state for matrix multiplication, we find that GPUs require higher VRAM cache bandwidth in order to provide full performance for LU decomposition; and (4) decomposition results obtained by GPUs usually differ from those by CPUs, mainly due to the floating-point division error that increases the numerical error with the progress of decomposition.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Fernando, R. (ed.): GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics. Addison-Wesley, Reading (2004)Google Scholar
  2. 2.
    Fatahalian, K., Sugerman, J., Hanrahan, P.: Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In: Proc. SIGGRAPH/EUROGRAPHICS Workshop Graphics Hardware (GH 2004), pp. 133–137 (2004)Google Scholar
  3. 3.
    Thompson, C.J., Hahn, S., Oskin, M.: Using modern graphics architectures for general-purpose computing: A framework and analysis. In: Proc. 35th IEEE/ACM Int’l Symp. Microarchitecture (MICRO 2002), pp. 306–317 (2002)Google Scholar
  4. 4.
    Larsen, E.S., McAllister, D.: Fast matrix multiplies using graphics hardware. In: Proc. High Performance Networking and Computing Conf., SC 2001 (2001)Google Scholar
  5. 5.
    Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27, 3–35 (2001)MATHCrossRefGoogle Scholar
  6. 6.
    Hall, J.D., Carr, N.A., Hart, J.C.: Cache and bandwidth aware matrix multiplication on the GPU. Technical Report UIUCDCS-R-2003-2328, University of Illinois (2003)Google Scholar
  7. 7.
    Krüger, J., Westermann, R.: Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans. Graphics 22, 908–916 (2003)CrossRefGoogle Scholar
  8. 8.
    Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. ACM Trans. Graphics 22, 917–924 (2003)CrossRefGoogle Scholar
  9. 9.
    Moravánszky, A.: Dense Matrix Algebra on the GPU (2003),
  10. 10.
    Moreland, K., Angel, E.: The FFT on a GPU. In: Proc. SIGGRAPH/EUROGRAPHICS Workshop Graphics Hardware (GH 2003), pp. 112–119 (2003)Google Scholar
  11. 11.
    Fernando, R., Harris, M., Wloka, M., Zeller, C.: Programming graphics hardware. In: EUROGRAPHICS 2004 Tutorial Note, (2004),
  12. 12.
    Pharr, M., Fernando, R. (eds.): GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation. Addison-Wesley, Reading (2005)Google Scholar
  13. 13.
    Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2nd edn. Addison-Wesley, Reading (2003)Google Scholar
  14. 14.
    Shreiner, D., Woo, M., Neider, J., Davis, T. (eds.): OpenGL Programming Guide, 4th edn. Addison-Wesley, Reading (2003)Google Scholar
  15. 15.
    Microsoft Corporation: DirectX (2005),
  16. 16.
    Stevenson, D.: A proposed standard for binary floating-point arithmetic. IEEE Computer 14, 51–62 (1981)Google Scholar
  17. 17.
    Dongarra, J.J., Duff, I.S., Sorensen, D.C., Vorst, H.V.D. (eds.): Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia (1991)Google Scholar
  18. 18.
    Mark, W.R., Glanville, R.S., Akeley, K., Kilgard, M.J.: Cg: A system for programming graphics hardware in a C-like language. ACM Trans. Graphics 22, 896–897 (2003)CrossRefGoogle Scholar
  19. 19.
    Naruse, A., Sumimoto, S., Kumon, K.: Optimization and evaluation of linpack benchmark for Xeon processor. IPSJ Trans. Advanced Computing Systems 45, 62–70 (2004) (in Japanese)Google Scholar
  20. 20.
    Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication. Technical Report CS-TR-02-55, The University of Texas at Austin (2002)Google Scholar
  21. 21.
    Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: past, present and future. Concurrency and Computation: Practice and Experience 15, 803–820 (2003)CrossRefGoogle Scholar
  22. 22.
    Hillesland, K.E., Lastra, A.: In: GPU floating point paranoia. In: Proc. 1st ACM Workshop General-Purpose Computing on Graphics Processors (GP2 2004), vol. C–8 (2004),
  23. 23.
    Moore, G.E.: Cramming more components onto integrated circuits. Electronics 38, 114–117 (1965)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Fumihiko Ino
    • 1
  • Manabu Matsui
    • 1
  • Keigo Goda
    • 1
  • Kenichi Hagihara
    • 1
  1. 1.Graduate School of Information Science and TechnologyOsaka UniversityOsakaJapan

Personalised recommendations