A Framework for Out of Memory SVD Algorithms

  • Khairul Kabir
  • Azzam HaidarEmail author
  • Stanimire Tomov
  • Aurelien Bouteiller
  • Jack Dongarra
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10266)


Many important applications – from big data analytics to information retrieval, gene expression analysis, and numerical weather prediction – require the solution of large dense singular value decompositions (SVD). In many cases the problems are too large to fit into the computer’s main memory, and thus require specialized out-of-core algorithms that use disk storage. In this paper, we analyze the SVD communications, as related to hierarchical memories, and design a class of algorithms that minimizes them. This class includes out-of-core SVDs but can also be applied between other consecutive levels of the memory hierarchy, e.g., GPU SVD using the CPU memory for large problems. We call these out-of-memory (OOM) algorithms. To design OOM SVDs, we first study the communications for both classical one-stage blocked SVD and two-stage tiled SVD. We present the theoretical analysis and strategies to design, as well as implement, these communication avoiding OOM SVD algorithms. We show performance results for multicore architecture that illustrate our theoretical findings and match our performance models.


Singular Value Decomposition Singular Vector Solid State Drive Tile Size Band Matrix 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.


  1. 1.
    Anderson, E., Bai, Z., Bischof, C., Blackford, L.S., Demmel, J.W., Dongarra, J.J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide. SIAM, Philadelphia, (1992).
  2. 2.
    Bischof, C., Lang, B., Sun, X.: Parallel tridiagonalization through two-step band reduction. In: Proceedings of the Scalable High-Performance Computing Conference, pp. 23–27. IEEE Computer Society Press (1994)Google Scholar
  3. 3.
    Bischof, C.H., Lang, B., Sun, X.: Algorithm 807: the SBR toolbox–software for successive band reduction. ACM TOMS 26(4), 602–616 (2000)MathSciNetCrossRefGoogle Scholar
  4. 4.
    D’Azevedo, E.F., Dongarra, J.: The design and implementation of the parallel out-of-core ScaLAPACK LU, QR, and Cholesky factorization routines. Concurr. - Pract. Exp. 12(15), 1481–1493 (2000)CrossRefzbMATHGoogle Scholar
  5. 5.
    Dongarra, J.J., Sorensen, D.C., Hammarling, S.J.: Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. 27(1–2), 215–227 (1989)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Dongarra, J.J., Hammarling, S., Walker, D.W.: Key concepts for parallel out-of-core LU factorization. Comput. Math. Appl. 35(7), 13–31 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Gansterer, W.N., Kvasnicka, D.F., Ueberhuber, C.W.: Multi-sweep algorithms for the symmetric eigenproblem. In: Hernández, V., Palma, J.M.L.M., Dongarra, J.J. (eds.) VECPAR 1998. LNCS, vol. 1573, pp. 20–28. Springer, Heidelberg (1999). doi: 10.1007/10703040_3 CrossRefGoogle Scholar
  8. 8.
    Grimes, R., Krakauer, H., Lewis, J., Simon, H., Wei, S.-H.: The solution of large dense generalized eigenvalue problems on the cray X-MP/24 with SSD. J. Comput. Phys. 69, 471–481 (1987)CrossRefzbMATHGoogle Scholar
  9. 9.
    Grimes, R.G., Simon, H.D.: Solution of large, dense symmetric generalized eigenvalue problems using secondary storage. ACM Trans. Math. Softw. 14, 241–256 (1988)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Haidar, A., Tomov, S., Dongarra, J., Solca, R., Schulthess, T.: A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks. Int. J. High Perform. Comput. Appl. (2012, accepted)Google Scholar
  11. 11.
    Haidar, A., Kurzak, J., Luszczek, P.: An improved parallel singular value algorithm and its implementation for multicore hardware. In: SC 2012: The International Conference for High Performance Computing, Networking, Storage and Analysis (2013)Google Scholar
  12. 12.
    Haidar, A., Ltaief, H., Dongarra, J.: Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In: Proceedings of SC 2011, pp. 8:1–8:11. ACM, New York (2011)Google Scholar
  13. 13.
    Haidar, A., Ltaief, H., Luszczek, P., Dongarra, J.: A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium, Shanghai, China, 21–25 May 2012. ISBN 978-1-4673-0975-2Google Scholar
  14. 14.
    Lang, B.: A parallel algorithm for reducing symmetric banded matrices to tridiagonal form. SIAM J. Sci. Comput. 14, 1320–1338 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Ltaief, H., Luszczek, P., Dongarra, J.: High performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures. ACM TOMS, 39(3) (2013, in publication)Google Scholar
  16. 16.
    Ltaief, H., Luszczek, P., Dongarra, J.: Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2011. LNCS, vol. 7203, pp. 661–670. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-31464-3_67 CrossRefGoogle Scholar
  17. 17.
    Rabani, E., Toledo, S.: Out-of-core SVD and QR decompositions. In: PPSC (2001)Google Scholar
  18. 18.
    Toledo, S., Gustavson, F.G.: The design and implementation of SOLAR, a portable library for scalable out-of-core linear algebra computations. In: Proceedings of the Fourth Workshop on I/O in Parallel and Distributed Systems: Part of the Federated Computing Research Conference, IOPADS 1996, pp. 28–40. ACM, New York (1996)Google Scholar
  19. 19.
    Yamazaki, I., Tomov, S., Dongarra, J.: One-sided dense matrix factorizations on a multicore with multiple GPU accelerators*. Procedia Comput. Sci. 9, 37–46 (2012)CrossRefGoogle Scholar
  20. 20.
    Yamazaki, I., Tomov, S., Dongarra, J.: Non-GPU-resident dense symmetric indefinite factorization. Concurr. Comput.: Pract. Exp. (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Khairul Kabir
    • 4
  • Azzam Haidar
    • 1
    Email author
  • Stanimire Tomov
    • 1
  • Aurelien Bouteiller
    • 1
  • Jack Dongarra
    • 1
    • 2
    • 3
  1. 1.University of TennesseeKnoxvilleUSA
  2. 2.Oak Ridge National LaboratoryOak RidgeUSA
  3. 3.University of ManchesterManchesterUK
  4. 4.NvidiaSanta ClaraUSA

Personalised recommendations