Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations

  • Azzam Haidar
  • Raffaele Solcà
  • Mark Gates
  • Stanimire Tomov
  • Thomas Schulthess
  • Jack Dongarra
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7905)


Today’s high computational demands from engineering fields and complex hardware development make it necessary to develop and optimize new algorithms toward achieving high performance and good scalability on the next generation of computers. The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe and analyze a successful methodology to address the challenges—starting from our algorithm design, kernel optimization and tuning, to our programming model—in the development of a scalable high-performance generalized eigenvalue solver in the context of electronic structure calculations in materials science applications. We developed a set of leading edge dense linear algebra algorithms, as part of a generalized eigensolver, featuring fine grained memory aware kernels, a task based approach and hybrid execution/scheduling. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. We report the performance impact on the generalized eigensolver when different fractions of eigenvectors are needed. The algorithm described provides an enormous performance boost compared to current GPU-based solutions, and performance comparable to state-of-the-art distributed solutions, using a single node with multiple GPUs.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aasen, J.O.: On the reduction of a symmetric matrix to tridiagonal form. BIT 11, 233–242 (1971)MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Anderson, E., Bai, Z., Bischof, C., Blackford, L.S., Demmel, J.W., Dongarra, J.J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide. SIAM, Philadelphia (1992), zbMATHGoogle Scholar
  3. 3.
    Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia (1999)CrossRefGoogle Scholar
  4. 4.
    Auckenthaler, T., Blum, V., Bungartz, H.J., Huckle, T., Johanni, R., Krämer, L., Lang, B., Lederer, H., Willems, P.R.: Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations. Parallel Comput. 37(12), 783–794 (2011)CrossRefGoogle Scholar
  5. 5.
    Bientinesi, P., Igual, F.D., Kressner, D., Quintana-Ortí, E.S.: Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 387–395. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Bischof, C.H., Lang, B., Sun, X.: Algorithm 807: The SBR Toolbox—software for successive band reduction. ACM Transactions on Mathematical Software 26(4), 602–616 (2000)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)zbMATHCrossRefGoogle Scholar
  8. 8.
    Cuppen, J.J.M.: A divide and conquer method for the symmetric eigenproblem. Numer. Math. 36, 177–195 (1981)MathSciNetzbMATHCrossRefGoogle Scholar
  9. 9.
    Dong, T., Dongarra, J., Schulthess, T., Solca, R., Tomov, S., Yamazaki, I.: Matrix-vector multiplication and tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems. Parallel Comput. (July 2012) (submitted)Google Scholar
  10. 10.
    Dongarra, J.J., Sorensen, D.C.: A fully parallel algorithm for the symmetric eigenvalue problem. SIAM J. Sci. Statist. Comput. 8, s139–s154 (1987)Google Scholar
  11. 11.
    Gates, K., Arbenz, P.: Parallel divide and conquer algorithms for the symmetric tridiagonal eigenproblem (1994)Google Scholar
  12. 12.
    Golub, G.H., Loan, C.F.V.: Matrix Computations, 2nd edn. The Johns Hopkins University Press, Baltimore (1989)zbMATHGoogle Scholar
  13. 13.
    Grimes, R.G., Simon, H.D.: Solution of large, dense symmetric generalized eigenvalue problems using secondary storage. ACM Transactions on Mathematical Software 14, 241–256 (1988)MathSciNetzbMATHCrossRefGoogle Scholar
  14. 14.
    Haidar, A., Gates, M., Tomov, S., Dongarra, J.: Toward a scalable multi-gpu eigensolver via compute-intensive kernels and efficient communication. In: ICS 2013: 27th International Conference on Supercomputing, Eugene, Oregon, USA, June 10-14 (submitted, 2013)Google Scholar
  15. 15.
    Haidar, A., Ltaief, H., Dongarra, J.: Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In: SC 2011: International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, WA, USA, November 12-18 (2011)Google Scholar
  16. 16.
    Haidar, A., Tomov, S., Dongarra, J., Solca, R., Schulthess, T.: A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks. International Journal of High Performance Computing Applications (September 2012) (accepted)Google Scholar
  17. 17.
    Ipsen, L.C.F., Jessup, E.R.: Solving the symmetric tridiagonal eigenvalues problem on the hypercube. SIAM J. Sci. Stat. Comput. 11, 203–229 (1990)MathSciNetzbMATHCrossRefGoogle Scholar
  18. 18.
    Kågström, B., Kressner, D., Quintana-Orti, E., Quintana-Orti, G.: Blocked Algorithms for the Reduction to Hessenberg-Triangular Form Revisited. BIT Numerical Mathematics 48, 563–584 (2008)MathSciNetzbMATHCrossRefGoogle Scholar
  19. 19.
    Karlsson, L., Kågström, B.: Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures. Parallel Computing (2011), doi:10.1016/j.parco.2011.05.001Google Scholar
  20. 20.
    Kent, P.: Computational challenges of large-scale, long-time, first-principles molecular dynamics. Journal of Physics: Conference Series 125(1), 012058 (2008)CrossRefGoogle Scholar
  21. 21.
    Lang, B.: Efficient eigenvalue and singular value computations on shared memory machines. Parallel Computing 25(7), 845–860 (1999)MathSciNetzbMATHCrossRefGoogle Scholar
  22. 22.
    Ltaief, H., Luszczek, P., Dongarra, J.: High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures. In: ACM TOMS (2011) (accepted)Google Scholar
  23. 23.
    Luszczek, P., Ltaief, H., Dongarra, J.: Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In: IPDPS 2011: IEEE International Parallel and Distributed Processing Symposium, Anchorage, Alaska, USA, May 16-20 (2011)Google Scholar
  24. 24.
    Parlett, B.N.: The Symmetric Eigenvalue Problem. Prentice-Hall, Englewood Cliffs (1980)zbMATHGoogle Scholar
  25. 25.
    Rutter, J., Rutter, J.D.: A serial implementation of cuppen’s divide and conquer algorithm for the symmetric eigenvalue problem (1994)Google Scholar
  26. 26.
    Singh, D.J.: Planewaves, Pseudopotentials, and the LAPW Method. Kluwer, Boston (1994)CrossRefGoogle Scholar
  27. 27.
    Sorensen, D.C., Tang, P.T.P.: On the orthogonality of eigenvectors computed by divide-and-conquer techniques. SIAM J. Numer. Anal. 28(6), 1752–1775 (1991)MathSciNetzbMATHCrossRefGoogle Scholar
  28. 28.
    Tisseur, F., Dongarra, J.: Parallelizing the divide and conquer algorithm for the symmetric tridiagonal eigenvalue problem on distributed memory architectures. SIAM J. SCI. Comput. 20, 2223–2236 (1998)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Tomov, S., Nath, R., Dongarra, J.: Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing. Parallel Comput 36(12), 645–654 (2010)MathSciNetzbMATHCrossRefGoogle Scholar
  30. 30.
    Vomel, C., Tomov, S., Dongarra, J.: Divide and conquer on hybrid GPU-accelerated multicore systems. SIAM Journal on Scientific Computing 34(2), C70–C82 (2012)Google Scholar
  31. 31.
    Yamazaki, I., Tomov, S., Dongarra, J.: One-sided dense matrix factorizations on a multicore with multiple gpu accelerators. In: Proc. of ICCS 2012, Procedia CS, vol. 9, pp. 37–46 (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Azzam Haidar
    • 1
  • Raffaele Solcà
    • 4
  • Mark Gates
    • 1
  • Stanimire Tomov
    • 1
  • Thomas Schulthess
    • 4
    • 5
  • Jack Dongarra
    • 1
    • 2
    • 3
  1. 1.University of Tennessee KnoxvilleUSA
  2. 2.Oak Ridge National LaboratoryUSA
  3. 3.University of ManchesterUK
  4. 4.Institut for Theoretical PhysicsETH Zurichswitzerland
  5. 5.Swiss National Supercomputer CenterSwitzerland

Personalised recommendations