High Performance Polar Decomposition on Distributed Memory Systems

  • Dalal Sukkari
  • Hatem LtaiefEmail author
  • David Keyes
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9833)


The polar decomposition of a dense matrix is an important operation in linear algebra. It can be directly calculated through the singular value decomposition (SVD) or iteratively using the QR dynamically-weighted Halley algorithm (QDWH). The former is difficult to parallelize due to the preponderant number of memory-bound operations during the bidiagonal reduction. We investigate the latter scenario, which performs more floating-point operations but exposes at the same time more parallelism, and therefore, runs closer to the theoretical peak performance of the system, thanks to more compute-bound matrix operations. Profiling results show the performance scalability of QDWH for calculating the polar decomposition using around 9200 MPI processes on well and ill-conditioned matrices of 100 K \(\times \) 100 K problem size. We study then the performance impact of the QDWH-based polar decomposition as a pre-processing step toward calculating the SVD itself. The new distributed-memory implementation of the QDWH-SVD solver achieves up to five-fold speedup against current state-of-the-art vendor SVD implementations.


Singular Value Decomposition Singular Vector Polar Decomposition Message Passing Interface Process Numerical Robustness 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



For computer time, this research used the resources from the Swiss National Supercomputing Centre (CSCS) in Lugano, Switzerland.


  1. 1.
    Anderson, E., Bai, Z., Bischof, C.H., Blackford, L.S., Demmel, J.W., Dongarra, J.J., Croz, J.J.D., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.C.: LAPACK User’s Guide, 3rd edn. SIAM, Philadelphia (1999)CrossRefzbMATHGoogle Scholar
  2. 2.
    Antonelli, D., Vömel, C.: PDSYEVR. ScaLAPACK’s Parallel MRRR Algorithm for the Symmetric Eigenvalue Problem (168), 18 August 2005., technical Report UCB//CSD-05-1399
  3. 3.
    Arun, K.S.: A unitarily constrained total least squares problem in signal processing. SIAM J. Matrix Anal. Appl. 13(3), 729–745 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Bar-Itzhack, I.: Iterative optimal orthogonalization of the strapdown matrix. IEEE Trans. Aerosp. Electron. Syst. AES–11(1), 30–37 (1975)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E.F., Demmel, J.W., Dhillon, I.S., Dongarra, J.J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D.W., Whaley, R.C.: ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)CrossRefzbMATHGoogle Scholar
  6. 6.
    Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, A., Hérault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief, H., Luszczek, P., YarKhan, A., Dongarra, J.: Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA. In: IPDPS Workshops, pp. 1432–1441. IEEE (2011)Google Scholar
  7. 7.
    Dongarra, J., Beckman, P., Moore, T., Aerts, P., Aloisio, G., Andre, J.C., Barkai, D., Berthou, J.Y., Boku, T., Braunschweig, B., Cappello, F., Chapman, B., Chi, X., Choudhary, A., Dosanjh, S., Dunning, T., Fiore, S., Geist, A., Gropp, B., Harrison, R., Hereld, M., Heroux, M., Hoisie, A., Hotta, K., Jin, Z., Ishikawa, Y., Johnson, F., Kale, S., Kenway, R., Keyes, D., Kramer, B., Labarta, J., Lichnewsky, A., Lippert, T., Lucas, B., Maccabe, B., Matsuoka, S., Messina, P., Michielse, P., Mohr, B., Mueller, M.S., Nagel, W.E., Nakashima, H., Papka, M.E., Reed, D., Sato, M., Seidel, E., Shalf, J., Skinner, D., Snir, M., Sterling, T., Stevens, R., Streitz, F., Sugar, B., Sumimoto, S., Tang, W., Taylor, J., Thakur, R., Trefethen, A., Valero, M., Van Der Steen, A., Vetter, J., Williams, P., Wisniewski, R., Yelick, K.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011). CrossRefGoogle Scholar
  8. 8.
    Goldstein, J.A., Levy, M.: Linear algebra and quantum chemistry. Am. Math. Monthly 98(10), 710–718 (1991). MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Golub, G.H., Van Loan, C.F.: Matrix Computations. John Hopkins Studies in the Mathematical Sciences, 3rd edn. Johns Hopkins University Press, Baltimore (1996)zbMATHGoogle Scholar
  10. 10.
    Marek, A., Blum, V., Johanni, R., Havu, V., Lang, B., Auckenthaler, T., Heinecke, A., Bungartz, H., Lederer, H.: The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science. J. Phys. Condens Matter 26(21) (2014).
  11. 11.
    Forum, M.P.I.: MPI: a message passing interface. In: Proceedings of Supercomputing 1993, pp. 878–883. IEEE CS Press, Portland, November 1993Google Scholar
  12. 12.
    Nakatsukasa, Y., Bai, Z., Gygi, F.: Optimizing halley’s iteration for computing the matrix polar decomposition. SIAM J. Matrix Anal. Appl. 31, 2700–2720 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Nakatsukasa, Y., Higham, N.J.: Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd. SIAM J. Sci. Comput. 35(3), A1325–A1349 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Higham, N.J., Papadimitriou, P.: A new parallel algorithm for computing the singular value decomposition. In: Lewis, J.G. (ed.) Proceedings of the Fifth SIAM Conference on Applied Linear Algebra, pp. 80–84. Society for Industrial and Applied Mathematics, Philadelphia (1994)Google Scholar
  15. 15.
    Schreiber, R., Parlett, B.: Block reflectors: theory and computation. SIAM J. Numer. Anal. 25(1), 189–205 (1988). MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Schnemann, P.: A generalized solution of the orthogonal procrustes problem. Psychometrika 31(1), 1–10 (1966). MathSciNetCrossRefGoogle Scholar
  17. 17.
    Sukkari, D., Ltaief, H., Keyes, D.: A high performance QDWH-SVD solver using hardware accelerators. Accepted for publication at ACM Trans. Math. Softw. (2016).
  18. 18.
    Trefethen, L.N., Bau, D.: Numerical Linear Algebra. SIAM, Philadelphia (1997). CrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Extreme Computing Research Center, Division of Computer, Electrical, and Mathematical Sciences and EngineeringKing Abdullah University of Science and TechnologyThuwalKingdom of Saudi Arabia

Personalised recommendations