Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures

  • Alex DruinskyEmail author
  • Pieter Ghysels
  • Xiaoye S. Li
  • Osni Marques
  • Samuel Williams
  • Andrew Barker
  • Delyan Kalchev
  • Panayot Vassilevski
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9573)


We study the performance of a two-level algebraic-multigrid algorithm, with a focus on the impact of the coarse-grid solver on performance. We consider two algorithms for solving the coarse-space systems: the preconditioned conjugate gradient method and a new robust HSS-embedded low-rank sparse-factorization algorithm. Our test data comes from the SPE Comparative Solution Project for oil-reservoir simulations. We contrast the performance of our code on one 12-core socket of a Cray XC30 machine with performance on a 60-core Intel Xeon Phi coprocessor. To obtain top performance, we optimized the code to take full advantage of fine-grained parallelism and made it thread-friendly for high thread count. We also developed a bounds-and-bottlenecks performance model of the solver which we used to guide us through the optimization effort, and also carried out performance tuning in the solver’s large parameter space. As a result, significant speedups were obtained on both machines.


Algebraic multigrid HSS matrices Manycore machines 



We thank the anonymous referees for their many comments that greatly helped to improve the paper.


  1. 1.
    Intel threading building blocks.
  2. 2.
    Baker, A.H., Schulz, M., Yang, U.M.: On the performance of an algebraic multigrid solver on multicore clusters. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 102–115. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  3. 3.
    Bolz, J., Farmer, I., Grinspun, E., Schröoder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans. Graph. 22(3), 917–924 (2003)CrossRefGoogle Scholar
  4. 4.
    Brezina, M., Vassilevski, P.S.: Smoothed aggregation spectral element agglomeration AMG: SA-\(\rho \)AMGe. In: Lirkov, I., Margenov, S., Waśniewski, J. (eds.) LSSC 2011. LNCS, vol. 7116, pp. 3–15. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  5. 5.
    Callahan, D., Cocke, J., Kennedy, K.: Estimating interlock and improving balance for pipelined architectures. J. Parallel Distrib. Comput. 5(4), 334–358 (1988)CrossRefGoogle Scholar
  6. 6.
    Christie, M.A., Blunt, M.J.: Tenth SPE comparative solution project: Comparison of upscaling techniques. SPE Reserv. Eval. Eng. 4(4), 308–317 (2001)CrossRefGoogle Scholar
  7. 7.
    Duff, I.S., Koster, J.: The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl. 20, 889–901 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Gahvari, H., Baker, A.H., Schulz, M., Yang, U.M., Jordan, K.E., Gropp, W.: Modeling the performance of an algebraic multigrid cycle on HPC platforms. In: Proceedings of ICS, pp. 172–181 (2011)Google Scholar
  9. 9.
    Gahvari, H., Gropp, W., Jordan, K.E., Schulz, M., Yang, U.M.: Modeling the performance of an algebraic multigrid cycle using hybrid MPI/OpenMP. In: Proceedings of ICPP, pp. 128–137 (2012)Google Scholar
  10. 10.
    Ghysels, P., Li, X.S., Rouet, F.H., Williams, S., Napov, A.: An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling. SIAM J. Sci. Comput. (2014) preprintGoogle Scholar
  11. 11.
    Kalchev, D., Ketelsen, C., Vassilevski, P.S.: Two-level adaptive algebraic multigrid for a sequence of problems with slowly varying random coefficients. SIAM J. Sci. Comput. 35(6), B1215–B1234 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Kalchev, D.: Adaptive Algebraic Multigrid for Finite Element Elliptic Equations with Random Coefficients. Master’s thesis, Sofia University, Bulgaria (2012)Google Scholar
  13. 13.
    Liu, X., Smelyanskiy, M., Chow, E., Dubey, P.: Efficient sparse matrix-vector multiplication on x86-based many-core processors. In: Proceedings of ICS, pp. 273–282 (2013)Google Scholar
  14. 14.
    Martinsson, P.: A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix. SIAM J. Matrix Anal. Appl. 32(4), 1251–1274 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE TCCA Newsletter, pp. 19–25 (1995)Google Scholar
  16. 16.
    Saad, Y.: A flexible inner-outer preconditioned GMRES algorithm. SIAM J. Sci. Comput. 14(2), 461–469 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)CrossRefGoogle Scholar
  18. 18.
    Xia, J., Chandrasekaran, S., Gu, M., Li, X.S.: Fast algorithms for hierarchically semiseparable matrices. Numer. Linear Algebra Appl. 17(6), 953–976 (2010)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Alex Druinsky
    • 1
    Email author
  • Pieter Ghysels
    • 1
  • Xiaoye S. Li
    • 1
  • Osni Marques
    • 1
  • Samuel Williams
    • 1
  • Andrew Barker
    • 2
  • Delyan Kalchev
    • 2
  • Panayot Vassilevski
    • 2
  1. 1.Lawrence Berkeley National LaboratoryBerkeleyUSA
  2. 2.Lawrence Livermore National LaboratoryLivermoreUSA

Personalised recommendations