Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures

  • Alex Druinsky
  • Pieter Ghysels
  • Xiaoye S. Li
  • Osni Marques
  • Samuel Williams
  • Andrew Barker
  • Delyan Kalchev
  • Panayot Vassilevski
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9573)

Abstract

We study the performance of a two-level algebraic-multigrid algorithm, with a focus on the impact of the coarse-grid solver on performance. We consider two algorithms for solving the coarse-space systems: the preconditioned conjugate gradient method and a new robust HSS-embedded low-rank sparse-factorization algorithm. Our test data comes from the SPE Comparative Solution Project for oil-reservoir simulations. We contrast the performance of our code on one 12-core socket of a Cray XC30 machine with performance on a 60-core Intel Xeon Phi coprocessor. To obtain top performance, we optimized the code to take full advantage of fine-grained parallelism and made it thread-friendly for high thread count. We also developed a bounds-and-bottlenecks performance model of the solver which we used to guide us through the optimization effort, and also carried out performance tuning in the solver’s large parameter space. As a result, significant speedups were obtained on both machines.

Keywords

Algebraic multigrid HSS matrices Manycore machines 

References

  1. 1.
    Intel threading building blocks. https://www.threadingbuildingblocks.org
  2. 2.
    Baker, A.H., Schulz, M., Yang, U.M.: On the performance of an algebraic multigrid solver on multicore clusters. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 102–115. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  3. 3.
    Bolz, J., Farmer, I., Grinspun, E., Schröoder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans. Graph. 22(3), 917–924 (2003)CrossRefGoogle Scholar
  4. 4.
    Brezina, M., Vassilevski, P.S.: Smoothed aggregation spectral element agglomeration AMG: SA-\(\rho \)AMGe. In: Lirkov, I., Margenov, S., Waśniewski, J. (eds.) LSSC 2011. LNCS, vol. 7116, pp. 3–15. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  5. 5.
    Callahan, D., Cocke, J., Kennedy, K.: Estimating interlock and improving balance for pipelined architectures. J. Parallel Distrib. Comput. 5(4), 334–358 (1988)CrossRefGoogle Scholar
  6. 6.
    Christie, M.A., Blunt, M.J.: Tenth SPE comparative solution project: Comparison of upscaling techniques. SPE Reserv. Eval. Eng. 4(4), 308–317 (2001)CrossRefGoogle Scholar
  7. 7.
    Duff, I.S., Koster, J.: The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl. 20, 889–901 (1999)MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Gahvari, H., Baker, A.H., Schulz, M., Yang, U.M., Jordan, K.E., Gropp, W.: Modeling the performance of an algebraic multigrid cycle on HPC platforms. In: Proceedings of ICS, pp. 172–181 (2011)Google Scholar
  9. 9.
    Gahvari, H., Gropp, W., Jordan, K.E., Schulz, M., Yang, U.M.: Modeling the performance of an algebraic multigrid cycle using hybrid MPI/OpenMP. In: Proceedings of ICPP, pp. 128–137 (2012)Google Scholar
  10. 10.
    Ghysels, P., Li, X.S., Rouet, F.H., Williams, S., Napov, A.: An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling. SIAM J. Sci. Comput. (2014) preprintGoogle Scholar
  11. 11.
    Kalchev, D., Ketelsen, C., Vassilevski, P.S.: Two-level adaptive algebraic multigrid for a sequence of problems with slowly varying random coefficients. SIAM J. Sci. Comput. 35(6), B1215–B1234 (2013)MathSciNetCrossRefMATHGoogle Scholar
  12. 12.
    Kalchev, D.: Adaptive Algebraic Multigrid for Finite Element Elliptic Equations with Random Coefficients. Master’s thesis, Sofia University, Bulgaria (2012)Google Scholar
  13. 13.
    Liu, X., Smelyanskiy, M., Chow, E., Dubey, P.: Efficient sparse matrix-vector multiplication on x86-based many-core processors. In: Proceedings of ICS, pp. 273–282 (2013)Google Scholar
  14. 14.
    Martinsson, P.: A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix. SIAM J. Matrix Anal. Appl. 32(4), 1251–1274 (2011)MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE TCCA Newsletter, pp. 19–25 (1995)Google Scholar
  16. 16.
    Saad, Y.: A flexible inner-outer preconditioned GMRES algorithm. SIAM J. Sci. Comput. 14(2), 461–469 (1993)MathSciNetCrossRefMATHGoogle Scholar
  17. 17.
    Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)CrossRefGoogle Scholar
  18. 18.
    Xia, J., Chandrasekaran, S., Gu, M., Li, X.S.: Fast algorithms for hierarchically semiseparable matrices. Numer. Linear Algebra Appl. 17(6), 953–976 (2010)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Alex Druinsky
    • 1
  • Pieter Ghysels
    • 1
  • Xiaoye S. Li
    • 1
  • Osni Marques
    • 1
  • Samuel Williams
    • 1
  • Andrew Barker
    • 2
  • Delyan Kalchev
    • 2
  • Panayot Vassilevski
    • 2
  1. 1.Lawrence Berkeley National LaboratoryBerkeleyUSA
  2. 2.Lawrence Livermore National LaboratoryLivermoreUSA

Personalised recommendations