International Journal of Parallel Programming

, Volume 45, Issue 5, pp 1046–1072 | Cite as

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

  • Moritz Kreutzer
  • Jonas Thies
  • Melven Röhrig-Zöllner
  • Andreas Pieper
  • Faisal Shahzad
  • Martin Galgon
  • Achim Basermann
  • Holger Fehske
  • Georg Hager
  • Gerhard Wellein


While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring “standard” as well as “accelerated” resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The “General, Hybrid, and Optimized Sparse Toolkit” (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the “MPI+X” paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack. The library code and several applications are available as open source.


Sparse linear algebra Heterogeneous computing Software library Task parallelism Large scale computing Data parallelism 



This work was supported by the German Research Foundation (DFG) through the Priority Program 1648 “Software for Exascale Computing” (SPPEXA) under project ESSEX (“Equipping Sparse Solvers for Exascale”). Special thanks go to Andreas Alvermann for providing sparse matrix generation functions for testing and everyone else who contributed to GHOST, directly or indirectly.


  1. 1.
    Alvermann, A., Basermann, A., Fehske, H., Galgon, M., Hager, G., Kreutzer, M., Krämer, L., Lang, B., Pieper, A., Röhrig-Zöllner, M., Shahzad, F., Thies, J., Wellein, G.: ESSEX: Equipping Sparse Solvers for Exascale, pp. 577–588. Springer International Publishing, Cham (2014). doi: 10.1007/978-3-319-14313-2_49
  2. 2.
    Anderson, M., Ballard, G., Demmel, J., Keutzer, K.: Communication-avoiding QR decomposition for GPUs. In: IEEE International on Parallel Distributed Processing Symposium (IPDPS), 2011, pp. 48–58 (2011). doi: 10.1109/IPDPS.2011.15
  3. 3.
    Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exp. 23(2), 187–198 (2011). doi: 10.1002/cpe.1631 CrossRefGoogle Scholar
  4. 4.
    Baker, C.G., Heroux, M.A.: Tpetra, and the use of generic programming in scientific computing. Sci. Program. 20(2), 115–128 (2012). doi: 10.1155/2012/693861 Google Scholar
  5. 5.
    Baker, C.G., Hetmaniuk, U.L., Lehoucq, R.B., Thornquist, H.K.: Anasazi software for the numerical solution of large-scale eigenvalue problems. ACM Trans. Math. Softw. 36(3), 13:1–13:23 (2009). doi: 10.1145/1527286.1527287
  6. 6.
    Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Eijkhout, V., Gropp, W.D., Kaushik, D., Knepley, M.G., McInnes, L.C., Rupp, K., Smith, B.F., Zampini, S., Zhang, H.: PETSc Web page (2016).
  7. 7.
    Blackford, L.S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., Whaley, R.C.: An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 135–151 (2001). doi: 10.1145/567806.567807 MathSciNetCrossRefGoogle Scholar
  8. 8.
    Boisvert, R.F., Pozo, R., Remington, K., Barrett, R.F., Dongarra, J.J.: Matrix market: A web resource for test matrix collections. In: Proceedings of the IFIP TC2/WG2.5 Working Conference on Quality of Numerical Software: Assessment and Enhancement, pp. 125–137. Chapman & Hall, Ltd., London, UK (1997)Google Scholar
  9. 9.
    Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: Hwloc: A generic framework for managing hardware affinities in HPC applications. In: Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP ’10, pp. 180–186. IEEE Computer Society, Washington, DC, USA (2010). doi: 10.1109/PDP.2010.67
  10. 10.
    Chevalier, C., Pellegrini, F.: PT-Scotch: a tool for efficient parallel graph ordering. Parallel Comput. 34(6–8), 318–331 (2008). doi: 10.1016/j.parco.2007.12.001 MathSciNetCrossRefGoogle Scholar
  11. 11.
    Chow, E., Patel, A.: Fine-grained parallel incomplete factorization. SIAM J. Sci. Comput. 37(2), 169–193 (2015). doi: 10.1137/140968896 MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), Art. No. 1 (2011). doi: 10.1145/2049662.2049663
  13. 13.
    Demmel, J., Hoemmen, M., Mohiyuddin, M., Yelick, K.: Avoiding communication in sparse matrix computations. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–12 (2008). doi: 10.1109/IPDPS.2008.4536305
  14. 14.
    Denis, A.: POSTER: a generic framework for asynchronous progression and multithreaded communications. In: IEEE International Conference on Cluster Computing (CLUSTER), 2014, pp. 276–277 (2014). doi: 10.1109/CLUSTER.2014.6968752
  15. 15.
    Devine, K., Boman, E., Heaphy, R., Bisseling, R., Catalyurek, U.: Parallel hypergraph partitioning for scientific computing. In: Parallel and Distributed Processing Symposium, 2006. IPDPS 2006, 20th International, p. 10 (2006). doi: 10.1109/IPDPS.2006.1639359
  16. 16.
    Galgon, M., Krämer, L., Thies, J., Basermann, A., Lang, B.: On the parallel iterative solution of linear systems arising in the FEAST algorithm for computing inner eigenvalues. Parallel Comput. 49, 153–163 (2015). doi: 10.1016/j.parco.2015.06.005 MathSciNetCrossRefGoogle Scholar
  17. 17.
    Gebremedhin, A.H., Nguyen, D., Patwary, M.M.A., Pothen, A.: Colpack: software for graph coloring and related problems in scientific computing. ACM Trans. Math. Softw. 40(1), 1:1–1:31 (2013). doi: 10.1145/2513109.2513110
  18. 18.
    GHOST: General, Hybrid, and Optimized Sparse Toolkit. Accessed July 2016
  19. 19.
    Ghysels, P., Ashby, T.J., Meerbergen, K., Vanroose, W.: Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM J. Sci. Comput. 35(1), C48–C71 (2013). doi: 10.1137/12086563X MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Gropp, W.D., Kaushik, D.K., Keyes, D.E., Smith, B.F.: Towards realistic performance bounds for implicit CFD codes. In: Proceedings of Parallel CFD99, pp. 233–240. Elsevier (1999)Google Scholar
  21. 21.
    Hager, G., Wellein, G.: Introduction to High Performance Computing for Scientists and Engineers, 1st edn. CRC Press Inc, Boca Raton, FL (2010)CrossRefGoogle Scholar
  22. 22.
    Intel Math Kernel Library. Accessed July 2016
  23. 23.
    Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bull. Int. Acad. Pol. Sci. Lett. 35, 355–357 (1937)zbMATHGoogle Scholar
  24. 24.
    Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5), C401–C423 (2014). doi: 10.1137/130930352 MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Kreutzer, M., Pieper, A., Hager, G., Wellein, G., Alvermann, A., Fehske, H.: Performance engineering of the kernel polynomal method on large-scale cpu-gpu systems. In: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pp. 417–426 (2015). doi: 10.1109/IPDPS.2015.76
  26. 26.
    LAMA: Library for accelerated mathematical applications. Accessed July 2016
  27. 27.
    Lehoucq, R., Sorensen, D., Yang, C.: ARPACK users’ guide. Soc. Ind. Appl. Math. (1998). doi: 10.1137/1.9780898719628 zbMATHGoogle Scholar
  28. 28.
    MAGMA: Matrix algebra on GPU and multicore architectures. Accessed July 2016
  29. 29.
    Matrix Market Exchange Format. Accessed July 2016
  30. 30.
    McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995)Google Scholar
  31. 31.
    Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Y. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, X. Martorell (eds.) High Performance Embedded Architectures and Compilers, Lecture Notes in Computer Science, vol. 5952, pp. 111–125. Springer, Berlin (2010). doi: 10.1007/978-3-642-11515-8_10
  32. 32.
    Nelson, T., Belter, G., Siek, J.G., Jessup, E., Norris, B.: Reliable generation of high-performance matrix algebra. ACM Trans. Math. Softw. 41(3), 18:1–18:27 (2015). doi: 10.1145/2629698
  33. 33.
    O’Leary, D.P.: The block conjugate gradient algorithm and related methods. Linear Algebra Appl. 29, 293–322 (1980). doi: 10.1016/0024-3795(80)90247-5. (Special Volume Dedicated to Alson S. Householder)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Oppe, T.C., Kincaid, D.R.: The performance of ITPACK on vector computers for solving large sparse linear systems arising in sample oil reseervoir simulation problems. Commun. Appl. Numer. Methods 3(1), 23–29 (1987). doi: 10.1002/cnm.1630030106 CrossRefzbMATHGoogle Scholar
  35. 35.
    PARALUTION. Accessed July 2016
  36. 36.
    PHIST: Pipelined Hybrid-parallel Iterative Solver Toolkit. Accessed July 2016
  37. 37.
    Pieper, A., Heinisch, R.L., Wellein, G., Fehske, H.: Dot-bound and dispersive states in graphene quantum dot superlattices. Phys. Rev. B 89, 165121 (2014). doi: 10.1103/PhysRevB.89.165121 CrossRefGoogle Scholar
  38. 38.
    Pieper, A., Kreutzer, M., Alvermann, A., Galgon, M., Fehske, H., Hager, G., Lang, B., Wellein, G.: High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations. J. Comput. Phys. 325, 226–243 (2016). doi: 10.1016/ MathSciNetCrossRefGoogle Scholar
  39. 39.
    Polizzi, E.: Density-matrix-based algorithm for solving eigenvalue problems. Phys. Rev. B 79, 115112 (2009). doi: 10.1103/PhysRevB.79.115112 CrossRefGoogle Scholar
  40. 40.
    Rabenseifner, R., Hager, G., Jost, G.: Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes. In: 17th Euromicro International Conference Parallel, Distributed and Network-based Processing, 2009, pp. 427–436 (2009). doi: 10.1109/PDP.2009.43
  41. 41.
    Röhrig-Zöllner, M., Thies, J., Kreutzer, M., Alvermann, A., Pieper, A., Basermann, A., Hager, G., Wellein, G., Fehske, H.: Increasing the performance of the Jacobi–Davidson method by blocking. SIAM J. Sci. Comput. 37(6), C697–C722 (2015). doi: 10.1137/140976017 MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    Rupp, K., Rudolf, F., Weinbub, J.: ViennaCL–A High Level Linear Algebra Library for GPUs and Multi-Core CPUs. In: International Workshop on GPUs and Scientific Applications, pp. 51–56 (2010)Google Scholar
  43. 43.
    Rupp, K., Weinbub, J., Jüngel, A., Grasser, T.: Pipelined iterative solvers with kernel fusion for graphics processing units. ACM Trans. Math. Softw. 43(2), 11:1–11:27 (2016). doi: 10.1145/2907944
  44. 44.
    Schofield, G., Chelikowsky, J.R., Saad, Y.: A spectrum slicing method for the Kohn–Sham problem. Comput. Phys. Commun. 183(3), 497–505 (2012). doi: 10.1016/j.cpc.2011.11.005 MathSciNetCrossRefzbMATHGoogle Scholar
  45. 45.
    Schubert, G., Fehske, H., Fritz, L., Vojta, M.: Fate of topological-insulator surface states under strong disorder. Phys. Rev. B 85, 201105 (2012). doi: 10.1103/PhysRevB.85.201105 CrossRefGoogle Scholar
  46. 46.
    Siek, J., Karlin, I., Jessup, E.: Build to order linear algebra kernels. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–8 (2008). doi: 10.1109/IPDPS.2008.4536183
  47. 47.
    SpMP: Sparse matrix pre-processing library. Accessed July 2016
  48. 48.
    Stathopoulos, A., McCombs, J.R.: PRIMME: preconditioned iterative multimethod eigensolver-methods and software description. ACM Trans. Math. Softw. 37(2), 1–30 (2010). doi: 10.1145/1731022.1731031 CrossRefzbMATHGoogle Scholar
  49. 49.
    Stewart, G.W.: A Krylov-Schur algorithm for large eigenproblems. SIAM J. Matrix Anal. Appl. 23(3), 601–614 (2002). doi: 10.1137/S0895479800371529 MathSciNetCrossRefzbMATHGoogle Scholar
  50. 50.
    Tabik, S., Ortega, G., Garzn, E.: Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study. J. Supercomput. 70(2), 577–587 (2014). doi: 10.1007/s11227-014-1102-4 CrossRefGoogle Scholar
  51. 51.
    TOP500 Supercomputer Sites as of June 2016. Accessed July 2016
  52. 52.
    Vital, B.: Etude de quelques mthodes de rsolution de problmes linaires de grande taille sur multiprocessor. Ph.D. thesis, Universit de Rennes, Rennes (1990)Google Scholar
  53. 53.
    Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 191–202. IEEE Press, Piscataway, NJ, USA (2014). doi: 10.1109/SC.2014.21
  54. 54.
    Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). doi: 10.1145/1498765.1498785 CrossRefGoogle Scholar
  55. 55.
    Wittmann, M., Hager, G., Zeiser, T., Wellein, G.: Asynchronous MPI for the masses (2013). Preprint

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Moritz Kreutzer
    • 1
  • Jonas Thies
    • 2
  • Melven Röhrig-Zöllner
    • 2
  • Andreas Pieper
    • 3
  • Faisal Shahzad
    • 1
  • Martin Galgon
    • 4
  • Achim Basermann
    • 2
  • Holger Fehske
    • 3
  • Georg Hager
    • 1
  • Gerhard Wellein
    • 5
  1. 1.Erlangen Regional Computing CenterFriedrich-Alexander-Universität Erlangen-NürnbergErlangenGermany
  2. 2.German Aerospace Center (DLR)Simulation and Software TechnologyKölnGermany
  3. 3.Institute of PhysicsErnst-Moritz-Arndt-Universität GreifswaldGreifswaldGermany
  4. 4.Bergische Universität WuppertalWuppertalGermany
  5. 5.Department of Computer ScienceFriedrich-Alexander-Universität Erlangen-NürnbergErlangenGermany

Personalised recommendations