GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Kreutzer, Moritz; Thies, Jonas; Röhrig-Zöllner, Melven; Pieper, Andreas; Shahzad, Faisal; Galgon, Martin; Basermann, Achim; Fehske, Holger; Hager, Georg; Wellein, Gerhard

doi:10.1007/s10766-016-0464-z

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Published: 01 October 2016

Volume 45, pages 1046–1072, (2017)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Moritz Kreutzer ORCID: orcid.org/0000-0002-7822-9468¹,
Jonas Thies²,
Melven Röhrig-Zöllner²,
Andreas Pieper³,
Faisal Shahzad¹,
Martin Galgon⁴,
Achim Basermann²,
Holger Fehske³,
Georg Hager¹ &
…
Gerhard Wellein⁵

862 Accesses
21 Citations
1 Altmetric
Explore all metrics

Abstract

While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring “standard” as well as “accelerated” resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The “General, Hybrid, and Optimized Sparse Toolkit” (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the “MPI+X” paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack. The library code and several applications are available as open source.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Containerization technologies: taxonomies, applications and challenges

Article 08 June 2021

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

Shared Memory Parallelism in Modern C++ and HPX

Article 20 April 2024

References

Alvermann, A., Basermann, A., Fehske, H., Galgon, M., Hager, G., Kreutzer, M., Krämer, L., Lang, B., Pieper, A., Röhrig-Zöllner, M., Shahzad, F., Thies, J., Wellein, G.: ESSEX: Equipping Sparse Solvers for Exascale, pp. 577–588. Springer International Publishing, Cham (2014). doi:10.1007/978-3-319-14313-2_49
Anderson, M., Ballard, G., Demmel, J., Keutzer, K.: Communication-avoiding QR decomposition for GPUs. In: IEEE International on Parallel Distributed Processing Symposium (IPDPS), 2011, pp. 48–58 (2011). doi:10.1109/IPDPS.2011.15
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exp. 23(2), 187–198 (2011). doi:10.1002/cpe.1631
Article Google Scholar
Baker, C.G., Heroux, M.A.: Tpetra, and the use of generic programming in scientific computing. Sci. Program. 20(2), 115–128 (2012). doi:10.1155/2012/693861
Google Scholar
Baker, C.G., Hetmaniuk, U.L., Lehoucq, R.B., Thornquist, H.K.: Anasazi software for the numerical solution of large-scale eigenvalue problems. ACM Trans. Math. Softw. 36(3), 13:1–13:23 (2009). doi:10.1145/1527286.1527287
Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Eijkhout, V., Gropp, W.D., Kaushik, D., Knepley, M.G., McInnes, L.C., Rupp, K., Smith, B.F., Zampini, S., Zhang, H.: PETSc Web page (2016). http://www.mcs.anl.gov/petsc
Blackford, L.S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., Whaley, R.C.: An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 135–151 (2001). doi:10.1145/567806.567807
Article MathSciNet Google Scholar
Boisvert, R.F., Pozo, R., Remington, K., Barrett, R.F., Dongarra, J.J.: Matrix market: A web resource for test matrix collections. In: Proceedings of the IFIP TC2/WG2.5 Working Conference on Quality of Numerical Software: Assessment and Enhancement, pp. 125–137. Chapman & Hall, Ltd., London, UK (1997)
Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: Hwloc: A generic framework for managing hardware affinities in HPC applications. In: Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP ’10, pp. 180–186. IEEE Computer Society, Washington, DC, USA (2010). doi:10.1109/PDP.2010.67
Chevalier, C., Pellegrini, F.: PT-Scotch: a tool for efficient parallel graph ordering. Parallel Comput. 34(6–8), 318–331 (2008). doi:10.1016/j.parco.2007.12.001
Article MathSciNet Google Scholar
Chow, E., Patel, A.: Fine-grained parallel incomplete factorization. SIAM J. Sci. Comput. 37(2), 169–193 (2015). doi:10.1137/140968896
Article MathSciNet MATH Google Scholar
Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), Art. No. 1 (2011). doi:10.1145/2049662.2049663
Demmel, J., Hoemmen, M., Mohiyuddin, M., Yelick, K.: Avoiding communication in sparse matrix computations. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–12 (2008). doi:10.1109/IPDPS.2008.4536305
Denis, A.: POSTER: a generic framework for asynchronous progression and multithreaded communications. In: IEEE International Conference on Cluster Computing (CLUSTER), 2014, pp. 276–277 (2014). doi:10.1109/CLUSTER.2014.6968752
Devine, K., Boman, E., Heaphy, R., Bisseling, R., Catalyurek, U.: Parallel hypergraph partitioning for scientific computing. In: Parallel and Distributed Processing Symposium, 2006. IPDPS 2006, 20th International, p. 10 (2006). doi:10.1109/IPDPS.2006.1639359
Galgon, M., Krämer, L., Thies, J., Basermann, A., Lang, B.: On the parallel iterative solution of linear systems arising in the FEAST algorithm for computing inner eigenvalues. Parallel Comput. 49, 153–163 (2015). doi:10.1016/j.parco.2015.06.005
Article MathSciNet Google Scholar
Gebremedhin, A.H., Nguyen, D., Patwary, M.M.A., Pothen, A.: Colpack: software for graph coloring and related problems in scientific computing. ACM Trans. Math. Softw. 40(1), 1:1–1:31 (2013). doi:10.1145/2513109.2513110
GHOST: General, Hybrid, and Optimized Sparse Toolkit. https://bitbucket.org/essex/ghost. Accessed July 2016
Ghysels, P., Ashby, T.J., Meerbergen, K., Vanroose, W.: Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM J. Sci. Comput. 35(1), C48–C71 (2013). doi:10.1137/12086563X
Article MathSciNet MATH Google Scholar
Gropp, W.D., Kaushik, D.K., Keyes, D.E., Smith, B.F.: Towards realistic performance bounds for implicit CFD codes. In: Proceedings of Parallel CFD99, pp. 233–240. Elsevier (1999)
Hager, G., Wellein, G.: Introduction to High Performance Computing for Scientists and Engineers, 1st edn. CRC Press Inc, Boca Raton, FL (2010)
Book Google Scholar
Intel Math Kernel Library. https://software.intel.com/en-us/intel-mkl. Accessed July 2016
Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bull. Int. Acad. Pol. Sci. Lett. 35, 355–357 (1937)
MATH Google Scholar
Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5), C401–C423 (2014). doi:10.1137/130930352
Article MathSciNet MATH Google Scholar
Kreutzer, M., Pieper, A., Hager, G., Wellein, G., Alvermann, A., Fehske, H.: Performance engineering of the kernel polynomal method on large-scale cpu-gpu systems. In: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pp. 417–426 (2015). doi:10.1109/IPDPS.2015.76
LAMA: Library for accelerated mathematical applications. http://www.libama.org. Accessed July 2016
Lehoucq, R., Sorensen, D., Yang, C.: ARPACK users’ guide. Soc. Ind. Appl. Math. (1998). doi:10.1137/1.9780898719628
MATH Google Scholar
MAGMA: Matrix algebra on GPU and multicore architectures. http://icl.cs.utk.edu/magma/. Accessed July 2016
Matrix Market Exchange Format. http://math.nist.gov/MatrixMarket/formats.html#MMformat. Accessed July 2016
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995)
Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Y. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, X. Martorell (eds.) High Performance Embedded Architectures and Compilers, Lecture Notes in Computer Science, vol. 5952, pp. 111–125. Springer, Berlin (2010). doi:10.1007/978-3-642-11515-8_10
Nelson, T., Belter, G., Siek, J.G., Jessup, E., Norris, B.: Reliable generation of high-performance matrix algebra. ACM Trans. Math. Softw. 41(3), 18:1–18:27 (2015). doi:10.1145/2629698
O’Leary, D.P.: The block conjugate gradient algorithm and related methods. Linear Algebra Appl. 29, 293–322 (1980). doi:10.1016/0024-3795(80)90247-5. (Special Volume Dedicated to Alson S. Householder)
Article MathSciNet MATH Google Scholar
Oppe, T.C., Kincaid, D.R.: The performance of ITPACK on vector computers for solving large sparse linear systems arising in sample oil reseervoir simulation problems. Commun. Appl. Numer. Methods 3(1), 23–29 (1987). doi:10.1002/cnm.1630030106
Article MATH Google Scholar
PARALUTION. http://www.paralution.com. Accessed July 2016
PHIST: Pipelined Hybrid-parallel Iterative Solver Toolkit. https://bitbucket.org/essex/phist. Accessed July 2016
Pieper, A., Heinisch, R.L., Wellein, G., Fehske, H.: Dot-bound and dispersive states in graphene quantum dot superlattices. Phys. Rev. B 89, 165121 (2014). doi:10.1103/PhysRevB.89.165121
Article Google Scholar
Pieper, A., Kreutzer, M., Alvermann, A., Galgon, M., Fehske, H., Hager, G., Lang, B., Wellein, G.: High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations. J. Comput. Phys. 325, 226–243 (2016). doi:10.1016/j.jcp.2016.08.027
Article MathSciNet Google Scholar
Polizzi, E.: Density-matrix-based algorithm for solving eigenvalue problems. Phys. Rev. B 79, 115112 (2009). doi:10.1103/PhysRevB.79.115112
Article Google Scholar
Rabenseifner, R., Hager, G., Jost, G.: Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes. In: 17th Euromicro International Conference Parallel, Distributed and Network-based Processing, 2009, pp. 427–436 (2009). doi:10.1109/PDP.2009.43
Röhrig-Zöllner, M., Thies, J., Kreutzer, M., Alvermann, A., Pieper, A., Basermann, A., Hager, G., Wellein, G., Fehske, H.: Increasing the performance of the Jacobi–Davidson method by blocking. SIAM J. Sci. Comput. 37(6), C697–C722 (2015). doi:10.1137/140976017
Article MathSciNet MATH Google Scholar
Rupp, K., Rudolf, F., Weinbub, J.: ViennaCL–A High Level Linear Algebra Library for GPUs and Multi-Core CPUs. In: International Workshop on GPUs and Scientific Applications, pp. 51–56 (2010)
Rupp, K., Weinbub, J., Jüngel, A., Grasser, T.: Pipelined iterative solvers with kernel fusion for graphics processing units. ACM Trans. Math. Softw. 43(2), 11:1–11:27 (2016). doi:10.1145/2907944
Schofield, G., Chelikowsky, J.R., Saad, Y.: A spectrum slicing method for the Kohn–Sham problem. Comput. Phys. Commun. 183(3), 497–505 (2012). doi:10.1016/j.cpc.2011.11.005
Article MathSciNet MATH Google Scholar
Schubert, G., Fehske, H., Fritz, L., Vojta, M.: Fate of topological-insulator surface states under strong disorder. Phys. Rev. B 85, 201105 (2012). doi:10.1103/PhysRevB.85.201105
Article Google Scholar
Siek, J., Karlin, I., Jessup, E.: Build to order linear algebra kernels. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–8 (2008). doi:10.1109/IPDPS.2008.4536183
SpMP: Sparse matrix pre-processing library. https://github.com/IntelLabs/SpMP. Accessed July 2016
Stathopoulos, A., McCombs, J.R.: PRIMME: preconditioned iterative multimethod eigensolver-methods and software description. ACM Trans. Math. Softw. 37(2), 1–30 (2010). doi:10.1145/1731022.1731031
Article MATH Google Scholar
Stewart, G.W.: A Krylov-Schur algorithm for large eigenproblems. SIAM J. Matrix Anal. Appl. 23(3), 601–614 (2002). doi:10.1137/S0895479800371529
Article MathSciNet MATH Google Scholar
Tabik, S., Ortega, G., Garzn, E.: Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study. J. Supercomput. 70(2), 577–587 (2014). doi:10.1007/s11227-014-1102-4
Article Google Scholar
TOP500 Supercomputer Sites as of June 2016. http://www.top500.org. Accessed July 2016
Vital, B.: Etude de quelques mthodes de rsolution de problmes linaires de grande taille sur multiprocessor. Ph.D. thesis, Universit de Rennes, Rennes (1990)
Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 191–202. IEEE Press, Piscataway, NJ, USA (2014). doi:10.1109/SC.2014.21
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). doi:10.1145/1498765.1498785
Article Google Scholar
Wittmann, M., Hager, G., Zeiser, T., Wellein, G.: Asynchronous MPI for the masses (2013). http://arxiv.org/abs/1302.4280. Preprint

Download references

Acknowledgments

This work was supported by the German Research Foundation (DFG) through the Priority Program 1648 “Software for Exascale Computing” (SPPEXA) under project ESSEX (“Equipping Sparse Solvers for Exascale”). Special thanks go to Andreas Alvermann for providing sparse matrix generation functions for testing and everyone else who contributed to GHOST, directly or indirectly.

Author information

Authors and Affiliations

Erlangen Regional Computing Center, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91058, Erlangen, Germany
Moritz Kreutzer, Faisal Shahzad & Georg Hager
German Aerospace Center (DLR), Simulation and Software Technology, 51147, Köln, Germany
Jonas Thies, Melven Röhrig-Zöllner & Achim Basermann
Institute of Physics, Ernst-Moritz-Arndt-Universität Greifswald, 17489, Greifswald, Germany
Andreas Pieper & Holger Fehske
Bergische Universität Wuppertal, 42097, Wuppertal, Germany
Martin Galgon
Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91058, Erlangen, Germany
Gerhard Wellein

Authors

Moritz Kreutzer
View author publications
You can also search for this author in PubMed Google Scholar
Jonas Thies
View author publications
You can also search for this author in PubMed Google Scholar
Melven Röhrig-Zöllner
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Pieper
View author publications
You can also search for this author in PubMed Google Scholar
Faisal Shahzad
View author publications
You can also search for this author in PubMed Google Scholar
Martin Galgon
View author publications
You can also search for this author in PubMed Google Scholar
Achim Basermann
View author publications
You can also search for this author in PubMed Google Scholar
Holger Fehske
View author publications
You can also search for this author in PubMed Google Scholar
Georg Hager
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Wellein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Moritz Kreutzer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kreutzer, M., Thies, J., Röhrig-Zöllner, M. et al. GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems. Int J Parallel Prog 45, 1046–1072 (2017). https://doi.org/10.1007/s10766-016-0464-z

Download citation

Received: 01 March 2016
Accepted: 23 September 2016
Published: 01 October 2016
Issue Date: October 2017
DOI: https://doi.org/10.1007/s10766-016-0464-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Parallelizing the dual revised simplex method

Shared Memory Parallelism in Modern C++ and HPX

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Parallelizing the dual revised simplex method

Shared Memory Parallelism in Modern C++ and HPX

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation