Skip to main content
Log in

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring “standard” as well as “accelerated” resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The “General, Hybrid, and Optimized Sparse Toolkit” (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the “MPI+X” paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack. The library code and several applications are available as open source.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Alvermann, A., Basermann, A., Fehske, H., Galgon, M., Hager, G., Kreutzer, M., Krämer, L., Lang, B., Pieper, A., Röhrig-Zöllner, M., Shahzad, F., Thies, J., Wellein, G.: ESSEX: Equipping Sparse Solvers for Exascale, pp. 577–588. Springer International Publishing, Cham (2014). doi:10.1007/978-3-319-14313-2_49

  2. Anderson, M., Ballard, G., Demmel, J., Keutzer, K.: Communication-avoiding QR decomposition for GPUs. In: IEEE International on Parallel Distributed Processing Symposium (IPDPS), 2011, pp. 48–58 (2011). doi:10.1109/IPDPS.2011.15

  3. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exp. 23(2), 187–198 (2011). doi:10.1002/cpe.1631

    Article  Google Scholar 

  4. Baker, C.G., Heroux, M.A.: Tpetra, and the use of generic programming in scientific computing. Sci. Program. 20(2), 115–128 (2012). doi:10.1155/2012/693861

    Google Scholar 

  5. Baker, C.G., Hetmaniuk, U.L., Lehoucq, R.B., Thornquist, H.K.: Anasazi software for the numerical solution of large-scale eigenvalue problems. ACM Trans. Math. Softw. 36(3), 13:1–13:23 (2009). doi:10.1145/1527286.1527287

  6. Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Eijkhout, V., Gropp, W.D., Kaushik, D., Knepley, M.G., McInnes, L.C., Rupp, K., Smith, B.F., Zampini, S., Zhang, H.: PETSc Web page (2016). http://www.mcs.anl.gov/petsc

  7. Blackford, L.S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., Whaley, R.C.: An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 135–151 (2001). doi:10.1145/567806.567807

    Article  MathSciNet  Google Scholar 

  8. Boisvert, R.F., Pozo, R., Remington, K., Barrett, R.F., Dongarra, J.J.: Matrix market: A web resource for test matrix collections. In: Proceedings of the IFIP TC2/WG2.5 Working Conference on Quality of Numerical Software: Assessment and Enhancement, pp. 125–137. Chapman & Hall, Ltd., London, UK (1997)

  9. Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: Hwloc: A generic framework for managing hardware affinities in HPC applications. In: Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP ’10, pp. 180–186. IEEE Computer Society, Washington, DC, USA (2010). doi:10.1109/PDP.2010.67

  10. Chevalier, C., Pellegrini, F.: PT-Scotch: a tool for efficient parallel graph ordering. Parallel Comput. 34(6–8), 318–331 (2008). doi:10.1016/j.parco.2007.12.001

    Article  MathSciNet  Google Scholar 

  11. Chow, E., Patel, A.: Fine-grained parallel incomplete factorization. SIAM J. Sci. Comput. 37(2), 169–193 (2015). doi:10.1137/140968896

    Article  MathSciNet  MATH  Google Scholar 

  12. Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), Art. No. 1 (2011). doi:10.1145/2049662.2049663

  13. Demmel, J., Hoemmen, M., Mohiyuddin, M., Yelick, K.: Avoiding communication in sparse matrix computations. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–12 (2008). doi:10.1109/IPDPS.2008.4536305

  14. Denis, A.: POSTER: a generic framework for asynchronous progression and multithreaded communications. In: IEEE International Conference on Cluster Computing (CLUSTER), 2014, pp. 276–277 (2014). doi:10.1109/CLUSTER.2014.6968752

  15. Devine, K., Boman, E., Heaphy, R., Bisseling, R., Catalyurek, U.: Parallel hypergraph partitioning for scientific computing. In: Parallel and Distributed Processing Symposium, 2006. IPDPS 2006, 20th International, p. 10 (2006). doi:10.1109/IPDPS.2006.1639359

  16. Galgon, M., Krämer, L., Thies, J., Basermann, A., Lang, B.: On the parallel iterative solution of linear systems arising in the FEAST algorithm for computing inner eigenvalues. Parallel Comput. 49, 153–163 (2015). doi:10.1016/j.parco.2015.06.005

    Article  MathSciNet  Google Scholar 

  17. Gebremedhin, A.H., Nguyen, D., Patwary, M.M.A., Pothen, A.: Colpack: software for graph coloring and related problems in scientific computing. ACM Trans. Math. Softw. 40(1), 1:1–1:31 (2013). doi:10.1145/2513109.2513110

  18. GHOST: General, Hybrid, and Optimized Sparse Toolkit. https://bitbucket.org/essex/ghost. Accessed July 2016

  19. Ghysels, P., Ashby, T.J., Meerbergen, K., Vanroose, W.: Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM J. Sci. Comput. 35(1), C48–C71 (2013). doi:10.1137/12086563X

    Article  MathSciNet  MATH  Google Scholar 

  20. Gropp, W.D., Kaushik, D.K., Keyes, D.E., Smith, B.F.: Towards realistic performance bounds for implicit CFD codes. In: Proceedings of Parallel CFD99, pp. 233–240. Elsevier (1999)

  21. Hager, G., Wellein, G.: Introduction to High Performance Computing for Scientists and Engineers, 1st edn. CRC Press Inc, Boca Raton, FL (2010)

    Book  Google Scholar 

  22. Intel Math Kernel Library. https://software.intel.com/en-us/intel-mkl. Accessed July 2016

  23. Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bull. Int. Acad. Pol. Sci. Lett. 35, 355–357 (1937)

    MATH  Google Scholar 

  24. Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5), C401–C423 (2014). doi:10.1137/130930352

    Article  MathSciNet  MATH  Google Scholar 

  25. Kreutzer, M., Pieper, A., Hager, G., Wellein, G., Alvermann, A., Fehske, H.: Performance engineering of the kernel polynomal method on large-scale cpu-gpu systems. In: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pp. 417–426 (2015). doi:10.1109/IPDPS.2015.76

  26. LAMA: Library for accelerated mathematical applications. http://www.libama.org. Accessed July 2016

  27. Lehoucq, R., Sorensen, D., Yang, C.: ARPACK users’ guide. Soc. Ind. Appl. Math. (1998). doi:10.1137/1.9780898719628

    MATH  Google Scholar 

  28. MAGMA: Matrix algebra on GPU and multicore architectures. http://icl.cs.utk.edu/magma/. Accessed July 2016

  29. Matrix Market Exchange Format. http://math.nist.gov/MatrixMarket/formats.html#MMformat. Accessed July 2016

  30. McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995)

  31. Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Y. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, X. Martorell (eds.) High Performance Embedded Architectures and Compilers, Lecture Notes in Computer Science, vol. 5952, pp. 111–125. Springer, Berlin (2010). doi:10.1007/978-3-642-11515-8_10

  32. Nelson, T., Belter, G., Siek, J.G., Jessup, E., Norris, B.: Reliable generation of high-performance matrix algebra. ACM Trans. Math. Softw. 41(3), 18:1–18:27 (2015). doi:10.1145/2629698

  33. O’Leary, D.P.: The block conjugate gradient algorithm and related methods. Linear Algebra Appl. 29, 293–322 (1980). doi:10.1016/0024-3795(80)90247-5. (Special Volume Dedicated to Alson S. Householder)

    Article  MathSciNet  MATH  Google Scholar 

  34. Oppe, T.C., Kincaid, D.R.: The performance of ITPACK on vector computers for solving large sparse linear systems arising in sample oil reseervoir simulation problems. Commun. Appl. Numer. Methods 3(1), 23–29 (1987). doi:10.1002/cnm.1630030106

    Article  MATH  Google Scholar 

  35. PARALUTION. http://www.paralution.com. Accessed July 2016

  36. PHIST: Pipelined Hybrid-parallel Iterative Solver Toolkit. https://bitbucket.org/essex/phist. Accessed July 2016

  37. Pieper, A., Heinisch, R.L., Wellein, G., Fehske, H.: Dot-bound and dispersive states in graphene quantum dot superlattices. Phys. Rev. B 89, 165121 (2014). doi:10.1103/PhysRevB.89.165121

    Article  Google Scholar 

  38. Pieper, A., Kreutzer, M., Alvermann, A., Galgon, M., Fehske, H., Hager, G., Lang, B., Wellein, G.: High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations. J. Comput. Phys. 325, 226–243 (2016). doi:10.1016/j.jcp.2016.08.027

    Article  MathSciNet  Google Scholar 

  39. Polizzi, E.: Density-matrix-based algorithm for solving eigenvalue problems. Phys. Rev. B 79, 115112 (2009). doi:10.1103/PhysRevB.79.115112

    Article  Google Scholar 

  40. Rabenseifner, R., Hager, G., Jost, G.: Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes. In: 17th Euromicro International Conference Parallel, Distributed and Network-based Processing, 2009, pp. 427–436 (2009). doi:10.1109/PDP.2009.43

  41. Röhrig-Zöllner, M., Thies, J., Kreutzer, M., Alvermann, A., Pieper, A., Basermann, A., Hager, G., Wellein, G., Fehske, H.: Increasing the performance of the Jacobi–Davidson method by blocking. SIAM J. Sci. Comput. 37(6), C697–C722 (2015). doi:10.1137/140976017

    Article  MathSciNet  MATH  Google Scholar 

  42. Rupp, K., Rudolf, F., Weinbub, J.: ViennaCL–A High Level Linear Algebra Library for GPUs and Multi-Core CPUs. In: International Workshop on GPUs and Scientific Applications, pp. 51–56 (2010)

  43. Rupp, K., Weinbub, J., Jüngel, A., Grasser, T.: Pipelined iterative solvers with kernel fusion for graphics processing units. ACM Trans. Math. Softw. 43(2), 11:1–11:27 (2016). doi:10.1145/2907944

  44. Schofield, G., Chelikowsky, J.R., Saad, Y.: A spectrum slicing method for the Kohn–Sham problem. Comput. Phys. Commun. 183(3), 497–505 (2012). doi:10.1016/j.cpc.2011.11.005

    Article  MathSciNet  MATH  Google Scholar 

  45. Schubert, G., Fehske, H., Fritz, L., Vojta, M.: Fate of topological-insulator surface states under strong disorder. Phys. Rev. B 85, 201105 (2012). doi:10.1103/PhysRevB.85.201105

    Article  Google Scholar 

  46. Siek, J., Karlin, I., Jessup, E.: Build to order linear algebra kernels. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–8 (2008). doi:10.1109/IPDPS.2008.4536183

  47. SpMP: Sparse matrix pre-processing library. https://github.com/IntelLabs/SpMP. Accessed July 2016

  48. Stathopoulos, A., McCombs, J.R.: PRIMME: preconditioned iterative multimethod eigensolver-methods and software description. ACM Trans. Math. Softw. 37(2), 1–30 (2010). doi:10.1145/1731022.1731031

    Article  MATH  Google Scholar 

  49. Stewart, G.W.: A Krylov-Schur algorithm for large eigenproblems. SIAM J. Matrix Anal. Appl. 23(3), 601–614 (2002). doi:10.1137/S0895479800371529

    Article  MathSciNet  MATH  Google Scholar 

  50. Tabik, S., Ortega, G., Garzn, E.: Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study. J. Supercomput. 70(2), 577–587 (2014). doi:10.1007/s11227-014-1102-4

    Article  Google Scholar 

  51. TOP500 Supercomputer Sites as of June 2016. http://www.top500.org. Accessed July 2016

  52. Vital, B.: Etude de quelques mthodes de rsolution de problmes linaires de grande taille sur multiprocessor. Ph.D. thesis, Universit de Rennes, Rennes (1990)

  53. Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 191–202. IEEE Press, Piscataway, NJ, USA (2014). doi:10.1109/SC.2014.21

  54. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). doi:10.1145/1498765.1498785

    Article  Google Scholar 

  55. Wittmann, M., Hager, G., Zeiser, T., Wellein, G.: Asynchronous MPI for the masses (2013). http://arxiv.org/abs/1302.4280. Preprint

Download references

Acknowledgments

This work was supported by the German Research Foundation (DFG) through the Priority Program 1648 “Software for Exascale Computing” (SPPEXA) under project ESSEX (“Equipping Sparse Solvers for Exascale”). Special thanks go to Andreas Alvermann for providing sparse matrix generation functions for testing and everyone else who contributed to GHOST, directly or indirectly.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Moritz Kreutzer.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kreutzer, M., Thies, J., Röhrig-Zöllner, M. et al. GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems. Int J Parallel Prog 45, 1046–1072 (2017). https://doi.org/10.1007/s10766-016-0464-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-016-0464-z

Keywords

Navigation