Parallelization of Sparse Matrix Kernels for Big Data Applications

Part of the Computer Communications and Networks book series (CCN)


Analysis of big data on large-scale distributed systems often necessitates efficient parallel graph algorithms that are used to explore the relationships between individual components. Graph algorithms use the basic adjacency list representation for graphs, which can also be viewed as a sparse matrix. This correspondence between representation of graphs and sparse matrices makes it possible to express many important graph algorithms in terms of basic sparse matrix operations, where the literature for optimization is more mature. For example, the graph analytic libraries such as Pegasus and Combinatorial BLAS use sparse matrix kernels for a wide variety of operations on graphs. In this work, we focus on two such important sparse matrix kernels: Sparse matrix–sparse matrix multiplication (SpGEMM) and sparse matrix–dense matrix multiplication (SpMM). We propose partitioning models for efficient parallelization of these kernels on large-scale distributed systems. Our models aim at reducing and improving communication volume while balancing computational load, which are two vital performance metrics on distributed systems. We show that by exploiting sparsity patterns of the matrices through our models, the parallel performance of SpGEMM and SpMM operations can be significantly improved.


Big data Graph analytics Sparse matrices Parallel computing High performance computing Combinatorial scientific computing 



This work was supported by The Scientific and Technological Research Council of Turkey (TUBITAK) under Grant EEEAG-115E212. This article is also based upon work from COST Action IC1406 (cHiPSet).


  1. 1.
    Intel math kernel library (2015).
  2. 2.
    Agarwal, V., Petrini, F., Pasetto, D., Bader, D.A.: Scalable graph exploration on multicore processors. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2010). doi: 10.1109/SC.2010.46
  3. 3.
    Akbudak, K., Aykanat, C.: Simultaneous input and output matrix partitioning for outer-product–parallel sparse matrix-matrix multiplication. SIAM J. Sci. Comput. 36(5), C568–C590 (2014). doi: 10.1137/13092589X Google Scholar
  4. 4.
    Boldi, P., Rosa, M., Santini, M., Vigna, S.: Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. In: Srinivasan S., Ramamritham K., Kumar A., Ravindra M.P., Bertino E., Kumar R. (eds.) Proceedings of the 20th International Conference on World Wide Web, pp. 587–596. ACM Press (2011)Google Scholar
  5. 5.
    Boldi, P., Vigna, S.: The WebGraph framework I: compression techniques. In: Proceedings of the Thirteenth International World Wide Web Conference (WWW 2004), pp. 595–601. ACM Press, Manhattan (2004)Google Scholar
  6. 6.
    Boman, E., Devine, K., Heaphy, R., Hendrickson, B., Heroux, M., Preis, R.: LDRD report: Parallel repartitioning for optimal solver performance. Tech. Rep. SAND2004–0365, Sandia National Laboratories, Albuquerque, NM (2004)Google Scholar
  7. 7.
    Buluç, A., Gilbert, J.R.: Parallel sparse matrix-matrix multiplication and indexing: implementation and experiments. SIAM J. Sci. Comput. (SISC) 34(4), 170–191 (2012). doi: 10.1137/110848244; Google Scholar
  8. 8.
    Buluç, A., Madduri, K.: Parallel breadth-first search on distributed memory systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pp. 65:1–65:12. ACM, New York, NY, USA (2011). doi: 10.1145/2063384.2063471;
  9. 9.
    Catalyurek, U.V., Aykanat, C.: Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans. Parallel Distrib. Syst. 10(7), 673–693 (1999)CrossRefGoogle Scholar
  10. 10.
    CP2K: CP2K home page (Accessed at 2015).
  11. 11.
    D’Alberto, P., Nicolau, A.: R-kleene: A high-performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks. Algorithmica 47(2), 203–213 (2007). doi: 10.1007/s00453-006-1224-z Google Scholar
  12. 12.
    Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. (TOMS) 38(1), 1 (2011)MathSciNetGoogle Scholar
  13. 13.
    Dostál, Z., Horák, D., Kučera, R.: Total FETI-an easier implementable variant of the FETI method for numerical solution of elliptic PDE. Commun. Numer. Meth. Eng. 22(12), 1155–1162 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Feng, Y., Owen, D., Peri, D.: A block conjugate gradient method applied to linear systems with multiple right-hand sides. Comput. Meth. Appl. Mech. Eng. 127(14), 203–215 (1995).; Google Scholar
  15. 15.
    Heroux, M.A., Bartlett, R.A., Howle, V.E., Hoekstra, R.J., Hu, J.J., Kolda, T.G., Lehoucq, R.B., Long, K.R., Pawlowski, R.P., Phipps, E.T., et al.: An overview of the Trilinos project. ACM Trans. Math. Softw. (TOMS) 31(3), 397–423 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Horowitz, E., Sahni, S.: Fundamentals of Computer Algorithms. Computer Science Press (1978)Google Scholar
  17. 17.
    Kang, U., Tsourakakis, C.E., Faloutsos, C.: Pegasus: A peta-scale graph mining system implementation and observations. In: Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM ’09, pp. 229–238. IEEE Computer Society, Washington, DC, USA (2009). doi: 10.1109/ICDM.2009.14
  18. 18.
    Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection. (2014)
  19. 19.
    Marion-Poty, V., Lefer, W.: A wavelet decomposition scheme and compression method for streamline-based vector field visualizations. Comput. Graphics 26(6), 899–906 (2002). doi: 10.1016/S0097-8493(02)00178-4; Google Scholar
  20. 20.
    Mattson, T., Bader, D., Berry, J., Buluc, A., Dongarra, J., Faloutsos, C., Feo, J., Gilbert, J., Gonzalez, J., Hendrickson, B., Kepner, J., Leiserson, C., Lumsdaine, A., Padua, D., Poole, S., Reinhardt, S., Stonebraker, M., Wallach, S., Yoo, A.: Standards for Graph Algorithm Primitives. ArXiv e-prints (2014)Google Scholar
  21. 21.
    NVIDIA Corporation: CUSPARSE library (2010)Google Scholar
  22. 22.
    O’Leary, D.P.: The block conjugate gradient algorithm and related methods. Linear Algebra Appl. 29(0), 293–322 (1980).; Special Volume Dedicated to Alson S. HouseholderGoogle Scholar
  23. 23.
    O’Leary, D.P.: Parallel implementation of the block conjugate gradient algorithm. Parallel Comput. 5(12), 127–139 (1987).; Proceedings of the International Conference on Vector and Parallel Computing-Issues in Applied Research and Development
  24. 24.
    Sarıyuce, A.E., Saule, E., Kaya, K., Çatalyurek, U.V.: Regularizing graph centrality computations. J. Parallel Distrib. Comput. 76(0), 106–119 (2015).; Special Issue on Architecture and Algorithms for Irregular Applications
  25. 25.
    Sawyer, W., Messmer, P.: Parallel grid manipulations for general circulation models. In: Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science, vol. 2328, pp. 605–608. Springer, Berlin (2006)Google Scholar
  26. 26.
    Selvitopi, O., Aykanat, C.: Reducing latency cost in 2D sparse matrix partitioning models. Parallel Comput. 57, 1–24 (2016).; Google Scholar
  27. 27.
    Selvitopi, R.O., Ozdal, M.M., Aykanat, C.: A novel method for scaling iterative solvers: avoiding latency overhead of parallel sparse-matrix vector multiplies. IEEE Trans. Parallel Distrib. Syst. 26(3), 632–645 (2015). doi: 10.1109/TPDS.2014.2311804 CrossRefGoogle Scholar
  28. 28.
    Shi, Z., Zhang, B.: Fast network centrality analysis using gpus. BMC Bioinf. 12(1), 149 (2011). doi: 10.1186/1471-2105-12-149
  29. 29.
    Uçar, B., Aykanat, C.: Encapsulating multiple communication-cost metrics in partitioning sparse rectangular matrices for parallel matrix-vector multiplies. SIAM J. Sci. Comput. 25(6), 1837–1859 (2004). doi: 10.1137/S1064827502410463 Google Scholar
  30. 30.
    Van De Geijn, R.A., Watts, J.: Summa: scalable universal matrix multiplication algorithm. Concurrency-Pract. Experience 9(4), 255–274 (1997)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Department of Computer EngineeringBilkent UniversityAnkaraTurkey

Personalised recommendations