Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+

  • Donglin Chen
  • Jianbin FangEmail author
  • Chuanfu XuEmail author
  • Shizhao Chen
  • Zheng Wang


Understanding the scalability of parallel programs is crucial for software optimization and hardware architecture design. As HPC hardware is moving towards many-core design, it becomes increasingly difficult for a parallel program to make effective use of all available processor cores. This makes scalability analysis increasingly important. This paper presents a quantitative study for characterizing the scalability of sparse matrix–vector multiplications (SpMV) on Phytium FT-2000+, an ARM-based HPC many-core architecture. We choose SpMV as it is a common operation in scientific and HPC applications. Due to the newness of ARM-based many-core architectures, there is little work on understanding the SpMV scalability on such hardware design. To close the gap, we carry out a large-scale empirical evaluation involved over 1000 representative SpMV datasets. We show that, while many computation-intensive SpMV applications contain extensive parallelism, achieving a linear speedup is non-trivial on Phytium FT-2000+. To better understand what software and hardware parameters are most important for determining the scalability of a given SpMV kernel, we develop a performance analytical model based on the regression tree. We show that our model is highly effective in characterizing SpMV scalability, offering useful insights to help application developers for better optimizing SpMV on an emerging HPC architecture.


SpMV Many-core Scalability Performance Modeling 



This work was partially funded by the National Key R&D Program of China under Grant No. 2017YFB0202003, the National Science Foundation of China under Grant Agreements 61602501, 61772542, and 61872294; and the Royal Society International Collaboration Grant (IE161012).


  1. 1.
    Adhianto, L., Banerjee, S., Fagan, M.W., Krentel, M., Marin, G., Mellor-Crummey, J.M., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22, 685–701 (2010)Google Scholar
  2. 2.
    Alam, S.R., Barrett, R.F., Kuehn, J.A., Roth, P.C., Vetter, J.S.: Characterization of scientific workloads on systems with multi-core processors. In: Proceedings of the 2006 IEEE International Symposium on Workload Characterization, IISWC 2006, October 25–27, 2006, San Jose, California, USA, pp. 225–236 (2006)Google Scholar
  3. 3.
    Bell, N., Garland, M.: Implementing sparse matrix–vector multiplication on throughput-oriented processors. In: SC (2009)Google Scholar
  4. 4.
    Benatia, A., Ji, W., Wang, Y., Shi, F.: Sparse matrix format selection with multiclass SVM for SpMV on GPU. In: 45th International Conference on Parallel Processing, ICPP 2016, Philadelphia, PA, USA, August 16–19, 2016, pp. 496–505 (2016)Google Scholar
  5. 5.
    Bhattacharjee, A., Martonosi, M.: Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In: 36th International Symposium on Computer Architecture (ISCA 2009), June 20–24, 2009, Austin, TX, USA, pp. 290–301 (2009)Google Scholar
  6. 6.
    Chen, D., Fang, J., Chen, S., Xu, C., Wang, Z.: Optimizing sparse matrix–vector multiplications on an armv8-based many-core architecture. Int. J. Parallel Program. 47(3), 418–432 (2019)CrossRefGoogle Scholar
  7. 7.
    Chen, S., et al.: Adaptive optimization of sparse matrix–vector multiplication on emerging many-core architectures. In: HPCC’18 (2018)Google Scholar
  8. 8.
    Cummins, C., Petoumenos, P., Wang, Z., Leather, H.: End-to-end deep learning of optimization heuristics. In: PACT (2017)Google Scholar
  9. 9.
    Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (2011)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Diamond, J.R., Burtscher, M., McCalpin, J.D., Kim, B., Keckler, S.W., Browne, J.C.: Evaluation and optimization of multicore performance bottlenecks in supercomputing applications. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2011, 10–12 April, 2011, Austin, TX, USA, pp. 32–43 (2011)Google Scholar
  11. 11.
    Emani, M.K., Wang, Z., O’Boyle, M.F.P.: Smart, adaptive mapping of parallelism in the presence of external workload. In: CGO (2013)Google Scholar
  12. 12.
    Eyerman, S., Bois, K.D., Eeckhout, L.: Speedup stacks: identifying scaling bottlenecks in multi-threaded applications. In: 2012 IEEE International Symposium on Performance Analysis of Systems and Software, New Brunswick, NJ, USA, April 1–3, 2012, pp. 145–155 (2012)Google Scholar
  13. 13.
    FT-2000 Plus. Phytium Technology Co. Ltd., (2017)
  14. 14.
    Grewe, D., Wang, Z., O’Boyle, M.F.P.: A workload-aware mapping approach for data-parallel programs. In: HiPEAC (2011)Google Scholar
  15. 15.
    Grewe, D., Wang, Z., O’Boyle, M.F.P.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO (2013a)Google Scholar
  16. 16.
    Grewe, D., et al.: Opencl task partitioning in the presence of GPU contention. In: LCPC (2013b)Google Scholar
  17. 17.
    Gupta, V., Kim, H., Schwan, K.: Evaluating Scalability of Multi-threaded Applications on a Many-Core Platform. Georgia Institute of Technology, Georgia (2012)Google Scholar
  18. 18.
    Kincaid, D.R., Young, T.C.: Itpackv 2d user’s guide. In: Technical Report, Center for Numerical Analysis, Texas University, Austin, TX (USA) (1989)Google Scholar
  19. 19.
    Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix–vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36, C401–C423 (2014)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Laurenzano, M.A., Tiwari, A., Cauble-Chantrenne, A., Jundt, A., William W.A., Jr., Campbell, R.L., Carrington, L.: Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In: ISPASS (2016)Google Scholar
  21. 21.
    Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix–matrix multiplication on GPUS. In: PPoPP (2018)Google Scholar
  22. 22.
    Liu, L., Li, Z., Sameh, A.H.: Analyzing memory access intensity in parallel programs on multicore. In: Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, Island of Kos, Greece, June 7–12, 2008, pp. 359–367 (2008)Google Scholar
  23. 23.
    Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix–vector multiplication. In: ICS (2015a)Google Scholar
  24. 24.
    Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49, 179–193 (2015b)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Lv, Y., Sun, B., Luo, Q., Wang, J., Yu, Z., Qian, X.: Counterminer: Mining big performance data from hardware counters. In: 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018, Fukuoka, Japan, October 20–24, 2018, pp. 613–626 (2018)Google Scholar
  26. 26.
    Maggioni, M., Berger-Wolf, T.Y.: An architecture-aware technique for optimizing sparse matrix–vector multiplication on GPUS. In: ICCS (2013)CrossRefGoogle Scholar
  27. 27.
    Magni, A., Dubach, C., O’Boyle, M.F.P.: A large-scale cross-architecture evaluation of thread-coarsening. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC’13, Denver, CO, USA, November 17–21, 2013, pp. 11:1–11:11 (2013)Google Scholar
  28. 28.
    Mellor-Crummey, J.M., Garvin, J.: Optimizing sparse matrix–vector product computations using unroll and jam. In: IJHPCA (2004)Google Scholar
  29. 29.
    Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix–vector multiplication for GPU architectures. In: HIPEAC (2010)CrossRefGoogle Scholar
  30. 30.
    Ogilvie, W.F., Petoumenos, P., Wang, Z., Leather, H.: Fast automatic heuristic construction using active learning. In: LCPC (2014)Google Scholar
  31. 31.
    Ogilvie, W.F., Petoumenos, P., Wang, Z., Leather, H.: Minimizing the cost of iterative compilation with active learning. In: CGO (2017)Google Scholar
  32. 32.
    Pedregosa, F., et al.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  33. 33.
    Pinar, A., Heath, M.T.: Improving performance of sparse matrix–vector multiplication. In: SC (1999)Google Scholar
  34. 34.
    Ren, J., Gao, L., Wang, H., Wang, Z.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM (2017)Google Scholar
  35. 35.
    Ren, J., et al.: Proteus: Network-aware web browsing on heterogeneous mobile systems. In: CoNEXT’18 (2018)Google Scholar
  36. 36.
    Sedaghati, N., Mu, T., Pouchet, L., Parthasarathy, S., Sadayappan, P.: Automatic selection of sparse matrix representation on GPUS. In: ICS (2015)Google Scholar
  37. 37.
    Stephens, N.: Armv8-a next-generation vector architecture for HPC. In: 2016 IEEE Hot Chips 28 Symposium (HCS), pp. 1–31 (2016)Google Scholar
  38. 38.
    Terpstra, D., Jagode, H., You, H., Dongarra, J.J.: Collectingperformance data with PAPI-C. In: Tools for High Performance Computing 2009, pp. 157–173 (2009)CrossRefGoogle Scholar
  39. 39.
    Tournavitis, G., Wang, Z., Franke, B., O’Boyle, M.F.P.: Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping. In: PLDI (2009)Google Scholar
  40. 40.
    Wang, Z., O’Boyle, M.: Machine learning in compiler optimization. In: Proceedings of IEEE (2018)Google Scholar
  41. 41.
    Wang, Z., O’Boyle, M.F.: Mapping parallelism to multi-cores: a machine learning based approach. In: PPoPP’09 (2009)Google Scholar
  42. 42.
    Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT’10 (2010)Google Scholar
  43. 43.
    Wang, Z., O’boyle, M.F.: Using machine learning to partition streaming programs. ACM Trans. Arch. Code Optm. 10, 20 (2013)Google Scholar
  44. 44.
    Wang, Z., Tournavitis, G., Franke, B., O’Boyle, M.F.P.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. Arch. Code Optm. 11, 2 (2014a)Google Scholar
  45. 45.
    Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for GPU-based heterogeneous systems. ACM Trans. Arch. Code Optm. 11, 42 (2014b)Google Scholar
  46. 46.
    Wen, Y., Wang, Z., O’Boyle, M.F.P.: Smart multi-task scheduling for opencl programs on CPU/GPU heterogeneous platforms. In: HiPC’14 (2014)Google Scholar
  47. 47.
    Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix–vector multiplication onemerging multicore platforms. In: Parallel Computing (2009)Google Scholar
  48. 48.
    Zhang, C.: Mars: A 64-core ARMv8 processor. In: HotChips (2015)Google Scholar
  49. 49.
    Zhang, P., et al.: Auto-tuning streamed applications on intel xeon phi. In: IPDPS (2018)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.College of Computer ScienceNational University of Defense TechnologyChangshaChina
  2. 2.University of LeedsLeedsUK
  3. 3.Xi’an University of Posts and TelecommunicationsXi’anChina

Personalised recommendations