Advertisement

Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture

  • Donglin Chen
  • Jianbin FangEmail author
  • Shizhao Chen
  • Chuanfu Xu
  • Zheng Wang
Article
  • 20 Downloads

Abstract

Sparse matrix–vector multiplications (SpMV) are common in scientific and HPC applications but are hard to be optimized. While the ARMv8-based processor IP is emerging as an alternative to the traditional x64 HPC processor design, there is little study on SpMV performance on such new many-cores. To design efficient HPC software and hardware, we need to understand how well SpMV performs. This work develops a quantitative approach to characterize SpMV performance on a recent ARMv8-based many-core architecture, Phytium FT-2000 Plus (FTP). We perform extensive experiments involved over 9500 distinct profiling runs on 956 sparse datasets and five mainstream sparse matrix storage formats, and compare FTP against the Intel Knights Landing many-core. We experimentally show that picking the optimal sparse matrix storage format and parameters is non-trivial as the correct decision requires expert knowledge of the input matrix and the hardware. We address the problem by proposing a machine learning based model that predicts the best storage format and parameters using input matrix features. The model automatically specializes to the many-core architectures we considered. The experimental results show that our approach achieves on average 93% of the best-available performance without incurring runtime profiling overhead.

Keywords

SpMV Sparse matrix format Many-core Performance tuning 

Notes

Acknowledgements

This work was partially funded by the National Key R&D Program of China under Grant No. 2017YFB0202003, the National Natural Science Foundation of China under grant Agreements 61602501, 11502296, 61772542, 61561146395 and 61872294; the Open Research Program of China State Key Laboratory of Aerodynamics under grant agreement SKLA20160104; the UK Engineering and Physical Sciences Research Council under Grants EP/M01567X/1 (SANDeRs) and EP/M015793/1 (DIVIDEND); and the Royal Society International Collaboration Grant (IE161012).

References

  1. 1.
    FT-2000. Phytium Technology Co. Ltd. (2017). http://www.phytium.com.cn/Product/detail?language=1&product_id=7
  2. 2.
    Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: SC (2009)Google Scholar
  3. 3.
    Che, Y., Xu, C., Fang, J., Wang, Y., Wang, Z.: Realistic performance characterization of CFD applications on intel many integrated core architecture. Comput. J. 58(12), 3279–3294 (2015)CrossRefGoogle Scholar
  4. 4.
    Chen, J., Fang, J., Liu, W., Tang, T., Chen, X., Yang, C.: Efficient and portable ALS matrix factorization for recommender systems. In: IPDPS (2017)Google Scholar
  5. 5.
    Chen, J., Fang, J., Liu, W., Tang, T., Yang, C.: clmf: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization. FGCS (2018a).  https://doi.org/10.1016/j.future.2018.04.071
  6. 6.
    Chen, S., Fang, J., Chen, D., Xu, C., Wang, Z.: Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In: HPCC ’18 (2018b)Google Scholar
  7. 7.
    Cummins, C., et al.: End-to-end deep learning of optimization heuristics. In: PACT ’17 (2017)Google Scholar
  8. 8.
    Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)Google Scholar
  9. 9.
    Emani, M.K., et al.: Smart, adaptive mapping of parallelism in the presence of external workload. In: CGO ’13 (2013)Google Scholar
  10. 10.
    Grewe, D., et al.: A workload-aware mapping approach for data-parallel programs. In: HiPEAC ’11 (2011)Google Scholar
  11. 11.
    Grewe, D. et al.: Opencl task partitioning in the presence of gpu contention. In: LCPC ’13 (2013a)Google Scholar
  12. 12.
    Grewe, D. et al.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO ’13 (2013b)Google Scholar
  13. 13.
    Ho, T.K.: Random decision forests. In: ICDAR, pp. 278–282 (1995)Google Scholar
  14. 14.
    Hollowell, C., et al.: The effect of numa tunings on cpu performance. J. Phys. Conf. Ser. 664(092010), 1–7 (2015)Google Scholar
  15. 15.
    Im, E., Yelick, K.A., Vuduc, R.W.: Sparsity: Optimization framework for sparse matrix kernels. IJHPCA (2004)Google Scholar
  16. 16.
    Kincaid, D. et al.: Itpackv 2d user’s guide. Tech. rep., Center for Numerical Analysis, Texas Univ., Austin, TX (USA) (1989)Google Scholar
  17. 17.
    Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5) (2014).  https://doi.org/10.1137/130930352
  18. 18.
    Laurenzano, M.A., Tiwari, A., Cauble-Chantrenne, A., Jundt, A., Jr WAW, Campbell, R.L., Carrington, L.: Characterization and bottleneck analysis of a 64-bit armv8 platform. In: ISPASS (2016)Google Scholar
  19. 19.
    Li, A., Liu, W., Kristensen, M.R.B., Vinter, B., Wang, H., Hou, K., Marquez, A., Song, S.L.: Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In: SC (2017)Google Scholar
  20. 20.
    Li, J., Tan, G., Chen, M., Sun, N.: SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication. In: PLDI (2013)Google Scholar
  21. 21.
    Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix-matrix multiplication on gpus. In: PPoPP (2018)Google Scholar
  22. 22.
    Liu, W.: Parallel and scalable sparse basic linear algebra subprograms. PhD thesis, University of Copenhagen (2015)Google Scholar
  23. 23.
    Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication. In: ICS (2015a)Google Scholar
  24. 24.
    Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49, 179–193 (2015b)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Liu, X., Smelyanskiy, M., Chow, E., Dubey, P.: Efficient sparse matrix–vector multiplication on x86-based many-core processors. In: ICS (2013)Google Scholar
  26. 26.
    Maggioni, M., Berger-Wolf, T.Y.: An architecture-aware technique for optimizing sparse matrix-vector multiplication on GPUS. In: ICCS (2013)Google Scholar
  27. 27.
    Mellor-Crummey, J.M., Garvin, J.: Optimizing sparse matrix-vector product computations using unroll and jam. IJHPCA 18(2), 225–236 (2004)Google Scholar
  28. 28.
    Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix–vector multiplication for GPU architectures. In: HIPEAC (2010)Google Scholar
  29. 29.
    Ogilvie, W.F., et al.: Fast automatic heuristic construction using active learning. In: LCPC ’14 (2014)Google Scholar
  30. 30.
    Ogilvie, W.F., et al.: Minimizing the cost of iterative compilation with active learning. In: CGO ’17 (2017)Google Scholar
  31. 31.
    Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research (2011)Google Scholar
  32. 32.
    Pinar, A., Heath, M.T.: Improving performance of sparse matrix-vector multiplication. In: SC (1999)Google Scholar
  33. 33.
    Ren, J. et al.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM ’17 (2017)Google Scholar
  34. 34.
    Ren, J., et al.: Adaptive web browsing on mobile heterogeneous multi-cores. IEEE Comput. Architect. Lett. (2018)Google Scholar
  35. 35.
    Sedaghati, N., Mu, T., Pouchet, L., Parthasarathy, S., Sadayappan, P.: Automatic selection of sparse matrix representation on gpus. In: ICS (2015)Google Scholar
  36. 36.
    Stephens, N.: Armv8-a next-generation vector architecture for HPC. In: 2016 IEEE Hot Chips 28 Symposium (HCS), pp. 1–31 (2016)Google Scholar
  37. 37.
    Taylor, B., et al.: Adaptive optimization for opencl programs on embedded heterogeneous systems. In: LCTES ’17 (2017)Google Scholar
  38. 38.
    Taylor, B. et al.: Adaptive deep learning model selection on embedded systems. In: LCTES ’18 (2018)Google Scholar
  39. 39.
    Tournavitis, G., et al.: Towards a holistic approach to auto-parallelization: Integrating profile-driven parallelism detection and machine-learning based mapping. In: PLDI ’09 (2009)Google Scholar
  40. 40.
    Wang, Z., O’Boyle, M.: Machine learning in compiler optimization. In: Proceedings of the IEEE (2018)Google Scholar
  41. 41.
    Wang, Z., O’Boyle, M.F.: Mapping parallelism to multi-cores: a machine learning based approach. In: PPoPP ’09 (2009)Google Scholar
  42. 42.
    Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT ’10 (2010)Google Scholar
  43. 43.
    Wang, Z., O’boyle, M.F.: Using machine learning to partition streaming programs. ACM TACO (2013)Google Scholar
  44. 44.
    Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for gpu-based heterogeneous systems. ACM TACO (2014a)Google Scholar
  45. 45.
    Wang, Z. et al.: Exploitation of gpus for the parallelisation of probably parallel legacy code. In: CC ’14 (2014b)Google Scholar
  46. 46.
    Wang, Z., et al.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM TACO (2014c)Google Scholar
  47. 47.
    Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: SC (2007)Google Scholar
  48. 48.
    Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Comput. (2009)Google Scholar
  49. 49.
    Yang, X., Fang, J., Chen, J., Wu, C., Tang, T., Lu, K.: High performance coordinate descent matrix factorization for recommender systems. In: CF (2017)Google Scholar
  50. 50.
    Zhang, C.: Mars: A 64-core armv8 processor. In: HotChips (2015)Google Scholar
  51. 51.
    Zhang, P. et al.: Auto-tuning streamed applications on intel xeon phi. In: IPDPS ’18 (2018)Google Scholar
  52. 52.
    Zhao, Y., Li, J., Liao, C., Shen, X.: Bridging the gap between deep learning and sparse matrix format selection. In: PPoPP (2018)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.College of Computer ScienceNational University of Defense TechnologyChangshaChina
  2. 2.School of Computing and CommunicationsLancaster UniversityLancasterUK

Personalised recommendations