Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Customizing the HPL for China accelerator


HPL is a Linpack benchmark package widely used in high-performance computing tests. Customizing the HPL is crucial for a heterogeneous system equipped with CPU and the China accelerator because of the complexity of the China accelerator and the specified interface on matrix multiplication built in the China accelerator. Therefore, it is advisable to use delicate partition and encapsulation on matrix (DPEM) to expose a friendly testing configuration. More importantly, we propose the orchestrating algorithm for matrix multiplication (OAMM) to enhance the efficiency of the heterogeneous system composed of CPU and China accelerator. Furthermore, optimization at vectorization (OPTVEC) is applied to shield the architectural details of the vector processing element (VPE) equipped in the China accelerator. The experimental results validate DPEM, OPTVEC and OAMM. OPTVEC optimizations would speed up matrix multiplication more than twofold, moreover OAMM would improve productivity by up to 10% compared to the traditional HPL tested in a heterogeneous system.

This is a preview of subscription content, log in to check access.


  1. 1

    Lu Y T. The applications leveraging supercomputing systems. In: International Supercomputing Conference, Frankfurt, 2015

  2. 2

    Dongarra J J, Luszczek P, Petitet A. The LINPACK benchmark: past, present and future. Concurr Computat-Pract Exper, 2003, 15: 803–820

  3. 3

    Shi R, Potluri S, Hamidouche K, et al. A scalable and portable approach to accelerate hybrid the HPL on heterogeneous CPU-GPU clusters. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER). Indianapolis: IEEE, 2014. 1–8

  4. 4

    Wang Q, Ohmura J, Axida S, et al. Parallel matrix-matrix multiplication based on the HPL with a GPU-accelerated PC cluster. In: Proceedings of the International Conference on Networking and Computing. Higashi-Hiroshima: IEEE, 2010. 243–248

  5. 5

    Yang X J, Liao X, Lu K, et al. The TianHe 1 a supercomputer, its hardware and software. J Comput Sci Tech, 2011, 26: 344–351

  6. 6

    Du Y F, Yang C Q, Wang F, et al. Analysis and evaluation method for the Linpack benchmark. J Northeast Univ Nat Sci, 2014, 35: 102–107

  7. 7

    Liu J, Gan X B, Chi L H, et al. A peak performance model for matrix multiplication on general-purpose DSP (in Chinese). J Hunan Univ Nat Sci, 2013, 40: 148–152

  8. 8

    Chi L H, Liu J, Yan Y H, et al. FitenBLAS: high-performance BLAS for a massively multithreaded FT1000 processor (in Chinese). J Hunan Univ Nat Sci, 2015, 42: 100–106

  9. 9

    Gong C Y, Bao W M, Tang G J, et al. An efficient parallel solution for Caputo fractional reaction-diffusion equation. J Supercomputing, 2014, 68: 1521–1537

  10. 10

    Gong C, Bao W, Tang G. A parallel algorithm for the Riesz fractional reaction-diffusion equation with explicit finite difference method. Fract Calc Appl Anal, 2013, 16: 654–669

  11. 11

    Gong C Y, Liu J, Chi L H, et al. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method. J Comput Phys, 2011, 230: 6010–6022

  12. 12

    Zhao X, Chen Y, Zhang H, et al. A new decomposition solver for complex electromagnetic problems. IEEE Antenn Propag Mag, 2017, 59: 131–140

  13. 13

    Xie X L, Liang Y, Li X H, et al. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In: Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). New York: ACM, 2015. 395–406

  14. 14

    Liang Y, Huynh H P, Rupnow K, et al. Efficient GPU spatial-temporal multitasking. IEEE Trans Parallel Distrib Syst, 2015, 26: 748–760

  15. 15

    Chen C, Du Y F, Jiang H, et al. HPCG: preliminary evaluation and optimization on Tianhe-2 CPU-only nodes. In: Proceedings of Symposium on Computer Architecture and high-performance Computing. Jussieu: IEEE, 2014. 41–48

  16. 16

    Ao Y L, Liu Y Q, Yang C, et al. Performance evaluation of HPGMG on tianhe-2: early experience. In: Proceedings of International Conference on Algorithms and Architectures for Parallel Processing. New York: Springer, 2015. 230–243

  17. 17

    Liu Y Q, Yang C, Liu F F, et al. 623 Tflop/s HPCG run on Tianhe-2: leveraging millions of hybrid cores. Internat J High Perform Comput Appl, 2016, 30: 39–54

  18. 18

    Li D, Xu C, Wang Y, et al. Parallelizing and optimizing large-scale 3D multi-phase flow simulations on the Tianhe-2 supercomputer. Concurr Computat-Pract Exper, 2016, 28: 1678–1692

  19. 19

    Wei S, Zhao R C, Yao Y. Loop-nest auto-vectorizat ion based on SLP (in Chinese). J Softw, 2012, 23: 1717–1728

  20. 20

    Zhao J, Zhao R C, Ding R, et al. Parallelism recognition technology based on nested loops classifying (in Chinese). J Softw, 2012, 23: 2695–2704

  21. 21

    Gao W, Zhao R C, Han L, et al. Research on SIMD auto-vectorization compiling optimization (in Chinese). J Softw, 2015, 26: 1265–1284

  22. 22

    Zhao J, Zhao R C, Han L, et al. An MPI backend for open64 compiler (in Chinese). J Softw, 2012, 23: 2695–2704

Download references


This work was partly supported by National Natural Science Foundation of China (Grant Nos. 61602495, 61402039, 91430218, 9130324, 11401580), Key Research and Development Program (Grant Nos. 2017YFB0202104, 2016YFB200401), Innovation Program from the National University of Defense Technology (Grant No. ZK16-03-06), partly supported by Specialized Research Fund for State Key Laboratories of Space Weather, Chinese Academy of Sciences, and partly supported by Open Research Fund of Key Laboratory of Spectral Imaging Technology, Chinese Academy of Sciences (Grant No. LIST201602D).

Author information

Correspondence to Xinbiao Gan.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gan, X., Hu, Y., Liu, J. et al. Customizing the HPL for China accelerator. Sci. China Inf. Sci. 61, 042102 (2018).

Download citation


  • HPL
  • China accelerator
  • DPEM
  • OAMM