Customizing the HPL for China accelerator

  • Xinbiao Gan
  • Yikun Hu
  • Jie Liu
  • Lihua Chi
  • Han Xu
  • Chunye Gong
  • Shengguo Li
  • Yihui Yan
Research Paper


HPL is a Linpack benchmark package widely used in high-performance computing tests. Customizing the HPL is crucial for a heterogeneous system equipped with CPU and the China accelerator because of the complexity of the China accelerator and the specified interface on matrix multiplication built in the China accelerator. Therefore, it is advisable to use delicate partition and encapsulation on matrix (DPEM) to expose a friendly testing configuration. More importantly, we propose the orchestrating algorithm for matrix multiplication (OAMM) to enhance the efficiency of the heterogeneous system composed of CPU and China accelerator. Furthermore, optimization at vectorization (OPTVEC) is applied to shield the architectural details of the vector processing element (VPE) equipped in the China accelerator. The experimental results validate DPEM, OPTVEC and OAMM. OPTVEC optimizations would speed up matrix multiplication more than twofold, moreover OAMM would improve productivity by up to 10% compared to the traditional HPL tested in a heterogeneous system.


HPL China accelerator DPEM OAMM OPTVEC 



This work was partly supported by National Natural Science Foundation of China (Grant Nos. 61602495, 61402039, 91430218, 9130324, 11401580), Key Research and Development Program (Grant Nos. 2017YFB0202104, 2016YFB200401), Innovation Program from the National University of Defense Technology (Grant No. ZK16-03-06), partly supported by Specialized Research Fund for State Key Laboratories of Space Weather, Chinese Academy of Sciences, and partly supported by Open Research Fund of Key Laboratory of Spectral Imaging Technology, Chinese Academy of Sciences (Grant No. LIST201602D).


  1. 1.
    Lu Y T. The applications leveraging supercomputing systems. In: International Supercomputing Conference, Frankfurt, 2015Google Scholar
  2. 2.
    Dongarra J J, Luszczek P, Petitet A. The LINPACK benchmark: past, present and future. Concurr Computat-Pract Exper, 2003, 15: 803–820CrossRefGoogle Scholar
  3. 3.
    Shi R, Potluri S, Hamidouche K, et al. A scalable and portable approach to accelerate hybrid the HPL on heterogeneous CPU-GPU clusters. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER). Indianapolis: IEEE, 2014. 1–8Google Scholar
  4. 4.
    Wang Q, Ohmura J, Axida S, et al. Parallel matrix-matrix multiplication based on the HPL with a GPU-accelerated PC cluster. In: Proceedings of the International Conference on Networking and Computing. Higashi-Hiroshima: IEEE, 2010. 243–248Google Scholar
  5. 5.
    Yang X J, Liao X, Lu K, et al. The TianHe 1 a supercomputer, its hardware and software. J Comput Sci Tech, 2011, 26: 344–351CrossRefGoogle Scholar
  6. 6.
    Du Y F, Yang C Q, Wang F, et al. Analysis and evaluation method for the Linpack benchmark. J Northeast Univ Nat Sci, 2014, 35: 102–107Google Scholar
  7. 7.
    Liu J, Gan X B, Chi L H, et al. A peak performance model for matrix multiplication on general-purpose DSP (in Chinese). J Hunan Univ Nat Sci, 2013, 40: 148–152Google Scholar
  8. 8.
    Chi L H, Liu J, Yan Y H, et al. FitenBLAS: high-performance BLAS for a massively multithreaded FT1000 processor (in Chinese). J Hunan Univ Nat Sci, 2015, 42: 100–106Google Scholar
  9. 9.
    Gong C Y, Bao W M, Tang G J, et al. An efficient parallel solution for Caputo fractional reaction-diffusion equation. J Supercomputing, 2014, 68: 1521–1537CrossRefGoogle Scholar
  10. 10.
    Gong C, Bao W, Tang G. A parallel algorithm for the Riesz fractional reaction-diffusion equation with explicit finite difference method. Fract Calc Appl Anal, 2013, 16: 654–669MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Gong C Y, Liu J, Chi L H, et al. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method. J Comput Phys, 2011, 230: 6010–6022CrossRefzbMATHGoogle Scholar
  12. 12.
    Zhao X, Chen Y, Zhang H, et al. A new decomposition solver for complex electromagnetic problems. IEEE Antenn Propag Mag, 2017, 59: 131–140CrossRefGoogle Scholar
  13. 13.
    Xie X L, Liang Y, Li X H, et al. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In: Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). New York: ACM, 2015. 395–406Google Scholar
  14. 14.
    Liang Y, Huynh H P, Rupnow K, et al. Efficient GPU spatial-temporal multitasking. IEEE Trans Parallel Distrib Syst, 2015, 26: 748–760CrossRefGoogle Scholar
  15. 15.
    Chen C, Du Y F, Jiang H, et al. HPCG: preliminary evaluation and optimization on Tianhe-2 CPU-only nodes. In: Proceedings of Symposium on Computer Architecture and high-performance Computing. Jussieu: IEEE, 2014. 41–48Google Scholar
  16. 16.
    Ao Y L, Liu Y Q, Yang C, et al. Performance evaluation of HPGMG on tianhe-2: early experience. In: Proceedings of International Conference on Algorithms and Architectures for Parallel Processing. New York: Springer, 2015. 230–243CrossRefGoogle Scholar
  17. 17.
    Liu Y Q, Yang C, Liu F F, et al. 623 Tflop/s HPCG run on Tianhe-2: leveraging millions of hybrid cores. Internat J High Perform Comput Appl, 2016, 30: 39–54CrossRefGoogle Scholar
  18. 18.
    Li D, Xu C, Wang Y, et al. Parallelizing and optimizing large-scale 3D multi-phase flow simulations on the Tianhe-2 supercomputer. Concurr Computat-Pract Exper, 2016, 28: 1678–1692CrossRefGoogle Scholar
  19. 19.
    Wei S, Zhao R C, Yao Y. Loop-nest auto-vectorizat ion based on SLP (in Chinese). J Softw, 2012, 23: 1717–1728CrossRefGoogle Scholar
  20. 20.
    Zhao J, Zhao R C, Ding R, et al. Parallelism recognition technology based on nested loops classifying (in Chinese). J Softw, 2012, 23: 2695–2704CrossRefGoogle Scholar
  21. 21.
    Gao W, Zhao R C, Han L, et al. Research on SIMD auto-vectorization compiling optimization (in Chinese). J Softw, 2015, 26: 1265–1284MathSciNetGoogle Scholar
  22. 22.
    Zhao J, Zhao R C, Han L, et al. An MPI backend for open64 compiler (in Chinese). J Softw, 2012, 23: 2695–2704CrossRefGoogle Scholar

Copyright information

© Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Xinbiao Gan
    • 1
    • 2
    • 3
  • Yikun Hu
    • 4
  • Jie Liu
    • 1
  • Lihua Chi
    • 5
  • Han Xu
    • 1
  • Chunye Gong
    • 1
  • Shengguo Li
    • 1
  • Yihui Yan
    • 1
  1. 1.School of ComputerNational University of Defense TechnologyChangshaChina
  2. 2.State Key Laboratory for Novel Software TechnologyNanjing UniversityNanjingChina
  3. 3.State Key Laboratory of Space WeatherChinese Academy of SciencesBeijingChina
  4. 4.College of Information Science and EngineeringHunan UniversityChangshaChina
  5. 5.Institutes of Advanced Science and TechnologyHunan Institute of Traffic EngineeringHengyangChina

Personalised recommendations