HPL is a Linpack benchmark package widely used in high-performance computing tests. Customizing the HPL is crucial for a heterogeneous system equipped with CPU and the China accelerator because of the complexity of the China accelerator and the specified interface on matrix multiplication built in the China accelerator. Therefore, it is advisable to use delicate partition and encapsulation on matrix (DPEM) to expose a friendly testing configuration. More importantly, we propose the orchestrating algorithm for matrix multiplication (OAMM) to enhance the efficiency of the heterogeneous system composed of CPU and China accelerator. Furthermore, optimization at vectorization (OPTVEC) is applied to shield the architectural details of the vector processing element (VPE) equipped in the China accelerator. The experimental results validate DPEM, OPTVEC and OAMM. OPTVEC optimizations would speed up matrix multiplication more than twofold, moreover OAMM would improve productivity by up to 10% compared to the traditional HPL tested in a heterogeneous system.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Lu Y T. The applications leveraging supercomputing systems. In: International Supercomputing Conference, Frankfurt, 2015
Dongarra J J, Luszczek P, Petitet A. The LINPACK benchmark: past, present and future. Concurr Computat-Pract Exper, 2003, 15: 803–820
Shi R, Potluri S, Hamidouche K, et al. A scalable and portable approach to accelerate hybrid the HPL on heterogeneous CPU-GPU clusters. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER). Indianapolis: IEEE, 2014. 1–8
Wang Q, Ohmura J, Axida S, et al. Parallel matrix-matrix multiplication based on the HPL with a GPU-accelerated PC cluster. In: Proceedings of the International Conference on Networking and Computing. Higashi-Hiroshima: IEEE, 2010. 243–248
Yang X J, Liao X, Lu K, et al. The TianHe 1 a supercomputer, its hardware and software. J Comput Sci Tech, 2011, 26: 344–351
Du Y F, Yang C Q, Wang F, et al. Analysis and evaluation method for the Linpack benchmark. J Northeast Univ Nat Sci, 2014, 35: 102–107
Liu J, Gan X B, Chi L H, et al. A peak performance model for matrix multiplication on general-purpose DSP (in Chinese). J Hunan Univ Nat Sci, 2013, 40: 148–152
Chi L H, Liu J, Yan Y H, et al. FitenBLAS: high-performance BLAS for a massively multithreaded FT1000 processor (in Chinese). J Hunan Univ Nat Sci, 2015, 42: 100–106
Gong C Y, Bao W M, Tang G J, et al. An efficient parallel solution for Caputo fractional reaction-diffusion equation. J Supercomputing, 2014, 68: 1521–1537
Gong C, Bao W, Tang G. A parallel algorithm for the Riesz fractional reaction-diffusion equation with explicit finite difference method. Fract Calc Appl Anal, 2013, 16: 654–669
Gong C Y, Liu J, Chi L H, et al. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method. J Comput Phys, 2011, 230: 6010–6022
Zhao X, Chen Y, Zhang H, et al. A new decomposition solver for complex electromagnetic problems. IEEE Antenn Propag Mag, 2017, 59: 131–140
Xie X L, Liang Y, Li X H, et al. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In: Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). New York: ACM, 2015. 395–406
Liang Y, Huynh H P, Rupnow K, et al. Efficient GPU spatial-temporal multitasking. IEEE Trans Parallel Distrib Syst, 2015, 26: 748–760
Chen C, Du Y F, Jiang H, et al. HPCG: preliminary evaluation and optimization on Tianhe-2 CPU-only nodes. In: Proceedings of Symposium on Computer Architecture and high-performance Computing. Jussieu: IEEE, 2014. 41–48
Ao Y L, Liu Y Q, Yang C, et al. Performance evaluation of HPGMG on tianhe-2: early experience. In: Proceedings of International Conference on Algorithms and Architectures for Parallel Processing. New York: Springer, 2015. 230–243
Liu Y Q, Yang C, Liu F F, et al. 623 Tflop/s HPCG run on Tianhe-2: leveraging millions of hybrid cores. Internat J High Perform Comput Appl, 2016, 30: 39–54
Li D, Xu C, Wang Y, et al. Parallelizing and optimizing large-scale 3D multi-phase flow simulations on the Tianhe-2 supercomputer. Concurr Computat-Pract Exper, 2016, 28: 1678–1692
Wei S, Zhao R C, Yao Y. Loop-nest auto-vectorizat ion based on SLP (in Chinese). J Softw, 2012, 23: 1717–1728
Zhao J, Zhao R C, Ding R, et al. Parallelism recognition technology based on nested loops classifying (in Chinese). J Softw, 2012, 23: 2695–2704
Gao W, Zhao R C, Han L, et al. Research on SIMD auto-vectorization compiling optimization (in Chinese). J Softw, 2015, 26: 1265–1284
Zhao J, Zhao R C, Han L, et al. An MPI backend for open64 compiler (in Chinese). J Softw, 2012, 23: 2695–2704
This work was partly supported by National Natural Science Foundation of China (Grant Nos. 61602495, 61402039, 91430218, 9130324, 11401580), Key Research and Development Program (Grant Nos. 2017YFB0202104, 2016YFB200401), Innovation Program from the National University of Defense Technology (Grant No. ZK16-03-06), partly supported by Specialized Research Fund for State Key Laboratories of Space Weather, Chinese Academy of Sciences, and partly supported by Open Research Fund of Key Laboratory of Spectral Imaging Technology, Chinese Academy of Sciences (Grant No. LIST201602D).
About this article
Cite this article
Gan, X., Hu, Y., Liu, J. et al. Customizing the HPL for China accelerator. Sci. China Inf. Sci. 61, 042102 (2018). https://doi.org/10.1007/s11432-017-9221-0
- China accelerator