Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight

  • Wenjing Ma
  • Yulong Ao
  • Chao YangEmail author
  • Samuel Williams


Benchmarks for supercomputers are important tools, not only for evaluating and ranking modern supercomputers, but also for providing hints for future architecture design. As a new benchmark, HPGMG (high performance geometric multigrid) solves a linear equation set with a full geometric multi-grid algorithm. It involves computation on different scales, data movement with various volumes, global communication and neighbor communication with both large and small messages, etc., and is more correlated to real world applications than traditional benchmarks such as LINPACK. Therefore, it is desirable to examine how well HPGMG can perform on leadership supercomputers such as Sunway Taihulight. Sunway Taihulight, the No. 1 supercomputer in the Top 500 list from June 2016 to June 2018, which uses a specially designed many-core architecture SW26010, is of great interest to the community of high performance computing. With careful analysis and code design, we came up with an efficient implementation of HPGMG on SW26010 processors. We not only employed traditional optimization techniques such as 2.5D partitioning, double buffering, and collective data load, but also introduced a micro-benchmark to help with the choice of optimization direction and parameter tuning. Another contribution is that we proposed a new procedure for the major operations, by granulating and reordering the smooth function and the ghost exchange operation, leading to reduced memory copy and accelerated communication process. Our optimized implementation of HPGMG on Sunway TaihuLight achieved a ground-breaking performance of \(1.036\times 10^{12}\) Degrees of Freedom per second at the finest level, which is No. 1 on the HPGMG list of Nov 2017.


HPGMG Sunway TaihuLight Performance benchmark and optimization Many-core computing 



The authors would like to thank the anonymous reviewers for helping improve the quality of the paper. This work was supported in part by National Key R&D Plan of China (Grant# 2016YFB0200603) and Beijing Natural Science Foundation (Grant# JQ18001). Dr. Williams was supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.


  1. 1.
    Adams, M.F., Brown, J., Shalf, J., Straalen, B.V., Strohmaier, E., Williams, S.: HPGMG 1.0:A Benchmark for Ranking High Performance Computing Systems. Lawrence Berkeley National Lab, Berkeley (2014)Google Scholar
  2. 2.
    Aldinucci, M., Danelutto, M., Drocco, M., Kilpatrick, P., Misale, C., Peretti Pezzi, G., Torquati, M.: A parallel pattern for iterative stencil + reduce. J. Supercomput. 74(11), 5690–5705 (2018). CrossRefGoogle Scholar
  3. 3.
    Ao, Y., Liu, Y., Yang, C., Liu, F., Zhang, P., Lu, Y., Du, Y.: Performance Evaluation of HPGMG on Tianhe-2: arly Experience, pp. 230–243. Springer, Cham (2015)Google Scholar
  4. 4.
    Ao, Y., Yang, C., Wang, X., Xue, W., Fu, H., Liu, F., Gan, L., Xu, P., Ma, W.: 26 PFLOPS stencil computations for atmospheric modeling on sunway TaihuLight. In: 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, May 29–June 2, 2017, pp. 535–544 (2017)Google Scholar
  5. 5.
    Basu, P., Hall, M., Williams, S., Straalen, B.V., Oliker, L., Colella, P.: In: 2015 IEEE International Parallel and Distributed Processing SymposiumGoogle Scholar
  6. 6.
    Basu, P., Hall, M., Williams, S., Van Straalen, B., Oliker, L.: Converting Stencils to Accumulations for Communication-Avoiding Optimization in Geometric Multigrid, pp. 9–16. Association for Computing Machinery, Inc (2014)Google Scholar
  7. 7.
    Basu, P., Venkat, A., Hall, M., Williams, S., Van Straalen, B., Oliker, L.: Compiler Generation and Autotuning of Communication-Avoiding Operators for Geometric Multigrid. IEEE Computer Society (2013)Google Scholar
  8. 8.
    Basu, P., Williams, S., Van Straalen, B., Oliker, L., Colella, P., Hall, M.: Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers. Parallel Comput. 64(C), 50–64 (2017)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Cao, W., Xu, C.F., Wang, Z.H., Yao, L., Liu, H.Y.: Cpu/gpu computing for a multi-block structured grid based high-order flow solver on a large heterogeneous system. Clust. Comput. 17(2), 255–270 (2014). CrossRefGoogle Scholar
  10. 10.
    Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, pp. 4:1–4:12. IEEE Press, Piscataway (2008).
  11. 11.
    Datta, K., Williams, S., Volkov, V., Carter, J., Oliker, L., Shalf, J., Yelick, K.: Auto-tuning Stencil Computations on Multicore and Accelerators. CRC Press, Boca Raton (2010)CrossRefGoogle Scholar
  12. 12.
    Dong, W., Kang, L., Quan, Z., Li, K., Li, K., Hao, Z., Xie, X.H.: Implementing molecular dynamics simulation on sunway TaihuLight system. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 443–450 (2016).
  13. 13.
    Dongarra, J.: Confessions of an accidental benchmarker.
  14. 14.
    Dongarra, J., Heroux, M.A., Luszczek, P.: High-performance conjugate-gradient benchmark: a new metric for ranking high-performance computing systems. Int. J. High Perform. Comput. Appl. 30, 3–10 (2015). CrossRefGoogle Scholar
  15. 15.
    Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: past, present and future. Concurr. Comput. 15, 803–820 (2003). CrossRefGoogle Scholar
  16. 16.
    Fu, H., He, C., Chen, B., Yin, Z., Zhang, Z., Zhang, W., Zhang, T., Xue, W., Liu, W., Yin, W., Yang, G., Chen, X.: 18.9Pflopss nonlinear earthquake simulation on sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, pp. 2:1–2:12. ACM, New York (2017)Google Scholar
  17. 17.
    Fu, H., Liao, J., Ding, N., Duan, X., Gan, L., Liang, Y., Wang, X., Yang, J., Zheng, Y., Liu, W., Wang, L., Yang, G.: Redesigning CAM-SE for peta-scale climate modeling performance and ultra-high resolution on sunway TaihuLight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, pp. 1:1–1:12. ACM, New York (2017)Google Scholar
  18. 18.
    Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C., Xue, W., Liu, F., Qiao, F., Zhao, W., Yin, X., Hou, C., Zhang, C., Ge, W., Zhang, J., Wang, Y., Zhou, C., Yang, G.: The sunway TaihuLight supercomputer: system and applications. Sci. China Inf. Sci. 59, 072001 (2016). CrossRefGoogle Scholar
  19. 19.
    Hagedorn, B., Stoltzfus, L., Steuwer, M., Gorlatch, S., Dubach, C.: High performance stencil code generation with lift. In: CGO. ACM, pp. 100–112 (2018)Google Scholar
  20. 20.
    Holewinski, J., Pouchet, L.N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, pp. 311–320. ACM, New York (2012)Google Scholar
  21. 21.
  22. 22.
    Jiang, L., Yang, C., Ao, Y., Ma, W.: Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In: The 46th International Conference on Parallel Processing’ (2017)Google Scholar
  23. 23.
    Köstler, H., Feichtinger, C., Rüde, U., Aoki, T.: A Geometric Multigrid Solver on Tsubame 2.0, pp. 155–173. Springer Berlin Heidelberg, Berlin, Heidelberg (2014)Google Scholar
  24. 24.
    Köstler, H., Ritter, D., Feichtinger, C.: A Geometric Multigrid Solver on GPU Clusters, pp. 407–422. Springer Berlin Heidelberg, Berlin, Heidelberg (2013)Google Scholar
  25. 25.
    Kwack, J., Bauer, G.H.: HPCG and HPGMG Benchmark Tests on Multiple Program, Multiple Data (MPMD) Mode on Blue Waters—A Cray XE6/XK7 Hybrid System. (2017)
  26. 26.
    Ma, W., Gao, K., Long, G.: Highly optimized code generation for stencil codes with computation reuse for GPUs. J. Comput. Sci. Technol. 31(6), 1262–1274 (2016)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Maruyama, N., Aoki, T.: Optimizing Stencil Computations for nvidia kepler gpus (2014)Google Scholar
  28. 28.
    Meuer, H., Strohmaier, E., Dongarra, J., Simon, H., Martin, M.: Top 500 Supercomputer Lists (2016).
  29. 29.
    Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In: 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2010)Google Scholar
  30. 30.
    Qiao, F., Zhao, W., Yin, X., Huang, X., Liu, X., Shu, Q., Wang, G., Song, Z., Li, X., Liu, H., Yang, G., Yuan, Y.: A highly effective global surface wave numerical simulation with ultra-high resolution. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, pp. 5:1–5:11. IEEE Press, Piscataway (2016).
  31. 31.
  32. 32.
    Sakharnykh, N.: Beyond GPU Memory Limits with Unified Memory on Pascal. (2016)
  33. 33.
    Stock, K., Kong, M., Grosser, T., Pouchet, L.N., Rastello, F., Ramanujam, J., Sadayappan, P.: A framework for enhancing data reuse via associative reordering. In: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pp. 65–76. ACM, New York (2014)Google Scholar
  34. 34.
    Tan, G., Li, L., Triechle, S., Phillips, E., Bao, Y., Sun, N.: Fast implementation of DGEMM on fermi GPU. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, p. 35. ACM (2011)Google Scholar
  35. 35.
  36. 36.
    Williams, S., Kalamkar, D.D., Singh, A., Deshpande, A.M., Straalen, B.V., Smelyanskiy, M., Almgren, A., Dubey, P., Shalf, J., Oliker, L.: Optimization of geometric multigrid for emerging multi- and manycore processors. In: High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pp. 1–11 (2012)Google Scholar
  37. 37.
    Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K.: The potential of the cell processor for scientific computing. In: Proceedings of the 3rd Conference on Computing Frontiers, CF ’06, pp. 9–20. ACM, New York (2006)Google Scholar
  38. 38.
    Yang, C., Xue, W., Fu, H., You, H., Wang, X., Ao, Y., Liu, F., Gan, L., Xu, P., Wang, L., Yang, G., Zheng, W.: 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, pp. 6:1–6:12. IEEE Press, Piscataway (2016)Google Scholar
  39. 39.
    Zhang, J., Zhou, C., Wang, Y., Ju, L., Du, Q., Chi, X., Xu, D., Chen, D., Liu, Y., Liu, Z.: Extreme-scale phase field simulations of coarsening dynamics on the sunway TaihuLight supercomputer. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 34–45 (2016)Google Scholar
  40. 40.
    Zhang, Y., Mueller, F.: Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In: 10th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2012, San Jose, March 31– April 04, 2012, pp. 155–164 (2012)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Institute of Software & State Key Lab of Computer Science, Chinese Academy of SciencesBeijingChina
  2. 2.CAPT and CCSE, School of Mathematical Sciences & Center for Data SciencePeking UniversityBeijingChina
  3. 3.Peng Cheng LaboratoryShenzhenChina
  4. 4.Computational Research DivisionLawrence Berkeley National LaboratoryBerkeleyUSA

Personalised recommendations