Stencil Computations on HPC-oriented ARMv8 64-Bit Multi-Core Processor

  • Chunjiang LiEmail author
  • Yushan Dong
  • Kuan Li
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9530)


The ARMv8 64-bit platform has been considered as an alternative for high performance computing (HPC). Stencil computations are a class of iterative kernels which update array elements according to a stencil. In this paper, we evaluate the performance and scalability of one ARMv8 64-bit Multi-Core Processor with 7-point 3D stencil code, and a series of optimization are devised for the stencil code. In the optimization, we mainly focus on how to parallelize the kernel and how to exploit data locality with loop tiling, also we improve the calculation of the block size in tiling. The achieved performance differs with the grid size of stencil, and the optimal performance is 24.4 % of the peak DP Flops for the grid size of \(64^{3}\). Comparing with Intel Xeon processor, the performance of the ARMv8 64-bit processor is about 40 % of that of Sandy Bridge for the stencil code with the grid size of \(512^{3}\), but this ARMv8 64-bit processor shows better scalability.


Stencil computation ARMv8 64-bit multi-core processor Parallelization Loop tiling 



The work in this paper is partially supported by the project of National Science Foundation of China under grant No.61170046, and the National High Technology Research and Development Program of China (863 Program) No.2012AA0 10903.


  1. 1.
    Rajovic, N., Carpenter, P.M., Gelado, I., Puzovic, N., Ramirez, A., Valero, M.: Supercomputing with commodity CPUs: are mobile SoCs ready for HPC? In: SC 2013: International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12. ACM, New York (2013)Google Scholar
  2. 2.
  3. 3.
    Rajovic, N., et al.: Building Supercomputers from Mobile Processors. In: EDA Work-shop13 Presentation, Dresden (2013)Google Scholar
  4. 4.
    Goodacre, J.: The evolution of the arm architecture towards big data and the data-center. In: VHPC 2013: Proceedings of the 8th Workshop on Virtualization in High-Performance Cloud Computing, pp. 1–10. ACM, New York (2013)Google Scholar
  5. 5.
    Laurenzano, M.A., Tiwari, A., Jundt, A., Peraza, J., Ward Jr., W.A., Campbell, R., Carrington, L.: Characterizing the performance-energy tradeoff of small ARM cores in HPC computation. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014 Parallel Processing. LNCS, vol. 8632, pp. 124–137. Springer, Heidelberg (2014) Google Scholar
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
    Mccool, M., Reinders, J., Robison, A.: Structured parallel programming: patterns for efficient computation. ACM SIGSOFT Softw. Eng. Notes 37(6), 43 (2012)Google Scholar
  11. 11.
    The Top 500 list.
  12. 12.
    Edson, L.P., Daniel, A.G.O., Pedro, V., et al.: Scalability and energy efficiency of hpc cluster with arm mpsoc. In: Workshop of Parallel and Distributed Processing (2013)Google Scholar
  13. 13.
    Rajovic, N., Rico, A., Vipond, J., Gelado, I., Puzovic, N., Ramirez, A.: Experiences with mobile processors for energy efficient HPC. In: DATE 2013: Design, Automation and Test in Europe Conference and Exhibition, pp. 464–468. EDA Consortium, San Jose (2013)Google Scholar
  14. 14.
    Ou, Z., Pang, B., Deng, Y., Nurminen, J.K., Yla-Jaaski, A., Hui, P.: Energy- and cost-efficiency analysis of ARM-based clusters. In: 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 115–123. IEEE, New York (2012)Google Scholar
  15. 15.
    Blem, E., Menon, J., Sankaralingam, K.: Power struggles: revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures. In: HPCA 2013: 19th IEEE International Symposium on High Performance Computer Architecture, pp. 1–12. IEEE Computer Society (2013)Google Scholar
  16. 16.
    Abdurachmanov, D., Bockelman, B., Elmer, P., Eulisse, G., Knight, R., Muzaffar, S.: Heterogeneous high throughput scientific computing with apm x-gene and intel xeon phi.CoRR.arXiv preprint arXiv:1410.3441 (2014)
  17. 17.
    Rivera, G., Tseng, C.W.: Tiling optimizations for 3D scientific computations. In: SC Conference, p. 32. IEEE Computer Society (2000)Google Scholar
  18. 18.
    Song, Y., Xu, R., Wang, C., Li, Z.: Data locality enhancement by memory reduction. In: Proceedings of the 15th International Conference on Supercomputing, pp. 50–64. ACM (2001)Google Scholar
  19. 19.
    Kamil, S., Datta, K., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Implicit and explicit optimizations for stencil computations. In: MSPC 2006: Proceedings of the 2006 Workshop on Memory System Performance and Correctness, pp. 51–60. ACM (2006)Google Scholar
  20. 20.
    Krishnamoorthy, S., Baskaran, M.M., Bondhugula, U., Ramanujam, J., Rountev A., Sadayappan, P.: Effective automatic parallelization of stencil computations. In: Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, pp. 235–244. ACM (2007)Google Scholar
  21. 21.
    Schäfer, A., Fey, D.: High performance stencil code algorithms for gpgpus. Procedia Comput. Sci. 4, 2027–2036 (2011)CrossRefGoogle Scholar
  22. 22.
    Maruyama, N., Aoki, T.: Optimizing stencil computations for NVIDIA Kepler GPUs. In: Proceedings of the 1st International Workshop on High-Performance Stencil Computations, pp. 89–95 (2014)Google Scholar
  23. 23.
    Dehnavi, M.M., You, Y., Fu, H., Song, S.L., Gan, L., Huang, X., et al.: Evaluating multi-core and many-core architectures through accelerating the three-dimensional laxCwendroff correction stencil. Int. J. High Perform. Comput. Appl. 28(3), 301–318 (2014)CrossRefGoogle Scholar
  24. 24.
    Rahman, S.M.F., Yi, Q., Qasem, A.: Understanding stencil code performance on multicore architectures. In: Proceedings of the 8th ACM International Conference on Computing Frontiers, p. 30. ACM (2011)Google Scholar
  25. 25.
    Chapman, B., Jost, G., Van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming, vol. 10. MIT Press, Cambridge (2008)Google Scholar
  26. 26.
    Dagum, L., Enon, R.: Openmp: an industry-standard api for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998). IEEECrossRefGoogle Scholar
  27. 27.
    Board, O.A.R.: OpenMP application program interface. version 4.0. The OpenMP Forum, Technical report (2013)Google Scholar
  28. 28.
    Xue, J.: Loop Tiling for Parallelism. Springer Science & Business Media, US (2000) CrossRefzbMATHGoogle Scholar
  29. 29.
    Leopold, C.: Tight bounds on capacity misses for 3D stencil codes. In: Sloot, P.M.A., Tan, C.J.K., Dongarra, J., Hoekstra, A.G. (eds.) ICCS-ComputSci 2002, Part I. LNCS, vol. 2329, pp. 843–852. Springer, Heidelberg (2002) CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.School of ComputerNational University of Defence TechnologyChangshaChina

Personalised recommendations