Using Arm’s scalable vector extension on stencil codes

Abstract

Data-level parallelism is frequently ignored or underutilized. Achieved through vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or register size. In addition, automatic compiler vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some of these issues, Arm recently released a new vector ISA, the scalable vector extension (SVE), which is vector-length agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical vector register length. In this paper, we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using vector lengths ranging from 128 to 2048 bits show that these optimizations can lead to performance improvements over straightforward vectorized code of up to 1.57\(\times\). In addition, we show that certain optimizations can hurt performance due to reduced arithmetic intensity and instruction overheads, and provide insight useful for compiler optimizers.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

References

  1. 1.

    Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, Horsnell M, Magklis G, Martinez A, Prémillieu N, Reid A, Rico A, Walker P (2017) The ARM scalable vector extension. IEEE Micro 37(2):26–39. https://doi.org/10.1109/MM.2017.35 [Online]

    Article  Google Scholar 

  2. 2.

    Yoshida T (2016) Introduction of Fujitsu’s HPC processor for the post-K computer. In: Hot Chips 28 Symposium (HCS), Ser. Hot Chips’16. IEEE

  3. 3.

    Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson DA, Shalf J, Yelick KA (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2008, November 15–21, 2008, Austin, Texas, USA, p. 4. https://doi.org/10.1145/1413370.1413375

  4. 4.

    Yount C, Tobin J, Breuer A, Duran A (2016) YASK–yet another stencil kernel: A framework for HPC stencil code-generation and tuning. In: Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC@SC 2016, Salt Lake, UT, USA, November 14, 2016, pp 30–39. https://doi.org/10.1109/WOLFHPC.2016.08

  5. 5.

    Frigo M, Strumpen V (2005) Cache oblivious stencil computations. In: Proceedings of the 19th Annual International Conference on Supercomputing, ICS 2005, Cambridge, MA, USA, June 20–22, pp 361–366. https://doi.org/10.1145/1088149.1088197

  6. 6.

    Trottenberg U, Oosterlee CW, Schuller A (2000) Multigrid. Academic Press, Cambridge

    Google Scholar 

  7. 7.

    Zhukov VT, Krasnov MM, Novikova ND, Feodoritova OB (2015) Multigrid effectiveness on modern computing architectures. Program Comput Softw 41(1):14–22. https://doi.org/10.1134/S0361768815010077

    Article  Google Scholar 

  8. 8.

    Komatitsch D, Erlebacher G, Göddeke D, Michéa D (2010) High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster. J Comput Phys 229(20):7692–7714. https://doi.org/10.1016/j.jcp.2010.06.024

    MathSciNet  Article  MATH  Google Scholar 

  9. 9.

    Heimlich A, Mol A, Pereira C (2011) Gpu-based monte carlo simulation in neutron transport and finite differences heat equation evaluation. Prog Nucl Energy 53(2):229–239

    Article  Google Scholar 

  10. 10.

    Molnár F, Izsák F, Mészáros R, Lagzi I (2011) Simulation of reaction–diffusion processes in three dimensions using cuda. Chemom Intell Lab Syst 108(1):76–85

    Article  Google Scholar 

  11. 11.

    Toffoli T, Margolus N (1987) Cellular automata machines-a new environment for modeling. MIT Press, Cambridge

    Google Scholar 

  12. 12.

    Espasa R, Valero M, Smith J.E (1998) Vector architectures: past, present and future. In: Proceedings of the 12th International Conference on Supercomputing, ICS 1998, Melbourne, Australia, July 13–17, pp 425–432. https://doi.org/10.1145/277830.277935

  13. 13.

    Intel Architecture instruction set extensions programming reference. Intel Corporation (2016). https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf

  14. 14.

    Fuller S (1998) Motorola’s altivec™ technology. Motorola Inc., Tech. Rep. http://www.nxp.com/assets/documents/data/en/fact-sheets/ALTIVECWP.pdf

  15. 15.

    Lee Y, Schmidt C, Ou A, Waterman A, Asanovic̀ K (2015) The hwacha vector-fetch architecture manual. In: Electrical Engineering and Computer Sciences, University of California at Berkeley, Tech. Rep., https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-262.pdf

  16. 16.

    Waterman A, Lee Y, Patterson D, Asanovic̀ K (2014) The risc-v instruction set manual, volume i: user-level ISA, version 2.0. Electrical Engineering and Computer Sciences, University of California at Berkeley, Tech. Rep. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.pdf

  17. 17.

    Russell RM (1978) The cray-1 computer system. Commun ACM 21:63–72

    Article  Google Scholar 

  18. 18.

    Reinders J, Jeffers J (2014) High performance parallelism pearls, multicore and many-core programming approaches. Morgan Kaufmann, Bulington, pp 377–396

    Google Scholar 

  19. 19.

    Kronawitter S, Lengauer C (2014) Optimization of two jacobi smoother kernels by domain-specific program transformation. HiStencils 2014:75–80

    Google Scholar 

  20. 20.

    Christen M, Schenk O, Burkhart H (May 2011) PATUS: a code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In: 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011–Conference Proceedings, pp 676–687

  21. 21.

    Szustak L, Rojek K, Wyrzykowski R, Gepner P (2014) Toward efficient distribution of MPDATA stencil computation on Intel MIC architecture. Proce. HiStencils 14:51–56

    Google Scholar 

  22. 22.

    Kamil S, Chan CP, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19–23 April 2010–Conference Proceedings, pp 1–12. https://doi.org/10.1109/IPDPS.2010.5470421

  23. 23.

    Kamil S, Husbands P, Oliker L, Shalf J, Yelick K.A (2005) Impact of modern memory subsystems on cache optimizations for stencil computations. In: Proceedings of the 2005 Workshop on Memory System Performance, Chicago, Illinois, USA, June 12, 2005, pp 36–43. https://doi.org/10.1145/1111583.1111589

  24. 24.

    Kamil S, Datta K, Williams S, Oliker L, Shalf J, Yelick KA (2006) Implicit and explicit optimizations for stencil computations. In: Proceedings of the 2006 Workshop on Memory System Performance and Correctness, San Jose, CA, USA, October 11, 2006, pp 51–60. https://doi.org/10.1145/1178597.1178605

  25. 25.

    Tang Y, Chowdhury RA, Kuszmaul BC, Luk C, Leiserson CE (2011) The pochoir stencil compiler. In: SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Jose, CA, USA, June 4–6, 2011 (Co-located with FCRC 2011), pp 117–128. https://doi.org/10.1145/1989493.1989508

  26. 26.

    Dursun H, Nomura K, Peng L, Seymour R, Wang W, Kalia RK, Nakano A, Vashishta P (2009) A multilevel parallelization framework for high-order stencil computations. In: Euro-Par 2009 Parallel Processing, 15th International Euro-Par Conference, Delft, The Netherlands, August 25–28, 2009. Proceedings, pp 642–653. https://doi.org/10.1007/978-3-642-03869-3_61

  27. 27.

    Peng L, Seymour R, Nomura K, Kalia RK, Nakano A, Vashishta P, Loddoch A, Netzband M, Volz W.R, Wong CC (2009) High-order stencil computations on multicore clusters. In: 23rd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009, Rome, Italy, May 23–29, 2009, pp. 1–11. https://doi.org/10.1109/IPDPS.2009.5161011

  28. 28.

    Maruyama N, Nomura T, Sato K, Matsuoka S (2011) Physis: an implicitly parallel programming model for stencil computations on large-scale gpu-accelerated supercomputers. In: Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12–18, 2011, pp 11:1–11:12. https://doi.org/10.1145/2063384.2063398

  29. 29.

    Yount C (2015) Vector folding: improving stencil performance via multi-dimensional simd-vector representation. In: IEEE 17th International Conference on High Performance Computing and Communications, pp 865–870

  30. 30.

    Heroux MA, Doerfler DW, Crozier PS, Willenbring JM, Edwards HC, Williams A, Rajan M, Keiter ER, Thornquist HK, Numrich RW (2009) Improving performance via mini-applications. Sandia National Laboratories, Albuquerque

    Google Scholar 

  31. 31.

    The mantevo suite release version 3.0. https://mantevo.org/download/

  32. 32.

    Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7. https://doi.org/10.1145/2024716.2024718

    Article  Google Scholar 

  33. 33.

    ARM Cortex-A72 MPCore Processor Technical Reference Manual. https://static.docs.arm.com/100095/0003/cortex_a72_mpcore_trm_100095_0003_05_en.pdf?_ga=2.187644577.805846766.1551351186-1814310934.1538732624

  34. 34.

    ARM Cortex-A53 MPCore Processor Technical Reference Manual. http://infocenter.arm.com/help/topic/com.arm.doc.ddi0500d/DDI0500D_cortex_a53_r0p2_trm.pdf

Download references

Acknowledgements

This work has been partially supported by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), and by the Generalitat de Catalunya (contracts 2017-SGR-1328 and 2017-SGR-1414). The Mont-Blanc project receives funding from the EUs H2020 Framework Programme (H2020/2014-2020) under Grant Agreements Nos. 671697 and 779877. M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship Number RYC-2016-21104. M. Casas has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2017-23269. Finally, A. Armejach has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva postdoctoral fellowship Number FJCI-2015-24753.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Adrià Armejach.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Armejach, A., Caminal, H., Cebrian, J.M. et al. Using Arm’s scalable vector extension on stencil codes. J Supercomput 76, 2039–2062 (2020). https://doi.org/10.1007/s11227-019-02842-5

Download citation

Keywords

  • Data-level parallelism
  • Scalable vector extension
  • Vector-length agnostic
  • Stencil computations