Advertisement

Using Arm’s scalable vector extension on stencil codes

  • Adrià ArmejachEmail author
  • Helena Caminal
  • Juan M. Cebrian
  • Rubén Langarita
  • Rekai González-Alberquilla
  • Chris Adeniyi-Jones
  • Mateo Valero
  • Marc Casas
  • Miquel Moretó
Article
  • 35 Downloads

Abstract

Data-level parallelism is frequently ignored or underutilized. Achieved through vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or register size. In addition, automatic compiler vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some of these issues, Arm recently released a new vector ISA, the scalable vector extension (SVE), which is vector-length agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical vector register length. In this paper, we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using vector lengths ranging from 128 to 2048 bits show that these optimizations can lead to performance improvements over straightforward vectorized code of up to 1.57\(\times\). In addition, we show that certain optimizations can hurt performance due to reduced arithmetic intensity and instruction overheads, and provide insight useful for compiler optimizers.

Keywords

Data-level parallelism Scalable vector extension Vector-length agnostic Stencil computations 

Notes

Acknowledgements

This work has been partially supported by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), and by the Generalitat de Catalunya (contracts 2017-SGR-1328 and 2017-SGR-1414). The Mont-Blanc project receives funding from the EUs H2020 Framework Programme (H2020/2014-2020) under Grant Agreements Nos. 671697 and 779877. M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship Number RYC-2016-21104. M. Casas has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2017-23269. Finally, A. Armejach has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva postdoctoral fellowship Number FJCI-2015-24753.

References

  1. 1.
    Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, Horsnell M, Magklis G, Martinez A, Prémillieu N, Reid A, Rico A, Walker P (2017) The ARM scalable vector extension. IEEE Micro 37(2):26–39.  https://doi.org/10.1109/MM.2017.35 [Online]CrossRefGoogle Scholar
  2. 2.
    Yoshida T (2016) Introduction of Fujitsu’s HPC processor for the post-K computer. In: Hot Chips 28 Symposium (HCS), Ser. Hot Chips’16. IEEEGoogle Scholar
  3. 3.
    Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson DA, Shalf J, Yelick KA (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2008, November 15–21, 2008, Austin, Texas, USA, p. 4.  https://doi.org/10.1145/1413370.1413375
  4. 4.
    Yount C, Tobin J, Breuer A, Duran A (2016) YASK–yet another stencil kernel: A framework for HPC stencil code-generation and tuning. In: Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC@SC 2016, Salt Lake, UT, USA, November 14, 2016, pp 30–39.  https://doi.org/10.1109/WOLFHPC.2016.08
  5. 5.
    Frigo M, Strumpen V (2005) Cache oblivious stencil computations. In: Proceedings of the 19th Annual International Conference on Supercomputing, ICS 2005, Cambridge, MA, USA, June 20–22, pp 361–366.  https://doi.org/10.1145/1088149.1088197
  6. 6.
    Trottenberg U, Oosterlee CW, Schuller A (2000) Multigrid. Academic Press, CambridgezbMATHGoogle Scholar
  7. 7.
    Zhukov VT, Krasnov MM, Novikova ND, Feodoritova OB (2015) Multigrid effectiveness on modern computing architectures. Program Comput Softw 41(1):14–22.  https://doi.org/10.1134/S0361768815010077 CrossRefGoogle Scholar
  8. 8.
    Komatitsch D, Erlebacher G, Göddeke D, Michéa D (2010) High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster. J Comput Phys 229(20):7692–7714.  https://doi.org/10.1016/j.jcp.2010.06.024 MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Heimlich A, Mol A, Pereira C (2011) Gpu-based monte carlo simulation in neutron transport and finite differences heat equation evaluation. Prog Nucl Energy 53(2):229–239CrossRefGoogle Scholar
  10. 10.
    Molnár F, Izsák F, Mészáros R, Lagzi I (2011) Simulation of reaction–diffusion processes in three dimensions using cuda. Chemom Intell Lab Syst 108(1):76–85CrossRefGoogle Scholar
  11. 11.
    Toffoli T, Margolus N (1987) Cellular automata machines-a new environment for modeling. MIT Press, CambridgezbMATHGoogle Scholar
  12. 12.
    Espasa R, Valero M, Smith J.E (1998) Vector architectures: past, present and future. In: Proceedings of the 12th International Conference on Supercomputing, ICS 1998, Melbourne, Australia, July 13–17, pp 425–432.  https://doi.org/10.1145/277830.277935
  13. 13.
    Intel Architecture instruction set extensions programming reference. Intel Corporation (2016). https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
  14. 14.
    Fuller S (1998) Motorola’s altivec™ technology. Motorola Inc., Tech. Rep. http://www.nxp.com/assets/documents/data/en/fact-sheets/ALTIVECWP.pdf
  15. 15.
    Lee Y, Schmidt C, Ou A, Waterman A, Asanovic̀ K (2015) The hwacha vector-fetch architecture manual. In: Electrical Engineering and Computer Sciences, University of California at Berkeley, Tech. Rep., https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-262.pdf
  16. 16.
    Waterman A, Lee Y, Patterson D, Asanovic̀ K (2014) The risc-v instruction set manual, volume i: user-level ISA, version 2.0. Electrical Engineering and Computer Sciences, University of California at Berkeley, Tech. Rep. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.pdf
  17. 17.
    Russell RM (1978) The cray-1 computer system. Commun ACM 21:63–72CrossRefGoogle Scholar
  18. 18.
    Reinders J, Jeffers J (2014) High performance parallelism pearls, multicore and many-core programming approaches. Morgan Kaufmann, Bulington, pp 377–396Google Scholar
  19. 19.
    Kronawitter S, Lengauer C (2014) Optimization of two jacobi smoother kernels by domain-specific program transformation. HiStencils 2014:75–80Google Scholar
  20. 20.
    Christen M, Schenk O, Burkhart H (May 2011) PATUS: a code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In: 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011–Conference Proceedings, pp 676–687Google Scholar
  21. 21.
    Szustak L, Rojek K, Wyrzykowski R, Gepner P (2014) Toward efficient distribution of MPDATA stencil computation on Intel MIC architecture. Proce. HiStencils 14:51–56Google Scholar
  22. 22.
    Kamil S, Chan CP, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19–23 April 2010–Conference Proceedings, pp 1–12.  https://doi.org/10.1109/IPDPS.2010.5470421
  23. 23.
    Kamil S, Husbands P, Oliker L, Shalf J, Yelick K.A (2005) Impact of modern memory subsystems on cache optimizations for stencil computations. In: Proceedings of the 2005 Workshop on Memory System Performance, Chicago, Illinois, USA, June 12, 2005, pp 36–43.  https://doi.org/10.1145/1111583.1111589
  24. 24.
    Kamil S, Datta K, Williams S, Oliker L, Shalf J, Yelick KA (2006) Implicit and explicit optimizations for stencil computations. In: Proceedings of the 2006 Workshop on Memory System Performance and Correctness, San Jose, CA, USA, October 11, 2006, pp 51–60.  https://doi.org/10.1145/1178597.1178605
  25. 25.
    Tang Y, Chowdhury RA, Kuszmaul BC, Luk C, Leiserson CE (2011) The pochoir stencil compiler. In: SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Jose, CA, USA, June 4–6, 2011 (Co-located with FCRC 2011), pp 117–128.  https://doi.org/10.1145/1989493.1989508
  26. 26.
    Dursun H, Nomura K, Peng L, Seymour R, Wang W, Kalia RK, Nakano A, Vashishta P (2009) A multilevel parallelization framework for high-order stencil computations. In: Euro-Par 2009 Parallel Processing, 15th International Euro-Par Conference, Delft, The Netherlands, August 25–28, 2009. Proceedings, pp 642–653.  https://doi.org/10.1007/978-3-642-03869-3_61
  27. 27.
    Peng L, Seymour R, Nomura K, Kalia RK, Nakano A, Vashishta P, Loddoch A, Netzband M, Volz W.R, Wong CC (2009) High-order stencil computations on multicore clusters. In: 23rd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009, Rome, Italy, May 23–29, 2009, pp. 1–11.  https://doi.org/10.1109/IPDPS.2009.5161011
  28. 28.
    Maruyama N, Nomura T, Sato K, Matsuoka S (2011) Physis: an implicitly parallel programming model for stencil computations on large-scale gpu-accelerated supercomputers. In: Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12–18, 2011, pp 11:1–11:12.  https://doi.org/10.1145/2063384.2063398
  29. 29.
    Yount C (2015) Vector folding: improving stencil performance via multi-dimensional simd-vector representation. In: IEEE 17th International Conference on High Performance Computing and Communications, pp 865–870Google Scholar
  30. 30.
    Heroux MA, Doerfler DW, Crozier PS, Willenbring JM, Edwards HC, Williams A, Rajan M, Keiter ER, Thornquist HK, Numrich RW (2009) Improving performance via mini-applications. Sandia National Laboratories, AlbuquerqueGoogle Scholar
  31. 31.
    The mantevo suite release version 3.0. https://mantevo.org/download/
  32. 32.
    Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7.  https://doi.org/10.1145/2024716.2024718 CrossRefGoogle Scholar
  33. 33.
  34. 34.
    ARM Cortex-A53 MPCore Processor Technical Reference Manual. http://infocenter.arm.com/help/topic/com.arm.doc.ddi0500d/DDI0500D_cortex_a53_r0p2_trm.pdf

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Adrià Armejach
    • 1
    • 2
    Email author
  • Helena Caminal
    • 3
  • Juan M. Cebrian
    • 1
  • Rubén Langarita
    • 1
  • Rekai González-Alberquilla
    • 4
  • Chris Adeniyi-Jones
    • 4
  • Mateo Valero
    • 1
  • Marc Casas
    • 1
  • Miquel Moretó
    • 1
    • 2
  1. 1.Barcelona Supercomputing CenterBarcelonaSpain
  2. 2.Universitat Politècnica de CatalunyaBarcelonaSpain
  3. 3.Cornell UniversityIthacaUSA
  4. 4.ArmCambridgeUK

Personalised recommendations