Vectorizing divergent control flow with active-lane consolidation on long-vector architectures

Praharenka, Wyatt; Pankratz, David; De Carvalho, João P. L.; Amiri, Ehsan; Amaral, José Nelson

doi:10.1007/s11227-022-04359-w

Vectorizing divergent control flow with active-lane consolidation on long-vector architectures

Published: 07 March 2022

Volume 78, pages 12553–12588, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Wyatt Praharenka ORCID: orcid.org/0000-0001-5064-1293¹,
David Pankratz¹,
João P. L. De Carvalho¹,
Ehsan Amiri² &
…
José Nelson Amaral¹

302 Accesses
1 Citation
Explore all metrics

Abstract

Control-flow divergence limits the applicability of loop vectorization, an important code-transformation that accelerates data-parallel loops. Control-flow divergence is commonly handled using an IF-conversion transformation combined with vector predication. However, the resulting vector instructions execute inefficiently with many inactive lanes. Branch-on-superword-condition-code (BOSCC) instructions are used to skip over some vector instructions, but their effectiveness decreases as vector length increases. This paper presents a novel vector permutation, Active-lane consolidation (ALC), that enables efficient execution of control-divergent loops by consolidating the active lanes of two vectors. This paper demonstrates the use of ALC with two loop transformations and applies them to kernels extracted from the SPEC CPU 2017 benchmark suite leading to up to a 30.9% reduction in dynamic instruction count compared to optimization using only BOSCCs. Motivated by ALC, this paper also proposes design changes to the ARM scalable vector extension (SVE) to improve vectorization of control-divergent loops.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Exploitation of Hyper Loop Parallelism in Vectorization

Automated Compiler Optimization of Multiple Vector Loads/Stores

Article 09 January 2017

Dynamic SIMD Vector Lane Scheduling

Notes

For details of the experiment refer to Sect. 6.1.
In this paper, vector registers are referred to simply as vectors.

References

Monroe D (2020) Fugaku takes the lead. Commun ACM 64(1):16–18
Article Google Scholar
Allen, JR, Kennedy, K, Porterfield, C, Warren, J (1983) Conversion of control dependence to data dependence. In: Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on principles of programming languages, pp 177–189
Barredo A, Cebrian JM, Moretó M, Casas M, Valero M (2020) Improving predication efficiency through compaction/restoration of simd instructions. In: 2020 IEEE international symposium on high performance computer architecture (HPCA), pp 717–728
Jaewook S (2007) Introducing control flow into vectorized code. In: 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007), pp 280–291. IEEE
Shin J, Hall MW, Chame J (2009) Evaluating compiler technology for control-flow optimizations for multimedia extension architectures. Microprocess Microsyst 33(4):235–243
Article Google Scholar
Flynn MJ (1972) Some computer organizations and their effectiveness. IEEE Trans Comput C–21(9):948–960
Article Google Scholar
Intel Corporation (2021) Intel AVX-512. https://www.intel.com/content/www/us/en/architecture-and-technology/avx-512-overview.html
ARM Corporation (2021) ARM Advanced SIMD. https://developer.arm.com/architectures/instruction-sets/simd-isas/neon
Arm Limited (2021) Arm®Architecture Reference Manual Armv8, for Armv8-A Architecture Profile
Russell RM (1978) The CRAY-1 computer system. Commun ACM 21(1):63–72
Article Google Scholar
David Patterson (2017) SIMD Instructions Considered Harmful. https://www.sigarch.org/simd-instructions-considered-harmful
Arm Limited (2021) Arm®Architecture Reference Manual Supplement The Scalable Vector Extension (SVE), for Armv8-A
RISC-V® International Members (2021) The RISC-V “V” vector extension. version 0.10 (Visited on April 26, 2021). https://github.com/riscv/riscv-v-spec/releases/download/v0.10/riscv-v-spec-0.10.pdf
Sreraman N, Govindarajan R (2000) A vectorizing compiler for multimedia extensions. Int J Parallel Prog 28(4):363–400
Article Google Scholar
Kennedy K, Allen JR (2001) Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., Massachusetts
Google Scholar
Wolfe MJ (1995) High performance compilers for parallel computing. Addison-Wesley Longman Publishing Co. Inc, New York
MATH Google Scholar
Moll S, Hack S (2018) Partial control-flow linearization. ACM SIGPLAN Notices 53(4):543–556
Article Google Scholar
Allen F, Cocke J (1971) A catalogue of optimizing transformations. Prentice-Hall, New Jersey
Google Scholar
Anantpur J, Govindarajan R (2014) Taming control divergence in gpus through control flow linearization. In: Albert C (ed) Compiler construction. Springer, Berlin Heidelberg, pp 133–153
Chapter Google Scholar
Sun H, Gorlatch S, Zhao R (2018) Refactoring loops with nested ifs for simd extensions without masked instructions. In: European Conference on Parallel Processing, pp 769–781. Springer
Sun, H, Fey F, Zhao J, Gorlatch S (2019) WCCV: improving the vectorization of IF-statements with warp-coherent conditions. In: Proceedings of the ACM International Conference on Supercomputing, pp 319–329
ARM (2020) The arm C language extensions https://developer.arm.com/architectures/system-architectures/software-standards/acle
Fujitsu Limited (2021) A64FX®Microarchitecture Manual. Version 1.4
ARM (2020) The ARM instruction emulator. https://developer.arm.com/tools-and-software/server-and-hpc/compile/arm-instruction-emulator
Bruening D, Amarasinghe S (2004) Efficient, transparent, and comprehensive runtime code manipulation. PhD thesis, Massachusetts Institute of Technology, Department of Electrical Engineering
SPEC (2021) SPEC2017 Benchmark overview. https://www.spec.org/cpu2017/Docs/overview.html
Coutinho B, Sampaio D, Pereira FMQ, Meira Jr W (2011) Divergence analysis and optimizations. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, pp 320–329. IEEE
Lang H, Passing L, Kipf A, Boncz P, Neumann T, Kemper A (2020) Make the most out of your SIMD investments: counter control flow divergence in compiled query pipelines. VLDB J 29(2):757–774
Article Google Scholar
Fung WWL, Sham I, Yuan G, Aamodt TM (2007) Dynamic warp formation and scheduling for efficient gpu control flow. In: 40th annual IEEE/ACM international symposium on microarchitecture (MICRO 2007), pp 407–420. IEEE
Fung WWL, Aamodt TM (2011) Thread block compaction for efficient simt control flow. In: 2011 IEEE 17th international symposium on high performance computer architecture, pp 25–36. IEEE,
Khorasani F, Gupta R, Bhuyan LN (2015) Efficient warp execution in presence of divergence with collaborative context collection. In: Proceedings of the 48th international symposium on microarchitecture, MICRO-48, pp 204-215
Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, Horsnell M, Magklis G, Martinez A, Premillieu N et al (2017) The ARM scalable vector extension. IEEE Micro 37(2):26–39
Article Google Scholar
Sato M, Ishikawa Y, Tomita H, Kodama Y, Odajima T, Tsuji M, Yashiro H, Aoki M, Shida N, Miyoshi I, et al (2020) Co-design for A64FX manycore processor and “Fugaku”. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–15. IEEE
Lovett (2021) SVE in LLVM. https://hps.vi4io.org/_media/events/2020/llvm-cth20_lovett.pdf
Armejach A, Caminal H, Cebrian JM, Langarita R, González-Alberquilla R, Adeniyi-Jones C, Valero M, Casas M, Moretó M (2020) Using Arm® scalable vector extension on stencil codes. J Supercomput 76(3):2039–2062
Article Google Scholar
Cococcioni M, Rossi F, Ruffaldi E, Saponara S (2020) Fast deep neural networks for image processing using posits and arm scalable vector extension. J Real-Time Image Process 17:759–771
Article Google Scholar
Chen C, Xiang X, Liu C, Shang Y, Guo R, Liu D, Lu Y, Hao Z, Luo J, Chen Z, et al (2020) Xuantie-910: a commercial multi-core 12-stage pipeline out-of-order 64-bit high performance RISC-V processor with vector extension: industrial product. In: 2020 ACM/IEEE 47th annual international symposium on computer architecture (ISCA), pp 52–64. IEEE

Download references

Acknowledgements

This research was funded by the University of Alberta Huawei Joint Innovation Collaboration (UAHJIC) and by the National Sciences and Engineering Research Council (NSERC) of Canada. We thank Giancarlo Pernudi Segura for his great assistance creating some of the assembly-level coding for the case studies.

Author information

Authors and Affiliations

University of Alberta, Edmonton, Canada
Wyatt Praharenka, David Pankratz, João P. L. De Carvalho & José Nelson Amaral
Huawei Technologies Canada, Markham, Canada
Ehsan Amiri

Authors

Wyatt Praharenka
View author publications
You can also search for this author in PubMed Google Scholar
David Pankratz
View author publications
You can also search for this author in PubMed Google Scholar
João P. L. De Carvalho
View author publications
You can also search for this author in PubMed Google Scholar
Ehsan Amiri
View author publications
You can also search for this author in PubMed Google Scholar
José Nelson Amaral
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wyatt Praharenka.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Praharenka, W., Pankratz, D., De Carvalho, J.P.L. et al. Vectorizing divergent control flow with active-lane consolidation on long-vector architectures. J Supercomput 78, 12553–12588 (2022). https://doi.org/10.1007/s11227-022-04359-w

Download citation

Accepted: 05 February 2022
Published: 07 March 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s11227-022-04359-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Vectorizing divergent control flow with active-lane consolidation on long-vector architectures

Abstract

Access this article

Similar content being viewed by others

Efficient Exploitation of Hyper Loop Parallelism in Vectorization

Automated Compiler Optimization of Multiple Vector Loads/Stores

Dynamic SIMD Vector Lane Scheduling

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Vectorizing divergent control flow with active-lane consolidation on long-vector architectures

Abstract

Access this article

Similar content being viewed by others

Efficient Exploitation of Hyper Loop Parallelism in Vectorization

Automated Compiler Optimization of Multiple Vector Loads/Stores

Dynamic SIMD Vector Lane Scheduling

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation