Skip to main content
Log in

Vectorizing divergent control flow with active-lane consolidation on long-vector architectures

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Control-flow divergence limits the applicability of loop vectorization, an important code-transformation that accelerates data-parallel loops. Control-flow divergence is commonly handled using an IF-conversion transformation combined with vector predication. However, the resulting vector instructions execute inefficiently with many inactive lanes. Branch-on-superword-condition-code (BOSCC) instructions are used to skip over some vector instructions, but their effectiveness decreases as vector length increases. This paper presents a novel vector permutation, Active-lane consolidation (ALC), that enables efficient execution of control-divergent loops by consolidating the active lanes of two vectors. This paper demonstrates the use of ALC with two loop transformations and applies them to kernels extracted from the SPEC CPU 2017 benchmark suite leading to up to a 30.9% reduction in dynamic instruction count compared to optimization using only BOSCCs. Motivated by ALC, this paper also proposes design changes to the ARM scalable vector extension (SVE) to improve vectorization of control-divergent loops.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. For details of the experiment refer to Sect. 6.1.

  2. In this paper, vector registers are referred to simply as vectors.

References

  1. Monroe D (2020) Fugaku takes the lead. Commun ACM 64(1):16–18

    Article  Google Scholar 

  2. Allen, JR, Kennedy, K, Porterfield, C, Warren, J (1983) Conversion of control dependence to data dependence. In: Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on principles of programming languages, pp 177–189

  3. Barredo A, Cebrian JM, Moretó M, Casas M, Valero M (2020) Improving predication efficiency through compaction/restoration of simd instructions. In: 2020 IEEE international symposium on high performance computer architecture (HPCA), pp 717–728

  4. Jaewook S (2007) Introducing control flow into vectorized code. In: 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007), pp 280–291. IEEE

  5. Shin J, Hall MW, Chame J (2009) Evaluating compiler technology for control-flow optimizations for multimedia extension architectures. Microprocess Microsyst 33(4):235–243

    Article  Google Scholar 

  6. Flynn MJ (1972) Some computer organizations and their effectiveness. IEEE Trans Comput C–21(9):948–960

    Article  Google Scholar 

  7. Intel Corporation (2021) Intel AVX-512. https://www.intel.com/content/www/us/en/architecture-and-technology/avx-512-overview.html

  8. ARM Corporation (2021) ARM Advanced SIMD. https://developer.arm.com/architectures/instruction-sets/simd-isas/neon

  9. Arm Limited (2021) Arm®Architecture Reference Manual Armv8, for Armv8-A Architecture Profile

  10. Russell RM (1978) The CRAY-1 computer system. Commun ACM 21(1):63–72

    Article  Google Scholar 

  11. David Patterson (2017) SIMD Instructions Considered Harmful. https://www.sigarch.org/simd-instructions-considered-harmful

  12. Arm Limited (2021) Arm®Architecture Reference Manual Supplement The Scalable Vector Extension (SVE), for Armv8-A

  13. RISC-V® International Members (2021) The RISC-V “V” vector extension. version 0.10 (Visited on April 26, 2021). https://github.com/riscv/riscv-v-spec/releases/download/v0.10/riscv-v-spec-0.10.pdf

  14. Sreraman N, Govindarajan R (2000) A vectorizing compiler for multimedia extensions. Int J Parallel Prog 28(4):363–400

    Article  Google Scholar 

  15. Kennedy K, Allen JR (2001) Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., Massachusetts

    Google Scholar 

  16. Wolfe MJ (1995) High performance compilers for parallel computing. Addison-Wesley Longman Publishing Co. Inc, New York

    MATH  Google Scholar 

  17. Moll S, Hack S (2018) Partial control-flow linearization. ACM SIGPLAN Notices 53(4):543–556

    Article  Google Scholar 

  18. Allen F, Cocke J (1971) A catalogue of optimizing transformations. Prentice-Hall, New Jersey

    Google Scholar 

  19. Anantpur J, Govindarajan R (2014) Taming control divergence in gpus through control flow linearization. In: Albert C (ed) Compiler construction. Springer, Berlin Heidelberg, pp 133–153

    Chapter  Google Scholar 

  20. Sun H, Gorlatch S, Zhao R (2018) Refactoring loops with nested ifs for simd extensions without masked instructions. In: European Conference on Parallel Processing, pp 769–781. Springer

  21. Sun, H, Fey F, Zhao J, Gorlatch S (2019) WCCV: improving the vectorization of IF-statements with warp-coherent conditions. In: Proceedings of the ACM International Conference on Supercomputing, pp 319–329

  22. ARM (2020) The arm C language extensions https://developer.arm.com/architectures/system-architectures/software-standards/acle

  23. Fujitsu Limited (2021) A64FX®Microarchitecture Manual. Version 1.4

  24. ARM (2020) The ARM instruction emulator. https://developer.arm.com/tools-and-software/server-and-hpc/compile/arm-instruction-emulator

  25. Bruening D, Amarasinghe S (2004) Efficient, transparent, and comprehensive runtime code manipulation. PhD thesis, Massachusetts Institute of Technology, Department of Electrical Engineering

  26. SPEC (2021) SPEC2017 Benchmark overview. https://www.spec.org/cpu2017/Docs/overview.html

  27. Coutinho B, Sampaio D, Pereira FMQ, Meira Jr W (2011) Divergence analysis and optimizations. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, pp 320–329. IEEE

  28. Lang H, Passing L, Kipf A, Boncz P, Neumann T, Kemper A (2020) Make the most out of your SIMD investments: counter control flow divergence in compiled query pipelines. VLDB J 29(2):757–774

    Article  Google Scholar 

  29. Fung WWL, Sham I, Yuan G, Aamodt TM (2007) Dynamic warp formation and scheduling for efficient gpu control flow. In: 40th annual IEEE/ACM international symposium on microarchitecture (MICRO 2007), pp 407–420. IEEE

  30. Fung WWL, Aamodt TM (2011) Thread block compaction for efficient simt control flow. In: 2011 IEEE 17th international symposium on high performance computer architecture, pp 25–36. IEEE,

  31. Khorasani F, Gupta R, Bhuyan LN (2015) Efficient warp execution in presence of divergence with collaborative context collection. In: Proceedings of the 48th international symposium on microarchitecture, MICRO-48, pp 204-215

  32. Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, Horsnell M, Magklis G, Martinez A, Premillieu N et al (2017) The ARM scalable vector extension. IEEE Micro 37(2):26–39

    Article  Google Scholar 

  33. Sato M, Ishikawa Y, Tomita H, Kodama Y, Odajima T, Tsuji M, Yashiro H, Aoki M, Shida N, Miyoshi I, et al (2020) Co-design for A64FX manycore processor and “Fugaku”. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–15. IEEE

  34. Lovett (2021) SVE in LLVM. https://hps.vi4io.org/_media/events/2020/llvm-cth20_lovett.pdf

  35. Armejach A, Caminal H, Cebrian JM, Langarita R, González-Alberquilla R, Adeniyi-Jones C, Valero M, Casas M, Moretó M (2020) Using Arm® scalable vector extension on stencil codes. J Supercomput 76(3):2039–2062

    Article  Google Scholar 

  36. Cococcioni M, Rossi F, Ruffaldi E, Saponara S (2020) Fast deep neural networks for image processing using posits and arm scalable vector extension. J Real-Time Image Process 17:759–771

    Article  Google Scholar 

  37. Chen C, Xiang X, Liu C, Shang Y, Guo R, Liu D, Lu Y, Hao Z, Luo J, Chen Z, et al (2020) Xuantie-910: a commercial multi-core 12-stage pipeline out-of-order 64-bit high performance RISC-V processor with vector extension: industrial product. In: 2020 ACM/IEEE 47th annual international symposium on computer architecture (ISCA), pp 52–64. IEEE

Download references

Acknowledgements

This research was funded by the University of Alberta Huawei Joint Innovation Collaboration (UAHJIC) and by the National Sciences and Engineering Research Council (NSERC) of Canada. We thank Giancarlo Pernudi Segura for his great assistance creating some of the assembly-level coding for the case studies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wyatt Praharenka.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Praharenka, W., Pankratz, D., De Carvalho, J.P.L. et al. Vectorizing divergent control flow with active-lane consolidation on long-vector architectures. J Supercomput 78, 12553–12588 (2022). https://doi.org/10.1007/s11227-022-04359-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04359-w

Keywords

Navigation