Abstract
Modern optimizing compilers rely on auto-vectorization algorithms for generating high-performance code. Both loop and straight-line code vectorization algorithms generate SIMD vector instructions out of scalar code, with no intervention from the programmer.
In this work, we show that the existing auto-vectorization algorithms operate on restricted code regions and therefore are missing out vectorization opportunities by either generating narrower vectors than those possible for the target architecture or are completely failing and leaving some of the code in scalar form. We show the need for a specialized post-processing re-vectorization pass, called PostSLP, that has the ability to span across multiple regions, and to generate more effective vector code. PostSLP is designed to convert already vectorized, or partially vectorized code into wider forms that perform better on the target architecture. We implemented PostSLP in LLVM and our evaluation shows significant performance improvements in SPEC CPU2006.
V. Porpodas—Currently at Google.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The shuffle instructions of these examples are similar to LLVM’s shufflevector instructions.
- 2.
In LLVM, we use either the shufflevector instructions when the output is a vector instruction, or extractelement when the output is scalar.
References
Allen, J.R., Kennedy, K.: PFC: A program to convert Fortran to parallel form. Technical report 82-6, Rice University (1982)
Allen, J.R., Kennedy, K.: Automatic translation of Fortran programs to vector form. TOPLAS (1987)
Anderson, A., Malik, A., Gregg, D.: Automatic vectorization of interleaved data revisited. ACM TACO 12, 1–25 (2015)
Davies, J., et al.: The KAP/S-1- an advanced source-to-source vectorizer for the S-1 Mark IIa supercomputer. In: ICPP (1986)
GCC: GNU compiler collection (2015). http://gcc.gnu.org
Huh, J., Tuck, J.: Improving the effectiveness of searching for isomorphic chains in superword level parallelism. In: MICRO (2017)
Karrenberg, R., Hack, S.: Whole-function vectorization. In: CGO (2011)
Kennedy, K., Allen, J.R.: Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., Burlington (2001)
Kuck, D.J., et al.: Dependence graphs and compiler optimizations. In: POPL (1981)
Larsen, S., Amarasinghe, S.: Exploiting superword level parallelism with multimedia instruction sets. In: PLDI (2000)
Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis transformation. In: CGO (2004)
Liu, J., Zhang, Y., Jang, O., Ding, W., Kandemir, M.: A compiler framework for extracting superword level parallelism. In: PLDI (2012)
Liu, Y.-P., et al.: Exploiting asymmetric SIMD register configurations in ARM-to-x86 dynamic binary translation. In: PACT (2017)
Masten, M., Tyurin, E., Mitropoulou, K., Saito, H., Garcia, E.: Function/Kernel vectorization via loop vectorizer. In: LLVM-HPC (2018)
Mendis, C., Amarasinghe, S.: goSLP: globally optimized superword level parallelism framework. In: OOPSLA (2018)
Mendis, C., Jain, A., Jain, P., Amarasinghe, S.: Revec: program rejuvenation through revectorization. In: CC (2019)
Nuzman, D., Rosen, I., Zaks, A.: Auto-vectorization of interleaved data for SIMD. In: PLDI (2006)
Nuzman, D., Zaks, A.: Outer-loop vectorization: revisited for short SIMD architectures. In: PACT (2008)
OpenMP Application Program Inteface. https://www.openmp.org/specifications/
Park, Y., Seo, S., Park, H., Cho, H., Mahlke, S.: SIMD defragmenter: efficient ILP realization on data-parallel architectures. In: ASPLOS (2012)
Porpodas, V.: SuperGraph-SLP auto-vectorization. In: PACT (2017)
Porpodas, V., Jones, T.M.: Throttling automatic vectorization: when less is more. In: PACT (2015)
Porpodas, V., et al.: PSLP: padded SLP automatic vectorization. In: CGO (2015)
Porpodas, V., Rocha, R.C., et al.: Super-node SLP: optimized vectorization for code sequences containing operators and their inverse elements. In: CGO (2019)
Porpodas, V., Rocha, R.C.O., Góes, L.F.W.: VW-SLP: auto-vectorization with adaptive vector width. In: PACT (2018)
Porpodas, V., Rocha, R.C.O., Góes, L.F.W.: Look-ahead SLP: auto-vectorization in the presence of commutative operations. In: CGO (2018)
Ren, G., et al.: Optimizing data permutations for SIMD devices. In: PLDI (2006)
Rocha, R.C.O., et al.: Vectorization-aware loop unrolling with seed forwarding. In: CC (2020)
Rosen, I., et al.: Loop-aware SLP in GCC. In: GCC Developers’ Summit (2007)
Shin, J., Hall, M., Chame, J.: Superword-level parallelism in the presence of control flow. In: CGO (2005)
Wolfe, M.: Vector optimization vs. vectorization. In: Houstis, E.N., Papatheodorou, T.S., Polychronopoulos, C.D. (eds.) ICS 1987. LNCS, vol. 297, pp. 309–315. Springer, Heidelberg (1988). https://doi.org/10.1007/3-540-18991-2_18
Wolfe, M.J.: High Performance Compilers for Parallel Computing. Addison-Wesley, Boston (1995)
Zhou, H., Xue, J.: A compiler approach for exploiting partial SIMD parallelism. TACO 13, 1–26 (2016)
Zhou, H., Xue, J.: Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In: CGO (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Porpodas, V., Ratnalikar, P. (2021). PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized Code. In: Pande, S., Sarkar, V. (eds) Languages and Compilers for Parallel Computing. LCPC 2019. Lecture Notes in Computer Science(), vol 11998. Springer, Cham. https://doi.org/10.1007/978-3-030-72789-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-72789-5_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72788-8
Online ISBN: 978-3-030-72789-5
eBook Packages: Computer ScienceComputer Science (R0)