Skip to main content

PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized Code

  • Conference paper
  • First Online:
Languages and Compilers for Parallel Computing (LCPC 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11998))

  • 341 Accesses

Abstract

Modern optimizing compilers rely on auto-vectorization algorithms for generating high-performance code. Both loop and straight-line code vectorization algorithms generate SIMD vector instructions out of scalar code, with no intervention from the programmer.

In this work, we show that the existing auto-vectorization algorithms operate on restricted code regions and therefore are missing out vectorization opportunities by either generating narrower vectors than those possible for the target architecture or are completely failing and leaving some of the code in scalar form. We show the need for a specialized post-processing re-vectorization pass, called PostSLP, that has the ability to span across multiple regions, and to generate more effective vector code. PostSLP is designed to convert already vectorized, or partially vectorized code into wider forms that perform better on the target architecture. We implemented PostSLP in LLVM and our evaluation shows significant performance improvements in SPEC CPU2006.

V. Porpodas—Currently at Google.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The shuffle instructions of these examples are similar to LLVM’s shufflevector instructions.

  2. 2.

    In LLVM, we use either the shufflevector instructions when the output is a vector instruction, or extractelement when the output is scalar.

References

  1. Allen, J.R., Kennedy, K.: PFC: A program to convert Fortran to parallel form. Technical report 82-6, Rice University (1982)

    Google Scholar 

  2. Allen, J.R., Kennedy, K.: Automatic translation of Fortran programs to vector form. TOPLAS (1987)

    Google Scholar 

  3. Anderson, A., Malik, A., Gregg, D.: Automatic vectorization of interleaved data revisited. ACM TACO 12, 1–25 (2015)

    Google Scholar 

  4. Davies, J., et al.: The KAP/S-1- an advanced source-to-source vectorizer for the S-1 Mark IIa supercomputer. In: ICPP (1986)

    Google Scholar 

  5. GCC: GNU compiler collection (2015). http://gcc.gnu.org

  6. Huh, J., Tuck, J.: Improving the effectiveness of searching for isomorphic chains in superword level parallelism. In: MICRO (2017)

    Google Scholar 

  7. Karrenberg, R., Hack, S.: Whole-function vectorization. In: CGO (2011)

    Google Scholar 

  8. Kennedy, K., Allen, J.R.: Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., Burlington (2001)

    Google Scholar 

  9. Kuck, D.J., et al.: Dependence graphs and compiler optimizations. In: POPL (1981)

    Google Scholar 

  10. Larsen, S., Amarasinghe, S.: Exploiting superword level parallelism with multimedia instruction sets. In: PLDI (2000)

    Google Scholar 

  11. Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis transformation. In: CGO (2004)

    Google Scholar 

  12. Liu, J., Zhang, Y., Jang, O., Ding, W., Kandemir, M.: A compiler framework for extracting superword level parallelism. In: PLDI (2012)

    Google Scholar 

  13. Liu, Y.-P., et al.: Exploiting asymmetric SIMD register configurations in ARM-to-x86 dynamic binary translation. In: PACT (2017)

    Google Scholar 

  14. Masten, M., Tyurin, E., Mitropoulou, K., Saito, H., Garcia, E.: Function/Kernel vectorization via loop vectorizer. In: LLVM-HPC (2018)

    Google Scholar 

  15. Mendis, C., Amarasinghe, S.: goSLP: globally optimized superword level parallelism framework. In: OOPSLA (2018)

    Google Scholar 

  16. Mendis, C., Jain, A., Jain, P., Amarasinghe, S.: Revec: program rejuvenation through revectorization. In: CC (2019)

    Google Scholar 

  17. Nuzman, D., Rosen, I., Zaks, A.: Auto-vectorization of interleaved data for SIMD. In: PLDI (2006)

    Google Scholar 

  18. Nuzman, D., Zaks, A.: Outer-loop vectorization: revisited for short SIMD architectures. In: PACT (2008)

    Google Scholar 

  19. OpenMP Application Program Inteface. https://www.openmp.org/specifications/

  20. Park, Y., Seo, S., Park, H., Cho, H., Mahlke, S.: SIMD defragmenter: efficient ILP realization on data-parallel architectures. In: ASPLOS (2012)

    Google Scholar 

  21. Porpodas, V.: SuperGraph-SLP auto-vectorization. In: PACT (2017)

    Google Scholar 

  22. Porpodas, V., Jones, T.M.: Throttling automatic vectorization: when less is more. In: PACT (2015)

    Google Scholar 

  23. Porpodas, V., et al.: PSLP: padded SLP automatic vectorization. In: CGO (2015)

    Google Scholar 

  24. Porpodas, V., Rocha, R.C., et al.: Super-node SLP: optimized vectorization for code sequences containing operators and their inverse elements. In: CGO (2019)

    Google Scholar 

  25. Porpodas, V., Rocha, R.C.O., Góes, L.F.W.: VW-SLP: auto-vectorization with adaptive vector width. In: PACT (2018)

    Google Scholar 

  26. Porpodas, V., Rocha, R.C.O., Góes, L.F.W.: Look-ahead SLP: auto-vectorization in the presence of commutative operations. In: CGO (2018)

    Google Scholar 

  27. Ren, G., et al.: Optimizing data permutations for SIMD devices. In: PLDI (2006)

    Google Scholar 

  28. Rocha, R.C.O., et al.: Vectorization-aware loop unrolling with seed forwarding. In: CC (2020)

    Google Scholar 

  29. Rosen, I., et al.: Loop-aware SLP in GCC. In: GCC Developers’ Summit (2007)

    Google Scholar 

  30. Shin, J., Hall, M., Chame, J.: Superword-level parallelism in the presence of control flow. In: CGO (2005)

    Google Scholar 

  31. Wolfe, M.: Vector optimization vs. vectorization. In: Houstis, E.N., Papatheodorou, T.S., Polychronopoulos, C.D. (eds.) ICS 1987. LNCS, vol. 297, pp. 309–315. Springer, Heidelberg (1988). https://doi.org/10.1007/3-540-18991-2_18

    Chapter  Google Scholar 

  32. Wolfe, M.J.: High Performance Compilers for Parallel Computing. Addison-Wesley, Boston (1995)

    Google Scholar 

  33. Zhou, H., Xue, J.: A compiler approach for exploiting partial SIMD parallelism. TACO 13, 1–26 (2016)

    Google Scholar 

  34. Zhou, H., Xue, J.: Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In: CGO (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vasileios Porpodas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Porpodas, V., Ratnalikar, P. (2021). PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized Code. In: Pande, S., Sarkar, V. (eds) Languages and Compilers for Parallel Computing. LCPC 2019. Lecture Notes in Computer Science(), vol 11998. Springer, Cham. https://doi.org/10.1007/978-3-030-72789-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-72789-5_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-72788-8

  • Online ISBN: 978-3-030-72789-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics