Abstract
GPU programs are widely used in industry. To obtain the best performance, a typical development process involves the manual or semi-automatic application of optimizations prior to compiling the code. To avoid the introduction of errors, we can augment GPU programs with (pre- and postcondition-style) annotations to capture functional properties. However, keeping these annotations correct when optimizing GPU programs is labor-intensive and error-prone.
This paper introduces Alpinist, an annotation-aware GPU program optimizer. It applies frequently-used GPU optimizations, but besides transforming code, it also transforms the annotations. We evaluate Alpinist, in combination with the VerCors program verifier, to automatically optimize a collection of verified programs and reverify them.
This work is supported by NWO grant 639.023.710 for the Mercedes project and by NWO TTW grant 17249 for the ChEOPS project.
Chapter PDF
Similar content being viewed by others
References
Allen, R., Kennedy, K.: Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Languages and Systems (TOPLAS) 9(4), 491–542 (1987)
Ashari, A., Tatikonda, S., Boehm, M., Reinwald, B., Campbell, K., Keenleyside, J., Sadayappan, P.: On optimizing machine learning workloads via kernel fusion. ACM SIGPLAN Notices 50(8), 173–182 (2015)
Ashouri, A., Killian, W., Cavazos, J., Palermo, G., Silvano, C.: A Survey on Compiler Autotuning using Machine Learning. ACM Computing Surveys 51(5), 96:1–96:42 (2018)
Ayers, G., Litz, H., Kozyrakis, C., Ranganathan, P.: Classifying memory access patterns for prefetching. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. pp. 513–526 (2020)
Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on CUDA. Tech. rep., Citeseer (2008)
Berdine, J., Calcagno, C., O’Hearn, P.: Smallfoot: Modular Automatic Assertion Checking with Separation Logic. In: de Boer, F., Bonsangue, M., Graf, S., de Roever, W. (eds.) FMCO. LNCS, vol. 4111, pp. 115–137. Springer (2005)
Bertolli, C., Betts, A., Mudalige, G., Giles, M., Kelly, P.: Design and Performance of the OP2 Library for Unstructured Mesh Applications. In: Proceedings of the 1st Workshop on Grids, Clouds and P2P Programming (CGWS). Lecture Notes in Computer Science, vol. 7155, pp. 191–200. Springer (2011). https://doi.org/10.1007/978-3-642-29737-3_22
Betts, A., Chong, N., Donaldson, A., Qadeer, S., Thomson, P.: GPUVerify: a verifier for GPU kernels. In: OOPSLA. pp. 113–132. ACM (2012)
Blom, S., Darabi, S., Huisman, M., Oortwijn, W.: The VerCors Tool Set: Verification of Parallel and Concurrent Software. In: iFM. LNCS, vol. 10510, pp. 102 – 110. Springer (2017)
Blom, S., Huisman, M., Mihelčić, M.: Specification and Verification of GPGPU programs. Science of Computer Programming 95, 376–388 (2014)
Bornat, R., Calcagno, C., O’Hearn, P., Parkinson, M.: Permission accounting in separation logic. In: Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages (POPL). pp. 259–270 (2005)
Boyland, J.: Checking Interference with Fractional Permissions. In: SAS. LNCS, vol. 2694, pp. 55–72. Springer (2003)
Catanzaro, B., Keller, A., Garland, M.: A decomposition for in-place matrix transposition. ACM SIGPLAN Notices 49(8), 193–206 (2014)
Collingbourne, P., Cadar, C., Kelly, P.H.: Symbolic testing of OpenCL code. In: Haifa Verification Conference. pp. 203–218. Springer (2011)
Şakar, O., Safari, M., Huisman, M., Wijs, A.: The repository for the examples used in Alpinist, https://github.com/OmerSakar/Alpinist-Examples.git
Şakar, O., Safari, M., Huisman, M., Wijs, A.: The repository for the implementations of Alpinist, https://github.com/utwente-fmt/vercors/tree/gpgpu-optimizations/src/main/java/vct/col/rewrite/gpgpuoptimizations
DeFrancisco, R., Cho, S., Ferdman, M., Smolka, S.: Swarm Model Checking on the GPU. International Journal on Software Tools for Technology Transfer 22, 583–599 (2020). https://doi.org/10.1007/s10009-020-00576-x
Dross, C., Furia, C.A., Huisman, M., Monahan, R., Müller, P.: Verifythis 2019: a program verification competition. International Journal on Software Tools for Technology Transfer pp. 1–11 (2021)
Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. The Journal of Supercomputing 71(10), 3934–3957 (2015)
Gjomemo, R., Namjoshi, K.S., Phung, P.H., Venkatakrishnan, V., Zuck, L.D.: From verification to optimizations. In: International Workshop on Verification, Model Checking, and Abstract Interpretation. pp. 300–317. Springer (2015)
Grauer-Gray, S., Xu, L., Searles, R., Ayalasomayajula, S., Cavazos, J.: Auto-tuning a High-Level Language Targeted to GPU Codes. In: Proc. 2012 Innovative Parallel Computing (InPar). pp. 1–10. IEEE (2012). https://doi.org/10.1109/InPar.2012.6339595
van den Haak, L., Wijs, A., M.G.J. van den Brand, Huisman, M.: Formal Methods for GPGPU Programming: Is The Demand Met? In: Proceedings of the 16th International Conference on Integrated Formal Methods (IFM 2020). Lecture Notes in Computer Science, vol. 12546, pp. 160–177. Springer (2020). https://doi.org/10.1007/978-3-030-63461-2_9
Hamers, R., Jongmans, S.S.: Safe sessions of channel actions in Clojure: a tour of the discourje project. In: International Symposium on Leveraging Applications of Formal Methods. pp. 489–508. Springer (2020)
Herrmann, F., Silberholz, J., Tiglio, M.: Black Hole Simulations with CUDA. In: GPU Computing Gems Emerald Edition, chap. 8, pp. 103–111. Morgan Kaufmann (2011)
Hong, C., Sukumaran-Rajam, A., Nisa, I., Singh, K., Sadayappan, P.: Adaptive sparse tiling for sparse matrix multiplication. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. pp. 300–314 (2019)
Huisman, M., Blom, S., Darabi, S., Safari, M.: Program correctness by transformation. In: 8th International Symposium On Leveraging Applications of Formal Methods, Verification and Validation (ISoLA). LNCS, vol. 11244. Springer (2018)
Huisman, M., Joosten, S.: A solution to VerifyThis 2019 challenge 1, https://github.com/utwente-fmt/vercors/blob/97c49d6dc1097ded47a5ed53143695ace6904865/examples/verifythis/2019/challenge1.pvl
Konstantinidis, A., Kelly, P.H., Ramanujam, J., Sadayappan, P.: Parametric GPU code generation for affine loop programs. In: International Workshop on Languages and Compilers for Parallel Computing. pp. 136–151. Springer (2013)
Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Ng, A.: On Optimization Methods for Deep Learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML). pp. 265–272. Omnipress (2011)
Leroy, X.: Formal certification of a compiler back-end or: programming a compiler with a proof assistant. In: Conference record of the 33rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. pp. 42–54 (2006)
Leroy, X.: A formally verified compiler back-end. Journal of Automated Reasoning 43(4), 363–446 (2009)
Li, G., Gopalakrishnan, G.: Scalable SMT-based verification of GPU kernel functions. In: SIGSOFT FSE 2010, Santa Fe, NM, USA. pp. 187–196. ACM (2010)
Li, G., Li, P., Sawaya, G., Gopalakrishnan, G., Ghosh, I., Rajan, S.P.: GKLEE: concolic verification and test generation for GPUs. In: ACM SIGPLAN Notices. vol. 47, pp. 215–224. ACM (2012)
Lindholm, L., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28(2), 39–55 (2008). https://doi.org/10.1109/MM.2008.31
Liu, X., Tan, S., Wang, H.: Parallel Statistical Analysis of Analog Circuits by GPU-Accelerated Graph-Based Approach. In: Proceedings of the 2012 Conference and Exhibition on Design, Automation & Test in Europe (DATE). pp. 852–857. IEEE Computer Society (2012). https://doi.org/10.1109/DATE.2012.6176615
de Moura, L.M., Bjørner, N.: Z3: An efficient SMT solver. In: Ramakrishnan, C., Rehof, J. (eds.) TACAS. LNCS, vol. 4963, pp. 337–340. Springer (2008)
Müller, P., Schwerhoff, M., Summers, A.: Viper - a verification infrastructure for permission-based reasoning. In: VMCAI (2016)
Murthy, G.S., Ravishankar, M., Baskaran, M.M., Sadayappan, P.: Optimal loop unrolling for GPGPU programs. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). pp. 1–11. IEEE (2010)
Namjoshi, K.S., Pavlinovic, Z.: The impact of program transformations on static program analysis. In: International Static Analysis Symposium. pp. 306–325. Springer (2018)
Namjoshi, K.S., Singhania, N.: Loopy: Programmable and formally verified loop transformations. In: International Static Analysis Symposium. pp. 383–402. Springer (2016)
Namjoshi, K.S., Xue, A.: A Self-certifying Compilation Framework for WebAssembly. In: International Conference on Verification, Model Checking, and Abstract Interpretation. pp. 127–148. Springer (2021)
The OpenCL 1.2 specification (2011)
Osama, M., Wijs, A.: Parallel SAT Simplification on GPU Architectures. In: TACAS, Part I. LNCS, vol. 11427, pp. 21–40. Springer (2019)
Osama, M., Wijs, A., Biere, A.: SAT Solving with GPU Accelerated Inprocessing. In: Proceedings of the 27th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), Part I. Lecture Notes in Computer Science, vol. 12651, pp. 133–151. Springer (2021). https://doi.org/10.1007/978-3-030-72016-2_8
de Putter, S., Wijs, A.: Verifying a verifier: on the formal correctness of an LTS transformation verification technique. In: International Conference on Fundamental Approaches to Software Engineering. pp. 383–400. Springer (2016)
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices 48(6), 519–530 (2013)
Rocha, R.C., Pereira, A.D., Ramos, L., Góes, L.F.: Toast: Automatic tiling for iterative stencil computations on GPUs. Concurrency and Computation: Practice and Experience 29(8), e4053 (2017)
Safari, M., Huisman, M.: Formal verification of parallel stream compaction and summed-area table algorithms. In: International Colloquium on Theoretical Aspects of Computing. pp. 181–199. Springer (2020)
Safari, M., Huisman, M.: A generic approach to the verification of the permutation property of sequential and parallel swap-based sorting algorithms. In: International Conference on Integrated Formal Methods. pp. 257–275. Springer (2020)
Safari, M., Oortwijn, W., Huisman, M.: Automated verification of the parallel Bellman–Ford algorithm. In: Drăgoi, C., Mukherjee, S., Namjoshi, K. (eds.) Static Analysis. pp. 346–358. Springer International Publishing, Cham (2021)
Safari, M., Oortwijn, W., Joosten, S., Huisman, M.: Formal verification of parallel prefix sum. In: NASA Formal Methods Symposium. pp. 170–186. Springer (2020)
Şakar, O.: Extending support for axiomatic data types in vercors (April 2020), http://essay.utwente.nl/80892/
Shimobaba, T., Ito, T., Masuda, N., Ichihashi, Y., Takada, N.: Fast calculation of computer-generated-hologram on AMD HD5000 series GPU and OpenCL. Optics express 18(10), 9955–9960 (2010)
Sundfeld, D., Havgaard, J.H., Gorodkin, J., De Melo, A.C.: CUDA-Sankoff: using GPU to accelerate the pairwise structural RNA alignment. In: 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP). pp. 295–302. IEEE (2017)
The CUDA team: Documentation of the CUDA unroll pragma (Accessed Oct 6, 2021), https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#pragma-unroll
The Halide team: Documentation of the Halide unroll function (Accessed Oct 6, 2021), https://halide-lang.org/docs/class_halide_1_1_func.html#a05935caceb6efb8badd85f306dd33034
The verification of tictactoe program, https://github.com/utwente-fmt/vercors/blob/0a2fdc24419466c2d3b7a853a2908c37e7a8daa7/examples/session-generate/MatrixGrid.pvl
Unkule, S., Shaltz, C., Qasem, A.: Automatic restructuring of GPU kernels for exploiting inter-thread data locality. In: International Conference on Compiler Construction. pp. 21–40. Springer (2012)
Van Werkhoven, B., Maassen, J., Bal, H.E., Seinstra, F.J.: Optimizing convolution operations on GPUs using adaptive tiling. Future Generation Computer Systems 30, 14–26 (2014)
Viper project website: (2016), http://www.pm.inf.ethz.ch/research/viper
Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 191–202. IEEE (2014)
Wang, G., Lin, Y., Yi, W.: Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In: 2010 IEEE/ACM Int’l Conference on Green Computing and Communications & Int’l Conference on Cyber, Physical and Social Computing. pp. 344–350. IEEE (2010)
Werkhoven, B.v.: Kernel Tuner: A search-optimizing GPU code auto-tuner. Future Generation Computer Systems 90, 347–358 (2019)
Wienke, S., Springer, P., Terboven, C., Mey, D.: OpenACC - First Experiences with Real-World Applications. In: Proceedings of the 18th European Conference on Parallel and Distributed Computing (EuroPar). Lecture Notes in Computer Science, vol. 7484, pp. 859–870. Springer (2012). https://doi.org/10.1007/978-3-642-32820-6_85
Wijs, A.: BFS-Based Model Checking of Linear-Time Properties With An Application on GPUs. In: CAV, Part II. LNCS, vol. 9780, pp. 472–493. Springer (2016)
Wijs, A., Engelen, L.: REFINER: Towards Formal Verification of Model Transformations. In: NFM. LNCS, vol. 8430, pp. 258–263. Springer (2014)
Wijs, A., Neele, T., Bošnački, D.: GPUexplore 2.0: Unleashing GPU Explicit-State Model Checking. In: Proceedings of the 21st International Symposium on Formal Methods. Lecture Notes in Computer Science, vol. 9995, pp. 694–701. Springer (2016). https://doi.org/10.1007/978-3-319-48989-6_42
Wu, H., Diamos, G., Wang, J., Cadambi, S., Yalamanchili, S., Chakradhar, S.: Optimizing data warehousing applications for GPUs using kernel fusion/fission. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. pp. 2433–2442. IEEE (2012)
Xu, C., Kirk, S.R., Jenkins, S.: Tiling for performance tuning on different models of GPUs. In: 2009 Second International Symposium on Information Science and Engineering. pp. 500–504. IEEE (2009)
Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. ACM Sigplan Notices 45(6), 86–97 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this paper
Cite this paper
Şakar, Ö., Safari, M., Huisman, M., Wijs, A. (2022). Alpinist: An Annotation-Aware GPU Program Optimizer. In: Fisman, D., Rosu, G. (eds) Tools and Algorithms for the Construction and Analysis of Systems. TACAS 2022. Lecture Notes in Computer Science, vol 13244. Springer, Cham. https://doi.org/10.1007/978-3-030-99527-0_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-99527-0_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-99526-3
Online ISBN: 978-3-030-99527-0
eBook Packages: Computer ScienceComputer Science (R0)