Alpinist : an Annotation-Aware GPU Program Optimizer ?

. GPU programs are widely used in industry. To obtain the best performance, a typical development process involves the manual or semi-automatic application of optimizations prior to compiling the code. To avoid the introduction of errors, we can augment GPU programs with (pre-and postcondition-style) annotations to capture functional properties. However, keeping these annotations correct when optimizing GPU programs is labor-intensive and error-prone. This paper introduces Alpinist , an annotation-aware GPU program optimizer. It applies frequently-used GPU optimizations, but besides transforming code, it also transforms the annotations. We evaluate Alpinist , in combination with the VerCors program veri(cid:12)er, to automatically optimize a collection of veri(cid:12)ed programs and reverify them.


Introduction
Over the course of roughly a decade, graphics processing units (GPUs) have been pushing the computational limits in fields as diverse as computational biology [64], statistics [35], physics [7], astronomy [24], deep learning [29], and formal methods [17,43,44,65,67]. Dedicated programming languages such as CUDA [34] and OpenCL [42] can be used to write GPU source code. To achieve the most performance out of GPUs, developer should apply incremental optimizations, tailored to the GPU architecture. Unfortunately, this is to a large extent a manual activity. The fact that for different GPU devices, the same code tends to require a different sequence of transformations [21] makes this procedure even more time consuming and error-prone. Recently, automating this has received some attention, for instance by applying machine learning [3]. Reasoning about the correctness of GPU software is hard, but necessary. Multiple verification techniques and tools have been developed to aid in this task aimed at detecting data races, see [8,10,14,32,33], and for a recent overview, see [22]. Some of these techniques apply deductive program verification, which requires a program to be manually augmented with pre-and postcondition annotations. However, annotating a program is time consuming. The more complex a program is, the more challenging it becomes to annotate it. In particular, as a program is being optimized repeatedly, its annotations tend to change frequently. This paper presents Alpinist, a tool that can apply annotation-aware transformations [26] on annotated GPU programs. It can be used with the deductive program verifier VerCors [9]. VerCors can verify the functional correctness of GPU programs [10]. It allows the verification of many typical GPU computations, see e.g., [48,50,51]. The purpose of Alpinist is twofold (see Fig. 1): First, it automates the optimization of GPU code, to the extent that the developer needs to indicate which optimization needs to be applied where, and the tool performs the transformation. Interestingly, the presence of annotations is exploited by Alpinist to determine whether an optimization is actually applicable, and in doing so, can sometimes apply an optimization where a compiler cannot. Second, as it applies a code transformation, it also transforms the related annotations, which means that once the developer has annotated the unoptimized, simpler code, any further optimized version of that code is automatically annotated with updated pre-and postconditions, making it reverifiable. This avoids having to re-annotate the program every time it is optimized for a specific GPU device.
Alpinist supports GPU code optimizations that are used frequently in practice, namely loop unrolling, tiling, kernel fusion, iteration merging, matrix linearization and data prefetching. In the current paper, we discuss how Alpinist has been implemented, how it can be applied on annotated GPU code, and how some of the more complex optimizations work. In addition, we evaluate the effect of applying several of these optimizations, both in terms of annotation size and time needed to verify a program, to a collection of examples including the verified case studies in [48,49,51].
Outline. Section 2 demonstrates how Alpinist optimizes a verified GPU program while preserving its provability. Section 3 discusses the architecture of Alpinist. Section 4 discusses the most complex optimizations supported by Alpinist in detail, namely loop unrolling, tiling and kernel fusion, and briefly discusses the remaining three. Section 5 presents the results of experiments in which the tool has been applied on a collection of programs. Section 6 discusses related work and Section 7 concludes the paper, and discusses future work.

Annotation-Aware Optimization using Alpinist
This section illustrates how Alpinist can optimize a verified GPU program while preserving its provability. Fig. 2 shows a GPU program with annotations [10] that is verified by VerCors. The example is written in a simplified version of VerCors' own language PVL. The program initializes an array a, and subsequently updates the values in a, N times. The workflow of a GPU program in general is that the host (i.e., CPU) invokes a kernel, i.e., a GPU function, executed by a specified number of GPU threads. These threads are organized in one or more thread blocks. In this program, there are two kernels, both executed by one thread block of a.length threads (lines 8 and 12 (l.8, l.12)) 3 . Each thread has a unique identifier, in the example called tid. In the first kernel (l.8-l.11), each thread initializes a[tid] to 0. In the second kernel (l.12-l.22), each thread updates a[tid+1] (modulo a.length) N times, by adding tid to it. In the main Host function, Kernel1 is called, followed by Kernel2.
The kernels, the for-loop and the host function are annotated for verification (in blue), using permission-based separation logic [6,11,12]. Permissions capture which memory locations may be accessed by which threads; they are fractional values in the interval (0, 1] (cf. Boyland [12]): any fraction in the interval (0,  1) indicates a read permission, while 1 indicates a write permission. A write permission can be split into multiple read permissions and read permissions can be added up, and transformed into a write permission if they add up to 1. The soundness of the logic ensures that for each memory location, the total number of permissions among all threads does not exceed 1.
To specify permissions, predicates are used of the form Perm(L, π) where L is a heap location and π a fractional value in the interval (0, 1] (e.g., 1\3). Preand postconditions, denoted by keywords req and ens, should hold at the beginning and the end of an annotated function, respectively. The keyword context abbreviates both req and ens (l.9, l.13). The keyword context everywhere is used to specify a property that must hold throughout the function (l.1). Note that \forall* is used to express a universal separating conjunction over permission predicates (l.2-l.4) and \forall is used as standard universal conjunction over logical predicates (l.5). For logical conjunction, && is used and * * is used as separating conjunction in separation logic.
In the example, write permissions are required for all locations in a (l.2). The pre-and postconditions of the first kernel specify that each thread needs write permission for a[tid] (l.9). The postcondition states that a[tid] is set to 0 (l.10). In the second kernel, all threads have write permission for a[tid+1], except thread a.length-1 which has write permission for a[0] (l.13). Moreover, it is required that a[tid+1] (modulo a.length) is 0 (l.14). For the for-loop (l. 19l.22), loop invariants are specified: k is in the range [0, N] (l. 16), each thread has write permission for a[tid+1] (modulo a.length) (l.17) and this location always has the value k*tid (l.18). The postconditions of the second kernel and the host function are similar to this latter invariant. Fig. 3 shows an optimized version of the program, with updated annotations to make it verifiable. Alpinist has applied three optimizations: 1. Fusing the two kernels: in GPU programs, the only global synchronisation points (used, for instance, to avoid data races) exist implicitly between kernel launches. However, if such a global synchronisation point is not really needed between two specific kernels, then fusing them gives several benefits, in particular the ability to store intermediate results in (fast) thread-local register memory as opposed to (slow) GPU global memory, and it has a positive effect on power consumption [62]. In the example, the kernels are combined into Fused Kernel, and a thread block-local barrier is introduced (l.18) to avoid data races within the single thread block executing the code. 2. Using register memory; register variables can be used to reduce the number of global memory accesses. Here, the use of a reg 0 and a reg 1 has been enabled by kernel fusion. 3. Unrolling the for-loop; the for-loop has been unrolled once here (l.20-l.25).
Since GPU threads are very light-weight, compared to CPU threads, any checking of conditions that can be avoided benefits performance. When unrolling a loop, this means that fewer checks of the loop-condition are needed. Note that here, Alpinist benefits from the knowledge that N > 0 (l.1), so it knows that the for-loop can be unrolled at least once.
To preserve provability of the optimized program, Alpinist changed the annotations, in particular the pre-and postcondition of the fused kernel and the loop invariants (highlighted in Fig. 3). Moreover, Alpinist introduced an annotated barrier (l.14-l.18). Since threads synchronize at a barrier, it is possible to redistribute the permissions. In the rest of the paper, we discuss how Alpinist performs these annotation-aware transformations.

The Design of Alpinist
This section gives a high-level overview of the design of Alpinist. The optimizations supported by Alpinist are discussed in Section 4. To understand the design of Alpinist, we first explain the architecture of the VerCors verifier.

VerCors' Architecture
VerCors is a deductive program verifier, which is designed to work for different input languages (e.g., Java and OpenCL). It takes as input an annotated program, which is then transformed in several steps into an annotated Silver program. Silver is an intermediate verification language, used as input for Viper [37,60]. Viper then generates proof obligations, which can be discharged by an automated theorem prover, such as Z3 [36].
The internal transformations in VerCors are defined over our internal AST representation (written in the Common Object Language or COL [52]), which captures the features of all input languages. Some of the transformations are generic (e.g., splitting composite variable declarations) and others are specific to verification (e.g., transforming contracts). The transformations implemented as part of Alpinist are also applied on the COL AST, but they are developed with a different goal in mind, and in particular several of the transformation are specific to the supported optimizations.
Using VerCors and its architecture to implement Alpinist gives us some benefits. First, existing helper functions can be reused, which simplifies tasks such as gathering information regarding specific AST nodes. Second, some generic transformations of VerCors can be reused, such as splitting composite variable declarations or simplifying expressions. This helps to simplify the implementation of the optimizations. Third, using the architecture of VerCors allows us to prove assertions that we generate relatively easily by invoking VerCors internally.

Alpinist's Architecture
Alpinist takes a verified file as its input, annotated with special optimization annotations that indicate where specific optimizations should be applied. Alpinist is written in Java and Scala and runs on Windows, Linux and macOS. Fig. 4 gives a high-level overview of the internal design of Alpinist. The input program goes through four phases: the parsing phase, the applicability checking phase, the transformation phase and the output phase.
The parsing phase transforms the input file into a COL AST, after which the applicability checking phase checks if the optimization can be applied. Some optimizations, such as tiling (see Section 4.2), are always applicable, hence their applicability check always passes. For other optimizations, prerequisites must be established. Sometimes, a syntactical analysis of the AST suffices, e.g., kernel fusion (see Section 4.3). For this optimization, it must be determined whether there is any data dependency between two selected kernels. When analysis of the AST is not enough, VerCors can be used to perform more complex reasoning. An example of this is loop unrolling (see Section 4.1). Its prerequisite is that for the loop to be unrollable k times, it is guaranteed that the loop executes at least k times. This prerequisite is encoded as an assertion to be proven by VerCors.
The applicability checking phase is one of the strengths of Alpinist. It exploits the fact that the input program is annotated to determine whether an optimization is applicable, and relies on the fact that VerCors can perform complex reasoning. Moreover, this approach allows to distinguish failure due to unsatisfied prerequisites and due to mistakes in the transformation procedure.  If the applicability check passes (i.e., the optimization is applicable), the transformation phase is next, otherwise a message is generated that the prerequisites could not be proven.
The transformation phase applies the optimizations to the input AST. The output phase either prints the optimized program in the same language as the input program, or a message is printed, signifying either a failure in optimizing or a verification failure in the applicability checking phase.

GPU Optimizations
Alpinist supports six frequently-used GPU optimizations, namely loop unrolling, tiling, kernel fusion, iteration merging, matrix linearization and data prefetching. This section discusses loop unrolling, tiling, and kernel fusion in detail. The other optimizations follow the same approach in spirit and are discussed briefly, which can be found in the Alpinist implementation [16]. Each optimization is introduced in the context of GPU programs. Then, we discuss how to apply them. Interesting insights are discussed where relevant.

Loop Unrolling
Loop unrolling is a frequently-used optimization technique that is applicable to both GPU and CPU programs. It unrolls some iterations of a loop, which increases the code size, but can have a positive impact on program performance; e.g., see [21,38,46,59,63] for its impact, specifically on GPU programs. Fig. 5 shows an example of unrolling an (annotated) loop twice: the body of the loop is duplicated twice before the loop. This has the following effect on the annotations: the loop invariant bounding the loop variable (l.5) changes in the optimized program (l.14). Note that the other loop invariants (i.e., Inv(i)) remain the same. Moreover, after each unrolling part, we add all invariants as assertions (l.8-l.10) except after the last unroll. This captures that the code produced by unrolling the loop should still satisfy the original loop invariants.
Our approach to loop unrolling is more general than optimization techniques during compilation. For instance, the unroll pragma in CUDA [55] and the unroll function in Halide [56] unroll loops by calculating the number of iterations to see if unrolling is possible, i.e., it should be computable at compile time. This difference is illustrated in Fig. 5 where N (i.e., the number of iterations) is unknown at compile time. Their approach cannot automatically handle this   case, while our approach can automatically unroll the loop, since annotations (l.1, l.6) specify the lower-bound of N (provided by the programmer, who knows that this is a valid lower-bound). VerCors verifies that the unrolling is valid. Fig. 6 shows a loop template in a verified GPU program. We would like to automatically unroll the loop k times and preserve the provability of the program. To accomplish this, we follow a procedure consisting of three parts: the main, checking and updating part. In the main part, an annotated (verified) GPU program and positive k are given as input. Next we go to the checking part, to see if it is possible to unroll the loop k times. This part corresponds with the applicability checking phase. Thus, we statically calculate the number of loop iterations, by counting how many times the condition (cond(i)) holds starting from either a (as the lowerbound of i) or b (as the upperbound of i), depending on the operation of upd(i). If k is greater than the total number of loop iterations at the end of the checking part, then we report an error. Otherwise we go to the updating part, in which we update either a or b according to the operation in upd(i). If the operation is addition or multiplication, then the loop variable i (in the unoptimized program) goes from a to b. That means, after unrolling, a should be updated according to the constant c from the update expression and k. If the operation is subtraction or division, i goes from b to a. Thus, after unrolling, b should be updated. After the updating part, we return to the main part to unroll the loop k times.

Tiling
Tiling is another well-known optimization technique for GPU programs. It increases the workload of the threads to fully utilize GPU resources by assigning more data to each thread. Concretely, we assume there are T threads and a onedimensional array of size T in the unoptimized GPU program where each thread is responsible for one location in that array (Fig. 8). To apply the optimization, we first divide the array into T/N chunks, each of size N (1 ≤ N ≤ T) 5 . There are two different ways to create and assign threads to array cells (as in Fig. 7): -Inter-Tiling We define N threads and assign them to one specific location in each chunk. That means each thread serially iterates over all chunks to be responsible for a specific location in each chunk. -Intra-Tiling We define T/N threads and assign one thread to one chunk (i.e., 1-to-1 mapping) to serially iterate over all cells in that chunk. Both forms of tiling can have a positive impact on GPU program performance; e.g., see [25,28,47,69] for the impact of this optimization. Fig. 9 shows the optimized version of Fig. 8 by applying inter-tiling. Regarding program optimization, two major changes happen: 1) the total number of threads has reduced (l.2), and 2) the body is encapsulated inside a loop (l. 16l.18). As mentioned, in inter-tiling, we define N threads instead of T. The number 5 Since N is in the range 1 ≤ N ≤ T, the last chunk might have fewer cells.
340Ö. Şakar et al.  of chunks is indicated by the function ceiling(T, N). Each thread in the newly added loop iterates over all chunks (in the range 0 to ceiling(T, N)-1) to be responsible for a specific location. This happens by the loop variable j and the loop condition tid+j×N < T. This means, each thread tid can access its own location at index tid in each chunk. To preserve verifiability, we add invariants to the loop (l.9-l.17). Therefore, we specify: the boundaries of the loop variable j, which iterates over all chunks.
a permission-related invariant for each thread in each chunk (l.10). This comes from the precondition of the kernel and is quantified over all chunks. an invariant to indicate functional properties of the locations that have not yet been updated by threads in the body of the loop (l.12). This comes from the functional precondition of the kernel and is quantified over all chunks. an invariant to specify how each thread updates the array in each chunk (l.14). This comes from the functional property as the postcondition of the kernel and is quantified over all chunks.
Moreover, we modify the specification of the kernel (l.3-l.6). Note that we have the condition tid+j×N < T in all universally quantified invariants, because the last chunk might have fewer cells than N. We quantified the pre-and postcondition of the kernel over the chunks in the same way as the invariants. Intra-tiling is in essence similar to inter-tiling with two major differences: 1) the total number of threads is ceiling(T, N), and 2) each thread in the loop iterates over cells within its own chunk. Therefore, we have different conditions in the loop and the quantified invariants. Alpinist also supports this.
Above, each thread is assigned to one cell. This can easily be generalized to have each thread assigned to one or more consecutive cells (i.e., a task). A similar procedure can be applied as long as the tasks do not overlap, i.e., each cell is assigned to at most one thread.

Kernel Fusion
Kernel fusion is a GPU optimization where we merge two or more consecutive kernels into one. It increases the potential to use thread-local registers to store intermediate results (see Section 2) and can lead to less power consumption. See [2,19,61,62,68] for the impact of kernel fusion on GPU programs. We provide a generalized procedure to fuse an arbitrary number of consecutive kernels while considering data dependency between them. The idea is to fuse them by repeatedly fusing the first two kernels (i.e., kernel reduction). In each iteration, if there is no data dependency between the two kernels, we safely fuse them. Else if there is only one thread block then we fuse the two kernels by inserting a barrier between the bodies, else fusion fails.
A benefit of this approach is that it only considers two kernels at a time. In this way, it can be determined whether a barrier is necessary between two specific kernels, and we do not miss any possible fusion optimization. Another benefit of this approach is that when a data dependency between two kernels P and P + 1 (1 < P < #kernels−1) is detected, the output of the approach is the fusion of the first P kernels, and the remaining unfused kernels after P . This allows the user to not only find out that there is a data dependency between P and P + 1, but also to obtain fused kernels where possible.
There are multiple challenges in this transformation: (1) how to detect data dependency between two kernels? (2) how to collect the pre-and postconditions for the fused kernel? and (3) how to deal with permissions so that in the fused kernel the permission for a location does not exceed 1? The main difficulty in addressing these challenges is that we have to consider many different possible scenarios. Fortunately, we can use the information from the contract of the two kernels. The permission patterns in the contract indicate for each thread which locations it reads from and writes to. We provide procedures to separately collect pre-and postconditions related to permissions and to functional correctness. Due to space limitations, we only discuss the essential steps to collect the precondition related to permissions for array accesses of the fused kernel in Alg. 1. Collecting the rest of the contract uses a similar procedure.
Alg. 1 requires kernels k1 and k2 to not lose any permissions, only possibly redistribute them (using a barrier). Furthermore, for ease of presentation, we assume that in both k1 and k2, each thread accesses at most one cell of array a, and that the expressions used to compute array indices only combine constants and thread ID variables, using standard arithmetic operators.
We compare the postcondition of k1 and the precondition of k2 (l.2) to understand how to add permissions of the preconditions of k1 and k2 to the precondition of the fused kernel. Note that prePerm and postPerm correspond to a permission-related pre-and postcondition, respectively. We use the postcondition of k1 for this comparison since the permission at the end of k1 needs to be sufficient to satisfy the precondition of k2. If the index expressions e1 and e2 to access an array a are syntactically the same, then they refer to the same array cell. In that case, we first add to the precondition of the fused kernel the original permission from the precondition of k1 that corresponds to the permis-Algorithm 1 Kernel fusion procedure for collecting precondition permissions.

17:
Add prePerm(a[e2], 1-p1) as pre. to kf sion for a[e1] in the postcondition of k1 (remember that the latter permission may have been obtained in k1 after permission redistribution). Second, if p1 is not sufficient for the precondition of k2 (l.5), we add additional permission to the precondition of the fused kernel to satisfy the precondition of k2 (l.6).
The remaining different cases in the algorithm correspond to the different edge cases that we should consider when e1 and e2 are not syntactically the same. In particular, data dependency happens when the accumulated permission (in both kernels) for one location is greater than 1, and there is at least one write permission. Therefore, we have to distinguish multiple cases: 1) p1 + p2 does not exceed 1 (l.8), 2) p1 + p2 exceeds 1, but no write permission is involved (l.10), or 3) and 4) at least one write is involved (l.13 and l.15). In the latter two cases, a barrier must be introduced to take care of distributing permissions from the access in k1 to the access in k2, and possibly additional permission for the latter must be added to the precondition of the fused kernel (l.17). After constructing the contract of the fused kernel, we check for data dependency. Fig. 10 shows an example of fusing two kernels. We only present the permission precondition expressions which are collected with Alg. 1. There are two shared arrays a and b. To collect permission preconditions in the fused kernel, we follow steps {l.2→l.3→l.4} for array a and steps {l.2→l.3→l.4→l.5→l.6} for array b. As there is no data dependency, we can safely fuse the two kernels.
Implementing Data Dependency Detection. One of the implementation challenges of kernel fusion is to check data dependency in the applicability checking phase. Our idea of detecting kernel dependencies is similar to detecting loop iteration dependencies, see [1]. To detect data dependency for a specific shared array, the function SV is used. Fig. 11 shows an example of the output of SV. The kernel has 1\2 permission for a[tid+1] and 1\3 permission for a[0] if tid+1 is out of bounds. SV takes an array name and the pre-and postconditions of a kernel (of the form cond(tid) => Perm(a[patt(tid)], p)) on l.3-l.6, and returns a mapping from indices patt(tid) to the permissions p (in Fig. 11: right).  If the function SV is executed for two kernels to fuse with the same shared array a, the results SV 1 (a) and SV 2 (a) can be compared to determine whether there is data dependency between the two kernels. This comparison is described generally at l.8-l.16 in Algorithm 1. For each corresponding location in SV 1 (a) and SV 2 (a), we can determine, for example, whether both permissions combined do not exceed 1 (l.8) or whether the location in k1 has write permission (l.12).

Other Optimizations
We briefly discuss the three remaining optimizations supported by Alpinist. Iteration merging is an optimization technique related to loop unrolling that is applicable to both GPU and CPU programs 6 . Iteration merging reduces the number of loop iterations by extending the loop body with multiple copies of it, as opposed to creating copies of it outside the loop, as is done in loop unrolling. Iteration merging can have a positive performance impact; see [38,46,53] for the effectiveness of this optimization on GPU programs.
Matrix linearization is an optimization where we transform two-dimensional arrays into one dimension ones. This optimization can result in better memory access patterns, thereby improving caching. See [5,13,54] for the impact of matrix linearization on GPU programs.
The last optimization implemented in Alpinist is data prefetching. Suppose there is a verified GPU program where each thread accesses an array location in global memory multiple times. In this optimization, we prefetch the values of those locations that are in global memory into registers which are local to each thread. A similar optimization, in which intermediate results are stored in register memory, is applied in Section 2. Therefore, instead of multiple accesses to the high latency global memory, we benefit from low-latency registers. Data prefetching can have a positive performance impact; see [4,58,70].

Evaluation
This section describes the evaluation of Alpinist. The goal is to Q1 test whether Alpinist works on GPU programs. Q2 investigate how long it takes for Alpinist to transform GPU programs and how this affects the verification time. Q3 investigate the usability of Alpinist on real-world complex examples.

Experiment Setup
Alpinist is evaluated on examples from three different sources. The first source consists of hand-made examples that cover different scenarios for each optimization. The second source is a collection of verified programs from VerCors' example repository 7 . The third source consists of complex case studies that are already verified in VerCors: two parallel prefix sum algorithms [51], parallel stream compaction and summed-area table algorithms [48], a variety of sorting algorithms [49], a solution [27] to the VerifyThis 2019 challenge 1 [18] and a Tic-Tac-Toe example [57] based on [23]. In total, we applied the optimizations 30 times in the first category, 23 times in the second category and 17 times in the third category (in total 70 experiments). All the examples are annotated with special optimization annotations such that Alpinist can apply those optimizations automatically. All these examples are publicly available at [15]. All the experiments were conducted on a MacBook Pro 2020 (macOS 11.3.1) with a 2.0GHz Intel Core i5 CPU. Each experiment was performed ten times, after which the average times, i.e., optimization and verification times, of those executions were recorded for the experiment.

Q1
To test whether Alpinist works on GPU programs, we applied the six optimizations in all 70 experiments and used VerCors to reverify all the resulting programs. All these tests were successful. Q2 To investigate how long it takes for Alpinist to transform GPU programs, we recorded the transformation time for each optimization applied to all the    Table 2 shows the optimization and verification times of applying loop unrolling, iteration merging, matrix linearization and data prefetching to these case studies. Note that in the case studies only these four optimizations could be applied. In the table, N/A indicates that the optimization is not applicable to the example.

Related Work
To the best of our knowledge, this is the first paper to showcase a tool that implements annotation-aware transformations. We categorize the related work into three parts, covering both tools and optimizations.
Automatic Optimizations without Correctness. There is a large body of related work, see e.g., [2,4,19,25,28,47,61,62,[68][69][70], that shows the impact of automated optimizations on GPU programs, but does not consider correctness, or the preservation of it. Our tool can potentially complement these approaches by preserving the provability of the optimized programs.
Correctness Proofs for Transformations. Another body of related work focuses on different approaches to preserve provability not specific to GPU programs. COMPCert [30,31] is a formally verified C compiler which preserves semantic equivalence of the source and compiled program, by proving correctness of each transformation in the compilation process. Wijs and Engelen [66] and De Putter and Wijs [45] prove the preservation of functional properties over transformations on models of concurrent systems. They prove preservation of model-independent properties. This approach differs from ours as they work on models instead of concrete programs.
Compiler Optimization Correctness. Finally, there is related work that focusses on the compilation of sequential programs, performing transformations from high-level source code to lower-level machine code while preserving the semantics. These approaches neither consider parallelization, nor target different architectures. In GPU programming, the optimizations often need to be applied manually rather than during the compilation process. Namjoshi and Xu [41] use a proof checker to show equivalence between an original WebAssembly program and optimized program. An equivalence proof is generated based on the transformations. Namjoshi and Singhania [40] created a semi-automatic loop optimizer with user-directives. The loops are verified during compilation. For each transformation, semantics are defined to guarantee semantical equivalence to the original program. Namjoshi and Pavlinovic [39] focus on recovering from precision loss due to semantics-preserving program transformations and propose systematic approaches to simplify analysis of the transformed program. Finally Gjomemo et al. [20] help compiler optimizations by supplying high-level information gathered by external static analysis (e.g., Frama-C). This information is used by the compiler for better reasoning.

Conclusion
In this paper, we presented Alpinist, the annotation-aware GPU program optimizer. Given an unoptimized, annotated GPU program, we showed how Alpinist transforms both the code and the annotations, with the goal to preserve the provability of the optimized GPU program. Alpinist supports loop unrolling, tiling, kernel fusion, iteration merging, matrix linearization and data prefetching, of which the first three are discussed in detail. We discussed the design and implementation of Alpinist, and we validated it by verifying a set of examples and reverifying their optimized counterparts.
For future work, there are other optimizations that could be supported, such as data prefetching for all memory patterns as mentioned by Ayers et al. [4]. Another open question is if and how this approach can be used in program compilation. We also plan to extend this approach to preserve the provability of transpiled code, e.g., CUDA to OpenCL conversions. Moreover, we plan to investigate how Alpinist can be combined with techniques such as autotuning that automatically detect the potential for applying specific optimizations and identify optimal parameter configurations [3,63].