OpenMP offloading data transfer optimization for DCUs

Guo, Hengliang; Zhang, Long; Zhang, Yi; Li, Jianan; Xu, Xiaoyue; Liu, Lu; Cai, Kuangsheng; Wu, Dan; Yang, Shuxin; Kong, Lingbo; Gao, Xu

doi:10.1007/s11227-023-05422-w

OpenMP offloading data transfer optimization for DCUs

Open access
Published: 02 August 2023

Volume 80, pages 2381–2402, (2024)
Cite this article

Download PDF

You have full access to this open access article

The Journal of Supercomputing Aims and scope Submit manuscript

OpenMP offloading data transfer optimization for DCUs

Download PDF

Hengliang Guo¹,
Long Zhang^1,2,
Yi Zhang^1,2,
Jianan Li^1,2,
Xiaoyue Xu^1,2,
Lu Liu^1,2,
Kuangsheng Cai^1,2,
Dan Wu^1,2,
Shuxin Yang^1,2,
Lingbo Kong^1,2 &
…
Xu Gao¹

1122 Accesses
2 Altmetric
Explore all metrics

Abstract

OpenMP supports the use of target offloading compile guidance instructions to invoke heterogeneous-platform accelerators to compute core code segments; however, unreasonable use of target offloading instructions can make the data transfer process time-consuming. The problem of unused array transfer and unused data segment transfer arises when the amount of data transferred from the host side to the device side exceeds the amount of data required for the core computation on the device side. For the transmission of unused arrays, the use of the transmitted arrays is guided by adding a filter to eliminate the transmission of redundant data; for the transmission of unused data segments, the use of arrays is quickly determined on the basis of the filter, and valid data are transmitted by optimizing Clang’s code generation strategy after obtaining the lengths of the data segments in core computation. Experiments are performed using the Polybench benchmark; the optimized speedup for unused array transfer reaches 7%, and the optimized speedup for unused data segment transfer reaches 10%. The experimental results show that data transfer optimization for target offloading characteristics can help improve program performance.

FOTV: A Generic Device Offloading Framework for OpenMP

Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMP

Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

As an increasing number of supercomputing platforms leverage the computing power of hybrid system architectures and general-purpose graphics processing units (GPUs) to achieve high-performance computing goals, heterogeneous programming for GPUs is becoming a trend in program development. Programming GPUs is an important process, and many programming models have emerged along with GPUs, such as CUDA [1] and OPENCL [2]; however, the use of these models requires complex programming. In contrast, OpenMP was once widely used for multithreaded parallel programming on central processing units (CPUs) due to its ease of programming [3]. To support GPU programming, the target offloading feature was introduced in version 4.0, providing a new option for heterogeneous programming direction.

OpenMP provides a set of generic instruction structures to declare compute kernels (i.e. areas of code that need to be executed on the accelerators and data transfers in the program [4]. For the programmer, it is simply a matter of either selecting the appropriate compiler guidance instructions to specify the explicit transfer of data during offloading or letting the compiler implement the implicit transfer of data. However, neither the programmer’s manual programming nor the compiler’s automatic compilation can take into account the specific use of data, leading to the transfer of unused arrays and unused data segments. When using OpenMP target offloading, the time overhead of program execution is of two main types: data transfer overhead and core computation overhead on the GPU side; the former accounts for 20–25% of the overall time overhead [5], so the optimization of data transfer is necessary.

In this paper, we divide unused data transmission problems in the data transmission process into two types: unused array transmission and unused data segment transmission. For the problem of unused array transmission, the abstract syntax tree (AST) of the source program is analysed to obtain the data transmission and usage, and the unused data transmission information is filtered and eliminated to improve the data transmission efficiency. For the problem of unused data segment transmission, the length of the data segment used in the actual calculation is obtained, and the length of the transmitted data is adjusted in the process of data transmission to reduce the impact of unused data segment transmission on the performance of the program. The length of unused data segments is adjusted to reduce the impact of unused data segment transmission on program performance.

The rest of the paper is organized as follows: in Sect. 2, the execution model and instruction constructs for OpenMP target offloading and support for target offloading in LLVM are described. In Sect. 3, the principles and implementation of our identification and elimination for unused array mapping are described. In Sect. 4, the design and implementation that solve the problem of transferring unused data segments are described. In Sect. 5, the experimental environment in which we work and the overall performance of the implemented functionality are shown. In Sect. 6, related work is presented, and in Sect. 7, a summary and an outlook on future work are provided.

2 Background

In this section, we will briefly explain the OpenMP target offload instruction constructs, the support for implementing target offload in LLVM according to OpenMP version 4.5, and information about the AST corresponding to the target constructs. In the last part of this section, we will introduce the data transfer principles of OpenMP target offload.

2.1 Target offload command constructs

Commonly used target offload command constructs in OpenMP 4.5 have the following forms [6]:

target: the target directive instructs the compiler to generate a target region, map the variables to the device-side data environment and execute the code contained in the target region on the device side.
target data: variables are mapped to the device-side data environment; using these two sets of execution constructs reduces the number of times data are copied to the device side.
target exit data: this unmaps the data on the device side; it is often used in combination with target enter data.
target update: this keeps the data environments on the host side and the device side consistent and exchanges data if they are inconsistent.

OpenMP also introduces data mappings, hardware-independent abstractions for data transfer between the host side and the device side, when implementing target offloading. A data mapping is specified by the map clause of the instruction construct and consists of two main parameters. The types of parameters that can be selected and their meanings are shown in Fig. 1 [7]:

map-list: the variables or arrays to be mapped;
map-type: the flow direction of the data transfer, transferring variables from the host side to the device side.

The types of maps that can be used in common command constructs are not identical. For example, target and target data can use all map types, but the latter can have at most one map clause. Target enter data and target exit data have opposite roles, so they can use opposite map types.

The execution model for the target offload instruction construct consists of two basic parts [8]: (1) the data mapping between the host and the device and (2) the computation of the kernel offload. Figure 3 illustrates a simple execution model of the OpenMP target offload instruction construct, assuming a one-dimensional array a, b, c, where c is created by performing matrix addition on a and b. Initially, all the data are in the CPU memory, and in order for the GPU to access these data and complete the computation, the target offload instruction must be used in the program. The example code in Fig. 2 is described in Table 1.

Table 1 OpenMP target offloading code example

Full size table

The program in Table 1 uses a combination of the target instruction and the map clause constructs to indicate how the values of a, b and c are mapped between the CPU side and the GPU side by the map clause, while the computation kernel is specified by the target instruction.

After the data mapping is complete, the compiler wraps the compute kernel in a function that transfers it to the GPU specified by the target instruction to offload the compute kernel.

2.2 LLVM support for target offload implementations

The implementation of the OpenMP target instruction construct in LLVM consists of three main parts [9]: host-side code compilation, linking the runtime library and device-side code compilation. The process of linking the runtime library is included in the process of compiling the device-side code. This is shown in Fig. 3.

Host-side code compilation. This stage mainly processes the LLVM compiler front-end Clang, which completes the regular compilation of the code and the identification of the OpenMP target offload code. The offload area calls the corresponding runtime library function based on the instruction constructs and passes the corresponding parameters for the runtime library function based on the instruction constructs, such as the function identifier and number of parameters. In addition, if the target offload fails to start (e.g. there is an unsuccessful device request), the compiler generates the corresponding host version for the kernel function.
Linking Runtime libraries. To support OpenMP target offloading, LLVM adds the dynamic runtime library [10], whose main function is to obtain the compiled results of the host-side code, convert the parameter information, function identifiers and other data formats into types recognizable on the device side and load the available target offloading plugins to prepare for the compilation and execution of the device-side code.
Device-side code compilation. This process uses the identified OpenMP target offload regions and associated functions and captured variables and then outputs the target-related device code. This includes entry core functions, global variables, etc., for each target region.

In this section, we describe the target offloading instruction construct and the support for target offloading under the LLVM compiler. In Sects. 3 and 4, we analyse the problems in executing target offloading instructions and give our optimization solutions, respectively.

3 Transfer optimization of unused arrays

The OpenMP 4.5 specification [11] allows explicit mapping of unused variables in the target area [12]. These unused data mappings are also passed to the device side during the compilation and execution of the program, resulting in unnecessary data transfer overhead. To analyse the problem in detail, a typical example is first constructed; see Table 2 for sample code.

Table 2 Explicitly specifying the transfer of unused array

Full size table

In Table 2, the data used by the computation kernel (lines 10–13) are only the three arrays a, b and c. However, the map clause constructed by the target instruction (lines 6–8) also explicitly specifies the mapping of the three arrays d, e and f. It is clear that these three arrays will not play any role in the computation on the device side, and passing their mapping relationships to the device side will incur the overhead of transferring unused data segments.

In this section, we design and implement a filter and insert it into Clang. We identify unused array mappings with the help of the filter and eliminate the screened-out unused array mappings during the code generation process of Clang to optimize the unused array transfers. This task consists of two main parts: filtering out the information of unused data mappings from the information passed by the target offload instruction and eliminating the data transfers corresponding to these data mappings.

3.1 Filtering of unused array mappings

The design idea of the filter is to first obtain the subtree corresponding to the target offload instruction constructed from the source program’s AST, identify from it the data mapping specified in the map clause for use on the device side and pass the result to Clang’s own code generation module to complete the subsequent conversion from AST to IR. The filter needs to have two basic functions, namely (1) collecting the array mapping information specified in the instruction construct and (2) identifying the array mapping information used in the computational kernel.

3.1.1 Explicit data mapping collection

When using target offloading, there are two main data transfer strategies: the first is fully explicit, where the programmer specifies the data mapping information in the map clause, such as the array to be transferred, its size and the direction of data transfer. The second is implicit, where only the target offloading instruction construct is used explicitly and either the map clause is omitted or only the name of the array to be transferred is specified in the map clause; the data mapping is left for the compiler to perform implicitly. Regardless of the data transfer strategy used, the data mapping information can be obtained from the corresponding AST of the program. Figure 4 shows an example of the AST subtree for the OpenMP target offload command in Table 2. The root of this subtree represents the target instruction, and its children are the clauses and parameters of the instruction.

The data mapping information is stored in the AST in the OMPMapClause subnode [13]. The array mapping, specified by the map clause, is stored in the OMPArraySectionExpr node in the subtree structure, so explicit data mapping information can be obtained directly from this node.

Taking Table 2 as an example, there may be more than one map clause in a target offload directive construct, and these map clauses will be converted into OMPMapClause nodes at non-identical levels in the AST, so a recursive identification process is required for the filter to ensure that all OMPMapClause nodes under the same OMPTargetDirective subtree are ultimately obtained under the same OMPTargetDirective subtree.

3.1.2 Obtaining the data mapping from the computing kernel

As shown in Fig. 4, the computational kernel information specified by the target instruction in Table 2 is stored in the ForStmt subtree in the AST structure.

As shown in Fig. 5, \(\textcircled {1}\) the node identifies the declaration and initialization of the for loop iteration variables; \(\textcircled {2}\) the node identifies the judgement condition of the for loop; \(\textcircled {3}\) the node identifies the incremental segment of the loop variables; and \(\textcircled {4}\) the node identifies the code block of the for loop. The code associated with the loop is encapsulated in CapturedStmt, which exists as a child node of \(\textcircled {4}\). Clang keeps track of which variables are used inside CapturedStmt, so the actual data mapping information used by the computational kernel can be obtained from this child node.

Of course, there can be multiple computation kernels in the target region during the construction of the target offload instruction (e.g. there may be multiple non-nested for loops), in which case the CapturedStmt corresponding to each segment of the code is a different child of the same node. For the simple example and AST structure shown in Fig. 6, our filter will fetch the two ForStmt nodes in turn, ensuring that the information in all nodes is eventually fetched.

We also need to choose a reasonable data structure to store the above information. The LLVM architecture makes extensive use of the SmallVector data structure, which is essentially a dynamic variable-length array. For better compiler performance, we also choose to use SmallVector to store the information, and the filter is completed.

The workflow of the filter is shown in Fig. 7. The filter takes the AST of the source program as input and selects the FunctionDecl node corresponding to the function declaration as the entry point of the filter, which facilitates fast location of the target offload instruction constructing the corresponding AST subtree.

The main workflow of filtering is divided into two parts. The first part is to collect the data mapping information specified by the display, where the first judgement is used when multiple array transfers are specified in a map clause and the second judgement is used when multiple map clauses exist within a target construct. The second part is to collect the data mapping information from the computational kernel. The third judgement is used for multiple non-nested loops (as depicted in Fig. 6) to ensure that all the array mapping information in the computational kernel is collected. Finally, the two parts are integrated to obtain the unused data mapping information for the subsequent optimization of the unused array transfers.

3.2 Deletion of unused arrays

When using target offload, the program manages the data on the host side until the code in the compute kernel is executed. The array mapping specified for transfer by the target offload instruction construct exists in the map clause in the format Base[LowerBound:Length], and it is converted to an AST in the node structure of Fig. 8. When Clang’s code generation module analyses the AST to generate the corresponding intermediate representation IR, the target offload instruction construct generates the corresponding runtime library function calls (e.g. the __tgt_target of the target instruction [9]), and the data map specified by the map clause is also replaced with the parameters of these functions.

In the resulting runtime library function call, the information in the map clause is parsed into five arrays: \(\textcircled {1}\) offload_sizes, \(\textcircled {2}\) offload_maptypes, \(\textcircled {3}\) offload_baseptrs, \(\textcircled {4}\) offload_ptrs and \(\textcircled {5}\) offload_mappers. \(\textcircled {1}\) represents the number, types and lengths of arrays to be transferred; \(\textcircled {2}\) represents the type of data transfer by using unsigned int type values to represent different mapping types; \(\textcircled {3}\) represents the base address of the array to be transferred at the host side, i.e. the first address of the array; \(\textcircled {4}\) represents the location of the array mapped to the device side, which corresponds with \(\textcircled {3}\); and \(\textcircled {5}\) represents the data mapping method defined by the program developer. The mapping method is generally empty by default.

In the method in the previous subsection, the unused array mapping information is actually a node or a part of the subtree of the AST, which is successfully obtained by the filter on the device side. By adjusting the compiler’s code generation strategy so that this unused array mapping information is not encapsulated in the compute kernel, we ensure that it is not subsequently transferred to the device side with the compute kernel.

The pseudocode for the optimization process of unused array transfer is given below in Algorithm 1.

Algorithm 1 generates IR code by building a filter and traversing the Abstract Syntax Tree (AST). It first establishes a filter and calculates the number of OMPMapClause nodes in the AST. Then, during the traversal of the AST, it retrieves information from the map clauses and adjusts the length of data transfers and the values of base pointers accordingly. Finally, it generates IR code based on the information in the filter, taking into account the rules specified in the map clauses and default mapping strategies.

The core idea of the algorithm is to optimize the transfer of unused data segments in the map clauses. It achieves this by constructing a filter to select the arrays for which IR code generation is necessary and by adjusting the length of data transfers and base pointer values based on the information in the map clauses. Through this optimization, the algorithm improves the program’s execution efficiency and reduces unnecessary data transfers. Additionally, the algorithm considers modifying the rules specified in the map clauses and default mapping strategies to accommodate different programming requirements.

4 Optimization of the transmission of unused data segments

In the actual program development process, there are cases when the data length specified by the map is not fully used in the computational kernel, so the program developer needs to manually tune the performance of the program. Considering various factors, such as the efficiency and correctness of manual tuning, this part of optimization should be left to the compiler. To facilitate the analysis of the problem, the example in Table 2 is first adapted (see Table 3).

Table 3 Examples of explicitly unused data segment transfers

Full size table

In Table 3, the data transfer length explicitly specified by the map clause is LEN, but only one percent of LEN is actually used in the calculation kernel, and the remaining part of the data does not need to be transferred to the device side; this part is called the unused data segment.

For unused data segment transfer, we can first use the filter described in the previous subsection to quickly locate the computation kernel specified by the target offload instruction construct and obtain the data segment usage, including the length and starting address, and then optimize the transfer of unused data segments by adjusting the actual length of the transferred data segments. This procedure is divided into two parts: first, the data length used in the computation kernel is obtained, and then the length is passed to the code generation module at the right time and set accordingly.

4.1 Obtaining the actual length of the data segment used

With the help of the filters described in the previous section, it is possible to determine the array mapping information used in the calculation kernel and then obtain the actual calculated lengths. The next step is to obtain the length of the used data segments from the calculation kernel.

The example in Table 3 calculates the length of the data segment used in the kernel in the original program; this length is the iteration length of the for loop, so its corresponding AST subtree is first obtained for subsequent analysis, and its abstract structure is shown in Fig. 9.

The iteration of the for loop in the example in Table 3 starts at 0, which means that the array in the compute kernel is accessed starting at the first address, so it can be assumed that the length of the iteration is the actual length of the data segment to be transferred. If the loop iteration does not start at 0, then the actual length of the data segment to be transferred should be the difference between the upper and lower bounds of the iteration of the loop.

The iterative upper bound of the loop is obtained from the AST subtree, and this is the length of the data segment used for the actual computation. The next step is to obtain the data length explicitly specified in the program by the map clause, which can be done with the help of the second part of the filter in the previous subsection. If there is a difference between the explicitly specified data length and the actual length of the data segment used in the computation kernel, optimization of the data segment transfer is needed.

4.2 Inserting information about the length of the acquired data segment

Clang’s code generation module generates information, such as the data length, the data map type, a base pointer to the first address of the data on the host side, and a user-defined mapper, to be passed to the device side based on the data map explicitly specified in the map clause. This information is first encapsulated by the Clang front-end and stored in the CombinedInfo intermediate container, from which the subsequent encapsulation process of the computing kernel reads the information. Therefore, our insertion operation should be performed in the packaging process of this container.

The mapped data length information in the above containers is stored in the offload_sizes array, where the array lengths to be mapped are pointed to by corresponding pointers. The data segment length in the offload_sizes array can be modified by adjusting the pointer to point to the data segment length used by the computational kernel.

When the lower_bound of the loop iteration in the calculation kernel is not 0, it is necessary to consider adjusting the position of the base pointer. The adjustment process involves taking into account the data offset, which depends on the data-type and is explicitly defined in Algorithm 2. Specifically, it involves adjusting the offload_baseptrs and offload_ptrs arrays, ensuring that the adjustment is synchronized with the modification of offload_sizes. This adjustment is crucial to maintain the consistency between the base pointer and the corresponding data, ensuring the proper execution of the program. This operation is always bound to the DeclRefExpr node corresponding to each variable, which is a reference to the corresponding declared variable and can guarantee the correctness of our adjustment. In addition, since the types of the variables representing the length of the data segments are converted throughout the adjustment process, it is also necessary to check the final result type at the end of the adjustment process and convert it if needed.

The pseudocode for the overall implementation is given below (see Algorithm 2).

The core idea of Algorithm 2 is to optimize the transfer of unused data segments in the map clause. It traverses the Abstract Syntax Tree (AST) to check the transfer length specified in the map clause and the upper and lower bounds of loop iterations. If the transfer length does not match the loop iteration range, the algorithm adjusts the length of offload sizes accordingly and modifies the value of the base pointer to ensure the correctness of data transfer. If the transfer length matches the loop iteration range, the algorithm executes the default mapping strategy. Through this approach, the algorithm optimizes the transfer of unused data segments, improving the program’s execution efficiency.

5 Test and analysis

The main focus in this section is on verifying the correctness of the functionality we implement and evaluating its performance.

5.1 Experimental environment

To evaluate the correctness and performance improvement of our work, we use a deep computer unit (DCU) as an accelerator. The DCU is a GPU running in the Radeon Open Computer (ROCm) environment [14] that supports OpenCL and is compatible with major heterogeneous programming standards such as CUDA. The CPU is used as the host side, the DCU is the device side, and the two are connected through a PCIE bus. For the experiments in this chapter, the operating system used was CentOS-7.6, with support for the Haiguang 1 DCU accelerator running ROCm 4.2.2. The OpenMP target offloading programs were compiled using LLVM-12.0.0. The CPU used was the Hygon C86 multicore processor, with 32 physical cores, 32 KB L1 cache size, 128 GB global memory and 64 threads. The Haiguang 1 DCU accelerator has a GPU-like architecture, with 60 Compute Units (CUs), a core clock frequency of 1.7 GHZ, 16 GB global memory and 64 KB shared memory per CU.

5.2 Typical example test analysis

The first test is performed for the example used in this paper.

5.2.1 Test for unused array elimination

To test the transmission optimization of unused arrays, three sets of experiments are set up to control A, B and C. The sets are shown in Fig. 10, and the test examples used are shown in Table 2.

The purpose of Experiment A is to establish a baseline performance metric by measuring the execution time of a test program that does not involve any unused data transfer optimization. This provides a reference point for comparison with the other experiments. In Experiment B, we introduced the problem of unused data transfer, but did not optimize it using our technique. This experiment serves as a control to isolate the impact of our optimization approach. Experiment C tests the functionality of our implementation by applying our technique to optimize unused data transfer.

The test results, as shown in Fig. 12a, provide valuable insights. Based on the results, we can analyse that our implementation (corresponding to experiment C) exhibits approximately a 3.01% performance overhead compared to the baseline experiment A. This is because there are still optimization opportunities in the code implementing the functionality, which will be addressed in future work. However, for scenarios involving explicitly specified unused data transfers (corresponding to experiment B), there is a performance improvement of 9.36%. This demonstrates that optimizing for the issue of unused data transfers can enhance the overall program performance.

5.2.2 Test for unused data segment elimination

Four sets of experiments, A, B, C and D, are set up for comparison to test the optimization of unused data segment transmission, and the sets of experiments are set up as shown in Fig. 11. Examples of test selection are shown in Table 3.

Experiments A and B are used as benchmarks for the entire experiment, in which the array size specified for transfer is consistent with the computational length, and the compilers used do not optimize the transfer of unused data segments. In contrast, Experiment C is designed as a control group to compare the performance of our optimization in Experiment D. In Experiment C, the size of the array specified for transfer is inconsistent with the length used in the computational kernel, and the compilers do not optimize the unused data segments.

The test results, as depicted in Fig. 12b, reveal interesting insights. Comparing the results of experiments A and C, it can be inferred that when the device-side calculated lengths are the same but the lengths of the transmitted data segments differ, there are performance variations in the program. These discrepancies primarily arise from the data transmission process. Comparing the results of experiments A and D, it can be concluded that our implementation introduces performance overhead. However, this overhead can be eliminated by optimizing the code. Overall, optimizing the transfer of unused data segments proves beneficial in enhancing program performance.

5.3 Polybench benchmark

Polybench is a benchmark suite for numerical computation with static control flow, extracted from operations in various application areas (linear algebraic computation, image processing, physical simulation, dynamic programming, statistics, etc.).

5.3.1 Test analysis of unused array elimination

This group of experiments analyses the performance of some of the test questions from the Polybench test set with compiling optimization choice -O3, using the DCU accelerators and running on the LLVM compiler with the Clang version 12.0.0 compiler front-end. The compiler directive is “#pragma omp target teams distribute parallel for” with an explicitly specified map clause, and the issue of unused data transfer is introduced manually in the map clause. The test results are shown in Fig. 13.

From the experimental results, it can be observed that, except for ludcmp, all test cases exhibit positive acceleration. This can be attributed to the following two factors: (a) Data access patterns: The data access patterns in these test cases align well with the characteristics of the DCU, allowing for effective utilization of its parallel computing capabilities and reducing the memory access bottleneck. For example, the gramschmidt test case involves extensive matrix multiplications and transpositions, which can be efficiently parallelized on the DCU, fully exploiting its computational power.

(b) Data dependencies: The test cases exhibit relatively simple data dependencies, without intricate dependencies across multiple data points, enabling effective data parallelization. For instance, the computation in Jacobi-1d only involves calculations between adjacent data points, without complex dependencies spanning multiple points, thus facilitating parallel computations.

Among all the test cases showing positive acceleration, the Floyd–Warshall test case demonstrates the smallest performance improvement. This is due to its high computational complexity, where the optimization of unused array transfers is just one aspect, and its impact is not significant. Additionally, the data size in this test case is relatively small, resulting in less noticeable optimization effects. On the other hand, the ludcmp test case exhibits reverse acceleration, primarily because of its complex data dependencies and the optimization of unused array transfers, which introduces more random data access patterns, deteriorating memory locality.

In conclusion, these test cases cover basic algorithmic operations in various domains and applications, showcasing different computational and data characteristics. The experimental results indicate that implementing optimization for unused data transfers contributes to improving the performance of OpenMP heterogeneous programs.

5.3.2 Test analysis of unused data segment elimination

The test questions, compile optimization options, test environment, compiler version and compile guidance instruction format in this set of experiments are consistent with those in the previous set of experiments. However, the explicitly specified unused data segment transfer problem is no longer introduced in the compile guidance instructions, and in addition, the unused data segment transfer problem is introduced by modifying the number of loop iterations. The final test results are shown in Fig. 14.

The experimental results demonstrate a positive acceleration effect in most of the tested cases. Specifically, the Cholesky and Lu tests benefit from complex matrix operations and data access patterns, showing improved performance with the optimization approach. However, the ludcmp test exhibits lower optimization effectiveness due to its complex data dependencies and special computations. In the Deriche and Jacobi-1d tests, the optimization of unused data segment transfers further enhances cache utilization and data transfer efficiency, thanks to their smaller data transfer sizes and good locality. On the other hand, the Floyd–Warshall test shows a relatively modest acceleration effect, likely due to its high computational complexity and smaller dataset. Overall, these findings support the effectiveness of optimizing unused data segment transfers as a means to enhance the performance of OpenMP heterogeneous programs.

6 Related works

OpenMP 4.0 provides a mechanism for offloading code regions to accelerators. Antao et al. [15] introduced an OpenMP offload implementation for LLVM for Nvidia GPUs. This implementation was later extended to support AMD GPUs, This work serves as the experimental foundation for our study. Budiardja and Cardall [5] verified that data transfer between a GPU and host memory in OpenMP applications takes considerable time. This prompted our study to explore methods for optimizing OpenMP accelerator offloading implementations. Yu et al. [16] proposed ARBALEST to dynamically detect data movement when using OpenMP offloading, enabling data transfer using OpenMP target offloading to be “visualized”. This tool provides support for the experiments in this paper. Yan et al. [7] implemented reading and writing data directly between storage and device memory by extending the map clause, providing a new direction for data transfer optimization. This serves as an inspiration for our future work, in which we will reference their methods and further optimize our work based on their approach.

7 Summary and outlook

This paper introduces the problem of unused data transfer between the host side and the device side when using the LLVM compiler to implement OpenMP target offloading on DCUs. Unused array transfer and unused data segment transfer are optimized by inserting filters and by optimizing the code generation strategy of Clang, respectively. The experimental results show that both of these optimizations improve the program performance.

In future work, we will continue to improve the performance of the filter, such as by replacing the currently implemented heuristic algorithm with an optimization algorithm.

Data availability

The compiler used is LLVM, version 12.0, available at https://llvm.org/; The data set is the open source PollyBench benchmark and can be accessed at https://web.cse.ohio-state.edu/pouchet.2/index.html#software.

References

CUDA C programming guide, [Online]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html (2019)
OpenCL 2.2 API specification, [Online]. https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL/_API.html (2019)
Olivier SL (2020) Evaluating the efficiency of OpenMP tasking for unbalanced computation on diverse CPU architectures. In: Milfeld K, de Supinski BR, Koesterke L, Klinkenberg J (eds) OpenMP: portable multi-level parallelism on modern systems. Springer, Cham, pp 18–33
Diaz Jose Monsalve, Friedline Kyle, Pophale Swaroop, Hernandez Oscar, Bernholdt David E, Chandrasekaran Sunita (2019) Analysis of OpenMP 4.5 offloading in implementations: correctness and overhead. Parallel Comput 89:102546
Article Google Scholar
Budiardja RD, Cardall CY (2019) Targeting GPUs with OpenMP directives on summit: a simple and effective fortran experience. Parallel Comput 88(C)
Cramer T, Dietrich R, Terboven C, Müller MS, Nagel WE (2015) Performance analysis for target devices with the OpenMP tools interface. In: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, IPDPSW ’15. IEEE Computer Society, USA, pp 215–224
Yan K, Wang A, Yi X, Yan Y (2019) Extending OpenMP map clause to bridge storage and device memory. In: 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC), pp 79–85
Mishra A, Li L, Kong M, Finkel H, Chapman B (2017) Benchmarking and evaluating unified memory for OpenMP GPU offloading. In: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC’17. Association for Computing Machinery, New York
Patel A, Tian S, Doerfert J, Chapman B (2021) A virtual GPU as developer-friendly OpenMP offload target. In: 50th International Conference on Parallel Processing Workshop, ICPP Workshops’21. Association for Computing Machinery, New York
Tian S, Chesterfield J, Doerfert J, Chapman B (2021) Experience report: writing a portable GPU runtime with OpenMP 5.1. In: McIntosh-Smith S, de Supinski BR, Klinkenberg J (eds) OpenMP: enabling massive node-level parallelism. Springer, Cham, pp 159–169
OpenMP application programing interface, version 4.5, [Online]. http://openmp.org (2015)
Tiotto Ettore, Mahjour Bardia, Tsang Whitney, Xue Xing, Islam Tarique Mesbaul, Chen Wang (2020) OpenMP 4.5 compiler optimization for GPU offloading. IBM J Res Dev 64:14:1-14:11
Article Google Scholar
Kruse M (2021) Loop Transformations using Clang’s abstract syntax tree. In: 50th International Conference on Parallel Processing Workshop, ICPP Workshops’21. Association for Computing Machinery, New York
Hu W (2021) Research on polyhedral compilation optimization technique for DCU. Zhengzhou University
Antao SF, Bataev A, Jacob AC, Bercea G-T, Eichenberger AE, Rokos G, Martineau M, Jin T, Ozen G, Sura Z, Chen T, Sung H, Bertolli C, O’Brien K (2016) Offloading support for OpenMP in Clang and LLVM. In: 2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), pp 1–11
Yu L, Protze J, Hernandez O, Sarkar V (2021) ARBALEST: dynamic detection of data mapping issues in heterogeneous OpenMP applications. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 464–474

Download references

Funding

This research was funded by Major Science and Technology Special Projects in Henan Province (221100210600); Major Science and Technology Special Projects in Henan Province (201400211000); Major Science and Technology Special Projects in Henan Province (201400210100); and Science and Technology Tackling Plan of Henan Province (222102320220).

Author information

Authors and Affiliations

National Supercomputing Center in Zhengzhou, Zhengzhou University, Zhengzhou, 450001, China
Hengliang Guo, Long Zhang, Yi Zhang, Jianan Li, Xiaoyue Xu, Lu Liu, Kuangsheng Cai, Dan Wu, Shuxin Yang, Lingbo Kong & Xu Gao
School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China
Long Zhang, Yi Zhang, Jianan Li, Xiaoyue Xu, Lu Liu, Kuangsheng Cai, Dan Wu, Shuxin Yang & Lingbo Kong

Authors

Hengliang Guo
View author publications
You can also search for this author in PubMed Google Scholar
Long Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianan Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyue Xu
View author publications
You can also search for this author in PubMed Google Scholar
Lu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Kuangsheng Cai
View author publications
You can also search for this author in PubMed Google Scholar
Dan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Shuxin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Lingbo Kong
View author publications
You can also search for this author in PubMed Google Scholar
Xu Gao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.G. and L.Z. wrote the main manuscript text, Y.Z., L.L. and L.Z. participated in experiments and verification, D.W., X.Y., S.C. and B.K. prepared diagrams, H.G., J.L., X.G. and X.X. provided suggestions for revisions to the paper, and all authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Xu Gao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Guo, H., Zhang, L., Zhang, Y. et al. OpenMP offloading data transfer optimization for DCUs. J Supercomput 80, 2381–2402 (2024). https://doi.org/10.1007/s11227-023-05422-w

Download citation

Accepted: 21 May 2023
Published: 02 August 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11227-023-05422-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

OpenMP offloading data transfer optimization for DCUs

Abstract

Similar content being viewed by others

FOTV: A Generic Device Offloading Framework for OpenMP

Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMP

Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading

1 Introduction

2 Background

2.1 Target offload command constructs

2.2 LLVM support for target offload implementations

3 Transfer optimization of unused arrays

3.1 Filtering of unused array mappings

3.1.1 Explicit data mapping collection

3.1.2 Obtaining the data mapping from the computing kernel

3.2 Deletion of unused arrays

4 Optimization of the transmission of unused data segments

4.1 Obtaining the actual length of the data segment used

4.2 Inserting information about the length of the acquired data segment

5 Test and analysis

5.1 Experimental environment

5.2 Typical example test analysis

5.2.1 Test for unused array elimination

5.2.2 Test for unused data segment elimination

5.3 Polybench benchmark

5.3.1 Test analysis of unused array elimination

5.3.2 Test analysis of unused data segment elimination

6 Related works

7 Summary and outlook

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation