Automatic Data Layout Optimizations for GPUs
 8 Citations
 1.5k Downloads
Abstract
Memory optimizations have became increasingly important in order to fully exploit the computational power of modern GPUs. The data arrangement has a big impact on the performance, and it is very hard for GPU programmers to identify a wellsuited data layout. Classical data layout transformations include grouping together data fields that have similar access patterns, or transforming ArrayofStructures (AoS) to StructureofArrays (SoA).
This paper presents an optimization infrastructure to automatically determine an improved data layout for OpenCL programs written in AoS layout. Our framework consists of two separate algorithms: The first one constructs a graphbased model, which is used to split the AoS input struct into several clusters of fields, based on hardware dependent parameters. The second algorithm selects a good percluster data layout (e.g., SoA, AoS or an intermediate layout) using a decision tree. Results show that the combination of both algorithms is able to deliver higher performance than the individual algorithms. The layouts proposed by our framework result in speedups of up to 2.22, 1.89 and 2.83 on an AMD FirePro S9000, NVIDIA GeForce GTX 480 and NVIDIA Tesla k20m, respectively, over different AoS sample programs, and up to 1.18 over a manually optimized program.
Keywords
Global Memory Training Pattern Data Layout Graph Base Model OpenCL Kernel1 Introduction
With the advent of new massively parallel architectures such as GPUs, many research projects focus on memory optimizations. In order to exploit the properties of the memory hierarchy, a key aspect is to maximize the reuse of data.
In this context, data layout transformation represents a very interesting class of optimizations. Two typical examples are: organizing data with similar access patterns in structures or rearranging array of structures (AoS) as structure of arrays (SoA). Recent work extends the classical SoA layout by introducing AoSoA (Array of Structure of Array) [16], also called ASA [14]. In this paper we prefer the expression tiledAoS, but we remark that all approaches exploit the same idea: mixing AoS and SoA in a unique data layout.
1.1 Motivation
In this work, we investigate an automatic memory optimization method that can be easily ported to different GPU architectures, using OpenCL as programming model. We combine together two different optimization strategies: we try to group together data fields with similar data access patterns and find the best data layout for each of these clusters.
The exploration of the whole search space, including both fields’ clustering and data tiling (i.e., finding the best data layout for each of these clusters) would take more than 400 years.
Figure 1 shows a subset of the optimization space for SAMPO. The heatmap on top depicts all possible data tiling for the onecluster grouping of all the twelve data fields. For this partition, the untiled AoS layout is slow (blue); by increasing the data tilesize the runtime decreases (shown in red), and with data tilesize bigger than \(12\,\mathrm{K}\) it also outperforms the SoA layout. The lower heatmap shows the performance results while applying the specific data tiling suggested by our algorithm (Sect. 3.1). The fastest version of the shown optimization subspace is achieved when we use a tilesize of 16 for the smaller struct containing two fields, 24 for the bigger struct with six fields, and having the other fields in a SoA layout. This example program also shows that the best tilesize can be different within the same code and different clusters: when using only one cluster, the highest performance is achieved with large data tiles; however, different clustering delivers better performance with smaller data tiling sizes. This suggests that the optimal data tilesize highly depends on the size of the individual cluster.

A Kernel Data Layout Graph (\({ KDLG }\)) model extracted from an input OpenCL kernel; each vertex weight represents structure field’s size and the edge weight expresses intradata field memory distance.

A twophase algorithm: first, a \({ KDLG }\) partitioning algorithm — driven by a devicedependent graph model — splits the original graph into partitions with similar data access patterns; second, for each partition we exploit a data layout selection method — driven by a devicedependent layout calculation — selects the most suitable layout from AoS, SoA and tiledAoS layouts.

An evaluation of five OpenCL applications on three GPUs showing a speedup of up to 2.83.
2 Related Work
The problem of finding an optimal layout is not only NPhard, but also hard to approximate [11]. Raman et al. [9] introduced a graph based model to optimize structure layout for multithreaded programs. They developed a semiautomatic tool which produces layout transformations optimized for both false sharing and data locality. Our work uses a different graph based model encoding the variables memory distance and data structure size, in order to provide a completely automatic approach; we also support AoS, SoA and tiledAoS layouts. Kendermi et al. [5] introduced an interprocedural optimization framework using both loop optimizations and data layout transformation; our method does not apply to a single function only, but can span over multiple functions.
Data layout transformations such as SoA conversion have been described to be the core optimization techniques for scaling to massively threaded systems such as GPUs [13]. DL presented data layout transformations for heterogeneous computing [15]; DL supports AoS, SoA and ASTA and implements and automatic data marshaling framework to easily change data layout arrangements. Our work supports similar data layouts, but we provide an automatic approach for the layout selection. MATOG [16] introduces a DSLlike, librarybased approach which optimizes GPU codes using either static and empirical profiling to adjust parameters or to change the kernel implementation. MATOG supports AoS, SoA and AoSoA with 32 threads (to match the warp size on CUDA) on multidimensional layouts and builds an applicationdependent decision tree to select the best layout. Dymaxion [4] is an API that allows programmers to optimize memory mapping on heterogeneous platforms. It extends NVIDIA’s CUDA API with a data index transformation and a latency hiding mechanism based on CUDA stream. Dymaxion C++ [3] further extends prior work. However, it does not relieve the programmer from selecting a good data layout.
3 Method
Our approach tries to answer two complex questions: (1) What is the best way to group data fields? (2) For each field cluster, what is the best data layout?
Once clusters have been identified, for each cluster we try to find the best possible layout within that cluster (i.e., homogenous layout). Our model supports AoS, SoA, as well as tiledAoS with different tilesizes.
In the next section we introduce a novel graph based model, where we encode data layout, field’s size and field locality information. The presented twostep approach (1) identifies field partitions (i.e., clusters of fields) with high locality within intrapartition fields and (2) determines an efficient data layout for each partition.
3.1 Kernel Data Layout Graph Model
We borrow the idea of memory distance from [9] and extend it with the actual data type size, which is important to distinguish different memory behaviors.
Figure 2b displays the \({ KDLG }\) generated from the code shown in Fig. 2a: The fields a and b are always accessed consecutively, therefore \(\delta (a,b)\) is 4 bytes. c is accessed after the for loop with 32 iterations, therefore \(\delta (c,b)=252\) and \(\delta (c,a)=256\) bytes, resulting from the 32 iterations that access \(2 \cdot 4\) bytes in each iteration. d is never accessed, therefore its distance from other fields is \(\infty \).
Our graph based model unrolls all loops before starting the analysis. Therefore, it assumes that loop bounds are known at compile time. If not known, we use a OpenCL kernel specific loop size inference heuristic to have a good approximation (see Sect. 3.1). Our analysis focuses on global memory operations, as they are considerably slower than local and private memory operations
Let \( MI (f)\) define the set of all global memory instructions (loads and stores) involving the data field f. Our distance function \(\delta \) between two fields \(f_1\) and \(f_2\) is defined by taking into account the maximummemorydistance path between the accessing instructions \(i_1 \in MI (f_1)\) and \(i_2 \in MI (f_2)\).
We conservatively use the maximum, which leads to higher weights on the \({ KDLG }\)’s edges and leads to more clusters; since more clusters have a lower risk of performance loss on our target architectures.
KDLG Partitioning. The first step of our algorithm identifies which fields in the input data structure should be grouped together. Formally, we assume that a field partitioning C of the \({ KDLG }\) (i.e., field clusters) is good if \(\forall e \in C  \delta (e) < \epsilon \), where \(\epsilon \) is a device dependent threshold. We define \(\epsilon \) as the L1 cache line size of the individual GPUs. The values of \(\epsilon \) are listed in Table 1. We use this value as it is the smallest entity that can be loaded from the L1 cache and therefore should be loaded at once.
Loop Bounds Approximation. When generating the test data to select \(\epsilon \) we use loops with a fixed number of iterations, in order to to accurately understand the memory distance between two memory accesses. In real world codes, the actual number of iterations is often not known at compile time. Therefore we use a heuristic that is specifically designed for OpenCL kernel codes. If the number of loop iterations are determined by compiletime constants, we use the actual number of iterations. If not, we apply a heuristic to approximate the number of iterations: When a loop performs one iteration for each OpenCL workitem [6] of the workgroup [6], we estimate it has 256 iterations, as the workgroup size is usually in this range. When a loop performs one iteration for each workitem of the NDRange [6], we assume it will have \(1\cdot 10^6\) iterations. If the number of iterations is neither constant nor linked to the workgroup size or NDRange, we estimate it to have \(512\cdot 10^3\) iterations. The estimation of loop bounds is not very sensitive: we only need to distinguish short loops, which may not completely flush the L1 cache, from long ones.
3.2 PerCluster Layout Selection
After KDLGPartitioning, we assume that each field in the same cluster has similar memory behavior. Therefore, all the fields within a cluster should have the same data layout arrangement, e.g., tiledAoS with a specific tilesize.
To understand what layout is best for a given cluster, we generate different kernels corresponding to a simple onecluster \({ KDLG }\) where \(\delta \) is roughly the same for each pair of fields. The kernel consists of a single forloop with a constant number of iterations n. The value of n comprises all powers of two from 128 to 16384. We evaluated the performance of these kernels with different combinations of loop size n, number of structure fields m, and tilesize t.
From the results we derive a devicedependent function SelectTilesize \((\sigma (c))\) which returns the suggested layout for a cluster c, where \(\sigma (c) = \sum _{f \in c}{\sigma (f)}\) and \(\sigma (f)\) returns the size of the field f in bytes. SelectTilesize is implemented using a decision tree, constructed by the C5.0 algorithm [12]. \(\sigma (c)\) is the only attribute the decision tree depends on. The potential target classes are AoS , SoA and all powers of two from \(2^1\) to \(2^{15}\). The performance measurements of the aforementioned kernels are used to generate the training data. For each kernel we create a training pattern for the fastest tilesize as well as all other tilesizes that are less than \(1\,\%\) slower than the fastest one. These training patterns consist only of the size of the structure \(\sigma (c)\), which is the only feature while the used tilesize acts as the target value. Generating training patterns not only for the fastest tilesize but for all which achieve at least \(99\,\%\) of it, as well as several training patterns for different structures with the same size, may lead to contradicting training patterns. However, our experiments demonstrated that the resulting decision tree is more accurate and less prone to overfitting. C5.0 was used with default settings; its runtime was about 1ms, depending on the input.
3.3 Final Algorithm
Line 2 calls the KDLGPartitioning algorithm and returns a set of clusters C in which the corresponding structure should be split. Then the decision tree determines an efficient tiling factor for each of these clusters and stores the resulting pair (cluster, tilesize) (line 3–5).
4 Experimental Results
Properties determined using our algorithms

Test programs
Struct bytes  Size fields  Affected kernels  Loop bound approx.\(^{a}\)  Speedup over AoS  

Test codes  FirePro  GeForce  Tesla  
Nbody  32  2  1  n  1.01  1.06  1.01 
BlackScholes [2]  28  7  1  –  1.00  1.43  2.83 
Bitonic sorting  16  4  1  u  1.47  1.50  1.38 
LavaMD [10]  36  3  1  c,u  2.22  1.89  2.07 
SAMPO  48  12  9  w,u  2.19  1.59  1.96 
Bitonic Sort [1] is a sorting algorithm optimized for massively parallel hardware such as GPUs. The implementation that we are using sorts a struct of four elements, where the first element acts as the sortingkey. As all elements are always moved together, the \({ KDLG }\)based algorithm results in one single cluster for any \(\epsilon \ge 4\). Therefore, the version generated by the \({ KDLG }\)based algorithm is the same as AoS. The decisiontreebased tiling algorithm converts this layout into a tiledAoS layout with a tilesize of 512 bytes for the FirePro and GeForce while it suggests to use SoA on the Tesla. The results can be found in Fig. 5b. It clearly shows that, although the \({ KDLG }\)based algorithm fails to gain any improvement over AoS, the decisiontreebased algorithm as well as the combination of both algorithm exceeds the performance of the AoS based implementation by a factor of 1.38 to 1.5. Furthermore, it delivers performance that is comparable or superior than the one achieved by a SoA implementation.
SAMPO [7] is an agentbased mosquito point model in OpenCL, which is designed to simulate mosquito populations to understand how vectorborne illnesses (e.g., malaria) may spread. The version available online is already manually optimized for AMD GPUs. Therefore, we ported this version to a pure AoS layout, where each agent is represented by a single struct with twelve fields. The measurements are displayed in Fig. 5c. The results clearly show, that SoA yields a much better performance than AoS on all tested GPUs. Applying the \({ KDLG }\)based algorithm already results in a speedup between 1.54 and 2.18 on the three tested GPUs, which is within \(\pm 10\,\%\) of the SoA version. Applying tiling to the AoS implementation shows good results on the NVIDIA GPUs. Also the AMD GPU benefits from tiling, but it does not reach the performance of the SoA version or the version optimized with the \({ KDLG }\)based algorithm. Applying tiling to the latter version further increases the performance on all evaluated GPUs and leads to a speedup over the AoS version of 2.19, 1.59 and 1.96 on the FirePro, GeForce and Tesla, respectively outperforming any other version we tested. Even the manually optimized implementation is outperformed by \(7\,\%\), \(10\,\%\) and \(18\,\%\) on the FirePro, GeForce and Tesla, respectively.
5 Discussion
The results clearly show, that programs with AoS data layout are not well suited for GPUs. SoA delivers a much higher performance on all GPU/program combinations we tested. However, also SoA fails to achieve the maximum performance in some cases. We observed that a tiledAoS can achieve results that are usually equal or better compared to the ones achieved with an SoA layout. But tiledAoS is not suited for all programs. Similarly, splitting structures in several smaller ones based on a \({ KDLG }\) is beneficial for most programs. However, also this technique fails to improve the performance of some applications . Therefore, combining these two algorithms leads to much better overall results than each of them can achieve individually. This is underlined by the results of SAMPO, where the combination of both algorithms does not only outperform the results of each algorithm applied individually, but also leads to higher performance than obtained by both, a SoA layout and the manually optimized layout.
6 Conclusions
We presented a system to automatically determine an improved data layout for OpenCL programs. Our framework consists of two separate algorithms: The first one constructs a \({ KDLG }\), which is used to split a given struct into several clusters based on the hardware dependent parameter \(\epsilon \). The second algorithm constructs a decision tree, which is used to determine the tilesize for a certain structure when converting it from AoS to tiledAoS or SoA layouts.
The combination of both algorithms is crucial, as using only one of them often leads to no improvement over AoS. The layouts proposed by our framework result in speedups of up to 2.22, 1.89 and 2.83 on an AMD FirePro S9000, NVIDIA GeForce GTX 480 and NVIDIA Tesla k20m, respectively.
Notes
Acknowledgments
This project was funded by the FWF Austrian Science Fund as part of project I 1523 “EnergyAware Autotuning for Scientific Applications” and by the Interreg IV ItalyAustria 5962273 ENACT funded by ERDF and the province of Tirol.
References
 1.Batcher, K.E.: Sorting networks and their applications. In: Proceedings of the Spring Joint Computer Conference, AFIPS 1968 (Spring), 30 April  2 May 1968, pp. 307–314. ACM, New York (1968)Google Scholar
 2.Black, F., Scholes, M.: The pricing of options and corporate liabilities. J. Polit. Econ. 81, 637–654 (1973)CrossRefGoogle Scholar
 3.Che, S., Meng, J., Skadron, K.: Dymaxion++: a directivebased API to optimize data Layout and memory mapping for heterogeneous systems. In: AsHes 2014 (2014)Google Scholar
 4.Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: SC 2011, pp. 13:1–13:11. ACM, New York (2011)Google Scholar
 5.Kandemir, M., Choudhary, A., Ramanujam, J., Banerjee, P.: A framework for interprocedural locality optimization using both loop and data layout transformations. In: Proceedings of the International Conference on Parallel Processing, pp. 95–102 (1999)Google Scholar
 6.Khronos Group: OpenCL 1.2 Specification, April 2012Google Scholar
 7.Kofler, K., Davis, G., Gesing, S.: Sampo: an agentbased mosquito point model in opencl. In: ADS 2014, pp. 5:1–5:10. Society for Computer Simulation International, San Diego (2014)Google Scholar
 8.Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. 7(1), 48–50 (1956)MathSciNetCrossRefzbMATHGoogle Scholar
 9.Raman, E., Hundt, R., Mannarswamy, S.: Structure layout optimization for multithreaded programs. In: CGO 2007, pp. 271–282. IEEE Computer Society, Washington (2007)Google Scholar
 10.Rodinia: LavaMD, November 2014. http://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/LavaMD
 11.Rubin, S., Bodík, R., Chilimbi, T.: An efficient profileanalysis framework for datalayout optimizations. In: POPL 2002, pp. 140–153. ACM, New York (2002)Google Scholar
 12.RULEQUEST RESEARCH: Data mining tools see5 and c5.0, October 2014. https://www.rulequest.com/see5info.html
 13.Stratton, J.A., Rodrigues, C.I., Sung, I.J., Chang, L.W., Anssari, N., Liu, G.D., Hwu, W.W., Obeid, N.: Algorithm and data optimization techniques for scaling to massively threaded systems. IEEE Comput. 45(8), 26–32 (2012)CrossRefGoogle Scholar
 14.Strzodka, R.: Data layout optimization for multivalued containers in opencl. J. Parallel Distrib. Comput. 72(9), 1073–1082 (2012)CrossRefGoogle Scholar
 15.Sung, I.J., Anssari, N., Stratton, J.A., Hwu, W.W.: Data layout transformation exploiting memorylevel parallelism in structured grid manycore applications. Int. J. Parallel Program. 40(1), 4–24 (2012)CrossRefzbMATHGoogle Scholar
 16.Weber, N., Goesele, M.: Autotuning complex array layouts for gpus. In: Proceedings of Eurographics Symposium on Parallel Graphics and Visualization, EGPGV14, EGGoogle Scholar