Automatic Data Layout Optimizations for GPUs

  • Klaus Kofler
  • Biagio Cosenza
  • Thomas Fahringer
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9233)


Memory optimizations have became increasingly important in order to fully exploit the computational power of modern GPUs. The data arrangement has a big impact on the performance, and it is very hard for GPU programmers to identify a well-suited data layout. Classical data layout transformations include grouping together data fields that have similar access patterns, or transforming Array-of-Structures (AoS) to Structure-of-Arrays (SoA).

This paper presents an optimization infrastructure to automatically determine an improved data layout for OpenCL programs written in AoS layout. Our framework consists of two separate algorithms: The first one constructs a graph-based model, which is used to split the AoS input struct into several clusters of fields, based on hardware dependent parameters. The second algorithm selects a good per-cluster data layout (e.g., SoA, AoS or an intermediate layout) using a decision tree. Results show that the combination of both algorithms is able to deliver higher performance than the individual algorithms. The layouts proposed by our framework result in speedups of up to 2.22, 1.89 and 2.83 on an AMD FirePro S9000, NVIDIA GeForce GTX 480 and NVIDIA Tesla k20m, respectively, over different AoS sample programs, and up to 1.18 over a manually optimized program.


Global Memory Training Pattern Data Layout Graph Base Model OpenCL Kernel 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This project was funded by the FWF Austrian Science Fund as part of project I 1523 “Energy-Aware Autotuning for Scientific Applications” and by the Interreg IV Italy-Austria 5962-273 EN-ACT funded by ERDF and the province of Tirol.


  1. 1.
    Batcher, K.E.: Sorting networks and their applications. In: Proceedings of the Spring Joint Computer Conference, AFIPS 1968 (Spring), 30 April - 2 May 1968, pp. 307–314. ACM, New York (1968)Google Scholar
  2. 2.
    Black, F., Scholes, M.: The pricing of options and corporate liabilities. J. Polit. Econ. 81, 637–654 (1973)CrossRefGoogle Scholar
  3. 3.
    Che, S., Meng, J., Skadron, K.: Dymaxion++: a directive-based API to optimize data Layout and memory mapping for heterogeneous systems. In: AsHes 2014 (2014)Google Scholar
  4. 4.
    Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: SC 2011, pp. 13:1–13:11. ACM, New York (2011)Google Scholar
  5. 5.
    Kandemir, M., Choudhary, A., Ramanujam, J., Banerjee, P.: A framework for interprocedural locality optimization using both loop and data layout transformations. In: Proceedings of the International Conference on Parallel Processing, pp. 95–102 (1999)Google Scholar
  6. 6.
    Khronos Group: OpenCL 1.2 Specification, April 2012Google Scholar
  7. 7.
    Kofler, K., Davis, G., Gesing, S.: Sampo: an agent-based mosquito point model in opencl. In: ADS 2014, pp. 5:1–5:10. Society for Computer Simulation International, San Diego (2014)Google Scholar
  8. 8.
    Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. 7(1), 48–50 (1956)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Raman, E., Hundt, R., Mannarswamy, S.: Structure layout optimization for multithreaded programs. In: CGO 2007, pp. 271–282. IEEE Computer Society, Washington (2007)Google Scholar
  10. 10.
  11. 11.
    Rubin, S., Bodík, R., Chilimbi, T.: An efficient profile-analysis framework for data-layout optimizations. In: POPL 2002, pp. 140–153. ACM, New York (2002)Google Scholar
  12. 12.
    RULEQUEST RESEARCH: Data mining tools see5 and c5.0, October 2014.
  13. 13.
    Stratton, J.A., Rodrigues, C.I., Sung, I.J., Chang, L.W., Anssari, N., Liu, G.D., Hwu, W.W., Obeid, N.: Algorithm and data optimization techniques for scaling to massively threaded systems. IEEE Comput. 45(8), 26–32 (2012)CrossRefGoogle Scholar
  14. 14.
    Strzodka, R.: Data layout optimization for multi-valued containers in opencl. J. Parallel Distrib. Comput. 72(9), 1073–1082 (2012)CrossRefGoogle Scholar
  15. 15.
    Sung, I.J., Anssari, N., Stratton, J.A., Hwu, W.W.: Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. Int. J. Parallel Program. 40(1), 4–24 (2012)CrossRefzbMATHGoogle Scholar
  16. 16.
    Weber, N., Goesele, M.: Auto-tuning complex array layouts for gpus. In: Proceedings of Eurographics Symposium on Parallel Graphics and Visualization, EGPGV14, EGGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Klaus Kofler
    • 1
  • Biagio Cosenza
    • 1
    • 2
  • Thomas Fahringer
    • 1
  1. 1.DPSUniversity of InnsbruckInnsbruckAustria
  2. 2.AESTU BerlinBerlinGermany

Personalised recommendations