Advertisement

Memory-Access-Pattern Analysis Techniques for OpenCL Kernels

  • Gangwon JoEmail author
  • Jaehoon Jung
  • Jiyoung Park
  • Jaejin Lee
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11403)

Abstract

Previous pattern-by-pattern approaches for OpenCL/CUDA memory optimization require explicit user interventions to extract the kernel memory access patterns. This paper presents an automatic memory-access-pattern analysis framework called MAPA. It is based on a source-level analysis technique derived from traditional symbolic analyses and a run-time pattern selection technique. We propose formal notations of the memory access patterns, analysis algorithms based on the SSA form, and the integration method of MAPA with auto-tuners. The experimental results indicate that MAPA properly analyzes 116 real-world OpenCL kernels from Rodinia and Parboil benchmark suites. We also show an auto-tuner case study, Auto-Dymaxion, which exploits MAPA to automate a memory-access-pattern-based optimization approach.

References

  1. 1.
  2. 2.
    Ballance, R.A., Maccabe, A.B., Ottenstein, K.J.: The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages. In: Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, pp. 257–271 (1990)Google Scholar
  3. 3.
    Bauer, M., Cook, H., Khailany, B.: CudaDMA. http://lightsighter.github.io/CudaDMA/
  4. 4.
    Bauer, M., Cook, H., Khailany, B.: CudaDMA: optimizing GPU memory bandwidth via warp specialization. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (2011)Google Scholar
  5. 5.
    Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 101–113 (2008)Google Scholar
  6. 6.
    Brown, W.M., Wang, P., Plimpton, S.J., Tharrington, A.N.: Implementing molecular dynamics on hybrid high performance computers - short range forces. Comput. Phys. Commun. 182(4), 898–911 (2011)CrossRefGoogle Scholar
  7. 7.
    Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of 2009 IEEE International Symposium on Workload Characterization, pp. 44–54 (2009)Google Scholar
  8. 8.
    Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (2011)Google Scholar
  9. 9.
    Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13(4), 451–490 (1991)CrossRefGoogle Scholar
  10. 10.
    Eklund, A., Dufort, P., Forsberg, D., LaConte, S.M.: Medical image processing on the GPU - past, present and future. Med. Image Anal. 17(8), 1073–1094 (2013)Google Scholar
  11. 11.
    Götz, A.W., Williamson, M.J., Xu, D., Poole, D., Le Grand, S., Walker, R.C.: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. generalized born. J. Chem. Theory Comput. 8(5), 1542–1555 (2012)CrossRefGoogle Scholar
  12. 12.
    Grosser, T., Groesslinger, A., Lengauer, C.: Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett. 22(4), 1250010 (2012)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Haghighat, M.R., Polychronopoulos, C.D.: Symbolic analysis for parallelizing compilers. ACM Trans. Program. Lang. Syst. 18, 477–518 (1996)CrossRefGoogle Scholar
  14. 14.
    Jang, B., Schaa, D., Mistry, P., Kaeli, D.: Exploiting memory access patterns to improve memory performance in data parallel architectures. IEEE Trans. Parallel Distrib. Syst. 22(1), 105–118 (2011)CrossRefGoogle Scholar
  15. 15.
    Khronos Group: SPIR generator/Clang. https://github.com/KhronosGroup/SPIR
  16. 16.
    Kim, J., Kim, H., Lee, J.H., Lee, J.: Achieving a single compute device image in OpenCL for multiple GPUs. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, pp. 277–288 (2011)Google Scholar
  17. 17.
  18. 18.
    NVIDIA: CUDA C best practices guide (2015). http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
  19. 19.
    Pop, S., Cohen, A., Bastoul, C., Girbal, S., Silber, G.A., Vasilache, N.: GRAPHITE: polyhedral analyses and optimizations for GCC. In: Proceedings of the 2006 GCC Developers Summit (2006)Google Scholar
  20. 20.
    Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A.: High-throughput sequence alignment using graphics processing units. BMC Bioinform. 8(1), 1–10 (2007)CrossRefGoogle Scholar
  21. 21.
    Seo, S., Lee, J., Jo, G., Lee, J.: Automatic OpenCL work-group size selection for multicore CPUs. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pp. 387–397 (2013)Google Scholar
  22. 22.
    Steensgaard, B.: Points-to analysis in almost linear time. In: Proceedings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 32–41 (1996)Google Scholar
  23. 23.
    Stratton, J.A., et al.: Optimization and architecture effects on GPU computing workload performance. In: Proceedings of Innovative Parallel Computing (InPar) (2012)Google Scholar
  24. 24.
    Stratton, J.A., et al.: Parboil: a revised benchmark suite for scientific and commercial throughput computing. Technical report, IMPACT-12-01, IMPACT, University of Illinois at Urbana-Champaign (2012)Google Scholar
  25. 25.
    Tal, B.N., Levy, E., Barak, A., Rubin, E.: Memory access patterns: the missing piece of the multi-GPU puzzle. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2015)Google Scholar
  26. 26.
    Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36(5–6), 232–240 (2010)CrossRefGoogle Scholar
  27. 27.
    Tu, P., Padua, D.: Gated SSA-based demand-driven symbolic analysis for parallelizing compilers. In: Proceedings of the 9th International Conference on Supercomputing, pp. 414–423 (1995)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Gangwon Jo
    • 1
    Email author
  • Jaehoon Jung
    • 1
  • Jiyoung Park
    • 1
  • Jaejin Lee
    • 1
  1. 1.Center for Manycore Programming, Department of Computer Science and EngineeringSeoul National UniversitySeoulKorea

Personalised recommendations