Identifying Optimization Opportunities Within Kernel Execution in GPU Codes

  • Robert Lim
  • Allen Malony
  • Boyana Norris
  • Nick Chaimov
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9523)

Abstract

Tuning codes for GPGPU architectures is challenging because few performance tools can pinpoint the exact causes of execution bottlenecks. While profiling applications can reveal execution behavior with a particular architecture, the abundance of collected information can also overwhelm the user. Moreover, performance counters provide cumulative values but does not attribute events to code regions, which makes identifying performance hot spots difficult. This research focuses on characterizing the behavior of GPU application kernels and its performance at the node level by providing a visualization and metrics display that indicates the behavior of the application with respect to the underlying architecture. We demonstrate the effectiveness of our techniques with LAMMPS and LULESH application case studies on a variety of GPU architectures. By sampling instruction mixes for kernel execution runs, we reveal a variety of intrinsic program characteristics relating to computation, memory and control flow.

References

  1. 1.
  2. 2.
  3. 3.
    Browne, S., Dongarra, J., Garner, N., Ho, G., Mucci, P.: A portable programming interface for performance evaluation on modern processors. Int. J. High Perform. Comput. Appl. 14, 189–204 (2000)CrossRefGoogle Scholar
  4. 4.
    Chabbi, M., Murthy, K., Fagan, M., Mellor-Crummey, J.: Effective sampling-driven performance tools for GPU-accelerated supercomputers. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE (2013)Google Scholar
  5. 5.
    Dietrich, R., Ilsche, T., Juckeland, G.: Non-intrusive performance analysis of parallel hardware accelerated applications on hybrid architectures. In: First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI). IEEE Computer Society (2010)Google Scholar
  6. 6.
    Hartono, A., Norris, B., Sadayappan, P.: Annotation-based empirical performance tuning using Orio. In: International Symposium on Parallel & Distributed Processing (IPDPS). IEEE (2009)Google Scholar
  7. 7.
    Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ACM SIGARCH Computer Architecture News. ACM (2009)Google Scholar
  8. 8.
    Karlin, I., Bhatele, A., Chamberlain, B., Cohen, J., Devito, Z., Gokhale, M., et al., R.H.: Lulesh programming model and performance ports overview. In: Technical report, Lawrence Livermore National Laboratory (LLNL) Technical Report (2012)Google Scholar
  9. 9.
    Kerr, A., Andrew, G., Yalamanchili, S.: A characterization and analysis of PTX kernels. In: International Symposium on Workload Characterization (IISWC). IEEE (2009)Google Scholar
  10. 10.
    Kim, H., Vuduc, R., Baghsorkhi, S., Choi, J., Hwu, W.: Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU). Morgan & Claypool Publishers (2012)Google Scholar
  11. 11.
    Knpfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., Mller, M., Nagel, W.: The VAMPIR performance analysis tool-set. In: Resch, M., Keller, R., Himmler, V., Krammer, B., Schulz, A. (eds.) Tools for High Performance Computing. Springer, Heidelberg (2008)Google Scholar
  12. 12.
    Lim, R.: Identifying optimization opportunities within kernel launches in GPU architectures, Technical report. University of Oregon, CIS Department (2015)Google Scholar
  13. 13.
    Lim, R., Carrillo-Cisneros, D., Alkowaileet, W., Scherson, I.: Computationally efficient multiplexing of events on hardware counters. In: Linux Symposium (2014)Google Scholar
  14. 14.
    Malony, A., Biersdorff, S., Shende, S., Jagode, H., Tomov, S., Juckeland, G., Dietrich, R., Poole, D., Lamb, C.: Parallel performance measurement of heterogeneous parallel systems with GPUs. In: IEEE International Conference on Parallel Processing (ICPP) (2011)Google Scholar
  15. 15.
    Morris, A., Malony, A., Shende, S., Huck, K.: Design and implementation of a hybrid parallel performance measurement system. In: International Conference on Parallel Processing (ICPP) (2010)Google Scholar
  16. 16.
    Plimpton, S.: Fast parallel algorithms for short-range molecular dynamics. J. Comput. Phys. 117(1), 1–19 (1995)MATHCrossRefGoogle Scholar
  17. 17.
    Shende, S., Malony, A.: The TAU parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006)CrossRefGoogle Scholar
  18. 18.
    Weaver, V., Terpstra, D., Moore, S.: Non-determinism and overcount on modern hardware performance counter implementations. In: International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE (2013)Google Scholar
  19. 19.
    Shao, Y., Brooks, D.: ISA-independent workload characterization and its implications for specialized architectures. In: IEEE International Symposium Performance Analysis of Systems and Software (ISPASS) (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Robert Lim
    • 1
  • Allen Malony
    • 1
  • Boyana Norris
    • 1
  • Nick Chaimov
    • 1
  1. 1.Performance Research Laboratory, High-Performance Computing LaboratoryUniversity of OregonEugeneUSA

Personalised recommendations