Kernel concurrency opportunities based on GPU benchmarks characterization

  • Pablo Carvalho
  • Rommel Cruz
  • Lucia M. A. Drummond
  • Cristiana BentesEmail author
  • Esteban Clua
  • Edson Cataldo
  • Leandro A. J. Marzulo


Graphical Processing Units (GPUs) became an important platform to general purpose computing, thanks to their high performance and low cost when compared to CPUs. Modern GPU architectures are constantly evolving with growing resources. In order to take advantage of all the resources available and increase the GPU efficiency, new generation GPUs include support for concurrent kernel execution. Different kernels can be executed at the same time and share the GPU resources. Thus, benchmark suites developed to evaluate GPU performance and scalability should take this aspect into account that could be quite different from traditional CPU benchmarks. Nowadays, SHOC, Parboil, and Rodinia are the main benchmark suites for evaluating GPUs. This work analyzes these benchmark suites in a novel way. We propose to categorize the kernels of each application of these benchmarks by multiple criteria, built on their behavior in terms of computation type (integer or float), usage of memory hierarchy, efficiency and hardware occupancy. Based on the characterization results, we analyze kernel concurrency opportunities. The focus is on disclosing the resource requirements of the kernels of these benchmarks and to explain their behavior when executed concurrently.


GPU Benchmark characterization Concurrent kernel execution 


  1. 1.
    Adriaens, J.T., Compton, K., Kim, N.S., Schulte, M.J.: The case for GPGPU spatial multitasking. In: IEEE 18th International Symposium on High Performance Computer Architecture (HPCA), pp. 1–12. IEEE (2012)Google Scholar
  2. 2.
    Asanovic, K.: The landscape of parallel computing research: a view from berkeley. Tech. Rep. UCB/EECS-2006-183, EECS Department, University of California, Berkley, CA, USA (2006)Google Scholar
  3. 3.
    Bakhoda, A., Yuan, G.L., Fung, W.W., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software, 2009. ISPASS 2009, pp. 163–174. IEEE (2009)Google Scholar
  4. 4.
    Bienia, C.: Benchmarking Modern Multiprocessors. Princeton University, Princeton (2011)Google Scholar
  5. 5.
    Bienia, C.: Benchmarking modern multiprocessors. Ph.D. thesis, Princeton University (2011)Google Scholar
  6. 6.
    Breder, B., Charles, E., Cruz, R., Clua, E., Bentes, C., Drummond, L.: Maximizando o uso dos recursos de GPU através da reordenação da submissão de kernels concorrentes. In: Anais do WSCAD 2016 Simpósio de Sistemas Computacionais de Alto Desempenho, pp. 98–109. Editora da Sociedade Brasileira de Computação (SBC) (2016)Google Scholar
  7. 7.
    Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on GPUs. In: 2012 IEEE International Symposium on Workload Characterization (IISWC), pp. 141–151. IEEE (2012)Google Scholar
  8. 8.
    Carvalho, P., Drummond, L., Bentes, C., Clua, E., Cataldo, E., Marzulo, L.: Analysis and characterization of gpu benchmarks for kernel concurrency efficiency. In: Mocskos E., Nesmachnow S. (eds.) High Performance Computing. CARLA 2017. Communications in Computer and Information Science, vol. 796 (2017)Google Scholar
  9. 9.
    Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54 (2009)Google Scholar
  10. 10.
    Che, S., Sheaffer, J.W., Boyer, M., Szafaryn, L.G., Wang, L., Skadron, K.: A characterization of the rodinia benchmark suite with comparison to contemporary CMP workloads. In: Proceedings of the IEEE International Symposium on Workload Characterization (2010)Google Scholar
  11. 11.
    Che, S., Skadron, K.: Benchfriend: correlating the performance of GPU benchmarks. Int. J. High Perform. Comput. Appl. 28(2), 238–250 (2014)CrossRefGoogle Scholar
  12. 12.
    Cruz, R., Drummond, L., Clua, E., Bentes, C.: Analyzing and estimating the performance of concurrent kernels execution on GPUs. In: Proceedings of the XVIII Simpósio em Sistemas Computacionais de Alto Desempenho-WSCAD (2017)Google Scholar
  13. 13.
    Cruz, R.A., Bentes, C., Breder, B., Vasconcellos, E., Clua, E., de Carvalho, P., Drummond, L.: Maximizing the GPU resource usage by reordering concurrent kernels submission. Concurr. Comput.Google Scholar
  14. 14.
    Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 63–74 (2010)Google Scholar
  15. 15.
    Goswami, N., Shankar, R., Joshi, M., Li, T.: Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications. In: 2010 IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10. IEEE (2010)Google Scholar
  16. 16.
    Hu, Q., Shu, J., Fan, J., Lu, Y.: Run-time performance estimation and fairness-oriented scheduling policy for concurrent GPGPU applications. In: 2016 45th International Conference on Parallel Processing (ICPP), pp. 57–66. IEEE (2016)Google Scholar
  17. 17.
    Jog, A., Kayiran, O., Kesten, T., Pattnaik, A., Bolotin, E., Chatterjee, N., Keckler, S.W., Kandemir, M.T., Das, C.R.: Anatomy of GPU memory system for multi-application execution. In: Proceedings of the 2015 International Symposium on Memory Systems, pp. 223–234. ACM (2015)Google Scholar
  18. 18.
    Joshi, A., Phansalkar, A., Eeckhout, L., John, L.K.: Measuring benchmark similarity using inherent program characteristics. IEEE Trans. Comput. 55(6), 769–782 (2006)CrossRefGoogle Scholar
  19. 19.
    Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of PTX kernels. In: IEEE International Symposium on Workload Characterization, 2009. IISWC 2009, pp. 3–12. IEEE (2009)Google Scholar
  20. 20.
    Li, T., Narayana, V.K., El-Ghazawi, T.: A power-aware symbiotic scheduling algorithm for concurrent GPU kernels. In: IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), 2015, pp. 562–569 (2015)Google Scholar
  21. 21.
    NVIDIA: Cuda multi process service overview (2017).
  22. 22.
    NVIDIA Corp: Profiler user’s guide. (2017). An optional note
  23. 23.
    O’Neil, M.A., Burtscher, M.: Microarchitectural performance characterization of irregular GPU kernels. In: 2014 IEEE International Symposium on Workload Characterization (IISWC), pp. 130–139. IEEE (2014)Google Scholar
  24. 24.
    Pai, S., Thazhuthaveetil, M.J., Govindarajan, R.: Improving GPGPU concurrency with elastic kernels. In: ACM SIGPLAN Notices, vol. 48, pp. 407–418. ACM (2013)Google Scholar
  25. 25.
    Ravi, V.T., Becchi, M., Agrawal, G., Chakradhar, S.: Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, pp. 217–228. ACM (2011)Google Scholar
  26. 26.
  27. 27.
    Spafford, K., Meredith, J.S., Vetter, J.S., Chen, J., Grout, R.W., Sankaran, R.: Accelerating S3D: a GPGPU case study. In: Euro-Par Workshops, pp. 122–131. Springer, New York (2009)Google Scholar
  28. 28.
    Stratton, J.A., Rodrigues, C., Sung, I.J., Obeid, N., Chang, L.W., Anssari, N., Liu, G.D., mei W. Hwu, W.: Parboil: a revised benchmark suite for scientific and commercial throughput computing (2012)Google Scholar
  29. 29.
    Wende, F., Cordes, F., Steinke, T.: On improving the performance of multi-threaded CUDA applications with concurrent kernel execution by kernel reordering. In: Symposium on Application Accelerators in High Performance Computing (SAAHPC), pp. 74–83 (2012)Google Scholar
  30. 30.
    Xu, Q., Jeon, H., Kim, K., Ro, W.W., Annavaram, M.: Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In: Proceedings of the 43rd International Symposium on Computer Architecture, pp. 230–242. IEEE Press (2016)Google Scholar
  31. 31.
    Zhong, J., He, B.: Kernelet: high-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst. 25(6), 1522–1532 (2014)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Instituto de Computação - Universidade Federal FluminenseNiteróiBrazil
  2. 2.Engenharia de Sistemas e Computação - Universidade do Estado do Rio de JaneiroRio de JaneiroBrazil
  3. 3.Programa de Pós-graduação em Engenharia Elétrica e de Telecomunicações - Universidade Federal FluminenseNiteróiBrazil
  4. 4.GoogleSunnyvaleUSA

Personalised recommendations