An Empirical Evaluation of GPGPU Performance Models

  • Souley Madougou
  • Ana Lucia Varbanescu
  • Cees de Laat
  • Rob van Nieuwpoort
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8805)


Computing systems today rely on massively parallel and heterogeneous architectures to promise very high peak performance. Yet most applications only achieve small fractions of this performance. While both programmers and architects have clear opinions about the causes of this performance gap, finding and quantifying the real problems remains a topic for performance modeling tools. In this paper, we sketch the landscape of modern GPUs’ performance limiters and optimization opportunities, and dive into details on modeling attempts for GPU-based systems. We highlight the specific features of the relevant contributions in this field, along with the optimization and design spaces they explore. We further use a typical kernel example (tiled dense matrix multiplication) to assess the efficacy and usability of a set of promising approaches. We conclude that the available GPU performance modeling solutions are very sensitive to applications and platform changes, and require significant efforts for tuning and calibration when new analyses are required.


performance modeling performance analysis and prediction GPU architectures 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Saule, E., Kaya, K., Çatalyürek, Ü.V.: Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. CoRR abs/1302.1078 (2013)Google Scholar
  2. 2.
    NVIDIA Corporation: Press release: Nvidia tesla gpu computing processor ushers in the era of personal supercomputing (June 2007)Google Scholar
  3. 3.
    Advanced Micro Devices (AMD) Inc. Press release: Amd delivers enthusiast performance leadership with the introduction of the ati radeon 3870 x2 (January 2008)Google Scholar
  4. 4.
    Asanovic, K., et al.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)CrossRefGoogle Scholar
  5. 5.
    Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing cuda workloads using a detailed gpu simulator. In: ISPASS, pp. 163–174. IEEE (2009)Google Scholar
  6. 6.
    Mudalige, G.R., Vernon, M.K., Jarvis, S.A.: A plug-and-play model for evaluating wavefront computations on parallel architectures. In: IPDPS, pp. 1–14. IEEE (2008)Google Scholar
  7. 7.
    Diamos, G.F., Yalamanchili, S.: Harmony: An execution model and runtime for heterogeneous many core systems. In: Proceedings of HPDC 2008, pp. 197–200. ACM, New York (2008)Google Scholar
  8. 8.
    Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: A programming model for heterogeneous multi-core systems. SIGPLAN Not. 43(3) (March 2008) Google Scholar
  9. 9.
    Snavely, A., Carrington, L., Wolter, N., Labarta, J., Badia, R., Purkayastha, A.: A framework for performance modeling and prediction. In: Proceedings of SC 2002, pp. 1–17. IEEE Computer Society Press, Los Alamitos (2002)Google Scholar
  10. 10.
    Tikir, M.M., Laurenzano, M.A., Carrington, L., Snavely, A.: PSINS: An open source event tracer and execution simulator for MPI applications. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 135–148. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  11. 11.
    Laurenzano, M., Tikir, M., Carrington, L., Snavely, A.: Pebil: Efficient static binary instrumentation for linux. In: ISPASS 2010, pp. 175–183 (March 2010)Google Scholar
  12. 12.
    Carrington, L., Tikir, M.M., Olschanowsky, C., Laurenzano, M., Peraza, J., Snavely, A., Poole, S.: An idiom-finding tool for increasing productivity of accelerators. In: Proceedings of ICS 2011, pp. 202–212. ACM, New York (2011)Google Scholar
  13. 13.
    Kerr, A., Anger, E., Hendry, G., Yalamanchili, S.: Eiger: A framework for the automated synthesis of statistical performance models. In: Proceedings of WPEA 2012 (2012)Google Scholar
  14. 14.
    Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of ptx kernels. In: Proceedings of IISWC 2009, Washington, DC, USA, pp. 3–12 (2009)Google Scholar
  15. 15.
    Jia, W., Shaw, K., Martonosi, M.: Stargazer: Automated regression-based gpu design space exploration. In: ISPASS 2012, pp. 2–13 (April 2012)Google Scholar
  16. 16.
    Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.M.W.: An adaptive performance modeling tool for gpu architectures. SIGPLAN Not. 45(5), 105–114 (2010)Google Scholar
  17. 17.
    Hong, S., Kim, H.: An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News 37(3), 152–163 (2009)Google Scholar
  18. 18.
    Kothapalli, K., Mukherjee, R., Rehman, M., Patidar, S., Narayanan, P.J., Srinathan, K.: A performance prediction model for the cuda gpgpu platform. In: HiPC 2009, pp. 463–472 (December 2009)Google Scholar
  19. 19.
    Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)CrossRefGoogle Scholar
  20. 20.
    Fortune, S., Wyllie, J.: Parallelism in random access machines. In: Proceedings of STOC 1978, pp. 114–118. ACM, New York (1978)Google Scholar
  21. 21.
    Gibbons, P.B., Matias, Y., Ramachandran, V.: The queue-read queue-write asynchronous pram model. In: Euro-Par 1996. LNCS, vol. 1124, pp. 279–292. Springer, Heidelberg (1996)Google Scholar
  22. 22.
    Zhang, Y., Owens, J.: A quantitative performance analysis model for gpu architectures. In: HPCA 2011, pp. 382–393 (February 2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Souley Madougou
    • 1
  • Ana Lucia Varbanescu
    • 1
  • Cees de Laat
    • 1
  • Rob van Nieuwpoort
    • 2
  1. 1.University of AmsterdamAmsterdamThe Netherlands
  2. 2.Netherlands eScience CenterAmsterdamThe Netherlands

Personalised recommendations