Benchmarking the Memory Hierarchy of Modern GPUs

  • Xinxin Mei
  • Kaiyong Zhao
  • Chengjian Liu
  • Xiaowen Chu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8707)


Memory access efficiency is a key factor for fully exploiting the computational power of Graphics Processing Units (GPUs). However, many details of the GPU memory hierarchy are not released by the vendors. We propose a novel fine-grained benchmarking approach and apply it on two popular GPUs, namely Fermi and Kepler, to expose the previously unknown characteristics of their memory hierarchies. Specifically, we investigate the structures of different cache systems, such as data cache, texture cache, and the translation lookaside buffer (TLB). We also investigate the impact of bank conflict on shared memory access latency. Our benchmarking results offer a better understanding on the mysterious GPU memory hierarchy, which can help in the software optimization and the modelling of GPU architectures. Our source code and experimental results are publicly available.


  1. 1.
    Li, Q., Zhong, C., Zhao, K., Mei, X., Chu, X.: Implementation and analysis of AES encryption on GPU. In: 2012 IEEE 14th International Conference on High Performance Computing and Communication 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), pp. 843–848 (2012)Google Scholar
  2. 2.
    Chu, X., Zhao, K., Wang, M.: Practical random linear network coding on GPUs. In: Fratta, L., Schulzrinne, H., Takahashi, Y., Spaniol, O. (eds.) NETWORKING 2009. LNCS, vol. 5550, pp. 573–585. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  3. 3.
    Li, Y., Zhao, K., Chu, X., Liu, J.: Speeding up K-Means algorithm by GPUs. Journal of Computer and System Sciences 79, 216–229 (2013)CrossRefMathSciNetGoogle Scholar
  4. 4.
    Micikevicius, P.: 3D finite difference computation on GPUs using CUDA. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 79–84. ACM (2009)Google Scholar
  5. 5.
    Zhao, K., Chu, X.: G-BLASTN: accelerating nucleotide alignment by graphics processors. Bioinformatics (2014)Google Scholar
  6. 6.
    Mei, X., Yung, L.S., Zhao, K., Chu, X.: A measurement study of GPU DVFS on energy conservation. In: Proceedings of the Workshop on Power-Aware Computing and Systems, vol. (10). ACM (2013)Google Scholar
  7. 7.
    Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, vol. (31). IEEE Press (2008)Google Scholar
  8. 8.
    Papadopoulou, M., Sadooghi-Alvandi, M., Wong, H.: Micro-benchmarking the GT200 GPU. Computer Group, ECE, University of Toronto, Tech. Rep. (2009)Google Scholar
  9. 9.
    Wong, H., Papadopoulou, M.M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying GPU microarchitecture through microbenchmarking. In: 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pp. 235–246. IEEE (2010)Google Scholar
  10. 10.
    Meltzer, R., Zeng, C., Cecka, C.: Micro-benchmarking the C2070. In: GPU Technology Conference (2013)Google Scholar
  11. 11.
    NVIDIA Corporation: Fermi Whitepaper (2009)Google Scholar
  12. 12.
    NVIDIA Corporation: Kepler GK110 Whitepaper (2012)Google Scholar
  13. 13.
    NVIDIA Corporation: Tuning CUDA Applications for Kepler (2013)Google Scholar
  14. 14.
    NVIDIA Corporation: CUDA C Best Practices Guide - v6.0 (2014)Google Scholar
  15. 15.
    NVIDIA Corporation: CUDA C Programming Guide - v6.0 (2014)Google Scholar
  16. 16.
    NVIDIA Corporation: Tuning CUDA Applications for Maxwell (2014)Google Scholar
  17. 17.
    Micikevicius, P.: Local Memory and Register Spilling. NVIDIA Corporation (2011)Google Scholar
  18. 18.
    Micikevicius, P.: GPU performance analysis and optimization. In: GPU Technology Conference (2012)Google Scholar
  19. 19.
    Saavedra, R.H.: CPU Performance Evaluation and Execution Time Prediction Using Narrow Spectrum Benchmarking. PhD thesis, EECS Department, University of California, Berkeley (1992)Google Scholar
  20. 20.
    Saavedra, R.H., Smith, A.J.: Measuring cache and TLB performance and their effect on benchmark runtimes. IEEE Transactions on Computers 44, 1223–1235 (1995)CrossRefzbMATHGoogle Scholar

Copyright information

© IFIP International Federation for Information Processing 2014

Authors and Affiliations

  • Xinxin Mei
    • 1
  • Kaiyong Zhao
    • 1
  • Chengjian Liu
    • 1
  • Xiaowen Chu
    • 1
  1. 1.Department of Computer ScienceHong Kong Baptist UniversityHong Kong

Personalised recommendations