Advertisement

Cluster Computing

, Volume 22, Supplement 1, pp 871–883 | Cite as

Memory-aware TLP throttling and cache bypassing for GPUs

  • Jun ZhangEmail author
  • Yanxiang He
  • Fanfan Shen
  • Qing’an Li
  • Hai Tan
Article
  • 314 Downloads

Abstract

General-purpose graphics processing unit (GPGPU) has become one of the most important high performance platforms oriented to high throughput applications. However, on-chip resources contention can often occur as there are large amounts of concurrent running threads inside GPGPU. It has become an important factor affecting the performance of GPGPUs. We propose memory-aware TLP throttling and cache bypassing (MATB) mechanism, which can exploit data temporal locality and memory bandwidth. It aims to make those cache blocks with good data locality stay inside L1D cache longer while maintaining on-chip resources utiliza- tion. On one hand, it can alleviate cache contention via limiting the memory warps with bad data reuse to be scheduled while cache contention and on-chip network congestion occur. On the other hand, it can make memory bandwidth be utilized more effectively via cache bypassing. Experimental results show MATB can achieve 26.6% and 14.2% performance improvement respectively on average relative to GTO and DYNCTA with low hardware cost.

Keywords

GPGPU TLP throttling Cache bypassing Resources contention On-chip network congestion 

Notes

Acknowledgements

The authors would like to thank the reviewers for their worthy suggestions that help to improve this work greatly. This work was partially supported by the National Natural Science Foundation of China [Project Nos. 61373039, 61662002, 61462004], and the Natural Science Foundation of Jiangxi Province, China[Project No. 20151BAB207042, 20161BAB212056], and the Key Research and Development Plan of the Scientific Department in Jiangxi Province, China (No. 20161BBE50063), and the Science and Technology Project of the Education Department in Jiangxi Province, China[Project No. GJJ150605]. Yanxiang He is the corresponding author.

References

  1. 1.
    Nvidia C: Nvidia’s next generation cuda compute architecture: Fermi. Comput. Syst., 26, 63–72 (2009)Google Scholar
  2. 2.
    Nvidia C: Nvidia’s next generation cuda compute architecture: Kepler gk110. Whitepaper (2012)Google Scholar
  3. 3.
    Luebke, D., Humphreys, G.: How GPUs work. IEEE Comput. 40(2), 96–100 (2007)CrossRefGoogle Scholar
  4. 4.
    Montrym, J., Moreton, H.: The Geforce 6800. IEEE Micro 25(2), 41–51 (2005)CrossRefGoogle Scholar
  5. 5.
    Nickolls, J., Dally, W.J.: The GPU computing era. IEEE Micro 30(2), 56–69 (2010)CrossRefGoogle Scholar
  6. 6.
    Lindholm, E., Nickolls, J., Oberman, S., et al.: NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28(2), 39–55 (2008)CrossRefGoogle Scholar
  7. 7.
    He, Y., Zhang, J., Shen, F., et al.: Thread scheduling optimization of general purpose Graphics Processing Unit: a survey. Chinese J. Comput. 39(9), 1–17 (2016)MathSciNetGoogle Scholar
  8. 8.
    Corparation AMD: ATI stream computing OpenCL programming guide (2010)Google Scholar
  9. 9.
    Xiang, P., Yang, Y., Zhou, H.: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation. Proceedings of the 20th International Symposium on High Performance Computer Architecture, USA, pp. 284–295 (2014)Google Scholar
  10. 10.
    Li, D.: Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors. Dissertation for Ph.D. Degree. USA: The University of Texas at Austin, pp. 4–6 (2014)Google Scholar
  11. 11.
    Rogers, T.G., O’Connor, M., Aamodt, T.M.: Cache-conscious wave- front scheduling. Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Canada, pp. 72–83, (2012)Google Scholar
  12. 12.
    Rogers, T.G., O’Connor, M., Aamodt, T.M.: Divergence-aware warp scheduling. Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, USA, pp. 99–110, (2013)Google Scholar
  13. 13.
    Kayiran, O., Jog, A., Kandemir, M.T., et al.: Neither more nor less: optimizing thread-level parallelism for gpgpus. Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, United Kingdom, pp. 157–166 (2013)Google Scholar
  14. 14.
    Li, C., Yang, Y., Dai, H., et al.: Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs. Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software, USA, pp. 231–242 (2014)Google Scholar
  15. 15.
    Kim, K., Lee, S., Yoon, M.K., et al.: Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding. Proceedings of the 22th International Symposium on High Performance Computer Architecture, Spain, pp. 163–165 (2016)Google Scholar
  16. 16.
    Che, S., Boyer, M., Meng, J., et al.: Rodinia: A benchmark suite for heterogeneous computing. Proceedings of the 2009 IEEE International Symposium on Workload Characterization. USA, pp. 44–54 (2009)Google Scholar
  17. 17.
    Bakhoda, A., Yuan, G., Fung, W.L., et al.: Analyzing CUDA workloads using a detailed GPU simulator. Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, USA, pp. 163–174 (2009)Google Scholar
  18. 18.
    NVIDIA C: C Programming Guide: Design Guide. PG-02829-001 v6.5, NVIDIA, Santa Clara, Calif, USA (2014)Google Scholar
  19. 19.
    NVIDIA CUDA Team: NVIDIA compute PTX: Parallel thread execution. ISA version (2009)Google Scholar
  20. 20.
    Johnson, T.L., Hwu, W.M.W.: Run-time adaptive cache hierarchy management via reference analysis. Proceedings of the 24th International Symposium on Computer Architecture, Denver, Colorado, USA, pp. 315–326 (1997)Google Scholar
  21. 21.
    Kharbutli, M., Solihin, Y.: Counter-based cache replacement and bypassing algorithms. IEEE Trans. Comput. 57(4), 433–447 (2008)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Duong, N., Zhao, D., Kim, T., et al.: Improving cache management policies using dynamic reuse distances. Proceedings of 45th Annual IEEE/ACM International Symposium on, Canada, pp. 389–400 (2012)Google Scholar
  23. 23.
    Gaur, J., Chaudhuri, M., Subramoney, S.: Bypass and insertion algorithms for exclusive last-level caches. Proceedings of 38th International Sympo- sium on Computer Architecture, USA, pp. 81–92 (2011)Google Scholar
  24. 24.
    Chen, X., Chang, L.W., Rodrigues, C.I., et al.: Adaptive cache management for energy-efficient gpu computing. Proceedings of the 47th annual IEEE/ACM international symposium on microarchitecture, United Kingdom, pp. 343–355 (2014)Google Scholar
  25. 25.
    Tian, Y., Puthoor, S., Greathouse, J.L., et al.: Adaptive GPU cache bypassing. Proceedings of the 8th Workshop on General Purpose Processing using GPUS, USA, pp. 25–35 (2015)Google Scholar
  26. 26.
    Li, C., Song, S.L., Dai, H., et al.: Locality-driven dynamic GPU cache bypassing. Proceedings of the 29th ACM on International Conference on Supercomputing, USA, pp. 67–77 (2015)Google Scholar
  27. 27.
    Xie, X., Liang, Y., Wang, Y., et al.: Coordinated static and dynamic cache bypassing for GPUs. Proceedings of 21st International Symposium on High Performance Computer Architecture, USA, pp. 76–88 (2015)Google Scholar
  28. 28.
    Lee, S.Y., Wu, C.J.: Ctrl-C: Instruction-aware control loop based adaptive cache bypassing for GPUs. Proceedings of 34th International Conference on Computer Design, USA, pp. 133–140 (2016)Google Scholar
  29. 29.
    Zheng, Z., Wang, Z., Lipasti, M.: Adaptive cache and concurrency allocation on GPGPUs. IEEE Comput. Archit. Lett. 14(2), 90–93 (2015)CrossRefGoogle Scholar
  30. 30.
    Lee, M., Song, S., Moon, J., et al.: Improving GPGPU resource utilization through alternative thread block scheduling. Proceedings of the 20th International Symposium on High Performance Computer Architecture, USA, pp. 260–271 (2014)Google Scholar
  31. 31.
    Yoon, M.K., Kim, K., Lee, S., et al.: Virtual thread: maximizing thread-level parallelism beyond GPU scheduling limit. Proceedings of the 43rd Annual International Symposium on Computer Architecture, South Korea, pp. 609–621 (2016)Google Scholar
  32. 32.
    Xie, X., Liang, Y., Li, X., et al.: Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. Proceedings of the 48th International Symposium on Microarchitecture, USA, pp. 395–406 (2015)Google Scholar
  33. 33.
    Adriaens, J.T., Compton, K., Kim, N.S., et al.: The case for GPGPU spatial multitasking. Proceedings of the 18th International Symposium on High Performance Computer Architecture, USA, pp. 1–12 (2012)Google Scholar
  34. 34.
    Zhong, J., He, B.: Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst. 25(6), 1522–1532 (2014)MathSciNetCrossRefGoogle Scholar
  35. 35.
    Xu, Q., Jeon, H., Kim, K., et al.: Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramm- ing. Proceedings of the 43rd International Symposium on Computer Architecture, South Korea, pp. 230–242 (2016)Google Scholar
  36. 36.
    Park, J.J.K., Park, Y., Mahlke, S.: Dynamic resource management for efficient utilization of multitasking GPUs. Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, China, pp. 527–540 (2017)Google Scholar
  37. 37.
    Park, J.J.K., Park, Y., Mahlke, S.: Chimera: Collaborative preemption for multitasking on a shared GPU. Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, Turkey, pp. 593–606 (2015)Google Scholar
  38. 38.
    Tanasic, I., Gelado, I., Cabezas, J., et al.: Enabling preemptive multiprogramming on GPUs. Proceedings of 41st International Symposium on Computer Architecture, USA, pp. 193–204 (2014)Google Scholar
  39. 39.
    Wu, B., Liu, X., Zhou, X., et al.: FLEP: Enabling flexible and efficient preemption on GPUs. Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, China, pp. 483–496 (2017)Google Scholar
  40. 40.
    Wang, Z., Yang, J., Melhem, R., et al.: Simultaneous multikernel GPU: multi-tasking throughput processors via fine-grained sharing, pp. 358–369. Proceedings of International Symposium on High Performance Computer Architecture, Spain (2016)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  • Jun Zhang
    • 1
    • 2
    Email author
  • Yanxiang He
    • 1
    • 3
  • Fanfan Shen
    • 1
  • Qing’an Li
    • 1
    • 3
  • Hai Tan
    • 2
  1. 1.Computer SchoolWuhan UniversityWuhanChina
  2. 2.School of SoftwareEast China University of TechnologyNanchangChina
  3. 3.State Key Laboratory of Software EngineeringWuhan UniversityWuhanChina

Personalised recommendations