GPU thread throttling for page-level thrashing reduction via static analysis

Kim, Hyunjun; Han, Hwansoo

doi:10.1007/s11227-023-05787-y

GPU thread throttling for page-level thrashing reduction via static analysis

Published: 16 December 2023

Volume 80, pages 9829–9847, (2024)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Hyunjun Kim¹ &
Hwansoo Han²

131 Accesses
Explore all metrics

Abstract

Unified virtual memory was introduced in modern GPUs to enable a new programming model for programmers. This method manages memory pages between the GPU and CPU automatically, reducing the complexity of data management for programmers. However, when a GPU programs generates a large memory footprint that exceeds the GPU memory capacity, thrashing can occur, leading to significant performance degradation. To address this issue, this paper proposes a thread throttling that restricts the active thread groups, thereby alleviating memory oversubscription and improving performance. The proposed method adjusts the active thread group at compile time to ensure that their memory footprints fit within the available memory capacity. The effectiveness of the proposed method was evaluated using GPU programs that experience memory oversubscription. The results showed that our approach improved the performance of the original programs by 3.44\(\times\) on average. This represents a 1.53\(\times\) performance improvement compared to static thread throttling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Shared Memory Parallelism in Modern C++ and HPX

Article 20 April 2024

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Article 27 April 2021

In-memory database acceleration on FPGAs: a survey

Article Open access 26 October 2019

Availability of data and materials

The datasets generated during and/or analyzed during the current study are available from Hyunjun Kim on reasonable request.

References

Ganguly D, Zhang Z, Yang J, Melhem R (2019) Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory. In: Proceedings of the 46th International Symposium on Computer Architecture (ISCA), pp 224–235
Ganguly D, Zhang Z, Yang J, Melhem R (2020) Adaptive page migration for irregular data-intensive applications under gpu memory oversubscription. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 451–461
Ganguly D, Melhem R, Yang J (2021) An adaptive framework for oversubscription management in cpu-gpu unified memory. In: Design, Automation and Test in Europe Conference and Exhibition (DATE)
Zheng T, Nellans D, Zulfiqar A, Stephenson M, Keckler SW (2016) Towards high performance paged memory for gpus. In: IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 345–357
Li C, Ausavarungnirun R, Rossbach CJ, Zhang Y, Mutlu O, Guo Y, Yang J (2019) A framework for memory oversubscription management in graphics processing units. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS
Allen T, Ge R (2021) Demystifying gpu uvm cost with deep runtime and workload analysis. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 141–150
Allen T, Ge R (2021) In-depth analyses of unified virtual memory system for gpu accelerated computing. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
Yu Q, Childers B, Huang L, Qian C, Wang Z (2020) Hpe: Hierarchical page eviction policy for unified memory in gpus. IEEE Trans Comput Aided Design Integr Circuits Syst 39(10):2461–2474
Article Google Scholar
Yu Q, Childers B, Huang L, Qian C, Guo H, Wang Z (2020) Coordinated page prefetch and eviction for memory oversubscription management in gpus. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 472–482
Kim H, Sim J, Gera P, Hadidi R, Kim H (2020) Batch-aware unified memory management in gpus for irregular workloads. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp 1357–1370
Rogers TG, O’Connor M, Aamodt TM (2013) Divergence-aware warp scheduling. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-46
Wang B, Yu W, Sun X-H, Wang X (2015) Dacache: Memory divergence-aware gpu cache management. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS
Wang B, Liu Z, Wang X, Yu W (2015) Eliminating intra-warp conflict misses in gpu. In: Design, Automation and Test in Europe Conference and Exhibition. DATE
Wang B, Zhu Y, Yu W (2016) Oaws: Memory occlusion aware warp scheduling. In: International Conference on Parallel Architecture and Compilation Techniques. PACT
Xie X, Liang Y, Wang Y, Sun G, Wang T (2015) Coordinated static and dynamic cache bypassing for gpus. In: IEEE 21st International Symposium on High Performance Computer Architecture. HPCA
Rogers TG, O’Connor M, Aamodt TM (2012) Cache-conscious wavefront scheduling. In: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-45
Kim H, Hong S, Lee H, Seo E, Han H (2019) Compiler-assisted gpu thread throttling for reduced cache contention. In: Proceedings of the 48th International Conference on Parallel Processing. ICPP
Jung J, Kim J, Lee J (2023) Deepum: Tensor migration and prefetching in unified memory. In: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp 207–221
Wen F, Qin M, Gratz P, Reddy N (2021) Openmem: Hardware/software cooperative management for mobile memory system. In: 58th ACM/IEEE Design Automation Conference (DAC), pp 109–114
Wen F, Qin M, Gratz P, Reddy N (2022) Software hint-driven data management for hybrid memory in mobile systems. ACM Trans Embed Comput Syst. 21(1)
Alur R, Devietti J, Navarro Leija OS, Singhania N (2017) Gpudrano: Detecting uncoalesced accesses in gpu programs. In: Computer Aided Verification. CAV
Alur R, Devietti J, Singhania N (2018) Block-size independence for gpu programs. In: Static Analysis. SAS
NVIDIA: NVIDIA Tesla V100 GPU Architecture: The World’s Most Advanced Data Center GPU (2017)
AMD: Radeons Next-generation Vega Architecture (2017)
Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to gpu codes. In: Innovative Parallel Computing (InPar), pp 1–10
Gu Y, Wu W, Li Y, Chen L (2020) Uvmbench: A comprehensive benchmark suite for researching unified virtual memory in gpus. In: arXiv, arxiv.org/abs/2007.09822
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization. IISWC
Kayiran O, Jog A, Kandemir MT, Das CR (2013) Neither more nor less: Optimizing thread-level parallelism for gpgpus. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. PACT
Zhang J, Gao S, Kim NS, Jung M (2018) Ciao: Cache interference-aware throughput-oriented architecture and scheduling for gpus. In: IEEE International Parallel and Distributed Processing Symposium. IPDPS
Chen Y, Hayes AB, Zhang C, Salmon T, Zhang EZ (2018) Locality-aware software throttling for sparse matrix operation on gpus. In: 2018 USENIX Annual Technical Conference. ATC
Li D, Rhu M, Johnson DR, O’Connor M, Erez M, Burger D, Fussell DS, Redder SW (2015) Priority-based cache allocation in throughput processors. In: IEEE 21st International Symposium on High Performance Computer Architecture. HPCA
Li C, Song SL, Dai H, Sidelnik A, Hari SKS, Zhou H (2015) Locality-driven dynamic gpu cache bypassing. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS
Li A, Braak G-J, Kumar A, Corporaal H (2015) Adaptive and transparent cache bypassing for gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC
Ausavarungnirun R, Ghose S, Kayiran O, Loh GH, Das CR, Kandemir MT, Mutlu O (2015) Exploiting inter-warp heterogeneity to improve gpgpu performance. In: International Conference on Parallel Architecture and Compilation. PACT
Jia W, Shaw KA, Martonosi M (2014) Mrpb: Memory request prioritization for massively parallel processors. In: IEEE 20th International Symposium on High Performance Computer Architecture. HPCA
Xie X, Liang Y, Sun G, Chen D (2013) An efficient compiler framework for cache bypassing on gpus. In: Proceedings of the International Conference on Computer-Aided Design. ICCAD
Koo G, Oh Y, Ro WW, Annavaram M (2017) Access pattern-aware cache management for improving data utilization in gpu. In: ACM/IEEE 44th Annual International Symposium on Computer Architecture. ISCA
Chen X, Chang L-W, Rodrigues CI, Lv J, Wang Z, Hwu W-M (2014) Adaptive cache management for energy-efficient gpu computing. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO
Panda R, Zheng X, Wang J, Gerstlauer A, John LK (2017) Statistical pattern based modeling of gpu memory access streams. In: Proceedings of the 54th Annual Design Automation Conference. DAC
Kim H, Hong S, Park J, Han H (2020) Static code transformations for thread-dense memory accesses in gpu computing. Concurr Comput Pract Exper 32(5):5512
Article Google Scholar
Lattner C, Adve V (2004) Llvm: a compilation framework for lifelong program analysis & transformation. In: International Symposium on Code Generation and Optimization (CGO), pp 75–86
Gu Y, Wu W, Li Y, Chen L (2020) UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs

Download references

Funding

This work is supported by NRF grant (2021R1A2C2008877) and IITP grant (2021000773) funded by Korea government, MSIT.

Author information

Authors and Affiliations

System Research Center, Samsung Advanced Institute of Technology, Suwon, 16662, Republic of Korea
Hyunjun Kim
Department of CSE, Sungkyunkwan University, Suwon, 16419, Republic of Korea
Hwansoo Han

Authors

Hyunjun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Hwansoo Han
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

HK conceived the presented idea. HK developed and evaluated the idea. HK wrote the initial manuscript and HH helped to improve the writing and the structure of the manuscript. All authors discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Hyunjun Kim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kim, H., Han, H. GPU thread throttling for page-level thrashing reduction via static analysis. J Supercomput 80, 9829–9847 (2024). https://doi.org/10.1007/s11227-023-05787-y

Download citation

Accepted: 03 November 2023
Published: 16 December 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s11227-023-05787-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GPU thread throttling for page-level thrashing reduction via static analysis

Abstract

Access this article

Similar content being viewed by others

Shared Memory Parallelism in Modern C++ and HPX

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

In-memory database acceleration on FPGAs: a survey

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

GPU thread throttling for page-level thrashing reduction via static analysis

Abstract

Access this article

Similar content being viewed by others

Shared Memory Parallelism in Modern C++ and HPX

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

In-memory database acceleration on FPGAs: a survey

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation