CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

Zhang, Yang; Xing, Zuo-cheng; Liu, Cang; Tang, Chuan

doi:10.1631/FITEE.1700059

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

Published: 19 April 2018

Volume 19, pages 206–220, (2018)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Yang Zhang ORCID: orcid.org/0000-0001-5919-918X¹,
Zuo-cheng Xing¹,
Cang Liu¹ &
…
Chuan Tang¹

73 Accesses
3 Citations
Explore all metrics

Abstract

As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit (GPU) is an accelerator used widely in most of recent supercomputers. It adopts a large number of threads to hide a long latency with a high energy efficiency. In contrast to their powerful computing ability, GPUs have only a few megabytes of fast on-chip memory storage per streaming multiprocessor (SM). The GPU cache is inefficient due to a mismatch between the throughput-oriented execution model and cache hierarchy design. At the same time, current GPUs fail to handle burst-mode long-access latency due to GPU’s poor warp scheduling method. Thus, benefits of GPU’s high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. In this paper, we put forward a coordinated warp scheduling and locality-protected (CWLP) cache allocation scheme to make full use of data locality and hide latency. We first present a locality-protected cache allocation method based on the instruction program counter (LPC) to promote cache performance. Specifically, we use a PC-based locality detector to collect the reuse information of each cache line and employ a prioritised cache allocation unit (PCAU) which coordinates the data reuse information with the time-stamp information to evict the lines with the least reuse possibility. Moreover, the locality information is used by the warp scheduler to create an intelligent warp reordering scheme to capture locality and hide latency. Simulation results show that CWLP provides a speedup up to 19.8% and an average improvement of 8.8% over the baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Article 27 April 2021

In-memory database acceleration on FPGAs: a survey

Article Open access 26 October 2019

References

Bakhoda A, Yuan G, Fung W, et al., 2009. Analyzing CUDA workloads using a detailed GPU simulator. ISPASS IEEE Int Symp on Performance Analysis of Systems and Software, p.163–174. https://doi.org/10.1109/ISPASS.2009.4919648
Google Scholar
Che S, Boyer M, Meng J, et al., 2009. Rodinia: a benchmark suite for heterogeneous computing. IISWC IEEE Int Symp on Workload Characterization, p.44–54. https://doi.org/10.1109/IISWC.2009.5306797
Google Scholar
Chen J, Tao X, Yang Z, et al., 2013. Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. IEEE 27th Int Symp on Parallel & Distributed Processing, p.441–451. https://doi.org/10.1109/IPDPS.2013.95
Google Scholar
Chen X, Chang L, Rodrigues C, et al., 2014. Adaptive cache management for energy-efficient GPU computing. Proc 47th Annual IEEE/ACM Int Symp on Microarchitecture, p.343–355. https://doi.org/10.1109/MICRO.2014.11
Google Scholar
Dally W, Labonte F, Das A, et al., 2003. Merrimac: supercomputing with streams. Proc ACM/IEEE Conf on Supercomputing, Article 35. https://doi.org/10.1145/1048935.1050187
Google Scholar
Drew Y, 2008. A closer look at GPUs. Commun ACM, 51(10):50–57. https://doi.org/10.1145/1400181.1400197
Article Google Scholar
Fang W, He B, Luo Q, et al., 2011. Mars: accelerating mapreduce with graphics processors. IEEE Trans Parall Distr Syst, 22(4):608–620. https://doi.org/10.1109/TPDS.2010.158
Article Google Scholar
Gebhart M, Johnson D, Tarjan D, et al., 2011. Energyefficient mechanisms for managing thread context in throughput processors. Proc 38th Annual Int Symp Computer Architecture, p.235–246. https://doi.org/10.1145/2000064.2000093
Google Scholar
Gupta S, Xiang P, Zhou H, 2013. Analyzing locality of memory references in GPU architectures. Proc ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, Article 12. https://doi.org/10.1145/2492408.2492423
Google Scholar
Harris M, 2014. Maxwell: the Most Advanced CUDA GPU Ever Made. https://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made
Google Scholar
Jia W, Shaw K, Martonosi M, 2014. MRPB: memory request prioritization for massively parallel processors. IEEE 20th Int Symp on High Performance Computer Architecture, p.272–283. https://doi.org/10.1109/HPCA.2014.6835938
Google Scholar
Jog A, Kayiran O, Nachiappan C, et al., 2013. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. ACM SIGARCH Comput Arch News, 41(1):395–406. https://doi.org/10.1145/2490301.2451158
Google Scholar
Lee M, Song S, Moon J, et al., 2014. Improving GPGPU resource utilization through alternative thread block scheduling. IEEE 20th Int Symp on High Performance Computer Architecture, p.260–271. https://doi.org/10.1109/HPCA.2014.6835937
Google Scholar
Lee S, Arunkumar A, Wu C, 2015. CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. Proc 42nd Annual Int Symp on Computer Architecture, p.515–527. https://doi.org/10.1145/2872887.2750418
Google Scholar
Narasiman V, Shebanow M, Lee CJ, et al., 2011. Improving GPU performance via large warps and two-level warp scheduling. Proc 44th Annual IEEE/ACM Int Symp on Microarchitecture, p.308–317. https://doi.org/10.1145/2155620.2155656
Google Scholar
Nugteren C, van den Braak G, Corporaal H, et al., 2014. A detailed GPU cache model based on reuse distance theory. IEEE 20th Int Symp on High Performance Computer Architecture, p.37–48. https://doi.org/10.1109/HPCA.2014.6835955
Google Scholar
NVIDIA, 2009. NVIDIA’s next generation CUDA compute architecture: FERMI. v1.1. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_ Compute_Architecture_Whitepaper.pdf NVIDIA, 2015.
NVIDIA CUDA C Programming Guide v7.5. http://developer.nvidia.com/nvidia-gpu-computingdocumentation
Rhu M, Sullivan M, Leng J, et al., 2013. A locality-aware memory hierarchy for energy-efficient GPU architectures. Proc 46th Annual IEEE/ACM Int Symp on Microarchitecture, p.86–98. https://doi.org/10.1145/2540708.2540717
Chapter Google Scholar
Rogers T, O’Connor M, Aamodt T, 2012. Cache-conscious wavefront scheduling. Proc 45th Annual IEEE/ACM Int Symp on Microarchitecture, p.72–83. https://doi.org/10.1109/MICRO.2012.16
Google Scholar
Rogers T, O’Connor M, Aamodt T, 2013. Divergence-aware warp scheduling. Proc 46th Annual IEEE/ACM Int Symp on Microarchitecture, p.99–110. https://doi.org/10.1145/2540708.2540718
Chapter Google Scholar
Sethia A, Jamshidi D, Mahlke S, 2015. Mascar: speeding up GPU warps by reducing memory pitstops. IEEE 21st Int Symp on High Performance Computer Architecture, p.174–185. https://doi.org/10.1109/HPCA.2015.7056031
Google Scholar
Xie X, Liang Y, Sun G, et al., 2013. An efficient compiler framework for cache bypassing on GPUs. IEEE/ACM Int Conf on Computer-Aided Design, p.516–523. https://doi.org/10.1109/ICCAD.2013.6691165
Google Scholar
Xie X, Liang Y, Wang Y, et al., 2015. Coordinated static and dynamic cache bypassing for GPUs. IEEE 21st Int Symp on High Performance Computer Architecture, p.76–88. https://doi.org/10.1109/HPCA.2015.7056023
Google Scholar
Xie X, Liang Y, Li X, et al., 2017. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. IEEE/ACM Int Symp on Microarchitecture, p.395–406. https://doi.org/10.1109/TC.2017.2776272
Google Scholar
Zhang Y, Xing Z, Zhou L, et al., 2017. Locality protected dynamic cache allocation scheme on GPUs. IEEE Trustcom/BigDataSE/ISPA, p.1524–1530. https://doi.org/10.1109/TrustCom.2016.0237
Google Scholar
Zheng Z, 2014. Research on Key Technologies for Cache Power and Performance Optimization on Many-Core Heterogeneous Architecture. PhD Thesis, National University of Defense Technology, Changsha, China (in Chinese).
Google Scholar

Download references

Author information

Authors and Affiliations

National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha, 410073, China
Yang Zhang, Zuo-cheng Xing, Cang Liu & Chuan Tang

Authors

Yang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zuo-cheng Xing
View author publications
You can also search for this author in PubMed Google Scholar
Cang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chuan Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Zhang.

Additional information

Project supported by the National Natural Science Foundation of China (No. 61170083) and the Specialized Research Fund for the Doctoral Program of Higher Education, China (No. 20114307110001)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Xing, Zc., Liu, C. et al. CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs. Frontiers Inf Technol Electronic Eng 19, 206–220 (2018). https://doi.org/10.1631/FITEE.1700059

Download citation

Received: 19 January 2017
Accepted: 29 August 2017
Published: 19 April 2018
Issue Date: February 2018
DOI: https://doi.org/10.1631/FITEE.1700059

Key words

CLC number

TP368.1

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

In-memory database acceleration on FPGAs: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

In-memory database acceleration on FPGAs: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation