A Credit-Based Load-Balance-Aware CTA Scheduling Optimization Scheme in GPGPU

Yu, Yulong; He, Xubin; Guo, He; Wang, Yuxin; Chen, Xin

doi:10.1007/s10766-014-0318-5

A Credit-Based Load-Balance-Aware CTA Scheduling Optimization Scheme in GPGPU

Published: 22 August 2014

Volume 44, pages 109–129, (2016)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Yulong Yu^1,2,
Xubin He²,
He Guo¹,
Yuxin Wang³ &
…
Xin Chen¹

398 Accesses
1 Citation
3 Altmetric
Explore all metrics

Abstract

GPGPU improves the computing performance due to the massive parallelism. The cooperative-thread-array (CTA) schedulers employed by the current GPGPUs greedily issue CTAs to GPU cores as soon as the resources become available for higher thread level parallelism. Due to the locality consideration in the memory controller, the CTA execution time varies in different cores, and thus it leads to a load imbalance of the CTA issuance among the cores. The load imbalance causes the computing resources under-utilized, and leaves an opportunity for further performance improvement. However, existing warp and CTA scheduling policies did not take account of load balance. We propose a credit-based load-balance-aware CTA scheduling optimization scheme (CLASO) piggybacked to a standard GPGPU scheduling system. CLASO uses credits to limit the amount of CTAs issued on each core to avoid the greedy issuance to faster executing cores as well as the starvation to leftover cores. In addition, CLASO employs the global credits and two tuning parameters, active levels and loose levels, to enhance the load balance and the robustness. Instead of a standalone scheduling policy, CLASO is compatible with existing CTA and warp schedulers. The experiments conducted using several paradigmatic benchmarks illustrate that CLASO effectively improves the load balance by reducing 52.4 % idle cycles on average, and achieves up to 26.6 % speedup compared to the GPGPU baseline scheduling policy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

Article 01 February 2018

FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization

Making GPU Warp Scheduler and Memory Scheduler Synchronization-Aware

References

Narasiman, V., Shebanow, M., Lee, C. et al.: Improving GPU performance via large warps and two-level warp scheduling. In: International Symposium on Microarchitecture, pp. 308–317 (2011)
Jog, A., Kayiran, O., Nachiappan, N. et al.: OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In: International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 395–406 (2013)
Kayiran, O., Jog, A., Kandermir, M. et al.: Neither more nor less: optimizing thread-level parallelism for GPGPUs. In: International Conference on Parallel Architectures and Compilation Techniques, pp. 157–166 (2013)
NVIDIA: CUDA C Programming Guide (2012) http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
Khronos Group: The open standard for parallel programming of heterogeneous systems (2013) http://www.khronos.org/opencl/
NVIDIA: NVIDIA Visual Profiler (2014) https://developer.nvidia.com/nvidia-visual-profiler
NVIDIA: CUDA C/C++ SDK code samples (2011) http://www.nvidia.com/cuda-cc-sdk-code-samples
Bakhoda, A., Yuan, G., Fung, W. et al.: Analyzing CUDA workloads using a detailed GPU simulator. In: International Symposium on Performance Analysis of Systems and Software, pp. 163–174 (2009)
NVIDIA: Tesla C2050 / C2070 GPU computing processor (2010). http://www.nvidia.com/docs/IO/43395/NV_DS_Tesla_C2050_C2070_jul10_lores.pdf
Che, S., Boyer, M., Meng, J. et al.: Rodinia: a benchmark suite for heterogeneous computing. In: International Symposium on Workload Characterization, pp. 44–54 (2009)
Stratton, J.A., Rodrigues, C., Sung, I.J. et al.: Parboil: a revised benchmark suite for scientific and commercial throughput computing. Tech. Rep. IMPACT-12-01 University of Illinois at Urbana-Champaign (2012)
Lee, M., Song, S., Moon, J. et al.: Improving GPGPU resource utilization through alternative thread block scheduling. In: International Symposium on High Performance Computer Architecture, pp. 263–273 (2014)
Adriaens, J., Compton, K., Kim, N. et al.: The case for GPGPU spatial multitasking. In: International Symposium on High Performance Computer Architecture, pp. 1–12 (2012)
Jog, A., Kayiran, O., Mishra, A. et al.: Orchestrated scheduling and prefetching for GPGPUs. In: International Symposium on Computer Architecture, pp. 332–343 (2013)
Gebhart, M., Johnson, D.R., Tarjan, D. et al.: Energy-efficient mechanisms for managing thread context in throughput processors. In: International Symposium on Computer Architecture, pp. 235–246 (2011)
Rogers, T., O’Connor, M., Aamodt, T. et al.: Cache-conscious wavefront scheduling. In: International Symposium on Microarchitecture, pp. 72–83 (2012)
Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: International Symposium on Computer Architecture, pp. 235–246 (2010)
Fung, W.W.L., Sham, I., Yuan, G. et al.: Dynamic warp formation and scheduling for efficient GPU control flow. In: International Symposium on Microarchitecture, pp. 407–420 (2007)
Fung, W., Aamodt, T.: Thread block compaction for efficient SIMT control flow. In: International Symposium on High Performance Computer Architecture, pp. 25–36 (2011)
Brunie, N., Collange, S., Diamos, G.: Simultaneous branch and warp interweaving for sustained GPU performance. In: International Symposium on Computer Architecture, pp. 49–60 (2012)
Jia, W., Shaw, K.A., Martonosi, M.: MRPB: memory request prioritization for massively parallel processors. In: International Symposium on High Performance Computer Architecture, pp. 274–285 (2014)
Jog, A., Bolotin, E., Guz, Z. et al.: Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In: Workshop on General Purpose Processing Using GPUs, pp. 1–8 (2014)
Lakshminarayana, N.B., Kim, H.: Effect of instruction fetch and memory scheduling on GPU performance. In: Workshop on Language, Compiler, and Architecture Support for GPGPU, pp. 1–10 (2010)
Chen, L., Villa, O., Krishnamoorthy, S. et al.: Dynamic load balancing on single- and multi-GPU systems. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–12 (2010)

Download references

Acknowledgments

This research is partially sponsored by the U.S. National Science Foundation (NSF) Grants CCF-1102624 and CNS-1218960, and National Natural Science Foundation of China grants 61033012 and 11372067. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.

Author information

Authors and Affiliations

School of Software Technology, Dalian University of Technology, Dalian, China
Yulong Yu, He Guo & Xin Chen
Department of Electrical and Computer Engineering, Virginia Commonwealth University, Richmond, VA, USA
Yulong Yu & Xubin He
School of Computer Science and Technology, Dalian University of Technology, Dalian, China
Yuxin Wang

Authors

Yulong Yu
View author publications
You can also search for this author in PubMed Google Scholar
Xubin He
View author publications
You can also search for this author in PubMed Google Scholar
He Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yuxin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to He Guo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, Y., He, X., Guo, H. et al. A Credit-Based Load-Balance-Aware CTA Scheduling Optimization Scheme in GPGPU. Int J Parallel Prog 44, 109–129 (2016). https://doi.org/10.1007/s10766-014-0318-5

Download citation

Received: 08 July 2014
Accepted: 31 July 2014
Published: 22 August 2014
Issue Date: February 2016
DOI: https://doi.org/10.1007/s10766-014-0318-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Credit-Based Load-Balance-Aware CTA Scheduling Optimization Scheme in GPGPU

Abstract

Access this article

Similar content being viewed by others

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization

Making GPU Warp Scheduler and Memory Scheduler Synchronization-Aware

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Credit-Based Load-Balance-Aware CTA Scheduling Optimization Scheme in GPGPU

Abstract

Access this article

Similar content being viewed by others

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization

Making GPU Warp Scheduler and Memory Scheduler Synchronization-Aware

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation