Skip to main content
Log in

A Credit-Based Load-Balance-Aware CTA Scheduling Optimization Scheme in GPGPU

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

GPGPU improves the computing performance due to the massive parallelism. The cooperative-thread-array (CTA) schedulers employed by the current GPGPUs greedily issue CTAs to GPU cores as soon as the resources become available for higher thread level parallelism. Due to the locality consideration in the memory controller, the CTA execution time varies in different cores, and thus it leads to a load imbalance of the CTA issuance among the cores. The load imbalance causes the computing resources under-utilized, and leaves an opportunity for further performance improvement. However, existing warp and CTA scheduling policies did not take account of load balance. We propose a credit-based load-balance-aware CTA scheduling optimization scheme (CLASO) piggybacked to a standard GPGPU scheduling system. CLASO uses credits to limit the amount of CTAs issued on each core to avoid the greedy issuance to faster executing cores as well as the starvation to leftover cores. In addition, CLASO employs the global credits and two tuning parameters, active levels and loose levels, to enhance the load balance and the robustness. Instead of a standalone scheduling policy, CLASO is compatible with existing CTA and warp schedulers. The experiments conducted using several paradigmatic benchmarks illustrate that CLASO effectively improves the load balance by reducing 52.4 % idle cycles on average, and achieves up to 26.6 % speedup compared to the GPGPU baseline scheduling policy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Narasiman, V., Shebanow, M., Lee, C. et al.: Improving GPU performance via large warps and two-level warp scheduling. In: International Symposium on Microarchitecture, pp. 308–317 (2011)

  2. Jog, A., Kayiran, O., Nachiappan, N. et al.: OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In: International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 395–406 (2013)

  3. Kayiran, O., Jog, A., Kandermir, M. et al.: Neither more nor less: optimizing thread-level parallelism for GPGPUs. In: International Conference on Parallel Architectures and Compilation Techniques, pp. 157–166 (2013)

  4. NVIDIA: CUDA C Programming Guide (2012) http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

  5. Khronos Group: The open standard for parallel programming of heterogeneous systems (2013) http://www.khronos.org/opencl/

  6. NVIDIA: NVIDIA Visual Profiler (2014) https://developer.nvidia.com/nvidia-visual-profiler

  7. NVIDIA: CUDA C/C++ SDK code samples (2011) http://www.nvidia.com/cuda-cc-sdk-code-samples

  8. Bakhoda, A., Yuan, G., Fung, W. et al.: Analyzing CUDA workloads using a detailed GPU simulator. In: International Symposium on Performance Analysis of Systems and Software, pp. 163–174 (2009)

  9. NVIDIA: Tesla C2050 / C2070 GPU computing processor (2010). http://www.nvidia.com/docs/IO/43395/NV_DS_Tesla_C2050_C2070_jul10_lores.pdf

  10. Che, S., Boyer, M., Meng, J. et al.: Rodinia: a benchmark suite for heterogeneous computing. In: International Symposium on Workload Characterization, pp. 44–54 (2009)

  11. Stratton, J.A., Rodrigues, C., Sung, I.J. et al.: Parboil: a revised benchmark suite for scientific and commercial throughput computing. Tech. Rep. IMPACT-12-01 University of Illinois at Urbana-Champaign (2012)

  12. Lee, M., Song, S., Moon, J. et al.: Improving GPGPU resource utilization through alternative thread block scheduling. In: International Symposium on High Performance Computer Architecture, pp. 263–273 (2014)

  13. Adriaens, J., Compton, K., Kim, N. et al.: The case for GPGPU spatial multitasking. In: International Symposium on High Performance Computer Architecture, pp. 1–12 (2012)

  14. Jog, A., Kayiran, O., Mishra, A. et al.: Orchestrated scheduling and prefetching for GPGPUs. In: International Symposium on Computer Architecture, pp. 332–343 (2013)

  15. Gebhart, M., Johnson, D.R., Tarjan, D. et al.: Energy-efficient mechanisms for managing thread context in throughput processors. In: International Symposium on Computer Architecture, pp. 235–246 (2011)

  16. Rogers, T., O’Connor, M., Aamodt, T. et al.: Cache-conscious wavefront scheduling. In: International Symposium on Microarchitecture, pp. 72–83 (2012)

  17. Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: International Symposium on Computer Architecture, pp. 235–246 (2010)

  18. Fung, W.W.L., Sham, I., Yuan, G. et al.: Dynamic warp formation and scheduling for efficient GPU control flow. In: International Symposium on Microarchitecture, pp. 407–420 (2007)

  19. Fung, W., Aamodt, T.: Thread block compaction for efficient SIMT control flow. In: International Symposium on High Performance Computer Architecture, pp. 25–36 (2011)

  20. Brunie, N., Collange, S., Diamos, G.: Simultaneous branch and warp interweaving for sustained GPU performance. In: International Symposium on Computer Architecture, pp. 49–60 (2012)

  21. Jia, W., Shaw, K.A., Martonosi, M.: MRPB: memory request prioritization for massively parallel processors. In: International Symposium on High Performance Computer Architecture, pp. 274–285 (2014)

  22. Jog, A., Bolotin, E., Guz, Z. et al.: Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In: Workshop on General Purpose Processing Using GPUs, pp. 1–8 (2014)

  23. Lakshminarayana, N.B., Kim, H.: Effect of instruction fetch and memory scheduling on GPU performance. In: Workshop on Language, Compiler, and Architecture Support for GPGPU, pp. 1–10 (2010)

  24. Chen, L., Villa, O., Krishnamoorthy, S. et al.: Dynamic load balancing on single- and multi-GPU systems. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–12 (2010)

Download references

Acknowledgments

This research is partially sponsored by the U.S. National Science Foundation (NSF) Grants CCF-1102624 and CNS-1218960, and National Natural Science Foundation of China grants 61033012 and 11372067. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to He Guo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, Y., He, X., Guo, H. et al. A Credit-Based Load-Balance-Aware CTA Scheduling Optimization Scheme in GPGPU. Int J Parallel Prog 44, 109–129 (2016). https://doi.org/10.1007/s10766-014-0318-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-014-0318-5

Keywords

Navigation