A Study of the Potential of Locality-Aware Thread Scheduling for GPUs

  • Cedric Nugteren
  • Gert-Jan van den Braak
  • Henk Corporaal
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8806)


Programming models such as CUDA and OpenCL allow the programmer to specify the independence of threads, effectively removing ordering constraints. Still, parallel architectures such as the graphics processing unit (GPU) do not exploit the potential of data-locality enabled by this independence. Therefore, programmers are required to manually perform data-locality optimisations such as memory coalescing or loop tiling. This work makes a case for locality-aware thread scheduling: re-ordering threads automatically for better locality to improve the programmability of multi-threaded processors. In particular, we analyse the potential of locality-aware thread scheduling for GPUs, considering among others cache performance, memory coalescing and bank locality. This work does not present an implementation of a locality-aware thread scheduler, but rather introduces the concept and identifies the potential. We conclude that non-optimised programs have the potential to achieve good cache and memory utilisation when using a smarter thread scheduler. A case-study of a naive matrix multiplication shows for example a 87% performance increase, leading to an IPC of 457 on a 512-core GPU.


Integral Image Active Thread Reuse Distance Thread Schedule Loop Tiling 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bakhoda, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing CUDA Workloads using a Detailed GPU Simulator. In: ISPASS: International Symposium on Performance Analysis of Systems and Software. IEEE (2009)Google Scholar
  2. 2.
    Ding, C., Zhong, Y.: Predicting Whole-Program Locality through Reuse Distance Analysis. In: PLDI-24: Conference on Programming Language Design and Implementation. ACM (2003)Google Scholar
  3. 3.
    Fuller, S., Millett, L.: Computing Performance: Game Over or Next Level? IEEE Computer 44 (2011)Google Scholar
  4. 4.
    Gebhart, M., Johnson, D., Tarjan, D., Keckler, S., Dally, W., Lindholm, E., Skadron, K.: A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors. ACM Trans. on Computer Systems 30, 8:1–8:38 (2012)Google Scholar
  5. 5.
    Kayiran, O., Jog, A., Kandemir, M., Das, C.: Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. In: PACT-22: International Conference on Parallel Architectures and Compilation Techniques. IEEE (2013)Google Scholar
  6. 6.
    Lashgar, A., Baniasadi, A., Khonsari, A.: Dynamic Warp Resizing: Analysis and Benefits in High-Performance SIMT. In: ICCD-30: International Conference on Computer Design. IEEE (2012)Google Scholar
  7. 7.
    Meng, J., Tarjan, D., Skadron, K.: Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In: ISCA-37: International Symposium on Computer Architecture. ACM (2010)Google Scholar
  8. 8.
    Narasiman, V., Shebanow, M., Lee, C., Miftakhutdinov, R., Mutlu, O., Patt, Y.: Improving GPU Performance via Large Warps and Two-level Warp Scheduling. In: MICRO-44: International Symposium on Microarchitecture. ACM (2011)Google Scholar
  9. 9.
    Nugteren, C., van den Braak, G.-J., Corporaal, H., Bal, H.: A Detailed GPU Cache Model Based on Reuse Distance Theory. In: HPCA-20: International Symposium on High Performance Computer Architecture. IEEE (2014)Google Scholar
  10. 10.
    NVIDIA. CUDA C Programming Guide 5.5 (2013)Google Scholar
  11. 11.
    Philbin, J., Edler, J., Anshus, O., Douglas, C., Li, K.: Thread Scheduling for Cache Locality. In: ASPLOS-7: International Conference on Architectural Support for Programming Languages and Operating Systems. ACM (1996)Google Scholar
  12. 12.
    Rogers, T., O’Connor, M., Aamodt, T.: Cache-Conscious Wavefront Scheduling. In: MICRO-45: International Symposium on Microarchitecture. IEEE (2012)Google Scholar
  13. 13.
    Stratton, J., Anssari, N., Rodrigues, C., Sung, I.-J., Obeid, N., Chang, L., Liu, G., Hwu, W.: Optimization and Architecture Effects on GPU Computing Workload Performance. In: INPAR: Workshop on Innovative Parallel Computing. IEEE (2012)Google Scholar
  14. 14.
    Tam, D., Azimi, R., Stumm, M.: Thread Clustering: Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors. In: EuroSys-2: European Conference on Computer Systems. ACM (2007)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Cedric Nugteren
    • 1
  • Gert-Jan van den Braak
    • 1
  • Henk Corporaal
    • 1
  1. 1.Eindhoven University of TechnologyEindhovenThe Netherlands

Personalised recommendations