Skip to main content
Log in

Criticality-aware priority to accelerate GPU memory access

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Graphic processing units (GPU) concept, combined with CUDA and OpenCL programming models, offers new opportunities to reduce latency and power consumption of throughput-oriented workloads. GPU can execute thousands of parallel threads to hide the memory access latency. However, for some memory-intensive workloads, it is very likely in some time intervals that all threads of a core will be stalled while waiting for their data to be provided by the main memory. In this research, we aim to make GPU memory access latency shorter to increase the thread activity time and to decrease core underutilization. In order to improve non-optimal time of cores, we focus on the memory buffer and the interconnection network to prioritize the packets of the cores with the greatest number of stalled threads. As a result, more critical packets will receive the higher priority in arbitration and resource allocation, so their memory requests will be handled faster, and overall cores’ stall time is reduced. 28% maximum and 12.5% average speed-up improvements among the used benchmarks, without significant effect on system area and power consumption, are reported.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Macri J (2015) Amd’s next generation GPU and high bandwidth memory architecture: Fury. In: 2015 IEEE Hot Chips 27 Symposium (HCS), pp. 1–26. IEEE

  2. Baskaran MM, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P (2008) A compiler framework for optimization of affine loop nests for GPGPUS. In: Proceedings of the 22nd Annual International Conference on Supercomputing, pp. 225–234

  3. Baskaran MM, Ramanujam J, Sadayappan P (2010) Automatic c-to-cuda code generation for affine programs. In: International Conference on Compiler Construction, pp. 244–263. Springer

  4. Keckler SW, Dally WJ, Khailany B, Garland M, Glasco D (2011) Gpus and the future of parallel computing. IEEE Micro 31(5):7–17

    Article  Google Scholar 

  5. KirkW D, Hwu W (2010) Programming massively parallel processors. Morgan Kaufmann, Burlington, MA

    Google Scholar 

  6. Wittenbrink CM, Kilgariff E, Prabhu A (2011) Fermi gf100 GPU architecture. IEEE Micro 31(2):50–59

    Article  Google Scholar 

  7. Munshi A (2009) The opencl specification. In: 2009 IEEE Hot Chips 21 symposium (HCS), pp. 1–314. IEEE

  8. Nickolls J, Buck I, Garland M, Skadron K (2008) Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? Queue 6(2):40–53

    Article  Google Scholar 

  9. Bauer M, Cook H, Khailany B (2011) Cudadma: optimizing gpu memory bandwidth via warp specialization. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11

  10. Lindholm E, Nickolls J, Oberman S, Montrym J (2008) Nvidia tesla: A unified graphics and computing architecture. IEEE Micro 28(2):39–55

    Article  Google Scholar 

  11. Lee M, Song S, Moon J, Kim J, Seo W, Cho Y, Ryu S (2014) Improving gpgpu resource utilization through alternative thread block scheduling. In: 2014 IEEE 20th international symposium on high performance computer architecture (HPCA), pp. 260–271. IEEE

  12. BiTalebi H, Safaei F (2021) Lara: Locality-aware resource allocation to improve gpu memory-access time. J Supercomput 77(12):14438–14460

    Article  Google Scholar 

  13. Bakhoda A, Kim J, Aamodt TM (2010) Throughput-effective on-chip networks for manycore accelerators. In: 2010 43rd Annual IEEE/ACM international symposium on microarchitecture, pp. 421–432. IEEE

  14. Tu C-Y, Chang Y-Y, King C-T, Chen C-T, Wang T-Y (2014) Traffic-aware frequency scaling for balanced on-chip networks on gpgpus. In: 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp. 87–94. IEEE

  15. Lotfi-Kamran P, Modarressi M, Sarbazi-Azad H (2017) Near-ideal networks-on-chip for servers. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 277–288. IEEE

  16. Zhan J, Ouyang J, Ge F, Zhao J, Xie Y (2015) Dimnoc: A dim silicon approach towards power-efficient on-chip network. In: 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE

  17. Kim J (2009) Low-cost router microarchitecture for on-chip networks. In: Proceedings of the 42nd Annual IEEE/ACM international symposium on microarchitecture, pp. 255–266

  18. Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, Patt YN (2011) Improving gpu performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual IEEE/ACM international symposium on microarchitecture, pp. 308–317

  19. Sethia A, Jamshidi DA, Mahlke S (2015) Mascar: Speeding up GPU warps by reducing memory pitstops. In: 2015 IEEE 21st International symposium on high performance computer architecture (HPCA), pp. 174–185. IEEE

  20. Tian Y, Puthoor S, Greathouse JL, Beckmann BM, Jiménez DA (2015) Adaptive gpu cache bypassing. In: Proceedings of the 8th workshop on general purpose processing using GPUS, pp. 25–35

  21. Abdel-Majeed M, Annavaram M (2013) Warped register file: a power efficient register file for gpgpus. In: 2013 IEEE 19th International symposium on high performance computer architecture (HPCA), pp. 412–423. IEEE

  22. Zhang Y, Xing Z, Liu C, Tang C, Wang Q (2018) Locality based warp scheduling in gpgpus. Futur Gener Comput Syst 82:520–527

    Article  Google Scholar 

  23. Sanyal S, Basu P, Bal A, Roy S, Chakraborty K (2019) Predicting critical warps in near-threshold gpgpu applications using a dynamic choke point analysis. In: 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 444–449. IEEE

  24. Oh Y, Kim K, Yoon MK, Park JH, Park Y, Annavaram M, Ro WW (2018) Adaptive cooperation of prefetching and warp scheduling on gpus. IEEE Trans Comput 68(4):609–616

    Article  MathSciNet  MATH  Google Scholar 

  25. Sadrosadati M, Mirhosseini A, Ehsani SB, Sarbazi-Azad H, Drumond M, Falsafi B, Ausavarungnirun R, Mutlu O (2018) Ltrf: Enabling high-capacity register files for gpus via hardware/software cooperative register prefetching. ACM SIGPLAN Notices 53(2):489–502

    Article  Google Scholar 

  26. Jog A, Kayiran O, Pattnaik A, Kandemir MT, Mutlu O, Iyer R, Das CR (2016) Exploiting core criticality for enhanced gpu performance. In: Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, pp. 351–363

  27. Cheng X, Zhao H, Mohanty SP, Fang J (2019) Improving gpu noc power efficiency through dynamic bandwidth allocation. In: 2019 IEEE International Conference on Consumer Electronics (ICCE), pp. 1–4. IEEE

  28. Zhao X, Adileh A, Yu Z, Wang Z, Jaleel A, Eeckhout L (2019) Adaptive memory-side last-level gpu caching. In: Proceedings of the 46th international symposium on computer architecture, pp. 411–423

  29. Lym S, Lee D, O’Connor M, Chatterjee N, Erez M (2019) Delta: Gpu performance model for deep learning applications with in-depth memory system traffic analysis. In: 2019 IEEE international symposium on performance analysis of systems and software (ISPASS), pp. 293–303. IEEE

  30. Wang L, Zhao X, Kaeli D, Wang Z, Eeckhout L (2018) Intra-cluster coalescing to reduce GPU NOC pressure. In: 2018 IEEE International parallel and distributed processing symposium (IPDPS), pp. 990–999. IEEE

  31. Zhao X, Kaeli D, Wang Z, Eeckhout L et al (2019) Intra-cluster coalescing and distributed-block scheduling to reduce GPU NOC pressure. IEEE Trans Comput 68(7):1064–1076

    Article  MathSciNet  MATH  Google Scholar 

  32. Yin J, Eckert Y, Che S, Oskin M, Loh GH (2018) Toward more efficient noc arbitration: A deep reinforcement learning approach. In: Proceedings of the 1st International workshop on ai-assisted design for architecture (AIDArc)

  33. Fang J, Wei Z, Yang H (2021) Locality-based cache management and warp scheduling for reducing cache contention in GPU. Micromachines 12(10):1262

    Article  Google Scholar 

  34. Rogers TG, O’Connor M, Aamodt TM (2013) Divergence-aware warp scheduling. In: Proceedings of the 46th Annual IEEE/ACM international symposium on microarchitecture, pp. 99–110

  35. Tripathy D, Abdolrashidi A, Bhuyan LN, Zhou L, Wong D (2021) Paver: Locality graph-based thread block scheduling for GPUS. ACM Trans Architec Code Optimiz (TACO) 18(3):1–26

    Article  Google Scholar 

  36. Huzaifa M, Alsop J, Mahmoud A, Salvador G, Sinclair MD, Adve SV (2020) Inter-kernel reuse-aware thread block scheduling. ACM Trans Architect Code Optimiz (TACO) 17(3):1–27

    Article  Google Scholar 

  37. Ukarande A, Patidar S, Rangan R (2021) Locality-aware CTA scheduling for gaming applications. ACM Trans Architect Code Optimiz (TACO) 19(1):1–26

    Google Scholar 

  38. Li X, Li C, Guo Y, Ausavarungnirun R (2021) Improving inter-kernel data reuse with cta-page coordination in gpgpu. In: 2021 IEEE/ACM International Conference on Computer Aided Design (ICCAD), pp. 1–9. IEEE

  39. Ghose S, Lee H, Martínez JF (2013) Improving memory scheduling via processor-side load criticality information. In: Proceedings of the 40th annual international symposium on computer architecture, pp. 84–95

  40. Bhattacharjee A, Martonosi M (2009) Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. ACM SIGARCH Comput Archit News 37(3):290–301

    Article  Google Scholar 

  41. Srinivasan ST, Ju RD-C, Lebeck AR, Wilkerson C (2001) Locality vs. criticality. In: Proceedings 28th annual international symposium on computer architecture, pp. 132–143. IEEE

  42. Subramaniam S, Bracy A, Wang H, Loh GH (2009) Criticality-based optimizations for efficient load processing. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture, pp. 419–430. IEEE

  43. Cai Q, González J, Rakvic R, Magklis G, Chaparro P, González A (2008) Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In: 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 240–249. IEEE

  44. Du Bois K, Eyerman S, Sartor JB, Eeckhout L (2013) Criticality stacks: identifying critical threads in parallel programs using synchronization behavior. In: Proceedings of the 40th annual international symposium on computer architecture, pp. 511–522

  45. Jia W, Shaw KA, Martonosi M (2014) Mrpb: Memory request prioritization for massively parallel processors. In: 2014 IEEE 20th international symposium on high performance computer architecture (HPCA), pp. 272–283. IEEE

  46. Xie X, Liang Y, Wang Y, Sun G, Wang T (2015) Coordinated static and dynamic cache bypassing for GPUS. In: 2015 IEEE 21st International symposium on high performance computer architecture (HPCA), pp. 76–88. IEEE

  47. Liang Y, Xie X, Sun G, Chen D (2015) An efficient compiler framework for cache bypassing on GPUS. IEEE Trans Comput Aided Des Integr Circuits Syst 34(10):1677–1690

    Article  Google Scholar 

  48. Chaudhuri M, Gaur J, Bashyam N, Subramoney S, Nuzman J (2012) Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, pp. 293–304

  49. Gaur J, Chaudhuri M, Subramoney S (2011) Bypass and insertion algorithms for exclusive last-level caches. In: Proceedings of the 38th annual international symposium on computer architecture, pp. 81–92

  50. Kadam G, Zhang D, Jog A (2018) Rcoal: mitigating gpu timing attack via subwarp-based randomized coalescing techniques. In: 2018 IEEE international symposium on high performance computer architecture (HPCA), pp. 156–167. IEEE

  51. Kloosterman J, Beaumont J, Wollman M, Sethia A, Dreslinski R, Mudge T, Mahlke S (2015) Warppool: Sharing requests with inter-warp coalescing for throughput processors. In: 2015 48th annual IEEE/ACM international symposium on microarchitecture (MICRO), pp. 433–444. IEEE

  52. Mu S, Deng Y, Chen Y, Li H, Pan J, Zhang W, Wang Z (2013) Orchestrating cache management and memory scheduling for GPGPU applications. IEEE Trans Very Large Scale Integr VLSI Syst 22(8):1803–1814

    Article  Google Scholar 

  53. Ausavarungnirun R, Ghose S, Kayıran O, Loh GH, Das CR, Kandemir MT, Mutlu O (2018) Holistic management of the gpgpu memory hierarchy to manage warp-level latency tolerance. arXiv preprint arXiv:1804.11038

  54. Yuan GL, Bakhoda A, Aamodt TM (2009) Complexity effective memory access scheduling for many-core accelerator architectures. In: 2009 42nd Annual IEEE/ACM international symposium on microarchitecture (MICRO), pp. 34–44. IEEE

  55. Chatterjee N, O’Connor M, Loh GH, Jayasena N, Balasubramonia R (2014) Managing dram latency divergence in irregular gpgpu applications. In: SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 128–139. IEEE

  56. Jaleel A, Theobald KB, Steely SC Jr, Emer J (2010) High performance cache replacement using re-reference interval prediction (rrip). ACM SIGARCH Comput Architect News 38(3):60–71

    Article  Google Scholar 

  57. Jalminger J, Stenstrom P (2003) A novel approach to cache block reuse predictions. In: 2003 International Conference on Parallel Processing, 2003. Proceedings., pp. 294–302. IEEE

  58. Khan S, Alameldeen AR, Wilkerson C, Mutluy O, Jimenezz DA (2014) Improving cache performance using read-write partitioning. In: 2014 IEEE 20th international symposium on high performance computer architecture (HPCA), pp. 452–463. IEEE

  59. Qureshi MK, Jaleel A, Patt YN, Steely SC, Emer J (2007) Adaptive insertion policies for high performance caching. ACM SIGARCH Comput Architect News 35(2):381–391

    Article  Google Scholar 

  60. Seshadri V, Yedkar S, Xin H, Mutlu O, Gibbons PB, Kozuch MA, Mowry TC (2015) Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Trans Architect Code Optimiz (TACO) 11(4):1–22

    Article  Google Scholar 

  61. Ebrahimi E, Lee CJ, Mutlu O, Patt YN (2011) Prefetch-aware shared resource management for multi-core systems. ACM SIGARCH Computer Architecture News 39(3):141–152

    Article  Google Scholar 

  62. Lakshminarayana NB, Lee J, Kim H, Shin J (2011) Dram scheduling policy for GPGPU architectures based on a potential function. IEEE Comput Archit Lett 11(2):33–36

    Article  Google Scholar 

  63. Awatramani M, Zhu X, Zambreno J, Rover D (2015) Phase aware warp scheduling: Mitigating effects of phase behavior in gpgpu applications. In: 2015 International Conference on Parallel Architecture and Compilation (PACT), pp. 1–12. IEEE

  64. Bakhoda A, Kim J, Aamodt TM (2013) Designing on-chip networks for throughput accelerators. ACM Trans Architect Code Optimiz (TACO) 10(3):1–35

    Article  Google Scholar 

  65. Lee J, Li S, Kim H, Yalamanchili S (2013) Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures. ACM Trans Design Automation Electron Syst (TODAES) 18(4):1–28

    Google Scholar 

  66. Koo G, Oh Y, Ro WW, Annavaram M (2017) Access pattern-aware cache management for improving data utilization in GPU. In: Proceedings of the 44th annual international symposium on computer architecture, pp. 307–319

  67. Wang L, Ye J, Zhao Y, Wu W, Li A, Song SL, Xu Z, Kraska T (2018) Superneurons: Dynamic GPU memory management for training deep neural networks. In: Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pp. 41–53

  68. Chen X, Chang L-W, Rodrigues CI, Lv J, Wang Z, Hwu W-M (2014) Adaptive cache management for energy-efficient GPU computing. In: 2014 47th Annual IEEE/ACM international symposium on microarchitecture, pp. 343–355. IEEE

  69. Kim GB, Kim JM, Kim CH (2019) Mshr-aware dynamic warp scheduler for high performance GPUS. KIPS Trans Comput Commun Syst 8(5):111–118

    Google Scholar 

  70. Gu Y, Chen L (2019) Dynamically linked mshrs for adaptive miss handling in GPUS. In: Proceedings of the ACM International Conference on Supercomputing, pp. 510–521

  71. Kroft D (1983) Cache memory organization utilizing miss information holding registers to prevent lockup from cache misses. Google Patents. US Patent 4,370,710

  72. Kroft D (1998) Lockup-free instruction fetch/prefetch cache organization. In: 25 Years of the international symposia on computer architecture (selected Papers), pp. 195–201

  73. Arunkumar A, Lee S-Y, Wu C-J (2016) Id-cache: instruction and memory divergence based cache management for GPUS. In: 2016 IEEE international symposium on workload characterization (IISWC), pp. 1–10. IEEE

  74. Bakhoda A, Yuan GL, Fung WW, Wong H, Aamodt TM (2009) Analyzing cuda workloads using a detailed GPU simulator. In: 2009 IEEE international symposium on performance analysis of systems and software, pp. 163–174. IEEE

  75. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: 2009 IEEE international symposium on workload characterization (IISWC), pp. 44–54. IEEE

  76. Leng J, Hetherington T, ElTantawy A, Gilani S, Kim NS, Aamodt TM, Reddi VJ (2013) Gpuwattch: Enabling energy optimizations in GPGPUS. ACM SIGARCH Comput Architect News 41(3):487–498

    Article  Google Scholar 

  77. Jia Z, Maggioni M, Staiger B, Scarpazza DP (2018) Dissecting the nvidia volta gpu architecture via microbenchmarking. arXiv preprint arXiv:1804.06826

  78. Anzt H, Tsai YM, Abdelfattah A, Cojean T, Dongarra J (2020) Evaluating the performance of nvidia’s a100 ampere GPU for sparse and batched computations. In: 2020 IEEE/ACM performance modeling, benchmarking and simulation of high performance computer systems (PMBS), pp. 26–38. IEEE

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Farshad Safaei.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bitalebi, H., Safaei, F. Criticality-aware priority to accelerate GPU memory access. J Supercomput 79, 188–213 (2023). https://doi.org/10.1007/s11227-022-04657-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04657-3

Keywords

Navigation