Criticality-aware priority to accelerate GPU memory access

Bitalebi, Hossein; Safaei, Farshad

doi:10.1007/s11227-022-04657-3

Criticality-aware priority to accelerate GPU memory access

Published: 06 July 2022

Volume 79, pages 188–213, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

1 Citation
Explore all metrics

Abstract

Graphic processing units (GPU) concept, combined with CUDA and OpenCL programming models, offers new opportunities to reduce latency and power consumption of throughput-oriented workloads. GPU can execute thousands of parallel threads to hide the memory access latency. However, for some memory-intensive workloads, it is very likely in some time intervals that all threads of a core will be stalled while waiting for their data to be provided by the main memory. In this research, we aim to make GPU memory access latency shorter to increase the thread activity time and to decrease core underutilization. In order to improve non-optimal time of cores, we focus on the memory buffer and the interconnection network to prioritize the packets of the cores with the greatest number of stalled threads. As a result, more critical packets will receive the higher priority in arbitration and resource allocation, so their memory requests will be handled faster, and overall cores’ stall time is reduced. 28% maximum and 12.5% average speed-up improvements among the used benchmarks, without significant effect on system area and power consumption, are reported.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs

LARA: Locality-aware resource allocation to improve GPU memory-access time

Article 15 May 2021

Power-efficient prefetching on GPGPUs

Article 05 December 2014

References

Macri J (2015) Amd’s next generation GPU and high bandwidth memory architecture: Fury. In: 2015 IEEE Hot Chips 27 Symposium (HCS), pp. 1–26. IEEE
Baskaran MM, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P (2008) A compiler framework for optimization of affine loop nests for GPGPUS. In: Proceedings of the 22nd Annual International Conference on Supercomputing, pp. 225–234
Baskaran MM, Ramanujam J, Sadayappan P (2010) Automatic c-to-cuda code generation for affine programs. In: International Conference on Compiler Construction, pp. 244–263. Springer
Keckler SW, Dally WJ, Khailany B, Garland M, Glasco D (2011) Gpus and the future of parallel computing. IEEE Micro 31(5):7–17
Article Google Scholar
KirkW D, Hwu W (2010) Programming massively parallel processors. Morgan Kaufmann, Burlington, MA
Google Scholar
Wittenbrink CM, Kilgariff E, Prabhu A (2011) Fermi gf100 GPU architecture. IEEE Micro 31(2):50–59
Article Google Scholar
Munshi A (2009) The opencl specification. In: 2009 IEEE Hot Chips 21 symposium (HCS), pp. 1–314. IEEE
Nickolls J, Buck I, Garland M, Skadron K (2008) Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? Queue 6(2):40–53
Article Google Scholar
Bauer M, Cook H, Khailany B (2011) Cudadma: optimizing gpu memory bandwidth via warp specialization. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11
Lindholm E, Nickolls J, Oberman S, Montrym J (2008) Nvidia tesla: A unified graphics and computing architecture. IEEE Micro 28(2):39–55
Article Google Scholar
Lee M, Song S, Moon J, Kim J, Seo W, Cho Y, Ryu S (2014) Improving gpgpu resource utilization through alternative thread block scheduling. In: 2014 IEEE 20th international symposium on high performance computer architecture (HPCA), pp. 260–271. IEEE
BiTalebi H, Safaei F (2021) Lara: Locality-aware resource allocation to improve gpu memory-access time. J Supercomput 77(12):14438–14460
Article Google Scholar
Bakhoda A, Kim J, Aamodt TM (2010) Throughput-effective on-chip networks for manycore accelerators. In: 2010 43rd Annual IEEE/ACM international symposium on microarchitecture, pp. 421–432. IEEE
Tu C-Y, Chang Y-Y, King C-T, Chen C-T, Wang T-Y (2014) Traffic-aware frequency scaling for balanced on-chip networks on gpgpus. In: 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp. 87–94. IEEE
Lotfi-Kamran P, Modarressi M, Sarbazi-Azad H (2017) Near-ideal networks-on-chip for servers. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 277–288. IEEE
Zhan J, Ouyang J, Ge F, Zhao J, Xie Y (2015) Dimnoc: A dim silicon approach towards power-efficient on-chip network. In: 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE
Kim J (2009) Low-cost router microarchitecture for on-chip networks. In: Proceedings of the 42nd Annual IEEE/ACM international symposium on microarchitecture, pp. 255–266
Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, Patt YN (2011) Improving gpu performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual IEEE/ACM international symposium on microarchitecture, pp. 308–317
Sethia A, Jamshidi DA, Mahlke S (2015) Mascar: Speeding up GPU warps by reducing memory pitstops. In: 2015 IEEE 21st International symposium on high performance computer architecture (HPCA), pp. 174–185. IEEE
Tian Y, Puthoor S, Greathouse JL, Beckmann BM, Jiménez DA (2015) Adaptive gpu cache bypassing. In: Proceedings of the 8th workshop on general purpose processing using GPUS, pp. 25–35
Abdel-Majeed M, Annavaram M (2013) Warped register file: a power efficient register file for gpgpus. In: 2013 IEEE 19th International symposium on high performance computer architecture (HPCA), pp. 412–423. IEEE
Zhang Y, Xing Z, Liu C, Tang C, Wang Q (2018) Locality based warp scheduling in gpgpus. Futur Gener Comput Syst 82:520–527
Article Google Scholar
Sanyal S, Basu P, Bal A, Roy S, Chakraborty K (2019) Predicting critical warps in near-threshold gpgpu applications using a dynamic choke point analysis. In: 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 444–449. IEEE
Oh Y, Kim K, Yoon MK, Park JH, Park Y, Annavaram M, Ro WW (2018) Adaptive cooperation of prefetching and warp scheduling on gpus. IEEE Trans Comput 68(4):609–616
Article MathSciNet MATH Google Scholar
Sadrosadati M, Mirhosseini A, Ehsani SB, Sarbazi-Azad H, Drumond M, Falsafi B, Ausavarungnirun R, Mutlu O (2018) Ltrf: Enabling high-capacity register files for gpus via hardware/software cooperative register prefetching. ACM SIGPLAN Notices 53(2):489–502
Article Google Scholar
Jog A, Kayiran O, Pattnaik A, Kandemir MT, Mutlu O, Iyer R, Das CR (2016) Exploiting core criticality for enhanced gpu performance. In: Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, pp. 351–363
Cheng X, Zhao H, Mohanty SP, Fang J (2019) Improving gpu noc power efficiency through dynamic bandwidth allocation. In: 2019 IEEE International Conference on Consumer Electronics (ICCE), pp. 1–4. IEEE
Zhao X, Adileh A, Yu Z, Wang Z, Jaleel A, Eeckhout L (2019) Adaptive memory-side last-level gpu caching. In: Proceedings of the 46th international symposium on computer architecture, pp. 411–423
Lym S, Lee D, O’Connor M, Chatterjee N, Erez M (2019) Delta: Gpu performance model for deep learning applications with in-depth memory system traffic analysis. In: 2019 IEEE international symposium on performance analysis of systems and software (ISPASS), pp. 293–303. IEEE
Wang L, Zhao X, Kaeli D, Wang Z, Eeckhout L (2018) Intra-cluster coalescing to reduce GPU NOC pressure. In: 2018 IEEE International parallel and distributed processing symposium (IPDPS), pp. 990–999. IEEE
Zhao X, Kaeli D, Wang Z, Eeckhout L et al (2019) Intra-cluster coalescing and distributed-block scheduling to reduce GPU NOC pressure. IEEE Trans Comput 68(7):1064–1076
Article MathSciNet MATH Google Scholar
Yin J, Eckert Y, Che S, Oskin M, Loh GH (2018) Toward more efficient noc arbitration: A deep reinforcement learning approach. In: Proceedings of the 1st International workshop on ai-assisted design for architecture (AIDArc)
Fang J, Wei Z, Yang H (2021) Locality-based cache management and warp scheduling for reducing cache contention in GPU. Micromachines 12(10):1262
Article Google Scholar
Rogers TG, O’Connor M, Aamodt TM (2013) Divergence-aware warp scheduling. In: Proceedings of the 46th Annual IEEE/ACM international symposium on microarchitecture, pp. 99–110
Tripathy D, Abdolrashidi A, Bhuyan LN, Zhou L, Wong D (2021) Paver: Locality graph-based thread block scheduling for GPUS. ACM Trans Architec Code Optimiz (TACO) 18(3):1–26
Article Google Scholar
Huzaifa M, Alsop J, Mahmoud A, Salvador G, Sinclair MD, Adve SV (2020) Inter-kernel reuse-aware thread block scheduling. ACM Trans Architect Code Optimiz (TACO) 17(3):1–27
Article Google Scholar
Ukarande A, Patidar S, Rangan R (2021) Locality-aware CTA scheduling for gaming applications. ACM Trans Architect Code Optimiz (TACO) 19(1):1–26
Google Scholar
Li X, Li C, Guo Y, Ausavarungnirun R (2021) Improving inter-kernel data reuse with cta-page coordination in gpgpu. In: 2021 IEEE/ACM International Conference on Computer Aided Design (ICCAD), pp. 1–9. IEEE
Ghose S, Lee H, Martínez JF (2013) Improving memory scheduling via processor-side load criticality information. In: Proceedings of the 40th annual international symposium on computer architecture, pp. 84–95
Bhattacharjee A, Martonosi M (2009) Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. ACM SIGARCH Comput Archit News 37(3):290–301
Article Google Scholar
Srinivasan ST, Ju RD-C, Lebeck AR, Wilkerson C (2001) Locality vs. criticality. In: Proceedings 28th annual international symposium on computer architecture, pp. 132–143. IEEE
Subramaniam S, Bracy A, Wang H, Loh GH (2009) Criticality-based optimizations for efficient load processing. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture, pp. 419–430. IEEE
Cai Q, González J, Rakvic R, Magklis G, Chaparro P, González A (2008) Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In: 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 240–249. IEEE
Du Bois K, Eyerman S, Sartor JB, Eeckhout L (2013) Criticality stacks: identifying critical threads in parallel programs using synchronization behavior. In: Proceedings of the 40th annual international symposium on computer architecture, pp. 511–522
Jia W, Shaw KA, Martonosi M (2014) Mrpb: Memory request prioritization for massively parallel processors. In: 2014 IEEE 20th international symposium on high performance computer architecture (HPCA), pp. 272–283. IEEE
Xie X, Liang Y, Wang Y, Sun G, Wang T (2015) Coordinated static and dynamic cache bypassing for GPUS. In: 2015 IEEE 21st International symposium on high performance computer architecture (HPCA), pp. 76–88. IEEE
Liang Y, Xie X, Sun G, Chen D (2015) An efficient compiler framework for cache bypassing on GPUS. IEEE Trans Comput Aided Des Integr Circuits Syst 34(10):1677–1690
Article Google Scholar
Chaudhuri M, Gaur J, Bashyam N, Subramoney S, Nuzman J (2012) Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, pp. 293–304
Gaur J, Chaudhuri M, Subramoney S (2011) Bypass and insertion algorithms for exclusive last-level caches. In: Proceedings of the 38th annual international symposium on computer architecture, pp. 81–92
Kadam G, Zhang D, Jog A (2018) Rcoal: mitigating gpu timing attack via subwarp-based randomized coalescing techniques. In: 2018 IEEE international symposium on high performance computer architecture (HPCA), pp. 156–167. IEEE
Kloosterman J, Beaumont J, Wollman M, Sethia A, Dreslinski R, Mudge T, Mahlke S (2015) Warppool: Sharing requests with inter-warp coalescing for throughput processors. In: 2015 48th annual IEEE/ACM international symposium on microarchitecture (MICRO), pp. 433–444. IEEE
Mu S, Deng Y, Chen Y, Li H, Pan J, Zhang W, Wang Z (2013) Orchestrating cache management and memory scheduling for GPGPU applications. IEEE Trans Very Large Scale Integr VLSI Syst 22(8):1803–1814
Article Google Scholar
Ausavarungnirun R, Ghose S, Kayıran O, Loh GH, Das CR, Kandemir MT, Mutlu O (2018) Holistic management of the gpgpu memory hierarchy to manage warp-level latency tolerance. arXiv preprint arXiv:1804.11038
Yuan GL, Bakhoda A, Aamodt TM (2009) Complexity effective memory access scheduling for many-core accelerator architectures. In: 2009 42nd Annual IEEE/ACM international symposium on microarchitecture (MICRO), pp. 34–44. IEEE
Chatterjee N, O’Connor M, Loh GH, Jayasena N, Balasubramonia R (2014) Managing dram latency divergence in irregular gpgpu applications. In: SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 128–139. IEEE
Jaleel A, Theobald KB, Steely SC Jr, Emer J (2010) High performance cache replacement using re-reference interval prediction (rrip). ACM SIGARCH Comput Architect News 38(3):60–71
Article Google Scholar
Jalminger J, Stenstrom P (2003) A novel approach to cache block reuse predictions. In: 2003 International Conference on Parallel Processing, 2003. Proceedings., pp. 294–302. IEEE
Khan S, Alameldeen AR, Wilkerson C, Mutluy O, Jimenezz DA (2014) Improving cache performance using read-write partitioning. In: 2014 IEEE 20th international symposium on high performance computer architecture (HPCA), pp. 452–463. IEEE
Qureshi MK, Jaleel A, Patt YN, Steely SC, Emer J (2007) Adaptive insertion policies for high performance caching. ACM SIGARCH Comput Architect News 35(2):381–391
Article Google Scholar
Seshadri V, Yedkar S, Xin H, Mutlu O, Gibbons PB, Kozuch MA, Mowry TC (2015) Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Trans Architect Code Optimiz (TACO) 11(4):1–22
Article Google Scholar
Ebrahimi E, Lee CJ, Mutlu O, Patt YN (2011) Prefetch-aware shared resource management for multi-core systems. ACM SIGARCH Computer Architecture News 39(3):141–152
Article Google Scholar
Lakshminarayana NB, Lee J, Kim H, Shin J (2011) Dram scheduling policy for GPGPU architectures based on a potential function. IEEE Comput Archit Lett 11(2):33–36
Article Google Scholar
Awatramani M, Zhu X, Zambreno J, Rover D (2015) Phase aware warp scheduling: Mitigating effects of phase behavior in gpgpu applications. In: 2015 International Conference on Parallel Architecture and Compilation (PACT), pp. 1–12. IEEE
Bakhoda A, Kim J, Aamodt TM (2013) Designing on-chip networks for throughput accelerators. ACM Trans Architect Code Optimiz (TACO) 10(3):1–35
Article Google Scholar
Lee J, Li S, Kim H, Yalamanchili S (2013) Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures. ACM Trans Design Automation Electron Syst (TODAES) 18(4):1–28
Google Scholar
Koo G, Oh Y, Ro WW, Annavaram M (2017) Access pattern-aware cache management for improving data utilization in GPU. In: Proceedings of the 44th annual international symposium on computer architecture, pp. 307–319
Wang L, Ye J, Zhao Y, Wu W, Li A, Song SL, Xu Z, Kraska T (2018) Superneurons: Dynamic GPU memory management for training deep neural networks. In: Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pp. 41–53
Chen X, Chang L-W, Rodrigues CI, Lv J, Wang Z, Hwu W-M (2014) Adaptive cache management for energy-efficient GPU computing. In: 2014 47th Annual IEEE/ACM international symposium on microarchitecture, pp. 343–355. IEEE
Kim GB, Kim JM, Kim CH (2019) Mshr-aware dynamic warp scheduler for high performance GPUS. KIPS Trans Comput Commun Syst 8(5):111–118
Google Scholar
Gu Y, Chen L (2019) Dynamically linked mshrs for adaptive miss handling in GPUS. In: Proceedings of the ACM International Conference on Supercomputing, pp. 510–521
Kroft D (1983) Cache memory organization utilizing miss information holding registers to prevent lockup from cache misses. Google Patents. US Patent 4,370,710
Kroft D (1998) Lockup-free instruction fetch/prefetch cache organization. In: 25 Years of the international symposia on computer architecture (selected Papers), pp. 195–201
Arunkumar A, Lee S-Y, Wu C-J (2016) Id-cache: instruction and memory divergence based cache management for GPUS. In: 2016 IEEE international symposium on workload characterization (IISWC), pp. 1–10. IEEE
Bakhoda A, Yuan GL, Fung WW, Wong H, Aamodt TM (2009) Analyzing cuda workloads using a detailed GPU simulator. In: 2009 IEEE international symposium on performance analysis of systems and software, pp. 163–174. IEEE
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: 2009 IEEE international symposium on workload characterization (IISWC), pp. 44–54. IEEE
Leng J, Hetherington T, ElTantawy A, Gilani S, Kim NS, Aamodt TM, Reddi VJ (2013) Gpuwattch: Enabling energy optimizations in GPGPUS. ACM SIGARCH Comput Architect News 41(3):487–498
Article Google Scholar
Jia Z, Maggioni M, Staiger B, Scarpazza DP (2018) Dissecting the nvidia volta gpu architecture via microbenchmarking. arXiv preprint arXiv:1804.06826
Anzt H, Tsai YM, Abdelfattah A, Cojean T, Dongarra J (2020) Evaluating the performance of nvidia’s a100 ampere GPU for sparse and batched computations. In: 2020 IEEE/ACM performance modeling, benchmarking and simulation of high performance computer systems (PMBS), pp. 26–38. IEEE

Download references

Author information

Authors and Affiliations

Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
Hossein Bitalebi & Farshad Safaei

Authors

Hossein Bitalebi
View author publications
You can also search for this author in PubMed Google Scholar
Farshad Safaei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Farshad Safaei.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bitalebi, H., Safaei, F. Criticality-aware priority to accelerate GPU memory access. J Supercomput 79, 188–213 (2023). https://doi.org/10.1007/s11227-022-04657-3

Download citation

Accepted: 11 June 2022
Published: 06 July 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s11227-022-04657-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Criticality-aware priority to accelerate GPU memory access

Abstract

Access this article

Similar content being viewed by others

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs

LARA: Locality-aware resource allocation to improve GPU memory-access time

Power-efficient prefetching on GPGPUs

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Criticality-aware priority to accelerate GPU memory access

Abstract

Access this article

Similar content being viewed by others

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs

LARA: Locality-aware resource allocation to improve GPU memory-access time

Power-efficient prefetching on GPGPUs

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation