LARA: Locality-aware resource allocation to improve GPU memory-access time

BiTalebi, Hossein; Safaei, Farshad

doi:10.1007/s11227-021-03854-w

LARA: Locality-aware resource allocation to improve GPU memory-access time

Published: 15 May 2021

Volume 77, pages 14438–14460, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Hossein BiTalebi¹ &
Farshad Safaei¹

342 Accesses
2 Citations
Explore all metrics

Abstract

Memory access as a primary performance bottleneck of each processing unit also plays a significant role in GPU performance. In addition to high challenging parts of GPU’s memory access path, the low locality property among the requests considerably increases the memory access delay. Despite the GPU’s immense processing power, they cannot reach their maximum throughput values because of the memory access bottlenecks. Memory divergence and miss locality among the L1 missed requests significantly impose the Last-Level-Cache contention and main memory row switching overheads. In addition, interconnection network routes the request packets regardless of locality properties, such routing algorithm considerably disrupts the locality among the requests.

In this paper, we proposed Locality-Aware Resource Allocation (LARA) to reduce the Streaming-Multiprocessors stall time with arbitrage among the memory request packets in favor of locality maintenance at the interconnection network of GPU. In addition, before injecting the memory requests to the interconnection network, they will be reordered at the injection-port buffer based on their thread block equality. Memory-divergence and miss-locality among the requests are two main factors that increase the rates of Last-Level-Cache contention and main memory row switching. We proposed a comprehensive approach to improving the GPU performance by decreasing the average memory access delay. We focused on the request locality property to decrease the Last-Level-Cache contention overheads and main memory row switching rate. As a result, 33% maximum and 17% average speed-up improvements among the used benchmarks, without significant effect on system areas and power consumptions, are reported.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache

A Quantitative Study of Locality in GPU Caches for Memory-Divergent Workloads

Article Open access 05 April 2022

A memory scheduling strategy for eliminating memory access interference in heterogeneous system

Article Open access 10 January 2020

References

N. Corporation, (2009) 'NVIDIA’s next generation CUDA compute architecture: Fermi
Lee SY , and Wu C., (2014) CAWS: criticality-aware warp scheduling for GPGPU workloads. In: Proceedings of the 23rd international conference on Parallel architectures and compilation, 175–186
Mu S et al (2013) “Orchestrating cache management and memory scheduling for GPGPU applications”, IEEE Trans. Very Large Scale Integr Syst 22(8):1803–1814
Article Google Scholar
Yoon H, Meza J, Ausavarungnirun R, Harding RA, and Mutlu O, (2012) Row buffer locality aware caching policies for hybrid memories. In: IEEE 30th International Conference on Computer Design (ICCD), 337–344
Stallings W, (2003) Computer organization and architecture: designing for performance. Pearson Education India
Wang B, Liu Z, Wang X, and Yu W (2015) Eliminating intra-warp conflict misses in GPU. In: Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, 689–694
Nugteren C, den Braak G.-J, Corporaal H, Bal H (2014) A detailed GPU cache model based on reuse distance theory. In: IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 37–48
Muralidhara SP, Subramanian L, Mutlu O, Kandemir M, Moscibroda T, (2011) Reducing memory interference in multicore systems via application-aware memory channel partitioning. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 374–385
Yuan GL, Bakhoda A, and Aamodt TM (2009) Complexity effective memory access scheduling for many-core accelerator architectures. In: 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) 34–44
Wittenbrink CM, Kilgariff E, Prabhu A (2011) Fermi GF100 GPU architecture. IEEE Micro 31(2):50–59
Article Google Scholar
Wang B, Zhu Y, Yu W (2016) “OAWS: memory occlusion aware warp scheduling”, In. International Conference on Parallel Architecture and Compilation Techniques (PACT) 2016:45–55
Google Scholar
Xu Z, Zhao X, Wang Z, Yang C (2019) Application-aware NoC management in GPUs multitasking. J Supercomput 75(8):4710–4730
Article Google Scholar
Zhao X, Ma S, Wang Z, Jerger NE, Eeckhout L (2019) CD-Xbar: A converge-diverge crossbar network for high-performance GPUs. IEEE Trans Comput 68(9):1283–1296
Article MathSciNet Google Scholar
Bakhoda A, Kim J, and Aamodt TM (2010) Throughput-effective on-chip networks for manycore accelerators. In: Proceedings of the 2010 43rd annual IEEE/ACM international symposium on microarchitecture, 421–432
Bakhoda A, Kim J, Aamodt TM (2013) Designing on-chip networks for throughput accelerators. ACM Trans Archit Code Optim 10(3):21
Article Google Scholar
Lee J, Li S, Kim H, Yalamanchili S (2013) Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures. ACM Trans Des Autom Electron Syst 18(4):48
Google Scholar
Lotfi-Kamran P, Modarressi M, Sarbazi-Azad H (2017) Near-Ideal Networks-on-Chip for Servers. In: High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, 277–288
Fedorova A, Blagodurov S, Zhuravlev S (2010) Managing contention for shared resources on multicore processors. Commun ACM 53(2): 49–57
Article Google Scholar
Chandra D, Guo F, Kim S, and Solihin Y (2005) Predicting inter-thread cache contention on a chip multi-processor architecture. In: 11th International Symposium on High-Performance Computer Architecture, 340–351
Wang B, Yu W, Sun X.-H, and Wang X (2015) Dacache: Memory divergence-aware gpu cache management. In: Proceedings of the 29th ACM on International Conference on Supercomputing, 89–98
Vijaykumar N, Ebrahimi E, Hsieh K, Gibbons PB, and Mutlu O, (2018) The Locality Descriptor: A holistic cross-layer abstraction to express data locality in GPUs. In: ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 829–842
Shin S et al (2018) Scheduling page table walks for irregular GPU applications. In: Proceedings of the 45th Annual International Symposium on Computer Architecture, 180–192
Koo G, Oh Y, Ro WW, Annavaram M (2017) Access pattern-aware cache management for improving data utilization in gpu. ACM SIGARCH Computer Architecture News 45(2):307–319
Article Google Scholar
Li A, Song SL, Liu W, Liu X, Kumar A, Corporaal H (2017) Locality-aware cta clustering for modern gpus. ACM SIGARCH Computer Architecture News 45(1):297–311
Article Google Scholar
Rogers TG, O’Connor M, and Aamodt TM (2013) Divergence-aware warp scheduling. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 99–110
Akhil SL and Wu AC, (2015) CAWA : Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads. Isca
Rafique N, Lim W.-T, and Thottethodi M (2007) Effective management of DRAM bandwidth in multicore processors. In: 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007), 245–258
Rixner S, Dally WJ, Kapasi UJ, Mattson P, Owens JD (2000) Memory access scheduling. ACM SIGARCH Computer Architecture News 28(2):128–138
Article Google Scholar
Chatterjee N et al (2017) Architecting an energy-efficient dram system for gpus, In:. IEEE International Symposium on High Performance Computer Architecture (HPCA) 73–84
Article Google Scholar
Lakshminarayana NB and Kim H, (2010) Effect of instruction fetch and memory scheduling on gpu performance. In: Workshop on Language, Compiler, and Architecture Support for GPGPU, 88
Wang L, Zhao X, Kaeli D, Wang Z, Eeckhout L (2018) “Intra-Cluster Coalescing to Reduce GPU NoC Pressure”, In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), 990–999
MATH Google Scholar
Rogers TG, O’Connor M, and Aamodt TM (2012) Cache-conscious wavefront scheduling. In: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 72–83
Li C, Song SL, Dai H, Sidelnik A, Hari SKS, and Zhou H, (2015) Locality-driven dynamic GPU cache bypassing In: Proceedings of the 29th ACM on International Conference on Supercomputing, 67–77
Tan J, Yan K, Song SL, Fu X (2019) LoSCache: Leveraging Locality Similarity to Build Energy-Efficient GPU L2 Cache. In: Design Automation & Test in Europe Conference & Exhibition (DATE), 1190-1195
Khoshavi N, Demara RF (2018) Read-tuned STT-RAM and eDRAM cache hierarchies for throughput and energy optimization. IEEE Access 6:14576–14590
Article Google Scholar
Mekkat V, Holey A, Yew P.-C, and Zhai A (2013) Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, 225–234
Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator, In: IEEE International Symposium on Performance Analysis of Systems and Software, 163–174
Article Google Scholar
Che S et al (2009) “Rodinia: A benchmark suite for heterogeneous computing”, In: IEEE international symposium on workload characterization (IISWC) 2009:44–54
Article Google Scholar
Jog A et al (2016) Exploiting core criticality for enhanced GPU performance. ACM SIGMETRICS Performance Evaluation Review 44(1):351–363
Article Google Scholar
Ausavarungnirun R et al., (2018) Holistic management of the GPGPU memory hierarchy to manage warp-level latency tolerance. arXiv Prepr. arXiv1804.11038
Chatterjee N, O’Connor M, Loh GH, Jayasena N, and Balasubramonian R (2014) Managing DRAM latency divergence in irregular GPGPU applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 128–139
Jaleel A, Theobald KB, Steely SC Jr, Emer J (2010) High performance cache replacement using re-reference interval prediction (RRIP). ACM SIGARCH Comput Archit News 38(3):60–71
Article Google Scholar
Jalminger J and Stenstrom P, (2003) A novel approach to cache block reuse predictions. In: International Conference on Parallel Processing, Proceedings , 294–302
Khan S, Alameldeen AR, Wilkerson C, Mutluy O, and Jimenezz DA, (2014) Improving cache performance using read-write partitioning. In: IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 452–463
Qureshi MK, Jaleel A, Patt YN, Steely SC, Emer J (2007) Adaptive insertion policies for high performance caching. ACM SIGARCH Comput Archit News 35(2):381–391
Article Google Scholar
Seshadri V et al (2015) Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Trans Archit Code Optim 11(4):1–22
Article Google Scholar
Ebrahimi E, Lee CJ, Mutlu O, Patt YN (2011) Prefetch-aware shared resource management for multi-core systems. ACM SIGARCH Comput Archit News 39(3):141–152
Article Google Scholar
Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, and Patt YN (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 308–317
Chaudhuri M, GaurJ, Bashyam N, Subramoney S, and Nuzman J, (2012) Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches,” In: Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pp. 293–304.
Gaur J, Chaudhuri M, and Subramoney S, (2011) Bypass and insertion algorithms for exclusive last-level caches. In: Proceedings of the 38th annual international symposium on Computer architecture, 81–92
Jia W, Shaw KA, and Martonosi M, (2014) MRPB: Memory request prioritization for massively parallel processors. In: IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 272–283
Jia Z, Maggioni M, Staiger B, and Scarpazza DP, (2018) Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv Prepr. arXiv1804.06826
Tsai YM, Cojean T, and Anzt H, (2020) Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse Linear Algebra Computations. arXiv Prepr. arXiv2008.08478

Download references

Author information

Authors and Affiliations

Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
Hossein BiTalebi & Farshad Safaei

Authors

Hossein BiTalebi
View author publications
You can also search for this author in PubMed Google Scholar
Farshad Safaei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Farshad Safaei.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

BiTalebi, H., Safaei, F. LARA: Locality-aware resource allocation to improve GPU memory-access time. J Supercomput 77, 14438–14460 (2021). https://doi.org/10.1007/s11227-021-03854-w

Download citation

Accepted: 29 April 2021
Published: 15 May 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s11227-021-03854-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LARA: Locality-aware resource allocation to improve GPU memory-access time

Abstract

Access this article

Similar content being viewed by others

Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache

A Quantitative Study of Locality in GPU Caches for Memory-Divergent Workloads

A memory scheduling strategy for eliminating memory access interference in heterogeneous system

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

LARA: Locality-aware resource allocation to improve GPU memory-access time

Abstract

Access this article

Similar content being viewed by others

Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache

A Quantitative Study of Locality in GPU Caches for Memory-Divergent Workloads

A memory scheduling strategy for eliminating memory access interference in heterogeneous system

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation