Skip to main content
Log in

LARA: Locality-aware resource allocation to improve GPU memory-access time

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Memory access as a primary performance bottleneck of each processing unit also plays a significant role in GPU performance. In addition to high challenging parts of GPU’s memory access path, the low locality property among the requests considerably increases the memory access delay. Despite the GPU’s immense processing power, they cannot reach their maximum throughput values because of the memory access bottlenecks. Memory divergence and miss locality among the L1 missed requests significantly impose the Last-Level-Cache contention and main memory row switching overheads. In addition, interconnection network routes the request packets regardless of locality properties, such routing algorithm considerably disrupts the locality among the requests.

In this paper, we proposed Locality-Aware Resource Allocation (LARA) to reduce the Streaming-Multiprocessors stall time with arbitrage among the memory request packets in favor of locality maintenance at the interconnection network of GPU. In addition, before injecting the memory requests to the interconnection network, they will be reordered at the injection-port buffer based on their thread block equality. Memory-divergence and miss-locality among the requests are two main factors that increase the rates of Last-Level-Cache contention and main memory row switching. We proposed a comprehensive approach to improving the GPU performance by decreasing the average memory access delay. We focused on the request locality property to decrease the Last-Level-Cache contention overheads and main memory row switching rate. As a result, 33% maximum and 17% average speed-up improvements among the used benchmarks, without significant effect on system areas and power consumptions, are reported.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. N. Corporation, (2009) 'NVIDIA’s next generation CUDA compute architecture: Fermi

  2. Lee SY , and Wu C., (2014) CAWS: criticality-aware warp scheduling for GPGPU workloads. In: Proceedings of the 23rd international conference on Parallel architectures and compilation, 175–186

  3. Mu S et al (2013) “Orchestrating cache management and memory scheduling for GPGPU applications”, IEEE Trans. Very Large Scale Integr Syst 22(8):1803–1814

    Article  Google Scholar 

  4. Yoon H, Meza J, Ausavarungnirun R, Harding RA, and Mutlu O, (2012) Row buffer locality aware caching policies for hybrid memories. In: IEEE 30th International Conference on Computer Design (ICCD), 337–344

  5. Stallings W, (2003) Computer organization and architecture: designing for performance. Pearson Education India

  6. Wang B, Liu Z, Wang X, and Yu W (2015) Eliminating intra-warp conflict misses in GPU. In: Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, 689–694

  7. Nugteren C, den Braak G.-J, Corporaal H, Bal H (2014) A detailed GPU cache model based on reuse distance theory. In: IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 37–48

  8. Muralidhara SP, Subramanian L, Mutlu O, Kandemir M, Moscibroda T, (2011) Reducing memory interference in multicore systems via application-aware memory channel partitioning. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 374–385

  9. Yuan GL, Bakhoda A, and Aamodt TM (2009) Complexity effective memory access scheduling for many-core accelerator architectures. In: 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) 34–44

  10. Wittenbrink CM, Kilgariff E, Prabhu A (2011) Fermi GF100 GPU architecture. IEEE Micro 31(2):50–59

    Article  Google Scholar 

  11. Wang B, Zhu Y, Yu W (2016) “OAWS: memory occlusion aware warp scheduling”, In. International Conference on Parallel Architecture and Compilation Techniques (PACT) 2016:45–55

    Google Scholar 

  12. Xu Z, Zhao X, Wang Z, Yang C (2019) Application-aware NoC management in GPUs multitasking. J Supercomput 75(8):4710–4730

    Article  Google Scholar 

  13. Zhao X, Ma S, Wang Z, Jerger NE, Eeckhout L (2019) CD-Xbar: A converge-diverge crossbar network for high-performance GPUs. IEEE Trans Comput 68(9):1283–1296

    Article  MathSciNet  Google Scholar 

  14. Bakhoda A, Kim J, and Aamodt TM (2010) Throughput-effective on-chip networks for manycore accelerators. In: Proceedings of the 2010 43rd annual IEEE/ACM international symposium on microarchitecture, 421–432

  15. Bakhoda A, Kim J, Aamodt TM (2013) Designing on-chip networks for throughput accelerators. ACM Trans Archit Code Optim 10(3):21

    Article  Google Scholar 

  16. Lee J, Li S, Kim H, Yalamanchili S (2013) Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures. ACM Trans Des Autom Electron Syst 18(4):48

    Google Scholar 

  17. Lotfi-Kamran P, Modarressi M, Sarbazi-Azad H (2017) Near-Ideal Networks-on-Chip for Servers. In: High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, 277–288

  18. Fedorova A, Blagodurov S, Zhuravlev S (2010) Managing contention for shared resources on multicore processors. Commun ACM 53(2): 49–57

    Article  Google Scholar 

  19. Chandra D, Guo F, Kim S, and Solihin Y (2005) Predicting inter-thread cache contention on a chip multi-processor architecture. In: 11th International Symposium on High-Performance Computer Architecture, 340–351

  20. Wang B, Yu W, Sun X.-H, and Wang X (2015) Dacache: Memory divergence-aware gpu cache management. In: Proceedings of the 29th ACM on International Conference on Supercomputing, 89–98

  21. Vijaykumar N, Ebrahimi E, Hsieh K, Gibbons PB, and Mutlu O, (2018) The Locality Descriptor: A holistic cross-layer abstraction to express data locality in GPUs. In: ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 829–842

  22. Shin S et al (2018) Scheduling page table walks for irregular GPU applications. In: Proceedings of the 45th Annual International Symposium on Computer Architecture, 180–192

  23. Koo G, Oh Y, Ro WW, Annavaram M (2017) Access pattern-aware cache management for improving data utilization in gpu. ACM SIGARCH Computer Architecture News 45(2):307–319

    Article  Google Scholar 

  24. Li A, Song SL, Liu W, Liu X, Kumar A, Corporaal H (2017) Locality-aware cta clustering for modern gpus. ACM SIGARCH Computer Architecture News 45(1):297–311

    Article  Google Scholar 

  25. Rogers TG, O’Connor M, and Aamodt TM (2013) Divergence-aware warp scheduling. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 99–110

  26. Akhil SL and Wu AC, (2015) CAWA : Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads. Isca

  27. Rafique N, Lim W.-T, and Thottethodi M (2007) Effective management of DRAM bandwidth in multicore processors. In: 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007), 245–258

  28. Rixner S, Dally WJ, Kapasi UJ, Mattson P, Owens JD (2000) Memory access scheduling. ACM SIGARCH Computer Architecture News 28(2):128–138

    Article  Google Scholar 

  29. Chatterjee N et al (2017) Architecting an energy-efficient dram system for gpus, In:. IEEE International Symposium on High Performance Computer Architecture (HPCA) 73–84

    Article  Google Scholar 

  30. Lakshminarayana NB and Kim H, (2010) Effect of instruction fetch and memory scheduling on gpu performance. In: Workshop on Language, Compiler, and Architecture Support for GPGPU, 88

  31. Wang L, Zhao X, Kaeli D, Wang Z, Eeckhout L (2018) “Intra-Cluster Coalescing to Reduce GPU NoC Pressure”, In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), 990–999

    MATH  Google Scholar 

  32. Rogers TG, O’Connor M, and Aamodt TM (2012) Cache-conscious wavefront scheduling. In: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 72–83

  33. Li C, Song SL, Dai H, Sidelnik A, Hari SKS, and Zhou H, (2015) Locality-driven dynamic GPU cache bypassing In: Proceedings of the 29th ACM on International Conference on Supercomputing, 67–77

  34. Tan J, Yan K, Song SL, Fu X (2019) LoSCache: Leveraging Locality Similarity to Build Energy-Efficient GPU L2 Cache. In: Design Automation & Test in Europe Conference & Exhibition (DATE), 1190-1195

  35. Khoshavi N, Demara RF (2018) Read-tuned STT-RAM and eDRAM cache hierarchies for throughput and energy optimization. IEEE Access 6:14576–14590

    Article  Google Scholar 

  36. Mekkat V, Holey A, Yew P.-C, and Zhai A (2013) Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, 225–234

  37. Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator, In: IEEE International Symposium on Performance Analysis of Systems and Software, 163–174

    Article  Google Scholar 

  38. Che S et al (2009) “Rodinia: A benchmark suite for heterogeneous computing”, In: IEEE international symposium on workload characterization (IISWC) 2009:44–54

    Article  Google Scholar 

  39. Jog A et al (2016) Exploiting core criticality for enhanced GPU performance. ACM SIGMETRICS Performance Evaluation Review 44(1):351–363

    Article  Google Scholar 

  40. Ausavarungnirun R et al., (2018) Holistic management of the GPGPU memory hierarchy to manage warp-level latency tolerance. arXiv Prepr. arXiv1804.11038

  41. Chatterjee N, O’Connor M, Loh GH, Jayasena N, and Balasubramonian R (2014) Managing DRAM latency divergence in irregular GPGPU applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 128–139

  42. Jaleel A, Theobald KB, Steely SC Jr, Emer J (2010) High performance cache replacement using re-reference interval prediction (RRIP). ACM SIGARCH Comput Archit News 38(3):60–71

    Article  Google Scholar 

  43. Jalminger J and Stenstrom P, (2003) A novel approach to cache block reuse predictions. In: International Conference on Parallel Processing, Proceedings , 294–302

  44. Khan S, Alameldeen AR, Wilkerson C, Mutluy O, and Jimenezz DA, (2014) Improving cache performance using read-write partitioning. In: IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 452–463

  45. Qureshi MK, Jaleel A, Patt YN, Steely SC, Emer J (2007) Adaptive insertion policies for high performance caching. ACM SIGARCH Comput Archit News 35(2):381–391

    Article  Google Scholar 

  46. Seshadri V et al (2015) Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Trans Archit Code Optim 11(4):1–22

    Article  Google Scholar 

  47. Ebrahimi E, Lee CJ, Mutlu O, Patt YN (2011) Prefetch-aware shared resource management for multi-core systems. ACM SIGARCH Comput Archit News 39(3):141–152

    Article  Google Scholar 

  48. Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, and Patt YN (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 308–317

  49. Chaudhuri M, GaurJ, Bashyam N, Subramoney S, and Nuzman J, (2012) Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches,” In: Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pp. 293–304.

  50. Gaur J, Chaudhuri M, and Subramoney S, (2011) Bypass and insertion algorithms for exclusive last-level caches. In: Proceedings of the 38th annual international symposium on Computer architecture, 81–92

  51. Jia W, Shaw KA, and Martonosi M, (2014) MRPB: Memory request prioritization for massively parallel processors. In: IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 272–283

  52. Jia Z, Maggioni M, Staiger B, and Scarpazza DP, (2018) Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv Prepr. arXiv1804.06826

  53. Tsai YM, Cojean T, and Anzt H, (2020) Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse Linear Algebra Computations. arXiv Prepr. arXiv2008.08478

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Farshad Safaei.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

BiTalebi, H., Safaei, F. LARA: Locality-aware resource allocation to improve GPU memory-access time. J Supercomput 77, 14438–14460 (2021). https://doi.org/10.1007/s11227-021-03854-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03854-w

Keywords

Navigation