A quantitative evaluation of unified memory in GPUs

Yu, Qi; Childers, Bruce; Huang, Libo; Qian, Cheng; Wang, Zhiying

doi:10.1007/s11227-019-03079-y

A quantitative evaluation of unified memory in GPUs

Published: 16 November 2019

Volume 76, pages 2958–2985, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Qi Yu ORCID: orcid.org/0000-0001-8105-8682¹,
Bruce Childers²,
Libo Huang¹,
Cheng Qian¹ &
…
Zhiying Wang¹

920 Accesses
8 Citations
Explore all metrics

Abstract

The introduction of unified memory and demand paging has simplified programming of graphics processing units (GPUs). It has also enabled oversubscribing the memory for a GPU. However, the overhead of page management makes page faults a performance bottleneck. Sometimes the page eviction policy is unable to mitigate performance slowdown caused by page faults and memory oversubscription. On average, eviction policies such as Random and CAR are not competitive with a traditional least recently used (LRU) policy. Other policies, such as CLOCK-Pro, are designed to overcome challenges with LRU, but they only achieve limited speedup. Even enhancing LRU with page walk hit information does not lead to notable performance improvement. Based on these observations, we propose optimization opportunities to mitigate performance degradation caused by page faults and memory oversubscription. These optimization opportunities include an effective page eviction policy that retains LRU’s advantages while addressing LRU’s inability to deal with thrashing access patterns, page prefetch and pre-eviction, memory-aware throttling, and capacity compression.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Article 27 April 2021

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

In-memory database acceleration on FPGAs: a survey

Article Open access 26 October 2019

References

Ausavarungnirun R, Landgraf J, Miller V, Ghose S, Gandhi J, Rossbach CJ, Mutlu O (2017) Mosaic: a GPU memory manager with application-transparent support for multiple page sizes. In: Proceedings of the 50th IEEE/ACM International Symposium on Microarchitecture, pp 136–150
Ausavarungnirun R, Miller V, Landgraf J, Ghose S, Gandhi J, Jog A, Rossbach CJ, Mutlu O (2018) MASK: redesigning the GPU memory hierarchy to support multi-application concurrency. In: Proceedings of the 23rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp 503–518
Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp 163–174
Bansal S, Modha DS (2004) CAR: clock with adaptive replacement. In: Proceedings of the 3rd USENIX Conference on File and Storage Technologies, pp 187–200
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of 2009 IEEE International Symposium on Workload Characterization, pp 44–54
Choquette J, Giroux O, Foley D (2018) Volta: performance and programmability. IEEE Micro 38(2):42–52
Article Google Scholar
Danskin J (2016) PASCAL GPU WITH NVLINK. http://hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.10-GPU-HPC-Epub/HC28.22.121-Pascal-GPU-DanskinFoley-NVIDIA-v06-6_7.pdf. Accessed 5 May 2019
Dashti M, Fedorova A (2017) Analyzing memory management methods on integrated CPU-GPU systems. In: Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management, pp 59–69
Foley D, Danskin J (2017) Ultra-performance pascal GPU and NVLink interconnect. IEEE Micro 37(2):7–17
Article Google Scholar
Fung WW, Sham I, Yuan G, Aamodt TM (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pp 407–420
Ganguly D, Zhang Z, Yang J, Melhem R (2019) Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory. In: ISCA, pp 224–235
Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to GPU codes. In: Proceedings of 2012 Innovative Parallel Computing, pp 1–10
Hao Y, Fang Z, Reinman G, Cong J (2017) Supporting address translation for accelerator-centric architectures. In: Proceedings of the 23rd IEEE International Symposium on High Performance Computer Architecture, pp 37–48
Harris M (2013) Unified memory in CUDA 6. https://devblogs.nvidia.com/unified-memory-in-cuda-6/. Accessed 8 May 2019
Hestness J, Keckler SW, Wood DA (2014) A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior. In: Proceedings of 2014 IEEE International Symposium on Workload Characterization, pp 150–160
Jain A, Khairy M, Rogers TG (2018) A quantitative evaluation of contemporary gpu simulation methodology. Proc ACM Meas Anal Comput Syst 2(2):35
Article Google Scholar
Jaleel A, Theobald KB, Steely Jr SC, Emer J (2010) High performance cache replacement using re-reference interval prediction (RRIP). In: Proceedings of the 37th International Symposium on Computer Architecture, pp 60–71
Jarząbek Ł, Czarnul P (2017) Performance evaluation of unified memory and dynamic parallelism for selected parallel cuda applications. J Supercomput 73(12):5378–5401
Article Google Scholar
Jiang S, Chen F, Zhang X (2005) CLOCK-Pro: an effective improvement of the CLOCK replacement. In: Proceedings of USENIX Annual Technical Conference, pp 323–336
Jog A, Kayiran O, Mishra AK, Kandemir MT, Mutlu O, Iyer R, Das CR (2013) Orchestrated scheduling and prefetching for GPGPUs. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, pp 332–343
Kehne J, Metter J, Bellosa F (2015) GPUswap: enabling oversubscription of GPU memory through transparent swapping. In: Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, pp 65–77
Xu JY (2008) OpenCL – the open standard for parallel programming of heterogeneous systems. https://pdfs.semanticscholar.org/fb16/3d7fe546bb950294ffaf5ef6e225f630c76d.pdf. Accessed 14 Nov 2019
Landaverde R, Zhang T, Coskun AK, Herbordt M (2014) An investigation of unified memory access performance in CUDA. In: Proceedings of 2014 IEEE High Performance Extreme Computing Conference, pp 1–6
Li C, Ausavarungnirun R, Rossbach CJ, Zhang Y, Mutlu O, Guo Y, Yang J (2019) A framework for memory oversubscription management in graphics processing units. In: Proceedings of the 24th ACM International Conference on Architectural Support for Programming Languages and Operating System
Li W, Jin G, Cui X, See S (2015) An evaluation of unified memory technology on NVIDIA GPUs. In: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp 1092–1098
Lindholm E, Nickolls J, Oberman S, Montrym J (2008) NVIDIA tesla: a unified graphics and computing architecture. IEEE Micro 28(2):39–55
Article Google Scholar
NVIDIA (2009) NVIDIA next generation CUDA compute architecture: Fermi. https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. Accessed 10 May 2019
NVIDIA (2018) CUDA C programming guide. https://docs.nvidia.com/cuda/archive/9.1/pdf/CUDA_C_Programming_Guide.pdf. Accessed 14 Nov 2019
NVIDIA (2016) Pascal P100. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf. Accessed 10 May 2019
NVIDIA (2017) TESLA V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Accessed 10 May 2019
Pichai B, Hsu L, Bhattacharjee A (2014) Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces. In: Proceedings of the 19th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp 743–758
Power J, Hill MD, Wood DA (2014) Supporting x86-64 address translation for 100s of GPU lanes. In: Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture, pp 568–578
Qureshi MK, Jaleel A, Patt YN, Steely SC, Emer J (2007) Adaptive insertion policies for high performance caching. In: Proceedings of the 34th International Symposium on Computer Architecture, pp 381–391
Rogers TG, O’Connor M, Aamodt TM (2012) Cache-conscious wavefront scheduling. In: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp 72–83
Sakharnykh N (2016) Beyond GPU memory limits with unified memory on pascal. https://devblogs.nvidia.com/beyond-gpu-memory-limits-unified-memory-pascal/. Accessed 11 May 2019
Sakharnykh N (2017) Unified memory on pascal and volta. http://on-demand.gputechconf.com/gtc/2017/presentation/s7285-nikolay-sakharnykh-unified-memory-on-pascal-and-volta.pdf. Accessed 11 May 2019
Sakharnykh N (2018) Everything you need to know about unified memory. http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf. Accessed 11 May 2019
Shin S, Cox G, Oskin M, Loh GH, Solihin Y, Bhattacharjee A, Basu A (2018) Scheduling page table walks for irregular GPU applications. In: Proceedings of the 45th International Symposium on Computer Architecture, pp 180–192
Shin S, LeBeane M, Solihin Y, Basu A (2018) Neighborhood-aware address translation for irregular GPU applications. In: Proceedings of the 51st IEEE/ACM International Symposium on Microarchitecture, pp 352–363
Stratton JA, Rodrigues C, Sung I, Obeid N, Chang L, Anssari N, Liu GD, Hwu WW (2012) Parboil: a revised benchmark suite for scientific and commercial throughput computing. IMPACT Technical Report, pp 1–12
Vesely J, Basu A, Oskin M, Loh GH, Bhattacharjee A (2016) Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In: Proceedings of 2016 IEEE International Symposium on Performance Analysis of Systems and Software, pp 161–171
Yu Q, Childers B, Huang L, Qian C, Wang Z. HPE: Hierarchical page eviction policy for unified memory in GPUs. IEEE Trans Comput-Aided Des Integr Circuits Syst. https://doi.org/10.1109/TCAD.2019.2944790
Yu Q, Childers B, Huang L, Qian C, Wang Z (2019) Hierarchical page eviction policy for unified memory in GPUs. In: 2019 IEEE International Symposium on Performance Analysis of Systems and Software, pp 149–150
Zheng T, Nellans D, Zulfiqar A, Stephenson M, Keckler SW (2016) Towards high performance paged memory for GPUs. In: Proceedings of the 22nd IEEE International Symposium on High Performance Computer Architecture, pp 345–357

Download references

Acknowledgements

We thank the anonymous reviewers for their feedback. This work was supported in part by National Natural Science Foundation of China (Grants 61433019, 61872374, and 61572508).

Author information

Authors and Affiliations

School of Computer, National University of Defense Technology, Changsha, Hunan, China
Qi Yu, Libo Huang, Cheng Qian & Zhiying Wang
University of Pittsburgh, Pittsburgh, PA, USA
Bruce Childers

Authors

Qi Yu
View author publications
You can also search for this author in PubMed Google Scholar
Bruce Childers
View author publications
You can also search for this author in PubMed Google Scholar
Libo Huang
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Qian
View author publications
You can also search for this author in PubMed Google Scholar
Zhiying Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Libo Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, Q., Childers, B., Huang, L. et al. A quantitative evaluation of unified memory in GPUs. J Supercomput 76, 2958–2985 (2020). https://doi.org/10.1007/s11227-019-03079-y

Download citation

Published: 16 November 2019
Issue Date: April 2020
DOI: https://doi.org/10.1007/s11227-019-03079-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A quantitative evaluation of unified memory in GPUs

Abstract

Access this article

Similar content being viewed by others

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Can GPU performance increase faster than the code error rate?

In-memory database acceleration on FPGAs: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A quantitative evaluation of unified memory in GPUs

Abstract

Access this article

Similar content being viewed by others

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Can GPU performance increase faster than the code error rate?

In-memory database acceleration on FPGAs: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation