Skip to main content
Log in

An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Unified virtual memory (UVM) improves GPU programmability by enabling on-demand data movement between CPU memory and GPU memory. However, due to the limited capacity of GPU device memory, oversubscription overhead becomes a major performance bottleneck for data-intensive workloads running on GPUs with UVM. This paper proposes a novel framework for UVM oversubscription management in discrete CPU-GPU systems. It consists of an access pattern classifier followed by a pattern-specific transformer-based model using a novel loss function aiming to reduce page thrashing. A policy engine is designed to leverage the model’s result to perform accurate page prefetching and eviction. Our evaluation shows that our proposed framework significantly outperforms the state-of-the-art (SOTA) methods on a set of 11 memory-intensive benchmarks, reducing the number of pages thrashed by 64.4% under 125% memory oversubscription compared to the baseline, while the SOTA method reduces the number of pages thrashed by 17.3%. Compared to the SOTA method, our solution achieves average IPC improvement of 1.52X and 3.66X under 125% and 150% memory oversubscription.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Availability of data and materials

Data sharing is not applicable.

References

  1. Sakharnykh, N.: Everything you need to know about unified memory http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf (2018)

  2. Ganguly, D., Melhem, R., Yang, J.: An adaptive framework for oversubscription management in cpu-gpu unified memory. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1212–1217. IEEE (2021)

  3. Yu, Q., Childers, B., Huang, L., Qian, C., Wang, Z.: Hpe: Hierarchical page eviction policy for unified memory in gpus. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 39(10), 2461–2474 (2019)

    Article  Google Scholar 

  4. Hashemi, M., Swersky, K., Smith, J., Ayers, G., Litz, H., Chang, J., Kozyrakis, C., Ranganathan, P.: Learning memory access patterns. In: International Conference on Machine Learning, pp. 1919–1928. PMLR (2018)

  5. Shi, Z., Huang, X., Jain, A., Lin, C.: Applying deep learning to the cache replacement problem. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 413–425 (2019)

  6. Shi, Z., Jain, A., Swersky, K., Hashemi, M., Ranganathan, P., Lin, C.: A hierarchical neural model of data prefetching. In: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 861–873 (2021)

  7. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems, vol. 30 (2017)

  8. Appelhans, D., Auerbach, G., Averill, D., Black, R., Brown, A., Buono, D., Cash, R., Chen, D., Deindl, M., Duffy, D., et al: Functionality and performance of nvlink with ibm power9 processors. Ibm Journal of Research and Development, vol. 62(4-5) (2018)

  9. Ganguly, D., Zhang, Z., Yang, J., Melhem, R.: Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory. In: Proceedings of the 46th International Symposium on Computer Architecture, pp. 224–235 (2019)

  10. Agarwal, N., Nellans, D., O’Connor, M., Keckler, S.W., Wenisch, T.F.: Unlocking bandwidth for gpus in cc-numa systems. In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 354–365. IEEE (2015)

  11. Sakharnykh, N.: Unified memory on pascal and volta. http://on-demand.gputechconf.com/gtc/2017/presentation/s7285-nikolay-sakharnykh-unfied-memory-on-pascal-and-volta.pdf (2017)

  12. Zheng, T., Nellans, D., Zulfiqar, A., Stephenson, M., Keckler, S.W.: Towards high performance paged memory for gpus. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 345–357. IEEE (2016)

  13. Belady, L.A.: A study of replacement algorithms for a virtual-storage computer. IBM Systems journal 5(2), 78–101 (1966)

    Article  Google Scholar 

  14. Tarsa, S.J., Lin, C.-K., Keskin, G., Chinya, G., Wang, H.: Improving branch prediction by modeling global history with convolutional neural networks. arXiv:1906.09889 (2019)

  15. Zangeneh, S., Pruett, S., Lym, S., Patt, Y.N.: Branchnet: a convolutional neural network to predict hard-to-predict branches. In: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 118–130. IEEE (2020)

  16. Peled, L., Mannor, S., Weiser, U., Etsion, Y.: Semantic locality and context-based prefetching using reinforcement learning. In: 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), pp. 285–297. IEEE (2015)

  17. Bhatia, E., Chacon, G., Pugsley, S., Teran, E., Gratz, P.V., Jiménez, D.A.: Perceptron-based prefetch filtering. In: 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pp. 1–13. IEEE (2019)

  18. Peled, L., Weiser, U., Etsion, Y.: A neural network prefetcher for arbitrary memory access patterns. ACM Transactions on Architecture and Code Optimization (TACO) 16(4), 1–27 (2019)

    Article  Google Scholar 

  19. Bera, R., Kanellopoulos, K., Nori, A., Shahroodi, T., Subramoney, S., Mutlu, O.: Pythia: A customizable hardware prefetching framework using online reinforcement learning. In: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 1121–1137 (2021)

  20. Parisi, G.I., Kemker, R., Part, J.L., Kanan, C., Wermter, S.: Continual lifelong learning with neural networks: a review. Neural Netw. 113, 54–71 (2019)

    Article  Google Scholar 

  21. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)

  22. Hou, S., Pan, X., Loy, C.C., Wang, Z., Lin, D.: Learning a unified classifier incrementally via rebalancing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 831–839 (2019)

  23. TylerYep: torchinfo. https://github.com/TylerYep/torchinfo (2021)

  24. Gao, Y., Liu, Y., Zhang, H., Li, Z., Zhu, Y., Lin, H., Yang, M.: Estimating gpu memory consumption of deep learning models. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1342–1352 (2020)

  25. NVIDIA: NVIDIA hopper architecture In-Depth. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ (2022)

  26. Srivastava, A., Wang, T.-Y., Zhang, P., Rose, C.A.F.D., Kannan, R., Prasanna, V.K.: Memmap: compact and generalizable meta-lstm models for memory access prediction. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 57–68. Springer (2020)

  27. Zhang, P., Srivastava, A., Nori, A.V., Kannan, R., Prasanna, V.K.: Transformap: transformer for memory access prediction. arXiv:2205.14778 (2022)

  28. Rodge, J.: NVIDIA announces tensorRT 8 Slashing BERT-Large inference down to 1 millisecond. https://developer.nvidia.com/blog/nvidia-announces-tensorrt-8-slashing-bert-large-inference-down-to-1-millisecond/ (2021)

  29. NVIDIA: MULTI-PROCESS SERVICE https://docs.nvidia.com/pdf/CUDA_Multi_Process_Service_Overview.pdf (2021)

  30. Lew, J., Shah, D.A., Pati, S., Cattell, S., Zhang, M., Sandhupatla, A., Ng, C., Goli, N., Sinclair, M.D., Rogers, T.G.: Analyzing machine learning workloads using a detailed gpu simulator. In: 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 151–152. IEEE (2019)

  31. NVIDIA: NVIDIA cuDNN. https://developer.nvidia.com/cudnn (2022)

  32. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  33. AMD: Radeons Next-generation Vega architecture. https://radeon.com/_downloads/vega-whitepaper-11.6.17.pdf (2017)

  34. NVIDIA: Pascal architecture Whitepaper. https://images.nvidia.cn/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf (2019)

  35. Pratheek, B., Jawalkar, N., Basu, A.: Improving gpu multi-tenancy with page walk stealing. In: 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 626–639. IEEE (2021)

  36. Lee, J., Samadi, M., Mahlke, S.: Vast: the illusion of a large memory space for gpus. In: 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 443–454. IEEE (2014)

  37. Kehne, J., Metter, J., Bellosa, F.: Gpuswap: enabling oversubscription of gpu memory through transparent swapping. In: Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, pp. 65–77 (2015)

  38. Agarwal, N., Nellans, D., Stephenson, M., O’Connor, M., Keckler, S.W.: Page placement strategies for gpus within heterogeneous memory systems. In: Proceedings of the Twentieth International Conference on Architectural and Operating Systems, pp. 607–618 (2015)

  39. Li, C., Ausavarungnirun, R., Rossbach, C. J., Zhang, Y., Mutlu, O., Guo, Y., Yang, J.: A framework for memory oversubscription management in graphics processing units. In: Proceedings of the Twenty-fifth International Conference on Architectural Support for Programming Languages and Operating Systems , pp. 49–63 (2019)

  40. Kim, H., Sim, J., Gera, P., Hadidi, R., Kim, H.: Batch-aware unified memory management in gpus for irregular workloads. In: Proceedings of the Twenty-fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 1357–1370 (2020)

  41. Ganguly, D., Zhang, Z., Yang, J., Melhem, R.: Adaptive page migration for irregular data-intensive applications under gpu memory oversubscription. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 451–461 . IEEE (2020)

  42. Yu, Q., Childers, B., Huang, L., Qian, C., Guo, H., Wang, Z.: Coordinated page prefetch and eviction for memory oversubscription management in gpus. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 472–482 . IEEE (2020)

  43. NVIDIA: Improving GPU Memory Oversubscription Performance https://developer.nvidia.com/blog/improving-gpu-memory-oversubscription-performance/ (2021)

  44. Seznec, A.: A new case for the tage branch predictor. In: Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture, pp. 117–127 (2011)

  45. Doudali, T.D., Blagodurov, S., Vishnu, A., Gurumurthi, S., Gavrilovska, A.: Kleio: A hybrid memory page scheduler with machine intelligence. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, pp. 37–48 (2019)

Download references

Acknowledgments

This work was supported in part by the Industrial Internet Innovation and Development Action Plan Project (No. TC210A02K) and the National Natural Science Foundation of China (No.61802022 and No.61802027).

Funding

This work was supported in part by the Industrial Internet Innovation and Development Action Plan Project (No. TC210A02K) and the National Natural Science Foundation of China (No.61802022 and No.61802027).

Author information

Authors and Affiliations

Authors

Contributions

Xinjian Long: Conceptualization of this study, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing-original draft. Xiangyang Gong: Methodology, Project administration, Resources, Supervision, Writing-review & editing. Bo Zhang: Methodology, Funding acquisition, Project ad-ministration, Resources, Supervision, Writing-review & editing. Huiyang Zhou: Conceptualization, Methodology, Funding acquisition, Project administration, Resources, Supervision, Writing-review & editing.

Corresponding author

Correspondence to Xiangyang Gong.

Ethics declarations

Competing interests

We declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Consent for Publication

We confirm that we understand journal Journal of Grid Computing is a transformative journal. When research is accepted for publication, there is a choice to publish using either immediate gold open access or the traditional publishing route.

Ethics approval and consent to participate

The results/ data/figures in this manuscript have not been published elsewhere, nor are they under consideration by another publisher. We have read the Springer journal policies on author responsibilities and submit this manuscript in accordance with those policies. All of the material is owned by the authors and/or no permissions are required.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Long, X., Gong, X., Zhang, B. et al. An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory. J Grid Computing 21, 11 (2023). https://doi.org/10.1007/s10723-023-09646-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10723-023-09646-1

Keywords

Navigation