Skip to main content

Dynamic GMMU Bypass for Address Translation in Multi-GPU Systems

  • Conference paper
  • First Online:
Network and Parallel Computing (NPC 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12639))

Included in the following conference series:

Abstract

The ever increasing application footprint raises challenges for GPUs. As Moore’s Law reaches its limit, it is not easy to improve single GPU performance any further; instead, multi-GPU systems have been shown to be a promising solution due to its GPU-level parallelism. Besides, memory virtualization in recent GPUs simplifies multi-GPU programming. Memory virtualization requires support for address translation, and the overhead of address translation has an important impact on the system’s performance. Currently, there are two common address translation architectures in multi-GPU systems, including distributed and centralized address translation architectures. We find that both architectures suffer from performance loss in certain cases. To address this issue, we propose GMMU Bypass, a technique that allows address translation requests to dynamically bypass GMMU in order to reduce translation overhead. Simulation results show that our technique outperforms distributed address translation architecture by 6% and centralized address translation architecture by 106% on average.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arunkumar, A., et al.: MCM-GPU: Multi-chip-module GPUs for continued performance scalability. ACM SIGARCH Comput. Archit. News 45(2), 320–332 (2017)

    Article  Google Scholar 

  2. Ausavarungnirun, R., et al.: Mosaic: a GPU memory manager with application-transparent support for multiple page sizes. In: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 136–150 (2017)

    Google Scholar 

  3. Ausavarungnirun, R., et al.: MASK: redesigning the GPU memory hierarchy to support multi-application concurrency. ACM SIGPLAN Not. 53(2), 503–518 (2018)

    Article  Google Scholar 

  4. Baruah, T., et al.: Griffin: hardware-software support for efficient page migration in multi-GPU systems. In: 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 596–609, February 2020. https://doi.org/10.1109/HPCA47549.2020.00055

  5. Ganguly, D., Zhang, Z., Yang, J., Melhem, R.: Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory. In: Proceedings of the 46th International Symposium on Computer Architecture, pp. 224–235 (2019)

    Google Scholar 

  6. Jermain, C., Rowlands, G., Buhrman, R., Ralph, D.: GPU-accelerated micromagnetic simulations using cloud computing. J. Magn. Magn. Mater. 401, 320–322 (2016)

    Article  Google Scholar 

  7. Kim, G., Lee, M., Jeong, J., Kim, J.: Multi-GPU system design with memory networks. In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 484–495. IEEE (2014)

    Google Scholar 

  8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  9. Li, C., et al.: Priority-based PCIe scheduling for multi-tenant multi-GPU system. IEEE Comput. Archit. Lett. 18, 157–160 (2019)

    Article  Google Scholar 

  10. NVIDIA, T.: V100 GPU architecture. Whitepaper (2017). nvidia.com. Accessed September 2019

  11. Pichai, B., Hsu, L., Bhattacharjee, A.: Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces. ACM SIGARCH Comput. Archit. News 42(1), 743–758 (2014)

    Article  Google Scholar 

  12. Power, J., Hill, M.D., Wood, D.A.: Supporting x86–64 address translation for 100s of GPU lanes. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp. 568–578. IEEE (2014)

    Google Scholar 

  13. Raina, R., Madhavan, A., Ng, A.Y.: Large-scale deep unsupervised learning using graphics processors. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 873–880 (2009)

    Google Scholar 

  14. Sanaullah, A., Mojumder, S.A., Lewis, K.M., Herbordt, M.C.: GPU-accelerated charge mapping. In: 2016 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7. IEEE (2016)

    Google Scholar 

  15. Sun, Y., et al.: MGPUSim: enabling multi-GPU performance modeling and optimization. In: Proceedings of the 46th International Symposium on Computer Architecture, pp. 197–209 (2019)

    Google Scholar 

  16. Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., Owens, J.D.: Gunrock: a high-performance graph processing library on the GPU. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 1–12 (2016)

    Google Scholar 

  17. Wu, Y., Wang, Y., Pan, Y., Yang, C., Owens, J.D.: Performance characterization of high-level programming models for GPU graph analytics. In: 2015 IEEE International Symposium on Workload Characterization, pp. 66–75. IEEE (2015)

    Google Scholar 

  18. Young, V., Jaleel, A., Bolotin, E., Ebrahimi, E., Nellans, D., Villa, O.: Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systems. In: 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 339–351. IEEE (2018)

    Google Scholar 

  19. Zheng, T., Nellans, D., Zulfiqar, A., Stephenson, M., Keckler, S.W.: Towards high performance paged memory for GPUs. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 345–357. IEEE (2016)

    Google Scholar 

  20. Ziabari, A.K., et al.: UMH: a hardware-based unified memory hierarchy for systems with multiple discrete GPUs. ACM Trans. Archit. Code Optim. (TACO) 13(4), 1–25 (2016)

    Article  Google Scholar 

Download references

Acknowledgement

This work is partially supported by Research Project of NUDT ZK20-04, PDL Foundation 6142110180102, Science and Technology Innovation Project of Hunan Province 2018XK2102 and Advanced Research Program 31513010602-1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chen Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wei, J., Lu, J., Yu, Q., Li, C., Zhao, Y. (2021). Dynamic GMMU Bypass for Address Translation in Multi-GPU Systems. In: He, X., Shao, E., Tan, G. (eds) Network and Parallel Computing. NPC 2020. Lecture Notes in Computer Science(), vol 12639. Springer, Cham. https://doi.org/10.1007/978-3-030-79478-1_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-79478-1_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-79477-4

  • Online ISBN: 978-3-030-79478-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics