Skip to main content
Log in

Compressed page walk cache

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

GPUs are widely used in modern high-performance computing systems. To reduce the burden of GPU programmers, operating system and GPU hardware provide great supports for shared virtual memory, which enables GPU and CPU to share the same virtual address space. Unfortunately, the current SIMT execution model of GPU brings great challenges for the virtual-physical address translation on the GPU side, mainly due to the huge number of virtual addresses which are generated simultaneously and the bad locality of these virtual addresses. Thus, the excessive TLB accesses increase the miss ratio of TLB. As an attractive solution, Page Walk Cache (PWC) has received wide attention for its capability of reducing the memory accesses caused by TLB misses. However, the current PWC mechanism suffers from heavy redundancies, which significantly limits its efficiency. In this paper, we first investigate the facts leading to this issue by evaluating the performance of PWC with typical GPU benchmarks. We find that the repeated L4 and L3 indices of virtual addresses increase the redundancies in PWC, and the low locality of L2 indices causes the low hit ratio in PWC. Based on these observations, we propose a new PWC structure, namely Compressed Page Walk Cache (CPWC), to resolve the redundancy burden in current PWC. Our CPWC can be organized in either direct-mapped mode or set-associated mode. Experimental results show that CPWC increases by 3 times over TPC in the number of page table entries, increases by 38.3% over PWC in L2 index hit ratio and reduces by 26.9% in the memory accesses of page tables. The average memory accesses caused by each TLB miss is reduced to 1.13. Overall, the average IPC can improve by 25.3%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Power J, Hill M D, Wood D A. Supporting x86-64 address translation for 100s of GPU lanes. In: Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture. 2014, 568–578

  2. Chatterjee N, Connor M O, Loh G H, Jayasena N, Balasubramonian R. Managing dram latency divergence in irregular gpgpu applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2014, 128–139

  3. Burtscher M, Nasre R, Pingali K. A quantitative study of irregular programs on GPUs. In: Proceedings of IEEE International Symposium on Workload Characterization. 2012, 141–151

  4. Meng J, Tarjan D, Skadron K. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: Proceedings of the 37th International Symposium on Computer Architecture. 2010, 235–246

  5. Vesely J, Basu A, Oskin M. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In: Proceedings of 2016 IEEE International Symposium on Performance Analysis of Systems and Software. 2016, 161–171

  6. Bhattacharjee A. Large-reach memory management unit caches. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 2013, 283–394

  7. Shin S, Cox G, Oskin M, Loh G H, Solihin Y, Bhattacharjee A, Basu A. Scheduling page table walks for irregular GPU applications. In: Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture. 2018, 180–192

  8. Barr T W, Cox A L, Rixner S. Translation caching: skip, don’t walk (the page table). In: Proceedings of the 37th International Symposium on Computer Architecture. 2010, 48–59

  9. Ausavarungnirun R, Landgraf J, Miller V, Ghose S, Gandhi J, Rossbach C J, Mutlu O. Mosaic: an application-transparent hardware-software cooperative memory manager for GPUs. Computing Research Repository, 2018, arXiv preprint arXiv: 1804.11265

  10. Ausavarungnirun R, Landgraf J, Miller V, Ghose S, Mutlu O. Mosaic: a GPU memory manager with application-transparent support for multiple page sizes. In: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 2017, 136–150

  11. Mei X, Chu X. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel & Distributed Systems, 2016, 28(1): 72–86

    Article  Google Scholar 

  12. Lee D, Subramanian L, Ausavarungnirun R, Choi J, Mutlu O. Decoupled direct memory access: isolating CPU and IO traffic by leveraging a dual-data-port dram. In: Proceedings of International Conference on Parallel Architectures and Compilation. 2015, 174–187

  13. Kurth A, Vogel P, Marongiu A, Benini L. Scalable and efficient virtual memory sharing in heterogeneous SOCs with TLB prefetching and MMU-aware DMA engine. In: Proceedings of the 36th IEEE International Conference on Computer Design. 2018, 292–300

  14. Seznec A. Concurrent support of multiple page sizes on a skewed associative TLB. IEEE Transactions on Computers, 2004, 53(7): 924–927

    Article  Google Scholar 

  15. Rogers T G, Connor M O, Aamodt T M. Cache-conscious wavefront scheduling. In: Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. 2012, 72–83

  16. Bakhoda A, Yuan G L, Fung W W L, Wong H, Aamodt T M. Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software. 2009, 163–174

  17. Che S, Boyer M, Meng J, Tarjan D, Skadron K. Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization. 2009, 44–54

  18. Karimov J, Rabl T, Markl V. PolyBench: the first benchmark for polystores. In: Proceedings of Technology Conference on Performance Evaluation and Benchmarking, 2018, 24–41

  19. Basu A, Gandhi J, Chang J, Hill M D, Swift M M. Efficient virtual memory for big memory servers. In: Proceedings of the 40th Annual International Symposium on Computer Architecture. 2013, 237–248

  20. Basu A. Revisiting virtual memory. University of Wisconsin at Madison, Dissertation, 2013

  21. Gandhi J, Basu A, Hill M D, Swift M M. Efficient memory virtualization: reducing dimensionality of nested page walks. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. 2014, 178–189

  22. Pham B, Bhattacharjee A, Eckert Y, Loh G H. Increasing TLB reach by exploiting clustering in page translations. In: Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture. 2014, 558–567

  23. Shin S, LeBeane M, Solihin Y, Basu A. Neighborhood-aware address translation for irregular GPU applications. In: Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture. 2018, 252–363

Download references

Acknowledgements

This paper was supported by the National Natural Science Fundation of China (Grant No. 61972407)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Shen.

Additional information

Dunbo Zhang is a postgraduate student of National University of Defense Technology, School of Computer, China. He obtains his bachelor’s degree from National University of Defense Technology, China. His research interests include GPU architecture, address translation and virtualization.

Chaoyang Jia is a postgraduate student of National University of Defense Technology, School of Computer, China. He obtains his bachelor’s degree from University of Electronic Science and Technology of China. His research interests include GPU architecture, address translation and virtualization.

Li Shen received the BS, MS, and PhD degrees in computer science & technology from National University of Defense Technology (NUDT), China. Currently he is a professor at Department of Computing, NUDT, China. His research interests include high performance processor architecture, parallel programming, and performance optimization techiques.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, D., Jia, C. & Shen, L. Compressed page walk cache. Front. Comput. Sci. 16, 163104 (2022). https://doi.org/10.1007/s11704-020-9485-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-020-9485-2

Keywords

Navigation