Abstract
Data caches are often unable to efficiently cope with the massive and simultaneous requests imposed by the SIMT execution model of modern GPUs. While software-aided cache management techniques and scheduling approaches were early considered, efficient prefetching schemes are regarded as the most viable solution to improve the efficiency of the GPU memory subsystem. Accordingly, a new GPU prefetching mechanism is proposed, by extending the stream computing model beyond the actual GPU processing core, thus broadening it toward the memory interface. The proposed prefetcher takes advantage of the available cache management resources and combines a low-profile architecture with a dedicated pattern descriptor specification, which is used to explicitly encode each kernel memory access pattern. The obtained results show that the proposed mechanism increases the L1 data cache hit rate by an average of 61%, resulting in performance speedups as high as 9.2\(\times \) and consequent energy efficiency improvements as high as 11\(\times \).
Similar content being viewed by others
References
Amilkanthwar M, Balachandran S (2013) CUPL: A compile-time uncoalesced memory access pattern locator for CUDA. In: Proceedings of the 27th ACM International Conference On Supercomputing. ACM, pp 459–460
Arnau JM, Parcerisa JM, Xekalakis P (2012) Boosting mobile GPU performance with a decoupled access/execute fragment processor. ACM SIGARCH Comput Archit News 40(3):84–93
Bakhoda A, Yuan GL, Fung WW, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 163–174
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization (IISWC), pp 44–54
Ghosh S, Martonosi M, Malik S (1997) Cache miss equations: An analytical representation of cache misses. In: ACM International Conference on Supercomputing. ACM Press, pp 317–324
Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to GPU codes. In: Innovative Parallel Computing (InPar), 2012. IEEE, pp 1–10
Grosser T, Groesslinger A, Lengauer C (2012) Polly—performing polyhedral optimizations on a low-level intermediate representation. Parallel Process Lett 22(04):1250010
Jia W, Shaw K, Martonosi M (2014) MRPB: Memory request prioritization for massively parallel processors. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, pp 272–283
Jia W, Shaw KA, Martonosi M (2012) Characterizing and improving the use of demand-fetched caches in GPUs. In: Proceedings of the 26th ACM International Conference on Supercomputing. ACM, pp 15–24
Jog A, Kayiran O, Mishra AK, Kandemir MT, Mutlu O, Iyer R, Das CR (2013) Orchestrated scheduling and prefetching for GPGPUs. ACM SIGARCH Comput Archit News 41(3):332–343
Lakshminarayana NB, Kim H (2014) Spare register aware prefetching for graph algorithms on gpus. In: IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp 614–625
Lee J, Lakshminarayana NB, Kim H, Vuduc R (2010) Many-thread aware prefetching mechanisms for GPGPU applications. In: 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 213–224
Lee S, Kim K, Koo G, Jeon H, Ro WW, Annavaram M (2015) Warped-compression: enabling power efficient GPUs through register compression. In: 42nd Intl Symposium on Computer Architecture. ACM, pp 502–514
Leng J, Hetherington T, ElTantawy A, Gilani S, Kim NS, Aamodt TM, Reddi VJ (2013) GPUWattch: enabling energy optimizations in GPGPUs. ACM SIGARCH Comput Archit News 41(3):487–498
Neves N, Tomás P, Roma N (2017) Adaptive in-cache streaming for efficient data management. IEEE Trans Very Large Scale Integr (VLSI) Syst 25(7):2130–2143
NVIDIA (2009) NVIDIA’s Next Generation CUDATM Compute Architecture: FermiTM. NVIDIA, Santa Clara, Calif, USA
NVIDIA (2016) NVIDIA GP100 Pascal Architecture. White paper (Online). https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
Panda R, Eckert Y, Jayasena N, Kayiran O, Boyer M, John LK (2016) Prefetching techniques for near-memory throughput processors. In: Proceedings of the 2016 International Conference on Supercomputing, ICS ’16. ACM, New York, pp. 40:1–40:14
Sethia A, Dasika G, Samadi M, Mahlke S (2013) APOGEE: Adaptive prefetching on GPUs for energy efficiency. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, pp 73–82
Stephenson M, Hari SKS, Lee Y, Ebrahimi E, Johnson DR, Nellans D, O’Connor M, Keckler SW (2015) Flexible software profiling of GPU architectures. In: 42nd International Symposium on Computer Architecture. ACM, pp 185–197
Torres Y, Gonzalez-Escribano A, Llanos DR (2011) Understanding the impact of CUDA tuning techniques for Fermi. In: International Conference on High Performance Computing and Simulation (HPCS). IEEE, pp 631–639
Wu B, Zhao Z, Zhang EZ, Jiang Y, Shen X (2013) Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. ACM SIGPLAN Not 48(8):57–68
Xie X, Liang Y, Wang Y, Sun G, Wang T (2015) Coordinated static and dynamic cache bypassing for GPUs. In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, pp 76–88
Acknowledgements
This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) under project UID/CEC/50021/2013 and research grant SFRH/BD/100697/2014.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Neves, N., Tomás, P. & Roma, N. Stream data prefetcher for the GPU memory interface. J Supercomput 74, 2314–2328 (2018). https://doi.org/10.1007/s11227-018-2260-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2260-6