The Journal of Supercomputing

, Volume 74, Issue 6, pp 2314–2328 | Cite as

Stream data prefetcher for the GPU memory interface

  • Nuno Neves
  • Pedro Tomás
  • Nuno Roma


Data caches are often unable to efficiently cope with the massive and simultaneous requests imposed by the SIMT execution model of modern GPUs. While software-aided cache management techniques and scheduling approaches were early considered, efficient prefetching schemes are regarded as the most viable solution to improve the efficiency of the GPU memory subsystem. Accordingly, a new GPU prefetching mechanism is proposed, by extending the stream computing model beyond the actual GPU processing core, thus broadening it toward the memory interface. The proposed prefetcher takes advantage of the available cache management resources and combines a low-profile architecture with a dedicated pattern descriptor specification, which is used to explicitly encode each kernel memory access pattern. The obtained results show that the proposed mechanism increases the L1 data cache hit rate by an average of 61%, resulting in performance speedups as high as 9.2\(\times \) and consequent energy efficiency improvements as high as 11\(\times \).


GPU Prefetching Stream-based Prefetching Data-pattern encoding Assisted memory access 



This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) under project UID/CEC/50021/2013 and research grant SFRH/BD/100697/2014.


  1. 1.
    Amilkanthwar M, Balachandran S (2013) CUPL: A compile-time uncoalesced memory access pattern locator for CUDA. In: Proceedings of the 27th ACM International Conference On Supercomputing. ACM, pp 459–460Google Scholar
  2. 2.
    Arnau JM, Parcerisa JM, Xekalakis P (2012) Boosting mobile GPU performance with a decoupled access/execute fragment processor. ACM SIGARCH Comput Archit News 40(3):84–93CrossRefGoogle Scholar
  3. 3.
    Bakhoda A, Yuan GL, Fung WW, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 163–174Google Scholar
  4. 4.
    Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization (IISWC), pp 44–54Google Scholar
  5. 5.
    Ghosh S, Martonosi M, Malik S (1997) Cache miss equations: An analytical representation of cache misses. In: ACM International Conference on Supercomputing. ACM Press, pp 317–324Google Scholar
  6. 6.
    Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to GPU codes. In: Innovative Parallel Computing (InPar), 2012. IEEE, pp 1–10Google Scholar
  7. 7.
    Grosser T, Groesslinger A, Lengauer C (2012) Polly—performing polyhedral optimizations on a low-level intermediate representation. Parallel Process Lett 22(04):1250010MathSciNetCrossRefGoogle Scholar
  8. 8.
    Jia W, Shaw K, Martonosi M (2014) MRPB: Memory request prioritization for massively parallel processors. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, pp 272–283Google Scholar
  9. 9.
    Jia W, Shaw KA, Martonosi M (2012) Characterizing and improving the use of demand-fetched caches in GPUs. In: Proceedings of the 26th ACM International Conference on Supercomputing. ACM, pp 15–24Google Scholar
  10. 10.
    Jog A, Kayiran O, Mishra AK, Kandemir MT, Mutlu O, Iyer R, Das CR (2013) Orchestrated scheduling and prefetching for GPGPUs. ACM SIGARCH Comput Archit News 41(3):332–343CrossRefGoogle Scholar
  11. 11.
    Lakshminarayana NB, Kim H (2014) Spare register aware prefetching for graph algorithms on gpus. In: IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp 614–625Google Scholar
  12. 12.
    Lee J, Lakshminarayana NB, Kim H, Vuduc R (2010) Many-thread aware prefetching mechanisms for GPGPU applications. In: 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 213–224Google Scholar
  13. 13.
    Lee S, Kim K, Koo G, Jeon H, Ro WW, Annavaram M (2015) Warped-compression: enabling power efficient GPUs through register compression. In: 42nd Intl Symposium on Computer Architecture. ACM, pp 502–514Google Scholar
  14. 14.
    Leng J, Hetherington T, ElTantawy A, Gilani S, Kim NS, Aamodt TM, Reddi VJ (2013) GPUWattch: enabling energy optimizations in GPGPUs. ACM SIGARCH Comput Archit News 41(3):487–498CrossRefGoogle Scholar
  15. 15.
    Neves N, Tomás P, Roma N (2017) Adaptive in-cache streaming for efficient data management. IEEE Trans Very Large Scale Integr (VLSI) Syst 25(7):2130–2143CrossRefGoogle Scholar
  16. 16.
    NVIDIA (2009) NVIDIA’s Next Generation CUDATM Compute Architecture: FermiTM. NVIDIA, Santa Clara, Calif, USAGoogle Scholar
  17. 17.
    NVIDIA (2016) NVIDIA GP100 Pascal Architecture. White paper (Online).
  18. 18.
    Panda R, Eckert Y, Jayasena N, Kayiran O, Boyer M, John LK (2016) Prefetching techniques for near-memory throughput processors. In: Proceedings of the 2016 International Conference on Supercomputing, ICS ’16. ACM, New York, pp. 40:1–40:14Google Scholar
  19. 19.
    Sethia A, Dasika G, Samadi M, Mahlke S (2013) APOGEE: Adaptive prefetching on GPUs for energy efficiency. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, pp 73–82Google Scholar
  20. 20.
    Stephenson M, Hari SKS, Lee Y, Ebrahimi E, Johnson DR, Nellans D, O’Connor M, Keckler SW (2015) Flexible software profiling of GPU architectures. In: 42nd International Symposium on Computer Architecture. ACM, pp 185–197Google Scholar
  21. 21.
    Torres Y, Gonzalez-Escribano A, Llanos DR (2011) Understanding the impact of CUDA tuning techniques for Fermi. In: International Conference on High Performance Computing and Simulation (HPCS). IEEE, pp 631–639Google Scholar
  22. 22.
    Wu B, Zhao Z, Zhang EZ, Jiang Y, Shen X (2013) Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. ACM SIGPLAN Not 48(8):57–68CrossRefGoogle Scholar
  23. 23.
    Xie X, Liang Y, Wang Y, Sun G, Wang T (2015) Coordinated static and dynamic cache bypassing for GPUs. In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, pp 76–88Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.INESC-ID, Instituto Superior TécnicoUniversidade de LisboaLisbonPortugal

Personalised recommendations