Skip to main content
Log in

Taxonomy of Data Prefetching for Multicore Processors

  • Survey
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support, data prefetching brings data closer to a processor before it is actually needed. Many prefetching techniques have been developed for single-core processors. Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching techniques are directly applicable to multicore processors, numerous novel strategies have been proposed in the past few years to take advantage of multiple cores. This paper aims to provide a comprehensive review of the state-of-the-art prefetching techniques, and proposes a taxonomy that classifies various design concerns in developing a prefetching strategy, especially for multicore processors. We compare various existing methods through analysis as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Byna S, Chen Y, Sun X-H, Thakur R, Gropp W. Parallel I/O prefetching using MPI file caching and I/O signatures. In Proc. the International Conference on High Performance Computing, Networking, Storage and Analysis (SC'08), Austin, USA, November 2008, Article No. 44.

  2. Chen T F, Baer J L. Effective hardware-based data prefetching for high performance processors. IEEE Transactions on Computers, 1995, 44(5): 609–623.

    Article  MATH  Google Scholar 

  3. Dahlgren F, Dubois M, Stenström P. Fixed and adaptive sequential prefetching in shared-memory multiprocessors. In Proc. International Conference on Parallel Processing, New York, USA, Aug. 16–20, 1993, pp.56–63.

  4. Dahlgren F, Dubois M, Stenström P. Sequential hardware prefetching in shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, July 1995, 6(7): 733–746.

    Article  Google Scholar 

  5. Joseph D, Grunwald D. Prefetching using Markov predictors. In Proc. the 24th International Symposium on Computer Architecture, Denver, USA, June 2–4, 1997, pp.252–263.

  6. Kandiraju G, Sivasubramaniam A. Going the distance for TLB prefetching: An application-driven study. In Proc. the 29th International Symposium on Computer Architecture, Anchorage, USA, May 25–29, 2002, pp.195–206.

  7. Luk C K, Mowry T C. Compiler-based prefetching for recursive data structures. In Proc. the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, USA, Oct. 1–5, 1996, pp.222–233.

  8. Rabbah R M, Sandanagobalane H, Ekpanyapong M, Wong W F. Compiler orchestrated pre-fetching via speculation and predication. In Proc. the 11th International Conference on Architecture Support of Programming Languages and Operating Systems, Boston, USA, Oct. 7–13, 2004, pp.189–198.

  9. Kim J, Palem K V, Wong W F. A framework for data prefetching using off-line training of Markovian predictors. In Proc. the 2002 IEEE International Conference on Computer Design, Freiburg, Germany, Sept. 16–18, 2002, pp.340–347.

  10. Annavaram M, Patel J M, Davidson E S. Data prefetching by dependence graph precomputation. In Proc. the 28th International Symposium on Computer Architecture, Göteborg, Sweden, June 30–July 4, 2001, 29(2): 52–61.

  11. Chen Y, Byna S, Sun X-H, Thakur R, Gropp W. Hiding I/O latency with pre-execution prefetching for parallel applications. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'08), Austin, USA, November 2008, Article No.40.

  12. Kim D, Liao S S, Wang P H, del Cuvillo J, Tian X, Zou X, Wang H, Yeung D, Girkar M, Shen J P. Physical experimentation with prefetching helper threads on Intel's hyper-threaded processors. In Proc. the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, Palo Alto, USA, March 21–24, 2004, p.27.

  13. Solihin Y, Lee J, Torrellas J. Using a user-level memory thread for correlation prefetching. In Proc. the 29th International Symposium on Computer Architecture, Anchorage, USA, May 25–29, 2002, pp.171–182.

  14. Song Y, Kalogeropulos S, Tirumalai P. Design and implementation of a compiler framework for helper threading on multicore processors. In Proc. the 14th Parallel Architectures and Compilation Techniques, St. Louis, USA, Sept. 17–21, 2005, pp.99–109.

  15. Zilles C, Sohi G. Execution-based prediction using speculative slices. In Proc. the 28th International Symposium on Computer Architecture, Göteborg, Sweden, June 30–July 4, 29(2): 2–13.

  16. Ganusov I, Burtscher M. Future execution: A hardware prefetching technique for chip multiprocessors. In Proc. the 14th Parallel Architectures and Compilation Techniques, St. Louis, USA, Sept. 17–21, 2005, pp.350–360.

  17. Zhou H. Dual-core execution: Building a highly scalable single-thread instruction window. In Proc. the 14th Parallel Architectures and Compilation Techniques, St. Louis, USA, Sept. 17–21, 2005, Vol.17–21, pp.231–242.

  18. Sun X H, Byna S, Chen Y. Server-based data push architecture for multi-processor environments. Journal of Computer Science and Technology (JCST), 2007, 22(5): 641–652.

    Article  Google Scholar 

  19. VanderWiel S, Lilja D J. Data prefetch mechanisms. ACM Computing Surveys, 2000, 32(2): 174–199.

    Article  Google Scholar 

  20. Oren N. A survey of prefetching techniques. Technical Report CS-2000-10, University of the Witwatersrand, 2000.

  21. Casmira J P, Kaeli D R. Modeling cache pollution. International Journal of Modeling and Simulation, May 1998, 19(2): 132–138.

    Google Scholar 

  22. Dundas J, Mudge T. Improving data cache performance by pre-executing instructions under a cache miss. In Proc. International Conference on Supercomputing, Vienna, Austria, July 7–11, 1997, pp.68–75.

  23. Mutlu O, Stark J, Wilkerson C, Patt Y N. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proc. the 9th International Symposium on High-Performance Computer Architecture, San Jose, USA, Feb. 3–7, 2003, p.129.

  24. Doweck J. Inside Intel Core microarchitecture and smart memory access. White paper, Intel Research Website, 2006, http://download.intel.com/technology/architecture/sma.pdf.

  25. Klaiber A C, Levy H M. An architecture for software-controlled data prefetching. In Proc. the 18th International Symposium on Computer Architecture, Toronto, Canada, May 27–30, 1991, 19(3): 43–53.

  26. Mowry T, Gupta A. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 1991, 12(2): 87–106.

    Article  Google Scholar 

  27. Luk C K. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proc. the 28th International Symposium on Computer Architecture, Arlington, USA, June 13–15, 2001, 29(2): 40–51.

  28. Liao S, Wang P H, Wang H, Hoflehner G, Lavery D, Shen J P. Post-pass binary adaptation for software-based speculative precomputation. In Proc. the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, Berlin, Gemany, June 2002, pp.117–128.

  29. Roth A, Moshovos A, Sohi G S. Dependence based prefetching for linked data structures. In Proc. the 8th International Conference on Architecture Support for Programming Languages and Operating Systems, San Jose, USA, Oct. 4–7, 1998, pp.115–126.

  30. Hassanein W, Fortes J, Eigenmann R. Data forwarding through in-memory precomputation threads. In Proc. the 18th International Conference on Supercomputing, Saint Malo, France, June 26–July 1, 2004, pp.207–216.

  31. Collins J D, Wang H, Tullsen D M, Hughes C, Lee Y F, Lavery D, Shen J P. Speculative precomputation: Long-range prefetching of delinquent loads. In Proc. the 28th Annual International Symposium on Computer Architecture, Arlington, USA, June 13–15, 2001, 29(2): 14–25.

  32. Collins J, Tullsen D M, Wang H, Shen J P. Dynamic speculative precomputation. In Proc. the 34th ACM/IEEE International Symposium on Microarchitecture, Austin, USA, Dec. 2–5, 2001, pp.306–317.

  33. Byna S. Server-based data push architecture for data access performance optimization [Ph.D. Dissertation]. Department of Computer Science, Illinois Institute of Technology, 2006.

  34. Nesbit K J, Smith J E. Prefetching using a global history buffer. IEEE Micro, 2005, 25(1): 90–97.

    Article  Google Scholar 

  35. Chen Y, Byna S, Sun X H. Data access history cache and associated data prefetching mechanisms. In Proc. the ACM/IEEE Supercomputing Conference 2007, Reno, USA, November 10–16, 2007, Article No. 21.

  36. Chang P Y, Kaeli D R. Branch-directed data cache prefetching. In Proc. the 4th International Symposium on Computer Architecture Workshop on Scalable Shared-Memory Multiprocessors, Chicago, USA, April 1994, pp.225–230.

  37. Tran N, Reed D A. Automatic ARIMA time series modeling for adaptive I/O prefetching. IEEE Transactions on Parallel and Distributed Systems, April 2004, 15(4): 362–377.

    Article  Google Scholar 

  38. Sun X H, Byna S. Data-access memory servers for multi-processor environments. CS-TR-2005-001, Illinois Institute of Technology, 2005.

  39. Hennessy J, Patterson D. Computer Architecture: A Quantitative Approach. The 4th Edition, Morgan Kaufmann, 2006.

  40. Jouppi N P. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. the 17th International Symposium on Computer Architecture, Seattle, USA, May 28–31, 1990, pp.364–373.

  41. Jain P, Devadas S, Rudolph L. Controlling cache pollution in prefetching with software-assisted cache replacement. Technical Report TR-CSG-462, Massachusetts Institute of Technology, 2001.

  42. Megiddo N, Modha D. ARC: A self-tuning, low overhead replacement cache. In Proc. the 2nd USENIX Conference on File and Storage Technologies, San Francisco, March 31–April 2, 2003, pp.115–130.

  43. Yang C L, Lebeck A R, Tseng H W, Lee C. Tolerating memory latency through push prefetching for pointer-intensive applications. ACM Transactions on Architecture and Code Optimization, 2004, 1(4): 445–475.

    Article  Google Scholar 

  44. Brown J, Wang H, Chrysos G, Wang P, Shen J. Speculative precomputation on chip multiprocessors. In Proc. the 6th Workshop on Multithreaded Execution, Architecture, and Compilation, Istanbul, Turkey, Nov. 19, 2002.

  45. Smith A J. Sequential program prefetching in memory hierarchies. IEEE Computer, 1978, 11(12): 7–21.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Surendra Byna.

Additional information

This research was supported in part by the National Science Foundation of USA under Grant Nos. EIA-0224377, CNS-0406328, CNS-0509118, and CCF-0621435.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Byna, S., Chen, Y. & Sun, XH. Taxonomy of Data Prefetching for Multicore Processors. J. Comput. Sci. Technol. 24, 405–417 (2009). https://doi.org/10.1007/s11390-009-9233-4

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-009-9233-4

Keywords

Navigation