A Hybrid Hardware/Software Generated Prefetching Thread Mechanism on Chip Multiprocessors

  • Hou Rui
  • Longbing Zhang
  • Weiwu Hu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4128)


This paper proposes a hybrid hardware/software generated prefetching thread mechanism on Chip Multiprocessors(CMP). Two kinds of prefetching threads appear in our hybrid mechanism. Most threads belong to Dynamic Prefetching Thread, which are automatically generated, triggered, spawn and managed by hardware; The others are of Static Prefetching Thread, targeting at the critical delinquent loads which can not be accurately or timely predicted by Dynamic Prefetching Thread. Static Prefetching Threads are statically generated by binary-level optimization tool with the guide of profiling information. Also, some aggressive thread construction policies are proposed. Furthermore, the necessary hardware infrastructure for CMP supporting this hybrid mechanism are described. For a set of memory limited benchmarks with complicated access patterns, an average speedup of 3.1% is achieved on dual-core CMP when constructing basic hardware-generated prefetching thread, and this gain grows to 31% when adopting our hybrid mechanism.


Shadow Register Basic Policy Performance Counter Performance Speedup Address Computation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Roth, A., Sohi, G.: Speculative data-driven multithreading. In: 7th HPCA, pp. 37–48 (2001)Google Scholar
  2. 2.
    Collins, J., Wang, H., et al.: Speculative precomputation: Long-range prefetching of delinquent loads. In: the 28th ISCA, July 2001, pp. 14–25 (2001)Google Scholar
  3. 3.
    Collins, J.D., Tullsen, D.M., Wang, H., et al.: Dynamic speculative precomputation. In: the 34th annual ACM/IEEE International Symposium on Microarchitecture, pp. 306–317 (2001)Google Scholar
  4. 4.
    Liao, S., Wang, P., et al.: Post-Pass Binary Adaptation for Software-Based Speculative Precomputation. In: ACM Programming Language Design and Implementation (June 2002)Google Scholar
  5. 5.
    Brown, J.A., Wang, H., et al.: Speculative Precomputation on Chip Multiprocessors. In: The 6th MTEAC (November 2002)Google Scholar
  6. 6.
    Carlisle, M.: Olden: Parallelizing programs with dynamic data structures on distributed-memory machines. PhD Thesis, Princeton University Department of Computer Science (1996)Google Scholar
  7. 7.
    Moshovos, A., Pnevmatikatos, D., Baniasadi, A.: Slice processors: An implementation of operation-based prediction. In: the 15th International Conference on Supercomputing, June 2001, pp. 321–334 (2001)Google Scholar
  8. 8.
    Zhou, H.: Dual-core execution: building a highly scalable single-thread instruction window. In: The 14th PACT 2005 (2005)Google Scholar
  9. 9.
    Kohout, N., Choi, S., Yeung, D.: Multi-chain prefetching: Exploiting memory parallelism in pointer-chasing codes. In: ISCA Workshop on Solving the Memory Wall Problem (2000)Google Scholar
  10. 10.
    Mowry, T., Gupta, A.: Tolerating latency through software controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 87–106 (June 1991)Google Scholar
  11. 11.
    Luk, C.: Tolerating memory latency through softwarecontrolled pre-execution in simultaneous multithreading processors. In: The 28th ISCA, July 2001, pp. 40–51 (2001)Google Scholar
  12. 12.
    Ganusov, I., Burtscher, M.: Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors. In: PACT 2005, pp. 350–360 (2005)Google Scholar
  13. 13.
    Bershad, B.N., Lee, D., et al.: Avoiding Conflict Misses Dynamically in Large Direct-Mapped Caches. In: The 6th ASPLOS, pp. 158–170 (1994)Google Scholar
  14. 14.
    Huh, J., Burger, D., Keckler, S.: Exploring the design space of future CMPs. In: The 10th PACT, September 2001, pp. 199–210 (2001)Google Scholar
  15. 15.
    Burger, D., Goodman, J.R.: Billion-transistor architectures: there and back again. Computer, 22–28 (March 2004)Google Scholar
  16. 16.
    Mutlu, O., Stark, J., Wilkerson, C., Patt, Y.N.: Runahead execution: an alternative to very large instruction windows for out-of-order processors. In: The 9th HPCA (2003)Google Scholar
  17. 17.
    Renau, J., Fraguela, B., Tuck, J., et al.: (January 2005),

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Hou Rui
    • 1
  • Longbing Zhang
    • 1
  • Weiwu Hu
    • 1
  1. 1.Key Laboratory of Computer System and Architecture, Institute of Computing TechnologyChinese Academy of SciencesBeijingChina

Personalised recommendations