Skip to main content
Log in

An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Both hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherent in shared-memory multiprocessors; however, both types of prefetching have their shortcomings. While software schemes require less hardware support than hardware schemes, they must generate address calculation instructions and a prefetch instruction for each datum that needs to be prefetched. Hardware schemes, however, must become progressively more complex to be able to compute data access strides and to increase the prefetching lookahead. In this paper, we propose an integrated hardware/software prefetching method that uses simple hardware that can handle most data accesses and software prefetching for the few remaining accesses. A compile time algorithm analyzes the access streams formed by array references and determines sequences of consecutive memory accesses to an access stream that can be prefetched by the hardware mechanism. This analysis is based on the relative memory locations of consecutive accesses to an access stream and the number of intervening data references between consecutive accesses to an access stream. In addition, the prefetching lookahead can be set separately for each access stream. Our approach yields an effective scheme that minimizes both CPU overhead and hardware costs. Execution-driven simulations show our method to be very effective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

REFERENCES

  1. Anoop Gupta, John Hennessy, Kourosh Gharachorloo, Todd Mowry, and Wolf-Dietrich Weber, Comparative Evaluation of Latency Reducing and Tolerating Techniques, Int'l. Symp. Computer Architecture, pp. 254–263 (1991).

  2. Alan Jay Smith, Cache Memories, Computing Surveys, 14(3): 473–530 (September 1982).

    Google Scholar 

  3. Norman P. Jouppi, Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers, Int'l. Symp. Computer Architecture, pp. 364–373 (1990).

  4. Jean-Loup Baer and Tien-Fu Chen, An Effective On-Chip Preloading Scheme to Reduce Data Access Penalty, Supercomputing pp. 176–186 (1991).

  5. I. Sklenar, Prefetch Unit for Vector Operations on Scalar Computers, Computer Architecture News(September 1992).

  6. John W. C. Fu, Janak H. Patel, and Bob L. Janssens, Stride Directed Prefetching in Scalar Processors, Int'l. Symp. Microarchitecture, pp. 102–110 (December 1992).

  7. Y. Jegou and O. Teman, Speculative Prefetching, Int'l Conf. Supercomputing(1993).

  8. Fredrik Dahlgren, Michel Dubois, and Per Stenstrom, Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors, Int'l. Conf. Parallel Processing(1993).

  9. David J. Lilja, The Impact of Parallel Loop Strategies on Prefetching in a Shared Memory Multiprocessor, IEEE Trans. Parallel and Distrib. Syst., 5(6):573–584 (June 1994).

    Google Scholar 

  10. John W. C. Fu and Janak Patel, Data Prefetching in Multiprocessor Vector Cache Memories, Int'l. Symp. Computer Architecture, pp. 54–63 (1991).

  11. Roland Lee, Pen-Chung Yew, and Duncan H. Lawrie, Data Prefetching in Shared Memory Multiprocessors, Proc. Int'l. Conf. on Parallel Processing, St. Charles, Illinois,pp. 28–31 (August 1987).

    Google Scholar 

  12. Todd C. Mowry, Monica S. Lam, and Anoop Gupta, Design and Evaluation of a Compiler Algorithm for Prefetching, Architectural Support for Progr. Lang. Oper. Syst.,pp. 62–73 (October 1992).

  13. David Callahan, Ken Kennedy, and Allan Porterfield, Software Prefetching, Architectural Support for Progr. Lang. Oper. Syst., pp. 40–52 (April 1991).

  14. William Y. Chen, Roger A. Bringmann, Scott A. Mahlke, Richard E. Hank, and James E. Sicolo, An Efficient Architecture for Loop Based Data Preloading, Int'l. Symp. on Microarchitecture, pp. 92–100 (December 1992).

  15. William Y. Chen, Scott A. Mahlke, Pohua P. Chang, and Wen mei W. Hwu, Data Access Microarchitectures for Syperscalar Processors with Compiler-Assisted Data Prefetching, Int'l. Symp. Microarchitecture, pp. 69–73 (November 1991).

  16. Alexander C. Klaiber and Henry M. Levy, An Architecture for Software-Controlled Data Prefetching, Int'l. Symp. Computer Architecture, pp. 43–53 (1991).

  17. Todd Mowry and Anoop Gupta, Tolerating Latency through Software-Controlled Prefetching in Shared-Memory Multiprocessors, J. Parallel and Distributed Computing(June 1991).

  18. Dean M. Tullsen and Susan J. Eggers, Limitation of Cache Prefetching on a Bus-Based Multiprocessor, Int'l. Symp. Computer Architecture(1993).

  19. Edward H. Gornish, Elana D. Granston, and A. V. Veidenbaum, Compiler-Directed Data Prefetching in Multiprocessors with Memory Hierarchies, Int'l. Conf. Sypercomputing(June 1990).

  20. Tien-Fu Chen and Jean-Loup Baer, A Performance Study of Software and Hardware Data Prefetching Schemes, Int'l. Symp. Computer Architecture, pp. 223–232 ( April 1994).

  21. O. Teman and Y. Jegou, Using Virtual Lines to Enhance Locality Exploitation, Int'l. Conf. Supercomputing pp. 344–353 (1994).

  22. Chin-Hung Chi, Compiler Optimization Technique for Data Cache Prefetching Using a Small Cam Array, Int'l. Conf. Parallel Processing(1994).

  23. Edward H. Gornish and Alexander Veidenbaum, An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors, Int'l. Conf. Parallel Processing(1994).

  24. Linley Gwennap, Pa-7200 Enables Inexpensive MP Systems, Microprocessor Report(March 1994).

  25. Manuel E. Benitez and Jack W. Davidson, Code Generation for Streaming: An Access/Execute Mechanism, Architectural Support for Progr. Lang. Oper. Syst. (1991).

  26. Steven A. Moyer, Access Ordering and Effective Memory Bandwidth, Ph.D. Thesis, University of Virginia (1993).

  27. Alfred V. Aho and Jeffrey D. Ullman, Principles of Compiler Design, Addison-Wesley Publishing Company, Reading, Massachusetts (1977).

    Google Scholar 

  28. Michael E. Wolf and Monica Lam, A Data Locality Optimizing Problem, Proc. SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 30–44 (June 1991).

  29. Francois Bodin, William Jalby, Daniel Windheiser, and Christine Eisenbeis, A Quantitative Algorithm for Data Locality Optimization, Technical Report, IRISA, University of Rennes, France (1992).

    Google Scholar 

  30. Utpal Banerjee, Dependence Analysis for SupercomputingKluwer International Series in Engineering and Computer Science, Kluwer Academic Publishers, Boston (1988).

    Google Scholar 

  31. Edward H. Gornish, An Integrated Hardware/Software Approach for Reducing Effective Memory Latency in Multiprocessors, Ph.D. Thesis, University of Illinois at Urbana-Champaign (1995).

    Google Scholar 

  32. Yung-Chin Chen and Alexander Veidenbaum, An Effective Write Policy for Software Coherence Schemes, Supercomputing pp. 61–672 (1992).

  33. Yung-Chin Chen, Cache Design and Performance in a Large-Scale Shared-Memory Multiprocessor System, Ph. D. Thesis, Department of Electrical Engineering, University of Illinois at Urbana-Champaign (1993).

    Google Scholar 

  34. Hoichi Cheong, Compiler-Directed Cache Coeherence Strategies for Large-Scaled Shared-Memory Multiprocessor Systems, Ph. D. Thesis, Department of Electrical Engineering, University of Illinois at Urbana-Champaign (1990).

    Google Scholar 

  35. Yung-Chin Chen and Alexander Veidenbaum, Comparison and Analysis of Software and Directory Coherence Schemes, Supercomputing pp. 818–829 (1991).

  36. David Kroft,Lockup-Free Instruction Fetch/Prefetch Cache Organization, Int'l. Symp. Computer Architecture, pp. 81–87 (1981).

  37. Tien-Fu Chen, Data Prefetching for High-Performance Processors, Ph. D. Thesis, Department of Computer Science and Engineering, University of Washington (1993).

  38. David J. Kuck, R. H. Kuhn, B. Leasure, and M. Wolfe, Structure of an Advanced Vectorizer for Pipelined Processors, Fourth Int'l. Computer Software and Appl. Conf.(October1980).

Download references

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gornish, E.H., Veidenbaum, A. An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1 . International Journal of Parallel Programming 27, 35–70 (1999). https://doi.org/10.1023/A:1018792002672

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1018792002672

Navigation