Skip to main content

Improving Software Pipelining by Hiding Memory Latency with Combined Loads and Prefetches

  • Chapter

Part of the The Springer International Series in Engineering and Computer Science book series (SECS,volume 613)

Abstract

Modern processors and compilers hide long memory latencies through non-blocking loads or explicit software prefetching instructions. Unfortunately, each mechanism has potential drawbacks. Non-blocking loads can significantly increase register pressure by extending the lifetimes of loads. Software prefetching increases the number of memory instructions in the loop body. For a loop whose execution time is bound by the number of loads/stores that can be issued per cycle, software prefetching exacerbates this problem and increases the number of idle computational cycles in loops.

In this paper, we show how compiler and architecture support for combining a load and a prefetch into one instruction, called a prefetching load, can give lower register pressure like software prefetching and lower load/store-unit requirements like non-blocking loads. On a set of 106 Fortran loops we show that prefetching loads obtain a speedup of 1.07–1.53 over using just non-blocking loads and a speedup of 1.04-1.08 over using software prefetching. In addition, prefetching loads reduced floating-point register pressure by as much as a factor of 0.4 and integer register pressure by as much as a factor of 0.8 over non-blocking loads. Integer register pressure was also reduced by a factor of 0.97 over software prefetching, while floating-point register pressure was increased by a factor of 1.02 versus software prefetching in the worst case.

Keywords

  • Cache
  • Software Prefetching
  • Nonblocking Loads

This is a preview of subscription content, access via your institution.

Buying options

eBook
EUR   6.99
Price includes VAT (Finland)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR   109.99
Price includes VAT (Finland)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
EUR   109.99
Price includes VAT (Finland)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. V.H. Allan, R. Jones, R. Lee, and S.J. Allan. Software Pipelining. ACM Computing Surveys, 27 (3), September 1995.

    Google Scholar 

  2. A. Aiken and A. Nicolau. Optimal loop parallelization. In Conference on Programming Language Design and Implementation,pages 308–317, Atlanta Georgia, June 1988. SIGPLAN ‘88.

    Google Scholar 

  3. A. Aiken and A. Nicolau. Perfect Pipelining: A New Loop Optimization Technique. In Proceedings of the 1988 European Symposium on Programming, Springer Verlag Lecture Notes in Computer Science, #300, pages 221–235, Nancy, France, March 1988.

    Google Scholar 

  4. V.H. Allan, M. Rajagopalan, and R.M. Lee. Software Pipelining: Petri Net Pacemaker. In Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism, II~ Orlando, FL, January 20–22 1993.

    Google Scholar 

  5. V. Adve, J-C. Wang, J. Mellor-Crummey, D. Reed, M. Anderson, and K. Kennedy. An integrated compilation and performance analysis environment for data parallel programs. In Proceedings of Supercomputing ‘85, San Diego, CA, December 1995.

    Google Scholar 

  6. Preston Briggs and Keith D. Cooper. Effective partial redundancy elimination. In Proceedings of the ACM SIGPLAN ‘84 Conference on Programming Language Design and Implementation, pages 159170, Orlando, FL, June 1994.

    Google Scholar 

  7. ]P. Briggs, K. D. Cooper, and L. T. Simpson. Value numbering. Software — Practice 6 Experience, 27 (6): 701–724, June 1997.

    CrossRef  Google Scholar 

  8. Preston Briggs. The massively scalar compiler project. Technical report, Rice Univeristy, July 1994. Preliminary version available via anonymous ftp.

    Google Scholar 

  9. ]Tien-Fu Chen and Jean-Loup Baer. Reducing memory latency via non-blocking and prefetching caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 51–61, Boston, Massachusetts, 1992.

    CrossRef  Google Scholar 

  10. David Callahan, Steve Carr, and Ken Kennedy. Improving register allocation for subscripted variables. In Proceedings of the ACM SIGPLAN ‘80 Conference on Programming Language Design and Implementation, pages 53–65, White Plains, NY, June 1990.

    Google Scholar 

  11. S. Carr and K. Kennedy. Scalar replacement in the presence of conditional control flow. Software Practice and Experience, 24 (1): 5177, January 1994.

    CrossRef  Google Scholar 

  12. David Callahan, Ken Kennedy, and Allan Porterfield. Software pre-fetching. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 40–52, Santa Clara, California, 1991.

    CrossRef  Google Scholar 

  13. Steve Carr, Kathryn McKinley, and Chau-Wen Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252–262, Santa Clara, California, 1994.

    CrossRef  Google Scholar 

  14. R. Crowell. An experimental evaluation of compiler-based cache management techniques. Master’s thesis, Michigan Technological University, March 1998.

    Google Scholar 

  15. S. Carr and P. Sweany. Improving software pipelining with hardware support for self-spatial loads. In The Third Workshop on Interaction between Compilers and Computer Architecture (INTERACT-3), San Jose, CA, October 1998.

    Google Scholar 

  16. Keith D. Cooper, L. Taylor Simpson, and Christopher A. Vick. Operator strength reduction. Technical Report CRPC-TR95635S, Center for Research on Parallel Computation, Rice Univeristy, October 1995.

    Google Scholar 

  17. C. Ding, S. Carr, and P. Sweany. Modulo scheduling with cache reuse information. In Proceedings of EuroPar ‘87, Passau, Germany, August 1997.

    Google Scholar 

  18. D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformations. In Proceedings of the First International Conference on Supercomputing. Springer-Verlag, Athens, Greece, 1987.

    Google Scholar 

  19. Monica Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN ‘88 Conference on Programming Language Design and Implementation, pages 318–328, Atlanta, GA, July 1988.

    Google Scholar 

  20. MIPS Technologies, Incorporated. R10000 Microprocessor Product Overview, October 1994.

    Google Scholar 

  21. Todd C. Mowry, Monica S. Lam, and Anoop Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems,pages 62–75,Boston, Massachusetts, 1992.

    Google Scholar 

  22. D.A. Poplawski. The unlimited resource machine (URM). Technical Report 95–01,Michigan Technological University, January 1995.

    Google Scholar 

  23. B. R. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th International Symposium on Microarchitecture (MICRO-27)pages 63–74San Jose, CA, December 1994.

    Google Scholar 

  24. G. Rivera and C.-W. Tseng. Data transformations for eliminationg conflict misses. In Proceedings of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementationpages 38–49Montreal, Canada, June 17–19 1998.

    Google Scholar 

  25. Philip H. Sweany and Steven J. Beaty. Overview of the Rocket retargetable C compiler. Technical Report CS-94–01,Department of Computer Science, Michigan Technological University, Houghton, January 1994.

    Google Scholar 

  26. F. Sanchez and A. Gonzalez. Cache-sensitive modulo scheduling. In Proceedings of the 30th International Symposium on Microarchitecture (MICRO-30)Research Triangle Park, NC, December 1997.

    Google Scholar 

  27. Michael E. Wolf and Monica S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN ‘81 Conference on Programming Language Design and Implementationpages 30–44Toronto, Ontario, June 1991.

    Google Scholar 

  28. Nancy J. Warter, Scott A. Mahlke, W.-M. Hwu, and B. Ramakrishna Rau. Reverse if-conversion. In Proceedings of the ACM SIG-PLAN ‘83 Conference on Programming Language Design and Implementationpages 290–299Albuquerque, NM, June 1993.

    Google Scholar 

  29. Mark N. Wegman and F. Kenneth Zadeck. Constant propagation with conditional branches. ACM Transactions on Programming Languages and Systems 13(2):181–210April 1991.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2001 Springer Science+Business Media New York

About this chapter

Cite this chapter

Bedy, M., Carr, S., Önder, S., Sweany, P. (2001). Improving Software Pipelining by Hiding Memory Latency with Combined Loads and Prefetches. In: Lee, G., Yew, PC. (eds) Interaction between Compilers and Computer Architectures. The Springer International Series in Engineering and Computer Science, vol 613. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-3337-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-1-4757-3337-2_4

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-4896-0

  • Online ISBN: 978-1-4757-3337-2

  • eBook Packages: Springer Book Archive