Skip to main content
Log in

Optimized Unrolling of Nested Loops

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code generation algorithm that generates more compact code than the unroll-and-jam transformation; and (iii) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2× speedup on matrix multiply, and an average 1.08× speedup on seven of the SPEC95fp benchmarks (with a 1.2× speedup for two benchmarks). Larger performance improvements can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

REFERENCES

  1. F. E. Allen and J. Cocke, A catalogue of optimizing transformations, in Design and Optimization of Compilers, Prentice-Hall, pp. 1-30 (1972).

  2. J. J. Dongarra and A. R. Hinds, Unrolling Loops in Fortran, Software-Practice and Experience 9(3):219-226 (March 1979).

    Google Scholar 

  3. J. A. Fisher, J. R. Ellis, J. C. Ruttenberg, and A. Nicolau, Parallel Processing: A Smart Compiler and a Dumb Machine, Proc. ACM Symp. Compiler Construction, pp. 37-47 (June 1984).

  4. D. F. Bacon, S. L. Graham, and O. J. Sharp, Compiler Transformations for High-Performance Computing, ACM Computing Surveys 26(4):345-420 (December 1994).

    Google Scholar 

  5. Steve Carr and Ken Kennedy, Scalar Replacement in the Presence of Conditional Control Flow, Software-Practice and Experience (1):51-77 (January 1994).

  6. Michael J. Alexander, Mark W. Bailey, Bruce R. Childers, Jack W. Davidson, and Sanjay Jinturkar, Memory bandwidth optimizations for wide-bus machines, Proc. 26th Hawaii Int'l. Conf. Syst. Sci., Wailea, Hawaii, pp. 466-475 (January 1993).

  7. T. C. Mowry, Tolerating Latency Through Software-Controlled Data Prefetching, Ph.D. thesis, Stanford University (March 1994).

  8. Mauricio Breternitz, Michael Lai, Vivek Sarkar, and Barbara Simons, Compiler Solutions for the Stale-Data and False-Sharing Problems, Technical report, TR 03.466, IBM Santa Teresa Laboratory (April 1993).

  9. Steve Carr and Ken Kennedy, Improving the Ratio of Memory Operations to Floating-Point Operations in Loops, ACM TOPLAS 16(4) (November 1994).

  10. Jack W. Davidson and Sanjay Jinturkar, Aggressive Loop Unrolling in a Retargetable, Optimizing Compiler, In Compiler Construction, Proc. Sixth Int'l. Conf. Linkoping, Sweden, Vol. 1060, Lecture Notes in Computer Science, Springer-Verlag, New York (April 1996).

    Google Scholar 

  11. David Callahan, Steve Carr, and Ken Kennedy, Improving Register Allocation for Subscripted Variables, Proc. ACM SIGPLAN Conf. Prog. Lang. Design and Implementation, White Plains, New York, pp. 53-65 (June 1990).

  12. S. Carr and Y. Guan, Unroll-and-Jam Using Uniformly Generated Sets, Proc. MICRO-30, pp. 349-357 (December 1997).

  13. Allan K. Porterfield, Software Methods for Improvement of Cache Performance on Supercomputer Applications, Ph.D. thesis, Rice University, Rice COMP TR89-93 (May 1989).

  14. Michael E. Wolf and Monica S. Lam, A Data Locality Optimization Algorithm, Proc. ACM SIGPLAN Symp. Progr. Lang. Design and Implementation, pp. 30-44 (June 1991).

  15. Vivek Sarkar, Automatic Selection of High Order Transformations in the IBM XL Fortran Compilers. IBM J. Res. Dev. 41(3) (May 1997).

  16. Michael J. Wolfe, Optimizing Supercompilers for Supercomputers, Pitman, London and The MIT Press, Cambridge, Massachusetts (1989). In the series, Research Monographs in Parallel and Distributed Computing.

  17. Vivek Sarkar and Radhika Thekkath, A General Framework for Iteration-Reordering Loop Transformations, Proc. ACM SIGPLAN Conf. Prog. Lang. Design and Implementation, pp. 175-187 (June 1992).

  18. Jeanne Ferrante, Vivek Sarkar, and Wendy Thrash, On Estimating and Enhancing Cache Effectiveness, Lecture Notes in Computer Science (589):328-343 (1991). Proc. Fourth Int'l. Workshop Lang. Compilers for Parallel Computing, Santa Clara, California (August 1991).

  19. B. Ramakrishna Rau, Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops, Proc. 27th Ann. Int'l. Symp. Microarchitecture, San Jose, California, pp. 63-74 (November 1994).

  20. Vivek Sarkar and Barbara Simons, Don't Waste Those Cycles: An In-Depth Look at Scheduling Instructions in Basic Blocks and Loops, Video Lecture in University Video Communication's Distinguished Lecture Series IX (August 1994).

  21. Vivek Sarkar, Determining Average Program Execution Times and their Variance, Proc. SIGPLAN Conf. Prog. Lang. Design and Implementation 24(7):298-312 (July 1989).

    Google Scholar 

  22. Vivek Sarkar, Automatic Partitioning of a Program Dependence Graph into Parallel Tasks, IBM J. Res. Dev 35(5/6) (1991).

  23. The Standard Performance Evaluation Corporation, SPEC CPU95 Benchmarks, http://open.specbench.org/osg/cpu95/ (1997).

  24. IBM Corporation, POWER2 and PowerPC, Special issue of IBM J. Res. Dev. 38(5): 489-648 (September 1994).

    Google Scholar 

  25. Barbara Simons, Vivek Sarkar, Jr. Mauricio Breternitz, and Michael Lai, An Optimal Asynchronous Scheduling Algorithm for Software Cache Consistency, Proc. Hawaii Int'l. Conf. Syst. Sci. (January 1994).

  26. Max Hailperin, Improving the Ratio of Memory Operations to Floating-Point operations in loops, Computing Reviews. Copy of review can be found in the ACM digital library at http://www.acm.org/pubs/citations/journals/toplas/1994-16-6/p1768-carr/.

  27. S. Weiss and J. E. Smith, A Study of Scalar Compilation Techniques for Pipelined Supercomputers, Proc. Second Int'l Conf. Architectural Support Progr. Lang. Oper. Syst. (ASPLOS), pp. 105-109 (October 1987).

  28. Reese B. Jones and Vicki H. Allan, Software Pipelining: An Evaluation of Enhanced Pipelining, Proc. 24th Ann. Int'l. symp. Microarchitecture, pp. 82-92 (December 1990).

  29. Bogong Su, Shiyuan Ding, Jian Wang, and Jinshi Xia, GURPR-A Method for Global Software Piplining; Proc. 20th Ann. Int'l. Symp. Microarchitecture, pp. 88-96 (December 1986).

  30. Daniel M. Lavery and Wen-Mei W.Hwu, Unrolling-Based Optmizations for Modulo Sheduling, Proc. MICRO-28, pp. 327-337 (December 1995).

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sarkar, V. Optimized Unrolling of Nested Loops. International Journal of Parallel Programming 29, 545–581 (2001). https://doi.org/10.1023/A:1012246031671

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1012246031671

Navigation