International Journal of Parallel Programming

, Volume 29, Issue 5, pp 545–581 | Cite as

Optimized Unrolling of Nested Loops

  • Vivek Sarkar


Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code generation algorithm that generates more compact code than the unroll-and-jam transformation; and (iii) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2× speedup on matrix multiply, and an average 1.08× speedup on seven of the SPEC95fp benchmarks (with a 1.2× speedup for two benchmarks). Larger performance improvements can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).

loop transformations loop unrolling unroll-and-jam unroll factors 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    F. E. Allen and J. Cocke, A catalogue of optimizing transformations, in Design and Optimization of Compilers, Prentice-Hall, pp. 1-30 (1972).Google Scholar
  2. 2.
    J. J. Dongarra and A. R. Hinds, Unrolling Loops in Fortran, Software-Practice and Experience 9(3):219-226 (March 1979).Google Scholar
  3. 3.
    J. A. Fisher, J. R. Ellis, J. C. Ruttenberg, and A. Nicolau, Parallel Processing: A Smart Compiler and a Dumb Machine, Proc. ACM Symp. Compiler Construction, pp. 37-47 (June 1984).Google Scholar
  4. 4.
    D. F. Bacon, S. L. Graham, and O. J. Sharp, Compiler Transformations for High-Performance Computing, ACM Computing Surveys 26(4):345-420 (December 1994).Google Scholar
  5. 5.
    Steve Carr and Ken Kennedy, Scalar Replacement in the Presence of Conditional Control Flow, Software-Practice and Experience (1):51-77 (January 1994).Google Scholar
  6. 6.
    Michael J. Alexander, Mark W. Bailey, Bruce R. Childers, Jack W. Davidson, and Sanjay Jinturkar, Memory bandwidth optimizations for wide-bus machines, Proc. 26th Hawaii Int'l. Conf. Syst. Sci., Wailea, Hawaii, pp. 466-475 (January 1993).Google Scholar
  7. 7.
    T. C. Mowry, Tolerating Latency Through Software-Controlled Data Prefetching, Ph.D. thesis, Stanford University (March 1994).Google Scholar
  8. 8.
    Mauricio Breternitz, Michael Lai, Vivek Sarkar, and Barbara Simons, Compiler Solutions for the Stale-Data and False-Sharing Problems, Technical report, TR 03.466, IBM Santa Teresa Laboratory (April 1993).Google Scholar
  9. 9.
    Steve Carr and Ken Kennedy, Improving the Ratio of Memory Operations to Floating-Point Operations in Loops, ACM TOPLAS 16(4) (November 1994).Google Scholar
  10. 10.
    Jack W. Davidson and Sanjay Jinturkar, Aggressive Loop Unrolling in a Retargetable, Optimizing Compiler, In Compiler Construction, Proc. Sixth Int'l. Conf. Linkoping, Sweden, Vol. 1060, Lecture Notes in Computer Science, Springer-Verlag, New York (April 1996).Google Scholar
  11. 11.
    David Callahan, Steve Carr, and Ken Kennedy, Improving Register Allocation for Subscripted Variables, Proc. ACM SIGPLAN Conf. Prog. Lang. Design and Implementation, White Plains, New York, pp. 53-65 (June 1990).Google Scholar
  12. 12.
    S. Carr and Y. Guan, Unroll-and-Jam Using Uniformly Generated Sets, Proc. MICRO-30, pp. 349-357 (December 1997).Google Scholar
  13. 13.
    Allan K. Porterfield, Software Methods for Improvement of Cache Performance on Supercomputer Applications, Ph.D. thesis, Rice University, Rice COMP TR89-93 (May 1989).Google Scholar
  14. 14.
    Michael E. Wolf and Monica S. Lam, A Data Locality Optimization Algorithm, Proc. ACM SIGPLAN Symp. Progr. Lang. Design and Implementation, pp. 30-44 (June 1991).Google Scholar
  15. 15.
    Vivek Sarkar, Automatic Selection of High Order Transformations in the IBM XL Fortran Compilers. IBM J. Res. Dev. 41(3) (May 1997).Google Scholar
  16. 16.
    Michael J. Wolfe, Optimizing Supercompilers for Supercomputers, Pitman, London and The MIT Press, Cambridge, Massachusetts (1989). In the series, Research Monographs in Parallel and Distributed Computing.Google Scholar
  17. 17.
    Vivek Sarkar and Radhika Thekkath, A General Framework for Iteration-Reordering Loop Transformations, Proc. ACM SIGPLAN Conf. Prog. Lang. Design and Implementation, pp. 175-187 (June 1992).Google Scholar
  18. 18.
    Jeanne Ferrante, Vivek Sarkar, and Wendy Thrash, On Estimating and Enhancing Cache Effectiveness, Lecture Notes in Computer Science (589):328-343 (1991). Proc. Fourth Int'l. Workshop Lang. Compilers for Parallel Computing, Santa Clara, California (August 1991).Google Scholar
  19. 19.
    B. Ramakrishna Rau, Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops, Proc. 27th Ann. Int'l. Symp. Microarchitecture, San Jose, California, pp. 63-74 (November 1994).Google Scholar
  20. 20.
    Vivek Sarkar and Barbara Simons, Don't Waste Those Cycles: An In-Depth Look at Scheduling Instructions in Basic Blocks and Loops, Video Lecture in University Video Communication's Distinguished Lecture Series IX (August 1994).Google Scholar
  21. 21.
    Vivek Sarkar, Determining Average Program Execution Times and their Variance, Proc. SIGPLAN Conf. Prog. Lang. Design and Implementation 24(7):298-312 (July 1989).Google Scholar
  22. 22.
    Vivek Sarkar, Automatic Partitioning of a Program Dependence Graph into Parallel Tasks, IBM J. Res. Dev 35(5/6) (1991).Google Scholar
  23. 23.
    The Standard Performance Evaluation Corporation, SPEC CPU95 Benchmarks, (1997).Google Scholar
  24. 24.
    IBM Corporation, POWER2 and PowerPC, Special issue of IBM J. Res. Dev. 38(5): 489-648 (September 1994).Google Scholar
  25. 25.
    Barbara Simons, Vivek Sarkar, Jr. Mauricio Breternitz, and Michael Lai, An Optimal Asynchronous Scheduling Algorithm for Software Cache Consistency, Proc. Hawaii Int'l. Conf. Syst. Sci. (January 1994).Google Scholar
  26. 26.
    Max Hailperin, Improving the Ratio of Memory Operations to Floating-Point operations in loops, Computing Reviews. Copy of review can be found in the ACM digital library at Scholar
  27. 27.
    S. Weiss and J. E. Smith, A Study of Scalar Compilation Techniques for Pipelined Supercomputers, Proc. Second Int'l Conf. Architectural Support Progr. Lang. Oper. Syst. (ASPLOS), pp. 105-109 (October 1987).Google Scholar
  28. 28.
    Reese B. Jones and Vicki H. Allan, Software Pipelining: An Evaluation of Enhanced Pipelining, Proc. 24th Ann. Int'l. symp. Microarchitecture, pp. 82-92 (December 1990).Google Scholar
  29. 29.
    Bogong Su, Shiyuan Ding, Jian Wang, and Jinshi Xia, GURPR-A Method for Global Software Piplining; Proc. 20th Ann. Int'l. Symp. Microarchitecture, pp. 88-96 (December 1986).Google Scholar
  30. 30.
    Daniel M. Lavery and Wen-Mei W.Hwu, Unrolling-Based Optmizations for Modulo Sheduling, Proc. MICRO-28, pp. 327-337 (December 1995).Google Scholar

Copyright information

© Plenum Publishing Corporation 2001

Authors and Affiliations

  • Vivek Sarkar
    • 1
  1. 1.IBM T. J. Watson Research CenterNew York

Personalised recommendations