Advertisement

A Framework for Loop Distribution on Limited On-Chip Memory Processors

  • Lei Wang
  • Waibhav Tembe
  • Santosh Pande
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1781)

Abstract

This work proposes a framework for analyzing the flow of values and their re-use in loop nests to minimize data traffic under the constraints of limited on-chip memory capacity and dependences. Our analysis first undertakes fusion of possible loop nests intra-procedurally and then performs loop distribution. The analysis discovers the closeness factor of two statements which is a quantitative measure of data traffic saved per unit memory occupied if the statements were under the same loop nest over the case where they are under different loop nests. We then develop a greedy algorithm which traverses the program dependence graph (PDG) to group statements together under the same loop nest legally. The main idea of this greedy algorithm is to transitively generate a group of statements that can legally execute under a given loop nest that can lead to a minimum data traffic. We implemented our framework in Petit [2], a tool for dependence analysis and loop transformations. We show that the benefit due to our approach results in eliminating as much as 30 % traffic in some cases improving overall completion time by a 23.33 % for processors such as TI’s TMS320C5x.

Keywords

Unit Memory Data Reuse Loop Transformation Loop Fusion Program Dependence Graph 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    J. Eyre and J. Bier, “DSP Processors Hit the Mainstream”, ‘COMPUTER’, 31(8):51–59, August 1998.CrossRefGoogle Scholar
  2. 2.
    Petit, Uniform Library, Omega Library, Omega Calculater. ‘http://www.cs.umd.edu/projects/omega/index.html’ 141, 149
  3. 3.
    Texas Instruments. ‘TMS 320C5x User’s Guide.Google Scholar
  4. 4.
  5. 5.
    A. Sundaram and S. Pande, “Compiler Optimizations for Real Time Execution of Loops on Limited Memory Embedded Systems”, Proceedings of IEEE International Real Time Systems Symposium, Madrid, Spain, pp.154–164. 142Google Scholar
  6. 6.
    A. Sundaram and S. Pande, “An Efficient Data Partitioning Method for Limited Memory Embedded Systems”, 1998 ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems (in conjunction with PLDI’ 98), Montreal, Canada, Springer-Verlag, pp. 205–218. 142Google Scholar
  7. 7.
    F. Irigoin and R. Triolet, “Supernode Partitioning”. in 15th Symposium on Principles of Programming Languages (POPL XV), pages 319–329, 1988. 151Google Scholar
  8. 8.
    J. Ramanujam and P. Sadayappan, “Tiling Multidimensional Iteration Spaces for Multicomputers.” Journal of Parallel and Distributed Computing, 16:108–120, 1992. 151CrossRefGoogle Scholar
  9. 9.
    U. Banerjee, “Loop transformations for restructuring compilers”, Boston: Kluwer Academic, 1994.Google Scholar
  10. 10.
    W. Li, “Compiling for NUMA parallel machines”, Ph.D. Thesis, Cornell University, Ithaca, NY, 1993.Google Scholar
  11. 11.
    M. Wolfe, High Performance Compilers for Parallel Computing, Addison Wesley, 1996. 151Google Scholar
  12. 12.
    M. Wolfe, “Iteration space tiling for memory hierarchies” in Third SIAM Conference on Parallel Processing for Scientific Computing, December 1987. 150Google Scholar
  13. 13.
    R. Schreiber and J. Dongarra, “Automatic Blocking of Nested Loops”. Technical report, RIACS, NASA Ames Research Center, and Oak Ridge National Laboratory, May 1990. 151Google Scholar
  14. 14.
    I. Kodukula, N. Ahmed and K. Pingali, “Data Centric Multi-level Blocking” in ACM Programming Language Design and Implementation 1997 (PLDI’ 97), pp. 346–357. 151Google Scholar
  15. 15.
    P. Panda, A. Nicolau and N. Dutt, “Memory Organization for Improved Data Cache Performance in Embedded Processors”, Proceedings of 1996 International Symposium on System Synthesis. 141Google Scholar
  16. 16.
    N. Mitchell, K. Hogstedt, L. Carter and J. Ferrante, “Quantifying the Multi-level Nature of Tiling Interactions”, International Journal of Parallel Programming, Vol 26, No 6, 1998, pp. 641–670. 151CrossRefGoogle Scholar
  17. 17.
    K. Cooper and T. Harvey, “Compiler Controlled Memory”, Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Oct. 3–7, 1998, San Jose, CA. 142Google Scholar
  18. 18.
    K. McKinley, S. Carr, and C.-W. Tseng, Improving data locality with loop transformation. ACM Transactions on Programming Languages and Systems (PLDI) 18(4):424–453, July 1996. 151CrossRefGoogle Scholar
  19. 19.
    M. Wolf and M. Lam, “A data locality optimizing algorithm”, in proceedings of ACM Special Interest Group on Programming Languages (SIGPLAN) 91 Conf. Programming Language Design and Implementation(PLDI’91), pp. 30–44, Toronto, Canada, June 1991. 151Google Scholar
  20. 20.
    K. Kennedy and K. McKinley, “Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution” in Languages and Compilers for Parallel Computing (LCPC) 1993. 151Google Scholar
  21. 21.
    G. Gao, R. Olsen, V. Sarkar and R. Thekkath “Collective Loop Fusion for Array Contraction” in Languages and Compilers for Parallel Computing (LCPC) 1992. 151Google Scholar
  22. 22.
    E. Dusterwald, R. Gupta and M. Soffa, “A Practical Data-flow Framework for Array Reference Analysis and its Application in Optimization” in ACM Programming Language Design and Implementation (PLDI) 1993 pp. 68–77. 151Google Scholar
  23. 23.
    R. Gupta and R. Bodik, “Array Data-Flow Analysis for Load-Store Optimizations in Superscalar Architectures,” in Eighth Annual Workshop on Languages and Compilers for Parallel Computing (LCPC) 1995. Also published in International Journal of Parallel Computing, Vol. 24, No. 6, pages 481–512,1996. 151Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Lei Wang
    • 1
  • Waibhav Tembe
    • 1
  • Santosh Pande
    • 1
  1. 1.Compiler Research Laboratory, Department of ECECS, ML 0030University of CincinnatiCincinnati

Personalised recommendations