Advertisement

Improving Data Locality by Chunking

  • Cédric Bastoul
  • Paul Feautrier
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2622)

Abstract

Cache memories were invented to decouple fast processors from slow memories. However, this decoupling is only partial, and many researchers have attempted to improve cache use by program optimization. Potential benefits are significant since both energy dissipation and performance highly depend on the traffic between memory levels. But modeling the traffic is diffcult; this observation has led to the use of heuristic methods for steering program transformations. In this paper, we propose another approach: we simplify the cache model and we organize the target program in such a way that an asymptotic evaluation of the memory trafic is possible. This information is used by our optimization algorithm in order to find the best reordering of the program operations, at least in an asymptotic sense. Our method optimizes both temporal and spatial locality. It can be applied to any static control program with arbitrary dependences. The optimizer has been partially implemented and applied to non-trivial programs. We present experimental evidence that the amount of cache misses is drastically reduced with corresponding performance improvements.

References

  1. 1.
    E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK User’s Guide, Third Edition. SIAM, 1999.Google Scholar
  2. 2.
    U. Banerjee. Unimodular transformations of double loops. In Advances in Languages and Compilers for Parallel Processing, pages 192–219, Irvine, august 1990.Google Scholar
  3. 3.
    F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle. Custom memory managament methodology. Kluwer Academic, 1998.Google Scholar
  4. 4.
    S. Coleman and K. McKinley. Tile size selection using cache organization and data layout. In ACM SIGPLAN’95 Conference on Programming Language Design and Implementation, pages 279–290, La Jolla, june 1995.Google Scholar
  5. 5.
    P. Feautrier. Dataflow analysis of scalar and array references. International Journal of Parallel Programming, 20(1):23–53, february 1991.MATHCrossRefGoogle Scholar
  6. 6.
    P. Feautrier. Some efficient solutions to the affine scheduling problem, part I: one dimensional time. International Journal of Parallel Programming, 21(5):313–348, october 1992.MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memories management by global program transformation. Journal of Parallel and Distributed Computing, (5):587–616, 1988.CrossRefGoogle Scholar
  8. 8.
    M. Kandemir, J. Ramanujam, and A. Choudhary. Improving cache locality by a combination of loop and data transformations. IEEE Transactions on Computers, 48(2):159–167, february 1999.CrossRefGoogle Scholar
  9. 9.
    I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In ACM SIGPLAN’97 Conference on Programming Language Design and Implementation, pages 346–357, Las Vegas, june 1997.Google Scholar
  10. 10.
    D. Kuck. The Structure of Computers and Computations. John Wiley & Sons, Inc., 1978.Google Scholar
  11. 11.
    W. Li. Compiling for NUMA parallel machines. PhD thesis, Cornell Univ., 1993.Google Scholar
  12. 12.
    V. Loechner, B. Meister, and P. Clauss. Precise data locality optimization of nested loops. Journal of Supercomputing, 21(1):37–76, january 2002.MATHCrossRefGoogle Scholar
  13. 13.
    K. McKinley, S. Carr, and C. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424–453, july 1996.CrossRefGoogle Scholar
  14. 14.
    F. Quilleré, S. Rajopadhye, and D. Wilde. Generation of efficient nested loops from polyhedra. International Journal of Parallel Programming, 28(5):469–498, october 2000.CrossRefGoogle Scholar
  15. 15.
    M. Wolf and M. Lam. A data locality optimizing algorithm. In ACM SIGPLAN’91 Conference on Programming Language Design and Implementation, pages 30–44, New York, june 1991.Google Scholar
  16. 16.
    M. Wolfe. Iteration space tiling for memory hierarchies. In 3rd SIAM Conference on Parallel Processing for Scientific Computing, pages 357–361, december 1987.Google Scholar
  17. 17.
    J. Xue. Transformations of nested loops with non-convex iteration spaces. Parallel Computing, 22(3):339–368, 1996.MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Cédric Bastoul
    • 1
  • Paul Feautrier
    • 2
  1. 1.Laboratoire PRiSMUniversité de Versailles Saint QuentinVersailles CedexFrance
  2. 2.École Normale Supérieure de LyonLyonFrance

Personalised recommendations