Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT Architectures

  • Evangelia Athanasaki
  • Kornilios Kourtis
  • Nikos Anastopoulos
  • Nectarios Koziris
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3746)


Cache misses form a major bottleneck for memory-intensive applications, due to the significant latency of main memory accesses. Loop tiling, in conjunction with other program transformations, have been shown to be an effective approach to improving locality and cache exploitation, especially for dense matrix scientific computations. Beyond loop nest optimizations, data transformation techniques, and in particular blocked data layouts, have been used to boost the cache performance. The stability of performance improvements achieved are heavily dependent on the appropriate selection of tile sizes.

In this paper, we investigate the memory performance of blocked data layouts, and provide a theoretical analysis for the multiple levels of memory hierarchy, when they are organized in a set associative fashion. According to this analysis, the optimal tile size that maximizes L1 cache utilization, should completely fit in the L1 cache, even for loop bodies that access more than just one array. Increased self- or/and cross-interference misses can be tolerated through prefetching. Such larger tiles also reduce mispredicted branches and, as a result, the lost CPU cycles that arise. Results are validated through actual benchmarks on an SMT platform.


Clock Cycle Memory Hierarchy Tile Size Cache Performance Cache Capacity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Athanasaki, E., Koziris, N.: Fast Indexing for Blocked Array Layouts to Improve Multi-Level Cache Locality. In: 8-th Work. on Interaction between Compilers and Computer Architectures, Madrid, Spain (February 2004); In conjuction with HPCA-10Google Scholar
  2. 2.
    Athanasaki, E., Koziris, N.: A Tile Size Selection Analysis for Blocked Array Layouts. In: 9-th Work. on Interaction between Compilers and Computer Architectures, San Francisco, CA (February 2005); In conjuction with HPCA-11Google Scholar
  3. 3.
    Chame, J., Moon, S.: A Tile Selection Algorithm for Data Locality and Cache Interference. In: Int. Conf. on Supercomputing, Rhodes, Greece (June 1999)Google Scholar
  4. 4.
    Coleman, S., McKinley, K.S.: Tile Size Selection Using Cache Organization and Data Layout. In: Conf. on Programming Language Design and Implementation, La Jolla, CA (June 1995)Google Scholar
  5. 5.
    Esseghir, K.: Improving Data Locality for Caches. Master’s thesis, Department of Computer Science, Rice University, Houston, TX (September 1993)Google Scholar
  6. 6.
    Ghosh, S., Martonosi, M., Malik, S.: Cache Miss Equations: A Compiler Framework for Analyzing and Tuning Memory Behavior. ACM Trans. on Programming Languages and Systems 21(4) (July 1999)Google Scholar
  7. 7.
    Harper, J.S., Kerbyson, D.J., Nudd, G.R.: Analytical Modeling of Set-Associative Cache Behavior. IEEE Trans. Computers 48(10) (October 1999)Google Scholar
  8. 8.
    Hsu, C.-H., Kremer, U.: A Quantitative Analysis of Tile Size Selection Algprithms. The J. of Supercomputing 27(3) (March 2004)Google Scholar
  9. 9.
    Kandemir, M., Ramanujam, J., Choudhary, A.: Improving Cache Locality by a Combinaion of Loop and Data Transformations. IEEE Trans. on Computers 48(2) (February 1999)Google Scholar
  10. 10.
    Lam, M.S., Rothberg, E.E., Wolf, M.E.: The Cache Performance and Optimizations of Blocked Algorithms. In: Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA (April 1991)Google Scholar
  11. 11.
    McKinley, K.S., Carr, S., Tseng, C.-W.: Improving Data Locality with Loop Transformations. ACM Trans. on Programming Languages and Systems 18(04) (July 1996)Google Scholar
  12. 12.
    Mitchell, N., Högstedt, K., Carter, L., Ferrante, J.: Quantifying the Multi-Level Nature of Tiling Interactions. Int. J. of Parallel Programming 26(6) (December 1998)Google Scholar
  13. 13.
    Panda, P.R., Nakamura, H., Dutt, N.D., Nicolau, A.: Augmenting Loop Tiling with Data Alignment for Improved Cache Performance. IEEE Trans. on Computers 48(2) (February 1999)Google Scholar
  14. 14.
    Park, N., Hong, B., Prasanna, V.: Analysis of Memory Hierarchy Performance of Block Data Layout. In: Int. Conf. on Parallel Processing, Vancouver, Canada (August 2002)Google Scholar
  15. 15.
    Patterson, D., Hennessy, J.: Computer Architecture. A Quantitative Approach, 3rd edn., San Francisco, CA (2002)Google Scholar
  16. 16.
    Rivera, G., Tseng, C.-W.: Eliminating Conflict Misses for High Performance Architectures. In: Int. Conf. on Supercomputing, Melbourne, Australia (July 1998)Google Scholar
  17. 17.
    Rivera, G., Tseng, C.-W.: A Comparison of Compiler Tiling Algorithms. In: Int. Conf. on Compiler Construction, Amsterdam, The Netherlands (March 1999)Google Scholar
  18. 18.
    Rivera, G., Tseng, C.-W.: Locality Optimizations for Multi-Level Caches. In: Int. Conf. on Supercomputing, Portland, OR (November 1999)Google Scholar
  19. 19.
    Song, Y., Li, Z.: Impact of Tile-Size Selection for Skewed Tiling. In: 5th Work. on Interaction between Compilers and Architectures, Monterrey, Mexico (Janaury 2001)Google Scholar
  20. 20.
    Temam, O., Fricker, C., Jalby, W.: Cache Interference Phenomena. In: Conf. on Measurement and Modeling of Computer Systems, Nashville, TN (May 1994)Google Scholar
  21. 21.
    Temam, O., Granston, E.D., Jalby, W.: To Copy or Not to Copy: A Compile-Time Technique for Assessing When Data Copying Should be Used to Eliminate Cache Conflicts. In: Conf. on Supercomputing, Portland, OR (November 1993)Google Scholar
  22. 22.
    Vera, X.: Cache and Compiler Interaction (how to analyze, optimize and time cache behaviour). PhD thesis, Malardalen University (Janaury 2003)Google Scholar
  23. 23.
    Wolf, M.E., Lam, M.S.: A Data Locality Optimizing Algorithm. In: Conf. on Programming Language Design and Implementation, Toronto, Canada (June 1991)Google Scholar
  24. 24.
    Wolf, M.E., Maydan, D.E., Chen, D.-K.: Combining Loop Transformations Considering Caches and Scheduling. In: Int. Symposium on Microarchitecture, Paris, France (December 1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Evangelia Athanasaki
    • 1
  • Kornilios Kourtis
    • 1
  • Nikos Anastopoulos
    • 1
  • Nectarios Koziris
    • 1
  1. 1.School of Electrical and Computer Engineering, Computing Systems LaboratoryNational Technical University of Athens 

Personalised recommendations