Compiler-Directed Cache Assist Adaptivity

  • Xiaomei Ji
  • Dan Nicolaescu
  • Alexander Veidenbaum
  • Alexandru Nicolau
  • Rajesh Gupta
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1940)


The performance of a traditional cache memory hierarchy can be improved by utilizing mechanisms such as a victim cache or a stream buffer (cache assists). The amount of on-chip memory for cache assist is typically limited for technological reasons. In addition, the cache assist size is limited in order to maintain a fast access time. Performance gains from using a stream buffer or a victim cache, or a combination of the two, varies from program to program as well as within a program. Therefore, given a limited amount of cache assist memory, there is a need and a potential for “adaptivity” of the cache assists i.e., an ability to vary their relative size within the bounds of the cache assist memory size. We propose and study a compiler-driven adaptive cache assist organization and its effect on system performance. Several adaptivity mechanisms are proposed and investigated. The results show that a cache assist that is adaptive at loop level clearly improves the cache memory performance, has low overhead, and can be easily implemented.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Andrew A. Chien and Jae H. Kim. Planar-adaptive routing: Low-cost adaptive networks for multiprocessors. In Proc. 19th Annual Symposium on Computer Architecture, pages 268–277, 1992. 90Google Scholar
  2. [2]
    Fredrik Dahlgren, Michel Dubois, and Per Stendstrom. Fixed and adaptive sequential prefething in shared memory multiprocessors. In Intl. Conference on Parallel Processing, 1993. 90Google Scholar
  3. [3]
    W. J. Dally and H. Aoki. Deadlock-free adaptive routing in multicomputer networks using virtual channels. In IEEE Transactions on Parallel and Distributed Systems, pages 466–475, 1993. 90Google Scholar
  4. [4]
    Jeffrey Kuskin et al. The Stanford FLASH multiprocessor. In Proc. 21st Annual Symposium on Computer Architecture, pages 302–313, 1994. 90Google Scholar
  5. [5]
    Edward H. Gornish and Alexander Veidenbaum. An integrated hardware/software data prefething scheme for shared-memory multiprocessors. In Intl. Conference on Parallel Processing, pages 247–254, 1994. 90Google Scholar
  6. [6]
    Teresa L. Johnson and Wen mei Hwu. Run-time adaptive cache hierarchy management via reference analysis. In Proceedings of the 24th Annual International Symposium on Computer Architecture, 1997. 90Google Scholar
  7. [8]
    Norman P. Jouppi and Steven J. E. Wilton. Tradeoffs in two-level on-chip caching. In Proc. 21st Annual Symposium on Computer Architecture, 1994. 88Google Scholar
  8. [9]
    Toni Juan, Sanji Sanjeevan, and Juan J. Navarro. Dynamic history-length fitting: A third level of adaptivity for branch prediction. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 155–166, 1998. 90Google Scholar
  9. [10]
    Sanjeev Kumar and Christopher Wilkerson. Exploiting spatial locality in data caches using spatial footprints. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 357–368, 1998. 90Google Scholar
  10. [11]
    T. Matsumoto, K. Nishimura, T. Kudoh, K. Hiraki, H. Amano, and H. Tanaka. Distributed shared memory architecure for JUMP-1. In Intl. Symposium on Parallel Architecures, Algorithms, and Networks, pages 131–137, 1996. 90Google Scholar
  11. [12]
    Ted Romer, Wayne Ohlich, Anna Karlin, and Brian Bershad. Reducing TLB and memory overhead using on-line superpage promotion. 1996. 90Google Scholar
  12. [13]
    D. Sunada, D. Glasco, and M. Flynn. ABSS v2.0: SPARC simulator. Technical Report CSL-TR-98-755, Stanford University, 1998. 91Google Scholar
  13. [14]
    Madhusudhan Talluri and Mark D. Hill. Surpassing the TLB performance of superpages with less operating system support. 1996. 90Google Scholar
  14. [15]
    O. Temam and N. Drach. Software-assistance for data caches. In Proceedings IEEE High Performance Computer Architecture, 1995. 90Google Scholar
  15. [16]
    Steve Turner and Alexander Veidenbaum. Scalability of the Cedar system. In Supercomputing, pages 247–254, 1994. 90Google Scholar
  16. [17]
    Jack E. Veenstra and Robert J. Fowler. Mint: A front end for efficient simulation of shared-memory multiprocessors. In Intl. Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pages 201–207, 1994. 91Google Scholar
  17. [18]
    Alexander V. Veidenbaum, Weiyu Tang, Rajesh Gupta, Alexandru Nicolau, and Xiaomei Ji. Adapting cache line size to application behavior. In Proceedings ICS’99, June 1999. 90Google Scholar
  18. [19]
    Peter Van Vleet, Eric Anderson, Lindsay Brown, Jean-Loup Baer, and Anna Karlin. Pursuing the performance potential of dynamic cache line sizes. In Proceedings of 1999 International Conference on Computer Design, November 1999. 90Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Xiaomei Ji
    • 1
  • Dan Nicolaescu
    • 1
  • Alexander Veidenbaum
    • 1
  • Alexandru Nicolau
    • 1
  • Rajesh Gupta
    • 1
  1. 1.Department of Information and Computer ScienceUniversity of California IrvineIrvine

Personalised recommendations