Filtering Directory Lookups in CMPs with Write-Through Caches

  • Ana Bosque
  • Victor Viñals
  • Pablo Ibañez
  • Jose Maria Llaberia
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6852)


In CMPs, coherence protocols are used to maintain data coherence among the multiple local caches. In this paper, we focus on CMPs using write-through local caches, and a directory-based coherence protocol implemented as a duplicate of the local cache tags. A large fraction of directory lookups is due to stores performed on private data local to the processor performing the store.

We propose to add a filter before the directory in order to either reduce the associativity of the lookups or even eliminate those that are unnecessary. When a block from the shared cache has only one copy in the local caches, the filter identifies the processor and allows for reducing the number of comparisons performed in the corresponding directory lookup. When that is not possible, the filter bits are used to code other situations that can also reduce the number of directory lookups or their associativity.

We evaluate the fillter in a CMP with 8 in-order processors with 4 threads each and a memory hierarchy with local caches and a shared cache. We show that a filter representing 0.7% of the size of the shared cache can avoid, on average, 97% and 93% of all comparisons performed by directory lookups for SPLASH2 and Specweb2005, respectively. Only for SPLASH2, there is a small performance loss of 0.3%. As a result, on average, directory power is reduced 30.8% and 22.4% for SPLASH2 and Specweb2005, respectively.


Memory Hierarchy Leakage Power Local Cache Cache Coherence Cache Block 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agarwal, A., Simoni, R., Hennessy, J., Horowitz, M.: An Evaluation of Directory Schemes for Cache Coherence. In: ISCA-15, pp. 280–289 (1988)Google Scholar
  2. 2.
    Agarwal, N., Peh, L.-S., Jha, N.: In-Network Coherence Filtering: Snoopy coherence without broadcasts, pp. 232–243 (2009)Google Scholar
  3. 3.
    Alameldeen, A.R., Wood, D.A.: Variability in Architectural Simulations of Multi-Threaded Workloads. In: HPCA-9, p. 7 (2003)Google Scholar
  4. 4.
    AMD. AMD Multi-Core Technology,
  5. 5.
    Ballapuram, C.S., Sharif, A., Lee, H.-H.S.: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors. In: ASPLOS XIII, pp. 60–69 (2008)Google Scholar
  6. 6.
    Barroso, L.A., et al.: Piranha: a Scalable Architecture Based on Single-Chip Multiprocessing. In: ISCA-27, pp. 282–293 (2000)Google Scholar
  7. 7.
    Cantin, J.F., Lipasti, M.H., Smith, J.E.: Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking. In: ISCA-32, pp. 246–257 (June 2005)Google Scholar
  8. 8.
    Censier, L.M., Feautrier, P.: A New Solution to Coherence Problems in Multicache Systems. IEEE Transactions on Computers C-27(12), 1112–1118 (1978)CrossRefzbMATHGoogle Scholar
  9. 9.
    Charlesworth, A., Aneshansley, N., Haakmeester, M., Drogichen, D., Gilbert, G., Williams, R., Phelps, A.: The Starfire SMP Interconnect, p. 37 (1997)Google Scholar
  10. 10.
    Dash, A., Petrov, P.: Energy-Efficient Cache Coherence for Embedded Multi-Processor Systems through Application-Driven Snoop Filtering. In: DSD 2006, pp. 79–82 (2006)Google Scholar
  11. 11.
    Ekman, M., Dahlgren, F., Stenström, P.: Evaluation of Snoop-Energy Reduction Techniques for Chip-Multiprocessors. In: Workshop on Duplicating, Deconstructing and Debunking, in conjunction with ISCA (May 2002)Google Scholar
  12. 12.
    Ekman, M., Stenström, P., Dahlgren, F.: TLB and Snoop Energy-Reduction Using Virtual Caches in Low-Power Chip-Multiprocessors. In: ISLPED 2002, pp. 243–246 (2002)Google Scholar
  13. 13.
    Fujitsu. Fujitsu SPARC64 VII Processor (June 2008)Google Scholar
  14. 14.
    Gupta, A., dietrich Weber, W., Mowry, T.: Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In: ICPP 1990, pp. 312–321 (1990)Google Scholar
  15. 15.
  16. 16.
    Intel. Leading Virtualization Performance and Energy Efficiency in a Multi-processor ServerGoogle Scholar
  17. 17.
    Jerger, N.: SigNet: Network-on-chip filtering for coarse vector directories. pp. 1378–1383 (2010)Google Scholar
  18. 18.
    Johnson, T., Nawathe, U.: An 8-core, 64-thread, 64-bit Power Efficient SPARC SOC (niagara2). In: ISPD 2007, p. 2 (2007)Google Scholar
  19. 19.
    Laudon, J., Lenoski, D.: The SGI Origin: A ccnuma Highly Scalable Server, pp. 241–251 (1997)Google Scholar
  20. 20.
    Le, H.Q., et al.: IBM POWER6 microarchitecture. IBM J. Res. Dev. 51(6), 639–662 (2007)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Magnusson, P., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., Werner, B.: Simics: A Full System Simulation Platform. Computer 35(2), 50–58 (2002)CrossRefGoogle Scholar
  22. 22.
    Monchiero, M., Ahn, J.H., Falcón, A., Ortega, D., Faraboschi, P.: How to Simulate 1000 Cores. SIGARCH Comput. Archit. News 37(2), 10–19 (2009)CrossRefGoogle Scholar
  23. 23.
    Moshovos, A.: RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. In: ISCA-32, pp. 234–245 (June 2005)Google Scholar
  24. 24.
    Moshovos, A., Memik, G., Falsafi, B., Choudhary, A.: JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers. In: HPCA-7, 2001, pp. 85–96 (2001)Google Scholar
  25. 25.
    Muralimanohar, N., Balasubramonian, R.: CACTI 6.0: A Tool to Model Large Caches (2009)Google Scholar
  26. 26.
    Salapura, V., Blumrich, M., Gara, A.: Improving the Accuracy of Snoop Filtering Using Stream Registers. In: MEDEA 2007, pp. 25–32 (2007)Google Scholar
  27. 27.
    Singh, J.P., Gupta, A., Ohara, M., Torrie, E., Woo, S.C.: The SPLASH-2 Programs: Characterization and Methodological Considerations. In: ISCA-22, p. 24 (1995)Google Scholar
  28. 28.
    Steinman, M.B., Harris, G.J., Kocev, A., Lamere, V.C., Pannell, R.D.: The AlphaServer 4100 Cached Processor Module Architecture and Design (1996)Google Scholar
  29. 29.
    Strauss, K., Shen, X., Torrellas, J.: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors. SIGARCH Comput. Archit. News 34(2), 327–338 (2006)CrossRefGoogle Scholar
  30. 30.
    Sun Microsystems, Inc. OpenSPARC T2 System-On-Chip (SoC) Microarchitecture Specification vol. 1 (May 2008)Google Scholar
  31. 31.
    Tang, C.K.: Cache System Design in the Tightly Coupled Multiprocessor System. In: AFIPS 1976, pp. 749–753 (1976)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Ana Bosque
    • 1
  • Victor Viñals
    • 2
  • Pablo Ibañez
    • 2
  • Jose Maria Llaberia
    • 1
  1. 1.DACUPCBarcelonaSpain
  2. 2.DIISUniversity of ZaragozaZaragozaSpain

Personalised recommendations