Filtering Directory Lookups in CMPs with Write-Through Caches
Abstract
In CMPs, coherence protocols are used to maintain data coherence among the multiple local caches. In this paper, we focus on CMPs using write-through local caches, and a directory-based coherence protocol implemented as a duplicate of the local cache tags. A large fraction of directory lookups is due to stores performed on private data local to the processor performing the store.
We propose to add a filter before the directory in order to either reduce the associativity of the lookups or even eliminate those that are unnecessary. When a block from the shared cache has only one copy in the local caches, the filter identifies the processor and allows for reducing the number of comparisons performed in the corresponding directory lookup. When that is not possible, the filter bits are used to code other situations that can also reduce the number of directory lookups or their associativity.
We evaluate the fillter in a CMP with 8 in-order processors with 4 threads each and a memory hierarchy with local caches and a shared cache. We show that a filter representing 0.7% of the size of the shared cache can avoid, on average, 97% and 93% of all comparisons performed by directory lookups for SPLASH2 and Specweb2005, respectively. Only for SPLASH2, there is a small performance loss of 0.3%. As a result, on average, directory power is reduced 30.8% and 22.4% for SPLASH2 and Specweb2005, respectively.
Keywords
Memory Hierarchy Leakage Power Local Cache Cache Coherence Cache BlockPreview
Unable to display preview. Download preview PDF.
References
- 1.Agarwal, A., Simoni, R., Hennessy, J., Horowitz, M.: An Evaluation of Directory Schemes for Cache Coherence. In: ISCA-15, pp. 280–289 (1988)Google Scholar
- 2.Agarwal, N., Peh, L.-S., Jha, N.: In-Network Coherence Filtering: Snoopy coherence without broadcasts, pp. 232–243 (2009)Google Scholar
- 3.Alameldeen, A.R., Wood, D.A.: Variability in Architectural Simulations of Multi-Threaded Workloads. In: HPCA-9, p. 7 (2003)Google Scholar
- 4.AMD. AMD Multi-Core Technology, http://multicore.amd.com
- 5.Ballapuram, C.S., Sharif, A., Lee, H.-H.S.: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors. In: ASPLOS XIII, pp. 60–69 (2008)Google Scholar
- 6.Barroso, L.A., et al.: Piranha: a Scalable Architecture Based on Single-Chip Multiprocessing. In: ISCA-27, pp. 282–293 (2000)Google Scholar
- 7.Cantin, J.F., Lipasti, M.H., Smith, J.E.: Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking. In: ISCA-32, pp. 246–257 (June 2005)Google Scholar
- 8.Censier, L.M., Feautrier, P.: A New Solution to Coherence Problems in Multicache Systems. IEEE Transactions on Computers C-27(12), 1112–1118 (1978)CrossRefMATHGoogle Scholar
- 9.Charlesworth, A., Aneshansley, N., Haakmeester, M., Drogichen, D., Gilbert, G., Williams, R., Phelps, A.: The Starfire SMP Interconnect, p. 37 (1997)Google Scholar
- 10.Dash, A., Petrov, P.: Energy-Efficient Cache Coherence for Embedded Multi-Processor Systems through Application-Driven Snoop Filtering. In: DSD 2006, pp. 79–82 (2006)Google Scholar
- 11.Ekman, M., Dahlgren, F., Stenström, P.: Evaluation of Snoop-Energy Reduction Techniques for Chip-Multiprocessors. In: Workshop on Duplicating, Deconstructing and Debunking, in conjunction with ISCA (May 2002)Google Scholar
- 12.Ekman, M., Stenström, P., Dahlgren, F.: TLB and Snoop Energy-Reduction Using Virtual Caches in Low-Power Chip-Multiprocessors. In: ISLPED 2002, pp. 243–246 (2002)Google Scholar
- 13.Fujitsu. Fujitsu SPARC64 VII Processor (June 2008)Google Scholar
- 14.Gupta, A., dietrich Weber, W., Mowry, T.: Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In: ICPP 1990, pp. 312–321 (1990)Google Scholar
- 15.
- 16.Intel. Leading Virtualization Performance and Energy Efficiency in a Multi-processor ServerGoogle Scholar
- 17.Jerger, N.: SigNet: Network-on-chip filtering for coarse vector directories. pp. 1378–1383 (2010)Google Scholar
- 18.Johnson, T., Nawathe, U.: An 8-core, 64-thread, 64-bit Power Efficient SPARC SOC (niagara2). In: ISPD 2007, p. 2 (2007)Google Scholar
- 19.Laudon, J., Lenoski, D.: The SGI Origin: A ccnuma Highly Scalable Server, pp. 241–251 (1997)Google Scholar
- 20.Le, H.Q., et al.: IBM POWER6 microarchitecture. IBM J. Res. Dev. 51(6), 639–662 (2007)MathSciNetCrossRefGoogle Scholar
- 21.Magnusson, P., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., Werner, B.: Simics: A Full System Simulation Platform. Computer 35(2), 50–58 (2002)CrossRefGoogle Scholar
- 22.Monchiero, M., Ahn, J.H., Falcón, A., Ortega, D., Faraboschi, P.: How to Simulate 1000 Cores. SIGARCH Comput. Archit. News 37(2), 10–19 (2009)CrossRefGoogle Scholar
- 23.Moshovos, A.: RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. In: ISCA-32, pp. 234–245 (June 2005)Google Scholar
- 24.Moshovos, A., Memik, G., Falsafi, B., Choudhary, A.: JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers. In: HPCA-7, 2001, pp. 85–96 (2001)Google Scholar
- 25.Muralimanohar, N., Balasubramonian, R.: CACTI 6.0: A Tool to Model Large Caches (2009)Google Scholar
- 26.Salapura, V., Blumrich, M., Gara, A.: Improving the Accuracy of Snoop Filtering Using Stream Registers. In: MEDEA 2007, pp. 25–32 (2007)Google Scholar
- 27.Singh, J.P., Gupta, A., Ohara, M., Torrie, E., Woo, S.C.: The SPLASH-2 Programs: Characterization and Methodological Considerations. In: ISCA-22, p. 24 (1995)Google Scholar
- 28.Steinman, M.B., Harris, G.J., Kocev, A., Lamere, V.C., Pannell, R.D.: The AlphaServer 4100 Cached Processor Module Architecture and Design (1996)Google Scholar
- 29.Strauss, K., Shen, X., Torrellas, J.: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors. SIGARCH Comput. Archit. News 34(2), 327–338 (2006)CrossRefGoogle Scholar
- 30.Sun Microsystems, Inc. OpenSPARC T2 System-On-Chip (SoC) Microarchitecture Specification vol. 1 (May 2008)Google Scholar
- 31.Tang, C.K.: Cache System Design in the Tightly Coupled Multiprocessor System. In: AFIPS 1976, pp. 749–753 (1976)Google Scholar