Skip to main content
Log in

Replacement techniques for dynamic NUCA cache designs on CMPs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the offchip memory, because of the significant speed gap between processor and memory. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has eliminated the effectiveness of replacement policies because banks operate independently of each other, and hence their replacement decisions are restricted to a single NUCA bank. In this paper, we propose three different techniques to deal with replacements in NUCA caches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Algorithm 1
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. The experimental methodology is described in Sect. 3.

References

  1. Alghazo J, Akaaboune A, Botros N (2004) Sf-lru cache replacement algorithm. In: Records of the international workshop on memory technology, design and testing

    Google Scholar 

  2. Bardine A, Foglia P, Gabrielli G, Prete CA (2007) Analysis of static and dynamic energy consumption in nuca caches: initial results. In: Proc of the workshop on memory performance: dealing with applications, systems and architecture

    Google Scholar 

  3. Beckmann BM, Wood DA (2004) Managing wire delay in large chip-multiprocessor caches. In: Proc of the 37th international symposium on microarchitecture

    Google Scholar 

  4. Belady LA (1966) A study of replacement algorithms for virtual-storage computer. IBM Syst J 5(2)

  5. Chaudhuri M (2009) Pagenuca: selected policies for page-grain locality management in large shared chip-multiprocessors. In: Proc of the 15th international symposium on high-performance computer architecture

    Google Scholar 

  6. Chishti Z, Powell MD, Vijaykumar TN (2003) Distance associativity for high-performance energy-efficient non-uniform cache architectures. In: Proc of the 36th international symposium on microarchitecture, MICRO-36

    Google Scholar 

  7. Chou S, Chen C, Wen C, Chan Y, Chen T, Wang C, Wang J (2009) No cache-coherence: a single-cycle ring interconnection for multi-core L1-NUCA sharing on 3D chips. In: Proc of the 46th design automation conference

    Google Scholar 

  8. Cong J, Ghodrat MA, Gill M, Liu C, Reinman G (2012) BiN: a buffer-in-NUCA scheme for accelerator-rich CMPs. In: Proc of the international symposium on low power electronics and design

    Google Scholar 

  9. Dybdahl H, Stenström P, Natvig L (2007) An lru-based replacement algorithm augmented with frequency of access in shared chip-multiprocessor caches. Comput Archit News 35

  10. Grochowski E, Ronen R, Shen J, Wang H (2004) Best of both latency and throughput. In: Proc of the 22nd intl conference on computer design

    Google Scholar 

  11. Hammoud M, Cho S, Melhem R (2009) Acm: an efficient approach for managing shared caches in chip multiprocessors. In: Proc of the 4th intl conference on high performance and embedded architectures

    Google Scholar 

  12. Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive nuca: near-optimal block placement and replication in distributed caches. In: Proc of the 36th international symposium on computer architecture

    Google Scholar 

  13. Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler SW (2005) A nuca substrate for flexible cmp cache sharing. In: Proc of the 19th ACM international conference on supercomputing

    Google Scholar 

  14. Jouppi NP (1990) Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In: Proc of the 17th annual international symposium on computer architecture

    Google Scholar 

  15. Jung J, Kang K, Kyung CM (2011) Latency-aware utility-based NUCA cache partitioning in 3D-stacked multi-processor systems. In: Proc of the 21st edition of the great lakes symposium on Great Lakes symposium on VLSI

    Google Scholar 

  16. Kandemir M, Li F, Irwin MJ, Son SW (2008) A novel migration-based nuca design for chip multiprocessors. In: Proc of the international conference on supercomputing

    Google Scholar 

  17. Khan A, Kang K, Kyung CM (2011) Exploiting maximum throughput in 3D multicore architectures with stacked NUCA cache. In: Proc of the 19th IFIP/IEEE international conference on very large scale integration

    Google Scholar 

  18. Kharbutli M, Solihin Y (2005) Counter-based cache replacement algorithms. In: Proc of the 23rd international conference on computer design

    Google Scholar 

  19. Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proc of the 10th intl conf on architectural support for programming languages and operating systems

    Google Scholar 

  20. Lira J, Molina C, González A (2009) Last bank: dealing with address reuse in non-uniform cache architecture for cmps. In: Proc of the 15th international Euro-Par conference (Euro-Par)

    Google Scholar 

  21. Lira J, Molina C, González A (2011) Hk-nuca: boosting data searches in dynamic non-uniform cache architectures for chip multiprocessors. In: Proc of the 25th IEEE international parallel and distributed processing symposium (IPDPS)

    Google Scholar 

  22. Lira J, Molina C, Brooks D, González A (2011) Implementing a hybrid sram/edram nuca architecture. In: Proc of the 18th annual international conference on high performance computing (HiPC’11)

    Google Scholar 

  23. Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Högberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulator platform. Computer 35(2):50–58

    Article  Google Scholar 

  24. Malkowski K, Raghavan P, Kandemir MT, Irwin MJ (2010) T-NUCA—a novel approach to non-uniform access latency cache architectures for 3D CMPs. In: Proc of the 24th IEEE international symposium on parallel and distributed processing

    Google Scholar 

  25. Martin MMK, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. Comput Archit News

  26. Merino J, Puente V, Gregorio JA (2010) ESP-NUCA: a low-cost adaptive non-uniform cache architecture. In: Proc of the 17th IEEE international symposium on high performance computer architecture

    Google Scholar 

  27. Micron (2009) System power calculator. http://www.micron.com/

  28. Muralimanohar N, Balasubramonian R (2007) Interconnect design considerations for large nuca caches. In: Proc of the 34th international symposium on computer architecture

    Google Scholar 

  29. Muralimanohar N, Balasubramonian R, Jouppi NP (2007) Cacti 6.0: A tool to understand large caches. Tech rep, University of Utah and Hewlett Packard Laboratories

  30. Muralimanohar N, Balasubramonian R, Jouppi NP (2007) Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In: Proc of the 40th international symposium on microarchitecture

    Google Scholar 

  31. Qureshi MK, Jaleel A, Patt YN (2007) Adaptive insertion policies for high-performance caching. In: Proc of the 34th international symposium on computer architecture

    Google Scholar 

  32. Qureshi MK, Suleman MA, Patt YN (2007) Line distillation: increasing cache capacity by filtering unused words in cache lines. In: Proc of the 13th international symposium of high-performance computer architecture

    Google Scholar 

  33. Ricci R, Barrus S, Balasubramonian R Leveraging bloom filters for smart search within nuca caches. In: Proc of the 7th workshop on complexity-effective

  34. Smith AJ (1982) Cache memories. ACM Comput Surv 14(3)

  35. Thoziyoor S, Muralimanohar N, Ahn JH, Jouppi NP (2008) Cacti 5.1. Tech rep, HP

  36. Wang HS, Zhu X, Peh LS, Malik S (2002) Orion: a power-performance simulator for interconnection networks. In: Proc of the 35th international symposium on microarchitecture

    Google Scholar 

  37. Wenisch TF, Wunderlich RE, Ferdman M, Ailamaki A, Falsafi B, Hoe JC (2006) Simflex: statistical sampling of computer system simulation. IEEE MICRO 26(4):18–31

    Article  Google Scholar 

  38. Wong W, Baer J (2000) Modified lru policies for improving second-level cache behavior. In: Proc of the 6th international symposium on high-performance computer architecture

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Javier Lira.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lira, J., Molina, C., Rakvic, R.N. et al. Replacement techniques for dynamic NUCA cache designs on CMPs. J Supercomput 64, 548–579 (2013). https://doi.org/10.1007/s11227-012-0859-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-012-0859-6

Keywords

Navigation