Advertisement

The Journal of Supercomputing

, Volume 72, Issue 2, pp 718–752 | Cite as

Locality-aware data replication in the last-level cache for large scale multicores

  • Farrukh Hijaz
  • Qingchuan Shi
  • George Kurian
  • Srinivas Devadas
  • Omer KhanEmail author
Article

Abstract

Next generation large single-chip multicores will process massive data with varying degree of locality. Harnessing on-chip data locality to optimize the utilization of on-chip cache and network resources is of fundamental importance. We propose a locality-aware selective data replication protocol for the last-level cache (LLC). The goal is to lower memory access latency and energy by only replicating cache lines with high reuse in the LLC slice of the requesting core, while simultaneously keep the off-chip miss rate low. The approach relies on low-overhead yet highly accurate in-hardware runtime cache line level classifier that only allows replication of cache lines with high reuse. Furthermore, a classifier captures the LLC pressure at the existing replica locations and adapts its replication decision accordingly. On a set of parallel benchmarks, the proposed protocol reduces overall energy by 14.7, 10.7, 10.5, and 16.7 % and completion time by 2.5, 6.5, 4.5, and 9.5 % when compared to the previously proposed Victim Replication, Adaptive Selective Replication, Reactive-NUCA, and Static-NUCA LLC management schemes. An efficient classifier implementation is evaluated with an overhead of 5.44 KB, which translates to only 1.58 % on top of the Static-NUCA baseline’s cache related per-core storage.

Keywords

Multicore Cache hierarchy Data management Energy efficiency 

References

  1. 1.
    Dreslinski RG, Fick D, Giridhar B, Kim G, Seo S, Fojtik M, Satpathy S, Lee Y, Kim D, Liu N, Wieckowski M, Chen G, Sylvester D, Blaauw D, Mudge T (2013) Centip3de: a 64-core, 3d stacked near-threshold system. IEEE Micro 33(2):8–16. doi: 10.1109/MM.2013.4 CrossRefGoogle Scholar
  2. 2.
    Kaul H, Anders M, Hsu S, Agarwal A, Krishnamurthy R, Borkar S (2012) Nearthreshold voltage (ntv) design: opportunities and challenges. In: Design Automation Conference. ACM, pp 1149–1154Google Scholar
  3. 3.
    Borkar S (2007) Thousand core chips: a technology perspective. In: Proceedings of the 44th annual design automation conference. ACM, New York, NY, USA, DAC’07, pp 746–749. doi: 10.1145/1278480.1278667
  4. 4.
    Bell S, Edwards B, Amann J, Conlin R, Joyce K, Leung V, MacKay J, Reif M, Bao L, Brown J, Mattina M, Miao CC, Ramey C, Wentzlaff D, Anderson W, Berger E, Fairbanks N, Khan D, Montenegro F, Stickney J, Zook J (2008) Tile64-processor: a 64-core soc with mesh interconnect. In: IEEE international solid-state circuits conference, 2008. ISSCC 2008. Digest of Technical Papers, pp 88–598. doi: 10.1109/ISSCC.2008.4523070
  5. 5.
    Agarwal A, Simoni R, Hennessy JL, Horowitz M (1988) An Evaluation of Directory Schemes for Cache Coherence. In: International symposium on computer architectureGoogle Scholar
  6. 6.
    Martin MMK, Hill MD, Sorin DJ (2012) Why on-chip cache coherence is here to stay. Commun ACM 55(7):78–89CrossRefGoogle Scholar
  7. 7.
    Sanchez D, Kozyrakis C (2012) SCD: a scalable coherence directory with flexible sharer set encoding. In: International symposium on high-performance computer architectureGoogle Scholar
  8. 8.
    Zhao H, Shriraman A, Dwarkadas S (2010) SPACE: sharing pattern-based directory coherence for multicore scalability. In: International conference on parallel architectures and compilation techniques, pp 135–146Google Scholar
  9. 9.
    Zebchuk J, Srinivasan V, Qureshi MK, Moshovos A (2009) A tagless coherence directory. In: International symposium on microarchitectureGoogle Scholar
  10. 10.
    Eisley N, Peh LS, Shang L (2006) In-network cache coherence. In: IEEE/ACM International symposium on microarchitecture, MICRO 39:321–332. doi: 10.1109/MICRO.2006.27 Google Scholar
  11. 11.
    Kurian G, Khan O, Devadas S (2013) The locality-aware adaptive cache coherence protocol. In: Proceedings of the 40th annual international symposium on computer architecture. ACM, New York, NY, USA, ISCA’13, pp 523–534. doi: 10.1145/2485922.2485967
  12. 12.
    Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B (2010) Cache hierarchy and memory subsystem of the amd opteron processor. Micro IEEE 30(2):16–29. doi: 10.1109/MM.2010.31 CrossRefGoogle Scholar
  13. 13.
    First the tick, now the tock: next generation intel microarchitecture (Nehalem). White Paper (2008)Google Scholar
  14. 14.
    Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: International conference on architectural support for programming languages and operating systems (ASPLOS), pp 211–222Google Scholar
  15. 15.
    Chishti Z, Powell MD, Vijaykumar TN (2005) Optimizing replication, communication, and capacity allocation in cmps. In: Proceedings of the 32Nd Annual international symposium on computer architecture, IEEE computer society, Washington, DC, USA, ISCA’05, pp 357–368. doi: 10.1109/ISCA.2005.39
  16. 16.
    Zhang M, Asanovic K (2005) Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: international symposium on computer architecture. doi: 10.1109/ISCA.2005.53
  17. 17.
    Beckmann BM, Marty MR, Wood DA (2006) Wood. Asr: adaptive selective replication for cmp caches. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE computer society, Washington, DC, USA, MICRO 39, pp 443–454. doi: 10.1109/MICRO.2006.10
  18. 18.
    Chaudhuri M (2009) PageNUCA: selected policies for page-grain locality management in large shared chip-multiprocessor caches. In: HPCA, pp 227–238Google Scholar
  19. 19.
    Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches. In: Proceedings of the 36th annual international symposium on computer architecture (ISCA’09). ACM, New York, NY, USA, pp 184–195Google Scholar
  20. 20.
    Shi Q, Hijaz F, Khan O (2013) Towards efficient dynamic data placement in noc-based multicores. In: IEEE 31st International Conference on Computer Design (ICCD), 2013, pp 369–376. doi: 10.1109/ICCD.2013.6657067
  21. 21.
    Merino J, Puente V, Gregorio J (2010) Esp-nuca: a low-cost adaptive non-uniform cache architecture. In: IEEE 16th international symposium on high performance computer architecture (HPCA), 2010, pp 1–10. doi: 10.1109/HPCA.2010.5416641
  22. 22.
    Censier LM, Feautrier P (1978) A new solution to coherence problems in multicache systems. IEEE Trans Comput 27(12):1112–1118. doi: 10.1109/TC.1978.1675013 CrossRefzbMATHGoogle Scholar
  23. 23.
    Bell S, Edwards B, Amann J, Conlin R, Joyce K, Leung V, MacKay J, Reif M, Bao L, Brown J, Mattina M, Miao C, Ramey C, Wentzlaff D, Anderson W, Berger E, Fairbanks N, Khan D, Montenegro F, Stickney J, Zook J (2008) TILE64-processor: a 64-Core SoC with mesh interconnect. In: International Solid-State Circuits ConferenceGoogle Scholar
  24. 24.
    Kurian G, Miller J, Psota J, Eastep J, Liu J, Michel J, Kimerling L, Agarwal A (2010) ATAC: a 1000-core cache-coherent processor with on-chip optical network. In: International conference on parallel architectures and compilation techniquesGoogle Scholar
  25. 25.
    Cho S, Jin L (2006) Managing distributed, shared l2 caches through os-level page allocation. In: Proceedings of the 39th annual IEEE/ACM international symposium on microarchitecture, IEEE computer society, Washington, DC, USA, MICRO 39, pp 455–468. doi: 10.1109/MICRO.2006.31. http://dl.acm.org/citation.cfm?id=1194858
  26. 26.
    Awasthi M, Sudan K, Balasubramonian R, Carter J (2009) Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In: IEEE 15th international symposium on high performance computer architecture, 2009. HPCA 2009, pp 250–261. doi: 10.1109/HPCA.2009.4798260
  27. 27.
    Kurian G, Devadas S, Khan O (2014) Locality-aware data replication in the last-level cache. In: IEEE 120th international symposium on high performance computer architecture (HPCA2014), 2014Google Scholar
  28. 28.
    Chang J, Sohi G (2006) Cooperative caching for chip multiprocessors. In: 33rd international symposium on computer architecture, 2006. ISCA’06, pp 264–276. doi: 10.1109/ISCA.2006.17
  29. 29.
    Herrero E, González J, Canal R (2010) Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors. In: Proceedings of the 37th Annual international symposium on computer architecture. ACM, New York, NY, USA, ISCA’10, pp 419–428. doi: 10.1145/1815961.1816018
  30. 30.
    Qureshi MK (2009) Adaptive spill-receive for robust high-performance caching in cmps. In: IEEE 15th international symposium on high performance computer architecture, 2009. HPCA 2009, pp 45–54. doi: 10.1109/HPCA.2009.4798236
  31. 31.
    Srikantaiah S, Kultursay E, Zhang T, Kandemir M, Irwin MJ, Xie Y (2011) Morphcache: a reconfigurable adaptive multi-level cache hierarchy. In: IEEE 17th international symposium on high performance computer architecture (HPCA), 2011 pp 231–242. doi: 10.1109/HPCA.2011.5749732
  32. 32.
    Lee H, Cho S, Childers B (2011) Cloudcache: Expanding and shrinking private caches. In: IEEE 17th international symposium on high performance computer architecture (HPCA), 2011 pp 219–230. doi: 10.1109/HPCA.2011.5749731
  33. 33.
    Sorin DJ, Hill MD, Wood DA (2011) A primer on memory consistency and cache coherence. Synthesis lectures in computer architecture. Morgan Claypool Publishers, San RafaelGoogle Scholar
  34. 34.
    Jaleel A, Borch E, Bhandaru M, Steely Jr SC, Emer J (2010) Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (tla) cache management policies. In: Proceedings of the 2010 43rd annual IEEE/ACM international symposium on microarchitecture, IEEE computer society, Washington, DC, USA, MICRO’43, pp 151–162. doi: 10.1109/MICRO.2010.52
  35. 35.
    Miller JE, Kasture H, Kurian G, Gruenwald C, Beckmann N, Celio C, Eastep J, Agarwal A (2010) A distributed parallel simulator for multicores. In: 16th international symposium on high performance computer architecture (HPCA), pp 1–12Google Scholar
  36. 36.
    Dally WJ, Towles B (2004) Principles and practices of interconnection networks. Morgan KaufmannGoogle Scholar
  37. 37.
    Park S, Krishna T, Chen CH, Daya B, Chandrakasan A, Peh LS (2012) Approaching the theoretical limits of a mesh noc with a 16-node chip prototype in 45nm soi. In: Proceedings of the 49th annual design automation conference (DAC’12). ACM, New York, NY, USA, pp 398–405CrossRefGoogle Scholar
  38. 38.
    Sun C, Chen CHO, Kurian G, Wei L, Miller J, Agarwal A, Peh LS, Stojanovic V (2012) DSENT-a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In: 6th IEEE/ACM international symposium on symposium on networks-on-chip (NoCS), pp 201–210, 9–11 May 2012Google Scholar
  39. 39.
    Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: 42nd annual IEEE/ACM international symposium on microarchitecture, MICRO-42, pp 469–480, 12–16 Dec 2009Google Scholar
  40. 40.
    Thoziyoor S, Ahn JH, Monchiero M, Brockman JB, Jouppi NP (2008) A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In: 35th international symposium on computer architecture, ISCA’08, pp 51–62, 21–25 June 2008Google Scholar
  41. 41.
    Khakifirooz A, Nayfeh OM, Antoniadis D (2009) A simple semiempirical short-channel MOSFET current-voltage model continuous across all regions of operation and employing only physical parameters. IEEE Transactions Electron Devices 56(8):1674–1680CrossRefGoogle Scholar
  42. 42.
    Wei L, Boeuf F, Skotnicki T, Wong HS (2011) Parasitic capacitances: analytical models and impact on circuit-Level performance. IEEE Transactions on Electron Devices 58(5):1361–1370CrossRefGoogle Scholar
  43. 43.
    Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 Programs: characterization and methodological considerations. In: Proceedings of 22nd annual international symposium on computer architecture, pp 24–36, 22–24 June 1995Google Scholar
  44. 44.
    Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC Benchmark Suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques (PACT’08). ACM, New York, NY, USA, pp 72–81CrossRefGoogle Scholar
  45. 45.
    Yu X, Bezerra G, Pavlo A, Devadas S, Stonebraker M (2014) Staring into the abyss: an evaluation of concurrency control with one thousand cores. Proc VLDB Endow 8(3):209–220. doi: 10.14778/2735508.2735511 CrossRefzbMATHGoogle Scholar
  46. 46.
    Iqbal S, Liang Y, Grahn H (2010) ParMiBench - an open-source benchmark for embedded multiprocessor systems. Comput Archit LettGoogle Scholar
  47. 47.
  48. 48.
    Ahmad M, Hijaz F, Shi Q, Khan O (2015) A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In: IEEE international symposium on workload characterization (IISWC), 2015 pp 44–55. doi: 10.1109/IISWC.2015.11

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Farrukh Hijaz
    • 1
  • Qingchuan Shi
    • 1
  • George Kurian
    • 2
    • 3
  • Srinivas Devadas
    • 2
  • Omer Khan
    • 1
    Email author
  1. 1.University of ConnecticutStorrsUSA
  2. 2.Massachusetts Institute of TechnologyCambridgeUSA
  3. 3.GoogleMountain ViewUSA

Personalised recommendations