Skip to main content
Log in

Exploring grouped coherence for clustered hierarchical cache

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The industry trends for processors are toward integrating an increasing number of cores into a single chip. Researchers have to deal with frequent data migration across network-on-chip and the increasing on-chip traffic. The innovation from flat to hierarchy is probably a natural design methodology for scalable systems (Martin et al. in Commun ACM, 55(7):78–89, 2012. doi:10.1145/2209249.2209269). Unfortunately, the alternative of hierarchical directory protocol inevitably leads to on-chip traffic overhead, protocol complexity and access latency. In this paper, we target hierarchical cache coherence protocol to overcome the potentially high cost of maintaining cache coherence in current multicore processors. We propose a novel vertical caching protocol combined with grouped coherence, in which the coherence domain expand on demand. More specifically, its design philosophy is to provide a ‘best-effort’ single-copy delivery which allows the shared data only in the first common shared level. Compared to the previous hierarchical protocol, our proposal is able to achieve the performance improvement of 9.9% in the 16-core system and 13.4% in the 64-core system as well as an on-chip traffic reduction of about 10.8% in the 16-core system and 15.9% in the 64-core system, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. Note that the cluster referred to in this article is a recursive definition, namely, a bigger cluster probably includes several sub-cluster.

  2. The triplet-based hierarchical interconnection network (THIN) belonging to WK\(_{(3,3)}\) network [17, 20] is a novel NoC whose number of processing nodes increase with power of three at each stage.

References

  1. Acacio ME, Gonzalez J, Garcia JM, Duato J (2004) An architecture for high-performance scalable shared-memory multiprocessors exploiting on-chip integration. IEEE Trans Parallel Distrib Syst 15(8):755–768. doi:10.1109/TPDS.2004.27

    Article  Google Scholar 

  2. Balasubramonian R, Jouppi NP, Muralimanohar N (2011) Multi-core cache hierarchies. Morgan Claypool. doi:10.2200/S00365ED1V01Y201105CAC017

    Google Scholar 

  3. Beckmann BM, Marty MR, Wood DA (2006) Asr: adaptive selective replication for cmp caches. In: 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06), pp 443–454. doi:10.1109/MICRO.2006.10

  4. Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7. doi:10.1145/2024716.2024718

    Article  Google Scholar 

  5. Chang J, Sohi GS (2006) Cooperative caching for chip multiprocessors. In: 33rd International Symposium on Computer Architecture (ISCA’06), pp 264–276. doi:10.1109/ISCA.2006.17

  6. Chishti Z, Powell MD, Vijaykumar TN (2005) Optimizing replication, communication, and capacity allocation in cmps. In: 32nd International Symposium on Computer Architecture (ISCA’05), pp 357–368. doi:10.1109/ISCA.2005.39

  7. Della Vecchia G, Sanges C (1988) A recursively scalable network VLSI implementation. Fut Gener Comput Syst 4(3):235–243

    Article  Google Scholar 

  8. Demetriades S, Cho S (2014) Stash directory: a scalable directory for many-core coherence. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp 177–188. doi:10.1109/HPCA.2014.6835928

  9. Fu Y, Nguyen TM, Wentzlaff D (2015) Coherence domain restriction on large scale systems. In: Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48. ACM, New York, NY, USA, pp 686–698. doi:10.1145/2830772.2830832

  10. Guo SL, Wang HX, Xue YB, Li CM, Wang DS (2010) Hierarchical cache directory for cmp. J Comput Sci Technol 25(2):246–256. doi:10.1007/s11390-010-9321-5

    Article  Google Scholar 

  11. Jerger NE, Peh LS, Lipasti M (2008) Virtual circuit tree multicasting: a case for on-chip hardware multicast support. In: 2008 International Symposium on Computer Architecture, pp 229–240. doi:10.1109/ISCA.2008.12

  12. Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. SIGARCH Comput Archit News 30(5):211–222. doi:10.1145/635506.605420

    Article  Google Scholar 

  13. Lotfi-Kamran P, Grot B, Ferdman M, Volos S, Kocberber O, Picorel J, Adileh A, Jevdjic D, Idgunji S, Ozer E, Falsafi B (2012) Scale-out processors. In: 2012 39th Annual International Symposium on Computer Architecture (ISCA), pp 500–511. doi:10.1109/ISCA.2012.6237043

  14. Martin MMK, Hill MD, Sorin DJ (2012) Why on-chip cache coherence is here to stay. Commun ACM 55(7):78–89. doi:10.1145/2209249.2209269

    Article  Google Scholar 

  15. Nilsson H, Stenstrom P (1992) The scalable tree protocol-a cache coherence approach for large-scale multiprocessors. In: [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing, pp 498–506. doi:10.1109/SPDP.1992.242703

  16. Pugsley SH, Spjut JB, Nellans DW, Balasubramonian R (2010) Swel: Hardware cache coherence protocols to map shared data onto shared caches. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10. ACM, New York, NY, USA, pp 465–476. doi:10.1145/1854273.1854331

  17. Rashid KHU, Shi F, Ji W, Jing Y, Wang Y, Liu C, Deng N, Li J (2010) Computationally efficient locality-aware interconnection topology for multi-processor system-on-chip(mp-soc). Chin Sci Bull 55(29):3363–3371

    Article  Google Scholar 

  18. Ros A, Davari M, Kaxiras S (2015) Hierarchical private/shared classification: the key to simple and efficient coherence for clustered cache hierarchies. In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp 186–197. doi:10.1109/HPCA.2015.7056032

  19. Sodani A, Gramunt R, Corbal J, Kim HS, Vinod K, Chinthamani S, Hutsell S, Agarwal R, Liu YC (2016) Knights landing: second-generation intel xeon phi product. IEEE Micro 36(2):34–46. doi:10.1109/MM.2016.25

    Article  Google Scholar 

  20. Wang YC, Juan ST (2015) Hamiltonicity of the basic wk-recursive pyramid with and without faulty nodes. Theor Comput Sci 562(C):542–556

    Article  MathSciNet  MATH  Google Scholar 

  21. Wentzlaff D, Griffin P, Hoffmann H, Bao L, Edwards B, Ramey C, Mattina M, Miao CC, Brown JF III, Agarwal A (2007) On-chip interconnection architecture of the tile processor. IEEE Micro 27(5):15–31. doi:10.1109/MM.2007.4378780

    Article  Google Scholar 

  22. Wilson AW Jr. (1987) Hierarchical cache/bus architecture for shared memory multiprocessors. In: Proceedings of the 14th Annual International Symposium on Computer Architecture, ISCA ’87. ACM, New York, NY, USA, pp 244–252. doi:10.1145/30350.30378

  23. Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The splash-2 programs: characterization and methodological considerations. ISCA ’95. ACM, New York, NY, USA, pp 24–36. doi:10.1145/223982.223990

  24. Yan S, Zhou X, Gao Y, Chen H, Luo S, Zhang P, Cherukuri N, Ronen R, Saha B (2009) Terascale chip multiprocessor memory hierarchy and programming model. In: 2009 International Conference on High Performance Computing (HiPC), pp 150–159. doi:10.1109/HIPC.2009.5433215

  25. Zhang M, Asanovic K (2005) Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: 32nd International Symposium on Computer Architecture (ISCA’05), pp 336–345. doi:10.1109/ISCA.2005.53

  26. Zhao H, Shriraman A, Kumar S, Dwarkadas S (2013) Protozoa: adaptive granularity cache coherence. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13. ACM, New York, NY, USA, pp 547–558. doi:10.1145/2485922.2485969

  27. Zuo W, Feng S, Qi Z, Weixing J, Jiaxin L, Ning D, Licheng X, Yuan T, Baojun Q (2009) Group-caching for noc based multicore cache coherent systems. In: 2009 Design, Automation Test in Europe Conference Exhibition, pp 755–760. doi:10.1109/DATE.2009.5090765

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their helpful suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sensen Hu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, S., Shi, F., Ji, W. et al. Exploring grouped coherence for clustered hierarchical cache. J Supercomput 73, 4137–4157 (2017). https://doi.org/10.1007/s11227-017-2024-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-017-2024-8

Keywords

Navigation