Exploring grouped coherence for clustered hierarchical cache

Hu, Sensen; Shi, Feng; Ji, Weixing; Chen, Xu; Talpur, Shahnawaz

doi:10.1007/s11227-017-2024-8

Exploring grouped coherence for clustered hierarchical cache

Published: 28 March 2017

Volume 73, pages 4137–4157, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Sensen Hu¹,
Feng Shi¹,
Weixing Ji¹,
Xu Chen¹ &
…
Shahnawaz Talpur²

295 Accesses
Explore all metrics

Abstract

The industry trends for processors are toward integrating an increasing number of cores into a single chip. Researchers have to deal with frequent data migration across network-on-chip and the increasing on-chip traffic. The innovation from flat to hierarchy is probably a natural design methodology for scalable systems (Martin et al. in Commun ACM, 55(7):78–89, 2012. doi:10.1145/2209249.2209269). Unfortunately, the alternative of hierarchical directory protocol inevitably leads to on-chip traffic overhead, protocol complexity and access latency. In this paper, we target hierarchical cache coherence protocol to overcome the potentially high cost of maintaining cache coherence in current multicore processors. We propose a novel vertical caching protocol combined with grouped coherence, in which the coherence domain expand on demand. More specifically, its design philosophy is to provide a ‘best-effort’ single-copy delivery which allows the shared data only in the first common shared level. Compared to the previous hierarchical protocol, our proposal is able to achieve the performance improvement of 9.9% in the 16-core system and 13.4% in the 64-core system as well as an on-chip traffic reduction of about 10.8% in the 16-core system and 15.9% in the 64-core system, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Note that the cluster referred to in this article is a recursive definition, namely, a bigger cluster probably includes several sub-cluster.
The triplet-based hierarchical interconnection network (THIN) belonging to WK\(_{(3,3)}\) network [17, 20] is a novel NoC whose number of processing nodes increase with power of three at each stage.

References

Acacio ME, Gonzalez J, Garcia JM, Duato J (2004) An architecture for high-performance scalable shared-memory multiprocessors exploiting on-chip integration. IEEE Trans Parallel Distrib Syst 15(8):755–768. doi:10.1109/TPDS.2004.27
Article Google Scholar
Balasubramonian R, Jouppi NP, Muralimanohar N (2011) Multi-core cache hierarchies. Morgan Claypool. doi:10.2200/S00365ED1V01Y201105CAC017
Google Scholar
Beckmann BM, Marty MR, Wood DA (2006) Asr: adaptive selective replication for cmp caches. In: 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06), pp 443–454. doi:10.1109/MICRO.2006.10
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7. doi:10.1145/2024716.2024718
Article Google Scholar
Chang J, Sohi GS (2006) Cooperative caching for chip multiprocessors. In: 33rd International Symposium on Computer Architecture (ISCA’06), pp 264–276. doi:10.1109/ISCA.2006.17
Chishti Z, Powell MD, Vijaykumar TN (2005) Optimizing replication, communication, and capacity allocation in cmps. In: 32nd International Symposium on Computer Architecture (ISCA’05), pp 357–368. doi:10.1109/ISCA.2005.39
Della Vecchia G, Sanges C (1988) A recursively scalable network VLSI implementation. Fut Gener Comput Syst 4(3):235–243
Article Google Scholar
Demetriades S, Cho S (2014) Stash directory: a scalable directory for many-core coherence. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp 177–188. doi:10.1109/HPCA.2014.6835928
Fu Y, Nguyen TM, Wentzlaff D (2015) Coherence domain restriction on large scale systems. In: Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48. ACM, New York, NY, USA, pp 686–698. doi:10.1145/2830772.2830832
Guo SL, Wang HX, Xue YB, Li CM, Wang DS (2010) Hierarchical cache directory for cmp. J Comput Sci Technol 25(2):246–256. doi:10.1007/s11390-010-9321-5
Article Google Scholar
Jerger NE, Peh LS, Lipasti M (2008) Virtual circuit tree multicasting: a case for on-chip hardware multicast support. In: 2008 International Symposium on Computer Architecture, pp 229–240. doi:10.1109/ISCA.2008.12
Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. SIGARCH Comput Archit News 30(5):211–222. doi:10.1145/635506.605420
Article Google Scholar
Lotfi-Kamran P, Grot B, Ferdman M, Volos S, Kocberber O, Picorel J, Adileh A, Jevdjic D, Idgunji S, Ozer E, Falsafi B (2012) Scale-out processors. In: 2012 39th Annual International Symposium on Computer Architecture (ISCA), pp 500–511. doi:10.1109/ISCA.2012.6237043
Martin MMK, Hill MD, Sorin DJ (2012) Why on-chip cache coherence is here to stay. Commun ACM 55(7):78–89. doi:10.1145/2209249.2209269
Article Google Scholar
Nilsson H, Stenstrom P (1992) The scalable tree protocol-a cache coherence approach for large-scale multiprocessors. In: [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing, pp 498–506. doi:10.1109/SPDP.1992.242703
Pugsley SH, Spjut JB, Nellans DW, Balasubramonian R (2010) Swel: Hardware cache coherence protocols to map shared data onto shared caches. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10. ACM, New York, NY, USA, pp 465–476. doi:10.1145/1854273.1854331
Rashid KHU, Shi F, Ji W, Jing Y, Wang Y, Liu C, Deng N, Li J (2010) Computationally efficient locality-aware interconnection topology for multi-processor system-on-chip(mp-soc). Chin Sci Bull 55(29):3363–3371
Article Google Scholar
Ros A, Davari M, Kaxiras S (2015) Hierarchical private/shared classification: the key to simple and efficient coherence for clustered cache hierarchies. In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp 186–197. doi:10.1109/HPCA.2015.7056032
Sodani A, Gramunt R, Corbal J, Kim HS, Vinod K, Chinthamani S, Hutsell S, Agarwal R, Liu YC (2016) Knights landing: second-generation intel xeon phi product. IEEE Micro 36(2):34–46. doi:10.1109/MM.2016.25
Article Google Scholar
Wang YC, Juan ST (2015) Hamiltonicity of the basic wk-recursive pyramid with and without faulty nodes. Theor Comput Sci 562(C):542–556
Article MathSciNet MATH Google Scholar
Wentzlaff D, Griffin P, Hoffmann H, Bao L, Edwards B, Ramey C, Mattina M, Miao CC, Brown JF III, Agarwal A (2007) On-chip interconnection architecture of the tile processor. IEEE Micro 27(5):15–31. doi:10.1109/MM.2007.4378780
Article Google Scholar
Wilson AW Jr. (1987) Hierarchical cache/bus architecture for shared memory multiprocessors. In: Proceedings of the 14th Annual International Symposium on Computer Architecture, ISCA ’87. ACM, New York, NY, USA, pp 244–252. doi:10.1145/30350.30378
Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The splash-2 programs: characterization and methodological considerations. ISCA ’95. ACM, New York, NY, USA, pp 24–36. doi:10.1145/223982.223990
Yan S, Zhou X, Gao Y, Chen H, Luo S, Zhang P, Cherukuri N, Ronen R, Saha B (2009) Terascale chip multiprocessor memory hierarchy and programming model. In: 2009 International Conference on High Performance Computing (HiPC), pp 150–159. doi:10.1109/HIPC.2009.5433215
Zhang M, Asanovic K (2005) Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: 32nd International Symposium on Computer Architecture (ISCA’05), pp 336–345. doi:10.1109/ISCA.2005.53
Zhao H, Shriraman A, Kumar S, Dwarkadas S (2013) Protozoa: adaptive granularity cache coherence. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13. ACM, New York, NY, USA, pp 547–558. doi:10.1145/2485922.2485969
Zuo W, Feng S, Qi Z, Weixing J, Jiaxin L, Ning D, Licheng X, Yuan T, Baojun Q (2009) Group-caching for noc based multicore cache coherent systems. In: 2009 Design, Automation Test in Europe Conference Exhibition, pp 755–760. doi:10.1109/DATE.2009.5090765

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their helpful suggestions.

Author information

Authors and Affiliations

Beijing Institute of Technology, Beijing, China
Sensen Hu, Feng Shi, Weixing Ji & Xu Chen
Mehran University of Engineering and Technology, Jamshoro, Sindh, Pakistan
Shahnawaz Talpur

Authors

Sensen Hu
View author publications
You can also search for this author in PubMed Google Scholar
Feng Shi
View author publications
You can also search for this author in PubMed Google Scholar
Weixing Ji
View author publications
You can also search for this author in PubMed Google Scholar
Xu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Shahnawaz Talpur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sensen Hu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, S., Shi, F., Ji, W. et al. Exploring grouped coherence for clustered hierarchical cache. J Supercomput 73, 4137–4157 (2017). https://doi.org/10.1007/s11227-017-2024-8

Download citation

Published: 28 March 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11227-017-2024-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring grouped coherence for clustered hierarchical cache

Abstract

Access this article

Similar content being viewed by others

Survey on chiplets: interface, interconnect and integration methodology

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Performance improvement of the triangular matrix product in commodity clusters

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploring grouped coherence for clustered hierarchical cache

Abstract

Access this article

Similar content being viewed by others

Survey on chiplets: interface, interconnect and integration methodology

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Performance improvement of the triangular matrix product in commodity clusters

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation