Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Improving performance of multi-core NUCA coherent systems using NoC-assisted mechanisms


The significant speed-gap between processor and memory makes last-level cache performance crucial for multi-core architectures (MCA). Non-uniform cache architecture (NUCA) has been proposed to overcome the performance limitations of MCA for many embedded applications. The cache is partitioned into sub-banks, with each sub-bank being an independently accessible entity connected with a fast on-chip network (NoC). This paper presents two NoC-assisted mechanisms to improve the performance and power consumption of NUCA coherence. The first mechanism provides priority-based communication based on the wormhole routing architecture to support NUCA coherence. High-priority coherent packets are transmitted first to save time. The second mechanism offers multicasting communication based on the proposed priority-based NoC to provide efficient cache coherency for NUCA. We dispatch and collect coherence packets at the collecting nodes (CN) to further decrease the number of coherent messages flowing in the NoC. Experimental results show that the priority-based transmission can improve performance by approximately 10 %. The proposed multicasting mechanism can further improve performance and decrease power consumption of the NoC in NUCA by approximately 15 %. The two proposed mechanisms can together enhance the performance by 25 % averagely.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20


  1. 1.

    Trawick T (2007) Multicore communication: today and the future. Embed Comput Des

  2. 2.

    Parkhurst J, Darringer J, Grundmann B (2006) From single core to multi-core: preparing for a new exponential. In: Proceedings of the 2006 IEEE/ACM international conference on computer-aided design, November 2006, pp 67–72

  3. 3.

    Haritan E, Yagi H, Wolf W, Hattori T, Paulin P, Nohl A, Wingard D, Muller M (2008) Multicore design is the challenge! What is the solution? In: Proceedings of design automation conference, June 2008, pp 128–130

  4. 4.

    Chai L, Gao Q, Panda DK (2007) Understanding the impact of multi-core architecture in cluster computing: a case study with intel dual-core system. In: Proceedings of seventh IEEE international symposium on cluster computing and the grid, May 2007, pp 471–478

  5. 5.

    Marino MD (2006) 32-core CMP with multi-sliced L2, 2 and 4 cores sharing a L2 slice. In: Proceedings of symposium on computer architecture and high performance computing, October 2006, pp 141–150

  6. 6.

    Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proceeding of international conference of architectural support for programming languages and operating systems, pp 211–222

  7. 7.

    Benini L, De Micheli G (2002) Networks on chips: a new SoC paradigm. IEEE Comput Mag January:70–78

  8. 8.

    Dally WJ, Towles B (2001) Route packets, not wires: on-chip interconnection networks. In: Proceedings of design automation conference, June 2001, pp 684–689

  9. 9.

    Bambha NK, Bhattacharyya SS (2005) Joint application mapping/interconnect synthesis techniques for embedded chip-scale multiprocessors. IEEE Trans Parallel Distrib Syst 16(2):99–112

  10. 10.

    Bertozzi D, Jalabert A, Murali S, Tamhankar R, Stergio S, Benini L, Micheli GD (2005) NoC synthesis flow for customized domain specific multiprocessor systems-on-chip. IEEE Trans Parallel Distrib Syst 16(2):113–129

  11. 11.

    Lee J, Lee K, Yoo H-J (2005) Packet-switched on-chip interconnection network for system-on-chip applications. IEEE Trans Circuits Syst 52(6):308–312

  12. 12.

    Pande PP, Micheli GD, Grecu C, Ivanov A, Saleh R (2005) Design, synthesis, and test of networks on chips. IEEE Des Test Comput 22(5):404–413

  13. 13.

    Chang K-C, Shen J-S, Chen T-F (2006) Evaluation and design trade-offs between circuit-switched and packet-switched NoCs for application-specific SoCs. In: Proceedings of design automation conference, July 2006, pp 143–148

  14. 14.

    Chang K-C, Shen J-S, Chen T-F (2008) Tailoring circuit-switched network-on-chip to application-specific SoC. ACM Trans Des Autom Electron Syst 13(1):1–31

  15. 15.

    Kim C, Burger D, Keckler SW (2003) An adaptive, non uniform cache structure for wire delay dominated on chip caches. IEEE MICRO, 99–107

  16. 16.

    Zhou X, Yu C, Dash A, Petrove P (2008) Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors. ACM Trans Des Autom Electron Syst 13(1)

  17. 17.

    Brown JA, Kumar R, Tullsen D (2007) Proximity-aware directory-based coherence for multi-core processor architectures. In: Proceedings of the nineteenth annual ACM symposium on parallel algorithms and architectures, San Diego, California, USA, pp 126–134

  18. 18.

    de Massas PG, Pétro F (2008) Comparison of memory write policies for NoC based multicore cache coherent systems. In: Proceedings of design, automation and test in Europe, March 2008, pp 997–1002

  19. 19.

    Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler SW (2007) A NUCA substrate for flexible CMP cache sharing. IEEE Trans Parallel Distrib Syst 18(8):1028–1040

  20. 20.

    Foglia P, Mangano D, Prete CA (2005) A NUCA model for embedded systems cache design. In: Proceedings of workshop on embedded systems for real-time multimedia, September 2005, pp 41–46

  21. 21.

    Loghi M, Letis M, Benini L, Poncino M (2005) Exploring the energy efficiency of cache coherence protocols in single-chip multi-processors. In: Proceedings of the 15th ACM great lakes symposium on VLSI, April 2005, pp 276–281

  22. 22.

    Lira J, Molina C, González A (2009) Analysis of non-uniform cache architecture policies for chip-multiprocessor using the parsec benchmark suite. In: Proceedings of the workshop on managed many-core systems, March 2009

  23. 23.

    Mohapatra P (1998) Wormhole routing techniques for directly connected multicomputer system. Proc ACM Comput Surv 30(3):374–410

  24. 24.

    Open SystemC Initiative.

  25. 25.

    Tomasevic M, Milutinovic VM (1994) Hardware approaches to cache coherence in shared-memory multiprocessors. IEEE MICRO 14(5–6):52–59

  26. 26.

    Gracia DS, Dimitrakopoulos G, Arnal TM, Katevenis MGH, Yufera VV (2011) LP-NUCA: networks-in-cache for high-performance low-power embedded processors. IEEE Trans Very Large Scale Integr Syst

  27. 27.

    Bolotin E, Guz Z, Cidon I, Ginosar R, Kolodny A (2007) The power of priority: NoC based distributed cache coherency. In: Proceedings of the international symposium on networks-on-chip, May 2007, pp 117–126

  28. 28.


  29. 29.

    Magnussion PS et al (2002) Simics: a full system simulation platform. Computer 35(2):50–58

Download references

Author information

Correspondence to Kuei-Chung Chang.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Chang, K., Liao, I. & Liao, C. Improving performance of multi-core NUCA coherent systems using NoC-assisted mechanisms. J Supercomput 62, 1318–1337 (2012).

Download citation


  • Many-core SoC
  • Non-uniform cache architecture
  • Network-on-chip