The Journal of Supercomputing

, Volume 62, Issue 3, pp 1318–1337 | Cite as

Improving performance of multi-core NUCA coherent systems using NoC-assisted mechanisms

  • Kuei-Chung ChangEmail author
  • Ing-Ming Liao
  • Chiu-Han Liao


The significant speed-gap between processor and memory makes last-level cache performance crucial for multi-core architectures (MCA). Non-uniform cache architecture (NUCA) has been proposed to overcome the performance limitations of MCA for many embedded applications. The cache is partitioned into sub-banks, with each sub-bank being an independently accessible entity connected with a fast on-chip network (NoC). This paper presents two NoC-assisted mechanisms to improve the performance and power consumption of NUCA coherence. The first mechanism provides priority-based communication based on the wormhole routing architecture to support NUCA coherence. High-priority coherent packets are transmitted first to save time. The second mechanism offers multicasting communication based on the proposed priority-based NoC to provide efficient cache coherency for NUCA. We dispatch and collect coherence packets at the collecting nodes (CN) to further decrease the number of coherent messages flowing in the NoC. Experimental results show that the priority-based transmission can improve performance by approximately 10 %. The proposed multicasting mechanism can further improve performance and decrease power consumption of the NoC in NUCA by approximately 15 %. The two proposed mechanisms can together enhance the performance by 25 % averagely.


Many-core SoC Non-uniform cache architecture Network-on-chip 


  1. 1.
    Trawick T (2007) Multicore communication: today and the future. Embed Comput Des Google Scholar
  2. 2.
    Parkhurst J, Darringer J, Grundmann B (2006) From single core to multi-core: preparing for a new exponential. In: Proceedings of the 2006 IEEE/ACM international conference on computer-aided design, November 2006, pp 67–72 CrossRefGoogle Scholar
  3. 3.
    Haritan E, Yagi H, Wolf W, Hattori T, Paulin P, Nohl A, Wingard D, Muller M (2008) Multicore design is the challenge! What is the solution? In: Proceedings of design automation conference, June 2008, pp 128–130 Google Scholar
  4. 4.
    Chai L, Gao Q, Panda DK (2007) Understanding the impact of multi-core architecture in cluster computing: a case study with intel dual-core system. In: Proceedings of seventh IEEE international symposium on cluster computing and the grid, May 2007, pp 471–478 CrossRefGoogle Scholar
  5. 5.
    Marino MD (2006) 32-core CMP with multi-sliced L2, 2 and 4 cores sharing a L2 slice. In: Proceedings of symposium on computer architecture and high performance computing, October 2006, pp 141–150 Google Scholar
  6. 6.
    Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proceeding of international conference of architectural support for programming languages and operating systems, pp 211–222 CrossRefGoogle Scholar
  7. 7.
    Benini L, De Micheli G (2002) Networks on chips: a new SoC paradigm. IEEE Comput Mag January:70–78 CrossRefGoogle Scholar
  8. 8.
    Dally WJ, Towles B (2001) Route packets, not wires: on-chip interconnection networks. In: Proceedings of design automation conference, June 2001, pp 684–689 Google Scholar
  9. 9.
    Bambha NK, Bhattacharyya SS (2005) Joint application mapping/interconnect synthesis techniques for embedded chip-scale multiprocessors. IEEE Trans Parallel Distrib Syst 16(2):99–112 CrossRefGoogle Scholar
  10. 10.
    Bertozzi D, Jalabert A, Murali S, Tamhankar R, Stergio S, Benini L, Micheli GD (2005) NoC synthesis flow for customized domain specific multiprocessor systems-on-chip. IEEE Trans Parallel Distrib Syst 16(2):113–129 CrossRefGoogle Scholar
  11. 11.
    Lee J, Lee K, Yoo H-J (2005) Packet-switched on-chip interconnection network for system-on-chip applications. IEEE Trans Circuits Syst 52(6):308–312 CrossRefGoogle Scholar
  12. 12.
    Pande PP, Micheli GD, Grecu C, Ivanov A, Saleh R (2005) Design, synthesis, and test of networks on chips. IEEE Des Test Comput 22(5):404–413 CrossRefGoogle Scholar
  13. 13.
    Chang K-C, Shen J-S, Chen T-F (2006) Evaluation and design trade-offs between circuit-switched and packet-switched NoCs for application-specific SoCs. In: Proceedings of design automation conference, July 2006, pp 143–148 Google Scholar
  14. 14.
    Chang K-C, Shen J-S, Chen T-F (2008) Tailoring circuit-switched network-on-chip to application-specific SoC. ACM Trans Des Autom Electron Syst 13(1):1–31 CrossRefGoogle Scholar
  15. 15.
    Kim C, Burger D, Keckler SW (2003) An adaptive, non uniform cache structure for wire delay dominated on chip caches. IEEE MICRO, 99–107 Google Scholar
  16. 16.
    Zhou X, Yu C, Dash A, Petrove P (2008) Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors. ACM Trans Des Autom Electron Syst 13(1) Google Scholar
  17. 17.
    Brown JA, Kumar R, Tullsen D (2007) Proximity-aware directory-based coherence for multi-core processor architectures. In: Proceedings of the nineteenth annual ACM symposium on parallel algorithms and architectures, San Diego, California, USA, pp 126–134 CrossRefGoogle Scholar
  18. 18.
    de Massas PG, Pétro F (2008) Comparison of memory write policies for NoC based multicore cache coherent systems. In: Proceedings of design, automation and test in Europe, March 2008, pp 997–1002 CrossRefGoogle Scholar
  19. 19.
    Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler SW (2007) A NUCA substrate for flexible CMP cache sharing. IEEE Trans Parallel Distrib Syst 18(8):1028–1040 CrossRefGoogle Scholar
  20. 20.
    Foglia P, Mangano D, Prete CA (2005) A NUCA model for embedded systems cache design. In: Proceedings of workshop on embedded systems for real-time multimedia, September 2005, pp 41–46 CrossRefGoogle Scholar
  21. 21.
    Loghi M, Letis M, Benini L, Poncino M (2005) Exploring the energy efficiency of cache coherence protocols in single-chip multi-processors. In: Proceedings of the 15th ACM great lakes symposium on VLSI, April 2005, pp 276–281 CrossRefGoogle Scholar
  22. 22.
    Lira J, Molina C, González A (2009) Analysis of non-uniform cache architecture policies for chip-multiprocessor using the parsec benchmark suite. In: Proceedings of the workshop on managed many-core systems, March 2009 Google Scholar
  23. 23.
    Mohapatra P (1998) Wormhole routing techniques for directly connected multicomputer system. Proc ACM Comput Surv 30(3):374–410 CrossRefGoogle Scholar
  24. 24.
    Open SystemC Initiative.
  25. 25.
    Tomasevic M, Milutinovic VM (1994) Hardware approaches to cache coherence in shared-memory multiprocessors. IEEE MICRO 14(5–6):52–59 CrossRefGoogle Scholar
  26. 26.
    Gracia DS, Dimitrakopoulos G, Arnal TM, Katevenis MGH, Yufera VV (2011) LP-NUCA: networks-in-cache for high-performance low-power embedded processors. IEEE Trans Very Large Scale Integr Syst Google Scholar
  27. 27.
    Bolotin E, Guz Z, Cidon I, Ginosar R, Kolodny A (2007) The power of priority: NoC based distributed cache coherency. In: Proceedings of the international symposium on networks-on-chip, May 2007, pp 117–126 Google Scholar
  28. 28.
  29. 29.
    Magnussion PS et al (2002) Simics: a full system simulation platform. Computer 35(2):50–58 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Kuei-Chung Chang
    • 1
    Email author
  • Ing-Ming Liao
    • 1
  • Chiu-Han Liao
    • 1
  1. 1.Department of Information Engineering and Computer ScienceFeng Chia UniversityTaichungTaiwan

Personalised recommendations