Centaur: a hybrid network-on-chip architecture utilizing micro-network fusion

Abstract

The escalating proliferation of multicore chips has accentuated the criticality of the on-chip network. Packet-based networks-on-chip (NoC) have emerged as the de facto interconnect of future chip multi-processors (CMP). On-chip traffic comprises a mixture of data and control messages from the cache coherence protocol. Given the latency-criticality of control messages, in this paper we aim to optimize their delivery times. Instead of treating the on-chip router as a monolithic component, we advocate the introduction of an ultra-low-latency ring-inspired (i.e., utilizing ring primitive building blocks) support micro-network that is optimized for control messages. This \(\upmu \)NoC is fused with a throughput-driven conventional NoC router to form a hybrid architecture, called Centaur, which maintains separate data paths and control logic for the two fused networks. Full-system simulation results from a 64-core CMP indicate that the proposed fused Centaur router improves overall system performance by up to 26 %, as compared to a state-of-the-art router implementation. Furthermore, hardware synthesis results using commercial 65 nm libraries indicate that Centaur’s area and power overheads are 9 and 3 %, respectively, as compared to a baseline router design. More importantly, the new design does not affect the router’s critical path.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    In Greek mythology, Centaur was a hybrid creature that was part human and part horse. Much like this mythical creature, our proposed router architecture fuses two distinct networks into one entity.

  2. 2.

    Wind River Systems: http://www.windriver.com/

References

  1. 1.

    Abad P, Puente V, Gregorio JA (2013) LIGERO: a light but efficient router conceived for cache-coherent chip multiprocessors. ACM Trans Archit Code Optim 9(4):37:1–37:21.

    Google Scholar 

  2. 2.

    Abad P, Puente V, Gregorio JA, Prieto P (2007) Rotary router: an efficient architecture for cmp interconnection networks. In: Proceedings of the 34th annual international symposium on computer architecture, ISCA ’07, pp 116–125.

  3. 3.

    Abousamra A, Melhem R, Jones A (2012) Deja vu switching for multiplane nocs. In: Sixth IEEE/ACM international symposium on networks on chip (NoCS), pp 11–18.

  4. 4.

    Agarwal N, Krishna T, Peh LS, Jha N (2009) GARNET: A detailed on-chip network model inside a full-system simulator. In: IEEE international symposium on performance analysis of systems and software.

  5. 5.

    Agarwal N, Peh LS, Jha N (2009), In-network snoop ordering (INSO): snoopy coherence on unordered interconnects. In: Proceedings of the 15th international symposium on high-performance computer, architecture, pp 67–78.

  6. 6.

    Anjan K, Pinkston T, Duato J (1996) Generalized theory for deadlock-free adaptive wormhole routing and its application to disha concurrent. In: Proceedings of IPPS ’96. The 10th international parallel processing symposium, pp 815–821.

  7. 7.

    Balfour J, Dally WJ (2006) Design tradeoffs for tiled CMP on-chip networks. In: Proceedings of the 20th annual international conference on supercomputing, pp 187–198.

  8. 8.

    Bienia C (2011) Benchmarking modern multiprocessors. Ph.D. Thesis, Princeton University.

  9. 9.

    Bolotin E, Guz Z, Cidon I, Ginosar R, Kolodny A (2007) The power of priority: NoC based distributed cache coherency. In: Proceedings of the first international symposium on networks-on-chip.

  10. 10.

    Bourduas S, Zilic Z (2007) A hybrid ring/mesh interconnect for network-on-chip using hierarchical rings for global routing. In: Proceedings of the first international symposium on networks-on-chip, pp 195–204.

  11. 11.

    Chuang JH, Chao WC (1994) Torus with slotted rings architecture for a cache-coherent multiprocessor. In: Proceedings of the 1994 international conference on parallel and distributed systems, pp 76–81.

  12. 12.

    Das R, Eachempati S, Mishra A, Narayanan V, Das C (2009), Design and evaluation of a hierarchical on-chip interconnect for next-generation cmps. In: Proceedings of the 15th international symposium on high-performance computer, architecture, pp 175–186.

  13. 13.

    Das R, Mutlu O, Moscibroda T, Das C (2009) Application-aware prioritization mechanisms for on-chip networks. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture, pp 280–291.

  14. 14.

    Duato J, Yalamanchili S, Ni L (2003) Interconnection networks. Margan Kaufmann, San Francisco

    Google Scholar 

  15. 15.

    Flores A, Aragon J, Acacio M (2010) Heterogeneous interconnects for energy-efficient message management in cmps. IEEE Trans Comput 59(1):16–28

    MathSciNet  Article  Google Scholar 

  16. 16.

    Gratz P, Kim C, McDonald R, Keckler S, Burger D (2006) Implementation and evaluation of on-chip network architectures. In: Proceedings of international conference on computer design.

  17. 17.

    Hayenga M, Jerger NE, Lipasti M (2009) SCARAB: a single cycle adaptive routing and bufferless network. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture.

  18. 18.

    Holliday M, Stumm M (1994) Performance evaluation of hierarchical ring-based shared memory multiprocessors. IEEE Trans Comput 43:52–67

    Article  Google Scholar 

  19. 19.

    Jerger NDE, Peh LS, Lipasti MH (2008) Circuit-switched coherence. In: Proceedings of the second ACM/IEEE international symposium on networks-on-chip, pp 193–202.

  20. 20.

    Kim J (2009) Low-cost router microarchitecture for on-chip networks. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture, pp 255–266.

  21. 21.

    Kim J, Nicopoulos C, Park D (2006) A gracefully degrading and energy-efficient modular router architecture for on-chip networks. SIGARCH Comput Archit News 34(2):4–15

    Article  Google Scholar 

  22. 22.

    Kumar A, Peh LS, Kundu P, Jha NK (2007) Express virtual channels: towards the ideal interconnection fabric. In: Proceedings of the 34th annual international symposium on computer architecture.

  23. 23.

    Kumary A, Kunduz P, Singhx A, Peh LS, Jhay N (2007) A 4.6Tbits/s 3.6 GHz single-cycle NoC router with a novel switch allocator in 65 nm CMOS. In: Proceedings of the 25th international conference on computer design, pp 63–70.

  24. 24.

    Martin MMK, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput Archit News 33:2005

    Article  Google Scholar 

  25. 25.

    Matsutani H, Koibuchi M, Amano H, Yoshinaga T (2009), Prediction router: yet another low latency on-chip router architecture. In: Proceedings of the IEEE 15th international symposium on high performance computer, architecture, pp 367–378.

  26. 26.

    Mullins R, West A, Moore S (2004), Low-latency virtual-channel routers for on-chip networks. In: Proceedings of the 31st, annual international symposium on computer architecture, p 188.

  27. 27.

    Mullins R, West A, Moore S (2006) The design and implementation of a low-latency on-chip network. In: Proceedings of Asia and South Pacific conference on design automation, p 6.

  28. 28.

    Nicopoulos C, Park D, Kim J, Vijaykrishnan N, Yousif M, Das C (2006) Vichar: a dynamic virtual channel regulator for network-on-chip routers. In: 39th annual IEEE/ACM international symposium on microarchitecture, pp 333–346.

  29. 29.

    Park C, Badeau R, Biro L, Chang J, Singh T, Vash J, Wang B, Wang T (2010) A 1.2 TB/s on-chip ring interconnect for 45nm 8-core enterprise Xeon processor. In: Proceedings of IEEE international solid-state circuits conference digest of technical papers, pp 180–181.

  30. 30.

    Peh LS, Dally WJ (2001), A delay model and speculative architecture for pipelined routers. In: Proceedings of the 7th international symposium on high-performance computer, architecture, p 255.

  31. 31.

    Pinkston T (1999) Flexible and efficient routing based on progressive deadlock recovery. IEEE Trans Comput 48(7):649–669

    Article  Google Scholar 

  32. 32.

    Sibai F (2008) Adapting the hyper-ring interconnect for many-core processors. In: International symposium on parallel and distributed processing with applications, pp 649–654.

  33. 33.

    Singh A, Dally W, Towles B, Gupta A (2004) Globally adaptive load-balanced routing on tori. Comput Archit Lett 3(1):2

    Article  Google Scholar 

  34. 34.

    Song YH, Pinkston T (2003) A progressive approach to handling message-dependent deadlock in parallel computer systems. IEEE Trans Parallel Distrib Syst 14(3):259–275

    Article  Google Scholar 

  35. 35.

    Volos S, Seiculescu C, Grot B, Pour N, Falsafi B, De Micheli G (2012) Ccnoc: Specializing on-chip interconnects for energy efficiency in cache-coherent servers. In: Sixth IEEE/ACM international symposium on networks on chip (NoCS), pp 67–74.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Junghee Lee.

Appendix I: Formal proof of protocol-level deadlock avoidance

Appendix I: Formal proof of protocol-level deadlock avoidance

As stated in Sect. 4.2, progressive recovery mechanisms resolve all types of deadlocks when the following two conditions are met [6, 31]: (1) the recovery network is free from deadlocks, and (2) in each and every deadlock situation, there exists at least one packet that is granted access to the recovery network. The following theorems prove that these two conditions are satisfied in the Centaur architecture and, thus, any protocol- level deadlocks are guaranteed to be broken by the proposed time-out mechanism.

Theorem 1

The deadlock freedom of the DNoC is not affected by the \(\mu \)NoC.

Proof

Although the inter-router links are shared between the DNoC and the \(\upmu \)NoC, the latter has its own buffers (\(\upmu \)NoC Buffer and Intermediate Buffer). Therefore, the \(\upmu \)NoC does not affect the deadlock freedom of the DNoC, because the packets in the \(\upmu \)NoC do not block any packets in the DNoC. \(\square \)

Theorem 2

The time-out mechanism enables every packet in a deadlock situation to have the opportunity to access the recovery network.

Proof

When a protocol-level deadlock occurs (or is suspected), all the packets in the \(\upmu \)NoC buffers (\(\upmu \)NoC and Intermediate Buffers) are given the chance to escape from the deadlock. As mentioned in Sect. 4.2, even if a packet is not at the head of the buffer, it is also given the same opportunity. \(\square \)

Corollary 1

All dependencies among the various message classes involved in deadlock situations are broken by the time-out mechanism.

Proof

Dependencies among message classes can, indeed, be created in the \(\upmu \)NoC buffers, whereas there are no such dependencies in the DNoC. In the \(\upmu \)NoC, a packet may be blocked by a preceding packet that belongs to a different message class. When the time-out mechanism is triggered, packets are no longer blocked by preceding packets in the \(\upmu \)NoC, because they are forwarded to the DNoC, and the head packet does not block the following packet(s) in the same buffer. \(\square \)

One drawback of this mechanism is that the packet order might be reversed. To preserve packet order, control packets should leave each router in the order of arrival, regardless of which network (\(\upmu \)NoC or DNoC) they arrive through. Packets with the same input-output port mappings should be ordered. As a hardware implementation of packet ordering, we introduce a sequence numbering mechanism within the router. Note that this mechanism is accounted for in the area/power/timing evaluation of Sect. 5.3.

For every pair of input and output ports, two counters are maintained for each VC (message class) in the DNoC, which makes use of control packets. The ‘Head’ counter indicates the order of arriving packets and the ‘Tail’ counter indicates the order of departing packets. When a control packet arrives at a router, it is stored either in the \(\upmu \)NoC Buffer, or the Intermediate Buffer, or a VC buffer within the DNoC. Irrespective of which buffer it is stored into, a sequence number is given. The sequence number is equal to ‘Head’ and the ‘Head’ counter is subsequently increased by one. The packet can leave only when its sequence number matches the ‘Tail’ counter. After the packet departs, ‘Tail’ is increased by one.

However, the sequence numbering raises additional dependencies within a router, which may incur deadlocks. The following theorem proves that deadlocks do not happen.

Theorem 3

Deadlock freedom is not affected by the additional dependencies created by the sequence numbering mechanism.

Proof

Corollary 1 proves that there are no dependencies among message classes when the time-out mechanism is triggered. The additional dependencies caused by the sequence numbering mechanism are only within the same message class (VC). Therefore, the additional dependencies do not cause any packet to be blocked by other message classes. In addition, since the dependencies are created based on the order of arrival (i.e., the dependencies are, essentially, ordered), they do not form any cycles. \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Lee, J., Nicopoulos, C., Lee, H.G. et al. Centaur: a hybrid network-on-chip architecture utilizing micro-network fusion. Des Autom Embed Syst 18, 121–139 (2014). https://doi.org/10.1007/s10617-014-9131-z

Download citation

Keywords

  • Networks-on-chip
  • Interconnection networks
  • Segregated/separated networks