Abstract
Networks-on-chip (NOCs) are becoming the de facto communication fabric to connect cores and cache banks in chip multiprocessors (CMPs). Routing algorithms, as one of the key components that influence NOC latency, are the subject of extensive research. Static routing algorithms have low cost but unlike adaptive routing algorithms, do not perform well under non-uniform or bursty traffic. Adaptive routing algorithms estimate congestion levels of output ports to avoid routing traffic over congested ports. As global adaptive routing algorithms are not restricted to local information for congestion estimation, they are the prime candidates for balancing traffic in NOCs. Unfortunately, destinations of packets are not considered for congestion estimation in existing global adaptive routing algorithms. We will show that having identical congestion estimates for packets with different destinations prevents global adaptive routing algorithms from reaching their peak potential. In this work, we introduce Fast, a low-cost global adaptive routing algorithm that estimates congestion levels of output ports on a per-packet basis. The simulation results reveal that Fast achieves lower latency and higher throughput as compared to those of other adaptive routing algorithms across all workloads examined. Fast increases the throughput of an \(8 \times 8\) network by 54, 30, and 16 % as compared to DOR, Local, and RCA on a synthetic traffic profile. On realistic benchmarks, Fast achieves 5 % average, and 12 % maximum latency reduction on SPLASH-2 benchmarks running on a 49-core CMP as compared to the state of the art.
Similar content being viewed by others
References
Bakhoda A, Kim J, Aamodt TM (2010) Throughput-effective on-chip networks for Manycore accelerators. In: Proceedings of the 43rd annual IEEE/ACM international symposium on microarchitecture, USA, NY, NY, pp 421–432
Balfour JD, Dally WJ (2006) Design tradeoffs for tiled CMP on-chip networks. In: Proceedings of the 20th annual ACM international conference on supercomputing, Cairns, Queensland, Australia, pp 187–198
Barroso LA, Gharachorloo K, McNamara R, Nowatzyk A, Qadeer S, Sano B, Smith S, Stets R, Verghese B (2000) Piranha: a scalable architecture based on single-chip multiprocessing. In: Proceedings of the 27th annual international symposium on computer architecture, Vancouver, British Columbia, Canada, pp 282–293
Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques, New York, New York, USA, pp 72–81. doi:10.1145/1454115.1454128
Chiu GM (2000) The odd-even turn model for adaptive routing. IEEE Trans Parallel Distrib Syst 11(7):729–738
Council TPP. http://www.tpc.org/default.asp
Dally WJ, Aoki H (1993) Deadlock-free adaptive routing in multicomputer networks using virtual channels. IEEE Trans Parallel Distrib Syst 4(4):466–475
Duato J, Yalamanchili S, Lionel N (2002) Interconnection networks: an engineering approach, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco
Dumitras T, Marculescu R (2003) On-chip stochastic communication. In: Proceedings of the conference on design, automation and test in Europe, vol 1, p 10790
Ebrahimi M, Daneshtalab M, Farahnakian F, Plosila J, Liljeberg P, Palesi M, Tenhunen H (2012) HARAQ: congestion-aware learning model for highly adaptive routing algorithm in on-chip networks. In: Proceedings of the 6th IEEE/ACM international symposium on networks-on-chip, pp 19–26
Feige U, Raghavan P (1992) Exact analysis of hot-potato routing. In: Proceedings of the 33rd annual symposium on Foundations of Computer Science, pp 553–562
Ferdman M, Adileh A, Kocberber O, Volos S, Alisafaee M, Jevdjic D, Kaynak C, Popescu AD, Ailamaki A, Falsafi B (2012) Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In: Proceedings of the 17th international conference on architectural support for programming languages and operating systems, England, UK, London, pp 37–48
Galles M (1997) Spider: a high-speed network interconnect. IEEE Micro 17(1):34–39
Glass CJ, Ni LM (1992) The turn model for adaptive routing. In: Proceedings of the 19th annual international symposium on computer architecture, Queensland, Australia, pp 278–287
Gratz P, Grot B, Keckler SW (2008) Regional congestion awareness for load balance in networks-on-chip. In: Proceedings of the 14th international symposium on high-performance computer architecture, Salt Lake City, UT, USA, pp 203–214
Grot B, Hardy D, Lotfi-Kamran P, Falsafi B, Nicopoulos C, Sazeides Y (2012) Optimizing data-center TCO with scale-out processors. IEEE Micro 32(5):52–63
Hu J, Marculescu R (2004) DyAD: smart routing for networks-on-chip. In: Proceedings of the 41st annual design automation conference, San Diego, CA, USA, pp 260–263
Intel. Intel Xeon Processor X5670. http://ark.intel.com/products/47920/
Intel (1991) A touchstone DELTA system description. In: Technical report. Supercomputer Systems Division, Intel Corporation
International Technology Roadmap for Semiconductors (ITRS) 2011 Edition. URL http://www.itrs.net/Links/2011ITRS/Home2011.htm
Kahng AB, Li B, Peh LS, Samadi K (2009) ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration. In: Proceedings of the conference on design, automation, and test in Europe, Nice, France, pp 423–428
Kim J, Dally WJ, Abts D (2007) Flattened butterfly: a cost-efficient topology for high-radix networks. In: Proceedings of the 34th annual international symposium on computer architecture, San Diego, California, USA, pp 126–137
Kim J, Park D, Theocharides T, Vijaykrishnan N, Das CR (2005) A low latency router supporting adaptivity for on-chip interconnects. In: Proceedings of the 42nd annual design automation conference, Anaheim, California, USA, pp 559–564
Kumar A, Kundu P, Singh AP, Peh LS, Jha NK (2007) A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS. In: Proceedings of the 25th international conference on computer design, pp 63–70
Kumar A, Peh LS, Kundu P, Jha NK (2007) Express virtual channels: towards the ideal interconnection fabric. In: Proceedings of the international symposium on computer architecture, San Diego, California, USA, pp 150–161
Li M, Zeng QA, Jone WB (2006) DyXY: a proximity congestion-aware deadlock-free dynamic routing method for network on chip. In: Proceedings of the 43rd annual design automation conference, CA, USA, San Francisco, pp 849–852
Lin X, Ni L (1993) Multicast communication in multicomputer networks. IEEE Trans Parallel Distrib Syst 4(10):1105–1117
Lotfi-Kamran P, Daneshtalab M, Lucas C, Navabi Z (2008) BARP—a dynamic routing protocol for balanced distribution of traffic in NoCs. In: Proceedings of the conference on design. Automation and test in Europe, Munich, Germany, pp 1408–1413
Lotfi-Kamran P, Grot B, Falsafi B (2012) NOC-Out: microarchitecting a scale-out processor. In: Proceedings of the 45th annual IEEE/ACM international symposium on microarchitecture, Vancouver, BC, Canada, pp 177–187
Lotfi-Kamran P, Grot B, Ferdman M, Volos S, Kocberber O, Picorel J, Adileh A, Jevdjic D, Idgunji S, Ozer E, Falsafi B (2012) Scale-out processors. In: Proceedings of the 39th annual international symposium on computer architecture, Portland, Oregon, USA, pp 500–511
Lotfi-Kamran P, Rahmani AM, Daneshtalab M, Afzali-Kusha A, Navabi Z (2010) EDXY—a low cost congestion-aware routing algorithm for network-on-chips. J Syst Archit 56(7):256–264
Ma S, Enright Jerger N, Wang Z (2011) DBAR: an efficient routing algorithm to support multiple concurrent applications in networks-on-chip. In: Proceedings of the 38th annual international symposium on computer architecture, pp 413–424
Marculescu R, Ogras UY, Peh LS, Jerger NE, Hoskote Y (2009) Outstanding research problems in NoC design: system, microarchitecture, and circuit perspectives. IEEE Trans Comput-Aided Des Integr Circuits Syst 28(1):3–21
Michelogiannakis G, Balfour J, Dally WJ (2009) Elastic-buffer flow control for on-chip networks. In: Proceedings of the 15th IEEE international symposium on high-performance computer architecture, Raleigh, NC, USA, pp 151–162
Moscibroda T, Mutlu O (2009) A case for bufferless routing in on-chip networks. In: Proceedings of the 36th annual international symposium on computer architecture, pp 196–207
Ni LM, McKinley PK (1993) A survey of wormhole routing techniques in direct networks. Computer 26(2):62–76
Nilsson E, Millberg M, Oberg J, Jantsch A (2003) Load distribution with the proximity congestion awareness in a network on chip. In: Proceedings of the conference on design, automation and test in Europe, vol 1, p 11126
Ogras UY, Hu J, Marculescu R (2005) Key research problems in NoC design: a holistic perspective. In: Proceedings of the 3rd international conference on hardware/software codesign and system synthesis, Jersey City, NJ, USA, pp 69–74
Ozer E, Flautner K, Idgunji S, Saidi A, Sazeides Y, Ahsan B, Ladas N, Nicopoulos C, Sideris I, Falsafi B, Adileh A, Ferdman M, Lotfi-Kamran P, Kuulusa M, Marchal P, Minas N (2010) EuroCloud: energy-conscious 3D server-on-chip for green cloud services. In: Proceedings of the workshop on architectural concerns in large datacenters in conjunction with ISCA
Ramanujam RS, Lin B (2010) Destination-based adaptive routing on 2D mesh networks. In: Proceedings of the 6th ACM/IEEE symposium on architectures for networking and communications systems, pp 19:1–19:12
Ramanujam RS, Lin B (2013) Destination-based congestion awareness for adaptive routing in 2D mesh networks. ACM Trans Des Autom Electron Syst 18(4):60:1–60:27
Schonwald T, Zimmermann J, Bringmann O, Rosenstiel W (2007) Fully adaptive fault-tolerant routing algorithm for network-on-chip architectures. In: Proceedings of the 10th Euromicro conference on digital system design architectures. Methods and tools, Lubeck, Germany, pp 527–534
Shin JL, Tam K, Huang D, Petrick B, Pham H, Hwang C, Li H, Smith A, Johnson T, Schumacher F, Greenhill D, Leon AS, Strong A (2010) A 40nm 16-Core 128-Thread CMT SPARC SoC processor. In: Proceedings of the IEEE international solid-state circuits conference, CA, USA, San Francisco, pp 98–99
Singh A, Dally WJ, Gupta AK, Towles B (2003) GOAL: a load-balanced adaptive routing algorithm for torus networks. In: Proceedings of the 30th annual international symposium on computer architecture, Tel-Aviv, Israel, pp 194–205
Vangal SR, Howard J, Ruhl G, Dighe S, Wilson H, Tschanz J, Finan D, Singh A, Jacob T, Jain S, Erraguntla V, Roberts C, Hoskote Y, Borkar N, Borkar S (2008) An 80-Tile Sub-100-W TeraFLOPS processor in 65-nm CMOS. IEEE J Solid-State Circuits 43(1):29–41
Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22nd international symposium on computer architecture, S. Margherita Ligure, Italy, pp 24–36. doi:10.1145/223982.223990
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lotfi-Kamran, P. Per-packet global congestion estimation for fast packet delivery in networks-on-chip. J Supercomput 71, 3419–3439 (2015). https://doi.org/10.1007/s11227-015-1439-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1439-3