Dataflow architecture has shown its advantages in many high-performance computing cases. In dataflow computing, a large amount of data are frequently transferred among processing elements through the network-on-chip (NoC). Thus the router design has a significant impact on the performance of dataflow architecture. Common routers are designed for control-flow multi-core architecture and we find they are not suitable for dataflow architecture. In this work, we analyze and extract the features of data transfers in NoCs of dataflow architecture: multiple destinations, high injection rate, and performance sensitive to delay. Based on the three features, we propose a novel and efficient NoC router for dataflow architecture. The proposed router supports multi-destination; thus it can transfer data with multiple destinations in a single transfer. Moreover, the router adopts output buffer to maximize throughput and adopts non-flit packets to minimize transfer delay. Experimental results show that the proposed router can improve the performance of dataflow architecture by 3.6x over a state-of-the-art router.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Chen T S, Du Z D, Sun N H, Wang J, Wu C Y, Chen Y J, Temam O. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proc. the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2014, pp.269-284.
Liu D F, Chen T S, Liu S L, Zhou J H, Zhou S Y, Temam O, Feng X B, Zhou X H, Chen Y J. PuDianNao: A polyvalent machine learning accelerator. In Proc. the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2014, pp.369-381.
Voitsechov D, Etsion Y. Single-graph multiple flows: Energy efficient design alternative for GPGPUs. In Proc. the 41st Int. Symp. Computer Architecture, Jun. 2014, pp.205-216.
Oriato D, Tilbury S, Marrocu M, Pusceddu G. Acceleration of a meteorological limited area model with dataflow engines. In Proc. the Symp. Application Accelerators in High Performance Computing, Jul. 2012, pp.129-132.
Pratas F, Oriato D, Pell O, Mata R A, Sousa L. Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In Proc. the 21st Annual Int. Symp. Field-Programmable Custom Computing Machines, Apr. 2013, pp.177-180.
Fu H H, Gan L, Clapp R G, Ruan H B, Pell O, Mencer O, Flynn M, Huang X M, Yang G W. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro, 2014, 34(1): 30-40.
Theobald K B. EARTH: An efficient architecture for running threads [Ph.D. Thesis]. McGill University, Montreal, Que., Canada, 1999.
Milutinovic V, Salom J, Trifunovic N, Giorgi R. Guide to Dataflow Supercomputing (1st edition). Springer International Publishing, 2015.
Sankaralingam K, Nagarajan R, McDonald R, Desikan R, Drolia S, Govindan M S, Gratz P, Gulati D, Hanson H, Kim C, Liu H M, Ranganathan N, Sethumadhavan S, Sharif S, Shivakumar P, Keckler S W, Burger D. Distributed microarchitectural protocols in the TRIPS prototype processor. In Proc. the 39th Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2006, pp.480-491.
Burger D, Keckler S W, McKinley K S, Dahlin M, John L K, Lin C, Moore C R, Burrill J, McDonald R G, Yoder W. Scaling to the end of silicon with EDGE architectures. Computer, 2004, 37(7): 44-55.
Swanson S, Schwerin A, Mercaldi M, Petersen A, Putnam A, Michelson K, Oskin M, Eggers S J. The WaveScalar architecture. ACM Transactions on Computer Systems, 2007, 25(2): Article No.4.
Roca A, Flich J, Silla F, Duato J. A latency-efficient router architecture for CMP systems. In Proc. the 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools, Sept. 2010, pp.165-172.
Michelogiannakis G, Dally W J. Router designs for elastic buffer on-chip networks. In Proc. the Conference on High Performance Computing Networking, Storage and Analysis, Nov. 2009.
Chang Y Y, Huang Y S, Poremba M, Narayanan V K, Xie Y, King C T. TS-Router: On maximizing the quality-of-allocation in the on-chip network. In Proc. the 19th Int. Symp. High Performance Computer Architecture, Feb. 2013, pp.390-399.
Tran A T, Baas B M. Achieving high-performance on-chip networks with shared-buffer routers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2014, 22(6): 1391-1403.
Poluri P, Louri A. An improved router design for reliable on-chip networks. In Proc. the 28th Int. Parallel and Distributed Processing Symp., May 2014, pp.283-292.
Ben-Itzhak Y, Cidon I, Kolodny A, Shabun M, Shmuel N. Heterogeneous NoC router architecture. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(6): 2479-2492.
Zoni D, Flich J, Fornaciari W. CUTBUF: Buffer management and router design for traffic mixing in VNET-based NoCs. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(6): 1603-1616.
Singh W, Deb S. Energy efficient and congestion-aware router design for future NoCs. In Proc. the 29th Int. Conference on VLSI Design, Jan. 2016, pp.81-85.
Yan P Z, Jiang S X, Sridhar R. A high throughput router with a novel switch allocator for network on chip. In Proc. the 28th International System-on-Chip Conference, Sept. 2015, pp.160-163.
Xu Y, Zhao B, Zhang Y T, Yang J. Simple virtual channel allocation for high throughput and high frequency on-chip routers. In Proc. the 16th Int. Symp. High Performance Computer Architecture, Jan. 2010, pp.1-11.
Soteriou V, Ramanujam R S, Lin B, Peh L S. A high-throughput distributed shared-buffer NoC router. IEEE Computer Architecture Letters, 2009, 8(1): 21-24.
Gu L, Li M, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In Proc. the 24th ACM International Conference on Supercomputing, Jun. 2010, pp.305-314.
Zhang Y P, Mueller F. Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Transactions on Parallel and Distributed Systems, 2013, 24(3): 417-427.
Kuzak J, Tomov S, Dongarra J. Autotuning GEMM kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(11): 2045-2057.
Hesse R, Nicholls J, Jerger N E. Fine-grained bandwidth adaptivity in networks-on-chip using bidirectional channels. In Proc. the 6th IEEE/ACM Int. Symp. Networks-on-Chip, May 2012, pp.132-141.
Ye X C, Fan D R, Sun N H, Tang S B, Zhang M Z, Zhang H. SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture. In Proc. the Int. Symp. Low Power Electronics and Design, Sept. 2013, pp.273-278.
Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2009, pp.469-480.
Solinas M, Badia R M, Bodin F, Cohen A, Evripidou P, Faraboschi P, Fenchner B, Gao G R, Garbade A, Girbal S, Goodman D, Khan B, Koliai S, Li F, Luján M, Morin L, Mendelson A, Navarro N, Pop A, Trancoso P, Ungerer T, Valero M, Weis S, Watson I, Zuckermann S, Giorgi R. The TERAFLUX project: Exploiting the dataflow paradigm in next generation teradevices. In Proc. the Euromicro Conference on Digital System Design, Sept. 2013, pp.272-279.
Carter N P, Agrawal A, Borkar S, Cledat R, David H, Dunning D, Fryman J, Ganev I, Golliver R A, Knauerhase R, Lethin R, Meister B, Mishra A K, Pinfold W R, Teller J, Torrellas J, Vasilache N, Venkatesh G, Xu J P. Runnemede: An architecture for ubiquitous high-performance computing. In Proc. the 19th Int. Symp. High Performance Computer Architecture, Feb. 2013, pp.198-209.
Wei L, Zhou L. An equilibrium partitioning method for multicast traffic in 3D NoC architecture. In Proc. the IFIP/IEEE International Conference on Very Large Scale Integration, Oct. 2015, pp.128-133.
Agrawal M, Chakrabarty K. Test-time optimization in NOC-based manycore SOCs using multicast routing. In Proc. the 32nd IEEE VLSI Test Symposium, Apr. 2014.
Kamali M, Petre L, Sere K, Daneshtalab M. Formal modeling of multicast communication in 3D NoCs. In Proc. the 14th Euromicro Conference on Digital System Design, Aug. 31-Sept. 2, 2011, pp.634-642.
Zhan J, Ouyang J, Ge F, Zhao J S, Xie Y. Hybrid drowsy SRAM and STT-RAM buffer designs for dark-silicon-aware NoC. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2016, 24(10): 3041-3054.
Zhan J, Ouyang J, Ge F, Zhao S, Xie Y. DimNoC: A dim silicon approach towards power-efficient on-chip network. In Proc. the 52nd ACM/EDAC/IEEE Design Automation Conference, Jun. 2015. J. Comput. Sci. & Technol., Jan. 2017, Vol.32, No.1
Zhang L K, Strukov D, Saadeldeen H, Fan D R, Zhang M Z, Franklin D. SpongeDirectory: Flexible sparse directories utilizing multi-level memristors. In Proc. the 23rd Int. Conf. Parallel Architectures and Compilation, Aug. 2014, pp.61-74.
Deng Z X, Zhang L K, Franklin D, Chong F T. Herniated hash tables: Exploiting multi-level phase change memory for in-place data expansion. In Proc. the Int. Symp. Memory Systems, Oct. 2015, pp.247-257.
Zhang M Z, Zhang L K, Jiang L, Liu Z Y, Chong F T. Balancing performance and lifetime of MLC PCM by using a regionretention monitor. In Proc. the 23rd Int. Symp. High Performance Computer Architecture, Feb. 2017. (to be appeared)
LeeH H S, Tyson G S, Farrens M K. Eager writeback-a technique for improving bandwidth utilization. In Proc. the 33rd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2000, pp.11-21.
Zhang L K, Neely B, Franklin D, Strukov D, Xie Y, Chong F T. Mellow writes: Extending lifetime in resistive memories through selective slow write backs. In Proc. the 43rd Int. Symp. Computer Architecture, Jun. 2016, pp.519-531.
Special Section on Dataflow Architecture
This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2015AA01A301, the National Natural Science Foundation of China under Grant No. 61332009, the National HeGaoJi Project of China under Grant No. 2013ZX0102-8001-001-001, and the Beijing Municipal Science and Technology Commission under Grant Nos. Z15010101009 and Z151100003615006.
About this article
Cite this article
Shen, X., Ye, X., Tan, X. et al. An Efficient Network-on-Chip Router for Dataflow Architecture. J. Comput. Sci. Technol. 32, 11–25 (2017). https://doi.org/10.1007/s11390-017-1703-5
- dataflow architecture
- high-performance computing