Advertisement

Journal of Computer Science and Technology

, Volume 32, Issue 1, pp 11–25 | Cite as

An Efficient Network-on-Chip Router for Dataflow Architecture

  • Xiao-Wei Shen
  • Xiao-Chun Ye
  • Xu Tan
  • Da Wang
  • Lunkai Zhang
  • Wen-Ming Li
  • Zhi-Min Zhang
  • Dong-Rui Fan
  • Ning-Hui Sun
Regular paper

Abstract

Dataflow architecture has shown its advantages in many high-performance computing cases. In dataflow computing, a large amount of data are frequently transferred among processing elements through the network-on-chip (NoC). Thus the router design has a significant impact on the performance of dataflow architecture. Common routers are designed for control-flow multi-core architecture and we find they are not suitable for dataflow architecture. In this work, we analyze and extract the features of data transfers in NoCs of dataflow architecture: multiple destinations, high injection rate, and performance sensitive to delay. Based on the three features, we propose a novel and efficient NoC router for dataflow architecture. The proposed router supports multi-destination; thus it can transfer data with multiple destinations in a single transfer. Moreover, the router adopts output buffer to maximize throughput and adopts non-flit packets to minimize transfer delay. Experimental results show that the proposed router can improve the performance of dataflow architecture by 3.6x over a state-of-the-art router.

Keywords

multi-destination router network-on-chip dataflow architecture high-performance computing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Chen T S, Du Z D, Sun N H, Wang J, Wu C Y, Chen Y J, Temam O. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proc. the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2014, pp.269-284.Google Scholar
  2. [2]
    Liu D F, Chen T S, Liu S L, Zhou J H, Zhou S Y, Temam O, Feng X B, Zhou X H, Chen Y J. PuDianNao: A polyvalent machine learning accelerator. In Proc. the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2014, pp.369-381.Google Scholar
  3. [3]
    Voitsechov D, Etsion Y. Single-graph multiple flows: Energy efficient design alternative for GPGPUs. In Proc. the 41st Int. Symp. Computer Architecture, Jun. 2014, pp.205-216.Google Scholar
  4. [4]
    Oriato D, Tilbury S, Marrocu M, Pusceddu G. Acceleration of a meteorological limited area model with dataflow engines. In Proc. the Symp. Application Accelerators in High Performance Computing, Jul. 2012, pp.129-132.Google Scholar
  5. [5]
    Pratas F, Oriato D, Pell O, Mata R A, Sousa L. Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In Proc. the 21st Annual Int. Symp. Field-Programmable Custom Computing Machines, Apr. 2013, pp.177-180.Google Scholar
  6. [6]
    Fu H H, Gan L, Clapp R G, Ruan H B, Pell O, Mencer O, Flynn M, Huang X M, Yang G W. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro, 2014, 34(1): 30-40.Google Scholar
  7. [7]
    Theobald K B. EARTH: An efficient architecture for running threads [Ph.D. Thesis]. McGill University, Montreal, Que., Canada, 1999.Google Scholar
  8. [8]
    Milutinovic V, Salom J, Trifunovic N, Giorgi R. Guide to Dataflow Supercomputing (1st edition). Springer International Publishing, 2015.Google Scholar
  9. [9]
    Sankaralingam K, Nagarajan R, McDonald R, Desikan R, Drolia S, Govindan M S, Gratz P, Gulati D, Hanson H, Kim C, Liu H M, Ranganathan N, Sethumadhavan S, Sharif S, Shivakumar P, Keckler S W, Burger D. Distributed microarchitectural protocols in the TRIPS prototype processor. In Proc. the 39th Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2006, pp.480-491.Google Scholar
  10. [10]
    Burger D, Keckler S W, McKinley K S, Dahlin M, John L K, Lin C, Moore C R, Burrill J, McDonald R G, Yoder W. Scaling to the end of silicon with EDGE architectures. Computer, 2004, 37(7): 44-55.Google Scholar
  11. [11]
    Swanson S, Schwerin A, Mercaldi M, Petersen A, Putnam A, Michelson K, Oskin M, Eggers S J. The WaveScalar architecture. ACM Transactions on Computer Systems, 2007, 25(2): Article No.4.Google Scholar
  12. [12]
    Roca A, Flich J, Silla F, Duato J. A latency-efficient router architecture for CMP systems. In Proc. the 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools, Sept. 2010, pp.165-172.Google Scholar
  13. [13]
    Michelogiannakis G, Dally W J. Router designs for elastic buffer on-chip networks. In Proc. the Conference on High Performance Computing Networking, Storage and Analysis, Nov. 2009.Google Scholar
  14. [14]
    Chang Y Y, Huang Y S, Poremba M, Narayanan V K, Xie Y, King C T. TS-Router: On maximizing the quality-of-allocation in the on-chip network. In Proc. the 19th Int. Symp. High Performance Computer Architecture, Feb. 2013, pp.390-399.Google Scholar
  15. [15]
    Tran A T, Baas B M. Achieving high-performance on-chip networks with shared-buffer routers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2014, 22(6): 1391-1403.Google Scholar
  16. [16]
    Poluri P, Louri A. An improved router design for reliable on-chip networks. In Proc. the 28th Int. Parallel and Distributed Processing Symp., May 2014, pp.283-292.Google Scholar
  17. [17]
    Ben-Itzhak Y, Cidon I, Kolodny A, Shabun M, Shmuel N. Heterogeneous NoC router architecture. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(6): 2479-2492.Google Scholar
  18. [18]
    Zoni D, Flich J, Fornaciari W. CUTBUF: Buffer management and router design for traffic mixing in VNET-based NoCs. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(6): 1603-1616.Google Scholar
  19. [19]
    Singh W, Deb S. Energy efficient and congestion-aware router design for future NoCs. In Proc. the 29th Int. Conference on VLSI Design, Jan. 2016, pp.81-85.Google Scholar
  20. [20]
    Yan P Z, Jiang S X, Sridhar R. A high throughput router with a novel switch allocator for network on chip. In Proc. the 28th International System-on-Chip Conference, Sept. 2015, pp.160-163.Google Scholar
  21. [21]
    Xu Y, Zhao B, Zhang Y T, Yang J. Simple virtual channel allocation for high throughput and high frequency on-chip routers. In Proc. the 16th Int. Symp. High Performance Computer Architecture, Jan. 2010, pp.1-11.Google Scholar
  22. [22]
    Soteriou V, Ramanujam R S, Lin B, Peh L S. A high-throughput distributed shared-buffer NoC router. IEEE Computer Architecture Letters, 2009, 8(1): 21-24.Google Scholar
  23. [23]
    Gu L, Li M, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In Proc. the 24th ACM International Conference on Supercomputing, Jun. 2010, pp.305-314.Google Scholar
  24. [24]
    Zhang Y P, Mueller F. Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Transactions on Parallel and Distributed Systems, 2013, 24(3): 417-427.Google Scholar
  25. [25]
    Kuzak J, Tomov S, Dongarra J. Autotuning GEMM kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(11): 2045-2057.Google Scholar
  26. [26]
    Hesse R, Nicholls J, Jerger N E. Fine-grained bandwidth adaptivity in networks-on-chip using bidirectional channels. In Proc. the 6th IEEE/ACM Int. Symp. Networks-on-Chip, May 2012, pp.132-141.Google Scholar
  27. [27]
    Ye X C, Fan D R, Sun N H, Tang S B, Zhang M Z, Zhang H. SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture. In Proc. the Int. Symp. Low Power Electronics and Design, Sept. 2013, pp.273-278.Google Scholar
  28. [28]
    Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2009, pp.469-480.Google Scholar
  29. [29]
    Solinas M, Badia R M, Bodin F, Cohen A, Evripidou P, Faraboschi P, Fenchner B, Gao G R, Garbade A, Girbal S, Goodman D, Khan B, Koliai S, Li F, Luján M, Morin L, Mendelson A, Navarro N, Pop A, Trancoso P, Ungerer T, Valero M, Weis S, Watson I, Zuckermann S, Giorgi R. The TERAFLUX project: Exploiting the dataflow paradigm in next generation teradevices. In Proc. the Euromicro Conference on Digital System Design, Sept. 2013, pp.272-279.Google Scholar
  30. [30]
    Carter N P, Agrawal A, Borkar S, Cledat R, David H, Dunning D, Fryman J, Ganev I, Golliver R A, Knauerhase R, Lethin R, Meister B, Mishra A K, Pinfold W R, Teller J, Torrellas J, Vasilache N, Venkatesh G, Xu J P. Runnemede: An architecture for ubiquitous high-performance computing. In Proc. the 19th Int. Symp. High Performance Computer Architecture, Feb. 2013, pp.198-209.Google Scholar
  31. [31]
    Wei L, Zhou L. An equilibrium partitioning method for multicast traffic in 3D NoC architecture. In Proc. the IFIP/IEEE International Conference on Very Large Scale Integration, Oct. 2015, pp.128-133.Google Scholar
  32. [32]
    Agrawal M, Chakrabarty K. Test-time optimization in NOC-based manycore SOCs using multicast routing. In Proc. the 32nd IEEE VLSI Test Symposium, Apr. 2014.Google Scholar
  33. [33]
    Kamali M, Petre L, Sere K, Daneshtalab M. Formal modeling of multicast communication in 3D NoCs. In Proc. the 14th Euromicro Conference on Digital System Design, Aug. 31-Sept. 2, 2011, pp.634-642.Google Scholar
  34. [34]
    Zhan J, Ouyang J, Ge F, Zhao J S, Xie Y. Hybrid drowsy SRAM and STT-RAM buffer designs for dark-silicon-aware NoC. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2016, 24(10): 3041-3054.Google Scholar
  35. [35]
    Zhan J, Ouyang J, Ge F, Zhao S, Xie Y. DimNoC: A dim silicon approach towards power-efficient on-chip network. In Proc. the 52nd ACM/EDAC/IEEE Design Automation Conference, Jun. 2015. J. Comput. Sci. & Technol., Jan. 2017, Vol.32, No.1Google Scholar
  36. [36]
    Zhang L K, Strukov D, Saadeldeen H, Fan D R, Zhang M Z, Franklin D. SpongeDirectory: Flexible sparse directories utilizing multi-level memristors. In Proc. the 23rd Int. Conf. Parallel Architectures and Compilation, Aug. 2014, pp.61-74.Google Scholar
  37. [37]
    Deng Z X, Zhang L K, Franklin D, Chong F T. Herniated hash tables: Exploiting multi-level phase change memory for in-place data expansion. In Proc. the Int. Symp. Memory Systems, Oct. 2015, pp.247-257.Google Scholar
  38. [38]
    Zhang M Z, Zhang L K, Jiang L, Liu Z Y, Chong F T. Balancing performance and lifetime of MLC PCM by using a regionretention monitor. In Proc. the 23rd Int. Symp. High Performance Computer Architecture, Feb. 2017. (to be appeared)Google Scholar
  39. [39]
    LeeH H S, Tyson G S, Farrens M K. Eager writeback-a technique for improving bandwidth utilization. In Proc. the 33rd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2000, pp.11-21.Google Scholar
  40. [40]
    Zhang L K, Neely B, Franklin D, Strukov D, Xie Y, Chong F T. Mellow writes: Extending lifetime in resistive memories through selective slow write backs. In Proc. the 43rd Int. Symp. Computer Architecture, Jun. 2016, pp.519-531.Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Xiao-Wei Shen
    • 1
    • 2
  • Xiao-Chun Ye
    • 1
  • Xu Tan
    • 1
    • 2
  • Da Wang
    • 1
  • Lunkai Zhang
    • 3
  • Wen-Ming Li
    • 1
  • Zhi-Min Zhang
    • 1
  • Dong-Rui Fan
    • 1
  • Ning-Hui Sun
    • 1
  1. 1.State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of SciencesBeijingChina
  2. 2.School of Computer and Control EngineeringUniversity of Chinese Academy of SciencesBeijingChina
  3. 3.Department of Computer ScienceThe University of ChicagoChicagoUSA

Personalised recommendations