Skip to main content

Godiva: green on-chip interconnection for DNNs

Abstract

The benefits of deep neural networks (DNNs) and other big-data algorithms have led to their use in almost every modern application. The rising use of DNNs in diverse domains including computer vision, speech recognition, image classification, and prediction has increased the demand for energy-efficient hardware architectures. Massive amounts of parallel processing in large-scale DNN algorithms have made communication and storage a strong wall in front of a DNN’s power and performance. Nowadays, DNNs have gained a great deal of success by utilizing the inherent parallelism of GPU architectures. However, recent research shows that the integration of CPUs and GPUs presents a more efficient solution for running the next generation of machine learning (ML) chips. Designing interconnection networks for a heterogenous CPU-GPU platform are a challenge (especially for the execution of DNN workloads) as it must be scalable and efficient. A study in this work shows that the majority of traffic in DNN workloads is associated with last level caches (LLCs). Therefore, there is a need to design a low-overhead interconnect fabric to minimize the energy and access time to the LLC banks. To address this issue, a low-overhead on-chip interconnection, named Godiva, for running DNNs energy-efficiently has been proposed. Godiva interconnection affords low LLCs accesses delay using a low-overhead and small cost hardware in a heterogenous CPU-GPU platform. An experimental evaluation targeting a 16CPU-48GPU system and a set of popular DNN workloads reveals that the proposed heterogenous architecture improves system energy by about 21.7 × and reduces interconnection network area by about 51% when compared to a mesh-based CPU design.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10.
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Data availability

We confirm that all relevant data and results are included within the article.

References

  1. Inci A, Bolotin E, Fu Y, Dalal G, Mannor S, Nellans D, Marculescu D (2020) The architectural implications of distributed reinforcement learning on CPU-GPU systems. arXiv:2012.04210

  2. Russakovsky Olga, Deng Jia, Hao Su, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    MathSciNet  Article  Google Scholar 

  3. Bakhoda A, Yuan GL, Fung WW, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp 163–174. IEEE

  4. Espeholt L, Marinier R, Stanczyk P, Wang K, Michalski M (2019) Seed rl: scalable and efficient deep-rl with accelerated central inference. arXiv:1910.06591

  5. Kayiran O, Nachiappan NC, Jog A, Ausavarungnirun R, Kandemir MT, Loh GH, Mutlu O, Das CR (2014) Managing GPU concurrency in heterogeneous architectures. In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp 114–126. IEEE

  6. Kim Ryan Gary, Doppa Janardhan Rao, Pande Partha Pratim, Marculescu Diana, Marculescu Radu (2018) Machine learning and manycore systems design: a serendipitous symbiosis. Computer 51(7):66–77

    Article  Google Scholar 

  7. Sze V, Chen Y-H, Yang T-J, Emer JS (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329

    Article  Google Scholar 

  8. Inci A, Isgenc MM, Marculescu D (2020) DeepNVM++: cross-layer modeling and optimization framework of non-volatile memories for deep learning. arXiv:2012.04559

  9. Nabavinejad Seyed Morteza, Baharloo Mohammad, Chen Kun-Chih, Palesi Maurizio, Kogel Tim, Ebrahimi Masoumeh (2020) An overview of efficient interconnection networks for deep neural network accelerators. IEEE J Emerg Sel Top Circuits Syst 10(3):268–282

    Article  Google Scholar 

  10. Chen Y-H, Krishna T, Emer JS, Sze V (2016) Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Solid-State Circuits 52(1):127–138

    Article  Google Scholar 

  11. Talebi M, Salahvarzi A, Monazzah AMH, Skadron K, Fazeli M (2020) ROCKY: a robust hybrid on-chip memory kit for the processors with STT-MRAM cache technology. IEEE Trans Comput 70(12):2198–2210

    MATH  Google Scholar 

  12. Chen Y-H, Emer J, Sze V (2017) Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Micro 37(3):12–21

    Article  Google Scholar 

  13. Reza, MF, Ampadu P (2019) Energy-efficient and high-performance NoC architecture and mapping solution for deep neural networks. In: Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip, pp 1–8

  14. Mirmahaleh SYH, Reshadi M, Shabani H, Guo X, Bagherzadeh N (2019) Flow mapping and data distribution on mesh-based deep learning accelerator. In: Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip, pp 1–8

  15. Luo T, Liu S, Li L, Wang Y, Zhang S, Chen T, Zhiwei Xu, Temam O, Chen Y (2016) DaDianNao: a neural network supercomputer. IEEE Trans Comput 66(1):73–88

    MathSciNet  Article  Google Scholar 

  16. Liu X, Wen W, Qian X, Li H, Chen Y (2018) Neu-NoC: a high-efficient interconnection network for accelerated neuromorphic systems. In: 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), pp 141–146. IEEE

  17. Wong HSP, Raoux S, Kim S, Liang J, Reifenberg JP, Rajendran B, Asheghi M, Goodson KE (2010) Phase change memory. In: Proceedings of the IEEE 98, 12: 2201–2227

  18. Liu X, Mao M, Liu B, Li B, Wang Y, Jiang H, Barnell M et al (2016) Harmonica: a framework of heterogeneous computing systems with memristor-based neuromorphic computing accelerators. In: IEEE Transactions on Circuits and Systems I: Regular Papers 63, 5: 617–628

  19. Endoh T (2021) 3D integration of memories including heterogeneous integration. In: 2021 International Symposium on VLSI Technology, Systems and Applications (VLSI-TSA), pp 1–2. IEEE

  20. Joardar BK, Doppa JR, Pande PP, Marculescu D, Marculescu R (2018) Hybrid on-chip communication architectures for heterogeneous manycore systems. In: 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp 1–6. IEEE

  21. Bernstein L, Sludds A, Hamerly R, Sze V, Emer J, Englund D (2021) Freely scalable and reconfigurable optical hardware for deep learning. Sci Rep 11(1):1–12

    Article  Google Scholar 

  22. Karkar A, Mak T, Tong K-F, Yakovlev A (2016) A survey of emerging interconnects for on-chip efficient multicast and broadcast in many-cores. IEEE Circuits Syst Mag 16(1):58–72

    Article  Google Scholar 

  23. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105

    Google Scholar 

  24. Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv:1405.3531

  25. Kim D, Kung J, Chai S, Yalamanchili S, Mukhopadhyay S (2016) Neurocube: a programmable digital neuromorphic architecture with high-density 3D memory. ACM SIGARCH Comput Archit News 44(3):380–392

    Article  Google Scholar 

  26. Power J, Hestness J, Orr MS, Hill MD, Wood DA (2014) gem5-gpu: A heterogeneous cpu-gpu simulator. IEEE Comput Archit Lett 14(1):34–36

    Article  Google Scholar 

  27. Binkert Nathan, Beckmann Bradford, Black Gabriel, Reinhardt Steven K, Saidi Ali, Basu Arkaprava, Hestness Joel et al (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39(2):1–7

    Article  Google Scholar 

  28. Leng Jingwen, Hetherington Tayler, ElTantawy Ahmed, Gilani Syed, Kim Nam Sung, Aamodt Tor M, Reddi Vijay Janapa (2013) GPUWattch: enabling energy optimizations in GPGPUs. ACM SIGARCH Comput Archit News 41(3):487–498

    Article  Google Scholar 

  29. Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pp 469–480

  30. Agarwal N, Krishna T, Peh LS, Jha NK (2009) GARNET: a detailed on-chip network model inside a full-system simulator. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp 33–42. IEEE

  31. Deng Li (2012) The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process Mag 29(6):141–142

    Article  Google Scholar 

  32. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Greg S, Corrado et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467

  33. Lotfi-Kamran P, Grot B, Falsafi B (2012) NOC-Out: microarchitecting a scale-out processor. In: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp 177–187. IEEE

  34. Lee J, Li Si, Kim H, Yalamanchili S (2013) Design space exploration of on-chip ring interconnection for a CPU–GPU heterogeneous architecture. J Parallel Distrib Comput 73(12):1525–1538

    Article  Google Scholar 

  35. Alhubail L, Jasemi M, Bagherzadeh N (2020) Noc design methodologies for heterogeneous architecture In: 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 299–306. IEEE

  36. Mishra AK, Vijaykrishnan N, Das CR (2011) A case for heterogeneous on-chip interconnects for CMPs. ACM SIGARCH Comput Archit News 39(3):389–400

    Article  Google Scholar 

  37. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC), pp 44–54. IEEE

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arghavan Asad.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Asad, A., Mohammadi, F. Godiva: green on-chip interconnection for DNNs. J Supercomput (2022). https://doi.org/10.1007/s11227-022-04749-0

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-022-04749-0

Keywords

  • Heterogenous architecture
  • Last level cache (LLC)
  • CPU-GPU
  • Deep learning
  • Network-on-chip (NoC)
  • Manycore
  • Deep neural networks (DNN)