Advertisement

PRTSM: Hardware Data Arrangement Mechanisms for Convolutional Layer Computation on the Systolic Array

  • Shuquan WangEmail author
  • Lei Wang
  • Shiming Li
  • Tian Shuo
  • Shasha Guo
  • Ziyang Kang
  • Shuzheng Zhang
  • Weixia Xu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11783)

Abstract

The systolic array is an array of processing units which share the inner data flow. Since the 2D systolic array fits the operation of multiplication and accumulation (MAC) naturally, there are many groups which use the systolic array to accelerate the computation of DNN (Deep Neural Network). However, the performance of the systolic array is limited by the data bandwidth. Some groups solve this problem with the method of loop tiling and care little about the pixel reuse potential of the convolutional layer. In this paper, we propose a novel method of PRTSM (Pixels Reuse with Time and Spatial Multiplexing) which reuses the pixels of the input feature map with time and spatial multiplexing. With it, we can significantly reduce the pressure of bandwidth and save the time of data preparing for convolutional layers on the systolic array. We propose three algorithms for this method and implement the corresponding hardware mechanisms on Xilinx FPGA XCVU440. Experiments show that our hardware mechanisms can reduce at least \(72.03\%\) of the off-chip traffic. The mechanisms proposed by this paper can reach a peak performance of 64.034 GOPS with a frequency of 167 MHz.

Keywords

DNN FPGA Systolic array Hardware data arrangement 

References

  1. 1.
    Samajdar, A., Zhu, Y., Whatmough, P., et al.: SCALE-Sim: Systolic CNN Accelerator (2018)Google Scholar
  2. 2.
    Zhang, J., Gu, T., Basu, K., et al.: Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator (2018)Google Scholar
  3. 3.
    Bao, W., Jiang, J., Fu, Y., et al.: A reconfigurable macro-pipelined systolic accelerator architecture. In: 2011 International Conference on Field-Programmable Technology, FPT 2011, New Delhi, India, 12–14 December 2011. IEEE (2011)Google Scholar
  4. 4.
    Chen, Y.-H., Krishna, T., Emer, J., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. In: International Solid-State Circuits Conference, Ser. ISSCC (2016) Google Scholar
  5. 5.
    Sze, V., Chen, Y.H., Yang, T.J., et al.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)CrossRefGoogle Scholar
  6. 6.
    Du, Z., Fasthuber, R., Chen, T., et al.: ShiDianNao: shifting vision processing closer to the sensor. In: ACM/IEEE International Symposium on Computer Architecture (2015)Google Scholar
  7. 7.
    In-Datacenter Performance Analysis of a Tensor Processing Unit (2017)Google Scholar
  8. 8.
    Razip, M.I.M., Junid, S.A.M.A., Halim, A.K., et al.: Sequence alignment using systolic array for an accelerator. In: Power Engineering and Optimization Conference. IEEE (2014)Google Scholar
  9. 9.
    Razip, M.I.M., Al Junid, S.A.M., Halim, A.K., et al.: Sequence alignment using systolic array for an accelerator (2014)Google Scholar
  10. 10.
    Ito, M.: A power-efficient FPGA accelerator: systolic array with cache-coherent interface for pair-HMM algorithm. In: Low-Power and High-Speed Chips (2016)Google Scholar
  11. 11.
    Qiu, J., et al.: Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 24th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2016)Google Scholar
  12. 12.
    Chen, Y., Emer, J., Sze, V.: Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. In: Proceedings of 43rd Annual International Symposium on Computer Architecture (2016)Google Scholar
  13. 13.
    Alwani, M., Chen, H., Ferdman, M., Milder, P.: Fused-layer CNN accelerators. In: Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (2016)Google Scholar
  14. 14.
    Azizimazreah, A., Chen, L.: Shortcut mining: exploiting cross-layer shortcut reuse in DCNN accelerators. In: 2019 IEEE International Symposium on High-Performance Computer ArchitectureGoogle Scholar
  15. 15.
    Ma, Y., Kim, M., Cao, Y., Vrudhula, S., Seo, J.: End-to-end scalable FPGA accelerator for deep residual networks. In: IEEE International Symposium on Circuits and Systems (2017)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2019

Authors and Affiliations

  • Shuquan Wang
    • 1
    Email author
  • Lei Wang
    • 1
  • Shiming Li
    • 1
  • Tian Shuo
    • 1
  • Shasha Guo
    • 1
  • Ziyang Kang
    • 1
  • Shuzheng Zhang
    • 1
  • Weixia Xu
    • 1
  1. 1.National University of Defense TechnologyChangshaChina

Personalised recommendations