Skip to main content

Advertisement

Log in

A survey of FPGA-based accelerators for convolutional neural networks

  • Review
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Deep convolutional neural networks (CNNs) have recently shown very high accuracy in a wide range of cognitive tasks, and due to this, they have received significant interest from the researchers. Given the high computational demands of CNNs, custom hardware accelerators are vital for boosting their performance. The high energy efficiency, computing capabilities and reconfigurability of FPGA make it a promising platform for hardware acceleration of CNNs. In this paper, we present a survey of techniques for implementing and optimizing CNN algorithms on FPGA. We organize the works in several categories to bring out their similarities and differences. This paper is expected to be useful for researchers in the area of artificial intelligence, hardware architecture and system design.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29

Similar content being viewed by others

Notes

  1. Following acronyms are used frequently in this paper: bandwidth (BW), batch normalization (B-NORM), binarized CNN (BNN), block RAM (BRAM), convolution (CONV), digital signal processing units (DSPs), directed acyclic graph (DAG), design space exploration (DSE), fast Fourier transform (FFT), feature map (fmap), fixed point (FxP), floating point (FP), frequency-domain CONV (FDC), fully connected (FC), hardware (HW), high-level synthesis (HLS), inverse FFT (IFFT), local response normalization (LRN), lookup tables (LUTs), matrix multiplication (MM), matrix–vector multiplication (MVM), multiply–add–accumulate (MAC), processing engine/unit (PE/PU), register transfer level (RTL), single instruction multiple data (SIMD).

References

  1. Ovtcharov K, Ruwase O, Kim J-Y, Fowers J, Strauss K, Chung ES (2015) Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper vol 2, no 11

  2. Mittal S, Vetter J (2015) A survey of methods for analyzing and improving GPU energy efficiency. ACM Comput Surv 47:19

    Article  Google Scholar 

  3. Zhao R, Song W, Zhang W, Xing T, Lin J-H, Srivastava M B, Gupta R, Zhang Z (2017) Accelerating binarized convolutional neural networks with software-programmable FPGAs. In: FPGA, pp 15–24

  4. Suda N, Chandra V, Dasika G, Mohanty A, Ma Y, Vrudhula S, Seo J-s, Cao Y (2016) Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In International symposium on field-programmable gate arrays, pp 16–25

  5. Zhang C, Prasanna V (2017) Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In: International symposium on field-programmable gate arrays, pp 35–44

  6. Courbariaux M, Hubara I, Soudry D, El-Yaniv R, Bengio Y (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to + 1 or − 1. arXiv preprint arXiv:1602.02830

  7. Zhang C, Fang Z, Zhou P, Pan P, Cong J (2016) Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. In: International conference on computer-aided design (ICCAD), pp 1–8

  8. Motamedi M, Gysel P, Ghiasi S (2017) PLACID: a platform for FPGA-based accelerator creation for DCNNs. ACM Trans Multimed Comput Commun Appl (TOMM) 13(4):62

    Google Scholar 

  9. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition, pp 1–9

  10. Moini S, Alizadeh B, Emad M, Ebrahimpour R (2017) A resource-limited hardware accelerator for convolutional neural networks in embedded vision applications. IEEE Trans Circuits Syst II Express Briefs 64:1217–1221

    Article  Google Scholar 

  11. Abdelouahab K, Pelcat M, Sérot J, Bourrasset C, Berry F (2017) Tactics to directly map CNN graphs on embedded FPGAs. IEEE Embed Syst Lett 9:113–116

    Article  Google Scholar 

  12. Xilinx (2015) Ultrascale architecture FPGAs memory interface solutions v7.0. Technical Report

  13. Mittal S (2014) A survey of techniques for managing and leveraging caches in GPUs. J Circuits Syst Comput (JCSC) 23(8):1430002

    Article  Google Scholar 

  14. Chang AXM, Zaidy A, Gokhale V, Culurciello E (2017) Compiling deep learning models for custom hardware accelerators. arXiv preprint arXiv:1708.00117

  15. Mittal S, Vetter J (2015) A survey of CPU–GPU heterogeneous computing techniques. ACM Comput Surv 47(4):69:1–69:35

    Article  Google Scholar 

  16. Li Y, Liu Z, Xu K, Yu H, Ren F (2018) A GPU-outperforming FPGA accelerator architecture for binary convolutional neural networks. ACM J Emerg Technol Comput (JETC) 14:18

    Google Scholar 

  17. Umuroglu Y, Fraser NJ, Gambardella G, Blott M, Leong P, Jahre M, Vissers K (2017) FINN: a framework for fast, scalable binarized neural network inference. In: International symposium on field-programmable gate arrays, pp 65–74

  18. Park J, Sung W (2016) FPGA based implementation of deep neural networks using on-chip memory only. In: International conference on acoustics, speech and signal processing (ICASSP), pp 1011–1015

  19. Peemen M, Setio AA, Mesman B, Corporaal H (2013) Memory-centric accelerator design for convolutional neural networks. In: International conference on computer design (ICCD), pp 13–19

  20. Rahman A, Oh S, Lee J, Choi K (2017) Design space exploration of FPGA accelerators for convolutional neural networks. In: 2017 Design, automation & test in Europe conference & exhibition (DATE). IEEE, pp 1147–1152

  21. Zhang C, Li P, Sun G, Guan Y, Xiao B, Cong J (2015) Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: International symposium on field-programmable gate arrays, pp 161–170

  22. Guan Y, Liang H, Xu N, Wang W, Shi S, Chen X, Sun G, Zhang W, Cong J (2017) FP-DNN: an automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In: International symposium on field-programmable custom computing machines (FCCM), pp 152–159

  23. Moss DJ, Nurvitadhi E, Sim J, Mishra A, Marr D, Subhaschandra S, Leong PH (2017) High performance binary neural networks on the Xeon + FPGA platform. In: International conference on field programmable logic and applications (FPL), pp 1–4

  24. Ma Y, Cao Y, Vrudhula S, Seo J-s (2017) Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In: International symposium on field-programmable gate arrays, pp 45–54

  25. Yonekawa H, Nakahara H (2017) On-chip memory based binarized convolutional deep neural network applying batch normalization free technique on an FPGA. In: IEEE international parallel and distributed processing symposium workshops (IPDPSW), pp 98–105

  26. Nurvitadhi E, Sheffield D, Sim J, Mishra A, Venkatesh G, Marr D (2016) Accelerating binarized neural networks: comparison of FPGA, CPU, GPU, and ASIC. In: International conference on field-programmable technology (FPT), pp 77–84

  27. Zhang Y, Wang C, Gong L, Lu Y, Sun F, Xu C, Li X, Zhou X (2017) A power-efficient accelerator based on FPGAs for LSTM network. In: International conference on cluster computing (CLUSTER), pp 629–630

  28. Feng G, Hu Z, Chen S, Wu F (2016) Energy-efficient and high-throughput FPGA-based accelerator for convolutional neural networks. In: International conference on solid-state and integrated circuit technology (ICSICT), pp 624–626

  29. Wang D, An J, Xu K (2016) PipeCNN: an OpenCL-based FPGA accelerator for large-scale convolution neuron networks. arXiv preprint arXiv:1611.02450

  30. Liu Z, Dou Y, Jiang J, Xu J (2016) Automatic code generation of convolutional neural networks in FPGA implementation. In: International conference on field-programmable technology (FPT). IEEE, pp 61–68

  31. Samragh M, Ghasemzadeh M, Koushanfar F (2017) Customizing neural networks for efficient FPGA implementation. In: International symposium on field-programmable custom computing machines (FCCM), pp 85–92

  32. Podili A, Zhang C, Prasanna V (2017) Fast and efficient implementation of convolutional neural networks on FPGA. In: International conference on application-specific systems, architectures and processors (ASAP), pp 11–18

  33. Fraser NJ, Umuroglu Y, Gambardella G, Blott M, Leong P, Jahre M, Vissers K (2017) Scaling binarized neural networks on reconfigurable logic. In: Workshop on parallel programming and run-time management techniques for many-core architectures and design tools and architectures for multicore embedded computing platforms (PARMA-DITAM), pp 25–30

  34. Xiao Q, Liang Y, Lu L, Yan S, Tai Y-W (2017) Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In: Design automation conference, p 62

  35. Ma Y, Cao Y, Vrudhula S, Seo J-s (2017) An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In: International conference on field programmable logic and applications (FPL), pp 1–8

  36. Rahman A, Lee J, Choi K (2016) Efficient FPGA acceleration of convolutional neural networks using logical-3D compute array. In: Design, automation & test in Europe(DATE), pp 1393–1398

  37. Liu Z, Dou Y, Jiang J, Xu J, Li S, Zhou Y, Xu Y (2017) Throughput-optimized FPGA accelerator for deep convolutional neural networks. ACM Trans Reconfig Technol Syst (TRETS) 10(3):17

    Google Scholar 

  38. Zhang X, Liu X, Ramachandran A, Zhuge C, Tang S, Ouyang P, Cheng Z, Rupnow K, Chen D (2017) High-performance video content recognition with long-term recurrent convolutional network for FPGA. In: International conference on field programmable logic and applications (FPL), pp 1–4

  39. Ma Y, Suda N, Cao Y, Seo J-s, Vrudhula S (2016) Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. In: International conference on field programmable logic and applications (FPL), pp 1–8

  40. Shen Y, Ferdman M, Milder P (2017) Maximizing cnn accelerator efficiency through resource partitioning. In: International symposium on computer architecture, ser. ISCA ’17, pp 535–547

    Article  Google Scholar 

  41. Aydonat U, O’Connell S, Capalija D, Ling AC, Chiu GR (2017) An OpenCL deep learning accelerator on Arria 10. In: FPGA

  42. Kim JH, Grady B, Lian R, Brothers J, Anderson JH (2017) FPGA-based CNN inference accelerator synthesized from multi-threaded C software. In: IEEE SOCC

  43. Wei X, Yu CH, Zhang P, Chen Y, Wang Y, Hu H, Liang Y, Cong J (2017) Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In: Design automation conference (DAC), pp 1–6

  44. Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S et al (2016) Going deeper with embedded FPGA platform for convolutional neural network. In: International symposium on field-programmable gate arrays, pp 26–35

  45. Qiao Y, Shen J, Xiao T, Yang Q, Wen M, Zhang C (2017) FPGA-accelerated deep convolutional neural networks for high throughput and energy efficiency. Concurr Comput Pract Exp 29(20):e3850

    Article  Google Scholar 

  46. Page A, Jafari A, Shea C, Mohsenin T (2017) SPARCNet: a hardware accelerator for efficient deployment of sparse convolutional networks. ACM J Emerg Technol Comput Syst (JETC) 13(3):31

    Google Scholar 

  47. Zhao W, Fu H, Luk W, Yu T, Wang S, Feng B, Ma Y, Yang G (2016) F-CNN: an FPGA-based framework for training convolutional neural networks. In: International conference on application-specific systems, architectures and processors (ASAP), pp 107–114

  48. Liang S, Yin S, Liu L, Luk W, Wei S (2018) FP-BNN: binarized neural network on FPGA. Neurocomputing 275:1072–1086

    Article  Google Scholar 

  49. Natale G, Bacis M, Santambrogio MD (2017) On how to design dataflow FPGA-based accelerators for convolutional neural networks. In: IEEE computer society annual symposium on VLSI (ISVLSI), pp 639–644

  50. Lu L, Liang Y, Xiao Q, Yan S (2017) Evaluating fast algorithms for convolutional neural networks on FPGAs. In: International symposium on field-programmable custom computing machines (FCCM), pp 101–108

  51. Zhang C, Wu D, Sun J, Sun G, Luo G, Cong J (2016) Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In: International symposium on low power electronics and design, pp 326–331

  52. DiCecco R, Lacey G, Vasiljevic J, Chow P, Taylor G, Areibi S (2016) Caffeinated FPGAs: FPGA framework for convolutional neural networks. In: International conference on field-programmable technology (FPT), pp 265–268

  53. Venieris SI, Bouganis C-S (2016) fpgaConvNet: a framework for mapping convolutional neural networks on FPGAs. In: International symposium on field-programmable custom computing machines (FCCM), pp 40–47

  54. Zeng H, Chen R, Prasanna VK (2017) Optimizing frequency domain implementation of CNNs on FPGAs. Technical report

  55. Li H, Fan X, Jiao L, Cao W, Zhou X, Wang L (2016) A high performance FPGA-based accelerator for large-scale convolutional neural networks. In: 2016 26th International conference on field programmable logic and applications (FPL). IEEE, pp 1–9

  56. Lin J-H, Xing T, Zhao R, Zhang Z, Srivastava M, Tu Z, Gupta RK (2017) Binarized convolutional neural networks with separable filters for efficient hardware acceleration. In: Computer vision and pattern recognition workshop (CVPRW)

  57. Nakahara H, Fujii T, Sato S (2017) A fully connected layer elimination for a binarized convolutional neural network on an FPGA. In: International conference on field programmable logic and applications (FPL), pp 1–4

  58. Jiao L, Luo C, Cao W, Zhou X, Wang L (2017) Accelerating low bit-width convolutional neural networks with embedded FPGA. In: International conference on field programmable logic and applications (FPL), pp 1–4

  59. Meloni P, Deriu G, Conti F, Loi I, Raffo L, Benini L (2016) Curbing the roofline: a scalable and flexible architecture for CNNs on FPGA. In: ACM international conference on computing frontiers, pp 376–383

  60. Abdelouahab K, Bourrasset C, Pelcat M, Berry F, Quinton J-C, Serot J (2016) A holistic approach for optimizing DSP block utilization of a CNN implementation on FPGA. In: International conference on distributed smart camera, pp 69–75

  61. Gankidi PR, Thangavelautham J (2017) FPGA architecture for deep learning and its application to planetary robotics. In: IEEE aerospace conference, pp 1–9

  62. Venieris SI, Bouganis C-S (2017) Latency-driven design for FPGA-based convolutional neural networks. In: International conference on field programmable logic and applications (FPL), pp 1–8

  63. Shen Y, Ferdman M, Milder P (2017) Escher: a CNN accelerator with flexible buffering to minimize off-chip transfer. In: International symposium on field-programmable custom computing machines (FCCM)

  64. Zhang J, Li J (2017) Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network. In: FPGA, pp 25–34

  65. Guo K, Sui L, Qiu J, Yao S, Han S, Wang Y, Yang H (2016) Angel-eye: a complete design flow for mapping CNN onto customized hardware. In: IEEE computer society annual symposium on VLSI (ISVLSI), pp 24–29

  66. Han S, Kang J, Mao H, Hu Y, Li X, Li Y, Xie D, Luo H, Yao S, Wang Y et al (2017) ESE: efficient speech recognition engine with sparse LSTM on FPGA. In: FPGA, pp 75–84

  67. Wang Y, Xu J, Han Y, Li H, Li X (2016) DeepBurning: automatic generation of FPGA-based learning accelerators for the neural network family. In: Design automation conference (DAC). IEEE, pp 1–6

  68. Guan Y, Xu N, Zhang C, Yuan Z, Cong J (2017) Using data compression for optimizing FPGA-based convolutional neural network accelerators. In: International workshop on advanced parallel processing technologies, pp 14–26

    Google Scholar 

  69. Cadambi S, Majumdar A, Becchi M, Chakradhar S, Graf HP (2010) A programmable parallel accelerator for learning and classification. In: International conference on parallel architectures and compilation techniques, pp 273–284

  70. Motamedi M, Gysel P, Akella V, Ghiasi S (2016) Design space exploration of FPGA-based deep convolutional neural networks. In: Asia and South Pacific design automation conference (ASP-DAC), pp 575–580

  71. Han X, Zhou D, Wang S, Kimura S (2016) CNN-MERP: an FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks. In: International conference on computer design (ICCD), pp 320–327

  72. Sharma H, Park J, Mahajan D, Amaro E, Kim JK, Shao C, Mishra A, Esmaeilzadeh H (2016) From high-level deep neural models to FPGAs. In: International symposium on microarchitecture (MICRO). IEEE, pp 1–12

  73. Baskin C, Liss N, Mendelson A, Zheltonozhskii E (2017) Streaming architecture for large-scale quantized neural networks on an FPGA-based dataflow platform. arXiv preprint arXiv:1708.00052

  74. Gokhale V, Zaidy A, Chang AXM, Culurciello E (2017) Snowflake: an efficient hardware accelerator for convolutional neural networks. In: IEEE international symposium on circuits and systems (ISCAS), pp 1–4

  75. Lee M, Hwang K, Park J, Choi S, Shin S, Sung W (2016) “FPGA-based low-power speech recognition with recurrent neural networks. In: International workshop on signal processing systems (SiPS), pp 230–235

  76. Mahajan D, Park J, Amaro E, Sharma H, Yazdanbakhsh A, Kim JK, Esmaeilzadeh H (2016) Tabla: a unified template-based framework for accelerating statistical machine learning. In: International symposium on high performance computer architecture (HPCA). IEEE, pp 14–26

  77. Prost-Boucle A, Bourge A, Pétrot F, Alemdar H, Caldwell N, Leroy V (2017) Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In: International conference on field programmable logic and applications (FPL), pp 1–7

  78. Alwani M, Chen H, Ferdman M, Milder P (2016) Fused-layer CNN accelerators. In: International symposium on microarchitecture (MICRO), pp 1–12

  79. Mittal S (2016) A survey of techniques for approximate computing. ACM Comput Surv 48(4):62:1–62:33

    Google Scholar 

  80. Mittal S, Vetter J (2016) A survey of architectural approaches for data compression in cache and main memory systems. IEEE Trans Parallel Distrib Syst (TPDS) 27:1524–1536

    Article  Google Scholar 

  81. Winograd S (1980) Arithmetic complexity of computations, vol 33. SIAM, Philadelphia

    Book  Google Scholar 

  82. Maguire LP, McGinnity TM, Glackin B, Ghani A, Belatreche A, Harkin J (2007) Challenges for large-scale implementations of spiking neural networks on FPGAs. Neurocomputing 71(1):13–29

    Article  Google Scholar 

Download references

Acknowledgements

Support for this work was provided by Science and Engineering Research Board (SERB), India, Award Number ECR/2017/000622.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sparsh Mittal.

Ethics declarations

Conflict of interest

The author has no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mittal, S. A survey of FPGA-based accelerators for convolutional neural networks. Neural Comput & Applic 32, 1109–1139 (2020). https://doi.org/10.1007/s00521-018-3761-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-018-3761-1

Keywords

Navigation