Skip to main content

Advertisement

Log in

Optimizing Neural Networks for Efficient FPGA Implementation: A Survey

  • Original Paper
  • Published:
Archives of Computational Methods in Engineering Aims and scope Submit manuscript

Abstract

The deep learning has become the key for artificial intelligence applications development. It was successfully used to solve computer vision tasks. But the deep learning algorithms are based on Deep Neural Networks (DNN) with many hidden layers which need a huge computation effort and a big storage space. Thus, the general-purpose graphical processing units (GPGPU) are the best candidate for DNN development and inference because of their high number of processing core and the big integrated memory. In the other side, the disadvantage of the GPGPU is high-power consumption. In a real-world application, the processing unit is an embedded system based on limited power and computation resources. In recent years, Field Programmable Gate Array (FPGA) becomes a serious solution that can outperform GPGPU because of their flexible architecture and low power consumption. The FPGA is equipped with a very small integrated memory and a low bandwidth. To make DNNs fit into FPGA we need a lot of optimization techniques at different levels such as the network level, the hardware level, and the implementation tools level. In this paper, we will cite the existing optimization techniques and evaluate them to provide a complete overview of FPGA based DNN accelerators.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  2. Grossberg S (2013) Recurrent neural networks. Scholarpedia 8(2):1888

    Article  Google Scholar 

  3. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis (IJCV) 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  4. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587

  5. Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A et al (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567

  6. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702

  7. Simonyan, K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  8. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467

  9. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093

  10. Xilinx, https://www.xilinx.com/

  11. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  12. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016.

  13. Karpathy A (2016).CS231n neural networks part 3: learning and evaluation, [Online]. http://cs231n.github.io/neural-networks-3/

  14. Gschwend D (2016) Zynqnet: an fpga-accelerated embedded convolutional neural network. Masters thesis, Swiss Federal Institute of Technology Zurich (ETH-Zurich)

  15. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K (2016) Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size. arXiv preprint arXiv:1602.07360

  16. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

  17. Hasanpour SH, Rouhani M, Fayyaz M, Sabokrou M, Adeli E (2018) Towards principled design of deep convolutional networks: introducing SimpNet. arXiv preprint arXiv:1802.06205

  18. Srinivas S, Babu RV (2015) Data-free parameter pruning for deep neural networks. In: Proceedings of the British machine vision conference 2015, BMVC 2015, Swansea, UK, September 7–10, pp 1–12

  19. Han S, Pool J, Tran J, Dally WJ (2015) Learning both weights and connections for efficient neural networks. In: Proceedings of the 28th international conference on neural information processing systems, ser. NIPS’15

  20. Chen W, Wilson J, Tyree S, Weinberger KQ, Chen Y (2015) Compressing neural networks with the hashing trick. In: JMLR workshop and conference proceedings

  21. Wen W, Wu C, Wang Y, Chen Y, Li H (2016) Learning structured sparsity in deep neural networks. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems, vol 29. MIT Press, Cambridge, pp 2074–2082

    Google Scholar 

  22. Intel® FPGA SDK for OpenCL™ Available at: https://www.intel.com/content/www/us/en/software/programmable/sdk-for-opencl/overview.html

  23. Wang D, An J, Xu K (2016) PipeCNN: an OpenCL-based FPGA accelerator for large-scale convolution neuron networks. arXiv preprint arXiv:1611.02450

  24. Aydonat U, O’Connell S, Capalija D, Ling AC, Chiu, GR (2017). An OpenCL™ deep learning accelerator on arria 10. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays, pp 55–64. ACM

  25. Accelerating DNNs with Xilinx Alveo Accelerator Cards. Available at: https://www.xilinx.com/applications/megatrends/machine-learning.html

  26. Chen T, Li M, Li Y, Lin M, Wang N, Wang M, Zhang Z (2015) Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274

  27. Umuroglu, Yaman, et al. “Finn: A framework for fast, scalable binarized neural network inference.” Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017

  28. LeCun, Yann, L. D. Jackel, Leon Bottou, A. Brunot, Corinna Cortes, J. S. Denker, Harris Drucker et al. “Comparison of learning algorithms for handwritten digit recognition.” In International conference on artificial neural networks, vol. 60, pp. 53-60. 1995

  29. Krizhevsky, Alex, and Geoffrey Hinton. Learning multiple layers of features from tiny images. Vol. 1. No. 4. Technical report, University of Toronto, 2009

  30. Netzer Yuval, Wang Tao, Coates Adam, Bissacco Alessandro, Bo Wu, Andrew Y Ng (2011) Reading digits in natural images with unsupervised feature learning. NIPS workshop on deep learning and unsupervised feature learning 2011(2):5

    Google Scholar 

  31. Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S et al (2016) Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 26–35

  32. Guo K, Sui L, Qiu J, Yu J, Wang J, Yao S, Han S, Wang Y, Yang H (2017) Angel-eye: a complete design flow for mapping CNN onto embedded FPGA. In: IEEE transactions on computer-aided design of integrated circuits and systems

  33. Li F, Zhang B, Liu B (2016) Ternary weight networks. arXiv preprint arXiv:1605.04711

  34. Zhou S, Wu Y, Ni Z, Zhou X, Wen H, Zou Y (2016) DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160

  35. Chen W, Wilson J, Tyree S, Weinberger K, Chen Y (2015) Compressing neural networks with the hashing trick. In: International conference on machine learning, pp 2285–2294

  36. Han S, Mao H, Dally WJ (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149

  37. Zhu C, Han S, Mao H, Dally WJ (2016) Trained ternary quantization. arXiv preprint arXiv:1612.01064

  38. Li H, Fan X, Jiao L, Cao W, Zhou X, Wang L (2016) A high performance FPGA-based accelerator for large-scale convolutional neural networks. In: 2016 26th International conference on field programmable logic and applications (FPL). IEEE, pp 1–9

  39. Guan Y, Liang H, Xu N, Wang W, Shi S, Chen X, Sun G, Zhang W, Cong J (2017) FP-DNN: an automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates, pp 152–159

  40. Xiao Q, Liang Y, Lu L, Yan S, Tai Y-W (2017) Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In: Proceedings of the 54th annual design automation conference 2017. ACM, p 62

  41. Zhang C, Fang Z, Zhou P, Pan P, Cong J (2016) Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. In: 2016 IEEE/ACM international conference on computer-aided design (ICCAD), IEEE, pp 1–8

  42. Jiao L, Luo C, Cao W, Zhou X, Wang L (2017) Accelerating low bit-width convolutional neural networks with embedded FPGA. In: 2017 27th International conference on field programmable logic and applications (FPL). IEEE, pp 1–4

  43. Li Y, Liu Z, Xu K, Yu H, Ren F (2017) A 7.663-TOPS 8.2-W energy-efficient FPGA accelerator for binary convolutional neural networks. arXiv preprint arXiv:1702.06392

  44. Duncan JMM, Nurvitadhi E, Sim J, Mishra A, Marr D, Subhaschandra S, Leong PHW (2017) High performance binary neural networks on the Xeon + FPGAâĎć platform. In: 2017 27th International conference on field programmable logic and applications (FPL). IEEE, pp 1–4

  45. Nakahara H, Fujii T, Sato S (2017) A fully connected layer elimination for a binarizec convolutional neural network on an FPGA. In: 2017 27th International conference on field programmable logic and applications (FPL). IEEE, pp 1–4

  46. Nakahara H, Yonekawa H, Iwamoto H, Motomura M. (2017) A batch normalization free binarized convolutional deep neural network on an FPGA. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 290–290

  47. Umuroglu Y, Fraser NJ, Gambardella G, Blott M, Leong P, Jahre M, Vissers K (2017) Finn: a framework for fast, scalable binarized neural network inference. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 65–74

  48. Zhao R, Song W, Zhang W, Xing T, Lin J-H, Srivastava MB, Gupta R, Zhang Z (2017) Accelerating binarized convolutional neural networks with software-programmable FPGAs. In: FPGA, pp 15–24

  49. Zhang C, Prasanna V (2017) Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 35–44

  50. Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, et al. 2017. CirCNN: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 395–408

  51. Shmuel Winograd. 1980. Arithmetic complexity of computations. Vol. 33. Siam

  52. Zhang C, Fang Z, Zhou P, Pan P, Cong J (2016) Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. In: 2016 IEEE/ACM international conference on computer-aided design (ICCAD). IEEE, pp 1–8

  53. Ma Y, Cao Y, Vrudhula S, Seo J (2017) Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 45–54

  54. Wu E, Zhang X, Berman D, Cho I (2017) A high-throughput reconfigurable processing array for neural networks. In: 2017 27th International conference on field programmable logic and applications (FPL). IEEE, pp 1–4

  55. Park, Jinhwan, and Wonyong Sung. "FPGA based implementation of deep neural networks using on-chip memory only." In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1011-1015. IEEE, 2016.

  56. Zhang C, Li P, Sun G, Guan Y, Xiao B, Cong J (2015) Optimizing fpga-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 161–170

  57. Motamedi M, Gysel P, Akella V, Ghiasi S (2016) Design space exploration of fpgabased deep convolutional neural networks. In: Design automation conference (ASP-DAC), 2016 21st Asia and South Pacific. IEEE, pp 575–580

  58. Lu L, Liang Y, Xiao Q, Yan S (2017) Evaluating fast algorithms for convolutional neural networks on FPGAs. In: 2017 IEEE 25th annual international symposium on field-programmable custom computing machines (FCCM). IEEE, pp 101–108

  59. Liu Z, Dou Y, Jiang J, Xu J (2016) Automatic code generation of convolutional neural networks in FPGA implementation. In: 2016 International conference on field-programmable technology (FPT). IEEE, pp 61–68

  60. Zhang C, Wu D, Sun J, Sun G, Luo G, Cong J (2016) Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In: Proceedings of the 2016 international symposium on low power electronics and design. ACM, pp 326–331

  61. Shen Y, Ferdman M, Milder P (2016) Overcoming resource underutilization in spatial CNN accelerators. In: 2016 26th International conference on field programmable logic and applications (FPL). IEEE, pp 1–4

  62. Podili A, Zhang C, Prasanna V (2017) Fast and efficient implementation of convolutional neural networks on FPGA. In: 2017 IEEE 28th international conference on application-specific systems, architectures and processors (ASAP). IEEE, pp 11–18

  63. Wei X, Yu CH, Zhang P, Chen Y, Wang Y, Hu H, Liang Y, Cong J (2017) Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In: Proceedings of the 54th annual design automation conference 2017. ACM, p 29

  64. Aydonat U, O’Connell S, Capalija D, Ling AC, Chiu GR (2017) An OpenCL (TM) deep learning accelerator on Arria 10. arXiv preprint arXiv:1701.03534

  65. Wang C, Gong L, Yu Q, Li X, Xie Y, Zhou X (2017) DLAU: a scalable deep learning accelerator unit on FPGA. IEEE Trans Comput Aided Des Integr Circuits Syst 36(3):513–517

    Google Scholar 

  66. Liu D, Chen T, Liu S, Zhou J, Zhou S, Teman O, Chen Y (2015) Pudiannao: a polyvalent machine learning accelerator. In: ACM SIGARCH computer architecture news, vol 43, No 1, pp 369–381. ACM

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Riadh Ayachi.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ayachi, R., Said, Y. & Ben Abdelali, A. Optimizing Neural Networks for Efficient FPGA Implementation: A Survey. Arch Computat Methods Eng 28, 4537–4547 (2021). https://doi.org/10.1007/s11831-021-09530-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11831-021-09530-9

Navigation