Skip to main content

Efficient Design of Pruned Convolutional Neural Networks on FPGA

Abstract

Convolutional Neural Networks (CNNs) have improved several computer vision applications, like object detection and classification, when compared to other machine learning algorithms. Running these models in edge computing devices close to data sources is attracting the attention of the community since it avoids high-latency data communication of private data for cloud processing and permits real-time decisions turning these systems into smart embedded devices. Running these models is computationally very demanding and requires a large amount of memory, which are scarce in edge devices compared to a cloud center. In this paper, we proposed an architecture for the inference of pruned convolutional neural networks in any density FPGAs. A configurable block pruning method is proposed together with an architecture that supports the efficient execution of pruned networks. Also, pruning and batching are studied together to determine how they influence each other. With the proposed architecture, we run the inference of a CNN with an average performance of 322 GOPs for 8-bit data in a XC7Z020 FPGA. The proposed architecture running AlexNet processes 240 images/s in a ZYNQ7020 and 775 images/s in a ZYNQ7045 with only 1.2% accuracy degradation.

This is a preview of subscription content, access via your institution.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11

References

  1. 1.

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y.

    MathSciNet  Article  Google Scholar 

  2. 2.

    Cun, Y. L., Jackel, L. D., Boser, B., Denker, J. S., Graf, H. P., Guyon, I., Henderson, D., Howard, R. E., & Hubbard, W. (1989). Handwritten digit recognition: applications of neural network chips and automatic learning. IEEE Communications Magazine, 27(11), 41–46. https://doi.org/10.1109/35.41400.

    Article  Google Scholar 

  3. 3.

    Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (pp. 1097–1105). USA: NIPS’12, Curran Associates Inc.

  4. 4.

    Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations.

  5. 5.

    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1–9).

  6. 6.

    He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778).

  7. 7.

    Véstias, M. (2020). Deep learning on edge: Challenges and trends. In Rodrigues, J. M., Cardoso, P. J., Monteiro, J., & Ramos, C. M. (Eds.) Smart Systems Design, Applications, and Challenges (pp. 23–42): IGI Global.

  8. 8.

    Véstias, M. P., Duarte, R. P., deSousa, J. T., & Neto, H. (2018). Lite-cnn: A high-performance architecture to execute cnns in low density fpgas. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications.

  9. 9.

    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.

  10. 10.

    Gysel, P., Pimentel, J., Motamedi, M., & Ghiasi, S. (2018). Ristretto: A framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/TNNLS.2018.2808319.

  11. 11.

    Véstias, M. (2020). Processing systems for deep learning inference on edge devices. In Mastorakis, G., Mavromoustakis, C. X., Batalla, J. M., & Pallis, E. (Eds.) Convergence of Artificial Intelligence and the Internet of Things (pp. 213–240). Cham: Springer International Publishing.

  12. 12.

    Google: Edge TPU. (2019) https://cloud.google.com/edge-tpu/.

  13. 13.

    Coral: EDGE TPU Performance Benchmarks. (2020) https://coral.ai/docs/edgetpu/benchmarks.

  14. 14.

    Mário, V., Lopes, J. D., Véstias, M., & deSousa, J. T. (2020). Implementing cnns using a linear array of full mesh cgras. In Rincón, F., Barba, J., So, H. K. H., Diniz, P., & Caba, J. (Eds.) Applied Reconfigurable Computing. Architectures, Tools, and Applications (pp. 288–297). Cham: Springer International Publishing.

  15. 15.

    Chakradhar, S., Sankaradas, M., Jakkula, V., & Cadambi, S. (June 2010). A dynamically configurable coprocessor for convolutional neural networks. SIGARCH Comput. Archit. News, 38(3), 247–257. https://doi.org/10.1145/1816038.1815993.

    Article  Google Scholar 

  16. 16.

    Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N., & Temam, O. (2014). Dadiannao: A machine-learning supercomputer. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 609–622).

  17. 17.

    Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., & Cong, J. (2015). Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’15 (pp. 161–170). New York: ACM.

  18. 18.

    Liu, B., Zou, D., Feng, L., Feng, S., Fu, P., & Li, J. (2019). An fpga-based cnn accelerator integrating depthwise separable convolution. Electronics, 8(3), 18.

    Google Scholar 

  19. 19.

    Rivera-Acosta, M., Ortega-Cisneros, S., & Rivera, J. (2019). Automatic tool for fast generation of custom convolutional neural networks accelerators for fpga. Electronics, 8(6), 17.

    Article  Google Scholar 

  20. 20.

    Qiu, J., Wang, J., Yao, S., Guo, K., Li, B., Zhou, E., Yu, J., Tang, T., Xu, N., Song, S., Wang, Y., & Yang, H. (2016). Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’16 (pp. 26–35). New York: ACM.

  21. 21.

    Suda, N., Chandra, V., Dasika, G., Mohanty, A., Ma, Y., Vrudhula, S., Seo, J. S., & Cao, Y. (2016). Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’16 (pp. 16–25). New York: ACM.

  22. 22.

    Qiao, Y., Shen, J., Xiao, T., Yang, Q., Wen, M., & Zhang, C. (2017). Fpga-accelerated deep convolutional neural networks for high throughput and energy efficiency. Concurrency and Computation: Practice and Experience, 29(20), e3850–n/a. https://doi.org/10.1002/cpe.3850,cpe.3850.

    Article  Google Scholar 

  23. 23.

    Liu, Z., Dou, Y., Jiang, J., Xu, J., Li, S., Zhou, Y., & Xu, Y. (July 2017). Throughput-optimized fpga accelerator for deep convolutional neural networks. ACM Trans. Reconfigurable Technol. Syst., 10 (3), 17:1–17:23. https://doi.org/10.1145/3079758.

    Article  Google Scholar 

  24. 24.

    Alwani, M., Chen, H., Ferdman, M., & Milder, P. (2016). Fused-layer cnn accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (pp. 1–12).

  25. 25.

    Shen, Y., Ferdman, M., & Milder, P. (2017). Maximizing cnn accelerator efficiency through resource partitioning. SIGARCH Comput. Archit. News, 45(2), 535–547. https://doi.org/10.1145/3140659.3080221.

    Article  Google Scholar 

  26. 26.

    Gonçalves, A., Peres, T., & Véstias, M. (2019). Exploring data bitwidth to run convolutional neural networks in low density fpgas. In Hochberger, C., Nelson, B., Koch, A., Woods, R., & Diniz, P. (Eds.) Applied Reconfigurable Computing (pp. 387–401). Cham: Springer International Publishing.

  27. 27.

    Gysel, P., Motamedi, M., & Ghiasi, S. (2016). Hardware-oriented approximation of convolutional neural networks. In Proceedings of the 4th International Conference on Learning Representations.

  28. 28.

    Wang, J., Lou, Q., Zhang, X., Zhu, C., Lin, Y., & Chen, D. (2018). A design flow of accelerating hybrid extremely low bit-width neural network in embedded fpga. In 28th International Conference on Field-Programmable Logic and Applications.

  29. 29.

    Véstias, M. P., Duarte, R. P., De Sousa, J. T., & Neto, H. C. (2020). A configurable architecture for running hybrid convolutional neural networks in low-density fpgas. IEEE Access, 8, 107229–107243.

    Article  Google Scholar 

  30. 30.

    Umuroglu, Y., Fraser, N. J., Gambardella, G., Blott, M., Leong, P., Jahre, M., & Vissers, K. (2017). Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17. (pp. 65–74). New York: ACM. https://doi.org/10.1145/3020078.3021744

  31. 31.

    Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, arXiv:1510.00149.

  32. 32.

    Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., & Mahlke, S. (June 2017). Scalpel: Customizing dnn pruning to the underlying hardware parallelism. SIGARCH Comput. Archit. News, 45(2), 548–560. https://doi.org/10.1145/3140659.3080215.

    Article  Google Scholar 

  33. 33.

    Albericio, J., Judd, P., Hetherington, T., Aamodt, T., Jerger, N. E., & Moshovos, A. (2016). Cnvlutin: Ineffectual-neuron-free deep neural network computing. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) (pp. 1–13).

  34. 34.

    Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M. A., & Dally, W. J. (2016). Eie: Efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) (pp. 243–254).

  35. 35.

    Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S. W., & Dally, W. J. (June 2017). Scnn: An accelerator for compressed-sparse convolutional neural networks. SIGARCH Comput. Archit. News, 45(2), 27–40. https://doi.org/10.1145/3140659.3080254.

    Article  Google Scholar 

  36. 36.

    Nurvitadhi, E., Venkatesh, G., Sim, J., Marr, D., Huang, R., Ong GeeHock, J., Liew, Y. T., Srivatsan, K., Moss, D., Subhaschandra, S., & Boudoukh, G. (2017). Can fpgas beat gpus in accelerating next-generation deep neural networks?. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17. https://doi.org/10.1145/3020078.3021740 (pp. 5–14). New York: ACM.

  37. 37.

    Aimar, A., Mostafa, H., Calabrese, E., Rios-Navarro, A., Tapiador-Morales, R., Lungu, I., Milde, M.B., Corradi, F., Linares-Barranco, A., Liu, S., & Delbruck, T. (2019). Nullhop: A flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE Transactions on Neural Networks and Learning Systems, 30(3), 644–656. https://doi.org/10.1109/TNNLS.2018.2852335.

    Article  Google Scholar 

  38. 38.

    Zhang, S., Du, Z., Zhang, L., Lan, H., Liu, S., Li, L., Guo, Q., Chen, T., & Chen, Y. (2016). Cambricon-x: An accelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (pp. 1–12).

  39. 39.

    Lu, L., Xie, J., Huang, R., Zhang, J., Lin, W., & Liang, Y. (2019). An efficient hardware accelerator for sparse convolutional neural networks on fpgas. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp 17–25).

  40. 40.

    Véstias, M. P., Duarte, R. P., deSousa, J. T., & Neto, H. C. (2019). Fast convolutional neural networks in low density fpgas using zero-skipping and weight pruning. Electronics (8), 11. https://doi.org/10.3390/electronics8111321.

  41. 41.

    Véstias, M., Duarte, R., Sousa, J. T. D., & Neto, H. (2020). Moving deep learning to the edge. Algorithms, 13, 125.

    MathSciNet  Article  Google Scholar 

  42. 42.

    Venieris, S. I., & Bouganis, C. (2018). fpgaconvnet: Mapping regular and irregular convolutional neural networks on fpgas. IEEE Transactions on Neural Networks and Learning Systems, 1–17. https://doi.org/10.1109/TNNLS.2018.2844093.

  43. 43.

    Guo, K., Sui, L., Qiu, J., Yu, J., Wang, J., Yao, S., Han, S., Wang, Y., & Yang, H. (2018). Angel-eye: A complete design flow for mapping cnn onto embedded fpga. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(1), 35–47. https://doi.org/10.1109/TCAD.2017.2705069.

    Article  Google Scholar 

  44. 44.

    Gong, L., Wang, C., Li, X., Chen, H., & Zhou, X. (2018). Maloc: A fully pipelined fpga accelerator for convolutional neural networks with all layers mapped on chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(11), 2601–2612. https://doi.org/10.1109/TCAD.2018.2857078.

    Article  Google Scholar 

  45. 45.

    Véstias, M. P., Duarte, R. P., de Sousa, JT, & Neto, H. C. (2020). A fast and scalable architecture to run convolutional neural networks in low density fpgas. Microprocessors and Microsystems, 77, 103136.

    Article  Google Scholar 

  46. 46.

    Peres, T., Gonçalves, A., & Véstias, M. (2019). Faster convolutional neural networks in low density fpgas using block pruning. In Hochberger, C., Nelson, B., Koch, A., Woods, R., & Diniz, P. (Eds.) Applied Reconfigurable Computing (pp. 402–416). Cham: Springer International Publishing.

  47. 47.

    Struharik, R. J. R., Vukobratović, B. Z., Erdeljan, A. M., & Rakanović, D. M. (2020). Conna-hardware accelerator for compressed convolutional neural networks. Microprocessors and Microsystems, 73, 102991.

    Article  Google Scholar 

  48. 48.

    Véstias, M. (2021). Convolutional neural network. In Khosrow-Pour, D. B. A. M. (Ed.) Encyclopedia of Information Science and Technology, Fifth Edition (pp. 12–26): IGI Global.

  49. 49.

    Wang, Y., Xu, J., Han, Y., Li, H., & Li, X. (2016). Deepburning: Automatic generation of fpga-based learning accelerators for the neural network family. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC) (pp. 1–6).

  50. 50.

    Sharma, H., Park, J., Mahajan, D., Amaro, E., Kim, J. K., Shao, C., Mishra, A., & Esmaeilzadeh, H. (2016). From high-level deep neural models to fpgas. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (pp. 1–12).

  51. 51.

    Zhang, M., Li, L., Wang, H., Liu, Y., Qin, H., & Zhao, W. (2019). Optimized compression for implementing convolutional neural networks on fpga. Electronics, 8(3), 295. https://doi.org/10.3390/electronics8030295.

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UIDB/50021/2020 and was also supported by project IPL/IDI&CA/2020/TRAINEE/ISEL through Instituto Politécnico de Lisboaa.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Mário Véstias.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Véstias, M. Efficient Design of Pruned Convolutional Neural Networks on FPGA. J Sign Process Syst 93, 531–544 (2021). https://doi.org/10.1007/s11265-020-01606-2

Download citation

Keywords

  • Deep learning
  • Convolutional neural network
  • FPGA
  • Block pruning
  • Edge computing