Recent advances in efficient computation of deep convolutional neural networks

  • Jian Cheng
  • Pei-song Wang
  • Gang Li
  • Qing-hao Hu
  • Han-qing Lu
Review
  • 256 Downloads

Abstract

Deep neural networks have evolved remarkably over the past few years and they are currently the fundamental tools of many intelligent systems. At the same time, the computational complexity and resource consumption of these networks continue to increase. This poses a significant challenge to the deployment of such networks, especially in real-time applications or on resource-limited devices. Thus, network acceleration has become a hot topic within the deep learning community. As for hardware implementation of deep neural networks, a batch of accelerators based on a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC) have been proposed in recent years. In this paper, we provide a comprehensive survey of recent advances in network acceleration, compression, and accelerator design from both algorithm and hardware points of view. Specifically, we provide a thorough analysis of each of the following topics: network pruning, low-rank approximation, network quantization, teacher–student networks, compact network design, and hardware accelerators. Finally, we introduce and discuss a few possible future directions.

Keywords

Deep neural networks Acceleration Compression Hardware accelerator 

CLC number

TP3 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Albericio J, Judd P, Hetherington T, et al., 2016. Cnvlutin: ineffectual-neuron-free deep neural network computing. Proc 43rd Int Symp on Computer Architecture, p.1–13. https://doi.org/10.1145/3007787.3001138Google Scholar
  2. Alwani M, Chen H, Ferdman M, et al., 2016. Fused-layer CNN accelerators. 49th Annual IEEE/ACM Int Symp on MICRO, p.1–12. https://doi.org/10.1109/MICRO.2016.7783725Google Scholar
  3. Anwar S, Hwang K, Sung W, 2017. Structured pruning of deep convolutional neural networks. ACM J Emerg Technol Comput Syst, 13(3), Article 32. https://doi.org/10.1145/3005348Google Scholar
  4. Cai Z, He X, Sun J, et al., 2017. Deep learning with low precision by half-wave Gaussian quantization. IEEE Computer Society Conf on Computer Vision and Pattern Recognition, p.5918–5926.Google Scholar
  5. Chen L, Li J, Chen Y, et al., 2017. Accelerator-friendly neural-network training: learning variations and defects in RRAM crossbar. Proc Conf on Design, Automation and Test in Europe Conf and Exhibition, p.19–24.Google Scholar
  6. Chen Y, Sun N, Temam O, et al., 2014. DaDianNao: a machine-learning supercomputer. Proc 47th Annual IEEE/ACM Int Symp on Microarchitecture, p.609–622. https://doi.org/10.1109/MICRO.2014.58Google Scholar
  7. Chen Y, Krishna T, Emer J, et al., 2017. Eyeriss: an energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Sol-Stat Circ, 52(1): 127–138. https://doi.org/10.1109/JSSC.2016.2616357CrossRefGoogle Scholar
  8. Cheng J, Wu J, Leng C, et al., 2017. Quantized CNN: a unified approach to accelerate and compress convolutional networks. IEEE Trans Neur Netw Learn Syst, 99:1–14. https://doi.org/10.1109/TNNLS.2017.2774288Google Scholar
  9. Cheng Z, Soudry D, Mao Z, et al., 2015. Training binary multilayer neural networks for image classification using expectation backpropagation. http://arxiv.org/abs/1503.03562Google Scholar
  10. Courbariaux M, Bengio Y, David J, 2015. Binaryconnect: training deep neural networks with binary weights during propagations. NIPS, p.3123–3131.Google Scholar
  11. Denil M, Shakibi B, Dinh L, et al., 2013. Predicting parameters in deep learning. NIPS, p.2148–2156.Google Scholar
  12. Dettmers T, 2015. 8-bit approximations for parallelism in deep learning. http://arxiv.org/abs/1511.04561Google Scholar
  13. Gao M, Pu J, Yang X, et al., 2017. TETRIS: scalable and efficient neural network acceleration with 3D memory. Proc 22nd Int Conf on Architectural Support for Programming Languages and Operating Systems, p.751–764. https://doi.org/10.1145/3093337.3037702Google Scholar
  14. Gong Y, Liu L, Yang M, et al., 2014. Compressing deep convolutional networks using vector quantization. http://arxiv.org/abs/1412.6115Google Scholar
  15. Gudovskiy D, Rigazio L, 2017. ShiftCNN: generalized lowprecision architecture for inference of convolutional neural networks. http://arxiv.org/abs/1706.02393Google Scholar
  16. Guo Y, Yao A, Chen Y, 2016. Dynamic network surgery for efficient DNNs. NIPS, p.1379–1387.Google Scholar
  17. Gupta S, Agrawal A, Gopalakrishnan K, et al., 2015. Deep learning with limited numerical precision. Proc 32nd Int Conf on Machine Learning, p.1737–1746.Google Scholar
  18. Hammerstrom D, 2012. A VLSI architecture for highperformance, low-cost, on-chip learning. IJCNN Int Joint Conf on Neural Networks, p.537–544. https://doi.org/10.1109/IJCNN.1990.137621Google Scholar
  19. Han S, Mao H, Dally W, 2015a. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. http://arxiv.org/abs/1510.00149Google Scholar
  20. Han S, Pool J, Tran J, et al., 2015b. Learning both weights and connections for efficient neural network. NIPS, p.1135–1143.Google Scholar
  21. Han S, Liu X, Mao H, et al., 2016. EIE: efficient inference engine on compressed deep neural network. ACM/IEEE 43rd Annual Int Symp on Computer Architecture, p.243–254. https://doi.org/10.1109/ISCA.2016.30Google Scholar
  22. Han S, Kang J, Mao H, et al., 2017. ESE: efficient speech recognition engine with sparse LSTM on FPGA. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.75–84. https://doi.org/10.1145/3020078.3021745Google Scholar
  23. Hassibi B, Stork D, 1993. Second order derivatives for network pruning: optimal brain surgeon. NIPS, p.164–171.Google Scholar
  24. He K, Zhang X, Ren S, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770–778.Google Scholar
  25. He Y, Zhang X, Sun J, 2017. Channel pruning for accelerating very deep neural networks. http://arxiv.org/abs/1707.06168CrossRefGoogle Scholar
  26. Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. http://arxiv.org/abs/1503.02531Google Scholar
  27. Holi J, Hwang J, 1993. Finite precision error analysis of neural network hardware implementations. IEEE Trans Comput, 42(3):281–290. https://doi.org/10.1109/12.210171CrossRefGoogle Scholar
  28. Horowitz M, 2014. 1.1 computing’s energy problem (and what we can do about it). IEEE Int Solid-State Circuits Conf Digest of Technical Papers, p.10–14. https://doi.org/10.1109/ISSCC.2014.6757323Google Scholar
  29. Hou L, Yao Q, Kwok J, 2016. Loss-aware binarization of deep networks. http://arxiv.org/abs/1611.01600Google Scholar
  30. Howard A, Zhu M, Chen B, et al., 2017. Mobilenets: efficient convolutional neural networks for mobile vision applications. http://arxiv.org/abs/1704.04861Google Scholar
  31. Hu Q, Wang P, Cheng J, 2018. From hashing to CNNs: training binary weight networks via hashing. 32nd AAAI Conf on Artificial Intelligence, in press.Google Scholar
  32. Hwang K, Sung W, 2014. Fixed-point feedforward deep neural network design using weights +1, 0, and -1. IEEE Workshop on Signal Processing Systems, p.1–6. https://doi.org/10.1109/SiPS.2014.6986082Google Scholar
  33. Iandola F, Han S, Moskewicz M, et al., 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. http://arxiv.org/abs/1602.07360Google Scholar
  34. Jaderberg M, Vedaldi A, Zisserman A, 2014. Speeding up convolutional neural networks with low rank expansions. http://arxiv.org/abs/1405.3866CrossRefGoogle Scholar
  35. Jegou H, Douze M, Schmid C, 2011. Product quantization for nearest neighbor search. IEEE Trans Patt Anal Mach Intell, 33(1):117–128. https://doi.org/10.1109/TPAMI.2010.57CrossRefGoogle Scholar
  36. Jouppi N, 2017. In-datacenter performance analysis of a tensor processing unit. Proc 44th Annual Int Symp on Computer Architecture, p.1–12. https://doi.org/10.1145/3140659.3080246Google Scholar
  37. Kim D, Kung J, Chai S, et al., 2016. Neurocube: a programmable digital neuromorphic architecture with highdensity 3D memory. ACM/IEEE 43rd Annual Int Symp on Computer Architecture, p.380–392. https://doi.org/10.1109/ISCA.2016.41Google Scholar
  38. Kim K, Kim J, Yu J, et al., 2016. Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks. Proc 53rd Annual Design Automation Conf, Article 124. https://doi.org/10.1145/2897937.2898011Google Scholar
  39. Kim M, Smaragdis P, 2016. Bitwise neural networks. http://arxiv.org/abs/1601.06071Google Scholar
  40. Kim YD, Park E, Yoo S, et al., 2015. Compression of deep convolutional neural networks for fast and low power mobile applications. http://arxiv.org/abs/1511.06530Google Scholar
  41. Ko J, Mudassar B, Na T, et al., 2017. Design of an energy-efficient accelerator for training of convolutional neural networks using frequency-domain computation. ACM/EDAC/IEEE Design Automation Conf, p.1–6. https://doi.org/10.1145/3061639.3062228Google Scholar
  42. Krizhevsky A, Hinton G, 2009. Learning Multiple Layers of Features from Tiny Images. MS Thesis, Department of Computer Science, University of Toronto, Toronto, Canada.Google Scholar
  43. Krizhevsky A, Sutskever I, Hinton G, 2012. Imagenet classification with deep convolutional neural networks. NIPS, p.1097–1105.Google Scholar
  44. Lebedev V, Lempitsky V, 2016. Fast ConvNets using groupwise brain damage. IEEE Conf on Computer Vision and Pattern Recognition, p.2554–2564.Google Scholar
  45. Lebedev V, Ganin Y, Rakhuba M, et al., 2014. Speedingup convolutional neural networks using fine-tuned CPdecomposition. http://arxiv.org/abs/1412.6553Google Scholar
  46. LeCun Y, Denker J, Solla S, et al., 1989. Optimal brain damage. NIPS, p.598–605.Google Scholar
  47. Lee EH, Miyashita D, Chai E, et al., 2017. LogNet: energyefficient neural networks using logarithmic computation. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.5900–5904. https://doi.org/10.1109/ICASSP.2017.7953288Google Scholar
  48. Li F, Zhang B, Liu B, 2016. Ternary weight networks. http://arxiv.org/abs/1605.04711Google Scholar
  49. Li G, Li F, Zhao T, et al., 2018. Block convolution: towards memory-efficient inference of large-scale CNNs on FPGA. Design Automation and Test in Europe, in press.Google Scholar
  50. Lin M, Chen Q, Yan S, 2013. Network in network. http://arxiv.org/abs/1312.4400Google Scholar
  51. Lin Z, Courbariaux M, Memisevic R, et al., 2015. Neural networks with few multiplications. http://arxiv.org/abs/1510.03009Google Scholar
  52. Liu S, Du Z, Tao J, et al., 2016. Cambricon: an instruction set architecture for neural networks. Proc 43rd Int Symp on Computer Architecture, p.393–405. https://doi.org/10.1145/3007787.3001179Google Scholar
  53. Liu Z, Li J, Shen Z, et al., 2017. Learning efficient convolutional networks through network slimming. IEEE Int Conf on Computer Vision, p.2736–2744Google Scholar
  54. Luo J, Wu J, Lin W, 2017. ThiNet: a filter level pruning method for deep neural network compression. http://arxiv.org/abs/1707.06342Google Scholar
  55. Ma Y, Cao Y, Vrudhula S, et al., 2017a. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. 27th Int Conf on Field Programmable Logic and Applications, p.1–8. https://doi.org/10.23919/FPL.2017.8056824Google Scholar
  56. Ma Y, Cao Y, Vrudhula S, et al., 2017b. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.45–54. https://doi.org/10.1145/3020078.3021736Google Scholar
  57. Ma Y, Kim M, Cao Y, et al., 2017c. End-to-end scalable FPGA accelerator for deep residual networks. IEEE Int Symp on Circuits and Systems, p.1–4. https://doi.org/10.1109/ISCAS.2017.8050344Google Scholar
  58. Mao H, Han S, Pool J, et al., 2017. Exploring the regularity of sparse structure in convolutional neural networks. http://arxiv.org/abs/1705.08922Google Scholar
  59. Miyashita D, Lee E, Murmann B, 2016. Convolutional neural networks using logarithmic data representation. http://arxiv.org/abs/1603.01025Google Scholar
  60. Nguyen D, Kim D, Lee J, 2017. Double MAC: doubling the performance of convolutional neural networks on modern FPGAs. Design, Automation & Test in Europe Conf & Exhibition, p.890–893. https://doi.org/10.23919/DATE.2017.7927113Google Scholar
  61. Nurvitadhi E, Hillsboro, Venkatesh G, et al., 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.5–14. https://doi.org/10.1145/3020078.3021740Google Scholar
  62. Parashar A, Rhu M, Mukkara A, et al., 2017. SCNN: an accelerator for compressed-sparse convolutional neural networks. Proc 44th Annual Int Symp on Computer Architecture, p.27–40. https://doi.org/10.1145/3140659.3080254Google Scholar
  63. Price M, Glass J, Chandrakasan A, 2017. 14.4 a scalable speech recognizer with deep-neural-network acoustic models and voice-activated power gating. IEEE Int Solid-State Circuits Conf, p.244–245. https://doi.org/10.1109/ISSCC.2017.7870352Google Scholar
  64. Qiu JT, Wang J, Yao S, et al., 2016. Going deeper with embedded FPGA platform for convolutional neural network. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.26–35. https://doi.org/10.1145/2847263.2847265Google Scholar
  65. Rastegari M, Ordonez V, Redmon J, et al., 2016. XNORNet: ImageNet classification using binary convolutional neural networks. European Conf on Computer Vision, p.525–542. https://doi.org/10.1007/978-3-319-46493-0_32Google Scholar
  66. Ren A, Li Z, Ding C, et al., 2017. SC-DCNN: highly-scalable deep convolutional neural network using stochastic computing. Proc 22nd Int Conf on Architectural Support for Programming Languages and Operating Systems, p.405–418. https://doi.org/10.1145/3093336.3037746Google Scholar
  67. Romero A, Ballas N, Kahou S, et al., 2014. FitNets: hints for thin deep nets. http://arxiv.org/abs/1412.6550Google Scholar
  68. Russakovsky O, Deng J, Su H, et al., 2015. Imagenet large scale visual recognition challenge. Int J Comput Vis, 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y4MathSciNetCrossRefGoogle Scholar
  69. Sharma H, Park J, Mahajan D, et al., 2016. From highlevel deep neural models to FPGAs. 49th Annual IEEE/ACM Int Symp on Microarchitecture, p.1–21. https://doi.org/10.1109/MICRO.2016.7783720Google Scholar
  70. Shen Y, Ferdman M, Milder P, 2017. Escher: a CNN accelerator with flexible buffering to minimize off-chip transfer. IEEE 25th Annual Int Symp on Field-Programmable Custom Computing Machines, p.93–100. https://doi.org/10.1109/FCCM.2017.47Google Scholar
  71. Sim H, Lee J, 2017. A new stochastic computing multiplier with application to deep convolutional neural networks. Proc 54th Annual Design Automation Conf, Article 29. https://doi.org/10.1145/3061639.3062290Google Scholar
  72. Simonyan K, Zisserman A, 2014. Very deep convolutional networks for large-scale image recognition. http://arxiv.org/abs/1409.1556Google Scholar
  73. Suda N, Chandra V, Dasika G, et al., 2016. Throughputoptimized openCL-based FPGA accelerator for largescale convolutional neural networks. Proc ACM/ SIGDA Int Symp on Field-Programmable Gate Arrays, p.16–25. https://doi.org/10.1145/2847263.2847276Google Scholar
  74. Szegedy C, Liu W, Jia Y, et al., 2015. Going deeper with convolutions. Conf on Computer Vision and Pattern Recognition, p.1–9. https://doi.org/10.1109/CVPR.2015.7298594Google Scholar
  75. Tang W, Hua G, Wang L, 2017. How to train a compact binary neural network with high accuracy? 31st AAAI Conf on Artificial Intelligence, p.2625–2631.Google Scholar
  76. Tann H, Hashemi S, Bahar I, et al., 2017. Hardware-software codesign of accurate, multiplier-free deep neural networks. 54th ACM/EDAC/IEEE Design Automation Conf, p.1–6. https://doi.org/10.1145/3061639.3062259Google Scholar
  77. Umuroglu Y, Fraster N, Gambardella G, et al., 2017. FINN: a framework for fast, scalable binarized neural network inference. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.65–74. https://doi.org/10.1145/3020078.3021744Google Scholar
  78. Venieris S, Bouganis C, 2016. fpgaAConvNet: a framework for mapping convolutional neural networks on FPGAs. IEEE 24th Annual Int Symp on Field-Programmable Custom Computing Machines, p.40–47. https://doi.org/10.1109/FCCM.2016.22Google Scholar
  79. Venkataramani S, Ranjan A, Banerjee S, et al., 2017. ScaleDeep: a scalable compute architecture for learning and evaluating deep networks. Proc 44th Annual Int Symp on Computer Architecture, p.13–26. https://doi.org/10.1145/3079856.3080244Google Scholar
  80. Wang P, Cheng J, 2016. Accelerating convolutional neural networks for mobile applications. Proc ACM on Multimedia Conf, p.541–545. https://doi.org/10.1145/2964284.2967280Google Scholar
  81. Wang P, Cheng J, 2017. Fixed-point factorized networks. IEEE Conf on Computer Vision and Pattern Recognition, p.4012–4020.Google Scholar
  82. Wang P, Hu Q, Fang Z, et al., 2018. Deepsearch: a fast image search framework for mobile devices. ACM Trans Multim Comput Commun Appl, 14(1), Article 6. https://doi.org/10.1145/3152127.Google Scholar
  83. Wang Y, Xu J, Han Y, et al., 2016. Deepburning: automatic generation of FPGA-based learning accelerators for the neural network family. Proc 53rd Annual Design Automation Conf, Article 110. https://doi.org/10.1145/2897937.2898003Google Scholar
  84. Wei SC, Yu CH, Zhang P, et al., 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. 54th ACM/EDAC/IEEE Design Automation Conf, p.1–6. https://doi.org/10.1145/3061639.3062207Google Scholar
  85. Wen W, Wu C, Wang Y, et al., 2016. Learning structured sparsity in deep neural networks. NIPS, p.2074–2082.Google Scholar
  86. Wu J, Leng C, Wang Y, et al., 2016. Quantized convolutional neural networks for mobile devices. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.4820–4828.Google Scholar
  87. Xia L, Tang T, Huangfu W, et al., 2016. Switched by input: power efficient structure for RRAM-based convolutional neural network. 53rd ACM/EDAC/IEEE Design Automation Conf, p.1–6. https://doi.org/10.1145/2897937.2898101Google Scholar
  88. Xiao QC, Liang Y, Lu LQ, et al., 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. 54th ACM/EDAC/IEEE Design Automation Conf, p.1–6. https://doi.org/10.1145/3061639.3062244Google Scholar
  89. Xie S, Girshick R, Dollar P, et al., 2017. Aggregated residual transformations for deep neural networks. IEEE Conf on Computer Vision and Pattern Recognition, p.5987–5995. https://doi.org/10.1109/CVPR.2017.634Google Scholar
  90. Yang H, 2017. TIME: a training-in-memory architecture for memristor-based deep neural networks. 54th ACM/ EDAC/IEEE Design Automation Conf, p.1–6. https://doi.org/10.1145/3061639.3062326Google Scholar
  91. Zagoruyko S, Komodakis N, 2016. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. http://arxiv.org/abs/1612.03928Google Scholar
  92. Zhang C, Li P, Sun GY, et al., 2015. Optimizing FPGAbased accelerator design for deep convolutional neural networks. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.161–170. https://doi.org/10.1145/2684746.2689060Google Scholar
  93. Zhang C, Fang Z, Pan P, et al., 2016a. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. IEEE/ACM Int Conf on Computer-Aided Design, p.1–8. https://doi.org/10.1145/2966986.2967011Google Scholar
  94. Zhang C, Wu D, Sun J, et al., 2016b. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. Proc Int Symp on Low Power Electronics and Design, p.326–331. https://doi.org/10.1145/2934583.2934644Google Scholar
  95. Zhang S, Du Z, Zhang L, et al., 2016. Cambricon-X: an accelerator for sparse neural networks. 49th Annual IEEE/ACM Int Symp on Microarchitecture, p.1–12. https://doi.org/10.1109/MICRO.2016.7783723Google Scholar
  96. Zhang X, Zou J, He K, et al., 2015. Accelerating very deep convolutional networks for classification and detection. IEEE Trans Patt Anal Mach Intell, 38(10):1943–1955. https://doi.org/10.1109/TPAMI.2015.2502579CrossRefGoogle Scholar
  97. Zhang X, Zhou X, Lin M, et al., 2017. ShuffleNet: an extremely efficient convolutional neural network for mobile devices. http://arxiv.org/abs/1707.01083Google Scholar
  98. Zhao R, Song WN, Zhang WT, et al., 2017. Accelerating binarized convolutional neural networks with softwareprogrammable FPGAs. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.15–24. https://doi.org/10.1145/3020078.3021741Google Scholar
  99. Zhou A, Yao A, Guo Y, et al., 2017. Incremental network quantization: towards lossless CNNs with low-precision weights. http://arxiv.org/abs/1702.03044Google Scholar
  100. Zhou S, Wu Y, Ni Z, et al., 2016. DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. http://arxiv.org/abs/1606.06160Google Scholar
  101. Zhu C, Han S, Mao H, et al., 2016. Trained ternary quantization. http://arxiv.org/abs/1612.01064Google Scholar
  102. Zhu J, Qian Z, Tsui C, 2016. LRADNN: high-throughput and energy-efficient deep neural network accelerator using low rank approximation. 21st Asia and South Pacific Design Automation Conf, p.581–586. https://doi.org/10.1109/ASPDAC.2016.7428074Google Scholar

Copyright information

© Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.National Laboratory of Pattern Recognition, Institute of AutomationChinese Academy of SciencesBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations