Machine Learning in VLSI Computer-Aided Design pp 647-678 | Cite as

# Energy-Efficient Design of Advanced Machine Learning Hardware

## Abstract

The exponentially growing rates of data production in the current era of internet of things (IoT), cyber-physical systems (CPS), and big data pose ever-increasing demands for massive data processing, storage, and transmission. Such systems are required to be robust, intelligent, and self-learning while possessing the capabilities of high-performance and power-/energy-efficient systems. As a result, a hype in the artificial intelligence and machine learning research has surfaced in numerous communities (e.g., deep learning and hardware architecture).

This chapter first provides a brief overview of machine learning and neural networks followed by few of the most prominent techniques that have been used so far for designing energy-efficient accelerators for machine learning algorithms, particularly related to deep neural networks. Inspired by the scalable effort principles of human brains (i.e., scaling computing effort for required precision of the task, or for the recurrent execution of same/similar tasks), we focus on the (re-)emerging area of approximate computing (aka InExact Computing) which aims at relaxing the bounds of precise/exact computing to provide new opportunities for improving the area, power/energy, and performance efficiency of systems by orders of magnitude at the cost of reduced output quality. We also guide through a holistic methodology that encompasses the complete design phase, i.e., from algorithm to architectures. At the end, we summarize the challenges and the associated research roadmap that can aid in developing energy-efficient and adaptable hardware accelerators for machine learning.

## References

- 1.https://github.com/jcjohnson/cnn-benchmarksAccessedo13thNov2017. Accessed 13th Nov 2017
- 2.F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.J. Nam, B. Taba, Truenorth: design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.
**34**(10), 1537–1557 (2015)CrossRefGoogle Scholar - 3.J. Albericio, P. Judd, A. Delmás, S. Sharify, A. Moshovos, Bit-pragmatic deep neural network computing (2016). Preprint. arXiv:1610.06920Google Scholar
- 4.J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N.E. Jerger, A. Moshovos, Cnvlutin: ineffectual-neuron-free deep neural network computing, in
*2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)*(IEEE, Piscataway, 2016), pp. 1–13Google Scholar - 5.M. Alwani, H. Chen, M. Ferdman, P. Milder, Fused-layer CNN accelerators, in
*2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)*(IEEE, Piscataway, 2016), pp. 1–12CrossRefGoogle Scholar - 6.S. Anwar, K. Hwang, W. Sung, Structured pruning of deep convolutional neural networks. ACM J. Emerg. Technol. Comput. Syst.
**13**(3), 32 (2017)Google Scholar - 7.B.V. Benjamin, P. Gao, E. McQuinn, S. Choudhary, A.R. Chandrasekaran, J.M. Bussat, R. Alvarez-Icaza, J.V. Arthur, P.A., Merolla, K. Boahen, Neurogrid: a mixed-analog-digital multichip system for large-scale neural simulations. Proc. IEEE
**102**(5), 699–716 (2014)Google Scholar - 8.S. Borowiec, T. Lien, Alphago beats human go champ in milestone for artificial intelligence. Los Angeles Times
**12**(2016)Google Scholar - 9.C-spin, http://cspin.umn.edu/
- 10.Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun et al., DaDianNao: a machine-learning supercomputer, in
*Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture*(IEEE Computer Society, Washington, 2014), pp. 609–622Google Scholar - 11.Y.H. Chen, J. Emer, V. Sze, Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks, in
*2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)*(IEEE, Piscataway, 2016), pp. 367–379Google Scholar - 12.Y.-H. Chen, J. Emer, V. Sze, Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks,
*in ACM SIGARCH Computer Architecture News*, vol. 44, no. 3 (IEEE Press, Piscataway, 2016)CrossRefGoogle Scholar - 13.P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. SIGARCH Comput. Archit. News
**44**(3), 27–39 (2016)CrossRefGoogle Scholar - 14.F. Chollet, Xception: deep learning with depthwise separable convolutions (2016). Preprint. arXiv:1610.02357Google Scholar
- 15.M. Courbariaux, Y. Bengio, J.P. David, Training deep neural networks with low precision multiplications (2014). Preprint. arXiv:1412.7024Google Scholar
- 16.M. Courbariaux, Y. Bengio, J.P. David, Binaryconnect: training deep neural networks with binary weights during propagations, in
*Advances in Neural Information Processing Systems*(2015), pp. 3123–3131Google Scholar - 17.M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or −1 (2016). Preprint. arXiv:1602.02830Google Scholar
- 18.Y.L. Cun, J.S. Denker, S.A. Solla, Optimal brain damage, in
*Advances in Neural Information Processing Systems 2*(Morgan Kaufmann Publishers, San Francisco, 1990), pp. 598–605. http://dl.acm.org/citation.cfm?id=109230.109298 Google Scholar - 19.M. Davies, N. Srinivasa, T. Lin, G. Chinya, Y. Cao, S.H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain et al., Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro
**38**(1), 82–99 (2018)CrossRefGoogle Scholar - 20.L. Deng, The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. Mag.
**29**(6), 141–142 (2012)CrossRefGoogle Scholar - 21.J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale hierarchical image database, in
*IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009*(IEEE, Piscataway, 2009), pp. 248–255Google Scholar - 22.Z. Du, A. Lingamneni, Y. Chen, K. Palem, O. Temam, C. Wu, Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators, in
*2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC)*(IEEE, Piscataway, 2014), pp. 201–206Google Scholar - 23.Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, O. Temam, Shidiannao: shifting vision processing closer to the sensor, in
*ACM SIGARCH Computer Architecture News*, vol. 43 (ACM, New York, 2015), pp. 92–104Google Scholar - 24.S.K. Esser, R. Appuswamy, P. Merolla, J.V. Arthur, D.S. Modha, Backpropagation for energy-efficient neuromorphic computing, in
*Advances in Neural Information Processing Systems*(2015), pp. 1117–1125Google Scholar - 25.R. Fakoor, F. Ladhak, A. Nazi, M. Huber, Using deep learning to enhance cancer diagnosis and classification, in
*Proceedings of the International Conference on Machine Learning*(2013)Google Scholar - 26.K. Finley, Ai fighter pilot beats a human, but no need to panic (really) (2016). https://www.wired.com/2016/06/ai-fighter-pilot-beats-human-no-need-panic-really/
- 27.Gear adder library, https://sourceforge.net/projects/approxadderlib
- 28.V. Gupta, D. Mohapatra, S.P. Park, A. Raghunathan, K. Roy, Impact: imprecise adders for low-power approximate computing, in
*Proceedings of the 17th IEEE/ACM International Symposium on Low-Power Electronics and Design*(IEEE Press, Piscataway, 2011), pp. 409–414CrossRefGoogle Scholar - 29.P. Gysel, M. Motamedi, S. Ghiasi, Hardware-oriented approximation of convolutional neural networks (2016). Preprint. arXiv:1604.03168Google Scholar
- 30.S. Ha, J.M. Yun, S. Choi, Multi-modal convolutional neural networks for activity recognition, in
*2015 IEEE International Conference on Systems, Man, and Cybernetics (SMC)*(IEEE, Piscataway, 2015), pp. 3017–3022Google Scholar - 31.S. Han, H. Mao, W.J. Dally, Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding (2015). Preprint. arXiv:1510.00149Google Scholar
- 32.S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, EIE: efficient inference engine on compressed deep neural network, in
*Proceedings of the 43rd International Symposium on Computer Architecture*(IEEE Press, Piscataway, 2016), pp. 243–254Google Scholar - 33.X. Han, D. Zhou, S. Wang, S. Kimura, CNN-MERP: an FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks, in
*2016 IEEE 34th International Conference on Computer Design (ICCD)*(IEEE, Piscataway, 2016), pp. 320–327Google Scholar - 34.K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in
*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*(2016), pp. 770–778Google Scholar - 35.T. Highlander, A. Rodriguez, Very efficient training of convolutional neural networks using fast Fourier transform and overlap-and-add (2016). Preprint. arXiv:1601.06815Google Scholar
- 36.A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: efficient convolutional neural networks for mobile vision applications (2017). https://arxiv.org/abs/1704.04861
- 37.F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, K. Keutzer, Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size (2016). Preprint. arXiv:1602.07360Google Scholar
- 38.H. Jiang, J. Han, F. Lombardi, A comparative review and evaluation of approximate adders, in
*Proceedings of the 25th edition on Great Lakes Symposium on VLSI*(ACM, New York, 2015), pp. 343–348Google Scholar - 39.H. Jiang, C. Liu, N. Maheshwari, F. Lombardi, J. Han, A comparative evaluation of approximate multipliers, in
*2016 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH)*(IEEE, Piscataway, 2016), pp. 191–196Google Scholar - 40.J.H. Ko, B. Mudassar, T. Na, S. Mukhopadhyay, Design of an energy-efficient accelerator for training of convolutional neural networks using frequency-domain computation, in
*Proceedings of the 54th Annual Design Automation Conference 2017*(ACM, New York, 2017), p. 59Google Scholar - 41.A. Lavin, S. Gray, Fast algorithms for convolutional neural networks, in
*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*(2016), pp. 4013–4021Google Scholar - 42.F. Li, B. Zhang, B. Liu, Ternary weight networks (2016). Preprint. arXiv:1605.04711Google Scholar
- 43.lpaclib library, https://sourceforge.net/projects/lpaclib/
- 44.W. Lu, G. Yan, J. Li, S. Gong, Y. Han, X. Li, Flexflow: a flexible dataflow accelerator architecture for convolutional neural networks, in
*2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)*(IEEE, Piscataway, 2017), pp. 553–564Google Scholar - 45.V. Mrazek, S.S. Sarwar, L. Sekanina, Z. Vasicek, K. Roy, Design of power-efficient approximate multipliers for approximate artificial neural networks, in
*2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)*(IEEE, Piscataway, 2016), pp. 1–7Google Scholar - 46.V. Mrazek, R. Hrbacek, Z. Vasicek, L. Sekanina, Evoapproxsb: library of approximate adders and multipliers for circuit design and benchmarking of approximation methods, in
*2017 Design, Automation & Test in Europe Conference & Exhibition (DATE)*(IEEE, Piscataway, 2017), pp. 258–261Google Scholar - 47.A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S.W. Keckler, W.J. Dally, SCNN: an accelerator for compressed-sparse convolutional neural networks, in
*Proceedings of the 44th Annual International Symposium on Computer Architecture*(ACM, New York, 2017), pp. 27–40Google Scholar - 48.M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, XNOR-Net: Imagenet classification using binary convolutional neural networks, in
*European Conference on Computer Vision*(Springer, Berlin, 2016), pp. 525–542Google Scholar - 49.S. Rehman, W. El-Harouni, M. Shafique, A. Kumar, J. Henkel, J. Henkel, Architectural-space exploration of approximate multipliers, in
*2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)*(IEEE, Piscataway, 2016), pp. 1–8Google Scholar - 50.M. Rhu, M. O’Connor, N. Chatterjee, J. Pool, S.W. Keckler, Compressing DMA engine: leveraging activation sparsity for training deep neural networks (2017). Preprint. arXiv:1705.01626Google Scholar
- 51.S.S. Sarwar, S. Venkataramani, A. Raghunathan, K. Roy, Multiplier-less artificial neurons exploiting error resiliency for energy-efficient neural computing, in
*Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016*(IEEE, Piscataway, 2016), pp. 145–150Google Scholar - 52.A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J.P. Strachan, M. Hu, R.S. Williams, V. Srikumar, ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. SIGARCH Comput. Archit. News
**44**(3), 14–26 (2016)CrossRefGoogle Scholar - 53.M. Shafique, W. Ahmad, R. Hafiz, J. Henkel, A low latency generic accuracy configurable adder, in
*2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC)*(IEEE, Piscataway, 2015), pp. 1–6Google Scholar - 54.M. Shafique, F. Sampaio, B. Zatt, S. Bampi, J. Henkel, Resilience-driven STT-RAM cache architecture for approximate computing, in
*Workshop on Approximate Computing (AC)*(2015)Google Scholar - 55.M. Shafique, R. Hafiz, S. Rehman, W. El-Harouni, J. Henkel, Cross-layer approximate computing: from logic to architectures, in
*2016 53rd ACM/EDAC/IEEE Design Automation Conference (DAC)*(IEEE, Piscataway, 2016), pp. 1–6Google Scholar - 56.M. Shafique, R. Hafiz, M.U. Javed, S. Abbas, L. Sekanina, Z. Vasicek, V. Mrazek, Adaptive and energy-efficient architectures for machine learning: challenges, opportunities, and research roadmap, in
*2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)*(IEEE, Piscataway, 2017), pp. 627–632Google Scholar - 57.Y. Shen, M. Ferdman, P. Milder, Escher: a CNN accelerator with flexible buffering to minimize off-chip transfer, in
*2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)*(IEEE, Piscataway, 2017)Google Scholar - 58.K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition (2014). Preprint. arXiv:1409.1556Google Scholar
- 59.L. Song, X. Qian, H. Li, Y. Chen, Pipelayer: a pipelined ReRAM-based accelerator for deep learning, in
*IEEE International Symposium on High Performance Computer Architecture (HPCA)*(IEEE, Piscataway, 2017), pp. 541–552Google Scholar - 60.C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in
*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*(2015), pp. 1–9Google Scholar - 61.The NVIDIA DGX-1 deep learning system, https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-1/dgx-1-ai-supercomputer-datasheet-v4.pdf
- 62.A. Todri-Sanial, A. Magnani, M. De Magistris, A. Maffucci, Present and future prospects of carbon nanotube interconnects for energy efficient integrated circuits, in 2016 17th International Conference on Thermal, Mechanical and Multi-Physics Simulation and Experiments in Microelectronics and Microsystems (EuroSimE) (IEEE, Piscataway, 2016), pp. 1–5Google Scholar
- 63.C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark, J. Dolan, D. Duggins, T. Galatali, C. Geyer et al.: Autonomous driving in urban environments: boss and the urban challenge. J. Field Rob.
**25**(8), 425–466 (2008)CrossRefGoogle Scholar - 64.J. Wang, J. Lin, Z. Wang, Efficient convolution architectures for convolutional neural network, in
*2016 8th International Conference on Wireless Communications & Signal Processing (WCSP)*(IEEE, Piscataway, 2016) , pp. 1–5Google Scholar - 65.X. Wei, C.H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, J. Cong, Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs, in
*2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC)*(IEEE, Piscataway, 2017), pp. 1–6Google Scholar - 66.Q. Xiao, Y. Liang, L. Lu, S. Yan, Y.W. Tai, Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs, in
*Proceedings of the 54th Annual Design Automation Conference 2017*(ACM, New York, 2017), p. 62Google Scholar - 67.J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, S. Mahlke, Scalpel: customizing DNN pruning to the underlying hardware parallelism, in
*Proceedings of the 44th Annual International Symposium on Computer Architecture*(ACM, New York, 2017), pp. 548–560Google Scholar - 68.C. Zhang, V.K. Prasanna, Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system, in
*FPGA*(2017), pp. 35–44Google Scholar