Skip to main content

Efficient Hardware Acceleration of Emerging Neural Networks for Embedded Machine Learning: An Industry Perspective

  • Chapter
  • First Online:
Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing

Abstract

As neural networks become more complex, the energy required for doing training and inference has resulted in a noticeable shift towards adopting specialized accelerators to meet strict latency and energy constraints that are prevalent in both edge and cloud deployments. These accelerators achieve high performance through parallelism over hundreds of processing elements, and energy efficiency is achieved by reducing data movement and maximizing resource utilization through data reuse. After providing a brief summary of the problems that neural networks have been solving in the domains of Computer Vision, Natural Language Processing, Recommendation Systems and Graph Processing we will discuss how individual layers from each of these different neural networks can be accelerated in an energy-efficient manner. In particular, we focus on design considerations and trade-offs for mapping CNNs, Transformers, and GNNs on AI accelerators that attempt to maximize compute efficiency and minimize energy consumption by reducing the number of access to memory through efficient data reuse.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Raha, A., Kim, S.K., Mathaikutty, D.A., Venkataramanan, G., Mohapatra, D., Sung, R., Brick, C., Chinya, G.N.: Design considerations for edge neural network accelerators: An industry perspective. In: 34th International Conference on VLSI Design and 20th International Conference on Embedded Systems, pp. 328–333 (2021)

    Google Scholar 

  2. Raha, A., Ghosh, S., Mohapatra, D., Mathaikutty, D.A., Sung, R., Brick, C., Raghunathan, V.: Special session: Approximate TinyML systems: Full system approximations for extreme energy-efficiency in intelligent edge devices. In: IEEE 39th International Conference on Computer Design (ICCD), pp. 13–16 (2021)

    Google Scholar 

  3. Sze, V., Chen, Y.H., Yang, T.-J., Emer, J.S.: Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 2295–2329 (2017)

    Article  Google Scholar 

  4. Kwon, H., Chatarasi, P., Sarkar, V., Krishna, T., Pellauer, M., Parashar, A.: Maestro: A data-centric approach to understand reuse, performance, and hardware cost of DNN mappings. IEEE Micro 40, 20–29 (2020)

    Article  Google Scholar 

  5. Norrie, T., Patil, N., Yoon, D.H., Kurian, G., Li, S., Laudon, J., Young, C., Jouppi, N.P., Patterson, D.A.: The design process for Google’s training chips: Tpuv2 and tpuv3. IEEE Micro 41, 56–63 (2021)

    Article  Google Scholar 

  6. Jang, J.-W., Lee, S., Kim, D., Park, H., Ardestani, A.S., Choi, Y., Kim, C., Kim, Y., Yu, H., et al.: Sparsity-aware and re-configurable NPU architecture for Samsung flagship mobile soc. In: ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 15–28, (2021)

    Google Scholar 

  7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6, 84–90 (2017). https://doi.org/10.1145/3065386

    Article  Google Scholar 

  8. Zhao, Y., Wang, G., Tang, C., Luo, C., Zeng, W., Zha, Z.-J.: A battle of network structures: An empirical study of CNN, transformer, and MLP (2021). arXiv

    Google Scholar 

  9. Meta AI. The latest in machine learning — papers with code. https://paperswithcode.com/

  10. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)

    Google Scholar 

  11. Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need (2017). arXiv

    Google Scholar 

  12. Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: A survey. ACM Comput. Surv. 54, 1–41 (2021)

    Article  Google Scholar 

  13. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale (2021). arXiv

    Google Scholar 

  14. Lin, T., Wang, Y., Liu, X., Qiu, X.: A survey of transformers (2021). arXiv

    Google Scholar 

  15. Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al.: MLP-mixer: An all-MLP architecture for vision. Adv. Neural Inf. Process. Syst. 34, 24261–24272 (2021)

    Google Scholar 

  16. Ko, H., Lee, S., Park, Y., Choi, A.: A survey of recommendation systems: Recommendation models, techniques, and application fields. Electronics 11(1), 141 (2022)

    Article  Google Scholar 

  17. Wu, S., Sun, F., Zhang, W., Cui, B.: Graph neural networks in recommender systems: A survey (2020). arXiv

    Google Scholar 

  18. Zhang, S., Yao, L., Sun, A., Tay, Y.: Deep learning based recommender system: A survey and new perspectives. ACM Comput. Surv. 52(1), 1–38 (2019)

    Article  Google Scholar 

  19. Dong, G., Tang, M., Wang, Z., Gao, J., Guo, S., Cai, L., Gutierrez, R., Campbell, B., Barnes, L.E., Boukhechba, M.L: Graph neural networks in IoT: A survey. ACM Trans. Sensor Netw. (2022). http://nvdla.org/hw/v1/ias/lut-programming.html

  20. Abadal, S., Jain, A., Guirado, R., López-Alonso, J., Alarcón, E.: Computing graph neural networks: A survey from algorithms to accelerators. ACM Comput. Surv. 54(9), 1–38 (2021)

    Article  Google Scholar 

  21. Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., Sun, M.: Graph neural networks: A review of methods and applications. AI Open 1, 57–81 (2020)

    Article  Google Scholar 

  22. NVDLA Open Source Project - LUT programming

    Google Scholar 

  23. Chen, Y.-H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 127–138 (2017)

    Article  Google Scholar 

  24. Chen, Y.-H., Yang, T.J., Emer, J., Sze, V.: Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Sel. Topics Circuits Syst. 9(2), 292–308 (2019)

    Article  Google Scholar 

  25. Lin, C.-H., Cheng, C.-C., Tsai, Y.-M., Hung, S.-J., Kuo, Y.-T., Wang, P.H., Tsung, P.-K., Hsu, J.-Y., Lai, W.-C., et al.: 7.1 a 3.4-to-13.3tops/w 3.6tops dual-core deep-learning accelerator for versatile AI applications in 7nm 5g smartphone soc. In: IEEE International Solid-State Circuits Conference-(ISSCC), pp. 134–136 (2020)

    Google Scholar 

  26. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al.: In-datacenter performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News 45(2), 1–12 (2017)

    Article  Google Scholar 

  27. Qin, E., Samajdar, A., Kwon, H., Nadella, V., Srinivasan, S.M., Das, D., Kaul, B., Krishna, T.: Sigma: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training. In: IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 58–70 (2020)

    Google Scholar 

  28. NVIDIA. Nvidia ampere architecture (2022). https://www.nvidia.com/en-us/data-center/ampere-architecture/

  29. Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W., Dally, W.J.: Scnn: An accelerator for compressed-sparse convolutional neural networks. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 27–40 (2017)

    Google Scholar 

  30. Rhu, M., O’Connor, M., Chatterjee, N., Pool, J., Kwon, Y., Keckler, S.W.: Compressing DMA engine: Leveraging activation sparsity for training deep neural networks. In: IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 78–91 (2018)

    Google Scholar 

  31. IntelⓇ Movidius™ Myriad™ X Vision Processing Unit (VPU). https://www.intel.com/content/www/us/en/products/details/processors/movidius-vpu/movidius-myriad-x.html

  32. Lee, B., Burgess, N.: Some results on Taylor-series function approximation on FPGA. In: The Thirty-Seventh Asilomar Conference on Signals, Systems Computers, vol. 2, pp. 2198–2202 (2003)

    Google Scholar 

  33. Lin, C.-W., Wang, J.-S.: A digital circuit design of hyperbolic tangent sigmoid function for neural networks. In: 2008 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 856–859 (2008)

    Google Scholar 

  34. Leboeuf, K., Namin, A.H., Muscedere, R., Wu, H., Ahmadi, M.: High speed VLSI implementation of the hyperbolic tangent sigmoid function. In: Third International Conference on Convergence and Hybrid Information Technology, vol. 1, pp. 1070–1073 (2008)

    Google Scholar 

  35. Zamanlooy, B., Mirhassani, M.: Efficient VLSI implementation of neural networks with hyperbolic tangent activation function. IEEE Trans. Very Large Scale Integr. Syst. 22(1), 39–48 (2014)

    Article  Google Scholar 

  36. Ioannou, Y.A., Robertson, D.P., Cipolla, R., Criminisi, A.: Deep roots: Improving CNN efficiency with hierarchical filter groups. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5977–5986 (2017)

    Google Scholar 

  37. Sun, K., Li, M., Liu, D., Wang, J.: Igcv3: Interleaved low-rank group convolutions for efficient deep neural networks. In: BMVC (2018)

    Google Scholar 

  38. Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning (2016). arXiv

    Google Scholar 

  39. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)

    Google Scholar 

  40. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019)

    Google Scholar 

  41. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., et al.: Language models are few-shot learners (2020). arXiv

    Google Scholar 

  42. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer (2019). arXiv

    Google Scholar 

  43. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002 (2021)

    Google Scholar 

  44. Wang, H., Zhang, Z., Han, S.: SpAtten: Efficient sparse attention architecture with cascade token and head pruning. In: IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 97–110 (2021)

    Google Scholar 

  45. He, K., Zhang, X., Ren, S., Sun, J.: “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  46. Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs (2017). arXiv

    Google Scholar 

  47. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks (2018). arXiv

    Google Scholar 

  48. Yan, M., Deng, L., Hu, X., Liang, L., Feng, Y., Ye, X., Zhang, Z., Fan, D., Xie, Y.: HyGCN: A GCN accelerator with hybrid architecture (2020). arXiv

    Google Scholar 

  49. Stevens, J.R., Das, D., Avancha, S., Kaul, B., Raghunathan, A.: GNNerator: A hardware/software framework for accelerating graph neural networks (2021). arXiv

    Google Scholar 

  50. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s (2022). arXiv

    Google Scholar 

  51. Susskind, Z., Arden, B., John, L.K., Stockton, P., John, E.B.: Neuro-symbolic AI: An emerging class of AI workloads and their characterization (2021). arXiv

    Google Scholar 

  52. Wang, X., Han, Y., Leung, V.C., Niyato, D., Yan, X., Chen, X.: Convergence of edge computing and deep learning: A comprehensive survey. IEEE Commun. Surv. Tutor. 22(2), 869–904 (2020)

    Article  Google Scholar 

  53. Raha, A., Raghunathan, V.: qLUT: Input-aware quantized table lookup for energy-efficient approximate accelerators. ACM Trans. Embed. Comput. Syst. 16(5s), 1–23 (2017)

    Article  Google Scholar 

  54. Salvator, D., Wu, H., Kulkarni, M., Emmart, N.: Nvidia technical blog: Int4 precision for AI inference (2019). https://www.nvidia.com/en-us/data-center/ampere-architecture/

  55. Choi, J., Venkataramani, S.: Highly accurate deep learning inference with 2-bit precision (2019). https://www.ibm.com/blogs/research/2019/04/2-bit-precision/

  56. Ghosh, S.K., Raha, A., Raghunathan, V.: Approximate inference systems (axis) end-to-end approximations for energy-efficient inference at the edge. In: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 7–12 (2020)

    Google Scholar 

  57. Bavikadi, S., Sutradhar, P.R., Khasawneh, K.N., Ganguly, A., Dinakarrao, S.M.P.: A review of in-memory computing architectures for machine learning applications. In: Proceedings of the Great Lakes Symposium on VLSI, pp. 89–94 (2020)

    Google Scholar 

  58. Yu, S., Jiang, H., Huang, S., Peng, X., Lu, A.: Compute-in-memory chips for deep learning: recent trends and prospects. IEEE Circuits Syst. Mag. 21(3), 31–56 (2021)

    Article  Google Scholar 

  59. Bai, L., Zhao, Y., Huang, X.: A CNN accelerator on FPGA using depthwise separable convolution. IEEE Trans. Circuits Syst. II: Express Briefs 65(10), 1415–1419 (2018)

    Google Scholar 

  60. Lu, S., Wang, M., Liang, S., Lin, J., Wang, Z.: Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer. In: IEEE 33rd International System-on-Chip Conference (SOCC), pp. 84–89. IEEE (2020)

    Google Scholar 

  61. Kiningham, K., Re, C., Levis, P.: Grip: A graph neural network accelerator architecture (2020). arXiv

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arnab Raha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Raha, A. et al. (2024). Efficient Hardware Acceleration of Emerging Neural Networks for Embedded Machine Learning: An Industry Perspective. In: Pasricha, S., Shafique, M. (eds) Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-19568-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19568-6_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19567-9

  • Online ISBN: 978-3-031-19568-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics