Skip to main content
Log in

Designing a Performance-Centric MAC Unit with Pipelined Architecture for DNN Accelerators

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

In order to improve the performance of deep neural network (DNN) accelerators, it is necessary to optimize compute efficiency and operating frequency. However, the implementation of contemporary DNNs often requires excessive resources due to the heavy multiply-and-accumulate (MAC) computations. In this work proposes a MAC unit designed with a Co-ordinate Rotation DIgital Computer (CORDIC)-based architecture, which is both power and area-efficient for 8-bit and higher-bit precision. The CORDIC-based designs are typically associated with low throughput. To address this issue, a performance-centric pipelined architecture is investigated that increases throughput. The study conducts a detailed Pareto analysis of accuracy variation at different precision levels and required pipeline stages to achieve high performance. The proposed MAC unit’s post-synthesis results at the 45nm technology node are provided, and performance is evaluated on a deep neural network using Vertex-7 FPGA board. The proposed fixed-point MAC architecture is scalable for all bit-precision and flexible for the decimal point implication. The study finds that the proposed Fixed Q\(_{3.5}\) precision with five pipeline stage-based MAC shows better performance metrics compared to the recursive CORDIC-based MAC design. The proposed MAC design has a lower area-delay-product (ADP) which is 1.13\(\times \), and higher throughput of 2.73\(\times \) compared to the recursive CORDIC-based MAC. The study evaluated the performance of the proposed MAC unit using the fully connected NN for the MNIST dataset and found that the throughput 1.89\(\times \) better compared to the conventional MAC-based design.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

Data sharing is not applicable to this article as no data sets were generated or analyzed during the current study, and detailed circuit simulation results are given in the manuscript.

References

  1. R. Andraka, A survey of CORDIC algorithms for FPGA based computers. In: Proceedings of the 1998 ACM/SIGDA sixth international symposium on field programmable gate arrays, pp. 191-200 (1998)

  2. A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu, and W. J. Gross, VLSI implementation of deep neural network using integral stochastic computing. IEEE Trans Very Large Scale Integr (VLSI) Syst 25(10): 2688-2699 (2017)

  3. Z. Carmichael, H.F. Langroudi, C. Khazanov, J. Lillie, J. L. Gustafson, and D. Kudithipudi, Performancefficiency trade-off of low-precision numerical formats in deep neural networks. In: Proceedings of the conference for next generation arithmetic, pp. 1-9 (2019)

  4. H. Chhajed, G. Raut, N. Dhakad, S. Vishwakarma, and S. K. Vishvakarma, BitMAC: bit-serial computation-based efficient multiply-accumulate unit for DNN accelerator. Circuits Syst Signal Process, pp. 1-16(2022)

  5. F.U.D. Farrukh, C. Zhang, Y. Jiang, Z. Zhang, Z. Wang, Z. Wang, and H. Jiang, Power efficient tiny yolo CNN using reduced hardware resources based on booth multiplier and wallace tree adders. IEEE Open J. Circuits Syst. 1 (2020): 76-87

  6. T. Fujii, S. Sato, H. Nakahara, and M. Motomura, An FPGA realization of a deep convolutional neural network using a threshold neuron pruning. In: Applied reconfigurable computing: 13th international symposium, ARC 2017, Delft, The Netherlands, April 3-7, 2017, Proceedings 13, pp. 268-280. Springer International Publishing (2017)

  7. M. Gao, Q. Wang, and G. Qu, Energy and error reduction using variable bit-width optimization on dynamic fixed point format. In: 2019 IEEE computer society annual symposium on VLSI (ISVLSI), pp. 152-157. IEEE (2019)

  8. J. Garland, D. Gregg, Low complexity multiply-accumulate units for convolutional neural networks with weight-sharing. ACM Trans. Architect. Code Optim. (TACO) 15(3), 1–24 (2018)

    Article  Google Scholar 

  9. D.A. Gudovskiy, and L. Rigazio, Shiftcnn: generalized low-precision architecture for inference of convolutional neural networks. arXiv preprint arXiv:1706.02393 (2017)

  10. M.A. Hanif, R. Hafiz, and M. Shafique, Error resilience analysis for systematically employing approximate computing in convolutional neural networks. In: 2018 design, automation and test in Europe conference and exhibition (DATE), pp. 913-916. IEEE (2018)

  11. https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/design-compiler-graphical.html

  12. ISO/IEC/IEEE International Standard - Floating-point arithmetic, ISO/IEC 60559:2020(E) IEEE Std 754-2019, pp.1-86 (2020)

  13. S. Jain, S. Venkataramani, V. Srinivasan, J. Choi, P. Chuang, and L. Chang, Compensated-DNN: energy efficient low-precision deep neural networks by compensating quantization errors. In: Proceedings of the 55th annual design automation conference, pp. 1-6 (2018)

  14. H. Jiang, C. Liu, F. Lombardi, J. Han, Low-power approximate unsigned multipliers with configurable error recovery. IEEE Trans. Circuits Syst. I Regul. Pap. 66(1), 189–202 (2018)

    Article  Google Scholar 

  15. R.B. S. Kesava, B. L. Rao, K. B. Sindhuri, and N. U. Kumar, Low power and area efficient Wallace tree multiplier using carry select adder with binary to excess-1 converter. In: 2016 conference on advances in signal processing (CASP), pp. 248-253. IEEE (2016)

  16. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  17. W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F.E. Alsaadi, A survey of deep neural network architectures and their applications. Neurocomputing 234, 11–26 (2017)

    Article  Google Scholar 

  18. M. Masadeh, O. Hasan, S. Tahar, Input-conscious approximate multiply-accumulate (mac) unit for energy-efficiency. IEEE Access 7, 147129–147142 (2019)

    Article  Google Scholar 

  19. A.N. Mazumder, J. Meng, H.A. Rashid, U. Kallakuri, X. Zhang, J.S. Seo, T. Mohsenin, A survey on the optimization of neural network accelerators for micro-ai on-device inference. IEEE J. Emerging Sel. Topics Circuits Syst. 11(4), 532–547 (2021)

    Article  Google Scholar 

  20. L. Mei, M. Dandekar, D. Rodopoulos, J. Constantin, P. Debacker, R. Lauwereins, and M. Verhelst, Sub-word parallel precision-scalable MAC engines for efficient embedded DNN inference. In: 2019 IEEE international conference on artificial intelligence circuits and systems (AICAS), pp. 6-10. IEEE (2019)

  21. E. Monmasson, L. Idkhajine, M. N. Cirstea, I. Bahri, A. Tisan, and M. W. Naouar, FPGAs in industrial control applications. IEEE Trans. Ind. Inf. 7(2): 224-243 (2011)

  22. V. Mrazek, L. Sekanina, and Z. Vasicek, Libraries of approximate circuits: automated design and application in CNN accelerators. IEEE J. Emerging Sel. Topics Circuits Syst. 10(4): 406-418 (2020)

  23. H. Nakahara, and T. Sasao, A high-speed low-power deep neural network on an FPGA based on the nested RNS: applied to an object detector. In: 2018 IEEE international symposium on circuits and systems (ISCAS), pp. 1-5. IEEE (2018)

  24. M. Nazemi, G. Pasandi, and M. Pedram, Energy-efficient, low-latency realization of neural networks through Boolean logic minimization. In: Proceedings of the 24th Asia and South Pacific design automation conference, pp. 274-279 (2019)

  25. V. Rajagopal, C. K. Ramasamy, A. Vishnoi, R. N. Gadde, N. R. Miniskar, and S. K. Pasupuleti, Accurate and efficient fixed point inference for deep neural networks. In: 2018 25th IEEE international conference on image processing (ICIP), pp. 1847-1851. IEEE (2018)

  26. G. Raut, S. Rai, S. K. Vishvakarma, and A. Kumar, A CORDIC based configurable activation function for ANN applications. In: 2020 IEEE computer society annual symposium on VLSI (ISVLSI), pp. 78-83. IEEE (2020)

  27. G. Raut, A. Biasizzo, N. Dhakad, N. Gupta, G. Papa, and S. K. Vishvakarma, Data multiplexed and hardware reused architecture for deep neural network accelerator. Neurocomputing 486: 147-159 (2022)

  28. G. Raut, S. Rai, S.K. Vishvakarma, A. Kumar, RECON: resource-efficient CORDIC-based neuron architecture. IEEE Open J. Circuits Syst. 2, 170–181 (2021)

    Article  Google Scholar 

  29. T. Sato, and T. Ukezono, A dynamically configurable approximate array multiplier with exact mode. In: 2020 5th international conference on computer and communication systems (ICCCS), pp. 917-921. IEEE (2020)

  30. H. Sim, J. Lee, Cost-effective stochastic MAC circuits for deep neural networks. Neural Netw. 117, 152–162 (2019)

    Article  Google Scholar 

  31. V. Sze, Y.-H. Chen, T.-J. Yang, J.S. Emer, Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)

    Article  Google Scholar 

  32. R. Thomas, V. DeBrunner, and L. DeBrunner, Fixed-point implementation of discrete Hirschman transform. In: 2018 52nd Asilomar conference on signals, systems, and computers, pp. 1507-1511. IEEE (2018)

  33. A. Thomas, G. Trivedi, and P. Guha, Design of a low power bfloat16 pipelined mac unit for deep neural network applications. In: 2021 IEEE region 10 symposium (TENSYMP), pp. 1-8. IEEE (2021)

  34. Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, Finn: a framework for fast, scalable binarized neural network inference. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays, pp. 65-74 (2017)

  35. A. Vamsi, S. Krishna, and S. R. Ramesh, An efficient design of 16 bit mac unit using vedic mathematics. In: 2019 international conference on communication and signal processing (ICCSP), pp. 0319-0322. IEEE (2019)

  36. N. Van Toan, J.-G. Lee, FPGA-based multi-level approximate multipliers for high-performance error-resilient applications. IEEE Access 8, 25481–25497 (2020)

    Article  Google Scholar 

  37. S. Yin, Z. Jiang, J.-S. Seo, M. Seok, XNOR-SRAM: in-memory computing SRAM macro for binary/ternary deep neural networks. IEEE J. Solid-State Circuits 55(6), 1733–1743 (2020)

    Google Scholar 

  38. Yugandhar, K., V. Ganesh Raja, M. Tejkumar, and D. Siva. “High performance array multiplier using reversible logic structure.” In 2018 international conference on current trends towards converging technologies (ICCTCT), pp. 1-5. IEEE, 2018

  39. S. M. A. Zeinolabedin, F. M. Schüffny, R. George, F. Kelber, H. Bauer, S. Scholze, S. Hänzsche et al. A 16-channel fully configurable neural SoC with 1.52\(\mu \) W/Ch signal acquisition, 2.79\(\mu \) W/Ch real-time spike classifier, and 1.79 TOPS/W deep neural network accelerator in 22 nm FDSOI. IEEE Trans. Biomed. Circuits Syst. 16(1): 94-107 (2022)

  40. H. Zhang, D. Chen, S.-B. Ko, New flexible multiple-precision multiply-accumulate unit for deep neural network training and inference. IEEE Trans. Comput. 69(1), 26–38 (2019)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the University Grant Commission (UGC) New Delhi, Government of India under the SRF scheme with award no. 22745/(NET-DEC. 2015) for providing financial support. And Special Manpower Development Program Chip to System Design (SMDP), Department of Electronics and Information Technology (DeitY) under the Ministry of Communication and Information Technology, Government of India for providing necessary Research Facilities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Santosh Kumar Vishvakarma.

Ethics declarations

Conflict of interest

There is no conflict of interest from the authors

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Raut, G., Mukala, J., Sharma, V. et al. Designing a Performance-Centric MAC Unit with Pipelined Architecture for DNN Accelerators. Circuits Syst Signal Process 42, 6089–6115 (2023). https://doi.org/10.1007/s00034-023-02387-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-023-02387-2

Keywords

Navigation