Designing a Performance-Centric MAC Unit with Pipelined Architecture for DNN Accelerators

Raut, Gopal; Mukala, Jogesh; Sharma, Vishal; Vishvakarma, Santosh Kumar

doi:10.1007/s00034-023-02387-2

Designing a Performance-Centric MAC Unit with Pipelined Architecture for DNN Accelerators

Published: 16 May 2023

Volume 42, pages 6089–6115, (2023)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Gopal Raut¹,
Jogesh Mukala¹,
Vishal Sharma² &
…
Santosh Kumar Vishvakarma ORCID: orcid.org/0000-0003-4223-0077¹

467 Accesses
1 Citation
Explore all metrics

Abstract

In order to improve the performance of deep neural network (DNN) accelerators, it is necessary to optimize compute efficiency and operating frequency. However, the implementation of contemporary DNNs often requires excessive resources due to the heavy multiply-and-accumulate (MAC) computations. In this work proposes a MAC unit designed with a Co-ordinate Rotation DIgital Computer (CORDIC)-based architecture, which is both power and area-efficient for 8-bit and higher-bit precision. The CORDIC-based designs are typically associated with low throughput. To address this issue, a performance-centric pipelined architecture is investigated that increases throughput. The study conducts a detailed Pareto analysis of accuracy variation at different precision levels and required pipeline stages to achieve high performance. The proposed MAC unit’s post-synthesis results at the 45nm technology node are provided, and performance is evaluated on a deep neural network using Vertex-7 FPGA board. The proposed fixed-point MAC architecture is scalable for all bit-precision and flexible for the decimal point implication. The study finds that the proposed Fixed Q\(_{3.5}\) precision with five pipeline stage-based MAC shows better performance metrics compared to the recursive CORDIC-based MAC design. The proposed MAC design has a lower area-delay-product (ADP) which is 1.13\(\times \), and higher throughput of 2.73\(\times \) compared to the recursive CORDIC-based MAC. The study evaluated the performance of the proposed MAC unit using the fully connected NN for the MNIST dataset and found that the throughput 1.89\(\times \) better compared to the conventional MAC-based design.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Precision-Aware Neuron Engine for DNN Accelerators

Article 26 April 2024

A Reconfigurable Multiplier/Dot-Product Unit for Precision-Scalable Deep Learning Applications

Double MAC Supported CNN Accelerator

Data Availability

Data sharing is not applicable to this article as no data sets were generated or analyzed during the current study, and detailed circuit simulation results are given in the manuscript.

References

R. Andraka, A survey of CORDIC algorithms for FPGA based computers. In: Proceedings of the 1998 ACM/SIGDA sixth international symposium on field programmable gate arrays, pp. 191-200 (1998)
A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu, and W. J. Gross, VLSI implementation of deep neural network using integral stochastic computing. IEEE Trans Very Large Scale Integr (VLSI) Syst 25(10): 2688-2699 (2017)
Z. Carmichael, H.F. Langroudi, C. Khazanov, J. Lillie, J. L. Gustafson, and D. Kudithipudi, Performancefficiency trade-off of low-precision numerical formats in deep neural networks. In: Proceedings of the conference for next generation arithmetic, pp. 1-9 (2019)
H. Chhajed, G. Raut, N. Dhakad, S. Vishwakarma, and S. K. Vishvakarma, BitMAC: bit-serial computation-based efficient multiply-accumulate unit for DNN accelerator. Circuits Syst Signal Process, pp. 1-16(2022)
F.U.D. Farrukh, C. Zhang, Y. Jiang, Z. Zhang, Z. Wang, Z. Wang, and H. Jiang, Power efficient tiny yolo CNN using reduced hardware resources based on booth multiplier and wallace tree adders. IEEE Open J. Circuits Syst. 1 (2020): 76-87
T. Fujii, S. Sato, H. Nakahara, and M. Motomura, An FPGA realization of a deep convolutional neural network using a threshold neuron pruning. In: Applied reconfigurable computing: 13th international symposium, ARC 2017, Delft, The Netherlands, April 3-7, 2017, Proceedings 13, pp. 268-280. Springer International Publishing (2017)
M. Gao, Q. Wang, and G. Qu, Energy and error reduction using variable bit-width optimization on dynamic fixed point format. In: 2019 IEEE computer society annual symposium on VLSI (ISVLSI), pp. 152-157. IEEE (2019)
J. Garland, D. Gregg, Low complexity multiply-accumulate units for convolutional neural networks with weight-sharing. ACM Trans. Architect. Code Optim. (TACO) 15(3), 1–24 (2018)
Article Google Scholar
D.A. Gudovskiy, and L. Rigazio, Shiftcnn: generalized low-precision architecture for inference of convolutional neural networks. arXiv preprint arXiv:1706.02393 (2017)
M.A. Hanif, R. Hafiz, and M. Shafique, Error resilience analysis for systematically employing approximate computing in convolutional neural networks. In: 2018 design, automation and test in Europe conference and exhibition (DATE), pp. 913-916. IEEE (2018)
https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/design-compiler-graphical.html
ISO/IEC/IEEE International Standard - Floating-point arithmetic, ISO/IEC 60559:2020(E) IEEE Std 754-2019, pp.1-86 (2020)
S. Jain, S. Venkataramani, V. Srinivasan, J. Choi, P. Chuang, and L. Chang, Compensated-DNN: energy efficient low-precision deep neural networks by compensating quantization errors. In: Proceedings of the 55th annual design automation conference, pp. 1-6 (2018)
H. Jiang, C. Liu, F. Lombardi, J. Han, Low-power approximate unsigned multipliers with configurable error recovery. IEEE Trans. Circuits Syst. I Regul. Pap. 66(1), 189–202 (2018)
Article Google Scholar
R.B. S. Kesava, B. L. Rao, K. B. Sindhuri, and N. U. Kumar, Low power and area efficient Wallace tree multiplier using carry select adder with binary to excess-1 converter. In: 2016 conference on advances in signal processing (CASP), pp. 248-253. IEEE (2016)
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F.E. Alsaadi, A survey of deep neural network architectures and their applications. Neurocomputing 234, 11–26 (2017)
Article Google Scholar
M. Masadeh, O. Hasan, S. Tahar, Input-conscious approximate multiply-accumulate (mac) unit for energy-efficiency. IEEE Access 7, 147129–147142 (2019)
Article Google Scholar
A.N. Mazumder, J. Meng, H.A. Rashid, U. Kallakuri, X. Zhang, J.S. Seo, T. Mohsenin, A survey on the optimization of neural network accelerators for micro-ai on-device inference. IEEE J. Emerging Sel. Topics Circuits Syst. 11(4), 532–547 (2021)
Article Google Scholar
L. Mei, M. Dandekar, D. Rodopoulos, J. Constantin, P. Debacker, R. Lauwereins, and M. Verhelst, Sub-word parallel precision-scalable MAC engines for efficient embedded DNN inference. In: 2019 IEEE international conference on artificial intelligence circuits and systems (AICAS), pp. 6-10. IEEE (2019)
E. Monmasson, L. Idkhajine, M. N. Cirstea, I. Bahri, A. Tisan, and M. W. Naouar, FPGAs in industrial control applications. IEEE Trans. Ind. Inf. 7(2): 224-243 (2011)
V. Mrazek, L. Sekanina, and Z. Vasicek, Libraries of approximate circuits: automated design and application in CNN accelerators. IEEE J. Emerging Sel. Topics Circuits Syst. 10(4): 406-418 (2020)
H. Nakahara, and T. Sasao, A high-speed low-power deep neural network on an FPGA based on the nested RNS: applied to an object detector. In: 2018 IEEE international symposium on circuits and systems (ISCAS), pp. 1-5. IEEE (2018)
M. Nazemi, G. Pasandi, and M. Pedram, Energy-efficient, low-latency realization of neural networks through Boolean logic minimization. In: Proceedings of the 24th Asia and South Pacific design automation conference, pp. 274-279 (2019)
V. Rajagopal, C. K. Ramasamy, A. Vishnoi, R. N. Gadde, N. R. Miniskar, and S. K. Pasupuleti, Accurate and efficient fixed point inference for deep neural networks. In: 2018 25th IEEE international conference on image processing (ICIP), pp. 1847-1851. IEEE (2018)
G. Raut, S. Rai, S. K. Vishvakarma, and A. Kumar, A CORDIC based configurable activation function for ANN applications. In: 2020 IEEE computer society annual symposium on VLSI (ISVLSI), pp. 78-83. IEEE (2020)
G. Raut, A. Biasizzo, N. Dhakad, N. Gupta, G. Papa, and S. K. Vishvakarma, Data multiplexed and hardware reused architecture for deep neural network accelerator. Neurocomputing 486: 147-159 (2022)
G. Raut, S. Rai, S.K. Vishvakarma, A. Kumar, RECON: resource-efficient CORDIC-based neuron architecture. IEEE Open J. Circuits Syst. 2, 170–181 (2021)
Article Google Scholar
T. Sato, and T. Ukezono, A dynamically configurable approximate array multiplier with exact mode. In: 2020 5th international conference on computer and communication systems (ICCCS), pp. 917-921. IEEE (2020)
H. Sim, J. Lee, Cost-effective stochastic MAC circuits for deep neural networks. Neural Netw. 117, 152–162 (2019)
Article Google Scholar
V. Sze, Y.-H. Chen, T.-J. Yang, J.S. Emer, Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)
Article Google Scholar
R. Thomas, V. DeBrunner, and L. DeBrunner, Fixed-point implementation of discrete Hirschman transform. In: 2018 52nd Asilomar conference on signals, systems, and computers, pp. 1507-1511. IEEE (2018)
A. Thomas, G. Trivedi, and P. Guha, Design of a low power bfloat16 pipelined mac unit for deep neural network applications. In: 2021 IEEE region 10 symposium (TENSYMP), pp. 1-8. IEEE (2021)
Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, Finn: a framework for fast, scalable binarized neural network inference. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays, pp. 65-74 (2017)
A. Vamsi, S. Krishna, and S. R. Ramesh, An efficient design of 16 bit mac unit using vedic mathematics. In: 2019 international conference on communication and signal processing (ICCSP), pp. 0319-0322. IEEE (2019)
N. Van Toan, J.-G. Lee, FPGA-based multi-level approximate multipliers for high-performance error-resilient applications. IEEE Access 8, 25481–25497 (2020)
Article Google Scholar
S. Yin, Z. Jiang, J.-S. Seo, M. Seok, XNOR-SRAM: in-memory computing SRAM macro for binary/ternary deep neural networks. IEEE J. Solid-State Circuits 55(6), 1733–1743 (2020)
Google Scholar
Yugandhar, K., V. Ganesh Raja, M. Tejkumar, and D. Siva. “High performance array multiplier using reversible logic structure.” In 2018 international conference on current trends towards converging technologies (ICCTCT), pp. 1-5. IEEE, 2018
S. M. A. Zeinolabedin, F. M. Schüffny, R. George, F. Kelber, H. Bauer, S. Scholze, S. Hänzsche et al. A 16-channel fully configurable neural SoC with 1.52\(\mu \) W/Ch signal acquisition, 2.79\(\mu \) W/Ch real-time spike classifier, and 1.79 TOPS/W deep neural network accelerator in 22 nm FDSOI. IEEE Trans. Biomed. Circuits Syst. 16(1): 94-107 (2022)
H. Zhang, D. Chen, S.-B. Ko, New flexible multiple-precision multiply-accumulate unit for deep neural network training and inference. IEEE Trans. Comput. 69(1), 26–38 (2019)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank the University Grant Commission (UGC) New Delhi, Government of India under the SRF scheme with award no. 22745/(NET-DEC. 2015) for providing financial support. And Special Manpower Development Program Chip to System Design (SMDP), Department of Electronics and Information Technology (DeitY) under the Ministry of Communication and Information Technology, Government of India for providing necessary Research Facilities.

Author information

Authors and Affiliations

Department of Electrical Engineering, IIT Indore, Indore, India
Gopal Raut, Jogesh Mukala & Santosh Kumar Vishvakarma
Nanyang Technological University, Singapore, Singapore
Vishal Sharma

Authors

Gopal Raut
View author publications
You can also search for this author in PubMed Google Scholar
Jogesh Mukala
View author publications
You can also search for this author in PubMed Google Scholar
Vishal Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Santosh Kumar Vishvakarma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Santosh Kumar Vishvakarma.

Ethics declarations

Conflict of interest

There is no conflict of interest from the authors

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Raut, G., Mukala, J., Sharma, V. et al. Designing a Performance-Centric MAC Unit with Pipelined Architecture for DNN Accelerators. Circuits Syst Signal Process 42, 6089–6115 (2023). https://doi.org/10.1007/s00034-023-02387-2

Download citation

Received: 01 August 2022
Revised: 11 April 2023
Accepted: 12 April 2023
Published: 16 May 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00034-023-02387-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Designing a Performance-Centric MAC Unit with Pipelined Architecture for DNN Accelerators

Abstract

Access this article

Similar content being viewed by others

A Precision-Aware Neuron Engine for DNN Accelerators

A Reconfigurable Multiplier/Dot-Product Unit for Precision-Scalable Deep Learning Applications

Double MAC Supported CNN Accelerator

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Designing a Performance-Centric MAC Unit with Pipelined Architecture for DNN Accelerators

Abstract

Access this article

Similar content being viewed by others

A Precision-Aware Neuron Engine for DNN Accelerators

A Reconfigurable Multiplier/Dot-Product Unit for Precision-Scalable Deep Learning Applications

Double MAC Supported CNN Accelerator

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation