A Multi-functional Multi-precision 4D Dot Product Unit with SIMD Architecture

Kuang, Shiann-Rong; Liang, Chih-Yuan; Chang, Ming-Fong

doi:10.1007/s13369-016-2117-3

A Multi-functional Multi-precision 4D Dot Product Unit with SIMD Architecture

Research Article - Computer Engineering and Computer Science
Published: 30 March 2016

Volume 41, pages 3139–3151, (2016)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

148 Accesses
5 Citations
3 Altmetric
Explore all metrics

Abstract

The floating-point (FP) four-dimensional vector inner product (4D dot product; DP4) is one of the most frequently performed operations in 3D graphics applications. Therefore, the hardware implementation of FP DP4 unit can be used in modern graphics processing units (GPUs) to speed up the performance. Unfortunately, the FP DP4 unit is power hungry and how to reduce its power consumption becomes very critical for the mobile GPUs. In this paper, a multi-functional multi-precision DP4 unit with single instruction multiple data (SIMD) architecture is proposed. Instead of additional discrete FP multipliers, adders, and multiply-add-fused units, the proposed architecture can perform not only one-way DP4 but also four-way multiplication, addition, and multiply-add-fused operations to decrease the hardware area. In addition, the proposed architecture can perform the above-mentioned FP operations with four precision modes (i.e., 23-, 18-, 13- and 7-bit modes) to reduce the power and energy consumptions when a little image distortion is allowable. The proposed design is fully pipelined with a latency of three cycles, a throughput of one cycle, and a cycle time of 2.8 ns in 90 nm CMOS technology. When compared with the one-precision DP4 unit, the proposed multi-precision DP4 unit can save about 7.2, 18.5, 32.2, and 49.6 % power consumption on average for 23-, 18-, 13- and 7-bit precision modes, respectively, at the expense of 3.7 % more area and 7.7 % longer delay.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Implementation of a Fine-Grained Parallel Full Pipeline Schnorr–Euchner Sphere Decoder Algorithm Accelerator on Field-Programmable Gate Array

Efficient-Fused Architectures for FFT Processor Using Floating-Point Arithmetic

Design of Fully Pipelined Dual-Mode Double Precision Reduction Circuit on FPGAs

References

Botella G., Garcia A., Rodriguez-Alvarez M., Ros E., Meyer-Baese U., Molina M.C.: Robust bioinspired architecture for optical-flow computation. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 18(4), 616–629 (2010)
Article Google Scholar
Botella G., Meyer-Baese U., Garcia A., Rodriguez M.: Quantization analysis and enhancement of a VLSI gradient-based motion estimation architecture. Digit. Signal Process. 22(6), 1174–1187 (2012)
Article MathSciNet Google Scholar
Hegner, R.; Austvoll, I.; Ryen, T.; Schuster, G.M.: Efficient implementation of optical flow algorithm based on directional filters on a GPU using CUDA. In: 2011 19th European Signal Processing Conference, pp. 1529–1533 (2011)
Buyukaydin, D.; Akgun, T.: GPU implementation of an anisotropic Huber-L1 dense optical flow algorithm using OpenCL. In: 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 326–331 (2015)
Kalarot, R.; Morris, J.: Comparison of FPGA and GPU implementations of real-time stereo vision. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 9–15 (2010)
Yu, W.; Chen, T.; Hoe, J.C.: Real time stereo vision using exponential step cost aggregation on GPU. In: 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 4281–4284 (2009)
Kim, W.-Y.; Lee, B.-H.; Lee, K.Y.; Kwak, J.-C.: Design of a fully programmable shader processor for low power mobile devices. In: 2009 IEEE Region 10 Conference, pp. 1–5 (2009)
Hsiao, S.-F.; Wu, P.-H.; Wen, C.-S.; Chen, L.-Y.: Design of a programmable vertex processor in OpenGL ES 2.0 mobile graphics processing units. In: International Symposium on VLSI Design, Automation, and Test (VLSI-DAT), pp. 1–4 (2013)
Chen, Y.-J.; Chuang, S.-Y.; Hung, C.-Y.; Hsu, C.-H.; Chang, C.-M.; Chien, S.-Y.; Chen L.-G. : A 130.3mW 16-core mobile GPU with power-aware approximation techniques. In: 2013 IEEE Asian Solid-State Circuits Conference (A-SSCC), pp. 297–300 (2013)
Tan D., Lemonds C.E., Schulte M.J.: Low-power multiple-precision iterative floating-point multiplier with SIMD support. IEEE Trans. Comput. 58(2), 175–187 (2009)
Article MathSciNet Google Scholar
Srinivasan, S.; Bhudiya, K.; Ramanarayanan, R.; Babu, P.S.; Jacob, T.; Mathew, S.K.; Krishnamurthy, R.; Errgauntla, V.: Split-Path Fused Floating Point Multiply Accumulate (FPMAC). In: 2013 21st IEEE Symposium on Computer Arithmetic (ARITH), pp. 17–24 (2013)
Del Barrio A.A., Bagherzadeh N., Hermida R.: Ultra-low-power adder stage design for exascale floating point units. ACM Trans. Embed. Comput. Syst. 13(3s), 105:1–105:24 (2014)
Google Scholar
Yoon J.-S., Yu C.-H., Kim D., Kim L.-S.: A dual-shader 3-D graphics processor with fast 4-D vector inner product units and power-aware texture cache. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 19(4), 525–537 (2011)
Article Google Scholar
Tao, Y.; Deyuan, G.; Xiaoya, F.; Nurmi, J.: Correctly rounded architectures for floating-point multi-operand addition and dot-product computation. In: 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 346–355 (2013)
Sohn; J., Swartzlander, E.E.: Improved architectures for a floating-point fused dot product unit. In: 2013 21st IEEE Symposium on Computer Arithmetic (ARITH), pp. 41–48 (2013)
Gupta, A.K.; Biswal, B.: A high speed floating point dot product unit. In: 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT), pp. 314–318 (2014)
Kim D., Kim L.-S.: A floating-point unit for 4D vector inner product with reduced latency. IEEE Trans. Comput. 58(7), 890–901 (2009)
Article MathSciNet Google Scholar
Chang, Y.; Wei, J.; Guo, W.; Sun, J.: A multi-functional dot product unit with SIMD architecture for embedded 3D graphics engine. In: IEEE 54th International Midwest Symposium on Circuits and Systems, pp. 1–4 (2011)
Pool, J.; Lastra, A.; Singh, M.: Energy-precision tradeoffs in mobile graphics processing units. In: IEEE International Conference on Computer Design, pp. 60–67 (2008)
Hao X., Varshney A.: Variable precision rendering. Interact. 3D Graph. 54(7), 149–158 (2001)
Google Scholar
IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Standard 754 (1985)
Tong J.Y.F., Nagle D., Rutenbar R.A.: Reducing power by optimizing the necessary precision/range of floating-point arithmetic. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 8(3), 273–286 (2000)
Article Google Scholar
Kuang S.-R., Wu K.-Y., Yu K.-K.: Energy-efficient multiple-precision floating-point multiplier for embedded applications. J. Signal Process. Syst. 72(1), 43–55 (2013)
Article Google Scholar
Wu, K.-Y.; Liang, C.-Y.; Yu, K.-K.; Kuang, S.-R.: Multiple-mode floating-point multiply-add fused unit for trading accuracy with power consumption. In: IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS), pp. 429–435 (2013)
Kuang S.-R., Wang J.-P., Hong H.-Y.: Variable-latency floating-point multipliers for low-power applications. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 18(10), 1493–1497 (2010)
Article Google Scholar
Wang J.-P., Kuang S.-R., Liang S.-C.: High-accuracy fixed-width modified booth multipliers for lossy applications. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 19(1), 52–60 (2011)
Article Google Scholar
Kuang S.-R., Wang J.-P.: Design of power-efficient configurable booth multiplier. IEEE Trans. Circuits Syst. Part I Regul. Pap. 57(3), 568–580 (2010)
Article MathSciNet Google Scholar
Lang T., Bruguera J.D.: Floating-point multiply-add-fused with reduced latency. IEEE Trans. Comput. 53(8), 988–1003 (2004)
Article Google Scholar
Mei, X.-L.: Leading zero anticipation for latency improvement in floating-point fused multiply-add units. In: 6th International Conference On ASIC, pp. 53–56 (2005)
Huang, L.; Shen, L.; Dai, K.; Wang, Z.: A new architecture for multiple-precision floating-point multiply-add fused unit design. In: 18th IEEE Symposium on Computer Arithmetic, pp. 69–76 (2007)
Huang L., Ma S., Shen L., Wang Z., Xiao N.: Low-cost binary128 floating-point FMA unit design with SIMD support. IEEE Trans. Comput. 61(5), 745–751 (2012)
Article MathSciNet Google Scholar
Preiss, J.; Boersma, M.; Mueller, S.M.: Advanced clockgating schemes for fused-multiply-add-type floating-point units. In: 19th IEEE Symposium on Computer Arithmetic, pp. 48–56 (2009)
Qi, Z.; Guo, Q.; Zhang, G.; Li, X.; Hu, W.: Design of low-cost high-performance floating-point fused multiply-add with reduced power. In: 23rd International Conference on VLSI Design, pp. 206–211 (2010)
Oklobdzija V.G.: An algorithmic and novel design of a leading zero detector circuit: Comparison with logic synthesis. IEEE Trans. VLSI Syst. 2(1), 124–128 (1994)
Article Google Scholar
del Barrio, V.M.; Gonzalez, C.; Roca, J.; Fernandez, A.; Espasa, R.: ATTILA: a cycle-level execution-driven simulator for modern GPU architectures. In: IEEE International Symposium on Performance Analysis of Systems and Software, pp. 231–241 (2006)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, 804, Taiwan
Shiann-Rong Kuang, Chih-Yuan Liang & Ming-Fong Chang

Authors

Shiann-Rong Kuang
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Yuan Liang
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Fong Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shiann-Rong Kuang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kuang, SR., Liang, CY. & Chang, MF. A Multi-functional Multi-precision 4D Dot Product Unit with SIMD Architecture. Arab J Sci Eng 41, 3139–3151 (2016). https://doi.org/10.1007/s13369-016-2117-3

Download citation

Received: 15 October 2015
Accepted: 15 March 2016
Published: 30 March 2016
Issue Date: August 2016
DOI: https://doi.org/10.1007/s13369-016-2117-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Multi-functional Multi-precision 4D Dot Product Unit with SIMD Architecture

Abstract

Access this article

Similar content being viewed by others

Implementation of a Fine-Grained Parallel Full Pipeline Schnorr–Euchner Sphere Decoder Algorithm Accelerator on Field-Programmable Gate Array

Efficient-Fused Architectures for FFT Processor Using Floating-Point Arithmetic

Design of Fully Pipelined Dual-Mode Double Precision Reduction Circuit on FPGAs

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Multi-functional Multi-precision 4D Dot Product Unit with SIMD Architecture

Abstract

Access this article

Similar content being viewed by others

Implementation of a Fine-Grained Parallel Full Pipeline Schnorr–Euchner Sphere Decoder Algorithm Accelerator on Field-Programmable Gate Array

Efficient-Fused Architectures for FFT Processor Using Floating-Point Arithmetic

Design of Fully Pipelined Dual-Mode Double Precision Reduction Circuit on FPGAs

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation