Skip to main content
Log in

A Multi-functional Multi-precision 4D Dot Product Unit with SIMD Architecture

  • Research Article - Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

The floating-point (FP) four-dimensional vector inner product (4D dot product; DP4) is one of the most frequently performed operations in 3D graphics applications. Therefore, the hardware implementation of FP DP4 unit can be used in modern graphics processing units (GPUs) to speed up the performance. Unfortunately, the FP DP4 unit is power hungry and how to reduce its power consumption becomes very critical for the mobile GPUs. In this paper, a multi-functional multi-precision DP4 unit with single instruction multiple data (SIMD) architecture is proposed. Instead of additional discrete FP multipliers, adders, and multiply-add-fused units, the proposed architecture can perform not only one-way DP4 but also four-way multiplication, addition, and multiply-add-fused operations to decrease the hardware area. In addition, the proposed architecture can perform the above-mentioned FP operations with four precision modes (i.e., 23-, 18-, 13- and 7-bit modes) to reduce the power and energy consumptions when a little image distortion is allowable. The proposed design is fully pipelined with a latency of three cycles, a throughput of one cycle, and a cycle time of 2.8 ns in 90 nm CMOS technology. When compared with the one-precision DP4 unit, the proposed multi-precision DP4 unit can save about 7.2, 18.5, 32.2, and 49.6 % power consumption on average for 23-, 18-, 13- and 7-bit precision modes, respectively, at the expense of 3.7 % more area and 7.7 % longer delay.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Botella G., Garcia A., Rodriguez-Alvarez M., Ros E., Meyer-Baese U., Molina M.C.: Robust bioinspired architecture for optical-flow computation. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 18(4), 616–629 (2010)

    Article  Google Scholar 

  2. Botella G., Meyer-Baese U., Garcia A., Rodriguez M.: Quantization analysis and enhancement of a VLSI gradient-based motion estimation architecture. Digit. Signal Process. 22(6), 1174–1187 (2012)

    Article  MathSciNet  Google Scholar 

  3. Hegner, R.; Austvoll, I.; Ryen, T.; Schuster, G.M.: Efficient implementation of optical flow algorithm based on directional filters on a GPU using CUDA. In: 2011 19th European Signal Processing Conference, pp. 1529–1533 (2011)

  4. Buyukaydin, D.; Akgun, T.: GPU implementation of an anisotropic Huber-L1 dense optical flow algorithm using OpenCL. In: 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 326–331 (2015)

  5. Kalarot, R.; Morris, J.: Comparison of FPGA and GPU implementations of real-time stereo vision. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 9–15 (2010)

  6. Yu, W.; Chen, T.; Hoe, J.C.: Real time stereo vision using exponential step cost aggregation on GPU. In: 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 4281–4284 (2009)

  7. Kim, W.-Y.; Lee, B.-H.; Lee, K.Y.; Kwak, J.-C.: Design of a fully programmable shader processor for low power mobile devices. In: 2009 IEEE Region 10 Conference, pp. 1–5 (2009)

  8. Hsiao, S.-F.; Wu, P.-H.; Wen, C.-S.; Chen, L.-Y.: Design of a programmable vertex processor in OpenGL ES 2.0 mobile graphics processing units. In: International Symposium on VLSI Design, Automation, and Test (VLSI-DAT), pp. 1–4 (2013)

  9. Chen, Y.-J.; Chuang, S.-Y.; Hung, C.-Y.; Hsu, C.-H.; Chang, C.-M.; Chien, S.-Y.; Chen L.-G. : A 130.3mW 16-core mobile GPU with power-aware approximation techniques. In: 2013 IEEE Asian Solid-State Circuits Conference (A-SSCC), pp. 297–300 (2013)

  10. Tan D., Lemonds C.E., Schulte M.J.: Low-power multiple-precision iterative floating-point multiplier with SIMD support. IEEE Trans. Comput. 58(2), 175–187 (2009)

    Article  MathSciNet  Google Scholar 

  11. Srinivasan, S.; Bhudiya, K.; Ramanarayanan, R.; Babu, P.S.; Jacob, T.; Mathew, S.K.; Krishnamurthy, R.; Errgauntla, V.: Split-Path Fused Floating Point Multiply Accumulate (FPMAC). In: 2013 21st IEEE Symposium on Computer Arithmetic (ARITH), pp. 17–24 (2013)

  12. Del Barrio A.A., Bagherzadeh N., Hermida R.: Ultra-low-power adder stage design for exascale floating point units. ACM Trans. Embed. Comput. Syst. 13(3s), 105:1–105:24 (2014)

    Google Scholar 

  13. Yoon J.-S., Yu C.-H., Kim D., Kim L.-S.: A dual-shader 3-D graphics processor with fast 4-D vector inner product units and power-aware texture cache. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 19(4), 525–537 (2011)

    Article  Google Scholar 

  14. Tao, Y.; Deyuan, G.; Xiaoya, F.; Nurmi, J.: Correctly rounded architectures for floating-point multi-operand addition and dot-product computation. In: 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 346–355 (2013)

  15. Sohn; J., Swartzlander, E.E.: Improved architectures for a floating-point fused dot product unit. In: 2013 21st IEEE Symposium on Computer Arithmetic (ARITH), pp. 41–48 (2013)

  16. Gupta, A.K.; Biswal, B.: A high speed floating point dot product unit. In: 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT), pp. 314–318 (2014)

  17. Kim D., Kim L.-S.: A floating-point unit for 4D vector inner product with reduced latency. IEEE Trans. Comput. 58(7), 890–901 (2009)

    Article  MathSciNet  Google Scholar 

  18. Chang, Y.; Wei, J.; Guo, W.; Sun, J.: A multi-functional dot product unit with SIMD architecture for embedded 3D graphics engine. In: IEEE 54th International Midwest Symposium on Circuits and Systems, pp. 1–4 (2011)

  19. Pool, J.; Lastra, A.; Singh, M.: Energy-precision tradeoffs in mobile graphics processing units. In: IEEE International Conference on Computer Design, pp. 60–67 (2008)

  20. Hao X., Varshney A.: Variable precision rendering. Interact. 3D Graph. 54(7), 149–158 (2001)

    Google Scholar 

  21. IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Standard 754 (1985)

  22. Tong J.Y.F., Nagle D., Rutenbar R.A.: Reducing power by optimizing the necessary precision/range of floating-point arithmetic. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 8(3), 273–286 (2000)

    Article  Google Scholar 

  23. Kuang S.-R., Wu K.-Y., Yu K.-K.: Energy-efficient multiple-precision floating-point multiplier for embedded applications. J. Signal Process. Syst. 72(1), 43–55 (2013)

    Article  Google Scholar 

  24. Wu, K.-Y.; Liang, C.-Y.; Yu, K.-K.; Kuang, S.-R.: Multiple-mode floating-point multiply-add fused unit for trading accuracy with power consumption. In: IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS), pp. 429–435 (2013)

  25. Kuang S.-R., Wang J.-P., Hong H.-Y.: Variable-latency floating-point multipliers for low-power applications. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 18(10), 1493–1497 (2010)

    Article  Google Scholar 

  26. Wang J.-P., Kuang S.-R., Liang S.-C.: High-accuracy fixed-width modified booth multipliers for lossy applications. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 19(1), 52–60 (2011)

    Article  Google Scholar 

  27. Kuang S.-R., Wang J.-P.: Design of power-efficient configurable booth multiplier. IEEE Trans. Circuits Syst. Part I Regul. Pap. 57(3), 568–580 (2010)

    Article  MathSciNet  Google Scholar 

  28. Lang T., Bruguera J.D.: Floating-point multiply-add-fused with reduced latency. IEEE Trans. Comput. 53(8), 988–1003 (2004)

    Article  Google Scholar 

  29. Mei, X.-L.: Leading zero anticipation for latency improvement in floating-point fused multiply-add units. In: 6th International Conference On ASIC, pp. 53–56 (2005)

  30. Huang, L.; Shen, L.; Dai, K.; Wang, Z.: A new architecture for multiple-precision floating-point multiply-add fused unit design. In: 18th IEEE Symposium on Computer Arithmetic, pp. 69–76 (2007)

  31. Huang L., Ma S., Shen L., Wang Z., Xiao N.: Low-cost binary128 floating-point FMA unit design with SIMD support. IEEE Trans. Comput. 61(5), 745–751 (2012)

    Article  MathSciNet  Google Scholar 

  32. Preiss, J.; Boersma, M.; Mueller, S.M.: Advanced clockgating schemes for fused-multiply-add-type floating-point units. In: 19th IEEE Symposium on Computer Arithmetic, pp. 48–56 (2009)

  33. Qi, Z.; Guo, Q.; Zhang, G.; Li, X.; Hu, W.: Design of low-cost high-performance floating-point fused multiply-add with reduced power. In: 23rd International Conference on VLSI Design, pp. 206–211 (2010)

  34. Oklobdzija V.G.: An algorithmic and novel design of a leading zero detector circuit: Comparison with logic synthesis. IEEE Trans. VLSI Syst. 2(1), 124–128 (1994)

    Article  Google Scholar 

  35. del Barrio, V.M.; Gonzalez, C.; Roca, J.; Fernandez, A.; Espasa, R.: ATTILA: a cycle-level execution-driven simulator for modern GPU architectures. In: IEEE International Symposium on Performance Analysis of Systems and Software, pp. 231–241 (2006)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiann-Rong Kuang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kuang, SR., Liang, CY. & Chang, MF. A Multi-functional Multi-precision 4D Dot Product Unit with SIMD Architecture. Arab J Sci Eng 41, 3139–3151 (2016). https://doi.org/10.1007/s13369-016-2117-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-016-2117-3

Keywords

Navigation