Abstract
This chapter studies the possible optimizations that arise in the specialization or fusion of floating-point operators. It builds upon the specialized fixed operators of previous chapters and focuses on exponent management and rounding issues specific to floating point.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Strictly speaking, as the exponent is constant, the point does not actually float here.
- 2.
Of course, trusting the compiler to respect language standards is necessary but not sufficient to ensure safe applications.
References
Nicolas Brisebarre, Florent de Dinechin, and Jean-Michel Muller. “Integer and Floating-Point Constant Multipliers for FPGAs”. In: International Conference on Application-Specific Systems, Architectures and Processors (ASAP). IEEE, 2008, pp. 239–244.
Javier D. Bruguera and Tomás Lang. “Floating-point Fused Multiply-Add: Reduced Latency for Floating-Point Addition”. In: Symposium on Computer Arithmetic (ARITH). IEEE, 2005.
Nicolas Brisebarre and Jean-Michel Muller. “Correctly Rounded Multiplication by Arbitrary Precision Constants”. In: IEEE Transactions on Computers 57.2 (2008), pp. 165–174.
Marius Cornea, John Harrison, and Ping Tak Peter Tang. Scientific Computing on Itanium®-Based Systems. Intel Press, 2002.
Vassil Dimitrov, Laurent Imbert, and Andrew Zakaluzny. “Multiplication by a Constant is Sublinear”. In: Symposium on Computer Arithmetic (ARITH). IEEE, 2007, pp. 261–268.
Florent de Dinechin, Cristian Klein, and Bogdan Pasca. “Generating High-Performance Custom Floating-Point Pipelines”. In: International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2009, pp. 59–64.
E. Hokenek, R. K. Montoye, and P. W. Cook. “Second-Generation RISC Floating Point with Multiply-Add Fused”. In: IEEE Journal of Solid-State Circuits 25.5 (1990), pp. 1207–1213.
Tomás Lang and Javier D. Bruguera. “Floating-Point Multiply-Add-Fused with Reduced Latency”. In: IEEE Transactions on Computers 53.8 (2004), pp. 988–1003.
David Lutz. “Fused Multiply-Add Microarchitecture Comprising Separate Early-Normalizing Multiply and Add Pipelines”. In: Symposium on Computer Arithmetic (ARITH). IEEE, 2011, pp. 123–128.
Peter Markstein. IA-64 and Elementary Functions: Speed and Precision. Hewlett-Packard Professional Books. Prentice Hall, 2000.
R. K. Montoye, E. Hokonek, and S. L. Runyan. “Design of the IBM RISC System/6000 floating-point execution unit”. In: IBM Journal of Research and Development 34.1 (1990), pp. 59–70.
Jae Hong Min and Earl E. Swartzlander. “Fused Floating-Point Two-term Sum-of-Squares Unit”. In: Application-Specific Systems, Architectures and Processors (ASAP). IEEE, 2013.
Steven S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldeş, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, and Serge Torres. Handbook of Floating-Point Arithmetic. 2nd ed. Birkhäuser Boston, 2018.
Behrooz Parhami. Computer Arithmetic, Algorithms and Hardware Designs. 2nd ed. Oxford University Press, 2010.
José-Alejandro Piñeiro, Javier D. Bruguera, and Jean-Michel Muller. “Faithful Powering Computation Using Table Look-Up and a Fused Accumulation Tree”. In: Symposium on Computer Arithmetic (ARITH). IEEE, 2001, pp. 40–47.
E. Quinnell, E. E. Swartzlander, and C. Lemonds. “Floating-Point Fused Multiply-Add Architectures”. In: Asilomar Conference on Signals, Circuits and Systems. IEEE, 2007, pp. 331–337.
Peter-Michael Seidel. “Multiple Path IEEE Floating-Point Fused Multiply-Add”. In: 46th International Midwest Symposium on Circuits and Systems. IEEE, 2003, pp. 1359–1362.
Lukas Sommer, Lukas Weber, Martin Kumm, and Andreas Koch. “Comparison of Arithmetic Number Formats for Inference in Sum-Product Networks on FPGAs”. In: International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 75–83.
Hani H. Saleh and Earl E. Swartzlander. “A Floating-Point Fused Dot-Product Unit”. In: International Conference on Computer Design (ICCD). 2008, pp. 426–431.
Eric M. Schwarz, Martin Schmookler, and Son Dao Trong. “FPU Implementations with Denormalized Numbers”. In: IEEE Transactions on Computers 54.7 (2005), pp. 825–836.
ISO/IEC. International Standard ISO/IEC 9899:2018. Programming languages – C. 2018.
Yao Tao, Gao Deyuan, Fan Xiaoya, and Ren Xianglong. “Three-Operand Floating-Point Adder”. In: 12th International Conference on Computer and Information Technology. 2012, pp. 192–196.
Yao Tao, Gao Deyuan, Fan Xiaoya, and Jari Nurmi. “Correctly Rounded Architectures for Floating-Point Multi-Operand Addition and Dot-Product Computation”. In: Application-Specific Systems, Architectures and Processors (ASAP). IEEE, 2013.
Yao Tao, Gao Deyuan, and Fan Xiaoya. “A Multi-Path Fused Add-Subtract Unit for Digital Signal Processing”. In: Computer Science and Automation Engineering (CSAE). 2012.
S. D. Trong, Martin M. Schmookler, E. M. Schwarz, and M. Kroener. “P6 Binary Floating-Point Unit”. In: Symposium on Computer Arithmetic (ARITH). IEEE, 2007, pp. 77–86.
Yohann Uguen, Florent de Dinechin, Victor Lezaud, and Steven Derrien. “Application-Specific Arithmetic in High-Level Synthesis Tools”. In: ACM Transactions on Architecture and Code Optimization 17.1 (2020).
X. Y. Yu, Y.-H. Chan, B. Curran, E. Schwarz, M. Kelly, and B. Fleischer. “A 5GHz+ 128-bit Binary Floating-Point Adder for the POWER6 Processor”. In: European Solid-State Circuits Conference. 2006, pp. 166–169.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2024 Springer Nature Switzerland AG
About this chapter
Cite this chapter
de Dinechin, F., Kumm, M. (2024). Specialization and Fusion of Floating-Point Operators. In: Application-Specific Arithmetic. Springer, Cham. https://doi.org/10.1007/978-3-031-42808-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-42808-1_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42807-4
Online ISBN: 978-3-031-42808-1
eBook Packages: EngineeringEngineering (R0)