Extending Summation Precision for Network Reduction Operations
- First Online:
- Cite this article as:
- Michelogiannakis, G., Li, X.S., Bailey, D.H. et al. Int J Parallel Prog (2015) 43: 1218. doi:10.1007/s10766-014-0326-5
- 100 Downloads
Double precision summation is at the core of numerous important algorithms such as Newton–Krylov methods and other operations involving inner products, such as matrix multiplication and dot products. However, the effectiveness of summation is limited by the accumulation of rounding errors due to compressed representations, which are an increasing problem with the scaling of modern HPC systems and data sets that can easily perform summations with millions or billions of operands. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums. However, such libraries increase computation and communication time significantly, and do not always guarantee an exact result. In this article, we propose fixed-point representations of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. We call this format big integer (BigInt). Even though such formats have been studied for local processor computations, we make the case that using fixed-point representation for distributed computation over a system-wide network is feasible with performance comparable to that of double-precision floating point summation. This is possible by the inclusion of simple and inexpensive logic into modern NICs, or by using the programmable logic found in many modern NICs, in order to accelerate performance on large-scale systems in order to avoid waking up processors.