Abstract
Double precision summation is at the core of numerous important algorithms such as Newton–Krylov methods and other operations involving inner products, such as matrix multiplication and dot products. However, the effectiveness of summation is limited by the accumulation of rounding errors due to compressed representations, which are an increasing problem with the scaling of modern HPC systems and data sets that can easily perform summations with millions or billions of operands. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums. However, such libraries increase computation and communication time significantly, and do not always guarantee an exact result. In this article, we propose fixed-point representations of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. We call this format big integer (BigInt). Even though such formats have been studied for local processor computations, we make the case that using fixed-point representation for distributed computation over a system-wide network is feasible with performance comparable to that of double-precision floating point summation. This is possible by the inclusion of simple and inexpensive logic into modern NICs, or by using the programmable logic found in many modern NICs, in order to accelerate performance on large-scale systems in order to avoid waking up processors.
Similar content being viewed by others
References
Adams, M., Brown, J., Shalf, J., Straalen, B.V., Strohmaier, E., Williams, S.: HPGMG 1.0: a benchmark for ranking high performance computing systems. Tech. rep., Lawrence Berkeley national laboratory (2014). doi:10.2172/1131029. http://www.osti.gov/scitech/servlets/purl/1131029
Allen, E., Burns, J., Gilliam, D., Hill, J., Shubov, V.: The impact of finite precision arithmetic and sensitivity on the numerical solution of partial differential equations. Math. Comput. Model. 35(11–12) (2002). doi:10.1016/S0895-7177(02)00078-X
Antypas, K.: The Hopper XE6 system: delivering high end computing to the nation’s science and research community. Tech. rep, Cray Quarterly Review (2011)
Astfalk, G.: Why optical data communications and why now? Appl. Phys. A 95(4), 933–940 (2009). doi:10.1007/s00339-009-5115-4
Bailey, D.H.: High-precision floating-point arithmetic in scientific computation. Comput. Sci. Eng. 7(3) (2005). doi:10.1109/MCSE.2005.52
Bailey, D.H., Barrio, R., Borwein, J.M.: High-precision computation: mathematical physics and dynamics. Appl. Math. Comput. 218(20), 10106–10121 (2012)
Bailey, D.H., Hida, Y., Li, X.S., Thompson, O.: ARPREC: an arbitrary precision computation package. Tech. rep, Lawrence Berkeley National Laboratory (2002)
Boden, N.J., Cohen, D., Felderman, R.E., Kulawik, A.E., Seitz, C.L., Seizovic, J.N., Su, W.K.: Myrinet: A gigabit-per-second local area network. IEEE Micro 15(1), 29–36 (1995). doi:10.1109/40.342015
Buntinas, D., Panda, D.K.: NIC-based reduction in Myrinet clusters: is it beneficial? In: SAN-02 Workshop (2003)
Carreo, V.A., Miner, P.S.: Specification of the IEEE-854 floating-point standard in HOL and PVS (1995)
Case, L.: Inside Intel’s Haswell CPU: better performance, all-day battery. http://www.pcworld.com/article/262241/inside_intels_haswell_cpu_better_performance_all_day_battery.html/ (2012)
Chervenak, A., Deelman, E., Livny, M., Su, M.H., Schuler, R., Bharathi, S., Mehta, G., Vahi, K.: Data placement for scientific applications in distributed environments. In: 8th IEEE/ACM International Conference on Grid Computing, GRID ’07 (2007). doi:10.1109/GRID.2007.4354142
Chesneaux, J.M., Graillat, S., Jézéquel, F.: Rounding errors. In: Wiley Encyclopedia of Computer Science and Engineering (2008)
Corporation, I.: Intel 64 and IA-32 architectures developer’s manual: vol. 1 (2012)
Damaraju, S., George, V., Jahagirdar, S., Khondker, T., Milstrey, R., Sarkar, S., Siers, S., Stolero, I., Subbiah, A.: A 22nm IA multi-CPU and GPU system-on-chip. In: 59th IEEE International Solid-State Circuits Conference Digest of Technical Papers, ISSCC ’12 (2012). doi:10.1109/ISSCC.2012.6176876
Demmel, J., Diament, B., Malajovich, G.: On the complexity of computing error bounds. Found. Comput. Math. 1(1), 101–125 (2001)
Demmel, J., Dumitriu, I., Holtz, O., Koev, P.: Accurate and efficient expression evaluation and linear algebra. Comput. Res. Reporitory abs/0712.4027 (2007)
Demmel, J., Nguyen, H.D.: Fast reproducible floating-point summation. In: 21st IEEE Symposium on Computer Arithmetic (2013)
Forum, M.P.I.: MPI: a message-passing interface standard. version 3.0. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf/ (2012)
Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., Zimmermann, P.: MPFR: a multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33(2) (2007). doi:10.1145/1236463.1236468
Gene Frantz, R.S.: Comparing fixed- and floating-point DSPs. Texas instruments while paper. http://www.ti.com/lit/wp/spry061/spry061.pdf (2004)
Ghazi, K.R., Lefevre, V., Theveny, P., Zimmermann, P.: Why and how to use arbitrary precision. Comput. Sci. Eng. 12(3), 5 (2010). doi:10.1109/MCSE.2010.73
Govindu, G., Zhuo, L., Choi, S., Prasanna, V.: Analysis of high-performance floating-point arithmetic on FPGAs. In: 18th IEEE International Parallel and Distributed Processing Symposium, IPDPS ’04 (2004). doi:10.1109/IPDPS.2004.1303135
Graillat, S., Ménissier-Morain, V.: Accurate summation, dot product and polynomial evaluation in complex floating point arithmetic. Inf. Comput. 216, 57–71 (2012)
Granlund, T., the GMP development team: GNU MP: the GNU Multiple Precision Arithmetic Library, 5.0.5 edn. (2012)
He, Y., Ding, C.H.Q.: Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications. In: 14th International Conference on Supercomputing, ICS ’00 (2000). doi:10.1145/335231.335253
Heroux, M.A., Dongarra, J., Luszczek, P.: HPCG benchmark technical specification. Tech. rep., Sandia national laboratory (2013). http://www.osti.gov/scitech/servlets/purl/1113870
Hida, Y., Li, X., Bailey, D.H.: Library for double-double and quad-double arithmetic. http://web.mit.edu/tabbott/Public/quaddouble-debian/qd-2.3.4-old/docs/qd.pdf/ (2007)
Higham, N.J.: The accuracy of floating point summation. SIAM J. Sci. Comput. 14, 783–799 (1993)
Hoefler, T., Gottlieb, S.: Parallel zero-copy algorithms for fast Fourier transform and conjugate gradient using MPI datatypes. In: Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI’10, pp. 132–141 (2010). http://dl.acm.org/citation.cfm?id=1894122.1894140
Hoefler, T., Traff, J.: Sparse collective operations for MPI. In: 29th IEEE International Symposium on Parallel Distributed Processing, IPDPS ’09 (2009). doi:10.1109/IPDPS.2009.5160935
Hong, X., Chongyang, W., Jiangyu, Y.: Analysis and research of floating-point exceptions. In: 2nd International Conference on Information Science and Engineering, ICISE ’10 (2010). doi:10.1109/ICISE.2010.5690343
IEEE: IEEE standard for binary floating-point arithmetic. ANSI/IEEE Std 754–1985 (1985). doi:10.1109/IEEESTD.1985.82928
IEEE: IEEE standard for floating-point arithmetic. ANSI/IEEE Std 754–2008 (2008). DOI 10.1109/IEEESTD.2008.4610935
Katz, R.H.: Contemporary logic design. Benjamin-Cummings, Redwood City (1993)
Kielmann, T., Hofman, R.E.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.E.: MPI’s reduction operations in clustered wide area systems. In: Message Passing Interface Developer’s and User’s Conference, MPIFC ’99 (1999). doi:10.1109/IPDPS.2006.1639334
Krueger, J., Donofrio, D., Shalf, J., Mohiyuddin, M., Williams, S., Oliker, L., Pfreund, F.J.: Hardware/software co-design for energy-efficient seismic modeling. In: Conference on High Performance Computing Networking, Storage and Analysis (2011)
Kulisch, U.: Very fast and exact accumulation of products. Computing 91(4), 397–405 (2011). doi:10.1007/s00607-010-0131-y
Kulisch, U., Snyder, V.: The exact dot product as basic tool for long interval arithmetic. Computing 91(3) (2011). doi:10.1007/s00607-010-0127-7
Kwon, T.J., Sondeen, J., Draper, J.: Design trade-offs in floating-point unit implementation for embedded and processing-in-memory systems. In: IEEE International Symposium on Circuits and Systems, ISCAS ’05 (2005). doi:10.1109/ISCAS.2005.1465341
McNamee, J.M.: A comparison of methods for accurate summation. ACM SIGSAM Bull. 38(1) (2004). doi:10.1145/980175.980177
Petrini, F., Feng, W.c., Hoisie, A., Coll, S., Frachtenberg, E.: The quadrics network (QsNet): High-performance clustering technology. In: Proceedings of the The Ninth Symposium on High Performance Interconnects, HOTI ’01, pp. 125–130 (2001)
Petrini, F., Moody, A., Fernandez, J., Frachtenberg, E., Panda, D.K.: NIC-based reduction algorithms for large-scale clusters. Int. J. High Perform. Comput. Netw. 4(3/4), 122–136 (2006). doi:10.1504/IJHPCN.2006.010635
Pritchard, H., Gorodetsky, I., Buntinas, D.: A uGNI-based MPICH2 nemesis network module for the Cray XE. In: 18th European MPI Users’ Group Conference on Recent Advances in the Message Passing Interface, EuroMPI’11 (2011)
Reussner, R., Sanders, P., Träff, J.L.: SKaMPI: a comprehensive benchmark for public benchmarking of MPI. Sci. Program. 10(1), 55–65 (2002)
Ritzdorf, H., Traff, J.: Collective operations in NEC’s high-performance MPI libraries. In: International Parallel and Distributed Processing Symposium, IPDPS ’06 (2006). doi:10.1109/IPDPS.2006.1639334
Schuite, M., Balzola, P., Akkas, A., Brocato, R.: Integer multiplication with overflow detection or saturation. IEEE Trans. Comput. 49(7), 681–691 (2000). doi:10.1109/12.863038
Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: 9th International Conference on High Performance Computing for Computational Science, VECPAR’10 (2011)
Siegel, S., Wolff von Gudenberg, J.: A long accumulator like a carry-save adder. Computing 94(2–4), 203–213 (2012). doi:10.1007/s00607-011-0164-x
Tsafrir, D.: The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops). In: Experimental Computer Science on Experimental Computer Science, ECS ’07. USENIX Association (2007)
Vishnu, A., ten Bruggencate, M., Olson, R.: Evaluating the potential of Cray Gemini interconnect for PGAS communication runtime systems. In: 19th IEEE Annual Symposium on High Performance Interconnects, HOTI ’11 (2011). doi:10.1109/HOTI.2011.19
Vishnu, A., Koop, M., Moody, A., Mamidala, A., Narravula, S., Panda, D.: Hot-spot avoidance with multi-pathing over InfiniBand: an MPI perspective. In: 7th IEEE International Symposium on Cluster Computing and the Grid, CCGRID ’07 (2007). doi:10.1109/CCGRID.2007.60
Zhu, Y.K., Hayes, W.B.: Algorithm 908: Online exact summation of floating-point streams. ACM Trans. Math. Softw. 37(3), 37:1–37:13 (2010). doi:10.1145/1824801.1824815
Acknowledgments
This work was supported by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE- AC02-05CH11231.
Author information
Authors and Affiliations
Corresponding author
Additional information
Disclaimer: This document was prepared as an account of work sponsored by the United States Government. While this document is believed to contain correct information, neither the United States Government nor any agency thereof, nor the Regents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by its trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or the Regents of the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof or the Regents of the University of California.
Copyright Notice: This manuscript has been authored by an author at Lawrence Berkeley National Laboratory under Contract No. DE-AC02-05CH11231 with the U.S. Department of Energy. The U.S. Government retains, and the publisher, by accepting the article for publication, acknowledges, that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. Government purposes.
Rights and permissions
About this article
Cite this article
Michelogiannakis, G., Li, X.S., Bailey, D.H. et al. Extending Summation Precision for Network Reduction Operations. Int J Parallel Prog 43, 1218–1243 (2015). https://doi.org/10.1007/s10766-014-0326-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-014-0326-5