Extending Summation Precision for Network Reduction Operations

Abstract

Double precision summation is at the core of numerous important algorithms such as Newton–Krylov methods and other operations involving inner products, such as matrix multiplication and dot products. However, the effectiveness of summation is limited by the accumulation of rounding errors due to compressed representations, which are an increasing problem with the scaling of modern HPC systems and data sets that can easily perform summations with millions or billions of operands. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums. However, such libraries increase computation and communication time significantly, and do not always guarantee an exact result. In this article, we propose fixed-point representations of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. We call this format big integer (BigInt). Even though such formats have been studied for local processor computations, we make the case that using fixed-point representation for distributed computation over a system-wide network is feasible with performance comparable to that of double-precision floating point summation. This is possible by the inclusion of simple and inexpensive logic into modern NICs, or by using the programmable logic found in many modern NICs, in order to accelerate performance on large-scale systems in order to avoid waking up processors.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

References

  1. 1.

    Adams, M., Brown, J., Shalf, J., Straalen, B.V., Strohmaier, E., Williams, S.: HPGMG 1.0: a benchmark for ranking high performance computing systems. Tech. rep., Lawrence Berkeley national laboratory (2014). doi:10.2172/1131029. http://www.osti.gov/scitech/servlets/purl/1131029

  2. 2.

    Allen, E., Burns, J., Gilliam, D., Hill, J., Shubov, V.: The impact of finite precision arithmetic and sensitivity on the numerical solution of partial differential equations. Math. Comput. Model. 35(11–12) (2002). doi:10.1016/S0895-7177(02)00078-X

  3. 3.

    Antypas, K.: The Hopper XE6 system: delivering high end computing to the nation’s science and research community. Tech. rep, Cray Quarterly Review (2011)

  4. 4.

    Astfalk, G.: Why optical data communications and why now? Appl. Phys. A 95(4), 933–940 (2009). doi:10.1007/s00339-009-5115-4

  5. 5.

    Bailey, D.H.: High-precision floating-point arithmetic in scientific computation. Comput. Sci. Eng. 7(3) (2005). doi:10.1109/MCSE.2005.52

  6. 6.

    Bailey, D.H., Barrio, R., Borwein, J.M.: High-precision computation: mathematical physics and dynamics. Appl. Math. Comput. 218(20), 10106–10121 (2012)

  7. 7.

    Bailey, D.H., Hida, Y., Li, X.S., Thompson, O.: ARPREC: an arbitrary precision computation package. Tech. rep, Lawrence Berkeley National Laboratory (2002)

  8. 8.

    Boden, N.J., Cohen, D., Felderman, R.E., Kulawik, A.E., Seitz, C.L., Seizovic, J.N., Su, W.K.: Myrinet: A gigabit-per-second local area network. IEEE Micro 15(1), 29–36 (1995). doi:10.1109/40.342015

    Article  Google Scholar 

  9. 9.

    Buntinas, D., Panda, D.K.: NIC-based reduction in Myrinet clusters: is it beneficial? In: SAN-02 Workshop (2003)

  10. 10.

    Carreo, V.A., Miner, P.S.: Specification of the IEEE-854 floating-point standard in HOL and PVS (1995)

  11. 11.

    Case, L.: Inside Intel’s Haswell CPU: better performance, all-day battery. http://www.pcworld.com/article/262241/inside_intels_haswell_cpu_better_performance_all_day_battery.html/ (2012)

  12. 12.

    Chervenak, A., Deelman, E., Livny, M., Su, M.H., Schuler, R., Bharathi, S., Mehta, G., Vahi, K.: Data placement for scientific applications in distributed environments. In: 8th IEEE/ACM International Conference on Grid Computing, GRID ’07 (2007). doi:10.1109/GRID.2007.4354142

  13. 13.

    Chesneaux, J.M., Graillat, S., Jézéquel, F.: Rounding errors. In: Wiley Encyclopedia of Computer Science and Engineering (2008)

  14. 14.

    Corporation, I.: Intel 64 and IA-32 architectures developer’s manual: vol. 1 (2012)

  15. 15.

    Damaraju, S., George, V., Jahagirdar, S., Khondker, T., Milstrey, R., Sarkar, S., Siers, S., Stolero, I., Subbiah, A.: A 22nm IA multi-CPU and GPU system-on-chip. In: 59th IEEE International Solid-State Circuits Conference Digest of Technical Papers, ISSCC ’12 (2012). doi:10.1109/ISSCC.2012.6176876

  16. 16.

    Demmel, J., Diament, B., Malajovich, G.: On the complexity of computing error bounds. Found. Comput. Math. 1(1), 101–125 (2001)

    MATH  MathSciNet  Article  Google Scholar 

  17. 17.

    Demmel, J., Dumitriu, I., Holtz, O., Koev, P.: Accurate and efficient expression evaluation and linear algebra. Comput. Res. Reporitory abs/0712.4027 (2007)

  18. 18.

    Demmel, J., Nguyen, H.D.: Fast reproducible floating-point summation. In: 21st IEEE Symposium on Computer Arithmetic (2013)

  19. 19.

    Forum, M.P.I.: MPI: a message-passing interface standard. version 3.0. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf/ (2012)

  20. 20.

    Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., Zimmermann, P.: MPFR: a multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33(2) (2007). doi:10.1145/1236463.1236468

  21. 21.

    Gene Frantz, R.S.: Comparing fixed- and floating-point DSPs. Texas instruments while paper. http://www.ti.com/lit/wp/spry061/spry061.pdf (2004)

  22. 22.

    Ghazi, K.R., Lefevre, V., Theveny, P., Zimmermann, P.: Why and how to use arbitrary precision. Comput. Sci. Eng. 12(3), 5 (2010). doi:10.1109/MCSE.2010.73

  23. 23.

    Govindu, G., Zhuo, L., Choi, S., Prasanna, V.: Analysis of high-performance floating-point arithmetic on FPGAs. In: 18th IEEE International Parallel and Distributed Processing Symposium, IPDPS ’04 (2004). doi:10.1109/IPDPS.2004.1303135

  24. 24.

    Graillat, S., Ménissier-Morain, V.: Accurate summation, dot product and polynomial evaluation in complex floating point arithmetic. Inf. Comput. 216, 57–71 (2012)

    MATH  Article  Google Scholar 

  25. 25.

    Granlund, T., the GMP development team: GNU MP: the GNU Multiple Precision Arithmetic Library, 5.0.5 edn. (2012)

  26. 26.

    He, Y., Ding, C.H.Q.: Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications. In: 14th International Conference on Supercomputing, ICS ’00 (2000). doi:10.1145/335231.335253

  27. 27.

    Heroux, M.A., Dongarra, J., Luszczek, P.: HPCG benchmark technical specification. Tech. rep., Sandia national laboratory (2013). http://www.osti.gov/scitech/servlets/purl/1113870

  28. 28.

    Hida, Y., Li, X., Bailey, D.H.: Library for double-double and quad-double arithmetic. http://web.mit.edu/tabbott/Public/quaddouble-debian/qd-2.3.4-old/docs/qd.pdf/ (2007)

  29. 29.

    Higham, N.J.: The accuracy of floating point summation. SIAM J. Sci. Comput. 14, 783–799 (1993)

  30. 30.

    Hoefler, T., Gottlieb, S.: Parallel zero-copy algorithms for fast Fourier transform and conjugate gradient using MPI datatypes. In: Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI’10, pp. 132–141 (2010). http://dl.acm.org/citation.cfm?id=1894122.1894140

  31. 31.

    Hoefler, T., Traff, J.: Sparse collective operations for MPI. In: 29th IEEE International Symposium on Parallel Distributed Processing, IPDPS ’09 (2009). doi:10.1109/IPDPS.2009.5160935

  32. 32.

    Hong, X., Chongyang, W., Jiangyu, Y.: Analysis and research of floating-point exceptions. In: 2nd International Conference on Information Science and Engineering, ICISE ’10 (2010). doi:10.1109/ICISE.2010.5690343

  33. 33.

    IEEE: IEEE standard for binary floating-point arithmetic. ANSI/IEEE Std 754–1985 (1985). doi:10.1109/IEEESTD.1985.82928

  34. 34.

    IEEE: IEEE standard for floating-point arithmetic. ANSI/IEEE Std 754–2008 (2008). DOI 10.1109/IEEESTD.2008.4610935

  35. 35.

    Katz, R.H.: Contemporary logic design. Benjamin-Cummings, Redwood City (1993)

    Google Scholar 

  36. 36.

    Kielmann, T., Hofman, R.E.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.E.: MPI’s reduction operations in clustered wide area systems. In: Message Passing Interface Developer’s and User’s Conference, MPIFC ’99 (1999). doi:10.1109/IPDPS.2006.1639334

  37. 37.

    Krueger, J., Donofrio, D., Shalf, J., Mohiyuddin, M., Williams, S., Oliker, L., Pfreund, F.J.: Hardware/software co-design for energy-efficient seismic modeling. In: Conference on High Performance Computing Networking, Storage and Analysis (2011)

  38. 38.

    Kulisch, U.: Very fast and exact accumulation of products. Computing 91(4), 397–405 (2011). doi:10.1007/s00607-010-0131-y

    MATH  MathSciNet  Article  Google Scholar 

  39. 39.

    Kulisch, U., Snyder, V.: The exact dot product as basic tool for long interval arithmetic. Computing 91(3) (2011). doi:10.1007/s00607-010-0127-7

  40. 40.

    Kwon, T.J., Sondeen, J., Draper, J.: Design trade-offs in floating-point unit implementation for embedded and processing-in-memory systems. In: IEEE International Symposium on Circuits and Systems, ISCAS ’05 (2005). doi:10.1109/ISCAS.2005.1465341

  41. 41.

    McNamee, J.M.: A comparison of methods for accurate summation. ACM SIGSAM Bull. 38(1) (2004). doi:10.1145/980175.980177

  42. 42.

    Petrini, F., Feng, W.c., Hoisie, A., Coll, S., Frachtenberg, E.: The quadrics network (QsNet): High-performance clustering technology. In: Proceedings of the The Ninth Symposium on High Performance Interconnects, HOTI ’01, pp. 125–130 (2001)

  43. 43.

    Petrini, F., Moody, A., Fernandez, J., Frachtenberg, E., Panda, D.K.: NIC-based reduction algorithms for large-scale clusters. Int. J. High Perform. Comput. Netw. 4(3/4), 122–136 (2006). doi:10.1504/IJHPCN.2006.010635

    Article  Google Scholar 

  44. 44.

    Pritchard, H., Gorodetsky, I., Buntinas, D.: A uGNI-based MPICH2 nemesis network module for the Cray XE. In: 18th European MPI Users’ Group Conference on Recent Advances in the Message Passing Interface, EuroMPI’11 (2011)

  45. 45.

    Reussner, R., Sanders, P., Träff, J.L.: SKaMPI: a comprehensive benchmark for public benchmarking of MPI. Sci. Program. 10(1), 55–65 (2002)

    Google Scholar 

  46. 46.

    Ritzdorf, H., Traff, J.: Collective operations in NEC’s high-performance MPI libraries. In: International Parallel and Distributed Processing Symposium, IPDPS ’06 (2006). doi:10.1109/IPDPS.2006.1639334

  47. 47.

    Schuite, M., Balzola, P., Akkas, A., Brocato, R.: Integer multiplication with overflow detection or saturation. IEEE Trans. Comput. 49(7), 681–691 (2000). doi:10.1109/12.863038

    Article  Google Scholar 

  48. 48.

    Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: 9th International Conference on High Performance Computing for Computational Science, VECPAR’10 (2011)

  49. 49.

    Siegel, S., Wolff von Gudenberg, J.: A long accumulator like a carry-save adder. Computing 94(2–4), 203–213 (2012). doi:10.1007/s00607-011-0164-x

  50. 50.

    Tsafrir, D.: The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops). In: Experimental Computer Science on Experimental Computer Science, ECS ’07. USENIX Association (2007)

  51. 51.

    Vishnu, A., ten Bruggencate, M., Olson, R.: Evaluating the potential of Cray Gemini interconnect for PGAS communication runtime systems. In: 19th IEEE Annual Symposium on High Performance Interconnects, HOTI ’11 (2011). doi:10.1109/HOTI.2011.19

  52. 52.

    Vishnu, A., Koop, M., Moody, A., Mamidala, A., Narravula, S., Panda, D.: Hot-spot avoidance with multi-pathing over InfiniBand: an MPI perspective. In: 7th IEEE International Symposium on Cluster Computing and the Grid, CCGRID ’07 (2007). doi:10.1109/CCGRID.2007.60

  53. 53.

    Zhu, Y.K., Hayes, W.B.: Algorithm 908: Online exact summation of floating-point streams. ACM Trans. Math. Softw. 37(3), 37:1–37:13 (2010). doi:10.1145/1824801.1824815

Download references

Acknowledgments

This work was supported by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE- AC02-05CH11231.

Author information

Affiliations

Authors

Corresponding author

Correspondence to George Michelogiannakis.

Additional information

Disclaimer: This document was prepared as an account of work sponsored by the United States Government. While this document is believed to contain correct information, neither the United States Government nor any agency thereof, nor the Regents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by its trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or the Regents of the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof or the Regents of the University of California.

Copyright Notice: This manuscript has been authored by an author at Lawrence Berkeley National Laboratory under Contract No. DE-AC02-05CH11231 with the U.S. Department of Energy. The U.S. Government retains, and the publisher, by accepting the article for publication, acknowledges, that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. Government purposes.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Michelogiannakis, G., Li, X.S., Bailey, D.H. et al. Extending Summation Precision for Network Reduction Operations. Int J Parallel Prog 43, 1218–1243 (2015). https://doi.org/10.1007/s10766-014-0326-5

Download citation

Keywords

  • Computation precision
  • Double-precision
  • Distributed summation