International Journal of Parallel Programming

, Volume 43, Issue 6, pp 1218–1243 | Cite as

Extending Summation Precision for Network Reduction Operations

  • George Michelogiannakis
  • Xiaoye S. Li
  • David H. Bailey
  • John Shalf
Article

Abstract

Double precision summation is at the core of numerous important algorithms such as Newton–Krylov methods and other operations involving inner products, such as matrix multiplication and dot products. However, the effectiveness of summation is limited by the accumulation of rounding errors due to compressed representations, which are an increasing problem with the scaling of modern HPC systems and data sets that can easily perform summations with millions or billions of operands. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums. However, such libraries increase computation and communication time significantly, and do not always guarantee an exact result. In this article, we propose fixed-point representations of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. We call this format big integer (BigInt). Even though such formats have been studied for local processor computations, we make the case that using fixed-point representation for distributed computation over a system-wide network is feasible with performance comparable to that of double-precision floating point summation. This is possible by the inclusion of simple and inexpensive logic into modern NICs, or by using the programmable logic found in many modern NICs, in order to accelerate performance on large-scale systems in order to avoid waking up processors.

Keywords

Computation precision Double-precision Distributed summation 

Notes

Acknowledgments

This work was supported by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE- AC02-05CH11231.

References

  1. 1.
    Adams, M., Brown, J., Shalf, J., Straalen, B.V., Strohmaier, E., Williams, S.: HPGMG 1.0: a benchmark for ranking high performance computing systems. Tech. rep., Lawrence Berkeley national laboratory (2014). doi:10.2172/1131029. http://www.osti.gov/scitech/servlets/purl/1131029
  2. 2.
    Allen, E., Burns, J., Gilliam, D., Hill, J., Shubov, V.: The impact of finite precision arithmetic and sensitivity on the numerical solution of partial differential equations. Math. Comput. Model. 35(11–12) (2002). doi:10.1016/S0895-7177(02)00078-X
  3. 3.
    Antypas, K.: The Hopper XE6 system: delivering high end computing to the nation’s science and research community. Tech. rep, Cray Quarterly Review (2011)Google Scholar
  4. 4.
    Astfalk, G.: Why optical data communications and why now? Appl. Phys. A 95(4), 933–940 (2009). doi:10.1007/s00339-009-5115-4
  5. 5.
    Bailey, D.H.: High-precision floating-point arithmetic in scientific computation. Comput. Sci. Eng. 7(3) (2005). doi:10.1109/MCSE.2005.52
  6. 6.
    Bailey, D.H., Barrio, R., Borwein, J.M.: High-precision computation: mathematical physics and dynamics. Appl. Math. Comput. 218(20), 10106–10121 (2012)Google Scholar
  7. 7.
    Bailey, D.H., Hida, Y., Li, X.S., Thompson, O.: ARPREC: an arbitrary precision computation package. Tech. rep, Lawrence Berkeley National Laboratory (2002)Google Scholar
  8. 8.
    Boden, N.J., Cohen, D., Felderman, R.E., Kulawik, A.E., Seitz, C.L., Seizovic, J.N., Su, W.K.: Myrinet: A gigabit-per-second local area network. IEEE Micro 15(1), 29–36 (1995). doi:10.1109/40.342015 CrossRefGoogle Scholar
  9. 9.
    Buntinas, D., Panda, D.K.: NIC-based reduction in Myrinet clusters: is it beneficial? In: SAN-02 Workshop (2003)Google Scholar
  10. 10.
    Carreo, V.A., Miner, P.S.: Specification of the IEEE-854 floating-point standard in HOL and PVS (1995)Google Scholar
  11. 11.
    Case, L.: Inside Intel’s Haswell CPU: better performance, all-day battery. http://www.pcworld.com/article/262241/inside_intels_haswell_cpu_better_performance_all_day_battery.html/ (2012)
  12. 12.
    Chervenak, A., Deelman, E., Livny, M., Su, M.H., Schuler, R., Bharathi, S., Mehta, G., Vahi, K.: Data placement for scientific applications in distributed environments. In: 8th IEEE/ACM International Conference on Grid Computing, GRID ’07 (2007). doi:10.1109/GRID.2007.4354142
  13. 13.
    Chesneaux, J.M., Graillat, S., Jézéquel, F.: Rounding errors. In: Wiley Encyclopedia of Computer Science and Engineering (2008)Google Scholar
  14. 14.
    Corporation, I.: Intel 64 and IA-32 architectures developer’s manual: vol. 1 (2012)Google Scholar
  15. 15.
    Damaraju, S., George, V., Jahagirdar, S., Khondker, T., Milstrey, R., Sarkar, S., Siers, S., Stolero, I., Subbiah, A.: A 22nm IA multi-CPU and GPU system-on-chip. In: 59th IEEE International Solid-State Circuits Conference Digest of Technical Papers, ISSCC ’12 (2012). doi:10.1109/ISSCC.2012.6176876
  16. 16.
    Demmel, J., Diament, B., Malajovich, G.: On the complexity of computing error bounds. Found. Comput. Math. 1(1), 101–125 (2001)MATHMathSciNetCrossRefGoogle Scholar
  17. 17.
    Demmel, J., Dumitriu, I., Holtz, O., Koev, P.: Accurate and efficient expression evaluation and linear algebra. Comput. Res. Reporitory abs/0712.4027 (2007)Google Scholar
  18. 18.
    Demmel, J., Nguyen, H.D.: Fast reproducible floating-point summation. In: 21st IEEE Symposium on Computer Arithmetic (2013)Google Scholar
  19. 19.
    Forum, M.P.I.: MPI: a message-passing interface standard. version 3.0. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf/ (2012)
  20. 20.
    Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., Zimmermann, P.: MPFR: a multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33(2) (2007). doi:10.1145/1236463.1236468
  21. 21.
    Gene Frantz, R.S.: Comparing fixed- and floating-point DSPs. Texas instruments while paper. http://www.ti.com/lit/wp/spry061/spry061.pdf (2004)
  22. 22.
    Ghazi, K.R., Lefevre, V., Theveny, P., Zimmermann, P.: Why and how to use arbitrary precision. Comput. Sci. Eng. 12(3), 5 (2010). doi:10.1109/MCSE.2010.73
  23. 23.
    Govindu, G., Zhuo, L., Choi, S., Prasanna, V.: Analysis of high-performance floating-point arithmetic on FPGAs. In: 18th IEEE International Parallel and Distributed Processing Symposium, IPDPS ’04 (2004). doi:10.1109/IPDPS.2004.1303135
  24. 24.
    Graillat, S., Ménissier-Morain, V.: Accurate summation, dot product and polynomial evaluation in complex floating point arithmetic. Inf. Comput. 216, 57–71 (2012)MATHCrossRefGoogle Scholar
  25. 25.
    Granlund, T., the GMP development team: GNU MP: the GNU Multiple Precision Arithmetic Library, 5.0.5 edn. (2012)Google Scholar
  26. 26.
    He, Y., Ding, C.H.Q.: Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications. In: 14th International Conference on Supercomputing, ICS ’00 (2000). doi:10.1145/335231.335253
  27. 27.
    Heroux, M.A., Dongarra, J., Luszczek, P.: HPCG benchmark technical specification. Tech. rep., Sandia national laboratory (2013). http://www.osti.gov/scitech/servlets/purl/1113870
  28. 28.
    Hida, Y., Li, X., Bailey, D.H.: Library for double-double and quad-double arithmetic. http://web.mit.edu/tabbott/Public/quaddouble-debian/qd-2.3.4-old/docs/qd.pdf/ (2007)
  29. 29.
    Higham, N.J.: The accuracy of floating point summation. SIAM J. Sci. Comput. 14, 783–799 (1993)Google Scholar
  30. 30.
    Hoefler, T., Gottlieb, S.: Parallel zero-copy algorithms for fast Fourier transform and conjugate gradient using MPI datatypes. In: Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI’10, pp. 132–141 (2010). http://dl.acm.org/citation.cfm?id=1894122.1894140
  31. 31.
    Hoefler, T., Traff, J.: Sparse collective operations for MPI. In: 29th IEEE International Symposium on Parallel Distributed Processing, IPDPS ’09 (2009). doi:10.1109/IPDPS.2009.5160935
  32. 32.
    Hong, X., Chongyang, W., Jiangyu, Y.: Analysis and research of floating-point exceptions. In: 2nd International Conference on Information Science and Engineering, ICISE ’10 (2010). doi:10.1109/ICISE.2010.5690343
  33. 33.
    IEEE: IEEE standard for binary floating-point arithmetic. ANSI/IEEE Std 754–1985 (1985). doi:10.1109/IEEESTD.1985.82928
  34. 34.
    IEEE: IEEE standard for floating-point arithmetic. ANSI/IEEE Std 754–2008 (2008). DOI 10.1109/IEEESTD.2008.4610935Google Scholar
  35. 35.
    Katz, R.H.: Contemporary logic design. Benjamin-Cummings, Redwood City (1993)Google Scholar
  36. 36.
    Kielmann, T., Hofman, R.E.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.E.: MPI’s reduction operations in clustered wide area systems. In: Message Passing Interface Developer’s and User’s Conference, MPIFC ’99 (1999). doi:10.1109/IPDPS.2006.1639334
  37. 37.
    Krueger, J., Donofrio, D., Shalf, J., Mohiyuddin, M., Williams, S., Oliker, L., Pfreund, F.J.: Hardware/software co-design for energy-efficient seismic modeling. In: Conference on High Performance Computing Networking, Storage and Analysis (2011)Google Scholar
  38. 38.
    Kulisch, U.: Very fast and exact accumulation of products. Computing 91(4), 397–405 (2011). doi:10.1007/s00607-010-0131-y MATHMathSciNetCrossRefGoogle Scholar
  39. 39.
    Kulisch, U., Snyder, V.: The exact dot product as basic tool for long interval arithmetic. Computing 91(3) (2011). doi:10.1007/s00607-010-0127-7
  40. 40.
    Kwon, T.J., Sondeen, J., Draper, J.: Design trade-offs in floating-point unit implementation for embedded and processing-in-memory systems. In: IEEE International Symposium on Circuits and Systems, ISCAS ’05 (2005). doi:10.1109/ISCAS.2005.1465341
  41. 41.
    McNamee, J.M.: A comparison of methods for accurate summation. ACM SIGSAM Bull. 38(1) (2004). doi:10.1145/980175.980177
  42. 42.
    Petrini, F., Feng, W.c., Hoisie, A., Coll, S., Frachtenberg, E.: The quadrics network (QsNet): High-performance clustering technology. In: Proceedings of the The Ninth Symposium on High Performance Interconnects, HOTI ’01, pp. 125–130 (2001)Google Scholar
  43. 43.
    Petrini, F., Moody, A., Fernandez, J., Frachtenberg, E., Panda, D.K.: NIC-based reduction algorithms for large-scale clusters. Int. J. High Perform. Comput. Netw. 4(3/4), 122–136 (2006). doi:10.1504/IJHPCN.2006.010635 CrossRefGoogle Scholar
  44. 44.
    Pritchard, H., Gorodetsky, I., Buntinas, D.: A uGNI-based MPICH2 nemesis network module for the Cray XE. In: 18th European MPI Users’ Group Conference on Recent Advances in the Message Passing Interface, EuroMPI’11 (2011)Google Scholar
  45. 45.
    Reussner, R., Sanders, P., Träff, J.L.: SKaMPI: a comprehensive benchmark for public benchmarking of MPI. Sci. Program. 10(1), 55–65 (2002)Google Scholar
  46. 46.
    Ritzdorf, H., Traff, J.: Collective operations in NEC’s high-performance MPI libraries. In: International Parallel and Distributed Processing Symposium, IPDPS ’06 (2006). doi:10.1109/IPDPS.2006.1639334
  47. 47.
    Schuite, M., Balzola, P., Akkas, A., Brocato, R.: Integer multiplication with overflow detection or saturation. IEEE Trans. Comput. 49(7), 681–691 (2000). doi:10.1109/12.863038 CrossRefGoogle Scholar
  48. 48.
    Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: 9th International Conference on High Performance Computing for Computational Science, VECPAR’10 (2011)Google Scholar
  49. 49.
    Siegel, S., Wolff von Gudenberg, J.: A long accumulator like a carry-save adder. Computing 94(2–4), 203–213 (2012). doi:10.1007/s00607-011-0164-x
  50. 50.
    Tsafrir, D.: The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops). In: Experimental Computer Science on Experimental Computer Science, ECS ’07. USENIX Association (2007)Google Scholar
  51. 51.
    Vishnu, A., ten Bruggencate, M., Olson, R.: Evaluating the potential of Cray Gemini interconnect for PGAS communication runtime systems. In: 19th IEEE Annual Symposium on High Performance Interconnects, HOTI ’11 (2011). doi:10.1109/HOTI.2011.19
  52. 52.
    Vishnu, A., Koop, M., Moody, A., Mamidala, A., Narravula, S., Panda, D.: Hot-spot avoidance with multi-pathing over InfiniBand: an MPI perspective. In: 7th IEEE International Symposium on Cluster Computing and the Grid, CCGRID ’07 (2007). doi:10.1109/CCGRID.2007.60
  53. 53.
    Zhu, Y.K., Hayes, W.B.: Algorithm 908: Online exact summation of floating-point streams. ACM Trans. Math. Softw. 37(3), 37:1–37:13 (2010). doi:10.1145/1824801.1824815

Copyright information

© Springer Science+Business Media New York (outside the USA) 2014

Authors and Affiliations

  • George Michelogiannakis
    • 1
  • Xiaoye S. Li
    • 1
  • David H. Bailey
    • 1
  • John Shalf
    • 1
  1. 1.Lawrence Berkeley National LaboratoryBerkeleyUSA

Personalised recommendations