Advertisement

Reproducible, Accurately Rounded and Efficient BLAS

  • Chemseddine Chohra
  • Philippe Langlois
  • David Parello
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10104)

Abstract

Numerical reproducibility failures rise in parallel computation because floating-point summation is non-associative. Massively parallel and optimized executions dynamically modify the floating-point operation order. Hence, numerical results may change from one run to another. We propose to ensure reproducibility by extending as far as possible the IEEE-754 correct rounding property to larger operation sequences. We introduce our RARE-BLAS (Reproducible, Accurately Rounded and Efficient BLAS) that benefits from recent accurate and efficient summation algorithms. Solutions for level 1 (asum, dot and nrm2) and level 2 (gemv) routines are presented. Their performance is studied compared to the Intel MKL library and other existing reproducible algorithms. For both shared and distributed memory parallel systems, we exhibit an extra-cost of 2\(\times \) in the worst case scenario, which is satisfying for a wide range of applications. For Intel Xeon Phi accelerator a larger extra-cost (4\(\times \) to 6\(\times \)) is observed, which is still helpful at least for debugging and validation steps.

Keywords

Vector Size Sequential Case Small Vector Thread Level Parallelism High Memory Bandwidth 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Chohra, C., Langlois, P., Parello, D.: Implementation and Efficiency of Reproducible Level 1 BLAS (2015). http://hal-lirmm.ccsd.cnrs.fr/lirmm-01179986
  2. 2.
    Collange, S., Defour, D., Graillat, S., Iakymchuk, R.: Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Comput. 49(C), 83–97 (2015). http://dx.doi.org/10.1016/j.parco.2015.09.001 MathSciNetCrossRefGoogle Scholar
  3. 3.
    Dekker, T.J.: A floating-point technique for extending the available precision. Numer. Math. 18, 224–242 (1971)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Demmel, J.W., Nguyen, H.D.: Fast reproducible floating-point summation. In: Proceedings of 21th IEEE Symposium on Computer Arithmetic, Austin, Texas, USA (2013)Google Scholar
  5. 5.
    Demmel, J.W., Nguyen, H.D.: Toward Hardware Support for Reproducible Floating-Point Computation. In: SCAN 2014, Würzburg, Germany, September 2014Google Scholar
  6. 6.
    Demmel, J.W., Nguyen, H.D.: Parallel reproducible summation. IEEE Trans. Comput. 64(7), 2060–2070 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Graillat, S., Lauter, C., Tang, P.T.P., Yamanaka, N., Oishi, S.: Efficient calculations of faithfully rounded l2-norms of n-vectors. ACM Trans. Math. Softw. 41(4), 24:1–24:20 (2015). http://doi.acm.org/10.1145/2699469 CrossRefzbMATHGoogle Scholar
  8. 8.
    IEEE Task P754: IEEE 754–2008, Standard for Floating-Point Arithmetic. Institute of Electrical and Electronics Engineers, New York, August 2008Google Scholar
  9. 9.
    The MPFR library (2004). http://www.mpfr.org/. Accessed 8 July 2016
  10. 10.
    Muller, J.M., Brisebarre, N., de Dinechin, F., Jeannerod, C.P., Lefèvre, V., Melquiond, G., Revol, N., Stehlé, D., Torres, S.: Handbook of Floating-Point Arithmetic. Birkhäuser, Boston (2010)CrossRefzbMATHGoogle Scholar
  11. 11.
    Neal, R.M.: Fast exact summation using small and large superaccumulators. CoRR abs/1505.05571 (2015). http://arxiv.org/abs/1505.05571
  12. 12.
    Nguyen, H.D., Demmel, J., Ahrens, P.: ReproBLAS: Reproducible BLAS. http://bebop.cs.berkeley.edu/reproblas/
  13. 13.
    Ogita, T., Rump, S.M., Oishi, S.: Accurate sum and dot product. SIAM J. Sci. Comput. 26(6), 1955–1988 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Rump, S.M.: Ultimately fast accurate summation. SIAM J. Sci. Comput. 31(5), 3466–3502 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Todd, R.: Run-to-Run Numerical Reproducibility with the Intel Math Kernel Library and Intel Composer XE 2013. Intel Corporation, Technical report (2013)Google Scholar
  16. 16.
    Yamanaka, N., Ogita, T., Rump, S., Oishi, S.: A parallel algorithm for accurate dot product. Parallel Comput. 34(68), 392–410 (2008)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Zhu, Y.K., Hayes, W.B.: Correct rounding and hybrid approach to exact floating-point summation. SIAM J. Sci. Comput. 31(4), 2981–3001 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Zhu, Y.K., Hayes, W.B.: Algorithm 908: online exact summation of floating-point streams. ACM Trans. Math. Softw. 37(3), 37:1–37:13 (2010)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Chemseddine Chohra
    • 1
    • 2
    • 3
  • Philippe Langlois
    • 1
    • 2
    • 3
  • David Parello
    • 1
    • 2
    • 3
  1. 1.Univ. Perpignan Via Domitia, Digits, Architectures et Logiciels InformatiquesPerpignanFrance
  2. 2.Univ. Montpellier II, Laboratoire d’Informatique Robotique Et de Microélectronique de Montpellier, UMR 5506MontpellierFrance
  3. 3.CNRSParisFrance

Personalised recommendations