Advertisement

Reproducible and Accurate Matrix Multiplication

  • Roman Iakymchuk
  • David Defour
  • Sylvain Collange
  • Stef Graillat
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9553)

Abstract

Due to non-associativity of floating-point operations and dynamic scheduling on parallel architectures, getting a bit-wise reproducible floating-point result for multiple executions of the same code on different or even similar parallel architectures is challenging. In this paper, we address the problem of reproducibility in the context of matrix multiplication and propose an algorithm that yields both reproducible and accurate results. This algorithm is composed of two main stages: a filtering stage that uses fast vectorized floating-point expansions in conjunction with error-free transformations; an accumulation stage based on Kulisch long accumulators in a high-radix carry-save representation. Finally, we provide implementations and performance results in parallel environments like GPUs.

Keywords

Matrix multiplication Reproducibility Accuracy Kulisch long accumulator Error-free transformation Floating-point expansion Rounding-to-nearest GPUs 

Notes

Acknowledgement

This work undertaken (partially) in the framework of CALSIMLAB is supported by the public grant ANR-11-LABX-0037-01 overseen by the French National Research Agency (ANR) as part of the “Investissements d’Avenir” program (reference: ANR-11-IDEX-0004-02). This work was also (partially) supported by the FastRelax project through the ANR public grant (reference: ANR-14-CE25-0018-01).

References

  1. 1.
    Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (CDROM). Supercomputing 1998, 1–27. IEEE Computer Society (1998)Google Scholar
  2. 2.
    Goto, K., van de Geijn, R.A.: High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 1–14 (2008)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Fabregat-Traver, D., Bientinesi, P.: Computing petaflops over terabytes of data: the case of genome-wide association studies. ACM Trans. Math. Softw. 40(4), 27:1–27:22 (2014)CrossRefGoogle Scholar
  4. 4.
    Bergman, K., al.: Exascale computing study: technology challenges in achieving exascale systems. DARPA report, September 2008Google Scholar
  5. 5.
    Whitehead, N., Fit-Florea, A.: Precision & performance: Floating point and IEEE 754 compliance for NVIDIA GPUs. Technical report, NVIDIA (2011)Google Scholar
  6. 6.
    Corden, M.: Differences in floating-point arithmetic between Intel Xeon processors and the Intel Xeon Phi™ coprocessor. Technical report, Intel (2013)Google Scholar
  7. 7.
    Doertel, K.: Best known method: Avoid heterogeneous precision in control flow calculations. Technical report, Intel (2013)Google Scholar
  8. 8.
    Kulisch, U., Snyder, V.: The exact dot product as basic tool for long interval arithmetic. Computing 91(3), 307–313 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Demmel, J., Nguyen, H.D.: Fast reproducible floating-point summation. In: Proceedings of the 21st IEEE Symposium on Computer Arithmetic, Austin, Texas, USA, pp. 163–172 (2013)Google Scholar
  10. 10.
    Collange, S., Defour, D., Graillat, S., Iakymchuk, R.: Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures. Technical report HAL: hal-00949355, INRIA, DALI-LIRMM, LIP6, ICS, February 2014Google Scholar
  11. 11.
    IEEE Computer Society: IEEE Standard for Floating-Point Arithmetic. IEEE Standard 754–2008, August 2008Google Scholar
  12. 12.
    Higham, N.J.: Accuracy and stability of numerical algorithms, 2nd edn. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA (2002)CrossRefzbMATHGoogle Scholar
  13. 13.
    Muller, J.M., Brisebarre, N., de Dinechin, F., Jeannerod, C.P., Lefèvre, V., Melquiond, G., Revol, N., Stehlé, D., Torres, S.: Handbook of Floating-Point Arithmetic. Birkhäuser, Boston (2010)CrossRefzbMATHGoogle Scholar
  14. 14.
    Li, X.S., Demmel, J.W., Bailey, D.H., Henry, G., Hida, Y., Iskandar, J., Kahan, W., Kang, S.Y., Kapur, A., Martin, M.C., Thompson, B.J., Tung, T., Yoo, D.J.: Design, implementation and testing of extended and mixed precision BLAS. ACM Trans. Math. Softw. 28(2), 152–205 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Hida, Y., Li, X.S., Bailey, D.H.: Algorithms for quad-double precision floating point arithmetic. In: Proceedings of the 15th IEEE Symposium on Computer Arithmetic, CA, USA, 155–162. IEEE Computer Society Press, Los Alamitos (2001)Google Scholar
  16. 16.
    Knuth, D.E.: The Art of Computer Programming. Seminumerical Algorithms, vol. 2, 3rd edn. Addison-Wesley, Boston (1997)zbMATHGoogle Scholar
  17. 17.
    Matsumoto, K., Nakasato, N., Sakai, T., Yahagi, H., Sedukhin, S.G.: Multi-level optimization of matrix multiplication for gpu-equipped systems. In: ICCS. Procedia Computer Science, vol. 4, pp. 342–351. Elsevier (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Roman Iakymchuk
    • 1
    • 2
    • 3
  • David Defour
    • 4
  • Sylvain Collange
    • 5
  • Stef Graillat
    • 1
    • 2
  1. 1.Sorbonne Universités UPMC Univ Paris 06, UMR 7606, LIP6ParisFrance
  2. 2.CNRS, UMR 7606, LIP6ParisFrance
  3. 3.Sorbonne Universités UPMC Univ Paris 06, ICSParisFrance
  4. 4.DALI–LIRMMUniversité de PerpignanPerpignanFrance
  5. 5.INRIA – Centre de Recherche Rennes – Bretagne AtlantiqueRennes CedexFrance

Personalised recommendations