On the Performance Prediction of BLAS-based Tensor Contractions

  • Elmar Peise
  • Diego Fabregat-Traver
  • Paolo Bientinesi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8966)

Abstract

Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.

References

  1. 1.
    Kidder, L.E., Scheel, M.A., Teukolsky, S.A.: Extending the lifetime of 3d black hole computations with a new hyperbolic system of evolution equations. Phys. Rev. D 64, 064017 (2001)CrossRefMathSciNetGoogle Scholar
  2. 2.
    Lehner, L.: Numerical relativity: a review. Class. Quantum Gravity 18(17), R25 (2001)CrossRefMATHMathSciNetGoogle Scholar
  3. 3.
    Helgaker, T., Jorgensen, P., Olsen, J.: Molecular Electronic-Structure Theory. Wiley, Chichester (2000)CrossRefGoogle Scholar
  4. 4.
    C̆íz̆ek, J.: On the correlation problem in atomic and molecular systems calculation of wavefunction components in ursell type expansion using quantum-field theoretical methods. J. Chem. Phys. 45(11), 4256–4266 (1966)CrossRefGoogle Scholar
  5. 5.
    Bartlett, R.J., Musiał, M.: Coupled-cluster theory in quantum chemistry. Rev. Mod. Phys. 79, 291–352 (2007)CrossRefGoogle Scholar
  6. 6.
    Lawson, C.L., Hanson, R.J., Kincaid, D.R., Krogh, F.T.: Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw. 5(3), 308–323 (1979)CrossRefMATHGoogle Scholar
  7. 7.
    Dongarra, J.J., Du Croz, J., Hammarling, S., Hanson, R.J.: An extended set of fortran basic linear algebra subprograms. ACM Trans. Math. Softw. 14(1), 1–17 (1988)CrossRefMATHGoogle Scholar
  8. 8.
    Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)CrossRefMATHGoogle Scholar
  9. 9.
    Baumgartner, G., Auer, A., Bernholdt, D., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Hirata, S., Krishnamoorthy, S., Krishnan, S., Lam, C., Lu, Q., Nooijen, M., Pitzer, R., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93(2), 276–292 (2005)CrossRefGoogle Scholar
  10. 10.
    Lu, Q., Gao, X., Krishnamoorthy, S., Baumgartner, G., Ramanujam, J., Sadayappan, P.: Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions. J. Parallel Distrib. Comput. 72(3), 338–352 (2012)CrossRefGoogle Scholar
  11. 11.
    Di Napoli, E., Fabregat-Traver, D., Quintana-Orti, G., Bientinesi, P.: Towards an efficient use of the blas library for multilinear tensor contractions. Appl. Math. Comput. 235, 454–468 (2014)CrossRefMathSciNetGoogle Scholar
  12. 12.
    Iakymchuk, R., Bientinesi, P.: Modeling performance through memory-stalls. SIGMETRICS Perform. Eval. Rev. 40(2), 86–91 (2012)CrossRefGoogle Scholar
  13. 13.
    Iakymchuk, R., Bientinesi, P.: Execution-less performance modeling. In: Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computing Systems (PMBS 2011) held as part of the Supercomputing Conference (SC 2011), Seattle, USA, November 2011Google Scholar
  14. 14.
    Peise, E., Bientinesi, P.: Performance modeling for dense linear algebra. In: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. SCC 2012, pp. 406–416. IEEE Computer Society, Washington, DC, USA (2012)Google Scholar
  15. 15.

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Elmar Peise
    • 1
  • Diego Fabregat-Traver
    • 1
  • Paolo Bientinesi
    • 1
  1. 1.AICESRWTH AachenAachenGermany

Personalised recommendations