Cluster Computing

, Volume 16, Issue 1, pp 131–155 | Cite as

Optimizing tensor contraction expressions for hybrid CPU-GPU execution

  • Wenjing Ma
  • Sriram Krishnamoorthy
  • Oreste Villa
  • Karol Kowalski
  • Gagan Agrawal
Article

Abstract

Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupled cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU as compared to one CPU core and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). We further investigate tensor contraction code on a new series of GPUs, the Fermi GPUs, and provide several effective optimization algorithms. For the same computation of CCSD(T), on a cluster with Fermi GPUs, we achieve a speedup of 3.4 over a cluster with T10 GPUs. With a single Fermi GPU on each node, we achieve a speedup of 43 over the sequential CPU version.

Keywords

Hybrid CPU+GPU execution CUDA Tensor Contraction Expressions 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Anzt, H., Hahn, T., Heuveline, V., Rocker, B.: GPU accelerated scientific computing: evaluation of the NVIDIA Fermi architecture; elementary kernels and linear solvers (2010). http://www.emcl.kit.edu/preprints/emcl-preprint-2010-04.pdf
  2. 2.
    Aprà, E., Rendell, A.P., Harrison, R.J., Tipparaju, V., deJong, W.A., Xantheas, S.S.: Liquid water: obtaining the right answer for the right reasons. In: Proceedings of the ACM/IEEE SC Conference on High Performance Networking and Computing, pp. 1–7 (2009). doi: 10.1145/1654059.1654127 CrossRefGoogle Scholar
  3. 3.
    Auer, A., Baumgartner, G., Bernholdt, D., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Krishnamoorthy, S., Krishnan, S., Lam, C., Lu, Q., Nooijen, M., Pitzer, R., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Mol. Phys. 2, 211 (2006) CrossRefGoogle Scholar
  4. 4.
    Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.M.: An adaptive performance modeling tool for GPU architectures. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 105–114 (2010). doi: 10.1145/1693453.1693470 Google Scholar
  5. 5.
    Bartlett, R.J., Musiał, M.: Coupled-cluster theory in quantum chemistry. Rev. Mod. Phys. 79(1), 291–352 (2007). doi: 10.1103/RevModPhys.79.291 CrossRefGoogle Scholar
  6. 6.
    Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for GPGPUs. In: Proceedings of the International Conference on Supercomputing (ICS), pp. 225–234 (2008). doi: 10.1145/1375527.1375562 Google Scholar
  7. 7.
    Baumgartner, G., Auer, A., Bernholdt, D., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Hirata, S., Krishnamoorthy, S., et al.: Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93(2), 276–292 (2005) CrossRefGoogle Scholar
  8. 8.
    Boyer, M., Tarjan, D., Acton, S.T., Skadron, K.: Accelerating leukocyte tracking using CUDA: a case study in leveraging manycore coprocessors. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pp. 1–12 (2009). doi: 10.1109/IPDPS.2009.5160984 Google Scholar
  9. 9.
    Che, S., Meng, J., Sheaffer, J.W., Skadron, K.: A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput. 68(10), 1370–1380 (2008). doi: 10.1016/j.jpdc.2008.05.014 CrossRefGoogle Scholar
  10. 10.
    Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 115–126 (2010). doi: 10.1145/1693453.1693471 Google Scholar
  11. 11.
    Čižek, J.: On correlation problem in atomic and molecular systems. Calculation of wavefunction components in ursell-type expansion using quantum-field theoretical methods. J. Chem. Phys. 45(11), 4256–4266 (1966) CrossRefGoogle Scholar
  12. 12.
    Consortium, H.T.: PCI Express 3.0 specification. http://www.hypertransport.org/docs/twgdocs/HTC20051222-00046-0028.pdf (2011)
  13. 13.
    DePrince, A.E., Hammond, J.R.: Coupled cluster theory on graphics processing units I. The coupled cluster doubles method. J. Chem. Theory Comput. 7(5), 1287–1295 (2011). doi: 10.1021/ct100584w. http://pubs.acs.org/doi/abs/10.1021/ct100584w CrossRefGoogle Scholar
  14. 14.
    Dotsenko, Y., Baghsorkhi, S.S., Lloyd, B., Govindaraju, N.K.: Auto-tuning of fast Fourier transform on graphics processors. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP ’11, pp. 257–266. ACM Press, New York (2011). doi: 10.1145/1941553.1941589. URL http://doi.acm.org/10.1145/1941553.1941589 CrossRefGoogle Scholar
  15. 15.
    Dunning, T.: Gaussian basis sets for use in correlated molecular calculations I. The atoms boron through neon and hydrogen. J. Chem. Phys. 90, 1007–1023 (1989) CrossRefGoogle Scholar
  16. 16.
    Filippi, C., Zaccheddu, M., Buda, F.: Absorption spectrum of the green fluorescent protein chromophore: a difficult case for ab initio methods? J. Chem. Theory Comput. 5, 2074–2087 (2009) CrossRefGoogle Scholar
  17. 17.
    Gordon, M.I., Thies, W., Amarasinghe, S.: Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. Oper. Syst. Rev. 40(5), 151–162 (2006). doi: 10.1145/1168917.1168877 CrossRefGoogle Scholar
  18. 18.
    Hammond, J.R., De Prince, III, A.E.: Evaluating one-sided programming models for gpu cluster computations. http://saahpc.ncsa.illinois.edu/papers/paper_43.pdf (2011)
  19. 19.
    Harish, P., Narayanan, P.: Accelerating large graph algorithms on the GPU using CUDA. In: Proceedings of the International Conference on High Performance Computing (HiPC), pp. 197–208 (2007) Google Scholar
  20. 20.
    Hirata, S.: Tensor contraction engine: abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. J. Phys. Chem. 107(46), 9887–9897 (2003) CrossRefGoogle Scholar
  21. 21.
    Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ISCA ’09: Proceedings of the 36th Annual International Symposium on Computer Architecture, pp. 152–163. ACM Press, New York (2009). doi: 10.1145/1555754.1555775 CrossRefGoogle Scholar
  22. 22.
    Intel: An introduction to the Intel QuickPath Interconnect. Document Number: 320412, January 2009, http://www.intel.com/technology/quickpath/introduction.pdf
  23. 23.
    Kowalski, K., Krishnamoorthy, S., Olson, R.M., Tipparaju, V., Apra, E.: Scalable implementations of accurate excited-state coupled cluster theories: application of high-level methods to porphyrin-based systems. In: Proceedings of the ACM/IEEE SC Conference on High Performance Networking and Computing (2011). doi: 10.1145/2063384.2063481 Google Scholar
  24. 24.
    Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: Proceedings of the International Conference on Computational Science (ICCS), pp. 884–892 (2009). doi: 10.1007/978-3-642-01970-8-89 Google Scholar
  25. 25.
    Lu, Q., Krishnamoorthy, S., Sadayappan, P.: Combining analytical and empirical approaches in tuning matrix transposition. In: Proceedings of the Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 233–242 (2006). doi: 10.1145/1152154.1152190 CrossRefGoogle Scholar
  26. 26.
    Ma, W., Agrawal, G.: A translation system for enabling data mining applications on GPUs. In: Proceedings of the International Conference on Supercomputing (ICS), pp. 400–409 (2009). doi: 10.1145/1542275.1542331 Google Scholar
  27. 27.
    Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K.: GPU-based implementations of the noniterative regularized-CCSD(T) corrections: applications to strongly correlated systems. J. Chem. Theory Comput. 7(5), 1316–1327 (2011). doi: 10.1021/ct1007247. URL http://pubs.acs.org/doi/abs/10.1021/ct1007247 CrossRefGoogle Scholar
  28. 28.
    Molka, D., Hackenberg, D., Schone, R., Muller, M.S.: Memory performance and cache coherency effects on an intel nehalem multiprocessor system. In: Proceedings of the Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 261–270 (2009). doi: 10.1109/PACT.2009.22 Google Scholar
  29. 29.
    Murthy, S.G.: Optimal loop unrolling for GPGPU programs. Master’s thesis, The Ohio State University (2009) Google Scholar
  30. 30.
    Nath, R., Tomov, S., Dongarra, J.: An improved MAGMA GEMM for fermi GPUs. http://icl.cs.utk.edu/projectsfiles/magma/pubs/fermi_gemm.pdf (2010)
  31. 31.
    Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. ACM Queue 6(2), 40–53 (2008). doi: 10.1145/1365490.1365500 CrossRefGoogle Scholar
  32. 32.
    Nieplocha, J., Tipparaju, V., Krishnan, M., Panda, D.: High performance remote memory access communication: the armci approach. Int. J. High Perform. Comput. Appl. 20(2), 233 (2006) CrossRefGoogle Scholar
  33. 33.
    Nukada, A., Ogata, Y., Endo, T., Matsuoka, S.: Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In: Proceedings of the ACM/IEEE SC Conference on High Performance Networking and Computing, pp. 1–11 (2008) Google Scholar
  34. 34.
    Nvidia: NVIDIA’s next generation CUDA compute architecture: Fermi. http://www.nvidia.com/object/fermi_architecture.html
  35. 35.
    NVIDIA: NVIDIA CUDA Programming guide, version 3.0 (2010) Google Scholar
  36. 36.
    Paldus, J., Li, X.: A critical assessment of coupled cluster method in quantum chemistry. Adv. Chem. Phys. 110, 1–175 (1999) CrossRefGoogle Scholar
  37. 37.
    Raghavachari, K., Trucks, G.W., Pople, J.A., Head-Gordon, M.: A 5th-order perturbation comparison of electron correlation theories. Chem. Phys. Lett. 157(6), 479–483 (1989) CrossRefGoogle Scholar
  38. 38.
    Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.M.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 73–82 (2008). doi: 10.1145/1345206.1345220 CrossRefGoogle Scholar
  39. 39.
    Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.Z., Stratton, J.A., Hwu, W.M.W.: Program optimization space pruning for a multithreaded GPU. In: Proceedings of the International Symposium on Code Generation and Optimization (CGO), pp. 195–204 (2008). doi: 10.1145/1356058.1356084 Google Scholar
  40. 40.
    Schatz, M., Trapnell, C., Delcher, A., Varshney, A.: High-throughput sequence alignment using graphics processing units. BMC Bioinform. 8(1), 474 (2007). doi: 10.1186/1471-2105-8-474 CrossRefGoogle Scholar
  41. 41.
    TOP500: http://www.top500.org (2011)
  42. 42.
    Udupa, A., Govindarajan, R., Thazhuthaveetil, M.J.: Software pipelined execution of stream programs on GPUs. In: Proceedings of the International Symposium on Code Generation and Optimization (CGO), pp. 200–209 (2009). doi: 10.1109/CGO.2009.20 CrossRefGoogle Scholar
  43. 43.
    Valiev, M., Bylaska, E., Govind, N., Kowalski, K., Straatsma, T., Dam, H.V., Wang, D., Nieplocha, J., Apra, E., Windus, T., de Jong, W.: NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Comput. Phys. Commun. 181(9), 1477–1489 (2010). doi: 10.1016/j.cpc.2010.04.018. URL http://www.sciencedirect.com/science/article/pii/S0010465510001438 MATHCrossRefGoogle Scholar
  44. 44.
    Volkov, V., Demmel, J.: LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs. Tech. Rep. UCB/EECS-2008-49, EECS Department. University of California, Berkeley (2008). URL http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-49.html
  45. 45.
    Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the ACM/IEEE SC Conference on High Performance Networking and Computing, pp. 1–11 (2008) Google Scholar

Copyright information

© Springer Science + Business Media, LLC (outside the USA) 2011

Authors and Affiliations

  • Wenjing Ma
    • 1
  • Sriram Krishnamoorthy
    • 1
  • Oreste Villa
    • 1
  • Karol Kowalski
    • 1
    • 3
  • Gagan Agrawal
    • 2
  1. 1.Computational Sciences and Mathematics DivisionPacific Northwest National LaboratoryRichlandUSA
  2. 2.Department of Computer Sciences and EngineeringThe Ohio State UniversityColumbusUSA
  3. 3.Environmental Molecular Sciences LaboratoryPacific Northwest National LaboratoryRichlandUSA

Personalised recommendations