Fast GPU Convolution for CP-Decomposed Tensorial Neural Networks

Reustle, Alexander; Rabbani, Tahseen; Huang, Furong

doi:10.1007/978-3-030-55180-3_35

Alexander Reustle¹⁷,
Tahseen Rabbani¹⁷ &
Furong Huang¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1250))

Included in the following conference series:

Proceedings of SAI Intelligent Systems Conference

1136 Accesses

Abstract

We present a GPU algorithm for performing convolution with decomposed tensor products. We experimentally find up to 4.85x faster execution times than Nvidia’s cuDNN for some tensors. This is achieved by extending recent advances in compression of CNNs through use of tensor decomposition methods on weight tensors. Progress had previously been limited by a lack of fast operations to compute the decomposed variants of critical functions such as 2D convolution. We interpret this and other operations as a network of compound convolution and tensor contraction on the decomposed factors (i.e., generalized tensor operations). The prior approach sees such networks evaluated in a pairwise manner until the resulting output has been recovered, by composing functions in existing libraries such as cuDNN. The computational cost of such evaluations depends upon the order in which the index sums are evaluated, and varies between networks. The sequence of pairwise generalized tensor operations that minimizes the number of computations often produces large intermediate products, incurring performance bottlenecks when communicated with the scarce global memory of modern GPUs. Our solution is a GPU parallel algorithm which performs 2D convolution using filter tensors obtained through CP-decomposition with minimal memory overhead. We benchmark the run-time performance of our algorithm for common filter sizes in neural networks at multiple decomposition ranks. We compare ourselves against cuDNN traditional convolutions and find that our implementation is superior for lower ranks. We also propose a method for determining optimal sequences of pairwise tensor operations, achieving a minimal number of operations with memory constraints.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In Fig. 2, we illustrate these operations with simple examples of third-order tensors \(\mathcal {X}\) and \(\mathcal {Y}\), but they also apply for higher-order tensors as rigorously defined in [31].
2.
Note that for convolution cost (12), we assume no Fast-Fourier Transform is used.

References

Anandkumar, A., Ge, R., Hsu, D., Kakade, S.M., Telgarsky, M.: Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15, 2773–2832 (2014)
MathSciNet MATH Google Scholar
Auer, A.A., Baumgartner, G., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Krishnamoorthy, S., Krishnan, S., Lam, C.-C., Lu, Q., Nooijen, M., Pitzer, R., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Mol. Phys. 104(2), 211–228 (2006)
Article Google Scholar
Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282 (2017)
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., Shelhamer, E.: cuDNN: efficient primitives for deep learning. arXiv:1410.0759 [cs], October 2014. arXiv: 1410.0759
Cichocki, A., Lee, N., Oseledets, I.V., Phan, A.H., Zhao, Q., Mandic, D.: Low-rank tensor networks for dimensionality reduction and large-scale optimization problems: perspectives and challenges part 1. arXiv preprint arXiv:1609.00893 (2016)
Cichocki, A., Lee, N., Oseledets, I.V., Phan, A.H., Zhao, Q., Mandic, D.P.: Low-rank tensor networks for dimensionality reduction and large-scale optimization problems: perspectives and challenges PART 1. CoRR, abs/1609.00893 (2016)
Google Scholar
Cichocki, A., Phan, A.-H., Zhao, Q., Lee, N., Oseledets, I., Sugiyama, M., Mandic, D.P., et al.: Tensor networks for dimensionality reduction and large-scale optimization: part 2 applications and future perspectives. Found. Trends® Mach. Learn. 9(6), 431–673 (2017)
Google Scholar
Nvidia Corporation. Nvidia Turing GPU Architecture (2018). https://nvidia.com/en-us/geforce/news/geforce-rtx-20-series-turing-architecture-whitepaper. Accessed 09 Sept 2019
Denton, E.L., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in Neural Information Processing Systems, pp. 1269–1277 (2014)
Google Scholar
Abadi, M., et al.: Dean, Tucker, Yu, and TensorFlow: Large-scale machine learning on heterogeneous systems (2015). tensorflow.org
Grasedyck, L., Kressner, D., Tobler, C.: A literature survey of low-rank tensor approximation techniques. GAMM-Mitteilungen 36(1), 53–78 (2013)
Article MathSciNet Google Scholar
Janzamin, M., Sedghi, H., Anandkumar, A.: Generalization bounds for neural networks through tensor factorization. CoRR, abs/1506.08473 (2015)
Google Scholar
Kim, J., Sukumaran-Rajam, A., Thumma, V., Krishnamoorthy, S., Panyala, A., Pouchet, L.-N., Rountev, A., Sadayappan, P.: A code generator for high-performance tensor contractions on GPUs. In: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Washington, DC, USA, pp. 85–95. IEEE, February 2019
Google Scholar
Knuth, D.E.: The Art of Computer Programming, Volume 1 (3rd edn.): Fundamental Algorithms. Addison Wesley Longman Publishing Co., Inc., Redwood City (1997)
Google Scholar
Kolda, T., Bader, B.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Article MathSciNet Google Scholar
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Article MathSciNet Google Scholar
Kossaifi, J., Khanna, A., Lipton, Z., Furlanello, T., Anandkumar, A.: Tensor contraction layers for parsimonious deep nets. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1940–1946. IEEE (2017)
Google Scholar
Kossaifi, J., Lipton, Z.C., Khanna, A., Furlanello, T., Anandkumar, A.: Tensor regression networks. CoRR, abs/1707.08308 (2017)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Lam, C.-C., Sadayappan, P., Wenger, R.: On optimizing a class of multi-dimensional loops with reductions for parallel execution. Parallel Process. Lett. 7(2), 157–168 (1997)
Article MathSciNet Google Scholar
Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., Lempitsky, V.: Speeding-up convolutional neural networks using fine-tuned CP-decomposition. arXiv preprint arXiv:1412.6553 (2014)
Li, J., Sun, Y., Su, J., Suzuki, T., Huang, F.: Understanding Generalization in Deep Learning via Tensor Methods (2020)
Google Scholar
Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K.: GPU-based implementations of the noniterative regularized-CCSD(T) corrections: applications to strongly correlated systems. J. Chem. Theory Comput. 7(5), 1316–1327 (2011)
Article Google Scholar
Novikov, A., Podoprikhin, D., Osokin, A., Vetrov, D.P.: Tensorizing neural networks. CoRR, abs/1509.06569 (2015)
Google Scholar
Orús, R.: A practical introduction to tensor networks: matrix product states and projected entangled pair states. Ann. Phys. 349, 117–158 (2014)
Article MathSciNet Google Scholar
Oseledets, I.V.: Tensor-train decomposition. SIAM J. Sci. Comput. 33(5), 2295–2317 (2011)
Article MathSciNet Google Scholar
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In: NIPS-W (2017)
Google Scholar
Pfeifer, R.N.C., Haegeman, J., Verstraete, F.: Faster identification of optimal contraction sequences for tensor networks. Phys. Rev. E 90(3), 033315 (2014). arXiv:1304.6112
Shi, Y., Niranjan, U.N., Anandkumar, A., Cecka, C.: Tensor contractions with extended BLAS kernels on CPU and GPU. In: 2016 IEEE 23rd International Conference on High Performance Computing (HiPC), pp. 193–202 (2016)
Google Scholar
Springer, P., Bientinesi, P.: Design of a high-performance GEMM-like Tensor-Tensor Multiplication. CoRR (2016)
Google Scholar
Su, J., Li, J., Bhattacharjee, B., Huang, F.: Tensorial neural networks: generalization of neural networks and application to model compression. CoRR, abs/1805.10352 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Maryland, College Park, USA
Alexander Reustle, Tahseen Rabbani & Furong Huang

Authors

Alexander Reustle
View author publications
You can also search for this author in PubMed Google Scholar
Tahseen Rabbani
View author publications
You can also search for this author in PubMed Google Scholar
Furong Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Furong Huang .

Editor information

Editors and Affiliations

Saga University, Saga, Japan
Kohei Arai
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Supriya Kapoor
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Rahul Bhatia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Reustle, A., Rabbani, T., Huang, F. (2021). Fast GPU Convolution for CP-Decomposed Tensorial Neural Networks. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol 1250. Springer, Cham. https://doi.org/10.1007/978-3-030-55180-3_35

Download citation

DOI: https://doi.org/10.1007/978-3-030-55180-3_35
Published: 25 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-55179-7
Online ISBN: 978-3-030-55180-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics