Abstract
We present a GPU algorithm for performing convolution with decomposed tensor products. We experimentally find up to 4.85x faster execution times than Nvidia’s cuDNN for some tensors. This is achieved by extending recent advances in compression of CNNs through use of tensor decomposition methods on weight tensors. Progress had previously been limited by a lack of fast operations to compute the decomposed variants of critical functions such as 2D convolution. We interpret this and other operations as a network of compound convolution and tensor contraction on the decomposed factors (i.e., generalized tensor operations). The prior approach sees such networks evaluated in a pairwise manner until the resulting output has been recovered, by composing functions in existing libraries such as cuDNN. The computational cost of such evaluations depends upon the order in which the index sums are evaluated, and varies between networks. The sequence of pairwise generalized tensor operations that minimizes the number of computations often produces large intermediate products, incurring performance bottlenecks when communicated with the scarce global memory of modern GPUs. Our solution is a GPU parallel algorithm which performs 2D convolution using filter tensors obtained through CP-decomposition with minimal memory overhead. We benchmark the run-time performance of our algorithm for common filter sizes in neural networks at multiple decomposition ranks. We compare ourselves against cuDNN traditional convolutions and find that our implementation is superior for lower ranks. We also propose a method for determining optimal sequences of pairwise tensor operations, achieving a minimal number of operations with memory constraints.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anandkumar, A., Ge, R., Hsu, D., Kakade, S.M., Telgarsky, M.: Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15, 2773–2832 (2014)
Auer, A.A., Baumgartner, G., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Krishnamoorthy, S., Krishnan, S., Lam, C.-C., Lu, Q., Nooijen, M., Pitzer, R., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Mol. Phys. 104(2), 211–228 (2006)
Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282 (2017)
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., Shelhamer, E.: cuDNN: efficient primitives for deep learning. arXiv:1410.0759 [cs], October 2014. arXiv: 1410.0759
Cichocki, A., Lee, N., Oseledets, I.V., Phan, A.H., Zhao, Q., Mandic, D.: Low-rank tensor networks for dimensionality reduction and large-scale optimization problems: perspectives and challenges part 1. arXiv preprint arXiv:1609.00893 (2016)
Cichocki, A., Lee, N., Oseledets, I.V., Phan, A.H., Zhao, Q., Mandic, D.P.: Low-rank tensor networks for dimensionality reduction and large-scale optimization problems: perspectives and challenges PART 1. CoRR, abs/1609.00893 (2016)
Cichocki, A., Phan, A.-H., Zhao, Q., Lee, N., Oseledets, I., Sugiyama, M., Mandic, D.P., et al.: Tensor networks for dimensionality reduction and large-scale optimization: part 2 applications and future perspectives. Found. Trends® Mach. Learn. 9(6), 431–673 (2017)
Nvidia Corporation. Nvidia Turing GPU Architecture (2018). https://nvidia.com/en-us/geforce/news/geforce-rtx-20-series-turing-architecture-whitepaper. Accessed 09 Sept 2019
Denton, E.L., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in Neural Information Processing Systems, pp. 1269–1277 (2014)
Abadi, M., et al.: Dean, Tucker, Yu, and TensorFlow: Large-scale machine learning on heterogeneous systems (2015). tensorflow.org
Grasedyck, L., Kressner, D., Tobler, C.: A literature survey of low-rank tensor approximation techniques. GAMM-Mitteilungen 36(1), 53–78 (2013)
Janzamin, M., Sedghi, H., Anandkumar, A.: Generalization bounds for neural networks through tensor factorization. CoRR, abs/1506.08473 (2015)
Kim, J., Sukumaran-Rajam, A., Thumma, V., Krishnamoorthy, S., Panyala, A., Pouchet, L.-N., Rountev, A., Sadayappan, P.: A code generator for high-performance tensor contractions on GPUs. In: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Washington, DC, USA, pp. 85–95. IEEE, February 2019
Knuth, D.E.: The Art of Computer Programming, Volume 1 (3rd edn.): Fundamental Algorithms. Addison Wesley Longman Publishing Co., Inc., Redwood City (1997)
Kolda, T., Bader, B.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Kossaifi, J., Khanna, A., Lipton, Z., Furlanello, T., Anandkumar, A.: Tensor contraction layers for parsimonious deep nets. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1940–1946. IEEE (2017)
Kossaifi, J., Lipton, Z.C., Khanna, A., Furlanello, T., Anandkumar, A.: Tensor regression networks. CoRR, abs/1707.08308 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Lam, C.-C., Sadayappan, P., Wenger, R.: On optimizing a class of multi-dimensional loops with reductions for parallel execution. Parallel Process. Lett. 7(2), 157–168 (1997)
Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., Lempitsky, V.: Speeding-up convolutional neural networks using fine-tuned CP-decomposition. arXiv preprint arXiv:1412.6553 (2014)
Li, J., Sun, Y., Su, J., Suzuki, T., Huang, F.: Understanding Generalization in Deep Learning via Tensor Methods (2020)
Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K.: GPU-based implementations of the noniterative regularized-CCSD(T) corrections: applications to strongly correlated systems. J. Chem. Theory Comput. 7(5), 1316–1327 (2011)
Novikov, A., Podoprikhin, D., Osokin, A., Vetrov, D.P.: Tensorizing neural networks. CoRR, abs/1509.06569 (2015)
Orús, R.: A practical introduction to tensor networks: matrix product states and projected entangled pair states. Ann. Phys. 349, 117–158 (2014)
Oseledets, I.V.: Tensor-train decomposition. SIAM J. Sci. Comput. 33(5), 2295–2317 (2011)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In: NIPS-W (2017)
Pfeifer, R.N.C., Haegeman, J., Verstraete, F.: Faster identification of optimal contraction sequences for tensor networks. Phys. Rev. E 90(3), 033315 (2014). arXiv:1304.6112
Shi, Y., Niranjan, U.N., Anandkumar, A., Cecka, C.: Tensor contractions with extended BLAS kernels on CPU and GPU. In: 2016 IEEE 23rd International Conference on High Performance Computing (HiPC), pp. 193–202 (2016)
Springer, P., Bientinesi, P.: Design of a high-performance GEMM-like Tensor-Tensor Multiplication. CoRR (2016)
Su, J., Li, J., Bhattacharjee, B., Huang, F.: Tensorial neural networks: generalization of neural networks and application to model compression. CoRR, abs/1805.10352 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Reustle, A., Rabbani, T., Huang, F. (2021). Fast GPU Convolution for CP-Decomposed Tensorial Neural Networks. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol 1250. Springer, Cham. https://doi.org/10.1007/978-3-030-55180-3_35
Download citation
DOI: https://doi.org/10.1007/978-3-030-55180-3_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-55179-7
Online ISBN: 978-3-030-55180-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)