Efficient cuDNN-Compatible Convolution-Pooling on the GPU

  • Shunsuke Suita
  • Takahiro Nishimura
  • Hiroki Tokura
  • Koji NakanoEmail author
  • Yasuaki Ito
  • Akihiko Kasagi
  • Tsuguchika Tabaru
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12044)


The main contribution of this paper is to show efficient implementations of the convolution-pooling in the GPU, in which the pooling follows the multiple convolution. Since the multiple convolution and the pooling operations are performed alternately in earlier stages of many Convolutional Neural Networks (CNNs), it is very important to accelerate the convolution-pooling. Our new GPU implementation uses two techniques, (1) convolution interchange with direct sum, and (2) conversion to matrix multiplication. By these techniques, the computational and memory access cost are reduced. Further the convolution interchange is converted to matrix multiplication, which can be computed by cuBLAS very efficiently. Experimental results using Telsa V100 GPU show that our new GPU implementation compatible with cuDNN for the convolution-pooling is at least 1.34 times faster than the multiple convolution and then the pooling by cuDNN, the most popular library of primitives to implement the CNNs in the GPU.


Deep learning Neural Networks Convolution Average pooling GPU 


  1. 1.
    Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration for deep neural networks. CoRR abs/1710.09282, October 2017Google Scholar
  2. 2.
    Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., Shelhamer, E.: cuDNN: efficient primitives for deep learning. CoRR abs/1410.0759, August 2014Google Scholar
  3. 3.
    Emoto, Y., Funasaka, S., Tokura, H., Honda, T., Nakano, K., Ito, Y.: An optimal parallel algorithm for computing the summed area table on the GPU. In: Proceedings of International Parallel and Distributed Processing Symposium Workshops, pp. 763–772, February 2018Google Scholar
  4. 4.
    Honda, T., Yamamoto, S., Honda, H., Nakano, K., Ito, Y.: Simple and fast parallel algorithms for the Voronoi map and the Euclidean distance map, with GPU implementations. In: Proceedings of International Conference on Parallel Processing, pp. 362–371, August 2017Google Scholar
  5. 5.
    Hwu, W.W.: GPU Computing Gems, Emerald edn. Morgan Kaufmann, Burlington (2011)Google Scholar
  6. 6.
    Kasagi, A., Nakano, K., Ito, Y.: Parallel algorithms for the summed area table on the asynchronous hierarchical memory machine, with GPU implementations. In: Proceedings of International Conference on Parallel Processing (ICPP), pp. 251–260, September 2014Google Scholar
  7. 7.
    Kasagi, A., Tabaru, T., Tamura, H.: Fast algorithm using summed area tables with unified layer performing convolution and average pooling. In: Proceedings of International Workshop on Machine Learning for Signal Processing, September 2017Google Scholar
  8. 8.
    Li, C., Yang, Y., Feng, M., Chakradhar, S., Zhou, H.: Optimizing memory efficiency for deep convolutional neural networks on GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2016Google Scholar
  9. 9.
    Matsumura, N., Tokura, H., Kuroda, Y., Ito, Y., Nakano, K.: Tile art image generation using conditional generative adversarial networks. In: Proceedings of International Symposium on Computing and Networking Workshops, pp. 209–215 (2018)Google Scholar
  10. 10.
    NVIDIA Corporation: NVIDIA CUDA C programming guide version 4.0 (2011)Google Scholar
  11. 11.
    NVIDIA Corporation: CUBLAS LIBRARY user guide, February 2019.
  12. 12.
    NVIDIA Corporation: CUDNN developer guide, February 2019.
  13. 13.
    Ogawa, K., Ito, Y., Nakano, K.: Efficient Canny edge detection using a GPU. In: Proceedings of International Conference on Networking and Computing, pp. 279–280. IEEE CS Press, November 2010Google Scholar
  14. 14.
    Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)CrossRefGoogle Scholar
  15. 15.
    Takeuchi, Y., Takafuji, D., Ito, Y., Nakano, K.: ASCII art generation using the local exhaustive search on the GPU. In: Proceedings of International Symposium on Computing and Networking, pp. 194–200, December 2013Google Scholar
  16. 16.
    Zhang, Q., Zhang, M., Chen, T., Sun, Z., Ma, Y., Yu, B.: Recent advances in convolutional neural network acceleration. Neurocomputing 323, 37–51 (2019)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Shunsuke Suita
    • 1
  • Takahiro Nishimura
    • 1
  • Hiroki Tokura
    • 1
  • Koji Nakano
    • 1
    Email author
  • Yasuaki Ito
    • 1
  • Akihiko Kasagi
    • 2
  • Tsuguchika Tabaru
    • 2
  1. 1.Department of Information EngineeringHiroshima UniversityHigashi-HiroshimaJapan
  2. 2.Fujitsu Laboratories Ltd.KawasakiJapan

Personalised recommendations