Optimizing Performance of Image Processing Algorithms on GPUs

Zhou, Honghui; Qin, Ruyi; Liu, Zihan; Qian, Ying; Ju, Xiaoming

doi:10.1007/978-981-19-2456-9_95

Honghui Zhou⁴⁰,
Ruyi Qin⁴⁰,
Zihan Liu⁴¹,
Ying Qian⁴¹ &
…
Xiaoming Ju⁴¹

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE))

Included in the following conference series:

INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND APPLICATIONS

9209 Accesses

Abstract

The application of machine learning algorithms in the field of power grid improves the service level of power enterprises and promotes the development of power grid. NVIDIA Volta and Turing GPUs powered by Tensor Cores can accelerate training and learning performance for these algorithms. With Tensor Cores enabled, FP32 and FP16 mixed precision matrix multiplication dramatically accelerates the throughput and reduces AI training times. In order to explore the cause of this phenomenon, we choose a convolutional neural network (CNN), which is widely used in computer vision, as an example and show the performance characteristics with tensor core on general matrix multiplications and convolution calculations as benchmark. Building a CNN based on cuDNN and TensorFlow, we analyze the performance of CNN from various aspects and optimize performance of it by changing the shape of convolution kernel and using texture memory, etc. The experimental results prove the effectiveness of our methods.

You have full access to this open access chapter, Download conference paper PDF

Novel accelerated methods for convolution neural network with matrix core

Article 30 May 2023

Accelerating Deep Convolutional Neural Network Inference Based on OpenCL

Improving Performance of Convolutional Neural Networks by Separable Filters on GPU

Keywords

1 Introduction

Electricity has become an indispensable part of people’s life. The application of Artificial Intelligence technology in the field of power grid improves the service level of power enterprises and promotes the development of power grid. With the in-depth application of intelligent technology in power grid, a large number of image data are produced. At this time, with the help of big data image processing technology, enterprises can solve the problem of processing and saving massive data. It can reduce the workload of the enterprise, improve the efficiency and accuracy of the staff, promote the development of the enterprise and enhance the core competitiveness of the enterprise. Among the Artificial Intelligence technologies, machine learning is a research hot spot in many research organizations. Machine learning techniques, especially deep learning such as recurrent neural networks and convolutional neural networks have been applied to fields including computer vision 1, speech recognition 2, natural language processing 3 and drug discovery 4. Deep Learning requires substantial computing power. Graphics Processing Unit (GPU) can accelerated computing.

Recently, NVIDIA published Turing architecture 5 as the successor to the Volta architecture 6 with tensor cores 7 which can accelerate general matrix multiplication (GEMM). GEMM is at the heart of deep learning. Here’s a diagram from 8, where the time’s going for a typical deep convolutional neural network doing image recognition using Alex Krizhevsky’s Imagenet architecture 1. All of the layers that start with fc (for fully-connected) or conv (for convolution) are implemented using GEMM, and almost all the time (95% of the GPU version, and 89% on CPU) is spent on those layers.

In order to construct the machine learning models conveniently, various high-performance open-source deep learning frameworks emerge these years such as tensorflow 9 and caffe 10. These frameworks support running computations on a variety of types of devices, including CPU and GPU (Fig. 1).

In some tasks of image processing, CNN can be applied to image recognition, classification and enhancement, etc. CNN used a special structure for image recognition and can be trained quickly. In order to explore the reasons for such huge difference, we will implement a typical CNN named LeNet-5 23, which is commonly used in deep learning.

2 Related Work

AI computing has become the driving force of the NVIDA GPU, as a computing accelerator, it integrates built-in hardware and software for machine learning. Some studies have investigated the tensor core by programing111213. Sorna et al. proposed a method that can use computational capability of tensor core without degrading the precision of the Fourier Transform result 14. Carrasco et al. applied a reduction strategy based on matrix multiply-accumulate with tensor core. Their found showed that tensor core can promote the arithmetic reductions15. Markidis et al. evaluated performance of NVIDIA Tensor core with Tesla V100 using GEMM operating 16. They tested the capability with tensor Core using naive implementation with CUDA 9 WMMA, CUTLASS and cuBLAS. Martineau et al. analyzed and evaluated the tensor core through optimization a GEMM benchmark 11, finding similar conclusion of V100 GPU presented by14. Different from previous studies, we will make use of neural network parallel library to further evaluate the performance of GPU on the basis of benchmark.

In deep learning, CNN is a class of artificial neural network structure gradually emerging in recent years. A representative CNN involves convolutional layer, pooling layer and full-connected layer. The convolutional layer extracts feature by convolving input with a group of kernel filters, which contains plenty of matrix operations. The pooling layer contains average, max and stochastic pooling, which contributes to invariance to data variation and perturbation. The fully connected layer in a CNN combines the results of convolutions. It performs the weights which represent the relationship between the input and output and the input multiplication and generates the output.

3 Experiment

The following experiment environment is: AMD Ryzen CPU, NVIDIA Geforce RTX 2080TI (Turing) GPU, Microsoft Windows 10 64-bit, CUDA SDK 10.0, CUTLASS 1.3. Nvprof is selected to evaluate from instruction running time to number of calls. The performance of experiment uses TFlops/s to statistics with operand divided by operation time.

General Matrix Multiplication (GEMM) defined in BLAS 18 and cuBLAS 19 is a matrix multiplication and accumulation routine as fllows:

$$\mathrm{C}\leftarrow \upalpha \mathrm{A}\times \mathrm{B }+ \upbeta \mathrm{C}$$

where $\mathrm{A}\in {\mathrm{R}}^{M\times K}$, $\mathrm{B}\in {\mathrm{R}}^{K\times N}$, and $\mathrm{C}\in {\mathrm{R}}^{M\times N}$ are matrices, and ${\upalpha }$ and $\upbeta$ are scalars. GEMM is the heart of deep learning and is mainly used in neural networks of specific structures such as CNN/RNN. The main purpose of the Tensor core in the Volta architecture and Turing architecture is to accelerate the calculation of GEMM. Many optimization efforts have also been incorporated to the widely used GEMM libraries: MAGMA 20, CUTLASS 21 and cuBLAS.

3.1 Performance of GEMM with Matrix Dimension

When calculate GEMM, the dimensions of matrix are m, n and k respectively in (1). Each cell is multiplied by a 1 × K matrix and a K × 1 matrix, this operation will be split and distributed to the tensor core for processing with tensor core on. We try to investigate the effect of m, n, k dimension on the speed-up ratio and the shared size K has a greater impact on performance. In order to find the optimal size k, the GEMM is performed with half-precision in Fig. 3. It can be seen that the speed-up ratio of the test sample that cannot be divisible by 8 is relatively low, close to 1; Most of samples which can be divisible by 8 can be effectively accelerated by the tensor core; and as the k value increases, the speed-up ratio also shows an upward trend, indicating that the tensor core is more sensitive to the value of k (Fig. 2).

3.2 Performance Analysis of GEMM with Tensor Core on and off

A series of self-written cases supplemented by the deep learning test suite DeepBench 22 are tested the performance with the tensor core on or off in the new architecture. Table 1 shows the results of running GEMM using Nvprof with the tensor core turned on and off, including the number of calls and running time of each API.

With the tensor core on, since the matrix multiplication operation that originally required multiple dot product instructions is replaced by only one wmma instruction, the calculation is more dense and the time of device synchronization become less, the performance is improved significantly.

Table 1. Performance analysis of GEMM with Tensor Core on and off (API Calls)

Full size table

3.3 Convolution Calculation

In the CNN model, the fully connected layer is often served as the last layer, and the body of the network is composed of convolutional layers. Therefore, it is critical to speed up the calculation of convolution for the performance of the entire network.

There are several methods developed to efficiently implement the convolution operation besides directly computing the convolution named direct convolution. One is based on Fast Fourier Transform (FFT) named FFT convolution to reduce computational complexity, computing the convolution in the frequency domain . Another is based on matrix multiplication (e.g., GEMM) which is one of the most widely used algorithms for convolution. Figure 4 shows the performance of each method when the image size is less than 128 * 128. When the input image size become smaller, the performance of the two methods mentioned above drops sharply, while the direct method calculates the convolution performance is stable. For the direct method using texture memory, the row and column convolutions are not much different.

3.4 Convolutional Neural Networks (CNN) Based on cuDNN

The construction of CNN refers to LeNet-5 23, and the pooling layer is omitted for the reason that GEMM concentrated on the full-connected layer and convolutional layers, leaving only the input/output layer, convolution layer, and fully connected layer.

The results are shown in Fig. 5. In the forward process of the convolutional neural network, in addition to the convolution calculation, the forward propagation according to the weight is also the main calculation. The performance advantage with tensor core on is still obvious, except in the case of the image size is very small (such as ${10}^{1}$), which also corresponds to the phenomenon in convolution calculation.

3.5 Convolutional Neural Networks (CNN) Based on Tensor Flow

Table 2. Optimization result in CNN based on TensorFlow.

Full size table

We use TensorFlow framework to build CNN based on the LeNet-5 with cifar-10 as the dataset, which contains 50,000 images with 32 × 32 pixel and can be divided into ten different categories. The latest version of TensorFlow is enabled by default with tensor core on. We change the size of the convolution kernel and the convolution calculation method and in TensorFlow. The result is shown in Table 2. When the size of the convolution kernel was changed to 8 × 8, the performance improved significantly, proving the conclusions that the tensor core is more sensitive to the value of K in the GEMM experiment.

4 Conclusion

We make a series of experiments based on GEMM, convolution calculations and CNN and analyze the improvement of performance on tensor core. Based on the analysis of the above experimental results, it can be concluded that the new architecture can indeed bring significant performance improvements to a large number of GEMM in machine learning under certain circumstances and improving the performance of overall machine learning applications. However, in some cases the improvement of performance is limited for the shape of matrix and other operation except GEMM and traditional calculation methods still have higher performance.

References

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Abdel-Hamid, O., Mohamed, A., Jiang, H., et al.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014)
Article Google Scholar
Conneau, A., Schwenk, H., Barrault, L., et al.: Very deep convolutional networks for natural language processing. arXiv preprint arXiv:1606.01781, February 2016
Segler, M.H.S., Kogej, T., Tyrchan, C., et al.: Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4(1), 120–131 (2017)
Article Google Scholar
NVIDIA: Nvidia turing architecture whitepaper. Technical report, NVIDIA Corp., August 2018. https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
NVIDIA Volta GPU Architecture (2017). https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
NVIDIA: NVIDIA TENSOR CORES, The Next Generation of Deep Learning (2019)
Google Scholar
Jia, Y.: Learning semantic image representations at a large scale. UC Berkeley (2014)
Google Scholar
Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: a system for large-scale machine learning. In: 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 2016), pp. 265–283 (2016)
Google Scholar
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Google Scholar
Martineau, M., Atkinson, P., McIntosh-Smith, S.: Benchmarking the NVIDIA V100 GPU and tensor cores. In: Mencagli, G., et al. (eds.) Euro-Par 2018. LNCS, vol. 11339, pp. 444–455. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10549-5_35
Chapter Google Scholar
Jia, Z., Maggioni, M., Staiger, B., Scarpazza, D.P.: Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826
Jia, Z., Maggioni, M., Staiger, B., Scarpazza, D.P.: Dissecting the NVidia Turing T4 GPU via microbenchmarking. arXiv preprint arXiv:1903.07486 (2019)
Sorna, A., Cheng, X., D’Azevedo, E., Won, K., Tomov, S.: Optimizing the fast Fourier transform using mixed precision on tensor core hardware. In: 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW), Bengaluru, India, pp. 3–7 (2018)
Google Scholar
Carrasco, R., Vega, R., Navarro, C.A.: Analyzing GPU tensor core potential for fast reductions. In: 2018 37th International Conference of the Chilean Computer Science Society (SCCC), Santiago, Chile, pp. 1–6 (2018)
Google Scholar
Markidis, S., Chien, S.W.D., Laure, E., Peng, I.B., Vetter, J.S.: NVIDIA tensor core programmability, performance & precision. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, pp. 522–531 (2018)
Google Scholar
Chetlur, S., Woolley, C., Vandermersch, P., et al.: cuDNN: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)
Lawson, C.L., Hanson, R.J., Kincaid, D.R., et al.: Basic linear algebra subprograms for Fortran usage (1977)
Google Scholar
Nvidia, C.: Cublas library. NVIDIA Corporation, Santa Clara, California, 15(27): 31 (2008)
Google Scholar
Nath, R., Tomov, S., Dongarra, J.: An improved MAGMA GEMM for fermi graphics processing units. Int. J. High Perform. Comput. Appl. 24(4), 511–515 (2010)
Article Google Scholar
NVIDIA. CUTLASS: Fast Linear Algebra in CUDA C++ (2018). https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/
Narang, S., Diamos, G.: Baidu DeepBench (2017)
Google Scholar
LeCun, Y.: LeNet-5, convolutional neural networks (2015). http://yann.lecun.com/exdb/lenet. 20: 5

Download references

Acknowledgement

The work is supported by State Grid Zhejiang Electric Power Co., Ltd., science and technology project (5211nb200139), the key technology and terminal development of lightweight image elastic sensing and recognition based on AI chip.

Author information

Authors and Affiliations

Ningbo Power Supply Company of State Grid, Zhejiang Electric Power Co., Ltd., Ningbo, China
Honghui Zhou & Ruyi Qin
Zhejiang Jierui Electric Power Technology Co., Ltd., Ningbo, China
Zihan Liu, Ying Qian & Xiaoming Ju

Authors

Honghui Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Ruyi Qin
View author publications
You can also search for this author in PubMed Google Scholar
Zihan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ying Qian
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoming Ju
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoming Ju .

Editor information

Editors and Affiliations

College of Communication Engineering, Jilin University, Jilin, Jilin, China
Zhihong Qian
Department of AI & ML, Vardhaman College of Engineering, Hyderabad, Telangana, India
M.A. Jabbar
College of Technology, Indiana State University, Terre Haute, IN, USA
Xiaolong Li

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, H., Qin, R., Liu, Z., Qian, Y., Ju, X. (2022). Optimizing Performance of Image Processing Algorithms on GPUs. In: Qian, Z., Jabbar, M., Li, X. (eds) Proceeding of 2021 International Conference on Wireless Communications, Networking and Applications. WCNA 2021. Lecture Notes in Electrical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-19-2456-9_95

Download citation

DOI: https://doi.org/10.1007/978-981-19-2456-9_95
Published: 13 July 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2455-2
Online ISBN: 978-981-19-2456-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics