Keywords

1 Introduction

Electricity has become an indispensable part of people’s life. The application of Artificial Intelligence technology in the field of power grid improves the service level of power enterprises and promotes the development of power grid. With the in-depth application of intelligent technology in power grid, a large number of image data are produced. At this time, with the help of big data image processing technology, enterprises can solve the problem of processing and saving massive data. It can reduce the workload of the enterprise, improve the efficiency and accuracy of the staff, promote the development of the enterprise and enhance the core competitiveness of the enterprise. Among the Artificial Intelligence technologies, machine learning is a research hot spot in many research organizations. Machine learning techniques, especially deep learning such as recurrent neural networks and convolutional neural networks have been applied to fields including computer vision 1, speech recognition 2, natural language processing 3 and drug discovery 4. Deep Learning requires substantial computing power. Graphics Processing Unit (GPU) can accelerated computing.

Recently, NVIDIA published Turing architecture 5 as the successor to the Volta architecture 6 with tensor cores 7 which can accelerate general matrix multiplication (GEMM). GEMM is at the heart of deep learning. Here’s a diagram from 8, where the time’s going for a typical deep convolutional neural network doing image recognition using Alex Krizhevsky’s Imagenet architecture 1. All of the layers that start with fc (for fully-connected) or conv (for convolution) are implemented using GEMM, and almost all the time (95% of the GPU version, and 89% on CPU) is spent on those layers.

In order to construct the machine learning models conveniently, various high-performance open-source deep learning frameworks emerge these years such as tensorflow 9 and caffe 10. These frameworks support running computations on a variety of types of devices, including CPU and GPU (Fig. 1).

Fig. 1.
figure 1

Performance improvement in GEMM given by the official white paper and practical application

In some tasks of image processing, CNN can be applied to image recognition, classification and enhancement, etc. CNN used a special structure for image recognition and can be trained quickly. In order to explore the reasons for such huge difference, we will implement a typical CNN named LeNet-5 23, which is commonly used in deep learning.

2 Related Work

AI computing has become the driving force of the NVIDA GPU, as a computing accelerator, it integrates built-in hardware and software for machine learning. Some studies have investigated the tensor core by programing111213. Sorna et al. proposed a method that can use computational capability of tensor core without degrading the precision of the Fourier Transform result 14. Carrasco et al. applied a reduction strategy based on matrix multiply-accumulate with tensor core. Their found showed that tensor core can promote the arithmetic reductions15. Markidis et al. evaluated performance of NVIDIA Tensor core with Tesla V100 using GEMM operating 16. They tested the capability with tensor Core using naive implementation with CUDA 9 WMMA, CUTLASS and cuBLAS. Martineau et al. analyzed and evaluated the tensor core through optimization a GEMM benchmark 11, finding similar conclusion of V100 GPU presented by14. Different from previous studies, we will make use of neural network parallel library to further evaluate the performance of GPU on the basis of benchmark.

In deep learning, CNN is a class of artificial neural network structure gradually emerging in recent years. A representative CNN involves convolutional layer, pooling layer and full-connected layer. The convolutional layer extracts feature by convolving input with a group of kernel filters, which contains plenty of matrix operations. The pooling layer contains average, max and stochastic pooling, which contributes to invariance to data variation and perturbation. The fully connected layer in a CNN combines the results of convolutions. It performs the weights which represent the relationship between the input and output and the input multiplication and generates the output.

3 Experiment

The following experiment environment is: AMD Ryzen CPU, NVIDIA Geforce RTX 2080TI (Turing) GPU, Microsoft Windows 10 64-bit, CUDA SDK 10.0, CUTLASS 1.3. Nvprof is selected to evaluate from instruction running time to number of calls. The performance of experiment uses TFlops/s to statistics with operand divided by operation time.

General Matrix Multiplication (GEMM) defined in BLAS 18 and cuBLAS 19 is a matrix multiplication and accumulation routine as fllows:

$$\mathrm{C}\leftarrow \upalpha \mathrm{A}\times \mathrm{B }+ \upbeta \mathrm{C}$$

where \(\mathrm{A}\in {\mathrm{R}}^{M\times K}\), \(\mathrm{B}\in {\mathrm{R}}^{K\times N}\), and \(\mathrm{C}\in {\mathrm{R}}^{M\times N}\) are matrices, and \({\upalpha }\) and \(\upbeta\) are scalars. GEMM is the heart of deep learning and is mainly used in neural networks of specific structures such as CNN/RNN. The main purpose of the Tensor core in the Volta architecture and Turing architecture is to accelerate the calculation of GEMM. Many optimization efforts have also been incorporated to the widely used GEMM libraries: MAGMA 20, CUTLASS 21 and cuBLAS.

3.1 Performance of GEMM with Matrix Dimension

Fig. 2.
figure 2

Performance of GEMM at half-precision with k.

When calculate GEMM, the dimensions of matrix are m, n and k respectively in (1). Each cell is multiplied by a 1 × K matrix and a K × 1 matrix, this operation will be split and distributed to the tensor core for processing with tensor core on. We try to investigate the effect of m, n, k dimension on the speed-up ratio and the shared size K has a greater impact on performance. In order to find the optimal size k, the GEMM is performed with half-precision in Fig. 3. It can be seen that the speed-up ratio of the test sample that cannot be divisible by 8 is relatively low, close to 1; Most of samples which can be divisible by 8 can be effectively accelerated by the tensor core; and as the k value increases, the speed-up ratio also shows an upward trend, indicating that the tensor core is more sensitive to the value of k (Fig. 2).

3.2 Performance Analysis of GEMM with Tensor Core on and off

A series of self-written cases supplemented by the deep learning test suite DeepBench 22 are tested the performance with the tensor core on or off in the new architecture. Table 1 shows the results of running GEMM using Nvprof with the tensor core turned on and off, including the number of calls and running time of each API.

With the tensor core on, since the matrix multiplication operation that originally required multiple dot product instructions is replaced by only one wmma instruction, the calculation is more dense and the time of device synchronization become less, the performance is improved significantly.

Table 1. Performance analysis of GEMM with Tensor Core on and off (API Calls)

3.3 Convolution Calculation

In the CNN model, the fully connected layer is often served as the last layer, and the body of the network is composed of convolutional layers. Therefore, it is critical to speed up the calculation of convolution for the performance of the entire network.

Fig. 3.
figure 3

Performance of convolution based on different algorithm (Small images).

There are several methods developed to efficiently implement the convolution operation besides directly computing the convolution named direct convolution. One is based on Fast Fourier Transform (FFT) named FFT convolution to reduce computational complexity, computing the convolution in the frequency domain . Another is based on matrix multiplication (e.g., GEMM) which is one of the most widely used algorithms for convolution. Figure 4 shows the performance of each method when the image size is less than 128 * 128. When the input image size become smaller, the performance of the two methods mentioned above drops sharply, while the direct method calculates the convolution performance is stable. For the direct method using texture memory, the row and column convolutions are not much different.

3.4 Convolutional Neural Networks (CNN) Based on cuDNN

Fig. 4.
figure 4

Performance of DIRECT, FFT, GEMM algorithm in cuDNN.

The construction of CNN refers to LeNet-5 23, and the pooling layer is omitted for the reason that GEMM concentrated on the full-connected layer and convolutional layers, leaving only the input/output layer, convolution layer, and fully connected layer.

The results are shown in Fig. 5. In the forward process of the convolutional neural network, in addition to the convolution calculation, the forward propagation according to the weight is also the main calculation. The performance advantage with tensor core on is still obvious, except in the case of the image size is very small (such as \({10}^{1}\)), which also corresponds to the phenomenon in convolution calculation.

3.5 Convolutional Neural Networks (CNN) Based on Tensor Flow

Table 2. Optimization result in CNN based on TensorFlow.

We use TensorFlow framework to build CNN based on the LeNet-5 with cifar-10 as the dataset, which contains 50,000 images with 32 × 32 pixel and can be divided into ten different categories. The latest version of TensorFlow is enabled by default with tensor core on. We change the size of the convolution kernel and the convolution calculation method and in TensorFlow. The result is shown in Table 2. When the size of the convolution kernel was changed to 8 × 8, the performance improved significantly, proving the conclusions that the tensor core is more sensitive to the value of K in the GEMM experiment.

4 Conclusion

We make a series of experiments based on GEMM, convolution calculations and CNN and analyze the improvement of performance on tensor core. Based on the analysis of the above experimental results, it can be concluded that the new architecture can indeed bring significant performance improvements to a large number of GEMM in machine learning under certain circumstances and improving the performance of overall machine learning applications. However, in some cases the improvement of performance is limited for the shape of matrix and other operation except GEMM and traditional calculation methods still have higher performance.