Advertisement

Applying CNN on a scientific application accelerator based on dataflow architecture

  • Xiaochun Ye
  • Taoran Xiang
  • Xu Tan
  • Yujing Feng
  • Haibin Wu
  • Meng Wu
  • Dongrui FanEmail author
Regular Paper
  • 79 Downloads

Abstract

Convolutional neural network (CNN) is widely used in applications such as face recognition, intelligent monitoring, image recognition and text recognition. Because of its high computational complexity, many efficient hardware accelerators have been proposed to exploit high degree of parallel processing for CNN. However, accelerators which are implemented on FPGAs and ASICs usually sacrifice generality for higher performance and lower power consumption. Other accelerators, such as GPUs, are general enough, but they lead to higher power consumption. Fine-grained dataflow architectures, which break conventional Von Neumann architectures, show natural advantages in processing scientific applications. Meanwhile, CNN algorithm shares many vital characteristics with scientific applications including high parallelism, simple loop and regular memory accessing pattern. In this paper, we propose a scheme for implementing and optimizing CNN on fine-grained dataflow architecture designed for scientific applications, namely Scientific Processing Unit (SPU). The experiment results reveal that by using our scheme, the performance of AlexNet and VGG-19 running on SPU is averagely \(2.29\,\times\) higher than that on NVIDIA Titan Xp, and the energy consumption of our hardware is averagely \(5.76\,\times\) lower than that of Titan Xp.

Keywords

Fine-grained dataflow Convolutional neural network Parallel computing Accelerator 

1 Introduction

Deep convolution neural networks (CNNs) develop rapidly in recent years, and have achieved unprecedented accuracy for tasks including face recognition, intelligent monitoring, image recognition, text recognition and other fields. Particularly, in 2012, Alex Krizhevsky’s team at the University of Toronto, won the ImageNet Challenge by virtue for their better neural network classification model AlexNet (Krizhevsky et al. 2012). They reduced the error rate of classification from 26 to 15%, shocking the world at the time. Since then, a large number of companies and researchers have invested in the study of the deep learning.

CNN’s ultra-high accuracy leads to massive computation complexity (Sze et al. 2017) as well as the ever-growing requirements of hardware performance and energy. Thus, besides the normal implementation using general purpose GPU (Coates et al. 2013; Chetlur et al. 2014), many specific hardware schemes have appeared for the CNN acceleration, including either ASIC route (Albericio et al. 2016; Chen et al. 2014b) or FPGA route (Zhang et al. 2015; Lu and Liang 2018; Liang et al. 2019). For example, DianNao (Chen et al. 2014a), an accelerator made up of neural functional unit (NFU), processes feature maps by organizing data to stream though NFUs. Chen et al. (2016) proposed a novel dataflow model that can minimize the energy consumption of data movement. These specific accelerators are designed for specific applications, so complex control and decode logic can be eliminated and thus they usually achieve better performance and energy efficiency than general schemes (GPU and CPU).

However, those specific accelerators fail to meet nowadays computation demand since their range of applications is limited. They can only be applied in calculating CNN, and some of them even cannot support all layers of CNN. Facing different applications, users will need to purchase a variety of dedicated accelerators and learn how to fully utilize those accelerators. Besides, using inflexible hardware makes it difficult to implement algorithmic innovations. Moreover, currently, there is a trend that more and more use cases require both HPC and AI support, such as in autonomous driving and life science. That means if we keep separating these two types computation apart, for example using divergent platforms or accelerators for HPC and AI respectively, extra efforts and expenses will be required to design specific accelerators and unite the whole heterogeneous system. Therefore, it is necessary to converge these two scenarios and achieve efficient execution of both HPC and AI application on one unite accelerator.

Fine-grained dataflow architecture was proposed by Dennis (1974). It is completely different from conventional control flow architecture. It achieves high performance with low power, and remains broadly applicable and adaptable. Fine-grained dataflow architecture is characterized by the following features. Firstly, instructions in the dataflow architecture can be executed as soon as their operands are ready, and the execution of the instructions is not limited by the program counter and the instruction window. Thus, the dataflow architecture can fully exploit the instruction-level parallelism and the data-level parallelism. Secondly, the data are directly transferred among processing nodes, which avoids frequent data accesses to memory. Thirdly, processing nodes do not require redundant control logic such as out-of-order execution, branch prediction, simplifying the design of the processing nodes.

Fine-grained dataflow architecture is very suitable for applications that are characterized by high computational intensity, straight forward memory access and re-use patterns. Prior work demonstrates that dataflow architecture is well suited for acceleration in the fields of scientific computing (Xiao-Wei et al. 2017; Verdoscia et al. 2015; Oriato et al. 2012; Pratas et al. 2013; Fu et al. 2014). Its high performance, as well as generality solves the user’s requirements.

To illustrate the versatility of the dataflow architecture, we analyze two typical scientific applications, namely Stencil (Nguyen et al. 2010) and FFT (Gu et al. 2010) in our prior work (Tan et al. 2018). These applications have sufficient parallelism to be exploited, but current accelerators based on control flow architecture cannot take full advantage of these properties. Table 1 shows that these applications have achieved higher utilization of computing resources and energy efficiency by executing in the fine-grained dataflow accelerator. Based on the analysis, we found that the layers in CNN have similar characteristics of high parallelism, massive data reuse, and a huge amount of computation. Thus, we have good reasons to believe that higher utilization of computing resources and energy efficiency will be achieved by using the fine-grained dataflow architecture.
Table 1

Computing resource utilization and performance/Watt comparison between GPU and SPU

Application

Comparative item

GPU

SPU

Stencil

Efficiency

8.91%

48.43%

Performance/Watt

0.39

5.28

FFT

Efficiency

6.82%

47.81%

Performance/Watt

1.56

5.21

In this paper, we apply CNN to a dataflow accelerator designing for scientific applications, namely Scientific Processing Unit (SPU). Our contributions are summarized as follows:
  1. 1.

    We analyzed the similarity of CNN and scientific applications and discussed the practicability of applying CNN to SPU.

     
  2. 2.

    We proposed optimization schemes of the data preprocessing which is the first divided steps of convolution. A data transforming kernel with fine granularity is designed so that contexts can fully stream on the accelerator.

     
  3. 3.

    We proposed optimization schemed of matrix multiplication which is the key operation of convolution. In this proposed schemes, efficient data tiling and reuse are taken into consideration to support high throughput dataflow.

     
  4. 4.

    We presented the experiment results of SPU by comparison with CPU, GPU and specific dataflow accelerator. With similar peak performance, execution in SPU is speeded up by \(2.29\,\times\) over GPU, \(3.18\,\times\) over DianNao on average. SPU also averagely reduces \(5.76\,\times\) of energy compared with GPU. In addition, we analyzed the bandwidth requirements of SPU.

     
The rest of this paper is organized as follows. Section 2 introduces the background of CNN. In Sect. 3, the architecture of SPU is introduced. Section 4 explains the detailed acceleration scheme of CNN on fine-grained dataflow architecture. In Sect. 5, we introduce the experimental methodology. The Experimental results are present in Sect. 6. Finally in Sect. 8, we draw our conclusion.

2 CNN Background

CNNs have a wide variety of shapes and sizes among different fields and various applications. The well-known examples of CNN model are AlexNet Krizhevsky et al. (2012), VGGNet Simonyan and Zisserman (2014), and et. al. The unique shape and size of these models determine the accuracy and efficiency they can achieve. CNNs are generally composed of five types of layers: convolutional layers (CONV), pooling layers (POOL), fully-connection layers (FC), activation layers (ACT) and local response normalization layers (LRN). The inputs of the CNNs usually are the pixels of images. After these inputs are inferred by CNN, the corresponding analysis results are obtained. The accuracy of the CNN is also related to the weight of its layers which can be adjusted by training.

2.1 Convolutional Layers

CONV layers are the major components of CNN which utilizes convolutional operation to extract features from the input data. The inputs of the CONV layers are composed of a set of 2-D input feature maps (ifmaps) that are extracted from multiple input images. A group of ifmaps of one input image is called a channel, and a group of pictures that enter into the CNN each time is called a batch. The input data of the CONV layers are represented by using a four-dimensional tensor in the form of \(\varvec{I_0}\in R^{NCHW}\), ranging over N images in a batch, C input feature maps per image, H rows per image, and W columns per image. The weights of the CONV layers are called filters that are represented by a four-dimensional tensor in the form of \(\varvec{F_0}\in R^{KCRS}\), ranging over N images in a batch, K output feature maps (ofmaps) of each image, R rows per 2-D filter, S columns per 2-D filter. As illustrated in Fig. 1, a number of 2-D filters form a filter channel. The filters in a channel respectively convolute with the 2-D ifmaps in a channel, and the result of CONV layer is accumulated to obtain a 2-D ofmap. Each input channel convolutes with K filter channels to obtain the four-dimensional tensor of the ofmaps, represented in the form of \({\varvec{O}} \in R^{NKPQ}\) in this paper (P represents height of ofmaps, and Q represents weight of ofmaps). Parameter v defines the vertical stride to access ifmaps, and parameter v defines the horizontal stride to access ifmaps.
Fig. 1

Convolution process

The final convolution formula is shown as follows:
$$\begin{aligned} \begin{aligned} {\varvec{O}}[n][k][p][q] =&\sum _{c=0}^{C-1}\sum _{r=0}^{R-1}\sum _{s=0}^{S-1} \varvec{I_0} [n][c][p \times u+r][q \times v+s] \\&\times \varvec{F_0}[k][c][r][s]. \end{aligned} \end{aligned}$$
(1)

2.2 Fully-connection layers

Compared with other layers in CNN, FC layer requires more memory space to store the large quantity of filter weights. FC layer can be regarded as a special CONV layer, due to the reason that the size of the filters is the same as the size of the ifmaps. Therefore, the same formula is used in FC layers as CONV layers, where \(H = R\), \(W = S\), \(P = Q = 1\).

2.3 Pooling layers

In pooling layers, the maximum and the average value among neighbor pixels are computed. The pooling layer can provide strong robustness as well as reducing the dimension size of the ifmaps. In addition, it allows the coarse-grain features to be extracted by the following layers and prevents over-fitting. Pooling layers generally do not utilize any filter weights.
$$\begin{aligned} \begin{aligned} {\varvec{O}}[n][k][p][q] =&\max _{0r<R,0s<S} \varvec{I_0} [n][c][p \times u+r][q \times v+s] \\&\times \varvec{F_0}[k][c][r][s]. \end{aligned} \end{aligned}$$
(2)

2.4 Activation layers

Two reasons lead to the result that data cannot be well classified by CONV, FC and POOL layers. Firstly, the data processed by CNN are often not linearly separable. Secondly, CONV, FC and POOL layers are linear functions. Therefore, some nonlinear functions that are called activation functions are introduced to solve the problem of data classification. The result of CONV, FC or POOL layers is then fed to one or multiple activation layers. Each activation layer computes the result of its activation function for each pixel entered.
$$\begin{aligned} {\varvec{O}}[n][k][p][q] = t({\varvec{I}}[n][c][p \times u+r][q \times v+s]) \end{aligned}$$
(3)
where t() is a transfer function, e.g., \(\frac{1}{1+e^{-x}}\) for Sigmoid, tanh(x) for TanH, max(0, x) for ReLU, etc.

2.5 Local response normalization layers

LRN layer implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different layers (Krizhevsky et al. 2012). LRN layer enhances the generalization ability of the model.
$$\begin{aligned} {\varvec{O}}[n][k][p][q] = \frac{{\varvec{I}}[n][c][p][q]}{(1+(\frac{\alpha }{T}) \sum _{i=\max (0,c-T/2)}^{\min (C-1,c+T/2)} {\varvec{I}}[n][i][p][q]^2 )^\beta } \end{aligned}$$
(4)
where T, \(\alpha\) and \(\beta\) are constants.

3 SPU architecture

3.1 SPU structure

SPU is a fine-grained dataflow accelerator for scientific applications that are characterized by high computational intensity, direct memory access, and reuse patterns. The overall structure of the SPU is shown in Fig. 2. The SPU processor consists of processing elements (PEs), data buffers (Dbufs), instruction buffers (Cbufs), a micro-controller unit (MicC), and a DMA controller. The structure of SPU is similar to traditional many-core architectures (Fan et al. 2012, 2018) but every PE inside the array is a fine-grained dataflow processing element not a control-flow core. The Dbufs and Cbufs are implemented by using Scratch-Pad Memory (SPM). The Dbufs are distributed around the PE array. The Cbufs locate at the west side of the PE array. MicC is used for controlling the execution of dataflow graph on PEs. PEs, Dbufs, Cbufs, and MicC communicate through a two-dimensional mesh network.

PE: Internal structure of the PE is shown in the right side of Fig. 2. Each PE contains pipelined execution units, a buffer for storing instructions, a buffer for storing operands of each instruction, an instruction fire controller, and some local registers.

Dbuf: Stores data, and is shared by all PEs.

Cbuf: Stores PEs’ local register value and instructions waiting to be mapped into PEs.

MicC: Controls execution process of the accelerator. At the beginning of the execution, CPU sends a request to the MicC to start the accelerator. MicC will then map the instructions and local register values stored in Cbuf through NoC to the PEs. After the PEs are initialized, MicC sends the contexts in pipeline to the dataflow graph and collects the context-termination message from the PE. When the PE array finishes executing all the contexts, the MicC will send an end message to the CPU.

NoC: The NoC uses a static XY (X-direction priority) deterministic routing strategy. NoC is implemented with both forward network and backward network. The forward network is used for transferring operands and accessing memory while the backward network is used for transferring control signals for instructions.

DMA: The DMA is the key component to make data movement between off-chip memory and those two on-chip buffers (Dbuf and Cbuf). Before a kernel execution of the PE array get started, CPU has to configure DMA and then DMA start to fetch data to Dbuf and instructions/local register value to Cbuf. After the execution of the kernel is done, DMA starts to fetch result data from Dbuf back to off-chip memory.
Fig. 2

SPU architecture diagram

3.2 Parallel model

In traditional dataflow architectures (Govindan et al. 2007; Swanson et al. 2007), a context corresponds to a program block, and the program blocks are sequentially switched after one another. Only after a block has been executed, a new program block can be allocated to the PEs to execute. The functional units are under-utilized in this case. Differed from traditional dataflow architecture, SPU not only adopts the dataflow execution model, but also improves the utilization of the functional unit by implementing the loop-in-pipeline (Xiao-Wei et al. 2017) optimization method.

In the SPU, the dataflow model is applied to the PE array. Under this model, data are generated by the instructions and are directly transferred from one PE to another without requirement of writing back to the global register or memory location. A complete dataflow program/graph is called a kernel, and one complete execution of a kernel is called a context. At the top level, multiple contexts run on the accelerator using the loop-in-pipeline model. Each dataflow instruction can be considered as a pipeline component. As soon as the previous context has been executed, the dataflow instruction can calculate the data corresponding to next context.

Dataflow model Dataflow model is a calculation model completely different from traditional control flow. In the traditional control flow processors, instructions are executed under control of Program Counter (PC). However, in the dataflow model, an instruction can be fired when all source operands of an instruction are ready.

In the dataflow calculation, program is represented by dataflow diagram. The execution result of each instruction is passed directly to another instruction, and the dependencies established between instructions form the dataflow map. Each node in a dataflow diagram represents an instruction, and each edge represents the dependency between two instructions. As shown in Fig. 3a, we feed the data input the dataflow diagram, and the instructions are fired. For example, inst3 can be fired if both results of inst0 and inst1’s arrive.

In a PE, each entry of the operand buffer stores the operands of an instruction. Fire controller is used for monitoring whether operands are all in place. When an instruction satisfies the condition to be fired, fire controller will send the operands and its position (point out which instruction it belongs) into the Decoder. After decoding, opcode and operands are sent to the pipelined execution units. Finally, the result of the instruction is sent to the instructions that depend on this result through NOC.

To execute a program on SPU, the instructions in diagram need to be assigned to the PEs by MicC, as shown in Fig. 3b. The number of instructions in a kernel is limited by the size of the instructions buffer of the PEs. Data within a context are transferred directly among PEs, enabling reuse of data. Data belonging to different contexts cannot communicate. Better performance can be achieved by choosing an appropriate size for context data.
Fig. 3

Process of dataflow executing: a dataflow diagram, b instructions mapping, c loop-in-pipeline

Loop-in-pipeline model In traditional dataflow architectures, such inefficient execution is observed that some instructions are executed while others are still waiting for the arrival of the operands. The optimization method of loop-in-pipeline model is introduced to solve the inefficiency by allowing more than one context to simultaneously run on the dataflow architecture under the condition that no data dependency exists among these contexts, as shown in Fig. 3c. Without data dependency between contexts, multiple contexts in loop-in-pipeline mode execute on the PEs in a pipelined and paralleled manner. Each instruction in the dataflow diagram immediately receives the operands of the next context after previous execution is done, resulting in a significant improvement in the utilization of the execution units.

3.3 Dataflow instruction set

The instruction format for SPU is shown in Fig. 4. Each instruction is composed of operation code, the number of dependent operands and destination addresses. The destination addresses points to the location of context memory in the PE.

In the instruction set of the SPU, operations of basic arithmetic, memory access, loop, and branch are support to ensure the flexibility. In order to better implement the CONV and FC layers, we added an accumulate instruction that keeps partial sum in context memory.

Figure 5 shows the process of compiling for SPU. To get executable binary kernel on SPU, we need to first compile the high-level kernel code to dataflow instructions. After compiling, manually fine tuning of the output sequence is needed to adjust the balance of the dataflow graph. The second step is mapping the dataflow instructions to the PE array statically (Xiao-Wei et al. 2017) according to the specification of PE Array which indicates the scale of the PE array and the maximum number of instructions a single PE can hold. Each instruction in the generated binary code follows the instruction format in Fig. 4. The Binary code will be transferred to Cbuf and waiting to be mapped to the PEs.
Fig. 4

Instruction format of SPU

Fig. 5

The process of compiling binary code from high level language

3.4 Accelerating scientific application on SPU

The basic design concept of SPU is accelerating loop in scientific application. The loops in scientific applications are always regular which means (1) there is no complex dependency between each iteration(expect accumulation), (2) there is no random memory accessing and (3) the number of iteration to be executed is static. In consequence, we are able to utilize loop-in-pipeline model described above. Moreover, the characteristics of scientific applications also allow us to partition a big problem into multiple small problems.

Figure 6 shows an example of accelerating 2D stencil on SPU. The original program is a nested loop to compute a \(256\times 256\) 2D5P stencil. We partition the whole task into \(32\times 32\) smaller sub-tasks and each of them relates to \(8\times 8\) data computing. The granularity of each sub-task is essential to the execution efficiency of SPU. The instructions of the sub-task should fulfill the PE array as much as possible to provide more instruction level parallelism and make more data reuse. The sub-task is translated into dataflow graph and mapped to the PE array. During the execution, \(32\times 32\) contexts will flow through the dataflow graph and thus context level parallelism can be exploited.
Fig. 6

Example of accelerating 2D stencil on SPU. a Original 2D5P stencil program. b Program after partition. c Partition of 2D5P stencil. d Dataflow graph of sub-task

We have applied several typical scientific applications to SPU. Fig. 7 illustrates the utilization of floating-point unit and floating-point efficiency. Since SPU exploits sufficient parallelism of the scientific applications, it achieves considerable execution efficiency.
Fig. 7

The utilization of floating-point unit and floating-point efficiency for typical scientific applications

4 Acceleration scheme on fine-grained dataflow architecture

4.1 Basic concept of applying CNN on SPU

Although SPU is not designed for AI algorithms, many AI algorithms including CNN have potential characteristics of being accelerated by SPU. Taking CNN algorithm as an example, it has the following features which are similar to scientific applications and necessary for SPU:
  1. 1.

    High parallelism. Each layer in CNN provides sufficient parallelism to be exploited by SPU. The computation of an output point is usually independent to others so SPU is able to utilize the parallelism from both the inside dataflow graph and pipelined contexts.

     
  2. 2.

    Simple loop pattern. All the layers in CNN can be written by a simple nested for-loop. All the dependencies in the program are statically decided, so dataflow graph can be easily translated representing static dependencies. Moreover, except accumulation, there is no complex dependency between each iteration. Thus, iterations(contexts) are able to flow through the dataflow graph fluently providing sufficient firable instructions.

     
  3. 3.

    Regular memory accessing pattern. There is no random memory accessing in all the layer in CNN. All the computation and related data can be easily partitioned so we can make full use of the on-chip buffer (DBuf) instead of using complicated cache-hierarchy in traditional control-flow architecture. Furthermore, due to the regular accessing pattern, it is flexible to design dataflow graph with optimum scale.

     
Figure 8 takes fully connected layer as an example to illustrate how we apply a layer in CNN on SPU. Figure 8a is the original FC-layer program in which the whole problem is a matrix-vector product(K\(\times\)\((C\times H\times W)\) matrix times \((C\times H\times W)\) vector). Similar to the scientific application showed in Fig. 6, we partitioned the whole layer into sub-tasks(\(N_y \times N_x\) matrix times \(N_x\) vector). The scale of \(N_x\) and \(N_y\) should be considered carefully to ensure the dataflow graph is big enough to provide sufficient instruction level parallelism and the number of contexts is large enough to provide sufficient context level parallelism. By comparing Fig. 8 and Fig. 6, we can find that the process of applying CNN on SPU is similar to the process of scientific applications on SPU.
Reasonable partition and dataflow graph are keys to utilize maximum data reuse and exploit maximum parallelism. The computational characteristics of each layer in CNN are not the same. Thus, each layer is optimized differently according to their unique characteristics so that they execute more efficiently on the SPU. We propose different methods of partitioning the input feature map and filter into multiple contexts for each kind of layers. In addition, the dataflow graphs that maximize the data reuse and computing resource utilization are designed for executing contexts of each layer. In the following, we will show detailed design for each layer in CNN.
Fig. 8

Example of accelerating FC layer on SPU. a Original FC program. b Program after partition. c Partition of FC layer

4.2 CONV

Different optimization ideas are adopted by the existing platforms (CPU, GPU and specific dataflow accelerators). As shown in Fig. 9, unrolling convolution that is commonly used for CPUs and GPUs, converts convolution into matrix multiplication (Chetlur et al. 2014; Chellapilla et al. 2006). The data of the ifmaps is rearranged and all the multi-layer loops in the convolution operation are expanded. The convolution runs efficiently according to the full acceleration of matrix multiplication by the hardware platform. In the other way, specific dataflow accelerators define dataflow of the ifmaps and filters on the accelerator. These dataflows maximize multiple data reuse (Chen et al. 2016) in the convolution, as a result of which the number of memory accesses is reduced and the operational efficiency is improved.

We analyzed the advantages and disadvantages of these two optimization methods. Convolution is featured by the overlapping pixels of the ifmaps that is required by the calculation of the adjacent ofmaps. After the data of the ifmaps are rearranged by unrolling convolution, the data of input feature map cannot be reused by the accelerator. Thus, unrolling convolution requires more memory accesses than the optimization methods of the specific accelerator.

The optimization methods of specific accelerators can minimize the number of memory accesses. However, the parameters of different CONV layers are not the same, resulting in different calculation times. It is difficult to distribute the tasks to the PE array in a balanced manner. For example, two schemes are implemented on a \(3 \times 3\) PE array. As shown in Fig. 9, a \(3 \times 3\) input image is convoluted with a \(2 \times 2\) filter and a total of 16 multiply-accumulate operations are performed. 7 PEs perform 2 multiply-accumulate operations, while the other 2 PEs perform only one. Due to the reason that the operations are not evenly distributed to the PE array, some PEs are busy but the others are idle. In unrolling convolution, we divide the re-arranged new matrix into small blocks, as shown in the right side of Fig. 9. We will calculate a \(2 \times 3\) filter and a \(3 \times 3\) matrix multiplication each time. There are 18 multiply-accumulate operations that can be evenly distributed to 9 PEs. Compared with the first method, it achieves higher utilization of computing component, and better adaptability to the CONV layers with different parameters.

Although the number of memory accesses is increased a little bit in the unrolling convolution, it does not become the bottleneck on SPU. In Sect. 6.4, we analyze the bandwidth requirements of this convolution optimization method. The requirement is far lower than the peak bandwidth provided by the hardware. However, the even distribution of the task greatly impacts the performance of the fine-grained dataflow structure (Xiao-Wei et al. 2017). Thus, we chose to implement the unrolling optimization method on SPU.

We divide the unrolling process into two parts: one is data rearrangement process, namely data preprocessing; the other is matrix multiplication. Data preprocessing does not perform multiplication and addition in the formula (1), which reduces the component utilization of the entire operating process. Therefore, we reduce the percentage of data preprocessing to the total execution time by running data preprocessing of a data block only once. The results of this data per-processing will be loaded to the memory from the Dbufs. In actual operation, it occupies less than 5.5% of the total execution time in CONV layers.

The data block is divided into squares, so that the data reuse of the matrix multiplication is maximized. The calculation of different shapes of data preprocessing and matrix multiplication is supported. When the partitioned block is smaller than the specified size (as shown in the figure with the gray part of the block) and only the effective part is calculated, the invalid number of calculation can be greatly reduced. In the following sections, we will describe in detail how the two parts internally divide the context and the dataflow diagram within the context.
Fig. 9

Blocking unrolling convolution

Fig. 10

Data preprocessing execution on fine-grained dataflow architecture

4.2.1 Data preprocessing

The data preprocessing process of unrolling convolution that splits ifmaps into a large matrix \({\varvec{I}}\) is shown in Fig. 9. Each row of \({\varvec{I}}\) is obtained by unfolding the convolutional window of input channel in Fig. 1, so the length of \({\varvec{I}}\) is CRS. A convolutional window is translated PQ times on an input channel, and each CONV layer has N input channels, so the width of the matrix \({\varvec{I}}\) is NPQ. Filters are also tiled into a \(K \times CRS\) matrix. The ofmaps that constitute a \(K \times NPQ\) matrix is obtained by multiplying filter matrix \({\varvec{F}}\) with \({\varvec{I}}\). The following formula illustrates the conversion between ifmaps \(\varvec{I_0}\) and matrix \({\varvec{I}}\):
$$\begin{aligned} \begin{aligned} {\varvec{I}}[x][y] =&{\varvec{I}}[c\times RS + r\times S + s][n \times PQ + p \times Q + q] \\ =&\varvec{I_0} [n][c][p \times u+r][q \times v+s] \end{aligned} \end{aligned}$$
(5)
x represents the number of rows in the matrix \({\varvec{I}}\) and y represents the number of columns in the matrix \({\varvec{I}}\).

The data volume of the whole convolution is too large to be loaded into SPU because of the limited on-chip memory in the accelerator. Therefore, matrix \({\varvec{I}}\) is divided into smaller blocks. Each time, data preprocessing obtains a block of \({\varvec{I}}\) that is sized \(K_x \times K_y\). As shown in Fig. 9, \({\varvec{I}}\) is divided into 12 blocks, each of which is sized by \(3 \times 3\). Only a part of the 4-D ifmpas are transferred to the Dbufs instead of the entire ifmaps. Obviously, data in \(\varvec{I_b} (1,0)\) come from \(\varvec{I_0}(0,0,,)\) and \(\varvec{I_0}(0,1,,)\), both of which are transferred to the on-chip memory. Then, the kernel of data preprocessing is executed. Finally, we can obtain \(\varvec{I_b} (1,0)\) which is a partial block of matrix \({\varvec{I}}\).

SPU transforms data at a smaller granularity than a block, dividing calculation of \(\varvec{I_b}(i, j,,)\) into multiple contexts and running these contexts concurrently. Each context corresponds to an \(N_x \times N_y\) block. As an example in Fig. 10, we assume that the size of a block is \(16 \times 16\), and the size of a context is \(4 \times 4\). The entire block is divided into 16 contexts. We can deduce from the formula (5) that when the position of a pixel in the matrix \({\varvec{I}}\) is known, the parameters to access the pixel value in \(\varvec{I_0}\) can be calculated, where \(n=y/PQ\), \(p=y\%PQ/Q\), \(q=y\%PQ\%Q\), \(c=x/RS\), \(r=x\%RS/S\), \(s=x\%RS\%S\) (PQRS are constants, so these division and remainder operations can be converted to multiplication and subtraction). Because three variables c, r, s only relate to the number of pixels’ rows, the pixels of the same row can share the c, r, s. Other three variables q, p, n only relate to the number of pixels’ columns, so the pixels of the same column can share the q, p, n. As shown in Fig. 10, \(c_3\), \(r_3\), \(s_3\) are transferred to (3, 0), (3, 1), (3, 2), (3, 3), and \(q_0\), \(p_0\), \(n_0\) are transferred to (0, 0), (1, 0), (2, 0), (3, 0). Node (3, 0) receives all the variables \(c_3\),\(r_3\), \(s_3\), \(q_0\), \(p_0\), \(n_0\) and calculates the index of pixels in ifmaps by using the formula in (5). Then node (3, 0) loads pixels from ifmaps \(\varvec{I_0}(,,,)\) and stores them in \(\varvec{I_b}(0,0,3,0)\). Other nodes perform similar operations. In this way, the computational complexity of data index in a context can be reduced from \({\varvec{O}}(N^2)\) to \({\varvec{O}}(N)\). No data dependency exists among multiple contexts so that the contexts can fully stream on the accelerator.

4.2.2 Matrix multiplication

After data preprocessing, a block of the matrix \({\varvec{I}}\) is calculated and stored on Dbufs. The matrix \({\varvec{F}}\) is also divided into blocks of the same size as \({\varvec{I}}\). Executing matrix multiplication only requires the transfer of a corresponding block of matrix \({\varvec{F}}\) to Dbufs. As shown in Fig. 9, \(\varvec{I_b}(1,0,,)\) is obtained by data preprocessing after storing \(\varvec{I_0}(0,0,,)\) and \(\varvec{I_0}(0,1,,)\) on the Dbufs. Since \(\varvec{F_b}(0,1)\) has been transferred to on-chip storage, we can multiply \(\varvec{I_b}(1,0,,)\) by \(\varvec{F_b}(0,1)\) and produce partial sum of \({\varvec{O}}(0,0)\). We mark \(\varvec{F_b}(,)\) as matrix \({\varvec{A}}\), \(\varvec{I_b}(,,,)\) obtained by data preprocessing as matrix \({\varvec{B}}\), and the result of the multiplication as matrix \({\varvec{C}}\).

The optimization of matrix multiplication is similar to that of data preprocessing by dividing calculation into multiple contexts with parallel processing. Each context fetches a row of \({\varvec{A}}\) and a column of \({\varvec{B}}\), multiplied and added to obtain a value in \({\varvec{C}}\), as follows:
$$\begin{aligned} \begin{aligned} \varvec{C_{ij}}&= \sum _{k=0}^{D-1} \varvec{a_{ik} b_{kj} }\\&= \varvec{a_{i0} b_{0j}} + \varvec{a_{i1} b_{1j}} + \cdots +\varvec{a_{i(D-1)} b_{(D-1)j}}. \end{aligned} \end{aligned}$$
(6)
However, in this scheme, the data in \({\varvec{A}}\) and \({\varvec{B}}\) must be accessed \({\varvec{O}}(N^2)\) times. In order to increase the percentage of data reuse and reduce the number of memory load, the optimization scheme shown in Fig. 11 is used. Firstly, \({\varvec{A}}\) is divided into small pieces whose size is \(K_x \times N_y\) as shown in the bottom left corner of the Fig. 11, and \({\varvec{B}}\) is divided into small pieces whose size is \(N_x \times K_y\) as shown in the upper left corner. In each context, a piece of data is fetched from \({\varvec{A}}\) to multiply with a piece from \({\varvec{B}}\) to generate a small \(N_x \times N_y\) block of \({\varvec{C}}\). The result of all contexts in a block constitutes an entire matrix \({\varvec{C}}\). Data reuse within a context is implemented in such a way that each row of \({\varvec{A}}\)’s context can be multiplied by multiple columns in \({\varvec{B}}\)’s context, and each column in \({\varvec{B}}\)’s context can be multiplied by multiple rows in \({\varvec{A}}\)’s context, and these data need to be loaded only once from memory. Fig. 11 presents an example in which the size of \({\varvec{A}}\)’s context is four rows and the size of \({\varvec{B}}\)’s context is four columns. Four pixels of one row from \({\varvec{A}}\)’s context is taken every time and is spread to \(4 \times 4\) nodes from west to east. Similarly, four pixels of one column from \({\varvec{B}}\)’s context is taken and is spread from north to south. These two numbers are multiplied and are accumulated with partial sum. By repeating this process 4 times, a \(4 \times 4\) size piece of matrix \({\varvec{C}}\) is obtained. The entire \({\varvec{C}}\) can be obtained by executing 16 contexts contained by the block of \(16 \times 16\) size.
Fig. 11

Execution of matrix multiplication on fine-grained dataflow architecture

4.3 FC

As described in Sect. 2.2, the operation of the FC layers is similar to that of the CONV layers. However, FC layers does not need data preprocessing because it does not share weights among pixels of one feature. FC can be implemented with matrix-vector multiplication.

Figure 8c shows the problem partition of FC. Each time SPU calculates \(N_x \times K_x\)-sized filter that is multiplied by \(N_x\)-sized ifmaps to obtain \(K_x\) pixels of ofmaps. These data are split into \(K_x / N_x\) contexts and are executed in parallel. For each context, a filter of \(N_x \times N_y\) is multiplied by \(N_x\)-sized ifmaps.

If the batch size is big enough, FC layers can be implemented by matrix multiplication to achieve higher performance.

4.4 POOL

Figure 12 shows the problem partition of POOL layer. In POOL layers, SPU will load \(N_x \times N_y \times K_x\)-sized ifmaps each time, and these ifmaps are divided into \(K_x\) contexts. In each context, the downsampling of \(N_x \times N_y\)-sized data is calculated according to the formula (2).

If either the size of u or the size of v is smaller than the size of the filter, the data used by neighboring computations in POOL layers will overlap. Then, the overlapping data can be kept in the PE and be reused to reduce the number of memory accesses.
Fig. 12

Problem partition of pooling layers

4.5 ReLU

As illustrated in Fig. 13, the problem partition of ReLU is similar to the partition of POOL layers. Each time, \(N_x \times N_y \times K_x\)-sized ifmaps are loaded into SPU and are divided into \(K_x\) contexts. In each context, a piece of \(N_x \times N_y\) ifmaps is used for calculation with formula (3). After the calculation is done, the results are written back to memory.
Fig. 13

Problem partition of ReLU layers

4.6 LRN

As shown in Fig. 14, SPU loads \(N_x \times K_x \times K_y\) number of ifmaps into Dbuf each time. When calculating a pixel of ofmaps, we need to load a number of ifmaps for the same location in feature maps as the pixel of ofmaps from different channels. Each context takes one pixel from each channel, taking a total of \(N_x\) pixels to update \(N_x-T\) pixels in ofmaps. This is naturally divided into \(K_x \times K_y\) contexts.
Fig. 14

Problem partition of LRN layers

5 Experimental methodology

Implementation: We use the large-scale parallel simulation framework SimICT (Ye et al. 2013) as a platform to implement a C-language clocked accurate simulator. Our simulator has been verified with real RTL frontend design and the deviation is under 3%. We choose to use simulator model rather than RTL emulation to save experiment time under tolerable deviation.The structure of SPU simulator is shown in Fig. 2, which includes all the simulation components such as SPU, CPU, DMA controller, and memory.

The SPU processor consists of \(8\times 8\) PEs, 32 Dbufs, 8 Cbufs, a Micro-controller unit (MicC), and a DMA controller. The size of all Dbufs is 2 MB, and the size of the operand buffer in each PE is 6 KB. The data of pixels in features maps use the data format of 16-bit fixed point. Each PE is equipped with two 32-bit fixed-point multiply-add units to calculate load/store address, four-way 16-bit subword-SIMD multiply-add units. 8 PEs are equipped with four-way 16-bit fixed-point divider for LRN layers. For 16-bit operations (multiply and add), peak performance of a single SPU tile is 512 GOPs under clock frequency of 1GHz.

Workloads: We choose two typical CNNs, namely AlexNet (Krizhevsky et al. 2012) and VGG-19 (Simonyan and Zisserman 2014), in the Caffe2 model zone as the Workloads. This two CNNs consists of various kinds of layers with different parameter of each layer. For example, the size of AlexNet’s filter covers 3, 5–11. By testing these two CNNs, we can get the performance of different shape and type of layers and verify the adaptability and flexibility of SPU. For VGG-19, We just pick out the result of part of representative layers, since there are duplicated layers among the whole network. But we still implemented and test the entire network to make a fair comparison with other platforms. The parameter of the selected layers are listed in Tables 2 and  3. The batch size of these two CNNs are both fixed at 16. In these two tables, \(T_n\) represents the number of tasks the layer is divided into. isize / T and fsize / T represents the size of input feature map and filter for one task, respectively. Those parameters are set according to the specification of the layer and the size of on-chip Dbuf.
Table 2

Parameters of selected layers in AlexNet

Layer

H / W

R / S

P / Q

C

K

u / v

\(T_n\)

isize / T

fsize / T

CONV1

227

11

55

3

96

4

285

128 \(\times\) 128

128 \(\times\) 128

CONV2

27

5

27

48

256

1

460

128 \(\times\) 128

128 \(\times\) 128

CONV3

13

3

13

256

384

1

324

128 \(\times\) 128

128 \(\times\) 128

CONV4

13

3

13

192

384

1

252

128 \(\times\) 128

128 \(\times\) 128

CONV5

13

3

13

192

256

1

168

128 \(\times\) 128

128 \(\times\) 128

FC1

6

6

1

256

4096

1

576

256 \(\times\) 256

256 \(\times\) 4

FC2

1

1

1

4096

4096

1

256

256 \(\times\) 256

256 \(\times\) 4

FC3

1

1

1

4096

1000

1

64

256 \(\times\) 256

256 \(\times\) 4

POOL1

55

3

27

96

96

2

24

55 \(\times\) 55 \(\times\) 16

POOL2

27

3

13

256

256

2

16

27 \(\times\) 27 \(\times\) 64

LRN1

27

5

96

96

12

27 \(\times\) 27 \(\times\) 36

LRN2

27

5

256

256

16

13 \(\times\) 13 \(\times\) 68

ReLU1

55

96

96

380

128 \(\times\) 128

ReLU2

27

256

256

184

128 \(\times\) 128

Layer

CONV

FC

POOL

LRN

ACT

    

AlexNet

5

3

3

2

7

    
Table 3

Parameters of selected layers in VGG-19

Layer

H / W

R / S

P / Q

C

K

u / v

\(T_n\)

isize / T

fsize / T

CONV2

224

3

222

3

64

1

7700

128 \(\times\) 128

128 \(\times\) 128

CONV4

112

3

110

128

128

1

3409

128 \(\times\) 128

128 \(\times\) 128

CONV5

56

3

54

128

256

1

1656

128 \(\times\) 128

128 \(\times\) 128

CONV9

28

3

26

256

512

1

1584

128 \(\times\) 128

128 \(\times\) 128

CONV13

14

3

12

512

512

1

3168

128 \(\times\) 128

128 \(\times\) 128

FC1

7

7

1

512

4096

1

1568

256 \(\times\) 256

256 \(\times\) 4

FC2

1

1

1

4096

4096

1

256

256 \(\times\) 256

256 \(\times\) 4

FC3

1

1

1

4096

1000

1

64

256 \(\times\) 256

256 \(\times\) 4

POOL1

224

2

112

64

64

2

256

56 \(\times\) 56 \(\times\) 16

POOL2

112

2

56

128

128

2

128

56 \(\times\) 56 \(\times\) 16

POOL3

56

2

28

256

256

2

64

56 \(\times\) 56 \(\times\) 16

ReLU1

222

64

64

6164

128 \(\times\) 128

ReLU3

110

128

128

1516

128 \(\times\) 128

ReLU5

54

256

256

736

128 \(\times\) 128

Layer

CONV

FC

POOL

ACT

     

VGG-19

16

3

5

19

     

Power & area: We used Verilog to implement the SPU design and synthesize SPU with 45 nm technology library using Synopsys Design Compiler (DC). The power data is calculated by static power generated from Design Compiler plus dynamic power generated from average flip rate in emulation. The area and power of a single SPU tile measured are approximately 44.71 mm\(^2\) and 3.27 W.

Comparison methodology: We first make comparison with CPU implementations, running on an Intel Core i7-4790K. We also compare against NVIDIA TITAN Xp (12,15 Teraflops peak, 12 GB GDDR5, 547.6 GB/s total memory bandwidth). CUDA version of CNN is compiled with CUDA 9.0 and is accelerated with cuDNN 7.0. We also include 24 tiles (SPU) with 480 GB/s memory bandwidth, which enable SPU to reach a similar peak performance (12.288 Tops) as GPU. In order to be similar to the test environment of GPU, CPU in the SPU simulator is only responsible for assigning workloads. Besides, the cycle frequency of CPU is set to the same frequency as the real Intel processor, and is normalized for calculating the total execution time of CNN layers. We evaluate performance, computational resource utilization, and energy consumption of CNN running on GPU and SPU. The computing resource utilization is defined as
$$\begin{aligned} \frac{Total\ number\ of\ FP\ instructions}{Number\ of\ execution\ cycles\times N}\,\times \,100\% \end{aligned}$$
(7)
where N represents maximum number of instructions the architecture can execute per cycle.

Based on the performance result, we analyzed the SPU bandwidth requirements. To further illustrate our advantages over other specific accelerators, we compare the performance of one tile of SPU over the performance of DianNao (452 Gops) by implementing the same workloads.

6 Experimental results

6.1 Performance

Results of performance comparison between CPU, GPU and 24 tiles SPU for those layers in AlexNet and VGG-19 are shown in Figs. 15 and  16. Among them, the execution on SPU is averagely \(6.4\times \sim 33.4\times\) faster than that on GPU in POOL and ReLU layers, which are much larger than CONV layers and FC layers. This is because these layers contain less computational operations than CONV layers and FC layers. GPUs usually allot all instructions computing a pixel (including load/store and calculations) to a Core. If the amount of calculation is small, it is difficult to make full use of the large number of computing components. Whereas, the SPU uses fine-grained dataflow architecture that is different from the GPU and can fully exploit instruction-level parallelism. The instructions required for the same data calculation can be distributed to different PEs, making full use of the SPU’s computing resources. Besides, SPU reuses many of the partial sums in neighboring computations in POOL layers and LRN layers, rather than re-fetching them from memory. The speedups of SPU over GPU for LRN1 and LRN2 in AlexNet are \(7.2\times\) and \(13.6\times\), respectively (feature maps in LRN2 are much smaller than LRN1). They are smaller than the speedup for POOL layers because of the limited computing ability and huge delay of the divider. Although CONV layers and FC layers have already been well accelerated on the GPU, we make better use of computing components on the SPU, achieving a speedup of \(1.66\times \sim 5.29\times\) over GPU. In addition, the executions of the entire AlexNet and VGG-19 on SPU only require 25.8% and 36.7% of the time needed by GPU, respectively.
Fig. 15

Performance comparison of CPU, GPU, and SPU for layers in AlexNet

Fig. 16

Performance comparison of CPU, GPU, and SPU for layers in VGG-19

6.2 Computing resource utilization

Figures 17 and  18 presents the utilization of computational resources of CONV layers and FC layers for GPU and SPU, respectively. The performance gaps of GPU when executing different shapes of CONV layers are obvious. Taking layers in AlexNet as examples, utilization of computing resources in CONV1 is 18.12%, much lower than that in other CONV layers. The efficiency of CONV3, CONV4, and CONV5 also has a large gap in resource utilization with CONV2. Whereas, the acceleration scheme of SPU well adapts to different shapes of convolution. In all CONV layers, over 40% of the utilization of computational resources has been achieved on SPU, far beyond GPU. And SPU achieves 16.23% utilization of computing resource on average when executing FC layers, also outperforming that of GPU. With the complete implementation of AlexNet and VGG-19, SPU achieves 61.7% and 51.1% utilization of computing resources, respectively.
Fig. 17

Utilization of computational resources of GPU and SPU for layers in AlexNet

Fig. 18

Utilization of computational resources of GPU and SPU for layers in VGG-19

6.3 Energy reduction

Figures 19 and  20 shows the energy reduction of SPU over GPU. Power consumption of GPU corresponding to different workload utilization are quoted from Jouppi et al. (2017). SPU reduces the energy by \(2.92\times \sim 99.92\times\) compared to the baseline of GPU. The CONV and FC layers achieve less energy improvement of \(2.92\times \sim 7.54\times\) compared with \(8.81\times \sim 99.02\times\) of energy improvement achieved by POOL and ReLU layers. With the complete implementation of AlexNet and VGG-19, SPU achieves \(6.53\times\) and \(4.99\times\) energy improvement compared to GPU, respectively.
Fig. 19

Energy reduction of SPU over GPU

Fig. 20

Energy reduction of SPU over GPU

6.4 Bandwidth requirement

Figure 21 lists the peak bandwidth required by each layer of CNN when there is no bandwidth limitation for 24-tile SPU. For CONV, FC, and LRN layers, the peak bandwidth required by the SPU is less than 480 GB/s and its performance is not limited by bandwidth. The peak bandwidth required by POOL and ReLU layers is larger than 480 GB/s. Because calculation in ReLU layers is simple, we can reuse data stored on-chip after the previous layers finish computing. Thus, ReLU layers are not affected by the bandwidth during the actual execution. Performance of POOL layers declines because of bandwidth limitation.
Fig. 21

Peak bandwidth required by each type of layers of CNN on SPU

6.5 Comparison with other accelerators

Figure 22 presents the acceleration ratio of one SPU relative to DianNao for AlexNet. The performance of the SPU is better than that of the DianNao for most layers except for CONV3 layer. CONV2, CONV5 and POOL1 achieve better speedup than other layers. This is due to the smaller size of their input/output feature maps, combined with fewer operators used, for which DianNao did not accelerate well (mentioned by Chen et al. 2014a). We take into account the influence of different feature map parameters on performance. Thus, we implement the unrolling convolution optimization method on SPU to reduce the impact of smaller feature maps on performance. The POOL layers benefit from instruction-level parallelism of SPU. When the amount of computation is small, the computation can still be distributed as evenly as possible on the PE array.
Fig. 22

Performance comparison of SPU over DianNao for AlexNet

Table 4 illustrates the flexibility of the SPU to implement a wider variety of layers than other dataflow accelerators. Particularly for new layers that may appear in the future, other accelerators might need to change the hardware configuration. Whereas, SPU only needs to design a fine-grained dataflow program for new layers and does not need to redesign the hardware structure.

Recent accelerators for neural network [such as TPU (Jouppi et al. 2017) and Tesla FSD (Venkataramanan and Sarma 2019)] apply systolic array to accelerating the computing-bound layers such as convolutional or fully-connection layers. TPU and Tesla FSD are able to offer peak throughput of 92 Tops (integer 8-bit) and 73 Tops (integer 8-bit), respectively, while the 24-tile SPU SoC offers peak throughput of 12.88 Tops but using floating-point units. Although TPU and Tesla FSD provide considerable performance for specific NN layers, the other part of the application in real scenarios such as preprocessing of input images including resizing and filtering has to rely on the other on-chip resource such as DSP, ISP or the main core. The reason is that the data flows of those non-NN operations differ from that of NN which always based on matrix multiplications, and some of those require floating-point support. SPU is originally designed for scientific applications, so comparing with those accelerators, SPU is able to provides good configurability towards those non-NN algorithms in real scenarios, meanwhile achieve high throughput and efficiency for NN layers as discussed above. Therefore, SPU is a good alternative to support both NN operations and non-NN part of the whole application.
Table 4

Type of layers implemented by different dataflow accelerators

Layer

CONV

FC

POOL

ACT

LRN

DianNao

Yes

Yes

Yes

Yes

No

Eyeriss

Yes

Yes

Yes

No

No

FlexFlow

Yes

Yes

Yes

No

No

SPU

Yes

Yes

Yes

Yes

Yes

7 Related work

Some classical fine-grained dataflow architecture and some of dataflow ASIC for CNN and DNN are described as follows.

Fine-grained dataflow architecture: TRIPS (Govindan et al. 2007) is a fine-grained dataflow architecture proposed by the Doug Burger et al at the University of Texas. TRIPS can support eight frame at the same time to obscure the delay of transferring operations between the instructions. WaveScalar (Swanson et al. 2007) is a cluster-based extensible dataflow architecture proposed by Steven Swanson et al. at the Washington University. Each Cluster contains four domains, each domain contains eight PEs. Each instruction will be mapped to a PE. During program execution, WaveScalar constantly replaces useless instructions and loads new unimplemented instructions into PEs. Both dataflow architectures are general-purpose processors that cannot take full advantage of parallelism in CNN to mask the latency of operand transfer, resulting in low utilization of functional units.

ASIC: DianNao (Chen et al. 2014a) is composed of buffer for input neurons (NBin), output neurons (NBout), and synaptic weights (SB), connected to Neural Functional Unit (NFU), using loop tiling to optimize computation. Chen et al. (2016) proposed a row stationary (RS) dataflow, which maps each row of output feature map to one PE. Flexflow (Lu et al. 2017) proposed a complementary parallelism principle reaping multiple types of parallelism to improve the computing resource utilization. Besides, it can be adaptive to multiple mixtures of parallelism to serve different layers efficiently. However, these specific accelerators sacrifice the flexibility of the hardware in order to achieve high efficiency. This makes it difficult to implement the new layers of CNN that may emerge in the future.

8 Conclusion

In this paper, we proposed a method of accelerating CNN using a dataflow accelerator for scientific applications. We firstly analyzed the similarity of CNN and scientific applications. According to the characteristics of this accelerator, the method of context partition and the dataflow graph for each layer of CNN are proposed, respectively. As the results of the experiments show, the performance of SPU is improved over GPU and energy consumption of SPU is reduced over GPU. The execution time of the whole AlexNet and VGG-19 running on SPU is \(2.29\times\) faster on average than that on GPU, the energy consumption of SPU is reduced by \(5.76\times\) on average over GPU. SPU achieves good performance as well as maintaining good flexibility when running various types of layers in CNN.

Notes

Acknowledgements

This work was supported by the National Key Research and Development Plan of China under Grant no. 2017YFC0803401, the National Natural Science Foundation of China under Grant nos. 61872335 and 61732018, the International Partnership Program of Chinese Academy of Sciences under Grant no. 171111KYSB20170032.

References

  1. Albericio, J., Judd, P., Hetherington, T., et al.: Cnvlutin: ineffectual-neuron-free deep neural network computing. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 1–13 (2016).  https://doi.org/10.1109/ISCA.2016.11
  2. Chellapilla, K., Puri, S., Simard, P.: High performance convolutional neural networks for document processing. In: Tenth International Workshop on Frontiers in Handwriting Recognition, pp. 1–6 (2006)Google Scholar
  3. Chen, T., Du, Z., Sun, N., et al.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pp. 269–284. ACM, New York (2014).  https://doi.org/10.1145/2541940.2541967
  4. Chen, Y., Luo, T., Liu, S., et al.: Dadiannao: a machine-learning supercomputer. In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622 (2014).  https://doi.org/10.1109/MICRO.2014.58
  5. Chen, Y.H., Emer, J., Sze, V.: Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 367–379 (2016).  https://doi.org/10.1109/ISCA.2016.40
  6. Chetlur, S., Woolley, C., Vandermersch, P., et al.: cuDNN: efficient primitives for deep learning. CoRR arxiv: abs/1410.0759 (2014)
  7. Coates, A., Huval, B., Wang, T., et al.: Deep learning with COTS HPC systems. In: Proceedings of the 30th International Conference on Machine Learning, vol. 28. ICML’13, pp. III-1337–III-1345. JMLR.org (2013). http://dl.acm.org/citation.cfm?id=3042817.3043086
  8. Dennis, J.B.: First version of a data flow procedure language. In: Programming Symposium, Proceedings Colloque Sur La Programmation, pp. 362–376. Springer, London (1974). http://dl.acm.org/citation.cfm?id=647323.721501 Google Scholar
  9. Fan, D., Zhang, H., Wang, D., Ye, X., Song, F., Li, G., Sun, N.: Godson-T: an efficient many-core processor exploring thread-level parallelism. IEEE Micro 32(2), 38–47 (2012)CrossRefGoogle Scholar
  10. Fan, D., Li, W., Ye, X., Wang, D., Zhang, H., Tang, Z., Sun, N.: SmarCO: an efficient many-core processor for high-throughput applications in datacenters. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 596–607. IEEE, New York (2018)Google Scholar
  11. Fu, H., Gan, L., Clapp, R.G., Ruan, H., Pell, O., Mencer, O., Flynn, M., Huang, X., Yang, G.: Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro 34(1), 30–40 (2014)CrossRefGoogle Scholar
  12. Govindan, M.S.S., Burger, D., Keckler, S.: Trips: a distributed explicit data graph execution (edge) microprocessor. In: 2007 IEEE Hot Chips 19 Symposium (HCS), pp. 1–13 (2007).  https://doi.org/10.1109/HOTCHIPS.2007.7482519
  13. Gu, L., Li, X., Siegel, J.: An empirically tuned 2D and 3D FFT library on CUDA GPU. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10, pp. 305–314. ACM, New York (2010).  https://doi.org/10.1145/1810085.1810127
  14. Jouppi, N.P., Young, C., Patil, N., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pp. 1–12. ACM, New York (2017).  https://doi.org/10.1145/3079856.3080246
  15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc., Red Hook (2012)Google Scholar
  16. Liang, Y., Lu, L., Xiao, Q., Yan, S.: Evaluating fast algorithms for convolutional neural networks on FPGAS. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pp. 1–1 (2019).  https://doi.org/10.1109/TCAD.2019.2897701
  17. Lu, L., Liang, Y.: SpWA: an efficient sparse Winograd convolutional neural networks accelerator on FPGAS. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pp. 1–6 (2018).  https://doi.org/10.1109/DAC.2018.8465842
  18. Lu, W., Yan, G., Li, J., et al.: FlexFlow: a flexible dataflow accelerator architecture for convolutional neural networks. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 553–564 (2017).  https://doi.org/10.1109/HPCA.2017.29
  19. Nguyen, A., Satish, N., Chhugani, J., et al.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2010).  https://doi.org/10.1109/SC.2010.2
  20. Oriato, D., Tilbury, S., Marrocu, M., Pusceddu, G.: Acceleration of a meteorological limited area model with dataflow engines. In: 2012 Symposium on Application Accelerators in High Performance Computing (SAAHPC), pp. 129–132. IEEE, New York (2012)Google Scholar
  21. Pratas, F., Oriato, D., Pell, O., Mata, R.A., Sousa, L.: Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In: IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 177–180. IEEE, New York (2013)Google Scholar
  22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  23. Swanson, S., Schwerin, A., Mercaldi, M., et al.: The wavescalar architecture. ACM Trans. Comput. Syst. 25(2), 4:1–4:54 (2007).  https://doi.org/10.1145/1233307.1233308 CrossRefGoogle Scholar
  24. Sze, V., Chen, Y., Yang, T., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. CoRR arxiv: abs/1703.09039 (2017)
  25. Tan, X., Ye, X.C., Shen, X.W., Xu, Y.C., Wang, D., Zhang, L., Li, W.M., Fan, D.R., Tang, Z.M.: A pipelining loop optimization method for dataflow architecture. J. Comput. Sci. Technol. 33(1), 116–130 (2018).  https://doi.org/10.1007/s11390-017-1748-5 CrossRefGoogle Scholar
  26. Venkataramanan, G., Sarma, D.D.: Compute and redundancy solution for Tesla’s full self driving computer. In: Hotchips 2019 (2019)Google Scholar
  27. Verdoscia, L., Vaccaro, R., Giorgi, R.: A matrix multiplier case study for an evaluation of a configurable dataflow-machine. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, CF ’15, pp. 63:1–63:6. ACM, New York (2015).  https://doi.org/10.1145/2742854.2747287
  28. Xiao-Wei, S., Xiao-Chun, Y., Da, W., et al.: Optimizing dataflow architecture for scientific applications. Chin. J. Comput. 9, 2181–2196 (2017)Google Scholar
  29. Ye, X., Fan, D., Sun, N., et al.: SimICT: a fast and flexible framework for performance and power evaluation of large-scale architecture. In: International Symposium on Low Power Electronics and Design (ISLPED), pp. 273–278 (2013).  https://doi.org/10.1109/ISLPED.2013.6629308
  30. Zhang, C., Li, P., Sun, G., et al.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’15, pp. 161–170. ACM, New York (2015).  https://doi.org/10.1145/2684746.2689060

Copyright information

© China Computer Federation (CCF) 2019

Authors and Affiliations

  • Xiaochun Ye
    • 1
    • 2
  • Taoran Xiang
    • 1
    • 2
  • Xu Tan
    • 1
    • 2
  • Yujing Feng
    • 1
    • 2
  • Haibin Wu
    • 1
  • Meng Wu
    • 1
  • Dongrui Fan
    • 1
    • 2
    Email author
  1. 1.State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of SciencesBeijingChina
  2. 2.School of Computer and Control EngineeringUniversity of Chinese Academy of SciencesBeijingChina

Personalised recommendations