Advertisement

ReBNN: in-situ acceleration of binarized neural networks in ReRAM using complementary resistive cell

  • Linghao SongEmail author
  • You Wu
  • Xuehai Qian
  • Hai Li
  • Yiran Chen
Regular Paper
  • 561 Downloads

Abstract

Resistive random access memory (ReRAM) has been proven capable to efficiently perform in-situ matrix-vector computations in convolutional neural network (CNN) processing. The computations are often conducted on multi-level cell (MLC) that have limited precision and hence, show significant vulnerability to noises. The binarized neural network (BNN) is a hardware-friendly model that can dramatically reduce the computation and storage overheads. However, XNOR, which is the key operation in BNNs, cannot be directly computed in-situ in ReRAM because of its nonlinear behavior. To enable real in-situ processing of BNNs in ReRAM, we modified the BNN algorithm to enable direct computation of XNOR, POPCOUNT and POOL based on ReRAM cells. We also proposed the complementary resistive cell (CRC) design to efficiently conduct XNOR operations and optimized the pipeline design with decoupled buffer and computation stages. Our results show our scheme, namely, ReBNN, improves the system performance by \(25.36\times\) and the energy efficiency by \(4.26\times\) compared to conventional ReRAM based accelerator, and ensures a throughput higher than state-of-the-art BNN accelerators. The correctness of the modified algorithm is also validated.

Keywords

BNNs ReRAM Accelerator 

1 Introduction

In the 2012 ImageNet Large Scale Visual Recognition Challenge (Russakovsky et al. 2015), convolutional neural networks (CNNs) (Krizhevsky et al. 2012) have become the main drivers of revolutions in various application domains, such as computer vision (Liu et al. 2016, 2018d; Ren et al. 2015), healthcare (Ching et al. 2018; Faust et al. 2018; Miotto et al. 2017) and scientific computing (Baldi et al. 2014; Goh et al. 2017; Song et al. 2019a). To accelerate CNNs, various hardware acceleration solutions were proposed, focusing on computation efficiency improvement (Esmaeilzadeh et al. 2012), data flow optimization (Chen et al. 2016), or both (Chen et al. 2014b). Many acceleration designs based on ASIC (Akhlaghi et al. 2018; Chen et al. 2014a, b; Du et al. 2015; Esmaeilzadeh et al. 2012; Jouppi et al. 2017; Kwon et al. 2018; Parashar et al. 2017; Sharma et al. 2018; Song et al. 2018b, 2019b; Yazdanbakhsh et al. 2018; Yu et al. 2017) and FPGA (Guan et al. 2017a, b; Han et al. 2017; Mahajan et al. 2016; Qiu et al. 2016; Wang et al. 2016; Zhang et al. 2015, 2018, 2016) were proposed.

Because of the large memory capacity required by CNN computations and the heavy data traffic between computing units and memories, processing-in-memory (PIM) becomes an attractive approach for CNN acceleration. Emerging Resistive random access memory (ReRAM) (Mao et al. 2015, 2017, 2018b; Niu et al. 2010, 2012; Wong et al. 2012; Xu et al. 2013, 2015) is one of the promising technologies to enable PIM. In particular, ReRAM can be organized to form an array of weight (synapse) (Kuzum et al. 2013; Woo et al. 2018; Yu et al. 2015, 2011a, 2013) that can efficiently perform matrix-vector multiplications. Prime (Chi et al. 2016), ISAAC (Shafiee et al. 2016) and PipeLayer (Song et al. 2017) are three representative neural network accelerators based on ReRAM, and other ReRAM based accelerators for neural networks are (Chen and Li 2018a; Chen et al. 2018b, 2019a, b; Cheng et al. 2017; Ji et al. 2018a, 2019; Li et al. 2018; Lin et al. 2019a, b; Liu et al. 2018c, 2015; Mao et al. 2019, 2018a; Qiao et al. 2018; Sun et al. 2018a, b; Tang et al. 2017; Wang et al. 2018). Ji et al. (2016, 2018b) integrated the ReRAM based accelerator into system-level solutions. Besides neural networks, processing in ReRAM is also a potential solution for graph processing (Dai et al. 2019; Song et al. 2018a) and genome sequencing (Chen et al. 2020; Huangfu et al. 2018; Zokaee et al. 2019).

Although ReRAM has demonstrated its great potential in accelerating neural networks, there are still many obstacles on the way of its commercialization. For example, ReRAM-based PIM often conduct its operations on multi-level cell (MLC) that have limited precision and hence, showing significant vulnerability to noises and fault-tolerant or low-precision designs are required for the compensation for computation accuracy (Liu et al. 2018a, b, 2019a, b; Mohanty et al. 2017; Wang et al. 2017). Following the technology scaling of ReRAM devices, the above reliability issue will become more prominent. It is necessary to explore new neural network models that are friendly to the hardware implementation of neural network accelerators with high computing efficiency and robustness.

Binarized neural network (BNN) (Hubara et al. 2016; Rastegari et al. 2016) is a promising solution to reduce high computational complexity and data storage requirement incurred by conventional full-precision CNNs. In a BNN, feature maps and kernel weights are all binary, i.e., \(\{+1,-1\}\) in number domain or \(\{1,0\}\) in logic domain. BNNs are hardware-friendly as they occupy much smaller memory and requires much simpler operations during their computations, compared to conventional CNNs. BNNs are naturally suitable for ReRAM-based PIM because the data precision needed by the BNNs is low (i.e., binary) and the corresponding binary operations are very resilient to the noises. Very recently, Tang et al. (2017) proposed a ReRAM-based BNN accelerator. However, this work simply treats BNNs as a low precision CNN and only a pseudo in-situ acceleration was developed without fully utilizing the ReRAM arrays. Except for matrix-vector multiplications, other computations such as normalization and pooling still need to be implemented using peripheral circuit but not the ReRAM array.

In this work, we proposed ReBNN—a software-hardware co-design for real in-situ acceleration of BNNs in ReRAM. The main contributions of our work are:
  • We modify the original algorithm to allow highly efficient hardware implementations. Specifically, we reformulate the original MAX-NORM-SIGN operation flow to NORM/COMP-OR so that the MAX operation can directly work on binaries and can be simply implemented using OR gates. The MAX implementation in ReRAM also becomes more noise resistant. This algorithm change is the key enabler to support BNNs in ReRAM.

  • We design a novel complementary resistive cell (CRC) as the basic building block for processing unit design. The XNOR-POPCOUNT operation is performed by the CRC and the sensing amplifier is also redesigned to perform normalization for BNNs.

  • We propose a decoupled buffer-computation execution to support multiple instances of layer-wise pipelined execution of BNNs. Both peak computation performance and throughput are improved.

  • We evaluate our design on three datasets, MNIST (LeCun 1998), CIFAR-10 (Krizhevsky and Hinton 2009) and SVHN (Netzer et al. 2011). The results show that ReBNN improves the performance by \(25.36\times\) and the energy efficiency by \(4.26\times\), the throughput of ReBNN is higher than state-of-the-art BNN accelerators, and the correctness of the modified algorithm is validated.

Our paper is organized as follows: Sect. 2 introduces the background about neural network (including both CNNs and BNNs) acceleration, and the motivation for BNN acceleration; Sect. 3 proposes a modified BNN operation flow for hardware efficiency enhancement; Sect. 4 describes the complementary resistive cell design for real in-situ processing of BNNs; Sect. 5 discusses the architecture and pipeline design for ReBNN; Sect. 6 presents our experimental results; Sect. 7 concludes this work.

2 Background and motivation

2.1 Neural networks

2.1.1 Convolutional neural networks

In essence, a CNN can be considered as a directed acyclic graph (DAG) of computation layers. For the convolutional (CONV) layers, the computation can be defined as:
$$\begin{aligned} \begin{aligned}&{\mathbf {O}}[u] = f\left( \sum \limits _{v=0}^{N_i-1} {\mathbf {I}}[v]*{\mathbf {W}}[u][v] + {\mathbf {b}}[u] \right) , \\&\forall u\in \{u | 0 \leqslant u < N_o\}, \end{aligned} \end{aligned}$$
(1)
where \({{\mathbf {O}}}\), \({\mathbf {W}}\), \({\mathbf {b}}\) and \({\mathbf {I}}\) are output feature maps, kernels, bias and input feature maps, respectively. \({\mathbf {I}}\) has three dimensions of \(N_i\times (H_i\times L_i)\), where \(N_i\), \(H_i\) and \(L_i\) are the number of channels, height and length of the input feature maps respectively. Similarly, \({{\mathbf {O}}}\) has three dimensions of \(N_o\times (H_o\times L_o)\), where \(N_o\), \(H_o\) and \(L_o\) are the number of channels, height and length of the output feature maps, respectively. \({\mathbf {W}}\) has four dimensions of \(N_o\times N_i\times (K\times K)\), where K is kernel size. \(f(\cdot )\) is a nonlinear activation function and \(*\) denotes a 2D convolution (\((H_i\times L_i)*(K\times K)\)). Fully-connected (FC) or inner product layer is another category of weighted layers. The core operations of FC are matrix multiplications. FC can also be viewed as a special CONV where the second and the third dimensions feature maps is 1, i.e., the dimensions of \({\mathbf {I}}\) and \({{\mathbf {O}}}\) are \(N_i\times (1\times 1)\) and \(N_o\times (1\times 1)\), respectively.
Fig. 1

Data flows in CNN and BNN layers. a In CNNs, all tensors (input feature maps, output feature maps, kernels and intermediate feature maps) are floating-point. b In BNNs, input feature maps, output feature maps, kernels are binarized. The intermediate feature maps are integers for the count of ‘1’s in a binary stream, and the intermediate feature maps will also be converted to binary format

Pooling (POOL) is another common layer type in neural networks. A pooling layer performs down-sampling. A max POOL passes the maximum element in a pooling window in each channel. Assume the window size is \(K\times K\), max_pool(\(\cdot\)) is defined as
$$\begin{aligned} \begin{aligned}&y = \max _{ii,jj} \{{\mathbf {I}}[v][K\cdot i+ii][K\cdot j+jj]\}, \\&\forall v\in \{v | 0\leqslant v< N_i\},\\&\forall i\in \{i | 0\leqslant i< H_o\}, ~\forall j\in \{j | 0\leqslant j< L_o\}, \\&\forall ii\in \{ii | 0\leqslant ii< K\}, ~\forall jj\in \{jj | 0\leqslant jj < K\}. \end{aligned} \end{aligned}$$
(2)

2.1.2 Binarized neural networks

In binarized neural networks, feature maps and kernel weights are binary, i.e. \(\{+1,-1\}\) in number domain or \(\{1,0\}\) in logic domain. BNNs are hardware-friendly as they require smaller memory footprint and simpler operations compared to CNNs. XNOR-Net (Rastegari et al. 2016) and BinaryNet (Hubara et al. 2016) are two promising BNNs. Compared with BinaryNet, XNOR-Net requires extra non-binary operations [i.e., scaling factors (Rastegari et al. 2016) related transformations] in each layer. Thus, we focus on BinaryNet in this paper.

If we encode \(\{+1,-1\}\) with \(\{1,0\}\), logic XNOR can be used to perform multiplication. It is illustrated in Table 1, where A (\(\in \{+1,-1\}\)) is encoded with B (\(\in \{1,0\}\)), then (\(B_1\) XNOR \(B_2\)) is identical to (\(A_1 \times A_2\)). Based on this property, BNN can be significantly accelerated using XNOR, as multiplications constitute the largest percentage of operations in BNNs. In this encoding scheme, an addition is performed by a POPCOUNT operation, which counts the population (number) of ‘1’s in a binary stream. Thus the convolutions in BNNs can be realized with XNOR and POPCOUNT.
Table 1

XNOR operation to perform multiplication

\(A_1\)

\(A_2\)

\(B_1\)

\(B_2\)

\(A_1 \times A_2\)

\(B_1\) XNOR \(B_2\)

\(+1\)

\(+1\)

1

1

\(+1\)

1

\(+1\)

\(-1\)

1

0

\(-1\)

0

\(-1\)

\(+1\)

0

1

\(-1\)

0

\(-1\)

\(-1\)

0

0

\(+1\)

1

Normalization (NORM) is a necessary operation in BNNs, which scales and normalizes data. Here NORM is an element-wise operation, which can be defined as:
$$\begin{aligned} y=\frac{x-\mu }{\sqrt{\sigma ^2+\epsilon }}\gamma +\beta , \end{aligned}$$
(3)
where x and y are input and output values, and \(\mu\), \(\sigma\) , \(\epsilon\), \(\gamma\), \(\beta\) are constant parameters (\(\mu ,\ \sigma\) are statistical mean and standard deviation, respectively. \(\gamma ,\ \beta\) are trained parameters and \(\epsilon\) is a small float term to avoid dividing by zero error). Feature maps are linearly shifted and scaled before they are binarized by their signs, i.e., operated by function Sign\((\cdot )\),
$$\begin{aligned} y= \left\{ \begin{array}{ll} +1, &{}\quad x > 0\\ -1, &{}\quad \text {else} \end{array} \right. \end{aligned}$$
(4)
which actually binarizes the intermediate result to ‘0’ or ‘1’.

Figure 1 compares the data flows in a CNN layer and a BNN layer. In the CNN layer, operations of cascaded CONVOLUTION-ACTIVATION perform the computation in Eq. (1), and POOLING is the last operation to generate output feature maps. In a BNN layer, CONVOLUTION is replaced by XNOR-POPCOUNT. Another noticeable difference is that POOLING is no longer the last operation before output feature maps. The order of POOLING, NORM and SIGN is subtle and has implications on hardware implementations. The details will be discussed in Sect. 3.

2.2 Neural network acceleration in ReRAM

Resistive random access memory (ReRAM) is an emerging non-volatile memory, which has appealing properties of high density, fast read access and low leakage power. ReRAM has been considered as a promising candidate for main memory (Xu et al. 2015), where a ReRAM cell can switch between a high resistance state (‘0’) and a low resistance state (‘1’). Crossbar architecture (Hu et al. 2016) and multi-level cell (MLC) (Yu et al. 2011b) were originally proposed to improve the density and reduce the cost of ReRAM. However, these two techniques also inspired storing matrix weights in ReRAM cells and performing in-situ matrix-vector multiplications. When voltages are applied to wordlines of a weighted ReRAM array (\({\mathbf {W}}\)), according to Kirchhoff’s current law (KCL), the bitlines accumulate currents in parallel. Hence, \(I={\mathbf {W}}U\)); weights are represented by the resistance state of the ReRAM cells in the array and the bitline current accumulation enables matrix-vector multiplication in analog manner. Many ReRAM-based neural network accelerators were proposed in recent years (Chi et al. 2016; Hu et al. 2016; Liu et al. 2015; Shafiee et al. 2016; Song et al. 2017).

However, the current ReRAM based accelerators (Chi et al. 2016; Shafiee et al. 2016; Song et al. 2017) need to be customized to run CNNs due to the difference between the BNNs and the conventional CNNs. We argue that BNN is a perfect fit for ReRAM. First, BNNs by nature require much lower precision. The current designs for CNNs rely on multi-bit weights. However, realizing multi-bit precision on ReRAM cells is difficult even using MLC because of large device variations. Complex iterative tuning schemes (Alibart et al. 2012) or conversion algorithm (Hu et al. 2016) must be used. This is why ReRAM only provides limited precision. Moreover, the analog computations on ReRAM are vulnerable to various noises (Lee et al. 2011) and variations (Chang et al. 2012). BNNs, however, are not affected by these factors in general. Second, BNNs avoid analog/digital conversions, which incur high energy consumption and large chip area. In BNNs, ReRAM cells used for computation are the same as the cells used for storage (i.e., with two states ‘0’ and ‘1’), which are ease to be programmed and immune to noises and variations. Nonetheless, supporting BNNs in ReRAM is non-trivial because a ReRAM array cannot intrinsically implement XNOR operations.

3 Hardware oriented BNN flow

In the default BNN operation flow (Hubara et al. 2016), to get binary output feature maps, intermediate integer feature maps are operated by POOLING (MAX)-NORM-SIGN, as shown in Fig. 2a. The direct implementation of this operation sequence requires integer comparators and buffers for MAX and a floating point multiplication and accumulation (MAC) unit for NORM. The potential data format conversions incur considerable hardware cost. In ReRAM implementations, this sequence becomes problematic because the floating point MAC operations of NORM cannot be directly mapped to ReRAM. Even if we use integer for normalization, it is still challenging to perform MAX in ReRAM. The current buffers and analog comparators are still needed and vulnerable to circuit noises.
Fig. 2

Two identical operation flows a MAX-NORM-SIGN and b NORM/COMP-OR

To resolve these challenges, we modify the original algorithm to allow more efficient hardware implementations. Consider Eq. (3) discussed in Sect. 2.1.2, it can be reformulated as:
$$\begin{aligned} y={\mathbf {1}}\cdot \text {COMP}\left( x, \tau \right) , \end{aligned}$$
(5)
where \(\tau\) is a threshold value and \(\tau =\mu -\frac{\beta }{\gamma }\sqrt{\sigma ^2+\epsilon }\). From Eq. (5), we see that NORM-SIGN can be combined as NORM/COMP, where the comparison of x and \(\tau\) is the only needed computation.

With this modification, the operation flow is simplified to NORM/COMP, which can be simply implemented as a comparator. More importantly, re-organization of our operation flow introduces additional optimization opportunity. Specifically, if we place MAX after NORM/COMP in BNNs, MAX directly works on binaries and can be implemented by an OR gate. Eventually, the original operation flow (i.e., MAX-NORM/COMP) is changed to NORM/COMP-OR, as shown in Fig. 2b.

While the new operation flow leads to a more efficient hardware implementation, it is critical to show that the function of the new implementation is identical to that of the original implementation. To this end, we propose Lemma 1.

Lemma 1

\(\max (a_i)>\tau\)\(\Leftrightarrow\)\(\vee (a_i>\tau )\).

Proof

  1. 1.

    Necessity: \(\max (a_i)>\tau\)\(\Rightarrow\)\(\exists \ i=i_1\) such that \(a_{i_1}=\max (a_i)>\tau\)\(\Rightarrow\)\(\vee (a_i>\tau )\).

     
  2. 2.

    Sufficiency: \(\vee (a_i>\tau )\)\(\Rightarrow\)\(\exists \ i=i_2\) such that \(a_{i_2}>\tau\)\(\Rightarrow\)\(\max (a_i)\geqslant a_{i_2}>\tau\).

     

In summary, the operation flow modification is crucial when mapping BNN to ReRAM. Although the default flow can be implemented in CMOS architecture, integer comparators are less efficient than OR gates. The default MAX implementation in ReRAM requires dynamic analog comparison, which unfortunately is not noise resistant. We will show the effect of the modifications of operation flows in Sect. 6.3.

4 Complementary resistive cell (CRC)

4.1 Complementary resistive cell (CRC)

Although ReRAM arrays are efficient to perform vector-matrix multiplications with the property of bitline current summation, it is non-trivial to perform XNOR in the arrays. To support XNOR operations in ReRAM arrays, we propose complementary resistive cell (CRC). A CRC consists of two resistive cells that are always complementary, i.e., one in high conductance (low resistance) state, and the other in low conductance (high resistance) state. Actually, it is not an over-design to use two cells to store one weight, because in the MLC design such as Prime (Chi et al. 2016) and PipeLayer (Song et al. 2017), four cells are used to store one weight.
Fig. 3

XNOR operation in the complementary resistive cell (CRC)

Figure 3 shows the structure and functionality of XNOR operations in a CRC. The conductance of the two cells in a CRC are complementary and the input voltages to the two cells generated by a buffer and an inverter are also complementary. In this way, only one cell in a CRC is active. To store a weight \(W=1\) in a CRC, the upper cell in the CRC is set to a high (H) conductance state while the lower one is set to a low (L) conductance state, as shown in Fig. 3a, c. Similarly, the upper L and lower H states represent \(W=0\), as shown in Fig. 3b, d. To compute \((X=0)\)XNOR\((W=0)\), the lower cell connected to the inverter is active, so that a high current (\(I_\mathrm{H}\)) is generated to represent ‘1’. On the contrary, a low current (\(I_\mathrm{L}\)) represents ‘0’, as shown in Fig. 3d. The other cases can be derived similarly.

4.2 Fused activation with CRC array

The fused activation with CRC Array finally enables the real in-situ computation of BNNs in ReRAM. The proposed architecture is shown in Fig. 4. The key component is the CRC array which performs the operation of XNOR-POPCOUNT. The structure of a CRC array is similar to a conventional ReRAM array, except that each cell in CRC array is a CRC rather than a single resistive cell. As we already discussed in Sect. 4.1, a CRC can perform XNOR. Since the POPCOUNT is the summation of ‘1’s in a bit stream, we can construct a bitline of CRCs to perform XNOR-POPCOUNT in a CRC array. Assume the total number of CRCs on a bitline is N, and the number of ‘1’s that generated by the CRCs is n, the current on the bitline can be calculated as:
$$\begin{aligned} I_\text {bit}=N {I}_{\mathrm{L}} + n ({I}_{\mathrm{H}}-{I}_{\mathrm{L}}). \end{aligned}$$
(6)
We design a current sensing sense amplifier, as shown in Fig. 4d. A similar current sensing scheme was used in memristor-based threshold logic circuit (Rajendran et al. 2012). Combine Eqs. (3) and (6), we can derive the reference current of the sense amplifier as
$$\begin{aligned} I_\text {ref}=\tau + N {I}_{\mathrm{L}}. \end{aligned}$$
(7)
Note that the reference current \(I_\text {ref}\) will not change in inference, so it is preset in the sense amplifier. An OR gate is placed at the output of each sense amplifier to perform MAX operation. The driver is a complementary pair of a buffer and an inverter and CRCs on the same wordline share one driver.

Previous works on ReRAM based accelerators for BNNs, such as Sun et al. (2018b), Tang et al. (2017), failed to perform real in-situ computation in the ReRAM. ReRAM arrays are only used as processing engines for low precision multiplication in Sun et al. (2018b, Tang et al. (2017), and extra logic components are deployed for the nonlinear activation and binary quantization. In this work, we enables the real in-situ computation in ReRAM for BNN processing through a software/hardware codesign approach, including (1) the modified hardware oriented operation flow and (2) the fused activation with CRC array.

5 Architecture and pipeline design

5.1 Architecture

The architecture of ReBNN is shown in Fig. 4. The overall ReBNN accelerator is a processing-in-memory (PIM) architecture. As shown in Fig. 4a, each ReRAM is configured with a buffer for input data, wordline complementary drivers, bitline sense amplifiers and an output data buffer. The complementary ReRAM cell (CRC) array is the key competent for binarized computation. The complementary drivers apply a complementary activation on the two wordlines corresponding to one CRC. As Fig. 4 shows, in a CRC array, a bit stream is input to wordlines to be XNORed in parallel. Moreover, bitlines of a CRC array can simultaneously generate results for different channels of output feature maps. In the current sense amplifier, shown in Fig. 4d, a reference current \(I_{\text {ref}}\) according to Eq. (7) is programmed. So the normalization binarization in BNNs can be performed as a read on the CRC array. For the down sampling, i.e. the max pooling, as we have modified the operation flow, that is performed as the OR operation on the output bit stream. Thus, a CRC array can store the trained weights and perform computations for BNNs. The buffers between CRC arrays coordinate the movement of input and output feature maps.
Fig. 4

The architecture design of ReBNN. a Overall architecture, b the complementary ReRAM cell (CRC) array, c the complementary driver and d the sense amplifier with a preset reference current \(I_{\text {ref}}\)

5.2 Pipeline

While layer-wise pipeline in Song et al. (2017) helps to improve throughput of the training of neural network, it buffers intermediate feature maps for a batch, which is not necessary in BNNs because of the smaller memory footprint. The feature maps between layers are cached in buffers rather than stored back to main memory in our design. To further improve the throughput, we pipeline instance in a batch with decoupled buffer and computation stages, so that different inputs could enter the model consecutively.
Fig. 5

Pipeline design in ReBNN. Buffers between layers are used to coordinate the on-chip movement of input and output feature maps. The computation for one layer is a read operation on the CRC array with wordlines activated complementarily

We design a pipeline to further boost the performance of BNN processing. Figure 5 shows the pipeline design. The processing for one layer is decoupled as two stages, buffer and computation. In the pipeline, a layer i only needs to communicate with the previous layer \(i-1\) and the next layer \(i+1\). The data movement is coordinated by the buffers.

In the buffer stage, as shown in Fig. 5 ❶, the output feature map of layer \(i-1\) in the output buffer (colored yellow) is sent to the input buffer of layer i (colored yellow). Note that for the processing of non-binarized neural network, intermediate results of a whole layer can not be transferred directly on-chip because of the memory footprint, which must be coordinated by extra main memory space as in Song et al. (2017). The output feature map of layer i is cached in the output buffer (colored red) and sent to the input buffer (colored red) of layer \(i+1\).

In the computation stage, as shown in Fig. 5 ❷, data from the input buffer are applied on the wordlines complementary by the driver, and the sense amplifiers perform NORM/COMP on the accumulated XNORed values from bitlines. Thus, the computation is just a read from the CRC array with multiple wordlines activated.

During execution of the whole binarized neural network, a static and regular communication graph is formed between CRC arrays based on order of layers in a binarized neural network. Note that this pipeline design also benefits to the processing of the first layer in BNNs. The feature maps in BNNs are all binary format except the input feature maps to first layer. The input feature maps (the images) to the first layer are in an 8-bit fixed-point format. With this pipeline design, the input feature map of first layer can also utilize the binary drivers, which input the bits from MSB to LSB of the fixed point, and accumulate the shifted partial sums cached in the buffer.

6 Evaluation

6.1 Evaluation setup

In the evaluation, we use three classic datasets: MNIST (LeCun 1998), CIFAR-10 (Krizhevsky and Hinton 2009) and SVHN (Netzer et al. 2011). MNIST is a popular dataset of handwritten digits, which contains a total of 70,000 samples. CIFAR-10 is a dataset of tiny colorful images for advanced recognition system design. CIFAR-10 contains 10 classes with 6000 in each class. SVHN is a dataset of real-world street view house numbers, with more than 100,000 samples. Based on the three datasets, three binarized networks were trained.

The neural networks in this work are running on the machine learning toolkit Torch (Collobert et al. 2011) with NVIDIA cuDNN (Chetlur et al. 2014) enabled. The computing system has two Intel Xeon E5-2630 CPUs, with 128 GB main memory. And the GPU used in the evaluation is an NVIDIA GTX TITAN X, with 12 GB GDDR5 graphic memory. For ReRAM based accelerator evaluation, we use 128-wordline-by-128-bitline ReRAM arrays at 30 nm process node, and we modified NVSim (Dong et al. 2012) and NeuroSim (Chen et al. 2018c) for performance and energy evaluation. The parameters for the ReBNN accelerator are listed in Table 2.
Table 2

ReBNN parameters

ReRAM cell

Cell on resistance

1.000 M\(\Omega\)

Cell off resistance

10.00 M\(\Omega\)

Write voltage

1.000 V

Read voltage

0.400 V

ReRAM array

Array size

\(128\times 128\)

Write latency

23.93 ns

Read latency

5.472 ns

Write energy

52.23 nJ

Read energy

25.15 nJ

Leakage power

32.75 mW

Table 3 shows the architecture of the three BNN models, weight size and output feature map sizes of each layer. The BNN trained on MNIST is a 4-layer fully connected network, while the BNNs trained on CIFAR-10 and SVHN are networks with 9 weighted layers, where the first 6 are convolutional layers and the last 3 are fully connected layers. A max pooling follows the 2nd, 4th and 6th convolutional layers (conv 2, conv 4 and conv 6). Note that the memory footprint in this table is in Bit, and the largest size of weight in a layer is 8.0 M Bits (FC1, CIFAR-10). For the BNN on CIFAR-10, the total weight is only 13.4 M Bits and the total output feature maps is only 282 K Bits.
Table 3

Architecture and memory footprint of three binarized neural networks

Dataset

Architecture and memory footprint

MNIST

   Layer

FC1

FC2

FC3

FC4

   Kernel

2048,784

2048,2048

2048,2048

10,2048

   Weight bits

1.5 K

4.0 M

4.0 M

20 K

   OFmap bits

2.0 K

2.0 K

2.0 K

10

CIFAR-10

   Layer

conv1

conv2 (pool)

conv3

conv4 (pool)

conv5

conv6 (pool)

FC1

FC2

FC3

   Kernel

128,3, \(3\times 3\)

128,128, \(3\times 3\)

256,128, \(3\times 3\)

256,256, \(3\times 3\)

512,256, \(3\times 3\)

512,512, \(3\times 3\)

1024,8192

1024,1024

10,1024

   Weight bits

3.4 K

144 K

288 K

576 K

1.1 M

2.3 M

8.0 M

1.0 M

10 K

   OFmap bits

128 K

32 K

64K

16 K

32 K

8 K

1 K

1 K

10

SVHN

   Layer

conv1

conv2 (pool)

conv3

conv4 (pool)

conv5

conv6 (pool)

FC1

FC2

FC3

   Kernel

64,3, \(3\times 3\)

64,64, \(3\times 3\)

128,64, \(3\times 3\)

128,128, \(3\times 3\)

256,128, \(3\times 3\)

256,256, \(3\times 3\)

1024,4096

1024,1024

10,1024

   Weight bits

1.7 K

36 K

72 K

144 K

288 K

576 K

4.0 M

1.0 M

10 K

   OFmap bits

64 K

16 K

32 K

8 K

16 K

4 K

1 K

1 K

10

6.2 Performance and energy efficiency

We compared the performance of ReBNN to a ISAAC baseline, which is a state-of-the-art multi-level ReRAM based accelerator for deep neural networks. As shown in Fig. 6a, ReBNN w/o pipeline is the ReRAM based accelerator with hardware oriented BNN flow (discussed in Sect. 3) with CRC design (discussed in Sect. 4), and ReBNN w/ pipeline is the design configured with the pipeline (discussed in Sect. 5). The performance is normalized to the ISAAC baseline, and the averaged performance improvements of our design is \(11.42\times\) without pipeline and is \(25.36\times\) with pipeline. Fig. 6b shows the energy efficiency improvement, and ReBNN improves the energy efficiency by \(4.26\times\) on average. The improvements come from the binarized computation and data format, but our software-hardware co-design enables the in-situ execution of BNNs in ReRAM. In addition, the pipeline further improves performance, because buffers for pipelining data in BNN is quite smaller than that in full-precision or fixed-point neural networks, thus the memory accessing and computation can be fine-decoupled for layer-wise pipeline design.
Fig. 6

a Normalized performance of ReBNN on MNIST, CIFAR-10 and SVHN, b normalized energy efficiency and c the comparison of frame-per-second (FPS) of state-of-the-art accelerators and ReBNN

We also compare the throughput, i.e. frame-per-second (FPS) of ReBNN with popular state-of-the-art neural network accelerators ShiDianNao (Du et al. 2015) YodaNN (Andri et al. 2016), Eyeriss (Chen et al. 2016), ISAAC (Shafiee et al. 2016), Neurocube (Kim et al. 2016), BNN-FPGA (Li et al. 2017) and XNOR-POP (Jiang et al. 2017), shown in Fig. 6c. BNN-FPGA (Li et al. 2017) and XNOR-POP (Jiang et al. 2017) are two accelerators for BNNs, the throughput of ReBNN is around \(2\times\) of XNOR-POP. Actually, as a stand-alone accelerator, ReBNN can be easily scaled to a larger design with duplicated computation units. However, XNOR-POP, which modified DRAM peripherals, was not easy to scale for a larger design.

6.3 Accuracy of modified operation flow

We study the accuracy of the modified operation flow (i.e., the default operation flow MAX-NORM-SIGN is changed to the modified NORM/COMP-OR) shown in Fig. 2. We retrain the BNNs using the new operation flow.
Fig. 7

Error rate curves on SVHN

Fig. 8

Error rate curves on CIFAR-10

Table 4

Accuracies of BNNs on SVHN and CIFAR-10

 

Validation

Training

Dataset

SVHN

CIFAR-10

SVHN

CIFAR-10

Original

\(97.33\%\)

\(88.42\%\)

\(99.58\%\)

\(98.42\%\)

Modified

\(98.35\%\)

\(87.02\%\)

\(99.46\%\)

\(95.98\%\)

Figure 7 shows the error rate curves for validation and training of the original and modified algorithms on SVHN. We can see the modified algorithm has almost the same curves as the original one, especially for the validation (inference) error curves. The error rate curves of the original and modified BNN on CIFAR-10 are shown in Fig. 8. While the performance of the modified BNN on CIFAR-10 is slightly worse than the original, the modified algorithm converges on the trend with the original. We speculate that in the modified operation flow, the indication for the max element is lost while it is kept in the original flow, which leads to the error rate gap. However, the validation error rate gap is only 1.40%, which is acceptable and means the modified operation flow is practical. The modified algorithm even has a smaller validation (inference) error rate 2.65% than the original one, 2.67%. The validation error rate of the modified algorithm on CIFAR-10 is 12.98%, while the original algorithm has 11.58% validation error rate. The accuracies of the original and modified BNNs are listed in Table 4, we can see that the modified hardware friendly data flow works properly.

7 Conclusion

Binarized neural networks are more hardware efficient than convolutional neural networks, but BNNs can not be directly processed in-situ in ReRAM. We modify the original algorithm to fuse the normalization, comparison and pooling in BNNs to allow more efficient hardware implementations. We design a new complementary resistive cell (CRC) as the processing unit to perform XNOR and POPCOUNT in ReRAM. To improve the performance and throughput, we propose a decoupled buffer-computation pipeline execution. In evaluation, our design ReBNN improves the performance by \(25.36\times\), improves the energy efficiency by \(4.26\times\), ReBNN also has a higher throughput than state-of-the-art BNN accelerator. The correctness of the modified algorithm, which is the software fundamental for the hardware design of ReBNN, is validated.

Notes

Acknowledgements

This work was supported in part by NSF 1910299, 1717657, DOE DE-SC0018064, AFRL FA8750-18-2-0057, NSF CCF-1657333, CCF-1717754, CNS-1717984, CCF-1750656, and CCF-1919289.

Compliance with ethical standards

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

References

  1. Akhlaghi, V., Yazdanbakhsh, A., Samadi, K., Gupta, R.K., Esmaeilzadeh, H.: Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 662–673. IEEE (2018)Google Scholar
  2. Alibart, F., Gao, L., Hoskins, B.D., Strukov, D.B.: High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm. Nanotechnology 23(7), 075201 (2012)CrossRefGoogle Scholar
  3. Andri, R., Cavigelli, L., Rossi, D., Benini, L.: Yodann: an ultra-low power convolutional neural network accelerator based on binary weights. In: 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 236–241. IEEE (2016)Google Scholar
  4. Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5, 4308 (2014)CrossRefGoogle Scholar
  5. Chang, M.F., Sheu, S.S., Lin, K.F., Wu, C.W., Kuo, C.C., Chiu, P.F., Yang, Y.S., Chen, Y.S., Lee, H.Y., Lien, C.H., et al.: A high-speed 7.2-ns read-write random access 4-mb embedded resistive ram (ReRAM) macro using process-variation-tolerant current-mode read schemes. IEEE J. Solid State Circuits 48(3), 878–891 (2012)CrossRefGoogle Scholar
  6. Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O.: Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: ACM Sigplan Notices, vol. 49, pp. 269–284. ACM (2014a)Google Scholar
  7. Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N., et al.: Dadiannao: a machine-learning supercomputer. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. IEEE Computer Society (2014b)Google Scholar
  8. Chen, Y.H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circuits 52(1), 127–138 (2016)CrossRefGoogle Scholar
  9. Chen, F., Li, H.: Emat: an efficient multi-task architecture for transfer learning using ReRAM. In: Proceedings of the International Conference on Computer-Aided Design, p. 33. ACM (2018a)Google Scholar
  10. Chen, F., Song, L., Chen, Y.: Regan: A pipelined ReRAM-based accelerator for generative adversarial networks. In: 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 178—183. IEEE (2018b)Google Scholar
  11. Chen, P.Y., Peng, X., Yu, S.: Neurosim: a circuit-level macro model for benchmarking neuro-inspired architectures in online learning. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 37(12), 3067–3080 (2018c)CrossRefGoogle Scholar
  12. Chen, F., Song, L., Li, H.: Efficient process-in-memory architecture design for unsupervised GAN-based deep learning using ReRAM. In: Proceedings of the 2019 on Great Lakes Symposium on VLSI, pp. 423–428. ACM (2019a)Google Scholar
  13. Chen, F., Song, L., Li, H.H., Chen, Y.: Zara: a novel zero-free dataflow accelerator for generative adversarial networks in 3d ReRAM. In: Proceedings of the 56th Annual Design Automation Conference 2019, p. 133. ACM (2019b)Google Scholar
  14. Chen, F., Song, L., Li, H., Chen, Y.: Parc: a processing-in-cam architecture for genomic long read pairwise alignment using ReRAM. In: 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC). ACM (2020)Google Scholar
  15. Cheng, M., Xia, L., Zhu, Z., Cai, Y., Xie, Y., Wang, Y., Yang, H.: Time: a training-in-memory architecture for memristor-based deep neural networks. In: Proceedings of the 54th Annual Design Automation Conference 2017, p. 26. ACM (2017)Google Scholar
  16. Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., Shelhamer, E.: CUDNN: efficient primitives for deep learning. arXiv:1410.0759 (2014)
  17. Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y., Xie, Y.: Prime: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In: ACM SIGARCH Computer Architecture News, vol. 44, pp. 27–39. IEEE Press (2016)Google Scholar
  18. Ching, T., Himmelstein, D.S., Beaulieu-Jones, B.K., Kalinin, A.A., Do, B.T., Way, G.P., Ferrero, E., Agapow, P.M., Zietz, M., Hoffman, M.M., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15(141), 20170387 (2018)CrossRefGoogle Scholar
  19. Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. Tech. rep. (2011)Google Scholar
  20. Dai, G., Huang, T., Wang, Y., Yang, H., Wawrzynek, J.: Graphsar: a sparsity-aware processing-in-memory architecture for large-scale graph processing on ReRAMs. In: Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 120–126. ACM (2019)Google Scholar
  21. Dong, X., Xu, C., Xie, Y., Jouppi, N.P.: Nvsim: a circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 31(7), 994–1007 (2012)CrossRefGoogle Scholar
  22. Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y., Temam, O.: Shidiannao: shifting vision processing closer to the sensor. In: ACM SIGARCH Computer Architecture News, vol. 43, pp. 92–104. ACM (2015)CrossRefGoogle Scholar
  23. Esmaeilzadeh, H., Sampson, A., Ceze, L., Burger, D.: Neural acceleration for general-purpose approximate programs. In: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 449–460. IEEE Computer Society (2012)Google Scholar
  24. Faust, O., Hagiwara, Y., Hong, T.J., Lih, O.S., Acharya, U.R.: Deep learning for healthcare applications based on physiological signals: a review. Comput. Methods Programs Biomed. 161, 1–13 (2018)CrossRefGoogle Scholar
  25. Goh, G.B., Hodas, N.O., Vishnu, A.: Deep learning for computational chemistry. J. Comput. Chem. 38(16), 1291–1307 (2017)CrossRefGoogle Scholar
  26. Guan, Y., Liang, H., Xu, N., Wang, W., Shi, S., Chen, X., Sun, G., Zhang, W., Cong, J.: FP-DNN: an automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In: 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 152–159. IEEE (2017a)Google Scholar
  27. Guan, Y., Yuan, Z., Sun, G., Cong, J.: FPGA-based accelerator for long short-term memory recurrent neural networks. In: 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 629–634. IEEE (2017b)Google Scholar
  28. Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., Xie, D., Luo, H., Yao, S., Wang, Y., et al.: ESE: efficient speech recognition engine with sparse LSTM on FPGA. In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 75–84. ACM (2017)Google Scholar
  29. Hu, M., Strachan, J.P., Li, Z., Grafals, E.M., Davila, N., Graves, C., Lam, S., Ge, N., Yang, J.J., Williams, R.S.: Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication. In: Proceedings of the 53rd annual design automation conference, p. 19. ACM (2016)Google Scholar
  30. Huangfu, W., Li, S., Hu, X., Xie, Y.: Radar: a 3D-ReRAM based DNA alignment accelerator architecture. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2018)Google Scholar
  31. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Advances in Neural Information Processing Systems, pp. 4107–4115 (2016)Google Scholar
  32. Ji, Y., Zhang, Y., Li, S., Chi, P., Jiang, C., Qu, P., Xie, Y., Chen, W.: Neutrams: neural network transformation and co-design under neuromorphic hardware constraints. In: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, p. 21. IEEE Press (2016)Google Scholar
  33. Ji, H., Song, L., Jiang, L., Li, H.H., Chen, Y.: ReCom: an efficient resistive accelerator for compressed deep neural networks. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 237–240. IEEE (2018a)Google Scholar
  34. Ji, Y., Zhang, Y., Chen, W., Xie, Y.: Bridge the gap between neural networks and neuromorphic hardware with a neural network compiler. In: ACM SIGPLAN Notices, vol. 53, pp. 448–460. ACM (2018b)CrossRefGoogle Scholar
  35. Ji, H., Jiang, L., Li, T., Jing, N., Ke, J., Liang, X.: HUBPA: high utilization bidirectional pipeline architecture for neuromorphic computing. In: Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 249–254. ACM (2019)Google Scholar
  36. Jiang, L., Kim, M., Wen, W., Wang, D.: XNOR-pop: a processing-in-memory architecture for binary convolutional neural networks in wide-IO2 drams. In: 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pp. 1–6. IEEE (2017)Google Scholar
  37. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al.: In-datacenter performance analysis of a tensor processing unit. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. IEEE (2017)Google Scholar
  38. Kim, D., Kung, J., Chai, S., Yalamanchili, S., Mukhopadhyay, S.: Neurocube: a programmable digital neuromorphic architecture with high-density 3D memory. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 380–392. IEEE (2016)Google Scholar
  39. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. rep, Citeseer (2009)Google Scholar
  40. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)Google Scholar
  41. Kuzum, D., Yu, S., Wong, H.P.: Synaptic electronics: materials, devices and applications. Nanotechnology 24(38), 382001 (2013)CrossRefGoogle Scholar
  42. Kwon, H., Samajdar, A., Krishna, T.: MAERI: enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In: ACM SIGPLAN Notices, vol. 53, pp. 461–475. ACM (2018)CrossRefGoogle Scholar
  43. LeCun, Y.: The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998)
  44. Lee, D., Lee, J., Jo, M., Park, J., Siddik, M., Hwang, H.: Noise-analysis-based model of filamentary switching ReRAM with \(\text{ ZrO }_{x}/\text{ HfO }_{x}\) stacks. IEEE Electron Device Lett. 32(7), 964–966 (2011)CrossRefGoogle Scholar
  45. Li, Y., Liu, Z., Xu, K., Yu, H., Ren, F.: A 7.663-tops 8.2-w energy-efficient FPGA accelerator for binary convolutional neural networks. In: FPGA, pp. 290–291 (2017)Google Scholar
  46. Li, B., Song, L., Chen, F., Qian, X., Chen, Y., Li, H.H.: ReRAM-based accelerator for deep learning. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 815–820. IEEE (2018)Google Scholar
  47. Lin, J., Li, S., Hu, X., Deng, L., Xie, Y.: CNNWIRE: Boosting convolutional neural network with winograd on ReRAM based accelerators. In: Proceedings of the 2019 on Great Lakes Symposium on VLSI, pp. 283–286. ACM (2019a)Google Scholar
  48. Lin, J., Zhu, Z., Wang, Y., Xie, Y.: Learning the sparsity for ReRAM: mapping and pruning sparse neural network for ReRAM based accelerator. In: Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 639–644. ACM (2019b)Google Scholar
  49. Liu, X., Mao, M., Liu, B., Li, H., Chen, Y., Li, B., Wang, Y., Jiang, H., Barnell, M., Wu, Q., et al.: Reno: a high-efficient reconfigurable neuromorphic computing accelerator design. In: 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2015)Google Scholar
  50. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016. Springer, Cham, pp. 21–37 (2016)CrossRefGoogle Scholar
  51. Liu, M., Xia, L., Wang, Y., Chakrabarty, K.: Design of fault-tolerant neuromorphic computing systems. In: 2018 IEEE 23rd European Test Symposium (ETS), pp. 1–9. IEEE (2018a)Google Scholar
  52. Liu, M., Xia, L., Wang, Y., Chakrabarty, K.: Fault tolerance for RRAM-based matrix operations. In: 2018 IEEE International Test Conference (ITC), pp. 1–10. IEEE (2018b)Google Scholar
  53. Liu, R., Peng, X., Sun, X., Khwa, W.S., Si, X., Chen, J.J., Li, J.F., Chang, M.F., Yu, S.: Parallelizing SRAM arrays with customized bit-cell for binary neural networks. In: Proceedings of the 55th Annual Design Automation Conference, p. 21. ACM (2018c)Google Scholar
  54. Liu, X., Yang, H., Liu, Z., Song, L., Li, H., Chen, Y.: DPATCH: an adversarial patch attack on object detectors. arXiv:1806.02299 (2018d)
  55. Liu, M., Xia, L., Wang, Y., Chakrabarty, K.: Fault tolerance in neuromorphic computing systems. In: Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 216–223. ACM (2019a)Google Scholar
  56. Liu, T., Wen, W., Jiang, L., Wang, Y., Yang, C., Quan, G.: A fault-tolerant neural network architecture. In: Proceedings of the 56th Annual Design Automation Conference 2019, DAC ’19, pp. 55:1–55:6. ACM, New York (2019b).  https://doi.org/10.1145/3316781.3317742
  57. Mahajan, D., Park, J., Amaro, E., Sharma, H., Yazdanbakhsh, A., Kim, J.K., Esmaeilzadeh, H.: Tabla: a unified template-based framework for accelerating statistical machine learning. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 14–26. IEEE (2016)Google Scholar
  58. Mao, M., Cao, Y., Yu, S., Chakrabarti, C.: Optimizing latency, energy, and reliability of 1T1R ReRAM through appropriate voltage settings. In: 2015 33rd IEEE International Conference on Computer Design (ICCD), pp. 359–366. IEEE (2015)Google Scholar
  59. Mao, M., Chen, P.Y., Yu, S., Chakrabarti, C.: A multilayer approach to designing energy-efficient and reliable ReRAM cross-point array system. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 25(5), 1611–1621 (2017)CrossRefGoogle Scholar
  60. Mao, M., Sun, X., Peng, X., Yu, S., Chakrabarti, C.: A versatile ReRAM-based accelerator for convolutional neural networks. In: 2018 IEEE International Workshop on Signal Processing Systems (SiPS), pp. 211–216. IEEE (2018a)Google Scholar
  61. Mao, M., Yu, S., Chakrabarti, C.: Design and analysis of energy-efficient and reliable 3-d ReRAM cross-point array system. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26(7), 1290–1300 (2018b)CrossRefGoogle Scholar
  62. Mao, M., Peng, X., Liu, R., Li, J., Yu, S., Chakrabarti, C.: Max2: an ReRAM-based neural network accelerator that maximizes data reuse and area utilization. IEEE J. Emerg. Sel. Top. Circuits Syst. (2019)Google Scholar
  63. Miotto, R., Wang, F., Wang, S., Jiang, X., Dudley, J.T.: Deep learning for healthcare: review, opportunities and challenges. Brief. Bioinform. 19(6), 1236–1246 (2017)CrossRefGoogle Scholar
  64. Mohanty, A., Du, X., Chen, P.Y., Seo, J.s., Yu, S., Cao, Y.: Random sparse adaptation for accurate inference with inaccurate multi-level RRAM arrays. In: 2017 IEEE International Electron Devices Meeting (IEDM), pp. 6–3. IEEE (2017)Google Scholar
  65. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning (2011)Google Scholar
  66. Niu, D., Chen, Y., Xu, C., Xie, Y.: Impact of process variations on emerging memristor. In: Proceedings of the 47th Design Automation Conference, pp. 877–882. ACM (2010)Google Scholar
  67. Niu, D., Xu, C., Muralimanohar, N., Jouppi, N.P., Xie, Y.: Design trade-offs for high density cross-point resistive memory. In: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design, pp. 209–214. ACM (2012)Google Scholar
  68. Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W., Dally, W.J.: SCNN: an accelerator for compressed-sparse convolutional neural networks. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 27–40. IEEE (2017)Google Scholar
  69. Qiu, J., Wang, J., Yao, S., Guo, K., Li, B., Zhou, E., Yu, J., Tang, T., Xu, N., Song, S., et al.: Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 26–35. ACM (2016)Google Scholar
  70. Qiao, X., Cao, X., Yang, H., Song, L., Li, H.: Atomlayer: a universal ReRAM-based CNN accelerator with atomic layer computation. In: Proceedings of the 55th Annual Design Automation Conference, p. 103. ACM (2018)Google Scholar
  71. Rajendran, J., Manem, H., Karri, R., Rose, G.S.: An energy-efficient memristive threshold logic circuit. IEEE Trans. Comput. 61(4), 474–487 (2012)MathSciNetCrossRefGoogle Scholar
  72. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classification using binary convolutional neural networks. In: European Conference on Computer Vision, pp. 525–542. Springer (2016)Google Scholar
  73. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  74. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  75. Shafiee, A., Nag, A., Muralimanohar, N., Balasubramonian, R., Strachan, J.P., Hu, M., Williams, R.S., Srikumar, V.: Isaac: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Comput. Archit. News 44(3), 14–26 (2016)CrossRefGoogle Scholar
  76. Sharma, H., Park, J., Suda, N., Lai, L., Chau, B., Chandra, V., Esmaeilzadeh, H.: Bit fusion: bit-level dynamically composable architecture for accelerating deep neural networks. In: Proceedings of the 45th Annual International Symposium on Computer Architecture, pp. 764–775. IEEE Press (2018)Google Scholar
  77. Song, L., Qian, X., Li, H., Chen, Y.: Pipelayer: a pipelined ReRAM-based accelerator for deep learning. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 541–552. IEEE (2017)Google Scholar
  78. Song, L., Zhuo, Y., Qian, X., Li, H., Chen, Y.: GRAPHR: accelerating graph processing using ReRAM. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 531–543. IEEE (2018a)Google Scholar
  79. Song, M., Zhao, J., Hu, Y., Zhang, J., Li, T.: Prediction based execution on deep neural networks. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 752–763. IEEE (2018b)Google Scholar
  80. Song, L., Chen, F., Young, S.R., Schuman, C.D., Perdue, G., Potok, T.E.: Deep learning for vertex reconstruction of neutrino-nucleus interaction events with combined energy and time data. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3882–3886. IEEE (2019a)Google Scholar
  81. Song, L., Mao, J., Zhuo, Y., Qian, X., Li, H., Chen, Y.: Hypar: towards hybrid parallelism for deep learning accelerator array. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 56–68. IEEE (2019b)Google Scholar
  82. Sun, X., Peng, X., Chen, P.Y., Liu, R., Seo, J.s., Yu, S.: Fully parallel RRAM synaptic array for implementing binary neural network with (+ 1, -1) weights and (+ 1, 0) neurons. In: Proceedings of the 23rd Asia and South Pacific Design Automation Conference, pp. 574–579. IEEE Press (2018a)Google Scholar
  83. Sun, X., Yin, S., Peng, X., Liu, R., Seo, J.s., Yu, S.: XNOR-RRAM: a scalable and parallel resistive synaptic architecture for binary neural networks. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1423–1428. IEEE (2018b)Google Scholar
  84. Tang, T., Xia, L., Li, B., Wang, Y., Yang, H.: Binary convolutional neural network on RRAM. In: 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 782–787. IEEE (2017)Google Scholar
  85. Wang, Y., Xu, J., Han, Y., Li, H., Li, X.: Deepburning: automatic generation of FPGA-based learning accelerators for the neural network family. In: Proceedings of the 53rd Annual Design Automation Conference, p. 110. ACM (2016)Google Scholar
  86. Wang, Y., Wen, W., Song, L., Li, H.H.: Classification accuracy improvement for neuromorphic computing systems with one-level precision synapses. In: 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 776–781. IEEE (2017)Google Scholar
  87. Wang, P., Ji, Y., Hong, C., Lyu, Y., Wang, D., Xie, Y.: SNRRAM: an efficient sparse neural network computation architecture based on resistive random-access memory. In: Proceedings of the 55th Annual Design Automation Conference, p. 106. ACM (2018)Google Scholar
  88. Wong, H.S.P., Lee, H.Y., Yu, S., Chen, Y.S., Wu, Y., Chen, P.S., Lee, B., Chen, F.T., Tsai, M.J.: Metal-oxide rram. Proc. IEEE 100(6), 1951–1970 (2012)CrossRefGoogle Scholar
  89. Woo, J., Peng, X., Yu, S.: Design considerations of selector device in cross-point RRAM array for neuromorphic computing. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–4. IEEE (2018)Google Scholar
  90. Xu, C., Niu, D., Muralimanohar, N., Jouppi, N.P., Xie, Y.: Understanding the trade-offs in multi-level cell ReRAM memory design. In: 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2013)Google Scholar
  91. Xu, C., Niu, D., Muralimanohar, N., Balasubramonian, R., Zhang, T., Yu, S., Xie, Y.: Overcoming the challenges of crossbar resistive memory architectures. In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 476–488. IEEE (2015)Google Scholar
  92. Yazdanbakhsh, A., Samadi, K., Kim, N.S., Esmaeilzadeh, H.: GANAX: a unified MIMD-SIMD acceleration for generative adversarial networks. In: Proceedings of the 45th Annual International Symposium on Computer Architecture, pp. 650–661. IEEE Press (2018)Google Scholar
  93. Yu, S., Wu, Y., Jeyasingh, R., Kuzum, D., Wong, H.S.P.: An electronic synapse device based on metal oxide resistive switching memory for neuromorphic computation. IEEE Trans. Electron Devices 58(8), 2729–2737 (2011)CrossRefGoogle Scholar
  94. Yu, S., Wu, Y., Wong, H.S.P.: Investigating the switching dynamics and multilevel capability of bipolar metal oxide resistive switching memory. Appl. Phys. Lett. 98(10), 103514 (2011)CrossRefGoogle Scholar
  95. Yu, S., Gao, B., Fang, Z., Yu, H., Kang, J., Wong, H.S.P.: A low energy oxide-based electronic synaptic device for neuromorphic visual systems with tolerance to device variation. Adv. Mater. 25(12), 1774–1779 (2013)CrossRefGoogle Scholar
  96. Yu, S., Chen, P.Y., Cao, Y., Xia, L., Wang, Y., Wu, H.: Scaling-up resistive synaptic arrays for neuro-inspired architecture: Challenges and prospect. In: 2015 IEEE International Electron Devices Meeting (IEDM), pp. 17–3. IEEE (2015)Google Scholar
  97. Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., Mahlke, S.: Scalpel: customizing DNN pruning to the underlying hardware parallelism. In: ACM SIGARCH Computer Architecture News, vol. 45, pp. 548–560. ACM (2017)CrossRefGoogle Scholar
  98. Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161–170. ACM (2015)Google Scholar
  99. Zhang, C., Wu, D., Sun, J., Sun, G., Luo, G., Cong, J.: Energy-efficient cnn implementation on a deeply pipelined FPGA cluster. In: Proceedings of the 2016 International Symposium on Low Power Electronics and Design, pp. 326–331. ACM (2016)Google Scholar
  100. Zhang, C., Sun, G., Fang, Z., Zhou, P., Pan, P., Cong, J.: Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans. Comput. Aided Design Integr. Circuits Syst. (2018).  https://doi.org/10.1109/TCAD.2017.2785257 CrossRefGoogle Scholar
  101. Zokaee, F., Zhang, M., Jiang, L.: Finder: Accelerating fm-index-based exact pattern matching in genomic sequences through ReRAM technology. In: Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques. ACM (2019)Google Scholar

Copyright information

© China Computer Federation (CCF) 2019

Authors and Affiliations

  1. 1.Duke UniversityDurhamUSA
  2. 2.University of Southern CaliforniaLos AngelesUSA

Personalised recommendations