# ReBNN: in-situ acceleration of binarized neural networks in ReRAM using complementary resistive cell

- 561 Downloads

## Abstract

Resistive random access memory (ReRAM) has been proven capable to efficiently perform in-situ matrix-vector computations in convolutional neural network (CNN) processing. The computations are often conducted on multi-level cell (MLC) that have limited precision and hence, show significant vulnerability to noises. The binarized neural network (BNN) is a hardware-friendly model that can dramatically reduce the computation and storage overheads. However, XNOR, which is the key operation in BNNs, cannot be directly computed in-situ in ReRAM because of its nonlinear behavior. To enable real in-situ processing of BNNs in ReRAM, we modified the BNN algorithm to enable direct computation of XNOR, POPCOUNT and POOL based on ReRAM cells. We also proposed the complementary resistive cell (CRC) design to efficiently conduct XNOR operations and optimized the pipeline design with decoupled buffer and computation stages. Our results show our scheme, namely, ReBNN, improves the system performance by \(25.36\times\) and the energy efficiency by \(4.26\times\) compared to conventional ReRAM based accelerator, and ensures a throughput higher than state-of-the-art BNN accelerators. The correctness of the modified algorithm is also validated.

## Keywords

BNNs ReRAM Accelerator## 1 Introduction

In the 2012 ImageNet Large Scale Visual Recognition Challenge (Russakovsky et al. 2015), convolutional neural networks (CNNs) (Krizhevsky et al. 2012) have become the main drivers of revolutions in various application domains, such as computer vision (Liu et al. 2016, 2018d; Ren et al. 2015), healthcare (Ching et al. 2018; Faust et al. 2018; Miotto et al. 2017) and scientific computing (Baldi et al. 2014; Goh et al. 2017; Song et al. 2019a). To accelerate CNNs, various hardware acceleration solutions were proposed, focusing on computation efficiency improvement (Esmaeilzadeh et al. 2012), data flow optimization (Chen et al. 2016), or both (Chen et al. 2014b). Many acceleration designs based on ASIC (Akhlaghi et al. 2018; Chen et al. 2014a, b; Du et al. 2015; Esmaeilzadeh et al. 2012; Jouppi et al. 2017; Kwon et al. 2018; Parashar et al. 2017; Sharma et al. 2018; Song et al. 2018b, 2019b; Yazdanbakhsh et al. 2018; Yu et al. 2017) and FPGA (Guan et al. 2017a, b; Han et al. 2017; Mahajan et al. 2016; Qiu et al. 2016; Wang et al. 2016; Zhang et al. 2015, 2018, 2016) were proposed.

Because of the large memory capacity required by CNN computations and the heavy data traffic between computing units and memories, processing-in-memory (PIM) becomes an attractive approach for CNN acceleration. Emerging Resistive random access memory (ReRAM) (Mao et al. 2015, 2017, 2018b; Niu et al. 2010, 2012; Wong et al. 2012; Xu et al. 2013, 2015) is one of the promising technologies to enable PIM. In particular, ReRAM can be organized to form an array of weight (synapse) (Kuzum et al. 2013; Woo et al. 2018; Yu et al. 2015, 2011a, 2013) that can efficiently perform matrix-vector multiplications. Prime (Chi et al. 2016), ISAAC (Shafiee et al. 2016) and PipeLayer (Song et al. 2017) are three representative neural network accelerators based on ReRAM, and other ReRAM based accelerators for neural networks are (Chen and Li 2018a; Chen et al. 2018b, 2019a, b; Cheng et al. 2017; Ji et al. 2018a, 2019; Li et al. 2018; Lin et al. 2019a, b; Liu et al. 2018c, 2015; Mao et al. 2019, 2018a; Qiao et al. 2018; Sun et al. 2018a, b; Tang et al. 2017; Wang et al. 2018). Ji et al. (2016, 2018b) integrated the ReRAM based accelerator into system-level solutions. Besides neural networks, processing in ReRAM is also a potential solution for graph processing (Dai et al. 2019; Song et al. 2018a) and genome sequencing (Chen et al. 2020; Huangfu et al. 2018; Zokaee et al. 2019).

Although ReRAM has demonstrated its great potential in accelerating neural networks, there are still many obstacles on the way of its commercialization. For example, ReRAM-based PIM often conduct its operations on multi-level cell (MLC) that have limited precision and hence, showing significant vulnerability to noises and fault-tolerant or low-precision designs are required for the compensation for computation accuracy (Liu et al. 2018a, b, 2019a, b; Mohanty et al. 2017; Wang et al. 2017). Following the technology scaling of ReRAM devices, the above reliability issue will become more prominent. It is necessary to explore new neural network models that are friendly to the hardware implementation of neural network accelerators with high computing efficiency and robustness.

Binarized neural network (BNN) (Hubara et al. 2016; Rastegari et al. 2016) is a promising solution to reduce high computational complexity and data storage requirement incurred by conventional full-precision CNNs. In a BNN, feature maps and kernel weights are all binary, i.e., \(\{+1,-1\}\) in number domain or \(\{1,0\}\) in logic domain. BNNs are hardware-friendly as they occupy much smaller memory and requires much simpler operations during their computations, compared to conventional CNNs. BNNs are naturally suitable for ReRAM-based PIM because the data precision needed by the BNNs is low (i.e., binary) and the corresponding binary operations are very resilient to the noises. Very recently, Tang et al. (2017) proposed a ReRAM-based BNN accelerator. However, this work simply treats BNNs as a low precision CNN and only a pseudo in-situ acceleration was developed without fully utilizing the ReRAM arrays. Except for matrix-vector multiplications, other computations such as normalization and pooling still need to be implemented using peripheral circuit but not the ReRAM array.

We modify the original algorithm to allow highly efficient hardware implementations. Specifically, we reformulate the original MAX-NORM-SIGN operation flow to NORM/COMP-OR so that the MAX operation can directly work on binaries and can be simply implemented using OR gates. The MAX implementation in ReRAM also becomes more noise resistant. This algorithm change is the key enabler to support BNNs in ReRAM.

We design a novel complementary resistive cell (CRC) as the basic building block for processing unit design. The XNOR-POPCOUNT operation is performed by the CRC and the sensing amplifier is also redesigned to perform normalization for BNNs.

We propose a decoupled buffer-computation execution to support multiple instances of layer-wise pipelined execution of BNNs. Both peak computation performance and throughput are improved.

We evaluate our design on three datasets, MNIST (LeCun 1998), CIFAR-10 (Krizhevsky and Hinton 2009) and SVHN (Netzer et al. 2011). The results show that ReBNN improves the performance by \(25.36\times\) and the energy efficiency by \(4.26\times\), the throughput of ReBNN is higher than state-of-the-art BNN accelerators, and the correctness of the modified algorithm is validated.

## 2 Background and motivation

### 2.1 Neural networks

#### 2.1.1 Convolutional neural networks

*K*is kernel size. \(f(\cdot )\) is a nonlinear activation function and \(*\) denotes a 2D convolution (\((H_i\times L_i)*(K\times K)\)). Fully-connected (FC) or inner product layer is another category of weighted layers. The core operations of FC are matrix multiplications. FC can also be viewed as a special CONV where the second and the third dimensions feature maps is 1, i.e., the dimensions of \({\mathbf {I}}\) and \({{\mathbf {O}}}\) are \(N_i\times (1\times 1)\) and \(N_o\times (1\times 1)\), respectively.

#### 2.1.2 Binarized neural networks

In binarized neural networks, feature maps and kernel weights are binary, i.e. \(\{+1,-1\}\) in number domain or \(\{1,0\}\) in logic domain. BNNs are hardware-friendly as they require smaller memory footprint and simpler operations compared to CNNs. XNOR-Net (Rastegari et al. 2016) and BinaryNet (Hubara et al. 2016) are two promising BNNs. Compared with BinaryNet, XNOR-Net requires extra non-binary operations [i.e., scaling factors (Rastegari et al. 2016) related transformations] in each layer. Thus, we focus on BinaryNet in this paper.

*A*(\(\in \{+1,-1\}\)) is encoded with

*B*(\(\in \{1,0\}\)), then (\(B_1\) XNOR \(B_2\)) is identical to (\(A_1 \times A_2\)). Based on this property, BNN can be significantly accelerated using XNOR, as multiplications constitute the largest percentage of operations in BNNs. In this encoding scheme, an addition is performed by a POPCOUNT operation, which counts the population (number) of ‘1’s in a binary stream. Thus the convolutions in BNNs can be realized with XNOR and POPCOUNT.

XNOR operation to perform multiplication

\(A_1\) | \(A_2\) | \(B_1\) | \(B_2\) | \(A_1 \times A_2\) | \(B_1\) XNOR \(B_2\) |
---|---|---|---|---|---|

\(+1\) | \(+1\) | 1 | 1 | \(+1\) | 1 |

\(+1\) | \(-1\) | 1 | 0 | \(-1\) | 0 |

\(-1\) | \(+1\) | 0 | 1 | \(-1\) | 0 |

\(-1\) | \(-1\) | 0 | 0 | \(+1\) | 1 |

*x*and

*y*are input and output values, and \(\mu\), \(\sigma\) , \(\epsilon\), \(\gamma\), \(\beta\) are constant parameters (\(\mu ,\ \sigma\) are statistical mean and standard deviation, respectively. \(\gamma ,\ \beta\) are trained parameters and \(\epsilon\) is a small float term to avoid dividing by zero error). Feature maps are linearly shifted and scaled before they are binarized by their signs, i.e., operated by function Sign\((\cdot )\),

Figure 1 compares the data flows in a CNN layer and a BNN layer. In the CNN layer, operations of cascaded CONVOLUTION-ACTIVATION perform the computation in Eq. (1), and POOLING is the last operation to generate output feature maps. In a BNN layer, CONVOLUTION is replaced by XNOR-POPCOUNT. Another noticeable difference is that POOLING is no longer the last operation before output feature maps. The order of POOLING, NORM and SIGN is subtle and has implications on hardware implementations. The details will be discussed in Sect. 3.

### 2.2 Neural network acceleration in ReRAM

Resistive random access memory (ReRAM) is an emerging non-volatile memory, which has appealing properties of high density, fast read access and low leakage power. ReRAM has been considered as a promising candidate for main memory (Xu et al. 2015), where a ReRAM cell can switch between a high resistance state (‘0’) and a low resistance state (‘1’). Crossbar architecture (Hu et al. 2016) and multi-level cell (MLC) (Yu et al. 2011b) were originally proposed to improve the density and reduce the cost of ReRAM. However, these two techniques also inspired storing matrix weights in ReRAM cells and performing in-situ matrix-vector multiplications. When voltages are applied to wordlines of a weighted ReRAM array (\({\mathbf {W}}\)), according to Kirchhoff’s current law (KCL), the bitlines accumulate currents in parallel. Hence, \(I={\mathbf {W}}U\)); weights are represented by the resistance state of the ReRAM cells in the array and the bitline current accumulation enables matrix-vector multiplication in analog manner. Many ReRAM-based neural network accelerators were proposed in recent years (Chi et al. 2016; Hu et al. 2016; Liu et al. 2015; Shafiee et al. 2016; Song et al. 2017).

However, the current ReRAM based accelerators (Chi et al. 2016; Shafiee et al. 2016; Song et al. 2017) need to be customized to run CNNs due to the difference between the BNNs and the conventional CNNs. We argue that BNN is a perfect fit for ReRAM. First, BNNs by nature require much lower precision. The current designs for CNNs rely on multi-bit weights. However, realizing multi-bit precision on ReRAM cells is difficult even using MLC because of large device variations. Complex iterative tuning schemes (Alibart et al. 2012) or conversion algorithm (Hu et al. 2016) must be used. This is why ReRAM only provides limited precision. Moreover, the analog computations on ReRAM are vulnerable to various noises (Lee et al. 2011) and variations (Chang et al. 2012). BNNs, however, are not affected by these factors in general. Second, BNNs avoid analog/digital conversions, which incur high energy consumption and large chip area. In BNNs, ReRAM cells used for computation are the same as the cells used for storage (i.e., with two states ‘0’ and ‘1’), which are ease to be programmed and immune to noises and variations. Nonetheless, supporting BNNs in ReRAM is non-trivial because a ReRAM array cannot intrinsically implement XNOR operations.

## 3 Hardware oriented BNN flow

*x*and \(\tau\) is the only needed computation.

With this modification, the operation flow is simplified to NORM/COMP, which can be simply implemented as a comparator. More importantly, re-organization of our operation flow introduces additional optimization opportunity. Specifically, if we place MAX after NORM/COMP in BNNs, MAX directly works on binaries and can be implemented by an OR gate. Eventually, the original operation flow (i.e., MAX-NORM/COMP) is changed to NORM/COMP-OR, as shown in Fig. 2b.

While the new operation flow leads to a more efficient hardware implementation, it is critical to show that the function of the new implementation is identical to that of the original implementation. To this end, we propose Lemma 1.

### Lemma 1

\(\max (a_i)>\tau\)\(\Leftrightarrow\)\(\vee (a_i>\tau )\).

### Proof

- 1.
Necessity: \(\max (a_i)>\tau\)\(\Rightarrow\)\(\exists \ i=i_1\) such that \(a_{i_1}=\max (a_i)>\tau\)\(\Rightarrow\)\(\vee (a_i>\tau )\).

- 2.
Sufficiency: \(\vee (a_i>\tau )\)\(\Rightarrow\)\(\exists \ i=i_2\) such that \(a_{i_2}>\tau\)\(\Rightarrow\)\(\max (a_i)\geqslant a_{i_2}>\tau\).

In summary, the operation flow modification is crucial when mapping BNN to ReRAM. Although the default flow can be implemented in CMOS architecture, integer comparators are less efficient than OR gates. The default MAX implementation in ReRAM requires dynamic analog comparison, which unfortunately is not noise resistant. We will show the effect of the modifications of operation flows in Sect. 6.3.

## 4 Complementary resistive cell (CRC)

### 4.1 Complementary resistive cell (CRC)

Figure 3 shows the structure and functionality of XNOR operations in a CRC. The conductance of the two cells in a CRC are complementary and the input voltages to the two cells generated by a buffer and an inverter are also complementary. In this way, only one cell in a CRC is active. To store a weight \(W=1\) in a CRC, the upper cell in the CRC is set to a high (H) conductance state while the lower one is set to a low (L) conductance state, as shown in Fig. 3a, c. Similarly, the upper L and lower H states represent \(W=0\), as shown in Fig. 3b, d. To compute \((X=0)\)XNOR\((W=0)\), the lower cell connected to the inverter is active, so that a high current (\(I_\mathrm{H}\)) is generated to represent ‘1’. On the contrary, a low current (\(I_\mathrm{L}\)) represents ‘0’, as shown in Fig. 3d. The other cases can be derived similarly.

### 4.2 Fused activation with CRC array

*N*, and the number of ‘1’s that generated by the CRCs is

*n*, the current on the bitline can be calculated as:

Previous works on ReRAM based accelerators for BNNs, such as Sun et al. (2018b), Tang et al. (2017), failed to perform real in-situ computation in the ReRAM. ReRAM arrays are only used as processing engines for low precision multiplication in Sun et al. (2018b, Tang et al. (2017), and extra logic components are deployed for the nonlinear activation and binary quantization. In this work, we enables the real in-situ computation in ReRAM for BNN processing through a software/hardware codesign approach, including (1) the modified hardware oriented operation flow and (2) the fused activation with CRC array.

## 5 Architecture and pipeline design

### 5.1 Architecture

### 5.2 Pipeline

We design a pipeline to further boost the performance of BNN processing. Figure 5 shows the pipeline design. The processing for one layer is decoupled as two stages, buffer and computation. In the pipeline, a layer *i* only needs to communicate with the previous layer \(i-1\) and the next layer \(i+1\). The data movement is coordinated by the buffers.

In the buffer stage, as shown in Fig. 5 ❶, the output feature map of layer \(i-1\) in the output buffer (colored yellow) is sent to the input buffer of layer *i* (colored yellow). Note that for the processing of non-binarized neural network, intermediate results of a whole layer can not be transferred directly on-chip because of the memory footprint, which must be coordinated by extra main memory space as in Song et al. (2017). The output feature map of layer *i* is cached in the output buffer (colored red) and sent to the input buffer (colored red) of layer \(i+1\).

In the computation stage, as shown in Fig. 5 ❷, data from the input buffer are applied on the wordlines complementary by the driver, and the sense amplifiers perform NORM/COMP on the accumulated XNORed values from bitlines. Thus, the computation is just a read from the CRC array with multiple wordlines activated.

During execution of the whole binarized neural network, a static and regular communication graph is formed between CRC arrays based on order of layers in a binarized neural network. Note that this pipeline design also benefits to the processing of the first layer in BNNs. The feature maps in BNNs are all binary format except the input feature maps to first layer. The input feature maps (the images) to the first layer are in an 8-bit fixed-point format. With this pipeline design, the input feature map of first layer can also utilize the binary drivers, which input the bits from MSB to LSB of the fixed point, and accumulate the shifted partial sums cached in the buffer.

## 6 Evaluation

### 6.1 Evaluation setup

In the evaluation, we use three classic datasets: MNIST (LeCun 1998), CIFAR-10 (Krizhevsky and Hinton 2009) and SVHN (Netzer et al. 2011). MNIST is a popular dataset of handwritten digits, which contains a total of 70,000 samples. CIFAR-10 is a dataset of tiny colorful images for advanced recognition system design. CIFAR-10 contains 10 classes with 6000 in each class. SVHN is a dataset of real-world street view house numbers, with more than 100,000 samples. Based on the three datasets, three binarized networks were trained.

ReBNN parameters

ReRAM cell | |
---|---|

Cell on resistance | 1.000 M\(\Omega\) |

Cell off resistance | 10.00 M\(\Omega\) |

Write voltage | 1.000 V |

Read voltage | 0.400 V |

ReRAM array | |
---|---|

Array size | \(128\times 128\) |

Write latency | 23.93 ns |

Read latency | 5.472 ns |

Write energy | 52.23 nJ |

Read energy | 25.15 nJ |

Leakage power | 32.75 mW |

Architecture and memory footprint of three binarized neural networks

Dataset | Architecture and memory footprint | ||||||||
---|---|---|---|---|---|---|---|---|---|

MNIST | |||||||||

Layer | FC1 | FC2 | FC3 | FC4 | – | ||||

Kernel | 2048,784 | 2048,2048 | 2048,2048 | 10,2048 | – | ||||

Weight bits | 1.5 K | 4.0 M | 4.0 M | 20 K | – | ||||

OFmap bits | 2.0 K | 2.0 K | 2.0 K | 10 | – | ||||

CIFAR-10 | |||||||||

Layer | conv1 | conv2 (pool) | conv3 | conv4 (pool) | conv5 | conv6 (pool) | FC1 | FC2 | FC3 |

Kernel | 128,3, \(3\times 3\) | 128,128, \(3\times 3\) | 256,128, \(3\times 3\) | 256,256, \(3\times 3\) | 512,256, \(3\times 3\) | 512,512, \(3\times 3\) | 1024,8192 | 1024,1024 | 10,1024 |

Weight bits | 3.4 K | 144 K | 288 K | 576 K | 1.1 M | 2.3 M | 8.0 M | 1.0 M | 10 K |

OFmap bits | 128 K | 32 K | 64K | 16 K | 32 K | 8 K | 1 K | 1 K | 10 |

SVHN | |||||||||

Layer | conv1 | conv2 (pool) | conv3 | conv4 (pool) | conv5 | conv6 (pool) | FC1 | FC2 | FC3 |

Kernel | 64,3, \(3\times 3\) | 64,64, \(3\times 3\) | 128,64, \(3\times 3\) | 128,128, \(3\times 3\) | 256,128, \(3\times 3\) | 256,256, \(3\times 3\) | 1024,4096 | 1024,1024 | 10,1024 |

Weight bits | 1.7 K | 36 K | 72 K | 144 K | 288 K | 576 K | 4.0 M | 1.0 M | 10 K |

OFmap bits | 64 K | 16 K | 32 K | 8 K | 16 K | 4 K | 1 K | 1 K | 10 |

### 6.2 Performance and energy efficiency

We also compare the throughput, i.e. frame-per-second (FPS) of ReBNN with popular state-of-the-art neural network accelerators ShiDianNao (Du et al. 2015) YodaNN (Andri et al. 2016), Eyeriss (Chen et al. 2016), ISAAC (Shafiee et al. 2016), Neurocube (Kim et al. 2016), BNN-FPGA (Li et al. 2017) and XNOR-POP (Jiang et al. 2017), shown in Fig. 6c. BNN-FPGA (Li et al. 2017) and XNOR-POP (Jiang et al. 2017) are two accelerators for BNNs, the throughput of ReBNN is around \(2\times\) of XNOR-POP. Actually, as a stand-alone accelerator, ReBNN can be easily scaled to a larger design with duplicated computation units. However, XNOR-POP, which modified DRAM peripherals, was not easy to scale for a larger design.

### 6.3 Accuracy of modified operation flow

Accuracies of BNNs on SVHN and CIFAR-10

Validation | Training | |||
---|---|---|---|---|

Dataset | SVHN | CIFAR-10 | SVHN | CIFAR-10 |

Original | \(97.33\%\) | \(88.42\%\) | \(99.58\%\) | \(98.42\%\) |

Modified | \(98.35\%\) | \(87.02\%\) | \(99.46\%\) | \(95.98\%\) |

Figure 7 shows the error rate curves for validation and training of the original and modified algorithms on SVHN. We can see the modified algorithm has almost the same curves as the original one, especially for the validation (inference) error curves. The error rate curves of the original and modified BNN on CIFAR-10 are shown in Fig. 8. While the performance of the modified BNN on CIFAR-10 is slightly worse than the original, the modified algorithm converges on the trend with the original. We speculate that in the modified operation flow, the indication for the max element is lost while it is kept in the original flow, which leads to the error rate gap. However, the validation error rate gap is only 1.40%, which is acceptable and means the modified operation flow is practical. The modified algorithm even has a smaller validation (inference) error rate 2.65% than the original one, 2.67%. The validation error rate of the modified algorithm on CIFAR-10 is 12.98%, while the original algorithm has 11.58% validation error rate. The accuracies of the original and modified BNNs are listed in Table 4, we can see that the modified hardware friendly data flow works properly.

## 7 Conclusion

Binarized neural networks are more hardware efficient than convolutional neural networks, but BNNs can not be directly processed in-situ in ReRAM. We modify the original algorithm to fuse the normalization, comparison and pooling in BNNs to allow more efficient hardware implementations. We design a new complementary resistive cell (CRC) as the processing unit to perform XNOR and POPCOUNT in ReRAM. To improve the performance and throughput, we propose a decoupled buffer-computation pipeline execution. In evaluation, our design ReBNN improves the performance by \(25.36\times\), improves the energy efficiency by \(4.26\times\), ReBNN also has a higher throughput than state-of-the-art BNN accelerator. The correctness of the modified algorithm, which is the software fundamental for the hardware design of ReBNN, is validated.

## Notes

### Acknowledgements

This work was supported in part by NSF 1910299, 1717657, DOE DE-SC0018064, AFRL FA8750-18-2-0057, NSF CCF-1657333, CCF-1717754, CNS-1717984, CCF-1750656, and CCF-1919289.

### Compliance with ethical standards

### Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

## References

- Akhlaghi, V., Yazdanbakhsh, A., Samadi, K., Gupta, R.K., Esmaeilzadeh, H.: Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 662–673. IEEE (2018)Google Scholar
- Alibart, F., Gao, L., Hoskins, B.D., Strukov, D.B.: High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm. Nanotechnology
**23**(7), 075201 (2012)CrossRefGoogle Scholar - Andri, R., Cavigelli, L., Rossi, D., Benini, L.: Yodann: an ultra-low power convolutional neural network accelerator based on binary weights. In: 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 236–241. IEEE (2016)Google Scholar
- Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun.
**5**, 4308 (2014)CrossRefGoogle Scholar - Chang, M.F., Sheu, S.S., Lin, K.F., Wu, C.W., Kuo, C.C., Chiu, P.F., Yang, Y.S., Chen, Y.S., Lee, H.Y., Lien, C.H., et al.: A high-speed 7.2-ns read-write random access 4-mb embedded resistive ram (ReRAM) macro using process-variation-tolerant current-mode read schemes. IEEE J. Solid State Circuits
**48**(3), 878–891 (2012)CrossRefGoogle Scholar - Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O.: Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: ACM Sigplan Notices, vol. 49, pp. 269–284. ACM (2014a)Google Scholar
- Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N., et al.: Dadiannao: a machine-learning supercomputer. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. IEEE Computer Society (2014b)Google Scholar
- Chen, Y.H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circuits
**52**(1), 127–138 (2016)CrossRefGoogle Scholar - Chen, F., Li, H.: Emat: an efficient multi-task architecture for transfer learning using ReRAM. In: Proceedings of the International Conference on Computer-Aided Design, p. 33. ACM (2018a)Google Scholar
- Chen, F., Song, L., Chen, Y.: Regan: A pipelined ReRAM-based accelerator for generative adversarial networks. In: 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 178—183. IEEE (2018b)Google Scholar
- Chen, P.Y., Peng, X., Yu, S.: Neurosim: a circuit-level macro model for benchmarking neuro-inspired architectures in online learning. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.
**37**(12), 3067–3080 (2018c)CrossRefGoogle Scholar - Chen, F., Song, L., Li, H.: Efficient process-in-memory architecture design for unsupervised GAN-based deep learning using ReRAM. In: Proceedings of the 2019 on Great Lakes Symposium on VLSI, pp. 423–428. ACM (2019a)Google Scholar
- Chen, F., Song, L., Li, H.H., Chen, Y.: Zara: a novel zero-free dataflow accelerator for generative adversarial networks in 3d ReRAM. In: Proceedings of the 56th Annual Design Automation Conference 2019, p. 133. ACM (2019b)Google Scholar
- Chen, F., Song, L., Li, H., Chen, Y.: Parc: a processing-in-cam architecture for genomic long read pairwise alignment using ReRAM. In: 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC). ACM (2020)Google Scholar
- Cheng, M., Xia, L., Zhu, Z., Cai, Y., Xie, Y., Wang, Y., Yang, H.: Time: a training-in-memory architecture for memristor-based deep neural networks. In: Proceedings of the 54th Annual Design Automation Conference 2017, p. 26. ACM (2017)Google Scholar
- Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., Shelhamer, E.: CUDNN: efficient primitives for deep learning. arXiv:1410.0759 (2014)
- Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y., Xie, Y.: Prime: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In: ACM SIGARCH Computer Architecture News, vol. 44, pp. 27–39. IEEE Press (2016)Google Scholar
- Ching, T., Himmelstein, D.S., Beaulieu-Jones, B.K., Kalinin, A.A., Do, B.T., Way, G.P., Ferrero, E., Agapow, P.M., Zietz, M., Hoffman, M.M., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface
**15**(141), 20170387 (2018)CrossRefGoogle Scholar - Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. Tech. rep. (2011)Google Scholar
- Dai, G., Huang, T., Wang, Y., Yang, H., Wawrzynek, J.: Graphsar: a sparsity-aware processing-in-memory architecture for large-scale graph processing on ReRAMs. In: Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 120–126. ACM (2019)Google Scholar
- Dong, X., Xu, C., Xie, Y., Jouppi, N.P.: Nvsim: a circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.
**31**(7), 994–1007 (2012)CrossRefGoogle Scholar - Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y., Temam, O.: Shidiannao: shifting vision processing closer to the sensor. In: ACM SIGARCH Computer Architecture News, vol. 43, pp. 92–104. ACM (2015)CrossRefGoogle Scholar
- Esmaeilzadeh, H., Sampson, A., Ceze, L., Burger, D.: Neural acceleration for general-purpose approximate programs. In: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 449–460. IEEE Computer Society (2012)Google Scholar
- Faust, O., Hagiwara, Y., Hong, T.J., Lih, O.S., Acharya, U.R.: Deep learning for healthcare applications based on physiological signals: a review. Comput. Methods Programs Biomed.
**161**, 1–13 (2018)CrossRefGoogle Scholar - Goh, G.B., Hodas, N.O., Vishnu, A.: Deep learning for computational chemistry. J. Comput. Chem.
**38**(16), 1291–1307 (2017)CrossRefGoogle Scholar - Guan, Y., Liang, H., Xu, N., Wang, W., Shi, S., Chen, X., Sun, G., Zhang, W., Cong, J.: FP-DNN: an automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In: 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 152–159. IEEE (2017a)Google Scholar
- Guan, Y., Yuan, Z., Sun, G., Cong, J.: FPGA-based accelerator for long short-term memory recurrent neural networks. In: 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 629–634. IEEE (2017b)Google Scholar
- Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., Xie, D., Luo, H., Yao, S., Wang, Y., et al.: ESE: efficient speech recognition engine with sparse LSTM on FPGA. In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 75–84. ACM (2017)Google Scholar
- Hu, M., Strachan, J.P., Li, Z., Grafals, E.M., Davila, N., Graves, C., Lam, S., Ge, N., Yang, J.J., Williams, R.S.: Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication. In: Proceedings of the 53rd annual design automation conference, p. 19. ACM (2016)Google Scholar
- Huangfu, W., Li, S., Hu, X., Xie, Y.: Radar: a 3D-ReRAM based DNA alignment accelerator architecture. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2018)Google Scholar
- Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Advances in Neural Information Processing Systems, pp. 4107–4115 (2016)Google Scholar
- Ji, Y., Zhang, Y., Li, S., Chi, P., Jiang, C., Qu, P., Xie, Y., Chen, W.: Neutrams: neural network transformation and co-design under neuromorphic hardware constraints. In: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, p. 21. IEEE Press (2016)Google Scholar
- Ji, H., Song, L., Jiang, L., Li, H.H., Chen, Y.: ReCom: an efficient resistive accelerator for compressed deep neural networks. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 237–240. IEEE (2018a)Google Scholar
- Ji, Y., Zhang, Y., Chen, W., Xie, Y.: Bridge the gap between neural networks and neuromorphic hardware with a neural network compiler. In: ACM SIGPLAN Notices, vol. 53, pp. 448–460. ACM (2018b)CrossRefGoogle Scholar
- Ji, H., Jiang, L., Li, T., Jing, N., Ke, J., Liang, X.: HUBPA: high utilization bidirectional pipeline architecture for neuromorphic computing. In: Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 249–254. ACM (2019)Google Scholar
- Jiang, L., Kim, M., Wen, W., Wang, D.: XNOR-pop: a processing-in-memory architecture for binary convolutional neural networks in wide-IO2 drams. In: 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pp. 1–6. IEEE (2017)Google Scholar
- Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al.: In-datacenter performance analysis of a tensor processing unit. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. IEEE (2017)Google Scholar
- Kim, D., Kung, J., Chai, S., Yalamanchili, S., Mukhopadhyay, S.: Neurocube: a programmable digital neuromorphic architecture with high-density 3D memory. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 380–392. IEEE (2016)Google Scholar
- Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. rep, Citeseer (2009)Google Scholar
- Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)Google Scholar
- Kuzum, D., Yu, S., Wong, H.P.: Synaptic electronics: materials, devices and applications. Nanotechnology
**24**(38), 382001 (2013)CrossRefGoogle Scholar - Kwon, H., Samajdar, A., Krishna, T.: MAERI: enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In: ACM SIGPLAN Notices, vol. 53, pp. 461–475. ACM (2018)CrossRefGoogle Scholar
- LeCun, Y.: The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998)
- Lee, D., Lee, J., Jo, M., Park, J., Siddik, M., Hwang, H.: Noise-analysis-based model of filamentary switching ReRAM with \(\text{ ZrO }_{x}/\text{ HfO }_{x}\) stacks. IEEE Electron Device Lett.
**32**(7), 964–966 (2011)CrossRefGoogle Scholar - Li, Y., Liu, Z., Xu, K., Yu, H., Ren, F.: A 7.663-tops 8.2-w energy-efficient FPGA accelerator for binary convolutional neural networks. In: FPGA, pp. 290–291 (2017)Google Scholar
- Li, B., Song, L., Chen, F., Qian, X., Chen, Y., Li, H.H.: ReRAM-based accelerator for deep learning. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 815–820. IEEE (2018)Google Scholar
- Lin, J., Li, S., Hu, X., Deng, L., Xie, Y.: CNNWIRE: Boosting convolutional neural network with winograd on ReRAM based accelerators. In: Proceedings of the 2019 on Great Lakes Symposium on VLSI, pp. 283–286. ACM (2019a)Google Scholar
- Lin, J., Zhu, Z., Wang, Y., Xie, Y.: Learning the sparsity for ReRAM: mapping and pruning sparse neural network for ReRAM based accelerator. In: Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 639–644. ACM (2019b)Google Scholar
- Liu, X., Mao, M., Liu, B., Li, H., Chen, Y., Li, B., Wang, Y., Jiang, H., Barnell, M., Wu, Q., et al.: Reno: a high-efficient reconfigurable neuromorphic computing accelerator design. In: 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2015)Google Scholar
- Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016. Springer, Cham, pp. 21–37 (2016)CrossRefGoogle Scholar
- Liu, M., Xia, L., Wang, Y., Chakrabarty, K.: Design of fault-tolerant neuromorphic computing systems. In: 2018 IEEE 23rd European Test Symposium (ETS), pp. 1–9. IEEE (2018a)Google Scholar
- Liu, M., Xia, L., Wang, Y., Chakrabarty, K.: Fault tolerance for RRAM-based matrix operations. In: 2018 IEEE International Test Conference (ITC), pp. 1–10. IEEE (2018b)Google Scholar
- Liu, R., Peng, X., Sun, X., Khwa, W.S., Si, X., Chen, J.J., Li, J.F., Chang, M.F., Yu, S.: Parallelizing SRAM arrays with customized bit-cell for binary neural networks. In: Proceedings of the 55th Annual Design Automation Conference, p. 21. ACM (2018c)Google Scholar
- Liu, X., Yang, H., Liu, Z., Song, L., Li, H., Chen, Y.: DPATCH: an adversarial patch attack on object detectors. arXiv:1806.02299 (2018d)
- Liu, M., Xia, L., Wang, Y., Chakrabarty, K.: Fault tolerance in neuromorphic computing systems. In: Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 216–223. ACM (2019a)Google Scholar
- Liu, T., Wen, W., Jiang, L., Wang, Y., Yang, C., Quan, G.: A fault-tolerant neural network architecture. In: Proceedings of the 56th Annual Design Automation Conference 2019, DAC ’19, pp. 55:1–55:6. ACM, New York (2019b). https://doi.org/10.1145/3316781.3317742
- Mahajan, D., Park, J., Amaro, E., Sharma, H., Yazdanbakhsh, A., Kim, J.K., Esmaeilzadeh, H.: Tabla: a unified template-based framework for accelerating statistical machine learning. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 14–26. IEEE (2016)Google Scholar
- Mao, M., Cao, Y., Yu, S., Chakrabarti, C.: Optimizing latency, energy, and reliability of 1T1R ReRAM through appropriate voltage settings. In: 2015 33rd IEEE International Conference on Computer Design (ICCD), pp. 359–366. IEEE (2015)Google Scholar
- Mao, M., Chen, P.Y., Yu, S., Chakrabarti, C.: A multilayer approach to designing energy-efficient and reliable ReRAM cross-point array system. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
**25**(5), 1611–1621 (2017)CrossRefGoogle Scholar - Mao, M., Sun, X., Peng, X., Yu, S., Chakrabarti, C.: A versatile ReRAM-based accelerator for convolutional neural networks. In: 2018 IEEE International Workshop on Signal Processing Systems (SiPS), pp. 211–216. IEEE (2018a)Google Scholar
- Mao, M., Yu, S., Chakrabarti, C.: Design and analysis of energy-efficient and reliable 3-d ReRAM cross-point array system. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
**26**(7), 1290–1300 (2018b)CrossRefGoogle Scholar - Mao, M., Peng, X., Liu, R., Li, J., Yu, S., Chakrabarti, C.: Max2: an ReRAM-based neural network accelerator that maximizes data reuse and area utilization. IEEE J. Emerg. Sel. Top. Circuits Syst. (2019)Google Scholar
- Miotto, R., Wang, F., Wang, S., Jiang, X., Dudley, J.T.: Deep learning for healthcare: review, opportunities and challenges. Brief. Bioinform.
**19**(6), 1236–1246 (2017)CrossRefGoogle Scholar - Mohanty, A., Du, X., Chen, P.Y., Seo, J.s., Yu, S., Cao, Y.: Random sparse adaptation for accurate inference with inaccurate multi-level RRAM arrays. In: 2017 IEEE International Electron Devices Meeting (IEDM), pp. 6–3. IEEE (2017)Google Scholar
- Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning (2011)Google Scholar
- Niu, D., Chen, Y., Xu, C., Xie, Y.: Impact of process variations on emerging memristor. In: Proceedings of the 47th Design Automation Conference, pp. 877–882. ACM (2010)Google Scholar
- Niu, D., Xu, C., Muralimanohar, N., Jouppi, N.P., Xie, Y.: Design trade-offs for high density cross-point resistive memory. In: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design, pp. 209–214. ACM (2012)Google Scholar
- Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W., Dally, W.J.: SCNN: an accelerator for compressed-sparse convolutional neural networks. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 27–40. IEEE (2017)Google Scholar
- Qiu, J., Wang, J., Yao, S., Guo, K., Li, B., Zhou, E., Yu, J., Tang, T., Xu, N., Song, S., et al.: Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 26–35. ACM (2016)Google Scholar
- Qiao, X., Cao, X., Yang, H., Song, L., Li, H.: Atomlayer: a universal ReRAM-based CNN accelerator with atomic layer computation. In: Proceedings of the 55th Annual Design Automation Conference, p. 103. ACM (2018)Google Scholar
- Rajendran, J., Manem, H., Karri, R., Rose, G.S.: An energy-efficient memristive threshold logic circuit. IEEE Trans. Comput.
**61**(4), 474–487 (2012)MathSciNetCrossRefGoogle Scholar - Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classification using binary convolutional neural networks. In: European Conference on Computer Vision, pp. 525–542. Springer (2016)Google Scholar
- Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis.
**115**(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar - Shafiee, A., Nag, A., Muralimanohar, N., Balasubramonian, R., Strachan, J.P., Hu, M., Williams, R.S., Srikumar, V.: Isaac: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Comput. Archit. News
**44**(3), 14–26 (2016)CrossRefGoogle Scholar - Sharma, H., Park, J., Suda, N., Lai, L., Chau, B., Chandra, V., Esmaeilzadeh, H.: Bit fusion: bit-level dynamically composable architecture for accelerating deep neural networks. In: Proceedings of the 45th Annual International Symposium on Computer Architecture, pp. 764–775. IEEE Press (2018)Google Scholar
- Song, L., Qian, X., Li, H., Chen, Y.: Pipelayer: a pipelined ReRAM-based accelerator for deep learning. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 541–552. IEEE (2017)Google Scholar
- Song, L., Zhuo, Y., Qian, X., Li, H., Chen, Y.: GRAPHR: accelerating graph processing using ReRAM. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 531–543. IEEE (2018a)Google Scholar
- Song, M., Zhao, J., Hu, Y., Zhang, J., Li, T.: Prediction based execution on deep neural networks. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 752–763. IEEE (2018b)Google Scholar
- Song, L., Chen, F., Young, S.R., Schuman, C.D., Perdue, G., Potok, T.E.: Deep learning for vertex reconstruction of neutrino-nucleus interaction events with combined energy and time data. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3882–3886. IEEE (2019a)Google Scholar
- Song, L., Mao, J., Zhuo, Y., Qian, X., Li, H., Chen, Y.: Hypar: towards hybrid parallelism for deep learning accelerator array. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 56–68. IEEE (2019b)Google Scholar
- Sun, X., Peng, X., Chen, P.Y., Liu, R., Seo, J.s., Yu, S.: Fully parallel RRAM synaptic array for implementing binary neural network with (+ 1, -1) weights and (+ 1, 0) neurons. In: Proceedings of the 23rd Asia and South Pacific Design Automation Conference, pp. 574–579. IEEE Press (2018a)Google Scholar
- Sun, X., Yin, S., Peng, X., Liu, R., Seo, J.s., Yu, S.: XNOR-RRAM: a scalable and parallel resistive synaptic architecture for binary neural networks. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1423–1428. IEEE (2018b)Google Scholar
- Tang, T., Xia, L., Li, B., Wang, Y., Yang, H.: Binary convolutional neural network on RRAM. In: 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 782–787. IEEE (2017)Google Scholar
- Wang, Y., Xu, J., Han, Y., Li, H., Li, X.: Deepburning: automatic generation of FPGA-based learning accelerators for the neural network family. In: Proceedings of the 53rd Annual Design Automation Conference, p. 110. ACM (2016)Google Scholar
- Wang, Y., Wen, W., Song, L., Li, H.H.: Classification accuracy improvement for neuromorphic computing systems with one-level precision synapses. In: 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 776–781. IEEE (2017)Google Scholar
- Wang, P., Ji, Y., Hong, C., Lyu, Y., Wang, D., Xie, Y.: SNRRAM: an efficient sparse neural network computation architecture based on resistive random-access memory. In: Proceedings of the 55th Annual Design Automation Conference, p. 106. ACM (2018)Google Scholar
- Wong, H.S.P., Lee, H.Y., Yu, S., Chen, Y.S., Wu, Y., Chen, P.S., Lee, B., Chen, F.T., Tsai, M.J.: Metal-oxide rram. Proc. IEEE
**100**(6), 1951–1970 (2012)CrossRefGoogle Scholar - Woo, J., Peng, X., Yu, S.: Design considerations of selector device in cross-point RRAM array for neuromorphic computing. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–4. IEEE (2018)Google Scholar
- Xu, C., Niu, D., Muralimanohar, N., Jouppi, N.P., Xie, Y.: Understanding the trade-offs in multi-level cell ReRAM memory design. In: 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2013)Google Scholar
- Xu, C., Niu, D., Muralimanohar, N., Balasubramonian, R., Zhang, T., Yu, S., Xie, Y.: Overcoming the challenges of crossbar resistive memory architectures. In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 476–488. IEEE (2015)Google Scholar
- Yazdanbakhsh, A., Samadi, K., Kim, N.S., Esmaeilzadeh, H.: GANAX: a unified MIMD-SIMD acceleration for generative adversarial networks. In: Proceedings of the 45th Annual International Symposium on Computer Architecture, pp. 650–661. IEEE Press (2018)Google Scholar
- Yu, S., Wu, Y., Jeyasingh, R., Kuzum, D., Wong, H.S.P.: An electronic synapse device based on metal oxide resistive switching memory for neuromorphic computation. IEEE Trans. Electron Devices
**58**(8), 2729–2737 (2011)CrossRefGoogle Scholar - Yu, S., Wu, Y., Wong, H.S.P.: Investigating the switching dynamics and multilevel capability of bipolar metal oxide resistive switching memory. Appl. Phys. Lett.
**98**(10), 103514 (2011)CrossRefGoogle Scholar - Yu, S., Gao, B., Fang, Z., Yu, H., Kang, J., Wong, H.S.P.: A low energy oxide-based electronic synaptic device for neuromorphic visual systems with tolerance to device variation. Adv. Mater.
**25**(12), 1774–1779 (2013)CrossRefGoogle Scholar - Yu, S., Chen, P.Y., Cao, Y., Xia, L., Wang, Y., Wu, H.: Scaling-up resistive synaptic arrays for neuro-inspired architecture: Challenges and prospect. In: 2015 IEEE International Electron Devices Meeting (IEDM), pp. 17–3. IEEE (2015)Google Scholar
- Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., Mahlke, S.: Scalpel: customizing DNN pruning to the underlying hardware parallelism. In: ACM SIGARCH Computer Architecture News, vol. 45, pp. 548–560. ACM (2017)CrossRefGoogle Scholar
- Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161–170. ACM (2015)Google Scholar
- Zhang, C., Wu, D., Sun, J., Sun, G., Luo, G., Cong, J.: Energy-efficient cnn implementation on a deeply pipelined FPGA cluster. In: Proceedings of the 2016 International Symposium on Low Power Electronics and Design, pp. 326–331. ACM (2016)Google Scholar
- Zhang, C., Sun, G., Fang, Z., Zhou, P., Pan, P., Cong, J.: Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans. Comput. Aided Design Integr. Circuits Syst. (2018). https://doi.org/10.1109/TCAD.2017.2785257 CrossRefGoogle Scholar
- Zokaee, F., Zhang, M., Jiang, L.: Finder: Accelerating fm-index-based exact pattern matching in genomic sequences through ReRAM technology. In: Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques. ACM (2019)Google Scholar