On-chip trainable hardware-based deep Q-networks approximating a backpropagation algorithm

Reinforcement learning (RL) using deep Q-networks (DQNs) has shown performance beyond the human level in a number of complex problems. In addition, many studies have focused on bio-inspired hardware-based spiking neural networks (SNNs) given the capabilities of these technologies to realize both parallel operation and low power consumption. Here, we propose an on-chip training method for DQNs applicable to hardware-based SNNs. Because the conventional backpropagation (BP) algorithm is approximated, a performance evaluation based on two simple games shows that the proposed system achieves performance similar to that of a software-based system. The proposed training method can minimize memory usage and reduce power consumption and area occupation levels. In particular, for simple problems, the memory dependency can be significantly reduced given that high performance is achieved without using replay memory. Furthermore, we investigate the effect of the nonlinearity characteristics and two types of variation of non-ideal synaptic devices on the performance outcomes. In this work, thin-film transistor (TFT)-type flash memory cells are used as synaptic devices. A simulation is also conducted using fully connected neural network with non-leaky integrated-and-fire (I&F) neurons. The proposed system shows strong immunity to device variations because an on-chip training scheme is adopted.


Introduction
Recently, neuromorphic computing inspired by the human brain has emerged as one of the most promising types of computing architectures. It overcomes the limitations of the conventional von Neumann architecture, which is associated with a bottleneck between the memory and the processor, and offers advantages in terms of time and power consumption [1][2][3]. Two training methods are most commonly used to training a neural network: the backpropagation (BP) algorithm and the spike-timing-dependent plasticity (STDP) learning rule [4,5]. The BP algorithm propagates error values obtained from the output layer in the backward direction and updates the synaptic weights through these error values. It is suitable for processing labeled data and is mainly used for offline supervised learning. Software-based deep neural networks (DNNs) using the BP algorithm have shown high performance in many fields [6][7][8]. However, this training algorithm requires considerable amounts of time and power to determine the error values [4]. The STDP learning rule is inspired by the weight changes of biological synapses and is mainly used for online unsupervised learning. Unlike the BP algorithm, the synaptic weights are updated according to the time difference between the presynaptic and postsynaptic spikes [9]. The STDP learning rule has an advantage in that the neural network consumes less power by enabling event-driven operations and on-chip training. However, it still lacks performance compared to the BP algorithm [10][11][12][13].
For low power consumption and high speed, hardwarebased spiking neural networks (SNNs) using the conductance of electronic synaptic devices as synaptic weights have been actively studied [14][15][16][17]. Hardware-based neural networks can perform massively parallel computations with low levels of power. When training a hardware-based neural network using the BP algorithm, two methods are most commonly used: an off-chip training method that simply transfers the weight values trained in software using the BP algorithm to the synaptic devices, and an on-chip training method that continuously updates the synaptic weights during the training process [4]. Off-chip training uses more power because the training process for weight updating occurs in the software. Furthermore, the performance can be degraded due to variations of non-ideal synaptic devices [5,18]. On the other hand, on-chip training is immune to variations of non-ideal synaptic devices [10,[19][20][21][22]. This approach is also advantageous given its low power consumption and high-speed training capabilities, as both weighted sum and weight updating occur in the hardware-based neural network [5,23,24]. Recently, several studies have investigated on-chip training for low power consumption and good performance outcomes [25][26][27]. On-chip training in a hardware-based neural network showed performance fairly similar to that by the conventional software-based neural network.
Thus, we focused on the on-chip training method using the BP algorithm in a hardware-based neural network. This is advantageous given its relatively high performance compared to the STDP learning rule, low power consumption, high-speed training capabilities, and strong immunity to variations of non-ideal synaptic devices compared to the off-chip training method. While several studies related to on-chip training have been conducted on various networks, no studies have been reported on an onchip reinforcement learning (RL) to the best of our knowledge. RL achieves human-level performance, which is difficult to achieve with simple DNNs [28,29]. In RL with complex problems, training using deep Q-networks (DQNs) is preferred, as performance levels when using this method have already surpassed the human levels in several fields [28][29][30][31][32].
In this work, we propose an on-chip training method for RL using vanilla DQNs while minimizing memory usage. While several training methods applicable to RL have been reported, vanilla DQNs were used as a simple training method suitable for implementing hardware-based RL [31][32][33][34][35][36][37][38]. In addition, nonlinearity characteristics and variations of non-ideal synaptic devices are considered. Two types of variation of non-ideal synaptic devices were considered: the pulse-to-pulse variation and the device-todevice variation. The entire training process is divided into four phases: two forward phases, one backward phase, and one update phase. Moreover, for simple problems, the network can be trained without using replay memory, thereby significantly reducing the memory dependency. We use the previously proposed neuron circuits [39] and the synaptic devices [40] to evaluate the proposed training method. The performance of the training method is evaluated through two example games: a 'Fruit Catching' game and a 'Rush Hour' game.
This paper is organized as follows. Section 2 presents the characteristics of the synaptic device used in this work and the proposed training method for hardware-based DQNs. Section 3 provides the system-level simulation results of the proposed hardware-based neural network based on two simple games and a discussion about the results. Section 4 provides a summary of the overall paper and the future work.
2 Device characteristics and training method

Synaptic device
In a hardware-based neural network, synaptic devices representing weight values are very important. In this work, we use the thin-film transistor (TFT)-type flash memory cells fabricated using a method published in an earlier report by the authors as synaptic devices [40]. A schematic 3D view of the proposed synaptic device is shown in Fig. 1a. TFT-type flash memory cells are fabricated on a six-inch Si wafer with conventional CMOS process technology. Between the word line (WL) and the source line (SL), a half-covered n ? poly-Si floating gate (FG) is formed as a charge storage layer. Because the FG covers only half of the poly-Si channel, the threshold voltage does not fall below 0 in the full erase state. This prevents leakage current and reduces the standby power during the training process. The thicknesses of the poly-Si active layer, tunneling SiO 2 layer, blocking SiO 2 are 20 nm, 7 nm, and 15 nm, respectively. The distance between the source and drain is 0.5 lm, and the width of the control gate is 2 lm. If the width of the control gate is scaled to the minimum feature size (F), one synaptic device can be scaled down to 8 F 2 . Figure 1b shows the measured long-term potentiation (LTP) and long-term depression (LTD) characteristics of the proposed synaptic device as a parameter of the number of the pulses applied to the WL and SL. Fifty repeated erase pulses (V WL = -3 V, V SL-= 5 V) and 300 repeated program pulses (V WL = 0 V, V SL = -4.8 V) are applied in this measurement. The behavioral model of the nonlinear synaptic device is generally expressed using the following equations [10]: Equations (1) and (2) represent the LTP and LTD behaviors of synaptic devices, respectively. G(n) denotes the conductance of the synaptic device when potentiation or depression pulses are applied n times, and G max and G min are the maximum and minimum conductance values, respectively. a p and b p represent the fitting parameters of the potentiation characteristics. Similarly, a d and b d represent the fitting parameters of the depression characteristics. The nonlinearity of the synaptic device is determined by b p and b d in these equations. The LTP and LTD characteristics of the proposed TFT-type synaptic device are fitted with b p equal to 2.5 and b d equal to 5. In addition, the synaptic device consumes energy of *100 fJ/spike on average (approximately a 10 nA current at a pulse with a 1 V amplitude and a 10 ls width).

Training method
Before introducing the training method, we describe the behavior of the integrate-and-fire (I&F) neuron model used in the neural network. The spikes from presynaptic neurons are integrated into the membrane capacitor of the postsynaptic I&F neurons. When the membrane potential exceeds the threshold voltage, postsynaptic spikes are generated from the I&F neuron and are transmitted to the next layer. The behavior of the I&F neurons in the SNNs can approximate the rectified linear unit (ReLU) activation function of conventional DNNs, as the number of the postsynaptic spikes is proportional to the membrane potential of the I&F neuron. Here, the membrane potentials are initialized to zero, and the lower limit is set to zero.
Details of the behavior of the membrane potential (V l j ) of the j-th I&F neuron in the l layer are expressed as shown below.
Here, S l j t ð Þ represents the spikes generated in the j-th neuron in the l layer at time t in the form of a voltage pulse. w lÀ1 ij indicates the synaptic weight between the i-th neuron in the l-1 layer and the j-th neuron in the l layer. C mem and N l-1 correspondingly represent the membrane capacitance of the I&F neuron and the total number of neurons in the l-1 layer. Equation (4) represents the behavior of the j-th neuron in the l layer when the membrane potential exceeds the threshold voltage (V th ). At this time, the membrane potential drops by V th and the neuron generates a postsynaptic spike (S l j t ð Þ). By accumulating the teaching signal (Z k (t)) and the output spike, we can obtain the error value (d L k ) in the output layer (l = L), as follows, where T is the total number of time steps taken to train one set of input pixels. In this paper, images were used as input data, and the input pixels are presented as binary data with a value of 0 or 1. The total number of times a neuron fires in each layer during the training process ( P T t S l j t ð Þ) is limited to a maximum of T times. The teaching signal supervises the training direction and is obtained through methods described in Sect. 2.3.
The error values in the previous layers (l 2 1; 2; . . .; L À 1 f g ) are obtained through the backward weighted sum, as follows, When the j-th neuron in the l layer fires more than once during time T ( P T t S l j ! 1), the error value is obtained through the backward weighted sum. On the other hand, the error value is 0 when the j-th neuron in the l layer does not fire during time T ( P T t S l j ¼ 0). This reflects the derivative value of the ReLU activation function.
The synaptic weights are updated as follows using the error value obtained in Eq. (7), where g denotes the ratio used when converting the magnitude of the error value to the pulse width. The conversion ratio g has a meaning similar to that of learning rate in software-based networks. In addition, the weight update value (Dw l jk ) becomes 0 when the number of presynaptic spikes generated during time T ( P T t S l j t ð Þ) is 0. Then, the synaptic weight between the j-th neuron in the l layer and the k-th neuron in the l ? 1 layer is updated using the following equation: The derivative value of the activation function is not employed in the overall training process, as the derivative value of the ReLU function is 1 or 0. When (7) and Dw lÀ1 ij becomes 0 in Eq. (8), outcomes identical to reflecting the derivative value 0 of the ReLU function. When P T t S l j t ð Þ exceeds 1, d l j is obtained as the derivative value 1 of the ReLU function is reflected and Dw lÀ1 ij is obtained through Eq. (8). In a hardware-based neural network, the weight value between the i-th presynaptic neuron and the j-th postsynaptic neuron is represented by the difference in the conductance of two synaptic devices, as follows, where G þ ij and G À ij represent the positive and negative weight values, respectively. Two synaptic devices are required for each weight value to express a negative synaptic weight because the conductance of a synaptic device only has a positive value. The update and reset methods of G ? and Gfollow the method proposed by Lim [41]. G ? is increased when weight potentiation is required, and Gis increased when weight depression is necessary. If G ? reaches G max and weight potentiation is necessary, Gis initialized and then increased to a conductance level one step lower than the previous value. When both G ? and Greach G max , they are initialized to G min . Figure 2 represents the overall training process of the hardware-based DQN. It consists of three elements: the environment in which the game progresses, the DQN where training takes place and the appropriate action is selected, and replay memory for an experience replay.

Hardware-based deep Q-network
First, the current state (s) of the environment is applied to the input of the DQN and stored in the replay memory (À of Fig. 2). In the network, forward propagation occurs and the first fired output neuron is selected as an action (a) according to the learning rule. This action is applied to the environment and stored in the replay memory (`of Fig. 2). Next, the reward (r) and the next state (s 0 ) that appear when the given action is performed are stored in the replay memory (´of Fig. 2). Through this process, one set of (s, a, r, s 0 ) data is stored in the replay memory. Finally, using the data stored in the replay memory, the DQN is trained using the method described in the previous Sect. 2.2 (ˆof Fig. 2).
In this hardware-based network, the entire training process is divided into four phases. Each phase is split into five time steps (T = 5) with a total length of 150 ls. Only one spike can be generated during each time step, and the input pixel is presented as binary data with a value of 0 or 1. When the input data from a pixel is 0, no input spike is generated, and if it is 1, input spikes are generated at every time step. Figure 3a shows a schematic illustration of a fully connected network with one hidden layer as an example. The case of the second phase where the total number of the time step per phase is 5 and the input data from a pixel is 1 (equivalent to 5 pulses with a 3 V amplitude) is shown in Fig. 3b as an example. Figure 3c presents the pulse scheme for weight updating during the fourth phase.
The first phase is the forward phase that receives state s 0 as an input and obtains the maximum value of the output value (Q). This output value Q represents the long-term expected return of executing action a 0 from given state s 0 .
With a higher the Q-value, better long-term results in state s 0 can be obtained when the corresponding action a 0 is performed. In order to obtain the maximum result, the agent selects an action that will lead to the highest Q-value in each state. When the weighted sum from the presynaptic neurons is stored in the forward direction membrane capacitor of the postsynaptic neuron and the membrane potential exceeds the threshold voltage, a postsynaptic spike is generated through the I&F neuron circuit. When the first spike is generated in the k-th output neuron, the membrane potentials of the other output neurons are set to 0. The generated spikes of the k-th output neuron then charge the connected capacitor. The amount of charge stored in the capacitor connected to the k-th output neuron represents the maximum Q-value, which is used as a teaching signal in the next phase.
The second phase is also the forward phase that receives state s as an input and obtains the error value for backpropagation. A process identical to that during the first phase occurs, except that the input data are different. In the neurons of all layers except for the last layer (l 2 1; 2; . . .; L À 1 f g ), whether or not each neuron fired during time T (sign P T t S l j t ð Þ ) is stored as a single bit. In other words, neurons that fired more than once during time T are stored with a value of 1, and neurons that did not fire during time T are stored as a value of 0. In this phase, a teaching signal is generated through a pulse-width modulation (PWM) circuit and the error value is obtained from the difference between the output spike and the teaching signal. The teaching signal Z j is obtained as shown below using the maximum Q-value obtained in the first phase and the reward r obtained when action a is taken in state s.
where c represents a discount factor that decreases the value of future rewards over time. c is between 0 and 1, and usually has a value of 0.9. This equation is well known as the Bellman equation. The teaching signal is applied only to the m-th output neuron and only the m-th error value is calculated. This m-th output neuron is a neuron corresponding to action a taken when changing from state s to state s 0 . The error values of other output neurons are set to 0. The third phase is the backward phase, which propagates the error values obtained in the second phase. In this phase, the weighted sum of the error values obtained in the next layer is stored in the backward direction membrane capacitor. The hidden layer has two membrane capacitors: the forward direction membrane capacitor to store the weighted sum and the backward direction membrane capacitor to store the weighted sum of the error values obtained in the next layer.
The fourth phase is the update phase, which updates the synaptic weights using the error values obtained in the third phase. When the error values are positive or negative, error spikes with corresponding values of 5.0 V or -1.8 V are generated. The pulse width of the error spike is proportional to the magnitude of the error value using the PWM circuit. These error spikes are applied to the source of the synaptic devices twice during this phase, as shown in Fig. 3c. The second error spike is applied after 10 ls.
When a single bit value per neuron (sign P T t S l j t ð Þ ) stored in the second phase is 1, two 10 ls width spikes having magnitudes of 3 V and -3 V are applied to the gate of the synaptic device in turn. If the error spike is positive, an erase pulse is applied to the synaptic device by overlapping the error spike applied to the source line and the negative part of the spike applied to the gate of the synaptic device, which potentiated the synaptic weight, as shown in  Fig. 3c. In the opposite case, when the error spike is negative, a program pulse is applied to the synaptic device by overlapping the error spike applied to the source line and the positive part of the spike applied to the gate of the synaptic device, which depressed the synaptic weight. However, when a single bit value is 0, no spike is applied to the gate of the synaptic device, and a weight update does not occur.

Results and discussion
Two system-level simulations are conducted using Python, a programming language, along with the PyTorch library to evaluate the proposed training method and hardware-based network architecture during the Fruit Catching game and the Rush Hour game. The parameters used in this simulation are shown in Table 1. The synaptic weights for all the simulations are initialized using the initialization method proposed by He [42]. Figure 4 shows an example of how the Fruit Catching game proceeds. In a 10 9 10 grid world, the fruit is 1 9 1 in size and the basket is 1 9 3 in size. When a new game starts (t game ¼ 1), a fruit is created at a random position among ten columns in the first row, and as each time step passes, the fruit falls one row. For each time step, the basket can take three actions at the bottom row: stop or moving to the left or right by one column. The bottom of Fig. 4 presents an example of output spikes generated in the output neurons when t game ¼ 1. In this case, because the firing rate of the output neuron representing the moving left action is the highest, the agent takes an action that moves the basket to the left. When the fruit reaches the ninth row (t game ¼ 9), as shown in the rightmost part of Fig. 4, if the basket exists under the fruit, the agent receives a reward of 1. On the other hand, if the basket does not exist under the fruit, the agent receives a reward of -1. In all other situations, the reward received by the agent is 0. One episode ends with this process, and the new game begins again. Figures 5a and b represent the catching rate of the proposed hardware-based neural network with different epsilon values (e) and discount factors (c). Figure 5a is an ideal case with an epsilon value of 1 and a discount factor of 0.9, and Fig. 5b is a simplified case with an epsilon value of 0 and a discount factor of 1.0. The catching rate was obtained through 1000 test games. The network size used in the simulation is 100-100-100-3. Replay memory with a size of 500 is used in this work, and the network is trained using 50 randomly selected datasets for each action. Here, for example, replay memory with a size of 1 has the number of bits required to store one (s, a, r, s 0 ) dataset. As   Figure 5c shows the average value of the catching rate during the last 200 episodes for the ideal case (e ¼ 1; c ¼ 0:9) and the simplified case (e ¼ 0; c ¼ 1:0). As the networks with ideal cases and simplified cases are well trained without a significant difference, it is clear that the network is trained well even if exploration is not employed and the discount factor is set to 1 in the Fruit Catching game, which is a relatively simple game. The catching rate of the network as a parameter of the nonlinearity factor (b) is also investigated, as indicated in Fig. 6a. Training was conducted under conditions identical to those in Fig. 5b, with the result showing the average catching rate for the last 200 episodes. As the nonlinearity factor increases, the catching rate decreases slightly (about 5% when b = 5). Figure 6b shows the average catching rate for the last 200 episodes versus the variation of the synaptic weights. The pulse-to-pulse variation and the device-to-device variation are considered. The pulse-to-pulse variation is modeled as follows:

Fruit catching game
The device-to-device variation is modeled as follows: The x-axis in Fig. 6b represents the standard deviation (r). For the two variation cases, the catching rate scarcely drops and remains nearly constant even if r increases to 0.5, as the on-chip training scheme is employed.

Rush hour game
The second example used to verify the proposed training method is a simple Rush Hour game. In a 6 9 6 position, several cars of length 2 or 3 are placed horizontally or vertically. Only one car can be moved by one position in one move. The goal of the game is to move the target car (red car) to the exit with the fewest number of moves. If the road between the target car and the exit is not blocked, the agent receives a reward of 1 and one episode ends. In all other situations, the reward received by the agent is 0. Figure 7a and d show two examples of the Rush Hour game. Figure 7a is an example of a relatively easy game, where average adults will only require less than a few minutes to solve the problem. On the other hand, Fig. 7d is a relatively complicated example, and it is difficult to know which car to move first. The neural network used in the simulation has 288 input neurons and 16 output neurons with no hidden layers. Here, 288 (6 9 6 9 8) input neurons are used because each car can have 36 (6 9 6) positions, and 16 (8 9 2) output neurons are used because each car can move in two directions, i.e., up/left or down/right. Figure 7b and e represent the number of moves required to move the target car to the exit. Both Figs. 7b and e are ideal cases with an epsilon value of 1 and a discount factor of 0.9. As above, replay memory with a size of 500 is used, and the network is trained using 50 randomly selected datasets for each action. As training progresses, the number of moves required to move the target car to the exit decreases to the optimal value (9 in Fig. 7b and 14 in Fig. 7e). Figure 7c and f are simplified cases in which the agent only takes random actions in the first episode (e ¼ 1, exploration only). In subsequent episodes, the agent does not take random actions and only exploitation occurs  (e ¼ 0, exploitation only). In the simplified cases, the number of moves required to move the target car to the exit converges to the optimal value, as in the ideal cases. This means that training can be conducted well in a simpler manner in that exploration is conducted only in the first episode. The subsequent simulations for training are performed with this simplified method. In addition to the two examples discussed above, 18 random examples were trained under identical conditions. When the optimal number of moves is reached for each example, it is considered that the training is done well. Otherwise, it is considered that the network needs more training. Figure 8 presents how many of the 20 examples reached the optimal number of moves. The inset in Fig. 8 shows the accuracy of the last 50 episodes. As training progresses, the accuracy converges to 100%. This indicates that the network can be trained well for various game examples.

Network without replay memory
Thus far, we have trained the network using replay memory with a size of 500 in all simulations. However, for a relatively simple problem such as a Fruit Catching game, by not using the replay memory, various advantages, such as better power consumption, occupied area, and fast learning speed outcomes, can be obtained. Figure 9 shows the overall process of training when the replay memory is not used. Only one set of s; a; r; s 0 ð Þ data is stored for each moment and is used for network training. As shown in´of Fig. 2, the network with replay memory receives s 0 and stores it in the replay memory first, after which the network is trained (ˆof Fig. 2). However, in the network without replay memory, s 0 is not stored and is immediately applied to the input of the DQN (´of Fig. 9), which becomes the first phase of the four-phase training process. Therefore, in of Fig. 9, it is sufficient to proceed with training from the second phase, which increases the overall training speed. Figure 10a presents the catching rate of the network without the replay memory. The size of the network used in the simulation and all other parameters are identical to those in Fig. 5b, except that the replay memory is not used and the network is trained with only one dataset for each action. Therefore, more episodes are needed for training compared to the network with replay memory. However, because the memory access required for training is significantly reduced for each episode, the total time required for training is decreased. The catching rate was obtained through 1000 test games. The inset in Fig. 10a shows the catching rate for the last 200 episodes. Figure 10b shows the average value of the catching rate during the last 200  episodes for the network with and without the replay memory. For a simple problem, the network can be trained well regardless of whether or not the replay memory is used.

Conclusion
In this paper, we proposed a training method for on-chip trainable hardware-based DQNs. The entire training process is divided into four phases: two forward phases, one backward phase, and one update phase. In each forward phase, two values are stored in each case: the value for the target spike (max a 0 Q s 0 ; a 0 ð Þ) and the generation of the spike (sign P T t S l j t ð Þ ) for weight update. In the backward phase and update phase, a training method approximating the conventional backpropagation algorithm is used. To implement on-chip training, only a single bit of memory per neuron is used, and the dependency of memory is low. The performance of the proposed training method is evaluated through two example games: a Fruit Catching game and a Rush Hour game. Evaluation results show that the network is trained well without significant performance differences relative to the outcomes from a software-based training method in both cases. In particular, for one of the simple games here, specifically the Fruit Catching game, high performance in the form of a catching rate of approximately 98% was achieved despite the fact that the replay memory was not used. This means that the network can be suitably trained while significantly reducing the use of memory, thus reducing the power consumption and area occupation that comes with memory usage. In addition, further performance improvements can be achieved through optimization of the parameters used in the simulation. Dealing with large input image data is a challenging future study. It might require a large amount of replay memory and additional convolutional neural networks (CNNs). This issue will be addressed more thoroughly in future work.
In this work, TFT-type flash memory cells are used as synaptic devices. Because the FG covers only half of the channel, the threshold voltage does not fall below 0 in the full erase state, and the standby power consumption is reduced by preventing leakage current. In addition, the bidirectional conductance update characteristic makes this  Fig. 10 a The catching rate of the proposed hardware-based network without replay memory (e ¼ 0; c ¼ 1:0). b The average catching rate during the last 200 episodes for the network with and without replay memory device suitable for use as a synaptic device in which conductance updates frequently occur.
The effects of non-ideal properties of the synaptic devices are also investigated. Nonlinear characteristics and two variations of the synaptic devices are considered. The performance of the proposed training method is evaluated while increasing b, a factor indicating the nonlinearity of the synaptic device, to 5. There is a slight decrease in the performance as b is increased, but overall the outcome indicates good performance nonetheless. In addition, because the on-chip training scheme is employed, the proposed system shows strong immunity to device variations.