BPLC + NOSO: backpropagation of errors based on latency code with neurons that only spike once at most

For mathematical completeness, we propose an error-backpropagation algorithm based on latency code (BPLC) with spiking neurons conforming to the spike–response model but allowed to spike once at most (NOSOs). BPLC is based on gradients derived without approximation unlike previous temporal code-based error-backpropagation algorithms. The latency code uses the spiking latency (period from the first input spike to spiking) as a measure of neuronal activity. To support the latency code, we introduce a minimum-latency pooling layer that passes the spike of the minimum latency only for a given patch. We also introduce a symmetric dual threshold for spiking (i) to avoid the dead neuron issue and (ii) to confine a potential distribution to the range between the symmetric thresholds. Given that the number of spikes (rather than timesteps) is the major cause of inference delay for digital neuromorphic hardware, NOSONets trained using BPLC likely reduce inference delay significantly. To identify the feasibility of BPLC + NOSO, we trained CNN-based NOSONets on Fashion-MNIST and CIFAR-10. The classification accuracy on CIFAR-10 exceeds the state-of-the-art result from an SNN of the same depth and width by approximately 2%. Additionally, the number of spikes for inference is significantly reduced (by approximately one order of magnitude), highlighting a significant reduction in inference delay.


Introduction
Spiking neural networks (SNNs) of layer-wise feedforward structure can process and convey data forward based on asynchronous spiking events without forward locking unlike feedforward deep neural networks (DNNs) [10,32]. When implemented in asynchronous neuromorphic hardware, SNNs are believed to leverage their processing efficiency. Nevertheless, asynchronous neuromorphic hardware often suffers from traffic congestion due to a large number of spikes (events) that are routed to their destination neurons through network-on-chip with limited bandwidth [9]. In this regard, the number of synaptic operations per second (SynOPS) is considered as a crucial measure of neuromorphic hardware performance, and attempts have been made to improve this synaptic operation speed to further accelerate the inference process [8,12,27,28]. Algorithmwise approaches to improve the inference speed include the development of learning algorithms that support the inference process using fewer spikes.
Given the limited accessibility to global data in multicore neuromorphic hardware, learning algorithms of locality are favored as on-chip learning algorithms. However, learning algorithms of locality, e.g., naive Hebb rule [15], spike timing-dependent plasticity [4], and Ca-signaling model [21], fail to achieve high performance. Currently, it appears that the trend is moving toward off-chip learning, allowing the learner to access large global data within the general framework of error-backpropagation (backprop). The advantage is such that enriched optimization techniques for DNNs can readily be applied to SNNs, which significantly improves the performance of SNNs [10]. Nevertheless, the notable inconsistency between DNNs and SNNs lies in the fact that output spikes are non-differentiable unlike activation functions.
As a workaround, the gradients of spikes are often approximated to heuristic functions, which are popularly referred to as surrogate gradients [2,11,34,38,44]. Using surrogate gradients, the gradient values are available disregarding the presence of events, avoiding the dead neuron issue that hinders the network from learning. To date, various surrogate gradients have been proposed, e.g., boxcar function [38], arctan function [11], exponential function [34]; these methods remove the inconsistency between DNNs and SNNs, yielding the state-of-the-art classification accuracy on various datasets. Despite the technical success, such heuristic surrogate gradient methods lack theoretical completeness given the lack of theoretical foundations of surrogate gradients.
Spike timing-based backprop (temporal backprop) algorithms can avoid such surrogate gradients because the spike timing may be differentiable with the membrane potential using a linear approximation of near-threshold potential evolution [5]. Temporal backprop is generally prone to learning failure because of limited error-backpropagation paths. This is because spike timing gradients are available only for the neurons that spike at a given timestep unlike surrogate gradients. The number of error-backpropagation paths is further limited by dead neurons, i.e., neurons whose current fan-in weights are low so that they no longer fire spikes. STDBP, a temporal backprop algorithm, uses a rectified linear potential kernel to avoid the dead neuron issue [46]. The rectified linear kernel causes a monotonous increase in potential upon receiving an input spike with a positive weight, suggesting that the neurons eventually fire spikes. TSSL-BP considers additional error-backprogation paths via spikes from the same neuron to avoid learning failure due to limited errorbackpropagation paths [48]. The timing gradient is calculated using the linear approximation by Bohte et al. [5]. Another temporal backprop algorithm (STiDi-BP) uses a piece-wise linear kernel to approximate the spike timing gradient to a simple function, and thus to reduce the computational cost [25,26].
Because spike timing gradients are available only for the neurons that spike, generally, the larger the number of spikes, the richer the error-backpropagation paths. Thus, more spikes are desired for a better training. However, this causes a considerable inference delay when implemented in digital neuromorphic hardware because of its limited synaptic operation speed. Concerning the desires for • theoretically seamless applications of temporal backprop to SNNs, • workaround for the dead neuron issue, • fewer spikes for fast inference, we propose a novel learning algorithm based on the spiking latency code of neurons that only spike once at most (NOSOs). NOSOs are based on the spike-response model (SRM) [13] but with an infinite hard refractory period to avoid additional spikes. The algorithm is based on the backpropagation of errors evaluated using the spiking latency code (BPLC). The key to BPLC + NOSO is such that, when spiking, spiking latency (rather than spike itself) is the measure of the response to a given input, which is differentiable without approximations unlike [5]. Thus, BPLC + NOSO is mathematically rigorous such that all required gradients are derived analytically. Other important features of BPLC + NOSO are as follows.
• The use of NOSOs for both learning and inference minimizes the workload on the event-routing circuits in neuromorphic hardware. • To support the latency code, NOSONet includes minimumlatency pooling (MinPool) layers (instead of MaxPool or AvgPool) that pass the event of the minimum latency only for a given patch. • Each NOSO is given two symmetric thresholds (−ϑ and ϑ) for spiking to confine the potential distribution to the range between the symmetric thresholds. • BPLC + NOSO fully supports both folded and unfolded NOSONets, allowing us to use the automatic differentiation framework [31].
The primary contributions of this study include the following: • We introduce a novel learning algorithm based on the spiking latency code (BPLC + NOSO) with full derivations of the primary gradients without approximations. • We provide novel and essential methods for BPLC + NOSO support, such as MinPool layers and symmetric dual threshold for spiking, which greatly improve accuracy and inference efficiency. • We introduce a method to quickly calculate wallclock time for inference on general digital neuromorphic hardware, which allows a quick estimation of the inference delay for a given fully trained SNN.
The rest of the paper is organized as follows-Section "Related work" briefly overviews previous learning algorithms based on temporal codes. Section "Preliminaries" addresses primary techniques employed in BPLC + NOSO. Section "BPLC with spike response model" is dedicated to the theoretical foundations of BPLC + NOSO. Section "Experiments" addresses the performance evaluation of BPLC + NOSO on Fashion-MNIST and CIFAR-10 and effects of MinPool and symmetric dual threshold for spiking on learning efficacy. Section "Discussion" discusses the estimation of inference time for an SNN mapped onto a general digital multicore neuromorphic processor. Finally, Section "Conclusion and outlook" concludes our study.

Related work
Spike timing gradient approximation: Temporal backprop algorithms frequently use linear approximated spike timing gradients proposed by Bohte et al. [5]. The specific form of the gradient depends on the membrane potential kernel used. Bohte et al. [5], Comsa et al. [7], and Kim et al. [19] used an alpha kernel as an approximation of the genuine SRM kernel, and the corresponding gradients were evaluated using the linear approximation. Zhang et al., employed a rectified linear kernel to avoid the dead neuron issue [46] while Mirsadeghi et al., employed a piece-wise linear kernel for simple calculations of the gradient [25,26]. To apply the linear approximation by Bohte et al. [5], the gradient of membrane potential at the spike timing should be available. Integrate-and-fire (IF) neurons do not allow the gradient value at the spike timing so that Kheradpisheh and Masquelier [17] approximated the gradient to be constant at -1. The same holds for leaky integrate-and-fire (LIF) neurons. Zhang and Li [48] stated that the linear approximated was employed, but the gradient is not clearly derived.
Label-encoding as spike timings: For SNN with temporal code, the correct labels are frequently encoded as particular output spike timings [17,25,26] or the temporal order of output spikes such as time-to-first-spike (TTFS) code [30,45,46]. In the TTFS code, the neuron index of the first output spike indicates the output label.
Workaround for dead neuron: Comsa et al. proposed temporal backprop with a means to avoid dead neurons (assigning penalties to the presynaptic weights of each dead neuron) [7]. Zhang et al. [46] proposed a rectified linear potential kernel that causes a monotonous increase in potential upon receiving a spike with positive weight. Thus, the neuron eventually fire a spike. Zhang and Li [48] proposed TSSL-BP with additional backprop paths via the spikes emitted from the same neuron (intra-neuron dependency). The additional paths avoid the learning failure due to limited backprop paths by dead neurons. Kim et al. [19] combined temporal backprop paths with rate-based backprop paths to compensate for the loss of temporal backprop paths due to dead neurons.
BPLC + NOSO is clearly distinguished from the previous temporal backprop algorithms in terms of the primary perspectives addressed in this section. First, BPLC + NOSO employs no approximation for gradient evaluation unlike the previous temporal backprop algorithms including those reviewed in this section. Therefore, it barely embodies ambiguity. Second, the proposed spiking latency code is a novel data encoding scheme, distinguishable from the previous temporal code schemes. Third, the symmetric dual threshold for spiking is a novel method to avoid the dead neuron issue, which is computationally efficient since it hardly involves high-cost computations. Additionally, BPLC + NOSO is fully compatible with the original SRM without approximations.

Latency code
Spiking latency is a period from the first input spike timing t in and consequent spike timingt as illustrated in Fig. 1a. In the latency code, NOSONet encodes input data x as the spiking latency T (L) lat of the output neurons in the output layer L.
wheret (·) and t (·) in denote the spike timings of the neurons in the (·)th layer and their first input spike timings, respectively. The function f (L) encodes input spikes (from the layer L-1) att (L−1) as spiking latency values T (L) lat . The larger the weight w (L−1) , the shorter the spiking latency T (L) lat . This latency code should be distinguished from the TTFS code [30,45,46] in which the first input spike timings t (L) in in Eq. (1) are ignored, so that it considers the output spike timings only.

Minimum-latency pooling
The MinPool layers support the latency code. Consider the time elapsed since the first input spike, t elap = t − t in , for a given neuron. We consider a spiking latency map in a given 2D patch D pool at timestep t and feature (spike) map in the same patch, T lat,D pool [t] and s D pool [t], respectively. The latency map T lat,D pool is initialized to infinite values. Each where s x min [t] indicates the spike function value for x min at timestep t.

NOSO with dual threshold for spiking
Each NOSO is endowed with a symmetric dual threshold for spiking (−ϑ and ϑ), and thus a spike is generated if the membrane potential u satisfies u ≥ ϑ or u ≤ −ϑ. Therefore, the subthreshold potential u is confined to the range between −ϑ and ϑ. The restriction on the potential offers the upper limit of potential variance over the samples in a given batch, preventing large potential variance over the samples. The symmetry in the two bounds may offer zero-mean potential over the samples. Additionally, the restriction on the potential is expected to avoid dead neurons given that most dead neurons arise from their potentials largely biased toward the negative side.

Spike response model mapped onto computational graphs
We consider SRM, which is equivalent to the basic leaky integrate-and-fire (LIF) model with exponentially decaying synaptic current [13]. But our model is allowed to maximally spike only once in response to a single input sample by using an infinite hard refractory period in place of the refractory kernel. The choice of SRM, rather than simpler models, e.g., Stein's model [35], is to enlarge the mutual information of spike timing and synaptic weight, which is the key to temporal code.
In SRM, the subthreshold potential of the ith spiking neuron in the lth layer (u where j denotes the indices of the presynaptic neurons, and w (l) i j denotes the synaptic weight from the jth neuron in the (l-1)th layer. The spiking-availability function sav i = 0 otherwise. The kernel is expressed as follows [13].
where denotes the Heaviside step function. The potential and synaptic current time constants are denoted by τ m and τ s , respectively. A spike from the jth neuron in the (l-1)th layer att . Because the kernel in Eq. (4) consists of two independent sub-kernels, Eq. (3) can be expressed as Here, we introduce a new variable v (l) The variables u i,s are reset to zero when the neuron fires a spike. The advantage of this method is that the membrane potential can be evaluated by simply convolving input spikes using two independent kernels, which otherwise needs to solve two sequential differential equations [20]. After spiking, the spiking-availability function sav (l) i remains constant at zero, hindering additional spike generation.
All variables are recursively evaluated using the explicit Equation (6) can be mapped onto a computational graph as shown in Fig. 2. A layer's processed data is transmitted along the forward pass through the use of spikes (s (l) ).

Backward pass and gradients
SNNs are typically trained using forward and backward passes aligned in opposing directions, so that it is unavoidable to use surrogate gradients due to non-differentiability of spikes [29,34,44]. Instead, BPLC + NOSO uses a backward pass via spike timingst (·) rather than spikes themselves s (·) (Fig. 2). This backward pass involves differentiable functions only. The output of NOSONet (with M output NOSOs) is the spiking latency values of the output NOSOs, , as given in Eq. (1). The prediction is then made by reference to the output neuron of the minimum spiking latency. We use a cross-entropy loss function L(−T (L) lat ,ŷ), whereŷ denotes a one-hot encoded label vector. The loss is evaluated at the end of the learning phase, and the weights are then updated using the gradients assessed when the neurons spiked.
We calculate the weight's update w (l) i j using the gradient descent method as follows. The learning rate and loss function are denoted by η and L, respectively. Equation (7) is equivalent to with the error e (l) given by for N neurons in the lth layer. The symbol denotes the Hadamard product.
for M neurons in the (l − 1)th layer. The backward propagation of the error from the lth layer to the (l − 1)th layer (with M neurons) is given by Equation (10) is derived in Appendix A. Because NOSO spikes once at most, the elements once written in v (l) and v (l) [t (l) ] are not overwritten. Equation (10) identifies that BPLC involves the gradients of spike timings rather than spikes themselves. Therefore, the backward pass differs from the forward pass. Two types of gradients are thus required for BPLC + NOSO: Fortunately, SRM allows these gradients to be expressed analytically.
Theorem 1 When an SRM neuron (whose membrane potential is u (l) i ) spikes at a given time t(=t (l) i ), the gradient of spike timingt (l) i with membrane potential is given by The proof of Theorem 1 is given in Appendix B. If the neuron does not spike during a learning phase, the gradient in Eq. (11) is zero.

Theorem 2 When an SRM neuron receives an input spike at t
The proof of Theorem 2 is also given in Appendix B. Using Theorem 2, the gradient ∂v (l) Likewise, this gradient is also zero if this neuron does not spike. Both gradients in Eqs. (11) and (13) can simply be calculated by reading out the four local variables (u j,s ) when the neuron spikes. The above derivations are for folded NOSONet, where all tensors for each layer are simply overwritten over time so that the space complexity is independent of the number of timesteps. We used unfolded NOSONet in the temporal domain to apply the the automatic differentiation framework [31]. The equivalence between folded and unfolded NOSONets is proven in Appendix C.

Classification accuracy and the number of spikes for inference
We evaluated the classification accuracy on Fashion-MNIST and CIFAR-10 and the total number of spikes used sp denotes the number of spikes generated from the layer i at timestep t.

Fashion-MNIST:
Fashion-MNIST consists of 70,000 grayscale images (each of which 28 × 28 in size) of clothing categorized as 10 classes [40]. We rescaled each grayscale pixel value of an image to the range 0 − 0.3 and  Table 2 in comparison with previous works. We also evaluated the total number of spikes N sp over all hidden+output NOSOs in the network for each test sample ( Table 2). The results highlight large sparsity of active NOSOs, which likely reduces the inference latency when implemented in neuromorphic hardware. This will be discussed in Section "Discussion". Figure 3a shows the ratio of active NOSOs to all NOSOs, n (i) sp (= t n (i,t) sp /C (i) H (i) W (i) ), for layer i over the entire timesteps. CIFAR-10: CIFAR-10 consists of 60,000 real-world color images (each of which 3 × 32 × 32 in size) of objects labeled as 10 classes [22]. All training images were preprocessed such that each image with zero-padding of size 4 was randomly cropped to 32 × 32, which was followed by random horizontal flipping. The RGB values of each pixel were rescaled to the range 0 − 0.3 and then used as input currents. For learning stability, we linearly increased the initial learning rate (1E-2) to the plateau learning rate (5E-2) for the first five epochs (ramp rate: 8E-3/epoch). The fully trained C-NOSONet (64C5-128C5-MP2-256C5-MP2-512C5-256C5-1024-512) yields the classification accuracy and the number of spikes for inference in Table 2. Notably, our classification accuracy exceeds the result from an SNN of the same depth and width (CNN2-half-ch) [39] by approximately 2.0%. Additionally, our NOSONet uses much fewer spikes (only 10.9% of CNN2-half-ch), supporting highthroughput inference. The layer-wise active NOSO ratio n (i) sp over the entire timesteps is plotted in Fig. 3b, highlighting the high sparsity of spikes.

Minimum-latency pooling versus MaxPool
MinPool supports the latency code by passing the event of the minimum spiking latency in a given 2D patch. To identify its effects on learning, we compared NOSONets with Min-Pool layers and conventional MaxPool layers.

Effect of symmetric dual threshold on potential distribution
We identified the effect of the dual threshold on potential distribution over samples in a given batch by training NOSONet (32C5-MP2-64C5-MP2-600) on Fashion-MNIST and CIFAR-10 with four different threshold conditions: single threshold 0.05 and 0.1, and dual threshold ±0.1 and ±0. 15. The results are shown in Fig. 6. The usage of dual threshold greatly lowers the standard deviation and results in a mean that is almost zero because it limits the potential to the range between −ϑ and ϑ. Additionally, the highest accuracy was attained with the dual threshold ±0.15. The potential distributions for a single threshold case (0.1) and dual threshold case (±0.15) on Fashion-MNIST are detailed in Appendix F.

Discussion
We estimate the inference time for an SNN mapped onto a general digital multicore neuromorphic processor using the following assumptions. Assumption 1: Total N n neurons in a given SNN are distributed uniformly over N c cores of a neuromorphic processor, i.e., N n /N c neurons per core. Assumption 2: All N n /N c neurons in each core share a multiplier by time-division multiplexing, so that the current potential is multiplied by a potential decay factor (e −1/τ m ) for one neuron at each cycle. Each timestep for an SNN with LIF neurons includes two primary processes: (i) the process of multiplying the current potential by a decay factor and (ii) synaptic operation (spike routing to the destination neurons plus the consequent potential update). Process (i) in a digital neuromorphic processor is commonly pipelined within a core but executed in parallel over the N c cores [20]. Thus, at each timestep, the time for process (i) for all N n neurons (T up ) is given by where a and f clk denote the initialization cycle number and clock speed, respectively. Although the number of initialization cycles a differs for different processor designs, it is commonly a few clock cycles. Given the total number of spikes generated at timestep t (n sp [t]), the time for synaptic operations at each timestep is given by Given Assumptions, the total time for processes (i) and (ii) at each timestep is given by T step = T up + T sop . Therefore, we have the total time for inference during total N step timesteps, T inf = t T step [t], as follows.
where N sp = t n sp [t]. The number of neurons in a core (N n /N c ) differs for different designs. We assume 1k neurons in each core [8], a few tens MSynOPS as for [3,12,27], and 100 MHz clock speed. For inference involving N sp spikes (∼ 10 6 as in Table 2) and a N step of ∼ 100, Eq. (14) identifies that T sop is dominant over T up so that T inf is dictated by T sop . Therefore, it is desired to concern N sp when developing learning algorithms. For SNNs with IF neurons (without leakage), process (i) is unnecessary so that T up vanishes. Therefore, T inf is solely determined by N sp .

Conclusion and outlook
We proposed a mathematically rigorous learning algorithm (BPLC) based on spiking latency code in conjunction with minimum-latency pooling (MinPool) operations. We overcome the dead neuron issue using a symmetric dual thresh-old for spiking, which additionally improves the potential distribution over samples in a given batch (and thus the classification accuracy). BPLC-trained NOSONet on CIFAR-10 highlights its high accuracy outperforming the SNN of the same depth and width by approximately 2% with much fewer spikes (only 10.9%). This large reduction in the number of spikes largely reduces the inference latency of SNNs implemented in digital neuromorphic processors.
Currently, we conceive the following future work to boost the impact of BPLC + NOSO.
• Scalability confirmation: Although the viability of BPLC + NOSO was identified, its applicability to deeper SNNs on more complex datasets should be confirmed. Such datasets include not only static image datasets like ImageNet [33] but also event datasets like CIFAR10-DVS [24] and DVS128 Gesture [1]. Given that the number of spikes is severely capped, BPLC + NOSO on event datasets in particular might be challenging. • Hyperparameter fine-tuning: To further increase the classification accuracy, the hyperparameters should be fine-tuned using optimization techniques. • Weight quantization: BPLC + NOSO is based on fullprecision (32b FP) weights. However, the viability of BPLC + NOSO with reduced precision weights should be confirmed to improve the efficiency in memory use. This may need an additional weight-quantization algorithm in conjunction with BPLC + NOSO like CBP [18].

Availability of data and materials
The datasets generated during and/or analyzed during the current study are available in the GitHub repository, https://github.com/dooseokjeong/BPLC-NOSO.

Conflict of interest
The authors have no relevant financial or nonfinancial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Appendix A Derivation of backward propagation of errors
We define The subthreshold membrane potential of NOSO is Thus, the following equation holds when spiking with a spiking threshold ϑ.
For simplicity, we omit the spiking-availability function According to Theorem 1, the denominator of the righthand side of Eq. (A4) equals ∂t (l) i /∂u (l) i −1 , and thus we have Applying a chain rule on the left-hand side of Eq. (A5) yields the following equation- Eq. (A6) is re-expressed as According to Theorem 2, Using Eq. (A7) at t =t (l) i , the following equation holds: The error for the jth neuron in the (l − 1)th layer e Equation (A11) is expressed as the following matrix formula.

Appendix B Proofs of Theorems
Theorem 1 When an SRM neuron (whose membrane potential is u (l) i ) spikes at a given time t(=t (l) i ), the gradient of spike timingt (l) i with membrane potential is given by Proof The update of weight w (l) i j is calculated using the gradient descent method as follows- Regarding that u (l) . Consequently, we have The left-hand side of Eq. (B14) is zero because the threshold ϑ is constant. Thus, the following equation holds- A comparison between Eqs. (B13) and (B16) indicates that the following equation holds.
The right-hand side of Eq. (B17) is obtained by differentiating Eq. (A2) with t and evaluating the derivative at the spike timingt (l) i , which finally leads to To be precise, the Heaviside step function in Eq. (B18) should be t −t Proof Because NOSOs spike once maximally, for all i, s (l)

Theorem 4
The weight update for the folded SNN, is equivalent to the following equation- where v (l) [t] is given by v (l) [t] = v (l)

Proof
The error e (l) is known to be Using Eq. (B22) and a basic property of the Hamadard product, the matrix diag e (l)T on the right-hand side of Eq. (B20) is unfolded as The matrix v (l) t (l) in Eq. (B20), given by Entering Eqs. (B23) and (B24) into Eq. (B20) yields which is always zero if t = t according to Theorem 3. Therefore, we have Theorems 4 and 5 are proven in Appendix B. Theorem 4 identifies the backward propagation of errors at timestep t toward timestep t through time. Thus, BPLC + NOSO can be unfolded on a computational graph as shown in Fig. 2, allowing the automatic differentiation framework to be used to learn the weights. Note that we rule out the backward pass from sav (l) [t + 1] to s (l) [t] because it can be ignored if the learning uses spike function gradients (rather than surrogate gradients) and refractory periods. This is proven in Appendix D.

Appendix D Gradient of the spike-availability function with respect to a spike from the previous timestep
Spike-function gradients are non-zero only when the neurons spike unlike surrogate gradients. The same neuron cannot spike at the consecutive timesteps in a row because of the refractory period. Consider the computational graph in Fig. 2. When the ith neuron in the lth layer is quiet at timestep t + 1, the gradient ∂t (l) i /∂u (l) i [t + 1] is zero, so that no gradient flows to s (l) i [t] regardless of the presence of the backward pass. When the neuron is active at timestep t +1 (i.e., quiet at timestep t), the gradient ∂t (l) i /∂u (l) i [t + 1] is non-zero. However, the gradient at timestep t is zero, so that the presence or absence of the backward pass does not affect any gradient flow.

Appendix E Hyperparameters
We used the hyperparameters in Table 3. The input scaling factor is an upper limit of the scaled pixel value of input image. We initialized the kernels and weight matrices using the Xavier uniform initialization method given by W ∼ U − a n in + n out , a n in + n out , Xavier uniform [14] where a is set to 6. The parameters in NOSONet (32C5-MP2-64C5-MP2-600) on Fashion-MNIST were initialized using the Xavier uniform method. We also initialized NOSONet (64C5-128C5-MP2-256C5-MP2-512C5-256C5-1024-512) on CIFAR-10 using the Xavier uniform method, but the weight matrices for the fully connected layers were initialized using a modified Xavier uniform method with a = 3 rather than 6.

Appendix F Potential distribution over samples in a batch
Figures 7 and 8 show potential distributions over samples in a random batch (batch size: 300) for single threshold and dual threshold cases, respectively. Note that the distributions exclude zero potential.