Introduction

Spiking neural networks (SNNs) of layer-wise feedforward structure can process and convey data forward based on asynchronous spiking events without forward locking unlike feedforward deep neural networks (DNNs) [10, 32]. When implemented in asynchronous neuromorphic hardware, SNNs are believed to leverage their processing efficiency. Nevertheless, asynchronous neuromorphic hardware often suffers from traffic congestion due to a large number of spikes (events) that are routed to their destination neurons through network-on-chip with limited bandwidth [9]. In this regard, the number of synaptic operations per second (SynOPS) is considered as a crucial measure of neuromorphic hardware performance, and attempts have been made to improve this synaptic operation speed to further accelerate the inference process [8, 12, 27, 28]. Algorithm-wise approaches to improve the inference speed include the development of learning algorithms that support the inference process using fewer spikes.

Given the limited accessibility to global data in multicore neuromorphic hardware, learning algorithms of locality are favored as on-chip learning algorithms. However, learning algorithms of locality, e.g., naive Hebb rule [15], spike timing-dependent plasticity [4], and Ca-signaling model [21], fail to achieve high performance. Currently, it appears that the trend is moving toward off-chip learning, allowing the learner to access large global data within the general framework of error-backpropagation (backprop). The advantage is such that enriched optimization techniques for DNNs can readily be applied to SNNs, which significantly improves the performance of SNNs [10]. Nevertheless, the notable inconsistency between DNNs and SNNs lies in the fact that output spikes are non-differentiable unlike activation functions.

As a workaround, the gradients of spikes are often approximated to heuristic functions, which are popularly referred to as surrogate gradients [2, 11, 34, 38, 44]. Using surrogate gradients, the gradient values are available disregarding the presence of events, avoiding the dead neuron issue that hinders the network from learning. To date, various surrogate gradients have been proposed, e.g., boxcar function [38], arctan function [11], exponential function [34]; these methods remove the inconsistency between DNNs and SNNs, yielding the state-of-the-art classification accuracy on various datasets. Despite the technical success, such heuristic surrogate gradient methods lack theoretical completeness given the lack of theoretical foundations of surrogate gradients.

Spike timing-based backprop (temporal backprop) algorithms can avoid such surrogate gradients because the spike timing may be differentiable with the membrane potential using a linear approximation of near-threshold potential evolution [5]. Temporal backprop is generally prone to learning failure because of limited error-backpropagation paths. This is because spike timing gradients are available only for the neurons that spike at a given timestep unlike surrogate gradients. The number of error-backpropagation paths is further limited by dead neurons, i.e., neurons whose current fan-in weights are low so that they no longer fire spikes. STDBP, a temporal backprop algorithm, uses a rectified linear potential kernel to avoid the dead neuron issue [46]. The rectified linear kernel causes a monotonous increase in potential upon receiving an input spike with a positive weight, suggesting that the neurons eventually fire spikes. TSSL-BP considers additional error-backprogation paths via spikes from the same neuron to avoid learning failure due to limited error-backpropagation paths [48]. The timing gradient is calculated using the linear approximation by Bohte et al. [5]. Another temporal backprop algorithm (STiDi-BP) uses a piece-wise linear kernel to approximate the spike timing gradient to a simple function, and thus to reduce the computational cost [25, 26].

Because spike timing gradients are available only for the neurons that spike, generally, the larger the number of spikes, the richer the error-backpropagation paths. Thus, more spikes are desired for a better training. However, this causes a considerable inference delay when implemented in digital neuromorphic hardware because of its limited synaptic operation speed. Concerning the desires for

  • theoretically seamless applications of temporal backprop to SNNs,

  • workaround for the dead neuron issue,

  • fewer spikes for fast inference,

we propose a novel learning algorithm based on the spiking latency code of neurons that only spike once at most (NOSOs). NOSOs are based on the spike–response model (SRM) [13] but with an infinite hard refractory period to avoid additional spikes. The algorithm is based on the backpropagation of errors evaluated using the spiking latency code (BPLC). The key to BPLC + NOSO is such that, when spiking, spiking latency (rather than spike itself) is the measure of the response to a given input, which is differentiable without approximations unlike [5]. Thus, BPLC + NOSO is mathematically rigorous such that all required gradients are derived analytically. Other important features of BPLC + NOSO are as follows.

  • The use of NOSOs for both learning and inference minimizes the workload on the event-routing circuits in neuromorphic hardware.

  • To support the latency code, NOSONet includes minimum-latency pooling (MinPool) layers (instead of MaxPool or AvgPool) that pass the event of the minimum latency only for a given patch.

  • Each NOSO is given two symmetric thresholds (\(-\vartheta \) and \(\vartheta \)) for spiking to confine the potential distribution to the range between the symmetric thresholds.

  • BPLC + NOSO fully supports both folded and unfolded NOSONets, allowing us to use the automatic differentiation framework [31].

The primary contributions of this study include the following:

  • We introduce a novel learning algorithm based on the spiking latency code (BPLC + NOSO) with full derivations of the primary gradients without approximations.

  • We provide novel and essential methods for BPLC + NOSO support, such as MinPool layers and symmetric dual threshold for spiking, which greatly improve accuracy and inference efficiency.

  • We introduce a method to quickly calculate wallclock time for inference on general digital neuromorphic hardware, which allows a quick estimation of the inference delay for a given fully trained SNN.

The rest of the paper is organized as follows— Section “Related work” briefly overviews previous learning algorithms based on temporal codes. Section “Preliminaries” addresses primary techniques employed in BPLC + NOSO. Section “BPLC with spike response model” is dedicated to the theoretical foundations of BPLC + NOSO. Section “Experiments” addresses the performance evaluation of BPLC + NOSO on Fashion-MNIST and CIFAR-10 and effects of MinPool and symmetric dual threshold for spiking on learning efficacy. Section “Discussion” discusses the estimation of inference time for an SNN mapped onto a general digital multicore neuromorphic processor. Finally, Section “Conclusion and outlook” concludes our study.

Table 1 Acronyms and symbols

Related work

Spike timing gradient approximation: Temporal backprop algorithms frequently use linear approximated spike timing gradients proposed by Bohte et al. [5]. The specific form of the gradient depends on the membrane potential kernel used. Bohte et al. [5], Comsa et al. [7], and Kim et al. [19] used an alpha kernel as an approximation of the genuine SRM kernel, and the corresponding gradients were evaluated using the linear approximation. Zhang et al., employed a rectified linear kernel to avoid the dead neuron issue [46] while Mirsadeghi et al., employed a piece-wise linear kernel for simple calculations of the gradient [25, 26]. To apply the linear approximation by Bohte et al. [5], the gradient of membrane potential at the spike timing should be available. Integrate-and-fire (IF) neurons do not allow the gradient value at the spike timing so that Kheradpisheh and Masquelier [17] approximated the gradient to be constant at –1. The same holds for leaky integrate-and-fire (LIF) neurons. Zhang and Li [48] stated that the linear approximated was employed, but the gradient is not clearly derived.

Label-encoding as spike timings: For SNN with temporal code, the correct labels are frequently encoded as particular output spike timings [17, 25, 26] or the temporal order of output spikes such as time-to-first-spike (TTFS) code [30, 45, 46]. In the TTFS code, the neuron index of the first output spike indicates the output label.

Workaround for dead neuron: Comsa et al. proposed temporal backprop with a means to avoid dead neurons (assigning penalties to the presynaptic weights of each dead neuron) [7]. Zhang et al. [46] proposed a rectified linear potential kernel that causes a monotonous increase in potential upon receiving a spike with positive weight. Thus, the neuron eventually fire a spike. Zhang and Li [48] proposed TSSL-BP with additional backprop paths via the spikes emitted from the same neuron (intra-neuron dependency). The additional paths avoid the learning failure due to limited backprop paths by dead neurons. Kim et al. [19] combined temporal backprop paths with rate-based backprop paths to compensate for the loss of temporal backprop paths due to dead neurons.

BPLC + NOSO is clearly distinguished from the previous temporal backprop algorithms in terms of the primary perspectives addressed in this section. First, BPLC + NOSO employs no approximation for gradient evaluation unlike the previous temporal backprop algorithms including those reviewed in this section. Therefore, it barely embodies ambiguity. Second, the proposed spiking latency code is a novel data encoding scheme, distinguishable from the previous temporal code schemes. Third, the symmetric dual threshold for spiking is a novel method to avoid the dead neuron issue, which is computationally efficient since it hardly involves high-cost computations. Additionally, BPLC + NOSO is fully compatible with the original SRM without approximations.

Preliminaries

Latency code

Spiking latency is a period from the first input spike timing \(t_\text {in}\) and consequent spike timing \(\hat{t}\) as illustrated in Fig. 1a. In the latency code, NOSONet encodes input data \(\varvec{x}\) as the spiking latency \(T_\text {lat}^{(L)}\) of the output neurons in the output layer L.

$$\begin{aligned} \varvec{T}_\text {lat}^{(L)} = \hat{\varvec{t}}^{(L)} - \varvec{t}_\text {in}^{(L)} = f^{(L)}(\hat{\varvec{t}}^{(L-1)};\varvec{w}^{(L-1)}), \end{aligned}$$
(1)

where \(\hat{\varvec{t}}^{(\cdot )}\) and \(\varvec{t}^{(\cdot )}_\text {in}\) denote the spike timings of the neurons in the \((\cdot )\)th layer and their first input spike timings, respectively. The function \(f^{(L)}\) encodes input spikes (from the layer L-1) at \(\hat{\varvec{t}}^{(L-1)}\) as spiking latency values \(\varvec{T}_\text {lat}^{(L)}\). The larger the weight \(\varvec{w}^{(L-1)}\), the shorter the spiking latency \(\varvec{T}_\text {lat}^{(L)}\). This latency code should be distinguished from the TTFS code [30, 45, 46] in which the first input spike timings \(\varvec{t}_\text {in}^{(L)}\) in Eq. (1) are ignored, so that it considers the output spike timings only.

Minimum-latency pooling

The MinPool layers support the latency code. Consider the time elapsed since the first input spike, \(t_\text {elap}=t-t_\text {in}\), for a given neuron. We consider a spiking latency map in a given 2D patch \(\mathcal {D}_\text {pool}\) at timestep t and feature (spike) map in the same patch, \(\varvec{T}_{\text {lat,}\mathcal {D}_\text {pool}}[t]\) and \(\varvec{s}_{\mathcal {D}_\text {pool}}[t]\), respectively. The latency map \(\varvec{T}_{\text {lat},\mathcal {D}_\text {pool}}\) is initialized to infinite values. Each element in the map is replaced by real spiking latency when the neuron spikes. Note that the elements once replaced by real latency values are no longer overwritten because of the use of NOSOs. At time step t, MinPool outputs one if the neuron of the smallest spiking latency in the patch fires a spike, and zero otherwise.

$$\begin{aligned}&x_\text {min} = {{\,\mathrm{arg\,min}\,}}_{x\in \mathcal {D}_\text {pool}}\left\{ \varvec{T}_{\text {lat},\mathcal {D}_\text {pool}}[t]\right\} \text {,}\nonumber \\&\text {MinPool}\left( \mathcal {D}_\text {pool}\right) \left[ t\right] = s_{x_\text {min}}\left[ t\right] , \end{aligned}$$
(2)

where \(s_{x_\text {min}}[t]\) indicates the spike function value for \(x_\text {min}\) at timestep t. An example of \(\text {MinPool}\left( \mathcal {D}_\text {pool}\right) \left[ t\right] \left( =1\right) \) is illustrated in Fig. 1b.

Fig. 1
figure 1

a Definition of spiking latency, b Schematic of the minimum latency pooling operation

NOSO with dual threshold for spiking

Each NOSO is endowed with a symmetric dual threshold for spiking (\(-\vartheta \) and \(\vartheta \)), and thus a spike is generated if the membrane potential u satisfies \(u\ge \vartheta \) or \(u\le -\vartheta \). Therefore, the subthreshold potential u is confined to the range between \(-\vartheta \) and \(\vartheta \). The restriction on the potential offers the upper limit of potential variance over the samples in a given batch, preventing large potential variance over the samples. The symmetry in the two bounds may offer zero-mean potential over the samples. Additionally, the restriction on the potential is expected to avoid dead neurons given that most dead neurons arise from their potentials largely biased toward the negative side.

BPLC with spike response model

Spike response model mapped onto computational graphs

We consider SRM, which is equivalent to the basic leaky integrate-and-fire (LIF) model with exponentially decaying synaptic current [13]. But our model is allowed to maximally spike only once in response to a single input sample by using an infinite hard refractory period in place of the refractory kernel. The choice of SRM, rather than simpler models, e.g., Stein’s model [35], is to enlarge the mutual information of spike timing and synaptic weight, which is the key to temporal code.

In SRM, the subthreshold potential of the ith spiking neuron in the lth layer (\(u_{i}^{(l)}\)) is given by

$$\begin{aligned} u_{i}^{(l)}\left[ t\right] = \sum _{j} w_{ij}^{(l)}\left( \epsilon *s_{j}^{(l-1)} \right) \left[ t\right] sav_i^{(l)}\left[ t\right] , \end{aligned}$$
(3)

where j denotes the indices of the presynaptic neurons, and \(w_{ij}^{(l)}\) denotes the synaptic weight from the jth neuron in the (l-1)th layer. The spiking-availability function \(sav_i^{(l)}\) is employed to allow each neuron to spike once at most such that \(sav_i^{(l)}=1\) if the neuron has not spiked before, and \(sav_i^{(l)}=0\) otherwise. The kernel \(\epsilon \) is expressed as follows [13].

$$\begin{aligned} \epsilon =\frac{\tau _{m}}{\tau _{m}-\tau _{s}}\left( e^{-t\mathbin {/}\tau _{m}} - e^{-t\mathbin {/}\tau _{s}}\right) \Theta \left[ t\right] , \end{aligned}$$
(4)

where \(\Theta \) denotes the Heaviside step function. The potential and synaptic current time constants are denoted by \(\tau _{m}\) and \(\tau _s\), respectively. A spike from the jth neuron in the (l-1)th layer at \(\hat{t}_j^{(l-1)}\) is denoted by \(s_{j}^{(l-1)}\). Because the kernel in Eq. (4) consists of two independent sub-kernels,

$$\begin{aligned} \epsilon _{\left( \cdot \right) } = \dfrac{\tau _m}{\tau _m-\tau _s}e^{-t/\tau _{\left( \cdot \right) }}\Theta \left[ t\right] \text {, where } (\cdot )\in \left\{ m, s\right\} , \end{aligned}$$
(5)

Eq. (3) can be expressed as

$$\begin{aligned} u_i^{(l)}\left[ t\right]&= \left( u_{i,m}^{(l)}\left[ t\right] -u_{i,s}^{(l)}\left[ t\right] \right) sav_i^{(l)}\left[ t\right] \text {,}\\\nonumber u_{i,(\cdot )}^{(l)}\left[ t\right]&= \sum _{j}\dfrac{\tau _{m}w_{ij}^{(l)}}{\tau _{m}-\tau _{s}} e^{-\left( t-\hat{t}_{j}^{(l-1)}\right) \mathbin {/}\tau _{\left( \cdot \right) }} \Theta \left[ t-\hat{t}_{j}^{(l-1)}\right] \text {,} \\&\quad \text { where } (\cdot )\in \left\{ m, s\right\} .\nonumber \end{aligned}$$

Here, we introduce a new variable \(v_j^{(l)}\) given by

$$\begin{aligned} v_j^{(l)}\left[ t\right]&= v_{j,m}^{(l)}\left[ t\right] -v_{j,s}^{(l)}\left[ t\right] \text {,}\\\nonumber v_{j,(\cdot )}^{(l)}\left[ t\right]&= \dfrac{\tau _{m}}{\tau _{m}-\tau _{s}} e^{-\left( t-\hat{t}_{j}^{(l-1)}\right) \mathbin {/}\tau _{\left( \cdot \right) }} \Theta \left[ t-\hat{t}_{j}^{(l-1)}\right] \text {,}&\\&\quad \text { where } (\cdot )\in \left\{ m, s\right\} .\nonumber&\end{aligned}$$

The variables \(u_{i,m}^{(l)}\) and \(u_{i,s}^{(l)}\) are reset to zero when the neuron fires a spike. The advantage of this method is that the membrane potential can be evaluated by simply convolving input spikes using two independent kernels, which otherwise needs to solve two sequential differential equations [20]. After spiking, the spiking-availability function \(sav_i^{(l)}\) remains constant at zero, hindering additional spike generation.

Fig. 2
figure 2

Unfolded NOSONet on a computational graph

All variables are recursively evaluated using the explicit finite difference method.

$$\begin{aligned} v_{j,(\cdot )}^{(l)}\left[ t+1\right]&= v_{j,(\cdot )}^{(l)}\left[ t\right] e^{-1/\tau _{(\cdot )}} + \dfrac{\tau _m}{\tau _m-\tau _s}s_j^{(l-1)}\left[ t+1\right] \text {,}\nonumber \\&\quad \text { where } (\cdot )\in \left\{ m, s\right\} ,\nonumber \\ u_{i,(\cdot )}^{(l)}\left[ t+1\right]&= \sum _{j} w_{ij}^{(l)}v_{j,(\cdot )}^{(l)}\left[ t+1\right] \text {, where } (\cdot )\in \left\{ m, s\right\} ,\nonumber \\ u_i^{(l)}\left[ t+1\right]&= \left( u_{i,m}^{(l)}\left[ t+1\right] - u_{i,s}^{(l)}\left[ t+1\right] \right) sav_i^{(l)}\left[ t+1\right] . \end{aligned}$$
(6)

Equation (6) can be mapped onto a computational graph as shown in Fig. 2. A layer’s processed data is transmitted along the forward pass through the use of spikes (\(\varvec{s}^{(l)}\)).

Backward pass and gradients

SNNs are typically trained using forward and backward passes aligned in opposing directions, so that it is unavoidable to use surrogate gradients due to non-differentiability of spikes [29, 34, 44]. Instead, BPLC + NOSO uses a backward pass via spike timings \(\hat{\varvec{t}}^{(\cdot )}\) rather than spikes themselves \(\varvec{s}^{(\cdot )}\) (Fig. 2). This backward pass involves differentiable functions only. The output of NOSONet (with M output NOSOs) is the spiking latency values of the output NOSOs, \(\varvec{T}_\text {lat}^{(L)}=\{T^{(L)}_{\text {lat}, i}\}_{i=1}^{M}\), as given in Eq. (1). The prediction is then made by reference to the output neuron of the minimum spiking latency. We use a cross-entropy loss function \(\mathcal {L}(-\varvec{T}_\text {lat}^{(L)}, \varvec{\hat{y}})\), where \(\varvec{\hat{y}}\) denotes a one-hot encoded label vector. The loss is evaluated at the end of the learning phase, and the weights are then updated using the gradients assessed when the neurons spiked.

We calculate the weight’s update \(\Delta w_{ij}^{(l)}\) using the gradient descent method as follows.

$$\begin{aligned} \Delta w_{ij}^{(l)}= & {} -\eta \dfrac{\partial \mathcal {L}}{\partial \hat{t}_i^{(l)}}\dfrac{{\partial \hat{t}_i^{(l)}}}{\partial u_i^{(l)}}\dfrac{\partial u_i^{(l)}}{\partial w_{ij}^{(l)}}\left[ \hat{t}_i^{(l)}\right] \nonumber \\ {}= & {} -\eta \dfrac{\partial \mathcal {L}}{\partial \hat{t}_i^{(l)}}\dfrac{{\partial \hat{t}_i^{(l)}}}{\partial u_i^{(l)}}v_j^{(l)}\left[ \hat{t}_i^{(l)}\right] \text {.} \end{aligned}$$
(7)

The learning rate and loss function are denoted by \(\eta \) and \(\mathcal {L}\), respectively. Equation (7) is equivalent to

$$\begin{aligned} \Delta \varvec{w}^{(l)} = -\eta diag\left( \varvec{e}^{(l)}\right) \varvec{v}^{(l)}\left[ \varvec{\hat{t}}^{(l)}\right] , \end{aligned}$$
(8)

with the error \(\varvec{e}^{(l)}\) given by

$$\begin{aligned} \varvec{e}^{(l)}&= \nabla _{\varvec{\hat{t}}^{(l)}}\mathcal {L}\odot \hat{\varvec{t}}^{(l)'},\nonumber \\ \nabla _{\varvec{\hat{t}}^{(l)}}\mathcal {L}&= \left[ \dfrac{\partial \mathcal {L}}{\partial \hat{t}_i^{(l)}},\cdots , \dfrac{\partial \mathcal {L}}{\partial \hat{t}_N^{(l)}}\right] ^\textrm{T},\nonumber \\ \hat{\varvec{t}}^{(l)'}&= \left[ \dfrac{\partial \hat{t}_1^{(l)}}{\partial u_1^{(l)}}, \cdots , \dfrac{\partial \hat{t}_N^{(l)}}{\partial u_N^{(l)}}\right] ^\textrm{T}, \end{aligned}$$
(9)

for N neurons in the lth layer. The symbol \(\odot \) denotes the Hadamard product. The matrix \(\varvec{v}^{(l)}[\varvec{\hat{t}}^{(l)}]\) is given by

$$\begin{aligned}\nonumber \varvec{v}^{(l)}\left[ \varvec{\hat{t}}^{(l)}\right] = \begin{bmatrix} v^{(l)}_1\left[ \hat{t}^{(l)}_1\right] &{} \ldots &{} v^{(l)}_M\left[ \hat{t}^{(l)}_1\right] \\ \vdots &{} \ddots &{} \vdots \\ v^{(l)}_1\left[ \hat{t}^{(l)}_N\right] &{} \ldots &{} v^{(l)}_M\left[ \hat{t}^{(l)}_N\right] \end{bmatrix}, \end{aligned}$$

for M neurons in the \(\left( l-1\right) \)th layer.

Table 2 Classification accuracy and the number of spikes used for inference

The backward propagation of the error from the lth layer to the \(\left( l-1\right) \)th layer (with M neurons) is given by

$$\begin{aligned} \varvec{e}^{(l-1)}&= \left( \varvec{w}^{(l)\mathrm T}\odot \varvec{v}^{(l)'}\left[ \varvec{\hat{t}}^{(l)}\right] \right) \varvec{e}^{(l)}\odot \hat{\varvec{t}}^{(l-1)'},\nonumber \\ \varvec{v}^{(l)'}\left[ \varvec{\hat{t}}^{(l)}\right]&= \begin{bmatrix} \dfrac{\partial v^{(l)}_1}{\partial \hat{t}^{(l-1)}_1}\left[ \hat{t}^{(l)}_1\right] &{} \ldots &{} \dfrac{\partial v^{(l)}_1}{\partial \hat{t}^{(l-1)}_1}\left[ \hat{t}^{(l)}_N\right] \\ \vdots &{} \ddots &{} \vdots \\ \dfrac{\partial v^{(l)}_M}{\partial \hat{t}^{(l-1)}_M}\left[ \hat{t}^{(l)}_1\right] &{} \ldots &{} \dfrac{\partial v^{(l)}_M}{\partial \hat{t}^{(l-1)}_M}\left[ \hat{t}^{(l)}_N\right] \end{bmatrix}. \end{aligned}$$
(10)

Equation (10) is derived in Appendix A. Because NOSO spikes once at most, the elements once written in \(\varvec{v}^{(l)}[\varvec{\hat{t}}^{(l)}]\) and \(\varvec{v}^{(l)'}[\varvec{\hat{t}}^{(l)}]\) are not overwritten. Equation (10) identifies that BPLC involves the gradients of spike timings rather than spikes themselves. Therefore, the backward pass differs from the forward pass.

Two types of gradients are thus required for BPLC + NOSO: (i) \(\partial \hat{t}_i^{(l)}/\partial u_i^{(l)}\) and (ii) \(\partial v_j^{(l)}/\hat{t}_j^{(l-1)}\) at the spike timing \(\hat{t}_i^{(l)}\). Fortunately, SRM allows these gradients to be expressed analytically.

Theorem 1

When an SRM neuron (whose membrane potential is \(u_i^{(l)}\)) spikes at a given time \(t (=\hat{t}_i^{(l)})\), the gradient of spike timing \(\hat{t}_i^{(l)}\) with membrane potential is given by

$$\begin{aligned} \dfrac{\partial \hat{t}_i^{(l)}}{\partial u_{i}^{(l)}} = \left( u_{i,m}^{(l)}\left[ \hat{t}_i\right] /\tau _m - u_{i,s}^{(l)}\left[ \hat{t}_i\right] /\tau _s\right) ^{-1}. \end{aligned}$$
(11)

The proof of Theorem 1 is given in Appendix  B. If the neuron does not spike during a learning phase, the gradient in Eq. (11) is zero.

Theorem 2

When an SRM neuron receives an input spike at \(\hat{t}_j^{(l-1)}\), the gradients of \(v_{j,m}^{(l)}\) and \(v_{j,s}^{(l)}\) with respect to \(\hat{t}_j^{(l-1)}\) are given by

$$\begin{aligned} \dfrac{\partial v_{j,\left( \cdot \right) }^{(l)}}{\partial \hat{t}_j^{(l-1)}}\left[ t\right]&= \dfrac{\tau _m}{\tau _{\left( \cdot \right) }\left( \tau _m - \tau _s\right) }e^{-\left( t-\hat{t}_{j}^{(l-1)}\right) \mathbin {/}\tau _{\left( \cdot \right) }}\Theta \left[ t-\hat{t}_{j}^{(l-1)}\right] \nonumber \\&= \dfrac{v_{j,\left( \cdot \right) }^{(l)}\left[ t\right] }{\tau _{\left( \cdot \right) }} \text {, where } (\cdot )\in \left\{ m, s\right\} . \end{aligned}$$
(12)

The proof of Theorem 2 is also given in Appendix  B. Using Theorem 2, the gradient \(\partial v_j^{(l)}/\hat{t}_j^{(l-1)}\) is given by

$$\begin{aligned} \dfrac{\partial v_{j}^{(l)}}{\partial \hat{t}_j^{(l-1)}}\left[ \hat{t}_i^{(l)}\right] = v_{j,m}^{(l)}\left[ \hat{t}_i^{(l)}\right] /\tau _m - v_{j,s}^{(l)}\left[ \hat{t}_i^{(l)}\right] /\tau _s. \end{aligned}$$
(13)

Likewise, this gradient is also zero if this neuron does not spike. Both gradients in Eqs. (11) and (13) can simply be calculated by reading out the four local variables (\(u_{i,m}^{(l)}\), \(u_{i,s}^{(l)}\), \(v_{j,m}^{(l)}\), \(v_{j,s}^{(l)}\)) when the neuron spikes.

The above derivations are for folded NOSONet, where all tensors for each layer are simply overwritten over time so that the space complexity is independent of the number of timesteps. We used unfolded NOSONet in the temporal domain to apply the the automatic differentiation framework [31]. The equivalence between folded and unfolded NOSONets is proven in Appendix  C.

Experiments

Convolutional NOSONet (C-NOSONet) was trained on Fashion-MNIST [40] and CIFAR-10 [22] using BPLC + NOSO. We used the hyperparameters listed in Appendix  E unless otherwise stated. The hyperparameters were manually searched. All experiments were conducted in the Pytorch framework [31] on a GPU workstation (CPU: Intel Xeon Processor Gold, GPU: RTX A6000). NOSONet on Fashion-MNIST was trained using one GPU, whereas NOSONet on CIFAR-10 using four GPUs.

Classification accuracy and the number of spikes for inference

Fig. 3
figure 3

Active NOSO ratio \(\overline{n}_{sp}^{(i)}\) for each layer on a Fashion-MNIST and b CIFAR-10 over all timesteps

We evaluated the classification accuracy on Fashion-MNIST and CIFAR-10 and the total number of spikes used for inference \(N_{\text {sp}} (=\sum _{i,t}n_{\text {sp}}^{(i,t)})\), where \(n_{\text {sp}}^{(i,t)}\) denotes the number of spikes generated from the layer i at timestep t.

Fashion-MNIST: Fashion-MNIST consists of 70,000 gray-scale images (each of which \(28\times 28\) in size) of clothing categorized as 10 classes [40]. We rescaled each gray-scale pixel value of an image to the range \(0-0.3\) and applied an additive white Gaussian noise (zero mean and 0.05 standard deviation). These values were then used as input currents into input LIF neurons. We trained a C-NOSONet (32C5-MP2-64C5-MP2-600, where MP denotes MinPool). The classification accuracy of the C-NOSONet is shown in Table 2 in comparison with previous works. We also evaluated the total number of spikes \(N_{\text {sp}}\) over all hidden+output NOSOs in the network for each test sample (Table 2). The results highlight large sparsity of active NOSOs, which likely reduces the inference latency when implemented in neuromorphic hardware. This will be discussed in Section “Discussion”. Figure 3a shows the ratio of active NOSOs to all NOSOs, \(\overline{n}_{\text {sp}}^{(i)} (=\sum _tn_{\text {sp}}^{(i,t)}/C^{(i)}H^{(i)}W^{(i)})\), for layer i over the entire timesteps.

CIFAR-10: CIFAR-10 consists of 60,000 real-world color images (each of which \(3\times 32\times 32\) in size) of objects labeled as 10 classes [22]. All training images were pre-processed such that each image with zero-padding of size 4 was randomly cropped to \(32\times 32\), which was followed by random horizontal flipping. The RGB values of each pixel were rescaled to the range \(0-0.3\) and then used as input currents. For learning stability, we linearly increased the initial learning rate (1E-2) to the plateau learning rate (5E-2) for the first five epochs (ramp rate: 8E-3/epoch). The fully trained C-NOSONet (64C5-128C5-MP2-256C5-MP2-512C5-256C5-1024-512) yields the classification accuracy and the number of spikes for inference in Table 2. Notably, our classification accuracy exceeds the result from an SNN of the same depth and width (CNN2-half-ch) [39] by approximately 2.0%. Additionally, our NOSONet uses much fewer spikes (only 10.9% of CNN2-half-ch), supporting high-throughput inference. The layer-wise active NOSO ratio \(\overline{n}_{\text {sp}}^{(i)}\) over the entire timesteps is plotted in Fig. 3b, highlighting the high sparsity of spikes.

Fig. 4
figure 4

Comparison between MinPool and MaxPool in terms of a validation accuracy, b training loss, and c layer-wise active NOSO ratio on Fashion-MNIST

Fig. 5
figure 5

Comparison between MinPool and MaxPool in terms of a validation accuracy, b training loss, and c layer-wise active NOSO ratio on CIFAR-10

Fig. 6
figure 6

Mean potential and standard deviation for neurons in each layer of NOSONet a on Fashion-MNIST, b on CIFAR-10. They were evaluated from potential distribution over samples in a random batch (size: 300 on Fashion-MNIST and 100 on CIFAR-10)

Minimum-latency pooling versus MaxPool

MinPool supports the latency code by passing the event of the minimum spiking latency in a given 2D patch. To identify its effects on learning, we compared NOSONets with MinPool layers and conventional MaxPool layers. Figures 4 and 5 show the comparisons on Fashion-MNIST and CIFAR-10, respectively. Compared with MaxPool, MinPool yields (i) the higher classification accuracy as shown in Figs. 4a and 5a and (ii) higher spike sparsity as shown in Figs. 4c and 5c. The accuracy increase despite the decrease in spike number may imply that MinPool removes unimportant spikes in classification unlike dropout that randomly removes spikes.

Effect of symmetric dual threshold on potential distribution

We identified the effect of the dual threshold on potential distribution over samples in a given batch by training NOSONet (32C5-MP2-64C5-MP2-600) on Fashion-MNIST and CIFAR-10 with four different threshold conditions: single threshold 0.05 and 0.1, and dual threshold ±0.1 and ±0.15. The results are shown in Fig. 6. The usage of dual threshold greatly lowers the standard deviation and results in a mean that is almost zero because it limits the potential to the range between \(-\vartheta \) and \(\vartheta \). Additionally, the highest accuracy was attained with the dual threshold ±0.15. The potential distributions for a single threshold case (0.1) and dual threshold case (±0.15) on Fashion-MNIST are detailed in Appendix  F.

Discussion

We estimate the inference time for an SNN mapped onto a general digital multicore neuromorphic processor using the following assumptions.

Assumption 1: Total \(N_\text {n}\) neurons in a given SNN are distributed uniformly over \(N_\text {c}\) cores of a neuromorphic processor, i.e., \(N_{\text {n}}/N_{\text {c}}\) neurons per core.

Assumption 2: All \(N_{\text {n}}/N_{\text {c}}\) neurons in each core share a multiplier by time-division multiplexing, so that the current potential is multiplied by a potential decay factor (\(e^{-1/\tau _\text {m}}\)) for one neuron at each cycle.

Assumption 3: Synaptic operations are also executed serially.

Assumption 4: Neurons in different cores are updated parallel.

Each timestep for an SNN with LIF neurons includes two primary processes: (i) the process of multiplying the current potential by a decay factor and (ii) synaptic operation (spike routing to the destination neurons plus the consequent potential update). Process (i) in a digital neuromorphic processor is commonly pipelined within a core but executed in parallel over the \(N_{\text {c}}\) cores [20]. Thus, at each timestep, the time for process (i) for all \(N_{\text {n}}\) neurons (\(T_{\text {up}}\)) is given by

$$\begin{aligned} T_{\text {up}} = \left( N_{\text {n}}/N_{\text {c}} + a\right) f_{\text {clk}}^{-1}\text {,} \end{aligned}$$

where a and \(f_{\text {clk}}\) denote the initialization cycle number and clock speed, respectively. Although the number of initialization cycles a differs for different processor designs, it is commonly a few clock cycles. Given the total number of spikes generated at timestep t (\(n_{\text {sp}}[t]\)), the time for synaptic operations at each timestep is given by

$$\begin{aligned} T_{\text {sop}} = n_{\text {sp}}[t]\left( \text {SynOPS}\right) ^{-1}\text {.} \end{aligned}$$

Given Assumptions, the total time for processes (i) and (ii) at each timestep is given by \(T_{\text {step}}=T_{\text {up}}+T_{\text {sop}}\). Therefore, we have the total time for inference during total \(N_{\text {step}}\) timesteps, \(T_{\text {inf}}=\sum _tT_{\text {step}}[t]\), as follows.

$$\begin{aligned} T_{\text {inf}} = N_{\text {step}}\left( N_{\text {n}}/N_{\text {c}} + a\right) f_{\text {clk}}^{-1} + N_{\text {sp}}\left( \text {SynOPS}\right) ^{-1}\text {,} \end{aligned}$$
(14)

where \(N_{\text {sp}}=\sum _tn_{\text {sp}}[t]\). The number of neurons in a core \((N_{\text {n}}/N_{\text {c}})\) differs for different designs. We assume 1k neurons in each core [8], a few tens MSynOPS as for [3, 12, 27], and 100 MHz clock speed. For inference involving \(N_{sp}\) spikes (\(\sim 10^6\) as in Table 2) and a \(N_{\text {step}}\) of \(\sim 100\), Eq. (14) identifies that \(T_{\text {sop}}\) is dominant over \(T_{\text {up}}\) so that \(T_{\text {inf}}\) is dictated by \(T_{\text {sop}}\). Therefore, it is desired to concern \(N_{\text {sp}}\) when developing learning algorithms.

For SNNs with IF neurons (without leakage), process (i) is unnecessary so that \(T_{\text {up}}\) vanishes. Therefore, \(T_{\text {inf}}\) is solely determined by \(N_{\text {sp}}\).

Conclusion and outlook

We proposed a mathematically rigorous learning algorithm (BPLC) based on spiking latency code in conjunction with minimum-latency pooling (MinPool) operations. We overcome the dead neuron issue using a symmetric dual threshold for spiking, which additionally improves the potential distribution over samples in a given batch (and thus the classification accuracy). BPLC-trained NOSONet on CIFAR-10 highlights its high accuracy outperforming the SNN of the same depth and width by approximately 2\(\%\) with much fewer spikes (only 10.9%). This large reduction in the number of spikes largely reduces the inference latency of SNNs implemented in digital neuromorphic processors.

Currently, we conceive the following future work to boost the impact of BPLC + NOSO.

  • Scalability confirmation: Although the viability of BPLC + NOSO was identified, its applicability to deeper SNNs on more complex datasets should be confirmed. Such datasets include not only static image datasets like ImageNet [33] but also event datasets like CIFAR10-DVS [24] and DVS128 Gesture [1]. Given that the number of spikes is severely capped, BPLC + NOSO on event datasets in particular might be challenging.

  • Hyperparameter fine-tuning: To further increase the classification accuracy, the hyperparameters should be fine-tuned using optimization techniques.

  • Weight quantization: BPLC + NOSO is based on full-precision (32b FP) weights. However, the viability of BPLC + NOSO with reduced precision weights should be confirmed to improve the efficiency in memory use. This may need an additional weight-quantization algorithm in conjunction with BPLC + NOSO like CBP [18].

  • Search for new application domains: We need to search for new applications domains in which BPLC + NOSO can leverage its low process latency and power when implemented in neuromorphic hardware. The examples potentially include intelligent control systems like constrained nonlinear systems [41,42,43].