Introduction

It is not an exaggeration to say that the hungry demand for data learning is at the core of the ongoing age of intelligence. Being able to effectively process multi-scale complex spatiotemporal information is important for many real-world applications, such as handling video in self-driving cars, interpreting written text in mobile reading apps, and managing various types of sensor data in outdoor robots. However, it remains a significant challenge to process such complex data accurately, reliably, and efficiently, particularly in varying environments with different performance requirements.

In mainstream machine learning, non-spiking recurrent neural networks (RNNs) serve as a pivotal model for processing spatiotemporal data. Unlike traditional feedforward architectures, RNNs incorporate recurrent connections into standard artificial neural network (ANN) models, enabling them to capture temporal patterns. While RNNs are extensively employed in diverse applications like speech recognition1, language modeling2, and state control3, they cannot learn long temporal dependency due to the gradient vanishing during backpropagation (BP) learning4. To address this issue, variants such as long short-term memory (LSTM) networks5 have been developed. These advanced models, equipped with additional gated units, excel at capturing long-term temporal dependencies but come at the cost of increased computational complexity.

Concurrently, there is growing interest in neuromorphic computing as an alternative pathway for developing intelligent models that are both computationally efficient and biologically inspired. Spiking neural networks (SNNs), regarded as the third generation of neural networks6, are the most famous family of neuromorphic models. The behaviors of each spiking neuron are described by the nonlinear dynamics of the membrane potential and the binary spiking mechanism for the communication between adjacent neurons7. Distinct from the extrinsic dynamics of RNNs induced by external recurrence, the dynamics in SNNs intrinsically exist within each neuron. The intra-neuron temporal dynamics and the spatial dataflow through the network make SNNs well-suited for processing spatiotemporal data. Up to now, SNNs have been extensively used for spike stream processing8, speech recognition9, ECG signal analysis10, state control11, and so forth.

Machine-learning-oriented RNNs use intense matrix multiplications for computation and continuous activations for inter-neuron communication; on the contrary, neuromorphic-computing-oriented SNNs use sparse matrix accumulations for computation and binary spikes for inter-neuron communication. Compared to the continuous activation state space of RNNs, the spike states of SNNs usually evolve in a discrete space. With these distinctions, RNNs have been evidenced to achieve higher accuracy on conventional continuous data sources (e.g., speech signals and language texts) while SNNs are more suited for discrete data sources12 such as the event stream collected by dynamic vision sensors (DVS)13. Owing to the natural filtering effect of the membrane potential leakage along with the spike firing and reset mechanisms of spiking neurons, SNNs have demonstrated strong robustness against variations in temporal resolution12 and adversarial attack14. In addition, owing to the binary format of spikes and the sparsity of spiking activities, the computational cost of an SNN model can be much lower than its non-spiking counterpart under the same network structure12,15.

Based on the above analyses, it can be seen that RNNs and SNNs present different performance results due to the disparate modeling paradigms. However, in practical scenarios, the type of data sources varies, e.g., continuous data or discrete data, and the performance requirement may also be highly diverse. For example, the high functional accuracy attracts the most attention from cloud users, while the low computational cost is more important for energy-restricted edge devices. Furthermore, for many core components in a system, how to guarantee high robustness against internal noise or external attack becomes the primary design consideration. Even though we can build a specific model to accomplish each task, it would be inefficient because researchers cannot directly apply the experiences accumulated in the modeling exploration when the task changes. To escape the one-task-one-model dilemma, a unified modeling framework to realize adaptive accuracy, robustness, and efficiency is highly expected for processing spatiotemporal data in various scenarios.

Here we report a unified modeling framework that creates hybrid spatiotemporal neural networks (HSTNNs) by synergistically combining RNNs and SNNs for processing spatiotemporal data sources. To make the hybrid model learnable, our work builds on a unified learning methodology, backpropagation through time (BPTT) augmented with a surrogate function, which works for both RNNs and SNNs and thus opening the possibility for hybridization. Furthermore, we exploit a classical pruning method16,17 to realize neuron selection from RNN and SNN populations and further develop a neuron-aware three-stage hybridization solution to create HSTNNs. It leverages the Hessian gradient information and enables automatic learning of a hybrid structure during the training phase. On several typical spatiotemporal dataset benchmarks, HSTNNs demonstrate better adaptive ability in balancing different performance metrics in terms of accuracy, robustness, and efficiency by tuning the configuration between two types of neurons, and usually outperform conventional single-paradigm RNNs and SNNs. With a robotic place recognition task, we evidence the great potential of HSTNNs in varying environments. Overall, the proposed HSTNNs provide an attractive way to adaptively process variable spatiotemporal data sources in the open world.

Results

Creating HSTNNs

Generally, RNNs and SNNs adopt different strategies in neural coding, computation, and communication, leading to varying performance and application suitability on specialized devices. How to incorporate the distinct features of RNNs and SNNs and integrate their complementary advantages is an open but foremost issue for designing HSTNNs. Recent progress in the neuromorphic field has seen a surge of interest in a hybrid approach that converts non-recurrent neural networks into spiking networks18,19,20,21,22. Several studies23,24 have explored integrating non-recurrent ANNs and SNN modules at the layer level. However, elaborating on specifying fixed heterogeneous networks in advance for each specific task is required, and a hybrid approach supporting effectively integrating diverse temporal dynamics and handling spatio-temporal data flows is still lacking.

To maintain the features of different neurons, we take a decoupling strategy to preserve the diverse spatiotemporal dynamics of different neurons and allow hybrid information transmission at the neuron level. The general structure of HSTNNs is shown in Fig. 1a. Each hybrid layer contains two neuron populations, an RNN one with artificial neurons and an SNN one with spiking neurons. Both populations receive the same mixed inputs from the previous layer, independently update the respective spatiotemporal dynamics, and synergistically send mixed outputs by concatenating both RNN and SNN outputs. Within each hybrid layer, each neuron only connects to the neurons belonging to the same population.

Fig. 1: Methodology of creating HSTNNs.
figure 1

a The HSTNN architecture. Each hidden layer of an HSTNN contains two types of neurons and their outputs are combined before being injected into the next layer. b The three-stage learning process for creating an HSTNN. In the Adaptation stage, we create two redundant populations of neurons in each hidden layer and apply the unified BPTT learning algorithm to warm the connection weights. In the Selection stage, we propose a neuron-aware selection mechanism to measure the importance of neurons and select important neurons. In the Restoration strategy, we shrink the network by invalidating unselected neurons and their connections and retrain the compact model until convergence. c HSTNNs exhibit several advantageous features: an HSTNN can be conveniently initialized under a unified Adaptation stage no matter what the specific task scenario is; it enables a flexible balance between accuracy, cost, and robustness to satisfy variable performance requirements in practice; it can be deployed on neuromorphic hardware for constructing an efficient application system.

To generate optimal dynamics between spiking neurons and non-spiking artificial neurons, we expect the hybrid network can be automatically learned. RNNs are usually trained with the classical BPTT algorithm, while SNNs are widely trained with bio-plausible synaptic plasticity rules, e.g., spike timing dependent plasticity (STDP)25,26, which is incompatible with BPTT. Fortunately, BPTT has been adapted to SNNs recently by addressing the training convergence problems and the issue of the non-differentiable spiking activities. This progress in the SNN domain lays the foundation for hybridizing RNNs and SNNs under a unified learning framework. Based on BPTT, we propose a three-stage hybrid learning methodology to create HSTNNs, which will be detailed in the following sections.

The core idea of the learning methodology is an evolution-inspired strategy, which combines a unified BPTT learning algorithm and a neuron-aware selection mechanism to select prominent neurons from two redundant neuron populations for building a hybrid network. As shown in Fig. 1b, to explore the optimal structure of the hybrid network, we generate two types of neuron populations in each layer, which represent a redundant number of candidate neurons for constructing the HSTNN. Each population is independent of the other within the layer and the outputs are combined before injecting into the next layer. We adopt BPTT with the surrogate gradient technique to pretrain this initial network to a good point with inter-population interactions. Next, we develop a neuron-aware selection mechanism to measure the importance of different types of neurons, which guides the selection of prominent neurons from the two redundant populations by invalidating unimportant neurons and their connections. Finally, we retrain the resulting compact hybrid network after neuron selection until convergence. Notice that the ratio between two types of neurons is given at the beginning, while which neurons to select is automatically determined during learning.

The proposed HSTNNs demonstrate several advantageous features benefiting practical usages, as depicted in Fig. 1(c). First, the proposed hybridization approach adopts a unified adaptation strategy to create redundant neuron populations no matter what the target task is, thereby easing its usage. Second, HSTNNs enable a flexible balance between accuracy, cost, and robustness according to the actual need by customizing the ratio of different types of neurons, which is able to satisfy variable performance requirements in practice. Third, HSTNNs exhibit better accuracy and robustness compared to single-paradigm networks with appropriate neuron configurations. Last, although HSTNNs represent a novel paradigm, they can be deployed on existing neuromorphic hardware, especially on the ones with hybrid computing architectures11,27,28, which promises the construction of efficient application systems.

Three-stage hybrid learning

HSTNNs use a three-stage learning methodology to generate the hybrid network, including Adaptation, Selection, and Restoration stages. The three stages gradually extract a compact hybrid network from an initialized redundant network in an evolutionary manner.

In the Adaptation stage, an HSTNN first expands redundant neuron pools with different types of neurons at each hidden layer. The neurons in each pool are governed by their respective spatiotemporal dynamics and the output representations are merged and then propagated to the next layer. The surrogate gradient technique12,29 is used to handle the derivative of the non-differentiable spiking function, and the BPTT learning algorithm is then applied to warm the connection weights.

The Selection stage aims at identifying critical neurons and pruning the unimportant neurons to satisfy a target neuron configuration. To quantitatively describe the neuron configuration, in the following, we introduce the SNN ratio as the ratio of the number of spiking neurons to the total number of hidden neurons after the three-stage learning. To produce an effective selection mechanism, we get inspiration from the OBS method16, a classical network pruning method for shallow feedforward networks, and extend it for the selection of important neurons in hybrid recurrent networks. Particularly, we use the second-order gradient information of synaptic connections around the local minimum of the loss function as the basic measurement for importance, collect the importance scores of different afferent connections for each neuron, and finally evaluate the neuron importance according to the accumulated gradients spanning the temporal domain. Because of the different types of gradient information of spiking and non-spiking neurons, we introduce a grouping selection mechanism that ranks the same type of neurons across all layers and constrains the overall number of neurons according to the specific SNN ratio. By doing so, the Selection stage invalidates unimportant neurons and shrinks the redundant structures to an expected compact level.

Since the selection process is independent of the adaptation stage, this learnable architecture is quite efficient and flexible to meet the actual need without the guidance of expertise to tailor a network architecture for each specific task. Given the compact network structure, the Restoration stage finally fine-tunes the remaining neural interactions to get the final HSTNN.

The three-stage learning methodology employs the surrogate gradient technique to approximate the first- and second-order gradient information of the non-differentiable spiking activities, which potentially influences the learned interaction between spiking and non-spiking activities. Our empirical analysis is shown in Supplementary Fig. 2 that the specific format of the surrogate function proves to have little impact on the learning performance of the HSTNN. HSTNNs can learn a similar profile of neuron importance under different surrogate functions (see Supplementary Fig. 2b, c), leading to a competitive learning performance at the end of the third stage (see Supplementary Fig. 2d, e). This indicates that the proposed hybridization approach demonstrates stability against various hyper-parameters of the surrogate functions during the learning process.

Comprehensive performance in terms of task performance and the computational cost

The HSTNN presents a general hybrid strategy for integrating different network paradigms, suitable for various sequential learning tasks. We first evaluate its comprehensive performance in terms of both task accuracy and the computational cost in four different types of tasks, as shown in Fig. 2. For tasks on PTB, S-MNIST, and N-MNIST datasets, HSTNNs with two hybrid fully-connected hidden layers are implemented. For the more challenging DVS-Gesture dataset, a nine-layer convolutional-based network structure is employed (see Methods).

Fig. 2: Comprehensive evaluation of HSTNNs in terms of task performance and the computational cost.
figure 2

Impact of the SNN ratio on accuracy (upper) for (a) S-MNIST, (b) PTB, (c) N-MNIST, and (d) DVS-Gesture datasets. Impact of the SNN ratio on the number of operations (lower) for (e) S-MNIST, (f) PTB, (g) N-MNIST, and (h) DVS-Gesture datasets. A two-hidden-layer network structure is adopted for PTB, S-MNIST, and N-MNIST, and a convolutional structure for DVS-Gesture (see Methods). Note that task performance on the PTB dataset is measured by perplexity (ppl), where lower is better. S-MNIST is a variant of the standard MNIST dataset in which the images are input into the model sequentially, one column at a time. i Comprehensive analysis of the trade-off between accuracy and the computational cost for the N-MNIST dataset, where the computational cost is measured by the total number of multiplication and addition operations. j Analysis of average accuracy improvement by HSTNNs compared to directly hybrid models, measured by the ratio of remaining hidden neurons in the Restoration stage to those in the Adaptation stage. All error bars represent the standard deviation over five repeated trials.

As depicted in Fig. 2a–d, single-paradigm networks, SNNs, and RNNs, exhibit variable performance on different datasets: RNNs perform better on traditional deep-learning-oriented datasets like S-MNIST and PTB datasets, while SNNs excel on neuromorphic-computing-oriented datasets like N-MNIST and DVS-Gesture. This prominent difference may result from the radical difference in neural coding and computation properties of the two networks. Compared to SNNs, the high-precision neural representation of RNNs is advantageous in handling continuous-value-based text analysis tasks like PTB. Conversely, the natural filtering effect of the membrane potential leakage and the spike rate coding schemes may enhance the robustness of SNNs against input fluctuations, leading to better performance of SNNs on neuromorphic datasets, as evidenced by the results on N-MNIST and DVS-Gesture.

Combining Fig. 2a–h can further derive three key conclusions about HSTNNs. First, HSTNNs with suitable neuron configurations can outperform both single-paradigm SNNs and RNNs. This improvement is likely due to the hybrid information representation of non-spiking and spiking neural networks with richer neuronal computation mechanisms, which increases learning nonlinearity and integrates the complementary strengths of SNNs and RNNs in addressing specific tasks. For instance, such integration can increase the representation precision of single-paradigm SNNs on the non-spiking PTB dataset and enhance the robustness of single-paradigm RNNs against input fluctuations on the spiking N-MNIST dataset.

Second, Fig. 2e–h shows that HSTNNs demonstrate adaptive balance in accuracy and the computational cost between SNNs and RNNs. The computational cost gradually decreases as the SNN ratio grows owing to the high efficiency of SNNs dominated by sparse accumulation operations. In Fig. 2i, we further take the N-MNIST dataset as an example to evaluate the correlation between accuracy and computational cost. The lighter color of a bubble indicates a larger SNN ratio with a lower computational cost. Results show that HSTNNs are able to produce a better comprehensive performance solution with higher accuracy and a lower computational cost, demonstrating the effectiveness of the proposed HSTNN model.

Third, HSTNNs provide an effective hybridization method compared with the direct hybridization method. To evidence the superiority of our three-stage hybridization, we build a baseline model for comparison. This baseline model, named the directly-hybrid model, trains a hybrid model from scratch under a given SNN ratio without employing the three-stage learning process. As shown in Fig. 2a–d, HSTNNs perform better than the directly-hybrid models. Furthermore, as observed in Fig. 2j, the improvement of HSTNNs over directly-hybrid models becomes more pronounced when constructing smaller-size hybrid neural networks, a practical constraint commonly considered in real-world edge systems with limited resources. These comparisons reflect the effectiveness of the elaborate three-stage learning methodology: expanding the representation dimension initially and then meticulously selecting important neurons from the hybrid redundant populations, rather than directly training a compact hybrid network from scratch.

Robustness analysis against noise, frame-loss, and adversarial attack

We next examine the robustness of sole SNNs, sole RNNs, and HSTNNs on the deep-learning-oriented S-MNIST and the neuromorphic-computing-oriented N-MNIST dataset. We evaluate the model robustness in three aspects including random noise robustness, frame-loss robustness, and adversarial attack robustness. For the random noise robustness, considering the diverse data characteristics of different datasets, we add the Gaussian noise into the testing samples of S-MNIST and the salt-and-pepper noise into the testing samples of N-MNIST. For the frame-loss robustness, we randomly mask some sequence information of each frame of the testing sample. For the adversarial attack robustness, we generate untargeted adversarial samples with a small level of perturbation on the raw samples (see Methods). Three selected digits are shown in Fig. 3a to illustrate different formats of testing samples in robustness experiments. All models are trained on standard training sets while evaluated on preprocessed testing sets with testing samples described above. We record the performance of HSTNNs compared to that of sole RNNs and SNNs in the three types of noise tests in Fig. 3b. Note that for the noise and frame-loss robustness experiments, higher recognition accuracy indicates better robustness, while for the adversarial attack robustness experiments, a lower attack success rate implies better resistance to adversarial attacks.

Fig. 3: Validating the robustness and scalability of HSTNNs.
figure 3

a Illustration of three-digit samples in varied formats for different robustness experiments. Following the training phase, we introduce distinct noise types into the testing samples and report the average accuracy across the entire testing dataset. b Comparison of noise robustness, frame-loss (FL) robustness, and adversarial attack (AA) robustness between HSTNNs under the optimal ratios and single-paradigm networks on S-MNIST and N-MNIST datasets. c Analysis of the robustness with increasing noise levels (left) and frame loss probability (right). It is observed that RNNs and SNNs exhibit distinct robustness advantages for the two types of noises, while HSTNNs demonstrate a complementary robustness profile. d HSTNNs are scalable for integrating neuron models with various types of temporal dynamics, evidenced by accuracy improvements when incorporating more complex neuronal computation features. The HSTNNs with the optimal SNN ratios are reported here for comparison. Error bars represent the standard deviation, and the numbers above the yellow bars denote the best accuracies of HSTNNs over five runs.

We observe two prominent phenomenons in Fig. 3b–c. First, as presented in Fig. 3b, in both noise robustness and frame loss robustness experiments, HSTNNs perform comparably to the best single-paradigm models, i.e., RNNs or SNNs. In the adversarial attack robustness experiment, HSTNNs even demonstrate better robustness than both RNNs and SNNs. Second, we observe in Fig. 3c that RNNs and SNNs exhibit different robustness performances in noise and frame loss experiments. As the noise level increases, measured by the standard variance of the Gaussian noise or the frame loss probability on S-MNIST, it is noted that HSTNNs perform closer to the single-paradigm models with stronger robustness in different robustness tests and achieve significantly higher accuracy than the other single-paradigm networks. This implies this hybridization enables to inherit the robustness advantages of single-paradigms of models.

Scalability of HSTNNs in integrating different neuron modules and architectures

The proposed neuron-wise hybridization approach facilitates the easy incorporation of more complex neuronal models and deeper network structures. We have demonstrated in Fig. 2 that the proposed hybrid approach enables applying to different shallow network structures in solving various sequential learning. Now we further quantitatively analyze the scalability of HSTNNs in integrating different neuronal models, including vanilla RNN (vRNN), LSTM, LIF, and adaptive LIF (ALIF), and applying the hybrid approach to deeper network structures.

Figure 3d shows the results of building HSTNNs with different neuronal models on DVS-Gesture. When comparing the second to fourth groups of bars with the first group, it becomes evident that including more complex neuronal dynamics enhances the overall task performance of single-paradigm spiking and non-spiking neural networks. Notably, one can see that the improvements of HSTNNs for the third and fourth groups are more prominent compared to those for the first and second groups, where the spike-based models demonstrate overwhelmingly better performance. This suggests that the effectiveness of HSTNNs in exploring a complementary and superior solution depends on the original performance difference between single-paradigm networks. Namely, in the case where two single-paradigm networks perform comparably, the HSTNN has a larger chance to produce a better hybridization solution. In addition to DVS-Gesture, we also test the scalability of the proposed approach in deep convolutional network structures (see Methods) on N-MNIST and CIFAR10-DVS. The results, displayed in Table 1, show that integrating convolutional neuron models can consistently enhance the performance of single-paradigm networks and surpass other advanced models. This underscores the potential of combining heterogeneous neuron types within a single hybrid network to improve task performance.

Table 1 Comparison with advanced models on neuromorphic datasets

Adaptability of HSTNNs in varying environments

In addition to evaluating HSTNNs on standard datasets, we further assess their adaptability with a robotic place recognition task in real-world varying environments. To conduct the experiments, we utilize a robot platform developed by our previous study30, as depicted in Fig. 4a, where the robot navigates in different environments. The objective of this task is to accurately recognize the specific place based on event-based and frame-based vision inputs, as illustrated in Fig. 4b. The entire path is divided into 100 classes representing distinct places, and we collect event-based and frame-based data using a DVS camera and an RGB-D camera from three different environments: (i) an indoor environment with adequate lighting condition (env1), (ii) an outdoor environment with varying lighting condition (env2), and (iii) an indoor environment with low lighting condition (env3).

Fig. 4: Validating the adaptability of HSTNNs with a robotic place recognition task in varying environments.
figure 4

a Experimental setup for data collection. b The HSTNNs are tested in three different environments: (i) env1, an indoor environment with adequate lighting conditions; (ii) env2, an outdoor environment with varying lighting conditions; and (iii) env3, an indoor environment with low lighting conditions. The HSTNNs simultaneously receive inputs from a dynamic vision sensor (DVS) device (the first column) and an RGB-D camera (the second column). c The best place recognition accuracy in different environments is achieved at different SNN ratios. df Accuracy of the HSTNNs in the three environments, with error bars representing the standard deviation over three trials. g The overall number of operations performed by the HSTNNs in this task. h Comprehensive comparison of performance results between SNN/RNN models and the HSTNNs with the optimal SNN ratio. Acc.: accuracy and opts.: operations.

The results obtained from these place recognition experiments highlight the flexible adaptability of HSTNNs to varying environmental conditions. Figure 4c–g presents a comparative analysis of HSTNNs with varying SNN ratios, including single-paradigm RNNs and SNNs across the three environments. The results indicate that the optimal recognition accuracy varies across different SNN ratios and demonstrate the suitability of different hybrid paradigms for specific environmental conditions. This adaptability enables HSTNNs to perform effectively in variable scenarios. Figure 4h further provides a comprehensive comparison between HSTNNs and single-paradigm RNN/SNN models in terms of accuracy and computational cost. Notably, HSTNNs outperform the single-paradigm models, demonstrating the superiority of the hybrid modeling paradigm. Figure 4h also highlights the trade-off between accuracy and the computational cost, as the best accuracy achieved by HSTNNs corresponds to different numbers of operations. This flexibility empowers practical users to make informed decisions based on their specific requirements.

The performance adaptability, robustness to varying environments, and consideration of the computational cost make HSTNNs a promising approach for processing spatiotemporal data sources in the open world. The findings from the above robotic place recognition experiments evidence the great potential of HSTNNs in achieving improved accuracy compared to single-paradigm models and providing a flexible solution for different application environments.

Hardware deployability of HSTNNs

Applying neural network models in practice depends on efficient hardware. In current intelligent machines, general-purpose processors such as GPUs are the mainstream platforms for running neural networks. Although they can perform ANNs efficiently, there exists a big performance gap when executing SNNs. Neuromorphic processors are another family of hardware that can perform SNNs efficiently but support RNNs inadequately. Therefore, our HSTNNs with hybrid computation of both RNNs and SNNs are not suited for these single-paradigm-oriented hardware platforms. Fortunately, hybrid-paradigm neuromorphic chips31,32,33 emerged in recent years, which show promising performance no matter running RNNs or SNNs. To validate the application potential of HSTNNs, we select a recent hybrid neuromorphic chip, TianjicX34, as the platform for execution efficiency analysis.

We have implemented RNNs, SNNs, and HSTNNs with different SNN ratios on the TianjicX chip. All the networks contain three layers running in a pipeline on the chip as illustrated in Fig. 5a. We employed two mapping strategies, namely fixed-core mapping and variable-core mapping, for different hybrid layers, as presented in Fig. 5b. The small layers are mapped to two fixed cores respectively used for computing SNN and RNN modules, i.e., the fixed-core mapping strategy, while the larger layers are mapped to more cores where the computational resources for SNN and RNN modules are proportional to the number of neurons, i.e., the variable-core mapping strategy. The choice of the strategy depends on the potential for parallel execution of the layer across multiple cores, which will be detailed in Methods.

Fig. 5: Comparing the efficiency of RNNs, SNNs, and HSTNNs on neuromorphic hardware.
figure 5

a Implementation of HSTNNs on the TianjicX chip. Two three-layer neural networks on S-MNIST and N-MNIST datasets are mapped onto the chip with different mapping strategies. b Illustration of the two-layer mapping strategies. c Execution latency and energy consumption of RNNs, SNNs, and HSTNNs with different SNN ratios.

To minimize the effect of the mismatch between the network structure and the hardware architecture, we add extra restrictions to HSTNNs. First, the input size of neuron populations is set to multiples of 16 while the output size is set to multiples of 32 so that the networks can exploit parallelism within each core. Second, we fix the total number of neurons in each layer for HSTNNs with different SNN ratios. This restriction solves the problem that the Selection stage with only a global constraint on the entire network might generate a variable number of neurons in each layer, which makes the hardware mapping unfriendly and causes unfairness in comparing execution performance.

Figure 5c shows that, in general, the higher the ratio of SNN neurons in the network, the shorter the execution latency, as recurrent connections in the RNN consume additional computation. The execution latency of the sole SNN on S-MNIST is significantly higher than that of the HSTNN at the SNN ratio of 0.75 because the sole SNN only utilizes a single SNN core for computation in the fixed-core mapping strategy while the hybrid HSTNN can use both cores. By measuring both the execution latency and dynamic power consumption on the chip, we calculate the dynamic energy consumed by RNNs (SNN ratio 0), SNNs (SNN ratio 1), and HSTNNs for inferring one sample. As given in Fig. 5c, the energy consumption and the SNN ratio of HSTNNs are negatively correlated on both S-MNIST and N-MNIST datasets, regardless of the mapping strategy. These results are consistent with the previous analysis of the computational cost and once again demonstrate the flexibility of HSTNNs. As the SNN ratio increases, the heavy computation of RNNs is reduced, which results in lower energy consumption. More details of the mapping strategies and experimental results are available in Methods and Supplementary Fig. 3.

Discussion

We presented a generic hybridization approach that can maintain and integrate the complementary features of RNNs and SNNs, promising a unified effective way to process different types of spatiotemporal data. We observed that RNNs and SNNs have shown divergent performances across six distinct types of tasks. By leveraging their complementary features through our hybrid models, we demonstrated that the HSTNNs not only surpass single-paradigm models in comprehensive performance but also exhibit superior robustness against noise, frame loss, and adversarial attack. Furthermore, the adaptability of HSTNNs to diverse environmental conditions was evidenced in the robot place recognition task. The flexible hybrid paradigm yielded optimal recognition accuracy in a variety of lighting conditions, indicating its potential for handling the complexity and variability of real-world applications. Even though HSTNNs integrate two types of neurons, they can be deployed on emerging neuromorphic chips with the hybrid architecture for efficient execution.

Interestingly, HSTNNs exhibit intriguing similarities to the coding strategies and integration of continuous and spiking activities observed in the human brain for information processing. The brain, renowned as a hybrid learning system, employs diverse types of neuron populations and a range of coding schemes to tackle complex spatiotemporal tasks. HSTNNs can actually achieve similar functionality by leveraging different coding strategies of SNNs and RNNs. We observed that the interaction between continuous and spike-based neural activities can alter the spiking activities of the spiking neuron population (see Supplementary Figs. 5, 6). This amalgamation of various neuronal dynamics and coding strategies in HSTNNs embraces the diversity and richness of the brain’s own computational strategies, an aspect that has been underscored by contemporary neuroscientific research35,36. These parallels not only highlight the relevance and potential of HSTNNs for handling diverse real-world applications but also provide hints to understanding the design of more robust and adaptable systems for artificial intelligence.

How to determine an optimal SNN ratio in HSTNNs and thereby achieve the balance between task performance and the computational cost is a crucial but open issue. This optimal SNN ratio is highly context-dependent and varies according to specific user requirements, as the weights assigned to accuracy and the cost differ across different environments. We show a heuristic method in Supplementary Information to address this challenge and provide an optimal SNN ratio automatically searched in specific tasks. By formulating the optimization problem and employing approximation optimization methods such as subgradient descent, we demonstrate in Supplementary Fig. 7 the feasibility of finding the optimal solution of the SNN ratio. It allows users to customize the model based on their specific needs and achieve an optimal trade-off between accuracy and cost. The practical implications of this optimization process are significant, as it facilitates real-world applications of HSTNNs in practical environments for effective and efficient processing of spatiotemporal data.

Hybrid neural network models gain more and more interests from different fields due to the rapid development of neuroscience and the breakthrough of deep learning23,31. Several advantages of layer-wise hybridization have been demonstrated in references23,24. These layer-wise hybridization approaches focus on integrating non-recurrent ANN and SNN modules using layer-wise strategies, providing efficient solutions for practical applications such as optical flow estimation and high-speed tracking tasks. In contrast, the proposed neuron-wise hybridization approach in this work offers a finer-grained method of hybridization, enabling real-time interaction of the coding and computational features of different types of neurons. Moreover, the neuron selection strategy employed in the Selection stage represents a more general solution that encompasses layer-wise hybridization as a specific case. We demonstrate in Supplementary Fig. 4 that the proposed hybridization model can be applied to optical flow estimation, yielding comparable results to those of specially designed single-paradigm networks.

The proposed HSTNN is a very initial effort to bridge dynamic models in machine learning and neuromorphic computing. As aforementioned, HSTNNs have presented great potential in task performance, model robustness, and computational cost, which provides a flexible trade-off to satisfy variable environments and user requirements under a unified modeling and learning framework. The models showcased in this work are relatively simple, offering considerable scope for further enhancement. For instance, HSTNNs could be enhanced with advanced transformer architectures and create deeper and larger models, enabling the processing of more complex spatiotemporal data. Furthermore, intelligent machines equipped with neuromorphic chips can incorporate HSTNNs to process spatiotemporal information collected by various sensors such as cameras, microphones, electroencephalogram electrodes, and so forth. We look forward to inspiring more investigations for taking complementary features and advantages of computer-science-oriented models and neuroscience-oriented models.

Methods

Establishment of HSTNN

The HSTNN contains an input layer, one or multiple hidden hybrid layers, and a readout layer. Each hybrid layer has a population of non-spiking recurrent neurons and a population of spiking neurons. The size of each population can be adaptively changeable during the training process. To facilitate efficient training of hybrid models, non-spiking recurrent neurons are simulated with synchronized timing as spiking neurons. At each time step, neurons in the two populations receive the same mixed inputs from the previous layer and their outputs in the previous time step, and then update the neural dynamics and generate their respective outputs (rt or st). These outputs are combined before being forwarded to the next processing layer. The combined output \({{{\boldsymbol{y}}}}_{t}^{n}\) of the n-th hybrid layer at the t-th time step is formalized as:

$${{{{\boldsymbol{y}}}}_{t}^{n}={{\rm{Concat}}}\left({{\rm{SNN}}}\left({{{\boldsymbol{s}}}}_{t-1}^{n},\,{{{\boldsymbol{y}}}}_{t}^{n-1} \parallel {{{\boldsymbol{\theta }}}}_{SNN}^{n}\right),\, {{\rm{RNN}}}\left({{{\boldsymbol{r}}}}_{t-1}^{n},\, {{{\boldsymbol{y}}}}_{t}^{n-1}\parallel {{{\boldsymbol{\theta }}}}_{RNN}^{n}\right)\right),\, n=2,...,\, N-1,}$$
(1)

where N denotes the number of layers. A detailed illustration of the propagation of hybrid information in HSTNN is provided in Supplementary Fig. 1. To support diverse spike decoding schemes, the HSTNN uses a generic parametric decoder d(xω) capable of decoding rate-based or timing-based information from the output spike train \(\{{{{\boldsymbol{s}}}}_{1}^{N},\, {{{\boldsymbol{s}}}}_{2}^{N},...,\, {{{\boldsymbol{s}}}}_{t}^{N}\}\) into a vector representation. The parameter ω represents the weight matrix assigned to the output spike train in the specific decoding scheme. The decoded spiking information is concatenated with the RNN’s outputs to produce a final output via a readout weight WN. This process can be formalized as follows:

$${{{\boldsymbol{y}}}}_{t}^{N}={{{\boldsymbol{W}}}}^{N}{{\rm{Concat}}}\left({{{\boldsymbol{r}}}}_{t}^{N},\, d\left({{{\boldsymbol{s}}}}_{1}^{N},\, {{{\boldsymbol{s}}}}_{2}^{N},...,\, {{{\boldsymbol{s}}}}_{t}^{N} \big| {{\boldsymbol{\omega }}}\right)\right).$$
(2)

In our experiments, we instantiate d(*) in a straightforward form: \({d}_{i}={\sum }_{{t}^{{\prime} }=1}^{t}{\omega }_{i,{t}^{{\prime} }}{{{\boldsymbol{s}}}}_{i,{t}^{{\prime} }}^{N}\), where di(*) represents the ith component of d(*). In the neuromorphic dataset including N-MNIST, DVS-Gesture, and CIFAR10-DVS, we decode the rate-based information from output spike trains by setting an equal entry for \({\omega }_{i,{t}^{{\prime} }}=\frac{1}{t},\forall i,\, {t}^{{\prime} }\); in the text analysis tasks, we employ the spike timing information at the last time step by setting ωi,t = 1 and \({\omega }_{i,{t}^{{\prime} }}=0\) for \({t}^{{\prime} } \, < \, t\).

Neuron models for HSTNN

The HSTNN facilitates the integration of various non-spiking RNN modules and spiking modules. In this work, we primarily instantiate two representative RNN modules, including vanilla RNN and LSTM, as well as two SNN modules, LIF and ALIF37, for constructing the HSTNN. The behaviours of a vanilla RNN module can be described by

$${{{\boldsymbol{r}}}}_{t}^{n}=\sigma \left({{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{n,r}{{{\boldsymbol{y}}}}_{t}^{n-1}+{{{\boldsymbol{W}}}}_{{{\rm{rec}}}}^{n}{{{\boldsymbol{r}}}}_{t-1}^{n}\right)$$
(3)

where rn denotes the hidden state, \({{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{n,r}\) is the input weight matrix, \({{{\boldsymbol{W}}}}_{{{\rm{rec}}}}^{n}\) is the recurrent weight matrix, and σ( ⋅ ) is the sigmoid( ⋅ ) function. The LSTM module consists of four gates and one continuous variable, called cell state, which can be formulated as

$$\left\{\begin{array}{l}{{{\boldsymbol{i}}}}_{t}^{n,r}=\sigma \left ({{{\boldsymbol{W}}}}_{{{\rm{in,i}}}}^{n,r}{{{\boldsymbol{y}}}}_{t}^{n-1}+{{{\boldsymbol{W}}}}_{{{\rm{rec,i}}}}^{n}{{{\boldsymbol{r}}}}_{t-1}^{n}\right)\\ {{{\boldsymbol{f}}}}_{\!t}^{n}=\sigma \left({{{\boldsymbol{W}}}}_{{{\rm{in,f}}}}^{n,r}{{{\boldsymbol{y}}}}_{t}^{n-1}+{{{\boldsymbol{W}}}}_{{{\rm{rec,f}}}}^{n}{{{\boldsymbol{r}}}}_{t-1}^{n}\right)\\ {{{\boldsymbol{o}}}}_{t}^{n}=\sigma \left({{{\boldsymbol{W}}}}_{{{\rm{in,o}}}}^{n,r}{{{\boldsymbol{y}}}}_{t}^{n-1}+{{{\boldsymbol{W}}}}_{{{\rm{rec,o}}}}^{n}{{{\boldsymbol{r}}}}_{t-1}^{n}\right)\\ {{{\boldsymbol{g}}}}_{t}^{n}=\phi \left({{{\boldsymbol{W}}}}_{{{\rm{in,g}}}}^{n,r}{{{\boldsymbol{y}}}}_{t}^{n-1}+{{{\boldsymbol{W}}}}_{{{\rm{rec,g}}}}^{n}{{{\boldsymbol{r}}}}_{t-1}^{n}\right) \hfill \\ {{{\boldsymbol{c}}}}_{t}^{n}={{{\boldsymbol{c}}}}_{t-1}^{n}\odot {{{\boldsymbol{f}}}}_{\!t}^{n}+{{{\boldsymbol{i}}}}_{t}^{n}\odot {{{\boldsymbol{g}}}}_{t}^{n} \hfill \\ {{{\boldsymbol{r}}}}_{t}^{n}={{{\boldsymbol{o}}}}_{t}^{n}\odot \phi \left({{{\boldsymbol{c}}}}_{t}^{n}\right) \hfill \end{array}\right.$$
(4)

where i, f, and o denote the states of the input gate, the forgetting gate, and the output gate, respectively. g, c, r denote the candidate state, the cell state, and the hidden state, respectively. ϕ( ⋅ ) denotes the \(\tanh (\cdot )\) function and  ⊙ is the Hadamard product. The LIF neuron simultaneously receives signals from the previous layer and the current layer for updating its membrane potential u. When a neuron’s membrane potential ui exceeds a firing threshold uth, the neuron fires a spike si and resets its membrane potential to u0. The behaviours of the LIF module can be written as

$$\left\{\begin{array}{l}\tau \frac{d{{{\boldsymbol{u}}}}^{n}(t)}{dt}=-{{{\boldsymbol{u}}}}^{n}(t)+{{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{n,s}{{{\boldsymbol{y}}}}^{n-1}(t)\\ \left\{\begin{array}{l}{s}_{i}^{n}(t)=1\,\&\,{u}_{i}^{n}(t)={u}_{0},\quad {{\rm{if}}}\,{u}_{i}^{n}(t)\ge {u}_{th} \hfill \\ {s}_{i}^{n}(t)=0,\qquad {{\rm{otherwise}}}\hfill \end{array}\right.\end{array}\right.$$
(5)

where \({{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{n,s}\) is the input weight matrix. To make the continuous neural dynamics friendlier for programming the gradient-descent learning approaches, we further convert Eq. (5) into an explicitly iterative version8 as

$$\left\{\begin{array}{l}{{{\boldsymbol{u}}}}_{t}^{n}={e}^{-\frac{dt}{\tau }}{{{\boldsymbol{u}}}}_{t-1}^{n}\odot \left(1-{{{\boldsymbol{s}}}}_{t-1}^{n}\right)+{{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{n,s}{{{\boldsymbol{y}}}}_{t}^{n-1}\\ {{{\boldsymbol{s}}}}_{t}^{n}={{\rm{H}}}\left({{{\boldsymbol{u}}}}_{t}^{n}-{u}_{th}\right),\hfill \end{array}\right.$$
(6)

where H( ⋅ ) is the Heaviside function that satisfies H(x) = 1 when x ≥ 0 and H(x) = 0 otherwise. Here we assume u0 = 0.

Unlike the fixed firing threshold of the LIF neuron, the ALIF neuron further introduces adaptive thresholds. The evolution of the firing thresholds of the ALIF neuron, η, can be described as

$$\left\{\begin{array}{l}{{{\boldsymbol{b}}}}_{t}^{n}={\rho }^{n}{{{\boldsymbol{b}}}}_{t-1}^{n}+(1-{\rho }^{n}){{{\boldsymbol{s}}}}_{t}^{n}\\ {{{\boldsymbol{\eta }}}}_{t}^{n}=0.1+{\alpha }^{n}{{{\boldsymbol{b}}}}_{t}^{n},\hfill \end{array}\right.$$
(7)

where ρn denotes the learnable parameters that control the update rate of the adaptive thresholds. The parameter αn is a constant that controls the size of adaptation of the thresholds, set to 0.2 by default.

Details of the three-stage learning for HSTNN

We develop a three-stage learning methodology to create the HSTNN progressively, including Adaptation, Selection, and Restoration stages.

Adaptation stage

To learn the optimal hybrid connections, the adaptation stage expands each hybrid layer by two redundant neuron populations. In particular, to generate a hybrid layer with M neurons, it first introduces an SNN pool with M neurons and an RNN pool with M neurons. Each pool works in its respective dynamics and different types of signals are mixed by Eq. 1 before sending to the next layer. The synaptic weights are trained by the BPTT learning algorithm, producing a better starting point for the following Selection stage. In this way, the adaptation stage provides greater flexibility in exploiting the hybrid structure and integrating the distinct dynamic behaviours of RNNs and SNNs into a unified optimization framework.

Selection stage

To select the optimal structure from the abundant pools, the Selection stage identifies and ranks the importance of neurons.

The neuronal importance is evaluated by aggregating the importance scores of its afferent weights. The weight score is evaluated based on a classical parameter saliency measure16,38, which accesses the saliency of a parameter through calculating the smallest change of the loss function ΔL caused by perturbing the specific parameter.

Next, we will formulate the smallest change of the loss function caused by the perturbation as an optimization problem16 and employ a neuron-wise pruning strategy to adapt the saliency measure for the hybrid model. A key relationship exists between the parameter perturbation and neuron pruning: pruning an unimportant neuron can be formalized as perturbing the model such that all weights connecting to the unimportant neuron become zero (i.e., Δw = −w, where Δw denotes the weight perturbation). On this basis, the change of ΔL, expressed in the Taylor expansion form, is governed by

$${{\Delta }}L=\, {\nabla }_{w}L{(w)}^{T}{{\Delta }}w \,+\frac{1}{2}{{\Delta }}{w}^{T}H{{\Delta }}w \,+\, O(\parallel {{\Delta }}w{\parallel }^{3}).$$
(8)

Based on the OBS16, we assume that a trained neural network model (i.e., the HSTNN established by the Adaptation stage) has converged to a local minimum of the loss function L, where the gradient yields ∇wL(w) = 0, and the Hessian matrix H is positive semidefinite. Thus, ΔL can be primarily associated with the second-order term containing the Hessian matrix ΔwTHΔw.

We then formulate the process of finding the smallest change of the loss function ΔL while removing the specific weight parameter wu as an optimization problem:

$$ \mathop{\min }_{{{\Delta }}w}\frac{1}{2}{{\Delta }}{w}^{T}H{{\Delta }}w=\frac{1}{2}{\left(\begin{array}{r}{{\Delta }}{w}_{u}\\ {{\Delta }}{w}_{i}\end{array}\right)}^{T}\left(\begin{array}{rc}{H}_{u,u}&{H}_{u,i}\\ {H}_{i,u}&{H}_{i,i}\end{array}\right)\left(\begin{array}{r}{{\Delta }}{w}_{u}\\ {{\Delta }}{w}_{i}\end{array}\right)\,\\ {{\rm{s.t.}}}\,{{\Delta }}{w}_{u}+{w}_{u}=0,$$
(9)

where wu and wi denote the weight groups of unimportant and important neurons, respectively. The Hessian matrix H can be further written as a block matrix. Given the importance of wu is measured by how its removal influences the smallest change in the loss function, we set Δwu = −wu. Solving the above optimization problem using the Lagrangian method yields:

$${H}_{i,i}{{\Delta }}{w}_{i}+{H}_{i,u}{{\Delta }}{w}_{u}=0.$$
(10)

Substituting Δwi by \({{\Delta }}{w}_{i}={H}_{i,i}^{-1}{H}_{i,u}{w}_{u}\) results in a solution16 to Eq. (9):

$$\frac{1}{2}{{\Delta }}{w}^{T}H{{\Delta }}w=\frac{1}{2}{w}_{u}^{T}\big({H}_{u,u}-{H}_{u,i}{H}_{i,i}^{-1}{H}_{i,u}\big){w}_{u}.$$
(11)

By employing the Schur complement of the inverse matrix, we have

$$\frac{1}{2}{w}_{u}^{T}\big({H}_{u,u}-{H}_{u,i}{H}_{i,i}^{-1}{H}_{i,u}\big){w}_{u}=\frac{1}{2}{w}_{u}^{T}{\big[{H}^{-1}\big]}_{u,u}^{-1}{w}_{u}.$$
(12)

The original OBS requires measuring the perturbation estimation for all parameters separately and calculating the matrix inverse of the Hessian matrix, which leads to an intolerable computational cost in large-scale neural networks. Instead, we focus on evaluating the comprehensive impact of a group of afferent weights connecting to the same neuron. Therefore, we develop a neuron-wise strategy based on the structural pruning method17. Specifically, we first group all weight parameters connecting to a specific output neuron and compute the corresponding perturbation when this group of weight is pruned. Additionally, following previous studies17,38,39, we assume the main saliency features are contained within the diagonal blocks. Therefore, the Hessian matrix H can be approximated by a diagonal block matrix where each block includes a diagonal operator:

$$\frac{1}{2}{w}_{u}^{T}{\big[{H}^{-1}\big]}_{u,u}^{-1}{w}_{u} \, \approx \, \frac{1}{2}{w}_{u}^{T}{H}_{u,u}{w}_{u}\, \approx \, \frac{1}{2}{w}_{u}^{T}\frac{{{\rm{Tr}}}({H}_{u,u})}{u}{w}_{u} \,=\, \frac{{{\rm{Tr}}}({H}_{u,u})}{2u}\parallel {w}_{u}{\parallel }_{2}^{2}$$
(13)

where Tr(Hu,u) denotes the trace of the block diagonal Hessian of the unimportant group. The above equation effectively avoids the computation of the inverse of the Hessian matrix by using the trace of Tr(Hu,u). Furthermore, we employ the Hutchinson method40,41 for calculating the trace, which employs stochastic vectors to effectively estimate the Hessian operator (see Eqs 6, 7 in reference41 for implementations).

Consequently, the neuronal importance score, measured by the above smallest perturbation, can be evaluated by the above Hessian trace estimation with only a moderate computational cost. Considering the distinct neuronal dynamics and representation manners between RNNs and SNNs, we rank the importance scores of spiking and non-spiking neurons, separately. Specifically, we collect the same types of neurons from all layers and uniformly sort them according to the important scores. Given the ranking results, we select a certain percentage of neurons from each pool as important neurons according to the predefined SNN ratio.

Restoration stage

Given the ranking results, the Restoration stage further prunes redundant neurons and their inactive connections and fine-tunes the resulting compact network. To this end, we create the corresponding binary mask matrix for the specified input weight connections, \({{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{n,r}\) and \({{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{n,s}\), and the recurrent weight connections of RNNs, \({{{\boldsymbol{W}}}}_{{{\rm{rec}}}}^{n}\), based on the indices of selected neurons.

Formally, let the total number of neurons in layer n be ln, the index set of selected artificial neurons be \(R(n):=\{{i}_{1}^{n},...,\, {i}_{k}^{n}\}\) and the index set of the selected spiking neurons be \(S(n):=\{{j}_{1}^{n},...,\, {j}_{{k}^{{\prime} }}^{n}\}\). Let 1i be a column unit vector with the i-th element being 1. The size of R(n) and S(n) are denoted as ∣R(n)∣ and ∣S(n)∣, respectively. The mask matrix \({{{\boldsymbol{m}}}}^{n,r}\in {{\mathbb{R}}}^{{l}_{n-1}\times | R(n)| }\) for non-spiking neurons can be formalized using unit vectors and ordering the indices in R(n) following an ascending order:

$${{{\boldsymbol{m}}}}^{n,r}=[{{{\boldsymbol{1}}}}_{{i}_{1}^{n}},...,\, {{{\boldsymbol{1}}}}_{{i}_{k}^{n}}],\quad {i}_{1}^{n},...,\, {i}_{k}^{n}\in R(n).$$
(14)

Similarly, the mask matrix \({{{\boldsymbol{m}}}}^{n,s}\in {{\mathbb{R}}}^{{l}_{n-1}\times | S(n)| }\) for spiking neurons can be formalized in a similar way:

$${{{\boldsymbol{m}}}}^{n,s}=[{{{\boldsymbol{1}}}}_{{j}_{1}^{n}},...,\, {{{\boldsymbol{1}}}}_{{j}_{{k}^{{\prime} }}^{n}}],\quad {j}_{1}^{n},...,\, {j}_{{k}^{{\prime} }}^{n}\in S(n).$$
(15)

We can then derive the concatenated mask matrix \({{{\boldsymbol{m}}}}^{n-1}\in {{\mathbb{R}}}^{{l}_{n-1}\times (| S(n)|+| R(n)| )}\) for both \({{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{n,r}\) and \({{{\boldsymbol{W}}}}_{in}^{n,s}\)

$${{{\boldsymbol{m}}}}^{n-1}=\left(\begin{array}{rc}{{{\boldsymbol{m}}}}^{n-1,s} \quad \hfill 0\hfill \\ \quad 0 \quad \quad {{{\boldsymbol{m}}}}^{n-1,r}\end{array}\right).$$
(16)

Given mn,rmn,s, and mn−1, we derive the shrinked weights, \({{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{{\prime} n,r},\, {{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{{\prime} n,s},\, {{{\boldsymbol{W}}}}_{{{\rm{rec}}}}^{{\prime} n}\) in the n-th layer after the Restoration stage by

$$\left\{\begin{array}{l}{{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{{\prime} n,r}\leftarrow {({{{\boldsymbol{m}}}}^{n,r})}^{T}{{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{n,r}{{{\boldsymbol{m}}}}^{n-1}\quad \\ {{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{{\prime} n,s}\leftarrow {({{{\boldsymbol{m}}}}^{n,s})}^{T}{{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{n,s}{{{\boldsymbol{m}}}}^{n-1}\quad \\ {{{\boldsymbol{W}}}}_{{{\rm{rec}}}}^{{\prime} n\,\,}\leftarrow {({{{\boldsymbol{m}}}}^{n,r})}^{T}{{{\boldsymbol{W}}}}_{{{\rm{rec}}}}^{n}{{{\boldsymbol{m}}}}^{n,r}\quad \end{array}\right..$$
(17)

After that, we retrain the final compact HSTNN to fine-tune the parameters.

Details of the learning algorithm for HSTNN

BPTT4 is a powerful learning algorithm for RNNs and recently has been adapted to train SNNs by addressing the convergence problem and the non-differentiable spiking activities8,12,29. The training approaches for both RNNs and SNNs share several core features, including the backpropagation of gradients through spatial (layer-wise) and temporal (time step-wise) dimensions, and the subsequent update of parameters based on these gradients across all time steps. Given these similarities, we employ a unified BPTT methodology, incorporating the surrogate function for spiking activities, to train the HSTNN. We introduce the notation δ for the gradient regarding the loss function L, for example, \(\delta o=\frac{\partial L}{\partial o}\). For a vanilla RNN module, we have

$$\delta {{{\boldsymbol{r}}}}_{t}^{n}={\big({{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{n+1,r}\big)}^{T}\big({\sigma }^{{\prime} }\odot \delta {{{\boldsymbol{r}}}}_{t}^{n+1}\big)+{\big({{{\boldsymbol{W}}}}_{{{\rm{rec}}}}^{n}\big)}^{T}\big({\sigma }^{{\prime} }\odot \delta {{{\boldsymbol{r}}}}_{t+1}^{n}\big)$$
(18)

where \({\sigma }^{{\prime} }\) represents the gradient of the activation function. For the LIF-based SNN module, we have

$$\left\{\begin{array}{l}\delta {{{\boldsymbol{s}}}}_{t}^{n}={\left({{{\boldsymbol{W}}}}_{{{\rm{in}}}}^{n+1,s}\right)}^{T}\delta {{{\boldsymbol{u}}}}_{t}^{n+1}-{e}^{-\frac{dt}{\tau }}{{{\boldsymbol{u}}}}_{t}^{n}\odot \delta {{{\boldsymbol{u}}}}_{t+1}^{n}\quad \\ \delta {{{\boldsymbol{u}}}}_{t}^{n}={{{\rm{H}}}}^{{\prime} }\odot \delta {{{\boldsymbol{s}}}}_{t}^{n}+{e}^{-\frac{dt}{\tau }}\big(1-{{{\boldsymbol{s}}}}_{t}^{n}\big)\odot \delta {{{\boldsymbol{u}}}}_{t+1}^{n}\quad \end{array}\right.$$
(19)

where \({{{\rm{H}}}}^{{\prime} }\) is the gradient of the Heaviside function, which is actually non-differentiable. To solve this problem, we use the surrogate function to approximate its gradient29. An empirical analysis of the effect of specific surrogate function formats on the Hessian trace, used in the Selection stage, is provided in Supplementary Fig. 2. The gradient expressions for more complex neuronal modules are similar to those in Eqs. 18, 19 and thus are omitted here for clarity.

Details of parameter configurations and model comprehensive evaluation

We used consistent network structures for SNNs, RNNs, directly-hybrid models, and HSTNNs in Fig. 2. On N-MNIST, S-MNIST, and PTB datasets, the network structures of [input-800-800-10], [input-800-800-10], and [input-650-650-10,000] were employed, respectively, to compare the task performance of different models. On the above three datasets, the HSTNNs were built based on vanilla RNN and LIF models. Please note that in our experiments, the network structure consistently refers to the network structure after the three-stage learning. By default, an equal number of spiking and non-spiking neurons are utilized in the selection stage unless stated otherwise. On DVS-Gesture, the network structure of [input-128C3-AP2-256C3-AP2-384C3-AP2-256-11] was adopted, using recurrent convolutional neural network (RCNN) and LIF-based spiking convolutional neural network models for HSTNN construction. The implementation of RCNN and RSNN followed the formulations in Eqs. 3–6 but with the simple operation of weighted sum replaced by the convolutional operation. For RCNN and RSNN, we adapted the selection process by using a structural grouping strategy that selects the most important output feature map channels based on cumulative importance scores across all neurons within the same feature map. The selected feature maps are then maintained to create a reduced network structure for retraining in the Restoration stage. The SGD optimizer was chosen for the PTB dataset, while Adam was for the S-MNIST, N-MNIST, and DVS-gesture datasets. Detailed parameter configurations are provided in Supplementary Table 1.

We employed the consistent loss functions across three learning stages on all datasets. For the language modeling task, we utilized a cross-entropy-based loss function, which can be formalized by

$$\quad {{\rm{L}}}=-\frac{1}{T}{\sum }_{t=1}^{T}{\sum }_{i=1}^{{l}_{N}}{{{\boldsymbol{g}}}}_{t,i}\log \left({\hat{{{\boldsymbol{y}}}}}_{t,i}^{N}\right)$$
(20)

where gt is a one-hot vector that denotes the real distribution of vocabularies and \({\hat{{{\boldsymbol{y}}}}}_{t}^{N}={{\rm{softmax}}}({{{\boldsymbol{y}}}}_{t}^{N})\) denotes the predicted distribution at the t-th time step. The most recent spiking temporal information was used in Eq. 2 for computing \({{{\boldsymbol{y}}}}_{t,i}^{N}\). For S-MNIST, a similar cross-entropy loss was used:

$${{\rm{L}}}=-{\sum }_{i=1}^{{l}_{N}}{{{\boldsymbol{y}}}}_{i}\log \left({{{\boldsymbol{y}}}}_{T,i}^{N}\right).$$
(21)

where the rate coding was used in Eq. 2 for computing \({{{\boldsymbol{y}}}}_{T,i}^{N}\).

For classification tasks on neuromorphic datasets including N-MNIST and DVS-Gesture, we used the Mean Squared Error loss function:

$${{\rm{L}}}=\frac{1}{{l}_{N}}{\sum }_{i=1}^{{l}_{N}}{\left({{{\boldsymbol{y}}}}_{i}-{{{\boldsymbol{y}}}}_{T,i}^{N}\right)}^{2}.$$
(22)

where lN denotes the number of neurons in the layer N and the rate coding was used for computing \({{{\boldsymbol{y}}}}_{T,i}^{N}\).

The computational cost was evaluated at the operation level. For a vanilla RNN module with Mi input neurons and Mo output neurons, the computational cost can be estimated as:

$${C}_{RNN} \, \approx \, {\big({M}_{i}{M}_{o} \big ({C}_{mul}+{C}_{add} \big)+{M}_{o}{M}_{o}\big({C}_{mul}+{C}_{add}\big )\big)}T \\ = { \big(\big({M}_{i}{M}_{o}+{M}_{o}{M}_{o}\big)\big({C}_{mul}+{C}_{add}\big)\big)}T$$
(23)

where Cmul and Cadd denote the basic computational costs of a multiplication operation and an addition operation, respectively, and T denotes the number of time steps. In order to provide an intuitive and concise comparison, here we mainly estimated the computational cost of matrix operations, which produces a great impact on the hardware execution energy, and ignored the computation of the vector or scalar operations. For our implementation of the LIF-based SNN module, there is no recurrent matrix computation, and the multiplication operations can be replaced with sparse accumulation operations benefiting from the binary spike format. We thereby evaluated the cost of a LIF-based SNN module by

$${C}_{SNN} \, \approx \, s{C}_{add}{M}_{i}{M}_{o}T$$
(24)

where s denotes the average spike rate during the entire inference stage (normalized within [0, 1]). As with RNNs, the computational cost of the vector or scalar operations is omitted for clarity. Since more and more neuromorphic chips11,28 efficiently support the hybrid execution between non-spiking computation and spiking computation, the computational cost of a hybrid layer can be derived based on the results of single-paradigm RNN or SNN modules. Assuming that there are Mi1 non-spiking inputs, Mi2 spiking inputs, Mo1 RNN output neurons, and Mo2 SNN output neurons, the computational cost of an HSTNN layer yields

$${{C}_{HSTNN} \, \approx \, [({M}_{i1}({M}_{o1}+{M}_{o2})+{M}_{o1}{M}_{o1})({C}_{mul}+{C}_{add})+s{M}_{i2}({M}_{o1}+{M}_{o2}){C}_{add}]T}$$
(25)

where CHSTNN is smaller than CRNN owing to the insertion of the SNN with a much lower computational cost. The estimation for other more complicated neuron models is similar by incorporating more matrix operations and we omit them for clarity.

Details of experimental setup for the robustness evaluation

HSTNNs were constructed using the optimal SNN ratio reported in Fig. 2 for comparison: SNN ratios of 0.25, 0.95, and 0.75 for S-MNIST, N-MNIST, and DVS-Gesture datasets, respectively. All models were trained on standard training sets and evaluated on preprocessed testing sets. Three types of model robustness were evaluated: random noise robustness, frame-loss robustness, and adversarial attack robustness. On the S-MNIST datasets, the network structures [input-400-400-10] were employed. The same structures for N-MNIST and DVS-Gesture as those used in Fig. 2 were employed in the comparison.

In Fig. 3b, for the random noise robustness, we added the Gaussian noise with a zero mean and a 0.05 standard deviation into each testing sample of S-MNIST and added the salt-and-pepper noise into each testing sample of N-MNIST with a probability of 0.1. For the frame-loss robustness, we randomly masked some sequence information of each frame of the testing sample with a probability of 0.1. For the adversarial attack robustness, we generated the untargeted adversarial sample (\({{{\boldsymbol{x}}}}^{{\prime} }\)) by adding an imperceptible perturbation (δ) into the raw testing sample (x)42. The perturbation can be defined by

$${{{\rm{argmin}}}}_{{{\boldsymbol{\delta }}}}{\parallel {{\boldsymbol{\delta }}}\parallel }_{2},\quad s.t.,\, f({{\boldsymbol{x}}}+{{\boldsymbol{\delta }}})\, \ne \,f({{\boldsymbol{x}}}),$$
(26)

where f(x) generally refers to the output of the victim model. To solve the above optimization problem, we followed the prior work42 and took an iterative strategy to calculate the gradient with respect to the spike input sample (xs) as follows:

$$\delta {{{\boldsymbol{s}}}}_{i}={{{\Delta }}}_{{{{\boldsymbol{xs}}}}_{i}}L(\theta,\, {{{\boldsymbol{xs}}}}_{i},\, {{{\boldsymbol{y}}}}_{original}),$$
(27)

where δsi represents the input gradient at the ith iteration. Since the elements in δsi are continuous values, in order to generate the spike-based adversarial input \({{{\boldsymbol{xs}}}}_{i}^{{\prime} }\), we used a two-stage method proposed by Liang et al.14, called gradient-to-spike (G2S) and restricted spike flipper (RSF). Specifically, the G2S technique was used to convert the continuous gradient into a ternary one (i.e., { − 1, 0, 1}) via probabilistic sampling from the normalization version of δsi:

$$\quad \delta {{{\boldsymbol{s}}}}_{i}^{{\prime} } \,= \, {{\rm{sign}}}({{{\boldsymbol{\delta }}}}_{mask}\odot \delta {{{\boldsymbol{s}}}}_{i}),\quad P({{{\boldsymbol{\delta }}}}_{mask}=1) \\= \, {{\rm{norm}}}(| \delta {{{\boldsymbol{s}}}}_{i}| ),\quad P({{{\boldsymbol{\delta }}}}_{mask}=0)=1-{{\rm{norm}}}(| \delta {{{\boldsymbol{s}}}}_{i}| ),$$
(28)

where δmask is a binary mask and norm( ⋅ ) is a scaling normalization function that normalizes each element into the range of [0, 1]. Then an overflow-aware transformation was utilized to avoid the overflow of the resulting xsi, i.e., keeping the resulting \({{{\boldsymbol{xs}}}}_{i}^{{\prime} }\) as a binary spike within {0, 1}. The entire G2S process can be described as

$${{{\boldsymbol{xs}}}}_{i}^{{\prime} }={{\rm{transform}}}(\delta {{{\boldsymbol{s}}}}_{i}^{{\prime} },\,{{{\boldsymbol{xs}}}}_{i}).$$
(29)

The RSF technique was used to address the gradient vanishing problem. When meeting all-zero input gradients, the spiking inputs can be flipped randomly with a control of the turnover rate. We ran 20 iterations to generate each adversarial sample.

Details of the experimental setup for the scalability evaluation

In Fig. 3d, we demonstrated the combinations of vRNN&LIF, vRNN&ALIF, and LSTM&LIF using the same network structure (i.e., [input-400-400-10]). For the RCNN&SCNN, we applied the network structure as utilized in Fig. 2d. An optimal SNN ratio of 0.75, which yielded the best classification accuracy, was selected for constructing the HSTNNs. In Table 1, a network structure of [input-128C3-AP2-256C3-AP2-384C3-256-10] was used for N-MNIST, and a structure of [input-128C3-128C3-AP2-128C3-128C3-AP2-256C3-256C3-AP2-512C3-512C3-512C3-512C3-10] was employed for CIFAR10-DVS. Optimal SNN ratios of 0.875 for both N-MNIST and CIFAR10-DVS were adopted for the construction of HSTNNs. We provided other parameter settings and training details in Supplementary Table 1.

Details of experiments on the robot place recognition

We conducted robot navigation in three different environments: an indoor environment with adequate lighting (env1), an outdoor environment with varying lighting conditions (env2), and an indoor environment with low lighting (env3). The robot traversed a predefined path six times in each environment, collecting event-based data using a DVS camera and frame-based data using an RGB camera. The path was divided into 100 segments representing distinct places, and the objective was to recognize the current scenario within these 100 classes. For data preprocessing, we utilized a pre-trained four-layer CNN and a four-layer SCNN, as described in prior work30, to handle the inputs from the RGB and DVS cameras, respectively, used for robot place recognition. The CNN used for RGB images, processed inputs of size 240 × 180 × 3 to include three RGB channels. The SCNN processed event images with an input size of 240 × 180 pixels, incorporating both positive and negative polarity information. The parameters of both pre-trained CNN and SCNN were fixed in our simulations. Outputs from the CNN and SCNN models were combined and fed into a three-layer HSTNN with a network structure of [input-500-500-100]. Due to the different resolutions of DVS and RGB cameras, we used nine consecutive event images and three corresponding RGB images as a training sample. The HSTNN was then constructed through our three-stage hybrid approach, learning to recognize the correct place from among 100 candidates. The training process involved 150 epochs for the Selection stage and 100 epochs for the Restoration stage. We employed the Adam optimizer and a cross-entropy loss function for the three-stage learning stage. To evaluate computational cost, we analyzed the overall number of operations performed by the hybrid modules by utilizing Eqs. 23–25.

Details of implementation on neuromorphic hardware

TianjicX is a hybrid neuromorphic chip that can flexibly allocate computing resources and schedule execution time for multiple neural network tasks, including both ANNs and SNNs34. However, the flexibility of the chip also complicates the deployment of neural networks. We describe the mapping details when deploying HSTNNs on TianjicX from a top-down perspective as follows.

At the network level, layers were first grouped and mapped to core groups, where the number of cores depends on the structure and the computational cost of the layers. In the experiment, we assigned a core group for each layer in HSTNNs. Core groups can run in a pipeline manner on the TianjicX chip as depicted in Supplementary Fig. 3b. The reported results were collected in the scenario of running a single sample. The layer-level mapping strategies are illustrated in Fig. 5b. We applied different mapping strategies for different layers considering the layer size. In the fixed-core mapping, a layer was mapped onto a core group containing a small fixed number of cores dedicated to computing the RNN or SNN module, respectively. The workload for each core varies according to the SNN ratio. We used this strategy for small layers, such as those on S-MNIST, and fixed both the core numbers for SNN or RNN modules to one. This is because partitioning a small layer cannot fully utilize the parallelism of multiple cores but brings additional data transfer, which can result in resource wastage and excessive power consumption. In contrast, larger layers can utilize the resources of multiple cores better, so we used more cores with a fixed workload for each and allocated them for the SNN or RNN module according to the SNN ratio. The TianjicX chip supports a primitive instruction set that covers a wide range of operations. To perform computation of the layers, we configured the primitive sequence for each core. The operations required by HSTNNs were listed in Supplementary Fig. 3c. We used 8-bit integers for the outputs of both RNN and SNN populations, thus simplifying the output concatenation in each layer.

Following the mapping steps above, we successfully implemented HSTNNs on the TianjicX development board (see Supplementary Fig. 3d). The execution latency and energy consumption results shown in Supplementary Fig. 3a validated the efficiency and flexibility of HSTNNs. It is worth mentioning that the execution latency of HSTNNs can be shorter than that of the sole SNN. For the networks on S-MNIST, we noticed that although a sole SNN (with the SNN ratio of 1) has the least computational workload, it doesn’t achieve minimal latency due to the utilization of only a single core. Conversely, for hybrid models, as the SNN ratio decreases, the latency of the SNN core shortens, and that of the RNN core lengthens. The total latency is the maximum of the latencies consumed by the two cores, thus achieving the minimum value when their latencies are equal. In the case of the networks on N-MNIST, we found that the latencies at SNN ratios of 0.75 and 1 were almost identical, possibly because the minimal additional latency of the small RNN module in the first layer at the SNN ratio of 0.75 may be offset by the data transfer latency and the off-chip measurement we adopted might introduce errors.