1 Introduction

Gesture sensing technology allows users to interact easily and intuitively with machines compared to conventional mouse and touch screen systems [1] enabling applications in gaming, TVs, smart homes, and automotive. Gesture sensing technology is primarily dominated by vision-based systems, where large data volumes obtained with camera sensors are used by applying advance computer vision techniques [2,3,4,5]. However, camera-based solutions, on the other hand, raise privacy problems because they operate with photos and videos. Furthermore, they necessitate suitable lighting conditions and weather requirements (in outdoor scenarios), which limits their application. Non-vision based solutions have been devised to overcome these constraints. In non-vision based systems, specialized sensors in the form of gloves or bands are commonly attached to human subjects to record human hand motion and recognize gestures by analyzing the data received with these sensors [6,7,8,9,10]. Although these systems have been proven to overcome the limitation of vision-based systems, wearing these kinds of sensor provide a cumbersome experience to users.

To overcome the aforementioned limitations, contactless non-vision based systems are becoming more prevalent. Among them, radar-based technologies are favorable because they are subtle to lighting conditions, invariant to occlusion of hand, and possess a simple signal processing pipeline [11,12,13]. Additionally, their ability to preserve privacy, work inside enclosures and detect fine motions, it has become a sensor of choice for gesture sensing applications. There are two key research directions in radar-based gesture sensing systems, one direction is developing efficient miniature hardware that can produce high-fidelity target data [14,15,16,17,18,19,20,21], and the other being the signal processing pipeline, mostly driven by deep learning techniques [22,23,24,25,26,27,28,29,30,31].

With the current adoption of radars in IoT devices, radar-based techniques have focused on energy efficiency, the key requirement in IoT [32,33,34]. An effective gesture identification has been demonstrated with a tiny radar chip incorporated into a mobile phone and a small CNN network fitting resource constraint CPUs [35]. TinyRadarNN [36] uses a 2D CNN paired with a temporal convolutional neural network (TCN) to detect gestures utilizing range-frequency Doppler for low-power wearable devices. Similarly, in [37], to improve the robustness, additional features including range, Doppler, elevation and azimuth are fed to a CNN paired with recurrent neural networks (RNNs). In [38], to minimize power consumption and computational costs, a tiny CNN has been developed for an embedded solution for hand gesture recognition. Furthermore, to capture the micro-motion dynamics of the gestures [39,40,41], micro-Doppler [42] signatures are used to classify minute gestures such as gesture made with fingers. The potential of radar is not limited to just recognition of hand swipes, in [43,44,45] their capabilities have been shown in air-writing applications as well, where characters or phrases drawn on a virtual board in front of radar are recognized and categorized.

While traditional deep neural networks (deepNets) approaches remain the best candidates in detection and recognition, the energy efficiency of the systems during inference remains a concern [46], particularly for edge devices. The Multiply-accumulate (MAC) actions between layers deepNets cost the vast bulk of energy and research efforts are focused mainly on reducing MACs by employing smaller networks, pruning approaches, and weights quantization.

Spiking Neural Networks (SNNs) [47] in recent times have gained popularity for their energy efficiency because of the availability of the resources to build the hardware necessary to run SNNs efficiently. In contrast to deepNets, in SNN information is communicated by spike timing in SNNs, which includes latencies and spike rates. The communication in SNN is highly sparse as the information is only transmitted when the membrane potential of a node (neuron) reaches a specific threshold. Additionally, the sparse communication nature (1-bit activity) reduces the amount of data volume communicated between nodes significantly. Furthermore, since nodes are just integrating the spike coming at the node, therefore, the MAC arrays are replaced with adders significantly reducing the amount of computation. Despite the energy efficiency of SNNs [48,49,50,51], due to the non-differentiable transfer function training of SNNs is challenging.

Since conventional backpropagation cannot be applied, therefore, local unsupervised learning which involves mostly Spike timing-dependent plasticity (STDP) and its variants are used [52]. However, these kinds of methods only facilitate small networks requiring fewer parameters. Although recent advancements have shown very promising results with STDP [53], for bigger networks, mostly the concept of deepNets is incorporated where the network is trained in a backpropagation manner by using differential approximations to the spiking neurons [54]. Among the existing SNN neuron models, the leaky integrate-and-fire (LIF) [55] is the most well-known spiking neuron model and is used in this paper. The LIF is a good choice for developing SNN models since it is simple and easy to implement [56], requiring less computing (floating-point operations) and having neuro-computational features. The applications of SNNs in radar-based gesture sensing systems have been shown in [57,58,59,60] where range-Doppler features are fed to SNNs for robust gesture recognition and classification.

In contrast to approaches that operate on range-Doppler features in [61] we propose an SNN-based gesture recognition system that works only on raw data where the fast-time FFT is mimicked in the SNN itself. In this paper, we proposed to extend that approach where both the fast-time and slow-time FFTs are mimicked in the SNN allowing the system the classify up to 8 gestures. The main contributions of this paper are as follows:

  • An end-to-end radar-based gesture sensing system is proposed where the SNN takes the raw data and performs gesture recognition.

  • Unlike [57,58,59,60] that works on Doppler images, the proposed approach only relies on raw ADC data. The pre-processing steps such as slow-time and fast-time FFTs are not required reducing the overhead of performing computation and requiring additional computational units.

  • As an advancement to [61] that only classifies 4 gesture, the proposed approach not only mimics the fast-time FFT but also slow-time FFT in SNN enabling the system to classify up to 8 gestures.

  • A novel SNN architecture is proposed where the signal pre-processing (slow-time FFT, fast-time FFT) is mimicked in SNN.

2 System design

2.1 Hardware

To perform our experiments, we use in this work the hardware platform developed by Infineon Technologies [62] as shown in Fig. 1.

Fig. 1
figure 1

XENSIVTM 60GHz radar sensor for advanced sensing

The simplified internal circuitry of the radar chipset is shown in  Fig. 2. It is consisted of 1 transmit (Tx) path antenna, 3 (Rx) receive paths, a mixer and Analogto-Digital Converters (ADCs). An external phase-locked loop is used for linear frequency sweeping. An 80 MHz reference oscillator is used to control the loop and the Finite State Machine (FSM) is controlled by a reference clock clicking at 80 MHz [63]. The tuning voltage is varied from 1 to 4.5 V to enable a voltage-controlled oscillator (VCO) that generates linear frequency sweeps from 57 GHz to 63 GHz enabling the chipset to transmit signal up to 6 GHz bandwidth. Serial Peripheral Interface (SPI) and Queued Serial Peripheral Interface (QSPI) are added for memory readout. The maximum data transfer for readout is up to 200Mb/s (4 × 50Mb/s). The out streaming of the data is achieved by an interrupt (IRQ) flag issued by FSM when a threshold set by the host is reached. The mixers are driven by high-pass filters trailed by a variable gain amplifier (VGA) and an antialiasing filter (AAF) and an ADC driver. A 4Mb/s 12 b Successive Approximation Register (SAR) is used for multi-channel ADC (MADC) and Static random-access memory (SRAM) of 196 k stores the raw data. The temperature and transmit power readout is done by a sensor ADC (SADC) [63].

Fig. 2
figure 2

Simplified block diagram of Infineon’s BGT60TR13C radar chipset

For further hardware details, readers can refer to [63].

Fig. 3
figure 3

Proposed processing chain where a radar is connected to a CPU. The CPU takes raw ADC data and performs signal processing such as MTI and target detection. The filtered data is fed to SNN hardware/software for gesture classification

2.2 System design

The experimental setup used in this paper is shown in Fig. 3. A 60 GHz radar configured with the parameters shown in Table 1, collects the hand gesture data in form of raw ADC. The raw ADC data is then fed to a PC via a USB, where the signal processing steps take place.

Table 1 System parameters summary

The start and the end of the gesture are automatic. When the hand is detected by the radar, the recording starts and the end of the gesture is marked when the recording is completed for 32 consecutive frames. For shorter and quick gestures where the number of frames is less than 32 are appended with zeros chirp values.

2.3 Signal processing

2.3.1 Signal model

Frequency-Modulated Continuous Wave radar (FMCW) radar transmits chirps, which are linearly increasing frequency waves. The chirps when reflected from a target are collected at a receiver antenna. A mixer at the receiver mixes the transmitted and received signals, resulting in a low-pass filtered signal known as an Intermediate Frequency signal. For a signal of bandwidth Bw and duration T the frequency of the FMCW waveform can be mathematically expressed as:

$$ \nu (t) = \nu_{c} +\frac{B_{w}}{T}t, $$
(1)

where νc is the ramp start frequency. The beat signal is formed by mixing the reflected signal and replicate of the transmitted signal. The down-converted IF signal is given as:

$$ {I_{F}}(t) = \sum\limits_{m=1}^{M} \exp \left({2\pi \left({\frac {2\nu_{c} R_{m}}{c} + \left({\frac {2\nu_{c} \textbf{v}_{m}}{c} + \frac {2B R_{m}}{c T}}\right) t }\right) }\right), $$
(2)

with the assumption of τm/T << 1. The \(\tau _{m} = \frac {2R_{m} + \bold {v}_{m}t}{c} \) represents the time taken by the transmitted signal to reach back to the receiver after reflecting from mth target at distance Rm from the radar with radial velocity vm. The constant c represents the speed of light. The IF signal IF(t) is sampled and fed to subsequent processing steps.

2.3.2 Raw data

The IF(t) or raw ADC data from the radar chipset is collected chirp-wise (fast-time) and stacked together in rows (slow-time). So a frame is represented by a 2D array where each row represents a chirp.

2.3.3 Moving target indication filtering

In radar signals reflected from stationary objects can have greater magnitude than the target reflection thus subjugating the reflections of the target (in our case hand). To eliminate these reflections from stationary objects we used moving target indication (MTI). In MTI a moving average filter at each frame i is applied to the fast-time f(i) which is mathematically given as:

$$ \begin{array}{@{}rcl@{}} f(i) = f(i) - m(i-1), \\ m(i) = \alpha \times f(i) + (1-\alpha)\times m(i-1), \end{array} $$
(3) (4)

where α denotes forget factor set to 0.01. Once the data is filtered using MTI it is fed to the target detection block.

2.3.4 Target detection

For hand detection and recognition, thresholding is employed on the filtered fast-time data. The threshold is calculated by taking the mean value of the fast-time data with some scaling factor. For example, the threshold 𝜗 at frame j is given by:

$$ \vartheta_{j}= \beta \times \sum\limits_{n=1}^{N_{b}} f_{n}(j), $$
(5)

where β is the scaling factor set to 3. The scaling factor is obtained empirically for the trade-off between the false positive and the probability of detection. The n in the equation represents the index along range bins and Nb is the number of range bins. As soon as the moving target is detected, the filtered raw data is collected and fed to the SNN block for gesture classification. fn is calculated when there is no target in the j th bin.

3 Spiking neural networks

Spiking neural networks or SNNs are 3rd generation of neuron that advances the previous neural networks by capturing closely the neurocomputational features of the human nervous system [47]. Neurons in SNNs are more computationally powerful because SNNs not only take into account the spatial information but also take the temporal aspects (precise timing of the spikes) [64]. SNNs are ideal for large-scale data processing due to their low power consumption and ability to perform parallel analysis [65]. Their low power consumption, quick inference, and event-driven information processing makes them an ideal candidate for deep neural networks/machine learning applications [57,58,59] where low energy consumption is desired.

There exist different neuron models whose computational efficiency and biological plausibility have been discussed briefly in [66]. Among them, LIF is popular it is a simple model with fewer computations and is biologically plausible.

4 Proposed spiking neural network architecture

The aim of the proposed architecture is to be efficient in terms of computing power and energy. Therefore, we have opted for LIF as a neuron model choice. The proposed SNN architecture is shown in Fig. 4. Since the LIF is not differentiable and hence we cannot perform the backpropagation, therefore, we used a differentiable approximation to the LIF.

Fig. 4
figure 4

The proposed spiking neural network architecture where slow-time and fast-time FFTs are mimicked in the first layers of the SNN followed by a CNN and dense layer for classification. The block a) represents the multiplication of the input with the real coefficients of the FFT resulting in Yr and b) corresponds to the multiplication of the input with imaginary coefficients of the FFT resulting in Yi, thus mimics the range FFT. Similarly, the block c) represents the multiplication of the Yr with real coefficients of the FFT resulting in Yrr and the block d) represents the multiplication of the Yr with imaginary coefficients of the FFT resulting in Yri. Similarly, the block e) represents the multiplication of the Yi with real coefficients of the FFT resulting in Yri and the block f) represents the multiplication of the Yi with imaginary coefficients of the FFT resulting in Yii. Subsequently, the outputs Yr,Yi,Yrr,Yri,Yir and Yii are appended and fed to the convolution layer followed by dense and output layer

To mimic the discrete Fourier transform (DFT) in SNN layers we exploit the successive multiplication representation of DFT because of its linear transformation. Each DFT dimension is represented by a single Dense layer where the weights of the layer are real and complex parts of the coefficients of the DFT. Let the radar is operating with Fn total number of frames and Sn number of samples per chirp hence the input data dimension is Sn × Fn, to perform the DFT on this input data, the first layer contains 2 × Sn nodes to compute the real and imaginary values. The connectivity between the input nodes and the layer nodes would be 2Sn × Sn nodes. We use DFT trigonometric equation to calculate the weights of the connection mathematically expressed as:

$$ C_{q}= \sum\limits_{p=0}^{P-1} Y_{p} \left[\cos\left(\frac{2\pi}{P}qp \right) -i\sin \left(\frac{2\pi}{P}qp \right) \right], $$
(6)

where q and p take values between 0 and Sn − 1. When applied to an input vector Y, in matrix form (6) can be written as:

$$ \boldsymbol{C}= \left(\boldsymbol{W_{R}} + i \boldsymbol{ W_{I}} \right) \boldsymbol{Y}, $$
(7)

where C is the result of the transform, WR and WI are the real and imaginary coefficients.

Since in radar processing the 2nd FFT is applied across slow-time therefore, the output of the first layer is reshaped and then transposed using the transpose layer. Then the same formulation is applied to the 3rd layer where now the real part and imaginary part from the first layer output are separately connected to real and imaginary weights that are calculated using the above trigonometric (6). Let Yr represents the transformation YTWR and Yi represents the transformation YTWI then at layer 3 following transformation occurs:

$$ \boldsymbol{C^{r}}= \left(\boldsymbol{{W^{r}_{R}}} + i \boldsymbol{ {W^{r}_{I}}} \right) \boldsymbol{Y^{r}}, $$
(8)
$$ \boldsymbol{C^{i}}= \left(\boldsymbol{{W^{i}_{R}}} + i \boldsymbol{ {W^{i}_{I}}} \right) \boldsymbol{Y^{i}}. $$
(9)

The first and 3rd layers are appended with LIF as an activation function. The LIF transforms the output of each neuron into spikes. Let the 3rd layer’s output be represented as:

$$ \begin{array}{@{}rcl@{}} \boldsymbol{Y^{rr}} &=& \boldsymbol{{Y^{r}}^{T}}\boldsymbol{{W^{r}_{R}}}, \\ \boldsymbol{Y^{ri}} &=& \boldsymbol{{Y^{r}}^{T}} \boldsymbol{{W^{r}_{I}}} , \\ \boldsymbol{Y^{ir}} &=& \boldsymbol{{Y^{i}}^{T}}\boldsymbol{{W^{r}_{R}}}, \\ \boldsymbol{Y^{ii}} &=& \boldsymbol{{Y^{i}}^{T}} \boldsymbol{{W^{i}_{I}}},\\ \end{array} $$
(10) (11) (12) (13) (14)

then the 1st and 3rd layer’s outputs are appended with one another as:

$$ \boldsymbol{\phi} = [\boldsymbol{Y^{r}},\boldsymbol{Y^{i}},\boldsymbol{Y^{rr}},\boldsymbol{Y^{ri}},\boldsymbol{Y^{ir}},\boldsymbol{Y^{ii}}], $$
(15)

and is provided to the convolutional layer where each transformation is represented as a channel making ϕ as a 6 dimensional vector. The convolution layer has a total number of 16 filters of size 3. The stride is set to 1. The output of the convolutional layer is then fed to a fully connected layer with 64 neurons appended with LIF. Both the fully connected layer and convolutional layer use LIF as an activation function. In last a fully connected layer with 8 neurons is used as model output.

4.1 Training

The training of the network is performed in a conventional backpropagation manner using NengoDL [67] as it allows a differential approximation of the firing rate of the LIF neurons in the form SoftLIF [68] activation (an approximation to LIF). For calculating classification probabilities, we have used multi-class cross-entropy as an objective function.

Just like ANNs, SNNs learning performance also requires a suitable optimization solver and weight initialization. We employed the adaptive moment estimation (Adam) [69] as an optimization solver because of its computational efficiency while being proved to be a good candidate for large networks [70].

As a loss function softmax classifier is used that uses cross-entropy as a loss function. The cross-entropy for J number of training samples belonging to K number of classes if mathematically expressed as:

$$ CE=-\frac {1}{J}\sum\limits_{k=1}^{K} \sum\limits_{j=1}^{J} {{z_{j}^{k}} \log \left ({H_{\theta }\left ({x_{j},k }\right) }\right)}, $$
(16)

where \({z_{j}^{k}}\) is true label for training example j for class k. x is the input example to the model H with weights 𝜃.

4.2 Testing

To make the network spiking, the LIF neurons in the trained model are replaced with spiking LIF neurons. The connection weights and neuron biases for the spiking LIF are extracted from the trained model. In order to acquire an accurate estimate of the spiking neuron output over time, the test inputs or samples are adjusted for testing and presented to the network several times or steps.

5 Results and discussion

5.1 Dataset

We have used the dataset from [57]. This dataset has a total of 4800 hand gesture swipes collected with 5 people. The dataset is a collection of 8 different gestures as shown in Fig. 5.

Fig. 5
figure 5

Types of gestures included in the dataset: a) up down, b) down up, c) left-right, d) right-left, e) rubbing, f) diagonal southeast to northwest, g) diagonal southwest to northeast, h) clapping

The dataset has 600 samples for each gesture. During both training and validation, the dataset was collected with minimum prior supervision provided to the users. Furthermore, the data was gathered in a variety of settings, including different locations and environments. We performed 10 trials of the training and testing experiment, and then the accuracy is average along those trials. For each trial, we randomly select 80% of the total dataset for training and 20% for testing.

5.2 Results

To assess the performance of the proposed system we have used classification accuracy as a measure. Our proposed system achieved a similar average accuracy of 98.7% as is achieved by state-of-the-art methods over random trials as shown in Table 2. All the methods were applied to the same dataset. In [58, 59, 61] 4 out of 8 gestures are used from the same dataset.

Table 2 Comparison of the classification accuracy of the proposed SNN with other models where RVA stands for range, velocity and angle over time images

5.3 Discussion

In this paper radar-based gesture sensing system running on SNN is proposed where the proposed system does not require the conventional radar signal pre-processing steps such as fast-time and slow-time FFTs followed by Constant false alarm rate (CFAR) and the gesture sensing is performed using the raw ADC data only. The fast-time and slow-time FFTs are mimicked within the SNN and the CFAR and other denoising steps are learned by the SNN itself demonstrating the full capabilities of the SNNs. SNNs are energy-efficient, fast, scalable and hardware friendly making them a suitable energy-efficient and cost-effective solution.

Besides avoiding the use of traditional feature engineering techniques (slow-time and fast-time FFTs) which are mimicked within the SNN network, in contrast to the state-of-the-art methods that rely on 32 chirps per frame, the proposed method only works with a single chirp per frame. This significantly reduces the computational cost.

The capabilities of the proposed system in terms of firing/prediction are shown in Fig. 6 where the firing overtime for some of the examples is considered. For every class, the model starts by firing with more or less similar probabilities but quickly starts firing with a higher probability for the correct class after a few time steps. This happens due to spikes integration over a longer time period enabling higher accuracy in terms of correct class prediction.

Fig. 6
figure 6

The firing choice of SNN for examples taken from each class. Where the class labels are: a) 0 - down up, b) 1 - up down, c) 2 - left-right, d) 3 - rubbing, e) 4 - right-left, f) 5 - diagonal southwest to northeast, g) 6 - diagonal southeast to northwest, h) 7 - clapping. It can be seen that after a few time steps the SNN starts firing for the correct class

Since we aim to make our model close to biological plausible and computationally efficient, therefore, we opted for the LIF model because of its simplicity. Like conventional deepNets or any machine learning model, the learning performance of the SNN depends on hyperparameters. When it comes to SNN there are two types of hyperparameters that need to be considered: one is for the SNN model itself and the second is the hyperparameters at the neuron level. The optimal parameters used for the training are shown in Tables 3 and 4. These parameters were obtained by performing a grid search allowing the network to overfit and underfit. Furthermore, the optimal parameters used for Adam and the weight initialization of the dense and convolutional layers are shown in Table 1. The convolution layer weights are drawn from the well-known Xavier initialization [72] and are proven to be well suited for SNN [73]. With the parameters given in Tables 3, 4 and 1, the proposed system is capable of classifying gestures 8 with an accuracy (98.7%) level similar to that of state-of-the-art SNNs and deepNets in Table 2.

Table 3 Hyperparameters used for training of the proposed SNN
Table 4 The parameters LIF neuron used in the proposed SNN

Figure 7(c) shows the confusion matrix obtained for testing the model with the aforementioned testing dataset. It can be seen that the system is able to classify gestures 1 and 2 100% correctly without any confusion. Gesture 3 is confused with 1.10% with geesture 5 and gesture 4 is confused 0.87% with gesture 6 and gesture 7 respectively. The only high number of miss classification, around 5.88%, occurs for gesture 5 and is confused with gesture 7. Gesture 6 is confused with gesture 71.03% of the times and gesture 7 is confused 0.88% with gesture 6. The last gesture 8 is confused with gestures 4 and 5 each 0.98% times respectively. The classifying confusion of the proposed system (Model_3) shown in Fig. 7(c) is significantly less compared to the Figs. 7(a) (Model_1) and 7(b) (Model_2). This less confusion of the proposed model is attributed to the mimicked FFTs in starting layers (Table 5).

Fig. 7
figure 7

The confusion matrices of the models mentioned in Table 6 obtained with the test dataset. It can be seen that the Model_3 (our proposed model) performance is good as compared to the other models. This performance is attributed to mimicking the FFTs as the first layers. The confusion matrices values are shown in percentage. The labels on axes correspond to 8 classes of gestures as: 0- down up, 1 - up down, 2 - left-right, 3 - rubbing, 4 - right-left, 5 - diagonal southwest to northeast, 6 - diagonal southeast to northwest, 7 - clapping

Table 5 Parameters for Adam and the weight initialization of the dense and convolutional layers

To further evaluate the performance of our proposed system we used t-Distributed Stochastic Neighbour Embedding algorithm (t-SNE) to visualize high-dimensional feature space. This helps us how well discriminating features are produced by the network for each class in the dataset. We fed the output of the last layer (before the classification layer) into t-SNE with the associated labels. We change the layer neurons from 2 to 64 incrementally with a power of 2 and calculated the t-SNE for each case as shown in Fig. 8. It is observed that increasing the number of neurons (dimensions of embedding space) in the layer increases the separability. The proposed SNN learned both separable and discriminating features, as well as generated close-knit clusters for categorizing the 8 gesture at 32 neurons. This indicates that our SNN can correctly categorize the 8 gesture types at a lower dimension of 32.

Fig. 8
figure 8

t-SNE plots that show the dimensionality reduced embedded feature clusters such as a) 2, b) 4, c) 8, d) 16, e) 32 and f) 64 dimensions. The subfigure e) shows that even at lower dimensions the SNN is able to learn both separable and discriminating features

Furthermore, Fig. 9(c) shows the discriminating features learned layer by layer. It can be seen that when moving along layers the features become more and more discriminating and form close clusters. At the last layer, the features are well discriminated and hence are easily classified. The importance of mimicking the range and Doppler FFT in the SNN model is evident from Table 6 where the performance in terms of accuracy is given for training and testing the model with and without the FFTs. It can be seen that without the FFTs in the SNN the performance is not so good with 81.03% accuracy. The performance becomes better by 7.87% when introducing the range FFT layer and reaches 98.7% by adding the Doppler FFT layer, our proposed method in this case. It is also evident from the tSNE plots of the CNN model and range model that the features produced are not well discriminating and hence resulting in a low performance than the proposed model.

Fig. 9
figure 9

Visualization of high dimensional feature spaces of layers of the network using t-SNE plots. Column a) shows the t-SNE for the input layer, CNN layer and dense layer for the CNN (Model_1). Similarly, column b) shows the t-SNE for the range model (Model_2) and column c) shows the t-SNE for the layers of the proposed method (Model_3). It can be seen the features getting discriminated as we move deeper in the network. The legend of the sub figures represent the classes as: 0 - down up, 1 - up down, 2 - left-right, 3 - rubbing, 4 - right-left, 5 - diagonal southwest to northeast, 6 - diagonal southeast to northwest, 7 - clapping

Table 6 A comparison of classifications based on different layers. The comparison is done using the same parameters as shown in Tables 3, 4 and 5

To increase further power efficiency we looked at the effect of the post-training quantization on the performance in terms of accuracy as shown in Table 7. The quantization increases the power efficiency in two aspects: it reduces the memory footprint costs and computational costs. Furthermore, quantized data with a lower bit rate requires less data movement on-chip and off-chip, resulting in better energy efficiency and reduced memory bandwidth. Table 7 shows the post-training quantization effect on all the 3 models of Table 6. The quantization is performed by keeping 1 bit for the sign, 2 bits for the integer part and 1 bit for the fractional part in case of 4-bits quantization. Similarly, for 8 and 16 bits quantization, 1 bit for sign, 3 bits are assigned to integer part and the rest of the bits are used for fractional parts. As expected the accuracy drops for high quantization i.e., 4 bits for all the models and drops with a higher percentage for our proposed model (model_3). This drop is due to the higher number of neurons used in the proposed model and hence the high impact of quantization. However, we believe higher accuracy can be achieved with quantization aware training which currently the nengoDL framework does not support. Increasing the bits for quantization increases the accuracy as expected and with 8-bit and 16-bit, the proposed model achieves 92.68% and 96.46% of accuracy with good precision and recall as indicated by f1-scores. Where for “f1-scores_micro” average is calculated by counting the total true positives, false positives and false negatives. For “f1-scores_macro” the metric is calculated for each class using their unweighted arithmetic mean.

Table 7 Post training quantization and the effect on accuracy

Considering the goal of having a system that is energy-efficient, we looked into the energy consumption per classification of the proposed system. Since the actual hardware-based energy calculation is out of the scope of this research work,in the current study, we relied the hardware metrics of the μ Brain chip defined in [74] to estimate the energy consumption. If SPN is the maximum number of spikes, SPE = 2.1pJ is the energy per spike and LKP = 73μ W is the static leakage power, then the energy consumption CE per classification using μ Brain hardware metrics is given as:

$$ C_{E} = SP_{N} \times SP_{E} + \delta T \times LK_{P} $$
(17)

where δT is the inference time. Assuming the δT = 28 ms. The energy consumption per classification of our proposed system is approx. CE = 2.1μ J.

To see the energy efficiency for SNN hardware, readers can refer to [75], where the SNN hardware is compared with the other deep learning hardware in terms of energy efficiency. The performance of SNN hardware was tested on a keyword spotting application using a dynamic energy cost per inference on some energy-efficient accelerators commercially available as shown in Fig. 10. The dynamic energy cost per inference is the difference between the total amount of energy consumed by hardware in a single inference versus the energy consumed while the hardware is idle [75]. Here, an inference means passing an input vector through a two hidden layer feed-forward ANN to predict a probability distribution over alphabetical characters. They showed up to 10 × improvement in power efficiency in their experiments.

Fig. 10
figure 10

Dynamic energy cost per inference across hardware devices for a feed-forward neural network with two hidden layers for keyword spotting application [75]

Despite its simplicity, the proposed prototype SNN solution has the ability to identify real-time hand gestures with high accuracy, comparable to state-of-the-art deepNets and SNN counterparts. Additionally, the use of SNNs makes the proposed system a low-power and hardware friendly solution suitable for applications where low-power is desired.

6 Conclusion

A novel spiking neural network(SNN)-based gesture sensing system implemented using a 60-GHz radar system is proposed. Unlike existing methods that use image-based input data or point cloud input data, here we propose to directly leverage raw ADC data as input to the SNN. The SNN implicitly mimics the Fourier transform processing that not only helps to reduce the overhead of additional FFT accelerators but also makes the FFT pre-processing specific to the task, in this case, gesture sensing. In comparison to the state-of-the-art, our suggested SNN architecture offers a similar degree of accuracy performance on 8 gestures, making the proposed system suitable for low latency and low power embedded implementations. As future work, we would like to investigate how to mimic the non-parametric Fourier transforms. Furthermore, we would also like to mimic the micro-Doppler behavior that would allow us to classify micro-Doppler gestures.