Action recognition based on dynamic mode decomposition

Based on dynamic mode decomposition (DMD), a new empirical feature for quasi-few-shot setting (QFSS) skeleton-based action recognition (SAR) is proposed in this study. DMD linearizes the system and extracts the modes in the form of flattened system matrix or stacked eigenvalues, named the DMD feature. The DMD feature has three advantages. The first advantage is its translational and rotational invariance with respect to the change in the localization and pose of the camera. The second one is its clear physical meaning, that is, if a skeleton trajectory was treated as the output of a nonlinear closed-loop system, then the modes of the system represent the intrinsic dynamic property of the motion. Finally, the last one is its compact length and its simple calculation without training. The information contained by the DMD feature is not as complete as that of the feature extracted using a deep convolutional neural network (CNN). However, the DMD feature can be concatenated with CNN features to greatly improve their performance in QFSS tasks, in which we do not have adequate samples to train a deep CNN directly or numerous support sets for standard few-shot learning methods. Four QFSS datasets of SAR named CMU, Badminton, miniNTU-xsub, and miniNTU-xview, are established based on the widely used public datasets to validate the performance of the DMD feature. A group of experiments is conducted to analyze intrinsic properties of DMD, whereas another group focuses on its auxiliary functions. Experimental results show that the DMD feature can improve the performance of most typical CNN features in QFSS SAR tasks.


Introduction
Action recognition (AR) demonstrates broad application prospects in intelligent security monitoring, human-machine interaction, virtual reality, and kinematic analysis (Zhu et al. 2020). With the development of deep learning (DL), AR methods based on deep convolutional neural networks (CNNs) have shown great superiority over traditional visual technologies. Those methods can be divided into three types: two-stream network (TSN), 3D CNN, and skeleton-based action recognition (SAR) methods. TSN and 3D CNN deal with a video clip in an end-to-end manner and utilize context information when actions are closely related to the context. Whereas, SAR methods operate in a decoupled manner and often consist of two stages. The first stage is human pose estimation, which detects skeleton trajectories (STs) of one or more humans from a video clip. The second stage is to classify the action category of the STs. The decomposition of human pose estimation from action classification can utilize the powerful generalization ability of well-trained pose estimation frameworks (Cao et al. 2017;Open-MMLab 2019) to eliminate the disturbance of background when the training set suffers from insufficient diversity.
Given that the collection of samples are expensive and time-consuming, some few-shot (Guo et al. 2018), oneshot (Memmesheimer et al. 2020) or zero-shot (Jasani and Mazagonwalla 2019) learning-based AR methods have been proposed to deal with sample shortage based on numerous support sets in the past two years. However, in many tasks whose goals are to detect illegal behaviors, the samples in the training set are more than the few-shot setting but not adequate to train a deep CNN directly or to generate support sets. That is the quasi-few-shot setting (QFSS). The 1 3 challenge lies in seeking a priori knowledge to help the deep CNN to learn the feature better. The attention mechanism ) and part-aware (Li et al. 2017a) convolutional operation are two useful manners to guide the training process.
In this paper, we proposed a new empirical feature for SAR based on dynamic mode decomposition (DMD). DMD is a popular realization of Koopman (Takeishi et al. 2017) and has been widely used in nonlinear dynamic analysis. By modeling the human action as a nonlinear dynamic system that determines the evolution of the ST, the system matrix or its eigenvalues can be treated as an empirical feature. The DMD feature has multiple advantages. First, DMD has a clear physical meaning. Although some information would be lost during the linearization process, DMD contains important time-frequency domain information that can recover the action appropriately when the initial state is given. Second, DMD has the property of translational and rotational invariance, that is, the DMD feature is constant when the position and pose of the camera changes. The DMD feature is also effective on 2D skeletons in a fixed scene. At last, the DMD feature can be concatenated with CNNs features to improve their accuracy.
The currently widely used CNN is optimized as a black box and extracts time domain features that are not interpretable. Whereas, the DMD feature, which is inspired by the control theory, is an empirical and interpretable feature in the frequency domain and has a fixed computational process without training owing to its clear physical meaning. Those differences allow the DMD feature to play an auxiliary role for CNN features in QFSS tasks.
The remainder of this paper is organized as follows. Section 2 reviews recent developments on AR. Section 3 proposed a new DMD-based SAR framework and proves the translational and rotational invariance of DMD. Section 4 presents and analyzes experimental results. Finally, Sect. 5 concludes the study.

Related work
The progress of video AR before the DL era is slow because of the inability of traditional visual technologies to perform high semantic-level tasks. A complete pipeline of traditional methods comprises feature extraction, combination, and classification. One typical method is the dense trajectory (DT) algorithm based on optical flow . The motion trail of the video is captured by optical flow first, then features including trajectory shape, histograms of oriented optical flow, gradient, and motion boundary are extracted. These features are encoded and used to train a support vector machine (SVM) classifier. Wang et al. also proposed the improved DT (IDT) algorithm  in the same year. Compared with DT, IDT utilized the improved optical flow graph, feature regularization, and encoding method to increased accuracy from 84.54 to 91.2% on the UCF50 dataset and from 46.6 to 57.2% on the HMD51 dataset.
Since DL flourished in 2015, many DL-based AR methods have been proposed (Kong and Fu 2018) and offered a wide range of possible applications in safety management (Zhu et al. 2020), violence detection (Sumon et al. 2019), and ambient assisted living (Singh et al. 2017). According to the architecture of the network, these methods can be divided into three categories, namely, TSN (Lin et al. 2020), 3D CNN (Tran et al. 2017;Diba et al. 2017), and SAR (Yan et al. 2018). In some works, long-short temporal memory (LSTM) networks (Singh et al. 2017) are also used to model the evolution process of STs, but their performance is inferior to TSN and 3D CNN because of the difficulty of training.
TSN mainly uses a two-stream architecture to extract semantic information from RGB frames and time domain information from optical flow, and combines features to make collaborative predictions. This technical route was first proposed by Simonyan (2014) and improved by other researchers from several aspects. Feichtenhofer et al. introduced 3D pooling (Feichtenhofer et al. 2016) and multiscale time (Feichtenhofer et al. 2018) into TSN. Wang et al. (2016) proposed temporal segment networks to address long-time videos and Zhou et al. (2018) put forward a temporal relation network to learn the dependency relationship between frames. Overall, TSN is the DL version of IDT that appropriately balances the computational burden and accuracy requirement.
Unlike TSN establishes the connection between frames with optical flow, 3D CNN executes the convolution operation in the time dimension to achieve the same goal (Tran et al. 2015). To reduce the computational burden and improve the performance of 3D CNN, many equivalent operations have been proposed. ResNet-(2+1)D architectures, which uses 2D convolution on each RGB image and 3*1*1 convolution on the temporal dimension, were proposed by Tran et al. (2015Tran et al. ( , 2017 and Qiu et al. (2017) individually. Diba et al. (2017) proposed a temporal 3D CNN to explore long-term information comprehensively, together with the temporal transition layer to replace the pooling layer. They initialized the 3D CNN with a pre-trained 2D CNN, which is also an enlightening approach. Lin et al. (2019) proposed a novel method suitable for 2D CNN models that remarkably reduces the computation and performs the cross concatenation of channels between frames to allow information sharing.
SAR consists of two steps. The first step is human pose estimation, which can be classified into top-down and bottom-up strategies. Top-down strategies use an object detection framework to detect humans and locates skeleton joint points based on the detected boxes, whereas bottom-up strategies detect all possible joint points and cluster them to different humans. Many studies have been proposed and carried on open source frameworks Open-Pose (Wei et al. 2016;Simon et al. 2017;Cao et al. 2017) or mmpose (Open-MMLab 2019).
Once the ST has been obtained by the human pose estimation module, the most intuitive method of SAR is to stack the ST into a one-channel image and input it into a one-channel 2D CNN, which is named as temporal convolution network (TCN) (Kim and Reiter 2017;Memmesheimer et al. 2020). Another direct way is to use recurrent neural networks (RNNs) to represent the temporal relation Liu et al. 2017;Singh et al. 2017). An indirect manner is to project the ST into three orthometric views and stack them as a threechannel image (Hou et al. 2018) that is suitable for a general multi-channel 2D CNN.
To enhance the performance of SAR, a priori knowledge about body parts are introduced into the network in the form of an undirected graph (Yan et al. 2018;Shi et al. 2019;Holzinger et al. 2021) or a fixed concatenation (Li et al. 2017a;Zhang et al. 2017). Similar to the performance of the graphical neural networks (Holzinger et al. 2021) in other applications, those part-aware methods provide a supervised attention mechanism substantially. At the same time, the unsupervised attention mechanism has also been exploited by some researchers. Si et al. (2019) proposed an attention-enhanced graph convolutional LSTM which achieves state-of-the-art on several public datasets; Li et al. (2019) combined an adaptive attention module with a two-stream RNN architecture. Furthermore, Zhao et al. (2019) combined a graphical neural network (GCN) with LSTM into a Bayesian framework, and Peng et al. (2020) proposed a Neural Architecture Search (NAS) framework to design a part-aware GCN automatically.
Although a complex architecture can achieve better performance when the dataset is large enough, many applications fail to satisfy this requirement. Referring to flourishing few-shot learning methods (including the one-shot and zero-shot methods) in other visual tasks, a small group of researchers starts to seek one-shot learning methods for SAR (Memmesheimer et al. 2020). A new dataset for few-shot learning of SAR is established based on the NTU dataset (Li et al. 2017b), which contains adequate support sets. Many few-shot learning methods would be extended to SAR in the following two years. Moreover, QFSS, which is closer to the requirements of real applications, deserves additional attention.

Method
The human body is a complex dynamic system, with the brain as the controller, action target and external environment as the inputs, and human joints as the actuators. The sequential skeleton points, that is the ST, are the observed states of the system. When finishing different actions, the system would evolve under the navigation of different controllers and output different STs. Thus, if we can recover the close system from a given ST with DMD, the action type would be recognized according to the modes extracted by DMD.
Inspired by this motivation, the DMD-based SAR framework is proposed in this section. DMD theory is introduced first; the translational and rotational invariance of the DMD feature is proven then; finally, the DMD-based action recognition framework is proposed.

Dynamic mode decomposition
Given a discrete system k+1 = f k , where k ∈ ℝ n is the latent state, K is the Koopman operator (Takeishi et al. 2017).
It is assumed that K demonstrates discrete spectrums, which can be written in the form of infinite eigenvalues 1 , 2 , 3 , ⋯ and eigenfunctions 1 , 2 , 3 , ⋯ with the relation K i = i i . The observation function based on eigenfunctions is g( The Koopman operator approximates a lower-dimensional nonlinear system to an infinitedimensional linear system with sequential K + 1 samples by seeking a state transition matrix A ∈ ℝ K×K that satisfies DMD is the most widely used method to calculate A.
Performing a singular value decomposition on H 1 , we have the following: where U ∈ ℝ n×n , ∈ ℝ n×K , and V ∈ R K×K . The diagonal elements of are the singular values sorted descendingly, and all off-diagonal elements are 0. We can then obtain the similar matrix of A as follows: A and Ã have the same eigenvalues.
Considering that the response of a dynamic system is mainly determined by low-frequency parts, only the first r (1) eigenvalues are typically reserved to describe the feature of the system in practice, where r ≪ K . Let U r ∈ ℝ r×n , V r ∈ ℝ K×r , and r ∈ ℝ r×r be the left-top submatrices of U , V and with truncated eigenvalues, respectively. We can then obtain the approximated state transition matrix and its eigenvalues ̃i , i = 1, 2, ⋯ , r.
The state matrix Ã r determines the dynamic response of the system, including stability, response speed, and overshot. ̃i is the pole of the approximate linear closed-loop system and determines the stability of the system. Thus, both Ã r and ̃i can serve as an empirical feature for SAR. The feature dimension is r 2 for the flattened Ã r and 2r for the stacked ̃i (r real parts and r image parts). ̃i is shorter, whereas Ã r contains more information. The experimental results in the following section show that the performance distinction between them is unclear.

Translational and rotational invariance of DMD
For an action sample, when the sensor moves or rotates, the feature for SAR should be consistent. With a simple normalization method, DMD can satisfy this requirement theoretically, that is, translational and rotational invariance.
Two STs of one same action captured by different cameras are denoted as follow: and G 1 and G 2 are captured by cameras with fixed coordinates O 1 x 1 y 1 z 1 and O 2 x 2 y 2 z 2 . The s th skeleton joint at step j captured by camera i is denoted as p i j = x i s,j , y i s,j , z i s,j . Then the spatial coordinate of all S skeleton points can be stacked as The transfer matrix from O 1 x 1 y 1 z 1 to O 2 x 2 y 2 z 2 is denoted as where r 1to2 is the rotation matrix and l 1to2 is the translation vector. r 1to2 and l 1to2 satisfy for k = 1, 2, ⋯ , K + 1.
The translational and rotational invariance of DMD means that Ã 1 r =Ã 2 r . This property is proven as follow: then obtain the following relation: Thus, , and there is The system matrices can be obtained with Eqs. (1 and 2) as follows: Then, we have and Their similar matrices are as follows: As the rotation matrix is orthogonal and satisfies R T = R −1 , we can obtain the following: From the proof above, DMD can guarantee rotational invariance inherently, and the normalization method in Eq. (6) can guarantee translational invariance. Thus, the normalization is a necessary preprocessing step for the DMD feature. Some other normalization methods can also guarantee translational invariance, but they have some disadvantages. For instance, normalize the trajectory into [0, 1] and satisfy the translational and rotational invariance. However, when K = 0 , they are not applicable.

DMD feature for SAR with QFSS
Deep CNNs can extract more information than DMD because of their large amount of parameters. Deep CNNs are the standard answers for SAR if the training set is adequate. However, in many QFSS SAR tasks which do not have adequate training samples, training a deep CNN is impossible. Facing this problem, we design a framework to improve the performance of CNN features on QFSS SAR tasks with the empirical DMD feature. Denote a skeleton trajectory that has been normalized according to formula (6) in the form of a matrix where j = p 1,j , p 2,j , ⋯ , p S,j T ∈ ℝ 3S×1 is stacked skeleton points with p j = x s,j , y s,j , z s,j at step j. Then, we havẽ By substituting H 1 and H 2 into Eqs. (1, 2, and 3), we can obtain the DMD feature v DMD of G as follows: Input G into a CNN, the output is Then, a DMD-based SAR framework can be established, as depicted in Fig. 1, with the following five components: (1) A human pose estimation module, for instance, Open-Pose or mmpose, that can obtain skeleton trajectories from video clips; (2) Normalization of the ST to obtain G according to formula (6); (3) CNN feature extractor to obtain v CNN ; (4) DMD feature extractor to obtain v CNN ; (5) Final classifier to predict the action category.
As a trajectory can be recovered from its modes and eigenvectors approximately (Takeishi et al. 2017), DMD servers as an encoder in the framework. The physical meaning of the DMD feature is clear, compact, and informative. Although the order truncation operation and linearization may make some information lost, it is useful when the training set is not adequate.
The rank of DMD for a RAS task is often less than 10, and the length of v DMD is less than 100. When using v DMD together with v CNN , the increased computation is negligible.
Considering the perspective transformation in the imaging acquisition of RGB videos, the DMD feature of a 2D skeleton trajectory cannot guarantee translational and rotational invariance. However, in some applications where the Fig. 1 Framework of SAR based on DMD position and pose of the camera are fixed, the distortion of skeletons can be treated as one part of the action itself, and thus, the DMD feature is also suitable.

Experiments and analysis
To analyze the performance of the DMD feature comprehensively, two groups of experiments are conducted based on three datasets. In one group, the DMD feature is used alone to analyze its intrinsic properties. The matrix feature and eigenvalue feature of DMD is compared with basic LSTM on the CMU and Badminton datasets. In the other group, we focus on the auxiliary performance of the DMD feature.  (Shi et al. 2019), which used to be state-of-the-art on the NTU and Kinematic datasets, failed to converge on the miniNTU dataset. Thus, we did not present its results. ResNet18 is a special realization of TCN with its backbone as a one-channel residual network, which has a much deeper architecture than other methods. Basic LSTM contains three layers and each layer contains 100 neurons. In all methods, we have adjusted the feature length to 256 and the output layer to a linear fully-connected layer with 256 inputs and 4 (CMU and Badminton) or 40 (miniNTU) outputs. All these methods are trained with randomly initialized parameters.

Datasets
The DMD feature is an empirical feature with limited length, and it does not have a strong expressive ability like the CNN feature. Thus, the motivation of this work is to explore the applicable scenes of DMD, rather than seeking a state-ofthe-art accuracy. We have chosen three datasets with very different properties to analyze DMD fully.
(1) CMU dataset. CMU dataset (CMU 2013) is a classic dataset for motion capture, in which 29 skeleton points are measured by wearable devices. Thus, its precision is much higher than other datasets. We divided a subset from the CMU dataset in this group, which includes dancing, jumping, running, and walking actions. Figure 2 shows some samples of the CUM dataset. A total of 119 samples are used for training and test, whose distribution is listed in Table 1. We removed 4 unnecessary skeleton joints to make it share the NTU's data loader. CMU is easier than other datasets.
(2) Badminton dataset. The Badminton dataset is a selfestablished dataset to illustrate the applicability of the DMD feature for 2D ST in a fixed scene. This dataset also contains four categories of actions, namely, backhand striking, forehand striking, backhand lifting, and forehand lifting. The 2D skeleton trajectories are obtained with the human pose estimation framework mmpose (Open-MMLab 2019) from some video clips. Some failed frames are replenished with linear interpolation. An action of badminton contains three stages, namely, move toward the shuttlecock, hit, and return to the defensive position. In addition, some athletes hold the racket in their right hand while the rest in their left hands. This makes the action more indistinguishable. Figure 3 show some samples. The training set contains 30 trajectories for each type, and the test set contains 12, 10, 10, and 13 for each type respectively. We only considered the athlete in the field below by limiting the detection region of their feet. In this dataset, the skeleton contains 17 joints. The distortion of the 2D skeleton by perspective transformation, the consistency of the athlete's movements, and the confusion of main hands between athletes, make it much more difficult than the CMU dataset.
(3) miniNTU dataset. NTU (Li et al. 2017b) is a widely used large-scale dataset for SAR. This dataset contains 60 categories of actions, and 20 of them involve multiple  humans. The skeletons are captured with three RGBD sensors located at different poses. In this work, we considered those 40 types of actions with only one human and chose 30 training and 10 test samples for each type of action, so that it satisfies QFSS. Same as the standard NTU dataset, we also established the cross-subject and cross-view subsets. The former means that the humans in the training set and test set are different; the latter means that the trajectories in the training set and test set are captured by different cameras. It is much more difficult than the CMU and Badminton. Another widely used dataset is the Kinetics (Kay et al. 2017) dataset, which is even more difficult than NTU. Its skeleton trajectories cannot satisfy the requirement of translational and rotational invariance, and thus, we have not tested the Kinetics dataset.

SAR based on DMD feature and ovoSVM
In this group, the intrinsic properties of DMD were explored. Because the miniNTU dataset is too difficult to fully exhibit the DMD feature's properties, we only conducted experiments on the CMU and Badminton datasets. We have considered the contrast experiment from several aspects.
First, as both DMD and LSTM can directly utilize temporal information, we have designed a DMD+ovoSVM framework as the realization of the DMD feature and chosen a basic shallow LSTM for comparison from several aspects in this group. Shallow CNN performs much poorer than DMD+ovoSVM and LSTM because it cannot extract temporal information. Thus, we did not compare DMD+ovoSVM with the shallow CNN in this group. The DMD+ovoSVM framework is a simple realization of Fig. 1, in which the classifier is an ovoSVM with radial basis function (RBF) kernels, and the DMD feature is input into the ovoSVM without concatenation with any CNN feature. Considering that the length of many skeleton trajectories in the Badminton dataset is less than 40, we limited the DMD rank to be smaller than 7, and the number of RBF kernels in ovoSVM ranges from 0.1 to 300. Second, both two types of DMD features mentioned above, that is, the flattened matrix and the stacked eigenvalues, have been considered. Finally, to explore computation reducing method, a half truncation and four joints tricks were tested. The former truncates trajectories from the middle inspired by the fact that all movements in Badminton contain recovering processes. The latter reduces the skeleton to 4 joints including two wrists and two ankles, due to the limb's movement range is relatively large than the body's. Table 2 shows all optional hyperparameter configurations of DMD+ovoSVM. A uniformly distributed noise is added to the trajectories to augment the training and test set, and 10 duplicates of each trajectory are generated. The noise is defined as x → (1 + 0.05 * ) ⋅ x , where ∼ U (−1, 1) . The LSTM has three linear fully-connected layers with 100 neurons to extract the feature, and another linear fully-connected layer is used to predict the action categories. The input length of LSTM is truncated or padded with 0-200 for CMU and 40 for Badminton respectively. The appropriate hyperparameters in Table 2, including truncation, augmentation, and four joints, are also used on LSTM.
Each configuration is repeated 50 times and the best results of all configurations are listed in Table 4. Binary classification results on the striking and lifting subset of Badminton are also presented for reference. It can be found that: (1) LSTM achieved the highest accuracy on CUM with good stability, whereas, LSTM achieved the lowest accuracy on Badminton with the worst stability. (2) The matrix feature is preferred on the CMU dataset, whereas the eigenvalue  Figure 4 shows the corresponding distribution of the results in Table 3. In the figures, the flattened matrix and the stacked eigenvalues are denoted as Amat and Mu, respectively. LSTM performs better than DMD+ovoSVM on the CMU dataset, but poorer on the Badminton dataset. The result of Amat on the CMU dataset is like a barbell, that is, it suffers from a large standard deviation. As DMD extracts the modes of an approximate linear system, the DMD feature has no relation with the input and would drop out some spatial information that is useful for the classification of CMU. A latent temporal condition of Badminton is that lift actions must occur in the frontcourt and strike actions must occur in the backcourt. If this temporal condition can be utilized, the classification results on complete Badminton should be close to the subsets. However, both LSTM and DMD failed to utilize this condition.
To analyze the performance of DMD more comprehensively, we compared the results of different hyperparameters. We computed the accuracy of all executions in the rank test. Figure 5 shows the distribution of accuracy versus the rank (r) of DMD. The optimal results are achieved for lift and strike actions when r = 3 because the difficulty to obtain the feature boundary increases when the length of the feature increases. The results of r = 2 for strike action are poorer than that of r > 2 . This indicated that minimum low-frequency modes may be insufficient in describing the strike action. A tradeoff exists between the rank and length of the DMD feature, and how to determine the rank for different tasks is an important problem that deserves in-deep investigation. Figures 6, 7 and 8 show the comparison of the half trajectory, four points, and shuffle eigenvalue tricks, respectively. The half trajectory and four points tricks do not lead to loss of accuracy. Thus, they can be used to reduce the computation significantly in some tasks. Shuffling operation on eigenvalues composes a negative effect on the high accuracy region but makes the distribution converge to the middle region.
DMD+ovoSVM can achieve the best performance near the shallow LSTM. The training speed of DMD+ovoSVM is higher and more stable. The solving process of DMD+ovoSVM only takes approximately 0.02-0.4 ms when running on the CPU Intel@i9-9900K. LSTM takes approximately 2 min for 50 epochs training when running on the CPU Intel@i9-9900K. The time decrease to 0.2-4 s when running on one piece of GPU NVIDIA@RTX2080ti. Since DMD involves singular value decomposition and matrix inversion, a GPU cannot accelerate the computation of DMD. The inability to utilize the GPU is a disadvantage of DMD.

SAR based on DMD feature and CNN feature
In this group, we considered the auxiliary role of the DMD feature for some popular deep CNNs, including ST-GCN, TCN, ResNet18, basic LSTM, and PLSTM. According to the framework in Fig. 1, the DMD feature in the form flattened matrix is concatenated with the CNN feature that is extracted by one of those deep CNNs and input into a linear fully-connected layer for classification. No trick in Table 2 has been used in this group. Table 4 shows all configuration for this group of experiments.
Tables 5 and 6 present the results on the CMU and Badminton datasets, and the miniNTU datasets, respectively. We collected the mean, maximum, and standard deviation of accuracy from 20 executions of ST-GCN and ST-GCN+DMD on the miniNTU dataset and 50 executions of others. The results of DMD+ovoSVM are also presented   Tables 7 and 8. From the results, it can be found that: (1) The DMD feature can improve the performance of most methods, particularly, help TCN become convergent on miniNTU-xsub and PLSTM convergent on miniNTU-xview. ResNet18 can represent the frequency domain information  owing to its deep architecture and multiple convolution layers. Thus, the DMD feature cannot provide supplementary information for ResNet18. The DMD feature would lose some spatial information and is not as complete as a deep CNN feature.
(2) A recurrent architecture can also extract temporal information, but shallow layers would limit its feature expression ability. Thus, LSTM and PLMST perform better than TCN but much poorer than ST-GCN and ResNet18. (3) The performance of ResNet18 exceeds ST-GCN dramatically on all three datasets. However, ST-GTN is better than ResNet18 on the standard NTU dataset. In our test, the top 1 accuracies of ResNet18 are 79% and 87% on standard NTU-xsub and NTU-xview dataset, whereas ST-GCN achieves 81.3% and 89.1%. The results of ST-GCN+DMD are very close to ST-GCN, which means that DMD provides no information for ST-GCN. With the predefined relation of the human skeleton, ST-GCN has a stronger ability to extract spatial and temporal information than ResNet18. When the training samples are adequate, the spatial relation between joints brings more benefit than the frequency domain information. However, when the samples are not adequate in QFSS tasks, the predefined relation failed to perform fully.
(4) Although noise would injure the performance of all methods evidently, the auxiliary function of DMD still lasts when a 5% noise exists.
A deeper GCN, which combines the advantages of both deep architecture and part-aware knowledge, would own a better performance. However, it requires more samples and stronger computing power. When a deep architecture is unable to deploy, for instance, running on some embedded neural computing devices or lack of training samples, the DMD feature can be used to assist some simpler CNN feature to achieve higher accuracy.

Conclusion
The DMD feature for RAS is studied in this work. This feature has a clear physical meaning in the frequency domain and can guarantee translational and rotational invariance with an appropriate normalization. The DMD feature can achieve a performance close to a shallow LSTM when it is used solely in SAR tasks. A DMD-based SAR framework is proposed, in which the DMD feature is concatenated with a CNN feature. The DMD feature can improve the CNN features' accuracy evidently in QFSS SAR tasks with a small computational cost, even when a 5% noise exists. Particularly, DMD can help TCN become convergent on the miniNTU-xsub dataset and PLSTM convergent on the miniNTU-xview dataset. Because we cannot utilize a GPU to accelerate the calculation of DMD, the DMD-based SAR framework cannot be combined in an end-to-end framework. Thus, one of our works in the future is to find a realization of DMD on GPU, for instance, training a CNN to extract the modes. Furthermore, as the DMD feature only represents the modes of an approximated linear system and would lose some spatial information, another problem that deserves further research is to explore some empirical spatial features that can eliminate the information loss problem of DMD.

Declarations
Data availability The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.