Abstract
Based on dynamic mode decomposition (DMD), a new empirical feature for quasifewshot setting (QFSS) skeletonbased action recognition (SAR) is proposed in this study. DMD linearizes the system and extracts the modes in the form of flattened system matrix or stacked eigenvalues, named the DMD feature. The DMD feature has three advantages. The first advantage is its translational and rotational invariance with respect to the change in the localization and pose of the camera. The second one is its clear physical meaning, that is, if a skeleton trajectory was treated as the output of a nonlinear closedloop system, then the modes of the system represent the intrinsic dynamic property of the motion. Finally, the last one is its compact length and its simple calculation without training. The information contained by the DMD feature is not as complete as that of the feature extracted using a deep convolutional neural network (CNN). However, the DMD feature can be concatenated with CNN features to greatly improve their performance in QFSS tasks, in which we do not have adequate samples to train a deep CNN directly or numerous support sets for standard fewshot learning methods. Four QFSS datasets of SAR named CMU, Badminton, miniNTUxsub, and miniNTUxview, are established based on the widely used public datasets to validate the performance of the DMD feature. A group of experiments is conducted to analyze intrinsic properties of DMD, whereas another group focuses on its auxiliary functions. Experimental results show that the DMD feature can improve the performance of most typical CNN features in QFSS SAR tasks.
Introduction
Action recognition (AR) demonstrates broad application prospects in intelligent security monitoring, humanmachine interaction, virtual reality, and kinematic analysis (Zhu et al. 2020). With the development of deep learning (DL), AR methods based on deep convolutional neural networks (CNNs) have shown great superiority over traditional visual technologies. Those methods can be divided into three types: twostream network (TSN), 3D CNN, and skeletonbased action recognition (SAR) methods. TSN and 3D CNN deal with a video clip in an endtoend manner and utilize context information when actions are closely related to the context. Whereas, SAR methods operate in a decoupled manner and often consist of two stages. The first stage is human pose estimation, which detects skeleton trajectories (STs) of one or more humans from a video clip. The second stage is to classify the action category of the STs. The decomposition of human pose estimation from action classification can utilize the powerful generalization ability of welltrained pose estimation frameworks (Cao et al. 2017; OpenMMLab 2019) to eliminate the disturbance of background when the training set suffers from insufficient diversity.
Given that the collection of samples are expensive and timeconsuming, some fewshot (Guo et al. 2018), oneshot (Memmesheimer et al. 2020) or zeroshot (Jasani and Mazagonwalla 2019) learningbased AR methods have been proposed to deal with sample shortage based on numerous support sets in the past two years. However, in many tasks whose goals are to detect illegal behaviors, the samples in the training set are more than the fewshot setting but not adequate to train a deep CNN directly or to generate support sets. That is the quasifewshot setting (QFSS). The challenge lies in seeking a priori knowledge to help the deep CNN to learn the feature better. The attention mechanism (Liu et al. 2020) and partaware (Li et al. 2017a) convolutional operation are two useful manners to guide the training process.
In this paper, we proposed a new empirical feature for SAR based on dynamic mode decomposition (DMD). DMD is a popular realization of Koopman (Takeishi et al. 2017) and has been widely used in nonlinear dynamic analysis. By modeling the human action as a nonlinear dynamic system that determines the evolution of the ST, the system matrix or its eigenvalues can be treated as an empirical feature. The DMD feature has multiple advantages. First, DMD has a clear physical meaning. Although some information would be lost during the linearization process, DMD contains important timefrequency domain information that can recover the action appropriately when the initial state is given. Second, DMD has the property of translational and rotational invariance, that is, the DMD feature is constant when the position and pose of the camera changes. The DMD feature is also effective on 2D skeletons in a fixed scene. At last, the DMD feature can be concatenated with CNNs features to improve their accuracy.
The currently widely used CNN is optimized as a black box and extracts time domain features that are not interpretable. Whereas, the DMD feature, which is inspired by the control theory, is an empirical and interpretable feature in the frequency domain and has a fixed computational process without training owing to its clear physical meaning. Those differences allow the DMD feature to play an auxiliary role for CNN features in QFSS tasks.
The remainder of this paper is organized as follows. Section 2 reviews recent developments on AR. Section 3 proposed a new DMDbased SAR framework and proves the translational and rotational invariance of DMD. Section 4 presents and analyzes experimental results. Finally, Sect. 5 concludes the study.
Related work
The progress of video AR before the DL era is slow because of the inability of traditional visual technologies to perform high semanticlevel tasks. A complete pipeline of traditional methods comprises feature extraction, combination, and classification. One typical method is the dense trajectory (DT) algorithm based on optical flow (Wang et al. 2013). The motion trail of the video is captured by optical flow first, then features including trajectory shape, histograms of oriented optical flow, gradient, and motion boundary are extracted. These features are encoded and used to train a support vector machine (SVM) classifier. Wang et al. also proposed the improved DT (IDT) algorithm (Wang and Schmid 2013) in the same year. Compared with DT, IDT utilized the improved optical flow graph, feature regularization, and encoding method to increased accuracy from 84.54 to 91.2% on the UCF50 dataset and from 46.6 to 57.2% on the HMD51 dataset.
Since DL flourished in 2015, many DLbased AR methods have been proposed (Kong and Fu 2018) and offered a wide range of possible applications in safety management (Zhu et al. 2020), violence detection (Sumon et al. 2019), and ambient assisted living (Singh et al. 2017). According to the architecture of the network, these methods can be divided into three categories, namely, TSN (Lin et al. 2020), 3D CNN (Tran et al. 2017; Diba et al. 2017), and SAR (Yan et al. 2018). In some works, longshort temporal memory (LSTM) networks (Singh et al. 2017) are also used to model the evolution process of STs, but their performance is inferior to TSN and 3D CNN because of the difficulty of training.
TSN mainly uses a twostream architecture to extract semantic information from RGB frames and time domain information from optical flow, and combines features to make collaborative predictions. This technical route was first proposed by Simonyan (2014) and improved by other researchers from several aspects. Feichtenhofer et al. introduced 3D pooling (Feichtenhofer et al. 2016) and multiscale time (Feichtenhofer et al. 2018) into TSN. Wang et al. (2016) proposed temporal segment networks to address longtime videos and Zhou et al. (2018) put forward a temporal relation network to learn the dependency relationship between frames. Overall, TSN is the DL version of IDT that appropriately balances the computational burden and accuracy requirement.
Unlike TSN establishes the connection between frames with optical flow, 3D CNN executes the convolution operation in the time dimension to achieve the same goal (Tran et al. 2015). To reduce the computational burden and improve the performance of 3D CNN, many equivalent operations have been proposed. ResNet(2+1)D architectures, which uses 2D convolution on each RGB image and 3*1*1 convolution on the temporal dimension, were proposed by Tran et al. (2015, 2017) and Qiu et al. (2017) individually. Diba et al. (2017) proposed a temporal 3D CNN to explore longterm information comprehensively, together with the temporal transition layer to replace the pooling layer. They initialized the 3D CNN with a pretrained 2D CNN, which is also an enlightening approach. Lin et al. (2019) proposed a novel method suitable for 2D CNN models that remarkably reduces the computation and performs the cross concatenation of channels between frames to allow information sharing.
SAR consists of two steps. The first step is human pose estimation, which can be classified into topdown and bottomup strategies. Topdown strategies use an object detection framework to detect humans and locates skeleton joint points based on the detected boxes, whereas bottomup strategies detect all possible joint points and cluster them to different humans. Many studies have been proposed and carried on open source frameworks OpenPose (Wei et al. 2016; Simon et al. 2017; Cao et al. 2017) or mmpose (OpenMMLab 2019).
Once the ST has been obtained by the human pose estimation module, the most intuitive method of SAR is to stack the ST into a onechannel image and input it into a onechannel 2D CNN, which is named as temporal convolution network (TCN) (Kim and Reiter 2017; Memmesheimer et al. 2020). Another direct way is to use recurrent neural networks (RNNs) to represent the temporal relation (Wang and Wang 2017; Liu et al. 2017; Singh et al. 2017). An indirect manner is to project the ST into three orthometric views and stack them as a threechannel image (Hou et al. 2018) that is suitable for a general multichannel 2D CNN.
To enhance the performance of SAR, a priori knowledge about body parts are introduced into the network in the form of an undirected graph (Yan et al. 2018; Shi et al. 2019; Holzinger et al. 2021) or a fixed concatenation (Li et al. 2017a; Zhang et al. 2017). Similar to the performance of the graphical neural networks (Holzinger et al. 2021) in other applications, those partaware methods provide a supervised attention mechanism substantially. At the same time, the unsupervised attention mechanism has also been exploited by some researchers. Si et al. (2019) proposed an attentionenhanced graph convolutional LSTM which achieves stateoftheart on several public datasets; Li et al. (2019) combined an adaptive attention module with a twostream RNN architecture. Furthermore, Zhao et al. (2019) combined a graphical neural network (GCN) with LSTM into a Bayesian framework, and Peng et al. (2020) proposed a Neural Architecture Search (NAS) framework to design a partaware GCN automatically.
Although a complex architecture can achieve better performance when the dataset is large enough, many applications fail to satisfy this requirement. Referring to flourishing fewshot learning methods (including the oneshot and zeroshot methods) in other visual tasks, a small group of researchers starts to seek oneshot learning methods for SAR (Memmesheimer et al. 2020). A new dataset for fewshot learning of SAR is established based on the NTU dataset (Li et al. 2017b), which contains adequate support sets. Many fewshot learning methods would be extended to SAR in the following two years. Moreover, QFSS, which is closer to the requirements of real applications, deserves additional attention.
Method
The human body is a complex dynamic system, with the brain as the controller, action target and external environment as the inputs, and human joints as the actuators. The sequential skeleton points, that is the ST, are the observed states of the system. When finishing different actions, the system would evolve under the navigation of different controllers and output different STs. Thus, if we can recover the close system from a given ST with DMD, the action type would be recognized according to the modes extracted by DMD.
Inspired by this motivation, the DMDbased SAR framework is proposed in this section. DMD theory is introduced first; the translational and rotational invariance of the DMD feature is proven then; finally, the DMDbased action recognition framework is proposed.
Dynamic mode decomposition
Given a discrete system \(\zeta _{k+1}=f\left( \zeta _{k}\right) \), where \(\zeta _{k} \in \mathbb {R}^{n}\) is the latent state, K is the Koopman operator (Takeishi et al. 2017). K is an infinite linear operator defined as \(\mathrm {K}(g(\zeta ))=g(f(\zeta ))\) for \(\forall g: \mathrm {M} \rightarrow \mathbb {R}(or\;\mathbb {C})\), where M is the state space of \(\zeta \), \(\mathbb {R}\) (or \(\mathbb {C}\)) is the real (or image) set, \(f(\cdot )\) is the dynamic function, and \(g(\cdot )\) is the observation function.
It is assumed that K demonstrates discrete spectrums, which can be written in the form of infinite eigenvalues \(\left\{ \lambda _{1}, \lambda _{2}, \lambda _{3}, \cdots \right\} \) and eigenfunctions \(\left\{ \phi _{1}, \phi _{2}, \phi _{3}, \cdots \right\} \) with the relation \(\mathrm {K} \phi _{i}=\lambda _{i} \phi _{i}\). The observation function based on eigenfunctions is \(g(\zeta )=\sum _{i} \phi _{i}(\zeta ) c_{i}\), i.e., \(g\left( \zeta _{k}\right) =\sum _{i} \lambda _{i}^{k} \phi _{i}\left( \zeta _{0}\right) c_{i}\). The Koopman operator approximates a lowerdimensional nonlinear system to an infinitedimensional linear system with sequential \(K+1\) samples by seeking a state transition matrix \(\varvec{A} \in \mathbb {R}^{K \times K}\) that satisfies
DMD is the most widely used method to calculate A.
Performing a singular value decomposition on \(\varvec{H}_{1}\), we have the following:
where \(\varvec{U} \in \mathbb {R}^{n \times n}\), \(\varvec{\varSigma } \in \mathbb {R}^{n \times K}\), and \(\varvec{V} \in \mathrm {R}^{K \times K}\). The diagonal elements of \(\varSigma \) are the singular values sorted descendingly, and all offdiagonal elements are 0. We can then obtain the similar matrix of \(\varvec{A}\) as follows:
\(\varvec{A}\) and \(\varvec{\tilde{A}}\) have the same eigenvalues.
Considering that the response of a dynamic system is mainly determined by lowfrequency parts, only the first r eigenvalues are typically reserved to describe the feature of the system in practice, where \(r \ll K\). Let \(\varvec{U}_{r} \in \mathbb {R}^{r \times n}\), \(\varvec{V}_{r} \in \mathbb {R}^{K \times r}\), and \(\varvec{\varSigma }_{r} \in \mathbb {R}^{r \times r}\) be the lefttop submatrices of \(\varvec{U}\), \(\varvec{V}\) and \(\varvec{\varSigma }\) with truncated eigenvalues, respectively. We can then obtain the approximated state transition matrix
and its eigenvalues \(\tilde{\lambda }_{i}\), \(i=1,2, \cdots , r\).
The state matrix \(\varvec{\tilde{A}}_{r}\) determines the dynamic response of the system, including stability, response speed, and overshot. \(\tilde{\lambda }_{i}\) is the pole of the approximate linear closedloop system and determines the stability of the system. Thus, both \(\varvec{\tilde{A}}_{r}\) and \(\tilde{\lambda }_{i}\) can serve as an empirical feature for SAR. The feature dimension is \(r^{2}\) for the flattened \(\varvec{\tilde{A}}_{r}\) and 2r for the stacked \(\tilde{\lambda }_{i}\) (r real parts and r image parts). \(\tilde{\lambda }_{i}\) is shorter, whereas \({\tilde{A}}_{r}\) contains more information. The experimental results in the following section show that the performance distinction between them is unclear.
Translational and rotational invariance of DMD
For an action sample, when the sensor moves or rotates, the feature for SAR should be consistent. With a simple normalization method, DMD can satisfy this requirement theoretically, that is, translational and rotational invariance.
Two STs of one same action captured by different cameras are denoted as follow:
and
\(\varvec{G}^{1}\) and \(\varvec{G}^{2}\) are captured by cameras with fixed coordinates \(O_{1} x_{1} y_{1} z_{1}\) and \(O_{2} x_{2} y_{2} z_{2}\). The \(s^{t h}\) skeleton joint at step j captured by camera i is denoted as \(p_{j}^{i}=\left[ x_{s, j}^{i}, y_{s, j}^{i}, z_{s, j}^{i}\right] \). Then the spatial coordinate of all S skeleton points can be stacked as \(\varvec{\eta }_{j}^{i}=g^{i}\left( \varvec{\zeta }_{j}\right) =\left[ \varvec{p}_{1, j}^{i}, \varvec{p}_{2, j}^{i}, \cdots , \varvec{p}_{S, j}^{i}\right] ^{T} \in \mathbb {R}^{3 S \times 1}\).
The transfer matrix from \(O_{1} x_{1} y_{1} z_{1}\) to \(O_{2} x_{2} y_{2} z_{2}\) is denoted as
where \(\varvec{r}_{1to2}\) is the rotation matrix and \(\varvec{l}_{1to2}\) is the translation vector. \(\varvec{r}_{1to2}\) and \(\varvec{l}_{1to2}\) satisfy
for \(k=1,2, \cdots , K+1\).
The translational and rotational invariance of DMD means that \(\varvec{\tilde{A}}_r^1=\varvec{\tilde{A}}_{r}^2\). This property is proven as follow:
Proof
Based on (5), we have
where,
Normalize \(\overline{\varvec{\eta }}^{i}\) with
where \(\varvec{L}_{0}^{i}=\underbrace{\left[ \varvec{p}_{1,1}^{i}, \varvec{p}_{1,1}^{i}, \cdots , \varvec{p}_{1,1}^{i}\right] }_{N \text { blocks}}{\!}^{T}\) with \(\varvec{R} \varvec{L}_{0}^{1}=\varvec{L}_{0}^{2}\varvec{L}\), we can then obtain the following relation:
Thus,
Let \(\varvec{H}_{1}^{i}=\left[ \overline{\varvec{\eta }}_{1}^{i}, \overline{\varvec{\eta }}_{2}^{i}, \cdots , \overline{\varvec{\eta }}_{K}^{i}\right] \) and \(\varvec{H}_{2}^{i}=\left[ \overline{\varvec{\eta }}_{2}^{i}, \overline{\varvec{\eta }}_{3}^{i}, \cdots , \overline{\varvec{\eta }}_{K+1}^{1}\right] \), and there is
The system matrices can be obtained with Eqs. (1 and 2) as follows:
Then, we have
and
Their similar matrices are as follows:
As the rotation matrix is orthogonal and satisfies \(\varvec{R}^{T}=\varvec{R}^{1}\), we can obtain the following:
Thus, there exists \(\varvec{\tilde{A}}_{r}^{2}=\varvec{\tilde{A}}_{r}^{1}\). \(\square \)
From the proof above, DMD can guarantee rotational invariance inherently, and the normalization method in Eq. (6) can guarantee translational invariance. Thus, the normalization is a necessary preprocessing step for the DMD feature. Some other normalization methods can also guarantee translational invariance, but they have some disadvantages. For instance, \(\overline{\varvec{\eta }}^{i}=(\varvec{\eta }^{i}\varvec{\eta }^{0})/(\varvec{\eta }^{K}\varvec{\eta }^{0})\) or \(\overline{\varvec{\eta }}^{i}=(\varvec{\eta }^{K}\varvec{\eta }^{i})/(\varvec{\eta }^{K}\varvec{\eta }^{0}),\) can normalize the trajectory into [0, 1] and satisfy the translational and rotational invariance. However, when \(\varvec{\eta }^{K}=\varvec{\eta }^{0}\), they are not applicable.
DMD feature for SAR with QFSS
Deep CNNs can extract more information than DMD because of their large amount of parameters. Deep CNNs are the standard answers for SAR if the training set is adequate. However, in many QFSS SAR tasks which do not have adequate training samples, training a deep CNN is impossible. Facing this problem, we design a framework to improve the performance of CNN features on QFSS SAR tasks with the empirical DMD feature.
Denote a skeleton trajectory that has been normalized according to formula (6) in the form of a matrix
where \(\varvec{\eta }_{j}=\left[ \varvec{p}_{1, j}, \varvec{p}_{2, j}, \cdots , \varvec{p}_{S, j}\right] ^{T} \in \mathbb {R}^{3 S \times 1}\) is stacked skeleton points with \(p_{j}=\left[ x_{s, j}, y_{s, j}, z_{s, j}\right] \) at step j. Then, we have
By substituting \(\varvec{H}_1\) and \(\varvec{H}_2\) into Eqs. (1, 2, and 3), we can obtain the DMD feature \(v_{DMD}\) of \(\varvec{G}\) as follows:
Input G into a CNN, the output is
Then, a DMDbased SAR framework can be established, as depicted in Fig. 1, with the following five components:

(1)
A human pose estimation module, for instance, OpenPose or mmpose, that can obtain skeleton trajectories from video clips;

(2)
Normalization of the ST to obtain \(\varvec{G}\) according to formula (6);

(3)
CNN feature extractor to obtain \(v_{CNN}\);

(4)
DMD feature extractor to obtain \(v_{CNN}\);

(5)
Final classifier to predict the action category.
As a trajectory can be recovered from its modes and eigenvectors approximately (Takeishi et al. 2017), DMD servers as an encoder in the framework. The physical meaning of the DMD feature is clear, compact, and informative. Although the order truncation operation and linearization may make some information lost, it is useful when the training set is not adequate.
The rank of DMD for a RAS task is often less than 10, and the length of \(v_{DMD}\) is less than 100. When using \(v_{DMD}\) together with \(v_{CNN}\), the increased computation is negligible.
Considering the perspective transformation in the imaging acquisition of RGB videos, the DMD feature of a 2D skeleton trajectory cannot guarantee translational and rotational invariance. However, in some applications where the position and pose of the camera are fixed, the distortion of skeletons can be treated as one part of the action itself, and thus, the DMD feature is also suitable.
Experiments and analysis
To analyze the performance of the DMD feature comprehensively, two groups of experiments are conducted based on three datasets. In one group, the DMD feature is used alone to analyze its intrinsic properties. The matrix feature and eigenvalue feature of DMD is compared with basic LSTM on the CMU and Badminton datasets. In the other group, we focus on the auxiliary performance of the DMD feature. Five CNNbased methods, namely, STGCN (Yan et al. 2018), TCN (Kim and Reiter 2017), partaware LSTM (PLSTM) (Shahroudy et al. 2016), ResNet18 (He et al. 2016) and basic LSTM (Graves 2012), have been chosen for comparison. DGNN (Shi et al. 2019), which used to be stateoftheart on the NTU and Kinematic datasets, failed to converge on the miniNTU dataset. Thus, we did not present its results. ResNet18 is a special realization of TCN with its backbone as a onechannel residual network, which has a much deeper architecture than other methods. Basic LSTM contains three layers and each layer contains 100 neurons. In all methods, we have adjusted the feature length to 256 and the output layer to a linear fullyconnected layer with 256 inputs and 4 (CMU and Badminton) or 40 (miniNTU) outputs. All these methods are trained with randomly initialized parameters.
Datasets
The DMD feature is an empirical feature with limited length, and it does not have a strong expressive ability like the CNN feature. Thus, the motivation of this work is to explore the applicable scenes of DMD, rather than seeking a stateoftheart accuracy. We have chosen three datasets with very different properties to analyze DMD fully.
(1) CMU dataset. CMU dataset (CMU 2013) is a classic dataset for motion capture, in which 29 skeleton points are measured by wearable devices. Thus, its precision is much higher than other datasets. We divided a subset from the CMU dataset in this group, which includes dancing, jumping, running, and walking actions. Figure 2 shows some samples of the CUM dataset. A total of 119 samples are used for training and test, whose distribution is listed in Table 1. We removed 4 unnecessary skeleton joints to make it share the NTU’s data loader. CMU is easier than other datasets.
(2) Badminton dataset. The Badminton dataset is a selfestablished dataset to illustrate the applicability of the DMD feature for 2D ST in a fixed scene. This dataset also contains four categories of actions, namely, backhand striking, forehand striking, backhand lifting, and forehand lifting. The 2D skeleton trajectories are obtained with the human pose estimation framework mmpose (OpenMMLab 2019) from some video clips. Some failed frames are replenished with linear interpolation. An action of badminton contains three stages, namely, move toward the shuttlecock, hit, and return to the defensive position. In addition, some athletes hold the racket in their right hand while the rest in their left hands. This makes the action more indistinguishable. Figure 3 show some samples. The training set contains 30 trajectories for each type, and the test set contains 12, 10, 10, and 13 for each type respectively. We only considered the athlete in the field below by limiting the detection region of their feet. In this dataset, the skeleton contains 17 joints. The distortion of the 2D skeleton by perspective transformation, the consistency of the athlete’s movements, and the confusion of main hands between athletes, make it much more difficult than the CMU dataset.
(3) miniNTU dataset. NTU (Li et al. 2017b) is a widely used largescale dataset for SAR. This dataset contains 60 categories of actions, and 20 of them involve multiple humans. The skeletons are captured with three RGBD sensors located at different poses. In this work, we considered those 40 types of actions with only one human and chose 30 training and 10 test samples for each type of action, so that it satisfies QFSS. Same as the standard NTU dataset, we also established the crosssubject and crossview subsets. The former means that the humans in the training set and test set are different; the latter means that the trajectories in the training set and test set are captured by different cameras. It is much more difficult than the CMU and Badminton.
Another widely used dataset is the Kinetics (Kay et al. 2017) dataset, which is even more difficult than NTU. Its skeleton trajectories cannot satisfy the requirement of translational and rotational invariance, and thus, we have not tested the Kinetics dataset.
SAR based on DMD feature and ovoSVM
In this group, the intrinsic properties of DMD were explored. Because the miniNTU dataset is too difficult to fully exhibit the DMD feature’s properties, we only conducted experiments on the CMU and Badminton datasets. We have considered the contrast experiment from several aspects.
First, as both DMD and LSTM can directly utilize temporal information, we have designed a DMD+ovoSVM framework as the realization of the DMD feature and chosen a basic shallow LSTM for comparison from several aspects in this group. Shallow CNN performs much poorer than DMD+ovoSVM and LSTM because it cannot extract temporal information. Thus, we did not compare DMD+ovoSVM with the shallow CNN in this group. The DMD+ovoSVM framework is a simple realization of Fig. 1, in which the classifier is an ovoSVM with radial basis function (RBF) kernels, and the DMD feature is input into the ovoSVM without concatenation with any CNN feature. Considering that the length of many skeleton trajectories in the Badminton dataset is less than 40, we limited the DMD rank to be smaller than 7, and the number of RBF kernels in ovoSVM ranges from 0.1 to 300. Second, both two types of DMD features mentioned above, that is, the flattened matrix and the stacked eigenvalues, have been considered. Finally, to explore computation reducing method, a half truncation and four joints tricks were tested. The former truncates trajectories from the middle inspired by the fact that all movements in Badminton contain recovering processes. The latter reduces the skeleton to 4 joints including two wrists and two ankles, due to the limb’s movement range is relatively large than the body’s.
Table 2 shows all optional hyperparameter configurations of DMD+ovoSVM. A uniformly distributed noise is added to the trajectories to augment the training and test set, and 10 duplicates of each trajectory are generated. The noise is defined as \(x\rightarrow (1+0.05*\varDelta )\cdot x\), where \(\varDelta \sim U (1,1)\). The LSTM has three linear fullyconnected layers with 100 neurons to extract the feature, and another linear fullyconnected layer is used to predict the action categories. The input length of LSTM is truncated or padded with 0–200 for CMU and 40 for Badminton respectively. The appropriate hyperparameters in Table 2, including truncation, augmentation, and four joints, are also used on LSTM.
Each configuration is repeated 50 times and the best results of all configurations are listed in Table 4. Binary classification results on the striking and lifting subset of Badminton are also presented for reference. It can be found that: (1) LSTM achieved the highest accuracy on CUM with good stability, whereas, LSTM achieved the lowest accuracy on Badminton with the worst stability. (2) The matrix feature is preferred on the CMU dataset, whereas the eigenvalue feature is preferred on the badminton dataset. (3) Backhand and forehand lift actions are more difficult to classify than strike actions because of their high similarity.
Figure 4 shows the corresponding distribution of the results in Table 3. In the figures, the flattened matrix and the stacked eigenvalues are denoted as Amat and Mu, respectively. LSTM performs better than DMD+ovoSVM on the CMU dataset, but poorer on the Badminton dataset. The result of Amat on the CMU dataset is like a barbell, that is, it suffers from a large standard deviation. As DMD extracts the modes of an approximate linear system, the DMD feature has no relation with the input and would drop out some spatial information that is useful for the classification of CMU. A latent temporal condition of Badminton is that lift actions must occur in the frontcourt and strike actions must occur in the backcourt. If this temporal condition can be utilized, the classification results on complete Badminton should be close to the subsets. However, both LSTM and DMD failed to utilize this condition.
To analyze the performance of DMD more comprehensively, we compared the results of different hyperparameters. We computed the accuracy of all executions in the rank test. Figure 5 shows the distribution of accuracy versus the rank (r) of DMD. The optimal results are achieved for lift and strike actions when \(r=3\) because the difficulty to obtain the feature boundary increases when the length of the feature increases. The results of \(r=2\) for strike action are poorer than that of \(r>2\). This indicated that minimum lowfrequency modes may be insufficient in describing the strike action. A tradeoff exists between the rank and length of the DMD feature, and how to determine the rank for different tasks is an important problem that deserves indeep investigation.
Figures 6, 7 and 8 show the comparison of the half trajectory, four points, and shuffle eigenvalue tricks, respectively. The half trajectory and four points tricks do not lead to loss of accuracy. Thus, they can be used to reduce the computation significantly in some tasks. Shuffling operation on eigenvalues composes a negative effect on the high accuracy region but makes the distribution converge to the middle region.
DMD+ovoSVM can achieve the best performance near the shallow LSTM. The training speed of DMD+ovoSVM is higher and more stable. The solving process of DMD+ovoSVM only takes approximately 0.02–0.4 ms when running on the CPU Intel@i99900K. LSTM takes approximately 2 min for 50 epochs training when running on the CPU Intel@i99900K. The time decrease to 0.2–4 s when running on one piece of GPU NVIDIA@RTX2080ti. Since DMD involves singular value decomposition and matrix inversion, a GPU cannot accelerate the computation of DMD. The inability to utilize the GPU is a disadvantage of DMD.
SAR based on DMD feature and CNN feature
In this group, we considered the auxiliary role of the DMD feature for some popular deep CNNs, including STGCN, TCN, ResNet18, basic LSTM, and PLSTM. According to the framework in Fig. 1, the DMD feature in the form flattened matrix is concatenated with the CNN feature that is extracted by one of those deep CNNs and input into a linear fullyconnected layer for classification. No trick in Table 2 has been used in this group. Table 4 shows all configuration for this group of experiments.
Tables 5 and 6 present the results on the CMU and Badminton datasets, and the miniNTU datasets, respectively. We collected the mean, maximum, and standard deviation of accuracy from 20 executions of STGCN and STGCN+DMD on the miniNTU dataset and 50 executions of others. The results of DMD+ovoSVM are also presented for reference. The result of TCN, LSTM, and PLSTM on miniNTU are (\(mean = 0.025, max = 0.025, std =0\)), which means that the networks have not converged and always output a fixed prediction. Because the STs obtained by the pose estimation module often suffer from instability, the robustness of the DMD feature should be analyzed. Thus, we augmented the training and test sets 10 times with 5% uniformly distributed random noise according to \(x\rightarrow (1+0.05*\varDelta )x\), where \(\varDelta ~ U(1,1)\). The results are listed in Tables 7 and 8.
From the results, it can be found that:
(1) The DMD feature can improve the performance of most methods, particularly, help TCN become convergent on miniNTUxsub and PLSTM convergent on miniNTUxview. ResNet18 can represent the frequency domain information owing to its deep architecture and multiple convolution layers. Thus, the DMD feature cannot provide supplementary information for ResNet18. The DMD feature would lose some spatial information and is not as complete as a deep CNN feature.
(2) A recurrent architecture can also extract temporal information, but shallow layers would limit its feature expression ability. Thus, LSTM and PLMST perform better than TCN but much poorer than STGCN and ResNet18.
(3) The performance of ResNet18 exceeds STGCN dramatically on all three datasets. However, STGTN is better than ResNet18 on the standard NTU dataset. In our test, the top 1 accuracies of ResNet18 are 79% and 87% on standard NTUxsub and NTUxview dataset, whereas STGCN achieves 81.3% and 89.1%. The results of STGCN+DMD are very close to STGCN, which means that DMD provides no information for STGCN. With the predefined relation of the human skeleton, STGCN has a stronger ability to extract spatial and temporal information than ResNet18. When the training samples are adequate, the spatial relation between joints brings more benefit than the frequency domain information. However, when the samples are not adequate in QFSS tasks, the predefined relation failed to perform fully.
(4) Although noise would injure the performance of all methods evidently, the auxiliary function of DMD still lasts when a 5% noise exists.
A deeper GCN, which combines the advantages of both deep architecture and partaware knowledge, would own a better performance. However, it requires more samples and stronger computing power. When a deep architecture is unable to deploy, for instance, running on some embedded neural computing devices or lack of training samples, the DMD feature can be used to assist some simpler CNN feature to achieve higher accuracy.
Conclusion
The DMD feature for RAS is studied in this work. This feature has a clear physical meaning in the frequency domain and can guarantee translational and rotational invariance with an appropriate normalization. The DMD feature can achieve a performance close to a shallow LSTM when it is used solely in SAR tasks. A DMDbased SAR framework is proposed, in which the DMD feature is concatenated with a CNN feature. The DMD feature can improve the CNN features’ accuracy evidently in QFSS SAR tasks with a small computational cost, even when a 5% noise exists. Particularly, DMD can help TCN become convergent on the miniNTUxsub dataset and PLSTM convergent on the miniNTUxview dataset. Because we cannot utilize a GPU to accelerate the calculation of DMD, the DMDbased SAR framework cannot be combined in an endtoend framework. Thus, one of our works in the future is to find a realization of DMD on GPU, for instance, training a CNN to extract the modes. Furthermore, as the DMD feature only represents the modes of an approximated linear system and would lose some spatial information, another problem that deserves further research is to explore some empirical spatial features that can eliminate the information loss problem of DMD.
References
Cao Z, Sheikh T, ShihEn S, Yaser W (2017) Realtime multiperson 2D pose estimation using part affinity fields. In: IEEE conference on computer vision and pattern recognition, pp 7291–7299
CMU (2013) CMU graphics lab motion capture database
Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Gool LV (2017) Temporal 3D ConvNets: new architecture and transfer learning for video classification. arXiv: 171108200 pp. 1–10
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional twostream network fusion for video action recognition. IEEE computer society conference on computer vision and pattern recognition. pp. 1933–1941. https://doi.org/10.1109/CVPR.2016.213
Feichtenhofer C, Fan H, Malik J, He K (2018) Slowfast networks for video recognition. In: IEEE/CVF international conference on computer vision, pp. 6201–6210
Graves A (2012) Long shortterm memory. Springer, Berlin, pp 37–45
Guo M, Chou E, Huang DA, Song S, Yeung S, FeiFei L (2018) Neural graph matching networks for fewshot 3D action recognition. European conference on computer vision. Munich, Germany, pp. 673–689
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. IEEE conference on computer vision and pattern recognition. Las Vegas, USA, pp 771–778
Holzinger A, Malle B, Saranti A, Pfeifer B (2021) Towards multimodal causability with Graph Neural Networks enabling information fusion for explainable AI. Inf Fusion 71:28–37. https://doi.org/10.1016/j.inffus.2021.01.008
Hou Y, Li Z, Wang P, Li W (2018) Skeleton optical spectrabased action recognition using convolutional neural networks. IEEE Trans Circuits Syst Video Technol 28(3):807–811. https://doi.org/10.1109/TCSVT.2016.2628339
Jasani B, Mazagonwalla A (2019) Skeleton based zero shot action recognition in joint poselanguage semantic space. arXiv: 191111344 pp. 1–8, arXiv: 1911.11344v1
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The Kinetics human action video dataset. arXiv: 170506950. pp. 1–22
Kim TS, Reiter A (2017) Interpretable 3D human action analysis with temporal convolutional networks. IEEE conference on computer vision and pattern recognition workshops, pp. 1623–1631
Kong Y, Fu Y (2018) Human action recognition and prediction: a survey. arXiv: 180611230 13(9):1–19
Li B, He M, Cheng X, Chen Y, Dai Y (2017a) Skeleton based action recognition using translationscale invariant image mapping and multiscale deep CNN. In: IEEE international conference on multimedia and expo workshops, pp. 601–604
Li C, Zhong Q, Xie D, Pu S (2017b) Skeletonbased action recognition with convolutional neural networks. IEEE international conference on multimedia and expo workshops. China, Hong Kong, pp. 597–600
Li L, Zheng W, Zhang Z, Huang Y, Wang L (2019) Relational network for skeletonbased action recognition. IEEE international conference on multimedia and expo, pp .826–831. arXiv: 1805.02556v1
Lin J, Gan C, Han S (2019) TSM: temporal shift module for efficient video understanding. In: IEEE/CVF international conference on computer vision (ICCV), pp. 7082–7092. https://doi.org/10.1109/ICCV.2019.00718
Lin J, Gan C, Wang K, Han S (2020) TSM: Temporal shift module for efficient and scalable video understanding on edge devices. IEEE transactions on pattern analysis and machine intelligence, p. 1, https://doi.org/10.1109/TPAMI.2020.3029799
Liu J, Wang G, Hu P, Duan Ly, Kot AC (2017) Global contextaware attention LSTM networks for 3D action recognition. IEEE conference on computer vision and pattern recognition. pp, 1647–1656
Liu R, Shen J, Wang H, Chen C, Cheung SC, Asari V (2020) Attention mechanism exploits temporal contexts: realtime 3D human pose reconstruction. Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp. 5063–5072. https://doi.org/10.1109/CVPR42600.2020.00511
Memmesheimer R, Theisen N, Paulus D (2020) Signal level deep metric learning for multimodal oneshot action recognition. arXiv: 201213823v1. pp. 1–7
OpenMMLab (2019) mmpose. https://githubcom/openmmlab/mmpose
Peng W, Hong X, Chen H, Zhao G (2020) Learning graph convolutional network for skeletonbased human action recognition by neural searching. In: AAAI conference on artificial intelligence, New York, USA, pp. 2669–2676. https://doi.org/10.1609/aaai.v34i03.5652
Qiu Z, Yao T, Mei T (2017) Learning spatiotemporal representation with pseudo3D residual networks. IEEE international conference on computer vision. pp. 5534–5542. https://doi.org/10.1109/ICCV.2017.590
Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB+D: a large scale dataset for 3D human activity analysis. IEEE conference on computer vision and pattern recognition. Las Vegas, USA, pp. 1010–1019
Shi L, Zhangng Y, Cheng J, Lu H (2019) Skeletonbased action recognition with directed graph neural networks. IEEE conference on computer vision and pattern recognition. Long Beach, USA, pp. 7912–7921
Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional LSTM network for skeletonbased action recognition. IEEE/CVF conference on computer vision and pattern recognition. Los Angeles CA, United States, pp. 1227–1236
Simon T, Joo H, Matthews I, Sheikh Y (2017) Hand keypoint detection in single images using multiview bootstrapping. In: IEEE conference on computer vision and pattern recognition, pp. 1145–1153
Simonyan K (2014) Twostream convolutional networks for action recognition in videos. 27th International conference on neural information processing systems, pp. 1–11, https://arxiv.org/pdf/1406.2199.pdf, arXiv: 1406.2199v2
Singh D, Merdivan E, Psychoula I, Kropf J, Hanke S, Geist M, Holzinger A (2017) Human activity recognition using recurrent neural networks. In: Holzinger A, Kieseberg P, Tjoa AM, Weippl E (eds) Machine learning and knowledge extraction. Springer, Cham, pp 267–274
Sumon SA, Shahria MT, Goni MR, Hasan N, Almarufuzzaman AM, Rahman RM (2019) Violent crowd flow detection using deep learning. Springer, Berlin
Takeishi N, Kawahara Y, Yairi T (2017) Learning Koopman invariant subspaces for dynamic mode decomposition. arXiv: 171004340, pp. 1–18
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. IEEE international conference on computer vision, pp. 4489–4497. https://doi.org/10.1109/ICCV.2015.510
Tran D, Ray J, Shou Z, Chang SF, Paluri M (2017) Convnet architecture search for spatiotemporal feature learning. arXiv: 170805038, pp. 1–10
Wang H, Schmid C (2013) Action recognition with improved trajectories. IEEE international conference on computer vision, pp. 3551–3558, https://doi.org/10.1109/ICCV.2013.441
Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using twostream recurrent neural networks. IEEE conference on computer vision and pattern recognition, pp. 499–508
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79. https://doi.org/10.1007/s1126301205948
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision, pp. 20–36
Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: IEEE conference on computer vision and pattern recognition, pp. 4724–4732
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeletonbased action recognition. In: AAAI conference on artificial intelligence, New Orleans, USA, pp. 1–10, arXiv: 1801.07455v2
Zhang S, Liu X, Xiao J (2017) On geometric features for skeletonbased action recognition using multilayer LSTM networks. IEEE winter conference on applications of computer vision, pp. 148–157
Zhao R, Wang K, Su H, Ji Q (2019) Bayesian graph convolution LSTM for skeleton based action recognition. In: IEEE international conference on computer vision, Los Angeles CA, United States, pp. 6881–6891, https://doi.org/10.1109/ICCV.2019.00698
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. In: European conference on computer vision, pp. 803–818
Zhu Y, Li X, Liu C, Zolfaghari M, Xiong Y, Wu C, Zhang Z, Tighe J, Manmatha R, Li M (2020) A comprehensive study of deep video action recognition. arXiv: 201206567v1, pp. 1–30
Acknowledgements
This work is supported by National Natural Science Foundation of China (62002053), Natural Science Foundation of Guangdong Province (2021A1515011866), Guangdong Basic and Applied Basic Research Projects (2019A1515111082, 2020A1515110504), Fund for HighLevel Talents Afforded by University of Electronic Science and Technology of China, Zhongshan Institute (417YKQ12, 419YKQN15), Social Welfare Major Project of Zhongshan (2019B2010, 2019B2011, 420S36), Achievement Cultivation Project of Zhongshan Industrial Technology Research Institute (419N26), the Science and Technology Foundation of Guangdong Province (2021A0101180005), and Young Innovative Talents Project of Education Department of Guangdong Province (2018KQNCX337,2019KQNCX186).
Author information
Affiliations
Corresponding author
Ethics declarations
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dong, S., Zhang, W., Wang, W. et al. Action recognition based on dynamic mode decomposition. J Ambient Intell Human Comput (2021). https://doi.org/10.1007/s12652021035671
Received:
Accepted:
Published:
Keywords
 Skeletonbased action recognition
 Dynamic mode decomposition
 Quasifewshot setting
 Translational and rotational invariance