Introduction

Human action recognition has become an active issue in recent years, which has been widely used in video surveillance systems, multimedia analysis, smart home applications, etc. The multiple modalities of data employed in action recognition can be categorized into RGB images [1, 2], optical flows [3,4,5], depth images [6,7,8,9], human skeleton data [10, 11], and data captured by wearable sensors [12, 13]. Among these modalities, skeleton data captured by the Kinect sensors can provide compact representation and therefore reduce computational complexity. Moreover, skeleton representation is independent of background noise.

To extract discriminative features from skeleton data, Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) are utilized in earlier methods. These methods treat the skeleton sequence as a sequence vector [10, 11, 14] or pseudo-image [15,16,17], ignoring the natural graphical connections among the body joints. Recently, since it can handle data with non-Euclidean structures, graph convolutional networks (GCN) [18,19,20,21] are proven to be highly successful for skeleton-based action recognition. However, these GCN-based methods only consider extracting features but ignore information of feature distributions.

Fisher Vector (FV) encoding [22] and its variants [23,24,25,26], based on a generative model which approximates the global distribution of features, have successfully been utilized in image classification, and video-based action recognition [27,28,29] over the past few years. However, this powerful algorithm was rarely utilized in previous GCN-based methods for skeleton-based action recognition.

In this work, we aim to introduce Fisher Vector (FV) encoding into GCN to effectively utilize the information of feature distributions. However, we demonstrate by our analysis that since the Gaussian Mixture Model (GMM) is employed to fit the global distribution of features, Fisher vector encoding inevitably leads to losing temporal information of actions. To solve this problem, we further propose a temporal enhanced Fisher vector encoding algorithm (TEFV) to provide discriminative visual representation. Moreover, we propose a two-stream framework (2sTEFV-GCN) by combining the TEFV model with the GCN model to further improve the performance. Since the TEFV algorithm is different from GCN, this hybrid architecture is capable of incorporating the advantages of both algorithms to discover complementary information of feature representation effectively.

The contributions of this work include:

  1. (1)

    We demonstrate by our analysis that Fisher vector encoding inevitably leads to losing temporal information of actions. To tackle this problem, we propose a temporal enhanced Fisher vector encoding algorithm (TEFV) to provide a more discriminative visual representation.

  2. (2)

    We propose a two-stream model (2sTEFV-GCN), which belongs to a complementary structure of feature information. TEFV branch could enrich and enhance the temporal information of skeleton data, and the GCN branch could mine the non-Euclidean spatial information of data. The proposed model can obtain more abundant spatiotemporal information.

  3. (3)

    On two large-scale datasets, we demonstrate improved performance over state-of-the-art methods.

The rest of this paper is arranged as follows. “Related works” reviews related works. “Algorithm” demonstrates the temporal information loss problem of FV and provides a detailed description of the proposed TEFV and the 2sTEFV-GCN model. “Experiments” evaluates experiment results. Finally, we conclude our report in “Conclusion”.

Related works

GCN-based action recognition from skeleton data

Recently, since it is capable of handling data with non-Euclidean structures, graph convolutional networks have successfully been applied in image classification [30], document classification [31], and semi-supervised learning [32].

Since the connections of the body joints can be naturally treated as a graph structure, the Graph Convolution Network (GCN) has been utilized in skeleton-based action recognition and has proven to be highly successful. Li et al. [33] proposed the actional-structural GCN by introducing an encoder–decoder structure and capturing higher order relationships. To solve the problem of how to efficiently incorporate the joint and bone data, the study in [34] designed a directed GCN and applied a two-stream framework. The work in [35] aimed to address the part-level action modeling problem and proposed a Part-Level GCN by integrating the part relation block and part attention block. To address the fixed graph structure problem and the one-order hop limitation, the work in [36] proposed a Neural Architecture Search GCN by Neural Architecture Search and multiple-hop modules. Since some GCN models are exceedingly sophisticated and high computationally expensive, some studies focus on tackling this issue. The work in [37] proposed an effective semantics-guided neural network by introducing the high-level semantics of joints and exploiting the relationship of joints through the joint-level module and frame-level module. The work in [21] is the current state-of-the-art method, which proposed an efficient GCN by a compound scaling strategy and an early fused multiple input branches. The proposed EfficientGCN achieves high accuracies with a small number of parameters. However, these GCN-based methods only consider extracting features while ignoring the information of feature distributions.

Fisher vector encoding

Fisher vector encoding originally proposed in [22] was based on a generative model which approximates the global distribution of features. Since its great power, Perronnin et al. [38] applied the Fisher vector encoding to image categorization where a Gaussian mixture model was adopted as the generative model to approximate the distribution of low-level features in images. Following this work, several variants were proposed. Perronnin et al. [23] proposed an improved Fisher vector by adding normalization strategies. Cinbis et al. [25] tackled the i.i.d. assumption problem of FV and introduced a latent variable model for image classification. Klein et al. [26] adopted Laplacian Mixture Model and a hybrid Gaussian–Laplacian Mixture Model as the generative model separately for image annotation. Wang et al. [27] applied FV which showed better performance than the Bag of Features model for video-based action recognition. Peng et al. [28] designed a stacked Fisher Vector architecture to refine the video representation for action recognition. Tang et al. [39, 40] proposed an FV-GCN architecture by combining Fisher vector encoding with GCN for skeleton-based action recognition. The main differences between our work and the works in [39] and [40] are as follows: (1) An improved Fisher vector encoding algorithm, i.e., temporal enhanced Fisher vector encoding algorithm (TEFV), is proposed in our work to tackle the temporal information loss problem of classical FV in [39] and [40]. (2) A two-stream framework (2sTEFV-GCN) is proposed in our work to further improve performance.

Algorithm

Fisher vector encoding

In this section, we introduce Fisher vector encoding [23, 39, 40]. We further demonstrate by our analysis that Fisher vector encoding inevitably leads to losing temporal information of actions.

GMM generation

Fisher vector (FV) encoding is a powerful algorithm that is capable of obtaining better performance. Gaussian Mixture Model (GMM) [41] \(u_\lambda \) is employed as the generative model of FV encoding architecture. The goal of the Gaussian mixture model is to find a mixture of multiple Gaussian models that will fit the distribution of the given feature space. It provides both the average information and the covariance of feature distribution

$$\begin{aligned} \begin{aligned} u_\lambda (x)=\sum _{i=1}^{K} \omega _i u_i(x), \end{aligned} \end{aligned}$$
(1)

where \(\lambda =\{\omega _i,\mu _i,\delta _i,i=1,2,\ldots ,K\}\) are the parameters. \(\omega _i\in [0,1]\) is the weight of the ith Gaussian model. \(\mu _i\in R^D\) and \(\delta _i\in R^D\) are means, and diagonal covariance of the ith Gaussian model, respectively. K is the number of Gaussians.

Consider the extracted feature maps \(S\in R^{T\times N \times C}\) from an action V, where T, N, and C denote the temporal dimension, the number of joints, and the number of channels, respectively. Then, the feature map S is split into T slices \(X=\{x_t \in R^{N\times C}, t=1,2,\ldots ,T\}\) along the temporal dimension T. Given the training set of X, the parameters of GMM are estimated by the expectation–maximization (EM) algorithm [42]. The action sample V can be represented as gradient vectors with respect to the GMM model parameters

$$\begin{aligned} \begin{aligned} G_{\lambda }^{x}=\frac{1}{T}\sum _{t=1}^{T}\nabla _\lambda \log u_\lambda (x_t). \end{aligned} \end{aligned}$$
(2)

Feature encoding

Let \({\mathcal {G}}_{\mu ,i}^{x}\) be the gradient of \(\mu _i\) and \({\mathcal {G}}_{\delta ,i}^{x}\) be the gradient of \(\delta _i\) of the i-th Gaussian component, respectively

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {G}}_{\mu ,i}^{x}=\frac{1}{T\sqrt{\omega _i}}\sum _{t=1}^{T}\gamma _t(i)\left( \frac{x_t-\mu _i}{\delta _i}\right) \end{aligned} \end{aligned}$$
(3)
$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {G}}_{\delta ,i}^{x}=\frac{1}{T\sqrt{2\omega _i}}\sum _{t=1}^{T}\gamma _t(i)\left[ \frac{(x_t-\mu _i)^2}{\delta _i^2}-1\right] , \end{aligned} \end{aligned}$$
(4)

where \(\gamma _t(i)\) is the weight of the descriptor \(x_t\) with the i-th Gaussian component

$$\begin{aligned} \begin{aligned} \gamma _t(i)=\frac{\omega _i u_i(x_t)}{\sum _{j=1}^{K}\omega _ju_j(x_t)}. \end{aligned} \end{aligned}$$
(5)

We can obtain the final Fisher vector by concatenating all the \({\mathcal {G}}_{\mu ,i}^{x}\) and \({\mathcal {G}}_{\delta ,i}^{x} (i = 1,2, \ldots ,K)\),where K is the number of Gaussians

$$\begin{aligned} \begin{aligned} FV: {\mathcal {G}}_{\lambda }^{x}=[{\mathcal {G}}_{\mu ,1}^{x}, {\mathcal {G}}_{\delta ,1}^{x},\ldots , {\mathcal {G}}_{\mu ,K}^{x}, {\mathcal {G}}_{\delta ,K}^{x}]. \end{aligned} \end{aligned}$$
(6)

Normalization

The result is further normalized by two steps, which improves the performance in [23]. To tackle the sparse problem of FV, following [23], power normalization is applied first:

$$\begin{aligned} \begin{aligned} {[{\mathcal {G}}_\lambda ^x]}_i \leftarrow \hbox {sign}([{\mathcal {G}}_\lambda ^x]_i)\sqrt{|[{\mathcal {G}}_\lambda ^x]_i|}. \end{aligned} \end{aligned}$$
(7)

To eliminate the impact of the background class-independent part, the feature \({\mathcal {G}}_\lambda ^x\) is further \(L_2\)-normalized

$$\begin{aligned} \begin{aligned} {\mathcal {G}}_\lambda ^x= {\mathcal {G}}_\lambda ^x/\Vert [{\mathcal {G}}_\lambda ^x]\Vert _2. \end{aligned} \end{aligned}$$
(8)

Temporal information loss problem

For the image classification task, FV encodes the local features (descriptor) of an image into a super vector, which obtains global information of the feature distribution, and has been proven to be highly successful. However, unlike image features, skeleton action features have temporal information, which is crucial to action recognition. According to Eqs. (3) and (4), \({\mathcal {G}}_{\mu ,i}^{x}\) and \({\mathcal {G}}_{\delta ,i}^{x}\) are obtained by sum and average operations over temporal dimension T, which inevitably leads to losing temporal information of actions. This issue also exists in video-based action recognition that utilizes FV methods.

Temporal enhanced Fisher vector encoding

Since the FV cannot depict the temporal variations in the action clip, in this section, we present a temporal enhanced FV algorithm (TEFV) to address this issue. The architecture is shown in Fig. 1. The skeleton data are first fed into EfficientGCN [21]. Then, the extracted features are fed into the TEFV block. ReLU function and a dropout layer with 0.25 drop probability are added between the two FC layers. The softmax scores are finally adopted to predict the action labels.

Fig. 1
figure 1

The pipeline of temporal enhanced FV architecture (TEFV) for action recognition from skeleton data. \(\oplus \) represents concatenation

GMM generation

The feature maps \(F\in R^{T\times N \times C}\) are split into T slices \(X=\{x_t \in R^{N \times C}, t=1, 2, \ldots , T \}\) along the temporal dimension T. Each \(x_t\) is further split into N slices \(x_t=\{x_n(t) \in R^C, n=1, 2, \ldots , N \}\) along the dimension N. To retain the temporal information among the action clip, the expectation–maximization (EM) algorithm is utilized to estimate the GMM parameters of each \(x_t\) rather than the global GMM parameters of the entire feature map X

$$\begin{aligned} \begin{aligned} u_{\lambda _t}(x_t)=\sum _{i=1}^k\omega _i u_i(x_t), \end{aligned} \end{aligned}$$
(9)

\(\lambda _t = \{\omega _i, \mu _i, \delta _i, i=1,2, \ldots , K\}\) where \(\omega _i \in [0,1]\), \(\mu _i \in R^D\) and \(\delta _i \in R^D\) are the weight, means, and diagonal covariances of the i-th Gaussian model, respectively. K is the number of Gaussians. The final GMMs are \(u_\lambda (X)=\{u_{\lambda _t}(x_t), t=1,2, \ldots , T\}\).

Then, the action V can be described by \(G_\lambda ^X=G_{\lambda _1}^{x_1}\oplus G_{\lambda _2}^{x_2}\oplus \cdots \oplus G_{\lambda _T}^{x_T}\) where \(\oplus \) is concatenation operation and \(G_{\lambda _t}^{x_t}\) is the gradient vector of log-likelihood with respect to the parameters of GMM models \(u_{\lambda _t}(x_t)\) of \(x_t\)

$$\begin{aligned} \begin{aligned} G_{\lambda _t}^{x_t}=\frac{1}{N}\sum _{n=1}^N\nabla _{\lambda _t}\log u_{\lambda _t}(x_n(t)). \end{aligned} \end{aligned}$$
(10)

Here, N is the number of descriptors, i.e., the number of joints.

Feature encoding

Specifically, let \({\mathcal {G}}_{\mu ,i}^{x_t}\) be the gradient of \(\mu _i\) and \({\mathcal {G}}_{\delta ,i}^{x_t}\) be the gradient of \(\delta _i\) of the i-th Gaussian component of GMM \(u_{\lambda _t}(x_t)\), respectively

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {G}}_{\mu ,i}^{x_t}=\frac{1}{N\sqrt{\omega _i}}\sum _{n=1}^{N}\gamma _n(t,i)\left( \frac{x_n(t)-\mu _i}{\delta _i}\right) \end{aligned} \end{aligned}$$
(11)
$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {G}}_{\delta ,i}^{x_t}=\frac{1}{N\sqrt{2\omega _i}}\sum _{n=1}^{N}\gamma _n(t,i)\left[ \frac{(x_n(t)-\mu _i)^2}{\delta _i^2}-1\right] , \end{aligned} \end{aligned}$$
(12)

where \(\gamma _n(t,i)\) is the soft assignment of \(x_n(t)\) to Gaussian i of GMM \(u_{\lambda _t}(x_t)\)

$$\begin{aligned} \begin{aligned} \gamma _n(t,i)=\frac{\omega _i u_i(x_n(t))}{\sum _{j=1}^K \omega _ju_j(x_n(t))}. \end{aligned} \end{aligned}$$
(13)

The Fisher vector with respect to GMM \(u_\lambda (x_t)\) is the concatenation of the \({\mathcal {G}}_{\mu ,i}^{x_t}\) and \({\mathcal {G}}_{\delta ,i}^{x_t}\) gradient vectors for \(i=1,2, \ldots , K\) ,where K is the number of Gaussians

$$\begin{aligned} \begin{aligned} {\mathcal {G}}_{\lambda _t}^{x_t}={\mathcal {G}}_{\mu ,1}^{x_t}\oplus {\mathcal {G}}_{\delta ,1}^{x_t}\oplus \cdots \oplus {\mathcal {G}}_{\mu ,K}^{x_t}\oplus {\mathcal {G}}_{\delta ,K}^{x_t}. \end{aligned} \end{aligned}$$
(14)

The Fisher vector of the entire action clip V can be written as

$$\begin{aligned} \begin{aligned} FV: {\mathcal {G}}_{\lambda }^{X}={\mathcal {G}}_{\lambda _1}^{x_1}\oplus {\mathcal {G}}_{\lambda _2}^{x_2}\oplus \cdots \oplus {\mathcal {G}}_{\lambda _T}^{x_T}, \end{aligned} \end{aligned}$$
(15)

where \(\oplus \) is the concatenation operation. The final Fisher vector is normalized by power normalization and \(L_2\)-normalization. According to Eqs. (11) and (12), since the GMMs \(u_\lambda (X)\) preserve the temporal information of the entire action, our TEFV can provide more discriminative visual representation. It is important for opposite action pairs and fine-grained action recognition. Opposite action pairs, such as “14 put on jacket”and “15 take off jacket”, and “16 put on a shoe”, “17 take off a shoe”, have similar spatial feature distribution but different temporal orders. Temporal order is crucial for recognizing these reversed action pairs. Temporal fine-grained actions, such as “3 brush teeth” and “4 brush hair”, “74 counting money”, “75 cutting nails” and “82 fold paper”, have similar body movements but subtle variations in hand/finger motions. Instead of averaging the gradient along the temporal dimension T, preserving the temporal information of the entire action Eqs. (11) and (12) is helpful for recognizing these fine-grained actions. We provide a detailed discussion of these cases in Sect. “Comparison with Fisher vector encoding”.

Two-stream framework

Inspired by the two-stream methods in [19, 20], we propose a two-stream framework (2sTEFV-GCN) by combining the TEFV model with the GCN model to further improve performance. Our Temporal Enhanced Fisher Vector Encoding is referred to as the TEFV stream, which could enrich and enhance the temporal information of skeleton data. Another stream, i.e., the GCN stream, adopts the EfficientGCN [21]. The EfficientGCN obtains joint, velocity, and bone data from the original 3D skeleton coordinate by data preprocessing. Next, the three input sequences are fed into three input branches, separately. The input branch consists of several GCN and attention blocks. Then, these three branches are fused and fed into the main stream, which is similar to the input branch. The features are obtained through a global average pooling operation and a Fully Connected layer for action recognition in the GCN stream. Finally, the softmax scores of the TEFV stream and the GCN stream are fused to predict the action labels.

Experiments

This section evaluates our TEFV and two-stream framework (2sTEFV-GCN) on two large-scale datasets: NTU RGB+D 60 [43] and NTU RGB+D 120 [44]. In particular, we first evaluate the number of Gaussians k. We next compare different normalization strategies individually or when combined. Then, we compare the proposed TEFV with FV to examine the contributions. To further demonstrate the advantage of our TEFV, we also provide class-wise accuracies and confusion matrix analysis. Finally, we compare the results of our two-stream Fisher Vector framework (2sTEFV-GCN) with state-of-the-art methods.

Datasets

NTU-RGB+D 60 [43] is a large-scale dataset for RGB+D human action recognition. It contains 56,880 samples of 60 action classes. These actions are in three major categories: daily actions (e.g., brush hair, drop, put on jacket), two person interactions (e.g., pushing, point finger, giving object), and medical conditions (e.g., sneeze/cough, fan self). The dataset provides 3D coordinates of 25 joints for each subject.

The authors recommend two benchmarks: (1) cross-subject (X-Sub): since these actions were performed by 40 volunteers, the dataset can be split into training and validation sets according to different volunteers. The training set consists of 40,320 samples performed by 20 actors. The test set consists of 16,560 samples performed by the other 20 actors. (2) Cross-view (X-View): since these actions were captured by three cameras with different horizontal angles, this benchmark employed camera 1 with 45 angles containing 18,960 samples for testing, and the other two cameras with \(-45^\circ \) and \(0^\circ \) angles containing 37,920 samples for training.

NTU RGB+D 120 [44] is by far the largest dataset for skeleton-based action recognition. This dataset is an extended version of NTU-RGB+D 60 by adding another 60 classes with 57,600 samples. The 120 action classes also include daily actions, two person interactions, and medical conditions actions. These actions are acted by 106 volunteers of a wide range of ages and captured from 155 different camera viewpoints. The newly added fine-grained actions and larger variation in subjects, viewpoints, and backgrounds make it very challenging for skeleton-based action recognition.

The authors recommend two benchmarks: (1) cross-subject (X-Sub): The training set consists of 63,026 samples acted by 53 volunteers. The validation set consists of 50,922 samples acted by the other 53 volunteers. (2) Cross-set (X-Set): Since 32 different setups were used to build the dataset, this benchmark employed 16 odd setup IDs with 54,471 samples for training and the other 16 setups with 59,477 samples for testing.

Training details

In our experiments, the learning rate is set as 0.0001 with a cosine schedule decay. Adam is employed as the optimization strategy. Cross-entropy is employed as the loss function. The maximum number of training epochs is set as 100. Our model is implemented in the Pytorch framework and trained on two Tesla V100 GPUs.

Evaluation of the number of Gaussians

The key parameter in the GMM aggregation of our TEFV is the number of Gaussians K. The performances on two datasets with different numbers of Gaussians K, which change from 1 to 9 at equal intervals, are shown in Figs. 2 and 3. Power normalization and \(L_2\) normalization are performed as default for a fair comparison.

Fig. 2
figure 2

Accuracies of TEFV with different numbers of Gaussians K on the NTU-RGB+D 60 dataset

Fig. 3
figure 3

Accuracies of TEFV with different numbers of Gaussians K on the NTU-RGB+D 120 dataset

As shown in Figs. 2 and 3, the larger the GMM size we set, the lower the recognition performance we obtain. Note that this is different from previous work [27,28,29] where the best results are often obtained with a larger Gaussians number K. We suggest that since lots of hand-crafted features used for image classification (SIFT [24]) and video-based action recognition (IDT, HOF, HOG, MBH [27,28,29]) are represented by lots of local descriptors, a larger size GMM is required to fit their distribution. The features extracted from the last graph convolution block of the EfficientGCN are high-level features with fewer descriptors. A GMM with one component is sufficient to fit the distribution of these high-level features. Accordingly, we set the number of Gaussians \(k=1\) as default in the following experiments.

Evaluation of normalization strategies

We compare different normalization strategies on both NTU RGB+D 60 and NTU RGB+D 120 datasets. The results are presented in Table 1. It is observed that our TEFV with Power+L2 Normalization performs the best on both NTU RGB+D 60 and NTU RGB+D 120 datasets under all protocols. Therefore, Power+L2 Normalization is adopted in the following experiments. The performance of power normalization is shown in Fig. 4. It is observed that lots of values of \(L_2\)-normalized FV features are crowded around zero. The power normalization effectively alleviates the sparse trouble of the \(L_2\)-normalized features.

Table 1 Performance of TEFV with different normalization strategies on both the NTU-RGB+D 60 dataset and NTU-RGB+D 120 dataset
Fig. 4
figure 4

Distribution of the values in several dimensions of the normalized Fisher vector (\(k=1\)). a, b is the 25-th dimension, and the 161-th dimension with only L2 normalization, respectively. c, d are the corresponding dimensions with L2 and power normalization. These histograms have been estimated on the 54,471 training samples of the NTU-RGB+D 120 dataset with X-set benchmark

Comparison with Fisher vector encoding

The results of Temporal Enhanced FV and FV are summarized in Table 2. It is observed that our Temporal Enhanced FV is better than FV.

To further validate the advantage of our TEFV, we analyze the class-wise accuracy and confusion matrix on the Cross-setup (X-set) benchmark of the NTU-RGB+D 120 dataset. The accuracy of each class is presented in Fig. 5.

It is observed that our TEFV outperforms FV in most of the classes. For example, in classes 4, 11, 13, 14, 15, 17, 29, 30, 37, 57, 74, 82, 103, and 106 , our TEFV significantly outperforms FV. The confusion matrices of some action classes for Our TEFV and FV on the Cross-setup (X-set) benchmark of the NTU-RGB+D 120 dataset are shown in Fig. 6.

In class “14 put on jacket”, 5 samples are misclassified by FV as the class “15 take off jacket”, while all of them are correctly classified by our TEFV. In class “17 take off a shoe”, 63 samples are confused as class “16 put on a shoe” by FV, while only 43 of them are incorrectly classified by our TEFV. These opposite action pairs have similar spatial feature distribution, but different temporal dynamics. These cases demonstrate the advantage of our TEFV, which can preserve the temporal information of the entire sample.

Besides the advantage of preserving the temporal information of the entire sample we mentioned above, the ability to capture fine-grained motions is also an advantage of our TEFV.

In class “4 brush hair”, 6 samples are misclassified by FV as the class “3 brush teeth”, while only 2 of them are incorrectly classified by our TEFV. 2 samples are incorrectly classified by FV as the class “20 put on a hat/cap”, while all of them are correctly classified by our TEFV. In class “37 wipe face”, 8 samples are incorrectly classified by FV as the class “18 put on glasses”, while only 2 of them are misclassified by our TEFV. These samples from different classes share a similar sub-action of raising a hand (or hands) close to the mouth (or the head/face). The key to correctly classifying these samples is recognizing the fine-grained spatial configurations and temporal dynamics of hand and arm skeletons.

In class “11 reading”, 71 samples are confused as class “12 writing” by FV, while only 56 of them are incorrectly classified by our TEFV. In class “29 play with phone/tablet”, 36 samples are confused as the class “30 type on a keyboard” by FV, while only 28 of them are incorrectly classified by our TEFV. 23 samples are misclassified by FV as the class “75 cutting nails”, while only 14 of them are incorrectly classified by our TEFV. 12 samples are incorrectly classified as the class “84 play magic cube” by FV, while only half of them are misclassified by our TEFV. In class “30 type on a keyboard”, 18 samples are confused as the class “29 play with phone/tablet” by FV, while only 13 of them are incorrectly classified by our TEFV. 7 samples are misclassified by FV as the class “84 play magic cube”, while only 2 of them are incorrectly classified by our TEFV. In class “74 counting money”, 36 samples are confused as the class “75 cutting nails” by FV, while only 27 of them are incorrectly classified by our TEFV. 13 samples are incorrectly classified as the class “82 fold paper” by FV, while only 7 of them are misclassified by our TEFV. Since these groups of actions share similar spatial and temporal variations, distinguishing these actions is challenging. The key to correctly classifying these classes is recognizing fine-grained hand and finger dynamics.

In class “14 put on jacket”, 11 samples are misclassified by FV as the class “87 put on bag”, while all of them are correctly classified by our TEFV. In class “15 take off jacket”, 4 samples are confused as the class “16 put on a shoe”, while all of them are correctly classified by our TEFV. 8 samples are misclassified by FV as class “87 put on bag”, while only 2 of them are incorrectly classified by our TEFV. Correctly classifying these classes requires the ability to recognize the fine-grained motions of hand and body skeletons.

Table 2 Performance of TEFV and FV on both the NTU-RGB+D 60 dataset and NTU-RGB+D 120 dataset
Fig. 5
figure 5

The class-wise accuracy on the Cross-setup (X-set) benchmark of the NTU-RGB+D 120 dataset. The blue bins and red bins denote the accuracy of our TEFV and FV, respectively

Fig. 6
figure 6

Confusion matrices of some action classes for our TEFV and FV on the Cross-setup (X-set) benchmark of the NTU-RGB+D 120 dataset

All these cases demonstrate that our TEFV not only preserves the temporal information of the entire sample but also captures fine-grained spatial configurations and temporal dynamics.

Although our TEFV outperforms FV in most of the classes, there are still a few actions difficult to be recognized. For example, in class 56, our TEFV is inferior to FV. The failure numbers of these classes for our TEFV and FV on the Cross-setup (X-set) benchmark of the NTU-RGB+D 120 dataset are shown in Table 3.

Table 3 The failure numbers of action “56 giving something to other person” on the Cross-setup (X-set) benchmark of the NTU-RGB+D 120 dataset

In class “56 giving something to other person”, 18 samples are misclassified by TEFV as the class “57 touch other person’s pocket”, while only 12 of them are incorrectly classified by FV. “56 giving something to other person” is marked by two sub-actions: one person hands out something and the other reaches for it. In class “57 touch other person’s pocket”, only one person has arm and hand movements. These actions have significant arm and hand movements, with no temporal fine-grained dynamics. These actions can be well recognized by the classical FV, which employs GMM to fit the global distribution of features. An over-emphasis on preserving detailed temporal information leads to overfitting for these actions.

Evaluation of the two-stream framework

We verify our two-stream temporal enhanced Fisher vector encoding framework (2sTEFV-GCN) on the NTU RGB+D 60 and NTU RGB+D 120 datasets. The results are summarized in Table 4. The two-stream temporal enhanced Fisher vector framework outperforms the one-stream-based method.

Comparison with the state-of-the-art

We compare our Two-stream Temporal Enhanced Fisher Vector encoding framework (2sTEFV-GCN) on the NTU RGB+D 60 and NTU RGB+D 120 datasets. The results are shown in Tables 5 and 6, respectively.

NTU-RGB+D 60 dataset. On NTU-RGB+D 60 dataset, We compare our Two-stream Temporal Enhanced Fisher Vector encoding framework (2sTEFV-GCN) with LSTM-based methods (i.e., HBRNN [10], ST-LSTM [11],STA-LSTM [45], CNN-based methods (i.e., Temporal CNN [16], HCN [46], GCN-based method (i.e., ST-GCN [18], RA-GCNv1 [47], RA-GCNv2 [48], AS-GCN [33], 2s-AGCN [19], DGNN [34], PL-GCN [35], NAS-GCN [36], MS-G3D [20], PA-ResGCN-B19 [49], Dynamic-GCN [50], RNXt-GCN [51], EfficientGCN-B4 [21]), and hybrid methods based on two types of networks (i.e., SR-TSL [52],VA-fusion [53], AGC-LSTM [54], and SGN [37]).

There are two typical works we should mention. The first one is ST-GCN [18], which innovatively employs GCN for skeleton-based action recognition. Compared with this popular model, our two-stream Temporal Enhanced Fisher Vector framework obtains obvious improvements on both X-view and X-sub benchmarks. Another method is EfficientGCN-B4 [21], which is the recent state-of-the-art method. Our Two-stream Temporal Enhanced Fisher Vector encoding framework (2sTEFV-GCN) constantly outperforms EfficientGCN-B4 on both X-view and X-sub benchmarks.

NTU-RGB+D 120 dataset. Since the NTU-RGB+D 120 is newly released, fewer performances are reported. We compare our two-stream Temporal Enhanced Fisher Vector encoding framework (2sTEFV-GCN) with LSTM-based methods (i.e., PA-LSTM [43], ST-LSTM [11]), CNN-based method (i.e., FSNet [55]), GCN-based methods (i.e., ST-GCN [18], RA-GCNv1 [47], RA-GCNv2 [48], AS-GCN [33], 2s-AGCN [19], MS-G3D [20], PA-ResGCN-B19 [49], Dynamic-GCN [50], RNXt-GCN [51], EfficientGCN-B4 [21]), and hybrid methods based on two types of networks (i.e., SR-TSL [52] and SGN [37]). The accuracies are reported in Table 6, where our method achieves superior classification accuracy over these state-of-the-art approaches under all evaluation settings. Notably, our 2sTEFV-GCN outperforms the recent state-of-the-art method, EfficientGCN-B4 [21], on both X-set and X-sub benchmarks.

Table 4 Performance of two-stream temporal enhanced Fisher vector encoding framework (2sTEFV-GCN) on both the NTU-RGB+D 60 dataset and NTU-RGB+D 120 dataset
Table 5 Performance of SOTA methods on NTU-RGB+D 60 dataset
Table 6 Performance of SOTA methods on NTU-RGB+D 120 dataset

These results on the two large-scale datasets demonstrate the superiority of our 2sTEFV-GCN. We consider that this is caused by our TEFV algorithm can effectively utilize the information of feature distributions to provide complementary information to GCN-based methods, while existing GCN-based methods ignoring the information of feature distributions. Since the TEFV algorithm is different from GCN, this hybrid framework (2sTEFV-GCN) is capable of incorporating the advantages of both algorithms.

Conclusion

In this paper, we aim to introduce Fisher vector (FV) encoding into GCN to effectively utilize the information of feature distributions. We demonstrate by our analysis that Fisher vector encoding inevitably leads to losing temporal information of actions. To tackle this problem, we further propose a temporal enhanced Fisher vector encoding algorithm (TEFV) to provide more discriminative visual representation. Compared with FV, our TEFV model can not only preserve the temporal information of the entire action but also capture fine-grained spatial configurations and temporal dynamics. Moreover, we propose a two-stream Temporal Enhanced Fisher vector encoding framework (2sTEFV-GCN) by combining the TEFV model with the GCN model to further improve the performance. On two large-scale datasets for skeleton-based action recognition, NTU-RGB+D 60 and NTU-RGB+D 120, our model achieves state-of-the-art performance.