Two-stream temporal enhanced Fisher vector encoding for skeleton-based action recognition

Tang, Jun; Liu, Baodi; Guo, Wenhui; Wang, Yanjiang

doi:10.1007/s40747-022-00914-3

Two-stream temporal enhanced Fisher vector encoding for skeleton-based action recognition

Original Article
Open access
Published: 25 November 2022

Volume 9, pages 3147–3159, (2023)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Two-stream temporal enhanced Fisher vector encoding for skeleton-based action recognition

Download PDF

1092 Accesses
1 Citation
Explore all metrics

Abstract

The key to skeleton-based action recognition is how to extract discriminative features from skeleton data. Recently, graph convolutional networks (GCNs) are proven to be highly successful for skeleton-based action recognition. However, existing GCN-based methods focus on extracting robust features while neglecting the information of feature distributions. In this work, we aim to introduce Fisher vector (FV) encoding into GCN to effectively utilize the information of feature distributions. However, since the Gaussian Mixture Model (GMM) is employed to fit the global distribution of features, Fisher vector encoding inevitably leads to losing temporal information of actions, which is demonstrated by our analysis. To tackle this problem, we propose a temporal enhanced Fisher vector encoding algorithm (TEFV) to provide more discriminative visual representation. Compared with FV, our TEFV model can not only preserve the temporal information of the entire action but also capture fine-grained spatial configurations and temporal dynamics. Moreover, we propose a two-stream framework (2sTEFV-GCN) by combining the TEFV model with the GCN model to further improve the performance. On two large-scale datasets for skeleton-based action recognition, NTU-RGB+D 60 and NTU-RGB+D 120, our model achieves state-of-the-art performance.

Short-term path signature for skeleton-based action recognition

Article 27 November 2022

Two-Stream Transformer Encoders for Skeleton-Based Action Recognition

Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Human action recognition has become an active issue in recent years, which has been widely used in video surveillance systems, multimedia analysis, smart home applications, etc. The multiple modalities of data employed in action recognition can be categorized into RGB images [1, 2], optical flows [3,4,5], depth images [6,7,8,9], human skeleton data [10, 11], and data captured by wearable sensors [12, 13]. Among these modalities, skeleton data captured by the Kinect sensors can provide compact representation and therefore reduce computational complexity. Moreover, skeleton representation is independent of background noise.

To extract discriminative features from skeleton data, Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) are utilized in earlier methods. These methods treat the skeleton sequence as a sequence vector [10, 11, 14] or pseudo-image [15,16,17], ignoring the natural graphical connections among the body joints. Recently, since it can handle data with non-Euclidean structures, graph convolutional networks (GCN) [18,19,20,21] are proven to be highly successful for skeleton-based action recognition. However, these GCN-based methods only consider extracting features but ignore information of feature distributions.

Fisher Vector (FV) encoding [22] and its variants [23,24,25,26], based on a generative model which approximates the global distribution of features, have successfully been utilized in image classification, and video-based action recognition [27,28,29] over the past few years. However, this powerful algorithm was rarely utilized in previous GCN-based methods for skeleton-based action recognition.

In this work, we aim to introduce Fisher Vector (FV) encoding into GCN to effectively utilize the information of feature distributions. However, we demonstrate by our analysis that since the Gaussian Mixture Model (GMM) is employed to fit the global distribution of features, Fisher vector encoding inevitably leads to losing temporal information of actions. To solve this problem, we further propose a temporal enhanced Fisher vector encoding algorithm (TEFV) to provide discriminative visual representation. Moreover, we propose a two-stream framework (2sTEFV-GCN) by combining the TEFV model with the GCN model to further improve the performance. Since the TEFV algorithm is different from GCN, this hybrid architecture is capable of incorporating the advantages of both algorithms to discover complementary information of feature representation effectively.

The contributions of this work include:

(1)
We demonstrate by our analysis that Fisher vector encoding inevitably leads to losing temporal information of actions. To tackle this problem, we propose a temporal enhanced Fisher vector encoding algorithm (TEFV) to provide a more discriminative visual representation.
(2)
We propose a two-stream model (2sTEFV-GCN), which belongs to a complementary structure of feature information. TEFV branch could enrich and enhance the temporal information of skeleton data, and the GCN branch could mine the non-Euclidean spatial information of data. The proposed model can obtain more abundant spatiotemporal information.
(3)
On two large-scale datasets, we demonstrate improved performance over state-of-the-art methods.

The rest of this paper is arranged as follows. “Related works” reviews related works. “Algorithm” demonstrates the temporal information loss problem of FV and provides a detailed description of the proposed TEFV and the 2sTEFV-GCN model. “Experiments” evaluates experiment results. Finally, we conclude our report in “Conclusion”.

Related works

GCN-based action recognition from skeleton data

Recently, since it is capable of handling data with non-Euclidean structures, graph convolutional networks have successfully been applied in image classification [30], document classification [31], and semi-supervised learning [32].

Since the connections of the body joints can be naturally treated as a graph structure, the Graph Convolution Network (GCN) has been utilized in skeleton-based action recognition and has proven to be highly successful. Li et al. [33] proposed the actional-structural GCN by introducing an encoder–decoder structure and capturing higher order relationships. To solve the problem of how to efficiently incorporate the joint and bone data, the study in [34] designed a directed GCN and applied a two-stream framework. The work in [35] aimed to address the part-level action modeling problem and proposed a Part-Level GCN by integrating the part relation block and part attention block. To address the fixed graph structure problem and the one-order hop limitation, the work in [36] proposed a Neural Architecture Search GCN by Neural Architecture Search and multiple-hop modules. Since some GCN models are exceedingly sophisticated and high computationally expensive, some studies focus on tackling this issue. The work in [37] proposed an effective semantics-guided neural network by introducing the high-level semantics of joints and exploiting the relationship of joints through the joint-level module and frame-level module. The work in [21] is the current state-of-the-art method, which proposed an efficient GCN by a compound scaling strategy and an early fused multiple input branches. The proposed EfficientGCN achieves high accuracies with a small number of parameters. However, these GCN-based methods only consider extracting features while ignoring the information of feature distributions.

Fisher vector encoding

Fisher vector encoding originally proposed in [22] was based on a generative model which approximates the global distribution of features. Since its great power, Perronnin et al. [38] applied the Fisher vector encoding to image categorization where a Gaussian mixture model was adopted as the generative model to approximate the distribution of low-level features in images. Following this work, several variants were proposed. Perronnin et al. [23] proposed an improved Fisher vector by adding normalization strategies. Cinbis et al. [25] tackled the i.i.d. assumption problem of FV and introduced a latent variable model for image classification. Klein et al. [26] adopted Laplacian Mixture Model and a hybrid Gaussian–Laplacian Mixture Model as the generative model separately for image annotation. Wang et al. [27] applied FV which showed better performance than the Bag of Features model for video-based action recognition. Peng et al. [28] designed a stacked Fisher Vector architecture to refine the video representation for action recognition. Tang et al. [39, 40] proposed an FV-GCN architecture by combining Fisher vector encoding with GCN for skeleton-based action recognition. The main differences between our work and the works in [39] and [40] are as follows: (1) An improved Fisher vector encoding algorithm, i.e., temporal enhanced Fisher vector encoding algorithm (TEFV), is proposed in our work to tackle the temporal information loss problem of classical FV in [39] and [40]. (2) A two-stream framework (2sTEFV-GCN) is proposed in our work to further improve performance.

Algorithm

Fisher vector encoding

In this section, we introduce Fisher vector encoding [23, 39, 40]. We further demonstrate by our analysis that Fisher vector encoding inevitably leads to losing temporal information of actions.

GMM generation

Fisher vector (FV) encoding is a powerful algorithm that is capable of obtaining better performance. Gaussian Mixture Model (GMM) [41] $u_\lambda $ is employed as the generative model of FV encoding architecture. The goal of the Gaussian mixture model is to find a mixture of multiple Gaussian models that will fit the distribution of the given feature space. It provides both the average information and the covariance of feature distribution

$$\begin{aligned} \begin{aligned} u_\lambda (x)=\sum _{i=1}^{K} \omega _i u_i(x), \end{aligned} \end{aligned}$$

(1)

where $\lambda =\{\omega _i,\mu _i,\delta _i,i=1,2,\ldots ,K\}$ are the parameters. $\omega _i\in [0,1]$ is the weight of the ith Gaussian model. $\mu _i\in R^D$ and $\delta _i\in R^D$ are means, and diagonal covariance of the ith Gaussian model, respectively. K is the number of Gaussians.

Consider the extracted feature maps $S\in R^{T\times N \times C}$ from an action V, where T, N, and C denote the temporal dimension, the number of joints, and the number of channels, respectively. Then, the feature map S is split into T slices $X=\{x_t \in R^{N\times C}, t=1,2,\ldots ,T\}$ along the temporal dimension T. Given the training set of X, the parameters of GMM are estimated by the expectation–maximization (EM) algorithm [42]. The action sample V can be represented as gradient vectors with respect to the GMM model parameters

$$\begin{aligned} \begin{aligned} G_{\lambda }^{x}=\frac{1}{T}\sum _{t=1}^{T}\nabla _\lambda \log u_\lambda (x_t). \end{aligned} \end{aligned}$$

(2)

Feature encoding

Let ${\mathcal {G}}_{\mu ,i}^{x}$ be the gradient of $\mu _i$ and ${\mathcal {G}}_{\delta ,i}^{x}$ be the gradient of $\delta _i$ of the i-th Gaussian component, respectively

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {G}}_{\mu ,i}^{x}=\frac{1}{T\sqrt{\omega _i}}\sum _{t=1}^{T}\gamma _t(i)\left( \frac{x_t-\mu _i}{\delta _i}\right) \end{aligned} \end{aligned}$$

(3)

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {G}}_{\delta ,i}^{x}=\frac{1}{T\sqrt{2\omega _i}}\sum _{t=1}^{T}\gamma _t(i)\left[ \frac{(x_t-\mu _i)^2}{\delta _i^2}-1\right] , \end{aligned} \end{aligned}$$

(4)

where $\gamma _t(i)$ is the weight of the descriptor $x_t$ with the i-th Gaussian component

$$\begin{aligned} \begin{aligned} \gamma _t(i)=\frac{\omega _i u_i(x_t)}{\sum _{j=1}^{K}\omega _ju_j(x_t)}. \end{aligned} \end{aligned}$$

(5)

We can obtain the final Fisher vector by concatenating all the ${\mathcal {G}}_{\mu ,i}^{x}$ and ${\mathcal {G}}_{\delta ,i}^{x} (i = 1,2, \ldots ,K)$,where K is the number of Gaussians

$$\begin{aligned} \begin{aligned} FV: {\mathcal {G}}_{\lambda }^{x}=[{\mathcal {G}}_{\mu ,1}^{x}, {\mathcal {G}}_{\delta ,1}^{x},\ldots , {\mathcal {G}}_{\mu ,K}^{x}, {\mathcal {G}}_{\delta ,K}^{x}]. \end{aligned} \end{aligned}$$

(6)

Normalization

The result is further normalized by two steps, which improves the performance in [23]. To tackle the sparse problem of FV, following [23], power normalization is applied first:

$$\begin{aligned} \begin{aligned} {[{\mathcal {G}}_\lambda ^x]}_i \leftarrow \hbox {sign}([{\mathcal {G}}_\lambda ^x]_i)\sqrt{|[{\mathcal {G}}_\lambda ^x]_i|}. \end{aligned} \end{aligned}$$

(7)

To eliminate the impact of the background class-independent part, the feature ${\mathcal {G}}_\lambda ^x$ is further $L_2$-normalized

$$\begin{aligned} \begin{aligned} {\mathcal {G}}_\lambda ^x= {\mathcal {G}}_\lambda ^x/\Vert [{\mathcal {G}}_\lambda ^x]\Vert _2. \end{aligned} \end{aligned}$$

(8)

Temporal information loss problem

For the image classification task, FV encodes the local features (descriptor) of an image into a super vector, which obtains global information of the feature distribution, and has been proven to be highly successful. However, unlike image features, skeleton action features have temporal information, which is crucial to action recognition. According to Eqs. (3) and (4), ${\mathcal {G}}_{\mu ,i}^{x}$ and ${\mathcal {G}}_{\delta ,i}^{x}$ are obtained by sum and average operations over temporal dimension T, which inevitably leads to losing temporal information of actions. This issue also exists in video-based action recognition that utilizes FV methods.

Temporal enhanced Fisher vector encoding

Since the FV cannot depict the temporal variations in the action clip, in this section, we present a temporal enhanced FV algorithm (TEFV) to address this issue. The architecture is shown in Fig. 1. The skeleton data are first fed into EfficientGCN [21]. Then, the extracted features are fed into the TEFV block. ReLU function and a dropout layer with 0.25 drop probability are added between the two FC layers. The softmax scores are finally adopted to predict the action labels.

GMM generation

The feature maps $F\in R^{T\times N \times C}$ are split into T slices $X=\{x_t \in R^{N \times C}, t=1, 2, \ldots , T \}$ along the temporal dimension T. Each $x_t$ is further split into N slices $x_t=\{x_n(t) \in R^C, n=1, 2, \ldots , N \}$ along the dimension N. To retain the temporal information among the action clip, the expectation–maximization (EM) algorithm is utilized to estimate the GMM parameters of each $x_t$ rather than the global GMM parameters of the entire feature map X

$$\begin{aligned} \begin{aligned} u_{\lambda _t}(x_t)=\sum _{i=1}^k\omega _i u_i(x_t), \end{aligned} \end{aligned}$$

(9)

$\lambda _t = \{\omega _i, \mu _i, \delta _i, i=1,2, \ldots , K\}$ where $\omega _i \in [0,1]$, $\mu _i \in R^D$ and $\delta _i \in R^D$ are the weight, means, and diagonal covariances of the i-th Gaussian model, respectively. K is the number of Gaussians. The final GMMs are $u_\lambda (X)=\{u_{\lambda _t}(x_t), t=1,2, \ldots , T\}$.

Then, the action V can be described by $G_\lambda ^X=G_{\lambda _1}^{x_1}\oplus G_{\lambda _2}^{x_2}\oplus \cdots \oplus G_{\lambda _T}^{x_T}$ where $\oplus $ is concatenation operation and $G_{\lambda _t}^{x_t}$ is the gradient vector of log-likelihood with respect to the parameters of GMM models $u_{\lambda _t}(x_t)$ of $x_t$

$$\begin{aligned} \begin{aligned} G_{\lambda _t}^{x_t}=\frac{1}{N}\sum _{n=1}^N\nabla _{\lambda _t}\log u_{\lambda _t}(x_n(t)). \end{aligned} \end{aligned}$$

(10)

Here, N is the number of descriptors, i.e., the number of joints.

Feature encoding

Specifically, let ${\mathcal {G}}_{\mu ,i}^{x_t}$ be the gradient of $\mu _i$ and ${\mathcal {G}}_{\delta ,i}^{x_t}$ be the gradient of $\delta _i$ of the i-th Gaussian component of GMM $u_{\lambda _t}(x_t)$, respectively

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {G}}_{\mu ,i}^{x_t}=\frac{1}{N\sqrt{\omega _i}}\sum _{n=1}^{N}\gamma _n(t,i)\left( \frac{x_n(t)-\mu _i}{\delta _i}\right) \end{aligned} \end{aligned}$$

(11)

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {G}}_{\delta ,i}^{x_t}=\frac{1}{N\sqrt{2\omega _i}}\sum _{n=1}^{N}\gamma _n(t,i)\left[ \frac{(x_n(t)-\mu _i)^2}{\delta _i^2}-1\right] , \end{aligned} \end{aligned}$$

(12)

where $\gamma _n(t,i)$ is the soft assignment of $x_n(t)$ to Gaussian i of GMM $u_{\lambda _t}(x_t)$

$$\begin{aligned} \begin{aligned} \gamma _n(t,i)=\frac{\omega _i u_i(x_n(t))}{\sum _{j=1}^K \omega _ju_j(x_n(t))}. \end{aligned} \end{aligned}$$

(13)

The Fisher vector with respect to GMM $u_\lambda (x_t)$ is the concatenation of the ${\mathcal {G}}_{\mu ,i}^{x_t}$ and ${\mathcal {G}}_{\delta ,i}^{x_t}$ gradient vectors for $i=1,2, \ldots , K$ ,where K is the number of Gaussians

$$\begin{aligned} \begin{aligned} {\mathcal {G}}_{\lambda _t}^{x_t}={\mathcal {G}}_{\mu ,1}^{x_t}\oplus {\mathcal {G}}_{\delta ,1}^{x_t}\oplus \cdots \oplus {\mathcal {G}}_{\mu ,K}^{x_t}\oplus {\mathcal {G}}_{\delta ,K}^{x_t}. \end{aligned} \end{aligned}$$

(14)

The Fisher vector of the entire action clip V can be written as

$$\begin{aligned} \begin{aligned} FV: {\mathcal {G}}_{\lambda }^{X}={\mathcal {G}}_{\lambda _1}^{x_1}\oplus {\mathcal {G}}_{\lambda _2}^{x_2}\oplus \cdots \oplus {\mathcal {G}}_{\lambda _T}^{x_T}, \end{aligned} \end{aligned}$$

(15)

where $\oplus $ is the concatenation operation. The final Fisher vector is normalized by power normalization and $L_2$-normalization. According to Eqs. (11) and (12), since the GMMs $u_\lambda (X)$ preserve the temporal information of the entire action, our TEFV can provide more discriminative visual representation. It is important for opposite action pairs and fine-grained action recognition. Opposite action pairs, such as “14 put on jacket”and “15 take off jacket”, and “16 put on a shoe”, “17 take off a shoe”, have similar spatial feature distribution but different temporal orders. Temporal order is crucial for recognizing these reversed action pairs. Temporal fine-grained actions, such as “3 brush teeth” and “4 brush hair”, “74 counting money”, “75 cutting nails” and “82 fold paper”, have similar body movements but subtle variations in hand/finger motions. Instead of averaging the gradient along the temporal dimension T, preserving the temporal information of the entire action Eqs. (11) and (12) is helpful for recognizing these fine-grained actions. We provide a detailed discussion of these cases in Sect. “Comparison with Fisher vector encoding”.

Two-stream framework

Inspired by the two-stream methods in [19, 20], we propose a two-stream framework (2sTEFV-GCN) by combining the TEFV model with the GCN model to further improve performance. Our Temporal Enhanced Fisher Vector Encoding is referred to as the TEFV stream, which could enrich and enhance the temporal information of skeleton data. Another stream, i.e., the GCN stream, adopts the EfficientGCN [21]. The EfficientGCN obtains joint, velocity, and bone data from the original 3D skeleton coordinate by data preprocessing. Next, the three input sequences are fed into three input branches, separately. The input branch consists of several GCN and attention blocks. Then, these three branches are fused and fed into the main stream, which is similar to the input branch. The features are obtained through a global average pooling operation and a Fully Connected layer for action recognition in the GCN stream. Finally, the softmax scores of the TEFV stream and the GCN stream are fused to predict the action labels.

Experiments

This section evaluates our TEFV and two-stream framework (2sTEFV-GCN) on two large-scale datasets: NTU RGB+D 60 [43] and NTU RGB+D 120 [44]. In particular, we first evaluate the number of Gaussians k. We next compare different normalization strategies individually or when combined. Then, we compare the proposed TEFV with FV to examine the contributions. To further demonstrate the advantage of our TEFV, we also provide class-wise accuracies and confusion matrix analysis. Finally, we compare the results of our two-stream Fisher Vector framework (2sTEFV-GCN) with state-of-the-art methods.

Datasets

NTU-RGB+D 60 [43] is a large-scale dataset for RGB+D human action recognition. It contains 56,880 samples of 60 action classes. These actions are in three major categories: daily actions (e.g., brush hair, drop, put on jacket), two person interactions (e.g., pushing, point finger, giving object), and medical conditions (e.g., sneeze/cough, fan self). The dataset provides 3D coordinates of 25 joints for each subject.

The authors recommend two benchmarks: (1) cross-subject (X-Sub): since these actions were performed by 40 volunteers, the dataset can be split into training and validation sets according to different volunteers. The training set consists of 40,320 samples performed by 20 actors. The test set consists of 16,560 samples performed by the other 20 actors. (2) Cross-view (X-View): since these actions were captured by three cameras with different horizontal angles, this benchmark employed camera 1 with 45 angles containing 18,960 samples for testing, and the other two cameras with $-45^\circ $ and $0^\circ $ angles containing 37,920 samples for training.

NTU RGB+D 120 [44] is by far the largest dataset for skeleton-based action recognition. This dataset is an extended version of NTU-RGB+D 60 by adding another 60 classes with 57,600 samples. The 120 action classes also include daily actions, two person interactions, and medical conditions actions. These actions are acted by 106 volunteers of a wide range of ages and captured from 155 different camera viewpoints. The newly added fine-grained actions and larger variation in subjects, viewpoints, and backgrounds make it very challenging for skeleton-based action recognition.

The authors recommend two benchmarks: (1) cross-subject (X-Sub): The training set consists of 63,026 samples acted by 53 volunteers. The validation set consists of 50,922 samples acted by the other 53 volunteers. (2) Cross-set (X-Set): Since 32 different setups were used to build the dataset, this benchmark employed 16 odd setup IDs with 54,471 samples for training and the other 16 setups with 59,477 samples for testing.

Training details

In our experiments, the learning rate is set as 0.0001 with a cosine schedule decay. Adam is employed as the optimization strategy. Cross-entropy is employed as the loss function. The maximum number of training epochs is set as 100. Our model is implemented in the Pytorch framework and trained on two Tesla V100 GPUs.

Evaluation of the number of Gaussians

The key parameter in the GMM aggregation of our TEFV is the number of Gaussians K. The performances on two datasets with different numbers of Gaussians K, which change from 1 to 9 at equal intervals, are shown in Figs. 2 and 3. Power normalization and $L_2$ normalization are performed as default for a fair comparison.

As shown in Figs. 2 and 3, the larger the GMM size we set, the lower the recognition performance we obtain. Note that this is different from previous work [27,28,29] where the best results are often obtained with a larger Gaussians number K. We suggest that since lots of hand-crafted features used for image classification (SIFT [24]) and video-based action recognition (IDT, HOF, HOG, MBH [27,28,29]) are represented by lots of local descriptors, a larger size GMM is required to fit their distribution. The features extracted from the last graph convolution block of the EfficientGCN are high-level features with fewer descriptors. A GMM with one component is sufficient to fit the distribution of these high-level features. Accordingly, we set the number of Gaussians $k=1$ as default in the following experiments.

Evaluation of normalization strategies

We compare different normalization strategies on both NTU RGB+D 60 and NTU RGB+D 120 datasets. The results are presented in Table 1. It is observed that our TEFV with Power+L2 Normalization performs the best on both NTU RGB+D 60 and NTU RGB+D 120 datasets under all protocols. Therefore, Power+L2 Normalization is adopted in the following experiments. The performance of power normalization is shown in Fig. 4. It is observed that lots of values of $L_2$-normalized FV features are crowded around zero. The power normalization effectively alleviates the sparse trouble of the $L_2$-normalized features.

Table 1 Performance of TEFV with different normalization strategies on both the NTU-RGB+D 60 dataset and NTU-RGB+D 120 dataset

Full size table

Comparison with Fisher vector encoding

The results of Temporal Enhanced FV and FV are summarized in Table 2. It is observed that our Temporal Enhanced FV is better than FV.

To further validate the advantage of our TEFV, we analyze the class-wise accuracy and confusion matrix on the Cross-setup (X-set) benchmark of the NTU-RGB+D 120 dataset. The accuracy of each class is presented in Fig. 5.

It is observed that our TEFV outperforms FV in most of the classes. For example, in classes 4, 11, 13, 14, 15, 17, 29, 30, 37, 57, 74, 82, 103, and 106 , our TEFV significantly outperforms FV. The confusion matrices of some action classes for Our TEFV and FV on the Cross-setup (X-set) benchmark of the NTU-RGB+D 120 dataset are shown in Fig. 6.

In class “14 put on jacket”, 5 samples are misclassified by FV as the class “15 take off jacket”, while all of them are correctly classified by our TEFV. In class “17 take off a shoe”, 63 samples are confused as class “16 put on a shoe” by FV, while only 43 of them are incorrectly classified by our TEFV. These opposite action pairs have similar spatial feature distribution, but different temporal dynamics. These cases demonstrate the advantage of our TEFV, which can preserve the temporal information of the entire sample.

Besides the advantage of preserving the temporal information of the entire sample we mentioned above, the ability to capture fine-grained motions is also an advantage of our TEFV.

In class “4 brush hair”, 6 samples are misclassified by FV as the class “3 brush teeth”, while only 2 of them are incorrectly classified by our TEFV. 2 samples are incorrectly classified by FV as the class “20 put on a hat/cap”, while all of them are correctly classified by our TEFV. In class “37 wipe face”, 8 samples are incorrectly classified by FV as the class “18 put on glasses”, while only 2 of them are misclassified by our TEFV. These samples from different classes share a similar sub-action of raising a hand (or hands) close to the mouth (or the head/face). The key to correctly classifying these samples is recognizing the fine-grained spatial configurations and temporal dynamics of hand and arm skeletons.

In class “11 reading”, 71 samples are confused as class “12 writing” by FV, while only 56 of them are incorrectly classified by our TEFV. In class “29 play with phone/tablet”, 36 samples are confused as the class “30 type on a keyboard” by FV, while only 28 of them are incorrectly classified by our TEFV. 23 samples are misclassified by FV as the class “75 cutting nails”, while only 14 of them are incorrectly classified by our TEFV. 12 samples are incorrectly classified as the class “84 play magic cube” by FV, while only half of them are misclassified by our TEFV. In class “30 type on a keyboard”, 18 samples are confused as the class “29 play with phone/tablet” by FV, while only 13 of them are incorrectly classified by our TEFV. 7 samples are misclassified by FV as the class “84 play magic cube”, while only 2 of them are incorrectly classified by our TEFV. In class “74 counting money”, 36 samples are confused as the class “75 cutting nails” by FV, while only 27 of them are incorrectly classified by our TEFV. 13 samples are incorrectly classified as the class “82 fold paper” by FV, while only 7 of them are misclassified by our TEFV. Since these groups of actions share similar spatial and temporal variations, distinguishing these actions is challenging. The key to correctly classifying these classes is recognizing fine-grained hand and finger dynamics.

In class “14 put on jacket”, 11 samples are misclassified by FV as the class “87 put on bag”, while all of them are correctly classified by our TEFV. In class “15 take off jacket”, 4 samples are confused as the class “16 put on a shoe”, while all of them are correctly classified by our TEFV. 8 samples are misclassified by FV as class “87 put on bag”, while only 2 of them are incorrectly classified by our TEFV. Correctly classifying these classes requires the ability to recognize the fine-grained motions of hand and body skeletons.

Table 2 Performance of TEFV and FV on both the NTU-RGB+D 60 dataset and NTU-RGB+D 120 dataset

Full size table

All these cases demonstrate that our TEFV not only preserves the temporal information of the entire sample but also captures fine-grained spatial configurations and temporal dynamics.

Although our TEFV outperforms FV in most of the classes, there are still a few actions difficult to be recognized. For example, in class 56, our TEFV is inferior to FV. The failure numbers of these classes for our TEFV and FV on the Cross-setup (X-set) benchmark of the NTU-RGB+D 120 dataset are shown in Table 3.

Table 3 The failure numbers of action “56 giving something to other person” on the Cross-setup (X-set) benchmark of the NTU-RGB+D 120 dataset

Full size table

In class “56 giving something to other person”, 18 samples are misclassified by TEFV as the class “57 touch other person’s pocket”, while only 12 of them are incorrectly classified by FV. “56 giving something to other person” is marked by two sub-actions: one person hands out something and the other reaches for it. In class “57 touch other person’s pocket”, only one person has arm and hand movements. These actions have significant arm and hand movements, with no temporal fine-grained dynamics. These actions can be well recognized by the classical FV, which employs GMM to fit the global distribution of features. An over-emphasis on preserving detailed temporal information leads to overfitting for these actions.

Evaluation of the two-stream framework

We verify our two-stream temporal enhanced Fisher vector encoding framework (2sTEFV-GCN) on the NTU RGB+D 60 and NTU RGB+D 120 datasets. The results are summarized in Table 4. The two-stream temporal enhanced Fisher vector framework outperforms the one-stream-based method.

Comparison with the state-of-the-art

We compare our Two-stream Temporal Enhanced Fisher Vector encoding framework (2sTEFV-GCN) on the NTU RGB+D 60 and NTU RGB+D 120 datasets. The results are shown in Tables 5 and 6, respectively.

NTU-RGB+D 60 dataset. On NTU-RGB+D 60 dataset, We compare our Two-stream Temporal Enhanced Fisher Vector encoding framework (2sTEFV-GCN) with LSTM-based methods (i.e., HBRNN [10], ST-LSTM [11],STA-LSTM [45], CNN-based methods (i.e., Temporal CNN [16], HCN [46], GCN-based method (i.e., ST-GCN [18], RA-GCNv1 [47], RA-GCNv2 [48], AS-GCN [33], 2s-AGCN [19], DGNN [34], PL-GCN [35], NAS-GCN [36], MS-G3D [20], PA-ResGCN-B19 [49], Dynamic-GCN [50], RNXt-GCN [51], EfficientGCN-B4 [21]), and hybrid methods based on two types of networks (i.e., SR-TSL [52],VA-fusion [53], AGC-LSTM [54], and SGN [37]).

There are two typical works we should mention. The first one is ST-GCN [18], which innovatively employs GCN for skeleton-based action recognition. Compared with this popular model, our two-stream Temporal Enhanced Fisher Vector framework obtains obvious improvements on both X-view and X-sub benchmarks. Another method is EfficientGCN-B4 [21], which is the recent state-of-the-art method. Our Two-stream Temporal Enhanced Fisher Vector encoding framework (2sTEFV-GCN) constantly outperforms EfficientGCN-B4 on both X-view and X-sub benchmarks.

NTU-RGB+D 120 dataset. Since the NTU-RGB+D 120 is newly released, fewer performances are reported. We compare our two-stream Temporal Enhanced Fisher Vector encoding framework (2sTEFV-GCN) with LSTM-based methods (i.e., PA-LSTM [43], ST-LSTM [11]), CNN-based method (i.e., FSNet [55]), GCN-based methods (i.e., ST-GCN [18], RA-GCNv1 [47], RA-GCNv2 [48], AS-GCN [33], 2s-AGCN [19], MS-G3D [20], PA-ResGCN-B19 [49], Dynamic-GCN [50], RNXt-GCN [51], EfficientGCN-B4 [21]), and hybrid methods based on two types of networks (i.e., SR-TSL [52] and SGN [37]). The accuracies are reported in Table 6, where our method achieves superior classification accuracy over these state-of-the-art approaches under all evaluation settings. Notably, our 2sTEFV-GCN outperforms the recent state-of-the-art method, EfficientGCN-B4 [21], on both X-set and X-sub benchmarks.

Table 4 Performance of two-stream temporal enhanced Fisher vector encoding framework (2sTEFV-GCN) on both the NTU-RGB+D 60 dataset and NTU-RGB+D 120 dataset

Full size table

Table 5 Performance of SOTA methods on NTU-RGB+D 60 dataset

Full size table

Table 6 Performance of SOTA methods on NTU-RGB+D 120 dataset

Full size table

These results on the two large-scale datasets demonstrate the superiority of our 2sTEFV-GCN. We consider that this is caused by our TEFV algorithm can effectively utilize the information of feature distributions to provide complementary information to GCN-based methods, while existing GCN-based methods ignoring the information of feature distributions. Since the TEFV algorithm is different from GCN, this hybrid framework (2sTEFV-GCN) is capable of incorporating the advantages of both algorithms.

Conclusion

In this paper, we aim to introduce Fisher vector (FV) encoding into GCN to effectively utilize the information of feature distributions. We demonstrate by our analysis that Fisher vector encoding inevitably leads to losing temporal information of actions. To tackle this problem, we further propose a temporal enhanced Fisher vector encoding algorithm (TEFV) to provide more discriminative visual representation. Compared with FV, our TEFV model can not only preserve the temporal information of the entire action but also capture fine-grained spatial configurations and temporal dynamics. Moreover, we propose a two-stream Temporal Enhanced Fisher vector encoding framework (2sTEFV-GCN) by combining the TEFV model with the GCN model to further improve the performance. On two large-scale datasets for skeleton-based action recognition, NTU-RGB+D 60 and NTU-RGB+D 120, our model achieves state-of-the-art performance.

Data availability statement

The datasets analyzed during the current study are available in the [NTU RGB+D] repository and the [NTU RGB+D 120] repository, [https://rose1.ntu.edu.sg/dataset/actionRecognition/].

References

Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7083–7093
Tran D, Wang H, Torresani L, et al (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 6450–6459
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inform Process Syst 27:568–576
Google Scholar
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Wang L, Xiong Y, Wang Z et al (2018) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell 41(11):2740–2755
Article Google Scholar
Hu JF, Zheng WS, Lai J, et al (2015) Jointly learning heterogeneous features for rgb-d activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5344–5352
Xu C, Govindarajan LN, Zhang Y et al (2017) Lie-x: depth image based articulated object pose estimation, tracking, and action recognition on lie groups. Int J Comput Vis 123(3):454–478
Article MathSciNet MATH Google Scholar
Huynh-The T, Hua CH, Ngo TT et al (2020) Image representation of pose-transition feature for 3d skeleton-based action recognition. Inform Sci 513:112–126
Article Google Scholar
Divya R, Peter JD (2021) Smart healthcare system-a brain-like computing approach for analyzing the performance of detectron2 and posenet models for anomalous action detection in aged people with movement impairments. Complex Intell Syst:1–20
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118
Liu J, Shahroudy A, Xu D, et al (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision, Springer, pp 816–833
Khaled H, Abu-Elnasr O, Elmougy S, et al (2021) Intelligent system for human activity recognition in iot environment. Complex Intell Syst: 1–12
Kareem Z, Zaidan A, Ahmed M et al (2022) An approach to pedestrian walking behaviour classification in wireless communication and network failure contexts. Complex Intell Syst 8(2):909–931
Article Google Scholar
Zhang S, Liu X, Xiao J (2017) On geometric features for skeleton-based action recognition using multilayer lstm networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp 148–157
Ke Q, Bennamoun M, An S, et al (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297
Kim TS, Reiter A (2017) Interpretable 3d human action analysis with temporal convolutional networks. In: 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), IEEE, pp 1623–1631
Li C, Zhong Q, Xie D, et al (2017) Skeleton-based action recognition with convolutional neural networks. In: 2017 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), IEEE, pp 597–600
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence
Shi L, Zhang Y, Cheng J, et al (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12026–12035
Liu Z, Zhang H, Chen Z, et al (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 143–152
Song YF, Zhang Z, Shan C, et al (2022) Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans Pattern Anal Mach Intell
Jaakkola T, Haussler D (1998) Exploiting generative models in discriminative classifiers. Adv Neural Inform Process Syst 11
Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: European conference on computer vision, Springer, pp 143–156
Sánchez J, Perronnin F, Mensink T et al (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245
Article MathSciNet MATH Google Scholar
Cinbis RG, Verbeek J, Schmid C (2015) Approximate Fisher kernels of non-iid image models for image categorization. IEEE Trans Pattern Anal Mach Intell 38(6):1084–1098
Article Google Scholar
Klein B, Lev G, Sadeh G, et al (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4437–4446
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Peng X, Zou C, Qiao Y, et al (2014) Action recognition with stacked fisher vectors. In: European conference on computer vision, Springer, pp 581–595
Chen C, Liu M, Zhang B, et al (2016) 3d action recognition using multi-temporal depth motion maps and fisher vector. In: IJCAI, pp 3331–3337
Fu S, Liu W, Tao D et al (2020) Hesgcn: hessian graph convolutional networks for semi-supervised classification. Inform Sci 514:484–498
Article MathSciNet MATH Google Scholar
Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Adv Neural Inform Process Syst 29:3844–3852
Google Scholar
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: Proc. Int. Conf. Learning Representations
Li M, Chen S, Chen X, et al (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3595–3603
Shi L, Zhang Y, Cheng J, et al (2019) Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7912–7921
Huang L, Huang Y, Ouyang W, et al (2020) Part-level graph convolutional network for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 11045–11052
Peng W, Hong X, Chen H, et al (2020) Learning graph convolutional network for skeleton-based human action recognition by neural searching. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 2669–2676
Zhang P, Lan C, Zeng W, et al (2020) Semantics-guided neural networks for efficient skeleton-based human action recognition. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1112–1121
Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: 2007 IEEE conference on computer vision and pattern recognition, IEEE, pp 1–8
Tang J, Wang Y, Fu S, et al (2022) A graph convolutional neural network model with fisher vector encoding and channel-wise spatial-temporal aggregation for skeleton-based action recognition. IET Image Processing
Tang J, Wang Y, Liu B (2020) Effective skeleton-based action recognition by combining graph convolutional networks and fisher vector encoding. In: 2020 15th IEEE International Conference on Signal Processing (ICSP), IEEE, pp 230–233
Titterington DM, Smith AF, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley
Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning, vol 4. Springer
Shahroudy A, Liu J, Ng TT, et al (2016) Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Liu J, Shahroudy A, Perez M et al (2019) Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans Pattern Anal Mach Intell 42(10):2684–2701
Article Google Scholar
Song S, Lan C, Xing J, et al (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the AAAI conference on artificial intelligence
Li C, Zhong Q, Xie D, et al (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp 786–792
Song YF, Zhang Z, Wang L (2019) Richly activated graph convolutional network for action recognition with incomplete skeletons. In: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, pp 1–5
Song YF, Zhang Z, Shan C et al (2020) Richly activated graph convolutional network for robust skeleton-based action recognition. IEEE Trans Circ Syst Video Technol 31(5):1915–1925
Article Google Scholar
Song YF, Zhang Z, Shan C, et al (2020b) Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition. In: proceedings of the 28th ACM international conference on multimedia, pp 1625–1633
Ye F, Pu S, Zhong Q, et al (2020) Dynamic gcn: context-enriched topology learning for skeleton-based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 55–63
Liu S, Bai X, Fang M et al (2022) Mixed graph convolution and residual transformation network for skeleton-based action recognition. Appl Intell 52(2):1544–1555
Article Google Scholar
Si C, Jing Y, Wang W, et al (2018) Skeleton-based action recognition with spatial reasoning and temporal stack learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 103–118
Zhang P, Lan C, Xing J et al (2019) View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans Pattern Anal Mach Intell 41(8):1963–1978
Article Google Scholar
Si C, Chen W, Wang W, et al (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1227–1236
Liu J, Shahroudy A, Wang G et al (2019) Skeleton-based online action prediction using scale selection network. IEEE Trans Pattern Anal Mach Intell 42(6):1453–1467
Article Google Scholar

Download references

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62072468, the Natural Science Foundation of Shandong Province under Grant ZR2019MF073, the Fundamental Research Funds for the Central Universities, China University of Petroleum (East China) under Grant 20CX05001A, and the Open Research Fund from Shandong Provincial Key Laboratory of Computer Network under Grant SKLCN-2021-02.

Author information

Authors and Affiliations

College of Control Science and Engineering, China University of Petroleum (East China), Qingdao, 266580, China
Jun Tang, Baodi Liu, Wenhui Guo & Yanjiang Wang
College of Animation and Communication, Qingdao Agricultural University, Qingdao, 266109, China
Jun Tang

Authors

Jun Tang
View author publications
You can also search for this author in PubMed Google Scholar
Baodi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wenhui Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yanjiang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JT: conceptualization, formal analysis, methodology, software, and writing—original draft. BL: funding acquisition, supervision, review, and editing. WG: review and editing. YW: conceptualization, funding acquisition, and supervision. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Yanjiang Wang.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tang, J., Liu, B., Guo, W. et al. Two-stream temporal enhanced Fisher vector encoding for skeleton-based action recognition. Complex Intell. Syst. 9, 3147–3159 (2023). https://doi.org/10.1007/s40747-022-00914-3

Download citation

Received: 30 May 2022
Accepted: 06 November 2022
Published: 25 November 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s40747-022-00914-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Two-stream temporal enhanced Fisher vector encoding for skeleton-based action recognition

Abstract

Similar content being viewed by others

Short-term path signature for skeleton-based action recognition

Two-Stream Transformer Encoders for Skeleton-Based Action Recognition

Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition

Introduction

Related works

GCN-based action recognition from skeleton data

Fisher vector encoding

Algorithm

Fisher vector encoding

GMM generation

Feature encoding

Normalization

Temporal information loss problem

Temporal enhanced Fisher vector encoding

GMM generation

Feature encoding

Two-stream framework

Experiments

Datasets

Training details

Evaluation of the number of Gaussians

Evaluation of normalization strategies

Comparison with Fisher vector encoding

Evaluation of the two-stream framework

Comparison with the state-of-the-art

Conclusion

Data availability statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation