Two-stream temporal enhanced Fisher vector encoding for skeleton-based action recognition

The key to skeleton-based action recognition is how to extract discriminative features from skeleton data. Recently, graph convolutional networks (GCNs) are proven to be highly successful for skeleton-based action recognition. However, existing GCN-based methods focus on extracting robust features while neglecting the information of feature distributions. In this work, we aim to introduce Fisher vector (FV) encoding into GCN to effectively utilize the information of feature distributions. However, since the Gaussian Mixture Model (GMM) is employed to fit the global distribution of features, Fisher vector encoding inevitably leads to losing temporal information of actions, which is demonstrated by our analysis. To tackle this problem, we propose a temporal enhanced Fisher vector encoding algorithm (TEFV) to provide more discriminative visual representation. Compared with FV, our TEFV model can not only preserve the temporal information of the entire action but also capture fine-grained spatial configurations and temporal dynamics. Moreover, we propose a two-stream framework (2sTEFV-GCN) by combining the TEFV model with the GCN model to further improve the performance. On two large-scale datasets for skeleton-based action recognition, NTU-RGB+D 60 and NTU-RGB+D 120, our model achieves state-of-the-art performance.


Introduction
Human action recognition has become an active issue in recent years, which has been widely used in video surveillance systems, multimedia analysis, smart home applications, etc. The multiple modalities of data employed in action recognition can be categorized into RGB images [1,2], optical flows [3][4][5], depth images [6][7][8][9], human skeleton data [10,11], and data captured by wearable sensors [12,13]. Among these modalities, skeleton data captured by the Kinect sensors can provide compact representation and therefore reduce computational complexity. Moreover, skeleton representation is independent of background noise. To extract discriminative features from skeleton data, Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) are utilized in earlier methods. These methods treat the skeleton sequence as a sequence vector [10,11,14] or pseudo-image [15][16][17], ignoring the natural graphical connections among the body joints. Recently, since it can handle data with non-Euclidean structures, graph convolutional networks (GCN) [18][19][20][21] are proven to be highly successful for skeleton-based action recognition. However, these GCN-based methods only consider extracting features but ignore information of feature distributions.
Fisher Vector (FV) encoding [22] and its variants [23][24][25][26], based on a generative model which approximates the global distribution of features, have successfully been utilized in image classification, and video-based action recognition [27][28][29] over the past few years. However, this powerful algorithm was rarely utilized in previous GCN-based methods for skeleton-based action recognition.
In this work, we aim to introduce Fisher Vector (FV) encoding into GCN to effectively utilize the information of feature distributions. However, we demonstrate by our analysis that since the Gaussian Mixture Model (GMM) is employed to fit the global distribution of features, Fisher vector encoding inevitably leads to losing temporal information of actions. To solve this problem, we further propose a temporal enhanced Fisher vector encoding algorithm (TEFV) to provide discriminative visual representation. Moreover, we propose a two-stream framework (2sTEFV-GCN) by combining the TEFV model with the GCN model to further improve the performance. Since the TEFV algorithm is different from GCN, this hybrid architecture is capable of incorporating the advantages of both algorithms to discover complementary information of feature representation effectively.
The contributions of this work include: (1) We demonstrate by our analysis that Fisher vector encoding inevitably leads to losing temporal information of actions. To tackle this problem, we propose a temporal enhanced Fisher vector encoding algorithm (TEFV) to provide a more discriminative visual representation. (2) We propose a two-stream model (2sTEFV-GCN), which belongs to a complementary structure of feature information. TEFV branch could enrich and enhance the temporal information of skeleton data, and the GCN branch could mine the non-Euclidean spatial information of data. The proposed model can obtain more abundant spatiotemporal information. (3) On two large-scale datasets, we demonstrate improved performance over state-of-the-art methods.
The rest of this paper is arranged as follows. "Related works" reviews related works. "Algorithm" demonstrates the temporal information loss problem of FV and provides a detailed description of the proposed TEFV and the 2sTEFV-GCN model. "Experiments" evaluates experiment results. Finally, we conclude our report in "Conclusion".

GCN-based action recognition from skeleton data
Recently, since it is capable of handling data with non-Euclidean structures, graph convolutional networks have successfully been applied in image classification [30], document classification [31], and semi-supervised learning [32].
Since the connections of the body joints can be naturally treated as a graph structure, the Graph Convolution Network (GCN) has been utilized in skeleton-based action recognition and has proven to be highly successful. Li et al. [33] proposed the actional-structural GCN by introducing an encoder-decoder structure and capturing higher order relationships. To solve the problem of how to efficiently incorporate the joint and bone data, the study in [34] designed a directed GCN and applied a two-stream framework. The work in [35] aimed to address the part-level action modeling problem and proposed a Part-Level GCN by integrating the part relation block and part attention block. To address the fixed graph structure problem and the one-order hop limitation, the work in [36] proposed a Neural Architecture Search GCN by Neural Architecture Search and multiple-hop modules. Since some GCN models are exceedingly sophisticated and high computationally expensive, some studies focus on tackling this issue. The work in [37] proposed an effective semantics-guided neural network by introducing the highlevel semantics of joints and exploiting the relationship of joints through the joint-level module and frame-level module. The work in [21] is the current state-of-the-art method, which proposed an efficient GCN by a compound scaling strategy and an early fused multiple input branches. The proposed EfficientGCN achieves high accuracies with a small number of parameters. However, these GCN-based methods only consider extracting features while ignoring the information of feature distributions.

Fisher vector encoding
Fisher vector encoding originally proposed in [22] was based on a generative model which approximates the global distribution of features. Since its great power, Perronnin et al. [38] applied the Fisher vector encoding to image categorization where a Gaussian mixture model was adopted as the generative model to approximate the distribution of low-level features in images. Following this work, several variants were proposed. Perronnin et al. [23] proposed an improved Fisher vector by adding normalization strategies. Cinbis et al. [25] tackled the i.i.d. assumption problem of FV and introduced a latent variable model for image classification. Klein et al. [26] adopted Laplacian Mixture Model and a hybrid Gaussian-Laplacian Mixture Model as the generative model separately for image annotation. Wang et al. [27] applied FV which showed better performance than the Bag of Features model for video-based action recognition. Peng et al. [28] designed a stacked Fisher Vector architecture to refine the video representation for action recognition. Tang et al. [39,40] proposed an FV-GCN architecture by combining Fisher vector encoding with GCN for skeleton-based action recognition. The main differences between our work and the works in [39] and [40] are as follows: (1) An improved Fisher vector encoding algorithm, i.e., temporal enhanced Fisher vector encoding algorithm (TEFV), is proposed in our work to tackle the temporal information loss problem of classical FV in [39] and [40]. (2) A two-stream framework (2sTEFV-GCN) is proposed in our work to further improve performance.

Fisher vector encoding
In this section, we introduce Fisher vector encoding [23,39,40]. We further demonstrate by our analysis that Fisher vector encoding inevitably leads to losing temporal information of actions.

GMM generation
Fisher vector (FV) encoding is a powerful algorithm that is capable of obtaining better performance. Gaussian Mixture Model (GMM) [41] u λ is employed as the generative model of FV encoding architecture. The goal of the Gaussian mixture model is to find a mixture of multiple Gaussian models that will fit the distribution of the given feature space. It provides both the average information and the covariance of feature distribution 1] is the weight of the ith Gaussian model. μ i ∈ R D and δ i ∈ R D are means, and diagonal covariance of the ith Gaussian model, respectively. K is the number of Gaussians. Consider the extracted feature maps S ∈ R T ×N ×C from an action V, where T, N, and C denote the temporal dimension, the number of joints, and the number of channels, respectively. Then, the feature map S is split into T slices X = {x t ∈ R N ×C , t = 1, 2, . . . , T } along the temporal dimension T. Given the training set of X, the parameters of GMM are estimated by the expectation-maximization (EM) algorithm [42]. The action sample V can be represented as gradient vectors with respect to the GMM model parameters

Feature encoding
Let G x μ,i be the gradient of μ i and G x δ,i be the gradient of δ i of the i-th Gaussian component, respectively where γ t (i) is the weight of the descriptor x t with the i-th Gaussian component We can obtain the final Fisher vector by concatenating all

Normalization
The result is further normalized by two steps, which improves the performance in [23]. To tackle the sparse problem of FV, following [23], power normalization is applied first: To eliminate the impact of the background class-independent part, the feature G x λ is further

Temporal information loss problem
For the image classification task, FV encodes the local features (descriptor) of an image into a super vector, which obtains global information of the feature distribution, and has been proven to be highly successful. However, unlike image features, skeleton action features have temporal information, which is crucial to action recognition. According to Eqs. (3) and (4), G x μ,i and G x δ,i are obtained by sum and average operations over temporal dimension T, which inevitably leads to losing temporal information of actions. This issue also exists in video-based action recognition that utilizes FV methods.

Temporal enhanced Fisher vector encoding
Since the FV cannot depict the temporal variations in the action clip, in this section, we present a temporal enhanced FV algorithm (TEFV) to address this issue. The architecture is shown in Fig. 1. The skeleton data are first fed into Effi-cientGCN [21]. Then, the extracted features are fed into the TEFV block. ReLU function and a dropout layer with 0.25 drop probability are added between the two FC layers. The softmax scores are finally adopted to predict the action labels.

GMM generation
The feature maps F ∈ R T ×N ×C are split into T slices To retain the temporal information among the action clip, the expectation-maximization (EM) algorithm is utilized to estimate the GMM parameters of each x t rather than the global GMM parameters of the entire feature map X Then, the action V can be described by Here, N is the number of descriptors, i.e., the number of joints.

Feature encoding
.
The Fisher vector with respect to GMM The Fisher vector of the entire action clip V can be written as where ⊕ is the concatenation operation. The final Fisher vector is normalized by power normalization and L 2 -normalization. According to Eqs. (11) and (12), since the GMMs u λ (X ) preserve the temporal information of the entire action, our TEFV can provide more discriminative visual representation. It is important for opposite action pairs and fine-grained action recognition. Opposite action pairs, such as "14 put on jacket"and "15 take off jacket", and "16 put on a shoe", "17 take off a shoe", have similar spatial feature distribution but different temporal orders. Temporal order is crucial for recognizing these reversed action pairs. Temporal fine-grained actions, such as "3 brush teeth" and "4 brush hair", "74 counting money", "75 cutting nails" and "82 fold paper", have similar body movements but subtle variations in hand/finger motions. Instead of averaging the gradient along the temporal dimension T, preserving the temporal information of the entire action Eqs. (11) and (12) is helpful for recognizing these fine-grained actions. We provide a detailed discussion of these cases in Sect. "Comparison with Fisher vector encoding".

Two-stream framework
Inspired by the two-stream methods in [19,20], we propose a two-stream framework (2sTEFV-GCN) by combining the TEFV model with the GCN model to further improve performance. Our Temporal Enhanced Fisher Vector Encoding is referred to as the TEFV stream, which could enrich and enhance the temporal information of skeleton data. Another stream, i.e., the GCN stream, adopts the EfficientGCN [21]. The EfficientGCN obtains joint, velocity, and bone data from the original 3D skeleton coordinate by data preprocessing. Next, the three input sequences are fed into three input branches, separately. The input branch consists of several GCN and attention blocks. Then, these three branches are fused and fed into the main stream, which is similar to the input branch. The features are obtained through a global average pooling operation and a Fully Connected layer for action recognition in the GCN stream. Finally, the softmax scores of the TEFV stream and the GCN stream are fused to predict the action labels.

Experiments
This section evaluates our TEFV and two-stream framework (2sTEFV-GCN) on two large-scale datasets: NTU RGB+D 60 [43] and NTU RGB+D 120 [44]. In particular, we first evaluate the number of Gaussians k. We next compare different normalization strategies individually or when combined. Then, we compare the proposed TEFV with FV to examine the contributions. To further demonstrate the advantage of our TEFV, we also provide class-wise accuracies and confusion matrix analysis. Finally, we compare the results of our two-stream Fisher Vector framework (2sTEFV-GCN) with state-of-the-art methods.

Datasets
NTU-RGB+D 60 [43] is a large-scale dataset for RGB+D human action recognition. It contains 56,880 samples of 60 action classes. These actions are in three major categories: daily actions (e.g., brush hair, drop, put on jacket), two person interactions (e.g., pushing, point finger, giving object), and medical conditions (e.g., sneeze/cough, fan self). The dataset provides 3D coordinates of 25 joints for each subject. The authors recommend two benchmarks: (1) crosssubject (X-Sub): since these actions were performed by 40 volunteers, the dataset can be split into training and validation sets according to different volunteers. The training set consists of 40,320 samples performed by 20 actors. The test set consists of 16,560 samples performed by the other 20 actors.
(2) Cross-view (X-View): since these actions were captured by three cameras with different horizontal angles, this benchmark employed camera 1 with 45 angles containing 18,960 samples for testing, and the other two cameras with −45 • and 0 • angles containing 37,920 samples for training.
NTU RGB+D 120 [44] is by far the largest dataset for skeleton-based action recognition. This dataset is an extended version of NTU-RGB+D 60 by adding another 60 classes with 57,600 samples. The 120 action classes also include daily actions, two person interactions, and medical conditions actions. These actions are acted by 106 volunteers of a wide range of ages and captured from 155 different camera viewpoints. The newly added fine-grained actions and larger variation in subjects, viewpoints, and backgrounds make it very challenging for skeleton-based action recognition.
The authors recommend two benchmarks: (1)

Training details
In our experiments, the learning rate is set as 0.0001 with a cosine schedule decay. Adam is employed as the optimization strategy. Cross-entropy is employed as the loss function. The maximum number of training epochs is set as 100. Our model is implemented in the Pytorch framework and trained on two Tesla V100 GPUs.

Evaluation of the number of Gaussians
The key parameter in the GMM aggregation of our TEFV is the number of Gaussians K. The performances on two datasets with different numbers of Gaussians K, which change from 1 to 9 at equal intervals, are shown in Figs. 2 and 3. Power normalization and L 2 normalization are performed as default for a fair comparison.
As shown in Figs. 2 and 3, the larger the GMM size we set, the lower the recognition performance we obtain. Note that this is different from previous work [27][28][29] where the best results are often obtained with a larger Gaussians number K. We suggest that since lots of hand-crafted features used for image classification (SIFT [24]) and video-based action recognition (IDT, HOF, HOG, MBH [27][28][29]) are represented by lots of local descriptors, a larger size GMM is required to fit their distribution. The features extracted from the last graph convolution block of the EfficientGCN are high-level features with fewer descriptors. A GMM with one component is sufficient to fit the distribution of these highlevel features. Accordingly, we set the number of Gaussians k = 1 as default in the following experiments.

Evaluation of normalization strategies
We compare different normalization strategies on both NTU RGB+D 60 and NTU RGB+D 120 datasets. The results are presented in Table 1. It is observed that our TEFV with Power+L2 Normalization performs the best on both NTU RGB+D 60 and NTU RGB+D 120 datasets under all protocols. Therefore, Power+L2 Normalization is adopted in the following experiments. The performance of power normalization is shown in Fig. 4. It is observed that lots of values of L 2 -normalized FV features are crowded around zero. The power normalization effectively alleviates the sparse trouble of the L 2 -normalized features.

Comparison with Fisher vector encoding
The results of Temporal Enhanced FV and FV are summarized in Table 2. It is observed that our Temporal Enhanced FV is better than FV.
To further validate the advantage of our TEFV, we analyze the class-wise accuracy and confusion matrix on the Crosssetup (X-set) benchmark of the NTU-RGB+D 120 dataset. The accuracy of each class is presented in Fig. 5.
In class "14 put on jacket", 5 samples are misclassified by FV as the class "15 take off jacket", while all of them are correctly classified by our TEFV. In class "17 take off a shoe", 63 samples are confused as class "16 put on a shoe" by FV, while only 43 of them are incorrectly classified by our TEFV. These opposite action pairs have similar spatial feature distribution, but different temporal dynamics. These cases demonstrate the advantage of our TEFV, which can preserve the temporal information of the entire sample.
Besides the advantage of preserving the temporal information of the entire sample we mentioned above, the ability to capture fine-grained motions is also an advantage of our TEFV.
In class "4 brush hair", 6 samples are misclassified by FV as the class "3 brush teeth", while only 2 of them are incorrectly classified by our TEFV. 2 samples are incorrectly classified by FV as the class "20 put on a hat/cap", while all of them are correctly classified by our TEFV. In class "37 wipe face", 8 samples are incorrectly classified by FV as the class "18 put on glasses", while only 2 of them are misclassified  The key to correctly classifying these samples is recognizing the fine-grained spatial configurations and temporal dynamics of hand and arm skeletons. In class "11 reading", 71 samples are confused as class "12 writing" by FV, while only 56 of them are incorrectly classified by our TEFV. In class "29 play with phone/tablet", 36 samples are confused as the class "30 type on a keyboard" by FV, while only 28 of them are incorrectly classified by our TEFV. 23 samples are misclassified by FV as the class "75 cutting nails", while only 14 of them are incorrectly classi-fied by our TEFV. 12 samples are incorrectly classified as the class "84 play magic cube" by FV, while only half of them are misclassified by our TEFV. In class "30 type on a keyboard", 18 samples are confused as the class "29 play with phone/tablet" by FV, while only 13 of them are incorrectly classified by our TEFV. 7 samples are misclassified by FV as the class "84 play magic cube", while only 2 of them are incorrectly classified by our TEFV. In class "74 counting money", 36 samples are confused as the class "75 cutting nails" by FV, while only 27 of them are incorrectly classified by our TEFV. 13 samples are incorrectly classified as the class "82 fold paper" by FV, while only 7 of them are mis- Bold indicates the optimal recognition result classified by our TEFV. Since these groups of actions share similar spatial and temporal variations, distinguishing these actions is challenging. The key to correctly classifying these classes is recognizing fine-grained hand and finger dynamics.
In class "14 put on jacket", 11 samples are misclassified by FV as the class "87 put on bag", while all of them are correctly classified by our TEFV. In class "15 take off jacket", 4 samples are confused as the class "16 put on a shoe", while all of them are correctly classified by our TEFV. 8 samples are misclassified by FV as class "87 put on bag", while only 2 of them are incorrectly classified by our TEFV. Correctly classifying these classes requires the ability to recognize the fine-grained motions of hand and body skeletons.
All these cases demonstrate that our TEFV not only preserves the temporal information of the entire sample but also captures fine-grained spatial configurations and temporal dynamics.
Although our TEFV outperforms FV in most of the classes, there are still a few actions difficult to be recognized. For example, in class 56, our TEFV is inferior to FV. The failure numbers of these classes for our TEFV and FV on the Cross-setup (X-set) benchmark of the NTU-RGB+D 120 dataset are shown in Table 3.
In class "56 giving something to other person", 18 samples are misclassified by TEFV as the class "57 touch other person's pocket", while only 12 of them are incorrectly classified by FV. "56 giving something to other person" is marked by two sub-actions: one person hands out something and the other reaches for it. In class "57 touch other person's pocket", only one person has arm and hand movements. These actions have significant arm and hand movements, with no temporal fine-grained dynamics. These actions can be well recognized by the classical FV, which employs GMM to fit the global distribution of features. An over-emphasis on preserving detailed temporal information leads to overfitting for these actions.

Evaluation of the two-stream framework
We verify our two-stream temporal enhanced Fisher vector encoding framework (2sTEFV-GCN) on the NTU RGB+D 60 and NTU RGB+D 120 datasets. The results are sum-marized in Table 4. The two-stream temporal enhanced Fisher vector framework outperforms the one-stream-based method.
There are two typical works we should mention. The first one is ST-GCN [18], which innovatively employs GCN for skeleton-based action recognition. Compared with this popular model, our two-stream Temporal Enhanced Fisher Vector framework obtains obvious improvements on both X-view and X-sub benchmarks. Another method is EfficientGCN-B4 [21], which is the recent state-of-the-art method. Our Twostream Temporal Enhanced Fisher Vector encoding framework (2sTEFV-GCN) constantly outperforms EfficientGCN-B4 on both X-view and X-sub benchmarks.
These results on the two large-scale datasets demonstrate the superiority of our 2sTEFV-GCN. We consider that this is caused by our TEFV algorithm can effectively utilize the information of feature distributions to provide comple-   Table 3 The failure numbers of action "56 giving something to other person" on the Cross-setup (X-set) benchmark of the NTU-RGB+D 120 dataset Some confusion actions of action "56 giving something to other person "   5  25  52  53  54  55  57  106  107  109  110   F a i l u r en u m b e r so fF V  4  2  1  2  2  1  1 2  2  1  1  2 F a i l u r en u m b e r so fT E F V 4 2 0 1 1 1 1 8 2 1 0 2  mentary information to GCN-based methods, while existing GCN-based methods ignoring the information of feature dis- Bold indicates the optimal recognition result The top part consists of several LSTM-based methods, and the second part contains CNN-based methods. The third part consists of some GCN-based methods, while the fourth part contains hybrid methods. * These results are reported in [19].
tributions. Since the TEFV algorithm is different from GCN, this hybrid framework (2sTEFV-GCN) is capable of incorporating the advantages of both algorithms.

Conclusion
In this paper, we aim to introduce Fisher vector (FV) encoding into GCN to effectively utilize the information of feature distributions. We demonstrate by our analysis that Fisher vector encoding inevitably leads to losing temporal information of actions. To tackle this problem, we further propose a temporal enhanced Fisher vector encoding algorithm (TEFV) to provide more discriminative visual representation. Compared with FV, our TEFV model can not only preserve the temporal information of the entire action but also capture fine-grained spatial configurations and temporal dynamics. Moreover, we propose a two-stream Temporal Enhanced Fisher vector encoding framework (2sTEFV-GCN) by combining the TEFV model with the GCN model to further improve the performance. On two large-scale datasets for skeleton-based action recognition, NTU-RGB+D 60 and NTU-RGB+D 120, our model achieves state-of-the-art performance.

Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.