Deep Stacked Bidirectional LSTM Neural Network for Skeleton-Based Action Recognition

Zou, Kai; Yin, Ming; Huang, Weitian; Zeng, Yiqiu

doi:10.1007/978-3-030-34120-6_55

Kai Zou¹⁴,
Ming Yin¹⁴,
Weitian Huang¹⁴ &
…
Yiqiu Zeng¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11901))

Included in the following conference series:

International Conference on Image and Graphics

2141 Accesses
6 Citations

Abstract

Skeleton-based action recognition has made great progress recently. However, many problems still remain unsolved. For example, the representations of skeleton sequences learned by most of the existing methods lack spatial structure information and detailed temporal dynamics features. To this end, we propose a novel Deep Stacked Bidirectional LSTM Network (DSB-LSTM) for human action recognition from skeleton data. Specifically, we first exploit human body geometry to extract the skeletal modulus ratio features (MR) and the skeletal vector angle features (VA) from the skeletal data. Then, the DSB-LSTM is applied to learning both the spatial and temporal representation from MR features and VA features. This network not only leads to more powerful representation but also stronger generalization capability. We perform several experiments on the MSR Action3D dataset, Florence 3D dataset and UTKinect-Action dataset. And the results show that our approach outperforms the compared methods on all datasets, demonstrating the effectiveness of the DSB-LSTM.

The Project was supported in part by NSF China (No. 61876042), Science and Technology Planning Project of Guangdong Province (No.2017A010101024), and in part by Special Funds for Scientific and Technological Innovation and Cultivation of Guangdong University Students, China (No. pdjh2019b0153).

You have full access to this open access chapter, Download conference paper PDF

Learning Discriminative Representation for Skeletal Action Recognition Using LSTM Networks

Human action recognition using Lie Group features and convolutional neural networks

Article 16 January 2020

Leveraging Pre-trained CNN Models for Skeleton-Based Action Recognition

Keywords

1 Introduction

In computer vision, human action recognition plays a fundamental and import role. It has many important applications including video surveillance, human-computer interaction, game control, sports video analysis and so on [1].

Action recognition is a challenging task in the computer vision community. According to the type of the input data, the existing approaches can be grossly divided into two categories: RGB video based action recognition and 3D skeleton data based action recognition [2]. The RGB video based action recognition methods mainly focus on modeling spatial and temporal representation from RGB frames and temporal optical flow. Despite RGB video based methods have achieved promising results [1, 3], there still exit some limitation, e.g., background clutter, illumination changes, and so on. Moreover, the spatial appearance only contains 2D information that is hard to capture all the action information, and the optical flow generally is computing intensive. On the other hand, Johansson et al. [4] have recognized that 3D skeleton sequences can effectively represent the dynamics of human actions. Moreover, skeleton sequence does not contain color information and is not affected by the limitations of RGB video. Such robust representation allows to model more discriminative temporal characteristics about human actions. Beside, the skeleton sequences can be captured by the Kinect [5] and the advanced human pose estimation algorithms [6]. In the last decades, skeleton-based human action recognition has attracted more and more attention [1, 2]. In this paper, we focus on the problem of skeleton based action recognizing.

For skeleton based action recognition, the existing methods explore different models to learn spatial and temporal features of skeleton sequences. There has been a lot of existing methods applying relative joints coordinates, Vemulapalli et al. [8] utilized rotations and translations to represent the 3D geometric relationships of body parts in Lie group, which can overlook absolute movements of skeleton joints. Recently, there is a growing trend toward Long Short-Term Memory (LSTM) networks based methods, Shahroudy et al. [9] introduce a part-aware LSTM network to further improve the performance of the action recognition. Despite the great improvement in performance, there exist two urgent problems needed to be solved [7]. First, human behavior is accomplished in coordination with each part of the body. It is very difficult to capture the high-level spatial structural information within each frame if directly feeding the concatenation of all body joints into networks. Second, these methods utilize LSTM to directly model the overall temporal dynamics of skeleton sequences. The hidden representation of the final RNN is used to recognize the actions. For the task of action recognition, the current prediction depends not only on the past but also on the expectations of the future. However, the last hidden representation cannot completely contain the detailed temporal dynamics of sequences.

In this paper, we propose a novel Deep Stacked Bidirectional LSTM Network (DSB-LSTM) for this task, which can effectively solve the above challenges. Figure 1 shows the overall pipeline of our model. Firstly, we propose a feature extraction method to capture modulus ratio spatial features and vector angle spatial features within each frame. Then we concatenate this spatial features and normalize them using the mask layer. Next, we apply base on deep stacked Bidirectional LSTM layers to model spatial-temporal features. Finally, a fully connected layer and a softmax layer are performed on the obtained representation to classify the actions. The main contributions of this work are summarized as follows:

We propose a novel DSB-LSTM network for skeleton-based action recognition, which is able to effectively capture discriminative spatiotemporal information.
Model is performed extensive experiments to demonstrate the effectiveness. Experimental results on three action recognition datasets consistently demonstrate the effectiveness of the proposed model.

2 Related Work

Human action recognition based on skeleton data has received a lot of attention, due to its effective representation of motion dynamics. Traditional skeleton-based action recognition methods mainly focus on designing hand-crafted features [8, 10, 11]. Vemulapalli et al. [4] represent each skeleton using the relative 3D rotations between various body parts. The relative 3D geometry between all pairs of body parts is applied to represent the 3D human skeleton in [8].

Recent works mainly learn human action representations with deep learning networks. Du et al. [1] divide human skeleton into five parts according to the human physical structure, and then separately feed them into a hierarchical recurrent neural network to recognize actions. A spatial-temporal attention network learns to selectively focus on discriminative spatial and temporal features in [11]. Zhang et al. [12] present a view adaptive model for skeleton sequence, which is capable of regulating the observation viewpoints to the suitable ones by itself. The works in [13] further show that learning discriminative spatial and temporal features is the key element for human action recognition. A hierarchical CNN model is presented in [14] to learn representations for joint co-occurrences and temporal evolutions. A spatial-temporal graph convolutional network (ST-GCN) is proposed for action recognition in [13]. Each spatial-temporal graph convolutional layer constructs spatial characteristics with a graph convolutional operator, and models temporal dynamic with a convolutional operator. Compared with ST-GCN, Si et al. [11] apply graph neural networks to capture spatial structural information and then use LSTM to model temporal dynamics. Despite the significant performance improvement in [11], it ignores the co-occurrence relationship between spatial and temporal features. In this paper, we propose a novel DSB-LSTM network that can not only effectively extract discriminative spatial and temporal features but also explore the relationship between spatial and temporal domains.

3 Method

In this section, we first review some necessary backgrounds. Then, we introduce features extracted from the skeleton sequence. Finally, a novel deep DSB-LSTM is proposed of action recognition are discussed.

3.1 Preliminaries

Recurrent neural networks (RNN) have an internal state to exhibit dynamic temporal behavior, which make them naturally suitable for supervised sequence labelling. They map an input sequence to another output sequence, and can process sequences with arbitrary length. The advanced RNN architecture of LSTM which can learn long-range dependencies. The hidden state representation $h_t$ of an unit at each time step t is updated by:

$$ f_t=\sigma (W_f \cdot [h_{t-1},x_t]+b_f ), $$

$$ i_t=\sigma (W_i \cdot [h_{t-1},x_t]+b_i), $$

$$\begin{aligned} C_t=f_t*C_{t-1}+i_{t}*tanh(W_{c}*[h_{t-1},x_{t}]+b_{c}), \end{aligned}$$

(1)

$$ o_t = \sigma (W_o \cdot [h_{t-1},x_t]+b_o), $$

$$ h_t = o_t*tanh(C_t), $$

where $x_t$ denotes the input, and $i_t$, $f_t$, $o_t$ denote the internal representations correspond to the input gate, forget gate and output gate, respectively. All the matrices W are the connection weights and all the variables b are biases. The gates are used to determine when the input is significant enough to remember, when it should continue to remember or forget the value, and when it should output the value.

Due to past and future contexts are important for sequence labelling, for the task of action recognition, the current prediction depends not only on the past but also on the expectations of the future. Bidirectional LSTM neuron (BiLSTM)[15] elegantly combine both forward and backward dependencies by using two separate recurrent hidden layers to present the input sequence. By using BiLSTM, the output sequence at each time step provides complete historical and future contexts. BiLSTM hidden state representation $h_t$ is update by:

$$\begin{aligned} h_{t}^{l}=f_{h}^{(l)}(h_{t}^{(l-1)},h_{t-1}^{l}), \end{aligned}$$

(2)

where $h_{t}^{(l)}$ is the hidden state of the l-th level at time step t, and $f_{h}^{(l)}$ is nonlinear function of the BiLSTM unit. When l = 1, the state is computed using $x_t$ instead of $h_{t}^{(l-1)}$.

3.2 Skeleton Sequence Feature Extraction

A human body can be represented by a stick figure called human skeleton, which consists of line segments linked by joints, and the motion of joints can provide the key to motion estimation and recognition of the whole figure. Given a human subject, the skeleton data involves two geometric constraints. First, as a bone length is constant, the distance and direction between two adjacent points along a connected segment is fixed. Second, Fixed range of rotation angles at each joint. Based on above observations, the skeleton data conveys two types of information: Skeleton sequence modulus ratio, skeleton sequence vector angles. To make the most of the ability of deep networks to learn representations from raw data, extracted MR and VA features should have excellent spatiotemporal. The details are presented as follows.

Skeleton Sequence Modulus Ratio: Skeleton data contains the positions of joints in a frame. The relative position of the joints will vary with body type, the captured joints are based on the position of the sensor. In order to overcome the influence of the subject body type and sensor’s coordinate system, we uses the joints modulus ratio to characterize of the relative change of the skeletal point position. We sets the subject body’s hip joint as the new coordinates, converts others joints coordinate on the sensor’s coordinate into the new coordinate. The resulting new coordinate calculation formula is shown in formula (3) by:

$$\begin{aligned} f=p_{n}-p_{0}, \qquad (n=1,2,3,...,N), \end{aligned}$$

(3)

Where $p_{0}$ is coordinates of the hip joint and $p_{n}$ denotes coordinates of others. f denotes vector of others joints to hip. Feature vector of each frame $f^{t}=[f_{x}^{t},f_{y}^{t},f_{z}^{t}]$, all sequences are characterized by $F=[f^{1},f^{2},...,f^{t},...,f^{T}]$. In order to balance the effects of subject body difference, we normalize each feature vector using the distance from the head to the hip.Where h is distance from the head to the center of the hip, finally modulus ratio by,

$$\begin{aligned} \overline{f}=f/h. \end{aligned}$$

(4)

Skeleton Sequence Vector Angles: The joints drive the interconnected joint movements, causing the angle of the interconnected joint to change, thereby Skeleton sequence vector angles indirectly reflecting human action. In order to obtain the skeletal sequence vector angle, we need to construct the skeletal structure vector from the skeletal data.Base on geometric constraints, we first constructed 22 structure vectors are shown in Table 1. The t on Table 1 $V_{lshoulder\_to\_lelbow}$ denotes a vector from left shoulder to left elbow. For example,the t frame of the action sequence left shoulder point $p_{lshoulder}^{t}=(x_{1}^{t},y_{1}^{t},z_{1}^{t})$, left elbow point $p_{lelbow}^{t}=(x_{2}^{t},y_{2}^{t},z_{2}^{t})$, left wrist point $p_{lwrist}^{t}=(x_{3}^{t},y_{3}^{t},z_{3}^{t})$. $V_{lshoulder\_to\_lelbow}$ and $V_{lelbow\_to\_lwrist}$ calculation formulas are as :

$$\begin{aligned} \begin{array}{rcl} V_{lshoulder\_to\_lelbow}^{t}&{}=&{}(x_{1}^{t}-x_{2}^{t},y_{1}^{t}-y_{2}^{t},z_{1}^{t}-z_{2}^{t}),\\ V_{lelbow\_to\_lwrist}^{t}&{}=&{}(x_{2}^{t}-x_{3}^{t},y_{2}^{t}-y_{3}^{t},z_{2}^{t}-z_{3}^{t}), \end{array} \end{aligned}$$

(5)

Table 1. The definition of the structure vectors.

Full size table

Table 2. The calculated angles from the structure vectors.

Full size table

Then, calculated angle by the structure vectors are shown in Table 2: As is shows in Table 2. For example, $\theta _{lshoulder\_lelbow\_lwrist}$ denotes angle formed by vector $V_{lshoulder\_to\_lelbow}$ and $V_{lelbow\_to\_lwrist}$. each Skeleton sequence vector angles is obtained by the following cosine theorem:

$$\begin{aligned} \theta _{lshoulder\_lelbow\_lwrist}= \frac{V_{lshoulder\_to\_lelbow} \cdot V_{lelbow\_to\_lwrist}}{|V_{lshoulder\_to\_lelbow}||V_{lelbow\_to\_lwrist}|} , \end{aligned}$$

(6)

where the $V_{lshoulder\_to\_lelbow} \ne 0 $ and $ V_{lelbow\_to\_lwrist} \ne 0$.

3.3 Deep Stacked Bidirectional LSTM Neural Network

The DSB-LSTM for action recognition is shown in Fig. 1. Feeding Skeleton sequence modulus ratio and angle as the inputs. The backbone of our network consists of two BiLSTM layers and a LSTM layer due to its excellent performance for classification. A temporal max pooling (MP) layer along the time axis is placed on top of the last LSTM layer to obtain a time invariant vector representation of the sequence. After that, dropout (DP) is employed and a fully-connected (FC) layer with softmax activation is used to classify actions. In particular, to facilitate feature learning and improve model robustness, we introduce two novel layers: masking mechanism (MASK) layer and temporal dropout (TDP) layer. The details are described as follows.

Masking Mechanism: In reality, the length of the skeleton sequences collected in sensor is different. For the LSTM-based prediction problem, if the number of input data frames time series data contains missing/null values and inconsistent, the LSTM based model will fail due to null values cannot be computed during the training process. If the missing values are set as zero,or some other pre-defined values, the training and testing results will be highly biased. Thus, we adopt a masking mechanism to overcome the potential missing values problem.

Figure 2 demonstrates the details of the masking mechanism. BiLSTM cell denotes a BiLSTM layer. A mask value, $\phi $, is pre-defined, which normally is 0 or Null. We employ the maximum of frames in sequence as the standard, sequence that not reach the maximum number of frames is processed with missing values. For an input MR features and VA features series data $X_T$, if $x_t$ is the missed element, which equals to $\phi $, the training process at the t-th step will be skipped, and thus, the calculated cell state of the (t-1)-th step will be directly input into the (t+1)-th step. In this case, the output of t-step also equals to $\phi $, which will be considered as a missing value. Similarly, we can deal with input data with consecutive missing values using the masking mechanism.

Temporal Dropout: Skeletons collected by sensor may not always be reliable due to noise and occlusion. To address this problem, We adopt approach based on dropout [16], which improves model robustness. As is shown Fig. 3(a), for the standard dropout, each hidden unit is randomly omitted from the network with a probability of $p_{drop}$ during training. For testing, all activations are used and $1-p_{drop}$ is multiplied to account for the increase in the expected bias. Temporal dropout is slightly different from the standard dropout. Given the $T\times d$ matrix representation of a sequence, where T is the length of the sequence and d is the feature dimension, we only perform T dropout trials and extend the dropout value across the feature dimension. This technique is inspired by the spatial dropout to process the convolution feature 4D tensor [17]. We modify it for 3D tensor and apply it for feature learning from sequences. As shown in Fig. 3(b), the temporal dropout is performed before the bidirectional LSTM layer.

4 Experiment

we have evaluated our proposed model on three benchmark datasets: MSR Action3D dataset, Florence 3D Action, UTKinect-Action. The analysis of experimental results confirms the effectiveness of our model for skeleton-based action recognition.

4.1 Datasets

MSR Action3D Dataset: This dataset was captured using a depth sensor like Kinect. It consists of 20 actions performed by 10 subjects for two or three times. Altogether, there are 557 valid action sequences, and each frame in a sequence is composed of 20 skeleton joints [8].

Florence 3D Action: This dataset was collected at the University of Florence using a Kinect camera [5]. It includes 9 actions: arm wave, drink from a bottle, answer phone, clap, tight lace, sit down, stand up, read watch, bow. Each action is performed by 10 subjects several times for a total of 215 sequences. The sequences are acquired using the OpenNI SDK, with skeletons represented by 15 joints instead of 20 as with the Microsoft Kinect SDK. The main challenges of this dataset are the similarity between actions, the human object interaction, and the different ways of performing a same action [13].

UTKinect-Action: This dataset was captured using a single stationary Kinect. It consists of 10 actions performed by 10 different subjects, and each subject performed every action twice. Altogether, there are 199 action sequences, and the 3D locations of 20 joints are given. This is regarded as a challenging dataset because of variations in the view point and high intra-class variations [8].

4.2 Results and Comparisons

MSR Action3D Dataset: MSR Action3D dataset follow the standard protocol provided in [8]. In this standard protocol, the dataset is divided into three action sets such as Action Set1 (AS1), Action Set2 (AS2) and Action Set3 (AS3). In Experiments, We use the samples of subjects 1, 3, 5, 7, 9 for training and the samples of subjects 2, 4, 6, 8, 10 for testing. The results of experiments as shown in Table 3 where MR and VA denote the skeleton sequence modulus ratio and angles, MR+VA denote combination of the two features. As shown in Table 3, the addition of MR and SM into LSTM makes the average accuracy increase by 5.31% and 8.09%, respectively, which indicates that our feature representation is very useful on this dataset. The DSB-LSTM are around 3.5% higher than the previous method [20]. The confusion matrices on the dataset are shown in Fig. 4 (a),(b),(c). We can see that the misclassifications. For example in Fig. 4(a), the action “Pick up&Throw” is often misclassified to “Bend” while the action “Forward-Punch” is misclassified to “TennisServe”. Actually, “Pick up&Throw” just has one more “Throw” move than “Bend”, and the “throw” move often holds few frames in the sequence. So it is very difficult to distinguish these two actions.

Table 3. The experimental results comparison on the MSR Action3D dataset.

Full size table

Florence 3D Action: Florence 3D Action dataset is follow the protocol [21], the dataset is benchmarked by leave-one-subject-out cross-validation. The results of experiments as shown in left on Table 4. As shown in Table 4, the DSB-LSTM achieves the best accuracy of 97.46% on the Florence 3D Action dataset. Where MR and VA denote the skeleton sequence modulus ratio and angles, MR+VA denote combination of the two features. As shown in Table 4, the using the MR+SM feature is 2.17% and 3.24% higher than the single feature MR, SM. The DSB-LSTM are around 3.5% higher than the previous method [20]. The confusion matrices on Florence 3D Action dataset are shown in Fig. 4(d).

UTKinect 3D Action: UTKinect-Action dataset is follow the protocol [13], in which half of the subjects are used for training and the remaining are used for testing. The first 5 subjects are used for training while the last 5 subjects are used for testing. As shown in right on Table 4, the propose DSB-LSTM achieves the best accuracy of 95.96% on the UTKinect 3D Action dataset. The confusion matrices on UTKinect 3D Action dataset are shown in Fig. 4(e).

Table 4. Experimental result comparison on the Florence dataset and UTKinect dataset.

Full size table

5 Conclusion

In this paper, we propose an Deep Stacked Bidirectional LSTM Network(DSB-LSTM) for skeleton-based action recognition, which achieves promising results than the existing methods. We first extract modulus ratio features and angle features from the skeletal sequence. Then, DSB-LSTM is presented to effectively capture discriminative spatiotemporal feature. The success of our approach can be explained by introduction of masking layer and temporal dropout layer, such that impose action recognition precision and stronger generalization capability. Nevertheless, in the future we will try the combination of skeleton sequence and object appearance to promote the performance of human action recognition.

References

Simonyan, K., Zisserman, A., Ghahramani, Z., et al.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 27, pp. 568–576 (2014)
Google Scholar
Si, C.Y., Chen, W.T., Wan, W., et al.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Wang, L.M., Xiong, Y., Wang, Z., et al.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36 (2016)
Chapter Google Scholar
Johansson, G.: Visual perception of biological motion and a model for its analysis. Percept. Psychophys. 4(2), 201–211 (1973)
Article Google Scholar
Zhang, Z.: Microsoft kinect sensor and its effect. IEEE Multimed. 19(2), 4–10 (2012)
Article Google Scholar
Cao, Z., Simon, T., Wei, S.E., et al.: Realtime multi-person 2d pose estimation using part affinity fields. In: The IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Google Scholar
Si, C., Jing, Y., Wang, W., Wang, L., Tan, T.: Skeleton-based action recognition with spatial reasoning and temporal stack learning using temporal sliding LSTM networks. In: The European Conference on Computer Vision, pp. 103–118 (2018)
Chapter Google Scholar
Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: The IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–595 (2014)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., et al.: Ntu rgb+d: A large scale dataset for 3d human activity analysis. In: The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Google Scholar
Wang. J., Liu, Z., Wu, Y., et al.: Mining actionlet ensemble for action recognition with depth cameras. In: The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297 (2012)
Google Scholar
Hussein, M.E., Torki, M., Gowayyed, M.A., El-Saban, M.: Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In: International Joint Conference on Artificial Intelligence (2013)
Google Scholar
Wenjun, Z., Junliang, X., et al.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
Google Scholar
Yan, S., Xiong Y., Lin, D., et al.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: International Joint Conference on Artificial Intelligence (2018)
Google Scholar
Li, C., Zhong, Q., Xie, D. et al.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: International Joint Conference on Artificial Intelligence (2018)
Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. Trans. Signal Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Hinton, G.E., Srivastava,N., Krizhevsky, A. et al.: Improving neural networks by preventing co-adaptation of feature detectors, arXiv preprint arXiv:1207.0580, (2012)
Tompson, J., Goroshin, R., Jain, A., et al.: Efficient object localization using convolutional networks. In: The IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656 (2015)
Google Scholar
Xia, L., Chen, C., Aggarwal, J.K., et al.: View invariant human action recognition using histograms of 3D joints. In: The IEEE International Conference on Computer Vision and Pattern Recognition, pp. 20–27 (2012)
Google Scholar
Yang, X., Tian, Y.: Eigen joints based action recognition using Naitive-Bayes-nearest-neighbor. In: The IEEE International Conference on Computer Vision and Pattern Recognition, pp. 14–19 (2012)
Google Scholar
Yuan, M., Chen, E., Gao, L.: Posture selection based on two-layer ap with application to human action recognition using HMM. In: IEEE International Symposium on Multimedia, pp. 359–364 (2017)
Google Scholar
Seidenari, L., Varano, V., Berretti, S., et al.: Recongnizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In: the IEEE International Conference on Computer Vision and Pattern Recognition Workshops (2013)
Google Scholar
Ding, W.W., Liu, K., Li, G.: Human action recognition using spectral embedding to similarity degree between postures. In: Visiual Communications and Image Processing, Chengdu (2017)
Google Scholar
Anirudh, R., Turaga, P., Su, J., Srivastava, A.: Elastic functional coding of human actions: from vector-fields to latent variables. In: the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3147–3155 (2015)
Google Scholar
Youssef, C.: Spatiotemporal representation of 3d skeleton joints-based action recognition using modified spherical harmonics. Pattern Recogn. Lett. 83, 32–41 (2016)
Article Google Scholar
Li, X., Liao, D., Zhang, Y.: Mining key skeleton poses with latent SVM for action recognition. Appl. Comput. Intell. Soft Comput. 2017, 11 (2017)
Google Scholar
Zhu, Y., Chen, W., Guo, G.: Fusing spatiotemporal features and joints for 3d action recognition. In: The IEEE International Conference on Computer Vision and Pattern Recognition, pp. 486–491 (2013)
Google Scholar
Liu, J., Shahroudy, A., Xu, D.: Spatio-temporal lstm with trust gates for 3d human action recognition. In: European Conference on Computer Vision, pp. 816–833 (2016)
Chapter Google Scholar
Zhang, S., Liu, X., Xiao, J.: On geometric features for skeleton-based action recognition using multilayer lstm networks. In: IEEE Winter Conference on Applications of Computer Vision, pp. 148–157 (2017)
Google Scholar
Ghorbel, E., Boonaert, J., Boutteau, R., et al.: An extension of kernel learning methods using a modified Log-Euclidean distance for fast and accurate skeleton-based human action recognition. Comput. Vis. Image Understand. 175, 32–43 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Automation, Guangdong University of Technology, Guangzhou, China
Kai Zou, Ming Yin, Weitian Huang & Yiqiu Zeng

Authors

Kai Zou
View author publications
You can also search for this author in PubMed Google Scholar
Ming Yin
View author publications
You can also search for this author in PubMed Google Scholar
Weitian Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yiqiu Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming Yin .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
The Australian National University, Canberra, Australia
Nick Barnes
Peking University, Beijing, China
Baoquan Chen
The Technical University of Munich, Munich, Bayern, Germany
Rüdiger Westermann
Zhejiang University, Hangzhou, China
Xiangwei Kong
Beijing Jiaotong University, Beijing, China
Chunyu Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zou, K., Yin, M., Huang, W., Zeng, Y. (2019). Deep Stacked Bidirectional LSTM Neural Network for Skeleton-Based Action Recognition. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11901. Springer, Cham. https://doi.org/10.1007/978-3-030-34120-6_55

Download citation

DOI: https://doi.org/10.1007/978-3-030-34120-6_55
Published: 28 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34119-0
Online ISBN: 978-3-030-34120-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Deep Stacked Bidirectional LSTM Neural Network for Skeleton-Based Action Recognition

Abstract

Similar content being viewed by others

Learning Discriminative Representation for Skeletal Action Recognition Using LSTM Networks

Human action recognition using Lie Group features and convolutional neural networks

Leveraging Pre-trained CNN Models for Skeleton-Based Action Recognition

Keywords

1 Introduction

2 Related Work