Keywords

1 Introduction

In computer vision, human action recognition plays a fundamental and import role. It has many important applications including video surveillance, human-computer interaction, game control, sports video analysis and so on [1].

Action recognition is a challenging task in the computer vision community. According to the type of the input data, the existing approaches can be grossly divided into two categories: RGB video based action recognition and 3D skeleton data based action recognition [2]. The RGB video based action recognition methods mainly focus on modeling spatial and temporal representation from RGB frames and temporal optical flow. Despite RGB video based methods have achieved promising results [1, 3], there still exit some limitation, e.g., background clutter, illumination changes, and so on. Moreover, the spatial appearance only contains 2D information that is hard to capture all the action information, and the optical flow generally is computing intensive. On the other hand, Johansson et al. [4] have recognized that 3D skeleton sequences can effectively represent the dynamics of human actions. Moreover, skeleton sequence does not contain color information and is not affected by the limitations of RGB video. Such robust representation allows to model more discriminative temporal characteristics about human actions. Beside, the skeleton sequences can be captured by the Kinect [5] and the advanced human pose estimation algorithms [6]. In the last decades, skeleton-based human action recognition has attracted more and more attention [1, 2]. In this paper, we focus on the problem of skeleton based action recognizing.

For skeleton based action recognition, the existing methods explore different models to learn spatial and temporal features of skeleton sequences. There has been a lot of existing methods applying relative joints coordinates, Vemulapalli et al. [8] utilized rotations and translations to represent the 3D geometric relationships of body parts in Lie group, which can overlook absolute movements of skeleton joints. Recently, there is a growing trend toward Long Short-Term Memory (LSTM) networks based methods, Shahroudy et al. [9] introduce a part-aware LSTM network to further improve the performance of the action recognition. Despite the great improvement in performance, there exist two urgent problems needed to be solved [7]. First, human behavior is accomplished in coordination with each part of the body. It is very difficult to capture the high-level spatial structural information within each frame if directly feeding the concatenation of all body joints into networks. Second, these methods utilize LSTM to directly model the overall temporal dynamics of skeleton sequences. The hidden representation of the final RNN is used to recognize the actions. For the task of action recognition, the current prediction depends not only on the past but also on the expectations of the future. However, the last hidden representation cannot completely contain the detailed temporal dynamics of sequences.

In this paper, we propose a novel Deep Stacked Bidirectional LSTM Network (DSB-LSTM) for this task, which can effectively solve the above challenges. Figure 1 shows the overall pipeline of our model. Firstly, we propose a feature extraction method to capture modulus ratio spatial features and vector angle spatial features within each frame. Then we concatenate this spatial features and normalize them using the mask layer. Next, we apply base on deep stacked Bidirectional LSTM layers to model spatial-temporal features. Finally, a fully connected layer and a softmax layer are performed on the obtained representation to classify the actions. The main contributions of this work are summarized as follows:

  • We propose a novel DSB-LSTM network for skeleton-based action recognition, which is able to effectively capture discriminative spatiotemporal information.

  • Model is performed extensive experiments to demonstrate the effectiveness. Experimental results on three action recognition datasets consistently demonstrate the effectiveness of the proposed model.

Fig. 1.
figure 1

The architecture of the proposed Deep Stacked Bidirectional LSTM Network (DSB-LSTM). MASK layer fills the sequence of different frames to normalized frame number. TDP layer strengthen the robustness of the model. MP layer enhances representation of temporal features. BiLSTM denote bidirectional LSTM.

2 Related Work

Human action recognition based on skeleton data has received a lot of attention, due to its effective representation of motion dynamics. Traditional skeleton-based action recognition methods mainly focus on designing hand-crafted features [8, 10, 11]. Vemulapalli et al. [4] represent each skeleton using the relative 3D rotations between various body parts. The relative 3D geometry between all pairs of body parts is applied to represent the 3D human skeleton in [8].

Recent works mainly learn human action representations with deep learning networks. Du et al. [1] divide human skeleton into five parts according to the human physical structure, and then separately feed them into a hierarchical recurrent neural network to recognize actions. A spatial-temporal attention network learns to selectively focus on discriminative spatial and temporal features in [11]. Zhang et al. [12] present a view adaptive model for skeleton sequence, which is capable of regulating the observation viewpoints to the suitable ones by itself. The works in [13] further show that learning discriminative spatial and temporal features is the key element for human action recognition. A hierarchical CNN model is presented in [14] to learn representations for joint co-occurrences and temporal evolutions. A spatial-temporal graph convolutional network (ST-GCN) is proposed for action recognition in [13]. Each spatial-temporal graph convolutional layer constructs spatial characteristics with a graph convolutional operator, and models temporal dynamic with a convolutional operator. Compared with ST-GCN, Si et al. [11] apply graph neural networks to capture spatial structural information and then use LSTM to model temporal dynamics. Despite the significant performance improvement in [11], it ignores the co-occurrence relationship between spatial and temporal features. In this paper, we propose a novel DSB-LSTM network that can not only effectively extract discriminative spatial and temporal features but also explore the relationship between spatial and temporal domains.

3 Method

In this section, we first review some necessary backgrounds. Then, we introduce features extracted from the skeleton sequence. Finally, a novel deep DSB-LSTM is proposed of action recognition are discussed.

3.1 Preliminaries

Recurrent neural networks (RNN) have an internal state to exhibit dynamic temporal behavior, which make them naturally suitable for supervised sequence labelling. They map an input sequence to another output sequence, and can process sequences with arbitrary length. The advanced RNN architecture of LSTM which can learn long-range dependencies. The hidden state representation \(h_t\) of an unit at each time step t is updated by:

$$ f_t=\sigma (W_f \cdot [h_{t-1},x_t]+b_f ), $$
$$ i_t=\sigma (W_i \cdot [h_{t-1},x_t]+b_i), $$
$$\begin{aligned} C_t=f_t*C_{t-1}+i_{t}*tanh(W_{c}*[h_{t-1},x_{t}]+b_{c}), \end{aligned}$$
(1)
$$ o_t = \sigma (W_o \cdot [h_{t-1},x_t]+b_o), $$
$$ h_t = o_t*tanh(C_t), $$

where \(x_t\) denotes the input, and \(i_t\), \(f_t\), \(o_t\) denote the internal representations correspond to the input gate, forget gate and output gate, respectively. All the matrices W are the connection weights and all the variables b are biases. The gates are used to determine when the input is significant enough to remember, when it should continue to remember or forget the value, and when it should output the value.

Due to past and future contexts are important for sequence labelling, for the task of action recognition, the current prediction depends not only on the past but also on the expectations of the future. Bidirectional LSTM neuron (BiLSTM)[15] elegantly combine both forward and backward dependencies by using two separate recurrent hidden layers to present the input sequence. By using BiLSTM, the output sequence at each time step provides complete historical and future contexts. BiLSTM hidden state representation \(h_t\) is update by:

$$\begin{aligned} h_{t}^{l}=f_{h}^{(l)}(h_{t}^{(l-1)},h_{t-1}^{l}), \end{aligned}$$
(2)

where \(h_{t}^{(l)}\) is the hidden state of the l-th level at time step t, and \(f_{h}^{(l)}\) is nonlinear function of the BiLSTM unit. When l = 1, the state is computed using \(x_t\) instead of \(h_{t}^{(l-1)}\).

3.2 Skeleton Sequence Feature Extraction

A human body can be represented by a stick figure called human skeleton, which consists of line segments linked by joints, and the motion of joints can provide the key to motion estimation and recognition of the whole figure. Given a human subject, the skeleton data involves two geometric constraints. First, as a bone length is constant, the distance and direction between two adjacent points along a connected segment is fixed. Second, Fixed range of rotation angles at each joint. Based on above observations, the skeleton data conveys two types of information: Skeleton sequence modulus ratio, skeleton sequence vector angles. To make the most of the ability of deep networks to learn representations from raw data, extracted MR and VA features should have excellent spatiotemporal. The details are presented as follows.

Skeleton Sequence Modulus Ratio: Skeleton data contains the positions of joints in a frame. The relative position of the joints will vary with body type, the captured joints are based on the position of the sensor. In order to overcome the influence of the subject body type and sensor’s coordinate system, we uses the joints modulus ratio to characterize of the relative change of the skeletal point position. We sets the subject body’s hip joint as the new coordinates, converts others joints coordinate on the sensor’s coordinate into the new coordinate. The resulting new coordinate calculation formula is shown in formula (3) by:

$$\begin{aligned} f=p_{n}-p_{0}, \qquad (n=1,2,3,...,N), \end{aligned}$$
(3)

Where \(p_{0}\) is coordinates of the hip joint and \(p_{n}\) denotes coordinates of others. f denotes vector of others joints to hip. Feature vector of each frame \(f^{t}=[f_{x}^{t},f_{y}^{t},f_{z}^{t}]\), all sequences are characterized by \(F=[f^{1},f^{2},...,f^{t},...,f^{T}]\). In order to balance the effects of subject body difference, we normalize each feature vector using the distance from the head to the hip.Where h is distance from the head to the center of the hip, finally modulus ratio by,

$$\begin{aligned} \overline{f}=f/h. \end{aligned}$$
(4)

Skeleton Sequence Vector Angles: The joints drive the interconnected joint movements, causing the angle of the interconnected joint to change, thereby Skeleton sequence vector angles indirectly reflecting human action. In order to obtain the skeletal sequence vector angle, we need to construct the skeletal structure vector from the skeletal data.Base on geometric constraints, we first constructed 22 structure vectors are shown in Table 1. The t on Table 1 \(V_{lshoulder\_to\_lelbow}\) denotes a vector from left shoulder to left elbow. For example,the t frame of the action sequence left shoulder point \(p_{lshoulder}^{t}=(x_{1}^{t},y_{1}^{t},z_{1}^{t})\), left elbow point \(p_{lelbow}^{t}=(x_{2}^{t},y_{2}^{t},z_{2}^{t})\), left wrist point \(p_{lwrist}^{t}=(x_{3}^{t},y_{3}^{t},z_{3}^{t})\). \(V_{lshoulder\_to\_lelbow}\) and \(V_{lelbow\_to\_lwrist}\) calculation formulas are as :

$$\begin{aligned} \begin{array}{rcl} V_{lshoulder\_to\_lelbow}^{t}&{}=&{}(x_{1}^{t}-x_{2}^{t},y_{1}^{t}-y_{2}^{t},z_{1}^{t}-z_{2}^{t}),\\ V_{lelbow\_to\_lwrist}^{t}&{}=&{}(x_{2}^{t}-x_{3}^{t},y_{2}^{t}-y_{3}^{t},z_{2}^{t}-z_{3}^{t}), \end{array} \end{aligned}$$
(5)
Table 1. The definition of the structure vectors.
Table 2. The calculated angles from the structure vectors.

Then, calculated angle by the structure vectors are shown in Table 2: As is shows in Table 2. For example, \(\theta _{lshoulder\_lelbow\_lwrist}\) denotes angle formed by vector \(V_{lshoulder\_to\_lelbow}\) and \(V_{lelbow\_to\_lwrist}\). each Skeleton sequence vector angles is obtained by the following cosine theorem:

$$\begin{aligned} \theta _{lshoulder\_lelbow\_lwrist}= \frac{V_{lshoulder\_to\_lelbow} \cdot V_{lelbow\_to\_lwrist}}{|V_{lshoulder\_to\_lelbow}||V_{lelbow\_to\_lwrist}|} , \end{aligned}$$
(6)

where the \(V_{lshoulder\_to\_lelbow} \ne 0 \) and \( V_{lelbow\_to\_lwrist} \ne 0\).

3.3 Deep Stacked Bidirectional LSTM Neural Network

The DSB-LSTM for action recognition is shown in Fig. 1. Feeding Skeleton sequence modulus ratio and angle as the inputs. The backbone of our network consists of two BiLSTM layers and a LSTM layer due to its excellent performance for classification. A temporal max pooling (MP) layer along the time axis is placed on top of the last LSTM layer to obtain a time invariant vector representation of the sequence. After that, dropout (DP) is employed and a fully-connected (FC) layer with softmax activation is used to classify actions. In particular, to facilitate feature learning and improve model robustness, we introduce two novel layers: masking mechanism (MASK) layer and temporal dropout (TDP) layer. The details are described as follows.

Masking Mechanism: In reality, the length of the skeleton sequences collected in sensor is different. For the LSTM-based prediction problem, if the number of input data frames time series data contains missing/null values and inconsistent, the LSTM based model will fail due to null values cannot be computed during the training process. If the missing values are set as zero,or some other pre-defined values, the training and testing results will be highly biased. Thus, we adopt a masking mechanism to overcome the potential missing values problem.

Figure 2 demonstrates the details of the masking mechanism. BiLSTM cell denotes a BiLSTM layer. A mask value, \(\phi \), is pre-defined, which normally is 0 or Null. We employ the maximum of frames in sequence as the standard, sequence that not reach the maximum number of frames is processed with missing values. For an input MR features and VA features series data \(X_T\), if \(x_t\) is the missed element, which equals to \(\phi \), the training process at the t-th step will be skipped, and thus, the calculated cell state of the (t-1)-th step will be directly input into the (t+1)-th step. In this case, the output of t-step also equals to \(\phi \), which will be considered as a missing value. Similarly, we can deal with input data with consecutive missing values using the masking mechanism.

Fig. 2.
figure 2

Masking layer for time series data with missing values

Temporal Dropout: Skeletons collected by sensor may not always be reliable due to noise and occlusion. To address this problem, We adopt approach based on dropout [16], which improves model robustness. As is shown Fig. 3(a), for the standard dropout, each hidden unit is randomly omitted from the network with a probability of \(p_{drop}\) during training. For testing, all activations are used and \(1-p_{drop}\) is multiplied to account for the increase in the expected bias. Temporal dropout is slightly different from the standard dropout. Given the \(T\times d\) matrix representation of a sequence, where T is the length of the sequence and d is the feature dimension, we only perform T dropout trials and extend the dropout value across the feature dimension. This technique is inspired by the spatial dropout to process the convolution feature 4D tensor [17]. We modify it for 3D tensor and apply it for feature learning from sequences. As shown in Fig. 3(b), the temporal dropout is performed before the bidirectional LSTM layer.

Fig. 3.
figure 3

Black grids of feature map denotes the dropout part.

4 Experiment

we have evaluated our proposed model on three benchmark datasets: MSR Action3D dataset, Florence 3D Action, UTKinect-Action. The analysis of experimental results confirms the effectiveness of our model for skeleton-based action recognition.

4.1 Datasets

MSR Action3D Dataset: This dataset was captured using a depth sensor like Kinect. It consists of 20 actions performed by 10 subjects for two or three times. Altogether, there are 557 valid action sequences, and each frame in a sequence is composed of 20 skeleton joints [8].

Florence 3D Action: This dataset was collected at the University of Florence using a Kinect camera [5]. It includes 9 actions: arm wave, drink from a bottle, answer phone, clap, tight lace, sit down, stand up, read watch, bow. Each action is performed by 10 subjects several times for a total of 215 sequences. The sequences are acquired using the OpenNI SDK, with skeletons represented by 15 joints instead of 20 as with the Microsoft Kinect SDK. The main challenges of this dataset are the similarity between actions, the human object interaction, and the different ways of performing a same action [13].

UTKinect-Action: This dataset was captured using a single stationary Kinect. It consists of 10 actions performed by 10 different subjects, and each subject performed every action twice. Altogether, there are 199 action sequences, and the 3D locations of 20 joints are given. This is regarded as a challenging dataset because of variations in the view point and high intra-class variations [8].

4.2 Results and Comparisons

MSR Action3D Dataset: MSR Action3D dataset follow the standard protocol provided in [8]. In this standard protocol, the dataset is divided into three action sets such as Action Set1 (AS1), Action Set2 (AS2) and Action Set3 (AS3). In Experiments, We use the samples of subjects 1, 3, 5, 7, 9 for training and the samples of subjects 2, 4, 6, 8, 10 for testing. The results of experiments as shown in Table 3 where MR and VA denote the skeleton sequence modulus ratio and angles, MR+VA denote combination of the two features. As shown in Table 3, the addition of MR and SM into LSTM makes the average accuracy increase by 5.31% and 8.09%, respectively, which indicates that our feature representation is very useful on this dataset. The DSB-LSTM are around 3.5% higher than the previous method [20]. The confusion matrices on the dataset are shown in Fig. 4 (a),(b),(c). We can see that the misclassifications. For example in Fig. 4(a), the action “Pick up&Throw” is often misclassified to “Bend” while the action “Forward-Punch” is misclassified to “TennisServe”. Actually, “Pick up&Throw” just has one more “Throw” move than “Bend”, and the “throw” move often holds few frames in the sequence. So it is very difficult to distinguish these two actions.

Table 3. The experimental results comparison on the MSR Action3D dataset.

Florence 3D Action: Florence 3D Action dataset is follow the protocol [21], the dataset is benchmarked by leave-one-subject-out cross-validation. The results of experiments as shown in left on Table 4. As shown in Table 4, the DSB-LSTM achieves the best accuracy of 97.46% on the Florence 3D Action dataset. Where MR and VA denote the skeleton sequence modulus ratio and angles, MR+VA denote combination of the two features. As shown in Table 4, the using the MR+SM feature is 2.17% and 3.24% higher than the single feature MR, SM. The DSB-LSTM are around 3.5% higher than the previous method [20]. The confusion matrices on Florence 3D Action dataset are shown in Fig. 4(d).

UTKinect 3D Action: UTKinect-Action dataset is follow the protocol [13], in which half of the subjects are used for training and the remaining are used for testing. The first 5 subjects are used for training while the last 5 subjects are used for testing. As shown in right on Table 4, the propose DSB-LSTM achieves the best accuracy of 95.96% on the UTKinect 3D Action dataset. The confusion matrices on UTKinect 3D Action dataset are shown in Fig. 4(e).

Table 4. Experimental result comparison on the Florence dataset and UTKinect dataset.
Fig. 4.
figure 4

(a), (b), (c) are the result of MSR dataset is divided into three action sets, (c), (d) are the confusion matrices of Florence dataset and UTKinect dataset

5 Conclusion

In this paper, we propose an Deep Stacked Bidirectional LSTM Network(DSB-LSTM) for skeleton-based action recognition, which achieves promising results than the existing methods. We first extract modulus ratio features and angle features from the skeletal sequence. Then, DSB-LSTM is presented to effectively capture discriminative spatiotemporal feature. The success of our approach can be explained by introduction of masking layer and temporal dropout layer, such that impose action recognition precision and stronger generalization capability. Nevertheless, in the future we will try the combination of skeleton sequence and object appearance to promote the performance of human action recognition.