Skeleton-based human activity recognition using ConvLSTM and guided feature learning

Human activity recognition aims to determine actions performed by a human in an image or video. Examples of human activity include standing, running, sitting, sleeping, etc. These activities may involve intricate motion patterns and undesired events such as falling. This paper proposes a novel deep convolutional long short-term memory (ConvLSTM) network for skeletal-based activity recognition and fall detection. The proposed ConvLSTM network is a sequential fusion of convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and fully connected layers. The acquisition system applies human detection and pose estimation to pre-calculate skeleton coordinates from the image/video sequence. The ConvLSTM model uses the raw skeleton coordinates along with their characteristic geometrical and kinematic features to construct the novel guided features. The geometrical and kinematic features are built upon raw skeleton coordinates using relative joint position values, differences between joints, spherical joint angles between selected joints, and their angular velocities. The novel spatiotemporal-guided features are obtained using a trained multi-player CNN-LSTM combination. Classification head including fully connected layers is subsequently applied. The proposed model has been evaluated on the KinectHAR dataset having 130,000 samples with 81 attribute values, collected with the help of a Kinect (v2) sensor. Experimental results are compared against the performance of isolated CNNs and LSTM networks. Proposed ConvLSTM have achieved an accuracy of 98.89% that is better than CNNs and LSTMs having an accuracy of 93.89 and 92.75%, respectively. The proposed system has been tested in realtime and is found to be independent of the pose, facing of the camera, individuals, clothing, etc. The code and dataset will be made publicly available.


Introduction
The basic aim of human activity recognition systems is to automatically recognize the activities of an individual with the raw data obtained from sensors.The application of activity detection can be found in many areas like human-computer interaction, video surveillance, sports analysis, video understanding, etc. [24,34,19].Monitoring fall detection and early reporting is also an important application of human activity recognition.The world population is expected to have a 25% increase in the elder population by 2050, it is necessary to assist elderly adults over the age of 65 [6].Fall is a major cause of an accident and even death, especially in the case of the elderly.An estimated $31 billion is spent on direct medical costs for fall injuries in the US [27], making fall prevention and early reporting necessary.
The classical studies typically studied action recognition using monocular RGB videos [42] which makes it hard to comprehensively represent actions in 3D space.In recent years, for human activity recognition, low-cost and high mobility sensors, like Microsoft Kinect are being widely adopted.Kinect's ability to track skeleton joints has attracted significant attention from computer vision researchers, and different algorithms have been proposed using the skeleton joint information for recognizing human activities.Skeleton joints extracted from the Kinect can be used to calculate features invariant to the human body size, appearance, and change in camera viewpoints [39].
There have been some works on activity recognition in the past, where a combination of convolutional neural networks (CNNs) and long short-term memory (LSTM) networks are utilized for activity recognition [11,33,12].Many works involved running different models in parallel like an ensemble classifier and fused the scores of each model to predict the final class labels.For example, Li et al. [17] used the score fusion of CNN and LSTM models.Although this method improves upon using either CNN or recurrent neural networks (RNNs) individually, it does not utilize the advantages of both models.As done by [17], a lot of previous works on activity recognition involves feeding the skeleton data directly into the model and it is up to the neural network to extract features from those coordinates.
Only skeletal features are not enough for recognizing all the activities.There are a few activities like falling, differentiating between running and walking, etc., which require the rate of change of coordinates of the center of mass, velocity, acceleration, and other derived features like head to floor distance and the joint angles between the joint points for recognition.Such features cannot be modeled by CNN and LSTM techniques directly, and they require hand engineering.Also, using only the hand-engineered features results in a model which is shallow and dataset-dependent [32].
In this paper, a set of derived features along with the raw skeleton joint coordinates are fed to the deep learning networks as shown in Fig. 2. Initially, we applied skeletal tracking algorithms using a Kinect (v2) sensor and collected 3D joints locations for each frame, and made a skeletal of bones.A set of features have been extracted from the raw data obtained from the Kinect (v2) sensor for improving the model efficiency.The standard features like velocity, acceleration, position of the center of gravity, angle between different body joints, etc., have been derived from the raw body coordinates.After extracting the features, the dataset is preprocessed and inputted to the deep learning models, which consist of fourteen derived features along with seventeen skeleton coordinates.The proposed ConvLSTM model is a sequential combination of CNNs, LSTM, and dense layers, where CNN is used for feature filtering, LSTM is used for classification, and dense layers are used for feature mapping (as shown in Fig. 2).LSTM cells can represent the contextual dependencies in the temporal domain effectively, while the CNNs perform better to process the feature set with more spatial information.By combining both, we retrieved the best set of features containing spatial and temporal information.Finally, a fully connected network is applied to these features to get the classification scores.We have experimentally found that using the combination of CNNs and LSTMs in a serial manner results in better efficiency as compared to using either of them individually, or using it in a parallel mode.
Major contributions of this manuscript are highlighted as follows: 1.The paper presents a privacy-preserving activity recognition and fall detection system using the data obtained from Kinect (v2) sensor.

2.
We propose a ConvLSTM model, which is a sequential combination of CNNs, LSTMs, and dense layers.LSTM cells are used to represent the contextual dependencies in the temporal domain, while CNNs are used for extracting the spatial features.The combination of these gives the best set of spatiotemporal features.

3.
To preserve the privacy of the user, instead of passing the raw videos directly to the network, a set of derived features along with skeletal joint coordinates are fed to the deep learning network.A set of fourteen derived features along with seventeen skeleton coordinates are inputted to the networks for recognizing the activities and detecting falls.4. A new dataset is presented for activity recognition and fall detection.This dataset has been collected with all possible variations and has all the features and enough complexity generally required for training a system.During experimentation, this dataset is used for testing the performance of the proposed model to recognize the activities and detecting falls (described in Section 4.1).
The outline of this paper is organized as follows.Section 2 presents the literature survey.
Section 3 describes the proposed methodology along with feature extraction, CNNs, LSTM, and ConvLSTM models.Section 4 describes the experimental results and the data collection procedure, and also, it compares the performances of CNNs, LSTM, and ConvLSTM models.
Section 5 presents the concluding remarks and proposes future directions.

Related Works
Various literature has been proposed for activity recognition and fall detection using single or multiple cameras.Multi-view cameras were found to improve accuracy [2], but they lead to higher complexity and duplicate cost [25].Low-cost depth sensors like Kinect are recently investigated to deal with the above limitations.These approaches used low-resolution depth information from Kinect for joint localization to detect falls.Many literatures have proposed fall detection systems using Kinect depth sensors such as Gasparrini et al. [9] placed Kinect sensor on the ceiling and used depth images for fall detection while Uden et al. [28] placed Kinect sensor under the bed for detecting fall along with other activities like leaving the room, feet in front of the bed, and activity in the room.However, these approaches suffer in a real-time environment to give accurate results.
Researchers have used different techniques to construct classification models for human activity recognition, for example, Wang et al. [31] used Deep fully-connected networks (DFNs) to facilitate a better representation of data as compared to artificial neural networks (ANNs).
Vepakomma et al. [29] took hand-engineered features obtained from the sensors for human activity recognition.Hammerla et al. [11] used five hidden layer DFNs for feature extraction.
Generally, DFNs with more number of hidden layers serve as the dense layer for other deep learning algorithms.A few researchers have also used autoencoders, a variety of ANNs used for unsupervised learning, for activity recognition.The aim of an autoencoder is to memorize the dataset representation, typically for dimensionality reduction.Almaslukh et al. [1] and Wang et al. [30] utilized greedy approaches in which each layer was pre-trained and then fine-tuned.In comparison to this, Li et al. [20] have utilized the sparse autoencoders by adding the Kullback Leibler (KL) divergence and introducing noises to the cost function, which ultimately improved the performance for activity recognition.Stacked Autoencoders (SAEs) are used for learning the features in an unsupervised manner, which may be used to enhance the feature extraction for HAR.However, SAEs depend upon the number of layers and their activation functions that makes them hard to search for the optimal solution.
Deep learning based action recognition can be categorized into two broad categories, i.e.CNN based approaches [7,17,14,35] and RNN based approaches [8,43,21,16,41].CNNs have obtained promising results in the image/video classification, signal processing, etc.It performs better for processing the feature set with more spatial information [36].CNNs comprise one or many convolutional layers.Once the convolution operation completes, pooling and the fully connected CNNs.Weng et al. [35] utilized Naive-Bayes mutual information maximization (NBMIM) [38] to CNNs for the action recognition.For the sequential information, RNNs are best suited.They are widely applied in various fields like speech recognition and natural language processing [40,18,10].Activity recognition can also be considered a sequential problem.Du et al. [8] have presented a skeleton-based activity recognition system using an end-to-end deep learning model consisting of hierarchical RNNs.In their methodology, they have divided the human skeleton obtained from the Kinect sensor into five different parts.These parts are then fed to five different bidirectional RNNs.Among various RNN architectures, LSTMs are most popular due to their memory capacity and remembering useful data for an extended period (Fig. 1).For activity recognition, LSTMs are very robust with real-world recognition [13].Zhu et al. [43] proposed an approach to automatically learn the human skeletal representations.They used RNNs and LSTMs to learn the long-term temporal dependency in the dataset.To model the joint co-occurrences with the LSTMs and RNNs, joint position values were used as the input for each time slot.Liu et al. [21] proposed a new gated scheme in LSTM for sequential action recognition.Lee et al. [16] proposed a temporal sliding LSTM, which includes short, medium, and long-term units.Zhang et al. [41] presented an element-wise attention gate using RNN's for action recognition.
The combination of CNNs and LSTMs is among the most emerging hybrid models for activity recognition.These are especially being applied to vision tasks involving sequential inputs and outputs.CNNs basically consists of two modules.First, feature map construction or extraction, strates a sequential ConvLSTM approach using the best features of both CNN and LSTM.We have focused on combining the CNNs, LSTMs, and dense layers in a sequential manner and take advantage of all three methods.

Proposed Methodology
The proposed methodology of this paper is threefold.First, the human body frames are acquired from the Kinect-v2 sensor and tracked the 3D skeleton joint coordinates.Then a 3D joints normalization technique is applied for the preprocessing of the data.The 3D coordinates are used to make a 3D bounding box over the tracked human.Suitable features like velocity, acceleration, angle between skeleton joints, height, width etc. have been extracted for identifying different activities.For each activity, important features have been selected and then constructed feature vectors.Second, the dataset is stored in a CSV format consisting of the raw skeleton joint coordinates and the extracted features.Third, the dataset containing joint values along with extracted manual features are inputted to deep learning networks i.e.CNNs, LSTMs, and Con- vLSTM for the activity recognition and fall detection.Fig. 2 illustrates the proposed model of the ConvLSTM, which is a sequential fusion of CNNs and LSTMs.

Preprocessing
This subsection presents the preprocessing methods applied to the raw data obtained from the Kinect (v2) sensor.We applied skeleton joint estimation methods to acquire the body frames and applied 3D joints tracking methods.Fig. 3  As Chen et al. [4] stated, not all the skeletal joints are useful but only some of the skeletal joints are informative for a particular activity, in our case we excluded coordinates like hand tip, thumb, neck, etc., which are not very important for recognizing the intended activities.Once the 3D skeletal coordinates are available, we applied 3D joints normalization techniques to make 3D bounding boxes across the tracked human skeleton.The bounding box varies as the person moves in the video.Fig. 4 presents the block diagram of the proposed automatic dataset labeling procedure.in the proposed work is min-max normalization.Suppose, if X represents the training dataset then:

Geometric and Kinematic Features Calculation
The 3D human skeleton joint coordinates are used for evaluating different features and constructing the feature vectors.Feature vectors are used for deciding which joints of the body parts to be tracked for different activities.Only discriminative features are utilized for each activity.
Angle Between Skeleton Joints: The 3D coordinates of different body joints are connected using a line and drawn a skeleton.Here, only 10 joints, namely shoulder left, shoulder center, shoulder right, spine base, knee left, hip right, ankle left, hip left, ankle right, and knee right, are utilized for calculating the angle values.These are most relevant among the twenty-five skeleton joints to recognize the activities.A set of angle values are calculated using these joint coordinates.
, and R 1 = x 3 − y 3 , then the angle between skeleton joints are: Ankle Left (x 3 , y 3 , z 3 ) θ Fig. 5: An example of the angle calculation from the right side between the ankle, knee, and hip.
where, P QR Velocity Estimation: Velocities in the X, Y, Z directions are estimated using the differences between the positions of the human skeleton at the time instance of t and t+1.The displacement between the two consecutive frames is calculated using the spine mid joint coordinate of the person.Next, the displacement per unit time is used to calculate the velocity of the person.

V elocity = Displacement of the tracked person between f rames T ime (4)
Acceleration Estimation: Acceleration in the X, Y, Z directions are estimated using the changes in the velocity between consecutive frames as follows: Acceleration = V elocity of the tracked person between f rames T ime Distance from Floor: This is used to estimate the distance between the floor and joint coordinate of the head of the tracked person.
Depth Estimation: Depth is the distance from the camera to the nearest object.It was estimated with the help of the head joint's Z coordinate of the tracked person.Convolutional Neural Network Architecture: Firstly, the activity recognition has been implemented using the CNNs [23].Let X 0 t = [X 1 , X 2 , . . ., X n ] be the readings from the sensor data as an input vector.Here, n is the number of input samples.The convolutional layer's output can be given as: where, l correspond to index of layer, and σ is a sigmoid activation function.B j is a bias corresponding to the j th feature, and M is the filter size.W j m represents the weight for the j th feature map and m th filter index.
Three input channels are used for the RGB dataset as the input layers.In the convolution layer, six filters are passed along with setting the kernel sizes, padding, and ReLU activation functions.Max-pooling is used as a pooling layer, which down-samples the image data and reduces the dimensionality for reducing the processing time.The output of which has been passed to the fully connected layers.Ultimately, the Softmax scores give the probabilities of the classes.As far as the negative log-likelihood cost function is considered, it is minimized using a stochastic gradient descent optimizer.
Long Short-Term Memory Architecture: Secondly, we have implemented activity recognition using LSTM, an improved version of RNN, which avoids the vanishing gradient problem and consists of memory cells.LSTMs mainly consist of gates such as forget, input, and output gates to control and protect the cell states [3].Forget gate is a binary gate used to decide how much information to let through.The input gate layer is used to decide the new information that needs to be stored in the cell state.The output gate consists of a sigmoid gate, which decides what parts of the cell to give as output.Passing the cell state through the tanh layer, and multiplying it with the output obtained from the sigmoid gate provides the final output (Fig. 1).
The standard equations describing the actions of each gate can be given as follows: where, W i , W f , W o are the weight matrices, and X t is an input to the LSTM cells at the time instance t. σ is the Sigmoid activation function, whereas tanh is a hyperbolic tangent activation function.f, i, and o are the forget, input and output gates, respectively.C represent the state of a memory cell.B i , B c , B f , and B o are the bias vectors.
A different combination of batch sizes, hidden layers, and learning rates has been investigated.
The best results were obtained using 32 hidden layers for 7 classes with a learning rate of 0.0025, lambda loss amount of 0.0015, and batch size of 1500.Two LSTM cells were stacked which adds deepness to the network.For loss computation, the Softmax loss function was used and optimized using Adam optimizer.
ConvLSTM Architecture: In this, a ConvLSTM network has been proposed using the fusion of CNNs, LSTM, and dense layers.Here, CNNs are used for spatial feature extraction, LSTMs are used for sequence prediction, and dense layers are used for mapping the features to get more separable space (Fig. 2).Fig. 6 demonstrates a traditional activity recognition model, where the  parallel fusion of CNN and LSTM has been performed for the ConvLSTM model.This has been used in many works, previously [11,33,12].Although this approach is much better than using either only CNN or only LSTM, yet it does not use the efficiency of both the models to their fullest.
In this, a sequential fusion of CNN, LSTM, and dense layers is used as shown in Fig. 7. Here, the outputs of the last hidden layer of CNNs are inputted to the LSTM layers followed by the fully connected layers for the classification.The equations of ConvLSTM can be given as follows: where, σ (sigmoid) and tanh (hyperbolic − tangent) are non-linear activation functions.
represents the Hadamard product, and * represents the convolution operations.The inputs (X t ), cells (C t ), hidden states (H t ), forget gates (F t ), input gates (I t ), input-modulation gates ( Čt ), and output gates (O t ) are all M × N × F (rows, columns, feature maps) dimensional 3D tensors.
The memory cell C t is the most crucial module, acting as an aggregator of the state information controlled by the gates.
Initially, the helper functions are defined to increase the reusability and readability of the code.
The hyper-parameters like the size, number of layers, steps, batch sizes, and learning rate (0.0001) have been set to an optimum value.Next, we have constructed the LSTM cells and reshaped the dataset for LSTM into sequence length, batches, and channels.Then we applied ReLU activation and set dropout regularization, which operates simultaneously on gates, cells, and output responses of LSTM neurons.Finally, the logit functions are used for cost function measurement.
Further, we used Adam optimizer for cost function optimization and utilized gradient clipping.
For the training of the network, we set the checkpoint path, saver function, initialized session, set iterations, computed loss and accuracy on the validation dataset, and saved the checkpoint for further testing the model.The performance is tremendously improved by using a sequential model (Fig. 7) as compared to score fusion (Fig. 6).The experimental results and comparisons of three different models i.e.CNNs, LSTMs, and ConvLSTM for activity recognition are presented in the next sections.

Experimental Result
This section presents the dataset building procedure, and experimental results using the deep learning algorithms i.e.CNNs, LSTMs, and ConvLSTM.The proposed model has been tested on a newly collected KinectHAR dataset recorded by the Kinect-v2 sensor.Only skeleton joints coordinates along with suitable features are stored for inputting to the deep learning model.

Dataset Building
With the development of cost-effective RGB-D sensing technologies, now it is more convenient to get 3D and depth data.We utilized Microsoft Kinect (v2) sensor for the data collection, which is a depth sensor-based motion-sensing input device that offers a convenient way to record and capture human skeleton joints.The name of Kinect is a combination of kinetic and connects [4].It produces three-dimensional RGB-D data.Kinect runs at 30 f ps and has a resolution of 640×480p for both i.e. video and depth [5].Kinect (v2) has a sensing range of 4 meters [5].
For the dataset collection, 20 different people (12 males and 8 females) have participated and performed seven different activities which include sitting, standing, bending, walking fast, walking slow, lying, and fall activities.Every person performed each activity for more than two minutes with all possible variations, so that it can identify activities accurately in a real-time environment.As we do not record the videos, only the skeleton coordinates are used, the privacy is preserved.Throughout the experiment, Kinect (v2) sensor was placed at a two-meter height above the ground.All the experiments have been performed at a range of 0.5 meters to 4.0 meters in front of the camera.The final dataset contains a total of 130,000 samples with 81 attribute values.The source codes and dataset will be made publicly available to the research community.
Table 3 presents the precision, recall, and    posed ConvLSTM results in a better accuracy compared to LSTM and CNNs, as illustrated in Table 4.The accuracy of ConvLSTM is approximately 5-6% better compared to either LSTM or CNN individually.Table 5 presents the precision, recall, and F 1-score of each class obtained using the ConvLSTM model.As we can see from the Table 3 and Table 4, ConvLSTM gives the best results in comparison to other algorithms, so we have stored the trained model using ConvLSTMs and tested the model in real-time.Fig. 11 presents the activity recognition results obtained in realtime.The performance is sufficiently high for the general adoption of the system.

Conclusion
This paper presented a privacy-preserving activity recognition and fall detection system using a single Kinect (v2) sensor and ConvLSTM.The proposed system derives geometrical and kinematic features and passes them along with the raw skeleton coordinates into deep learning networks.As the system uses only derived features along with raw skeleton joint coordinates and does not use the actual images of the user, the privacy of the user is protected.We proposed a simple and effective method based on the sequential fusion of CNNs and LSTM, named as ConvL-STM model.The performance of the deep learning-based classification algorithms, namely CNN, LSTM, and ConvLSTM, has been compared on the novel dataset consisting of 130,000 samples along with 81 attribute values.The proposed system recognizes standing, walking slow, walking fast, sitting, bending, fall, and lying down activities.The proposed system is unobtrusive to the users and independent of the camera orientation, clothing, etc.The system gives sufficiently high performance for activity recognition and fall detection for the general adoption of the system.
The source code and presented dataset will be made publicly available.

Fig. 2 :
Fig. 2: The proposed model of ConvLSTM.From the raw input videos 3D skeleton coordinates are extracted which are passed to calculate the geometrical and kinematic features.The extracted features along with raw skeleton joint coordinates are passed to CNNs for extracting the automated spatial features.These spatial features are then passed to LSTMs for extracting the temporal features.Finally, fully connected layers are applied to classify the activities and calculated the Softmax scores.
presents the twenty-five skeleton joints tracked at each instance.These skeletal joints include knee right, hip right, knee left, hip left, foot left, ankle left, foot right, ankle right, head, spine mid, wrist left, shoulder right, shoulder left, wrist right, elbow right, and elbow left.All these joint values contain X, Y, Z coordinates in the space.

Fig. 4 :
Fig.4: The block diagram of the proposed data collection procedure.The input video streams from the Kinect-v2 sensor are used to acquire the 3D skeleton joint coordinates.Then a 3D joint normalization technique is used for the normalization.A bounding box is made across the practitioner using the upper and lower body joint coordinates.Suitable features such as the angle between joints, velocity, acceleration, height, width, etc. are calculated and stored in the dataset along with their activity labels.This dataset is later passed to deep learning algorithms for training the network.

Width Estimation: 6 ) 7 ) 3 . 3
The difference between maximum right joint and maximum left joint coordinates are used to calculate the depth.The extreme right joint values are calculated using the elbow right, hip right, knee right, shoulder right, hand-tip right, ankle left, foot right, and head joint values.A similar procedure is used to find the left extreme joints using all the left side joint coordinates.W idth = abs(M ax.Right Joint − M ax.Lef t Joint) (Height Estimation: The difference between extreme top joints and extreme bottom joints is used to calculate the height.The extreme bottom is calculated using the ankle left, knee right, ankle right, knee left, foot right, ankle left, foot left, and ankle right joint coordinates.Extreme top calculated using the head, hand tip left, hand tip right, ankle right, elbow right, ankle left, elbow left, knee right, and knee left joints values.Height = abs(T op Joint − Bottom Joint) (Classification Models In this section, the final dataset along with derived features are inputted to the deep learning network for classification.Three different classification methods have been used for the classification namely CNN, LSTM, and the newly designed ConvLSTMs.

F 1 -
score values for different algorithms performed with ±1 percentage change.We have applied different machine learning and deep learning algorithms such as SVMs, decision trees (DT), random forest (RF), artificial neural networks (ANNs), CNNs, LSTMs, and ConvLSTMs.The comparison of their accuracies is shown in Table 4.All the accuracies and plots are calculated for 200 epochs.The accuracy and loss curves using LSTM, CNN, and ConvLSTM are shown in Fig. 8, Fig. 9, and Fig. 10 respectively.Pro-

Fig. 11 :
Fig. 11: Realtime testing of the standing, sitting, bending, walking slow, and walking fast activities(working model is presented to demonstrate realtime testing as https://www.youtube.com/watch?v=2PqkyXMVBLg.

Table 1 :
Summary of the literature review.
[26]second, a basic classifier.The hybrid model of CNN and LSTM uses CNN for feature extraction, and LSTM for feature classification.Ordonez et al.[22], Yao et al.[37]and Singh et al.[26], used a combination of CNNs and LSTMs, and demonstrated a tremendous improvement in result.They run both CNN and LSTM models in parallel and used score fusion for the final prediction.Table1presents a summary of the literature survey.Work in this research demon- Table 2 describes the list of skeletal joints tracked, set of derived features, and activity class labels.The normalization technique used for stabilizing the convergence of the loss functions

Table 2 :
Feature Set Specifications.

Table 4 :
Comparison of accuracies using different algorithms.