1 Introduction

The basic aim of human activity recognition systems is to automatically recognize the activities of an individual with the raw data obtained from sensors. The application of activity detection can be found in many areas like human-computer interaction, video surveillance, sports analysis, video understanding, etc.(Poppe 2010; Weinland et al. 2011; Li et al. 2019). Monitoring fall detection and early reporting is also an important application of human activity recognition. The world population is expected to have a 25% increase in the elder population by 2050, it is necessary to assist elderly adults over the age of 65 (DeSA et al. 2013). Fall is a major cause of an accident and even death, especially in the case of the elderly. An estimated $31 billion is spent on direct medical costs for fall injuries in the US (Stevens et al. 2006), making fall prevention and early reporting necessary.

The classical studies typically studied action recognition using monocular RGB videos (Zhang et al. 2019) which makes it hard to comprehensively represent actions in 3D space. In recent years, for human activity recognition, low-cost and high mobility sensors, like Microsoft Kinect are being widely adopted. Kinect’s ability to track skeleton joints has attracted significant attention from computer vision researchers, and different algorithms have been proposed using the skeleton joint information for recognizing human activities. Skeleton joints extracted from the Kinect can be used to calculate features invariant to the human body size, appearance, and change in camera viewpoints (Zhang et al. 2016).

There have been some works on activity recognition in the past, where a combination of convolutional neural networks (CNNs) and long short-term memory (LSTM) networks are utilized for activity recognition (Hammerla et al. 2016; Wang et al. 2015; Hou et al. 2016). Many works involved running different models in parallel like an ensemble classifier and fused the scores of each model to predict the final class labels. For example, Li et al. (2017) used the score fusion of CNN and LSTM models. Although this method improves upon using either CNN or recurrent neural networks (RNNs) individually, it does not utilize the advantages of both models. As done by Li et al. (2017), a lot of previous works on activity recognition involves feeding the skeleton data directly into the model and it is up to the neural network to extract features from those coordinates.

Only skeletal features are not enough for recognizing all the activities. There are a few activities like falling, differentiating between running and walking, etc., which require the rate of change of coordinates of the center of mass, velocity, acceleration, and other derived features like head to floor distance and the joint angles between the joint points for recognition. Such features cannot be modeled by CNN and LSTM techniques directly, and they require hand engineering. Also, using only the hand-engineered features results in a model which is shallow and dataset-dependent (Wang et al. 2013).

In this paper, a set of derived features along with the raw skeleton joint coordinates are fed to the deep learning networks as shown in Fig. 2. Initially, we applied skeletal tracking algorithms using a Kinect (v2) sensor and collected 3D joints locations for each frame and made a skeletal of bones. A set of features have been extracted from the raw data obtained from the Kinect (v2) sensor for improving the model efficiency. The standard features like velocity, acceleration, position of the center of gravity, angle between different body joints, etc., have been derived from the raw body coordinates. After extracting the features, the dataset is preprocessed and inputted to the deep learning models, which consist of fourteen derived features along with seventeen skeleton coordinates. The proposed ConvLSTM model is a sequential combination of CNNs, LSTM, and dense layers, where CNN is used for feature filtering, LSTM is used for classification, and dense layers are used for feature mapping (as shown in Fig. 2). LSTM cells can represent the contextual dependencies in the temporal domain effectively, while the CNNs perform better to process the feature set with more spatial information. By combining both, we retrieved the best set of features containing spatial and temporal information. Finally, a fully connected network is applied to these features to get the classification scores. We have experimentally found that using the combination of CNNs and LSTMs in a serial manner results in better efficiency as compared to using either of them individually, or using it in a parallel mode.

Major contributions of this manuscript are highlighted as follows:

  1. 1.

    The paper presents a privacy-preserving activity recognition and fall detection system using the data obtained from Kinect (v2) sensor.

  2. 2.

    We propose a ConvLSTM model, which is a sequential combination of CNNs, LSTMs, and dense layers. LSTM cells are used to represent the contextual dependencies in the temporal domain, while CNNs are used for extracting the spatial features. The combination of these gives the best set of spatiotemporal features.

  3. 3.

    To preserve the privacy of the user, instead of passing the raw videos directly to the network, a set of derived features along with skeletal joint coordinates are fed to the deep learning network. A set of fourteen derived features along with seventeen skeleton coordinates are inputted to the networks for recognizing the activities and detecting falls.

  4. 4.

    A new dataset is presented for activity recognition and fall detection. This dataset has been collected with all possible variations and has all the features and enough complexity generally required for training a system. During experimentation, this dataset is used for testing the performance of the proposed model to recognize the activities and detecting falls (described in Sect. 4.1).

  5. 5.

    Experimental results show that the ConvLSTM model achieves better accuracy (98.89%) as compared to the LSTM (92.75%) and CNNs (93.89%) individually.

The outline of this paper is organized as follows. Section 2 presents the literature survey. Section 3 describes the proposed methodology along with feature extraction, CNNs, LSTM, and ConvLSTM models. Section 4 describes the experimental results and the data collection procedure, and also, it compares the performances of CNNs, LSTM, and ConvLSTM models. Section 5 presents the concluding remarks and proposes future directions.

Fig. 1
figure 1

Sequence learning using, (1) LSTM, (2) RNN (Zhang et al. 2017)

2 Related works

Various literature has been proposed for activity recognition and fall detection using single or multiple cameras. Multi-view cameras were found to improve accuracy (Auvinet et al. 2010), but they lead to higher complexity and duplicate cost (Rougier et al. 2011). Low-cost depth sensors like Kinect are recently investigated to deal with the above limitations. These approaches used low-resolution depth information from Kinect for joint localization to detect falls. Many researches have proposed fall detection systems using Kinect depth sensors such as Gasparrini et al. (2014) placed Kinect sensor on the ceiling and used depth images for fall detection while Uden et al. (2018) placed Kinect sensor under the bed for detecting fall along with other activities like leaving the room, feet in front of the bed, and activity in the room. However, these approaches suffer in a real-time environment to give accurate results.

Researchers have used different techniques to construct classification models for human activity recognition, for example, Wang et al. (2019) used deep fully-connected networks (DFNs) to facilitate a better representation of data as compared to artificial neural networks (ANNs). Vepakomma et al. (2015) took hand-engineered features obtained from the sensors for human activity recognition. Hammerla et al. (2016) used five hidden layer DFNs for feature extraction. Generally, DFNs with more number of hidden layers serve as the dense layer for other deep learning algorithms. A few researchers have also used autoencoders, a variety of ANNs used for unsupervised learning, for activity recognition. The aim of an autoencoder is to memorize the dataset representation, typically for dimensionality reduction. Almaslukh et al. (2017) and Wang et al. (2016) utilized greedy approaches in which each layer was pre-trained and then fine-tuned. In comparison with this, Li et al. (2014) have utilized the sparse autoencoders by adding the Kullback Leibler (KL) divergence and introducing noises to the cost function, which ultimately improved the performance for activity recognition. Stacked autoencoders (SAEs) are used for learning the features in an unsupervised manner, which may be used to enhance the feature extraction for HAR. However, SAEs depend upon the number of layers and their activation functions that makes them hard to search for the optimal solution.

Table 1 Summary of the literature review
Fig. 2
figure 2

The proposed model of ConvLSTM. From the raw input videos 3D skeleton coordinates are extracted which are passed to calculate the geometrical and kinematic features. The extracted features along with raw skeleton joint coordinates are passed to CNNs for extracting the automated spatial features. These spatial features are then passed to LSTMs for extracting the temporal features. Finally, fully connected layers are applied to classify the activities and calculated the Softmax scores

Deep learning-based action recognition can be categorized into two broad categories, i.e.,  CNN-based approaches (Ding et al. 2017; Li et al. 2017; Ke et al. 2017; Weng et al. 2018) and RNN-based approaches (Du et al. 2015; Zhu et al. 2016; Liu et al. 2017; Lee et al. 2017; Zhang et al. 2018). CNNs have obtained promising results in the image/video classification, signal processing, etc. It performs better for processing the feature set with more spatial information (Yadav et al. 2021). CNNs comprise one or many convolutional layers. Once the convolution operation completes, pooling and the fully connected layers are combined to perform the classification at the final output (LeCun et al. 2015). Ding et al. (2017) had investigated different skeletal features using CNNs for the 3D action recognition. They encoded the spatial skeleton features into images by incorporating various encoding techniques. Also, they studied the performance implications of the different skeleton joints in the features extraction. Li et al. (2017) proposed two-stream CNNs in which one stream was applied on raw joint coordinates whereas the other one was used for the motion data from the subtraction of joint coordinates from subsequent frames. Ke et al. (2017) converted the skeletal features into images and passed it to deep CNNs. Weng et al. (2018) utilized Naive-Bayes mutual information maximization (NBMIM) Yuan et al. (2009) to CNNs for the action recognition. For the sequential information, RNNs are best suited. They are widely applied in various fields like speech recognition and natural language processing (Zhang et al. 2017; Li et al. 2015; Grushin et al. 2013). Activity recognition can also be considered a sequential problem. Du et al. (2015) have presented a skeleton-based activity recognition system using an end-to-end deep learning model consisting of hierarchical RNNs. In their methodology, they have divided the human skeleton obtained from the Kinect sensor into five different parts. These parts are then fed to five different bidirectional RNNs. Among various RNN architectures, LSTMs are most popular due to their memory capacity and remembering useful data for an extended period (Fig. 1). For activity recognition, LSTMs are very robust with real-world recognition (Inoue et al. 2018). Zhu et al. (2016) proposed an approach to automatically learn the human skeletal representations. They used RNNs and LSTMs to learn the long-term temporal dependency in the dataset. To model the joint co-occurrences with the LSTMs and RNNs, joint position values were used as the input for each time slot. Liu et al. (2017) proposed a new gated scheme in LSTM for sequential action recognition. Lee et al. (2017) proposed a temporal sliding LSTM, which includes short, medium, and long-term units. Zhang et al. (2018) presented an element-wise attention gate using RNN’s for action recognition.

Fig. 3
figure 3

The 25 skeleton joints tracked using Kinect (v2) sensor

The combination of CNNs and LSTMs is among the most emerging hybrid models for activity recognition. These are especially being applied to vision tasks involving sequential inputs and outputs. CNNs basically consists of two modules. First, feature map construction or extraction, and second, a basic classifier. The hybrid model of CNN and LSTM uses CNN for feature extraction, and LSTM for feature classification. Ordóñez and Roggen (2016), Yao et al. (2017) and Singh et al. (2017), used a combination of CNNs and LSTMs, and demonstrated a tremendous improvement in result. They run both CNN and LSTM models in parallel and used score fusion for the final prediction. Table 1 presents a summary of the literature survey. Work in this research demonstrates a sequential ConvLSTM approach using the best features of both CNN and LSTM.

We have focused on combining the CNNs, LSTMs, and dense layers in a sequential manner and take advantage of all three methods.

Fig. 4
figure 4

The block diagram of the proposed data collection procedure. The input video streams from the Kinect-v2 sensor are used to acquire the 3D skeleton joint coordinates. Then a 3D joint normalization technique is used for the normalization. A bounding box is made across the practitioner using the upper and lower body joint coordinates. Suitable features such as the angle between joints, velocity, acceleration, height, width, etc. are calculated and stored in the dataset along with their activity labels. This dataset is later passed to deep learning algorithms for training the network

3 Proposed methodology

The proposed methodology of this paper is threefold. First, the human body frames are acquired from the Kinect-v2 sensor and tracked the 3D skeleton joint coordinates. Then a 3D joints normalization technique is applied for the preprocessing of the data. The 3D coordinates are used to make a 3D bounding box over the tracked human. Suitable features like velocity, acceleration, angle between skeleton joints, height, width etc. have been extracted for identifying different activities. For each activity, important features have been selected and then constructed feature vectors. Second, the dataset is stored in a CSV format consisting of the raw skeleton joint coordinates and the extracted features. Third, the dataset containing joint values along with extracted manual features are inputted to deep learning networks, i.e.,  CNNs, LSTMs, and ConvLSTM for the activity recognition and fall detection. Figure 2 illustrates the proposed model of the ConvLSTM, which is a sequential fusion of CNNs and LSTMs.

Table 2 Feature set specifications

3.1 Preprocessing

This subsection presents the preprocessing methods applied to the raw data obtained from the Kinect (v2) sensor. We applied skeleton joint estimation methods to acquire the body frames and applied 3D joints tracking methods. Figure 3 presents the twenty-five skeleton joints tracked at each instance. These skeletal joints include knee right, hip right, knee left, hip left, foot left, ankle left, foot right, ankle right, head, spine mid, wrist left, shoulder right, shoulder left, wrist right, elbow right, and elbow left. All these joint values contain XYZ coordinates in the space. As Chen et al. (2016) stated, not all the skeletal joints are useful but only some of the skeletal joints are informative for a particular activity, in our case we excluded coordinates like hand tip, thumb, neck, etc., which are not very important for recognizing the intended activities. Once the 3D skeletal coordinates are available, we applied 3D joints normalization techniques to make 3D bounding boxes across the tracked human skeleton. The bounding box varies as the person moves in the video. Figure 4 presents the block diagram of the proposed automatic dataset labeling procedure. Table 2 describes the list of skeletal joints tracked, set of derived features, and activity class labels. The normalization technique used for stabilizing the convergence of the loss functions in the proposed work is min-max normalization. Suppose, if X represents the training dataset then:

$$\begin{aligned} X = \frac{X - X_{\mathrm{min}}}{X_{\mathrm{max}} - X_{\mathrm{min}}} \end{aligned}$$
(1)

3.2 Geometric and kinematic features calculation

The 3D human skeleton joint coordinates are used for evaluating different features and constructing the feature vectors. Feature vectors are used for deciding which joints of the body parts to be tracked for different activities. Only discriminative features are utilized for each activity.

Angle between skeleton joints:

The 3D coordinates of different body joints are connected using a line and drawn a skeleton. Here, only 10 joints, namely shoulder left, shoulder center, shoulder right, spine base, knee left, hip right, ankle left, hip left, ankle right, and knee right, are utilized for calculating the angle values. These are most relevant among the twenty-five skeleton joints to recognize the activities. A set of angle values are calculated using these joint coordinates. Figure 5 presents an example of calculating the joint angles between the left side of the ankle, knee, and hip. The average of the difference of hip to knee values and ankle to knee values are used for calculating the angle values. If P,  Q,  R represents the distance before coordinate values as \(P_1 = x_1 - y_1\), \(Q_1 = x_2 - y_2\), and \(R_1 = x_3 - y_3\), then the angle between skeleton joints are:

$$\begin{aligned} \theta = \frac{\mathrm{PQR}}{(\mathrm{PQ} * \mathrm{QR})} \end{aligned}$$
(2)

where, \(PQR = P_1 * P_2 + Q_1 * Q_2 + R_1 * R_2;\;\; PQ = \sqrt{P_1^2 + Q_1^2 + R_1^2}\);  and \(QR = \sqrt{P_2^2 + Q_2^2 + R_2^2}\)

$$\begin{aligned} \mathrm{Angle} = \frac{\mathrm{cos}^{-1}(\theta ) * 180}{pi} \end{aligned}$$
(3)
Fig. 5
figure 5

An example of the angle calculation from the right side between the ankle, knee, and hip

Velocity estimation: Velocities in the XYZ directions are estimated using the differences between the positions of the human skeleton at the time instance of t and \(t+1\). The displacement between the two consecutive frames is calculated using the spine mid joint coordinate of the person. Next, the displacement per unit time is used to calculate the velocity of the person.

$$\begin{aligned} \mathrm{Velocity} = \frac{{\text {Displacement of the tracked person between frames}}}{{\text {Time}}} \end{aligned}$$
(4)

Acceleration estimation: Acceleration in the XYZ directions is estimated using the changes in the velocity between consecutive frames as follows:

$$\begin{aligned} \mathrm{Acceleration} = \frac{{\text {Velocity of the tracked person between frames}}}{{\text {Time}}} \end{aligned}$$
(5)

Distance from floor: This is used to estimate the distance between the floor and joint coordinate of the head of the tracked person.

Depth estimation: Depth is the distance from the camera to the nearest object. It was estimated with the help of the head joint’s Z coordinate of the tracked person.

Width estimation: The difference between maximum right joint and maximum left joint coordinates is used to calculate the depth. The extreme right joint values are calculated using the elbow right, hip right, knee right, shoulder right, hand-tip right, ankle left, foot right, and head joint values. A similar procedure is used to find the left extreme joints using all the left-side joint coordinates.

$$\begin{aligned} {\text {Width}} = \mathrm{abs} ({\text {Max. Right Joint}} - {\text {Max. Left Joint}}) \end{aligned}$$
(6)

Height estimation: The difference between extreme top joints and extreme bottom joints is used to calculate the height. The extreme bottom is calculated using the ankle left, knee right, ankle right, knee left, foot right, ankle left, foot left, and ankle right joint coordinates. Extreme top calculated using the head, hand tip left, hand tip right, ankle right, elbow right, ankle left, elbow left, knee right, and knee left joints values.

$$\begin{aligned} {\text {Height}} = \mathrm{abs} ({\text {Top Joint}} - {\text {Bottom Joint}}) \end{aligned}$$
(7)

3.3 Classification models

In this section, the final dataset along with derived features is inputted to the deep learning network for classification. Three different classification methods have been used for the classification namely CNN, LSTM, and the newly designed ConvLSTMs.

Convolutional neural network architecture: Firstly, the activity recognition has been implemented using the CNNs (O’Shea and Nash 2015). Let \( X_t^0=[X_1,X_2,\ldots ,X_n] \) be the readings from the sensor data as an input vector. Here, n is the number of input samples. The convolutional layer’s output can be given as:

$$\begin{aligned} C_i^{l,j}= \sigma \left( B_j + \sum _{m=1}^M W_m^j * X_{i+m-1}^{0,j} \right) \end{aligned}$$
(8)

where l correspond to index of layer, and \(\sigma \) is a sigmoid activation function. \(B_j\) is a bias corresponding to the \(j^{th}\) feature, and M is the filter size. \(W_m^j\) represents the weight for the \(j^{th}\) feature map and \(m^{th}\) filter index.

Three input channels are used for the RGB dataset as the input layers. In the convolution layer, six filters are passed along with setting the kernel sizes, padding, and ReLU activation functions. Max-pooling is used as a pooling layer, which down-samples the image data and reduces the dimensionality for reducing the processing time. The output of which has been passed to the fully connected layers. Ultimately, the Softmax scores give the probabilities of the classes. As far as the negative log-likelihood cost function is considered, it is minimized using a stochastic gradient descent optimizer.

Long short-term memory architecture: Secondly, we have implemented activity recognition using LSTM, an improved version of RNN, which avoids the vanishing gradient problem and consists of memory cells. LSTMs mainly consist of gates such as forget, input, and output gates to control and protect the cell states (Bengio et al. 1994). Forget gate is a binary gate used to decide how much information to let through. The input gate layer is used to decide the new information that needs to be stored in the cell state. The output gate consists of a sigmoid gate, which decides what parts of the cell to give as output. Passing the cell state through the tanh layer, and multiplying it with the output obtained from the sigmoid gate provides the final output (Fig. 1). The standard equations describing the actions of each gate can be given as follows:

$$\begin{aligned} i_t= & {} \sigma ~(W_{(Xi)}~X_t + W_{(Hi)}~H_{(t-1)} \nonumber \\&+ W_{(Ci)}~C_{(t-1)} + B_i) \end{aligned}$$
(9)
$$\begin{aligned} f_t= & {} \sigma ~(W_{(Xf)}~X_t + W_{(Hf)}~H_{(t-1)} \nonumber \\&+ W_{(Cf)}~C_{(t-1)} + B_f) \end{aligned}$$
(10)
$$\begin{aligned} o_t= & {} \sigma ~(W_{(Xo)}~X_t + W_{(Ho)}~H_{(t-1)} \nonumber \\&+ W_{(Co)}~C_t + B_o) \end{aligned}$$
(11)
$$\begin{aligned} C_t= & {} f_t~C_{(t-1)} + i_t~\mathrm{tan}h~(W_{(Xc)}~X_t \nonumber \\&+ W_{(Hc)}~H_{(t-1)} + B_c) \end{aligned}$$
(12)
$$\begin{aligned} H_t= & {} o_t~\mathrm{tan}h~{(C_t)} \end{aligned}$$
(13)

where \(W_i, W_f, W_o\) are the weight matrices, and \(X_t\) is an input to the LSTM cells at the time instance t. \(\sigma \) is the Sigmoid activation function, whereas tanh is a hyperbolic tangent activation function. fi,  and o are the forget, input, and output gates, respectively. C represent the state of a memory cell. \(B_i, B_c, B_f,~and~B_o\) are the bias vectors.

A different combination of batch sizes, hidden layers, and learning rates has been investigated. The best results were obtained using 32 hidden layers for 7 classes with a learning rate of 0.0025, lambda loss amount of 0.0015, and batch size of 1500. Two LSTM cells were stacked which adds deepness to the network. For loss computation, the Softmax loss function was used and optimized using Adam optimizer.

ConvLSTM architecture: In this, a ConvLSTM network has been proposed using the fusion of CNNs, LSTM, and dense layers. Here, CNNs are used for spatial feature extraction, LSTMs are used for sequence prediction, and dense layers are used for mapping the features to get more separable space (Fig. 2). Figure 6 demonstrates a traditional activity recognition model, where the parallel fusion of CNN and LSTM has been performed for the ConvLSTM model. This has been used in many works, previously (Hammerla et al. 2016; Wang et al. 2015; Hou et al. 2016). Although this approach is much better than using either only CNN or only LSTM, yet it does not use the efficiency of both the models to their fullest.

Fig. 6
figure 6

Demonstrates a parallel ConvLSTM model with score fusion

Fig. 7
figure 7

Demonstrates a sequential ConvLSTM mode

In this, a sequential fusion of CNN, LSTM, and dense layers is used as shown in Fig. 7. Here, the outputs of the last hidden layer of CNNs are inputted to the LSTM layers followed by the fully connected layers for the classification. The equations of ConvLSTM can be given as follows:

$$\begin{aligned} F_t= & {} \sigma ~ (W_{(XF)}~*~X_t + W_{(HF)}~*~H_{(t-1)} + B_F) \end{aligned}$$
(14)
$$\begin{aligned} I_t= & {} \sigma ~ (W_{(XI)}~*~X_t + W_{(HI)}~*~H_{(t-1)} + B_I) \end{aligned}$$
(15)
$$\begin{aligned} \check{C}_t= & {} tanh~(W_{(X\check{C})}~*~X_t + W_{(H\check{C})}~*~H_{(t-1)} + B_{\check{C}} ) \end{aligned}$$
(16)
$$\begin{aligned} O_t= & {} \sigma ~ (W_{(XO)}~*~X_t + W_{(HO)}~*~H_{(t-1)} + B_O) \end{aligned}$$
(17)
$$\begin{aligned} C_t= & {} F_t~\odot ~C_{(t-1)} + I_t~\odot ~\check{C}_t \end{aligned}$$
(18)
$$\begin{aligned} H_t= & {} O_t~\odot ~\mathrm{tan}h~(C_t) \end{aligned}$$
(19)

where \(\sigma \) (sigmoid) and tanh (\(hyperbolic-tangent\)) are nonlinear activation functions. \(\odot \) represents the Hadamard product, and \(*\) represents the convolution operations. The inputs \((X_t)\), cells \((C_t)\), hidden states \((H_t)\), forget gates \((F_t)\), input gates \((I_t)\), input-modulation gates \((\check{C}_t)\), and output gates \((O_t)\) are all \(M \times N \times F\) (rows, columns, feature maps) dimensional 3D tensors. The memory cell \(C_t\) is the most crucial module, acting as an aggregator of the state information controlled by the gates.

Initially, the helper functions are defined to increase the reusability and readability of the code. The hyper-parameters like the size, number of layers, steps, batch sizes, and learning rate (0.0001) have been set to an optimum value. Next, we have constructed the LSTM cells and reshaped the dataset for LSTM into sequence length, batches, and channels. Then we applied ReLU activation and set dropout regularization, which operates simultaneously on gates, cells, and output responses of LSTM neurons. Finally, the logit functions are used for cost function measurement. Further, we used Adam optimizer for cost function optimization and utilized gradient clipping. For the training of the network, we set the checkpoint path, saver function, initialized session, set iterations, computed loss and accuracy on the validation dataset, and saved the checkpoint for further testing the model. The performance is tremendously improved by using a sequential model (Fig. 7) as compared to score fusion (Fig. 6). The experimental results and comparisons of three different models, i.e.,  CNNs, LSTMs, and ConvLSTM for activity recognition are presented in the next sections.

Table 3 Precision, recall, and F1-score using LSTM, CNNs, and ConvLSTM

4 Experimental result

This section presents the dataset building procedure, and experimental results using the deep learning algorithms, i.e.,  CNNs, LSTMs, and ConvLSTM. The proposed model has been tested on a newly collected KinectHAR dataset recorded by the Kinect-v2 sensor. Only skeleton joints coordinates along with suitable features are stored for inputting to the deep learning model.

4.1 Dataset building

With the development of cost-effective RGB-D sensing technologies, now it is more convenient to get 3D and depth data. We utilized Microsoft Kinect (v2) sensor for the data collection, which is a depth sensor-based motion-sensing input device that offers a convenient way to record and capture human skeleton joints. The name of Kinect is a combination of kinetic and connects (Chen et al. 2016). It produces three-dimensional RGB-D data. Kinect runs at 30 fps and has a resolution of \(640 \times 480\)p for both, i.e.,  video and depth (Cippitelli et al. 2016). Kinect (v2) has a sensing range of 4 meters (Cippitelli et al. 2016).

Table 4 Comparison of accuracies using different algorithms
Fig. 8
figure 8

Model accuracy and loss curves using LSTMs

For the dataset collection, 20 different people (12 males and 8 females) have participated and performed seven different activities which include sitting, standing, bending, walking fast, walking slow, lying, and fall activities. Every person performed each activity for more than two minutes with all possible variations, so that it can identify activities accurately in a real-time environment. As we do not record the videos, only the skeleton coordinates are used, the privacy is preserved. Throughout the experiment, Kinect (v2) sensor was placed at a two-meter height above the ground. All the experiments have been performed at a range of 0.5 meters to 4.0 meters in front of the camera. The final dataset contains a total of 130,000 samples with 81 attribute values.

The source codes and dataset will be made publicly available to the research community.

Fig. 9
figure 9

Model accuracy and loss curves using CNNs

Fig. 10
figure 10

Model accuracy and loss curves using ConvLSTM

4.2 Model evaluation

This section describes the experimental results obtained using three different deep learning algorithms, namely CNN, LSTM, and ConvLSTM. In the ConvLSTM, CNN is used for feature filtering, LSTM is used for the sequential classification, and fully connected layers are used for feature mapping, as illustrated in Fig. 2. LSTM cells represent the contextual dependencies in the temporal domain effectively, while the CNNs perform better to process spatial features. The combination of these gets the best set of spatial features from CNNs and long-term temporal dependency from LSTMs.

Table 5 Activity-wise precision, recall, and F1-score using the ConvLSTM
Fig. 11
figure 11

Realtime testing of the standing, sitting, bending, walking slow, and walking fast activities (working model is presented to demonstrate realtime testing as https://www.youtube.com/watch?v=2PqkyXMVBLg

The dataset has been split into train and validation sets with a ratio of 60:20, while the remaining 20% is left for the testing. The training data have been used for training the classifiers, while the validation dataset has been used for measuring the performance and accuracy of the trained model. Model loss is obtained using categorical cross-entropy, due to its suitability for measuring the performance of the final layer with Softmax activation. All three models were trained for 200 epochs on a machine that has NVidia TITAN-X GPU. The performance of the system has been measured based on the different measures, which include precision, recall, F1-score, and accuracy. If we represent \(T_N\) as true negatives, \(T_P\) as true positives, \(F_N\) as false negatives, and \(F_P\) as false positives, then the performance metrics can represent as follows:

$$\begin{aligned} {\text {Recall}}= & {} \frac{T_P}{T_P + F_N} \end{aligned}$$
(20)
$$\begin{aligned} {\text {Precision}}= & {} \frac{T_P}{T_P + F_P}\end{aligned}$$
(21)
$$\begin{aligned} F1-{\text {Score}}= & {} 2 \times \frac{{\text {Precision}} \times {\text {Recall}}}{{\text {Precision}} + {\text {Recall}}}\end{aligned}$$
(22)
$$\begin{aligned} {\text {Accuracy}}= & {} \frac{T_P + T_N}{T_P + F_P + F_N + T_N} \end{aligned}$$
(23)

Table 3 presents the precision, recall, and F1-score values for different algorithms performed with \(\pm 1\) percentage change. We have applied different machine learning and deep learning algorithms such as SVMs, decision trees (DT), random forest (RF), artificial neural networks (ANNs), CNNs, LSTMs, and ConvLSTMs. The comparison of their accuracies is shown in Table 4. All the accuracies and plots are calculated for 200 epochs. The accuracy and loss curves using LSTM, CNN, and ConvLSTM are shown in Figs. 89, and 10, respectively. Proposed ConvLSTM results in a better accuracy compared to LSTM and CNNs, as illustrated in Table 4. The accuracy of ConvLSTM is approximately 5-6% better compared to either LSTM or CNN individually. Table 5 presents the precision, recall, and F1-score of each class obtained using the ConvLSTM model. As we can see from Tables 3 and 4, ConvLSTM gives the best results in comparison to other algorithms, so we have stored the trained model using ConvLSTMs and tested the model in real-time. Figure 11 presents the activity recognition results obtained in realtime. The performance is sufficiently high for the general adoption of the system.

5 Conclusion

This paper presented a privacy-preserving activity recognition and fall detection system using a single Kinect (v2) sensor and ConvLSTM. The proposed system derives geometrical and kinematic features and passes them along with the raw skeleton coordinates into deep learning networks. As the system uses only derived features along with raw skeleton joint coordinates and does not use the actual images of the user, the privacy of the user is protected. We proposed a simple and effective method based on the sequential fusion of CNNs and LSTM, named as ConvLSTM model. The performance of the deep learning-based classification algorithms, namely CNN, LSTM, and ConvLSTM, has been compared on the novel dataset consisting of 130,000 samples along with 81 attribute values. The proposed system recognizes standing, walking slow, walking fast, sitting, bending, fall, and lying down activities. The proposed system is unobtrusive to the users and independent of the camera orientation, clothing, etc. The system gives sufficiently high performance for activity recognition and fall detection for the general adoption of the system. The source code and presented dataset will be made publicly available.