1 Introduction

In recent years, the use of skeleton data for human posture recognition has emerged as a popular research topic in the computer vision field. This technology shows good prospects for application in human-computer interaction, rehabilitation medicine, multimedia applications, virtual reality, robot control, and others. In general, postures are different from actions, with the former being static and the latter dynamic. A human posture is a base of actions, and is often taken as the key frame in various action recognition algorithms. Moreover, in some fields, such as physical training, rehabilitation training [8] and sign language communication, a human posture is more important than an action. In noisy workshops and dangerous working environments, posture recognition, as a human-computer interaction mode, is much superior to keystroke control and voice interaction in that it is more accurate, efficient and more natural in interaction.

There are several main methods for posture recognition. One is to use wearable sensors [39], such as wearing accelerometer [2, 3, 16] and pressure sensor [11]. However, wearing such a device makes subjects feel a sense of burden, which compromises the interactive experience. The other one is based on monocular cameras [35]. However, it is susceptible to illumination and background interference, offering unsatisfactory recognition accuracy and robustness in complex conditions. With the increasingly low cost depth image sensors, RGB-D image based posture and action recognition has become an important research focus in the field of human-computer interaction. Researchers can obtain color and depth images as well as skeleton data of human easily. Many posture recognition algorithms [6, 22] that use skeleton data obtained from Kinect are proposed. These algorithms can not only avoid the influence of illumination, but they also eliminate the need of preprocessing such as segmentation and object detection in complex backgrounds, which enables greatly improved accuracy. However, most of the existing works are focused on the action recognition rather than the posture recognition, with more and more attention being paid to daily actions. Additionally, datasets and algorithms based on posture recognition are still of limited availability. Therefore, in this paper we propose a human posture recognition method, which incorporates several datasets that contain a lot of postures while achieving more accurate posture recognition.

The contributions of this paper are that we extract features at different granular levels and create diverse training subsets for enhanced accuracy in the rule-based classifier. Specifically, to better represent human postures, (1): we extract angle features between joints in the fine-grained level and relative distance features between key body parts in the coarse-grained level. (2): in the classification stage, bagging and random subspace approaches are used to divide the original training dataset into subsets with different samples and features. The final decision is made by voting RIPPER classifiers that are trained on these diverse training subsets. The experimental results show that our algorithm performs better than CNNs for the current datasets even using the same parameters.

The rest of this paper is organized as follows. A review of related work is offered in Sect. 2. The algorithm of human posture is described in Sect. 3. A description of the datasets and the experimental results are provided in Sect. 4. The conclusions are given in Sect. 5.

2 Related works

Most of the traditional posture recognition methods describe human visual information and two-dimensional posture information by extracting features from RGB images. Ramanan and Sminchisescu [36] proposed an algorithm that uses human contour samples to obtain human edge templates and a similarity and gradient descent method to estimate postures. Jiang et al. [18] presented a posture recognition method using convex programming based matching schemes. This method proves to be more efficient than other methods such as the graph-cut or belief propagation methods for the object matching problem in which a large searching range is involved. However, these methods are sensitive to some unnecessary features extracted from people’s clothes, environment interference and illumination in the image.

Souto and Musse [38] proposed an algorithm that uses artificial neural networks to automatically detect human poses in a single image. But this approach uses static image features to determine human skeleton, which requires a large amount of computation to extract features. Mun Wai and Isaac [34] have presented a technique of data-driven MCMC technique to estimate 3D human poses from static images. For pose estimation of three-dimensional human, Sarafianos et al. [37] reviewed the progresses and shortcomings of recent researches on the estimation of 3D human poses. Considering that different input modes and different key features are introduced separately, they conducted an extensive experimental evaluation on the approaches in a synthetic dataset. At the end of the paper, they discussed the findings from the literature review and the experimental results.

Since the advent of the Microsoft Kinect sensor in 2010, more and more researchers have begun developing posture recognition methods based on skeleton data and depth images. Lin et al. [25] proposed a Kinect-based rehabilitation system, which defines two kinds of features, namely, the average distance between 10 joint points of the upper limb and the angle features of 9 adjacent joints compared with the posture to be recognized. The recognition result of the method depends on the setting of the matching threshold, so the robustness is less than ideal. Islam et al. [17] used a Kinect sensor to detect different joint points of human body and further to calculate the average deviation to recognize yoga poses for users. Miranda et al. [33] presented a method that uses the angle between skeletal joints to describe the human postures and a multi-level support vector machine (SVM) and a decision forest are used to classify them. The method, however, offers limited accuracy when recognizing multiple similar postures. Li et al. [22] used angular features to represent six human postures and SVM to classify them. Chen and Wang [6] proposed a method that uses the back propagation (BP) network, SVM, naive bayes to recognize three postures. This method involves no feature extraction and uses the original skeleton data as the input data to the classifier.

Agarwal and Triggs [1] proposed a relevance vector machine (RVM) regression method that employs contour information to estimate human postures. This method requires matching with multiple templates and is therefore time-consuming. Zainordin et al. [41] proposed a method to classify postures by setting the threshold distance, angle between joints, and establishing a set of rules based on the skeleton and depth information. However, this method is only suitable for classifying a few postures due to the reliance of its recognition accuracy on the posture kernel formulation training. Georgakopoulos et al. [13] proposed a method that can automatically recognize any user-defined postures. Nine features which represent specific body parts are generated from the user’s posture skeleton information. The features are input into SVM to generate attitude learning models to recognize postures. Elforaici et al. [9] proposed a method in which convolutional features are extracted from color images and transfer learning is involved to train convolutional neural networks (CNNs) for recognizing human postures from RGB and depth images. Li et al. [23] proposed a method that uses the anthropometry and the BP neural network to recognize human postures with the person oriented to the Kinect sensor in different directions. The deep learning method exhibits relatively good recognition rates, but it is difficult to interpret the resulting mode. The method also requires very large data sets and time-consuming parameter regulation work to achieve high performance.

To sum up, most of the existing works are image-based methods. As such, we propose a posture recognition algorithm for the skeletal information obtained by Kinect.

3 Proposed approach

The proposed approach for human posture recognition is based on the skeleton information extracted from a Kinect sensor. Figure 1 illustrates the stages involved in this approach. First, multiple features were defined, including the angle features and the distance features between joints. Then bagging and random subspace methods were used to create rule ensembles based on the RIPPER rule learning algorithm, which allowed training 100 rule sets that make up a rule ensemble for final classification by majority voting.

Fig. 1
figure 1

Overview of the proposed approach

3.1 Extraction of multiple features

The Kinect sensor can acquire real-time 3D position information of 20 human joints, which can be expressed in \(x,\,y\) and z coordinates in meters. In the original data, each posture is recorded as the absolute position of 20 joints of human body, the skeleton information is denoted as \(\hbox {J} = \{j_{1},\, j_{2},\, j_{3},\, \ldots ,\,j_{N}\},\) where, \(j_{i} = (x_{i},\, y_{i},\, z_{i})\) refers to the coordinate position of joint i, and N = 20 is the total number of skeleton joints. The label of each joint is defined as shown in Fig. 2.

Fig. 2
figure 2

The label of each joint points

Any two joints form one skeleton segment. As shown in Table 1, a total of 23 skeletal segments are defined as \(S_{i} = \{ S_{1},\, S_{2},\, \ldots , \,S_{23}\}\). Each skeletal segment \(S_{i}\) consists of two joint points in the table, where the spatial coordinates are expressed as: \(j_{a} = ( x_{a},\, y_{a},\, z_{a}),\, a=1,2,\ldots ,20 ,\, j_{b} = (x_{b},\,y_{b},\,z_{b}),\, b=1,2,\ldots ,20,\, b \ne a\).

Table 1 Composition of skeletal segments

Then the direction vector of the linear equation of skeletal segment \(S_i\) is denoted as follows:

$$\begin{aligned} \begin{aligned} \upsilon _i(\upsilon _x, \upsilon _y, \upsilon _z) = (x_b - x_a, y_b - y_a, z_b - z_a) \end{aligned} \end{aligned}$$
(1)

Thus the angle between the two skeletal segments \(S_{a}\) and \(S_{b}\) is defined as:

$$\begin{aligned} \begin{aligned} Angle = arcos \frac{(\upsilon _{xa} * \upsilon _{xb} + \upsilon _{ya} * \upsilon _{yb} + \upsilon _{za} * \upsilon _{zb})}{\sqrt{(\upsilon _{xa}^2+\upsilon _{ya}^2+\upsilon _{za}^2) * (\upsilon _{xb}^2+\upsilon _{yb}^2+\upsilon _{zb}^2)}} \end{aligned} \end{aligned}$$
(2)

Here the direction vector of \(S_{a}\) is \(v_{a}\)(\(v_{xa}\),\(v_{ya}\),\(v_{za}\)), and the direction vector of \(S_{b}\) is \(v_{b}\)(\(v_{xb}\),\(v_{yb}\),\(v_{zb}\)).

In this study, 253 angular values were obtained from the defined angle between two skeletal segments. After removal of 67 redundant angles, 186 angle features were extracted finally. We define them as: \(Angle_{i} = [Angle_{1},\,Angle_{2},\,\ldots ,\,Angle_{186}]\). Here, the angle features of three-dimensional space are rotation and scale invariant, and they play an important role in the recognition process.

Next, we define the relative distance features of 11 groups of joint points, which are shown in Table 2. Here, the distance feature \(D_{i} = \{d_{ix},\,d_{iy},\, d_{iz}\},\,i\in \,[1,11]\), where, \(d_{ix}= ( x_{a} - x_{b});\, d_{iy}= (y_{a} - y_{b});\, d_{iz}= (z_{a} - z_{b})\). The distance features represent the global human posture, as a complement to angular features.

Table 2 Composition of distance feature D

Finally, a 219-dimensional feature vector \(f_{i} = \{ f_{1},\, f_{2},\, \ldots ,\, f_{219}\}\) is generated, which includes 186 angular features and 33 distance features. The angle features can describe the relationship between two skeletal segments as well as local human postures. The distance features of human posture show the relative distance between the joint points, which can roughly describe the movement of limbs. The combination of angle features and distance features permits more comprehensive representation of postures.

3.2 Classification method

As mentioned in Sect. 1, the classification process entails the Bagging approach, the random subspace method, and the RIPPER rule learning algorithm for creating rule ensembles.

The Bagging approach (stands for bootstrap aggregating), which was proposed by [4], is used here to draw n different versions of training data through random sampling with replacement. In this way, some instances may be selected more than once into the new training sample \(s_i\), whereas some other instances may never be selected. On average, each sample \(s_i\) is expected to represent 63.2% of the instances in the original training set [21, 26, 27]. This indicates that the base classifiers trained (using the same learning algorithm) on the n samples are likely to be diverse [5, 19], because the n samples cover different parts of the original training set. The procedure of the Bagging approach is illustrated in Fig. 3.

Fig. 3
figure 3

The procedure of bagging [29]

The random subspace method, which was proposed by [15], is used here to create diversity among m feature subsets. Since each feature subset \(fs_j\) represents a random subspace of the full feature set, which leads to the diversity among the randomly selected feature subsets, the m base classifiers trained on the m feature subsets are more likely to be diverse [5, 19]. The random subspace method was originally used as an effective way of creating decision tree ensembles and its resulting models are referred to as random decision forests [14]. The random subspace method involves a similar procedure to the Bagging approach, as shown in Fig. 3. In the sampling stage, however, features instead of instances are selected. Hence, the random subspace method is also known as feature bagging.

The RIPPER algorithm, which was proposed by [7], is aimed at training rule-based classifiers through the separate-and-conquer strategy of rule learning [12] as illustrated in Algorithm 1.

figure a

At each iteration of learning a single rule (shown in line 2 of the algorithmic procedure illustrated in Algorithm 1), an attribute-value pair (e.g. \(x_1>2\)) that can maximize the rule quality is selected as a condition (an antecedent of the rule), and the process is repeated until the stopping criterion of learning this rule is satisfied. Once the rule has been finalized following the above process, the rule would normally have covered the same class of training instances. In this case, the learning of the above rule is finished. It is then required to find all the instances that are covered by this rule and delete these instances from the training set, in order to initiate the learning of the next rule from the remaining instances.

For the RIPPER algorithm, the selection of an attribute-value pair at each iteration of learning a single rule is made by evaluating the rule quality [based on the FOIL information gain shown in Eq. (3)] after adding an attribute-value pair as an antecedent of this rule,

$$\begin{aligned} \begin{aligned} IG_{r_i}= p_{r_i} \times \left( log_2\left( \frac{p_{r_i}}{p_{r_i}+n_{r_i}}\right) -log_2\left( \frac{p}{p+n}\right) \right) \end{aligned} \end{aligned}$$
(3)

where \(p_{r_i}\) and \(n_{r_i}\) represent, respectively, the number of positive and negative instances covered by rule \(r_i\), whereas p and n represent, respectively, the number of positive and negative instances in the initial training subset from which the learning of rule \(r_i\) starts.

On the other hand, the RIPPER algorithm also requires pruning of each rule \(r_i\) once the learning of the rule \(r_i\) is complete before the learning of the next rule can start. In particular, incremental reduced error pruning (IREP) is adopted to simplifying each rule \(r_i\), based on the rule-value metric shown in Eq. (4).

$$\begin{aligned} \begin{aligned} w_{r_i}= \frac{p_{r_i}-n_{r_i}}{p_{r_i}+n_{r_i}} \end{aligned} \end{aligned}$$
(4)

IREP is designed to prune each rule by starting from evaluating the last antecedent of rule \(r_i\) in terms of the rule-value metric \(w_{r_i}\). If the value of \(w_{r_i}\) increases after removal of the last antecedent of rule \(r_i\), the above pruning process is repeated until the value of \(w_{r_i}\) decreases. In other words, if the value of \(w_{r_i}\) does not increase after removal of the last antecedent of rule \(r_i\), the pruning process should be stopped immediately and the last antecedent of rule \(r_i\) should not be removed.

Once a whole set of rules have been trained, a global optimization stage is involved to further enhance the quality of the rule set. More details about how the RIPPER algorithm works for the whole rule learning and pruning procedure can be found in [7].

The whole framework of training classifiers is designed to involve three levels. Level 1 is to create n samples of training data through the Bagging approach; level 2 is to create m feature subsets based on each of the n training samples, using the random subspace method; and level 3 is to train a base classifier based on each of the \(m\times n\) feature subsets, using the RIPPER algorithm. The final classification is made by fusing the outputs of the \(m\times n\) base classifiers through majority voting.

4 Experiments

In this section, the data sets used for this study are described alongside the details on the experimental setup. Moreover, the experimental results are discussed in a comparative way.

4.1 Datasets

We have performed an extensive evaluation on our proposed method using four datasets. The first three datasets were extracted from the public action databases MSR-Action3D, Microsoft MSRC-12, and UTKinect-Action. The fourth dataset, called “Baduanjin posture”, was built by ourselves using the Kinect sensor.

The MSR-Action3D dataset [24] was collected from 20 actions: high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand wave, side-boxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, pick up and throw. There are 10 subjects, and each subject performs each action 2 or 3 times. We extract 20 postures from the MSR-Action3D dataset to build the MSR-Action3D posture dataset, which consists of 3224 frames. As shown in Fig. 4, the first posture is highly similar to the fourth and sixth ones, and the thirteenth one is highly similar to the twentieth one, which causes difficulties in posture recognition.

Fig. 4
figure 4

MSR-action3D posture depth image

The second posture dataset used in this paper was established from the Microsoft MSRC-12 dataset of Research Cambridge [10]. It was collected from 30 people who performed 12 gestures. We extracted 5884 frames from 719359 frame action samples, and built a new posture dataset. Figure 5 shows the 12 postures.

Fig. 5
figure 5

MSRC-12 posture RGB image

The third posture dataset was extracted from the UTKinect-Action dataset [40]. We chose 10 action types of this dataset: walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands and clap hands. There were 10 subjects, and a total of 3795 frames were extracted. This dataset was collected to investigate variations in different views: right view, frontal view, left view and back view. In addition, the background clutter and human-object interactions in some postures add new challenges to posture recognition. Figure 6 shows the 10 postures.

Fig. 6
figure 6

UTKinect action posture RGB image

We have also collected a new dataset of rehabilitation postures. It is called Baduanjin dataset, which is collected in accordance with the standard operating procedures. Baduanjin is a traditional method of fitness, which is often used to improve the physical constitution, balance and joint flexibility of patients with motor dysfunction in China. We defined 15 types of postures, and collected them using a Kinect sensor. In our test, each action was performed by 10 subjects. Figure 7 shows the 10 postures.

Fig. 7
figure 7

Baduanjin action posture RGB image

4.2 Experimental setup and results

This experiment was established on the KNIME Analysis Platform, which allowed easier integration of algorithms and more convenient manipulation or visualization of data. We used the Bagging node (a part of the Weka plugin), where the size of each bag (a percentage of the training data size) was set to 100, and the button of calculating the out of bag was set as false. The number of iterations was set to 10, i.e., the Bagging approach was used to draw 10 training samples, and the Random Subspace method was used to draw 10 feature subsets on each of the 10 training samples, by setting the size of each subspace to 0.5. The RIPPER algorithm was used for training 10 base classifiers (rule sets) on the 10 feature subsets drawn from each training sample, where the RIPPER algorithm was set to involve 2 runs of rule optimization and using 1/3 of the training data for rule pruning. Therefore, the adoption of the whole framework for ensemble creation (based on Bagging, Random Subspace and RIPPER) produced 100 base classifiers in total. All the algorithms were tested using the 10-fold cross-validation method. The proposed method was compared with five common classification methods and convolutional neural networks.

We have performed parameter selection for SVM and KNN by cross-validation [20]. The optimized parameters for SVM and KNN and other three common classification algorithms are listed in Table 3. We have also conducted experiments using these five algorithms with default settings of parameters. For SVM, C is the complexity constant, L is the tolerance parameter, P is the epsilon for round-off error and K is polynomial kernel. For KNN, K is the number of nearest neighbors used in classification. According to Table 3, the selected parameters for the SVM algorithm are not the same for different datasets while those for KNN are the same.

Table 3 Comparison methods and parameters setting

We use PCA and wrapper-based feature selection to reduce the feature dimensionality. The results are shown in Table 4. According to Table 4, using our method without feature selection achieves the best accuracy for the 4 datasets. Feature dimensionality is significantly reduced to a range between 34 and 53 using PCA, but the accuracy is dropped slightly. With the combination of genetic search and ZeroR classifier, the number of features decreased dramatically to 10, but the performance obtained using this feature selection method also declined. As a matter of the fact, in the stage of random subspace, the original feature set is divided into diverse subsets with lower feature dimensionality, which means that the feature dimensionality is reduced even though there is no feature selection involved. Thus the following experiments are all conducted using 219-dimensional feature vectors.

Table 4 The accuracy and the number of selected features of our method after feature selection

In the experiment, we have used the angle features and the distance features proposed in this paper for posture recognition. As shown in the MSR-Action 3D posture dataset diagram in Fig. 4, the dataset contains several groups of similar postures, such as the first posture extracted from the waving action and the sixth posture extracted from the high-throw action; the second, fourth and twelfth postures obtained from the horizontal sliding, grabbing and side stroke are also similar among the three groups of postures. These similar postures cause great difficulties in posture recognition. As shown in Table 5, this algorithm has higher recognition rates than the other five algorithms using default settings of parameters. The classification confusion matrix of the proposed algorithm and the SVM algorithm in Fig. 8a, b demonstrates that the proposed algorithm performs better in classifying similar postures in the dataset than the SVM algorithm.

Table 5 Comparison of recognition accuracy on the MSR-Action3D posture dataset
Fig. 8
figure 8

Contrast matrix between our proposed algorithm and the SVM algorithm

The experimental results on the posture data set obtained from the MSRC-12 action dataset are given in Table 6. The dataset also contained some similar postures, such as posture 4 for using telescopes and posture 7 for shooting, posture 10 for head-holding and posture 12 for air hitting. The experimental results show that this method outperforms other contrast algorithms using default settings of parameters in terms of recognition accuracy. Posture-like classification also works well.

Table 6 Comparison of recognition accuracy on the MSRC-12 posture dataset

The recognition results obtained using the UTK-Action posture data set as the training set are shown in Table 7. This algorithm also produces better recognition results than other algorithms using default settings of parameters. As indicated in Fig. 6, the collection environment of the dataset is complex. The angle and distance features obtained from the skeleton data used for the algorithm are not affected by the environment background, exhibiting a higher level of robustness.

Table 7 Comparison of recognition accuracy on the UTKinect-Action posture dataset

The recognition results obtained using each algorithm on the Baduanjin posture dataset we built here are shown in Table 8. It is also superior to the other five classification algorithms using default settings of parameters in the recognition accuracy of 15 rehabilitation postures. Additionally, our proposed algorithm is based on rule learning. Therefore, the classification model obtained by using the algorithm is an ensemble of rule sets (consisting of rules). Compared with many machine learning and deep learning methods, our proposed algorithm can generate better interpretable models. The KNIME platform can output the model generated by the rule learning algorithm for posture recognition to text, which consists of the feature subset selected by random subspace and the classification rule set of each base classifier. These visible rule sets can be used to recognize different rehabilitation postures, showing more promising applications in rehabilitation than other algorithms.

Table 8 Comparison of recognition accuracy on the Baduanjin posture dataset
Table 9 Accuracy of SVM and KNN using optimized parameters on four datasets

Accuracy produced by SVM and KNN using optimized parameters in Table 3 on four datasets is shown in Table 9. It is worth noticing that the parameter selection process for SVM and KNN is time-consuming. Compared with SVM and KNN, even though the results of these classifiers using optimized parameters are merely better than ours, our algorithm achieves ideal results on all the four datasets. Our method has a stronger generalization ability when it comes to sharing the same parameters for different datasets, which make our algorithm more robust [42]. More importantly, with its base classifier being rule-based, our method can output rules for each posture, making it quite useful and convenient in real-world applications.

According to Table 10, our method tops all in terms of recognition accuracy in the four datasets. In the Baduanjin and MSRC datasets, these four methods all have a recognition accuracy rate above 95%. Alexnet exhibits higher accuracy rates than the other two CNNs but still lower ones than ours. In the MSR3D dataset, our method is the only one that has an accuracy rate of over 90%. In the UTKA dataset, our method has a 6.4% higher recognition accuracy rate than VGG-13. A main reason for this is that the features we extracted contain different granular-level information. Specifically, we extract a 219-dimensional feature vector which consists of 186 angle features (fine-grained level) and 33 distance features (coarse-grained level). Therefore, the features we extracted can capture both the local and global information of different postures.

Table 10 Comparison between the results of our method and those of CNN
Table 11 AUC values of different methods
Table 12 AP values of different methods

We saved the results of each fold in 10-fold classification and used the micro-average method in sklearn toolbox to generate the precision and recall (PR) curves and ROC curves. The ROC curves and the PR curves for different datasets are shown in Figs. 9 and 10. The AUC values and AP values are shown in Tables 11 and 12. Our method shows better performance than CNNs and for AUC and AP values, our method is at the top for all the datasets.

5 Conclusion

In this paper, we have proposed a rule ensemble approach for human posture recognition based on multiple features. The approach employs the Bagging approach for random sampling of training data and the Random Subspace method for random selection of feature subsets. This allows diverse rule-based classifiers to be trained using the RIPPER rule learning algorithm and thus create a high-performance ensemble. In terms of feature extraction, we managed to extract multiple features, which include angel features and distance features between joints. A comparison was made between our proposed approach and five popular learning methods using three public action data sets and one that was built by ourselves. The experimental results show that our proposed approach outperforms the other learning methods.

In the future, we will investigate the techniques of granular computing [28, 30,31,32] towards extraction of features at multiple levels of granularity and fusion of different features to reduce the dimensionality and the sparsity of feature sets. It is also critical to explore how the extraction of multiple features can increase the diversity among classifiers trained using different feature sets or learning algorithms, so as to enable further advances in the performance of human posture recognition.

Fig. 9
figure 9

ROC curves of different methods

Fig. 10
figure 10

Precision and recall curves of different methods