1 Introduction

Automated high-level human activity analysis and recognition play a fundamental role in many relevant and heterogeneous application fields such as video-surveillance, ambient assisted living, automatic video annotation or human-computer interfaces. Of course different applications need specific approaches to be designed and implemented; general-purpose solutions, though highly desirable, are very difficult to implement due to the differences in the source of information, the requirements in terms of efficiency, the environmental factors which have a significant impact on performance, etc. This work focuses on human activity recognition in indoor environments which has typical applications in fall-detection of elderly people, abnormal human behavior detection or human computer interfaces. In our opinion unobtrusiveness is one of the most important and interesting features of ambient intelligence applications; to meet this requirement, the proposal of this paper is a vision-based technique where simple cameras are used as input devices and the users are not require to wear neither to actively interact with sensors of different nature.

With respect to other application scenarios such as video-surveillance, indoor environments offer several advantages: the input data are somehow more “controlled” and easier to process (e.g. to segment the subjects in the scene), the number of possible users is generally limited and input devices, such as RGB-D cameras, can be successfully adopted for data acquisition. The problem of activity recognition is however still complex if we consider that the users are not cooperative and a real-time processing is needed to produce timely and useful information. This paper proposes an activity recognition technique based on the use of RGB-D cameras, and in particular the Kinect sensor, for data acquisition. To the best of our knowledge all the existing techniques based on skeleton data only exploit 3D joint position, while joint orientation is typically neglected. Aim of this work is to evaluate the reliability of the joint orientation estimates provided by Kinect and to verify their effectiveness for action recognition.

The paper is organized as follows: an overview of the state-of-the-art is provided in Sect. 2, Sect. 3 presents the proposed approach, the results of the experimental evaluation are given in Sect. 4 and finally Sect. 5 draws some conclusions an presents possible future research directions.

2 State of the Art

Vision-based activity recognition techniques do not require the use of special devices and the only source of information is represented by cameras placed in the environment which continuously acquire video sequences. Many works adopt common RGB cameras to acquire information from the environment, but undoubtedly the widespread diffusion of low-cost RGB-D sensors, such as the well-known Microsoft Kinect, greatly boosted the research on this topic. Even though a few hybrid approaches combining gray-scale and depth information have been proposed (e.g. [1]), RGB-D sensors alone have been widely used for activity analysis [2] and several benchmarks have been released to facilitate the comparative evaluation of recognition algorithms [3, 4]. The most attractive feature of the Kinect sensor is the ability to capture depth images, coupled with the possibility of tracking rather accurately skeletons of individuals in the scene. The skeleton representation provided by Kinect which consists of a set of joints, each described in terms of position and orientation in the 3D space. Such information is extremely useful for human activity analysis as confirmed by many approaches in the literature. A few works exploit only the depth information (and not the skeleton), and typically perform an image segmentation to identify some relevant posture features from the human body [5]. Most of the approaches perform a skeleton analysis, adopting different representations of the set of joints such as the simple joint coordinates, normalized according to some body reference measure [6, 7] or joint distances [8], EigenJoints in [9] where PCA is applied to static and dynamic posture features to create a motion model, histograms of 3D joints [10], kinematic features, obtained observing the angles between couples of joints [11], Gaussian Mixture Models representing the 3D positions of skeleton joints [12], Dynamic Bayesian Mixture Model of 3D skeleton features [13] or spatio-temporal interest points and descriptors derived from the depth image [14]. Another common approach is to adopt a hierarchical representation where an activity is composed of a set of sub-activities, also called actionlets [15,16,17,18]. Finally a few works also analyze the interaction of humans with objects to obtain a better scene understanding. The authors of [18] adopt a Markov random field where the nodes represent objects and sub-activities, and the edges represent the relationships between object affordances, their relations with sub-activities, and their evolution over time is proposed; in [19] the authors propose a graph-based representation.

3 Proposed Approach

The idea behind the proposed approach is to encode each frame of a video sequence as a set of angles derived from the human skeleton, which summarize the relative positions of the different body parts. This proposal presents some advantages: the use of skeleton data ensures a higher level of privacy for the user with respect to RGB sequences, and the angle information derived from skeletons is intrinsically normalized and independent from the user’s physical build. The skeleton information extracted by the Kinect [20] consists of a set of n joints \(J=\left\{ j_1, j_2,...,j_n\right\} \) where the number n of joints depends on the software used for the skeleton tracking (i.e. typical configurations include 15, 20 or 25 joints). Each joint \(j_i=\left( \mathbf {p_i}, \overrightarrow{\mathbf {o_i}}\right) \) is described by its 3D position \(\mathbf {p_i}\) and its orientation \(\overrightarrow{\mathbf {o_i}}\) with respect to “the world”. Our approach exploits the information given by joint orientations to compute relevant angles whose spatio-temporal evolution characterizes an activity. We consider three different families of angles (see Fig. 1a and b):

  • \(\theta _{ab}\): angle between the orientations \(\overrightarrow{\mathbf {o}_a}\) and \(\overrightarrow{\mathbf {o}_b}\) of joints \(j_a\) and \(j_b\). Angles \(\theta _{ab}\) are computed for the following set of couples of joints:

    $$ A_\theta =\{(j_1,j_3), (j_1,j_5), (j_3,j_4), (j_5,j_6), (j_0,j_{11}), (j_0,j_{12}), (j_7,j_8), (j_9,j_{10})\} $$
  • \(\varphi _{ab}\): angle between the orientation \(\overrightarrow{\mathbf {o}_a}\) of \(j_a\) and the segment \(\overrightarrow{j_aj_b}\) connecting \(j_a\) to \(j_b\) (we can consider the segment as the bone that interconnects the two joints). Angles \(\varphi _{ab}\) are computed for the following set of couples of joints:

    $$\begin{aligned} A_\varphi =\{(j_3,j_1), (j_3,j_4), (j_4,j_3), (j_4,j_{11}), (j_{11},j_4), (j_5,j_1), (j_5,j_6), (j_6,j_5), \end{aligned}$$
    $$\begin{aligned} (j_6,j_{12}), (j_{12},j_6), (j_2,j_7), (j_7,j_2), (j_7,j_8), (j_2,j_9), (j_9,j_2), (j_9,j_{10})\} \end{aligned}$$
  • \(\alpha _{bac}\): angle between the segment \(\overrightarrow{j_aj_b}\) connecting \(j_a\) to \(j_b\) and \(\overrightarrow{j_aj_c}\) that connects \(j_a\) to \(j_c\). Angles \(\alpha _{abc}\) are computed for the following triplets of joints:

    $$\begin{aligned} A_\alpha =\{(j_2, j_7, j_8), (j_7, j_8, j_{13}), (j_2, j_9, j_{10}), (j_9, j_{10}, j_{14})\} \end{aligned}$$

We consider only subset of the possible angles, mainly obtained from the joints of the upper part of the body, because not all the angles are really informative: for example the angles between head and neck are almost constant over time and does not provide useful information for activity discrimination. Different configurations of angles have been evaluated and compared in (see Sect. 5). Therefore, each frame \(f_i\) of the video sequence \(S_i, i=1,..,l\) is represented by a vector obtained as the ordered concatenation of the values of \(\theta _i\ |\ i\in A_\theta \), \(\varphi _j\ |\ j\in A_\varphi \), \(\alpha _k\ |\ k\in A_\alpha \)

$$\begin{aligned} \mathbf {v}_i=\left( \theta _1,...,\theta _m, \varphi _1,...\varphi _n, \alpha _1,...,\alpha _s\right) \end{aligned}$$

of size (\(m+n+s\)).

Fig. 1.
figure 1

(a) Representation of a subset of joints \(j_a = (p_a, \overrightarrow{o_a})\), \(j_b = (p_b, \overrightarrow{o_b})\) and \(j_c = (p_c, \overrightarrow{o_c})\) and related angles \(\theta \), \(\varphi \) and \(\alpha \). (b) The 28 angles used in our experiments computed from a skeleton configuration with 15 joints.

It is worth noting that the number of frames for each video sequence can be extremely high and certainly not all the resulting feature vectors are significant: the variation of the angles between two subsequent frames is minimal and usually unnoticeable. We decided therefore to adopt a Bag of Word model [21] with a two-fold objective: minimizing the representation of each sequence keeping only the relevant information and producing fixed-length descriptor which can be used to train an activity classifier. The idea is to represent each activity as an histogram of occurrences of some reference postures (see Fig. 2 for a visual representation), derived from the analysis of the training set. A reference dictionary is first built by applying the K-means clustering algorithm [22] to the set of posture features extracted from the training sequences. Since some subjects could be left-handed, all the angle features are mirrored with respect to the x-axis. We denote with k the number of clusters determined (i.e. the dictionary size). The dictionary should encode the basic postures assumed during the different actions in the training set and will be used to represent each sequence as an histogram of occurrences of such basic elements. Given a set of training sequences \(TS=\left\{ S_i, i=1,..,d \right\} \), representative of the different actions, the k-means clustering algorithm is applied to the associated set of feature vectors \(FV=\left\{ \mathbf {v_i}, i=1,..,d \right\} \) to obtain a set of k clusters: the cluster centroids are used as words of the reference dictionary \(W=\left\{ w_i, i=1,..,k \right\} \). The number of clusters k determines the size of the dictionary and is one of the most relevant parameters of the proposed approach. Each sequence is then encoded as a normalized histogram of occurrences of the words in W. Of course the angle features are continuous values and a precise correspondence between the words in the dictionary and the descriptors is very unlikely; therefore when computing the histogram each feature vector \(f_i\) is associated to the closest word \(w_j^*\) in the dictionary: \(j^*= \arg \!\min _j || f_i - w_j ||\).

A Random Forest Classifier [23] is trained to discriminate the different activities represented in the training set; the classifier consists of an ensemble of decision trees, each trained on a subset of the patterns and a subset of the features and the final classification is obtained combining the decisions of the single sub-trees.

Fig. 2.
figure 2

Visual representation of a subset of key poses corresponding to some cluster centroids of the dictionary W.

4 Experiments

Several experiments have been conducted to evaluate the sensitivity of the proposed approach to its main parameters (i.e. the set of angles selected and the dictionary size). Despite of the large number of existing benchmarks for activity recognition from skeleton information, joint orientations are generally not available. We used for testing the well-known CAD-60 [15, 24], released by the Cornell University, and a newly acquired dataset. CAD-60 contains 60 RGB-D videos where 4 different subjects (two male and two female, one left-handed) perform 12 daily activities in 5 environments (office, kitchen, bedroom, bathroom and living room). The authors of the benchmark propose two settings named new person, where a leave-one-out cross-validation is adopted, and have seen where the training set includes data from all the subjects. We adopted the new person testing protocol, in accordance with all the related works in the literature, to allow for a comparison of the results. Moreover, analogously to other works, the recognition accuracy is measured separately for the different rooms.

4.1 Office Activity Dataset (OAD)

Due to the lack of datasets including information on joint orientations, we decided to acquire a new database of human activities to perform further tests. Data acquisition was carried out in a single environment (office) from several perspectives based on the action being performed. From this point of view the benchmark is more complex than CAD-60 because all the activities need to be compared for activity recognition and the higher number of subjects increases the variability of each action. It contains 14 different activities: drinking, getting up, grabbing an object from the ground, pour a drink, scrolling book pages, sitting, stacking items, take objects from a shelf, talking on the phone, throwing something in the bin, waving hand, wearing coat, working on computer, writing on paper. Data was collected from 10 different subjects (five males and five females) aged between 20 and 35, one subject left-handed. The volunteers received only basic information (e.g. “pour yourself a drink”) in order to be as natural as possible while performing actions. Each subject performs each activity twice, therefore we have collected overall 280 sequences.

The device used for data acquisition is the Microsoft Kinect V2 whose SDK allows to track 25 different joints (19 of which have their own orientation). For testing, we adopted the same “new person” setting of the CAD-60 dataset: a leave-one-out cross-validation with rotation of the test subject. The set of angles used for testing the proposed approach is however the same used for CAD-60. The dataset will be made available online in the Smart City Lab web site (http://smartcity.csr.unibo.it).

4.2 Results

Performance evaluation starts from the analysis of the confusion matrix M where a generic element \(M\left( i,j\right) \) represents the percentage of patterns of class i classified by the system as belonging to class j. Further synthetic indicators can be derived from the confusion matrix; in particular, we computed precision P and recall R as follows:

$$\begin{aligned} P=\frac{TP}{TP+FP}, R=\frac{TP}{TP+FN} \end{aligned}$$

where TP, FP and FN represent respectively the True Positives, False Positives and False Negatives which can be easily derived from the extra-diagonal elements of the confusion matrix. In analogy to the proposal in [8], each video sequence is partitioned into three subsequences which are used independently in the tests. The results obtained are summarized in Fig. 3 where the Precision (P) and Recall (R) values are reported for different experimental settings, i.e. variable dictionary size (k) and three subsets of angles considered for skeleton representation. In particular, the efficacy of the joint orientations is assessed by comparing the results of two different settings - 24 angles, (\(\alpha \) angles omitted) and 28 angles - with those obtained using only \(A_\alpha \) angles, computed between all the existing pairs of neighboring segments (13 angles, no joint orientation is used in this case). The results show that, overall, the accuracy of the proposed technique is good. As expected the dictionary size has a significant impact on the performance; it is worth noting that different actions have often very similar postures (e.g. drinking and talking on the phone) and a value of k excessively low probably determines the reference posture of such activities to collapse in a single word, thus making difficult to correctly distinguish them. On the other hand, a high value of k produces very sparse feature vectors, more sensitive to the presence of noise. The best results have been reached with a value of \(k=100\) which also allows to efficiently perform the classification task. Also the angle configuration is important; the use of 28 angles produces better results both in terms of precision and recall with respect to the version with 24 angles. The limited accuracy of the configuration with 13 angles, where the orientation is not exploited, confirm the effectiveness of joint orientation for accurate posture representation. These results also show that the significance of the angles varies greatly and a few strategical angles can greatly improve the recognition performance. As to the computational complexity, the proposed approach is very efficient, and all the angle configuration are suitable for a real time processing.

Fig. 3.
figure 3

Precision (a) and recall (b) values on CAD-60 with different configurations of angles, as a function of the dictionary size (k).

Table 1. Confusion matrix using \(k = 100\) words and a configuration of 28 angles on CAD-60.
Table 2. Precision (P) and recall (R) of the proposed approach on CAD-60, compared to the results published in the benchmark website. “*” indicates that a different protocol was used.
Table 3. Precision (P) and Recall (R) values of the proposed approach for each activity on OAD.
Table 4. Confusion matrix using \(k = 100\) words and a configuration of 28 angles on OAD.

The confusion matrix, reported in Table 1, allows to analyze the main causes of errors. The mismatch occurred are all rather comprehensible since they are related to very similar activities (e.g. cooking-chopping, cooking-stirring). In these cases the skeleton information is probably too synthetic to discriminate the two actions which are very similar in terms of posture. A comparison with the state of the art is provided in Table 2 which summarizes the results published in the benchmark website. Despite of the very good accuracy reached by different approaches in recent years, the proposed approach outperforms existing methods, both in terms of precision and recall.

The results on the Office Activity Dataset are reported in Tables 3 and 4 for the standard configuration with 28 angles and \(k=100\). The overall results confirm that this benchmark is more difficult for several reasons: (i) the activities are not partitioned according to the room where they are performed and the probability of misclassification increases; (ii) the number of subjects is higher and the variability in executing the actions increases proportionally. For instance the worst results have been measured for the activity “throwing something in bin” that the different subjects executed very differently. Other mismatches occur between the activities “sitting” and “getting up”; in principle the reference postures of the two actions are similar, but their temporal ordering in the execution is different and probably the BoW representation adopted is not able to capture this aspect. However in general the good performance of the proposed approach is confirmed on this dataset as well.

5 Conclusions

A human activity recognition technique based on skeleton information has been proposed in this work. In particular, the effectiveness of joint orientations, typically neglected by the works in the literature, has been evaluated on different benchmarks. The efficacy of the proposal have been confirmed; the results obtained overcome the state-of-the-art in the well-known CAD-60 benchmark and good accuracy levels can be reached also on the newly acquired OAD dataset. Future researches will be devoted to the study of techniques able to couple the human posture information (encoded according to the model proposed here) to the information from the surrounding environment (e.g. about interactions with objects or facial expressions) which would certainly increase the performance and enable a fine-grained classification of activities.