1 Introduction

The classification of human body movements is very interesting and still open problem. There are numbers of common and specialized activities a human do every day. There are many reasons why analysis of those actions is in the scope of interest of science and industry. For example properly classified hand gestures or full body movements can be used as a user interface for digital systems. Human actions during training can be interpreted by computer coaching systems. Pattern recognition methods can also detect potentially hazardous situation that might lead to violence. In literature there are many approach that can be used to hand gestures or body action recognition (see state of the art in [11, 12]). However hidden Markov models (HMM) are among most popular are reliable methods for those tasks.

1.1 State of the art on application of HMM to gestures and actions recognition

Discrete HMM together with other pattern recognition techniques like fuzzy logic or neural networks (NN) [9, 10] has many application in classification of for example handwritten text [22], speech [25, 36] or gestures / actions [19, 29, 34]. The HMM classifier is often accompanied with another pattern recognition method that is used for discretization of continues signal. Those can be for example NN [20, 25] or Support Vectors Machine [36].

Many up-to date gestures and action recognition methods utilizes multi-dimensional continuous HMM, the commonly used distribution to describe the observation densities is the Gaussian one. The characteristics and dimensionality of input signal differs depending of data acquisition method. What is more very often a classifier requires pre-processing of input data.

For example in [19] authors present a glove-based hand gesture recognition. They decompose the input signal to a set of so called hand strokes. A stroke can be a simple hand gesture but usually two or three strokes constitute a more complex gesture. Once the discrete stroke HMMs have been trained for the defined strokes, hand gesture recognition can be accomplished by identifying a composition of such strokes.

In paper [34] a human activity recognition method is proposed, which utilizes Independent Component Analysis (ICA) for activity shape information extraction from image sequences and hidden Markov model for recognition. Various human activities are represented by shape feature vectors from the sequence of activity shape images via ICA. Based on these features, each HMM is trained and activity recognition is achieved by the trained HMMs of different activities.

Paper [13] introduces a model-based hand gesture recognition system, which consists of three phases: feature extraction, training, and recognition. In the feature extraction phase, a hybrid technique combines the spatial (edge) and the temporal (motion) information of each frame to extract the feature images. Then, in the training phase, authors use the principal component analysis (PCA) to characterize spatial shape variations and the hidden Markov models to describe the temporal shape variations. A modified Hausdorff distance measurement is also applied to measure the similarity between the feature images and the pre-stored PCA models. The similarity measures are referred to as the possible observations for each frame. Finally, in recognition phase, with the pre-trained PCA models and HMM, authors generate the observation patterns from the input sequences, and then apply the Viterbi algorithm to identify the gesture.

The Depth Map-Based Human Activity Recognition system [33] uses Silhouette vector extraction and dimensionality reduction with PCA to generate input for continuous hidden Markov models.

In [18] a hierarchical human activity recognition (HAR) system is proposed to recognize abnormal activities from the daily life activities of elderly people living alone. The system is structured to have two-levels of feature extraction and activity recognition. The first level consists of R-transform, kernel discriminant analysis (KDA), k-means algorithm and HMM to recognize the video activity. The second level consists of KDA, k-means algorithm and HMM, and is selectively applied to the recognized activities from the first level when it belongs to the specified group.

In [21] authors propose a forward spotting scheme that executes gesture segmentation and recognition simultaneously. The start and end points of gestures are determined by zero crossing from negative to positive (or from positive to negative) of a competitive differential observation probability that is defined by the difference of observation probability between the maximal gesture and the non-gesture. Authors also propose the sliding window and accumulative HMMs.

In the method proposed in [17] a gesture is represented as an ordered sequence of poses in a spatiotemporal space. To estimate a human pose in a frame, all frames extracted from an image sequence are classified. The features of a frame are represented by the position values of three upper body parts: head, left hand, and right hand. The further analysis is based on trajectories recognition.

Paper [14] introduces a multi-Principal-Distribution-Model (PDM) method and hidden Markov model for gesture recognition. To track the hand-shape, it uses the PDM model which is built by learning patterns of variability from a training set of correctly annotated images.

The method for hand gesture interpretation in Sign Language [8] uses the raw data that is captured using a color digital camera in the video format. Then a SwisTrack implementation of the color separation and tracking method is used to extract the 2D trajectories of head and hands of the signer. Extracted trajectories are converted to a Polar coordinate system from Cartesian coordinates at the first stage. Stationary head or hand components are not informative. Hence they are removed in the second stage. Then at stage three, it is possible to distinguish gestures with a single or both hands, assuming the head remains relatively stationary. In the following stage 4, the mean co-ordinate of the gesture is removed to reposition it at the origin. In stage 5, the gesture is rotated and scaled down to a neutral position using Singular Value Decomposition (SVD). Finally in stage 6, the gesture is re-sampled using Bezier curve techniques, in order to obtain trajectories that are jitter-free, clean and smooth, having standard sampling rate.

In paper [6] authors introduce a hand gesture recognition system to recognize continuous gesture before stationary background. The system consists of four modules: a real time hand tracking and extraction, feature extraction, hidden Markov model training, and gesture recognition. First, authors apply a real-time hand tracking and extraction algorithm to trace the moving hand and extract the hand region, then they use the Fourier descriptor (FD) to characterize spatial features and the motion analysis to characterize the temporal features. They combine the spatial and temporal features of the input image sequence as our feature vector. After having extracted the feature vectors, HMMs is applied to recognize the input gesture.

Paper [15] presents EasyGR (Easy Gesture Recognition), a tool based on machine learning algorithms that help to reduce the effort involved in gesture recognition. Authors use joint-based user tracking. The input features are generated by normalization of body position to make it camera position invariant and HMM for recognition.

In research [35] a HMM is proposed for various types of hand gesture recognition. In the preprocessing stage, this approach consists of three different procedures for hand localization, hand tracking and gesture spotting. The hand location procedure detects hand candidate regions on the basis of skin-color and motion. The hand tracking algorithm finds the centroids of the moving hand regions, connects them, and produces a hand trajectory.

In paper [4] authors propose a method of modeling hand gestures based on the angles and angular change rates of the hand trajectories. Each hand motion trajectory is composed of a unique series of straight and curved segments. In hidden Markov model implementation, these trajectories are modeled as a connected series of states analogous to the series of phonemes in speech recognition.

In [24] authors use discrete HMM on derivatives of tracked body joints of shoulder, elbow and wrist. Those joints are used to calculate angle features of left and right arm. Afterward, the gesture feature vector is clustered using the k-means clustering algorithm [4], the k is set as 16.

Also [37] uses gesture grouping with a k-means algorithm before presenting the features to HMM. The input signal is gathered with three-dimensional finger worn accelerometer.

In the work [26] a background modeling algorithm using fuzzy logic is used to detect foreground objects in outdoor video data. Three distinct features are extracted from the contours of detected objects. A unique aggregated feature vector is formed using a fuzzy inference system by aggregating three feature vectors. To minimize computation in recognition using hidden Markov model, the length of final feature vector is reduced using vector quantization.

1.2 Our motivation

As can be seen in state of the art there are many methods capable for human action analysis however most of those papers do not address motion capture data that is generate by modern multimedia motion capture devices that are actual available on the market. Those devices (like for example Microsoft Kinect) is capable to generate low frame rate (about 30 fps) and very noisy stream of tracking data however due to low price that type of data acquisition devices become common components of home multimedia systems. Also those types of devices require different approach in feature selection and signal analysis than RGB or monochrome video images or super-precise wearable motion capture costumes.

The goal of this paper is to propose a reliable way of features selection for motion capture data that will enable further classification of user actions with appropriate pattern recognition algorithm. The main novelty of this paper besides presenting this approach is evaluation of results on large (containing 770 actions) motion-capture dataset of various gym exercises. The common problem with markerless motion capture vision - based systems is that if some parts of body surface are covered by another part it is impossible to perform accurate features measurements. Knowing this we have evaluated two types of features sets: angle-based and coordinates based. We have performed PCA analysis of both features sets to justify our choices. We have also test 24 subsets of proposed features dataset in order to estimate which body joints are vital for correct actions recognition. As a classifier we have evaluated HMM with various number of hidden states and also Gaussian mixture model.

Some of already mentioned methods might be applied to motion capture (MoCap) data acquired by Kinect controller however very often data processing and training procedure contains many parameters that might be difficult to set-up properly to obtain best recognition results (like for example arbitral number of clusters [18], type of kernel etc.).

The choice of continues HMM was motivated by simplicity of its application to signal with Gaussian densities of observations and its high efficiency. As we will show no others processing of input signal beside features calculation presented in Section 2.1 is required to obtain high recognition rate on our test dataset. For comparison in [5] authors uses actions description called Histogram-of-Oriented-Velocity-Vectors. An algorithm first tracks 3D positions of skeleton joints and extracts the velocity vector associated with each joint. The orientation of the velocity vectors are defined by α and β in a spherical coordinate system. Algorithm then forms a spatial histogram of these vectors by grouping them over orientations. To overcome the problem with actions which do not involve a movement authors uses thresholding schema to distinguish them from non-static actions. That method uses SVM and KNN classifier. Our proposition based on angle-based features similarly to [5] is also scale-invariant, speed-invariant and length-invariant descriptor for human actions represented by 3D skeletons acquired by Kinect however it is far simpler to implement and does not require any arbitrary threshold that might strongly depends to tracking hardware and software.

Also paper [1] presents interesting approach to action recognition task. The depth maps are processed to track human silhouettes by considering temporal continuity constraints of human motion information and compute centroids for each activity based on contour generation. In body joints features, depth silhouettes are computed first through geodesic distance to identify anatomical landmarks which produce joint points information from specific body parts. Then, body joints are processed to produce centroid distance features and key joints distance features. Finally, Self-Organized Map (SOM) is employed to train and recognize different human activities from the features. The mentioned paper proposes original method for computation of body joints features while we are utilizing Kinect SDK method [30]. Both methods however cannot overcome the problem of body occlusion which can only be solved when using multi-view camera system. In our paper we have evaluated proposed HMM approach both on our angle-based features and centroids distance features from paper [1].

2 Methods

The problem we want to solve is classification of gym warm-up and fitness exercises. The exercises were recorded in SKL format and stored in SKL files [27]. SKL recording contains motion capture information: the spatial positions of points on human body (so called joints) that were tracked during recording session. As the tracking device we used Microsoft Kinect (capture frequency was 30 Hz), for raw-data segmentation and tracking we used Microsoft Kinect 1.8 SDK (it generates 20 body joints) and to register SKL recordings Gesture Description Language Studio v1.1 [27]. To solve the above task we utilized angle-based features and classification based on HMM with multivariate Gaussian distribution and also Gaussian mixture model.

2.1 Features selection and preprocessing

The role of feature selection is changing original representation of motion capture data that is three-dimensional coordinates of 20 joints to the other that is more convenient for further analysis. The original representation has two main drawbacks: it is dependent of relative position of user to camera leans and it is 60-dimensional (3 dimensions of Cartesian frame * 20 joints). The dependence from the camera position virtually prevents method from being usable in real-world scenario. Also the number of dimensions might affect the processing speed of methodology making it not real - time. We have change position-based representation to angle – based representation (see Fig. 1). In this notation the vertexes of angles are positioned either in some important for movements analysis body joints (like elbows – angle 1 and 2, shoulders – angle 3 and 4, knees – angle 6 and 7) or angles measure position of limbs relatively to each other or relatively to torso. The second type of angles we utilized are angle defined between forearms (angle 5), angle between vector defined by joint between shoulders - joint between hips and thighs (angle 8 and 9) and finally angle between thighs (angle 10).

Fig. 1
figure 1

This figure presents visualization of 24 features sets with tracked points marked as green circles and features (angles) marked with blue lines and red semi-circles. Each feature has its own unique indexing number (from 1 to 10)

After the feature selection the data is preprocessed. For each data sample we calculate the Z-score for each feature value:

$$ \overline{z}=\frac{\overline{x}-{\mu}_{\overline{x}}}{\sigma_{\overline{x}}} $$
(1)

Where \( \overline{x} \) are values of single feature in data sample, \( {\mu}_{\overline{x}} \) is a mean of elements in \( \overline{x} \) and \( {\sigma}_{\overline{x}} \) is a standard deviation of elements in \( \overline{x} \).

The correctness of angles estimation depends on precision of data tracking. In case of marker less vision-based tracking systems the precision is limited especially when some body parts become invisible from the perspective of camera leans. In this case it is impossible to measure angle value correctly. Due to this fact it is important to investigate which features influence recognition process the most. In the next paragraph we will investigate this influence on different subsets of 10 - element features set we defined above. All subsets are presented in Fig. 1.

The second type of features we investigated are the state-of-the-art joints coordinates-based set ([1, 2, 7, 16]). In [1] authors propose so called centroids distance features (CDF) and key joints distance features (KDF). In CDF they compute the distance between body joints and the mean centroid of each activity (totally 15 values). Using KDF they calculate the distance between a set of key joints (i.e., extreme body joints) and the other body joints (i.e., sub-extreme landmarks), there are 54 KDF (together with CDF there are 69 features). We have decided to adopt those features to our needs. We used only 12 CDF (see Fig. 2 - left) because Kinect SDK estimations of hands and feet positions seems to be noisy and due to this unreliable. The centroid point coordinate was replaced by spine joint (with zero index in Fig. 2). We also did not use KDF because 69-dimensional representation of movement seems to be sparse and redundant. We also calculated Z-score (1) of each feature value.

Fig. 2
figure 2

On the left this figure presents visualization of joints coordinates-based feature set we used in our experiment. We compute the distance between spine body joint (index 0) and other body joints (with indexes varying from 1 to 12). On the right this figure presents a three-state left to right HMM

2.2 Signal classification based on HMM with multivariate Gaussian distribution

Continues HMM denoted as H = {S, p, A, B} can be expressed as follows [33]:S = {S 1, S 2, …, S q } indicates the states where q is the number of the states. The state of the model at time t can be expressed as Ω t  ∈ S, 1 ≤ t ≤ T where T is the length of the observation sequence. The initial probability of the states p can be represented as:

$$ p=\left\{{p}_j\right\},{\displaystyle {\sum}_{j=1}^q{p}_j=1} $$
(2)

The state transition probability matrix is denoted as A where aij denotes the probability of a changing state from i to j i.e.,:

$$ {a}_{i,j}=P\left({\Omega}_{t+1}={S}_j|{q}_t={S}_i\right),\ 1\le i,j\le q $$
(3)
$$ {\displaystyle {\sum}_{j=1}^q{a}_{i,j}=1,1\le i\le q} $$
(4)

The observation probability matrix is denoted as B where the probability bj(d) represents the probability of observing d from a state j that can be expressed as:

$$ {b}_j(d)=P\left({O}_t=d|\Omega ={S}_j\right),1\le j\le q $$
(5)

In case of forward model (see Fig. 2 - right) the initial probability of the states p are initialized as {1,0,0} if there are three states and the modeling procedure is to be started from the first state. The transition matrix can be uniformly distributed based on connections between states:

$$ A=\left[\begin{array}{ccc}\hfill \frac{1}{3}\hfill & \hfill \frac{1}{3}\hfill & \hfill \frac{1}{3}\hfill \\ {}\hfill 0\hfill & \hfill \frac{1}{2}\hfill & \hfill \frac{1}{2}\hfill \\ {}\hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill \end{array}\right] $$
(6)

In the continuous observation probability density function matrix B, the commonly used distribution to describe the observation densities is the Gaussian one. To represent the continuous observation probability matrix, the mean and covariance are utilized. Weight coefficients are necessary to use during the mixture of the probability density function (pdf). Thus, the observation probability of Ot at time t from state j can be represented as

$$ {b}_j\left({O}_t\right)={\displaystyle {\sum}_{k=1}^M{c}_{j,k}\cdot {b}_{j,k}\left({O}_t\right),1\le j\le q} $$
(7)
$$ {\displaystyle {\sum}_{k=1}^M{c}_{j,k}=1,1\le j\le q} $$
(8)

where c represents the weighting coefficients, M the number of mixtures, and Ot the observation feature vector at time t. The Baum-Welch algorithm can be applied to estimate the HMM parameters [3]. In order to find probability of observed sequence a forward algorithm might be used. It is also possible to create a HMM-based classifier that decides to which of n-classes a given signal belongs. In this case the classifier is composed of n HMM (the same as number of classes) and parameters of each HMM are estimated on exemplars of signals from different classes. When unknown signal is being examined it is classified to the class for which HMM corresponding to this class has largest probability to produce the sequence.

3 Experiment and results

The dataset we used in our experiment consists of SKL recordings for 14 participants, 4 women (W1-W4) and 10 men (M1-M10) – W means a woman, M – man, numbers defines id of a participant. The exercises that were performed were: body weight lunge left (bwll), body weight lunge right (bwlr), body weight squat (bws), dumbbell bicep curl (dbc), jumping jacks (jj), side lunges left (sll), side lunges right (slr), standing dumbbell upright row (sdur), triceps dumbbell kickback (tdk). In Fig. 3 we present body joints sets (so called skeletons) that represent important phases of each physical activity from our dataset. We have selected those nine actions because they were among exercises that we were capable to register on our multimedia hardware due to trajectories of movements (nearly all tracked joints are visible during whole activity). Also those exercises did not require equipment that might disturb the process of body tracking. Quite often proper equipment handling is crucial part of the exercise’s safety and we did not want to incorporate to our dataset actions that might be potentially harmful to experiment participants.

Fig. 3
figure 3

In this figure we present body joints sets (skeletons) that represent important phases of each physical activity from our dataset

In Table 1 we have presented quantities of actions of a given type that were present in our dataset. As can be seen not every person have performed each action, also numbers of repetitions are not equal. That is because that recordings were made in a certain period of time and not all users were asked to perform all gestures (for example in four recordings bws were skipped). Those lacks were then completed by recordings from other four persons. Each person was asked to perform each exercise how many times he or she is capable to, but not more than ten times each (in order not to get too tired). There were some people who made those exercises more than ten times (for example M1). For the other hand many participants were getting tired more quickly and we decided to reduce number of repetitions to 5 of each type. For example participant M4 after performing slr was not capable to perform sll correctly. The test set was portioned into 10 groups of recordings (the indexing number of test set is presented in second column of Table 1). The criterion of division was that each set consists of SKL recordings from a single person. The movement samples of this person are not present in other sets. In case of users who did not performed particular action (bws in M3, M4, M5 and M8) those sets were complement with W3, W4, M6 and M9.

Table 1 This table presents the test dataset we used in our experiment. We have used recordings of 14 persons both men and woman (W and M respectively). This partitioning is shown in first column. Those datasets were used to create 10 test datasets for K-fold cross validation. The division on test datasets is shown in second column. Each test dataset besides one (the sixth) contains each type of examined actions. The other columns contain quantities of actions (exercises) of a given type that were performed by a particular person

The evaluation goes as follow: each HMM classifier composed of q-state HMM was trained to recognize actions presented in Table 1 (each HMM in classifier was trained to recognize single action). We have utilized Baum-Welch learning method. The classifier was trained on nine from ten dataset and then evaluated on remaining one which was test dataset. Each action recording from the test dataset was classified to a class for which HMM corresponding to this class has largest probability to produce the sequence. The HMM classifiers were composed from nine HMM (the same number as classes of actions) that has q = 1, 2, 3 and then 4 states. A 1-state HMM is basically just a Gaussian mixture model (GMM).

We have implemented our approach in C# utilizing HMM libraries from Accord Framework [31]. Some parts of experiment were evaluated with FactoMineR package [23]. The evaluation of the methodology was performed on dataset presented in Table 1. We have perform K-fold cross validation where K = 10. After evaluation of each action from a test dataset we trained classifier on another nine from ten dataset until all dataset was considered as test dataset. The above procedure was repeated for all features sets presented in Fig. 1.

We have investigated the angle-based and coordinates-based features. We also have taken into account features sets that are derived from both features sets with PCA. We have taken into account datasets with reduced dimensionality form range 3 to 7. We did that because we wanted to investigate the correlation between PC and features to find most relevant for correction recognition subsets. The results of our evaluation are presented in Table 2 and visualized in Fig. 4. The recognition rate (RR) is an averaged value of RR (percentage of correctly recognized actions among action of a given type) among all 10 considered test sets plus/minus standard deviation. The bolded font indicates the settings where averaged RR equals or is above 95 %. We did not present here evaluation of HMMs with number of states greater than 4 because we did not notice any significant increase of RR when number of states is above this value.

Table 2 Table 4. This table presents results of K-fold cross validation on various forward-only HMM. Each column corresponds to HMM with different number of hidden states. Each row corresponds to different features dataset, Angle means angle-based, Coordinate means coordinates-based. PCA stands for number of principal components of feature set that was used for classification (all means all original features). Values in this table are averaged recognition rate of recognition rate on all action types +/− standard deviation
Fig. 4
figure 4

This figure visualizes data from Table 2 in order to compare recognition rates between classifiers. Color bars corresponds to number of states in HMM classifiers. Vertical axis shows averaged recognition rate of a classifier. Horizontal axis corresponds to different features set

In Table 3 and Fig. 5 we present percentage of overall variance for i-th principal component of angle-based and coordinate-based features. This parameter can be calculated from:

$$ Va{r}_i=\frac{\lambda_i}{{\displaystyle {\sum}_{j=1}^r}{\lambda}_j}\cdot 100\% $$
Table 3 This table presents percentage of overall variance for i-th principal component of angle-based and coordinate-based features
Fig. 5
figure 5

This figure visualizes results from Table 3. Vertical axis shows percentage of variance of particular PC while each bar corresponds to different PC

Where \( {\lambda}_i \) i an eigenvalue corresponding to i-th principal component when overall components count is r.

In further analysis we do not take into account coordinate-based features because we have obtained better RR results with angle-based set.

In Table 4 and Fig. 6 we present correlations of angle-based features and projection of angle-based features on two-dimensional PC space.

Table 4 This table presents correlations of angle-based features to first PC ordered by corresponding eigenvalue. In brackets in first column there are indexes of angles from Fig. 1
Fig. 6
figure 6

This figure visualizes projection of angle-based features on two-dimensional PC space. The data is from Tables 3 and 4

Figure 6 visualize projection of angle-based features on two-dimensional PC space. The PCA supports us with information about correlation between features. The arrows are determined by coordinates of the variables in PC frame. Dim 1 contains 34 % of overall variance while Dim 2 16 % (see Tables 3 and 4).

We present detailed evaluation results of 4-states HMM classifier with angle-based features in Table 5. We decided to limit in Tables 2, 3, 4 and 5 presentation of previous GMM / HMM evaluations on proposed combination of features sets to overall recognition rate because our intention was to select the configuration with highest RR. The detailed presentation as in Table 6 will require much more space and it will not be much informative for the reader.

Table 5 This table presents results of K-fold cross validation on various forward-only HMM. Each column corresponds to HMM with different number of hidden states. Each row corresponds to different angle-based features dataset (see Fig. 1). Values in this table are averaged recognition rate of recognition rate on all action types +/− standard deviation
Table 6 This table presents more detailed results of K-fold validation on 4 states forward-only HMM. Each column corresponds to different action. Each row corresponds to different angle-based features dataset (see Fig. 1). Values in this table are 10-fold cross validation results +/− standard deviation

In Table 5 the RR for each classifier with given features set is averaged over all actions and among all 10 considered test sets plus/minus standard deviation. Due to this recognition rate of a classifier is represented by single cell in the Table 5. The bolded font indicates the settings where averaged RR equals or is above 95 %. We did not present here evaluation of HMMs with number of states greater than 4 because we did not notice any significant increase of RR when number of states is above this value (Figs. 7 and 8).

Fig. 7
figure 7

This figure visualizes data from Table 5 in order to compare recognition rates between classifiers. Color bars corresponds to number of states in HMM classifiers. Vertical axis shows averaged recognition rate of a classifier. Horizontal axis corresponds to different angle-based features set (see Fig. 1)

Fig. 8
figure 8

This figure visualize results from Table 5 in order to show averaged recognition rates (blue bars) and standard deviations (black bars). Each plot corresponds to different forward-only HMM classifier. Vertical axis shows averaged recognition rate of a classifier. Horizontal axis corresponds to different angle-based features set (see Fig. 1)

4 Discussion

Our evaluation results in Table 2 and Fig. 4 showed that angle-based features in our dataset guarantee better RR than coordinate-based features. Only in two cases (with HMM with 3 and 4 hidden states) the recognition rate of coordinate-based features had value greater and equal 95 % while in case of angle-based in 8 cases. While projecting dataset to lower dimension according to PCA angle-based features has always better RR than coordinate-based. However the overall RR in case of using all features without projecting into lower dimension does not differs much (8 % for 1 state for GMM, 2 % for 2 states HMM, 0 % for 3 state HMM and 2 % for 4 states). That means that multivariate continuous hidden Markov model classifier is convenient method of actions recognition. Also less angle-based features are required to obtain better RR. The percentage of overall variance for i-th principal component (Table 3 and Fig. 5) is similar in both features set. First four components have over 70 % of overall variance. Table 4 and Fig. 6 shows that BetweenWrists feature (number 5 from Fig. 1) and BetweenLegs (number 10 from Fig. 1) has largest influence on Dim 1 and Dim 2 appropriately. It is also confirmed by evaluation presented in Table 5, where most of features sets (namely 1, 15, 16, 17, 19, 20 24) containing those two features has RR over 90 %.

As can be seen in Table 5 15 examined settings of classifiers obtained RR greater or equal 95 %. In case we select all ten angle-based features even GMM is capable to correctly identify most (96 ± 17 %) of the actions. However when number of features is lowered also the RR in GMM drops more quickly than in other considered classifiers. The highest RR 97 ± 14 % was obtained for 4-states HMM with 10 features (see Tables 5 and 6). The most difficult to classify were actions that required whole body movements and in which some body parts were covered by another ones from the perspective of our tracking hardware. In those cases the data processing software approximated positions of body joints that were not visible for camera’s leans. Those approximations however very often did not suit the real joint’s positions which affected the further features calculation. In those cases we could not overcome this phenomena by increasing the number of features because they calculation were based on wrongly tracked body joints. Most difficult to classify were mainly sll and slr. They were the only actions where averaged RR in 4-states HMM classifier dropped below 90 %. Sadly, because those errors where caused by limitation of tracking hardware they cannot be easily overcome.

The quite obvious observation is that the more features were used to describe the movement the better recognition results was obtain. Quite surprisingly when only two features were used for recognition (see features set 11, 12 and 13) all classifiers beside GMM were capable to recognize nearly or over 50 % actions. The more important for correct recognition seemed to be knee joints that those in elbows. The bigger impact on averaged RR that leg-based features had over hand-based could also be observed in features sets 6 and 7.

The very important observation is that it is possible to obtain average RR for HMM with more than 1 state at level 95 % without using the feature that are based on joints positioned in torso of the observed user (see features set 24). Also in features set 24 RR does not differ much (1 to 2 %) from the best results from features set 1. This is very important information because during physical activities torso is very often covered from the camera’s perspective by limbs. We can utilize this knowledge when designing data acquisition set up for action recognition and track only body joints positioned on limbs (hands and legs).

As can be seen highest RR was achieved in nearly all features set for HMM with 4 states however difference between 4, 3 and 2 states HMM is not huge – about 1 or 2 %. Also The standard deviation of those results are nearly identical – they did not differ more than 4 % (in case of features set 14) but more often differ from 1 to 2 %. That means that classifiers react in very similar way on lowering/altering observed features. The implementation of forward algorithm in case of all considered HMM classifiers enables real time performance (with frequency above 30 Hz) on our dataset. Knowing that we can safely assume that forward-only 4 states HMM – based classifier is suitable to solve our classification problem.

5 Conclusion

In this paper we describe how to create a HMM-based classifier that is capable of real-time gym actions recognition. What is more we proved that this classifier can operate on low-quality motion capture data generated by multimedia hardware. The features sets we used are invariant to relative position of user to camera leans. With the angle-based features we proposed it is even possible to obtain high RR on simple GMM classifier. The experimental results we have observed on angle-based features subsets were confirmed by PCA analysis. The features that have more variance (that influence the “most relevant” PC the most) are also present in subsets with highest RR.

The methodology and results presented in this paper are significant both from practical and theoretical point of view. Practically the implementation of our methodology can be directly applied to multimedia computer couching application. The goal of such program is to oversee the gym warming up and training by checking if a person follows a workout routine, to show its progress and to motivate for further efforts. However we have to remember that it is very difficult or nearly impossible to apply proposed methodology to evaluate the quality (similarity to “ideal pattern”) of gym exercises when we are using only multimedia hardware. We believe that the qualitative evaluation will be possible after applying professional motion capture costume especially that professional hardware also produces very similar set of body joints as Kinect does.

The results of our work also indicate which angle-based features sets are most important for further correct recognition of body actions. That means which parts of human body should be captured in order to optimize the RR of the observed actions. Those findings might be then applied for example during designing of outdoor real-time system which detects potentially hazardous situation that might lead to violence. This type of solution is extension to street monitoring systems. With our results there is a possibility to optimize the position of cameras to enable monitoring of those parts of human figure that are most important for further behavior classification. Also knowing our findings it will be possible to limit the number of sensor on motion capture costume if the intention is to use this type of hardware to action recognition and quality evaluation. The optimization of position and limitation of sensors significantly lowers the cost of hardware because motion capturing sensors are relatively expensive part of equipment.

The goal of our further researches will be to utilize the results presented in this paper to two already mentioned applications that is outdoor real-time hazardous situation detection and high-quality body actions analysis (especially in sport). Basing on latest achievement in web-based visual data computation and high-quality visualization [28, 32] we believe that there are no technological and methodological obstacles for solving our scientific goals and to introduce the human action recognition even to low-power mobile devices.