Human actions recognition from motion capture recordings using signal resampling and pattern recognition methods

In this paper we will experimentally prove that after recalculating the motion capture (MoCap) data to position-invariant representation it can be directly used by classifier to successfully recognize various actions types. The assumption on classifier is that it is capable to deal with objects that are described by hundreds of numeric values. The second novelty of this paper is application of neural network trained with the parallel stochastic gradient descent, Random Forests and Support Vector Machine with Gaussian radial basis kernel to perform classification task on gym exercises and karate techniques MoCap datasets. We have tested our approach on two datasets using k-fold cross-validation method. Depending of the dataset we have obtained averaged recognition rate from 100 to 97 %. Our results presented in this work give very important hints for developing similar actions recognition systems because proposed features selection and classification setup seems to guarantee high efficiency and effectiveness.


Introduction
Human actions recognition is challenging and up-to-date problem that appears in many practical applications like computer games, security monitoring or smart home technologies. In this section we will present state-of-the-art review in actions recognition methods and our motivation for writing this paper.

State-of-the-art in actions recognition
Nearly each actions recognition framework proposed in the literature introduces its own feature selection method. Neural networks (NN) are among pattern recognition methods that were commonly reported to be used for actions recognition and human pose estimation Jiu et al. (2012), Li et al. (2015), , Charalampous and Gasteratos (2014). Also paper Li et al. (2014) proposes a framework that combines Fast HOG3D description and self-organization feature map (SOM) network for actions recognition from unconstrained videos, bypassing the demanding preprocessing such as human detection, tracking or contour extraction. Support vector machines (SVM) are also among supervised classification method used for actions recognition Liu et al. (2013a), Díaz-Más et al. (2012), Mahbub et al. (2014), Shen et al. (2015), Cao et al. (2014), , Ji et al. (2014), Bilen et al. (2014), Omidyeganeh et al. (2013), Nasiri et al. (2014), Zhen et al. (2014), Wu et al. (2014). The different class of pattern classification methods designed for actions recognition is that which uses rule-based descriptions and reasoning modules. Among those is Gesture Description Language Hachaj and Ogiela (2014) that uses unsupervised R-GDL training Ogiela (2014, 2015a, b) for automatic rules generation. GDL can also be use as online video segmentation method that prepares the input signal to other classification methods like hidden Markov model (HMM) Hachaj et al. (2015a, b). Paper Rincón et al. (2013) proposed methodology is decomposed into two stages. First, a bag-of-words gives a first estimate of action classification from video sequences, by performing an image feature analysis. Those results are afterward passed to a common-sense reasoning system, which analyses, selects and corrects the initial estimation yielded by the machine learning algorithm. This second stage resorts to the knowledge implicit in the rationality that motivates human behavior. Some action classification tasks can be solved with simple naive Bayes nearest-neighbor method Liu et al. (2013b) and Yang and Tian (2014). Random forests (RF) approach is popular method utilized in process of segmentation and recognition of actions Zhu et al. (2013), Jiang et al. (2013), Saito and Nishiyama (2015), Liu et al. (2014), Burghouts et al. (2014), Burghouts et al. (2013), Chen and Guo (2015), Jiang et al. (2013). SVM and RF are very flexible approaches that have many important applications and can operate on objects described by various features sets Fan and Chaovalitwongse (2010), Yahav and Shmueli (2014). Among features and features selection methods that are often applied for human actions recognition there are methods like optical flow Liu et al. (2013b), Mahbub et al. (2014), Jiang et al. (2013), Liu et al. (2014) various dimensionality reduction techniques like PCA, 2D-PCA, LDA, Díaz-Más et al. (2012), bag-of-words framework Shen et al. (2015), Cao et al. (2014), Nasiri et al. (2014), Burghouts et al. (2013), probability distributions -based features , Ji et al. (2014) or 3D wavelet transform Omidyeganeh et al. (2013). There are also a number of pattern recognition methods that are less commonly used in human actions recognition tasks. We can mention regularized multi-task learning Guo and Chen (2015), papers Ogiela (2014, 2015a, b) models actions with multivariate continuous hidden Markov model classifier, dynamic time warping, canonical time warping Vrigkas et al. (2014). In paper Jiang et al. (2015) and Liu et al. (2015) feature sets are evaluated using a Conditional Random Fields linear (CRFs). In paper Pazhoumand-Dar et al. (2015) author uses longest common subsequence (LCSS) algorithm to assign action represented by body joints derived features to proper class.
The state of the art review on recent developments in deep learning and unsupervised feature learning for time-series problems can be found in Längkvist et al. (2014) while Ziaeefard and Bergevin (2015) presents an overview of state-of-the-art methods in activity recognition using semantic features.

Our motivation for writing this paper
As can be seen in above state-of-the-art review one of the most challenging stage of actions recognition is appropriate features selection that enables to extract the movements characteristics from video sequence. However up-to-date multimedia depth cameras like for example Kinect controllers enables relatively cheap registration of video stream that can be then used for extraction of human posture and so called skeleton. This approach is marker-less MoCap. There are number of methods that are capable for this type of extraction and body joints tracking Papadopoulos et al. (2014), Shotton et al. (2013), Coleca et al. (2013). The tracked features consisted of so called body joints are valuable source of information that does not require much further processing to be used by classifier. State-of-the-art papers however even when dealing with skeleton data processes it with additional methods making the output data dependent to many additional parameters. Those parameters values are often dependent on processing model and might differ between actions to which we want to apply them. In fact the feature set that describes an action has one crucial demand-it has to be invariant to relative position of observed user to camera. In this paper we will experimentally prove that after recalculating the MoCap data to position-invariant representation it can be directly used by classifier to successfully recognize various actions types. The assumption on classifier is that it is capable to deal with objects that are described by hundreds of numeric values. For example up-to-date implementation of parallel stochastic gradient descent training method Recht et al. (2011) allows to relatively quickly train NN that is dependent on hundreds of thousands synaptic weights.
The second novelty of this paper is application of NN trained with the parallel stochastic gradient descent, Random Forests and SVM with Gaussian radial basis kernel to perform classification task on gym exercises and karate techniques MoCap datasets. The original MoCap data consisted of 20 or 25 time-varying three-dimensional body joints coordinates acquired with Kinect (appropriately Kinect 2) controller is preprocessed to 9-dimensional angle-based time-varying features set, 15-dimensional or 16-dimensional distance based feature set. The data is resampled to the uniform length with cubic spline interpolation after which each action is represented by 60 samples and eventually 540 (60 × 9), 900 (60 × 15) or 960 (60 × 16)-dimensional variables are presented to the classifier. We have tested our approach on two datasets using k-fold cross-validation method. First dataset introduced in Hachaj and Ogiela (2015a) consists of recordings of 14 participants that perform nine types of popular gym exercises (totally 770 actions samples). The second dataset is extended version of one introduced in . It consists of recordings of 6 participants that perform sixteen types of karate techniques (totally 1996 actions samples). In the following sections we will present the dataset we have used in our experiment, feature selection methodology and classification methods. Later we will also discuss the obtained results and present goals for future researches.

Material and methods
In this section we will present the dataset, features selection procedure and classifiers we have used in our experiment.

Dataset and features selection
The launching of Microsoft Kinect with skeleton tracking technique opens up new potentials for skeleton based human actions recognition. However, the 3D human skeletons, generated via skeleton tracking from the depth map sequences, are generally very noisy and unreliable what makes actions recognition a challenging task Jiang et al. (2015). Despite the fact that Kinect was initially designed to be a game controller, its potential as cheap general purpose depth camera was quickly noticed .
To gather the dataset for evaluation of proposed methodology we have utilized Microsoft Kinect v1 for the gym exercises dataset and Microsoft Kinect v2 for karate techniques dataset. Those datasets were prepared using different hardware because in time when gym dataset was recorded Kinect v2 was not yet available. According to research  Kinect v2 controller and Kinect v2 SDK is capable to generate more reliable data for classification in competition to Kinect v1 so second dataset was recorded using the newer hardware. The Kinect SDK software library for Kinect v1 is capable to segment and track 20 joints on human body with acquisition frequency of 30Hz while SDK for Kinect v2 segments and tracks 25 joints with the same frequency. The tracking is marker-less procedure. We have used those joints to produce camera position invariant representation of action because the dependence on the camera position virtually prevents method from being usable in real-world scenario. In our angle-based representation (Fig. 1a) the vertices of angles are positioned either in some important for movements analysis body joints (like elbows-angle 1 and 2, shouldersangle 3 and 4, knees-angle 6 and 7) or angles measure position of limbs relatively to each other or relatively to torso. The second type of angles we utilized are angle defined between forearms (angle 5), angle between vector defined by joint between shoulders-joint between hips and thighs (angle 8 and 9). The same representation was used for both Kinect v1 and Kinect v2 datasets. The selection of this subset of all possible angles was among subset considered in  for which HMM used their obtained high recognition rate. The second and third feature set was defined as Euclidean distances between central joint (in Fig. 1b, c) it is "spine" joint with index 0) and 15 other joints in (B) and 16 in (C). The joints we used are nearly all joints form Kinect SDK beside feet and hands joints that we skipped due to high inaccuracies of tracking of those body parts. The above joints representations were calculated to all frames of acquired actions recordings. In the next step the data is resampled to the uniform length with cubic spline interpolation after which each action is represented by the vector of the same size. The uniform length we choose was 60 frames per recording which was the smallest number of frames that was present among all actions recordings in both considered datasets. After this operation gym exercises dataset was represent by 540 variables (60 × 9-see Fig. 1a) or by 900 variables (60 × 15-see Fig. 1b). The karate techniques dataset was represented also by 540 variables (60 × 9) or by 960 variables (60 × 16-see Fig. 1c). All those features sets were evaluated separately in our experimental setup.

Neural network implementation
In our experiment we used multi-layer, feedforward neural networks Candel and Parmer (2015). It consists of many layers of interconnected neuron units: beginning with an input layer to match the feature space followed by a layer of nonlinearity and terminating with a classification layer to match the output space. For each training example j the objective is to minimize a loss function L(W, B| j), where W is the collection {w i } 1:N −1 , W i denotes the weight matrix connecting layers i and i + 1 for a network of N layers; similarly B is the collection {b} 1:N −1 , where b i denotes the column vector of biases for layer i+ 1. The training of NN for classification task is based on minimization of cross-entropy loss function Candel and Parmer (2015): where o y are the predicted (target) output and actual output, respectively, for training example j, and ydenote the output units and Othe output layer.
For minimization of (1) stochastic gradient descent (SGD) method can be used which is an iteration procedure for each tanning example j LeCun et al. (2002): where w jk ∈ W (weights), b jk ∈ B (biases).
To speed-up the training procedure, we used Hogwild, the lock-free parallelization scheme for SGD that has been published lately Recht et al. (2011).
The activation function in hidden layer might be a rectified linear function: where: x i and w i denote the firing neuron's input values and their weights, respectively; α denotes the weighted combination.
In our experiment we have utilized fully connected NN. Input layer had 540, 900 or 960 neurons, depending on number of variables in features set. We have experimented with different number of neurons in hidden layer from 4 to 256. Activation function of neurons in hidden layer was (3). The input data for network is standardize to N (0, 1).

Support vector machine implementation
Kernel-based learning methods use kernel function for mapping of the input data into a high dimensional feature space Karatzoglou et al. (2004). The further learning takes place in the feature space and the data points only appear inside dot products with other points. ("kernel trick") Schölkopf and Smola (2002). If a projection : X → H is used, the dot product (x) • (y) can be represented by a kernel function k: which is computationally simpler than explicitly projecting x and y into the feature space H Karatzoglou et al. (2004). Support vector machines Vapnik (1998) have gained prominence in the field of machine learning and pattern classification and regression. The solutions to classification and regression problems such as the SVM are linear functions in the feature space: where w ∈ F is a weight vector. If the weight vector w can be expressed as a linear combination of the training points the kernel trick can be exploited: In the case of the 2-norm Soft Margin classification the optimization problem during classifier learning takes the form: Minimize: Subject to: The classification problems that include more than two classes (multi-class) a one-against-one Knerr et al. (1990) or pairwise classification method Kreßel (1999) is used. In our research we use Gaussian radial basis kernel:

Random forests implementation
Random forests are a combination of tree predictors. For all trees in the forest each tree depends on the values of a random vector sampled independently and with the same distribution. As the number of trees in becomes large the generalization error for forests decreases.  (2001). Each tree uses only random sample of training data and captures only a part of overall information. This is called a bagging procedure. The second randomized procedure is features selection during determining the best split. In H2O (2015) implementation we used in our experiment tree selects randomly subset of features of size square root of all features. The simplest random forest with random features is formed by selecting at random, at each node, a small group of input variables to split on. The tree growth uses CART methodology Breiman et al. (1984).

Results
The gym exercises dataset was used in earlier work Hachaj and Ogiela (2015a). It consists of recordings of 14 participants, 10 men (M1-M10) and 4 women (W1-W4), numbers defines id of a participant (see Table 1). The users were ask to perform: body weight lunge left (bwll), body weight lunge right (bwlr), body weight squat (bws), dumbbell bicep curl (dbc), jumping jacks (jj), side lunges left (sll), side lunges right (slr), standing dumbbell upright row (sdur), tricep dumbbell kickback (tdk). In Table 1 we have presented quantities of actions of a given type that was performed by each person. Total number of samples was 770. The visualization of important phases of actions from the gym exercises dataset is presented in Fig. 2. The karate techniques dataset is extension of dataset we used in earlier work . The dataset consisted of MoCap recordings of six volunteers including multiple champion of Kumite Knockdown Oyama karate. We recorded four types of defense techniques (gedan-barai, jodan-uke, soto-uke and uchi-uke) three types of kicks (hiza-geri, mae-geri and yoko-geri) and three stands (kiba-dachi, kokutsu-dachi and zenkutsu-dachi). The stands were preceded by fudo-dachi and were also evaluated as actions (not as static body positions). Kicks were done with right foot and blocks were done with right hand. The original dataset was extended by three types of punches: furi-uchi, shita-uchi and tsuki. Punches were done with right and left hand separately. In Fig. 3 we present important stages of karate techniques we have evaluated. Total number of samples was 1996 (see Table 2).  In Fig. 4 we present four plots of 9-dimensional angle-based signals of exemplar karate techniques from our datasets. We can clearly see that kicking actions highly involve whole body while punching mostly hands and marginally rest of the body that agrees with those movements motoric.
In both experiments we used features sets described in Sect. 2.1. We have implemented our solution in R language using H2O package H2O (2015) for NN and RF and kernlab Karatzoglou et al. (2004) for SVM.
In first experiment on gym exercises dataset (Table 1)    which also used anglebased and distance-based features set, however not the same as we proposed in Sect. 2.1. Table 3 presents averaged recognition rate (RR) for gym exercises obtained with k-fold cross-validation plus/minus standard deviation.
In Figs. 5 and 6 we present visualization of data from Table 3. Color bars represent averaged RR and black bars stand for standard deviation.
In second experiment on karate techniques dataset (Table 2) we have used the very similar classifiers settings as in previous one. We have also compared obtained results with multivariate continuous hidden Markov model classifier with 4 hidden states from  which used angle-based features set, however not the same as we proposed in Sect. 2.1. Also dataset in  did not contain six classes of actions namely punches. Table 4 presents averaged (RR) for karate techniques obtained with k-fold cross-validation plus/minus standard deviation.
In Figs. 7 and 8 we present visualization of data from Table 4. Color bars represent averaged RR and black bars stand for standard deviation.  and SVM with Gaussian radial basis kernel with angle-based features the RR reaches 99 % or even 100 %. We might conclude that applying all examined pattern recognition techniques (NN, RF and SVM) for both types of features representation resulted in equally very good classification results. The karate techniques dataset is more difficult for correct recognition than previous one. It is because it has more classes of movements (16 comparing to 9 of gym's). None of the methods exceeded 97 % of RR. This time we can clearly observe that angle-based representation gives better RR than distance-based one. Mostly often errors were caused by misclassification of punches (most notable furi-uchi with tsuki) and blocks (uchi-uke and gedan-barai and jodan-uke). This is caused by low quality of data tracking and heavy tracking errors that becomes visible when hands are positioned near other body parts. The angle-based features derived from joints positions seem to be more resistant to those noises than distance-based features. The highest RR (97 ± 2 %) was obtained for NN with 256 neurons in hidden layer that uses angle-based coordinates. Those results were similar to HMM Hachaj et al. (2015a) which was also 97 ± 2 %, however we must notice that karate dataset from  did not include punches (6 additional classes of movements) and we might expect that finally RR of HMM will be far worse than 97 %. Also SVM classifier and RF with 64 and 256 trees have very similar RR namely 97 % with only slightly higher standard deviation (±3 % in SVD and in RF).

Conclusions
The proposed movement data representation technique based on resampling the input multidimensional signal to common length resulted in high RR to all applied pattern recognition methods. Basing on our experiments on relatively large datasets (9 classes with 770 actions samples and 16 classes with 1996 samples) it seems that angle-based 9-dimensioanl features set guaranteed higher RR than 15 or 16 distance-based features set. That is due the fact that angle based features seem to be more resistant to tracking inaccuracies present in the dataset. The most important aspect while choosing appropriate classifier is to select a method that is capable to operate on data sample with many dimensions (in our case between 540 and 960). This type of actions recognition approach outperforms key frame-based approach that uses multivariate continuous hidden Markov model classifier. Our method is also easy to setup and does not require many adaptive parameters to work successfully. Results presented in this work give very important hints for developing similar actions recognition systems because easy to repeat features selection and classification setup we have described seems to guarantee high efficiency and effectiveness of overall solution.
The goal for the future is to apply the proposed data representation schema for quantitative analysis of actions. The most straightforward but promising approach might be using NN with auto-encoding architecture which is effective approach in anomaly and outliners detection Candel and Parmer (2015). We believe that this type of analysis will be useful in outdoor real-time hazardous situation detection and high-quality body actions analysis (especially in sport).