# Landmark-based multimodal human action recognition

- 988 Downloads
- 1 Citations

## Abstract

Human activity recognition has received a lot of attention recently, mainly thanks to the advancements in sensing technologies and systems’ increasing computational power. However, complexity in human movements, sensing devices’ noise and person-specific characteristics impose challenges that still remain to be overcome. In the proposed work, a novel, multi-modal human action recognition method is presented for handling the aforementioned issues. Each action is represented by a basis vector and spectral analysis is performed on an affinity matrix of new action feature vectors. Using modality-dependent kernel regressors for computing the affinity matrix, complexity is reduced and robust low-dimensional representations are achieved. The proposed scheme supports online adaptivity of modalities, in a dynamic fashion, according to their automatically inferred reliability. Evaluation on three publicly available datasets demonstrates the potential of the approach.

### Keywords

Spectral clustering Human action recognition Multimodal fusion## 1 Introduction

Human-machine interaction is entering a new era, with computers altering the way they respond to human stimuli. Natural interaction, expressivity, affect [4] and activity recognition [1] are the principal factors that enrich a human-machine interaction experience. Indeed, technology now offers an increasingly large amount of sensing devices for capturing human activity and, in many cases, hidden intentions, behaviors, affective and cognitive states. Wearable inertial measurement sensors [11], robust video processing algorithms [1], infrared and depth sensors [7] and audio [27] are only a few of the cues available for understanding human activity. These advances brought automatic action recognition to the front-end in many applications, ranging from entertainment to health-care systems. Based on the above, it is understood that a robust action recognition scheme should fulfil a series of criteria. First of all, algorithms guaranteeing real time performance are necessary, while accuracy is equally important, especially when it comes to critical circumstances, such as those involving healthcare systems. Although the more information is provided to a system, the more accurate feedback it is likely to deliver, in many circumstances, a large volume of information dramatically increases computational complexity, leading to systems not appropriate for real-time applications. Exploiting multi-modal information is also a significant task that can boost the performance of a system but care should be taken for placing more importance on ’good’ modalities than on noisy ones.

In the proposed work, a real-time, human action recognition method is introduced. The proposed framework approaches the problem by taking into account the aforementioned challenges. In particular, a low-dimensional representation of large dimensionality feature vectors is utilized, by following a landmark-based spectral analysis scheme. In this way, low-dimensional subspaces, encoding valuable information, are built, while new, unknown actions are projected on them. Consequently, only valuable information from different modalities is identified and used in the construction of the models and in further classification of new instances. Based on the mathematical framework of spectral analysis, a method for constructing the adjacency matrix combining cues from multiple modalities, is also introduced in this work. Modalities are fused adaptively, according to automatically inferred reliability metrics, guaranteeing increased robustness to sensor’s instability or tracking failures. Furthermore, a methodology for catering for large variance within the same action is proposed; in this manner, different styles in executing the same action are handled, boosting, in this way, the system’s ability to generalize for unknown individuals. Finally, for inferring for new, unseen vectors, no local sub-manifold unfolding is necessary and, thus, only simple matrix operations are needed, making, thus, the proposed technique suitable for high demands in real time applications. The above are illustrated through experiments, where comparisons with state-of-the-art methods on three datasets are presented (HMMs & Bayes classification, Bag-of-Words used in Support Vector Machines, multiclass Multiple Kernel Learning) and classification speed is assessed.

The proposed technique builds on authors’ preliminary work on Microsoft kinect-based activity recognition based on spectral analysis, [3] where results were presented on the single-modality case of only depth data, while inter- and intra-individual sub-actions were not considered and experiments were limited to a single scenario. The rest of the paper is structured as follows: Section 2 gives an overview of systems employed for human action recognition. Section 3 provides the technical details of the proposed method, while Section 4 presents extensive experiments on three publicly available datasets. Section 5 concludes the paper.

## 2 Related work

Feature pre-processing is strongly related to the utilized cue, in problems related to human activity recognition. Raw inertial sensor data are used extensively, due to their ability to capture instantaneous features of local character and, thus, lead to a rich source of information for action classification. Statistical [23], expressivity [5] and frequency domain parameters [17], on the other hand, although local, convey a summary of an action for different parts of the human body and, thus, they can be time independent. Such parameters usually depend on efficient tracking in video sequences, which is a challenging area of research on its own, attracting the attention of numerous researchers. Recent advances in object tracking have given rise to new techniques aiming at handling (self-)occlusions and local anomalies, using uncertainty-based techniques [36]. Space-Time Volumes [15] concatenate consecutive vision-based two-dimensional human silhouettes along time, leading to three-dimensional volumes and have been extensively used in non-periodic activities, with their performance in the case of varying speed and motion still questioned [1]. Local descriptors (e.g. SIFT [24] and Histograms of Oriented Gradients [19]) necessitate optimal alignment between training and testing data and, although they possess strong discriminative power, they fail to take advantage of whole body actions. A recently proposed approach in the domain of computer vision has introduced the notion of mid-level descriminative patches [12] to automatically extract semantically rich spatial or spatiotemporal windows of RGB information, in order to distinguish elements that account for primitive human actions. Various feature extraction techniques have also been proposed in the area of depth maps for human action recognition; typical is the work in [6], where the authors proposed the use of Depth Motion Maps (DMMs) for capturing motion and shape cues concurrently. Subsequently, LBP descriptors are employed for describing rotation invariant textures of the patches employed. Recently, Song et al. [26] conducted experiments in re-projecting multiple modalities to a new space where correlation among them is maximised and showed that, following this pre-processing step, nonlinear relationships among different data sources can be found.

On a second level lay the methodologies which use as input processed features. The robustness of the selected approach depends on the context of the application and the availability in features. Dynamic Time Warping (DTW) [30] is one of the most well-known classification schemes. One of the major advantages of the method is its adjustability to varying time lengths, but it usually requires a very large number of training examples, as it is basically a template matching technique. Models describing statistical dependencies have also been used extensively, mainly in order to encode time-related correlations. One of the classical approaches, in this vein, are the Hidden Markov Models (HMMs) [16, 35]. Authors in [32], propose a discriminative parameter learning method for a hybrid dynamic network in human activity recognition. They showcase results on walking, jogging, running, hand waving and hand clapping activities. Authors in [20] employ DBNs for the semantic analysis of sports-related events in videos. The probabilistic behavior of human motion-related features has also been widely used through Support Vector Machines (SVMs). SVMs seek hyperplanes in the feature space for separating data into classes. The data points on the margin of the hyperplane are called support vectors. Laptev et al. [18] use non-linear SVMs for the task of recognizing daily activities of small temporal length (answer the phone, sit down/up, kiss, hug, get out of car). Similar, authors in [29] use SVMs on temporal and time-weighted variances, and authors in [21] employ SVMs in RGB and Depth data to recover gestures, and then apply a fusion scheme using inferred motion and audio, in a multimodal environment. Authors in [14] have also utilized SVMs for activity feature classification, on joint orientation angles and their forward differences, while view-invariant features (normalized between-joint distances orientations and velocities) have been employed in [28]. The output of an Artificial Neural Network (ANN) can also be used for modelling the probability *P*(*y*|*x*) of an activity *y* to occur, given input feature vector *x*. Three and four layer perceptrons are among the most common architectures. Typical is the work in [9], where the authors perform indoors action recognition, using two modalities, namely, wearable and depth sensors. Authors in [10] have also recently proposed a method for human action recognition based on skeletal information, using Hierarchical Recurrent Neural Networks, in order to epxloit temporal information in different parts of the human body, while the work in [13] is proposing a three-dimensional Convolution Neural Network in order to jointly make use of spatial and temporal information. Using Neural Networks, special attention should be paid to high complexity during training, as well as overfitting. Classical classification schemes, such as *k*-Nearest Neighbor-based ones (*k*-NNs) and binary trees have also been widely reported in the bibliography. The authors in [17] employ Discrete Fourier Transform (DFT) as their representation scheme and feed the corresponding parameters to a *k*-NN. The main drawbacks of these systems is that they are quite sensitive to parameter fine tuning and tend to generalize poorly for uknown subjects. Recently, there is also a surge in the use of Sparse Representation techniques, especially in the area of computer vision tasks [25, 33, 34], and authors in [37] propose a novel methodology for pattern recognition, applied on action, face, digit and object recognition by transferring the data structure into the optimization process.

## 3 Landmark-based action recognition

Identical or similar actions represented by feature vectors \(\mathbf {x}_{i}{\in }\mathbb {R}^{m}\) can be considered to lay close to each other on a manifold space. Thus, they can be approximated by the linear combination of representation vectors \(\mathbf {z}_{i}{\in }\mathbb {R}^{k}\) (*k* << *m*) with a set of basis vectors \(\mathbf {l}_{j}{\in }\mathbb {R}^{m}\), leading to the optimization problem of minimizing ||*X*−*L**Z*||, with \(X=[\mathbf {x}_{1}, ... , \mathbf {x}_{n}]{\in }\mathbb {R}^{m{\times }n}\) being a set of *n* actions, \(L=[\mathbf {l}_{1}, ... , \mathbf {l}_{k}]{\in }\mathbb {R}^{m{\times }k}\) a table of feature vectors corresponding to landmark-features (derived randomly, after clustering or straight from the activities themselves) and \(Z=[\mathbf {z}_{1}, ... , \mathbf {z}_{n}]{\in }\mathbb {R}^{k{\times }n}\) the low-dimensional representation of *X*. A typical approach for finding low-dimensional representations in manifold spaces is the calculation of distances among all *n* data vectors, leading to the adjacency matrix \(W=(w_{i,j})_{i,j=1}^{n}\) [31]. From *W*, the degree diagonal matrix *D* is built, whose elements are the column (or row) sums of *W*. Subtracting *W* from *D* gives the graph Laplacian matrix *L*, and the eigenvectors corresponding to its *k* smallest eigenvalues are the low (*k*)-dimensional representation of the initial dataset. However, large datasets lead to time consuming construction and eigen-decomposition of the Laplacian. Moreover, real-time action classification, using a spectral analysis scheme, requires a per-frame unfolding of local submanifolds, as well as the use of a pre-defined number of closest feature points in it. Authors in [8] present a methodology for solving the problem by only using a subset of feature (basis) vectors **l**_{j} instead of finding one-to-one relationships among all feature vectors in a dataset, for building the adjacency matrix. According to this method, the *n* data points \(\mathbf {x}_{i}{\in }\mathbb {R}^{m}\) can be represented by linear combinations of *k* (*k* ≪ *n*) representative landmarks (basis vectors). This representation can be used in the spectral embedding. The new representations are *k*-dimensional vectors \(\mathbf {b}_{i}{\in }\mathbb {R}^{k}\) while the landmarks are the result of random selection or a *k*-means algorithm. We hereby extend this technique by introducing a dynamic weighting scheme for handling multiple modalities in the adjacency matrix and provide a framework for real time inference, using simple matrix operations avoiding, thus, manifold unfolding in testing, which would be prohibitive for real time applications.

*k*

^{′}classes of a training dataset can constitute a basis for building the landmark matrix \(L{\in }\mathbb {R}^{m{\times }k^{\prime }}\). Here, we consider each (sub-)action-specific landmark as the average of the corresponding

*m*-dimensional feature vectors. The original data matrix \(X=[\mathbf {x}_{1}, ... , \mathbf {x}_{n}]{\in }\mathbb {R}^{m{\times }n}\) can be approximated by the product of

*L*and the representation matrix \(Z{\in }\mathbb {R}^{k^{\prime }{\times }n}\):

Since different individuals (or the same individual, at different times) might adopt different expressivity for performing the same action, the idea of sub-action basis vectors in the spectral embedding is proposed here. In particular, since an action may be defined by more than one classes, a within-action clustering scheme is followed. For a given action *a*, a hierarchical cluster tree is used, in order to lead to the identification of significant sub-clusters. The algorithm computes the matrix \(Y{\in }\mathbb {R}^{n_{a}{\times }m}\) of the cosine distance between pairs of the *n*_{a} feature vectors belonging to the same action. It constructs *k*_{a} clusters using the distance criterion, finding the lowest height where a cut through the hierarchical tree leaves a maximum of a pre-defined number of sub-clusters. A stopping criterion is also imposed, so that heavily imbalanced clusters are not created. Using the above, the total number of the landmarks used for spectral classification is \(k=\sum \limits _{a=1}^{k^{\prime }}k_{a}\geq k^{\prime }\).

*z*

_{ji}of the representation matrix

*Z*can be found as the output of a kernel function

*k*

_{h}(⋅) (here, we use the Laplacian Kernel) of feature vector

**x**

_{i}and landmark

**l**

_{j}normalized with the sum of the corresponding values for all landmark vectors:

*σ*is the width of the kernel.

*Z*represents the similarity values between data vectors and actions’ (or sub-actions’) representative landmarks and defines an undirected graph

*G*=(

*V*,

*E*) with graph matrix \(W=\hat {Z}^{T}\hat {Z}\), where:

*D*being a diagonal matrix whose elements are the row sums of

*Z*. Since each column of the representation matrix sums up to 1, it is straightforward to check that the degree matrix of

*W*is the identity matrix. Consequently [22], the eigenvectors of

*W*are the same as those of the corresponding Laplacian matrix.

*σ*

_{j}are the singular values of \(\hat {Z}\) and

*A*consists of the left singular vectors of \(\hat {Z}\), found through singular value decomposition (4), while \(B=[\mathbf {b}_{1}...\mathbf {b}_{k}]{\in }\mathbb {R}^{n{\times }k}\) are the eigenvectors of matrix \(W=\hat {Z}^{T}\hat {Z}\). Each row of

*B*is a low-dimensional representation of the original, high-dimensional feature vectors.

*A*

^{T}=

*A*

^{−1},

*B*can be computed directly from (4), as:

*σ*

_{j}, in decreasing order.

### 3.1 Dynamic fusion of different modalities

*σ*

^{c}for calculating the representation matrix. When properly weighted, they can adjust the amount of reliability attributed to each modality. This can be achieved by considering that

*σ*

^{c}increases with the probability of model

*𝜃*

^{c, f}of modality

*c*and feature

*f*generating observation

*x*

^{c, f}and is calculated as the normalized average for each modality, using the following equations:

*N*

_{c}is the number of features used for modality

*c*and

*η*

^{c}is a multiplying factor. Thus, (2), for given feature and basis vectors \(\mathbf {x}^{c}_{i}\), \(\mathbf {l}^{c}_{j}\), corresponding to modality

*c*, becomes:

### 3.2 Classification of new instances

**x**

^{′}=[

*x*

^{′}

^{1}...

*x*

^{′}

^{M}], coming from

*M*modalities, to an activity, the elements \(z_{j}^{\prime }\) of the representation vector \(\mathbf {z}^{\prime }{\in }\mathbb {R}^{k}\) defined by the similarities between

**x**

^{′}and \(L=[(\mathbf {l}^{1}_{1}...\mathbf {l}^{M}_{1})^{T}...(\mathbf {l}^{1}_{k}...\mathbf {l}^{M}_{k})^{T}]\) are found as:

**b**

^{′}of the new feature vector in the low dimensional domain is given by:

*C*of the action with low-dimensional representation matrix

*B*

_{a}(as calculated in training) that minimizes a distance metric

*d*(⋅) from

**b**

^{′}:

Thus, for new data vectors, no local sub-manifold unfolding is necessary and, for inference, simple matrix operations are needed. This is of great significance, since it allows for real-time action recognition and constitutes the proposed method appropriate for online evaluation of whether the projection of multiple modality features over the course of an action is close to the subspace classes of a trained model.

## 4 Experimental evaluation

In order to have its accuracy validated, the proposed methodology has been tested on three publicly available datasets.

### 4.1 Skoda Mini Checkpoint Dataset

*x*,

*y*and

*z*axis. In the experiments, in order to capture temporal and not only qualitative characteristics, every instance was split into 4 periods and the average values of the above features were calculated within these time segments. The above procedure gave a total of 240 features per instance.

### 4.2 Huawei/3DLife Dataset 1

^{1}where 14 subjects participated, each performing a set of 16 repetitive actions. These actions are either sports-related, or involve some standard movements (e.g. knocking on the door), as shown in Fig. 3. Each action was performed 5 times by each subject. Subjects’ motion was captured using a series of depth sensors (Microsoft Kinect). As authors in [28] report results on the non-repetitive action of running on a treadmill, we hereby included this action in our experiments, as well.

Using Kinect depth sensors, human motion can be easily extracted in the form of moving human skeletons [2] and real-time feedback regarding a series of features’ positions is obtained (head, neck, shoulders, elbows, hands, torso, hips, knees, feet). Authors in [28] introduce a set of view-invariant features that we hereby present in brevity: For each joint, its distance on all three axis from the torso, (as the torso is seen in the first frame of each action) is calculated. This is normalized with the average distance between the torso joint and the feet joints, in order to cater for different body sizes. Moreover, joint orientations expressed in quaternions are used. Also, velocity information is used, both using positional and orientation-related information. Velocities are calculated for two different time intervals for each feature. The above strategy leads to 264-dimensional features per time segment. Sun and Aizawa [28] use the above features and, after a feature refinement step, they represent them by Bags of Words at sampling intervals of the whole sequence of the action, as well as three temporal subsequences and they use SVM for classification. Similarly, in our experiments, we used the expected values of the same features over the course of each action, as well as, three subsequences of them, which assists in differentiating between similar actions with temporal differences (e.g. backward vs forward tennis moves). Since many actions consist of less then 5 frames, velocity-related features were extracted for two time segments.

*c*of feature

*i*can be given by (13):

*i*of modality

*c*and

*η*

^{c}a modality-specific parameter.

*N*

_{c}is the number of feature variables in modality

*c*.

Results on the Huawei/3DLife Dataset Session 2 using the proposed technique with/without reliability, different combinations of modalities and the technique described in [28]

All modalities (using reliability indicators) | 80.4 % |

All modalities (not using reliability indicators) | 77.6 % |

Position and Orientation raw values | 71.0 % |

Position and Orientation velocities | 72.1 % |

Method in [28] (Bag-of-words/SVM) | 79.78 % |

### 4.3 Berkeley MHAD database

Experimental results are also presented on the recently published dataset, Berkely MHAD (Multimodal Human Action Database), described in [23]. The dataset comprises 11 actions performed by 12 subjects, with each subject performing a set of 5 repetitions of each action. Three different types of actions resulted in a total of 82 min of recording time: 1) actions in both upper and lower body extremities, 2) actions with high dynamics in upper extremities, 3) actions with high dynamics in lower extremities. The actions performed in the dataset, are: jumping, jumping jacks, bending, punching, waving two hands, waving one hand, clapping, throwing, sit down/stand up, sit down, stand up. For each action, 5 different cues were used for recognition: A Mocap System, a set of multi-view video data, a set of two Microsoft Kinect depth sensors, six three-axis accelerometers that capture the motion of hips, ankles and wrists, and an audio system.

*i*, belonging to modality

*c*, is considered to follow a lognormal distribution \(f({x_{i}^{c}}\mid {{\mu _{i}^{c}},{\sigma _{i}^{c}}})\) with \({\mu _{i}^{c}}\) and \({\sigma _{i}^{c}}\) being the mean and standard deviation, respectively, of the associated normal distribution. Equation (14) can then be used to obtain the normalized weight corresponding to each modality

*c*:

*η*

^{c}being a modality specific constant and

*N*

_{c}the number of feature variables in modality

*c*. In our experiments,

*η*

^{c}was set to 6, for both modalities, as it achieved the best accuracy on a validation dataset of 2 subjects, part of the training data of the 7 subjects. Table 2 compares the results achieved using the proposed method and the method used in [23], where multiclass Multiple Kernel Learning was used, while, Figs. 5, 6 and 7 are indicative of the discriminative power of the proposed technique. Specifically, as the corresponding results suggest, using both modalities clearly helps to better distinguish classes from each other that, using one modality alone, would not be possible. Moreover, classes similar to each other (sit down - stand up) can be effectively separated at dimensionalities of

**b**

_{j}explaining lower feature variances (Fig. 8). For classification of new feature vectors, less than 0.02 s were necessary, while training on the first 7 subjects requires about 25 s using, non-optimized Matlab code.

## 5 Conclusions

In this paper, we used action-dependent basis vectors for projecting large-dimensionality feature vectors to low-dimensional spaces. An affinity matrix between feature and basis vector was constructed, instead of the full adjacency matrix. In the proposed method, catering for different action styles is taken into consideration, while, an online, adaptative, weighting modality scheme is proposed in the representation matrix. Evaluation on three publicly available datasets showed that the method is promising and that the proposed technique, building on multimodal spectral analysis, can achieve high levels of accuracy, comparable or even higher than techniques using state of the art methods in the field (Bag of Words, Hidden Markov Models, Support Vector Machines). Moreover, the proposed method provides with an analytical approach for action recognition, using expressivity-dependent features. This can alleviate from constraints imposed by the Markovian assumption in HMMs and the large number of training data that need to be used. Finally, as seen through experiments, the method can be used for real-time applications, since simple matrix operations are needed for inference; for our classification purposes, in each of the experiments, less than 0.02 s were needed for each instance, using non-optimized code, which is a promising result for on-the-fly recognition of activities in a multimodal environment.

## Footnotes

- 1.
^{1}Huawei/3DLife ACM Multimedia Grand Challenge for 2013

## Notes

### Acknowledgments

This work has been partly funded by the EU Horizon 2020 Framework Programme under grant agreement no. 690090 (ICT4Life project).

### References

- 1.Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv 43(3):16CrossRefGoogle Scholar
- 2.Asteriadis S, Chatzitofis A, Zarpalas D, Alexiadis DS, Daras P (2013) Estimating human motion from multiple kinect sensors. In: Proceedings of the 6th international conference on computer vision/computer graphics collaboration techniques and applications, p 3. ACMGoogle Scholar
- 3.Asteriadis S, Daras P (2015) Skeleton-based human action recognition using basis vectors. In: International conference on pervasive technologies related to assistive environments (PETRA)Google Scholar
- 4.Asteriadis S, Karpouzis K, Kollias SD (2008) A neuro-fuzzy approach to user attention recognition. In: 18th international conference on artificial neural networks (ICANN). Prague, 3–6 September 2008, pp 927–936Google Scholar
- 5.Caridakis G, Castellano G, Kessous L, Raouzaiou A, Malatesta L, Asteriadis S, Karpouzis K (2007) Expressive faces, gestures and speech in multimodal affective analysis. In: Boukis C, Pnevmatikakis A, Polymenakos L (eds) Artificial intelligence and innovations: from theory to applications, pp 375– 388Google Scholar
- 6.Chen C, Liu M, Zhang B, Han J, Jiang J, Liu H 3d action recognition using multi-temporal depth motion maps and fisher vectorGoogle Scholar
- 7.Chen L, Wei H, Ferryman JM (2013) A survey of human motion analysis using depth imagery. Pattern Recogn Lett 34(15):1995–2006CrossRefGoogle Scholar
- 8.Chen X, Cai D (2011) Large scale spectral clustering with landmark-based representation. In: AAAI conference on artificial intelligenceGoogle Scholar
- 9.Delachaux B, Rebetez J, Perez-Uribe A, Mejia HFS (2013) Indoor activity recognition by combining one-vs.-all neural network classifiers exploiting wearable and depth sensors. In: Lecture notes in computer science, pp 216–223Google Scholar
- 10.Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118Google Scholar
- 11.He W, Guo Y, Gao C, Li X (2012) Recognition of human activities with wearable sensors. EURASIP J Adv Sig Proc 2012:108CrossRefGoogle Scholar
- 12.Jain A, Gupta A, Rodriguez M, Davis LS (2013) Representing videos using mid-level discriminative patches. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2571–2578Google Scholar
- 13.Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRefGoogle Scholar
- 14.Kapsouras I, Nikolaidis N (2014) Action recognition on motion capture data using a dynemes and forward differences representation. J Vis Commun Image Represent 25 (6):1432–1445CrossRefGoogle Scholar
- 15.Ke Y, Sukthankar R, Hebert M (2007) Spatio-temporal shape and flow correlation for action recognition. In: 7th international workshop on visual surveillanceGoogle Scholar
- 16.Kim E, Helal S, Cook D (2010) Human activity recognition and pattern discovery. IEEE Pervasive Comput 9(1):48–53. doi:10.1109/MPRV.2010.7
- 17.Kumari S, Mitra SK (2011) Human action recognition using dft. In: Computer vision, pattern recognition national conference on image processing and graphics, vol 0, pp 239–242Google Scholar
- 18.Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision & pattern recognition (CVPR)Google Scholar
- 19.Lu WL, Little JJ (2006) Simultaneous tracking and action recognition using the pca-hog descriptor. In: The 3rd Canadian conference on computer and robot vision, p 6Google Scholar
- 20.Luo Y, Wu TD, Hwang JN (2003) Object-based analysis and interpretation of human motion in sports video sequences by dynamic bayesian networks. Comput Vis Image Underst 92(2–3):196–216CrossRefGoogle Scholar
- 21.Nandakumar K, Wan KW, Chan SMA, Ng WZT, Wang JG, Yau WY (2013) A multi-modal gesture recognition system using audio, video, and skeletal joint data. In: Proceedings of the 15th ACM on International conference on multimodal interaction, pp 475–482. ACMGoogle Scholar
- 22.Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems. MIT Press, pp 849–856Google Scholar
- 23.Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R (2013) Berkeley mhad: a comprehensive multimodal human action database. In: IEEE workshop on applications of computer vision, vol 0, pp 53–60Google Scholar
- 24.Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th international conference on multimedia, MULTIMEDIA ’07. ACM, New York, pp 357–360Google Scholar
- 25.Shen C, Chen L, Priebe CE (2015) Sparse representation classification beyond l1 minimization and the subspace assumption. arXiv preprint arXiv:1502.01368
- 26.Song Y, Morency LP, Davis R (2012) Multimodal human behavior analysis: learning correlation and interaction across modalities. In: Proceedings of the 14th ACM international conference on multimodal interaction. ACM, pp 27–30Google Scholar
- 27.Stork J, Spinello L, Silva J, Arras K (2012) Audio-based human activity recognition using non-markovian ensemble voting. In: IEEE international workshop on robots and human interactive communications (RO-MAN), pp 509–514Google Scholar
- 28.Sun L, Aizawa K (2013) Action recognition using invariant features under unexampled viewing conditions. In: Proceedings of the 21st ACM international conference on multimedia, MM ’13. ACM, New York, pp 389–392Google Scholar
- 29.Vantigodi S, Babu RV (2013) Real-time human action recognition from motion capture data. In: 2013 fourth national conference on computer vision, pattern recognition, image processing and graphics (NCVPRIPG). IEEE, pp 1–4Google Scholar
- 30.Veeraraghavan A, Member S, Roy-chowdhury AK (2005) Matching shape sequences in video with applications in human movement analysis. IEEE Trans Pattern Anal Mach Intell 27:1896–1909CrossRefGoogle Scholar
- 31.von Luxburg U (2007) A tutorial on spectral clustering. Stat ComputGoogle Scholar
- 32.Wang X, Ji Q (2012) Learning dynamic bayesian network discriminatively for human activity recognition. In: Proceedings of the 21st international conference on pattern recognition (ICPR), pp 3553– 3556Google Scholar
- 33.Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31 (2):210–227CrossRefGoogle Scholar
- 34.Yang AY, Zhou Z, Balasubramanian AG, Sastry SS, Ma Y (2013) Fast-minimization algorithms for robust face recognition. IEEE Trans Image Process 22(8):3234–3246CrossRefGoogle Scholar
- 35.Zappi P, Lombriser C, Stiefmeier T, Farella E, Roggen D, Benini L, Tröster G (2008) Activity recognition from on-body sensors: accuracy-power trade-off by dynamic sensor selection. SpringerGoogle Scholar
- 36.Zhang B, Perina A, Li Z, Murino V, Liu J, Ji R (2016) Bounding multiple gaussians uncertainty with application to object tracking. Int J Comput Vis 1–16Google Scholar
- 37.Zhang B, Perina A, Murino V, Del Bue A (2015) Sparse representation classification with manifold constraints transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4557–4565Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.