Abstract
A framework is presented to carry out prediction and classification of Motion Capture (MoCap) multichannel data, based on kernel adaptive filters and multi-kernel learning. To this end, a Kernel Adaptive Filter (KAF) algorithm extracts the dynamic of each channel, relying on the similarity between multiple realizations through the Maximum Mean Discrepancy (MMD) criterion. To assemble dynamics extracted from all MoCap data, center kernel alignment (CKA) is used to assess the contribution of each to the classification tasks (that is, its relevance). Validation is performed on a database of tennis players, performing a good classification accuracy of the considered stroke classes. Besides, we find that the relevance of each channel agrees with the findings reported in the biomechanical analysis. Therefore, the combination of KAF together with CKA allows building a proper representation for extracting relevant dynamics from multiple-channel MoCap data.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In human action recognition using MoCap data, the primary efforts are directed at extracting adequately robust dynamics to model the movements accomplished under given actions [1]. In practice, the models are mostly oriented to classify accurately executed actions, accounting for the relevance of the extracted feature sets but voiding the contribution of the body segments and articulations (i.e., channel relevance). One of the restraints to assess the channel relevance is the need of developing spatial filtering methods that may provide an adequate interpretation of biomechanical generation.
To deal with this issue, compact, meaningful dictionaries or codebooks, that match physiological principles are built. To this end, Kernel Adaptive Filters (KAFs) are widely employed in time-series prediction task that enables encoding the salient elements of signals [2], avoiding the segmentation step within the feature extraction stage of human action recognition [3]. Furthermore, the combination of multiple dynamic models by kernels methods can be implemented through different feasible approaches like CKA proposed in [4].
Provided a set of output labels, the supervised CKA algorithm employs a distance that measures the dissimilarity/similarity between each basis kernel and the target kernel, yielding the combination weights that estimate the relevance of each input kernel. In channel relevance tasks of MoCap multichannel time series, however, construction of adequate basis kernel sets, which must be in independent from each other, is still a challenging issue.
Here, to reveal the contribution of channels involved in each action execution, a channel relevance methodology is presented to improve the performance of prediction and classification tasks using MoCap multichannel data. Initially, from input data, the Kernel Adaptive Filter build a codebook set as well a vector of predicted outputs, which are further mapped in a Reproducing Kernel Hilbert Space. Relying on the similarity between multiple realizations through the Maximum Mean Discrepancy criterion, we construct a basis kernel per channel. Then, CKA aligns the whole basis kernel set, using the label set. As a result, we find that the relevance of each channel agrees with the findings reported in the biomechanical analysis. Therefore, the combination of KAF together with CKA allows building a proper representation for extracting relevant dynamics from multiple-channel MoCap data.
2 Theoretical Framework
2.1 Dynamical Channel Model Encoded by Kernel Adaptive Filtering
We assume a scenario in which a set of J time series \(\mathbf{{x}}_j[t]\) are obtained from sensor measurements, with \(j=1,\dots ,J\). For each time series, T time steps are available, i.e. \(t=1,\dots ,T\). We collect the entire set of measurements in the matrix \({\varvec{X}} \in \mathbb {R}^{J\times T}\), which contains the J time series as its rows as follows:
Thorough this paper, we assume that multiple sets are available, where the n-th set is represented as \(\varvec{X}^n\), with \(n=1,\dots ,N\). Also, to indicate that a time series belongs to a particular set n, we use notation \(\mathbf{{x}}_j^n[t]\).
With the aim of modeling properly each time series \(\mathbf{{x}}_j\), its dynamic behavior is represented through Kernel Adaptive Filters (KAFs) so that the problem non-linearities can be represented as a kernel expansion in terms of the training data:
where \(\alpha _r\) is built using kernel least-mean-square algorithms (KLMS). Here, we employ KAFs that enable tracking of non-stationary data with nonlinear relationships. Among KAF algorithms, we are interested in those that construct a dictionary set or codebook composed of R elements, each one including the most representative data points learned from the quantization process.
2.2 Model Construction and Similarity Measure
The KRLS tracker introduced in [5], assumes a set of ordered input-output pairs \(\{\mathbf{{x}}_j[t],y_j[t]\}\) in which the input data is taken as the time-embedded version of the series with L lags, \(\mathbf{{x}}_j[t] = \left[ x_j[t],x_j[t-1],\dots ,x_j[t-L+1]\right] \), and the desired output is the next sample, \(y_j[t] = x_j[t+1]\). In addition to the obtained channel predictor (see Eq. (2)), we get a codebook \(\mathbf{{c}}_j[r]\) and their estimated latent function outputs or desired values d[r], applying the KRLS tracker [5]. Consequently, we define a model associated to each time series as \(\mathcal {M}_j=\{\mathbf{{c}}_j[r],d_j[r], \ r=1,...,R\}\).
Further, we perform the similarity measure between models. Namely, let us consider two different models \(\mathbf{p}_r = \left( \mathbf{c}_p[r], d_p[r] \right) \) and \(\mathbf{q}_r = \left( \mathbf{c}_q[r], d_q[r] \right) \). The elements of each model or model samples, as given by KRLST, are not ordered. Therefore, any permutation or reordering of the elements represents the same model. Bearing this in mind, we interpret each model as a cluster of points in the input space. We now define a mapping from the set of models \(\mathcal {Z}\) to a RKHS as \(\varPhi : \mathcal {Z} \longrightarrow \mathcal {H}\), which maps \(\{ \mathbf{p}_r \}_{r=1}^{R} \longmapsto \{ \varPhi \left( \mathbf{p}_r \right) \}_{r=1}^{R}\). A model can be interpreted as a distribution function \(\mathcal {P}\) from which R realizations are available. Then, to define a distance between models we resort to the Maximum Mean Discrepancy (MMD) defined by Gretton in [6]. Given two models \(\mathcal {P}\) and \(\mathcal {Q}\), the MMD criterion computes the distance between them as
Assuming a separable model that decouples the influence of the input and the output [7], the distance between models in Eq. (3) can be rewritten in terms of kernel matrices as
where \({\varvec{K}}_{pq} (r,r') = \exp (-\Vert \mathbf{{c}}_p[r] - \mathbf{{c}}_q[r']\Vert ^2 / 2 \sigma ^2_{c})\), and \(\mathbf{{d}}(r,r')=\mathbf{{d}}[r]\mathbf{{d}}[r']\) is a linear kernel for the output of each model.
2.3 Relevance Assessment by Multikernel Learning
Let \({\varvec{X}}^n \in \mathbb {R}^{J \times T}\), \(n= 1, \ldots , N\) be a labeled set of J-dimensional time series. For the n-th multichannel time series we have a collection of J models that we denote as \(\{ \mathcal {M}_j[n] \}_{j=1}^J\). Let us denote as \(\mathbf{K}_j\) the \(N \times N\) kernel matrix that measures the (di)similarities for the j-th channel between the N time series in the training data set. The element (n, m) of this kernel matrix is given by \( \mathbf{K}_j(n,m) = \exp - \left( \frac{\mathfrak {d}^2 (\mathcal {M}_j[n],\mathcal {M}_j[m])}{2\sigma ^2_{\mathfrak {d}}} \right) \), where \(\mathfrak {d}^2 (\mathcal {M}_j[n],\mathcal {M}_j[m])\) is the pairwise distance between models described in Sect. 2.2 (Eq. (4)).
To combine the information from the J channels we propose to use a multikernel constructed as follows
where the weights \(\alpha _j\) \(j=1,\ldots , J\) are yet to be determined. To find informative weights that allow us to quantify the relevance of individual channels, we propose to use a centered kernel alignment procedure [4]. The basic idea is to find the optimal \(\alpha _j^*\) maximizing the alignment between the multikernel matrix \({\varvec{K}}\) and the target kernel matrix \({\varvec{K}}_{{\varvec{l}}}={\varvec{l}} {\varvec{l}}^T\), which is calculated from the known label classes \({\varvec{l}}=\{l[i]\}_{i=1}^N\). For a given set of weights \(\alpha _j\), the centered correlation or alignment between matrix kernels \({\varvec{K}}\) and \({\varvec{K}}_{\pmb {l}}\) is given by
where \({\varvec{H}}\,{{\,\mathrm{\negthinspace =\negthinspace }\,}}\,{\varvec{I}}- N^{-1}\mathbf{{1}}\mathbf{{1}}^\top \) is a centering matrix, \({\varvec{I}}{{\,\mathrm{\negthinspace \in \negthinspace }\,}}\mathbb {R}^{N{{\,\mathrm{\negthinspace \times \negthinspace }\,}}N}\) is the identity matrix, \(\mathbf{{1}}{{\,\mathrm{\negthinspace \in \negthinspace }\,}}\mathbb {R}^N\) is an all-ones vector, and notations \(\langle \cdot ,\cdot \rangle \) and \(\Vert \cdot ,\cdot \Vert _F\) stand for the inner product and the Frobenius norm, respectively.
Then, the optimal relevance weights are \(\alpha ^*={\text {argmax}} \ \rho ({\varvec{K}},{\varvec{K}}_{\pmb {l}},\mathbf{{\alpha }})\) subject to the constraint \(||\alpha ^* ||=1\). This problem is solved by the Centered Kernel Alignment (CKA) algorithm [4].
3 Experimental Setup
3.1 Database Description
The data were collected from 17 high-performance tennis players of the Caldas-Colombia tennis league. Infrared videography with 23 optical markers was collected from six cameras to acquire sagittal, frontal, and lateral planes and skeleton and multichannel time series were estimated in Optitrack Arena®. All subjects were encouraged to hit the ball with the same velocity and action just as they would in a match. They were instructed to hit one series continuously by 30 s of each indicated stroke. The strokes indicated in each record were: serve, forehand, backhand, volley, backhand volley and smash.
3.2 MoCap Data Preprocessing
Let \({\varvec{U}}{{\,\mathrm{\negthinspace \in \negthinspace }\,}}\mathbb {R}^{T {{\,\mathrm{\negthinspace \times \negthinspace }\,}}(J {{\,\mathrm{\negthinspace \times \negthinspace }\,}}D)}\) be a multi-channel input matrix that holds T frames and \(J {{\,\mathrm{\negthinspace \times \negthinspace }\,}}D\) channels, where J is the number of joints of the body model. Each \({\varvec{U}}_j = \left\{ \mathbf{{u}}_{ij} {{\,\mathrm{\negthinspace \in \negthinspace }\,}}\mathbb {R}^D: i{{\,\mathrm{\negthinspace \in \negthinspace }\,}}T \right\} \) assembles time behavior of D-dimensional body-joint j. Initially, all channels are centered respect to the limb center. Then, to describe the time behavior of the j-th body-joint from \({\varvec{U}}_j\), we perform a dimensional reduction stage from \(\mathbb {R}^D \rightarrow \mathbb {R}\) to obtain a compact representation of its time behavior. In this case, from the covariance matrix \({\varvec{W}}\,{{\,\mathrm{\negthinspace \in \negthinspace }\,}}\,\mathbb {R}^{D {{\,\mathrm{\negthinspace \times \negthinspace }\,}}D}\) we consider only the first principal eigenvector \(\mathbf{{w}}_1\), obtained from the first column of the covariance matrix. Then, we obtain the linear projection \(\mathbf{{x}}_j = {\varvec{U}}_j \mathbf{{w}}_1\), where \(\mathbf{{w}}_1\,{{\,\mathrm{\negthinspace \in \negthinspace }\,}}\, \mathbb {R}^{D {{\,\mathrm{\negthinspace \times \negthinspace }\,}}1}\).
3.3 Model Estimation and Similarity Measure
We compute each model \(\mathcal {M}_j\) through a KRLST algorithm with parameters set as follows: forgetting factor 1, time embedding \(L=6\), codebook size \(R=50\), regularization parameter \(\lambda =10^{-6}\), a Gaussian kernel with \(\sigma \) calculated as the median value of channel \(\mathbf{{x}}_j\) and the initial codebooks are built directly from the input time series \(\mathbf{{x}}_j \,{{\,\mathrm{\negthinspace \in \negthinspace }\,}}\,\mathbb {R}^{T {{\,\mathrm{\negthinspace \times \negthinspace }\,}}1}\). Each model is validated doing a simple task: predict \(x(t+1)\) from data available up to time t.
Figure 1 shows the mean prediction error in each channel j for all sets of multichannel data, in this case, N = 102. Although the number of outliers looks high, it shows a low and regular mean error, which is significant due to the high variability of both: inter-subject and inter-class variability. Besides, our approach works with the 30 s full-long one take videos where several and continuous actions were recorded. There are approximately 12 to 16 strokes in each individual record. It is worth saying that segmentation and selection of actions are not required in our modeling process.
Besides, our proposed functional \(\mathfrak {d}^2\) allows us to construct a kernel similarity measure \(\kappa (\mathcal {M}_{j}[n],\mathcal {M}_{j}[m])\) which highlights each group of actions without previous information about the classes. In Fig. 2(a) we can see the block diagonal structure of the Gram matrix \({\varvec{K}}\) constructed over records of the right wrist joint. In fact, KPCA 2D-embedding in Fig. 2(b) shows the separability between groups of records that are colored according to its true label.
4 Relevance and Classification Results
Once the multikernel \({\varvec{\hat{K}}}\) from Eq. (5) is constructed it allows us compare multichannel data, so that we can apply any kernel-based classifier. In this work, we use a kernel nearest neighbor (KNN). The KNN classifier finds the k samples in the training dataset closest to test data (with maximum similarity) and carries out majority vote. Classification performance and relevance are computed using a cross-validation scheme.
Figure 3 shows the attained \(\alpha \) values in a boxplot. Particularly, the body joints at the end of the limbs are the most relevant. These channels highlight the difference between the six classes of action executed. Nonetheless, the variability observed in the most relevant channels implies a strong dependency in the execution, namely, the angle of the racquet in the hit moment varies with the wrist and fingers channels relation.
Regarding to the classification results, as can be seen in Fig. 4(a), accuracies over 90% are attained for a number of nearest neighbors ranging from 1 to 9. In Fig. 4(b), the lowest results must be analyzed in confrontation with the action, where backhand presents low ball speeds after the impact and it were closer to speeds obtained in volley strokes executions. Nevertheless, each record classified contains 12 to 16 continuously stroke executions without segmentation, so the confused actions depend of execution’s speed after 30 s.
4.1 Discussion and Concluding Remarks
The proposed framework for MoCap multichannel analysis presents a methodology that first: obtains an appropriate and individual representation of the dynamic of each channel; and second: this channel representation based on KAFs allows us to combine similarity between several realizations. In fact, this framework easily matches with a multikernel algorithm as CKA, which merges multiple channels into just one kernel that can be used in classification tasks. It can be seen that CKA reveals the most significant channels in a set of actions, and these results are congruent with biomechanic theory in tennis actions execution [8].
This framework should be expanded to analyze optimal number and placement of sensors in human action recognition tasks, regardless of its source: optical markers, inertial sensors or depth cameras. Besides, human motion action involves an interaction between all body segments: every action has a biomechanical chain that produces it, so relevance of channels must give information about the most relevant body segments involved across the time. The results encourage us to develop an algorithm for biomechanical chain generation without kinetic information, just from skeleton representations of actions.
As future work, this framework must be validated in larger action datasets, as well as be evaluated in assessment motor disorders to check whether relevance shows alterations in specific body segments or articulations.
References
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Sequence of the most informative joints (SMIJ): a new representation for human skeletal action recognition. J. Vis. Commun. Image Represent. 25(1), 24–38 (2014)
Van Vaerenbergh, S., Santamaría, I.: A comparative study of kernel adaptive filtering algorithms. In: 2013 IEEE DSP/SPE Meeting, pp. 181–186, August 2013. Software available at https://github.com/steven2358/kafbox/
Pulgarin-Giraldo, J.D., Alvarez-Meza, A.M., Melo-Betancourt, L.G., Ramos-Bermudez, S., Castellanos-Dominguez, G.: A similarity indicator for differentiating kinematic performance between qualified tennis players. In: Beltrán-Castañón, C., Nyström, I., Famili, F. (eds.) CIARP 2016. LNCS, vol. 10125, pp. 309–317. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52277-7_38
Cortes, C., Mohri, M., Rostamizadeh, A.: Algorithms for learning kernels based on centered alignment. J. Mach. Learn. Res. 13(1), 795–828 (2012)
Van Vaerenbergh, S., Lazaro-Gredilla, M., Santamaria, I.: Kernel recursive least-squares tracker for time-varying regression. IEEE Trans. Neural Netw. Learn. Syst. 23(8), 1313–1326 (2012)
Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012)
Álvarez, M.A., Rosasco, L., Lawrence, N.D.: Kernels for vector-valued functions: a review. Found. Trends Mach. Learn. 4(3), 195–266 (2012)
Landlinger, J., Lindinger, S., Stoggl, T., Wagner, H., Muller, E.: Key factors and timing patterns in the tennis forehand of different skill levels. J. Sports Sci. Med. 9, 643–651 (2010)
Acknowledgments
This work is supported by the project 36075 and mobility grant 8401 funded by Universidad Nacional de Colombia sede Manizales, by program “Doctorados Nacionales 2014” number 647 funded by COLCIENCIAS, as well as PhD financial support from Universidad Autónoma de Occidente.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Pulgarin-Giraldo, J.D., Alvarez-Meza, A.M., Van Vaerenbergh, S., Santamaría, I., Castellanos-Dominguez, G. (2019). Analysis and Classification of MoCap Data by Hilbert Space Embedding-Based Distance and Multikernel Learning. In: Vera-Rodriguez, R., Fierrez, J., Morales, A. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2018. Lecture Notes in Computer Science(), vol 11401. Springer, Cham. https://doi.org/10.1007/978-3-030-13469-3_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-13469-3_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13468-6
Online ISBN: 978-3-030-13469-3
eBook Packages: Computer ScienceComputer Science (R0)