Keywords

1 Introduction

Action recognition is a key research domain in video/image processing and computer vision, being nowadays ubiquitous in human-robot interaction, autonomous driving vehicles, elderly care and video-surveillance to name a few [21]. Yet, challenging difficulties arise due to visual ambiguities (illumination variations, texture of clothing, general background noise, view heterogeneity, occlusions). As an effective countermeasure, joint-based skeletal representations (extracted from depth images) are a viable solution.

Combined with a skeletal representation, the symmetric and positive definite (SPD) covariance operator scores a sound performance in 3D action recognition [5, 9, 22]. Indeed, while properly modeling the skeletal dynamics with a second order statistic, the covariance operator is also naturally able to handle different temporal duration of action instances. This avoids slow pre-processing stages such as time warping or interpolation [20]. In addition, the superiority of such representation can be attested by achieving state-of-the-art performance by means of a relative simple classification pipeline [5, 22] where, basicallyFootnote 1, a non-linear Support Vector Machine (SVM) is trained using the Log-Euclidean kernel

$$\begin{aligned} K_{\ell E}(\mathbf {X},\mathbf {Y}) = \exp \left( - \dfrac{1}{2 \sigma ^2} \Vert \log \mathbf {X} - \log \mathbf {Y} \Vert _F^2\right) \end{aligned}$$
(1)

to compare covariance operators \(\mathbf {X}\), \(\mathbf {Y}\). In (1), for any SPD matrix \(\mathbf {X}\), we define

$$\begin{aligned} \log \mathbf {X} = \mathbf {U} \mathrm{diag} (\log \lambda _1, \dots , \log \lambda _d) \mathbf {U}^\top , \end{aligned}$$
(2)

being \(\mathbf {U}\) the matrix of eigenvectors which diagonalizes \(\mathbf {X}\) in terms of the eigenvalues \(\lambda _1 \ge \dots \ge \lambda _d > 0\). Very intuitively, for any fixed bandwidth \(\sigma > 0\), \(K_{\ell E}(\mathbf {X},\mathbf {Y})\) is actually computing a radial basis Gaussian function by comparing the covariance operators \(\mathbf {X}\) and \(\mathbf {Y}\) by means of the Frobenius norm \(\Vert \cdot \Vert _F\) (after \(\mathbf {X},\mathbf {Y}\) have been log-projected). Computationally, the latter stage is not problematic (see Sect. 3) and can be performed for each covariance operator before computing the kernel. In addition to its formal properties in Riemannian geometry, this makes (1) widely used in practice [5, 9, 22].

However, the modern big data regime mines the applicability of such a kernel function. Indeed, since (1) has to be computed for every pair of instances in the dataset, the so produced Gram matrix has prohibitive size. So its storage becomes time- and memory-expensive and the related computations (required to train the model) are simply unfeasible.

The latter inconvenient can be solved as follows. According to the well established kernel theory [2], the Kernel (1) induces an infinite-dimension feature map \(\varphi \), meaning that \(K_{\ell E}(\mathbf {X},\mathbf {Y}) = \langle \varphi (\mathbf {X}), \varphi (\mathbf {Y}) \rangle \). However, if we are able to obtain an explicit feature map \(\varPhi \) such that \(K_{\ell E}(\mathbf {X},\mathbf {Y}) \approx \langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \), we can directly compute a finite-dimensional feature representation \(\varPhi (\mathbf {X})\) for each action instance separately. Then, with a compact \(\varPhi \), we can train a linear SVM instead of its kernelized version. This is totally feasible and quite efficient even in the big data regime [7]. Therefore, the whole pipeline will actually provide a scalable implementation of a Log-Euclidean SVM, whose cost is reduced from quadratic to linear.

In our work we specifically tackle the aforementioned issue through the following main contributions.

  1. 1.

    We propose a novel compact and explicit feature map to approximate the Log-Euclidean kernel within a probabilistic framework.

  2. 2.

    We provide a rigorous mathematical formulation, proving that the proposed approximation has null bias and bounded variance.

  3. 3.

    We compare the proposed feature map approximation against alternative approximation schemes, showing the formal superiority of our framework.

  4. 4.

    We experimentally evaluate our method against the very same approximation schemes over six 3D action recognition datasets, confirming with practice our theoretical findings.

The rest of the paper is outlined as follows. In Sect. 2 we review the most relevant related literature. Section 3 proposes the novel approximation and discusses its foundation. We compare it with alternative paradigms in Sect. 4. Section 5 draws conclusions and the Appendix A reports all proofs of our theoretical results.

2 Related Work

In this Section, we summarize the most relevant works in both covariance-based 3D action recognition and kernels’ approximations.

Originally envisaged for image classification and detection tasks, the covariance operator has experienced a growing interest for action recognition, experiencing many different research trends: [9] extends it to the infinite dimensional case, while [10] hierarchically combines it in a temporal pyramid; [12, 22] investigate the conceptual analogy with trial-specific kernel matrices and [5] further proposes a new kernelization as to model arbitrary, non-linear relationships conveyed by the raw data. However, those kernel methods usually do not scale up easily to big datasets due to demanding storage and computational costs. As a solution, the exact kernel representation can be replaced by an approximated, more efficient version. In the literature, this is done according to the following mainstream approaches.

  1. (i)

    The kernel Gram matrix is replaced with a surrogate low-rank version, in order to alleviate both memory and computational costs. Within these methods, [1] applied Cholesky decomposition and [24] exploited Nyström approximation.

  2. (ii)

    Instead of the exact kernel function k, an explicit feature map \(\varPhi \) is computed, so that the induced linear kernel \(\langle \varPhi (\mathbf {x}),\varPhi (\mathbf {y}) \rangle \) approximates \(k(\mathbf {x},\mathbf {y})\). Our work belong to this class of methods.

In this context, Rahimi and Recht [17] exploited the formalism of the Fourier Transform to approximate shift invariant kernels \(k(\mathbf {x},\mathbf {y}) = k(\mathbf {x}-\mathbf {y})\) through an expansion of trigonometric functions. Leveraging on a similar idea, Le et al. [13] sped up the computation by exploiting the Walsh-Hadamard transform, downgrading the running cost of [17] from linear to log-linear with respect to the data dimension. Recently, Kar and Karnick [11] proposed an approximated feature maps for dot product kernels \(k(\mathbf {x},\mathbf {y}) = k(\langle \mathbf {x},\mathbf {y} \rangle )\) by directly exploiting the MacLaurin expansion of the kernel function.

Instead of considering a generic class of kernels, our work specifically focuses on the log-Euclidean one, approximating it through a novel unbiased estimator which ensures a explicit bound for variance (as only provided by [13]) and resulting in a superior classification performance with respect to [11, 13, 17].

3 The Proposed Approximated Feature Map

In this Section, we present the main theoretical contribution of this work, namely (i) a random, explicit feature map \(\varPhi \) such that \(\langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \approx K_{\ell E}(\mathbf {X},\mathbf {Y})\), (ii) the proof of its unbiasedness and (iii) a strong theoretical bound on its variance.

Construction of the Approximated Feature Map. In order to construct a \(\nu \) dimensional feature map \(\mathbf {X} \mapsto \varPhi (\mathbf {X}) = [\varPhi _1(\mathbf {X}),\dots ,\varPhi _\nu (\mathbf {X})] \in \mathbb {R}^{\nu }\), for any \(d \times d\) SPD matrix \(\mathbf {X}\), fix a probability distribution \(\rho \) supported over \(\mathbb {N}\). Precisely, each component \(\varPhi _1, \dots , \varPhi _\nu \) of our \(\nu \)-dimensional feature map \(\varPhi \) is independently computed according to the following algorithm.

figure a

The genesis of (3) can be explained by inspecting the feature map \(\varphi \) associated to the kernel \(K(x,y)=\exp \left( -\frac{1}{2\sigma ^2}|x-y|^2\right) \), where \(x,y \in \mathbb {R}\) for simplicity. It results \(\varphi (x) \propto \left[ 1, \sqrt{\frac{1}{1!\sigma ^2}}x,\sqrt{\frac{1}{2!\sigma ^4}}x^2,\sqrt{\frac{1}{3!\sigma ^6}}x^3,\dots \right] .\) Intuitively, we can say that (3) approximates the infinite dimensional \(\varphi (x)\) by randomly selecting one of its components: this is the role played by \(n \sim \rho \). In addition, we introduce the \(\log \) mapping and replace the exponentiation with a Kronecker product. As a consequence, the random weights \(\mathbf {W}\) ensure that \(\varPhi (\mathbf {X})\) achieves a sound approximation of (1), in terms of unbiasedness and rapidly decreasing variance.

In the rest of the Section we discuss the theoretical foundation of our analysis, where all proofs have been moved to Appendix A for convenience.

Unbiased Estimation. In order for a statistical estimator to be reliable, we need it to be at least unbiased, i.e., its expected value must be equal to the exact function it is approximating. The unbiasedness of the feature map \(\varPhi \) of Eq. (3) for the Log-Euclidean kernel (1) is proved by the following result.

Theorem 1

Let \(\rho \) be a generic probability distribution over \(\mathbb {N}\) and consider \(\mathbf {X}\) and \(\mathbf {Y}\), two generic SPD matrices such that \(\Vert \log \mathbf {X} \Vert _F = \Vert \log \mathbf {Y} \Vert _{F} = 1.\) Then, \(\langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \) is an unbiased estimator of (1). That is

$$\begin{aligned} \mathbb {E}[\langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle ] = K_{\ell E} (\mathbf {X}, \mathbf {Y}), \end{aligned}$$
(4)

where the expectation is computed over n and \(\mathbf {W}\) which define \(\varPhi _j(\mathbf {X})\) as in (3).

Once averaging upon all possible realizations of n sampled from \(\rho \) and the Gaussian weights \(\mathbf {W}\), Theorem 1 guarantees that the linear kernel \(\langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \) induced by \(\varPhi \) is equal to \(K_{\ell E}(\mathbf {X},\mathbf {Y})\). This formalizes the unbiasedness of our approximation.

On the Assumption \(\Vert \log \mathbf {X} \Vert _F = \Vert \log \mathbf {Y} \Vert _{F} = 1\). Under a practical point of view, this assumption may seem unfavorable, but this is not the case. The reason is provided by Eq. (2), which is very convenient to compute the logarithm of a SPD matrix. Since in (3), \(\varPhi (\mathbf {X})\) is explicitly dependent on \(\log \mathbf {X}\), we can simply use (2) and then divide each entry of the obtained matrix by \(\Vert \log \mathbf {X} \Vert _F\). This is a non-restrictive strategy to satisfy our assumption and actually analogous to require input vectors to have unitary norm, which is very common in machine learning [2].

Low-Variance. One can note that, in Theorem 1, even by choosing \(\nu = 1\) (a scalar feature map), \(\varPhi (\mathbf {X}) = [\varPhi _1(\mathbf {X})] \in \mathbb {R}\) is unbiased for (1). However, since \(\varPhi \) is an approximated finite version of the exact infinite feature map associated to (1), one would expect the quality of the approximation to be very bad in the scalar case, and to improve as \(\nu \) grows larger. This is indeed true, as proved by the following statement.

Theorem 2

The variance of \(\langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \) as estimator of (1) can be explicitly bounded. Precisely,

$$\begin{aligned} \mathbb {V}_{n,\mathbf {W}}(K_\varPhi (\mathbf {X},\mathbf {Y})) \le \dfrac{\mathcal {C}_\rho }{\nu ^3} \exp \left( \dfrac{3 - 2\sigma ^2}{\sigma ^4}\right) , \end{aligned}$$
(5)

where \(\Vert \log \mathbf {X} \Vert _F = \Vert \log \mathbf {Y} \Vert _F = 1\) and the variance is computed over all possible realizations of \(n \sim \rho \) and \(\mathbf {W}\), the latter being element-wise sampled from a \(\mathcal {N}(0,\sigma ^2/\sqrt{n})\) distribution. Also, \(\mathcal {C}_\rho {\mathop {=}\limits ^\mathrm{def}} \sum _{n = 0}^{\infty } \frac{1}{\rho (n) n!}\), the series being convergent.

Let us discuss the bound on the variance provided by Theorem 2. Since the bandwidth \(\sigma \) of the kernel function (1) we want to approximate is fixed, the term \(\exp \left( \frac{3 - 2\sigma ^2}{\sigma ^4} \right) \) can be left out from our analysis. The bound in (5) is linear in \(\mathcal {C}_\rho \) and inversely cubic in \(\nu \). When \(\nu \) grows, the increased dimensionality of our feature map \(\varPhi \) makes the variance rapidly vanishing, ensuring that the approximated kernel \(K_\varPhi (\mathbf {X}, \mathbf {Y})= \langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \) converges to the target one, that is \(K_{\ell E}\). Such trend may be damaged by big values of \(\mathcal {C}_\rho .\) Since the latter depends on the distribution \(\rho \), let us fix it to be the geometric distribution \(\mathcal {G}(\theta )\) with parameter \(0 \le \theta < 1\). This yields

$$\begin{aligned} \mathcal {C}_\rho \propto \sum _{n = 0}^\infty \dfrac{1}{(1 - \theta )^n \cdot n!} = \exp \left( \dfrac{1}{1 - \theta } \right) . \end{aligned}$$
(6)

There is a trade-off between a low variance (i.e., \(\mathcal {C}_\rho \) small) and a reduced computational cost for \(\varPhi \) (i.e., n small). Indeed, choosing \(\theta \approx 1\) makes \(\mathcal {C}_\rho \) big in (6). In this case, the integer n sampled from \(\rho = \mathcal {G}(\theta )\) is small with great probability: this leads to a reduced number of Kronecker products to be computed in \(\log (\mathbf {X})^{\otimes n}\). Conversely, when \(\theta \approx 0\), despite n and the related computational cost of \(\log (\mathbf {X})^{\otimes n}\) are likely to grow, \(\mathcal {C}_\rho \) is small, ensuring a low variance for the estimator.

As a final theoretical result, Theorems 1 and 2 immediately yield that

$$\begin{aligned} \mathbb {P}\left[ \left| K_\varPhi (\mathbf {X}, \mathbf {Y}) - K_{\ell E}(\mathbf {X}, \mathbf {Y}) \right| \ge \epsilon \right] \le \frac{\mathcal {C}_\rho }{\nu ^3 \epsilon ^2} \exp \left( \frac{3 - 2\sigma ^2}{\sigma ^4}\right) \end{aligned}$$
(7)

for every pairs of unitary Frobenius normed SPD matrices \(\mathbf {X},\mathbf {Y}\) and \(\epsilon > 0\), as a straightforward implication of the Chebyshev inequality. This ensures that \(K_\varPhi \) differs in module from \(K_{\ell E}\) by more than \(\epsilon \) with a (low) probability \(\mathbb {P}\), which is inversely cubic and quadratic in \(\nu \) and \(\epsilon \), respectively.

Final Remarks. To sum up, we have presented a constructive algorithm to compute a \(\nu \)-dimensional feature map \(\varPhi \) whose induced linear kernel is an unbiased estimator of the log-Euclidean one. Additionally, we ensure an explicit bound on the variance which rapidly vanishes as \(\nu \) grows (inversely cubic decrease). This implies that \(\langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \) and \(K_{\ell E}(\mathbf {X},\mathbf {Y})\) are equal with very high probability, even at low \(\nu \) values. This implements a Log-Euclidean kernel in a scalable manner, downgrading the quadratic cost of computing \(K_{\ell E}(\mathbf {X},\mathbf {Y})\) for every \(\mathbf {X},\mathbf {Y}\) into the linear cost of evaluating the feature map \(\varPhi (\mathbf {X})\) as in (3) for every \(\mathbf {X}\). Practically, this achieve a linear implementation of the log-Euclidean SVM.

4 Results

In this Section, we compare our proposed approximated feature map versus the alternative ones by Rahimi and Recht [17], Kar and Karnick [11] and Le et al. [13] (see Sect. 2).

Theoretical Comparison. Let us notice that all approaches [11, 13, 17] are applicable also to the log-Euclidean kernel (1). Indeed, [13, 17] includes our case of study since \(K_{\ell E}(\mathbf {X},\mathbf {Y}) = k(\log \mathbf {X} - \log \mathbf {Y})\) is logarithmic shift invariant. At the same time, thanks to the assumption \(\Vert \log \mathbf {X} \Vert _F = \Vert \log \mathbf {Y} \Vert _F = 1\) as in Theorem 1, we obtain \(K_{\ell E}(\mathbf {X},\mathbf {Y}) = k(\langle \log \mathbf {X}, \log \mathbf {Y} \rangle )\) (see (13) in Appendix A), thus satisfying the hypothesis of Kar and Karnick [11].

As we proved in Theorem 1, all works [11, 13, 17] can also guarantee an unbiased estimation for the exact kernel function.

Actually, what makes our approach superior is the explicit bound on the variance (see Table 1). Indeed, [11, 17] are totally lacking in this respect. Moreover, despite an analogous bound is provided in [13, Theorem 4], it only ensures a \(O(1/\nu )\) decrease rate for the variance with respect to the feature dimensionality \(\nu \). Differently, we can guarantee a \(O(1/{\nu ^3})\) trend. This implies that, we achieve a better approximation of the kernel with a lower dimensional feature representation, which ease the training of the linear SVM [7].

Table 1. Comparison of explicit bounds on variance between the proposed approximation and [11, 13, 17]: the quicker the decrease, the better the bound. Here, \(\nu \ge 1\) denotes the dimensionality of the approximated feature vector.

Experimental Comparison. We reported here the experimental comparison on 3D action recognition between our proposed approximation and the paradigms of [11, 13, 17].

Datasets. For the experiments, we considered UTKinect [23], Florence3D [19], MSR-Action-Pairs (MSR-pairs) [16], MSR-Action3D [3, 14], HDM-05 [15] and MSRC-Kinect12 [8] datasets.

For each, we follow the usual training and testing splits proposed in the literature. For Florence3D and UTKinect, we use the protocols of [20]. For MSR-Action3D, we adopt the splits originally proposed by [14]. On MSRC-Kinect12, once highly corrupted action instances are removed as in [10], training is performed on odd-index subject, while testing on the even-index ones. On HDM-05, the training exploited all instances of “bd” and “mm” subjects, being “bk”, “dg” and “tr” left out for testing [22], using the 65 action classes protocol of [6].

Fig. 1.
figure 1

Experimental comparison of our approximation (red curves) against the schemes ofr Rahimi and Recht [17] (pink curves), Kar and Karnick [11] (green curves) and Le et al. [13] (blue curves). Best viewed in colors.

Data Preprocessing. As a common pre-processing step, we normalize the data by computing the relative displacements of all joints \(x-y-z\) coordinates and the ones of the hip (central) joint, for each timestamps.

Results. Figure 1 reports the quantitative performance while varying \(\nu \) in the range 10, 20, 50, 100, 200, 500, 1000, 2000, 5000. When comparing with [13], since the data input size must be a multiple of a power of 2, we zero-padded our vectorized representation to match 4096 and (whenever possible) 2048 and 1024 input dimensionality. These cases are then compared with the results related to \(\nu =\) 5000, 2000, 1000 for RGW and [11, 17], respectively. Since all approaches have a random component, we performed ten repetitions for each method and dimensionality setup, averaging the scored classification performances obtained through a linear SVM with \(C = 10\). We employ the publicly available codes for [11, 13, 17]. Finally, we also report the classification performance with the exact method obtained by feeding an SVM with the log-Euclidean kernel whose bandwidth \(\sigma \) is chosen via cross validation.

Discussion. For large \(\nu \) values, all methods are able to reproduce the performance of the log-Euclidean kernel (black dotted line). Still, in almost all the cases, our approximation is able to outperform the competitors: for instance, we gapped Rahimi and Recht on both MSR-Pairs and MSR-Action3D, while Kar and Karnick scored a much lower performance on HDM-05 and Florence3D. If comparing to Le et al., the performance is actually closer, but this happens for all the methods which are able to cope the performance of the Log-Euclidean kernel with \(\nu \ge 2000,5000\). Precisely, the true superiority of our approach is evident in the case of a small \(\nu \) value (\(\nu =\) 10, 20, 50). Indeed, our approximation always provides a much rapid growing accuracy (MSR-Action3D, Florence3D and UTKinect), with only a few cases where the gap is thinner (Kar and Karnick [11] on MSR-pairs and Rahimi and Recht [17] on MSRC-Kinect 12). Therefore, our approach ensures a more descriptive and compact representation, providing a superior classification performance.

5 Conclusions

In this work we propose a novel scalable implementation of a Log-Euclidean SVM to perform proficient classification of SPD (covariance) matrices. We achieve a linear complexity by providing an explicit random feature map whose induced linear kernel is an unbiased estimator of the exact kernel function.

Our approach proved to be more effective than alternative approximations [11, 13, 17], both theoretically and experimentally. Theoretically, we achieve an explicit bound on the variance on the estimator (such result is totally absent in [11, 17]), which is decreasing with inversely cubic pace versus the inverse linear of [13]. Experimentally, through a broad evaluation, we assess the superiority of our representation which is able to provide a superior classification performance at a lower dimensionality.