Our recognition system is composed of the following main stages: (1) Pose description: For each video sequence, a set of visual features is extracted from each frame to represent the pose appearing in it. (2) Sequential distance kernel learning: Each feature view is computed into a kernel matrix via our proposed Gaussian-sequential learning. (3) Kernelized Multiview Projection: KMP is able to successfully explore the complementary property of different views and finally finds a discriminative low-dimensional subspace to fuse all views into a single feature vector. (4) Action recognition: the SVM with the RBF kernel is finally applied to categorize actions into different classes. The flowchart of the proposed method is illustrated in Fig. 2. We will detail the above stages in the following sections.
Notations
We are given N training video sequences \(\{v_1, \ldots , v_N\}\) and M different descriptors are used for multiview feature extraction. For the i-th view and p-th video sequence, \(X^i_p\) represents the matrix composed of the feature column vector of i-th view in time-sequential order. Since the dimensions of various descriptors are different, kernel matrices \(K_1, \ldots , K_M \in \mathbb {R}^{N \times N}\) are constructed in Sect. 3.3 for the fusion of different views. Our task is to output an optimal projection matrix \(P \in \mathbb {R}^{N \times d}\) and weights \(\{\alpha _1, \ldots , \alpha _M\}\) (\(\sum ^M_{i=1} \alpha _i =1\)) for kernel matrices such that the fused feature matrix \(Y = [\mathbf {y}_1, \ldots , \mathbf {y}_N]^T = KP = (\sum ^{M}_{i=1} \alpha _i K_i)P\) can represent original data comprehensively.
Incremental Naive Bayes Denoising
In a video sequence, however, not all of the poses are informative and discriminative for action recognition. Some poses may carry neither complete nor accurate information and would even contain common patterns shared by various action types. Since these poses in a video sequence cannot represent the action well and would cause confusion during the classification phase, a weakly supervised method, termed incremental Naive Bayes filter (INBF), has been carried out to filter the noisy representation and keep the relatively representative and discriminative poses, i.e., the key poses.
For each action category, ten action sequences are randomly selected. We choose a small set of discriminative poses for a certain action type from each action sequence as the INBF initial positive samples (labeled as \(y=1\)), and the remaining frames are adopted as the negative ones (\(y=0\)). As illustrated in Fig. 1, the five frames in the middle of an action sequence are selected as discriminative poses. We repetitively apply the above procedure to each action type. INBF is then regarded as an unsupervised online learning strategy.
For the i-th feature view, the representation of each pose (frame) s is \(\mathbf {x}^{i}(s) = (x^{i}_{1}(s), \ldots , x^{i}_{D}(s)) \in \mathbb {R}^{D}\). Since all the features we extracted are based on statistical histograms, we assume all elements in \(x^{i}\) are independently distributed and model them with a naive Bayes classifier:
$$\begin{aligned} \begin{aligned} P(\mathbf {x}^{i})&=\log \frac{\varPi _{m=1}^{D}\Pr (x^{i}_{m}|y=1) \Pr (y=1)}{\varPi _{m=1}^{D}\Pr (x^{i}_{m}|y=0) \Pr (y=0)}\\&=\sum _{m=1}^{D}\log \frac{\Pr (x^{i}_{m}|y=1)}{\Pr (x^{i}_{m}|y=0)}.\\ \end{aligned} \end{aligned}$$
(1)
Note that we make the assumption of a uniform prior, i.e., \(\Pr (y = 1)=\Pr (y = 0)\), and \(y\in \{0,1\}\) is a binary variable which represents the positive and negative sample labels, respectively.
Furthermore, in either statistics or physics, real-world data distribution empirically follows the same form, i.e., Gaussian distribution. Thus, the conditional distributions \(x^{i}_{m}|y=1\) and \(x^{i}_{m}|y=0\) in the classifier \(P(\mathbf {x}^i)\) are assumed to be Gaussian distributed with the four-tuple \((\mu ^{m}_{y=1}, \mu ^{m}_{y=0}, \sigma ^{m}_{y=1}, \sigma ^{m}_{y=0})\), which satisfy
$$\begin{aligned} x^{i}_{m}|y=1\thicksim N\left( \mu ^{m}_{y=1},\sigma ^{m}_{y=1}\right) \end{aligned}$$
and
$$\begin{aligned} x^{i}_{m}|y=0\thicksim N\left( \mu ^{m}_{y=0},\sigma ^{m}_{y=0}\right) . \end{aligned}$$
Up to now, for a certain feature view, we can initialize a group of naive Bayes models for each action type, and the training sequence is successively employed through all the models. The Gaussian parameters in INBF can be then incrementally updated as follows:
$$\begin{aligned} \begin{aligned}&\mu ^{m}_{y=1}\leftarrow \lambda \mu ^{m}_{y=1}+(1-\lambda )\mu _{y=1},\\&\sigma ^{m}_{y=1}\leftarrow \sqrt{\lambda \left( \sigma ^{m}_{y=1}\right) ^{2}+(1-\lambda )(\sigma _{y=1})^{2} + \lambda (1-\lambda ) \left( \mu ^{m}_{y=1}-\mu _{y=1}\right) ^2}, \end{aligned} \end{aligned}$$
(2)
where \(\lambda >0\) denotes the learning rate of INBF, \(\mu _{y=1} = \frac{1}{S}\sum _{s|y(s)=1}x^{i}_{m}(s)\), \(\sigma _{y=1} = \sqrt{\frac{1}{S}\sum _{s|y(s)=1}(x^{i}_{m}(s)-\mu _{y=1})^{2}}\) and \(S = |\{s|y(s)=1\}|\). And \(\mu ^m_{y=0}\) and \(\sigma ^m_{y=0}\) have similar update rules. The above solutions are easily obtained by maximum likelihood estimation. In this way, we can use INBF to keep the representative frames for the later learning phase and discard irrelevant frames to decrease the influence of noise.
RBF Sequential Kernel Construction
For the i-th view, since we extract features from the frames of video sequences, each video sequence can be described by a set of features with a sequential order (along the temporal axis). The similarity between video \(v_p\) and video \(v_q\) under view i: \(k_i(v_p, v_q)\) can be measured via DTW (Berndt and Clifford 1994). Therefore, the kernel function can be defined as: \(k_i(v_p, v_q) = \exp (-\frac{DTW(X^i_p, X^i_q)^2}{2\sigma ^{2}})\), where \(DTW(X^i_p, X^i_q)\) indicates the sequential distance computed via DTW and \(\sigma \) is a standard deviation in the RBF kernel. In this way, we can easily obtain the kernel matrices for different views using the above equation.
Kernelized Multiview Projection
Based on the above kernel construction, we can obtain kernel matrices \(K_1, \ldots , K_M \in \mathbb {R}^{N \times N}\) with the same size for M views with different dimensions. Furthermore, we use the label of training video sequences to supervise the calculation of the similarity matrix \(W_i\) for the i-th view. Then each component of \(W_i\) is computed as follows:
$$\begin{aligned} (W_i)_{pq}= \left\{ \begin{array}{ll} \exp \left( -\frac{DTW(X^i_p, X^i_q)^2}{2\sigma ^{2}}\right) , &{} C(p)=C(q) \\ 0, &{} otherwise \end{array} \right. , \end{aligned}$$
(3)
where C(p) is the label function which indicates the label of video \(v_p\) and \(p, q = 1, \ldots , N\). In fact, the similarity matrix \(W_i\) is a block matrix consisting of some submatrices of kernel matrix \(K_i\) as illustrated in Fig. 3. Then we have the diagonal matrix \(D_i\) in which \((D_i)_{pp} = \sum _q (W_i)_{pq}\) and the Laplacian matrix \(L_i = D_i - W_i\) for each view i.
Due to the complementary nature of different descriptors, we assign different weights for different views. The goal of KMP is to find the basis of a subspace in which the lower-dimensional representation can preserve the intrinsic structure of original data. Therefore, we impose a set of nonnegative weights \(\alpha = (\alpha _1, \ldots , \alpha _M)\) on the similarity matrices \(W_1, \ldots , W_M\) and we have the fused similarity matrix \(W = \sum ^M_{i=1} \alpha _i W_i\) and the fused Laplacian matrix \(L = \sum ^M_{i=1} \alpha _i L_i\).
For the kernel matrix, since we use the same method (DTW) to compute kernel values and similarities, we can also define the fused kernel matrix \(K= \sum ^M_{i=1} \alpha _i K_i\). In fact, suppose \(\phi _i\) is the substantial feature map for kernel \(K_i\), i.e., \(K_i = \phi _i(X^i)^T \phi _i(X^i)\), then the fused kernel value is computed by the feature vector concatenated by the mapped vectors via \(\phi _1, \ldots , \phi _M\), since we have
$$\begin{aligned} \begin{aligned} K&= \sum ^M_{i=1} \alpha _i K_i = \sum ^M_{i=1} \alpha _i \phi _i(X^i)^T \phi _i(X^i) \\&= \left[ \begin{array}{c} \sqrt{\alpha _1} \phi _1(X^1) \\ \vdots \\ \sqrt{\alpha _M} \phi _M(X^M) \\ \end{array} \right] ^T \left[ \begin{array}{c} \sqrt{\alpha _1} \phi _1(X^1) \\ \vdots \\ \sqrt{\alpha _M} \phi _M(X^M) \\ \end{array} \right] \\&= \phi (X)^T \phi (X), \end{aligned} \end{aligned}$$
where \(\phi (\cdot ) = [\sqrt{\alpha _1} \phi _1(\cdot )^T, \cdots , \sqrt{\alpha _M} \phi _M(\cdot )^T]^T\) is the fused feature map and \(X = (X^1, \ldots , X^M)\) is the M-tuple consisting of features from all the views.
To preserve the fused locality information, we need to find the optimal projection for the following optimization problem:
$$\begin{aligned} \mathop {\text {arg min}}\limits _{\mathbf {v}} \sum _{ij} \Vert \mathbf {v}^T \psi _i - \mathbf {v}^T \psi _j\Vert ^2 (W)_{ij}, \end{aligned}$$
(4)
where \(\psi _i\) is the fused mapped feature, i.e., \([\psi _1, \ldots , \psi _N] = \phi (X)\). Through simple algebra derivation, the above optimization problem can be transformed to the following form:
$$\begin{aligned} \mathop {\text {arg min}}\limits _{\mathbf {v}} \hbox {Tr}(\mathbf {v}^T \phi (X) L \phi (X)^T \mathbf {v}). \end{aligned}$$
(5)
With the constraint \(\hbox {Tr}(\mathbf {v}^T \phi (X) D \phi (X)^T \mathbf {v}) = 1\), minimizing the objective function in Eq. (5) is to solve the following generalized eigenvalue problem:
$$\begin{aligned} \phi (X) L \phi (X)^T \mathbf {v} = \lambda \phi (X) D \phi (X)^T \mathbf {v}. \end{aligned}$$
(6)
Note that each solution of problem (6) is a linear combination of \(\psi _1, \ldots , \psi _N\), and there exits N-tuple \(\mathbf {p} = (p_1, \ldots , p_N) \in \mathbb {R}^N\) such that \(\mathbf {v} = \sum ^N_{i=1} p_i \psi _i = \phi (X) \mathbf {p}\). For matrix V consisting of all the solutions, there exists a matrix P such that \(V= \phi (X)P\). Therefore, with the additional constraint \(\hbox {Tr}(P^T \phi (X) D \phi (X)^T P)=1\), we can formulate the new objective function as follows:
$$\begin{aligned} \begin{aligned}&\mathop {\text {arg min}}\limits _{P, \alpha } \hbox {Tr}(P^T K L K P) \\&\text {s.t.}~ \hbox {Tr}(P^T K D K P)=1,~ \sum ^{M}_{i=1} \alpha _i=1,~ \alpha _i \ge 0, \end{aligned} \end{aligned}$$
(7)
or in the form without the trace constraint:
$$\begin{aligned} \begin{aligned}&\mathop {\text {arg min}}\limits _{P, \alpha } \frac{\hbox {Tr}(P^T K L K P)}{\hbox {Tr}(P^T K D K P)}, ~ \text {s.t.}~ \sum ^{M}_{i=1} \alpha _i=1,~ \alpha _i \ge 0. \end{aligned} \end{aligned}$$
(8)
Alternate Optimization via Relaxation
In this section, we employ a procedure of alternate optimization (Bezdek and Hathaway 2002; Tao et al. 2007) to derive the solution of the optimization problem. To the best of our knowledge, it is difficult to find its optimal solution directly, especially for the weights in (8). To optimize \(\alpha \), we derive a relaxed objective function from the original problem. The output of the relaxed function can ensure that the value of the objective function in (8) is in a small neighborhood of the true minimum.
For a fixed \(\alpha \), finding the optimal projection P is simply reduced to solve the generalized eigenvalue problem
$$\begin{aligned} KLK \mathbf {p} = \lambda KDK \mathbf {p}, \end{aligned}$$
(9)
and set \(P = [\mathbf {p}_1, \ldots , \mathbf {p}_d]\) corresponds to the smallest d eigenvalues based on the Ky-Fan theorem (Bhatia 1997).
Next, we fix the projection P to update \(\alpha \) individually. Without loss of generality, we first consider the condition that \(M = 2\), i.e., there are only two views. Then the optimization problem (8) is reduced to
$$\begin{aligned} \begin{aligned}&\mathop {\text {arg min}}\limits _{P, \alpha } \frac{\hbox {Tr}(P^T K L K P)}{\hbox {Tr}(P^T K D K P)},~ \alpha _1 + \alpha _2 =1,~ \alpha _1, \alpha _2 \ge 0. \end{aligned} \end{aligned}$$
(10)
For simplicity, we denote \(L_{ijk} = \hbox {Tr}(P^T K_i L_k K_j P)\) and \(D_{ijk} = \hbox {Tr}(P^T K_i D_k K_j P)\), \(i, j, k \in \{1, 2\}\). Then we can simply find that \(L_{ijk} = L_{jik}\) and \(D_{ijk} = D_{jik}\).
With the Cauchy-Schwarz inequality (Hardy et al. 1952), the relaxation for the objective function in (10) is shown in Eq. (11),
$$\begin{aligned} \frac{\hbox {Tr}(P^T K L K P)}{\hbox {Tr}(P^T K D K P)}= & {} \frac{\hbox {Tr}\Big (P^T (\alpha _1 K_1 + \alpha _2 K_2) (\alpha _1 L_1 + \alpha _2 L_2) (\alpha _1 K_1 + \alpha _2 K_2) P\Big )}{\hbox {Tr}\Big (P^T (\alpha _1 K_1 + \alpha _2 K_2) (\alpha _1 L_1 + \alpha _2 L_2) (\alpha _1 K_1 + \alpha _2 K_2) P\Big )} \nonumber \\= & {} \frac{\alpha _1^3 L_{111} + 2\alpha _1^2 \alpha _2 L_{121} + \alpha _1 \alpha _2^2 L_{221} + \alpha _1^2 \alpha _2 L_{112} + 2 \alpha _1 \alpha _2^2 L_{122} + \alpha _2^3 L_{222}}{\alpha _1^3 D_{111} + 2\alpha _1^2 \alpha _2 D_{121} + \alpha _1 \alpha _2^2 D_{221} + \alpha _1^2 \alpha _2 D_{112} + 2 \alpha _1 \alpha _2^2 D_{122} + \alpha _2^3 D_{222}}\nonumber \\\le & {} \frac{1}{\alpha _1^3 L_{111} + 2\alpha _1^2 \alpha _2 L_{121} + \alpha _1 \alpha _2^2 L_{221} + \alpha _1^2 \alpha _2 L_{112} + 2 \alpha _1 \alpha _2^2 L_{122} + \alpha _2^3 L_{222}} \nonumber \\&\times \left( \frac{\left( \alpha _1^3 L_{111}\right) ^2}{\alpha _1^3 D_{111}} + \frac{\left( 2\alpha _1^2 \alpha _2 L_{121}\right) ^2}{2\alpha _1^2 \alpha _2 D_{121}} + \frac{\left( \alpha _1 \alpha _2^2 L_{221}\right) ^2}{\alpha _1 \alpha _2^2 D_{221}} + \frac{\left( \alpha _1^2 \alpha _2 L_{112}\right) ^2}{\alpha _1^2 \alpha _2 D_{112}} + \frac{\left( 2 \alpha _1 \alpha _2^2 L_{122}\right) ^2}{2 \alpha _1 \alpha _2^2 D_{122}} + \frac{\left( \alpha _2^3 L_{222}\right) ^2}{\alpha _2^3 D_{222}}\right) \nonumber \\= & {} \frac{1}{\alpha _1^3 L_{111} + 2\alpha _1^2 \alpha _2 L_{121} + \alpha _1 \alpha _2^2 L_{221} + \alpha _1^2 \alpha _2 L_{112} + 2 \alpha _1 \alpha _2^2 L_{122} + \alpha _2^3 L_{222}} \nonumber \\&\times \left( \alpha _1^3 L_{111} \frac{L_{111}}{D_{111}} + 2\alpha _1^2 \alpha _2 L_{121} \frac{L_{121}}{D_{121}} + \alpha _1 \alpha _2^2 L_{221} \frac{L_{221}}{D_{221}} + \alpha _1^2 \alpha _2 L_{112} \frac{L_{112}}{D_{112}} + 2 \alpha _1 \alpha _2^2 L_{122} \frac{L_{122}}{D_{122}} + \alpha _2^3 L_{222} \frac{L_{222}}{D_{222}}\right) \nonumber \\= & {} \sum _{i,j,k \in \{1,2\}} w_{ijk}(\alpha _1, \alpha _2) \frac{L_{ijk}}{D_{ijk}}, \end{aligned}$$
(11)
where \(w_{ijk}\) is the coefficient of \(\frac{L_{ijk}}{D_{ijk}}\) and \(\sum _{i,j,k \in \{1,2\}} w_{ijk} =1\). In this way, the objective function in (10) is relaxed to a weighted sum of \(\frac{L_{ijk}}{D_{ijk}}\). Thus, minimizing the weighted sum of the right-hand-side in (11) can lower the objective function value in (10). Note that
$$\begin{aligned} \alpha _1^2 \alpha _1 = \frac{1}{2} \alpha _1 \cdot \alpha _1 \cdot 2\alpha _2 \le \frac{1}{2}\left( \frac{\alpha _1 + \alpha _1 + 2\alpha _2}{3}\right) ^3 = \frac{4}{27}, \end{aligned}$$
and then the weights without containing \(\alpha _1^3\) and \(\alpha _2^3\) are always smaller than a constant. Therefore, we only ensure that a part of the terms in the weighted sum is minimized, i.e., to solve the following optimization problem:
$$\begin{aligned} \mathop {\text {arg min}}\limits _{\alpha _1, \alpha _2} w_{111} \frac{L_{111}}{D_{111}} + w_{222} \frac{L_{222}}{D_{222}}. \end{aligned}$$
(12)
Since \(w_{111}\) and \(w_{222}\) are the functions of \((\alpha _1, \alpha _2)\), we first find the optimal weights without parameters. To avoid trivial solution, we assign an exponent \(r > 1\) on each weight. The relaxed optimization will be
$$\begin{aligned} \mathop {\text {arg min}}\limits _{\beta _1, \beta _2} \beta _1^r \frac{L_{111}}{D_{111}} + \beta _2^r \frac{L_{222}}{D_{222}}, ~\text {s.t.}~ \beta _1 + \beta _2 =1, \beta _1, \beta _2 \ge 0. \end{aligned}$$
(13)
For (13), we have the Lagrangian function with the Lagrangian multiplier \(\eta \):
$$\begin{aligned} L(\beta _1,\beta _2,\eta ) = \beta _1^r \frac{L_{111}}{D_{111}} + \beta _2^r \frac{L_{222}}{D_{222}} - \eta (\beta _1 + \beta _2 -1 ). \end{aligned}$$
(14)
We only need to set the derivatives of L with respect to \(\beta _1\), \(\beta _2\) and \(\eta \) to zeros as follows:
$$\begin{aligned} \frac{\partial L}{\partial \beta _1}= & {} r \beta _1^{r-1} \frac{L_{111}}{D_{111}} - \eta =0, \end{aligned}$$
(15)
$$\begin{aligned} \frac{\partial L}{\partial \beta _2}= & {} r \beta _2^{r-1} \frac{L_{222}}{D_{222}} - \eta =0, \end{aligned}$$
(16)
$$\begin{aligned} \frac{\partial L}{\partial \eta }= & {} \beta _1 + \beta _2 -1 =0. \end{aligned}$$
(17)
Then \(\beta _1\) and \(\beta _2\) can be calculated by
$$\begin{aligned} \begin{aligned} \beta _1&= \frac{(L_{222} D_{111})^{\frac{1}{r-1}}}{(L_{222} D_{111})^{\frac{1}{r-1}} + (L_{111} D_{222})^{\frac{1}{r-1}}}, \\ \beta _2&= \frac{(L_{111} D_{222})^{\frac{1}{r-1}}}{(L_{222} D_{111})^{\frac{1}{r-1}} + (L_{111} D_{222})^{\frac{1}{r-1}}}. \end{aligned} \end{aligned}$$
(18)
Having acquired \(\beta _1\) and \(\beta _2\), we can obtain \(\alpha _1\) and \(\alpha _2\) by the corresponding relationship between the coefficients of the functions in (12) and (13):
$$\begin{aligned} \frac{\alpha _1^3 L_{111}}{\alpha _2^3 L_{222}} = \frac{w_{111}}{w_{222}} = \frac{\beta _1^r}{\beta _2^r}. \end{aligned}$$
(19)
With the constraint \(\alpha _1 + \alpha _2 =1\), we can easily find that
$$\begin{aligned} \begin{aligned} \alpha _1&= \frac{\left( \beta _1^r L_{222}\right) ^{\frac{1}{3}}}{\left( \beta _1^r L_{222}\right) ^{\frac{1}{3}} + \left( \beta _2^r L_{111}\right) ^{\frac{1}{3}}}, \\ \alpha _2&= \frac{\left( \beta _2^r L_{111}\right) ^{\frac{1}{3}}}{\left( \beta _1^r L_{222}\right) ^{\frac{1}{3}} + \left( \beta _2^r L_{111}\right) ^{\frac{1}{3}}}. \end{aligned} \end{aligned}$$
(20)
Hence, for the general M-view situation, we also have the corresponding relaxed problems:
$$\begin{aligned} \mathop {\text {arg min}}\limits _{\sum ^M_{i=1} \alpha _i =1} \sum _{i,j,k \in \{1, \ldots , M\}} w_{ijk}(\alpha _1, \ldots , \alpha _M) \frac{L_{ijk}}{D_{ijk}} \end{aligned}$$
(21)
and
$$\begin{aligned} \mathop {\text {arg min}}\limits _{\beta _1, \ldots , \beta _M} \sum ^M_{i=1} \beta _i^r \frac{L_{iii}}{D_{iii}}, ~\text {s.t.}~ \sum ^M_{i=1} \beta _i = 1,~ \beta _i \ge 0. \end{aligned}$$
(22)
The coefficients \((\beta _1, \ldots , \beta _M)\) and \((\alpha _1, \ldots , \alpha _M)\) can be obtained in similar forms:
$$\begin{aligned} \beta _i = \frac{(D_{iii}/L_{iii})^{\frac{1}{r-1}}}{\sum ^M_{j=1}(D_{jjj}/L_{jjj})^{\frac{1}{r-1}}},~ i=1,\ldots , M \end{aligned}$$
(23)
and
$$\begin{aligned} \alpha _i = \frac{\left( \beta _i^r/L_{iii}\right) ^{\frac{1}{3}}}{\sum ^M_{j=1} \left( \beta _j^r/L_{jjj}\right) ^{\frac{1}{3}}},~ i=1,\ldots , M. \end{aligned}$$
(24)
Although the weight \(\alpha \) obtained in the above procedure is not the global minimum, the objective function is ensured in a range of small values. We let \(F_1\) and \(F_2\) be the objective functions in (8) and (21), respectively, and let
$$\begin{aligned} F_3 = \sum _{i=j=k} w_{ijk} \frac{L_{ijk}}{D_{ijk}} = \sum ^M_{i=1} w_{iii} \frac{L_{iii}}{D_{iii}}. \end{aligned}$$
(25)
We can find that \(F_1 \le F_2\) and if there exists \(\alpha _i = 1\) for some i, then \(F_1 = F_2 = F_3\). During the alternate procedure, for optimizing P, \(F_1\) is minimized, and for optimizing \(\alpha \), \(F_3\) is minimized. Denote \(m_1 = \max (F_1 - F_3)\) and \((P_1, \alpha _1) = \hbox {arg max} (F_1 - F_3)\), then we have
$$\begin{aligned} \begin{aligned} \min F_3 + m_1&\le F_3 (P_1, \alpha _1) + (F_1 - F_3)(P_1, \alpha _1) \\&= F_1 (P_1, \alpha _1) \le \max F_1, \end{aligned} \end{aligned}$$
and we can define the following nonnegative continuous function:
$$\begin{aligned} F_4 (P,\alpha ) = \max \Big (F_1(P, \alpha ), \min _{\alpha } \big (F_3(P, \alpha ) + m_1\big )\Big ). \end{aligned}$$
(26)
Note that \(\min _{\alpha } \big (F_3(P, \alpha ) + m_1\big )\) is independent of \(\alpha \), thus for any P, there exists \(\alpha _0\), such that \(F_1 (P, \alpha _0) = \min _{\alpha } \big (F_3(P, \alpha ) + m_1\big )\). If we impose the above alternate optimization on \(F_4\), \(F_4\) is nonincreasing and therefore converges. Though \(\alpha \) dose not converge to a certain point, the range of \(F_1\) is reduced to a small district, i.e., smaller than \(\min _{\alpha } F_3\) plus a constant. It is also worthwhile to note that \(F_3\) is actually the weighted sum of the objective functions for preserving each view’s locality information. However, the optimization for \(F_3\) still learns information from each view separately, i.e., the locality similarity is not fused. We summarize the KMP in Algorithm 1.
During the testing phase, having acquired the data from each view \(X_{test}^1, \cdots , X_{test}^M\) of a test video sequence \(v_{test}\), we first compute the kernel values to form the representation of \(v_{test}\) in RKHS of each view:
$$\begin{aligned} \mathbf {k}^i_{test} = (k_i(v_1, v_{test}), \cdots , k_i(v_N, v_{test})), ~i = 1, \ldots , M, \end{aligned}$$
where \(k_i (\cdot , \cdot )\) is the kernel function defined in Sect. 3.3. Using the weights \((\alpha _1, \ldots , \alpha _M)\) optimized by Algorithm 1, we have the fused representation of \(v_{test}\): \(\mathbf {k}_{test} = \sum _{i=1}^M \alpha _i \mathbf {k}^i_{test}\). Then the final fused representation of \(v_{test}\) in the reduced space is \(\mathbf {y}_{test} = \mathbf {k}_{test} P\).