Multi-view kernel completion

Bhadra, Sahely; Kaski, Samuel; Rousu, Juho

doi:10.1007/s10994-016-5618-0

Multi-view kernel completion

Published: 30 December 2016

Volume 106, pages 713–739, (2017)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Multi-view kernel completion

Download PDF

2662 Accesses
38 Citations
Explore all metrics

Abstract

In this paper, we introduce the first method that (1) can complete kernel matrices with completely missing rows and columns as opposed to individual missing kernel values, with help of information from other incomplete kernel matrices. Moreover, (2) the method does not require any of the kernels to be complete a priori, and (3) can tackle non-linear kernels. The kernel completion is done by finding, from the set of available incomplete kernels, an appropriate set of related kernels for each missing entry. These aspects are necessary in practical applications such as integrating legacy data sets, learning under sensor failures and learning when measurements are costly for some of the views. The proposed approach predicts missing rows by modelling both within-view and between-view relationships among kernel values. For within-view learning, we propose a new kernel approximation that generalizes and improves Nyström approximation. We show, both on simulated data and real case studies, that the proposed method outperforms existing techniques in the settings where they are available, and extends applicability to new settings.

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

Article 15 April 2024

Open Set Recognition in Real World

Article 07 March 2024

Sparse semi-supervised multi-label feature selection based on latent representation

Article Open access 17 April 2024

1 Introduction

In recent years, many methods have been proposed for multi-view learning, i.e, learning with data collected from multiple sources or “views” to utilize the complementary information in them. Kernelized methods capture the similarities among data points in a kernel matrix. The multiple kernel learning (MKL) framework (c.f. Gönen and Alpaydin 2011) is a popular way to accumulate information from multiple data sources, where kernel matrices built on features from individual views are combined for better learning. In MKL methods, it is commonly assumed that full kernel matrices for each view are available. However, in partial data analytics, it is common that information from some sources is not available for some data points.

The incomplete data problem exists in a wide range of fields, including social sciences, computer vision, biological systems, and remote sensing. For example, in remote sensing, some sensors can go off for periods of time, leaving gaps in data. A second example is that when integrating legacy data sets, some views may not available for some data points, because integration needs were not considered when originally collecting and storing the data. For instance, gene expression may have been measured for some of the biological samples, but not for others, and as biological sample material has been exhausted, the missing measurements cannot be made any more. On the other hand, some measurements may be too expensive to repeat for all samples; for example, patient’s genotype may be measured only if a particular condition holds. All these examples introduce missing views, i.e, all features of a view for a data point can be missing simultaneously.

Novelties in problem definition: Previous methods for kernel completion have addressed completion of the aggregated Gaussian kernel matrix by integrating multiple incomplete kernels (Williams and Carin 2005) or single-view kernel completion assuming individual missing values (Graepel 2002; Paisley and Carin 2010), or required at least one complete kernel with a full eigen-system to be used as an auxiliary data source (Tsuda et al. 2003; Trivedi et al. 2005), or assume the eigen-system of two kernels to be exactly the same (Shao et al. 2013), or assumed a linear kernel approximation (Lian et al. 2015). Williams and Carin (2005) do not complete the individual incomplete kernel matrix but complete only aggregated kernels when all kernels are Gaussian. Due to absence of full rows/columns in the incomplete kernel matrices, no existing or non-existing single-view kernel completion method (Graepel 2002; Paisley and Carin 2010) can be applied to complete kernel matrices of individual views independently. In the multi-view setting, Tsuda et al. (2003) have proposed an expectation maximization based method to complete an incomplete kernel matrix for a view, with the help of a complete kernel matrix from another view. As it requires a full eigen-system of the auxiliary full kernel matrix, that method cannot be used to complete a kernel matrix with missing rows/columns when no other auxiliary complete kernel matrix is available. Both Trivedi et al. (2005) and Shao et al. (2013) match kernels through their Graph Laplacians, which may not work optimally if the kernels have different eigen-structures arising from different types of measurements. The method by Shao et al. (2013) completes multiple kernels sequentially, making an implicit assumption that the adjacent kernels in the sequence are related. This can be a hard constraint and in general may not match the reality. On the other hand, Lian et al. (2015) proposed a generative model based method which approximates the similarity matrix for each view as a linear kernel in some low-dimensional space. Therefore, it is unable to model highly non-linear kernels such as RBFs. Hence no conventional method can, by itself, complete highly non-linear kernel matrices with completely missing rows and columns in a multi-view setting when no other auxiliary full kernel matrix is available.

Contributions: In this paper, we propose a novel method to complete all incomplete kernel matrices collaboratively, by learning both between-view and within-view relationships among the kernel values (Fig. 1). We model between-view relationships in the following two ways: (1) Initially, adapting the strategies from multiple kernel learning (Argyriou et al. 2005; Cortes et al. 2012), we complete kernel matrices by expressing individual normalized kernel matrices corresponding to each view as a convex combination of normalized kernel matrices of other views. (2) Second, to model relationships between kernels having different eigen-systems we propose a novel approach of restricting the local embedding of one view in the convex hull of local embeddings of other views. We relate theoretically the kernel approximation quality of the different approaches to the properties of the underlying eigen-spaces of the kernels, pointing out settings where different approaches are optimal.

For within-view learning, we begin from the concept of local linear embedding (Roweis and Saul 2000) applied to the feature vector, and extend it to the kernel matrix by reconstructing each feature representation for a kernel as a sparse linear combination of other available feature representations or “basis” vectors in the same view. We assume the local embeddings, i.e., the reconstruction weights and the basis vectors for reconstructing each samples, are similar across views. In this approach, the non-linearity of kernel functions of individual views is also preserved in the basis vectors. We prove (Theorem 2) that the proposed within-view kernel reconstruction can be seen as generalizing and improving the Nyström method (Williams and Seeger 2001) which have been successfully applied to efficient kernel learning. Most importantly, we show (in Theorem 3) for a general single kernel matrix the proposed scheme results into optimal low rank approximation which is not reached by the Nyström method.

For between-view learning, we recognize that the similarity of the eigen-spaces of the views plays a crucial role. When the different kernels have similar optimal low-rank approximations, we show (Theorem 4) that predicting kernel values across views is a potent approach. For this case, we propose a method ($\mathbf{MKC}_{app}$ (25)) relying on a technique previously used in multiple kernel learning literature, namely, restricting a kernel matrix into the convex hull of other kernel matrices (Argyriou et al. 2005; Cortes et al. 2012). Here, we use this technique for simultaneously completing multiple incomplete kernel matrices, while Argyriou et al. (2005) and Cortes et al. (2012) used it only for learning effective linear combination of complete kernel matrices.

For the case when the eigen-systems of the different views differ, we propose methods that, instead of kernel values, translate the reconstruction weights across views. For the cases where the leading eigen-vectors are similar but eigen-value spectra are different, we show (Theorem 5) that it is sufficient to maintain one global set of reconstruction weights, used in the within-view reconstructions of all views. In the case of heterogeneous leading eigen-vector sets across views, we propose to learn the reconstruction weights for each view restricting it in convex hull of the reconstruction weights of the other views ($\mathbf{MKC}_{embd(ht)}$ (24)).

2 Multi-view kernel completion

We assume N data observations ${\mathbf {X}}=\{ {\mathbf {x}}_1,\ldots ,{\mathbf {x}}_N\}$ from a multi-view input space ${\mathscr {X}}= {\mathscr {X}}^{1} \times \cdots \times {\mathscr {X}}^{(M)}$, where ${\mathscr {X}}^{(m)}$ is the input space generating the mth view. We denote by ${\mathbf {X}}^{(m)}=\{ {\mathbf {x}}^{(m)}_1,\ldots ,{\mathbf {x}}^{(m)}_N\}$, $\forall $ $m = 1,\ldots , M$, the set of observations for the mth view, where ${\mathbf {x}}^{(m)}_i \in {\mathscr {X}}^{(m)}$ is the ith observation in the mth view and ${\mathscr {X}}^{(m)}$ is the input space. For simplicity of notation we sometimes omit the superscript (m) denoting the different views when there is no need to refer to several views at a time.

Considering an implicit mapping of the observations of the mth view to an inner product space ${\mathscr {F}}^{(m)}$ via a mapping $\phi ^{(m)}: {\mathscr {X}}^{(m)} \rightarrow {\mathscr {F}}^{(m)}$, and following the usual recipe for kernel methods (Bach et al. 2004), we specify the kernel as the inner product in ${\mathscr {F}}^{(m)}$. The kernel value between the ith and jth data points is defined as ${k}^{(m)}_{ij}= \langle \phi _i^{(m)},\phi _j^{(m)} \rangle $, where $\phi _i^{(m)} = \phi ^{(m)}({\mathbf {x}}_i^{(m)})$ and $k_{ij}^{(m)}$ is an element of ${\mathbf {K}}^{(m)}$, the kernel Gram matrix for the set ${\mathbf {X}}^{(m)}$.

In this paper we make the assumption that a subset of samples is observed in each view, and correspondingly, a subset of views is observed for each sample. Let $I_N = [1,\ldots , N] $ be the set of indices of all data points and $I^{(m)}$ be the set of indices of all available data points in the mth view. Hence for each view, only a kernel sub-matrix (${\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}$) corresponding to the rows and columns indexed by $I^{(m)}$ is observed. Our aim is to predict a complete positive semi-definite (PSD) kernel matrix (${\hat{{\mathbf {K}}}}^{(m)} \in {\mathbb {R}}^{N \times N}$) corresponding to each view. The crucial task is to predict the missing (tth) rows and columns of ${\hat{{\mathbf {K}}}}^{(m)}$, for all $t \in \{ I_N/I^{(m)}\}$. Our approach for predicting ${\hat{{\mathbf {K}}}}^{(m)}$ is based on learning both between-view and within-view relationships among the kernel values (Fig. 1). The sub-matrix ${\hat{{\mathbf {K}}}}^{(m)}_{I^{(m)}I^{(m)}}$ should be approximately equal to the observed matrix ${\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}$, however, in our approach, approximation quality of the two parts of the kernel matrix can be traded.

2.1 Within-view kernel relationships

For within-view learning, relying on the concept of local linear embedding (Roweis and Saul 2000), we reconstruct the feature map of tth data point $\phi _t$ by a sparse linear combination of observed data samples

$$\begin{aligned} {\hat{\phi }}_t= \sum _{i \in I} a_{it} \phi _i \end{aligned}$$

where $ a_{it} \in {\mathbb {R}}$ is the reconstruction weight of the ith feature representation for representing the tth observation. Hence, approximated kernel values can be expressed as

$$\begin{aligned} {\hat{k}}_{tt'} = \langle {\hat{\phi }}_t,{\hat{\phi }}_{t'} \rangle = \sum _{i,j \in I} a_{it} a_{jt'} \langle \phi _i, \phi _j \rangle . \end{aligned}$$

We note that the above formulation retains the non-linearity of the feature map $\phi $ and the corresponding kernel. We collect all reconstruction weights of a view into the matrix ${\mathbf {A}}= \left( a_{ij}\right) _{i,j=1}^N$. Further, by ${\mathbf {A}}_{I}$ we denote the sub-matrix of ${\mathbf {A}}$ containing the rows indexed by I, the known data samples in the view. Thus the reconstructed kernel matrix ${\hat{{\mathbf {K}}}}$ can be written as

$$\begin{aligned} {\hat{{\mathbf {K}}}} = {\mathbf {A}}^{^T}_{I} {\mathbf {K}}_{II} {\mathbf {A}}_{I} = g({\mathbf {K}}).\,\,\,\, \end{aligned}$$

(1)

Note that ${\hat{{\mathbf {K}}}}$ is positive semi-definite when ${\mathbf {K}}$ is positive semi-definite. Thus, a by-product of this approximation is that in optimization, PSD property is automatically guaranteed without inserting explicit positive semi-definiteness constraints.

Intuitively, the reconstruction weights are used to extend the known part of the kernel to the unknown part, in other words, the unknown part is assumed to reside within the span of the known part.

We further assume that in each view there exists a sparse embedding in ${\mathscr {F}}$, given by a small set of samples $B\subset I$, called a basis set, that is able to represent all possible feature representations in that particular view. Thus the non-zero reconstruction weights are confined to the basis set: $a_{ij} \ne 0$ only if $i \in B$. To select such a sparse set of reconstruction weights, we regularize the reconstruction weights by the $\ell _{2,1}$ norm (Argyriou et al. 2006) of the reconstruction weight matrix,

$$\begin{aligned} \Vert {\mathbf {A}}_{I}\Vert _{2,1} = \sum _{i \in I}\sqrt{\sum _{j \in I} (a_{ij})^2}. \end{aligned}$$

(2)

Finally, for the observed part of the kernel, we add the additional objective that the reconstructed kernel values closely approximate the known or observed values. To this end, we define a loss function measuring the within-view approximation error for each view as

$$\begin{aligned} Loss_{within} = \Vert {\hat{{\mathbf {K}}}}_{II}- {\mathbf {K}}_{II}\Vert ^2_2. \end{aligned}$$

(3)

Hence, for individual views the observed part of a kernel is approximated by

$$\begin{aligned} {\hat{{\mathbf {K}}}}_{II} = {\mathbf {A}}^{*^T}_{II} {\mathbf {K}}_{II} {\mathbf {A}}^{*}_{II} \end{aligned}$$

(4)

where the reconstruction weights ${\mathbf {A}}^{*}_{II}$ (here the superscript $*$ indicates the optimum values) are optimized using (2) and (3) by

$$\begin{aligned} {\mathbf {A}}^{*}_{II}= \mathop {{{\mathrm{arg\,min}}}}\limits _{{\mathbf {A}}_{II}} \Vert {\mathbf {A}}^{^T}_{II} {\mathbf {K}}_{II} {\mathbf {A}}_{II}- {\mathbf {K}}_{II}\Vert ^2_2 + \lambda \Vert {\mathbf {A}}_{II}\Vert _{2,1} \end{aligned}$$

(5)

where $\lambda $ is user defined hyper-parameter which indicate the weights of regularization.

Without the $\ell _{2,1}$ regularization, the above approximation loss could be trivially optimized by choosing ${\mathbf {A}}_{II}$ as the identity matrix. The $\ell _{2,1}$ regularization will have the effect of zeroing out some of the diagonal values and introducing non-zeros to the sub-matrix ${\mathbf {A}}_{BI}$, corresponding to the rows and columns indexed by $B$ and $I$ respectively, where $B= \{i | a_{ii} \ne 0\}$.

In Sect. 3 we show by Theorem 1 that (5) corresponds to a generalized form of the Nyström method (Williams and Seeger 2001) which is a sparse kernel approximation method that has been successfully applied to efficient kernel learning. Nyström method finds a small set of vectors (not necessarily linearly independent) spanning the kernel, whereas our method searches for linearly independent basis vectors (c.f. Sect. 3, Lemma 1) and optimizes the reconstruction weights for the data samples. In particular, we show that (5) achieves the best rank-r approximation of a kernel, when the original kernel has rank higher than r, which is not achieved by Nyström method (c.f. Theorem 2).

2.2 Between-view kernel relationships

For a completely missing row or column of a kernel matrix, there is not enough information available for completing it within the same view, and hence the completion needs to be based on other information sources, in our case the other views where the corresponding kernel parts are known. In the following, we introduce two approaches for relaying information of the other views for completing the unknown rows/columns of a particular view. The first technique is based on learning a convex combination of the kernels, extending the multiple kernel learning (Argyriou et al. 2005; Cortes et al. 2012) techniques to kernel completion. The second technique is based on learning reconstruction weights so that they share information between the views.

Between-view learning of kernel values: In multi-view kernel completion the perhaps simplest situation arises when the kernels of the different views are similar, i.e.,

$$\begin{aligned} {\hat{{\mathbf {K}}}}^{(m)}\approx & {} {\hat{{\mathbf {K}}}}^{(l)},\,\, \forall \,\, l \hbox { and } m. \end{aligned}$$

(6)

In this case predicting kernel values across views may lead to good kernel approximation. One way to model the similarity is to require the kernels of the different views to have the similar low-rank approximations. In Theorem 3 we show that optimal rank-r approximation can be achieved if the kernels have the same ‘true’ rank-r approximation and the kernels themselves have rank at least r.

However, this is probably an overly restrictive assumption in most applications. Thus, in our approach we allow the views to have different approximate kernel matrices with a parametrized relationship learned from data. To learn between-view relationships we express the individual normalized kernel matrix $\left( \frac{k_{tt'}}{\sqrt{k_{tt}k_{t't'}}}\right) $ corresponding to each view as a convex combination of normalized kernel matrices of the other views. Hence the proposed model learns kernel weights ${\mathbf {S}}= (s_{ml})_{m,l=1}^M$ between all pairs of kernels (m, l) such that

$$\begin{aligned} {\hat{{\mathbf {K}}}}^{(m)}\approx & {} \sum _{l=1,l \ne m}^{M} s_{ml} {\hat{{\mathbf {K}}}}^{(l)}, \end{aligned}$$

(7)

where the kernel weights are confined to a convex combination

$$\begin{aligned} {\mathscr {S}}=\{{\mathbf {S}}| s_{ml} \ge 0, \,\, \sum _{l=1,l \ne m}^{M} s_{ml} =1\}. \end{aligned}$$

The kernel weights then can flexibly pick up a subset of relevant views to the current view m. This gives us between-view loss as

$$\begin{aligned} Loss_{between}^{(m)}({\hat{{\mathbf {K}}}},{\mathbf {S}}) =\Vert {\hat{{\mathbf {K}}}}^{(m)} -\sum _{l=1,l\ne m}^M s_{ml}{\hat{{\mathbf {K}}}}^{(l)} \Vert ^2_2. \end{aligned}$$

(8)

Previously, Argyriou et al. (2005) have proposed a method for learning kernels by restricting the search in the convex hull of a set of given kernels to learn parameters of individual kernel matrices. Here, we apply the idea to kernel completion, which has not been previously considered. We further note that kernel approximation as a convex combination has the interpretation of avoiding extrapolation in the space of kernels, and can be interpreted as a type of regularization to constrain the otherwise flexible set of PSD kernel matrices.

Between-view learning of reconstruction weights: In practical applications, the kernels arising in a multi-view setup might be very heterogeneous in their distributions. In such cases, it might not be realistic to find a convex combination of other kernels that are closely similar to the kernel of a given view. In particular, when the eigen-spectra of the kernels are very different, we expect a low between-view loss (8) to be hard to achieve.

Here we assume the approximated kernel matrices have related eigen-spaces in that the eigen-vectors of the related kernels can be written as linear combinations of eigen-vectors of the others, but each of them have their own set of eigen-values. In other words,

$$\begin{aligned} {\hat{{\mathbf {K}}}}^{(m)}= & {} {\mathbf {U}}^{(m)} {\varSigma }^{(m)} {\mathbf {U}}^{(m)^T} = {\mathbf {U}}^{(1)}{\mathbf {T}}^{(m)} {\varSigma }^{(m)} {\mathbf {T}}^{(m)^T} {\mathbf {U}}^{(1)^T} \,\,\, \end{aligned}$$

(9)

where ${\mathbf {U}}^{(m)}$ contains eigen-vectors of ${\mathbf {K}}^{(m)}$ in its column and ${\varSigma }^{(m)}$ contains corresponding eigen-values in its diagonal. ${\mathbf {T}}^{(m)}$ is a linear operator such that ${\mathbf {U}}^{(m)}= {\mathbf {U}}^{(1)}{\mathbf {T}}^{(m)}$. Above the matrices ${\mathbf {T}}^{(m)}$ allow rotations of the eigen-vectors while the scaling is of them is governed by the matrices ${\varSigma }^{(m)}$.

For this situation, we propose and alternative approach, where instead of the kernel values, we assume that the basis sets and the reconstruction weights have between-view dependencies that we can learn. Theorem 4 claims when kernels of all views satisfy (9) then learning a set of reconstruction weights, used in in all views, i.e.,

$$\begin{aligned} {\mathbf {A}}^{(1)} = \ldots ={\mathbf {A}}^{(M)} \end{aligned}$$

(10)

gives better approximation than learning a convex combination of kernels as in (7).

However, assuming that kernel functions in all the views have similar eigen-vectors is also unrealistic for many real world data-sets with heterogeneous sources and kernels applied to them. On the contrary, it is quite possible that only for a subset of views the eigen-vectors of approximated kernel are linearly related. Thus, in our approach we allow the views to have different reconstruction weights, but assume a parametrized relationship learned from data. This also allows the model to find an appropriate set of related kernels from the set of available incomplete kernels, for each missing entry.

To capture the relationship, we assume the reconstruction weights in a view can be approximated by a convex combination of the reconstruction weights of the other views,

$$\begin{aligned} {\mathbf {A}}^{(m)}_{I^{(m)}} \approx \sum _{l=1,l\ne m}^M s_{ml}{\mathbf {A}}^{(l)}_{I^{(m)}}, \end{aligned}$$

(11)

where the coefficients $s_{ml}$ are defined as in (7). This gives us between-view loss for reconstruction weights as

$$\begin{aligned} Loss_{between}^{(m)}({\mathbf {A}},{\mathbf {S}}) =\Vert {{\mathbf {A}}}^{(m)}_{I^{(m)}} -\sum _{l=1,l\ne m}^M s_{ml}{\mathbf {A}}^{(l)}_{I^{(m)}} \Vert ^2_2. \end{aligned}$$

(12)

The reconstructed kernel is thus given by

$$\begin{aligned} {\hat{{\mathbf {K}}}}^{(m)} = \left( \sum _{l=1,l\ne m}^M s_{ml}{\mathbf {A}}_{I^{(m)}} ^{(l)^T}\right) {\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}} \left( \sum _{l=1,l\ne m}^M s_{ml}{\mathbf {A}}^{(l)}_{I^{(m)}} \right) . \end{aligned}$$

(13)

3 Theoretical analysis

In this section, we present the theoretical results underlying our methods. We begin by showing the relationship and advantages of our within-kernel approximation to the Nyström method, and follow with theorems establishing the approximation quality of different kernel completion models.

3.1 Rank of the within-kernel approximation

We begin with analysis of the rank of the proposed within-kernel approximation method, given in (4) and (5). It approximates the individual kernels as ${\hat{{\mathbf {K}}}} = {\mathbf {A}}^{*^T} {\mathbf {K}}{\mathbf {A}}^{*}$ where

$$\begin{aligned} {\mathbf {A}}^{*}= & {} \mathop {{{\mathrm{arg\,min}}}}\limits _{{\mathbf {A}}} \Vert {\mathbf {A}}^{^T} {\mathbf {K}}{\mathbf {A}}- {\mathbf {K}}\Vert ^2_2 + \lambda \Vert {\mathbf {A}}\Vert _{2,1}. \end{aligned}$$

(14)

For the purposes of the analysis, we derive an equivalent form that reveals the rank behaviour of the method more easily. Above, the matrix ${\mathbf {A}}$ simultaneously indicates the position of feature maps in the underlying subspace and also selects the basis vectors for defining these subspaces. Hence ${\mathbf {A}}$ can be written as convolution of two operators ${\mathbf {A}}= {\mathbf {P}}{\hat{{\mathbf {A}}}} $ where ${\mathbf {P}}= diag({\mathbf {p}})$ and ${\mathbf {p}}\in \{0,1\}^{N}$. Here ${\mathbf {p}}$ acts as a selector operator such that ${\mathbf {p}}_{i} = 1$ if $i \in B$ or the ith feature-map is selected as a basis vectors and all other elements of ${\mathbf {p}}$ are assigned zero values. ${\hat{{\mathbf {A}}}}$ is a matrix of size ${\mathbf {A}}$, such that ${\hat{{\mathbf {A}}}}_{i \in B}={\mathbf {A}}_{i \in B} $ and other elements are zeros.

The $\ell _{2,1}$ norm on ${\mathbf {A}}$ in (14) assigns zeros to few rows of the matrix ${\mathbf {A}}$; equivalently $\ell _{1}$ norm on selection operator (${\mathbf {p}}$) fulfils the same purpose. Again, after rows selection is done both ${\hat{{\mathbf {A}}}}_{i \in B}$ and ${\mathbf {A}}_{i \in B}$ denote reconstruction weights for kernel by using selected rows ($B$) and would behave similarly. Therefore the (14) is equivalent to

$$\begin{aligned}{}[{\hat{{\mathbf {A}}}}^{*}, {\mathbf {P}}^{*} ]= & {} \mathop {{{\mathrm{arg\,min}}}}\limits _{{\hat{{\mathbf {A}}}},{\mathbf {P}}= diag({\mathbf {p}})} \Vert {\hat{{\mathbf {A}}}}^{^T} {\mathbf {P}}^{^T} {\mathbf {K}}{\mathbf {P}}{\hat{{\mathbf {A}}}}- {\mathbf {K}}\Vert ^2_2 \,\,\, + \lambda \Vert {\mathbf {p}}\Vert _1 + \sum _i \lambda _i \Vert {\hat{{\mathbf {A}}}}_i\Vert ^2_2. \end{aligned}$$

(15)

To see the equivalence, note that at optimum, the rows of ${\hat{A}}$ that are not selected by P will be identically zero, since the value of the first term of the objective only depends on the selected rows. Again the equivalence in regularization term can be shown as

$$\begin{aligned} \Vert {\mathbf {A}}\Vert _{2,1}= & {} \sum _i \Vert {\mathbf {A}}_{i}\Vert _2 = \sum _i \Vert {\mathbf {p}}_i {\hat{{\mathbf {A}}}}_{i}\Vert _2 \,\,\, \hbox { where} \Vert {\mathbf {A}}_i\Vert \hbox { indicates the } ith \hbox { column of } {\mathbf {A}}\nonumber \\= & {} \sum _i |{\mathbf {p}}_i| + \sum _i \frac{1}{4} \left( 1-\left( \frac{{\mathbf {p}}_i}{\Vert {\hat{{\mathbf {A}}}}_{i}\Vert _2 }-1\right) ^2\right) \Vert {\hat{A}}_i\Vert _2^2 \,\,\, \hbox { as }{\mathbf {p}}_i \in \{1,0\} \nonumber \\ \lambda \Vert {\mathbf {A}}\Vert _{2,1}= & {} \lambda \Vert {\mathbf {p}}\Vert _1 + \sum _i \lambda _{i} \Vert {\hat{A}}_i\Vert _2^2 \,\,\, \hbox { by equating }\lambda _{i} = \frac{\lambda }{4} \left( 1-\left( \frac{{\mathbf {p}}^*_i}{\Vert {\hat{{\mathbf {A}}}}^*_{i}\Vert _2 }-1\right) ^2\right) \end{aligned}$$

(16)

Now, the approximated kernel can be written as

$$\begin{aligned} {\hat{{\mathbf {K}}}}= & {} {\hat{{\mathbf {A}}}}^{*^T} {\mathbf {P}}^{*^T} {\mathbf {K}}{\mathbf {P}}^{*} {\hat{{\mathbf {A}}}}^{*} \nonumber \\= & {} {\hat{{\mathbf {A}}}}^{*^T}_{B} {\mathbf {W}}^{*}{\hat{{\mathbf {A}}}}^{*}_{B}, \end{aligned}$$

(17)

where, ${\mathbf {W}}^{*}$ and ${\hat{{\mathbf {A}}}}^{*}_{B}$ are non-zero sub-matrices of ${\mathbf {P}}^{*^T}{\mathbf {K}}{\mathbf {P}}^{*}$ (corresponding to $B$ rows and $B$ columns) and ${\hat{{\mathbf {A}}}}^{*}$ (corresponding to $B$ rows) respectively.

Lemma 1

For $rank({\mathbf {K}}) \ge r$, $\exists $ $\lambda =\lambda ^r $ and $\lambda _{{\hat{{\mathbf {A}}}}^*_{i}}$ such that the solution of (15) selects a ${\mathbf {W}}^{*} \in {\mathbb {R}}^{r \times r}$ with $rank({\mathbf {W}}^{*})=r$.

Proof

When $rank({\mathbf {K}}) \ge r$ then there must exist a rank-r sub-matrix of ${\mathbf {K}}$ of size $r \times r$. Again, in case of $\ell _1$ regularization on binary vector ${\mathbf {p}}$, one can tune $\lambda $ to $\lambda ^r$ to have required sparsity on ${\mathbf {p}}$, i.e., $\Vert {\mathbf {p}}^{*}\Vert _1= r$. Moreover, $\ell _1$ regularization on binary vector ensures the linear independence of selected columns and rows when $\lambda _{i}$ is carefully chosen. If a solution of (15) selects a column which is linearly dependent on other selected columns, then the solution, from the objective function value of some other solution which selects the same columns except this linearly dependent column, will raise the value of $\ell _1$ norm regularization term of objective function by $\lambda $ while keeping the first part of the objective function same and lowering the third part of (15) by $\frac{\lambda _{i}\Vert A_i\Vert ^2_2}{2}$. Hence if $\lambda _{i} < 2\frac{\lambda }{\Vert \hat{{\mathbf {A}}^*}_i\Vert ^2_2}$ then that can not be an optimum solution and if $\lambda _{i}$ is chosen according to the (16) then $\lambda _{i} < 2\frac{\lambda }{\Vert \hat{{\mathbf {A}}^*}_i\Vert ^2_2}$. This completes the proof. $\square $

3.2 Relation to Nyström approximation

Nyström method (Williams and Seeger 2001) approximates the kernel matrix ${\mathbf {K}}$ as

$$\begin{aligned} {\hat{{\mathbf {K}}}}_{nys}= {\mathbf {C}}{\mathbf {W}}^{-1} {\mathbf {C}}^{T} \end{aligned}$$

(18)

where the matrix ${\mathbf {C}}\in {\mathbb {R}}^{N \times c}$ consists of c randomly chosen columns of kernel ${\mathbf {K}}$ and ${\mathbf {W}}\in {\mathbb {R}}^{c \times c}$ is a matrix consisting of the intersection of those c columns with the corresponding c rows. Due to the random selection, Nyström method, unlike ours (as established above), does not in general produce linearly independent set of vectors. One can re-write

$$\begin{aligned} {\mathbf {C}}= \left[ \begin{array}{c} {\mathbf {C}}_1 \\ {\mathbf {C}}_{2} \end{array} \right] = \left[ \begin{array}{c} {\mathbf {W}}\\ {\mathbf {C}}_2 \end{array}\right] \end{aligned}$$

and then the Nyström approximation as

$$\begin{aligned} {\hat{{\mathbf {K}}}}_{nys} = \left[ \begin{array}{c} {\mathbf {W}}\\ {\mathbf {C}}_2 \end{array} \right] {\mathbf {W}}^{-1}\left[ {\mathbf {W}}\,\, {\mathbf {C}}_2^{T}\right] = \left[ \begin{array}{cc} {\mathbf {W}}&{} {\mathbf {C}}_2^{T} \\ {\mathbf {C}}_2&{} {\mathbf {C}}_2{\mathbf {W}}^{-1}{\mathbf {C}}_2^{T} \end{array}\right] . \end{aligned}$$

(19)

For non-invertible ${\mathbf {W}}$, its pseudo inverse can be used instead of ${\mathbf {W}}^{-1}$.

Theorem 1

The Nyström approximation of ${\mathbf {K}}$ is a feasible solution of (15), i.e., for invertible ${\mathbf {W}}$, $ \exists {\hat{{\mathbf {A}}}} \in {\mathbb {R}}^{N \times c}$ such that $ {\hat{{\mathbf {K}}}}_{nys} = {\hat{{\mathbf {A}}}}^{^T} {\mathbf {W}}{\hat{{\mathbf {A}}}} $.

Proof

Equate

$$\begin{aligned} {\hat{{\mathbf {A}}}}^{^T} =\left[ \begin{array}{c} { I}_c \\ {\mathbf {C}}_2 {\mathbf {W}}^{-1} \end{array}\right] \end{aligned}$$

in (19) where $I_c$ denotes the identity matrix of size c. $\square $

The above theorem shows that the approach of (15), by finding the optimal feasible solution, will always produce better kernel approximation with the same level of sparsity as the Nyström method.

3.3 Low-rank approximation quality

Nyström approximation satisfies following low-rank approximation properties (Kumar et al. 2009):

If $r=rank({\mathbf {K}}^{})\le c$ and $rank({\mathbf {W}}^{})=r$, then the Nyström approximation is exact, i.e., $\Vert {\mathbf {K}}- {\hat{{\mathbf {K}}}}^{}_{nys}\Vert ^2_2 =0$.
For general ${\mathbf {K}}^{}$ when $rank({\mathbf {K}}^{}) \ge r$ and $rank({\mathbf {W}}^{})=r$, then the Nyström approximation is not the best rank-r approximation of ${\mathbf {K}}^{}$.

Below, we will establish that our approach will result in the best rank-r approximation also in the general case where the original kernel has rank higher than r.

Lemma 2

If ${\mathbf {K}}_r$ be the best rank-r approximation of a kernel ${\mathbf {K}}$ with $rank({\mathbf {K}}) \ge r$ and ${\mathbf {W}}$ be a full rank sub-matrix of ${\mathbf {K}}$ of size $r \times r$, i.e, $rank({\mathbf {W}})=r$. Then $ \exists {\hat{{\mathbf {A}}}} \in {\mathbb {R}}^{N \times r}$ such that for the proposed approximation ${\hat{{\mathbf {K}}}} =\hat{{\mathbf {A}}^T} {\mathbf {W}}{\hat{{\mathbf {A}}}}$ is equivalent to ${\mathbf {K}}_r$, i.e., $\Vert {\mathbf {K}}- {\hat{{\mathbf {K}}}}\Vert ^2_2 = \Vert {\mathbf {K}}- {\mathbf {K}}_r\Vert ^2_2$.

Proof

Using eigen-value decomposition one can write

$$\begin{aligned} {\mathbf {K}}_r = {\mathbf {U}}{\mathbf {U}}^T&\hbox { and }&{\mathbf {W}}= {\mathbf {U}}_W {\mathbf {U}}_W^T \end{aligned}$$

where ${\mathbf {U}}\in {\mathbb {R}}^{N \times r}$ and $rank({\mathbf {U}}) =r$ and ${\mathbf {U}}_W \in {\mathbb {R}}^{r \times r}$ and $rank({\mathbf {U}}_W) =r$.

Using invertible property of ${\mathbf {U}}_W$, one can express ${\mathbf {U}}$ as ${\mathbf {U}}=({\mathbf {U}}{\mathbf {U}}_W^{-1}) {\mathbf {U}}_W$.

$$\begin{aligned} {\mathbf {K}}_r= & {} {\mathbf {U}}{\mathbf {U}}^T \\= & {} ({\mathbf {U}}{\mathbf {U}}_W^{-1}) ({\mathbf {U}}_W{\mathbf {U}}_W^T) ({\mathbf {U}}{\mathbf {U}}_W^{-1})^T \\= & {} {\hat{{\mathbf {A}}}}^T {\mathbf {W}}{\hat{{\mathbf {A}}}} \end{aligned}$$

where ${\hat{{\mathbf {A}}}}^T = {\mathbf {U}}{\mathbf {U}}_W^{-1}$. $\square $

Theorem 2

If $rank({{\mathbf {K}}}) \ge r$, then $\exists $ $\lambda ^r$ such that the proposed approximation ${\hat{{\mathbf {K}}}}$ in (17) is equivalent to the best rank-r approximation of ${{\mathbf {K}}}$, i.e., $\Vert {\mathbf {K}}- {\hat{{\mathbf {K}}}}\Vert ^2_2 = \Vert {\mathbf {K}}- {{\mathbf {K}}}_r\Vert ^2_2$, where ${{\mathbf {K}}}_r$ is the best rank-r approximation of ${{\mathbf {K}}}$.

Proof

Lemma 1 proves that there exist a $\lambda _r$ for which the optimum solution of (15) results into a ${\mathbf {W}}^{*} \in {\mathbb {R}}^{r \times r}$ such that $rank({\mathbf {W}}^{*})=r$. According to Lemma 2 there exist also a feasible ${\hat{{\mathbf {A}}}}$ which reconstructs ${{\mathbf {K}}}_r$. Let us assume $\hat{{\mathbf {A}}^{*}}$ is the optimum solution of (15) with $\lambda _A =0$, then

$$\begin{aligned} \Vert {\mathbf {K}}-{\hat{{\mathbf {A}}}}^{*T} {\mathbf {W}}^{*} {\hat{{\mathbf {A}}}}^{*}\Vert _2^2\le & {} \Vert {\mathbf {K}}-{\hat{{\mathbf {A}}}}^{T} {\mathbf {W}}^{*} {\hat{{\mathbf {A}}}}\Vert _2^2 \\ \Vert {\mathbf {K}}^- {\hat{{\mathbf {K}}}}\Vert ^2_2\le & {} \Vert {\mathbf {K}}- {{\mathbf {K}}}_r\Vert ^2_2. \end{aligned}$$

This completes the proof. $\square $

3.4 Low-rank approximation quality of multiple kernel matrices

In this section, we establish the approximation quality achieved in the multi-view setup, when the different kernels are similar either in the sense of having the same underlying ‘true’ low-rank approximations (Theorem 3) or more generally similar sets of eigen-vectors (Theorem 4).

Theorem 3

Assume ${\mathbf {K}}^{(1)}, \ldots ,{\mathbf {K}}^{M}$ are M kernel matrices such that $ \forall \,\,m,\,\, rank({\mathbf {K}}^{m}) \ge r$ and that all of them have the same rank-r approximation, i.e., ${\mathbf {K}}^{(1)}_r = \ldots = {\mathbf {K}}^{(m)}_r$ (assumption in (6)). Then $\exists \lambda ^r$ and $\lambda _A$ such that the following optimization problem:

$$\begin{aligned} \min _{{\hat{{\mathbf {A}}}}^{(m)},{\mathbf {P}}}&\sum _{m=1}^M \Vert {\mathbf {K}}^{(m)}-{\hat{{\mathbf {A}}}}^{(m)^T}{\mathbf {P}}^{^T}{\mathbf {K}}^{(m)}{\mathbf {P}}{\hat{{\mathbf {A}}}}^{(m)}\Vert _2^2 + \lambda ^r \sum _{m=1}^M \Vert {\mathbf {p}}\Vert _1 + \lambda _A \sum _{m=1}^M\Vert {\mathbf {A}}^{(m)}\Vert _2^2 \nonumber \\ s.t.&{\hat{{\mathbf {A}}}}^{(m)^T}{\mathbf {P}}^{^T}{\mathbf {K}}^{(m)}{\mathbf {P}}{\hat{{\mathbf {A}}}}^{(m)} = {\hat{{\mathbf {A}}}}^{(l)^T}{\mathbf {P}}^{^T}{\mathbf {K}}^{(l)}{\mathbf {P}}{\hat{{\mathbf {A}}}}^{(l)}\,\,\,\, \forall \,\, m \hbox { and }l \nonumber \\&{\mathbf {P}}=diag({\mathbf {p}}) \hbox { and } {\mathbf {p}}_i \in \{0,1\} \end{aligned}$$

(20)

produces the exact rank-r approximation for individual matrices, i.e.,

$$\begin{aligned} \sum _{m=1}^M \Vert {\mathbf {K}}^{(m)}-{\hat{{\mathbf {A}}}}^{(m)^T}{\mathbf {P}}^{^T}{\mathbf {K}}^{(m)}{\mathbf {P}}{\hat{{\mathbf {A}}}}^{(m)}\Vert _2^2 = \sum _{m=1}^M \Vert {\mathbf {K}}^{(m)}-{\mathbf {K}}^{(m)}_r\Vert _2^2. \end{aligned}$$

Proof

Each symmetric positive semi-definite kernel matrix can be written as

$$\begin{aligned} {\mathbf {K}}^{(m)}={\mathbf {X}}^{(m)}{\mathbf {X}}^{(m)^T}, \end{aligned}$$

where ${\mathbf {X}}^{(m)} \in {\mathbb {R}}^{N \times rank({\mathbf {K}}^{(m)})}$ and columns of ${\mathbf {X}}^{(m)}$ are orthogonal to each other.

When all ${\mathbf {K}}^{(m)}$s have the same rank-r approximation then the first r columns of ${\mathbf {X}}^{(m)}$ are same for all m. Hence ${\mathbf {X}}^{(m)}$ can be expressed as

$$\begin{aligned} {\mathbf {X}}^{(m)}=\left[ {\mathbf {X}}_r \, {\mathbf {X}}_{r^c}^{(m)} \right] \end{aligned}$$

where $r^c$ denotes the complement of set r. Here ${\mathbf {X}}_r \in {\mathbb {R}}^{N \times r}$ is a rank-r matrix and hence it is possible to find a set of r rows from ${\mathbf {X}}_r$ which together produce a rank-r sub-matrix of size $r \times r$. Let ${\mathbf {P}}^{*^T}$ be such a selector operator which select r linearly independent rows from ${\mathbf {X}}_r$, i.e., Moreover, according to Lemma 1 there exist a $\lambda _r$ for which the optimization problem in (20) gives the required sparsity in ${\mathbf {P}}^*$.

Hence,

$$\begin{aligned} {\mathbf {W}}^{(m)}={\mathbf {P}}^{*^T} {\mathbf {K}}^{(m)} {\mathbf {P}}^{*} = {\mathbf {X}}_W^{(m)}{{\mathbf {X}}_{W}^{(m)}}^T \end{aligned}$$

where ${\mathbf {X}}_W^{(m)} = {\mathbf {P}}^{*^T}{\mathbf {X}}^{(m)} = {\mathbf {P}}^{*^T} \left[ {\mathbf {X}}_r {\mathbf {X}}_{r'}^{(m)} \right] $ and hence ${\mathbf {X}}_W^{(m)}$ contains r linearly independent rows of ${\mathbf {X}}^{(m)}$ and hence for all m, $rank({\mathbf {W}}^{(m)})=r.$

When the parameter $\lambda _A$ is significantly small then, using Theorem 2, we can prove that for a ${\mathbf {W}}^{(m)}$ with $rank({\mathbf {W}}^{(m)})=r$, there exist a ${\hat{{\mathbf {A}}}}^{(m)}$ which is able to generate exact rank-r approximation for individual kernel matrix i.e,

$$\begin{aligned} \Vert {\mathbf {K}}^{(m)}-{\hat{{\mathbf {A}}}}^{(m)^T}{\mathbf {P}}^{*^T}{\mathbf {K}}^{(m)}{\mathbf {P}}^{*} {\hat{{\mathbf {A}}}}^{(m)}\Vert _2^2 = \Vert {\mathbf {K}}^{(m)}-{\mathbf {K}}^{(m)}_r\Vert _2^2, \,\,\,\, \forall \,\, m. \end{aligned}$$

This completes the proof. $\square $

Theorem 4

Assume ${\mathbf {K}}^{(1)}, \ldots ,{\mathbf {K}}^{(M)}$ are M kernel matrices such that $ \forall \,\,m,\,\, rank({\mathbf {K}}^{(m)}) = r$ and all of them have same eigen-space, i.e, eigen-vectors are linearly transferable and eigen-values are different (assumption in (9)), i.e., ${{\mathbf {K}}}^{(m)} = {\mathbf {U}}^{(m)} {\varSigma }^{(m)} {\mathbf {U}}^{(m)^T} $ such that ${\mathbf {U}}^{(m)} = {\mathbf {U}}^{(1)}{\mathbf {T}}^{(m)}$. Then $\exists \lambda ^r$ and $\lambda _A$ such that the following optimization problem (by the assumption in (10))

$$\begin{aligned} \min _{{\hat{{\mathbf {A}}}},{\mathbf {P}}}&\sum _{m=1}^M \Vert {\mathbf {K}}^{(m)}-{\hat{{\mathbf {A}}}}^T{\mathbf {P}}^T{\mathbf {K}}^{(m)}{\mathbf {P}}{\hat{{\mathbf {A}}}}\Vert _2^2 + \lambda ^r \sum _{m=1}^M \Vert {\mathbf {p}}\Vert _1 + \lambda _A \Vert {\hat{{\mathbf {A}}}}\Vert _2^2 \nonumber \\&{\mathbf {P}}=diag({\mathbf {p}}) \hbox { and } {\mathbf {p}}_i \in \{0,1\} \end{aligned}$$

(21)

selects a rank-r sub-matrix ${\mathbf {W}}^{*(m)} \in {\mathbb {R}}^{r \times r} $ with $rank({\mathbf {W}}^{*(m)})=r$ of each kernel ${\mathbf {K}}^{(m)}$ which can produce the exact reconstruction for individual matrices, i.e.,

$$\begin{aligned} \sum _{m=1}^M \Vert {\mathbf {K}}^{(m)}-{\hat{{\mathbf {A}}}}^{*T}{\mathbf {W}}^{(m)} {\hat{{\mathbf {A}}}}^*\Vert _2^2 = 0 \end{aligned}$$

Proof

According to assumption in (9) each kernel matrix can be written as

$$\begin{aligned} {\mathbf {K}}^{(m)} = {\mathbf {U}}^{(m)} {\varSigma }^{(m)} {\mathbf {U}}^{(m)^T} = {\mathbf {U}}^{(1)} {\mathbf {T}}^{(m)} {\varSigma }^{(m)} {\mathbf {T}}^{(m)^T} {\mathbf {U}}^{(1)^T} \end{aligned}$$

where ${\mathbf {U}}^{(1)} \in {\mathbb {R}}^{N \times r}$ is orthonormal. Hence it is possible to find a set of r rows from ${\mathbf {U}}^{(1)}$ which together produce a rank-r sub-matrix of size $r \times r$. Let ${\mathbf {P}}^{*^T}$ be such selector operator which selects r linearly independent rows from ${\mathbf {U}}^{(1)}$. Let $r^*$ denote the set of indices of such linearly independent rows of ${\mathbf {U}}^{(1)}$. Hence ${\mathbf {U}}^{(1)}= \left[ \begin{array}{c} {\mathbf {U}}^{(1)}_{r*}\\ {\mathbf {U}}^{(1)}_{r^{c}} \end{array} \right] $ and ${\mathbf {U}}^{(1)}_{r*}$ is invertible.

According to Lemma 1 there exist a $\lambda _r$ for which the optimization problem (21) gives the required sparsity in ${\mathbf {P}}^{*T}$. Hence using the assumption ${\mathbf {U}}^{(m)} = {\mathbf {U}}^{(1)}{\mathbf {T}}^{(m)}$, we get

$$\begin{aligned} {\mathbf {W}}^{*(m)}={\mathbf {P}}^{*^T} {\mathbf {K}}^{(m)} {\mathbf {P}}^{*}= {\mathbf {P}}^{*T} {\mathbf {U}}^{(1)} \left( {\mathbf {T}}^{(m)} {\varSigma }^{(m)} {\mathbf {T}}^{(m)^T} \right) {\mathbf {U}}^{(1)^T} {\mathbf {P}}^{*} = {\mathbf {U}}_{r^*}^{(1)} \left( {\mathbf {T}}^{(m)} {\varSigma }^{(m)} {\mathbf {T}}^{(m)^T} \right) {\mathbf {U}}_{r^*}^{(1)^T} \end{aligned}$$

Hence, given $\lambda _A$ is significantly small, according to Theorem 2 the optimization problem (21) selects a sub-matrix ${\mathbf {W}}^{*(m)}$ such that ${\mathbf {W}}^{*(m)} \in {\mathbb {R}}^{r \times r}$ and $rank({\mathbf {W}}^{*(m)})=r$. Then each kernel matrix is expressed in terms of ${\mathbf {W}}^{*(m)}$ as

$$\begin{aligned} {\mathbf {K}}^{(m)}= {\mathbf {U}}^{(m)} {\varSigma }^{(m)} {\mathbf {U}}^{(m)T}= & {} {\mathbf {U}}^{(1)}\left( {\mathbf {T}}^{(m)} {\varSigma }^{(m)} {\mathbf {T}}^{(m)T} \right) {\mathbf {U}}^{(1)T}\\= & {} \left( {\mathbf {U}}^{(1)} {\mathbf {U}}_{r*}^{(1)-1}\right) {\mathbf {W}}^{*(m)} \left( {\mathbf {U}}^{(1)} {\mathbf {U}}_{r*}^{(1)-1}\right) ^T. \end{aligned}$$

Defining ${\hat{{\mathbf {A}}}}^*= \left( {\mathbf {U}}^{(1)} {\mathbf {U}}^{(1)^{-1}}_{r*}\right) ^T$ (which is possible for significantly small $\lambda _A$), we get ${\mathbf {K}}^{(m)} = {\hat{{\mathbf {A}}}}^{*^T}{\mathbf {W}}^{*(m)} {\hat{{\mathbf {A}}}}^*. $ This completes the proof. $\square $

4 Optimization problems

Here we present the optimization problems for Multi-view Kernel Completion (MKC), arising from the within-view and between-view kernel approximations described above.

4.1 MKC using semi-definite programming ($\mathbf{MKC}_{sdp}$)

This is the most general case where we do not put any other restrictions on kernels of individual views, other than restricting them to be positive semi-definite kernels. In this general case we propagate information from other views by learning between-view relationships depending on kernel values in (7). Hence, using (3) and (8) we get

$$\begin{aligned} \min _{\begin{array}{c} {\mathbf {S}},{\hat{{\mathbf {K}}}}^{(m)},\\ m=1,\ldots ,M \end{array}}&\sum _{m=1}^M \left( \Vert {\hat{{\mathbf {K}}}}^{(m)}_{I^{(m)}I^{(m)}}-{\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}\Vert ^2_2 + \lambda _1 \Vert {\hat{{\mathbf {K}}}}^{(m)} -\sum _{l=1,l\ne m}^M s_{ml}{\hat{{\mathbf {K}}}}^{(l)} \Vert ^2_2\right) \nonumber \\ s.t.&{\hat{{\mathbf {K}}}}^{(m)} \succeq 0 \,\,\, \forall \, m=1,\ldots ,M \nonumber \\&{\mathbf {S}}\in {\mathscr {S}}. \end{aligned}$$

(22)

We solve this non-convex optimization problem by iteratively solving it for ${\mathbf {S}}$ and ${\hat{{\mathbf {K}}}}^{(m)}$ using block-coordinate descent. For a fixed ${\mathbf {S}}$, to update the ${\hat{{\mathbf {K}}}}^{(m)}$’s we need to solve a semi-definite program with M positive constraints.

4.2 MKC using homogeneous embeddings ( $\mathbf{MKC}_{embd(hm)}$ )

An optimization problem with M positive semi-definite constraints is inefficient for even a data set of size 100. To avoid solving the SDP in each iteration we assume a kernel approximation (1). When kernel functions in different views are not the same and kernel matrices in different views have different eigen-spectra, we learn relationships among underlying embeddings of different views (10), instead of the actual kernel values. Hence, using (3), (1) and (10) along with $\ell _{2,1}$ regularization on ${\mathbf {A}}$, we get

$$\begin{aligned} \min _{ \begin{array}{c} {\mathbf {A}},{\hat{{\mathbf {K}}}}^{(m)}, \\ \forall m=1,\ldots ,M \end{array}}&\sum _{m=1}^M \Vert {\hat{{\mathbf {K}}}}^{(m)}_{I^{(m)}I^{(m)}}-{\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}\Vert ^2_2 +\lambda \Vert {\mathbf {A}}\Vert _{2,1}\nonumber \\ s.t&{\hat{{\mathbf {K}}}}^{(m)} = {\mathbf {A}}^{^T}_{I^{(m)}} {\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}} {\mathbf {A}}_{I^{(m)}} \nonumber \\&{\mathbf {S}}\in {\mathscr {S}} \end{aligned}$$

(23)

Theorem 4 shows that the above formulation is appropriate when the first few eigen-vectors for all kernels are same while corresponding eigen-values may be different.

4.3 MKC using heterogeneous embeddings ($\mathbf{MKC}_{embd(ht)}$)

When kernel matrices in different views have different eigen-spectra both in eigen-values and eigen-vectors, we learn relationships among underlying embeddings of different views ( 11), instead of the actual kernel values. Hence, using (3), (1) and (12) along with $l_{2,1}$ regularization on ${\mathbf {A}}^{(m)}$, we get

$$\begin{aligned} \min _{ \begin{array}{c} {\mathbf {S}}, {\mathbf {A}}^{(m)},{\hat{{\mathbf {K}}}}^{(m)}, \\ \forall m=1,\ldots ,M \end{array}}&\sum _{m=1}^M \left( \Vert {\hat{{\mathbf {K}}}}^{(m)}_{I^{(m)}I^{(m)}}-{\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}\Vert ^2_2 +\lambda _1 \Vert {{\mathbf {A}}}^{(m)}_{I^{(m)}} -\sum _{l=1,l\ne m}^M s_{ml}{\mathbf {A}}^{(l)}_{I^{(m)}} \Vert ^2_2 \right) \nonumber \\&+ \lambda _2\sum _{m=1}^M \Vert {\mathbf {A}}^{(m)}\Vert _{2,1}\nonumber \\ s.t&{\hat{{\mathbf {K}}}}^{(m)} = {\mathbf {A}}^{(m)^T}_{I^{(m)}} {\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}} {\mathbf {A}}^{(m)}_{I^{(m)}} \nonumber \\&{\mathbf {S}}\in {\mathscr {S}} \end{aligned}$$

(24)

4.4 MKC using kernel approximation ($\mathbf{MKC}_{app}$)

On the other hand when the low rank approximation of related kernels are same (6) then between-view relationships are learnt on kernel values using (8). In this case the kernel is approximated to avoid solving the SDP:

$$\begin{aligned} \min _{\begin{array}{c} {\mathbf {S}},{\mathbf {A}}^{(m)},{\hat{{\mathbf {K}}}}^{(m)},\\ \forall m=1,\ldots ,M \end{array}}&\sum _{m=1}^M \left( \Vert {\hat{{\mathbf {K}}}}^{(m)}_{I^{(m)}I^{(m)}}-{\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}\Vert ^2_2 + \lambda _1 \Vert {\hat{{\mathbf {K}}}}^{(m)} -\sum _{l=1,l\ne m}^M s_{ml}{\hat{{\mathbf {K}}}}^{(l)} \Vert ^2_2 \right) \nonumber \\&+\lambda _2\sum _{m=1}^M \Vert {\mathbf {A}}^{(m)}\Vert _{2,1}\nonumber \\ s.t&{\hat{{\mathbf {K}}}}^{(m)} = {\mathbf {A}}^{(m)^T}_{I^{(m)}} {\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}} {\mathbf {A}}^{(m)}_{I^{(m)}} \nonumber \\&{\mathbf {S}}\in {\mathscr {S}} \end{aligned}$$

(25)

Theorem 3 shows that this method results into the exact rank-r approximation when rank-r approximation kernels for related views are same. We solve all the above-mentioned non-convex optimization problems with $l_{2,1}$ regularization by sequentially updating ${\mathbf {S}}$ and ${\mathbf {A}}^{(m)}$. In each iteration ${\mathbf {S}}$ is updated by solving a quadratic program and for each m, ${\mathbf {A}}^{(m)}$ is updated using proximal gradient descent.

5 Algorithms

Here we present algorithms for solving various optimization problems, described in previous section.^{Footnote 1}

5.1 Algorithm to solve $\mathbf{MKC}_{embd(ht)}$

In this section the Algorithm 1 describes the algorithm to solve $\mathbf{MKC}_{embd(ht)}$ (24).

Substituting ${\hat{{\mathbf {K}}}}^{(m)} = {\mathbf {A}}^{(m)^T}_{I^{(m)}} {\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}} {\mathbf {A}}^{(m)}_{I^{(m)}}$, the optimization problem (24) contains two sets of unknowns, ${\mathbf {S}}$ and the ${\mathbf {A}}^{(m)}$’s. We update ${\mathbf {A}}^{(m)}$ and ${\mathbf {S}}$ in an iterative manner. In the kth iteration for a fixed ${\mathbf {S}}^{k-1}$ from the previous iteration, to update ${\mathbf {A}}^{(m)}$’s we need to solve following for each m:

$$\begin{aligned} {\mathbf {A}}^{(m)^k} = \arg \min _{{\mathbf {A}}^{(m)}} Aobj_{{\mathbf {S}}}^k({\mathbf {A}}^{(m)}) +\lambda _2 {\varOmega }({\mathbf {A}}^{(m)}) \end{aligned}$$

where ${\varOmega }({\mathbf {A}}^{(m)})=\Vert {\mathbf {A}}^{(m)}\Vert _{2,1}$ and $Aobj_{{\mathbf {S}}}^k({\mathbf {A}}^{(m)})=\Vert {\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}- \left[ {\mathbf {A}}^{(m)^T}_{I^{(m)}}{\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}{\mathbf {A}}^{(m)}_{I^{(m)}}\right] _{I^{(m)}I^{(m)}}\Vert _2^2 +\lambda _1 \sum _{m=1}^M \Vert {\mathbf {A}}^{(m)} - \sum _{l=1, l\ne m}^{M}s^{k-1}_{ml}{\mathbf {A}}^{(l)} \Vert _2^2$.

Instead of solving this problem in each iteration we update ${\mathbf {A}}^{(m)}$ using proximal gradient descent. Hence, in each iteration,

$$\begin{aligned} {\mathbf {A}}^{(m)^k} = Prox_{\gamma \lambda _2{\varOmega }}\left( {\mathbf {A}}^{(m)^{k-1}}-\gamma \partial Aobj_{{\mathbf {S}}}^k({\mathbf {A}}^{(m)^{k-1}}) \right)&\end{aligned}$$

(26)

where $\partial Aobj_{{\mathbf {S}}}^k({\mathbf {A}}^{(m)})$ is the differential of $Aobj_{{\mathbf {S}}}^{k}({\mathbf {A}}^{(m)})$ at ${\mathbf {A}}^{(m)^{k-1}}$ and $\gamma $ is the step size which is decided by a line search. In (26) each row of ${\mathbf {A}}^{(m)}$ (i.e., ${\mathbf {a}}_{t}^{(m)}$) can be solved independently and we apply a proximal operator on each row. Following Bach et al. (2011), the solution of (26) is

$$\begin{aligned} {\mathbf {a}}_t^{(m)^k} = max\left( 0, (1-\frac{\gamma \lambda _2}{\Vert {\varDelta } {\mathbf {a}}_t^{(m)^{k-1}}\Vert _2}) {\varDelta } {\mathbf {a}}_t^{(m)^{k-1}}\right) , \end{aligned}$$

(27)

where ${\varDelta } {\mathbf {a}}_t^{(m)^{k-1}}$ is the tth row of $\left( {\mathbf {A}}^{(m)^{k-1}}-\gamma \partial Aobj_{{\mathbf {S}}}^k({\mathbf {A}}^{(m)^{k-1}})\right) $.

Again, in the kth iteration, for fixed ${\mathbf {A}}^{(m)^k}$’s, the ${\mathbf {S}}$ is updated by independently updating each row (${\mathbf {s}}_m$) through solving the following Quadratic Program:

$$\begin{aligned} {\mathbf {s}}_m^{k}= & {} \arg \min _{{\mathbf {s}}_m} Sobj_{{\mathbf {A}}^{(m),m=1,\ldots ,M}}^k({\mathbf {s}}_m) \nonumber \\ s.t&\sum _{l\ne m} s_{ml} =1, \nonumber \\&s_{ml} \ge 0 \,\, \forall l \end{aligned}$$

(28)

where $Sobj_{{\mathbf {A}}^{(m),m=1,\ldots ,M}}^k({\mathbf {s}}_m)=\Vert {\mathbf {A}}^{(m)^k}-\sum _{l=1, l\ne m}^{M}s_{ml}{\mathbf {A}}^{(l)^k}\Vert _2^2$.

Computational Complexity: Each iteration of Algorithm 1 needs to update reconstruction weight vectors of size N for N data-points for M views and also between view relation weights of size $M \times M$. Hence the effective computational complexity is ${\mathscr {O}}\left( M(N^2 +M)\right) $.

5.2 Algorithm to solve $\mathbf{MKC}_{sdp}$

In this section the Algorithm 2 describes the algorithm to solve $\mathbf{MKC}_{sdp}$(22). The optimization problem (22) has two sets of unknowns, ${\mathbf {S}}$ and the ${\hat{{\mathbf {K}}}}^{(m)}$’s. We update ${\hat{{\mathbf {K}}}}^{(m)}$ and ${\mathbf {S}}$ in an iterative manner. In the kth iteration, for fixed ${\mathbf {S}}^{k-1}$ and ${\hat{{\mathbf {K}}}}^{(m)^{k-1}}$, the ${\hat{{\mathbf {K}}}}^{(m)}$ is updated independently by solving following Semi-definite Program:

$$\begin{aligned} {\hat{{\mathbf {K}}}}^{(m)^{k}}= & {} \arg \min _{{\mathbf {K}}} Kobj_{{\mathbf {S}}}^k({\hat{{\mathbf {K}}}}^{(m)}) \nonumber \\ s.t&{\hat{{\mathbf {K}}}}^{(m)} \succeq 0, \end{aligned}$$

(29)

where

$$\begin{aligned} Kobj_{{\mathbf {S}}}^k({\hat{{\mathbf {K}}}}^{(m)})= & {} \Vert {\hat{{\mathbf {K}}}}^{(m)}_{I^{(m)}I^{(m)}} -{{\mathbf {K}}}^{(m)}_{I^{(m)}I^{(m)}} \Vert _2^2 + \lambda \Vert {\hat{{\mathbf {K}}}}^{(m)}- \sum _{l=1, l\ne m}^{M}s^{k-1}_{ml}{\hat{{\mathbf {K}}}}^{(l)^{k-1}} \Vert _2^2 \\&+\lambda \sum _{l=1, l\ne m}^{M} \Vert \left( {\hat{{\mathbf {K}}}}^{(l)^{k-1}}- \sum _{l'=1, l' \ne l,m}^{M} s^{k-1}_{ll'} {\hat{{\mathbf {K}}}}^{(l')^{k-1}}\right) - s^{k-1}_{lm}{\hat{{\mathbf {K}}}}^{(m)}\Vert _2^2 \end{aligned}$$

Again, in the kth iteration, for fixed ${\hat{{\mathbf {K}}}}^{(m)^k}, \forall m=[1,\ldots ,M]$, ${\mathbf {S}}$ is updated by independently updating each row (${\mathbf {s}}_m$) through solving the following Quadratic Program:

$$\begin{aligned} {\mathbf {s}}_m^{k}= & {} \arg \min _{{\mathbf {s}}_m} Sobj_{[psd]{\hat{{\mathbf {K}}}}^{(m), m=1,\ldots ,M}}^k({\mathbf {s}}_m) \nonumber \\ s.t&\sum _{l\ne m} s_{ml} =1, \nonumber \\&s_{ml} \ge 0 \,\, \forall l \end{aligned}$$

(30)

Here $ Sobj_{[psd]{\hat{{\mathbf {K}}}}^{(m), m=1,\ldots ,M}}^k({\mathbf {s}}_m) = \Vert {\hat{{\mathbf {K}}}}^{(m)^k}- \sum _{l=1, l\ne m}^{M}s_{ml}{\hat{{\mathbf {K}}}}^{(l)^k} \Vert _2^2 $.

Computational Complexity: Each iteration of Algorithm 2 needs to optimize M kernel by solving of M semi-definite programming(SDP) of size N. General SDP solver has computation complexity ${\mathscr {O}} \left( N^{6.5 }\right) $ (Wang et al. 2013). Hence the effective computational complexity is ${\mathscr {O}}\left( M N^{6.5}\right) $.

5.3 Algorithm to solve $\mathbf{MKC}_{app}$

In this section the Algorithm 3 describes the algorithm to solve $\mathbf{MKC}_{app}$(25) which is similar to Algorithm 1. Substituting ${\hat{{\mathbf {K}}}}^{(m)} = {\mathbf {A}}^{(m)^T}_{I^{(m)}} {\mathbf {K}}^{(m)}_{I^{(m)},I^{(m)}} {\mathbf {A}}^{(m)}_{I^{(m)}}$, the optimization problem (25) also has two sets of unknowns, ${\mathbf {S}}$ and the ${\mathbf {A}}^{(m)}$’s and again we update ${\mathbf {A}}^{(m)}$ and ${\mathbf {S}}$ in an iterative manner. In the kth iteration for a fixed ${\mathbf {S}}^{k-1}$ from previous iteration, to update ${\mathbf {A}}^{(m)}$’s, unlike $\mathbf{MKC}_{embd(ht)}$, we need to solve following for each m:

$$\begin{aligned} {\mathbf {A}}^{(m)^k} = \arg \min _{{\mathbf {A}}^{(m)}} Aobj_{[app]{\mathbf {S}}}^k({\mathbf {A}}^{(m)}) +\lambda _2 {\varOmega }({\mathbf {A}}^{(m)^{k-1}}) \end{aligned}$$

where $ {\varOmega }({\mathbf {A}}^{(m)})=\Vert {\mathbf {A}}^{(m)}\Vert _{2,1}$ and

$$\begin{aligned} Aobj_{[app]{\mathbf {S}}}^k({\mathbf {A}}^{(m)})= & {} \Vert {\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}-\left[ {\mathbf {A}}^{(m)^T}_{I^{(m)}}{\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}{\mathbf {A}}^{(m)}_{I^{(m)}}\right] _{I^{(m)}I^{(m)}}\Vert _2^2 \\&+ \lambda _1 \sum _{m=1}^M \Vert {\mathbf {A}}^{(m)^T}_{I^{(m)}}{\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}{\mathbf {A}}^{(m)}_{I^{(m)}}- \sum _{l=1, l\ne m}^{M}s^{k-1}_{ml}{\mathbf {A}}^{(l)^T}_{I^{(l)}}{\mathbf {K}}^{(l)}_{I^{(l)}I^{(l)}}{\mathbf {A}}^{(l)}_{I^{(l)}} \Vert _2^2. \end{aligned}$$

For this case too, instead of solving this problem in each iteration we update ${\mathbf {A}}^{(m)}$ using proximal gradient descent. Hence, in each iteration,

$$\begin{aligned} {\mathbf {A}}^{(m)^k} = Prox_{\gamma \lambda _2{\varOmega }}\left( {\mathbf {A}}^{(m)^{k-1}}-\gamma \partial Aobj_{[app]{\mathbf {S}}}^k({\mathbf {A}}^{(m)^{k-1}}) \right) \end{aligned}$$

(31)

where $\partial Aobj_{[app]{\mathbf {S}}}^k({\mathbf {A}}^{(m)^{k-1}})$ is the differential of $Aobj_{[app]{\mathbf {S}}}^{k}({\mathbf {A}}^{(m)})$ at ${\mathbf {A}}^{(m)^{k-1}}$ and $\gamma $ is the step size which is decided by a line search. By applying proximal operator on each row of ${\mathbf {A}}$ (i.e., ${\mathbf {a}}_{t}$) in (31)

$$\begin{aligned} {\mathbf {a}}_t^{(m)^k} = max\left( 0, \left( 1-\frac{\gamma \lambda _2}{\Vert {\varDelta } {\mathbf {a}}_t^{(m)^{k-1}}\Vert _2}\right) {\varDelta } {\mathbf {a}}_t^{(m)^{k-1}}\right) , \end{aligned}$$

(32)

where ${\varDelta } {\mathbf {a}}_t^{(m)^{k-1}}$ is the tth row of $\left( {\mathbf {A}}^{(m)^{k-1}}-\gamma \partial Aobj_{[app]{\mathbf {S}}}^k({\mathbf {A}}^{(m)})\right) $.

Again, in the kth iteration, for fixed ${\mathbf {A}}^{(m)^k}, \forall m=[1,\ldots ,M]$, ${\mathbf {S}}$ is updated by independently updating each row (${\mathbf {s}}_m$) through solving the following Quadratic Program:

$$\begin{aligned} {\mathbf {s}}_m^{k}= & {} \arg \min _{{\mathbf {s}}_m} Sobj_{[app]{\mathbf {A}}}^k({\mathbf {s}}_m) \nonumber \\ s.t&\sum _{l\ne m} s_{ml} =1, \nonumber \\&s_{ml} \ge 0 \,\, \forall l \end{aligned}$$

(33)

where $ Sobj_{[app]{\mathbf {A}}}^k({\mathbf {s}}_m) = \Vert {\mathbf {A}}^{(m)^T}_{I^{(m)}}{\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}{\mathbf {A}}^{(m)}_{I^{(m)}}- \sum _{l=1, l\ne m}^{M}s_{ml}{\mathbf {A}}^{(l)^T}_{I^{(l)}}{\mathbf {K}}^{(l)}_{I^{(l)}I^{(l)}}{\mathbf {A}}^{(l)}_{I^{(l)}} \Vert _2^2 $.

Computational Complexity: Each iteration of Algorithm 3 needs to update reconstruction weight vectors of size N for N data-points for M views and also between view relation weights of size $M \times M$. Hence the effective computational complexity is ${\mathscr {O}}\left( M(N^2 +M)\right) $.

6 Experiments

We apply the proposed MKC method on a variety of data sets, with different types of kernel functions in different views, along with different amounts of missing data points. The objectives of our experiments are: (1) to compare the performance of MKC against other existing methods in terms of the ability to predict the missing kernel rows, (2) to empirically show that the proposed kernel approximation with the help of the reconstruction weights also improves running-time over the $\mathbf{MKC}_{sdp}$ method.

6.1 Experimental setup

6.1.1 Data sets:

To evaluate the performance of our method, we used 4 simulated data sets with 100 data points and 5 views, as well as two real-world multi-view data sets: (1) Dream Challenge 7 data set (DREAM) (Daemen et al. 2013; Heiser and Sadanandam 2012) and (2) Reuters RCV1/RCV2 multilingual data (Amini et al. 2009).^{Footnote 2}

Synthetic data sets: We followed the following steps to simulate our synthetic data sets:

1
We generated the first 10 points (${\mathbf {X}}_{B^{(m)}}^{(m)}$) for each view, where ${\mathbf {X}}_{B^{(m)}}^{(1)}$ and ${\mathbf {X}}_{B^{(m)}}^{(2)}$ are uniformly distributed in $[-1,1]^{5}$ and ${\mathbf {X}}_{B^{(m)}}^{(3)}$, ${\mathbf {X}}_{B^{(m)}}^{(4)}$ and, ${\mathbf {X}}_{B^{(m)}}^{(5)}$ are uniformly distributed in $[-1,1]^{10}$.
2
These 10 data points were used as basis sets for each view, and further 90 data points in each view were generated by ${\mathbf {X}}^{(m)}={\mathbf {A}}^{(m)} {\mathbf {X}}_{B^{(m)}}^{(m)}$, where the ${\mathbf {A}}^{(m)}$ are uniformly distributed random matrices $ \in {\mathbb {R}}^{90 \times 10}$. We chose ${\mathbf {A}}^{(1)}={\mathbf {A}}^{(2)}$ and ${\mathbf {A}}^{(3)}={\mathbf {A}}^{(4)} = {\mathbf {A}}^{(5)}$.
3
Finally, ${\mathbf {K}}^{(m)}$ was generated from ${\mathbf {X}}^{(m)}$ by using different kernel functions for different data sets as follows:
- TOYL: Linear kernel for all views
- TOYG1 and TOYG0.1: Gaussian kernel for all views where the kernel with of the Gaussian kernel are 1 and 0.1 respectively.
- TOYLG1: Linear kernel for the first 3 views and Gaussian kernel for the last two views with the kernel width 1. Note that with this selection view 3 shares reconstruction weights with view 4 and 5, but has the same kernel as views 1 and 2.
Figure 2 shows the eigen-spectra of kernel matrices are very much different for TOYLG1 where we have used different kernels in different views.

The Dream Challenge 7 data set (DREAM): For Dream Challenge 7, genomic characterizations of multiple types on 53 breast cancer cell lines are provided. They consist of DNA copy number variation, transcript expression values, whole exome sequencing, RNA sequencing data, DNA methylation data and RPPA protein quantification measurements. In addition, some of the views are missing for some cell lines. For 25 data points all 6 views are available. For all the 6 views, we calculated Gaussian kernels after normalizing the data sets. We generated two other kernels by using Jaccard’s kernel function over binarized exome data and RNA sequencing data. Hence, the final data set has 8 kernel matrices. Figure 2 shows the eigen-spectra of the kernel matrices of all views, which are quite different for different views.

RCV1/RCV2: Reuters RCV1/RCV2 multilingual data set contains aligned documents for 5 languages (English, French, Germany, Italian and Spanish). Originally the documents are in any one of these languages and then corresponding documents for other views have been generated by machine translations of the original document. For our experiment, we randomly selected 1500 documents which were originally in English. The latent semantic kernel (Cristianini et al. 2002) is used for all languages.

6.1.2 Evaluation setup

Each of the data sets was partitioned into tuning and test sets. The missing views were introduced in these partitions independently. To induce missing views, we randomly selected data points from each partition, a few views for each of them, and deleted the corresponding rows and columns from the kernel matrices. The tuning set was used for parameter tuning. All the results have been reported on the test set which was independent of the tuning set.

For all 4 synthetic data sets as well as RCV1/RCV2 we chose $40\%$ of the data samples as the tuning set, and the rest $60\%$ were used for testing. For the DREAM data set these partitions were $60\%$ for tuning and $40\%$ for testing.

We generated versions of the data with different amounts of missing values. For the first test case, we deleted 1 view from each selected data point in each data set. In the second test case, we removed 2 views for TOY and RCV1/RCV2 data sets and 3 views for DREAM. For the third one we deleted 3 views among 5 views per selected data point in TOY and RCV1/RCV2, and 5 views among 8 views per selected data point in DREAM.

We repeated all our experiments for 5 random tuning and test partitions with different missing entries and report the average performance on them.

6.1.3 Compared methods

We compared performance of the proposed methods, $\mathbf{MKC}_{embd(hm)}$ $\mathbf{MKC}_{embd(ht)}$, $\mathbf{MKC}_{app}$, $\mathbf{MKC}_{sdp}$, with k nearest neighbour (KNN) imputation as a baseline KNN has previously been shown to be a competitive imputation method (Brock et al. 2008). For KNN imputation we first concatenated underlying feature representations from all views to get a joint feature representation. We then sought k nearest data points by using their available parts, and the missing part was imputed as either average (Knn) or the weighted average (wKnn) of the selected neighbours. We also compare our result with generative model based approach of Lian et al. (2015) (MLFS) and with an EM-based kernel completion method ($\mathbf{EM}_{based}$) proposed by Tsuda et al. (2003). Tsuda et al. (2003) cannot solve our problem when no view is complete, hence we study the relative performance only in the cases which it can solve. For Tsuda et al. (2003)’s method we assume the first view is complete.

We also compared $\mathbf{MKC}_{embd(ht)}$, with $\mathbf{MKC}_{rnd}$ where we assumed the basis vectors are selected randomly with uniform distribution with out replacement and after that reconstruction weights for all views are optimizied.

The hyper-parameters $\lambda _1$ and $\lambda _2$ of MKC and k of Knn and wKnn were selected with the help of tuning set, from the range of $10^{-3}$ to $10^3$ and [1, 2, 3, 5, 7, 10] respectively. All reported results indicate performance in the test sets.

6.2 Prediction error comparisons

6.2.1 Average Relative Error (ARE)

We evaluated the performance of all methods using the average relative error (ARE) (Xu et al. 2013). Let ${\hat{{\mathbf {k}}}}^{(m)}_{t}$ be the predicted tth row for the mth view and the corresponding true values of kernel row be ${{\mathbf {k}}}^{(m)}_{t}$, then the relative error is the relative root mean square deviation.

The average relative error (in percentage) is then computed over all missing data points for a view, that is,

$$\begin{aligned} \text {ARE} = \frac{100}{n_{t}^{(m)}}\left( \sum _{t \notin I^{(m)}}\frac{\Vert {\hat{{\mathbf {k}}}}^{(m)}_{t}-{{\mathbf {k}}}^{(m)}_{t}\Vert _2}{\Vert {{\mathbf {k}}}^{(m)}_{t}\Vert _2}\right) . \end{aligned}$$

(34)

Here $n_{t}^{(m)}$ is the number of missing samples in the mth view.

Table 1 Average relative error percentage (34))

Full size table

6.2.2 Results

Table 1 shows the Average Relative Error (34) for the compared methods. It shows that the proposed MKC methods generally predict missing values more accurately than Knn, wKnn, $\mathbf{EM}_{based}$ and MLFS. In particular, the differences in favor to the MKC methods increase when the number of missing views is increased.

The most recent method MLFS performs comparatively for the DREAM data-set and the data-set with linear kernels (TOYL). But it deteriorates very badly with the increased of non-linearity of kernels, i.e., TOYG1 and TOYLG1. For highly non-linear sparse kernels (TOYG0.1) and for RCV1/RCV2 data-set with large amount of missing views the MLFS fails to predict.

The $\mathbf{EM}_{based}$ sometimes has more than $200\%$ error and higher (more than $200\%$) variance. The most accurate method in each setup is one of the proposed MKC’s. $\mathbf{MKC}_{embd(hm)}$ is generally the least accurate of them, but still competitive against the other compared methods. We further note that:

$\mathbf{MKC}_{embd(ht)}$ is consistently the best when different views have different kernel functions or eigen-spectra, e.g., TOYLG1 and DREAM (Fig. 2). Better performance of $\mathbf{MKC}_{embd(ht)}$ than $\mathbf{MKC}_{embd(hm)}$ in DREAM data gives evidence of applicability of $\mathbf{MKC}_{embd(ht)}$ in real-world data-set.
$\mathbf{MKC}_{app}$ performs best or very close to $\mathbf{MKC}_{embd(ht)}$ when kernel functions and eigen-spectra of all views are the same (for instance TOYL, TOYG1 and RCV1/RCV2). As $\mathbf{MKC}_{app}$ learns between-view relationships on kernel values it is not able to perform well for TOYLG1 and DREAM where very different kernel functions are used in different views.
$\mathbf{MKC}_{sdp}$ outperforms all other methods when kernel functions are highly non-linear (such as in TOYG0.1). On less non-linear cases, $\mathbf{MKC}_{sdp}$ on the other hand trails in accuracy to the other MKC variants. $\mathbf{MKC}_{sdp}$ is computationally more demanding than the others, to the extent that on RCV1/RCV2 data we had to skip it.

Figure 3 depicts the performance as the number of missing samples per view is increased. Here, $\mathbf{MKC}_{embd(ht)}$, $\mathbf{MKC}_{app}$ and $\mathbf{MKC}_{embd(hm)}$ prove to be the most robust methods over all data sets. The performance of $\mathbf{MKC}_{sdp}$ seems to be the most sensitive to amount of missing samples. Overall, $\mathbf{EM}_{based}$, Knn, and wKnn have worse error rates than the MKC methods.

6.3 Comparison of performance of different versions of the proposed approach

Figure 4 shows how relative prediction error (ARE) of $\mathbf{MKC}_{embd(ht)}$, $\mathbf{MKC}_{app}$ and $\mathbf{MKC}_{sdp}$ vary with two properties of given data-sets. Namely, (1) difference among eigen-spectra of kernel of different views(x axis) and (2) non-linearity of kernels for all views(y axis). For this experiment, we consider 3rd, 4th and 5th views of TOY data. Here all views have been generated from same embedding. Non-linearity of kernel function varies with combination of linear and Gaussian kernel where the kernel with of the Gaussian kernel varies among 5, 1 and 0.1.

Table 2 Average running time over all views and for TOY, over all 4 data sets

Full size table

The heterogeneity of eigen-spectra of all kernels are calculated as average mean square difference of eigen-spectra of each pair of kernels. The non-linearity of kernel is indicated by the average of the 20th eigen-values of all views. Each circle indicates amount of prediction error by $\mathbf{MKC}_{embd(ht)}$, $\mathbf{MKC}_{app}$ and $\mathbf{MKC}_{sdp}$ where radius of each circle is proportional to “$log(ARE) + Thr$”. A constant “Thr” was required to have positive radii for all circles for better visualization. We further note that:

The performance of $\mathbf{MKC}_{embd(ht)}$ is the best among these three methods for all most all cases.
Only when all views have similarly high non-linear kernel (top-left corner), $\mathbf{MKC}_{sdp}$ performs best among all. It also shows that the performance of $\mathbf{MKC}_{sdp}$ improves with increase of non-linearity.
We can also see that with the increase of heterogeneity in kernels (increase of x-axis) the performance of $\mathbf{MKC}_{app}$ deteriorates and is getting worse than that of $\mathbf{MKC}_{embd(ht)}$.

6.4 Running time comparison

Table 2 depicts the running times for the compared methods. $\mathbf{MKC}_{app}$, $\mathbf{MKC}_{embd(ht)}$ and $\mathbf{MKC}_{embd(hm)}$ are many times faster than $\mathbf{MKC}_{sdp}$. In particular, $\mathbf{MKC}_{embd(hm)}$ is competitive in running time with the significantly less accurate $\mathbf{EM}_{based}$ and MLFS methods, except on the RCV1/RCV2 data. As expected, Knn and wKnn are orders of magnitude faster but fall far short of the reconstruction quality of the MKC methods.

7 Conclusions

In this paper, we have introduced new methods for kernel completion in the multi-view setting. The methods are able to propagate relevant information across views to predict missing rows/columns of kernel matrices in multi-view data. In particular, we are able to predict missing rows/columns of kernel matrices for non-linear kernels, and do not need any complete kernel matrices a priori.

Our method of within-view learning approximates the full kernel by a sparse basis set of examples with local reconstruction weights, picked up by $\ell _{2,1}$ regularization. This approach has the added benefit of circumventing the need of an explicit PSD constraint in optimization. We showed that the method generalizes and improves Nyström approximation. For learning between views, we proposed two alternative approaches, one based on learning convex kernel combinations and another based on learning a convex set of reconstruction weights. The heterogeneity of the kernels in different views affects which of the approaches is favourable. We related theoretically the kernel approximation quality of these methods to the similarity of eigen-spaces of the individual kernels.

Our experiments show that the proposed multi-view completion methods are in general more accurate than previously available methods. In terms of running time, due to the inherent non-convexity of the optimization problems, the new proposals still have room to improve. However, the methods are amenable for efficient parallelization, which we leave for further work.

Notes

MKC code is available in https://github.com/aalto-ics-kepaco/MKC_software.
All data-sets and MKC code is available in https://github.com/aalto-ics-kepaco/MKC_software.

References

Amini, M., Usunier, N., & Goutte, C. (2009). Learning from multiple partially observed views—An application to multilingual text categorization. Advances in Neural Information Processing Systems, 22, 28–36.
Google Scholar
Argyriou, A., Micchelli, C.A., & Pontil, M. (2005). Learning convex combinations of continuously parameterized basic kernels. In Proceedings of the 18th annual conference on learning theory (pp. 338–352).
Argyriou, A., Evgeniou, T., & Pontil, M. (2006). Multi-task feature learning. Advances in Neural Information Processing Systems, 19, 41–48.
Google Scholar
Bach, F., Lanckriet, G., & Jordan, M. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the 21st international conference on machine learning (pp. 6–13). ACM.
Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2011). Convex optimization with sparsity-inducing norms. Optimization for Machine Learning, 5, pp. 19–53.
Brock, G., Shaffer, J., Blakesley, R., Lotz, M., & Tseng, G. (2008). Which missing value imputation method to use in expression profiles: A comparative study and two selection schemes. BMC Bioinformatics, 9, 1–12.
Article Google Scholar
Cortes, C., Mohri, M., & Rostamizadeh, A. (2012). Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13, 795–828.
MathSciNet MATH Google Scholar
Cristianini, N., Shawe-Taylor, J., & Lodhi, H. (2002). Latent semantic kernels. Journal of Intelligent Information Systems, 18(2–3), 127–152.
Article Google Scholar
Daemen, A., Griffith, O., Heiser, L., et al. (2013). Modeling precision treatment of breast cancer. Genome Biology, 14(10), 1.
Article Google Scholar
Gönen, M., & Alpaydin, E. (2011). Multiple kernel learning algorithms. Journal of Machine Learning Research, 12, 2211–2268.
MathSciNet MATH Google Scholar
Graepel, T. (2002). Kernel matrix completion by semidefinite programming. In Proceedings of the 12th international conference on artificial neural networks, Springer (pp. 694–699).
Heiser, L. M., Sadanandam, A., et al. (2012). Subtype and pathway specific responses to anticancer compounds in breast cancer. Proceedings of the National Academy of Sciences, 109(8), 2724–2729.
Article Google Scholar
Kumar, S., Mohri, M., & Talwalkar, A. (2009). On sampling-based approximate spectral decomposition. In Proceedings of the 26th annual international conference on machine learning (pp. 53–560). ACM.
Lian, W., Rai, P., Salazar, E., & Carin, L. (2015). Integrating features and similarities: Flexible models for heterogeneous multiview data. In Proceedings of the 29th AAAI conference on artificial intelligence (pp. 2757–2763).
Paisley, J., Carin, & L. (2010). A nonparametric Bayesian model for kernel matrix completion. In The 35th international conference on acoustics, speech, and signal processing, IEEE (pp. 2090–2093).
Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.
Article Google Scholar
Shao, W., Shi, X., & Yu, P.S. (2013). Clustering on multiple incomplete datasets via collective kernel learning. In IEEE 13th international conference on, data mining (ICDM), 2013 (pp. 1181–1186). IEEE.
Trivedi, A., Rai, P., Daumé III, H., & DuVall, S.L. (2005). Multiview clustering with incomplete views. In Proceedings of the NIPS workshop.
Tsuda, K., Akaho, S., & Asai, K. (2003). The em algorithm for kernel matrix completion with auxiliary data. The Journal of Machine Learning Research, 4, 67–81.
MathSciNet MATH Google Scholar
Wang, P., Shen, C., & Van Den Hengel, A. (2013). A fast semidefinite approach to solving binary quadratic problems. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1312–1319).
Williams, C., Seeger, M. (2001). Using the nyström method to speed up kernel machines. In Proceedings of the 14th annual conference on neural information processing systems, EPFL-CONF-161322 (pp. 682–688).
Williams, D., Carin, L. (2005). Analytical kernel matrix completion with incomplete multi-view data. In Proceedings of the ICML workshop on learning with multiple views.
Xu, M., Jin, R., & Zhou, Z.H. (2013). Speedup matrix completion with side information: Application to multi-label learning. In Advances in neural information processing systems (pp. 2301–2309).

Download references

Acknowledgements

We thank Academy of Finland (Grants 292334, 294238, 295503 and Center of Excellence in Computational Inference COIN for SK, Grant 295496 for JR and Grants 295503 and 295496 for SB) and Finnish Funding Agency for Innovation Tekes (Grant 40128/14 for JR) for funding. We also thank authors of Lian et al. (2015) for sharing their software for MLFS.

Author information

Authors and Affiliations

Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland
Sahely Bhadra, Samuel Kaski & Juho Rousu

Authors

Sahely Bhadra
View author publications
You can also search for this author in PubMed Google Scholar
Samuel Kaski
View author publications
You can also search for this author in PubMed Google Scholar
Juho Rousu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sahely Bhadra.

Additional information

Editors: Bob Durrant, Kee-Eung Kim, Geoff Holmes, Stephen Marsland, Zhi-Hua Zhou and Masashi Sugiyama.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhadra, S., Kaski, S. & Rousu, J. Multi-view kernel completion. Mach Learn 106, 713–739 (2017). https://doi.org/10.1007/s10994-016-5618-0

Download citation

Received: 03 May 2016
Accepted: 21 November 2016
Published: 30 December 2016
Issue Date: May 2017
DOI: https://doi.org/10.1007/s10994-016-5618-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-view kernel completion

Abstract

Similar content being viewed by others

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

Open Set Recognition in Real World

Sparse semi-supervised multi-label feature selection based on latent representation

1 Introduction

2 Multi-view kernel completion

2.1 Within-view kernel relationships

2.2 Between-view kernel relationships

3 Theoretical analysis

3.1 Rank of the within-kernel approximation

Lemma 1

Proof

3.2 Relation to Nyström approximation

Theorem 1

Proof

3.3 Low-rank approximation quality

Lemma 2

Proof

Theorem 2

Proof

3.4 Low-rank approximation quality of multiple kernel matrices

Theorem 3

Proof

Theorem 4

Proof

4 Optimization problems

4.1 MKC using semi-definite programming (\(\mathbf{MKC}_{sdp}\))

4.2 MKC using homogeneous embeddings ( \(\mathbf{MKC}_{embd(hm)}\) )

4.3 MKC using heterogeneous embeddings (\(\mathbf{MKC}_{embd(ht)}\))

4.4 MKC using kernel approximation (\(\mathbf{MKC}_{app}\))

5 Algorithms

5.1 Algorithm to solve \(\mathbf{MKC}_{embd(ht)}\)

5.2 Algorithm to solve \(\mathbf{MKC}_{sdp}\)

5.3 Algorithm to solve \(\mathbf{MKC}_{app}\)

6 Experiments

6.1 Experimental setup

6.1.1 Data sets:

6.1.2 Evaluation setup

6.1.3 Compared methods

6.2 Prediction error comparisons

6.2.1 Average Relative Error (ARE)

6.2.2 Results

6.3 Comparison of performance of different versions of the proposed approach

6.4 Running time comparison

7 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation