Abstract
In this paper, we introduce the first method that (1) can complete kernel matrices with completely missing rows and columns as opposed to individual missing kernel values, with help of information from other incomplete kernel matrices. Moreover, (2) the method does not require any of the kernels to be complete a priori, and (3) can tackle non-linear kernels. The kernel completion is done by finding, from the set of available incomplete kernels, an appropriate set of related kernels for each missing entry. These aspects are necessary in practical applications such as integrating legacy data sets, learning under sensor failures and learning when measurements are costly for some of the views. The proposed approach predicts missing rows by modelling both within-view and between-view relationships among kernel values. For within-view learning, we propose a new kernel approximation that generalizes and improves Nyström approximation. We show, both on simulated data and real case studies, that the proposed method outperforms existing techniques in the settings where they are available, and extends applicability to new settings.
Similar content being viewed by others
1 Introduction
In recent years, many methods have been proposed for multi-view learning, i.e, learning with data collected from multiple sources or “views” to utilize the complementary information in them. Kernelized methods capture the similarities among data points in a kernel matrix. The multiple kernel learning (MKL) framework (c.f. Gönen and Alpaydin 2011) is a popular way to accumulate information from multiple data sources, where kernel matrices built on features from individual views are combined for better learning. In MKL methods, it is commonly assumed that full kernel matrices for each view are available. However, in partial data analytics, it is common that information from some sources is not available for some data points.
The incomplete data problem exists in a wide range of fields, including social sciences, computer vision, biological systems, and remote sensing. For example, in remote sensing, some sensors can go off for periods of time, leaving gaps in data. A second example is that when integrating legacy data sets, some views may not available for some data points, because integration needs were not considered when originally collecting and storing the data. For instance, gene expression may have been measured for some of the biological samples, but not for others, and as biological sample material has been exhausted, the missing measurements cannot be made any more. On the other hand, some measurements may be too expensive to repeat for all samples; for example, patient’s genotype may be measured only if a particular condition holds. All these examples introduce missing views, i.e, all features of a view for a data point can be missing simultaneously.
Novelties in problem definition: Previous methods for kernel completion have addressed completion of the aggregated Gaussian kernel matrix by integrating multiple incomplete kernels (Williams and Carin 2005) or single-view kernel completion assuming individual missing values (Graepel 2002; Paisley and Carin 2010), or required at least one complete kernel with a full eigen-system to be used as an auxiliary data source (Tsuda et al. 2003; Trivedi et al. 2005), or assume the eigen-system of two kernels to be exactly the same (Shao et al. 2013), or assumed a linear kernel approximation (Lian et al. 2015). Williams and Carin (2005) do not complete the individual incomplete kernel matrix but complete only aggregated kernels when all kernels are Gaussian. Due to absence of full rows/columns in the incomplete kernel matrices, no existing or non-existing single-view kernel completion method (Graepel 2002; Paisley and Carin 2010) can be applied to complete kernel matrices of individual views independently. In the multi-view setting, Tsuda et al. (2003) have proposed an expectation maximization based method to complete an incomplete kernel matrix for a view, with the help of a complete kernel matrix from another view. As it requires a full eigen-system of the auxiliary full kernel matrix, that method cannot be used to complete a kernel matrix with missing rows/columns when no other auxiliary complete kernel matrix is available. Both Trivedi et al. (2005) and Shao et al. (2013) match kernels through their Graph Laplacians, which may not work optimally if the kernels have different eigen-structures arising from different types of measurements. The method by Shao et al. (2013) completes multiple kernels sequentially, making an implicit assumption that the adjacent kernels in the sequence are related. This can be a hard constraint and in general may not match the reality. On the other hand, Lian et al. (2015) proposed a generative model based method which approximates the similarity matrix for each view as a linear kernel in some low-dimensional space. Therefore, it is unable to model highly non-linear kernels such as RBFs. Hence no conventional method can, by itself, complete highly non-linear kernel matrices with completely missing rows and columns in a multi-view setting when no other auxiliary full kernel matrix is available.
Contributions: In this paper, we propose a novel method to complete all incomplete kernel matrices collaboratively, by learning both between-view and within-view relationships among the kernel values (Fig. 1). We model between-view relationships in the following two ways: (1) Initially, adapting the strategies from multiple kernel learning (Argyriou et al. 2005; Cortes et al. 2012), we complete kernel matrices by expressing individual normalized kernel matrices corresponding to each view as a convex combination of normalized kernel matrices of other views. (2) Second, to model relationships between kernels having different eigen-systems we propose a novel approach of restricting the local embedding of one view in the convex hull of local embeddings of other views. We relate theoretically the kernel approximation quality of the different approaches to the properties of the underlying eigen-spaces of the kernels, pointing out settings where different approaches are optimal.
For within-view learning, we begin from the concept of local linear embedding (Roweis and Saul 2000) applied to the feature vector, and extend it to the kernel matrix by reconstructing each feature representation for a kernel as a sparse linear combination of other available feature representations or “basis” vectors in the same view. We assume the local embeddings, i.e., the reconstruction weights and the basis vectors for reconstructing each samples, are similar across views. In this approach, the non-linearity of kernel functions of individual views is also preserved in the basis vectors. We prove (Theorem 2) that the proposed within-view kernel reconstruction can be seen as generalizing and improving the Nyström method (Williams and Seeger 2001) which have been successfully applied to efficient kernel learning. Most importantly, we show (in Theorem 3) for a general single kernel matrix the proposed scheme results into optimal low rank approximation which is not reached by the Nyström method.
For between-view learning, we recognize that the similarity of the eigen-spaces of the views plays a crucial role. When the different kernels have similar optimal low-rank approximations, we show (Theorem 4) that predicting kernel values across views is a potent approach. For this case, we propose a method (\(\mathbf{MKC}_{app}\) (25)) relying on a technique previously used in multiple kernel learning literature, namely, restricting a kernel matrix into the convex hull of other kernel matrices (Argyriou et al. 2005; Cortes et al. 2012). Here, we use this technique for simultaneously completing multiple incomplete kernel matrices, while Argyriou et al. (2005) and Cortes et al. (2012) used it only for learning effective linear combination of complete kernel matrices.
For the case when the eigen-systems of the different views differ, we propose methods that, instead of kernel values, translate the reconstruction weights across views. For the cases where the leading eigen-vectors are similar but eigen-value spectra are different, we show (Theorem 5) that it is sufficient to maintain one global set of reconstruction weights, used in the within-view reconstructions of all views. In the case of heterogeneous leading eigen-vector sets across views, we propose to learn the reconstruction weights for each view restricting it in convex hull of the reconstruction weights of the other views (\(\mathbf{MKC}_{embd(ht)}\) (24)).
2 Multi-view kernel completion
We assume N data observations \({\mathbf {X}}=\{ {\mathbf {x}}_1,\ldots ,{\mathbf {x}}_N\}\) from a multi-view input space \({\mathscr {X}}= {\mathscr {X}}^{1} \times \cdots \times {\mathscr {X}}^{(M)}\), where \({\mathscr {X}}^{(m)}\) is the input space generating the mth view. We denote by \({\mathbf {X}}^{(m)}=\{ {\mathbf {x}}^{(m)}_1,\ldots ,{\mathbf {x}}^{(m)}_N\}\), \(\forall \) \(m = 1,\ldots , M\), the set of observations for the mth view, where \({\mathbf {x}}^{(m)}_i \in {\mathscr {X}}^{(m)}\) is the ith observation in the mth view and \({\mathscr {X}}^{(m)}\) is the input space. For simplicity of notation we sometimes omit the superscript (m) denoting the different views when there is no need to refer to several views at a time.
Considering an implicit mapping of the observations of the mth view to an inner product space \({\mathscr {F}}^{(m)}\) via a mapping \(\phi ^{(m)}: {\mathscr {X}}^{(m)} \rightarrow {\mathscr {F}}^{(m)}\), and following the usual recipe for kernel methods (Bach et al. 2004), we specify the kernel as the inner product in \({\mathscr {F}}^{(m)}\). The kernel value between the ith and jth data points is defined as \({k}^{(m)}_{ij}= \langle \phi _i^{(m)},\phi _j^{(m)} \rangle \), where \(\phi _i^{(m)} = \phi ^{(m)}({\mathbf {x}}_i^{(m)})\) and \(k_{ij}^{(m)}\) is an element of \({\mathbf {K}}^{(m)}\), the kernel Gram matrix for the set \({\mathbf {X}}^{(m)}\).
In this paper we make the assumption that a subset of samples is observed in each view, and correspondingly, a subset of views is observed for each sample. Let \(I_N = [1,\ldots , N] \) be the set of indices of all data points and \(I^{(m)}\) be the set of indices of all available data points in the mth view. Hence for each view, only a kernel sub-matrix (\({\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}\)) corresponding to the rows and columns indexed by \(I^{(m)}\) is observed. Our aim is to predict a complete positive semi-definite (PSD) kernel matrix (\({\hat{{\mathbf {K}}}}^{(m)} \in {\mathbb {R}}^{N \times N}\)) corresponding to each view. The crucial task is to predict the missing (tth) rows and columns of \({\hat{{\mathbf {K}}}}^{(m)}\), for all \(t \in \{ I_N/I^{(m)}\}\). Our approach for predicting \({\hat{{\mathbf {K}}}}^{(m)}\) is based on learning both between-view and within-view relationships among the kernel values (Fig. 1). The sub-matrix \({\hat{{\mathbf {K}}}}^{(m)}_{I^{(m)}I^{(m)}}\) should be approximately equal to the observed matrix \({\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}\), however, in our approach, approximation quality of the two parts of the kernel matrix can be traded.
2.1 Within-view kernel relationships
For within-view learning, relying on the concept of local linear embedding (Roweis and Saul 2000), we reconstruct the feature map of tth data point \(\phi _t\) by a sparse linear combination of observed data samples
where \( a_{it} \in {\mathbb {R}}\) is the reconstruction weight of the ith feature representation for representing the tth observation. Hence, approximated kernel values can be expressed as
We note that the above formulation retains the non-linearity of the feature map \(\phi \) and the corresponding kernel. We collect all reconstruction weights of a view into the matrix \({\mathbf {A}}= \left( a_{ij}\right) _{i,j=1}^N\). Further, by \({\mathbf {A}}_{I}\) we denote the sub-matrix of \({\mathbf {A}}\) containing the rows indexed by I, the known data samples in the view. Thus the reconstructed kernel matrix \({\hat{{\mathbf {K}}}}\) can be written as
Note that \({\hat{{\mathbf {K}}}}\) is positive semi-definite when \({\mathbf {K}}\) is positive semi-definite. Thus, a by-product of this approximation is that in optimization, PSD property is automatically guaranteed without inserting explicit positive semi-definiteness constraints.
Intuitively, the reconstruction weights are used to extend the known part of the kernel to the unknown part, in other words, the unknown part is assumed to reside within the span of the known part.
We further assume that in each view there exists a sparse embedding in \({\mathscr {F}}\), given by a small set of samples \(B\subset I\), called a basis set, that is able to represent all possible feature representations in that particular view. Thus the non-zero reconstruction weights are confined to the basis set: \(a_{ij} \ne 0\) only if \(i \in B\). To select such a sparse set of reconstruction weights, we regularize the reconstruction weights by the \(\ell _{2,1}\) norm (Argyriou et al. 2006) of the reconstruction weight matrix,
Finally, for the observed part of the kernel, we add the additional objective that the reconstructed kernel values closely approximate the known or observed values. To this end, we define a loss function measuring the within-view approximation error for each view as
Hence, for individual views the observed part of a kernel is approximated by
where the reconstruction weights \({\mathbf {A}}^{*}_{II}\) (here the superscript \(*\) indicates the optimum values) are optimized using (2) and (3) by
where \(\lambda \) is user defined hyper-parameter which indicate the weights of regularization.
Without the \(\ell _{2,1}\) regularization, the above approximation loss could be trivially optimized by choosing \({\mathbf {A}}_{II}\) as the identity matrix. The \(\ell _{2,1}\) regularization will have the effect of zeroing out some of the diagonal values and introducing non-zeros to the sub-matrix \({\mathbf {A}}_{BI}\), corresponding to the rows and columns indexed by \(B\) and \(I\) respectively, where \(B= \{i | a_{ii} \ne 0\}\).
In Sect. 3 we show by Theorem 1 that (5) corresponds to a generalized form of the Nyström method (Williams and Seeger 2001) which is a sparse kernel approximation method that has been successfully applied to efficient kernel learning. Nyström method finds a small set of vectors (not necessarily linearly independent) spanning the kernel, whereas our method searches for linearly independent basis vectors (c.f. Sect. 3, Lemma 1) and optimizes the reconstruction weights for the data samples. In particular, we show that (5) achieves the best rank-r approximation of a kernel, when the original kernel has rank higher than r, which is not achieved by Nyström method (c.f. Theorem 2).
2.2 Between-view kernel relationships
For a completely missing row or column of a kernel matrix, there is not enough information available for completing it within the same view, and hence the completion needs to be based on other information sources, in our case the other views where the corresponding kernel parts are known. In the following, we introduce two approaches for relaying information of the other views for completing the unknown rows/columns of a particular view. The first technique is based on learning a convex combination of the kernels, extending the multiple kernel learning (Argyriou et al. 2005; Cortes et al. 2012) techniques to kernel completion. The second technique is based on learning reconstruction weights so that they share information between the views.
Between-view learning of kernel values: In multi-view kernel completion the perhaps simplest situation arises when the kernels of the different views are similar, i.e.,
In this case predicting kernel values across views may lead to good kernel approximation. One way to model the similarity is to require the kernels of the different views to have the similar low-rank approximations. In Theorem 3 we show that optimal rank-r approximation can be achieved if the kernels have the same ‘true’ rank-r approximation and the kernels themselves have rank at least r.
However, this is probably an overly restrictive assumption in most applications. Thus, in our approach we allow the views to have different approximate kernel matrices with a parametrized relationship learned from data. To learn between-view relationships we express the individual normalized kernel matrix \(\left( \frac{k_{tt'}}{\sqrt{k_{tt}k_{t't'}}}\right) \) corresponding to each view as a convex combination of normalized kernel matrices of the other views. Hence the proposed model learns kernel weights \({\mathbf {S}}= (s_{ml})_{m,l=1}^M\) between all pairs of kernels (m, l) such that
where the kernel weights are confined to a convex combination
The kernel weights then can flexibly pick up a subset of relevant views to the current view m. This gives us between-view loss as
Previously, Argyriou et al. (2005) have proposed a method for learning kernels by restricting the search in the convex hull of a set of given kernels to learn parameters of individual kernel matrices. Here, we apply the idea to kernel completion, which has not been previously considered. We further note that kernel approximation as a convex combination has the interpretation of avoiding extrapolation in the space of kernels, and can be interpreted as a type of regularization to constrain the otherwise flexible set of PSD kernel matrices.
Between-view learning of reconstruction weights: In practical applications, the kernels arising in a multi-view setup might be very heterogeneous in their distributions. In such cases, it might not be realistic to find a convex combination of other kernels that are closely similar to the kernel of a given view. In particular, when the eigen-spectra of the kernels are very different, we expect a low between-view loss (8) to be hard to achieve.
Here we assume the approximated kernel matrices have related eigen-spaces in that the eigen-vectors of the related kernels can be written as linear combinations of eigen-vectors of the others, but each of them have their own set of eigen-values. In other words,
where \({\mathbf {U}}^{(m)}\) contains eigen-vectors of \({\mathbf {K}}^{(m)}\) in its column and \({\varSigma }^{(m)}\) contains corresponding eigen-values in its diagonal. \({\mathbf {T}}^{(m)}\) is a linear operator such that \({\mathbf {U}}^{(m)}= {\mathbf {U}}^{(1)}{\mathbf {T}}^{(m)}\). Above the matrices \({\mathbf {T}}^{(m)}\) allow rotations of the eigen-vectors while the scaling is of them is governed by the matrices \({\varSigma }^{(m)}\).
For this situation, we propose and alternative approach, where instead of the kernel values, we assume that the basis sets and the reconstruction weights have between-view dependencies that we can learn. Theorem 4 claims when kernels of all views satisfy (9) then learning a set of reconstruction weights, used in in all views, i.e.,
gives better approximation than learning a convex combination of kernels as in (7).
However, assuming that kernel functions in all the views have similar eigen-vectors is also unrealistic for many real world data-sets with heterogeneous sources and kernels applied to them. On the contrary, it is quite possible that only for a subset of views the eigen-vectors of approximated kernel are linearly related. Thus, in our approach we allow the views to have different reconstruction weights, but assume a parametrized relationship learned from data. This also allows the model to find an appropriate set of related kernels from the set of available incomplete kernels, for each missing entry.
To capture the relationship, we assume the reconstruction weights in a view can be approximated by a convex combination of the reconstruction weights of the other views,
where the coefficients \(s_{ml}\) are defined as in (7). This gives us between-view loss for reconstruction weights as
The reconstructed kernel is thus given by
3 Theoretical analysis
In this section, we present the theoretical results underlying our methods. We begin by showing the relationship and advantages of our within-kernel approximation to the Nyström method, and follow with theorems establishing the approximation quality of different kernel completion models.
3.1 Rank of the within-kernel approximation
We begin with analysis of the rank of the proposed within-kernel approximation method, given in (4) and (5). It approximates the individual kernels as \({\hat{{\mathbf {K}}}} = {\mathbf {A}}^{*^T} {\mathbf {K}}{\mathbf {A}}^{*}\) where
For the purposes of the analysis, we derive an equivalent form that reveals the rank behaviour of the method more easily. Above, the matrix \({\mathbf {A}}\) simultaneously indicates the position of feature maps in the underlying subspace and also selects the basis vectors for defining these subspaces. Hence \({\mathbf {A}}\) can be written as convolution of two operators \({\mathbf {A}}= {\mathbf {P}}{\hat{{\mathbf {A}}}} \) where \({\mathbf {P}}= diag({\mathbf {p}})\) and \({\mathbf {p}}\in \{0,1\}^{N}\). Here \({\mathbf {p}}\) acts as a selector operator such that \({\mathbf {p}}_{i} = 1\) if \(i \in B\) or the ith feature-map is selected as a basis vectors and all other elements of \({\mathbf {p}}\) are assigned zero values. \({\hat{{\mathbf {A}}}}\) is a matrix of size \({\mathbf {A}}\), such that \({\hat{{\mathbf {A}}}}_{i \in B}={\mathbf {A}}_{i \in B} \) and other elements are zeros.
The \(\ell _{2,1}\) norm on \({\mathbf {A}}\) in (14) assigns zeros to few rows of the matrix \({\mathbf {A}}\); equivalently \(\ell _{1}\) norm on selection operator (\({\mathbf {p}}\)) fulfils the same purpose. Again, after rows selection is done both \({\hat{{\mathbf {A}}}}_{i \in B}\) and \({\mathbf {A}}_{i \in B}\) denote reconstruction weights for kernel by using selected rows (\(B\)) and would behave similarly. Therefore the (14) is equivalent to
To see the equivalence, note that at optimum, the rows of \({\hat{A}}\) that are not selected by P will be identically zero, since the value of the first term of the objective only depends on the selected rows. Again the equivalence in regularization term can be shown as
Now, the approximated kernel can be written as
where, \({\mathbf {W}}^{*}\) and \({\hat{{\mathbf {A}}}}^{*}_{B}\) are non-zero sub-matrices of \({\mathbf {P}}^{*^T}{\mathbf {K}}{\mathbf {P}}^{*}\) (corresponding to \(B\) rows and \(B\) columns) and \({\hat{{\mathbf {A}}}}^{*}\) (corresponding to \(B\) rows) respectively.
Lemma 1
For \(rank({\mathbf {K}}) \ge r\), \(\exists \) \(\lambda =\lambda ^r \) and \(\lambda _{{\hat{{\mathbf {A}}}}^*_{i}}\) such that the solution of (15) selects a \({\mathbf {W}}^{*} \in {\mathbb {R}}^{r \times r}\) with \(rank({\mathbf {W}}^{*})=r\).
Proof
When \(rank({\mathbf {K}}) \ge r\) then there must exist a rank-r sub-matrix of \({\mathbf {K}}\) of size \(r \times r\). Again, in case of \(\ell _1\) regularization on binary vector \({\mathbf {p}}\), one can tune \(\lambda \) to \(\lambda ^r\) to have required sparsity on \({\mathbf {p}}\), i.e., \(\Vert {\mathbf {p}}^{*}\Vert _1= r\). Moreover, \(\ell _1\) regularization on binary vector ensures the linear independence of selected columns and rows when \(\lambda _{i}\) is carefully chosen. If a solution of (15) selects a column which is linearly dependent on other selected columns, then the solution, from the objective function value of some other solution which selects the same columns except this linearly dependent column, will raise the value of \(\ell _1\) norm regularization term of objective function by \(\lambda \) while keeping the first part of the objective function same and lowering the third part of (15) by \(\frac{\lambda _{i}\Vert A_i\Vert ^2_2}{2}\). Hence if \(\lambda _{i} < 2\frac{\lambda }{\Vert \hat{{\mathbf {A}}^*}_i\Vert ^2_2}\) then that can not be an optimum solution and if \(\lambda _{i}\) is chosen according to the (16) then \(\lambda _{i} < 2\frac{\lambda }{\Vert \hat{{\mathbf {A}}^*}_i\Vert ^2_2}\). This completes the proof. \(\square \)
3.2 Relation to Nyström approximation
Nyström method (Williams and Seeger 2001) approximates the kernel matrix \({\mathbf {K}}\) as
where the matrix \({\mathbf {C}}\in {\mathbb {R}}^{N \times c}\) consists of c randomly chosen columns of kernel \({\mathbf {K}}\) and \({\mathbf {W}}\in {\mathbb {R}}^{c \times c}\) is a matrix consisting of the intersection of those c columns with the corresponding c rows. Due to the random selection, Nyström method, unlike ours (as established above), does not in general produce linearly independent set of vectors. One can re-write
and then the Nyström approximation as
For non-invertible \({\mathbf {W}}\), its pseudo inverse can be used instead of \({\mathbf {W}}^{-1}\).
Theorem 1
The Nyström approximation of \({\mathbf {K}}\) is a feasible solution of (15), i.e., for invertible \({\mathbf {W}}\), \( \exists {\hat{{\mathbf {A}}}} \in {\mathbb {R}}^{N \times c}\) such that \( {\hat{{\mathbf {K}}}}_{nys} = {\hat{{\mathbf {A}}}}^{^T} {\mathbf {W}}{\hat{{\mathbf {A}}}} \).
Proof
Equate
in (19) where \(I_c\) denotes the identity matrix of size c. \(\square \)
The above theorem shows that the approach of (15), by finding the optimal feasible solution, will always produce better kernel approximation with the same level of sparsity as the Nyström method.
3.3 Low-rank approximation quality
Nyström approximation satisfies following low-rank approximation properties (Kumar et al. 2009):
-
If \(r=rank({\mathbf {K}}^{})\le c\) and \(rank({\mathbf {W}}^{})=r\), then the Nyström approximation is exact, i.e., \(\Vert {\mathbf {K}}- {\hat{{\mathbf {K}}}}^{}_{nys}\Vert ^2_2 =0\).
-
For general \({\mathbf {K}}^{}\) when \(rank({\mathbf {K}}^{}) \ge r\) and \(rank({\mathbf {W}}^{})=r\), then the Nyström approximation is not the best rank-r approximation of \({\mathbf {K}}^{}\).
Below, we will establish that our approach will result in the best rank-r approximation also in the general case where the original kernel has rank higher than r.
Lemma 2
If \({\mathbf {K}}_r\) be the best rank-r approximation of a kernel \({\mathbf {K}}\) with \(rank({\mathbf {K}}) \ge r\) and \({\mathbf {W}}\) be a full rank sub-matrix of \({\mathbf {K}}\) of size \(r \times r\), i.e, \(rank({\mathbf {W}})=r\). Then \( \exists {\hat{{\mathbf {A}}}} \in {\mathbb {R}}^{N \times r}\) such that for the proposed approximation \({\hat{{\mathbf {K}}}} =\hat{{\mathbf {A}}^T} {\mathbf {W}}{\hat{{\mathbf {A}}}}\) is equivalent to \({\mathbf {K}}_r\), i.e., \(\Vert {\mathbf {K}}- {\hat{{\mathbf {K}}}}\Vert ^2_2 = \Vert {\mathbf {K}}- {\mathbf {K}}_r\Vert ^2_2\).
Proof
Using eigen-value decomposition one can write
where \({\mathbf {U}}\in {\mathbb {R}}^{N \times r}\) and \(rank({\mathbf {U}}) =r\) and \({\mathbf {U}}_W \in {\mathbb {R}}^{r \times r}\) and \(rank({\mathbf {U}}_W) =r\).
Using invertible property of \({\mathbf {U}}_W\), one can express \({\mathbf {U}}\) as \({\mathbf {U}}=({\mathbf {U}}{\mathbf {U}}_W^{-1}) {\mathbf {U}}_W\).
where \({\hat{{\mathbf {A}}}}^T = {\mathbf {U}}{\mathbf {U}}_W^{-1}\). \(\square \)
Theorem 2
If \(rank({{\mathbf {K}}}) \ge r\), then \(\exists \) \(\lambda ^r\) such that the proposed approximation \({\hat{{\mathbf {K}}}}\) in (17) is equivalent to the best rank-r approximation of \({{\mathbf {K}}}\), i.e., \(\Vert {\mathbf {K}}- {\hat{{\mathbf {K}}}}\Vert ^2_2 = \Vert {\mathbf {K}}- {{\mathbf {K}}}_r\Vert ^2_2\), where \({{\mathbf {K}}}_r\) is the best rank-r approximation of \({{\mathbf {K}}}\).
Proof
Lemma 1 proves that there exist a \(\lambda _r\) for which the optimum solution of (15) results into a \({\mathbf {W}}^{*} \in {\mathbb {R}}^{r \times r}\) such that \(rank({\mathbf {W}}^{*})=r\). According to Lemma 2 there exist also a feasible \({\hat{{\mathbf {A}}}}\) which reconstructs \({{\mathbf {K}}}_r\). Let us assume \(\hat{{\mathbf {A}}^{*}}\) is the optimum solution of (15) with \(\lambda _A =0\), then
This completes the proof. \(\square \)
3.4 Low-rank approximation quality of multiple kernel matrices
In this section, we establish the approximation quality achieved in the multi-view setup, when the different kernels are similar either in the sense of having the same underlying ‘true’ low-rank approximations (Theorem 3) or more generally similar sets of eigen-vectors (Theorem 4).
Theorem 3
Assume \({\mathbf {K}}^{(1)}, \ldots ,{\mathbf {K}}^{M}\) are M kernel matrices such that \( \forall \,\,m,\,\, rank({\mathbf {K}}^{m}) \ge r\) and that all of them have the same rank-r approximation, i.e., \({\mathbf {K}}^{(1)}_r = \ldots = {\mathbf {K}}^{(m)}_r\) (assumption in (6)). Then \(\exists \lambda ^r\) and \(\lambda _A\) such that the following optimization problem:
produces the exact rank-r approximation for individual matrices, i.e.,
Proof
Each symmetric positive semi-definite kernel matrix can be written as
where \({\mathbf {X}}^{(m)} \in {\mathbb {R}}^{N \times rank({\mathbf {K}}^{(m)})}\) and columns of \({\mathbf {X}}^{(m)}\) are orthogonal to each other.
When all \({\mathbf {K}}^{(m)}\)s have the same rank-r approximation then the first r columns of \({\mathbf {X}}^{(m)}\) are same for all m. Hence \({\mathbf {X}}^{(m)}\) can be expressed as
where \(r^c\) denotes the complement of set r. Here \({\mathbf {X}}_r \in {\mathbb {R}}^{N \times r}\) is a rank-r matrix and hence it is possible to find a set of r rows from \({\mathbf {X}}_r\) which together produce a rank-r sub-matrix of size \(r \times r\). Let \({\mathbf {P}}^{*^T}\) be such a selector operator which select r linearly independent rows from \({\mathbf {X}}_r\), i.e., Moreover, according to Lemma 1 there exist a \(\lambda _r\) for which the optimization problem in (20) gives the required sparsity in \({\mathbf {P}}^*\).
Hence,
where \({\mathbf {X}}_W^{(m)} = {\mathbf {P}}^{*^T}{\mathbf {X}}^{(m)} = {\mathbf {P}}^{*^T} \left[ {\mathbf {X}}_r {\mathbf {X}}_{r'}^{(m)} \right] \) and hence \({\mathbf {X}}_W^{(m)}\) contains r linearly independent rows of \({\mathbf {X}}^{(m)}\) and hence for all m, \(rank({\mathbf {W}}^{(m)})=r.\)
When the parameter \(\lambda _A\) is significantly small then, using Theorem 2, we can prove that for a \({\mathbf {W}}^{(m)}\) with \(rank({\mathbf {W}}^{(m)})=r\), there exist a \({\hat{{\mathbf {A}}}}^{(m)}\) which is able to generate exact rank-r approximation for individual kernel matrix i.e,
This completes the proof. \(\square \)
Theorem 4
Assume \({\mathbf {K}}^{(1)}, \ldots ,{\mathbf {K}}^{(M)}\) are M kernel matrices such that \( \forall \,\,m,\,\, rank({\mathbf {K}}^{(m)}) = r\) and all of them have same eigen-space, i.e, eigen-vectors are linearly transferable and eigen-values are different (assumption in (9)), i.e., \({{\mathbf {K}}}^{(m)} = {\mathbf {U}}^{(m)} {\varSigma }^{(m)} {\mathbf {U}}^{(m)^T} \) such that \({\mathbf {U}}^{(m)} = {\mathbf {U}}^{(1)}{\mathbf {T}}^{(m)}\). Then \(\exists \lambda ^r\) and \(\lambda _A\) such that the following optimization problem (by the assumption in (10))
selects a rank-r sub-matrix \({\mathbf {W}}^{*(m)} \in {\mathbb {R}}^{r \times r} \) with \(rank({\mathbf {W}}^{*(m)})=r\) of each kernel \({\mathbf {K}}^{(m)}\) which can produce the exact reconstruction for individual matrices, i.e.,
Proof
According to assumption in (9) each kernel matrix can be written as
where \({\mathbf {U}}^{(1)} \in {\mathbb {R}}^{N \times r}\) is orthonormal. Hence it is possible to find a set of r rows from \({\mathbf {U}}^{(1)}\) which together produce a rank-r sub-matrix of size \(r \times r\). Let \({\mathbf {P}}^{*^T}\) be such selector operator which selects r linearly independent rows from \({\mathbf {U}}^{(1)}\). Let \(r^*\) denote the set of indices of such linearly independent rows of \({\mathbf {U}}^{(1)}\). Hence \({\mathbf {U}}^{(1)}= \left[ \begin{array}{c} {\mathbf {U}}^{(1)}_{r*}\\ {\mathbf {U}}^{(1)}_{r^{c}} \end{array} \right] \) and \({\mathbf {U}}^{(1)}_{r*}\) is invertible.
According to Lemma 1 there exist a \(\lambda _r\) for which the optimization problem (21) gives the required sparsity in \({\mathbf {P}}^{*T}\). Hence using the assumption \({\mathbf {U}}^{(m)} = {\mathbf {U}}^{(1)}{\mathbf {T}}^{(m)}\), we get
Hence, given \(\lambda _A\) is significantly small, according to Theorem 2 the optimization problem (21) selects a sub-matrix \({\mathbf {W}}^{*(m)}\) such that \({\mathbf {W}}^{*(m)} \in {\mathbb {R}}^{r \times r}\) and \(rank({\mathbf {W}}^{*(m)})=r\). Then each kernel matrix is expressed in terms of \({\mathbf {W}}^{*(m)}\) as
Defining \({\hat{{\mathbf {A}}}}^*= \left( {\mathbf {U}}^{(1)} {\mathbf {U}}^{(1)^{-1}}_{r*}\right) ^T\) (which is possible for significantly small \(\lambda _A\)), we get \({\mathbf {K}}^{(m)} = {\hat{{\mathbf {A}}}}^{*^T}{\mathbf {W}}^{*(m)} {\hat{{\mathbf {A}}}}^*. \) This completes the proof. \(\square \)
4 Optimization problems
Here we present the optimization problems for Multi-view Kernel Completion (MKC), arising from the within-view and between-view kernel approximations described above.
4.1 MKC using semi-definite programming (\(\mathbf{MKC}_{sdp}\))
This is the most general case where we do not put any other restrictions on kernels of individual views, other than restricting them to be positive semi-definite kernels. In this general case we propagate information from other views by learning between-view relationships depending on kernel values in (7). Hence, using (3) and (8) we get
We solve this non-convex optimization problem by iteratively solving it for \({\mathbf {S}}\) and \({\hat{{\mathbf {K}}}}^{(m)}\) using block-coordinate descent. For a fixed \({\mathbf {S}}\), to update the \({\hat{{\mathbf {K}}}}^{(m)}\)’s we need to solve a semi-definite program with M positive constraints.
4.2 MKC using homogeneous embeddings ( \(\mathbf{MKC}_{embd(hm)}\) )
An optimization problem with M positive semi-definite constraints is inefficient for even a data set of size 100. To avoid solving the SDP in each iteration we assume a kernel approximation (1). When kernel functions in different views are not the same and kernel matrices in different views have different eigen-spectra, we learn relationships among underlying embeddings of different views (10), instead of the actual kernel values. Hence, using (3), (1) and (10) along with \(\ell _{2,1}\) regularization on \({\mathbf {A}}\), we get
Theorem 4 shows that the above formulation is appropriate when the first few eigen-vectors for all kernels are same while corresponding eigen-values may be different.
4.3 MKC using heterogeneous embeddings (\(\mathbf{MKC}_{embd(ht)}\))
When kernel matrices in different views have different eigen-spectra both in eigen-values and eigen-vectors, we learn relationships among underlying embeddings of different views ( 11), instead of the actual kernel values. Hence, using (3), (1) and (12) along with \(l_{2,1}\) regularization on \({\mathbf {A}}^{(m)}\), we get
4.4 MKC using kernel approximation (\(\mathbf{MKC}_{app}\))
On the other hand when the low rank approximation of related kernels are same (6) then between-view relationships are learnt on kernel values using (8). In this case the kernel is approximated to avoid solving the SDP:
Theorem 3 shows that this method results into the exact rank-r approximation when rank-r approximation kernels for related views are same. We solve all the above-mentioned non-convex optimization problems with \(l_{2,1}\) regularization by sequentially updating \({\mathbf {S}}\) and \({\mathbf {A}}^{(m)}\). In each iteration \({\mathbf {S}}\) is updated by solving a quadratic program and for each m, \({\mathbf {A}}^{(m)}\) is updated using proximal gradient descent.
5 Algorithms
Here we present algorithms for solving various optimization problems, described in previous section.Footnote 1
5.1 Algorithm to solve \(\mathbf{MKC}_{embd(ht)}\)
In this section the Algorithm 1 describes the algorithm to solve \(\mathbf{MKC}_{embd(ht)}\) (24).
Substituting \({\hat{{\mathbf {K}}}}^{(m)} = {\mathbf {A}}^{(m)^T}_{I^{(m)}} {\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}} {\mathbf {A}}^{(m)}_{I^{(m)}}\), the optimization problem (24) contains two sets of unknowns, \({\mathbf {S}}\) and the \({\mathbf {A}}^{(m)}\)’s. We update \({\mathbf {A}}^{(m)}\) and \({\mathbf {S}}\) in an iterative manner. In the kth iteration for a fixed \({\mathbf {S}}^{k-1}\) from the previous iteration, to update \({\mathbf {A}}^{(m)}\)’s we need to solve following for each m:
where \({\varOmega }({\mathbf {A}}^{(m)})=\Vert {\mathbf {A}}^{(m)}\Vert _{2,1}\) and \(Aobj_{{\mathbf {S}}}^k({\mathbf {A}}^{(m)})=\Vert {\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}- \left[ {\mathbf {A}}^{(m)^T}_{I^{(m)}}{\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}{\mathbf {A}}^{(m)}_{I^{(m)}}\right] _{I^{(m)}I^{(m)}}\Vert _2^2 +\lambda _1 \sum _{m=1}^M \Vert {\mathbf {A}}^{(m)} - \sum _{l=1, l\ne m}^{M}s^{k-1}_{ml}{\mathbf {A}}^{(l)} \Vert _2^2\).
Instead of solving this problem in each iteration we update \({\mathbf {A}}^{(m)}\) using proximal gradient descent. Hence, in each iteration,
where \(\partial Aobj_{{\mathbf {S}}}^k({\mathbf {A}}^{(m)})\) is the differential of \(Aobj_{{\mathbf {S}}}^{k}({\mathbf {A}}^{(m)})\) at \({\mathbf {A}}^{(m)^{k-1}}\) and \(\gamma \) is the step size which is decided by a line search. In (26) each row of \({\mathbf {A}}^{(m)}\) (i.e., \({\mathbf {a}}_{t}^{(m)}\)) can be solved independently and we apply a proximal operator on each row. Following Bach et al. (2011), the solution of (26) is
where \({\varDelta } {\mathbf {a}}_t^{(m)^{k-1}}\) is the tth row of \(\left( {\mathbf {A}}^{(m)^{k-1}}-\gamma \partial Aobj_{{\mathbf {S}}}^k({\mathbf {A}}^{(m)^{k-1}})\right) \).
Again, in the kth iteration, for fixed \({\mathbf {A}}^{(m)^k}\)’s, the \({\mathbf {S}}\) is updated by independently updating each row (\({\mathbf {s}}_m\)) through solving the following Quadratic Program:
where \(Sobj_{{\mathbf {A}}^{(m),m=1,\ldots ,M}}^k({\mathbf {s}}_m)=\Vert {\mathbf {A}}^{(m)^k}-\sum _{l=1, l\ne m}^{M}s_{ml}{\mathbf {A}}^{(l)^k}\Vert _2^2\).
Computational Complexity: Each iteration of Algorithm 1 needs to update reconstruction weight vectors of size N for N data-points for M views and also between view relation weights of size \(M \times M\). Hence the effective computational complexity is \({\mathscr {O}}\left( M(N^2 +M)\right) \).
5.2 Algorithm to solve \(\mathbf{MKC}_{sdp}\)
In this section the Algorithm 2 describes the algorithm to solve \(\mathbf{MKC}_{sdp}\)(22). The optimization problem (22) has two sets of unknowns, \({\mathbf {S}}\) and the \({\hat{{\mathbf {K}}}}^{(m)}\)’s. We update \({\hat{{\mathbf {K}}}}^{(m)}\) and \({\mathbf {S}}\) in an iterative manner. In the kth iteration, for fixed \({\mathbf {S}}^{k-1}\) and \({\hat{{\mathbf {K}}}}^{(m)^{k-1}}\), the \({\hat{{\mathbf {K}}}}^{(m)}\) is updated independently by solving following Semi-definite Program:
where
Again, in the kth iteration, for fixed \({\hat{{\mathbf {K}}}}^{(m)^k}, \forall m=[1,\ldots ,M]\), \({\mathbf {S}}\) is updated by independently updating each row (\({\mathbf {s}}_m\)) through solving the following Quadratic Program:
Here \( Sobj_{[psd]{\hat{{\mathbf {K}}}}^{(m), m=1,\ldots ,M}}^k({\mathbf {s}}_m) = \Vert {\hat{{\mathbf {K}}}}^{(m)^k}- \sum _{l=1, l\ne m}^{M}s_{ml}{\hat{{\mathbf {K}}}}^{(l)^k} \Vert _2^2 \).
Computational Complexity: Each iteration of Algorithm 2 needs to optimize M kernel by solving of M semi-definite programming(SDP) of size N. General SDP solver has computation complexity \({\mathscr {O}} \left( N^{6.5 }\right) \) (Wang et al. 2013). Hence the effective computational complexity is \({\mathscr {O}}\left( M N^{6.5}\right) \).
5.3 Algorithm to solve \(\mathbf{MKC}_{app}\)
In this section the Algorithm 3 describes the algorithm to solve \(\mathbf{MKC}_{app}\)(25) which is similar to Algorithm 1. Substituting \({\hat{{\mathbf {K}}}}^{(m)} = {\mathbf {A}}^{(m)^T}_{I^{(m)}} {\mathbf {K}}^{(m)}_{I^{(m)},I^{(m)}} {\mathbf {A}}^{(m)}_{I^{(m)}}\), the optimization problem (25) also has two sets of unknowns, \({\mathbf {S}}\) and the \({\mathbf {A}}^{(m)}\)’s and again we update \({\mathbf {A}}^{(m)}\) and \({\mathbf {S}}\) in an iterative manner. In the kth iteration for a fixed \({\mathbf {S}}^{k-1}\) from previous iteration, to update \({\mathbf {A}}^{(m)}\)’s, unlike \(\mathbf{MKC}_{embd(ht)}\), we need to solve following for each m:
where \( {\varOmega }({\mathbf {A}}^{(m)})=\Vert {\mathbf {A}}^{(m)}\Vert _{2,1}\) and
For this case too, instead of solving this problem in each iteration we update \({\mathbf {A}}^{(m)}\) using proximal gradient descent. Hence, in each iteration,
where \(\partial Aobj_{[app]{\mathbf {S}}}^k({\mathbf {A}}^{(m)^{k-1}})\) is the differential of \(Aobj_{[app]{\mathbf {S}}}^{k}({\mathbf {A}}^{(m)})\) at \({\mathbf {A}}^{(m)^{k-1}}\) and \(\gamma \) is the step size which is decided by a line search. By applying proximal operator on each row of \({\mathbf {A}}\) (i.e., \({\mathbf {a}}_{t}\)) in (31)
where \({\varDelta } {\mathbf {a}}_t^{(m)^{k-1}}\) is the tth row of \(\left( {\mathbf {A}}^{(m)^{k-1}}-\gamma \partial Aobj_{[app]{\mathbf {S}}}^k({\mathbf {A}}^{(m)})\right) \).
Again, in the kth iteration, for fixed \({\mathbf {A}}^{(m)^k}, \forall m=[1,\ldots ,M]\), \({\mathbf {S}}\) is updated by independently updating each row (\({\mathbf {s}}_m\)) through solving the following Quadratic Program:
where \( Sobj_{[app]{\mathbf {A}}}^k({\mathbf {s}}_m) = \Vert {\mathbf {A}}^{(m)^T}_{I^{(m)}}{\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}{\mathbf {A}}^{(m)}_{I^{(m)}}- \sum _{l=1, l\ne m}^{M}s_{ml}{\mathbf {A}}^{(l)^T}_{I^{(l)}}{\mathbf {K}}^{(l)}_{I^{(l)}I^{(l)}}{\mathbf {A}}^{(l)}_{I^{(l)}} \Vert _2^2 \).
Computational Complexity: Each iteration of Algorithm 3 needs to update reconstruction weight vectors of size N for N data-points for M views and also between view relation weights of size \(M \times M\). Hence the effective computational complexity is \({\mathscr {O}}\left( M(N^2 +M)\right) \).
6 Experiments
We apply the proposed MKC method on a variety of data sets, with different types of kernel functions in different views, along with different amounts of missing data points. The objectives of our experiments are: (1) to compare the performance of MKC against other existing methods in terms of the ability to predict the missing kernel rows, (2) to empirically show that the proposed kernel approximation with the help of the reconstruction weights also improves running-time over the \(\mathbf{MKC}_{sdp}\) method.
6.1 Experimental setup
6.1.1 Data sets:
To evaluate the performance of our method, we used 4 simulated data sets with 100 data points and 5 views, as well as two real-world multi-view data sets: (1) Dream Challenge 7 data set (DREAM) (Daemen et al. 2013; Heiser and Sadanandam 2012) and (2) Reuters RCV1/RCV2 multilingual data (Amini et al. 2009).Footnote 2
Synthetic data sets: We followed the following steps to simulate our synthetic data sets:
-
1
We generated the first 10 points (\({\mathbf {X}}_{B^{(m)}}^{(m)}\)) for each view, where \({\mathbf {X}}_{B^{(m)}}^{(1)}\) and \({\mathbf {X}}_{B^{(m)}}^{(2)}\) are uniformly distributed in \([-1,1]^{5}\) and \({\mathbf {X}}_{B^{(m)}}^{(3)}\), \({\mathbf {X}}_{B^{(m)}}^{(4)}\) and, \({\mathbf {X}}_{B^{(m)}}^{(5)}\) are uniformly distributed in \([-1,1]^{10}\).
-
2
These 10 data points were used as basis sets for each view, and further 90 data points in each view were generated by \({\mathbf {X}}^{(m)}={\mathbf {A}}^{(m)} {\mathbf {X}}_{B^{(m)}}^{(m)}\), where the \({\mathbf {A}}^{(m)}\) are uniformly distributed random matrices \( \in {\mathbb {R}}^{90 \times 10}\). We chose \({\mathbf {A}}^{(1)}={\mathbf {A}}^{(2)}\) and \({\mathbf {A}}^{(3)}={\mathbf {A}}^{(4)} = {\mathbf {A}}^{(5)}\).
-
3
Finally, \({\mathbf {K}}^{(m)}\) was generated from \({\mathbf {X}}^{(m)}\) by using different kernel functions for different data sets as follows:
-
TOYL: Linear kernel for all views
-
TOYG1 and TOYG0.1: Gaussian kernel for all views where the kernel with of the Gaussian kernel are 1 and 0.1 respectively.
-
TOYLG1: Linear kernel for the first 3 views and Gaussian kernel for the last two views with the kernel width 1. Note that with this selection view 3 shares reconstruction weights with view 4 and 5, but has the same kernel as views 1 and 2.
Figure 2 shows the eigen-spectra of kernel matrices are very much different for TOYLG1 where we have used different kernels in different views.
-
The Dream Challenge 7 data set (DREAM): For Dream Challenge 7, genomic characterizations of multiple types on 53 breast cancer cell lines are provided. They consist of DNA copy number variation, transcript expression values, whole exome sequencing, RNA sequencing data, DNA methylation data and RPPA protein quantification measurements. In addition, some of the views are missing for some cell lines. For 25 data points all 6 views are available. For all the 6 views, we calculated Gaussian kernels after normalizing the data sets. We generated two other kernels by using Jaccard’s kernel function over binarized exome data and RNA sequencing data. Hence, the final data set has 8 kernel matrices. Figure 2 shows the eigen-spectra of the kernel matrices of all views, which are quite different for different views.
RCV1/RCV2: Reuters RCV1/RCV2 multilingual data set contains aligned documents for 5 languages (English, French, Germany, Italian and Spanish). Originally the documents are in any one of these languages and then corresponding documents for other views have been generated by machine translations of the original document. For our experiment, we randomly selected 1500 documents which were originally in English. The latent semantic kernel (Cristianini et al. 2002) is used for all languages.
6.1.2 Evaluation setup
Each of the data sets was partitioned into tuning and test sets. The missing views were introduced in these partitions independently. To induce missing views, we randomly selected data points from each partition, a few views for each of them, and deleted the corresponding rows and columns from the kernel matrices. The tuning set was used for parameter tuning. All the results have been reported on the test set which was independent of the tuning set.
For all 4 synthetic data sets as well as RCV1/RCV2 we chose \(40\%\) of the data samples as the tuning set, and the rest \(60\%\) were used for testing. For the DREAM data set these partitions were \(60\%\) for tuning and \(40\%\) for testing.
We generated versions of the data with different amounts of missing values. For the first test case, we deleted 1 view from each selected data point in each data set. In the second test case, we removed 2 views for TOY and RCV1/RCV2 data sets and 3 views for DREAM. For the third one we deleted 3 views among 5 views per selected data point in TOY and RCV1/RCV2, and 5 views among 8 views per selected data point in DREAM.
We repeated all our experiments for 5 random tuning and test partitions with different missing entries and report the average performance on them.
6.1.3 Compared methods
We compared performance of the proposed methods, \(\mathbf{MKC}_{embd(hm)}\) \(\mathbf{MKC}_{embd(ht)}\), \(\mathbf{MKC}_{app}\), \(\mathbf{MKC}_{sdp}\), with k nearest neighbour (KNN) imputation as a baseline KNN has previously been shown to be a competitive imputation method (Brock et al. 2008). For KNN imputation we first concatenated underlying feature representations from all views to get a joint feature representation. We then sought k nearest data points by using their available parts, and the missing part was imputed as either average (Knn) or the weighted average (wKnn) of the selected neighbours. We also compare our result with generative model based approach of Lian et al. (2015) (MLFS) and with an EM-based kernel completion method (\(\mathbf{EM}_{based}\)) proposed by Tsuda et al. (2003). Tsuda et al. (2003) cannot solve our problem when no view is complete, hence we study the relative performance only in the cases which it can solve. For Tsuda et al. (2003)’s method we assume the first view is complete.
We also compared \(\mathbf{MKC}_{embd(ht)}\), with \(\mathbf{MKC}_{rnd}\) where we assumed the basis vectors are selected randomly with uniform distribution with out replacement and after that reconstruction weights for all views are optimizied.
The hyper-parameters \(\lambda _1\) and \(\lambda _2\) of MKC and k of Knn and wKnn were selected with the help of tuning set, from the range of \(10^{-3}\) to \(10^3\) and [1, 2, 3, 5, 7, 10] respectively. All reported results indicate performance in the test sets.
6.2 Prediction error comparisons
6.2.1 Average Relative Error (ARE)
We evaluated the performance of all methods using the average relative error (ARE) (Xu et al. 2013). Let \({\hat{{\mathbf {k}}}}^{(m)}_{t}\) be the predicted tth row for the mth view and the corresponding true values of kernel row be \({{\mathbf {k}}}^{(m)}_{t}\), then the relative error is the relative root mean square deviation.
The average relative error (in percentage) is then computed over all missing data points for a view, that is,
Here \(n_{t}^{(m)}\) is the number of missing samples in the mth view.
6.2.2 Results
Table 1 shows the Average Relative Error (34) for the compared methods. It shows that the proposed MKC methods generally predict missing values more accurately than Knn, wKnn, \(\mathbf{EM}_{based}\) and MLFS. In particular, the differences in favor to the MKC methods increase when the number of missing views is increased.
The most recent method MLFS performs comparatively for the DREAM data-set and the data-set with linear kernels (TOYL). But it deteriorates very badly with the increased of non-linearity of kernels, i.e., TOYG1 and TOYLG1. For highly non-linear sparse kernels (TOYG0.1) and for RCV1/RCV2 data-set with large amount of missing views the MLFS fails to predict.
The \(\mathbf{EM}_{based}\) sometimes has more than \(200\%\) error and higher (more than \(200\%\)) variance. The most accurate method in each setup is one of the proposed MKC’s. \(\mathbf{MKC}_{embd(hm)}\) is generally the least accurate of them, but still competitive against the other compared methods. We further note that:
-
\(\mathbf{MKC}_{embd(ht)}\) is consistently the best when different views have different kernel functions or eigen-spectra, e.g., TOYLG1 and DREAM (Fig. 2). Better performance of \(\mathbf{MKC}_{embd(ht)}\) than \(\mathbf{MKC}_{embd(hm)}\) in DREAM data gives evidence of applicability of \(\mathbf{MKC}_{embd(ht)}\) in real-world data-set.
-
\(\mathbf{MKC}_{app}\) performs best or very close to \(\mathbf{MKC}_{embd(ht)}\) when kernel functions and eigen-spectra of all views are the same (for instance TOYL, TOYG1 and RCV1/RCV2). As \(\mathbf{MKC}_{app}\) learns between-view relationships on kernel values it is not able to perform well for TOYLG1 and DREAM where very different kernel functions are used in different views.
-
\(\mathbf{MKC}_{sdp}\) outperforms all other methods when kernel functions are highly non-linear (such as in TOYG0.1). On less non-linear cases, \(\mathbf{MKC}_{sdp}\) on the other hand trails in accuracy to the other MKC variants. \(\mathbf{MKC}_{sdp}\) is computationally more demanding than the others, to the extent that on RCV1/RCV2 data we had to skip it.
Figure 3 depicts the performance as the number of missing samples per view is increased. Here, \(\mathbf{MKC}_{embd(ht)}\), \(\mathbf{MKC}_{app}\) and \(\mathbf{MKC}_{embd(hm)}\) prove to be the most robust methods over all data sets. The performance of \(\mathbf{MKC}_{sdp}\) seems to be the most sensitive to amount of missing samples. Overall, \(\mathbf{EM}_{based}\), Knn, and wKnn have worse error rates than the MKC methods.
6.3 Comparison of performance of different versions of the proposed approach
Figure 4 shows how relative prediction error (ARE) of \(\mathbf{MKC}_{embd(ht)}\), \(\mathbf{MKC}_{app}\) and \(\mathbf{MKC}_{sdp}\) vary with two properties of given data-sets. Namely, (1) difference among eigen-spectra of kernel of different views(x axis) and (2) non-linearity of kernels for all views(y axis). For this experiment, we consider 3rd, 4th and 5th views of TOY data. Here all views have been generated from same embedding. Non-linearity of kernel function varies with combination of linear and Gaussian kernel where the kernel with of the Gaussian kernel varies among 5, 1 and 0.1.
The heterogeneity of eigen-spectra of all kernels are calculated as average mean square difference of eigen-spectra of each pair of kernels. The non-linearity of kernel is indicated by the average of the 20th eigen-values of all views. Each circle indicates amount of prediction error by \(\mathbf{MKC}_{embd(ht)}\), \(\mathbf{MKC}_{app}\) and \(\mathbf{MKC}_{sdp}\) where radius of each circle is proportional to “\(log(ARE) + Thr\)”. A constant “Thr” was required to have positive radii for all circles for better visualization. We further note that:
-
The performance of \(\mathbf{MKC}_{embd(ht)}\) is the best among these three methods for all most all cases.
-
Only when all views have similarly high non-linear kernel (top-left corner), \(\mathbf{MKC}_{sdp}\) performs best among all. It also shows that the performance of \(\mathbf{MKC}_{sdp}\) improves with increase of non-linearity.
-
We can also see that with the increase of heterogeneity in kernels (increase of x-axis) the performance of \(\mathbf{MKC}_{app}\) deteriorates and is getting worse than that of \(\mathbf{MKC}_{embd(ht)}\).
6.4 Running time comparison
Table 2 depicts the running times for the compared methods. \(\mathbf{MKC}_{app}\), \(\mathbf{MKC}_{embd(ht)}\) and \(\mathbf{MKC}_{embd(hm)}\) are many times faster than \(\mathbf{MKC}_{sdp}\). In particular, \(\mathbf{MKC}_{embd(hm)}\) is competitive in running time with the significantly less accurate \(\mathbf{EM}_{based}\) and MLFS methods, except on the RCV1/RCV2 data. As expected, Knn and wKnn are orders of magnitude faster but fall far short of the reconstruction quality of the MKC methods.
7 Conclusions
In this paper, we have introduced new methods for kernel completion in the multi-view setting. The methods are able to propagate relevant information across views to predict missing rows/columns of kernel matrices in multi-view data. In particular, we are able to predict missing rows/columns of kernel matrices for non-linear kernels, and do not need any complete kernel matrices a priori.
Our method of within-view learning approximates the full kernel by a sparse basis set of examples with local reconstruction weights, picked up by \(\ell _{2,1}\) regularization. This approach has the added benefit of circumventing the need of an explicit PSD constraint in optimization. We showed that the method generalizes and improves Nyström approximation. For learning between views, we proposed two alternative approaches, one based on learning convex kernel combinations and another based on learning a convex set of reconstruction weights. The heterogeneity of the kernels in different views affects which of the approaches is favourable. We related theoretically the kernel approximation quality of these methods to the similarity of eigen-spaces of the individual kernels.
Our experiments show that the proposed multi-view completion methods are in general more accurate than previously available methods. In terms of running time, due to the inherent non-convexity of the optimization problems, the new proposals still have room to improve. However, the methods are amenable for efficient parallelization, which we leave for further work.
Notes
MKC code is available in https://github.com/aalto-ics-kepaco/MKC_software.
All data-sets and MKC code is available in https://github.com/aalto-ics-kepaco/MKC_software.
References
Amini, M., Usunier, N., & Goutte, C. (2009). Learning from multiple partially observed views—An application to multilingual text categorization. Advances in Neural Information Processing Systems, 22, 28–36.
Argyriou, A., Micchelli, C.A., & Pontil, M. (2005). Learning convex combinations of continuously parameterized basic kernels. In Proceedings of the 18th annual conference on learning theory (pp. 338–352).
Argyriou, A., Evgeniou, T., & Pontil, M. (2006). Multi-task feature learning. Advances in Neural Information Processing Systems, 19, 41–48.
Bach, F., Lanckriet, G., & Jordan, M. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the 21st international conference on machine learning (pp. 6–13). ACM.
Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2011). Convex optimization with sparsity-inducing norms. Optimization for Machine Learning, 5, pp. 19–53.
Brock, G., Shaffer, J., Blakesley, R., Lotz, M., & Tseng, G. (2008). Which missing value imputation method to use in expression profiles: A comparative study and two selection schemes. BMC Bioinformatics, 9, 1–12.
Cortes, C., Mohri, M., & Rostamizadeh, A. (2012). Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13, 795–828.
Cristianini, N., Shawe-Taylor, J., & Lodhi, H. (2002). Latent semantic kernels. Journal of Intelligent Information Systems, 18(2–3), 127–152.
Daemen, A., Griffith, O., Heiser, L., et al. (2013). Modeling precision treatment of breast cancer. Genome Biology, 14(10), 1.
Gönen, M., & Alpaydin, E. (2011). Multiple kernel learning algorithms. Journal of Machine Learning Research, 12, 2211–2268.
Graepel, T. (2002). Kernel matrix completion by semidefinite programming. In Proceedings of the 12th international conference on artificial neural networks, Springer (pp. 694–699).
Heiser, L. M., Sadanandam, A., et al. (2012). Subtype and pathway specific responses to anticancer compounds in breast cancer. Proceedings of the National Academy of Sciences, 109(8), 2724–2729.
Kumar, S., Mohri, M., & Talwalkar, A. (2009). On sampling-based approximate spectral decomposition. In Proceedings of the 26th annual international conference on machine learning (pp. 53–560). ACM.
Lian, W., Rai, P., Salazar, E., & Carin, L. (2015). Integrating features and similarities: Flexible models for heterogeneous multiview data. In Proceedings of the 29th AAAI conference on artificial intelligence (pp. 2757–2763).
Paisley, J., Carin, & L. (2010). A nonparametric Bayesian model for kernel matrix completion. In The 35th international conference on acoustics, speech, and signal processing, IEEE (pp. 2090–2093).
Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.
Shao, W., Shi, X., & Yu, P.S. (2013). Clustering on multiple incomplete datasets via collective kernel learning. In IEEE 13th international conference on, data mining (ICDM), 2013 (pp. 1181–1186). IEEE.
Trivedi, A., Rai, P., Daumé III, H., & DuVall, S.L. (2005). Multiview clustering with incomplete views. In Proceedings of the NIPS workshop.
Tsuda, K., Akaho, S., & Asai, K. (2003). The em algorithm for kernel matrix completion with auxiliary data. The Journal of Machine Learning Research, 4, 67–81.
Wang, P., Shen, C., & Van Den Hengel, A. (2013). A fast semidefinite approach to solving binary quadratic problems. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1312–1319).
Williams, C., Seeger, M. (2001). Using the nyström method to speed up kernel machines. In Proceedings of the 14th annual conference on neural information processing systems, EPFL-CONF-161322 (pp. 682–688).
Williams, D., Carin, L. (2005). Analytical kernel matrix completion with incomplete multi-view data. In Proceedings of the ICML workshop on learning with multiple views.
Xu, M., Jin, R., & Zhou, Z.H. (2013). Speedup matrix completion with side information: Application to multi-label learning. In Advances in neural information processing systems (pp. 2301–2309).
Acknowledgements
We thank Academy of Finland (Grants 292334, 294238, 295503 and Center of Excellence in Computational Inference COIN for SK, Grant 295496 for JR and Grants 295503 and 295496 for SB) and Finnish Funding Agency for Innovation Tekes (Grant 40128/14 for JR) for funding. We also thank authors of Lian et al. (2015) for sharing their software for MLFS.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Bob Durrant, Kee-Eung Kim, Geoff Holmes, Stephen Marsland, Zhi-Hua Zhou and Masashi Sugiyama.
Rights and permissions
About this article
Cite this article
Bhadra, S., Kaski, S. & Rousu, J. Multi-view kernel completion. Mach Learn 106, 713–739 (2017). https://doi.org/10.1007/s10994-016-5618-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-016-5618-0