Multiview kernel completion
 672 Downloads
 5 Citations
Abstract
In this paper, we introduce the first method that (1) can complete kernel matrices with completely missing rows and columns as opposed to individual missing kernel values, with help of information from other incomplete kernel matrices. Moreover, (2) the method does not require any of the kernels to be complete a priori, and (3) can tackle nonlinear kernels. The kernel completion is done by finding, from the set of available incomplete kernels, an appropriate set of related kernels for each missing entry. These aspects are necessary in practical applications such as integrating legacy data sets, learning under sensor failures and learning when measurements are costly for some of the views. The proposed approach predicts missing rows by modelling both withinview and betweenview relationships among kernel values. For withinview learning, we propose a new kernel approximation that generalizes and improves Nyström approximation. We show, both on simulated data and real case studies, that the proposed method outperforms existing techniques in the settings where they are available, and extends applicability to new settings.
Keywords
Kernel completion Low rank kernel approximation Multiview data Missing values1 Introduction
In recent years, many methods have been proposed for multiview learning, i.e, learning with data collected from multiple sources or “views” to utilize the complementary information in them. Kernelized methods capture the similarities among data points in a kernel matrix. The multiple kernel learning (MKL) framework (c.f. Gönen and Alpaydin 2011) is a popular way to accumulate information from multiple data sources, where kernel matrices built on features from individual views are combined for better learning. In MKL methods, it is commonly assumed that full kernel matrices for each view are available. However, in partial data analytics, it is common that information from some sources is not available for some data points.
The incomplete data problem exists in a wide range of fields, including social sciences, computer vision, biological systems, and remote sensing. For example, in remote sensing, some sensors can go off for periods of time, leaving gaps in data. A second example is that when integrating legacy data sets, some views may not available for some data points, because integration needs were not considered when originally collecting and storing the data. For instance, gene expression may have been measured for some of the biological samples, but not for others, and as biological sample material has been exhausted, the missing measurements cannot be made any more. On the other hand, some measurements may be too expensive to repeat for all samples; for example, patient’s genotype may be measured only if a particular condition holds. All these examples introduce missing views, i.e, all features of a view for a data point can be missing simultaneously.
Novelties in problem definition: Previous methods for kernel completion have addressed completion of the aggregated Gaussian kernel matrix by integrating multiple incomplete kernels (Williams and Carin 2005) or singleview kernel completion assuming individual missing values (Graepel 2002; Paisley and Carin 2010), or required at least one complete kernel with a full eigensystem to be used as an auxiliary data source (Tsuda et al. 2003; Trivedi et al. 2005), or assume the eigensystem of two kernels to be exactly the same (Shao et al. 2013), or assumed a linear kernel approximation (Lian et al. 2015). Williams and Carin (2005) do not complete the individual incomplete kernel matrix but complete only aggregated kernels when all kernels are Gaussian. Due to absence of full rows/columns in the incomplete kernel matrices, no existing or nonexisting singleview kernel completion method (Graepel 2002; Paisley and Carin 2010) can be applied to complete kernel matrices of individual views independently. In the multiview setting, Tsuda et al. (2003) have proposed an expectation maximization based method to complete an incomplete kernel matrix for a view, with the help of a complete kernel matrix from another view. As it requires a full eigensystem of the auxiliary full kernel matrix, that method cannot be used to complete a kernel matrix with missing rows/columns when no other auxiliary complete kernel matrix is available. Both Trivedi et al. (2005) and Shao et al. (2013) match kernels through their Graph Laplacians, which may not work optimally if the kernels have different eigenstructures arising from different types of measurements. The method by Shao et al. (2013) completes multiple kernels sequentially, making an implicit assumption that the adjacent kernels in the sequence are related. This can be a hard constraint and in general may not match the reality. On the other hand, Lian et al. (2015) proposed a generative model based method which approximates the similarity matrix for each view as a linear kernel in some lowdimensional space. Therefore, it is unable to model highly nonlinear kernels such as RBFs. Hence no conventional method can, by itself, complete highly nonlinear kernel matrices with completely missing rows and columns in a multiview setting when no other auxiliary full kernel matrix is available.
Contributions: In this paper, we propose a novel method to complete all incomplete kernel matrices collaboratively, by learning both betweenview and withinview relationships among the kernel values (Fig. 1). We model betweenview relationships in the following two ways: (1) Initially, adapting the strategies from multiple kernel learning (Argyriou et al. 2005; Cortes et al. 2012), we complete kernel matrices by expressing individual normalized kernel matrices corresponding to each view as a convex combination of normalized kernel matrices of other views. (2) Second, to model relationships between kernels having different eigensystems we propose a novel approach of restricting the local embedding of one view in the convex hull of local embeddings of other views. We relate theoretically the kernel approximation quality of the different approaches to the properties of the underlying eigenspaces of the kernels, pointing out settings where different approaches are optimal.
For withinview learning, we begin from the concept of local linear embedding (Roweis and Saul 2000) applied to the feature vector, and extend it to the kernel matrix by reconstructing each feature representation for a kernel as a sparse linear combination of other available feature representations or “basis” vectors in the same view. We assume the local embeddings, i.e., the reconstruction weights and the basis vectors for reconstructing each samples, are similar across views. In this approach, the nonlinearity of kernel functions of individual views is also preserved in the basis vectors. We prove (Theorem 2) that the proposed withinview kernel reconstruction can be seen as generalizing and improving the Nyström method (Williams and Seeger 2001) which have been successfully applied to efficient kernel learning. Most importantly, we show (in Theorem 3) for a general single kernel matrix the proposed scheme results into optimal low rank approximation which is not reached by the Nyström method.
For betweenview learning, we recognize that the similarity of the eigenspaces of the views plays a crucial role. When the different kernels have similar optimal lowrank approximations, we show (Theorem 4) that predicting kernel values across views is a potent approach. For this case, we propose a method (\(\mathbf{MKC}_{app}\) (25)) relying on a technique previously used in multiple kernel learning literature, namely, restricting a kernel matrix into the convex hull of other kernel matrices (Argyriou et al. 2005; Cortes et al. 2012). Here, we use this technique for simultaneously completing multiple incomplete kernel matrices, while Argyriou et al. (2005) and Cortes et al. (2012) used it only for learning effective linear combination of complete kernel matrices.
2 Multiview kernel completion
We assume N data observations \({\mathbf {X}}=\{ {\mathbf {x}}_1,\ldots ,{\mathbf {x}}_N\}\) from a multiview input space \({\mathscr {X}}= {\mathscr {X}}^{1} \times \cdots \times {\mathscr {X}}^{(M)}\), where \({\mathscr {X}}^{(m)}\) is the input space generating the mth view. We denote by \({\mathbf {X}}^{(m)}=\{ {\mathbf {x}}^{(m)}_1,\ldots ,{\mathbf {x}}^{(m)}_N\}\), \(\forall \) \(m = 1,\ldots , M\), the set of observations for the mth view, where \({\mathbf {x}}^{(m)}_i \in {\mathscr {X}}^{(m)}\) is the ith observation in the mth view and \({\mathscr {X}}^{(m)}\) is the input space. For simplicity of notation we sometimes omit the superscript (m) denoting the different views when there is no need to refer to several views at a time.
Considering an implicit mapping of the observations of the mth view to an inner product space \({\mathscr {F}}^{(m)}\) via a mapping \(\phi ^{(m)}: {\mathscr {X}}^{(m)} \rightarrow {\mathscr {F}}^{(m)}\), and following the usual recipe for kernel methods (Bach et al. 2004), we specify the kernel as the inner product in \({\mathscr {F}}^{(m)}\). The kernel value between the ith and jth data points is defined as \({k}^{(m)}_{ij}= \langle \phi _i^{(m)},\phi _j^{(m)} \rangle \), where \(\phi _i^{(m)} = \phi ^{(m)}({\mathbf {x}}_i^{(m)})\) and \(k_{ij}^{(m)}\) is an element of \({\mathbf {K}}^{(m)}\), the kernel Gram matrix for the set \({\mathbf {X}}^{(m)}\).
In this paper we make the assumption that a subset of samples is observed in each view, and correspondingly, a subset of views is observed for each sample. Let \(I_N = [1,\ldots , N] \) be the set of indices of all data points and \(I^{(m)}\) be the set of indices of all available data points in the mth view. Hence for each view, only a kernel submatrix (\({\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}\)) corresponding to the rows and columns indexed by \(I^{(m)}\) is observed. Our aim is to predict a complete positive semidefinite (PSD) kernel matrix (\({\hat{{\mathbf {K}}}}^{(m)} \in {\mathbb {R}}^{N \times N}\)) corresponding to each view. The crucial task is to predict the missing (tth) rows and columns of \({\hat{{\mathbf {K}}}}^{(m)}\), for all \(t \in \{ I_N/I^{(m)}\}\). Our approach for predicting \({\hat{{\mathbf {K}}}}^{(m)}\) is based on learning both betweenview and withinview relationships among the kernel values (Fig. 1). The submatrix \({\hat{{\mathbf {K}}}}^{(m)}_{I^{(m)}I^{(m)}}\) should be approximately equal to the observed matrix \({\mathbf {K}}^{(m)}_{I^{(m)}I^{(m)}}\), however, in our approach, approximation quality of the two parts of the kernel matrix can be traded.
2.1 Withinview kernel relationships
Intuitively, the reconstruction weights are used to extend the known part of the kernel to the unknown part, in other words, the unknown part is assumed to reside within the span of the known part.
Without the \(\ell _{2,1}\) regularization, the above approximation loss could be trivially optimized by choosing \({\mathbf {A}}_{II}\) as the identity matrix. The \(\ell _{2,1}\) regularization will have the effect of zeroing out some of the diagonal values and introducing nonzeros to the submatrix \({\mathbf {A}}_{BI}\), corresponding to the rows and columns indexed by \(B\) and \(I\) respectively, where \(B= \{i  a_{ii} \ne 0\}\).
In Sect. 3 we show by Theorem 1 that (5) corresponds to a generalized form of the Nyström method (Williams and Seeger 2001) which is a sparse kernel approximation method that has been successfully applied to efficient kernel learning. Nyström method finds a small set of vectors (not necessarily linearly independent) spanning the kernel, whereas our method searches for linearly independent basis vectors (c.f. Sect. 3, Lemma 1) and optimizes the reconstruction weights for the data samples. In particular, we show that (5) achieves the best rankr approximation of a kernel, when the original kernel has rank higher than r, which is not achieved by Nyström method (c.f. Theorem 2).
2.2 Betweenview kernel relationships
For a completely missing row or column of a kernel matrix, there is not enough information available for completing it within the same view, and hence the completion needs to be based on other information sources, in our case the other views where the corresponding kernel parts are known. In the following, we introduce two approaches for relaying information of the other views for completing the unknown rows/columns of a particular view. The first technique is based on learning a convex combination of the kernels, extending the multiple kernel learning (Argyriou et al. 2005; Cortes et al. 2012) techniques to kernel completion. The second technique is based on learning reconstruction weights so that they share information between the views.
Betweenview learning of reconstruction weights: In practical applications, the kernels arising in a multiview setup might be very heterogeneous in their distributions. In such cases, it might not be realistic to find a convex combination of other kernels that are closely similar to the kernel of a given view. In particular, when the eigenspectra of the kernels are very different, we expect a low betweenview loss (8) to be hard to achieve.
However, assuming that kernel functions in all the views have similar eigenvectors is also unrealistic for many real world datasets with heterogeneous sources and kernels applied to them. On the contrary, it is quite possible that only for a subset of views the eigenvectors of approximated kernel are linearly related. Thus, in our approach we allow the views to have different reconstruction weights, but assume a parametrized relationship learned from data. This also allows the model to find an appropriate set of related kernels from the set of available incomplete kernels, for each missing entry.
3 Theoretical analysis
In this section, we present the theoretical results underlying our methods. We begin by showing the relationship and advantages of our withinkernel approximation to the Nyström method, and follow with theorems establishing the approximation quality of different kernel completion models.
3.1 Rank of the withinkernel approximation
Lemma 1
For \(rank({\mathbf {K}}) \ge r\), \(\exists \) \(\lambda =\lambda ^r \) and \(\lambda _{{\hat{{\mathbf {A}}}}^*_{i}}\) such that the solution of (15) selects a \({\mathbf {W}}^{*} \in {\mathbb {R}}^{r \times r}\) with \(rank({\mathbf {W}}^{*})=r\).
Proof
When \(rank({\mathbf {K}}) \ge r\) then there must exist a rankr submatrix of \({\mathbf {K}}\) of size \(r \times r\). Again, in case of \(\ell _1\) regularization on binary vector \({\mathbf {p}}\), one can tune \(\lambda \) to \(\lambda ^r\) to have required sparsity on \({\mathbf {p}}\), i.e., \(\Vert {\mathbf {p}}^{*}\Vert _1= r\). Moreover, \(\ell _1\) regularization on binary vector ensures the linear independence of selected columns and rows when \(\lambda _{i}\) is carefully chosen. If a solution of (15) selects a column which is linearly dependent on other selected columns, then the solution, from the objective function value of some other solution which selects the same columns except this linearly dependent column, will raise the value of \(\ell _1\) norm regularization term of objective function by \(\lambda \) while keeping the first part of the objective function same and lowering the third part of (15) by \(\frac{\lambda _{i}\Vert A_i\Vert ^2_2}{2}\). Hence if \(\lambda _{i} < 2\frac{\lambda }{\Vert \hat{{\mathbf {A}}^*}_i\Vert ^2_2}\) then that can not be an optimum solution and if \(\lambda _{i}\) is chosen according to the (16) then \(\lambda _{i} < 2\frac{\lambda }{\Vert \hat{{\mathbf {A}}^*}_i\Vert ^2_2}\). This completes the proof. \(\square \)
3.2 Relation to Nyström approximation
Theorem 1
The Nyström approximation of \({\mathbf {K}}\) is a feasible solution of (15), i.e., for invertible \({\mathbf {W}}\), \( \exists {\hat{{\mathbf {A}}}} \in {\mathbb {R}}^{N \times c}\) such that \( {\hat{{\mathbf {K}}}}_{nys} = {\hat{{\mathbf {A}}}}^{^T} {\mathbf {W}}{\hat{{\mathbf {A}}}} \).
Proof
The above theorem shows that the approach of (15), by finding the optimal feasible solution, will always produce better kernel approximation with the same level of sparsity as the Nyström method.
3.3 Lowrank approximation quality

If \(r=rank({\mathbf {K}}^{})\le c\) and \(rank({\mathbf {W}}^{})=r\), then the Nyström approximation is exact, i.e., \(\Vert {\mathbf {K}} {\hat{{\mathbf {K}}}}^{}_{nys}\Vert ^2_2 =0\).

For general \({\mathbf {K}}^{}\) when \(rank({\mathbf {K}}^{}) \ge r\) and \(rank({\mathbf {W}}^{})=r\), then the Nyström approximation is not the best rankr approximation of \({\mathbf {K}}^{}\).
Lemma 2
If \({\mathbf {K}}_r\) be the best rankr approximation of a kernel \({\mathbf {K}}\) with \(rank({\mathbf {K}}) \ge r\) and \({\mathbf {W}}\) be a full rank submatrix of \({\mathbf {K}}\) of size \(r \times r\), i.e, \(rank({\mathbf {W}})=r\). Then \( \exists {\hat{{\mathbf {A}}}} \in {\mathbb {R}}^{N \times r}\) such that for the proposed approximation \({\hat{{\mathbf {K}}}} =\hat{{\mathbf {A}}^T} {\mathbf {W}}{\hat{{\mathbf {A}}}}\) is equivalent to \({\mathbf {K}}_r\), i.e., \(\Vert {\mathbf {K}} {\hat{{\mathbf {K}}}}\Vert ^2_2 = \Vert {\mathbf {K}} {\mathbf {K}}_r\Vert ^2_2\).
Proof
Theorem 2
If \(rank({{\mathbf {K}}}) \ge r\), then \(\exists \) \(\lambda ^r\) such that the proposed approximation \({\hat{{\mathbf {K}}}}\) in (17) is equivalent to the best rankr approximation of \({{\mathbf {K}}}\), i.e., \(\Vert {\mathbf {K}} {\hat{{\mathbf {K}}}}\Vert ^2_2 = \Vert {\mathbf {K}} {{\mathbf {K}}}_r\Vert ^2_2\), where \({{\mathbf {K}}}_r\) is the best rankr approximation of \({{\mathbf {K}}}\).
Proof
3.4 Lowrank approximation quality of multiple kernel matrices
In this section, we establish the approximation quality achieved in the multiview setup, when the different kernels are similar either in the sense of having the same underlying ‘true’ lowrank approximations (Theorem 3) or more generally similar sets of eigenvectors (Theorem 4).
Theorem 3
Proof
Theorem 4
Proof
4 Optimization problems
Here we present the optimization problems for Multiview Kernel Completion (MKC), arising from the withinview and betweenview kernel approximations described above.
4.1 MKC using semidefinite programming (\(\mathbf{MKC}_{sdp}\))
4.2 MKC using homogeneous embeddings ( \(\mathbf{MKC}_{embd(hm)}\) )
4.3 MKC using heterogeneous embeddings (\(\mathbf{MKC}_{embd(ht)}\))
4.4 MKC using kernel approximation (\(\mathbf{MKC}_{app}\))
5 Algorithms
Here we present algorithms for solving various optimization problems, described in previous section.^{1}
5.1 Algorithm to solve \(\mathbf{MKC}_{embd(ht)}\)
In this section the Algorithm 1 describes the algorithm to solve \(\mathbf{MKC}_{embd(ht)}\) (24).
Computational Complexity: Each iteration of Algorithm 1 needs to update reconstruction weight vectors of size N for N datapoints for M views and also between view relation weights of size \(M \times M\). Hence the effective computational complexity is \({\mathscr {O}}\left( M(N^2 +M)\right) \).
5.2 Algorithm to solve \(\mathbf{MKC}_{sdp}\)
Computational Complexity: Each iteration of Algorithm 2 needs to optimize M kernel by solving of M semidefinite programming(SDP) of size N. General SDP solver has computation complexity \({\mathscr {O}} \left( N^{6.5 }\right) \) (Wang et al. 2013). Hence the effective computational complexity is \({\mathscr {O}}\left( M N^{6.5}\right) \).
5.3 Algorithm to solve \(\mathbf{MKC}_{app}\)
Computational Complexity: Each iteration of Algorithm 3 needs to update reconstruction weight vectors of size N for N datapoints for M views and also between view relation weights of size \(M \times M\). Hence the effective computational complexity is \({\mathscr {O}}\left( M(N^2 +M)\right) \).
6 Experiments
We apply the proposed MKC method on a variety of data sets, with different types of kernel functions in different views, along with different amounts of missing data points. The objectives of our experiments are: (1) to compare the performance of MKC against other existing methods in terms of the ability to predict the missing kernel rows, (2) to empirically show that the proposed kernel approximation with the help of the reconstruction weights also improves runningtime over the \(\mathbf{MKC}_{sdp}\) method.
6.1 Experimental setup
6.1.1 Data sets:
To evaluate the performance of our method, we used 4 simulated data sets with 100 data points and 5 views, as well as two realworld multiview data sets: (1) Dream Challenge 7 data set (DREAM) (Daemen et al. 2013; Heiser and Sadanandam 2012) and (2) Reuters RCV1/RCV2 multilingual data (Amini et al. 2009).^{2}
 1
We generated the first 10 points (\({\mathbf {X}}_{B^{(m)}}^{(m)}\)) for each view, where \({\mathbf {X}}_{B^{(m)}}^{(1)}\) and \({\mathbf {X}}_{B^{(m)}}^{(2)}\) are uniformly distributed in \([1,1]^{5}\) and \({\mathbf {X}}_{B^{(m)}}^{(3)}\), \({\mathbf {X}}_{B^{(m)}}^{(4)}\) and, \({\mathbf {X}}_{B^{(m)}}^{(5)}\) are uniformly distributed in \([1,1]^{10}\).
 2
These 10 data points were used as basis sets for each view, and further 90 data points in each view were generated by \({\mathbf {X}}^{(m)}={\mathbf {A}}^{(m)} {\mathbf {X}}_{B^{(m)}}^{(m)}\), where the \({\mathbf {A}}^{(m)}\) are uniformly distributed random matrices \( \in {\mathbb {R}}^{90 \times 10}\). We chose \({\mathbf {A}}^{(1)}={\mathbf {A}}^{(2)}\) and \({\mathbf {A}}^{(3)}={\mathbf {A}}^{(4)} = {\mathbf {A}}^{(5)}\).
 3Finally, \({\mathbf {K}}^{(m)}\) was generated from \({\mathbf {X}}^{(m)}\) by using different kernel functions for different data sets as follows:Figure 2 shows the eigenspectra of kernel matrices are very much different for TOYLG1 where we have used different kernels in different views.

TOYL: Linear kernel for all views

TOYG1 and TOYG0.1: Gaussian kernel for all views where the kernel with of the Gaussian kernel are 1 and 0.1 respectively.

TOYLG1: Linear kernel for the first 3 views and Gaussian kernel for the last two views with the kernel width 1. Note that with this selection view 3 shares reconstruction weights with view 4 and 5, but has the same kernel as views 1 and 2.

The Dream Challenge 7 data set (DREAM): For Dream Challenge 7, genomic characterizations of multiple types on 53 breast cancer cell lines are provided. They consist of DNA copy number variation, transcript expression values, whole exome sequencing, RNA sequencing data, DNA methylation data and RPPA protein quantification measurements. In addition, some of the views are missing for some cell lines. For 25 data points all 6 views are available. For all the 6 views, we calculated Gaussian kernels after normalizing the data sets. We generated two other kernels by using Jaccard’s kernel function over binarized exome data and RNA sequencing data. Hence, the final data set has 8 kernel matrices. Figure 2 shows the eigenspectra of the kernel matrices of all views, which are quite different for different views.
RCV1/RCV2: Reuters RCV1/RCV2 multilingual data set contains aligned documents for 5 languages (English, French, Germany, Italian and Spanish). Originally the documents are in any one of these languages and then corresponding documents for other views have been generated by machine translations of the original document. For our experiment, we randomly selected 1500 documents which were originally in English. The latent semantic kernel (Cristianini et al. 2002) is used for all languages.
6.1.2 Evaluation setup
Each of the data sets was partitioned into tuning and test sets. The missing views were introduced in these partitions independently. To induce missing views, we randomly selected data points from each partition, a few views for each of them, and deleted the corresponding rows and columns from the kernel matrices. The tuning set was used for parameter tuning. All the results have been reported on the test set which was independent of the tuning set.
For all 4 synthetic data sets as well as RCV1/RCV2 we chose \(40\%\) of the data samples as the tuning set, and the rest \(60\%\) were used for testing. For the DREAM data set these partitions were \(60\%\) for tuning and \(40\%\) for testing.
We generated versions of the data with different amounts of missing values. For the first test case, we deleted 1 view from each selected data point in each data set. In the second test case, we removed 2 views for TOY and RCV1/RCV2 data sets and 3 views for DREAM. For the third one we deleted 3 views among 5 views per selected data point in TOY and RCV1/RCV2, and 5 views among 8 views per selected data point in DREAM.
We repeated all our experiments for 5 random tuning and test partitions with different missing entries and report the average performance on them.
6.1.3 Compared methods
We compared performance of the proposed methods, \(\mathbf{MKC}_{embd(hm)}\) \(\mathbf{MKC}_{embd(ht)}\), \(\mathbf{MKC}_{app}\), \(\mathbf{MKC}_{sdp}\), with k nearest neighbour (KNN) imputation as a baseline KNN has previously been shown to be a competitive imputation method (Brock et al. 2008). For KNN imputation we first concatenated underlying feature representations from all views to get a joint feature representation. We then sought k nearest data points by using their available parts, and the missing part was imputed as either average (Knn) or the weighted average (wKnn) of the selected neighbours. We also compare our result with generative model based approach of Lian et al. (2015) (MLFS) and with an EMbased kernel completion method (\(\mathbf{EM}_{based}\)) proposed by Tsuda et al. (2003). Tsuda et al. (2003) cannot solve our problem when no view is complete, hence we study the relative performance only in the cases which it can solve. For Tsuda et al. (2003)’s method we assume the first view is complete.
We also compared \(\mathbf{MKC}_{embd(ht)}\), with \(\mathbf{MKC}_{rnd}\) where we assumed the basis vectors are selected randomly with uniform distribution with out replacement and after that reconstruction weights for all views are optimizied.
The hyperparameters \(\lambda _1\) and \(\lambda _2\) of MKC and k of Knn and wKnn were selected with the help of tuning set, from the range of \(10^{3}\) to \(10^3\) and [1, 2, 3, 5, 7, 10] respectively. All reported results indicate performance in the test sets.
6.2 Prediction error comparisons
6.2.1 Average Relative Error (ARE)
We evaluated the performance of all methods using the average relative error (ARE) (Xu et al. 2013). Let \({\hat{{\mathbf {k}}}}^{(m)}_{t}\) be the predicted tth row for the mth view and the corresponding true values of kernel row be \({{\mathbf {k}}}^{(m)}_{t}\), then the relative error is the relative root mean square deviation.
Average relative error percentage (34))
Algorithm  TOYL  TOYG1  TOYG0.1  TOYLG1  DREAM  RCV1/RCV2 

Number of missing views = 1 (TOY and RCV1/RCV2) and 1 (DREAM)  
\(\mathbf{MKC}_{embd(ht)}\)  0.07 (±0.09)  7.40 (±9.20)  84.91 (±5.18)  4.50 (±6.72)  13.36 (±26.53)  1.79( ±0.89) 
\(\mathbf{MKC}_{app}\)  0.09 (±0.10)  5.02 (±3.60)  76.24 (±10.59)  2.11 (±3.40)  14.46 (±28.39)  1.15( ±0.48) 
\(\mathbf{MKC}_{sdp}\)  0.22 (±0.32)  11.29 (±6.29)  7.83 (±5.46)  6.06 (±7.84)  20.19 (±41.28)  – 
\(\mathbf{MKC}_{embd(hm)}\)  0.19 (±0.18)  27.54 (±14.38)  86.08 (±6.34)  8.93 (±11.86)  16.12 (±30.27)  3.27( ±1.26) 
\(\mathbf{MKC}_{rnd}\)  0.08 (±0.09)  16.79 (±13.14)  93.36 (±5.18)  4.50 (±6.72)  16.00 (±28.83)  1.40( ±0.28) 
MLFS  0.28( ±0.20)  79.20( ±22.95)  100.00( ±0.00)  25.34( ±31.42)  11.69( ±26.47)  15.13( ±3.25) 
\(\mathbf{EM}_{based}\)  20.65 (±41.08)  554.08 (± 90.00)  31.23 (± 37.02)  759.74 (± 90.00)  14.78 (± 32.93)  23.38( ±29.00) 
Knn  0.34( ±0.53)  42.89( ±27.93)  62.69( ±8.77)  11.27( ±15.53)  14.94( ±25.29)  5.79( ±2.65) 
wKnn  0.34( ±0.53)  45.47( ±29.50)  62.80( ±8.86)  15.30( ±20.15)  15.00( ±25.35)  5.91( ±2.71) 
Number of missing views = 2 (TOY and RCV1/RCV2) and 3 (DREAM)  
\(\mathbf{MKC}_{embd(ht)}\)  0.08 (±0.07)  9.43 (±6.72)  86.72 (±3.34)  3.26 (±5.07)  16.13 (±28.29)  2.74( ±0.85) 
\(\mathbf{MKC}_{app}\)  0.07 (±0.05)  6.89 (±3.44)  84.40 (±9.04)  4.01 (±6.03)  17.51 (±27.65)  1.61( ±0.65) 
\(\mathbf{MKC}_{sdp}\)  0.34 (± 0.39)  19.87 (±13.88)  18.30 (±12.94)  37.88 (±49.58)  32.86 (±51.73)  – 
\(\mathbf{MKC}_{embd(hm)}\)  0.14 (±0.09)  29.69 (±9.85)  96.19 (±1.60)  13.78 (±21.33)  18.33 (±29.56)  2.71( ±0.71) 
\(\mathbf{MKC}_{rnd}\)  0.08 (±0.07)  17.94 (±6.91)  93.24 (±3.34)  6.96 (±9.39)  17.70 (±28.29)  2.29( ±0.11) 
MLFS  0.58( ±0.26)  79.54( ±11.88)  100.00( ±0.00)  30.86( ±38.79)  16.59( ±31.28)  25.23( ±36.29) 
\(\mathbf{EM}_{based}\)  28.66 (±42.28)  202.58 (±339.02)  61.87 (±50.48)  298.57 (±281.79)  25.98 (±63.70)  27.83 (±13.70) 
Knn  0.26( ±0.26)  52.18( ±16.34)  97.65( ±11.62)  19.64 (±25.62)  22.04( ±30.07)  7.47( ±2.38) 
wKnn  0.26( ±0.26)  54.94( ±16.02)  98.24( ±11.17)  20.90( ±27.16)  22.20( ±30.26)  7.61( ±2.40) 
Number of missing views = 3 (TOY and RCV1/RCV2) and 5 (DREAM)  
\(\mathbf{MKC}_{embd(ht)}\)  0.05 (±0.04)  12.87 (±3.40)  89.88 (±3.26)  5.13 (±7.17)  20.04 (±30.58)  1.69( ±0.85) 
\(\mathbf{MKC}_{app}\)  0.10 (±0.05)  12.04 (±3.71)  89.69 (±5.54)  5.72 (±7.88)  20.43 (±30.39)  2.91( ±3.15) 
\(\mathbf{MKC}_{sdp}\)  0.41 (±0.35)  86.21 (±55.84)  17.59 (±9.37)  438.92 (±624.21)  97.79 (±89.51)  – 
\(\mathbf{MKC}_{embd(hm)}\)  0.16 (±0.10)  32.70 (±10.63)  95.43 (±1.75)  15.91 (±23.31)  22.13 (±33.29)  2.45( ±1.54) 
\(\mathbf{MKC}_{rnd}\)  0.08 (±0.06)  20.56 (±4.88)  94.54 (±3.26)  7.95 (±10.76)  21.04 (±30.58)  1.38( ±0.29) 
MLFS  8.55( ±3.08)  99.92( ±0.23)  100.00( ±0.00)  32.17( ±32.28)  21.54( ±30.94)  100.00( ±0.00) 
\(\mathbf{EM}_{based}\)  21.46 (±40.73)  101.87 (±63.31)  554.08 (±90.00)  231.98 (±416.30)  60.47 (±245.88)  29.76 (±12.28) 
Knn  0.39( ±0.33)  62.32( ±14.94)  112.79( ±13.22)  24.93( ±33.57)  27.54( ±36.88)  8.66( ±1.99) 
wKnn  0.38( ±0.33)  66.94(±14.74)  97.96( ±4.52)  27.48( ±36.90)  27.57( ±36.70)  8.85( ±1.99) 
6.2.2 Results
Table 1 shows the Average Relative Error (34) for the compared methods. It shows that the proposed MKC methods generally predict missing values more accurately than Knn, wKnn, \(\mathbf{EM}_{based}\) and MLFS. In particular, the differences in favor to the MKC methods increase when the number of missing views is increased.
The most recent method MLFS performs comparatively for the DREAM dataset and the dataset with linear kernels (TOYL). But it deteriorates very badly with the increased of nonlinearity of kernels, i.e., TOYG1 and TOYLG1. For highly nonlinear sparse kernels (TOYG0.1) and for RCV1/RCV2 dataset with large amount of missing views the MLFS fails to predict.

\(\mathbf{MKC}_{embd(ht)}\) is consistently the best when different views have different kernel functions or eigenspectra, e.g., TOYLG1 and DREAM (Fig. 2). Better performance of \(\mathbf{MKC}_{embd(ht)}\) than \(\mathbf{MKC}_{embd(hm)}\) in DREAM data gives evidence of applicability of \(\mathbf{MKC}_{embd(ht)}\) in realworld dataset.

\(\mathbf{MKC}_{app}\) performs best or very close to \(\mathbf{MKC}_{embd(ht)}\) when kernel functions and eigenspectra of all views are the same (for instance TOYL, TOYG1 and RCV1/RCV2). As \(\mathbf{MKC}_{app}\) learns betweenview relationships on kernel values it is not able to perform well for TOYLG1 and DREAM where very different kernel functions are used in different views.

\(\mathbf{MKC}_{sdp}\) outperforms all other methods when kernel functions are highly nonlinear (such as in TOYG0.1). On less nonlinear cases, \(\mathbf{MKC}_{sdp}\) on the other hand trails in accuracy to the other MKC variants. \(\mathbf{MKC}_{sdp}\) is computationally more demanding than the others, to the extent that on RCV1/RCV2 data we had to skip it.
6.3 Comparison of performance of different versions of the proposed approach
Average running time over all views and for TOY, over all 4 data sets
Algorithm  TOY (mins)  DREAM (mins)  RCV1/RCV2 (h) 

Number of missing views = 1 (TOY and RCV1/RCV2) and 1 (DREAM)  
\(\mathbf{MKC}_{embd(ht)}\)  5.00( ±2.04)  0.86( ±0.29)  45.93( ±2.27) 
\(\mathbf{MKC}_{app}\)  2.91( ±0.39)  1.89( ±0.62)  16.59( ±0.28) 
\(\mathbf{MKC}_{sdp}\)  14.82( ±4.39)  1.13( ±0.11)  – 
\(\mathbf{MKC}_{embd(hm)}\)  0.15( ±0.07)  0.05( ±0.03)  0.28( ±0.02) 
\(\mathbf{MKC}_{rnd}\)  2.71( ±0.53)  0.85( ±0.18)  4.21( ±0.56) 
MLFS  0.15( ±0.01)  0.01( ±0.01)  1.72( ±0.11) 
\(\mathbf{EM}_{based}\)  0.50( ±0.19)  0.03( \(\pm \quad \)0.05)  0.03( ±0.00) 
Number of missing view = 2 (TOY and RCV1/RCV2) and 3 (DREAM)  
\(\mathbf{MKC}_{embd(ht)}\)  7.58( ±2.18)  1.13( ±0.12)  25.86( ±0.36) 
\(\mathbf{MKC}_{app}\)  2.78( ±0.68)  1.29( ±0.25)  34.42( ±1.28) 
\(\mathbf{MKC}_{sdp}\)  25.65( ±5.43)  1.97( ±0.34)  – 
\(\mathbf{MKC}_{embd(hm)}\)  0.11( ±0.05)  0.03( ±0.01)  0.47( ±0.02) 
\(\mathbf{MKC}_{rnd}\)  1.33( ±0.59)  1.34( ±0.19)  3.63( ±0.50) 
MLFS  0.14( ±0.01)  0.01( ±0.02)  2.61( ±1.01) 
\(\mathbf{EM}_{based}\)  0.45( ±0.08)  0.06( ±0.06)  0.03( ±0.00) 
Number of missing views = 3 (TOY and RCV1/RCV2) and 5 (DREAM)  
\(\mathbf{MKC}_{embd(ht)}\)  6.83( ±2.14)  3.39( ±1.11)  24.39( ±2.13) 
\(\mathbf{MKC}_{app}\)  2.20( ±0.66)  3.64(±1.79)  20.26(±1.72) 
\(\mathbf{MKC}_{sdp}\)  178.1( ±162.9)  4.94( ±2.48)  – 
\(\mathbf{MKC}_{embd(hm)}\)  0.12( ±0.08)  0.03( ±0.02)  0.57( ±0.00) 
\(\mathbf{MKC}_{rnd}\)  1.83( ±0.73)  1.31( ±0.20)  2.81( ±0.17) 
MLFS  0.07( ±0.05)  0.05( ±0.01)  1.04(±0.40) 
\(\mathbf{EM}_{based}\)  0.45( ±0.05)  0.10( ±0.05)  0.03( ±0.00) 

The performance of \(\mathbf{MKC}_{embd(ht)}\) is the best among these three methods for all most all cases.

Only when all views have similarly high nonlinear kernel (topleft corner), \(\mathbf{MKC}_{sdp}\) performs best among all. It also shows that the performance of \(\mathbf{MKC}_{sdp}\) improves with increase of nonlinearity.

We can also see that with the increase of heterogeneity in kernels (increase of xaxis) the performance of \(\mathbf{MKC}_{app}\) deteriorates and is getting worse than that of \(\mathbf{MKC}_{embd(ht)}\).
6.4 Running time comparison
Table 2 depicts the running times for the compared methods. \(\mathbf{MKC}_{app}\), \(\mathbf{MKC}_{embd(ht)}\) and \(\mathbf{MKC}_{embd(hm)}\) are many times faster than \(\mathbf{MKC}_{sdp}\). In particular, \(\mathbf{MKC}_{embd(hm)}\) is competitive in running time with the significantly less accurate \(\mathbf{EM}_{based}\) and MLFS methods, except on the RCV1/RCV2 data. As expected, Knn and wKnn are orders of magnitude faster but fall far short of the reconstruction quality of the MKC methods.
7 Conclusions
In this paper, we have introduced new methods for kernel completion in the multiview setting. The methods are able to propagate relevant information across views to predict missing rows/columns of kernel matrices in multiview data. In particular, we are able to predict missing rows/columns of kernel matrices for nonlinear kernels, and do not need any complete kernel matrices a priori.
Our method of withinview learning approximates the full kernel by a sparse basis set of examples with local reconstruction weights, picked up by \(\ell _{2,1}\) regularization. This approach has the added benefit of circumventing the need of an explicit PSD constraint in optimization. We showed that the method generalizes and improves Nyström approximation. For learning between views, we proposed two alternative approaches, one based on learning convex kernel combinations and another based on learning a convex set of reconstruction weights. The heterogeneity of the kernels in different views affects which of the approaches is favourable. We related theoretically the kernel approximation quality of these methods to the similarity of eigenspaces of the individual kernels.
Our experiments show that the proposed multiview completion methods are in general more accurate than previously available methods. In terms of running time, due to the inherent nonconvexity of the optimization problems, the new proposals still have room to improve. However, the methods are amenable for efficient parallelization, which we leave for further work.
Footnotes
 1.
MKC code is available in https://github.com/aaltoicskepaco/MKC_software.
 2.
All datasets and MKC code is available in https://github.com/aaltoicskepaco/MKC_software.
Notes
Acknowledgements
We thank Academy of Finland (Grants 292334, 294238, 295503 and Center of Excellence in Computational Inference COIN for SK, Grant 295496 for JR and Grants 295503 and 295496 for SB) and Finnish Funding Agency for Innovation Tekes (Grant 40128/14 for JR) for funding. We also thank authors of Lian et al. (2015) for sharing their software for MLFS.
References
 Amini, M., Usunier, N., & Goutte, C. (2009). Learning from multiple partially observed views—An application to multilingual text categorization. Advances in Neural Information Processing Systems, 22, 28–36.Google Scholar
 Argyriou, A., Micchelli, C.A., & Pontil, M. (2005). Learning convex combinations of continuously parameterized basic kernels. In Proceedings of the 18th annual conference on learning theory (pp. 338–352).Google Scholar
 Argyriou, A., Evgeniou, T., & Pontil, M. (2006). Multitask feature learning. Advances in Neural Information Processing Systems, 19, 41–48.Google Scholar
 Bach, F., Lanckriet, G., & Jordan, M. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the 21st international conference on machine learning (pp. 6–13). ACM.Google Scholar
 Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2011). Convex optimization with sparsityinducing norms. Optimization for Machine Learning, 5, pp. 19–53.Google Scholar
 Brock, G., Shaffer, J., Blakesley, R., Lotz, M., & Tseng, G. (2008). Which missing value imputation method to use in expression profiles: A comparative study and two selection schemes. BMC Bioinformatics, 9, 1–12.CrossRefGoogle Scholar
 Cortes, C., Mohri, M., & Rostamizadeh, A. (2012). Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13, 795–828.MathSciNetMATHGoogle Scholar
 Cristianini, N., ShaweTaylor, J., & Lodhi, H. (2002). Latent semantic kernels. Journal of Intelligent Information Systems, 18(2–3), 127–152.CrossRefGoogle Scholar
 Daemen, A., Griffith, O., Heiser, L., et al. (2013). Modeling precision treatment of breast cancer. Genome Biology, 14(10), 1.CrossRefGoogle Scholar
 Gönen, M., & Alpaydin, E. (2011). Multiple kernel learning algorithms. Journal of Machine Learning Research, 12, 2211–2268.MathSciNetMATHGoogle Scholar
 Graepel, T. (2002). Kernel matrix completion by semidefinite programming. In Proceedings of the 12th international conference on artificial neural networks, Springer (pp. 694–699).Google Scholar
 Heiser, L. M., Sadanandam, A., et al. (2012). Subtype and pathway specific responses to anticancer compounds in breast cancer. Proceedings of the National Academy of Sciences, 109(8), 2724–2729.CrossRefGoogle Scholar
 Kumar, S., Mohri, M., & Talwalkar, A. (2009). On samplingbased approximate spectral decomposition. In Proceedings of the 26th annual international conference on machine learning (pp. 53–560). ACM.Google Scholar
 Lian, W., Rai, P., Salazar, E., & Carin, L. (2015). Integrating features and similarities: Flexible models for heterogeneous multiview data. In Proceedings of the 29th AAAI conference on artificial intelligence (pp. 2757–2763).Google Scholar
 Paisley, J., Carin, & L. (2010). A nonparametric Bayesian model for kernel matrix completion. In The 35th international conference on acoustics, speech, and signal processing, IEEE (pp. 2090–2093).Google Scholar
 Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.CrossRefGoogle Scholar
 Shao, W., Shi, X., & Yu, P.S. (2013). Clustering on multiple incomplete datasets via collective kernel learning. In IEEE 13th international conference on, data mining (ICDM), 2013 (pp. 1181–1186). IEEE.Google Scholar
 Trivedi, A., Rai, P., Daumé III, H., & DuVall, S.L. (2005). Multiview clustering with incomplete views. In Proceedings of the NIPS workshop.Google Scholar
 Tsuda, K., Akaho, S., & Asai, K. (2003). The em algorithm for kernel matrix completion with auxiliary data. The Journal of Machine Learning Research, 4, 67–81.MathSciNetMATHGoogle Scholar
 Wang, P., Shen, C., & Van Den Hengel, A. (2013). A fast semidefinite approach to solving binary quadratic problems. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1312–1319).Google Scholar
 Williams, C., Seeger, M. (2001). Using the nyström method to speed up kernel machines. In Proceedings of the 14th annual conference on neural information processing systems, EPFLCONF161322 (pp. 682–688).Google Scholar
 Williams, D., Carin, L. (2005). Analytical kernel matrix completion with incomplete multiview data. In Proceedings of the ICML workshop on learning with multiple views.Google Scholar
 Xu, M., Jin, R., & Zhou, Z.H. (2013). Speedup matrix completion with side information: Application to multilabel learning. In Advances in neural information processing systems (pp. 2301–2309).Google Scholar