Abstract
We study the problem of finding a subspace representative of multiple datasets by minimizing the maximal dissimilarity between this subspace and all the subspaces generated by those datasets. After arguing for the choice of the dissimilarity function, we derive some properties of the corresponding formulation. We propose an adaptation of an algorithm used for a similar problem on Riemannian manifolds. Experiments on synthetic data show that the subspace recovered by our algorithm is closer to the true common subspace than the solution obtained using an SVD.
1 Introduction
We address the problem of extracting common information from multiple datasets. In recent years data has become increasingly easy to generate and store for analysis to guide decision making, and it is not uncommon to have access to datasets representing similar but not exactly equivalent phenomena. A typical example can be found in bioinformatic, where datasets usually have a few tens to at most a few hundreds of samples for a few (tens of) thousands of features. However there usually exist various datasets measuring the same disease on different sets of patients, but corresponding to different studies and different experimental conditions that should be taken into account in further analysis. Considering all those similar datasets at once can be very useful to deal with the high number of features since statistical inferences require a large number of samples to be robust enough and generalizable to other data.
Beside the basic possibility to simply concatenate all the datasets \(X_1\),...,\(X_m\) into a larger dataset \(X =[X_1\ \ldots \ X_m]\) and apply usual methods such as principal components analysis on X, more specific approaches exist to extract common components present in the datasets. A method to factorize two datasets with a common factor was proposed in [1] with a closed-form solution, and an extension to more than two datasets was proposed in [2]. However, such methods assume that the common dimension of the datasets is full-rank, which is not the case if we consider datasets with a higher number of variables than samples, such as gene expression datasets. The best known method is probably canonical correlation analysis (CCA) [3], which aims to find a linear combination of the initial features for both datasets maximizing the correlation between those two combinations. When dealing with two datasets only, an exact solution can be computed based on the covariance matrix. In order to find more than one pair of correlated combination of features, deflation is usually used: the same procedure (CCA) is repeated on the data from which the previous components were removed. Another well known method, partial least square regression [4], aims to find linear combinations of features for the two datasets such that the covariance between those two new representations is maximal. As in CCA a closed-form solution exists, and deflation can be used to compute the following components. Another variation is co-inertia analysis (CIA) and its extension multiple CIA [5] that maximizes a sum of weighted squared covariances between linear combination of the datasets features and a reference vector. Consensus principal component analysis is very similar to CIA, the main difference being in the deflation process [6]. Different extensions of those methods to more than two datasets have been proposed, with various criteria to optimize (see for example [7, 8] and references within): maximizing a sum on all pairs of datasets of covariances or correlations, possibly squared or in absolute value, and with different constraints. In such cases, a closed-form solution does not always exist.
A central question when using more than two datasets is the importance to give to those different (pairs of) datasets. Common approaches are to give all datasets the same importance or, as in [7], to consider if a pair of datasets is connected or not and to give to the corresponding term a weight of 1 or 0. If we are dealing with a set of datasets all very similar except one (for example, because measured using another technology), those kind of choices can lead to components representing very well all the similar datasets but being not representative at all of the last one. Here, we want to avoid this situation, and in order to take all \(X_i\) into account we propose to minimize the maximal dissimilarity d between the common component \(U \in \mathbb {R}^{p\times K}\) and all datasets \(X_i \in \mathbb {R}^{p \times n_i}\):
This formulation can be viewed as looking for the center of the smallest-radius sphere enclosing all \({X}_i\), and can be linked to the minimum enclosing ball, 1-center problem or minimax optimization problem. However, since here U represents a subspace, we are really interested in the subspace generated by the columns of U. So we want to solve problem (1) such that \(d(U,X_i)=d(\mathcal {U},\mathcal {X}_i)\) is a dissimilarity measure between \(\mathcal {U}\) and \(\mathcal {X}_i\), the subspaces generated by the columns of U and X.
The problem of finding the smallest enclosing ball of a finite point set \({\mathbb {X}=\{x_1,\ldots ,x_m\}}\) has been already thoroughly investigated in Euclidean space, and an efficient approximation algorithm has been proposed in [9]. An adaptation of the algorithm presented in [9] to Riemannian geometry is proposed in [10] with a study of the convergence rate, and in [11] to compute Riemannian \(L_1\) and \(L_\infty \) center of mass of structure tensor images in order to denoise those images.
In this paper we assume that each point \(\mathcal {X}_i\) represents a subspace of dimension \(n_i\) in \(\mathbb {R}^p\), that is \(\mathcal {X}_i\) belongs to the Grassmannian manifold \(\mathcal {G}(n_i,p)\) and so \({\mathbb {X}=\{\mathcal {X}_1,\ldots ,\mathcal {X}_m\}}\) is included in the total Grassmannian \(\cup \mathcal {G}(n_i,p)\). The proposed approach to solve problem (1) is inspired by [10]. The main difference is that each data point \(\mathcal {X}_i\) belongs to a different Grassmannian \(\mathcal {G}(n_i,p)\), which prevents us by using the usual Grassmannian distance. Instead we use an adaptation based on principal angles, which allows us to measure the dissimilarity between any pair of subspaces of different dimension, and to project \(\mathcal {G}(n_i,p)\) on \(\mathcal {G}(K,p)\) in order to return to a common manifold.
The paper is organized as followed. We first discuss the choice of the dissimilarity measure and the resulting problem in Sect. 2, then details of the proposed approach are presented in Sect. 3. Section 4 describes the results obtained on synthetic data, and we conclude in Sect. 5.
2 Problem Formulation
Let \(X_i \in \mathbb {R}^{p\times n_i}\) be a matrix of p variables times \(n_i\) samples, for \(i=1, \ ..., m\). Our goal is to find a subspace \(\mathcal {U}\) of dimension K representative of all the subspaces \(\mathcal {X}_i\), where \(\mathcal {X}_i\) is the subspace generated by the columns of \(X_i\). In other words we are looking for a \(U \in \mathbb {R}^{p\times K}\) minimizing \(d(U,X_i)\) for all i, where \(d(U,X)=d(\mathcal {U},\mathcal {X})\) is a dissimilarity measure between the span of U and the span of X.
2.1 Dissimilarity Measure
Different dissimilarities are possible to quantify d(U, X), we detail some of them below. For \(K=1\), a possible choice to evaluate if a vector \(u \in \mathbb {R}^p\) is close to \(\mathcal {X}\) is the angle between u and its orthogonal projection on \(\mathcal {X}\). A vector u is close to the subspace \(\mathcal {X}\) if the (positive) angle between them is small. If we define \(\phi \) as the angle between u and \(\mathcal {X}\) (in \([-\frac{\pi }{2},\frac{\pi }{2}]\)), we have
where \(||u||=1\) and \( \check{X} \) is an orthonormal basis of \(\mathcal {X}\). The term \(u^{\top \!} \check{X} \check{X} ^{\top \!} u\) can then be seen as a similarity measure evaluating how close u is to \(\mathcal {X}\), with a value of 1 when u is in the subspace \(\mathcal {X}\), and 0 when they are orthogonal. We can then define a dissimilarity:
with \(d(u,X)=0\) if and only if \(u \in \mathcal {X}\).
This can be extended to a more general \(U \in \mathbb {R}^{p\times K}\) with \(p \ge n \ge K\ge 1\) (with n the dimension of \(\mathcal {X}\)) by summing the dissimilarities obtained for each element of an orthonormal basis \( \check{U} \) of \(\mathcal {U}\):
with \( \cos \phi _k(U,X)\) the singular values of \( \check{U} ^{\top \!} \check{X} \). Note that this quantity does not depend on the \( \check{U} \) or \( \check{X} \) chosen.
Another possible dissimilarity is [12]:
Similarly, we can consider the norm between \( \check{X} \) and its projection onto the common subspace \(\mathcal {U}\) (termed chordal metric in [13, Table 3]):
Another possibility is to consider the principal angles \(\phi _k\) between both subspaces:
See [13] for other possible dissimilarity measures.
Letting \(\sigma _k= \cos \phi _k(U,X)\) denote the kth singular value of \( \check{U} ^{\top \!} \check{X} \), we can compare the different dissimilarities in Table 1 (with \(n_u\) and \(n_x\) dimensions of subspaces \(\mathcal {U}\) and \(\mathcal {X}\)). When using those dissimilarities in \(\min _{U} \max _i d(U,X_i)\), \(d_b\) and \(d_c\) will give more importance to datasets \(X_i\) with a higher \(n_i\). All dissimilarities except \(d_d\) can be directly expressed in terms of \( \check{U} ^{\top \!}XX^{\top \!} \check{U} \). As \(d_a\) and \(d_d\) respect \( \mathcal {U} \subsetneq \mathcal {X} \Rightarrow d(U,X)=0\), they are not distances. Note that if \(n_x=n_u\), we have \(\sqrt{d_a} = d_b =d_c\).
In the context of (1), it is natural to require that \(d(U,X)=0\) when \(\mathcal {U} \subset \mathcal {X}\) or \(\mathcal {X} \subset \mathcal {U}\). We opt for \(d_a\), since it yields a simpler objective function than \(d_d\). Hence, (1) becomes:
Since K is fixed and \( \check{U} \) verifies \( \check{U} ^{\top \!} \check{U} =I_K\), this is equivalent to
Since \(\max _U \min _i f_i(U)\) is equivalent to \(\max _{U,\tau } \tau \) subject to \(\tau \le f_i(U)\) for all i, (2) is equivalent to:
with \(u_i\) the ith column of U. Observe that (3) is an optimization problem with a linear objective function and quadratic (in)equality constraints.
2.2 KKT Conditions
We derive the first order necessary conditions of optimality for problem (3). Associating Lagrange multipliers \(\gamma _i\)’s with constraints (3a), \(M_{jj}\)’s with constraints (3b) and \(M_{jk}\)’s with constraints (3c), the KKT conditions, see e.g., [14] can be written as:
The \(M_{ij}\)’s correspond to the Lagrange multipliers associated with constraints \(u_i^{\top \!} u_j =0\) and the \(M_{ii}\)’s to \(u_i^{\top \!} u_i - 1=0\), so M is symmetric. Therefore there exist a diagonal matrix D and an orthogonal matrix Q such that \(M = QDQ^{\top \!}\). We have then \( \left( \sum _i \gamma _i \check{X} _i \check{X} _i^{\top \!} \right) UQ= UQ D \) which means that UQ is a matrix of eigenvectors of \(\sum _i \gamma _i \check{X} _i \check{X} _i^{\top \!}\). The \(\gamma _i\)’s can be interpreted as the importance given to the corresponding subspaces, and are positive only for those subspaces that achieve the max of problem (3).
Let \(U_Y D_Y V_Y^{\top \!} \) be the singular value decomposition of
Observe that \(U_Y D_Y^2 U_Y^{\top \!} \) is then an eigendecomposition of \(YY^{\top \!}\). A candidate solution of problem (3) would then be, for fixed \(\gamma _i\) respecting condition (4f):
The last equality results from the combination of conditions (4a) and (4f):
To maximize \(\tau \), we should consider the K first singular values of Y. The difficulty is then to find \(\gamma _i\) such that condition (4f) is respected.
We can easily see that unless the optimal U belongs to all subspaces \(\mathcal {X}_i\), more than one \(\gamma _i\) is nonzero. To see this, observe that if \(\gamma _i=0\) for all \(i\ne j\), constraint (4b) would imply that U belongs to subspace \(\mathcal {X}_j\), which means that \(\mathrm{Tr}(U^{\top \!} \check{X} _j \check{X} _j^{\top \!} U) = K\) and \(\tau = K\) by condition (4f). Since for all i, k we have \(0 \le u_k^{\top \!} \check{X} _i \check{X} _i^{\top \!} u_k\le 1\) and \( \mathrm{Tr}\left( U^{\top \!} \check{X} _i \check{X} _i^{\top \!} U \right) \ge \tau =K\) by condition (4d), we have \(\mathrm{Tr}\left( U^{\top \!} \check{X} _i \check{X} _i^{\top \!} U \right) =K\) for all i, and U belongs to all the other \(\mathcal {X}_i\)’s. As a result, any candidate solution should have at least two \(X_i\)’s realizing the optimum.
3 Proposed Approach
In [9], a fast and simple procedure is proposed to find an approximation of the minimum enclosing ball center of a finite-dimensional Euclidean space. The procedure is extended to arbitrary Riemannian manifolds in [10]:
-
Initialize the candidate solution \(U^{(t)}\) with a point in the set
-
Iteratively update as \( U^{(t+1)} = {\text {Geodesic}}\left( U^{(t)},X_f^{(t)},\frac{1}{t+1}\right) \), where \(X_f^{(t)}\) is the farthest point to \(U^{(t)}\), and \({\text {Geodesic}}(p,q,t)\) represents the intermediate point m on the geodesic passing through p and q such that \({\text {dist}}(p,m)={\text {dist}}(p,q)\).
Since we are interested in finding the best subspace of dimension K in \(\mathbb {R}^p\), our solution U belongs to the Grassmann manifold \(\mathcal {G}(K,p)\). The main difference with [10] is that we are dealing with points representing subspaces of different dimensions \(n_i\) and therefore belonging to different manifolds \(\mathcal {G}(n_i,p)\). The first consequence is that the usual Grassmaniann distance cannot be used to determine the farthest point \(X_f^{(t)}\). Since we want to preserve \(d(U,X_i)=0\) when \(\mathcal {U} \subset \mathcal {X}_i\), we used a dissimilarity which is not a metric except if the two subspaces belongs to the same Grassmannian. The second consequence is that to update the current iterate \(U^{(t)}\) using a geodesic, \(X_f^{(t)}\) should be first projected on \(\mathcal {G}(K,p)\). The next proposition shows how, given \(\mathcal {X}_f \in \mathcal {G}(n_f,p)\) and \(\mathcal {U} \in \mathcal {G}(K,p)\) with \(n_f \ge K\), we can compute \(\mathcal {Y}_f \in \mathcal {G}(K,p)\) included in \(\mathcal {X}_f\) that minimizes the distance to \(\mathcal {U}\). We can then update U using the corresponding geodesic.
Proposition 1
Let \(\mathcal {Y},\ \mathcal {U} \in \mathcal {G}(K,p) \) and \(\mathcal {X}\in \mathcal {G}(n,p)\) where \(n \ge K\), with \( \check{X} \) and \( \check{U} \) orthonormal basis of \(\mathcal {X}\) and \(\mathcal {U}\). Let \(A_1D_1B_1^T\) be an SVD of \( \check{U} ^T \check{X} \), then we have
Those equalities hold also for \(d_d\).
An adaptation is proposed in Algorithm 1, integrating results obtained from the KKT conditions analysis. We initialize using a K-truncated SVD of \({Y =[ \check{X} _1, \check{X} _2,..., \check{X} _m]}\), corresponding to the case where all the \(\gamma _i\)’s are equal (line 2), and stop when the two farthest subspaces have close dissimilarity values (line 18). As explained in Subsect. 2.2, this is a necessary, but not sufficient, condition at optimality. The farthest \(X_i\) from current \(U^{(t)}\) is determined using the chosen dissimilarity based on the principal angles (lines 5 to 8). The associated orthonormal basis \(S_0\) and \(S_1\) of \(\mathcal {U}\) and \(\mathcal {X}_{imax}\) are computed (lines 9 to 11) to update \(U^{(t)}\) in the direction of \(X_{imax}\) with a step \(\frac{1}{t+1}\) along the Grassmannian geodesic [15] (lines 12 to 16).
4 Experiments
We generated synthetic data to represent a case where datasets are unevenly distributed in space and the minimax approach is justified. We first generated a common subspace \(U_c \in \mathbb {R}^{p\times K_c} \sim N(0,1)\). We then perturbed it to generate two different noisy versions \(U_{j} = U_c + N(0,s_{j}\mu _{U_c})\), \(j \in \{1,2\}\), with \(\mu _{U_c}=mean(|U_c|)\), from which we generated two groups of data. For each \(U_j\), \(j \in \left\{ 1,2\right\} \), we generated different datasets \(X_i\):
where \(B_i \in \mathbb {R}^{n_i\times K_i}\) is distributed \(\sim U_{[0,1]}\), and \(A_i \in \mathbb {R}^{p\times K_i} \sim N(0,1)\). Each column of matrices \(U_j\), \(A_i\) and \(B_i\) is normalized (using the \(L_2\) norm) to give the same importance to each component within the dataset. Each column \(V_i(:,j)\) of \(V_{i} \in \mathbb {R}^{n_i\times K_c}\) is distributed \(\sim U_{[0, \frac{3 w_{ij}}{p}]}\), where \(w_{ij}\) represents the importance of the common component j within dataset i. Finally, Gaussian noise \(\epsilon _i \sim N(0,\sigma _i*\mu _{X_i})\) is added to each datasets: \( X_i \leftarrow X_i + N(0,\sigma _i*\mu _{X_i})\) with \(\mu _{X_i}=mean(|X_i|)\).
We generated datasets in two groups: the first, based on \(U_1\), contains more datasets but with higher noise, while the second group, based on \(U_2\), contains fewer less noisy datasets. The first group contains 17 datasets with \(s_1 = 1\), while the second contains 3 datasets with \(s_2=0.1\). We took \(K_c=3\) common components and \(K_i=5\) additional components, \(p=1000\) features and \(n_i \sim U_{[20\ 220]}\) samples for each dataset \(X_i\). The weights \(w_{ij}\) were randomly generated as \(\sim U_{[0.05\ 0.5]}\) to ’hide’ the common components in the datasets. The final added noise has \(\sigma _i= 0.1\).
We compared our Grassmaniann Minimum Enclosing Ball approach GMEB\(_{da}\) described in Algorithm 1 to a K-truncated SVD on \( {X =[X_1 ... X_n]}\) (SVD) and \({ \check{X} = [ \check{X} _1 ... \check{X} _n]}\) (\(SVD_o\)). Working with \( \check{X} \) instead of X improves the recovery of components that are (weakly) present in all \(X_i\)’s. For each subspace obtained, we computed its maximal dissimilarity to \( \check{X} _i\), but also to the background truth \(U_c\) and the two noisy \(U_j\). Mean results on 100 randomly generated datasets are shown on Fig. 1, where we also give results when using dissimilarities \(d_b\), \(d_c\) or \(d_d\) in Algorithm 1.
When computing dissimilarities to the U’s, we logically have \(\sqrt{d_a} = d_b =d_c\) since, in these cases, \(n_x\) and \(n_u\) of Table 1 are equivalent. Results obtained for \(d_b\) and \(d_c\) with \( \check{X} _i\) are similar for all methods, due to the influence of \(n_i\) in the dissimilarities. Since we have \(d_c(U,X_i) \in [\sqrt{n_i-K}, \sqrt{n_i}]\) and \(n_i > K\), the results are mainly influenced by \(\max _i n_i\). On the criterion minimized (\(d_a\) on \( \check{X} _i\)), the common subspace approach is the best one. As expected, \(SVD_o\) recovers very well the noisy components \(U_1\), but the common subspace approach recovers better the \(U_2\). The original \(U_c\) is then recovered better by the subspace approach than by \(SVD_o\).
5 Conclusion
In this paper, we examined the problem of finding a subspace representative of multiple datasets by minimizing the maximal dissimilarity between this subspace and all the subspaces generated by those datasets. After arguing for a particular choice of dissimilarity measure, we derived some properties of the corresponding formulation. Based on those properties, we proposed an adaptation of an algorithm used for a similar problem on a Riemannian manifold. We then tested the proposed algorithm on synthetic data. Compared to SVD, the subspace recovered by our algorithm is closer to the true common subspace. Based on these promising results, the next step is to analyze properly the convergence of the proposed algorithm. Other approaches to solve the problem should also be investigated, for example based on the KKT conditions or on linearization.
References
Alter, O., Brown, P.O., Botstein, D.: Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proc. Natl. Acad. Sci. 100(6), 3351–3356 (2003)
Ponnapalli, S.P., Saunders, M.A., Van Loan, C.F., Alter, O.: A higher-order generalized singular value decomposition for comparison of global mRNA expression from multiple organisms. PloS one 6(12), e28072 (2011)
Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377 (1936)
Wold, H.: Partial least squares. Encycl. Stat. Sci. 6, 581–591 (1985)
Meng, C., Kuster, B., Culhane, A.C., Gholami, A.M.: A multivariate approach to the integration of multi-omics datasets. BMC Bioinf. 15(1), 162 (2014)
Hanafi, M., Kohler, A., Qannari, E.M.: Connections between multiple co-inertia analysis and consensus principal component analysis. Chemometr. Intell. Lab. Syst. 106(1), 37–40 (2011)
Tenenhaus, A., Tenenhaus, M.: Regularized generalized canonical correlation analysis. Psychometrika 76(2), 257–284 (2011)
Westerhuis, J.A., Kourti, T., MacGregor, J.F.: Analysis of multiblock and hierarchical PCA and PLS models. J. Chemometr. 12(5), 301–321 (1998)
Badoiu, M., Clarkson, K.L.: Smaller core-sets for balls. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 801–802 (2003)
Arnaudon, M., Nielsen, F.: On approximating the Riemannian 1-center. Comput. Geom. 46(1), 93–104 (2013)
Angulo, J.: Structure tensor image filtering using Riemannian \(L_1\) and \(L_\infty \) center-of-mass. Image Anal. Stereol. 33(2), 95–105 (2014)
Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20(2), 303–353 (1998)
Ye, K., Lim, L.H.: Schubert varieties and distances between subspaces of different dimensions. SIAM J. Matrix Anal. Appl. 37(3), 1176–1197 (2016)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research and Financial Engineering, 2nd edn. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5
Gallivan, K.A., Srivastava, A., Liu, X., Van Dooren, P.: Efficient algorithms for inferences on grassmann manifolds. In: IEEE Workshop on Statistical Signal Processing, pp. 315–318 (2003)
Acknowledgments
Part of this work was performed while the second author was a visiting professor at Université catholique de Louvain.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Renard, E., Gallivan, K.A., Absil, PA. (2018). A Grassmannian Minimum Enclosing Ball Approach for Common Subspace Extraction. In: Deville, Y., Gannot, S., Mason, R., Plumbley, M., Ward, D. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer Science(), vol 10891. Springer, Cham. https://doi.org/10.1007/978-3-319-93764-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-93764-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93763-2
Online ISBN: 978-3-319-93764-9
eBook Packages: Computer ScienceComputer Science (R0)