Scalable variable selection for two-view learning tasks with projection operators

In this paper we propose a novel variable selection method for two-view settings, or for vector-valued supervised learning problems. Our framework is able to handle extremely large scale selection tasks, where number of data samples could be even millions. In a nutshell, our method performs variable selection by iteratively selecting variables that are highly correlated with the output variables, but which are not correlated with the previously chosen variables. To measure the correlation, our method uses the concept of projection operators and their algebra. With the projection operators the relationship, correlation, between sets of input and output variables can also be expressed by kernel functions, thus nonlinear correlation models can be exploited as well. We experimentally validate our approach, showing on both synthetic and real data its scalability and the relevance of the selected features. Keywords: Supervised variable selection, vector-valued learning, projection-valued measure, reproducing kernel Hilbert space


Introduction
Vector-valued, or more generally structured output learning tasks arising from various domains have attracted much research attention in recent years [Micchelli and Pontil, 2005, Deshwal et al., 2019, Brogat-Motte et al., 2022].For both supervised but also unsupervised learning approaches, multi-view data has been of interest [Hotelling, 1936, Xu et al., 2013, Minh et al., 2016a].Despite many successful approaches for various multi-view and vector-valued learning settings, including interpretability to these models has received less attention.While there are various feature selection and dimensionality reduction methods either for scalar-valued learning tasks, or unsupervised methods for data represented in a single view [Zebari et al., 2020, Li et al., 2017, Anette and Nokto, 2018, Bommert et al., 2020], there is scarcity of methods suitable for when data is represented in two views, or arises from a vector-valued learning task.From the point of view of interpretability, especially feature selection methods are advantageous over dimensionality reduction since the relevant features are directly obtained as a result and not given only in (linear) combinations.
Recently, some feature selection methods have been proposed for structured output learning tasks.[Brouard et al., 2022] proposed kernel-based non-linear feature selection model relying on sparsity regularization.Another supervised feature selection approach based on kernel methods is introduced in [Song et al., 2012], this one relying instead on forward-and backward-selection ideology.In addition, [Jordan et al., 2021] discusses feature selection in conjunction with kernel-based models, obtaining sparsity implicitly via loss function without explicit regularization term.An alternative, spline based, approach to the non-linear feature selection is proposed by [Boyd et al., 2018].These methods, relying on the kernel evaluations between data samples for both inputs and outputs, tend not to scale very well to large sample sizes.
In this paper, we introduce a novel variable selection approach for vector-valued, or two-view learning tasks, including CCA.Our method is based on efficient iterative computation of projections of input variables to the vector space intersection between the space spanned by the output variables and the one of the previously selected input variables.In this space, the input variables are then selected by a correlation-based criterion.Going one step further, we also exploit a kernel-based representation of the variables, allowing us to capture complex, nonlinear relationships.Here, we consider the kernelised representation of the variables instead of data samples -in essence, we model the co-variance on the features in the Hilbert space induced by the kernel.Notably, both input and output features are captured with the same kernelisation.This is in stark contrast to other proposed kernelbased feature selection approaches in literature, where separate kernels are used for data samples in input and output spaces [Brouard et al., 2022, Song et al., 2012, Jordan et al., 2021].We can more readily draw comparisons to canonical correlation analysis (CCA) and its kernelized version, where the correlations are computed between two sets of variables instead of pairs of individual ones [Bie et al., 2005].
For many approaches, scalability in feature selection can be a challenge for when the data dimensionality is extremely large.Some supervised linear feature selection models adapted to this setting are proposed in [Fan et al., 2009, Aghazadeh et al., 2018, Valcárcel et al., 2022].We note, that all these methods are for the supervised setting, but with scalar-valued output variables.While scalability w.r.t the feature dimensionality is often considered due to motivations arising from fat data, the scalability to large sample sizes is less focused on.Traditionally, kernelized algorithms, while powerful, are very poorly scalable due to the dependence to the kernel matrix, especially if its inverse is required.Contrary to the usual, by leveraging the recursive formulation of our algorithm and a trick with singular value decomposition on the variable representation, our approach is extremely scalable to large sample sizes -which we also demonstrate experimentally in Section 5.
To summarize, our main contributions in this paper are as follows: • we propose projective selection (ProjSe) algorithm, a novel approach for variable selection for vector-valued or two-view learning problems that is based on projection operators.In ProjSe the result of the feature selection only depends on the subspace spanned by the outputs, not on the specific values (invariance).
• our proposed iterative method offers high scalability even for the kernelised formulation capturing nonlinearities in the data, due to a trick with singular value decomposition applied to the feature representation.
• we experimentally validate the proposed approach, showing both relevance of the selected features and the efficiency of the algorithm.is the subspace corresponding to the projection operator P. P L ⊥ is an orthogonal projection operator into the orthogonal complement of subspace L. P X is an orthogonal projection operator into the subspace of H, spanned by the columns of matrix X.It is the same as P L X .P X ⊥ is an orthogonal projection operator into the subspace of H orthogonal to the subspace spanned by the columns of matrix X.It is the same as P L ⊥ X .A + denotes the Moore-Penrose inverse of matrix A.

[n]
is a short hand notation for the set {1, . . ., n}.A • B denotes pointwise(Hadamard) product of matrices A and B. A •n is the pointwise power of matrix A. A[:, I] selects the subset of columns of matrix A with indices in set I.
The paper is organised as follows.In the next section we give overview of the method, before moving to more rigorous treatment in Section 3.There we give a brief introduction to projection operators and their matrix representation, and discuss the key detail of our approach, expressing the projector into intersection.We then move on to describing our large-scale kernelized adaptation of the algorithm in Section 4. We validate our approach experimentally in Section 5 before concluding.

Method overview
Our algorithm is designed to perform variable selection when there are multiple dependent variables of interest.We denote the matrix containing the data from which the variables are selected as X ∈ R m×nx , and the reference data as Y ∈ R m×ny -the sample size is m, and the number of features/variables are n x and n y (see other frequently used notation in Table 1).Here X and Y could also correspond to vector-valued inputs and outputs of some supervised learning task.Our method is based on defining correlation via projection operators: we define the correlation between a variable vector x ∈ R m (a column vector from X containing the values of a single input variable for all data points) and a set of variables in columns of matrix Y, as where P L Y (or P Y in shorthand) is the orthogonal projection operator into a subspace L Y spanned by the columns of Y.This definition is motivated by the concept of Projection-Valued Measure which plays a significant role in quantum mechanics theory (see for example [Nielsen and Chuang, 2000]).Our approach selects variables from input data X iteratively, such that correlation between the selected variable and the outputs is high, while correlation to the previously selected variables is low.Remark 1.For sake of simplicity, we assume that for all x ∈ R m , x = 1.
Our variable selection algorithm, ProjSe, is illustrated in Figure 1.The first set of variables is is chosen simply to maximize the projection onto the subspace spanned by columns of Y, L Y .This is illustrated with x 1 , which is projected with P Y as P Y x 1 .The second set of features chosen, x 2 in the figure, is projected into the intersection of L Y , and the orthogonal complement of the chosen feature  our variable selection algorithm in a sense joins the variable spaces of the inputs and outputs -both of them are considered in the same space.At the same time, in order to our selection approach to work, L X should not be fully orthogonal L Y .Additionally, due to the properties of the projection operators, our approach promotes invariance: the selected explanatory variables (input features) depend only on the subspace spanned by the response variables (output features), and are independent on any transformation on the response variables that would span the same subspace.These transformations can be singular or even nonlinear, as long as they are automorphisms of the output space.
In this basic form the algorithm is is scalable to medium-scale data, as it is limited memory required to store the projection matrix.In the following sections we present techniques that allow scaling to very large datasets, e.g.m > 1000000 and m n x , n y .A recursive representation of the projection operators (see Section 3.2), and especially the singular vector based form, (eq.( 9)), significantly reduces the demand for resources, both for memory and for computation time.

Projection operators
This section first introduces relevant background on projection operators and their algebra.Then, two key points for our algorithm are discussed: the matrix representation of the projectors, and how the projection into the intersection can be expressed.

Projection operators, projectors
We now briefly introduce the mathematical framework describing the projection operators of a Hilbert space.The proofs of the statements mentioned, as well as further details, are presented for example by [Kreyszig, 1989].
Let T be a linear operator T : H → H. Its adjoint T * : H → H is defined by y, T * x = Ty, x for all x, y ∈ H.A linear operator T is self-adjoint, or Hermitian if T = T * , unitary if T * = T −1 and normal if TT * = T * T. On the set of self-adjoint operators of H one can define a partial order by for all x ∈ H.An operator T is positive if for all x ∈ H.As a consequence we have Let L be a subspace of H, the orthogonal complement of L is given by Theorem 2. For any subspace L ⊆ H, H = L ⊕ L ⊥ .
A linear operator P is a projection operator if P : H → L for a subspace L of H.To highlight the connection between the subspace and projection, they can be also denoted as L P and P L .
An operator P is idempotent if PPx = Px, or PP = P holds for any x ∈ H.The projection operators can be characterized by the following statements.Theorem 3. A linear operator P : H → H is a projection if it is self adjoint, P = P * , and idempotent PP = P. Proposition 4. The map connecting the set of closed subspaces1 of H and the set of the corresponding orthogonal projections is bijective.
As a consequence of the idempotent and self-adjoint properties we have that the range R(P) and the null space N (P) of P are orthogonal, namely for any x, y ∈ H Px, y − Py = P 2 x, y − Py = Px, (P − P 2 )y = 0. (5) The following theorems describe some algebraic properties of projection operators we are going to exploit.Theorem 5. (Product of projections) Let P 1 and P 2 be projections on H. P = P 1 P 2 is projection if and only if Theorem 6. (Sum of projections) Let P 1 and P 2 be projections on H.
Theorem 7. (Partial order) Let P 1 and P 2 be projections on H, and N (P 1 ) and N (P 2 ) the corresponding null spaces.Then the following statements are equivalent.
Theorem 8. (Difference of projections) Let P 1 and P 2 be projections on H. P = P 2 − P 1 is projection if and only From the theorems above we can derive a simple corollary: if L is a subspace, then the projection into its complement is equal to P L ⊥ = I − P L .Theorem 9. (Monotone increasing sequence) Let (P n ) be monotone increasing sequence of projections defined on H. Then: 1. (P n ) is strongly operator convergent, and the limit P, P n → P, is a projection.

P
If S is a self-adjoint operator and P is a projection into the range of S then SP = PS, see [Conway, 1997] for further details.
Let I be projection into the entire space, and 0 its complement.If 0 ≤ S ≤ I, and T ≥ 0 operators.If P is a projection into S + T then P commutes both S and T. See in [Hayashi and Nagaoka, 2002].

Matrix representation of projectors
If a basis of the Hilbert space H is fixed, then every linear operator acting on H can be represented by a matrix.Let the subspace L of H be spanned by the vectors a 1 , . . ., a k of H. Let us construct a matrix A whose columns are equal to the vectors a 1 , . . ., a k .Here the linear independence of those vectors is not assumed.The corresponding subspace is denoted by L A The matrix representing the orthogonal projection operator into to subspace L A can be expressed by a well-known minimization problem [Golub and Loan, 2013], where + denotes the Moore-Penrose pseudo-inverse.Based on eq. ( 5) the vector A T w * is the orthogonal projection of x into L.The orthogonal projection of x is equal to Since this is true for any x ∈ H, the matrix representation of the orthogonal projection operator P A is given by This formula can be simplified by exploiting the properties of the Moore-Penrose pseudo-inverse, see for example [Ben-Israel and Greville, 2003], via the singular value decomposition U A S A V T A of the matrix A.Here we assume that the matrix A ∈ R m×n A , m > n A , and V A is a square matrix, but U A contains only those left singular vectors where the corresponding singular values are not equal to zero.We have This representation of the projection operator plays a central role in our variable selection algorithm.The following proposition ensures that the projection operator does not depend on its representation.Proposition 10.Assume that two different matrices A and B span the same subspace L of dimension k.Then the two representations P A = U A U T A and P B = U B U T B yield the same projection operator.
Proof.Since the columns of U B as linear combinations of B are in the L, thus Because the right hand side, P B , is a projection, the left hand side P A P B is also one.Following the same line we have P B P A = P A as well.From Theorem 5 we know that if the product of projections is a projection, then the product of projections is commutative, P B P A = P A P B .Finally we can conclude that We also exploited that if H is finite dimensional and the corresponding field is R then the adjoint of P * is represented by the transpose P T of the matrix P.

Projection onto the intersection of subspaces -General view
Our algorithm hinges on the orthogonal projector of the intersection of a set of subspaces {L 1 , L 2 , . . ., L n L }.To introduce this concept, here we mainly follow the line presented by [Ben-Israel, 2015].We can start with some classical result, first we can recall [von Neumann, 1950], who derived a solution in case of two subspaces as a limit: That result has been extended to arbitrary finite sets of subspaces by [Halperin, 1962],: [ Anderson. and Duffin, 1969], gave an explicit formula for the case of two subspaces by [Ben-Israel, 2015], provides an alternative to compute P L1,∩•••∩Ln L Here we rely on the Lemma 4.1 and Corollary 4.2 of his work: Proposition 11.For i = 1, . . ., n L , let L i be subspaces of H, P i be the corresponding projectors, P ⊥ i = I − P i , and eliminating all the complements of the projectors.
By exploting that for any projector P P ⊥ = I − P, the Q t corrsponding to P L V ∩L X⊥ t can be written as The critical point is the computation of the Moore-Penrose inverse of Q.

Expressing the projector into intersection
To implement the proposed variable selection algorithm (Figure 2) the projection into the intersection of an arbitrary subspace L P and the complement of an arbitrary vector x, P L∩x ⊥ , has to be computed.The projector P L ⊥ to the complement of a subspace L can be expressed as I − P L , hence the projector into P x ⊥ is given by I − xx T ||x|| 2 .Since L is arbitrary we use P instead of P L for sake of simplicity.
While we have these two projectors, their product, according to Theorem 5, is not a projection as it does not commute: because in the general case Pxx T = xx T P. To overcome this problem we can recognize that the intersection L P ∩ L x ⊥ can be expressed after a simple transformation.
Lemma 12. Let P be a projector and x be any vector, then the intersections L P ∩ L x ⊥ and L P ∩ L (Px) ⊥ are the same subspaces of L P .
Proof.Any vector u is in By projecting x into L first, and then computing the corresponding intersection, we can compute the projector into L P ∩ L x ⊥ in a simple way.
Proposition 13.Let x = 1 and α = Px 2 = x T PPx = x T Px.If α > 0 then When α = 0, which means x orthogonal to L P , then we have P L P ∩L x ⊥ = P.
We can check that P − 1 α Pxx T P is a real projector.It is idempotent, since P − 1 α Pxx T P 2 = P − 1 α Pxx T P − 1 α Pxx T P + α α 2 Pxx T P = P − 1 α Pxx T P.This agrees with Theorem 5 which states that the product of projections is a projection, idempotent and adjoint, and it is map into the intersection of the corresponding subspaces.Furthermore the orthogonality of x and the projection of any u ∈ H by P − 1 α Pxx T P can also be verified, as Tjur, 1984]) Two subspaces L 1 and L 2 are orthogonal if for any vectors, x 1 ∈ L 1 and x 2 ∈ L 2 x 1 , x 2 = 0 holds.Two subspaces L 1 and L 2 are geometrically orthogonal if Lemma 15. ([Tjur, 1984]) Two subspaces L 1 and L 2 are geometrically orthogonal if and only if P L1 P L2 = P L2 P L1 and Proposition 16.The subspaces L P and L (Px) ⊥ are geometrically orthogonal.
Proof.It is a simple Corollary of Proposition 13.

Selecting variables in RKHS
In order to take into account non-linear correlations in the data, we propose a kernelized adaptation of the problem.Kernel methods are a group of varied machine learning models, taking advantage of a symmetric and positive semidefinite kernel function comparing data samples (sets of features) k : X × X → R. The usage of a kernel function allows including non-linearity to the models implicitly via a feature map ϕ : X → F k : a kernel evaluated with two samples corresponds to an inner product in this so-called feature space (more specifically reproducing kernel Hilbert space, RKHS): k(x, z) = ϕ(x), ϕ(z) F k .For more thorough introduction to traditional kernel methods, we refer the reader e.g. to [Hofmann et al., 2008].
We here propose to kernelize the variable representation.We consider φ : R m → H, where R m is the vector space containing all columns of Y ∈ R m×ny and X ∈ R m×nx , and H is a RKHS.In essence, this corresponds to defining a kernel on the variable vectors, κ : R m × R m → R -in fact, we assume that the φ is only given implicitly via κ.In mathematical sense, this matrix can equally well be considered to be a kernel matrix, since distinction between the rows and columns is by convention only.Usually however the matrix built from inner products between the variables is referred to as covariance operator.The covariance operators are also extended to RKHS with various applications in machine learning tasks [Muandet et al., 2017, Minh et al., 2016b].Contrary to our approach, there the feature map and kernel are defined on data space X instead of variable space R m .We need to mention here also the Gaussian Process Regression, [Rasmussen and Williams, 2005] where kernels are also used to cover the covariance matrix, thus connecting the variables via inner product.

Expressing the Projection operator in RKHS
Based on Section 3.2, Equation ( 9), the projection P Y is represented with the left singular vectors of P Y , U Y .This representation is also needed for the kernelized algorithm.However calculating directly the singular value decomposition on φ(Y), φ(Y) = U Y S Y V T Y , might not be feasible if the dimensionality of the feature space is large.Assuming that H is finite dimensional2 with dimension d, we have φ(Y), U Y ∈ R d×ny , and S Y , V Y ∈ R ny×ny .Therefore we can write and D Y diagonal with nonnegative elements of singular values, thus Again, this decomposition can not be computed directly, however we can go on the following line of computation.
To express the U Y we can apply a similar approach to what is exploited in the computation of the kernel principal component analysis [Schölkopf et al., 1998].Recall that the kernel matrix on columns of φ(Y) is From the singular value decomposition we can derive that This kernel has a reasonably small size, n y × n y , thus its eigenvalue decomposition can be computed, which yields V Y and the squares of the singular values of the diagonal elements of S Y .By combining these expressions we have with the help of the Moore-Penrose generalized inverse.Our algorithm hinges on evaluating products between projectors and the variable vectors.We can now write the products of the U T Y with an arbitrary vector represented in H as Thus the product can be expressed with the help of the kernel on the variables with complexity O(n 2 y ) if the K Y , V Y and S −1 Y are precomputed.

The recursive selection procedure
To calculate the projection operator efficiently in each iteration we can exploit the structure of P L Y ∩L X⊥ t φ(x) introduced in Proposition 13.To this end, we define an intermediate operator, projection into the complement subspace of vector q ∈ R n as: Since Q(q) is a projection, we have Q(q) = Q(q)Q(q) and Q(q) = Q(q) T .It can also be seen that multiplying Q with a matrix A ∈ R n×n , has the complexity of only O(n 2 ) since only matrix-vector and outer product are needed.We are also going to use the following recurrent matrix products for a fixed t Now we can write up the sequence of projections corresponding to the Algorithm (2): Proposition 17.The sequence of projections above correctly computes the projection operators of Algorithm in Figure 2.
Proof.We apply induction on t to prove the statement.In case of t = 1 we have by Proposition 13, that In transforming ||U 0 U T 0 φ(x) t1 * || 2 into ||q 1 || 2 we exploited that U 0 U T 0 is a projection, hence it is idempotent.Let t > 1 be arbitrary.Suppose that holds true.Now, computing the projector t + 1 we obtain In the norm we again applied that U t U T t is idempotent.
We can express the main computation step, Step 2.b in Algorithm 2, by exploiting the kernelized recursive iteration.
From the sequential procedure we can see that a key step of the computation is the calculation of the vectors q i via Equation ( 18), for an arbitrary φ(x) ∈ H.In iteration t, we have 1. Input: Output data matrix Y ∈ R m×ny , input data matrix X ∈ R m×nx , D ≤ n y the number of variables to select, κ kernel on variables 2. Output: Set I D of indices of selected variables from X in the selection order.
The algorithm: Figure 3: Efficient implementation of the kernelized realization of supervised variable selection by projection algorithm, ProjSe.Note the notation, e.g.
Taking advantage of the recursive definition of U T t φ(x) we also have that where q t+1 = U T t φ(x k (t+1) * ), thus all terms relate to those computed in the previous iteration.The computation of the norm ||q t+1 || 2 can also exploit the recursive nature of the algorithm.Finally, all the feature representations φ(x) and φ(Y) are implicit, and are only expressed via kernel evaluations since they only appear in inner products.
Based on these statements and Proposition 17 we can present a concrete implementation of our algorithm in Figure 3.In the first step the kernels are computed, where K Y X requires O(mn y n x ), and K Y O(mn 2 y ) operations in case of for example linear and Gaussian kernels.For the eigenvalue decomposition of K Y we need O(n3 y ) operations, where D ≤ min(n y , n x ).In the algorithm, the critical step is Step 4.a.Its complexity in step t is O(n y (n x − t)), thus, in general for selecting D variables we need O(n y n x D) operations.Assuming that m n y , n x , the dominating part is the computation of the kernels, thus the entire complexity is equal to O(mn y max(n x , n y )).

Experiments
We first show demonstrations on our algorithm's scalability on synthetic data, before moving on to experimenting with real data and analysing the stability of the feature selection.

Scalability demonstration with synthetic data
This test is implemented via a scheme presented by (27) in Figure 4 and by (26).The components of the input matrix, X and the components of a transformation matrix W are independently sampled from normal distribution.Then output matrix is constructed, and finally random noise is added to the output.

Input
Linear transformation Noise  Table 2: ProjSe clustering results (NMI and time) with selected features (10 or 300) compared to results reported in [Brouard et al., 2022].Running time of ProjSe for choosing 300 features; running time and variation of k-means is negligible.
Carcinoma (m=174, nx=9182, C=11) Glioma (m=50, nx=4434, C=4) NMI( 10 We apply ProjSe to this data with various sample sizes.Figure 4 presents the dependence of the selection time on the sample size, where the maximum sample size is 10 million and the number of variables is 10 -the variable selection is performed in less than four seconds.

Feature selection from biological data
In this set of experiments, we compare our approach to [Brouard et al., 2022] -a kernel-based feature selection method, where kernels are considered traditionally on data samples instead of features.We experiment with the two gene expression datasets, "Carcinoma", "Glioma", considered there for unsupervised feature selection.While this kind of setting with one-view fat data is not the one our method was developed for, as the scalability we propose is first and foremost for large sample sizes, these experiments still serve for illustrative comparison of feature selection performance.
As the data is only available in one view in unsupervised setting, we apply our algorithm by using the data in both views: as the reference/label view and as the view the feature selection is performed on.Intuitively, this would filter out the noise and redundant features in the view.In our method we consider linear kernel, k(x, z) = x T z, polynomial kernel of degree 3, k(x, z) = (x T z) 3 , and RBF kernel, k(x, z) = exp( x − z 2 /(2σ 2 )) with the kernel parameter σ set as mean of pairwise distances.We assess the performance of the feature selection by measuring the normalised mutual information (NMI) of k-means clustering results.Here the clusterer has been given the amount of classes in the data as the number of clusters.
The results are displayed in Table 2, with comparison to selected methods from [Brouard et al., 2022]: UKFS proposed there, as well as a scoring-based method "lapl" [He et al., 2005] that performed well on Glioma dataset, and NDFS [Li et al., 2012], a clustering-based approach that performed well with Carcinoma dataset.Our method is very competitive with these, sometimes achieving better performance.As in our method the kernel is calculated on features, the running time is slower than for UKFS where the kernel on samples is considered.However notably we are still competitive when compared to the NDFS.
Additionally, Figure 5 displays more detailed clustering results with respect to the number of variables chosen by ProjSe.These results also highlight the differences that can be obtained by applying different kernels on the features: with Carcinoma dataset the non-linear kernels, RBF and polynomial kernel of degree 3, are clearly superior, while with Glioma linearity works the best.

Feature selection from vector-valued output setting
We next consider a setting more aligned with our method: supervised feature selection with vector-valued output as the reference view.Here we consider three datasets from the UEA & UCR Time Series Classification Repository5 , Crop,

Dataset
NonInvasiveFetalECGThorax1 ("Thorax"), and ShapesAll, as detailed in Table 3.These datasets are associated with multi-class classification tasks, and we use the one-hot encoding of the labels as the vector-valued output to perform the feature selection with ProjSe.As before, we consider linear, polynomial and RBF kernels.We assess the success of the feature selection task by performing classification with SVM with the selected features -here we consider both linear and RBF kernels on the data samples.
The results are displayed in Figure 6, where both kernel alignment (KA(K, where c denotes centering) to the linear kernel on the one-hot-encoded outputs, and accuracy of SVM classification are shown.The different kernels used in feature selection give slightly different features; however the performance on the subsequent classification task is mostly dependent on which kernel is used on the samples.Especially for Thorax and ShapesAll datasets with higher dimensionality, it can be seen that all the ProjSe results with linear SVM outperform using the full set of features.

Experiments with two-view data
In our last set of experiments, we consider the following datasets: • MNIST handwritten digits [LeCun et al., 1998, LeCun, 1998]: This dataset contains 60000 training and 10000 test examples of handwritten digits in greyscale.The number of pixels in each image is 28×28 = 784, resulting in total to 784 variables.To construct the two sets of variables, the image columns are split into half similarly as in [Andrew et al., 2013].Thus both views comprise of 392 variables.
• MediaMill dataset [Snoek et al., 2006] We perform variable selection independently on both views in the datasets.After the variable selection is performed on both sides, we compute canonical correlations between all subset pairs of the extracted variables, starting from the first ones and incrementally growing them to the entire sets.To demonstrate the performance of the proposed variable selection algorithm ProjSe, it is compared to the following methods: large-scale sparse kernel canonical correlation analysis (GradKCCA), [Uurtio et al., 2019], deep canonical correlation analysis (DCCA), [Andrew et al., 2013], randomized non-linear CCA (RCCA), [Lopez-Paz et al., 2014], kernel non-linear orthogonal iterations (KNOI), [Wang and Livescu, 2015], and CCA through Hilbert-Schmidt independence criterion (SCCA-HSIC), [Uurtio et al., 2018].
These CCA variants explicitly or implicitly rely on singular value decomposition, and their performance highly depends on the distribution of the singular values of the data matrices.Since the data matrices have small number of dominating singular values, we expect from a variable selection method that it can capture a relatively small set of variables to reproduce similar accuracy, measured in canonical correlation.We need to bear in mind that the CCA-based  The algorithm was run once with 100 variables, the smaller amounts were subsampled methods add up all information represented by the full collection of variables, however the projector based selection only relies a relatively small subset of those variables.
Figure 7 shows the performance of ProjSe on MNIST and MediaMill datasets with both linear and Gaussian kernels, as functions of the number of selected variables.The results are measured by canonical correlation between the subsets of variables selected from the two views.Comparing the results to the other CCA methods in Table 5 (taken from [Uurtio et al., 2019]), we observe ProjSe obtaining comparable performance after 20 or 50 selected variables, while being orders of magnitude faster than the other methods.Since ProjSe is fully deterministic, there is no variance is reported for it.The heaviest computation for ProjSe is in the beginning when eigenvalue decomposition of K Y Y is calculated (see Table 4).Thus the running time varies only minimally when different number of variables is selected.This is also demonstrated in 4 where the running times for MNIST and Cifar100 datasets are detailed.
As a variable selection method, we are interested in evaluating the stability of the selection process.In order to measure this, we here consider the stability index [Nogueira et al., 2018] and the average Pearson correlation of relevance [Hamer and Dupont, 2021].In both measures, higher values indicate higher stability; the maximum value in both is 1.
For each number of selected variables the subsamples are taken with the following percentage of the entire training sets: (10%, 20%, . . ., 50%).Then random subsets are extracted 10 times of the size given above.The scores are computed for each number of selected variables, and for each subsample size and finally averaged on all random samples.They are shown in Figure 8, where the averages for all pairs of subsets of variables and for all subsample sizes are presented.In this paper we introduced a novel variable selection method for two-view settings.Our method is deterministic, selecting variables based on correlation defined with projection operators.The kernelised formulation of our approach paves way for efficient and highly scalable implementation, allowing the application of our method to datasets with millions of data samples.We empirically demonstrated this efficiency and the suitability of our approach for feature selection task, with both synthetic and real data.

Figure 2 :
Figure 2: The generic algorithm of supervised variable selection by projection

Figure 4 :Figure 5 :
Figure 4: The dependence of the variable selection time on the sample size is shown in seconds on the left, and the random data generation scheme applied is on right

Figure 7 :
Figure 7: Variable selection results w.r.t the number of selected variables.

Figure 8 :
Figure 8: Means of the two stability scores on random sub-samples computed on the data sets the MNIST (left) and the MediaMill (right), with linear kernel

Table 1 :
Some of the frequently used notation in this paper.
H is a Hilbert space -unless otherwise noted, it has finite dimension d, in which case H = R d .X is a subspace of H, spanned by the columns of matrix X. L ⊥ is a subspace of H, the orthogonal complement of L. P L is an orthogonal projection operator into subspace L, P L : H → L L P x 1 , L x1 ⊥ .At this step, the correlation is measured with the projection operatorP L Y ∩L x ⊥ A set of output variables {y 1 , . .., y ny } of R m , in Y ∈ R m×ny .(b)A set of input variables {x 1 , . .., x nx } of R m , collected to X ∈ R m×nx .(c)D ≤ n y : number of variables to be chosen from X. 2. Output: Set I D of indices of selected variables from X in the selection order.

Table 3 :
The time series classification datasets.Results with time series datasets.The top row reports the kernel alignment of the input kernel with chosen variables (either RBF or linear) to the ideal output kernel.The bottom row reports accuracy on test set, again with both linear and RBF kernels for SVM; comparison is shown to randomly selected features and to full feature set.The colors differentiate which kernel is used on features in ProjSe, while the line style indicates if traditional linear or RBF kernel is used on samples.

Table 4 :
The detailed computation times in seconds for the variable selection method, where 10, 20, 50 and 100 variables are extracted.
: This dataset contains 43907 examples which are extracted from keyframes of video shots.There are two views in this data: text annotations (101 variables) and visual features (120 variables).• Cifar100 dataset [Krizhevsky, 2009]: This dataset, chosen to demonstrate the scalability of ProjSe, contains 50000 training and 10000 test examples of color images.The number of pixels in each image is 32 × 32 = 1024, where to each pixel 3 colors are assigned.The examples belong to 100 classes, where each class contains 500 training and 100 test examples.The classes are represted by indicator vectors.

Table 5 :
CCA comparison on the MNIST and MediaMill datasets.