Recipes for sparse LDA of horizontal data

Many important modern applications require analyzing data with more variables than observations, called for short horizontal. In such situation the classical Fisher’s linear discriminant analysis (LDA) does not possess solution because the within-group scatter matrix is singular. Moreover, the number of the variables is usually huge and the classical type of solutions (discriminant functions) are difficult to interpret as they involve all available variables. Nowadays, the aim is to develop fast and reliable algorithms for sparse LDA of horizontal data. The resulting discriminant functions depend on very few original variables, which facilitates their interpretation. The main theoretical and numerical challenge is how to cope with the singularity of the within-group scatter matrix. This work aims at classifying the existing approaches according to the way they tackle this singularity issue, and suggest new ones.


Introduction
Discriminant analysis (DA) is a descriptive multivariate technique for analyzing grouped data, i.e. data where the observations are divided into a number of groups that usually represent samples from different populations [14].Recently DA has also been viewed as a promising dimensionality reduction technique.Indeed, the presence of group structure in the data additionally facilitates dimensionality reduction.The best known variety of DA is linear discriminant analysis (LDA), whose central goal is to describe the differences between the groups in terms of canonical variates which are linear combinations of the original variables [14].LDA requires solving a generalized eigenvalue problem [19, §8.7].
The interpretation of the canonical variates is based on the coefficients of the original variables in the linear combinations.The interpretation can be clear and obvious if the coefficients in the loadings vectors take one of a small number of values which includes exact zero.Unfortunately, in many applications this is not the case.The interpretation problem is exacerbated by the fact that there are three types of coefficient, raw, standardized and structure, which can be used to describe the canonical variates [37,44], where the disadvantages for their interpretation are also discussed.A modification of LDA, aiming for better discrimination and possibly interpretation, is considered in [11,27].In this approach the vectors of coefficients in the canonical variates are constrained to be orthogonal.
These difficulties are similar to those encountered when interpreting principal component analysis (PCA) [25].Last decade this problem was approached by developing PCA procedures that produce sparse component loadings, i.e. containing many zeros.Such techniques are commonly known as sparse PCA [42], and can be adapted for use in LDA.This was realized first by Trendafilov and Jolliffe [44] who obtained sparse discriminant functions.The non-zero entries correspond to the variables that dominate the discrimination.This method cannot be applied directly to horizontal data, but triggered active research in this direction, which we try to review here.
Horizontal data occur when the number of variables ( p) is larger than the sample size (n).Such datasets are nowadays common in many applications.There are two main problems in using classical LDA on horizontal data.First, the within group covariance matrix W is singular or nearly singular and, hence, it cannot be inverted.This is because of the presence of many variables which are not useful for discrimination.Second, computations are very difficult if not impossible, hence deterring the applicability of classical LDA on horizontal data.
The paper is organized as follows.The classic LDA is briefly revised in Sect. 2. Section 3 briefly summarizes the idea for sparse solutions and the approaches to achieve sparseness.Section 4 is central and is divided into three parts.Section 4.1 reviews several approaches to do LDA of horizontal data by replacing the singular within-group scatter matrix W by its main diagonal.Another alternative to avoid the singular W is presented in Sect.4.2, where sparse LDA is based on minimization of classification error.Section 4.3 lists techniques equivalent to LDA which do not need inverse W or T, as optimal scaling and common principal components (CPC).Section 4.4 briefly reminds the application of multidimensional scaling is discrimination problems.Finally, sparse pattern with each original variable contributing to only one discriminant functions is discussed in Sect.4.5.The last Sect.5 briefly reports the performance three methods for sparse LDA on several data sets.

Basic notations, definitions and assumptions of classical LDA
Consider the following linear combinations XA also called discriminant scores.This is a linear transformation of the original data X into another vector space.There is interest in finding a ( p × s) transformation matrix A of the original data X such that the a priori groups are better separated in the dimensions of the transformed data XA than with respect to any of the original variables.The number of transformed dimensions s is typically much smaller than the original p.Thus the transformation also achieves dimension reduction.Fisher's LDA works by finding a transformation A which produces the "best" discrimination of the groups by simultaneous maximization of the between-groups variance and minimization of the within-groups variance of XA.Formally this is organized by maximizing: where and G is the g × n group indicator matrix, i.e.G has 1/n j at its (i, j) position if the ith observation (row of X) belongs to the jth group, and 0 otherwise.Then, the matrix of the group means X = GX.
In other words LDA depends on the between-groups sums-of-squares matrix, B, and the within-groups sums-of-squares matrix, W, of the original data.B and W are also called between-and within-groups scatter matrices respectively.
The procedure of finding A from B and W is sequential.Suppose a is the first column of A. One can show [14,27, §11.1] that a should be chosen so that (1) is maximized.The maximisation of objective function ( 1) is equivalent to the following generalized eigenvalue problem: Thus the maximum of objective function ( 1) is the largest eigenvalue of W −1 B and is achieved at the corresponding eigenvector a. Successive columns of A are also eigenvectors of W −1 B and the corresponding values of objective function (1) are the corresponding eigenvalues.
The rank of W −1 B is r ≤ min( p, g − 1), i.e. all the eigenvalues after the first r are 0s.The number r is called the dimension of the discriminant function representation.The number of useful dimensions for discriminating between groups, s, is smaller than r , and the transformation A is formed by the eigenvectors corresponding to the s largest eigenvalues ordered in decreasing order.Clearly the ( p × s) transformation A determined by Fisher's LDA maximizes the discrimination among the groups and represents the transformed data in a lower s-dimensional space.
Note, that the Fisher's LDA problem ( 1) and ( 4) can be solved without forming B and W explicitly.For vertical data, the maximum of (1) can be found by the generalized SVD of [19, §8.7.3].For large data this method is rather expensive as it requires O((n + g) 2 p) operations.The method proposed by [12] seems a better alternative if one needs to avoid the calculation of large B and W. Further related results can be found in [38].
The eigenvalue problems (4) can be rewritten in matrix terms as BA = WAD, where D is the (s × s) diagonal matrix of the s largest eigenvalues of W −1 B ordered in decreasing order.This is not a symmetric eigenvalue problem and in general the columns of A are not orthogonal.However the matrix A WA is diagonal i.e. the solution is orthogonal in the W-space.
Note also that an important assumption for a valid LDA is that the population withingroup covariance matrices are equal.This can be checked by using the likelihood-ratio test [27, p. 370] to compare each within-group covariance matrix to the common one.If the null hypothesis is rejected in some groups than the results from LDA are considered unreliable.
The common principal components (CPC) model has been introduced by Krzanowski [26] and Flury [15] to study discrimination problems with unequal group covariance matrices.

Interpretation and sparseness
It turns out that in the modern applications the typical data format is with more variables than observations.Such data are also commonly referred to as the small-sample or horizontal data.In other words horizontal data occur when the number of variables ( p) is larger than the sample size (n).The following two data are examples of horizontal data.
1. Ovarian cancer data [9] are collected from women who have a high risk of ovarian cancer due to family or personal history of cancer.The objective is to distinguish ovarian cancer from non-cancer observations (women).The data contains 216 samples, 121 cancer samples and 95 normal samples.The number of variables is as many as 373,401.But only 4000 variables are considered in this study.2. Rice data [29,36] have 100 variables and 62 observations.They have four groups (varieties) of rice with 7, 19, 9 and 27 observations in them.
The main problem with such data is that the within-groups scatter matrix is singular and the Fisher's LDA (1) is not defined.Moreover, the number of variables is usually huge (e.g.tens of thousands), and thus, it make sense to look for methods that produce sparse discriminant functions, i.e. involving only few of the original variables.
Broadly speaking a vector/matrix is called sparse when it has very few non-zero entries.The number of the non-zero entries is called cardinality of the vector/matrix.There are two main ways to impose sparseness on a vector/matrix solution: by specifying certain cardinality constrain on the solution, or by finding the solution subject to sparseness inducing penalties.The most popular sparseness inducing penalty is the Least Absolute Shrinkage and Selection Operator (LASSO), introduced by Tibshirani [39] for multiple regression problems.For a unit length vector a ( a 2 = 1), the LASSO has the form a 1 = i |a i | < τ, where τ is called tuning parameter.By reducing τ , one forces the smaller entries of a to become exact zeros.Apparently, the sparsest a has only one non-zero entry equal to 1.
It is also possible to obtain sparse solution by prescribing in advance certain pattern of sparseness [40,47].For example, one can be interested to find a sparse matrix A having a single non-zero entry in each raw, as considered in Sect.4.5.
Another possible option is the employ the vector/matrix majorization [31], which intuitively is expressed by the following example for unit length vectors from ≺ (0, 0, 1) , i.e. the "smallest" vector has equal entries.One can use some procedure for generation of majorization [31, p. 128] in order to achieve sparseness.A benefit of such an approach is that sparseness can be achieved without tuning parameters.For example, the procedure to obtain sparse patterns by Trendafilov [41] is equivalent to what is known now as soft-thresholding.However, the threshold is found easily by the majorization construction, rather than by tuning different values.Such a pattern construction can be further related to the fit, the classification error, and/or other desired features of the solution.

Sparse LDA with diagonal W
The straightforward idea to replace the non-existing inverse of W by some kind of generalized inverse has many drawbacks, and thus is not satisfactory.For this reason, Witten and Tibshirani [49] adopted the idea proposed by Bickel and Levina [2] to circumvent this difficulty by replacing W with a diagonal matrix W d containing its diagonal, i.e.W d := I p W. Note, that Dhillon et al. [10] were even more extreme and proposed doing LDA of high-dimensional data by simply taking W = I p , i.e.PCA of B. Such LDA version was adopted already by Trendafilov and Vines [45] to obtain sparse discriminant functions when W is singular.

Sparse LDA as a two-stage sparse PCA
Probably the simplest strategy can be based on the LDA approach proposed by Campbell and Reyment [7], where LDA is performed in two stages each consisting of eigenvalue decomposition (EVD) of a specific matrix.This approach was already applied by Krzanowski et al. [30] with quite reasonable success to LDA problems with singular W. When W d is adopted, the original two-stage procedure simplifies like this.At the first stage, the original data are transformed as . Then, at the second stage, the between-groups scatter matrix B Y of the transformed data Y is formed: and then, some kind of sparse PCA on B Y is applied.Let the resulting sparse components be collected in a p × min{ p, g − 1} matrix C.Then, the sparse canonical variates are given by The sparseness achieved by C is inherited in A because W d is diagonal.Note that the calculation of B Y in ( 5) is not really needed.Following (2), the sparse PCA can be performed directly on (GG ) −1/2 GY.
Krzanowski [28] proposed a generalization of this two-stage procedure for the case of unequal within-group scatter matrices.He adopted the CPC model for each of the withingroup scatter matrices in each group.For horizontal data, this generalized procedure results in a slightly different way of calculating B Y in (5) , where X i is the data sub-matrix containing the observations of the ith group and W i,d = I p W i , where W i is the within-group scatter matrix of the ith group.

Function constrained LDA (FC-LDA)
By adopting the simplification where , and d is found as a solution of the standard Fisher's LDA problem (1) with Note that b are in fact the so-called raw coefficients [27, p. 298].As W d is diagonal, a and b have the same sparseness.Then, the modified Fisher's LDA problem (6) to produce sparse raw coefficients is defined as: Thus, the problem ( 7) is in fact a function-constraint PCA problem [42].For small data, as those considered in the following examples one can readily apply the dynamical system approach [43].For this reason, one can employ some kind of smoothing of the 1 vector norm, e.g.: with some large γ > 0. Other smoothing options are considered elsewhere [22].Let f denote the objective function from (7), i.e.
Then, the solution of ( 7) can be found as an initial value problem (IVP) for: where ∇ f denotes the gradient of f with respect to the standard (Frobenius) matrix inner product and The current ordinary differential equations(ODE) solvers [32] are not suitable for solving large optimization problems.They track the whole trajectory defined by the ODE which is time-consuming and undesirable when the asymptotic state is of interest only [35].Instead, one can employ numerical methods for optimization on matrix manifolds, and in particular on the Stiefel manifold [1], and employ some existing software [3,48].
Replacing (7) by increases considerably the speed but usually increases the classification error.
In experiments with simulated and real data, solving (11) outperforms [49] in any case (probably to blame the MM optimization method), and is comparable to [8].
Example 1 The data in the following examples are centered, and normalized to variables with unit length.Iris data [14] have four variables and three groups with 50 observations each.First we solve the original Fisher's LDA (1).The effective number of discriminant functions for this problem is min(4, 3 − 1) = 2.The first two eigenvalues are 32.1919 and .2854(32.4773 in total), and the raw coefficients are depicted in the first two columns of Table 1.The projection of the data onto the space spanned by the first two discriminant functions is given in the (1,1) panel of Fig. 1.It is well known that there are three misclassified points (52, 103 and 104) for this solution, i.e. 2 % misclassification.Then, we solve the original Fisher's LDA with W = W d .The first two eigenvalues are 31.0969and .3125(31.4094 in total), and the raw coefficients are depicted in the second two columns of Table 1.There are six misclassified points (9, 31, 50, 52, 103 and 119) for this solution, i.e. 4 % misclassification.The discriminant plot of the data is given in the (1,2) panel of Fig. 1.Next, we solve (7) with τ = 1.2.The minimum of the objective function in ( 7) is 1.0680.The first two eigenvalues 31.0969 and .3125are approximated by 30.7763 and .4407respectively.The sparse raw coefficients are depicted in the third two columns of Table 1.There are five misclassified points (9,31,50,52,103) for this solution, i.e. 3.3 % misclassification.The discriminant plot of the data is given in the (2,1) panel of Fig. 1.Finally, we solve (7) with τ = .5.The Example 2 Rice data [29,36] have 100 variables (wavelengths) and four groups of rice with 7, 19, 9 and 27 observations in them.The effective number of discriminant functions for this problem is min(100, 4 − 1) = 3.The first three eigenvalues are 25.3009, 1.6737 and .0077,which indicates that the discrimination power of the second and the third discriminant functions are not high.There are 37 misclassified points for this solution, i.e. 59.68 % misclassification.This solution is worse than the results obtained by [29] and employing PCA as a preprocessing (reduction the number of variables).The projection of the data onto the space spanned by the first two discriminant functions is given in the (1,1) panel of Fig. 2. The panel (1,2) contains the raw coefficients of these discriminant functions.Next, we solve (7) with τ = .5.The minimum of the objective function in ( 7) is 1.1896.The first three eigenvalues are approximated by 23.6843, 0.0874 and 0.0803, respectively.The discriminant plot of the data is given in the (2,1) panel of Fig. 2.There are 40 misclassified points for this solution, i.e. 64.52 % misclassification.The panel (2,2) contains the raw coefficients of these discriminant functions, and the first ones are not sparse at all.Finally, we solve (7) with τ = .01.The minimum of the objective function in ( 7) is 1.0000.The first three eigenvalues are approximated by .4260,.1437and .2418,respectively.The discriminant plot of the data is given in the (3,1) panel of Fig. 2.There are again 37 misclassified points for this solution, i.e. 59.68 % misclassification.The panel (3,2) contains the sparse raw coefficients of these discriminant functions.It is really surprising to achieve such discrimination by two variables only!They are probably too sparse and one can look for a better τ .

Sparse LDA based on minimization of the classification error
Fan et al. [13] argued that ignoring the covariances (the off-diagonal entries in W) as suggested by Bickel and Levina [2] may not be a good idea.In order to avoid redefining Fisher's LDA for singular W, Fan et al. [13] proposed working with (minimizing) the classification error instead of the Fisher's LDA ratio (1).The method is called for short ROAD (from Regularized Optimal Affine Discriminant) and is developed for two groups.Let , or minimize a Ta subject to d a = 1.Then, the ROAD problem is to find such a minimizer a, which is moreover sparse.Thus, the ROAD minimizer a is sought subject to a LASSO-type constraint introduced as a penalty term, i.e.: min Further on, Fan et al. [13] replace the affine constrained problem (12) by a quadratic penalty term, which results in the following unconstraint problem: Let us forget for a while for the sparseness of a, and solve (13) for the Iris data with τ = 0.Then, the first group Iris setosa is perfectly separated by the cloud composed by the rest two groups of Iris versicolor and Iris virginica.The difference between the means of these two groups is d = (−.1243,.1045,−.1598, −.1537).The discriminant is a = (.6623,1.4585, −5.1101, −.7362).One can check that a d = 1.However, this solution of ( 13) is not convenient for interpretation because one cannot assess the relative sizes of the elements of a. Another related problem is that the LASSO constraint may not work well with vectors a with arbitrary length.
Thus, it seems reasonable to consider a constrained version of problem ( 13) subject to a a = 1.Solution of the following related "dense" problem (with τ = 0) is available by Gander et al. [17].One can consider "sparsifying" their solution to produce unit length ROAD discriminants.Other works in this direction exploit the fact that the classified error depends on T −1 and d only through their product T −1 d [5,23].As the ROAD approach, they are also designed for discrimination into two groups.This is helpful for obtaining asymptotic results, however not quite helpful for complicated applications involving several groups.
Finally, the function-constraint reformulation of ROAD is: where d is a standard eigenvalue of T.

Indirect methods for discriminant analysis
The main purpose of this class of approaches is to avoid the explicit use of the singular matrices T −1 and/or W −1 .

Equivalent definitions of discriminant analysis
Clemmensen et al. [8] make use of the LDA re-formulation as optimal scoring, discussed in detail by Hastie et al. [24].The optimal scoring problem does not require W −1 , and thus, is

Discriminant analysis with CPC
Common principal components (CPC) are developed by Flury [15] and can be used to discriminate several groups of observations with different covariance matrices in each group.Zou [50] already considered briefly such an option.In a simulated study, Flury et al. [16] demonstrated that even a simpler CPC model with proportional covariance matrices [15,Ch 5] can provide quite competitive discrimination compared to other more complicated methods.

Sparse LDA based on metric scaling
Gower [20] showed that metric scaling of the matrix of Mahalanobis distances between all pairs of groups will recover the canonical variate configuration of group means.However, the Mahalanobis distances use the pooled within-group scatter matrix, and thus, this approach is not applicable for horizontal data.It was mentioned before, that Dhillon et al. [10] avoided this problem by simply doing PCA of the between-group scatter matrix B to obtain LDA results.Trendafilov and Vines [45] considered sparse version of this LDA procedure.
The above approach can still be applied if the equality of the population covariance matrices of the groups cannot be assumed.A particularly elegant solution, employing Hellinger distances, can be obtained if the CPC hypothesis is appropriate for the different covariance matrices [28].
Another unexplored option would be to consider linear discrimination employing withinand between-group distance matrices [21], which have sizes n × n.

Sparse LDA without sparse inducing penalty
In this section we consider a new procedure for sparse LDA.The sparseness of the discriminant functions A will be achieved without employing sparse inducing penalties.Instead, we will look for a solution A with specific pattern of sparseness, with only one nonzero entry in each row of A. The methods is inspired by the recent works of Timmerman et al. [40] and Vichi and Saporta [47].
The following model represents the original data X by only the group means projected onto the reduced space, formed by the orthonormal discriminant functions A. The model can be formally written as: where X is the g × p matrix of group means and U is the n × g indicator matrix of the groups, such that G = (U U) −1 U .In this notations, one has X = GX = (U U) −1 U X, and the model ( 16) can be rewritten as: where one notes that P = U(U U) −1 U is a projector.The p × r orthonormal matrix A contains the orthonormal "raw coefficients" of the problem, and r is the number of required discriminant functions.We want to find sparse raw coefficients A but without relying on sparseness inducing constrains as in the previous sections.In general, this is unsolvable problem, but it can be easily tackled if we restrain ourselves to a particular pattern of sparseness: each row of A should posses a single nonzero entry.Thus, the total number of nonzero entries in A will be p.To construct A with such a pattern, we introduce a p × r binary (of 0's and 1's) membership matrix V, indicating which variables have nonzero loadings on each particular discriminant function i.e. in each column of A. Then, A will be sought in the form of a product A = Diag(b)V, where Diag(b) is a diagonal matrix formed by the vector b.The ith element of b gives the nonzero value at the ith row of A. In other words, V is responsible for the locations of the nonzero entries in A, while b will give their values.Apparently, the choice of V and b will affect the fit of the model (17).Thus, we need to solve the following least-squares problem: which will be called for short SDP (Sparse Discriminative Projection).
Example 3 We apply SDP to the Iris data.Two solutions (matrices A) are depicted in the last four columns of  what was achieved by other approaches, and probably needs further checking.The discriminant plot of the data is given in the (4,1) panel of Fig. 2. The panel (4,2) contains the raw coefficients of these discriminant functions: the first ones has a single nonzero entry, while the second is not sparse at all.The separation achieved by these discriminant functions is quite satisfying, but not the sparseness.
One can develop a better SDP method if the classification error is minimized instead of fitting the data matrix X or its projection onto the subspace spanned by the discriminant functions.Nevertheless, the main weakness of SDP is that for large p the SDP solutions are not sparse enough, and thus, not attractive for application.

Comparison of existing methods
We consider three sparse discriminant analysis methods for comparison using five datasets.The three methods are: • Function constrained linear discriminant analysis (FC-LDA) which is introduced in Sect.
In Table 2 we summarize the results from numerical experiments with the three methods referred above.The solutions produced by FC-LDA and SDA have about 5 % non-zero entries.From this table we see that FC-LDA works as well as SDA.The reason that FC-LDA does not show superiority as compared with SDA may be due to the fact that FC-LDA uses diagonal within group covariance matrix.The results in [49] have higher percentage of non-zero entries, so not quite comparable with the other two.

Connection with Gini's transvariation
This paper mainly revises sparse LDA methods on horizontal data.The main objective of LDA is to find the linear combination of p variables which maximizes group separation.In other words, it is obtained by maximizing the ratio of between to within covariance matrices [14].
Alternatively, the linear combinations can also be obtained in terms of Gini's transvariation [33].Gini [18] defined that two groups are said to be transvariate on a variable X if the sign of the difference of any two values of X from different groups is opposite to the sign of their corresponding mean difference.Any difference satisfying this condition is called a transvariation [46].Montanari [33] has shown that transvariation measures can be used to discriminate between groups.Moreover, Caló [6] has used the transvariaon method to measure group separablity.Other authors such as [34] and Bragoli et al. [4] have also applied transvariation for group separation and classification.
These references show that LDA is related to the Gini's transvariation due to the fact that linear discriminant function can be derived as the linear combination which minimize transvariation probability or area.Therefore, we can see that our method, FC-LDA is also related to Gini's transvariation.It is also possible to impose sparsity penalty on the transvariation method so as to find only few important variables in the case of horizontal data.The nature of the trasvariation formulation most likely will require non-parametric methods.Nowadays, Bayesian methods with sparseness inducing priors are widely used for sparse PCA and factor analysis.This could be a new contribution of Gini's transvariation to LDA and in general, to discrimination and classification problems.reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Table 1
Different raw coefficients for Fisher's Iris data of of the objective function in (7) is 1.0579.The first two eigenvalues 31.0969 and .3125areapproximatedby30.502 and .616,respectively.The sparse raw coefficients are depicted in the last two columns of Table1.The same five points are misclassified in this solution.The discriminant plot of the data is given in the (2,2) panel of Fig.1.It seems, that the LDA (1) with W = W d gives the worst solution, while the sparse LDA with τ = .5 is most satisfying both in terms of fit and interpretability. minimum

Table
1, two for each solution.The first pair of columns is the solution A for which the SDP objective function (17) is minimal (1.0563) among several random starts.However, this solution produces 11 misclassified observations, which is 7.33 % misclassification rate.This solution looks less satisfying compared to previous ones, reported in Example 1.The last two columns of Table 1 give another SDP solution for which the objective function (17) is 1.1121, but the misclassification is 4 % only, with six misclassified observations (9, 31, 50, 52, 103 and 104).The quality of this solution resembles the (dense) LDA solution with W = W d from Example 1.It's clear that SDP performance is not satisfying for this data set.Now, we apply SDP to the Rice data.The best solution (among several random starts) produces 26 misclassified observations, which is only 41.94 % misclassification rate.This result looks much better than

Table 2
Results from three methods for sparse LDA applied to several data sets