Stability of feature selection in classification issues for highdimensional correlated data
Abstract
Handling dependence or not in feature selection is still an open question in supervised classification issues where the number of covariates exceeds the number of observations. Some recent papers surprisingly show the superiority of naive Bayes approaches based on an obviously erroneous assumption of independence, whereas others recommend to infer on the dependence structure in order to decorrelate the selection statistics. In the classical linear discriminant analysis (LDA) framework, the present paper first highlights the impact of dependence in terms of instability of feature selection. A second objective is to revisit the above issue using a flexible factor modeling for the covariance. This framework introduces latent components of dependence, conditionally on which a new Bayes consistency is defined. A procedure is then proposed for the joint estimation of the expectation and variance parameters of the model. The present method is compared to recent regularized diagonal discriminant analysis approaches, assuming independence among features, and regularized LDA procedures, both in terms of classification performance and stability of feature selection. The proposed method is implemented in the R package FADA, freely available from the R repository CRAN.
Keywords
Variable selection High dimension Stability Classification Discriminant Analysis1 Introduction
Highthroughput technologies, increasingly used in diverse contexts such as brain activity modeling, astronomy, or gene expression analysis, share the common property to generate a huge volume of data, which makes possible the largescale analysis of complex systems. Such data are generally characterized by their high dimension, as the number of features can reach several thousands, whereas the sample size is usually about some tens. More and more authors also point out their heterogeneity, as the true signal and some confusing factors (uncontrolled and unobserved) are often observed at the same time. For example, in Omics data used for quantitative issues in systems biology, both these factors and the joint contribution of subsets of features to common biological pathways generate a biologically meaningful dependence structure among features. The impact of such a dependence on the performance of the supervised classification procedures which are used to predict the class of a biological sample from its genomic profile is still questioning.
Recent advances on the impact of dependence on the performance of supervised classification methods in situations where the number of covariates is much larger than the number of sampling items have led to apparently contradictory conclusions. Indeed, the superiority of approaches based on an erroneous independence assumption is reported (Dudoit et al. 2002; Levina 2002; Bickel and Levina 2004), whereas more and more methods account for the covariance structure (Guo et al. 2007; Dabney and Storey 2007; Xu et al. 2009; Zuber and Strimmer 2009). More recently, Ahdesmäki and Strimmer (2010) gives more insight to this issue by revisiting the naive Bayes approach of Efron (2008) called diagonal discriminant analysis (DDA), using decorrelated individual scores. In the DDA framework in which independence among features is assumed, finding the support of the signal, namely the subset of truly informative covariates, shows some similarity with largescale significance study since it consists in ranking individual scores. However, as recalled by Ahdesmäki and Strimmer (2010), whereas multiple testing procedures aim at controlling the number of false discoveries, this is usually more relevant for selection issues to control the number of erroneously nonselected features. Interestingly, in this multiple testing context, some papers (Leek and Storey 2007, 2008; Efron 2007; Friguet et al. 2009; Sun et al. 2012) have also reported the negative impact of large correlation among scores on the consistency of the ranking of p values. The authors propose to handle this correlation in a joint modeling of the relationships between features and covariates and residual variance–covariance using a flexible model which assumes that latent effects can linearly affect the dependence among features. The present paper introduces a specific procedure for the supervised classification issue.
The first objective of the present paper is to illustrate the instability of variable selection in the classical linear discriminant analysis (LDA) context, when the number of covariates exceeds the number of observations. For such highdimensional issues, regularized procedures based on \(\ell _{1}\) or \(\ell _{2}\) penalization of usual loss functions are now well established to handle efficiently a biasvariance tradeoff for the estimation of the discriminant scores [see for example Tibshirani et al. (2002, (2003) for a regularized DDA, Hastie et al. (1995) for a penalized LDA or Friedman et al. (2010) for an elastic net penalization of deviancebased estimation]. The stability and classification performance of some of these usual procedures are investigated and the impact of dependence on their repeatability properties is studied.
Section 2 introduces the context of feature selection for highdimensional supervised classification in a normal framework, focusing on the twoclass issue. A regression factor model is proposed to identify a lowdimensional linear kernel which captures data dependence. Some analytical properties are derived and a new strategy for model selection is deduced. This approach is described in Sect. 3. Sections 4 and 5 investigate the properties of variable selection procedures for highdimensional data, considering different structures for dependence and real data. The improvements brought by the proposed approach in terms of stability and classification performance are highlighted.
2 Highdimensional variable selection for classification
In order to highlight the selection stability issue, we intentionally focus hereafter on twoclass prediction in a normal setting with equal covariance in both groups. However, the general principles of our approach are applicable in the wider framework of more than two classes or unequal covariance structures.
2.1 Notation
The sample consists of \(n\) independent joint observations \((x'_{i},Y_{i}), i = 1,\ldots ,n\), of the explanatory and response variables. In the present highdimensional framework, \(n\) can be much smaller than \(m\). Hereafter, \(n_{1}\) (resp. \(n_{0}=nn_{1}\)) denotes the number of observations in the sample for which \(Y=1\) (resp. \(Y=0\)).
2.2 Bayes consistency and usual estimation procedures
However, the invertibility of the sample covariance matrix \(S\) is also required to minimize \( \mathcal{D} ( \beta ) \). This invertibility condition does not hold in a highdimensional framework.
This issue can be addressed by assuming that the support \(I = \left\{ j , \ \beta _{j} \ne 0 \right\} \subset [\![ 1;m ]\!]\) of the classification model is small regarding the number \(m\) of features. Under this assumption of a sparse model, feature selection procedures, which aim at identifying the nonzero coefficients in \(\beta \), are needed to reduce the explanatory profile to the most group predictive variables.
2.3 Feature selection
There is an abounding statistical literature dealing with the issue of feature selection in regression and classification. Among many other methods, minimization of the Akaike or Bayesian information criteria (AIC, BIC), which are based on a \(\ell _0\)penalization of the deviance, are frequently used. Indeed, minimization of BIC leads to consistent estimators of the support \(I\) and minimization of the AIC to minimax rate optimal rules for estimating the regression function (Yang 2005). The main cause of concern of these procedures in high dimension is of a computational nature, as an exhaustive search through all possible models (\(2^m\)) is needed. Stepwise exploration of the whole family of models provides an alternative, but this strategy can be unstable in high dimension because the number of fitted candidate models, at most \(m(m+1)/2\), is extremely small regarding the number of possible models.
As mentioned by Van de Geer (2010), LASSO makes strong assumptions on the covariance matrix, mainly that correlations between variables are weak. Consequently, a major and still open question remains the application of the procedure while coping with large correlations between variables.
Let us illustrate the impact of dependence on the stability of a standard variable selection procedure (LASSO) by a simulation study, comparing the dependent and independent cases.
The same simulation setting is used for the independent case, except that the withingroup correlation matrix is here \(\mathbb I_{m}\). Besides, \(\mu _{1}=\beta \) to keep the same \(\beta \) as in the dependent case.
The first striking impact of dependence is related to the number of selected features (Fig. 2), which is much larger when the features are correlated. Moreover, whereas in the independent case, no erroneous selection of null features is reported in the simulations, in 12.1 % of the simulations under dependence, the FDP is nonzero. Accuracy of selection is also clearly affected by dependence: the mean ranks in the independent case are consistent with the expected means if the most group predictive variables are selected (Fig. 2), namely half the number of selected features, whereas these mean ranks are much larger in the dependent case (Fig. 3a).
3 Factoradjusted variable selection
We propose a framework in which dependence is tractable at the level of the original data, which allows a direct adjustment of the data for that dependence. This dependence adjustment step can be combined to any selection procedures, as proposed in the comparative studies of Sect. 5.
3.1 Factor adjustment
In many areas, and particularly in the analysis of gene expression data (Kustra et al. 2006; Leek and Storey 2008; Carvalho et al. 2008; Friguet et al. 2009; Teschendorff et al. 2011; Sun et al. 2012), it has become frequent to cope with dependence by assuming the existence of a moderate number of latent factors conditionally on which it is assumed that features are independent. The main advantage of such an approach is that dependence is captured into a lowdimensional linear space. Then the statistical procedures initially designed for the independent (or weak correlation) case can be applied to the decorrelated data, obtained after adjustment for the latent effects. Several methods have been proposed to model the latent factors, such as (Independent) surrogate variable analysis (Leek and Storey 2007; Teschendorff et al. 2011), independent component analysis (Lee and Batzoglou 2003), latenteffect adjustment after primary projection (Sun et al. 2012) or factor analysis (Friguet et al. 2009) for example.
Hereafter, we introduce a supervised Factor Analysis model for classification. Based on this model, the conditional linear Bayes classifier is defined and the conditional Bayes consistency of the factoradjusted approach is proved.
3.2 A flexible framework for dependence
Note that the Bayes classifier general optimality, which is established without any assumption on \(\varSigma \), is not questioned here. However, under the assumption of a factor model for \(\varSigma \), the above result establishes the theoretical superiority of a conditional approach based on the factoradjusted explanatory variables \(xBz\). Consequently, we propose hereafter an estimation procedure for the regression factor model (12).
3.3 An iterative estimation procedure for the supervised factor model
We propose an iterative method, which alternates the estimation of \(\mu _{0}, \ \mu _{1}, B\) and \(\varPsi \), and the derivation of the latent factors \(Z\).
3.3.1 Initialization
The algorithm starts with \(\hat{\mu }_{0} = \bar{x}_{0}, \ \hat{\mu }_{1} = \bar{x}_{1}\). Based on these estimates of the group means, the centered profiles \( x  \hat{\mu }_{y}\) are used to estimate \(B\) and \(\varPsi \), using the EM algorithm detailed in Friguet et al. (2009). The corresponding estimators are hereafter denoted \(\hat{B}\) and \(\hat{\varPsi }\).
3.3.2 Step 1: factors extraction (Z)
3.3.3 Remarks about the implementation
 1.Note that the calculation of \(\beta ^{*}_{0}\) and \(\beta ^{*}\), using expressions (3) and (4), only involves the inversion of a \(q\times q\)matrix according to the Woodbury’s identity:$$\begin{aligned} \varSigma ^{1} = \varPsi ^{1}  \varPsi ^{1} B ( I_{q} + B' \varPsi ^{1} B )^{1} B' \varPsi ^{1}. \end{aligned}$$
 2.
Besides, the plugin estimator of \(\mathbb {P}_{x}(Y=1)\) can be affected if the factor model is overfitted, which penalizes the classification performance. Alternative estimation procedures can therefore be preferred to estimate \(\mathbb {P}_{x}(Y=1)\), such as \(\ell _{1}\)penalized logistic regression, which introduces sparsity to reduce the effects of overfitting.
3.3.4 Step 2: model parameters estimation (\(\mu _{0}, \mu _{1}, B\)and\(\varPsi \))
The estimations of \(\mu _{0}\) and \(\mu _{1}\) are updated by the leastsquares fitting of the multivariate regression model (12), where \(Z\) is replaced by \(\hat{Z}\). The factor decomposition of the centered profiles (\(x\hat{\mu }_{y}\)) covariance provides updated estimates of \(B\) and \(\varPsi \).
3.3.5 Iterations and stop criterion
Steps \(1\) and \(2\) are iterated, updating alternatively factors and model parameters estimations. The algorithm stops when two successive estimates of the factor model parameters are similar.
Therefore, the proposed strategy consists in defining factoradjusted versions of usual classification methods by applying these methods on the factoradjusted data \(x\hat{B}\hat{Z}\).
A crucial point in the present feature selection context is the choice of the proper number of factors. Indeed, an overestimation of \(q\) would artificially reduce the estimation of the residual specific variances \(\hat{\varPsi }\), which could generate false positives. In a multiple testing context, Friguet et al. (2009) notice that the variance of the number of false positives is an increasing function of the amount of dependence among the test statistics and give a closedform expression for the variance inflation \(\mathcal{V}_{k}\) due to the \(k\)factor model for this dependence. Consequently, they suggest an ad hoc procedure which consists, for each \(k\)factor model \((\varPsi _{k},B_{k})\), to estimate the variance of the number of false positives when the tests are calculated with the \(k\)factoradjusted residuals: \( \hat{e}  \hat{Z}_{k} \hat{B}_{k} \).
The algorithm described in this section is implemented in the R package FADA available from the R repository CRAN, providing functions for decorrelation, feature selection, and estimation of a classification model.
In the following two sections, we illustrate, on real data and by simulations, that this new factor adjustment algorithm improves variable selection, both in terms of classification or prediction performance and reproducibility of the selected variables.
4 Stability of variable selection in high dimension
4.1 DNA microarray data
In genomics, microarrays let biologists measure expression levels for thousands of genes in a single sample all at once. The level of measured gene expressions is influenced both by a biological trait of interest and by unwanted technical and/or biological factors, referred to as heterogeneity factors (Leek and Storey 2007, 2008). Moreover, it is now widely considered that groups of genes contributing to some few biological processes can show coexpression patterns: some genes are activators or inhibitors of others. This motivates the emerging issue of gene coexpression network inference from microarray data. In such context, dealing with dependence is a major concern in carrying statistical analyses.
Feature selection is increasingly common in genomic data analysis to identify genes which expression patterns have meaningful biological links with a phenotypic trait. Therefore, as an illustration of selection issues in high dimension, let us consider the microarray experiment detailed in Hedenfalk et al. (2001), which is commonly used in the statistical literature for comparative studies of highdimensional statistical procedures.
4.1.1 Data: breast cancer study
The data were primarily analyzed in order to compare expressions of three types of breast cancer tumor tissues: BRCA1, BRCA2, and Sporadic. The raw expression data, downloaded from http://research.nhgri.nih.gov/microarray/NEJM_Supplement/, initially consist of 3226 genes in 22 arrays; seven arrays from the BRCA1 group, eight from the BRCA2 group, and six from the Sporadic group. The label of one sample being unclear, it has been removed from the study. 196 genes presenting some suspicious levels of expression (larger than 10 or lower than 0.1) are removed and the data are finally log\(_{2}\) transformed. In the following, we focus on the selection of gene expressions among the \(m=3030\) included in the study that best predict the two types of tumors BRCA1 and BRCA2. The sample size is then \(n=15\).
4.1.2 Methods
Variable selection is performed using the R package glmnet (Friedman et al. 2010) which provides a function to fit a twogroup logistic regression model via \(\ell _{1}\)regularized maximum likelihood (Tibshirani 1996). The sample being small, the choice of the tuning parameter is done thanks to LeaveOneOut crossvalidation. LASSO is known to be non consistent when performed on correlated data (Bach 2008). However, the following example aims to illustrate how a lack of stability can be observed on real data and how factor adjustment can stabilize a usual selection procedure.
The procedure is first applied on the complete dataset (with \(n=15\) observations). The performance of the procedure is evaluated through the number of selected variables and the crossvalidation error.
Then to illustrate instability of variable selection, the same procedure is applied, removing successively one of the observations. The aim is to evaluate the sensibility of the procedure to changes in the data. The results of the selection procedure are compared to those obtained on the complete data considering the number of selected variables and the overlap with the subset of variables initially selected using the complete data.
Finally, the same procedure is applied on the factoradjusted data. Factor adjustment is performed with the method presented in Sect. 3.3. The minimization of the variance inflation criterion suggests to keep \(q=1\) common factor for the complete data and for each incomplete dataset.
4.1.3 Results
Selection procedure on the complete dataset
Data  Features  Prediction error 

Raw data  11  0.400 
Factoradjusted data  8  0.267 
Selection procedure on the incomplete datasets The selection procedure is then applied on the 15 subdatasets, removing successively one of the observations from the complete raw data (resp. complete factoradjusted data). Table 2a (resp. 2b) reports the number of selected features, the number and proportion of selected variables which belongs to \(I_{\mathrm{{raw}}}\) (resp. \(I_{\mathrm{{FA}}}\)) and crossvalidated prediction error of the selection procedure for each subdataset. For each criterion, the tables report the results after the removal of the first four and last four observations as an overview of results, as well as the mean and standard deviation in the last column. Results for all observations are not presented to avoid overloading.
A wide range of situations are reported in Table 2a, regarding both the number and the set of selected features, depending on which observation has been removed. Each observation has therefore a strong influence on the stability of the selection procedure.
Selection procedure after having removed one observation
Removed id.  1  2  3  4  ...  12  13  14  15  Mean (SD) 

(a) Raw data  
Features  1  10  7  8  ...  6  12  6  6  6.4 (3.6) 
Included (N)  1  9  3  6  ...  6  6  3  5  4.2 (2.5) 
Included (%)  9.1  81.8  27.3  54.5  ...  54.5  54.5  27.3  45.5  38.2 (22.3) 
Prediction error  0.571  0.214  0.286  0.214  ...  0.214  0.357  0.214  0.357  0.3 (0.138) 
(b) Factoradjusted data  
Features  9  7  9  10  ...  9  7  12  8  7.9 (2.5) 
Included (N)  5  7  6  8  ...  7  6  8  7  5.7 (2) 
Included (%)  62.5  87.5  75.0  100.0  ...  87.5  75.0  100.0  87.5  70.8 (24.9) 
Prediction error  0.357  0.214  0.286  0.071  ...  0.357  0.357  0.214  0.214  0.229 (0.115) 
Conclusion This illustrative situation shows that the usual statistical approaches for variable selection, such as LASSO selection here, are questioned for dependent highdimensional data analysis. A small change in the data, just considering the removal of one observation, induces variability in the performance of the procedure and leads to different sets of selected variables. Factor adjustment helps to block such effects of heterogeneity and improves both the stability of the set of selected variables and the prediction error.
4.2 DNA methylation data
Recently, DNA methylation data have focused the attention of biologists because new biological processes can be identified from the analysis of such data. In this section, a study is conducted to highlight the contribution of factor adjustment for the analysis of data generated by such experiments.
4.2.1 Data: gastric tumors study
The data were primarily published in Zouridis (2012) and initially consist of 27578 DNA methylation measures and 297 observations. 2573 variables were removed because of missing data so that the studied dataset has 25,005 columns. The binary response variable codes for gastric tumors (203 cases) and gastric nonmalignant samples (94 cases).
4.2.2 Methods
According to the simulation study in Sect. 5, shrinkage discriminant analysis (SDA) appears to be the most efficient method regarding the prediction error and the precision of selection. Thus, SDA is conducted on the whole dataset using the R package sda. The results are compared to the following threestep procedure. (1) A decorrelation step is performed on the whole dataset using FADA R package then, (2) to decrease the dimension of the dataset and to avoid high computation time in step (3), a rough selection is performed through standard t tests on decorrelated data and the first 3000 CpG sites are selected for the next step. (3) Variable selection and classification model are finally performed by SDA on the factoradjusted subdataset. Prediction errors are computed through a tenfold crossvalidation with 20 repetitions so that the model is estimated on 200 splits of the data.
4.2.3 Results
Nb. of selected features and estimated prediction errors for gastric tumors data
Method  Nb. features  Error rate 

SDA  2638  0.0301 
Factoradjusted SDA  305  0.0217 
5 Impact of the dependence design: a simulation study
In order to study the performance of factor adjustment for classification and variable selection, we propose a more intensive simulation study. Considering several scenarios of dependence between variables [independence, block dependence, factor structure, and Toeplitz design, in the manner of Meinshausen and Bühlmann (2010)], some wellknown classification methods are applied on simulated datasets. The stability of original procedures is compared to their factoradjusted versions.
5.1 Simulation design
Let us consider datasets simulated according to a multivariate normal distribution, each dataset being composed of \(m=1000\) variables and \(n=30\) observations. Besides, let us consider a binary variable \(Y\) such that the observations are split into two arbitrary groups of size \(n_{0}=n_{1}=n/2\). The \(m\)dimensional profiles \(X\) are normally distributed with mean \(\mu _{0} = 0_{m}\) in the first group (\(Y=0\)), where \(0_m \in \mathbb {R}^m\) is the zero vector, and \(\mu _1\) in the second group (\(Y=1\)). A subset \(I\) of 50 variables is randomly chosen to be group predictive. For these variables, \(\mu _1\) has nonzero components: \(\mu _{1j}=\delta \) for \(j \in I\) and \(\mu _{1j}=0\) otherwise. The value of \(\delta \) is set to 0.55 or 0.47, which matches with high and moderate signal strength, as introduced by Donoho and Jin (2008).
 (A)
The \(m\) variables are normally and independently distributed with variance \(1\) so that \(\varSigma \) is the \(m\)diagonal matrix \(\mathbb I_m\). This scenario is used as a control situation to check that the proposed method does not falsely catch dependence;
 (B)
\(\varSigma \) is a twoblocks matrix. Correlation between the first \(100\) variables is set to \(0.7\) and correlation between the remaining \(900\) variables is equal to \(0.3\). This correlation matrix is used to study impact of dependence in multiple testing in the context of gene expression analysis in Zuber and Strimmer (2009);
 (C)
\(\varSigma \) is decomposed into a specific and a common part as in a factor model (see Sect. 3.2): \(\varSigma =BB'+\varPsi \). \(\varPsi \) is a diagonal matrix of specific variances and \(B\) is a \(m \times q\)matrix of coefficients, chosen so that the proportion \(\mathrm{{trace}}(BB')/\mathrm{{trace}}(\varSigma )\) of dependence among variables is high (78 %). In the present simulation study, the number of common factors is \(q = 5\). Note that the signal is here set to a weaker value \(\delta =0.47\) because generating dependence through a factor structure is a favored scenario.
 (D)
\(\varSigma \) is a Toeplitz matrix. This kind of design corresponds to autoregressive time dependence such that the covariance between two variables \(i\) and \(j\) is equal to \(\sigma \rho ^{\vert ij \vert }\). In this simulation study, \(\sigma =1\) and \(\rho =0.99\).
5.2 Methods
 (LASSO)

\(\ell _{1}\)regularized logistic regression using the R package glmnet (Friedman et al. 2010);
 (SLDA)

Sparse linear discriminant analysis, which is an \(\ell _1\)penalized LDA using the R package SparseLDA (Clemmensen et al. 2011), the stop parameter was set to 10;
 (SDA)

Shrinkage discriminant analysis, which is a James–Stein regularized version of LDA, using the R package sda (Ahdesmäki and Strimmer 2010). Note that SDA consists finally in a correlation adjustment of the scores used for feature selection in DDA;
 (DDA)

Shrinkage diagonal discriminant analysis, which assumes withingroup independence among features, using the R package sda (Ahdesmäki and Strimmer 2010). Estimation of the DDA model is here regularized using a ridge approach.
Several cutoffs are implemented in the R package sda to conduct DDA and SDA such as the False NonDiscovery Rate (FNDR) or Higher Criticism (Donoho and Jin 2008). Both lead to similar results in this simulation study and the results reported here concern the FNDR cutoff.
Each procedure is applied both on raw data and on factoradjusted data, using the decorrelation method presented in Sect. 3.3: for each simulated dataset, covariance parameters \(\varPsi \) and \(B\) and latent factors \(Z\) are estimated on training datasets and factoradjusted training data (decorrelation step) are computed using formula \(xBz\) introduced in expression (14). Estimates \(\hat{\varPsi }\) and \(\hat{B}\) are used to estimate latent factors of testing data and factoradjusted testing data are computed in the same way. Classification methods are finally trained on decorrelated training samples and assessed on decorrelated testing sample.
Prediction errors are calculated on an independent balanced test dataset consisting of 10,000 sampling items, generated according to each structure of dependence. Performances of methods are assessed by calculating, for each simulated dataset, the prediction error calculated on the test dataset, the number of selected features, and the proportion of truly selected variables (or positive predictive value, reported hereafter as “precision”).
5.3 Results
5.3.1 Crossvalidation
Check of crossvalidated error rates (prediction errors) for a nosignal design
Raw data  Factoradjusted data  

LASSO  0.4989  0.4990 
SLDA  0.4989  0.4992 
SDA  0.5000  0.5004 
DDA  0.4999  0.4996 
5.3.2 Independence design
No factor found for independence design (A)
Prediction error  Features  Precision (%) mean (SD)  

LASSO  0.3858  12.82  40.32 (20.96) 
SLDA  0.3873  10.00  39.50 (15.33) 
SDA  0.3868  35.09  35.52 (21.77) 
DDA  0.3489  32.90  38.44 (23.68) 
5.3.3 Structures with correlations
Simulation results for several designs of dependence
Method  Prediction error  Features  Precision (%) mean (SD) 

Block structure (B)  
LASSO  0.3780  12.64  40.05 (23.85) 
Factoradjusted LASSO  0.3118  15.44  49.16 (21.30) 
SLDA  0.3872  10.00  39.80 (15.50) 
Factoradjusted SLDA  0.3426  10.00  50.80 (16.00) 
SDA  0.3244  41.63  42.12 (17.77) 
Factoradjusted SDA  0.2863  44.19  42.46 (18.08) 
DDA  0.4393  165.10  28.31 (24.46) 
Factoradjusted DDA  0.2820  48.44  42.13 (19.14) 
Factor structure (C)  
LASSO  0.2660  14.025  62.67 (14.94) 
Factoradjusted LASSO  0.1038  8.477  90.43 (12.35) 
SLDA  0.3000  10.00  68.80 (17.25) 
Factoradjusted SLDA  0.0926  10.00  87.50 (11.67) 
SDA  0.1258  70.00  50.29 (14.84) 
Factoradjusted SDA  0.0452  53.17  65.17 (19.00) 
DDA  0.4772  4.18  69.75 (18.30) 
Factoradjusted DDA  0.0474  55.26  65.04 (20.61) 
Temporal dependence (D)  
LASSO  0.3020  13.10  62.36 (20.63) 
Factoradjusted LASSO  0.1510  8.03  93.02 (9.69) 
SLDA  0.3314  10.00  62.50 (17.08) 
Factoadjusted SLDA  0.1222  10.00  90.90 (10.83) 
SDA  0.2695  57.20  75.07 (23.94) 
Factoradjusted SDA  0.0893  68.22  67.93 (25.66) 
DDA  0.4813  149.42  15.58 (15.27) 
Factoradjusted DDA  0.3146  97.65  48.76 (29.91) 
Considering the block structure (B), errors rates are reduced for each classification method and relevant features are more often selected except for SDA.
As expected, scenario (C) leads to the most significant results mainly because this scenario is favored by the factor model used for the covariance matrix.
When applied on raw data, DDA always leads to the highest error rates. In scenario (C), the selection step is very unstable as no variable was selected in 15 % of simulations, which explains that the average number of selected features only rates 4.18 % variables. For the two other scenarios, the number of selected features is high, but without catching relevant ones. As expected, DDA, which assumes independence between covariates, is more suitable on factoradjusted data and performances are better both in terms of prediction ability and in selection.
LASSO and Sparse LDA are considerably improved by factor adjustment. Interestingly, the former methods give similar results, probably because they are both based on \(\ell _1\)regularization. However, the benefit of factor adjustment on SDA is lesser than on the other classification methods. SDA is indeed a competing method to factor adjustment as it is also based on decorrelation. Nevertheless, SDA seems to be improved by a factor adjustment, which could be explained by the better ability of the factor model to catch a complex dependence than the James–Stein approach.
6 Discussion and conclusion
The analysis of highdimensional data has markedly renewed the statistical methodology for feature selection in classification issues. Such data are characterized by their heterogeneity, as confusing factors can interfere with the signal of interest. A common and notorious difficulty in largescale data analysis is therefore the handling of these confounding factors, which may induce bias in significance studies, cause unreliable feature selection and high error rates.
The present article illustrates that data heterogeneity affects the ranking and the stability of supervised classification model selection. Most of the usual procedures in supervised classification assume a weak correlation structure between variables and heterogeneity of the data violates this assumption. This article describes an innovative methodology based on an explicit modeling of the data heterogeneity, which provides a general framework to deal with dependence in variable selection. A supervised factor model is used to capture data dependence into a linear lowdimensional space and a conditional Bayes consistency is defined in this framework. This paper provides an algorithm which takes advantage of the correlation structure to estimate at the same time the correlation structure, the signal and individual probabilities in order to decorrelate data. Furthermore, we show that the conditional optimality of the linear Bayes classifier is achieved by the usual Bayes classifier applied to the factoradjusted data.
Factor adjustment is shown to improve stability of some usual procedures of selection and classification. One very important implication of the factoradjusted approach is that, in situations where a strong dependence can be approximated using a factor decomposition, the performance for classification is markedly improved.
Our simulation study shows nice operating characteristics considering dependence structures that fit well to genomics, according to several authors, which is one of our scientific area of interest. We believe that this approach can also be convenient for other scientific areas. As an illustration, we have considered a Toeplitz design, which can be used to model simple autoregressive time dependence structures.
In this paper, it is assumed that the covariance structures in both groups are the same, which is consistent with the homoscedasticity assumption of Linear Discriminant Analysis. Extraction of factors \(Z\) depending on the response variable \(Y\) is possible by considering a different factor model in each group. In such case, two models are independently estimated from the two sets of observations where \(Y=0\) or \(Y=1\). However, in highdimensional data analysis, where the total number of observation is often small, it could reduce the power to detect the biological signal (different means in each group).
References
 Ahdesmäki, M., Strimmer, K.: Feature selection in omics prediction problems using cat scores and false nondiscovery rate control. Ann. Appl. Stat. 4, 503–519 (2010)MathSciNetCrossRefMATHGoogle Scholar
 Bach, F.: Bolasso: model consistent lasso estimation through the bootstrap. Proceedings of the twentyfifth International Conference on Machine Learning (ICML) (2008)Google Scholar
 Bickel, P., Levina, E.: Some theory for Fisher’s linear discriminant function, naive Bayes, and some alternatives when there are many more variables than observations. Bernoulli 10(6), 989–1010 (2004)MathSciNetCrossRefMATHGoogle Scholar
 Blum, Y., LeMignon, G., Lagarrigue, S., Causeur, D.: A factor model to analyze heterogeneity in gene expression. BMC Bioinform. 11, 368 (2010)CrossRefGoogle Scholar
 Carvalho, C., Chang, J., Lucas, J., Nevins, J., Wang, Q., West, M.: Highdimensional sparse factor modeling: applications in gene expression genomics. J. Am. Stat. Assoc. Appl. Case Stud. 103, 484 (2008)MathSciNetMATHGoogle Scholar
 Clemmensen, L., Hastie, T., Witten, D., Ersbøll, B.: Sparse discriminant analysis. Technometrics 53(4), 406–413 (2011)MathSciNetCrossRefGoogle Scholar
 Dabney, A., Storey, J.: Optimality driven nearest centroid classification from genomic data. PLoS ONE 2(10), e1002 (2007)CrossRefGoogle Scholar
 Donoho, D., Jin, J.: Higher criticism thresholding: optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci. 105(39), 14790–14795 (2008)CrossRefGoogle Scholar
 Dudoit, S., Fridlyand, J., Speed, T.: Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97, 77–87 (2002)MathSciNetCrossRefMATHGoogle Scholar
 Efron, B.: Empirical Bayes estimates for largescale prediction problems. Technical report, Department of Statistics, Stanford University (2008)Google Scholar
 Efron, B.: Correlation and largescale simultaneous testing. J. Am. Stat. Assoc. 102, 93–103 (2007)MathSciNetCrossRefMATHGoogle Scholar
 Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)CrossRefGoogle Scholar
 Friguet, C., Kloareg, M., Causeur, D.: A factor model approach to multiple testing under dependence. J. Am. Stat. Assoc. 104(488), 1406–1415 (2009)MathSciNetCrossRefMATHGoogle Scholar
 Guo, Y., Hastie, T., Tibshirani, R.: Regularized discriminant analysis and its application in microarrays. Biostatistics 8, 86–100 (2007)CrossRefMATHGoogle Scholar
 Hastie, T., Buja, A., Tibshirani, R.: Penalized discriminant analysis. Ann. Stat. 23(1), 73–102 (1995)MathSciNetCrossRefMATHGoogle Scholar
 Hedenfalk, I., Duggan, D., Chen, Y.D., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O.P., Wilfond, B., Borg, A., Trent, J.: Gene expression profiles in hereditary breast cancer. New Engl. J. Med. 344, 539–548 (2001)CrossRefGoogle Scholar
 Kustra, R., Shioda, R., Zhu, M.: A factor analysis model for functional genomics. BMC Inform. 7, 216–229 (2006)Google Scholar
 Lee, S., Batzoglou, S.: Application of independent component analysis to microarrays. Genome Biol. 4(11), R76 (2003)CrossRefGoogle Scholar
 Leek, J.T., Storey, J.: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3(9), e161 (2007)CrossRefGoogle Scholar
 Leek, J.T., Storey, J.: A general framework for multiple testing dependence. Proc. Natl. Acad. Sci. 105, 18718–18723 (2008)CrossRefGoogle Scholar
 Levina, E.: Statistical issues in texture analysis. PhD thesis, University of California, Berkeley (2002)Google Scholar
 Meinshausen, N., Bühlmann, P.: Stability selection. J. R. Stat. Soc. B 72(4), 417–473 (2010)MathSciNetCrossRefGoogle Scholar
 Pournara, I., Wernisch, L.: Factor analysis for gene regulatory networks and transcription factor activity profiles. BMC Bioinform. 8, 61 (2007)CrossRefGoogle Scholar
 Spearman, C.: General intelligence, objectively determined and measured. Am. J. Psychol. 15, 201–293 (1904)CrossRefGoogle Scholar
 Sun, Y., Zhang, N., Owen, A.: Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. Ann. Appl. Stat. 6(4), 1664–1688 (2012)Google Scholar
 Teschendorff, A., Zhuang, J., Widschwendter, M.: Independent surrogate variable analysis to deconvolve confounding factors in largescale microarray profiling studies. Bioinformatics 27(11), 1496–1505 (2011)CrossRefGoogle Scholar
 Tibshirani, R.: Regression shrinkage and selection via LASSO. J. R. Stat. Soc. B 58, 267–288 (1996)MathSciNetMATHGoogle Scholar
 Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple cancer type by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99, 6567–6572 (2002)CrossRefGoogle Scholar
 Tibshirani, R., Hastie, T., Narsimhan, B., Chu, G.: Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat. Sci. 18, 104–117 (2003)MathSciNetCrossRefMATHGoogle Scholar
 Van de Geer, S.: L1regularization in highdimensional statistical models. Proceedings of the International Congress of Mathematicians (2010)Google Scholar
 Xu, P., Brock, G., Parrish, R.S.: Modified linear discriminant analysis approaches for classification of highdimensional microarray data. Comput. Stat. Data Anal. 53, 1674–1687 (2009)MathSciNetCrossRefMATHGoogle Scholar
 Yang, Y.: Can the strengths of AIC and BIC be shared? A conflict between model identification and regression estimation. Biometrika 92(4), 937–950 (2005)MathSciNetCrossRefMATHGoogle Scholar
 Zou, H.: The adaptive LASSO and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)MathSciNetCrossRefMATHGoogle Scholar
 Zouridis, H., et al.: Methylation subtypes and largescale epigenetic alterations in gastric cancer. Sci. Transl. Med. 4(156), 156140 (2012)CrossRefGoogle Scholar
 Zuber, V., Strimmer, K.: Gene ranking and biomarker discovery under correlation. Bioinformatics 25, 2700–2707 (2009)CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.