# Random projections as regularizers: learning a linear discriminant from fewer observations than dimensions

- 868 Downloads
- 11 Citations

## Abstract

We prove theoretical guarantees for an averaging-ensemble of randomly projected Fisher linear discriminant classifiers, focusing on the case when there are fewer training observations than data dimensions. The specific form and simplicity of this ensemble permits a direct and much more detailed analysis than existing generic tools in previous works. In particular, we are able to derive the exact form of the generalization error of our ensemble, conditional on the training set, and based on this we give theoretical guarantees which directly link the performance of the ensemble to that of the corresponding linear discriminant learned in the full data space. To the best of our knowledge these are the first theoretical results to prove such an explicit link for any classifier and classifier ensemble pair. Furthermore we show that the randomly projected ensemble is equivalent to implementing a sophisticated regularization scheme to the linear discriminant learned in the original data space and this prevents overfitting in conditions of small sample size where pseudo-inverse FLD learned in the data space is provably poor. Our ensemble is learned from a set of randomly projected representations of the original high dimensional data and therefore for this approach data can be collected, stored and processed in such a compressed form. We confirm our theoretical findings with experiments, and demonstrate the utility of our approach on several datasets from the bioinformatics domain and one very high dimensional dataset from the drug discovery domain, both settings in which fewer observations than dimensions are the norm.

## Keywords

Random projections Ensemble learning Linear discriminant analysis Compressed learning Learning theory## 1 Introduction

Classification ensembles that use some form of randomization in the design of the base classifiers have a long and successful history in machine learning, especially in the case when there are fewer training observations than data dimensions. Common approaches include: Bagging (Breiman 1996); random subspaces (Ho 1998); random forests (Breiman 2001).

Surprisingly, despite the well-known theoretical properties of random projections as dimension-reducing approximate isometries (Dasgupta and Gupta 2002; Achlioptas 2003) and empirical and theoretical studies demonstrating their usefulness when learning a *single* classifier (e.g. Fradkin and Madigan 2003; Durrant and Kabán 2010), results in the literature employing random projections to create weak learners for *ensemble* classification are sparse compared to results for other approaches such as bagging and random subspaces. On the other hand, given their appealing properties and tractability to analysis, random projections seem like a rather natural choice in this setting. Those empirical studies we could find on randomly-projected ensembles in the literature (Goel et al. 2005; Folgieri 2008; Schclar and Rokach 2009) all report good empirical performance from the ensemble, but none attempt a theoretical analysis. Indeed for all of the randomizing approaches mentioned above, despite a wealth of empirical evidence demonstrating the effectiveness of these ensembles, there are very few theoretical studies.

An important paper by Fumera et al. (2008) gives an approximate analytical model as a function of the ensemble size, applicable to linear combiners, which explains the variance reducing property of bagging. However, besides the inherent difficulties with the approach of bias-variance decomposition for classification problems (Schapire et al. 1998), such analysis only serves to relate the performance of an ensemble to its members and Fumera et al. (2008) correctly point out that even for bagging, the simplest such approach and in use since at least 1996, there is ‘no clear understanding yet of the conditions under which bagging outperforms an individual classifier [trained on the full original data set]’. They further state that, even with specific assumptions on the data distribution, such an analytical comparison would be a complex task. In other words, there is no clear understanding yet about when to use an ensemble vs. when to use one classifier.

Here we take a completely different approach to address this last open issue for a specific classifier ensemble: Focusing on an ensemble of randomly projected Fisher linear discriminant (RP-FLD) classifiers as our base learners, we leverage recent random matrix theoretic results to link the performance of the linearly combined ensemble to the corresponding classifier trained on the original data. In particular, we extend and simplify the work of Marzetta et al. (2011) specifically for this classification setting, and one of our main contributions is to derive theoretical guarantees that directly link the performance of the randomly projected ensemble to the performance of Fisher linear discriminant (FLD) learned in the full data space. This theory is, however, not simply of abstract interest: We also show experimentally that the algorithm we analyze here is highly competitive with the state-of-the-art. Furthermore our algorithm has several practically desirable properties, amongst which are: Firstly, the individual ensemble members are learned in a very low-dimensional space from randomly-projected data, and so training data can be collected, stored and processed entirely in this form. Secondly, our approach is fast—training on a single core typically has lower time complexity than learning a regularized FLD in the data space, while for classification the time complexity is the same as the data space FLD. Thirdly, parallel implementation of our approach is straightforward since, both for training and classification, the individual ensemble members can be run on separate cores. Finally, our approach returns an inverse covariance matrix estimate for the full \(d\)-dimensional data space, the entries of which are interpretable as conditional correlations; this may be useful in a wide range of settings.

Our randomly projected ensemble approach can be viewed as a generalization of bagged ensembles, in the sense that here we generate multiple instances of training data by projecting a training set of size \(N\) onto a subspace drawn uniformly at random with replacement from the data space, whereas in bagging one generates instances of training data by drawing \(N'\) training examples uniformly with replacement from a training set of size \(N \geqslant N'\). However, in this setting, an obvious advantage of our approach over bagging is that it is able to repair the rank deficiency of the sample covariance matrix we need to invert in order to build the classifier. In particular, we show that when there are fewer observations than dimensions our ensemble implements a data space FLD with a sophisticated covariance regularization scheme (parametrized by an integer parameter) that subsumes a combination of several previous regularization schemes. In order to see the clear structural links between our ensemble and its data space counterpart we develop our theory in a random matrix theoretic setting. We avoid a bias-variance decomposition approach since, in common with the analysis of Schapire et al. (1998), a key property of our ensemble is that its effect is not simply to reduce the variance of a biased classifier.

A preliminary version of this work (Durrant and Kabán 2013) won the best paper award at the 5th Asian Conference on Machine Learning (ACML 2013). Here we extend that work in several directions, as well as including material omitted there due to space constraints: The high probability lower bound on the generalization error of pseudo-inverted FLD (Theorem 3.3), full proofs of all theorems in the place of the sketches in Durrant and Kabán (2013), and most of Sect. 5 are new. Moreover we have greatly extended the experimental section, adding a 100,000-dimensional dataset, additional carefully-tuned comparison methods, and comparisons of our approach with an ensemble of random subspace FLDs.

The structure of the remainder of the paper is as follows: We give some brief background and describe the randomly projected FLD classifier ensemble. Next, we present theoretical findings that give insight into how this ensemble behaves. We continue by presenting extensive experiments on real datasets from the bioinformatic domain where FLD (and variants) are a popular classifier choice even though often restricted to a diagonal covariance choice because of high dimensionality and data scarcity (Guo et al. 2007; Dudoit et al. 2002). We further present experimental results on a 100,000-dimensional drug discovery dataset, that is from another problem domain where the small sample size problem typically arises. Our experiments suggest that in practice, when the number of training examples is less than the number of data dimensions, our ensemble approach outperforms the traditional FLD in the data space both in terms of prediction performance and computation time. Finally, we summarize and discuss possible future directions for this and similar approaches.

## 2 Preliminaries

We consider a binary classification problem in which we observe \(N\) i.i.d examples of labelled training data \(\mathcal {T}_{N}=\{( x_{i},y_{i}): x_i \in \mathbb {R}^d, y_{i} \in \{0,1\} \}_{i=1}^{N}\) where \((x_{i},y_{i}) \overset{i.i.d}{\sim } \mathcal {D}_{x,y}\). We are interested in comparing the performance of a randomly-projected ensemble classifier working in the projected space \(\mathbb {R}^k, k \ll d\), to the performance achieved by the corresponding classifier working in the data space \(\mathbb {R}^d\). We will consider Fisher’s linear discriminant classifier working in both of these settings since FLD is a popular and widely used linear classifier (in the data space setting) and yet it is simple enough to analyse in detail.

We commence our theoretical analysis of this algorithm by examining the expected performance of the RP-FLD ensemble when the training set is fixed, which is central to linking the ensemble and data space classifiers, and then later in Theorem 3.2 we will consider random instantiations of the training set.

## 3 Theory

Our main theoretical results are the following three theorems: the first characterizes the regularization effect of our ensemble, while the second bounds the generalization error of the ensemble for an arbitrary training set of size \(N\) in the case of multivariate Gaussian class-conditional distributions with shared covariance. The third is a finite sample generalization of the negative result of Bickel and Levina (2004) showing that when the data dimension \(d\) is large compared to the rank of \(\hat{\Sigma }\) (which is a function of the sample size) then, with high probability, pseudoinverted FLD performs poorly.

### **Theorem 3.1**

The significance of this theorem from a generalization error analysis point of view stems from the fact that the rank deficient maximum-likelihood covariance estimate has unbounded condition number and, as we see below in Theorem 3.2, (an upper bound on) the generalization error of FLD increases as a function of the condition number of the covariance estimate employed. In turn, the bound given in our Theorem 3.1 depends on the extreme *non-zero* eigenvalues of \(\hat{\Sigma }\), its rank^{1} \(\rho \), and the subspace dimension \(k\), which are all finite for any particular training set instance. We should also note that the subspace dimension \(k\) is a parameter that we can choose, and in what follows \(k\) therefore acts as the integer regularization parameter in our setting.

### **Theorem 3.2**

The principal terms in this bound are: (i) The function \(g:[1,\infty ) \rightarrow (0,\frac{1}{2}]\) which is a decreasing function of its argument and here captures the effect of the mismatch between the estimated model covariance matrix \(\hat{S}^{-1}\) and the true class-conditional covariance \(\Sigma \), via a high-probability upper bound on the condition number of \(\hat{S}^{-1}\Sigma \); (ii) The Mahalanobis distance between the two class centres which captures the fact that the better separated the classes are the smaller the generalization error should be; and (iii) antagonistic terms involving the sample size (\(N\)) and the number of training examples in each class (\(N_{0},N_{1}\)), which capture the effect of class (im)balance—the more evenly the training data are split, the tighter the bound.

We note that a bound on generalization error with similar behaviour can be obtained for the much larger family of sub-Gaussian distributions, or when the true class-conditional covariance matrices are taken to be different (see e.g. Durrant and Kabán 2010; Durrant 2013). Therefore the distributional assumptions on Theorem 3.2 are not crucial.

### **Theorem 3.3**

It is interesting to notice that this lower bound depends on the rank of the covariance estimate, not on its fit to the true covariance \(\Sigma \). Note in particular that when \(N \ll d\) our lower bound explains the bad performance of pseudo-inverted FLD since \(\rho \), the rank of \(\hat{\Sigma }\), is at most \(\min \{N-2,d\}\) and the lower bound of Theorem 3.3 becomes tighter as \(\rho /d\) decreases. Allowing the dimensionality \(d\) to be large, as in Bickel and Levina (2004), so that \(\rho /d \rightarrow 0\), this fraction goes to \(0\) which means the lower bound of Theorem 3.3 converges to \(\Phi (0) = 1/2\)—in other words random guessing.

## 4 Proofs

### 4.1 Proof of Theorem 3.1

Estimating the condition number of \(\text {E}\left[ R^\mathrm{T} \left( R \hat{\Lambda } R^{T} \right) ^{-1} R \right]\) is the key result underpinning our generalization error results. We will make use of the following two easy, but useful, lemmas:

### **Lemma 4.1**

### **Lemma 4.2**

(Expected preservation of eigenvectors) Let \(\hat{\Lambda }\) be a diagonal matrix, then \(\text {E}\left[ R^\mathrm{T} \left( R\hat{\Lambda } R^\mathrm{T} \right) ^{-1} R \right]\) is a diagonal matrix.

Furthermore, if \(\hat{U}\) diagonalizes \(\hat{\Sigma }\) as \(\hat{\Sigma }=\hat{U}\hat{\Lambda } \hat{U}^\mathrm{T}\), then \(\hat{U}\) also diagonalizes \(\text {E}\left[ R^\mathrm{T} \left( R \hat{\Sigma }R^\mathrm{T} \right) ^{-1} R \right]\).

We omit the proofs which are straightforward and can be found in Marzetta et al. (2011).

*i*th diagonal element of \(\text {E}\left[ R^\mathrm{T} \left( R \hat{\Lambda } R^\mathrm{T} \right) ^{-1} R \right] \) is \(\text {E}\left[\frac{r_{i}^2}{\sum _{j=1}^{\rho } \lambda _{j} r_{j}^2} \right]\), where \(r_i\) is the

*i*th entry of the single row matrix \(R\). This can be upper and lower bounded as:

Recall that as a result of Lemmas 4.1 and 4.2 we only need consider the diagonal entries of this expectation as the off-diagonal terms are known to be zero.

*i*th column of \(P\), we can write and bound the

*i*th diagonal element as:

### 4.2 Proof of Theorem 3.2

Traditionally ensemble methods are regarded as ‘meta-learning’ approaches and although bounds exist (e.g. Koltchinskii and Panchenko 2002) there are, to the best of our knowledge, no results giving the exact analytical form of the generalization error of any particular ensemble. Indeed, in general it is not analytically tractable to evaluate the generalization error exactly, so one can only derive bounds. Because we deal with a particular ensemble of FLD classifiers we are able to derive the exact generalization error of the ensemble in the case of Gaussian classes with shared covariance \(\Sigma \), the setting in which FLD is Bayes’ optimal. This allows us to explicitly connect the performance of the ensemble to its data space analogue. As noted earlier, an upper bound on generalization error with similar behaviour can be derived for the much larger class of sub-Gaussian distributions (see e.g. Durrant and Kabán 2010; Durrant 2013), therefore this Gaussianity assumption is not crucial.

We proceed in two steps: (1) Obtain the generalization error of the ensemble conditional on a fixed training set; (2) Bound the deviation of this error caused by a random draw of a training set.

#### 4.2.1 Generalization error of the ensemble for a fixed training set

For a fixed training set, the generalization error is given by the following lemma:

### **Lemma 4.3**

The proof of this lemma is similar in spirit to the one for a single FLD in Pattison and Gossink (1999). For completeness we give it below.

### 4.3 Proof of Lemma 4.3

A similar argument deals with the case when \(x_{q}\) belongs to class \(1\), and applying the law of total probability completes the proof. \(\square \)

Indeed equation (4.5) has the same form as the error of the data space FLD (See Bickel and Levina 2004; Pattison and Gossink 1999 for example.) and the converged ensemble, inspected in the original data space, produces exactly the same mean estimates and covariance matrix eigenvector estimates as FLD working on the original data set. However it has different eigenvalue estimates that result from the sophisticated regularization scheme that we analyzed in Sect. 4.1.

#### 4.3.1 Tail bound on the generalization error of the ensemble

The previous section gave the exact generalization error of our ensemble conditional on a given training set. In this section our goal is to derive an upper bound with high probability on the ensemble generalization error w.r.t. random draws of the training set.

We will use the following concentration lemma:

### **Lemma 4.4**

#### 4.3.2 Lower-bounding the term \(A\)

Next, since \(\Sigma ^{-\frac{1}{2}}\hat{\mu }_{1}\) and \(\Sigma ^{-\frac{1}{2}}\hat{\mu }_{0}\) are independent with \(\Sigma ^{-\frac{1}{2}}\hat{\mu }_{y} \sim \mathcal {N}(\Sigma ^{-\frac{1}{2}}\mu _{y}, I_d/N_{y})\), we have \(\Sigma ^{-\frac{1}{2}}(\hat{\mu }_{1}-\hat{\mu }_{0}) \sim \mathcal {N}(\Sigma ^{-\frac{1}{2}}(\mu _{1}- \mu _{0}), N/(N_{0}N_{1})\cdot I_d )\).

#### 4.3.3 Upper-bounding the term \(B\)

To bound the condition number \(\kappa (\hat{S}^{- \frac{1}{2}} \Sigma \hat{S}^{- \frac{1}{2}})\) with high probability we need the following additional lemma:

### **Lemma 4.5**

#### 4.3.4 Proof of Lemma 4.5

To upper bound the condition number \(\kappa (\hat{S}^{- \frac{1}{2}} \Sigma \hat{S}^{- \frac{1}{2}})\) with high probability, we derive high-probability upper and lower bounds on (respectively) the greatest and least eigenvalues of its argument. We will make use of the following result, Eq. (2.3) from Vershynin (2012):

### **Lemma 4.6**

#### 4.3.5 Upper-bound on largest eigenvalue

#### 4.3.6 Lower-bound on smallest eigenvalue

Back to the proof of Theorem 3.2, substituting into Lemma 4.3 the high probability bounds for \(A\) and \(B\), rearranging, then setting each of the failure probabilities to \(\delta /5\) so that the overall probability of failure remains below \(\delta \), then solving for \(\epsilon \) we obtain Theorem 3.2 after some algebra. For completeness we give these last few straightforward details in Appendix 2. \(\square \)

### 4.4 Proof of Theorem 3.3

Setting both of these exponential risk probabilities to \(\delta /2\) and solving for \(\epsilon _1\) and \(\epsilon _2\), we obtain the lower bound on the generalization error of pseudoinverted FLD, Theorem 3.3. \(\square \)

## 5 Remarks

### 5.1 On the effect of eigenvector misestimation

We have seen that the eigenvector estimates are not affected by the regularization scheme implemented by our converged ensemble. One may then wonder, since we are dealing with small sample problems, how does misestimation of the eigenvectors of \(\Sigma \) affect the classification performance?

Applying this in the small sample setting we consider here, if the eigengaps of \(\Sigma \) are too small we can expect bad estimates of its eigenvectors. However, we have seen in Theorem 3.2 that the generalization error of the ensemble can be bounded above by an expression that depends on covariance misestimation only through the condition number of \(\hat{S}^{-1}\Sigma \equiv (\Sigma +E)^{-1}\Sigma \) so even a large misestimation of the eigenvectors need not have a large effect on the classification performance: if all the eigengaps are small, so that all the eigenvalues of \(\Sigma \) are similar, then poor estimates of the eigenvectors will not affect this condition number too much. Conversely, following Johnstone and Lu (2009) if the eigengaps are large—i.e. we have a very elliptical covariance—then better eigenvector estimates are likely from the same sample size and the condition number of \(\hat{S}^{-1}\Sigma \) should still not grow too much as a result of any eigenvector misestimation. In the case of the toy example above, the eigenvalues of \(\Sigma (\Sigma +E)^{-1}\) are \(\frac{1\pm \epsilon \sqrt{2-\epsilon ^2}}{1-\epsilon ^2}\), so its condition number is \(\frac{1+\epsilon \sqrt{2-\epsilon ^2}}{1-\epsilon \sqrt{2-\epsilon ^2}}\). For small \(\epsilon \) this remains fairly close to one—meaning eigenvector misestimation indeed has a negligible effect on classification performance.

### 5.2 On the effect of \(k\)

It is interesting to examine the condition number bound in (4.17) in isolation, and observe the trade off for the projection dimension \(k\) which describes very well its role of regularization parameter in the context of our RP-FLD ensemble. To make the numerator smaller \(k\) needs to be large while to make the denominator larger it needs to be small. We also see natural behaviour with \(N, d\) and the conditioning of the true covariance. From Eqs. (4.13) and (4.16) we see that the condition number bounded by Eq. (4.17) is the only term in the generalization error bound affected by the choice of \(k\), so we can also partly answer the question left open in Marzetta et al. (2011) about how the optimal \(k\) depends on the problem characteristics, from the perspective of classification performance, by reading off the most influential dependencies that the problem characteristics have on the optimal \(k\). The first term in the numerator of (4.17) contains \(d\) but does not contain \(k\) while the remaining terms contain \(k\) but do not contain \(d\), so we infer that in the setting of \(k < \rho -1 <d\) the optimal choice of \(k\) is not affected by the dimensionality \(d\). Noting that for \(N < d\) and Gaussian class-conditionals we have \(\rho = N-2\) with probability \(1\), we see also that for small \(N\) or \(\rho \) the minimizer of this condition number is achieved by a smaller \(k\) (meaning a stronger regulariser), as well as for a small \(\kappa (\Sigma )\). Conversely, when \(N, \rho \), or \(\kappa (\Sigma )\) is large then \(k\) should also be large to minimize the bound.

It is also interesting to note that the regularization scheme implemented by our ensemble has a particularly pleasing form. Shrinkage regularization is the optimal regularizer (w.r.t the Frobenius norm) in the setting when there are sufficient samples to make a full rank estimation of the covariance matrix (Ledoit and Wolf 2004), and therefore one would also expect it to be a good choice for regularization in the range space of \(\hat{\Sigma }\). Furthermore ridge regularization in the null space of \(\hat{\Sigma }\) can also be considered optimal in the following sense—its effect is to ensure that any query point lying entirely in the null space of \(\hat{\Sigma }\) is assigned the maximum likelihood estimate of its class label (i.e. the label of the class with the nearest mean).

### 5.3 Bias of the ensemble

Note that when we plug the expectation examined above into the classifier ensemble, this is equivalent to an ensemble with infinitely many members and therefore, for any choice of \(k < \rho -1\), although we can underfit (with a poor choice of \(k\)) the bounded loss of our ensemble implies that we cannot overfit any worse than the pseudo-inverse FLD data space classifier regardless of the ensemble size, since we do not learn any combination weights from the data. This is quite unlike adaptive ensemble approaches such as AdaBoost, where it is well-known that increasing the ensemble size can indeed lead to overfitting. Furthermore, we shall see from the experiments in Sect. 6 that this guarantee vs. the performance of pseudo-inversion appears to be a conservative prediction of the performance achievable by our randomly-projected ensemble.

### 5.4 Time complexities for the RP-FLD ensemble

We noted in the Introduction that our ensemble, although simple to implement, is also fast. Here we briefly compare the time complexity of our ensemble approach (for a finite ensemble) with that for regularized FLD learnt in the data space.

The time complexity of training a regularized FLD in the data space is dominated by the cost of inverting the estimated covariance matrix \(\hat{\Sigma }\) (Duda et al. 2000), which is \(\mathcal {O}(d^3)\) or \(\mathcal {O}(d^{\log _{2} 7}) \simeq \mathcal {O}(d^{2.807})\) using Strassen’s algorithm (Golub and Van Loan 2012).^{2} On the other hand, in order to obtain a full-rank inverse covariance estimate in the data space using our ensemble requires \(M \in \mathcal {O}\left( \lceil d/k \rceil \right) \), and our experimental results in Sect. 6 suggest that \(M\) of this order is indeed enough to get good classification performance. Using this, and taking into account the \(M\) \(k\times d\) matrix multiplications required to construct the randomly-projected training sets, implies that the time complexity of training our algorithm is \(\mathcal {O}(\frac{d}{k}(Nkd+k^3)) = \mathcal {O}(Nd^{2}+k^{2}d)\) overall, where the \(k^3\) term comes from inverting the full-rank covariance matrix estimate in the projected space. Since we have \(k < \rho -1 < N \ll d\) this is generally considerably faster than learning regularized FLD in the original data space, and furthermore, by using sparse random projection matrices such as those described in Achlioptas (2003); Ailon and Chazelle (2006); Matoušek (2008) one can improve the constant terms hidden by the \(\mathcal {O}\) considerably.

## 6 Experiments

We now present experimental results which show that our ensemble approach is competitive with the state of the art in terms of prediction performance. We do not claim of course that the choice of FLD as a classifier is optimal for these data sets; rather we demonstrate that the various practical advantages of our RP-FLD approach that we listed in the Introduction and Sect. 5.4, and most importantly its analytical tractability, do not come at a cost in terms of prediction performance.

### 6.1 Datasets

### 6.2 Protocol

We standardized each data set to have features with mean 0 and variance 1. For dorothea we removed features with zero variance, there were 8402 such features which left a working dimensionality of 91598; we did not do any further feature selection filtering to avoid any external effects in our comparison. For the first five datasets we ran experiments on 100 independent splits, and in each split we took 12 points for testing and used the remainder for training. For dorothea we used the same data split as was used in the NIPS challenge, taking the provided 800 point training set for training and the 350 point validation set for testing. We ran 10 instances for each combination of projection dimension, projection method, and ensemble size—that is 1120 experiments.

For our data space experiments on colon and leukaemia we used FLD with ridge regularization and fitted the regularization parameter using 5-fold cross-validation independently on each training set, following Cawley and Talbot (2010), with search in the set \(\{2^{-11}, 2^{-10},\ldots ,2\}\). However on these data this provided no statistically significant improvement over employing a diagonal covariance in the data space, most likely because of the data scarcity. Therefore for the remaining three bioinformatics datasets (which are even higher dimensional) we used diagonal FLD in the data space. Indeed since diagonal FLD is in use for gene array data sets (Dudoit et al. 2002) despite the features being known to be correlated (this constraint acting as a form of regularization) one of the useful benefits of our ensemble is that such a diagonality constraint is no longer necessary.

To satisfy ourselves that building on FLD was a reasonable choice of classifier we also ran experiments in the data space using classical SVM (using the matlab implementation of Cawley (2000) on the first five datasets, and the ‘liblinear’ toolbox (Fan et al. 2008), which is specialised for very large datasets, for dorothea) and \(\ell _1\)-regularized SVM (Fan et al. 2008) with linear kernel. In all SVMs the \(C\) parameter was fitted by 5-fold cross-validation as above, with search in the set \(\{2^{-10},2^{-9},\ldots ,2^{10}\}\).

For the dorothea dataset it was impractical to consider constructing the full FLD in the dataspace since the covariance matrix would not fit in memory on the authors’ machines. We linearized diagonal FLD to get around this issue, but the performance of diagonal FLD was extremely poor (Accuracy of 0.2686) and, since the classical linear SVM is also known to perform poorly on this dataset (Guyon et al. 2006, 2004), for the dorothea dataset we baselined against Bernoulli Naïve Bayes (without preprocessing the binary data) following the advice of the challenge organiser to her students given in Guyon et al. (2006). We have also run \(\ell _1\)-regularised SVM (Fan et al. 2008), which turned out successful on this data set.

For all experiments carried out in the projected space, the randomly projected base learners are FLDs with full covariance and no regularization (since we choose \(k < \rho -1\) and so the projected sample covariances are invertible).

### 6.3 Results

For the five bioinformatics datasets, in each case we compare the performance of the RP ensembles with (regularized) FLD in the data space, vanilla and \(\ell _1\)-regularized SVM, and (as suggested by one of the anonymous referees) with an ensemble of Random Subspace (RS) FLD classifiers.^{3} For dorothea we also compare our RP-FLD ensemble with Bernoulli Naïve Bayes.

Mean error rates \(\pm \) 1 standard error, and CPU times estimated from 100 independent splits (10 instances of the fixed split for dorothea) for random projection ensembles with 100 (RP-Ens M=100) or 1000 (RP-Ens M=1000) members, and competing methods (see text for details). For both RP-ensembles and RS-ensembles \(k=\rho /2\) was used

Dataset | \(\rho /2\) | Method | Error | | CPU Time (s) |
---|---|---|---|---|---|

Colon | 24 | RP-Ens M=100 | \(13.50 \pm 0.88\) | \(0.18 \pm 0.002\) | |

RP-Ens M=1000 | \(13.08 \pm 0.88\) | \(1.48 \pm 0.008\) | |||

FLD-full | \(15.50 \pm 0.89\) | \(\varvec{-}\) \(\varvec{-}\) | \(10.01 \pm 0.064\) | ||

SVM | \(11.58 \pm 0.89\) | \(\varvec{+}\) \(\varvec{+}\) | \(0.54 \pm 0.001\) | ||

SVM L1 | \(15.83 \pm 1.01\) | \(\varvec{-}\) \(\varvec{-}\) | \(0.53 \pm 0.002\) | ||

RS-Ens M=100 | \(12.83 \pm 0.82\) | \(\varvec{+}\) | \(0.12\pm 0.025\) | ||

RS-Ens M=1000 | \(12.58 \pm 0.81\) | \(0.88 \pm 0.000\) | |||

Leukaemia | 29 | RP-Ens M=100 | \(2.08 \pm 0.40\) | \(0.35 \pm 0.004\) | |

RP-Ens M=1000 | \(1.67 \pm 0.33\) | \(3.31 \pm 0.029\) | |||

FLD-full | \(2.17 \pm 0.39\) | \(\varvec{-}\) | \(44.99 \pm 0.261\) | ||

SVM | \(1.67 \pm 0.36\) | \(1.09 \pm 0.004\) | |||

SVM L1 | \(6.08 \pm 0.66\) | \(\varvec{-}\) \(\varvec{-}\) | \(1.07 \pm 0.003\) | ||

RS-Ens M=100 | \(1.83 \pm 0.35\) | \(0.18\pm 0.000\) | |||

RS-Ens M=1000 | \(1.83 \pm 0.37\) | \(1.81 \pm 0.001\) | |||

Leuk-large | 29 | RP-Ens M=100 | \(2.25 \pm 0.44\) | \(0.59 \pm 0.003\) | |

RP-Ens M=1000 | \(1.92 \pm 0.41\) | \(6.30 \pm 0.056\) | |||

FLD-diag | \(13.33 \pm 1.09\) | \(\varvec{-}\) \(\varvec{-}\) | \(0.48 \pm 0.003\) | ||

SVM | \(3.50 \pm 0.46\) | \(\varvec{-}\) \(\varvec{-}\) | \(2.18 \pm 0.012\) | ||

SVM L1 | \(2.83 \pm 0.55\) | \(\varvec{-}\) | \(7.03 \pm 0.075\) | ||

RS-Ens M=100 | \(3.33 \pm 0.56\) | \(\varvec{-}\) \(\varvec{-}\) | \(0.44 \pm 0.006\) | ||

RS-Ens M=1000 | \(2.33 \pm 0.49\) | \(4.16 \pm 0.044\) | |||

Prostate | 44 | RP-Ens M=100 | \(7.42 \pm 0.70\) | \(0.82 \pm 0.005\) | |

RP-Ens M=1000 | \(7.00 \pm 0.70\) | \(8.15 \pm 0.054\) | |||

FLD-diag | \(38.33 \pm 1.57\) | \(\varvec{-}\) \(\varvec{-}\) | \(0.35 \pm 0.000\) | ||

SVM | \(7.33 \pm 0.72\) | \(2.91 \pm 0.023\) | |||

SVM L1 | \(6.75 \pm 0.73\) | \(2.85 \pm 0.008\) | |||

RS-Ens M=100 | \(8.75 \pm 0.71\) | \(\varvec{-}\) \(\varvec{-}\) | \(0.56 \pm 0.009\) | ||

RS-Ens M=1000 | \(8.92 \pm 0.73\) | \(\varvec{-}\) \(\varvec{-}\) | \(4.95 \pm 0.026\) | ||

Duke | 15 | RP-Ens M=100 | \(17.50 \pm 1.28\) | \(0.33 \pm 0.002\) | |

RP-Ens M=1000 | \(15.67 \pm 1.25\) | \(\varvec{+}\) | \(3.28 \pm 0.023\) | ||

FLD-diag | \(30.58 \pm 1.57\) | \(\varvec{-}\) \(\varvec{-}\) | \(0.47 \pm 0.000\) | ||

SVM | \(13.50 \pm 1.10\) | \(\varvec{+}\) \(\varvec{+}\) | \(0.90 \pm 0.001\) | ||

SVM L1 | \(17.42 \pm 1.05\) | \(1.14 \pm 0.004\) | |||

RS-Ens M=100 | \(19.25 \pm 1.30\) | \(\varvec{-}\) | \(0.21 \pm 0.000\) | ||

RS-Ens M=1000 | \(18.67 \pm 1.32\) | \(\varvec{-}\) | \(2.12 \pm 0.002\) |

Dataset | \(\rho /2\) | Method | Error | | CPU Time (s) |
---|---|---|---|---|---|

Dorothea | 399 | RP-Ens M=100 | \(8.66 \pm 0.044\) | \(211.56 \pm 1.944\) | |

RP-Ens M=1000 | \(8.80 \pm 0.038\) | \(2149.00 \pm 24.910\) | |||

Bernoulli NB | \(33.43 \) | \(\varvec{-}\) \(\varvec{-}\) | \(4.00 \) | ||

FLD-diag | \(71.34 \) | \(\varvec{-}\) \(\varvec{-}\) | \(72.98 \) | ||

SVM | \(86.86 \) | \(\varvec{-}\) \(\varvec{-}\) | \(308.64 \) | ||

SVM L1 | \(6.00 \) | \(\varvec{+}\) \(\varvec{+}\) | \(958.53 \) | ||

RS-Ens M=100 | \(8.57 \pm 0.000\) | \(122.33 \pm 1.312\) | |||

RS-Ens M=1000 | \(8.63 \pm 0.000\) | \(1233.33 \pm 10.563\) |

We see from Table 2 that with M = 1,000 members in our ensemble, the SVM outperforms us on two datasets (colon and duke), we outperform it on two datasets (leukaemia-large, and dorothea) and no statistical difference is found on the remaining two datasets. On one dataset (dorothea) \(\ell _1\)-regularised SVM does better than us, we outperform it on three data sets (colon, leukaemia, and leukaemia-large), and there is no statistical difference on the remaining two data sets. We outperform random subspace with 1,000 ensemble members on two datasets (duke and prostate) and there is no statistical difference found on the remaining four datasets.

The picture looks not much different for our method having M=100 ensemble members, except there is no significant difference with the \(\ell _1\)-regularised SVM on leukaemia-large, the random subspaces with 100 members beats us on colon and leukaemia, and displays no difference on duke. In fact it turns out that our ensemble with 1000 members differs from that with 100 members on only one data set (duke).

The random subspace FLD ensemble wins over our RP-FLD ensemble with respect to computation time, although this difference is of course confined to the training time only since the time complexity for classification is still \(\mathcal {O}(d)\). Interestingly for the random subspace ensembles the overall error performance is just slightly behind that of the random projection ensembles. Since trading off a small amount of accuracy for a speed-up may be desirable for some applications, an interesting research question is whether similar theoretical guarantees to those we obtained for our RP-FLD ensemble can be proved in the random subspace case. Nevertheless the computation time of our RP-FLD ensemble is comparable with the sophisticated liblinear implementation of \(\ell _1\)-regularised SVM, as is its performance. In fact on three of the six data sets tested none of the competing methods outperformed our RP-FLD ensemble at the 95 % confidence level.

## 7 Discussion and future work

We considered a randomly projected (RP) ensemble of FLD classifiers and gave theory which, for a fixed training set, explicitly links this ensemble classifier to its data space analogue. We have shown that the RP ensemble implements an implicit regularization of the corresponding FLD classifier in the data space. We demonstrated experimentally that the ensemble can recover or exceed the performance of a carefully-fitted ridge-regularized data space equivalent but with generally lower computational cost. Our theory guarantees that, for most choices of projection dimension \(k\), the error of a large ensemble remains bounded even when the number of training examples is far lower than the number of data dimensions and we gained a good understanding of the effect of our discrete regularization parameter \(k\). In particular, we argued that the regularization parameter \(k\) allows us to finesse the known issue of poor eigenvector estimates in this setting. We also demonstrated empirically that with an appropriate choice of \(k\) we can obtain good generalization performance even with few training examples, and a rule of thumb choice \(k=\rho /2\) appears to work well.

We showed that, for classification, the optimal choice of \(k\) depends on the true data parameters (which are unknown) thereby partly answering—in the negative—the question in Marzetta et al. (2011) concerning whether a simple formula for the optimal \(k\) exists.

It would be interesting to extend this work to obtain similar guarantees for ensembles of generic randomly-projected linear classifiers in convex combination, and for an ensemble of random subspace FLDs: we are working on ways to do this. Furthermore, it would be interesting to derive a concentration inequality for matrices in the p.s.d ordering to quantify with what probability a finite ensemble is far from its expectation; this however appears to be far from straightforward—the rank deficiency of \(\hat{\Sigma }\) is the main technical issue to tackle.

## Footnotes

- 1.
In the setting considered here we typically have \(\rho = N-2\).

- 2.
We note that pseudoinverting \(\hat{\Sigma }\) or inverting a diagonal covariance matrix has typical time complexity of \(\mathcal {O}(Nd^2)\) or \(\mathcal {O}(d)\) respectively. However, as we see from Theorem 3.3 and the experiments in Sect. 6, the cost in classification performance of these approaches can be prohibitive.

- 3.
The Random Subspace method (Ho 1998) consists of projection onto the span of \(k\) randomly chosen canonical basis vectors. Note that our theory developed here applies to Gaussian random projection, and this is different to random subspace projection. The RS-FLD decision rule is equivalent to \(\hat{h}_{P}(x_{q}){:=}\mathbf {1} \left\{ \!(\hat{\mu }_{1}- \hat{\mu }_{0})P^\mathrm{T}(P\hat{\Sigma }P^\mathrm{T})^{+}P\left( \!x_{q}- \frac{\hat{\mu }_{0}+ \hat{\mu }_{1}}{2}\right) \!\!>\! 0 \right\} \) where \(P\) is a canonical projection matrix and e.g. is therefore not full rank.

## References

- Achlioptas, D. (2003). Database-friendly random projections: Johnson–Lindenstrauss with binary coins.
*Journal of Computer and System Sciences*,*66*(4), 671–687.zbMATHMathSciNetCrossRefGoogle Scholar - Ailon, N., & Chazelle, B. (2006). Approximate nearest neighbors and the fast Johnson–Lindenstrauss transform. In
*Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing*(pp. 557–563).Google Scholar - Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
*Proceedings of the National Academy of Sciences*,*96*(12), 6745.CrossRefGoogle Scholar - Arriaga, R., & Vempala, S. (1999). An algorithmic theory of learning: Robust concepts and random projection. In
*40th Annual Symposium on Foundations of Computer Science, 1999*(pp. 616–623).Google Scholar - Bickel, P., & Levina, E. (2004). Some theory for Fisher’s linear discriminant function, ‘naïve Bayes’, and some alternatives when there are many more variables than observations.
*Bernoulli*,*10*(6), 989–1010.zbMATHMathSciNetCrossRefGoogle Scholar - Breiman, L. (1996). Bagging predictors.
*Machine Learning*,*24*(2), 123–140.zbMATHMathSciNetGoogle Scholar - Brown, G. (2009). Ensemble learning. In C. Sammut & G. Webb (Eds.),
*Encyclopedia of Machine Learning*. Berlin: Springer.Google Scholar - Cawley, G. C. (2000). MATLAB Support Vector Machine Toolbox (v0.55\(\beta \)). School of Information Systems, University of East Anglia, Norwich, Norfolk, UK. http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox. Accessed 28 June 2014.
- Cawley, G. C., & Talbot, N. L. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation.
*Journal of Machine Learning Research*,*99*, 2079–2107.MathSciNetGoogle Scholar - Dasgupta, S., & Gupta, A. (2002). An elementary proof of the Johnson–Lindenstrauss Lemma.
*Random Structures & Algorithms*,*22*, 60–65.MathSciNetCrossRefGoogle Scholar - Duda, R., Hart, P., & Stork, D. (2000).
*Pattern Classification*(2nd ed.). New York: Wiley.Google Scholar - Dudoit, S., Fridlyand, J., & Speed, T. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data.
*Journal of the American Statistical Association*,*97*(457), 77–87.zbMATHMathSciNetCrossRefGoogle Scholar - Durrant, R. J. (2013). Learning in High Dimensions with Projected Linear Discriminants. Ph.D. thesis, School of Computer Science, University of Birmingham.Google Scholar
- Durrant, R., & Kabán, A. (2010). Compressed Fisher Linear Discriminant Analysis: Classification of Randomly Projected Data. In
*Proceedings16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2010)*.Google Scholar - Durrant, R., & Kabán, A. (2012). Error bounds for Kernel Fisher Linear Discriminant in Gaussian Hilbert space. In
*Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AIStats 2012)*.Google Scholar - Durrant, R., & Kabán, A. (2013). Random projections as regularizers: Learning a linear discriminant ensemble from fewer observations than dimensions. In
*Proceedings of the 5th Asian Conference on Machine Learning (ACML 2013)*.Google Scholar - Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space.
*Journal of the Royal Statistical Society B*,*70*(Part 5), 849–911.MathSciNetCrossRefGoogle Scholar - Fan, R., Chang, K., Hsieh, C., Wang, X., & Lin, C. (2008). LIBLINEAR: a library for large linear classification.
*Journal of Machine Learning Research*,*9*, 1871–1874.zbMATHGoogle Scholar - Folgieri, R. (2008). Ensembles based on Random Projection for gene expression data analysis. Ph.D. thesis, Università degli Studi di Milano.Google Scholar
- Fradkin, D., & Madigan, D. (2003) Experiments with random projections for machine learning. In
*Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*(pp. 522–529).Google Scholar - Fumera, G., Roli, F., & Serrau, A. (2008). A theoretical analysis of bagging as a linear combination of classifiers.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*30*(7), 1293–1299.CrossRefGoogle Scholar - Goel, N., Bebis, G., & Nefian, A. (2005). Face recognition experiments with random projection.
*Proceedings of SPIE*,*5779*, 426.CrossRefGoogle Scholar - Golub, G., & Van Loan, C. (2012).
*Matrix computations*(Vol. 3). London: JHU Press.Google Scholar - Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring.
*Science*,*286*(5439), 531.CrossRefGoogle Scholar - Guo, Y., Hastie, T., & Tibshirani, R. (2007). Regularized linear discriminant analysis and its application in microarrays.
*Biostatistics*,*8*(1), 86–100.zbMATHCrossRefGoogle Scholar - Guyon, I., NIPS 2003 Feature Selection Challenge: Dorothea dataset. Retrieved April 14, 2014, from http://www.nipsfsc.ecs.soton.ac.uk/datasets/DOROTHEA.zip.
- Guyon, I., Gunn, S. R., Ben-Hur, A., & Dror, G. (2004). Result analysis of the NIPS 2003 feature selection challenge.
*NIPS*,*4*, 545–552.Google Scholar - Guyon, I., Li, J., Mader, T., Pletscher, P., Schneider, G., & Uhr, M. (2006). Feature Selection with the CLOP Package. Technical report, Retrieved April 14, 2014, from http://clopinet.com/isabelle/Projects/ETH/TM-fextract-class.
- Hastie, T., Tibshirani, R., & Friedman, J. (2001).
*The elements of statistical learning; data mining, inference, and prediction*. Berlin: Springer.zbMATHGoogle Scholar - Ho, T. (1998). The random subspace method for constructing decision forests.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*20*(8), 832–844.CrossRefGoogle Scholar - Horn, R., & Johnson, C. (1985).
*Matrix analysis*. Cambridge: Cambridge University Press.zbMATHCrossRefGoogle Scholar - Johnstone, I., & Lu, A. (2009). On consistency and sparsity for principal components analysis in high dimensions.
*Journal of the American Statistical Association*,*104*(486), 682–693.MathSciNetCrossRefGoogle Scholar - Koltchinskii, V., & Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers.
*The Annals of Statistics*,*30*(1), 1–50.zbMATHMathSciNetGoogle Scholar - Ledoit, O., & Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices.
*Journal of Multivariate Analysis*,*88*(2), 365–411.zbMATHMathSciNetCrossRefGoogle Scholar - Maniglia, S., & Rhandi, A. (2004). Gaussian measures on separable Hilbert spaces and applications.
*Quaderni del Dipartimento di Matematica dell’Università del Salento*,*1*(1), 1–24.Google Scholar - Mardia, K., Kent, J., & Bibby, J. (1979).
*Multivariate analysis*. London: Academic Press.zbMATHGoogle Scholar - Marzetta, T., Tucci, G., & Simon, S. (2011). A random matrix-Theoretic approach to handling singular covariance estimates.
*IEEE Transaction on Information Theory*,*57*(9), 6256–6271.MathSciNetCrossRefGoogle Scholar - Matoušek, J. (2008). On variants of the Johnson–Lindenstrauss lemma.
*Random Structures & Algorithms*,*33*(2), 142–156.zbMATHMathSciNetCrossRefGoogle Scholar - Pattison, T., & Gossink, D. (1999). Misclassification probability bounds for multivariate Gaussian classes.
*Digital Signal Processing*,*9*, 280–296.CrossRefGoogle Scholar - Paul, D., & Johnstone, I. (2012). Augmented sparse principal component analysis for high dimensional data. arXiv:1202.1242.
- Raudys, S., & Duin, R. (1998). Expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix.
*Pattern Recognition Letters*,*19*(5), 385–392.zbMATHCrossRefGoogle Scholar - Schapire, R., Freund, Y., Bartlett, P., & Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods.
*The Annals of Statistics*,*26*(5), 1651–1686.zbMATHMathSciNetCrossRefGoogle Scholar - Schclar, A., & Rokach, L. (2009). Random projection ensemble classifiers. In J. Filipe, J. Cordeiro, W. Aalst, J. Mylopoulos, M. Rosemann, M. J. Shaw, & C. Szyperski (Eds.),
*Enterprise information systems. Vol. 24 of Lecture Notes in Business Information Processing*(pp. 309–316). Berlin: Springer.Google Scholar - Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., et al. (2002). Gene expression correlates of clinical prostate cancer behavior.
*Cancer Cell*,*1*(2), 203–209.CrossRefGoogle Scholar - Tulino, A., & Verdú, S. (2004).
*Random matrix theory and wireless communications*. Hanover, MA: Now Publishers Inc.zbMATHGoogle Scholar - Vershynin, R. (2012). Introduction to Non-asymptotic Random Matrix Theory. In Y. Eldar & G. Kutyniok (Eds.),
*Compressed sensing. Theory and applications*(pp. 210–268). Cambridge: Cambridge University Press.CrossRefGoogle Scholar - Vu, V. (2011). Singular vectors under random perturbation.
*Random Structures & Algorithms*,*39*(4), 526–538.zbMATHMathSciNetCrossRefGoogle Scholar - Vu, V., & Lei, J. (2012). Minimax Rates of Estimation for Sparse PCA in High Dimensions. In
*Proceedings 15th International Conference on Artificial Intelligence and Statistics (AISTATS 2012)*(Vol. 22, pp. 1278–1286).Google Scholar - West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., et al. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles.
*Proceedings of the National Academy of Sciences*,*98*(20), 11462.CrossRefGoogle Scholar