1 Introduction

In many machine learning applications, the size of the training data is small relative to the number of features, and acquiring more data may be too costly or time-consuming. For example, a standard dataset to predict breast-cancer from gene-expression data has only 99 positive examples for 7, 650 features (Sotiriou et al. 2003), and datasets with smaller training sizes are also of interest (Blagus and Lusa 2013). In computational advertising, we must learn to predict whether an ad is relevant to a person after seeing only limited data for that ad. Since there are many available ads, waiting for more training data can reduce ad revenue. Limited training data also leads to the “cold-start” problem in recommendation systems, where we must quickly tune our recommendations for new users, or find relevant matches for new items. Thus, the limited-data setting is widely applicable.

We consider the problem of learning a linear classifier from limited training data. In such cases, overfitting is common. In other words, the feature weight vector that minimizes the training loss often has a test loss that is much worse than the training loss. The usual solution for such overfitting is to do dimensionality reduction or regularization. Dimensionality reduction reduces the number of features, while regularization keeps all features but penalizes large weights. For example, for the least-squares loss, Principal Components Regression (PCR) does dimensionality reduction, while Ridge and LASSO do regularization based on the \(L_2\) and \(L_1\) norms respectively.

However, both dimensionality reduction and regularization have weaknesses. Dimensionality reduction ignores information. For example, PCR uses only the top few principal components of the feature matrix. But, if the top principal components are uncorrelated with the response variable, the PCR solution may perform poorly (Jolliffe 1982). For regularization, it is not easy to choose the best \(L_p\) norm. It depends on the dataset, the training size, and the loss function. Different norms can lead to significantly different test losses. Existing explanations of regularization rely on priors, or a distance between probability distributions, or a distance metric in feature space. It is not clear why such inputs are needed and how we should choose them in practice. This motivates the following problem:

How can we build a linear classifier that (a) outperforms both dimensionality reduction and norm-based regularization, (b) works for a wide range of loss functions, and (c) needs no user input such as a norm or a prior?

Our proposed method, called RoLin (RObust LINear classification), aims to achieve this by combining dimensionality reduction with robust optimization. Dimensionality reduction methods such as PCR rely on the idea that solutions constructed from the top singular vectors (principal components) are less prone to overfitting. Like PCR, RoLin first constructs such a classifier. But, unlike PCR, we do not ignore the bottom singular vectors. Training and test loss can indeed be very different for data projected on to the bottom singular vectors. But even from this “unreliable” projected data, we may be able to estimate some low-order moments, such as the mean and some aspects of the covariance. Now, the loss function depends on the entire data distribution, not just the low-order moments. So, RoLin constructs a “worst-case” distribution that matches the low-order moments. Then it finds a classifier that has the smallest loss under this distribution. Now, we have two classifiers: one from the top principal components, and a robust one from the rest. RoLin combines them into a single classifier that captures all available information. Thus, RoLin goes beyond the top few principal components, but still avoids overfitting. Figure 1 shows the intuition behind RoLin.

Fig. 1
figure 1

Overview of RoLin: a We can reliably estimate the top Principal Components (PCs) of the data distribution from limited training data. In this example, only the first PC is reliably estimated. b We project the data onto the reliable and unreliable subspaces. c We find the classifier with the least training loss on data projected on the reliable subspace. d But for the orthogonal subspace, such a classifier may overfit. Linear classifiers such as logistic regression minimize a loss that is a function of \({\varvec{z}}:=y\cdot {\tilde{\varvec{x}}}\), where \(\tilde{\varvec{x}}\) is a projected datapoint. Given limited training data, even the covariance estimate of \({\varvec{z}}\) can be noisy (Marcenko and Pastur 1967). So, minimizing the loss over the empirical distribution of \({\varvec{z}}\) can yield classifiers (shown by the dashed line) that have low training loss but much higher test loss. e RoLin builds a robust covariance of \({\varvec{z}}\). f It uses this to construct a maximum-uncertainty distribution for \({\varvec{z}}\). g Optimizing on this distribution gives a robust classifier, which RoLin combines with the reliable classifier from step (c)

We summarize our main contributions below.

A new approach to avoid overfitting Given limited training data, the top principal components are often reliable while the bottom components are noisy. Motivated by this, RoLin processes the top and bottom principal components differently. Reliable data from the top components is used directly, while noisy data from the bottom components is filtered through a robust optimization. This ensures that all available reliable information is extracted and used in building RoLin ’s classifier. By limiting the robust optimization to the unreliable subspace, RoLin avoids becoming too conservative.

No user choice needed In contrast to existing regularization methods, RoLin does not force the user to choose a norm, prior, or distance metric.

Applicable to many loss functions RoLin works, unchanged, for the logistic, hinge, squared hinge, and modified Huber losses, among others. In particular, we can use RoLin for both logistic regression and linear SVMs.

Robust cross-validation Existing cross-validation methods identify overfitting classifiers by their poor accuracy on holdout sets. But in limited-data settings, holdout sets are small and holdout accuracy may be too noisy. We develop a new cross-validation method, called RobustCV, that checks for several signs of overfitting that are missed by standard cross-validation.

Empirical results We compare RoLin against competing methods on three loss functions and 25 real-world datasets, where the number of features ranges from \(p=8\) to \(p=43,680\). We test each dataset under five different training sizes, from \(n=15\) to \(n=200\) samples. RoLin outperforms dimensionality reduction as well as \(L_1\) and \(L_2\) regularization. Dimensionality reduction has \(14\%-40\%\) worse loss on average that RoLin, under all problem settings. For some datasets, dimensionality reduction can be 4x worse. Under logistic loss, RoLin can be up to 3x better than the best norm-based regularization. Under squared hinge loss, RoLin can be up to 12x better. RoLin performs particularly well for small training sizes, where robustness is most important. When 50 or fewer training samples are available, RoLin achieves the smallest loss on around 2x to 3x as many datasets as the next best method, depending on the loss function. For some datasets, RoLin with \(n=15\) samples is better than both \(L_1\) and \(L_2\) regularization with \(n=1500\) samples. Finally, among the competitors of RoLin, no single method dominates, and it is challenging to choose the best method for a given dataset, loss function, and training size. In contrast, we find that RoLin works well for all datasets under all problem settings.

The rest of the paper is organized as follows. We present our robust formulation and the main theorems in Sect. 2. Section 3 provides detailed algorithms for RoLin and RobustCV. Section 4 presents empirical results. We discuss prior work on overfitting in Sect. 5, and we conclude in Sect. 6. All proofs are deferred to Appendix A.

2 Robust minimization of expected loss

We are given n independent training samples from some distribution \(\mathcal {D}\) of pairs \(({\varvec{x}}, y)\in \mathbb {R}^p\times \{-1, 1\}\), where \({\varvec{x}} \) is a feature vector with p features, and y is a binary class label. We want to train a classifier, parameterized by \(\varvec{\beta }\), to output a positive score \({g_{\varvec{\beta }} ({\varvec{x}})} \) when it predicts \(y=1\), and a negative score otherwise. The quality of classification is measured by a loss function \(\ell (y, {g_{\varvec{\beta }} ({\varvec{x}})})\). The best classifier is the one that minimizes the expected loss

$$\begin{aligned} \min _{{\varvec{\beta }}} E_{(y,{\varvec{x}})\in \mathcal {D}}\; \ell (y, {g_{\varvec{\beta }} ({\varvec{x}})}). \end{aligned}$$
(1)

We consider linear classifiers where \({g_{\varvec{\beta }} ({\varvec{x}})} =\beta _0 + {\varvec{\beta }}_{w} ^T {\varvec{x}} \), where \(\beta _0\) (intercept) and \({\varvec{\beta }}_{w} \) (feature weights) are the first and the remaining elements of \({\varvec{\beta }} \in \mathbb {R}^{p+1}\). In this setting, many common losses are functions of \(y\cdot {g_{\varvec{\beta }} ({\varvec{x}})} \), and we denote the loss \(\ell (y, {g_{\varvec{\beta }} ({\varvec{x}})})\) as \(\ell (y\cdot {g_{\varvec{\beta }} ({\varvec{x}})})\) henceforth. Common loss functions include

$$\begin{aligned} \ell (y\cdot {g_{\varvec{\beta }} ({\varvec{x}})})&= \left\{ \begin{array}{cl} \log _2(1+\exp (-y\cdot {g_{\varvec{\beta }} ({\varvec{x}})})) &{} \text {(logistic loss)}\\ \max (0, 1-y\cdot {g_{\varvec{\beta }} ({\varvec{x}})}) &{} \text {(hinge loss)}\\ \left( \max (0, 1-y\cdot {g_{\varvec{\beta }} ({\varvec{x}})})\right) ^2 &{} \text {(squared hinge loss)}\\ \mathbbm {1}_{y{g_{\varvec{\beta }} ({\varvec{x}})} \ge -1}\cdot \max (0, 1-y\cdot {g_{\varvec{\beta }} ({\varvec{x}})})^2 - \mathbbm {1}_{y{g_{\varvec{\beta }} ({\varvec{x}})} < -1}\cdot (4y\cdot {g_{\varvec{\beta }} ({\varvec{x}})}) &{} \text {(modified Huber loss)}\\ \mathbbm {1}_{y\cdot {g_{\varvec{\beta }} ({\varvec{x}})} \le 0} &{} \text {(zero-one loss).} \end{array}\right. \end{aligned}$$
(2)

Well-known classifiers such as logistic regression (logistic loss) and linear SVM (hinge or squared hinge loss) fall under this framework. Such linear classifiers are also the building blocks for popular complex classifiers such as neural networks. Except for zero-one loss, all the other losses are convex in \(\varvec{\beta }\). In this paper, we seek to minimize the expected loss in Eq. 1 for such convex loss functions.

RoLin splits this problem into separate problems in different subspaces of the feature space. Next, we discuss the details of subspace separation, our robust optimization, and its solution. But first, we discuss the connection to PCR in more detail, as this helps us explain the unique features of RoLin.

2.1 Intuition for subspaces via PCR

Consider the problem of least-squares regression:

$$\begin{aligned} \min _{{\varvec{\beta }} \in \mathbb {R}^p} E_{{\varvec{x}}, y} (y-{\varvec{\beta }} ^T {\varvec{x}})^2 = \min _{{\varvec{\beta }} \in \mathbb {R}^p} E\left[ y^2\right] - 2{\varvec{\beta }} ^T E\left[ y\cdot {\varvec{x}} \right] + {\varvec{\beta }} ^T E\left[ {\varvec{x}} {\varvec{x}} ^T\right] {\varvec{\beta }}, \end{aligned}$$
(3)

where we assume a zero intercept for ease of exposition. Suppose we are given n i.i.d. training samples \((y_i, {\varvec{x}} _i)\in \mathbb {R}\times \mathbb {R}^p\). Then we can solve Eq. 3 after replacing the expectation terms with their estimates. But for small n, estimation errors lead to poor out-of-sample performance. Instead, Principal Components Regression (PCR) first projects the features \({\varvec{x}} _i\) on to the top few principal components. Then, it solves Eq. 3 only on the projected data. In other words, PCR splits the feature space \(\mathbb {R}^p\) into a subspace \({\mathcal {S}_1} \) spanned by the top principal components, and the orthogonal subspace \({\mathcal {S}_2} \). It then solves for the best \({\varvec{\beta }} \in {\mathcal {S}_1} \) and ignores \({\mathcal {S}_2} \).

The reason for the success of PCR is as follows. The principal directions and singular values correspond to the eigenvectors and eigenvalues of the matrix \(\hat{M} = \sum _i {\varvec{x}} _i{\varvec{x}} _i^T/n\). The top eigenvalues and eigenvectors of \(\hat{M}\) are often close to those of the expectation matrix \(M=E[\hat{M}]=E[{\varvec{x}} {\varvec{x}} ^T]\), even for small training sizes. This is because the estimation error for an eigenvector depends on the gap between its eigenvalue and all other eigenvalues (Davis and Kahan 1970; Yu et al. 2015). A larger gap implies smaller estimation error. For many datasets, this gap is large for the top eigenvalues. So the top principal directions are well estimated, and the same holds for the singular values too (Zhao et al. 2019). Hence, \(\hat{M}\) and M have similar projections on the subspace \({\mathcal {S}_1}\) spanned by these well-estimated principal directions. So for any \({\varvec{\beta }} \in {\mathcal {S}_1} \), \({\varvec{\beta }} ^T \hat{M}\) is close to \({\varvec{\beta }} ^T M\), and so \({\varvec{\beta }} ^T \hat{M} {\varvec{\beta }} \approx {\varvec{\beta }} ^T M {\varvec{\beta }} \). Applying this in Eq. 3, \(\min _{{\varvec{\beta }} \in {\mathcal {S}_1}} E(y-{\varvec{\beta }} ^Tx)^2 \approx \min _{{\varvec{\beta }} \in {\mathcal {S}_1}} \sum _i (y_i - {\varvec{\beta }} ^T {\varvec{x}} _i)^2 / n\), and this becomes PCR’s solution. In contrast, the remaining principal components are poorly estimated when n is small. So, the least-squares training loss is not a reliable indicator of the expected loss in the orthogonal subspace \({\mathcal {S}_2}\). Therefore PCR ignores \({\mathcal {S}_2}\).

The loss functions we consider in Eq. 2 are not restricted to just second moments as in Eq. 3. However, the basic ideas underlying PCR are still applicable, as we discuss next.

2.2 Subspace separation

We want to minimize the expected loss under a linear score function. The loss is given by \(\ell (y\cdot {g_{\varvec{\beta }} ({\varvec{x}})}) = \ell (y\beta _0 + {\varvec{\beta }}_{w} ^T(y\cdot {\varvec{x}})) = \ell (y\beta _0 + {\varvec{\beta }}_{w} ^T {\varvec{z}})\), where \({\varvec{\beta }} =(\beta _0\; {\varvec{\beta }}_{w})^T\) with intercept \(\beta _0\in \mathbb {R}\) and feature vector \({\varvec{\beta }}_{w} \in \mathbb {R}^p\), and \({\varvec{z}} =y\cdot {\varvec{x}} \). Extending the PCR argument, we propose to split the space \(\mathbb {R}^p\) into three subspaces \({\mathcal {S}_0}\), \({\mathcal {S}_1}\), and \({\mathcal {S}_2}\). The subspace \({\mathcal {S}_0} \) is spanned by the top few eigenvectors of \(\hat{M} = \sum _i {\varvec{x}} _i{\varvec{x}} _i^T/n\). We expect that the loss function can be reliably estimated in this subspace:

$$\begin{aligned} E\left[ \ell (y\beta _0 + {\varvec{\beta }}_{w} ^T{\varvec{z}})\right] \approx \mathbb {P}_n \left[ \ell (y\beta _0 + {\varvec{\beta }}_{w} ^T{\varvec{z}})\right] \quad \text {for any } {\varvec{\beta }}_{w} \in {\mathcal {S}_0}, \end{aligned}$$
(4)

where \(\mathbb {P}_n[.]\) represents the empirical mean. The next few eigenvectors span the subspace \({\mathcal {S}_1} \). Here, we can reliably estimate only the first and second moments of the distribution of \({\varvec{z}} \) projected on to \({\mathcal {S}_1} \):

$$\begin{aligned} {\varvec{\beta }}_{w} ^T E\left[ {\varvec{z}} {\varvec{z}} ^T\right] {\varvec{\beta }}_{w} \approx {\varvec{\beta }}_{w} ^T \mathbb {P}_n\left[ {\varvec{z}} {\varvec{z}} ^T\right] {\varvec{\beta }}_{w} \quad \text {for any } {\varvec{\beta }}_{w} \in {\mathcal {S}_1} \end{aligned}$$
(5)

But this may not be enough to estimate the loss function accurately. We must also consider the subspace \({\mathcal {S}_2} \) that is orthogonal to both \({\mathcal {S}_0} \) and \({\mathcal {S}_1} \). Here, we only expect first moments to be well-estimated. For the subspace \({\mathcal {S}_2} \), orthogonal to both \({\mathcal {S}_0} \) and \({\mathcal {S}_1} \), only first moments are well-estimated.

$$\begin{aligned} {\varvec{\beta }}_{w} ^T E\left[ {\varvec{z}} \right] \approx {\varvec{\beta }}_{w} ^T \mathbb {P}_n \left[ {\varvec{z}} \right] \quad \text {for any }{\varvec{\beta }}_{w} \in {\mathcal {S}_1\cup \mathcal {S}_2} \end{aligned}$$
(6)

But the second moments under \({\mathcal {S}_2}\) are not arbitrary. Note that \({\mathcal {S}_1}\) and \({\mathcal {S}_2}\) are constructed from separate sets of eigenvectors of the sample covariance matrix \(\mathbb {P}_n[{\varvec{z}} {\varvec{z}} ^T] = \mathbb {P}_n[{\varvec{x}} {\varvec{x}} ^T] = \hat{M}\) (since \({\varvec{z}} =y\cdot {\varvec{x}} \) and \(y\in \{+1, -1\}\)). So they are orthogonal under \(\hat{M}\), that is, \(\mathbb {P}_n\left[ \left( P_{{\mathcal {S}_1}} {\varvec{z}} \right) \left( P_{{\mathcal {S}_2}} {\varvec{z}} \right) ^T \right] = 0\), where \(P_{{\mathcal {S}_i}}\in \mathbb {R}^{p\times p}\) is a matrix that projects any vector on to \({\mathcal {S}_i} \), for \(i\in \{0, 1, 2\}\). We expect \({\mathcal {S}_1}\) and \({\mathcal {S}_2}\) to remain nearly orthogonal under the population covariance M:

$$\begin{aligned} E\left[ \left( P_{{\mathcal {S}_1}} {\varvec{z}} \right) \left( P_{{\mathcal {S}_2}} {\varvec{z}} \right) ^T \right]&\approx 0. \end{aligned}$$
(7)

Finally, we expect the eigenvalues of the second-moment matrix under \({\mathcal {S}_2}\) to be smaller than those under \({\mathcal {S}_1}\).

$$\begin{aligned} \sigma _{max}\left( E\left[ \left( P_{{\mathcal {S}_2}} {\varvec{z}} \right) \left( P_{{\mathcal {S}_2}} {\varvec{z}} \right) ^T \right] \right)&< \sigma _{min} \left( \mathbb {P}_n\left[ \left( P_{{\mathcal {S}_1}} {\varvec{z}} \right) \left( P_{{\mathcal {S}_1}} {\varvec{z}} \right) ^T \right] \right) , \end{aligned}$$
(8)

where \(\sigma _{max}\) and \(\sigma _{min}\) refer to the maximum and minimum non-zero eigenvalues of a matrix.

2.3 Different optimizations over subspaces

To construct our solution, we first optimize the training loss over \({\mathcal {S}_0} \). By Eq. 4, the training loss accurately reflects the expected loss of a solution \(\varvec{\beta }\) with intercept \(\beta _0\) and \({\varvec{\beta }}_{w} \in {\mathcal {S}_0} \). So, we choose an intercept \(\beta _0\in \mathbb {R}\) and \({\varvec{\beta }}_{\mathcal {S}_0} \in {\mathcal {S}_0} \) that solves the following optimization:

$$\begin{aligned}&\min _{\beta _0\in \mathbb {R}, {\varvec{\beta }}_{\mathcal {S}_0} \in {\mathcal {S}_0}} \frac{1}{n} \sum _{i=1}^n \ell \left( \beta _0\cdot y_i + {\varvec{\beta }}_{\mathcal {S}_0} ^t (P_{{\mathcal {S}_0}} {\varvec{z}} _i)\right) \nonumber \\&\quad = \min _{\beta _0, {\varvec{\beta }}_{\mathcal {S}_0}} \frac{1}{n} \sum _{i=1}^n \ell \left( y_i \cdot \left( \beta _0 + {\varvec{\beta }}_{\mathcal {S}_0} ^t (P_{{\mathcal {S}_0}} {\varvec{x}} _i)\right) \right) \end{aligned}$$
(9)

This is the usual parameter-fitting problem in classification but with projected features \(P_{\mathcal {S}_0} {\varvec{x}} _i\). We can use off-the-shelf solvers for logistic regression (for logistic loss) or linear SVM (hinge or squared hinge losses). For other convex loss functions, such as the modified Huber loss, we can use standard optimizers such as stochastic gradient descent.

Now, unlike PCR, we do not ignore \({\mathcal {S}_1\cup \mathcal {S}_2} \). Suppose we set \({\varvec{\beta }}_{w} ={\varvec{\beta }}_{\mathcal {S}_0} + {\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2} \) for some vector \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2} \in {\mathcal {S}_1\cup \mathcal {S}_2} \). Then, the expected loss \(E[\ell (\beta _0\cdot y + {\varvec{\beta }}_{w} ^T{\varvec{z}})]\) equals \(E[ \ell \left( \beta _0\cdot y + {\varvec{\beta }}_{\mathcal {S}_0} ^T (P_{\mathcal {S}_0} {\varvec{z}}) + {\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2} ^T (P_{\mathcal {S}_1\cup \mathcal {S}_2} {\varvec{z}})\right) ]\). If we change \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\), it affects the third term but not the first two. Setting \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2} =0\) corresponds to dimensionality reduction, because we only use the top principal components in \({\mathcal {S}_0}\). But a careful choice of \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) can reduce the loss obtained from \({\varvec{\beta }}_{\mathcal {S}_0}\) alone. But we cannot just project the data on to \({\mathcal {S}_1\cup \mathcal {S}_2}\) and pick the \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) that minimizes training loss. This is because we cannot reliably estimate the loss function in this subspace. Instead, we need a \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) that is robust to estimation errors. To avoid being too conservative, we still need to use all available information about \({\mathcal {S}_1\cup \mathcal {S}_2}\) (Eqs. 58). We formulate this as a robust optimization problem, which we discuss next.

2.4 Robust formulation

To select a robust \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\), we need to characterize the distribution of the data projected on to \({\mathcal {S}_1\cup \mathcal {S}_2}\). The empirical distribution is unreliable here. Instead, we will construct distributions that are “worst-case”, in that they have the maximum uncertainty subject to the constraints in Eqs. 58. Then, we pick the \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) with the best worst-case performance. This prevents \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) from overfitting to incidental aspects of the empirical distribution, while still using all reliable information about moments.

We note that our worse-case distribution depends on the data, but not on the weight vector \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\). An alternative notion of robustness is to let the worst-case distribution depend on \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) as well. This corresponds to setting, for each possible choice of \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\), the worst possible higher-order moments of the data distribution. Since our only constraints are on the mean and covariance, setting all other moments to their worst-case values is overly conservative. Fixing the data distribution to the maximum-uncertainty distribution helps us achieve robustness in a more practical way.

We will formulate our robust model assuming that Eqs. 57 are equalities. The first moments of the distribution of \(P_{\mathcal {S}_1\cup \mathcal {S}_2} {\varvec{z}} \) can be taken to be the first moments of the empirical distribution. For the second moments of this distribution, we have only partial information. Let \(V_{{\mathcal {S}_1\cup \mathcal {S}_2}}\) be a matrix whose columns are the eigenvectors that span \({\mathcal {S}_1}\) and \({\mathcal {S}_2}\). Writing the second-moment matrix of \(P_{\mathcal {S}_1\cup \mathcal {S}_2} {\varvec{z}} \) in the basis \(V_{{\mathcal {S}_1\cup \mathcal {S}_2}}\), we get a block-wise form:

$$\begin{aligned}&V_{{\mathcal {S}_1\cup \mathcal {S}_2}} ^T E\left[ \left( P_{{\mathcal {S}_1\cup \mathcal {S}_2}}{\varvec{z}} \right) \left( P_{{\mathcal {S}_1\cup \mathcal {S}_2}}{\varvec{z}} \right) ^T\right] V_{{\mathcal {S}_1\cup \mathcal {S}_2}} \nonumber \\&\quad =\begin{bmatrix} \begin{array}{c@{}c} V_{{\mathcal {S}_1}} ^T E\left[ \left( P_{{\mathcal {S}_1}}{\varvec{z}} \right) \left( P_{{\mathcal {S}_1}}{\varvec{z}} \right) ^T\right] V_{{\mathcal {S}_1}} &{} V_{{\mathcal {S}_1}} ^T E\left[ \left( P_{{\mathcal {S}_1}}{\varvec{z}} \right) \left( P_{{\mathcal {S}_2}}{\varvec{z}} \right) ^T\right] V_{{\mathcal {S}_2}} \\ V_{{\mathcal {S}_2}} ^T E\left[ \left( P_{{\mathcal {S}_2}}{\varvec{z}} \right) \left( P_{{\mathcal {S}_1}}{\varvec{z}} \right) ^T\right] V_{{\mathcal {S}_1}} &{} V_{{\mathcal {S}_2}} ^T E\left[ \left( P_{{\mathcal {S}_2}}{\varvec{z}} \right) \left( P_{{\mathcal {S}_2}}{\varvec{z}} \right) ^T\right] V_{{\mathcal {S}_2}} \end{array} \end{bmatrix} =: \begin{bmatrix} B_{11} &{} B_{12}\\ B_{21} &{} B_{22} \end{bmatrix} \end{aligned}$$
(10)

Now,

$$\begin{aligned}&B_{11} = V_{{\mathcal {S}_1}} ^T E \left[ \left( P_{{\mathcal {S}_1}}{\varvec{z}} \right) \left( P_{{\mathcal {S}_1}}{\varvec{z}} \right) ^T \right] V_{{\mathcal {S}_1}} \\&\quad = V_{{\mathcal {S}_1}} ^T \mathbb {P}_n \left[ \left( P_{{\mathcal {S}_1}}{\varvec{z}} \right) \left( P_{{\mathcal {S}_1}}{\varvec{z}} \right) ^T \right] V_{{\mathcal {S}_1}} \\&\quad = \mathbb {P}_n \left[ \left( V_{{\mathcal {S}_1}} ^T{\varvec{z}} \right) \left( V_{{\mathcal {S}_1}} ^T{\varvec{z}} \right) ^T\right] , \end{aligned}$$

where the second equality follows from applications of Eq. 5, and the third equality follows from \(P_{\mathcal {S}_1} =V_{{\mathcal {S}_1}} V_{{\mathcal {S}_1}} ^T\). So \(B_{11}\) is the second-moment matrix of the data projected on to \(V_{{\mathcal {S}_1}}\). Also, by Eq. 7,

$$\begin{aligned} B_{12} = B_{21} = 0. \end{aligned}$$
(11)

For \(B_{22}\), we have no estimates but only a bound (Eq. 8). This suggests the following uncertainty set for \(B_{22}\):

$$\begin{aligned} B_{22} \in \mathcal {U} := \left\{ W\left| W\succeq \frac{1}{n} \sum _{i=1}^n \left( V_{{\mathcal {S}_2}} ^T{\varvec{z}} _i\right) \left( V_{{\mathcal {S}_2}} ^T{\varvec{z}} _i\right) ^T, \Vert W\Vert \le \sigma _{bound}\right. \right\} , \end{aligned}$$
(12)

where \(\sigma _{bound} = \sigma _{min} \left( \mathbb {P}_n\left[ \left( P_{{\mathcal {S}_1}} {\varvec{z}} \right) \left( P_{{\mathcal {S}_1}} {\varvec{z}} \right) ^T \right] \right) \). Note that by construction, this uncertainty set is non-empty. Equations 1012 thus characterize the second moments of the distribution of \(P_{\mathcal {S}_1\cup \mathcal {S}_2} {\varvec{z}} \).

Now, we construct our worst-case distribution for \(P_{\mathcal {S}_1\cup \mathcal {S}_2} {\varvec{z}} \); call it \(q(P_{\mathcal {S}_1\cup \mathcal {S}_2} {\varvec{z}})\). We choose \(q(P_{\mathcal {S}_1\cup \mathcal {S}_2} {\varvec{z}})\) to be the distribution with the maximum entropy (and hence the most “uncertainty”) subject to the first and second moments specified above. It is well known that the maximum entropy is achieved by the exponential family distribution with those moments (Cover and Thomas 2006). Now, we pick \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) that performs best under \(q(P_{\mathcal {S}_1\cup \mathcal {S}_2} {\varvec{z}})\):

$$\begin{aligned} {\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2} = {{\,\mathrm{arg\,min}\,}}_{{\varvec{b}}\in {\mathcal {S}_1\cup \mathcal {S}_2}} \max _{B_{22}\in \mathcal {U}} \frac{1}{n} \sum _{i=1}^n E_{{\varvec{r}} \sim q(.)} \left[ \ell \left( \beta _0\cdot y_i + {\varvec{\beta }}_{\mathcal {S}_0} ^T \left( P_{{\mathcal {S}_0}} {\varvec{z}} _i\right) + {\varvec{b}}^T {\varvec{r}}\right) \right] , \end{aligned}$$
(13)

where \({\varvec{r}}\) is a random variable that represents \(P_{{\mathcal {S}_1\cup \mathcal {S}_2}}{\varvec{z}} \), and \(\beta _0\) and \({\varvec{\beta }}_{\mathcal {S}_0}\) are the solutions of Eq. 9. Note that q(.) depends on \(B_{22}\).

2.5 The solution of the robust objective

The solution \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) of Eq. 13 depends not only on the distribution \(q(P_{{\mathcal {S}_1\cup \mathcal {S}_2}}{\varvec{z}})\) but also on \(\beta _0\), \({\varvec{\beta }}_{\mathcal {S}_0}\), and \(P_{{\mathcal {S}_0}}{\varvec{z}} _i\). This suggests that solving the robust optimization might be difficult. However, we show a surprising result. While the scale of \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) indeed depends on all the above factors, the direction of \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) does not. In fact, in many cases, the direction does not even depend on the specific loss function.

Theorem 1

(Direction of the Robust solution) Suppose the loss function \(\ell (.)\) is non-negative, monotonically non-increasing, convex, differentiable, and the absolute value of its first derivative \(|\ell '(.)|\) has finite non-zero expectation under the standard Normal distribution. Then, any solution of Eq. 13 satisfies

$$\begin{aligned} {\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}&= c\cdot V_{{\mathcal {S}_1\cup \mathcal {S}_2}} \Sigma ^{-1} \mu ,\nonumber \\ \text {where } \Sigma&= \begin{bmatrix} \begin{array}{c@{}c} \frac{1}{n} \sum _{i=1}^n \left( V_{{\mathcal {S}_1}} ^T{\varvec{z}} _i\right) \left( V_{{\mathcal {S}_1}} ^T{\varvec{z}} _i\right) ^T &{} 0 \\ 0 &{} \sigma _{bound}\cdot I \end{array} \end{bmatrix},\nonumber \\ {\varvec{\mu }}&= \frac{1}{n}\sum _{i=1}^n V_{{\mathcal {S}_1\cup \mathcal {S}_2}} ^T{\varvec{z}} _i, \end{aligned}$$
(14)

for some scalar c. Further, if there is a sequence of loss functions \(\ell ^{(m)}(.)\) satisfying the properties mentioned above such that \(\lim _{m\rightarrow \infty }\sup _{x\in \mathbb {R}} |\ell ^{(m)}(x) - \ell (x)| = 0\), then there is a solution of the form of Eq. 14 that is arbitrarily close to the optimal.

Corollary 1

(Wide applicability) The minimizer \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) of Eq. 13 has the form of Eq. 14 for logistic, hinge, squared hinge, and modified Huber losses.

These results are significant from both a theoretical and practical standpoint. It is challenging to formulate tractable robust optimizations. Uncertainty sets are often chosen for their ease of analysis. So, it is encouraging to see a simple closed-form structure emerge from a well-motivated formulation. Further, we do not need separate analyses for each loss function. Armed with Theorem 1, we only need to pick a single scalar, which is the magnitude \(\Vert {\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2} \Vert \). We will choose this by cross-validation.

Computing the direction of \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) is also easy because \(\Sigma \) is a diagonal matrix. To see this, let X be the matrix with \({\varvec{x}} _i\) as its \(i^{th}\) row, and let \(X=UDV^T\) be its singular value decomposition (SVD). The SVD of X is related to the eigenvectors and eigenvalues of \(\hat{M}\) by the formula \(n\cdot \hat{M} = \sum _i {\varvec{z}} _i{\varvec{z}} _i^T = \sum _i {\varvec{x}} _i{\varvec{x}} _i^T = V D^2 V^T\). For \(i\in \{0, 1, 2\}\), let \(D_{\mathcal {S}_i} \) be the diagonal matrix of singular values corresponding to the eigenvectors in \(V_{{\mathcal {S}_i}}\). Then, the top-left block of \(\Sigma \) equals \(\sum _i (V_{{\mathcal {S}_1}} ^T{\varvec{z}} _i)(V_{{\mathcal {S}_1}} ^T{\varvec{z}} _i)^T/n = D_{{\mathcal {S}_1}}^2/n\). So, \(\Sigma \) is a diagonal matrix with entries \(D_{{\mathcal {S}_1}}^2/n\) and \(\sigma _{bound}\). Since \(\sigma _{bound}=\min (D_{\mathcal {S}_1} ^2/n)>\max (D_{\mathcal {S}_2} ^2/n)\) from Eq. 8, we may write \(\Sigma = \text {diag}(\max (D_{\mathcal {S}_1\cup \mathcal {S}_2} ^2/n, \sigma _{bound}))\). We propose using a smooth upper-bound of this: \(\Sigma _{smooth} = \text {diag}(D_{\mathcal {S}_1\cup \mathcal {S}_2} ^2/n + \sigma _{bound})\). By varying \(\sigma _{bound}\), we get smooth transitions between different choices for \({\mathcal {S}_1}\) and \({\mathcal {S}_2}\). Using \(\Sigma _{smooth}\) also reveals a curious connection between our robust solution and ridge regression.

Theorem 2

(Connection to ridge regression) The robust solution \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) using \(\Sigma _{smooth}\) is also the solution, up to a scaling factor, for regressing \(y_i\) on \(P_{\mathcal {S}_1\cup \mathcal {S}_2} {\varvec{x}} _i\) with a ridge penalty:

$$\begin{aligned} {\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}&\propto {{\,\mathrm{arg\,min}\,}}_{\varvec{b}} \sum _{i=1}^n \left( y_i-{\varvec{b}}^TP_{\mathcal {S}_1\cup \mathcal {S}_2} {\varvec{x}} _i\right) ^2 + n\sigma _{bound}\cdot \Vert {\varvec{b}}\Vert ^2. \end{aligned}$$

Thus, we can view RoLin as a mix of standard classification over the well-estimated top principal components and ridge regression over the poorly-estimated orthogonal subspace.

3 Algorithm and Robust cross-validation

figure a

RoLin combines two algorithms: the CalcBeta algorithm to calculate the solution \(\varvec{\beta }\), and the RobustCV algorithm to robustly select model parameters for CalcBeta. We now provide details for both these algorithms.

Calculation of the solution vector \(\varvec{\beta }\) Algorithm 1 shows the steps in calculating \(\varvec{\beta }\). Apart from the data itself, it requires three inputs. This first input is the number k of top principal components that comprise the subspace \({\mathcal {S}_0}\). The second input is a parameter \(\sigma _{ratio}\) from which we construct \(\sigma _{bound}\). Tuning \(\sigma _{ratio}\) allows for smooth transitions between \({\mathcal {S}_1}\) and \({\mathcal {S}_2}\). Setting \(\sigma _{ratio}=0\) corresponds to setting \({\mathcal {S}_2} =\varnothing \), while a large \(\sigma _{ratio}\) corresponds to \({\mathcal {S}_1} =\varnothing \). Third, we need an upper bound \(b_{max}\) on the magnitude \(\Vert {\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2} \Vert \).

We first construct the matrix Z with rows \({\varvec{z}} _i = y_i\cdot {\varvec{x}} _i\). The singular value decomposition of Z gives the diagonal matrix D of singular values and the matrix V of singular vectors (step 4)Footnote 1. We form \(V_{{\mathcal {S}_0}}\) from the first k singular vectors and \(V_{{\mathcal {S}_1\cup \mathcal {S}_2}}\) from the remaining singular vectors (steps 5, 6). \(V_{{\mathcal {S}_0}}\) and \(V_{{\mathcal {S}_1\cup \mathcal {S}_2}}\) span the subspaces \({\mathcal {S}_0}\) and \({\mathcal {S}_1\cup \mathcal {S}_2}\), respectively. For \({\mathcal {S}_0}\), we compute the optimal intercept \(\beta _0\) and weight vector \({\varvec{\beta }}_{\mathcal {S}_0} \in {\mathcal {S}_0} \) via Eq. 9 (steps 8, 9). As discussed in Sect. 2.3, this step can use any convex minimizer. Then, for \({\mathcal {S}_1\cup \mathcal {S}_2}\), we compute the direction vector \({\varvec{\eta }}_{{\mathcal {S}_1\cup \mathcal {S}_2}}\) using Theorem 1 (steps 1113). Here, we use \(\Sigma _{smooth}\) with \(\sigma _{bound} = \sigma _{ratio} * \max (D_{{\mathcal {S}_1\cup \mathcal {S}_2}}^2)\), where \(D_{{\mathcal {S}_1\cup \mathcal {S}_2}}\) contains singular values corresponding to \(V_{{\mathcal {S}_1\cup \mathcal {S}_2}}\). Finally, we choose the best magnitude of \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\) over the training samples, but under the bound \(\Vert {\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2} \Vert \le b_{max}\) (step 14). Bounded norm solutions have small generalization error (Sect. 5), so this is appropriate for the poorly-estimated subspace \({\mathcal {S}_1\cup \mathcal {S}_2}\). Note that the question of the “right” norm does not arise. We must bound the \(L_2\)-norm \(\Vert {\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2} \Vert \) since we already know the direction of the vector \({\varvec{\beta }}_{\mathcal {S}_1\cup \mathcal {S}_2}\).

figure b

Robust cross-validation for choosing model parameters. Now we need to select the input parameters \(\psi =(k, \sigma _{ratio}, b_{max})\) for CalcBeta (Algorithm 1). A poor \(\psi \) leads to an overconfident classifier. But, there may only be a few holdout samples where an overconfident classifier incurs significant losses. The averaging step of cross-validation can hide these few large losses. To counter this, we develop a new robust cross-validation method called RobustCV (Algorithm 2).

RobustCV guards against overconfidence by using three signals. The first signal is the loss ratio, which we define as the ratio of the holdout loss to training loss, averaged over all cross-validation splits (step 36). A \(\psi \) with a high loss ratio indicates that overfitting is likely. To implement this idea, we use a loss ratio threshold \(\Theta _{ratio}\). Below the threshold, we use the average holdout loss as a measure of the cost of \(\psi \), like standard cross-validation. But above the threshold, the cost of \(\psi \) is set to the maximum holdout loss (step 37). We also use the loss ratio to find an upper bound \(k_{max}\) for the number of principal components that are well-estimated (step 4). Throughout our algorithm, we restrict the parameter k in \(\psi \) to \(k\le k_{max}\).

The second warning sign of overconfidence is a significant difference between the average holdout loss and the maximum holdout loss. Standard cross-validation picks the \(\psi ^\star \) with the smallest cost. But, we find a robust parameter setting \(\psi ^{rob}\) whose cost is within a factor \((1+\Theta _{slack})\) of \(\psi ^\star \), but whose worst-case holdout loss is better (steps 2426).

Third, realizing that the distribution of \(P_{\mathcal {S}_1\cup \mathcal {S}_2} {\varvec{z}} \) is poorly estimated, we check if a solution constructed from \({\mathcal {S}_0}\) alone is good enough. That is, we select a \(\psi _{\mathcal {S}_0} ^{rob}\) that ignores the subspace \({\mathcal {S}_1\cup \mathcal {S}_2}\) unless the cost of improves by a factor of \(\Theta _{gain}\) using \(\psi ^{rob}\) (step 9). Together, these steps ensure that RobustCV selects a \(\psi ^{best}\) that is robust but not too conservative. RoLin then runs CalcBeta under \(\psi ^{best}\) over the entire training sample. The output of this is RoLin ’s solution.

4 Experiments

We will first compare RoLin against competing methods for 25 real-world datasets and three loss functions. Then, we will contrast RoLin run with RobustCV versus alternative cross-validation schemes. Finally, we will present a sensitivity analysis for RoLin ’s parameters.

Datasets We use 25 benchmark real-world datasets from the UCI repositoryFootnote 2. These span many domains, and the number of features range for \(p=8\) to \(p=43,680\). We convert categorical variables into binary “dummy” variables, and count each dummy variable as a separate feature.

Evaluation methodology We run experiments with logistic loss, squared hinge loss, and modified Huber loss (Eq. 2). All three are standard loss functions, and the first two are widely used in logistic regression and linear support vector machines. Our focus is on the limited-data setting because this is where estimation errors are significant, and finding a good solution is difficult. So we vary the number of training samples from \(n=15\) to \(n=200\) for each dataset. For each experiment, we randomly choose n points as the training samples and the remainder as the test samples. Then, we compute the average test loss for RoLin and all competing methods. We repeat this process 50 times. For each n, we report the trimmed mean of the losses, because it is robust to the occasional outlier. That is, we drop the five best and five worst test losses from the 50 repetitions, and calculate the average of the remaining test losses. We note that the mean losses have the same pattern as the trimmed means, with RoLin outperforming other methods by an even wider margin.

Competing methods The closest competitors to RoLin are norm-based regularization (using \(L_1\) and \(L_2\) norms), and dimensionality reduction using the top few principal components (Top PCs). For norm-based regularization, we use Python’s scikit-learn implementations for all losses and regularizations. We select the regularization parameter via standard cross-validation. Note that we calculate cross-validation loss using the actual loss function which we want to minimize, and not zero-one loss as in common in practice. Using zero-one loss gives sub-optimal results (see Appendix B). For Top PCs, the number of principal components is chosen by cross-validation.

We also contrast RobustCV with two cross-validation methods. One is the standard CV, which picks the parameter with the best average holdout loss. The second method (CV-1-SD) considers all parameters whose loss is within one standard deviation of the best loss, and picks the parameter that achieves the most regularization (Hastie et al. 2009).

Implementation details. For RoLinFootnote 3, we run RobustCV with \(\Theta _{ratio}=5, \Theta _{slack}=0.1,\) and \(\Theta _{gain}=0.05\). We split the training data using five instances of 5-fold cross-validation (De Brabanter et al. 2002). We vary \(b_{max}\) from 0.01 to 0.1 for \(n=15\). For larger n, we scale the upper range with \(\sqrt{n}\). This allows more weight to be placed on the robust solution when more data is available. We vary \(\sigma _{ratio}\) from 1 to 10. We also use RobustCV to choose whether to normalize the features. We use the above settings for all experiments except for the sensitivity analysis in Sect. 4.3.

4.1 Accuracy of RoLin

The detailed plots of the performance of every method for each dataset and each loss function are shown in Fig. 10 in the appendix. In the following, we summarize our results and present our main observations.

Fig. 2
figure 2

Comparison of RoLin against competing methods: The top row shows the number of datasets on which each method is best. RoLin performs best for most losses and training sizes. The bottom row shows the ratio of the trimmed means of losses of each method against that of RoLin, averaged over all 25 datasets. Again, RoLin works best for all loss functions and training sizes. We do not show \(L_2\) regularization for the squared hinge and modified Huber losses, and \(L_1\) regularization for the modified Huber loss, because their average loss is too large

Fig. 3
figure 3

Among the competitors of RoLin, no one method is best: We compare the three competitors of RoLin against each other (ignoring RoLin). Top PCs works well for Modified Huber loss, but for other losses, there is no one method that works best

Fig. 4
figure 4

RoLin versus Top PCs: RoLin is consistently better

RoLin outperforms the competing methods. Figure 2 shows the aggregate statistics comparing RoLin against competing methods over all 25 datasets. The top panel of Fig. 2 counts the number of datasets on which any given method achieves the best loss. We see that RoLin is the best in all settings except for logistic loss with 200 training samples. RoLin is particularly dominant for small training sizes, since this is when robustness to estimation error is most needed. For \(n=15\) training samples, RoLin is the best performer on at least 15 datasets, irrespective of the loss function. When 50 or fewer training samples are available, RoLin achieves the smallest loss on around 2x to 3x as many datasets as the next best method. RoLin also works very well for modified Huber Loss; it is best for 14 or more datasets for any training size.

We observe the same pattern when we compare the actual value of the loss. The bottom panel of Fig. 2 shows the loss incurred by each method compared against that of RoLin, averaged over all datasets. RoLin always has a better loss on average, for all loss functions and training sizes. The greatest difference is for the smallest training size \(n=15\), where the next best method is on average \(14\%-40\%\) worse than RoLin, depending on the loss function. But even with \(n=200\) training samples, every method is worse on average than RoLin. For the modified Huber loss, the average improvement of RoLin over \(L_1\) and \(L_2\) regularization is too large to fit on the plot.

There is no clear second-best method among the competitors of RoLin. In practice, we must choose a single method to apply to a dataset. Figure 2 shows that while RoLin is best, there is no clear second-best method. In terms of the loss, Top PCs works well everywhere. However, Fig. 2d shows that \(L_1\) regularization is better for most training sizes for logistic loss. Further, if we consider the instances where some method outperforms RoLin, that method is often \(L_2\) regularization (Figs. 2a and 2b. But the average loss for \(L_2\) regularization can be much worse than the other methods (Fig. 2e, f). Figure 3 compares only the competitors of RoLin to each other. Top PCs works well for the modified Huber loss, but for other losses, \(L_2\) regularization is comparable or sometimes better. Thus, among the competitors of RoLin, no single method dominates.

Robust optimization contributes significantly to RoLins performance. The difference between Top PCs and RoLin is that Top PCs ignores the bottom principal components, while RoLin uses a robust optimization for them. Hence, the importance of robust optimization can be gauged from the difference between these two methodsFootnote 4.

The previous results show that Top PCs works reasonably well on all loss functions and training sizes. But the consistency of Top PCs comes at a cost: it rarely outperforms RoLin (top panel of Fig. 2). Figure 4 shows the ratio of the loss of Top PCs against RoLin on a log-scale. For every loss function, and for any training size, RoLin is better than Top PCs on at least \(75\%\) of the datasets (shown by the bottom of the boxes being around one). Further, Top PCs can be up to 4x worse than RoLin. Even with \(n=200\) training samples, Top PCs can still be 2x worse. This clearly demonstrates the need for the robust optimization step in RoLin.

Norm-based regularization can occasionally have very large losses. For particular datasets and settings, RoLin can be much better than both \(L_1\) and \(L_2\) norm-based regularizations. It can be up to 3x better under logistic loss, and 12x better under squared hinge loss. For modified Huber loss, no norm-based regularization yields a reasonable classifier for the Credit and Gas sensor datasets (Fig. 10(12) and 10(18)). Indeed, for several datasets, the classifiers obtained from norm-based regularization have such poor test loss that they do not appear in the plots for Fig. 10.

To illustrate this, Fig. 5 compares RoLin against norm-based regularization for three specific datasets. Note that the y-axis is on a log scale, and we report trimmed means which remove outliers. Plot (a) shows an instance where RoLin with \(n=15\) training samples is better than both norm-based methods with \(n=1500\) samples. Plot (c) shows a similar situation. In plot (b), the losses for norm-based regularization become much worse when training size is reduced. This cannot be due to occasional outliers, because the trimmed mean ignores the worst five test losses. Further, \(L_1\)-regularization in plot (c) is not close to convergence even with \(n=1500\). These examples highlight the perils of choosing the “wrong” norm. RoLin sidesteps this issue entirely.

Finally, we note that RoLin performs as well or better than competing methods for zero-one loss (or, misclassification rate). Since our focus is on convex losses, we defer these results to Appendix B.

Fig. 5
figure 5

Test loss comparison on three example datasets: RoLin is compared against \(L_1\) and \(L_2\) regularizations for three loss functions, over a wide range of training sizes n. For each n, we report the trimmed mean of the test losses over 50 repeated experiments. Note that the y-axis is plotted on a log-scale

Fig. 6
figure 6

Importance of RobustCV: We plot the loss when RoLin is run with CV and CV-1-SD, versus RobustCV. RobustCV is much better for small training sizes

4.2 Importance of RobustCV

To examine the influence of RobustCV, we run RoLin with standard cross-validation (CV) and another common variant (CV-1-SD). For each dataset and loss function, we calculate the trimmed mean of test loss of RoLin with CV and CV-1-SD, for \(n=15\) to \(n=200\). We compare these against the trimmed means using RobustCV. Figure 6 show that CV is better than CV-1-SD for all losses. Between CV and RobustCV, RobustCV outperforms for small training sizes. For \(n\le 30\), CV is \(5\%-25\%\) worse on average than RobustCV, depending on the loss function. The differences mostly disappear when more training samples are available.

These results show the usefulness of robustness in cross-validation for small sample sizes. In such scenarios, an overconfident classifier may correctly classify all but a few points, and only these few points provide any warning about the unsuitability of the classifier. RobustCV is designed to look for these warning signals and hence can avoid overconfident classifiers. Standard CV averages over all holdout sets, and this attenuates or even hides the warning signs.

Fig. 7
figure 7

Sensitivity to the parameters of RobustCV: We plot the relative difference in trimmed means for logistic loss when the parameters \((\Theta _{ratio}, \Theta _{slack}, \Theta _{gain})\) are varied from their default values of (5, 0.1, 0.05). Positive values imply larger losses. RoLin is seen to be robust to a wide range of parameter choices

4.3 Sensitivity analysis

Recall that RobustCV requires three parameters. The first is \(\Theta _{ratio}\), which is the threshold ratio of holdout to training loss above which we distrust the average holdout loss. The second is \(\Theta _{slack}\), which is the importance we assign to the maximum holdout loss versus the average holdout loss. The third is \(\Theta _{gain}\), which characterizes our preference for solutions constructed only from the top principal components. Now, we vary these parameters one at a time from their default values and report results on ten datasets.

Figure 7 shows the relative increase in the trimmed means under logistic loss for different values of these parameters. Plot (a) shows that for \(\Theta _{ratio}\), any value in the range \(\Theta _{ratio}\in [2.5, 100]\) works well (the default is 5). Larger values of \(\Theta _{ratio}\) mean that we ignore instances where training loss is much smaller than holdout loss, which is a clear sign of overfitting. Smaller values mean that we always use the maximum holdout loss instead of the average holdout loss. Always focusing on maximum loss is too conservative, so it performs poorly for our expected test loss objective.

Plot (b) shows that any choice of \(\Theta _{slack}\le 0.15\) yields similar results (the default is 0.1). Losses become worse for larger values of \(\Theta _{slack}\). A large \(\Theta _{slack}\) means that we downplay the average holdout loss and focus on the maximum holdout loss. Like a small \(\Theta _{ratio}\), this is too conservative and does not work for the same reason.

Plot (c) shows that any \(\Theta _{gain}\le 0.05\) yields good results (the default is 0.05). Higher values imply a preference for solutions based on only the top few principal components, ignoring the robust solution from the remaining principal components. When \(\Theta _{gain}\rightarrow \infty \), we get Top PCs. We see that high \(\Theta _{gain}\) leads to a significant increase in the test loss, showing the importance of the robust component of RoLin.

Extreme values for any of these parameters correspond to either standard cross-validation or very conservative choices. The former is bad for small n, while the latter performs poorly for large n. But for a broad range of parameters, RobustCV achieves good results.

5 Prior work

A common approach to deal with limited data is regularization. Here, we add to the desired objective an extra term that penalizes large feature weights. This term is typically some \(L_q\) norm of the feature weight vector, with \(q=1\) and \(q=2\) being common choices:

$$\begin{aligned} \min _{{\varvec{\beta }}} \mathbb {P}_n \ell (y\cdot {g_{\varvec{\beta }} ({\varvec{x}})}) + \lambda \cdot \Vert {\varvec{\beta }}_{w} \Vert _q. \end{aligned}$$
(15)

There are several competing justifications of the regularization term in Eq. 15. Regularization can emerge from a prior, or as the solution of a robust optimization, or as a way to bound estimation errors. Next, we discuss these, and contrast them with RoLin.

Prior. We can cast regularization as a prior on the parameter vector \({\varvec{\beta }} \). Then, the solution of Eq. 15 is the maximum a posteriori (MAP) estimate of \({\varvec{\beta }} \). For example, a zero-mean spherical Gaussian prior for \({\varvec{\beta }}_{w}\) gives \(L_2\) regularization, while a Laplace prior yields \(L_1\) regularization. But one may construct a prior for any \(L_q\)-norm, or any Mahalanobis distance metric. Choosing the best prior for a dataset is difficult, but it matters a lot, as we showed in Sect. 4. RoLin does not assume a prior, so it sidesteps this difficulty entirely.

Priors are also useful for dealing with corrupted data (Kordzakhia et al. 2001; Feng et al. 2014; Tibshirani and Manning 2014). Further, \(L_1\) priors induce sparsity in the solution, which makes the model easier to interpret (Tibshirani 1996). We do not consider data corruption or interpretability in this paper.

Robust optimization. Many optimization problems have parameters or constraints that must be learned from data. Robust optimization methods protect against corrupted data, outliers, and incorrect assumptions (Ben-Tal et al. 2009). These methods first construct uncertainty sets that reflect the ambiguity in the data. Then, they optimize a worst-case objective over the uncertainty set. For some uncertainty sets, this worst-case objective matches norm-based regularization.

Robust optimization methods typically fall into two groups. Methods in the first group assume that the training samples are perturbed. The perturbation could be because of uncertain or missing data (Trafalis and Gilbert 2006; Gao Huang et al. 2012; Wang and Pardalos 2014; Tzelepis et al. 2018), adversarial opponents (Globerson and Roweis 2006), or different training and test distributions (Bi and Zhang 2004). Robustness to perturbations is also equivalent to robustness under chance constraints (Bhattacharyya 2004; Shivaswamy et al. 2006). To achieve robustness, we assume that the “true” data fall inside an uncertainty set constructed from the “perturbed” data. Standard uncertainty sets impose a bound on some norm of the perturbation. Choosing a particular norm gives a corresponding norm-based regularization (El Ghaoui and Lebret 1997; Xu et al. 2009a, b).

The second group of robust optimization methods constructs uncertainty sets of probability distributions. They assume that the true distribution of \((y, {\varvec{x}})\) lies in this uncertainty set and optimize for the worst-case distribution in this set. Delage and Ye (2010); Goh and Sim (2010); Wiesemann et al. (2014) consider distributions with appropriately bounded moments. Others choose distributions within a bounded distance from the empirical distribution. The distance can be the Prohorov metric (Erdoğan and Iyengar 2006), KL-divergence (Jiang and Guan 2016), or Wasserstein distance (Wozabal 2012; Shafieezadeh-Abadeh et al. 2015, 2017; Mohajerin Esfahani and Kuhn 2018). For Wasserstein distance, the user must also choose a distance metric in feature space. Choosing a distance metric based on some norm yields a regularization using that norm.

Thus, both types of robust optimization approaches rely on the user to choose a distance metric or a norm. This choice determines the form of the regularization term. The “best” choice for a dataset is unclear. Further, robust optimization emphasizes worst-case performance. This can make robust algorithms too conservative for our average-loss objective. In contrast, RoLin does not require any user inputs. Also, RoLin restricts robust optimization to just the bottom principal components, which are much noisier than the top components. This protects RoLin from becoming too conservative.

Estimation error bounds. Regularization also ensures that the training loss is close to the expected loss. Let \(\Delta _{\varvec{\beta }}:= E\ell (y\cdot {g_{\varvec{\beta }} ({\varvec{x}})}) - \mathbb {P}_n \ell (y\cdot {g_{\varvec{\beta }} ({\varvec{x}})})\) be the difference between the expected and training losses. For large training sizes n, this is small for any \(\varvec{\beta }\). But, for small n, \(\Delta _{\varvec{\beta }} \) can be much greater than zero for some values of \(\varvec{\beta }\). However, if \({\varvec{\beta }} \) has a small norm, \(\Delta _{\varvec{\beta }} \) can be upper-bounded. For example, under zero-one loss, if \(\Vert {\varvec{\beta }} \Vert _1\le 1/\rho \) (bounded \(L_1\) norm), then \(\Delta _{\varvec{\beta }} \) decays with \(\rho \) and the square-root of n (see Mohri et al. 2018). Similar results hold when \(\varvec{\beta }\) has a small \(L_2\) norm. Now, regularization biases the objective of Eq. 15 towards a \(\varvec{\beta }\) with a small norm. This ensures that the expected loss of the solution is comparable to its training loss. Hence, regularization avoids overfitting.

Still, bounding \(\Delta _{\varvec{\beta }} \) is not enough. Our aim is a \(\varvec{\beta }\) with small expected loss, not a small \(\Delta _{\varvec{\beta }} \). Further, such bounds may hold for many norms or Mahalanobis distances. Choosing the best norm for a dataset is difficult. For small n, the bounds on \(\Delta _{\varvec{\beta }} \) can be looseFootnote 5. So, we cannot just pick the norm with the best bound. Finally, while a \(\varvec{\beta }\) with a small norm may have a small \(\Delta _{\varvec{\beta }} \), the converse need not be true. There may be other solutions that have a small \(\Delta _{\varvec{\beta }} \) and also a low loss.

All the above justifications for regularization need the user to choose a norm, or a prior, or a distance metric. The right choice depends on the dataset, the training size, and the loss function. This choice is challenging but also crucial because the wrong choice can significantly hurt performance. In contrast, RoLin needs no user input and does not force the solution to have a small norm. This suggests that explicit norm-based regularization of the form of Eq. 15 is unnecessary.

6 Conclusions

Our goal is to build a linear classifier with two properties. First, it should optimize for general loss functions, instead of the usual zero-one loss. This can be interpreted as accurately predicting class probabilities and not just the binary class labels. Second, its accuracy should gracefully degrade with smaller training sample sizes. The usual approach is to do dimensionality reduction via principal components, or to add to the loss function a regularization term based on a norm chosen by the user. But dimensionality reduction loses data, while regularization is sensitive to the choice of norm. Our proposed method, called RoLin, overcomes these flaws. Unlike dimensionality reduction, it does not ignore the bottom principal components. Unlike regularization, RoLin is entirely automatic and needs no user input. Further, it works well with many loss functions.

RoLin first projects the data on to its top principal components and minimizes training loss on the projected data. The resulting classifier does not overfit because the top principal components are stable. But this classifier ignores the subspace orthogonal to the top principal components. We cannot minimize training loss in this subspace, because estimates of loss are unreliable. So RoLin constructs a robust classifier here. Finally, RoLin combines the two classifiers to get the benefits of both.

To select the parameters of RoLin, we develop a new robust cross-validation algorithm called RobustCV. This checks for several warning signs of overfitting missed by standard cross-validation. RobustCV helps RoLin work well even with small training sizes.

Experiments on 25 real-world datasets and three loss functions show that RoLin outperforms existing state of the art methods. RoLin does particularly well for small training sizes. For \(n=15\) training samples, RoLin has \(14\%-40\%\) lower loss on average than the next-best competitor, under all problem settings. When 50 or fewer training samples are available, RoLin achieves the smallest loss on around 2x to 3x as many datasets as the next best method. For the modified Huber loss, RoLin dominates other methods for all training sizes Further, among the competitors of RoLin, no single method is best. Norm-based regularization is close to RoLin for logistic regression, but is not comparable for modified Huber loss. On some datasets, RoLin achieves with \(n=15\) samples an accuracy that regularization fails to reach with \(n=1500\) samples. Dimensionality reduction via the top principal components rarely outperforms RoLin, especially for logistic loss. Finally, the best norm for regularization depends on the dataset, training size, and loss function. So, for a new problem setting, picking the right norm is difficult. In contrast, RoLin works well for all datasets and settings.

There are several ways to extend RoLin. We can try to use RoLin for non-linear classification via the kernel trick. Here, each test point \(\varvec{x}\) is classified based on a linear combination of \(K({\varvec{x}}, {\varvec{x}} _i)\) where \({\varvec{x}} _i\) is a training point and K(., .) is a kernel function. This suggests that we can use RoLin on the kernel matrix instead of the features matrix. We can also use RoLin for multiclass classification via one-versus-the-rest binary classification. Finally, we note that RoLin does not handle outliers or different training and test distributions. The top principal components of the training and test distributions may not be similar in this setting. One possibility is to project RoLin ’s solution on to the set of small-norm solutions, which may be more robust under outliers.