1 Introduction

In practically all analyses of real-world data we encounter outliers, i.e., data items that do not follow the same patterns as the majority of the data. Such items are problematic, since they may negatively influence modelling of the data. This is observed, for instance, in ordinary least-squares (ols) regression where already a single outlier may lead to arbitrarily large errors (Donoho and Huber 1983). It is, hence, important to consider robust methods that effectively avoid the influence of outliers.

Robust regression methods can be used as almost drop-in replacements for linear regression, which is still widely used because of the inherent interpretability and simplicity. Linear regression is also often used as a part of other machine learning or data mining algorithms, e.g., in explainable artificial intelligence (Ribeiro et al. 2016; Björklund et al. 2019). Furthermore, robust regression can be used to search for outliers by investigating the data items that do not adhere to the robust model.

In this paper we present a sparse robust regression method, termed slise (Sparse Linear Subset Explanations), that achieves the highest possible theoretical robustness and outperforms many existing state-of-the-art robust regression methods in terms of scalability on large datasets. Specifically, we consider finding the largest subset of data items that can be represented by a linear model to a given accuracy.

To illustrate the need for robust regression methods, consider the dataset shown in Fig. 1 containing a cluster of outliers in the top right corner. Here ordinary least-squares linear (ols) regression clearly fails due to the influence of the outliers. In contrast, our slise method yields high robustness by ignoring the outliers and finds a linear model that approximates a (large) subset of the data items.

Fig. 1
figure 1

Outliers can cause ordinary least-squares regression to give unusable results (green dashed line). Robust regression mitigates the effect of the outliers. slise (orange solid line) accomplishes this by finding a maximal subset of items which can be modelled by a linear model to a given maximum error. This maximum error is illustrated as the lightly shaded “corridor” around the slise regression line (Color figure online)

The two main goals of the slise algorithm can be stated in brief as follows. The primary goal of the algorithm is to find a maximum subset of points that can be modelled by a linear model to a given maximum error. This subset is visualised in Fig. 1 by the shaded area around the orange slise line. The width of this “corridor” shows the (adjustable) error tolerance. The secondary goal of slise is to minimise the residuals of the items within the subset marked by the corridor.

In the analysis of multidimensional datasets it is also important to consider the interpretability of the model. slise incorporates lasso regularisation and can, hence, yield sparse models, i.e., models where some of the coefficients are zero and have no impact on the model, making interpreting the results easier (Ribeiro et al. 2016).

1.1 Related work

One of the reasons for the non-robustness of ols is the minimisation of the sum of squared residuals in the loss function, meaning that outliers highly affect the final fit. One remedy for this is to use a loss function without squaring, e.g., absolute deviation (Giloni et al. 2006) or Huber-loss (Huber 1964). Although this is, in practice, often more robust than ols, theoretically these loss functions are just as susceptible to outliers as ols (Rousseeuw and Hubert 2011).

Linear regression can also be solved by finding a maximum likelihood estimate for the parameters. The likelihood function can then be chosen such that the estimation becomes robust. Early examples include S-Estimators (Rousseeuw and Yohai 1984) and MM-Estimators (Yohai 1987). These have recently been further developed into, e.g., mm-lasso (Smucler and Yohai 2017), mte-lasso (Qin et al. 2017), and smdm (Koller and Stahel 2017).

Another approach to robust modelling is to fit the model only to non-outliers, instead of considering the full dataset. However, in order to ignore the outliers, these must first be found. One approach is to ignore a fixed fraction of the items, typically up to half, and simultaneously optimise for both the regression model and the subset of included data items. This is the idea behind the Least Trimmed Squares (lts) algorithm (Rousseeuw 1984), which also has more recent improvements (Rousseeuw and Van Driessen 2000; Alfons et al. 2013).

This can also be done by randomly selecting subsets until a subset of only non-outliers is found. This subset can then be expanded to include all non-outliers, e.g., by fitting a linear model to the subset and selecting all data items with small-enough errors. This is what the ransac algorithm does (Fischler and Bolles 1981). ransac has gained popularity in the field of computer vision with multiple recent improvements (Barath and Matas 2018; Barath et al. 2020). However, these improvements operate on different assumptions and data structures, since the task is to match pixels between two images.

Another way of achieving robustness is to replace non-robust parts of the algorithms with more robust parts, e.g., by replacing means with medians (Hubert and Debruyne 2009). This is the idea behind Quantile Regression (Koenker and Hallock 2001), for which the most recent development is conquer (Fernandes et al. 2021).

Beside robustness, linear regression methods can have other characteristics, such as sparsity. In this context, sparsity means that some of the coefficients are (deliberately) zero. A common way of achieving sparsity is through lasso regularisation (Tibshirani 1996), and this has also been added to various robust regression methods (Alfons et al. 2013; Smucler and Yohai 2017; Wang et al. 2007).

Of the robust regression methods, slise is most closely related to lts, or rather to the sparse variant sparse-lts (Alfons et al. 2013). Both lts and slise fit linear models for a subset of the items. The difference is that lts requires the size of subset to be fixed and specified a priori while slise defines the subset based on a maximum tolerable error. This requires a new algorithm that actually scales better with regards to the number of dimension (see Sect. 4.2) without sacrificing any theoretical robustness (see Sect. 2.2). The slise algorithm uses graduated optimisation to find solutions, which makes it both fast and robust against noise and outliers.

1.2 Contributions

We present a novel robust regression method, slise, by considering the problem of finding the largest subset that can be approximated by a sparse linear model to a given accuracy (Problem 1). We show that this can yield the best possible breakdown value (Sect. 2.2), but that the problem is \(\mathbf {NP}\)-hard (Theorem 2). We present an approximative algorithm for solving it (Algorithm 1) and demonstrate that it also works empirically using both synthetic and real-world datasets (Sects. 57). Compared to other robust regression methods slise yields high robustness and good scalability to high-dimensional datasets.

The initial version of the slise algorithm was presented in aconference paper (Björklund et al. 2019). That paper has a greater focus on one specific use case for the robust regression method, namely, explaining outcomes from black box supervised learning models. (The word “Explanations” in the name of the slise algorithm stems from this application.) When applied to the problem of explaining models, the idea is to find a simple interpretable linear model that approximates a more complex supervised learning model in the neighbourhood of a data item of interest. For this task slise must be able to find good solutions even for small error tolerances. The advantage of using slise for explaining supervised learning models is that no resampling of datasets is required and that the explanations (linear models) obey constraints imposed by the data. In the present paper, however, we focus solely on robust regression.

Compared to Björklund et al. (2019), the discussion on the problem characteristics and numeric approximation is substantially extended with additional proofs in Sects. 2 and 3. We also provide novel initialisation schemes for the slise algorithm in Sect. 4, and evaluate their effect, and the effect of all the other parameters, on the stability and performance of the algorithm in Sect. 6. We find that one of the new schemes is more robust than the lasso initialisation used previously, and provide recommendations for suitable default values for the other parameters. Furthermore, we perform a more thorough empirical comparison to related methods in Sect. 7.

1.3 Organisation

In Sect. 2 we formalise our robust regression problem, and show its complexity and breakdown value. We then discuss the practical numeric optimisation and its approximation properties in Sect. 3. The algorithm, with different initialisation schemes and asymptotic complexity, is presented in Sect. 4. The empirical evaluation, which consists of the default parameter selection and the comparison to related methods, follows in Sects. 57. We end with conclusions in Sect. 8.

2 Problem definition

Let (XY), where \(X \in \mathbb {R}^{n \times d}\) and \(Y \in \mathbb {R}^{n}\), be a dataset consisting of n pairs \(\{(x_i, y_i)\}_{i = 1}^n\) where we denote the ith d-dimensional item (row) in X by \(x_i\in {\mathbb {R}}^d\) (the predictor) and similarly the ith element in Y by \(y_i\in \mathbb {R}\) (the response). We use the shorthand \([n] = \{1, \ldots , n\}\).

Our goal is to develop a linear regression method that is robust to outliers by finding the largest subset of data items that can be described by a sparse linear model to a given precision, as exemplified in the introduction.

We now state the main problem in this paper:

Problem 1

Given \(X \in \mathbb {R}^{n \times d}\), \(Y \in \mathbb {R}^{n}\), the error tolerance \(\varepsilon \in {\mathbb {R}}\ge 0\), and the regularisation strength \(\lambda \in {\mathbb {R}}\ge 0\); find the regression coefficients \(\alpha \in \mathbb {R}^d\) minimising the \((\varepsilon , \lambda )\)-loss function

$$\begin{aligned} \texttt {Loss} (\varepsilon , \lambda , X, Y, \alpha ) = \sum \nolimits _{i=1}^n{H\left( \varepsilon ^2-r_i^2\right) \left( r_i^2/n-\varepsilon ^2\right) } + \lambda {\Vert \alpha \Vert }_1, \end{aligned}$$
(1)

where the residual errors are given by \(r_i = y_i - \alpha {}^\intercal x_i\), \(H(\cdot )\) is the Heaviside step function satisfying \(H(u)=1\) if \(u \ge 0\) and \(H(u)=0\) otherwise, and \({\Vert \alpha \Vert }_1\) denotes the L1-norm. In the remainder of the paper we will use the L1-norm given by \({\Vert \alpha \Vert }_1 = \sum \nolimits _{i=1}^d \left| \alpha _i\right| \), even though the theoretical results would be valid for L2 or other norms as well. The benefits of L1 or Lasso regularization include that it leads to sparse solutions, which is beneficial for interpretability and explanations (Björklund et al. 2019). If necessary, the data matrix X can be augmented with a column of all ones to accommodate the intercept term of the model.

Alternatively, the Lagrangian term \(\lambda \Vert \alpha \Vert _1\) in Eq. (1) can be replaced by a constraint \(\Vert \alpha \Vert _1\le t\) for some t. Now we can rewrite Eq. (1) as

$$\begin{aligned} \texttt {Loss} (\varepsilon , \lambda , X, Y, \alpha ) = \sum \nolimits _{i\in S}{\left( r_i^2/n-\varepsilon ^2\right) } \end{aligned}$$
(2)

under the constraint \({\Vert \alpha \Vert }_1\le t\), and where

$$\begin{aligned} S=\{i \in [n] \mid r_i^2 \le \varepsilon ^2\}. \end{aligned}$$
(3)

Here S is the subset of data items assumed to be non-outliers and the complement, \(S^\mathsf {c} = [n] \setminus S\), is the subset of potential outliers. As can be seen in Eq. (2), we only consider the non-outliers when fitting the linear model.

At the limit of \(\varepsilon \rightarrow \infty \) it follows that \(S = [n]\) and Problem 1 becomes equivalent to lasso (Tibshirani 1996). When \(\varepsilon \) is small Problem 1 becomes a combinatorial problem in disguise, where we try to find a maximal subset S, due to the subtraction of \(\varepsilon ^2\) in Eq. (2).

Theorem 1

Minimising the loss in Eq. (2) maximises the size of the subset S in Eq. (3).

Proof

The subset size term in Eq. (2) is \(-\sum _{i\in S}{\varepsilon ^2}=-|S|\varepsilon ^2\), while for the residual term it holds that \(\sum \nolimits _{i\in S}{r_i^2/n}\le \varepsilon ^2\). This means that any change in the subset size has at least as large an impact on the loss as all the residuals combined, and for any \(\alpha \) and \(\alpha ^*\) satisfying \(\texttt {Loss} (\varepsilon , \lambda , X, Y, \alpha )\le \texttt {Loss} (\varepsilon , \lambda , X, Y, \alpha ^*)\), it then follows that \(|S|\ge |S^*|\). \(\square \)

2.1 Complexity class

Due to the combinatorial nature, finding an exact solution to Problem 1 is difficult. By showing that Problem 1 is a generalisation of a known \(\mathbf {NP}\)-hard problem we can give a lower bound for the complexity class.

Theorem 2

Problem 1 is \(\mathbf {NP}\)-hard and hard to approximate.

Proof

We prove the theorem by a reduction to the maximum satisfying linear subsystem problem (Ausiello et al. 1999, Problem MP10), which is known to be \(\mathbf {NP}\)-hard . In maximum satisfying linear subsystem we are given the system \(X\alpha = y\), where \(X \in \mathbb {Z}^{n \times m}\) and \(y \in \mathbb {Z}^n\) and we want to find \(\alpha \in \mathbb {Q}^{m}\) such that as many equations as possible are satisfied. This is equivalent to Problem 1 with \(\varepsilon = 0\) and \(\lambda = 0\). Additionally, the problem is not approximable within \(n^\gamma \) for some \(\gamma > 0\), according to Amaldi and Kann (1995). \(\square \)

2.2 Breakdown value

The robustness of robust regression methods is often measured using the breakdown value (Donoho and Huber 1983; Rousseeuw and Hubert 2011), which is defined as the theoretical minimum fraction of (adversarial) outliers that can cause an arbitrarily large deviation in the model. This can, e.g., be measured with (Alfons et al. 2013):

$$\begin{aligned} v_\text {breakdown} = \min \{ v: \sup \Vert X \alpha _v - Y \Vert \rightarrow \infty \}, \end{aligned}$$

where v is the fraction of outliers and \(\alpha _v\) is the linear model that fits the dataset \((X_v, Y_v)\) where v of the items have been replaced by items with arbitrary values (outliers).

Non-robust regression methods, such as ordinary least-squares, have a breakdown value of 1/n (Hubert and Debruyne 2009), i.e., a single outlier is enough to cause a breakdown. The practical upper limit for the breakdown value is 0.5, since any value larger than that cannot be guaranteed, without prior information.

Theorem 3

The breakdown value of Problem 1 is 0.5.

Proof

Following the definition, the breakdown value can be found as follows. We start from an uncorrupted (i.e., no outliers) dataset \((X_0, Y_0)\) with n items obeying a linear model parametrised by \(\alpha _0\) and where all data items are within the corridor defined by \(\varepsilon \). A fraction v of the data items are then changed into adversarial outliers, yielding the corrupted dataset \((X_v, Y_v)\).

The subset given by Problem 1 on \((X_0, Y_0)\) is \(S_0\) and \(|S_0| = n\). With a finite \(\varepsilon \) the breakdown occurs when the subset \(S_v\) for \((X_v, Y_v)\) contains corrupted items. This requires \(|S_v| \ge |S_0|/2\), since otherwise the uncorrupted subset is larger, \(|S_0 \setminus S_v| > |S_v|\), and is selected. Thus, the breakdown value is \(v = |S_v|/n = (|S_0|/2)/n = 0.5\). \(\square \)

In Sect. 7.2 we empirically validate the breakdown value.

3 Numeric approximation

In order to solve the \(\mathbf {NP}\)-hard Problem 1 efficiently in the general case, we relax the problem by replacing the Heaviside function with a sigmoid function \(\sigma (u)=1/(1+e^{-u})\) and a continuous and differentiable rectifier function \(\phi (u)\approx \min {(0,u)}\). This allows us to compute the gradient and find \(\alpha \) by minimising

$$\begin{aligned} \beta \text {-}\texttt {Loss} (\varepsilon , \lambda , X, Y, \alpha ) = \sum \nolimits _{i=1}^n{\sigma \left( \beta (\varepsilon ^2-r_i^2)\right) \phi \left( r_i^2/n-\varepsilon ^2\right) } + \lambda {\Vert \alpha \Vert }_1, \end{aligned}$$
(4)

where the parameter \(\beta \) determines the steepness of the sigmoid and the rectifier function \(\phi \) is parametrised by a small constant \(\omega >0\) such that

$$\begin{aligned} \phi (u) = \left\{ \begin{array}{lr} u, &{} \quad u\le -\omega , \\ -(u^2/\omega +\omega )/2, &{} \quad -\omega< u < 0, \\ -\omega /2, &{} \quad u\ge 0. \end{array}\right. \end{aligned}$$

It is easy to see that Eq. (4) is a smoothed variant of Eq. (1) and that the two become equal when \(\beta \rightarrow \infty \) and \(\omega \rightarrow 0^+\).

3.1 Graduated optimisation

We perform the optimisation of Eq. (4) using graduated optimisation (Mobahi and Fisher 2015). Graduated optimisation iteratively solves a difficult optimisation problem by progressively increasing the complexity. A natural parametrisation for the complexity of our problem is via the \(\beta \) parameter, since \(\beta =0\) corresponds to a convex optimisation problem equivalent to lasso and when \(\beta \rightarrow \infty \) the problem becomes equivalent to Problem 1. This yields an iterative optimisation strategy.

At each step we use the previous optimal value of \(\alpha \) as a starting point for minimisation of Eq. (4), and then increase the value of \(\beta \). It is important that the optima of consecutive \(\alpha \):s are “close enough”. This is why we derive an approximation ratio between solutions with different values of \(\beta \). We observe that our problem can be rewritten as a maximisation of \(-\beta \text {-}\texttt {Loss} (\varepsilon , \lambda , X, Y, \alpha )\). The choice of \(\beta \) does not affect the L1-norm and we omit it for simplicity (\(\lambda = 0\)).

Theorem 4

Given \(\varepsilon , \beta _1, \beta _2 \ge 0\), such that \(\beta _1\le \beta _2\), and the functions

$$\begin{aligned} f_j(r)=-\sigma (\beta _j(\varepsilon ^2-r^2))\phi (r^2/n-\varepsilon ^2), \end{aligned}$$

and \(G_j(\alpha )={\sum _{i=1}^n f_j(r_i)}\) where \(r_i = y_i - \alpha ^\intercal x_i\) and \(j\in \{1,2\}\). For \(\alpha _1={\mathrm{arg\,max}}_\alpha {G_1(\alpha )}\) and \(\alpha _2={\mathrm{arg\,max}}_\alpha {G_2(\alpha )}\) the inequality

$$\begin{aligned} G_2(\alpha _2)\le K G_2(\alpha _1) \end{aligned}$$
(5)

always holds, where \(K= G_1(\alpha _1)/\left( G_2(\alpha _1)\min _r{f_1(r)/f_2(r)}\right) \) is the approximation ratio.

Proof

The functions \(f_1\) and \(f_2\) are both always non-zero and positive: the inequalities \(\sigma (u) > 0\) and \(\phi (u) < 0\) hold for all \(u \in \mathbb {R}\), thus \(f_j(r) > 0\). Now, by definition, \(G_1(\alpha _2)\le G_1(\alpha _1)\). We denote \(r_i^*= y_i - \alpha _2^\intercal x_i\) and \(k=\min _r f_1(r)/f_2(r)\), which allows us the rewrite and bound:

$$\begin{aligned} G_1(\alpha _2)=\sum \nolimits _{i=1}^n{f_1(r_i^*)}=\sum \nolimits _{i=1}^n{f_2(r_i^*)f_1(r_i^*)/f_2(r_i^*)}\ge kG_2(\alpha _2). \end{aligned}$$

Then \(G_2(\alpha _2)\le G_1(\alpha _2)/k \le G_1(\alpha _1)/k \le G_2(\alpha _1) G_1(\alpha _1)/(k G_2(\alpha _1))\), and the inequality from the theorem holds. \(\square \)

We use Theorem 4 in the graduated optimisation to choose the sequence of increasing \(\beta \) values (\(\beta _1, \beta _2, \ldots , \beta _i > \beta _{i-1}\)), so that the approximation ratio, defined as K, stays constant.

3.2 Stopping criterion

Iterating until \(\beta \rightarrow \infty \) is not possible in practice, so we need a stopping criterion for the algorithm. The iterations should be stopped when \({\sigma (\beta (\varepsilon ^2 - r^2)) \approx H(\varepsilon ^2 - r^2)}\), i.e., the stopping criterion is dependent on the shape of the sigmoid function. The sigmoid shape is determined by both \(\beta \) and \(\varepsilon \). However, \(\varepsilon \) is expected to change often, depending on both the dataset and the task, so a stopping criterion independent of \(\varepsilon \) would be preferable.

Observation 1

Setting \(\beta _\text {max} \propto 1/\varepsilon ^2\) makes the stopping criterion independent of \(\varepsilon \).

Proof

Assume that there exists a pair of values \(\beta _\text {c}\) and \(\varepsilon _\text {c}\). We say that a sigmoid function parametrised by some \(\beta _\text {max}\) and \(\varepsilon \) has the same relative shape if and only if

$$\begin{aligned} \sigma (\beta _\text {max}(\varepsilon ^2 - (p\varepsilon )^2)) = \sigma (\beta _\text {c}(\varepsilon _\text {c}^2 - (p\varepsilon _\text {c})^2)) \end{aligned}$$

for every value of \(p \in \mathbb {R}\). Since the sigmoid function is strictly increasing, it can trivially be removed from both sides of the equation and hence

$$\begin{aligned} \begin{aligned} \sigma (\beta _\text {max}(\varepsilon ^2 - (p\varepsilon )^2))&= \sigma (\beta _\text {c}(\varepsilon _\text {c}^2 - (p\varepsilon _\text {c})^2))\\ \beta _\text {max}\varepsilon ^2(1-p^2)&= \beta _\text {c}\varepsilon _\text {c}^2(1-p^2)\\ \beta _\text {max}&= \beta _\text {c}\varepsilon _\text {c}^2/\varepsilon ^2 = c/\varepsilon ^2, \end{aligned} \end{aligned}$$
(6)

where \(c = \beta _\text {c}\varepsilon _\text {c}^2\) is a constant. \(\square \)

We empirically determine a good default value for c in Sect. 6.5.

3.3 Approximation ratio for the numeric approximation

By using the approximation ratio between two \(\beta \)-Losses (Theorem 4) we can derive a new approximation ratio between the losses of Eqs. (4) and (1) (the problem definition and the numeric approximation).

We set \(\beta _2\rightarrow \infty \) and \(\omega \rightarrow 0^+\) so that \(f_2(r)=-H(\varepsilon ^2-r^2)\phi (r^2/n-\varepsilon ^2)\). Additionally we introduce an \(\varepsilon ^*\) such that \(f_2^*(r)=-H({(\varepsilon ^*)}^2-r^2)\phi (r^2/n-{(\varepsilon ^*)}^2)\), \(G_2^*(\alpha )=\sum _{i=1}^n{f_2^*(y_i - \alpha ^\intercal x_i)}\), and \(\alpha _2^*={\mathrm{arg\,min}}_\alpha G_2^*(\alpha )\). This leads to a new approximation ratio \(K_{\varepsilon ^*}\) derived in the following lemma.

Lemma 1

The approximation ratio between the losses parametrised by \(\alpha _1\) and \(\alpha _2^*\) is \({K_{\varepsilon ^*} = G_1(\alpha _1) / \left( G_2^*(\alpha _1) k_{\varepsilon ^*}\right) }\) where

$$\begin{aligned} {k_{\varepsilon ^*} = \sigma (\beta _1(\varepsilon ^2 - {(\varepsilon ^*)}^2))\phi ({(\varepsilon ^*)}^2/n-\varepsilon ^2) / ({(\varepsilon ^*)}^2/n-{(\varepsilon ^*)}^2)}. \end{aligned}$$

Proof

The proof is omitted since it exactly mirrors Theorem 4 with the observation that \({k_{\varepsilon ^*} = \min _r f_1(r) / f_2^*(r) = \min _{r \le \varepsilon ^*}{( - f_1(r) / (r^2/n - {\varepsilon ^*}^2)})}\), which leads to

\(k_{\varepsilon ^*} = \sigma (\beta _1(\varepsilon ^2 - {(\varepsilon ^*)}^2)) {\phi ({(\varepsilon ^*)}^2/n-\varepsilon ^2)} / ({(\varepsilon ^*)}^2/n-{(\varepsilon ^*)}^2)\). \(\square \)

We call \(\alpha _2^*\) the matching solution, since it is the optimum for Eq. (1) closest to \(\alpha _1\). Note that \(\alpha _2^*\) has a different \(\varepsilon \) (namely \(\varepsilon ^*\)) that we need to specify in order to fully define the matching solution.

Lemma 2

For \({\varepsilon ^*}\) minimising the approximation ratio \(K_{\varepsilon ^*}\), it holds

$$\begin{aligned} \varepsilon ^* = {\mathrm{arg\,max}}_{\varepsilon ^*}{\left( -\sum _{i=1}^n H({(\varepsilon ^*)}^2-r_i^2)(r_i^2/n-{(\varepsilon ^*)}^2) \frac{\sigma (\beta _1(\varepsilon ^2-{(\varepsilon ^*)}^2))\phi ({(\varepsilon ^*)}^2/n-\varepsilon ^2)}{{(\varepsilon ^*)}^2/n-{(\varepsilon ^*)}^2}\right) } \end{aligned}$$
(7)

where \(r_i = y_i - \alpha _1 ^\intercal x_i\).

Proof

Let us denote

$$\begin{aligned} \varepsilon ^* = {\mathrm{arg\,min}}_{\varepsilon ^*} K_{\varepsilon ^*} = {\mathrm{arg\,min}}_{\varepsilon ^*} G_1(\alpha _1) / ( G_2^*(\alpha _1) k_{\varepsilon ^*} ) = {\mathrm{arg\,max}}_{\varepsilon ^*} G_2^*(\alpha _1) k_{\varepsilon ^*}, \end{aligned}$$

from which Eq. (7) follows. \(\square \)

Due to the non-continuity of the Heaviside function, the maximum can be found at \(\varepsilon ^* = r_j\) for some \(j\in [n]\). We can further assume, without loss of generality, that the residuals are sorted so that \({r_1^2 \le r_2^2 \le \cdots \le r_n^2}\), which means that \(\sum \nolimits _{i=1}^nH(r_j^2-r_i^2) = j\). With a large enough n, so that \(1/n \approx 0\), Eq. (7) can be simplified to

$$\begin{aligned} \varepsilon ^* \approx {\mathrm{arg\,max}}_{r_j} j \cdot \sigma (\beta _1(\varepsilon ^2-r_j^2)). \end{aligned}$$
(8)

Now, if the data is subsampled to a constant size, then Eq. (4) has a constant approximation ratio for the matching solution.

Theorem 5

The matching solution \(\alpha _2^*\) satisfies the inequality \({G_2^*(\alpha _2^*) \le K_{\varepsilon ^*} G_2^*(\alpha _1)}\) where \({K_{\varepsilon ^*} = \mathscr {O}(\log n)}\) is the approximation ratio.

Proof

Since \(\varepsilon ^* = r_j\) for some \(j \in [n]\), it follows that \(K_{\varepsilon ^*} \le K_{r_t}\) for all \(t \in [n]\), where the definition of \(K_{r_t} = G_1(\alpha _1) / (G_2^{r_t}(\alpha _1) k_{r_t})\) follows Theorem 1 (with \(r_t\) instead of \(\varepsilon ^*\)). Hence,

$$\begin{aligned} 1/K_{r_t} = G_2^{r_t}(\alpha _1) k_{r_t} / G_1(\alpha _1) \le \left( G_2^*(\alpha _1) k_{\varepsilon ^*}\right) / G_1(\alpha _1) = 1/ K_{\varepsilon ^*}, \end{aligned}$$

and we can derive, by rearranging the terms in the inequality,

$$\begin{aligned} k_{r_t} \le G_2^*(\alpha _1) k_{\varepsilon ^*} / G_2^{r_t}(\alpha _1). \end{aligned}$$

Assuming that the residuals are sorted so that \({r_1^2 \le r_2^2 \le \cdots \le r_n^2}\), then

$$\begin{aligned} G_2^{r_t}(\alpha _1) = -\sum _{i=1}^n H(r_t^2-r_i^2)(r_i^2/n-r_t^2) \ge -t(r_t^2/n - r_t^2), \end{aligned}$$

and further following the definition in Theorem 1

$$\begin{aligned} k_{r_t} = \sigma (\beta _1(\varepsilon ^2 - r_t^2)) \phi (r_t^2/n-\varepsilon ^2) / (r_t^2/n-r_t^2), \end{aligned}$$

which, by rearranging, lets us derive

$$\begin{aligned} \begin{aligned} -\sigma (\beta _1(\varepsilon ^2 - r_t^2)) \phi (r_t^2/n-\varepsilon ^2)&= k_{r_t} \cdot (-(r_t^2/n-r_t^2))\\&\le G_2^*(\alpha _1) k_{\varepsilon ^*} / G_2^{r_t}(\alpha _1) \cdot (-(r_t^2/n-r_t^2))\\&\le G_2^*(\alpha _1) k_{\varepsilon ^*} / (-t(r_t^2/n - r_t^2)) \cdot (-(r_t^2/n-r_t^2))\\&= G_2^*(\alpha _1) k_{\varepsilon ^*} / t. \end{aligned} \end{aligned}$$

Inserting this into \(G_1\) yields

$$\begin{aligned} \begin{aligned} G_1(\alpha _1)&= -\sum \nolimits _{i=1}^n{\sigma (\beta _1(\varepsilon ^2-r_i^2)) \phi (r_i^2/n - \varepsilon ^2) }\\&\le \sum \nolimits _{i=1}^n G_2^*(\alpha _1) k_{\varepsilon ^*} / i \\&\le G_2^*(\alpha _1) k_{\varepsilon ^*} (\log n + 1), \end{aligned} \end{aligned}$$

and when combined with \(K_{\varepsilon ^*}\) from Theorem 1 we have

$$\begin{aligned} K_{\varepsilon ^*} = G_1(\alpha _1) / (G_2^*(\alpha _1) k_{\varepsilon ^*}) \le G_2^*(\alpha _1) k_{\varepsilon ^*} (\log n + 1) / (G_2^*(\alpha _1) k_{\varepsilon ^*}) = \log n + 1, \end{aligned}$$

which completes the proof. \(\square \)

4 The slise algorithm

In this section we describe an approximate numeric algorithm, slise, for solving Problem 1. We start by introducing the algorithm, and then continue by discussing different initialisation schemes. Finally we demonstrate the asymptotic complexity of the slise algorithm.

The slise algorithm (Algorithm 1) takes as input the data and the optimisation parameters. The algorithm starts by selecting initial values for the linear model \(\alpha \) and the sigmoid steepness \(\beta \) (line 3). There are several potential initialisation schemes, and we will present and discuss four alternatives later in this section (see Sect. 4.1 and Algorithm 3).

figure a

The main part of slise consists of graduated optimisation (lines 4–7), that optimise the values for \(\alpha \) and \(\beta \). This is done by alternating between optimising \(\alpha \), and increasing \(\beta \) until we reach \(\beta _\mathrm {max}\). At each iteration, we need to find the \(\alpha \) that minimises the \(\beta \)-Loss in Eq. (4), using the current value of \(\beta \) (line 5). This optimisation is done with owl-qn (Schmidt et al. 2009) which is a quasi-Newton optimisation method with built-in L1-regularisation. We then increase \(\beta \) (line 6) such that the approximation ratio between consecutive steps, as defined in Theorem 4, equals K.

The pseudocode for the approximation ratio calculation is provided in Algorithm 2. In Eq. (4) we use the rectifying \(\phi \) function to ensure negativity. This function requires a constant \(\omega \) and its value can be chosen to be smaller than machine epsilon. Hence, in the approximation ratio calculation (line 4), \(\phi (u)\) is numerically equivalent to \(\min (0, u)\). As a minor computational side-effect of this, we have to make sure not to divide by zero if all \(\phi (r_i) = 0\) (lines 8–11).

figure b

4.1 Initialisation schemes

In Algorithm 3 we introduce four alternative initialisation schemes, and later in Sect. 6.3 we empirically compare the proposed approaches.

The first initialisation scheme (lines 2–5) is to use the non-robust counterpart to slise, i.e., lasso-regression, by setting \(\beta = 0\). Since lasso is a convex problem, the initial \(\alpha \) does not matter, and here we use ordinary least squares regression to obtain the initial \(\alpha \).

With \(\beta > 0\) the problem is no longer convex and the choice of initial \(\alpha \) becomes important. The next two initialisation schemes (lines 6–9 and lines 10–13) offer two different alternatives. In the first scheme the initial \(\alpha \) (line 7) is given by non-sparse ols, while in the second scheme \(\alpha \) (line 11) is a super-sparse constant vector of only zeros. In both cases the approximation ratio (Algorithm 2) is used to select an initial value for \(\beta \).

The final scheme (lines 14–27) is inspired by the initialisation used in fast-lts (Rousseeuw and Van Driessen 2000) and ransac (Fischler and Bolles 1981), which are related robust regression methods. The idea is to generate \(u_\text {init}\) initial candidates, and heuristically select the best one. The candidates are generated by drawing random subsets (\(X_S\), \(Y_S\)) of size \(d+1\), i.e., \(X_S\in \mathbb {R}^{d \times d+1}\) and \(Y_S\in \mathbb {R}^{d}\), and fitting linear models to them (using ols, lines 20–22). Note that the probability that at least one of the subsets is free from outliers is given by \(p = 1 - (1 - (1 - o)^d )^u\), where o is the fraction of outliers, d is the number of dimensions, and u is the number of candidates. If d is large, then u would also have to be large to compensate. To alleviate this potential issue we use pca to temporarily reduce the number of dimensions when d is larger than a threshold \(t_\text {pca}\) (lines 17–19). Finally, the best candidate \(\alpha \) is the one that minimises the \(\beta \)-Loss and \(\beta \) is updated to match the currently best \(\alpha \) (line 23–26).

figure c

4.2 Asymptotic complexity

The evaluation of the loss function (and its gradient) has a time complexity of \(\mathscr {O}(nd)\), due to the multiplication between the linear model \(\alpha \) and the data-matrix X. The approximation ratio calculation also has a complexity of \(\mathscr {O}(nd)\) for the same reason. This means that the complexity of the initialisation is dominated by the complexity of ols, which is \(\mathscr {O}(\min (nd^2, d^3))\).

The optimisation consists of owl-qn and graduated optimisation. owl-qn is a variant of lbfgs and has a time complexity of \(\mathscr {O}(md)\) (Schmidt et al. 2009), where m is the size of the “memory” for approximating the inverse Hessian. Additionally, owl-qn requires the loss value and gradient to be calculated every iteration. In practice, the number of iterations can be limited by a constant upper bound (see Sect. 6.4). The graduated optimisation only adds the approximation ratio calculation, the cost of which is negligible.

The total asymptotic time complexity of slise is a combination of the initialisation and the optimisation complexities: \(\mathscr {O}(\min (nd^2,d^3)+ndp)\), where p is an upper bound for the total number of optimisation iterations. However, in most cases \(p \gg d\) and the ols term becomes vanishingly small (see Sect. 7.1) and in cases where \(d>p\) the exact ols solution can be replaced by an optimisation, e.g., using lbfgs (\(\mathscr {O}(ndi_\text {max})\)), since the initialisation does not have to be exact. In this case the complexity of slise becomes \(\mathscr {O}(ndp)\), or rather \(\mathscr {O}(nd)\) since p is constant. Also note that any given dataset might require fewer iterations than p, but that is linked to the difficulty of finding the largest linear subset, rather than the size of the dataset.

The memory complexity of slise is at least \(\mathscr {O}(nd)\), i.e., the memory required to store the dataset itself. The loss function adds \(\mathscr {O}(\max (n,d))\), the approximation ratio \(\mathscr {O}(n)\), and owl-qn \(\mathscr {O}(md)\). This makes the total asymptotic memory complexity of slise \(\mathscr {O}(nd)\).

5 Experimental setup

We divide the empirical evaluation into three sections. In this section we describe the datasets and environment used for the experiments. In Sect. 6, we empirically determine suitable default values for the parameters of slise, and which initialisation method to recommend. Later, in Sect. 7 we compare slise to competing methods by demonstrating that (i) slise scales better on high-dimensional datasets than competing methods, (ii) slise is very robust to noise, and (iii) the solutions found using slise are optimal.

The experiments were run using R (Microsoft and R Core Team 2019, v. 3.5) on a high-performance computer cluster (FGCI 2021), using 4 cores from an Intel Xeon E5-2680 at 2.4 GHz, and reserving 16 GB of RAM. Our implementation of the slise algorithm, the code to run all the experiments, and the data processing are released as open source (Björklund et al. 2021).

5.1 Datasets

For the empirical evaluation we use both real and synthetic datasets. An overview of all the datasets is shown in Table 1. The real datasets are three regression datasets from the UCI Machine Learning Repository (Dua and Graff 2019), namely student (Cortez and Silva 2008), air quality (De Vito et al. 2008), and superconductivity (Hamidieh 2018).

As additional real datasets we use some classification datasets for which we create regression tasks with the help of complex classifiers. The datasets and classifier pairs are handwritten digits (Cohen et al. 2017, emnist) with a convolutional neural network, movie reviews (Maas et al. 2011, imdb) with a support vector machine, and particle jets (HIP CMS Experiment 2019, physics) with a neural network. The predictions from the classifiers are probabilities, which we turn into real values using the logit function: \(y_i' = \log (y / (1 - y))\). emnist has ten classes (digits), which yields ten different regression tasks. Whenever one of these tasks are used, we randomly subsample the dataset such that 50% of the items are of the “correct” digit and 50% are from all the other digits. This creates datasets with lots of potential outliers.

Synthetic datasets are generated as follows. The data matrix \(X \in \mathbb {R}^{n \times d}\) is created by sampling from ten normal distribution with randomised means and variances (\(\mu \sim \mathscr {N}(0, 2)\) and \(\sigma ^2 \sim \mathscr {U}(0, 1)\)). The response vector \(Y \in \mathbb {R}^n\) is created by \(y_i = a_{k0} + a_k ^\intercal x_i + e\), where e is normal noise with zero mean and unit variance, and \(a_k \in \mathbb {R}^{d+1}\) is one of seven linear models with the coefficients drawn from a uniform distribution between \(-1\) and 1. Each of the seven models \(a_k\) is used to create 10% of the Y-values, except one that creates 40% of the Y-values. The code for creating the synthetic datasets is available in the repository (Björklund et al. 2021) for full reproducibility.

Table 1 The datasets used in the experiments. The synthetic dataset can be generated to any desired size

5.2 Pre-processing

slise uses lasso regularisation both for introducing sparsity and regularisation. Since the lasso penalty sums the absolute values of the coefficients, it is important that the variables have roughly the same magnitude (Tibshirani 1996). Normalising the variables also makes it easier to compare the relative importance of values when interpreting the results. Thus, the variables in all datasets, except imdb and emnist, have been centred around (subtracted) the median and scaled (divided) by the mad (median absolute deviation) since this is a more robust form of normalisation than using means and standard deviations (Rousseeuw and Hubert 2011). Furthermore, we add an intercept column to the datasets after the normalisation. For all datasets we also normalise the target (Y) in the same way, to make the \(\varepsilon \):s comparable.

In the student dataset we remove the grades for the first and second period, since these are very correlated with the target (the grades for the third period), as noted by Cortez and Silva (2008).

For the emnist dataset the targets (Y) are robustly scaled as described above, whereas the pixel values of the input images are just linearly scaled from [0, 255] to \([-1,1]\). Some of the pixels have the same value in all of the images (i.e., the corners), which is problematic for some of the comparison methods (in Sect. 7), so these are removed and the images are flattened to vectors of length 672.

The text data in the imdb dataset is transformed into real-valued vectors by using a bag-of-words model, after case normalisation, removal of stop words, removal of punctuation, and stemming. The obtained word frequencies are divided by the most frequent word in each review to adjust for different review lengths, yielding real-valued vectors of length 1000.

6 Parameter experiments

The slise algorithm presented in Sect. 4 has multiple parameters that must be set. In addition, we also presented four different initialisation schemes. In this section we find good default values for the parameters, and determine which initialisation scheme to recommend.

The two most important parameters for slise are the error tolerance \(\varepsilon \) and the sparsity coefficient \(\lambda \). These depend on both the use case and the dataset. Therefore, they should be manually adjusted whenever slise is used. The default values for the other parameters, and which initialisation scheme to recommend, are selected based on empirical evidence. Specifically, we base the selection on the value of the loss function and the running time.

All experiments are run ten times per dataset (with different random seeds) in order to better capture the variance (this means that emnist is run a hundred times, due to the ten different tasks/digits). Furthermore, any dataset with \(n > 10{,}000\) is randomly subsampled to \(n=10{,}000\). If nothing else is mentioned, the values in Table 2 are used as default values for the parameters in the experiments.

Table 2 In the experiments we use the following default values for the parameters of slise, if nothing else is mentioned

6.1 Error tolerance

The role of the error tolerance is to enable detection and ignoring of outliers. Depending on the dataset or task there might be some obvious limits for the error which should be used when available. The question is hence how to select the value for \(\varepsilon \) without this knowledge?

The value of \(\varepsilon \) directly affects the size of the subset that fits the resulting model. Ideally, we want a value of \(\varepsilon \) such that the subset contains a large amount of non-outliers and no outliers. In order to better understand how to set the value for \(\varepsilon \) we consider, in Fig. 2, how \(\varepsilon \) corresponds to subset sizes for the datasets.

Fig. 2
figure 2

Measuring how the \(\varepsilon \) value affects the subset size. Small values result in (too) small subsets, while large values have a diminishing return (due to the natural sparseness at the edges of the distributions and to outliers)

To be able to guarantee the maximum possible robustness we could proceed as Rousseeuw (1984) and select the \(\varepsilon \):s that correspond to subset sizes of 50%. However, since none of the datasets considered here contain that many outliers another possibility is to be less strict and choose subset sizes of, e.g., 75% (Alfons et al. 2013). However, decreasing \(\varepsilon \) makes the task more difficult and we want to test parameters under adverse conditions. Hence, in this paper we use 30% for all datasets. Furthermore, in the previous paper on slise (Björklund et al. 2019) the main use case is explaining outcomes from black box models, where smaller values of \(\varepsilon \) might be preferable. The corresponding \(\varepsilon \) values for all datasets can be seen in Table 3.

Table 3 Selection of \(\varepsilon \) based on the resulting subset size

Another way to investigate the choice of \(\varepsilon \) is to plot the distribution of errors relative to the size of \(\varepsilon \). When \(\varepsilon \) is small only the peak fits within \([-\varepsilon ,\varepsilon ]\) and when \(\varepsilon \) is large parts of the tails are included. It is natural for distributions to have tails, and not all items in the tails are outliers, but this could nonetheless be used as another criterion for selecting \(\varepsilon \), which can be seen in Fig. 3.

Fig. 3
figure 3

The distributions of errors relative to \(\varepsilon \) can give information on the the effect of \(\varepsilon \). When \(\varepsilon \) becomes large some of the tails of the error distributions fit inside \([-\varepsilon ,\varepsilon ]\). This is a sign that \(\varepsilon \) might be too large. For example, \(\varepsilon > 0.8\) with the physics dataset include tails within the interval (selected subset)

6.2 Regularisation

The regularisation coefficient \(\lambda \) is dependent on the use case and dataset. With lasso-type methods it is common to scale the regularisation by the number of items n (or by dividing the rest of the loss by n). For slise the size of the subset |S| would be a better choice but that is not known in advance. Furthermore, the rest of the loss function is proportional to \(\varepsilon ^2\), so selecting \(\lambda \propto n \varepsilon ^2\) makes the transition between dataset sizes and parameter values easier. For the purpose of the experiments, we only use a minimal regularisation by setting \(\lambda = 10^{-4} \cdot n \varepsilon ^2\), since we are not looking for a specific sparsity.

6.3 Initialisation

In Sect. 4 we present four different schemes for selecting the initial values for the linear model \(\alpha \) and sigmoid steepness \(\beta \). Figure 4 shows the results from comparing the four initialisation schemes. No particular method stands out, which indicates that the combination of graduated optimisation and owl-qn yields good performance overall. Furthermore, there are no major differences in running time, according to Table 4. However, slise can only be guaranteed to find a local optimum (in contrast to finding the global optimum), so we need to consider possible failure modes.

Fig. 4
figure 4

Losses for the different initialisation methods. Lower is better

Table 4 Median times (for the whole algorithm) when trying different initialisation methods. Lower values are better

Both lasso and ols are non-robust so even a single outlier can lead to arbitrarily large deviations (Donoho and Huber 1983). This means that the starting points might be heavily influenced by outliers. Starting from a vector of zeros is good for sparsity, but with a large enough \(\lambda \) it becomes a local optimum. It is, however, easy to detect when the optimisation has failed to escape this kind of local optimum, by checking if \({\Vert \alpha \Vert }_1 \approx 0\).

Another problem with using a fixed starting point (which is what the lasso, ols, and zeros initialisation schemes do) is that if the starting point is bad, then there is no way to detect and recover from it. This can be seen in Fig. 5, where the fixed starting \(\alpha \):s are all very close to a cluster of outliers, which creates a local optimum that the optimisation cannot escape. The candidates initialisation is designed to detect and discard these bad local optima early, by selecting a candidate based on the best \(\beta \)-Loss.

Fig. 5
figure 5

Example showing when some of the initialisation schemes fail. The data is constructed such that the starting \(\alpha \):s from lasso, ols, and zeros all pass by a cluster of outliers. These outliers create a local optimum which only the candidates initialisation scheme is able to avoid. slise is used with \(\varepsilon = 0.2\) and \(\lambda = 0.01\)

The candidates initialisation scheme is non-deterministic, and requires the number of candidates \(u_\text {init}\) as a parameter. A larger number increases the likelihood that a good candidate is found, but also increases the running time. However, the results from Fig. 6 show that the number of candidates only has a small impact, and also that the time differences are negligible. fast-lts (Rousseeuw and Van Driessen 2000) has a similar notion of candidates, and by default they use \(u_\text {init}=500\), which seems to be a reasonable choice also in this case.

Fig. 6
figure 6

Losses and running times for different number of initial candidates in the initialisation. For a given dataset, the losses and running times are for all practical purposes equal for different number of initial candidates. The results from multiple runs are aggregated using the mean. Lower is better for both time and loss

With \(u_\text {init}=500\) fixed, we can find a value for the threshold \(t_\mathrm {pca}\) for using pca , that is independent of any dataset. The formula for the probability of at least one candidate having no outliers is \(p=1-(1-(1-o)^d)^u\), where o is the fraction of outliers, \(d = t_\mathrm {pca}\), and \(u=u_\text {init}=500\). A larger threshold leads to less information lost in the pca while a smaller value increases p. The curves for different \(t_\mathrm {pca}\) can be seen in Fig. 7. If \(t_\mathrm {pca}=10\) then \(p = 0.38\) when \(o = 0.5\), which should be sufficient in most scenarios, since including one outlier does not automatically lead to an inescapable and bad local optimum.

Fig. 7
figure 7

The probability of finding at least one subset without outliers for different fractions of outliers and different dimensions for the pca threshold. This can be calculated with the formula \(p = 1 - (1 - (1 - o) ^d) ^u\). Both higher probabilities and more dimensions are better

Based on the results in Fig. 4, Table 4, and Fig. 5, we choose candidates as the recommended initialisation scheme. It is the only stochastic initialisation scheme, which means that it does not have a fixed (potentially bad) starting point. Alternatively, the failure mode of zeros is easy to detect, and the initialisation is fast, so it is our second choice if a deterministic algorithm is desired.

6.4 Iterations

slise incorporates two iterative optimisation methods, owl-qn and graduated optimisation. Increasing the number of iterations leads to better results for both methods, but beyond a point there are clear diminishing returns. The number of iterations in graduated optimisation is determined by the target approximation ratio K (where \(K > 1\)). A larger value results in fewer iterations. However, we require the steps to be small enough that consecutive iterations have similar optima.

The number of iterations in owl-qn can be controlled by several different convergence criteria, but the simplest one is to simply limit the number of iterations \(i_\text {max}\). This has the advantage of ensuring an upper bound on the number of iterations in the worst-case scenario. Additionally, since owl-qn is run multiple times on similar problems there will be a lot of wasted resources if it is forced to fully converge each graduated optimisation iteration.

As can be seen in Fig. 8, K and \(i_\text {max}\) complements each other, in that a decrease in K can be compensated for by a decrease in \(i_\text {max}\) while preserving time and loss values. The choice of values for these parameters can have a massive impact on the running time, while the impact on the loss at times is minimal. Based on the results, the combination of \(K=1.15\) and \(i_\text {max}=300\) is a good default trade-off between time and loss. Furthermore, the last optimisation of owl-qn is not an intermediate step and is, therefore, allowed extra time to converge, by multiplying \(i_\text {max}\) with four (\(i_\text {max}=1200\) when \(\beta =\beta _\text {max}\)).

Fig. 8
figure 8

Losses and running times for different values of the two parameters that control the number of iterations. The results from multiple runs are aggregated using the mean. Lower values are better for both time and losses

6.5 Stopping parameter

It is sufficient that the stopping parameter \(\beta _\text {max}\) is large enough to make the sigmoid function essentially equivalent to a Heaviside function. As shown in Sect. 3.2, in order to make the shape of the sigmoid only depend on \(\beta _\text {max}\) it has to be defined as \({\beta _\text {max}=c/\varepsilon ^2}\). The results in Fig. 9 show that \(c = 20\) is sufficiently large, and any larger value merely increases the running time.

Fig. 9
figure 9

Losses and running times for various values for the \(\beta _\text {max}\) parameter. Lower values are better

7 Robust regression experiments

In this section we compare slise to five robust regression methods: sparse-lts (Alfons et al. 2013), smdm (Koller and Stahel 2017), conquer (Fernandes et al. 2021), mte-lasso (Qin et al. 2017), and ransac (Fischler and Bolles 1981). lasso (Tibshirani 1996) is also included as a non-robust baseline. To make the comparison maximally useful we compare against implementations found in established software libraries. See Table 5 for an overview of all methods.

All algorithms have been used with default settings, with the exception of sparse-lts, which has been used with a subset size of n/2 for maximal robustness, and ransac, where we use 20, 000 trials and the same error threshold as for slise. In the case of slise, the parameter values are the same as above and can be found in Table 2.

Table 5 Properties of the regression methods compared in the experiments

7.1 Scalability

First, we investigate the scalability of the methods. Many of the methods have the same asymptotic time complexity, \(\mathscr {O}(n d^2)\), but almost all are iterative methods and the number of iterations are not accounted for in the complexity. We empirically determine the running time on the synthetic datasets with (i) a fixed number of dimensions (\(d = 100\)) with an increasing number of items, and (ii) a fixed number of items (\(n = 10{,}000\)) with an increasing number of variables. The results are aggregated from ten different runs. We also limit the calculations to 1000 s. The results are shown in Fig. 10.

We observe that slise performs comparable to the other robust regression methods when the number of items increases (the left plot of Fig. 10). However, when the number of variables increases (the right plot of Fig. 10) slise is faster than most robust regression methods. The only exception is conquer which is almost as fast as lasso.

Fig. 10
figure 10

Running times in seconds. Left: Varying the number of samples n with a fixed number of dimensions \(d = 100\). Right: Varying the number of dimensions d with a fixed number of samples \(n= 10{,}000\). The cut-off time is shown using a dashed horizontal line at \(t = 1000\) s. Lower is better

The scalability experiment only tests running times on one, synthetic , dataset. To get a broader perspective we also evaluate the running times on the real datasets. The results in Fig. 11 show that slise is comparable or only a couple of seconds slower than most methods on the small datasets. However, on the larger datasets, superconductivity, imdb, and emnist, slise is much faster than the other robust methods, except for conquer. On the larger datasets slise is actually faster than Fig. 10 would suggest, which demonstrates how the difficulty of any given dataset or task affects the running time.

Fig. 11
figure 11

Running times on real datasets. Lower is better

7.2 Robustness

Next, we empirically compare the methods’ robustness to outliers. To do this we corrupt datasets by replacing a fraction of the items with outliers. We utilise two types of outliers commonly found in literature (Rousseeuw and van Zomeren 1990; Alfons et al. 2013): vertical outliers and leverage points. The dataset we use is a variant of the synthetic dataset, where the Y values are from only one model (so that we can be sure that there are no inherent outliers).

All methods are trained on datasets corrupted by outliers and evaluated on the uncorrupted datasets. If a method is robust to a certain fraction of noise then the residuals for the uncorrupted data will be small. The results are shown in Fig. 12. The breakdown value is the point where the curves start trending upwards, and at high outlier fractions all methods are expected to break down.

Fig. 12
figure 12

Robustness to outliers. The x-axis shows the fraction of outliers and the y-axis the mean absolute error on the clean dataset. Consistently small residuals as the number of outliers increases indicate a robust method

Vertical outliers are outliers where the target value is different from the rest. We create a vertical outlier by taking a non-outlier item i and replace \(y_i\) with \(y_i' \sim \mathscr {N}(10, 1)\). As we can see in the left plot of Fig. 12, slise and sparse-lts are the most robust ones. However, vertical outliers are generally considered to be an easier type of outliers (Alfons et al. 2013).

Leverage points are outliers where the variable values are unusual. Here we create a leverage point by taking a non-outlier item i and replace \(x_i\) with \(x_i' \sim \mathscr {N}(10, 1)\). When the fraction of leverage points is high, most of the correlation between the predictors and target is broken, so regression methods tend to converge towards constant predictions, rather than breaking down. Nonetheless, in the right plot of Fig. 12 we can see that lasso and mte-lasso break down immediately, while conquer and ransac follow soon thereafter. The absolute errors on the clean data from smdm, slise, and sparse-lts stay low for larger fraction of outliers indicating that they are more robust towards leverage points.

Since the robustness experiment is performed on a rather strict dataset we also consider the robustness to outliers on the real datasets. In Fig. 13 we can see how adding vertical outliers affects the behaviour of the regression methods on real datasets. ransac fares well on the low-dimensional datasets, physics and air quality, but fails on even moderately sized datasets. This is because the chance of randomly finding a set of non-outliers shrinks exponentially with the number of dimensions (Fischler and Bolles 1981), so even the high number of trials (20, 000) is not enough. On the contrary, slise consistently achieves a breakdown value of at least 0.5.

Fig. 13
figure 13

Robustness to outliers on real datasets. The x-axis shows the fraction of outliers and the y-axis the mean absolute error on the clean dataset. Some methods were not evaluated for the imdb and emnist datasets, due to their time requirements. Lower error is better

7.3 No outliers

Robust regression methods should also work in situations where there are no outliers. To evaluate this we perform 10-fold cross validation on the real datasets (with no added outliers). As a baseline we include a dummy model that always predicts the mean y-value from the training data. In Fig. 14 we see that most robust regression methods (including slise) perform about as well as the non-robust lasso, clearly better than the mean model, the exception being ransac on high-dimensional datasets.

Fig. 14
figure 14

Cross validation (10-fold) on the real datasets with no added outliers. sparse-lts, slise, and ransac use subset sizes/error tolerances corresponding to 50% of the data. All values have been divided by the corresponding mean absolute error for the reference (mean) model. Some methods were not evaluated for the imdb and emnist datasets, due to their time requirements. Lower error is better

7.4 Optimality

Finally, we demonstrate that the solution found by slise optimises the loss of Eq. (1). The slise algorithm is designed to find the largest subset such that the residuals are upper-bounded by \(\varepsilon \). To investigate if the model found using slise is optimal, we determine regression models (i.e., obtain the coefficient vectors \(\alpha \)) using each algorithm. We then calculate the value of the loss-function in Eq. (1) for every model with varying values of \(\varepsilon \).

The results are shown in Fig. 15. All loss-values have been normalised with respect to the absolute median for each value of \(\varepsilon \) and dataset. slise consistently reaches the smallest loss in the region around \(\varepsilon \) used for training, as expected. For superconductivity and emnist the loss curves for sparse-lts is very close to the curves for slise, but slise and sparse-lts should actually give equally good, or even identical, results if the error tolerance \(\varepsilon \) in slise happens to match the subset size h in sparse-lts.

Fig. 15
figure 15

Finding the best solution to Problem 1. The loss-values are normalised by dividing by the absolute median loss per dataset and \(\varepsilon \) value. The \(\varepsilon \) used for training slise and ransac is marked with a vertical line. Some methods were not evaluated for the imdb and emnist datasets, due to their time requirements. Lower values are better

8 Conclusions

This paper refines the slise algorithm for robust regression. slise introduces a novel way of detecting and discarding outliers; find a subset of non-outliers where the error is less than an adjustable error tolerance (\(\varepsilon \)), for fitting the regression model. This flexible subset size (based on \(\varepsilon \)) is in contrast to other methods (primarily the lts family) where the subset size is fixed. Additionally, slise yields sparse solutions through built-in regularisation.

Although finding an exact solution to the problem definition (Problem 1) is \(\mathbf {NP}\)-hard (see Sect. 2.1), the combination of graduated optimisation with a quasi-Newton optimiser (owl-qn ) yields an effective approximation (see Sects. 3 and 7). Adding a stochastic initialisation further mitigates the risk of unfortunate starting conditions (see Sect. 6.3).

When comparing to other robust regression methods, slise is able to achieve robustness levels that are among the best possible, which we show both theoretically (see Sect. 2.2) and empirically (see Sect. 7.2). Furthermore, slise scales better to high-dimensional data than many alternative methods (see Sects. 4.2 and 7.1).

Future work could investigate how the choice of \(\lambda \) affects the sparsity, especially during the graduated optimisation. Another direction would be to try changing the balance between maximising the subset and minimising the residuals, or to introduce different weights for the data items.

In an earlier paper (Björklund et al. 2019) we show how slise can be used to explain outcomes from black box models in a way that respects constraints in the data. Along this line we could further investigate the utility of selecting the data used for the explanations in order to answer specific questions in an interactive manner. This would give better insight into the learned models and their behaviour.

The explanations given by slise are local, i.e. for specific outcomes, and an interesting follow up would be to combine these local explanations into one global explanation. Furthermore, slise is a robust regression method and, therefore, quite generic, which means that it can readily be integrated into, or combined with, other explanation methods.

Our implementation of the slise algorithm is released under an open source license. It is available in both Python (Björklund 2021) and R (Björklund et al. 2021), which also includes the code for running all the experiments.