1 Introduction

1.1 Machine learning, regression and optimization

Machine learning is a fast growing field of research that inherits and combines methods from statistics, computer science and optimization to tackle a vast variety of applications like fraud detection, recommender systems, predictive maintenance and autonomous driving (Marsland 2015). One subfield of machine learning is supervised learning, whose task is to train a function on labeled data. This stands in contrast to other applications, like anomaly detection or clustering, where no labels are available and which are therefore examples of unsupervised learning (Bishop 2006). The most popular examples of supervised learning are classification and regression tasks. While classification aims at assigning discrete values to data points (e.g., binary values for cancer detection), regression methods train functions that assign continuous numbers to data points (e.g., prediction of house prices). Commonly used candidate mappings are (piecewise) linear functions, splines, tree-based models and neural networks (Clark and Pregibon 2015; Goldberg et al. 2021; Krasko and Rebennack 2017; Micula and Micula 2012; Rebennack and Kallrath 2015; Rebennack and Krasko 2020; Specht 1991).

The training procedure involves the minimization of a so-called loss function that measures the distance of the observations to the corresponding predictions. Minimizing the loss function results typically in an unconstrained smooth optimization problem that is tackled by variants of the stochastic gradient descent method which is a lightweight modification of gradient descent where only parts of the gradient are evaluated in each iteration (Schmidt et al. 2017; Robbins and Monro 1951). Other optimization-related topics within machine learning are Bayesian optimization (Snoek et al. 2012) or the optimization of pretrained machine learning models in the feature space (Thebelt et al. 2020b).

Mixed-integer linear optimization (MILO) models involve linear terms in the decision variables as well as integrality restrictions (for some) of the decision variables (Jünger et al. 2009; Wolsey and Nemhauser 1999). A very rich class of optimization problems in practice can be modeled using MILO models. Current state-of-the-art solvers for general MILO models use so-called branch-and-cut algorithms. The idea of branch-and-cut algorithms is to repeatedly solve linear optimization problems (these are easy to solve) which are obtained by relaxing the integrality restrictions on the decision variables. The linear optimization problems are updated by additional restrictions on the relaxed variables in order to cut out fractional values. This is called branching. In the worst case, there are exponentially many such branches in the number of integer variables. The branching is accompanied by cutting planes whose goal is to cut away fractional solutions (without the need to execute the costly branching). Therefore, as a general rule-of-thumb, fewer integer variables lead to lower computational times (though this is not always true). We make use of this observation in this paper.

Dimitris Bertsimas was one of the first researchers to point out that recent advances in linear and quadratic mixed-integer optimization have been rarely noticed in the statistics and machine learning communities. This inspired him to publish a series of papers under the motto “Machine Learning under a Modern Optimization Lens” that are summarized in the eponymous book (Bertsimas and Dunn 2019). Bertsimas’ assessment was confirmed by some of the most renowned researchers in the statistics and machine learning community, Trevor Hastie and Robert Tibshirani, who state (Hastie et al. 2017):

In exciting new work, Bertsimas et al. (2016) showed that the classical best subset selection problem in regression modeling can be formulated as a mixed integer optimization (MIO) problem. Using recent advances in MIO algorithms, they demonstrated that best subset selection can now be solved at much larger problem sizes that what was thought possible in the statistics community.

This paper was heavily inspired by Bertsimas’ observation that mixed-integer optimization is still relatively unknown but can be applied to many optimization problems in the context of machine learning and statistics. Therefore, we present Leveraged Least Trimmed Absolute Deviations (LLTA), a mixed-integer based robust regression model, whose main idea we explain now.

1.2 Motivation

Let \(f_\theta :{\mathbb {R}}^n\rightarrow {\mathbb {R}}\),

$$\begin{aligned} f_\theta (x)=\theta _0+\sum _{i=1}^n \theta _i x_i \end{aligned}$$

be a linear candidate function whose parameter \(\theta \in {\mathbb {R}}^{n+1}\) we want to determine optimally with respect to some labeled training data

$$\begin{aligned} (x^1, y_1), \ldots , (x^j, y_j), \ldots , (x^N, y_N)\in {\mathbb {R}}^n\times {\mathbb {R}}, \end{aligned}$$

with \(x^j \in {\mathbb {R}}^n\) for \(j=1, \ldots , N\). The most popular idea of obtaining such a function \(f_\theta\) is referred to as Ordinary Least Squares (OLS) and goes back to Legendre or Gauß (Stigler 1981) at the end of the 18th century. Let

$$\begin{aligned} r_{j,\theta }=\theta _0+\sum _{i=1}^n \theta _i x_i^j-y_j \end{aligned}$$

be the residual of \(f_\theta\) with respect to the jth data point, \(j=1, \ldots , N\). Then, OLS computes \(\theta\) by solving the unconstrained convex quadratic optimization problem

$$\begin{aligned} \min _{\theta } \; \sum _{j=1}^N r_{j,\theta }^2. \end{aligned}$$
(1)

OLS is computationally attractive as it possesses a closed-form solution. However, it is very sensitive with respect to outliers. To soften this sensitivity to outliers, an alternative approach is to minimize the \(\ell _1\)-norm of the residual vector \(r^\theta =(r_{1,\theta }, \ldots ,r_{j,\theta }, \ldots , r_{N,\theta })\) instead of the \(\ell _2\)-norm. This results in the Least Absolute Deviations (LAD), a problem that was stated in 1757 by Boscovich (Koenker and Bassett 1985), even before OLS. LAD results in the unconstrained convex piecewise linear optimization problem

$$\begin{aligned} \min _{\theta } \; \sum _{j=1}^N |r_{j, \theta }|. \end{aligned}$$
(2)

In contrast to (1), LAD does not have a closed-form solution but can be reformulated and solved as a linear (continuous) optimization problem (LP) or tackled by a subgradient-based method. When only estimating \(\theta _0\) and all \(\theta _i = 0\), then this leads to the so-called location model (Bassett 1991). For this case, an optimal estimator for \(\theta _0\) is simply the median of the sorted data points \(y_{(j)}\), \(j =1, \ldots , N\).

A crucial property of LAD is its robustness against so-called y-outliers. This is a consequence of Theorem 1 which states that LAD is not affected by changes in \(y_i\) for data points that do not lie directly on the regression line as long as the signs of the residuals are not reverted.

Theorem 1

(Dodge 1997) Suppose \(\theta ^\star\) is a minimizer of

$$\begin{aligned} F(\theta )=\sum _{j=1}^N\left| y_{j}-\left( \theta _0+\sum _{i=1}^n \theta _{i} x^{j}_{i}\right) \right| . \end{aligned}$$

Then, \(\theta ^\star\) is also a minimizer of

$$\begin{aligned} G(\theta )=\sum _{j=1}^N \left| z_{j}-\left( \theta _0+\sum _{i=1}^n \theta _{i} x^{j}_{i}\right) \right| , \end{aligned}$$

provided \(z_{j}\ge \theta _0+\sum _{i=1}^n \theta _{i} x^{j}_{i}\) whenever \(y_{j} > \theta _0+\sum _{i=1}^n \theta _{i} x^{j}_{i}\) and \(z_{j}\le \theta _0+\sum _{i=1}^n \theta _{i} x^{j}_{i}\) whenever \(y_{j} < \theta _0+\sum _{i=1}^n \theta _{i} x^{j}_{i}\).

However, while being robust to y-outliers, LAD is still affected by leverage points, i.e., outliers in x. This is illustrated in Fig. 1.

Fig. 1
figure 1

Behavior of ordinary least squares (OLS) and least absolute deviations (LAD) in the presence of outliers of different types

Inspired by this observation, Rousseeuw (1984) proposed the Least Trimmed Squares (LTS) in 1984, whose formulation as an optimization problem is given by the mixed-integer nonlinear optimization problem (MINLP)

$$\begin{aligned} \min _{\theta , b} \; \sum _{j=1}^Nr_{j,\theta }^2\cdot b_{j} \qquad \text {s.t.}\;\sum _{j=1}^Nb_{j}=N-k,\;\; b\in \{0,1\}^N \end{aligned}$$
(3)

with \(k\in {\mathbb {N}}\) and \(\frac{n}{2}<k<n\), where we use “\(\cdot\)” whenever multiplying decision variables. By design, (3) minimizes the sum of squares while ignoring the k largest squared deviations.

Similar to the LTS, in 1999, the least trimmed sum of absolute deviations (LTA) is proposed by Hawkins and Olive (1999). LTA can be formulated as the MINLP

$$\begin{aligned} \min _{\theta , b} \; \sum _{j=1}^N |r_{j,\theta }| \cdot b_{j} \qquad \text {s.t.}\;\sum _{j=1}^Nb_{j}=N-k,\;\; b\in \{0,1\}^N. \end{aligned}$$
(4)

To compute the LTA regression, Hawkins and Olive propose an enumeration algorithm over all possible subsets with k elements. This algorithm is of particular interest for the location model, as the LTA for a fixed subset is then obtained by evaluating the \(N-k+1\) subsets of ordered data points \(y_{(k)}, y_{(k+1)}, \ldots , y_{(k+h-1)}\), for all \(k=1, \ldots , N-k+1\) (Bassett 1991; Tableman 1994).

Next to an enumeration algorithm, the LTA regression problem is solved by Flores (2011) via a tailored continuous global optimization algorithm for the cases that the intercept is zero, i.e., \(\theta _0 = 0\). First, the MINLP is reformulated as an NLP by introducing the nonconvex constraint \(b_{j}^2 - b_{j} = 0\) for continuous variables \(b_{j}\) instead of the binary restriction on \(b_{j}\). The resulting continuous nonconvex global optimization problem is then solved by a tailored global optimization algorithm in the spirit of Lasserre (2001).

The book about “open problems in optimization and data analysis” contains a chapter which discusses the connection between optimization and statistical robust estimators in the context of LTA regression (Pardalos and Migdalas 2018). For the location model, Zioutas et al. present a MINLP model of type (4) and a MILP reformulation of the bilinear terms using standard techniques. The LTA for the location model is extended to take into account outliers violating the correlational structure of the data set via a two-level approach in Chatzinakos et al. (2016).

Nowadays, within the statistics and machine learning communities, the LTS and LTA are solved by applying heuristics since both the LTS and the LTA are considered to be intractable as being \({\mathcal {N}}{\mathcal {P}}\)-hard (Bernholt 2006). However, during the period 1991-2015, due to algorithmic advances, mixed-integer linear optimization problem (MILP) solvers have experienced an average speedup factor of 780,000, cf. Bertsimas et al. 2016; Bixby 2012 and the references therein. These machine-independent advances have been accompanied by an impressive progress in hardware performance. Thus, many real-world applications that could not be solved in the 1980s or 1990s are now solvable to global optimality within seconds. Similarly, modern software packages like CPLEX and GUROBI can now also solve large-scale nonconvex mixed-integer quadratic optimization problems.

Despite the latest solver developments, LTS and LTA can only be solved for medium-sized problem instances. Therefore, we introduce Leveraged Least Trimmed Absolute Deviations (LLTA), which is a two-step approach that trains a linear function on possibly infiltrated data. The two steps are:

  1. 1.

    Identify all leverage points.

  2. 2.

    Minimize the total absolute deviations and ignore the \(k\in {\mathbb {N}}\) largest deviations to data points that are leverage points, for some chosen k with \(\frac{n}{2}<k<n\).

Consider now Fig. 2. LTS needs 11 binary decision variables and \(k=3\) to achieve a reasonable fit (Fig. 2a). LTA yields the same result with 11 binary decision variables and \(k=1\) (Fig. 2b). However, LLTA produces the same high-quality fit for \(k=1\) using only one binary decision variable (Fig. 2c).

Fig. 2
figure 2

Behavior of LTS and LLTA in the presence of outliers of different types

These indicated advantages of LLTA compared to LTS and LTA are further examined in Sect. 3 after a formal introduction of LLTA in Sect. 2.

1.3 Statement of contributions

The unique contributions of this paper are: We

  1. 1.

    introduce Leveraged Least Trimmed Absolute Deviations (LLTA),

  2. 2.

    demonstrate that LLTA outperforms LTS with respect to regression-quality and computational speed,

  3. 3.

    show that the regression-quality of LLTA is comparable to LTA while being much faster in terms of run time,

  4. 4.

    first benchmark the LTS and LTA with current MIQP solvers (the LTS and LTA are only solved by heuristic methods in the literature, ignoring the recent progress in MIQP algorithms and software developments).

The remainder of this paper is organized as follows. In Sect. 2, we introduce LLTA. We provide the benchmarking of LLTA with LTA and LTS in Sect. 3 before we conclude with Sect. 4.

2 Leveraged least trimmed absolute deviations

We start noting that the complexity of LTS and LTA is governed by

  1. 1.

    the number of ignored data points k since there are \(\left( {\begin{array}{c}N\\ k\end{array}}\right)\) subsets of length k among the N data points and \(\left( {\begin{array}{c}N\\ k\end{array}}\right)\) grows exponentially in k for fixed N and \(k<N/2\),

  2. 2.

    and the number of binary variables since the search space also grows exponentially in the number of binary variables.

To mitigate the computational complexity resulting from the second point, we introduce Leveraged Least Trimmed Absolute Deviations (LLTA). LLTA is a two-step procedure. Let \({\mathcal {D}}=\{1,\ldots ,N\}\).

  1. 1.

    Compute the index set \({\mathcal {O}}\subsetneq {\mathcal {D}}\) of leverage points; see Sect. 2.2 for details.

  2. 2.

    Solve the optimization problem

    $$\begin{aligned} \min _{\theta , b} \sum _{j \in {\mathcal {D}}{\setminus} {\mathcal {O}}}|r_{j,\theta }| +\sum _{j\in {\mathcal {O}}}|r_{j,\theta }|\cdot b_{j} \qquad \text {s.t.}\;\sum _{j\in {\mathcal {O}}}b_{j}=|{\mathcal {O}}|-k,\;\; b\in \{0,1\}^{|{\mathcal {O}}|}. \end{aligned}$$
    (5)

Since the classical \(\ell _1\)-regression LAD is immune to y-outliers, we only protect our regression function with respect to leverage points. This is achieved through the parameter k allowing the optimal fit to ignore k data points within the index set of leverage points \({\mathcal {O}}\). Note that we utilize here that the set of leverage points can be computed beforehand, while this is not possible for the y-outliers because they are regression-function dependent. Therefore, we obtain

  1. 1.

    a significant reduction of binary decision variables of (5) compared to (3) and (4), because LLTA only introduces one binary decision variable for each leverage point instead of one binary decision variable for each data point.

  2. 2.

    The possibility to choose smaller values of k compared to LTS, since LAD is already immune with respect to y-outliers.

Remark 1

As elaborated in Breiman (2001), there are two approaches to statistical modeling which are very different: In the Data Modeling Culture, a stochastic data model with an underlying distribution is assumed whose distribution is estimated from successive draws. Examples for that are given in Liu (1996) and Vanhatalo et al. (2009). In the Algorithmic Modeling Culture, a function \(y = f(x)\) is fitted to observed data where the data generating process remains a black box with no further distributional assumptions. Many successful methods from machine learning like deep neural networks or gradient boosted trees are treated in the spirit of the latter culture. In this introductory work, we made the conscious decision to perform the analysis of LLTA within the framework of Algorithmic Modeling Culture. This yields a clear uncluttered overview about the main ideas and may serve as starting point for extensions from both cultures. To introduce underlying stochastic assumptions and to apply distributional-free sensitivity analysis using methods like bootstrapping are then possible extensions, cf. Sect. 2.5.

Remark 2

We assume that the number of infiltrated data points, i.e., the number of possible outliers, is strictly smaller than N/2. This is a standard assumption that is also posed in LTS and LTA. Therefore, also k must not exceed N/2 and we focus on the better half of the residuals as elaborated in Sect. 2.4.

2.1 Epigraph reformulation

In order to implement LTS, LTA and LLTA in a modern mixed-integer optimization solver, we first have to apply some reformulations. An LTS model is trained by solving the mixed-integer quadratically-constraint quadratic optimization problem (MIQCQP)

$$\begin{aligned} \min _{\theta , b, r} \; \frac{1}{N^2} \sum _{j\in {\mathcal {D}}}^N r^\mathrm{sqr}_{j,\theta } \cdot b_{j} \quad &\text {s.t.}\quad r^\mathrm{sqr}_{j,\theta }\ge \left( y_{j} -\left( \theta _0+ \sum _{i=1}^n \theta _{i} x_{i}^{j} \right) \right) ^2,\;\; \forall j\in {\mathcal {D}}\\& \sum _{j \in {\mathcal {D}}}b_{j}=N-k\\&\theta \in {\mathbb {R}}^{n+1},\ r^{sqr}_\theta \in {\mathbb {R}}_{\ge 0}^N,\ b\in \{0,1\}^N, \end{aligned}$$

where we have avoided the trilinear terms \(r_{j,\theta }^2\cdot b_j\) in the objective function by using only bilinear and quadratic expressions in the objective functions and constraints, respectively. Specifically, in an optimal solution,

$$\begin{aligned} r^\mathrm{sqr}_{j,\theta } = \left( y_{j} - \left( \theta _0 +\sum _{i=1}^n \theta _{i} x_{i}^{j} \right) \right) ^2 = r^2_{j,\theta }. \end{aligned}$$

An LTA-estimate is computed as an optimal solution of the mixed-integer quadratic optimization problem (MIQP)

$$\begin{aligned} \min _{\theta , b, r} \; \frac{1}{N} \sum _{j\in {\mathcal {D}}}^N r^{abs}_{j,\theta } \cdot b_{j} \quad \text {s.t.}\quad &r^{abs}_{j,\theta }\ge y_{j} - \left( \theta _0 + \sum _{i=1}^n \theta _{i} x_{i}^{j} \right) ,\;\; \forall j\in {\mathcal {D}}\\&r^{abs}_{j,\theta }\ge - y_{j} + \theta _0 + \sum _{i=1}^n \theta _{i} x_{i}^{j},\;\; \forall j\in {\mathcal {D}}\\& \sum _{j\in {\mathcal {D}}}b_{j}=N-k\\&\theta \in {\mathbb {R}}^{n+1},\ r^{abs}_\theta \in {\mathbb {R}}_{\ge 0}^N,\ b\in \{0,1\}^N. \end{aligned}$$

The absolute value term \(|r_{j,\theta }|\) in the objective function is modeled through two linear constraints, for every \(j\in {\mathcal {D}}\). This is possible because LTS is a minimization problem.

Finally, we compute a linear regression function for LLTA by minimizing the MIQP

$$\begin{aligned} \min _{\theta , b, r} \; \frac{1}{N}\left( \sum _{j\in {\mathcal {D}}{\setminus} {\mathcal {O}}}^N r^{abs}_{j,\theta } + \sum _{j\in {\mathcal {O}}}^N r^{abs}_{j,\theta } \cdot b_{j}\right) \quad \text {s.t.}\quad &r^{abs}_{j,\theta }\ge y_{j} -\left( \theta _0+ \sum _{i=1}^n \theta _{i} x_{i}^{j} \right) ,\;\; \forall j\in {\mathcal {D}}\\&r^{abs}_{j,\theta }\ge -y_{j} + \theta _0 + \sum _{i=1}^n \theta _{i} x_{i}^{j},\;\; \forall j\in {\mathcal {D}}\\& \sum _{j\in {\mathcal {O}}}b_{j}=|{\mathcal {O}}|-k\\&\theta \in {\mathbb {R}}^{n+1},\ r^{abs}_\theta \in {\mathbb {R}}_{\ge 0}^N,\ b\in \{0,1\}^{|{\mathcal {O}}|}, \end{aligned}$$

where we rewrite the absolute value terms like in the MIQP for the LTA above.

The prefactors \(\frac{1}{N^2}\) and \(\frac{1}{N}\) in the three formulations above do not affect the optimal solutions, but are added to enhance numerical stability. Because GUROBI version 9.0 is capable of dealing with MIQCQPs and MIQPs, there is no need to reformulate the optimization problems as MILPs. In this way, we avoid the introduction of Big-M constraints which are known to yield notoriously weak relaxations.

2.2 Computation of leverage points

Let \(x^1,\ldots ,x^j,\ldots , x^N\in {\mathbb {R}}^n\) be the data points and \(q_{i}^{0.25}\) the lower quartile of their ith component. Further, let \(q_{i}^{0.75}\) be the upper quartile and

$$\begin{aligned} \mathrm{iqr}_i := q_{i}^{0.75}-q_{i}^{0.25} \end{aligned}$$

the interquartile range of component \(i\in \{1, \ldots , n\}\). Then, we introduce the following definition of a leverage point.

Definition 1

A data point \(x\in {\mathbb {R}}^n\) is called leverage point, if for at least one \(i\in \{1,\ldots , n\}\), \(x_i < q_{i}^{0.25} - 1.5\cdot \mathrm{iqr}_i\) or \(x_i > q_{i}^{0.75} + 1.5\cdot \mathrm{iqr}_i\).

This leads us to the definition of the index set of all leverage points

$$\begin{aligned} {\mathcal {O}}:= \left\{ j \in {\mathcal {D}} \ | \ \exists i\in \{1,\ldots ,n\}:x_i^{j} < q_{i}^{0.25} - 1.5\cdot \mathrm{iqr}_i \text { or } x_i^{j} > q_{i}^{0.75} + 1.5\cdot \mathrm{iqr}_i \right\} . \end{aligned}$$

The definition of a leverage point given in this paper is more specific than commonly defined in existing literature where a leverage point is usually described as “data point that has an extreme value for one of the explanatory variables” (Dodge 1997).

Remark 3

The outlier tolerance 1.5 might be treated as an hyper parameter t of LLTA which influences the number of binary variables as well as the prediction quality. However, for the remainder of this work, we set \(t=1.5\), which coincides with the definition of an outlier for boxplots (Tukey 1977).

Remark 4

Note that based on this definition of leverage points, there might be a tendency to identify more leverage points for high-dimensional data sets since the probability mass within a multidimensional probability distribution tends “to move away from its center” (van Handel 2014). The combination of domain knowledge and the selection of a tailored problem-specific outlier detection method (Hodge and Austin 2004) probably yields the best definition of leverage points for the problem at hand.

2.3 Choosing the number of outliers k

The choice of k may have a significant influence on the computed regression functions for LTS, LTA as well as LLTA. Quite generally, methods developed for LTS and LTA to choose k can also be applied toward LLTA.

By inspecting the optimization models (3), (4) and (5), we observe that their objective functions are monotone decreasing in the number k, i.e., allowing more outliers leads to a better fitting regression function. At the same time, increasing the number of outliers beyond the actual number of outliers in the data set implies loss of information. Consequently, k should not be chosen “too large.” For practical problems, one would compute regression functions for different numbers of k and choose the k heuristically which seems to yield a good compromise between outlier detection and regression quality—one might choose k such that there is a significant improvement in the fit compared to \(k-1\) and where \(k+1\) yields only some minor improvement.

To choose k optimally, one would need an (objective) function quantifying both the regression fit and the “loss” from excluding potentially useful data.

2.4 Performance evaluation

Classical performance measures, like the root-mean-square error (RMSE) or mean-absolute error (MAE), are not suitable to measure the quality of statistical models in the presence of outliers because they evaluate the residuals for all data points. In contrast, a good robust model ignores some data points for being outliers. We evaluate the performance of the models by sorting the absolute residuals in ascending order, i.e.,

$$\begin{aligned} |r_{(1), \theta }| \le \cdots \le |r_{(j), \theta }| \le |r_{(j+1), \theta }| \le \cdots \le |r_{(N), \theta }| \end{aligned}$$

and computing the trimmed MAE on the better half of all residuals

$$\begin{aligned} \mathrm{tMAE}:= \frac{1}{\lfloor {N/2} \rfloor }\sum _{j=1}^{\lfloor {N/2} \rfloor }|r_{(j), \theta }| \end{aligned}$$

where \(\lfloor {N/2} \rfloor\) denotes the floor function of N/2.

To motivate \(\mathrm{tMAE}\) as performance metric, consider the following synthetic example where we have 30 “good” data points, ten “x-outliers,” i.e., leverage points and ten “y-outliers.” The scatter plot and regression lines for LLTA and LTS with \(k=10\) are depicted in Fig. 3.

Fig. 3
figure 3

Synthetic example with 50 data points, 20 of which are outliers, and regression lines computed by LLTA and LTS

It is not surprising to see that LTS is affected by the outliers, while LLTA yields a very good approximation of the “good” data points. However, how can we measure that in a multidimensional setting where visual inspections are not that easy to perform? We notice first that the classical performance measures RMSE and MAE are not suited since LTS outperforms LLTA in both, RMSE (3258 vs. 3660) and MAE (1471 vs. 1560), despite the fact that LTS’ fit is obviously worse for this illustrative and synthetic example with 20 outliers. If we knew which of the data points are the “good” ones, we could just evaluate RMSE and MAE on these points. Unfortunately, we do not know that in real-world applications; otherwise, we would not have to immunize the regression function against outliers. Even worse, up to half of the data could be infiltrated and, in fact, we have an outlier rate of 40% in this example.

Let us take a look at the empirical distribution of the residuals of both methods which are illustrated in Fig. 4.

Fig. 4
figure 4

Empirical distribution of residuals for LLTA and LTS

We recognize that the distributions are right-skewed—while the majority of all residuals are rather small, there exist some outliers, i.e., some data points show large residuals. This is not a problem per se since our goal is to design robust methods that ignore outliers on purpose. However, we should then also ignore these residuals when calculating our performance metric. Since this might affect up to 50% of our residuals, we decided to use trimmed MAE as performance metric for LLTA. If an upper bound b on the relative share of outliers is available, an obvious adjustment of \(\mathrm{tMAE}\) would be to evaluate the residuals not only on the better half of the residuals but on \((1-b)\cdot 100\)% of the data.

2.5 Uncertainty measurement

The quantification of prediction uncertainty is important as it tries to determine the trustworthiness of a particular prediction or even of the statistical model in general. In particular, if decision making is based on good predictions, an uncertainty measure for the prediction is crucial.

A prominent example for model predictions, embedded in a “predict-tell” cycle, is Bayesian optimization. In Bayesian optimization, in each iteration an acquisition function is optimized, taking into account the prediction of a surrogate model as well as the model uncertainty (Pelikan et al. 1999). The predominant surrogate models are Gaussian Processes which rely on a normal distribution assumption also yielding an uncertainty estimate.

However, more recent distributional-free approaches to Bayesian optimization also work with gradient tree ensemble methods as surrogate models and distance-based uncertainty measures (Thebelt et al. 2020a). A distance-based uncertainty measure does not assume any probability distribution in the data and is therefore also applicable to LLTA. The main idea is to measure the distance of an x-value, whose y-value is to be predicted, to the set of existing training data since the statistical model might have bad extrapolation properties. Distance-based uncertainty measures have a nice intuitive interpretation, but might provide misleading information for large datasets where especially the \(\ell _2\)-norm shows counterintuitive behavior (Aggarwal et al. 2001). Distance-based uncertainty measures that use the \(\ell _1\)-norm might be a useful uncertainty measure for LLTA.

Next to distance-based uncertainty measures, the second (by far more popular) approach to uncertainty estimation without distributional assumptions is bootstrapping (Diaconis and Efron 1983). Koenker and Hallock use bootstrapping in their famous work (Koenker and Hallock 2001) to quantify uncertainty in quantile regression which is a very popular approach to robust regression. The main idea of bootstrapping is to train statistical models on random samples of the training data and to compare its statistical properties. If models trained on different samples tend to vary much, then this might be an indication of a high model uncertainty. Therefore, using bootstrapping, or one of its many variants, is recommended as uncertainty measure for LLTA.

3 Computational results

We perform computational tests, comparing the two approaches LTS and LTA from the literature to the new model LLTA. All models are implemented in GUROBI 9.0 via its Python-API, and they are solved on a standard desktop computer possessing four cores each with 2.71 GHz and 16 GB RAM.

We start with an example on the body–brain data set in Sect. 3.1 with the goal to illustrate the usage of LLTA and to give an indication regarding the strengths and weaknesses of the examined models. We then use 11 instances from the literature in Sect. 3.2 to demonstrate the computational differences of the models resulting from LTS, LTA and LLTA.

3.1 Comparison with LTS and LTA on the body–brain data set

To illustrate the new model LLTA, we compare it to LTS and LTA based on the “Brain and Body Weights” dataset (Rousseeuw and Leroy 1987; Weisberg 1985). This dataset contains the following average body and brain weights in kg in the format (body, brain) for 65 animals

  • (1.35, 8.1), (465, 423), (36.33, 119.5), (27.66, 115), (1.04, 5.5), (11700, 50), (2547, 4603), (187.1, 419), (521, 655), (10, 115), (3.3, 25.6), (529, 680), (207, 406), (62, 1320), (6654, 5712), (9400, 70), (6.8, 179), (35, 56), (0.12, 1), (0.023, 0.4), (2.5, 12.1), (55.5, 175), (100, 157), (52.16, 440), (0.28, 1.9), (87000, 154.5), (0.122, 3), (192, 180), (3.385, 44.5), (0.48, 15.5), (14.83, 98.2), (4.19, 58), (0.425, 6.4), (0.101, 4), (0.92, 5.7), (1, 6.6), (0.005, 0.14), (0.06, 1), (3.5, 10.8), (2, 12.3), (1.7, 6.3), (0.023, 0.3), (0.785, 3.5), (0.2, 5), (1.41, 17.5), (85, 325), (0.75, 12.3), (3.5, 3.9), (4.05, 17), (0.01, 0.25), (1.4, 12.5), (250, 490), (10.55, 179.5), (0.55, 2.4), (60, 81), (3.6, 21), (4.288, 39.2), (0.075, 1.2), (0.048, 0.33), (3, 25), (160, 169), (0.9, 2.6), (1.62, 11.4), (0.104, 2.5), (4.235, 50.4)

which are depicted in Fig. 5a, b using logarithmic scales.

Fig. 5
figure 5

Pairs of body–brain weights for 65 species

For the body–brain data set, we have \(n=1\). Its quartiles are given by \(q^{0.25}=0.75\) and \(q^{0.75}=60\) which results in an interquartile range of \(\mathrm{iqr}= 59.25\). We compute

$$\begin{aligned} {\mathcal {O}}&= \{j\in {\mathcal {D}}|\ x_{j}^{1}<0.75-1.5\cdot 59.25\text { or }x_{j}^{1}>60+1.5\cdot 59.25\}\\&= \{j\in {\mathcal {D}}|\ x_{j}^{1}>148.875\}\\&= \{2, 6, 7, 8, 9, 12, 13, 15, 16, 26, 28, 52, 61\}, \end{aligned}$$

i.e., we have 13 leverage points with the x-values 465, 11700, 2547, 187.1, 521, 529, 207, 6654, 9400, 87000, 192, 250, 160.

Due to the presence of these leverage points, OLS and LAD are heavily affected such that there is need for a robust statistical model. Therefore, we train LTS, LTA and LLTA on the data set for different values of k. In Fig. 6, we observe that the asymptotic trimmed mean absolute errors \(\mathrm{tMAE}\) of the residuals of the three models is comparable, whereas LTA and LLTA obtain a better score for \(k<25\).

Fig. 6
figure 6

Trimmed mean absolute errors and model instances of LLTA, LTA and LTS for different values of k on log-scaled axes

In addition to the good statistical performance of LLTA, we observe in Fig. 7 that LLTA outperforms LTS and LTA with respect to the number of visited nodes in the branch-and-bound tree and with respect to the run time (at a time limit of 600 s). LLTA visits at most 151 nodes and solves most optimization problems within the root node. In contrast, LTS and LTA visit up to 155,572 and 1,071,225 nodes, respectively. Regarding the run time, LLTA needs at most 0.04 s to solve any of the instances in contrast to LTA whose run time increases up to 17 seconds. In turn, LTA is much better than LTS, where an optimality certificate cannot be computed within the time limit. As such, we obtain a maximum speedup of 425 (or 99.8%) of LLTA compared to LTA and of 15,000 (or 99.99%) of LLTA to LTS.

Fig. 7
figure 7

Number of visited branch-and-bound nodes and run times for LLTA, LTS and LTA with a time limit of 120 s for different values of k

For some values \(k\ge 16\), LTS does not manage to close the optimality gap within the time limit, as depicted in Fig. 8.

Fig. 8
figure 8

Remaining optimality gap with time limit of 600 s

3.2 Datasets from the literature

We extracted 11 datasets from the existing literature on robust regression. Table 1 summarizes some relevant information about the datasets we use for benchmarking.

Table 1 Some information on the datasets where rows refer to the number of observations and columns are the number of features

We run LTS, LTA and LLTA for all data sets at a time limit of 600 seconds for all \(k\in \{0,\ldots , \lfloor N/2-1\rfloor \}\) for LTS and LTA as well as \(k\in \{0,\ldots , |{\mathcal {O}}|\}\) for LLTA. The results are depicted in Figs. 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 and 19, where we see the trimmed MAE, the number of visited branch-and-bound nodes and the run time for each method. Among the 11 datasets, we compare 118 different regression functions.Footnote 1

Fig. 9
figure 9

Coleman data set

Fig. 10
figure 10

Delivery time data

Fig. 11
figure 11

Hawkins, Bradu, Kass’s artificial data

Fig. 12
figure 12

Heart catherization data

Fig. 13
figure 13

Waterflow measurements of Kootenay River in Libby and Newgate

Fig. 14
figure 14

Pension funds data

Fig. 15
figure 15

Phosphorus content data

Fig. 16
figure 16

Salinity data

Fig. 17
figure 17

Siegel’s exact fit example data

Fig. 18
figure 18

Steam usage data (excerpt)

Fig. 19
figure 19

Modified data on wood specific gravity

Consider now Figs. 9a, 10a, 11a, 12a, 13a, 14a, 15a, 16a, 17a, 18a and 19a, where the tMAE is shown for different values of k. Among the 118 regression functions, the performance of LTS is never better than the one of LTA. This is because LTS is not immune against y-outliers and, thus, requires larger values of k to achieve a similar performance than LTA. For 68 instances, LLTA performs better than LTS and for \(k<4\), LLTA is always better. Note that these results are heavily affected by dataset # 3 (Fig. 11a). LLTA performs comparable to LTA in most datasets. Exceptions are datasets # 2 (Fig. 10a), # 7 (Fig. 15a) and # 10 (Fig. 18a), where LTA is consistently better than LLTA and dataset # 9 (Fig. 17a), where LLTA outperforms LTA. It is remarkable that both LTA and LLTA have a quite similar performance, given that LTA has more degrees of freedom (because it can choose the leverage points among all data points compared to LLTA which is restricted to the index set of leverage points \({\mathcal {O}}\)). Even more surprising is that LLTA shows a (strictly) better tMAE compared to LTA for 12Footnote 2 regression functions! Note that the computed regression functions are evaluated with respect to the tMAE (measuring the better half of the residuals) and not with respect to the objective function (measuring all residuals except the k outliers). The difference in evaluation metrics and objective function also explains why the tMAE curves are not monotone decreasing in k (while we still observe a decreasing trend with an increase in k). This nonmonotone behavior can be seen, for example, in dataset # 1 (Fig. 9a).

Figures 9b, 10b, 11b, 12b, 13b, 14b, 15b, 16b, 17b, 18b and 19b show the number of visited branch-and-bound nodes, necessary to solve the corresponding instances. The trend here is clear: While LLTA shows a linear growth in k, both LTA and LTS follow an exponential curve. This behavior is by design, as the primary motivation to introduce LLTA is the significant reduction of the computational burden. In many instances, the LLTA can be solved in the root node.

The computational time is highly related to the number of visited branch-and-bound nodes. Therefore, Figs. 9c, 10c, 11c, 12c, 13c, 14c, 15c, 16c, 17c, 18c and 19c show a similar trend than the number of visited branch-and-bound nodes. In addition, we observe that the computational efforts to solve LTS are significantly greater than solving LTA. A similar percentage increase in runtime is observable for LTA compared to LLTA. Note that LLTA can solve any instance in at most 0.66 s, except the one instance for dataset #10 and \(k=0\) (Fig. 18c). In average, LLTA is 699.58 faster compared to LTA and 797.08 faster compared to LTS. However, the average speedup is mainly driven by dataset #3 where LLTA is 7543 faster than LTA and 7436 faster than LTS. Then, median speedup values are 5.33 compared to LTA and 80.23 compared to LTS.

Instance # 3 (Fig. 11a) is of particular interest due to its size. With 75 rows, this dataset contains about three times more rows than any other dataset (cf. Table 1). When inspecting Fig. 11b, c, we see that for \(k\ge 11\), none of the instances for LTA and LTS can be solved to optimality within the time limit. This explains why we do not observe an exponential grows in the number of visited branch-and-bound nodes for large k for this dataset.

4 Summary and outlook

We introduce the Leveraged Least Trimmed Absolute Deviations (LLTA) which is based on the Least Trimmed Absolute Deviations (LTA). We make use of two observations. First, LTA is by design immune against y-outliers. Second, the leverage points can be computed beforehand in contrast to the y-outliers, because the y-outliers depend on the constructed regression function. As such, LLTA combines the advantages of LTA while considering only leverage points as potential x-outliers. This has the consequence that the proposed regression model LLTA is immune against both leverage points and y-outliers. At the same time, the computational burden is much lower compared to LTA.

Our computational results on known benchmark instances show that LLTA has a comparable performance of the computed regression models compared to LTA. At the same time, LLTA can be solved much faster compared to LTA. The computed regression models by LLTA tend to outperform the ones computed by the well-known least trimmed squares (LTS). For small k, this effect is drastic. In addition, LLTA can be solved several orders of magnitude faster than LTS.