1 Introduction

In statistics and machine learning, binary classification is one of the most recurring and relevant tasks. This problem consists of identifying a model, selected from a hypothesis space, able to separate samples characterized by a well-defined set of numerical features and belonging to two different classes. The fitting process is based on a finite set of samples, the training set, but the aim is to get a model which correctly labels unseen data.

Among the various existing models to perform binary classification, such as k-nearest-neighbors, SVM, neural networks or decision trees (for a review of classification models see, e.g., the books of [9, 22] or [24]), we consider the logistic regression model. Logistic regression belongs to the class of Generalized Linear Models and possesses a number of useful properties: it is relatively simple; it is readily interpretable (since the weights are linearly associated to the features); outputs are particularly informative, as they have a probabilistic interpretation; statistical confidence measures can quickly be obtained; the model can be updated by simple gradient descent steps if new data are available; moreover, in practice it often has good predictive performance, especially when the size of train data is too limited to exploit more complex models.

In this work, we are interested in the problem of best features subset selection in logistic regression. This variant of standard logistic regression requires to find a model that, in addition to accurately fitting the data, exploits a limited number of features. In this way, the obtained model only employs the most relevant features, with benefits in terms of both performance and interpretation.

In order to compare the quality of models that exploit different features, i.e., models with different complexity, goodness-of-fit (GOF) measures have been proposed. These measures allow to evaluate the trade-off between accuracy of fit and complexity associated with a given model. Among the many GOF measures that have been proposed in the literature, those based on information criteria (IC) such as AIC [1], BIC [39] or HQIC [21] are some of the most popular [23]. Models based upon these Information Criteria are very popular in the statistics literature. Of course it is evident that no model is perfect and different models might had been considered. However, as nicely reported by [13], a reasonable model should be computable from data as well as based on a general statistical inference framework. This means that “model selection is justified and operates within either a likelihood or Bayesian framework or within both frameworks”. So, although many alternative models can be proposed and successfully employed, like, e.g., those described in [6, 8, 26], in this paper we prefer to remain on the classical ground of Information Criteria like the AIC, which is an asymptotically unbiased estimator of the expected Kullback–Leibler information loss, or the BIC, which is an easy to compute good approximation of the Bayes factor.

In case the selection of the model is based on one of the aforementioned IC, the underlying optimization problem consists of minimizing a function which is the sum of a convex part (the negative log-likelihood) and a penalty term, proportional to the number of employed variables; it is thus a sparse optimization problem.

Problems of this kind are often solved by heuristic procedures [17] o by \(\ell _1\)-regularization [27, 29, 45]. In fact, specific optimization algorithms exist to directly handle the zero pseudo-norm [4, 31, 32]. However, none of the aforementioned methods is guaranteed to find the best possible subset of features under a given GOF measure.

With problems where the convex part of the objective is simple, such as least squares linear regression, approaches based on mixed-integer formulations allow to obtain certified optima, and have thus had an increased popularity in recent years [7, 14, 19, 34, 35]. Logistic likelihood, although convex, cannot however be inserted in a standard MIQP model. Still, [38] showed that, by means of a cutting-planes based approximation, a good surrogate MILP problem can be defined and solved, at least for moderate problem sizes, providing a high quality classification model.

The aim of this paper is to introduce a novel technique that, exploiting mixed-integer modeling, is able to produce good solutions to the best subset selection in logistic regression problem, being at the same time reasonably scalable w.r.t. problem size. To reach this goal, we make use of a decomposition strategy.

The main contributions of the paper consist in:

  • the definition of a strong necessary optimality condition for optimization problems with an \(\ell _0\) penalty term;

  • the definition of a decomposition scheme, with a suitable variable selection rule, allowing to improve the scalability of the method from [38], with convergence guarantees to points satisfying the aforementioned condition;

  • practical suggestions to improve the performance of the proposed algorithm;

  • a thorough computational study comparing various solvers from the literature on best subset selection problems in logistic regression.

The rest of the manuscript is organized as follows: in Sect. 2, we formally introduce the problem of best subset selection in logistic regression, state optimality conditions and provide a brief review of a related approach. In Sect. 3, we present our proposed method, explaining in detail the key contributions and carrying out a theoretical analysis of the procedure. Then, we describe and report in Sect. 4 the results of a thorough experimental comparison on a benchmark of real-world classification problems; these results highlight the effectiveness of the proposed approach with respect to state-of-the-art methods. We finally give some concluding remarks and suggest possible future research in Sect. 5. In Appendix we also provide a detailed review of the algorithms considered in the computational experiments.

2 Best subset selection in logistic regression

Let \(X \in \mathbb {R}^{N\times n}\) be a dataset of N examples with n real features and \(Y\in \{-1,1\}^N\) a set of N binary labels. The logistic regression model [22] for binary classification defines the probability for an example x of belonging to class \(y=1\) as

$$\begin{aligned} \mathbb {P}(y=1\mid x) = \frac{1}{1+\exp (-w^\top x)}. \end{aligned}$$

Substantially, a sigmoid nonlinearity is applied to the output of a linear regression model. Note that the intercept term is not explicitly present in the linear part of the model; in fact, it can be implicitly added, by considering it as a feature which is equal to 1 in all examples; we did so in the experimental part of this work. It is easy to see that

$$\begin{aligned} \mathbb {P}(y=-1\mid x) = 1-\mathbb {P}(y=1\mid x) = \frac{1}{1+\exp (w^\top x)}. \end{aligned}$$

Hence, the logistic regression model can be expressed by the single equation here below:

$$\begin{aligned} \mathbb {P}(y\mid x) = \frac{1}{1+\exp (-yw ^\top x)}. \end{aligned}$$
(1)

Under the hypothesis that the distribution of \((y\mid x)\) follows a Bernoulli distribution, we get that model (1) is associated with the following log-likelihood function:

$$\begin{aligned} \ell (w) = -\sum _{i=1}^{N} \log \left( 1+\exp \left( -y^{(i)}w^\top x^{(i)}\right) \right) . \end{aligned}$$
(2)

A function \(f(v) = \log (1+\exp (-v))\) is referred to as logistic loss function and is a convex function. The maximum likelihood estimation of (1), which requires the maximization of \(\ell (w)\), is thus a convex continuous optimization problem.

Identifying a subset of features that provides a good trade-off between fit quality and model sparsity is a recurrent task in applications. Indeed, a sparse model might offer a better explanation of the underlying generating model; moreover, sparsity is statistically proved to improve the generalization capabilities of the model [44]; finally, a sparse model will be computationally more efficient.

Many different approaches have been proposed in the literature for the best subset selection problem which, we recall, is a specific form of model selection. Every model selection procedure has advantages and disadvantages as it is difficult to think that there might exist a single, correct, model for a specific application. Among the many different proposals, those which base subset selection on information criteria [12, 13, 28] stand out as the most frequently used, both for their computational appeal as well as for their deep statistical theoretical support. Information criteria are statistical tools to compare the quality of different models in terms of quality of fit and sparsity simultaneously. The two currently most popular information criteria are:

  • the Akaike Information Criterion (AIC) [1, 2, 11]:

    $$\begin{aligned} \text {AIC}(w) = -2\ell (w) + 2\Vert w\Vert _0; \end{aligned}$$

    Comparing a set of candidate models, the one with smallest AIC is considered closer to the truth than the others. Since the log-likelihood, at its maximum point, is a biased upward estimator of the model selection target [12], the penalty term \(2\Vert w\Vert _0\), i.e., the total number of parameters involved in the model, allows to correct this bias;

  • the Bayesian Information Criterion (BIC) [39]:

    $$\begin{aligned} \text {BIC}(w) = -2\ell (w) + \log (N)\Vert w\Vert _0; \end{aligned}$$

    It has been shown [12, 28] that given a set of candidate models, the one which minimizes the BIC is optimal for the data, in the sense that it is the one that maximizes the marginal likelihood of the data under the Bayesian assumption that all candidate models have equal prior probabilities.

Although other models can be proposed for model selection, those based on the AIC and BIC, or their variant, are extremely popular thanks to their solid statistical properties.

In summary, when referred to logistic regression models, the problem of best subset selection based on information criteria like AIC or BIC has the form of the following optimization problem:

$$\begin{aligned} \min _{w\in \mathbb {R}^n} \mathcal {L}(w) + \lambda \Vert w\Vert _0, \end{aligned}$$
(3)

where \(\mathcal {L}:\mathbb {R}^n\rightarrow \mathbb {R}\) is twice the negative log-likelihood of the logistic regression model (\(\mathcal {L}(w) = -2\ell (w)\)), which is a continuously differentiable convex function, \(\lambda >0\) is a constant depending on the choice of the information criterion and \(\Vert \cdot \Vert _0\) denotes the \(\ell _0\) semi-norm of a vector. Given a solution \({\bar{w}}\), we will denote the set of its nonzero variables, also referred to as support, by \(S({\bar{w}})\subseteq \{1,\ldots ,n\}\), while \({\bar{S}}({\bar{w}})=\{1,\ldots ,n\}{\setminus } S({\bar{w}})\), denotes its complementary. In the following, we will also refer to the objective function as \(\mathcal {F}(w) = \mathcal {L}(w) + \lambda \Vert w\Vert _0\).

Because of the discontinuous nature of the \(\ell _0\) semi-norm, solving problems of the form (3) is not an easy task. In fact, problems like (3) are well-known to be \({\mathcal {NP}}\)-hard, hence, finding global minima is intrinsically difficult.

Lu and Zhang [32] have established necessary first-order optimality conditions for problem (3); in fact, they consider a more general, constrained version of the problem. In the unconstrained case we are interested in, such conditions reduce to the following.

Definition 1

A point \(w^\star \in \mathbb {R}^n\) satisfies Lu–Zhang first order optimality conditions for problem (3) if \(\nabla _j \mathcal {L}(w^\star ) = 0\) for all \(j\in \{1,\ldots ,n\}\) such that \(w^\star _j\ne 0\).

As proved by [32], if \(\mathcal {L}(w)\) is a convex function, as in the case of logistic regression log-likelihood, there is an equivalence relation between Lu–Zhang optimality and local optimality, meaning there exists a neighborhood V of \(w^\star\) such that \(\mathcal {F}(w^\star )\le \mathcal {F}(w)\) for all \(w\in V\).

Proposition 1

Let \(w^\star \in \mathbb {R}^n\). Then, \(w^\star\) is a local minimizer for Problem (3) if and only if it satisfies Lu–Zhang first order optimality conditions.

This may appear surprising at first glance. However, after a more careful thinking, it should be evident. Being \(\mathcal {L}\) convex, a Lu–Zhang point is globally optimal w.r.t. the nonzero variables. As for the zero variables, since \(\mathcal {L}\) is continuous, there exists a neighborhood such that the decrease in \(\mathcal {L}\) is bounded by \(\lambda\), which is the penalty term that is added to the overall objective function as soon as one of the zero variables is moved.

Unfortunately, the number of Lu–Zhang local minima is in the order of \(2^n\). Indeed, for any subset of variables, minimizing w.r.t. those components, while keeping fixed the others to zero, allows to obtain a point which satisfies Lu–Zhang conditions. Hence, satisfying the necessary and sufficient conditions of local optimality is indeed a quite weak feature in practice. On the other hand, being the search of an optimal subset of variables a well-known \({\mathcal {NP}}\)-hard problem, requiring theoretical guarantees of global optimality is unreasonable. In conclusion, it should be clear that the evaluation and comparison of algorithms designed to deal with problem (3) have to be based on the quality of the solutions empirically obtained in experiments.

However, we can further characterize candidates for optimality by means of the following notion, which adapts the concept of CW-optimality for cardinality constrained problems defined by [4]. To this aim, we introduce the notation \(w_{\ne i}\) to denote all the components of w except the i-th.

Definition 2

A point \(w^\star \in \mathbb {R}^n\) is a CW-minimum for Problem (3) if

$$w^\star _i\in {\underset{w_i}{\,\mathrm{argmin }\,}}\mathcal {F}(w_i; w^\star _{\ne i})$$
(4)

for all \(i=1,\ldots ,n\).

Equivalently, (4) could be expressed as

$$\begin{aligned} w^\star \in \underset{w}{{\,\mathrm{argmin }\,}}\;&\mathcal {F}(w)\\ \text {s.t. }&\Vert w-w^\star \Vert _0\le 1 \end{aligned}$$
(5)

CW-optimality is a stronger property than Lu–Zhang stationarity. We outline this fact in the following proposition.

Proposition 2

Consider Problem (3) and let \(w^\star \in \mathbb {R}^n\). The following statements hold:

  1. 1.

    If \(w^\star\) is a CW-minimum for (3), then \(w^\star\) satisfies Lu–Zhang optimality conditions, i.e., \(w^\star\) is a local minimizer for \(w^\star\).

  2. 2.

    If \(w^\star\) is a global minimizer for (3), then \(w^\star\) is a CW-minimum for (3).

Proof

We prove the statements one at a time.

  1. 1.

    Let \(w^\star\) be a CW-minimum, i.e.,

    $$\begin{aligned} w^\star _i\in {\underset{w_i}{\,\mathrm{argmin }\,}}\mathcal {F}(w_i; w^\star _{\ne i}) \end{aligned}$$
    (6)

    for all \(i=1,\ldots ,n\). Assume by contradiction that \(w^\star\) does not satisfy Lu–Zhang conditions; then, there exists \(h\in \{1,\ldots ,n\}\) such that \(w^\star _h\ne 0\) and \(\nabla _h\mathcal {L}(w^\star )>0\). Hence, \(-\nabla _h\mathcal {L}(w^\star )\) is a descent direction for \(\mathcal {L}(w_h;w^\star _{h\ne j})\) at \(w_h^\star \ne 0\), which contradicts (6).

  2. 2.

    Let \(w^\star\) be a globally optimal point for (3). Assume by contradiction that \(w^\star\) is not a CW-minimum, i.e., there exists \(h\in \{1.\ldots ,n\}\) such that there exists \({\hat{w}}_h\) such that \(\mathcal {F}({\hat{w}}_h;w^\star _{h\ne j})<\mathcal {F}(w^\star )\). This clearly contradicts that \(w^\star\) is a global optimum.

\(\square\)

Note that CW-optimality is a sufficient, yet not necessary, condition for local optimality. Indeed, Lu–Zhang conditions, and hence local optimality, certify that an improvement cannot be achieved without changing the set of nonzero variables. CW-optimality allows to also take into account possible changes in the support, although limited to one variable. We show this in the following examples, where, for the sake of simplicity, we consider a simpler convex function than \(\mathcal {L}\).

Example 1

Consider the problem

$$\begin{aligned} \min _{w\in \mathbb {R}^2} \varphi (w) = (w_1-1)^2 + (w_2-2)^2 + 2\Vert w\Vert _0. \end{aligned}$$

It is easy to see that Lu–Zhang conditions are satisfied by the points \(w^a=(0,0)\), \(w^b=(1,2)\), \(w^c=(0,2)\) and \(w^d=(1,0)\). We have \(\varphi (w^a) = 5\), \(\varphi (w^b) = 4\), \(\varphi (w^c) = 3\), \(\varphi (w^d) = 4\). We can then observe that \(w^c\) and \(w^d\) are CW-minima, as their objective value cannot be improved by changing only one of their components, while \(w^a\) and \(w^b\) are not CW-optima, as the solutions can be improved by zeroing a component or setting the first component to 1, respectively.

We can conclude by remarking that searching through the CW-points allows to filter out a number of local minima that are certainly not globally optimal.

2.1 The MILO approach

Many approaches have been proposed to tackle cardinality-penalized problems in general and for problem (3) specifically. We provide a detailed review of many of these methods in Appendix. Here, we focus on a particular approach that is relevant for the rest of the paper.

Sato et al. [38] proposed a mixed integer linear (MILO) reformulation for problem (3), which is, to the best of our knowledge, the top performing one, as long as the dimensions of the underlying classification problem are not exceedingly large. Such approach has two core ideas. The first one consists of the replacement of the \(\ell _0\) term by the sum of binary indicator variables.

The second key element is the approximation of the nonlinearity in \(\mathcal {L}\), i.e., the logistic loss function, by a piecewise linear function, so that the resulting reformulated problem is a MILP problem. The approximating piecewise linear function is defined by the pointwise maximum of a family of tangent lines, that is,

$$\begin{aligned} f(v) = \log (1+\exp (-v)) \approx \hat{{f}}(v)&= \max \{f'(v^k)(v-v^k) + f(v^k) \mid k=1,2,\ldots ,K\}\\&=\min \{t\mid t\ge f'(v^k)(v-v^k) + f(v^k),\;k=1,\ldots ,K\} \end{aligned}$$

for some discrete set of points \(\{v^1,\ldots ,v^K\}\). The function \({\hat{f}}\) is a linear underestimator to the true loss logistic function. The final MILP reformulation of problem (3) is given by

$$\begin{aligned} \begin{aligned} \min _{w,z,t}\;&2\sum _{i=1}^{N} t_i + \lambda \sum _{i=1}^{n}z_i\\ \text {s.t. }&-Mz_i\le w_i\le Mz_i\quad \forall \,i=1,\ldots ,n,\\ {}&z\in \{0,1\}^n,\\ {}&t_i\ge f'(v^k)(y^{(i)}(w^\top x^{(i)})-v^k) + f(v^k)\quad \forall \,k=1,\ldots ,K,\quad \forall \, i=1,\ldots , N, \end{aligned} \end{aligned}$$
(7)

where M is a large enough positive constant.

The choice of the tangent lines is clearly crucial for this method. For large values of K, problem (7) becomes hard to solve. On the other hand, if the number of lines is small, the quality of the approximation will reasonably be low. Hence, points \(v^k\) should be selected carefully. Sato et al. [38] suggest to adopt a greedy algorithm that adds one tangent line at a time, minimizing the area of gap between the exact logistic loss and the linear piece-wise approximation. In their work, Sato et al. [38] show that the greedy algorithm provides, depending on the desired set size, the following sets of interpolation points:

$$\begin{aligned}&V_1 = \{0, \pm 1.9, \pm \infty \},\qquad V_2 = V_1\cup \{\pm 0.89,\pm 3.55\},\\&V_3 = V_2 \cup \{\pm 0.44, \pm 1.37, \pm 2.63, \pm 5.16\} \end{aligned}$$

As problem (7) employs an approximation of \(\mathcal {L}\), the optimal solution \({\hat{w}}\) obtained by solving it is not necessarily optimal for (3). However, since the objective of (7) is an underestimator of the original objective function, it is possible to make a posteriori accuracy evaluations. In particular, letting \(w^\star\) be the optimal solution and

$$\begin{aligned} \hat{\mathcal {L}}(w) = 2\sum _{i=1}^{N}\max _k{f'(v^k)(y^{(i)}(w^\top x^{(i)})-v^k) + f(v^k)}, \end{aligned}$$

we have

$$\begin{aligned} \hat{\mathcal {L}}({\hat{w}}) + \lambda \Vert {\hat{w}}\Vert _0\le \mathcal {L}(w^\star ) + \Vert w^\star \Vert _0 \le \mathcal {L}({\hat{w}}) + \lambda \Vert {\hat{w}}\Vert _0. \end{aligned}$$

Hence, if \(\mathcal {L}({\hat{w}}) - \hat{\mathcal {L}}({\hat{w}})\) is small, it is guaranteed that the value of the real objective function at \({\hat{w}}\) is close to the optimum.

3 The proposed method

The MILO approach from [38] is computationally very effective, but it suffers from a main drawback: it scales pretty badly as either the number of examples or the number of features in the dataset grows. This fact is also highlighted by the experimental results reported in the original MILO paper.

On the other hand, heuristic enumerative-like approaches present the limitation of performing moves with a limited horizon. This holds not only for the simple stepwise procedures, but also for other possible more complex and structured strategies that one may come up with. Indeed, selecting one move among all those involving the addition or removal from the current best subset of multiple variables at one time is unsustainable except for tiny datasets.

In this work, we propose a new approach that somehow employs the MILO formulation to overcome the limitations of discrete enumeration methods, but also has better scalability features than the standard MILO approach itself, in particular w.r.t. the number of features. The core idea of our proposal consists of the application of a decomposition strategy to problem (3). The classical Block Coordinate Descent (BCD) [5, 42] algorithm consists in performing, at each iteration, the optimization w.r.t. one block of variables, i.e., the iterations have the form

$$\begin{aligned}&w_{B_\ell }^{\ell +1} \in {\underset{w_{B_\ell }}{\,\mathrm{argmin }\,}} \mathcal {F}(w_{B_\ell };w^\ell _{{\bar{B}}_\ell }), \end{aligned}$$
(8)
$$\begin{aligned}&w_{{\bar{B}}_\ell }^{\ell +1} = w_{{\bar{B}}_\ell }^{\ell }, \end{aligned}$$
(9)

where \(B_\ell \subset \{1,\ldots ,n\}\) is referred to as working set, \({\bar{B}}_\ell = \{1,\ldots ,n\}{\setminus } B_\ell\). Now, if the working set size |B| is reasonably small, the subproblems can be easily handled by means of a MILO model analogous to that from [38]. Carrying out such a strategy, the subproblems to be solved at each iteration have the form

$$\begin{aligned} \begin{aligned} \min _{w_{B_\ell },z,t}\;&2\sum _{i=1}^{N} t_i + \lambda \sum _{i\in B_\ell }^{}z_i\\ \text {s.t. }&-Mz_i\le w_i\le Mz_i\quad \forall \,i\in B_\ell ,\\ {}&z_i\in \{0,1\}\quad \forall \,i\in B_\ell ,\\ {}&t_i\ge f'(v^k)(y_i(w^\top x_i)-v^k) + f(v^k)\quad \forall \,k=1,\ldots ,K,\quad \forall \, i=1,\ldots , N.\\&w_{{\bar{B}}_\ell } = w^\ell _{{\bar{B}}_\ell } \end{aligned} \end{aligned}$$
(10)

At the end of each iteration, we can also introduce a minimization step of \(\mathcal {L}\) w.r.t. the current nonzero variables. Since this is a convex minimization step, it allows to refine every iterate up to global optimality w.r.t. the support and to Lu–Zhang stationarity, i.e., local optimality, in terms of the original problem. This operation has low computational cost and a great practical utility, since it guarantees, as we will show in the following, finite termination of the algorithm.

3.1 The working set selection rule

Many different strategies could be designed for selecting, at each iteration \(\ell\), the variables constituting the working set \(B_\ell\), within the BCD framework. In this work, we propose a rule based on the violation of CW-optimality.

Given the current iterate \(x^\ell\), we can define a score function

$$\begin{aligned} p(w^\ell ,i) = \left\{ \begin{array}{ll} \mathcal {L}(0,w^\ell _{\ne i}) - \lambda + \lambda \Vert w^\ell \Vert _0&{}\text {if } w^\ell _i\ne 0,\\ \min _{w_i}\mathcal {L}(w_i,w^\ell _{\ne i}) + \lambda + \lambda \Vert w^\ell \Vert _0&{}\text {if } w^\ell _i = 0. \end{array}\right. \end{aligned}$$
(11)

The rational of this score is to estimate what the objective function would become if we forced the considered variable \(w_i\) alone to change its status, entering/leaving the support.

We finally select the working set \(B^\ell\), of size b, choosing, in a greedy way, the b lowest scoring variables, i.e., by solving the problem

$$\begin{aligned} \begin{aligned} B^\ell \in \;&{{\,\mathrm{arg \,min }\,}}_{B} \sum _{h\in B}^{} p(w^\ell ,h)\\\text {s.t. }&B\subseteq \{1,\ldots ,n\},\\ {}&|B|=b. \end{aligned} \end{aligned}$$
(12)

3.2 The complete procedure

The whole proposed algorithm is formally summarized in Algorithm 1. Basically, it is a BCD where subproblems are (approximately) solved by the MILO reformulation and variables are selected by (12).

In addition, there are some technical steps aimed at making the algorithm work from both the theoretical and the practical point of view.

In the ideal case where the subproblems are solved exactly, thanks to our selection rule, we would be guaranteed to do at least as well as a greedy descent step along a single variable. However, subproblems are approximated and it happens that, solving the MILO, the true objective may sometimes not be decreased, even if the simple greedy step would. In such cases, we actually perform the greedy step to produce the next iterate.

Moreover, at the end of each iteration we perform the refinement step previously discussed. Note that this step cannot increase the value of \(\mathcal {F}\), as we are lowering the value of \(\mathcal {L}\) by only moving nonzero variables.

Last, we make the stopping criterion explicit; the algorithm stops as soon as an iteration is not able to produce a decrease in the objective value; we then return the point \(w^\ell\).

figure a

3.3 Theoretical analysis

In this section, we provide a theoretical characterization for Algorithm 1.

We begin by stating a nice property of the set of local minima of problem (3).

Lemma 1

Let \(\Gamma =\{\mathcal {F}(w)\mid w \text { is a local minimum point for problem}\;({3})\}\). Then \(|\Gamma |\le 2^n.\)

Proof

For each support set \(S\subseteq \{1,\ldots ,n\}\) let \(L^\star _S\) be the optimal value of the problem

$$\begin{aligned} \min _{w:w_{{\bar{S}}}=0}\mathcal {L}(w). \end{aligned}$$

Let \(w^\star\) be a local minimizer for problem (3). Then, from Lu–Zhang conditions and the convexity of \(\mathcal {L}\), it is a global minimizer of

$$\begin{aligned} \min _{w:w_{{\bar{S}}(w^\star )}=0}\mathcal {L}(w), \end{aligned}$$

and \(\mathcal {F}(w^\star ) = L^\star _{S(w^\star )} + \lambda |S(w^\star )|\). We hence have

$$\begin{aligned} \Gamma&= \{L^\star _{S(w^\star )} + \lambda | S(w^\star )|\mid w^\star \text { is a local minimizer for}\;(3)\}\\ {}&\subseteq \{L^\star _{S} + \lambda | S|\mid S\subseteq \{1,\ldots ,n\}\} \end{aligned}$$

and so

$$\begin{aligned} |\Gamma | \le |\{L^\star _{S} + \lambda |S|\mid S\subseteq \{1,\ldots ,n\}\}|\le |\{S\mid S\subseteq \{1,\ldots ,n\}\}| = 2^n. \end{aligned}$$

\(\square\)

We go on with a statement about the relationship between the objective function \(\mathcal {F}(w)\) and the score function p(wi).

Lemma 2

Let p be the score function defined as in (11) and let \({\bar{w}}\in \mathbb {R}^n\). Moreover, for all \(h=1,\ldots ,n\), let \({\bar{w}}^h\in {{\,\mathrm{argmin }\,}}_{w_h} \mathcal {F}(w_h, {\bar{w}}_{\ne h})\). Then the following statements hold

  1. (1)

    If \(\mathcal {F}({\bar{w}}^h)=\mathcal {F}({\bar{w}})\) then \(p({\bar{w}},h)\ge \mathcal {F}({\bar{w}});\)

  2. (2)

    If \(\mathcal {F}({\bar{w}}^h)<\mathcal {F}({\bar{w}})\) and \({\bar{w}}\) satisfies Lu–Zhang conditions, then \(p({\bar{w}},h)= \mathcal {F}({\bar{w}}^h).\)

Proof

We prove the two statements one at a time:

  1. (i)

    Let us assume that the thesis is false, i.e., \(\mathcal {F}({\bar{w}}^h)=\mathcal {F}({\bar{w}})\) and \(p({\bar{w}}, h)<\mathcal {F}({\bar{w}})\). We distinguish two cases: \({\bar{w}}_h=0\) and \({\bar{w}}_h\ne 0\). In the former case we have

    $$\begin{aligned} \mathcal {F}({\bar{w}})>p({\bar{w}},h)&= \min _{w_h}\mathcal {L}(w_h,{\bar{w}}_{\ne h})+\lambda + \lambda \Vert {\bar{w}}\Vert _0\\ {}&=\min _{w_h}\mathcal {L}(w_h,{\bar{w}}_{\ne h})+\lambda + \lambda \Vert {\bar{w}}_{\ne h}\Vert _0\\ {}&\ge \min _{w_h}\mathcal {L}(w_h,{\bar{w}}_{\ne h})+\lambda \Vert w_h\Vert _0 + \lambda \Vert {\bar{w}}_{\ne h}\Vert _0\\ {}&=\min _{w_h}\mathcal {L}(w_h,{\bar{w}}_{\ne h})+\lambda \Vert (w_h,{\bar{w}}_{\ne h})\Vert _0 \\ {}&= \mathcal {F}({\bar{w}}^h)=\mathcal {F}({\bar{w}}), \end{aligned}$$

    which is absurd. In the latter case, we have

    $$\begin{aligned} \mathcal {F}({\bar{w}})>p({\bar{w}},h)&= \mathcal {L}(0,{\bar{w}}_{\ne h})-\lambda + \lambda \Vert {\bar{w}}\Vert _0\\ {}&=\mathcal {L}(0,{\bar{w}}_{\ne h}) + \lambda \Vert (0,{\bar{w}}_{\ne h})\Vert _0\\ {}&\ge \mathcal {F}({\bar{w}}^h)=\mathcal {F}({\bar{w}}), \end{aligned}$$

    which is again a contradiction; hence we get the thesis.

  2. (ii)

    We again distinguish two cases: \({\bar{w}}_h=0\) and \({\bar{w}}_h\ne 0\). In the first case we have

    $$\begin{aligned} \mathcal {F}({\bar{w}}^h)&= \min _{w_h}\mathcal {F}(w_h,{\bar{w}}_{\ne h})\\&=\min \left\{ \min _{w_h\ne 0}\mathcal {F}(w_h,{\bar{w}}_{\ne h}),\mathcal {F}(0,{\bar{w}}_{\ne h})\right\} \\&=\min \left\{ \min _{w_h\ne 0}\mathcal {F}(w_h,{\bar{w}}_{\ne h}),\mathcal {F}({\bar{w}})\right\} \end{aligned}$$

    But since we know \(\mathcal {F}({\bar{w}}^h)<\mathcal {F}({\bar{w}})\), we can imply that

    $$\begin{aligned} \min _{w_h\ne 0} \mathcal {L}(w_h,{\bar{w}}_{\ne h})< \mathcal {L}(0,{\bar{w}}_{\ne h}) \end{aligned}$$

    and we can also write

    $$\begin{aligned} \mathcal {F}({\bar{w}}^h)&= \min _{w_h\ne 0}\mathcal {F}(w_h,{\bar{w}}_{\ne h})\\&= \min _{w_h\ne 0} \mathcal {L}(w_h,{\bar{w}}_{\ne h}) + \lambda \Vert (w_h,{\bar{w}}_{\ne h})\Vert _0\\&=\min _{w_h\ne 0} \mathcal {L}(w_h,{\bar{w}}_{\ne h}) + \lambda + \lambda \Vert {\bar{w}}_{\ne h}\Vert _0\\&=\min _{w_h\ne 0} \mathcal {L}(w_h,{\bar{w}}_{\ne h}) + \lambda + \lambda \Vert {\bar{w}}\Vert _0\\ {}&=\min _{w_h} \mathcal {L}(w_h,{\bar{w}}_{\ne h}) + \lambda + \lambda \Vert {\bar{w}}\Vert _0\\&=p({\bar{x}},h). \end{aligned}$$

    In the second case, since \({\bar{w}}\) satisfies Lu–Zhang conditions, we have \({\bar{w}}_h\in {{\,\mathrm{arg \,min }\,}}_{w_h} \mathcal {L}(w_h,{\bar{w}}_{\ne h})\). Therefore

    $$\begin{aligned} {\bar{w}}_h\in {{\,\mathrm{arg \,min }\,}}_{w_h\ne 0} \mathcal {L}(w_h,{\bar{w}}_{\ne h})+\lambda \Vert (w_h,{\bar{w}}_{\ne h})\Vert _0={{\,\mathrm{arg \,min }\,}}_{w_h\ne 0}\mathcal {F}(w_h,{\bar{w}}_{\ne h}). \end{aligned}$$

    Since \(\mathcal {F}({\bar{w}}^h)<\mathcal {F}({\bar{w}}) = \min _{w_h\ne 0}\mathcal {F}(w_h,{\bar{w}}_{\ne h})\), we get \({\bar{w}}^h = (0,{\bar{w}}_{\ne h})\). We finally obtain

    $$\begin{aligned} \mathcal {F}({\bar{w}}^h)&= \mathcal {L}({\bar{w}}^h) + \lambda \Vert {\bar{w}}^h\Vert _0 \\&= \mathcal {L}(0,{\bar{w}}_{\ne h}) + \lambda \Vert (0,{\bar{w}}_{\ne h})\Vert _0\\&= \mathcal {L}(0,{\bar{w}}_{\ne h}) + \lambda \Vert {\bar{w}}_{\ne h}\Vert _0\\&= \mathcal {L}(0,{\bar{w}}_{\ne h}) + \lambda \Vert {\bar{w}}\Vert _0 - \lambda \\&=p({\bar{w}},h). \end{aligned}$$

\(\square\)

We are finally able to state finite termination and optimality properties of the returned solution of the MILO-BCD procedure.

Proposition 3

Let \(\{w^\ell \}\) be the sequence generated by Algorithm 1. Then \(\{w^\ell \}\) is a finite sequence and the last element \({\bar{w}}\) is a CW-minimum for problem (3).

Proof

From the instructions of the algorithm, for all \(\ell =1,2,\ldots\), we have that

$$\begin{aligned} w^{\ell } \in {\underset{w}{\,\mathrm{argmin }\,}} \;&\mathcal {L}(w)\\ \text {s.t.}\quad&w_i = 0 \text { for all } i\in {\bar{S}}(w^{\ell }), \end{aligned}$$

hence \(\nabla _i \mathcal {L}(w^{\ell }) = 0\) for all \(i\in S(w^{\ell })\), i.e., \(w^\ell\) satisfies Lu–Zhang conditions and is therefore a local minimum point for problem (3). From Lemma 1, we thus know that there exist finite possible values for \(\mathcal {F}(w^\ell )\). Moreover, \(\{\mathcal {F}(w^\ell )\}\) is a nonincreasing sequence. We can conclude that in a finite number of iterations we get \(\mathcal {F}(w^\ell ) = \mathcal {F}(w^{\ell +1})\), activating the stopping criterion.

We now prove that the returned point, \({\bar{w}}=w^{\bar{\ell }}\) for some \(\bar{\ell }\in \mathbb {N}\), is CW-optimal. Assume by contradiction that \({\bar{w}}\) is not CW-optimal. Then, there exists \(h\in \{1,\ldots ,n\}\) such that \(\min _{w_h}\mathcal {F}(w_h,{\bar{w}}_{\ne h})<\mathcal {F}({\bar{w}})\).

We show that this implies that there exists \(t\in \{1,\ldots ,n\}\) such that \(t\in B^{\bar{\ell }}\) and \(\min _{w_t}\mathcal {F}(w_t,{\bar{w}}_{\ne t})<\mathcal {F}({\bar{w}})\). Assume by contradiction that for all \(j\in B^{\bar{\ell }}\) it holds \(\min _{w_j}\mathcal {F}(w_j,{\bar{w}}_{\ne j})=\mathcal {F}({\bar{w}})\). Letting i any index in the working set \(B^{\bar{\ell }}\) and recalling Lemma 2, we have

$$\begin{aligned} \sum _{j\in B^{\bar{\ell }}}^{}p(w^{\bar{\ell }},j)&= \sum _{j\in B^{\bar{\ell }}{\setminus }\{i\}}^{}p(w^{\bar{\ell }},j) + p(w^{\bar{\ell }},i)\\&\ge \sum _{j\in B^{\bar{\ell }}{\setminus }\{i\}}^{}p(w^{\bar{\ell }},j) +\mathcal {F}({w^{\bar{\ell }}})\\&> \sum _{j\in B^{\bar{\ell }}{\setminus }\{i\}}^{}p(w^{\bar{\ell }},j) + p(w^{\bar{\ell }},h)\\&=\sum _{j\in B^{\bar{\ell }}\cup \{h\}{\setminus }\{i\}}^{}p(w^{\bar{\ell }},j), \end{aligned}$$

which contradicts the working set selection rule (12).

Now, either \(\mathcal {F}(\nu ^{\ell +1})<\mathcal {F}(w^{\bar{\ell }})\) after steps 4–5 of the algorithm, or, after step 7, we get

$$\begin{aligned} \mathcal {F}(\nu ^{\ell +1})\le \min _{w_t}\mathcal {F}(w_t,w^{\bar{\ell }}_{\ne t})<\mathcal {F}(w{^{{\bar{\ell }}}}). \end{aligned}$$

Therefore, since step 8 cannot increase the value of \(\mathcal {F}\), we get \({\mathcal{F}}(w^{\bar{\ell }+1})<{\mathcal{F}}^{\bar{\ell }}\), but this contradicts the fact that the stopping criterion at line 9 is satisfied at iteration \(\bar{\ell }\). \(\square\)

3.4 Finding good CW-optima

We have shown in the previous section that Algorithm 1 always returns a CW-optimal solution. Although this allows us to cut off a lot of local minima, there are in practice many low-quality CW-minima. For this reason, we introduce in our algorithm an heuristic aimed at leaving bad CW-optima where it may get stuck.

In detail, we do as follows. Instead of stopping the algorithm as soon as the objective value does not decrease, we try to repeat the iteration with a different working set. In doing this, we obviously have to change the working set selection rule. This operation is repeated up to a maximum number of times. If after testing a suitable amount of different working sets a decrease in the objective function cannot be achieved, the algorithm stops.

Specifically, we define a modified score function

$$\begin{aligned} {\hat{p}}(w^\ell ,i) = p(w^\ell ,i) + 2^{r_i} -1, \end{aligned}$$
(13)

where \(r_i\) is the number of times the i-th variable was in the working set in the previous attempts.

The idea of this working set selection rule is to first try a greedy selection. Then, if that first attempt failed, we penalize (exponentially) variables that were tried more times and could not provide improvements in the end. This penalty is heuristic. In fact, we may end up with repeating the search over the same working set from the same starting point. However, we can keep track of the working set used throughout the outer iteration, in order to avoid duplicate computations.

Note that such a modification does not alter the theoretical properties of the algorithm; on the other hand, it has a massive impact on the empirical performance.

4 Computational results

This section is dedicated to a computational comparison between the approach proposed in this paper and the state-of-the-art algorithms described in Sect. 2 and Appendix. In our experiments we took into account 11 datasets for binary classification tasks, listed in Table 1, from the UCI Machine Learning Repository [15]. In fact, the digits dataset is inherently for multi-class classification; we followed the same binarization strategy as [38], assigning a positive label to the examples from the largest class and a negative one to all the others. Moreover, we removed data points with missing variables, encoded the categorical variables with a one-hot vector and normalized the other ones to zero mean and unit variance. In Table 1 we also reported the number n of data points and the number p of features of each dataset, after the aforementioned preprocessing operations.

Table 1 List of datasets used for the experiments on best subset selection in logistic regression

These datasets constitute a benchmark to evaluate the performance of the algorithms under examination, namely: Forward Selection and Backward Elimination Stepwise heuristics, LASSO, Penalty Decomposition, Concave approximation, the Outer Approximation method in its original form, in the adapted version for cardinality-penalized problems and also in the variant exploiting the approximated dual problems, MILO and our proposed method MILO-BCD. All of these algorithms are described in Appendix and Sect. 2.

All the experiments described in this section were performed on a machine with Ubuntu Server 18.04 LTS OS, Intel Xeon E5-2430 v2 @ 2.50 GHz CPU and 16GB RAM. The algorithms were implemented in Python 3.7.4, exploiting Gurobi 9.0.0 [20] for the outer approximation method, MILO and MILO-BCD. The scipy [43] implementation of the L-BFGS algorithm defined in [30] was employed for local optimization steps of all methods. A time limit of 10,000 s was set for each method.

Both the stepwise methods (forward and backward) exploit L-BFGS [30] as internal optimizer. The forward selection version uses L-BFGS to optimize the logistic with respect to one variable, whereas backward elimination defines his starting point exploiting L-BFGS to optimize the model w.r.t. all the variables.

Concerning LASSO, we solved Problem (14) using the scikit-learn implementation [36], with LIBLINEAR library [18] as internal optimizer, for each value of the hyperparameter \(\lambda\). Each \(\lambda\) value was chosen so that two different hyperparameters, \(\lambda _1 \ne \lambda _2\), would not produce the same level of sparsity and to avoid the zero solution. More specifically, we defined our set of hyperparameters by computing the LASSO path, exploiting to the scikit-learn function l1_min_c. All the obtained solutions were refined by further optimizing w.r.t. the nonzero components only by means of L-BFGS. At the end of this grid search we selected the solution, among these one, providing the best information criterion value.

Penalty Decomposition requires to set a large number of hyperparameters: in our experiments we set \(\varepsilon = 10^{-1}\), \(\eta = 10^{-3}\) and \(\sigma _\varepsilon =1\) for all the datasets. We ran the algorithm multiple times for values of \(\tau\) and \(\sigma _\tau\) taken from a small grid. L-BFGS was again used as internal solver. The best solution obtained, in terms of information criterion, was retained at the end of the process.

Table 2 Results of AIC minimization in logistic regression with different optimization methods on small datasets (best result for each dataset in bold)
Table 3 Results of AIC minimization in logistic regression with different optimization methods on large datasets (best result for each dataset in bold)

Concave approximation, theoretically, requires the solution of a sequence of problem. However, as outlined in Appendix, a single problem with fixed approximation hyperparameter \(\mu\) can be solved in practice [37]. In our experiments, Problem (17) was solved by using L-BFGS. Again, we retain as optimal solution the one that, after an L-BFGS refining step w.r.t. the nonzero variables, minimizes the information criterion among a set of resulting points obtained for different values of \(\mu\).

It is important to highlight that the refining optimization step is crucial for methods like the Concave Approximation or LASSO; as a matter of fact, without this precaution, the computed solutions don’t even necessarily satisfy the Lu–Zhang conditions.

All variants of the Outer Approximation method expoit Gurobi to handle the MILP subproblems and L-BFGS for the continuous ones. As suggested by [8], a single branch and bound tree is constructed to solve all the MILP subproblems, adding cutting-type constraints dynamically as lazy constraints. Moreover, the starting cut is decided by means of the first-order heuristic described in the referenced work. For the cardinality-constrained version of the algorithm, we set a time limit of 1000 s for the solution of any individual problem of the form (18) with a fixed value of s. As for the dual formulation, we set \(\gamma =10^4\) to make the considered problem as close as possible to the formulation tackled by all other algorithms. The approximated version of the dual problem, which is quadratic, is efficiently solved with Gurobi instead of L-BFGS.

As concerns MILO and MILO-BCD, we employed the \(V_2\) set of interpolation points for both methods, in order to have a good trade-off between accuracy and computational burden. Moreover, for MILO-BCD we set the cardinality of the working set b to 20 for all the problems. We report in Sect. 4.1 the results of preliminar computational experiments that appear to support our choice. All the subproblems were solved with Gurobi. For MILO-BCD we employ the heuristic discussed in Sect. 3.4. For each problem, the maximum number of consecutive attempts with no improvement, before stopping the algorithm, is set to n. Note that, in order to improve the algorithm efficiency, we instantiate a single MILP problem with n variables and dynamically change the box constraints based on the current working set. The continuous optimization steps needed to perform steps 7 and 8 of Algorithm 1 are performed by using L-BFGS.

Table 4 Results of BIC minimization in logistic regression with different optimization methods on small datasets (best result for each dataset in bold)
Table 5 Results of BIC minimization in logistic regression with different optimization methods on large datasets (best result for each dataset in bold)

In Tables 2, 3, 4 and 5 the computational results of minimizing AIC and BIC respectively on the 11 datasets are shown. For each algorithm and problem, we can see the information criterion value at the returned solution, its zero norm and the total runtime. We can observe the effectiveness of the MILO-BCD approach w.r.t. the other methods. In particular in 8 out of 11 test problems MILO-BCD found the best AIC value, while in the remaining three cases it attains a very close second-best result. The results of minimizing BIC are very similar: for 9 out of 11 datasets MILO-BCD returns the best solution and in the remaining two it ranks at the second place. We can also note that, in cases where p is large such as spam, digits, a2a, w2a and madelon datasets, our method, within the established time limit, is able to find a much better quality solution with respect to the other algorithms (with the only exception of spam for the AIC), and in particular compared to MILO.

As for the efficiency, Tables 2, 3, 4 and 5 also allow to evaluate the computational burden of MILO-BCD. As expected, our method is slower than the approaches that are not based on Mixed Integer Optimization, which on the other hand provide lower quality solutions. However, compared to standard MILO, we can see a considerable improvement in terms of computational time with both the small and the large datasets.

In Fig. 1 we plot the cumulative distribution of absolute distance from the optimum attained by each solver, computed upon the 22 subset selection problems. The x-axis values represent the difference in absolute value between the information criterion obtained and the best one found, while y-axis reports the fraction of solved problems within a certain distance from the best. As it is possible to see, MILO-BCD clearly outperforms the other methods. As a matter of fact, MILO-BCD always found a solution that is distant less than 15 from the optimal one and in around the 80% of the problems it attained the optimal solution. We can also see that for all the other methods there is a number of bad cases where the obtained value is very far from the optimal one. Note that we consider the absolute distance from the best, instead of a relative distance, since it is usually the difference in IC values which is considered in practice to assess the quality of a model w.r.t. another one [12].

Finally, we highlight that MILO-BCD manages to greatly increase the performance of MILO, without making its interface more complex. As a matter of fact, we have only added a hyperparameter that controls the cardinality of the working set and experimentally appears to be extremely easy to tune. Indeed, note that all the experiments were carried out using the same working set size for each dataset and, despite this choice, MILO-BCD shown impressive performances in all the considered datasets.

Fig. 1
figure 1

Each curve represents the fraction of the 22 classification problems for which the corresponding solver obtains an absolute error less or equal than \(\Delta _{\text {abs}}\) w.r.t. the optimal value

4.1 Varying the working set size

The value of the working set size b may greatly affect the performance of the MILO-BCD procedure, in terms of both quality of solutions and running time. For this reason, we performed a study to evaluate the behavior of the algorithm as the value of b changes. We ran MILO-BCD on the problems obtained from datasets at different scales: heart, breast, spectf, and a2a. AIC is used as GOF measure.

The results are reported in Table 6 and Fig. 2. We can see that a working set size of 20, as employed in the experiments of the previous section, provides a good trade-off. Indeed, the running time seems to grow in general with the working set size, whereas the optimal solution is approached only when large working sets are employed. We can see that in some cases a slightly larger value of b allows to retrieve even better solutions than those obtained in the experiments of Sect. 4, but the computational cost significantly increases. In the end, as can be observed in Sect. 4, the choice \(b=20\) experimentally led to excellent results on the entirety of our benchmark.

Table 6 Results obtained by the MILO-BCD procedure on the best subset selection problem based on AIC with four datasets for different values of working set size b
Fig. 2
figure 2

Trade-off between runtime and solution quality for different values of the working set size in MILO-BCD, on the best subset selection problem based on AIC for the four considered problems

5 Conclusions

In this paper, we considered the problem of best subset selection in logistic regression, with particular emphasis on the IC-based formulation. We introduced an algorithm combining mixed-integer programming models and decomposition techniques like the block coordinate descent. The aim of the algorithm is to find high quality solutions even on larger scale problems, where other existing MIP-based methods are unreasonably expensive, while heuristic and local-optimization-based methods produce very poor solutions.

We theoretically characterized the features and the behavior of the proposed method. Then, we showed the results of wide computational experiments, proving that the proposed approach indeed is able to find, in a reasonable running time, much better solutions than a set of other state-of-the-art solvers; this fact appears particularly evident on the problems with higher dimensions.

Future research will be focused on the definition of possibly more effective and efficient working set selection rules for our algorithm. Upcoming work may also be aimed at adapting the proposed algorithm to deal with different or more general problems.

In particular, the case of multi-class classification is of great interest. However, the problem is challenging. Specifically, the complexity in directly extending our approach to the multinomial case lies in the definition of the piece-wise linear approximation of the objective function. Indeed, in the multi-class scenario, the number of weights is \(n\times m\), being m the number of classes, and \(N\times m\) pieces of the objective function need to be approximated. This results in an increasingly high number of variables and constraints to be handled, which might become rapidly unmanageable even exploiting our decomposition approach. Hence, future work might be focused on devising alternative decomposition approaches specifically designed to tackle the multinomial case.