A Unifying Framework for Sparsity-Constrained Optimization

In this paper, we consider the optimization problem of minimizing a continuously differentiable function subject to both convex constraints and sparsity constraints. By exploiting a mixed-integer reformulation from the literature, we define a necessary optimality condition based on a tailored neighborhood that allows to take into account potential changes of the support set. We then propose an algorithmic framework to tackle the considered class of problems and prove its convergence to points satisfying the newly introduced concept of stationarity. We further show that, by suitably choosing the neighborhood, other well-known optimality conditions from the literature can be recovered at the limit points of the sequence produced by the algorithm. Finally, we analyze the computational impact of the neighborhood size within our framework and in the comparison with some state-of-the-art algorithms, namely, the Penalty Decomposition method and the Greedy Sparse-Simplex method. The algorithms have been tested using a benchmark related to sparse logistic regression problems.

1. Introduction.We consider the following sparsity constrained problem: where f : R n → R is a continuously differentiable function, X ⊆ R n is a closed and convex set, and s < n is a properly chosen integer value.We further use X to indicate the overall feasible set X ∩ {x ∈ R n | x 0 ≤ s}.Problem (1.1) has a wide range of applications, from subset selection in regression [25] and the compressed sensing technique used in signal processing [12] to portfolio optimization [7,26].Such a problem can be reformulated into equivalent different mixed-integer problems and is known to be N P-hard [7,27,28].
We observe that problem (1.1) is generally hard to solve because both the objective function and the feasible set (due to the combinatorial nature of the sparsity constraint) are nonconvex.The inherently combinatorial flavor of the given problem makes the definition of proper optimality conditions and, consequently, the development of algorithms that generate points satisfying those conditions a challenging task.A number of ways to address these issues are proposed in the literature (see, e.g., [3,4,11,20,23]).However, some of the optimality conditions proposed do not fully take into account the combinatorial nature of the problem, whereas some of the corresponding algorithms [3,23] require to exactly solve a sequence of nonconvex subproblems and this may be practically prohibitive.Moreover, due to the theoretical tools involved in the analysis, it is anyway not easy to relate the different approaches with each other.
In this paper, we hence give a unifying view on this matter.More specifically, we consider the mixed-integer reformulation of problem (1.1) described in [11] and use it to define a suitable optimality condition.This condition is then embedded into an algorithmic framework aimed at finding points satisfying the resulting optimality criterion.The algorithm combines inexact minimizations with a strategy that explores tailored neighborhoods of a given feasible point.Those features make it easy to handle the nonconvexity in both the objective function and the feasible set also from a practical point of view.We prove the convergence of the algorithmic scheme, establishing that its limit points satisfy the specific optimality condition.We then show that different conditions proposed in the literature (see, e.g., [3,11,23]) can be easily derived from ours.We finally perform some numerical tests on sparse logistic regression in order to show that the devised method is also computationally viable.
The paper is organized as follows: in section 2, we provide basic definitions and preliminary results related to optimality conditions of problem (1.1).In section 3, we describe our proposed algorithmic framework and show (subsection 3.1) the convergence analysis without constraint qualifications.In section 4, we analyze the asymptotic convergence properties of the algorithm when constraint qualifications hold.Finally, we report numerical experiments in section 5 and give some concluding remarks in section 6.We also provide in Appendix A some insights on the relationship between classical stationarity conditions for convex problems with and without constraints qualifications.
2. Basic definitions and preliminary results.Even though problem (1.1) is a continuous optimization problem, it has an intrinsic combinatorial nature and in applications the interest often lies in finding a good, possibly globally optimal configuration of active variables.Being (1.1) a continuous problem, x * ∈ X is a local minimizer if there exists an open ball B(x * , ) such that f (x * ) = min{f (x) | x ∈ X ∩ B(x * , )}.In some works from the literature (e.g., [11,23]) necessary conditions of local optimality have been proposed.However, for this particular problem every local minimizer for a fixed active set of s variables is a local minimizer of the given problem.Hence the number of local minimizers grows as fast as n s and is thus of low practical usefulness.
In [3,4], the authors propose necessary conditions for global optimality that go beyond the concept of local minimum described above, thus allowing to consider possible changes to the structure of the support set, and reducing the pool of optimal candidates.However, these conditions are either tailored to the "unconstrained case", or limited to moderate changes in the support, or involve hard operations, such as exact minimizations or projections onto nonconvex sets.
In order to introduce a general and affordable necessary optimality condition that also takes into account the combinatorial nature of the problem, we consider in our analysis the equivalent reformulation of problem (1.1) described in [11]: From here onwards, we will use the following notation: We further define the support set of a vector z by while its complement is defined by Moreover, we recall the concept of super support set [4]: We denote by z I the subvector of z identified by the components contained in an index set I. We also denote by Π C the orthogonal projection operator over the closed convex set C. We notice that given a feasible point (x, y) of problem (2.1), the components I 0 (y) give an active subspace for x, i.e., those components identify the subspace where the nonzero components of x lay.We thus have that I 1 (x) ⊆ I 0 (y).
Nonlinear mixed-integer programs have been characterized exploiting the notion of neighborhood [21,24].Given a feasible point (x, y), a discrete neighborhood N (x, y) is a set of feasible points that are close, to some extent, to (x, y) and that contains (x, y) itself.We introduce here an example of tailored neighborhood for problem (2.1) that can be implemented at a reasonable computational cost.Such a neighborhood will also help us to relate our analysis to the other theoretical tools available in the literature.
We notice that this particular definition of neighborhood allows to take into account the potential "change of status" of up to ρ variables in the vector ŷ defining an active subspace.
Example 1.Consider the problem (2.1) with n = 3 and s = 2 and let ρ = 2. Let (x, y) be a feasible point defined as follows Now, a notion of local optimality for problem (2.1), depending on the neighborhood N (x, y), can be introduced: ) if there exists an > 0 such that and for all (x, ŷ) ∈ N (x * , y * ) it holds Note that in the above definition the continuous nature of the problem, expressed by the variables x, is taken into account by means of the standard ball B(x, ).The given definition clearly depends on the choice of the discrete neighborhoods.A larger neighborhood N (x * , y * ) should give a better local minimizer, but the computational effort needed to locate the solution may increase.
Inspired by the definition of local optimality for problem (2.1), we introduce a necessary condition of global optimality for problem (1.1) that allows to take into account possible, beneficial changes of the support and that hence properly captures, from an applied point of view, the essence of the problem.
Such a condition relies on the use of stationary points related to continuous problems obtained by fixing the binary variables in problem (2.1), i.e., for a fixed ȳ ∈ Y, x is a stationary point of the continuous problem It is easy to see that the following result stands: Theorem 2.5.Let x * be a minimum point of problem (1.1).Then x * is an Nstationary point.
We will show later in this work that the definition of N -stationariy allows to retrieve in a unified view most of the known optimality conditions, if a suitable neighborhood N is employed.
In Definition 2.4 we generically refer to stationary points of problem (2.3), namely, to points satisfying suitable optimality conditions.Then, concerning the assumptions on the feasible set X (ȳ), we distinguish the two cases: (i) no constraint qualifications hold; (ii) constraint qualifications are satisfied and the usual KKT theory can be applied.In case (i), we will refer to the following definition of stationary point of problem (2.3).Definition 2.6.Given ȳ ∈ Y and x ∈ X (ȳ), we say that x is a stationary point of problem (2.3) if and only if We notice that X (ȳ) is a convex set when X is convex, then the condition given above is a classic stationarity condition for the problem (2.3).Case (ii) will be considered later.
3. Algorithmic framework.Here, we discuss an algorithmic framework for the solution of problem (1.1) that exploits the reformulation given in problem (2.1).The proposed approach is somehow related to classic methods for mixed variable programming proposed in the literature (see, e.g., [21,24]).Roughly speaking, the approach is based at each iteration on the definition of a suitable neighborhood N (x k , y k ) of the current point (x k , y k ) and on exploratory moves with respect to the continuous variables around the points of the neighborhood.
Concerning the exploration move, it is a local search performed by an Armijotype line search along the projected gradient direction.The procedure is formalized in Algorithm 3.1.
For any point (x k , ŷk ) ∈ N (x k , y k ) that is not significantly worse (in terms of the objective value) than the current candidate, we perform a local continuous search around xk ; we skip to the following iteration as soon as a point providing a sufficient decrease of the objective value is found.The algorithm, which we refer to as Sparse Neighborhood Search (SNS) is formally defined in Algorithm 3.2.

Convergence analysis.
In this section, we prove a set of results concerning the properties of the sequences produced by Algorithm 3.2.Note that in this Section we employ the concept of stationarity (A.2).First, we state some suitable assumptions.
Assumption 1.The gradient ∇f (x) is Lipschitz-continuous, i.e., there exists a constant L > 0 such that Assumption 2. Given y 0 ∈ Y, x 0 ∈ X (y 0 ) and a scalar ξ > 0, the level set The crucial point in the proposed framework is choosing suitable discrete neighborhoods.First, note that when we deal with both continuous and integer variables, the usual notion of convergence to a point needs to be tweaked.In particular, we have the following definition.Definition 3.1.A sequence {(x k , y k )} converges to a point (x, ȳ) if for any > 0 there exists an index k such that for all k ≥ k we have that y k = ȳ and x k − x < .
To ensure convergence to meaningful points, we need a "continuity" assumption on the discrete neighborhoods we explore.
The assumption above is a mild continuity assumption on the discrete neighborhoods and is equivalent to the lower semicontinuity of a point-to-set function as defined in [5].Next, we properly define the discrete neighborhood used in our algorithmic framework.Now, a discrete neighborhood, by definition, is a set of feasible points.In the case when X ⊂ R n , zeroing variables may result in points that are not feasible.For this reason, we initially consider an easier version of problem (2.1) where X = R n .In this case the neighborhood N ρ defined in (2.2) contains feasible points.Moreover, it satisfies Assumption 3, as stated here below.
If j / ∈ J, we have lim On the other hand, if j ∈ J, xk j = 0 and xj = 0. Hence and we thus get the thesis.
To generalize the previous proposition to the case where X ⊂ R n , we can replace each (x, ỹ) ∈ N ρ (x, y) with the point (x, ŷ), where ŷ = ỹ and x = Π X (ŷ) (x).In other words, first we change the structure of the active set, then we project the x part onto X (ŷ), which is a convex set.In the following, we will refer to this new discrete neighborhood with N Cρ (x, y).Proposition 3.3.Let {(x k , y k )} be a sequence converging to (x, ȳ).Then, the neighborhood N Cρ (x, ȳ) satisfies Assumption 3.
Proof.The proof follows exactly as in Proposition 3.2, recalling the continuity of the projection operator Π X (ŷ) .
Before turning to the convergence analysis of the algorithm, we prove a further useful preliminary result concerning the neighborhood N ρ .Notice that this result can be easily extended to N Cρ .In order to avoid getting a too much cumbersome notation, we will always refer to N ρ from now on, even when dealing with additional constraints.
Lemma 3.4.Let y ∈ Y and x ∈ X (y) with δ = x 0 .Let us consider the set Proof.Let (x, ŷ) be any point in N (x).From the feasibility of (x, y) we have Moreover, from the definition of N (x), we have Now, it is easy to see that We can note that, since I 0 (y) ⊇ I 1 (x) and I 0 (ŷ) ⊇ I 1 (x), it has to be I 0 (y) ∩ I 0 (ŷ) ⊇ I 1 (x).Therefore We can now turn to I 1 (y) ∩ I 1 (ŷ).Since the latter set can be equivalently written, by De Morgan's law, as {1, . . ., n} \ (I 0 (y) ∪ I 0 (ŷ)), we can obtain where the second last inequality comes from (3.1) and (3.3).Putting everything together back in (3.2), we get Taking into account that ρ ≥ 2(s − δ) in the definition of N ρ (x, y), we obtain (x, ŷ) ∈ N ρ (x, y), thus getting the desired result.
We can now focus on the algorithms.First, we prove a property of Algorithm 3.1 that will play an important role in the convergence analysis of Algorithm 3.2.Proposition 3.5.Given a feasible point (x, y), Algorithm 3.1 produces a feasible point (x, y) such that where the function . By the properties of the projection operator, we can write By the instruction of the algorithm, either α = 1 or α < 1.
Applying the mean value theorem to equation (3.7), we get where θ ∈ (0, 1).Adding and subtracting ∇f (x) d, and rearranging, we get By the Lipschitz-continuity of ∇f (x), we can write which means that Rearranging, we get This last inequality, together with (3.4), yields and substituting in equation (3.6) we finally get This last inequality, together with (3.5), implies that where We can now state a couple of preliminary theoretical results.We first show that Algorithm 3.2 is well-posed.
Proposition 3.6.For each iteration k, Step 2 of Algorithm 3.2 terminates in a finite number of steps.
Proof.Suppose by contradiction that Steps 2.1-2.4 generate an infinite loop, so that an infinite sequence of points {x j } is produced for which By Proposition 3.5, for each j we have that where σ (•) ≥ 0. The sequence {f (x j )} is therefore nonincreasing.Moreover, (3.9) implies that By Assumption 2, {f (x j )} is lower bounded.Therefore, recalling that {f (x j )} is nonincreasing, we get that {f (x j )} converges, which implies that f (x j+1 ) − f (x j ) → 0. By (3.10), we get that σ x j − Π X (y ) x j − ∇f (x j ) → 0, and, by the properties of σ (•), we finally get that x j − Π X (y ) x j − ∇f (x j ) → 0, and this contradicts (3.8).
The next proposition shows some properties of the sequences generated by the algorithm, which will play an important role in the subsequent analysis.Proposition 3.7.Let {(x k , y k )}, {µ k } and {η k } be the sequences produced by the algorithm.Then: (i) the sequence {f (x k )} is nonincreasing and convergent; (ii) the sequence {(x k , y k )} is bounded; Proof.
(i) The instructions of the algorithm and Proposition 3.5 imply that {f (x k )} is nonincreasing, and Assumption 2 implies that {f (x k )} is lower bounded.Hence, {f (x k )} converges.(ii) The instructions of the algorithm imply that each point (x k , y k ) belongs to the level set L(x 0 , y 0 ), which is compact by Assumption 2. Therefore, {(x k , y k )} is bounded.(iii) Suppose that K u is finite.Then there exists k > 0 such that all iterates satisfying k > k are successful, i.e., and η k = η k−1 = η > 0 for all k ≥ k.Since η > 0, this implies that {f (x k )} diverges to −∞, in contradiction with Item (i).(iv) Since, for all k, µ k+1 = δµ k , where δ ∈ (0, 1), the claim holds.(v) If k ∈ K u , then η k+1 = θη k , where θ ∈ (0, 1).Since K u is infinite and .
By the instructions of the algorithm, f (x k+1 ) ≤ f (x k ), and so we can write Before stating the main theorem of this section, it is useful to summarize some theoretical properties of the subsequence {(x k , y k )} Ku of the unsuccessful iterates.As the proof shows, the next proposition follows easily from the theoretical results we have shown above.
We can now prove the main theoretical result of this section.
Theorem 3.9.Let {(x k , y k )} be the sequence generated by Algorithm 3.2.Every accumulation point (x * , y * ) of {(x k , y k )} Ku is such that x * is an N -stationary point of problem (1.1).
Proof.Let (x * , y * ) be an accumulation point of {(x k , y k )} Ku .We must show that conditions (i)-(iii) of Definition 2.4 are satisfied.
(i) From the instructions of Algorithm 3.2 the iterates (x k , y k ) belong to the set L(x 0 , y 0 ), which is closed from Assumption 2. Any limit point (x * , y * ) belongs to L(x 0 , y 0 ) and is thus feasible for problem (2.1).(ii) The result follows from Proposition 3.7, Item (vi).(iii) Since K u is an infinite subset of unsuccessful iterations, recalling that , the test at Step 3 fails at iteration k, and therefore for all (x k , ŷk ) ∈ N (x k , y k ).Since the sequence {f (x k )} is nonincreasing (Proposition 3.7, Item (i)), we can write for all (x k , ŷk ) ∈ N (x k , y k ).Taking limits, we get from Proposition 3.7, Item (v), Proposition 3.2, and by the continuity of f that f (x * ) ≤ f (x) for all (x, ŷ) ∈ N (x * , y * ).Now, note that Item (i) of Proposition 3.7 ensures the existence of Consider any (x, ŷ) ∈ N (x * , y * ) such that f (x) = f * .(3.12) Proposition 3.8 implies that (x, ŷ) is an accumulation point of a sequence {(x k , ŷk )} Ku , where (x k , ŷk ) ∈ N (x k , y k ).Since k ∈ K u , we have that (3.11) and (3.12) we get, for k sufficiently large,

fails, and we can write
Moreover, as the sequence {(x k , ŷk )} Ku converges to the point (x, ŷ), by (3.11), (3.12), (3.13), (3.14), and by Item (v) of Proposition 3.7, we obtain By Proposition 3.5, we have that which can be rewritten as Taking limits for k → ∞, k ∈ K u , we finally get and the claim holds.
In [4], the concept of basic feasibility (BF) introduced in [3] is extended to problem (1.1): Definition 3.10.A feasible point x * of problem (1.1) is referred to as basic feasible if, for any super support set J, letting y J ∈ {0, 1} n such that y i = 0 if i ∈ J and y i = 1 otherwise, there exists L > 0 such that where Note that BF stationarity requires that, for any y J defining a super support set, ) and d J = 0, whereas the condition in Definition 2.6 requires x * = Π X (y J ) [x * − ∇f (x * )].In fact, in the case of our problem the two conditions are equivalent, as we show below.Lemma 3.11.Let y ∈ Y and x * ∈ X (y).Then x * satisfies where d I0(y) = − 1 L ∇ J f (x * ) and d I1(y) = 0, if and only if it satisfies Proof.Let us consider Let us denote xp = x * − ∇f (x k ) and xp = x * + d Since both x and x belong to X (y), we have xI1(y) = 0 and xI1(y) = 0. From the well-known properties of the projection operator on a convex set, we get: We can hence show that, provided that N ρ is employed as neighborhood in 3.2, with a sufficiently large value of ρ, the SNS procedure converges to basic feasible solutions.
Theorem 3.12.Let {(x k , y k )} be the sequence of iterates generated by Algorithm 3.2 equipped with N ρ as neighborhood and A * the set of the accumulation points of the sequence {(x k , y k )} Ku of unsuccessful iterates.If ρ ≥ 2(s − δ * ), in the definition of the set N ρ (x, y), and δ * = min{ x * 0 | (x * , y * ) ∈ A * }, then given a point (x * , y * ) ∈ A * , x * is basic feasible for problem (1.1).
Proof.Let J ⊂ {1, . . ., n} be any super support set for x * , and consider the vector ŷ such that ŷj = 1 ∀j / ∈ J and zero otherwise.As |J| = s, we have e ŷ = n − s, and, taking into account that i / ∈ J implies x * i = 0 and i ∈ J implies ŷi = 0, it follows Then, we have I 1 (x * ) ⊆ I 0 (ŷ) and (x * , ŷ) ∈ N (x * ) ⊆ N ρ (x * , y * ), where we used Lemma 3.4.By taking into account Theorem 3.9, we finally get that x * is an N ρstationary point of problem (1.1) and that it is also a stationary point of Then, by Lemma 3.11, recalling that ŷi = 0 if and only if i ∈ J, we obtain that x * is basic feasible.

Convergence results under constraint qualifications.
In this section, we show that, under constraint qualifications and by choosing suitable neighborhoods, it is possible to state convergence results similar to those considered in important works of the related literature [11,23].Here, we assume that X = {x ∈ R n | g(x) ≤ 0, h(x) = 0}, where h i , i = 1, . . ., p are affine functions and g i , i = 1, . . ., m, are convex functions.First we state the following assumption which implicitly involves constraint qualifications.γ i e i = 0, The above assumption states that x is a stationary point of problem (2.3) if and only if it is a KKT point of the following problem which can be equivalenty rewritten as follows Remark 4.1.As shown in Appendix A, Assumption 4 holds when, e.g., the functions g i are strongly convex with constant µ i > 0, for i = 1, . . ., m, the functions h j , for j = 1, . . ., p are affine, and some Cardinality Constraint-Constraint Qualification (CC-CQ) is satisfied.For instance, a standard CC-CQ is the Cardinality Constraint-Linear Independence Constraint Qualification (CC-LICQ), requiring that the gradients for all i : g i (x) = 0 ∇h i (x) for all i = 1, . . ., p e i for all i : ȳi = 1 are linearly independent.
From Theorem 3.9 and Assumption 4 we immediately get the following result.
Theorem 4.2.Let {(x k , y k )} be the sequence generated by Algorithm 3.2.Every accumulation point (x * , y * ) of the sequence of unsuccessful iterates Remark 4.3.Condition (4.1) is the S-stationarity concept introduced in [11].Basically, the limit points of the sequence {(x k , y k )} Ku produced by Algorithm 3.2 are always guaranteed to be S-stationary.This implies, by the results in [11], that x * is also Mordukhovich-stationary for problem (1.1).In fact, under Assumption 4, it is easy to see that N -stationarity is a stronger condition than M -stationarity, from points (i)-(ii) of Definition 2.4.
In order to state stronger convergence results, we need to use suitable neighborhoods (e.g., N ρ with a sufficiently large value of ρ) in the algorithm.Theorem 4.4.Let {(x k , y k )} be the sequence generated by Algorithm 3.2 equipped with N ρ as neighborhood and A * the set of the accumulation points of the sequence {(x k , y k )} Ku of unsuccessful iterates.If ρ ≥ 2(s − δ * ), in the definition of the set N ρ (x, y), and δ * = min{ x * 0 | (x * , y * ) ∈ A * }, then given a point (x * , y * ) ∈ A * and for every super support set J ⊂ {1, . . ., n}, we have that there exist multipliers Proof.Let J ⊂ {1, . . ., n} be any super support set for x * , and consider the vector ŷ such that ŷj = 1 ∀j / ∈ J and zero otherwise.As |J| = s, we have e ŷ = n − s, and, taking into account that i / ∈ J implies x * i = 0 and i ∈ J implies ŷi = 0, it follows Then, we have I 1 (x * ) ⊆ I 0 (ŷ) and (x * , ŷ) ∈ N (x * ) ⊆ N ρ (x * , y * ), where we used Lemma 3.4.By taking into account Theorem 3.9, we finally get that x * is an N ρstationary point of problem (1.1) and that it is also a stationary point of Then, by Assumption 4, recalling that ŷi = 0 if and only if i ∈ J, we obtain that (4.2) holds.
Remark 4.5.Condition (4.2) is the necessary optimality condition first defined in [23].It is interesting to note that the Penalty Decomposition algorithm proposed in the referenced work in fact is not guaranteed to converge to a point satisfying such conditions, that are guaranteed to hold only if the limit point has full support.In the general case, the PD method generates points satisfying (4.2) for at least one super support set.Our SNS algorithm would have the same exact convergence results if we used the neighborhood The above neighborhood basically checks all the super support sets at the current iterate x k , but it does not satisfy the continuity Assumption 3, hence failing to guarantee that condition (4.2) is satisfied by all super support sets at the limit point.

Numerical Experiments.
From a computational point of view, we are particularly interested in studying two relevant aspects.Specifically, here we want to: • analyze the benefits and the costs of increasing the size of the neighborhood; • assess the performance of the proposed approach, compared to the the Greedy Sparse-Simplex (GSS) method proposed in [3] and the Penalty Decomposition (PD) approach [23].To these aims, we considered the problem of sparse logistic regression, where the objective function is continuously differentiable and convex, but the solution of the problem for a fixed support set requires the adoption of an iterative method.Note that we preferred to consider a problem without other constraints in addition to the sparsity one, in order to simplify the analysis of the behavior of the proposed algorithm.
The problem of sparse logistic regression [19] has important applications, for instance, in machine learning [2,30].Given a dataset having N samples {z 1 , . . ., z N }, with n features and N corresponding labels {t 1 , . . ., t N } belonging to {−1, 1}, the problem of sparse maximum likelihood estimation of a logistic regression model can be formulated as follows The benchmark for this experiment is made up of problems of the form (5.1), obtained as described hereafter.We employed 6 binary classification datasets, listed in Table 1.All the datasets are from the UCI Machine Learning Repository [17].For each dataset, we removed data points with missing variables; moreover, we one-hot encoded the categorical variables and standardized the other ones to zero mean and unit standard deviation.For every dataset, we chose different values of s, as specified later in this section.5.1.Implementation details.Algorithms SNS, PD and GSS have been implemented in Python 3.7, mainly exploiting libraries numpy and scipy.The convex subproblems of both PD and GSS have been solved up to global optimality by using the L-BFGS algorithm (in the implementation from [22], provided by scipy).We also employed L-BFGS for the local optimization steps in SNS.All algorithms start from the feasible initial point x 0 = 0 ∈ R n .For the PD algorithm, we set the starting penalty parameter to 1 and its growth rate to 1.05.The algorithm stops when x k − y k < 0.0001, as suggested in [23].AS for the GSS, we stop the algorithm as soon as x k+1 − x k ≤ 0.0001.
Concerning our proposed Algorithm 3.2, the parameters have been set as follows: For what concerns µ 0 and δ, we actually keep the value of µ fixed to 10 −6 .We again employ the stopping criterion x k+1 − x k ≤ 0.0001.
For all the algorithms, we have also set a time limit of 10 4 seconds.All the experiments have been carried out on an Intel(R) Xeon E5-2430 v2 @2.50GHz CPU machine with 6 physical cores (12 threads) and 16 GB RAM.
As benchmark for our experiments, we considered 18 problems, obtained from the 6 datasets in Table 1 and setting s to 3, 5 and 8 in (5.1).For SNS and GSS we consider the computational time employed to find the best solution.We take into account four versions of Algorithm 3.2, with neighborhood radius ρ ∈ {1, 2, 3, 4}.
In Figure 1 the performance profiles [16] w.r.t. the objective function values and the runtimes (intended as the time to find the best solution) attained by the different algorithms are shown.We do not report the runtime profile of SNS(1) since it is much faster than all the other methods and thus would dominate the plot, making it poorly informative.We can however note that unfortunately its speed is outweighed by the very poor quality of the solutions.We can observe that increasing the size of the neighborhood consistently leads to higher quality solutions, even though the computational cost grows.We can see that SNS (with a sufficiently large neighborhood) has better performances than the other algorithms known from the literature; in particular, while the neighborhood radius ρ = 1 only allows to perform forward selection, with poor outcomes, ρ ≥ 2 makes swap operations possible, with a significant impact on the exploration capabilities.The GSS has worse quality performance than SNS(2), which is reasonable, since its move set is actually smaller and optimization is always carried out w.r.t. a single variable and not the entire active set.However, it proved to also be slower than the SNS, mostly because of two reasons: it always tries all feasible moves, not necessarily accepting the first one that provides an objective decrease, and it requires many more iterations to converge, since it considers one variable at a time.Finally, the PD method appears not to be competitive from both points of view: it is slow at converging to a feasible point and it has substantially no global optimization features that could guide to globally good solutions.It is interesting to remark how considering larger neighborhoods appears to be particularly useful in problems where the sparsity constraint is less strict and thus combinatorially more challenging.As an example, we show the runtime-objective tradeoff for the breast, spam and a2a problems for s = 3 and s = 8 in Figure 2. We can observe that for s = 3, SNS finds good, similar solutions for either ρ = 2, 3 or 4, with a similar computational cost.On the other hand, as s grows to 8, using ρ = 4 allows to significantly improve the quality of the solution without a significant increase in terms of runtime.Fig. 2. Quality/cost trade-off for the algorithms on sparse logistic regression problems from datasets breast, spam and a2a.

Conclusions.
In this paper we have analyzed sparsity constrained optimization problems.For this class of problems, we have defined a necessary optimality condition, namely, N -stationarity, exploiting the concept of discrete neighborhood associated with a well-known mixed integer equivalent reformulation, that allows to take into account potentially advantageous changes on the set of active variables.
We have afterwards proposed an algorithmic framework to tackle the family of problems under analysis.Our SNS method alternates continuous local search steps and neighborhood exploration steps; the algorithm is then proved to produce a sequence of iterates whose cluster points are N -stationary.Moreover, we proved that, by suitably employing a tailored neighborhood, the limit points also satisfy other optimality conditions from the literature, based on both gradient projection and Lagrange multipliers, thus providing stronger optimality guarantees than other state-of-the-art approaches.
Finally, we studied the features and the benefits of our proposed procedure from a computational perspective.Specifically, we compared the performance of the SNS as the size of the neighborhood increases, observing that using wider neighborhoods consistently provides higher quality solutions with a reasonable increase of the computational cost, especially when the required cardinality is not that small.Moreover, when comparing SNS with the Penalty Decomposition method and the Greedy Sparse-Simplex method, we observed that our method has higher exploration capability, thus getting a nice match between theory and practice, and it is affordable in terms of computational cost, being even faster than the other considered methods.This implies that the system ∇f (x * ) d < 0 ∇g i (x * ) d < 0 i : g i (x * ) = 0 ∇h i (x * ) d = 0 i = 1, . . ., p does not admit solution.By Motzkin's theorem we get that x * satisfies the Fritz-John conditions and hence, by assuming a constraint qualification, the thesis is proved.
Condition (ii) of Proposition A.3 holds by assuming that the convex functions g i , for i = 1, . . ., m are such that with γ > 0. Indeed, in this case it is easy to see that a direction d is a feasible direction at x * if and only if ∇g i (x * ) d < 0 i : g i (x * ) = 0 ∇h j (x * ) d = 0 i = 1, . . ., p Condition (A.6) is satisfied by assuming that the functions g i are twice continuosly differentiable and the Hessian matrix is positive definite.
Condition (A.6) holds also for continuously differentiable functions g i assuming that they are strongly convex with constant µ i > 0, i.e., that for i = 1, . . ., m the functions g i (y) ≥ g i (x) + ∇g i (x) (y − x) + µ i 2 y − x 2 , ∀ x, y.

Table 1
List of datasets used for experiments on sparse logistic regression.
Performance profiles for the considered algorithms on 18 sparse logistic regression problems.