Sparse optimization via vector k-norm and DC programming with an application to feature selection for support vector machines

Sparse optimization is about finding minimizers of functions characterized by a number of nonzero components as small as possible, such paradigm being of great practical relevance in Machine Learning, particularly in classification approaches based on support vector machines. By exploiting some properties of the k-norm of a vector, namely, of the sum of its k largest absolute-value components, we formulate a sparse optimization problem as a mixed-integer nonlinear program, whose continuous relaxation is equivalent to the unconstrained minimization of a difference-of-convex function. The approach is applied to Feature Selection in the support vector machine framework, and tested on a set of benchmark instances. Numerical comparisons against both the standard ℓ1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _1$$\end{document}-based support vector machine and a simple version of the Slope method are presented, that demonstrate the effectiveness of our approach in achieving high sparsity level of the solutions without impairing test-correctness.


Introduction
The sparse counterpart of a mathematical program is aimed at finding optimal solutions that have as few nonzero components as possible, this feature playing a significant role in several applications, mainly (but not solely) arising in the broad fields of data science and machine learning.Indeed, this is one of the reason why sparse optimization has recently become a topic of great interest in the field of mathematical optimization.
Accounting for sparsity in a mathematical program amounts to embed into its formulation a measure dependent on the number of zero components of a solutionvector.The mathematical tool that is most naturally suited to such task is the 0 -(pseudo-)norm • 0 , often simply referred to as the 0 -norm, defined as the number of nonzero components of a solution, it rather being a measure of vector density.Thus, a sparse optimization problem should also encompass the 0 -norm minimization in order to achieve the largest possible sparsity.Such feature highlights the intrinsic bi-objective nature of sparse optimization, that makes it fit to two commonplace singleobjective formulations, the 0 -regularization problem and the cardinality-constrained problem, depending on whether one aims at pushing for sparsity or at ensuring an assigned sparsity level, respectively.
Given a real-valued function f : R n → R, with n ≥ 2, the (unconstrained) 0regularization problem has the following structure where the 0 -norm penalty term inside the objective function pushes towards sparse solutions, and a fixed penalty-parameter σ > 0 ensures the trade-off between the two (possibly) conflicting objectives.Problem (SOP) has a nonconvex and discontinuous nature (see, e.g., [1] for a study on complexity issues), that makes it unfit to be faced by means of standard nonlinear optimization approaches.In fact, it has been usually tackled by replacing the 0 -norm with the 1 -norm, thus obtaining an optimization problem, the 1 -regularization one (SOP 1 ), whose tractability rather depends on the properties of f , the 1 term being a convex function.It has been proved (see [16,17,21]) that, under appropriate conditions the solutions of (SOP) and (SOP 1 ) coincide.Nevertheless, in many practical applications the equivalence conditions do not hold true and the solutions obtained from the 1 minimization problem are less sparse that those of the 0 problem.A different viewpoint in dealing with the bi-objective nature of sparse optimization is given by the cardinality-constrained formulation min f (x) : where sparsity is ensured by forcing the 0 -norm to be not larger than a given integer t ∈ {1, . . ., n − 1}, thus obtaining minimizers with at least n − t zero components.Such a problem has been studied in depth in recent years due to its possible application in areas of relevant interest such as compressed sensing [21] and portfolio selection [10].We refer the reader to [37] for an up-to-date survey.Here, we only mention [4], on the theoretical side, where necessary optimality conditions have been analyzed, and [15,24], on the methodological side, where reformulations as mathematical programs with complementarity constraints have been proposed.
In this paper we focus on sparse optimization as an effective way to deal with Feature Selection (FS) in Machine Learning (ML), see [32], where the problem is to detect the sample parameters which are really significant in applications such as unsupervised learning (see [22]) and regression (see [45]).In particular, we restrict our attention to supervised binary classification, an area where intensive research activity has been performed in the last decades, mainly since the advent of Support Vector Machine (SVM) as the election classification methodology, see [19,47].
Most of the applications of sparse optimization to feature selection fall into two classes of methods, those where the 0 -norm has been approximated by means of appropriate concave functions (see, e.g., [13,14,42,49]) and those where sparsity of the solution has been enforced by resorting to the definition of appropriate sets of binary variables, giving rise to Mixed Integer Linear (MILP) or even Mixed Integer Non Linear (MINLP) problems.In fact, it is natural to associate to any continuous variable an appropriate binary one according to its property of being zero or nonzero.Mixed integer reformulations have been widely adopted in relevant areas of Machine Learning such as classification, logistic regression, medical scoring etc. (see e.g.[7,8,20,43,46] and the references therein).As for the SVM literature we cite here [6,29,39].
The model we propose is inspired by the possibility of connecting the 0 -norm to the vector k-norm, which is defined as the sum of the k largest absolute-value components of any vector, and is a polyhedral norm, thus relatively easy to be treated by standard tools of convex nonsmooth optimization.Properties of the vector k-norm have been investigated in several papers (see, e.g., [30,41,50]).We particularly recall the applications to matrix completion, see [40], and to linear system approximation, see [48], as a possible alternative to the use of classic 1 , 2 and ∞ norms.
More recently, vector k-norm has been proved to be an effective tool for dealing with sparse optimization, see [31], also with specific applications to feature selection in the SVM framework, see [28].The use of k-norm is somehow evoked also in the trimmed LASSO model described in [9].
The novelty of this paper is the development of a continuous k-norm-based model which is obtained as the continuous relaxation of a MINLP problem.The objective function comes out to be nonconvex, in particular of the Difference-of-Convex (DC) type, thus our approach is closer to the trimmed LASSO [9] than to the SLOPE method [5,11].Other DC-like approaches are presented in [23,38,52].
The remainder of the article is organized as follows.In Sect. 2 we summarize the properties of vector k-norms, and we introduce a couple of MINLP reformulations of the 0 -regularization problem (SOP).In Sect. 3 we consider the continuous relaxations of both such reformulations, and we focus in Sect. 4 to the one based on vector knorm, which consists of a DC (Difference of Convex) optimization problem.In Sect. 5 we cast the SVM-based Feature Selection problem into our sparse optimization setting.Finally, in Sect.6 we report on the computational experience, obtained on some benchmark datasets for binary classification, of both relaxed MINLP reformulations, highlighting the role of k-norms in increasing sparsity levels of the solutions.
Notation Vectors are represented in lower-case bold letters, with e and 0 representing vectors with all the elements equal to one and to zero, respectively.The inner product of vectors x and y is denoted by x y.Given any vector in R n , say x = (x 1 , . . ., x n ) , we denote the p -norm, with 1 ≤ p < +∞, and the ∞ -norm, respectively, by Furthermore, we recall the definition of the 0 -norm and some of its relevant properties: • 0 is lower semicontinuous.

MINLP-based formulations of the 0 -regularization problem
Introducing an n-dimensional binary decision-vector z, associated to x, whose kth component z k is set to one if x k > 0 and set to zero otherwise, an obvious and classic reformulation of problem (SOP) is the following where the positive "big" M parameter denotes a uniform bound on the absolute value of any single component of x.It is easy to verify that any minimizer (x * , z * ) of (MISOP) is such that and, consequently, the term e z * actually represents the 0 -norm of x * .It is worth observing that an ideal threshold for M is M * x * ∞ , where x * is any global minimum of (SOP), as indeed any M ≥ M * guarantees equivalence of (SOP) and (MISOP).
In view of introducing an alternative MINLP-based reformulation of (SOP), we first recall, for k ∈ {1, . . ., n}, the polyhedral vector k-norm x [k] , defined as the sum of the k largest unsigned components of x.In particular, given x = (x 1 , . . ., x n ) , and adopting the following notation . . .
the vector k-norm of x can be expressed as Moreover, it is easy to see that • [k] fulfills the following properties: Now, focusing in particular on the equivalence ( 9), and introducing an n-dimensional binary decision-vector y, whose kth component y k is set to one if x 1 − x [k] > 0 and set to zero otherwise, it is possible to state yet another mixed binary reformulation of problem (SOP) as the following where the sufficiently large parameter M > 0 has been introduced.An equivalence between (MIkSOP) and (SOP) can be obtained in the constrained case where x is restricted to stay in a compact set S ⊂ R n .In fact, letting D max x∈S x 1 , and observing that then any M ≥ 1 − 1 n D ensures the equivalence.Unlike (MISOP) formulation ( 1)-(4), the (MIkSOP) formulation ( 10)-( 12) is characterized by the presence of a set of nonconvex constraints (11), whose left hand sides are, in particular, DC functions.Furthermore, as for the objective function, we remark that the only feasible solutions of (MIkSOP) which are candidates to be optimal are those (x, y) for which the equivalence holds for every k ∈ {1, . . ., n}.Observing that at any of such solutions it is necessarily y n = 0 and that, provided x = 0, there holds e y = max s : which implies that if (x * , y * ) is optimal for problem (MIkSOP), then x * is optimal for problem (SOP) as well.
In the sequel, we analyze the continuous relaxation of the MINLP-based formulations (MISOP) and (MIkSOP), assuming in particular that f is a convex function, not necessarily differentiable.

Continuous relaxation of MINLP-based SOP formulations
We start by studying the continuous relaxation (RSOP) of (MISOP), obtained by replacing the binary constraints z ∈ {0, 1} n with 0 ≤ z ≤ 1.A simple contradiction argument ensures that any minimizer (x, z) of (RSOP) satisfies by equality at least one constraint of the pair and the relaxed problems (RSOP) can be written as In other words, the continuous relaxation of (MISOP) just provides the 1regularization of function f , often referred to as LASSO model, a convex, possibly nonsmooth program due to the assumption made on f (for a discussion on LASSO and possible variants see, e.g., [52]).
Consider now the continuous relaxation (RkSOP) of (MIkSOP), where 0 ≤ y ≤ 1 replace y ∈ {0, 1} n .Note that any minimizer ( x, y) of (RkSOP) satisfies by equality all the constraints (11), since if this were not the case for some k, the corresponding y k would be reduced, leading to a reduction in the objective function without impairing feasibility.Consequently,every continuous variable y k , with k ∈ {1, . . ., n}, is such that implying that As a consequence, (RkSOP) can be stated as We observe that, recalling (7), the sum of the k-norm of x can be written as Hence, taking into account the definition of x 1 , we obtain that and, consequently, we come out with the problem The latter formula highlights that increasing weights are assigned to decreasing absolute values of the components of x, i.e., the smaller is the absolute value the bigger is its weight; this somehow indicates a kind of preference towards reduction of the small components, which is, in turn, in favor of reduction of the 0 -norm of x.
Remark 1 Formula ( 16) clarifies the differences between our approach and the SLOPE method (see [5,11]), the trimmed LASSO method (see [9]), the approach proposed in [28] and the 1−2 method [51].In SLOPE, a reverse ordering of the weights with respect to our model is adopted, as components |x j s | are associated to non-increasing penalty weights.In fact, the resulting convex program is with Given any σ > 0, a possible choice of parameters is for some sparsity parameter t ∈ {1, . . ., n − 1}.Under such a choice problem (17) becomes the following convex penalized k-norm model: In trimmed LASSO, given a sparsity parameter t ∈ {1, . . ., n − 1}, the following nonconvex problem is addressed where, unlike our approach, components of x with the highest absolute values are not penalized at all, whereas the (n − t) smallest ones are equally penalized.As for the approach presented in [28], it can be proved that, for a given penalty parameter σ > 0, letting J σ (x) j ∈ {1, . . ., n} : x j > 1 σ , the proposed DC (hence nonconvex) program can be formulated as which tends to (SOP) as σ → ∞.Finally, we remark the substantial difference between the method we propose and the 1−2 approach [51], where the 0 pseudonorm is approximated by the DC model A better insight into the differences between the two formulations (RSOP) and (Rk-SOP) can be gathered by appropriate setting of the constants M and M .Since the former is referred to single vector components and the latter to norms, it is natural to set With such choice the objective function F(•) of (RkSOP) becomes: We observe that function F(•) is the difference of two convex functions, namely, with and As a consequence, (RkSOP) can be handled by resorting to both the theoretical and the algorithmic tools of DC programming (see, e.g., [2,33,44]).Moreover, we observe that f 1 is exactly the objective function of (RSOP), thus it is Summing up, F(x) is convex and majorizes the (nonconvex) F(x), with f 2 (x) being the nonnegative gap, whose value depends on the sum of the k-norms.

Properties of problem (RkSOP) and its solutions
We start by analyzing some properties of function f 2 (•).First, taking into account ( 14) and ( 26), we rewrite f 2 as where c s = 1 − (s−1) n , for every s ∈ {1, . . ., n}, with Next, in order to study the behavior of f 2 (•) at equal 1 -norm value, we consider any ball B(0, ρ) {x ∈ R n : x 1 = ρ}, centered at the origin with radius ρ > 0, and we focus on the following two optimization problems (P max ) and (P min ), respectively. and We observe that, denoting by the set of all permutations of {1, . . ., n}, both problems can be decomposed in several problems of the type , respectively.Proposition 4.1 For any given π ∈ , the following properties regarding optimal solutions of (P max π ) and (P min π ) hold: (i) Problem (P max π ) has two maximizers x Proof (i) The property follows by observing that the solutions |x (27), are optimal for the relaxation obtained by eliminating the ordering constraints from (P max π ), and that such solutions are also feasible for (P max π ).(ii) We observe first that the monotonically decreasing structure of the cost coefficients c s guarantees that at the optimum |x (min)   π s | > |x (min) π s+1 | cannot occur for any index s ∈ {1, . . ., (n − 1)}.Consequently, it comes out from satisfaction of the ordering constraints in (P min π ), and from x (min) ∈ B(0, ρ), that the optimal solutions satisfy |x n and the property follows.

Remark 2
The optimal solutions of (P max π ) depend on π , in particular on the choice of the index π 1 , but they have all the same optimal value.Thus problem (P max ) has a total number of 2n global solutions, with f max 2 = ρσ M .As for problem (P min π ), the 2 n optimal solutions are independent of π , hence they are also global solution for (P min ), with f 2n .As a consequence, the variation (ρ) of f 2 on B(0, ρ) is

Remark 3
The 0 -norm of the optimal solutions of problem (P max ) is equal to 1, while the 0 -norm of those of (P min ) is equal to n.As a consequence, the (subtractive) gap f 2 , for a fixed value of the 1 norm, is maximal when x 0 = 1 and it is minimal when all components are nonzero and equal in modulus, that is x 0 = n; in other words, model (RkSOP) exhibits a stronger bias towards reduction of the 0 -norm than (RSOP).Of course, the price to be paid to obtain such an advantage is the need of solving the DC (global) optimization problem (RkSOP) instead of the convex program (RSOP).
An additional insight into properties of the solutions of (RkSOP) can be obtained by focusing on the related necessary conditions for global optimality.In particular, at any global minimizer x, taking into account the DC decomposition (24), the inclusion is satisfied, see [34].Before introducing a property of minimizers in terms of their 0 norm, we recall some differential properties of the k-norm, see [31].The subdifferential at any point x ∈ R n is In particular, given any x ∈ R n , and denoting by J [k] (x) { j 1 , . . ., j k } the index set of k largest absolute-value components of x, a subgradient g [k] ∈ ∂ x [k] can be obtained as Moreover, we note that the subdifferential ∂ • [k] is a singleton (i.e., the vector k-norm is differentiable) any time the set As for the subdifferential ∂ • 1 of the 1 norm, we recall that its element are the subgradients g (1) at x whose jth component, for every j ∈ {1, . . ., n}, is obtained as where α j ∈ [−1, 1].Summing up, we get The following proposition provides a quantitative evaluation of the effect of the penalty-parameter σ on the 0 -norm of a global minimizer.
By simple calculation we get where By a component-wise rewriting of condition (32), and taking into account only the components j ∈ {1, . . ., r } such that x j = 0, from (31) we obtain that at the minimizer x, for some ξ ∈ ∂ f ( x) it is Then, taking into account | ξ j | ≤ L, the thesis follows by observing that condition (33) implies Remark 4 Proposition 4.2 suggests a quantitative way to control sparsity of the solution acting on parameter σ .It is worth pointing out, however, that the bound holds only if a global minimizer is available.This is not necessarily the case if a local optimization algorithm is adopted, as we do in Sect.6.

Feature selection in support vector machine (SVM)
In the SVM framework for binary classification two (labeled) point-sets A {a 1 , . . ., a m 1 } and B {b 1 , . . ., b m 2 } in R n are given, the objective being to find a hyperplane, associated to a couple (x, γ ) ∈ R n × R, strictly separating them.Thus, it is required that the following inequalities hold true: The existence of such a hyperplane is ensured if and only if convA ∩ convB = ∅, a property hard to be checked.A convex, piecewise linear and nonnegative error function of (x, γ ) is thus defined.It has the form being equal to zero if and only if (x, γ ) actually defines a (strictly) separating hyperplane satisfying (34)(35).
In the SVM approach the following convex problem min Ce(x, γ ) is solved, where the norm of x is added to the error function aiming to obtain a maximum-margin separation, C being a positive trade-off parameter.The 1 and 2 norms are in general adopted in model (37), but, in case feature selection is pursued, the 0 -norm looks as the most suitable tool, although the 1norm is usually considered as a good approximation.In the following, we will focus on the feature selection 0 -norm problem min Ce(x, γ ) by applying a relaxed model of the type (RkSOP) and the relative machinery.Note that, to keep the notation as close as possible to that commonly adopted in the SVM framework, the trade-off parameter C, unlike the classical formulation of (SOP), where parameter σ is present, is now equivalently put in front of the error term e(x, γ ).
By adopting the same relaxation scheme described in §3, we obtain the DC continuous program (SVM-RkSOP) an SVM model, tailored for feature selection, as it exploits the continuous relaxation of (MISOP), whose computational behavior will be analyzed in the next section.

Computational experience
We have evaluated the computational behavior of our SVM-based feature selection model (SVM-RkSOP) by testing it on 8 well known datasets whose relevant details are listed in Table 1.In particular, we remark that such datasets can be partitioned in two groups depending on the relative proportion between the number of points (m 1 + m 2 ) and the number of features (n).From this viewpoint, datasets 1 to 4 have a large number of points compared to the small number of features, whilst datasets 5 to 8 have a large number of features compared to the small number of points.We recall that (SVM-RkSOP) is a DC optimization problem, analogous to (24), where (42) and that can be tackled by adopting techniques, like those described in [3,27,36], which require to calculate at each iterate-point the linearization of function f 2 (•).In fact, such linearization can be easily calculated by applying (30)  to a get a subgradient g [k] ∈ ∂ x [k] .In particular, we have implemented a version of the DCA algorithm, see [2], exploiting the fact that at an iterate point x, once obtained a linearization h 2 (x, •) of f 2 (•) at x, the new iterate-point can be calculated by minimizing the function that can be easily turned into an equivalent linear program.Hence, applying DCA to (SVM-RkSOP) amounts to solving a sequence of linear programs as long as a decrease of the objective function is obtained.In order to evaluate the effectiveness of (SVM-RkSOP) solved by DCA, we will also make some comparison against the behavior of the convex standard 1 -norm-based SVM model (SVM-RSOP) whose solution can be easily obtained by solving just one equivalent linear program, next referred to as LP-SVM-RSOP.The experimental plan for both approaches is based on a two-level cross-validation protocol to tune parameter C and to train the classifier.In fact, the tenfold cross-validation approach has been adopted at the higher level to train the classifier, every dataset being randomly partitioned into 10 groups of equal size.Then, 10 different blocks (the training sets) are built, each containing 9 out of 10 groups.Every block is used to train the classifier, using the left out group as the testing-set that returns the percentage of correctly classified bags (test correctness).Before executing the training phase, at the lower level it is needed to tune parameter C. A grid of 20 values, ranging between 10 −3 and 10 2 , has been selected, and a five-fold cross-validation approach has been adopted on each training set (the model-selection phase).For each training set, we choose the C value as the one returning the highest average test-correctness.As for the selection of an appropriate starting point for DCA applied to (SVM-RkSOP), next referred to as DCA-SVM-RkSOP, denoting by x a the barycenter of all the a i instances, and by x b the barycenter of all the b l instances, we have selected the starting point (x 0 , γ 0 ) by setting and choosing γ 0 such that all the b l instances are well classified.We have implemented the DCA-SVM-RkSOP algorithm [26] in Python 3.6, and run the computational experiments on a 2.80 GHz Intel(R) Core(TM) i7 computer.The LP solver of IBM ILOG CPLEX 12.8 [35] has been used to solve linear programs.The numerical results regarding DCA-SVM-RkSOP and LP-SVM-RSOP are reported in Table 2 and Table 3, respectively, where we list the percentage correctness averaged over the 10 folds of both the testing and the training phases.Moreover, we report the values ft0, ft-2, ft-4, and ft-9, representing the percentage average of features for which the corresponding component of the minimizer x * is larger than 1, 10 −2 , 10 −4 , 10 −9 , respectively.norm based approaches has been elsewhere demonstrated (see [28,29]), at least for problems of reasonable size.As for comparison with the k-norm based approach SVM 0 described in [28], it has to be taken into account that the results therein (see Tables 6 and 7) are strongly affected by the penalty parameter choice.Nevertheless, as far as test-correctness is considered, from comparison of Table 2 with [28, Table 5], we observe that each of the two approaches SVM 0 and DCA-SVM-RkSOP prevails on four of the eight datasets considered in both papers.As for sparsity of the solutions, the behavior is comparable on the data sets of the second group, while SVM 0 exhibits stronger sparsity enforcement on the first group of datasets, somehow at expenses of test-correctness.

Conclusions
We have introduced a novel continuous nonconvex k-norm-based model for sparse optimization, which is derived as the continuous relaxation of a MINLP problem.
We have applied such model to Feature Selection in the SVM setting and the results suggest the superiority of the proposed approach over other attempts to simulate the 0 pseudo-norm (e.g., the 1 norm penalization).On the other hand, we have also observed that the mere replacement of the 0 pseudo-norm with the k-norm, evoked in some methods available in the literature, does not provide satisfactory results, at least in the experimentation area considered in this paper.
Funding Open access funding provided by Universitá della Calabria within the CRUI-CARE Agreement.
at any point x ∈ R n