Polyhedral separation via difference of convex (DC) programming

We consider polyhedral separation of sets as a possible tool in supervised classification. In particular, we focus on the optimization model introduced by Astorino and Gaudioso (J Optim Theory Appl 112(2):265–293, 2002) and adopt its reformulation in difference of convex (DC) form. We tackle the problem by adapting the algorithm for DC programming known as DCA. We present the results of the implementation of DCA on a number of benchmark classification datasets.


Introduction
The classification of an object is a decision-making process whose outcome is the assignment of a specific class membership to the object under observation. Medical diagnosis (Mangasarian et al. 1995), chemistry (Jurs 1986), cybersecurity , image processing (Khalaf et al. 2017) are only some of the possible application areas of classification.
Each object (sample) is characterized by a finite number of (quantitative and/or qualitative) attributes, usually referred to as the features.
Construction of a classifier is a supervised learning activity, where a dataset of samples, whose class membership is known in advance, is given as the input. The objective is to gather a mathematical model capable to correctly classify Dipartimento di Matematica e Informatica, Università di Cagliari, Cagliari, Italy newly incoming samples whose class membership is, instead, unknown.
Classification deals, mainly, with separation of sets of samples in the feature space, which is assumed to be R n . Whenever classes are two, we are faced with a binary classification problem. In this paper, in fact, the training dataset is partitioned into two subsets, say A and B, and thus, the problem consists in finding a separation surface, if any, between them.
Most of the models rely on setting an appropriate optimization problem whose output is either a separating surface or a nearly separating one, resulting in the minimization of some measure of the classification error.
Starting from the pioneering works by Bennett and Mangasarian (1992) and (Vapnik 1995), the hyperplane has been considered as the election surface to be looked for, although the use of nonlinear separation surfaces has been pursued too by Rosen (1965), Plastria et al. (2014) and Gaudioso (2005, 2009).
The literature on classification is huge. We cite Vapnik (1995), Cristianini and Shawe-Taylor (2000), Schölkopf et al. (1999) and Sra et al. (2011) as basic references in support vector machine (SVM) framework and Thongsuwan et al. (2020) as a recent approach in the deep learning.
It is well known (see, e.g., Cristianini and Shawe-Taylor (2000)) that if the convex hulls of the two sets A and B do not intersect, there exists a separating hyperplane such that the set A is on one side of such hyperplane and B is on the other side. It can be calculated by linear programming Bennett and Mangasarian (1992), and the two sets are referred to as linearly separable. On the other hand, if conv(A) ∩ conv(B) = ∅, a number of algorithms can be adopted to determine a quasiseparating hyperplane such that the related error functions are minimized (see, for example, Mangasarian (1999)).
In this paper, we deal with binary classification based on the use of a polyhedral surface. The concept of polyhedral separability was introduced in Megiddo (1988) and applied within the classification framework in Astorino and Gaudioso (2002) and Astorino and Fuduli (2015).
Whenever, in fact, the two sets A and B are not linearly separable, it is possible to resort to polyhedral separation, that is to determine h > 1 hyperplanes such that A is in the convex polyhedron given by the intersection of h half-spaces and B lies outside such polyhedron.
In Astorino and Gaudioso (2002), an optimization model was proposed to calculate a set of h hyperplanes generating a polyhedral separation, whenever possible, for the sets A and B. The model consists, as usual in classification, in minimizing an error function to cope with the case when the two sets are not h-polyhedrally separable. Parallel to SVM, the model was extended in Astorino and Fuduli (2015) to accommodate for margin maximization.
The error function adopted in Astorino and Gaudioso (2002) is neither convex nor concave, and it was dealt with by means of successive linearizations.
In this paper, we focus on the numerical treatment of the optimization problem to be solved in order to get a polyhedral separation surface. In particular, we fully exploit the DC (difference of convex) nature of the objective function (Hiriart-Urruty 1986), and thus, differently from Astorino and Gaudioso (2002), we adopt an algorithm designed to treat DC functions. In fact, the literature provides a wide set of efficient algorithms in this area nowadays.
In Pham Dinh (2005), Pham Dinh and Le Thi Hoai (2014), an iterative algorithm was introduced to minimize functions of the form f = f 1 − f 2 , with g and h convex functions. The algorithm, called DCA, considers at iteration t the linearization f (t) 2 of function f 2 at point x t and determines the next iterate x t+1 as an optimal solution of the convex problem DCA has proven to be an efficient method to tackle DC problems, even non-smooth and different artificial intelligence problems have been approached by means of the DCA (Astorino et al. 2010(Astorino et al. , 2012Khalaf et al. 2017). Several other methods have been more recently proposed in the literature (see Gaudioso et al. (2018) and Joki et al. (2017)) which allow to solve large-scale DC programs too.
In this paper, we build on the model Astorino and Gaudioso (2002) and adopt a decomposition of the error function for the h-polyhedral separability problem as the difference of two convex functions. We then apply the DCA to carry out an extensive experimentation on several classes of benchmark instances.
The paper is organized as follows. In Sect. 2, we describe the h-polyhedral classification model and its reformulation as a difference of convex optimization problem. In Sect. 3, we describe how the DCA has been adapted to the DC reformulations. In Sect. 4, we present the results of our implementation on a number of benchmark classification problems. Some conclusions are drawn in Sect. 5.

The polyhedral separability model
Let A = {a 1 , . . . , a m } and B = {b 1 , . . . , b k } be two finite sets of R n .

Definition 1 The sets
The following proposition gives an equivalent characterization of h-polyhedral separability:

Proposition 1 The sets A and B are h-polyhedrally separable if and only if there exist h hyperplanes
. . , k, and at least one j ∈ {1, . . . , h}. (3) Proof [Astorino and Gaudioso (2002), Proposition 2.1] Moreover, in Astorino and Gaudioso (2002) [Proposition 2.2] it is proven that a necessary and sufficient condition for the sets A and B to be h-polyhedrally separable (for some h ≤ |B|) is given by (4)

Remark 1
The roles of A and B in (4) are not symmetric.
According to Proposition 1, a point a i ∈ A is well classified by the set of hyperplanes {w j , γ j } if a T i w j −γ j +1 ≤ 0 for all j = 1, . . . , h. Therefore, we can compute the classification error of the point a i with respect to {w j , γ j } as Given a set of h hyperplanes {w j , γ j }, we denote with W = [w 1 : · · · : w h ] the matrix whose jth column is the vector w j and with Γ = (γ 1 , . . . , γ h ) the vector whose components are the γ j 's. The classification error function for the h-polyhedral separability problem for the sets A and B, with respect to the hyperplanes {w j , γ j }, is then given by where (8) and represent the errors for points of A and B, respectively. Function e(W , Γ ) is nonnegative and piecewise affine; e 1 (W , Γ ) is convex, and e 2 (W , Γ ) is quasi-concave; moreover, in Astorino and Gaudioso (2002) it has also been proven that the sets A and B are h-polyhedrally separable if and only if there exists a set of h hyperplanes (W * , Γ * ) such that e(W * , Γ * ) = 0 and, in that case, w j = 0 for all j = 1, . . . , h cannot be the optimal solution. In Astorino and Gaudioso (2002), the problem of minimizing the error function (7) is tackled by solving, at each iteration, a linear program providing a descent direction. Here, instead, we rewrite e(W , Γ ) as difference of convex functions, and then, we address its minimization through ad hoc DC techniques.
To obtain such a reformulation, considert the following identity, which is valid for any set of h affine functions z j (x), j = 1, . . . , h: By applying (10) to e 2 (W , Γ ), we obtain the following DC decomposition of e(W , Γ ): where botĥ and are convex. The DC decomposition (11) has been already discussed in Strekalovsky et al. (2015), where the authors suggested an algorithm that combines a local and a global search in order to find a global minimum of the error function. In the numerical experience we are going to discuss in the next sections, we confine ourselves to find just local minima of the error functions involved.

Exploiting the function structure in the DCA implementation
We have applied DCA to the minimization of the error function (11). Before discussing our experiment setting, we describe how we have adapted DCA to deal with polyhedral separation applied to a number of datasets from the classification literature. At iteration t, in any possible configuration (W t , Γ t ) of the h hyperplanes we can calculate for each l = 1, . . . , k the index j l where the maximum in (13) is achieved: and we define consequently the linearization of functionê 2 at iteration t: which satisfiesê t 2 (W , Γ ) ≤ê 2 (W , Γ ). We observe that the choice of the index (14) means selecting a subgradient g t ∈ ∂ê 2 (W t , Γ t ), which yields the construction of (15), the linearization function ofê 2 (W t , Γ t ), according to the DCA scheme.
Then, we consider the convexification of the original DC function: so that next configuration (W t+1 , Γ t+1 ) is obtained by solving the convex program which in turn can be put in the form of the following linear program, thanks to the introduction of the additional variables ξ i , i = 1, . . . , m and ζ l , l = 1, . . . , k: Summing up, the DCA-based algorithm for the minimization of (11) can be stated as follows: 0. Choose W 0 ∈ R n×h , Γ 0 ∈ R h and a tolerance . Set t = 0; 1. Compute g t ∈ ∂ê 2 (W t , Γ t ) and construct (15); 2. Set (W t+1 , γ t+1 ) as a solution of (17); Otherwise, increase t by 1 and return to step 1.
The above algorithm is a descent method. It is easy to verify that if e(W t+1 , Γ t+1 ) = e(W t , Γ t ), then (W t , Γ t ) is a critical point, i.e., Hence, this result provides the stopping criterion at step 3. For more theoretical details and the convergence theorem, see Pham Dinh and Le Thi Hoai (2014), Pham Dinh (2005).
Following the SVM paradigm, aimed at obtaining a good generalization capability, we have added toê 1 (W , Γ ) in (11) the margin term: thus coming out with the following DC model:  where C > 0 is a trade-off parameter between the two objectives of maximizing the margin and minimizing the classification error. The minimization of (19) can be addressed by DCA, too. In this case, at each iteration we have to solve a quadratic program that differs from the linear program (17) only for the quadratic margin term (18). Consequently, the algorithmic scheme is unchanged except for the step 2 where a quadratic program has to be solved. h-PolSepDC, where we minimize (11) (the separation problem with no separation margin) by solving at each iteration the linear program (17); h-PolSepDC-QP, where we minimize (19) (a margin has been accounted for) by solving at each iteration a quadratic program.
In particular, we have chosen h = 2 according to the hyperparameter tuning performed in Astorino and Gaudioso (2002).
Moreover, since the role of the sets A and B is not symmetric in the definition of polyhedral separability, in the numerical experiments one has to define which is A and B in any dataset. So, we have called set A the one with less number of points, following, also for this issue, the rule adopted in Astorino and Gaudioso (2002).
We have used MATLAB R2015b calling CPLEX library, under a 2,6 GHz Intel Core i7 processor, on an OS X 10.12.6 operating system.
To evaluate the impact of the DC decomposition of the error function (7) with respect to the classic non-smooth optimization approach, we have reimplemented, in MATLAB, the algorithm proposed in Astorino and Gaudioso (2002) (2-PolSep code). Finally, for sake of completeness we have also  We have considered several test problems drawn from the binary classification literature which are described in Table  1. In particular, all datasets are taken from the LIBSVM (Library for Support Vector Machines) repository Chang and Lin (2011), except for g50c and g10n, which are described in Chapelle and Zien (2005).
For all datasets, we have performed a standard tenfold cross-validation protocol and in Table 2, we summarize the LP/QP problems solved at each fold, in terms of number of variables and constraints.
For each approach, in the columns train and test of Table  3 we report the average percentage of training and testing correctness, respectively. The best results in terms of testing correctness have been underlined.
A preliminary tuning for the parameter C in 2-PolSepDC-QP and SVM-LINEAR codes has been performed, and we have selected, for each dataset, that value optimizing the performance on the testing set.
2-PolSep and the 2-PolSepDC are different algorithms to solve the same problem, i.e., the minimization of (11). By comparing the objective function values obtained by the two codes starting from the same initial point, we note that the second approach provides better solutions, but the difference is not so significant.
Since in the definition of polyhedral separability the role of the sets A and B is not symmetric, we compare the results also in terms of recall, specificity, precision and F1-score (see Table 5). The trend of these key performance indicators confirms the goodness of the DC-based approaches.
For completeness, we have launched both the codes h-PolSepDC and h-PolSepDC-QP with h > 2. The running time is not dramatically larger, but the numerical experiments show that there is no significant improvement in terms of correctness. Even worse, in some cases the improvement of training performance is not accompanied with an improvement of testing one. This proves that a high value of h-number of hyperplanes-provides classifiers with a smaller generalization capability. For instance, we report some results (see Tables 6 and 7).

Conclusions
We have adopted a difference of convex decomposition of the error function in polyhedral separation and have tackled the resulting optimization problem via DCA algorithm.
The numerical results we have obtained demonstrate the good performance of the approach both in terms of classification correctness and computation time.
Future research would investigate the integration between feature selection Gaudioso et al. (2017) and polyhedral separation aimed at detecting a possibly smaller subsets of significant attributes in terms of classification correctness.
Funding Open access funding provided by Università degli Studi di Cagliari within the CRUI-CARE Agreement. The fifth author was supported by KASBA-funded by Regione Autonoma della Sardegna. This research was also supported by Fondazione di Sardegna through the project Algorithms for Approximation with Applications.

Conflict of interest
The authors declare that they have no conflict of interest.
Human and animal rights This article does not contain any studies with human participants or animals performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.