Multiclass optimal classification trees with SVM-splits

In this paper we present a novel mathematical optimization-based methodology to construct tree-shaped classification rules for multiclass instances. Our approach consists of building Classification Trees in which, except for the leaf nodes, the labels are temporarily left out and grouped into two classes by means of a SVM separating hyperplane. We provide a Mixed Integer Non Linear Programming formulation for the problem and report the results of an extended battery of computational experiments to assess the performance of our proposal with respect to other benchmarking classification methods.


Introduction
Interpretability is a crucial requisite demanded to machine learning methods provoked by the tremendous amount of methodologies that have arised in the last decade [21].It is expected that the model that results when applying a machine learning methodology using a training sample, apart from being able to adequately predict the behaviour of out-of-sample observations, can be interpreted.Different tools have been applied to derive interpretable machine learning methods.One of the most popular strategies to simplify the obtained models is feature selection, in which a reduced set of attributes is to be selected without loosing quality in the predictions.Reducing the number of parameters to analyze, the models can be easier to understand, yielding higher descriptive accuracy.One could also consider models that can be modulated, in the sense that a great proportion of its prediction-making process can be interpreted independently.This is the case of generalized linear models [29].Other methods incorporate interpretability as a synonym of being able to be reproduced by humans in its entire construction [15,34].This is the case of Decision Trees with small depth which can be visualized and interpreted easily by users even not familiar with the tools behind their construction.We adopt a tree-based methodology through this paper.
Among the wide variety of strands derived under the lens of Machine Learning, classification is one that has attracted a lot of attention because of its applicability in many different fields [4,28,31,36,41].Classification methodologies aim to adequately predict the class of new observations provided that a given sample has been used to construct the classification rule.The role of Mathematical Programming in the construction of classification models has been widely recognized, and some of the most popular methods to derive classification rules are based on solving optimization problems [13,17,16,11,27].Moreover, Mathematical Programming has also been proven to be a flexible and accurate tool when requiring interpretability to the obtained models [5,6,12,25].
However, most of the optimization tools derived to construct classifiers assume instances with only two classes.In this paper, we provide a novel classification method in which the instances are allowed to be classified into two or more classes.The method is constructed using one of the most interpretable classification method, Classification Trees, but combined with Support Vector Machines, which provide highly predictive models.
We have developed a Mathematical Programming model that allows to construct an Optimal Classification Tree for a given training sample, in which each split is generated by means of a SVM-based hyperplane.When building the tree, the labels of the observations are ignored in the branch nodes, and they are only accounted for in the leaf nodes where misclassification errors are considered.The classification tree is constructed to minimize the complexity of the tree (assuring interpretability) and also the misclassification risk (assuring predictive power).
1.1.Related Works.Several machine learning methodologies have been proposed in the literature in order to construct highly predictive classification rules.The most popular ones are based on Deep Learning mechanisms [1], k-Nearest Neighborhoods [18,42], Naïve Bayes [35], Classification Trees (CT) [15,23] and Support Vector Machines (SVM) [17].Among them, CT and SVM, which are, by nature, optimization-based methodologies, apart from producing highly predictive classifiers have been proven to be very flexible tools since both allow the incorporation of different elements (through the adequate optimization models by means of constraints and objective functions) to be adapted to different situations, as Feature Selection [27,5,6,30], accuracy requirements [7,24] or dealing with unbalanced or noisy instances [22,10,14], amongst others.Support Vector Machines were originally introduced by Cortes and Vapnik [17] as a binary classification tool that builds the decision rule by means of a separating hyperplane with large separation between the two classes.This hyperplane is obtained by solving a convex quadratic optimization problem, in which the goal is to separate data by their two differentiated classes, maximizing the margin between them and minimizing the misclassification errors.Duality properties of this optimization problem allow one to extend the methodology to find nonlinear separators by means of kernels.Classification Trees were firstly introduced by Breiman et.al [15], and the decision rule is based on a hierarchical relation among a set of nodes which is used to define paths that lead observations from the root node (highest node in the hierarchical relation), to some of the leaves in which a class is assigned to the data.These paths are obtained according to different optimization criteria over the predictor variables of the training sample.The decision rule comes up naturally, the classes predicted for new observations are the ones assigned to the terminal nodes in which observations fall in.Clearly, the classification rules derived from CTs are easily interpretable by means of the splits that are constructed at the tree nodes.In [15], a greedy heuristic procedure, the so-called CART approach, is presented to construct CTs.Each level of the tree is sequentially constructed: starting at the root node and using the whole training sample, the method minimizes an impurity measure function obtaining as a result a split that divides the sample into two disjoint sets which determine the two descendant nodes.This process is repeated until a given termination criteria is reached (minimum number of observations belonging to a leaf, maximum depth of the tree, or minimum percentage of observations of the same class on a leaf, amongst others).In this approach, the tree grows following a top-down greedy approach, an idea that is also shared in other popular decision tree methods like C4.5 [40] or ID3 [39].The advantage of these methods is that the decision rule can be obtained rather quickly even for large training samples, since the whole process relies on solving manageable problems at each node.Nevertheless, these types of heuristic approaches may not obtain the optimal classification tree, since they look for the best split locally at each node, not taking into account the splits that will come afterwards.Thus, these local branches may not capture the proper structure of the data, leading to misclassification errors in out-of-sample observations.Furthermore, the solutions provided by these methods can result into very deep (complex) trees, resulting in overfitting and, at times, loosing interpretability of the classification rule.This difficulty is usually overcome by pruning the tree as it is being constructed by comparing the gain on the impurity measure reduction with respect to the complexity cost of the tree.
The recent advances on modeling and solving difficult Optimization problems together with the flexibility and adaptability of these models have motivated the use of optimization tools to construct supervised classification methods with a great success [9,16]).In particular, recently, Bertsimas and Dunn [8] introduced the notion of Optimal Classification Trees (OCT) by approaching Classification and Regression Trees under optimization lens, providing a Mixed Integer Linear Programming formulation for its optimal construction.Moreover, the authors proved that this model can be solved for reasonable size datasets, and equally important, that for many different real datasets, significant improvements in accuracy with respect to CART can be obtained.In contrast to the standard CART approach, OCT builds the tree by solving a single optimization problem taking into account (in the objective function) the complexity of the tree, avoiding post pruning processes.Moreover, every split is directly applied in order to minimize the misclassification errors on the terminal nodes, and hence, OCTs are more likely to capture the hidden patterns of the data.
While SVM were initially designed to deal only with bi-class instances, some extensions have been proposed in the literature for multiclass classification.The most popular multiclass SVM-based approaches are One-Versus-All (OVA) and One-Versus-One (OVO).The former, namely OVA, computes, for each class r ∈ {1, . . ., k}, a binary SVM classifier labeling the observations as 1, if the observation is in the class r, and −1 otherwise.The process is repeated for all classes (k times), and then each observation is classified into the class whose constructed hyperplane is the furthest from it in the positive halfspace.In the OVO approach, classes are separated with k 2 hyperplanes using one hyperplane for each pair of classes, and the decision rule comes from a voting strategy in which the most represented class among votes becomes the class predicted.OVA and OVO inherit most of the good properties of binary SVM.In spite of that, they are not able to correctly classify datasets where separated clouds of observations may belong to the same class (and thus are given the same label) when a linear kernel is used.Another popular method is the directed acyclic graph SVM, DAGSVM [1].In this technique, although the decision rule involves the same hyperplanes built with the OVO approach, it is not given by a unique voting strategy but for a sequential number of votings in which the most unlikely class is removed until only one class remains.In addition, apart from OVA and OVO, there are some other methods based on decomposing the original multiclass problem into several binary classification ones.In particular, in [2] and [20], this decomposition is based on the construction of a coding matrix that determines the pairs of classes that will be used to build the separating hyperplanes.Alternatively, other methods such as CS ( [19]), WW ( [44]) or LLW ( [33]), do not address the classification problem sequentially but as a whole considering all the classes within the same optimization model.Obviously, this seems to be the correct approach.In particular, in WW, k hyperplanes are used to separate the k classes, each hyperplane separating one class from the others, using k − 1 misclassification errors for each observation.The same separating idea, is applied in CS but reducing the number of misclassification errors for each observation to a unique value.In LLW, a different error measure is proposed to cast the Bayes classification rule into the SVM problem implying theoretical statistical properties in the obtained classifier.These properties cannot be ensured in WW or CS.
We can also find a quadratic extension based on LLW proposed by [26].In [43], the authors propose a multiclass SVM-based approach, GenSVM, in which the classification boundaries for a problem with k classes are obtained in a (k − 1)-dimensional space using a simplex encoding.Some of these methods have become popular and are implemented in most software packages in machine learning as e1071 [37], scikit-learn [38] or [32].Finally, in the recent work [11] the authors propose an alternative approach to handle multiclass classification extending the paradigm of binary SVM classifiers by construnting a polyhedral partition of the feature space and an assignment of classes to the cells of the partition, by maximizing the separation between classes and minimizing two intuitive misclassification errors.1.2.Contributions.In this paper, we propose a novel approach to construct Classification Trees for multiclass instances by means of a mathematical programming model.Our method is based on two main ingredients: (1) An optimal binary classification tree (with oblique cuts) is constructed in the sense of [8], in which the splits and pruned nodes are determined in terms of the misclassification errors at the leaf nodes; (2) The splits generating the branches of the tree are build by means of binary SVM-based hyperplanes separating fictitious clases (which are also decided by the model), i.e., maximizing separation between classes and minimizing the distance-based misclassification errors.Our specific contributions include: (1) Deriving an interpretable multiclass classification rule which combines two of the most powerful tools in supervised classification, namely OCT and SVM.(2) The classifier is constructed using a mathematical programming model that can be formulated as a Mixed Integer Second Order Cone Programming problem.The classifier is simple to apply and interpretable.(3) Several valid inequalities are presented for the formulation that allow one to strengthen the model and to solve larger size instances in smaller CPU times.(4) An extensive battery of computational experiments on realistic datases from UCI is reported showing that our approach outperforms other decision tree-based methodologies as CART, OCT and OCT-H.
1.3.Paper structure.Section 2 is devoted to fix the notation and to recall the tools that are used to derive our method.In Section 3 we detail the main ingredients of our approach and illustrate its performance on a toy example.
The mathematical programming model that allows us to construct the classifier is given in Section 4, where we include all the elements involved in the model: parameters, variables, objective function and constraints.In Section 5 we report the results of our experiments to assess the performance of our method compared with other tree-shaped classifiers.Finally, Section 6 is devoted to draw some conclusions and future research lines on the topic.

Preliminaries
This section is devoted to introduce the problem under study and to fix the notation used through this paper.We also recall the main tools involved in our proposed approach namely, Support Vector Machines and Optimal Classification Trees.These methods are adequately combined to develop a new method, called Multiclass Optimal Classification Trees with Support Vector Machines based splits (MOCTSVM).
We are given a training sample, X = {(x 1 , y 1 ), . . ., (x n , y n )} ⊆ R p ×{1, . . ., K}, which comes up as the result of measuring p features over a set of n observations (x 1 , . . ., x n ) as well as a label in {1, . . ., K} for each of them (y 1 , . . ., y n ).The goal of a classification method is to build a decision rule so as to accurately assign labels (y) to data (x) based on the behaviour of the given training sample X .
The first ingredient that we use in our approach is the Support Vector Machine method.SVM is one of the most popular optimization-based methods to design a classification rule in which only two classes are involved, usually referred as the positive (y = +1) and the negative class (y = −1).The goal of linear SVM is to construct a hyperplane separating the two classes by maximizing their separation and simultaneously minimizing the misclassification and margin violation errors.Linear SVM can be formulated as the following convex optimization problem: where c is the regularization parameter that states the trade-off between training errors and model complexity (margin), ω is the transpose of the vector ω and • 2 is the Euclidean norm in R p (other norms can also be considered but still keeping similar structural properties of the optimization problem [13]).Note that with this approach, the positive (resp.negative) class will tend to lie on the positive (resp.negative) half space induced by the hyperplane H = {z ∈ R p : ω z + ω 0 = 0}.On the other hand, the popularity of SVM is mostly due to the so call kernel trick.This allows one to project the data onto a higher dimensional space in which a linear separation is performed in a most accurate way with no need of knowing such a space, but just knowing the form of its inner products, and maintaining the computational complexity of the optimization problem (see [17] for further details).
The second method that we combine in our approach is Classification Trees.CTs are a family of classification methods based on a hierarchical relationship among a set of nodes.These methods allow one to create a partition of the feature space by means of hyperplanes that are sequentialy built.CT starts on a node containing the whole sample, that is called the root node, in which the first split is applied.When applying a split on a node (by means of a hyperplane separating the observations, two new branches are created leading to two new nodes, which are referred to as its child nodes.The nodes are usually distinguished into two groups: branch nodes, that are nodes in which a split is applied, and on the other hand the leaf nodes, which are the terminal nodes of the tree.Given a branch node and a hyperplane split in such a node, their branches (left and right) are defined as each of the two halfspaces defined by the hyperplane.The final goal of CT is to construct branches in order to obtain leaf nodes as pure as possible with respect to the classes.In this way, the classification rule for a given observation consists of assigning it to the most popular class of the leaf where it belongs to.
There is a vast amount of literature on CTs since they provide an easy interpretable classification rule.One of the most popular methods to construct CT is known as CART, introduced in [15].CART is a greedy heuristic approach that starts at the root node looking for the split in a single feature that minimizes a given impurity function, creating two new nodes.The same procedure is sequentially applied until a stop criteria is reached (maximal depth of the tree, proportion of observations of a single class in a node, etc).The main advantage of CART is its low computational cost, since nowadays very deep trees can be obtained within a few seconds.However, CART does not guarantee the optimality of the classification tree, in the sense that more accurate trees could be obtained if instead of locally constructing the branches one looks at the final configuration of the leaf nodes.For instance, in Figure 1(left) we show a CT constructed by CART for a biclass problem with maximal depth 2. We draw the classification tree, and also in the top right corner, the partition of the feature space (in this case R 2 ).As can be observed, the obtained classification is not perfect (not all leaf nodes are composed by pure classes) while in this case is not difficult to construct a CT with no classification errors.This situation is caused by the myopic construction done by the CART approach that, at each node only cares on better classification at their children, but not at the final leaf nodes, while subsequent branching decisions clearly affect the overall shape of the tree.
Motivated by this drawback of CART, in [8], the authors propose an approach to build an Optimal Classification Tree (OCT) by solving a single Mathematical Programming problem in which not only single-variable splits are possible but oblique splits involving more than one predictive variable (by means of general hyperplanes in the feature space) can be constructed.In Figure 1(right) we show a solution provided by OCT with hyperplanes (OCT-H) for the same example.One can observe that when splitting the root node (orange branches) a good local split is not obtained (the nodes contain half of the observations in different classes), however, when adding the other two splits, the final leaves only have observations of the same class, resulting in a perfect classification rule for the training sample.Both approaches, OCTs and SVMs can be combined in order to construct classification trees in which the classes separated by the hyperplanes determined in the CT are maximally separated, in the sense of the SVM approach.This idea is not new and has been proven to outperform standard optimal decision trees methods amongst many different biclass classification problems, as for instance, in [12] where the OCTSVM method is proposed.In Figure 2 we show how one could construct OCTs with larger separations between the classes using OCTSVM but still with the same 100% accuracy in the training sample as in OCT-H, but more protected to misclassification in out-sample observations.Nevertheless, as far as we know, the combination of OCT and SVM has only been analyzed for biclass instances.The extension of this method to multiclass settings (more than two classes) is not trivial, since one could construct more complex trees or use a multiclass SVM-based methodology (see e.g.[19,44,33]).However, these adaptations of the classical SVM method have been proved to fail in real-world instances (see [11]).In the rest of the paper we describe a novel methodology to construct accurate multiclass tree-shaped classifiers based on a different idea: constructing CTs with splits induced by bi-class SVM separators in which the classes of the observations at each one of the branch nodes are determined by the model, but adequately chosen to provide small classification errors at the leaf nodes.The details of the approach are given in the next section.

Multiclass OCT with SVM splits
In this section we describe the method that we propose to construct classification rules for multiclass instances, in particular Classification Trees in which splits are generated based on the SVM paradigm.
As already mentioned, our method is based on constructing OCT with SVM splits, but where the classes of the observations are momentarily ignored and only accounted for at the leaf nodes.In order to illustrate the idea under our method, in Figure 3 we show a toy instance with a set of points with four different classes (blue, red, orange and green).First, at the root node (the one in which all the observations are involved), our method constructs a SVM separating hyperplane for two fictitious classes (which have to be also decided).A possible separation could be the one shown in Figure 4, in which the training dataset has been classified into two classes (black and white).This separation allows one to generate two child nodes, the black and the white nodes.At each of these nodes, the same idea is applied until the leaf nodes are reached.In Figure 5 we show the final partition of the feature space according to this procedure.Once the tree is constructed with this strategy, the decision rule comes up naturally as it is usually done in decision trees methods, that is, out of sample observations will follow a path on the tree according to the splits and they will be assigned to the class of the leaf where they lie in (the most represented class of the leaf over the training set).In case a branch is pruned when building the tree, observations will be assigned to the most represented class of the node where the prune took place.

Mathematical Programming Formulation for MOCTSVM
In this section we derive a Mixed Integer Non Linear Programming formulation for the MOCTSVM method described in the previous section.
We assume to be given a training sample X = {(x 1 , y 1 ), . . ., (x n , y n )} ⊆ R p × {1, . . ., K}.We denote by N = {1, . . ., n} the index set for the observations in the training sample.We also consider the binary representation of the labels y as: Moreover, without loss of generality we will assume the features to be normalized, i.e., x 1 , . . ., x n ∈ [0, 1] p .We will construct decision trees with a fixed maximum depth D. Thus, the classification tree is formed by at most T = 2 D+1 − 1 nodes.We denote by τ = {1, . . ., T } the index set for the tree nodes, where node 1 is the root node and nodes 2 D , . . ., 2 D+1 − 1 are the leaf nodes.
For any node t ∈ τ \{1}, we denote by p(t) its (unique) parent node.The tree nodes can be classified in two sets: branching and leaf nodes.The branching nodes, that we denote by τ b , will be those in which the splits are applied.In constrast, in the leaf nodes, denoted by τ l , no splits are applied but is where predictions take place.The branching nodes can be also classified into two sets: τ bl and τ br depending on whether they follow the left or the right branch on the path from their parent nodes, respectively.τ bl nodes are indexed with even numbers meanwhile τ br nodes are indexed with odd numbers.
We define a level as a set of nodes which have the same depth within the tree.The number of levels in the tree to be built is D + 1 since the root node is assumed as the zero-level.Let U = {u 0 , . . ., u D } be the set of levels of the tree, where each u s ∈ U is the set of nodes at level s, for s = 0, . . ., D. With this notation, the root node is u 0 while u D represent the set of leaf nodes.
In Figure 7 we show the above mentioned elements in a 3-depth tree.Apart from the information about the topological structure of the tree, we also consider three regularization parameters that have to be calibrated in the validation process that allow us to find a trade-off between the different goals that we combine in our model: margin violation and classification errors of the separating splitting hyperplanes, correct classification at the leaf nodes and complexity of the tree.These parameters are the following:   The complete list of index sets and parameters used in our model are summarized in Table 1.
4.1.Variables.Our model uses a set of decision and auxiliary variables that are described in Table 2.We use both binary and continuous decision variables to model the MOCTSVM.The binary variables allow us to decide the allocation of observations to the decision tree nodes, or to decide whether a node is splited or not in the tree.The continuous variables allow us to determine the coefficients of the splitting hyperplanes or the misclassification errors (both in th SVM separations or at the leaf nodes).We also use auxiliary binary and integer variables that are useful to model adequately the problem.
In Figure 8 we illustrate the use of these variables in a feasible solution of a toy instance with three classes (red,blue and green).N = {1, . . ., n} Index set for the observations in the training sample.
D Maximal depth of the tree.T = 2 D+1 maximal number of nodes in a D-depth tree.τ = {1, . . ., T } Index set for the set of nodes of the tree.p(t) Parent of node t, for t ∈ τ \{1}.τ b ∈ τ Branching nodes of the tree.
τ l Leaf nodes of the tree.τ bl ∈ τ b Nodes that follow the left branch on the path from their parent nodes.τ br ∈ τ b Nodes whose right branch has been followed on the path from their parent nodes.u s Nodes at level s of the tree, for s = 0, . . ., D. U = {u 0 , . . ., u d } Sets of levels of the tree.The whole set of training observation is considered at the root node (node t = 1).There, the original labels are ignored and to determine the fictitious class of each observation a SVM-based hyperplane is constructed.Such a hyperplane is defined by the coefficients ω 1 ∈ R p and ω 10 ∈ R (hyperplane/line drawn with a dotted line in the picture) and it induces a margin separation ( 2 ω 1 2 ) and misclassification errors e i1 .In the feasible solution drawn in the figure, only three observations induce positive errors (those that are classified either in the margin area or in the opposite side of the hyperplane).Such a hyperplane also determines the splitting rule for the definition of the children of that node.Since the node is split (d 1 = 1), the observations that belongs to the positive side of the hyperplane are assigned to the left node (node t = 2) while those in the negative side are assigned to the right node (node t = 3) through the z-variables.At node t = 2, the same scheme is applied, that is, the hyperplane defined by ω 2 is constructed, inducing SVM-based margin and errors and since d 2 = 1, also the splitting rule applies to define nodes t = 4 and t = 5.At node t = 2, one must control the observations in that node to quantify the misclassifying errors, e i2 , only for those observations in the objective function.Specifically, we only account for these errors for the observations that belong to the node (z i2 = 1) and either belong to the positive (α i2 = 1) or the negative (α i2 = 0) side of the hyperplane.Also, in order to control the complexity of the tree, the h-variables are used to know whether an observation belongs to the node and to the positive side of the SVM-hyperplane.If all observations in a node belong to the positive side of the hyperplane, the variable v assumes the value 0. Otherwise, in case v takes value 1, two situations are possible: 1) there are observations in both sides of the hyperplane (as in node t = 2) inducing a new split (d 2 = 1), and 2) all observations belong to the negative side (as in node t = 3) determining that the tree is pruned at that node (d 3 = 0).
Concerning the leaf nodes, node t = 2 is split into nodes t = 4 and t = 5 and node t = 3, which was decided to be no longer split, is fictitiously split in two leaf nodes, although one of them is empty and the other one receives all the observations of the parent node (node t = 3).The allocation of any leaf node τ l to a class is done through the q-variables (to the most popular class in the node or arbitrarily in case the node has no observations) and the number of misclassified observations is accounted for by the L-variables.

Objective Function.
As already mentioned, our method aims to construct classification trees with small misclassification errors at the leaf nodes, but at the same time with maximal separation between the classes with the SVM-based hyperplanes and minimum distance based errors.
Using the variables described in the previous section, the four terms that are included in the objective functions are the following: Margins of the splitting hyperplanes:: The separating hyperplane of branching node t ∈ τ b has margin 2 ωt 2 .Thus, our method aims to maximize the minimum of these margins.This is equivalent to minimize the maximum among the inverse margins { 1 2 ω t 2 2 : t ∈ τ b } which is represented by the auxiliary variable δ.
Misclassification Errors at the leaf nodes:: Variable L t accounts for the number of misclassified observations in leaf node t, i.e., the number of observations that do not belong to the most represented class in that leaf node.These variables allow us to count the overall number of misclassified observations in the training sample.Therefore, the amount to be minimized by the model is given by the following sum: Distance-based Errors at branching nodes:: Each time a split is added to the tree, a SVM-based hyperplane in which the labels are assigned based on the global convenience of for the overall tree is incorporated.Thus, we measure, at each branching node in τ b , the distance-based errors incurred by the SVM classifier at that split.This amount is measured by the e it variables and is incorporated to the model through the sum: Complexity of the tree: The simplicity of the resulting tree is measured by the number of splits that are done in its construction.Since the d t variable tells us whether node t is split or not, this term is accounted for in our model as: Summarizing, the overall objective function of our model is: Note that the coefficients c 1 , c 2 and c 3 trade-off the misclassification of the training sample, the separation between classes and the complexity of the tree, respectively.These parameters should be carefully calibrated in order to construct simple decision trees with high predictive power, as can be seen in our computational experiments.

4.3.
Constraints.The requirements on the relationships between the variables and the rationale of our model are described through the following constraints that define the mathematical programming model.
First of all, in order to adequately represent the maximum among the inverse margins of the sppliting hyperplanes, we require: Next, we impose how the splits are performed in the tree.To this end, we need to know which observations belong to a certain node t (z-variable) and how these observations are distributed with respect to the two fictitious classes to be separated (α-variables).Gathering all these elements together, we use the following constraints to define the splits of the decision tree: According to this, constraint (C2a) is activated just in case the observation i belongs to the reference class and it is in node t (h it = 1).On the other hand, (C2b) is activated if i is allocated to node t (z it = 1) but it does not belong to the reference class (α it = 0).Therefore, the reference class is located on the positive half space of hyperplane H t , while the other class is positioned in the negative half space, and at the same time, margin violations are regulated by the e it variables.
To ensure the correct behaviour of the above constraints, we must correctly define the z it variables.First, it is required that each observation belongs to exactly one node per level in the tree.This can be easily done by adding the usual assingment constraints to the problem at each of the levels, u ∈ U , of the tree: Furthermore, we should enforce that if observation i is in node t (z it = 1), then observation i must also be in the parent node of t, p(t) (z ip(t) = 1), and also observation i can not be in node t if it is not in its parent node (z ip(t) = 0 ⇒ z it = 0).These implications can be obtained by means of the following constraints: Nevertheless, the way observations descend through the tree needs a further analysis, since at this point they could just randomly define a path in the tree.Whenever an observation i is in the positive half space of the splitting hyperplane at node t, H t , this observation should follow the right branch connecting to the child node of t.Otherwise, in case i is on the negative half space, it should follow the left branch.The knowledge on the side of the splitting hyperplane where an observation belongs to is encoded in the α-variables.Then, in case i lies on the positive half space of H t , α it will never be equal to zero since it would lead to a value of e it greater than one, while e it < 1 is guaranteed in case α it = 1.With the above observations, the constraints that assure the correct construction of the splitting hyperplanes with respect to the side of them where the observations belong to are the following: Constraints (C5a) assure that if observation i is on the parent node of an even node t (z ip(t) = 1), and i lies on the negative half space of H p(t) (α ip(t) = 0), then z it is enforced to be equal to one.As a result, α ip(t) = 0 forces observation i to take the left branch in node t.Note that in case z ip(t) = 1, and at the same time observation i is not in the left child node of t (z it = 0 for i ∈ τ bl ), then α ip(t) = 1, which means that observation i lies on the positive half space of H p(t) .Constraints (C5b) are analogous to (C5a) but allowing to adequately represent right branching nodes.
Moreover, two additional important elements need to be incorporated to complete our model: the tree complexity and the correct definition of misclassified observations.Note that in usual Optimal Classification Trees that do not use SVM-based splits, the complexity can be easily regulated by just imposing ω t 2 2 ≤ M d t (for a big enough M constant) in all the branch nodes, since in case a node is no further branched (d t = 0), the coefficients of the splitting hyperplane are set to zero.However, in our case, in which the splitting hyperplanes are SVM-based hyperplanes, these constraints are in conflict with constraints (C2a) and (C2b), since in case d t = 0 (and therefore ω t = 0) it would not only imply that the coefficients ω t are equal to zero, but also that the distance based errors would be set to the maximum value of 1, i.e., e it = 1 for every observation i in the node, even though these errors would not make any sense since observations would not be separated at the node.To overcome this issue, we consider the auxiliary binary variables h it = z it α it (h it takes value 1 if observation i belongs to node t and lies in the positive half-space of the splitting hyperplane applied at node t) and v t (that takes value zero in case all the points in the node belong to the positive halfspace and one otherwise).The variables are adequatelly defined if the following constraints are incorporated to the model: i∈N where constraints (C6a) and (C6b) are the linearization of the bilinear constraint h it = z it α it .On the other hand, Constraints (C6c) assure that in case v t = 0, then all observations in node t belong to the positive halfspace of H t , and constraints (C6d) assure that if v t = 1 and the tree is pruned at node t (d t = 0), then those observations allocated to node t are placed in the negative halfspace defined by the splitting hyperplane.Thus, it implies that d t takes value one if and only if the observations in node t are separated by H t , and therefore producing an effective split at the node.Finally, in order to adequately represent the L t variables (the ones that measure the number of misclassified observations at the leaf nodes) we use the constraints already incorporated in the OCT-H model in [8].On the one hand, we assign each leaf node to a single class (the most popular class of the observations that belong to that node).We use the binary variable q kt to check whether leaf node t ∈ τ l is assigned to class k = 1, . . ., K. The usual assignment constraints are considered to assure that each node is assigned to exactly one class: The correct definition of the variable L t is then guaranteed by the following set of constraints: These constraints are activated if and only if q kt = 1, i.e., if observations in node t are assigned to class k.In such a case, since L t is being minimized in the objective function, L t will be determined by the number of training observations in node t except those whose label is k, i.e., the number of missclasified observations in node t according to the k-class assignment.
Observe that the constant n in (C8) can be decreased and fixed to the maximum number of misclassified observations in the training sample.This number coincide with the difference between the number of observations in the training sample (n) and the number of observations in the most represented class in the sample.
Summarizing the above paragraphs, the MOCTSVM can be formulated as the following MINLP problem:

4.4.
Strengthening the model.The MINLP formulation presented above is valid for our MOCTSVM model.However, it is a computationally costly problem, and although it can be solved by most of the off-the-shelf optimization solvers (as Gurobi, CPLEX or XPRESS), it is able to solve optimally only small to medium size instances.To improve its performance, the problem can be strengthen by means of valid inequalities which allows one to reduce the gap between the continuous relaxation of the problem and its optimal integer solution, being then able to solve larger instances in smaller CPU times.In what follows we describe some of these inequalities that we have incorporated to the MINLP formulation: • If observations i and i belongs to different nodes, they cannot be assigned to the same node for the remainder levels of the tree: • If leaf nodes t and s are the result of proper splitting hyperplanes, then, both nodes cannot be assigned to the same class: • Variable α it is enforced to take value 0 in case z it = 0: • Variable h it is not allowed to take value one if α it takes value zero: • There should be at least a leaf node to which each class is assigned to (assuming that each class is represented in the training sample).It also implies that the number of nodes to which a class is assigned is bounded as: In order to reduce the dimensionality and also to avoid symmetries of the MINLP problem, one can also apply some heuristic strategies to fix the values of some of the binary variables in a preprocessing phase.For instance, we choose i 0 ∈ arg max i∈N | {i ∈ N : a i − a i ≤ ε and y i = y i } | , that is, the observation with a maximum amount of observations in the same class close enough to it.Then, we fix to one all the variables z it 0 with i ∈ {i ∈ N : a i −a i ≤ ε and y i = y i }, being t 0 the first left leaf node of the tree (and fixing to zero the allocation of these points to the rest of the leaf nodes).Analogously, we fix also to one all the z-variables allocating observations to the last right leaf node of the tree for a subset of observations in the same class which are far enough from i 0 , i.e., z it f = 1 for all i ∈ {i ∈ N : a i f − a i ≤ ε and y i f = y i } where i f = arg max i∈N a i 0 − a i (and fixing to zero the allocation of these points to the rest of the leaf nodes).

Experiments
In order to analyze the performance of this new methodology we have run a series of experiments among different real datasets from UCI Machine learning Repository [3].We have chosen twelve datasets with number of classes between two and seven.The dimension of these problems is reported in Table 3 by the tuple (n : number of observations, p : number of features, K : number of classes).
We have compared the MOCTSVM model with three other Classification Tree-based methodologies, namely CART, OCT and OCT-H.The maximum tree depth, D, for all the models was equal to 3, and the minimum number of observations per node in CART, OCT and OCT-H was equal to the 5% of the training size.
We have performed, for each instance a 5-fold cross validation scheme, i.e., datasets have been splited into five random train-test partitions where one of the folds is used to build the model and the remaining are used to measure the accuracy of the predictions.Moreover, in order to avoid taking advantage of beneficial initial partitions, we have repeated the cross-validation scheme five times for all the datasets.
The CART method was coded in R using the rpart library.On the other hand, MOCTSVM, OCT and OCT-H were coded in Python and solved using the optimization solver Gurobi 8.1.1.All the experiments were run on a PC Intel Xeon E-2146G processor at 3.50GHz and 64GB of RAM.A time limit of 300 seconds was set for training the training folds.Although not all the problems were optimally solved within the time limit, as can be observed in Table 3, the results obtained with our model already outperform the other methods.
In order to calibrate the parameters of the different models that regulate the complexity of the tree, we have used different approaches.On the one hand, for CART and OCT, since the maximum number of nodes for such a depth is 2 D − 1 = 7, one can search for the tree with best complexity by searching in the grid 1, . . ., 2 D − 1 of possible active nodes.For OCT-H, we search the complexity regularization factor in the grid 10 i : i = −5, . . ., 5 .Finally, in MOCTSVM we used the same grid 10 i : i = −5, . . ., 5 for c 1 and c 2 , and 10 i : i = −2, . . ., 2 for c 3 .
In Table 3 we report the results obtained in our experiments for all the models.The first column of the table indicates the identification of the dataset (together with its dimensionality).Second, for each of the methods that we have tested, we report the obtained average test accuracy and the standard deviation.We have highlighted in bold the best average test accuracies obtained for each dataset.
As can be observed, our method clearly outperforms in most of the instances the rest of the methods in terms of accuracy.Clearly, our model is designed to construct Optimal Classification Trees with larger separations between the classes, which results in better accuracies in the test sample.The datasets Australian and BalanceScale obtain their better results with OCT-H, but, as can be observed, the differences with respect the rest of the methods are tiny (it is the result of correctly classifying in the test sample just a few more observations than the rest of the methods).In that case, our method gets an accuracy almost as good as OCT-H.In the rest of the datasets, our method consistently gets better classifiers and for instance for Dermatology the difference with respect to the best classifiers among the others ranges in [4%, 19%], for Parkinson the accuracy with our model is at least 6% better than the rest, for Wine we get 5% more accuracy than OCTH and 10% more than CART and for Zoo the accuracy of our model is more than 17% greater than the one obtained with CART.
Concerning the variability of our method, the standard deviations reported in Table 3 show that our results are, in average, more stable than the others, with small deviations with respect to the average accuracies.This behaviour differs from the one observed in CART or OCT, where larger deviations are obtained, implying that the accuracies highly depends of the test folder where the method is applied.

Conclusions and Further Research
We have presented in this paper a novel methodology to construct classifiers for multiclass instances by means of a Mathematical Programming model.The proposed method outputs a classification tree were the splits are based on SVMbased hyperplanes.At each branch node of the tree, a binary SVM hyperplane is constructed in which the observations are classified in two fictitious classes (the original classes are ignored in all the splitting nodes), but the global goodness of the tree is measured at the leaf nodes, where misclassification errors are minimized.Also, the model minimizes the complexity of the tree together with the two elements that appear in SVM-approaches: margin separation and distance-based misclassifying errors.We have run an extensive battery of computational experiments that shows that our method outperforms most of the Decision Tree-based methodologies both in accuracy and stability.Future research lines on this topic include the analysis of nonlinear splits when branching in MOCTSVM, both using kernel tools derived from SVM classifiers or specific families of nonlinear separators.This approach will result into more flexible classifiers able to capture the nonlinear trends of many real-life datasets.Additionally, we also plan to incorporate features selection in our method in order to construct highly predictive but also more interpretable classification tools.

Figure 1 .
Figure 1.Example of a CT obtained with CART (left) and OCT-H (right) approaches for the same instance.

Figure 2 .
Figure 2. Example of a CT obtained with OCTSVM.

Figure 4 .
Figure 4. Root split on the 4-class classification problem

Figure 5 .
Figure 5. Child node splits on the 4-class classification problem higheleted as the fictitious classes decided in our model.

Figure 6 .
Figure 6.Child node splits on the 4-class classification problem with their original labels (colors).

Figure 7 .
Figure 7. Elements in a depth D = 3 tree.

c 1 :
unit misclassification cost at the leaf nodes.

c 2 :
unit distance based misclassification errors for SVM splits.

c 3 :
unit cost for each splitting hyperplane introduced in the tree.

Figure 8 .
Figure 8. Illustration of the sets of variables used in our model in a toy example.

1
Unit misclassification cost.c 2 Unit distance based missclassification errors for SVM splits.c 3 Unit cost for splitting hyperplanes.

Table 1 .
Index sets and parameters used in our model.Continuous Decision Variables ω t ∈ R p Coefficients of the separating hyperplane of node t. ω t 0 ∈ R Intercept of the separating hyperplane of node t. e it ∈ R + Misclassification error of observation i at node t. δ ∈ R + Inverse of the minimum margin between splitting hyperplanes.Binary Decision Variables z it ∈ {0, 1} Is one if observation i belongs to node t and zero otherwise.d t ∈ {0, 1} Is one if a split is applied at node t and zero otherwise.Auxiliary Variables L t ∈ Z + Number of misclassified observations at leaf node t. α it ∈ {0, 1} Is one if observation i belongs to the reference fictitious class in node t and zero otherwise.h it ∈ {0, 1} Is one if observation i is in node t and lies on the positive half space of the hyperplane of node t, and zero otherwise.v t ∈ {0, 1} Is one if not all observations in node t lie on the positive half space of the hyperplane in node t and zero otherwise.q kt ∈ {0, 1} Is one if class k is the most represented one in leaf node t and zero otherwise.

Table 2 .
Summary of the variables used in our model.