# A Geometric Viewpoint of the Selection of the Regularization Parameter in Some Support Vector Machines

## Abstract

The regularization parameter of support vector machines is intended to improve their generalization performance. Since the feasible region of binary class support vector machines with finite dimensional feature space is a polytope, we note that classifiers at vertices of this unbounded polytope correspond to certain ranges of the regularization parameter. This reduces the search for a suitable regularization parameter to a search of (finite number of) vertices of this polytope. We propose an algorithm that identifies neighbouring vertices of a given vertex and thereby identifies the classifiers corresponding to the set of vertices of this polytope. A classifier can then be chosen from them based on a suitable test error criterion. We illustrate our results with an example which demonstrates that this path can be complicated. A portion of the path is sandwiched between two finite intervals of path, each generated by separate sets of vertices and edges.

## Keywords

Support vector machines Regularization path Polytopes Neighbouring vertices Prediction error Parameter tuning Linear programming## 1 Introduction

A classical learning problem is that of binary classification wherein the learner is trained on a given data set (training set) and predicts the class of a new data point. Let the *n* point training set be \(\lbrace (\mathbf {x}_i, y_i)\rbrace _{i = 1}^n\), where \(\mathbf {x}_i \in \mathbb {R}^m\) is a vector of *m* features and \(y_i \in \lbrace -1, +1 \rbrace \) is the label of \(\mathbf {x}_i, \; i \in \{1, \cdots , n\}\). We consider the class of linear classifiers, \((\mathbf {w}, b)\), with \(\mathbf {w} \in \mathbb {R}^m\) and \(b \in \mathbb {R}\). The classifier predicts the class of data point \(\mathbf {x}\) as \(-1\) if \(\mathbf {w} \cdot \mathbf {x} + b < 0\) and predicts the class as \(+1\) otherwise, i.e., the predicted class for \(\mathbf {x}\) is sign\((\mathbf {w} \cdot \mathbf {x} + b).\) Such classifiers are called linear Support Vector Machines (SVMs).

Among finite dimensional models for binary class prediction, the class of polynomial kernels form an important class. These are quite popular in natural language processing (NLP) because fast linear SVM methods can be applied to the polynomially mapped data and can achieve accuracy close to that of using highly nonlinear kernels [2].

The purpose of the regularization parameter \(\lambda \) is to improve the generalization error of the SVM. It is known that a proper choice is needed; see, for example, Figure 4 of [7]. The purpose of this paper is to investigate this choice in fairly basic SVMs by considering the polyhedral nature of the feasible region of the above SVM QP.

The main results of this paper are summarized as follows: We characterize, to the best of our knowledge for the first time, the polytope, *P*, associated with the feasible space of (1), in terms of its vertices and give an algorithm that lists all its vertices. We notice that, starting off from a vertex, the path is generated by vertices and edges (one-dimensional facets) as well as facets of higher dimensions. The regularization parameter, \(\lambda \), for any classifier can be identified by linear programs; and for classifiers corresponding to vertices, this is an interval. The SVMs are generally assessed in terms of their performance on \(0-1\) loss criterion. We find that the vertex classifiers dominate other boundary classifiers on a single test point using this \(0-1\) loss function. This means that for the SVMs that we consider, a suitable choice of \(\lambda \) as a design parameter can be replaced by a search among the finite but large number of the vertices of *P*.

Different approaches have been employed to select an optimal regularization parameter, \(\lambda \), for the SVM QP. The task of tracing an entire regularization path was pioneered by [7]. The sets *E*, *L* and *R* of [7] in the feature space \((\mathbf {w}, b)\) correspond to a vertex *v* of the polytope *P* in the lifted space in \((\mathbf {w}, b, \varvec{\xi })\). Another approach [3] considers finding the optimal parameters for SVM based classifiers with kernel functions that could be infinite dimensional, where \(\lambda \) is included in the parameter vector. Bounds on the test error are obtained, based on the leave-one-out testing scheme and these are differentiable with respect to the parameter vector. A gradient based scheme is proposed for finding optimal parameters. Apart from tracing the path, various other aspects have been studied regarding the design of the SVMs, such as the feature selection problem [3, 10].

Note that, to trace the path, [7] use the dual optimization program to the SVM QP to study the trajectories of the primal and dual variables as a function of the regularization parameter; whereas the polytope considered in this paper resides in the primal space itself. As a consequence, we need to search among a finite, albeit a large, set of vertices. And unlike [3], we restrict our analysis to the case of finite dimensional kernels, which can be handled using fast algorithms.

## 2 The Polytope of the Feasible Region of SVMs

First we notice that the feasible region is unbounded; hence it admits a Minkowski decomposition into a base polytope, *P*, and a recession cone. We want to concentrate on characterizing *P* in terms of its vertices and more importantly, the role of these vertices in the regularization path of the SVM.

### **Theorem 1**

For a given \(\lambda \ge 0\), the optimal point (the classifier for the SVM) lies on the boundary of the polytope *P*.

### *Proof*

Consider the unconstrained problem with the same objective function of SVM QP: \(\lambda ||\mathbf {w}|| + \sum _i \xi _i \). This optimization is separable into two optimization problems: \(\min _{\mathbb {R}^m} \lambda ||\mathbf {w}||\) and \(\min _{\mathbb {R}^n} \sum _i \xi _{i}\). While the first one has the optimal value zero at \(\mathbf {w} = 0\), the second one is unbounded. Hence the unconstrained problem has an unbounded value, whereas the SVM QP has a finite non-negative optimal value. The SVM QP is a convex minimization problem and hence its finite optimal solution will lie on the boundary of its feasible region, the polytope *P*. \(\square \)

### **Theorem 2**

A classifier on the vertex of the polytope dominates a boundary classifier, i.e., a classifier corresponding to an edge or a facet, on \(0-1\) loss function.

### *Proof*

*Remark 1.* The above result was shown for a single test point. When we have a collection of the points, we will have a ‘dominating set’ of vertices, which may or may not lie on the \(\lambda \)-path.

### 2.1 Characterization of the Vertices in Terms of Active Constraints

We recall the following (see [1, 9], etc.): For a polytope in \(\mathbb {R}^k\), a vertex is a point of zero dimension. A vertex in \(\mathbb {R}^k\) can be identified as a solution of *k* linearly independent linear equations. A vertex is an extreme point of the polytope, and can not be obtained as a convex combination of any two distinct points. An edge is a facet of dimension one and is a convex combination of two vertices of the polytope. A facet of dimension two is a convex combination of three vertices and so forth. We define a vertex classifier as a classifier corresponding to a vertex on the polytope of the feasible region of the standard SVM model. The edge and facet classifiers are defined in a similar fashion. Henceforth, these notations will be used in the rest of the paper.

The dimension of \((\mathbf {w}, b, \varvec{\xi })\) is \((m+ 1 + n)\) and we have *n* linear inequalities with *n* positively constrained variables, \(\varvec{\xi } = (\xi _1, \ldots , \xi _n)\). (We make the assumption that \((m+1) < n\)). Rewriting them as *n* equalities with positive slack variables \(s_i\), we get a set of 2*n* linear constraints whose intersection gives us a polytope as the feasible region. Hence, if at least \((m+n+1)\) of these 2*n* constraints are active and are linearly independent, the resulting unique solution is a vertex.

At a vertex *v*, then, for a given \(i \in \{1, \cdots , n\}\), if \(\xi _i - s_i \ne 0\), then only one of \(\xi _i\) or \(s_i\) is non-zero. It can be noted that,

\( s_i - \,\xi _i = {\left\{ \begin{array}{ll} (-\infty , -2) &{}\quad \text {if }\mathbf {x}_i \text { is misclassified by }(\mathbf {w},b) \text {and outside the margin} \\ -2 &{}\quad \text {if }\mathbf {x}_i \text { is misclassified by }(\mathbf {w},b)\text { and on the margin} \\ (-2 , -1) &{}\quad \text {if }\mathbf {x}_i \text { is misclassified by }(\mathbf {w},b) \text { and within the margin} \\ -1 &{}\quad \text {if }\mathbf {x}_i \text { is correctly classified by }(\mathbf {w},b) \text { and on the classifier}\\ (-1, 0) &{}\quad \text {if }\mathbf {x}_i \text { is correctly classified by }(\mathbf {w},b) \text { and within the margin} \\ 0 &{}\quad \text {if }\mathbf {x}_i\,\,\text {is correctly classified by }(\mathbf {w},b) \text { and on the margin}\\ ~ &{}\quad \text {i.e.,}~\mathbf {x}_i\,\,\text {is a support vector}\\ (0, \infty ) &{}\quad \text {if }\mathbf {x}_i \text { is correctly classified by }(\mathbf {w},b) \text { and outside the margin.} \end{array}\right. } \)

We can identify three different categories of classifiers based on the above values of \((s_i - \xi _i)\), as mentioned in the following theorem:

### **Theorem 3**

- (i)
The classifiers for which the points can be within, on or outside the margin. Thus, \((s_i - \,\xi _i) \in (-\infty , \infty ) \; \forall i \in \lbrace 1, \ldots , n \rbrace \) for such classifiers.

- (ii)
The \((\mathbf {0}, 1)\) and \((\mathbf {0}, -1)\) classifiers, for which \(\xi _i\) is either 0 or 2 and \((s_i - \,\xi _i) \in \lbrace 0, -2 \rbrace \; \forall i \in \lbrace 1, \ldots , n \rbrace .\)

- (iii)
The classifiers for which all the points are within or on the margins. For such classifiers \((s_i - \,\xi _i) \in (-2, 0) \; \forall i \in \lbrace 1, \ldots , n \rbrace .\)

We have another important characterization of a vertex of the polytope in terms of support vectors of the classifier.

### **Theorem 4**

*P*has at least one correctly classified point on its margin, also known as the support vector. Therefore, for \(v \in P\), we have

### *Proof*

*n*constraints in (1) have to be active. Thus, for any set of at least \((m + 1 + n)\) active constraints, \(I^*\)

Rewriting the constraints in (1) in a matrix form, we have: \(A \cdot (\mathbf {w}, b, \mathbf {\varvec{\xi }}) \ge b,\) where *A* corresponds to the coefficient matrix of the two sets of constraints of the SVM QP and \(b = \begin{pmatrix} \mathbf {1} \\ \mathbf {0} \end{pmatrix}.\)

*v*. So, given a set of active constraints, \(I^*(v)\), we can find the corresponding vertex \(v = (\mathbf {w}, b, \mathbf {\varvec{\xi }})\) by solving the following equation:

*A*with rows corresponding to \(I^*(v)\). Such a vertex corresponds to a basic solution of the feasible region. It is a feasible vertex if and only if it satisfies (1). Equivalently, it is not a feasible point if \(s_i\) is strictly negative. A feasible vertex corresponds to a basic feasible solution [1].

Given an active set of constraints, \(I^*(v),\) Algorithm Vertex(*Active*)), as described in the technical report [8], computes a vertex *v* corresponding to these active constraint set, if it exists and is feasible.

### 2.2 Neighbours of a Vertex of the Polytope, *P*

Given that a vertex is characterized by the set of constraints, \(I^*(v)\), that are active at that point, we can find a neighbouring vertex \(\tilde{v}\) by changing \(I^*(v)\) in the following way:

Replace an active constraint by the one that is currently inactive at *v*. The constraint \(i \in I^*(v)\) to leave the active set is the one such that \(\xi _i = s_i = 0.\) The existence of such a constraint \(i \in I^*(v)\) is guaranteed by Theorem 3. The incoming constraint \(j \in I^*(v)\) is chosen so that \( \lbrace j \; \vert \; (s_j > 0 \, \& \, \xi _j = 0) \text { or } (\xi _j > 0 \, \& \, s_j = 0) \rbrace \) at *v*. And at the neighbour \(\tilde{v}\), we set \(\xi _j = s_j = 0\) to ensure a support vector for \(\tilde{v}.\)

If the solution to (4) with these new active constraints is feasible, then it is a valid neighbour of *v*. Note that if the given vertex *v* is degenerate, then, the above change in active constraint set \(I^*(v)\) can lead to another degenerate vertex and hence not a neighbouring vertex. Such degenerate vertices need to be ignored in the list of neighbours of *v*.

Such a careful updating of the set of neighbouring vertices avoids potential cycling while listing the set of all vertices of the polytope *P*. The set of all such neighbours of given *v* is denoted by *N*(*v*) and can be found as in Algorithm Neighbour(*v*), described in our technical report [8].

### 2.3 Vertices of the Polytope, *P*

*P*using Algorithm Neighbour(

*v*) [8] as required, which in turn calls procedure Vertex(

*Active*) [8] with \(I^*(v)\).

## 3 The Regularization Path

As the optimal classifiers for SVM QP are on the boundary of the polytope, by Theorem 1, the set of classifiers given by the regularization path is a subset of the set of vertices and related edges of the polytope of feasible region. Since the classifier is chosen by 0–1 loss function, using this in SVM design phase itself, one can argue that vertex classifiers on the path dominate those at the related edges.

However, for some set of test points, the dominating vertex classifier (as in Theorem 2) may or may not be on the path (see the example in Sect. 4). In the following discussion, we focus on the classifiers at vertices that generate some portions of the regularization path.

Before describing the procedure to identify the vertices on the path traced by the parameter \(\lambda \), we mention a few results which will be used by this procedure.

Using the fact that, at optimality, the gradient of the objective function in a convex setting needs to be a member of the normal cone at that point and the KKT system gives an algebraic representation of this geometric phenomena, we have the following:

### **Theorem 5**

The next result says that a portion of the \(\lambda \)-path of a given SVM is partitioned by intervals corresponding to some of the vertices and edges of the polytope *P*. Also, we can see that the \(\lambda \) value for an edge classifier is a harmonic mean of the bounds on the \(\lambda \) interval of the related vertices. Please refer to our technical report [8] for details of the proof.

### **Theorem 6**

- (i)
For a classifier \((\mathbf {w}, b)\) which is a vertex \(v := (\mathbf {w}, b, \mathbf {\varvec{\xi }})\) on the polytope

*P*, the range \([\lambda _{l}, \lambda _{u}]\) of \(\lambda \) values, for which*v*is optimal, is an interval in \(\mathbb {R}\). In fact, this range is always finite since the gradient of the objective is never parallel to the generators of the normal cone. - (ii)
For a classifier on an edge or a facet of

*P*, the feasible \(\lambda \) value is a singleton, i.e., \(\lambda _{l} = \lambda _{u}.\) Specifically, for an edge point classifier, \(e_{v_1, v_2}\), lying between two on-the-path vertices \(v_1\) and \(v_2\) such that \(\lambda _u(v_1) < \lambda _l(v_2)\), we have$$\begin{aligned} \lambda (e_{v_1, v_2}) = \frac{\lambda _u(v_1) \lambda _l(v_2)}{\beta \lambda _l(v_2) + (1 - \beta ) \lambda _u(v_1)}. \end{aligned}$$(7) - (iii)
A portion of the regularization path corresponding to vertex-edge boundary of

*P*can be decomposed into intervals corresponding to vertices on the path and the edges between them. This is so because, for an edge point classifier, \(e_{v_1, v_2}\), as described above, we have$$\begin{aligned} \lambda (e_{v_1, v_2}) \in (\lambda _u(v_1), \lambda _l(v_2)). \end{aligned}$$(8)

To trace the regularization path, we solve the SVM QP for \(\lambda = 0\) which is a linear program. This gives us a classifier corresponding to a vertex in *P*, say \(v_0\). The range of \(\lambda \) for \(v_0\) can be obtained via the solutions to the linear programs: (5) and (6). We know that \(\lambda \) traces a continuous path along the boundary of *P*, so the next vertex on the path will be a neighbour of \(v_0\), found using the procedure Neighbour(*v*) [8]. Many of the neighbouring vertices are not optimal classifiers for any value of \(\lambda \) and hence, the LPs (5) and (6) become infeasible at such neighbouring vertices. We will have one such neighbour for which there exists an interval of \(\lambda \) and hence it becomes the next vertex on the \(\lambda \)-path. Then we search amongst the neighbours of the current vertex, to find the next vertex on the path. This procedure continues iteratively till all the neighbours of the current vertex become infeasible for the path. Such a vertex corresponds to the last but one vertex on the path.

Yet, there can be instances, as we will show in our example, where the path does not retain continuity along the vertex-edge boundary of the polytope. This happens when none of the neighbours of the current vertex are optimal for any value of \(\lambda \). This forces us to search exhaustively for next generation neighbours which are optimal for some value of the parameter \(\lambda \).

## 4 An Illustrative Example

The purpose of this example is to illustrate two aspects of the regularization path: a contiguous portion of the path composed of vertices and edges of the polytope, and another portion on facets of two or more dimensions. Interestingly, this portion is sandwiched between intervals generated by some vertices and edges. We tabulate the \(\lambda \) intervals for the vertices of *P* on the \(\lambda \)-path for a binary classification SVM model. We consider a training set with 50 points drawn from two bivariate normal distributions. The two classes have means (0, 0) and (1, 0) and same covariance matrix (0.5, 0; 0, 0.5). A list of 15002 valid vertices with an index was generated using rcdd package [5] in R programming language.

The vertices are arranged so that the lower and upper bounds on the \(\lambda \) intervals are in an increasing order (Table 1). It was observed that each vertex on this path is a neighbour of the previous vertex on the list (except the first vertex, \(v_0\)). The portion of the path that occurs between vertices 9158 and 10470 corresponds to classifiers on the facets, since none of the first generation neighbours of vertex 9158 are optimal for any value of \(\lambda \). The next optimal vertex is 10470 which may be obtained as a fifth generation neighbour to the vertex 9158.

Using Theorem 6, we note that only 15 of these 15002 vertices are on the \(\lambda \)-path. The set of first 13 vertices correspond to the first portion of the path, followed by a portion on the facets of two or higher dimensions. The last two vertices and the edge involving them correspond to the next segment of the path. As mentioned above, there are no more optimal vertices on the path.

\((\lambda _l, \lambda _u)\) for vertices on \(\lambda \)-path for bivariate normal data with 50 points

Vertex Index | \(\lambda _l(v)\) | \(\lambda _u(v)\) |
---|---|---|

7274 (\(v_0\)) | 0 | 0.002712 |

7325 | 0.002776 | 0.059187 |

7327 | 0.059275 | 0.339041 |

7326 | 0.339110 | 0.522892 |

7328 | 0.523507 | 0.602838 |

7426 | 0.610561 | 0.745904 |

7420 | 0.749820 | 0.818734 |

7421 | 0.851777 | 1.191843 |

7424 | 1.198099 | 1.370998 |

7425 | 1.456195 | 1.723649 |

7352 | 1.785567 | 2.047029 |

9038 | 2.206312 | 2.428423 |

9158 | 2.535443 | 3.053205 |

- | 3.5 | 3.5 |

- | 4.25 | 4.25 |

10470 | 5.372196 | 6.368380 |

10471 | 6.486113 | 7.576469 |

## 5 Discussion

The polyhedral structure of the feasible space of the standard SVM optimization model allows us to trace the \(\lambda \)-path on a subset of vertices of the base polytope. It was observed that for initial values, the \(\lambda \)-path comprises of vertices and related edges. We have examples where the path has classifiers that are on facets of the SVM polytope, *P*. We have restricted ourselves to the subset of vertices that dominate the whole polytope of feasible classifiers on 0–1 loss. The vertices and their neighbours can be identified via suitable active constraint sets. We noticed in our limited computational exercise, that the tracing of the \(\lambda \)-path has encountered numerical instabilities; such problems are also reported by [7].

Some aspects that naturally need attention include the need to come up with a scheme to pick that neighbour of a vertex which generates the adjacent interval on the \(\lambda \)-path, if such an interval exists. Perhaps, a more broader and important aspect is to be able to restrict ourselves to a suitable subset of vertices, which may or may not be on the \(\lambda \)-path, but have a promising test error.

A leave-one-out scheme [3] can be employed for testing the design of the classifier. Such leave-one-out training sets can be viewed as suitable perturbations of a given training set, and corresponding robust classification problems can be formalized. We can ensure the stability of the SVM algorithm by establishing such equivalence with a robust optimization formulation [13].

Besides the test error, some other measures such as bias, variance and the margin [4] of the classifier can be used for design. An analysis of error decomposition of the learning algorithms such as decision trees, *k*-NNs, etc. is done by [4], where the main consideration is for the algorithms that are consistent in the use of the same loss function for training as well as testing. This analysis was further taken up by [12] to the case of SVMs, where training makes use of hinge loss and testing is based on \(0-1\) loss.

Some statistical properties about the risk of the classifier [11, 14] can be explored to improve the efficiency of our algorithms. These results may help us pick ‘good’ vertex classifiers, for example, via the \(\lambda \) intervals corresponding to good generalizations guarantees.

## References

- 1.Bertsimas, D., Tsitsiklis, J.N.: Introduction to linear optimization. Athena Scientifc Belmont, MA (1997) Google Scholar
- 2.Chang, Y.-W., Hsieh, C.-J., Chang, K.-W., Ringgaard, M., Lin, C.-J.: Training and testing low-degree polynomial data mappings via linear SVM. J. Mach. Learn. Res.
**11**, 1471–1490 (2010)MathSciNetMATHGoogle Scholar - 3.Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for support vector machines. Mach. Learn.
**46**(1), 131–159 (2002)CrossRefMATHGoogle Scholar - 4.Domingos, P.: A unified bias-variance decomposition. In: Proceedings of 17th International Conference on Machine Learning, pp. 231–238. Morgan Kaufmann, Stanford CA (2000)Google Scholar
- 5.Geyer, C.J., Meeden, G.D.: Incorporates code from cddlib written by Komei Fukuda. rcdd: Computational Geometry, R package version 1.1-9 (2015)Google Scholar
- 6.Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, Heidelberg (2001) CrossRefGoogle Scholar
- 7.Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: The entire regularization path for the support vector machine. J. Mach. Learn. Res.
**5**, 1391–1415 (2004)MathSciNetMATHGoogle Scholar - 8.Hemachandra, N., Sahu, P.: A geometric viewpoint of the selection of the regularization parameter in some support vector machines. Technical report, IE & OR, IIT Bombay, Mumbai, September 2015. http://www.ieor.iitb.ac.in/files/SVMpath_TechReport.pdf, September 30, 2015
- 9.Hiriart-Urruty, J.-B., Lemaréchal, C.: Fundamentals of Convex Analysis. Grundlehren Text Editions. Springer, Heidelberg (2004) Google Scholar
- 10.Jawanpuria, P., Varma, M., Nath, S.: On p-norm path following in multiple kernel learning for non-linear feature selection. In: Proceedings of the 31st International Conference on Machine Learning, pp. 118–126 (2014)Google Scholar
- 11.Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media (2008)Google Scholar
- 12.Valentini, G., Dietterich, T.G.: Bias-variance analysis of support vector machines for the development of svm-based ensemble methods. J. Mach. Learn. Res.
**5**, 725–775 (2004)MathSciNetMATHGoogle Scholar - 13.Huan, X., Caramanis, C., Mannor, S.: Robustness and regularization of support vector machines. J. Mach. Learn. Res.
**10**, 1485–1510 (2009)MathSciNetMATHGoogle Scholar - 14.Zhang, T.: Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat.
**32**, 56–85 (2004)CrossRefMATHGoogle Scholar