Abstract
Support vector classification (SVC) is a classical and well-performed learning method for classification problems. A regularization parameter, which significantly affects the classification performance, has to be chosen and this is usually done by the cross-validation procedure. In this paper, we reformulate the hyperparameter selection problem for support vector classification as a bilevel optimization problem in which the upper-level problem minimizes the average number of misclassified data points over all the cross-validation folds, and the lower-level problems are the \(l_1\)-loss SVC problems, with each one for each fold in T-fold cross-validation. The resulting bilevel optimization model is then converted to a mathematical program with equilibrium constraints (MPEC). To solve this MPEC, we propose a global relaxation cross-validation algorithm (GR–CV) based on the well-know Sholtes-type global relaxation method (GRM). It is proven to converge to a C-stationary point. Moreover, we prove that the MPEC-tailored version of the Mangasarian–Fromovitz constraint qualification (MFCQ), which is a key property to guarantee the convergence of the GRM, automatically holds at each feasible point of this MPEC. Extensive numerical results verify the efficiency of the proposed approach. In particular, compared with other methods, our algorithm enjoys superior generalization performance over almost all the data sets used in this paper.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Support vector classification (SVC) is a classical and widely used learning method for classification problems; see, e.g., (Chauhan et al. 2019; Cortes and Vapnik 1995; Vapnik 2013). In SVC, the selection of hyperparameters, also known as hyperparameter selection, is a critical issue and has been addressed by many researchers both theoretically and practically (Chapelle et al. 2002; Dong et al. 2007; Duan et al. 2003; Keerthi et al. 2006; Kunapuli 2008; Kunapuli et al. 2008a, b). While there have been many interesting attempts to use bounds, gradient descent methods or other techniques to identify these hyperparameters (Chapelle et al. 2002; Duan et al. 2003; Keerthi et al. 2006), one of the most widely used methods is cross-validation (CV). A classical approach for cross-validation is the grid search method (Momma and Bennett 2002), where one needs to define a grid over the hyperparameters of interest, and search for the combination of hyperparameters that minimize the cross-validation error (CV error). Bennett et al. (2006) emphasize that one of the drawbacks of the grid search approach is that the continuity of the hyperparameter is ignored by the discretization. A formulation of the bilevel optimization model is proposed to choose hyperparameters (Bennett et al. 2006; Kunapuli 2008). Below, we will focus on the bilevel optimization approach which is the most relevant to our work. We refer to Yu and Zhu (2020), Luo (2016) for a survey of various hyperparameters optimization methods and applications.
In terms of selecting hyperparameters through bilevel optimization, different models and approaches have been considered in the literature. For example, Okuno et al. (2018) propose a bilevel optimization model to select the best hyperparameter for a nonsmooth, possibly nonconvex, \(l_{p}\)-regularized problem. They then present a smoothing-type algorithm with convergence analysis to solve this bilevel optimization model. Kunisch and Pock (2013) formulate a parameter learning problem for variational image denoising model into a bilevel optimization problem. They design a semismooth Newton’s method for solving the resulting nonsmooth bilevel optimization problems. Moore et al. [17] develop an implicit gradient-type algorithm for selecting hyperparameters for linear SVM-type machine learning models which are expressed as bilevel optimization problems. Moore et al. (2009) propose a nonsmooth bilevel model to select hyperparameters for support vector regression (SVR) via T-fold cross-validation. They design a proximity control approximation algorithm to solve this bilevel optimization model. Couellan and Wang (2015) design a bilevel stochastic gradient algorithm for training large scale SVM with automatic selection of the hyperparameter. We refer to Crockett and Fessler (2021), Colson et al. (2007), Dempe (2002), Dempe and Zemkoho (2020) for recent general surveys on bilevel optimization, as well as Mejía-de-Dios and Mezura-Montes (2019), Zemkoho and Zhou (2021), Fischer et al. (2021), Lin et al. (2014), Ye and Zhu (2010), Ochs et al. (2016, 2015) for some of the latest algorithms on the subject. Next, we provide a brief overview of the MPEC reformulation of the bilevel optimization problem, which will play a fundamental role in this paper.
For a bilevel program, replacing the lower-level problem by its Karush–Kuhn–Tucker (KKT) conditions will result in a mathematical program with equilibrium constraints (MPEC) Luo et al. (1996). Therefore, various algorithms for MPECs can be potentially applied to solve bilevel optimization problems, although one might want to pay attention to the fact that both problems are not necessarily equivalent. Bennett and her collaborators do a series of works (Bennett et al. 2006; Kunapuli et al. 2008b; Bennett et al. 2008; Kunapuli et al. 2008a; Kunapuli 2008) on hyperparameter selection by reformulating a bilevel program into an MPEC. For example, (Kunapuli et al. 2008b) considers a bilevel optimization model for selecting many hyperparameters for \(l_{1}\)-loss SVC problems, in which the upper-level problem has box constraints for the regularization parameter and feature selection. They reformulate this bilevel program into an MPEC and solve it by the inexact cross-validation method. Other methods include Newton-type algorithms (Wu et al. 2015; Harder et al. 2021; Lee et al. 2015).
Considering these works, a natural question is whether one can build up a bilevel hyperparameter selection for SVC? If yes, whether there are some special and hidden properties if we transfer the corresponding bilevel optimization problem to its corresponding MPEC and how we can solve it efficiently? This is the main motivation of the work in this paper.
In this paper, we consider a bilevel optimization model for selecting the hyperparameter in SVC. This regularization hyperparameter C is selected to minimize the T-fold cross-validated estimation of the out-of-sample misclassification error, which is basically a 0–1 loss function. Therefore, the upper-level problem minimizes the average misclassification error in T-fold cross-validation based on the optimal solution of the lower-level problem (we use the typical \(l_{1}\)-loss SVC model) for all the possible values of the hyperparameter C. There are several challenges to design efficient algorithms for such potentially large-scale bilevel programs. Firstly, the objective function in the upper-level problem is a 0–1 loss function, which is discontinuous and nonconvex. Secondly, the constraints for the upper-level problem involve the optimal solution set of the lower-level problem, i.e., the \(l_{1}\)-loss SVC model, for which the optimal solution is not explicitly given. To deal with the first challenge, we reformulate the minimization of the 0–1 loss function into a linear optimization problem inspired by the technique in Mangasarian (1994). We then replace the lower-level problem by its optimality conditions to tackle the second challenge. This therefore leads to an MPEC.
The contributions of the paper are as follows. Firstly, we propose a bilevel optimization model for hyperparameter selection in a binary SVC and study its reformulation as an MPEC. Secondly, we apply the GRM originating from Scholtes (2001) to solve this MPEC, which is shown to converge to a C-stationary point. The resulting algorithm is called the GR–CV, which is a concrete implementation of the GRM for selecting the hyperparameter C in SVC. Thirdly, we prove the MPEC–Mangasarian–Fromovitz constraint qualification (MPEC–MFCQ, for short) property for each feasible point of our MPEC. The MPEC–MFCQ is a key property to guarantee the convergence of the GRM. We show that it automatically holds for our problem thanks to its special structure. Finally, we conduct extensive numerical experiments, which show that our method is very efficient; in particular, it enjoys superior generalization performance over almost all the data sets used in this paper.
The paper is organized as follows. In Sect. 2, based on T-fold cross-validation for SVC, we introduce a bilevel optimization model to select an optimal hyperparameter for SVC. We also analyze the interesting properties of the lower-level problem. In Sect. 3, we reformulate the bilevel optimization problem as an MPEC (also known as the KKT reformulation), and apply the GRM for solving the MPEC. In Sect. 4, we prove that every feasible point of this MPEC satisfies the regularity condition MPEC–MFCQ, which is a key property to guarantee the convergence of the GRM. In Sect. 5, we present some computational experiments comparing the resulting GR–CV based on the GRM with two other ones, which have been used in the literature for a similar purpose; i.e., the inexact cross-validation method (In–CV) and the grid search method (G–S). We conclude the paper in Sect. 6.
Notations. For \(x \in \mathbb {R}^{n}\), \(\Vert x \Vert _{0}\) denotes the number of nonzero elements in x, while \(\Vert x \Vert _{1}\) and \(\Vert x \Vert _{2}\) correspond to the \(l_{1}\)-norm and \(l_{2}\)-norm of x, respectively. Also, we will use \(x_{+}=((x_{1})_{+},\ \cdots ,\ (x_{n})_{+}) \in \mathbb {R}^{n}, \) where \((x_{i})_{+}=\max (x_{i},\ 0).\ \mid \! \Omega \! \mid \) denotes the number of elements in the set \(\Omega \subset \mathbb {R}^n\). We use \(\mathbf {1}_{k}\) to denote a vector with elements all ones in \(\mathbb {R}^{k}\). \(I_{k}\) is the identity matrix in \(\mathbb {R}^{k \times k}\), while \(e^{k}_{\gamma }\) is the \(\gamma \)-th row vector of an identity matrix in \(\mathbb {R}^{k \times k}\). The notation \(\mathbf {0}_{k \times q}\) represents a zero matrix in \(\mathbb {R}^{k \times q}\) and \(\mathbf {0}_{k}\) stands for a zero vector in \(\mathbb {R}^{k}\). On the other hand, \(\mathbf {0}_{(\tau ,\ \kappa )}\) will be used for a submatrix of the zero matrix, where \(\tau \) is the index set of the rows and \(\kappa \) is the index set of the columns. Similarly to the case of zero matrix, \(I_{(\tau ,\ \tau )}\) corresponds to a submatrix of an identity matrix indexed by both rows and columns in the set \(\tau \). Finally, \(\Theta _{(\tau ,\ \cdot )}\) represents a submatrix of the matrix \(\Theta \), where \(\tau \) is the index set of the rows, and \(x_{\tau }\) is a subvector of the vector x corresponding to the index set \(\tau \).
2 Bilevel hyperparameter optimization for SVC
We start this section by first introducing the problem settings in relation to the T-fold cross-validation for SVC. Subsequently, we present the lower-level problem with some interesting and relevant properties for further analysis in the later parts of the paper. Finally, we introduce the upper-level problem, that is, the bilevel optimization model for hyperparameter selection in SVC.
2.1 T-fold cross-validation for SVC
As discussed in the introduction, the most commonly used method for selecting the hyperparameter C is T-fold cross-validation. In T-fold cross-validation, the data set is split into a subset \(\Omega \) with \(l_{1}\) points, which is used for cross-validation, and a hold-out test set \(\Theta \) with \(l_{2}\) points. Here, \(\Omega =\{(x_{i},y_{i})\}_{i=1}^{l_{1}} \in \mathbb {R}^{n+1}\), where \(x_{i} \in \mathbb {R}^{n}\) denotes a data point and \(y_{i}\in \{\pm 1\}\) the corresponding label. For T-fold cross-validation, \(\Omega \) is equally partitioned into T disjoint subsetsFootnote 1, as done in Couellan and Wang (2015), Moore et al. (2009), one for each fold. The process is executed T iterations. For the t-th iteration (\(t=1, \ldots , T\)), the t-th fold is the validation set \(\Omega _{t}\), and the remaining \(T-1\) folds make up the training set \(\overline{\Omega }_{t}=\Omega \backslash \Omega _{t}\). Therefore, in the t-th iteration, the separating hyperplane is trained using the training set \(\overline{\Omega }_{t}\), and the validation error is computed on the validation set \(\Omega _{t}\).
Then, the cross-validation error (CV error) is the average of the validation error over all the T iterations. The value of C that gives the best CV error will be selected. Finally, the final classifier is trained using all the data in \(\Omega \) and the rescaled optimal C. The test error is computed on the test set \(\Theta \). Note that the CV error and the test error are the evaluation indices for the classification performance in T-fold cross-validation. As shown in Fig. 1, for three-fold cross-validation, the yellow part is the subset \(\Omega \) which is used for three-fold cross-validation. In the first iteration, the blue part is the validation set \(\Omega _{1}\), and the remaining two folds are the training set \(\overline{\Omega }_{1}\). The second and third iterations have similar meanings.
Let \(m_{1}\) be the size of the validation set \(\Omega _{t}\) and \(m_{2}\) the size of the training set \(\overline{\Omega }_{t}\). The corresponding index sets for the validation and training sets are \(\mathcal {N}_{t}\) and \(\overline{\mathcal {N}}_{t}\), respectively. In T-fold cross-validation, there are T validation sets. Therefore, there are totally \(T m_{1}\) validation points in T-fold cross-validation. We use the index set
to represent all the validation points in T-fold cross-validation. Similarly, there are totally \(T m_{2}\) training points in T-fold cross-validation. We use the index set
to represent all the training points in T-fold cross-validation. These two index sets will be used later.
To analyze different cases of the data points in the training set and the validation set, we need to introduce the soft-margin support vector classification (without bias term) Cristianini and Shawe-Taylor (2000), Galli and Lin (2021). The traditional SVC model which is referred to as hard-margin SVC requires that the data should be strictly separated, i.e., the constraints must be satisfied strictly. However, the regularized model (soft-margin SVC) allows that the data could be wrongly labelled, i.e., the inequality constraints can be violated, which is the case in the model
where \(C \! \ge \!0\) is a penalty parameter and \(\xi (\cdot )\) is the loss function. If \(\xi (w;\quad x_i,y_i)\!=\!(1-y_{i}(x_{i}^{\top }w))_{+}\), it is referred to as the \(l_1\)-loss function; if \(\xi (w;\quad x_i,y_i)=(1-y_{i}(x_{i}^{\top }w))_{+}^{2}\), it is referred to as the \(l_2\)-loss function. We refer to Chauhan et al. (2019), Wang et al. (2021), Huang et al. (2013) for various other types of loss functions. Next, we use Fig. 2 to show geometric relationships of different cases in soft-margin SVC.
For a sample \((x_{i},y_{i})\), the point \(x_{i}\) is referred to as a positive point if \(y_{i}=1\); the point \(x_{i}\) is referred to as a negative point if \(y_{i}=-1\). In Fig. 2, the plus signs ‘\(+\)’ are the positive points (i.e., \(y_i=1\)) and the minus signs ‘−’ are the negative ones (i.e., \(y_i=-1\)). The distance between the hyperplanes \(H_{1}: w^{\top } x=1\) and \(H_{2}: w^{\top } x=-1\) is called margin. The separating hyperplane H lies between \(H_{1}\) and \(H_{2}\). Clearly, the hyperplanes \(H_{1}\) and \(H_{2}\) are the boundaries of the margin. Therefore, if a positive point lies on the hyperplane \(H_{1}\) or a negative point lies on the hyperplane \(H_{2}\), we call it lying on the boundary of the margin (indicated by ‘①’ in Fig. 2). If a positive point lies between the separating hyperplane H and the hyperplane \(H_{1}\), or a negative point lies between the separating hyperplane H and the hyperplane \(H_{2}\), we call it lying between the separating hyperplane H and the boundary of the margin (indicated by ‘②’ in Fig. 2). Similarly, if a positive point lies on the correctly classified side of the hyperplane \(H_{1}\), or a negative point lies on the correctly classified side of the hyperplane \(H_{2}\), we call it lying on the correctly classified side of the boundary of the margin (indicated by ‘③’ in Fig. 2).
Based on Fig. 2, we have the following observations which address different cases for the data points in the training set \(\overline{\Omega }_t\). Consider the soft-margin SVC problem corresponding to the t-th fold, i.e., the t-th training set \(\overline{\Omega }_{t}\) and validation set \(\Omega _{t}\) are used. We also use \(w^{t}\) to represent the optimal solution in (3) trained by \(\overline{\Omega }_t\).
Proposition 1
Let \(w^{t}\) be an optimal solution of the t-th soft-margin SVC model. For \(i \in \overline{\mathcal {N}}_{t}\), consider a positive point \(x_i\). Then it holds that:
-
(a)
\(x_i\) satisfies \((w^{t})^{\top } x_{i}<0\) if and only if it lies on the misclassified side of the separating hyperplane H, and is therefore misclassified.
-
(b)
\(x_i\) satisfies \((w^{t})^{\top } x_{i}=0\) if and only if it lies on the separating hyperplane H, and is therefore correctly classified.
-
(c)
\(x_i\) satisfies \(0<(w^{t})^{\top } x_{i}<1\) if and only if it lies between the separating hyperplane H and the boundary of the margin; hence, it is correctly classified.
-
(d)
\(x_i\) satisfies \((w^{t})^{\top } x_{i}=1\) if and only if it lies on the boundary of the margin, and is therefore correctly classified.
-
(e)
\(x_i\) satisfies \((w^{t})^{\top } x_{i}>1\) if and only if it lies on the correctly classified side of the boundary of the margin, and is therefore correctly classified.
A result analogous to Proposition 1 can be stated for the negative points. In Fig. 3, any point \(x_{i} \in \overline{\Omega }_{t}\) in blue is a training point in each case (notation is the same as in Fig. 5).
As for data points in the validation set \(\Omega _{t}\), we have the following scenarios.
Proposition 2
Let \(w^{t}\) be an optimal solution of the t-th soft-margin SVC model. For \(i \in \mathcal {N}_{t}\), consider a positive point \(x_i\). Then it holds that:
-
(a)
\(x_i\) satisfies \((w^{t})^{\top } x_{i}<0\) if and only if it lies on the misclassified side of the separating hyperplane H, and is therefore misclassified.
-
(b)
\(x_i\) satisfies \((w^{t})^{\top } x_{i}=0\) if and only if it lies on the separating hyperplane H, and is therefore correctly classified.
-
(c)
\(x_i\) satisfies \((w^{t})^{\top } x_{i}>0\) if and only if it lies on the correctly classified side of the separating hyperplane H, and it is hence correctly classified.
A result analogous to Proposition 2 can be stated for the negative points. In Fig. 4, any point \(x_{i} \in \Omega _{t}\) in blue is a validation point in each case (notation is the same as in Fig. 6).
Remark 1
Note that Propositions 1 and 2 are applicable to the soft-margin SVC model with other loss functions. These two propositions will be used in the proof of Propositions 3 and 4.
2.2 The lower-level problem
In this part, we focus on the lower-level problem. That is, given hyperparameter C and the training set \(\overline{\Omega }_{t}\), we train the dataset via \(l_{1}\)-loss SVC model. We will also discuss the properties of the lower-level problem.
2.2.1 The training model: \(l_{1}\)-loss SVC
In T-fold cross-validation, there are T lower-level problems. In the t-th lower-level problem, we train the t-th fold training set \(\overline{\Omega }_{t}\) by the soft-margin SVC model in (3) with the \(l_1\)-loss functionFootnote 2. That is, given \(C \ge 0\), we solve the following optimization problem:
A popular reformulation of the problem above is the convex quadratic optimization problem obtained by introducing slack variables \(\xi ^{t} \in \mathbb {R}^{m_{2}}\):
where, for \(t=1, \cdots , T\) and \(k=m_{1}+ 1, \cdots , l_{1}\), we have
and we use \(\xi ^{t}_{i}\) to denote the i-th element of \(\xi ^{t} \in \mathbb {R}^{m_{2}}\).
Let \(\alpha ^{t} \in \mathbb {R}^{m_{2}}\) and \(\mu ^{t} \in \mathbb {R}^{m_{2}}\) be the multipliers of the constraints in (4). We can write the KKT conditions for the lower-level problem (4) as
where for two vectors a and b, writing \(\mathbf {0} \le a \perp b \ge \mathbf {0}\) means that we have \(a^{\top } b=0,\ a \ge \mathbf {0}\) and \(b \ge \mathbf {0}\). Also note that each complementary constraint in (5a) corresponds to a training point \(x_{i}\) with \(i \in Q_{l}\) in (2). Each training point corresponds to a slack variable \(\xi ^{t}_i\). So each complementary constraint in (5b) corresponds to a training point \(x_{i}\) with \(i \in Q_{l}\) in (2). Therefore, there is a one-to-one correspondence between the index set of the training points \(Q_{l}\) and the complementary constraints in (5a) and (5b), respectively. This will be used in the definition of some index sets below.
Furthermore, we would like to emphasize the support vectors implied in (5). From (5c), the weight vector \(w^{t}=(B^{t})^{\top } \alpha ^{t}=\sum \limits _{i \in \overline{\mathcal {N}}_{t}} \alpha ^{t}_{i} y_{i}x_{i}\). It implies that only the data points \(x_{i} \in \overline{\Omega }_{t}\) which correspond to \(\alpha ^{t}_{i} \ne 0\) are involved. By \(\alpha ^{t}_{i} \ge 0\) in (5a), it means that only \(x_{i} \in \overline{\Omega }_{t}\) with \(\alpha ^{t}_{i}>0\) are involved. It is for this reason that they are called support vectors. By eliminating \(\mu ^{t}\) and \(w^{t}\) from the system in (5), we get the reduced KKT conditions for problem (4) as follows:
2.2.2 Some properties of the lower-level problem
Let \(\alpha \in \mathbb {R}^{Tm_{2}},\ \xi \in \mathbb {R}^{Tm_{2}},\ w \in \mathbb {R}^{Tn}\), and \(B \in \mathbb {R}^{Tm_{2} \times Tn}\) be defined by
respectively. The KKT conditions in (6) can be decomposed as
Obviously, the intersection of any pair of these index sets \(\Lambda _{i}\) for \(i=1, \ldots , 6\) is empty. An illustrative representation of data points corresponding to these index sets is given in Fig. 5.
Proposition 3
Considering the training points corresponding to \(Q_{l}\) in (2), let \((\alpha ,\, \xi )\) satisfy the conditions in (6). Then, the following statements hold true:
-
(a)
The points \(\{x_i\}_{i\in \Lambda _{1}}\) lie on the boundary of the margin; they are correctly classified points, but are not support vectors.
-
(b)
The points \(\{x_i\}_{i\in \Lambda _{2}}\) lie on the correctly classified side of the boundary of the margin; they are correctly classified points, but are not support vectors.
-
(c)
The points \(\{x_i\}_{i\in \Lambda _{3}}\) lie on the boundary of the margin; they are correctly classified points and are support vectors.
-
(d)
The points \(\{x_i\}_{i\in \Lambda _{4}}\) lie between the separating hyperplane H and the boundary of the margin; they are correctly classified therefore support vectors.
-
(e)
The points \(\{x_i\}_{i\in \Lambda _{5}}\) lie on the separating hyperplane H; they are correctly classified points and are support vectors.
-
(f)
The points \(\{x_i\}_{i\in \Lambda _{6}}\) lie on the misclassified side of the separating hyperplane H; they are misclassified points and are support vectors.
Proof
We take positive points for example. The same analysis can be applied to negative ones. Since \(w=B^{\top } \alpha \) in (5c), we get \((B B^{\top } \alpha -\mathbf {1}+\xi )_{i}=(Bw -\mathbf {1}+\xi )_{i}\).
-
(a)
For the points \(\{x_i\}_{i\in \Lambda _{1}}\), since \(\xi _{i}=0\) in (8), we have \((Bw-\mathbf {1}+\xi )_{i}=(Bw-\mathbf {1})_{i}=0\), that is, \(y_i(w^{\top } x_i) -1=0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i=1\). It corresponds to (d) in Proposition 1. Therefore, it means that the point \(x_i\) lies on the boundary of the margin. It is correctly classified, and it is not a support vector, since \(\alpha _{i}=0\).
-
(b)
For the points \(\{x_i\}_{i\in \Lambda _{2}}\), since \(\xi _{i}=0\) in (9), we have \((Bw-\mathbf {1}+\xi )_{i}=(Bw-\mathbf {1})_{i}>0\), that is, \(y_i(w^{\top } x_i) -1>0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i>1\). It corresponds to (e) in Proposition 1. Therefore, it means that the point \(x_i\) lies on the correctly classified side of the boundary of the margin. It is correctly classified, but not a support vector, as \(\alpha _{i}=0\).
-
(c)
For the points \(\{x_i\}_{i\in \Lambda _{3}}\), since \(\xi _{i}=0\) in (10), we have \((Bw-\mathbf {1}+\xi )_{i}=(Bw-\mathbf {1})_{i}=0\), that is, \(y_i(w^{\top } x_i) -1=0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i=1\). It corresponds to (d) in Proposition 1. Therefore, it means that the point \(x_i\) lies on the boundary of the margin. It is correctly classified, and it is a support vector, since \(\alpha _{i} >0\).
-
(d)
For the points \(\{x_i\}_{i\in \Lambda _{4}}\), since \(0<\xi _{i}<1\) in (11), we have \(0<(Bw)_{i}<1\), that is, \(0<y_i(w^{\top } x_i)<1\). For a positive point, \(y_{i}=1\), it implies that \(0<w^{\top } x_i<1\). It corresponds to (c) in Proposition 1. Therefore, \(x_i\) lies between the separating hyperplane H and the boundary of the margin. It is correctly classified, and it is a support vector, since \(\alpha _{i} >0\).
-
(e)
For the points \(\{x_i\}_{i\in \Lambda _{5}}\), since \(\xi _{i}=1\) in (12), we have \((Bw-\mathbf {1}+\xi )_{i}=(Bw)_{i}=0\), that is, \(y_i(w^{\top } x_i)=0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i=0\). It corresponds to (b) in Proposition 1. Therefore, it means that the point \(x_i\) lies on the separating hyperplane H. It is correctly classified, and it is a support vector, since \(\alpha _{i} >0\).
-
(f)
For the points \(\{x_i\}_{i\in \Lambda _{6}}\), since \(\xi _{i}>1\) in (13), we have \((Bw)_{i}<0\), that is, \(y_i(w^{\top } x_i)<0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i<0\). It corresponds to (a) in Proposition 1. Therefore, it means that the point \(x_i\) lies on the misclassified side of the separating hyperplane H. It is misclassified, and it is a support vector, since \(\alpha _{i} >0\).
Remark 2
Note that all the data points \(x_i\) for \(i \in \Lambda _{1}\) corresponding to Fig. 5a and \(i \in \Lambda _{3}\) corresponding to Fig. 5c lie on the boundary of the margin. In other words, Fig. 5a and Fig. 5c are identical. However, the values of \(\alpha _i\) for \(i \in \Lambda _{1}\) and \(i \in \Lambda _{3}\) are different, so we demonstrate them in two subfigures.
2.3 The upper-level problem
In this part, we introduce the upper-level problem, that is, the bilevel optimization model for hyperparameter selection in SVC under the settings of T-fold cross-validation. Note that the aim of the upper-level problem is to minimize the T-fold CV error measured on the validation sets based on the optimal solutions of the lower-level problems. Specifically, the basic bilevel optimization model for selecting the hyperparameter C in SVC is formulated as
Here, the expression \(\sum \limits _{i \in \mathcal {N}_{t}}\Vert \left( -y_{i} \left( x_{i}^{\top } w^{t}\right) \right) _{+}\Vert _{0}\) basically counts the number of data points that are misclassified in the validation set \(\Omega _{t}\), while the outer summation (i.e., the objective function in (14)) averages the misclassification error over all the folds.
Problem (14) can be equivalently written in the matrix form as follows
where, for \(t=1,\ \cdots ,\ T\) and \(k=1,\ \cdots ,\ m_{1}\), we have
Remark 3
Compared with the model in Kunapuli et al. (2008b), we consider a simpler bilevel optimization model, only with an extra constraint \(C \ge 0\) in the upper-level problem.
3 Single-level reformulation and method
In this section, we first reformulate the bilevel optimization problem as a single-level optimization problem, precisely, we write the problem as an MPEC. Then we present the properties of this single-level problem. Finally, we discuss the GRM to solve the MPEC problem.
3.1 The MPEC reformulation
Recall the upper-level objective function in (15) is a measure of misclassification error based on the T out-of-sample validation sets, which we minimize. The measure used here is the classical CV error for classification, the average number of the data points misclassified. It is clear that \(\Vert (\cdot )_{+}\Vert _{0}\) is discontinuous and nonconvex. However, the function \(\Vert (\cdot )_{+}\Vert _{0}\) can be characterized as the minimum of the sum of all elements of the solution to the following linear optimization problem as demonstrated in Mangasarian (1994), i.e.,
Therefore, for each fold, \( \Vert \left( -A^{t} w^{t} \right) _{+} \Vert _{0}\) is the minimum of the sum of all elements of the solution to the following linear optimization problem:
Let \(\hat{\zeta }^{t}\) be the solution of problem (16) such that \(\sum \limits _{i=1}^{m_1} \hat{\zeta _{i}^{t}}\) is the minimum of the sum of all elements of the solution to problem (5). This implies that \( \Vert \left( -A^{t} w^{t} \right) _{+} \Vert _{0}= \sum \limits _{i=1}^{m_{1}}\zeta ^{t}_{i}\) in each fold. According to Proposition 2, there are two cases for the validation points:
-
1.
If the validation point \((x_{i},y_{i}) \in \Omega _{t}\) is misclassified, then \(y_{i} \left( x_{i}^{\top } w^{t}\right) <0 \). That is, \( (-A^{t} w^{t})_{i} >0\), which corresponds to \((\left( -A^{t} w^{t}\right) _{+})_{i}>0\).
-
2.
If the validation point \((x_{i},y_{i}) \in \Omega _{t}\) is correctly classified, we have \(y_{i} \left( x_{i}^{\top } w^{t}\right) \ge 0\). There are two cases. Firstly, \(x_{i}\) lies on the separating hyperplane H, that is, \(y_{i} \left( x_{i}^{\top } w^{t}\right) =0\). For \(y_{i}=1\), there is \((-A^{t} w^{t})_{i} =0\), which corresponds to \((\left( -A^{t} w^{t}\right) _{+})_{i}=0\). Secondly, \(x_{i}\) lies on the correctly classified side of the separating hyperplane H, that is, \(y_{i} \left( x_{i}^{\top } w^{t}\right) >0\). For \(y_{i}=1\), there is \( (-A^{t} w^{t})_{i} <0\), which corresponds to \((\left( -A^{t} w^{t}\right) _{+})_{i}=0\).
Combining with \( \Vert \left( -A^{t} w^{t} \right) _{+} \Vert _{0}= \sum \limits _{i=1}^{m_{1}}\hat{\zeta ^{t}_{i}}\), it means that
where \(\hat{\zeta ^{t}_{i}}\) is the i-th element of \(\hat{\zeta ^{t}}\) in the t-th fold.
The linear programs (LPs) (16), for \(t= 1, \cdots , T\), are inserted into the bilevel optimization problem in order to recast the discontinuous upper-level objective function into a continuous one. Each LP in the form of (16) can also be replaced with its KKT conditions as follows
By eliminating \(\lambda ^{t}\) and \(w^{t}\) with \(w^{t}=(B^{t})^{\top } \alpha ^{t}\) in (5c), we get the reduced KKT conditions for problem (16) with
Note that each complementary constraint in (18a) corresponds to a validation point \(x_{i}\) with \(i \in Q_{u}\) in (1). Each validation point corresponds to a variable \(\zeta ^{t}_i\). So we have each complementary constraint in (18b) corresponds to a validation point \(x_{i}\) with \(i \in Q_{u}\) in (1). Therefore, there is a one-to-one correspondence between the index set of the validation points \(Q_{u}\) and the complementary constraints in (18a) and (18b), respectively.
Combining the systems in (6) and (18), we can transform the bilevel optimization problem (15) into the single-level optimization problem
Note that the constraints \(C\mathbf {1}-\alpha ^{t} \ge \mathbf {0}\) and \(\alpha ^{t} \ge \mathbf {0}\) imply \(C \ge 0\). Therefore, we remove the redundant constraint \(C \ge 0\), and get an equivalent form of the problem above as follows
The presence of the equilibrium constraints makes problem (20) an instance of an MPEC, which is sometimes labelled as an extension of a bilevel optimization problem Luo et al. (1996). The optimal hyperparameter is now well defined as a global optimal solution to the MPEC Lee et al. (2015). Now, we have transformed a bilevel classification model into an MPEC.
We can also write (20) in a compact form. To proceed, let
and \(\alpha ,\ \xi ,\ B\) be defined in (7).
Then problem (20) can be written as
From now on, all our analysis is going to be based on this model.
3.2 Some properties of the MPEC reformulation
Observe that the last two constraints of problem (21) correspond to the complementarity systems that are part of the KKT conditions of the lower-level problem in (6). As the latter conditions are carefully studied in Proposition 3, it remains to analyze the first two complementarity systems describing the feasible set of problem (21). Hence, we partition them as follows
Similarly to (8)–(13), the intersection of any pair of the index sets \(\Psi _{j}\) for \(j=1, \,2, \, 3\) is empty. In the same vein, an illustrative representation of data points corresponding to the index sets \(\Psi _{j}\) for \(j=1, \,2, \, 3\) is given in Fig. 6.
Proposition 4
Considering the validation points corresponding to \(Q_{u}\) in (1), let \((\zeta ,\ z, \, \alpha )\) satisfy the first two complementarity systems describing the feasible set of problem (21). Then, the following statements hold true:
-
(a)
The points \(\{x_i\}_{i\in \Psi _{1}}\) lie on the separating hyperplane H and are therefore correctly classified.
-
(b)
The points \(\{x_i\}_{i\in \Psi _{2}}\) lie on the correctly classified side of the separating hyperplane H and are therefore correctly classified.
-
(c)
The points \(\{x_i\}_{i\in \Psi _{3}}\) lie on the misclassified side of the separating hyperplane H and are therefore misclassified.
Proof
We take positive points for example. The same analysis can be applied to negative ones. Since \(w=B^{\top } \alpha \) in (5c), we get \((A B^{\top } \alpha +z)_{i}=(A w +z)_{i}\).
-
(a)
For the points \(\{x_i\}_{i\in \Psi _{1}}\), since \(z_{i}=0\) in (22), we have \((A w+z)_{i}=(Aw)_{i}=0\), that is, \(y_i(w^{\top } x_i)=0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i=0\). It corresponds to (b) in Proposition 2. Therefore, it means that the point \(x_i\) lies on the separating hyperplane H. It is correctly classified.
-
(b)
For the points \(\{x_i\}_{i\in \Psi _{2}}\), since \(z_{i}=0\) in (23), we have \((Aw+z)_{i}=(Aw)_{i}>0\), that is, \(y_i(w^{\top } x_i)>0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i>0\). It corresponds to (c) in Proposition 2. Therefore, it means that the point \(x_i\) lies on the correctly classified side of the separating hyperplane H. It is correctly classified.
-
(c)
For the points \(\{x_i\}_{i\in \Psi _{3}}\), since \(z_{i}>0\) in (24), we have \((Aw)_{i}<0\), that is, \(y_{i}(w^{\top } x_{i})<0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i<0\). It corresponds to (a) in Proposition 2. Therefore, it means that the point \(x_i\) lies on the misclassified side of the separating hyperplane H.
\(\square \)
In Sect. 4, Proposition 4 will be combined with Proposition 3 to prove Proposition 5. It might also be important to note that if a validation point \(x_{i}\) lies on the separating hyperplane H, then we will have \(0 \le \zeta _{i} < 1\).
3.3 The global relaxation method (GRM)
Here, we present a numerical algorithm to solve the MPEC (21). There are various methods for solving MPECs, we refer to Dempe (2003), Luo et al. (1996) for some surveys on the problem and to Ye (2005), Flegel (2005), Wu et al. (2015), Harder et al. (2021), Guo et al. (2015), Jara-Moroni et al. (2018), Júdice (2012), Li et al. (2015), Yu et al. (2019), Dempe (2003), Anitescu (2000), Facchinei and Pang (2007), Fletcher et al. (2006), Fukushima and Tseng (2002) for some of the latest methods to solve the problem. Among methods to solve MPECs, one of the most popular ones is the relaxation method due to Scholtes (2001). Recently, Hoheisel et al. (2013) provided comparisons of five relaxation methods for solving MPECs, where it appears that the GRM has the best theoretical (in terms of requiring weaker assumptions for convergence) and numerical performance. Therefore, we will apply the GRM to solve our MPEC (21).
To simplify the presentation of the method, we now write problem (21) into a further compact format. Let \(v = \left[ C,\ \zeta ^{\top },\ z^{\top },\ \alpha ^{\top },\ \xi ^{\top }\right] ^{\top } \in \mathbb {R}^{\overline{m}+1}\) with \(\overline{m}= 2 T (m_{1}+m_{2})\) and define the functions
where
Problem (21) can then be written in the form
The basic idea of the GRM is as follows. Let \(\{t_{k}\} \downarrow 0\). At each iteration, we replace the MPEC (26) by the nonlinear program (NLP) of the following form, parameterized in \(t_k\):
The details of the GRM are shown in Algorithm 1.
Here, the maximum violation of all constraints Vio defined by
is used to measure the feasibility of the final iterate \(v_{opt}\), where \(\Vert \cdot \Vert _{\infty }\) denotes the \(l_{\infty }\) norm. Note that in step 4, the approximate solution refers to the approximate stationary point, in the sense that it satisfies the KKT conditions of (NLP-\(t_{k}\)) approximately. Numerically, we use the SNOPT solver Gill et al. (2002) to compute the KKT points of (NLP-\(t_{k}\)) approximately, such that the norm of the KKT conditions is less than a threshold value \(\epsilon \!=\!10^{-6}\). The point \(v^{k+1}\) returned by the SNOPT solver is referred to as an approximate solution of (NLP-\(t_{k}\)). We use the GRM in Algorithm 1 to solve the MPEC (26), and get the optimal hyperparameter C and the corresponding function value \(F(v_{opt})\) which is the cross-validation error (CV error) measured on the validation sets in T-fold cross-validation. To analyze the convergence of the GRM, we need the concept of C-stationarity, which we define next.
To proceed, let v be a feasible point for the MPEC (26) and recall that \(F(v),\ G(v)\) and H(v) are defined in (25). Based on v, let
Definition 1
(C-stationarity) Let v be a feasible point for the MPEC (26). Then v is said to be a C-stationary point, if there are multipliers \(\gamma ,\ \nu \in \mathbb {R}^{\overline{m}}\), such that
and \(\gamma _{i}=0\) for \( i \in I_{H},\ \nu _{i}=0\) for \(i \in I_{G}\), and \(\gamma _{i} \nu _{i} \ge 0\) for \(i \in I_{GH}\).
Note that for problem (26), C-stationarity holds at any local optimal solution that satisfies the MPEC–MFCQ, which can be defined as follows Hoheisel et al. (2013).
Definition 2
A feasible point v for problem (26) satisfies the MPEC-MFCQ if and only if the set of gradient vectors
is positive-linearly independent.
Recall that the set of gradient vectors in (28) is said to be positive-linearly dependent if there exist scalars \(\{\delta _{i}\}_{i \in I_{G} \cup I_{GH}}\) and \( \{\beta _{i}\}_{i \in I_{H} \cup I_{GH}}\) with \(\delta _{i} \ge 0\) for \(i \in I_{G} \cup I_{GH}\), \(\beta _{i} \ge 0\) for \(i \in I_{H} \cup I_{GH}\), not all of them being zero, such that \(\Sigma _{i \in I_{G} \cup I_{GH}} \delta _{i} \nabla G_{i}(v)+\Sigma _{i \in I_{H} \cup I_{GH}} \beta _{i} \nabla H_{i}(v)=\mathbf{0}\). Otherwise, we say that this set of gradient vectors is positive-linearly independent.
Also note that various other stationarity concepts can be defined for problem (26); for more details on this, interested readers are referred to Dempe and Zemkoho (2012), Flegel (2005).
The following result establishes that Algorithm 1 is well-defined, as it provides a framework ensuring that a solution (or a stationary point, to be precise) exists for problem (NLP-\(t_{k}\)) as required.
Theorem 1
Hoheisel et al. (2013) Let v be a feasible point for the MPEC (26) such that MPEC-MFCQ is satisfied at v. Then there exists a neighborhood N of v and \(\overline{t} > 0\) such that standard MFCQ for (NLP-\(t_{k}\)) at \(t_k=t\) is satisfied at all feasible points of (NLP-\(t_{k}\)) at \(t_k=t\) in this neighborhood N for all \(t \in (0,\ \overline{t})\).
Subsequently, we have the following convergence result, which ensures that a sequence of stationary points of problem (NLP-\(t_{k}\)), computed by Algorithm 1, converges to a C-stationary point of problem (26).
Theorem 2
Hoheisel et al. (2013) Let \(\{t_{k}\} \downarrow 0\) and let \(v^k\) be a stationary point of (NLP-\(t_{k}\)) with \(v^{k} \rightarrow v\) such that MPEC-MFCQ holds at the feasible point v. Then v is a C-stationary point of the MPEC (26).
Clearly, the MPEC-MFCQ is crucial for the analysis of problem (26), as it not only ensures that the C-stationarity condition can hold at a locally optimal point, but also helps in establishing the two fundamental results in Theorems 1 and 2. Considering this importance of the condition, we carefully analyze it in the next section, and show, in particular, that it automatically holds at any feasible point of problem (26).
4 Fulfilment of the MPEC–MFCQ
In this section, we prove that every point in the feasible set of the MPEC (26) satisfies the MPEC–MFCQ. The rough idea of our proof is as follows. Firstly, by analyzing the relationship of different index sets (Proposition 5), we reach a reduced form of the MPEC–MFCQ (Proposition 6). Then based on the positive-linear independence of three submatrices (Lemmas 1–3), we eventually show the MPEC–MFCQ in Theorem 3. The roadmap of the proof is summarized in Fig. 7.
4.1 Relationships between the index sets
In this part, we first explore more properties about the index sets \(I_{H},\ I_{G},\ I_{GH}\), as they are the key to the analysis of the positive-linear independence of the vectors in (28). Let \(I_{H}:=\underset{k=1}{\overset{4}{\cup }}I_{H_{k}},\ I_{G}:=\underset{k=1}{\overset{4}{\cup }}I_{G_{k}},\) and \(I_{GH}:=\underset{k=1}{\overset{4}{\cup }}I_{GH_{k}}\), where
Here, \(Q_{u},\ Q_{l}\) are defined in (1) and (2), respectively. Furthermore, let
It can be observed that each index set \(I^{k},\ k=1,\ 2,\ 3,\ 4\) corresponds to the union of the three components in the partition involved in the corresponding part of the complementarity systems in (21); that is,
-
Part 1: \(I^{1}\) for the partition of the system \(\mathbf {0} \le \zeta \perp \ A B^{T} \alpha +z \ge \mathbf {0}\);
-
Part 2: \(I^{2}\) for the partition of the system \(\mathbf {0} \le z \perp \mathbf {1}-\zeta \ge \mathbf {0}\);
-
Part 3: \(I^{3}\) for the partition of the system \(\mathbf {0} \le \alpha \perp B B^{T} \alpha - \mathbf {1}+\xi \ge \mathbf {0}\);
-
Part 4: \(I^{4}\) for the partition of the system \(\mathbf {0} \le \xi \perp C \mathbf {1}-\alpha \ge \mathbf {0}\).
In the previous section, we have clarified a one-to-one correspondence between the index set of the validation points \(Q_u\) in (1) and the complementary constraints in Part 1 and Part 2, respectively. It is clearly that \(I^{1}=I^{2}=Q_{u}\). Similarly, we have \(I^{3}=I^{4}=Q_{l}\).
Next, we give the relationships between the index sets in (29); recall that we already have some index sets described in Propositions 3 and 4. For the convenience of the analysis, we divide the index set \(\Lambda _3\) in (10) into two subsets \(\Lambda ^{+}_{3}\) and \(\Lambda ^{c}_{3}\), as well as \(\Psi _{1}\) in (22) into \(\Psi ^{0}_{1}\) and \(\Psi ^{+}_{1}\):
Proposition 5
The index sets in (29) and the index sets in Proposition 3 and Proposition 4 have the following relationship:
-
(a)
In Part \(1,\ I_{H_{1}}=\Psi _{2},\ I_{G_{1}}=\Psi ^{+}_{1} \cup \Psi _{3},\ I_{GH_{1}}=\Psi ^{0}_{1}.\)
-
(b)
In Part \(2,\ I_{H_{2}}=\Psi _{1} \cup \Psi _{2},\ I_{G_{2}}=\Psi _{3},\ I_{GH_{2}}=\emptyset .\)
-
(c)
In Part \(3,\ I_{H_{3}}=\Lambda _{2},\ I_{G_{3}}=\Lambda _{3} \cup \Lambda _{u},\ I_{GH_{3}}=\Lambda _{1}.\)
-
(d)
In Part \(4,\ I_{H_{4}}=\Lambda _{1} \cup \Lambda _{2} \cup \Lambda ^{+}_{3},\ I_{G_{4}}=\Lambda _{u},\ I_{GH_{4}}=\Lambda ^{c}_{3}.\)
Here, \(\Lambda _{u}\) is defined as follows
Proof
According to the definition of the index sets in (29) and the index sets in Proposition 3 and Proposition 4, we have the following analysis.
-
(a)
In Part 1, for \(i \in I_{H_{1}}\), compared with the index set \(\Psi _{2}\) in (23), it follows that we have \(z_{i}=0\) and \(I_{H_{1}}=\Psi _{2}\). For \(i \in I_{G_{1}}\), compared with the index sets \(\Psi ^{+}_{1}\) in (33) and \(\Psi _{3}\) in (24), we get \(0<\zeta _{i}<1,\ z_{i}=0\) or \(\zeta _{i}=1,\ z_{i}>0\), and \(I_{G_{1}}=\Psi ^{+}_{1} \cup \Psi _{3}\). For \(i \in I_{GH_{1}}\), compared with the index set \(\Psi ^{0}_{1}\) in (32), we get \(z_{i}=0\) and \(I_{GH_{1}}=\Psi ^{0}_{1}\).
-
(b)
In Part 2, for \(i \in I_{H_{2}}\), compared with the index sets \(\Psi _{1}\) in (22) and \(\Psi _{2}\) in (23), we get \(I_{H_{2}}=\Psi _{1} \cup \Psi _{2}.\) For \(i \in I_{G_{2}}\), compared with the index set \(\Psi _{3}\) in (24), we get \((A B^{\top } \alpha +z)_{i}=0\) and \(I_{G_{2}}=\Psi _{3}\). For \(i \in I_{GH_{2}}\), there is no index set in Proposition 4 corresponds to the index set \(I_{GH_{2}}\). Therefore, \(I_{GH_{2}}=\emptyset .\)
-
(c)
In Part 3, for \(i \in I_{H_{3}}\), compared with the index set \(\Lambda _{2}\) in (9), we get \(\xi _{i}=0\) and \(I_{H_{3}}=\Lambda _{2}\). For \(i \in I_{G_{3}}\), compared with the index sets \(\Lambda _{3}\) in (10) and \(\Lambda _{u}\) in (34), we get \(I_{G_{3}}=\Lambda _{3} \cup \Lambda _{u}.\) For \(i \in I_{GH_{3}}\), compared with the index set \(\Lambda _{1}\) in (8), we get \(\xi _{i}=0\) and \(I_{GH_{3}}=\Lambda _{1}\).
-
(d)
In Part 4, for \(i \in I_{H_{4}}\), compared with the index sets \(\Lambda _{1}\) in (8), \(\Lambda _{2}\) in (9) and \(\Lambda ^{+}_{3}\) in (30), we get \(I_{H_{4}}=\Lambda _{1} \cup \Lambda _{2} \cup \Lambda ^{+}_{3}.\) For \(i \in I_{G_{4}}\), compared with the index set \(\Lambda _{u}\) in (34), we get \((Bw-\mathbf {1}+\xi )_{i}=0\) and \(I_{G_{4}}=\Lambda _{u}.\) For \(i \in I_{GH_{4}}\), compared with the index set \(\Lambda ^{c}_{3}\) in (31), it results that we have \((Bw-\mathbf {1}+\xi )_{i}=0\) and \(I_{GH_{4}}=\Lambda ^{c}_{3}\).
The results in Proposition 5 are demonstrated in Fig. 8. For example, for (a) in Proposition 5, the index sets of complementarity constraints in Part 1 are shown in Fig. 8a, which is about the relationship of \(I_{H_{1}},\ I_{G_{1}},\ I_{GH_{1}}\) in (29) and the index sets (22)–(24). In Fig. 8a, the red shaded part represents the index set \(I_{G_{1}}\), which contains the index sets \(\Psi ^{+}_{1}\) and \(\Psi _{3}\). (b)–(d) in Proposition 5 are demonstrated in Fig. 8 b–d. Specially, in Fig. 8b, the red shaded part represents the index set \(I_{H_{2}}\), which contains the index sets \(\Psi _{1}\) (or \(\Psi ^{0}_{1} \cup \Psi ^{+}_{1}\)) and \(\Psi _{2}\). In Fig. 8c, the red shaded part represents the index set \(I_{G_{3}}\), which contains the index sets \(\Lambda _{3}\) (or \(\Lambda ^{+}_{3} \cup \Lambda ^{c}_{3}\)) and \(\Lambda _{u}\). In Fig. 8d, the red shaded part represents the index set \(I_{H_{4}}\), which contains the index sets \(\Lambda _{1},\ \Lambda _{2}\), and \(\Lambda ^{+}_{3}\).
4.2 The reduced form of the MPEC-MFCQ
Proposition 6
The set of gradient vectors in (28) at a feasible point v for the MPEC (26) can be written in the matrix form
where \(L_{q},\ q=1,\ \cdots ,\ 5\) are the index sets of columns corresponding to the variables C, \(\zeta \), z, \(\alpha \), and \(\xi \), respectively, and
Proof
Based on Definition 2, we can write the set of gradient vectors in (28) at a feasible point v in the rows of the matrix \(\Gamma \) as follows
Now, we can easily show that the matrix \(\Gamma \) in (37) is equivalent to the more specific form in (35). To proceed, first note that from Proposition 5 (a) and (b), we have
So, we get \(\Gamma _{a}^{3},\ \Gamma _{b}^{3},\ \Gamma _{c}^{2},\ \Gamma _{d}^{2},\ \Gamma _{e}^{2}\), and \(\Gamma _{f}^{3}\) in (36). On the other hand, it follows from Proposition 5 (c) and (d), we have
Subsequently, it follows that \(\Gamma _{g}^{5},\ \Gamma _{h}^{5},\ \Gamma _{i}^{4},\ \Gamma _{j}^{4},\ \Gamma _{k}^{4},\ \Gamma _{l}^{4},\ \Gamma _{m}^{5}\), and \(\Gamma _{n}^{5}\) in (36). Therefore, we obtain the form of the matrix \(\Gamma \) in (35). \(\square \)
4.3 Three important lemmas
Due to the complicated form of \(\Gamma \) in (35), in this part, we first present three lemmas, addressing the positive-linear independence of three submatrices in \(\Gamma \) marked by blue, green and yellow, respectively. To proceed from here on, we define the size of each index set in (29) and Propositions 3–4 as follows. We denote the size of the index set \(I_{G_1}\) by \(S_1\), that is, \( \mid \! I_{G_1}\mid =S_1\). Similarly,
Further, we denote the index corresponding to each row in the matrices \(\Gamma _{a}^{3},\ \cdots \ \Gamma _{n}^{5}\), in (36) by \(a_{s},\ \cdots \ n_{s}\), respectively.
Lemma 1
The row vectors in the following matrix
are positive-linearly independent.
Proof
Assume that there exist \(\overline{\rho }^{c} \in \mathbb {R}^{W_{1}}\ \text {and}\ \overline{\rho }^{c} \ge \mathbf {0},\ \overline{\rho }^{d} \in \mathbb {R}^{U_{1}} \ \text {and}\ \overline{\rho }^{d} \ge \mathbf {0},\ \overline{\rho }^{e} \in \mathbb {R}^{S_{2}} \ \text {and}\ \overline{\rho }^{e} \ge \mathbf {0},\) such that
The above equation is equivalent to the following system
Since \(\overline{\rho }^{c} \ge \mathbf {0},\ \overline{\rho }^{d} \ge \mathbf {0},\ \overline{\rho }^{e} \ge \mathbf {0}\), we get \(\overline{\rho }^{c} = \mathbf {0},\ \overline{\rho }^{d} = \mathbf {0},\ \overline{\rho }^{e} = \mathbf {0}\) from Eq. (39). Therefore, the row vectors in the matrix (38) are positive-linearly independent. \(\square \)
Lemma 2
The row vectors in the following matrix
are positive-linearly independent.
Proof
Assume that there exist \(\overline{\rho }^{a} \in \mathbb {R}^{S_{1}}\ \text {and}\ \overline{\rho }^{a} \ge \mathbf {0},\ \overline{\rho }^{b} \in \mathbb {R}^{W_{1}} \ \text {and}\ \overline{\rho }^{b} \ge \mathbf {0},\ \overline{\rho }^{f} \in \mathbb {R}^{U_{2}} \ \text {and}\ \overline{\rho }^{f} \ge \mathbf {0},\) such that
The above equation is equivalent to the following system
Since \(\overline{\rho }^{a} \ge \mathbf {0},\ \overline{\rho }^{b} \ge \mathbf {0},\ \overline{\rho }^{f} \ge \mathbf {0}\), we get \(\overline{\rho }^{a} = \mathbf {0},\ \overline{\rho }^{b} = \mathbf {0},\ \overline{\rho }^{f} = \mathbf {0}\) from Eq. (41). Therefore, the row vectors in the matrix (40) are positive-linearly independent. \(\square \)
Lemma 3
The row vectors in the matrix \(\Gamma _{sub}\) defined by
are positive-linearly independent.
Proof
For the convenience of analysis, note that
and assume that we can find some vectors \(\overline{\rho }^{g} \in \mathbb {R}^{S_{3}}\) and \(\overline{\rho }^{g} \ge \mathbf {0}\), \(\overline{\rho }^{h} \in \mathbb {R}^{W_{2}}\) and \(\overline{\rho }^{h} \ge \mathbf {0}\), \(\overline{\rho }^{i} \in \mathbb {R}^{W_{2}}\) and \(\overline{\rho }^{i} \ge \mathbf {0}\), \(\overline{\rho }^{j} \in \mathbb {R}^{U_{3}}\) and \(\overline{\rho }^{j} \ge \mathbf {0}\), \(\overline{\rho }^{m} \in \mathbb {R}^{W_{3}}\) and \(\overline{\rho }^{m} \ge \mathbf {0}\), and \(\overline{\rho }^{n} \in \mathbb {R}^{U_{4}}\) and \(\overline{\rho }^{n} \ge \mathbf {0}\), such that
The above equation is equivalent to the compact system
which leads to \(\overline{\rho }^{g} = \mathbf {0},\ \overline{\rho }^{h} = \mathbf {0},\ \overline{\rho }^{i} = \mathbf {0},\ \overline{\rho }^{j} = \mathbf {0},\ \overline{\rho }^{m} = \mathbf {0},\ \overline{\rho }^{n} = \mathbf {0}\), given that \(\overline{\rho }^{g} \ge \mathbf {0},\ \overline{\rho }^{h} \ge \mathbf {0},\ \overline{\rho }^{i} \ge \mathbf {0},\ \overline{\rho }^{j} \ge \mathbf {0},\ \overline{\rho }^{m} \ge \mathbf {0},\ \overline{\rho }^{n} \ge \mathbf {0}\). Therefore, the row vectors in the matrix \(\Gamma _{sub}\) are positively-linearly independent. \(\square \)
4.4 The main result
Based on the above lemmas, we are ready to present the main theorem on the MPEC–MFCQ.
Theorem 3
Let \(v=(C,\zeta ,z,\alpha ,\xi )\) be any feasible point for the MPEC (26), then v satisfies the MPEC–MFCQ.
Proof
Assume there exist \(\overline{\rho }^{a} \in \mathbb {R}^{S_{1}}\ \text {and}\ \overline{\rho }^{a} \ge \mathbf {0},\ \overline{\rho }^{b} \in \mathbb {R}^{W_{1}}\ \text {and}\ \overline{\rho }^{b} \ge \mathbf {0},\ \overline{\rho }^{c} \in \mathbb {R}^{W_{1}}\ \text {and}\ \overline{\rho }^{c} \ge \mathbf {0},\ \overline{\rho }^{d} \in \mathbb {R}^{U_{1}}\ \text {and}\ \overline{\rho }^{d} \ge \mathbf {0},\ \overline{\rho }^{e} \in \mathbb {R}^{S_{2}}\ \text {and}\ \overline{\rho }^{e} \ge \mathbf {0},\ \overline{\rho }^{f} \in \mathbb {R}^{U_{2}}\ \text {and}\ \overline{\rho }^{f} \ge \mathbf {0},\ \overline{\rho }^{g} \in \mathbb {R}^{S_{3}}\ \text {and}\ \overline{\rho }^{g} \ge \mathbf {0},\ \overline{\rho }^{h} \in \mathbb {R}^{W_{2}} \ \text {and}\ \overline{\rho }^{h} \ge \mathbf {0},\ \overline{\rho }^{i} \in \mathbb {R}^{W_{2}} \ \text {and}\ \overline{\rho }^{i} \ge \mathbf {0},\ \overline{\rho }^{j} \in \mathbb {R}^{U_{3}}\ \text {and}\ \overline{\rho }^{j} \ge \mathbf {0},\ \overline{\rho }^{k} \in \mathbb {R}^{S_{4}}\ \text {and}\ \overline{\rho }^{k} \ge \mathbf {0},\ \overline{\rho }^{l} \in \mathbb {R}^{W_{3}}\ \text {and}\ \overline{\rho }^{l} \ge \mathbf {0},\ \overline{\rho }^{m} \in \mathbb {R}^{W_{3}} \ \text {and}\ \overline{\rho }^{m} \ge \mathbf {0},\ \overline{\rho }^{n} \in \mathbb {R}^{U_{4}} \ \text {and}\ \overline{\rho }^{n} \ge \mathbf {0},\) such that the following holds
From the first row in Eq. (43), we get \(\sum \limits _{s=1}^{S_4} \rho _{s}^{k}+\sum \limits _{s=1}^{W_3} \rho _{s}^{l}=0\). Together with the fact that \(\overline{\rho }^{k} \ge \mathbf {0},\ \overline{\rho }^{l}\ge \mathbf {0}\), we get \(\overline{\rho }^{k}=\mathbf {0}\) and \(\overline{\rho }^{l}=\mathbf {0}\). From Lemma 1, we get \(\overline{\rho }^{c} = \mathbf {0},\ \overline{\rho }^{d} = \mathbf {0},\ \overline{\rho }^{e} = \mathbf {0}\) in Equation (43). From Lemma 2, we get \(\overline{\rho }^{a} = \mathbf {0},\ \overline{\rho }^{b} = \mathbf {0},\ \overline{\rho }^{f} = \mathbf {0}\) in Eq. (43). From Lemma 3, we get \(\overline{\rho }^{g} = \mathbf {0},\ \overline{\rho }^{h} = \mathbf {0},\ \overline{\rho }^{i} = \mathbf {0},\ \overline{\rho }^{j} = \mathbf {0},\ \overline{\rho }^{m} = \mathbf {0},\ \overline{\rho }^{n} = \mathbf {0}\) in Eq. (43).
In summary, the row vectors in the matrix \(\Gamma \) (35) are positive-linearly independent at every feasible point v for the MPEC (26). That is to say, every feasible point v for the MPEC (26) satisfies the MPEC-MFCQ. \(\square \)
5 Numerical results
In this section, we present the GR–CV, which is a concrete implementation of the GRM in Algorithm 1 for selecting the hyperparameter C in SVC, as shown in Algorithm 2. We show numerical results of the proposed GR–CV, and compare it with other approaches.
All the numerical tests are conducted in Matlab R2018a on a Windows 7 Dell Laptop with an Intel(R) Core(TM) i5-6500U CPU at 3.20GHz and 8 GB of RAM. All the data sets are collected from the LIBSVM library: https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/. Each data set is split into a subset \(\Omega \) with \(l_{1}\) points (it is used for cross-validation) and a hold-out test set \(\Theta \) with \(l_{2}\) points. The data descriptions are shown in Table 1.
We compare our GR–CV with two other approaches: the inexact cross-validation method (In-CV) and the grid search method (G–S). In-CV Kunapuli et al. (2008b) is a relaxation method based on the relaxation of the complementarity constraints by a prescribed tolerance parameter \(\mathbf {tol} > 0\). That is, solving (NLP-\(t_{k}\)) with \(t_{k}=\mathbf {tol}\) as a fixed tolerance rather than decreasing \(t_{k}\) gradually.
The parameters of three methods are set as follows. For GR–CV, we set the initial values as \(v_{0}=\left[ 1,\ \mathbf {0}_{1 \times \overline{m}}\right] ^{\top }\), \(t_{0}=1,\ t_{\min }=10^{-8},\ \sigma =0.01.\) The relaxed subproblems (NLP-\(t_{k}\)) are solved by the snsolve function, which is part of the SNOPT solver (Gill et al. 2002). For In-CV, we use the same \(v_{0}\) as in GR–CV and \(\mathbf {tol}=10^{-4}\). For G–S, we use \(C \in \{10^{-4},\ 10^{-3},\ 10^{-2}\), \(10^{-1},\ 1,\ 10^{1},\ 10^{2},\ 10^{3},\ 10^{4}\}\), which is a commonly used grid range (Bennett et al. 2006; Kunapuli et al. 2008b; Kunapuli 2008; Moore et al. 2009). In each training process, the ALM–SNCG algorithm from Yan and Li (2020), which is outstanding and competitive with the most popular methods in LIBLINEAR (https://www.csie.ntu.edu.tw/~cjlin/liblinear/) in both speed and accuracy, is used to solve the \(l_{1}\)-loss SVC problem.
We compare the aforementioned methods in the following three aspects:
-
1.
Test error (\(E_{t}\)) as defined by
$$\begin{aligned} E_{t}=\frac{1}{l_{2}} \sum _{(x, y) \in \Theta } \frac{1}{2} \mid {\text {sign}}\left( \widehat{w}^{\top } x\right) -y \mid , \end{aligned}$$which is a measure of the ability of generalization.
-
2.
CV error (\(E_{C}\)) as defined in the objective function of problem (14).
-
3.
The number of iterations k for an algorithm, and the total number of iterations it for solving the subproblems (short for \((k,\ it\))).
We also report the maximum violation of all constraints defined as in (27), to measure the feasibility of the final solution given by GR–CV and In-CV.
The results are reported in Table 2, where we mark the winners of test error \(E_{t}\), CV error \(E_{C}\) and the maximum violation of all constraints Vio in bold. We also show the comparisons of the three methods for different data sets on test error \(E_{t}\) and CV error \(E_{C}\) in Figs. 9 and 10, respectively. The data sets on the horizontal axis are arranged in the order shown in Table 2.
From Figs. 9, 10 and Table 2, we have the following observations. Firstly, GR–CV performs the best in terms of test error in Fig. 9, implying that our approach is more capable of generalization. Secondly, in terms of CV error in Fig. 10, GR–CV is competitive with G–S. GR–CV is the winner in five data sets of all the twelve datasets whereas G-S wins in eight datasets among the twelve datasets. Finally, comparing GR–CV with In-CV, the feasibility of the solution returned by GR–CV is significantly better than that by In-CV since Vio given by GR–CV is much smaller than that by In-CV. In terms of cpu time, it is obvious that In-CV takes less time than GR-CV since it only solves the relaxation problem (NLP-\(t_{k}\)) once. Since G-S is basically solving a completely different type of problem to find the hyperparameter C, it doesn’t make sense to compare the cpu time between GR–CV and G–S.
To further study the effect of increasing the number of folds on test error \(E_{t}\) and CV error \(E_{C}\) in the three methods, we report the results on the Australian data set in Fig. 11. The results show that as T changes, the test error for GR–CV is always the lowest, and the CV error for GR–CV is competitive with the other two methods. Meanwhile it is clear that larger number of folds can be successfully solved for GR–CV, the computing time grows with the number of folds because of the increasing number of variables and constraints for the MPEC to be solved. The ranges of the test error and CV error for different numbers of folds are not large, so \(T=3\) represents a reasonable choice.
6 Conclusion
In this paper, we have proposed a bilevel optimization model for the hyperparameter selection for support vector classification in which the upper-level problem minimizes a T-fold cross-validation error and the lower-level problems are T \(l_{1}\)-loss SVC problems on the training sets. We reformulated the bilevel optimization problem into an MPEC, and proposed the GR–CV to solve it based on the GRM from Scholtes (2001). We also proved that the MPEC–MFCQ automatically holds at each feasible point. Extensive numerical results on the data sets from the LIBSVM library demonstrated the superior generalization performance of the proposed method over almost all the data sets used in this paper. The proposed approach has the potential to deal with other hyperparameter selection problems in SVM, which may involve multiple hyperparameters or other types of loss functions. However, whether the resulting MPEC enjoys the property of MPEC–MFCQ needs to be further investigated. How to choose the most suitable numerical algorithms to solve the perturbed problem resulting from the Scholtes relaxation is also worth further study. These topics will be investigated further in the near future.
Notes
We choose the \(l_1\)-loss SVC model as the typical lower-level problem due to the following reasons. Firstly, from the practical perspective, the \(l_1\)-loss SVC model is a widely used statistical model in machine learning (Yan and Li 2020; Zhang 2004; Shalev-Shwartz et al. 2011). Secondly, the \(l_1\)-loss SVC model is more challenging to tackle than the \(l_2\)-loss SVC model due to the nonsmoothness of the \(l_1\)-loss function
References
Anitescu M (2000) On solving mathematical programs with complementarity constraints as nonlinear programs. Preprint ANL/MCS-P\(864\)-\(1200\), Argonne National Laboratory, Argonne, IL 3
Bennett KP, Hu J, Ji XY, Kunapuli G, Pang J-S (2006) Model selection via bilevel optimization. In: The 2006 IEEE International Joint Conference on Neural Network Proceedings, pp 1922–1929 . IEEE
Bennett KP, Kunapuli G, Hu J, Pang J-S (2008) Bilevel optimization and machine learning. In: IEEE World Congress on Computational Intelligence, pp 25–47
Chapelle O, Vapnik V, Bousquet O, Mukherjee S (2002) Choosing multiple parameters for support vector machines. Mach Learn 46(1):131–159
Chauhan VK, Dahiya K, Sharma A (2019) Problem formulations and solvers in linear SVM: a review. Artif Intell Rev 52(2):803–855
Colson B, Marcotte P, Savard G (2007) An overview of bilevel optimization. Ann Oper Res 153(1):235–256
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Couellan N, Wang WJ (2015) Bi-level stochastic gradient for large scale support vector machine. Neurocomputing 153:300–308
Cristianini N, Shawe-Taylor J (2000) An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge
Crockett C, Fessler JA (2021) Bilevel methods for image reconstruction. arXiv preprint arXiv:2109.09610
Dempe S (2002) Foundations of Bilevel Programming. Springer, New York
Dempe S (2003) Annotated bibliography on bilevel programming and mathematical programs with equilibrium constraints. Optimization 52(3):333–359
Dempe S, Zemkoho AB (2012) On the Karush-Kuhn-Tucker reformulation of the bilevel optimization problem. Nonlinear Analysis: Theory, Methods Appl 75(3):1202–1218
Dempe S, Zemkoho AB (2020) Bilevel Optimization Advances and Next Challenges. Springer, New York
Dong Y-L, Xia Z-Q, Wang M-Z (2007) An MPEC model for selecting optimal parameter in support vector machines. In: The First International Symposium on Optimization and Systems Biology, pp 351–357
Duan KB, Keerthi SS, Poo AN (2003) Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing 51:41–59
Facchinei F, Pang J-S (2007) Finite-dimensional Variational Inequalities and Complementarity Problems. Springer, NewYork
Fischer A, Zemkoho AB, Zhou S (2021) Semismooth Newton-type method for bilevel optimization: global convergence and extensive numerical experiments. Optimization Methods & Software, 1–35
Flegel ML (2005) Constraint qualifications and stationarity concepts for mathematical programs with equilibrium constraints. PhD thesis, Universität Würzburg
Fletcher R, Leyffer S, Ralph D, Scholtes S (2006) Local convergence of SQP methods for mathematical programs with equilibrium constraints. SIAM J Optim 17(1):259–286
Fukushima M, Tseng P (2002) An implementable active-set algorithm for computing a B-stationary point of a mathematical program with linear complementarity constraints. SIAM J Optim 12(3):724–739
Galli L, Lin C-J (2021) A study on truncated newton methods for linear classification. IEEE Transactions on Neural Networks and Learning Systems
Gill PE, Murray W, Saunders MA (2002) User’s Guide for Snopt version 6. A Fortran Package for Large-Scale Nonlinear Programming. University of California, California
Guo L, Lin G-H, Ye JJ (2015) Solving mathematical programs with equilibrium constraints. J Optim Theory Appl 166(1):234–256
Harder F, Mehlitz P, Wachsmuth G (2021) Reformulation of the M-stationarity conditions as a system of discontinuous equations and its solution by a semismooth Newton method. SIAM J Optim 31(2):1459–1488
Hoheisel T, Kanzow C, Schwartz A (2013) Theoretical and numerical comparison of relaxation methods for mathematical programs with complementarity constraints. Math Program 137(1):257–288
Hsieh C-J, Chang K-W, Lin C-J, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear SVM. In Proceedings of the 25th International Conference on Machine Learning, pp 408–415
Huang X, Shi L, Suykens JA (2013) Support vector machine classifier with pinball loss. IEEE Trans Pattern Anal Mach Intell 36(5):984–997
Jara-Moroni F, Pang J-S, Wächter A (2018) A study of the difference-of-convex approach for solving linear programs with complementarity constraints. Math Program 169(1):221–254
Júdice JJ (2012) Algorithms for linear programming with linear complementarity constraints. TOP 20(1):4–25
Keerthi S, Sindhwani V, Chapelle O (2006) An efficient method for gradient-based adaptation of hyperparameters in SVM models. Advances in neural information processing systems 19
Kunapuli G (2008) A Bilevel Optimization Approach to Machine Learning. Rensselaer Polytechnic Institute, New York
Kunapuli G, Bennett KP, Hu J, Pang J-S (2008) Classification model selection via bilevel programming. Optim Methods Softw 23(4):475–489
Kunapuli G, Bennett KP, Hu J, Pang J-S (2008) Bilevel model selection for support vector machines. Data mining mathe programming 45:129–158
Kunisch K, Pock T (2013) A bilevel optimization approach for parameter learning in variational models. SIAM J Imag Sci 6(2):938–983
Lee Y-C, Pang J-S, Mitchell JE (2015) Global resolution of the support vector machine regression parameters selection problem with LPCC. EURO J Comput Optim 3(3):197–261
Li JL, Huang RS, Jian JB (2015) A superlinearly convergent QP-free algorithm for mathematical programs with equilibrium constraints. Appl Math Comput 269:885–903
Lin G-H, Xu MW, Ye JJ (2014) On solving simple bilevel programs with a nonconvex lower level program. Math Program 144(1):277–305
Luo G (2016) A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw Model Anal Health Inform Bioinform 5(1):1–16
Luo Z-Q, Pang J-S, Ralph D (1996) Mathematical Programs with Equilibrium Constraints. Cambridge University Press, Cambridge
Mangasarian OL (1994) Misclassification minimization. J Global Optim 5(4):309–323
Mejía-de-Dios J-A, Mezura-Montes E (2019) A metaheuristic for bilevel optimization using tykhonov regularization and the quasi-newton method. In: 2019 IEEE Congress on Evolutionary Computation (CEC), pp 3134–3141
Momma M, Bennett KP (2002) A pattern search method for model selection of support vector regression. In: Proceedings of the 2002 SIAM International Conference on Data Mining, pp 261–274
Moore G, Bergeron C, Bennett KP. Gradient-type methods for primal SVM model selection
Moore G, Bergeron C, Bennett KP (2009) Nonsmooth bilevel programming for hyperparameter selection. In: 2009 IEEE International Conference on Data Mining Workshops, pp 374–381
Ochs P, Ranftl R, Brox T, Pock T (2016) Techniques for gradient-based bilevel optimization with non-smooth lower level problems. J Mathe Imag Vision 56(2):175–194
Ochs P, Ranftl R, Brox T, Pock T (2015) Bilevel optimization with nonsmooth lower level problems. In: International Conference on Scale Space and Variational Methods in Computer Vision, pp 654–665
Okuno T, Takeda A, Kawana A (2018) Hyperparameter learning for bilevel nonsmooth optimization. arXiv preprint arXiv:1806.01520
Scholtes S (2001) Convergence properties of a regularization scheme for mathematical programs with complementarity constraints. SIAM J Optim 11(4):918–936
Shalev-Shwartz S, Singer Y, Srebro N, Cotter A (2011) Pegasos: Primal estimated sub-gradient solver for SVM. Math Program 127(1):3–30
Vapnik V (2013) The Nature of Statistical Learning Theory. Springer, New York
Wang H, Shao Y, Zhou S, Zhang C, Xiu N (2021) Support vector machine classifier via \({L}_{0/1}\) soft-margin loss. IEEE Transactions on Pattern Analysis and Machine Intelligence
Wu J, Zhang LW, Zhang Y (2015) An inexact Newton method for stationary points of mathematical programs constrained by parameterized quasi-variational inequalities. Numer Algorithms 69(4):713–735
Yan YQ, Li QN (2020) An efficient augmented Lagrangian method for support vector machine. Optim Methods Softw 35(4):855–883
Ye JJ (2005) Necessary and sufficient optimality conditions for mathematical programs with equilibrium constraints. J Math Anal Appl 307(1):350–369
Ye JJ, Zhu DL (2010) New necessary optimality conditions for bilevel programs by combining the MPEC and value function approaches. SIAM J Optim 20(4):1885–1905
Yu B, Mitchell JE, Pang J-S (2019) Solving linear programs with complementarity constraints using branch-and-cut. Math Program Comput 11(2):267–310
Yu T, Zhu H (2020) Hyper-parameter optimization: A review of algorithms and applications. arXiv preprint arXiv:2003.05689
Zemkoho AB, Zhou SL (2021) Theoretical and numerical comparison of the karush-kuhn-tucker and value function reformulations in bilevel optimization. Comput Optim Appl 78(2):625–674
Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, p 116
Acknowledgements
We would like to thank the editor for the efficient handling of our submission. We would also like to thank the two anonymous referees for their valuable comments, which have helped us to improve the presentation in the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Q.Li: This author’s research is supported by the National Science Foundation of China (NSFC) 12071032.
A.Zemkoho: The work of this author is supported by the EPSRC grant EP/V049038/1 and the Alan Turing Institute under the EPSRC grant EP/N510129/1.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, Q., Li, Z. & Zemkoho, A. Bilevel hyperparameter optimization for support vector classification: theoretical analysis and a solution method. Math Meth Oper Res 96, 315–350 (2022). https://doi.org/10.1007/s00186-022-00798-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00186-022-00798-6
Keywords
- Support vector classification
- Hyperparameter selection
- Bilevel optimization
- Mathematical program with equilibrium constraints
- C-stationarity