1 Introduction

Support vector classification (SVC) is a classical and widely used learning method for classification problems; see, e.g., (Chauhan et al. 2019; Cortes and Vapnik 1995; Vapnik 2013). In SVC, the selection of hyperparameters, also known as hyperparameter selection, is a critical issue and has been addressed by many researchers both theoretically and practically (Chapelle et al. 2002; Dong et al. 2007; Duan et al. 2003; Keerthi et al. 2006; Kunapuli 2008; Kunapuli et al. 2008a, b). While there have been many interesting attempts to use bounds, gradient descent methods or other techniques to identify these hyperparameters (Chapelle et al. 2002; Duan et al. 2003; Keerthi et al. 2006), one of the most widely used methods is cross-validation (CV). A classical approach for cross-validation is the grid search method (Momma and Bennett 2002), where one needs to define a grid over the hyperparameters of interest, and search for the combination of hyperparameters that minimize the cross-validation error (CV error). Bennett et al. (2006) emphasize that one of the drawbacks of the grid search approach is that the continuity of the hyperparameter is ignored by the discretization. A formulation of the bilevel optimization model is proposed to choose hyperparameters (Bennett et al. 2006; Kunapuli 2008). Below, we will focus on the bilevel optimization approach which is the most relevant to our work. We refer to Yu and Zhu (2020), Luo (2016) for a survey of various hyperparameters optimization methods and applications.

In terms of selecting hyperparameters through bilevel optimization, different models and approaches have been considered in the literature. For example, Okuno et al. (2018) propose a bilevel optimization model to select the best hyperparameter for a nonsmooth, possibly nonconvex, \(l_{p}\)-regularized problem. They then present a smoothing-type algorithm with convergence analysis to solve this bilevel optimization model. Kunisch and Pock (2013) formulate a parameter learning problem for variational image denoising model into a bilevel optimization problem. They design a semismooth Newton’s method for solving the resulting nonsmooth bilevel optimization problems. Moore et al. [17] develop an implicit gradient-type algorithm for selecting hyperparameters for linear SVM-type machine learning models which are expressed as bilevel optimization problems. Moore et al. (2009) propose a nonsmooth bilevel model to select hyperparameters for support vector regression (SVR) via T-fold cross-validation. They design a proximity control approximation algorithm to solve this bilevel optimization model. Couellan and Wang (2015) design a bilevel stochastic gradient algorithm for training large scale SVM with automatic selection of the hyperparameter. We refer to Crockett and Fessler (2021), Colson et al. (2007), Dempe (2002), Dempe and Zemkoho (2020) for recent general surveys on bilevel optimization, as well as Mejía-de-Dios and Mezura-Montes (2019), Zemkoho and Zhou (2021), Fischer et al. (2021), Lin et al. (2014), Ye and Zhu (2010), Ochs et al. (2016, 2015) for some of the latest algorithms on the subject. Next, we provide a brief overview of the MPEC reformulation of the bilevel optimization problem, which will play a fundamental role in this paper.

For a bilevel program, replacing the lower-level problem by its Karush–Kuhn–Tucker (KKT) conditions will result in a mathematical program with equilibrium constraints (MPEC) Luo et al. (1996). Therefore, various algorithms for MPECs can be potentially applied to solve bilevel optimization problems, although one might want to pay attention to the fact that both problems are not necessarily equivalent. Bennett and her collaborators do a series of works (Bennett et al. 2006; Kunapuli et al. 2008b; Bennett et al. 2008; Kunapuli et al. 2008a; Kunapuli 2008) on hyperparameter selection by reformulating a bilevel program into an MPEC. For example, (Kunapuli et al. 2008b) considers a bilevel optimization model for selecting many hyperparameters for \(l_{1}\)-loss SVC problems, in which the upper-level problem has box constraints for the regularization parameter and feature selection. They reformulate this bilevel program into an MPEC and solve it by the inexact cross-validation method. Other methods include Newton-type algorithms (Wu et al. 2015; Harder et al. 2021; Lee et al. 2015).

Considering these works, a natural question is whether one can build up a bilevel hyperparameter selection for SVC? If yes, whether there are some special and hidden properties if we transfer the corresponding bilevel optimization problem to its corresponding MPEC and how we can solve it efficiently? This is the main motivation of the work in this paper.

In this paper, we consider a bilevel optimization model for selecting the hyperparameter in SVC. This regularization hyperparameter C is selected to minimize the T-fold cross-validated estimation of the out-of-sample misclassification error, which is basically a 0–1 loss function. Therefore, the upper-level problem minimizes the average misclassification error in T-fold cross-validation based on the optimal solution of the lower-level problem (we use the typical \(l_{1}\)-loss SVC model) for all the possible values of the hyperparameter C. There are several challenges to design efficient algorithms for such potentially large-scale bilevel programs. Firstly, the objective function in the upper-level problem is a 0–1 loss function, which is discontinuous and nonconvex. Secondly, the constraints for the upper-level problem involve the optimal solution set of the lower-level problem, i.e., the \(l_{1}\)-loss SVC model, for which the optimal solution is not explicitly given. To deal with the first challenge, we reformulate the minimization of the 0–1 loss function into a linear optimization problem inspired by the technique in Mangasarian (1994). We then replace the lower-level problem by its optimality conditions to tackle the second challenge. This therefore leads to an MPEC.

The contributions of the paper are as follows. Firstly, we propose a bilevel optimization model for hyperparameter selection in a binary SVC and study its reformulation as an MPEC. Secondly, we apply the GRM originating from Scholtes (2001) to solve this MPEC, which is shown to converge to a C-stationary point. The resulting algorithm is called the GR–CV, which is a concrete implementation of the GRM for selecting the hyperparameter C in SVC. Thirdly, we prove the MPEC–Mangasarian–Fromovitz constraint qualification (MPEC–MFCQ, for short) property for each feasible point of our MPEC. The MPEC–MFCQ is a key property to guarantee the convergence of the GRM. We show that it automatically holds for our problem thanks to its special structure. Finally, we conduct extensive numerical experiments, which show that our method is very efficient; in particular, it enjoys superior generalization performance over almost all the data sets used in this paper.

The paper is organized as follows. In Sect. 2, based on T-fold cross-validation for SVC, we introduce a bilevel optimization model to select an optimal hyperparameter for SVC. We also analyze the interesting properties of the lower-level problem. In Sect. 3, we reformulate the bilevel optimization problem as an MPEC (also known as the KKT reformulation), and apply the GRM for solving the MPEC. In Sect. 4, we prove that every feasible point of this MPEC satisfies the regularity condition MPEC–MFCQ, which is a key property to guarantee the convergence of the GRM. In Sect. 5, we present some computational experiments comparing the resulting GR–CV based on the GRM with two other ones, which have been used in the literature for a similar purpose; i.e., the inexact cross-validation method (In–CV) and the grid search method (G–S). We conclude the paper in Sect. 6.

Notations. For \(x \in \mathbb {R}^{n}\), \(\Vert x \Vert _{0}\) denotes the number of nonzero elements in x, while \(\Vert x \Vert _{1}\) and \(\Vert x \Vert _{2}\) correspond to the \(l_{1}\)-norm and \(l_{2}\)-norm of x, respectively. Also, we will use \(x_{+}=((x_{1})_{+},\ \cdots ,\ (x_{n})_{+}) \in \mathbb {R}^{n}, \) where \((x_{i})_{+}=\max (x_{i},\ 0).\ \mid \! \Omega \! \mid \) denotes the number of elements in the set \(\Omega \subset \mathbb {R}^n\). We use \(\mathbf {1}_{k}\) to denote a vector with elements all ones in \(\mathbb {R}^{k}\). \(I_{k}\) is the identity matrix in \(\mathbb {R}^{k \times k}\), while \(e^{k}_{\gamma }\) is the \(\gamma \)-th row vector of an identity matrix in \(\mathbb {R}^{k \times k}\). The notation \(\mathbf {0}_{k \times q}\) represents a zero matrix in \(\mathbb {R}^{k \times q}\) and \(\mathbf {0}_{k}\) stands for a zero vector in \(\mathbb {R}^{k}\). On the other hand, \(\mathbf {0}_{(\tau ,\ \kappa )}\) will be used for a submatrix of the zero matrix, where \(\tau \) is the index set of the rows and \(\kappa \) is the index set of the columns. Similarly to the case of zero matrix, \(I_{(\tau ,\ \tau )}\) corresponds to a submatrix of an identity matrix indexed by both rows and columns in the set \(\tau \). Finally, \(\Theta _{(\tau ,\ \cdot )}\) represents a submatrix of the matrix \(\Theta \), where \(\tau \) is the index set of the rows, and \(x_{\tau }\) is a subvector of the vector x corresponding to the index set \(\tau \).

2 Bilevel hyperparameter optimization for SVC

We start this section by first introducing the problem settings in relation to the T-fold cross-validation for SVC. Subsequently, we present the lower-level problem with some interesting and relevant properties for further analysis in the later parts of the paper. Finally, we introduce the upper-level problem, that is, the bilevel optimization model for hyperparameter selection in SVC.

2.1 T-fold cross-validation for SVC

As discussed in the introduction, the most commonly used method for selecting the hyperparameter C is T-fold cross-validation. In T-fold cross-validation, the data set is split into a subset \(\Omega \) with \(l_{1}\) points, which is used for cross-validation, and a hold-out test set \(\Theta \) with \(l_{2}\) points. Here, \(\Omega =\{(x_{i},y_{i})\}_{i=1}^{l_{1}} \in \mathbb {R}^{n+1}\), where \(x_{i} \in \mathbb {R}^{n}\) denotes a data point and \(y_{i}\in \{\pm 1\}\) the corresponding label. For T-fold cross-validation, \(\Omega \) is equally partitioned into T disjoint subsetsFootnote 1, as done in Couellan and Wang (2015), Moore et al. (2009), one for each fold. The process is executed T iterations. For the t-th iteration (\(t=1, \ldots , T\)), the t-th fold is the validation set \(\Omega _{t}\), and the remaining \(T-1\) folds make up the training set \(\overline{\Omega }_{t}=\Omega \backslash \Omega _{t}\). Therefore, in the t-th iteration, the separating hyperplane is trained using the training set \(\overline{\Omega }_{t}\), and the validation error is computed on the validation set \(\Omega _{t}\).

Then, the cross-validation error (CV error) is the average of the validation error over all the T iterations. The value of C that gives the best CV error will be selected. Finally, the final classifier is trained using all the data in \(\Omega \) and the rescaled optimal C. The test error is computed on the test set \(\Theta \). Note that the CV error and the test error are the evaluation indices for the classification performance in T-fold cross-validation. As shown in Fig. 1, for three-fold cross-validation, the yellow part is the subset \(\Omega \) which is used for three-fold cross-validation. In the first iteration, the blue part is the validation set \(\Omega _{1}\), and the remaining two folds are the training set \(\overline{\Omega }_{1}\). The second and third iterations have similar meanings.

Fig. 1
figure 1

Three-fold cross-validation

Let \(m_{1}\) be the size of the validation set \(\Omega _{t}\) and \(m_{2}\) the size of the training set \(\overline{\Omega }_{t}\). The corresponding index sets for the validation and training sets are \(\mathcal {N}_{t}\) and \(\overline{\mathcal {N}}_{t}\), respectively. In T-fold cross-validation, there are T validation sets. Therefore, there are totally \(T m_{1}\) validation points in T-fold cross-validation. We use the index set

$$\begin{aligned} Q_{u}:=\{i\ \mid \ i=1,\ 2,\ \cdots ,\ Tm_{1}\} \end{aligned}$$
(1)

to represent all the validation points in T-fold cross-validation. Similarly, there are totally \(T m_{2}\) training points in T-fold cross-validation. We use the index set

$$\begin{aligned} Q_{l}:=\{i\ \mid \ i=1,\ 2,\ \cdots ,\ Tm_{2}\} \end{aligned}$$
(2)

to represent all the training points in T-fold cross-validation. These two index sets will be used later.

To analyze different cases of the data points in the training set and the validation set, we need to introduce the soft-margin support vector classification (without bias term) Cristianini and Shawe-Taylor (2000), Galli and Lin (2021). The traditional SVC model which is referred to as hard-margin SVC requires that the data should be strictly separated, i.e., the constraints must be satisfied strictly. However, the regularized model (soft-margin SVC) allows that the data could be wrongly labelled, i.e., the inequality constraints can be violated, which is the case in the model

$$\begin{aligned} \underset{ w \in \mathbb {R}^{n}}{{\text {min}}} \frac{1}{2}\Vert w\Vert _{2}^{2}+ C \sum _{i=1}^{\tau } \xi (w;\quad x_i,y_i), \end{aligned}$$
(3)

where \(C \! \ge \!0\) is a penalty parameter and \(\xi (\cdot )\) is the loss function. If \(\xi (w;\quad x_i,y_i)\!=\!(1-y_{i}(x_{i}^{\top }w))_{+}\), it is referred to as the \(l_1\)-loss function; if \(\xi (w;\quad x_i,y_i)=(1-y_{i}(x_{i}^{\top }w))_{+}^{2}\), it is referred to as the \(l_2\)-loss function. We refer to Chauhan et al. (2019), Wang et al. (2021), Huang et al. (2013) for various other types of loss functions. Next, we use Fig. 2 to show geometric relationships of different cases in soft-margin SVC.

For a sample \((x_{i},y_{i})\), the point \(x_{i}\) is referred to as a positive point if \(y_{i}=1\); the point \(x_{i}\) is referred to as a negative point if \(y_{i}=-1\). In Fig. 2, the plus signs ‘\(+\)’ are the positive points (i.e., \(y_i=1\)) and the minus signs ‘−’ are the negative ones (i.e., \(y_i=-1\)). The distance between the hyperplanes \(H_{1}: w^{\top } x=1\) and \(H_{2}: w^{\top } x=-1\) is called margin. The separating hyperplane H lies between \(H_{1}\) and \(H_{2}\). Clearly, the hyperplanes \(H_{1}\) and \(H_{2}\) are the boundaries of the margin. Therefore, if a positive point lies on the hyperplane \(H_{1}\) or a negative point lies on the hyperplane \(H_{2}\), we call it lying on the boundary of the margin (indicated by ‘①’ in Fig. 2). If a positive point lies between the separating hyperplane H and the hyperplane \(H_{1}\), or a negative point lies between the separating hyperplane H and the hyperplane \(H_{2}\), we call it lying between the separating hyperplane H and the boundary of the margin (indicated by ‘②’ in Fig. 2). Similarly, if a positive point lies on the correctly classified side of the hyperplane \(H_{1}\), or a negative point lies on the correctly classified side of the hyperplane \(H_{2}\), we call it lying on the correctly classified side of the boundary of the margin (indicated by ‘③’ in Fig. 2).

Fig. 2
figure 2

Training points in soft-margin support vector machine

Based on Fig. 2, we have the following observations which address different cases for the data points in the training set \(\overline{\Omega }_t\). Consider the soft-margin SVC problem corresponding to the t-th fold, i.e., the t-th training set \(\overline{\Omega }_{t}\) and validation set \(\Omega _{t}\) are used. We also use \(w^{t}\) to represent the optimal solution in (3) trained by \(\overline{\Omega }_t\).

Proposition 1

Let \(w^{t}\) be an optimal solution of the t-th soft-margin SVC model. For \(i \in \overline{\mathcal {N}}_{t}\), consider a positive point \(x_i\). Then it holds that:

  1. (a)

    \(x_i\) satisfies \((w^{t})^{\top } x_{i}<0\) if and only if it lies on the misclassified side of the separating hyperplane H, and is therefore misclassified.

  2. (b)

    \(x_i\) satisfies \((w^{t})^{\top } x_{i}=0\) if and only if it lies on the separating hyperplane H, and is therefore correctly classified.

  3. (c)

    \(x_i\) satisfies \(0<(w^{t})^{\top } x_{i}<1\) if and only if it lies between the separating hyperplane H and the boundary of the margin; hence, it is correctly classified.

  4. (d)

    \(x_i\) satisfies \((w^{t})^{\top } x_{i}=1\) if and only if it lies on the boundary of the margin, and is therefore correctly classified.

  5. (e)

    \(x_i\) satisfies \((w^{t})^{\top } x_{i}>1\) if and only if it lies on the correctly classified side of the boundary of the margin, and is therefore correctly classified.

A result analogous to Proposition 1 can be stated for the negative points. In Fig. 3, any point \(x_{i} \in \overline{\Omega }_{t}\) in blue is a training point in each case (notation is the same as in Fig. 5).

Fig. 3
figure 3

Each case for different values of \((w^{t})^{\top } x_{i}\) with \(x_i\) in the training set \(\overline{\Omega }_{t}\)

As for data points in the validation set \(\Omega _{t}\), we have the following scenarios.

Proposition 2

Let \(w^{t}\) be an optimal solution of the t-th soft-margin SVC model. For \(i \in \mathcal {N}_{t}\), consider a positive point \(x_i\). Then it holds that:

  1. (a)

    \(x_i\) satisfies \((w^{t})^{\top } x_{i}<0\) if and only if it lies on the misclassified side of the separating hyperplane H, and is therefore misclassified.

  2. (b)

    \(x_i\) satisfies \((w^{t})^{\top } x_{i}=0\) if and only if it lies on the separating hyperplane H, and is therefore correctly classified.

  3. (c)

    \(x_i\) satisfies \((w^{t})^{\top } x_{i}>0\) if and only if it lies on the correctly classified side of the separating hyperplane H, and it is hence correctly classified.

A result analogous to Proposition 2 can be stated for the negative points. In Fig. 4, any point \(x_{i} \in \Omega _{t}\) in blue is a validation point in each case (notation is the same as in Fig. 6).

Remark 1

Note that Propositions 1 and 2 are applicable to the soft-margin SVC model with other loss functions. These two propositions will be used in the proof of Propositions 3 and 4.

Fig. 4
figure 4

Each case for different values of \((w^t)^{\top } x_{i}\) with \(x_i\) in the validation set \(\Omega _{t}\)

2.2 The lower-level problem

In this part, we focus on the lower-level problem. That is, given hyperparameter C and the training set \(\overline{\Omega }_{t}\), we train the dataset via \(l_{1}\)-loss SVC model. We will also discuss the properties of the lower-level problem.

2.2.1 The training model: \(l_{1}\)-loss SVC

In T-fold cross-validation, there are T lower-level problems. In the t-th lower-level problem, we train the t-th fold training set \(\overline{\Omega }_{t}\) by the soft-margin SVC model in (3) with the \(l_1\)-loss functionFootnote 2. That is, given \(C \ge 0\), we solve the following optimization problem:

$$\begin{aligned} \underset{ w^{t} \in \mathbb {R}^{n}}{{\text {min}}} \frac{1}{2}\Vert w^{t}\Vert _{2}^{2}+ C \sum _{i \in \overline{\mathcal {N}}_{t}} (1-y_{i}(x_{i}^{\top }w^{t}))_{+}. \end{aligned}$$

A popular reformulation of the problem above is the convex quadratic optimization problem obtained by introducing slack variables \(\xi ^{t} \in \mathbb {R}^{m_{2}}\):

$$\begin{aligned} \begin{array}{l} \underset{w^{t} \in \mathbf {R}^{n},\, \xi ^{t} \in \mathbf {R}^{m_{2}}}{\min }\quad \frac{1}{2}\left\| w^{t}\right\| _{2}^{2}+C\sum \limits _{i=1}^{m_{2}} \xi ^{t}_{i} \\ \quad \quad \ \hbox {s.t.} \quad \quad \quad \ B^{t} w^{t} \ge \mathbf {1}-\xi ^{t}, \\ \quad \quad \quad \quad \quad \quad \quad \ \xi ^{t} \ge \mathbf {0}, \end{array} \end{aligned}$$
(4)

where, for \(t=1, \cdots , T\) and \(k=m_{1}+ 1, \cdots , l_{1}\), we have

$$\begin{aligned} B^{t}\!=\! \left[ \begin{array}{c} y_{t_{m_{1}+1}} x_{t_{m_{1}+1}}^{\top } \\ \vdots \\ y_{t_{l_{1}}} x_{t_{l_{1}}}^{\top } \end{array}\right] \! \in \! \mathbb {R}^{m_{2} \times n}, \;\;\, (x_{t_k},y_{t_k}) \in \overline{\Omega }_{t}, \end{aligned}$$

and we use \(\xi ^{t}_{i}\) to denote the i-th element of \(\xi ^{t} \in \mathbb {R}^{m_{2}}\).

Let \(\alpha ^{t} \in \mathbb {R}^{m_{2}}\) and \(\mu ^{t} \in \mathbb {R}^{m_{2}}\) be the multipliers of the constraints in (4). We can write the KKT conditions for the lower-level problem (4) as

$$\begin{aligned}&\mathbf {0} \le \alpha ^{t} \perp B^{t} w^{t}-\mathbf {1}+\xi ^{t} \ge \mathbf {0}, \end{aligned}$$
(5a)
$$\begin{aligned}&\mathbf {0} \le \xi ^{t} \perp \mu ^{t} \ge \mathbf {0}, \end{aligned}$$
(5b)
$$\begin{aligned}&w^{t}-(B^{t})^{\top } \alpha ^{t}=\mathbf {0}, \end{aligned}$$
(5c)
$$\begin{aligned}&C\mathbf {1}-\alpha ^{t}-\mu ^{t}=\mathbf {0}, \end{aligned}$$
(5d)

where for two vectors a and b, writing \(\mathbf {0} \le a \perp b \ge \mathbf {0}\) means that we have \(a^{\top } b=0,\ a \ge \mathbf {0}\) and \(b \ge \mathbf {0}\). Also note that each complementary constraint in (5a) corresponds to a training point \(x_{i}\) with \(i \in Q_{l}\) in (2). Each training point corresponds to a slack variable \(\xi ^{t}_i\). So each complementary constraint in (5b) corresponds to a training point \(x_{i}\) with \(i \in Q_{l}\) in (2). Therefore, there is a one-to-one correspondence between the index set of the training points \(Q_{l}\) and the complementary constraints in (5a) and (5b), respectively. This will be used in the definition of some index sets below.

Furthermore, we would like to emphasize the support vectors implied in (5). From (5c), the weight vector \(w^{t}=(B^{t})^{\top } \alpha ^{t}=\sum \limits _{i \in \overline{\mathcal {N}}_{t}} \alpha ^{t}_{i} y_{i}x_{i}\). It implies that only the data points \(x_{i} \in \overline{\Omega }_{t}\) which correspond to \(\alpha ^{t}_{i} \ne 0\) are involved. By \(\alpha ^{t}_{i} \ge 0\) in (5a), it means that only \(x_{i} \in \overline{\Omega }_{t}\) with \(\alpha ^{t}_{i}>0\) are involved. It is for this reason that they are called support vectors. By eliminating \(\mu ^{t}\) and \(w^{t}\) from the system in (5), we get the reduced KKT conditions for problem (4) as follows:

$$\begin{aligned} \left\{ \begin{aligned}&\mathbf {0} \le \alpha ^{t} \perp B^{t} (B^{t})^{\top } \alpha ^{t}-\mathbf {1}+\xi ^{t} \ge \mathbf {0}, \\&\mathbf {0} \le \xi ^{t} \perp C\mathbf {1}-\alpha ^{t} \ge \mathbf {0}. \end{aligned}\right. \end{aligned}$$
(6)

2.2.2 Some properties of the lower-level problem

Let \(\alpha \in \mathbb {R}^{Tm_{2}},\ \xi \in \mathbb {R}^{Tm_{2}},\ w \in \mathbb {R}^{Tn}\), and \(B \in \mathbb {R}^{Tm_{2} \times Tn}\) be defined by

$$\begin{aligned} \alpha :=\left[ \begin{array}{l} \alpha ^{1} \\ \alpha ^{2}\\ \vdots \\ \alpha ^{T} \end{array}\right] ,\ \xi :=\left[ \begin{array}{l} \xi ^{1} \\ \xi ^{2} \\ \vdots \\ \xi ^{T} \end{array}\right] ,\ w:=\left[ \begin{array}{l} w^{1} \\ w^{2} \\ \vdots \\ w^{T} \end{array}\right] ,\ \text{ and } \; B:=\left[ \begin{array}{cccc} B^{1} &{} \mathbf {0} &{} \cdots &{} \mathbf {0} \\ \mathbf {0} &{}B^{2} &{} \cdots &{} \mathbf {0} \\ \vdots &{}\vdots &{} \ddots &{} \vdots \\ \mathbf {0} &{} \mathbf {0} &{} \cdots &{} B^{T} \end{array}\right] , \end{aligned}$$
(7)

respectively. The KKT conditions in (6) can be decomposed as

$$\begin{aligned} \Lambda _{1}:= & {} \{i \in Q_{l}\ \mid \ \alpha _{i}=0,\ (B B^{\top } \alpha -\mathbf {1}+\xi )_{i}=0,\ \xi _{i}=0\}, \end{aligned}$$
(8)
$$\begin{aligned} \Lambda _{2}:= & {} \{i \in Q_{l}\ \mid \ \alpha _{i}=0,\ (B B^{\top } \alpha -\mathbf {1}+\xi )_{i}>0,\ \xi _{i}=0\}, \end{aligned}$$
(9)
$$\begin{aligned} \Lambda _{3}:= & {} \{i \in Q_{l}\ \mid \ 0< \alpha _{i} \le C,\ (B B^{\top } \alpha -\mathbf {1}+\xi )_{i}=0,\ \xi _{i}=0\}, \end{aligned}$$
(10)
$$\begin{aligned} \Lambda _{4}:= & {} \{i \in Q_{l}\ \mid \ \alpha _{i}=C,\ (B B^{\top } \alpha -\mathbf {1}+\xi )_{i}=0,\ 0<\xi _{i}<1\}, \end{aligned}$$
(11)
$$\begin{aligned} \Lambda _{5}:= & {} \{i \in Q_{l}\ \mid \ \alpha _{i}=C,\ (B B^{\top } \alpha -\mathbf {1}+\xi )_{i}=0,\ \xi _{i}=1\}, \end{aligned}$$
(12)
$$\begin{aligned} \Lambda _{6}:= & {} \{i \in Q_{l}\ \mid \ \alpha _{i}=C,\ (B B^{\top } \alpha -\mathbf {1}+\xi )_{i}=0,\ \xi _{i}>1\}. \end{aligned}$$
(13)

Obviously, the intersection of any pair of these index sets \(\Lambda _{i}\) for \(i=1, \ldots , 6\) is empty. An illustrative representation of data points corresponding to these index sets is given in Fig. 5.

Fig. 5
figure 5

Representation of points with index sets \(\Lambda _j\), \(j=1, \ldots , 6\)

Proposition 3

Considering the training points corresponding to \(Q_{l}\) in (2), let \((\alpha ,\, \xi )\) satisfy the conditions in (6). Then, the following statements hold true:

  1. (a)

    The points \(\{x_i\}_{i\in \Lambda _{1}}\) lie on the boundary of the margin; they are correctly classified points, but are not support vectors.

  2. (b)

    The points \(\{x_i\}_{i\in \Lambda _{2}}\) lie on the correctly classified side of the boundary of the margin; they are correctly classified points, but are not support vectors.

  3. (c)

    The points \(\{x_i\}_{i\in \Lambda _{3}}\) lie on the boundary of the margin; they are correctly classified points and are support vectors.

  4. (d)

    The points \(\{x_i\}_{i\in \Lambda _{4}}\) lie between the separating hyperplane H and the boundary of the margin; they are correctly classified therefore support vectors.

  5. (e)

    The points \(\{x_i\}_{i\in \Lambda _{5}}\) lie on the separating hyperplane H; they are correctly classified points and are support vectors.

  6. (f)

    The points \(\{x_i\}_{i\in \Lambda _{6}}\) lie on the misclassified side of the separating hyperplane H; they are misclassified points and are support vectors.

Proof

We take positive points for example. The same analysis can be applied to negative ones. Since \(w=B^{\top } \alpha \) in (5c), we get \((B B^{\top } \alpha -\mathbf {1}+\xi )_{i}=(Bw -\mathbf {1}+\xi )_{i}\).

  1. (a)

    For the points \(\{x_i\}_{i\in \Lambda _{1}}\), since \(\xi _{i}=0\) in (8), we have \((Bw-\mathbf {1}+\xi )_{i}=(Bw-\mathbf {1})_{i}=0\), that is, \(y_i(w^{\top } x_i) -1=0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i=1\). It corresponds to (d) in Proposition 1. Therefore, it means that the point \(x_i\) lies on the boundary of the margin. It is correctly classified, and it is not a support vector, since \(\alpha _{i}=0\).

  2. (b)

    For the points \(\{x_i\}_{i\in \Lambda _{2}}\), since \(\xi _{i}=0\) in (9), we have \((Bw-\mathbf {1}+\xi )_{i}=(Bw-\mathbf {1})_{i}>0\), that is, \(y_i(w^{\top } x_i) -1>0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i>1\). It corresponds to (e) in Proposition 1. Therefore, it means that the point \(x_i\) lies on the correctly classified side of the boundary of the margin. It is correctly classified, but not a support vector, as \(\alpha _{i}=0\).

  3. (c)

    For the points \(\{x_i\}_{i\in \Lambda _{3}}\), since \(\xi _{i}=0\) in (10), we have \((Bw-\mathbf {1}+\xi )_{i}=(Bw-\mathbf {1})_{i}=0\), that is, \(y_i(w^{\top } x_i) -1=0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i=1\). It corresponds to (d) in Proposition 1. Therefore, it means that the point \(x_i\) lies on the boundary of the margin. It is correctly classified, and it is a support vector, since \(\alpha _{i} >0\).

  4. (d)

    For the points \(\{x_i\}_{i\in \Lambda _{4}}\), since \(0<\xi _{i}<1\) in (11), we have \(0<(Bw)_{i}<1\), that is, \(0<y_i(w^{\top } x_i)<1\). For a positive point, \(y_{i}=1\), it implies that \(0<w^{\top } x_i<1\). It corresponds to (c) in Proposition 1. Therefore, \(x_i\) lies between the separating hyperplane H and the boundary of the margin. It is correctly classified, and it is a support vector, since \(\alpha _{i} >0\).

  5. (e)

    For the points \(\{x_i\}_{i\in \Lambda _{5}}\), since \(\xi _{i}=1\) in (12), we have \((Bw-\mathbf {1}+\xi )_{i}=(Bw)_{i}=0\), that is, \(y_i(w^{\top } x_i)=0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i=0\). It corresponds to (b) in Proposition 1. Therefore, it means that the point \(x_i\) lies on the separating hyperplane H. It is correctly classified, and it is a support vector, since \(\alpha _{i} >0\).

  6. (f)

    For the points \(\{x_i\}_{i\in \Lambda _{6}}\), since \(\xi _{i}>1\) in (13), we have \((Bw)_{i}<0\), that is, \(y_i(w^{\top } x_i)<0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i<0\). It corresponds to (a) in Proposition 1. Therefore, it means that the point \(x_i\) lies on the misclassified side of the separating hyperplane H. It is misclassified, and it is a support vector, since \(\alpha _{i} >0\).

Remark 2

Note that all the data points \(x_i\) for \(i \in \Lambda _{1}\) corresponding to Fig. 5a and \(i \in \Lambda _{3}\) corresponding to Fig. 5c lie on the boundary of the margin. In other words, Fig. 5a and Fig. 5c are identical. However, the values of \(\alpha _i\) for \(i \in \Lambda _{1}\) and \(i \in \Lambda _{3}\) are different, so we demonstrate them in two subfigures.

2.3 The upper-level problem

In this part, we introduce the upper-level problem, that is, the bilevel optimization model for hyperparameter selection in SVC under the settings of T-fold cross-validation. Note that the aim of the upper-level problem is to minimize the T-fold CV error measured on the validation sets based on the optimal solutions of the lower-level problems. Specifically, the basic bilevel optimization model for selecting the hyperparameter C in SVC is formulated as

$$\begin{aligned} \begin{aligned} \min \limits _{C \in \mathbb {R},\ w^{t} \in \mathbb {R}^{n},\ t=1, \cdots ,T}&\frac{1}{T}\sum _{t=1}^{T} \frac{1}{m_{1}} \sum _{i \in \mathcal {N}_{t}} \Vert \left( -y_{i} \left( x_{i}^{\top } w^{t}\right) \right) _{+}\Vert _{0}\\ \hbox {s.t.} \quad \quad \quad \,&C \ge 0,\\ \quad \quad \quad \,&\text {and for} \quad t=1,\cdots ,T: \\ \quad \quad&w^{t} \in \underset{ w \in \mathbb {R}^{n}}{{\text {argmin}}} \left\{ \frac{1}{2}\Vert w\Vert _{2}^{2}+ C \sum _{i \in \overline{\mathcal {N}}_{t}} \left( 1-y_{i}\left( x_{i}^{\top }w\right) \right) _{+} \right\} . \end{aligned} \end{aligned}$$
(14)

Here, the expression \(\sum \limits _{i \in \mathcal {N}_{t}}\Vert \left( -y_{i} \left( x_{i}^{\top } w^{t}\right) \right) _{+}\Vert _{0}\) basically counts the number of data points that are misclassified in the validation set \(\Omega _{t}\), while the outer summation (i.e., the objective function in (14)) averages the misclassification error over all the folds.

Problem (14) can be equivalently written in the matrix form as follows

$$\begin{aligned} \begin{aligned} \min \limits _{C \in \mathbb {R},\ w^{t} \in \mathbb {R}^{n},\ t=1,\cdots ,T}&\frac{1}{T} \sum _{t=1}^{T} \frac{1}{m_{1}} \Vert \left( -A^{t} w^{t} \right) _{+} \Vert _{0}\\ \hbox {s.t.} \quad \quad \quad \,&C \ge 0,\\ \quad \quad \quad \,&\text {and for} \quad t=1,\cdots ,T: \\ \quad \quad&w^{t} \in {\mathop {\mathrm{argmin}}\limits _{w \in \mathbb {R}^{n}}} \left\{ \frac{1}{2}\Vert w\Vert _{2}^{2}+ C \Vert \left( \mathbf {1}-B^{t} w\right) _{+} \Vert _{1} \right\} , \end{aligned} \end{aligned}$$
(15)

where, for \(t=1,\ \cdots ,\ T\) and \(k=1,\ \cdots ,\ m_{1}\), we have

$$\begin{aligned} A^{t}=\left[ \begin{array}{c} y_{t_1} x_{t_1}^{\top } \\ \vdots \\ y_{t_{m_{1}}} x_{t_{m_{1}}}^{\top } \end{array}\right] \in \mathbb {R}^{m_{1} \times n} \;\, \text{ and } \;\, (x_{t_k},y_{t_k}) \in \Omega _{t}. \end{aligned}$$

Remark 3

Compared with the model in Kunapuli et al. (2008b), we consider a simpler bilevel optimization model, only with an extra constraint \(C \ge 0\) in the upper-level problem.

3 Single-level reformulation and method

In this section, we first reformulate the bilevel optimization problem as a single-level optimization problem, precisely, we write the problem as an MPEC. Then we present the properties of this single-level problem. Finally, we discuss the GRM to solve the MPEC problem.

3.1 The MPEC reformulation

Recall the upper-level objective function in (15) is a measure of misclassification error based on the T out-of-sample validation sets, which we minimize. The measure used here is the classical CV error for classification, the average number of the data points misclassified. It is clear that \(\Vert (\cdot )_{+}\Vert _{0}\) is discontinuous and nonconvex. However, the function \(\Vert (\cdot )_{+}\Vert _{0}\) can be characterized as the minimum of the sum of all elements of the solution to the following linear optimization problem as demonstrated in Mangasarian (1994), i.e.,

$$\begin{aligned} \Vert r_{+} \Vert _{0}=\left\{ {{\text {min}}}\sum \limits _{i=1}^{m_1} \zeta _{i}:\; \zeta =\underset{u}{{\text {argmin}}} \left\{ -u^{\top } r \ :\ \mathbf {0} \le u \le \mathbf {1} \right\} \right\} . \end{aligned}$$

Therefore, for each fold, \( \Vert \left( -A^{t} w^{t} \right) _{+} \Vert _{0}\) is the minimum of the sum of all elements of the solution to the following linear optimization problem:

$$\begin{aligned} \begin{aligned}&\min \limits _{\zeta ^{t} \in \mathbb {R}^{m_{1}}} \ -(\zeta ^{t})^{\top } (-A^{t} w^{t}) \\&\ \ \begin{array}{ll} \hbox {s.t.} \quad \ \zeta ^{t} \ge \mathbf {0}, \\ \quad \quad \ \ \mathbf {1}- \zeta ^{t} \ge \mathbf {0}. \end{array} \end{aligned} \end{aligned}$$
(16)

Let \(\hat{\zeta }^{t}\) be the solution of problem (16) such that \(\sum \limits _{i=1}^{m_1} \hat{\zeta _{i}^{t}}\) is the minimum of the sum of all elements of the solution to problem (5). This implies that \( \Vert \left( -A^{t} w^{t} \right) _{+} \Vert _{0}= \sum \limits _{i=1}^{m_{1}}\zeta ^{t}_{i}\) in each fold. According to Proposition 2, there are two cases for the validation points:

  1. 1.

    If the validation point \((x_{i},y_{i}) \in \Omega _{t}\) is misclassified, then \(y_{i} \left( x_{i}^{\top } w^{t}\right) <0 \). That is, \( (-A^{t} w^{t})_{i} >0\), which corresponds to \((\left( -A^{t} w^{t}\right) _{+})_{i}>0\).

  2. 2.

    If the validation point \((x_{i},y_{i}) \in \Omega _{t}\) is correctly classified, we have \(y_{i} \left( x_{i}^{\top } w^{t}\right) \ge 0\). There are two cases. Firstly, \(x_{i}\) lies on the separating hyperplane H, that is, \(y_{i} \left( x_{i}^{\top } w^{t}\right) =0\). For \(y_{i}=1\), there is \((-A^{t} w^{t})_{i} =0\), which corresponds to \((\left( -A^{t} w^{t}\right) _{+})_{i}=0\). Secondly, \(x_{i}\) lies on the correctly classified side of the separating hyperplane H, that is, \(y_{i} \left( x_{i}^{\top } w^{t}\right) >0\). For \(y_{i}=1\), there is \( (-A^{t} w^{t})_{i} <0\), which corresponds to \((\left( -A^{t} w^{t}\right) _{+})_{i}=0\).

Combining with \( \Vert \left( -A^{t} w^{t} \right) _{+} \Vert _{0}= \sum \limits _{i=1}^{m_{1}}\hat{\zeta ^{t}_{i}}\), it means that

$$\begin{aligned} \hat{\zeta ^{t}_{i}}=\left\{ \begin{array}{ll} 1, \quad \quad \text {if} \quad (x_{i},y_{i}) \in \Omega _{t} \ \text {is misclassified}, \\ 0, \quad \quad \text {if} \quad (x_{i},y_{i}) \in \Omega _{t}\ \text {is correctly classified}, \end{array}\right. \end{aligned}$$
(17)

where \(\hat{\zeta ^{t}_{i}}\) is the i-th element of \(\hat{\zeta ^{t}}\) in the t-th fold.

The linear programs (LPs) (16), for \(t= 1, \cdots , T\), are inserted into the bilevel optimization problem in order to recast the discontinuous upper-level objective function into a continuous one. Each LP in the form of (16) can also be replaced with its KKT conditions as follows

$$\begin{aligned} \left\{ \begin{aligned}&\begin{array}{l} \mathbf {0} \le \zeta ^{t} \perp \lambda ^{t} \ge \mathbf {0}, \\ \mathbf {0} \le z^{t} \perp \mathbf {1}-\zeta ^{t} \ge \mathbf {0}, \\ A^{t} w^{t}-\lambda ^{t}+z^{t}=\mathbf {0}. \end{array} \end{aligned}\right. \end{aligned}$$

By eliminating \(\lambda ^{t}\) and \(w^{t}\) with \(w^{t}=(B^{t})^{\top } \alpha ^{t}\) in (5c), we get the reduced KKT conditions for problem (16) with

$$\begin{aligned}&\mathbf {0} \le \zeta ^{t} \perp A^{t} (B^{t})^{\top } \alpha ^{t}+z^{t} \ge \mathbf {0}, \end{aligned}$$
(18a)
$$\begin{aligned}&\mathbf {0} \le z^{t} \perp \mathbf {1}-\zeta ^{t} \ge \mathbf {0}. \end{aligned}$$
(18b)

Note that each complementary constraint in (18a) corresponds to a validation point \(x_{i}\) with \(i \in Q_{u}\) in (1). Each validation point corresponds to a variable \(\zeta ^{t}_i\). So we have each complementary constraint in (18b) corresponds to a validation point \(x_{i}\) with \(i \in Q_{u}\) in (1). Therefore, there is a one-to-one correspondence between the index set of the validation points \(Q_{u}\) and the complementary constraints in (18a) and (18b), respectively.

Combining the systems in (6) and (18), we can transform the bilevel optimization problem (15) into the single-level optimization problem

(19)

Note that the constraints \(C\mathbf {1}-\alpha ^{t} \ge \mathbf {0}\) and \(\alpha ^{t} \ge \mathbf {0}\) imply \(C \ge 0\). Therefore, we remove the redundant constraint \(C \ge 0\), and get an equivalent form of the problem above as follows

(20)

The presence of the equilibrium constraints makes problem (20) an instance of an MPEC, which is sometimes labelled as an extension of a bilevel optimization problem Luo et al. (1996). The optimal hyperparameter is now well defined as a global optimal solution to the MPEC Lee et al. (2015). Now, we have transformed a bilevel classification model into an MPEC.

We can also write (20) in a compact form. To proceed, let

$$\begin{aligned} \begin{array}{c} \zeta \!:=\!\left[ \begin{array}{l} \zeta ^{1} \\ \zeta ^{2} \\ \vdots \\ \zeta ^{T} \end{array}\right] \! \in \! \mathbb {R}^{Tm_{1}},\ z \!:=\!\left[ \begin{array}{l} z^{1} \\ z^{2} \\ \vdots \\ z^{T} \end{array}\right] \! \in \! \mathbb {R}^{Tm_{1}},\ A:=\left[ \begin{array}{cccc} A^{1} &{} \mathbf {0} &{} \cdots &{} \mathbf {0} \\ \mathbf {0} &{}A^{2} &{} \cdots &{} \mathbf {0} \\ \vdots &{}\vdots &{} \ddots &{} \vdots \\ \mathbf {0} &{} \mathbf {0} &{} \cdots &{} A^{T} \end{array}\right] \in \mathbb {R}^{Tm_{1} \times {Tn}}, \end{array} \end{aligned}$$

and \(\alpha ,\ \xi ,\ B\) be defined in (7).

Then problem (20) can be written as

(21)

From now on, all our analysis is going to be based on this model.

3.2 Some properties of the MPEC reformulation

Observe that the last two constraints of problem (21) correspond to the complementarity systems that are part of the KKT conditions of the lower-level problem in (6). As the latter conditions are carefully studied in Proposition 3, it remains to analyze the first two complementarity systems describing the feasible set of problem (21). Hence, we partition them as follows

$$\begin{aligned} \Psi _{1}:= & {} \left\{ i \in Q_{u}\ \mid \ 0\le \zeta _{i}<1,\ (A B^{\top } \alpha +z)_{i}=0,\ z_{i}=0\right\} , \end{aligned}$$
(22)
$$\begin{aligned} \Psi _{2}:= & {} \left\{ i \in Q_{u}\ \mid \ \zeta _{i}=0,\ (A B^{\top } \alpha +z)_{i}>0,\ z_{i}=0\right\} , \end{aligned}$$
(23)
$$\begin{aligned} \Psi _{3}:= & {} \left\{ i \in Q_{u}\ \mid \ \zeta _{i}=1,\ (A B^{\top } \alpha +z)_{i}=0,\ z_{i}>0\right\} . \end{aligned}$$
(24)

Similarly to (8)–(13), the intersection of any pair of the index sets \(\Psi _{j}\) for \(j=1, \,2, \, 3\) is empty. In the same vein, an illustrative representation of data points corresponding to the index sets \(\Psi _{j}\) for \(j=1, \,2, \, 3\) is given in Fig. 6.

Fig. 6
figure 6

Representation of points with index sets \(\Psi _j\), \(j=1,\, 2, \, 3\)

Proposition 4

Considering the validation points corresponding to \(Q_{u}\) in (1), let \((\zeta ,\ z, \, \alpha )\) satisfy the first two complementarity systems describing the feasible set of problem (21). Then, the following statements hold true:

  1. (a)

    The points \(\{x_i\}_{i\in \Psi _{1}}\) lie on the separating hyperplane H and are therefore correctly classified.

  2. (b)

    The points \(\{x_i\}_{i\in \Psi _{2}}\) lie on the correctly classified side of the separating hyperplane H and are therefore correctly classified.

  3. (c)

    The points \(\{x_i\}_{i\in \Psi _{3}}\) lie on the misclassified side of the separating hyperplane H and are therefore misclassified.

Proof

We take positive points for example. The same analysis can be applied to negative ones. Since \(w=B^{\top } \alpha \) in (5c), we get \((A B^{\top } \alpha +z)_{i}=(A w +z)_{i}\).

  1. (a)

    For the points \(\{x_i\}_{i\in \Psi _{1}}\), since \(z_{i}=0\) in (22), we have \((A w+z)_{i}=(Aw)_{i}=0\), that is, \(y_i(w^{\top } x_i)=0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i=0\). It corresponds to (b) in Proposition 2. Therefore, it means that the point \(x_i\) lies on the separating hyperplane H. It is correctly classified.

  2. (b)

    For the points \(\{x_i\}_{i\in \Psi _{2}}\), since \(z_{i}=0\) in (23), we have \((Aw+z)_{i}=(Aw)_{i}>0\), that is, \(y_i(w^{\top } x_i)>0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i>0\). It corresponds to (c) in Proposition 2. Therefore, it means that the point \(x_i\) lies on the correctly classified side of the separating hyperplane H. It is correctly classified.

  3. (c)

    For the points \(\{x_i\}_{i\in \Psi _{3}}\), since \(z_{i}>0\) in (24), we have \((Aw)_{i}<0\), that is, \(y_{i}(w^{\top } x_{i})<0\). For a positive point, \(y_{i}=1\), it implies that \(w^{\top } x_i<0\). It corresponds to (a) in Proposition 2. Therefore, it means that the point \(x_i\) lies on the misclassified side of the separating hyperplane H.

\(\square \)

In Sect. 4, Proposition 4 will be combined with Proposition 3 to prove Proposition 5. It might also be important to note that if a validation point \(x_{i}\) lies on the separating hyperplane H, then we will have \(0 \le \zeta _{i} < 1\).

3.3 The global relaxation method (GRM)

Here, we present a numerical algorithm to solve the MPEC (21). There are various methods for solving MPECs, we refer to Dempe (2003), Luo et al. (1996) for some surveys on the problem and to Ye (2005), Flegel (2005), Wu et al. (2015), Harder et al. (2021), Guo et al. (2015), Jara-Moroni et al. (2018), Júdice (2012), Li et al. (2015), Yu et al. (2019), Dempe (2003), Anitescu (2000), Facchinei and Pang (2007), Fletcher et al. (2006), Fukushima and Tseng (2002) for some of the latest methods to solve the problem. Among methods to solve MPECs, one of the most popular ones is the relaxation method due to Scholtes (2001). Recently, Hoheisel et al. (2013) provided comparisons of five relaxation methods for solving MPECs, where it appears that the GRM has the best theoretical (in terms of requiring weaker assumptions for convergence) and numerical performance. Therefore, we will apply the GRM to solve our MPEC (21).

To simplify the presentation of the method, we now write problem (21) into a further compact format. Let \(v = \left[ C,\ \zeta ^{\top },\ z^{\top },\ \alpha ^{\top },\ \xi ^{\top }\right] ^{\top } \in \mathbb {R}^{\overline{m}+1}\) with \(\overline{m}= 2 T (m_{1}+m_{2})\) and define the functions

$$\begin{aligned} F(v)\! = \!M^{\top }v, \;\; G(v)\!= \!Pv+a, \; \text{ and } \; H(v) \!= \!Qv, \end{aligned}$$
(25)

where

$$\begin{aligned} M\!=\!\frac{1}{T m_{1}} \left[ \begin{array}{c} 0\\ \mathbf {1}_{T m_1} \\ \mathbf {0}_{T m_1 } \\ \mathbf {0}_{T m_2 } \\ \mathbf {0}_{T m_2 } \end{array}\right] \! \in \! \mathbb {R}^{\overline{m}+1},\ a \! = \! \left[ \begin{array}{l} \mathbf {0}_{T m_1} \\ \mathbf {1}_{T m_1} \\ -\mathbf {1}_{T m_2} \\ \mathbf {0}_{T m_2} \end{array}\right] \! \in \! \mathbb {R}^{\overline{m}},\ Q \!= \!\left[ \begin{array}{cc} \mathbf {0}_{\overline{m}}&I_{\overline{m}} \end{array}\right] \! \in \! \mathbb {R}^{\overline{m} \times (\overline{m}+1)}, \end{aligned}$$
$$\begin{aligned} P \!= \!\left[ \begin{array}{ccccc}\mathbf {0}_{T m_1 }&{} \mathbf {0}_{T m_1 \times T m_1} &{} I_{T m_1} &{} A B^{\top } &{} \mathbf {0}_{T m_1 \times T m_2} \\ \mathbf {0}_{T m_1} &{}-I_{T m_1 } &{} \mathbf {0}_{T m_1 \times T m_1} &{} \mathbf {0}_{T m_1 \times T m_2} &{} \mathbf {0}_{T m_1 \times T m_2}\\ \mathbf {0}_{T m_2} &{}\mathbf {0}_{T m_2 \times T m_1} &{} \mathbf {0}_{T m_2 \times T m_1} &{} B B^{\top } &{} I_{ T m_2} \\ \mathbf {1}_{T m_2 } &{}\mathbf {0}_{T m_2 \times T m_1} &{} \mathbf {0}_{T m_2 \times T m_1} &{} -I_{T m_2 } &{} \mathbf {0}_{T m_2 \times T m_2 }\end{array}\right] \! \in \! \mathbb {R}^{\overline{m} \times (\overline{m}+1)}. \end{aligned}$$

Problem (21) can then be written in the form

$$\begin{aligned} \begin{aligned}&\min _{v \in \mathbb {R}^{\overline{m}+1}} F(v)\\&\quad \hbox {s.t.}\ \ \mathbf {0} \le H(v) \perp G(v) \ge \mathbf {0}. \end{aligned} \end{aligned}$$
(26)

The basic idea of the GRM is as follows. Let \(\{t_{k}\} \downarrow 0\). At each iteration, we replace the MPEC (26) by the nonlinear program (NLP) of the following form, parameterized in \(t_k\):

figure a

The details of the GRM are shown in Algorithm 1.

figure b

Here, the maximum violation of all constraints Vio defined by

$$\begin{aligned} {\mathrm{Vio}}\,\left( v_{o p t}\right) =\Vert \min \{G(v_{opt}),\ H(v_{opt}) \}\Vert _{\infty } \end{aligned}$$
(27)

is used to measure the feasibility of the final iterate \(v_{opt}\), where \(\Vert \cdot \Vert _{\infty }\) denotes the \(l_{\infty }\) norm. Note that in step 4, the approximate solution refers to the approximate stationary point, in the sense that it satisfies the KKT conditions of (NLP-\(t_{k}\)) approximately. Numerically, we use the SNOPT solver Gill et al. (2002) to compute the KKT points of (NLP-\(t_{k}\)) approximately, such that the norm of the KKT conditions is less than a threshold value \(\epsilon \!=\!10^{-6}\). The point \(v^{k+1}\) returned by the SNOPT solver is referred to as an approximate solution of (NLP-\(t_{k}\)). We use the GRM in Algorithm 1 to solve the MPEC (26), and get the optimal hyperparameter C and the corresponding function value \(F(v_{opt})\) which is the cross-validation error (CV error) measured on the validation sets in T-fold cross-validation. To analyze the convergence of the GRM, we need the concept of C-stationarity, which we define next.

To proceed, let v be a feasible point for the MPEC (26) and recall that \(F(v),\ G(v)\) and H(v) are defined in (25). Based on v, let

$$\begin{aligned} \begin{aligned}&I_{G}&:= \;\;&\{i \ \mid \ G_{i}(v)=0,\ H_{i}(v)>0 \}, \\&I_{GH}&:=\;\;&\{i \ \mid \ G_{i}(v)=0,\ H_{i}(v)=0 \},\\&I_{H}&:=\;\;&\{i \ \mid \ G_{i}(v)>0,\ H_{i}(v)=0 \}. \end{aligned} \end{aligned}$$

Definition 1

(C-stationarity) Let v be a feasible point for the MPEC (26). Then v is said to be a C-stationary point, if there are multipliers \(\gamma ,\ \nu \in \mathbb {R}^{\overline{m}}\), such that

$$\begin{aligned} \nabla F\left( v\right) -\sum _{i=1}^{\overline{m}} \gamma _{i} \nabla G_{i}\left( v\right) -\sum _{i=1}^{\overline{m}} \nu _{i} \nabla H_{i}\left( v\right) =\mathbf{0}, \end{aligned}$$

and \(\gamma _{i}=0\) for \( i \in I_{H},\ \nu _{i}=0\) for \(i \in I_{G}\), and \(\gamma _{i} \nu _{i} \ge 0\) for \(i \in I_{GH}\).

Note that for problem (26), C-stationarity holds at any local optimal solution that satisfies the MPEC–MFCQ, which can be defined as follows Hoheisel et al. (2013).

Definition 2

A feasible point v for problem (26) satisfies the MPEC-MFCQ if and only if the set of gradient vectors

$$\begin{aligned} \begin{aligned} \left\{ \nabla G_{i}\left( v\right) \mid i \in I_{G} \cup I_{GH} \right\} \cup \left\{ \nabla H_{i}\left( v\right) \mid i \in I_{H} \cup I_{GH}\right\} \end{aligned} \end{aligned}$$
(28)

is positive-linearly independent.

Recall that the set of gradient vectors in (28) is said to be positive-linearly dependent if there exist scalars \(\{\delta _{i}\}_{i \in I_{G} \cup I_{GH}}\) and \( \{\beta _{i}\}_{i \in I_{H} \cup I_{GH}}\) with \(\delta _{i} \ge 0\) for \(i \in I_{G} \cup I_{GH}\), \(\beta _{i} \ge 0\) for \(i \in I_{H} \cup I_{GH}\), not all of them being zero, such that \(\Sigma _{i \in I_{G} \cup I_{GH}} \delta _{i} \nabla G_{i}(v)+\Sigma _{i \in I_{H} \cup I_{GH}} \beta _{i} \nabla H_{i}(v)=\mathbf{0}\). Otherwise, we say that this set of gradient vectors is positive-linearly independent.

Also note that various other stationarity concepts can be defined for problem (26); for more details on this, interested readers are referred to Dempe and Zemkoho (2012), Flegel (2005).

The following result establishes that Algorithm 1 is well-defined, as it provides a framework ensuring that a solution (or a stationary point, to be precise) exists for problem (NLP-\(t_{k}\)) as required.

Theorem 1

Hoheisel et al. (2013) Let v be a feasible point for the MPEC (26) such that MPEC-MFCQ is satisfied at v. Then there exists a neighborhood N of v and \(\overline{t} > 0\) such that standard MFCQ for (NLP-\(t_{k}\)) at \(t_k=t\) is satisfied at all feasible points of (NLP-\(t_{k}\)) at \(t_k=t\) in this neighborhood N for all \(t \in (0,\ \overline{t})\).

Subsequently, we have the following convergence result, which ensures that a sequence of stationary points of problem (NLP-\(t_{k}\)), computed by Algorithm 1, converges to a C-stationary point of problem (26).

Theorem 2

Hoheisel et al. (2013) Let \(\{t_{k}\} \downarrow 0\) and let \(v^k\) be a stationary point of (NLP-\(t_{k}\)) with \(v^{k} \rightarrow v\) such that MPEC-MFCQ holds at the feasible point v. Then v is a C-stationary point of the MPEC (26).

Clearly, the MPEC-MFCQ is crucial for the analysis of problem (26), as it not only ensures that the C-stationarity condition can hold at a locally optimal point, but also helps in establishing the two fundamental results in Theorems 1 and 2. Considering this importance of the condition, we carefully analyze it in the next section, and show, in particular, that it automatically holds at any feasible point of problem (26).

4 Fulfilment of the MPEC–MFCQ

In this section, we prove that every point in the feasible set of the MPEC (26) satisfies the MPEC–MFCQ. The rough idea of our proof is as follows. Firstly, by analyzing the relationship of different index sets (Proposition 5), we reach a reduced form of the MPEC–MFCQ (Proposition 6). Then based on the positive-linear independence of three submatrices (Lemmas 13), we eventually show the MPEC–MFCQ in Theorem 3. The roadmap of the proof is summarized in Fig. 7.

Fig. 7
figure 7

The roadmap of the proof of the MPEC–MFCQ

4.1 Relationships between the index sets

In this part, we first explore more properties about the index sets \(I_{H},\ I_{G},\ I_{GH}\), as they are the key to the analysis of the positive-linear independence of the vectors in (28). Let \(I_{H}:=\underset{k=1}{\overset{4}{\cup }}I_{H_{k}},\ I_{G}:=\underset{k=1}{\overset{4}{\cup }}I_{G_{k}},\) and \(I_{GH}:=\underset{k=1}{\overset{4}{\cup }}I_{GH_{k}}\), where

$$\begin{aligned}&I_{H_{1}}\ \; := \; \{i \in Q_{u} \ \mid \ \zeta _{i}=0,\ (AB^{\top } \alpha +z)_{i}>0\}, \end{aligned}$$
(29a)
$$\begin{aligned}&I_{H_{2}} \ \; :=\; \{i \in Q_{u} \ \mid \ z_{i}=0,\ 1-\zeta _{i}>0\}, \end{aligned}$$
(29b)
$$\begin{aligned}&I_{H_{3}} \ \; := \; \{i \in Q_{l} \ \mid \ \alpha _{i}=0,\ (BB^{\top } \alpha -\mathbf {1}+\xi )_{i}>0\}, \end{aligned}$$
(29c)
$$\begin{aligned}&I_{H_{4}} \ \; := \; \{i \in Q_{l} \ \mid \ \xi _{i}=0,\ C-\alpha _{i}>0\}, \end{aligned}$$
(29d)
$$\begin{aligned}&I_{G_{1}} \ \; :=\; \{i \in Q_{u} \ \mid \ \zeta _{i}>0,\ (AB^{\top } \alpha +z)_{i}=0\}, \end{aligned}$$
(29e)
$$\begin{aligned}&I_{G_{2}} \ \; :=\; \{i \in Q_{u} \ \mid \ z_{i}>0,\ 1-\zeta _{i}=0\}, \end{aligned}$$
(29f)
$$\begin{aligned}&I_{G_{3}} \ \; :=\; \{i \in Q_{l} \ \mid \ \alpha _{i}>0,\ (BB^{\top } \alpha -\mathbf {1}+\xi )_{i}=0\}, \end{aligned}$$
(29g)
$$\begin{aligned}&I_{G_{4}}\ \; :=\; \{i \in Q_{l} \ \mid \ \xi _{i}>0,\ C-\alpha _{i}=0\}, \end{aligned}$$
(29h)
$$\begin{aligned}&I_{GH_{1}} :=\; \{i \in Q_{u} \ \mid \ \zeta _{i}=0,\ (AB^{\top } \alpha +z)_{i}=0\}, \end{aligned}$$
(29i)
$$\begin{aligned}&I_{GH_{2}} :=\; \{i \in Q_{u} \ \mid \ z_{i}=0,\ 1-\zeta _{i}=0\}, \end{aligned}$$
(29j)
$$\begin{aligned}&I_{GH_{3}} :=\; \{i \in Q_{l} \ \mid \ \alpha _{i}=0,\ (BB^{\top } \alpha -\mathbf {1}+\xi )_{i}=0\}, \end{aligned}$$
(29k)
$$\begin{aligned}&I_{GH_{4}} :=\; \{i \in Q_{l} \ \mid \ \xi _{i}=0,\ C-\alpha _{i}=0\}. \end{aligned}$$
(29l)

Here, \(Q_{u},\ Q_{l}\) are defined in (1) and (2), respectively. Furthermore, let

$$\begin{aligned} I^{k}:=I_{H_{k}}\cup I_{G_{k}} \cup I_{GH_{k}},\ k=1,\ 2,\ 3,\ 4. \end{aligned}$$

It can be observed that each index set \(I^{k},\ k=1,\ 2,\ 3,\ 4\) corresponds to the union of the three components in the partition involved in the corresponding part of the complementarity systems in (21); that is,

  • Part 1: \(I^{1}\) for the partition of the system \(\mathbf {0} \le \zeta \perp \ A B^{T} \alpha +z \ge \mathbf {0}\);

  • Part 2: \(I^{2}\) for the partition of the system \(\mathbf {0} \le z \perp \mathbf {1}-\zeta \ge \mathbf {0}\);

  • Part 3: \(I^{3}\) for the partition of the system \(\mathbf {0} \le \alpha \perp B B^{T} \alpha - \mathbf {1}+\xi \ge \mathbf {0}\);

  • Part 4: \(I^{4}\) for the partition of the system \(\mathbf {0} \le \xi \perp C \mathbf {1}-\alpha \ge \mathbf {0}\).

In the previous section, we have clarified a one-to-one correspondence between the index set of the validation points \(Q_u\) in (1) and the complementary constraints in Part 1 and Part 2, respectively. It is clearly that \(I^{1}=I^{2}=Q_{u}\). Similarly, we have \(I^{3}=I^{4}=Q_{l}\).

Next, we give the relationships between the index sets in (29); recall that we already have some index sets described in Propositions 3 and 4. For the convenience of the analysis, we divide the index set \(\Lambda _3\) in (10) into two subsets \(\Lambda ^{+}_{3}\) and \(\Lambda ^{c}_{3}\), as well as \(\Psi _{1}\) in (22) into \(\Psi ^{0}_{1}\) and \(\Psi ^{+}_{1}\):

$$\begin{aligned} \Lambda ^{+}_{3}:= & {} \{i \in Q_{l}\ \mid \ 0<\alpha _{i}<C,\ (B B^{\top } \alpha -\mathbf {1}+\xi )_{i}=0,\ \xi _{i}=0\}, \end{aligned}$$
(30)
$$\begin{aligned} \Lambda ^{c}_{3}:= & {} \{i \in Q_{l}\ \mid \ \alpha _{i}=C,\ (B B^{\top } \alpha -\mathbf {1}+\xi )_{i}=0,\ \xi _{i}=0\}, \end{aligned}$$
(31)
$$\begin{aligned} \Psi ^{0}_{1}:= & {} \{i \in Q_{u}\ \mid \ \zeta _{i}=0,\ (A B^{\top } \alpha +z)_{i}=0,\ z_{i}=0\}, \end{aligned}$$
(32)
$$\begin{aligned} \Psi ^{+}_{1}:= & {} \{i \in Q_{u}\ \mid \ 0<\zeta _{i}<1,\ (A B^{\top } \alpha +z)_{i}=0,\ z_{i}=0\}. \end{aligned}$$
(33)
Fig. 8
figure 8

The index sets corresponding to the complementarity constraints in Parts 1–4

Proposition 5

The index sets in (29) and the index sets in Proposition 3 and Proposition 4 have the following relationship:

  1. (a)

    In Part \(1,\ I_{H_{1}}=\Psi _{2},\ I_{G_{1}}=\Psi ^{+}_{1} \cup \Psi _{3},\ I_{GH_{1}}=\Psi ^{0}_{1}.\)

  2. (b)

    In Part \(2,\ I_{H_{2}}=\Psi _{1} \cup \Psi _{2},\ I_{G_{2}}=\Psi _{3},\ I_{GH_{2}}=\emptyset .\)

  3. (c)

    In Part \(3,\ I_{H_{3}}=\Lambda _{2},\ I_{G_{3}}=\Lambda _{3} \cup \Lambda _{u},\ I_{GH_{3}}=\Lambda _{1}.\)

  4. (d)

    In Part \(4,\ I_{H_{4}}=\Lambda _{1} \cup \Lambda _{2} \cup \Lambda ^{+}_{3},\ I_{G_{4}}=\Lambda _{u},\ I_{GH_{4}}=\Lambda ^{c}_{3}.\)

Here, \(\Lambda _{u}\) is defined as follows

$$\begin{aligned} \Lambda _{u}:=\left\{ i \in Q_{l}\ \mid \ \alpha _{i}=C,\ (B B^{\top } \alpha -\mathbf {1}+\xi )_{i}=0,\ \xi _{i}>0\right\} =\Lambda _{4} \cup \Lambda _{5} \cup \Lambda _{6}.\nonumber \\ \end{aligned}$$
(34)

Proof

According to the definition of the index sets in (29) and the index sets in Proposition 3 and Proposition 4, we have the following analysis.

  1. (a)

    In Part 1, for \(i \in I_{H_{1}}\), compared with the index set \(\Psi _{2}\) in (23), it follows that we have \(z_{i}=0\) and \(I_{H_{1}}=\Psi _{2}\). For \(i \in I_{G_{1}}\), compared with the index sets \(\Psi ^{+}_{1}\) in (33) and \(\Psi _{3}\) in (24), we get \(0<\zeta _{i}<1,\ z_{i}=0\) or \(\zeta _{i}=1,\ z_{i}>0\), and \(I_{G_{1}}=\Psi ^{+}_{1} \cup \Psi _{3}\). For \(i \in I_{GH_{1}}\), compared with the index set \(\Psi ^{0}_{1}\) in (32), we get \(z_{i}=0\) and \(I_{GH_{1}}=\Psi ^{0}_{1}\).

  2. (b)

    In Part 2, for \(i \in I_{H_{2}}\), compared with the index sets \(\Psi _{1}\) in (22) and \(\Psi _{2}\) in (23), we get \(I_{H_{2}}=\Psi _{1} \cup \Psi _{2}.\) For \(i \in I_{G_{2}}\), compared with the index set \(\Psi _{3}\) in (24), we get \((A B^{\top } \alpha +z)_{i}=0\) and \(I_{G_{2}}=\Psi _{3}\). For \(i \in I_{GH_{2}}\), there is no index set in Proposition 4 corresponds to the index set \(I_{GH_{2}}\). Therefore, \(I_{GH_{2}}=\emptyset .\)

  3. (c)

    In Part 3, for \(i \in I_{H_{3}}\), compared with the index set \(\Lambda _{2}\) in (9), we get \(\xi _{i}=0\) and \(I_{H_{3}}=\Lambda _{2}\). For \(i \in I_{G_{3}}\), compared with the index sets \(\Lambda _{3}\) in (10) and \(\Lambda _{u}\) in (34), we get \(I_{G_{3}}=\Lambda _{3} \cup \Lambda _{u}.\) For \(i \in I_{GH_{3}}\), compared with the index set \(\Lambda _{1}\) in (8), we get \(\xi _{i}=0\) and \(I_{GH_{3}}=\Lambda _{1}\).

  4. (d)

    In Part 4, for \(i \in I_{H_{4}}\), compared with the index sets \(\Lambda _{1}\) in (8), \(\Lambda _{2}\) in (9) and \(\Lambda ^{+}_{3}\) in (30), we get \(I_{H_{4}}=\Lambda _{1} \cup \Lambda _{2} \cup \Lambda ^{+}_{3}.\) For \(i \in I_{G_{4}}\), compared with the index set \(\Lambda _{u}\) in (34), we get \((Bw-\mathbf {1}+\xi )_{i}=0\) and \(I_{G_{4}}=\Lambda _{u}.\) For \(i \in I_{GH_{4}}\), compared with the index set \(\Lambda ^{c}_{3}\) in (31), it results that we have \((Bw-\mathbf {1}+\xi )_{i}=0\) and \(I_{GH_{4}}=\Lambda ^{c}_{3}\).

The results in Proposition 5 are demonstrated in Fig. 8. For example, for (a) in Proposition 5, the index sets of complementarity constraints in Part 1 are shown in Fig. 8a, which is about the relationship of \(I_{H_{1}},\ I_{G_{1}},\ I_{GH_{1}}\) in (29) and the index sets (22)–(24). In Fig. 8a, the red shaded part represents the index set \(I_{G_{1}}\), which contains the index sets \(\Psi ^{+}_{1}\) and \(\Psi _{3}\). (b)–(d) in Proposition 5 are demonstrated in Fig. 8 b–d. Specially, in Fig. 8b, the red shaded part represents the index set \(I_{H_{2}}\), which contains the index sets \(\Psi _{1}\) (or \(\Psi ^{0}_{1} \cup \Psi ^{+}_{1}\)) and \(\Psi _{2}\). In Fig. 8c, the red shaded part represents the index set \(I_{G_{3}}\), which contains the index sets \(\Lambda _{3}\) (or \(\Lambda ^{+}_{3} \cup \Lambda ^{c}_{3}\)) and \(\Lambda _{u}\). In Fig. 8d, the red shaded part represents the index set \(I_{H_{4}}\), which contains the index sets \(\Lambda _{1},\ \Lambda _{2}\), and \(\Lambda ^{+}_{3}\).

4.2 The reduced form of the MPEC-MFCQ

Proposition 6

The set of gradient vectors in (28) at a feasible point v for the MPEC (26) can be written in the matrix form

(35)

where \(L_{q},\ q=1,\ \cdots ,\ 5\) are the index sets of columns corresponding to the variables C, \(\zeta \), z, \(\alpha \), and \(\xi \), respectively, and

(36)

Proof

Based on Definition 2, we can write the set of gradient vectors in (28) at a feasible point v in the rows of the matrix \(\Gamma \) as follows

$$\begin{aligned} \Gamma = \left[ \begin{array}{l} \nabla G(v)_{I_{G_1}} \\ \nabla G(v)_{I_{GH_1}} \\ \nabla H(v)_{I_{GH_1}} \\ \nabla H(v)_{I_{H_1}}\\ \nabla G(v)_{I_{G_2}} \\ \nabla H(v)_{I_{H_2}} \\ \nabla G(v)_{I_{G_3}}\\ \nabla G(v)_{I_{GH_3}} \\ \nabla H(v)_{I_{GH_3}} \\ \nabla H(v)_{I_{H_3}} \\ \nabla G(v)_{I_{G_4}} \\ \nabla G(v)_{I_{GH_4}} \\ \nabla H(v)_{I_{GH_4}} \\ \nabla H(v)_{I_{H_4}} \end{array}\right] . \end{aligned}$$
(37)

Now, we can easily show that the matrix \(\Gamma \) in (37) is equivalent to the more specific form in (35). To proceed, first note that from Proposition 5 (a) and (b), we have

$$\begin{aligned} \begin{array}{l} I_{H_{1}}\!=\! \Psi _{2},\ I_{G_{1}}\!=\! \Psi ^{+}_{1}\cup \Psi _{3},\ I_{GH_{1}}\!=\!\Psi ^{0}_{1},\ I_{H_{2}}\!=\!\Psi _{1} \cup \Psi _{2},\\ I_{G_{2}}\!=\!\Psi _{3},\ Q_u\!=\!\Psi _{1} \cup \Psi _{2}\cup \Psi _{3}. \end{array} \end{aligned}$$

So, we get \(\Gamma _{a}^{3},\ \Gamma _{b}^{3},\ \Gamma _{c}^{2},\ \Gamma _{d}^{2},\ \Gamma _{e}^{2}\), and \(\Gamma _{f}^{3}\) in (36). On the other hand, it follows from Proposition 5 (c) and (d), we have

$$\begin{aligned} \begin{array}{l} I_{H_{3}}\!=\!\Lambda _{2},\ I_{G_{3}}\!=\!\Lambda _{3} \cup \Lambda _{u},\ I_{GH_{3}}\!=\!\Lambda _{1},\ I_{H_{4}}\!=\!\Lambda _{1} \cup \Lambda _{2} \cup \Lambda ^{+}_{3},\\ I_{G_{4}}\!=\!\Lambda _{u},\ I_{GH_{4}}\!=\!\Lambda ^{c}_{3}, \;\; Q_{l}\!=\!\Lambda _{1} \cup \Lambda _{2} \cup \Lambda _{3} \cup \Lambda _{u}. \end{array} \end{aligned}$$

Subsequently, it follows that \(\Gamma _{g}^{5},\ \Gamma _{h}^{5},\ \Gamma _{i}^{4},\ \Gamma _{j}^{4},\ \Gamma _{k}^{4},\ \Gamma _{l}^{4},\ \Gamma _{m}^{5}\), and \(\Gamma _{n}^{5}\) in (36). Therefore, we obtain the form of the matrix \(\Gamma \) in (35). \(\square \)

4.3 Three important lemmas

Due to the complicated form of \(\Gamma \) in (35), in this part, we first present three lemmas, addressing the positive-linear independence of three submatrices in \(\Gamma \) marked by blue, green and yellow, respectively. To proceed from here on, we define the size of each index set in (29) and Propositions 34 as follows. We denote the size of the index set \(I_{G_1}\) by \(S_1\), that is, \( \mid \! I_{G_1}\mid =S_1\). Similarly,

$$\begin{aligned} \begin{array}{lllll} \mid \! I_{G_2}\! \mid =S_2,&{} \mid \! I_{G_3}\! \mid =S_3,&{} \mid \! I_{G_4} \! \mid =S_4, &{}{} \\ \mid \! I_{H_1}\! \mid =U_1, &{} \mid \! I_{H_2}\! \mid =U_2, &{} \mid \! I_{H_3}\! \mid =U_3, &{} \mid \! I_{H_4}\! \mid =U_4,&{}{}\\ \mid \! I_{GH_1}\! \mid =W_1, &{} \mid \! I_{GH_3}\! \mid =W_2,&{} \mid \! I_{GH_4}\! \mid =W_3, &{} {} &{}{}\\ \mid \! \Lambda _{1}\! \mid =D_1, &{} \mid \! \Lambda _{2}\! \mid =D_2,&{} \mid \! \Lambda ^{+}_{3}\! \mid =D_3, &{} \mid \! \Lambda ^{c}_{3}\! \mid =D_4, &{} \mid \! \Lambda _{u}\! \mid =D_5,\\ \mid \! \Psi ^{0}_{1}\! \mid =N_1, &{} \mid \! \Psi ^{+}_{1}\! \mid =N_2, &{} \mid \! \Psi _{2}\! \mid =N_3, &{} \mid \! \Psi _{3}\! \mid =N_4.&{}{} \end{array} \end{aligned}$$

Further, we denote the index corresponding to each row in the matrices \(\Gamma _{a}^{3},\ \cdots \ \Gamma _{n}^{5}\), in (36) by \(a_{s},\ \cdots \ n_{s}\), respectively.

Lemma 1

The row vectors in the following matrix

$$\begin{aligned} \left[ \begin{array}{l}\Gamma _{c}^{2}\\ \Gamma _{d}^{2} \\ \Gamma _{e}^{2} \end{array}\right] =\left[ \begin{array}{cccc}I_{(I_{GH_{1}},\ \Psi ^{0}_{1})}&{} \mathbf {0}_{(I_{GH_{1}},\ \Psi ^{+}_{1})}&{} \mathbf {0}_{(I_{GH_{1}},\ \Psi _{2})} &{} \mathbf {0}_{(I_{GH_{1}},\ \Psi _{3})} \\ \mathbf {0}_{(I_{H_{1}},\ \Psi ^{0}_{1})}&{} \mathbf {0}_{(I_{H_{1}},\ \Psi ^{+}_{1})}&{}I_{(I_{H_{1}},\ \Psi _{2})} &{} \mathbf {0}_{(I_{H_{1}},\ \Psi _{3})} \\ \mathbf {0}_{(I_{G_{2}},\ \Psi ^{0}_{1})}&{} \mathbf {0}_{(I_{G_{2}},\ \Psi ^{+}_{1})}&{} \mathbf {0}_{(I_{G_{2}},\ \Psi _{2})} &{} -I_{(I_{G_{2}},\ \Psi _{3})} \\ \end{array}\right] \end{aligned}$$
(38)

are positive-linearly independent.

Proof

Assume that there exist \(\overline{\rho }^{c} \in \mathbb {R}^{W_{1}}\ \text {and}\ \overline{\rho }^{c} \ge \mathbf {0},\ \overline{\rho }^{d} \in \mathbb {R}^{U_{1}} \ \text {and}\ \overline{\rho }^{d} \ge \mathbf {0},\ \overline{\rho }^{e} \in \mathbb {R}^{S_{2}} \ \text {and}\ \overline{\rho }^{e} \ge \mathbf {0},\) such that

$$\begin{aligned} \sum \limits _{s=1}^{W_1} \rho _{s}^{c}\left[ \begin{array}{c} e^{W_{1}}_{c_{s}} \\ \mathbf {0}_{N_2}\\ \mathbf {0}_{N_3}\\ \mathbf {0}_{N_4} \end{array}\right] +\sum \limits _{s=1}^{U_1} \rho _{s}^{d}\left[ \begin{array}{c} \mathbf {0}_{N_{1}}\\ \mathbf {0}_{N_2}\\ e^{U_1}_{d_{s}}\\ \mathbf {0}_{N_{4}}\\ \end{array}\right] +\sum \limits _{s=1}^{S_2} \rho _{s}^{e}\left[ \begin{array}{c} \mathbf {0}_{N_{1}}\\ \mathbf {0}_{N_2}\\ \mathbf {0}_{N_3}\\ -e^{S_2}_{e_{s}} \end{array}\right] =\mathbf {0}. \end{aligned}$$

The above equation is equivalent to the following system

$$\begin{aligned} \left[ \begin{array}{c}\overline{\rho }^{c} \\ \mathbf {0}_{N_2} \\ \overline{\rho }^{d}\\ -\overline{\rho }^{e} \end{array}\right] =\mathbf {0}. \end{aligned}$$
(39)

Since \(\overline{\rho }^{c} \ge \mathbf {0},\ \overline{\rho }^{d} \ge \mathbf {0},\ \overline{\rho }^{e} \ge \mathbf {0}\), we get \(\overline{\rho }^{c} = \mathbf {0},\ \overline{\rho }^{d} = \mathbf {0},\ \overline{\rho }^{e} = \mathbf {0}\) from Eq. (39). Therefore, the row vectors in the matrix (38) are positive-linearly independent. \(\square \)

Lemma 2

The row vectors in the following matrix

$$\begin{aligned} \left[ \begin{array}{l}\Gamma _{a}^{3}\\ \Gamma _{b}^{3}\\ \Gamma _{f}^{3} \end{array}\right] =\left[ \begin{array}{cccc}\mathbf {0}_{(\Psi ^{+}_{1},\ \Psi ^{0}_{1}) }&{} I_{(\Psi ^{+}_{1},\ \Psi ^{+}_{1})}&{} \mathbf {0}_{(\Psi ^{+}_{1},\ \Psi _{2})} &{} \mathbf {0}_{(\Psi ^{+}_{1},\ \Psi _{3})} \\ \mathbf {0}_{(\Psi _{3},\ \Psi ^{0}_{1})}&{} \mathbf {0}_{(\Psi _{3},\ \Psi ^{+}_{1})}&{} \mathbf {0}_{(\Psi _{3},\ \Psi _{2})} &{} I_{(\Psi _{3},\ \Psi _{3})}\\ I_{(I_{GH_{1}},\ \Psi ^{0}_{1}) }&{} \mathbf {0}_{(I_{GH_{1}},\ \Psi ^{+}_{1})}&{} \mathbf {0}_{(I_{GH_{1}},\ \Psi _{2})} &{} \mathbf {0}_{(I_{GH_{1}},\ \Psi _{3})}\\ I_{(\Psi ^{0}_{1},\ \Psi ^{0}_{1})}&{} \mathbf {0}_{(\Psi ^{0}_{1},\ \Psi ^{+}_{1})}&{} \mathbf {0}_{(\Psi ^{0}_{1},\ \Psi _{2})} &{} \mathbf {0}_{(\Psi ^{0}_{1},\ \Psi _{3})}\\ \mathbf {0}_{(\Psi ^{+}_{1},\ \Psi ^{0}_{1}) }&{} I_{(\Psi ^{+}_{1},\ \Psi ^{+}_{1})}&{} \mathbf {0}_{(\Psi ^{+}_{1},\ \Psi _{2})} &{} \mathbf {0}_{(\Psi ^{+}_{1},\ \Psi _{3})}\\ \mathbf {0}_{(\Psi _{2},\ \Psi ^{0}_{1})}&{} \mathbf {0}_{(\Psi _{2},\ \Psi ^{+}_{1})}&{}I_{(\Psi _{2},\ \Psi _{2})} &{} \mathbf {0}_{(\Psi _{2},\ \Psi _{3})} \end{array}\right] \end{aligned}$$
(40)

are positive-linearly independent.

Proof

Assume that there exist \(\overline{\rho }^{a} \in \mathbb {R}^{S_{1}}\ \text {and}\ \overline{\rho }^{a} \ge \mathbf {0},\ \overline{\rho }^{b} \in \mathbb {R}^{W_{1}} \ \text {and}\ \overline{\rho }^{b} \ge \mathbf {0},\ \overline{\rho }^{f} \in \mathbb {R}^{U_{2}} \ \text {and}\ \overline{\rho }^{f} \ge \mathbf {0},\) such that

$$\begin{aligned} \sum \limits _{s=1}^{S_1} \rho _{s}^{a}\left[ \begin{array}{c} \mathbf {0}_{N_1+N_3}\\ e^{S_{1}}_{a_{s}} \end{array}\right] +\sum \limits _{s=1}^{W_1} \rho _{s}^{b}\left[ \begin{array}{c} e^{W_{1}}_{b_{s}}\\ \mathbf {0}_{N_2+N_3+N_{4}} \end{array}\right] +\sum \limits _{s=1}^{U_2} \rho _{s}^{f}\left[ \begin{array}{c} e^{U_{2}}_{f_{s}}\\ \mathbf {0}_{N_4} \end{array}\right] =\mathbf {0}. \end{aligned}$$

The above equation is equivalent to the following system

$$\begin{aligned} \left[ \begin{array}{c}\overline{\rho }^{b}+\overline{\rho }_{\Psi ^{0}_1}^{f}\\ \overline{\rho }_{\Psi ^{+}_{1}}^{a}+\overline{\rho }_{\Psi ^{+}_{1}}^{f} \\ \overline{\rho }_{\Psi _2}^{f}\\ \overline{\rho }_{\Psi _3}^{a} \end{array}\right] =\mathbf {0}. \end{aligned}$$
(41)

Since \(\overline{\rho }^{a} \ge \mathbf {0},\ \overline{\rho }^{b} \ge \mathbf {0},\ \overline{\rho }^{f} \ge \mathbf {0}\), we get \(\overline{\rho }^{a} = \mathbf {0},\ \overline{\rho }^{b} = \mathbf {0},\ \overline{\rho }^{f} = \mathbf {0}\) from Eq. (41). Therefore, the row vectors in the matrix (40) are positive-linearly independent. \(\square \)

Lemma 3

The row vectors in the matrix \(\Gamma _{sub}\) defined by

$$\begin{aligned} \begin{array}{c} \Gamma _{sub} =\left[ \begin{array}{cc}(B B^{\top })_{(I_{G_{3}},\ \cdot \ )} &{} \Gamma _{g}^{5}\\ (B B^{\top })_{(I_{GH_{3}},\ \cdot \ )} &{} \Gamma _{h}^{5}\\ \Gamma _{i}^{4} &{} \mathbf {0}_{(I_{GH_{3}},\ L_{5} )} \\ \Gamma _{j}^{4} &{} \mathbf {0}_{(I_{H_{3}},\ L_{5}) } \\ \mathbf {0}_{(I_{GH_{4}},\ L_{4})} &{} \Gamma _{m}^{5} \\ \mathbf {0}_{(I_{H_{4}},\ L_{4})} &{} \Gamma _{n}^{5} \end{array}\right] \end{array} \end{aligned}$$
(42)

are positive-linearly independent.

Proof

For the convenience of analysis, note that

$$\begin{aligned} \left[ \begin{array}{c} \Gamma _{g}^{5} \\ \Gamma _{h}^{5} \\ \Gamma _{m}^{5} \\ \Gamma _{n}^{5} \end{array} \right] =\left[ \begin{array}{ccccc} \mathbf {0}_{(\Lambda ^{+}_{3},\ \Lambda _{1})} &{} \mathbf {0}_{(\Lambda ^{+}_{3},\ \Lambda _{2})} &{} I_{(\Lambda ^{+}_{3},\ \Lambda ^{+}_{3})} &{} \mathbf {0}_{(\Lambda ^{+}_{3},\ \Lambda ^{c}_{3})} &{} \mathbf {0}_{(\Lambda ^{+}_{3},\ \Lambda _{u})}\\ \mathbf {0}_{(\Lambda ^{c}_{3},\ \Lambda _{1})} &{} \mathbf {0}_{(\Lambda ^{c}_{3},\ \Lambda _{2})} &{} \mathbf {0}_{(\Lambda ^{c}_{3},\ \Lambda ^{+}_{3})} &{} I_{(\Lambda ^{c}_{3},\ \Lambda ^{c}_{3})} &{} \mathbf {0}_{(\Lambda ^{c}_{3},\ \Lambda _{u})}\\ \mathbf {0}_{(\Lambda _{u},\ \Lambda _{1})} &{} \mathbf {0}_{(\Lambda _{u},\ \Lambda _{2})} &{} \mathbf {0}_{(\Lambda _{u},\ \Lambda ^{+}_{3})} &{} \mathbf {0}_{(\Lambda _{u},\ \Lambda ^{c}_{3})} &{} I_{(\Lambda _{u},\ \Lambda _{u})}\\ I_{(I_{GH_{3}},\ \Lambda _{1})} &{} \mathbf {0}_{(I_{GH_{3}},\ \Lambda _{2})} &{} \mathbf {0}_{(I_{GH_{3}},\ \Lambda ^{+}_{3})} &{} \mathbf {0}_{(I_{GH_{3}},\ \Lambda ^{c}_{3})} &{} \mathbf {0}_{(I_{GH_{3}},\ \Lambda _{u})}\\ \mathbf {0}_{(I_{GH_{4}},\ \Lambda _{1})} &{} \mathbf {0}_{(I_{GH_{4}},\ \Lambda _{2})} &{} \mathbf {0}_{(I_{GH_{4}},\ \Lambda ^{+}_{3})} &{} I_{(I_{GH_{4}},\ \Lambda ^{c}_{3})} &{} \mathbf {0}_{(I_{GH_{4}},\ \Lambda _{u})}\\ I_{(\Lambda _{1},\ \Lambda _{1})} &{} \mathbf {0}_{(\Lambda _{1},\ \Lambda _{2})} &{} \mathbf {0}_{(\Lambda _{1},\ \Lambda ^{+}_{3})} &{} \mathbf {0}_{(\Lambda _{1},\ \Lambda ^{c}_{3})} &{} \mathbf {0}_{(\Lambda _{1},\ \Lambda _{u})}\\ \mathbf {0}_{(\Lambda _{2},\ \Lambda _{1})} &{} I_{(\Lambda _{2},\ \Lambda _{2})} &{} \mathbf {0}_{(\Lambda _{2},\ \Lambda ^{+}_{3})} &{} \mathbf {0}_{(\Lambda _{2},\ \Lambda ^{c}_{3})} &{} \mathbf {0}_{(\Lambda _{2},\ \Lambda _{u})}\\ \mathbf {0}_{(\Lambda ^{+}_{3},\ \Lambda _{1})} &{} \mathbf {0}_{(\Lambda ^{+}_{3},\ \Lambda _{2})} &{} I_{(\Lambda ^{+}_{3},\ \Lambda ^{+}_{3})} &{} \mathbf {0}_{(\Lambda ^{+}_{3},\ \Lambda ^{c}_{3})} &{} \mathbf {0}_{(\Lambda ^{+}_{3},\ \Lambda _{u})}\\ \end{array} \right] , \end{aligned}$$

and assume that we can find some vectors \(\overline{\rho }^{g} \in \mathbb {R}^{S_{3}}\) and \(\overline{\rho }^{g} \ge \mathbf {0}\), \(\overline{\rho }^{h} \in \mathbb {R}^{W_{2}}\) and \(\overline{\rho }^{h} \ge \mathbf {0}\), \(\overline{\rho }^{i} \in \mathbb {R}^{W_{2}}\) and \(\overline{\rho }^{i} \ge \mathbf {0}\), \(\overline{\rho }^{j} \in \mathbb {R}^{U_{3}}\) and \(\overline{\rho }^{j} \ge \mathbf {0}\), \(\overline{\rho }^{m} \in \mathbb {R}^{W_{3}}\) and \(\overline{\rho }^{m} \ge \mathbf {0}\), and \(\overline{\rho }^{n} \in \mathbb {R}^{U_{4}}\) and \(\overline{\rho }^{n} \ge \mathbf {0}\), such that

$$\begin{aligned} \begin{array}{l} \sum \limits _{s=1}^{S_3} \rho _{s}^{g}\left[ \begin{array}{c} \left( BB^{\top }\right) _{(g_{s},\ \cdot \ )}^{\top }\\ \left[ \begin{array}{c}\mathbf {0}_{D_{1}+D_{2}}\\ e^{S_3}_{g_{s}} \end{array}\right] \end{array}\right] \!+\!\sum \limits _{s=1}^{W_2} \rho _{s}^{h}\left[ \begin{array}{c}\left( BB^{\top }\right) _{(h_{s},\ \cdot \ )}^{\top } \\ \left[ \begin{array}{c} e^{W_{2}}_{h_{s}}\\ \mathbf {0}_{T m_{2} -D_{1}} \end{array}\right] \end{array}\right] \!+\!\sum \limits _{s=1}^{W_2} \rho _{s}^{i} \left[ \begin{array}{c} \left[ \begin{array}{c} e^{W_2}_{i_{s}} \\ \mathbf {0}_{T m_{2} -D_{1}} \end{array}\right] \\ \mathbf {0}_{T m_{2}}\end{array}\right] \!+\!\\ \sum \limits _{s=1}^{U_3} \rho _{s}^{j}\left[ \begin{array}{c} \left[ \begin{array}{c} \mathbf {0}_{D_{1}} \\ e^{U_3}_{j_{s}}\\ \mathbf {0}_{D_{3}+D_{4}+D_{5}}\end{array}\right] \\ \mathbf {0}_{T m_{2}}\end{array}\right] \!+\!\sum \limits _{s=1}^{W_3} \rho _{s}^{m}\left[ \begin{array}{c} \mathbf {0}_{T m_{2}}\\ \left[ \begin{array}{c} \mathbf {0}_{(D_{1}+D_{2}+D_{3})} \\ e^{W_3}_{m_{s}}\\ \mathbf {0}_{D_{5}} \\ \end{array}\right] \end{array}\right] \!+\!\sum \limits _{s=1}^{U_4} \rho _{s}^{n}\left[ \begin{array}{c} \mathbf {0}_{T m_{2}}\\ \left[ \begin{array}{c} e^{U_{4}}_{n_{s}}\\ \mathbf {0}_{D_{4}+D_{5}} \\ \end{array}\right] \end{array}\right] \!= \!\mathbf {0}. \end{array} \end{aligned}$$

The above equation is equivalent to the compact system

$$\begin{aligned} \left[ \begin{array}{c} \sum \limits _{s=1}^{S_3}\rho _{s}^{g}\left( \left( BB^{\top }\right) _{(g_{s},\ \cdot )}\right) ^{\top }+\sum \limits _{s=1}^{W_2} \rho _{s}^{h} \left( \left( BB^{\top }\right) _{(h_{s},\ \cdot )}\right) ^{\top }+\left[ \begin{array}{c} \overline{\rho }^{i}\\ \overline{\rho }^{j}\\ \mathbf {0}_{D_3+D_4+D_5 } \end{array}\right] \\ \overline{\rho }^{h}+\overline{\rho }_{\Lambda _1}^{n} \\ \overline{\rho }_{\Lambda _2}^{n}\\ \overline{\rho }_{\Lambda ^{+}_{3}}^{g}+\overline{\rho }_{\Lambda ^{+}_{3}}^{n} \\ \overline{\rho }_{\Lambda ^{c}_{3}}^{g}+\overline{\rho }^{m}\\ \overline{\rho }_{\Lambda _{u}}^{g} \end{array}\right] =\mathbf {0}, \end{aligned}$$

which leads to \(\overline{\rho }^{g} = \mathbf {0},\ \overline{\rho }^{h} = \mathbf {0},\ \overline{\rho }^{i} = \mathbf {0},\ \overline{\rho }^{j} = \mathbf {0},\ \overline{\rho }^{m} = \mathbf {0},\ \overline{\rho }^{n} = \mathbf {0}\), given that \(\overline{\rho }^{g} \ge \mathbf {0},\ \overline{\rho }^{h} \ge \mathbf {0},\ \overline{\rho }^{i} \ge \mathbf {0},\ \overline{\rho }^{j} \ge \mathbf {0},\ \overline{\rho }^{m} \ge \mathbf {0},\ \overline{\rho }^{n} \ge \mathbf {0}\). Therefore, the row vectors in the matrix \(\Gamma _{sub}\) are positively-linearly independent. \(\square \)

4.4 The main result

Based on the above lemmas, we are ready to present the main theorem on the MPEC–MFCQ.

Theorem 3

Let \(v=(C,\zeta ,z,\alpha ,\xi )\) be any feasible point for the MPEC (26), then v satisfies the MPEC–MFCQ.

Proof

Assume there exist \(\overline{\rho }^{a} \in \mathbb {R}^{S_{1}}\ \text {and}\ \overline{\rho }^{a} \ge \mathbf {0},\ \overline{\rho }^{b} \in \mathbb {R}^{W_{1}}\ \text {and}\ \overline{\rho }^{b} \ge \mathbf {0},\ \overline{\rho }^{c} \in \mathbb {R}^{W_{1}}\ \text {and}\ \overline{\rho }^{c} \ge \mathbf {0},\ \overline{\rho }^{d} \in \mathbb {R}^{U_{1}}\ \text {and}\ \overline{\rho }^{d} \ge \mathbf {0},\ \overline{\rho }^{e} \in \mathbb {R}^{S_{2}}\ \text {and}\ \overline{\rho }^{e} \ge \mathbf {0},\ \overline{\rho }^{f} \in \mathbb {R}^{U_{2}}\ \text {and}\ \overline{\rho }^{f} \ge \mathbf {0},\ \overline{\rho }^{g} \in \mathbb {R}^{S_{3}}\ \text {and}\ \overline{\rho }^{g} \ge \mathbf {0},\ \overline{\rho }^{h} \in \mathbb {R}^{W_{2}} \ \text {and}\ \overline{\rho }^{h} \ge \mathbf {0},\ \overline{\rho }^{i} \in \mathbb {R}^{W_{2}} \ \text {and}\ \overline{\rho }^{i} \ge \mathbf {0},\ \overline{\rho }^{j} \in \mathbb {R}^{U_{3}}\ \text {and}\ \overline{\rho }^{j} \ge \mathbf {0},\ \overline{\rho }^{k} \in \mathbb {R}^{S_{4}}\ \text {and}\ \overline{\rho }^{k} \ge \mathbf {0},\ \overline{\rho }^{l} \in \mathbb {R}^{W_{3}}\ \text {and}\ \overline{\rho }^{l} \ge \mathbf {0},\ \overline{\rho }^{m} \in \mathbb {R}^{W_{3}} \ \text {and}\ \overline{\rho }^{m} \ge \mathbf {0},\ \overline{\rho }^{n} \in \mathbb {R}^{U_{4}} \ \text {and}\ \overline{\rho }^{n} \ge \mathbf {0},\) such that the following holds

$$\begin{aligned}&\sum \limits _{s=1}^{S_1}\rho _{s}^{a}\left[ \begin{array}{c} 0 \\ \mathbf {0}_{T m_{1}} \\ \left( \Gamma _{a}^{3}\right) _{(a_{s},\ \cdot \ )}^{\top } \\ \left( AB^{\top }\right) _{(a_{s},\ \cdot \ )}^{\top }\\ \mathbf {0}_{T m_{2}} \end{array}\right] + \sum \limits _{s=1}^{W_1}\rho _{s}^{b}\left[ \begin{array}{c} 0 \\ \mathbf {0}_{T m_{1}} \\ \left( \Gamma _{b}^{3}\right) _{(b_{s},\ \cdot \ )}^{\top } \\ \left( AB^{\top }\right) _{(b_{s},\ \cdot \ )}^{\top } \\ \mathbf {0}_{T m_{2}} \end{array}\right] +\sum \limits _{s=1}^{W_1}\rho _{s}^{c}\left[ \begin{array}{c} 0 \\ \left( \Gamma _{c}^{2}\right) _{(c_{s},\ \cdot \ )}^{\top } \\ \mathbf {0}_{T m_{1}} \\ \mathbf {0}_{T m_{2}} \\ \mathbf {0}_{T m_{2}} \end{array}\right] \nonumber \\&\quad +\sum \limits _{s=1}^{U_1}\rho _{s}^{d}\left[ \begin{array}{c} 0 \\ \left( \Gamma _{d}^{2}\right) _{(d_{s},\ \cdot \ )}^{\top } \\ \mathbf {0}_{T m_{1} } \\ \mathbf {0}_{T m_{2} } \\ \mathbf {0}_{T m_{2} } \end{array}\right] +\sum \limits _{s=1}^{S_2}\rho _{s}^{e}\left[ \begin{array}{c} 0 \\ \left( \Gamma _{e}^{2}\right) _{(e_{s},\ \cdot \ )}^{\top } \\ \mathbf {0}_{T m_{1}} \\ \mathbf {0}_{T m_{2} } \\ \mathbf {0}_{T m_{2} } \end{array}\right] +\sum \limits _{s=1}^{U_2}\rho _{s}^{f}\left[ \begin{array}{c} 0 \\ \mathbf {0}_{T m_{1} } \\ \left( \Gamma _{f}^{3}\right) _{(f_{s},\ \cdot \ )}^{\top } \\ \mathbf {0}_{T m_{2} } \\ \mathbf {0}_{T m_{2}} \end{array}\right] \nonumber \\&\quad +\sum \limits _{s=1}^{S_3}\rho _{s}^{g}\left[ \begin{array}{c} 0\\ \mathbf {0}_{T m_{1}} \\ \mathbf {0}_{T m_{1}} \\ \left( BB^{\top }\right) _{(g_{s},\ \cdot \ )}^{\top } \\ \left( \Gamma _{g}^{5}\right) _{(g_{s},\ \cdot \ )}^{\top } \end{array}\right] + \sum \limits _{s=1}^{W_2}\rho _{s}^{h}\left[ \begin{array}{c} 0 \\ \mathbf {0}_{T m_{1} } \\ \mathbf {0}_{T m_{1} } \\ \left( BB^{\top }\right) _{(h_{s},\ \cdot \ )}^{\top } \\ \left( \Gamma _{h}^{5}\right) _{(h_{s},\ \cdot \ )}^{\top } \end{array}\right] + \sum \limits _{s=1}^{W_2}\rho _{s}^{i}\left[ \begin{array}{c} 0 \\ \mathbf {0}_{T m_{1} } \\ \mathbf {0}_{T m_{1}} \\ \left( \Gamma _{i}^{4}\right) _{(i_{s},\ \cdot \ )}^{\top } \\ \mathbf {0}_{T m_{2}} \end{array}\right] \nonumber \\&\quad +\sum \limits _{s=1}^{U_3}\rho _{s}^{j}\left[ \begin{array}{c} 0 \\ \mathbf {0}_{T m_{1} } \\ \mathbf {0}_{T m_{1} } \\ \left( \Gamma _{j}^{4}\right) _{(j_{s},\ \cdot \ )}^{\top } \\ \mathbf {0}_{T m_{2} } \end{array}\right] +\sum \limits _{s=1}^{S_4}\rho _{s}^{k}\left[ \begin{array}{c} 1 \\ \mathbf {0}_{T m_{1}} \\ \mathbf {0}_{T m_{1}} \\ \left( \Gamma _{k}^{4}\right) _{(k_{s},\ \cdot \ )}^{\top } \\ \mathbf {0}_{T m_{2}} \end{array}\right] +\sum \limits _{s=1}^{W_3}\rho _{s}^{l}\left[ \begin{array}{c} 1 \\ \mathbf {0}_{T m_{1} } \\ \mathbf {0}_{T m_{1}} \\ \left( \Gamma _{l}^{4}\right) _{(l_{s},\ \cdot \ )}^{\top } \\ \mathbf {0}_{T m_{2}} \end{array}\right] \nonumber \\&\quad +\sum \limits _{s=1}^{W_3}\rho _{s}^{m}\left[ \begin{array}{c} 0 \\ \mathbf {0}_{T m_{1}} \\ \mathbf {0}_{T m_{1}} \\ \mathbf {0}_{T m_{2} } \\ \left( \Gamma _{m}^{5}\right) _{(m_{s},\ \cdot \ )}^{\top } \end{array}\right] +\sum \limits _{s=1}^{U_4}\rho _{s}^{n}\left[ \begin{array}{c} 0 \\ \mathbf {0}_{T m_{1} } \\ \mathbf {0}_{T m_{1}} \\ \mathbf {0}_{T m_{2} } \\ \left( \Gamma _{n}^{5}\right) _{(n_{s},\ \cdot \ )}^{\top } \end{array}\right] =\mathbf {0}. \end{aligned}$$
(43)

From the first row in Eq. (43), we get \(\sum \limits _{s=1}^{S_4} \rho _{s}^{k}+\sum \limits _{s=1}^{W_3} \rho _{s}^{l}=0\). Together with the fact that \(\overline{\rho }^{k} \ge \mathbf {0},\ \overline{\rho }^{l}\ge \mathbf {0}\), we get \(\overline{\rho }^{k}=\mathbf {0}\) and \(\overline{\rho }^{l}=\mathbf {0}\). From Lemma 1, we get \(\overline{\rho }^{c} = \mathbf {0},\ \overline{\rho }^{d} = \mathbf {0},\ \overline{\rho }^{e} = \mathbf {0}\) in Equation (43). From Lemma 2, we get \(\overline{\rho }^{a} = \mathbf {0},\ \overline{\rho }^{b} = \mathbf {0},\ \overline{\rho }^{f} = \mathbf {0}\) in Eq. (43). From Lemma 3, we get \(\overline{\rho }^{g} = \mathbf {0},\ \overline{\rho }^{h} = \mathbf {0},\ \overline{\rho }^{i} = \mathbf {0},\ \overline{\rho }^{j} = \mathbf {0},\ \overline{\rho }^{m} = \mathbf {0},\ \overline{\rho }^{n} = \mathbf {0}\) in Eq. (43).

In summary, the row vectors in the matrix \(\Gamma \) (35) are positive-linearly independent at every feasible point v for the MPEC (26). That is to say, every feasible point v for the MPEC (26) satisfies the MPEC-MFCQ. \(\square \)

5 Numerical results

In this section, we present the GR–CV, which is a concrete implementation of the GRM in Algorithm 1 for selecting the hyperparameter C in SVC, as shown in Algorithm 2. We show numerical results of the proposed GR–CV, and compare it with other approaches.

figure c

All the numerical tests are conducted in Matlab R2018a on a Windows 7 Dell Laptop with an Intel(R) Core(TM) i5-6500U CPU at 3.20GHz and 8 GB of RAM. All the data sets are collected from the LIBSVM library: https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/. Each data set is split into a subset \(\Omega \) with \(l_{1}\) points (it is used for cross-validation) and a hold-out test set \(\Theta \) with \(l_{2}\) points. The data descriptions are shown in Table 1.

Table 1 Descriptions of data sets

We compare our GR–CV with two other approaches: the inexact cross-validation method (In-CV) and the grid search method (G–S). In-CV Kunapuli et al. (2008b) is a relaxation method based on the relaxation of the complementarity constraints by a prescribed tolerance parameter \(\mathbf {tol} > 0\). That is, solving (NLP-\(t_{k}\)) with \(t_{k}=\mathbf {tol}\) as a fixed tolerance rather than decreasing \(t_{k}\) gradually.

The parameters of three methods are set as follows. For GR–CV, we set the initial values as \(v_{0}=\left[ 1,\ \mathbf {0}_{1 \times \overline{m}}\right] ^{\top }\), \(t_{0}=1,\ t_{\min }=10^{-8},\ \sigma =0.01.\) The relaxed subproblems (NLP-\(t_{k}\)) are solved by the snsolve function, which is part of the SNOPT solver (Gill et al. 2002). For In-CV, we use the same \(v_{0}\) as in GR–CV and \(\mathbf {tol}=10^{-4}\). For G–S, we use \(C \in \{10^{-4},\ 10^{-3},\ 10^{-2}\), \(10^{-1},\ 1,\ 10^{1},\ 10^{2},\ 10^{3},\ 10^{4}\}\), which is a commonly used grid range (Bennett et al. 2006; Kunapuli et al. 2008b; Kunapuli 2008; Moore et al. 2009). In each training process, the ALM–SNCG algorithm from Yan and Li (2020), which is outstanding and competitive with the most popular methods in LIBLINEAR (https://www.csie.ntu.edu.tw/~cjlin/liblinear/) in both speed and accuracy, is used to solve the \(l_{1}\)-loss SVC problem.

We compare the aforementioned methods in the following three aspects:

  1. 1.

    Test error (\(E_{t}\)) as defined by

    $$\begin{aligned} E_{t}=\frac{1}{l_{2}} \sum _{(x, y) \in \Theta } \frac{1}{2} \mid {\text {sign}}\left( \widehat{w}^{\top } x\right) -y \mid , \end{aligned}$$

    which is a measure of the ability of generalization.

  2. 2.

    CV error (\(E_{C}\)) as defined in the objective function of problem (14).

  3. 3.

    The number of iterations k for an algorithm, and the total number of iterations it for solving the subproblems (short for \((k,\ it\))).

We also report the maximum violation of all constraints defined as in (27), to measure the feasibility of the final solution given by GR–CV and In-CV.

The results are reported in Table 2, where we mark the winners of test error \(E_{t}\), CV error \(E_{C}\) and the maximum violation of all constraints Vio in bold. We also show the comparisons of the three methods for different data sets on test error \(E_{t}\) and CV error \(E_{C}\) in Figs. 9 and 10, respectively. The data sets on the horizontal axis are arranged in the order shown in Table 2.

From Figs. 9, 10 and Table 2, we have the following observations. Firstly, GR–CV performs the best in terms of test error in Fig. 9, implying that our approach is more capable of generalization. Secondly, in terms of CV error in Fig. 10, GR–CV is competitive with G–S. GR–CV is the winner in five data sets of all the twelve datasets whereas G-S wins in eight datasets among the twelve datasets. Finally, comparing GR–CV with In-CV, the feasibility of the solution returned by GR–CV is significantly better than that by In-CV since Vio given by GR–CV is much smaller than that by In-CV. In terms of cpu time, it is obvious that In-CV takes less time than GR-CV since it only solves the relaxation problem (NLP-\(t_{k}\)) once. Since G-S is basically solving a completely different type of problem to find the hyperparameter C, it doesn’t make sense to compare the cpu time between GR–CV and G–S.

Table 2 Computational results for \(T=3\)
Fig. 9
figure 9

The comparison among the three methods on test error

Fig. 10
figure 10

The comparison among the three methods on CV error

Fig. 11
figure 11

Effect of increasing the number of folds on test error and CV error

To further study the effect of increasing the number of folds on test error \(E_{t}\) and CV error \(E_{C}\) in the three methods, we report the results on the Australian data set in Fig. 11. The results show that as T changes, the test error for GR–CV is always the lowest, and the CV error for GR–CV is competitive with the other two methods. Meanwhile it is clear that larger number of folds can be successfully solved for GR–CV, the computing time grows with the number of folds because of the increasing number of variables and constraints for the MPEC to be solved. The ranges of the test error and CV error for different numbers of folds are not large, so \(T=3\) represents a reasonable choice.

6 Conclusion

In this paper, we have proposed a bilevel optimization model for the hyperparameter selection for support vector classification in which the upper-level problem minimizes a T-fold cross-validation error and the lower-level problems are T \(l_{1}\)-loss SVC problems on the training sets. We reformulated the bilevel optimization problem into an MPEC, and proposed the GR–CV to solve it based on the GRM from Scholtes (2001). We also proved that the MPEC–MFCQ automatically holds at each feasible point. Extensive numerical results on the data sets from the LIBSVM library demonstrated the superior generalization performance of the proposed method over almost all the data sets used in this paper. The proposed approach has the potential to deal with other hyperparameter selection problems in SVM, which may involve multiple hyperparameters or other types of loss functions. However, whether the resulting MPEC enjoys the property of MPEC–MFCQ needs to be further investigated. How to choose the most suitable numerical algorithms to solve the perturbed problem resulting from the Scholtes relaxation is also worth further study. These topics will be investigated further in the near future.