Abstract
Support vector machine regression is a robust data fitting method to minimize the sum of deducted residuals of regression, and thus is less sensitive to changes of data near the regression hyperplane. Two design parameters, the insensitive tube size (\(\varepsilon _\mathrm{e}\)) and the weight assigned to the regression error trading off the normed support vector (\(C_\mathrm{e}\)), are selected by user to gain better forecasts. The global training and validation parameter selection procedure for the support vector machine regression can be formulated as a bi-level optimization model, which is equivalently reformulated as linear program with linear complementarity constraints (LPCC). We propose a rectangle search global optimization algorithm to solve this LPCC. The algorithm exhausts the invariancy regions on the parameter plane (\((C_\mathrm{e},\varepsilon _\mathrm{e})\)-plane) without explicitly identifying the edges of the regions. This algorithm is tested on synthetic and real-world support vector machine regression problems with up to hundreds of data points, and the efficiency are compared with several approaches. The obtained global optimal parameter is an important benchmark for every other selection of parameters.
Similar content being viewed by others
Notes
The 1st order subscript represents the role of that mathematical expression—subscript d: data; subscript s: support vector machine regression; subscript e: exogenous parameter to the support vector machine regression.
By exhausting all the invariancy regions, we mean that for every invariancy region, at least one \((C_\mathrm{e},\varepsilon _\mathrm{e})\) point within the region is chosen and the associated restricted linear program is solved.
Lower-hyperplane: \({y_\mathrm{d}}_j = ({\mathbf{x}_\mathbf{d}^\mathbf{j}})^T{\mathbf{w}_\mathbf{s}}^f+b_\mathrm{s}^f-\varepsilon _\mathrm{e}\).
Upper-hyperplane: \({y_\mathrm{d}}_j = ({\mathbf{x}_\mathbf{d}^\mathbf{j}})^T{\mathbf{w}_\mathbf{s}}^f+b_\mathrm{s}^f+\varepsilon _\mathrm{e}\).
A rectangle is defined by four bounds: upper and lower bounds of \(C_\mathrm{e}\) and \(\varepsilon _\mathrm{e}\).
If there is no previous interval, no need to run the add-in.
Partitioning at the endpoints is also a theoretically valid strategy. We choose to partition at the midpoints rather than the endpoints to avoid the loss of information due to arithmetic imprecision at the dividing line. The drawback of it is that most of the rectangles are not eliminated in the 1st stage but have to be passed to the 2nd stage.
Recall that we solve a \({\mathbb {RQCP}}\) when \({\mathbb {RLP}}\) or \({\mathbb {LCP}}\) fail to be solved.
Because the \(b_\mathrm{s}\) of SVM regression is not unique, we select a \(b_\mathrm{s}\) minimizing the absolute residual of the 45 validation data. See Sect. 2.3 for the selection of \(b_\mathrm{s}\).
A finer grid might give us a local grid-searched parameter that coincides the global optimal parameter since finding the global optimum is relatively easy compared to varying it. Note that the grids commonly used in the SVM parameter selection are the logarithmic grids (Fung and Mangasarian 2004) of base 2 or 10, i.e., \(2^i\) or \(10^j\) where i and j ranges from some negative integer to some positive integer.
We use \(tol = 10^{-14}\).
References
Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79
Bard JF, Moore JT (1990) A branch and bound algorithm for the bilevel programming problem. SIAM J Sci Stat Comput 11(2):281–292
Bemporad A, Morari M, Dua V, Pistikopoulos EN (2002) The explicit linear quadratic regulator for constrained systems. Automatica 38(1):3–20
Billups SC (1995) Algorithm for complementarity problems and generalized equations. PhD thesis, University of Wisconsin Madison
Burges CJC, Crisp DJ (1999) Uniqueness of the SVM solution. In NIPS’99, pp 223–229
Byrd RH, Nocedal J, Waltz RA (2006) Knitro: an integrated package for nonlinear optimization. In: Di Pillo G, Roma M (eds) Large-scale nonlinear optimization. Nonconvex optimization and its applications, vol 83. Springer, US, pp 35–59
Carrizosa E, Morales DR (2013) Supervised classification and mathematical optimization. Comput Oper Res 40(1):150–165
Carrizosa E, Martn-Barragn B, Morales DR (2014) A nested heuristic for parameter tuning in support vector machines. Comput Oper Res 43(0):328–334
Cawley GC, Talbot NLC (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079–2107
Columbano S, Fukuda K, Jones CN (2009) An output-sensitive algorithm for multi-parametric LCPs with sufficient matrices. In CRM proceedings and lecture notes, vol 48
De Luca T, Facchinei F, Kanzow C (1996) A semismooth equation approach to the solution of nonlinear complementarity problems. Math Program 75:407–439
Demiriz A, Bennett KP, Breneman CM, Embrechts MJ (2001) Support vector machine regression in chemometrics. In: Proceedings of the 33rd symposium on the interface of computing science and statistics
Facchinei F, Soares J (1997) A new merit function for nonlinear complementarity problems and a related algorithm. SIAM J Optim 7(1):225–247
Facchinei F, Pang J-S (2003) Finite-dimensional variational inequalities and complementarity problems II. Springer, New York
Ferris MC, Munson TS (2002) Interior-point methods for massive support vector machines. SIAM J Optim 13(3):783–804
Ferris MC, Munson TS (2004) Semismooth support vector machines. Math Program 101:185–204
Floudas CA, Gounaris C (2009) A review of recent advances in global optimization. J Glob Optim 45:3–38
Fung GM, Mangasarian OL (2004) A feature selection newton method for support vector machine classification. Comput Optim Appl 28(2):185–202
Ghaffari-Hadigheh A, Romanko O, Terlaky T (2010) Bi-parametric convex quadratic optimization. Optim Methods Softw 25:229–245
Gumus ZH, Floudas CA (2001) Global optimization of nonlinear bilevel programming problems. J Glob Optim 20:1–31
IBM ILOG CPLEX Optimizer (2010) http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/
Jing H, Mitchell JE, Pang J-S, Bennett KP, Kunapuli G (2008) On the global solution of linear programs with linear complementarity constraints. SIAM J Optim 19:445–471
Kecman V (2005) Support vector machines—an introduction. In: Wang L (ed) Support vector machines: theory and applications. Studies in fuzziness and soft computing, vol 177. Springer, Berlin, pp 1–48
Keerthi SS, Lin C-J (2003) Asymptotic behaviors of support vector machines with gaussian kernel. Neural Comput 15(7):1667–1689
Kunapuli G (2008) A bilevel optimization approach to machine learning. PhD thesis, Rensselaer Polytechnic Institute
Kunapuli G, Bennett KP, Jing H, Pang J-S (2008) Classification model selection via bilevel programming. Optim Methods Softw 23(4):475–489
Kunapuli G, Bennett KP, Hu J, Pang J-S (2006) Model selection via bilevel programming. In: Proceedings of the IEEE international joint conference on neural networks
Lee Y-C, Pang J-S, Mitchell JE (2015) An algorithm for global solution to bi-parametric linear complementarity constrained linear programs. J Glob Optim 62(2):263–297
Mangasarian OL, Musicant DR (1998) Successive overrelaxation for support vector machines. IEEE Trans Neural Netw 10:1032–1037
Ng AY (1997) Preventing “overfitting” of cross-validation data. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann, Menlo Park, pp 245–253
Schittkowski K (2005) Optimal parameter selection in support vector machines. J Ind Manag Optim 1(4):465–476
Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Tondel P, Johansen TA, Bemporad A (2003) An algorithm for multi-parametric quadratic programming and explicit mpc solutions. Automatica 39(3):489–497
Vapnik V, Golowich SE, Smola AJ (1997) Support vector method for function approximation, regression estimation and signal processing. In: Mozer M, Jordan MI, Petsche T (eds) Advances in neural information processing systems, vol 9. Proceedings of the 1996 neural information processing systems conference (NIPS 1996). MIT Press, Cambridge, pp 281–287
Yu B (2011) A branch and cut approach to linear programs with linear complementarity constraints. PhD thesis, Rensselaer Polytechnic Institute
Acknowledgments
Thanks for the thoroughgoing comments and suggestions from reviewers that allow us to significantly improve the completeness and the presentation of this work. Lee and Pang were supported in part by the Air Force Office of Scientific Research under Grant Number FA9550-11-1-0151. Mitchell was supported in part by the Air Force Office of Scientific Research under Grant Number FA9550-11-1-0260. Pang was supported by the National Science Foundation under Grant Number CMMI-1333902. Mitchell was supported by the National Science Foundation under Grant Number CMMI-1334327. Lee’s new affiliation after August 1, 2015 will be Department of Industrial Engineering and Engineering Management, National Tsing Hua University, Hsinchu, Taiwan.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Semismooth method for SVR
Algorithm 12
Damped Newton method (Algorithm 2 in [17])
- Step 0: Initialization: :
-
Let \(\mathbf {a}^0 \in {\mathbb {R}}^n\), \(\rho \ge 0\), \(p \ge 2\), and \(\sigma \in (0, \frac{1}{2})\) be given. Set \(k = 0\). Set tol Footnote 12.
- Step 1: Termination: :
-
If \(g(\mathbf {a}^k) := \frac{1}{2} \parallel \Phi (\mathbf {a}^k) \parallel _2^2 \le tol \), stop.
- Step 2: Direction Generation: :
-
Otherwise, let \(\mathbf {H}^k \in \partial _B \Phi (\mathbf {a}^k)\), and calculate \(\mathbf {d}^k \in {\mathbb {R}}^n\) solving the Newton system:
$$\begin{aligned} \mathbf {H}^k\mathbf {d}^k = - \Phi (\mathbf {a}^k). \end{aligned}$$(31)
If either (31) is unsolvable or the decent condition
is not satisfied, then set
Step 3: Line Search: Choose \(t^k = 2^{-i_k}\), where \(i_k\) is the smallest integer such that
Step 4: Update: Let \(\mathbf {a}^{k+1} := \mathbf {a}^k + t^k\mathbf {d}^k \) and \(k := k + 1\). Go to Step 1. \(\square \)
The B-subdifferential used in step 3 of Algorithm 12 is obtained using the following theorem.
Theorem 13
(Theorem 5 in Ferris and Munson (2004)) Let \(\mathbf {U}: {\mathbb {R}}^n \rightarrow {\mathbb {R}}^n\) be continuously differentiable. Then
where \(D_a \in {\mathbb {R}}^{n \times n}\) and \(D_b \in {\mathbb {R}}^{n\times n}\) are diagonal matrices with entries defined as follows:
-
1.
For all \(i \in \mathbb {I}\): If \(\parallel (a_i, U_i(\mathbf {a})) \parallel \ne 0\), then
$$\begin{aligned} (D_a)_{ii}= & {} 1 - \frac{a_i}{\parallel a_i, U_i(\mathbf {a}) \parallel }, \nonumber \\ (D_b)_{ii}= & {} 1 - \frac{U_i(\mathbf {a})}{\parallel a_i, U_i(\mathbf {a}) \parallel }; \end{aligned}$$(34)otherwise
$$\begin{aligned} ((D_a)_{ii}, (D_b)_{ii} ) \in \{ (1- \kappa , 1-\gamma ) \in {\mathbb {R}}^2 \mid \parallel (\eta , \gamma ) \parallel \le 1\}. \end{aligned}$$(35) -
2.
For all \(i\in \mathbb {E}\):
$$\begin{aligned} (D_a)_{ii}= & {} 0,\\(D_b)_{ii}= & {} 1. \end{aligned}$$\(\square \)
If \(\parallel (a_i, U_i(\mathbf {a})) \parallel \ne 0\), \(\Phi (\mathbf {a})\) is differentiable at \(\mathbf {a}\) and the formula (34) computes the exact Jacobian. On the other hand, if \(\parallel (a_i, U_i(\mathbf {a})) \parallel = 0\) occurs at the \(i^{th}\) complementarity, \(\kappa \) and \(\gamma \) appeared in (35) are computed as suggested in Facchinei and Pang (2003):
where \(\mathbf {v} \in {\mathbb {R}}^n\) is a vector of user’s choice whose ith element is nonzero.
To compute \(\kappa \) and \(\gamma \) in (36), we can choose \(\mathbf {v} = \mathbf {1}\), and let
which doesn’t require update. Then the computation of \(\kappa \) and \(\gamma \) is simplified as
where the index i is the same as defined in (36).
In our experiments, we ignore steps (32) and (33) to save running time because the system (31) is always solvable, and the convergence is obtained in most cases with the initial \(\{\mathbf {a}^0\} = \mathbf {0}\). For rare cases where the condition \(g(\mathbf {a}^k) \le tol\) in Step 2 can not be fulfilled, we try a different initial point to restart.
Appendix B: Steps 2–4 of Algorithm 8
Step 2. Search on the vertical line \(C_e = \underline{C}\).
Initialize the set \({\mathbb {G}}rouping{\mathbb {VS}}et^{left} = {\mathbb {G}}rouping{\mathbb {VS}}et^{lu}\).
-
2a:
For every pieces corresponding to a member of \({\mathbb {G}}rouping{\mathbb {VS}}et^{left}\), solve (20) subject to \(C_e = \underline{C}\) to obtain invariancy intervals. Let \(\big [\varepsilon _{min}, \,\varepsilon _{max}\big ]\) be the largest intervals among others. Do Add-In when repeating. \(countLeft+1\).
-
2b:
Solve \({\mathbb {LCP}}_{SVR}^f\) at \((\underline{C},\varepsilon _{min})\) and obtain the grouping vectors set.
-
2c:
Replace \({\mathbb {G}}rouping{\mathbb {VS}}et^{left}\) by the set of the grouping vectors obtained in 2b. If any members of \({\mathbb {G}}rouping{\mathbb {VS}}et^{left}\) are not in the set \({\mathbb {G}}rouping{\mathbb {VF}}ound\), add them to the later set.
-
2d:
If the objective value of \({\mathbb {RLP}}\) is smaller than LeastUpperBound, update LeastUpperBound.
-
2e:
If \(\varepsilon _{min}\) is greater than \(\underline{\varepsilon }\), let \(newStarting = (\underline{C}\,,\, \varepsilon _{min}-deviation)\).
-
2f:
Solve \({\mathbb {LCP}}_{SVR}^f\) at newStarting and obtain the grouping vectors set. Do as in 2c–2d.
-
2g:
Repeat 2a–2d until \(\varepsilon _{min} = \underline{\varepsilon }\).
Step 3. Search on the horizontal line \(\varepsilon _e = \underline{\varepsilon }\).
Initialize the set \({\mathbb {G}}rouping{\mathbb {VS}}et^{bottom} = {\mathbb {G}}rouping{\mathbb {VS}}et^{ul}\).
-
3a:
For every pieces corresponding to a member of \(GroupingVSet^{bottom}\), solve (19) subject to \(\varepsilon _e = \underline{\varepsilon }\) to obtain invariancy intervals. Let \(\big [C_{min}, \,C_{max}\big ]\) be the largest intervals among others. Do Add-In when repeating. \(countBottom+1\).
-
3b:
Solve \({\mathbb {LCP}}_{SVR}^f\) at \((C_{min},\underline{\varepsilon })\) and obtain the grouping vectors set.
-
3c:
Replace \({\mathbb {G}}rouping{\mathbb {VS}}et^{bottom}\) by the set of the grouping vectors obtained in 3b. If any members of \({\mathbb {G}}rouping{\mathbb {VS}}et^{bottom}\) are not in the set \({\mathbb {G}}rouping{\mathbb {VF}}ound\), add them to the later set.
-
3d:
If the objective value of \({\mathbb {RLP}}\) is smaller than LeastUpperBound, update LeastUpperBound.
-
3e:
If \(C_{min}\) is greater than \(\underline{C}\), let \(newStarting = (C_{min}-deviation\,,\, \underline{\varepsilon })\).
-
3f:
Solve \({\mathbb {LCP}}_{SVR}^f\) at newStarting and obtain the grouping vectors set. Do as in 3c-3d.
-
3g:
Repeat 3a–3d until \(C_{min} = \underline{C}\).
Step 4. Search on the vertical line \(C_e = \bar{C}\).
Initialize the \({\mathbb {G}}rouping{\mathbb {VS}}et^{right} = {\mathbb {G}}rouping{\mathbb {VS}}et^{uu}\).
-
4a:
For every pieces corresponding to a member of \({\mathbb {G}}rouping{\mathbb {VS}}et^{right}\), solve (20) subject to \(C_e = \bar{C}\) to obtain invariancy intervals. Let \(\big [\varepsilon _{min}, \,\varepsilon _{max}\big ]\) be the largest intervals among others. Do Add-In when repeating. \(countRight+1\).
-
4b:
Solve \({\mathbb {LCP}}_{SVR}^f\) at \((\bar{C},\varepsilon _{min})\) and obtain the grouping vectors set.
-
4c:
Replace \({\mathbb {G}}rouping{\mathbb {VS}}et^{right}\) by the set of the grouping vectors obtained in 4b. If any members of \({\mathbb {G}}rouping{\mathbb {VS}}et^{right}\) are not in the set \({\mathbb {G}}rouping{\mathbb {VF}}ound\), add them to the later set.
-
4d:
If the objective value of \({\mathbb {RLP}}\) is smaller than LeastUpperBound, update LeastUpperBound.
-
4e:
If \(\varepsilon _{min}\) is greater than \(\underline{\varepsilon }\), let \(newStarting = (\bar{C}, \varepsilon _{min}-deviation)\).
-
4f:
Solve \({\mathbb {LCP}}_{SVR}^f\) at newStarting and obtain the grouping vectors set. Do as in 4c–4d.
-
4g:
Repeat 4a–4d until \(\varepsilon _{min} = \underline{\varepsilon }\).
Appendix C: Cases 2–6 of Algorithm 10
Rights and permissions
About this article
Cite this article
Lee, YC., Pang, JS. & Mitchell, J.E. Global resolution of the support vector machine regression parameters selection problem with LPCC. EURO J Comput Optim 3, 197–261 (2015). https://doi.org/10.1007/s13675-015-0041-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13675-015-0041-z
Keywords
- Support vector
- Machine regression
- Parameter selection
- global optimal parameter
- Mathematical program with complementarity constraints
- Global optimization algorithm