Global resolution of the support vector machine regression parameters selection problem with LPCC

Lee, Yu-Ching; Pang, Jong-Shi; Mitchell, John E.

doi:10.1007/s13675-015-0041-z

Global resolution of the support vector machine regression parameters selection problem with LPCC

Original Paper
Published: 15 July 2015

Volume 3, pages 197–261, (2015)
Cite this article

EURO Journal on Computational Optimization

Yu-Ching Lee¹,
Jong-Shi Pang² &
John E. Mitchell³

193 Accesses
4 Citations
Explore all metrics

Abstract

Support vector machine regression is a robust data fitting method to minimize the sum of deducted residuals of regression, and thus is less sensitive to changes of data near the regression hyperplane. Two design parameters, the insensitive tube size ($\varepsilon _\mathrm{e}$) and the weight assigned to the regression error trading off the normed support vector ($C_\mathrm{e}$), are selected by user to gain better forecasts. The global training and validation parameter selection procedure for the support vector machine regression can be formulated as a bi-level optimization model, which is equivalently reformulated as linear program with linear complementarity constraints (LPCC). We propose a rectangle search global optimization algorithm to solve this LPCC. The algorithm exhausts the invariancy regions on the parameter plane ($(C_\mathrm{e},\varepsilon _\mathrm{e})$-plane) without explicitly identifying the edges of the regions. This algorithm is tested on synthetic and real-world support vector machine regression problems with up to hundreds of data points, and the efficiency are compared with several approaches. The obtained global optimal parameter is an important benchmark for every other selection of parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

Notes

The 1st order subscript represents the role of that mathematical expression—subscript d: data; subscript s: support vector machine regression; subscript e: exogenous parameter to the support vector machine regression.
By exhausting all the invariancy regions, we mean that for every invariancy region, at least one $(C_\mathrm{e},\varepsilon _\mathrm{e})$ point within the region is chosen and the associated restricted linear program is solved.
According to the numerical experiments provided in work Ferris and Munson (2004), the semismooth method outperforms the interior method (Ferris and Munson 2002) specifically in solving the large-scale SVM classification problems.
Lower-hyperplane: ${y_\mathrm{d}}_j = ({\mathbf{x}_\mathbf{d}^\mathbf{j}})^T{\mathbf{w}_\mathbf{s}}^f+b_\mathrm{s}^f-\varepsilon _\mathrm{e}$.
Upper-hyperplane: ${y_\mathrm{d}}_j = ({\mathbf{x}_\mathbf{d}^\mathbf{j}})^T{\mathbf{w}_\mathbf{s}}^f+b_\mathrm{s}^f+\varepsilon _\mathrm{e}$.
A rectangle is defined by four bounds: upper and lower bounds of $C_\mathrm{e}$ and $\varepsilon _\mathrm{e}$.
If there is no previous interval, no need to run the add-in.
Partitioning at the endpoints is also a theoretically valid strategy. We choose to partition at the midpoints rather than the endpoints to avoid the loss of information due to arithmetic imprecision at the dividing line. The drawback of it is that most of the rectangles are not eliminated in the 1st stage but have to be passed to the 2nd stage.
Recall that we solve a ${\mathbb {RQCP}}$ when ${\mathbb {RLP}}$ or ${\mathbb {LCP}}$ fail to be solved.
Because the $b_\mathrm{s}$ of SVM regression is not unique, we select a $b_\mathrm{s}$ minimizing the absolute residual of the 45 validation data. See Sect. 2.3 for the selection of $b_\mathrm{s}$.
A finer grid might give us a local grid-searched parameter that coincides the global optimal parameter since finding the global optimum is relatively easy compared to varying it. Note that the grids commonly used in the SVM parameter selection are the logarithmic grids (Fung and Mangasarian 2004) of base 2 or 10, i.e., $2^i$ or $10^j$ where i and j ranges from some negative integer to some positive integer.
We use $tol = 10^{-14}$.

References

Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79
Article Google Scholar
Bard JF, Moore JT (1990) A branch and bound algorithm for the bilevel programming problem. SIAM J Sci Stat Comput 11(2):281–292
Article Google Scholar
Bemporad A, Morari M, Dua V, Pistikopoulos EN (2002) The explicit linear quadratic regulator for constrained systems. Automatica 38(1):3–20
Article Google Scholar
Billups SC (1995) Algorithm for complementarity problems and generalized equations. PhD thesis, University of Wisconsin Madison
Burges CJC, Crisp DJ (1999) Uniqueness of the SVM solution. In NIPS’99, pp 223–229
Byrd RH, Nocedal J, Waltz RA (2006) Knitro: an integrated package for nonlinear optimization. In: Di Pillo G, Roma M (eds) Large-scale nonlinear optimization. Nonconvex optimization and its applications, vol 83. Springer, US, pp 35–59
Carrizosa E, Morales DR (2013) Supervised classification and mathematical optimization. Comput Oper Res 40(1):150–165
Article Google Scholar
Carrizosa E, Martn-Barragn B, Morales DR (2014) A nested heuristic for parameter tuning in support vector machines. Comput Oper Res 43(0):328–334
Article Google Scholar
Cawley GC, Talbot NLC (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079–2107
Google Scholar
Columbano S, Fukuda K, Jones CN (2009) An output-sensitive algorithm for multi-parametric LCPs with sufficient matrices. In CRM proceedings and lecture notes, vol 48
De Luca T, Facchinei F, Kanzow C (1996) A semismooth equation approach to the solution of nonlinear complementarity problems. Math Program 75:407–439
Google Scholar
Demiriz A, Bennett KP, Breneman CM, Embrechts MJ (2001) Support vector machine regression in chemometrics. In: Proceedings of the 33rd symposium on the interface of computing science and statistics
Facchinei F, Soares J (1997) A new merit function for nonlinear complementarity problems and a related algorithm. SIAM J Optim 7(1):225–247
Article Google Scholar
Facchinei F, Pang J-S (2003) Finite-dimensional variational inequalities and complementarity problems II. Springer, New York
Google Scholar
Ferris MC, Munson TS (2002) Interior-point methods for massive support vector machines. SIAM J Optim 13(3):783–804
Article Google Scholar
Ferris MC, Munson TS (2004) Semismooth support vector machines. Math Program 101:185–204
Google Scholar
Floudas CA, Gounaris C (2009) A review of recent advances in global optimization. J Glob Optim 45:3–38
Article Google Scholar
Fung GM, Mangasarian OL (2004) A feature selection newton method for support vector machine classification. Comput Optim Appl 28(2):185–202
Article Google Scholar
Ghaffari-Hadigheh A, Romanko O, Terlaky T (2010) Bi-parametric convex quadratic optimization. Optim Methods Softw 25:229–245
Article Google Scholar
Gumus ZH, Floudas CA (2001) Global optimization of nonlinear bilevel programming problems. J Glob Optim 20:1–31
Article Google Scholar
IBM ILOG CPLEX Optimizer (2010) http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/
Jing H, Mitchell JE, Pang J-S, Bennett KP, Kunapuli G (2008) On the global solution of linear programs with linear complementarity constraints. SIAM J Optim 19:445–471
Article Google Scholar
Kecman V (2005) Support vector machines—an introduction. In: Wang L (ed) Support vector machines: theory and applications. Studies in fuzziness and soft computing, vol 177. Springer, Berlin, pp 1–48
Keerthi SS, Lin C-J (2003) Asymptotic behaviors of support vector machines with gaussian kernel. Neural Comput 15(7):1667–1689
Article Google Scholar
Kunapuli G (2008) A bilevel optimization approach to machine learning. PhD thesis, Rensselaer Polytechnic Institute
Kunapuli G, Bennett KP, Jing H, Pang J-S (2008) Classification model selection via bilevel programming. Optim Methods Softw 23(4):475–489
Article Google Scholar
Kunapuli G, Bennett KP, Hu J, Pang J-S (2006) Model selection via bilevel programming. In: Proceedings of the IEEE international joint conference on neural networks
Lee Y-C, Pang J-S, Mitchell JE (2015) An algorithm for global solution to bi-parametric linear complementarity constrained linear programs. J Glob Optim 62(2):263–297
Article Google Scholar
Mangasarian OL, Musicant DR (1998) Successive overrelaxation for support vector machines. IEEE Trans Neural Netw 10:1032–1037
Article Google Scholar
Ng AY (1997) Preventing “overfitting” of cross-validation data. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann, Menlo Park, pp 245–253
Schittkowski K (2005) Optimal parameter selection in support vector machines. J Ind Manag Optim 1(4):465–476
Article Google Scholar
Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Google Scholar
Tondel P, Johansen TA, Bemporad A (2003) An algorithm for multi-parametric quadratic programming and explicit mpc solutions. Automatica 39(3):489–497
Article Google Scholar
Vapnik V, Golowich SE, Smola AJ (1997) Support vector method for function approximation, regression estimation and signal processing. In: Mozer M, Jordan MI, Petsche T (eds) Advances in neural information processing systems, vol 9. Proceedings of the 1996 neural information processing systems conference (NIPS 1996). MIT Press, Cambridge, pp 281–287
Yu B (2011) A branch and cut approach to linear programs with linear complementarity constraints. PhD thesis, Rensselaer Polytechnic Institute

Download references

Acknowledgments

Thanks for the thoroughgoing comments and suggestions from reviewers that allow us to significantly improve the completeness and the presentation of this work. Lee and Pang were supported in part by the Air Force Office of Scientific Research under Grant Number FA9550-11-1-0151. Mitchell was supported in part by the Air Force Office of Scientific Research under Grant Number FA9550-11-1-0260. Pang was supported by the National Science Foundation under Grant Number CMMI-1333902. Mitchell was supported by the National Science Foundation under Grant Number CMMI-1334327. Lee’s new affiliation after August 1, 2015 will be Department of Industrial Engineering and Engineering Management, National Tsing Hua University, Hsinchu, Taiwan.

Author information

Authors and Affiliations

University of Illinois at Urbana-Champaign, Champaign, USA
Yu-Ching Lee
University of Southern California, Los Angeles, USA
Jong-Shi Pang
Rensselaer Polytechnic Institute, Troy, USA
John E. Mitchell

Authors

Yu-Ching Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jong-Shi Pang
View author publications
You can also search for this author in PubMed Google Scholar
John E. Mitchell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu-Ching Lee.

Appendices

Appendix A: Semismooth method for SVR

Algorithm 12

Damped Newton method (Algorithm 2 in [17])

Step 0: Initialization: :: Let $\mathbf {a}^0 \in {\mathbb {R}}^n$, $\rho \ge 0$, $p \ge 2$, and $\sigma \in (0, \frac{1}{2})$ be given. Set $k = 0$. Set tol ^{Footnote 12}.
Step 1: Termination: :: If $g(\mathbf {a}^k) := \frac{1}{2} \parallel \Phi (\mathbf {a}^k) \parallel _2^2 \le tol $, stop.
Step 2: Direction Generation: :: Otherwise, let $\mathbf {H}^k \in \partial _B \Phi (\mathbf {a}^k)$, and calculate $\mathbf {d}^k \in {\mathbb {R}}^n$ solving the Newton system:
$$\begin{aligned} \mathbf {H}^k\mathbf {d}^k = - \Phi (\mathbf {a}^k). \end{aligned}$$
(31)

If either (31) is unsolvable or the decent condition

$$\begin{aligned} \nabla g(\mathbf {a}^k)^T\mathbf {d}^k < - \rho \parallel \mathbf {d}^k \parallel _2^p \end{aligned}$$

(32)

is not satisfied, then set

$$\begin{aligned} \mathbf {d}^k = - \nabla g(\mathbf {a}^k). \end{aligned}$$

(33)

Step 3: Line Search: Choose $t^k = 2^{-i_k}$, where $i_k$ is the smallest integer such that

$$\begin{aligned} g\big ( \mathbf {a}^k + 2^{-i_k}\mathbf {d}^k \big ) \le g(\mathbf {a}^k) + \sigma 2^{-i_k} \nabla g (\mathbf {a}^k)^T \mathbf {d}^k. \end{aligned}$$

Step 4: Update: Let $\mathbf {a}^{k+1} := \mathbf {a}^k + t^k\mathbf {d}^k $ and $k := k + 1$. Go to Step 1. $\square $

The B-subdifferential used in step 3 of Algorithm 12 is obtained using the following theorem.

Theorem 13

(Theorem 5 in Ferris and Munson (2004)) Let $\mathbf {U}: {\mathbb {R}}^n \rightarrow {\mathbb {R}}^n$ be continuously differentiable. Then

$$\begin{aligned} \partial _B\Phi (\mathbf {a}) \subseteq \{ D_a + D_bU'(\mathbf {a})\}, \end{aligned}$$

where $D_a \in {\mathbb {R}}^{n \times n}$ and $D_b \in {\mathbb {R}}^{n\times n}$ are diagonal matrices with entries defined as follows:

1.
For all $i \in \mathbb {I}$: If $\parallel (a_i, U_i(\mathbf {a})) \parallel \ne 0$, then
$$\begin{aligned} (D_a)_{ii}= & {} 1 - \frac{a_i}{\parallel a_i, U_i(\mathbf {a}) \parallel }, \nonumber \\ (D_b)_{ii}= & {} 1 - \frac{U_i(\mathbf {a})}{\parallel a_i, U_i(\mathbf {a}) \parallel }; \end{aligned}$$
(34)
otherwise
$$\begin{aligned} ((D_a)_{ii}, (D_b)_{ii} ) \in \{ (1- \kappa , 1-\gamma ) \in {\mathbb {R}}^2 \mid \parallel (\eta , \gamma ) \parallel \le 1\}. \end{aligned}$$
(35)
2.
For all $i\in \mathbb {E}$:
$$\begin{aligned} (D_a)_{ii}= & {} 0,\$D_b)_{ii}= & {} 1. \end{aligned}$$
\(\square $

If $\parallel (a_i, U_i(\mathbf {a})) \parallel \ne 0$, $\Phi (\mathbf {a})$ is differentiable at $\mathbf {a}$ and the formula (34) computes the exact Jacobian. On the other hand, if $\parallel (a_i, U_i(\mathbf {a})) \parallel = 0$ occurs at the $i^{th}$ complementarity, $\kappa $ and $\gamma $ appeared in (35) are computed as suggested in Facchinei and Pang (2003):

$$\begin{aligned} \kappa = \frac{v_i}{\sqrt{v_i^2 + (U'v)_i^2}}, \quad \text{ and } \quad \gamma = \frac{(U'v)_i}{\sqrt{v_i^2 + (U'v)_i^2}}, \end{aligned}$$

(36)

where $\mathbf {v} \in {\mathbb {R}}^n$ is a vector of user’s choice whose ith element is nonzero.

To compute $\kappa $ and $\gamma $ in (36), we can choose $\mathbf {v} = \mathbf {1}$, and let

which doesn’t require update. Then the computation of $\kappa $ and $\gamma $ is simplified as

$$\begin{aligned} \kappa = \frac{1}{\sqrt{1 + (hv)_i^2}},\quad \text{ and } \quad \gamma = \frac{(hv)_i}{\sqrt{1 + (hv)_i^2}}. \end{aligned}$$

where the index i is the same as defined in (36).

In our experiments, we ignore steps (32) and (33) to save running time because the system (31) is always solvable, and the convergence is obtained in most cases with the initial $\{\mathbf {a}^0\} = \mathbf {0}$. For rare cases where the condition $g(\mathbf {a}^k) \le tol$ in Step 2 can not be fulfilled, we try a different initial point to restart.

Appendix B: Steps 2–4 of Algorithm 8

Step 2. Search on the vertical line $C_e = \underline{C}$.

Initialize the set ${\mathbb {G}}rouping{\mathbb {VS}}et^{left} = {\mathbb {G}}rouping{\mathbb {VS}}et^{lu}$.

2a:
For every pieces corresponding to a member of ${\mathbb {G}}rouping{\mathbb {VS}}et^{left}$, solve (20) subject to $C_e = \underline{C}$ to obtain invariancy intervals. Let $\big [\varepsilon _{min}, \,\varepsilon _{max}\big ]$ be the largest intervals among others. Do Add-In when repeating. $countLeft+1$.
2b:
Solve ${\mathbb {LCP}}_{SVR}^f$ at $(\underline{C},\varepsilon _{min})$ and obtain the grouping vectors set.
2c:
Replace ${\mathbb {G}}rouping{\mathbb {VS}}et^{left}$ by the set of the grouping vectors obtained in 2b. If any members of ${\mathbb {G}}rouping{\mathbb {VS}}et^{left}$ are not in the set ${\mathbb {G}}rouping{\mathbb {VF}}ound$, add them to the later set.
2d:
If the objective value of ${\mathbb {RLP}}$ is smaller than LeastUpperBound, update LeastUpperBound.
2e:
If $\varepsilon _{min}$ is greater than $\underline{\varepsilon }$, let $newStarting = (\underline{C}\,,\, \varepsilon _{min}-deviation)$.
2f:
Solve ${\mathbb {LCP}}_{SVR}^f$ at newStarting and obtain the grouping vectors set. Do as in 2c–2d.
2g:
Repeat 2a–2d until $\varepsilon _{min} = \underline{\varepsilon }$.

Step 3. Search on the horizontal line $\varepsilon _e = \underline{\varepsilon }$.

Initialize the set ${\mathbb {G}}rouping{\mathbb {VS}}et^{bottom} = {\mathbb {G}}rouping{\mathbb {VS}}et^{ul}$.

3a:
For every pieces corresponding to a member of $GroupingVSet^{bottom}$, solve (19) subject to $\varepsilon _e = \underline{\varepsilon }$ to obtain invariancy intervals. Let $\big [C_{min}, \,C_{max}\big ]$ be the largest intervals among others. Do Add-In when repeating. $countBottom+1$.
3b:
Solve ${\mathbb {LCP}}_{SVR}^f$ at $(C_{min},\underline{\varepsilon })$ and obtain the grouping vectors set.
3c:
Replace ${\mathbb {G}}rouping{\mathbb {VS}}et^{bottom}$ by the set of the grouping vectors obtained in 3b. If any members of ${\mathbb {G}}rouping{\mathbb {VS}}et^{bottom}$ are not in the set ${\mathbb {G}}rouping{\mathbb {VF}}ound$, add them to the later set.
3d:
If the objective value of ${\mathbb {RLP}}$ is smaller than LeastUpperBound, update LeastUpperBound.
3e:
If $C_{min}$ is greater than $\underline{C}$, let $newStarting = (C_{min}-deviation\,,\, \underline{\varepsilon })$.
3f:
Solve ${\mathbb {LCP}}_{SVR}^f$ at newStarting and obtain the grouping vectors set. Do as in 3c-3d.
3g:
Repeat 3a–3d until $C_{min} = \underline{C}$.

Step 4. Search on the vertical line $C_e = \bar{C}$.

Initialize the ${\mathbb {G}}rouping{\mathbb {VS}}et^{right} = {\mathbb {G}}rouping{\mathbb {VS}}et^{uu}$.

4a:
For every pieces corresponding to a member of ${\mathbb {G}}rouping{\mathbb {VS}}et^{right}$, solve (20) subject to $C_e = \bar{C}$ to obtain invariancy intervals. Let $\big [\varepsilon _{min}, \,\varepsilon _{max}\big ]$ be the largest intervals among others. Do Add-In when repeating. $countRight+1$.
4b:
Solve ${\mathbb {LCP}}_{SVR}^f$ at $(\bar{C},\varepsilon _{min})$ and obtain the grouping vectors set.
4c:
Replace ${\mathbb {G}}rouping{\mathbb {VS}}et^{right}$ by the set of the grouping vectors obtained in 4b. If any members of ${\mathbb {G}}rouping{\mathbb {VS}}et^{right}$ are not in the set ${\mathbb {G}}rouping{\mathbb {VF}}ound$, add them to the later set.
4d:
If the objective value of ${\mathbb {RLP}}$ is smaller than LeastUpperBound, update LeastUpperBound.
4e:
If $\varepsilon _{min}$ is greater than $\underline{\varepsilon }$, let $newStarting = (\bar{C}, \varepsilon _{min}-deviation)$.
4f:
Solve ${\mathbb {LCP}}_{SVR}^f$ at newStarting and obtain the grouping vectors set. Do as in 4c–4d.
4g:
Repeat 4a–4d until $\varepsilon _{min} = \underline{\varepsilon }$.

Appendix C: Cases 2–6 of Algorithm 10

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, YC., Pang, JS. & Mitchell, J.E. Global resolution of the support vector machine regression parameters selection problem with LPCC. EURO J Comput Optim 3, 197–261 (2015). https://doi.org/10.1007/s13675-015-0041-z

Download citation

Received: 01 August 2013
Accepted: 26 June 2015
Published: 15 July 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s13675-015-0041-z

Keywords

Mathematics Subject Classification

90C26

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Global resolution of the support vector machine regression parameters selection problem with LPCC

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Learning from imbalanced data: open challenges and future directions

Feature selection techniques for machine learning: a survey of more than two decades of research

Notes

References

Acknowledgments