1 Introduction

As one of the well known supervised learning algorithms, Support Vector Machines (SVM) are gaining more and more attention. It was proposed by Vapnik [1, 2] as a maximum-margin classifier, and tutorials on SVM could refer to [36]. In recent years, SVM have been applied to many fields and have many algorithmic and modeling variations. In the biomedical field, SVM have been used to identify physical diseases [710] as well as psychological diseases [11]. Electroencephalography (EEG) signals can also be analyzed using SVM [1214]. Besides these, SVM also applied to protein prediction [1519] and medical images [2022]. Computer vision includes many applications of SVM like person identification [23], hand gesture detection [24], face recognition [25] and background subtraction [26]. In geosceinces, SVM have been applied to remote sensing analysis [2729], land cover change [3032], landslide susceptibility [3336] and hydrology [37, 38]. In power systems, SVM was used for transient status prediction [39], power load forecasting [40], electricity consumption prediction [41] and wind power forecasting [42]. Stock price forecasting [4345] and business administration [46] can also use SVM. Other applications of SVM include agriculture plant disease detection [47], condition monitoring [48], network security [49] and electronics [50, 51]. When basic SVM models cannot satisfy the application requirement, different modeling variations of SVM can be found in [52].

In this paper, a survey of SVM with uncertainties is presented. Basic SVM models are dealing with the situation that the exact values of the data points are known. When the data points are uncertain, different models have been proposed to formulate the SVM with uncertainties. Bi and Zhang [53] assumed the data points are subject to an additive noise which is bounded by the norm and proposed a very direct model. However, this model cannot guarantee a generally good performance on the uncertainty set. To guarantee an optimal performance when the worst case scenario constraints are still satisfied, robust optimization is utilized. Trafalis et al. [5458] proposed a robust optimization model when the perturbation of the uncertain data is bounded by norm. Ghaoui et al. [59] derived a robust model when the uncertainty is expressed as intervals. Fan et al. [60] studied a more general case for polyhedral uncertainties. Robust optimization is also used when the constraint is a chance constraint which is to ensure the small probability of misclassification for the uncertain data. The chance constraints are transformed by different bounding inequalities, for example multivariate Chebyshev inequality [61, 62] and Bernstein bounding schemes [63].

The organization of this paper is as follows: Sect. 2 gives an introduction to the basic SVM models. Section 3 presents the SVM with uncertainties, stating both the robust SVM with bounded uncertainty and chance constrained SVM through robust optimization. Section 4 presents concluding remarks and suggesting for further research.

2 Basic SVM Models

Support Vector Machines construct maximum-margin classifiers, such that small perturbations in data are least likely to cause misclassification. Empirically, SVM works really well and are well known supervised learning algorithms proposed by Vapnik [1, 2]. Suppose we have a two-class dataset of \(m\) data points \(\{\mathbf {x}_i,y_i\}_{i=1}^m\) with \(n\)-dimensional features \( \mathbf {x}_i \in \mathbb {R}^n \) and respective class labels \( y_i \in \{ +1,-1 \} \). For linearly separable datasets, there exists a hyperplane \( \mathbf {w}^\top \mathbf {x} + b = 0 \) to separate the two classes and the corresponding classification rule is based on the sign\((\mathbf {w}^\top \mathbf {x} + b )\). If this value is positive, \(\mathbf {x}\) is classified to be in \(+1\) class; otherwise, \(-1\) class.

The datapoints that the margin pushes up against are called support vectors. A maximum-margin hyperplane is one that maximizes the distance between the hyperplane and the support vectors. For the separating hyperplane \( \mathbf {w}^\top \mathbf {x} + b = 0 \), \(\mathbf {w}\) and \(b\) could be normalized so that \( \mathbf {w}^\top \mathbf {x} + b = +1 \) goes through support vectors of \(+1\) class, and \( \mathbf {w}^\top \mathbf {x} + b = -1 \) goes through support vectors of \(-1\) class. The distance between these two hyperplane, i.e., the margin width, is \({2 \over \Vert \mathbf {w} \Vert _2^2}\), therefore, maximization of the margin can be performed as minimization of \( {1 \over 2} \Vert \mathbf {w} \Vert _2^2 \) subject to separation constraints. This can be expressed as the following quadratic optimization problem:

$$\begin{aligned} \min _{\mathbf {w},b}\,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 \end{aligned}$$
(1a)
$$\begin{aligned} \hbox {s.t.} \, \,&y_i (\mathbf {w}^\top \mathbf {x}_i +b) \ge 1, \ \ i=1,\ldots ,m \end{aligned}$$
(1b)

Introduing Lagrange multipliers \(\varvec{\alpha }=[\alpha _1, \ldots , \alpha _m]\), the above constrained problem can be expressed as:

$$\begin{aligned} \min _{\mathbf {w},b} \max _{\varvec{\alpha } \ge 0} \ \fancyscript{L} (\mathbf {w},b,\varvec{\alpha }) = \dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 - \sum _{i=1}^m \alpha _i \bigl [y_i (\mathbf {w}^\top \mathbf {x}_i +b)-1\bigr ] \end{aligned}$$
(2)

Take the derivatives with respect to \(\mathbf {w}\) and \(b\), and set to zero:

$$\begin{aligned} \frac{\partial \fancyscript{L} (\mathbf {w},b,\varvec{\alpha })}{\partial \mathbf {w}} = 0 \ \,&\Rightarrow \ \ \mathbf {w} = \sum _{i=1}^m \alpha _i y_i \mathbf {x}_i \end{aligned}$$
(3a)
$$\begin{aligned} \frac{\partial \fancyscript{L} (\mathbf {w},b,\varvec{\alpha })}{\partial b} = 0 \ \,&\Rightarrow \ \ \sum _{i=1}^m \alpha _i y_i = 0 \end{aligned}$$
(3b)

Substituting into \( \fancyscript{L} (\mathbf {w},b,\varvec{\alpha }) \):

$$\begin{aligned} \fancyscript{L} (\varvec{\alpha }) = \sum _{i=1}^m \alpha _i - \dfrac{1}{2} \sum _{i=1}^m \sum _{j=1}^m \alpha _i \alpha _j y_i y_j \mathbf {x}_i^\top \mathbf {x}_j \end{aligned}$$
(4)

Then the dual of the original SVM problem is also a convex quadratic problem:

$$\begin{aligned} \max _{\varvec{\alpha }} \ \,&\sum _{i=1}^m \alpha _i - \dfrac{1}{2} \sum _{i=1}^m \sum _{j=1}^m \alpha _i \alpha _j y_i y_j \mathbf {x}_i^\top \mathbf {x}_j \end{aligned}$$
(5a)
$$\begin{aligned} \hbox {s.t.} \ \,&\sum _{i=1}^m \alpha _i y_i = 0, \ \ \alpha _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(5b)

Since only the \(\alpha _i\) corresponding to support vectors can be nonzero, this dramatically simplifies solving the dual problem.

The above is in the case that the two classes are linearly separable. When they are not, mislabeled samples need to be allowed where soft margin SVM arises. Soft margin SVM introduces non-negative slack variables \(\xi _i\) to measure the distance of within-margine or misclassified data \(\mathbf {x}_i\) to the hyperplane with the correct label, and \(\xi _i = \max \{0,1-y_i (\mathbf {w}^\top \mathbf {x}_i +b) \}\). When \(0<\xi _i<1\), the data is within margine but correctly classified; when \(\xi _i > 1 \), the data is misclassified. The objective function is then adding a term that penalizes these slack variables, and the optimization is a trade off between a large margin and a small error penalty. The soft margin SVM formulation with \(L_1\) regularization [64] is:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(6a)
$$\begin{aligned} \hbox {s.t.} \ \,&y_i (\mathbf {w}^\top \mathbf {x}_i +b) \ge 1-\xi _i, \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(6b)

where \(C\) is a trade-off parameter.

Similarly, the Lagrange of the soft margin SVM is:

$$\begin{aligned} \min _{\mathbf {w},b,\varvec{\xi }} \max _{\varvec{\alpha },\varvec{\beta } \ge 0} \ \fancyscript{L} (\mathbf {w},b,\varvec{\xi },\varvec{\alpha },\varvec{\beta })&= \dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i\nonumber \\&- \sum _{i=1}^m \alpha _i \bigl [y_i (\mathbf {w}^\top \mathbf {x}_i +b)-1+\xi _i \bigr ] - \sum _{i=1}^m \beta _i \xi _i\qquad \end{aligned}$$
(7)

Take the derivative with respect to \(\xi _i\) and set to zero:

$$\begin{aligned} \frac{\partial \fancyscript{L} (\mathbf {w},b,\varvec{\xi },\varvec{\alpha },\varvec{\beta })}{\partial \xi _i} = 0 \ \ \Rightarrow \ \ C-\alpha _i-\beta _i = 0 \end{aligned}$$
(8)

Then \(\alpha _i=C-\beta _i\). Since \(\beta _i \ge 0\), it indicates that \(\alpha _i \le C\).

The derivatives with respect to \(\mathbf {w}\) and \(b\) are the same as before, substituting into \(\fancyscript{L} (\mathbf {w},b,\varvec{\xi },\varvec{\alpha },\varvec{\beta })\) and get the dual of the soft margin SVM:

$$\begin{aligned} \max _{\varvec{\alpha }} \ \,&\sum _{i=1}^m \alpha _i - \dfrac{1}{2} \sum _{i=1}^m \sum _{j=1}^m \alpha _i \alpha _j y_i y_j \mathbf {x}_i^\top \mathbf {x}_j \end{aligned}$$
(9a)
$$\begin{aligned} \hbox {s.t.} \ \,&\sum _{i=1}^m \alpha _i y_i = 0, \ \ 0 \le \alpha _i \le C, \ \ i=1,\ldots ,m \end{aligned}$$
(9b)

The only difference is that the dual variables \(\alpha _i\) now have upper bounds \(C\). The advantage of the \(L_1\) regularization (linear penalty function) is that in the dual problem, the slack variables \(\xi _i\) vanish and the constant \(C\) is just an additional constraint on the Lagrange multipliers \(\alpha _i\). Because of this nice property and its huge impact in practice, \(L_1\) is the most widely used regularization term.

Besides the linear kernel \(k(\mathbf {x}_i,\mathbf {x}_j)=\mathbf {x}_i^\top \mathbf {x}_j\), nonlinear kernels are also introduced into SVM to create nonlinear classifiers. The maximum-margin hyperplane is constructed in a high-dimensional transformed fearture space with a possible nonlinear transformation, therefore, it could be nonlinear in the original feature space. A widely used nonliear kernel is the Gaussian radial basis function \(k(\mathbf {x}_i,\mathbf {x}_j)=\exp \big (-\gamma \Vert \mathbf {x}_i - \mathbf {x}_j \Vert _2^2\big )\). It corresponds to a Hilbert space of infinite dimensions.

3 SVM with Uncertainties

Given \(m\) training data points in \(\mathbb {R}^n\), use \( X_i = [X_{i1},\ldots ,X_{in}]^\top \in \mathbb {R}^n, i=1,\ldots ,m \) to denote the uncertain training data points and \( y_i \in \{ +1,-1 \}, i=1,\ldots ,m \) to denote the respective class labels. The soft margin SVM with uncertainty is as following:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(10a)
$$\begin{aligned} \hbox {s.t.} \ \,&y_i (\mathbf {w}^\top X_i +b) \ge 1-\xi _i, \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(10b)

When the training data points \(X_i\) are random vectors, the model needs to be modified to consider the uncertainties. The simplest model is to just employ the means of the uncertain data points, \(\mu _i = \mathbf {E}[X_i]\). The formulation would become:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(11a)
$$\begin{aligned} \hbox {s.t.} \ \,&y_i (\mathbf {w}^\top \mu _i +b) \ge 1-\xi _i, \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(11b)

The above model is equivalent to a soft margin SVM on data points fixed on the means, therefore does not take into account the uncertainties of the data. Bi and Zhang [53] assumed the data points are subject to an additive noise, \( X_i = \bar{\mathbf {x}}_i + \Delta \mathbf {x}_i\) and the noise is bounded by \(\Vert \Delta \mathbf {x}_i \Vert _2 \le \delta _i\). Then they proposed the model as:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(12a)
$$\begin{aligned} \hbox {s.t.} \ \,&y_i (\mathbf {w}^\top (\bar{\mathbf {x}}_i+\Delta \mathbf {x}_i) +b) \ge 1-\xi _i, \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m\end{aligned}$$
(12b)
$$\begin{aligned}&\Vert \Delta \mathbf {x}_i \Vert _2 \le \delta _i, \ \ i=1,\ldots ,m \end{aligned}$$
(12c)

In this model, the uncertain data \( X_i \) is free in the circle centered at \(\bar{\mathbf {x}}_i\) with radius equal to \(\delta _i\), i.e., \( X_i \) could move toward any direction in the uncertainty set. A drawback of this model is that it cannot guarantee a generally good performance on the uncertainty set since the direction of how the data points are perturbed is not constrained in this model. It is highly possible and already presented in this paper that a data point with a perturbation making it move far away from the separation hyperplane could be used as the support vector. Then considering the original uncertainty set of this data point, it would be mostly lie within the margin and the constraint would not be satisfied any more. To guarantee a better performance under most conditions or with higher probability, robust optimization is introduced to solve the SVM with uncertainty.

3.1 Robust SVM with Bounded Uncertainty

Robust optimization is to guarantee an optimal performance under the worst case scenario. Given different information of the uncertain data, several models have been proposed. Trafalis et al. [5458] proposed a model when the perturbation of the uncertain data is bounded by norm. The uncertain data could be expressed as \( X_i = \bar{\mathbf {x}}_i + \varvec{\sigma }_i\), the mean vector \(\bar{\mathbf {x}}_i\) plus the additional perturbation \(\varvec{\sigma }_i\) which is bounded by the \(L_p\) norm with \(\Vert \varvec{\sigma }_i \Vert _p \le \eta _i \), for all \(i=1,\ldots ,m\). Robust optimization is to deal with the worst case perturbation, and this would be:

$$\begin{aligned} \min _{\Vert \varvec{\sigma }_i \Vert _p \le \eta _i} y_i (\mathbf {w}^\top \bar{\mathbf {x}}_i +b) + y_i \mathbf {w}^\top \varvec{\sigma }_i \ge 1-\xi _i, \ \ i=1,\ldots ,m \end{aligned}$$
(13)

To solve the robust SVM, the following subproblem needs to be solved first:

$$\begin{aligned} \min _{\varvec{\sigma }_i} \ \,&y_i \mathbf {w}^\top \varvec{\sigma }_i\end{aligned}$$
(14a)
$$\begin{aligned} \hbox {s.t.} \ \,&\Vert \varvec{\sigma }_i \Vert _p \le \eta _i \end{aligned}$$
(14b)

Hölder’s inequality says that for a pair of dual norms \(L_p\) and \(L_q\) with \(p,q \in [1,\infty ]\) and \(1/p+1/q=1\), the following inequality holds:

$$\begin{aligned} \Vert fg \Vert _1 \le \Vert f \Vert _p \Vert g \Vert _q \end{aligned}$$
(15)

Therefore

$$\begin{aligned} | y_i \mathbf {w}^\top \varvec{\sigma }_i | \le \Vert \varvec{\sigma }_i \Vert _p \Vert \mathbf {w} \Vert _q \le \eta _i \Vert \mathbf {w} \Vert _q \end{aligned}$$
(16)

A lower bound of \( y_i \mathbf {w}^\top \varvec{\sigma }_i \) is \( -\eta _i \Vert \mathbf {w} \Vert _q \), substituting into the original problem will get the following formulation:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(17a)
$$\begin{aligned} \hbox {s.t.} \ \,&y_i (\mathbf {w}^\top \bar{\mathbf {x}}_i +b) - \eta _i \Vert \mathbf {w} \Vert _q \ge 1-\xi _i, \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(17b)

The above formulation depends on the norm \(L_p\). When \(p=q=2\), a conic program of the above formulation can be obtained:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(18a)
$$\begin{aligned} \hbox {s.t.} \ \,&y_i (\mathbf {w}^\top \bar{\mathbf {x}}_i +b) - \eta _i \Vert \mathbf {w} \Vert _2 \ge 1-\xi _i, \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(18b)

An interesting property of the norm transformation is that for \(L_1\) and \(L_\infty \) norms, with the objective function \( \dfrac{1}{2} \Vert \mathbf {w} \Vert _q + C \sum _{i=1}^m \xi _i \), the problem can be transformed into a linear programming (LP) problem.

The dual of \(L_1\) norm is \(L_\infty \) norm. When \(p=1\), the formulation becomes:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _\infty + C \sum _{i=1}^m \xi _i \end{aligned}$$
(19a)
$$\begin{aligned} \hbox {s.t.} \ \,&y_i (\mathbf {w}^\top \bar{\mathbf {x}}_i +b) - \eta _i \Vert \mathbf {w} \Vert _\infty \ge 1-\xi _i, \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(19b)

Introducing an auxiliary variable \(\alpha = \Vert \mathbf {w} \Vert _\infty \), then the above formulation can be written as a LP problem:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \alpha + C \sum _{i=1}^m \xi _i \end{aligned}$$
(20a)
$$\begin{aligned} \hbox {s.t.} \ \,&y_i (\mathbf {w}^\top \bar{\mathbf {x}}_i +b) - \eta _i \alpha \ge 1-\xi _i, \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(20b)
$$\begin{aligned}&\alpha \ge -w_j, \ \alpha \ge w_j, \ \ j=1,\ldots ,n \end{aligned}$$
(20c)

When the \(L_\infty \) norm is chosen to express the perturbation, then the formulation becomes:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _1 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(21a)
$$\begin{aligned} \hbox {s.t.} \ \,&y_i (\mathbf {w}^\top \bar{\mathbf {x}}_i +b) - \eta _i \Vert \mathbf {w} \Vert _1 \ge 1-\xi _i, \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(21b)

Introducing an auxiliary vector \(\varvec{\alpha }\) with \(\alpha _j = |w_j|\), the resulting optimization problem is also LP:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \sum _{j=1}^n \alpha _j + C \sum _{i=1}^m \xi _i \end{aligned}$$
(22a)
$$\begin{aligned} \hbox {s.t.} \ \,&y_i (\mathbf {w}^\top \bar{\mathbf {x}}_i +b) - \eta _i \sum _{j=1}^n \alpha _j \ge 1-\xi _i, \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(22b)
$$\begin{aligned}&\alpha _j \ge -w_j, \ \alpha _j \ge w_j, \ \ j=1,\ldots ,n \end{aligned}$$
(22c)

Ghaoui et al. [59] derived a robust model when the uncertainty is expressed as intervals (also known as support or extremum values). Suppose the extremum values of the uncertain data points are known as \( l_{ij} \le X_{ij} \le u_{ij} \), then each training data point \(X_i\) is lying in a hyper-rectangle \(\fancyscript{R}_i = \{ \mathbf {x}_i=[x_{i1},\ldots ,x_{in}]^\top \in \mathbb {R}^n \ | \ l_{ij} \le x_{ij} \le u_{ij}, j=1,\ldots ,n \} \) and the robust optimization requires that all points in the hyper-rectangle should satisfy \( y_i (\mathbf {w}^\top \mathbf {x}_i +b) \ge 1-\xi _i, \forall \mathbf {x}_i \in \fancyscript{R}_i \). The geometric center of the hyper-rectangle \(\fancyscript{R}_i\) is \(\mathbf {c}_i = [c_{i1}, \ldots , c_{in}]^\top \in \mathbb {R}^n \) where \(c_{ij} = (l_{ij}+u_{ij}) / 2, j=1,\ldots ,n\). The semi-lengths of the sides of the hyper-rectangle \(\fancyscript{R}_i\) is \(s_{ij} = (u_{ij}-l_{ij}) / 2, j=1,\ldots ,n\). Then the worst case with these interval information would be:

$$\begin{aligned} y_i (\mathbf {w}^\top \mathbf {c}_i +b) \ge 1-\xi _i + \sum _{j=1}^n s_{ij} |w_j| \end{aligned}$$
(23)

Then the SVM model with support information can be written as:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(24a)
$$\begin{aligned} \hbox {s.t.} \ \,&y_i (\mathbf {w}^\top \mathbf {c}_i +b) \ge 1-\xi _i+||\mathbf {S}_i \mathbf {w}||_1, \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(24b)

where \(\mathbf {S}_i\) is a diagonal matrix with entries \(s_{ij}\).

The interval uncertainty is a special case of polyhedral uncertainty [60]. The polyhedral uncertainty can be expressed as \(\mathbf {D}_i \mathbf {x}_i \le \mathbf {d}_i\), where the matrix \(\mathbf {D}_i \in \mathbb {R}^{q \times n}\) and the vector \(\mathbf {d}_i \in \mathbb {R}^q\). And since zero vectors could be added to obatin the same number \(q\) of inequalities for all data points, \(q\) is the largest dimension of the uncerainties of all the points. The robust SVM with polyhedral uncertainty is:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(25a)
$$\begin{aligned} \hbox {s.t.} \ \,&\min _{\{ \mathbf {x}_i: \mathbf {D}_i \mathbf {x}_i \le \mathbf {d}_i \}} y_i (\mathbf {w}^\top \mathbf {x}_i +b) \ge 1-\xi _i, \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(25b)

Since \(\min _{\{ \mathbf {x}_i: \mathbf {D}_i \mathbf {x}_i \le \mathbf {d}_i \}} y_i (\mathbf {w}^\top \mathbf {x}_i +b) \ge 1-\xi _i\) is equivalent to:

$$\begin{aligned} \max _{\{ \mathbf {x}_i: \mathbf {D}_i \mathbf {x}_i \le \mathbf {d}_i \}} (-y_i \mathbf {w}^\top \mathbf {x}_i) - y_i b \le -1+\xi _i \end{aligned}$$
(26)

To solve

$$\begin{aligned} \max \ \,&-y_i \mathbf {w}^\top \mathbf {x}_i \end{aligned}$$
(27a)
$$\begin{aligned} \hbox {s.t.} \ \,&\mathbf {D}_i \mathbf {x}_i \le \mathbf {d}_i \end{aligned}$$
(27b)

The dual is:

$$\begin{aligned} \min \ \,&\mathbf {d}_i^\top \mathbf {z}_i \end{aligned}$$
(28a)
$$\begin{aligned} \hbox {s.t.} \ \,&\mathbf {D}_i^\top \mathbf {z}_i = -y_i \mathbf {w} \end{aligned}$$
(28b)
$$\begin{aligned}&\mathbf {z}_i = (z_{i1}, \ldots , z_{iq})^\top \ge 0 \end{aligned}$$
(28c)

Strong duality would guarantee that the objevtive values of the dual and primal are equal. Therefore, the robust SVM with polyhedral uncertainty formulation is equivalent to:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i,\mathbf {z}} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(29a)
$$\begin{aligned} \hbox {s.t.} \ \,&\mathbf {d}_i^\top \mathbf {z}_i - y_i b \le -1+\xi _i , \ \ \ \xi _i \ge 0 \end{aligned}$$
(29b)
$$\begin{aligned}&\mathbf {D}_i^\top \mathbf {z}_i + y_i \,\mathbf {w} = 0, \ \ \ \mathbf {z}_i = (z_{i1}, \ldots , z_{iq})^\top \end{aligned}$$
(29c)
$$\begin{aligned}&z_{ij} \ge 0, \ \ \ i=1,\ldots ,m, \ \ \ j = 1,\ldots ,q \end{aligned}$$
(29d)

The authors also proved that for the hard margin SVM (i.e., when there is no \(\xi _i\)), the dual of the above formulation is:

$$\begin{aligned} \min _{\lambda ,\mu } \ \,&\sum _{i=1}^m \lambda _i - {1 \over 2} \sum _{k=1}^n \left( \sum _{i=1}^m y_i \mu _{ik} \right) ^2 \end{aligned}$$
(30a)
$$\begin{aligned} \hbox {s.t.} \ \,&\lambda _i d_{ij} + \sum _{k=1}^n \mu _{ik} D_{ijk} = 0, \ i=1,\ldots ,m, \ j = 1,\ldots ,q\end{aligned}$$
(30b)
$$\begin{aligned}&\sum _{i=1}^m \lambda _i y_i = 0 \end{aligned}$$
(30c)
$$\begin{aligned}&\lambda _i \ge 0, \ i=1,\ldots ,m \end{aligned}$$
(30d)

The interval uncertainty \([\mathbf {x}_i^0-\varvec{\delta }_i, \mathbf {x}_i^0+\varvec{\delta }_i]\) is a special case of polyhedral uncertainty since when defining

$$\begin{aligned} \mathbf {D}_i = \begin{pmatrix} I \\ -I \end{pmatrix}, \ \ \ \mathbf {d}_i = \begin{pmatrix} \mathbf {x}_i^0+\varvec{\delta }_i \\ -\mathbf {x}_i^0+\varvec{\delta }_i \end{pmatrix} \end{aligned}$$
(31)

\( \{ \mathbf {x}_i : \mathbf {x}_i \in [\mathbf {x}_i^0-\varvec{\delta }_i, \mathbf {x}_i^0+\varvec{\delta }_i] \} \) and \( \{ \mathbf {x}_i : \mathbf {D}_i \mathbf {x}_i \le \mathbf {d}_i \} \) are equivalent. The authors of [60] also proposed probabilistic bounds on constraint violation in this case.

3.2 Chance Constrained SVM through Robust Optimization

The chance-constrained program (CCP) is used to ensure the small probability of misclassification for the uncertain data. The chance-constrained SVM formulation is:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(32a)
$$\begin{aligned} \hbox {s.t.} \ \,&\hbox {Prob} \Bigl \{ y_i (\mathbf {w}^\top X_i +b) \le 1-\xi _i \Bigr \} \le \varepsilon , \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(32b)

where \( 0 < \varepsilon \le 1 \) is a prameter given by the user and close to 0. This model ensures an upper bound on the misclassification probability, but the chance constraints are typically non-convex so the problem is very hard to solve.

The work so far to deal with the chance constraint is to transform them by different bounding inequalities. When the mean and covariance matrix are known, the multivariate Chebyshev bound via robust optimization can be used to express the chance constraints above [61, 62].

Markov’s inequality states that if \(X\) is a nonnegative random variable and \(a>0\), then

$$\begin{aligned} \hbox {Prob} \{ X \ge a \} \le { \mathbf {E} [X] \over a } \end{aligned}$$
(33)

Consider the random variable \( \bigl ( X-\mathbf {E} [X] \bigr )^2 \). Since \( \mathbf {Var} (X) = \mathbf {E} \bigl [ (X-\mathbf {E} [X])^2 \bigr ] \), then

$$\begin{aligned} \hbox {Prob} \{ \bigl ( X-\mathbf {E} [X] \bigr )^2 \ge a^2 \} \le { \mathbf {Var} (X) \over a^2 } \end{aligned}$$
(34)

which yields the Chebyshev’s inequality

$$\begin{aligned} \hbox {Prob} \{ \bigl | X-\mathbf {E} [X] \bigr | \ge a \} \le { \mathbf {Var} (X) \over a^2 } \end{aligned}$$
(35)

Let \( \mathbf {x} \sim (\mathbf {\mu }, \Sigma ) \) denote the random vector \( \mathbf {x} \) with mean \(\mathbf {\mu }\) and convariance matrix \(\Sigma \). The multivariate Chebyshev inequality [65, 66] states that for an arbitrary closed convex set \(S\), the supremum of the probability that \( \mathbf {x} \) takes a value in \(S\) is

$$\begin{aligned} \sup _{\mathbf {x} \sim (\mathbf {\mu }, \mathbf {\Sigma })} \hbox {Prob} \{\mathbf {x} \in S\}&= { 1 \over 1+d^2 } \end{aligned}$$
(36a)
$$\begin{aligned} d^2&= \inf _{\mathbf {x} \in S} (\mathbf {x} - \mathbf {\mu })^\top \Sigma ^{-1} (\mathbf {x} - \mathbf {\mu }) \end{aligned}$$
(36b)

For the constraint \(\hbox {Prob} \{ \mathbf {w}^\top \mathbf {x} + b \le 0 \} \le \varepsilon \), it could be derived that:

$$\begin{aligned} \mathbf {w}^\top \mathbf {\mu } + b \ge \kappa _C ||\Sigma ^{1 \over 2} \mathbf {w}||_2 \end{aligned}$$
(37)

where \( \kappa _C=\sqrt{(1-\varepsilon )/\varepsilon } \).

Applying the above result to the chance constrained SVM, the Chebyshev based reformulation utilizing the means \(\mathbf {\mu }_i \) and covariance matrix \(\Sigma _i\) of each uncertain training point \(X_i\) can be obtained as the following robust model [61, 62]:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(38a)
$$\begin{aligned} \hbox {s.t.} \ \,&y_i (\mathbf {w}^\top \mu _i +b) \ge 1-\xi _i+\kappa _C||\Sigma _i^{1 \over 2} \mathbf {w}||_2, \ \ \ \xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(38b)

Another approach to study SVM with chance constraints is to use Bernstein approximation schemes [6769]. Ben-Tal et al. [63] employed Bernstein bounding schemes for the CCP relaxation and transformed the problem as a convex second order cone program with robust set constraints to guarantee the satisfaction of the chance constraints and can be solved efficiently using interior point solvers.

The Bernstein based relaxation utilized both the support (bounds, i.e. extremum values of the data points) and moment information (mean and variance). For random data point \( X_i = [X_{i1},\ldots ,X_{in}]^\top \) and its label \( y_i \), support information is the bounds of the data points \( l_{ij} \le X_{ij} \le u_{ij} \), i.e. \( X_i \in \fancyscript{R}_i = \{ \mathbf {x}_i=[x_{i1},\ldots ,x_{in}]^\top \in \mathbb {R}^n \ | \ l_{ij} \le x_{ij} \le u_{ij}, j=1,\ldots ,n \} \), 1st moment information is the bounds on the means of the data points \( \mu _i^- = [\mu _{i1}^-,\ldots ,\mu _{in}^-]^\top \le \mu _i = \mathbf {E}[X_i] = [\mathbf {E}[X_{i1}],\ldots ,\mathbf {E}[X_{in}]]^\top \le \mu _i^+ = [\mu _{i1}^+,\ldots ,\mu _{in}^+]^\top \), and 2nd moment information is the bounds on the second-moments of the data points \( 0 \le \mathbf {E}[X_{ij}^2] \le \sigma _{ij}^2 \).

The Bernstein based relaxation is to derive convex constraints so that when these convex constraints are satisfied then the chance-constraints are guaranteed to be satisfied. They proved that with the information of independent random variable \(X_{ij}\), i.e. support \( l_{ij} \le X_{ij} \le u_{ij} \), bounds on the first-moment \( \mu _{ij}^- \le \mu _{ij} = \mathbf {E}[X_{ij}] \le \mu _{ij}^+ \), and bounds on the second-moment \( 0 \le \mathbf {E}[X_{ij}^2] \le \sigma _{ij}^2 \), the chance-constraint in SVM is satisfied if the following convex constraint holds:

$$\begin{aligned} 1 - \xi _i - y_i b + \sum _j \Bigl ( \max \bigl [ -y_i \mu _{ij}^- w_j, -y_i \mu _{ij}^+ w_j \bigr ] \Bigr ) + \kappa _B ||\Sigma _i \mathbf {w}||_2 \le 0 \end{aligned}$$
(39)

where \( \kappa _B = \sqrt{2\log (1/\varepsilon )} \), and the diagonal matrix

$$\begin{aligned} \Sigma _i = \hbox {diag} \Bigl ( s_{i1} \nu ( \mu _{i1}^-, \mu _{i1}^+, \sigma _{i1} ), \ldots , s_{in} \nu ( \mu _{in}^-, \mu _{in}^+, \sigma _{in} ) \Bigr ) \end{aligned}$$
(40)

where \( s_{ij} = { u_{ij} - l_{ij} \over 2 } \) and the function \( \nu ( \mu _{ij}^-, \mu _{ij}^+, \sigma _{ij} ) \) is defined by normalizing \( \hat{X}_{ij} = { X_{ij} - c_{ij} \over s_{ij} } \), where \( c_{ij} = {l_{ij}+u_{ij} \over 2} \) and \( s_{ij} = {u_{ij}-l_{ij} \over 2} \). Using the information of \( X_{ij} \), one can easily compute the moment information of \( \hat{X}_{ij} \), which are denoted by \( \hat{\mu }_{ij}^- \le \hat{\mu }_{ij} = \mathbf {E}[\hat{X}_{ij}] \le \hat{\mu }_{ij}^+ \) and \( 0 \le \mathbf {E}[\hat{X}_{ij}^2] \le \hat{\sigma }_{ij}^2 \). They proved that

$$\begin{aligned} \mathbf {E} \Bigl [ \exp \{\tilde{t} \hat{X}_{ij} \} \Bigr ] \le g_{\hat{\mu }_{ij},\hat{\sigma }_{ij}} (\tilde{t}) = {\left\{ \begin{array}{ll} {(1-\hat{\mu }_{ij})^2 \exp \Bigl \{ \tilde{t} { \hat{\mu }_{ij}-\hat{\sigma }_{ij}^2 \over 1-\hat{\mu }_{ij}} \Bigr \} \,+\, \bigl ( \hat{\sigma }_{ij}^2 - \hat{\mu }_{ij}^2 \bigr ) \exp \{ \tilde{t} \} \over 1-2\hat{\mu }_{ij}\,+\,\hat{\sigma }_{ij}^2}, &{}\tilde{t} \ge 0 \\ {(1+\hat{\mu }_{ij})^2 \exp \Bigl \{ \tilde{t} { \hat{\mu }_{ij}+\hat{\sigma }_{ij}^2 \over 1+\hat{\mu }_{ij}} \Bigr \} \,+\,\bigl ( \hat{\sigma }_{ij}^2 - \hat{\mu }_{ij}^2 \bigr ) \exp \{ - \tilde{t} \} \over 1+2\hat{\mu }_{ij}\,+\,\hat{\sigma }_{ij}^2}, &{}\tilde{t} \le 0 \end{array}\right. }\nonumber \\ \end{aligned}$$
(41)

They defined \( h_{\hat{\mu }_{ij},\hat{\sigma }_{ij}} (\tilde{t}) = \log g_{\hat{\mu }_{ij},\hat{\sigma }_{ij}} (\tilde{t}) \), and the function \(\nu ( \mu ^-, \mu ^+, \sigma )\) is defined as:

$$\begin{aligned} \nu ( \mu ^-, \mu ^+, \sigma )&= \min \Bigl \{ k \ge 0 : h_{\hat{\mu },\hat{\sigma }} (\tilde{t}) \le \max [\hat{\mu }^- \tilde{t}, \hat{\mu }^+ \tilde{t}] + {k^2 \over 2} \tilde{t}^2, \nonumber \\&\quad \ \forall \hat{\mu } \in [\hat{\mu }^-,\hat{\mu }^+], \tilde{t} \Bigr \} \end{aligned}$$
(42)

This value can be calculated numerically. Under the condition that \( \mu _{ij}^- \le c_{ij} \le \mu _{ij}^+ \), this value can be computed analytically by \( \nu ( \mu ^-, \mu ^+, \sigma ) = \sqrt{1-(\hat{\mu }^{\min })^2} \), where \( \hat{\mu }^{\min } = \min (-\hat{\mu }^-, \hat{\mu }^+ ) \).

Replacing the chance-constraints in SVM by the convex constraint derived above, the problem is transformed into a convex second order cone program:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i,z_{ij}} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(43a)
$$\begin{aligned} \hbox {s.t.} \ \,&1 - \xi _i - y_i b + \sum _j z_{ij} + \kappa _B ||\Sigma _i \mathbf {w}||_2 \le 0 \end{aligned}$$
(43b)
$$\begin{aligned}&z_{ij} \ge -y_i \mu _{ij}^- w_j, \ z_{ij} \ge -y_i \mu _{ij}^+ w_j\end{aligned}$$
(43c)
$$\begin{aligned}&\xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(43d)

which can be solved efficiently using cone programming solvers.

The geometrical interpretation of this convex constraint is that \( y_i (\mathbf {w}^\top \mathbf {x} +b) \ge 1-\xi _i \) is satisfied for all \( \mathbf {x} \) belonging to the union of ellipsoids \( \fancyscript{E} \bigl ( \mu _i, \kappa _B \Sigma _i \bigr ) = \bigl \{ \mathbf {x} = \mu _i + \kappa _B \Sigma _i \mathbf {a} \ : \ || \mathbf {a} ||_2 \le 1 \bigr \} \) with center \( \mu _i \), shape and size \( \kappa _B \Sigma _i \), and the union is over \( \mu _i \in [\mu _i^-, \mu _i^+] \), i.e.,

$$\begin{aligned} y_i (\mathbf {w}^\top \mathbf {x} +b) \ge 1-\xi _i, \ \ \forall \mathbf {x} \in \cup _{\mu _i \in [\mu _i^-, \mu _i^+]} \fancyscript{E} \bigl ( \mu _i, \kappa _B \Sigma _i \bigr ) \end{aligned}$$
(44)

Therefore, this constraint is defining an uncertainty set \( \cup _{\mu _i \in [\mu _i^-, \mu _i^+]} \fancyscript{E} \bigl ( \mu _i, \kappa _B \Sigma _i \bigr ) \) for each uncertain training data point \( X_i \). If all the points in the uncertainty set satisfy \( y_i (\mathbf {w}^\top \mathbf {x} +b) \ge 1-\xi _i \), then the chance-constraint is guaranteed to be satisfied. This transforms the CCP into a robust optimization problem over the uncertainty set.

Since the size of the uncertainty set depend on \(\kappa _B\), and \( \kappa _B = \sqrt{2\log (1/\varepsilon )} \), when the upperbound of misclassification error \(\varepsilon \) decreases, the size of the uncertainty set increases. When \(\varepsilon \) is very small, the uncertainty set would become huge so the constraint would be too conservative. As the support information provides with the bounding hyper-rectangle \(\fancyscript{R}_i\) where the true training data point \(X_i\) would always lie in, a less conservative classifier can be obtained by taking the intersection of \( \cup _{\mu _i \in [\mu _i^-, \mu _i^+]} \fancyscript{E} \bigl ( \mu _i, \kappa _B \Sigma _i \bigr ) \) and \(\fancyscript{R}_i\) as the new uncertainty set.

The authors proved that when the uncertainty set is the intersection, i.e.,

$$\begin{aligned} y_i (\mathbf {w}^\top \mathbf {x} +b) \ge 1-\xi _i, \ \ \forall \mathbf {x} \in \Bigg ( \cup _{\mu _i \in [\mu _i^-, \mu _i^+]} \fancyscript{E} \bigl ( \mu _i, \kappa _B \Sigma _i \bigr ) \Bigg ) \cap \fancyscript{R}_i \end{aligned}$$
(45)

The above constraint is satisfied if and only if the following convex constraint holds:

$$\begin{aligned}&\sum _j \Bigl ( \max \bigl [ -l_{ij} (y_i w_j + a_{ij}), -u_{ij} (y_i w_j + a_{ij}) \bigr ] + \max \bigl [ \mu _{ij}^- a_{ij}, \mu _{ij}^+ a_{ij} \bigr ] \Bigr )\nonumber \\&\qquad + 1 - \xi _i - y_i b + \kappa _B ||\Sigma _i \mathbf {a}_i||_2 \le 0 \end{aligned}$$
(46)

Replacing the chance-constraints in SVM by the robust but less conservative convex constraint above, the problem is transformed into the following SOCP:

$$\begin{aligned} \min _{\mathbf {w},b,\xi _i,z_{ij},\tilde{z}_{ij},\mathbf {a}_i} \ \,&\dfrac{1}{2} \Vert \mathbf {w} \Vert _2^2 + C \sum _{i=1}^m \xi _i \end{aligned}$$
(47a)
$$\begin{aligned} \hbox {s.t.} \ \,&1 - \xi _i - y_i b + \sum _j \tilde{z}_{ij} + \sum _j z_{ij} + \kappa _B ||\Sigma _i \mathbf {a}_i||_2 \le 0 \end{aligned}$$
(47b)
$$\begin{aligned}&z_{ij} \ge \mu _{ij}^- a_{ij}, \ z_{ij} \ge \mu _{ij}^+ a_{ij} \end{aligned}$$
(47c)
$$\begin{aligned}&\tilde{z}_{ij} \ge -l_{ij} (y_i w_j + a_{ij}), \ \tilde{z}_{ij} \ge -u_{ij} (y_i w_j + a_{ij}) \end{aligned}$$
(47d)
$$\begin{aligned}&\xi _i \ge 0, \ \ i=1,\ldots ,m \end{aligned}$$
(47e)

The Bernstein based formulations (43) and (47) are robust to the moment estimation errors in addition to the uncertainty in data, since they are using the bounds on mean \(\big (\mu _{ij}^-,\mu _{ij}^+\big ) \) and bounds on second-moment \(\big (\sigma _{ij}^2\big )\) instead of the exact values of the moments which are often unknown.

Comparing the two approaches for the chance constrained SVM, both of them are robust to uncertainties in data and did not make assumptions to the underlying probability distribution. Chebyshev based schemes only employed moment information of the uncertain training points, while Bernstein bounds employed both support and moment information, therefore can be less conservative than Chebyshev bounds. The resulting classifier by Bernstein approach achieved larger classification margins and therefore better generalization ability according to the structural risk minimization principle of Vapnik [1]. A drawback of Bernstein based formulation is that it assumes each element \(X_{ij}\) is independent with each other, while Chebyshev based formulation allows the covariance matrix \(\Sigma _i\) of uncertain training point \(X_i\).

4 Concluding Remarks

This paper presented a survey on SVM with uncertainties. When direct model cannot guarantee a generally good performance on the uncertainty set, robust optimization is utilized to obtain an optimal performance under the worst case scenario. The perturbation of the uncertain data could be bounded by the norm, or expressed as intervals and polyhedrons. When the constraint is a chance constraint, different bounding schemes like multivariate Chebyshev inequality and Bernstein bounding schemes are used to ensure the small probability of misclassification for the uncertain data.

The models in the literature are generally processing the linear SVM. A big part of the power of SVM lies in the powerful representation of nonlinear kernel in SVM models, which is to generate nonlinear classification boundaries. Therefore, it is suggested that more study could be conducted to explore how to deal with nonlinear kernels. And more schemes could be explored to represent the robust regions of the uncertain data and formulate the models as convex solvable problems.