Universal consistency of twin support vector machines

A classification problem aims at constructing a best classifier with the smallest risk. When the sample size approaches infinity, the learning algorithms for a classification problem are characterized by an asymptotical property, i.e., universal consistency. It plays a crucial role in measuring the construction of classification rules. A universal consistent algorithm ensures that the larger the sample size of the algorithm is, the more accurately the distribution of the samples could be reconstructed. Support vector machines (SVMs) are regarded as one of the most important models in binary classification problems. How to effectively extend SVMs to twin support vector machines (TWSVMs) so as to improve performance of classification has gained increasing interest in many research areas recently. Many variants for TWSVMs have been proposed and used in practice. Thus in this paper, we focus on the universal consistency of TWSVMs in a binary classification setting. We first give a general framework for TWSVM classifiers that unifies most of the variants of TWSVMs for binary classification problems. Based on it, we then investigate the universal consistency of TWSVMs. To do this, we give some useful definitions of risk, Bayes risk and universal consistency for TWSVMs. Theoretical results indicate that universal consistency is valid for various TWSVM classifiers under some certain conditions, including covering number, localized covering number and stability. For applications of our general framework, several variants of TWSVMs are considered.


Introduction
As sample size increases gradually to infinity, there is an asymptotical property for learning algorithms, called consistency. It is an extremely important part in statistical learning theory. In fact, though the sample size is always finite for practical problems, the consistency of learning algorithms guarantees that by using more samples, a more accurate distribution could be reconstructed. Since the concept of consistency was first proposed by Vapnik and Chervonenkis [27][28][29], the consistency of various learning algorithms has been extensively explored in statistical learning and machine learning areas.
According to different settings for learning machines, the consistency could be summarized as the following types. The consistency of empirical risk minimization (ERM) method [29] is a classical type of consistency. The loss function minimizing empirical risk is used to approximate the loss function minimizing true risk. For example, Chen et al. [5] studied the consistency of ERM method based on convex losses of multi-class classification problems. Brownlees et al. [4] investigated the performance bound of heavytailed losses from the view of consistency of ERM method. Berner et al. [1] analyzed the generalization error of ERM method based on deep artificial neural network hypothesis. Xu et al. [30] proposed the general framework for statistical learning with group invariance, and paid attention to the consistency of the general framework. Fisher consistency [10] strengthens the property of unbiasedness for parameters of functions. It estimates the parameters of functions directly, and uses the estimated values of parameters to approximate their true values. For instance, Liu [17] established the Fisher consistency theory for different loss functions of multi-category support vector machine (SVM) algorithms. Fathony et al. [8] proposed an adversarial bipartite matching algorithm with the computational efficiency and Fisher consistency properties.
In addition to the two types, universal consistency is also a typical type of consistency, which measures the consistency of learning algorithms with the base of structural risk minimization (SRM) method. Indeed, the ERM method is only about empirical risk, and is not about regularization term. However in many practical situations, in order to control the generalization ability of learning machines, the regularization term is always considered. To this end, Vapnik [26] proposed the SRM method to balance the empirical risk of training data and generalization ability of learning machines. Later, the concept of universal consistency was introduced to demonstrate the consistency of many learning algorithms based on SRM method. For example, Steinwart [25] showed the universal consistency for SVMs and their different variants on a unified framework. Liu et al. [16] indicated that the extreme learning machine (ELM) was universally consistent for radial basis function networks, and pointed out the direction to select the optimal kernel functions in ELM application. Dumpert and Christmann [7] concerned the universal consistency of localized kernel based methods. Gyorfi et al. [12] shared the universal consistent results for the nearest-neighbor prototype algorithm in the multi-class classification setting, and the convergence rate was also conducted based on the universal consistency. In summary, universal consistency has been studied deeply in many problem settings. Here, the present paper focuses on universal consistency for binary classification problems.
SVMs are a type of powerful tools for binary classification problems. The key idea is to construct two parallel hyper-planes such that the positive and negative classes are separated well, and then maximize the margin between the two parallel hyper-planes, resulting in the minimization of the regularization term. SVMs were widely applied to many practical problems, such as text classification [15], face recognition [23] and bioinformatics [9] etc. Though successful in these applications, SVMs still have some difficulties, since they deal only with small sample problems. It would take very expensive computational cost for large scale sample problems.
In order to reduce the computational cost of SVMs, many extensions to SVM were proposed and studied. For instance, a generalized eigenvalue proximal support vector machine (GEPSVM) [18] was such an extension, which constructs two non-parallel hyper-planes, so that each hyper-plane is closest to one of the two classes and is also as far away from the other class as possible. Based on SVMs and GEPSVM, Jayadeva et al. [13] established another extension to SVMs, that is, twin support vector machine (TWSVM). The main idea of TWSVM is similar to that of GEPSVM, while the formulation is entirely different from that of GEPSVM. In fact, it derives a pair of quadratic programming problems (QPPs) for TWSVM, and the formulation of each QPP is similar to that of SVMs, except that only one class of the training samples appears in the constraints of each QPP. In brief, the computational cost of TWSVM is only one fourth of that of SVMs.
Similar to SVMs, TWSVM also has many variants for binary classification problem. For instance, smooth TWSVM algorithm [14] approximated a smoothing function (x, ) to the plus function (x) + in the process of solving the QPPs, such that the learnt classifier was more smoother than before. Twin bounded SVM (TBSVM) algorithm [21] embedded a regularization term to TWSVM, according to the SRM principle, and thus improved the classification performance. Weighted linear loss TWSVM algorithm [22] was constructed to adjust the impact of each point on the hyperplane, and thus the weights for the slack variables were given. Least squares TBSVM algorithm based on L1-norm distance metric [32] was a least square version of TBSVM firstly, and then substituted L1-norm for L2-norm to enhance the robustness. Besides, there are many other algorithms based on the same idea as TWSVM, like robust TWSVM algorithm [19], margin-based TWSVM with unity norm hyperplanes [20], fuzzy TWSVM algorithm [11] etc. Though these TWSVM variants are formulated well and applied to different practical problems, up to now their universal consistency has not been studied.
In this study, we address the universal consistency of TWSVM and its variants. Since it is very cumbersome to analyze the universal consistency of the TWSVM variants one by one, we suggest to do this work under a general framework. However, there is no unified framework for all TWSVM variants in the literature. So we first try to construct a general framework for TWSVM variants. Furthermore, as not all the variants are based on the same idea, it is very difficult to find a general framework fit to all variants, we determine to construct a general framework for most of the variants based on the idea of TWSVM, and formulate it as a general optimization problem. Concretely, the optimization problem consists of two minimization problems, each of which contains two terms: the first term measures the average of losses for both the positive and negative data, and the second term is expressed by a regularization term for maximizing some margin.
With the general framework, we then study the universal consistency of the general optimization problem. We first introduce the definition of universal consistency for the general optimization problem, and then show in what conditions, the universal consistency is valid. When introducing the definition of universal consistency, risk R P (f ) and Bayes risk R P for the general optimization problem are related, and thus are redefined here. When showing the validity of universal consistency in some conditions, an assertion is necessary and thus is proposed. Since the definitions like regularized L 1 -risk R reg 1,P,c 1 (f 1 ) and regularized L 2 -risk R reg 2,P,c 2 (f 2 ) are very important to describe the assertion, the detailed definitions are given before the assertion. The assertion is then described centered on a pair of concentration inequalities. Under three different conditions based on covering number, localized covering number and stability respectively, it derives three different pairs of concentration inequalities, and thus it concludes three different results for universal consistency.
The rest of the paper is organized as follows: Section 2 gives the preliminaries. Section 3 derives the general framework for most variants of TWSVM and formulates it as a general optimization problem. Section 4 introduces the definitions of risk, Bayes risk and universal consistency for this optimization problem and proposes an assertion to theoretically support the universal consistency. Section 5 presents the theoretical results about the assertion. Finally, Sect. 6 concludes this paper.

Preliminaries
Here, we give some notations and concepts that would be used in the following sections. Denote ℝ = (−∞, +∞) , ℝ + = [0, +∞) . Suppose X is a compact metric space, and k ∶ X × X ↦ ℝ is a positive semi-definite kernel. We define a quantity K Let H be the reproducing kernel Hilbert space (RKHS) with respect to kernel k. Reminder that there is a mapping ∶ X ↦ H satisfying the reproducing property, that is, Suppose the kernel k is continuous, then the element of H is also continuous on X. In this situation, there is a mapping I ∶ H ↦ C(X) , which continuously embeds the RKHS H into the space of all continuous functions C(X) We say k is a universal kernel, if the mapping I is dense. Let ∶ ℝ + × ℝ + → ℝ + be an non-descending function denoted by (c, t) . This function is continuous in 0 with respect to the variable c, and is unbounded when the variable t tends to infinity. We introduce the following definition [25] to explain in which case, is a regularization function.

Definition 1 Given a function (c, t)
, assume there exists t > 0 satisfying the inequality (c, t) < ∞ for any c > 0 . Then, is a regularization function, if for all c > 0 , s, t ∈ ℝ + , and for all sequences For any given loss function Then L is an admissible loss function [25], redefined as follows:

A general framework for twin support vector machines
Denote by X ⊆ ℝ d the input space for instances and by Y = {−1, 1} the output space for labels. Let S be the training set belonging to the space Assume the data of the training set S are sampled from an unknown distribution P on X × Y , and they are independent and identical distributed (i.i.d.). Denote matrices A, B and D as follows We give a general framework to cover most of the variants of TWSVM, which is formulated as follows: (1) where c 1 , c 2 , 1 , 2 are all trade-off parameters, f 1 , f 2 are the hyper-planes corresponding to the positive and negative classes, respectively, and H is RKHS. Here, 1 is one loss function measuring the squared distance from the data of one class to the corresponding hyper-plane. 2 is another loss function measuring the slack variable such that the distance from the data of the other class to the same hyper-plane is no smaller than 1. * is a regularization term for maximizing some margin. Note that in the minimization problem Eq. (1), 1 measures the loss for one positive sample and 2 measures the loss for one negative sample. The sum of the first two terms is the total loss for all the positive and negative samples. Now we want to combine the two loss functions into one revised loss function such that it could measure the loss for any positive or negative sample about hyperplane f 1 . Given a function p(y) the revised loss function L 1 can be defined as Analogously, a revised loss function L 2 is defined as in the minimization problem Eq. (2). It measures the loss for any positive or negative sample about hyper-plane f 2 . Therefore, the general framework can be rewritten as Obviously, it is equivalent to the optimization problem is also a regularization function. To sum up, this optimization problem Eq. (3) is a unified framework for most of TWSVM variants considered in the paper.
Note, in linear case, the two non-parallel hyper-planes are conducted as f 1 (x) = x T w 1 + b 1 for positive samples and f 2 (x) = x T w 2 + b 2 for negative samples, respectively. Similarly in nonlinear case, they are formulated The final classifier f ∶ X → ℝ for both linear and non-linear cases could be expressed as Let , are two slack variables, and e 1 , e 2 are two vectors whose elements are all 1's and whose dimensions are m 1 and m 2 , respectively. Below, we give several examples that can be expressed by the unified framework.
Example 1 (TWSVM [13]) For linear TWSVM, the optimization problem is formulated as follows: For nonlinear TWSVM, the optimization problem is formulated as follows:

and let
Then, the optimization problems of linear and nonlinear case of TWSVM could be both converted to the general framework Eq. (3).
Example 2 (TBSVM [21]) For linear TBSVM, the optimization problem is formulated as follows: For nonlinear TBSVM, the optimization problem is formulated as follows: Let Connected with Eqs. (4) and (5), the optimization problems of linear and nonlinear case of TBSVM could be both converted to the general framework Eq. (3).
Example 3 (Improved LSTSVM [31]) For linear improved LSTSVM, the optimization problem is formulated as follows: (4) For nonlinear improved LSTSVM, the optimization problem is formulated as follows:

Let
Considering Eqs. (4) and (6), the optimization problems of linear and nonlinear case of improved LSTSVM could be both converted to the general framework Eq. (3).

Universal consistency of TWSVMs
Since TWSVM and its variants are all built based on SRM principle, we study the universal consistency of TWSVMs in a general framework, that is, the universal consistency of the optimization problem (3).

Definitions
Given the training set S, let f S ∶ X → ℝ be the classifier of the optimization problem Eq. (3) learned from the training set S, where f S (⋅) = |f 2,S (⋅)| − |f 1,S (⋅)| and f 1,S , f 2,S ∶ X → ℝ are measurable functions corresponding to the positive and negative hyper-planes, respectively. In order to make the classifier work well, we need to make sure it is as small as possible for the wrongly classified probability of a novel data (x, y) drown from P independently to S. The wrong classification represents that sign(f S (x)) ≠ y . In what follows, it is necessary to redefine risk, Bayes risk and universal consistency for the optimization problem Eq. (3).

Definition 3 (Risk)) Given a measurable function
f ∶ X → ℝ , the risk of f is the wrongly classified probability of data (x, y) drawn from P independently to S, i.e., where f (⋅) = |f 2 (⋅)| − |f 1 (⋅)| , and f 1 , f 2 ∶ X → ℝ are both measurable functions corresponding to the positive and negative hyper-planes, respectively.

Definition 4 (Bayes Risk)
The Bayes risk with respect to distribution P, denoted by R P , is the smallest achievable risk where f (⋅) = |f 2 (⋅)| − |f 1 (⋅)| , and f 1 , f 2 ∶ X → ℝ are both measurable functions corresponding to the positive and negative hyper-planes, respectively.
Bayes risk is the minimal value of risk R P (f ) , and thus is the minimal true risk with respect to distribution P on space X × Y . Given the training set S, to make the wrongly classified probability as small as possible, we must make the risk R P (f S ) of the classifier f S infinitely close to the minimal true risk. Thereby, the universal consistency is defined in the following:  7) is valid almost surely, the classifier f S is strongly universally consistent.
Universal consistency is a key point to explain the success for the optimization problem Eq. (3), and to provide the solid theoretical basis. Thus in the below, we begin to discuss under what conditions, the universal consistency is guaranteed for the optimization problem Eq. (3).

Assertion
Though researchers have developed lots of variants based on the idea of TWSVM in the literature, there is still no theoretical study for the universal consistency of any variant. Also, no existing technique could be regarded as a R P (f ) = P{(x, y) ∶ sign(f (x)) ≠ y}, lim m→∞ R P (f S ) = R P reference in analyzing the universal consistency of the general optimization problem Eq. (3). Nevertheless, there is one specific technique for investigating the universal consistency of SVMs [25], and this technique can be applied to study the universal consistency of the problem Eq. (3). The reason is that TWSVMs are the extensions to SVMs, and the difference of the two problems is that TWSVMs formulate two QPPs, while SVMs formulate only one. The QPPs for TWSVMs and the QPP for SVMs are constructed in a similar way, both with two terms: loss function term and regularization function term. Therefore, in order to tackle the universal consistency of the optimization problem Eq. (3), a similar assertion like that in [25] is necessary to pay attention to. Before this, we give some definitions which would be used in the assertion. Step 1: Show t hat t here exist two elements f 1,P,c 1 , f 2,P,c 2 ∈ H minimizing the regularized L 1risk and regularized L 2 -risk, respectively, .
Step 2: Show that the minimal L 1 -risk R 1,P could be achieved at the element f 1,P,c 1 by the regularized L 1 -risk with c 1 tending to 0, and the minimal L 2 -risk R 2,P could be achieved at f 2,P,c 2 by the regularized L 2 -risk with c 2 tending to 0.
Step Assume that the assertion holds true, now we can see how to demonstrate the universal consistency of optimization problem Eq. (3). First, we could find two elements and show the existence by Step 1, since S is the empirical measure of P. Next, the upper bounds for the probabilities of the events |R 1,S (f 1,S,c 1 ) − |R 1,P (f 1,S,c 1 )| ≥ and |R 2,S (f 2,S,c 2 ) − |R 2,P (f 2,S,c 2 )| ≥ occurring could be derived according to Hoeffding's inequalities [6], respectively. Furthermore, the two derived inequalities about upper bounds are exactly the pair of concentration inequalities we want in Step 4. Then, we force the two upper bounds both to tend to 0, by setting two sequences c 1 (m) and c 2 (m) both to tend to 0 when m tends to infinity. Following Step 2, we have two measurable sequences f 1,S,c 1 (m) and f 2,S,c 2 (m) such that Naturally, the conditions for Step 3 are valid. Finally, the universal consistency is guaranteed by virtue of Step 3. Note that, the assertion's validity is just some kind of assumption without any proof up to now. Thus, in the rest of the paper, we would complete the assertion by testifying it's validity.

Theoretical results
Here, we investigate the assertion and use some theorems to support the validity of the assertion in each step. Three theorems are first given to show Steps 1-3 of the assertion, respectively. In the next three subsections, three pairs of concentration inequalities are derived for Step 4 based on different conditions, including covering number, localized covering number and stability, respectively. Theorems are given to show that the universal consistency is valid based on different concentration inequalities. The proofs follow the idea in [25]. Because of space limit for the paper, the detailed proofs of Theorems and Lemmas are put to the supplementary material. Readers can see the supplementary material for the proofs.
Let k be a positive semi-definite kernel, L 1 , L 2 be admissible loss functions, and be a regularization function. Given the parameters 1 , 2 , define for c 1 , c 2 > 0 that Note that 0 < c 1 , c 2 < ∞ , and we have For the loss function L i,c i , i = 1, 2 , denote by | ⋅ | 1 the supremum Theorem 1 Assume k is a continuous kernel on X. Let L 1 and L 2 be two admissible loss functions, and be a regularization function. There exist two elements f 1,P,c 1 , f 2,P,c 2 ∈ H such that for the probability distribution P and for any c 1 , c 2 > 0 . Furthermore, we have ‖f 1,P,c 1 ‖ H ≤ c 1 and ‖f 2,P,c 2 ‖ H ≤ c 2 .
Theorem 1 corresponds to Step 1 of the assertion, and ensures the existence of two elements f 1,P,c 1 and f 2,P,c 2 which minimize the regularized L 1 -risk and regularized L 2 -risk, respectively. Here, c 1 and c 2 are two critical quantities and give upper bounds on the norm of the solutions to the optimization problem Eq. (3), respectively.

Lemma 1 There exist two measurable functions
where P X is the marginal distribution of probability distribution P on X.
Lemma 1 converts the minimal L 1 -risk and minimal L 2 -risk to the expectations of M 1 and M 2 , respectively. It is necessary to the proof of Step 2.
Theorem 2 Assume k is a universal kernel on X. Let L 1 and L 2 be two admissible loss functions, and be a regularization function. We have for the probability distribution P and for any c 1 , c 2 > 0, Theorem 2 corresponds to Step 2 of the assertion, and guarantees that the minimal L 1 -risk and minimal L 2 -risk could be achieved at f 1,P,c 1 and f 2,P,c 2 , respectively, with c 1 , c 2 tending to 0.

Universal consistency based on covering number
Now we pay attention to Step 4 of the assertion, and want to find a pair of concentration inequality based on covering number [33].

Definition 9
Given a metric space (M, d) , the covering number of M is defined as is a closed ball with the center at point x and with a radius > 0.
Instead of using covering number directly, its logarithmic form is employed more frequently, which is denoted as H((M, d), ) = ln N((M, d), ).
In addition, we have to measure the continuity of a function. Given a loss function L 1 , the modulus and inverted modulus of continuity [2] of the function are expressed as w(L 1 , ) and w −1 (L 1 , ) , respectively, With these definitions, we begin to formulate the pair of concentration inequalities for the optimization problem Eq. (3) by the following lemma, and establish the consistency result by the following theorem.

Lemma 2 Assume k is a continuous kernel on X. Let L 1 and L 2 be two admissible loss functions, and be a regularization function. For the probability distribution P and for any
where Pr is the joint probability of data (x 1 , y 1 ) × (x 2 , y 2 ) × … × (x m , y m ) from the training set S.
Note that each sample (x i , y i ) ∈ S ⊆ X × Y, i = 1, … , m follows t he probability distr ibution P , t hen (x 1 , y 1 ) × (x 2 , y 2 ) × … × (x m , y m ) ∈ (X × Y) m follows the probability distribution P m .  Lemma 2 and Theorem 4 are corresponding to the fourth step of the assertion. By virtue of covering numbers of c 1 I and c 2 I , the pair of concentration inequalities are derived first. Based on them, the universal consistency of TWSVMs is then conducted. Next we give an example to explain the case in detail. [21], we consider the nonlinear case and use the Gaussian kernel k on X ⊂ ℝ d . There is an upper bound for the covering number of I in [33] Then we have

Example 4 Given a TWSVM variant like TBSVM
The classifier is universally consistent by Theorem 4.

Universal consistency based on localized covering number
Sometimes, we suggest the pair of concentration inequalities based on the localized covering number, instead of covering number. Given a function set F = {f ∶ X ↦ ℝ} , the localized covering number of F is H(I, ) ≤ a(ln 1 ) d+1 , for some positive constant a.
for any > 0 and m ≥ 1 , where |X 0 | ∞ is the space ℝ |X 0 | with the maximum norm, and F |X 0 = {f |X 0 ∶ f ∈ F} could be regarded as a subset of |X 0 | ∞ . The logarithm of localized covering number is H(F, m, ) = ln N(F, m, ) . Now we start to obtain another pair of concentration inequalities for the optimization problem Eq. (3) according to the following lemma, and derive the universal consistency by the following theorem. Lemma 3 presents another pair of concentration inequalities based on the localized covering numbers of c 1 I and c 2 I . And Theorem 5 discusses the conditions under which, the universal consistency is valid for TWSVMs. Thus, Lemma 3 and Theorem 5 imply the validation of Step 4 of the assertion. In the following, an example is illustrated to exhibit the universal consistency.
Example 5 Given a TWSVM variant, we study the nonlinear case. The universal kernel k(x, x � ) = ∞ n=0 a n n (x) n (x � ) is advised here, where a n > 0, n = 0, 1, 2, … , and n ∶ X → ℝ N(F, m, ) = sup{N((F |X 0 , are all continuous functions uniformly bounded with || ⋅ || ∞ -norm. The localized covering number of I is upper bounded [24], which indicates that By virtue of Theorem 5, it is obvious that the classifier is universal consistent.

Universal consistency based on stability
For practical problems, the case for convex loss functions and for the regularization function (c, t) = ct 2 is often considered for the optimization problem Eq. (3). In this case, a stable classifier is always employed. Here, the property for stability [3] is redefined as follows: with (x, y), and the new set is denoted by S i,(x,y) . If there exist two sequences ( 1 (m)) and ( 2 (m)) such that the following inequalities are valid for any (x � , y � ) ∈ X × Y , the classifier f S is stable with respect to sequences ( 1 (m)) and ( 2 (m)).

Lemma 4
Let L 1 and L 2 be two convex loss functions, (c, f ) = c‖f ‖ 2 , and (c 1 (m)) , (c 2 (m)) be two sequences for the regularization function. The classifier is stable with respect to sequences ( In Lemma 4, the classifier of the optimization problem Eq. (3) has shown to be stable for the convex loss functions L 1 , L 2 and for the regularization function (c, t) = ct 2 . Below, we verify Step 4 of the assertion. In Lemma 5, a pair of concentration inequalities is conducted under the condition that the classifier is stable. With the inequalities, the universal consistency of the optimization problem Eq. (3) is then obtained in Theorem 6.

Conclusion
In this paper, the universal consistency of TWSVMs for binary classification is addressed. Since many variants of TWSVM have been proposed, we first summarize a general framework of TWSVMs, which covers most of the TWSVM variants. We then perform theoretical study on universal consistency of the general framework in detail by defining an assertion. This assertion consists of four steps. In the first three steps, the regularized L 1 -risk and regularized L 2 -risk are introduced to build connections with the Bayes risk. In the last step, some pairs of concentration inequalities are derived based on different conditions, including covering number, localized covering number and stability. Universal consistency in different situations is proved based on different pairs of concentration inequalities, respectively.