1 Introduction

Support vector machine (SVM) was proposed in details by Vapnik et al. [1]. The goal of SVM was to find an optimal hyperplane to separate the labeled data points into two classes. Because of its excellent performance in text classification tasks [2], it soon became the mainstream technology of machine learning. At present, SVM and its variants have been successfully applied in many fields such as face recognition [3], financial distree prediction [4], regression [5], traffic flow prediction [6], medical [7] and more. Proximal support vector machine (PSVM) [8, 9] was derived from SVM; it aimed to find two parallel hyperplanes so that each plane was closer to one of two classes and as far away from the other as possible. Furthermore, in order to simplify the constraints, the generalized eigenvalue proximal support vector machine (GEPSVM) [10] was proposed. The main idea of GEPSVM was to replace two parallel hyperplanes with two nonparallel ones. According to this concept, Jayadeva et al. [11] proposed a well-known twin support vector machine (TSVM). Unlike the large quadratic programming problem (QPP) considered by traditional SVM, TSVM solves a pair of relatively smaller QPPs. The constraints of each QPP are only related to the data points of each of the two classes. Therefore, TSVM not only keeps the advantages of SVM, but also trains four times faster than SVM. Based on TSVM, Shao et al. [12] proposed an imbalanced weighted Lagrangian twin support vector machine (WLTSVM) for the imbalanced data classification. Other extensions and applications of TSVM can be found in [13, 14].

Recently, the research of semi-supervised learning (SSL) [15,16,17] has become a new hotspot in the field of machine learning. The main reason was that in many practical problems, labeled data are always scarce, but there are large amount of unlabeled data. SSL was to use these unlabeled data to assist a small number of labeled data for learning, so as to improve the performance of classifier. Manifold regularization (MR) [18, 19] was one of the frameworks of SSL. In the MR framework, there are two regularization terms. One controls the complexity of classifier in the Reproducing Kernel Hilbert Spaces (RKHS), and the other controls the complexity as measured by the geometry of the distribution. Following the MR framework, Qi et al. [20] proposed a Laplacian twin support vector machine (Lap-TSVM), which was the first twin support vector machine applied in the SSL problem. Extensive experimental results show that Lap-TSVM has very good performance in semi-supervised classification. Other extensions and applications of semi-supervised twin support vector machine can be found in [21, 22].

In general, data contain noises which come from faulty instruments, flawed measurements or faulty communication. Learning with data in the context of classification or regression is inevitably affected by noises in the data. If the training samples are mixed by noises, both SVM and its variants are often unable to find an optimal hyperplane and subsequently have difficulty to obtain satisfactory results. In order to solve such problem, fuzzy support vector machine (FSVM) [23] was proposed. The idea of FSVM was to use a membership function for each training sample. And the introduction of membership function can effectively reduce the effects of noises and outlier points and thus produce a robust classifier. Moreover, combining the TSVM with membership function can not only improve computational efficiency but also pursue robust performance. In recent years, intuitionistic fuzzy twin support vector machine (IFTSVM) [24] has been proposed which assigns a pair of membership and nonmembership functions to every training sample. These two functions help the IFTSVM to reduce the influence of noises and identify support vectors from noises.

The same difficulty was also encountered by the current semi-supervised twin support vector machine and its variants. When there are many noises in the data, the classification results are very poor and unsatisfactory. Ideally, we would like to determine which points are noisy, and then either remove them or greatly lower their weight. Therefore, inspired by the ideas of IFTSVM, we assign a pair of membership functions to each labeled point, which reduces the influence of noises on the classifier. And we introduce the ideas of fuzzy membership functions and the Lap-TSVM. In this paper, we proposed a novel intuitionistic fuzzy Laplacian twin support vector machine (IFLap-TSVM) for a semi-supervised classification problem. We use some constructed tests and several real datasets to evaluate the effectiveness of the IFLap-TSVM. The main advantages of our IFLap-TSVM are:

  1. (1)

    Membership and nonmembership functions are used for each training sample to indicate the contributions of different training samples to the learning of decision functions, which significantly reduces the negative impact of noises and outliers on classification accuracy.

  2. (2)

    Intuitionistic fuzzy number can reduce the influence of noises and outliers in labeled samples, and the semi-supervised framework of manifold regularization was introduced to deal with labeled and unlabeled samples in the primal space and the feature space. The combination of the two can further improve the classification accuracy.

  3. (3)

    IFLap-TSVM has better classification accuracy compared with other state-of-the-art TSVM, IFTSVM and Lap-TSVM on constructed tests and real-world datasets.

The remaining parts of this paper are organized as follows. In Sect. 2, we briefly introduce the background of SSL and Lap-TSVM. In Sect. 3, we describe the details of IFLap-TSVM. In Sect. 4, the numerical experiment results on the constructed test dataset, UCI dataset and MNIST dataset are reported. And Sect. 5 concludes the paper.

2 Background

In this section, we give a brief description of semi-supervised learning framework (SSL) and Lap-TSVM. The training data of the classification problem can be described as follows:

$$\begin{aligned} T=\{(x_1,y_1),(x_2,y_2),\cdots ,(x_l,y_l),x_{l+1},\cdots ,x_{l+u}\} , \end{aligned}$$
(1)

where \(x_i \in \mathbb {R}^{n}, y_i=\{+1,-1\}, i=1,2,\cdots ,l,\) are the labeled data, and \(x_i, i=l+1,\cdots ,l+u,\) are the unlabeled data. Denote the matrix \(A \in \mathbb {R}^{l_1\times n}\) as the labeled data belonging to class \(+1\), where every row of matrix A represents a data point. Similarly, the matrix \(B \in \mathbb {R}^{l_2\times n}\) as the labeled data belonging to class \(-1\). Clearly, we have \(l_1+l_2=l\).

2.1 Semi-supervised Learning Framework

SSL uses both labeled and unlabeled data to improve supervised learning. The goal is to build a more efficient classifier with large amounts of unlabeled data and relatively few labeled data. Regularization is a technique to prevent over fitting of training data, which is widely used in machine learning [25]. The MR framework takes advantage of the geometry of the probability distribution of the generated data and merges it as an additional regularization term. The decision function of the MR framework can be expressed as [18]

$$\begin{aligned} f^{*}=\mathop {\arg \min }_{f \in \mathcal {H}_{K}} \frac{1}{l} \sum _{i=1}^{l} V\left( x_{i}, y_{i}, f\right) +\gamma _{A}\Vert f\Vert _{K}^{2}+\gamma _{I}\Vert f\Vert _{I}^{2}, \end{aligned}$$
(2)

where f is an unknown decision function. The first part of the above expression is some loss function on the labeled data. The second part is a regularization term; \(\gamma _{A}\) is the weight of \(\Vert f\Vert _{K}^{2}\) and controls the complexity of f in the Reproducing Kernel Hilbert Space. \(\gamma _{I}\) controls the complexity of the function in the intrinsic geometry of marginal distribution, and it is the weight of \(\Vert f\Vert _{I}^{2}\), while \(\Vert f\Vert _{I}^{2}\) is an appropriate penalty term that should reflect the intrinsic structure of marginal distribution.

The MR framework [18] incorporates additional information about the geometric structure of the marginal distribution. The important assumption of this approach is that the probability distribution of data has the geometry structure of a Riemannian manifold \(\mathcal {M}\). If two points are very close in the intrinsic geometry, then they should have the same or similar labels. The RKHS regularization term \(\Vert f\Vert _{K}^{2}\) and the intrinsic regularizer \(\Vert f\Vert _{I}^{2}\) are as follows:

$$\begin{aligned} \Vert f\Vert _{K}^{2}&= \Vert f\Vert _2^2, \end{aligned}$$
(3)
$$\begin{aligned} \Vert f\Vert _{I}^{2}&=\frac{1}{(l+u)^2}\sum _{i,j=1}^{l+u} (f(x_i)-f(x_j))^2W_{ij}\nonumber \\&=\frac{1}{(l+u)^2}f(\mathbf{X} )^{\top }Lf(X), \end{aligned}$$
(4)

where \(f(X) =[f(x_1), \cdots , f(x_{l+u})]\) represents the decision function values over labeled and unlabeled points. \(W_{ij}\) are edge weights in the data adjacency graph, L is the graph Laplacian given by \(L=D-W\), \(W \in \mathbb {R}^{(l+u)\times (l+u)}\) is the weight matrix with entries \(W_{ij}\), D is the diagonal matrix with its i-th diagonal \(D_{ii}=\sum _{j=1}^{l+u}W_{ij}\). More detailed discussion of manifold regularization can be found in [18].

2.2 Laplacian Twin Support Vector Machine

Based on the TSVM, the Lap-TSVM [20] model is deduced by introducing the semi-supervised learning framework. According to the semi-supervised learning framework, for the linear case, the primal problems of Linear Lap-TSVM can be written as

$$\begin{aligned} \begin{aligned}&\min _{{ w}_{1},{ b}_{1},\xi }&\frac{1}{2}||Aw_{1}+e_{1} b_{1}||_{2}^{2}+c_{1}e_{2}^{\top }\xi +c_{2}(||w_{1}||_{2}^{2}+b_{1}^2)\\&\quad&+c_3(w_{1}^{\top }M^{\top }+e^{\top }b_{1})L(Mw_{1}+eb_{1}) \\&\quad {\mathrm{s.t.}}&-(Bw_{1}+e_{2}b_{1})+\xi \geqslant e_{2}, \quad \xi \geqslant 0, \end{aligned} \end{aligned}$$
(5)

and

$$\begin{aligned} \begin{aligned}&\min _{{ w}_{2},{ b}_{2}, \eta }&\frac{1}{2}||Bw_{2}+e_{2} b_{2}||_2^2+c_1e_{1}^{\top }\eta +c_2(||w_{2}||_2^2+b_{2}^2)\\&\quad&+c_3(w_{2}^{\top }M^{\top }+e^{\top }b_{2})L(Mw_{2}+eb_{2}) \\&\quad {\mathrm{s.t.}}&(Aw_{2}+e_{1}b_{2})+\eta \geqslant e_{1}, \quad \eta \geqslant 0. \end{aligned} \end{aligned}$$
(6)

And \(W_{ij}\) are the edge weights in the data adjacency graph and may be defined by k-nearest neighbors or graph kernel as follows:

$$\begin{aligned} W_{ij}=\left\{ \begin{aligned}&\exp (-||x_i-x_j||_2^2/2\sigma ^2),&\quad \text {if } x_i,x_j \text { are neighbor}; \\&0,&\quad \text {otherwise}. \end{aligned} \right. \end{aligned}$$
(7)

\(f_{1}=[f_{1}(x_1),\cdots ,f_{1}(x_{l+u})]^{\top }=Mw_{1}+eb_{1},f_{2} =[f_{2}(x_1),\cdots ,f_{2}(x_{l+u})]^{\top }=Mw_{2}+eb_{2},M \in R^{(l+u)\times n}\) includes all the training data, and e is an appropriate ones vector.

By introducing the Lagrangian multipliers, the Wolfe dual of the problem (5) and (6) can be formulated as

$$\begin{aligned} \begin{aligned}&{{\max _{\alpha }}}&e_{2}^{\top } \alpha - \frac{1}{2}\alpha ^{\top }G(H^{\top }H+c_2I+c_3J^{\top }LJ)^{-1}G^{\top }\alpha \\&{\mathrm{s.t.}}&0 \leqslant \alpha \leqslant c_1e_{2}, \end{aligned} \end{aligned}$$
(8)

and

$$\begin{aligned} \begin{aligned}&{{\max _{\beta }}}&e_{1}^{\top } \beta - \frac{1}{2}\beta ^{\top }H(G^{\top }G+c_2I+c_3J^{\top }LJ)^{-1}H^{\top }\beta \\&{\mathrm{s.t.}}&0 \leqslant \beta \leqslant c_2e_{1}. \end{aligned} \end{aligned}$$
(9)

Here \(G=[B~e_{2}]\), \(H=[A~e_{1}]\) and \(J=[M~e]\). I is an identity matrix of appropriate dimension. It can be proved that \(H^{\top }H+c_2I+c_3J^{\top }LJ\) and \(G^{\top }G+c_2I+c_3J^{\top }LJ\) are positive definite matrices [26]. And the augmented vector \(v_1,v_2\) are given by

$$\begin{aligned} \begin{aligned}&v_1=-(H^{\top }H+c_2I+c_3J^{\top }LJ)^{-1}G^{\top }\alpha ,&\quad \text {where }v_1=[w_{1}^{\top }~b_{1}]^{\top }, \\&v_2=(G^{\top }G+c_2I+c_3J^{\top }LJ)^{-1}H^{\top }\beta ,&\quad \text {where }v_2=[w_{2}^{\top }~b_{2}]^{\top }. \end{aligned} \end{aligned}$$
(10)

The same as TSVM, the decision function of Lap-TSVM is as follows:

$$\begin{aligned} f(x)=\mathop {\arg \min }_{i\in 1,2} |w_{i}^{\top }x+b_{i}|, \end{aligned}$$
(11)

where \(|\cdot |\) is the perpendicular distance of point x from the planes \(w_{i}^{\top }x+b_{i}\). For the nonlinear case, we refer to [20].

3 Intuitionistic Fuzzy Laplacian Twin Support Vector Machine

In this section, we first describe the concept of intuitionistic fuzzy set and then propose IFLap-TSVM model. The structures of two kernel functions, i.e., linear and nonlinear, are discussed in detail.

3.1 Intuitionistic Fuzzy Set

The traditional fuzzy set was given by Zadeh [27] . Let X be a nonempty set, the fuzzy set A in a universe X can be defined as

$$\begin{aligned} A=\{(x,\mu _A(x))|x\in X\}, \end{aligned}$$
(12)

where \(\mu _A : X \rightarrow [0,1]\) and \(\mu _A(x)\) is the degree of membership of x belonging to X.

As an extension of fuzzy set, an intuitionistic fuzzy set [28] is defined as

$$\begin{aligned} \tilde{A}=\{(x,\mu _{\tilde{A}}(x), \nu _{\tilde{A}}(x))|x\in X \}, \end{aligned}$$
(13)

where \(\mu _{\tilde{A}}(x)\) and \(\nu _{\tilde{A}}(x)\) are the degrees of membership and nonmembership of x belonging to X. Here \(\mu _{\tilde{A}}: X \rightarrow [0,1], \nu _{\tilde{A}}(x): X \rightarrow [0,1]\) and \(0 \leqslant \mu _{\tilde{A}}(x)+\nu _{\tilde{A}}(x) \leqslant 1\). Define \(\pi _{\tilde{A}}(x)=1-\mu _{\tilde{A}}(x)-\nu _{\tilde{A}}(x)\); it denotes the hesitation degree of x belonging to X.

It is important to select an appropriate membership function to reduce the effect of noises and outlier points. For example, as shown in Fig. 1, the training points A and B are located on the boundary of the positive class, the degrees of membership of these two training points belonging to positive class are the same, but it is obvious that there are many negative points around point B. Therefore, the classification contribution of point A and point B is different. It may lead to wrong predictions if we only consider the membership degree. In this case, we employ the intuitionistic fuzzy number \((\mu ,\nu )\) to each training point as proposed in [29]. \(\mu \) is the degree of membership function related to the one class, and \(\nu \) is the degree of nonmembership function related to the other class. It is obvious that the points A and B in positive class have different degrees of nonmembership.

Fig. 1
figure 1

Two training points with the same degree of membership

3.1.1 The Degree of Membership Function

In the high-dimensional feature space, the distance between training point and the class center is used as membership function. The distance between training points is expressed as

$$\begin{aligned} D(\phi (x_i),\phi (x_j))=||\phi (x_i)-\phi (x_j)||, \end{aligned}$$
(14)

where \(\phi \) represents the mapping from the sample space to the high-dimensional feature space.

The class center of each class is given by

$$\begin{aligned} C^{\pm }=\frac{1}{l^{\pm }} \sum _{y_i=\pm 1} \phi (x_i), \end{aligned}$$
(15)

where \(l_{+}\) and \(l_{-}\) denote the total number of positive and negative points, respectively.

The radius of each class can be measured by

$$\begin{aligned} r^{\pm }=\max _{y_i =\pm 1}||\phi (x_i)-C^{\pm }||. \end{aligned}$$
(16)

For each training point, the degree of membership can be defined as

$$\begin{aligned} \mu (x_i)=\left\{ \begin{aligned}&1-\frac{||\phi (x_i)-C^{+}||}{r^{+}+\delta },&\quad y_i=+1, \\&1-\frac{||\phi (x_i)-C^{-}||}{r^{-}+\delta },&\quad y_i=-1, \end{aligned} \right. \end{aligned}$$
(17)

where \(\delta >0\) is an adjustable parameter.

3.1.2 The Degree of Nonmembership Function

The degree of nonmembership function is determined by the ratio between the number of all heterogeneous points and the total number of training points in its neighborhood. The degree of nonmembership function is as follows:

$$\begin{aligned} \nu (x_i)=(1-\mu (x_i))\rho (x_i), \end{aligned}$$
(18)

where \(0\leqslant \mu (x_i)+\nu (x_i) \leqslant 1\) and \(\rho (x_i)\) is the proportion between all heterogeneous points and the total number of points in its neighborhood

$$\begin{aligned} \rho (x_i)=\frac{|\{x_j| ||\phi (x_i)-\phi (x_j)|| \leqslant \alpha , y_j\ne y_i\}|}{|\{x_j|||\phi (x_i)-\phi (x_j)||\leqslant \alpha \}|}, \end{aligned}$$
(19)

where \(|\cdot |\) denotes the cardinality and \(\alpha >0\) is an adjustable parameter.

The degree of membership and nonmembership of a training point is designed based on the inner product distance in the feature space. Therefore, the kernel functions are used to make the construction of intuitionistic fuzzy numbers.

3.1.3 The Score Function

Based on the above definitions, the training points can be converted into the intuitionistic fuzzy numbers as follows:

$$\begin{aligned} T=\{(x_1,y_1,\mu _1,\nu _1),(x_2,y_2,\mu _2,\nu _2),\cdots ,(x_l,y_l,\mu _l,\nu _l)\}, \end{aligned}$$
(20)

where \(\mu _i, \nu _i\) denote the degrees of membership functions and nonmembership functions of \(x_i\), respectively. For each given intuitionistic fuzzy number, a score function can be used to measure the classification contribution of each training point. The score function can be defined as

$$\begin{aligned} s_i=\left\{ \begin{aligned}&\mu _i,&\quad \nu _i=0, \\&0,&\quad \mu _i \leqslant \nu _i,\\&\frac{1-\nu _i}{2-\mu _i-\nu _i},&\quad \hbox {others}. \end{aligned} \right. \end{aligned}$$
(21)

The score value \(s_i\) can easily distinguish the support vector from noises and outliers points. For example, when \(v_i=0\) (positive point A shown in Fig. 2), there are no negative points in the neighborhood of A; a correct degree of membership function can be easily defined. Obviously, the positive point A is far away from the class center so its classification contribution is small. When \(\mu _i \leqslant \nu _i\) (negative point B shown in Fig. 2), the point B has no negative points in the neighborhood; the degree of nonmembership value is greater than the degree of membership value. Thus, B is a noise point with zero classification contribution. For the case of positive point C, we have \(\mu _i >\nu _i\) and \(\nu _i \ne 0\). C is far away from the class center, but there are some positive points in its neighborhood. Thus, it may be a support vector, instead of an outlier point. Hence, the classification contribution of point C is greater than that of outlier A.

Fig. 2
figure 2

Support vector, noise and outlier

3.2 Linear IFLap-TSVM

According to the semi-supervised learning framework, the square loss function and hinge loss function \(V(x_i, y_i,f)\) can be expressed as

$$\begin{aligned} V_{1}(x_i, y_i,f_{1})&= ((A_{i_{, \cdot }} \cdot w_1) + b_1)^2 + S_{2,i} \cdot \mathrm{max}(0, 1-f_{1}(B_{i_{, \cdot }})), \end{aligned}$$
(22)
$$\begin{aligned} V_{2}(x_i, y_i,f_{2})&= ((B_{i_{, \cdot }} \cdot w_2) + b_2)^2 + S_{1,i} \cdot \mathrm{max}(0, 1-f_{2}(A_{i_{, \cdot }})), \end{aligned}$$
(23)

where \(A_{i_{, \cdot }}\) and \(B_{i_{, \cdot }}\) represent the i-th row of A and B, respectively. \(S_{1,i}\) and \(S_{2,i}\) denote the i-th element in the vector \(S_1\) and \(S_2\), respectively. And \(S_1\in \mathbb {R}^{l_{+}}\) and \(S_2 \in \mathbb {R}^{l_{-}}\) are the score values of positive and negative points, respectively.

The regularization terms \(\Vert f_{1} \Vert _{K}^{2}\) and \(\Vert f_{2} \Vert _{K}^{2}\) can be written as

$$\begin{aligned} \Vert f_{1} \Vert _{K}^{2}&= \frac{1}{2}(||w_{1}||_{2}^{2}+b_{1}^2),\end{aligned}$$
(24)
$$\begin{aligned} \Vert f_{2} \Vert _{K}^{2}&= \frac{1}{2}(||w_{2}||_{2}^{2}+b_{2}^2). \end{aligned}$$
(25)

And the manifold regularization terms \(\Vert f_1 \Vert _{I}^{2}\) and \(\Vert f_2 \Vert _{I}^{2}\) are defined by

$$\begin{aligned} \Vert f_1 \Vert _{I}^{2}&= \frac{1}{(l+u)^2}f_1^{\top }Lf_1, \end{aligned}$$
(26)
$$\begin{aligned} \Vert f_2 \Vert _{I}^{2}&= \frac{1}{(l+u)^2}f_2^{\top }Lf_2. \end{aligned}$$
(27)

In accordance with (2), the linear IFLap-TSVM can be written as

$$\begin{aligned} \begin{aligned}&{\min _{w_{1},b_{1},\xi }}&\frac{1}{2}||Aw_{1}+e_{1} b_{1}||_{2}^{2}+c_{1} S_{2}^{\top }\xi +\frac{1}{2}c_{2}(||w_{1}||_{2}^{2}+b_{1}^2)\\&\quad&+\frac{1}{2}c_3(Mw_{1}+eb_{1})^{\top }L(Mw_{1}+eb_{1}) \\&\quad {\mathrm{s.t.}}&-(Bw_{1}+e_{2}b_{1})+\xi \geqslant e_{2}, \quad \xi \geqslant 0, \end{aligned} \end{aligned}$$
(28)

and

$$\begin{aligned} \begin{aligned}&{\min _{w_{2},b_{2}, \eta }}&\frac{1}{2}||Bw_{2}+e_{2} b_{2}||_2^2+c_4 S_{1}^{\top }\eta +\frac{1}{2}c_5(||w_{2}||_2^2+b_{2}^2)\\&\quad&+\frac{1}{2}c_6(Mw_{2}+eb_{2})^{\top }L(Mw_{2}+eb_{2}) \\&\quad {\mathrm{s.t.}}&(Aw_{2}+e_{1}b_{2})+\eta \geqslant e_{1}, \quad \eta \geqslant 0, \end{aligned} \end{aligned}$$
(29)

where \(c_1,c_2,\cdots ,c_6\) are pre-specified penalty factors, and \(\xi , \eta \) are slack variables, \(e_1,e_2,e\) are column vectors of ones of appropriate dimensions, L is the graph Laplacian.

The Lagrangian corresponding to the problem (28) is given by

$$\begin{aligned} \begin{aligned} L(w_1,b_1,\xi ,\alpha ,\beta )=&\frac{1}{2}||Aw_1+e_1b_1||_2^{2}+c_1S_2^{\top }\xi +\frac{1}{2}c_2(||w_1||_2^2+b_1^2)\\&+\frac{1}{2}c_3(Mw_{1}+eb_{1})^{\top }L(Mw_{1}+eb_{1})\\&-\alpha ^{\top }(-(Bw_{1}+e_{2}b_{1})+\xi -e_2)-\beta ^{\top }\xi , \end{aligned} \end{aligned}$$
(30)

where \(\alpha =(\alpha _1,\cdots ,\alpha _{l_2})^{\top }\) and \(\beta =(\beta _1,\cdots ,\beta _{l_1})^{\top }\) are the Lagrangian multipliers.

With the KKT conditions, we get

$$\begin{aligned} \frac{\partial L}{\partial w_1}= & {} A^{\top }(Aw_1+e_1b_1)+c_2w_1\nonumber \\&+c_3M^{\top }L(Mw_1+eb_1)+B^{\top }\alpha =0, \end{aligned}$$
(31)
$$\begin{aligned} \frac{\partial L}{\partial b_1}= & {} e_1^{\top }(Aw_1+e_1b_1)+c_2b_1~\nonumber \\&+c_3e^{\top }L(Mw_1+eb_1)+e_2^{\top }\alpha =0,~~~ \end{aligned}$$
(32)
$$\begin{aligned} \frac{\partial L}{\partial \xi }= & {} c_1S_2-\alpha -\beta =0. \end{aligned}$$
(33)

Combining (31) and (32) leads to

$$\begin{aligned} \begin{array}{l} {\left[ \begin{array}{l} A^{\top } \\ e_{1}^{\top } \end{array}\right] \left[ A~e_{1}\right] \left[ \begin{array}{l} w_{1} \\ b_{1} \end{array}\right] +c_{2}\left[ \begin{array}{l} w_{1} \\ b_{1} \end{array}\right] } \\ +c_{3}\left[ \begin{array}{l} M^{\top } \\ e^{\top } \end{array}\right] L\left[ M~e\right] \left[ \begin{array}{l} w_{1} \\ b_{1} \end{array}\right] +\left[ \begin{array}{l} B^{\top } \\ e_{2}^{\top } \end{array}\right] \alpha =0. \end{array} \end{aligned}$$
(34)

Let \(H=[A~e_1],J=[M~e],G=[B~e_2]\), and the augmented vector \(v_1=[w_1~b_1]^{\top }\), then (34) can be rewritten as

$$\begin{aligned} \begin{aligned}&(H^{\top }H+c_2I+c_3J^{\top }LJ)v_1+G^{\top }\alpha =0\\&\quad \Rightarrow v_1=-(H^{\top }H+c_2I+c_3J^{\top }LJ)^{-1}(G^{\top }\alpha ), \end{aligned} \end{aligned}$$
(35)

where I is an identity matrix of appropriate dimensions. It can be proved that \(H^{\top }H+c_2I+c_3J^{\top }LJ\) is a positive definite matrix according to matrix theory [26].

Since \(\beta \geqslant 0\), from (33), we get

$$\begin{aligned} 0\leqslant \alpha \leqslant c_1S_2. \end{aligned}$$
(36)

Therefore, the Wolfe dual of the problem (28) can be written as

$$\begin{aligned} \begin{aligned}&\max _{\alpha }&e_{2}^{\top } \alpha - \frac{1}{2}\alpha ^{\top }G(H^{\top }H+c_2I+c_3J^{\top }LJ)^{-1}G^{\top }\alpha \\&\quad {\mathrm{s.t.}}&0 \leqslant \alpha \leqslant c_1S_{2}. \end{aligned} \end{aligned}$$
(37)

Likewise, the dual of (29) is

$$\begin{aligned} \begin{aligned}&\max _{\beta }&e_{1}^{\top } \beta - \frac{1}{2}\beta ^{\top }P(Q^{\top }Q+c_5I+c_6F^{\top }LF)^{-1}P^{\top }\beta \\&{\mathrm{s.t.}}&0 \leqslant \beta \leqslant c_4S_{1}. \end{aligned} \end{aligned}$$
(38)

where \(P=[A~e_1], F=[M~e], Q=[B~e_2]\) and the augmented vector \(v_2=[w_2~b_2]^{\top }\) is written as follows:

$$\begin{aligned} v_2=(Q^{\top }Q+c_5I+c_6F^{\top }LF)^{-1}P^{\top }\beta . \end{aligned}$$
(39)

Once optimal \(v_1^{*}, v_2^{*}\) are achieved, the two hyperplanes are known. A new input data point x can be classified as positive or negative class based on the decision function

$$\begin{aligned} f(x)=\mathop {\arg \min }_{i\in 1,2} \frac{|w_{i}^{\top }x+b_{i}|}{||w_i||}, \end{aligned}$$
(40)

where \(|\cdot |\) is the absolute value. And the whole procedure of linear IFLap-TSVM is described in Algorithm 1.

figure a

3.3 Nonlinear IFLap-TSVM

So far the above discussion is restricted to the linear case. Here, we extend the linear IFLap-TSVM to the nonlinear case. We consider the following two kernel-generated hyperplanes:

$$\begin{aligned} K\left( x^{\top }, M^{\top }\right) \lambda _{1}+b_{1}=0,\quad K\left( x^{\top }, M^{\top }\right) \lambda _{2}+b_{2}=0, \end{aligned}$$
(41)

where \(K(x_i,x_j)=(\phi (x_i),\phi (x_j))\) is a chosen kernel function. The nonlinear optimization problem can be written as

$$\begin{aligned} \begin{aligned}&{{\min _{\lambda _{1},{ b}_{1},\xi }}}&\frac{1}{2}||K(A,M^{\top })\lambda _{1}+e_{1} b_{1}||_{2}^{2}+c_{1}S_{2}^{\top }\xi +c_{2}(\lambda _{1}^{\top }K\lambda _{1}+b_{1}^2)\\&\quad&+c_3(K\lambda _{1}+eb_{1})^{\top }L(K\lambda _{1}+eb_{1}) \\&\quad {\mathrm{s.t.}}&-(K(B,M^{\top })\lambda _{1}+e_{2}b_{1})+\xi \geqslant e_{2}, \quad \xi \geqslant 0, \end{aligned} \end{aligned}$$
(42)

and

$$\begin{aligned} \begin{aligned}&{{\min _{\lambda _{2},{ b}_{2}, \eta }}}&\frac{1}{2}||K(B,M^{\top })\lambda _{2}+e_{2} b_{2}||_2^2+c_4S_{1}^{\top }\eta +c_5(\lambda _{2}^{\top }K\lambda _{2}+b_{2}^2)\\&\quad&+c_6(K\lambda _{2}+eb_{2})^{\top }L(K\lambda _{2}+eb_{2}) \\&\quad {\mathrm{s.t.}}&(K(A,M^{\top })\lambda _{2}+e_{1}b_{2})+\eta \geqslant e_{1}, \quad \eta \geqslant 0. \end{aligned} \end{aligned}$$
(43)

The Lagrangian corresponding to the problem (42) is given by

$$\begin{aligned} \begin{aligned} L(\lambda _1,b_1,\xi ,\alpha ,\beta )=&\frac{1}{2}||K(A,M^{\top })\lambda _{1}+e_1b_1||_2^{2}+c_1S_2^{\top }\xi \\&+\frac{1}{2}c_2(\lambda _{1}^{\top }K\lambda _{1}+b_1^2)\\&+\frac{1}{2}(K\lambda _{1}+eb_{1})^{\top }L(K\lambda _{1}+eb_{1})\\&-\alpha ^{\top }(-(K(B,M^{\top })\lambda _{1}+e_{2}b_{1})+\xi -e_2)-\beta ^{\top }\xi . \end{aligned} \end{aligned}$$
(44)

The KKT conditions are obtained as follows:

$$\begin{aligned} \frac{\partial L}{\partial \lambda _1}= & {} K(A,M^{\top })^{\top }(K(A,M^{\top })\lambda _{1}+e_1b_1)+c_2K\lambda _{1}\nonumber \\&+c_3K^{\top }L(K\lambda _1+eb_1)+K(B,M^{\top })^{\top }\alpha =0, \end{aligned}$$
(45)
$$\begin{aligned} \frac{\partial L}{\partial b_1}= & {} e_1^{\top }(K(A,M^{\top })\lambda _{1}+e_1b_1)+c_2b_1\nonumber \\&+c_3e^{\top }L(K\lambda _{1}+eb_1)+e_2^{\top }\alpha =0, \end{aligned}$$
(46)
$$\begin{aligned} \frac{\partial L}{\partial \xi }= & {} c_1S_2-\alpha -\beta =0. \end{aligned}$$
(47)

Combining (45) and (46) leads to

$$\begin{aligned} \begin{array}{l} {\left[ \begin{array}{c} K\left( A, M^{\top }\right) ^{\top } \\ e_{1}^{\top } \end{array}\right] \left[ K\left( A, M^{\top }\right) ~ e_{1}\right] \left[ \begin{array}{c} \lambda _{1} \\ b_{1} \end{array}\right] } +c_{2}\left[ \begin{array}{cc} K &{} \quad ~0 \\ 0 &{} \quad ~1 \end{array}\right] \left[ \begin{array}{c} \lambda _{1} \\ b_{1} \end{array}\right] \\ +c_{3}\left[ \begin{array}{c} K^{\top } \\ e^{\top } \end{array}\right] L[K~e]\left[ \begin{array}{c} \lambda _{1} \\ b_{1} \end{array}\right] +\left[ \begin{array}{c} K\left( B, M^{\top }\right) ^{\top } \\ e_{2}^{\top } \end{array}\right] \alpha =0. \end{array} \end{aligned}$$
(48)

Let \(H_\mathrm{non}=[K(A,M^{\top })~e_1],O_\mathrm{non}=\left[ \begin{array}{cc} K &{} \quad ~0 \\ 0 &{} \quad ~1 \end{array}\right] ,J_\mathrm{non}=[K~e],G_\mathrm{non}=[K(B,M^{\top })~e_2]\), and the augmented vector \(v_\mathrm{non1}=[\lambda _1~b_1]^{\top }\), then (48) can be rewritten as

$$\begin{aligned} \begin{aligned}&(H_\mathrm{non}^{\top }H_\mathrm{non}+c_2O_\mathrm{non}+c_3J_\mathrm{non}^{\top }LJ_\mathrm{non})v_\mathrm{non1}+G_\mathrm{non}^{\top }\alpha =0\\&\quad \Rightarrow v_\mathrm{non1}=-(H_\mathrm{non}^{\top }H_\mathrm{non}+c_2O_\mathrm{non}+c_3J_\mathrm{non}^{\top }LJ_\mathrm{non})^{-1}(G_\mathrm{non}^{\top }\alpha ). \end{aligned} \end{aligned}$$
(49)

Therefore, the Wolfe dual of the problem (42) can be written as

$$\begin{aligned} \begin{aligned}&\max _{\alpha }&e_{2}^{\top } \alpha - \frac{1}{2}\alpha ^{\top }G_\mathrm{non}(H_\mathrm{non}^{\top }H_\mathrm{non}+c_2O_\mathrm{non}+c_3J_\mathrm{non}^{\top }LJ_\mathrm{non})^{-1}G_\mathrm{non}^{\top }\alpha \\&\quad {\mathrm{s.t.}}&0 \leqslant \alpha \leqslant c_1S_{2}. \end{aligned} \end{aligned}$$
(50)

Likewise, the dual of (43) is

$$\begin{aligned} \begin{aligned}&{\mathrm{\max _{\beta }}}&e_{1}^{\top } \beta - \frac{1}{2}\beta ^{\top }P_\mathrm{non}(Q_\mathrm{non}^{\top }Q_\mathrm{non}+c_5O_\mathrm{non}+c_6F_\mathrm{non}^{\top }LF_\mathrm{non})^{-1}P_\mathrm{non}^{\top }\beta \\&{\mathrm{s.t.}}&0 \leqslant \beta \leqslant c_4S_{1}, \end{aligned} \end{aligned}$$
(51)

where \(P_\mathrm{non}=[K(A,M^{\top })~e_1], F_\mathrm{non}=[K~e], Q_\mathrm{non}=[K(B,M^{\top })~e_2]\) and the augmented vector \(v_\mathrm{non2}=[\lambda _2~b_2]^{\top }\) is follows:

$$\begin{aligned} v_\mathrm{non2}=(Q_\mathrm{non}^{\top }Q_\mathrm{non}+c_5O_\mathrm{non}+c_6F_\mathrm{non}^{\top }LF_\mathrm{non})^{-1}P_\mathrm{non}^{\top }\beta . \end{aligned}$$
(52)

Once the optimal \(v_\mathrm{non1}^{*}, v_\mathrm{non2}^{*}\) are obtained, the two hyperplanes are known. A new input data point x can be classified as positive or negative class based on the decision function

$$\begin{aligned} f(x)=\mathop {\arg \min }_{i\in 1,2} \frac{|K(x,M^{\top })\lambda _i+b_{i}|}{\sqrt{\lambda _i^{\top }K\lambda _i}}. \end{aligned}$$
(53)

The whole procedure of nonlinear IFLap-TSVM is described in Algorithm 2.

figure b

4 Experiment

In this section, we investigate the effectiveness and generalization capability of the proposed method on artificial and UCI datasets, and we compare IFLap-TSVM with Lap-TSVM [20], IFTSVM [24] and TSVM [11].

The testing accuracies of all experiments are computed using standard 10-fold cross-validation [30]. The pre-specified penalty factors \(c_i\) \((i=1,\cdots ,6)\) and the RBF kernel parameter \(\sigma \) are selected from the set \(\{2^i|i=-5,\cdots ,5\}\), and we set \(c_1=c_4,c_2=c_5,c_3=c_6\). In addition, Gaussian kernel is applied to deal with the nonlinear case, i.e., \(K(x_1,x_2)=\mathrm{{exp}}(-||x_1-x_2||^2/\sigma ^2)\). Each experiment is repeated 10 times. All the methods are implemented in MATLAB R2017b environment on a PC with Intel Core i5 processor with 8 GB RAM.

4.1 Artificial Datasets

In order to verify the validity of the model, two artificial datasets are constructed to evaluate IFLap-TSVM. We use two lines and half-moons containing 200 points as tests. And for the two lines dataset, we select a linear kernel, for the half-moons dataset, we select an RBF kernel. We choose 10 labeled points of each class as training set. And we inject different proportion of noise, i.e., 10% and 20%, into the training points. For example, 10% of the training points are randomly selected and their class are changed to another class.

4.1.1 The Impact of the Parameters

In this subsection, the effects of different setting of the parameters \(c_1\) and \(c_3\) are analyzed using the half-moons dataset. In the first experiment, we compare the performance of IFLap-TSVM and Lap-TSVM with different \(c_1\). For these two classifiers, we consider the half-moons dataset with 10% noise, and we fix the regularization parameters \(c_2=c_3=1\) and the RBF kernel parameter \(\sigma = 0.5\), and then, let \(c_1\) vary from \(2^{-5}\) to \(2^5\). Figure 3(a) shows the accuracy rates of the IFLap-TSVM and Lap-TSVM. IFLap-TSVM achieves the optimal accuracy when \(c_1=2^2\), while Lap-TSVM obtains the best result when \(c_1 =2^0\). The value of \(c_1\) corresponding to the best performance for IFLap-TSVM is larger than that of Lap-TSVM, because the score \(s_i\) of each training sample point in IFLap-TSVM is less than or equal to 1. In IFLap-TSVM, to achieve different levels of penalty, training samples are given different score values. The smaller the score value, the smaller the effect of training sample.

In the second experiment, in order to reflect the effectiveness of the manifold regularization, we fix \(c_1=c_2=1\) and the RBF kernel parameter \(\sigma = 0.5\), and let \(c_3\) vary from \(2^{-5}\) to \(2^{10}\). It is easy to see from the results in Fig. 3(b) that with the increase in \(c_3\), the accuracy of IFLap-TSVM is also improved. However, the value of parameter \(c_3\) should not be too large. When the value of \(c_3\) exceeds \(2^3\), the accuracy begins to decline drastically, and finally it will drop to 50%. The reason for this is that when the value of \(c_3\) exceeds a certain limit, the manifold regularization will be penalized too much, which makes it lose its original function and make the model degenerate into a supervised model.

Fig. 3
figure 3

(a) Comparison of IFLap-TSVM and Lap-TSVM on half-moons dataset with different values of parameter \(c_1(c_1 = 2^n)\); (b) Accuracy of IFLap-TSVM on half-moons dataset with different values of parameter \(c_3(c_3 = 2^n)\)

4.1.2 Comparison with Other Methods

In this subsection, we compare the effectiveness of our IFLap-TSVM with Lap-TSVM, IFTSVM and TSVM on the two lines and half-moons datasets. Figure 4 shows the one-run results of each classifier on the two lines. And Figs. 4(b)–4(d) show the results of each classifier with a noise level of 0%,10% and 20%, respectively. It can be seen from the results that with the increase in noise, our IFLap-TSVM can produce more accurate hyperplanes than other models.

Fig. 4
figure 4

(a) Original data points of the two lines without noise distortion. The classification results of TSVM, IFTSVM, Lap-TSVM and IFLap-TSVM with different level of noise: (b) 0% noise; (c) 10% noise; (d) 20% noise

Fig. 5
figure 5

Classification results of TSVM, IFTSVM, Lap-TSVM and IFLap-TSVM with 0% noise on half-moons dataset. (a) TSVM; (b) IFTSVM; (c) Lap-TSVM; (d) IFLap-TSVM

Fig. 6
figure 6

Classification results of TSVM, IFTSVM, Lap-TSVM and IFLap-TSVM with 10% noise on half-moons dataset. (a) TSVM; (b) IFTSVM; (c) Lap-TSVM; (d) IFLap-TSVM

Fig. 7
figure 7

Classification results of TSVM, IFTSVM, Lap-TSVM and IFLap-TSVM with 20% noise on half-moons dataset. (a) TSVM; (b) IFTSVM; (c) Lap-TSVM; (d) IFLap-TSVM

And the one-run results of each classifier on the half-moons dataset with different level of noise are shown in Figs. 5, 6 and 7. It can be seen that compared with other methods, our IFLap-TSVM is more robust to noise, and the decision boundary is more accurate. In addition, for the half-moons dataset, we have done 10 experiments to further evaluate the classification results and the training time of the classifier, as shown in Table 1. The results show that with the increase in noise level, the accuracy of each method decreases. However, the effect of noise on the accuracy of IFLap-TSVM is the smallest. The training time of IFLap-TSVM is the longest, because compared with the supervised model, the objective function of dual QPP of semi-supervised model needs twice matrix inversion and the size of these matrices is \((n+1)\times (n+1)\). And IFLap-TSVM has more steps to calculate the score value of each training sample than Lap-TSVM.

In general, compared with Lap-TSVM, IFTSVM and TSVM, IFLap-TSVM provides higher accuracy for both noiseless and noisy datasets. This is because IFLap-TSVM can use the information of unlabeled data to improve accuracy and use intuitionistic fuzzy number to reduce the effect of noises and outliers.

4.2 UCI Datasets

In this section, we investigate the performance of the IFLap-TSVM model on the UCI dataset [31]. And the results are compared with TSVM, Lap-TSVM and IFTSVM. Before training, all data are scaled such that all features locate in [0, 1]. First, each dataset is divided into two subsets: 65% for training and 35% for testing. Then, for each dataset, we randomly label \(m(m = 10\%,20\%,30\%)\) as labeled data and the remainder as unlabeled data. Table 2 shows the detailed information of the UCI dataset.

The classification accuracy and standard deviation of the IFLap-TSVM and other models are shown in Tables 3, 4 and 5. Tables 3, 4 and 5 show that with the increase in the proportion of labeled data, the classification performance of all classifiers also increases. And from the average “mean” accuracy given in Tables 3, 4 and 5, the performance of IFLap-TSVM is better than other methods under the same labeled data.

Table 1 Classification accuracy and the training time on the half-moons test with different levels of noise
Table 2 Detailed information of UCI datasets
Table 3 Mean and standard deviation (%) of accuracy at 10% of labeled data points on UCI dataset
Table 4 Mean and standard deviation (%) of accuracy at 20% of labeled data points on UCI dataset

Furthermore, IFLap-TSVM and Lap-TSVM have higher classification accuracy than IFTSVM and TSVM, which also shows that manifold regularization can help the classification model by using the geometric distribution information of labeled data and unlabeled data. More importantly, the accuracy of IFLap-TSVM is higher than that of Lap-TSVM, indicating that the intuitionistic fuzzy function is effective in reducing the effect of noise and outlier points.

Fig. 8
figure 8

Samples from MINIST dataset

Table 5 Mean and standard deviation (%) of accuracy at 30% of labeled data points on UCI dataset

4.3 MNIST Dataset

In this section, we apply the IFLap-TSVM to handwritten symbol recognition. The MNIST dataset is shown in Fig. 8, which is a handwritten digital dataset composed of handwritten images from ‘0’ to ‘9’. The size of each image is \(28\times 28\) pixels with 256 gray levels. Similar to [32], we select four pairwise digits on raw pixel features for comparison. And each set of pairwise digits contains 450 images, 300 images of which are used for training and another 150 images for testing. Furthermore, we randomly label 50 images for the training set, and \(m(m=50,100,150,200,250)\) unlabeled images are selected from the remaining training images. In addition, we only consider these classifiers in the case of the RBF kernel.

Fig. 9
figure 9

Test accuracy and standard deviations of IFLap-TSVM, Lap-TSVM, IFTSVM and TSVM on MNIST dataset. (a) 0 versus 2; (b) 1 versus 7; (c) 3 versus 6; (d) 4 versus 9

Figure 9 shows the results of experiments. The results show that with the increase in unlabeled data, the test accuracies of semi-supervised learning classifiers are gradually improved, because manifold regularization can use the geometric distribution information of labeled data and unlabeled data to find a more accurate classifier. In most cases, the classification results of IFLap-TSVM are better than other models, and the standard deviations of IFLap-TSVM and IFTSVM are smaller than that of Lap-TSVM and TSVM. Therefore, intuitionistic fuzzy can effectively reduce the impact of noise and outliers on classification accuracy.

5 Conclusion

In this paper, we have proposed an intuitionistic fuzzy Laplacian twin support vector machine for a semi-supervised classification problem, which is inspired by the intuitionistic fuzzy number and Lap-TSVM. Not only can it reduce the effect of noises and outliers through the membership and non-membership functions, but also use the geometric distribution information of labeled data and unlabeled data to construct a more accurate classifier. Experimental results indicated that our IFLap-TSVM performs well on both constructed test data and several real-world datasets. Compared with Lap-TSVM, IFTSVM and TSVM, IFLap-TSVM has the best performance. In the future, we will pay attention to improve intuitionistic fuzzy number to further reduce the effect of noises and outliers on the model. Moreover, there may be noises in unlabeled samples and that how to deal with this should also be considered. And another possible work is to extend IFLap-TSVM to multi-class classification.