1 Introduction

Extreme learning machine (ELM) [1] is a simple and efficient single-hidden layer feedforward neural network (SLFN) that is much faster than other feedforward neural networks. The input weights and hidden layer biases of ELM are randomly generated, and the network output weights are obtained by using Moore-Penrose generalized inverse. The objective of ELM is to achieve the minimal norm of output weights and the minimum training error. In recent years, to improve the classification performance of ELM, some improvements have been proposed [2,3,4,5,6,7,8,9,10,11,12]. Due to their excellent classification performance, these algorithms have been applied to a wide range of fields [13,14,15,16,17,18,19,20,21,22,23].

In practical classification problems, there is a large amount of noise in the data. Noise can interfere with the construction of classifiers and reduce the classification performance of algorithms. However, traditional ELMs cannot effectively suppress the negative impact of noise. To enhance the noise robustness of ELM, Ren et al. [24] proposed the correntropy-based hinge loss robust extreme learning machine (CHELM). Wang et al. [25] proposed the extreme learning machine with the homotopy loss (\(l_1\)-HELM), which introduces the homotopy loss into ELM. To enhance the noise robustness and re-sampling stability, Ren et al. [26] proposed the extreme learning machine with the pinball loss function (PELM), which maximizes the quantile distance between two classes of samples. To further achieve robustness and sparsity, Shen et al. [27] introduced \(\varepsilon \)-insensitive zone pinball loss into the ELM. In practical classification problems, there are a large number of redundant or irrelevant features in the data. These algorithms are negatively affected by these features, which reduces their performance. To reduce redundant or irrelevant features, Huang et al. [28] propose a method that employs a new fuzzy \(\beta \) neighborhood-related discernibility measure and the fuzzy \(\beta \) covering (FBC) decision tables. To enhance the robustness of FBC in feature learning, Huang et al. [29] propose a noise-tolerant fuzzy-\(\beta \)-covering-based multigranulation rough set model. To deal with noisy data, VPDI uses noise-tolerant discrimination indexes and a heuristic feature selection algorithm to reduce redundant or irrelevant features [30]. Although these algorithms based on the FBC feature selection can eliminate redundant or irrelevant features, they do not consider the importance of samples in the classification process. In recent years, determining the importance of samples has become a research hotspot. Inspired by FSVM [31], Zhang et al. [32] proposed the fuzzy extreme learning machine (FELM). FELM employs a membership degree to each training sample, which can reduce the influence of outliers and noise.

FELM is an effective algorithm for dealing with classification problems with noises. However, it has two drawbacks: (1) FELM only considers the membership degree of samples but not the non-membership degree of samples, which can easily mistake the boundary samples for noise. (2) FELM uses the least square loss function, which leads to sensitivity to the feature noise and instability to re-sampling. To address the above drawbacks, we propose an improved ELM model combining IFSs and truncated pinball loss function, called an intuitionistic fuzzy extreme learning machine with the truncated pinball loss (TPin-IFELM). First, TPin-IFELM constructs the membership and non-membership degrees based on the local information of samples obtained by using the KNN method. The membership degree is calculated by the distance between the sample and the class center, and the non-membership degree is calculated by the correlation between all heterogeneous samples and all samples in its neighborhood. Further, we obtain the score value of the sample according to the membership and non-membership degrees. The score can be used to effectively identify whether boundary samples are noises. Finally, in order to further reduce the negative effects of noises, we introduce the truncated pinball loss function [33, 34] into TPin-IFELM, which makes TPin-IFELM more robust and sparse. Since TPin-IFELM is a non-convex problem, we use the CCCP [35, 36] to solve it. A large number of experimental results show that the proposed TPin-IFELM is superior to some state-of-the-art comparison algorithms in dealing with classification problems with noises.

The rest of this paper is organized as follows. In Sect. 2, we briefly review ELM and its loss functions, and FELM and its improvement. In Sect. 3, we discuss the optimization model for the linear and nonlinear TPin-IFELM in detail. In Sect. 4, we investigate some properties of TPin-IFELM. In Sect. 5, the TPin-IFELM is evaluated via a series of experiments. Section 6 summarizes this paper and puts forward the future research direction.

2 Related Works

2.1 Notations

We define the binary dataset \(D=\left\{ \left( x_i,t_i\right) \mid 1\le i\le N\right\} \), where \(t_i=\left\{ +1,-1\right\} \). Let \(\mathcal {X}^+=\left\{ x_i\mid \left( x_i,t_i\right) \in D,t_i=+1\right\} \) denote the positive samples, \(\mathcal {X}^-=\{x_j\mid (x_j,t_j)\in D,t_j=-1\}\) denote the negative samples, \(N^+=\vert \mathcal {X}^+\vert \), \(N^-=\vert \mathcal {X}^-\vert \), \(\mathcal {X}=\mathcal {X}^+\cup \mathcal {X}^-\), and \(N\ =N^++N^-\).

2.2 ELM and its Loss Functions

ELM is an effective single-layer feedforward neural network [2, 37]. First, the input weights and hidden layer biases are randomly assigned, then the hidden layer matrix H is obtained by the activation function \(G\left( \cdot \right) \), and finally the output weights \(\beta \) are obtained by solving the generalized inverse.

The output of ELM is defined as follows:

$$\begin{aligned} f\left( x\right) =\ \sum _{i=1}^{L}G({\theta _i}^{\textrm{T}}{x+\vartheta }_i)\beta _i=h\left( x\right) \cdot \beta , \end{aligned}$$
(1)

where \(\theta _i={[\theta _{i1},\ldots ,\theta _{in}]}^{\textrm{T}}\in \mathfrak {R}^n\) and \(\vartheta _i\in \mathfrak {R}\) are the input weight vector and bias of the corresponding hidden node, respectively, \(h\left( x\right) =\ \left[ G({\theta _1}^{\textrm{T}}{x+\vartheta }_1),\ldots ,G({\theta _L}^{\textrm{T}}{x+\vartheta }_L)\right] ^{\textrm{T}}\in \mathfrak {R}^L\) is the random feature mapping output of the hidden layer, and \(\beta ={\ \left[ \beta _1,\beta _2,\ldots ,\beta _L\right] }^{\textrm{T}}\in \mathfrak {R}^L\).

\(\beta \) can be solved by solving

$$\begin{aligned} min\left\| {\beta }\right\| \ \textrm{and}\ min{\sum _{i=1}^{N}{\left\| {\beta \cdot h\left( x_i\right) -t_i} \right\| }}. \end{aligned}$$
(2)

The optimal solution of (2) can be calculated by

$$\begin{aligned} \hat{\beta }=H^\dag T,\ \end{aligned}$$
(3)

where \(H=\left[ h\left( x_1\right) ,\ldots ,h\left( x_N\right) \right] ^{\textrm{T}}\in \mathfrak {R}^{N\times L}\) and \(T={\ \left[ t_1,t_2,\ldots ,t_N\right] }^{\textrm{T}}\in \mathfrak {R}^N\) is the output vector. \(H^\dag \) is the Moore-Penrose generalized inverse of matrix H.

The decision function of ELM is

$$\begin{aligned} f\left( x\right) =\ sign\left( {h\left( x\right) }^{\textrm{T}}\hat{\beta }\right) . \end{aligned}$$
(4)

In order to improve the classification performance of ELM, Huang et al. [3] proposed the optimization method-based ELM (OELM), which introduces the hinge loss function into ELM. To speed up the solution, Huang et al.[2] proposed the regular ELM (RELM), which introduces the least squares loss function into ELM. However, OELM and RELM are sensitive to noise. To enhance the noise robustness of ELM, Ren et al. [26] proposed the extreme learning machine with the pinball loss function (PELM), which maximizes the quantile distance between two classes of samples.

For convenience, we unify the optimization problems of these algorithms as follows:

$$\begin{aligned} min_\beta \ {\frac{1}{2}\left\| {\beta } \right\| ^2+c\sum _{i=1}^{N}L\left( \textrm{U}\right) }, \end{aligned}$$
(5)

where \(L\left( \textrm{U}\right) \) is the loss function. When \(L\left( \textrm{U}\right) \) is the hinge loss or pinball loss function, \(\textrm{U}=1-t_ih\left( x_i\right) \cdot \beta \) (see Fig. 1a, b); When \(L\left( \textrm{U}\right) \) is the least squares loss function, \(\textrm{U}=t-h\left( x_i\right) \cdot \beta \) (see Fig. 1c).

Fig. 1
figure 1

Illustrations of loss functions

2.3 FELM and Its Improvements

FELM [32] employs a membership degree to each training sample to reduce the influence of outliers and noise. The optimization problem of FELM can be formulated as follows:

$$\begin{aligned}&{min_{\beta ,\xi _i}}\ \frac{1}{2}\left\| {\beta } \right\| ^2+\frac{c}{2}\sum _{i=1}^{N}{s_i{\left\| {\xi _i} \right\| ^2}}\nonumber \\&s.t.\ \ h\left( x_i\right) \cdot \beta =t_i-\xi _i,\ i=1,\ldots ,N, \end{aligned}$$
(6)

where \(\xi _i\) is the training error, c is the penalty parameter, \(s_i=\left\{ \begin{array}{ll} 1-\frac{\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^+ \right\| }{r^++\delta },&{}t_i=+1\\ 1-\frac{\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^- \right\| }{r^-+\delta },&{}t_i=-1\\ \end{array}\right. \) is the membership degree of \(x_i\) in the random mapping feature space, \({\widetilde{\mathcal {C}}}^+=\frac{1}{N^+}\sum _{x_i\in \mathcal {X}^+} h\left( x_i\right) \) and \({\widetilde{\mathcal {C}}}^-=\ \frac{1}{N^-}\sum _{x_i\in \mathcal {X}^-} h\left( x_i\right) \) are the centers of the positive class and negative class, respectively, and \(r^+=\max _{x_i\in \mathcal {X}^+}(\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^+ \right\| )\) and \(r^-=\max _{x_i\in \mathcal {X}^-}(\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^- \right\| )\) are the radius of the positive class and negative class, respectively.

However, FELM only considers the membership degree of samples, which can easily mistake some boundary samples for noise. To identify the noise in the support vectors, Rezvani et al. proposed intuitionistic fuzzy twin support vector machines (IFTSVM), which use the intuitionistic fuzzy sets (IFSs) to construct the score values of samples [38]. In order to enhance the robustness and re-sampling stability of IFTSVM, Liang et al. proposed an intuitionistic fuzzy twin support vector machines with the \(\varepsilon \)-insensitive pinball loss (PIFTSVM) [39]. It defines the score function named SFA:

$$\begin{aligned} s(x)=\sqrt{\frac{{\mu (x)}^2+(1-\nu (x))^2}{2}}, \end{aligned}$$
(7)

where \(\mu (x)\) is the membership function, and \(\nu (x)\) is the non-membership function. Laxmi et al. proposed multi-category intuitionistic fuzzy twin support vector machines to solve the multi-class classification problems [40]. In order to effectively solve the problem of class imbalance, Rezvani et al. proposed class balance learning using fuzzy ART and intelligent fuzzy twin support vector machines.

3 Intuitionistic Fuzzy Extreme Learning Machines with the Truncated Pinball Loss

In this section, we propose TPin-IFELM to address the drawbacks of FELM. The algorithm framework of TPin-IFELM is shown in Fig. 2.

Fig. 2
figure 2

The framework diagram of TPin-IFELM

3.1 Intuitionistic Fuzzy Settings

FELM easily mistakes some boundary samples for noise due to its only using membership degree. To address this issue, in this subsection, We employ IFS for each sample to reduce the negative impact of noise.

Define an intuitionistic fuzzy set \(\bar{A}=\ \left\{ \left( x,\mu _{\bar{A}}\left( x\right) ,\nu _{\bar{A}}\left( x\right) \right) |\ x\in \mathcal {X}\right\} \), where \(\mu _{\bar{A}}\): \(\mathcal {X}\) \(\rightarrow \left[ 0,1\right] \) is the membership degree of x in \(\mathcal {X}\), \(\nu _{\bar{A}}: \mathcal {X}\rightarrow \left[ 0,1\right] \) is the non-membership degree of x in \(\mathcal {X}\), and \(0\le \mu _{\bar{A}}\left( x\right) +\nu _{\bar{A}}\left( x\right) \le 1\). We illustrate the acquisition of membership and non-membership degrees through the following examples.

3.1.1 Intuitionistic Fuzzy Membership Degree

In the random mapping feature space, the membership degree of samples is determined by the distance between samples and the class center, i.e.,

$$\begin{aligned} \mu _i=\left\{ \begin{array}{ll} 1-\frac{\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^+ \right\| }{r^++\varrho },&{}t_i=+1\\ 1-\frac{\left\| h\left( x_i\right) -{\widetilde{\mathcal {C}}}^- \right\| }{r^-+\varrho },&{}t_i=-1\\ \end{array}\right. , \end{aligned}$$
(8)

where \(1 \le i\le N\), \(\varrho >0\) is an adjustable parameter in the random mapping feature space.

Example 1

Let \(h(x_*) = (0.91, 0.27, 0.21, 0.22, 0.23)\), \(t_*=+1\), \({\widetilde{\mathcal {C}}}^+ = (0.80, 0.52, 0.40, 0.57, 0.43)\) is the center of the positive samples, and \(r^+ = 0.87\) is the radius of positive samples. According to Eq. (8), \(\mu _*=1-\frac{0.5227}{0.87+{10}^{-7}}=0.3992\).

3.1.2 Intuitionistic Fuzzy Non-membership Degree

We can effectively capture the correlation between \(x_i\) and all heterogeneous samples in its neighborhood by using the KNN method, i.e.,

$$\begin{aligned} \rho \left( x_i\right) =\frac{\vert \{x_j\mid h(x_j)\in KNN\left( h\left( x_i\right) \right) ,t_j\ne t_i\}\vert }{K}, \end{aligned}$$
(9)

where \(KNN\left( h\left( x_i\right) \right) \) is used to represent the K nearest neighbors of \(x_i\) in the random mapping feature space.

The non-membership degree \(\upsilon _i\) is defined as:

$$\begin{aligned} \upsilon _i=\left( 1-\mu _i\right) \rho \left( x_i\right) , \end{aligned}$$
(10)

and \(0\le \mu _i+\upsilon _i\le 1.\)

Example 2

Let \(K = 5\) and \(\rho \left( x_i\right) = \frac{4}{5}\). According to Eq. (10), \(\upsilon _*=\left( 1-0.3992\right) \times \frac{4}{5}=0.4806\).

3.1.3 The Score Function

We construct an IFS \(\breve{S}=\left\{ \left( x_1,t_1,\mu _1,\nu _1\right) ,\left( x_2,t_2,\mu _2,\nu _2\right) ,\ldots ,\left( x_N,t_N,\mu _N,\nu _N\right) \right\} \). According to \(\breve{S}\), we construct the score value (SV) as follows:

$$\begin{aligned} s_i=\ \left\{ \begin{array}{cc} \mu _i,&{}\nu _i=0\\ 0,&{}\mu _i\le \nu _i\\ \frac{1-\nu _i}{2-\mu _i-\nu _i},&{}\mu _i>\nu _i\ \textrm{and}\ \nu _i\ne 0\\ \end{array}\right. , 1\le i\le N, \end{aligned}$$
(11)

where \(s_i=\mu _i\) indicates that \(x_i\) is a correctly classified sample; \(s_i=0\) indicates that \(x_i\) is the noise; \(s_i=\frac{1-\nu _i}{2-\mu _i-\nu _i}\) indicates that \(x_i\) is the support vector of the corresponding class, not the noise.

3.2 Linear Case

Unlike FELM [32] which uses the least squares loss function, TPin-IFELM employs the truncated pinball loss function, which not only makes the model robust to the noises but also preserves the sparsity. The truncated pinball loss function (see Fig. 3) is as follows:

$$\begin{aligned} P_{\tau ,\varsigma }\left( x,t,f\left( x\right) \right) =\left\{ \begin{array}{cc} \tau \varsigma ,&{}\textrm{U}\le -\varsigma \\ -\tau \textrm{U},&{}-\varsigma<\textrm{U}<0\\ \textrm{U},&{}\textrm{U}\ge 0\\ \end{array}\right. , \end{aligned}$$
(12)

where \(0\le \tau \le 1\), and \(\varsigma >0\) is the preset value, and t is the label of x.

As shown in Fig. 3, the truncated pinball loss function takes into account the advantages of the hinge loss function and pinball loss function, so it has noise robustness and sparsity.

Fig. 3
figure 3

The truncated pinball loss

Equation (12) can be decomposed as follows:

$$\begin{aligned} P_{\tau ,\varsigma }\left( x,t,f\left( x\right) \right) =H_{1+\tau }\left( 1-tf\left( x\right) \right) -\left( H_\tau \left( 1-tf\left( x\right) +\varsigma \right) -\tau \varsigma \right) , \end{aligned}$$
(13)

where \(H_{1+\tau }\left( 1-tf\left( x\right) \right) = \left( 1+\tau \right) max\left( 0,1-tf(x)\right) \) and \(H_\tau \left( 1-tf\left( x\right) +\varsigma \right) = \tau max\left( 0,1-tf\left( x\right) +\varsigma \right) \).

We replace the least squares loss function in Eq. (6) with the truncated pinball loss and employ the score value for each sample as follows:

$$\begin{aligned} {min}_\beta \ J\left( \beta \right) =\ \frac{1}{2}\left\| {\beta }\right\| ^2+\ c\sum _{i=1}^{N}{s_iP}_{\tau ,\varsigma }\ \left( 1-t_if_\beta \left( x_i\right) \right) , \end{aligned}$$
(14)

where c is the penalty parameter.

The gradient \(\nabla _\beta \left( J\left( \beta \right) \right) \) of \(J\left( \beta \right) \) is as follows:

$$\begin{aligned} \nabla _\beta \left( J\left( \beta \right) \right) \ =\ \beta -c\sum _{i=1}^{N}s_it_ih\left( x_i\right) \partial _\beta P_{\tau ,\varsigma }\left( 1-t_if_\beta \left( x_i\right) \right) . \end{aligned}$$
(15)

It can be proved that the minimum of (14) with respect to \(\beta \) should satisfy the following condition

$$\begin{aligned} \beta =c\sum _{i=1}^{N}s_it_ih\left( x_i\right) \partial _\beta P_{\tau ,\varsigma }\left( 1-t_if_\beta \left( x_i\right) \right) . \end{aligned}$$
(16)

The function \(J\left( \beta \right) \) in (14) can be decomposed into the sum of the convex function \(J_{vex}(\beta )\) and the concave function \(J_{cav}(\beta )\), i.e.,

$$\begin{aligned} J\left( \beta \right)&={\underbrace{\ \frac{1}{2}\left\| {\beta }\right\| ^2+c\sum _{i=1}^{N}{s_iH}_{1+\tau }\left( 1-t_if_\beta \left( x_i\right) \right) }_{J_{vex}\left( \beta \right) }}\nonumber \\&\quad {\underbrace{-c\sum _{i=1}^{N}{s_iH_\tau }\left( 1-t_if_\beta \left( x_i\right) +\varsigma \right) +c\tau \varsigma \sum _{i=1}^{N}s_i}_{J_{cav}\left( \beta \right) }}. \end{aligned}$$
(17)

Obviously, (17) is a non-differentiable non-convex optimization problem, which can be solved by the CCCP. The detailed procedure of the CCCP is shown in Algorithm 1.

Algorithm 1
figure a

CCCP for optimization problem (17)

Using the CCCP to solve (17), the subproblem of the kth iteration can be expressed as:

$$\begin{aligned}&{min}_\beta \ J_{vex}\left( \beta \right) +\nabla _\beta {\left( J_{cav}\left( \beta ^{\left( k-1\right) }\right) \right) }^{\textrm{T}}\beta \nonumber \\&\quad =\frac{1}{2}\left\| {\beta }\right\| ^2+c\sum _{i=1}^{N}{s_iH_{1+\tau }}\left( 1-t_if_\beta \left( x_i\right) \right) +\sum _{i=1}^{N}{\delta _i^{k-1}t_if_\beta \left( x_i\right) }, \end{aligned}$$
(18)

where

$$\begin{aligned} \delta _i^{k-1}=\left\{ \begin{array}{cc} cs_i\tau ,&{}t_if_{\beta ^{k-1}}\ \left( x_i\right) = t_i\left( h\left( x_i\right) \cdot \beta ^{k-1}\right) <\varsigma +1\\ 0,&{}\ \ \ \textrm{otherwise}\\ \end{array}\right. . \end{aligned}$$
(19)

By introducing the slack variables \(\xi =\left[ \xi _1,\ldots ,\xi _N\right] ^{\textrm{T}}\), (18) is equivalent to the following form:

$$\begin{aligned}&{min_{\beta ,\xi _i}}\ \frac{1}{2}\left\| {\beta } \right\| ^2+c\sum _{i=1}^{N}{s_i\xi _i}+\ \sum _{i=1}^{N}{\delta _i^{k-1}t_if_\beta \left( x_i\right) }\nonumber \\&s.t.\ t_if_\beta \left( x_i\right) \ge \ 1-\ {\frac{1}{1+\tau }\xi _i},{\ \xi }_i\ge 0,\ i=1\ldots N. \end{aligned}$$
(20)

According to the Lagrange method, we can obtain the following dual problem. The detailed solution process is shown in Appendix A.

$$\begin{aligned}&min_\alpha {\ \frac{1}{2}({\alpha }^{\textrm{T}} -{\delta }^{\textrm{T}})Q(\alpha -\delta )-{e}^{\textrm{T}}\alpha }\nonumber \\&s.t.\ \ 0\le \ \alpha \le \left( 1+\tau \right) cS, \end{aligned}$$
(21)

where \(Q=TH{H}^{\textrm{T}}T\).

Set \(\lambda =\ \alpha -\delta \), and the lower and upper bounds of the box constraint are defined as \(\mathfrak {L}=-\delta \in \mathfrak {R}^N\) and \(\mathfrak {U}=\left( 1+\tau \right) cS-\ \delta \in \mathfrak {R}^N\). Then, (21) is equivalent to

$$\begin{aligned}&min_\lambda {\ \frac{1}{2}\lambda Q\lambda -{e}^{\textrm{T}}\alpha }\nonumber \\&s.t.\ \ \mathfrak {L}\le \ \lambda \le \mathfrak {U}. \end{aligned}$$
(22)

The label t of the unknown sample x is determined by the following decision function.

$$\begin{aligned} f\left( x\right) =sign\left( {\lambda }^{\textrm{T}} THh(x)\right) . \end{aligned}$$
(23)

The complete process of linear TPin-IFELM is shown in Algorithm 2.

Algorithm 2
figure b

The complete process of linear TPin-IFELM

3.3 Nonlinear Case

In the ELM kernel space [2, 3, 41], the membership degree of the sample is defined by

$$\begin{aligned} \mu _i^\Phi =\left\{ \begin{array}{cc} 1-\ \frac{\sqrt{\mathcal {K}_{ELM}\left( x_i,x_i\right) -\frac{2}{N^+}\sum _{x_j\in \mathcal {X}^+}{\mathcal {K}_{ELM}(x_i,x_j)} +\frac{1}{{N^+}^2}\sum _{x_i\in \mathcal {X}^+} \sum _{x_j\in \mathcal {X}^+}{\mathcal {K}_{ELM}(x_i,x_j)}}}{r^++\varrho },&{} t_i=+1\\ 1-\ \frac{\sqrt{\mathcal {K}_{ELM}\left( x_i,x_i\right) -\frac{2}{N^-} \sum _{x_j\in \mathcal {X}^-}{\mathcal {K}_{ELM}(x_i,x_j)}+\frac{1}{{N^-}^2} \sum _{x_i\in \mathcal {X}^-}\sum _{x_j\in \mathcal {X}^-}{\mathcal {K}_{ELM} (x_i,x_j)}}}{r^-+\varrho },&{}t_i=-1\\ \end{array}\right. , \end{aligned}$$
(24)

and the non-membership degree of the sample is defined as

$$\begin{aligned} \nu _i^\Phi =(1-\ \mu _i^\Phi )\rho \left( x_i\right) , \end{aligned}$$
(25)

where \(\mathcal {K}_{ELM}\left( x_i,x_i\right) = h\left( x_i\right) \cdot h\left( x_i\right) \), \(\mathcal {K}_{ELM}(x_i,x_j)= h\left( x_i\right) \cdot h(x_j)\), \(1\le i\le N\),

$$\begin{aligned} r^+= & {} \max _{x_i\in \mathcal {X}^+}\\{} & {} {\left( \sqrt{\mathcal {K}_{ELM}\left( x_i,x_i\right) -\frac{2}{N^+}\sum _{x_j\in \mathcal {X}^+}{\mathcal {K}_{ELM}(x_i,x_j)}+\frac{1}{{N^+}^2}\sum _{x_i\in \mathcal {X}^+}\sum _{x_j\in \mathcal {X}^+}{\mathcal {K}_{ELM}(x_i,x_j)}}\right) }, \end{aligned}$$

and

$$\begin{aligned} r^-= & {} \max _{x_i\in \mathcal {X}^-}\\{} & {} {\left( \sqrt{\mathcal {K}_{ELM}\left( x_i,x_i\right) -\frac{2}{N^-}\sum _{x_j\in \mathcal {X}^-}{\mathcal {K}_{ELM}(x_i,x_j)}+\frac{1}{{N^-}^2}\sum _{x_i\in \mathcal {X}^-}\sum _{x_j\in \mathcal {X}^-}{\mathcal {K}_{ELM}(x_i,x_j)}}\right) }. \end{aligned}$$

According to Eq. (24) and Eq. (25), the score function is defined as follows:

$$\begin{aligned} s_i^\Phi =\ \left\{ \begin{array}{cc} \mu _i^\Phi ,&{}\nu _i^\Phi =0\\ 0,&{}\mu _i^\Phi \le \nu _i^\Phi \\ \frac{1-\nu _i^\Phi }{2-\mu _i^\Phi -\nu _i^\Phi },&{}\mu _i^\Phi >\nu _i^\Phi \ {\textrm{and}}\ \nu _i^\Phi \ne 0\\ \end{array}\right. . \end{aligned}$$
(26)

The original problem of nonlinear TPin-IFELM can be expressed as

$$\begin{aligned} {min}_\varpi \ J\left( \varpi \right) =\ \frac{1}{2}\left\| {\varpi }\right\| ^2+\ C\sum _{i=1}^{N}{s_i^\Phi P}_{\tau ,\varsigma }\ \left( 1-t_i{\varpi }^{\textrm{T}} h\left( x_i\right) \right) . \end{aligned}$$
(27)

Similar to linear TPin-IFELM, (27) can be solved by CCCP. In the kth iteration, the subproblem of (27) can be expressed as

$$\begin{aligned}&min_{\varpi ,\xi }\ {\frac{1}{2}\left\| {\varpi }\right\| ^2+c\sum _{i=1}^{N}{s_i^\Phi \xi _i}+\ \sum _{i=1}^{N}{{\delta _i^\Phi }^{k-1}t_i{\varpi }^{\textrm{T}} h\left( x_i\right) }}\nonumber \\&s.t.\ t_i{\varpi }^{\textrm{T}} h\left( x\right) \ge \ 1-\ {\frac{1}{1+\tau }\xi _i},{\ \xi }_i\ge 0,\ i=1\ldots N, \end{aligned}$$
(28)

where \(\varpi \) is the output weight vector in the ELM kernel space, and

$$\begin{aligned} {\delta _i^\Phi }^{k-1}=\left\{ \begin{array}{cc} cs_i^\Phi \tau ,&{}\ t_if_{\beta ^{k-1}}\left( x_i\right) = t_ih\left( x_i\right) \varpi ^{k-1}<\varsigma +1\\ 0,&{}\ \ \ \textrm{otherwise}\\ \end{array}\right. . \end{aligned}$$
(29)

The dual problem of (28) is follow as:

$$\begin{aligned}&min_\alpha \ {\frac{1}{2}({\alpha }^{\textrm{T}}-{\delta }^{\textrm{T}})\widetilde{Q}(\alpha -\delta )-{e}^{\textrm{T}}\alpha }\nonumber \\&s.t.\ 0\le \ \alpha \le \left( 1+\tau \right) cS^\Phi , \end{aligned}$$
(30)

where \(\widetilde{Q}=T\Omega _{ELM}T\) and \(\Omega _{ELM}=H{H}^{\textrm{T}}\in \mathfrak {R}^{N\times N}\) whose element \({\Omega _{ELM}}_{ij}=\mathcal {K}_{ELM}(x_i,x_j)\).

Algorithm 3
figure c

The complete process of nonlinear TPin-IFELM

Similar to linear TPin-IFELM, Eq. (30) is equivalent to

$$\begin{aligned}&min_\lambda \ {\frac{1}{2}\lambda ^\Phi \widetilde{Q}\lambda ^\Phi -{e}^{\textrm{T}}\alpha }\nonumber \\&s.t.\ \mathfrak {L}^\Phi \le \ \lambda ^\Phi \le \mathfrak {U}^\Phi , \end{aligned}$$
(31)

where \(\mathfrak {L}^\Phi =-\delta ^\Phi \in \mathfrak {R}^N\) and \(\mathfrak {U}^\Phi =\left( 1+\tau \right) cS^\Phi -\ \delta ^\Phi \in \mathfrak {R}^N\).

For the unknown sample x, the decision function of nonlinear TPin-IFELM is

$$\begin{aligned} f\left( x\right) =sign\left( {\lambda ^\Phi }^{\textrm{T}} T\left[ \begin{array}{c} \mathcal {K}_{ELM}\left( x_1,x\right) \\ \vdots \\ \mathcal {K}_{ELM}\left( x_N,x\right) \\ \end{array}\right] \right) . \end{aligned}$$
(32)

The complete process of nonlinear TPin-IFELM is shown in Algorithm 3.

3.4 The Discussion

In this section, we discuss the relationship between TPin-IFELM and FELM. Similar to ELM, TPin-IFELM, and FELM randomly assign the input weights and the biases of the hidden layer. Then, the hidden layer output matrix is obtained by the activation function.

In order to suppress the negative effects of noises, FELM only uses the membership degree for each sample, while TPin-IFELM employs the membership and non-membership degrees based on the local information of samples. To further reduce the interference of noises, TPin-IFELM uses the truncated pinball loss function to not only maintain sparsity and robustness but also to enhance the re-sampling stability.

4 Properties of the TPin-IFELM

In this section, we analyze the theoretical properties of TPin-IFELM, including noise insensitivity, sparsity, weight scatter minimization, and misclassification error minimization.

4.1 Noise Insensitivity and Sparsity

In this subsection, we discuss the noise insensitivity and sparsity of TPin-IFELM. The sub-gradient function of (12) is

$$\begin{aligned} \partial P_{\tau ,\varsigma }\left( 1-t_if\left( x_i\right) \right) =\left\{ \begin{array}{cc} 0, &{} 1-t_if\left( x_i\right)<-\varsigma \\ \left[ -\tau ,0\right] , &{} 1-t_if\left( x_i\right) =-\varsigma \\ -\tau , &{} -\varsigma<1-t_if\left( x_i\right) <0\\ \left[ -\tau ,1\right] , &{} 1-t_if\left( x_i\right) =0\\ 1, &{} 1-t_if\left( x_i\right) >0\\ \end{array}\right. . \end{aligned}$$
(33)

Equation (16) can be rewritten as:

$$\begin{aligned} \textbf{0}\in \frac{\beta }{c}-\sum _{i=1}^{N}s_it_ih\left( x_i\right) \partial P_{\tau ,\varsigma }\left( 1-t_i{h\left( x_i\right) }^{\textrm{T}}\beta \right) , \end{aligned}$$
(34)

where \(\textbf{0}\in \mathfrak {R}^N\) is a column vector whose elements are all zero.

For given \(\beta \), the index set can be partitioned into five sets,

$$\begin{aligned} \mathcal {S}_0^\beta&=\left\{ i:1-t_i{h\left( x_i\right) }^{\textrm{T}} \beta<-\varsigma \right\} ,\nonumber \\ \mathcal {S}_1^\beta&=\left\{ i:1-t_i{h\left( x_i\right) }^{\textrm{T}} \beta =-\varsigma \right\} ,\nonumber \\ \mathcal {S}_2^\beta&=\left\{ i:-\varsigma<1-t_i{h \left( x_i\right) }^{\textrm{T}}\beta <0\right\} ,\nonumber \\ \mathcal {S}_3^\beta&=\left\{ i:1-t_i{h\left( x_i\right) }^{\textrm{T}} \beta =0\right\} ,\nonumber \\ \mathcal {S}_4^\beta&=\left\{ i:1-t_i{h\left( x_i\right) }^{\textrm{T}} \beta >0\right\} . \end{aligned}$$
(35)

Since \(\partial P_{\tau ,\varsigma }\left( 1-t_if\left( x_i\right) \right) =0\) when the samples are located in \(\mathcal {S}_0^\beta \), the samples in \(\mathcal {S}_0^\beta \) have no contribution to the calculation of \(\beta \). Therefore, \(\mathcal {S}_0^\beta \) is closely related to the sparsity of (14), in the other word, the parameter \(\varsigma \) can control the number of samples in \(\mathcal {S}_0^\beta \). When the value of \(\varsigma \) is smaller, the more samples in \(\mathcal {S}_0^\beta \), and the better the sparsity of (14). In particular, when \(\varsigma \rightarrow 0\), the truncated pinball loss function can be regarded as a hinge loss function, which is very sensitive to the noises. On the contrary, when the value of \(\varsigma \) is larger, the number of samples in \(\mathcal {S}_0^\beta \) is smaller, and (14) is robust to noises but gradually loses its sparsity. Particularly, when \(\varsigma \rightarrow +\infty \), the truncated pinball loss function degenerates into pinball loss, and the sparsity is completely lost.

According to the five sets \(\mathcal {S}_0^\beta \), \(\mathcal {S}_1^\beta \), \(\mathcal {S}_2^\beta \), \(\mathcal {S}_3^\beta \) and \(\mathcal {S}_4^\beta \) of (35), the optimality condition can be written as the existence of \(\psi _i\in \left[ -\tau ,0\right] \) and \(\zeta _i\in \left[ -\tau ,1\right] \) such that

$$\begin{aligned}&\frac{\beta }{c}-\sum _{i\epsilon \mathcal {S}_1^\beta }{\psi _is_i}t_ih\left( x_i\right) +\tau \sum _{i\epsilon \mathcal {S}_2^\beta } s_it_ih\left( x_i\right) \nonumber \\ {}&\quad -\sum _{i\epsilon \mathcal {S}_3^\beta }{\zeta _is_i}t_ih\left( x_i\right) -\sum _{i\epsilon \mathcal {S}_4^\beta } s_it_ih\left( x_i\right) =0. \end{aligned}$$
(36)

The number of samples in \(\mathcal {S}_1^\beta \) and \(\mathcal {S}_3^\beta \) is much smaller than that in \(\mathcal {S}_0^\beta \), \(\mathcal {S}_2^\beta \) and \(\mathcal {S}_4^\beta \), and the samples in \(\mathcal {S}_1^\beta \) and \(\mathcal {S}_3^\beta \) make little contribution to Eq. (36). Therefore, the main problem here is about the set \(\mathcal {S}_0^\beta \), \(\mathcal {S}_2^\beta \) and \(\mathcal {S}_4^\beta \). When the value of parameter \(\varsigma \) is fixed to a suitable value, the parameter \(\tau \) can be used to control the number of samples in \(\mathcal {S}_0^\beta \), \(\mathcal {S}_2^\beta \) and \(\mathcal {S}_4^\beta \), and the sparsity of (14) is affected. When \(\tau \) is large, such as \(\tau =1\), these three sets contain many samples, so (14) is robust to the feature noise. When \(\tau \) is very small, such as \(\tau =0.1\), there are few samples in \(\mathcal {S}_4^\beta \), and (14) is more sensitive. Especially, when \(\tau =0\), there are no samples or only a few samples in \(\mathcal {S}_4^\beta \). Therefore, when constructing the model, the feature noise around the decision boundary will bring significant negative effects. Since the total number of samples is fixed when \(\tau \) is smaller, the smaller the number of samples in \(\mathcal {S}_4^\beta \), the larger the number of samples in \({\ \mathcal {S}}_0^\beta \), and the better the sparsity of (14).

In summary, the appropriate \(\tau \) and \(\varsigma \) are chosen to enable TPin-IFELM to better balance noise insensitivity and sparsity.

4.2 Weight Scatter and Misclassification Error Minimization

The mechanism of TPin-IFELM can also be explained by the weight scatter and misclassification error minimization. The positive hyperplane \(f_+\left( x\right) :{\beta }^{\textrm{T}} h\left( x\right) =1\) and the negative hyperplane \(f_-\left( x\right) :{\beta }^{\textrm{T}} h\left( x\right) =-1\) are constructed by the samples in \(\mathcal {S}_3^\beta \). The distance between positive and negative hyperplanes is \(\frac{2}{\left\| {\beta }\right\| }\). We can measure the weight scatter in terms of the sum of the distances of a given point from similar samples. In the random mapping feature space related to \(\beta \), the weight scatter of \(x_{i_0}\) can be defined as

$$\begin{aligned} \sum _{x_{i_0}\in \mathcal {S}_3^\beta \cap x_i\in \mathcal {X}}{s_i\vert {{\beta }^{\textrm{T}}\left( h\left( x_{i_0}\right) -h\left( x_i\right) \right) }\vert .} \end{aligned}$$
(37)

If \(x_{i_0}\in \mathcal {S}_3^\beta \cap \mathcal {X}^+\), i.e., \({{\beta }^{\textrm{T}}}{h\left( x_{i_0}\right) }=1\) and \(t_{i_0}=1\), then

$$\begin{aligned} \sum _{x_i\in \mathcal {X}^+}{s_i\vert {{\beta }^{\textrm{T}}\left( h\left( x_{i_0}\right) -h\left( x_i\right) \right) } \vert } = \sum _{x_i\in \mathcal {X}^+}{s_i\vert {1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) }\vert }; \end{aligned}$$
(38)

If \(x_{i_0}\in \mathcal {S}_3^\beta \cap \mathcal {X}^-\), i.e., \({{\beta }^{\textrm{T}}}{h\left( x_{i_0}\right) }=-1\) and \(t_{i_0}=-1\), then

$$\begin{aligned} \sum _{x_i\in \mathcal {X}^-}{s_i\vert {{\beta }^{\textrm{T}}\left( h\left( x_{i_0}\right) -h\left( x_i\right) \right) } \vert } = \sum _{x_i\in \mathcal {X}^-}{s_i\vert {1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) }\vert }. \end{aligned}$$
(39)

Therefore,

$$\begin{aligned} {min}_\beta \frac{1}{2}\left\| {\beta }\right\| ^2+C_1\sum _{i=1}^{N}{s_i\vert {1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) }\vert } \end{aligned}$$
(40)

can be interpreted as maximizing the distance between hyperplanes \(f_+\left( x\right) \) and \(f_-\left( x\right) \) and meanwhile minimizing weight scatter.

In (14), (40) is extended to \(P_{\tau ,\varsigma }\). The misclassification term

$$\begin{aligned} C_1min\left( s_i\left( 1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) \right) ,0\right) - C_2\left( L_{hinge}\left( s_i\left( 1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) +\varsigma \right) \right) -s_i\varsigma \right) \end{aligned}$$

is introduced into Eq. (40), i.e.,

$$\begin{aligned}&{min}_\beta \ \frac{1}{2}\left\| {\beta }\right\| ^2+C_1\sum _{i=1}^{N}{s_i\vert {1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) } \vert }\nonumber \\&\quad +C_1\sum _{i=1}^{N}min\left( s_i\left( 1-t_i\left( {\beta }^{\textrm{T}}h\left( x_i\right) \right) \right) ,0\right) \nonumber \\&\quad -C_2\sum _{i=1}^{N}\left( L_{hinge}\left( s_i\left( 1-t_i\left( {\beta }^{\textrm{T}} h\left( x_i\right) \right) +\varsigma \right) \right) -s_i\varsigma \right) . \end{aligned}$$
(41)

We obtain TPin-IFELM with \(C_1=c\left( 1+\tau \right) \) and \(C_2=c\tau \). Thus, TPin-IFELM can minimize both the weight scatter and misclassification errors, simultaneously.

5 Experiments

In this section, we verify the effectiveness of TPin-IFELM through a series of experiments on the artificial dataset and benchmark datasets.Footnote 1

5.1 Experimental Configuration

In order to evaluate the effectiveness of TPin-IFELM, we compared it with other eight advanced comparison algorithms. TPin-IFELM with SFA, which replaces the score function SV in TPin-IFELM with the score function SFA in PIFTSVM, contains four parameters c, L, \(\tau \), and \(\varsigma \), and TPin-IFELM contains five parameters c, L, \(\tau \), \(\varsigma \) and K. To ensure the objectivity of the experiments, for the datasets with less than 2000 samples, the penalty parameter c of TPin-IFELM and TPin-IFELM with SFA, the penalty parameter C of OELM, RELM, and FELM, and the penalty parameters \(C_1\) and \(C_2\) of TELM, SPTELM, and PIFTSVM are searched from the set \(\left\{ 2^i|i=-10,-8,\ldots ,8,10\right\} \), and the number of hidden layer nodes L for these algorithms are searched from \(\left\{ 50,100,200,500\right\} \). For the datasets with greater than or equal to 2000 samples, the penalty parameters c, C, \(C_1\) and \(C_2\) are searched from \(\left\{ 2^i|i=-10,-6,\ldots ,6,10\right\} \), and the number of hidden layer nodes L is searched from \(\left\{ 50,100,200\right\} \). \(\tau \) and \(\varsigma \) are searched from \(\left\{ 0.25,0.5,0.75\right\} \), \(\varepsilon \) is searched from \(\{0,0.2,0.5\}\), and for TPin-IFELM, the number of nearest neighbors K is searched from \(\left\{ 1,3,\ldots ,20\right\} \).

We implement all algorithms by using MATLAB (R2018a). The experimental environment is a workstation with the 11th Gen Intel Core i5-11,400 H (2.70GHz) processor and 16 G RAM. We use the quadprog to solve the quadratic programming problem and use three evaluation metrics to evaluate the classification performance, including accuracy (ACC), the area under ROC (AUC), and \(F_1\)-measure \((F_1)\).

$$\begin{aligned} ACC&=\frac{TP+TN}{TP+FN+TN+FP}, \end{aligned}$$
(42)
$$\begin{aligned} F_1&=\frac{2\times T P}{2\times T P+FP+FN}, \end{aligned}$$
(43)
$$\begin{aligned} AUC&=\frac{\vert \{(x_i,x_j)\mid f(x_j)\le f\left( x_i\right) ,(x_i,x_j)\in P\times N\}\vert }{N^+\times N^-}, \end{aligned}$$
(44)

where FN denotes the number of false negatives, FP denotes the number of false positives, TN denotes the number of true negatives and TP denotes the number of true positives.

5.2 Experiments on the Artificial Dataset

To verify the robustness and sparsity of TPin-IFELM, we conduct comparative experiments on an artificial dataset with two-dimensional features. The training set and test sets consist of 200 samples and 50 samples, respectively. The positive and negative samples of the artificial dataset are generated by the Gaussian distributions \(\mathcal {X}^+\sim \mathcal {N}\left( \mathcal {V}_1,\Sigma _1\right) \) and \(\mathcal {X}^-\sim \mathcal {N}\left( \mathcal {V}_2,\Sigma _2\right) \), respectively, where \(\mathcal {V}_1=\left[ \begin{array}{cc} 1&{}1\\ \end{array}\right] ^{\textrm{T}}\), \(\mathcal {V}_2=\left[ \begin{array}{cc} -1&{}-1\\ \end{array}\right] ^{\textrm{T}}\) and \(\Sigma _1=\Sigma _2=\left[ \begin{array}{cc} 1&{}\\ &{}1\\ \end{array}\right] \).

Fig. 4
figure 4

Separating boundaries of FELM, TPin-IFELM with SFA and TPin-IFELM

As shown in Fig. 4, the red “+” and blue “\(\times \)” denote the positive training samples and the negative training samples, respectively. The pink “+” and green “\(\times \)” denote the positive test samples and the negative test samples, respectively. The support vectors are circled by “\(\circ \)”, and the noises identified by the algorithm are framed by black “\(\diamond \)”. We can see that compared with FELM, TPin-IFELM with SFA and TPin-IFELM use both the membership and non-membership degrees and the truncated pinball loss function, so they can more effectively reduce the negative effect of noises. The number of support vectors of TPin-IFELM with SFA and TPin-IFELM is 33% and 29% of the total number of samples, respectively. Thus, compared with FELM, TPin-IFELM with SFA and TPin-IFELM are sparse. Table 1 shows the experimental results of FELM, TPin-IFELM with SFA, and TPin-IFELM on the artificial dataset, and the best results of each evaluation indicator are shown in bold. As shown in Table 1, TPin-IFELM is superior to FELM and TPin-IFELM with SFA in terms of ACC and AUC and is second only to TPin-IFELM with SFA in terms of \(F_1\).

Table 1 The experimental results on artificial dataset

5.3 Experiments on the Benchmark Datasets

To evaluate the effectiveness and robustness of TPin-IFELM, we conduct comparative experiments on 15 benchmark datasets. The detailed characteristics of the datasets are shown in Table 2, where #Samples, #Positive samples, #Negative samples, and #Features denote the number of samples, the number of positive samples, the number of negative samples and the number of features, respectively.

Table 2 The characteristics of experimental datasets

In order to verify the classification performance of TPin-IFELM and other eight comparison algorithms, we conduct extensive experiments on fifteen benchmark datasets. Appendix B provides additional experimental results. Unlike these seven comparison algorithms, TPin-IFELM with SFA and TPin-IFELM employ the membership and non-membership degrees to effectively identify the role of the samples in the classification process. As shown in Tables 7, 8 and 9, in terms of the average rank, each evaluation metric of TPin-IFELM is superior to that of the other eight algorithms, and the ACC and AUC of TPin-IFELM with SFA are only lower than TPin-IFELM.

Table 3 ACC of nine algorithms in the 50% label noise and 0.5 Gaussian feature noise environment
Table 4 AUC of nine algorithms in the 50% label noise and 0.5 Gaussian feature noise environment
Table 5 \(F_1\) of nine algorithms in the 50% label noise and 0.5 Gaussian feature noise environment

Noise is commonly present in the datasets and can reduce the classification performance of algorithms. In order to demonstrate the robustness of TPin-IFELM, we conducted noise experiments on 15 benchmark datasets using label noise. We randomly select 50% of the training samples and then add label noise to them. The experimental results are shown in Tables 10, 11 and 12. We can observe that all algorithms are negatively affected by the samples with label noise. However, TPin-IFELM is less disturbed by label noise than the other eight comparison algorithms. In addition, it is superior to the other eight comparison algorithms on most datasets. For the classification problems with label noise and feature noise, we add Gaussian noise [39] that follows normal distribution \(N\left( 0,\sigma ^2\right) \) to the training set to form a training set with feature noise, where \(\sigma \) is 0.5, and then randomly select 50% of the training samples as the samples with label noise. Tables 3, 4 and 5 show the experimental results. The best results for each dataset are shown in bold. TPin-IFELM with SFA and TPin-IFELM are less disturbed by label noise and feature noise than the other seven algorithms and are superior to them on most datasets.

From the above noise experimental results, we can observe that ELM, OELM, RELM, TELM, and SPTELM do not consider the membership degree of the samples to reduce the negative impact of the noise, resulting in a significant decrease in their classification performance. Different from FELM, TPin-IFELM with SFA and TPin-IFELM employ the membership and non-membership degrees to effectively identify the role of the samples and the noise in the classification process. At the same time, they introduce the truncated pinball loss function to enhance the robustness of the model. Compared to TPin-IFELM with SFA, TPin-IFELM uses the local information of the samples to construct the more appropriate membership and non-membership degrees of the samples. Therefore, TPin-IFELM can better solve the classification problems with noise than the other eight comparison algorithms.

5.4 Statistical Analysis

From Tables 7, 8, 9, 10, 11 and 12 and Tables 3, 4 and 5, we can observe that not any algorithm outperforms all other algorithms on all datasets. In this subsection, we use the Friedman test [42] to analyze these algorithms statistically. Given \(\mathfrak {K}\) comparison algorithms and \(\mathcal {N}\) datasets, let \(r_i^j\) denote the rank of the j-th algorithm on the i-th dataset. \(R_j=\frac{1}{\mathcal {N}}\sum _{i=1}^{\mathcal {N}}r_i^j\) denotes the average rank of the j-th algorithm. The Friedman statistics \(F_F=\frac{\left( \mathcal {N}-1\right) \chi _F^2}{\mathcal {N} \left( \mathfrak {K}-1\right) -\chi _F^2}\sim F\left( \mathfrak {K}-1,\left( \mathfrak {K}-1\right) \left( \mathcal {N}-1\right) \right) \), where \(\chi _F^2=\frac{12\mathcal {N}}{\mathfrak {K} \left( \mathfrak {K}+1\right) }\left[ \sum _{j=1}^{\mathfrak {K}}{R_j^2 -\frac{\mathfrak {K}\left( \mathfrak {K}+1\right) ^2}{4}}\right] \). Table 6 shows the Friedman test results on the datasets without noise and datasets with noise. We observe that the Friedman statistics are much larger than the critical value, so the null hypothesis that all algorithms have the same classification performance is rejected, i.e., there is a significant difference in classification performance among the algorithms.

Table 6 Summary of Friedman statistics with and without noise and critical values with and without noise for all evaluation metrics
Fig. 5
figure 5

Comparison results of TPinIFELM and eight comparison algorithms using the Nemenyi test on datasets without noise and datasets with noise

Fig. 6
figure 6

Comparative results of three methods for obtaining sample structure information on datasets without noise and datasets with noise

Fig. 7
figure 7

Sensitivity analysis of parameters c and L on the datasets Heart and Ionosphere

Fig. 8
figure 8

The performance of TPin-IFELM on four datasets without noise changes with increasing the value of \(\tau \)

Fig. 9
figure 9

The performance of TPin-IFELM on four datasets with 30% label noise changes with increasing the value of \(\tau \)

Fig. 10
figure 10

The performance of TPin-IFELM on four datasets with 50% label noise and feature noise of \(\sigma = 0.5\) changes with increasing the value of \(\tau \)

Fig. 11
figure 11

The performance of TPin-IFELM on four datasets without noise changes with increasing the value of \(\varsigma \)

Fig. 12
figure 12

The performance of TPin-IFELM on four datasets with 30% label noise changes with increasing the value of \(\varsigma \)

Fig. 13
figure 13

The performance of TPin-IFELM on four datasets with 50% label noise and feature noise of \(\sigma = 0.5\) changes with increasing the value of \(\varsigma \)

Fig. 14
figure 14

The performance of TPin-IFELM on four datasets without noise changes with increasing the value of K

Fig. 15
figure 15

The performance of TPin-IFELM on four datasets with 30% label noise changes with increasing the value of K

Fig. 16
figure 16

The performance of TPin-IFELM on four datasets with 50% label noise and feature noise of \(\sigma = 0.5\) changes with increasing the value of K

The difference between TPin-IFELM and the other eight algorithms is compared by using the Nemenyi test [42]. The average rank difference between pairs of algorithms is compared by the critical difference (CD), where \(\textrm{CD}=q_\alpha \sqrt{\frac{\mathfrak {K}\left( \mathfrak {K}+1\right) }{6\mathcal {N}}}\). For the Nemenyi test, \(q_\alpha =2.948\) at the significance level \(\alpha =0.05\), thus, for experiments without noise, \(\textrm{CD}=3.102\ \left( \mathfrak {K}=9,\mathcal {N}=15\right) \); for experiments with noise, \(\textrm{CD}=1.5510\ (\mathfrak {K}=9, \mathcal {N}=60)\). The CD diagrams of all evaluation metrics with and without noise are shown in Fig. 5. We observe that TPin-IFELM is superior to the other eight algorithms on each evaluation metric.

5.5 Sensitivity Analysis

To analyze the parameter sensitivity of TPin-IFELM and the performance of methods for obtaining sample structure information, we conduct experiments on the benchmark datasets. The main parameters of TPin-IFELM include the penalty parameter c, the number L of hidden layer nodes, the parameter \(\tau \), the parameter K, and the parameter \(\varsigma \). The methods for obtaining sample structure information include KNN, K-Means, and Ward Linkage.

5.5.1 Methods for Obtaining Sample Structure Information

In order to investigate the impact of different methods of obtaining sample structure information on our TPin-IFELM, we use KNN, K-Means and Ward Linkage to extract local information of samples, respectively, and conduct experiments on the datasets Sonar and Colon-cancer, respectively. The comparative results are shown in Fig. 6. As shown in Fig. 6, TPin-IFELM using KNN achieves optimal performance. Compared to K-Means and Ward Linkage, KNN can more effectively capture the correlation between the sample and all heterogeneous samples in its neighborhood, thus obtaining valuable local information of samples.

5.5.2 Parameters c and L

To analyze the sensitivity of TPin-IFELM to c and L, we perform parameter sensitivity analysis experiments on the datasets Heart and Ionosphere. The parameter c is searched form \(\{2^i\mid i=\ -10,-8,\ldots ,8,10\}\), the parameter L is searched form \(\{50,100,200,500\}\), and the other parameters are fixed. From Fig. 7, we can observe that the ACC, AUC, and \(F_1\) of TPin-IFELM are higher when the value of c is larger and the L value is larger. In general, TPin-IFELM is sensitive to parameter c and is less affected by the change of L.

5.5.3 Parameters \(\tau \), \(\varsigma \) and K

To analyze the effects of parameters \(\tau \), \(\varsigma \), and K on the classification performance of TPin-IFELM, we conducted experiments on the datasets Colon-cancer, Sonar, Heart and Ionosphere without noise and with noise, respectively. There are two types of noise: samples with 30% label noise and samples with 50% label noise and feature noise of \(\sigma = 0.5\). As shown in Figs. 8, 9 and 10, for samples without noise, TPin-IFELM is minimally affected by the parameter \(\tau \), except for the data set Colon-cancer; however, for samples with noise, TPin-IFELM is strongly affected by the parameter \(\tau \). As shown in Figs. 11, 12 and 13, for samples without noise, TPin-IFELM is minimally affected by the parameter \(\varsigma \); however, for samples with noise, TPin-IFELM is sensitive to the parameter \(\varsigma \). As shown in Figs. 14, 15 and 16, TPin-IFELM is sensitive to the parameter K.

6 Conclusion

Inspired by the intuitionistic fuzzy theory and truncated pinball loss, we propose a novel model to solve the classification problem in this paper. TPin-IFELM employs the KNN method to obtain the local information of samples, which can obtain the more suitable membership and non-membership degrees of samples. TPin-IFELM exploits the membership and non-membership degrees to effectively identify whether the boundary samples are noises or not and uses the truncated pinball loss function, which makes it more robust and sparse. A large number of experiments fully verify the effectiveness of TPin-IFELM. Compared with the state-of-the-art comparison algorithms, TPin-IFELM has superior classification performance. In future work, we will extend the proposed model to the multi-view classification problem.