1 Introduction

Fault diagnosis and detection are mostly data-driven nowadays. Although data acquisition becomes easier, how to deal with a large amount of high-dimensional data has become a difficult problem. There is a lot of information that can be used in high-dimensional data, but most of them are redundant, and a large amount of data increases the computational complexity, which may lead to the curse of dimensionality.

A common feature extraction method to deal with high-dimensional data is dimension reduction. The traditional linear dimension reduction methods include principal component analysis (PCA) (Duchene and Leclercq 1988; Wold et al 1987), linear discriminant analysis (LDA) (Duda et al 2001; Etemad and Chellapa 1997) and others. Later, Schölkopf et al. (1997) proposed kernel principal component analysis (KPCA) to deal with nonlinear data, which achieved good results. With the further development of manifold learning, there is a better choice to deal with nonlinear data.

Manifold learning is mainly divided into local information preserving methods and global information preserving methods. Locally linear embedding (LLE) (Roweis and Saul 2000) is a classical manifold learning method based on local information, which can generate an implicit function to map data from high-dimensional space to low-dimensional space. Nonlinear methods for preserving global information mainly include multi-dimensional scaling (MDS) (Cox and Cox 1994), isometric mapping (ISOMAP) (Law and Jain 2006; Tenenbaum et al 2000) and others. ISOMAP proves that it is feasible and effective to change the measurement distance between data points. Li et al. (2008) combines maximum margin criterion (MMC) (Li et al 2006) with LLE to propose locally linear discriminant embedding (LLDE). LLDE uses the category attributes of data, and it makes the data of different categories farther apart and similar data closer to each other after dimension reduction. Besides, the algorithm avoids the small sample size (SSS) problem (Zheng et al 2004).

As mentioned above, using category attributes is more conducive to classification. Because of the attraction or repulsion between data samples (Zhou and Chen 2006), the distribution of data may be deformed. Cam weighted distance considers the scale of data distribution and the direction of deformation, which can well measure deformed data. When building a neighborhood with cam weighted distance, the samples with larger density will be given smaller weight and larger weight for the smaller ones (Zhou and Chen 2006). The fault data should be paid more attention in fault diagnosis, while cam weighted distance can give more attention to fault data. Owing to the way that LLDE calculates distance, LLDE has some limitations when dealing with deformed data. To address this problem, cam locally linear discriminant embedding (CLLDE) is proposed based on cam weighted distance.

It is necessary to classify the extracted data to realize fault identification after feature extraction. Zhang et al. (2018) used robust LLE to extract features, then used support vector machine (SVM) to identify mechanical faults. Fan et al. (2019) used convolutional neural network (CNN) for feature extraction, and then used LightGBM for mechanical fault prediction. LightGBM (Ke et al 2017) is a gradient lifting algorithm based on the decision tree proposed by Microsoft. In this algorithm, the histogram is introduced to generate the decision tree. In the sampling process, gradient-based one side sampling (GOSS) and exclusive feature bundling (EFB) are used to merge independent features. In addition, the enhancement to the leaf-wise growth strategy improved the computation efficiency of the algorithm.

The influence of hyper-parameters should be considered when training a classifier. LightGBM has some hyper-parameters that determine the final classification effect. Thus, it is important to select the appropriate hyper-parameters for the classifier.

Swarm intelligence optimization algorithm is widely applied in engineering and plays an important role in the adjustment of model parameters. Zeng et al. (2017) applied the improved particle swarm optimization (PSO) to address the short-term load forecasting problem. Liu et al. (2019) used an improved PSO and K-means algorithm to solve the clustering problem of the emergency patient. You et al. (2018) use BP neural network optimized by hybrid PSO to study the electro-hydraulic control system. Zeng et al. (2018) proposed a switching delayed PSO to optimize the hyper-parameters of SVM to implement medical diagnosis. Pathana et al. (2021) show that an optimization algorithm to optimize CNN to identify COVID-19 patients by X-ray image of the lung.

Sparrow search algorithm (SSA) (Xue and Shen 2020) is a novel swarm intelligence algorithm, which has been applied in many fields and achieved good results, such as (Xing et al 2021; Zhang and Ding 2021; Zhu and Yousefi 2021). Xing et al. (2021) apply SSA in fault diagnosis of the wheelset-bearing system. Zhang et al. (2021) show that the chaotic mapping strategy to improve SSA to optimize the stochastic configuration network. Zhu et al. (2021) proposed an adaptive SSA to address the parameters identification problem. However, the search efficiency and the convergence speed of SSA are needed to be improved. To improve the ability of the algorithm to find the optimal solution and accelerate the convergence speed of the algorithm, the elite opposite sparrow search algorithm (EOSSA) is proposed based on the elite opposite learning strategy and orifice imaging opposite learning strategy. EOSSA is used to optimize LightGBM to find the most suitable hyper-parameters to identify fault data. The diagram of the proposed fault diagnosis approach is shown in Fig. 1.

Fig. 1
figure 1

Diagram of the proposed framework based on the EOSSA-LightGBM approach

The main contributions of this paper are summarized as follows. CLLDE algorithm is proposed based on cam weighted distance to improve the limitations when dealing with the deformed data of LLDE. EOSSA is proposed based on elite opposite learning strategy and orifice imaging opposite learning strategy, which improves the ability to find the optimal solution and accelerate the convergence speed compared with SSA. SSA and EOSSA are used to optimize the hyper-parameters of LightGBM for the first time, and a new fault diagnosis approach based on EOSSA optimized LightGBM is proposed. Aimed at the problem that the fault data may not be approximated to normal distribution for similarity measurement. An approach based on CLLDE and EOSSA-LightGBM is proposed to address the diagnosis problem of the deformed fault data.

The rest of this paper is organized as follows. In Sect. 2, CLLDE is proposed based on cam weighted distance, and its effectiveness is verified on the iris dataset. In Sect. 3, EOSSA is proposed by introducing the elite opposite learning strategy and orifice imaging opposite learning strategy into SSA. The establishment process of EOSSA-LightGBM and the fault diagnosis experiment are introduced in Sect. 4. The conclusion of the whole paper is in Sect. 5.

2 Cam locally linear discriminate embedding

2.1 Cam weighted distance

Let d dimensional vector Z obey standard normal distribution N(0, I), and its probability density function is

$$\begin{aligned} f\left( z \right) =\frac{1}{\left( 2\pi \right) ^{\frac{d}{2}}}\exp \left( -\frac{1}{2}z^Tz \right) . \end{aligned}$$
(1)

Random vector X can be defined by the transformation (Zhou and Chen 2006)

$$\begin{aligned} X =\left( a+b\frac{z^T\tau }{z} \right) z. \end{aligned}$$
(2)

where z denotes the original well distributed dataset. \(a>b\geqslant 0\) reflects the scale of data distribution and the deformation direction in a certain direction. \(\tau\) is a normalized vector that describes the deformation orientation. \(||z||=\sqrt{z^Tz}\), X is the deformation distribution, and a and b are the deformation parameters (Pan et al 2009). The deformed data can be changed into normal distribution by transforming \(~z=X/\left( a+b\cos \theta \right)\), then the definition of cam weighted distance is obtained.

Let \(x_0 \in R^d\) be the center of cam distribution \(D_d\left( a,b,\tau \right)\). The cam weighted distance from point \(x \in R^d\) to \(x_0\) can be defined as

$$\begin{aligned} \mathrm{Dist}\left( x_0,x \right) =\frac{|x-x_0|}{a+b\frac{\left( x-x_0 \right) ^T\tau }{x-x_0}}. \end{aligned}$$
(3)

Lemma 1

 (Zhou and Chen 2006) If a random vector \(X=D_d\left( a,b,\tau \right)\), then \(E\left( X \right) =c_1b\tau\) and \(E\left( \Vert x \Vert \right) =c_2a\), where \(c_1\) and \(c_2\) are constants and can be expressed as

$$\begin{aligned}&c_1=\sqrt{2}\times \frac{\varGamma \left( \frac{d+1}{2} \right) }{\varGamma \left( \frac{d}{2} \right) \cdot d}, \end{aligned}$$
(4)
$$\begin{aligned}&c_2=\sqrt{2}\times \frac{\varGamma \left( \frac{d+1}{2} \right) }{\varGamma \left( \frac{d}{2} \right) }, \end{aligned}$$
(5)

where d is the dimensionality of the random vector X. \(\varGamma\) represents Gamma function \(\varGamma \left( m\right) =\int _0^{\infty }{t^{m-1}e^{-t}dt}\left( m>0 \right)\).

According to Lemma 1, for arbitrary x in dataset, we can suppose that it is the center of the cam distribution. The points around it are subject to cam distribution, and its k-nearest neighbors \(X_i = [x_{i1}, x_{i2}, \dotsc , x_{ik}]\) can be transformed as \(V_i = [v_{i1},v_{i2}, \dotsc , v_{ij}]\), where \(v_{ij} = x_{ij} - x_i\), \({j} = 1, 2, \dotsc , k\). The central vector \(\hat{G}_i\) and the average vector length \(\hat{L}_i\) can be presented as follows:

$$\begin{aligned}&\hat{G}_i=\frac{1}{k}\sum \limits _{j=1}^k{v_{ij}}, \end{aligned}$$
(6)
$$\begin{aligned}&\hat{L}_i=\frac{1}{k}\sum \limits _{j=1}^k{\Vert v_{ij} \Vert }. \end{aligned}$$
(7)

\(\hat{G}_i\) and the mean value of \(v_{ij}\) can be used to estimate \(E\left( X \right)\). Besides, \(\hat{L}_i\) and the mean value of \(\Vert v_{ij} \Vert\) can be used to estimate \(E\left( \Vert X \Vert \right)\). The parameter estimation can be obtained by Lemma 1:

$$\begin{aligned}&\hat{a}_i=\frac{\hat{L}_i}{c_2}, \end{aligned}$$
(8)
$$\begin{aligned}&\hat{b}_i=\frac{\Vert G_i \Vert }{c_1}, \end{aligned}$$
(9)
$$\begin{aligned}&\hat{\tau }_i=\frac{\hat{G}_i}{\Vert G_i \Vert }. \end{aligned}$$
(10)

2.2 Cam locally linear discriminate embedding

In practical applications, most high-dimensional data are not normally distributed. For fault data, it is often different from normal data in the process of acquisition due to attraction, repulsion, and other factors. These changes can not be ignored. Under this condition, it assumes that the data approximately is subject to the standard normal distribution will produce large errors. The original Euclidean distance calculation method based on normal distribution is no longer applicable. It can transform the deformed data into normal distribution to use cam weighted distance, which can eliminate the influence of data deformation and better describe the similarity (Zhou and Chen 2006). Considering this condition, cam weighted distance is used to improve the nearest neighbor distance calculation method of LLDE to make it more suitable for fault data diagnosis. Cam Locally Linear Discriminate Embedding (CLLDE) uses cam weighted distance to calculate the distance between the samples, and other calculation processes are similar with LLDE.

Changing the calculation method of the distance that used in LLDE between the nearest neighbor points can change the direction of the algorithm to select the nearest neighbor points. Instead of taking the nearest neighbor points as the selection criteria, the calculation method will consider the points in all directions of the center point to select the nearest neighbor points.

Based on the cam distance, combined with LLE and MMC, CLLDE is proposed to extract feature. LLE is a nonlinear dimension reduction method. Let \(X=\{X_1, X_2, \dotsc , X_n \}\), and \(X_{i} \in {R}^{D}\) denotes n points in D dimensional space, where \(i=1,2, \dotsc , n\). The mapping of high-dimensional data in low-dimensional space can be recorded as \(Y=\{ Y_1, Y_2, \dotsc , Y_n \}\), \(Y_i \in {R}^{d}\), where \(i=1,2, \dots , n\). The d represents the dimensionality of low-dimensional space.

LLE achieves the goal through three steps (Lei et al 2010). Firstly, kNN algorithm or \(\varepsilon\)-ball criterion is used in LLE to find the nearest neighbors. LLE is to find the best reconstruction weight matrix in the second step, and to minimize the local reconstruction error of \(x_i\) can realize it. It can be described as

$$\begin{aligned} \phi \left( W \right) =\,\text {arg}\,\underset{w}{\min }\sum _{i=1}^N{\Vert x_i-\sum _{j=1}^k{w_{ij}x_j} \Vert ^2}, \end{aligned}$$
(11)

where N is the amount of all data points, \(x_i\) denotes the i-th data point, \(x_j\) denotes the j-th nearest neighbor, and \(w_{ij}\) represents the weight coefficient of the i-th data point to the j-th nearest neighbor.

Let \(N_i^{k}\) be the set of neighbor points of \(x_i\). The \(x_i\) is reconstructed by its neighbor points, if \(x_j\) is the neighbor point of \(x_i\), then \(w_{ij} \ne 0\); on the contrary, \(w_{ij} = 0\). In addition, it should meet \(\sum _{j=1}^k{w_{ij}=1}\) to have a better data distribution after dimensionality reduction. The restriction of weight W can be written as

$$\begin{aligned} {\left\{ \begin{array}{ll} {w_{ij}= 0},\,\,\,\,&{}\mathrm{if} \,\,x_j\notin N_{i=1}^{k},\\ \sum \limits _{j=1}^k{w_{ij}=1}, \,\,\,\,&{}\mathrm{if}\,\,x_j\in N_{i=1}^{k}. \end{array}\right. } \end{aligned}$$
(12)

In the third step, the reconstruction matrix W can be used to calculate the optimal embedding matrix Y after dimension reduction. This step can be described as

$$\begin{aligned} \begin{aligned} \phi \left( Y \right)&=\,\text {arg}\,\underset{y}{\min }\sum _{i=1}^N{\Vert y_i-\sum _{j=1}^k{w_{ij}y_j} \Vert ^2} \\&=\,\,\text {arg}\,\underset{y}{\min }\,\,\text {tr}\{Y^T\left( I-W \right) ^T\left( I-W \right) Y\}. \end{aligned} \end{aligned}$$
(13)

Let \(M=\,\,\left( I-W \right) ^T\left( I-W \right)\), and (13) can be transformed as \(\mathrm{arg}\,\,\underset{y}{\min }\,\,\mathrm{tr}\{Y^TMY\}\), where \(M=\left[ M_{ij}\right] _{n\times n}\). To avoid degenerate solutions, it constrains the embedding vectors to have unit covariance with which satisfy \(\frac{1}{N}\,\,Y^TY=I\). According to Rayleigh-Ritz theorem (Li et al 2008), the smallest nonzero eigenvector of the sparse matrix M is the solution of (13).

Li et al. (2008) linearize LLE by introducing \(Y=V^TX\) into LLE to address this problem, and (13) can be rewritten as

$$\begin{aligned} \varPhi \left( Y \right) =\text {arg}\,\,\underset{y}{\min }\,\,\text {tr}\{V^TXMX^TV\}. \end{aligned}$$
(14)

The maximum margin criterion (MMC) restricts the mapping relationship, and its purpose is to maximize the distance between each class of samples. It can use the category attributes of data to improve the separability of data. \(S_w\) is the within-class scatter matrix. \(S_b\) is the between-class scatter matrix. They can be calculated by  Li et al (2006)

$$\begin{aligned} S_w&=\sum _{i=1}^c{\sum _{j=1}^{n_i}{\left( x_{j}^{i}-m_i \right) \left( x_{j}^{i}-m_i \right) ^T}}, \end{aligned}$$
(15)
$$\begin{aligned} S_b&=\sum \limits _{i=1}^c{n_i\left( m_i-m \right) \left( m_i-m \right) ^T}, \end{aligned}$$
(16)

where c is the number of classes, m is the mean vector of all the samples, \(m_i\) represents the average vector of the i-th class, \(n_i\) is the number of samples in the i-th class, and \(x_{j}^{i}\) is the j-th sample in the i-th class.

The objective function of MMC under projection matrix W can be described as

$$\begin{aligned} J=\mathrm{tr}\{ W^T\left( S_b-S_w \right) W \}. \end{aligned}$$
(17)

The class relationship between samples is considered in MMC. It can obtain the optimal linear discriminant mapping and maintain the internal geometric structure of neighbors. The linearized LLE obtains the manifold structure of high-dimensional data, which ensures the scalability of the data. To minimize the reconstruction error and maximize the distance between classes, a multi-objective optimization problem can be obtained by combining (14) and (17), and they have the same mapping matrix in this case. The multi-objective optimization problem can be represented as

$$\begin{aligned}&{\left\{ \begin{array}{ll} \min \,\, \mathrm{tr}\{V^TXMX^TV\}, \\ \max \,\, \mathrm{tr}\{V^T\left( S_b-S_w \right) V\}. \end{array}\right. } \end{aligned}$$
(18)

Since (14) and (17) have the same mapping matrix, a single objective optimization problem can be constructed by making a difference to solve the mapping matrix V. The single objective optimization problem can be written as

$$\begin{aligned}&\mathrm{min}\,\,\mathrm{tr} \{V^T\left( XMX^T-\left( S_b-S_w \right) \right) V \}, \end{aligned}$$
(19)
$$\begin{aligned}&\mathrm{s.t.} \,\, V^TXX^TV=I. \end{aligned}$$
(20)

The problem can be solved by the Lagrange multiplier method, and the form of (19) can be transformed into

$$\begin{aligned} \left( XMX^T-\left( S_b-S_w \right) \right) V = \lambda XX^TV, \end{aligned}$$
(21)

where \(\lambda _i\) is the generalized eigenvalue of \((XMX^T-(S_b-S_w))\) and \(XX^T\). V is the corresponding generalized eigenvector.

Therefore, the minimum value of the objective function (19) can be required as long as the generalized eigenvector V of (21) is obtained. V is composed of eigenvectors corresponding to the first d smallest eigenvalues of the generalized spectral decomposition, where d is the target dimension to be reduced to.

After changing the distance of selecting the nearest neighbor, mapping matrix Y can be obtained by (21) and \(Y=V^TX\), which can better reflect the actual distribution of the deformed data. The main steps of CLLDE are summarized in Algorithm 1.

figure a

The next part is the performance experiment of CLLDE in comparison with LLDE. The integrated development environment is Python 3.7.6 for all the experimentations, the model of CPU is Intel(R) Core(TM) i7-8750H, and the operating system is Windows 10. The following experiments are in the same environment.

The iris dataset contains 150 records of three categories called iris-setosa, iris-versicolour, and iris-virginica respectively. Each category contains 50 data, and each data contains 4 characteristic variables, which means the sample is 4-dimensional. Figure 2 shows the dimension reduction effect of LLDE on the iris data with 10, 20, 30, 40, and 50 neighbors, while Fig. 3 is the dimension reduction effect of CLLDE on the iris data with 10, 20, 30, 40, and 50 neighbors. Through the comparison, we can conclude that CLLDE achieves a better dimensionality reduction effect on the iris dataset than LLDE.

In summary, the experiments on the iris dataset show that CLLDE can better distinguish different categories of data, and it is more suitable for pattern recognition with labels under the same neighborhood conditions for uneven data or deformed data.

Fig. 2
figure 2

LLDE dimension reduction effect on iris dataset

Fig. 3
figure 3

CLLDE dimension reduction effect on iris dataset

3 Elite opposite sparrowsearch algorithm

3.1 Sparrow search algorithm and elite opposite sparrow search algorithm

SSA is a novel swarm intelligence algorithm inspired by the behaviors of the sparrows. SSA searches for the solution to the optimization problem by imitating the foraging behaviors and anti-predatory behaviors of the sparrows. Firstly, some sparrows are randomly selected as producers in the sparrow population. The main mission of the producers is to find food for the whole population, which determines the direction and the area of foraging. Scroungers will follow the producers to find food, and some scroungers will keep a lookout over the producers and wrest food from the producers. Once the predator is detected, the individuals will send the anti-predatory signal. Finally, sparrows find the most suitable location to forage for the whole group by iterating the location of producers and scroungers for many iterations.

The dimension of the sparrow population is n \(\times\) H, and n is the number of all the sparrows. H represents the dimensionality of variables to be optimized. The position of the i-th sparrows can be described as \(S_i=[s_{i,1}, s_{i,2}, \dotsc , s_{i,q}]\), \(i\in \left[ 1,n \right]\), \(q\in \left[ 1,H \right]\), and \(s_{i,j}\) shows the position of the i-th sparrow having j dimensional variables. The fitness value of the sparrow population can be described as

$$\begin{aligned} F_S=\left[ \begin{array}{c} f\left( \left[ s_{1,1},s_{1,2}, \dotsc , s_{1,q} \right] \right) \\ f\left( \left[ s_{2,1},s_{2,2}, \dotsc , s_{2,q} \right] \right) \\ \vdots \\ f\left( \left[ s_{n,1},s_{n,2}, \dotsc , s_{n,q} \right] \right) \end{array} \right] , \end{aligned}$$
(22)

where n represents the number of all the sparrows of population, and the value of each row in \(F_S\) is the fitness value of the individual sparrow.

In SSA, the mission of the producers is to find food for the population, and the main behavior of the scroungers is mainly to follow the producers to get food. Besides, some scroungers may compete with producers for the food, and the scroungers will get the food if they win. Based on the above description, the location of producers is defined as

$$\begin{aligned} s_{i,j}^{t+1}=\left\{ \begin{aligned}&{s_{i,j}^{t}\cdot \exp \left( \frac{-i}{\beta \cdot T} \right) },&A_{\mathrm{alarm}}<J_{th}, \\&{s_{i,j}^{t}+Q\cdot \vartheta },&A_{\mathrm{alarm}}\geqslant J_{th}, \end{aligned} \right. \end{aligned}$$
(23)

where t represents the current iteration, \(s_{i,j}^{t}\) is the value of the i-th sparrow in the j-th dimension at the t-th iteration, T is the maximum number of iterations, and \(\beta \in (0, 1]\) is a random number. \(A_{\mathrm{alarm}}\in \left[ 0,1 \right]\) represents the warning value while \(J_{th}\in \left[ 0.5,1 \right]\) expresses the safety threshold. Q is a random number that is subject to the normal distribution. \(\vartheta\) is a \(1\times q\) matrix where all elements are one (Xue and Shen 2020).

The position of the scrounger can be updated by

$$\begin{aligned} s_{i,j}^{t+1}=\left\{ \begin{aligned}&Q\cdot \exp \left( \frac{s_\mathrm{worst}^{t}-s_{i,j}^{t}}{i^2} \right) ,&i>n/2,\\&s_\mathrm{pbest}^{t+1}+\vert s_{i,j}^{t}-s_\mathrm{pbest}^{t+1} \vert \cdot \iota ^+\cdot \vartheta ,&i\leqslant n/2, \end{aligned} \right. \end{aligned}$$
(24)

where \(s_\mathrm{pbest}^{t+1}\) represents the best position in the global search space at the t+1 iteration, while \(s_\mathrm{worst}^{t}\) represents the global worst position at  t-th iteration. \(\iota\) is a 1 \(\times\) q matrix whose elements are randomly assigned 1 or -1, and \(\iota ^+\) satisfies \(\iota ^+=\iota ^T(\iota \iota ^T)^{-1}\).

When the sparrows in the population are aware of the danger, for the sparrows at the edge position, they will fly to a safe place. For the sparrows in the middle of the population, they approach other sparrows in a random walk. According to mentioned above, the mathematical model can be expressed as

$$\begin{aligned} s_{i,j}^{t+1}={\left\{ \begin{array}{ll} s_\mathrm{best}^{t}+ \gamma \cdot \left( s_{i,j}^{t}-s_\mathrm{best}^{t} \right) , &{}f_i \ne f_g,\\ s_{i,j}^{t}+ K \cdot \left( \frac{s_{i,j}^{t}-s_\mathrm{worst}^{t}}{\vert f_i-f_w \vert +\mu } \right) , &{}f_i=f_g, \end{array}\right. } \end{aligned}$$
(25)

where \(s_\mathrm{best}^{t}\) express the optimal position in the global search space at t-th iteration, and \(\gamma\) is a random number, which subjects to standard normal distribution and controls the step size. \(K\in \left[ -1,1 \right]\) shows the moving direction of sparrow. \(\mu\) is a very small constant, which is aimed to avoid zero-division-error.

In a swarm intelligence algorithm, the initial solution determines the distribution of the initial population in the solution space. The spatial distribution of the solution will affect the search-ability and convergence efficiency of the algorithm. The high-quality initial population can accelerate the convergence speed of the algorithm, which is more conducive to finding the global optimal solution. Based on mentioned above, the elite opposite learning and orifice imaging opposite learning strategy are introduced into SSA, and the elite opposite sparrow search algorithm (EOSSA) is proposed.

Before introducing EOSSA, it is necessary to explain the opposite point and the elite opposite solution. Let \(s=\left[ s_1, s_2, \dotsc , s_q \right]\) be a point in q dimensional space, where \(s_i\in \left[ \upsilon _j, \omega _j \right] , j = 1, 2, \dotsc , q\), \(\upsilon _j\) and \(\omega _j\) represent the lower boundary and the upper boundary of search space, respectively. The opposite point is defined as (Wang et al 2011)

$$\begin{aligned} s_{j}^{*}=\upsilon _j+\omega _j-s_j. \end{aligned}$$
(26)

Let \(s_{i,j}\) be the value of ordinary individual \(s_i\) in the j dimension, then the elite opposite solution can be defined as (Sihwail et al 2020)

$$\begin{aligned} s_{i,j}^{e}=\rho \left( \upsilon _j+\omega _j \right) -s_{i,j}, \end{aligned}$$
(27)

where \(\rho \in [0,1]\) is a random number, \(\left[ \upsilon _j, \omega _j \right]\) is the interval constructed by the elite sparrows, and \(s_{i,j}^{e}\in \left[ \upsilon _j, \omega _j \right]\).

The elite opposite strategy can provide more alternative solutions. The dynamic boundary can accelerate the convergence of SSA. EOSSA takes the sparrows with the top \(\nu\)% of fitness values as the elite solutions. \(\nu\) is a constant that represents the ratio of the elite sparrows. EOSSA can obtain the dynamic boundary of the elite sparrows, and then uses the opposite strategy to obtain the opposite solutions. Compared with the sparrow before updating and the sparrow after updating, if it obtains a better value, it will replace the previous sparrow.

The rest of the population is the ordinary sparrows. For ordinary sparrows, their low fitness is mainly because the search area is far away from the search range of excellent individuals. To expand the search range and have the ability to jump out of the current poor search area, and to increase the diversity of the population, we use the different opposite learning strategies from elite individuals to operate on ordinary sparrows.

figure b

Orifice imaging opposite learning is an opposite learning strategy based on optical principle (Zhang et al 2021), which assumes that there is a light source \(\varpi\) with a height of h, and the upper and lower boundaries of the coordinate axis are \(\upsilon _j\) and \(\omega _j\), respectively. The projection point of the light source on the coordinate axis is s. The image that has height \(\tilde{h}\) formed by the light source through the orifice is \(\varpi ^*\), and its projection on the coordinate axis is \(s^{oi}\), which is called the opposite point of s. The schematic diagram of orifice imaging is shown in Fig. 4.

Fig. 4
figure 4

The schematic diagram of orifice imaging principle

According to the principle of orifice imaging, the process can be described as

$$\begin{aligned} \frac{\left( \upsilon _j+\omega _j \right) /2-s}{s^{oi}-\left( \upsilon _j+\omega _j \right) /2}=\frac{h}{\tilde{h}}. \end{aligned}$$
(28)

Let \(h/\tilde{h}=\varrho\), then (28) can be transformed as

$$\begin{aligned} s^{oi}=\frac{\left( \upsilon _j+\omega _j \right) }{2}+\frac{\left( \upsilon _j+\omega _j \right) }{2\varrho }-\frac{s}{\varrho }. \end{aligned}$$
(29)

The process is extended to EOSSA, which can be described as

$$\begin{aligned} \frac{\left( \upsilon _j+\omega _j \right) /2-s_{i,j}^{t}}{s_{i,j}^{oi}-\left( \upsilon _j+\omega _j \right) /2}=\varrho . \end{aligned}$$
(30)
$$\begin{aligned} s_{i,j}^{oi}&=\frac{\left( \upsilon _j+\omega _j \right) }{2}+\frac{\left( \upsilon _j+\omega _j \right) }{2\varrho }-\frac{s_{i,j}^{t}}{\varrho }, \end{aligned}$$
(31)

where \(s_{i,j}^{t}\) is the j-th dimension of the i-th sparrow individual in the t-th iteration, \(s_{i,j}^{oi}\) formed by imaging through orifice represents the opposite point of \(s_{i,j}^{t}\), \(\upsilon _j\) and \(\omega _j\) represent the lower boundary and the upper boundary of search space, respectively, \(\varrho\) is regulation factor.

Remark 1

SSA as a swarm intelligence optimization algorithm is essentially a greedy algorithm. SSA may fall into a locally extremum point. The orifice imaging opposite strategy and opposite learning strategy are introduced into SSA to make it have the ability to escape the local extremum point. These strategies can help the algorithm explore more space in a limited time. The opposite strategy will make the sparrow search in the opposite direction in its searching space when the algorithm tends to local optimization. Besides, this approach can increase the diversity of the search population, which is conducive to the algorithm to obtain the optimal solution.

The steps of the EOSSA can be summarized as Algorithm 2.

Table 1 The list of benchmark functions
Table 2 Results of benchmark functions
Fig. 5
figure 5

The convergence curves on the unimodal benchmark functions

Fig. 6
figure 6

The convergence curves on the multimodal benchmark functions

Fig. 7
figure 7

The convergence curves on the fixed-dimension benchmark functions

3.2 The convergence characteristics of the algorithms on benchmark functions

Fig. 8
figure 8

Box plot comparison of six algorithms in this experiment

In this part, benchmark functions are used to verify the feasibility and effectiveness of EOSSA. SSA, chaotic SSA (CSSA) (Zhang and Ding 2021), chaos particle swarm optimization(CPSO) (Kennedy and Eberhart 1995; Su et al 2015; Pluhacek et al 2018), disturbance and somersault foraging grey wolf optimizer (DSFGWO) (Mirjalili et al 2014; Wang et al 2021), and Levy-flight based moth-flame optimization (LMFO) (Mirjalili 2015; Suja 2021) are used to compare the convergence characteristics with EOSSA. The initial population number of all algorithms is set to 100. The number of iterations is 1000. The parameters of each algorithm for comparison are set according to (Mirjalili 2015; Xue and Shen 2020). The detailed information of the standard benchmark functions is listed in Table 1 and the optimization results are listed in Table 2. The optimization results are the average value and standard deviation (Std) of 30 experiments. Figure 5 shows the optimization process curves of the algorithms on the unimodal benchmark functions \(F_1(x)\), \(F_2(x)\), \(F_3(x)\), \(F_4(x)\), and \(F_5(x)\). Figure 6 shows the optimization process curves of the algorithms on the multimodal benchmark functions \(F_6(x)\), \(F_7(x)\), \(F_8(x)\), and \(F_9(x)\), while Fig. 7 shows the optimization process curves of the algorithms on the fixed-dimension benchmark functions \(F_{10}(x)\) and \(F_{11}(x)\).

Remark 2

\(F_1\) to \(F_5\) are the unimodal benchmark functions. This kind of benchmark function has only one extreme point. They can be used to verify the convergence speed, optimization accuracy, and local development ability of the algorithm. \(F_6\) to \(F_9\) are multimodal benchmark functions. This kind of benchmark function has multiple local extremum points, which makes the algorithm extremely easy to fall into the local extremum. They can be used to verify the ability of the algorithm to escape from the local extremum and global exploration ability. \(F_{10}\) and \(F_{11}\) are the fixed dimension benchmark functions. They can be used to further verify the convergence speed, stability, and convergence accuracy of the algorithm.

From the optimization results, because the opposite strategy provides more alternative solutions for the algorithm and improves the ability to avoid falling into local optimum, the optimization performance of EOSSA is improved compared with SSA. After 30 experiments, the average test result of EOSSA is better than other algorithms, and Std is also better than other algorithms in most benchmark functions, but the standard deviation is slightly worse than others in a few benchmark functions, such as \(F_6(x)\). From the convergence curve, the convergence speed of EOSSA is improved compared with SSA.

To better evaluate and compare the optimization performance of various optimization algorithms, the box plot is used for statistical comparison and analysis. Figure a–k show the statistical results of SSA, EOSSA, CSSA, CPSO, DSFGWO, and LMFO on each benchmark function of this experiment, respectively. From the results of the box plots, it can be concluded that the comprehensive performance of EOSSA on the benchmark functions is better than other algorithms. In summary, the convergence characteristics of EOSSA are better than the other five optimization algorithms on the whole from the above analysis.

4 Experiments of fault diagnosis

The feature extraction accuracy will affect the final recognition rate. Simple and significant features can improve fault recognition accuracy and reduce computational complexity. Manifold learning reduces the redundant dimension through the geometric structure of data. LLE is a classical manifold learning algorithm, which can reveal the internal structure of data. For labeled data, it does not use the category attributes of data. Therefore, LLDE is proposed by combining LLE with MMC. Because of the deformed data distribution, the Euclidean distance in LLDE is replaced by cam weighted distance, and CLLDE is proposed. The previous experiment shows a good effect on the classification problem.

The main steps of the fault diagnosis approach can be summarized as follows. Firstly, CLLDE is used to map the original high-dimensional data to the low-dimensional feature space. Secondly, EOSSA is used to optimize the hyper-parameters of LightGBM to establish a classifier to obtain a better diagnosis effect. Finally, the trained EOSSA-LightGBM is used to identify the new test fault data. The flowchart of EOSSA-LightGBM is shown in Fig. 9.

Fig. 9
figure 9

The flowchart of EOSSA-LightGBM

The fault data used in this paper is Tennessee Eastman (TE) process (Yin et al 2012) dataset. Each sample of the simulation dataset contains 52 observed variables, among which the first 22 are non-categorical variables, the 23rd to 41st are categorical variables, and the last 11 are control variables. Considering the influence of the real environment, Gaussian noise is added to all observed variables.

No. 18 data set is selected to test the effectiveness of the algorithm. There are 980 training samples, including 480 fault samples and 500 normal samples. The training samples are randomly divided into training set and validation set according to the ratio of 7 : 3. There are 960 samples in the testing set, including 160 normal samples and 800 fault samples. The intrinsic dimension of CLLDE is set to 7 and the numbers of the nearest neighbor point are 13 after validating.

Fig. 10
figure 10

The histogram of Table 4

\(F_1\) score is used to analyze and evaluate the performance of the fault diagnosis approach. \(F_1\) score can be calculated by precision rate and recall rate, the mathematical expressions of the three above indicators can be written as

$$\begin{aligned} \mathrm{Precision}&=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}, \end{aligned}$$
(32)
$$\begin{aligned} \mathrm{Recall}&=\,\,\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}, \end{aligned}$$
(33)
$$\begin{aligned} F_1\,\,\mathrm{score}&=\frac{2\times \mathrm{Precision}\times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}, \end{aligned}$$
(34)

where TP is the number that the predicted label of the sample is the same as the real label, FN is the number of the sample that the actual label is the misjudgment, and FP represents the number of the false positive sample.

Fig. 11
figure 11

The ROC curve and PR curve of LightGBM optimized by SSA

Fig. 12
figure 12

The ROC curve and PR curve of LightGBM optimized by EOSSA

The process of fault diagnosis is divided into feature extraction and fault recognition. Different commonly used feature extraction methods and classification algorithms are combined and then compared with the method using CLLDE as feature extraction and EOSSA-LightGBM as fault diagnosis. Feature extraction methods include PCA, KPCA, ISOMAP, LLE, and LLDE. Classification algorithms include random forest (RF) (Cerrada et al 2016) and support vector machine (SVM) (Mavroforakis and Theodoridis 2006; Pan et al 2008; Su et al 2015).

Table 3 \(F_1\) score (%) comparison of different feature extraction methods and classifiers
Table 4 \(F_1\) score (%) comparison of different feature extraction methods and optimization algorithm
Fig. 13
figure 13

The ROC curve and PR curve of LightGBM optimized by CSSA

In order to obtain the best super parameters of LightGBM, EOSSA is used to optimize its parameters. The major hyper-parameters learning rate \(\eta\) and decision tree depth \(\varphi\) of LightGBM are selected for optimization. The calculation formula of \(F_1\) score is selected as the fitness function of EOSSA, which is formulated as

$$\begin{aligned} f=\frac{2\times \text {Precision}\times \text {Recall}}{\text {Precision}+\text {Recall}}. \end{aligned}$$
(35)

When the learning rate is 0.1058 and the depth of the decision tree is 7, the \(F_1\) score is the best, and it is effectiveness is verified in the validation set. Therefore, the two super parameters \(\eta\) and \(\varphi\) of LightGBM are set to 0.1058 and 7, respectively. The experimental results on the test set are listed in Tables 3 and 4. In Table 3, the horizontal comparison is the performance of classifiers using different feature extraction algorithms, and the vertical comparison is the performance of classifiers using the same feature extraction algorithm. Table 4 lists the results of various optimization algorithms in fault diagnosis. In order to show the results more intuitively, its histogram is shown in Fig. 10. From Table 4 and Fig. 10, we can conclude the proposed approach is better than others. The experiment proves the effectiveness of the approach we proposed.

Fig. 14
figure 14

The ROC curve and PR curve of LightGBM optimized by CPSO

Fig. 15
figure 15

The ROC curve and PR curve of LightGBM optimized by DSFGWO

Fig. 16
figure 16

The ROC curve and PR curve of LightGBM optimized by LMFO

Also, to compare the performance of various algorithms more intuitively, the receiver operating characteristic (ROC) curves and the precision-recall (PR) curves are shown in Figs. 1112131415, and 16. Figures 1112131415, and 16 shows that LightGBM is optimized by SSA, EOSSA, CSSA, CPSO, DSFGWO, and LMFO, respectively. They show the performance of different fault diagnosis approaches when using different feature extraction methods. From area under curve (AUC) values, we can conclude that the proposed approach is superior to other algorithms in fault identification in the contrast experiment.

5 Conclusion

In this paper, a new fault diagnosis approach based on EOSSA optimized LightGBM is proposed. Aimed at the deformation problem of data distribution, cam weighted distance is introduced into LLDE to extract the feature. The experiments show that CLLDE is effective. The influence of the randomly selected k value on CLLDE is smaller than that of LLDE in comparison. In the fault diagnosis experiment, CLLDE is used as a feature extraction method to improve the performance of various classifiers, which shows that it can effectively extract data features. EOSSA is proposed by introducing the elite opposite strategy and orifice imaging opposite learning strategy into SSA. The experimental results show that the strategies used in EOSSA are worked, which can accelerate the convergence speed of SSA as well as making the solution more effective. Overall, EOSSA shows superior performance on unimodal, multimodal, and fixed-dimension problems, but its statistical performance on the No. 6 benchmark function is not as good as other algorithms in comparison. EOSSA has the best solution of all the algorithms in the No. 6 benchmark function. However, its standard deviation is greater than other algorithms, which indicates that its performance in the No. 6 benchmark function is poor because it cannot guarantee that the solution of each searching mission is in an appropriate range. Compared with other optimization algorithms, EOSSA still shows advantages in the fault diagnosis problem, which indicates that EOSSA is still feasible and effective in this problem. In future work, we can study the application of deep learning (DL) (Liu et al 2017) and reinforcement learning (RL) in fault diagnosis (Song et al 2021). Also, we will study the performance of DL and RL combined with EOSSA in engineering applications.