Introduction

With the emergence of Internet-of-Things, big data, and artificial intelligence (AI), deep learning (DL)-based approaches have been increasingly implemented and achieved state-of-the-art performance in intelligent fault diagnosis (IFD) (Li et al., 2020). The applicability of these methods remains limited in practical application due to domain shift or out-of-distribution (OOD) problems, displayed in Fig. 1 (Han & Li, 2022). Most DL-based approaches can only guarantee reliable diagnostic results if the distributions of training and testing data are close. However, under complex working conditions, the vibration signals collected are usually nonstationary and nonlinear with noise contamination (Tian et al., 2021). Real vibration signals are likely drawn from different distributions than the training data which leads to OOD problems (Zhou et al., 2022a, 2022b). This tends to produce overconfident and untrustworthy diagnostic results. Trustworthy fault diagnosis plays a critical role in ensuring the safety, reliability, and efficiency of safety–critical industrial systems, such as nuclear power plants, aviation and aerospace systems, autonomous vehicles, railway systems, and robotics. Trustworthiness refers to the degree of confidence and reliability that can be placed in a fault diagnosis capability to accurately identify and localize faults within a given system. Several factors contribute to the trustworthiness of a fault diagnosis task, such as accuracy, robustness, reliability, interpretability, and adaptability. The uncertainty of the prediction can serve as a sign of trustworthiness for diagnosis faults.

Fig. 1
figure 1

The illustration of domain shift or OOD problem in IFD

Naturally, uncertainty estimation is a promising method to distinguish OOD samples and provides an indicator to evaluate when the diagnostic results can be trusted (Abdar et al., 2021). The higher uncertainty of the samples, the lower trustworthiness of the predictions would be. Specifically, uncertainty in fault diagnosis can stem from two main sources, namely aleatoric uncertainty (data uncertainty) and epistemic uncertainty (model uncertainty) (Hüllermeier & Waegeman, 2021). Data uncertainty is given rise by noise or class overlaps, usually due to the anomaly of sensors, the interruption of data transmission, and the interference of working environments (Abdar et al., 2021). Effective denoising of the collected data can largely improve data quality and reduce data uncertainty (Xinqiang Chen et al., 2021a, 2021b). Model uncertainty is caused by limited data and knowledge during training, which leads to inconsistency of outputs by varying model parameters. This can be reduced by optimizing model parameters with adequate data during iterative training (Loquercio et al., 2020).

Noise affects model diagnostic performance and increases data uncertainty, particularly in early faults diagnosis tasks (Cai et al., 2021). Denoising technologies have been employed to enhance diagnostic performance and reduce data uncertainty. Empirical mode decomposition (EMD) is suitable for decomposing nonlinear and non-stationary vibration signal data, however, suffering from mode mixing problems (Zhang et al., 2010). An improved complete ensemble EMD with adaptive noise (ICEEMDAN) is proposed as an enhancement to EMD, offering reduced mode mixing, diminished noise residue, and increased physical interpretability for the intrinsic mode functions (IMFs) (Colominas et al., 2014). However, when utilizing the EMD-based method solely for processing vibration signals, there is a risk that some effective components might not be completely extracted (Li et al., 2022). In addition, the selection of noise components in IMFs and the reconstruction of the denoising signal still rely on expert knowledge (Q. Xu et al., 2020a, 2020b).

Independent Component Analysis (ICA) is a blind source separation (BSS) method and has been widely used as a denoising technique. Multiple aliasing signals can be separated into independent components (ICs) while maintaining the characteristic information of the source signal (Miao & Zhao, 2020). However, it presents several limitations: the assumption that multiple sources are disjoint and the inability to identify noise ICs. (H. Zhou et al., 2022a, 2022b). Recently, joint denoising techniques have been implemented to address this issue and have demonstrated promising results (Li et al., 2022). EMD-based methods can decompose a single-channel signal into multiple IMFs, fulfilling the assumption of ICA. ICA would be to further analyze the IMFs and uncover hidden or latent relationships among them. To reduce dependence on expert judgment, the Fuzzy Entropy (FuEn) method is employed as a tool for analyzing the complexity or information content of the ICs. Compared to other entropy measures, FuEn offers advantages such as enhanced consistency and minimized data dependencies (Wei et al., 2021). Upon computing the FuEn values of the ICs, a threshold discriminant method is introduced to identify noise ICs (Gómez-Herrero et al., 2006).

Quantifying prediction uncertainty in DL-based methods is still challenging. Bayesian models offer a mathematical framework for uncertainty estimation, e.g., a stochastic deep learning model trained using Bayesian inference (Gal & Ghahramani, 2016). Though exact posterior inference is difficult, approximation inference technologies have been widely adopted in Bayesian neural networks (BNNs). Monte Carlo (MC) dropout is an effective method, which uses dropout as a regularization term to calculate the prediction uncertainty (Brach et al., 2020). Other dropout methods, including elementwise Bernoulli dropout and spatial Bernoulli dropout, are implemented to compute the model uncertainty in BNNs (Amini et al., 2018). Another effective method that has been used to approximate inference is Markov-chain Monte Carlo (MCMC) (Gómez-Rubio & Rue, 2018). Despite the success of MCMC, the convergence is slow to obtain the desired distribution with a large number of iterations (Neal, 2012). To address this problem, a Markov chain MCMC posterior sampling algorithm that mixes faster than the wider BNNS has been proposed (Hron et al., 2022).

Variational inference (VI) is a more effective approximation method that learns the posterior distribution over BNN weights (Swiatkowski et al., 2020). During the learning of a probability distribution using the weights of NNs, a novel Bayes by Backprop (BBB) algorithm has been proposed to quantify the uncertainty of weights, which can minimize the compression cost (Blundell et al., 2015). In another work, a probabilistic framework is developed based on a Bayesian variational autoencoder (VAE) to detect OOD samples (Daxberger & Hernández-Lobato, 2019). Also, the Laplace approximation has been used to solve the Bayesian inferences problem (De Wolf et al., 2021). Overall, the above approximate inference approaches, including dropout methods, MCMC, VI, and Laplace approximation, obtain the posterior distribution over the weights of the BNNs, which can capture the uncertainty and provide more reliable predictions (Maged & Xie, 2022). However, the success of BNNs depends on how well the approximate weight distributions match the true counterparts, which determines the computational complexity (Tsiligkaridis, 2021). High computational costs can limit the application of BNNs. Also, the choice of the prior distribution largely influences the uncertainty estimation.

Estimating uncertainty through evidential neural networks (ENNs) is another emerging approach proposed recently (Malinin & Gales, 2019; Sensoy et al., 2018). For neural network classifiers, the softmax function has been used to estimate the classification probability. Softmax, however, provides a point estimate of the class probability for a sample and lacks an uncertainty estimation, increasing the risk of an overconfident diagnosis. Dirichlet prior distribution has been introduced to model distributions of class compositions, in which the Dirichlet distribution parameters can be learned by training deterministic NNs (Tsiligkaridis, 2021). The predictions of the neural network can be viewed as evidence, which refers to the quantitative support for assigning an input sample to a specific class. Evidence can be used to compute probability distributions over all possible classes and the uncertainty in prediction (Sensoy et al., 2018). Due to the advantages of without sampling and minimal changes to the structures of the networks, ENNs yield closed-form predictive distributions and outperform BNNs in uncertainty estimation for OOD and adversarial queries. However, ENNs reply on a specific loss function to generate more evidence for the correct class and reduce the risk of misclassification (Sensoy et al., 2021).

For IFD tasks, deep belief networks (DBNs), convolutional neural networks (CNNs), and recurrent neural networks (RNNs) have been widely used as deep classifiers. With the unique self-feedback connections, RNNs outperform the regular DBN and CNN in time-series prediction (Xia et al., 2020). However, RNNs have the problem of the disappearance of gradients, which prevents them from learning long-term dependency (Rahman et al., 2018). To address this issue, long short-term memory units (LSTM) and gated recurrent units (GRU) are added to the RNN structure and outperform the basic RNNs. With the improved structure of RNN, the GRU network can effectively track long-term dependencies while mitigating gradient disappearance and exploding problems (Liu et al., 2023). Compared with LSTM, GRU has fewer parameters and can be trained faster (Liu et al., 2023). A stacked GRU model has been developed and applied for long time-series data, which performs accurate prediction (Xia et al., 2021).

The present paper develops a novel trustworthy and intelligent fault diagnosis approach with effective denoising and uncertainty estimation. This two-stage joint denoising method is integrated by ICEEMDAN and ICA with the FuEn threshold discriminant as the selection criterion. It is capable of filtering noise components and reconstructing the denoised signal. Then an improved stacked gated recurrent unit (SGRU) is incorporated with the evidence theory to construct the evidential SGRU (ESGRU) model. The proposed ESGRU model can evaluate the reliability of denoising performance and provide accurate predictive uncertainty. The predictive entropy and reliability diagram have been employed as calibration methods to verify the effectiveness of uncertainty estimation. The calibration is a process that ensures the accuracy and consistency of a diagnostic and uncertainty estimation system. Two case studies have been conducted to verify the performance of the approach in different practice scenarios. The main contributions of the paper are as follows:

  1. (1)

    A trustworthy and intelligent fault diagnosis framework is developed to address the unreliable performance of existing fault diagnosis models. This framework requires minimal structural changes to the model during the training process and provides reliable predictions for critical decision-making.

  2. (2)

    An effective joint denoising method is proposed to reduce data uncertainty. A signal is decomposed into multiple IMFs using the ICEEMDAN method, which serves as multiple inputs for ICA. ICs can be derived from the IMFs using ICA. The Fuzzy Entropy method is employed as a tool to analyze the complexity of the ICs, with the threshold discriminant applied to filter out noise components.

  3. (3)

    By integrating evidence theory, the proposed well ESGRU provides superior and reliable diagnostic results that can capture prediction uncertainty. It places Dirichlet distribution as a prior on the likelihood function. The parameters of the ESGRU network are optimized by the evidential loss function, which assigns more evidence to the correct classification.

  4. (4)

    Uncertainty estimation provides a novel idea to detect OOD samples. It can also be used as an indicator to measure the trustworthiness and reliability of diagnostic behavior and results in practical applications. The validity of uncertainty estimation can be confirmed through calibration methods involving predictive entropy and reliability diagrams.

The remainder of the paper is organized as follows. Section II introduces the proposed two-stage joint denoising method. Section III presents the details of the uncertainty estimation method of ESGRU. The experimental study is presented in Section IV. Section V gives the conclusion.

Related work of joint denoising method

Problem formulation

In the multi-class fault diagnosis with different levels of noise interference, it is critical to obtain the original fault information from the noisy data at the data preparation stage, which can improve the diagnostic performance. The goal of the proposed method is to accurately classify the testing samples in variable noise conditions by using the denoising method, while the uncertainty estimation can estimate the trustworthiness of the prediction results. Let \(D = \left\{ {({\text{x}}_{i} ,y_{i} )} \right\}_{i = 1}^{n}\) be the collected datasets, where n represents the number of samples, \({\text{x}}_{i} \in R\) shows the multi-dimensional data like vibration signals or images, and \(y_{i}\) denotes the corresponding label. By using denoising technologies, the denoised signal \(\user2{x^{\prime}}_{i}\) can be estimated from the collected signal, which updates the dataset to be \(D^{\prime} = \left\{ {\left( {\user2{x^{\prime}}_{i} ,y_{i} } \right)} \right\}_{i = 1}^{n}\). The task is to achieve an accurate and reliable diagnosis in a noise interference environment with uncertainty estimation.

EMD model

The EMD technique can decompose the input signal into several IMFS (Huang et al., 1998). IMFs are oscillatory components with varying amplitudes and frequencies that retain the local characteristics of the original signal. The following conditions must be met for a candidate IMF: 1) the number of extrema (maxima and minima) and zero-crossings should be equal or differ by no more than one; and 2) the local mean, calculated as the average of the upper and lower envelopes, must be approximately zero. The EMD algorithm consists of the following steps:

Step 1: Set \(k = 0\) (\(k = 0,1,..,K\)) and find all the extrema of \({\text{r}}_{0} = {\text{x}}\). k represents the number of mode decomposition, where K denotes the maximum number.

Step 2: Interpolate between maxima (minima) of kth residue \({\text{r}}_{k}\) to obtain the upper (lower) envelope \({\text{e}}_{\max }\) (\({\text{e}}_{\min }\)).

Step 3: Compute the mean envelope \({\text{m}} = ({\text{e}}_{\min } + {\text{e}}_{\max } )/2\).

Step 4: Compute the IMF candidate \({\text{c}}_{k + 1} = {\text{r}}_{k} - {\text{m}}\).

Step 5: Check whether the candidate IMF satisfies the prescribed conditions. If \({\text{c}}_{k + 1}\) is an IMF, compute the residue \({\text{r}}_{k + 1} = {\text{x}} - \sum\nolimits_{n = 1}^{k} {{\text{c}}_{n} }\) and treat \({\text{r}}_{k}\) as the input data in step 2. Otherwise, treat \({\text{c}}_{k + 1}\) as input data in step 2.

Step 6: Continue until the final residue \({\text{r}}_{K}\) satisfies some predefined stopping criterion.

The extraction of each mode involves a refinement procedure, encompassing steps 2 to 5, which requires multiple iterations and is referred to as the sifting process.

ICEEMDAN model

ICEEMDAN model proved to be an important improvement on CEEMDAN, which solves the problem of residual noise in modes and spurious modes in the early stages of the decomposition (Colominas et al., 2014). The main idea of these EMD-based methods is to add some controlled noise to the signal to create new extrema. In this way, the local mean can stick to the original signal in these portions where new extrema were created. The modes are defined as the difference between the average of its local means and current residue.

Let \({\text{x}}\) be the original signal. Generate the new signal \({\text{x}}^{(i)} = {\text{x}} + \beta {\text{w}}^{(i)}\), in which \({\mathbf{w}}^{(i)}\)(\(i = 1,2,...,I\)) is zero mean unit variance white Gaussian noise realization, \(\beta\) is the coefficient of the added noise component. \(i\) represents the number of times the white noise is added, where I denotes the maximum number in this sequence. Let \(E_{k} ( \cdot )\) be the operator which produces the kth mode obtained by EMD. \(M( \cdot )\) is another operator which produces the local mean of the signal. And the local mean can be seen that \(E_{1} ({\text{x}}) = {\text{x}} - M({\text{x}})\). Hence, the first mode \({\tilde{\text{c}}}_{1}\) can be calculated as

$$ \tilde{\user2{c}}_{1} = \left\langle {E_{1} \left( {{\varvec{x}}^{\left( i \right)} } \right)} \right\rangle = \left\langle {{\varvec{x}}^{\left( i \right)} - M\left( {{\varvec{x}}^{\left( i \right)} } \right)} \right\rangle = \left\langle {{\varvec{x}}^{\left( i \right)} } \right\rangle - \left\langle {M\left( {{\varvec{x}}^{\left( i \right)} } \right)} \right\rangle $$
(1)

in which \(\left\langle \cdot \right\rangle\) is the action of averaging throughout the realizations. By estimating only the local mean and subtracting it from the original signal, getting:

$$ {\tilde{\text{c}}}_{1} = {\text{x}} - \left\langle {M\left( {{\varvec{x}}^{\left( i \right)} } \right)} \right\rangle $$
(2)

Following the above idea, the ICEEMDAN algorithm can be outlined as follows (Colominas et al., 2014):

Step 1: Calculate by EMD the local means of I realizations \({\text{x}}^{(i)} = {\text{x}} + \beta_{0} E_{1} ({\text{w}}^{(i)} )\) to obtain the first reside as \({\text{r}}_{1} = \left\langle {M({\text{x}}^{(i)} )} \right\rangle\).

Step 2: Calculate the first mode \({\tilde{\text{c}}}_{1} = {\text{x}} - {\text{r}}_{1}\).

Step 3: Estimate the second residue as the average of local means of the realizations \({\text{r}}_{1} + \beta_{1} E_{2} ({\text{w}}^{(i)} )\) and calculate the second mode \({\tilde{\text{c}}}_{2} = {\text{r}}_{1} - {\text{r}}_{2} = {\text{r}}_{1} - M\left\langle {{\text{r}}_{1} + \beta_{1} E_{2} ({\text{w}}^{(i)} )} \right\rangle\).

Step 4: For the kth residue (\(k = 3,...,K\)), it can be computed as \({\text{r}}_{k} = \left\langle {M({\text{r}}_{k - 1} + \beta_{k - 1} E_{k} ({\text{w}}^{(i)} ))} \right\rangle\). The coefficient \(\beta_{k - 1}\) can be determined using \(\varepsilon_{k - 1} std({\text{r}}_{k - 1} )\), which \(\varepsilon_{k - 1}\) is the inverse of the desired SNR between the added noise and the analyzed signal. \(std( \cdot )\) is an operator to calculate the standard deviation, which is defined as \(std(q) = \sqrt {\sum\nolimits_{i = 1}^{N} {(q_{i} - \sum\nolimits_{i = 1}^{N} {q_{i} } /N)^{2} /N} }\) and N means the number of sample points.

Step 5: Compute the kth mode \({\tilde{\text{c}}}_{k} = {\text{r}}_{k - 1} - {\text{r}}_{k}\).

Step 6: Go to Step 4 for the next k until the final residue \({\text{r}}_{K}\) satisfies the stopping criterion.

By construction of ICEEMDAN, the final residue satisfies:

$$ {\text{r}}_{K} = {\text{x}} - \sum_{k = 1}^{K} {\tilde{\text{c}}}_{k} $$
(3)

in which K is the total number of modes. Hence, the signal \({\text{x}}\) can be expressed as

$$ {\text{x}} = \sum_{k = 1}^{K} {\tilde{\text{c}}}_{k} + {\text{r}}_{K} $$
(4)

The flowchart depicting the ICEEMDAN algorithm can be found in Fig. 2.

Fig. 2
figure 2

The flowchart of the ICEEMDAN algorithm (Colominas et al., 2014)

ICA model and threshold discriminant with FuEn

As a classic blind source separation technology, ICA can decompose an observed signal into a linear combination of unknown independent signals (Draper et al., 2003). Let \({\text{c}}\) represents the vector of observed signals, which in this paper are the IMFs obtained through the ICEEMDAN model. The vector of unknown source signals, separable from the observed signals, is referred to as ICs. If \({\mathbf{A}}\) is the unknown mixing matrix, then the mixing model can be written as follows.

$$ {\text{c}} = {\mathbf{As}} $$
(5)

It is assumed that the source signals are independent and the mixing matrix \({\mathbf{A}}\) is invertible. Based on the assumptions and the observed mixtures \({\text{c}}\), the ICA algorithm estimates the mixing matrix \({\mathbf{A}}\) and the separating matrix \({\mathbf{W}}\) such that

$$ {\mathbf{u = W}}{\text{c}} $$
(6)

in which \({\mathbf{u}}\) is an estimation of the source signals, i.e., ICs.

Upon computing the FuEn value of the ICs, a threshold discriminant method is introduced to identify noise ICs (Gómez-Herrero et al., 2006). The discriminant is as follows:

$$ (\phi^{(g + 1)} - \phi^{(g)} ) < (\phi^{(g)} - \phi^{(g - 1)} ) $$
(7)

in which \(\phi^{(g)}\) is the fuzzy entropy value of the gth IC sorted in ascending order, g represents the smallest integer in the range of \(1 < g \le {{[N} \mathord{\left/ {\vphantom {{[N} 2}} \right. \kern-0pt} 2}]\) and N is the number of ICs. If no g value within the specified range satisfies the required condition, then g = 1. If \(g\) satisfies Eq. (7), the corresponding IC would be determined as a noise component and filtered by setting it to zero, resulting in the denoised source signals vector \(\user2{u^{\prime}}\). Subsequently, the denoised observed signals \(\user2{c^{\prime}}\) can be acquired by multiplying the mixing matrix with the ICs matrix, as indicated by Eq. (5). Lastly, the reconstructed signal \(\user2{x^{\prime}}\) is formed by summing up the denoised observed signals. This process is shown in Fig. 3.

Fig. 3
figure 3

ICA blind source separation and denoising process

Evidential stacked GRU neural network

Uncertainty estimation with evidence theory

The Dempster-Shafer Theory (DST) of evidence is a generalization of the Bayesian theory to subjective probabilities (Dempster, 1968). Subjective Logic (SL) offers computationally efficient operators that approximate second-order Bayesian reasoning. SL can assign a probability (belief mass) to each class and generates a total uncertainty of this assignment (Jsang, 2016). The belief assignment is formalized as a Dirichlet distribution, and the parameters of the Dirichlet distribution are related to the evidence calculated by DNNs (Sensoy et al., 2018). The procedure is summarized in Algorithms 1.

Algorithm 1: evidential deep neural networks

Input: dataset \({\text{x}}\), class labels \({\text{m}} = [1,..,M]\), prior Dirichlet distribution parameter \({\upbeta }_{0} = [1,...,1]_{1 \times M}\)

1. Obtain evidence vector \({\text{v = }}[v_{1} ,...,v_{m} ]\) over classes by DNNs

2. Compute each belief mass \(a_{m} = v_{m} /\sum\nolimits_{m = 1}^{M} {(v_{m} } + 1)\), uncertainty \(w = M/\sum\nolimits_{m = 1}^{M} {(v_{m} } + 1)\), and \(w + \sum\nolimits_{m = 1}^{M} {a_{m} = 1}\)

3: Update Dirichlet distribution \(D({\text{p}}|{\upbeta })\) by the updated Dirichlet distribution parameter \({\varvec{\beta}} = {\varvec{v}} + {\varvec{\beta}}_{{\varvec{0}}}\), i.e., \(\beta_{m} = v_{m} + 1\)

4. Compute the class probability \(\hat{p}_{m} = {{\beta_{m} } \mathord{\left/ {\vphantom {{\beta_{m} } {\sum\nolimits_{m = 1}^{M} {\beta_{m} } }}} \right. \kern-0pt} {\sum\nolimits_{m = 1}^{M} {\beta_{m} } }}\)

Return: predicted label \({\text{y}}\), uncertainty estimation value \({\text{w}}\)

According to the theory, the total evidence \(V\) refers to the Dirichlet strength, which can be calculated as \(V = \sum\nolimits_{m = 1}^{M} {(v_{m} } + 1) = \sum\nolimits_{m = 1}^{M} {\beta_{m} }\).

For the Dirichlet distribution \(D( \cdot )\), it is a prior distribution over all possible outputs for the classification of a given sample. It is a probability density function for categorical distributions and can be characterized by the parameter \({\upbeta }\)(Jsang, 2016; Sensoy et al., 2021):

$$ D({\text{p}}|{\upbeta }) = \left\{ \begin{gathered} \frac{1}{{B({\upbeta })}}\prod\nolimits_{i = 1}^{M} {p_{i}^{{\beta_{i} - 1}} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} for{\kern 1pt} {\kern 1pt} {\text{p}} \in V_{M} ,} \hfill \\ 0{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} otherwise \hfill \\ \end{gathered} \right. $$
(8)

where \({\text{p}}\) represents probabilities that the sample belongs to each of M categories. \(B({\upbeta }) = \prod\nolimits_{m = 1}^{M} {\Gamma (\beta_{m} )} /\Gamma (V)\) is the K-dimensional multinomial beta function, where \(\Gamma ( \cdot )\) is the gamma function, and \(V_{M}\) is the K-dimensional unit simplex,

$$ V_{M} = \left\{ {{\text{p}}|\sum\nolimits_{i = 1}^{M} {p_{i} = 1,{\kern 1pt} {\kern 1pt} 0 \le p_{i} {\kern 1pt} \le 1{\kern 1pt} } } \right\} $$
(9)

When the observation favors one class over others, the corresponding Dirichlet parameter is incremented to update the Dirichlet distribution related to the class probabilities. The expected calibration error (ce) can be calculated by

$$ ce = \;\frac{1}{N}\sum\nolimits_{d = 1}^{N} {(co_{d} } - acc) $$
(10)

where \(co_{d}\) denotes the confidence level of the sample d. acc is the diagnostic accuracy for the test set consisting of N samples. Confidence as the maximum predicted probability refers to the highest probability score that the model assigns to its prediction for a specific input (Maddox et al., 2019). For a well-calibrated model, the calibration error is close to zero. The logic of the evidential DNNs is presented in Fig. 4.

Fig. 4
figure 4

The logic of the evidential deep classifier

The evidential loss function

Given a sample \(i\), \({\text{x}}_{i}\) is the observation and \({\text{y}}_{i}\) is the class label that is one-hot encoding. Treat the Dirichlet distribution \(D({\text{p}}|{\upbeta })\) as a prior on the multinomial likelihood function \(Mult({\text{y}}_{i} |{\text{p}}_{i} ) = \prod\nolimits_{j} {p_{ij}^{{y_{ij} }} }\). Using maximum likelihood, one can minimize the negative log-marginal likelihood. The evidential loss function can be defined as follows (Sensoy et al., 2018).

$$ \begin{gathered} {\mathcal{L}}_{i} (\Theta ) = - \log (\int {\prod\limits_{m = 1}^{M} {p_{im}^{{y_{im} }} } \frac{1}{{B({\upbeta }_{i} )}}} \prod\limits_{m = 1}^{M} {p_{im}^{{\beta_{im} - 1}} d{\text{p}}_{i} } ) \hfill \\ = - \sum\limits_{m = 1}^{M} {\log (\int {p_{im}^{{y_{im} }} \frac{1}{{B({\upbeta }_{i} )}}\prod\limits_{m = 1}^{M} {p_{im}^{{\beta_{im} - 1}} d{\text{p}}_{i} } } )} \hfill \\ = - \sum\limits_{m = 1}^{M} {y_{im} \log (\hat{p}_{m} )} \hfill \\ = - \sum\limits_{m = 1}^{M} {y_{im} \log (\frac{{\beta_{im} }}{{V_{i} }})} \hfill \\ = \sum\limits_{m = 1}^{M} {y_{im} (\log (V_{i} ) - \log (\beta_{im} ))} \hfill \\ \end{gathered} $$
(11)

where \(\Theta\) represents the network parameters. The loss function is minimized by searching parameters \({\upbeta }_{i}\). As evidence is removed from the correct label, the expected probability \(\hat{p}_{m}\) decreases and the loss function increases. Therefore, the DNN with the loss function is optimized to generate more evidence for the correct class labels and to avoid misclassification by removing misleading evidence for each sample. It encourages the maximization of correct class likelihoods. This technique is well-known as the Type II Maximum Likelihood.

In addition, a KL divergence has been incorporated as a regularization term in the loss function. This term measures the difference between the predicted Dirichlet distribution \(D({\text{p}}_{i} |{\tilde{\beta }}_{i} )\) and a target Dirichlet distribution \(D({\text{p}}_{i} |{1})\). \({\tilde{\beta }}_{i}\) represents the filtered Dirichlet parameter after removing the non-misleading evidence, given as follows:

$$ {\tilde{\beta }}_{i} = {\text{y}}_{i} + (1 - {\text{y}}_{i} ) \odot {\upbeta }_{i} $$
(12)

where \(\odot\) represents the element-wise product.

The regularization term can be written as

$$ \begin{gathered} KL[\left. {D({\text{p}}_{i} |{\tilde{\beta }}_{i} )} \right\|D({\text{p}}_{i} |{1})] = \log (\frac{{\Gamma (\sum\nolimits_{m = 1}^{M} {\tilde{\beta }_{im} } )}}{{\Gamma (M)\prod\nolimits_{m = 1}^{M} {\Gamma (\tilde{\beta }_{im} )} }}) \hfill \\ { + }\sum\nolimits_{m = 1}^{M} {(\tilde{\beta }_{im} - 1)[\psi (\tilde{\beta }_{im} ) - \psi (\sum\nolimits_{m = 1}^{M} {\tilde{\beta }_{im} } )]} \hfill \\ \end{gathered} $$
(13)

where \(\psi ( \cdot )\) is the digamma function. By incorporating the regularization term into the loss function, less evidence will be assigned in the misclassification classes. Then the evidential loss function can be calculated as

$$ {\mathcal{L}}_{total} (\Theta ) = \sum\limits_{i = 1}^{N} {{\mathcal{L}}_{i} (\Theta ) + \lambda_{t} \sum\limits_{i = 1}^{N} {KL[\left. {D({\text{p}}_{i} |{\tilde{\beta }}_{i} )} \right\|D({\text{p}}_{i} |{1})]} } $$
(14)

where \(\lambda_{t} = \min (1,t/10) \in [0,1]\) represents the annealing coefficient, which helps the DNNs explore the parameter space and prevents premature convergence to the uniform distribution. \(t\) is an index for the current training epoch.

Stacked GRU-based intelligent fault diagnosis model

The GRU network performs well in IFD tasks because it considers the relationship between neurons in the layer and extracts the temporal characteristics from vibration signals. Figure 5. shows the model architecture of a classic GRU cell, which is formed by a reset gate and an update gate. For a time-series sample set \(\left\{ {{\text{x}}_{t} } \right\}\), the calculation procedures are

$$ {\text{z}}_{t} = \sigma ({\text{W}}_{xz} {\text{x}}_{t} + {\text{W}}_{hz} {\text{h}}_{t - 1} + {\text{b}}_{z} ) $$
(15)
$$ {\text{r}}_{t} = \sigma ({\text{W}}_{xr} {\text{x}}_{t} + {\text{W}}_{hr} {\text{h}}_{t - 1} + {\text{b}}_{r} ) $$
(16)
$$ {\text{c}}_{t} = \varphi ({\text{W}}_{xz} {\text{x}}_{t} + {\text{W}}_{hh} ({\text{r}}_{t} \odot {\text{h}}_{t - 1} ) + {\text{b}}_{h} ) $$
(17)
$$ {\text{h}}_{t} = {\text{z}}_{t} \odot {\text{h}}_{t - 1} + (1 - {\text{z}}_{t} ) \odot {\text{c}}_{t} $$
(18)

where \({\text{z}}_{t}\) and \({\text{r}}_{t}\) is the update gate and the reset gate.,\({\text{h}}_{t - 1}\), \({\text{h}}_{t}\), and \({\tilde{\text{h}}}_{t}\) the previous hidden states, the current status, and the candidate layer, respectively. \({\text{b}}_{z}\), \({\text{b}}_{r}\), and \({\text{b}}_{h}\) are the biases at the update gate, reset gate, and the candidate layer. \({\text{W}}_{xz}\) is the weight matrix between the input state and update gate, similar to the weight matrix of \({\text{W}}_{hz}\), \({\text{W}}_{xr}\), \({\text{W}}_{hr}\), \({\text{W}}_{xz}\), and \({\text{W}}_{hh}\). The \(\sigma ( \cdot )\) and \(\varphi \left( \cdot \right)\) are the activation function given by

$$ \sigma (x) = 1/(1 + \exp ( - x)) $$
(19)
$$ \varphi (x) = {{(e^{x} - e^{ - x} )} \mathord{\left/ {\vphantom {{(e^{x} - e^{ - x} )} {(e^{x} + e^{ - x} )}}} \right. \kern-0pt} {(e^{x} + e^{ - x} )}} $$
(20)
Fig. 5
figure 5

The model architecture of a classic GRU cell

According to the forward propagation of GRU, the activation of the hidden unit at time t contains the state of the previously hidden unit and receives the current input value. Due to its generalization ability to store and access information, and to capture nonlinear relationships, GRU is applicable for sequential data disposal (Liu et al., 2018). Therefore, to perform accurate and reliable feature mapping, multiple GRU layers could be stacked to construct a stacked GRU model (Xia et al., 2021), as shown in Fig. 6.

Fig. 6
figure 6

The two-layered structure of stacked GRU

This paper improves stacked GRU architecture by adding the Batch Normalization (BN) layers. By introducing the activation layer at the top, the stacked GRU can be used for prediction, defined as

$$ {\text{y}}_{t} = \sigma ({\text{W}}_{yh} {\text{h}}_{t} + {\text{b}}_{y} ) $$
(21)

where \({\text{y}}_{t}\), i.e., evidence, represents the predicted output after the activation layer; \({\text{W}}_{yh}\) and \({\text{b}}_{y}\) represent the weight and the bias, respectively. \(\sigma ( \cdot )\) is the activation function. And dropout has been widely used during the training process to solve overfitting problems in DNNs (Xiaohan Chen et al., 2021a, 2021b). During the training process, dropout can randomly drop some units of the network to prevent excessive co-adaptation.

Denoising and trustworthy fault diagnosis framework

The framework of the joint denoising methods and trustworthy fault diagnosis method with uncertainty estimation is shown in Fig. 7. At the first stage, the vibration signals are decomposed into multiple IMFs by ICEEMDAN. In the second stage, these IMFs can be the sources of the ICA. It can separate the residual noise by using the FuEn discriminant as the threshold criterion and retain the source signal to reconstruct the new signal. The proposed ESGRU method can provide trustworthy diagnosis results with uncertainty estimation, avoiding overconfident misclassification. The procedure is summarized as follows:

Fig. 7
figure 7

The framework of the proposed ESGRU model in reliable IFD with uncertainty estimation

Step 1: Sense and collect vibration data of the monitored machine under different work conditions.

Step 2: Pre-process the collected signals using the proposed joint denoising approach. First, decompose the signal into multiple IMFs using ICEEMDAN. Then, separate source signals into ICs by the ICA method. The FuEn values of the ICs are calculated, and a threshold discriminant based on these FuEn values is chosen to filter out noise components.

Step 3: Reconstruct the denoised signal with the remaining independent components. Divide randomly the denoised vibration signals into training and testing sets.

Step 4: Construct an ESGRU method by integrating evidence theory and the stacked GRU model. Use the evidential loss function with KL divergence in the network. L2 regularization is applied to the weights of full-connected layers to reduce the complexity of the model.

Step 5: Train the ESGRU model using the training dataset. Calculate the evidence over classes and provide category probabilities for testing samples.

Step 6: Implement the trained ESGRU for fault diagnosis. With the uncertainty values and predictive entropy of predictions, overconfident diagnosis can be avoided and OOD samples can be identified. ESGRU model can provide accurate and reliable diagnostic results.

Case validation

Case 1 CWRU data

Experimental setup and data description

In this experiment, the roller bearing dataset that was collected from a motor drive system by Case Western Reserve University (CWRU) is used. The test stand is shown in Fig. 8. Vibration signals were acquired from the drive-side bearing (6205-2RS JEM SKF) at a sampling frequency of 12 kHz. The monitored conditions of the bearings included one normal condition and nine faulty conditions. There are three types of faults included, namely inner race faults (IRF), ball faults (BF), and outer race faults (ORF). According to the severity levels of each failure type, 0.007, 0.014, and 0.021 inches corresponded to slight, moderate, and severe faults, respectively. For each bearing condition, 1000 samples with 100,684 data points have been collected. The dataset is split randomly into subsets of training data (70%) and testing data (30%), details given in Table 1

Fig. 8
figure 8

Test rig of rolling-element bearing

Table 1 The structure of the data set considered in the present paper

Four standard evaluation measures of fault diagnosis performance are used: accuracy, precision, recall, and F1 score (Xiaohan Chen et al., 2021a, 2021b). In this paper, the mean uncertainty (U) of test samples has been employed as a sign of trustworthiness for predictions. In addition, predictive entropy (H) has been introduced to measure the precision of uncertainty (Tsiligkaridis, 2021). It quantifies the dispersion of the predicted probability distribution over the class labels.

$$ Accuracy{\kern 1pt} {\kern 1pt} = {\kern 1pt} {\kern 1pt} {\kern 1pt} \frac{TP + TN}{{TP + FP + FN + TN}} $$
(22)
$$ Precision{\kern 1pt} {\kern 1pt} {\kern 1pt} = {\kern 1pt} {\kern 1pt} {\kern 1pt} \frac{TP}{{TP + FP}} $$
(23)
$$ Recall{\kern 1pt} {\kern 1pt} {\kern 1pt} = {\kern 1pt} {\kern 1pt} {\kern 1pt} \frac{TP}{{TP + FN}} $$
(24)
$$ F1{\kern 1pt} {\kern 1pt} = {\kern 1pt} {\kern 1pt} \frac{2 \times Precision \times Recall}{{Precision + Recall}} $$
(25)
$$ U = (\sum\nolimits_{i = 1}^{n} {w_{i} } )/n $$
(26)
$$ H = - \sum\nolimits_{m = 1}^{M} {\hat{p}_{m} } \log \hat{p}_{m} $$
(27)

where TP, FP, FN, and TN represent the number of true positive, false positive, false negative, and true negative outcomes, respectively. \(\hat{p}_{m}\) is the expected class probability, and \(w_{i}\) represents the uncertainty, which can be calculated from Algorithms 1. n represents the number of selected test samples, while M denotes the total number of classes.

Simulating different noisy environments

To simulate real working conditions, additive white Gaussian noise with different SNR is added to the collected vibration data. SNR is defined as follows.

$$ SNR_{dB} = 10\log_{10} \left( {{{P_{signal} } \mathord{\left/ {\vphantom {{P_{signal} } {P_{noise} }}} \right. \kern-0pt} {P_{noise} }}} \right) $$
(28)

where \(P_{signal}\) and \(P_{noise}\) are the power of the original signal and the noisy signal. From the literature, the SNR is set to 4 dB, 0 dB, −4 dB, and −10 dB for slight, medium, strong, and extreme noise conditions, respectively (Zhang et al., 2018). Vibration signals with Gaussian noise are plotted in Fig. 9.

Fig. 9
figure 9

Different levels of Gaussian noise disturbance in C1

Denoising with the ICEEMDAN-ICA-FuEn

At the first stage of the joint denoising, the method of ICEEMDAN decomposes the single-channel signal into several IMFs, which serve as input sources of ICA. In terms of the parameters of the ICEEMDAN algorithm, the number of realizations and the maximum number of sifting iterations is 50, while the added noise amplitude is 0.2 times the signal standard deviation, and the added noise increases for each stage. Figure 10 shows the decomposition result of a normal bearing with a slight noise, including 17 IMFs (7 IMFs presented) and a residual component.

Fig. 10
figure 10

The decomposition results using ICEEMDAN of normal bearing

Then, the ICA produced 16 ICs, of which 8 ICs are shown in Fig. 11. In general, the number of ICs obtained from separation is less than the number of IMFs. Using fuzzy entropy threshold discriminant as a selection criterion, IC2 with a red line is identified as a noise component, as shown in Fig. 11. A denoised signal can be reconstructed with the remaining ICs.

Fig. 11
figure 11

The separation results using ICA

The overlap sliding segmentation method is used to expand the number of training and testing datasets (Han et al., 2022). Each sample includes 784 time points and the length of sample overlap for two neighbor segments is 684 time points, as shown in Fig. 12. After the segmentation, 1000 samples with 100,684 data points are obtained for each of the ten conditions.

Fig. 12
figure 12

The process of data overlap sliding segmentation

Diagnosis performance with original data

In the experiments, the key hyperparameters of the model are determined through grid search as follows: batch sizes 32, maximum training epochs 50, Adam optimizer, and dropout rate 50%. The original data without additional noise are used to evaluate the proposed ESGRU with results shown in Fig. 13.

Fig. 13
figure 13

Training and testing process diagram of the original data in Case1

In subplots (a) and (c) of Fig. 13, there are increasing differences in evidence support between correctly and incorrectly classified samples as the training epoch increases. The uncertainty of correctly classified samples is significantly lower than that of misclassified samples. The overall training accuracy reaches 100%. The same result in uncertainty is present during the testing process as shown in subplots (b) and (d) of Fig. 13 and the testing accuracy is 99.91%. The results show that the ESGRU model can provide excellent diagnosis accuracy and reliable uncertainty estimation.

Diagnosis performance with noisy data

The performance of fault diagnosis is quantified using the indicators of accuracy, recall, precision, F1 score, uncertainty (U), and predictive entropy (H), as shown in Table 2. Under extreme noise disturbance, the average accuracy is 98.15%, the average F1 score is 90.77%, the average testing uncertainty is 10.07%, and the average predictive entropy is 0.43. Specifically, the uncertainty values of ball faults (C3, C6, and C9) and outer race faults (C4 and C10) of testing datasets are greater than 10%, which carries a risk of misclassification due to noise interference and class overlap. It can also provide higher predictive entropy on untrustworthy predictions, consistent with the higher uncertainty value. The proposed ESGRU model can achieve trustworthy fault diagnosis with uncertainty estimation, even in an extreme noise environment.

Table 2 Diagnosis results with noise data (extreme noise corruption)

To further measure the precision of uncertainty estimation, Fig. 14 shows boxplots of predictive entropy for correct classification and misclassification under an extreme noise environment. The median predictive entropy for correctly classified and misclassified samples is 0.155 and 1.172 respectively. This confirms that the proposed method can assign lower uncertainties to correct classifications and higher uncertainties to misclassifications.

Fig. 14
figure 14

Boxplots of predictive entropy for correct classification and misclassification under extreme noise environment

Following the calibration method (Maddox et al., 2019), the reliability diagram has been used to evaluate the calibration results of uncertainty estimation, shown in Fig. 15. The test data are equally divided into 20 bins based on the mean confidence values. The dashed black line in Fig. 15 represents the baseline of the perfectly calibrated network without calibration error. Points above the baseline represent overconfident predictions, otherwise underconfident predictions. Compared to the GRU model, the proposed ESGRU method can provide well-calibrated prediction results.

Fig. 15
figure 15

Calibration result of EGRU and GRU under extreme noise environment

The testing results with various levels of Gaussian white noise interference are provided in Table 3. In terms of overall testing accuracy, the average values are 99.91%, 99.87%, 99.85%, 99.53%, and 98.15%, respectively. The average uncertainty estimation values gradually increase from 3.17% to 10.07% and the predictive entropy increases from 0.19 to 0.43, with increasing noise interference.

Table 3 Diagnosis results from different noise disturbance

To better demonstrate the classification probability and uncertainty estimation, 10 samples are randomly selected from C10 and plotted in Fig. 16. In the analysis of the 10 samples, sample 6 was misclassified with an uncertainty value of 67.28%. For the other correct classification samples, however, sample 7 also has high uncertainty due to the similar classification probabilities between C4 and C10, which puts it at risk of misclassification. As a result of the proposed method, unreliable classified samples with high uncertainty estimation values can be identified and overconfidence alerts can be issued for these predictions.

Fig. 16
figure 16

Samples diagnostic probability and uncertainty estimation of C10

Diagnosis performance with denoised data

To evaluate the performance of the proposed denoising method, the ESGRU is applied to the denoised data. The fault diagnosis results in conjunction with uncertainty estimation are shown in Table 4. Specifically, the extreme noise condition is selected to demonstrate the effectiveness of the denoising method. Compared to results with noisy data as listed in Table 2, the accuracy of the denoised signals is increased by 1.15% and reaches 100%. The average uncertainty values of the testing samples decrease to 3.29% and the specific uncertainty values of each class are all less than 5%. The average predictive entropy reduces to 0.19, decreased by 0.24. The results show that the proposed denoising method is effective and can reduce data uncertainty.

Table 4 Diagnosis results from denoised data (extreme noise corruption)

Under different noise severity, the proposed ESGRU achieved high diagnostic accuracy and reduced uncertainty as shown in Table 5. The average accuracy reaches nearly 100% with the denoised signals as the testing dataset. The uncertainty estimation values are 3.10%, 3.10%, 2.97%, 3.30%, and 3.29%, respectively. Correspondingly, the average predictive entropy values remain at a low level between 0.18 and 0.19. Results confirm that the proposed ICEEMDAN-ICA-FuEn denoising method is effective. Also, uncertainty estimation provides a trustworthy prediction in IFD tasks for safety–critical applications.

Table 5 Diagnosis results from different denoised data

Comparison with other advanced methods

Under the same noise environment, several state-of-the-art models are chosen and compared to verify the effectiveness of the proposed ESGRU model. The raw signals from the CWRU dataset are used as the original samples, while different levels of Gaussian noise are added to simulate different noisy environments. A comparison is provided in Table 6 of the fault diagnosis results using ESGRU, CNN, Ensemble TICNN (Zhang et al., 2018), AAnNet (Jin et al., 2020), VMD-DCNN (Z. Xu et al., 2020a, 2020b), and Bayesian Convolutional Neural Network (BCNN) (Fang et al., 2020). Overall, the proposed method provides high accuracy (average 99.46%) and reasonable uncertainty (given in parentheses). Other methods perform well with low levels of noise, however, at extreme noise levels, their diagnostic capabilities are relatively poor and unreliable. In contrast to other methods, the proposed method provides reliable and robust diagnosis results supported by uncertainty estimation under variable noisy environments, avoiding overconfidence.

Table 6 Comparison of accuracy results with other models (%)

The validity of the proposed joint denoising method compared to other denoising methods is verified by using the ESGRU model as the classifier, with results presented in Fig. 17. Under variable noise interference, the proposed method maintains superior robustness and reliability, with average accuracy and average F1 value close to 100%, lower average uncertainty (3.15%) and predictive entropy (0.19). The proposed denoising method outperforms other joint denoising methods and is more effective in severe noisy environments, which is particularly important in practical applications.

Fig. 17
figure 17

Comparison results with the different denoising methods

The effectiveness of the proposed ESGRU in uncertainty estimation has been verified by comparing the predictive entropy values of the BCNN and the proposed method under different noise disturbances, results shown in Table 7. As the level of noise increases, the predictive entropy value follows the same trend in the two methods, indicating a gradual increase in prediction uncertainty. Compared to the BCNN, the proposed method has a more stable diagnostic performance and is more resistant to noise interference, especially in strong and extreme noise environments. In addition, the validity of the proposed joint denoising method has been also verified, with significantly lower predictive entropy values.

Table 7 Comparison of predictive entropy results with the BCNN method

Case 2 XJTU data

Experimental setup and data description

In this experiment, we use the public rolling bearing dataset provided by Xi’an Jiaotong University and the Changxing Sumyoung Technology Co. (XJTU-SY) (Wang et al., 2018) shown in Fig. 18.

Fig. 18
figure 18

Tested rolling bearing of XJTU-SY

Testing was conducted on LDK UER204 bearings and vibration data were collected at a sampling frequency of 25.6 kHz. Three different operating conditions of rotating speed were examined: condition 1, 2100 rpm; condition 2, 2250 rpm; condition 3, 2400 rpm. The five fault types included inner race fault (IRF), cage fault (CF), outer race fault (ORF), inner race and outer race fault (IORF), and mix fault (MF, including inner race, ball, cage, and outer race fault). The description of the XJTU dataset is illustrated in Table 8. Every condition contains 1000 samples with 100,684 points. Then, 70% of the data is used for training and 30% for testing.

Table 8 Description of XJTU-SY datasets in the present paper

Simulating different noisy environments

To simulate actual working conditions, different noise disturbances are added to the original signal, including 4 dB, 0 dB, -4 dB, and -10 dB for slight, medium, strong, and extreme noise conditions, respectively, as depicted in Fig. 19.

Fig. 19
figure 19

Different levels of Gaussian noise disturbance

Denoising with the ICEEMDAN-ICA-FuEn

The proposed two-stage joint denoising method was applied. Figure 20 depicts a comparison between the original signal, noisy signal, and denoised signal, which allows visual verification of the validity of the proposed denoising method.

Fig. 20
figure 20

The original, noisy, and denoised signal of MF under strong noise

Diagnosis performance with original data

Figure 21 shows the effectiveness of trustworthy diagnosis in the training and testing process, using the proposed ESGRU model as the classifier.

Fig. 21
figure 21

Training and testing process diagram of the original data in Case 2

The evidence support allows correct classification to be distinguished confidently from misclassification as illustrated in subplots (a) and (c) of Fig. 21. According to subplots (b) and (d) in Fig. 21, the reliable diagnostic results during the training and testing process yield 100% accuracy and very low uncertainty estimation values. This confirms that the proposed ESGRU model is capable of making reliable and accurate predictions.

Diagnosis performance with noise data

Table 9 presents the diagnosis results on noisy data. Even under extreme noise conditions, the proposed ESGRU model can provide accurate fault diagnosis results for all fault types and achieves an average accuracy of 97.07%. As illustrated by the indicator of uncertainty, noise from the environment and sensor disturbs the distribution of signals and affects diagnostic reliability. Extreme noise has a greater impact on the outer race fault (C2), inner race fault (C3), inner and outer race fault (C5), and mix fault (C6), in which the testing uncertainty values all reach more than 10%. Correspondingly, the predictive entropy value of the fault classes is significantly higher than that of the normal condition (C1) and cage fault (C4). It verifies the proposed model can perform reliable fault diagnosis and achieve good resistance to extreme noise.

Table 9 Diagnosis results of noisy data (extreme noise corrupted)

Figure 22 illustrates the significant difference between the predictive entropy for correct classification and misclassification. Under extreme noise interference, the median predictive entropy for correctly classified and misclassified samples is 0.087 and 0.989 respectively. This shows that more high uncertainty will occur in misclassified samples. Figure 23 can substantially improve calibration over the GRU model, where the risk of overconfident predictions in the GRU model is high.

Fig. 22
figure 22

Boxplots of predictive entropy for correct classification and misclassification under extreme noise environment

Fig. 23
figure 23

Calibration result of EGRU and GRU under extreme noise environment

The proposed method achieves 100% accuracy and 1.99% uncertainty for the original signal as Table 10 shows. When noise is gradually added, the average accuracy decreases from 100% to 97.07%, the average estimation uncertainty increases from 1.99% to 10.47%, and the average predictive entropy increases from 0.11 to 0.38. The experiments show the robustness of the proposed fault diagnosis method. As an example, under strong noise (SNR = -4 dB), 99.80% testing accuracy can be achieved with an uncertainty of 4.21%.

Table 10 Diagnosis results from different noise disturbance

Ten samples are randomly selected from C2 under extreme noise conditions to illustrate the importance of uncertainty estimation in fault diagnosis, as shown in Fig. 24. Among the 10 samples, sample 4 was misclassified with a high uncertainty value of 79.20%. However, for the other correctly classified samples, the average value of uncertainty still reaches 16.63%. In this regard, even though samples are correctly classified, similar classification probabilities exist between classes, resulting in high uncertainty. For example, sample 7 has a misclassification risk between C2 and C3 with a high uncertainty value of 75.32%. Sample 1, with a high probability of 98.77% in class 2 and a low uncertainty of 1.47%, can be confidently classified as C2.

Fig. 24
figure 24

Samples diagnostic probability and uncertainty estimation of C2

Diagnosis performance with denoised data

Diagnosis results with the denoised data are presented in Table 11. The extreme noise condition is selected to demonstrate the denoising performance of the proposed method. In comparison with the noisy data, the average testing accuracy increases by 2.93% and reaches 100%. The average uncertainty values of the testing samples drop to 2.93%, which is a reduction of 7.54% compared to the results of the noisy data, as shown in Table 9. Correspondingly, the average predictive entropy value is reduced to 0.14, decreased by 0.24, which verifies the effectiveness of the proposed denoising method.

Table 11 Diagnosis results of denoised data (extreme noise corrupted)

The results under different levels of noise interference can be found in Table 12. It achieves an average diagnosis accuracy of almost 100% in variable noise environments. The average values of uncertainty estimation are 1.85%, 2.05%, 2.38%, 2.88%, and 2.93%, respectively. Correspondingly, the average predictive entropy values are 0.10, 0.10, 0.12, 0.14, and 0.14. A significant improvement has been observed both in prediction accuracy and reliability when compared to noisy data results, as shown in Table 10. The proposed denoising method can perform well even in extreme noise conditions.

Table 12 Diagnosis results from different denoised data

Comparison with other advanced methods

Many state-of-the-art classification methods have been reported in rolling bearing fault diagnosis, however, limited work can be found on the uncertainty estimation under noise interference. This paper uses CNN, stacked autoencoder (SAE), SAE-CNN, VMD-DCNN, and BCNN methods to conduct experiments under variable noise conditions. Results are presented in Table 13 and uncertainties are provided in parentheses. Experimental results show that the proposed method can reach a classification accuracy of 97.07% even when operating in extreme noise environments. A further benefit of the proposed method is that it provides an estimation of trustworthiness by quantifying the prediction uncertainty (given in parentheses).

Table 13 Comparison of accuracy results with other models (%)

The comparison result with other denoising methods is shown in Fig. 25. The average accuracy and the average F1 score reach almost 100%, the average uncertainty is low at 2.36%, and the average predictive entropy is 0.12 under varying noise conditions. The proposed denoising method outperforms other joint denoising methods in variable noise environments, particularly in extreme noise disturbance conditions.

Fig. 25
figure 25

Comparison results with the different denoising methods

The comparison result with the BCNN method under different noise disturbances is shown in Table 14. The proposed method has a more robust model diagnosis performance, especially in strong and extreme noise environments with lower predictive entropy values of 0.18 and 0.38, respectively. Moreover, the validity of the proposed joint denoising method has been verified in two methods, with lower predictive entropy values compared to the noisy data.

Table 14 Comparison of predictive entropy results with the BCNN method

Conclusions

This paper proposed a framework for trustworthy and intelligent fault diagnosis through an effective denoising process and diagnosis approach with uncertainty estimation. A novel two-stage joint denoising method based on ICEEMDAN and ICA with fuzzy entropy discriminant thresholds as selection criteria was developed which achieved an excellent denoising effect under varying noise conditions. Through the integration of evidence theory and the stacked GRU model, the proposed ESGRU method with calibration can provide reliable predictions with uncertainty estimation. The evidence loss function with KL divergence was applied to the ESGRU model, which enhanced the precision of uncertainty estimation without reducing model diagnostic abilities. Experimental studies on two roller bearing datasets verified the diagnosis performance of the proposed approach. The proposed method can be applied in noisy practical scenarios and safety–critical applications.

For future work, how to achieve trustworthy diagnoses in cross-working conditions and with limited data samples needs to be investigated. Also, the risk of misclassification and the cost of wrong decisions should be considered in the evidential loss function, which is an important step forward for practical application. Furthermore, integrating the approximate BNN inference methods and the specific evidential loss function that place Dirichlet priors over it would be an inspirational idea to solve the challenge of untrustworthy fault diagnosis.