1 Introduction

The forecast of the remaining service life is one of the key contributing factors of PHM technology. The proposed PHM technology has changed the contemporary maintenance concept and realized the transformation from planned maintenance to situational maintenance, which can comprehensively measure maintenance costs, resource losses, and production benefits, and situational maintenance can also address current safety hazards and has been widely applied in multi-domain such as electronics, aviation industry, and military applications [1, 2]. Health prediction is one of the core missions of PHM, and its major purpose is to calculate the remaining service life (RUL) of equipment by revealing its performance degradation pattern with information on equipment operation status. Accurate RUL predictions allow for a sound maintenance plan and timely replacement of safety-hazardous components, thus allowing for the avoidance of sudden system failures [3].

To solve the above problems, many advanced methods for remaining service life prediction have emerged, which mainly fall into two directions, i.e., mechanistic model-based and data-driven. For a model-based forecast project, an accurate construction of the dynamics of the mechanical system (or component) is required [4, 5]. Common model-based research methods include particle filters [6], Weibull distribution [7], etc. However, with the rapid development of industry, even for industry specialists, it is impossible to construct a comprehensive and ideal mechanism model because there are an increasing number of highly integrated mechanical structures with increasingly complex functioning mechanisms.

Moreover, complex mechanical system fault types are also varied, and a single machine model is not well adapted to complex and variable faults; therefore, the solution based on the machine model is less flexible [8] and has certain limitations.

In recent years, data-driven-based schemes have been widely used for lifetime prediction [9]. Data-driven prediction methods are based on a lower threshold of expertise and no longer require knowledge of the detailed working mechanisms of mechanical systems. Instead, we only need to gather correlative data from the system through sensors, based on a data-driven algorithm to capture the degradation trend in the degradation data for accurate prediction of RUL [10]. Compared to the mechanistic model approach, the machine learning-based approach abandons the specific stochastic process and it can set up a plain relation between degraded data and RUL through algorithms. Commonly used lifespan prediction methods based on shallow machine learning include artificial neural networks (ANNs) [11,12], support vector machines (SVMs) [13], random forest algorithms (RFs) [14], and hidden Markov models [15].

However, industrial systems are becoming increasingly complex, and shallow machine learning methods are not well suited to handle massive amounts of degraded data. With the improvement of computer technology, which provides a platform for solving big data problems, deep learning has gained unprecedented development and is used in various fields [16,17,18]. Convolutional neural networks (CNN) focus on describing spatial features of sequences with the superiorities of part cognition and parameter sharing [19] and are used for RUL prediction. However, ordinary convolutional networks are prone to lose key information due to pooling and downsampling operations in computation, and the proposed capsule network solves this problem well. Capsule networks [20, 21] have also been used for lifetime prediction in the last two years, achieving good prediction accuracy. Considering the time dependence of the vibration signals, Malhi et al. [22] put forward a competitive learning-based approach that employs a recurrent neural network (RNN) to capture the long-term degradation information of the machine operating state. But recurrent neural network (RNN) long-term model training can produce the problem of gradient vanishing or gradient blast, affect the online solution capability [23]. As a renewal to the traditional RNN, Yuan et al. [24] drew a neural network based on Long Short-Term Memory (LSTM) to process temporal data, which was used for RUL of an aero motor in complex operating environments with high noise and multiple faults. However, there are still some limitations of LSTM in RUL prediction, such as The traditional LSTM focuses only on the relevant features learned at the last time step for the prediction of the remaining lifetime [8,25], However, relevant features learned at other time steps of the engine's full life cycle may also contribute to the final life prediction, so it is necessary to distribute appropriate weights to the further significant sensors and time steps to ensure better grasp of key information [26]. The attention mechanism has generated widespread interest in the domain of RUL prediction [27] in recent years due to its ability to assign weights to various features based on their influence on the mechanical degradation trend. Ren L et al. [28] put forward an attention-based deep learning system to simulate human attention mechanisms by building an attention network that can assign corresponding weights to different features and improve certain accuracy. In recent years, migration learning has been extensively applied for lifetime prediction due to its capacity to better handle prediction tasks in the field of variable operating conditions. Anxi Zhang et al. [29] proposed a migration learning algorithm based on a Bidirectional Long Short-Term Memory neural network for lifespan prediction, which solved the cross-domain prediction task. Yangpan Tang [30] et al. applied a meta-migration learning strategy to lifetime prediction and achieved high prediction accuracy while improving the learning performance of adaptive hyperparameters in a small sample environment. Huang, Bi-Qing, et al. [31] proposed an agnostic meta-learning algorithm to solve the difficult problem of difficult access to full-lifecycle data in practical forecasting tasks, and the model constructs a pseudo-meta-task set by discriminating the similarity between cross-domain time series, which further improves the generalization capability of the method under the condition of small sample.

Combining the advantages and shortcomings of the above methods, this paper offers a deep learning approach for remaining service life prediction using capsule systems, founded on two-layer focused attention and multi-measure collection of features, with the following main contributions:

  1. 1.

    A channel attention mechanism is introduced, and the raw data are assigned corresponding weights for different sensors by the CAM. TSAM is embedded in the LSTM network to weigh the importance of various moments in the whole life cycle of the engine while weakening the impact of noise in the raw data.

  2. 2.

    Replacing the convolutional layer with the inceptionV1 module to extract multiscale features as input to the capsule network, which promotes the feature extraction capacity of the model.

  3. 3.

    The application of the capsule layer enables a more effective portrayal of the overall features of the temporal data while preventing the disappearance of degradation information due to downsampling and pooling operations in traditional convolutional networks (CNNs), ensuring the integrity of the information and improving the prediction accuracy.

  4. 4.

    Multiple multi-type experimental validations on publicly available datasets prove the feasibility of the proposed method.

2 Methodology

2.1 Long Short-Term Memory

Life prediction requires multiple sensors to collect degradation data with temporal correlation in multiple dimensions of the predicted object, commonly such as vibration and acoustic information. Recursive neural (RNN) is extensively utilized for the prediction of remaining service life because it is better at handling time-series data, taking into account the correlation between the moments before and after the time-series data. However, the issue of "long-term dependency" arises after the nodes of RNN are computed in multiple stages, which leads to gradient disappearance or gradient explosion. To overcome this difficulty, Jiawei Liu et al [32]. then used a combination of Long Short-Term Memory (LSTM) and RNN networks to forecast the remaining effective life of the fuel cell. Due to its unique advantages, LSTM networks have also acquired giant success in other fields of temporal data processing, such as video analysis [33] and face recognition [34].

As shown in Fig. 1, it is the Long Short-Term Memory network. It mainly consists of forgetting gate, input gate, and output gate. The forgetting gate \(f_t\) discards redundant information from previous time steps, the input gate \(i_t\) is responsible for filtering information, and the export gate \(O_t\) dominates the export of the network. The figure \(C_{t - 1}\) is used to save the unit state of the previous moment, \(f_t\) determine how much of the united state of the previous time is reserved to the present time \(C_t\), the input gate \(i_t\) decides how much of the input of the present moment \(x_t\) is preserved to the united state \(C_t\), \(O_t\) control how much of the united state \(C_t\) is export to the current output value \(h_t\) of the LSTM. \(\sigma\) and \(\tan h\) are activation functions respectively, the LSTM network is calculated as follows:

$$ f_{\text{t}} = \sigma \left( {w_f \left[ {h_{t - 1} ,X_t } \right] + b_f } \right) $$
(1)
$$ i_{\text{t}} = \sigma \left( {w_i \left[ {h_{t - 1} ,X_t } \right] + b_i } \right) $$
(2)
$$ \widetilde{C}_t = \tanh \left( {w_C \left[ {h_{t - 1} ,X_t } \right] + b_C } \right) $$
(3)
$$ C_{\text{t}} = f_t \ast C_{t - 1} + i_t \ast \widetilde{C}_t $$
(4)
$$ O_{\text{t}} = \sigma \left( {w_O \left[ {h_{t - 1} ,X_t } \right] + b_O } \right) $$
(5)
$$ h_{\text{t}} = O_t \ast \tanh \left( {C_t } \right) $$
(6)
Fig. 1
figure 1

LSTM network structure

Among them, \(w_f\), \(w_i\), \(w_C\), \(w_O\) are the calculated weights of the forgetting gate, the input gate, the state and the output gate, respectively.\(b_f\),\(b_i\),\(b_C\),\(b_O\) are their corresponding offsets, respectively.

Given the excellent temporal modeling capability of LSTM networks, this paper uses the LSTM network to learn the temporal characteristics of data. To address the limitation that the LSTM network only focuses on the features learned in the last time step for ultimate prediction. In this paper, a temporal attention mechanism is embedded in the LSTM network to assign weights to various time steps to improve the learning capacity of the model.

2.2 Attention Mechanism

Attention mechanism [35] is a data-handling technique in deep learning and is extensively used in different sorts of deep learning tasks such as natural language handling, face recognition, and speech recognition. The nature of the attentional mechanism is similar to that of human vision. When people observe external things, they first tend to look for the most representative features, and then weaken the secondary features, to form a macro overall impression of the target thing. The self-attention mechanism used in this paper assigns computational resources to more significant tasks, by which finite attention resources can be employed to rapid screening out key information from massive information. It has been successfully applied in [28,36,37] and many other fields. Its calculation formula is as follows:

1.Data Sample \(H = \left\{ {h_1 ,h_2 ,h_{\text{i}} ,...,h_m } \right\}\),\(h_i \in R^m\), m is the number of sequences of the current feature. The importance of any one feature is scored according to the following activation function:

\(\phi \left( \cdot \right)\) is the activation function, often referred to as the sub-function in the attention mechanism to judge the importance of features:

$$ s_i = \phi \left( {w \cdot h_i + b} \right) $$
(7)

2.The obtained scores are numerically varied by the softmax function. On the one hand, it can be normalized to get a new probability distribution with a weight of 1, and on the other hand, it can highlight key features.

$$ \alpha_i = softmax\left( {s_i } \right) = \frac{{\exp \left( {s_i } \right)}}{{\sum_i^{\,} {\exp \left( {s{\,}_i} \right)} }} $$
(8)

3.The final output feature data after weighting is:

\(O = H \otimes A = \left\{ {\alpha_1 h_1 ,\alpha_2 h_2 ,...,\alpha_m h_m } \right\}\),\(\otimes\) is the multiplication of the corresponding elements.

2.3 Inception V1

The Inception network architecture was proposed by Google in 2014. Compared with the traditional convolutional layer, the network depth and width of this network structure are further enhanced. Inception V1 employs convolutional kernels of different sizes (different sizes of convolutional kernels with different perceptual fields) in the same convolutional layer for multi-scale feature extraction and proposed parallel merging of convolutional kernels (also known as the Bottleneck layer).To avoid overfitting due to increasing the width and depth of the network, sparse connections are used instead of full connections. Thinking from another perspective, this design is consistent with intuition, similar to how humans perceive a thing, i.e., they process it from different perspectives and then aggregate the features learned from different perspectives.

In this paper, we use the Inception module instead of the traditional convolutional layer to abstract degraded features of sensor data at multiple scales as the input to the capsule network. The construction of the Inception module is indicated in Fig. 2 [38].

Fig. 2
figure 2

InceptionV1 structure

2.4 Capsule

Capsule networks were first proposed by Geoffrey Hinton [39] in 2011, and in late 2017, Geoffrey Hinton et al. [40] proposed the capsule architecture, a new deep neural network model. Such models are currently used in several fields, such as image recognition, lifetime prediction [20, 21], and fault diagnosis. Unlike traditional network neurons, Capsule takes the form of vectors as input and output, and benefits from a dynamic routing algorithm that discards the pooling operation inside traditional convolutional networks to maximize the integrity of the prediction information. As in Eq. (9), traditional convolutional networks deal with high-level and low-level features as a simple weighted sum, while capsule networks are good at capturing the positional relationships between high-level and low-level features.

$$ x_l^{\left( m \right)} = \alpha \left( {\sum_{d = 1}^D {w_l^{\left( {d,m} \right)} } \ast x_{l - 1}^{\left( d \right)} + b_l^{\left( m \right)} } \right) $$
(9)

where * denotes the convolution operation,\(x_l^{\left( m \right)}\) is the mth feature map output by the Lth layer of the convolutional layer,\(\alpha\) is a nonlinear activation function,\(w_l^{\left( {d,m} \right)}\) is the weight matrix of \(b_{ij} \leftarrow b_{ij} + u_{j|i} \cdot v_j\),\(b_l^m\) is bias. As shown in Fig. 3, a schematic diagram of a basic capsule network, consisting of a convolutional layer, a primary capsule layer, and a digital capsule layer.

Fig. 3
figure 3

Capsule network structure

To understand the capsule network in more depth, it is essential to understand the learning strategy of the capsule network, i.e., the dynamic routing between the capsules in Fig. 4 [39]. Capsules can be regarded as neuron vectors in groups, vector dimensions are used to represent the spatial location information of features, and the length of the vectors can indicate the odds of feature existence. These vectors are directional, so the capsule can guarantee the translational isotropy of the features during feature extraction, while traditional convolutional networks have translational invariance. Therefore, CapsNet is used as the final high-dimensional feature extraction module in this paper. The iterative computation of dynamic routing can be expressed as:

$$ u_{j|i} = w_{ij} u_i $$
(10)
$$ s_j = \sum_i {c_{ij} } u_{j|i} $$
(11)
$$ v_j = \frac{{\left\| {s_j } \right\|^2 }}{{1 + \left\| {s_j } \right\|^2 }}\frac{s_j }{{\left\| {s_j } \right\|}} $$
(12)
$$ c_{ij} = \frac{{\exp \left( {b_{ij} } \right)}}{{\sum_k {\exp \left( {b_{ik} } \right)} }} $$
(13)
$$ b_{ij} \leftarrow b_{ij} + u_{j|i} \cdot v_j $$
(14)
Fig. 4
figure 4

Dynamic routing diagram

Equation (10)\(w_{ij}\) is the prediction matrix, which is multiplied by the input vector \(u_i\) to obtain the high-level feature prediction vector \(u_{j|i}\). Equation (11) is the weighted sum of all divination vectors to obtain the vector \(s_j\). \(c_{ij}\) is the coupling coefficient, which determines the information transfer between the low level capsule and the high level capsule, updated by Eq. (13). The initial value of \(b_{ij}\) is 0, which to some extent reflects the similarity of the output vector and the input vector. The vector \(v_{\text{j}}\) is obtained by squeezing the function Squash from Eq. (12) and the length is compressed to (0,1).

3 The Proposed MAI-Capsule Model

As indicated in Fig. 5. [43], the double-layer attention structure module, the Inception multiple-scale feature extraction module, and the capsule network comprise the three distinct modules of the proposed MAI-Capsule model framework. The normalized sensor data are subjected to a CAM in order to assess the impact of various sensor data on the prediction. Then, the TSAM is embedded after the LSTM network to learn the temporal characteristics of the data, calculating the influence on the prediction of various engine life cycle moments. The Inception module extracts multiple-scale characteristics from the weighted data output of the two-layer attention structure, which is then fed into the capsule network for regression prediction.

Fig. 5
figure 5

Network structure of the proposed model

3.1 Attention Mechanism Model

In reality, a number of variables could tamper with the sensor's data collection. For instance, information gathered at several sites and different kinds of sensors have various effects on the prediction, and even the data gathered by the same sensor at different moments of the mechanical cycle are different. In traditional neural networks, the weights of data collected from different sensors and different moments of the same sensor are equally distributed in prediction, which leads to important degradation information being ignored and non-essential information being amplified, affecting the precision of the final forecast.

First is the channel attention mechanism. The data sample can be signified as \(x = \left\{ {x_1 ,...,x_i ,...,x_m } \right\}\), the data from different sensors at moment t can be signified as \(x_t = \left\{ {x_{1,t} ,...,x_{i,t} ,...,x_{m,t} } \right\}\), m is the number of sensors.

1. The sample data of the sensors at moment t is passed through Eq. (15) to obtain the scores of different sensors:

$$ S_t = \phi \left( {w \cdot x_t + b} \right) $$
(15)

The score of all sensors at moment t is obtained as \(S_t = \left\{ {S_{1,t} ,S_{2,t} ,...,S_{i,t} ,...,S_{m,t} } \right\}\).

2. After obtaining the score of the sensor at moment t, it is transformed into the weight value of the ith sensor at moment t by Eq. (16):

$$ \alpha_{i,t} = softmax\left( {S_{i,t} } \right) = \frac{{exp\left( {S_{i,t} } \right)}}{{\sum_i {exp\left( {S_{i,t} } \right)} }} $$
(16)

That is, the total sensor weights at time t are denoted as \(\alpha_t = \left( {\alpha_{1,t} ,...,\alpha_{i,t} ,...,\alpha_{m,t} } \right)\).

3. Calculate the average weight of the ith sensor:

$$ \overline{\alpha }_i = \frac{1}{T}\sum_t {\alpha_{i,t} } $$
(17)

The weight of every sensor on average is denoted as \(\alpha = \left\{ {\overline{\alpha_1 },\overline{\alpha_2 },...,\overline{\alpha_m }} \right\}\), T is the total cycle time.

4. The first-level attention mechanism output can be expressed as:

$$ O = \alpha \otimes x = \left\{ {\overline{\alpha }_1 x_1 ,\overline{\alpha }_2 x_2 ,...,\overline{\alpha }_m x_m } \right\} $$
(18)

Following CAM processing, unnecessary information is decreased and degraded information from vital channels is given more weight.

Next the TSAM further weights the export of the LSTM network in the length of time dimension. The data collected by the same sensor at different moments of engine operation contribute differently to the final prediction, so the TSAM is used to capture more critical time points and improve the prediction accuracy. Sample data \(x^{\prime} = \left\{ {x_1^{\prime} ,...,x_i^{\prime} ,...,x_m^{\prime} } \right\}^{\text{T}} = \left\{ {x_1^{\prime} ,...,x_t^{\prime} ,...,x_T^{\prime} } \right\}\), T is a transpose operation and T is a period of time cycle. The data of the ith sensor at different moments is \(x_i^{\prime} = \left\{ {x_{i,1}^{\prime} ,...,x_{i,t}^{\prime} ,...,x_{i,T}^{\prime} } \right\}\).

1. As with the CAM, the fraction of different time steps of the sensor is first calculated according to Eq. (19):

$$ S_i = \phi \left( {w \cdot x_i^{\prime} + b} \right) $$
(19)

Here \(S_i = \left\{ {S_{i,1} ,S_{i,2} ,...,S_{i,t} ,...,S_{i,T} } \right\}\), the score of the i-th sensor at each moment.

2. The weight of the i-th sensor at moment t can be calculated according to Eq. (20):

$$ \eta_{i,t} = softmax\left( {S_{i,t} } \right) = \frac{{\exp \left( {S_{i,t} } \right)}}{{\sum_t {\exp \left( {S_{i,t} } \right)} }} $$
(20)

That is, the weight value for all moments of the ith sensor is \(\eta_i = \left\{ {\eta_{i,1} ,\eta_{i,2} ,...,\eta_{i,t} ,...,\eta_{i,T} } \right\}\).

3. Eq. (21) yields the average value for each sensor weights at any given instant t:

$$ \overline{\eta }_t = \frac{1}{m}\sum_i {\eta_{i,t} } $$
(21)

The weight mean over entire steps is \(\eta = \left\{ {\overline{\eta }_1 ,\overline{\eta }_2 ,...,\overline{\eta }_t ,...,\overline{\eta }_T } \right\}\).

4. The output of the second layer of attention mechanism can be expressed as:

$$ O^{\prime} = \eta \otimes x^{\prime} = \left\{ {\overline{\eta }_1 x^{\prime}_1 ,\overline{\eta }_2 x^{\prime}_2 ,...,\overline{\eta }_t x^{\prime}_t ,...,\overline{\eta }_T x^{\prime}_T } \right\} $$
(22)

The weighted data obtained after the above two layers of attention mechanism processing highlight more meaningful exacerbation information concealed in the data and lays the foundation for subsequent feature extraction.

3.2 The Inception Module

The sensor-collected data is quite sophisticated and includes information on deep deterioration. To fully extract the hidden regression information in the weighted data, it is necessary to promote the function of the network by increasing the depth and width of the network, i.e., the number of layers and neurons of the network. However, this operation will increase the computing burden and produce overfitting. The Inception module solves this problem well with a sparse network structure but can generate dense data, improving the effectiveness of the neural network while guaranteeing the effective utilization of computational resources.

In this paper, we use Inception modules instead of traditional deep convolutional layers, by using convolutional kernels of different sizes, while using parallel connections. As shown in Fig. 2., inceptionV1, the 1*1 convolution is used to reduced data dimensions, the borders of the feature are covered in a "same-" strategy, and ultimate, a Concatenate layer is used to aggregate the learned multi-scale features.

3.3 Capsule Network

The main structure of the capsule network consists of a primary capsule layer and a digital capsule layer. The primary capsule layer is made up of convolutional and reshaping layers. The convolution layer uses 32 convolution kernels of size 10 with a convolution step set to 1 to create a feature map of 32 continuous sensors to mine low-dimensional information from the export of the inception module. The mined low-dimensional features are reshaped into 8-dimensional primary capsules by the reshaping layer and are conducted as input for the second level digital capsule layer. Lastly, in order to extract high-dimensional features and maintain the temporal data's overall positional hierarchy, the digital capsule layer employs ten 8-dimensional capsules. In particular, the number of capsules affects the accuracy of prediction to a certain extent, and subsequent comparison experiments will be conducted with the number of capsules network as the variable.

The use of capsule networks makes it possible to describe time-series data's general properties more effectively. DRA is responsible for updating the two capsule segments' coupling coefficients [39], and the paper sets up 3 rounds of calculation. The correlation between related capsules is raised while the correlation between unrelated capsules is decreased by ongoing iterative revisions [21], and the value in Eq. (9) characterizes this property.

Since lifetime prediction is a typical regression problem, the loss function is selected as the mean square error (MSE) [34]. The optimizer uses the Adam algorithm. To promote the prediction of the model, a learning rate attenuation tactics is added, with the initial value set to 0.001 and decreasing to 0.0001 and then stopping.

4 Experimental study and discussion

4.1 Dataset

At this part, the suggested model will be empirically validated using the NASA (C-MAPSS) public dataset [41], description of the assessment metrics, experimental results, and analytical discussion used in this thesis. Keras was used to construct the network's structure.

The commercial aero-engine dataset provided by NASA [41] consists of 21 sensors located at different locations to record the degradation progression of aircraft engines. The dataset models the engine's degradation performance under various working situations and flight modes, each with a unique collection of failures. Figure 6 displays a streamlined diagram of the aero-engine in line with the data set, whose principal elements are fans, low-pressure compressors, nozzles, high-pressure rotors, etc. An engine that is unknown at first is considered healthy, and when a failure occurs it affects engine performance and The sensors keep an eye out for unusual data. From healthy condition to stopping is the life cycle of the engine, and the test data contains the data of the whole life cycle of the engine. There are four sub-datasets in the C-MAPSS dataset, and the details are displayed in Table 1.

Fig. 6
figure 6

Engine Schematic

Table 1 Data set

Each subset of the C-MAPSS dataset contains 26 columns for both training and test data, the pertinent motor facets are shown in the first five columns, and the deteriorated data gathered by the 21 sensors spread across various locations is shown in the next 21 columns. There is no impact on the ultimate prediction because certain sensors (like those in dataset FD001) continuously gather settled data from the beginning of engine running until the conclusion of its life. The 14 valuable sensors selected in the experimental part of this paper are 2,3,4,7,8,9,11,12,13,14,15,17,20,21.

In practical applications, when a failure occurs, the engine's operating performance decreases significantly until the end of its life, when it is totally ruined. Heimes feels that an engine's RUL in a good state should be between 120 and 130, based on research [42]. The highest RUL in this article is 125.

4.2 Data Preprocessing

In this paper, we apply the sliding time window approach in the literature [36] to produce a sample of data, a sample [43] instance is shown in Fig. 7. Given a complete life cycle of T, a time step of S, and a window length of p, the sample size would be p × m. Each sample's RUL is T—S—P, and there are m sensors in total.

Fig. 7
figure 7

Example of data sample creation

Normalization is required before the production of data samples. In this paper, we utilise the normalizing technique of maximum-minimum. The following formula (23), for each sensor, normalizes the raw data to [0,1].

$$ x_{m,t}^* = \frac{{x_{m,t} - \min \left( {x_m } \right)}}{{\max \left( {x_m } \right) - \min \left( {x_m } \right)}} $$
(23)

where \(\max \left( {x_m } \right)\),\(\min \left( {x_m } \right)\), is the maximum and minimum value of m sensor. \(x_{m,t}\) is the value of m-sensor's instant t.

4.3 Evaluation Metrics

In this thesis, the root mean square error (RMSE) values [35] and the scoring function [41] were selected to assess the predictive performance of the model.

N is the total number of test samples, and \(d_i = y_{pred}^i - y_{true}^i\) is the measure of the disparity between the actual and predicted values.

The formula (24) used to compute RMSE is:

$$ RMSE = \sqrt {{\frac{1}{N}\sum_{i = 1}^N {d_i^2 } }} $$
(24)

Score is expressed as:

$$ Score = \left\{ \begin{gathered} \sum_{i = 1}^N {\left( {e^{\frac{d_i }{{10}}} - 1} \right),{\text{where}} d_i \ge 0} \hfill \\ \sum_{i = 1}^N {\left( {e^{ - \frac{d_i }{{13}}} - 1} \right),{\text{where}} d_i < 0} \hfill \\ \end{gathered} \right. $$
(25)

The scoring system with varying degrees of severity, score penalizes underprediction (di < 0) and overprediction (di > 0). The punishment for over-prediction is higher than that for under-prediction since the real-world repercussions are more dire. The less the RMSE and Score, which collectively assess the predictive accuracy of the model, the better.

4.4 Experimental Implementation and Results

To testify the superiority of the model, five experiments are established. The first experiment investigates the effect of producing data samples with various time steps on the final prediction performance. The second research is the loss differences based on different activation functions. The third research content is the influence of the number of digital capsules on the prediction. The suggested model's ablation experiment, which tries to prove the reliability of the model's constituent parts, is the fourth experiment. The last experiment is to contrast the effect of this model with other forecasting models.

4.4.1 Experiments with different time windows

Using different sizes of time windows to produce data samples will result in different amounts of information comprised in apiece sample. Too little will fail to capture the key information, and too large, although containing a large amount of information, will increase the calculation load of the model, which in turn affects the final prediction performance. That's why it's crucial to choose a suitable time window. This experiment uses different time windows to collect data samples on data sets FD001 and FD002, and the final forecast effects of the experiment are shown in Fig. 8. The left Y-axis of the graph indicates the value of RMSE, and the right Y-axis is Score. From the FD001 experimental results, it can be obtained that while the time window size is set to 30, both the RMSE and Score metrics are minimum. From FD002 on the right, we can also get the minimum Score value at a time window of 30, and the RMSE is the smallest at a time window of 20, but the difference is not large compared to 30. Combining the results of the two experiments, the final time window size chosen for this study was 30.

Fig. 8
figure 8

Experimental results of different time windows on FD001 and FD002 data sets

4.4.2 Loss differences and prediction accuracy based on different activation functions.

In the experiments, it was found that the choice of different activation functions affects the final prediction accuracy as well as the convergence speed of the model. This experiment, by comparing the model convergence speed and prediction accuracy under four activation functions, PReLU, ELU, ReLU, and SeLU. It was finally determined that ReLU was selected as the activation function in this study, and the final effects are seen in Table 2 and Fig. 9. To minimize the effect of random errors, all results are the average of 10 experiments. According to the test results in Table 2, it can be gained that the four activation functions have little effect on the final prediction results of the experiment, but it can be seen that using ReLU works best, with a 0.5% reduction in RMSE value and a 4.02% reduction in Score compared to SeLU. Also, as shown in Fig. 10, the model converges fastest using ReLU as the activation function. In summary, this study determines that ReLU is chosen as the activation function.

Table 2 Experimental results of FD001 with different activation functions
Fig. 9
figure 9

Losses of different activation functions

Fig. 10
figure 10

Effect of number of capsules on Rmse

4.4.3 The influence of the number of capsules on the prediction effects as well as the training time

During the experiments, it was found that the number of secondary capsules had a considerable impact on the final prediction performance of the model as well as the training time. A larger number of capsules means more parameters to train, and increasing the complexity of the network takes more time to train. Therefore it is necessary to choose a suitable number of capsules. In this paper, four capsule quantities of 6, 8, 10, and 12 were set up on the FD001 data set and 10 experiments were repeated to take the average value, and the final experimental results are shown in Fig. 10. and 11. Figure 10. indicates the effect of different numbers of capsules on RMSE and training time, from the figure it can be obtained that the RMSE is the smallest and the distribution range is more concentrated when the number of capsules is 10. Although the training time is longer compared to the number of capsules 6 and 8, with the unceasing improvement in computer power, the time effect can be neglected. Figure 11 is consistent with the final effects presented in Fig. 10, where the number of capsules is 10 and the lowest indexes are found. Therefore, summarizing the analysis of the experimental results, it is determined that the number of digital capsules in this paper is set to 10. (Note: The green line in Figs. 10 and 11 is the median line, and the black box is the mean value.)

Fig. 11
figure 11

Effect of number of capsules on Score

4.4.4 Ablation experiments of the proposed model

The purpose of this set of experimental sessions is to demonstrate how each suggested model module affects the accuracy of the final prediction. Five groups of experiments were set up in this round, namely, no attention mechanism model (I-capsule), channel-only attention mechanism model (CAI-capsule), time-only attention mechanism model (TAI-capsule), superimposed convolutional layers instead of Inception module (MAN-capsule), and proposed model (MAI-capsule). To enhance the accuracy of the experiments, experimental validation was performed on each of the four data sets, and each model was averaged over 10 experiments. The experimental results are seen in Fig. 12 and Fig. 13, corresponding to the data in Table 3. In the table, Mean is the average value and STD is the standard deviation. STD reflects the stability of the model to a certain extent, the smaller it is, the more stable the model is. According to the experimental effects in the graphs,

Fig. 12
figure 12

RMSE values for different model settings

Fig. 13
figure 13

Score values for different model settings

Table 3 Experimental results of different model settings

it is clear that the model with the attentional mechanism works better than the model without the attentional mechanism. Moreover, the fidelity of the model is higher as the layers of the attention mechanism are increased, so it can be well demonstrated that the forecast performance of the model can be conspicuously improved by assigning weights using the attention mechanism. In addition, the Inception module, because it uses convolutional layers with a parallel mechanism, has fewer parameters and is easier to train than a stacked convolutional layer network with a series mechanism, and has significantly improved prediction results.

4.4.5 Comparison between different forecasting models.

In this round of experiments, five models were selected for experimental comparison to prove the advantage of the proposed prediction model. One is the shallow learning model Random Deep Forest (RF) [14], and four deep learning models: the Multilayer Attention and Temporal Convolutional Network model (MLSA) [35], the Deep Capsule Network model (NDCN) [20], the Gated Capsule Network model (GAM-CapsNet) [21], and the Deep Separable Convolutional Network model (DSCN). To ensure the accuracy of the experiments, experiments were conducted on four data sets, and all were averaged after ten times. Final experimental data are shown in Tables 4 and 5, corresponding to Fig. 14. and 15, respectively. As shown in Table 4 and Fig. 14., there is a significant improvement in RMSE values on FD001, FD002, and FD004 compared to the other models, especially on the FD004 dataset where there is a 6.33% improvement in RMSE compared to the best of the other models. According to Table 5 and Fig. 15., it can be indicated that the Score improvement is significant on FD001, FD003, and FD004 compared to other models, and the Score improvement is 27.05% on FD004 compared to the MLSA model. In addition to this, STD also has a significant improvement compared to other models. In summary, the proposed model achieves the expected results and is more stable.

Table 4 RMSE values of different prediction models
Table 5 Scores of different prediction models
Fig. 14
figure 14

RMSE values of different prediction models

Fig.15
figure 15

Score values of different prediction models

5 Analysis

The fitted trajectories between the real RUL and the forecast RUL for the four sub-datasets are displayed in Fig. 16. As can be gained from the figure, the forecast effects of the four sub-datasets are good, especially on datasets FD001 and FD003 where the fit is better. Compared with the fitting effect of FD001 and FD003, the poor fitting effect of FD002 and FD004 is owing to the truth that FD001 and FD003 have only one operation mode, while FD002 and FD004 have six operation settings, so the operation conditions of FD002 and FD004 are more sophisticated, which increases the uncertainty of the prediction.

Fig. 16
figure 16

Fitted plots of true RUL and predicted RUL for the four sub-data sets

As shown in Fig. 17, the error normal distribution plots of the real RUL and the forecast RUL for the four sub-datasets are included in this paper to show the final prediction more clearly. The results in the figure are obtained by making an arithmetic difference between the forecast value and the real value. From the figure, the prediction errors of the four data sets are uniformly distributed around 0, and the maximum error is also below 60, further proving the superiority of the proposed model. The errors are more concentrated around less than 0, i.e., under-prediction (forecast RUL < real RUL). Over-prediction (forecast RUL > real RUL) is much more harmful than underprediction in real production life, which demonstrates the reliability of the model.

Fig. 17
figure 17

Prediction error distribution of the four sub-data sets

To understand the two-layer attention mechanism in the model more intuitively, a time series sample from FD001 is chosen for a simple visualization of the attention mechanism in this paper. As revealed in Fig. 18. (a) expresses the data after the normalization of a data sample, and (b) expresses the output of the normalized data after the channel attention layer. A comparison of the two plots from a and b clearly shows that the color of the thermogram changes, especially for sensors 4, 8, and 14. This is because the channel attention mechanism assigns corresponding weights to different sensors, resulting in different contributions of the raw data from each sensor to the final prediction. Fig. c shows the export of the data after the time-step attention mechanism layer, and it can be seen that the data weighted by the time-step attention mechanism undergoes a significant difference, focusing more on the moments that contribute more to the final prediction. The two layers of attention mechanisms work together to capture the more important degradation information and greatly promote the forecast accuracy of the model.

Fig. 18
figure 18

Visualization of attention mechanism

6 Conclusion

In this paper, an MAI-Capsule model is proposed to promote the accuracy of remaining service life prediction. The two-layer attention mechanism network evaluates the effects of various sensor data and different moments of the same sensor data on the final prediction separately. The Inception module extracts multi-scale features from the weighted data and finally inputs them into the capsule network to preserve the time-series data's general characteristics more effectively. The feasibility of the model was verified on a publicly available dataset of turbofan engines. The advantage of the proposed model is authenticated by contrasting it with other methods. In addition, experiments are conducted on the number of capsules for the model training time, which is crucial in future real-time prediction.

The main advantages of the proposed model are as follows: first, the channel attention mechanism processes the data and highlights the contribution of important feature moments. Second, adding a temporal attention mechanism to the LSTM effectively reduces the influence of environmental noise while screening out features at important moments. Again, a multi-scale feature extraction module is added to retrieve data about degradation more comprehensively from multiple perspectives. Finally, the capsule network better identifies the spatial location relationship of the overall features. Combining all the test effects, the proposed method in this paper has significantly improved the prediction performance and can be applied in future remaining service life predictions. However, the model still has room for improvement and enhancement as the training time grows under complex working conditions. Moreover, the random initialization of the neural network parameters increases the uncertainty of the final prediction. In future research, we will concentrate on how to further reduce the model complexity, shorten the model training time and optimize the parameter initialization.