1 Introduction

Throughout the modern history of humankind, manufacturing has been of central importance to economic advancement. According to statistics from the World Bank, the manufacturing industry contributed 16.8% to international gross domestic product (GDP) in 2018. The contribution to national GDP is as high as 27% in China, representing one of the highest proportions among all nations [1]. Driven by the evolving demands from “mass production” and “mass customization” to “mass personalization” [2, 3], manufacturing has evolved from mechanization and manual operation (Industry 1.0, 18th century) to today’s Industry 4.0, where operations take place in complex, digitized cyber-physical production systems (CPPS) that are characterized by sensor-rich monitoring and Internet-enabled edge/cloud computing for in-situ failure root cause diagnosis and future performance prognosis [47]. Accompanied by the increasing availability of abundant sensor data [6] as illustrated in Figure 1, analytical and numerical models, and computational infrastructure, the state-of-the-art in real-time condition monitoring, failure root cause diagnosis, and machine remaining useful life (RUL) prognosis has enabled a higher level of automation, robustness, and adaptivity of networked and optimized manufacturing systems [8, 9].

Figure 1
figure 1

[6]

Evolution of manufacturing, sensing technology and big data, adapted from Ref.

The current wave of innovation in manufacturing, characterized by the concept of smart manufacturing and digital transformation of the factory, is witnessing the convergence of big data [6, 8], artificial intelligence (AI, e.g., machine learning, ML, and deep learning, DL) [10, 11], and the expansion of communication and computational capabilities (e.g., industrial internet of things, IIoT, cloud and edge computing, and graphic processing unit, GPU) [1214]. The convergence has transformed a variety of manufacturing practices [15, 16], with condition monitoring, fault diagnosis and RUL prognosis among the most significant beneficiaries [17]. The increased availability of data due to massive deployment of sensors and the rapid advancement of DL have made it possible to gain insight into the mechanism underlying manufacturing operations, leading to enhanced observability of machines and processes [18]. It also made it feasible to associate data with condition-related parameters (e.g., fault and RUL) with unprecedented accuracy [19]. Also, advancements in communication and computational infrastructure have made it possible to carry out data transmission and computation with low latency to satisfy the requirement for real-time operation [20].

The past decade has seen a fast growth in the number of papers on DL-enabled condition monitoring, diagnosis and prognosis, which are comprehensively summarized in several review articles [2123]. While contributions from DL have been well highlighted, several limitations have also been identified. First, the datasets investigated to evaluate DL algorithms are generally error-free (e.g., no outliers or missing values) and well balanced, with each having sufficient number of samples to fully optimize the DL model parameters [10]. However, in real-world manufacturing scenarios, data errors can occur due to sensing or communication errors. Collecting a large amount of data from faulty equipment for algorithm training is often times infeasible, due to both safety and economic reasons. The consequence is that datasets with error or imbalance can potentially degrade the performance of otherwise high-performing DL models if the level of error or imbalance is high [23, 24]. Second, the datasets used in the reported studies did not require a posteriori (and therefore error-prone) labeling. This is due to the fact that commonly investigated scenarios, such as faults in the inner or outer of a rolling bearing are usually pre-labeled and seeded into the testing equipment before the data is collected. However, in realistic manufacturing scenarios, structural faults or anomalies are not “pre-labeled” because they are not known a priori, and therefore have to be interpreted from the collected data a posteriori. As an example, in additive manufacturing (AM), part surface defects are observed from images acquired after the completion of the AM process, and automated defect annotation (data labelling) is crucial to supporting the relevant diagnostic and/or predictive tasks. Third, the prediction logic of many DL models is generally not interpretable (or transparent) to the users in a physical sense [25]. Without a clear understanding of the data patterns that a DL-based method uses to carry out specific analysis tasks, it is difficult for the readers to establish trust in the performance and outcome of the algorithms.

To tackle these limitations, research on the topics of (1) data curation [26], which aims to improve data quality and provide semantic annotation, and (2) model interpretation, which aims to decipher DL model prediction logic [27], has become an indispensable step before and after the execution of DL algorithms (Figure 2), and thus is gaining attention over the past years. These include: (1) data denoising and cleansing methods that remove data pollution [28]; (2) generative models that recognize patterns underlying the data and synthesize samples to resolve problems arising from small or unbalanced datasets [29]; (3) semantic data annotation that automates the data labeling and contextualization process [30]; (4) relevance analysis methods, such as layer-wise relevance propagation (LRP), which trace the feature extraction processes utilized by neural networks to reveal salient information from the input data for decision-making [31]; (5) attention mechanisms that enable the incorporation of interpretable prediction logic at the design stage of neural networks for enhanced model interpretability [32], and (6) integration of DL and physics to ensure consistency between DL discoveries and existing domain physical knowledge [33].

Figure 2
figure 2

DL-driven pipeline for monitoring, diagnosis and prognosis

This paper is motivated by the above identified limitations and aims to fill this knowledge gap by analyzing research outcomes on data curation and model interpretability enabled by DL for operation monitoring, fault diagnosis, and RUL prognosis in manufacturing. As illustrated in Figure 3, data provides the material basis for DL algorithms, and DL algorithms advance the state of technology for monitoring, diagnosis, and prognosis.

Figure 3
figure 3

Interactions in an DL-enabled monitoring, diagnosis, and prognosis paradigm

The rest of this paper is organized as follows: Section 2 reviews the latest development in data quality assurance to support effective data curation. Section 3 highlights several methods that improve the outcome interpretability of DL models by relating the propagation of reasoning logics through the neural network structure to the physical laws, thereby improving the model interpretability. Section 4 examines several manufacturing applications that have benefited from these techniques. Conclusions and future directions are described in Section 5.

2 Data Curation

As a co-product of manufacturing, data encodes critical information underlying the dynamical behaviors of manufacturing machines and processes, providing the foundation for DL-driven algorithms. Advancements in sensing technologies have resulted in an ever-increasing amount of data acquired on factory floors [6, 34]. The increasing diversity and complexity of data not only pose new challenges for handling data quality issues such as noise, but also amplify additional problems such as data imbalance, outliers, unannotated data, or data with missing values. These become more prominent with the widespread usage of DL. To help ensure success of DL analysis, low-quality data need to be properly curated first [8, 26]. Several representative data curation techniques are summarized in Table 1 and are discussed in detail below.

Table 1 Representative techniques for data curation

2.1 Data Denoising

The purpose of data denoising is to extract pertinent information (e.g., process and machine dynamics, fault characteristic) from occluding background noise, thereby improving the effectiveness of data analysis [35]. The most adopted approach is to denoise by increasing the signal-to-noise ratio (SNR). Relevant techniques include projection-based method, such as local geometric projection (LGP) [36], and frequency or time-frequency analysis, such as empirical model decomposition (EMD) [37] and wavelet transform [38].

The idea of LGP is that once the data is mapped into a high-dimensional phase space, useful information and noise embedded in data can be decomposed by orthogonal projection into different subspaces. By reconstructing the data from the subspace occupied by the useful information, noise can be removed. In practice [36], the phase space is first segmented into local regions. Within each of the regions, the orthogonal projection matrix is computed by the method of singular value decomposition (SVD). Specifically, only the largest eigenvectors in SVD are used to form the projection matrix, which contains the majority of data variance in the phase space and is likely to capture the useful information. In experimental evaluations, an SNR improvement of 10 dB has been reported. Since LGP does not require prior knowledge about the frequency range of the noise components, it is more convenient to use than filtering-based methods.

The EMD algorithm decomposes data (commonly time series) into a sum of intrinsic mode functions (IMFs). The first IMF represents the highest dominant frequency in the data and the frequency decreases as decomposition proceeds [37]. As a result, EMD represents the data as a sum of frequency bands, and noise removal can be achieved by reconstructing the data from the IMFs that only contains the useful information (e.g., critical frequency components). In practice, suitable IMF range can be determined based on metrics such as mutual information ratio (MIR). For example, the cutoff point can be chosen as the one that leads to the largest increase in MIR, representing the threshold when useful data information is captured by IMF [37].

One of the time-frequency techniques for data denoising is wavelet transform, which is based on thresholding small wavelet coefficients and reconstructing the data using inverse wavelet transform. This is because large wavelet coefficients usually contain dominant data components [38]. The threshold value is commonly determined based on the estimated data variance [39]. Using a customized wavelet developed out of the impulse response of a sensor-embedded rolling bearing, the SNR of the bearing’s vibration data could be improved by up to eight times as compared to a standard wavelet [40].

An alternative approach to minimizing the effect of noise is stochastic resonance (SR) [41]. The idea is to amplify the critical frequency (e.g., fault characteristic frequency) through the interactions between the data (e.g., time series) and a bistable system [41]. Specifically, when the time series is added to the governing equation of the bistable system, it can be shown mathematically that the critical frequency will be amplified at the system output if the “switch” frequency of the system is tuned to match the critical frequency [41]. In Ref. [42], an adaptive SR strategy is introduced that overcomes the limitation of requiring prior knowledge about the critical frequency as standard SR method does and can accurately pinpoint the fault-related frequency from noisy vibration data that is otherwise undetectable. The technique is applicable to a wide range of critical frequency extraction applications.

More recently, hybrid denoising methods have been reported that take advantage of both the pattern recognition capability of DL and the physical understanding of noise contamination [43, 44]. Specifically, the methods first establish a contamination model y = G(x) based on the knowledge of the contaminants, where x is the ideal, clean data representing the physical phenomenon and y represents the measured data, contaminated with noise and determined by G. The contamination model serves as the guidance for data denoising as only the clean data that satisfies the model will be recovered (e.g., improving SNR in a physically meaningful way). Since solving x from y is a generally ill-posed problem (i.e., a large number of x can satisfy the model), the solution x must also be regularized to be consistent with prior knowledge about the data [45]. For this purpose, the method follows the Bayesian theory by iteratively minimizing the inconsistency between the solution x and the contamination model G(x), as well as the inconsistency with the prior knowledge, R(x). Mathematically, the denoising problem is expressed as:

$$\boldsymbol{x}={\text{argmin}}_{\boldsymbol{x}}\left[{||\boldsymbol{y}-G\left(\boldsymbol{x}\right)||}_{2}^{2}+R\left(\boldsymbol{x}\right)\right],$$
(1)

The outcome of the process is termed the maximum-a-posteriori (MAP) estimation and is graphically illustrated in Figure 4. Essentially, x is iteratively recovered by alternately projecting the intermediate outcome onto the cluster (orange line) that satisfies the contamination model and the cluster (blue line) that satisfies the prior knowledge. At the end, the joint distance between x and the clusters are minimized, leading to the denoised data that is most consistent with the physical contamination knowledge as well as the prior knowledge.

Figure 4
figure 4

Illustration of MAP estimation

A major challenge to solving Eq. (1) is to analytically formulate R(x), given the limitation in prior knowledge to characterize x [46]. To solve this problem, DL-based prior characterization has been developed, and two representative algorithms are described below.

Unrolled optimization. This is motivated by the iterative process of solving Eq. (1) in which each iteration can be concatenated as a layered structure. At each layer (iteration), two steps are carried out: (1) optimization with respect to prior \({R(\boldsymbol{x}}^{k-1})\) to obtain the intermediate outcome \({\boldsymbol{s}}^{k}\), and (2) optimization with respect to contamination model \({G(\boldsymbol{s}}^{k})\) to obtain the new estimation for the next iteration \({\boldsymbol{x}}^{k}\). To formulate R(x) under limited physical knowledge for characterizing x, optimization w.r.t the prior is modeled by a neural network, or \({\boldsymbol{s}}^{k}\leftarrow {\text{NN}(\boldsymbol{x}}^{k-1})\). The complete layered structure is trained in an end-to-end manner [46]. As a result of the iteration process, the underlying structural pattern of the image is learned and improvement of image peak signal-to-noise ratio (PSNR) of 3 dB is reported as compared to other techniques. This denoising technique is also computationally efficient with the processing time per image of around 0.1 s, making it suited for real-time applications.

Generative prior. Sensing data in the form of time series or images often contain information that can be represented in a sparse way under certain representation domain (e.g., the Fourier or wavelet domain) [45]. As a result, such information sparsity in data can enable the formulation of \(R\left(\boldsymbol{x}\right)\) to regularize the solution for Eq. (1), for example, by setting R(x) = \(\left\| x \right\|_{1}\). However, such algorithms suffer from low efficiency as the computational cost increases with the square of data dimension [47]. To resolve this limitation, the method of generative prior has been developed [47]. Specifically, it first uses a generative DL model, such as the variational auto-encoder (VAE), to obtain the sparse representation of x. Subsequently, by replacing x with its sparse representation z based on the decoder of VAE, g(z), the prior term in Eq. (1) can be neglected (as the VAE already enforces the sparsity) and G(x) becomes G(g(z)), resulting in a low-dimensional problem, as shown in Figure 5, that can be more efficiently solved [47].

Figure 5
figure 5

Generative prior for simplification of Eq. (1)

A significant advantage of generative prior is its high computational efficiency, since it reduces the computational cost from a quadratic increase with data dimension to a linear increase while achieving comparable denoising results. Table 2 summarizes the advantages and disadvantages of unrolled optimization and generative prior.

Table 2 Comparison between hybrid denoising techniques

2.2 Data Cleansing

The widespread deployment of sensors and increasing complexity of machines and processes have also increased the vulnerability of in-situ data to problems, such as outliers [48] or missing values [49]. Data cleansing addresses these problems by detecting, removing, or correcting outliers and/or missing values among normal samples to improve data quality.

Outlier detection. An outlier (also known as an out-of-distribution, or OOD, sample) generally refers to a data sample that significantly deviates from the expected data pattern associated with the physical phenomenon that it represents [50]. Common outlier detection methods can be classified into the following categories [48]: (1) statistical methods, which detect outliers based on the likelihood of seeing the sample under the assumed data distribution (e.g., Gaussian) [51]; (2) distance-based methods, which assume that within-distribution samples are located in a dense region in the data space while outliers are located further away [52]; (3) density-based methods, which are based on the assumption that the data distribution should be similar around within-distribution samples and significantly different around outliers [53]; and (4) cluster-based methods, for which clustering techniques are applied and outliers are detected as samples not in the neighborhood of any clusters [54]. Despite ongoing progress, these methods are limited when handling high-dimensionality and nonlinearity [48]. Most recently, DL-based methods have been reported, which focus primarily on enhancing the separability between within-distribution samples and outliers to facilitate the determination of outlier detection threshold.

One DL-based method for outlier detection is to use autoencoders (AE) to find latent features by projecting data into network layers with progressively reduced dimensions. The basic idea is that data reconstruction error is expected to be small for within-distribution samples and large for outliers [55]. This is because within-distribution samples are generally well clustered in the data space while outliers are likely to be randomly scattered. Therefore, the gradients induced by outliers during AE training are likely to be mitigated by the within-distribution sample gradients. Accordingly, the weights of the related layers of the network will be updated using predominantly those of within-distribution samples. Consequently, data reconstruction error will be minimized primarily for within-distribution samples rather than outliers [55]. It is reported in Ref. [48] that by training an AE using only within-distribution samples, the reconstruction error serves as a good indicator for outlier detection based on a simple 3-standard deviation threshold.

For datasets in which within-distribution samples are polluted with a comparatively small number of outliers (which is not ideal as compared to the scenario in Ref. [48]), an iterative approach has been developed in Ref. [55], where two steps are involved at each iteration: (1) discriminative labeling, which estimates within-distribution samples from the mixed data based on the reconstruction errors at that iteration, and (2) reconstruction learning, which updates the AE to reduce the reconstruction error for the identified within-distribution samples. This iterative approach has shown to gradually converge to the level of performance comparable to that of the AE trained using within-distribution samples only. The advantage of AE-based approaches is that the outlier detection is done completely at the data level and is therefore task-agnostic.

Besides AE, probabilistic neural networks have also been investigated for outlier detection. The idea is to quantify the uncertainty associated with the predicted outcome (e.g., type of a structural fault), for which a high uncertainty suggests that the input can potentially be an outlier. Commonly used techniques for neural network probabilistic formulation include ensemble learning [56] and Monte Carlo (MC) methods [57, 58], which assume that multiple DL diagnostic/prognostic models exist for any task, and uncertainty is quantified based on the prediction entropy across all model outputs.

One of the latest developments is a multi-head network [59] (Figure 6), which consists of shared layers at lower part of the network structure, while diverging to multiple classifiers at the upper part. In this work, a distributed gradient is developed for DL network training, in which only a fraction of the gradient is used to update the classifier with the best performance at each iteration to increase accuracy, while the remaining fraction of gradient flows through the other classifiers to improve generality for unseen data. As the multi-head network is trained only on within-distribution samples, the probability that the network can produce similar uncertainty for both within-distribution samples and outliers is expected to drop quickly as M becomes large [59]. The result is that it facilitates the determination of an uncertainty-based threshold for outlier detection. Table 3 summarizes probabilistic network techniques.

Figure 6
figure 6

[59]

Multi-head network, adapted from Ref.

Table 3 Comparison of probabilistic network techniques

DL-based outlier detection methods, e.g., based on temperature scaling and input perturbation, have been developed for pretrained DL classifiers [60]. In this example, temperature scaling refers to calibrating the scaling factor in the softmax function in the classifier. Mathematically, larger scaling factors lead to larger softmax scores for within-distribution samples than outliers. Perturbation refers to preprocessing the input by adding a perturbation term calculated based on the gradient of the softmax function with respect to the input, which tends to have a larger value for normal samples than outliers based on experimental observations. Therefore, by adding this gradient as perturbation to the input, the softmax scores with respect to within-distribution samples are likely to be greater than the outliers. Collectively, temperature scaling and perturbation enhance the separability of the softmax scores between the within-distribution samples and outliers, allowing a threshold to be set up flexibly for outlier detection.

The work reported is further extended in Ref. [61]. Rather than relying on a single score at the final network layer for outlier detection, a Mahalanobis distance-based confidence score is calculated in this work for all layers based on the layer-wise features, and the final score is represented by a weighted sum of all scores.

Data imputation. Besides outliers, another frequently encountered data quality issue is missing values, which is commonly caused by sensing or communication errors [62]. Imputation of missing values improves data quality by filling data gaps. Typical data imputation methods are based on statistics, such as linear interpolation or auto-regressive modeling [63, 64]. However, these methods often require strong assumptions (e.g., linearity) with respect to the data generation process, making them less effective to more complex data in which the assumptions do not hold.

One of the actively researched topics recently is time series data imputation based on recurrent neural networks (RNN) and their variants, such as long short-term memory (LSTM) and gated recurrent units (GRU) [62, 65, 66]. These techniques capture the non-linear time evolution pattern underlying the data for estimation of missing values. The main idea is to use a rolling window that progressively predicts the missing values by analyzing the data sequences that immediately precede these gaps.

The newly developed methods mainly differ in the approach with which the missing value is handled at the network input. In Refs. [65, 66], the input to the network is designed as a weighted sum of the observed value at the current step and the predicted value from the previous time step, when the observed value is available at the current step (Figure 7). When data is missing at the current step, these methods use the predicted value from the previous time step directly as network input at the current step.

Figure 7
figure 7

[66]

Time series imputation with RNN, adapted from Ref.

In contrast to Refs. [65, 66] in which the missing values are replaced by the predicted values of the same sequence, a different approach has been developed in Ref. [62] that aligns a separate, auxiliary “source” sequence to the “target” sequence with missing values. Then, it replaces the missing values with the corresponding values in the source sequence that are adjusted by the mean values of both sequences.

The development of DL, especially the convolutional neural networks (CNNs) that are specialized in image processing, has advanced image imputation that was previously considered challenging [67]. One strategy of image imputation is to train an end-to-end CNN that learns the direct mapping between the images with missing pixels to the corresponding, complete images.

However, this approach has shown to produce unsatisfactory results that tend to blur out the regions of missing pixels rather than learning the underlying structural pattern. This is mainly because the network training is guided by the reconstruction error that is averaged over all pixels [67]. One remedy is to treat image imputation as a special “denoising” process and implement the hybrid approach as described in Section 2.1 [68]. The other approach, which has attracted increasing attention, is to add an “adversarial” training loss that penalizes the CNN when the recovered image does not resemble the original image [67]. Adversarial approaches are gaining popularity for high-fidelity data generation. In the context of data curation, one of its most significant applications is in data balancing, which will be described in the next Section.

2.3 Data Balancing

Data balancing addresses the need for having sufficient faulty data to produce a balanced dataset when training neural networks to minimize learning bias [69]. This is particularly important for tasks involving data classification, e.g., in fault diagnosis [70]. However, operating machines under faulty conditions just for the purpose of collecting faulty data for algorithm training is not feasible in real-world applications, since machines are required to maintain normal operating conditions to ensure product quality. One approach to remedy this issue is transfer learning, which allows to transfer the diagnostic model or feature from a source domain in which faulty data is sufficient to the target domain which lacks sufficient data for network training. The latest development of transfer learning has been summarized by a series of review and technical papers [71,73,73,‒74]. The other research direction has been typically relying on high fidelity synthetic data to augment the number of samples and improve data balancing, which is the primary focus of this section.

Early works of data synthesis mainly relied on data interpolation, e.g., synthetic minority over-sampling technique or SMOTE [75]. The idea is to first select randomly a minority class sample and one of its neighbors. Then, a synthetic example is generated as a convex combination of the two chosen samples. While the method works well for low dimensional data such as process parameters and machine settings, SMOTE and related techniques cannot capture the complex characteristics as commonly shown in high dimensional data, such as high-speed time series or images [24].

A more systematic approach is made available by the AE-based generative model [76]. The idea is to learn a latent representation of the existing data and its underlying distribution using the AE and generate synthetic data from that distribution via data sampling. However, as the generation process is not guided by any supervision with regards to the quality of the outcome, the result of the AE can be dissatisfactory. A major breakthrough came with the development of the generative adversarial networks (GAN) [29], which is a specialized DL architecture that allows for high fidelity data synthesis with supervision.

The main structure of GAN is composed of a generator and a discriminator, as shown in Figure 8. The GAN operates on the premise that the generator can be trained to convert a random noise vector into synthetic data (e.g., time series or image) that closely resembles the real data. The performance of the generator is evaluated by a discriminator, which aims to correctly classify an input as either “real” or “generated”. Specifically, the discriminator randomly takes as input either the real data or the synthetic data produced by the generator and outputs a scalar representing the probability that the input data is “real”. Conversely, the objective of the generator is to generate synthetic data that are indistinguishable from the real ones and deceive the discriminator. This is realized through the training of the GAN, in which the generator and discriminator play a minimax game: the generator will try to minimize the discriminator’s accuracy, while the discriminator tries to maximize it. The final training outcome will be an equilibrium point, at which the discriminator will no longer be able to distinguish the generated data from the real ones, and the generator can no longer synthesize “better” data as the discriminator no longer provides useful feedback for further improvement. At this point, the generator is capable of synthesizing data with high-fidelity to augment the number of samples in the minority classes and reduce dataset imbalance.

Figure 8
figure 8

[29]

GAN for image data synthesis, adapted from Ref.

2.4 Data Annotation

Data annotation is about associating data with proper contextual information under which it is acquired by using proper semantic format. The use of image data, which contain rich spatial information that are not captured in time series signals, has become an important aspect of DL-based techniques [6]. At the same time, annotation and labeling of the image regions of interest (ROIs) semantically, which are indicative of critical information on the condition of the machine and process of interest, has become a challenge due to the lack of techniques to effectively parse abstract image patterns. Traditionally, the method of thresholding has been extensively applied [77, 78] with the goal to set a pixel intensity threshold value that separates the ROI from the remaining regions. However, the technique assumes that: (1) all pixels with intensity values within an established range belong to the same ROI, and (2) ranges for different ROIs are non-overlapping. In reality, both of the assumptions are often times invalid.

Built upon the image analysis capability of the CNNs, fully convolutional networks (FCNs) have been developed for the purpose of image semantic annotation [30]. A typical FCN is constructed by a pair of CNNs: an encoder and a decoder. The encoder, consisting of convolutional layers and pooling layers, distills essential information from the input image that is most relevant to semantic annotation. The decoder, which consists of upsampling operations and a classification layer, generates the annotated images.

At the classification layer, instead of producing a single probability distribution indicating the probabilities with which the image belongs to different categories (e.g., fault types) as in standard CNNs, the FCNs utilizes the softmax function at the pixel level, generating for each pixel a probability distribution that the corresponding pixel belongs to different ROIs (e.g., defect, tool wear) or non-ROI. Each pixel is then classified as the ROI or non-ROI with the highest probability.

Over the years, various semantic annotation methods built upon FCNs have been developed. The two widely used techniques are U-Net [79] and mask region-based CNN (RCNN) [80]. The basic structure of U-Net is shown in Figure 9. Compared to FCN, U-Net is designed in a symmetric fashion with the progressive upsampling layers in the decoder that match the encoder layers. In addition, the corresponding layers in the encoder and decoder are connected via skip connections for network training [82].

Figure 9
figure 9

[81]

Structure of U-Net, adapted from Ref.

In mask RCNNs, instead of analyzing the image as a whole, a region proposal network (RPN) is attached before the FCN to allow the network to first focus on small regions that potentially contain ROI, before carrying out FCN-based annotation [80]. Once annotated, the image can be not only used for direct diagnosis purpose (e.g., surface defect diagnosis, tool wear evaluation), the information extracted from the ROIs, such as area and geometry features, can also serve as the input to the DL model for other predictive tasks.

Besides image data, semantic annotation and labeling of text is also attracting increasing attention, as reflected in the development of natural language process (NLP) techniques [83]. The fundamental problem of annotating text in manufacturing, such as maintenance logs and inspection reports, is to convert text into computable representations while maintaining their semantic information. One of the most widely investigated techniques is embedding, which refers to the mapping of words to their representations in a high-dimensional space [84]. To establish domain-specific embedding and annotate manufacturing context, a key step is to train the embedding mapping in order to maximize the consistency (quantified as inner product) between an individual word and its existing manufacturing context while minimizing the consistency with the non-existing contexts. This allows word semantics to be implicitly encoded in their respective representations based on the number and frequency of the shared contexts, and the semantically similar words are expected to have similar representations.

Once the embedding is established, DL-based language models can be trained to decompose interested texts into interpretable labels for diagnosis and prognosis [85]. Common DL-based models include 1-D CNN and RNN (and its variants), both of which allow the analysis of sequential patterns, which is a prerequisite for language understanding [86]. Recently, more dedicated, and pre-trained language models have emerged, such as transformers [87] and bidirectional encoder representations from transformers, or BERT [88]. These models generally consist of a stack of encoders and self-attention modules, which allow efficient analysis of relationship among different words in the inputs and outputs. For example, the key element in the transformer is a {query, key, value} tuple that is computed for each word, and the association among different words is quantified as the inner product of their corresponding tuple values. These pre-trained models provide a backbone for general language analysis, which can be adapted for specific purpose through fine-tuning with a task model [89]. In Figure 10, the general flowchart for text annotation/labeling is shown.

Figure 10
figure 10

[86]

Flowchart of NLP in text annotation, adapted from Ref.

3 Model Interpretation

The capability of DL algorithms to automatically learn characteristic features from data to minimize errors in diagnosis or prognosis and reduce the need for extensive human knowledge is often credited as the advantage of DL [10]. However, the prediction logic of DL-based algorithms is generally not clearly interpretable in a physical sense, thus making it difficult to establish trust from the users in the model performance. To address the need for understanding the working mechanisms of DL and facilitate its broad acceptance, several representative techniques that improves the interpretability of DL models are highlighted in this section, and are summarized in Table 4.

Table 4 Representative techniques for DL model interpretation

3.1 Relevance Analysis

One major research activity towards improving DL model interpretability is to determine the association, or relevance of each input with the output. For tasks in which different regions of the input have semantic meanings, such as images or frequency spectrums, the prediction logic of the DL model can then be evaluated and verified against human knowledge by means of relevance analysis. Representative techniques include saliency maps [90, 91], deconvnet [92], layer-wise relevance propagation (LRP) [31].

Saliency maps. Saliency maps provide the ranking of the individual inputs based on their influence on the network decision [90]. The idea is to approximate the network in the neighborhood of the inputs using a Taylor expansion and quantify the sensitivity of the decision relative to changes in each input. Once the sensitivity of each input is computed, a sensitivity heatmap can then be generated to visualize the input regions that most influence the network decision, as shown in Figure 11.

Figure 11
figure 11

adapted from Ref. [91]

Example of saliency maps,

Deconvnet. An approach similar to the saliency maps is deconvnet [92]. Intuitively, a deconvnet is a network that uses the same kernels and pooling operations as the standard CNN that carries out the decision making, but in a reversed direction. For example, instead of computing a weighted sum of image pixels based on convolutional operations to generate features, it distributes the features backwards to the individual pixels. In practice, the deconvnet is attached to its corresponding CNN, forming a U-shape structure as shown in Figure 12. At each convolutional layer, each individual neuron is evaluated by first setting all other neurons in the layer to zero. Next, the generated feature maps are passed as input to the attached deconvnet layer to reconstruct its association with the layer beneath that produces the output of the selected neuron. This process is then repeated until the associations from individual image pixels are obtained.

Figure 12
figure 12

adapted from Ref. [93]

Structure of deconvnet,

LRP. Different from saliency maps and deconvnet in which the relevance is determined via network weights, the concept of LRP is to redistribute the network’s outcome backwards using local distribution rule based on both network weights and activations, until it assigns a relevance score to each individual input [31]. Specifically, the relevance scores are propagated starting with the final layer in the neural network. The scores are then propagated to the early layers in a way that the sum of the score is preserved at each layer. Mathematically, the local distribution rule is expressed as:

$${R}_{i}^{\left(l\right)}=\sum _{j}\frac{{z}_{ij}}{{\sum }_{i\text{'}}{z}_{i\text{'},j}+\epsilon \text{sign}({\sum }_{{i}^{\text{'}}}{z}_{{i}^{\text{'}},j})}{R}_{j}^{\left(l+1\right)},$$
(2)

where \({R}_{i}^{\left(l\right)}\) is the relevance score for the ith neuron in the lth layer. sign() represents the sign function, \(\epsilon\) is a numerical stabilizer, and \({z}_{ij}\) is the contribution (neuron pre-activation value times the corresponding weight) of the ith neuron in the lth layer to the jth neuron in the (l+1)th layer. Eq. (2) indicates that the relevance scores can be positive or negative.

Using the diagnosis of machine fault types as an example, a positive-valued relevance score represents the evidence for the diagnostic decision, while a negative-valued score indicates the evidence against the diagnostic decision. By analyzing relevance scores propagated to the input of the DL model, the input regions assigned with high positive scores can be interpreted as an indication that the corresponding regions significantly contribute to the diagnostic decision, and vice-versa for regions with high negative scores. An illustrative example of recognizing major structural features of an airplane image using LRP is shown in Figure 13.

Figure 13
figure 13

[31]

Illustration of LRP, adapted from Ref.

3.2 Attention for Interpretable Structure

Different from the relevance analysis for interpreting a trained DL model, the attention mechanism is a structure incorporated into the network design to establish the prediction logic that is inherently interpretable [32, 94]. The design of the attention mechanism comes from domain knowledge and is most suited for capturing the dynamic relationship of the processes. For example, in the sequential printing process of additive manufacturing (AM), the layer-wise influence on the final part property induced by thermal activities can be different for different parts, for the same number of printed layers. In addition, the number of total printed layers of a part also varies as the setting of layer height changes. This means that the prediction logic of the DL model is required to be adaptive to the variations. This poses a challenge for the standard neural network, since it uses network weights to encode the relationship that become fixed values after training and are independent of the input. Therefore, they cannot capture the dynamic relationships.

Attention mechanisms provide a means to alleviate this limitation by enabling dynamic weight generation based on the specific context in the process. Specifically, the weights are generated by a separate context network that takes the relevant context as the input. For example, to compute the thermal influence of a particular printed layer i to the part property in AM, the context can include the machine settings, material property and the thermal activities of the adjacent printed layers. The related adaptive weights \({w}_{i}\), i=1, 2,…, N, are computed with the corresponding unnormalized weights being first generated through a dense layer in the context network. Subsequently, to ensure that the relative influences of all printed layers add up to one for interpretability, a softmax layer is incorporated to normalize the generated weights. Mathematically, the process can be expressed as:

$${w}_{i}^{\text{'}}={W}_{\text{attn,dense}}{x}_{\text{context}},$$
(3)
$${w}_{i}=\frac{\text{exp}({w}_{i}^{\text{'}})}{{\sum }_{j=1,2,\dots ,N}{\text{exp}(w}_{j}^{\text{'}})},$$
(4)

in which \({W}_{\text{attn,dense}}\) represents the weights of the dense layer in the context network, \({x}_{\text{context}}\) are the inputs to the context network, and \({w}_{i}^{\text{'}}\) is the unnormalized weights. This version of attention mechanism is called Bahdanau attention, named after its creator [32]. Luong et al. [94] later improved the Bahdanau version by replacing the dense layer in the context network with inner product to improve computational efficiency.

3.3 Integrating Neural Network with Physics

Although the attention-based network structure provides a pathway to capturing interpretable relations, it does not guarantee that the discovered relation is consistent with the underlying physics of the machine or process. This is because during network training, the update of the network weights, which determine the exact relationship between the inputs and the outputs that the network represents, is guided by the prediction error only. Therefore, there is no guarantee that the network will converge to the relation that is also physically meaningful. The “spurious” relations discovered by the neural networks often cannot generalize to unseen scenarios and can be detrimental for critical tasks such as machine fault diagnosis and performance prognosis [95]. In an effort to remedy this issue, the integration of neural networks and physics is attracting increasing attention in recent years. Three representative approaches are described in this section.

The first approach is based on the fact that physical models underlying machines and processes often involve assumptions and simplifications [96]. For example, physical predictive models for machining processes such as grinding and milling often include the effects from major process parameters only, such as depth of cut, while having limited capability to incorporate other factors, such as the operating conditions [97, 98]. Therefore, while these models can generalize well, their predictive accuracy is often lacking due to the incompleteness of physical phenomena that are accounted for. On the other hand, neural networks have the advantage of learning the deviation of the physical model from real-world observations by leveraging the in-situ sensing data that reflects the operating conditions. Therefore, by using a neural network to compensate for the deviation of the physical model (Figure 14), the complementary strength of the two can be synergistically integrated [99], leading to improved predictive accuracy as compared to physical models alone, and enhanced network capability to generalize as compared to pure data-driven methods.

Figure 14
figure 14

Integration of physical model and neural network [99]

The second approach leverages neural networks to numerically calibrate the unknown parameters in the physical models that are time-consuming or difficult to calibrate experimentally. As an example, Paris’s law for fatigue crack propagation is expressed as: \({\text d}a/{\text d}t=C{\mathrm{\Delta }}K^{m}\), in which both C and m are unknown parameters that require experimental testing to determine [100]. In addition, the stress intensity range \({\mathrm{\Delta }}K\) also depends on the parameter that is related to the part geometry [100]. In this scenario, a neural network can be used to calibrate these unknown model parameters by associating them to the in-situ sensing inputs, thereby preserving the physical intuition of the model while alleviating the requirement for extensive experiment parameter calibration [101].

The third approach involves adding physical constraints during the network training process, such that the relation discovered by the network will be consistent with the physical domain knowledge. The physical constraints can be in the form of analytical equations or experimentally verified trends. For example, machine performance degradation should be monotonic, therefore, the performance predicted by the neural network should be monotonically decreasing as the operation cycle increases [102]. With the physical constraint, the network is forced to follow the physical equation or trend imposed by the constraint and can generalize well outside of the range of training data [103, 104]. Table 5 summarizes the comparison of these three approaches.

Table 5 Comparison of physical model and neural network integration approach

4 Application Highlights

The ultimate goal of data curation and model interpretation is to improve data quality to ensure effectiveness and reliability of DL-based analysis and improve interpretability of DL-based methods. In this section, several applications in manufacturing that have benefited from data curation and model interpretation techniques are highlighted.

4.1 Condition Monitoring of Machines and Processes

Condition monitoring refers to monitoring the quality and performance-related variables of a machine or process to identify significant deviation that is indicative of potential faults or anomalies [105]. One prerequisite for effective condition monitoring is the assurance of data quality such that the important variables can be faithfully reflected. Recent developments in data curation have contributed to data quality improvement in condition monitoring; two representative examples based on data annotation and data imputation are presented as follows.

In many condition monitoring scenarios, although part quality related information is captured in the sensing data, it is often not directly computable and requires significant manual examination. One example is machine tool wear images. While a human worker can delineate regions of tool wear and consequently estimate the tool’s condition from images, automated annotation that saves time has long been missing until the recent development of DL-based methods. In Ref. [106], Miao et al. presented a U-Net based approach for tool wear annotation in cutting process, as shown in Figure 15. Considering that the worn region of the tool typically covers only a small portion of the image, making the numbers of the worn region and normal region pixels unbalanced, a Matthews Correlation Coefficient (MCC)-based loss function is designed to alleviate the effect of data imbalance during the U-Net training. Effectiveness of the developed method has been confirmed in the experimental evaluation, achieving over 95% accuracy in tool wear ROI annotation.

Figure 15
figure 15

[106]

U-Net for tool wear image annotation, adapted from Ref.

With the increasing variety of data sources, data with missing values has become a frequent phenomenon, which negatively impacts the effectiveness of condition monitoring and potentially lead to faults or anomalies going undetected. DL-based data imputation has provided an effective mean of dealing with this issue. In Ref. [62], a bi-directional LSTM-based method has been developed for time-series imputation in energy consumption monitoring. In addition to the rolling-window strategy and auxiliary sequence alignment as described in Section 2.2, this work also features a bi-directional strategy that allows the missing values to be estimated based on two estimators in order to further improve the accuracy and robustness [107]. Experimental evaluation demonstrated a clear advantage of the developed method over traditional techniques in terms of imputation accuracy (root mean squared error reduced from 170.8 W to 90.3 W), especially in the situation of continuous missing values.

While common machine and process variables such as temperature can be measured in real-time, other important variables may not be directly measurable in-situ. They often only become available at the end of the process through post-process inspection. Therefore, predictive models are required to infer these variables from in-situ sensing data for timely detection of faults or anomalies. While DL-based predictive modeling for condition monitoring has been an active research field, the recent development of interpretable DL models has the potential to facilitate their widespread acceptance.

In Ref. [102], a bi-directional GRU with physics-informed network training has been developed for tool wear monitoring in milling. The input to the network at each step consists of statistical, frequency, and time-frequency features extracted from real-time force and vibration sensing data. The tool wear prediction at the network output is regularized by a physics-based loss function, which penalizes the network training when any pair of predicted tool wear values do not monotonically increase with the increasing cycle number. This penalty guides the network weight update to achieve maximum physical consistency. Experimental evaluation has demonstrated that the integration of neural network with physics not only eliminated the physical inconsistency in tool wear prediction, but also consistently achieved higher predictive accuracy as compared to the networks without physics-informed training.

In Ref. [108], an attention-based AM process monitoring and part predictive modeling method has been developed as shown in Figure 16. The attention mechanism is designed to capture the dynamic layer-wise thermal influence on the AM condition and part property. To generate the dynamic weights for each printed layer, the machine settings, material properties and the thermal activities up to that printed layer are selected as context for the attention mechanism. Evaluation results have shown that larger weights are generated for the layers printed later in the AM process as compared to the earlier layers and the trend is consistent under different AM machine settings.

Figure 16
figure 16

Attention-based AM monitoring and part property prediction [108]

4.2 Diagnosis of Fault and Anomaly

DL-enabled diagnosis requires associating condition-related features extracted from sensing data to the corresponding fault or anomaly root cause. To handle a large number of fault types with multiple fault severity levels, DL models often require a large number of training samples in order to fully optimize. In real-world applications, the collection of faulty data is often limited by production and safety constraints.

Recently, the method of data synthesis based on GAN to alleviate the lack of high-fidelity data for model training has shown great potential. As an example, the effectiveness of GAN in synthesizing sensing data features related to a faulty motor is presented in Ref. [109]. Specifically, the features evaluated are the IMFs from the EMD. The evaluated motor conditions include normal condition, inner race and outer race faults of a motor bearing, and broken rotor bar. In addition, different data imbalance ratios (from 2:1 to 16:1) between the normal and faulty datasets are considered. For data synthesis, both the generator and the discriminator of the GAN are formulated as fully connected networks, with the synthesized features serving as an input to another fully connected network for motor condition diagnosis. It has shown that the GAN-based method has consistently outperformed SMOTE-based approach in terms of fault recognition accuracy.

In Ref. [110], an auxiliary classifier GAN, or ACGAN, has been presented to incorporate the classification capability for the fault types directly into the discriminator, as show in Figure 17. In this work, vibration signal from the motor is used as the target for data synthesis. To analyze the temporal pattern embedded in the time series data, both the generator and the discriminator are constructed as stacked 1-D CNNs. Evaluated on a set of six different motor conditions, including normal, stator winding defect, unbalanced rotor, inner race bearing fault, broken rotor bar and bowed rotor, the GAN-based method has shown significant improvement in diagnosis accuracy for the dataset with a 2:1 imbalance ratio. A similar work has been reported by Wang et al., in which they investigated synthetic vibration signals for gearbox fault diagnosis [111].

Figure 17
figure 17

[110]

Auxiliary classifier GAN, adapted from Ref.

In addition to machine fault diagnosis, GAN has also been investigated for non-compliant tool condition detection. In Ref. [112], synthesis of wavelet time-frequency spectrums using GAN has been investigated for non-compliant tool detection in milling. Different from the previous works in which the classifier is either constructed separately or incorporated with the discriminator, the generator of the GAN is inverted to perform non-compliance detection in this work, resulting in a 25% improvement in accuracy for the dataset with 2:1 imbalance ratio.

Besides data balancing, the capability of image semantic annotation based on FCN has also enabled process condition monitoring and anomaly detection that would otherwise require significant human intervention. One of the successful applications is AM. In Ref. [113], a comprehensive investigation of layer-wise anomaly detection and evaluation based on image semantic annotation has been reported for three AM technologies: laser fusion, binder jetting and electron beam fusion. A total of 12 common surface anomalies, such as spatter and recoater streaking, have been evaluated. The network structure of the developed annotation method is built upon the U-Net while being enhanced by multi-stream analysis of the image at multiple scales. Evaluation of the developed method has shown that it can be executed in real-time on image data of resolution up to 3672×5496 pixels. The performance of the developed method is also shown to be superior to that of previous state-of-the-art in terms of anomaly ROI segmentation accuracy. A similar work has been reported for over-extrusion detection in the fused filament fabrication process [114].

In addition to the AM processes, Wu et al. developed a solder joint annotation method based on mask RCNN, as shown in Figure 18, which allows to locate, segment, and classify solder joint regions at the same time, which is critical for quality assurance in printed circuit board (PCB) manufacturing [115]. Due to the limitation in training images for solder joint, the method of transfer learning has also been investigated, which transfers a pre-trained network using the large-scale “common objects in context” dataset (by Microsoft) for the purpose of solder joint annotation. Four defective joint conditions are evaluated. The mask RCNN-based method has shown to achieve 100% condition recognition accuracy and 97.4% ROI segmentation accuracy.

Figure 18
figure 18

[115]

Mask RCNN for solder joint annotation and defect detection, adapted from Ref.

Recent development of text annotation has also contributed to the field of fault diagnosis. In Ref. [89], a text semantic decomposition method based on BERT has been described that extracts fault-related information, such as equipment, fault, cause, and solution directly from text documents. In addition to BERT as the pretrained backbone language model, the developed method also features a stacked LSTM and conditional random field (CRF) [116] as a task model. The whole network structure is shown in Figure 19. In experimental evaluation, the developed method has achieved state-of-the-art performance, outperforming the method based on LSTM + CRF in terms of annotation accuracy for “equipment” (from 89.7% to 83.7%), “fault” (from 61.6% to 53.8%), “cause” (from 77.4% to 76.3%) and “solution” (from 44.2% to 36.7%). On the other hand, it is noted that the absolute performance is still far from satisfactory. For example, the accuracy of extracting “fault”, “cause” and “solution” information is significantly worse than that of the “equipment” information, indicating that NLP for fault diagnosis is still at its early development stage and has a long way to go before reaching its full potential.

Figure 19
figure 19

Flowchart for fault-related information extraction from text documents based on NLP [89] (The input (in Chinese) means “the drain valve does not match the pressure applied”)

Beyond data curation, research on improving the interpretability of the DL-based diagnostic models has also been reported. Grezmak et al. investigated LRP to determine which regions of the wavelet time-frequency spectrum of the vibration signal that contribute the most to the motor fault diagnosis performance [117]. The diagnostic model is first constructed as a CNN, which takes the time-frequency spectrums as the input and determine which of the four conditions the corresponding motor belongs to—normal, bowed rotor, broken rotor bar and unbalanced rotor. Subsequently, LRP is investigated to determine how the fault-related information embedded in the input are learned by network to recognize different fault types. Figure 20 shows the corresponding flowchart of the developed method. Experimental evaluation confirms that the CNN learns to distinguish different fault types through different frequency bands in the wavelet spectrums such that the patterns are consistent for the same motor conditions, while being robust to the initial network weights during the training process. A similar work has been reported for gearbox fault diagnosis based on frequency spectrums of vibration signal [118]. In Ref. [119], different relevance-based DL interpretation methods are compared under the context of LCD panel inspection, in which LRP has shown to produce the most desirable relevance heatmap for defect detection.

Figure 20
figure 20

[117]

Interpretable CNN using LRP for motor fault diagnosis, adapted from Ref.

In addition to relevance analysis, attention mechanism has also been increasingly investigated to determine how input elements are being associated by DL models to machine and process conditions. Li et al. designed an attention-incorporated network to determine the influence of different segments of bearing vibration time series signals on the decision-making process of fault recognition [120]. In this work, for each evaluated vibration signal segment, the context of the attention mechanism includes the segments within a time interval in its vicinity. Experimental evaluation has shown that the attention mechanism tends to assign large weights to the segments that contain or are located closer to fault-related impulses and smaller weights to the remaining regions, which is consistent with human logic. A similar work is reported in Ref. [121].

Recently, new interpretable neural networks structures have been reported. For example, Li et al. developed WaveletKernelNet [122], in which a continuous wavelet convolutional layer (CWConv), as shown in Figure 21, is designed to replace the standard convolutional layer in the CNN to discover interpretable filters. By parameterizing the filter using a scaling and a translation parameter [123], the network is shown to generate highly customized wavelet filters by learning from the raw time series signals, which are shown to be effective for bearing and gearbox fault diagnosis.

Figure 21
figure 21

Operation of CWConv layer [122]

4.3 Prognosis of Remaining Useful Life

Prognosis aims at predicting the temporal evolution of machine performance from the current time into the future, and possibly until its functional failure. Accurate RUL prediction provides the technological basis for predictive maintenance and contributes directly to the reduction of unexpected downtime in manufacturing [7].

In general, DL-based prognosis consists of establishing a machine performance evolution model that parses the sequential pattern embedded in the performance evolution and forecasts its future progression based on its historical trajectory. Relevant DL models are usually trained with a set of run-to-failure sequences, which can be time-consuming to obtain. This limitation can be well addressed by GAN through run-to-failure data synthesis. Khan et al. investigated this approach for bearing degradation prognosis [124]. Bearing health degradation is represented as the evolution of the root mean square (RMS) value of the vibration signal over time. The degradation trajectories generated by GAN have shown to highly resemble the run-to-failure data collected from the real experiment and provides the foundation for the degradation model training.

Hou et al. developed an integrated method based on GAN and LSTM for RUL prediction [125]. The network structure is shown in Figure 22.

Figure 22
figure 22

[125]

RUL prediction based on GAN and LSTM, adapted from Ref.

Different from the conventional GAN-based method, the data synthesis function of the GAN is utilized in this work to improve the quality of the feature extracted from the sequential data to support RUL prediction, rather than generating more training samples. Specifically, the generator is constructed as an AE, and is supervised by a 1-D CNN-based discriminator. In addition, the latent feature extracted by the AE is associated to the RUL by a LSTM. As a result, the training process is guided by two objectives: improving the capability of data synthesis from the latent features (which is implicit) and the RUL predictive accuracy based on these latent features. Once trained, the encoder in the generator and the LSTM are directly used to take an on-going sequence as input and predict its RUL. The authors demonstrated that the developed method has reduced the RUL prediction error for aircraft engine by up to 15% as compared to the previous state-of-the-art.

Similar to the application in diagnosis, attention mechanism has also been increasingly investigated in RUL prognosis. The objective is to determine the importance of individual features from the sequential data as well as the relevance of individual time steps in the past trajectory to the machine’s RUL. Chen et al. developed an attention-based DL model that not only fuses the temporal features learned using a LSTM from different time steps, but also with the handcrafted features, such as the mean and the coefficient of a regression model based on the historical sequential data [126]. The context of the attention mechanism at each time step is represented by the sequential evaluation pattern up to that time step as generated by the LSTM. The developed network structure is shown in Figure 23.

Figure 23
figure 23

[126]

Attention-based RUL prediction, adapted from Ref.

In the experimental evaluation for aircraft engine RUL prediction, the weights generated by the attention mechanism indicates that the features from the more recent time steps have larger influence on the RUL prediction and the importance decreases for the earlier time steps. This is consistent with the logic used by human for prediction. An attention-based method has also been investigated for bearing RUL prediction [127]. In this work, an encoder-decoder structure based on gated recurrent unit (GRU) network has been developed. The encoder first distills the essential information from the temporal features and stores them in hidden states. Then, the attention-incorporated decoder analyzes the hidden states and adaptively determine the information to be used for RUL prediction.

5 Conclusions and Future Work

While the convergence of big data, DL and computation has provided an unprecedented opportunity to advance the state-of-the-art in machine condition monitoring, fault diagnosis and RUL prognosis, uncertainty associated with data and the resulting low data quality, as well as the general “black-box” nature of DL algorithms, have posed significant challenges to the effectiveness and broad acceptance of DL-based methods in manufacturing. To improve data quality and promote user trust in DL, two critically related topics—data curation and model interpretability, have been comprehensively reviewed in this paper. Major techniques covered include: (1) data denoising that utilizes both physical modeling and data-driven characterization for contamination removal; (2) data cleansing that detects, corrects or removes outliers and missing values to ensure data completeness and validity; (3) data synthesis that resolves biases caused by insufficient and unbalanced dataset to reduce bias in model learning; (4) semantic annotation that provides condition-related contextualization to the sensing data; (5) relevance analysis that quantifies the contribution from different inputs in the decision-making process of the neural networks; (6) attention mechanism that allows the capturing of process dynamic relationships for improved model interpretation and performance, and (7) integration of DL with physics to ensure the consistency of DL findings with domain physical knowledge. To explain how these techniques are utilized in practical scenarios, typical manufacturing applications that were enabled by these techniques are highlighted.

As research on DL-enabled manufacturing continues to accelerate, several topics that closely relate to data curation and model interpretation are summarized here, as recommendations for future study:

Uncertainty quantification. Uncertainty quantification is critical to ensuring the robustness of DL models. DL algorithms do not natively incorporate data uncertainty into the analysis, and few reported research on DL-enabled monitoring, fault diagnosis and RUL prognosis has discussed uncertainty quantification [128]. This makes it difficult to translate algorithms developed in academic laboratories into critical applications on the factory floor, where analysis and prediction results without uncertainty quantification cannot be considered realistic and trustworthy. Several uncertainty quantification techniques have been proposed recently for DL models, such as Bayesian deep learning [56, 57, 129, 130]. Still, more rigorous and general approaches need to be developed.

Physics-informed learning. While researchers have started to explore the integration of nerual network with physics by incorporating relevant physical knowledge directly into DL models to ensure the consistency between the DL findings and physical laws [99, 101, 102], physics-informed learning is still in its infancy. Manufacturing is characterized by rich physical-domain knowledge that has been accumulated over the past century. However, most of the knowledge still cannot be incorporated into the existing physics-informed learning framework. A broad, systematic approach is needed for transforming physical knowledge in various forms into elements that can be recognized and operated on by the DL algorithms.

Mitigating false discovery. One of the most compelling aspects of DL is the discovery of potential new knowledge, such as the unknown associations between machine settings and process parameters and the resulting material characteristics of the product. A common limitation associated with the current DL techniques is that they are generally not capable of controlling false discovery rate (FDR). This leads to significant waste of resources in the verification of the DL algorithms’ findings. Researchers have started to develop techniques that integrate analytical rigor into DL algorithms to control FDR [131]. Successful adaptation of these techniques into DL-based analysis in manufacturing continues to present an exciting future research direction.