SSMSPC: self-supervised multivariate statistical in-process control in discrete manufacturing processes

Biegel, Tobias; Helm, Patrick; Jourdan, Nicolas; Metternich, Joachim

doi:10.1007/s10845-023-02156-7

SSMSPC: self-supervised multivariate statistical in-process control in discrete manufacturing processes

Open access
Published: 29 June 2023

Volume 35, pages 2671–2698, (2024)
Cite this article

Download PDF

You have full access to this open access article

Journal of Intelligent Manufacturing Aims and scope Submit manuscript

SSMSPC: self-supervised multivariate statistical in-process control in discrete manufacturing processes

Download PDF

Tobias Biegel ORCID: orcid.org/0000-0001-9227-2238¹,
Patrick Helm¹,
Nicolas Jourdan¹ &
…
Joachim Metternich¹

1598 Accesses
1 Citation
Explore all metrics

Abstract

Self-supervised learning has demonstrated state-of-the-art performance on various anomaly detection tasks. Learning effective representations by solving a supervised pretext task with pseudo-labels generated from unlabeled data provides a promising concept for industrial downstream tasks such as process monitoring. In this paper, we present SSMSPC a novel approach for multivariate statistical in-process control (MSPC) based on self-supervised learning. Our motivation for SSMSPC is to leverage the potential of unsupervised representation learning by incorporating self-supervised learning into the general statistical process control (SPC) framework to develop a holistic approach for the detection and localization of anomalous process behavior in discrete manufacturing processes. We propose a pretext task called Location + Transformation prediction, where the objective is to classify both, the type and the location of a randomly applied augmentation on a given time series input. In the downstream task, we follow the one-class classification setting and apply the Hotelling’s $T^2$ statistic on the learned representations. We further propose an extension to the control chart view that combines metadata with the learned representations to visualize the anomalous time steps in the process data which supports a machine operator in the root cause analysis. We evaluate the effectiveness of SSMSPC with two real-world CNC-milling datasets and show that it outperforms state-of-the-art anomaly detection approaches, achieving $100\%$ and $99.6\%$ AUROC, respectively. Lastly, we deploy SSMSPC at a CNC-milling machine to demonstrate its practical applicability when used as a process monitoring tool in a running process.

Application of Machine Learning in Statistical Process Control Charts: A Survey and Perspective

International Conference on Advanced and Competitive Manufacturing Technologies milling tool wear prediction using unsupervised machine learning

Article Open access 24 May 2021

Machine learning with domain knowledge for predictive quality monitoring in resistance spot welding

Article Open access 02 March 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

SPC is a well-known concept to monitor the condition of a process over time with the objective of detecting any anomalous process behavior that affects the performance of a process (Ferrer, 2007; Kourti & MacGregor, 1996; Zhang et al., 2015). Within the discrete manufacturing domain, the practical application of SPC is typically based on univariate measurements of predefined quality characteristics of manufactured parts that are sampled in equidistant time intervals from the process (Montgomery, 2009). Literature agrees that this univariate post-process SPC-scheme is outdated, since it ignores the large amount of available process data in today’s data-rich manufacturing environments (Ferrer, 2014; Kourti & MacGregor, 1996; MacGregor, 1997; Woodall, 2017). Thus, in recent years, researchers in discrete manufacturing encouraged to shift this paradigm and transcend from univariate post-process SPC to MSPC by using sensor data collected from the machining process that are analyzed with machine learning (ML) methods to evaluate the process condition (Biegel et al., 2022a, b; Li et al., 2020; Qiu & Xie, 2021). Figure 1 visualizes this paradigm-shift.

The general SPC framework consists of two distinct phases. Phase I represents the offline monitoring phase, where the control limits of the process are determined, based on historical normal process condition data. Phase II is the actual process monitoring phase, where new samples are drawn from the process and compared to the control limits to assess the process condition (Grasso et al., 2015; Qiu & Xie, 2021; Woodall, 2000; Woodall et al., 2004; Woodall & Montgomery, 1999). From an anomaly detection perspective, the SPC framework essentially describes a semi-supervised anomaly detection problem (Grasso et al., 2015; Wu et al., 2021; Xie & Peihua, 2022). Semi-supervised anomaly detection assumes that the training data incorporate only normal samples. The task is to model the normal behavior of the data and flag samples that strongly deviate from this state as anomalies (Chandola et al., 2009; Gu et al., 2019; Kumagai et al., 2019; Ruff et al., 2020; Shen et al., 2021; Tax & Duin, 2004; Ye et al., 2021). Due to this analogy, we treat MSPC in general and SSMSPC in particular as a semi-supervised anomaly detection problem.

It is worth noting that some researchers refer to the semi-supervised anomaly detection problem as unsupervised anomaly detection, see, e.g., Bergmann et al. (2019), Dehaene et al. (2020), and Zhang et al. (2019). However, unsupervised anomaly detection typically refers to a problem setting in which most but not all data in the training set are assumed to belong to the normal class (Bergman & Hoshen, 2020; Dai & Chen, 2022; Goyal et al., 2020; Pang et al., 2021).

Research in anomaly detection is extensive, with many papers being published in recent years that focus on both shallow learning and deep learning approaches, see, e.g., Chalapathy and Chawla (2019), Chandola et al. (2009), Gupta et al. (2014), and Pang et al. (2021) for excellent reviews. Ruff et al. (2021), developed a unifying view for shallow and deep anomaly detection approaches in which they identify four main categories to which these methods can be assigned: (1) one-class classification, (2) probabilistic models, (3) reconstruction models and (4) distance-based methods.

Recently, anomaly detection methods that rely on self-supervised learning have shown outstanding performance on various benchmark tasks. Self-supervised learning is a form of unsupervised learning which aims to learn effective representations for real-world downstream tasks from unlabeled data by solving a supervised pretext task with automatically generated pseudo-labels, e.g., solving jigsaw puzzles or predicting image rotations (Doersch et al., 2015; Gidaris et al., 2018; Jing & Tian, 2020; Noroozi & Favaro, 2016; Noroozi et al., 2018). Anomaly detection methods based on self-supervised learning derive their anomaly score either directly from the pretext task or by using the learned representations in the downstream anomaly detection task (Qiu et al., 2021; Sohn et al., 2021). Thus, defining suitable pretext tasks is a vital component in self-supervised learning approaches (Li et al., 2021).

In this paper, we present SSMSPC, a novel approach for MSPC based on self-supervised learning to detect and localize anomalous process behavior in discrete manufacturing processes. We propose a pretext task that we refer to as Location + Transformation prediction. Given a time series input that has been augmented by one of k predefined augmentation functions in one of p equally sized windows, the objective in this pretext task is to classify both, the augmentation and the corresponding window in a multi-task fashion. In the downstream task, we follow the conventional one-class classification setting and compute the Hotelling’s $T^2$ statistic as the anomaly score, based on the learned representations of the pretext task. The control limits are fitted with Kernel Density Estimation (KDE). In addition to that, we propose an extension to the traditional control chart view that combines metadata with the learned representations to (1) segment the process data into the individual process steps and (2) highlight the anomalous time steps, which supports a machine operator in the root cause analysis.

To summarize, the contribution of this paper is threefold:

We propose SSMSPC, a novel approach for MSPC based on self-supervised learning to detect and localize anomalous process behavior in discrete manufacturing processes.
We present a pretext task called Location + Transformation prediction for learning effective representations, where the objective is to classify both, the type and the location of the augmentation based on a given randomly augmented time series input.
We introduce an extension to the conventional control chart view to facilitate the identification of the root cause by segmenting a raw time series signal into the individual process steps using metadata and highlighting the anomalous components.

The remainder of this paper is structured as follows: “Related work” section presents the related work with respect to recent developments in self-supervised anomaly detection and applications of in-process monitoring in continuous processes as well as discrete manufacturing processes. “Problem statement” section provides a comprehensive description of the general problem statement that we consider for the application of SSMSPC. In “SSMSPC” section, we introduce the individual components of SSMSPC. This includes a detailed explanation of the applied framework, the proposed pretext task, the subsequent downstream task and the control chart extension. “Experiments” section presents the experiments based on two real-world CNC-milling datasets. We compare SSMSPC with state-of-the-art anomaly detection baselines and conduct a comprehensive ablation study. Our contribution ends with the conclusion and an outlook for future research.

Related work

Self-supervised anomaly detection

The amount of research related to self-supervised anomaly detection has grown rapidly over recent years. Golan and El-Yaniv (2018) presented Geometric Transformations (GeoTrans) for anomaly detection in images. The authors designed a pretext task, where a multiclass neural classifier is trained to discriminate between geometric transformations that have been applied to normal images. The detection of anomalous images is accomplished by evaluating the softmax activations of the model when transformed images are used as an input. In a paper by Hendrycks et al. (2019), the authors combine rotation prediction (Gidaris et al., 2018) with geometric transformation prediction in their pretext task and are able to outperform a purely supervised approach based on outlier exposure (Hendrycks et al., 2019) to detect anomalies in images. Bergman and Hoshen (2020) present GOAD, a self-supervised anomaly detection approach for general data, i.e., images, tabular data etc. They extend the class of transformation functions in the pretext task to include random affine transformations to generalize to non-image data. Tack et al. (2020) propose contrasting shifted instances (CSI), which is based on the conventional contrastive learning scheme for learning visual representations. They introduce a new training method, in which a given sample is contrasted with distributionally-shifted augmentations of itself as well as other instances. By incorporating this into a new detection score, they achieve strong performance on state-of-the-art image anomaly detection tasks. Shen et al. (2020) propose THOC, a temporal hierarchical one-class network for time series anomaly detection. The authors use a dilated recurrent neural network with skip connections, and apply multiple hyperspheres obtained from a hierarchical clustering process to develop their one-class classification objective. They incorporate self-supervision by using a pretext task for multi-step-ahead prediction. Qiu et al. (2021) propose neural transformation learning for anomaly detection (NeuTraL AD) for data types beyond images. Their approach consists of a fixed set of learnable transformations and an encoder, that are both trained jointly on a deterministic contrastive loss (DCL) that is also used to score new samples at test time. Sohn et al. (2021) present a two-stage framework for deep one-class classification. They learn self-supervised representations from one-class data and then build a conventional one-class classifier on top of the learned representations. The authors present a thorough analysis of different self-supervised representation learning algorithms under their proposed framework. Li et al. (2021) introduce CutPaste, a simple augmentation strategy that cuts an image patch and pastes it at a random location. Their pretext task involves detecting whether an image has been augmented with CutPaste. They follow the two-stage framework proposed by Sohn et al. (2021). Fu and Xue (2022) introduce MAD, a self-supervised learning task for time series anomaly detection. The objective of their pretext task is to predict the values of randomly masked samples of a time series input. In a paper by Shenkar et al. (2022), the authors present a novel contrastive learning approach for anomaly detection in tabular data. Given a data sample with masked features, the proposed learning approach is based on the assumption that the remaining features can be used to identify the masked ones. Wang et al. (2023) propose COCA, a negative-sample-free contrastive one-class anomaly detection method for time series data. The authors apply jittering and scaling augmentations to expand the number of training samples. They consider the representation in the latent space and the reconstructed representation of a Seq2Seq model as positive pairs.

In-process monitoring applications

Continuous processes

Process monitoring based on MSPC and ML has seen wide application in industrial processes. Most of these applications originated in the context of continuous processes, such as chemical, petrochemical or polymer processes. The most basic approaches rely on Hotelling’s $T^2$ and Squared Prediction Error (SPE) monitoring statistics that are computed based on Principal Component Analysis (PCA) or Partial Least Squares (PLS) (Ge & Song, 2013; Qin, 2003, 2012; Wang et al., 2018). Throughout the years, many shallow learning approaches have been presented and extensively reviewed, see, e.g., Yin et al. (2014), Alauddin et al. (2018), and Qin and Chiang (2019). Recently, research with respect to deep learning applications has been very active, with numerous articles being published every year (Yu & Zhang, 2023).

Yu et al. (2021) use a convolutional long short-term memory autoencoder (CLSTM-AE) for process monitoring and compute Hotelling’s $T^2$ and SPE monitoring statistics in the latent space and residual space, respectively. They validate their approach using two industrial benchmark processes, namely the Tenessee-Eastman process (TEP) and the continuous stirred tank reactor (CSTR) and find it to outperform conventional approaches such as PCA and LSTM-AE. In another paper by Yu and Zhang (2020), the authors propose a manifold regularized stacked AE (MRSAE) that relies on Hotelling’s $T^2$ and SPE as monitoring statistics. Their suggested approach outperforms other deep learning approaches on the TEP and the Fed-Batch fermentation penicillin process (FBFP) benchmark. In a paper by Cheng et al. (2019), the authors present a novel monitoring approach based on a variational recurrent AE (VRAE). They use the negative variational scores as the monitoring statistic and fit the control limit with KDE. The approach is validated using the TEP. Kong and Yan (2020) introduce the inner product-based stacked AE (IPSAE), which adds the inner product between the outputs of the neurons to the loss function for regularization purposes. They compute monitoring statistics in the feature space and the residual space and fit the control limit via KDE. For evaluation purposes, the TEP process is used and compared to PCA and stacked AE. In Tang et al. (2020), the authors combine Gaussian mixture models with a variational AE to monitor nonlinear processes with multiple operating modes. They construct latent variable and reconstruction variable indices as monitoring statistics. The authors validate their approach on the TEP and a hot strip mill process. Zhang et al. (2021) introduce a hybrid deep learning model based on a 1D-CNN and a stacked denoising AE. They demonstrate the effectiveness of their approach using the TEP, FBFP and a real-world industrial process for conveyor belts. Li et al. (2022) present a slow feature analysis-aided AE (SFA-AE) which leverages the extracted high-level features by the AE to learn deep slow variation patterns. With these patterns, they compute monitoring statistics based on Hotelling’s $T^2$ and SPE. In addition, they incorporate a self-attention mechanism to identify the anomalous process variables in a contribution plot. Liu et al. (2022) introduce a novel stacked multimanifold AE (S-MMAE) to predict and monitor key quality variables in industrial processes. The authors validate their approach using a real-world hydrocracking process. In Ai et al. (2023), the authors present KD-SCL, a novel industrial process monitoring framework based on knowledge distillation and contrastive learning. They rely on memory queue-based negative sample augmentation and hard negative sampling mechanisms to support the selection of negative samples for contrastive learning. The approach is validated using data from a lead-zinc flotation plant. Lu et al. (2023) introduce a cascaded bagging-PCA and CNN classification network. They define a self-supervised pretext task by trying to discriminate between the reconstructions of bagged and conventional PCA. The approach is validated using the TEP. Li et al. (2023) propose a self-supervised learning framework based on multisource heterogeneous contrastive learning. They employ a two-stage framework, in which the self-supervised feature learning phase is followed by a supervised fine-tuning step. The authors show the effectiveness of their approach using data collected from a heavy-plate production process.

Discrete manufacturing processes

Regarding process monitoring in the context of discrete manufacturing processes, most of the existing research focuses on shallow or deep anomaly detection approaches that rely on some kind of reconstruction-based method, such as the AE. Biegel et al. (2022b) propose a novel approach that uses text data from machine operators to efficiently label the normal process condition data retrieved from a real-world CNC-milling process. Based on these data, they fit a simple PCA-based model and monitor the process with Hotelling’s $T^2$ and Squared Prediction Error (SPE) control charts. In another paper by Biegel et al. (2022a), the authors investigate the application of deep AE-based monitoring approaches and experiment with the reconstruction error and latent representation of the input data to compute Hotelling’s $T^2$ and SPE monitoring statistics. They use a real-world sheet metal forming process for their evaluation. In Lindemann et al. (2019), the authors present two data-driven self-learning approaches relying on k-means and LSTM-AE that are used to detect anomalies within a massive forming process. Lindemann et al. (2020) introduce a novel approach for anomaly detection in discrete manufacturing processes based on LSTM-AE that is evaluated on a multi-step forging process. Proteau et al. (2020) use a variational AE to monitor the condition of a CNC-milling process from the aerospace industry. Hahn and Mechefske (2021) present a disentangled variational AE with a temporal CNN to monitor tool wear in a self-supervised manner. They validate their approach using data from a CNC-milling process of small ball-valves. The authors treat the general AE scheme as a self-supervised learning objective. Ahmad et al. (2020) develop a hybrid model based on deep learning and SPC to monitor manufacturing processes in the presence of image or video data. They apply a fast region-based CNN to derive statistical features that are then plotted in an exponentially weighted moving average (EWMA) control chart. However, the authors validate their approach using only a simulated video. Lorenti et al. (2022) present CUAD-Mo, an approach for anomaly detection in machine operations that is based on Isolation Forest (IF) (Liu et al., 2008). The approach is validated in a CNC-milling process. Oshida et al. (2023) propose a stacked LSTM encoder-decoder model for anomaly detection. Their approach is evaluated on a real-world turning process for Inconel 718 to detect tool wear based on acoustic emission signals. In Sun et al. (2023), the authors introduce an AE-based semi-supervised anomaly detection method for cutting tools in machining processes. They validate the proposed method on an experimental and a public cutting tool wear dataset.

The related work presented here demonstrates that the incorporation of self-supervised anomaly detection methods in process monitoring schemes for both continuous processes and discrete manufacturing processes is still rare and has not yet gained momentum. However, we expect that this line of research is going to grow rapidly in the near future.

Problem statement

In this paper, we address the semi-supervised anomaly detection problem in the context of discrete manufacturing processes. Specifically, we consider the problem of detecting and localizing anomalous process behavior using an ML model trained exclusively on data that correspond to a normal process condition. The reason for restricting ourselves to this problem setting is twofold: First, as pointed out in “Introduction” section, the assumptions made in the general SPC framework correspond to a semi-supervised anomaly detection problem. Second, especially in discrete manufacturing, anomalies typically account for rare data instances while normal data are easier to obtain and generally represent the majority of available data (Chalapathy & Chawla, 2019; Pang et al., 2021).

When we speak of a discrete manufacturing process, we assume a machining process in which, (1) high-frequency process data, e.g., vibrations, and cutting forces are recorded throughout the processing cycle of a part in the form of a multivariate time series and (2) the process data are associated with the final quality characteristics of the produced part. Here, the second condition is of crucial importance to ensure that the monitoring scheme is in alignment with the economic objective of the process, which is to produce parts that meet the quality specifications. Thus, if an anomalous process condition is signaled, it should correspond to an increased likelihood of having produced a part that is out of specification.

With the preceding explanations, we can formalize the problem statement as follows: Let $\mathcal {D} \subset \mathbb {R}^{n\times m}$ be the set of all possible process data of a machining process, where n corresponds to the number of time steps in a processing cycle and m refers to the number of features, i.e., sensors used to record the process data. Each element $\textbf{x}\in \mathcal {D}$ thus represents a manufactured part in the form of a multivariate time series. Let further $X \subset \mathcal {D}$ represent the set of all normal process condition data.

Our overall objective is to provide a machine operator with the information whether the process data $\textbf{x}$ for a produced part correspond to a normal process condition. To achieve this, we wish to learn a mapping $h: \mathcal {D} \rightarrow \mathbb {R}$, where larger values of $h(\textbf{x})$ indicate an increased probability that $\textbf{x} \not \in X$, i.e., $\textbf{x}$ is likely to represent an anomalous process condition. Based on h, we require a threshold mapping $b_{\text {ucl}}: \mathbb {R} \rightarrow \{0,1\}$ such that

$$\begin{aligned} (b_{\text {ucl}} \circ h)(\textbf{x}) = \left\{ \begin{array}{lr} 1, &{} \text {if } h(\textbf{x}) > \text {ucl}\\ 0, &{} \text {if } h(\textbf{x}) \le \text {ucl} \end{array} \right\} , \end{aligned}$$

(1)

where ucl is the corresponding threshold, i.e., the upper control limit. Thus, process samples exceeding the ucl are flagged as anomalies.

Once an anomalous process condition has been detected, the ML model shall support a machine operator in the root cause analysis by locating the anomalous process condition in the time series. See Fig. 2 for a visualization of the problem statement considered in this work.

Note that SSMSPC can be applied to any discrete manufacturing process that is in accordance with this problem statement.

SSMSPC

In this section, we present SSMSPC. “Framework” section highlights the key components of the framework that constitutes our approach. “Pretext task” section dives into the details of the proposed self-supervised pretext task. “Downstream task” section explains how the learned representations of the pretext task are used to build a one-class classifier based on Hotelling’s $T^2$ statistic (Hotelling, 1947) in the corresponding downstream task. Finally, in “Control chart extension” section, we demonstrate how the results of our approach can be visually interpreted by a machine operator to help identify the root cause for an anomalous process condition by highlighting the respective anomalous sections in the raw time series signal.

Framework

We follow the two-stage framework introduced by Sohn et al. (2021). The rationale for relying on this two-stage framework is twofold. First, self-supervised learning is inherently a two-stage process, where the first stage corresponds to solving the pretext task and the second stage embodies the downstream task (Jing & Tian, 2020; Noroozi et al., 2018). This framework thus represents a natural choice for self-supervised anomaly detection methods. Second, the selected framework has demonstrated its effectiveness over end-to-end approaches in previous works, see, e.g., Li et al. (2021). Figure 3 illustrates the aforementioned framework.

In the first stage, self-supervised learning is used to learn meaningful representations from normal process condition data with the help of a predefined pretext task. An encoder network f is used to transform the input data $\textbf{x}$ to a latent representation $f(\textbf{x})$, that are then fed through a prediction head g, which is usually represented by a simple multi-layer perceptron (MLP) with a softmax output for the respective classification task. The model is trained end-to-end on the respective pretext task using backpropagation.

The second stage corresponds to the downstream task where a one-class classifier is constructed on top of the learned representations of the pretext task. For this stage, it is common practice to discard the prediction head g, and only use the pretrained encoder network f as a feature extractor, since it has been shown that the learned representations of the layer right before g provide better representations (Ermolov et al., 2021; Chen et al., 2020).

It is worth mentioning that with respect to the general SPC framework both, pretext task and downstream task are incorporated in the offline monitoring phase, i.e., phase I, as they are used to fit the control limits for subsequent phase II monitoring.

Pretext task

In this work, we propose a pretext task that we refer to as Location + Transformation prediction, which is specifically designed to be applied in the setting described in “Problem statement” section.

The core intuition behind our pretext task is (1) to learn which augmentation was applied to a given time series input, and (2) to learn where the augmentation occurred in the time series by predicting the respective augmentation window. By training a model with this pretext task, we show that the learned representations are highly effective for monitoring discrete manufacturing processes. Figure 4 provides a visualization of the individual components that constitute the proposed Location + Transformation prediction pretext task.

Assuming a training set $X_{\text {train}} \subset X$ we apply a set $\mathcal {T}:= \{T_1, \ldots , T_k \mid T_i: \mathbb {R}^{n \times m} \rightarrow \mathbb {R}^{n \times m} \}$ of k different augmentation functions. Each $T_i$ augments the entire set of available normal process condition data $X_{\text {train}}$. For every $\textbf{x} \in X_{\text {train}}$ the augmentation is applied in a randomly chosen window $W_j$ for all m sensors using a set of p predefined windows $\{W_1, \ldots , W_p\}$. The augmented time series data $T_{i}(\textbf{x})$ are consequently transformed using the continuous wavelet transformation (CWT) to retrieve the respective scalogram representation $\text {CWT} \circ T_{i}(\textbf{x}) \in \mathbb {R}^{s\times n\times m}$, where s represents the number of different scales of the chosen wavelet. Hence, by transforming the normal process condition data in this way, we retrieve a representation that can be interpreted as an m-channel image. Following this routine, the original dataset size is increased by a factor of k. Algorithm 1 summarizes the general augmentation procedure.

With the completion of the data preparation, the augmented dataset is used as an input for the subsequent Location + Transformation prediction task. The scalograms are fed into a simple LeNet-type encoder network f (Lecun et al., 1998) on which two prediction heads $g_{\text {trans}}$ and $g_{\text {loc}}$ are attached for Transformation prediction and Location prediction, respectively. The prediction heads are represented by two equivalent MLPs differing only in the size of the softmax output layer. Here, $g_{\text {trans}}$ outputs a softmax layer of size k to predict the applied augmentation, whereas $g_{\text {loc}}$ outputs a softmax layer of size $p+1$ to predict the window in which the augmentation occurred. Note that the output of $g_{\text {loc}}$ is $p+1$ in order to account for the case when no augmentation was applied, and thus no window was chosen.

In the following, we provide further details for the individual components of the proposed pretext task.

Data augmentations

We propose a total of $k=4$ augmentations that have been proven most useful in learning good representations during our experiments.

Identity In alignment with other recently proposed research in the field of self-supervised anomaly detection such as Tack et al. (2020) and Golan and El-Yaniv (2018) the first augmentation that we propose is simply the identity, i.e., $T_1(\textbf{x}) = \textbf{x}$. The reason for including the identity function is that this translates into including the normal process condition data $X_{\text {train}}$ as one of the four classes. Since the model is exposed to the original data $X_{\text {train}}$ during training, it needs to be able to distinguish the normal process condition from artificial anomalous data to perform well on the pretext task.

CutPaste For the second augmentation class, we draw inspiration from the work of Li et al. (2021) and as such refer to it as CutPaste. However, in contrast to the authors that proposed CutPaste to augment images, we transfer the idea of CutPaste to the time series representation and adapt it for the application in Location + Transformation prediction.

Figure 5 provides an exemplary illustration of how a time series is augmented using CutPaste. First, a random cutting window $W_c$ is chosen from $\{W_1, \ldots , W_p\}$. Within the bounds of $W_c$, two points are randomly selected, marking the start and end point of the cutting segment. Next, a pasting window $W_j$ is randomly selected from the entire set of windows. The cutting segment is then pasted to a random location within $W_j$. Note that the described augmentation is done for each of the m features, i.e., sensors of $\textbf{x}$. Algorithm 2 provides further details on how the CutPaste augmentation works.

MeanShift The next augmentation that we found useful to learn effective representations is referred to as MeanShift. For each sensor in $\textbf{x}$, this simple augmentation strategy shifts a pre-selected time series segment by the mean of the respective time series. Figure 6 shows how the proposed MeanShift augmentation works. The first step consists of selecting a random window $W_j$ from the set of windows $\{W_1, \ldots , W_p\}$. Similar to CutPaste, two points are then randomly chosen within the bounds of $W_j$ that mark the start and end point of the segment to be augmented. For each sensor, the mean of the respective time series is then computed and added to the selected segment in $W_j$. Algorithm 3 summarizes the idea of the MeanShift augmentation.

MissingSignal The last proposed augmentation is referred to as MissingSignal and resembles, as the name suggests, a missing sensor signal. Figure 7 visualizes the MissingSignal augmentation. The routine for selecting the window and the respective segment is equivalent to the presented MeanShift augmentation. However, as opposed to the other presented augmentations, the selected segment in the time series is replaced by a constant value that is equal to the first value of the original segment. Algorithm 4 provides further details on the proposed MissingSignal augmentation.

It is worth mentioning that the proposed CutPaste, MeanShift and MissingSignal augmentations were designed to approximate so-called contextual anomalies, since they represent a common type of anomalies present in time series data (Aggarwal, 2017; Chandola et al., 2009).

We tested more augmentations than the ones presented here, especially with respect to other anomaly classes, such as point outliers. However, as we will show in our ablation study in “Ablation study” section, these augmentations led to a deterioration in performance. Thus, we consider the augmentations presented here as the basis for SSMSPC. Nevertheless, we explicitly would like to stress the fact that, depending on the scenario at hand, other augmentations might be considered as a useful addition to the ones presented here.

Continuous wavelet transformation

CWT is a powerful mathematical tool to transform a 1D signal from the time domain into a 2D representation, also called scalogram, in the time-frequency domain. This 2D representation can be interpreted as a single channel image and thus allows the application of state-of-the-art CNN architectures to be used with time series data. The CWT of a signal s(t) is defined as follows:

$$\begin{aligned} \text {CWT}(s(t);a,\tau ) = \frac{1}{\sqrt{a}} \int _{-\infty }^{\infty } s(t)\psi ^{*}\left( \frac{t-\tau }{a}\right) dt, \end{aligned}$$

(2)

where $a > 0$ corresponds to the scaling parameter, $\tau $ represents the translational value and $\psi ^*$ is the complex conjugate of the selected base wavelet, also called mother wavelet. The mother wavelet is scaled and shifted in time across the respective signal in order to compute the so-called wavelet coefficients that represent the similarity between the wavelet and the respective signal. Depending on the signal at hand, there are plenty of different types of wavelets that can be selected as the mother wavelet. A mother wavelet frequently used is the Morlet wavelet developed by Grossmann and Morlet (1984).

As opposed to other commonly used techniques for time-frequency analysis, such as Short-Time-Fourier-Transformation (STFT), CWT is particularly useful in the analysis of non-stationary signals like those found in manufacturing processes (Gao & Yan, 2011). The reason for this is that CWT allows variable window sizes with the help of the scaling parameter a to analyze different frequency components of a signal. In recent publications, researchers use CWT due to its advantages over conventional STFT for process monitoring applications such as chatter detection, grinding burn recognition, etc. (Hübner et al., 2020; Liao et al., 2021; Tran et al., 2020; Tran & Lundgren, 2020).

As shown in Fig. 4, we apply the CWT for each feature of the time series input separately and stack the resulting scalograms on top of each other to receive a tensor that is then interpreted as an m-channel image and fed to the LeNet-type encoder network.

Location + Transformation prediction

LeNet-type encoder network Inspired by Liznerski et al. (2021), Ruff et al. (2018), and Ruff et al. (2021), we use a LeNet-type encoder network as the default backbone architecture for SSMSPC. The main reasons for choosing this architecture are (1) its simplicity and (2) its low capacity. Since the amount of available data to monitor discrete manufacturing processes is usually very small compared to other domains, there is an increased risk of overfitting when using architectures with higher capacity.

We tested different network architectures in the development phase of SSMSPC, and found that the LeNet-type encoder represents a reasonable choice in terms of performance, training and inference time for the considered problem statement. It is worth noting that SSMSPC generally allows different design choices depending on the problem at hand. Thus, if the capacity of the LeNet-type encoder is too low, it can be replaced with a superior CNN-based architecture.

Transformation prediction The objective in transformation prediction is to predict the augmentation $T_i \in \{T_1,\ldots ,T_k\}$ that has been applied to the time series input. To do so, we attach an MLP prediction head $g_\text {trans}$ to the LeNet-type encoder f with a softmax output of size k. The loss function used in transformation prediction is simply the cross-entropy loss based on the softmax output of the prediction head.

$$\begin{aligned} \mathcal {L}_{\text {trans}} = -\sum _{i\in \{1,\ldots ,k\}} y_i\text {log}((g_{\text {trans}}\circ f \circ \text {CWT})(\textbf{x})) \end{aligned}$$

(3)

Note that the idea of using a classifier to predict the respective augmentation is not new, but rather common practice in self-supervised approaches for anomaly detection, see, e.g., the works of Li et al. (2021), Tack et al. (2020), and Hendrycks et al. (2019).

Location prediction Location prediction aims at predicting the correct location, i.e., the window $W_j \in \{W_1, \ldots ,W_p\}$ in which an augmentation occurred. Similar to transformation prediction, we attach a simple MLP prediction head $g_{\text {loc}}$ with a softmax output of size $p+1$ to the encoder network. The objective function for this task is again a simple cross-entropy loss.

$$\begin{aligned} \mathcal {L}_{\text {loc}} = -\sum _{j\in \{1,\ldots ,p+1\}} y_j\text {log}((g_{\text {loc}}\circ f \circ \text {CWT})(\textbf{x})) \end{aligned}$$

(4)

Combining the idea of transformation prediction and location prediction leads to the proposed Location + Transformation prediction setting. The novelty here is to attach both prediction heads $g_\text {trans}$, $g_\text {loc}$ to the same encoder network f and train the network via backpropagation in a multi-task setting. The loss function for Location + Transformation prediction is just the linear combination of $\mathcal {L}_{\text {trans}}$ and $\mathcal {L}_{\text {loc}}$.

$$\begin{aligned} \mathcal {L}_\text {loc+trans} = \lambda _1 \mathcal {L}_{\text {trans}} + \lambda _2 \mathcal {L}_{\text {loc}} \end{aligned}$$

(5)

where $\lambda _1, \lambda _2$ are scaling factors. For the sake of simplicity, we use $\lambda _1, \lambda _2 = 1$ for the remainder of the paper.

Downstream task

With the completion of the pretext task, the next step in the two-stage framework, as depicted in Fig. 3, consists in fitting a one-class classifier on top of the learned representations of the pretext task. In SSMSPC, we propose to use Hotelling’s $T^2$ for this purpose. Figure 8 illustrates the individual components of the downstream task.

As opposed to the pretext task, in which the normal process condition data are augmented by a set of k different augmentation functions, the downstream task uses only the normal process condition data, as this is typical for the one-class classification setting.

Given a process sample $\textbf{x}\in X_{\text {train}} \subset X$, the multivariate time series is transformed to the scalogram representation $\text {CWT}(\textbf{x}) =: \widetilde{\textbf{x}}$, and then fed through the pretrained LeNet-type encoder network to retrieve the learned representations $f(\widetilde{\textbf{x}})$. Recall that the prediction heads $g_{\text {trans}}$ and $g_{\text {loc}}$ are discarded in this step, since they were only used for the pretext task. Based on the extracted features, we apply Hotelling’s $T^2$, which is defined as

$$\begin{aligned} T^2 := (f(\widetilde{\textbf{x}}) - \hat{\varvec{\mu }})^\intercal \hat{\Sigma }^{-1} (f(\widetilde{\textbf{x}}) - \hat{\varvec{\mu }}) \in \mathbb {R}, \end{aligned}$$

(6)

where $\hat{\varvec{\mu }} \in \mathbb {R}^{r\times 1}$ and $\hat{\Sigma }^{-1} \in \mathbb {R}^{r\times r}$ represent the estimated mean and the inverse of the estimated covariance matrix of the learned representations $f(\widetilde{X}_{\text {train}}) \in \mathbb {R}^{l\times r}$, respectively.

Consequently, for each $\textbf{x}\in X_{\text {train}}$, a single $T^2$ value is computed that represents the processing cycle of the corresponding part. In order to compute the upper control limit, we follow the approach taken by Biegel et al. (2022a, b). First, KDE is used to estimate the probability density function $\hat{\phi }_{T^2}$ based on the l computed $T^2$ values

$$\begin{aligned} \hat{\phi }_{T^2}(T^2) = \frac{1}{lh} \sum _{i} K\left( \frac{T^2 -T_{i}^{2}}{h}\right) , \end{aligned}$$

(7)

where h is the bandwidth and $K(\cdot )$ corresponds to the selected kernel function, in our case a Gaussian kernel. Next, we select a significance value $\alpha $ and compute the quantile function $\hat{\Phi }^{-1}_{T^2}$ of the corresponding cumulative distribution function $\hat{\Phi }_{T^2}$ of $\hat{\phi }_{T^2}$, to retrieve the upper control limit.

$$\begin{aligned} \hat{\Phi }^{-1}_{T^2}(1-\alpha ) = \text {ucl}. \end{aligned}$$

(8)

Note that $\alpha $ corresponds to the type I error or false alarm rate. Varying $\alpha $ translates into an increase or decrease of the type II error, i.e., the false negative rate. Thus, the specific choice of $\alpha $ is problem dependent. If there are no specific restrictions regarding, e.g., the false alarm rate, a possible way is to select the value for $\alpha $ that maximizes the F1-score on the validation set. Combining all the components presented so far, we now have a mapping to provide the anomaly score for a given time series input $\textbf{x}$, and a corresponding threshold based on which the decision will be made whether a sample is flagged as normal or anomalous. Recall that this corresponds exactly to the desired mappings $h:\mathcal {D} \rightarrow \mathbb {R}$ and $b_{\text {ucl}}: \mathbb {R} \rightarrow \{0,1\}$ as stated in “Problem statement” section.

Control chart extension

Once a control chart signals an anomalous process condition, a machine operator needs to determine the root cause of the anomaly and decide which steps to take next (Qiu & Xie, 2021). However, by using the information provided in a control chart alone, it is not possible to infer which part of the process is deemed anomalous and therefore requires a time-intensive investigation of the process to find the potential root cause. According to Jackson (1991), a good monitoring approach needs to provide an answer to the question of what the problem of the process is if the control chart signals an anomalous process condition.

In order to accomplish this, it is necessary that the underlying model decisions are interpretable, i.e., a mechanism is required to understand the reasoning behind the decision-making process of the model (Ruff et al., 2021). When using time series data, this translates into highlighting the components of the multivariate time series that the model considers anomalous (Abdulaal et al., 2021; Li et al., 2021).

However, with respect to the problem statement considered in this paper, visualizing anomalies in the raw time series signal without any additional domain context makes it very hard for a machine operator to interpret the model decisions correctly. A machining process usually consists of a set of q different process modes $\{M_1, \ldots ,M_q\}$, where each mode corresponds to a different process step in the processing cycle of the part. By collecting additional metadata during the processing cycle such as the Numerical Control (NC) lines of the machine program, it is possible to extract the start point for each process mode and thus segment the time series accordingly.

In SSMSPC, we suggest augmenting the conventional control chart perspective by an additional view in which (1) the anomalous time steps are highlighted and (2) the time series is segmented into the individual process modes. See Fig. 9 for an exemplary illustration of the extended control chart view.

Highlighting anomalies

For the purpose of highlighting the anomalous time steps, we use Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al., 2020), which is a simple and widely used technique to provide visual explanations for the decisions of CNN-based models. The first step in Grad-CAM consists of computing the gradients of $h(\textbf{x})$ with respect to the feature map activations $A^\psi \in \mathbb {R}^{u\times v}$ of the last convolutional layer of the corresponding LeNet-type encoder and use global average pooling on these gradients to obtain the neuron importance weights $\alpha _\psi $

$$\begin{aligned} \alpha _\psi = \frac{1}{uv}\sum _i\sum _j \frac{\partial {h(\textbf{x})}}{\partial {A^{\psi }_{i,j}}}\text { ,} \forall \psi \in \{1, \ldots , \Psi \}. \end{aligned}$$

(9)

To retrieve the Grad-CAM heatmap $\mathcal {H}\in \mathbb {R}^{u\times v}$ that contains the visual explanations, the next step involves the application of a ReLU on top of the linear combination of the neuron importance weights $\alpha _\psi $ and the feature map activations $A^\psi $

$$\begin{aligned} \mathcal {H} = \text {ReLU}\left( \sum _\psi \alpha _\psi A^{\psi }\right) . \end{aligned}$$

(10)

In order to obtain the anomalous time steps, we first resize the Grad-CAM heatmap $\mathcal {H}$ to $\widetilde{\mathcal {H}}\in \mathbb {R}^{s\times n}$. Recall that s represents the number of scales used when applying CWT and n stands for the number of time steps. Next, we look for a threshold function $b_\delta : \mathbb {R}^{s \times n} \rightarrow \mathbb {R}^{s\times n}$ such that

$$\begin{aligned} b_\delta (\widetilde{\mathcal {H}}) = \left\{ \begin{array}{lr} \widetilde{\mathcal {H}}_{i,j}, &{} \text {if } \widetilde{\mathcal {H}}_{i,j} > \delta \\ 0, &{} \text {if } \widetilde{\mathcal {H}}_{i,j} \le \delta \end{array} \right\} \forall i, j. \end{aligned}$$

(11)

To find the threshold $\delta $, we follow the exact same idea as demonstrated in the previous section. Thus, we first compute the probability density function $\hat{\phi }_{\widetilde{\mathcal {H}}}$ using KDE. Selecting a significance value $\alpha _{\widetilde{\mathcal {H}}}$ and evaluating the quantile function at the respective position $\hat{\Phi }^{-1}_{\widetilde{\mathcal {H}}}(1-\alpha _{\widetilde{\mathcal {H}}})$, we find the desired threshold $\delta $. Lastly, we simply sum up the columns of $b_{\delta }(\widetilde{\mathcal {H}})$, set all nonzero values to 1 and thus retrieve a binary value for each time step indicating whether it is to be considered anomalous

$$\begin{aligned} \mathbb {1}_{b_{\delta }(\widetilde{\mathcal {H}})_{i,j} > 0}\left( \sum _i b_{\delta }(\widetilde{\mathcal {H}})_{i,j}\right) \forall j \in \{1,\ldots ,n\}. \end{aligned}$$

(12)

Figure 10 visualizes the aforementioned steps.

Table 1 Overview of the dataset structure of the Bosch CNC-milling dataset, adapted from Tnani et al. (2022)

SSMSPC: self-supervised multivariate statistical in-process control in discrete manufacturing processes

Abstract

Similar content being viewed by others

Application of Machine Learning in Statistical Process Control Charts: A Survey and Perspective

International Conference on Advanced and Competitive Manufacturing Technologies milling tool wear prediction using unsupervised machine learning

Machine learning with domain knowledge for predictive quality monitoring in resistance spot welding

Introduction

Related work

Self-supervised anomaly detection

In-process monitoring applications

Continuous processes

Discrete manufacturing processes

Problem statement

SSMSPC

Framework

Pretext task

Data augmentations

Continuous wavelet transformation

Location + Transformation prediction

Downstream task

Control chart extension

Highlighting anomalies

Experiments

Bosch CNC-milling dataset

Pretext task: Bosch CNC-milling dataset

Downstream task: Bosch CNC-milling dataset

Baselines: Bosch CNC-milling dataset

Results: Bosch CNC-milling dataset

Localizing anomalies

Ablation study

CiP-DMD dataset

Process description

Dataset structure

Baselines: CiP-DMD dataset

Results: CiP-DMD dataset

Towards in-process control

Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation