Anomaly detection and event mining in cold forming manufacturing processes

Predictive maintenance is one of the main goals within the Industry 4.0 trend. Advances in data-driven techniques offer new opportunities in terms of cost reduction, improved quality control, and increased work safety. This work brings data-driven techniques for two predictive maintenance tasks: anomaly detection and event prediction, applied in the real-world use case of a cold forming manufacturing line for consumer lifestyle products by using acoustic emissions sensors in proximity of the dies of the press module. The proposed models are robust and able to cope with problems such as noise, missing values, and irregular sampling. The detected anomalies are investigated by experts and confirmed to correspond to deviations in the normal operation of the machine. Moreover, we are able to find patterns which are related to the events of interest.


Introduction
In recent years, there has been an increased interest in analysis tools for industrial applications. Within Industry 4.0, one of the goals is combining sensor technologies with data analysis tools in order to improve the manufacturing process. In general, predictive maintenance (PdM) focuses on diagnosing the machine status and providing insights about its current and future conditions. In this paper, we analyze two key aspects of PdM: online anomaly detection (AD) and event prediction on a cold forming manufacturing line.
The cold forming line is equipped with acoustic emission sensors (AE) that provide high-frequency information about the mechanical conditions of the press components. This type of sensors has previously been used to investigate failure modes of mechanical components under laboratory settings [4,7]. In contrast, this work focuses on analyzing the signal under real operation conditions, which poses Diego Nieves Avendano diego.nievesavendano@ugent.be Daniel Caljouw daniel.caljouw@philips.com 1 IDLab, Ghent University -imec, Ghent, Belgium 2 Philips Consumer Lifestyle B.V., Eindhoven, Netherlands challenges in the data quality and model capabilities. In particular, the data is challenging as samples are discontinuous and sampled at irregular intervals. This, in turn, prevents the usage of standard time series analysis approaches. Other problems which are also addressed include presence of noise, sensors getting disconnected, logging errors, and constant stops and restarts of the machines due to production bottlenecks.
Anomalies are events that do not conform to the normal or expected behavior of a process. Their correct detection is useful as it can uncover events of interest. In the manufacturing industry, AD provides benefits in terms of quality control, safety, and reduction of costs due to unexpected breakdowns. Fault prediction aims to discover early signs that an error will occur and to report them in terms of probability of the error occurring or a time interval in which it will occur. Fault prediction provides benefits, as it enables the optimization of maintenance schedules.
Traditional AD algorithms use statistical approaches to find extreme values or correlation changes between features [9]. This leaves a considerable gap in scenarios where anomalies are not necessarily linked to extreme values, or where their statistical properties do not diverge considerably from the norm. There exists extensive literature in the detection of analysis in online settings [1] and multivariate cases [9], but their application in industrial settings is still limited. More specifically, an unsupervised AD approach based on the micro-cluster continuous outlier detection (MCOD) algorithm [13] is evaluated on the same industrial use case as of the current work [18].
The AD approach is based on matrix profile (MP) [22], which is a method for time series analysis that is robust, domain-agnostic, and computationally efficient. Compared with previous approaches, it offers the following advantages: -Requires a short warm-up time and no historical data to give reliable results. -Provides a bounded and intuitive anomalous score.
-Can be calculated in real-time as it is not computationally intensive. -Can handle non-uniform sampling.
Fault prediction in industrial applications using wave signals, such as vibration and acoustic emission, has been applied for valves [2], reciprocating machines [6], diesel engines [5], indenters [19], conveying systems [20], and other rotating components [20,21]. Traditional approaches focus on the analysis of temporal and frequency features in combination with statistical testing. More recently, datadriven methods such as self-organizing maps [21], support vector machines [2], and deep learning [14] have been investigated. In our approach, we present a model for event prediction, where event refers to not only fault events, but also events which may not be associated to an error but are of interest for PdM.
The event prediction is based on a combination of salient subsequence mining, clustering, and association rules. It is generic enough to deal with different types of sensor data, with non-uniform sampling, and requires few finetuning of parameters. Compared with previous work, we are handling data in industrial settings in comparison with approaches done under controlled experimental settings. In our literature review, only the work from Varga et al. [20] handles real industrial conditions, but for a different type of machinery and without classifying the different types of faults. To the best of our knowledge, we are the first to present methods using AE for PdM in cold forming lines. State of the art machine learning techniques rely on data abstractions which reduce interpretability. In recent years, there has been an increased demand of interpretable models within different industrial sectors [12]. By using association rules, the acoustic patterns which signal future faults can be easily retrieved.
This paper is organized as follows: Section 2 presents the use case and the data representation. Section 3 presents the methodology for the anomaly detection based on matrix profile, and the event prediction based on association rules. Section 4 presents the results. Section 5 discusses the results and the lessons learned. Finally, Section 6 contains the conclusions.

Cold forming process
The real-world use case investigated in this work corresponds to a Philips' manufacturing line in the Netherlands. The data belongs to a cold forming line of a consumer lifestyle product. Figure 1 shows a scheme of the data collection.
The data were collected under normal operation conditions for a period of over a year. The data consist of information of the material batch, sensors from the press module, and information from the preventive and corrective maintenance. Table 1 summarizes the components and the information retrieved.
The press module is composed of a main rotor and 6 dies which either cut or flatten the metal strip. The dies of interest contain acoustic sensors or position sensors, or both. The position sensors are located within the lower die and measure the correct position of the stripper plate. Fig. 1 Cold process diagram. The blue cylinders correspond to the location of the acoustic emission sensors; the green triangles correspond to the location of the position sensors. Each die inside the press module can either cut (C), flatten (F), or both (CF). After the press module, the production line continues with further steps for the final product. Separately, the maintenance and die logs provide information concerning the status and maintenance of the press components The stripper position deviates when there is scrap material inside the die. In order to prevent damage, the position sensor triggers a stop of the machine. The AE sensors are located in the proximity of the dies; they measure the acoustic waves propagated through the die. The goal of this work is to analyze the AE signal to provide insights of the machine status. Table 2 summarizes the die functions and the available sensors.

Acoustic emission signal
The cold forming press contains six AE sensors. These sensors record sound information along a wide frequency spectrum. Changes in the AE signal can be associated with gradual mechanical changes, such as those caused by friction and mechanical degradation, as well as sudden changes caused by unexpected events such as jamming of materials.
The acoustic emission is measured in a low-frequency band (< 2 kHz), filtered and then a temporal feature is extracted for specific positions of the main rotating component. The temporal feature extracted is not disclosed for intellectual property reasons. Figure 2 shows common patterns for each of the six sensors. The x-axis is not time but the angle of the main rotor. In normal operating conditions, In order to reduce the data size, only one sample per minute is kept. Due to limitations in the sensing equipment, the samples are not uniformly distributed. The median difference between subsequent samples is of 63 ± 4.44 s (median and median absolute deviation). Regardless of the non-uniform sampling, our approach is flexible and does not require the signal to be continuously measured.

Stop and maintenance events
The press module is equipped with sensors that automatically stop the press in case of faulty condition(s). The stops can be triggered by either lubrication problems or the presence of double material. The double material can be detected in four modules as described in Table 2.
Additionally, there are maintenance events which are performed after faults in the press or deviations in the product quality have been detected. The maintenance events are associated to specific die modules and reported manually. We focus on two types of maintenance events: scrap returns into the die and cutting faults that occur due to a broken punch or a blunt punch.
In total there are four different types of events of interest: two stop events and two maintenance events. For the rest of this paper, we will refer to them as events of interest regardless of their type.

Methodology
This section presents the methodology for the AD model and the rule mining model. Figure 3 shows the data flow and processing. First, the acoustic emission is converted into the meta-time series using the matrix profile as described in Section 3.1; this information is then processed in two branches: the AD model which operates online, described in Section 3.2, and the event prediction which is conformed by three blocks which are described in Section 3.3. Once the anomalies have been detected or labelled, they can be passed to the rule mining module to learn patterns between anomalies and events of interest.

Online matrix profile
The MP is a data mining technique that allows analysis of time series in an intuitive way [22]. It focuses on the problem of all-pairs-similarity-search, where the nearest neighbor for each object in a set is found. The algorithm requires only two parameters: a window length (m) and a distance function. The parameter m is defined as the same number of angles positions (m = 500). The distance function used is the z-normalized Euclidean distance, which is a preferable measure when interested in the pattern shape regardless of the magnitude. The output of the algorithm are two meta-time series, namely the MP, which contains the distance to the nearest neighbor, and the profile index (PI), which contains the position of the nearest neighbor.
Traditional MP requires time series to be continuous and uniformly sampled. In this use case, the signal is subsampled at non-uniform intervals; and therefore, the problem is posed in a different representation: the time series X consists of D dimensions, each dimension corresponding to an AE sensor, N samples, each sample corresponding to a full punch recording, and M attributes corresponding to the magnitude at each angle. Hence, X ∈ R D×N×M . In this case, D = 6, M = 500. A new sample x ∈ R 6×1×500 arrives every minute approximately.
We modify the original matrix profile in the following aspects: -The exclusion zone corresponds to the number of full punches to ignore. This differs from the original approach where the exclusion zone corresponds to the number of time steps to exclude. -Only the left MP is calculated, which corresponds to the comparison of the current sample with the previously received samples. This allows the method to be used in online settings. -We define a bound value to limit the number of previous samples to which an incoming sample is compared. The bound is set to 1800 samples for all the experiments, which corresponds to approximately the previous 30 h. This reduces the computation time while still giving a good approximation of the MP without reducing the AD capabilities. Each channel is processed independently and we obtain a D-dimensional meta-time series where each dimension corresponds to the MP of a sensor. Using the timestamps of the original time series, the PI can be converted into time differences. This helps reduce the discrepancy caused by the non-uniform sampling. For example, an anomaly can be explained if a large time gap is found, which occurs when the machine is stopped for inspection. These new meta-time series are used for AD and rule mining as explained in the next sections.

Anomaly detection
In this work, we focus on point anomalies, where a full punch is labeled as an anomaly if any of its dimensions is anomalous. This can be done by analyzing the MP of each channel and defining a threshold value. Additionally, the MP can be summed up over the dimensions and detect AD over the reduced MP [24]. The z-normalized Euclidean distance is bounded between [0, m]; therefore, the multidimensional MP can be summed up regardless of the magnitude of the original signals. In case one wants to give more focus to a specific channel, the sum can be weighted.
In this use case, the anomalies are unlabelled. Therefore, a threshold value in the MP cannot be defined in a straightforward way. Instead, a statistical approach is used. The running mean and standard deviation over each channel of the MP and the summed MP are calculated and the threshold bands are defined as the mean ± nσ . With n, the number of deviations to consider as the tolerance limit. A punch is considered as anomalous if its MP value is beyond the bands. Figure 4 shows the moving average and the ±6σ bands for a period of 8 days. The gaps in the plot are caused by stops in the machine. When the machine is restarted, the MP will be high due to lack of samples to compare against. To avoid labelling these events as anomalous, the rolling mean is only considered after at least 10 samples have arrived.

Rule mining
This section explains the rule mining approach, which aims at finding acoustic patterns which precede events of interest. It is organized as follows: Section 3.3.1 presents the salient subsequence mining, which reduces the dataset to the most representative patterns. Section 3.3.2 presents the clustering model which is applied to the reduced dataset to discretize the acoustic signals. Finally, Section 3.3.3 explains the procedure to learn association rules and construct a classifier by association.

Salient subsequence mining
Salient subsequence mining focuses on finding a subset of the data that represents the majority of the common patterns. We apply a salient subsequence mining technique that uses the MP to explore candidates and add them to a representation set using minimum description length (MDL) as a stop criterion [23].
The MDL technique requires the time series to be discrete; therefore, each dimension is normalized to have μ = 0 and σ = 1, and afterwards discretized as shown in Eq. 1, where b is the number of bits for discretization. It has been proven empirically that the discretization of time series does not have an impact on classification and other related Fig. 4 Matrix profile results for 2 weeks of data. Highlighted in red, the reported anomalies. The y-axis is the z-normalized Euclidean distance (unitless) Fig. 5 Clusters for the ram. Each color represents a different cluster. In this case, the representation has well-defined clusters tasks, and that in general the number of compression bits (b) does not have a drastic impact [11].
The MP is traversed to explore the sequences that have a low value, meaning that there exists at least another closely related sequence. Notice that our MP approach includes a limited search (defined by the bound limit, see Section 3.1), which means the candidates are local minima and the solution is not optimal. However, the problem of optimal subsequence selection is on its own intractable, but good results have been obtained with approximate solutions in similar tasks [23].
Each subsequence is added to a hypothesis set, which is used to encode other segments; or to a compression set, meaning that it can be expressed by a member of the hypothesis set in fewer bits. Equation 2 shows the description length measure (dl) for a segment x, i.e., the number of bits required to store a sequence of length m with a cardinality of b bits.
Equation 3 shows the bit save obtained by compressing a sequence t c with t h as its hypothesis.
where γ (·) is function that counts the number of differing digits between two discretized sequences. Algorithm 2 presents the sequence selection which follows closely the one proposed by Yeh et al. [23], with the difference that in this case the patterns are multidimensional and the meta-time series, MP and PI, have a limited search. To employ the same algorithm, the sum of the multidimensional MP is used to select the best candidates. Alternatively, the patterns for each individual channel can be found and then later combined. However, for multidimensional time series, it is often the case that common patterns are found in a lower dimensionality [24].
The sequence selection goes through the following stages: (1) each dimension is normalized in order for them to have comparable magnitudes, (2) the data is reshaped from a tensor of D dimensions with m instances per sample to 1 dimension and D × m instances per sample (lines [3][4][5]. The initial cost (line 7) is equivalent to storing all the sequences without compression. Then, a greedy search loop starts, where first a set of candidates is selected using the MP (line 9 in Algorithm 3), and then evaluated to determine the best pattern in terms of bits saved, and whether it should be assigned to the hypothesis or compression set (line 12 in Algorithm 4).
Although one could argue that the subsequence selection is not required as clustering can be done on the whole dataset, it is important to consider that reducing the dataset decreases considerably the computational power required. In addition, it has been proven that clustering time series becomes irrelevant if all possible subsequences are considered [11].

Clustering
The salient subsequences discovered in Section 3.3.1 are clustered by first using principal component analysis (PCA) and then applying Gaussian mixture clustering on the principal components (PC). The aim is to use the discovered clusters as labels for the AE signals.
The PCA projection is done as follows: first, each dimension is considered to be a stationary time series, and is normalized to have μ = 0 and σ = 1. Then, for each punch, the sample is seen as the concatenation of six channels over a full rotation (d× 500 samples) giving a total of 3000 instances per punch in this case.
The clustering is done using Gaussian mixture models which is a family of probabilistic model that represents the data as a weighted sum of normal distributions, each of them with its own covariance matrix. Although any clustering algorithm can be used, the Bayesian Gaussian mixture algorithm offers the advantage of reducing the number of clusters in case it is required [3].
We analyze the effects of clustering on individual dimensions. Figure 5 shows the projection in the first two components for the ram. Here, it can be appreciated that the data has some defined clusters. Figure 6 shows the clustering for component G7, where the clusters are not clearly defined. Figure 7 shows the projection of the 6 dimensions combined.

Association rules
The association rule problem focuses on discovering interesting relations in datasets with transactions. This problem has been of particular interest in the retail industry, where associations can be found between products that are bought together or subsequently. We are interested in the A transaction T contains the set of items X if X ⊆ T . Then, an association rule is stated as follows: X ⇒ Y , where X ⊆ I , Y ⊆ I and x ∩ y = φ. Which can be read as follows: given X, Y is more likely to occur. In our model, X corresponds to a set of one or multiple acoustic signals discretized by the clusters, and Y is an event of interest.
The frequent pattern growth (FP-growth) algorithm is a fast algorithm for frequent pattern mining which uses an efficient data representation to handle large databases. It is particularly useful in cases where the desired rules contain sets of items with low support threshold. This coincides with finding rules that predict the events of interest. The full specifications of the FP-growth algorithm can be found in the work by Jiawei et al [10].
The problem then can be seen as mining item sets from the combined list of acoustic clusters and events. In this case, the amount of unique items is relatively low, with few possible clusters and only four event types (as explained in Section 2.2).
The AE data is segmented in time windows with no overlap. Each pattern within the window is encoded according to the clusters discovered in Section 3.3.1 and combined with the maintenance logs to form a transaction T . Notice that the acoustic clusters must precede the maintenance events, but we are not interested in the order in which the patterns occur. In case no events occur within the window, a dummy healthy status is assigned to the transaction.
The generated rules are filtered to only keep the ones with AE clusters in their antecedents and events of interest as their consequent. In addition, only the rules with a minimum confidence (as defined in Formula 4) and lift (as defined in Formula 4) are kept. Support is the percentage of transactions in D which contains the given item or tuple of items. The lift of a rule is the ratio between the probability of the antecedent and consequent occurring together against the expected probability if they were independent. This is based on P (A ∩ B) = P (A)P (B), where the probability of A is not affected by B nor the other way around.
Finally, the association rules are grouped in order to form a classifier by association using a modified version of the CBA-CB algorithm [15]. In the original implementation, rules are sorted according to their confidence and support. We extend this by taking into consideration the class imbalance by adding a weighted confidence which is the confidence of a rule multiplied by the inverse frequency of its consequent. Algorithm 5 presents the methodology to create a classifier based on association rules.
The algorithm first sorts the generated rules as follows: Given the rules r i precedes r j if: -Confidence of r i is larger than that of r j -If their confidences are equal but support of r i is larger than r j -If both confidences and supports are equal but the weighted confidence of r i is generated before r j -Else the rule that was generated first.

Anomaly detection
The proposed AD method is able to detect anomalies online after a short period of warm-up. In our tests, this was as low as half an hour or just 30 samples. Table 3 shows the amount of detected anomalies for two different thresholds. In order to get insight of the anomalies, these were clustered using HDBSCAN [17]. HDBSCAN is an extension of the density-based spatial clustering of applications with noise (DBSCAN) which allows clusters with different densities. The algorithm groups samples in high-density spaces and reports as anomalies samples in low-density spaces. In this way, anomalies that reoccur can be grouped together, and those that seldom occur are kept aside.
The anomalies are in most cases related to the magnitude of the main lobes. The magnitude of the acoustic signals is related to the force applied by the rotor; hence, deviations indicate that more or less force was required at a certain portion of the rotation. The discovered clusters were manually investigated and are summarized in Table 4. Figure 8 displays examples of the anomalies. The most common patterns discovered are:  Fig. 8 -Decrease in magnitude of main lobe. Appears in channels G0, G3, G4, and RAM. It is related to punches which required less force to go through. -Appearance of second lobe or reverberations. Appears in channel G0. It occurs when the press required more force to return to the initial position. -Distortions before or after main lobe. Occurs in channels G0, G3, and G4. Peaks or reverberations appear before or after the main lobe. These events need to be further investigated.
HDBSCAN was not capable of finding significant clusters for the anomalies in the channels G2 and G7. In all channels, there was a fraction of anomalies which could not be clustered; we refer to them in Table 4 as others. These groups are composed of events that are extremely rare and therefore could not be clustered. Among them are events which do not have significance for PdM tasks such as disconnected sensors, noise, and logging errors; and events that may be informative such as punches requiring additional force, or other abnormal patterns. Figure 9 shows examples of patterns which were not clustered. If enough samples are labelled, then a supervised method can be trained to classify the less common anomalies.

Association rules
The mining of salient subsequences offers a considerable compression as it requires only ≤ 6.85% of the data to store the most representative patterns. This representation where the main lobe is smaller and a large peak appears afterwards. d Anomaly in G3 where the main lobe is smaller and a small peak appears afterwards. e-f Anomalies in G4 and RAM where the magnitude of the main lobe is lower was then projected using PCA and preserving only 10 components, which are able to explain 94.11% of the variance. The compression percentage can be varied by tuning the number of trivial matches to ignore (n t ). However, the results are similar within a broad range of compression levels. Other aspects such as the number of PC to preserve, the number of clusters generated, and the time window selected have a more significant effect. Figure 10 shows the clustering using windows of 1 h and up to 80 clusters. From a visual inspection, it is evident that certain clusters are highly related to the events, but that some events may also happen unexpectedly. The rule mining technique was able to discover different sets of rules that trigger three out of the four events of interest: cutting faults, scrap returning into the die, and double material. No relevant rules were discovered for lubrication problems, which was an event with only six events over the year and therefore extremely rare. Table 5 shows the sets of rules discovered for a different number of clusters and time windows. The rules have been grouped according to the consequent they predict. These groups can be seen as sets of rules joined by an OR operator.
To compare the results, we use a baseline classifier which predicts the majority class in all cases, which corresponds to the healthy state. This way the model obtains a high micro F1-score (0.986-0.992) but does not predict the events. To take into account the class imbalance, we report the weighted F1-score. The best configuration is found at 60 clusters and windows of 1 h. It is important to notice that the time step has an important impact in the scoring of the classifiers as it reduces considerably the unbalance between healthy states and fault-related events. This increase in the score may be attributed more to the change in samples than more accurate rules. The best classifier is capable of detecting all events with a moderate F1-score (0.083 to 0.101) and a significant increase in the weighted score compared with the baseline. This is a relevant result considering that the fault-related events represent only 0.05% of the punches and 2.19% of the samples after windowing.

Discussion
The presented work brings time series analysis (TSA) and machine learning (ML) techniques to a real use case in the manufacturing industry and serves as a study of the viability of these techniques for PdM. The major lessons learned are presented below: Integration in practice In order to create solutions that can be easily integrated into practice and cause minimum disruption, it is important to use methods that do not require extensive tuning or interaction from technicians. This was achieved by using the matrix profile technique that is domain agnostic and does not require tuning. In this case, the technique was effective enough to give results with only a short warm-up of few samples (30 samples, which is approximately 30 min). In the case of the event detection, the only step that requires careful tuning is selecting an adequate clustering algorithm. In some cases, the selected algorithm may require fine-tuning its parameters.
Data sampling This was one of the main challenges in the use case. We hypothesize that the low sampling Table 5 Results for the rule mining rate can be one of the limiting factors causing the AD to not provide useful information for fault prediction. Currently, the sampling rate is 60 samples per hour. As the press can produce hundreds of punches within a minute, acoustic patterns related to events or faults might get easily missed. Additionally, a higher and uniform sampling rate would allow to treat the signal as continuous, opening opportunities for the use of other advanced ML techniques, such as recurrent neural networks (RNN) and long-shortterm memory (LSTM).

Extra sources of information
In the presented use case, the sensor information of the dies and the maintenance logs were used to assess the machine status. With the great availability of sensors nowadays, it is easy to keep track of all stages during the manufacturing process (such as room temperature, material properties, conditions of machines at later stages, and quality control, among others) that could be valuable for assessing other aspects, i.e., product quality and product defects, as well as complex interactions between manufacturing stages.
Expert knowledge Machine learning tries to learn the patterns from the data. However, experts have a lot of knowledge on the manufacturing process considered. Including this expert knowledge within the model design can provide great insight. For example, the AE patterns discovered in the anomalies and clusters could be further investigated and labelled, so that this information can also be included in the modelling. One step further is the step towards hybrid modelling approaches that combine datadriven models with physical and/or expert knowledge to further improve the models.
Concerning future work, we have identified the following opportunities: -The presented AD is fully online, processing each sample as it arrives. In some scenarios, an immediate response is not required and therefore delayed predictions can be offered. In this case, the right MP can be considered which provides additional information during the nearest neighbor search. The delayed predictions would also allow the calculation of the arc-count meta-series, which has been proven to be a good estimator for state transition problems and segmentation [8]. -The rule mining approach can be easily extended with new sources of information as long as they can be represented as discrete events. For example, the material properties such as thickness, hardness, and temperature. At the same time, the rules approach can be used to predict other types of events, such as quality of the product or defective products.
-The classifier by association is a simple and efficient implementation. However, it can be extended with bootstrapping to create rules with higher accuracy and more robustness [16]. Currently, bootstrapping methods are hard to apply given the small amount of faults recorded. As more data are collected in the next years, it will be possible to perform the required partitions.

Conclusions
We present a general approach for AD and event prediction for a cold forming manufacturing line. The models are fast, and require few tuning of parameters and no expert knowledge regarding the physics of the components. The anomalous events were manually analyzed and confirmed to be anomalous punches from the press. The event prediction is capable of detecting fault-related events that occur in less than 0.05% of the time with a micro F1-score of 0.632.
Most importantly, this work provides evidence on the potential use of AE signals for diagnostic purposes in the manufacturing industry.