Keywords

1 Introduction

In the age of Industry 4.0, the automation and digitization of manufacturing processes is playing an increasingly important role. With the help of sensors, a large number of process parameters are recorded in order to obtain as accurate a picture as possible of the current system status or to automate as many sub-processes as possible. In order to obtain profitable information from this amount of data to increase the efficiency of a manufacturing process, machine learning is a suitable tool. Here, models are trained with training data to automatically detect errors or plant conditions.

In today’s manufacturing plants, machine learning is used for a wide variety of tasks. For example, machine learning is often found in quality control at the end of a manufacturing step [1, 2]. With the help of images and previously trained models, defective parts can be automatically classified and sorted out. In this way, defective products can be identified at an early stage, sometimes even before the final inspection. Another area in which machine learning is already widely used in production is condition and process monitoring [3,4,5]. Condition monitoring or predictive maintenance can be used to monitor machines and systems and detect critical conditions before a machine breaks down. As a result, production downtimes can be avoided, maintenance costs reduced, and maintenance work as a whole made more plannable.

One topic where machine learning has hardly been used so far is the reduction of scrap by directly identifying and eliminating the causes of defects in a manufacturing process [6, 7]. Although a lot of data is often collected during the production of a part and large datasets are available, these are rarely used for optimization. Often, the experience of the person responsible for the machine is relied on alone or no optimization is carried out at all [8].

Especially in the interpretability of models, which is crucial for the identification of root causes, great progress has been made in recent years and new approaches have been explored. For example, in the image domain, many new methods exist, such as GradCAM [9], guided backpropagation [10] and integrated gradients [11], to gain insight into how models make decisions. Similarly, for tabular data, which is the most common in production, there are some new deep learning approaches based on the attention mechanism [12, 13] in addition to the traditional algorithms for computing feature importance [14, 15].

To the best of our knowledge, this paper is the first to compare different machine learning based approaches for the detection of root causes in production processes. For this purpose, various approaches for the determination of feature importance are first presented and then compared with each other on the basis of a real dataset from automated sensor production.

2 Theoretical Background

There are different approaches to determine the root causes of errors using machine learning methods. In this section, classical methods and methods based on the attention mechanism are explained in more detail.

2.1 Classical Methods

The classical methods use a classifier to determine which features are responsible for a high classification result. The worse the classifier is without a feature, the more important this feature is for the classification. Since the goal of the classifier is to learn the usually complex relationship between features and result, it can be concluded that the features important for the classifier in error classification are also causes of errors in the real production plant. The classical methods work completely independent of the type of classifier or model. The two most important representatives are the permutation feature importance and the drop column feature importance.

Permutation feature importance is a concept introduced by [14] to determine the feature importance of trained models. It measures the decreasing accuracy \(A_{k,j}\) of a model when permuting a feature j compared to the reference accuracy \(A_{\text {ref}}\) of the original dataset \(D_{\text {test}}\). Permuting a feature breaks its relationship with the result. Accordingly, if the accuracy of a model \(A_{k,j}\) drops sharply after permutation, it is a feature that is important to the model. Insignificant features, on the other hand, are not relevant for the decision process of the model. If these features are permuted, the model ignores them because its decision is based on the important features. Consequently, the feature importance \(F_j\) can be calculated as

$$\begin{aligned} F_j = A_{\text {ref}} - \frac{1}{K} \sum _{k=1}^{K} A_{k,j} \end{aligned}$$
(1)

where K is a defined number of permutations to reduce the influence of random swapping [15, 16].

An obvious criterion for a feature importance corresponding to reality is a well-predicting model. Only if the model has learned to pay attention to the right features, a good feature importance can be derived.

Highly correlated features are problematic for the permutation feature importance algorithm [17]. Despite the permutation of a feature, the model still has access to it via the correlated feature. This leads to reduced feature importance values for both features, when in fact their importance could be much higher. Therefore, care should be taken with permutation feature importance to remove highly correlated features.

The drop column feature importance method goes one step further than the permutation feature importance method. The importance of a feature is not determined by decreasing accuracy when that feature is swapped, but by reducing accuracy when the feature is ignored. A disadvantage of this is the high computational effort required. The model must be relearned for each feature. In addition, the algorithm exhibits the same shortcomings in regards to highly correlated features as the permutation feature importance method.

2.2 Method Based on Attention Mechanism

The attention mechanism was introduced in 2015 by [18] as an improvement to neural machine translation systems in natural language processing (NLP). Since then, this revolutionary concept has been transforming the application of deep learning in a wide variety of fields [19]. Not only in NLP, but also in computer vision applications [20], multiple instance learning [21], and language processing [22], numerous improvements have been achieved due to the attention mechanism.

Machine attention can be compared well with the cognitive attention of humans. The human brain is perfectly adapted to focus its attention only on the most important features despite an enormous flood of information. The attention mechanism implements the same principle for deep learning architectures, in that the neural network learns to pay attention only to the most important inputs [19]. This allows for more efficient learning and, by evaluating attention values, interpretability of DNN architectures.

TabNet is a neural network developed by Google LLC researchers for interpretable learning of tabular data based on the attention mechanism [12]. It uses a sequential attention mechanism to prioritize/select the correct features at each decision step.

Figure 1 shows the architecture of TabNet. It consists of \(N_{\text {Steps}}\) sequential decision steps. That is, the ith decision step operates on the results of the \((i-1)\)th decision step to select which features to use. This is done using learnable masks \(\mathbf {M[i]} \in \mathbb {R}^{B \times N_F}\). Each decision step is given the same \(N_F\)-dimensional features \(\textbf{F} \in \mathbb {R}^{B \times N_F}\), where B is the batch size and \(N_F\) is the number of features. The data are preprocessed only by group normalization (GN) [23].

Fig. 1.
figure 1

TabNet architecture consisting of the encoder for classification. This is composed of several decision steps, each having a feature transformer, an attentive transformer, and a learnable selection mask. Based on [12]

The essential features are selected by multiplying the learnable mask \(\mathbf {M[i]}\) by the feature matrix \(\textbf{F}\). Since the values of the mask are \(\mathbf {M[i]} \in [0,\,1]\), this corresponds to a soft feature selection. The masks are computed by the attentive transformer from the processed features of the previous step \(\mathbf {A[i-1]}\) through

$$\begin{aligned} \mathbf {M[i]} = {\text {sparsemax}}\left( \mathbf {P[i-1]} \cdot {\text {h}}_{\textrm{i}}\left( \mathbf {A\left[ i-1\right] }\right) \right) . \end{aligned}$$
(2)

The sparsemax activation function [24] is an extension of the softmax activation function to obtain sparse results, i.e. in this case weakly populated masks. As a result, the mask selects only the most essential features. \({\text {h}}_{\textrm{i}}\) corresponds to a trainable function consisting of a fully-connected (FC) layer and a batch normalization. \(\mathbf {P[i]}\) is a scaling term that tells how many times a given feature has been used so far.

The features selected by the mask are then processed by the feature transformer \({\text {f}}_{\textrm{i}}\), whose structure is described in detail in [12]. The feature transformer is followed by a split of the resulting matrix into the decision step output \(\mathbf {D[i]}\) and the information propagation \(\mathbf {A[i]}\) to

$$\begin{aligned} \left[ \mathbf {D[i]},\, \mathbf {A[i]}\right] = {\text {f}}_{\textrm{i}}(\mathbf {M[i]} \cdot \textbf{F}). \end{aligned}$$
(3)

Finally, the output \(\hat{y}\) of the TabNet model is composed of a sum of the individual decision step outputs \(\textbf{D}_{\text {out}} = \sum \nolimits _{i=1}^{N_{\text {steps}}} {\text {ReLU}}\left( \mathbf {D[i]}\right) \), a final FC layer, and a softmax function:

$$\begin{aligned} \hat{y} = {\text {softmax}}(\textbf{W}_{\text {final}}\textbf{D}_{\text {out}}). \end{aligned}$$
(4)

The importance of a feature is included in the learned feature selection masks \(\mathbf {M[i]}\). The feature importance can therefore be calculated from the weighted sum of the individual masks:

$$\begin{aligned} \textbf{M}_{\text {agg}\mathbf {-b,j}} = \frac{\sum _{i=1}^{N_{\text {steps}}} \eta _b[i]\textbf{M}_{\textbf{b,j}}\mathbf {[i]}}{\sum _{j=1}^{N_F} \sum _{i=1}^{N_{\text {steps}}} \eta _b[i]\textbf{M}_{\textbf{b,j}}\mathbf {[i]}}, \end{aligned}$$
(5)

where the denominator is added for normalization. To factor in the importance or decision power of an instance b of a decision step, the coefficient \(\eta _b[i]\) is used, which is calculated from the decision step outputs \(\mathbf {D[i]}\).

3 Data

A real dataset from a production plant of a sensor element is available for the investigation and comparison of the presented methods for the determination of failure causes. The production line consists of several production steps, including bonding, coil winding and welding processes, and contains several functional tests as well as quality controls. During the development of the production line, a special focus was placed on the generation, storage and display of process data. Therefore, during the production of a part, data such as process states, pressures, positions, distances, measurement results and other values are measured and stored. In addition, error codes are automatically generated in the event of errors. This enables separate investigation of the individual defect cases. In total, a dataset with 726,004 product instances is available.

During the analysis and pre-processing of the data, a frequent absence of parameters in the event of a defect was noticed. If a defect is detected in an early production step, the affected part is sorted out, further production steps are omitted and many parameters are not measured or stored. This is problematic for determining the root causes of defects, since the missing parameters cannot be used to determine the cause. Replacing them with specific values or distributions would cause the classifier to classify according to these values or distributions, breaking the actual relationship between the features and the output value. A calculation of the feature importance would be falsified by this. Table 1 shows how many features and instances the datasets still contain after sorting out for the most common error cases.

Table 1. Size of the remaining datasets in the most frequent error cases.

A calculation of the feature importance of all faulty instances against the correct instances as well as a multi-class classification is only possible either with the smallest denominator of the common remaining features or few instances. Otherwise, some error cases, due to the reasons described above, could be easily classified and thus falsify the result. We therefore chose to investigate the error cases individually. In the following, we present the results for error code A (position coil wire before welding out of order) as an example.

4 Results

For the determination of fault causes by means of machine learning, two criteria are particularly decisive. First, how well can the model associated with the method classify the fault case and second, what is the quality of the feature importance ranking determined with the respective method. In addition to the methods already presented, the techniques ReliefF and mutual information, which are known from the related topic of feature selection, as well as the Gini index automatically computed in the random forest classifier are evaluated. They serve as a basis for comparison of the different methods.

In a first step, the classification results of the different approaches are examined. For this purpose, the mean and standard deviation of the metrics balanced accuracy \(A_{\text {bal}}\), sensitivity of error detection \(\text {TPR}\), and specificity \(\text {TNR}\) are calculated for a random forest (RF) classifier, a support vector machine (SVM), and the TabNet model with a 5-fold cross-validation. The respective results are shown in Table 2. The comparison clearly shows that the random forest classifier gives the best classification results. For this reason, the feature importance rankings obtained using this classifier are expected to be the most reliable. The random forest classifier is followed by the TabNet model and the support vector machine in terms of their suitability.

Table 2. Comparison of classification results for error code A achieved with the support vector machine (SVM), random forest (RF) and TabNet models. The results are determined with a 5-fold cross validation. The mean value plus/minus the standard deviation for the test dataset is given in each case.

One of the most important investigations is to determine the quality of the feature rankings. For this purpose, random forest classifiers are trained with a dataset \(f_j\) containing the j features with the highest importance and their classification result is computed. Random forest classifiers are suitable for this task because their training process is fast enough to obtain acceptable computation times and, as stated in the previous paragraph, they have a good classification result. The features are added one by one according to the rankings computed by the methods until finally the whole dataset F is used for training. The quality \(Q(f_j)\) is defined as

$$\begin{aligned} Q(f_j) = \frac{A_{\text {bal}}(f_j)}{A_{\text {bal}}(F)}. \end{aligned}$$
(6)
Fig. 2.
figure 2

Quality of feature importance rankings for the different methods for error code A.

Figure 2 shows the result for the error code A. The first observation is that the drop column feature importance method produces the best results. Possible error causes of this error case can therefore best be determined with this method. It is closely followed by the calculated ranking of permutation feature importance, which is plausible given the strong similarities between the two methods. In contrast, the Gini feature importance algorithm performs worse. Although the approach has placed the correct features in the top ranks, important features seem to be located only at the end of its ranking, as can be seen from the late jumps in feature ranking quality with many features already used. The opposite is the case for the rankings determined with the TabNet model and the mutual information. These have incorrect features calculated as most important, having lower quality \(Q(f_j)\) first and having jumps in the curve after some features j have been added.

5 Discussion

The comparison of the classifiers and the quality of the feature importance rankings show, that the best results for the studied error case can be obtained with a random forest classifier in combination with the drop column feature importance. This is consistent with the results of the other error cases. For the error code A this corresponds to the result shown in Fig. 3.

Fig. 3.
figure 3

Drop column feature importance for the error code A.

The two most important features position of the left/right wire here correspond to the error description of the error code wrong position of the coil wire before welding. Since the error is determined directly from these features, this is a plausible result. The next features are of particular importance. They allow conclusions to be drawn about possible causes or consequences of the defect. In the following, these features will be considered individually:

  1. 1.

    Wire spacing of the coil outgoing wires: This features is measured in an early process step, the winding of the coil. It is therefore recorded before the parts are assembled and before the defect occurs. A large wire gap near the tolerance limit can cause the wire to be out of position before welding after the parts are joined. This feature can thus be a cause of the defect.

  2. 2.

    Lead number of transfer track: The lead number of the transfer track is in close contact with the next placed feature, the workpiece number. In the production line, there are two different transfer tracks and several workpiece carriers. Due to an even number of workpiece carriers, these are always assigned to one transfer track. If a workpiece carrier produces worse parts, there will be an uneven distribution of defects among the transfer tracks.

  3. 3.

    Workpiece carrier number: As mentioned in the previous feature, there are several workpiece carriers. It is quite possible that one or more workpiece carriers negatively influence the parts.

  4. 4.

    Maximum contact force of the coil into the ferrite: This feature is detected when the coil is pressed into the ferrite. An effect on the position of the coil wire is therefore possible.

  5. 5.

    Contact path of the coil into the ferrite: The contact path of the coil is closely related to the maximum contact force of the coil into the ferrite. The same conclusions apply.

  6. 6.

    Distance of the left edge to the core opening: The distances of the edges to the core opening are measured in the same step as the position of the coil wire and serve as a reference for it. A direct connection is thus given.

All the above features are in direct contact with the position of the coil wire. It is therefore quite possible that they are causes of the error case.

6 Conclusion

In this paper, we compared feature-based approaches with a deep learning method based on the attention mechanism to determine root causes in an automated sensor manufacturing process. The results achieved on a dataset containing real production data show that the best results can be obtained with a random forest classifier in combination with the drop column feature importance method. This made it possible to determine potential root causes, as illustrated with one exemplary error case. All identified root causes are directly related to the error case and can now be optimized in further investigations.

This paper also revealed the challenges in the collection of meaningful data. Especially in fault cases, many parameters are not written to the database. As a result, many parameters in some error cases contain no or only a few measured values and are therefore not usable for the evaluation of this error case.