Punching is a wide-spread production process that is applied when massive amounts of identical cheap parts are needed [1]. One important quality indicator of parts manufactured by punching is the burnish height or burnish surface area, which is particularly important for electrical connectors or parts with sealing purposes. A burnish surface area that is as large and continuous as possible is desirable. Note that, as shown in Fig. 1, the burnish height is currently defined in the profile section, not in the surface view. This is the case for many other quality indicators as well [2].

In contrast to the highly economical production process, the evaluation of the cutting surface is still cost intensive and time consuming. Currently, used evaluation methods include metallography, confocal microscopy, tactile measuring systems, or motorized measurement devices [3], which require parts to be taken out of the production process and analyzed separately. This also means that a continuous quality control cannot be guaranteed. To counteract this, we have developed an inline monitoring system which is capable of acquiring an image of each punching surface directly after it emerges from the punching tool [4]. We also developed an automated image processing for the segmentation of the burnish height by an active contours algorithm. This approach showed promising results in terms of accuracy and prescriptive recognition of tool wear. However, the processing of an image (cf. Figure 2) takes 40-60 seconds, which is not acceptable with a cycle time of 80–240 ms; note that stroke rates of up to 1000 strokes per minute are possible [1, 3]. Furthermore, this algorithm cannot recognize multiple burnish surface regions, which can occur in many random forms, depending on factors such as material combination of punch and sheet metal, fluctuations within the tensile strength of the sheet metal or the location and geometry of punch edge failures. The exact causal relationship between these factors and the occurrence of disturbances in the burnish surface—such as tear sections or holes, as shown later in Fig. 4—has not yet been investigated. However, since these disturbances do occur in real burnish surfaces, they have to be addressed by an image processing algorithm.

Fig. 1
figure 1

Definition of the cutting surface parameters [2]

Fig. 2
figure 2

Image of a produced punching part, captured by an inline monitoring system. The image corresponds to phase 1 as described in the text

Here, we therefore consider a neural-network-based approach for the image segmentation problem. Most generally, machine learning techniques for such tasks can be divided into instance segmentation and semantic segmentation methods: instance segmentation treats multiple objects within the same class as separated instances, whereas semantic segmentation converts every pixel in the input image to a category class within one instance. Since an algorithm for identifying the burnish surface should classify all pixels as the burnish part, we focus on semantic representation networks in the following.

Neural networks have already shown promising results for segmentation tasks in terms of accuracy and processing time, for instance, in biomedical image processing [5,6,7]. Although medical imaging is often subject to greater noise compared to other areas involving image processing, it is possible to recognize and segment tumours even in the most diverse organs. Currently, segmentation networks are attracting interest for the monitoring of manufacturing processes, although collecting and preparing data for the training remains a time-consuming and cost-intensive process [8]. Recent examples for the use of segmentation networks in manufacturing are the measurement of the strip position in a rolling mill production [9] and the detection of surface defects in a steel mill production [8, 10]. Lin et al. [11]] and Bergs et al. [12]] also developed a method for wear detection of milling tools, while Scime et al. [13]] showcased the monitoring of additive manufacturing processes.

The goal of our study is to analyze and optimize networks for the processing of image data to segment the burnish surface of punching parts. The segmentation needs to be accurate even in the presence of multiple disconnected burnish parts and should be realizable within an inference time below 80 ms to be suitable for inline-quality control. To this end, we adapt a network architecture that was originally developed for medical image processing.

1 Materials and methods

1.1 Burnish surface

For our purpose, i.e. online quality control, an accurate measurement of the burnish surface—in particular its height—is of primary importance. Therefore, the segmentation must not only be accurate regarding the covered area, but the shape of the burnish surface as well. In particular, the boundary, i.e. the transition between burnish part and fracture, needs to be identified accurately.

Fig. 3
figure 3

In a survey, 12 experts were asked to draw the transition line between the burnish and the fracture surface. The scattering of the lines shows that there is no consensus, but rather different application-related approaches

Fig. 4
figure 4

During manual labelling, we distinguished between the main section (1), tear sections (2), and the background (3). Roman numbers denote different components used for the metric evaluation

However, it is important to note that determining the burnish height in the surface view is not only a technical difficulty, but rather a conceptual one, since there is no standardized definition of the burnish surface. To demonstrate this lack of a commonly accepted definition, we carried out a survey, asking 12 industry experts to mark the transition between the burnish surface and the fracture surface, according to their understanding, in the surface view of a punching part. The results, which are shown in Fig. 3, suggest that there is no clear consensus; rather, the individual definitions of the burnish part are highly dependent on the component produced and its application. However, by investigating the overlapping main characteristics of the different experts’ segmentations, we conclude that in a surface-view image, the burnish part

  • is brightly illuminated,

  • is fluctuating over its length,

  • has a structure with vertical grooves,

  • can have holes,

  • can have multiple tear sections,

  • can increase or decrease in height over the width of the part.

Our manual labelling of the image dataset, as described below (cf. Figure 4), is therefore based on these criteria and the definitions given in [2]. We also note that generally, fewer holes in the burnish part (or none) are favourable for most produced parts, as are fewer tear sections. In particular, an increasing number of holes or tear sections and a decreasing burnish height are a sign of wear on the punching tool.

1.2 Dataset

Images for training, validation, and testing were captured with a monitoring system [4] within the punching process with a resolution of 1280 x 1024 pixels in greyscale. They were taken during a material test where a punch failure occurred on the left-hand side of the images. Overall, 17000 images were captured during this test. In the images, the burnish surface is brightly illuminated, textured with vertical grooves and an inhomogeneous transition to other cutting surface parts. Tear sections of the burnish surface occurred as well, caused by parameter fluctuation or punch failure.

Table 1 Data augmentation structure of training and evaluation data; the listed colours are used in Fig. 7

Disjoint subsets of these images were chosen as the training and test data. In order to represent the ongoing wear within a punch lifetime, images were taken from different phases within the dataset: phase one contains images with uniform wear rate and consistent burnish height, apart from natural fluctuation; phase two contains images with progressed wear rate and therefore de-/increasing burnish height. Finally, phase three contains images of parts produced with a damaged punch and show tear-off within the burnish height. In total, 415 images were selected for the dataset. A ground truth mask image was created for each image of dataset by manually segmenting every section of the burnish surface according to the criteria specified above. In particular, the labels provide a per-pixel partition of each image into the classes burnish surface and background based on expert knowledge. Since all parts – and thereby all images – were produced with the same tool and the same parameters, there is of course a high risk of overfitting to features from this particular process. Since the segmentation should ideally be applicable to images from different processes without re-training (cf. Sect. 4), we try to avoid this effect by extending the dataset via augmentation methods: Each image and corresponding ground truth mask was duplicated and altered with different operations. These consist of changing brightness values to represent different material combinations, vertical mirroring to change the location of the tears or defects, and scaling of the images to represent different material thicknesses; for simulating thinner materials, the images were compressed along the height axis and inserted into an image with the same background noise to preserve dimensions, while for thicker materials, the images were scaled by a ratio of 1.5 and 3 and clipped randomly along the cutting surface such that they would have the same ratio of pixels below and above, as would be expected from the images of the monitoring system. The full data augmentation structure can be seen in Table 1. Note that since each augmentation technique simulates a difference in material properties, we will consider each of these subcategories individually for our evaluation.

Overall, the image augmentation expanded the dataset to 10086 images, divided into training (6052 images), validation (2017 images) and test (2017 images). Finally, to decrease training time, all images were rescaled to \(256\times 256\) pixels. Although a higher resolution might be more suitable for precise measurement tasks, the segmentation functionality can still be analyzed with this reduced image size.

The ratio between background (BG) and foreground (FG) in the image dataset, which is important for the choice of a network and loss function, shows a mild imbalance with a ratio of 9:1, which could increase to 20:1 in applications depending on the specifications of the monitoring system.

1.3 Evaluation metrics and loss functions

1.3.1 Evaluation metric

In order to assess the quality of our neural network based image processing approach, it is crucial to select an appropriate evaluation metric to measure the accuracy of the area identified as the burnish surface by the neural network regarding the ground truth labels (i.e. the actual burnish surface in the image according to expert knowledge).

To this end, a combined metric (CM) has been created to evaluate the predictions according to our definition. As indicated above, the total size and, in particular, the height of the burnish part is an important quality indicator. For quantification of the burnish height, however, it is important to obtain a precise segmentation of the boundary. Furthermore, the metric should allow for weighting based on the size and the number of tear sections found, which also play an important role for assessing the part quality.

For a region-based metric, we selected the Dice similarity coefficient

$$\begin{aligned} {\text {DSC}}(G,S) =\dfrac{2 |G \cap S|}{|G| + |S|} \,; \end{aligned}$$
(1)

here and in the following, S and G represent the burnish surface according to the segmentation algorithm and the ground truth, respectively, with \(|X|\) denoting the number of pixels in a subset X of the image. Note that, \(0\le {\text {DSC}}(G,S)\le 1\) and that the maximum value 1 is attained if and only if the predicted area S and the ground truth region G are identical.

For the boundary-based part, the normalized surface distance

$$\begin{aligned} {\text {NSD}}(G,S,\tau ) =\dfrac{|\partial G \cap \partial S^{(\tau )}| + |\partial S \cap \partial G^{(\tau )}|}{|\partial G| + |\partial S|} \end{aligned}$$
(2)

was used, where \(\partial G,\partial S\) denote the boundaries of the segmentation surface and the ground truth, and \(\partial S^{(\tau )},\partial G^{(\tau )}\) represent the border regions at tolerance \(\tau \), i.e. the set of pixels whose distance from the boundary is less or equal \(\tau \). Note that for \(\tau =0\), this metric only accounts for the predicted boundary pixels which match the ground truth boundary exactly, whereas higher tolerance values do not distinguish between an approximate and an exact boundary match.

Fig. 5
figure 5

Different metric scores for images of over-segmentation and infra-segmation

Finally, the combined metric

$$\begin{aligned} {\text {CM}}(G,S,\tau )&=\alpha \,{\text {DSC}}(G,S)+ \beta \,{\text {NSD}}(G,S,\tau _1)\nonumber \\&\quad +\gamma \,NSD(G,S,\tau _2) \end{aligned}$$
(3)

considers both the region-based DSC and the boundary-based NSD. By selecting the weight factors \(\alpha ,\beta ,\gamma \) and the tolerances \(\tau _1,\tau _2\), this metric prioritizes either the overlap between the identified area and the ground truth burnish surface (for higher values of \(\alpha \)) or the accuracy of the predicted outlines of the area.

In the following, we choose the tolerances \(\tau _{1}=0\) and \(\tau _{2}=1\), which leads to both a positive evaluation for predicted outlines close to the actual boundary and an additional distinction between an approximate and an exact boundary match. Using the weights \(\alpha =0.5\), \(\beta ={0.45}\) and \(\gamma ={0.05}\), we put equal emphasis on the area overlap measured by DSC and the boundary matching via NSD. Figure 5 shows the behaviour of the combined metric for different degrees of deviation from the ground truth image.

Note that the number of tear sections is not explicitly taken into account by the combined metric, which needs to be calculated for each section individually. Here and in the following, sections are defined as four-way connected area of pixels, with one component for each tear section [14]. Overlapping tear sections in the prediction and ground truth mask are combined into one component. The combined metric is then calculated separately for each of the found components. The metric scores of each component are then weighted in relation to the area of the respective components and summed up; thus larger tear sections have a greater influence on the overall metric score than smaller ones.

While this metric already allows for a general assessment of the accuracy of the prediction, some topological information (e.g. tear sections that are either missing or newly added in the prediction) is not taken into account. To address this problem, we consider the following additional metrics:

  • the ratio between the predicted burnish surface to the ground truth area,

  • the percentage of tear sections that could be mapped to components of the ground truth,

  • the ratio between predicted and true tear sections,

  • the ratio between predicted and true holes.

Here, the term “hole” is defined as an eight-way-connected area which is surrounded by pixels that belong to a different class. These four expansions are well-suited for this work to represent the different properties of the predictions.

Of course, it is possible for an end user evaluating a segmentation method (such as a neural network) for a specific task to decide based on all the above criteria. In this case, depending on which score is more significant for the task at hand, higher importance can be assigned to particular metrics. For fully automated hyperparameter optimization, however, it would be necessary to aggregate the individual scores into a single metric, e.g., via a weighted sum.

1.3.2 Loss function

During the actual training of the neural network for given hyperparameters, the parameters of the network are modified to minimize a loss function over the set of training data. Choosing a suitable loss function is therefore of major importance to ensure that the prediction by the neural network accurately corresponds to the ground truth. In a previous analysis, [15] compared multiple loss functions on four segmentation tasks. For a dataset containing liver and liver tumour images, which can be considered similar to our dataset based on the BG:FG ratio, a combination loss with a Dice-related compound proved suitable for segmentation tasks. The Dice loss is a region-based loss function that penalizes the mismatched regions between ground truth and prediction, similar to the Dice similarity coefficient. For the general case of images with N pixels and C distinct classes, the Dice loss can be defined by [15, 16]

$$\begin{aligned} L_{\text {Dice}}= 1-\dfrac{2\,\sum _{c=1}^{C} \sum _{i=1}^{N} g_{i}^{c}s_{i}^{c}}{\sum _{c=1}^{C} \sum _{i=1}^{N} g_{i}^{c} + \sum _{c=1}^{C} \sum _{i=1}^{N} s_{i}^{c}} \,, \end{aligned}$$
(4)

where \(g_i^c\) denotes the ground truth binary indicator of class c for pixel i and the \(s_i^c\) is the corresponding output confidence of the neural network. Note that if only the burnish surface class with ground truth indicator g is considered, and if the output s is binary, then the Dice loss can be simplified to

$$\begin{aligned} L_{\text {Dice}}&= 1 - \dfrac{2\,\sum _{i=1}^{N} g_i s_i}{\sum _{i=1}^{N} g_i + \sum _{i=1}^{N} s_i}\\&= 1 - \dfrac{2 |G \cap S|}{|G| + |S|} \;=\; 1 - {\text {DSC}}(G,S) \,, \end{aligned}$$

where \({\text {DSC}}\) denotes the Dice similarity coefficient as defined in Eq. (1).

While the DiceTopK-loss showed particularly promising results in the study by [15], the burnish surface identification problem requires a different approach due to the importance of the transition between burnish and fractured part. In order to emphasize the boundary of the burnish surface over its area distribution, we therefore selected the DiceBD loss [15, 17]

$$\begin{aligned} L_{\text {DiceBD}}\;&=\; L_{Dice} + L_{BD}\,, \end{aligned}$$
(5)

which combines the Dice loss with the BD loss

$$\begin{aligned} L_{\text {BD}}&=\;\sum _{i=1}^N \phi _i\,s_i\,. \end{aligned}$$
(6)

Here, \(\phi _i\) denotes the level set representation of the boundary \(\partial G\) of the ground truth region, defined by

$$\begin{aligned} \phi _i = {\left\{ \begin{array}{ll} {-}{\text {dist}}(i,\partial G)&{}\text {if }\; i\in G,\\ {\text {dist}}(i,\partial G)&{}\text {if }\; i\notin G\,, \end{array}\right. } \end{aligned}$$
(7)

where \({\text {dist}}(i,\partial G)\) is the distance between a pixel i and the boundary \(\partial G\) [15, 18].

1.4 Network architecture

For our purpose, it seems reasonable to use a neural network architecture that was specifically developed for processing monochrome images. In particular, we consider several network structures that have been previously employed – or even originally developed – for medical image segmentation tasks. First, neural networks from three selected types of architecture are trained, analyzed and compared on the given dataset. Afterwards, the network that provides the best performance is analyzed and developed further. The chosen architectures are SegNet [19], UNet++ [6], MedT [20] and nnU-Net [7].

SegNet was originally developed for road scenes, with focus on low memory consumption and efficient computational time [19]. Therefore, this architecture contains fewer trainable parameters than UNet++ or MedT. SegNet’s main novelty is the decoder upsampling, i.e. the pooling of indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling.

UNet++ [6] is an extension of U-Net [5], which is built for including data augmentation to effectively learn from datasets with very few labelled images. The classical UNet++ network consists of five layers. The encoder is called backbone and, compared to U-Net, contains additional skip-connections to the decoder in the form of a pyramid structure, which is supposed to overcome the problem that the outputs of the simple skip-connections in U-Net are too different in kind. In addition, deep supervision is introduced into the learning process (cf. Fig. 6).

Fig. 6
figure 6

Structure of UNet++ [6]

MedT consists of a global subnetwork with two layers and a local subnetwork with five layers. The global subnetwork processes the complete input, whereas for the local subnetwork, the input images are divided into 16 parts which are processed individually and then reassembled. We selected MedT explicitly as an alternative to classical CNN approaches, since this architecture does not consist solely of convolutions, but includes gated axial-attention layers to act as the main processing units. In addition, the composition of a global and a local branch ensures that the local subnetwork is effectively trained with more images, which can be advantageous for smaller datasets such as the ones considered here. Furthermore, due to the splitting of the input in the local subnetwork, the positional variance of the image content is automatically included in the training process, and the network is confronted with images with different brightness gradients [20].

Finally, nnU-Net is a self-configuring method for medical image segmentation which automatically generates an architecture layout, with training and post-processing based on interdependent rules and empirical descension. It is publicly available and scored best in multiple biomedical segmentation competitions [7].

1.5 Hardware and training

Training and evaluation were implemented in PyTorch [21] with mixed precision and performed on an NVIDIA Quadro RTX 5000. Because of the differences in memory consumption between the networks, different batch sizes had to be used. Every network was trained for 100 epochs. The learning rate started at 3e-4 and was multiplied by 0.2 whenever the moving average of the training had stagnated for 20 epochs until a minimum learning rate of 1e-6 had been reached. The training of nnU-Net was performed in its own framework [7].

2 Network analysis

2.1 Comparison of different architectures

After training an instance of each architecture type, the metric scores were calculated for each augmentation subcategory (cf. Table 1) of the test dataset. The course of the loss function during training shows a successful training (see Fig. 7a). Since real-time segmentation is crucial for the process, the inference time is also taken into consideration. As shown in Table 2, with respect to the combined metric, UNet++ performed 17.94 percentage points better than SegNet, 5.67 percentage points better than MedT and 3.25 percentage points better than nnU-Net. For the other scores, UNet++ also performed comparatively well. Furthermore, the analysis of the metric for each subcategory (see Fig. 7b) shows that UNet++ responds best to three-times enlarged images in comparison with nnU-Net, MedT and SegNet; the latter two, in particular, falsely tend to recognize multiple tear sections instead of a single main section, as shown in Fig. 8. Based on these results, UNet++ is chosen as the most suitable network architecture for identifying the burnish surface.

Fig. 7
figure 7

Course of loss function during training and evaluation of each image augmentation class (cf. Table 1) with the combined metric

2.2 UNet++ optimisation

For further investigating the properties and hyperparameters of UNet++, we first established a reference score by training the network a total number of five times with default parameters. The mean metric values and their standard deviations are given in Table 3. We then trained the model with different hyperparameter settings and compared the individual metric scores to these reference values.

As shown in Table 4, only scores which differ from the reference value by more than a standard deviation are considered significant changes. Note that for some metrics (e.g., the ratio of tear sections), an improvement (\(\bigtriangleup \)) is indicated by a lower score, while for others (e.g., the combined metric CM), higher scores correspond to a more accurate prediction.

2.3 Hyperparameter variations

The hyperparameters analyzed in the following are the numbers of network layers, feature maps per layer and convolutional layers per block (the block depth). Our reference UNet++ uses 5 layers and block depth 2 with 32 feature maps in the first layer; this number is doubled with each layer, so that the last layer uses 512 feature maps.

Feature maps

To analyze the relationship between the number of feature maps and the prediction, networks with 8, 16 and 64 feature maps in the first layer were compared. The duplication per layer is retained.

The results, as shown in Table 4, indicate a minor, but significant improvement by 0.67 percentage points in terms of the combined metric after increasing the number of first-layer feature maps to 64. Moreover, all other metrics improve with this configuration as well. As expected, less detail is extracted from the image with fewer feature maps. As a result, these networks are less sensitive to changes in the image structure and tend to achieve worse results with a wider distribution. This is confirmed with images that have been magnified three times.

Depth

We also considered networks with one and three layers per block (“Depth 1/3” in Table 4). Due to the resulting changes to the amount of data processed per layer, the network with depth 1 shows a decreased inference time, but scores slightly worse overall. A higher number of layers per block, on the other hand, results in minor improvements: more tear sections can be assigned, and the combined metric score is slightly higher (0.08 percentage points above the standard deviation) compared to reference architecture. These advantages, however, come with an increased inference time.

Number of layers

Table 4 also shows the comparison between networks with different numbers of layers. While fewer layers deliver worse results in terms of the combined metric, the scores for assigned tear sections and hole ratio is clearly improved by increasing the number of layers, which suggests that the deep layers can assist in processing a more complex feature (such as holes).

Table 2 Comparison of architectures with default hyperparameters

Synergy between hyperparameters

Considering the results, we selected a network with 64 feature maps, three layers per block and 6 layers for further comparison. This model can be seen as the synthesis between the best-performing hyperparameters and, as shown in Table 5, provides an improvement of the default model by 0.66 percentage points in terms of the combined metric while performing better in every other metric except the hole ratio. We note, however, that the inference time increased significantly as a result of the more complex structure.

Fig. 8
figure 8

Comparison between the predicted burnish surfaces for different network architectures

2.4 Backbone modification

The previous analysis has shown that competitive predictions can be achieved with only one layer per block. However, during the evaluation, we observed that in this case, the segmentation tends to fail for images with more involved details, such as a higher number of tear sections. Increasing the number of layers beyond 5 lead to significant improvements, albeit at the cost of an increased processing time.

Table 3 Mean and standard deviation after fivefold training of the UNet++ architecture
Table 4 Comparison of the modified architectures, showing significantly better (\( \bigtriangleup \)) or worse (\( \bigtriangledown \)) results compared to the reference architecture

As a compromise, we therefore consider an architecture with an increasing block depth per layer and a total of number of 6 layers. More specifically, we modify the UNet++ structure with an incremental block depth such that the blocks in the first (top) layer contain one convolutional layer, the blocks in the second layer contain two layers etc.

The underlying assumption behind this architecture modification is that the processing of the simple properties (e.g. basic positioning and brightness) takes place within the upper layers, whereas the lower layer process more complex features – for example, whether a pixel lies within a larger group of bright pixels, how large this group is, how the edge of this group is shaped or whether the group contains a corresponding structure.

However, the results shown in Table 5 suggest that increasing the block depth in lower layers does not lead to better results regarding the different metric scores, whereas the inference time is more than doubled as a result of the more complex structure.

Next, we consider a replacement of the backbone in the UNet++ structure by DenseNet [22], similar to work by [23, 24] but extended to the UNet++ structure. Following the underlying assumption that this modification enriches the information about complex features in the deeper layers by connecting each layer with the previous layer via dense connections (see Fig. 9), this should lead to an overall improvement of the boundary details due to recurring influence of features.

Table 5 shows that the dense-backbone architecture indeed leads to comparable or better results regarding the different metric scores with a minor increase in inference time. As a result, the dense backbone is still outperformed by the hyperparameter-optimized network according to the combined metric.

3 Discussion

The above analysis of the hyperparameters and different backbones demonstrates that:

  • architectures with fewer than 16 feature maps achieve worse results, but require a shorter inference time;

  • architectures with more feature maps achieve better results and require a longer inference time;

  • architectures with a lower block depth achieve comparable results and require a shorter inference time;

  • architectures with a higher block depth achieve slightly better results and require a longer inference time;

  • architectures with fewer layers can achieve worse results and require a shorter inference time;

  • architectures with more layers achieve better results and are slightly slower to process;

  • architectures with optimized hyperparameters achieve better results, but increase the inference time;

  • architectures with increasing block depth achieve worse results and double the inference time;

  • architectures with a dense encoder achieve comparable or better results results and require a longer inference time;

Based on these findings, we propose the UNet++ with 64 feature maps, as the most suitable configuration. Even if the 6 Layer configuration performs better in some metric scores, the shorter inference time of the selected architecture should be prioritized as it is highly beneficial for the intended purpose of inline segmentation.

Fig. 9
figure 9

Architecture of UNet++ with dense backbone

4 Transfer evaluation

Currently, to the best of the authors’ knowledge, no other monitoring system comparable to the one considered here is currently in use – and thus no other extensive dataset of cutting surface images from punched parts is available for validation purposes. To evaluate the overall performance and transferability of the proposed neural network structure for further applications, we therefore collected a small transfer dataset of 60 images with corresponding mask images. This dataset consists of 40 images of burnish parts from a copper material with varying material thicknesses of 0.5 mm and 0.64 mm as well as images of a steel material with thickness 0.5 mm. All images were collected with the original monitoring system. We additionally collected images with an oil film applied to the burnish part, as would be expected in a real production process. Furthermore, images were taken with a Keyence confocal microscope with a different FG:BG ratio and image characteristics and added to the dataset.

We compared the results for UNet++ with 64 feature maps, the reference UNet++, the hyperparameter-optimized variant and 6 layer variant. The results are shown in Table 6. In summary, the best performance is achieved by the UNet++ with 64 feature maps. The metric scores are also confirmed by directly observing the images (examples are shown in Fig. 10). All networks tend towards an increased number of predicted tear sections on the transfer dataset, especially for images with an oil film. Considering that the networks were applied to images with formerly unknown characteristics, the performance is generally acceptable, even when a different device acquires the images. It is likely that the results can be improved considerably if images from multiple devices and different punching tools or process parameters are integrated into the training dataset.

Table 5 Results for the combination of best-performing hyperparameters (6 Layers, Depth 3, 64 Features ) and for a modified backbone with incremental block depth
Table 6 Comparison of modified architectures on the transfer dataset
Fig. 10
figure 10

Comparison of segmentation results on the transfer dataset, with the ground truth contour in green

5 Conclusion

Fast and accurate segmentation of images is essential for in-cycle processing of quality parameters during the punching process. With prior methods for the segmentation of the burnish surface being too slow for real-time applications, machine learning provides a promising alternative approach.

Since related tasks are well known to be solvable by neural networks in a biomedical environment, we compared the network architectures SegNet, UNet++, MedT and nnU-Net for segmentation of the burnish part. The evaluation is carried out by a newly developed metric, which allows for a simultaneous assessment of the segmentation accuracy in terms of both the boundary and area overlap. The same targets are considered by the loss function that is used for optimizing the networks’ parameters. Thereby, it is possible to prioritize characteristics both during training and evaluation. A modular selection of additional metric scores allows for an even more specific assessment of the results; for example, the ratio of tear sections or holes between prediction and ground truth might be considered especially important, depending on the application and the further use of the segmentation.

Moreover, we analyzed the hyperparameters of the UNet++ structure. In our comparison, a UNet++ architecture with 64 feature maps in the first layer achieved the best results, with an inference time of 4.43 ms. In particular, using this segmentation method, it is possible to reliably identify the burnish surface of a produced punching part within the process cycle time. We also tested the developed architecture on a transfer dataset consisting of images with different characteristics from different devices. Although the prediction scores are (expectedly) worse, the proposed modified UNet++ architecture still performed best. In addition, the results indicate that segmentation does indeed work across different devices and demonstrate that networks for biomedical image segmentation are suitable for manufacturing tasks. In terms of quality monitoring, further research will focus on the performance of the developed architecture and metric with an increased image size; here, we used a resolution of only \(256\times 256\) pixels to decrease development time. In terms of applications towards predictive maintenance, further research should focus on classifying the image into categories after segmentation – for example, an automated identification of rejects or the distinction between phases of the wear diagram such as running-in, steady state and increasing wear could be considered. Furthermore, the training data should be expanded with additional image data from punching processes with different parameters for thickness and material to avoid an overfitting to specific features.