1 Introduction

Manufacturing processes have been optimized in recent decades to achieve minimum reject rates and high product qualities. However, as product and process complexities increase, the importance of reliable quality continues to grow. Defects such as internal holes, pits, abrasions, and scratches on workpieces or knots, broken picks, and broken yarn in fabrics [1] negatively impact both visual and functional product properties [2]. Defects also contribute to the additional wastage of resources, safety hazards, and can have severe economic consequences for a company. Therefore, reliably assuring the quality of manufactured products is of paramount importance in manufacturing. One of the famous and contemporary solutions towards achieving the goal of a fully automated quality control system is through deep learning (DL)-based computer vision. DL algorithms improve over existing rule-based systems in terms of generalization and performance, while requiring less domain expertise [3, 4, 5]. However, a major disadvantage of data-driven approaches compared to rule-based techniques lies in the strong dependency of model precision on data quantity, data quality, and the evolution of the data over time (data drift) [6]. While the focus in recent years has been on the development of advanced network architectures (e.g., ResNet-50 [7] or Inception-v3 [8]), the progress that is being made in model-space is increasingly diminishing. As a result, the development is shifting more towards data-centric approaches, especially in real-world domains like for example manufacturing or medical diagnostics. Table 6.1 provides an overview of the main data challenges that are characteristic for image data acquired from production processes. These properties form a strong contrast to the ones of (research) datasets (e.g., ImageNet [9], COCO [10], MNIST [11]) used for developing and benchmarking of deep neural network architectures and DL-algorithms, which is why the approaches from research are difficult to transfer one-to-one to such complex defect detection use-cases.

Tab. 6.1 Causes of data quality issues in DL-based visual defect detection in terms of data sparsity, data imbalance and data shift

Data augmentation (DA) represents a data-space solution addressing the above mentioned data quality challenges. There are various DA techniques that aim for changing both the geometrical and visual appearance of images to improve both performance and robustness properties of deep neural networks. The most common DA techniques are geometric transformations, color augmentations, kernel filters, mixing images and random erasing [12]. Even though DA is already an integral part of DL pipelines, different DA-methods are often blindly applied based on empirical knowledge and require elaborate tuning for specific datasets. To analyze the impact of different DA-methods on both precision and generalization for the task of visual defect detection, this paper introduces our experimental procedure in Sect. 6.3.3, presents the results in Sect. 6.4.2 and finally derives insights about the studied DA-methods in Sect. 6.5. Sect. 6.3.2 introduces the three real-world datasets which we work with. Our DA-methods are chosen according to a preliminary study of related papers that is summarized in Sects. 6.2 and 6.3.3.

2 Related Work

This section provides a brief overview of work that addresses the generalization problem, DA approaches, and its impact on real-world DL tasks. One central drawback of real-world datasets is that the models trained on them do not generalize well as these datasets are prone to domain shift [13]. In recent years model-centric techniques such as dropout [14], transfer learning [15], and pretraining [16] have tried to address the issues of generalization, particularly in deep neural networks. DA tries to avoid poor generalization by solving the root problem of training data [17] rather than changing the model or training process. Applications of DA can be found in various works across multiple domains such as natural language processing [18], computer vision [17], and time series classification [19]. Particularly in computer vision tasks DA has been applied to address the domain generalization problem [20, 21, 22]. Many papers exist that apply and analyze basic DA-techniques (e.g., oversampling and data warping on histopathological images [23]) and advanced methods (e.g., stacked DA on medical images [24], style-transfer augmentations [25], cGan, and geometric transformations [26]) for specific use cases and datasets.

Fewer papers exist that provide an overview of DA-methods and try to examine their influences on model accuracy. The survey of Shorten et al. [17] presents a comprehensive overview of DA and present the impact examination of individual methods on well-known datasets (e.g., CIFAR-10, MNIST, Caltech101) in an isolated manner of pairwise comparisons. Shijie et al. [27] explore the impact of various DA-methods on image classification tasks with CNNs. On subsets of CIFAR10 and ImageNet, they conduct pair and triple comparisons to identify best″​=performing DA-techniques and to draw general conclusions. Yang et al. [28] systematically review different DA-methods and propose a taxonomy of reviewed methods. For semantic segmentation, image classification, and object detection, they compare the performances of different model architectures on datasets (e.g., CIFAR-100, SVHN) with and without pre-defined set of DA-techniques. The survey paper of Khosla et al. [29] presents an overview of selected DA-methods without conducting further effect analyses. In addition to generic studies on scientific datasets, a few domain″​=specific approaches exist. The only related work on DA in defect detection is provided by Jain et al. [30]. They propose a DA-framework utilizing GANs which they use to investigate data synthetization for classification of manufacturing datasets.

2 Scientific Impact

Existing studies are almost exclusively conducted on scientific datasets and no reference is made to specific application domains (with the exception of [30]). To the best of our knowledge, there is currently no preliminary work, that examines the impact of DA-methods specific to DL-based visual quality control in manufacturing datasets in an unconstrained setting (i.e. only pairwise evaluations).

3 Approach

In this section, we present our approaches and procedures. Sect. 6.3.1 defines the mathematical problem of binary defect detection. Sect. 6.3.2 introduces the datasets considered in this study and their properties. The experimental procedure, the domain shift measure, and the evaluation metrics are presented in Sect. 6.3.3.

3.1 Binary Defect Detection Problem Definition

For binary visual defect detection, the input feature space is denoted by 𝒳 and 𝒴 denotes the target space. We define the domain as a joint distribution PXY on 𝒳 × 𝒴 and the dataset as 𝒟 = {(𝒳i , 𝒴i)} Ni , where N is the number of training examples. In this work, 𝒳1, 𝒳2, 𝒳3 comprises images from three datasets, namely: (1) AITEX fabric defects [1], (2) Magnetic tile defects [31], and (3) TIG Aluminium 5083 welding defects [32]. We define the binary classification problem where 𝒴 ∈  {Defected, Non-defected}. Furthermore, the DL model is defined as f : 𝒳 → 𝒴, where the primary objective is to learn a mapping from the input space 𝒳 to target space 𝒴. In this work f ∈  {ResNet-50 [7], Inception-V3 [8]}. The predictions generated using model f are denoted as \(\hat{\mathcal{Y}}\). The categorical cross entropy loss function is defined as \(\ell\colon\mathcal{Y}\times\hat{\mathcal{Y}}\rightarrow[0,\infty)\). Each dataset 𝒟 = {(𝒳i , 𝒴i)} Ni is augmented using various DAs, where θ denotes the list of all DAs, and a new augmented dataset is generated as 𝒟1 = θ(𝒟). For each dataset, ten DA-pipelines with varying DAs are constructed to create ten different data sets 𝒟1 .. 𝒟10.

3.2 Presentation of the Datasets

Three real-world industrial″​=grade datasets are used in this work. An overview of exemplary images is provided in Fig. 6.1. The Magnetic tile defects dataset (MagTile) contains a total of 1,344 images of magnetic tiles with five defect types: blowhole, crack, fray, break, nneven (grinding uneven), and free (no defects). AITEX is a fabric production dataset containing 246 images of 4,096 × 256 pixels that capture seven different fabric structures. In total, there are 140 defect-free images, 20 for each type of fabric, and there are a total of 105 images with defects. The TIG Aluminium 5083 welding seam dataset (TIG5083) contains 33,254 images of aluminium weld seams and the surrounding area of the weld seam, with six classes: good weld, burn through, contamination, lack of fusion, misalignment, and lack of penetration. We convert the multi-class classification task of all datasets into a binary classification problem by merging all individual defect types into a single defect class.

Abb. 6.1
figure 1

Exemplary raw images of the datasets studied: AITEX (a), MagTile (b), and TIG5083 (c)

3.3 Experiment Procedure

To evaluate the impact of DA-techniques we propose a three-stage process: First, for each dataset, apply a DA-pipeline and evaluate model performance on different test sets. Second, measure the domain shift between the train set and the test sets. Third, correlate the achieved performance with the domain shift. This framework provides insight into the effects of different DAs on model performance, domain shift, and, through the correlation of both, the generalization capabilities of the trained model. An overview of our algorithm can be found in Fig. 6.2. We assume a standard train-test split of 80/20 and further a validation split of 60/20 (based on the 80% train split). Additionally, we create a hold-out test set by splitting off one of the defect classes per dataset before they are merged (see Sect. 6.3.2). This hold-out set serves as an additional out-of-distribution test set to measure the generalization capabilities of the model. We apply DA in two different settings. For AITEX and MagTile, augmented datapoints were added as new instances, retaining the original ones. This was done to increase the overall number of instances in the dataset and stabilize training. For TIG5083, augmented datapoints replace the originals since the dataset already contains enough images for training. The hold-out class for the AITEX data set was ‘Broken end’, the hold-out class for Magnetic tile defects was ‘Crack’, and the hold-out class for TIG Aluminium 5083 was ‘contamination’ class.

Abb. 6.2
figure 2

Experiment protocol for constructing the DA-pipelines, training, and evaluating the defect detection model

3.3 Data Augmentation Pipelines

In order to pre-select the DA-steps for this paper, a survey was conducted across 24 papers dealing with 6 major industrial image data sets. Table 6.2 describes all available augmentations for each dataset. From these augmentation pools, different pipelines for each dataset were constructed. For each pipeline, two of the augmentations are reserved for the test set and are later referred to as test augmentations. The remaining DAs have a 0.5 chance of being applied to the training set. This process is repeated ten times (see Table 6.3). App. 6.6.1 provides an overview of selected unaugmented and augmented images for all three datasets.

Tab. 6.2 Preselected set of DA-methods for TIG5083, AITEX, and MagTile
Tab. 6.3 Train, validation, and test set DA-Pipelines (AITEX)

3.3 Domain Shift Measures

We use an algorithm proposed by [36] for measuring the domain shift between datasets. In computer vision tasks, calculating domain shift can be seen as calculating the difference in representation by a model given the source and target domain. Given that a source domain is distant from the target domain, the representation of the domains in the learned space for a specific model tends to diverge. The authors used the activation values from the model’s last layers to quantify the domain shift. Specifically, by creating a statistical distribution using each kernel’s activation value in those layers, we can measure the distance between the datasets using the Wasserstein distance.

3.3 Evaluation Metrics

To evaluate the results of the binary classification problem, various metrics such as F1″​=Score, precision, recall, Jaccard similarity [37], Cohen’s kappa score [38], and Matthews correlation coefficient (MCC) [39] are used. Since the datasets are imbalanced even after applying DA, all metrics (Jaccard, precision, recall, and F1″​=Score) are weighted by the class distribution. We use multiple different evaluation metrics, as they all slightly deviate from each other. In this way, we circumvent the difficulties due to the sensitivity of individual metrics and obtain a more conclusive evaluation. Since all these scores are bound between [0,1] we average all of them for our reporting of final performance values.

4 Results

In this section, we present the results. Sect. 6.4.1 defines the training and implementation procedure. Sect. 6.4.2 provides an overview of the protocol followed to evaluate the results at the example of the AITEX dataset. Sect. 6.4.3 presents the results of our ablation study.

4.1 Training and Implementation

For controlling the model training, a validation set is split of from the augmented training set. The model is evaluated on the original test set, augmented test sets (using the two reserved test augmentations) and the hold-out set as described in Sect. 6.3.3. The hold-out class for the AITEX data set was ’Broken end’, the hold-out class for MagTile defects was ’Crack’, and the hold-out class for TIG5083 was ’contamination’. As models for our experiment, ResNet-50 and Inception-v3 were chosen, as both are widely used in the literature about industrial applications. The learning rate for both models is set to 10−3, the Adam optimizer [40] is used and the first-layer input shape of the networks is set to 224 and 299 respectively. We initialize the networks using pre-trained weights (ImageNet) for both architectures. DL is enhanced via transfer learning with 50 epochs of frozen weights in the encoder (shallow training) and additional 30 epochs of fine-tuning the entire model (deep training). Similarly to the evaluation metrics, the class-balanced version of the loss function was employed to stabilize the learning process. The data for each experiment was normalized according to the statistics of the train set after applying DA.

4.2 Results for the AITEX Dataset

Fig. 6.3 depicts the average F1″​=Score across both the models and across the DA steps for each test set. The values are obtained by averaging the performance of each pipeline that contains the respective augmentation. We observe that the performance on the original test and, to a lesser extent, the augmented test set remains stable, but on the hold-out set (highest amount of domain shift) model performance significantly improved. The top three DA-steps for AITEX dataset are MLS, Gaussian noise and random rotating. As stated in Sect. 6.3.3, we also averaged the performance across multiple other metrics, since they all slightly differ from each other. Similar trends can be observed in Fig. 6.4.

Abb. 6.3
figure 3

F1-Score averaged across models for each DA-method (AITEX). Sorted by hold-out performance.

Abb. 6.4
figure 4

Averaged Jaccard, precision, recall, kappa and MCC scores across models for each DA-method (AITEX). Sorted by hold-out performance.

Next, the distance between the train set (source domain) and the test sets (target domain) was calculated for all the models and datasets. Table 6.4 contains the mean and standard deviation across all the pipelines for the AITEX dataset and ResNet-50 model. The domain shift increases from the original test set to the augmented test set to the hold-out set. Finally, the domain shift is correlated to the respective F1-Scores, as Wasserstein distance alone lacks interpretability.

Tab. 6.4 Domain shift measure averaged across DA-pipelines for the last layer of ResNet-50

A negative correlation means that with increasing domain shift the performance of the model on the test data decreases. Therefore, a greater correlation is desirable. Each cell in Table 6.5 contains the Pearson correlations between the distance measure and F1-Scores across all the test sets. Since the domain shift is measured based on a single layer of the model we evaluated the last three layers of each model and reported the values separately in the columns. The correlation values don’t change depending on the layer used, but we observe two outliers in the pipelines that display a weaker correlation between domain shift and model performance. Further information can be found in App. 6.6.2. The same evaluation protocol was followed for evaluating the results across the other two datasets as well and similar trends were observed. The results TIG5083 and MagTile can be found in App. 6.6.3.

Tab. 6.5 Pearson correlations between the domain shift and model F1-Scores (AITEX). The bold values represent the largest negative mean correlations value.

4.3 Results of the Ablation Study

In addition to the average score presented in Sect. 6.4.2, we draw additional insights from comparing mode performance across all models and datasets available. Fig. 6.5 depicts the stacked bar plot of weighted F1-Scores averaged across all datasets and models for each augmentation that was available for the dataset. Across all the experiments, affine transformations, moving least squares (MLS) and random rotation DA techniques performed the best. Similarly, Fig. 6.6 depicts the average of the scores across all other evaluation metrics. We can observe similar trends where on average across experiments affine transformations, perspective transformation and MLS perform the best.

Abb. 6.5
figure 5

F1-Scores averaged across all augmentations steps in the train sets

Abb. 6.6
figure 6

Jaccard, precision, recall, kappa and MCC scores averaged across all augmentations steps in the train sets

5 Conclusion

DL offers enormous potential to automate complex visual quality control tasks that cannot be solved using rule-based methods. However, manufacturing applications entail three severe data challenges: data sparsity, data imbalance and data shift. DA-methods have become an integral part of DL-pipelines to improve both performance and generalization. To provide precise assistance for the selection of DA-methods for developing DL-based quality control in the future, in this paper we present an experiment protocol. Thereby, we aim to evaluate the impact of individual DA-methods on defect detection performance depending on dataset characteristics. We apply this protocol to three defect detection use-cases, present and interpret the results.

Using our approach, we can evaluate the influences of each DA method on the model metrics in detail. We show how to determine the domain shift between genuine and augmented dataset derivatives and therefore providing a measure and interpretability for choosing the degree of DA. By correlating this domain shift with F1-Scores, the strength of the positive influence of a DA-pipeline on bridging the domain shift can be determined. Applying our protocol to the datasets, we obtain the three best DA-methods MLS, Gaussian noise, random rotating (AITEX), image transpose, random perspective, salt & pepper noise (MagTile), and affine transformation, perspective transformation, image transpose (TIG5083). Thereby we confirm that the performance improvement of DA-methods depends on dataset characteristics, the DL-task to be solved and the degree of DA. This shows that there is no one-fits-all solution, but at the same time makes it all the more clear that establishing a mapping between dataset properties (e.g., degree of imbalance, defect sizes, positional variance of defects) and DA-induced performance enhancement will enable tailor-made and precise DL-pipeline development, especially in real-world applications.

Correlating the found performances with the respective domain shift revealed additional insights. The two pipelines for the AITEX dataset that induced the weakest negative correlation between domain shift and performance were mainly composed of our three best″​=performing augmentations for that dataset (see Table 6.5 pipeline 4,9). Additionally, we found that the worst performing pipelines either had very few augmentations or contained badly performing augmentations in them (mainly ″​random rotate’’ for AITEX), further highlighting the need for tailor-made DA-pipelines for each dataset. Our ablation study showed that (in contrast), by averaging the results over all datasets and models, at least some augmentations do perform better than others on average. The better″​=performing augmentations are the more complex ones, showcasing their versatility and robustness, while simple of-the-shelf augmentations display the least amount of lift in model performance. Fig. 6.6 can serve as a benchmark of augmentation techniques for new industrial″​=grade datasets, or those with unknown properties.

With the proposed the protocol, we lay the foundation for determining the appropriateness of DA-methods for specific data properties in an analytical approach. We will include also more advanced DA-methods and extend the study to additional domain″​=specific datasets to provide more validity to the results. By establishing a catalog of dataset properties to which we can map the results of the study, we aim to develop a domain″​=specific decision support system for choosing optimal DA-pipelines for DL-applications.

6 Appendix

6.1 Dataset Illustrations

Abb. 6.7
figure 7

Selection of AITEX [1] images: train set (a), test set (b), hold-out set (c), and augmented test set (d)

Abb. 6.8
figure 8

Selection of MagTile [31] images: train set (a), test set (b), hold-out set (c), and augmented test set (d)

Abb. 6.9
figure 9

Selection of TIG5083 [32] images: train set (a), test set (b), hold-out set (c), and augmented test set (d)

6.2 Domain Shift Calculations

The distance measure does not have good interpretability alone. Hence, we correlate the distance measure to the F1-Scores, a negative correlation is expected between them where the distance should be smaller, and the F1-Scores should be higher. Table 6.6 provides the distance measures for the averagepool layer of the ResNet-50 model across train and test sets, where the first three columns represent the distance and the following three columns represent the F1-score for the same pipelines. We take Pearson correlations along each pipeline, correlating the distance measure with the corresponding performance metric. Similarly, repeating this process for the last layers of both the models gives us Table 6.5. The same procedure was followed to construct similar tables for MagTile defects and TIG5083 dataset. Furthermore, we take the mean across the last layers of the models.

Tab. 6.6 Wasserstein distance between the augmented train set and all test sets for ResNet-50 and corresponding model F1-Scores

6.3 Results

6.3.1 MagTile Dataset

Tab. 6.7 Train, validation and test set DA-Pipelines (MagTile)
Abb. 6.10
figure 10

F1-Scores averaged across models for each DA-method (MagTile). Sorted by hold-out performance.

Abb. 6.11
figure 11

Jaccard, precision, recall, kappa and MCC scores averaged across models for each DA-method (MagTile). Sorted by hold-out performance.

Tab. 6.8 Pearson correlations between the domain shift and model F1-Scores (MagTile). The bold values represent the largest negative mean correlations value.

6.3.2 TIG5083 Dataset

Tab. 6.9 Train, validation and test set DA-Pipelines (TIG5083)
Abb. 6.12
figure 12

F1-Scores averaged across models for each DA-method (TIG5083). Sorted by hold-out performance.

Abb. 6.13
figure 13

Jaccard, precision, recall, kappa and MCC scores averaged across models for each DA-method (TIG5083). Sorted by hold-out performance.

Tab. 6.10 Pearson correlations between the domain shift and model F1-Scores (TIG5083). The bold values represent the largest negative mean correlations value.