Self-Supervised Learning for Industrial Image Anomaly Detection by Simulating Anomalous Samples

Industrial image anomaly detection (AD) is a critical issue that has been investigated in different research areas. Many works have attempted to detect anomalies by simulating anomalous samples. However, how to simulate abnormal samples remains a significant challenge. In this study, a method for simulating anomalous samples is designed. First, for the object category, patch extraction and patch paste are designed to ensure that the extracted image patches come from the objects and are pasted to the objects in the image. Second, based on the statistical analysis of various anomalies’ presence, a combination of data augmentation is proposed to cover various anomalies as much as possible. The method is evaluated on MVTec AD and BTAD datasets; the experimental results demonstrate that our method achieves an overall detection AUC of 97.6% in MVTec AD datasets, outperforming the baseline by 1.5%, and the improvement over VT-ADL method is 4.3% on the BTAD datasets, demonstrating our method’s effectiveness and generalization.


Introduction
The challenge of identifying patterns in data that do not correspond to expected behavior is known as anomaly detection [1].It is now widely used in various fields, e.g., financial risk control scenarios [2], medical imaging [3,4], computer network security [5], video surveillance [6][7][8], and industrial defect detection [9].Currently, the field of anomaly detection has many unique characteristics.First, anomalies are constantly changing and it is difficult to predict future patterns.Second, anomalies often exist in multiple cases and have a variety of patterns.Finally, normal and abnormal samples are very imbalanced, and the number of anomalous samples is very few.At present, with the development of deep learn-ing, anomaly detection has currently obtained great progress.However, there are numerous challenges in industrial defect detection, such as the sparsity of abnormal samples, the high cost of annotation, and the lack of a priori knowledge of defects.
To solve these challenges, many researchers have proposed many methods, such as reconstruction based and embedding based.Due to the lack of anomalous samples, most existing methods adopt unsupervised learning [10], which means only normal samples are involved in the training stage.In the reconstruction-based methods [11,12], the normal samples are reconstructed well, but abnormal samples are not.It only needs to compare the original sample and the reconstructed sample to determine the normality of the input image.However, due to the strong reconstruction ability of the neural network, the abnormal samples can also achieve good reconstruction results.Therefore, it may affect the detection results.The embedding-based methods [13,14] adopt a deep natural network to map normal and abnormal samples into a feature space, where the distance between embedding vectors of normal and abnormal samples is calculated.This approach captures higher-level and more abstract information about the samples, which can be beneficial.However, accurately localizing the anomalous region solely based on the embedding distance can be challenging.Further research is needed to improve the localization of anomalous regions within the embedded feature space.Recently, self-supervised learning techniques have demonstrated notable achievements in representation learning.This approach effectively utilizes unsupervised data by extracting supervised information through pretext tasks.Simulating anomalous samples, such as the Cutpaste technique [15], is currently a prevalent strategy employed in self-supervised learning.Patches of the Cutpaste method are randomly extracted and randomly pasted to any position of the image in the object category.However, there is still a gap between the simulated samples and the real anomalous samples.Therefore, simulating good anomalous samples needs to be further explored.
Our method is proposed for simulating better anomalies.To begin, in the Cutpaste method, its patches are randomly extracted and randomly pasted to any position of the image in the object category, leading away from the real anomalies.Therefore, extracted patch from the object must be pasted on itself.For example, in Fig. 1, if the patch is pasted on the outside of the screw, it does not generate the screw's anomaly image.However, if the patch is pasted on the surface of the screw, it may simulate the screw's anomalies.Second, as shown in Fig. 2, a method based on a combination of data augmentation is devised to cover as many types of the anomaly as possible.If only one type of data augmentation is performed on the patch, it may only simulate one type of anomaly data and cannot cover more types of anoma-Fig.1 When the extracted patch is applied to the outside of the object (unsuitable location), the generated anomaly sample is the opposite of the actual situation.While the patch is on the object (suitable location), it is close to the actual anomaly sample Fig. 2 When only one type of data augmentation is used on the patch, it can simulate only one type of anomalous sample and cannot cover more anomalies.While using various augmentations, it may cover a wide range of abnormal types lies in the test datasets.Therefore, through the above steps, the simulated anomaly proposed by our method is closer to the real one, which will improve the accuracy of the model judgment.
Extensive experiments are conducted on public benchmarks.The experimental results show an improvement in the results of our method.The contributions are summarized as follows: • For the object categories, the extracted patch is pasted on the surface of the object.It is closer to the real anomaly samples.• According to the statistical analysis based on the presence of various anomalies in the test datasets, the experiments select various data augmentation methods to cover various anomalies as much as possible.

Reconstruction-Based Methods
Reconstruction-based methods employ the calculation of anomaly scores at the pixel-level or image-level based on the reconstruction error, such as Auto-Encoders [11,16], generative adversarial networks(GANs) [17,18].However, due to the generalization capability of neural networks, abnormal samples are well reconstructed, leading to a difficult distinguishment between normal and abnormal samples.Therefore, some researchers have tried to address this issue by modifying network architecture, which includes memory mechanism, pseudo-anomaly, and image masking strategy.
Memory mechanism [16,19,20] are frequently used, memorizing the normal prototypes.Thus, it will improve the ability to distinguish between normal and abnormal samples.Pseudo-abnormal samples [21,22] aims to generate new abnormal samples, remodeling the reconstruction manifold.The image masking strategy [23,24] is to erase some information from the image and restore erased information.However, these methods still lack a discrimination ability for real-world anomaly detection [25,26].The literature [27] preprocesses the images by Omni-frequency Channel selection, and then compresses and restores the individual images.
The literature [28] achieves good results by reconstructing the images with Position and Neighborhood Information.The reconstruction-based method is very intuitive and has strong interpretability.However, existing methods frequently fail to produce the desired reconstruction results.

Embedding-Based Methods
Embedding-based methods are utilized to create anomaly scores by comparing the embedding vectors of normal and abnormal samples using a trained network.These methods commonly employ embedding-similarity, one-class classification [29], and Gaussian distributions [30] for anomaly scoring.One-class support vector machine (OC-SVM) [31] and support vector data description (SVDD) [32] are the processes of defining a compact closed one-class distribution using normal sample vectors.To cope with high-dimensional data, deep SVDD is adopted to estimate feature representations by deep networks.Due to the use of the entire image embedding, these methods can only determine whether this image is normal or not, but cannot locate the abnormal regions.To address this limitation, some approaches [13][14][15]use patch-level embeddings to generate an anomaly map, where each patch represents a point in the feature space [33,34].Feature adaptation of the target dataset has also been proposed to improve the representation in the feature space [35], leading to the formation of a decoupled hypersphere.Recently, a flow-based idea can perform density estimation to calculate anomaly scores [36].The purpose of the embedding-based method is to find distinguishable feature embeddings.The extracted features contain information about the local field of perception instead of individual pixels.It increases the tolerance to noise interference.

Self-Supervised Learning Methods
Self-supervised learning methods are employed to construct supervised information, enabling the learning of valuable representations for downstream tasks.These methods have proven effective in tasks such as estimating geometric transformations [37,38] and predicting the relative positions of patches [39].Moreover, they have been utilized for subimage anomaly detection [15,40,41].In [15], the method is used to learn the representation and then construct a single classifier based on the learned image representation.Abnormal samples are generated by cutting patches from the image and pasting them in arbitrary locations.In [40], the method learns the representation of anomaly samples and their reconstruction from anomalous data.Decision boundaries for both normal and abnormal samples are then learned.This method enables direct anomaly localization without requiring additional complex post-processing of the network output.The literature [42] uses self-supervised learning to reconstruct images by mask strategy.Due to the lack of sufficient manually annotated defect samples for industrial scenarios, self-supervised learning methods have also attracted attention.It has comparable performance with supervised methods.However, the performance highly relies on the design of supervision and pretext.

Anomaly Simulation Methods
To overcome the lack of anomaly samples in real-world scenarios, researchers have been exploring various methods to synthesize anomaly samples.Rotation [41] and cutout [43] are the most classical anomaly simulation strategies, and their results are not very good in detecting small anomaly regions.The literature [12,44] used Perlin noise to simulate real anomaly images, but the generated noise images may not closely resemble actual anomaly images.In [45], dynamic local augmentation is used to synthesize new samples, but there is still a difference with the actual sample.In [46], it uses a Dual-Siamese structure to capture the discriminative features of normal samples and their corresponding defective samples, which is synthesized by the module.In  [47], it incorporates Poisson image editing to smoothly combine scaled patches from different-sized images, including too many hyper-parameters.The literature [48] generates pseudo-anomalous data from latent variables in GAN networks, which is a promising idea, but the generated data may still very differ significantly from actual anomalies.Among these existing approaches [49,50], there is no systematic analysis of data augmentation.Cutpaste method [15] is a simple and effective anomaly simulation method.To further improve anomaly synthesis, we propose incorporating the idea of self-supervised learning.

Methods
The method in this study focuses on defining a selfsupervised learning task for anomaly detection, which focuses on simulating anomalous samples.Our selfsupervised learning method distinguishes from normal sam-ples and simulated anomalous samples.In the anomaly detection phase, a Gaussian probability density is used to evaluate results and visualize abnormal regions by class activation mapping (CAM).In Fig. 3, The proposed architecture consists of four modules: patch extraction, data augmentation, patch paste, and one-class classifier.In the patch extraction module, patches are extracted from within the object, which can be later pasted onto the object in the patch paste module.The data augmentation module mainly makes various augmentations to simulate the anomalies as much as possible.One-class classifier module to perform a classification between normal and synthetic abnormal samples.In the testing phase, feature extraction is performed on the image by trained CNN, and then the density of the extracted feature vectors is estimated by the Gaussian probability density function.An anomaly score map is generated by image-level feature vectors, and the visualization of the image is produced by full grad.In Algorithm 1, the pseudo-code of the training procedure is shown.

Self-Supervised Learning
Self-supervised learning involves training a model to learn useful data representations without relying on explicit supervision.In contrast to traditional supervised learning, where labeled input data are used to train models to predict specific labels, self-supervised learning trains models to learn underlying patterns or features of the data without relying on labeled data.To accomplish this, self-supervised learning often uses pretext tasks, which are tasks designed to provide the model with a learning signal that does not require human annotation or explicit labeling of the data.
In industrial scenarios, collecting abnormal samples can be challenging, and therefore, the available training data typically consist mainly of normal samples.As a result, training a model to identify abnormalities becomes difficult, as the model does not have a clear supervisory signal during the training process.To address this issue, a strategy of simulating abnormal samples through the Cutpaste method and data augmentation can be employed to provide a clear supervisory signal for the model.The aim of this approach is for the model to be able to generalize and accurately detect unseen real defects when applied to test datasets.
In our self-supervised learning scheme, the Cutpaste method is adopted.The extraction and paste operation of the patch is used for normal samples, while data augmentation is applied to the patches extracted to simulate better abnormal samples.CNN networks is employed to discriminate between normal samples and simulated abnormal samples.

Patch Extraction and Patch Paste
To simulate good anomalous samples for the object category, it needs to be that extracted patch comes from the object and pastes into the object in the image.So, two modules are designed, patch extraction and patch paste.In Fig. 1, the extracted patch needs to come from objects.Therefore, we conduct pre-processing of the image, as shown in Fig. 4. First, a binarization operation is performed on the original image.It needs to select the appropriate threshold value and perform a morph close operation.Second, a multiplying operation is applied between the original and binary image, getting the extracted object.At last, a suitable patch, which is a small rectangular area of variable sizes, is extracted from the object.The module of patch paste is shown in Fig. 5, it shows the procedures for synthesized anomaly samples.To begin, the initial synthesis anomaly samples with the patch and original image are obtained by searching the pixels of the object, which may lead to the patch surpass to the surface of the object.Next, by selecting the appropriate threshold value and performing morph close operation, it can get the mask and inverse mask image, then the object of the synthesis anomaly image is produced by mask and initial synthesis anomaly image with multiplication operation.Likewise, it can get the background of the original image.In the third place, the final synthesized anomaly image results from the object of the synthesis anomaly and the background of the original image with add operation, solving the patch surpass to the surface of the object.
In reference [15], its patches are randomly extracted and randomly pasted to any position of the image in the object category.But our method is different from it.First, it needs to ensure that the extracted patches come from the object, and second, the patches should be pasted on the object.Through this operation, it is closer to the actual anomalies.Meanwhile, the Cutpaste method only does a simple data augmentation, while our method does a more comprehensive data augmentation based on the anomalies that appear in the test dataset.
The MVTec dataset contains an object category and a texture category.In the texture, the width and height of the patch blocks are randomly chosen to be between 2 and 50, and between 10 and 40, respectively.The resulting patch blocks have various sizes, matching the various sizes in the test dataset.In the object, the image contains both large and small objects.Capsule, pill, screw, metal-nut, transistor are considered small objects, and others as large objects.The patch of Fig. 4 The procedures of extracted patch small objects cannot be set too large because it will affect the anomaly detection result.The width and height of its patch parameters are chosen randomly from 2 to 16 and 10 to 25, respectively.So, the generated patch is smaller.The patch parameters for large objects are the same as those for texture-like images.In terms of the patch's rotation angle, the parameters of both categories are randomly selected from 0 to 360 degrees.

Data Augmentation
The process of patch extraction and patch paste alone does not fully capture the complexity of anomaly samples.Hence, to better simulate anomalies, various data augmentation techniques are employed on the patches.To illustrate our approach, we provide examples from our test datasets.In Fig. 6, based on the analysis results in test datasets, two types of augmentation are obtained, geometric and color transformation.The geometric transformer contains shape and rotation.Color, gray, noise, and blur are included in the color transformation.For example, it can be seen that there are various defects in hazelnut, such as large circles in hazelnut, small circles, and thin lines.So, there exists a variety of shape transformations from the perspective of shape.In carpet, it can be observed that the defect also exists in the rotate transformation.These two phenomena occur not only in hazelnut and carpet but also in other objects.Therefore, these two augmentations are adopted in every object and texture for our experiments.Wood has different color anomalies, such as red, blue, and black.At the same time, the anomalies region of the capsule exists in gray transformation.From the pill and bottle, it is obvious that the objects have noise and blur phenomenon.The anomalies of other objects are very similar to these four categories.In Fig. 6, only the typical anomalies are listed.So, these four augmentation methods are used to synthesize anomalies.According to statistical analysis in test datasets, these four phenomena may not appear in the same object and texture at the same time.Therefore, our method is to randomly select one augmentation method at a time for the patch.
In conclusion, we randomly selected one augmented method in a blur, noise, color, and gray at every training in the experiment.However, shape and rotation are adopted in every training.These data augmentation methods are implemented by calling existing function libraries [51], which contain detailed parameter analysis.Average blur is chosen in all kinds of blur methods.The parameter k is a hyperparameter in the fuzzy function and is set to 2-11.For noise, the experiments used the additive Gaussian noise augmented method, and the parameter is selected in 5-15.The ColorJit-Fig.5 The procedures of synthesized anomaly samples image Fig. 6 Several abnormal samples are displayed and categorized from the test dataset ter augmentation is adopted in color and gray, which is the value of brightness from 0 to 0.9 and hue value from 0 to 0.4.

One-Class Classifier
The one-class classifier, which is a component of our model, is specifically designed to classify normal and abnormal samples.In Fig. 3, simulated abnormal samples and normal samples can be shown.It can classify them by binary classifier.If a sample is normal, it is considered as 0. Otherwise, it is considered as 1.The training objective of our method is as follows: In Eq. ( 1), χ is the training datasets with normal samples, Aug is denoted as data augmentation, g stands for a binary classifier and CE means the cross-entropy loss.
The pre-trained Resnet34 network used as the backbone of our CNN.In the fully connected output layer, its output layer number is modified to 2, since we only need to determine the result as 0 or 1.

Computing Anomaly Score
There are many methods to compute anomaly scores, such as the Gaussian density estimator and kernel density estimator.In the experiments, the Gaussian density estimator is adopted.The probability density function is modeled using multivariate Gaussian.The equation was defined as In Eq. ( 2), u is the mean vector, is the symmetric covariance matrix and D is the number of dimensions.The u and are learned from normal training data.When giving a x vector, the distance is computed to use the Mahalanobis distance equation, it is defined as follows: ( In Eq. ( 3), the meanings represented by u and are the same as in Eq. (2).M(x) is a measure of a sample of x.

Anomaly Detection Results on Two Datasets
Table 1 displays the results for anomaly detection using the MVTec dataset.The overall ROC-AUC average of ours is the highest, surpassing the Cutpaste method by 1.51%.Specifically, in the object category, our method outperforms the Cutpaste method by approximately 1.3% in terms of ROC-AUC average.However, the PaDiM method performs well than our method in the texture category, exceeding nearly 0.4%.Nevertheless, our method achieves a higher ROC-AUC average of around 2.1% in the object category compared to PaDiM.Notably, our method demonstrates a strong generalization performance in 15 categories, with ROC-AUC values exceeding 90% in both texture and object categories.In some categories, our methods have lower result values than Cutpaste methods, e.g., cable, pill, etc.Through our analysis, it is mainly anomalies in their test dataset that occur in the global area, but our method can only be used for the local area, so it can't simulate their anomalies very well.The ROC-AUC scores for each class were listed.In addition, the average score over the texture and object categories was calculated separately, and the average score of all categories was computed also Bold indicates the best result in each class To validate the effects of our method on other datasets, we conducted experiments on the BTAD dataset and compared the results to those of a recent method.The results are shown in Table 2. Despite the differences between the BTAD and MVTec datasets in terms of the types of anomalies, our method performs well on the BTAD dataset.From the experimental result, our method is higher than the VT-ADL method [50] by 4%, which is based on the transformer framework.In addition, the result was an increase of about 8% in the third row.
For the object category, we conducted an experimental comparison as shown in Table 3.The baseline is reproduced code for the Cutpaste method.In our proposed method, it is necessary for the extracted patch to be located on the object.To ensure a fair comparison, data augmentation was not used in this set of experiments.Based on the experimental results, our method demonstrates a 4.5% improvement, thus confirming the effectiveness of our approach.
To calculate the resource consumption and training time of our model, a comparison experiment was conducted with a baseline, and the results are shown in Table 4.It can be observed that our model exhibits a similar level of GPU resource usage compared to the baseline, indicating that it has minimal impact on model training.

Full Grad CAM Results and Histogram of Abnormal Score Statistics
CAM is a tool used for visualizing neural networks and displaying the activation regions within the network.Currently, it has various variants, including Grad CAM and Full Grad CAM.To better locate abnormal image regions, the Full Grad CAM is used.Figure 7 shows anomaly samples and their results in MVTec datasets.The first row is input images, the ground truth of the input images is on the second row, and the last row is visualization results.For the large or small defective regions, our method can localize the anomalies.Figure 8 shows the Full Grad CAM results on BTAD datasets.Comparing the MVTec AD datasets, the types of anomalies are different in shape, color, etc.Meanwhile, the Fig. 7 Defect detection on the MVTec datasets.The first row is input images, the ground truth of input images is on the second row, and the last row is the visualization results

123
Fig. 8 The visualization results using our method on BTAD dataset Fig. 9 Histogram of abnormal score statistics anomalies are not obvious in the image.In the third row, the texture defects of the wood are almost invisible.However, our method can localize the anomalies.
Figure 9 shows a statistical histogram of the abnormal score results.The blue line represents the normal sample, while the yellow line represents the abnormal sample.The horizontal axis indicates the abnormal scores predicted by the model for the samples, where higher scores indicate a higher probability of being an abnormal sample.The vertical axis represents the number of samples.

Results of t-SNE
T-distributed stochastic neighbor embedding (t-SNE) is a machine learning algorithm used for dimensionality reduction.It is capable of projecting data from a high-dimensional space into a low-dimensional space.To know the distance between the simulated anomalous samples and the real samples in our experiment, the t-SNE was adopted.A smaller distance indicates a higher similarity between the simulated samples and the real samples.The t-SNE plots of the representations of models in MVTec datasets are shown in Fig. 10.The green dots represent normal samples from the training datasets, while the red and blue dots represent abnormal samples and simulated abnormal samples, respectively.In Fig. 11, the simulated abnormal samples (blue dots) cover a portion of the real defect samples (red dots) across all categories.In certain categories, such as wood, zipper, etc., the blue dots are located in the red dots.This indicates that our method successfully simulates the real anomalies in some categories using the simulated abnormal samples.Consequently, the learned representations from the model can effectively distinguish between normal and abnormal samples.Similarly, the t-SNE plots of the models' representations on the BTAD dataset are presented in Fig. 11.

Comparison of Various Noises and Various Blurs
Among the types of noise, Gaussian, Poisson, and Salt and Pepper noise were studied.From Fig. 12, it can be seen that the Gaussian and Salt and Pepper noise are more obvious than the Poisson noise.However, it is not known which noise is more appropriate for the MVTec datasets.In the blur, Gaussian, Mean, and Motion blur were researched for the experiment.From Fig. 12, the difference in the various blurred effects is not very evident, so relevant ablation experiments were conducted to select a better blur augmentation from the experimental results.
To improve the ROC-AUC results for abnormal detection, several ablation experiments were conducted, and the results are shown in Table 5.By analyzing the abnormal samples in the test datasets, it can be seen that the abnormal samples contain noise and blur.In ablation experiment settings, only one of them was chosen at a time.Table 5 shows that using Gaussian noise is preferable, achieving 91.36%.It is over about 8 points higher than the other.In the Gaussian and Motion blur, their results are almost the same.The experiment eventually chose Gaussian blur as it obtained the best results.Therefore, in the noise and blur, the Gaussian noise  and Gaussian blur were adopted in our data augmentation for the experiments.

Study of Grayscale and Color Parameters
In color and gray, they are controlled by a hyper-parameter.In Fig. 13, some examples of gray strength in hazelnut and a variety of colors in leather were shown.From Table 6, the parameter of color is between 0 and 0.5.It was divided into three segments, 0-0.2, 0.3-0.5, and 0-0.5.Through the experiment, it was discovered that the results were comparable, with all scores above 92%.The highest score fall between 0 and 0.2.In gray, the parameter takes values between 0 and 0.9.It was also divided into three segments, 0-0.5, 0.6-0.9, and 0-0.9.According to the experimental results, the best results are obtained between 0 and 0.5, reaching 94.16%.Therefore, 0-0.2 was chosen for the color parameter and 0-0.5 for the grayscale parameter in our experiments.

Ablation studies of localization and various augmentation methods selected
To explore more effective augmentation methods, various ablation experiments were conducted.The results are shown in Table 7.The first column is the baseline, which is based on the Cutpaste method.In the second column, location technol- The results are presented as ROC-AUC scores ogy was applied along with shape transformation and patch rotation.The remaining columns correspond to different augmentation methods, along with their respective ROC-AUC values.From Table 7, it can be seen that the result of using location technology, shape, and rotation is improved by about 1 point in the second row.From the fourth to the seventh row, the results are obtained using noise, blur, color, and gray augmentation methods.Using noise augmentation alone resulted in a ROC-AUC of 91.36%.However, incorporating gray augmentation increased the ROC-AUC to 94.16%, a gain of approximately 3%.Eventually, the experiment achieved a ROC-AUC of 97.61%.

Conclusions
In this paper, a method for simulating anomalous samples is proposed.For the object category, the module of patch extract and paste is designed.Meanwhile, a combined data augmentation approach is developed.From the experiment, the results of ROC-AUC are effectively improved on the MVTec datasets and BTAD datasets, demonstrating its generalization ability.After our discussion, our method may be suitable for abnormalities that frequently occur in specific scenarios.However, it is worthwhile to continue our future research on how to generalize to unseen anomalies.

Fig. 3
Fig. 3 An overview of our proposed anomaly detection architecture.The framework has four modules in the training stage, patch extraction, data augmentation, patch paste, and one-class classifier.A self-supervised learning method is used overall.In the training phase, only normal samples are available.The abnormal samples are first Fig. 11 t-SNE visualization of representations of models trained on BTAD dataset

Fig. 12
Fig. 12 Examples for Gaussian, Poisson, Salt and Pepper noise in pill and Gaussian, Motion, Average blur in metal-nut

Fig. 13
Fig. 13 Examples of gray strength in hazelnut and a variety of colors in leather

Table 4
Resource consumption and inference time of the model

Table 5
Anomaly detection results of various types in noise and blur

Table 6
Results of various color and grayscale parameters