1 Introduction

Anomaly detection algorithms are widely used thanks to their stable performance and high detection efficiency. One of the classical approaches focuses on reconstruction. This approach utilizes a neural network to encode and decode the input normal samples, with the reconstruction input as the training target. The distribution pattern of the normal samples can be learned, and anomalies are identified by analyzing the differences between the original and reconstructed images. Commonly used reconstruction-based methods are categorized into autoencoders (AE) [1,2,3,4] and generative adversarial networks (GAN) [5,6,7] according to the different training modes employed. Even frameworks [8, 9] that combine AE and GAN can achieve impressive results. Moreover, embedding-based methods [10, 11] have shown good performance in anomaly detection tasks. The basic principle is to match the features of test samples with normal samples. The inference step of such models involves a complex feature-matching process, which increases the computational cost of the model even if its training phase takes little time.

In practical applications, the scarcity and variety of real-world anomaly samples pose significant challenges to supervised learning. A self-supervised surface anomaly detection method is proposed to address the issue in the work. The main contributions are as follows.

  • Our method based on autoencoder reconstruction introduced a novel combination of foreground enhancement strategy and efficient channel attention mechanism, which considerably improved anomaly detection performance.

  • An anomaly generation module was designed to generate anomaly samples using a foreground enhancement strategy, which mitigated the impact of irrelevant background information on model learning.

  • An efficient channel attention mechanism was introduced in the autoencoder. It captured cross-channel interaction information and enhanced the network’s ability to extract channel features for better reconfiguration.

  • Experimental validation on the MVTecAD [12] and BTAD [13] datasets demonstrated that our method outperformed other advanced approaches in multiple metrics. It achieved the highest improvement of up to 12.5%, especially in pixel-level average AP score.

2 Related work

Most surface anomaly detection models [7, 14, 15] aim to explore the broad patterns of normal samples. Only normal samples are used for reconstruction to train the model. MS-FCAE [16] is designed with multi-scale feature information based on AE, which provides different levels of contextual information for image reconstruction. Therefore, the reconstructed image is more accurate and clearer. AnoGAN [17] first introduces GAN into anomaly detection. However, it requires several iterations of optimization in the inference stage to generate a suitable normal image as a reference. The algorithm lacks the necessary computational efficiency to be deployed in real-time detection tasks. F-AnoGAN [18] introduces an additional encoder to extract image features based on GAN, which guides the generator to create the most matching images. In this regard, our work introduced an efficient channel attention mechanism [19] during the reconstruction of anomalous regions. This mechanism effectively captured inter-channel interactions and enhanced the capacity of the network to extract features, resulting in better reconstruction quality.

Some studies [20,21,22] seek to produce artificially simulated anomalous samples during the training phase to reveal the hidden differences between normal and anomalous samples. Specifically, CutPaste [20] utilizes augmentation techniques such as copy and paste to simulate anomalous samples by randomly copying a tiny rectangular region from the input image and pasting it onto the resulting image. DRAEM [21] creates anomalous areas by superimposing extra texture images as noise over normal images. Haselmann [22] adds rectangular masks at random to normal samples to simulate true anomalies. Considering the impact of background interference, we designed an effective foreground enhancement strategy. The strategy involved foreground extraction on the images and introduced noises to simulate anomaly generation, which resulted in more realistic anomaly samples for training.

3 Anomaly detection method

Our method consists of a foreground-enhanced anomaly generation module, an autoencoder reconstruction module and a segmentation module. The foreground-enhanced anomaly generation module is utilized to generate simulated anomaly samples by combining normal images with anomaly texture source images. The anomaly generation strategy can provide an arbitrary number of anomaly samples and accurate pixel-level segmentation ground truth maps, enabling our method to be trained without using real anomaly samples. The autoencoder with an efficient attention mechanism is trained using the \(L_{\textrm{rec}}\) to repair the anomalous regions. The input and output of autoencoder are concatenated and fed into the segmentation network, which is trained to localize anomalous regions using the \(L_{\textrm{seg}}\) (Fig. 1). A mean filter convolution layer is utilized to smooth the segmentation module’s output. The anomaly score is calculated by selecting the maximum value from the smoothed anomaly score map.

Fig. 1
figure 1

The anomaly detection process of our method

3.1 Foreground enhancement

Anomalies appear in diverse manifestations within real-world scenarios, which poses challenges in comprehensive anomalies data collection. Consequently, the construction of ideal large-scale anomaly datasets for training supervised detection models becomes arduous. Therefore, an effective strategy is designed to simulate anomaly generation for self-supervised learning. Figure 2 depicts the anomaly generation strategy with foreground enhancement.

Fig. 2
figure 2

Simulation anomaly generation strategy

The noise image \(P^{}\) is obtained by using a Perlin noise generator [23] to capture various anomalous shapes. It is binarized using a randomly uniformly sampled threshold \(T^{}\left( T= 0.5\right) \) to form the anomaly mask image \(P_{m}\). Besides, considering that in some actual collected image datasets, certain industrial components do not occupy a high enough proportion in the image, directly adding the anomaly mask could easily generate noise in the background. This increased disparity between the data distribution of real anomaly samples and simulated anomaly samples poses a greater challenge for the model to extract meaningful information. Therefore, a foreground enhancement strategy is applied to this kind of images. The Otsu method [24] is used to differentiate the foreground and background based on the maximization of inter-class variance. Original image \(I^{}\) is then binarized to generate mask \(I_{m}\). Element-wise multiplication is performed by multiplying masks \(P_{m}\) and \(I_{m}\) to obtain mask image \(M^{}\).

$$\begin{aligned} M=P_{m}\odot I_{m} \end{aligned}$$
(1)

where \(\odot \) denotes the element-wise multiplication operation.

Anomaly texture source image \(D^{}\) is drawn from a collection of anomalies that is not related to the distribution of the original image \(I^{}\). Anomaly image \(D^{}\) is randomly enhanced with three methods chosen by the group: {sharpness, equalize, solarize, posterize, auto-contrast, brightness change}, which retains the diversity of anomaly. Enhanced texture image \(D^{}\) and original image \(I^{}\) are masked through mask \(M^{}\). Subsequently, they are blended with the original image \(I^{}\) processed through mask \(\bar{P}_{m}\) to obtain final simulated anomaly-generating image \(I_{A}\).

$$\begin{aligned} I_{A}=\left( 1-\beta \right) \left( I^{}\odot M^{}\right) +\beta \left( D^{}\odot M^{}\right) +I^{}\odot \bar{P}_{m} \end{aligned}$$
(2)

where \(\beta \) denotes the opacity parameter during blending, which is randomly and uniformly sampled from [0.1, 1.0], \(\bar{P}_{m}\) is the inverse of \(P_{m}\).

In addition to applying various data augmentations to anomaly texture source image \(D^{}\), there are other functions. This strategy randomly rotates 30% of input images \(I^{}\) and Perlin noise \(P^{}\) within [\(-90^{\circ }\), 90\(^{\circ }\)], which strengthens robustness. Furthermore, Perlin noise can randomly change the size during simulated anomaly generation by considering the diversity of various component anomalies in actual industrial environments. Thus, the granularity of the noise image is controlled to obtain anomaly mask image \(P_{m}\) with various sizes and shapes (Fig. 3).

Fig. 3
figure 3

Multi-scale anomaly mask map

3.2 Autoencoder reconstruction

An efficient channel attention (ECA) [19] mechanism, introduced in the encoding phase of the autoencoder, effectively captures cross-channel interaction information and enhances the network’s feature extraction capability. The autoencoder can reconstruct the local anomalous pattern of input image \(I_{A}\) into a pattern closer to the normal sample distribution. Meanwhile, it maintains the non-anomalous areas of the original image unaltered and obtains reconstructed image \(I_{r}\) with the equal size to the original image. Figure 4 presents the architecture of the autoencoder.

Fig. 4
figure 4

Architecture of the autoencoder

ECA uses a dynamic convolution kernel to address the issue of extracting different range features for input feature maps with different numbers of channels. The convolution kernel adaptively changes its size through a function (Fig. 5).

Fig. 5
figure 5

Process of efficient channel attention (ECA)

$$\begin{aligned} {k=\psi \left( C \right) =\left\| \frac{log_{2}\left( C\right) }{\gamma }{}+\frac{b}{\gamma }{}\right\| _{odd}} \end{aligned}$$
(3)

where \(k^{}\) is the convolution kernel size, \(C^{}\) is the number of channels, \(\Vert \quad \Vert _{odd}\) indicates that \(k^{}\) can only be odd, and \(\gamma =2\) and \(b=1\) are used to change the ratio between the number of channels \(C^{}\) and the convolution kernel size.

3.3 Segmentation

A U-net [25]-like structure is employed by the segmentation network. Input \(I_{A}\) and output \(I_{r}\) of the autoencoder are first concatenated along the channel dimension. They are then fed into the segmentation network to provide enough information for anomaly localization. Simultaneously, five downsampling convolution blocks are applied for multi-scale feature extraction. This part includes the original image with a total of 6 scales, which can fully extract features. The feature map of equivalent size from the feature extraction part is copied and fused at each stage of network upsampling. Eventually, the image is restored to its original size to obtain an accurate anomaly segmentation map. Figure 6 shows the architecture of the segmentation network.

Fig. 6
figure 6

Architecture of the segmentation network

3.4 Loss function

Structural similarity index mean (SSIM) [26] has become a common loss function in computer vision. It is typically utilized to evaluate the similarity of two images and considers three key features of an image (i.e., luminance, contrast and structure).

$$\begin{aligned} \textrm{SSIM}\left( x,y\right) =l\left( x,y\right) ^{\alpha }\times c\left( x,y\right) ^{\beta }\times s\left( x,y\right) ^{\gamma } \end{aligned}$$
(4)

where \(l\left( x,y\right) \), \(c\left( x,y\right) \) and \(s\left( x,y\right) \) are the luminance similarity, contrast similarity and structure similarity of images x and y, respectively. \(\alpha \), \(\beta \) and \(\gamma \) represent the balanced hyperparameters.

The \(L_{2}\) loss is commonly utilized for anomaly detection algorithms, but adjacent pixels are assumed to be independent. Therefore, the SSIM loss is used enhance the interactivity between pixels additionally.

$$\begin{aligned} L_{\textrm{SSIM}}\left( I,I_{r}\right) =\frac{1}{N_{P}}\sum _{i=1}^{H}\sum _{j=1}^{W}1-\textrm{SSIM}\left( I,I_{r}\right) _{\left( i,j\right) } \end{aligned}$$
(5)

where \(H^{}\) and \(W^{}\) are the height and width of original image \(I^{}\), respectively. \(N_{P}\) is the number of pixels in \(I^{}\); \(I_{r}\) is the reconstructed image. \(\textrm{SSIM}\left( I,I_{r}\right) _{\left( i,j\right) }\) is the SSIM value of \(I^{}\) and \(I_{r}\) centered on coordinate \(\left( i,j\right) \), so the loss of reconstruction is defined by

$$\begin{aligned} L_{\textrm{rec}}\left( I,I_{r}\right) =\lambda L_{\textrm{SSIM}}\left( I,I_{r}\right) +L_{2}\left( I,I_{r}\right) \end{aligned}$$
(6)

where \(\lambda \) is the balanced hyperparameter of two kinds of losses.

Focal loss [27] (\(L_{\textrm{seg}}\)) is applied to the segmentation network because it can solve imbalance between positive and negative samples and improve the robustness of accurate segmentation of complex samples. According to the network’s reconstruction and segmentation goals, the overall loss during training is defined by

$$\begin{aligned} L\left( I,I_{r},M_{a},M\right) =L_{\textrm{rec}}\left( I,I_{r}\right) +L_{\textrm{seg}}\left( M_{a},M\right) \end{aligned}$$
(7)

where \(I^{}\) is the input image; \(I_{r}\) is the reconstructed image; \(M^{}\) is the output segmentation mask; and \(M_{a}\) is ground truth.

4 Experiments

MVTecAD [12] and BTAD [13] datasets for anomaly detection and localization were tested to evaluate the effectiveness of our method. Our method was more targeted at anomaly detection in images with background interference. The MVTecAD and BTAD datasets contained different categories of anomaly images. However, the inclusion of image data without backgrounds (e.g., texture category images) was deemed irrelevant for demonstrating the efficacy of our methodologies. Therefore, not all categories in the two datasets were tested. The optimizer used in the training phase was Adam [28], with a total of 700 iterations. The initial learning rate was set to 0.0001 and decayed at the \(560^{th}\) and \(630^{th}\) iterations with a decay factor of 0.2. The input image size was uniformly scaled to 256\(\times \)256, and the input batch size was 16.

A series of evaluation standards were calculated to quantitatively assess the detection capacity. The area under the receiver operating characteristic (AUROC) curve was the primary metric to compare the anomaly detection results. However, most of the anomalous areas were relatively small in practical applications. The metric value was affected by a large number of non-anomalous pixels while only a small number of anomalous pixels were detected. As a result, the pixel-level AUROC did not reflect the localization accuracy well. Therefore, the work additionally calculated the average precision (AP) metric, which represented the region under the curve of precision and recall rates. It was particularly well suited for highly imbalanced categories, notably in anomaly detection scenarios with precision playing a pivotal role.

4.1 Comparison with existing methods

Our approach is compared with unsupervised anomaly detection methods for images developed in recent years, including GANomaly [15], PaDiM [11], STAD [29], CutPaste [20], and DRAEM [21]. GANomaly combines autoencoders and generative adversarial networks; PaDiM extracts patch embeddings from the input image using a previously trained CNN; STAD solves the unsupervised anomaly segmentation problem using a student–teacher network; and CutPaste and DRAEM attempt to generate simulated anomalous samples during training. In summary, our method outperforms other methods and achieves the highest AUC and AP scores at both the image level and pixel level.

Table 1 Comparative results of image-level AUROC (%) on the MVTecAD dataset
Table 2 Comparative results of pixel-level AUROC (%) and AP (%) on the MVTecAD dataset

Table 1 displays the image-level AUROC metric. Our method achieves the highest or second-highest AUROC scores in each of the five categories in the MVTecAD dataset. Compared to the advanced method DRAEM, our method further improved the average image-level AUROC score by 0.3% across. This improvement is evident not only on the screw dataset, where anomalous regions are minimal and challenging to distinguish, but also on the toothbrush dataset with limited training samples, highlighting the effectiveness of our approach.

The pixel-level AUROC and AP metrics in Table 2 show the excellent performance of our method. The average AUROC is improved by 0.6% over PaDiM, and the average AP is significantly improved by 12.5% over DRAEM. These improvements can be owing to the ECA mechanism, which enhances the model’s reconstruction ability for images containing irregular workpiece anomalies. Additionally, the foreground enhancement strategy eliminates background interferences, which allows the model to acquire more valuable information.

Fig. 7
figure 7

Anomaly case analysis. a Anomaly image; b Reconstructed image; c Ground truth; and d Detect result

The pill dataset shows the poorest detection performance. The original training samples of the pills all have spots of the same color. However, several samples only have different spot colors and no surface anomalies during testing. The model mistakenly identifies this type of anomaly as a staining anomaly rather than a category anomaly in this case. It leads to significant disparities between the segmentation result and the ground truth and affects the detection metric (Fig. 7).

To fully demonstrate our advantages, the same settings are used as the MVTecAD dataset to test the more challenging BDAD dataset without any data augmentation. The anomaly detection results are compared with traditional algorithms [13, 21, 30]. As shown in Table 3, our method outperforms other advanced algorithms in terms of average AUROC scores at the image level and pixel level for two categories within the BTAD dataset, demonstrating exceptional efficacy.

Table 3 Comparative results of image-level and pixel-level AUROC (%) on the BTAD dataset
Fig. 8
figure 8

Visualization of anomaly detection results

Figure 8 shows the visualization of anomaly detection results, and each column displays the anomalous image, ground truth, reconstructed image, and detection result in sequential order. It can be observed that our algorithm can clearly reconstruct anomalous images while accurately locating surface anomalies on the products.

4.2 Ablation study

To validate the effectiveness of our method, ablation studies were carried out by applying the foreground enhancement strategy and adding multiple ECA modules to the baseline model. The baseline model directly simulated anomalies by adding noises to the input image without any enhancement strategy. Additionally, the network did not contain any attention module. Tables 4 and 5 show the ablation results for each dataset. By comparing, it is evident that both modules show improvements compared to the baseline model. Our method significantly improves the average pixel-level AP scores by 13.6% in the MVTecAD dataset when both modules are used simultaneously. Similarly, our method notably increases the average image-level and pixel-level AUROC scores by 6.9% and 12% in the BTAD dataset, respectively. The ablation results further demonstrate the effectiveness of each module.

Table 4 Ablation results on MVTecAD dataset
Table 5 Ablation results on BTAD dataset

In Fig. 9, each group presents the input anomaly image as well as the localization result of the baseline model, the model with only added ECA modules, the model only using foreground enhancement strategy, and the model combining foreground enhancement strategy and ECA modules. It is evident that the combination of the foreground enhancement strategy and the ECA modules yields the best anomaly localization results.

Fig. 9
figure 9

Comparison of ablation results

5 Conclusion

A self-supervised method for surface anomaly detection was proposed in this paper. The method only required normal samples during training and simulated real anomalies through an anomaly generation strategy with the foreground enhancement. It could mitigate the impact of invalid information in the background on model learning to some extent. Accurate anomaly detection results could be obtained through an autoencoder with efficient channel attention and a U-net-like network for fine segmentation, addressing the issue of imprecise anomaly localization in existing methods. Experimental results on the MVTecAD and BTAD datasets demonstrated that our method achieved excellent performance in anomaly detection and localization. In particular, compared with other advanced methods, the pixel-level average AP was significantly improved by 12.5%. The proposed method provided a better depiction of anomaly segmentation details and exhibits superior overall detection performance.