Introduction

Image anomaly detection serves as a vital application of computer vision modeling, with significant value in various practical applications, such as quality inspection, medical image analysis, and security monitoring [1, 2]. Consequently, the study of efficient anomaly detection methods is of great importance. However, neural network-based computer vision models, such as ResNet [3] and Deep one classification [4], typically depend on substantial volumes of data. Considering the sparsity of anomaly data, it becomes challenging to rely on existing samples for anomaly detection [5]. To address this issue, we employ a data augmentation technique grounded in self-supervised learning for anomaly detection, which is particularly suitable for real-world production scenarios where obtaining anomalous data is difficult.

Fig. 1
figure 1

The illustration of different data augmentation. Subfigure (b) is Cutout which has regular changes in shape and color. The shape of the mask is constantly changed while the color remains the same, generating pseudo data with randomly changing shapes, as is shown in subfigure (c). Pixel-point random walk data augmentation with color gradient (d)

Despite recent advancements in data augmentation [6], developing effective data augmentation models remains a formidable task. The diversity of defects further complicates data augmentation, as various defects need to be considered. Traditional data augmentation methods, characterized by regularity and apparent variability, can lead to trained models finding shortcuts [7, 8]. Current anomalies based on self-supervised learning face several challenges: (1) defect scarcity, which makes training a good anomaly detection model with minimal abnormal data challenging; (2) defect diversity, as traditional self-supervised learning methods struggle to construct pseudo-anomalous samples that conform to the diversity of defects in practical production; and (3) effectiveness of data augmentation, as regular and obvious data augmentation methods tend to cause models to learn shortcuts, leading to lazy models. Figure 1 visualizes the DA samples by different data augmentation methods. We find Cutout pseudo-anomalous [9] sample construction with significant variability and regularity.

In this paper, we primarily focus on two aspects of tasks, namely anomaly detection and anomaly segmentation. The goal of the anomaly detection task is to recognize abnormalities at a coarse granularity in images, determining whether an image contains any anomalies. However, this approach does not provide precise location information of the anomaly. On the other hand, the anomaly segmentation task aims to achieve a more refined level of anomaly localization, showcasing the segmentation at a fine granularity using visualization techniques, thus pinpointing the exact location of the anomaly.

The data representations of normal and abnormal images exhibit different data distributions as shown in Fig. 2. This paper establishes an anomaly detector based on the distribution of image representations, and diverse data augmentation can create a more varied distribution of anomalies.

Contributions To enhance the uncertainty and diversity of data enhancement, we propose a self-supervised data enhancement method based on random changes in color and shape. Specifically, we establish a pixel-point random walking method, in which the shapes and colors of the original images change randomly during the walking process, so as to create more random and diverse pseudo-anomaly samples. Combined with a Siamese feature extraction network, we build a pre-training task model that is more suitable for anomaly detection. The main contributions are as follows. (1) We propose a method for creating pseudo-anomaly samples using P2 Random Walk. More diverse random data generation is achieved by random pixel walking. (2) We create a two-stage framework for anomaly detection. In the training stage, features are extracted by developing a Siamese network with shared parameters. Anomalies are detected by creating a class classification model in the test stage. (3) On the MVTec dataset, we conduct in-depth evaluation studies and experimentally show that the P2 Random Walk performs better in anomaly detection. It verifies that the data augmentation of more random pseudo-anomaly samples proposed in this paper benefits feature extraction.

In this paper, Sect. 2 discusses the related work, and the generation of P2 Random Walk pretext task is presented in Sect. 3; Sect. 4 details the comparison of different experiments to proof the effectiveness of the proposed algorithm.

Related work

In anomaly detection, owing to the lack of anomaly data and the continuous emergence of new anomalies, the method of anomaly detection relying only on normal samples is gradually applied [10,11,12,13]. Discriminant models and reconstruction models are commonly used anomaly detection methods. In previous studies, classification surface construction models often outperformed coding reconstruction models [14]. However, the effectiveness of reconstruction models has significantly improved with the application of image coding techniques and deep networks in reconstruction models, such as Transformer [15]and the generative network GAN method [16]. In addition, extensions of deep cluster learning methods and self-supervised learning are applied to anomaly detection to improve the efficiency and performance of detection models [17, 18].

Fig. 2
figure 2

The difference in distribution characteristics between normal data types and enhanced pseudo-anomalous data is significant

The purpose of the discriminant model is to establish boundaries between normal and abnormal data, which can be utilized to classify data into these two categories [19]. Aiming at the problem of missing samples, the one-class classification model built with normal sample data is to create a distinct discriminative space for the data features [20]. Traditional discriminative methods construct classification surfaces with one-class SVM based on statistical modeling. One-class SVM is a typical single classification model [21] that maps normal sample features to a uniform feature space [22]. An anomaly detection method based on PSO-SVM uses multi-scale information for fault diagnosis of airborne fuel pumps [23]. Luo et al present an anomaly detection method that is based on sparse coding inspired Deep Neural Networks (DNN) [24]. The traditional classification surface framing method has limited processing capability for high-dimensional data. To accommodate high-dimensional image data more effectively, we utilize a deep network-based approach [25] for addressing the image anomaly detection problem. By training a deep data network that minimizes the feature space, we achieve the compact feature distribution. Another commonly used method in anomaly detection is the reconstruction model [26]. The reconstruction network has a strong reconstruction ability for normally trained samples and poor reconstruction ability for abnormal data, and anomaly detection is performed by comparing the differences before and after reconstruction. The reconstruction model disregards anomalous pixels during the encoding process, preventing regular reconstruction of the image data in the abnormal parts [27, 28].

The autoencoding methods construct symmetric networks using encoders and decoders in the reconstruction model [29]. In the reconstruction process, the aim is not only to achieve a good feature representation for normal samples, but also to ensure that abnormal data cannot be well represented. In addition, the reconstruction process can deteriorate due to local defects [28, 30]. The diminution of high-frequency data, attributed to the autoencoder’s inherent limitation, is circumvented by employing a recurrent update mechanism for the encoder’s input parameters. It establishes a dual potential space learning process to ensure that any sample is known from the potential space and that the abnormal samples cannot be well represented. [31] treats anomaly detection as a self-supervised repair problem by randomly removing parts of the image and reconstructing the removed regions. This is used to solve the reconstruction of anomalous pixels by the encoder. Mask learning is often used in anomaly detection and image inpainting [32]. In the sparse representation approach, [29] uses noise-reducing self-encoders for anomaly detection in images. Moreover, the Student–Teacher Feature Pyramid Matching (STPM) approach is employed for detecting anomalies by reconstructing image features and analyzing the differences between anomalous and normal images. GAN methods are also used for anomaly detection, which reconstructs anomalies by training image encoders and decoders [33].

In previous studies, network models trained with normal samples excel in reconstructing normal data, but perform poorly in reconstructing anomalous data [34]. Therefore, anomaly detection is performed using this property of the reconstructed model. In addition to direct anomaly detection using reconstruction, GAN can perform data augmentation, and pseudo-anomaly data construction through self-supervised learning [35]. The self-supervised learning method converts the task into supervised learning for training to obtain more targeted anomaly detection and classification models. Pseudo-sample generation is an effective method to enhance anomaly detection [36]. The pretext task with self-supervised learning is first designed to learn task-specific features that capture edge distributions in different spaces through adversarial training [37].

Table 1 Parameters used in our experiments

Various data augmentation models are used in self-supervised learning to better construct pretext tasks in an unsupervised context [38]. It is worth noting that the presence of uncertain anomalous data and the emergence of new anomalies, such as previously unseen anomalous data, contribute to a higher level of uncertainty in the original anomaly sample construction, which in turn leads to better results. This anomaly uncertainty brings new challenges to the construction of pretext task, and the ability to construct more uncertain and effective pretext tasks is the focus that needs to be addressed by self-supervised learning. The P2 Random Walk constructs more uncertain pretext tasks, considering the diversity of colors and shapes, and constructs more uncertain pseudo-anomaly samples to improve the model’s robust performance.

Methodology

Problem statement

Our goal is to create an effective data augmentation method for visual anomaly detection because regular data augmentation leads to the model learning shortcuts. Building a more challenging pretext task encourages the model to learn more about the images. Anomaly-free images and pseudo-defect images are used to train the network. Given an image, we utilize the feature extraction network and classification model to detect sample i, \({\mathcal {N}}\) learns a scoring method to construct a classifier,

$$\begin{aligned} {\mathcal {N}}(i) = {\left\{ \begin{array}{ll} 1 &{} \text {Abnormal} \\ 0 &{} \text {Normal} \end{array}\right. }, \end{aligned}$$
(1)

\({\mathcal {N}}(i) = 0\) represents the anomaly-free data, \({\mathcal {N}}(i) = 1\) represents an anomaly. By using the self-supervised learning, we improve the difficulty of the network model in classifying the data to encourage the extractor to extract more image information for anomaly detection. The performance metrics of the model are validated by measuring the area under the receiver operating characteristic curve (AUROC) of \({\mathcal {N}}(i)\). The parameters and description in this paper are shown in Table 1.

Fig. 3
figure 3

The framework of P2 Random Walk anomaly detection. In the first stage, we train the pretext task with normal and pseudo data to construct a feature extraction network with anomaly classification capability. In the second stage, the feature extraction network obtained through the training phase is used to extract the features of the image to be tested, and the extracted features are evaluated by the Mahalanobis distance to obtain the anomaly score of the image to be detected

As illustrated in Fig 3, we utilize a framework of anomaly detection with P2 Random Walk, comprising of two stages. The first stage, consisting of \({\mathcal {P}}\) and \({\mathcal {N}}\) modules, acts as the feature extractor training step. The latter makes use of the trained feature extraction network to build an anomaly detection classification model.

(1):

\({\mathcal {P}}\) is the data augmentation module to create the pseudo-defect samples for the training stage. The generated pseudo-anomaly samples are used as input to the Siamese neural network \({\mathcal {N}}\) along with the normal samples.

(2):

\({\mathcal {N}}\) is a binary classification model used for training the pretext task model. \({\mathcal {N}}\) captures feature representations to distinguish between normal and abnormal instances. During the training process, it constructs more differentiated feature representations.

(3):

The second stage \({\mathcal {F}}\) is used to construct the anomaly detection classification model. \({\mathcal {N}}\) is the feature extractor that is trained in the first stage. The features are used to construct a classifier utilizing Gaussian distribution for anomaly detection. In the P2 Random Walk, we implement ConvNeXt as the feature extractor network and the Gaussian density estimator (GDE) as the classification of the second stage.

Pixel-point random walk

\({\mathcal {P}}\) is the construction module of the pseudo-defect image. The research adopts data augmentation to build \({\mathcal {P}}\) module to extract low-level features without anomaly data. Specifically, we create pseudo-anomaly samples by P2 Random Walk, which achieves a gradual pixel change. From dataset I, we select normal images to be processed by P2 Random Walk. In practice, image representations learned through pseudo-defect construction prove to be effective alternatives for defect-related tasks, even though they may not perfectly match authentic faulty images. However, generating more diverse data during data augmentation remains a challenge for these methods. Irregular data augmentation methods can avoid the laziness of the model owing to the shortcuts in learning image representations.

To generate more random pseudo-anomaly samples, we propose a self-supervised excuse task based on a pixel-point random walk process, which is described in Algorithm 1. The process combines color change and shape change to improve the model’s adaptability to various anomaly types. The pixel-point random walk process is shown in Fig. 4. In Algorithm 1, we show the specific steps of the pixel-point random walk method. \(p_{sx}, p_{sy}\) is the initial position of the selected pixel point wandering, the initial position ranges (xy) (\(x \in [\frac{w}{2}, \frac{w}{3}]\), \(y \in [\frac{h}{2}, \frac{h}{3}]\)), and w and h denote the two attributes of the image, width and height. The initial position is chosen to prevent the local changes from being too concentrated during the wandering process. As the pixel point walking, the position and pixel attributes of pixel change randomly. Pixel position alteration relies on the direction \((x_{dr}, y_{dr}) \in p_{dr}\) and the step length \((x_{st}, y_{st}) \in p_{st}\), randomly chosen throughout the wandering process. As the point embarks on its random walk, the color \(P_{pix}\) fluctuates with the traversed distance. Moreover, the step count \(S_t\) during this process profoundly influences the effectiveness of anomaly detection. Furthermore, the frequency of random walks can dictate the extent of anomalies generated within the image. And the pixel changes in the process of change according to the step length of wandering, in order to prevent the change of pixel point in the process of wandering is too concentrated, we limit the change of pixel point, \(s_{pixc}\) indicates the value of pixel point change, and random color change is constructed by the way of changing the value of the selection in the actual change process. By combining the random change of position and color, therefore generating greater randomness in the abnormal samples, to achieve the randomness and diversity of image enhancement.

Fig. 4
figure 4

The process of P2 Random Walk. Here, we randomly select a pixel as the initial moving pixel point and the initial color is the pixel value of the selected pixel point. Pixel value changes gradually as the pixel points move

According to the random walk process in Algorithm 1, we can calculate the new position of the pixel point in the horizontal and vertical directions, as shown in Eqs. 2 and 3, in formulation (3), the set of pixel point changes defines the forward strategy of random walk. The direction \((x_{dr}, y_{dr})\) of the point walks is defined as [-1,1], and the number step of the point \((x_{dr}, y_{dr})\) is [1,9]. Considering the initial starting point, the quantity of steps per gradient movement, and the color alteration, both the direction and distance of theoving point vary unpredictably. This results in a collection of uncertain pseudo-anomalous samples with random characteristics.

$$\begin{aligned}{} & {} {\left\{ \begin{array}{ll} P_{nx} = x_s + x_{st}\times x_{dr}\\ \text {with} ~x_{st}~ \text {random}~[1,9] \\ \text {with}~ x_{dr}~ \text {random}~[-1, 1], \end{array}\right. } \end{aligned}$$
(2)
$$\begin{aligned}{} & {} {\left\{ \begin{array}{ll} P_{ny} = y_s + y_{st}\times y_{dr}\\ \text {with}~y_{st} ~ \text {random}[1,9]\\ \text {with}~y_{dr} ~ \text {random}[-1, 1]. \end{array}\right. } \end{aligned}$$
(3)
Algorithm 1
figure a

Random Walking Procedure

The initial value of pixel color alteration relies on the image’s color, resulting in a more random and challenging task for building robust models compared to other methods for creating anomalous samples. When moving a pixel point, the number of movement steps should be limited. If the step of movement steps are too large, the plotted image will be more spread out. Furthermore, the color changes will be more pronounced, resulting in discrepancies when plotting anomalous samples. The MVTec dataset for training images can be classified into two categories based on pixel channels: grayscale and RGB images. Considering that the variation in color is different without using grey and RGB images, we make special color settings for the different color images to avoid suboptimal models due to excessive color differences.

Pretext task

\({\mathcal {N}}\) consists of two components: feature extractor and binary classifier. The feature extractor learns the latent distribution of anomaly-free and pseudo-defect data. In traditional anomaly detection work without data augmentation, the feature extractors rely solely on normal images to extract features. In the pretext task, anomaly detection is redefined as a classification task, allowing the feature extractor to learn a broader range of image features for anomaly detection. The feature extractor is trained on both pseudo-defect and normal data to produce an output that differs from the feature vector derived from normal data. The classification pretext task serves as the training method for the feature extractor. \({\mathcal {N}}\) can be expressed as

$$\begin{aligned} {\mathcal {N}} = (I_t, I_p, l_i), \end{aligned}$$
(4)

where \({\mathcal {N}}\) is the feature extract network. It is trained by the factor of \(I_{t}, I_{p}, l_{i}\), where \(I_{t}\) is the normal data, \(I_{p}\) is the augmentation data, and \(l_{i}\) is the label of samples.

The goal of \({\mathcal {N}}\) is to classify the image, and the output of it as,

$$\begin{aligned} {\mathcal {N}}(i) = {\left\{ \begin{array}{ll} 1 &{} \text {if } i \text { is the pseudo-anomaly image} \\ 0 &{} \text {if } i \text { is the normal image} \end{array}\right. }, \end{aligned}$$
(5)

the pretext task is trained to develop a feature extraction network. When performing anomaly detection, the classifier associated with the pretext task is discarded, leaving only the feature extraction network to extract image features.

The objective of the Siamese neural network is to derive image characteristics that facilitate the differentiation between normal and abnormal instances. In the proposed technique, the Siamese neural networks exhibit shared parameters. Subsequent to the Siamese neural network, the ultimate output comprises a feature vector employed for discerning abnormalities,

$$\begin{aligned} \phi = {\text {ConvNeXt}}(i,w), \end{aligned}$$
(6)

where i is the resize image and w is the parameter of the neural network, \(\phi \) is the feature of the image.

The goal of \({\mathcal {N}}\) is to capture a feature that can distinguish normal and abnormal samples. Following the process of a binary classifier, we train the model of \({\mathcal {N}}\) using cross-entropy loss.

$$\begin{aligned} {\mathcal {L}} = \frac{1}{M} \sum _{i} L_{i} = - \frac{1}{M} \sum _{i}[l_i.log(p_i)+(1-l_i) \cdot log(1-{p_i})], \end{aligned}$$
(7)

where M is the number of samples, \(l_{i}\) is the label of samples, the defect sample is 1, the normal is 0, and \(P_{i}\) is the probability of predicting defect image. \(\phi \) is the Siamese Neural Network (SNN) extracted feature. Feature representation is the output of the SNN, and the results of different categories of inputs are obtained by performing distance calculations on the feature data of the SNN. We propose constructing a binary classification of the feature representation to use the pretext task. This motivates latent presentation when classifying the output to increase the feature representation of SNN without ground truth defect image.

Training models based on P2 Random Walk are more robust than one-class classification models. We add pseudo-anomaly samples with discriminative ability in the process of model learning compared with single classification learning. Pseudo-anomaly sample \(x_p\) is constructed after the pseudo-anomaly sample module \({\mathcal {P}}\). During the training of model \({\mathcal {N}}\), x and \(x_p\) are used together as input samples. Compared to the traditional class classification model, the feature extraction model is jointly trained by anomalous and normal samples with higher robustness.

The primary objective of model training is to acquire the feature extraction network \({\mathcal {N}}\). This training process yields a feature representation that possesses the ability to effectively discriminate between different sample inputs. The network trained using pseudo-anomaly samples focuses more on features with the capability to differentiate anomalies, while the model trained with anomaly samples is better suited for detecting unknown anomalies. However, the model created through training alone is not sufficiently robust for anomaly detection, as the feature extraction approach relying solely on normal samples fails to adequately characterize anomalous data. Consequently, the performance of anomaly detection models can be significantly enhanced by generating anomalous data through pretext tasks.

Anomaly detection

\({\mathcal {F}}\) is the second stage of the framework, which is a one-class classification model for anomaly detection. \({\mathcal {N}}\) is trained in the first stage to obtain the feature representation network and build a classification anomaly detection model. \({\mathcal {F}}\) is to infer image prediction by feature distribution in Gaussian range. The classification model is built to distinguish normal samples from non-normal ones, the feature vector is obtained by \({\mathcal {N}}\). We make \( \phi = ConvNeXt(i)\) and fit Gaussian to it. Each image is rated according to its probability as induced by the fitted model at anomaly categorize. Abnormal samples are assumed to have a lower probability than normal samples. The final scoring method is Gaussian to the feature, which is \({\mathcal {N}}\) extract.

$$\begin{aligned}{} & {} score(i) = s({\mathcal {N}}(i)), \end{aligned}$$
(8)
$$\begin{aligned}{} & {} s({\mathcal {N}}(i)) = Pr({\mathcal {N}}(i) | \mu , \sigma ), \end{aligned}$$
(9)
$$\begin{aligned}{} & {} Pr({\mathcal {N}}(i) | \mu , \sigma ) = \frac{1}{\sigma \sqrt{2\pi }}e^{- \frac{({\mathcal {F}}(i)-\mu )^2}{2 \sigma ^2}}, \end{aligned}$$
(10)

where \(\mu , \sigma \) is the mean and covariance of image features. The classification is obtained by training the feature representation model obtained from the training of the feature extractor through normal samples.

The Gaussian distribution model evaluates the model to obtain a score. The output of \({\mathcal {N}}\) is the anomaly probability, which can be considered as the anomaly score for a given input. \({\mathcal {N}}\) aims to learn the feature representation \(f_p\) between normal and abnormal samples, with different feature representations between different data,

$$\begin{aligned} \textrm{OCC}(x) = {\left\{ \begin{array}{ll}\text {Abnormal} &{} p(x)>\varepsilon \\ \text {Normal} &{} p(x)<\varepsilon ,\end{array}\right. } \end{aligned}$$
(11)

\(\varepsilon \) is a parameter that represents distance, which is used to distinguish normal data from abnormal samples. \(\textrm{OCC}\) is the classification anomaly detection model. When determining the hyperparameter, the evaluation of the model cannot consider only the calculation of the correct rate because of the unbalanced nature of anomalous data and normal data in anomaly detection. The assessment of the model’s efficacy is to be approached from a value-oriented perspective. The most suitable hyperparameters are selected, by selecting different hyperparameters for evaluation.

Fig. 5
figure 5

Samples of MVTec-AD dataset, defect-free (blue box) and defective object samples (red box)

Fig. 6
figure 6

Samples of MVTec-AD dataset, defect-free (blue box) and defective texture samples (red box)

Anomaly segmentation

During the anomaly segmentation process, to achieve a finer-grained anomaly segmentation, this paper adopts an image patch method, dividing the image into multiple patches. By assessing the anomaly score of each patch, we can achieve more detailed anomaly segmentation. In this study, we segment the input image into patches of size 64 \(\times \) 64 and use a stride of 4 for feature block extraction and construction. We then perform anomaly detection on each of the obtained feature blocks, achieving a finer-grained anomaly segmentation. During testing, feature blocks are extracted with a stride of 4, and anomaly scores are calculated for each extracted feature block. Anomaly scores are assigned to each pixel through Gaussian smoothing. We achieved a 96.3% anomaly segmentation result and visualized the anomalous areas of the image using heatmaps.

In this section, we introduce a framework comprising \({\mathcal {P}}\), \({\mathcal {N}}\), \({\mathcal {F}}\). We create pseudo-defect samples dataset \({\mathcal {X}}_{tr}\) of images from the initial training set using color and shape changes. The key of P2 Random Walk is more random samples are constructed for pretext task training. The color and shape of the normal sample images are altered in our pretext task so that the network model can learn data and information that is not part of the initial image. Images of the same type should be close to each other in the feature space, whereas images of different types should be far apart.

Experimental setup

To evaluate the performance of P2 Random Walk, we propose several sets of experiments. First, we assess the anomaly detection performance of the pixel-point random walk. The primary indicators for evaluating the performance of the implemented model in real-world applications are discussed herein. Subsequently, a comparison is conducted to assess the effectiveness of different feature extraction networks within the P2 Random Walk framework. The features derived from a feature extraction model are crucial in creating robust anomaly detection techniques. To develop an effective anomaly detection approach, proper feature extraction is essential. Additionally, we examine the impact of various parameters on the model. The number of random walk steps is a vital parameter influencing the model, with distinct parameters yielding diverse outcomes.

Table 2 Result on MVTec dateset: the comparison of AUROC of P2 Random Walk and other anomaly detection method

Dataset and evaluation metrics

Dataset We evaluate P2 Random Walk on the MVTec [39] dataset for anomaly detection, as shown in Figs. 5 and 6. MVTec, with a total number of 5354 images in 15 categories, of which the training set contains a total of 3629 anomaly-free images, and the test set contains a total number of 1725 anomaly and anomaly-free images. Figs. 5 and 6 show samples of the object and texture respectively. The first row shows the defect-free samples, while the second row shows the defect samples.

Evaluation metrics To assess the validity of P2 Random Walk, the model was evaluated by the area under the receiver operator curve (AUROC) metric. The AUROC is widely used in the performance evaluation of anomaly detection models. A higher AUROC value indicates a superior model.

Implementation details We resize the training and testing images to \(256 \times 256\) and 24 images are contained in each training batch. The encoder of P2 Random Walk consists of ConvNeXt to build feature extraction networks. Additionally, in the ablation study, we utilize ResNet18, ResNet34, and ConvNeXt. Based on this framework, we utilize the ConvNext as the backbone \({\mathcal {F}}\) and analyze the results of different steps in the ablation experiments. We train the model for 240 epochs with a batch size of 10, employing the Adam optimizer (\(lr = 0.003\)). In the preliminary study, we discover that the size of random walking steps and the way of random walking approach in pretext task construction yield different effects.

Table 3 Pixel-level segmentation results for different methods

Experimental results and analysis

Results

Table 2 presents the results of the P2 Random Walk, categorized into object and texture. The threshold-free area under the receiver operating (AUROC) characteristic curve is used to evaluate the models. We examine various pretext tasks and compare them with previous studies, including Patch SVDD [17], Cutout, DOCC [40], CutPaste [27], and uninformed student [5].

We evaluate the self-supervised model based on data augmentation from different aspects, including Cutout, rotation and proposed random walking. Compared to semantic, color and shape changes prove to be more effective for anomaly detection tasks. The color-fixed P2 Random Walk outperforms Cutout in terms of results. P2 Random Walk improves the model for feature extraction by means of random changes to the low-level features of the images. The color-fixed P2 Random Walk demonstrates superior performance compared to the Cutout. Additionally, the P2 Random Walk features not only a more diverse variation in shape, but also a random shift in color. During classification, shortcuts are unable to accomplish the pretext task effectively.

The findings showcase the exceptional detection capabilities of the P2 Random Walk and its classification strategy for anomaly detection. As evidenced by the MVTec data’s anomaly detection outcomes, our method attains outstanding results, boasting an average AUROC of 96.2 for texture class images, 98.2 for object class images, and 95.2 for the object class. This approach further differentiates all the data. Based on these results, we can make the following observations. (1) In MVTec datasets, the P2 Random Walk demonstrates superior anomaly detection performance. Our method achieves advancements in average anomaly detection, primarily due to the randomness in constructing anomaly samples. A more random pseudo-anomaly sample prevents the feature extractor from learning shortcuts and failing to learn adequate feature representation. (2) Compared with object samples, texture images achieve better results. The performance of P2 Random Walk is encouraged by complete image uniformity and a complex network. Object samples contain a greater variety of colors and a more complex feature structure than texture data. (3) The Screw sample is less prominent compared to the other samples. We discovered that the proportion of objects in the screw data was relatively small compared to the overall balance of other objects. The proportion of objects in the screw data is relatively small compared to the overall proportion of the different samples. Therefore, the anomaly detection constructed is slightly less effective than the other samples.

Fig. 7
figure 7

Display of anomaly segmentation results. A comparison of real anomalies and predicted results is conducted on images from 15 categories in the MVTec dataset

Table 3 presents the results of a segmentation process. From the analysis of the results, this random walk method has achieved excellent performance in anomaly segmentation, reaching 96.3%. Compared to object-based segmentation, the segmentation effect of texture classes is superior. In Fig.  7 shows the effect of random wandering method for anomaly segmentation, where the region of abnormal anomalies is shown through the form of heat map. In this paper, anomaly localization is performed by means of image blocks, which are sliced into image blocks of size 64 \(\times \) 64 and feature blocks are extracted in steps of 4. Fig.  7 includes input anomaly image, test set really anomaly mask image, anomaly heat map and, anomaly mask image, respectively. In the segmentation task, to measure the model performance comprehensively under unbalanced data, this article introduces the F1 score as an evaluation metric for the model. The F1 score takes into account both precision and recall, which allows a more comprehensive evaluation of the model performance. In examining the results presented in Tables 4 and 5, it is apparent that our model exhibits a superior level of balance when dealing with data categorised as texture.

Ablation study

To evaluate the effectiveness of the random walk method in anomaly detection approaches, we conducted ablation experiments in 15 categories of the MVTec AD dataset. This study mainly validates the effectiveness of the method from two aspects: random changes in shape and color. To ensure the validity of a single variable, we use the unified Convnext feature extraction model as the feature extraction model for the three data augmentation methods. The performance of P2 random wandering was tested during ablation experiments based on the ConvNeXt feature extraction model.

Shape diversity Impact of shape diversity on detection performance. We constructed various irregularly shaped pseudo-anomaly samples to enable the model to learn more image information. We compared Cutout and color-fixed random walk methods. The Cutout data augmentation involves creating data samples with regular shapes and a single color. In this study, we used Cutout with regular rectangles and white-filled data augmentation methods. In the color-fixed random walk, the position of the random walk changes randomly, while the color variation remains constant as white, similar to the Cutout method. Compared to the Cutout method, the color-fixed random walk constructs a more randomly shaped anomaly detection method. More irregular data augmentation encourages the model to learn more shape information, and the experimental results show a significant improvement in Cutout and color-fixed methods, as seen in Figs. 8 and 9.

Table 4 The F1 of segmentation for texture categories
Table 5 The F1 of segmentation for object categories
Fig. 8
figure 8

Different self-supervised learning methods show different results for the model. Among all data types, P2 Random Walk has better anomaly detection results

Fig. 9
figure 9

Different self-supervised learning methods show different results for the model. Among all data types, P2 Random Walk has better anomaly detection results

Color diversity Impact of color variation on anomaly detection. In the data augmentation process, the random walk method encompasses not only shape diversity, but also color enhancement diversity. In this study, we compared the P2 Random Walk and color-fixed methods. In contrast to the color-fixed random walk, the P2 Random Walk introduces random color variations during the walk, creating a more diverse color enhancement. The color diversity in the random walk helps prevent anomalies caused by the model learning a single color variation. The experimental results show a significant improvement in cutout and color-fixed methods, as seen in Figs. 8 and 9. By comparing the three data augmentation methods, we concluded that changes in shape and color during the random walk process have a positive impact on the model’s performance.

Analysis of results The analysis of the results obtained in in Figs. 8 and 9 reveals that our method is not always the best, for example, toothbrushes. We fully analyzed the results from both dataset and algorithm aspects. First, analyzing from the data level, we combed through all the data samples and found that among all the object categories, the toothbrush’s anomaly contains only one anomaly category, and the categories of anomalies are all at the level of defective semantics, with a single type of anomaly. Secondly, analyzing from the algorithm level, this method constructs a finer-grained pixel-level data enhancement method to achieve more refined data enhancement and avoid data bias in the enhancement process. The experimental results in Table 2 are analyzed. The results of this paper have more effect enhancement in the texture class compared to the object, and the texture class anomalies are mainly reflected in the more fine-grained color, shape, etc., while the object class anomalies out of the color and shape also include semantic-level anomalies. Therefore, through the analysis, it is found that the finer-grained data enhancement leads to a slightly weaker effect enhancement of the model for semantic higher-order information anomalies compared to detailed information, which is a limitation of the algorithm in this paper.

Time complexity For each image, the method we use is a randomized walking of pixel points. During the walking process, the time complexity mainly depends on the number of steps of walking. Therefore, the complexity of our algorithm is linear and the overall complexity depends on the number of training data, which is O(N). The difference in training time, the time complexity required for the color-fixed and random walking methods is the same and does not result in an additional time enhancement. This is because the color changes automatically with the walking process, and the color-fixed change is just that the color always changes for a value. Therefore, the time to train the model is the same compared to random walking with fixed colors.

As for the Cutout method, the pasting process is a selection of image blocks and then pasting them as a whole. The color change of the random walking is changed with the change of position and no additional operation is required. Since the number of channels of the processed image is the same as the random walk, the time complexity of the processing depends on the size of the image block. The larger the image block, the more pixel points need to be processed and also the linear complexity. Therefore, the time complexity increases with the size of the image block, also linear in complexity. Thus, the time complexity of both the random walking and paste-and-copy approaches are linear in complexity and consume in the same order of magnitude in training time.

Fig. 10
figure 10

The experimental results for walking steps on texture samples. The AUROC is oscillating as the number of random walking steps changes, and its average effect is best when the number of walking steps is at 400,000

Parameter analysis

To clearly depict the model’s sensitivity to different parameters, we conducted an in-depth analysis of the relevant parameters that have a significant impact on the model. Throughout the process of parameter sensitivity analysis, we adopted strategies based on P2 Random Walk.

Step number The effect of different walking steps on the anomaly detection AUROC is illustrated in Fig. 10. We find that the anomaly detection results increase with the number of steps when the number of random walking steps is lower than 40,000 and reach the optimum at 40,000. However, the results deteriorate above 40,000 steps. The experimental results demonstrate the need for a certain level of pseudo-anomaly sample construction during random walks, where the model learns more information by constructing more uncertainty. However, suppose the number of steps in the random walk is too small. In that case, the completed pseudo-anomaly samples are not discriminative, making it difficult for the model to learn the image information. Pseudo-anomaly samples built are too obvious and can reduce the feature extraction ability of the model.

Fig. 11
figure 11

Effect of different x and y walking ratios on the model. When the ratio of x to y is 1 to 9 get the best result

Random walking dispersion Dispersion of the walk is another critical parameter in anomaly detection. The dispersion of random walk refers to the distance that a pixel moves on the horizontal and vertical coordinates during the random walk. As shown in Fig. 11, the anomaly detection results improve as the dispersion of the random walk increases and the AUROC reaches its maximum at a dispersion of [1,9]. The dispersions able to construct regional pseudo-anomalous samples within a certain range, and its irregularity increases as the dispersion become progressively larger. However, when the dispersion reaches a certain level, the point-to-point relationship during its walking weakens, showing a highly discrete state. Moreover, as the random walk is carried out in the region of image, the dispersion is too large and the walking is repeated, which is equivalent to reducing the number of walking steps and is less effective for the construction of the model.

Table 6 Different backbone models show different model effects, with the ConNext network showing the best results

Feature extractor Due to the constrained quantity of MVTec data, the selection of a suitable feature extraction network bears a considerable influence on the efficacy of anomaly detection outcomes. To analyze the impact of different models on the P2 Random Walk method, this paper analyzes the algorithm results in the Resnet18, Resnet34 and ConvNeXt, as shown in Table 6, the ConvNeXt network shows the best results. Compared with the compared networks, ConvNeXt borrows the related methods of Swin Transformer while maintaining the CNN mechanism. The CNN model is effective for extracting the underlying information of the data, and the anomaly detection model is crucial for the extraction of the underlying information. In addition, Swin Transformer has more advantages in global information extraction, so ConvNeXt has certain advantages in both the extraction of underlying information and the extraction of global information, so it has good anomaly detection performance. Our study employs feature extraction models and classification methods to detect anomalies. When constructing P2 Random Walk pseudo-anomaly samples, there is no need for designing complicated manual processing methods. The processed pseudo-abnormal samples are inputted into the network alongside normal samples for feature extraction. Finally, image abnormality is detected by image classification. In the random walk method, the number of different steps taken during the random walk, and the way color changes with different effects on the abnormal results. The two most critical factors during a random walk are the number of steps taken and any changes in the image pixels.

Conclusion

In this paper, we propose a pixel-level random data augmentation method (P2 Random Walk) to address the issue of data scarcity in the process of image anomaly detection. This method establishes more random color and shape changes by way of pixel-point random walk, introducing greater randomness and irregularity in the data augmentation process through pixel-level color and position changes. More random data can prompt the model to learn more feature representations, and this increased uncertainty aids the model in extracting more comprehensive feature representations, avoiding the model learning data shortcuts. Furthermore, during the training process, the model realizes the training of the feature extraction model by completing proxy tasks. As the uncertainty in the data augmentation process during training increases, the model faces a correspondingly greater challenge. To effectively complete the pretext work, it prompts the feature extraction model to learn sufficient latent features, thereby enhancing anomaly detection capabilities. The proposed method has achieved advanced results in both anomaly detection and segmentation tasks. Through the analysis of the detection results of 15 types of data in the MVTec dataset, it was found that finer-grained data augmentation is more noticeable for the improvement of texture-type anomalies, but the improvement of detection results for object-type anomalies is limited. In subsequent research, by changing the walking method to construct a data augmentation method with stronger anomaly detection capabilities, the detection ability of object-type anomalies can be improved.