Introduction

Anomaly detection is a critical process that involves determining whether a given sample deviates from the normal distribution and detecting its unusual components, and it has a wide range of applications in industrial control [12, 33], product quality control [4, 43], and other fields [5, 22, 32]. Real-world datasets present challenges due to their characteristics of widely varying distributions and the scarcity of anomalous samples, which result in limited prior knowledge about an anomalous class. For single-class detection, such as in singular value decomposition [17] and one-class SVM [16], a feature is mapped to higher dimensions to increase the characterizability of normal samples; however, this is accompanied by relatively limited feature extraction capabilities. In image-related fields, deep neural networks have shown promising results. Methods like [37] extract latent features that characterize the dataset while detecting objects. However, over-parameterization often occurs for networks trained only on normal samples, and how to apply the feature extraction capability of deep neural networks is a hot topic [30, 31].

The teacher–student (T–S) architecture [41] is an important part of anomaly detection, and for situations where only normal samples can be observed, anomalies will be identified by observing a teacher model trained with a large amount of data and comparing it with a student model trained with normal samples. In contrast to the traditional T–S architecture that pursues smaller student network parameters to achieve fast inference, the T–S architecture for anomaly detection exploits the inconsistency of different networks on the training data to detect anomalous samples. The teacher network’s parameters are typically trained on large datasets, such as ImageNet [15], to obtain semantically strong descriptors. Meanwhile, the student network takes normal samples as input during training, compares its feature map with the corresponding hierarchical teacher network’s map, and uses the cosine similarity or mean squared error as loss. The assumption of the T–S architecture is that the ability of the teacher network to encode the images is transferred to the student network as much as possible. The student network is trained with normal samples by default, which differs from the pre-trained teacher network especially when dealing with anomaly images. Anomaly score maps result from comparing the feature maps of the two networks.

Since the student network uses the teacher network as the learning object, constructing the student network more efficiently is a key concern for anomaly detection. Some studies have approached the construction of different pre-trained teacher models or distillation methods. For instance, Xu et al. [46] proposes to use multiple teacher networks to handle different detection objects while Deng and Li [14] proposes to use reverse distillation to obtain comprehensive representations of teacher networks. Jiang et al. [23] conduct work based on [14] and introduce pixel-level and feature-level masking to alleviate the overgeneralization problem. Cao et al. [8] argues that increasing the knowledge of the student network is an effective way to improve model recognition. In previous studies, which have focused primarily on guided learning of single student networks, including the use of methods such as soft logit [46] and one-class embedding [14]. Ma et al. [29] in showing the effectiveness of training student networks using multiple pre-trained teacher models. Since different distillation methods have various advantages, it is beneficial to construct diverse student networks and aggregate their score maps.

To construct student networks that effectively generate recognition score maps, we adopt a dual-student approach that leverages high-level and low-level representations and performs subsequent aggregation. We propose a student network architecture for anomaly detection based on a skip connection and an attention mechanism based on reverse distillation. The attention mechanism supporting the student network determines which feature maps are more important in the teacher’s hierarchy. In addition, we develop a forward distillation student network that integrates the anomaly score maps obtained from both students using synthetic noise. Our main contributions are given below.

  • We propose a novel multi-student knowledge distillation framework for anomaly detection and localization named DSKD. Through synthetic noise, DSKD aggregates the score maps obtained from two different student networks, thus leading to a more powerful representation of learning.

  • To further improve the efficiency and effectiveness of anomaly detection and localization during reverse distillation, we propose a skip connection architecture to help the student network obtain the information of the layer corresponding to the teacher; moreover, an attention module is added to help the student recombine the features.

The remainder of this paper is divided into five sections. “Related work” presents related work on anomaly detection and knowledge distillation. “Mythology” introduces our proposed method. “Experimental results and analysis” presents our experiments and analysis. A conclusion is given in “Conclusion”.

Related work

Anomaly detection

Anomaly detection, also known as outlier detection or novelty detection, involves identifying samples that deviate from the rest of the observations. In our work, we assume that a model is trained using only normal samples, and since there is no supervisory information from other classes, the problem is treated as novelty detection. Traditional methods such as singular value decomposition and one-class SVM construct a hypersphere or hyperplane to check whether outliers are far from the hypersphere center. Deep learning-based methods are also used in anomaly detection due to the effectiveness of neural networks in feature extraction.

Methods based on autoencoders utilize reconstruction error as a primary measure for judging sample abnormality. This approach has found wide application in various domains such as video analysis [28], Internet of Things (IoT) [48], and railway turnout inspection [9]. For instance, [6] evaluates the application of an autoencoder for visual fault detection and finds that there are deficiencies in the reconstruction of high-frequency textures and small details using a convolutional autoencoder combining l2 loss and the structural similarity index. Üzen et al. [40] combines a convolution layer and swin transformer, where the former provides spatial properties and the latter provides global semantic properties. Meanwhile, [25] reduces false positives of an autoencoder by contrastive learning of complex shapes, sizes, and colors of the recruitment samples. In addition, skip connections are widely used in autoencoders to help the model reconstruct sharpness while preserving both high- and low-frequency information [11].

GAN-based methods use normal samples to train a model, and then they compare the generator, discriminator, and reconstructor for detection [45]. AnoGAN [36] uses normal data to train the model and compute the reconstruction errors between the generator and discriminator. Moreover, GANomaly [2] uses latent vectors of generators and discriminators to obtain an anomaly score. Skip-GANomaly [3] combines reconstruction metrics and latent representations using skip connections to improve the underdetection of small-scale anomalies in GANomaly. The discriminator identifies deviations from the normal distribution by comparing the reconstructed image with the original, particularly when the reconstruction is of poor quality. However, relying solely on reconstruction errors means that we do not fully utilize the potential of a large dataset. To address this limitation, a teacher–student (T–S) architecture was introduced. Furthermore, it was found that skip connections majorly improve anomaly detection as they provide high-frequency features as well as relevant support for our network’s design.

Knowledge distillation

With the advent of deep learning, networks designed to enhance classification accuracy have become increasingly deep and wide, resulting in substantial computational requirements for both training and testing. Conventional knowledge distillation is designed to give models a lighter architecture and reduce inference time. Knowledge distillation techniques involve leveraging a trained teacher network to transfer knowledge to a student network through soft loss distillation, thereby offering the potential for real-time predictions. Bergmann et al. [7] proposes a T–S network trained on large datasets and normal samples; anomaly detection involves processing external datasets by comparing the two networks. To further improve the student network’s ability to recover the original image, a denoising process [49] is introduced. It trains the student network with an anomaly mask to recover the original image using synthetic corrupted normal images. Under the assumption that the teacher model is underutilized, a gradient-based adaptive anomaly localization approach [35] based on the distillation of intermediate layer feature maps is proposed. Maps are sufficiently distilled to simplify the model and bypass the patching process. Tong et al. [39] is similar to [50] in that they both introduce self-supervised mask training for reinforcement learning of WideResNet50-based single-class prototype models as well as employ a feature diffusion module to identify large-area anomalies. Ma et al. [29] argues that teacher networks with different structures or initial parameters provide features from different perspectives and thus also supports the notion that constructing reliable feature maps is important in T–S-based anomaly detection.

Mythology

In anomaly detection, training data typically consists of only normal images. A crucial aspect of generating anomaly score maps involves constructing comparison networks based on teacher networks. There is a need to improve the sensitivity of students to the semantics and details to achieve both sample- and pixel-level anomaly detection. In our proposed method, we employ a teacher network as an anchor point and utilize distillation in multiple directions to obtain two student networks. Subsequently, we design a fusion network to combine the anomaly score maps generated by these students. The validity of our approach is demonstrated through both semantic and pixel anomaly detection.

Problem definition

An important assumption for anomaly detection is that anomalous samples are difficult to acquire or observe during training. Consistent with the literature [14] on anomaly detection, we let \(L_{t} = \{ L_{t}^{1},..., L_{t}^{n} \}\) be a normal sample in the dataset that only appears in training, thereby allowing the model to learn the normal distribution. Let \(L_{q} = \{ L_{q}^{1},..., L_{q}^{n} \}\) denote the samples to be detected; this set contains normal and abnormal data. Our goal is to construct a model trained with \(L_{t}\) that correctly identify the samples in \(L_{q}\). Normal samples in \(L_{t}\) and \(L_{q}\) conform to the same distribution, and samples outside of this distribution are considered anomalies. Taking the screw category of samples as an example, the model trained with screw samples is only used for detection within that category. Data that deviates from the distribution of that category will be considered an anomaly.

Network architecture

Figure 1 shows our model architecture. Unlike those with multiple teachers [29], our architecture uses only a pre-trained teacher network E that extracts multi-scale representations of normal samples. Since anomaly detection is usually pixel-level mining, i.e., it is similar to image segmentation, it is sensitive to both high-level and low-level representation details. We propose to extract these representations using students \(D_\mathrm{{f}}\) and \(D_\mathrm{{r}}\). \(D_\mathrm{{r}}\) uses reverse distillation to obtain high-level representations while connecting E’s previous feature maps to get different levels representations. Meanwhile, we retain low-level representations using forward distillation network \(D_\mathrm{{f}}\) and fuse the \(D_\mathrm{{f}}\) network with the \(D_\mathrm{{r}}\) network. In the inference stage, multi-scale anomaly detection is performed using \(D_\mathrm{{r}}\) and \(D_\mathrm{{f}}\), separately, and a unique anomaly map is obtained by each component of the fused network.

Since we focus on the network design of the student network, for the teacher network we simply use WideResNet50: it has more capacity and benefits less from input repetition. The parameters of the teacher network are learned from the ImageNet and are fixed. According to the network design of WideResNet50, the teacher network is divided into four network structures, with \(E_{1}\)\(E_{4}\) denoting the blocks from largest to smallest. Correspondingly, the network structures of \(D_\mathrm{{f}}\) and \(D_\mathrm{{r}}\), which we will explain in “Forward distillation student network” and “Reverse distillation student network”, respectively, are similar to those of the teacher’s blocks. The adaptive fusion of \(D_\mathrm{{f}}\) and \(D_\mathrm{{r}}\) is elaborated on in “Anomaly score map fusion”.

Fig. 1
figure 1

Network architecture for DSKD. The realization represents the direction of data flow and the dotted line represents the anomaly score. The attention mechanism is denoted as att. Input images are fed through a teacher network and two student networks; the kernel sizes for conv\(_{1}\) and conv\(_{2}\) are \(3*3\) and \(7*7\), respectively. Feature comparison uses cosine similarity to compare each position in feature maps \(E_{1}\) to \(E_{3}\), providing a description of the inconsistency between the teacher network and the student network. Final result is computed by an anomaly score fusion network

Forward distillation student network

The teacher network learns about a diverse and large amount of external data by pre-training, and the student network focuses on the normal samples of the dataset during its learning process. We examine the inconsistencies between the feature maps of the teacher and student networks to mine anomaly samples with query data. Forward distillation inputs images to the teacher and student network to obtain feature maps \(z^{t}\) and \(z^{f}\), respectively. By cosine similarity loss, keeping \(z^{t}\) and \(z^{f}\) in the same direction, the student feature map is kept numerically diverse. The loss is shown in Eq. (1).

$$\begin{aligned} \mathrm{{loss}} = 1 - \frac{z^{t} \cdot z^{f}}{\Vert z^{t} \Vert _{2} \cdot \Vert z^{f} \Vert _{2}} \;\text {} \end{aligned}$$
(1)

\(\Vert z^{t} \Vert _{2}\) and \(\Vert z^{f} \Vert _{2}\) are l2-norms of \(z^{t}\) and \(z^{f}\). In the design of the \(D_\mathrm{{f}}\) network, we use a simple approach of directly feeding data to the teacher and student. We expect the student network to directly use the data to obtain a more detailed feature map. At the same time, by not making additional connections between the student and teacher, we prevent the teacher from copying parameters to the student network and hence keep the student network sensitive to anomalies.

In a prediction phase, the anomaly map calculates the cosine similarity of each \(z^{f}\) to the corresponding \(z^{t}\) through Eq. (1). Position-by-position difference results are recorded and expanded to the original size by interpolation. In this process, we return the first three anomaly maps to be fused with the reverse distillation results.

Reverse distillation student network

While \(D_\mathrm{{f}}\) gives a superficial representation of an image, in the case of anomalous samples being outside of the normal distribution, the results of \(D_\mathrm{{f}}\) and E converges due to their use of the same data. Deng and Li [14] uses reverse distillation in both the training and prediction phases of the T–S network to promote the preferential acquisition of semantic knowledge to the student network. Inspired by this, we introduce skip connections and an attention module and use reverse distillation to improve the representation of feature map details since knowledge is then distilled to a shallow level.

Assume that the t-th block of the student network is \(D_\mathrm{{r}}^{t}\). DSKD first uses different sizes of convolution kernels (conv\(_{1}\) and conv\(_{2}\)) for feature extraction in the previous teacher layer. We design different perceptual fields to obtain diverse feature results. The attention module inputs the results of conv\(_{1}\), conv\(_{2}\) from \(E_{t}\), and \(D_\mathrm{{r}}^{t}\) is obtained by connection with conv from \(D_\mathrm{{r}}^{t+1}\).d Using an attention mechanism helps \(D_\mathrm{{r}}\) eliminate redundant information and learn the features related to the loss and the E’s previous feature map. The feature computation in block \(D_\mathrm{{r}}^{t}\) is shown in Eq. (2).

$$\begin{aligned}{} & {} D_\mathrm{{r}}^{t} <- \mathrm{{conv}} ( D_\mathrm{{r}}^{t+1} \oplus \mathrm{{att}}(D_\mathrm{{r}}^{t+1},(\mathrm{{conv}}_{1}(E^{t+1})\nonumber \\{} & {} \qquad \oplus \mathrm{{conv}}_{2}(E^{t+1}))) ) \;\text {} \end{aligned}$$
(2)

Here, \(\oplus \) denotes the connection, and the outermost convolution uses a \(1*1\) convolution kernel. The stride for conv\(_{1}\) and conv\(_{2}\) is set to 2 to align with the input size of the next block. Key features from semantic knowledge \(D_\mathrm{{r}}^{t+1}\) and teacher \(E^{t+1}\) are selected given to \(D_\mathrm{{r}}^{t}\) to ensure that important anomaly patterns are captured. The loss function and anomaly score map generation are consistent with those used in forward distillation.

Anomaly score map fusion

For the treatment of anomaly score maps at different scales, methods such as [14] interpolate the results of small cosine similarity scores and use multiplication or addition to fuse them with large similarity scores, thus aggregating them into a single detection result. However, in our approach employing multiple student networks, a key issue surfaces: which anomaly score map from student nets is important. To solve this problem, we add synthetic noise in some regions of the image and regard this noise as anomaly mask M. The noise is added using an external data source A in the anomaly-free normal image \(I_{n}\).

$$\begin{aligned} I_{a} = \beta (M \odot A) + (1 - \beta )(M \odot I_{n}) + (1 - M) \odot I_{n} \;\text {,} \end{aligned}$$
(3)

where \(\odot \) denotes the element-wise multiplication operations. \(\beta \) is opacity, which is regarded as data augmentation to increase the diversity of the training set, and it is randomly chosen between [0.15,1]. Such an injection of synthetic noise is also mentioned in DeSTSeg [49]. DeSTSeg uses a similar dual-student network for encoding and decoding, but it requires two residual blocks as a segmentation network; conversely, our work is more concerned with the aggregation of network results. We use an external dataset [10] for A, take the feature map of “Forward distillation student network”–“Reverse distillation student network” as input, and introduce dice loss [38] as the loss of the separating task. This loss is shown in Eq. (4).

$$\begin{aligned} \mathrm{{DL}} = 1 - \frac{2 \sum ^{N}_{i=1} p_{i} g_{i} }{ \sum ^{N}_{i=1} p_{i}^{2} + \sum ^{N}_{i=1} g_{i}^{2}} \;\text {} \end{aligned}$$
(4)

Here, \(p_{i}\) is the prediction associated with the ith pixel and \(g_{i}\) is its ground truth. The advantage of using dice loss is that we introduce a small percentage of synthetic masks that show an imbalanced distribution of semantic segmentation, which is in accordance with the small proportion of faults in one image. Dice loss focuses on mining foreground mask regions during training, and erroneous regions usually show range. As a region-based loss, dice loss also focuses on the correlation between the current and other pixel values. The anomaly map fusion network uses a \(1*1\) convolutional kernel size to aggregate the anomaly score maps. Activation is performed using relu and sigmoid functions to ensure that the results are between 0 and 1. In order to preserve the overall image information, the amount of added noise is typically restricted. We use relu to suppress non-zero pixels from entering the sigmoid, thereby reducing the likelihood of the network failing to predict the background. We also note that using relu and sigmoid is a common approach for applying Dice loss [19, 42], while softplus cause the prediction value to exceed one, surpassing the mask value. For student’s results, cosine similarity and interpolation are used to obtain anomaly score maps with consistent scales, and the final detection map is obtained using the trained fusion network. The process of anomaly score map fusion is shown in Fig. 2.

Fig. 2
figure 2

Network architecture for anomaly score map fusion network. The anomaly-free image is given noise and is trained by its mask

Experimental results and analysis

Implementation details

The dataset used is MVTec [6], which focuses on industrial inspections. It contains 15 categories of images divided into two types: object and texture. We use 3629 images for training (these are only normal samples), and 1725 images for testing (these are both normal and abnormal samples). All images are resized to \(128*128\) and \(256*256\), so that we obtain results corresponding to images of different sizes. To verify the effectiveness of our experiments on anomaly detection for different datasets, we also conducted one-class novelty detection in CIFAR10 [26] with the same settings as RD4AD, images are resize to 32*32, and evaluation with Sample AUROC.

For comparison with recently published results, the optimizer settings were kept consistent with RD4AD [14], and we set an Adam optimizer with \(\beta =(0.5,0.999)\). The learning rate is set to 0.005, and we train for 200 epochs using an NVIDIA Tesla K40c and Intel(R) Xeon(R) E5-2680 CPU@2.80 GHz. Our method uses WideResNet50 as the teacher model with 68.8 M parameters. The forward-distilled student network has the same number of parameters as the teacher model, and the reverse-distilled student network has 108 M parameters. For comparison, the RD4AD model using the same teacher network structure has 67.2 M parameters for the BN layer and 24.9 M parameters for the student network. The number of parameters of our proposed method is about 1.58 times higher than that of RD4AD. This also implies that while DSKD can be trained simultaneously for both student networks, the training time is longer in the reverse network compared to the RD4AD. It should be noted that since the training phase only involves normal samples, the optimal number of epochs is less likely to be observed for comparison. For a fair and visual comparison, RD4AD is run on the same machine and with the same number of epochs. The result of the last epoch is used to construct the anomaly score map with a Gaussian filter applied for smoothness.

We evaluate the proposed method in terms of anomaly detection and anomaly localization. The receiver operating characteristic curve of samples (Sample AUROC) reflects the performance of the model in determining whether a sample is anomalous. We define the maximum value in the abnormal score map as Sample AUROC, which is also done in RD4AD. AUCROC for pixel (Pixel AUROC) and per-region overlap (PRO) reflect whether a pixel is judged as anomalous. Pixel AUROC calculates anomalies pixel by pixel, while PRO reduces the influence of overlapping anomalies. We compare our method with MKD [35], GT [18], GANomaly (GN) [2], Uninformed Student (US) [7], PSVDD [47], DAAD [20], MetaFormer (MF) [44], PaDiM [13], CutPastee [27], and RD4AD in the experiment of MVTec. The comparison methods of one-class novelty detection are LSA [1], OCGAN [34], HRN [21], DAAD and RD4AD. We analyze the advantages and limitations of our approach in the widely used anomaly detection dataset MVTec by evaluating semantic and pixel anomaly detection.

Anomaly detection

Table 1 shows the results of anomaly detection using Sample AUROC. Bold text in all tables in this paper indicates optimal values.

Table 1 Sample AUROC for MVTec

From Table 1, our proposed method achieves optimal average performance in texture images for different image sizes; DSKD follows behind RD4AD and PaDIM only for carpet and leather categories. Our method is valid for CIFAR10, which performs random additive synthetic noise on natural images to improve the metrics for one-class novelty detection and AUROC has reach \(86.5\%\).

We also observed variations in the class distribution within the test set. When considering normal samples as negative samples and abnormal samples as positive samples, the positive-to-negative ratio ranges from 0.6 to 5.4. The transistor and pill categories are the smallest and largest categories, respectively. DSKD achieves the optimal Sample AUROC in both categories, thus showing that it is not much affected by sample imbalance in its anomaly detection performance. This finding also confirms that the anomaly score map fusion network consistently detects suspicious analog defects during training, and it effectively identifies normal samples that have not been previously encountered.

Limitations While DSKD achieves the average optimal performance for 128*128 images, the results for objects in 256*256 images are slightly worse than those obtained by RD4AD. We believe this is related to the added external data sources and the added locations. In texture images, the added locations are all targets to be detected while the objects contain part of the background. The added data sources for introducing synthetic noise are also textures and thus help textures improve the metrics. Moreover, the increase in image size does not necessarily improve the Sample AUROC; this is evident for tile, screw, and transistor categories. Overall, DSKD is competitive in Sample AUROC, especially for textures.

Anomaly localization

Table 2 Pixel AUROC for MVTec
Table 3 POC for MVTec

Tables 2 and 3 present the Pixel AUROC and PRO results. The data in Tables 2 and 3 are similar in that DSKD achieves the best average performance in all texture categories; especially for 128*128 images, it performs better than RD4AD in all texture categories. For objects, although the results are not as significant as for textures, DSKD performs better than RD4AD in the bottle category. In terms of pixel AUROC, the optimal metrics of 10 and 6 out of 15 are achieved in the different sizes. Advantages in PRO come mainly with 128*128 images. By observing Tables 2 and 3, our proposed method has good competitiveness in texture categories.

Limitations DSKD has the third highest average result in PRO of all compared methods; moreover, its PRO is below 90 in transistor and metalnut categories. The location of defects in both categories is highly variable. For instance, transistor images have misplaced defects and metalnut images have flip defects. These types of defects usually cover the whole inspection area. DSKD is similar to [39] and it performs well in locating small anomalies in structurally simple samples but not so well in locating large anomalous regions in object categories. We believe that the modeling of noise is a factor. For objects with different-sized parts and possible defects, we use the same noise simulation strategy to help the model. Due to the lack of a priori knowledge of possible a priori faults, there is still room for improvement in this noise simulation strategy for samples with a large range of faults. To explore practical applications, we utilized the Gamma distribution [24] for automatic threshold screening and pixel-level anomaly detection in 128*128. The average recall is \(15.8\%\) higher than the accuracy, which is acceptable for recall-sensitive application.

Ablation study and visual analysis

Our proposed method introduces two key innovations: a novel network structure and a result fusion technique. The network structure includes a forward (Pos) and a reverse (Res) student network. Tables 4 and 5 present three metrics corresponding to the use of the two student networks for 128*128 images and 256*256 images, respectively. In addition, we manually set the weighting of the two student networks using a 1 : 3 labeled manual fusion (MaFu) function. Feature fusion using synthetic noise is labeled synthetic fusion.

Table 4 Metrics for forward distillation, reverse distillation, manual fusion and our method in 128 size
Table 5 Metrics for forward distillation, reverse distillation, manual fusion and our method in 256 size
Fig. 3
figure 3

Visualization results for MVTec, where the blue color represents pixels predicted as normal

Typically, the results of forward distillation are weaker than those of reverse distillation. However, there are some inconsistencies in the results for the bottle and transistor categories, which highlights the necessity of incorporating forward distillation in our DSKD approach. The metrics for reverse distillation are also better with our method, thus suggesting that our use of skip connections and an attention mechanism is effective. The pixel AUROC for manual fusion is inconsistent across the two image sizes. By contrast, DSKD with synthetic noise enhancement outperforms both the unidirectional distillation approaches and manual fusion in terms of pixel AUROC. It is shown that our proposed approach achieves good improvement in anomaly localization for dual-student networks.

In Fig. 3, we show the visualization results of forward distillation, reverse distillation, RD4AD, and our proposed method obtained on 15 categories with an image size of \(256*256\).

Our results reveal that forward distillation tends to identify more defective regions, such as those in bottle and cable categories, compared with reverse distillation. This is also usually true compared with the regions identified by RD4AD. However, forward distillation also has some false defect regions, such as those seen in screw and metal nut categories. This phenomenon is not seen in our method. Our method also combines the advantages of forward and reverse distillation to obtain better defect prediction regions ( see, for example, transistor and wood categories). Compared with RD4AD, our method has complete defect region prediction even in the case of there being a high probability of defects.

Discussion

Why two student models with different distillation patterns were applied Our dual-student training strategy uses normal samples and their training can be run in parallel. In Table 5, the single-model student network performs inconsistently across different categories. How to aggregate the two parts of detection is a novel problem. The forward and reverse distillation results in Fig. 3 are also inconsistent; in addition, forward distillation is usually more sensitive to anomalous regions than reverse distillation. Our model draws on the advantages of both types of distillation.

Why we used synthetic noise for model integration When employing multiple models, how to aggregate detection results is a key question. For anomaly detection, texture anomaly is an important component. We can combine the existing texture dataset by Eq. (3). The data can be quickly fused into the training process, and it only adds six 1*1 convolution parameters. This is acceptable for both training and testing. According to Tables 1, 2 and 3, such synthetic noise is helpful for detecting both texture-related anomalies and object-related anomalies. Our proposed synthetic noise enhances metrics and allows us to fuse results.

Conclusion

In this paper, we present a novel approach for anomaly detection called dual-student knowledge distillation. A reverse distillation network is constructed using skip connections and an attention module, which helps the reverse distillation network obtain detailed information and high-level representations. In addition, we construct a forward distillation network using a simple architecture. After combining the two distillation results, we introduce synthetic noise to help different abnormal score maps to adaptively assign weights. Experimental results demonstrate that our approach achieves state-of-the-art performance on the texture images from the MVTec dataset while also obtaining competitive scores on object images.

We believe that we can further improve object detection by further combining the results of object segmentation with noise to reduce the effect of noise on distinguishing an object from its background. Also, during the training annotation process, the region where the synthetic noise is added can be determined by manual annotation, while no changes are required during the inference process. Setting the probability threshold in abnormal detection is another area for further research since the high probability part of the model prediction region can cover most of the real defect regions.