Deep Industrial Image Anomaly Detection: A Survey

The recent rapid development of deep learning has laid a milestone in industrial Image Anomaly Detection (IAD). In this paper, we provide a comprehensive review of deep learning-based image anomaly detection techniques, from the perspectives of neural network architectures, levels of supervision, loss functions, metrics and datasets. In addition, we extract the new setting from industrial manufacturing and review the current IAD approaches under our proposed our new setting. Moreover, we highlight several opening challenges for image anomaly detection. The merits and downsides of representative network architectures under varying supervision are discussed. Finally, we summarize the research findings and point out future research directions. More resources are available at https://github.com/M-3LAB/awesome-industrial-anomaly-detection.


Introduction
We review the recent advances of deep learning-based image anomaly detection since the rapid development of deep learning can bring the capabilities of image anomaly detection into the factory floor.In modern manufacturing, IAD is always performed at the end of the manufacturing process and tries to identify product defects.The price of a product is significantly affected by the defect's severity.In addition, if the flaw reaches a certain threshold, the product will be discarded.Historically, the majority of anomaly detection tasks are performed by humans, which suffers from the following many disadvantages: • It is impossible to avoid human fatigue, resulting in a false positive phenomenon (i.e., the ground truth is abnormal, while the human's judgment is normal).• Long and intensive work on anomaly detection may cause health problems, such as visual impairment.• Locating anomalies requires a significant number of employees, raising operational costs.
Thus, the goal of IAD algorithms is to reduce human labour and improve productivity and product quality.Before deep learning, the performance of IAD could not fulfil the demands of industrial manufacturing.Nowadays, the deep learning method has received good results, and most of these methods are more than 97% accurate.Still, IAD has many problems when it comes to real-world use.To comprehensively explore the effectiveness and applicable scenarios of the current methods, more careful analysis of IAD we conduct in this survey is necessary and significant.
Table 1: Related surveys and ours for IAD.

Content
Czimmermann [1] Tao [2] Cui [ 1 demonstrates clearly the merits of our survey in terms of dataset, metric, neural network architecture, levels of supervision and promising setting for industrial manufacturing.As a representative review that focuses more on traditional methods, Czimmermann et al. [1] have less discussion of deep learning methods, while our survey discusses deep learning in more depth.Firstly, our study uses twice as many IAD datasets as Tao et al. [2].Secondly, we analyze the performance of IAD using the most comprehensive image level and pixel level metrics.Nevertheless, Cui et al. [3] and Tao et al. [2] only employ image level metrics, neglecting the anomalies localization performance of IAD.Thirdly, our study develops a taxonomy based on the design of neural network architecture with varying degrees of supervision.Finally, to bridge the gap between academic research and real-world industry needs, we review the current IAD algorithms under industrial manufacturing settings.
As an emerging field, research on IAD must fully consider industrial manufacturing requirements.The following is a summary of the challenging issues that need to be investigated: • IAD dataset should be gathered from actual manufacturing lines, not labs.
The public cannot access the real-world anomalous dataset due to privacy concerns.The majority of open-source IAD datasets generate anomalies from anomaly-free products.In other words, the abnormalities from opensource IAD datasets may not occur in actual production lines, which makes deploying IADs in industrial manufacturing very challenging.• It is challenging to enable the creation of a unified IAD model in the absence of multiple domain IAD datasets.Recently, You et al. [4] propose a unified IAD model for multiple class objects.However, they disregard the notion that commodities produced in the same plant should be of the same sort.For example, an automaker manufactures several types of workpieces but does not produce fruit.Current popular IAD datasets, like MVTec AD [5] and MVTec LOCO [6], consist of numerous classes but not multiple domains.
To simulate a realistic manufacturing process, we must create a new IAD dataset collected from multiple domains.• It is urgent to set up a uniform assessment for the image-level and pixel level of IAD performance.The majority of IAD metrics shrink the anomalous mask (ground truth) into the size of feature map for evaluation, which inevitably reduces the precision of assessment.Moreover, we discover that certain IAD methods perform well on image AUROC but poorly on pixel AP, or vice versa.Therefore, it is essential to develop a uniform metric for assessment IAD performance at both image and pixel level.• We should design a more efficient loss function that can leverage both the guidance of labelled data and the exploration of unlabelled data.In realistic manufacturing scenario, limited number of anomalous samples are available.However, most of unsupervised IAD methods outperform semi-supervised IAD methods.By observing the failure of semi-supervised IAD, we would call for more attention to the feature extraction and loss function, which can leverage both the guidance from labels efficiently and the exploration from the unlabeled data.Regarding the key problem mentioned above, improving feature extraction from abnormal samples and redesigning deviation loss function can fully use labelled anomalies and diverge the feature space of abnormal samples from those of normal samples.
The paper categorizes various methods into several paradigms, and clearly analyzes the advantages and disadvantages of various paradigms.It allows the reader to understand the state-of-the-art quickly and provides a reliable guide for selecting the required algorithm for practical applications.More importantly, we have analyzed the disadvantages of different paradigms and the Fig. 1: Framework of this survey.current main challenges.Subsequent researchers can quickly find directions to push the field forward.

Contributions
The main contributions of this survey can be summarized as following: • We provide an in-depth review of image anomaly detection by considering the design of neural network architecture with varying degrees of supervision.
• It provides a comprehensive review of the current IAD algorithms in different settings to bridge the gap between the academic research and real-world industrial manufacturing.• It summarizes the main issues and potential challenges in IAD, which outlines the underlying research directions for future works.
The rest of this paper is organized as Figure 1.In Section 2 and Section 3, we review IAD on the basis of the neural network architecture with different levels of supervision.Next, we review the recent advances of IAD under our proposed setting from industrial manufacturing in Section 4. We describe the popular dataset in Section 5 and take a retrospective view of the metrics function in Section 5.Then, we provide an analysis of the performance of current IAD methods on various datasets in Section 6.Finally, we provide future research directions for IAD in Section 7.

Unsupervised Anomaly Detection
The majority of current research focuses on unsupervised anomaly detection, based on the assumption that the collection of abnormal samples incurs massive human and financial costs.This indicates that only normal samples are included in the training set, whereas both abnormal and normal samples are included in the test set.Anomaly detection in industrial images is a subset of problems with out-of-distribution (OOD).Before the rise of deep learning, differential detection and filtering were frequently used to detect anomalies in industrial images.Following the release of the MVTec AD [5], methods for anomaly detection in industrial images can be divided into two categories: feature-embedding and reconstructed-based.Currently, more AD techniques are based on feature embedding.

Teacher-Student Architecture
The performance of these methods is outstanding, but they depend on pretrained models such as ResNet [7] VGG [8] and EfficientNet [? ].The selection of the ideal teacher model is crucial.This type of instructional strategy is summarized in Table 2.The structure of the network and the method of distillation are the primary distinctions between various techniques.The teacher-student network architecture depicted in Fig. 2 is the most standard technique for detecting industrial image anomalies.This method typically selects a partial layer of a backbone network pre-trained on a large-scale dataset as a fixed-parameter teacher model.During training, the teacher model imparts to the student model the knowledge of extracting normal sample features.During inference, the characteristics of normal images extracted from the test set by the teacher network and the student network are comparable, whereas the characteristics of abnormal images extracted from the test set are quite distinct.By comparing the feature maps generated by the two networks, it is possible to generate anomaly score maps with the same size.Then, by enlarging the anomaly score map to the same proportion as the input image, we can obtain the anomaly scores of various input image locations.On the justification of this model, it is possible to determine whether the test image is abnormal.
Bergmann et al. [9] is the first to use teacher-student architecture for anomaly detection.The model is straightforward and effective, significantly outperforming other benchmark methods.While STPM [11] and MKD [10] both use multi-scale features under different network layers for distillation, they do so in different ways.In this instance, the normal sample features extracted by the student network are more similar to those extracted by the teacher network, whereas the abnormal sample features are more dissimilar.
In addition, MKD finds that the lighter student network structure performs better than the student network structure identical to that of the teacher network.Based on STPM, RSTPM [12,16] adds a pair of teacher-student networks.During reasoning, the new teacher network is placed behind the original teacher-student network and is responsible for recreating the features.
When anomalous images are presented, the student network typically reconstructs normal features that can be distinguished from those of the teacher network.RSTPM also includes a mechanism for transferring features from the teacher network to the student network in order to facilitate feature reconstruction.RD4AD [13] and RSTPM share certain similarities in their learning.RSTPM employs two pairs of teacher-student networks for feature reconstruction, whereas RD4AD only employs one pair of teacher-student networks.RD4AD proposes a Multi-scale Feature Fusion (MFF) block and One-Class Bottleneck (OCB) to form an embedding, which is used to eliminate redundant features at multiple scales so that a single pair of teacher-student networks can perform feature reconstruction effectively.The abnormal image features extracted by the teacher-student network of RD4AD differ significantly during inference.AST [15] concludes that the abnormal image features extracted by the teacher-student model with the same structure are significantly similar, so they propose an asymmetric teacher-student architecture to address this issue.AST also introduces a normalized flow to avoid this problem and prevent estimation bias caused by the inconsistency of the two network structures.Previous teacher-student architecture anomaly detection methods suffer from overfitting as a result of inconsistency between neural network capacity and knowledge amount.By incorporating the Context Similarity Loss (CSL) and Adaptive Hard Sample Mining (AHSM) modules, Informative Knowledge Distillation (IKD) [14] hopes to reduce overfitting.CSL can assist the student network in comprehending the structure of a context-containing data manifold.The AHSM can concentrate on difficult samples containing a lot of information.

One-Class Classification
One-class classification techniques rely more heavily on abnormal samples.If the generated abnormal samples are of poor quality, the method's performance will be severely compromised.As demonstrated in Table 3, with the exception of MemSeg [17], the training of other methods relies on SVDD and Cross-Entropy loss; consequently, the performance of the vast majority of methods is marginally inadequate.The paper proposes a Semantic Correlation module (SCB) to represent abnormal semantics information.MOCCA [21] L2, SVDD -The paper extends a single boundary to a hard boundary and a soft boundary, it also trains AE as feature extractor.[22] Cross-Entropy Xception [23] The paper uses Xception to train a classification network.PANDA [24] SVDD, Log-Likelihood DN2 [25] The paper introduces a method to avoid combating collapse in model adaptation.[26] Cross-Entropy, Contrastive

ResNet
The paper presents a novel distribution-augmented contrastive learning to enhance the representing ability of network.[27] --The paper performs template matching on salient regions to detect anomalies.[28] L1, L2 -This paper uses saliency detection to obtain object contours to assist anomaly detection.UISDI [29] L1, L2, Log-Likelihood -The paper uses salient object detection to segment the foreground and foreground to obtain abnormal regions.CutPaste [30] Cross-Entropy EfficientNet The paper applies "cut and paste" augmentation into binary anomaly classification.[31] Cosine Similarity, Contrastive -The paper applies some dynamic local augmentation to generate negative samples.CPC-AD [32] InfoNCE -The paper applies Contrastive Predictive Coding (CPC) model to AD and get an anomaly score through pixel-wise loss.MemSeg [17] L1, Focal ResNet The paper artificially creates anomalies in the foreground of products and makes detecting artificial anomalies a segmentation task.
Anomaly detection can also be viewed as a One-Class Classification (OCC) problem, which has inspired some research.As depicted in Fig. 3, the method finds a hypersphere to distinguish normal sample features from abnormal sample features during training.During inference, the method determines whether the sample is abnormal based on the relative position of the test sample's features and the hypersphere.Since the training set does not contain abnormal samples, some methods create abnormal samples artificially to improve the accuracy of the hypersphere.
SVDD [33] is a classic algorithm in the OCC problem, PatchSVDD [18] DSPSVDD [19] and SE-SVDD [20] improve it for industrial image AD.PatchSVDD [18] divides the image into uniform patches and sends them to the model for training, which significantly enhances the model's ability to detect anomalies.DSPSVDD [19] designs an improved comprehensive optimization objective for the deep SVDD model that simultaneously considers hypersphere volume minimization and network reconstruction error minimization to extract deep data features more effectively.SE-SVDD proposes a Semantic Correlation module (SCB) to improve the representation of abnormal semantics and the accuracy of anomaly localization by extracting multi-level features.

Normal Sample Artificial Abnormal Sample Abnormal Sample Normal Sample
Training Testing Fig. 3: Architecture of one-class classification models.
MOCCA [21] employs multi-layer features for anomaly detection.MOCCA, unlike SE-SVDD, uses an autoencoder to extract features and locates the boundary position of normal features at each layer.And Sauter et al. [22] attempt to use the Xception network for classification and obtained results comparable to SVDD.FCDD [34] employs a fully convolutional neural network for OCC.Since the relative positions of the features of each image layer do not change during the convolution process, FCDD yields more interpretable results than alternative methods.
PANDA [24] examines the migration method of pre-trained features and introduces the early stopping mechanism to the OCC problem.In addition, Reiss et al. [35] investigate the issue of catastrophic forgetting in PANDA.They propose a new loss function capable of overcoming the failure modes of both center-loss and contrastive-loss methods and replacing Euclidean distance with a confidence-invariant angular center loss for prediction.
DisAug CLR [26] proposes a two-stage anomaly detection framework, in which the first stage hinders the uniformity of contrastive representations by means of a novel distribution-enhanced contrastive learning.After comparative learning, abnormal and normal sample representations are easier to distinguish.While the second stage builds a one-class classifier using the representations learned in the first stage.Yoa et al. [31] presents a novel dynamic local augmentation to generate negative image pairs from a normal training dataset, which is effective for anomaly detection.Contrastive Predictive Coding (CPC) [36] model is utilized by De et al. [32] for anomaly detection and segmentation, which uses patch-wise contrastive loss as anomaly score to localize anomalies.
In addition, inspired by saliency object detection [37][38][39], many methods apply saliency detection to anomaly detection.. Bai et al. [27] proposed to use Fourier transform to detect salient regions of images, and compare the salient regions with templates to detect anomalies.Niu et al. [28] used the method of salient object detection to obtain object contours, thereby assisting the detection of outliers.Qiu et al. [29] proposed a Multi-Scale Saliency Detection (MSSD) method to separate the foreground and foreground to obtain coarse anomaly regions, and refine the detected results on this basis.What's more, GradCAM [40], as a common method to obtain saliency maps, is also used in various anomaly detection algorithms.Both CutPaste [30] and CAVGA [41] treat anomaly detection as a classification problem, while GradCAM is used for pixel-level anomaly localization.
CutPaste [30] is a representative example of an OCC method for data augmentation.It generates abnormal images by cutting and pasting portions of normal images, allowing the network to distinguish abnormal images.Additionally, segmentation-based methods are useful.This method puts more emphasis on pixel-level anomaly localization.When the flow is known, Iquebal et al. [42] demonstrate that the maximum posterior estimation of image labels can be formulated as a continuous max-flow problem.Then, anomaly segmentation is accomplished by obtaining flows iteratively using a novel Markov random field on the image domain.The technique shows its adaptability using a dataset for metal additive manufacturing anomaly detection [43].MemSeg [17] stores the features of normal images in a memory bank in order to improve the segmentation network's ability to distinguish abnormal regions.In order to prevent the influence of background factors, MemSeg only introduces anomalies in external data sets in the foreground of items, which is another reason for its excellent performance.

Distribution Map
Distribution-map based methods necessitate a suitable mapping objective for training, and the choice of mapping method impacts model performance.As shown in Table 4, Normalizing Flows (NF)-based methods predominate.As a generative model, NF has a strong mapping ability, and it has also demonstrated good performance in AD tasks.
Distribution-map based methods are very similar to OCC-based methods, with the exception that OCC-based methods concentrate on finding feature boundaries, whereas mapping-based methods attempt to map features into desired distributions.A common framework for those methods is shown in Fig. 4.This expected distribution is typically a MultiVariate Gaussian (MVG) distribution.This type of method first employs a strong pre-trained network to extract the features of normal images, and then maps the extracted features to the Gaussian distribution using a mapping module.This distribution will be deviated from by the features of abnormal images that appear during the evaluation.The abnormal probability can be calculated based on the level of deviation.
Tailanian et al. [44] propose a contrario framework that applies statistical analysis to feature maps produced by patch PCA and ResNet, which performs The paper uses PCA and ResNet to extract features and count their distribution.[45] Cross-Entropy ResNet,

Effi-cientNet
The paper establishes a model of normality by fitting a multivariate Gaussian to feature representations of a pre-trained network.[46] Mahalanobis Distance EfficientNet The paper generates a multi-variate Gaussian distribution for the normal class and mitigates the catastrophic forgetting in past research.PEDENet [47] Log-Likelihood, Cross-Entropy, Regularization -The model can predict the location of the patch and compare it with the actual location to judge the abnormality.
PFM [48] L2 ResNet The paper proposes the bidirectional and multi-hierarchical bidirectional pretrained feature mapping based on the vanilla feature mapping.PEFM [49] L2 ResNet The paper introduces position encoding into PFM.FYD [50] L2 ResNet The paper aligns samples at image and feature levels to detect anomalies.DifferNet [51] Log-Likelihood ResNet The paper is the first one to introduce normalizing flow into anomaly detection.CS-Flow [52] Log-Likelihood ResNet The paper uses information of multi-scale feature maps and improves Differ-Net.CFlow-AD [53] Log-Likelihood ResNet The paper introduces positional encoding into the conditional normalizing flow framework.CAINNFlow [54] Log-Likelihood ViT [55] The paper uses VIT to replace ResNet and achieve better result.FastFlow [56] Log-Likelihood ResNet The paper introduces an alternate stacking of large and small convolution kernels in the NF module to model global and local distribution.AltUB [57] Log-Likelihood ResNet The paper designs a module for normalizing flow based methods and improve their performance.
well on leather samples, to detect anomalies in images.By fitting a multivariate Gaussian to the feature representations of a pre-trained network, Rippel et al. [45] establish a model of normality.Nonetheless, the issue of catastrophic forgetting remains unresolved.Based on the relationship between generative and discriminative modeling, Rippel et al. [46] generate a multi-variable Gaussian distribution for the normal class and prove the efficacy of this concept on Deep SVDD and FCDD, which mitigates the catastrophic forgetting observed in previous research.PEDENet [47] framework consists of a Patch Embedding (PE) network, a Density Estimation (DE) network, and a Location Prediction (LP) network.At first, the PE module is used to reduce the size of the features that the pre-trained network has extracted.Then, using the DE module, which was inspired by the Gaussian mixture model, and the LP module, the model can predict the relative position of the patch embedding and, based on the difference between the predicted result and the actual result during inference, decide if the image is abnormal.Pre-trained Feature Mapping (PFM) [48] proposes bidirectional and multi-hierarchical bidirectional pre-trained feature mapping to enhance the performance of vanilla feature mapping.In addition, Wan et al. [49] add position encoding to the PFM framework and propose a novel Position Encoding enhanced Feature Mapping (PEFM) [49] to further enhance PFM.FYD [50] introduces registration to industrial image AD for the first time.FYD suggests a coarse-to-fine alignment method that starts with aligning the foreground of objects at the image level.Next, in the refinement alignment stage, non-contrastive learning is used to increase the similarity of features between all corresponding positions in a batch.Normalizing Flows (NF) [58] is a technique for constructing complex distributions by transforming a probability density via a series of invertible mappings.NF methods extract features from normal images from a pre-trained model, such as ResNet [59] or Swin Transformer [60], and transform the feature distribution as a Gaussian distribution during the training phase.In the test phase, after passing through NF, the features of abnormal images will deviate from the Gaussian distribution of the training phase, which is the most important principle for classifying anomalies.DifferNet [51] is the first research to use NF to address the industrial image AD issue.By incorporating cross-convolution blocks within the normalizing flow to assign probabilities, CS-Flow [52] makes use of the context within and between multi-scale feature maps to improve DifferNet.CFlow-AD [53] adds positional encoding to the framework for conditional normalizing flow to achieve superior results.In addition, CFlow-AD [53] analyzes in depth why the multivariate Gaussian assumption is a reasonable prior in earlier models and why the more general NF framework aims to converge to similar results with less computation.FastFlow [56] introduces an alternate stacking of large and small convolution kernels in the NF module to model global and local distribution efficiently.CAINNFlow [54] enhances the performance of the model by introducing the attention mechanism CBAM [61] to the NF module.In techniques such as FastFlow and CFlow-AD, the feature distribution center is not 0 and their performance is unstable.Kim et al. [62] propose a simple solution AltUB [57] that uses alternating training to update the base distribution of normalizing flow for anomaly detection in order to solve the problem.AltUB verifies the effect of CFlow-AD and FastFlow using AltUB.

Memory Bank
As illustrated in Table 5, memory-based methods regularly do not require the loss function for training, and models are constructed quickly.Their performance is ensured by a robust pre-training network and additional memory space, and this type of method is currently the most effective in IAD tasks.The paper reduces the computational cost for the inverse of multi-dimensional covariance tensor so that bigger resolution image can be applied.SOMAD [64] -ResNet The paper maintains normal characteristics by using topological memory based on multi-scale features.GCPF [65] -ResNet The paper processes normal features into multiple independent multivariate Gaussian clustering.MSPB [66] Kmeans, Cosine Similarity, SVDD

VGG
The paper enhances network representation capabilities by learning patch position relationships.SPD [67] Focal, InfoNCE, SPD, Cosine Similarity -Design a contrastive learning method to retrain ResNet to enhance the ability of defect representation.PatchCore [68] -ResNet The paper introduces a core-set sampling method to build a memory bank.CFA [69] SVDD ResNet The paper improves PatchCore so that image features are distributed on a hypersphere.FAPM [70] -ResNet The paper puts different position features of the image into different memory banks to speed up retrieval.N-pad [71] Mahalanobis Distance, Log-Likelihood

ResNet
The paper allows for possible edge misalignment by estimating a nominal distribution for each pixel using the pixel's neighborhood features.
The primary distinction between memory bank-based methods and OCCbased methods, is that memory-based methods, such as SVDD, require additional memory space to store image features.As shown in Fig. 5, these methods require minimal network training and only require sampling or mapping the collected normal image features for inference.During inference, features of the test image are compared to features in the memory bank.The abnormal probability of the test image is equal to the spatial distance from the normal features in the memory bank.K Nearest Neighbors (KNN) [72] is a widely used algorithm for unsupervised anomaly detection, but it operates only at the sample level.Semantic Pyramid Anomaly Detection (SPADE) [63] is inspired by KNN and utilizes correspondences based on a multi-resolution feature pyramid to obtain pixellevel anomaly segmentation results.PaDim [73] employs multivariate Gaussian distributions to construct a probabilistic representation of the normal class.Consequently, the memory bank size is determined solely by the image resolution and not by the size of the training set.PaDiM requires the batch-inverse of the multidimensional covariance tensor, which makes it challenging to scale up to larger CNNs due to the increased feature size.To reduce the computational cost of the inverse by a factor of three, Kim et al. [62] generalize random feature selection into semi-orthogonal embedding.
Meanwhile, Self-organizing Map for Anomaly Detection (SOMAD) [64] and GCPF [65] enhance the storage of normal features.SOMAD preserves normal characteristics by employing topological memory based on multi-scale features.While GCPF transforms standard characteristics into multiple independent multivariate Gaussian clustering.
PatchCore [68] is a significant advancement in industrial image AD that significantly raises the performance for MVTec AD.PatchCore contains two special points.First, the memory bank of PatchCore is coreset-subsampled to ensure a low inference cost while maximizing performance.PatchCore then determines whether the test sample is abnormal based on the distance between the test sample's nearest neighbor feature in its memory bank and other features.This process of reweighting makes PatchCore more robust.Since PatchCore was proposed, numerous improved methods have been developed on its foundation.Coupled-hypersphere-based Feature Adaptation (CFA) is proposed by Lee et al. [69] to obtain target-oriented features.The center and surface of the hypersphere in the memory bank are obtained through transfer learning, and the positional relationship between the test feature and the coupled-hypersphere can be used to determine whether it is abnormal or not.FAPM [70] is comprised of numerous patch-wise and layer-wise memory banks located in various places.FAPM calculates the features in different memory banks independently during inference, which significantly accelerates inference speed.N-pad [74] allows for the possibility of marginal misalignment by estimating a per-pixel nominal distribution using neighboring and target pixel features.In addition, anomaly scores are deduced using both Mahalanobis and Euclidean distances between target pixels and the estimated distribution.Similarly, Bae et al. [71] model the cumulative histogram using location information as conditional probabilities, and neighborhood information was used to establish the normal feature distribution.Furthermore, this work introduces the first refinement approach in the anomaly detection and localization problem, using synthetic anomalous images to improve the anomaly map based on the input image, as well as using neighborhood and location information to estimate the distribution.Deep Industrial Image Anomaly Detection: A Survey By learning the embedding position information and comparing the extracted features with the normal embedding during inference, Tsai et al. [66] propose a method to improve the network's ability to represent data.It is also based on the concept of self-supervised learning.Zou et al. [67] use contrastive learning to train the backbone network and propose a new data augmentation method called SPD to push the network to differentiate between two images with slight differences.In addition, they demonstrate the representation capability of the backbone network using PatchCore [68].
Reconstruction-based methods primarily self-train encoders and decoders to reconstruct images for anomaly detection, which makes them less reliant on the pre-trained model and increases their ability to detect anomalies.However, its image classification capability is poor due to its inability to extract highlevel semantic features.As shown in Table 6, the loss functions of various methods are comparable; however, their performance varies due to different reconstruction model paradigms and abnormal sample construction methods.

Method
Loss Function Pre-trained Highlights (1) Autoencoder Model [75] L2, SSIM -The paper firstly takes SSIM as a loss to reconstruct image and detect anomalies.[76] L2, SSIM -The paper proposes two AEs and reduces style change during image reconstruction.UTAD [77] L1, Adversarial VGG The paper uses two-stage reconstruction to generate high-fidelity images to avoid reconstruction errors.DFR [78] L2 VGG The paper proposes to reconstruct and compare at the feature level to detect anomalies.ALT [79] L1, Perceptual, Adversarial

VGG
The paper proposes an adaptive attention-level transition strategy and uses perceptual loss to improve reconstruction quality.P-Net [80] L1, Adversarial -The paper designs a new architecture for anomaly detection.[81] L1, L2 -The paper adds skip-connection in reconstruction network and adds noise during training to improve reconstruction sharpness.[82] L2 VGG The paper proposes a dense feature fusion module to assist reconstruction.[83] L2, Adversarial -The paper uses memory to help reconstructing images.EdgRec [84] L2, SSIM -The paper reconstructs from the gray value edge and preserves the highfrequency information with skip-connection.PAE [85] L2, Cross-Entropy -The paper gradually increases the resolution of the input image during training.SMAI [86] L2, SSIM -The paper masks and inpaintings image by superpixel.RIAD [87] L2, MSGMS, SSIM -The paper proposes to inpaint and reconstruct images by patch.I3AD [88] L1, Adversarial -The paper gradually masks the high anomaly probability areas and reconstructs them.[89] L2 -The paper proposes to reconstruct the anomalous area differently from the original image.[90] L2, SSIM, GMS -Similar to I3AD, but the paper adds skip connections to reconstruction network.DREAM [91] L2, SSIM, Focal -The paper designs a method to generate abnormal images and uses U-Net [92] to distinguish anomalies after reconstruction.SGSF [93] L2, SSIM, Focal -The method utilizes the idea of saliency detection to generate more realistic anomalies than DRAEM.DSR [94] L2, Focal -The paper generates abnormal samples in feature level and perform better than DRAEM.NSA [95] L2, Cross-Entropy -The paper generates abnormal samples by pasting parts of other normal samples, which is the SOTA method without extra data.SSPCAB [96] L2 -The paper designs a "plug and play" self-supervised block to improve the reconstruction ability of many methods.SSMCTB [97] L2 -This paper replaces the SE-layer in SSPCAB with transformer architecture.[98] Cross-Entropy -The paper guides reconstruction using gradient descent with VAE.[99] Attention Disentanglement -The paper proposes to use disentanglement VAE to detect anomalies.

DGM [100]
L2, Log-Likelihood -The paper proposes to use non-regularized objective functions for training VAE under heterogeneous datasets.FAVAE [101] Log-Likelihood VGG The paper uses VAE to model the distribution of features extracted by its pre-trained model.[102] L2, Cross-Entropy -The paper uses VQ-VAE to construct a discrete latent space and reconstructs images based on the latent space.
(2) GAN Model SCADN [103] L2, Adversarial -The paper masks part of image and reconstruct image with GAN during training.AnoSeg [104] L1, L2, Adversarial -The paper generates abnormal samples through a GAN and detects anomalies with the discriminator.OCR-GAN [105] L1, L2, Adversarial -The paper uses the Frequency Decoupling module to decouple and reconstruct images.

VGG
The paper introduces an auto-encoder architecture based on a transformer with HaloNet.InTra [110] L2, GMS, SSIM -The paper leverages more global information to repair images with transformer.MSTUnet [111] L2, SSIM, Focal -The paper uses swin transformer for inpainting masked images and detects anomalies.MeTAL [112] L1, SSIM -The paper uses information from neighbor patches to inpainting images, better accounting for local structural information.UniAD [4] L2 EfficientNet The paper trains all categories of products in one model.(4) Diffusion Model AnoDDPM [113] L2, Log-Likelihood -The paper is the first to apply diffusion model for industrial image anomaly detection.[114] L2, Log-Likelihood -The paper significantly speeds up the inference process of anomaly detection using diffusion model.
from scratch without employing robust pre-trained models, which results in inferior performance compared to image-level feature embedding.

Autoencoder
Autoencoder (AE) is the most prevalent reconstruction network for AD.Numerous other reconstruction networks also consist of encoder and decoder components.Bergmann et al. [75] investigate the influence of Structure Similarity Index Measure (SSIM) and L 2 loss on AE reconstruction and anomaly segmentation, providing numerous suggestions for future research.
How to resolve the difference between the reconstructed image and the original image is the most foundational principle.There are regularly differences in style between the reconstructed image and the original image, resulting in over-detection.Chung et al. [76] present an Outlier-Exposed Style Distillation Network (OE-SDN) to preserve the style translation and suppress the content translation of the AE in order to avoid over-detection.As the anomaly prediction, Chung et al. replaced the difference between the original image and the reconstruction image of AE with the difference between the reconstruction image of OE-SDN and the reconstruction image of AE.Unsupervised Two-stage Anomaly Detection (UTAD) [77] brings an IE-Net and Expert-Net to extract and utilize impressions for anomaly-free and high-fidelity reconstructions, thereby offering the framework interpretable.
Reconstruction-based methods are nearly effective as feature embedding methods when utilizing features at different scales.Similar to teacher-student architecture, Deep Feature Reconstruction (DFR) [78] method detects anomalous through reconstruction at the level of features.DFR obtains multiple spatial context-aware representations from a network that has been pre-trained.Then, DFR reconstructs features using a deep yet efficient convolutional AE and detects anomalous regions by comparing the original features to the reconstruction features.Yan et al. [79] propose a novel Multi-Level Image Reconstruction (MLIR) framework that forms the reconstruction process as an image denoising task at different resolutions.Thus, MLIR accounts for the detection of both global structure anomalies and detail anomalies.
Modifying the structure of AE can also improve its capacity for reconstruction.Zhou et al. [80] introduce P-Net to compare the difference in structure between the original and reconstruction images.Collin et al. [81] include skipconnections between encoder and decoder to improve the reconstruction's sharpness.In addition, they propose corrupting them with a synthetic noise model to prevent the network from convergently mapping identities, and they introduce the innovative Stain noise model for this purpose.Tao et al. [82] also operate at the feature level; they employ a dense feature fusion module to obtain a dense feature representation of double input in order to help reconstruction in the dual-Siamese framework.Hou et al. [83] also use skipconnections to enhance the quality of reconstruction.In addition to achieving expected results, they add a memory module to skip-connections.Liu et al. [84] reconstruct the original RGB image from its gray value edges, with the skipconnections in the model preserving the image's high-frequency information to better guide the reconstruction.Progressive Autoencoder (PAE) [85] improves autoencoder reconstruction performance through progressive learning and modified CutPaste augmentation.During training, PAE achieves progressive learning by gradually increasing the input image's resolution.
Masking and repainting is an effective method for self-supervised learning.The Superpixel Masking And Inpainting (SMAI) technique was developed by Li et al. [86].SMAI divides the image into multiple blocks of superpixels and trains the inpainting module to reconstruct a superpixel within a mask.SMAI performs masking and inpainting superpixel-by-superpixel on the test image during inference, and then compares the reconstruction image to the test image to distinguish abnormal regions.Iterative Image Inpainting Anomaly Detection (I3AD) is a method proposed by Nakanishi et al. [88] that reconstructs partial regions based on the anomaly map.I3AD improves reconstruction quality by only reconstructing inpainting masks over images, and only masking regions with a high probability of abnormality.SSM [90] is conceptually similar to I3AD.SSM adds skip-connections to the reconstruction network and predicts the mask region as the training target.RIAD [87] randomly masks a portion of the training set image at the patch level and reconstructs it using a U-Net encoder-decoder network [92].During inference, RIAD combines multiple random masks and reconstruction patches to generate a reconstructed image, which is then compared to the original image.Multi-Scale Gradient Magnitude Similarity (MSGMS) outperforms SSIM as an anomaly score, according to RIAD.
DRAEM [91] is representative of reconstruction-based techniques.DRAEM synthesizes abnormal images and reconstructs them as normal by introducing external datasets, which greatly improves the reconstruction network's generalization capacity.In addition, DRAEM feeds the original image and the reconstructed image into the segmentation network to predict abnormal regions, significantly enhancing the model's ability to segment anomalous regions.Nevertheless, DRAEM is susceptible to failure when synthesizing near-in-distribution anomalies.Inspired by saliency detection, Xing et al. [93] proposed the Saliency Augmentation Module (SAM) to generate more realistic abnormal images than DRAEM, so as to achieve better results.DSR [94] proposes an architecture based on quantized feature space representation and dual decoders to circumvent the requirement for image-level anomaly generation.By sampling the learned quantized feature space at the feature level, the near-in-distribution anomalies are generated in a controlled way.NSA [95] does not use external data for data data augmentation and adopts more data augmentation methods, allowing it to outperform all previous methods that learned without utilizing additional datasets.In contrast to other methods that attempt to reconstruct abnormal images into normal images, Bauer [89] proposes reconstructing the abnormal areas of the image so that they deviate from the original image's appearance.This approach produces comparable results to other methods.
In contrast to classical reconstruction-based methods, Ristea et al. [96] propose integrating reconstruction-based functionality into a Self-Supervised Predictive Architectural Building Block (SSPCAB).SSPCAB can be incorporated into models such as DRAEM and CutPaste to enhance those models.Self-Supervised Masked Convolutional Transformer Block (SSMCTB) [97] transforms the SE-layer [115] in SSPCAB into a channel-wise transformer block and achieves superior results.
VAE is a variant of AE, with the difference that the intermediate variables of VAE are data from a normal distribution.Naturally, VAE has superior interpretability.Dehaene et al. [98] iteratively guide reconstruction using gradient descent with energy defined by the reconstruction loss, thereby overcoming the tendency of VAE to produce blurry reconstructions and preserving the normal high-frequency structure.The variational autoencoder is trained with an attention disentanglement loss by Liu et al [99].Anomaly inputs in this VAE will result in Gaussian-deviating latent variables during gradient backpropagation and attention generation.This deviation can be used to locate anomalies.According to Matsubara et al. [100], datasets are commonly heterogeneous rather than regularized, and non-regularized objective functions are more suitable for training VAE models on heterogeneous datasets.FAVAE [101] employs VAE to model the distribution of features extracted by the pre-trained model, implicitly simulating richer anomalies and enhancing the model's generalization.Wang et al. [102] use VQ-VAE to create a discrete latent space, resample the discrete latent code deviate from the normal distribution, and reconstruct the image using the resampled latent code.VQ-VAE reconstructs images that are closer to the training set's normal images.

Generative Adversarial Networks
The stability of the reconstruction model based on Generative Adversarial Networks (GANs) is not as good as that of AE, but the discriminant network has a better effect on some scenes described as follows.
During training, Semantic Context based Anomaly Detection Network (SCADN) [103] masks a portion of the image and reconstructs it with GAN.SCADN detects anomalies for inference by comparing the input image to the reconstruction image.In addition to masking images, AnoSeg [104] utilizes hard augmentation, adversarial learning, and channel concatenation to generate abnormal samples.AnoSeg then trains GAN to generate normal samples.AnoSeg differs from the AE reconstruction model in that its objective function incorporates both reconstruction loss and adversarial loss.OCR-GAN [105] utilizes the Frequency Decoupling (FD) module to decouple the image into information combinations of different frequencies, and then reconstructs and combines the information of these different frequencies to yield reconstructed images.During inference, the model can identify a statistically significant difference between the frequency distributions of normal and abnormal images.

Transformer
Transformer has a higher capacity to represent global information, which gives it the potential to surpass AE and become a new reconstruction network foundation for anomaly detection.Mishra et al. [106] propose a transformerbased framework to reconstruct images at the patch level and employ a gaussian mixture density network to localize anomalous regions.You et al. [107] propose ADTR for reconstructing pre-trained features.According to them, the use of transformers prevents well-reconstructed anomalies, making it easy to identify anomalies when reconstruction fails.Lee et al. [108] introduce a vision transformer-based encoder-decoder model (AnoViT) and assert that AnoViT is superior to the CNN-based l 2 -CAE in the issue of anomaly detection.HaloAE [109] implements transformer into HaloNet [116] and facilitates image reconstruction by reconstructing features to achieve competitive results on the MVTec AD dataset.A common self-supervised learning method for reconstruction-based anomaly detection is the reconstruction of masked images.However, traditional CNNs find it difficult to extract global context information.In order to accomplish this, Pirnay et al. [110] propose Inpainting Transformer (InTra), which integrates information from larger regions of the input image.InTra is representative of trained-from-scratch methods.Masked Swin Transformer Unet (MSTUnet) [111] is comparable to InTra, but MSTUnet employs additional enhancements [117] when simulating anomalies, thereby achieving superior results.De et al. [112] used the neighbor patch to reconstruct the masked patch and also achieved a powerful reconstruction ability.

Diffusion Model
Diffusion model [118] is a recently popular generative model that can also be utilized for reconstruction-based anomaly detection.AnoDDPM [113] is, to the best of our knowledge, the first to apply the diffusion model to industrial image anomaly detection.In comparison to GAN-based methods, AnoDDPM with simplex noise can also capture large anomaly regions without the need for large datasets.When applying the diffusion model to anomaly detection, Teng et al. [114] primarily make two improvements.As a replacement metric for reconstruction loss, a time-dependent gradient value of normal data distribution is used to measure the defects.In addition, they develop a novel T-scales method to reduce the required number of iterations and accelerate the inference process.

Supervised Anomaly Detection
Despite the fact that abnormal data is diverse and difficult to collect, it is still possible to collect abnormal samples in real-world scenarios.Therefore, some research focuses on how to train models for anomaly detection using a small number of abnormal samples and a large number of normal samples.Chu et al. [119] propose a semi-supervised framework for detecting anomalies in the presence of significant data imbalance.They assume that changes in loss values during training can be used to identify abnormal data as features.To achieve this, they train a reinforcement learning-based neural batch sampler to amplify the difference in loss curves between anomalous and non-anomalous regions.FCDD [34] is an unsupervised method that synthesizes abnormal samples for training the OCC model.This concept is transferable to other OCC methods.Venkataramanan et al. [41] propose a Convolutional Adversarial Variational Autoencoder (CAVGA) with Guided Attention that can be applied equally to cases with and without abnormal images.In an unsupervised setting, CAVGA is guided to focus on all normal regions of an image by an attention expansion loss.CAVGA uses a complementary guided attention loss in the weakly supervised setting to minimize the attention map corresponding to abnormal regions of the image while focusing on normal regions.Bovzivc et al. [120] examine the influence of image-level supervision information, mixed supervision information, and pixel-level supervision information on surface defect detection tasks within the same deep learning framework.Bovzivc et al. find that a small number of pixel-level annotations can help the model achieve performance comparable to full supervision.DevNet [121] uses a small number of abnormal samples to realize fine-grained end-to-end differentiable learning.Wan et al. [122] propose a Logit Inducing Loss (LIS) for training with imbalanced data distribution and an Abnormality Capturing Module (ACM) for characterizing anomalous features in order to effectively utilize a small amount of anomalous information.DRA [123] proposes a framework for learning disentangled representations of seen, pseudo, and latent residual anomalies in order to detect both visible and invisible anomalies.
Besides, a number of studies fail to account for the unbalanced distribution of normal and abnormal samples and rely primarily on abnormal samples for supervised training.Sindagi et al. [124] investigate the domain transfer problem of datasets for anomaly detection in various settings.Dual Weighted PCA (DWPCA) is an algorithm proposed by Qiu et al. [125] for image registration and surface defect detection.An interleaved Deep Artifacts-aware Attention Mechanism (iDAAM) is proposed by Bhattacharya et al. [126] propose to classify multi-object and multi-class defects in abnormal images.Zeng et al. [127] view anomaly detection as a subset of target detection and designed a Reference-based Defect Detection Network (RDDN) to detect anomalies using template reference and context reference.Song et al. [128] regarded the abnormal part as the salient area of the image, and proposed an effective saliency propagation algorithm for anomaly detection.Long et al. [129] investigate defect detection in a tactile image, which has obvious benefits for fabric structure defect detection in RGB images.In addition, there are methods that refer to the concept of semantic segmentation.To detect defects in infrared thermal volumetric data, Hu et al. [130] propose a hybrid multi-dimensional space and temporal segmentation model.Ferguson et al. [131] use Mask Region-based CNN architecture to detect and segment defects in X-ray images simultaneously.There are also numerous modified models on anomaly detection based on the object detection and semantic segmentation model of natural images under full supervision [132][133][134].There are also many weakly supervised object detection methods suitable for anomaly detection [135][136][137].Here we will not discuss them one by one.

Industrial Manufacturing Setting
This section introduces the classification standards or application settings that are more appropriate for industrial scenes, namely few-shot anomaly detection, noisy anomaly detection, anomaly synthesis, and 3D anomaly detection.

Few-Shot Anomaly Detection
Few-shot learning is meaningful for data collection and data labeling, which has a great influence on real-world applications.On the one hand, by studying few-shot learning, we can reduce the cost of data collection and data annotation for industrial products.On the other hand, we can solve the problem from the perspective of data and investigate what kind of data is most valuable for industrial image anomaly detection.Few-Shot Anomaly Detection (FSAD) [138,139] is still in its infancy.There are two settings in FSAD.The first setting is meta-learning [140].In other words, this setting requires a large amount of images as meta-training dataset.Wu et al. [138] propose a novel architecture, called MetaFormer, that employs meta-learned parameters to achieve high model adaptation capability and instance-aware attention to localize abnormal regions.RegAD [140] trains a model for detecting category-agnostic anomalies.In the test phase, the anomalies are identified by comparing the registered features of the test image and its corresponding normal images.The second setting relies on the vanilla few-shot image learning.PatchCore [68], SPADE [63], PaDim [73] conduct the ablation study on 16 normal training samples.None of them, however, are specialized in fewshot anomaly detection.Hence, it is necessary to develop new algorithms that concentrate on native few-shot anomaly detection tasks.
Recently, researchers extended the Zero-Shot Anomaly Detection (ZSAD) setting beyond the FSAD setting.The goal of ZSAD is to leverage the generalization power of large models to solve anomaly detection problems without any training, thus completely eliminating the cost of data collection and annotation.MAEDAY [141] uses a pre-trained Masked autoencoder (MAE) [142] to tackle the problem.MAEDAY randomly masks parts of an image and restores them using MAE.If the reconstructed region is different from the region before masking, this region is considered as anomalous.WinCLIP [143] utilizes another large model called CLIP [144] for ZSAD.Basically, WinCLIP uses the image encoder of CLIP to extract image features.Given the textual descriptions such as "a photo of a damaged object", WinCLIP uses the text encoder of CLIP to extract the features of these descriptions, and then calculates the similarity between text features and image features.If the similarity is high, the image is "a photo of a damaged object"; otherwise the image is normal.MAEDAY and WinCLIP demonstrate that zero-shot anomaly detection (ZSAD) is a promising research direction.

Noisy Anomaly Detection
Noisy learning is a classical problem for anomaly detection.By studying anomaly detection under noisy learning, we can avoid the performance loss caused by labeling errors and reduce false detection in anomaly detection.Tan et al. [145] employ a novel trust region memory update scheme to keep noise feature point away from the memory bank.Yoon et al. [146] use a data refinement approach to improve the robustness of one-class classification model.Qiu et al. [147] propose a strategy for training an anomaly detector in the presence of unlabeled anomalies, which is compatible with a broad class of models.They create labelled anomalies synthetically and jointly optimize the loss function with normal data and synthesis abnormal data.Chen et al. [148] introduce an interpolated Gaussian descriptor that learns a one-class Gaussian anomaly classifier trained with adversarially interpolated training samples.However, the majority of the aforementioned approaches have not been verified on real industrial image datasets.In other words, the effectiveness of the existing anomaly detection methods may not be suitable for industrial manufacturing.

3D Anomaly Detection
3D anomaly detection can utilize more spatial information, thereby detecting some information that cannot be contained in RGB images.In some special lighting environments or for some anomalies that are not sensitive to color information, 3D anomaly detection can demonstrate its significant advantages.This research direction is currently receiving significant attention in the academy.Since the release of MVTec 3D-AD [6] dataset, several papers have focused on anomaly detection in 3D industrial images.Bergmann [149] introduces a teacher-student model for 3D anomaly detection.The teacher network is trained to acquire general local geometric descriptors by recreating local receptive fields.While the student network is taught to match the local 3D descriptors of the pre-trained teacher network.Horwitz et al. [150] propose BTF, a method that combines hand-crafted 3D representations (FPFH [151]) with the representation method of 2D features (PatchCore [68]).Reiss et al. [152] propose that the representational ability of self-supervised learning is temporarily inferior to that of handcrafted features for 3D anomaly detection.Nevertheless, self-supervised characterization still has great potential if largescale 3D anomaly detection datasets are available.AST [15] employs RGB image with depth information to enhance anomaly detection performance.However, most of 3D IAD methods are specialized in RGB-D images, while the 3D dataset in real-world industrial manufacturing consists of point clouds, meaning current 3D IAD methods cannot be directly deployed in industrial manufacturing.Thus, there are still opportunities for 3D IAD advancement.

Anomaly Synthesis
By artificially synthesizing anomalies, we can improve the performance of models with limited data.This research is complementary to the few-shot research.Few-shot learning studies how to improve the model when the data is fixed, and this research studies how to artificially increase the credible data to improve the model performance when the model is fixed.Both of them can reduce the cost of data collection and labeling.There are many unsupervised anomaly detection works that use data augmentation to synthetic anomaly images and significantly improve model performance.For examples, CutPaste [30], DRAEM [91], MemSeg [17] are representative methods.
In addition, some supervised methods use limited abnormal samples to synthesize more abnormal samples for training.Liu et al. [153] propose a model designed to generate defects on defect-free fabric images for training semantic segmentation.While rippel et al. [154] use CycleGAN [155] containing ResNet/U-Net as a generator as the basic architecture to transfer defects from one fabric to another.By improving the style transfer network, SDGAN [156] achieves better results than CycleGAN.Wei et al. [157] propose a model named DST to simulate defect samples.First, DST generates a blank mask area on a non-defective image, then DST uses the masked histogram matching module to make the color of the blank mask area consistent with the overall color of the image, and finally DST uses U-NET to perform style transfer to make the generated image more realistic.Wei et al. [158] propose a model named DSS, which uses conventional GAN to reconstruct defect structures in designated regions of defect-free samples, and then uses DST for style transfer to blend simulated defects into the background.Jain et al. [159] try to use DCGAN, ACGCN and InfoGAN to generate defect images by adding noise, which improves the accuracy of classification.Wang et al. [160] propose DTGAN based on Star-GANv2, which adds front-background decoupling and achieves a certain degree of style control and uses the Frechet inception distance (FID [161]) and kernel inception distance (KID [162]) to evaluate the quality of image generation.DefectGAN [163] also believes that defects and normal backgrounds can be layered, and that defects are foreground.DefectGAN generates defect foregrounds and their spatial distribution in the form of style transfer.Although there is a considerable amount of research in this field, unlike other fields that have well-established directions, there is still significant potential for further development.

Datasets and Metrics
Datasets.Data is a crucial driving factor for machine learning, particularly for deep learning.Principally, the difficulty of getting industrial photos Deep Industrial Image Anomaly Detection: A Survey hampers the advancement of image anomaly detection in industrial vision.Table 7 demonstrates that the number and the size of IAD dataset are gradually increasing, but most of them are not generated in a real production line.The promising alternative approach is to fully utilize the industrial simulator to generate anomalous images, possibly reducing the gap between academic research and the demands of industrial manufacturing.Metrics.Table 8 offers a comprehensive review of the metrics in industrial image anomaly detection.The first column denotes the name of the metric and the second column denotes the level.In other words, if the level is up, the larger the metrics value, the better the performance.If the level is down, the lower the metrics value, the better the performance.The third column gives the detail for each metric, especially on how the metric accurately indicates the performance of image anomaly detection.From Table 8, it can be easily observed that most of novel metrics are the variants of natural image segmentation and detection metrics, such as F1 score, AU-ROC or AU-PR.However, these metrics can not correspond to the performance of IAD because the tiny size of anomalies requires a greater weighting than the anomaly-free regions.Hence, the validity of these metrics for IAD remains to be explored.

Total Performance Analysis
Table 9 and Table 10 show the statistical result of current IAD performance on MVTec AD.Fig. 7 supports the results of Table 9: even if different methods have similar performance in image classification, there are still significant differences in pixel-level segmentation.We provide a deep analysis of the performance of current IAD methods and unlock meaningful insights as below:  • Ensemble learning can dramatically improve the performance of state-ofthe-art anomaly detection methods.
• SSPCAB [96] can be seamlessly integrated into cutting-edge methods and significantly enhance the performance of reconstruction-based methods.• The gap between few-shot IAD and vanilla IAD is narrowing.In other words, we may utilize data distillation algorithms to lower the amount of the dataset used for anomaly detection.• Without using ensemble learning, MemSeg [17] achieved the SOTA result on image-level anomaly classification, which is mainly due to the use of the U-Net [92] framework.DRAEM [91] also uses U-Net to outperform other methods on pixel-level anomaly segmentation.The effectiveness of MemSeg and DRAEM demonstrates the superiority of the segmentation module in anomaly detection.Artificial supervision is usually inferior to real supervision, and segmentation models trained with artificial supervision often perform worse.However, even when using artificially generated anomalies as supervisory information, these methods with segmentation modules still outperform other methods without segmentation modules on classification and segmentation tasks.We can conclude that the segmentation module is beneficial for anomaly detection tasks.• AU-PR is more valuable than AU-ROC for segmentation tasks [67].As shown in Table 10, reconstruction-based methods outperform other methods on the pixel AU-PR metric.As for Fig. 7, the detection result of DREAM is closest to the ground truth.It results in sharper edges and fewer false detection regions.We can infer from statistical data and visualizations that reconstruction-based methods are more suitable for segmentation tasks.

Future Directions
We outline several intriguing future directions as follows: • We should build up a multi-modalities IAD Dataset.In actual assembly lines, RGB images are insufficient to detect anomalies.Hence, we may employ additional modalities information, such as X-ray and ultrasound, to enhance anomaly detection performance.• Given that test samples are sequentially streamed on the product line, most IAD methods are incapable of making instantaneous predictions upon the arrival of a new test sample.In industrial manufacturing, the inference speed of IAD should be addressed in addition to its accuracy.Adopting multi-objective evolutionary neural architecture search algorithms to find the optimal trade-off architecture is thus a promising approach.• The majority of IAD methods use ImageNet pre-trained models to extract the features from industrial images, which inevitably results in the feature drift issue.Consequently, there is a pressing need to construct a pre-trained model for industrial images.• Most anomaly detection methods focus on the unsupervised setting.
Although this setting can reduce the cost of data labeling, it greatly curbs the development of segmentation-based methods.Unsupervised methods and supervised methods should complement each other, and the main reason for the slow development of supervised methods in recent years is the lack of a large number of labeled data sets.Therefore, it is necessary to propose a fully supervised anomaly detection dataset with pixel-level annotations in the future.• Previously, we focused on developing data augmentation method for normal images.However, we have not made much effort on synthesizing abnormal samples via data augmentation.In industrial manufacturing, it is very difficult to collect a large number of abnormal samples since most of the production lines are faultless.Hence, more attention should be paid to abnormal synthesis methods in the future, like CutPaste [30], DRAEM [91] and MemSeg [17].• Current anomaly detection algorithms often focus on detection accuracy, while ignoring the storage size and efficiency of the models.This leads to high computation costs and limits the application of anomaly detection to the production end of enterprises.Therefore, it is necessary to design lightweight but efficient anomaly detection models.• Currently, image anomaly detection algorithms can be mainly categorized into two tasks: industrial image anomaly detection and medical image anomaly detection.Although medical images have more modalities than industrial images [185][186][187], the two tasks share many similarities in terms of data and experimental settings.However, few studies have explored how to unify these two tasks.One reason for this is the domain differences between medical and industrial image datasets, and another reason is the lack of a good baseline and benchmark for comparison.It would be very meaningful to establish a unified framework for both industrial and medical image anomaly detection at the data or method level.

Conclusions
In this paper, we provide a literature review on image anomaly detection in industrial manufacturing, focusing on the level of supervision, the design of neural network architecture, the types and properties of datasets and the evaluation metrics.In particular, we characterize the promising setting from industrial manufacturing and review current IAD algorithms in our proposed setting.In addition, we investigate in depth which network architecture design can considerably improve anomaly detection performance.In the end, we highlight several exciting future research directions for image anomaly detection.

Fig. 7 :
Fig. 7: Visualization of results from representative methods.Note that the visualization results are from the open-source code reproduction.

Table 2 :
A summary of teacher-student methods regarding loss function, pre-trained model, and highlights.

Table 3 :
A summary of one-class classification methods regarding loss function, pre-trained model, and highlights.

Table 4 :
A summary of distribution-map based methods regarding loss function, pre-trained model, and highlights.

Table 5 :
A summary of memory bank based methods regarding loss function, pre-trained model, and highlights.

Table 6 :
A summary of reconstruction based methods.

Table 7 :
Comparison of datasets for anomaly detection.

Table 8 :
A summary of metrics used for anomaly detection.

Table 9 :
Image AUROC Performance of Different Methods on MVTec AD.The highest and second places are marked in red and blue.All results are reported from the original papers.
• Regarding the identification of image-level anomaly detection tasks, memory bank-based approaches are the most effective neural network design.However, they are inadequate at detecting pixel-level anomalies.

Table 10 :
Pixel AUROC and AUPR Performance of Different Methods on MVTec AD.The highest and second places are marked in red and blue.Note that * refers to reproduced results by us, while other results are reported from original papers.