1 Introduction

Anomaly detection (AD) tries to identify instances in data that deviate from a previously defined (or learned) concept of “normality” [1, 2]. In this context, identified deviations are referred to as “anomalies”, and labeled as “anomalous”, whereas data points that conform to the concept of normality are considered “normal”. In the field of computer vision, AD tries to identify anomalous images, and one of its most promising application domains is the automated visual inspection (AVI) of manufactured goods [3, 4, 5, 6]. The reason for this is the close match between the properties inherent to the AD problem and the constraints imposed by the manufacturing industry on any AVI system (AVIS):

  1. 1.

    Anomalies are rare events [1, 2]. As a consequence, AD algorithms generally focus on finding a description of the normal state, and require few to no anomalies during training. Viewing defective goods as anomalies and the expected product as the normal state, this matches the limited availability of defective goods when setting up AVISs. Manually collecting and labeling defective goods for training supervised deep learning (DL) methods has furthermore been identified as one of the main cost factors for DL-based AVISs [7], and has to be minimized to achieve economic feasibility.

  2. 2.

    The anomaly distribution is ill-defined [1, 2]. This matches the constraint that all possible defect types an AVIS may encounter during deployment are often unknown during training [4]. Still, AVISs are expected to detect also such unknown defect types reliably.

These two constraints imposed by AD problems in general, and the manufacturing industry in particular, already severely limit the feasibility of supervised, DL-based AVISs. Additionally, two further requirements are imposed by the manufacturing industry on AVISs:

  1. 1.

    AVI methods should not be compute″​=intensive during training to minimize lead times of product changes. Said product changes are furthermore expected to become more frequent due to a general decrease in lot sizes inherent to industry 4.0 [8].

  2. 2.

    AVI methods need to run in real-time on limited hardware [9].

While the recent success of DL-based AD algorithms has renewed the general interest in AD [2], these two additional constraints have, combined with the release of public datasets [10, 11, 4, 5], led to the development of AD algorithms that are geared specifically towards AVI [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 5]. The emergence of such algorithms, however, calls for their systematic review in order to consolidate findings and to thereby facilitate additional research. To the best of our knowledge, none of the recent reviews focus on AD for AVI, and they often discuss the broader AD field with a focus on natural images instead [2, 23, 24]. In our work, we fill this gap, and review recent advances in AD for AVI. To this end, we first provide brief formal definitions of both AD and anomaly segmentation (AS), and afterwards summarize public datasets. Next, we systematically categorize algorithms developed for 2D AVI setups, and give their main advantages as well as disadvantages. Last, we outline open research questions, and propose potential ways of addressing them.

2 A Brief Overview of AD/AS

2.1 Formal Definitions

As outlined above, AD is tasked with deciding whether a given test datum is normal or anomalous. More formally speaking, AD is tasked with finding a function Φ : 𝒳 → y that maps from the input space 𝒳 to the anomaly score y. It follows that y should be much lower for a normal test datum than for an anomalous test datum. In context of AVI, 𝒳 typically consists of 2D images \(\vec{x}\in\mathbb{R}^{C\times H\times W}\). Here, H and W specify the height and width of the image \(\vec{x}\), and C corresponds to the number of color channels present in the image. For RGB images, C = 3, whereas C = 1 for grayscale images and C > 3 for multi/hyperspectral images. For a more comprehensive definition of AD, see [2].

In addition to AD, AVI is also concerned with AS, i.e. with localizing the anomaly inside \(\vec{x}\). AS is thus tasked with finding a function \(\Phi\colon\mathcal{X}\to\vec{y}\) that produces an anomaly map \(\vec{y}\in\mathbb{R}^{H\times W}\) instead of a scalar anomaly score y. By aggregating \(\vec{y}\) appropriately, an image-level anomaly score y can be subsequently derived for AS algorithms.

2.2 Types of Algorithms

There exist 3 types of AD/AS algorithms, which differ in their requirements w.r.t. the training data (see Fig. 1.1, [1]):

  1. 1.

    Supervised algorithms. Supervised algorithms treat the AD/AS problems as imbalanced, binary classification/segmentation problems. As such, they require a fully labeled dataset that contains both normal and anomalous images for training. Sampling the anomaly distribution furthermore induces a significant bias [25], also for AVI [12].

  2. 2.

    Semi″​=supervised algorithms. As opposed to supervised algorithms, semi″​=supervised algorithms require only a dataset of labeled normal images for training. Semi″​=supervised approaches commonly make use of the concentration assumption [2], i.e. the assumption that the normal data distribution can be bounded inside a given feature space. Examples here would be neighborhood/prototype [15, 26] or density-based approaches [12, 14, 16]. Other approaches such as autoencoders (AEs) [27] use the concentration assumption in a more indirect manner: They try to train models that are well-behaved only on the manifold of the normal data distribution constructed in the input domain, i.e. the raw images. Thereby, they exploit the observation that DL models fail at generalizing to samples that are different from the training dataset [28]. The majority of proposed AD/AS approaches are semi″​=supervised.

  3. 3.

    Unsupervised algorithms. As opposed to supervised and semi″​=supervised approaches, unsupervised approaches can work with unlabeled data [29, 30]. To do so, they combine the concentration assumption with the two core assumptions made for anomalies: (I) that anomalies are rare events, and (II) that their distribution is ill-defined (still, a uniform distribution is often assumed).

Abb. 1.1
figure 1

The three types of AD/AS algorithms. While supervised approaches require both labeled anomalies and normal data for training, semi″​=supervised approaches use normal data only. Unsupervised approaches work on unlabeled data, and make assumptions about the normal and anomaly distribution.

We note that the terms semi″​=supervised and unsupervised, as defined in this review based on [1], are not used consistently throughout literature. For example, unsupervised is often misused to refer to the semi″​=supervised setting in AVI [4, 5, 6]. Moreover, the term semi″​=supervised has also been used to refer to a partially labeled dataset, where labeled anomalies may also be present and used for training [31]. We believe that a consistent use of terminology which conforms with its historical definition [1] would facilitate a more intuitive understanding of research in AD for AVI.

2.3 Anomaly Types

In previous literature, three anomaly types are distinguished [2]:

  1. 1.

    Point anomalies, which are instances that are anomalous on their own, e.g. a scratch.

  2. 2.

    Contextual anomalies, which are instances that are anomalous only in a specific context. A scratch, for example, might only be considered an anomaly if it lies on a cosmetic surface or otherwise impairs the product’s function.

  3. 3.

    Group anomalies, which are several data points that, as a whole, form an anomaly. Group anomalies may also be contextual, and are rare in AVI.

Complementary to these categories, anomalies have recently been partitioned based on the degree of semantic understanding required to detect them [2]. In particular, anomalies are divided into low-level, textural anomalies, and high-level, semantic anomalies (see Fig. 1.2 for an example). Detecting semantic anomalies is generally more difficult than detecting textural anomalies [2, 5], as learning semantically meaningful feature representation is inherently more difficult. Convolutional neural networks (CNNs), for example, exhibit a significant texture bias [32, 33]. Furthermore, we note that synonymous nomenclature was introduced in AD for AVI recently [5], where textural anomalies correspond to structural anomalies, and semantic anomalies correspond to logical anomalies. We again stress the importance of using terminology consistently, and stick to the terms textural/semantic anomalies due to their more intuitive understanding.

Abb. 1.2
figure 2

Textural vs. semantic anomalies. (a) shows a textural anomaly, whereas (b) shows a semantic anomaly. Images are taken from MVTec AD [4].

2.4 Evaluating AD/AS Performance

Since AD and AS can be viewed as binary problems, corresponding evaluation measures are commonly used to evaluate their performance. Specifically, the area under the receiver operating characteristic (ROC) curve (AUROC) and the area under the precision″​=recall (PR) curve (AUPR) are employed. It should be noted that AUPR is better″​=suited for evaluating imbalanced problems such as AS [34]. Recently, the per-region″​=overlap (PRO) curve was proposed specifically for AS in AVI [13]. As opposed to the pixel-wise AUROC/AUPR, the area under the PRO curve is used for measuring an algorithm’s ability to detect all individual anomalies present in an image. Since the PRO curve does not take false positive (FP) predictions into account, it is constructed up to a preset false positive rate (FPR), and the most commonly used cut-off value is 30% FPR.

We note that all the above evaluation measures focus on an algorithm’s general capability of solving the AD/AS problem. Thereby, the difficulty of selecting the optimal threshold t for y/\(\vec{y}\) is sidestepped by iterating over all possible thresholds t. Finding the optimal value for t, and finding it ideally with normal images only, is a promising avenue for future research [3].

3 AD/AS for 2D AVI

3.1 Datasets

Recent advances in AD/AS for AVI were spurred by the public release of suitable datasets that depict manufactured goods such as screws (objects) or fabrics (textures). We give an overview of them in Table 1.1, and observe the following:

Tab. 1.1 Public AD datasets for 2D AVI. In addition to the defect labels (e.g. scratch), we also denote whether the anomaly labels are given (i.e. textural vs. semantic). Abbreviations: Obj. = Objects, Text. = Textures, Norm. = Normal, Anom. = Anomalous, C = # Channels, M.c.s.l. = Multi-class single label.
  1. 1.

    In none of the datasets, anomalies are rare events. In fact, global prevalence ranges from 10%–30%, and prevalences commonly reach >50% in the pre-defined test sets. Furthermore, total dataset sizes are relatively small compared to the throughput of a deployed AVIS, and datasets are thus at risk of not fully capturing the normal data distribution. Fully recapitulating the normal data distribution, however, is crucial to achieve high true positive rates (TPRs) at low FPRs, a requirement imposed by the rarity of anomalies/defects. Thus, developed AD/AS methods might not transfer as well to industry. To mitigate this, dataset sizes should be further increased, and an effort should be made to sample the normal data distribution as representatively as possible.

  2. 2.

    Almost no dataset specifies the anomaly type. In fact, only MVTec LOCO AD distinguishes between textural and semantic anomalies, and even MVTec LOCO AD does not differentiate between point, contextual, and group anomalies. Furthermore, MVTec AD, BTAD and MTD contain mostly textural anomalies. Together, this limits research aimed at detecting semantic anomalies in AVI, and additional datasets are required.

  3. 3.

    All datasets contain images that can be cast to the RGB format. Moreover, most of the goods contained in MVTec AD and MVTec LOCO AD are relatively similar in appearance to natural images.

3.2 Methods

In general, developed AD/AS methods can be categorized into those that train complex models and their feature representations in an end-to-end manner, and hybrid approaches that leverage feature representations of DL models pre-trained on large-scale, natural image datasets, but leave them unchanged.

3.2.1 Training Complex Models in an End-to-End Manner.

Methods that train complex models in an end-to-end manner tend to pursue either AEs [27] or knowledge distillation (KD) [13]. Both approaches are based on the assumption that the trained DL model is well-behaved only on images that originate from the normal data distribution. For the AE, this means that the image reconstruction fails for anomalies, whereas for KD, this means that the regression of the teacher’s features by the student network fails. While the AE can be easily applied to multi/hyperspectral images [35], KD-based approaches are limited by their need for a suitable teacher model. Since CNNs pre-trained on ILSVRC2012 (a subset of ImageNet [36]) are commonly used as teacher models, this limits KD approaches to images that are castable to the RGB image format used in ImageNet, i.e. RGB or grayscale images. While a randomly initialized CNN might potentially be used as the teacher to circumvent this problem (similar approaches have been pursued successfully in reinforcement learning [37]), its efficacy has not yet been demonstrated for AVI.

As an alternative to AE and KD, the concentration assumption can be used to formulate learning objectives such as the patch support vector data description [38], which directly learn feature representations where the normal data is concentrated/clustered around a fixed point. Anomalies are then expected to have a larger distance to the cluster center than normal data.

The main advantage of methods that train complex models in an end-to-end manner is their applicability to any data type, including multi/hyperspectral images. For their main drawbacks, it needs to be stated that training these methods is compute″​=intensive, and they thus do not conform with the requirement of low/limited training effort imposed by the manufacturing industry. Furthermore, these methods tend to produce worse results than hybrid approaches on RGB-castable images. As a potential explanation, it has been hypothesized that discriminative features are inherently difficult to learn from scratch using normal data only [21]. Moreover, it was shown that AEs tend to generalize to anomalies in AVI [27]. To improve results, two approaches are currently pursued in literature: (I) Initializing the method with a model that was pre-trained on a large-scale natural image dataset [39]. However, this restricts approaches to grayscale/RGB images due to a lack of large-scale multi/hyperspectral image datasets. Furthermore, its effectiveness is limited by catastrophic forgetting [22, 40], which, in AD, refers to a loss of initially present, discriminative features. Therefore, this technique is often combined with the second approach, where (II) surrogates for the anomaly distribution are provided via anomaly synthesis [31, 41]. This requires either access to representative anomalies as a basis for synthesis, or an exhaustive understanding of the underlying manufacturing process and the visual appearance of occurring defects. Thus, anomaly synthesis violates the assumption of an ill-defined anomaly distribution in the same manner as supervised approaches, and is expected to incur a similar bias.

3.2.2 Hybrid Approaches.

At their core, hybrid approaches assume that feature representations generated by training DL models on large-scale, natural image datasets can be transferred to achieve AD in AVI. Specifically, they assume that discriminative features have already been generated, and restrict themselves to finding a description of normality in said features. To this end, they employ three different techniques, all of which are classical AD approaches that are based on the concentration assumption: (I) Generative approaches explicitly model the probability density function (PDF) of the normal data distribution inside the pre-trained feature representations. Both unconstrained PDFs (i.e. via normalizing flows) [16, 42] and constrained PDFs (e.g. by assuming a Gaussian distribution) [12, 14] have been used. (II) Classification″​=based approaches fit binary classification models such as the one-class support vector machine (SVM) to the pre-trained feature representations [43]. (III) Neighborhood/Prototype-based approaches employ k-NN algorithms or variations thereof to implicitly approximate the PDF of the normal data distribution [15, 26].

A main advantage of hybrid approaches lies in their outstanding performance: All approaches that achieved state-of-the-art AD performance on MVTec AD so far are hybrid approaches (see Fig. 1.3). Furthermore, hybrid approaches are in general not compute″​=intensive during training, as they do not train complex DL models. For example, “training” a Gaussian AD model consists of extracting the pre-trained features for the training dataset, followed by the numeric computation of \(\vec{\mu}\) and \(\vec{\Sigma}\) [12]. Moreover, hybrid approaches are fast, as state-of-the-art, lightweight classification CNNs can be used as feature extractors. Thereby, they fit the requirements of the manufacturing industry extremely well.

Abb. 1.3
figure 3

Temporal progression of the state of the art in AD performance on MVTec AD. Data was sourced from https://paperswithcode.com/about on 26.08.2022.

The main disadvantage of hybrid approaches lies in their core assumption: If the feature representations of the underlying, pre-trained feature extractor/model are simply not discriminative to the specific AVI problem at hand, hybrid approaches will automatically fail. However, hybrid approaches have been successfully applied even to AVI setups which produce images that differ from the natural image distribution [15, 18, 42]. To nonetheless mitigate this disadvantage, the diversity of available, pre-trained feature extractors should be increased. There are two straightforward ways to do this: (I) Using datasets other than ILSVRC2012 for pre-training, and (II) Using different model architectures, and even computer vision tasks, for pre-training. Both influence general transfer learning performance [44], and initial work indicates they might be beneficial also for AD/AS in AVI [20]. The second disadvantage of hybrid approaches is their limitation to images that can be cast to the RGB format, which is directly due to the underlying feature extractors that are trained on natural image datasets.

4 Open Research Questions

First, the detection performance of semantic anomalies needs to be improved further. This would facilitate the application of developed algorithms to even more sophisticated AVI tasks. Here, hybrid approaches can directly benefit from advances in DL which yield more semantically meaningful feature representations. For example, vision transformers were recently shown to possess as smaller texture bias than CNNs [45], and could thus potentially be used as feature extractors. Second, the bias incurred by sampling the anomaly distribution needs to be decreased. A potential way of achieving this would be to explicitly incorporate the assumptions made for the anomaly distribution into the corresponding learning objectives. Third, methods that facilitate setting the working point of AD/AS methods in an automated manner are needed. As these would ideally rely on normal data only, they could aim at achieving specific target FPRs, e.g. via bootstrapping and model ensembling. Fourth, AD/AS methods that are less compute″​=intensive during training are required for multi/hyperspectral images to meet the requirements imposed by the manufacturing industry. Here, public datasets are expected to facilitate progress, similar as was observed for RGB images. Fifth, AD/AS methods are required for 3D AVI tasks. A first dataset was published recently [6], and we expect for hybrid approaches that rely on models pre-trained on large-scale datasets to also achieve strong performance here [17].

While not an open research questions specific to AVI, anomalies have recently been clustered successfully based on their visual appearance [46]. Together with the empirical success of anomaly synthesis/supervised AD [31, 41, 47], this indicates that the commonly made assumption that anomalies follow a uniform distribution might not be true. This aspect thus warrants additional research.

5 Conclusion

In our work, we have reviewed recent advances in AD/AS for AVI. We have provided a brief definition of AD/AS, and gave an overview of public datasets and their limitations. Moreover, we identified two general categories of AD/AS approaches for AVI, and gave their main advantages as well as disadvantages when considering the constraints and requirements imposed by the manufacturing industry. Last, we identified open research questions, and outlined potential ways of addressing them. We expect our review to facilitate additional research in AD/AS for AVI.