1 Introduction

With the advancement of deep learning, the application of deep learning for anomaly detection (AD) has been widely used in various industries. Industrial anomaly detection applying deep learning is crucial for improving product quality and analyzing products. However, deep learning-based industrial anomaly detection requires a large number of images. Normal images are easy to obtain, and anomalous images are both rare and diverse. This is a challenge for applying deep learning in industrial anomaly detection. To address this issue, unsupervised deep learning is applied to industrial anomaly detection [1, 2]. Unsupervised learning efficiently recognizes unlabeled anomaly images. Hence, unsupervised learning based on deep learning has been widely used in anomaly detection, including industrial anomaly detection [3,4,5,6], network anomaly detection [7,8,9,10] and hyperspectral anomaly detection [11, 12].

Unsupervised learning-based anomaly detection includes embedding-based and reconstruction-based [13]. The reconstruction-based method relies on analyzing variations between the initial and reconstructed data for anomaly identification. First, a model is constructed using normal data for training to learn the characteristics of normal data. Then the anomaly data is fed into this model for reconstruction, and the degree of abnormality is determined by comparing the difference between the original and reconstructed data. The data is flagged as an anomaly if the reconstruction error exceeds the threshold value. Common methods include autoencoder [14], generative adversarial networks (GANs) [15,16,17,18], transformer [19,20,21,22,23] and diffusion [24,25,26,27]. Unlike the reconstruction-based approach, the embedding-based method is to generate a low-dimensional space and maps the data into the space. The significant idea of the embedding-based anomaly detection method is to determine anomalies based on the position or density in the embedding space. Normal data are clustered in the space, and anomalous data are separated from normal data. The common embedding-based approaches are one-class classification (OCC) [28,29,30,31], distribution map [32,33,34,35,36,37], memory bank [38, 39] and teacher–student model [40,41,42,43,44,45,46].

The teacher–student model is well-interpretable and generalizable that it has become a representative approach for industrial anomaly detection. Furthermore, the teacher–student model is an effective model compression method for classification tasks [47]. The common teacher–student model uses individual knowledge distillation for knowledge transfer. This approach of knowledge transfer only transfers the last layer of knowledge from the teacher’s model to the student model, which may not adequately facilitate the acquisition of the teacher model’s structural knowledge by the student model. Park et al. [48] propose relational knowledge distillation (RKD), which transfers structural knowledge from the teacher model to the student model. Through RKD, student models are able to gain more comprehensive and enriched knowledge. Consequently, student and teacher models with different structures performed better in the RKD. Meanwhile, the anomaly detection performance of teacher–student models with different structures has been shown to outperform teacher–student models with the same structure in multiresolution knowledge distillation (MKD) [42] and asymmetric student–teacher (AST) [6]. In addition, AST suggests that teacher–student models of the same structure extract significantly similar anomalous image features. Similar anomaly features are a challenge for anomaly detection and anomaly classification. Therefore, it is better to design different teacher–student structures than the same for anomaly detection and classification tasks.

The difficulty in anomaly classification lies in the rarity and diversity of anomaly images. First, the rarity of anomaly detection refers to the low percentage of anomalous images in the overall dataset. The data imbalance leads to image bias during training, which makes the model more likely to classify images as normal. The model cannot learn the anomaly images, which affects the accuracy of anomaly classification during the training process. Second, the diversity of anomalous images refers to the variability of anomalous images in terms of characteristics. Anomalous images are represented as various shapes, colors, et al. There is no apparent common feature among them. Most of the current work on anomaly detection is based on OCC, which classifies images as normal and anomalous.

This paper designs a multi-classification anomaly detection (MCAD) framework to realize anomaly classification and anomaly detection. MCAD utilizes a teacher–student model with different structures for anomaly detection. The ResNet18 model serves as the teacher, and the ResNet10 model serves as the student. In the training process, the teacher–student model learns normal image information. To make the student model comparable to the teacher model, the teacher model imparts knowledge to the student model through RKD. During the testing process, the teacher and student models responded differently to the input images since the teacher model used pre-trained parameters on ImageNet. This difference is converted into feature activation values for anomaly detection and anomaly localization. In addition, MCAD employs a multi-classification model for anomaly classification. By transfer learning, the multi-classification model is equipped with pre-trained weights to categorize a limited set of abnormal images. The primary contributions of this study are outlined as follows.

  • This paper presents a multi-classification anomaly detection framework, which contains two stages. The first stage performs anomaly detection and anomaly localization through a teacher–student model. The second stage implements anomaly classification through a lightweight model.

  • This paper designs a teacher–student model with different structures. Efficient knowledge transfer between teacher–student models through relational knowledge distillation.

  • This paper proposes a multi-classification model for anomaly classification. Consider anomaly classification as a traditional image classification task. The intermediate features of the student model are fused as inputs to the classification model through feature fusion.

  • Extensive experiments on an industrial anomaly detection dataset are conducted to validate the performance of the method in this paper on anomaly detection and anomaly classification. Compared with other methods, the method in this paper shows excellent performance.

The structure of the paper for the rest of the content is as follows. Section 2 provides an extensive overview of the prior research related to anomaly detection and anomaly classification. Section 3 describes in detail the MCAD framework proposed in this paper. Section 4 presents datasets, results, ablation study and discussions. Section 5 summarizes the findings of this paper and explores prospects for future research.

2 Related work

2.1 Anomaly detection

The teacher–student model transfers the feature extraction capability for normal data to the student model through the teacher model. During the inference process, anomaly detection is performed using the teacher and student models for feature differences of anomalous images. The utilization of the teacher–student structure for AD was initially proposed by Bergmann et al. [40]. After that, STPM [41] and MKD [42] use different approaches to refine multi-scale features under different network layers. However, the typical image features extracted from the student model features are more similar to those extracted from the teacher model, while the anomaly image features are less similar. Compared to STPM, RSTPM [43, 44] adds a pair of teacher–student models. During the test process, the new teacher model is placed behind the original teacher–student model and is responsible for replicating the features. The student model reconstructs normal features at the appearance of abnormal images and enables the reconstructed features to be distinguished from those of the teacher model. In contrast to RSTPM, RD4AD [45] uses only a pair of teacher–student models but has the same similarities in learning as RSTPM. RD4AD introduces multi-scale feature fusion blocks and one-class bottlenecks to create embeddings that eliminate superfluous features across various scales. This configuration of teacher–student models facilitates proficient feature reconstruction. During the inference process, the anomalous image features extracted by the RD4AD teacher-pupil model varied greatly. Therefore, this paper proposes a differently structured teacher–student model while utilizing a new knowledge distillation approach for knowledge transfer. Efficient knowledge transfer between teacher–student models with different structures improves the performance of model anomaly detection.

2.2 Anomaly classification

Anomaly detection is usually learned only on normal images and is viewed as an OCC problem. Support vector domain description (SVDD) stands as a classical algorithm employed to address the OCC problem. Based on SVDD, researchers have improved SVDD in industrial anomaly detection and propose discriminant SVDD (DSVDD) [31], patch SVDD (PatchSVDD) [28], deep structure preservation SVDD(DSPSVDD) [29] and semantic-enhanced SVDD (SESVDD) [30]. The main of SVDD is to project the images into the feature space and compute the centroids of all sample projections at a specified radius r. Those within r from the centroids are considered normal images; otherwise, they are abnormal. Ruff et al. [31] present DSVDD, a method utilizing DNNs for the mapping process. However, both SVDD and DSVDD have the disadvantage that the whole image is processed during the process, which means that each image corresponds to a point in the feature space during the projection process. The result of this is that it identifies anomalies but does not locate anomalies. Yi et al. [28] propose PatchSVDD to turn the processed objects from the whole image into patches; each patch corresponds to a point in the feature space. In addition, since the features of different patches are different and the patches of normal images may be very far apart in the feature space over time, it is not feasible to use only one centroid. PatchSVDD replaces one centroid of SVDD with multiple centroids formed by clustering. Zhang et al. [29] propose DSPSVDD by designing an improved integrated optimization objective for the DSVDD module, considering both hypersphere volume minimization and network reconstruction error minimization, to extract depth data features more efficiently.

Compared to OCC, the contrastive language-image pre-training (CLIP)-based approach shows excellent performance on the anomaly classification task. Jeong et al. [49] propose a window-based CLIP model, WinCLIP for anomaly classification. With the prompting of the large language model, WinCLIP achieves excellent classification results on the MVTec-AD dataset. Liu et al. [50] design a lightweight and nearly training-free unsupervised semantic segmentation model. The input image is segmented into multiple parts to realize industrial vision inspection. However, the article only mentions good potential for anomaly classification and does not show results. Although anomaly classification has achieved excellent results, anomaly classification relies heavily on large language models. This increases the computational complexity of the model.

The approaches mentioned above have their advantages, but there is still a lot of room for improvement. The current anomaly detection and anomaly classification models are complex and even require large language models, which increases the complexity of the system and reduces its usefulness. To address these issues, based on some studies [6, 42, 48], this paper proposes MCAD, which is described in Sect. 3.

3 Method

As mentioned above, this paper proposes the MCAD framework for anomaly detection, anomaly localization and anomaly classification. This section discusses the MCAD framework in detail. First, anomaly detection and localization are introduced. It employs a teacher–student model, inspired by RKD [48], and introduces a relational distillation strategy for knowledge transfer. Then, the lightweight classification model proposed in this paper is demonstrated for anomaly classification. MCAD framework is shown in Fig. 1. This section describes the teacher–student model. Meanwhile, the anomaly classification model is introduced in detail. Algorithm 1 shows the process of MCAD anomaly detection and classification.

Algorithm 1
figure a

MCAD pseudo-code

Fig. 1
figure 1

Overview of multi-classification anomaly detection and localization with relational knowledge distillation framework. IKD: Individual knowledge distillation. RKD: Relational knowledge distillation. A smaller student model (S) is trained to mimic the teacher model (T). A lightweight multi-classification model (MCM) is used for anomaly classification. T: A ResNet18 model pre-trained on ImageNet. S: The ResNet10 model without pre-trained. A total loss function for anomaly detection and localization calculates teacher–student intermediate differences. FA represents the feature activation value. F represents the features of the intermediate critical layer. MCM: Simple classification model. The intermediate features of the S are used as the input for anomaly classification. \(\times\) N represents the number of times used. CN is the number of channels

3.1 Teacher-student model

The teacher–student model only needs to be trained with normal data during anomaly detection, which addresses the data labeling problem in the anomaly detection task. Meanwhile, the teacher–student model adapts to new images by learning the distributional characteristics of normal data. The teacher–student model also learns and captures the complex patterns of normal data, including the interrelationships and dependencies between the data. In summary, the teacher–student model enables to detection of the distinction between normal and anomaly images during the process of anomaly detection. Compared with the traditional teacher–student model, this paper designs teacher and student models with different structures. As described in MKD [42] and AST [6], the different structures have achieved better performance on the anomaly detection than the teacher–student models with the same network. Moreover, the method of RKD is employed to enhance the accuracy of anomaly detection by transferring the structural information from the teacher to the student.

ResNet18 and ResNet10 models possess relatively shallow network architectures, allowing them to perform effectively even on devices with limited computational resources. As shown in Fig. 1, the teacher model is a ResNet18, loaded with ImageNet pre-training weights, while the student model is an initialized ResNet10. In the training process, normal images are used as inputs for the teacher and student models, respectively. The teacher model imparts the student model’s feature extraction ability through RKD. In the upper right corner of Fig. 1, the difference between RKD and individual knowledge distillation is demonstrated. Compared with individual knowledge distillation, RKD better conveys the structural information of teacher models. Conventional knowledge distillation passes the output of the teacher as prior knowledge to the student without focusing on the relationships between feature layers, resulting in poor interpretability of the student model. This paper employs an RKD approach for knowledge transfer to enhance the student model’s understanding of the structural information between the feature layers of the teacher model. RKD combines multiple outputs into structural units by using the outputs of multiple teacher models as structural units. This is better able to reflect the structured characteristics of the teacher and enables better teaching of the student model. Unlike individual distillation learning, the loss function of RKD also has a function for constructing structural information, making the student model learn the teacher model’s more efficient information representation capability. The loss function in anomaly detection and localization is described as follows.

Distance distillation loss The \(i_{\text{th}}\) critical layers of the model are represented as \(\text{CL}_i\). The feature activation (FA) values for the i-th critical layers of the teacher and student are denoted as \(\text{FA}^{\text{CL}_{i}}_{T}\) and \(\text{FA}^{\text{CL}_{i}}_{S}\), respectively. Based on this value, distances and angles are introduced to enhance the knowledge transfer from teacher to student. Therefore, this paper defines two loss values, \(\mathcal {L}_{\text{Dis}}\) and \(\mathcal {L}_{\text{Ang}}\), which represent the loss of the distance term and the angle term, respectively. \(\mathcal {L}_\text{Dis}\) is mainly concerned with minimizing the Euclidean distance between the teacher and student feature activation values at \(\text{CL}_i\). Therefore, \(\mathcal {L}_\text{Dis}\) is defined as:

$$\begin{aligned} \mathcal {L}_{\text{Dis}}= \sum \limits _{(x_i,x_j)\in \mathcal {X}^2}l_\delta \bigl (\psi _\mathcal {D}(t_i,t_j),\psi _\mathcal {D}(s_i,s_j)\bigr ), \end{aligned}$$
(1)

where \(\mathcal {X}^2\) represents a set of 2-tuples of different data instances, \(\mathcal {X}^{2}=\{(x_{i},y_{j})|i \ne j\}\). \(t_i = \text{FA}^{\text{CL}_{i}}_{T}\) and \(s_i = \text{FA}^{\text{CL}_{i}}_{S}\). \(l_\delta (a,b)\) is the Huber loss, which is defined as:

$$\begin{aligned} l_\delta (a,b)= {\left\{ \begin{array}{ll} \frac{1}{2}(a-b)^2, &{}|a-b|\le 1,\\ |a-b|-\frac{1}{2},&{}\mathrm {otherwise.} \end{array}\right. }, \end{aligned}$$
(2)

\(\psi _\mathcal {D}(t_i,t_j)\) is the calculation of the Euclidean distance of two instances in the output representation space:

$$\begin{aligned} \psi _{\text{D}}(t_i,t_j)=\dfrac{1}{\mu }\left\| t_i-t_j\right\| _2, \end{aligned}$$
(3)

where \(\mu\) is the distance normalization factor. To consider the distance between multiple pairs, \(\mu\) is established as the mean distance between pairs of \({\mathcal {X}}^2\) within the mini-batch:

$$\begin{aligned} \mu =\dfrac{1}{\left| \mathcal X^2\right| }\sum \limits _{(x_i,x_j)\in \mathcal X^2}\left\| t_i-t_j\right\| _2. \end{aligned}$$
(4)

Mini-batch distance normalization proves beneficial in aligning the distances between teacher and student embeddings, especially in the presence of significant scale differences between teacher distances \(\left\| t_i-t_j\right\| _2\) and student distances \(\left\| s_i-s_j\right\| _2\). It is experimentally demonstrated that normalization makes the model more stable and converges faster during training. The distance distillation loss conveys image relationships by evaluating the disparity in distances within their respective output representation spaces. Unlike traditional knowledge distillation, distance distillation focuses on the distance structure of the teacher’s output rather than directly matching the teacher’s output.

Angle distillation loss The angle distillation loss conveys the relationships among training instance embeddings by assessing the variance in angles. Compared to distance, angles are more intricate and impart relational information more effectively, thereby affording students greater flexibility in model training. \(\mathcal {L}_{\text{Ang}}\) is defined as:

$$\begin{aligned} \mathcal {L}_{\text{Ang}} = \sum _{(x_i,x_j,x_k)\in \mathcal {X}^3}l_\delta \bigl (\psi _\textbf{A}(t_i,t_j,t_k),\psi _\mathcal {A}(s_i,s_j,s_k)\bigr ), \end{aligned}$$
(5)

where \(\mathcal {X}^{3}=\{(x_{i},y_{j},x_{k})|i \ne j \ne k \}\), \(l_\delta (x,y)\) is the Huber loss. In addition, the angle similarity between activation vectors is increased using \(\mathcal {L}_{\text{Ang}}\), which is even more critical in the ReLU network because the neurons in a ReLU network are only activated when surpassing a threshold of zero value. This implies that two activation vectors exert contrasting influences on the activation of subsequent neurons, despite having an equal Euclidean distance from the object vector. To solve this problem, employ the cosine similarity metric. Equation (6) shows the relational potential on an angle measuring the angles representing the angle formed by the three examples in space.

$$\begin{aligned} \psi _A(t_i,t_j,t_k)=\cos \angle t_i t_j t_k=\langle \textbf{e}^{ij},{\textbf{e}}^{kj}\rangle , \end{aligned}$$
(6)

where

$$\begin{aligned} {\textbf{e}}^{ij}=\dfrac{t_i-t_j}{\left\Vert t_i-t_j\right\Vert _2},{\textbf{e}}^{kj}=\frac{t_k-t_j}{\Vert t_k-t_j\Vert _2}. \end{aligned}$$
(7)

The RKD is designed to transfer structural knowledge using the interrelationship of data instances output by the instructor. RKD is defined as follows:

$$\begin{aligned} \mathcal {L}_{\text{RKD}}= \mathcal {L}_\text{Dis}+\lambda \cdot \mathcal {L}_{\text{Ang}}, \end{aligned}$$
(8)

where \(\lambda\) is the hyperparameter for the \(\mathcal {L}_\text{Dis}\) and \(\mathcal {L}_{\text{Ang}}\). The impact of \(\lambda\) of anomaly detection result is discussed in Sect. 4.4.

3.2 Anomaly detection

Given a dataset \(\mathbb {D}_{\text{train}}=\{I_{1},I_{2},\ldots ,I_{N}\}\) for training, where \(\mathbb {D}_{\text{train}}\) contains only normal images. A student model is trained to identify anomalous images within the test dataset \(\mathbb {D}_{\text{test}}\). The teacher only transfers the knowledge of normal images to the student model in the knowledge distillation process, as shown in Fig. 1. Therefore, the feature activation values of anomalous images are not in the range of normal images in the inference process. Feature activation values for anomalous images are novel values for the student model. Because the teacher model is a pre-trained model on ImageNet, there are differences between the teacher and the student for anomalous images. The threshold value of this difference is used for anomaly detection as in Eq. (8). In Fig. 1, \(\text {FA1}\) represents the result of feature activation values for the first critical layer of the teacher–student model. MCAD utilizes the teacher–student model with different intermediate feature activation values for anomaly detection. The effect of different feature activation values on the anomaly detection results is discussed in Sect. 4.4.

3.3 Anomaly localization

As shown in Fig. 1, each test image is fed into the teacher–student model to locate anomalous areas. MKD [42] has demonstrated that the derivative of the loss function with respect to the input yields valuable insights into the significance of individual pixels. This paper obtains the gradient information of each input dimension by calculating the derivative of the loss \(L_{\text{RKD}}\) concerning the input \(x_{i}\). The gradient information value signifies the extent of the impact of this dimension on optimization and the presence of anomalies in the loss. Therefore, the gradient of \(L_{\text{RKD}}\) is used to find the anomalous areas that cause its value to increase. However, to obtain the localization feature map of input \(x_{i}\), the attribution map of \(x_{i}\) is first obtained. Equation (9) shows how to calculate the attribution map \(\mathcal {A}\):

$$\begin{aligned} \mathcal {A} = \frac{\partial L_{\text{RKD}}}{\partial x_{i}}. \end{aligned}$$
(9)

The attribution map \(\mathcal {A}\) is obtained by introducing Gaussian blur and open morphological filters. The attribution map \(\mathcal {A}\) reduces the natural noise in the feature maps. Thus, the localization feature map \(L_{\text{map}}\) is represented as:

$$\begin{aligned} \mathbb {M} = \mathcal {G}_{\sigma }(\mathcal {A}), \end{aligned}$$
(10)
$$\begin{aligned} L_{\text{map}} = (\mathbb {M} \ominus \mathcal {B})\oplus \mathcal {B}, \end{aligned}$$
(11)

where \(\mathcal {G}\) is a Gaussian filter with a standard deviation \(\sigma\). \(\ominus\) and \(\oplus\) denote the morphological operators of shape erosion and dilation through the structural element \(\mathcal {B}\), respectively. These operations are known as morphology-open. \(\ominus\) removes the light-colored noise. \(\oplus\) increases the light-colored component. \(\mathcal {B}\) is a simple binary mapping, usually oval or circular disk-shaped. Compared to Eq. (9), Eq. (11) not only enhances the clarity of each pixel’s impact on the loss value but also enables more accurate identification of anomalous areas.

3.4 Anomaly classification

In the method of classification using deep learning, customary approaches involve the design of DNNs to learn image representation and characterization. The residual structure is proposed to give ideas for designing DL networks. As a result of the skip-connect within the residual structure, the input from the preceding layer is directly added to the output of the subsequent layer. Meanwhile, the residual structure improves information to pass more freely in the network while retaining the original input information, thus improving the expressiveness of the network. Therefore, this paper considers this in designing the anomaly classification model, and a multi-classification model with residual structure is designed. The multi-classification model that we designed is shown in Fig. 2, which corresponds to the encoder–decoder part in Fig. 1. Since the anomaly images are only in the inference process, the multi-classification model is trained by transfer learning.

Fig. 2
figure 2

Multi-classification model

At the bottom of Fig. 1, multiple intermediate features of the student model are fused and used as inputs to the multi-classification model for anomaly classification. Feature fusion is done by feature concatenation. Feature concatenation enables multiple intermediate features to be concatenated together to form a large feature vector, thus extending the feature representation capability while preserving the information of each intermediate feature. Moreover, feature concatenation fuses different levels of features while also fusing features at different scales. This captures richer feature information and improves the model’s classification ability. The student model’s five intermediate features (F0–F4) are feature fused in Fig. 1. However, in the process of experimentation, different features are selected for feature fusion. For example, only \(\text {F0}\) is used as a feature for anomaly classification. Similarly, feature fusion is performed using \(\text {F0}+\text {F1}\). The impact of using features from different intermediate critical layers fused as input to a multi-classification model on the classification model on classification results is explored in Sect. 4.3. In addition, Table 4 provides the classification results after the fusion of different critical layer features.

4 Experiments

This section proves the effectiveness of the proposed method in this paper by extensive experiments. First, the experimental environment and datasets are introduced. Subsequently, we present the outcomes of the MCAD framework proposed in this paper across various datasets. After that, an ablation study is conducted. Finally, the whole experiment is discussed.

4.1 Experiment setup

Environment The experiments are executed using Python 3.9 and PyTorch 1.13.1, utilizing an Intel(R) Core i7-13700KF CPU and an NVIDIA GeForce RTX 4090 GPU. For MNIST, FashionMNIST and CIFAR10 experiments, the original image size is used as input. For the MVTec-AD dataset [51, 52], resize all images to \(224 \times 224\).

Datasets MNIST: The MNIST dataset comprises 60,000 training images and 10,000 test images, with each image being a grayscale handwritten digit of dimensions \(28 \times 28\) pixels. FashionMNIST: FashionMNIST serves as an image dataset that substitutes MNIST, containing a total of 70,000 images showcasing various products across 10 distinct categories. CIFAR10: CIFAR10 contains 60,000 color images in 10 categories, with the size of \(32\times 32\). MVTec-AD: MVTec-AD dataset is proposed by MVTec for industrial anomaly detection. The training set of MVTec-AD exclusively comprises normal images. The test set includes normal and anomaly images. In the anomaly images, the different classes of abnormalities are classified. The dataset mimics scenarios from industrial production, encompassing a variety of domains with five textures and ten objects. It also provides pixel-level annotations for anomaly regions and is a comprehensive multi-object, multi-anomaly dataset. MVTec-AD is a popular dataset in the field of AD.

4.2 Results of MNIST & FashionMNIST & CIFAR10

The proposed MCAD framework is contrasted against other state-of-the-art (SOTA) AD methods in this paper. The area under the receiver operating characteristic curve (AUROC) approach is used for outcome evaluation. Since the MNIST, FashionMNIST, and CIFAR10 datasets contain ten categories, this paper uses Class 0–9 to represent the ten categories. Table 1 compares the method described in this paper with other approaches. ARAE [53] improves anomaly detection by adversarially training autoencoders. LSA [54] introduces latent space autoregression as a method for AD. The latent space refers to the lower-dimensional representation created by the encoder to encapsulate the fundamental attributes of the input data. Anomaly detection is achieved by modeling the correlation between data points in latent space. MKD [42] employs multiresolution knowledge distillation to address AD. Anomaly detection performance is improved by knowledge transfer between teacher–student models and incorporates representations at different scales to better capture anomalous behavior in the data. The approach in this paper uses a teacher–student model for anomaly detection. Knowledge transfer between the teacher and the student is performed by relational knowledge as a manner to improve AD. In the context of the MNIST dataset, the method in this paper achieved several best results and ended up with the best average result of 98.95%. Compared with MKD and AREA, the method in this paper improved the detection accuracy by 0.24 and 1.45%, respectively. For the FashionMNIST dataset, MCAD achieved an average result of 96.04%. It is better than the detection of MKD. This is because some categories in FashionMNIST are more closely related, and MCAD does not perform well on multi-scale problems. CIFAR10 is a commonly used dataset in computer vision. It is also utilized within the realm of AD. On the CIFAR10 dataset, MCAD achieved the best detection results in several categories. MCAD achieves the best average result of 92.24%, outperforming the MKD [42] by 5.06%.

Table 1 AUROC results for anomaly detection on MNIST, FashionMNIST and CIFAR10
Fig. 3
figure 3

The original images (1st row) of each class in MVTec-AD dataset. The input images of each sample with the ground truth (2nd row) anomaly mask. The anomaly score map (3rd row) is estimated using the MCAD framework and the localization mask is generated (4th row)

4.3 Experimental results of MVTec-AD

This subsection discusses the anomaly detection and classification results for the MVTec-AD dataset. The various categories of the MVTec-AD dataset are divided into two parts, texture and object classes. Consequently, Table 2 presents the outcomes of anomaly detection, and Table 3 exhibits the results of anomaly localization across diverse categories. Meanwhile, the average results of the texture class and object class are also shown. The anomaly classification results are displayed in Table 4.

Table 2 Anomaly detection results with AUROC on MVTec-AD
Table 3 Anomaly localization results with AUPRO on MVTec-AD
Table 4 Anomaly classification results with accuracy on MVTec-AD

Anomaly detection This section showcases the outcomes of AD performed on the MVTec-AD dataset. Table 2 shows the comparative results between the approach presented in this paper and other methods. For the texture and object classes of MVTec-AD, MCAD achieved AUROC results of 98.83 and 96.12%. At the same time, MCAD achieved an average AUROC result of 97.58% for all classes. This is 11.32 and 8.61% higher than US [40] and MKD [42]. US [40] and MKD [42] also used the teacher–student model. US [40] utilizes the uninformative student model and discriminative embedding loss function for AD. It is noteworthy that US [40] utilizes the exact structure of the teacher–student model. MKD [42] achieved better detection results with different teacher–student structures than US [40]. However, MKD [42] employs a DNN with a structure VGG, where the teacher–student models respond similarly to the same anomalous region. The residual structure provides better extraction of anomalous areas. Therefore, MCAD employs ResNet18 as the teacher model, a deep neural network with a residual structure. Moreover, the student model is ResNet10 to ensure that the teacher–student model has a different structure. In summary, the MCAD framework proposed in this paper outperforms most models in AD, particularly achieving SOTA results in categories such as carpet, leather, bottle, and cable.

Anomaly localization This part examines the outcomes of anomaly localization utilizing the MCAD approach. Table 3 presents the outcomes of anomaly localization on the MVTec-AD dataset in comparison with alternative methods. For the texture and object classes, the average result for the text category differs from the best result by 0.23%, reaching 97.65%. The object category achieved the best average localization result of 98.32%. Meanwhile, MCAD achieved a final average localization result of 98.10% for all classes. MCAD has achieved SOTA localization results in multiple classes, for instance carpet, cable, capsule, toothbrush, transistor, zipper. Although DRAEM [61] achieved the best localization results in several classes (grid, wood, bottle, hazelnut and metal_nut). However, MCAD has achieved excellent performance at these levels as well. Moreover, the localization results of MCAD in the carpet, capsule and transistor classes are much better than those of DRAEM. The outcomes of MCAD anomaly localization on the MVTec-AD dataset are depicted in Fig. 3. For complex anomalies like texture classes, MCAD locates the anomalous areas accurately. Similarly, the anomalous areas of the object class are also achieved to locate accurately. As a result, MCAD achieves excellent localization performance of the MVTec-AD dataset.

Fig. 4
figure 4

AUROC results of anomaly classification for different classes

Anomaly classification This part discusses anomaly classification. As shown in Fig. 1, the MCAD framework contains a multi-classification model for anomaly classification. Since anomalous images exist only in the inference process, the multi-classification model classifies only in the inference process. The results of anomaly classification on the MVTec-AD dataset are presented in Table 4. The uppermost row illustrates the count of anomalies for each data type within the MVTec-AD dataset. For example, the category carpet contains five categories of exceptions, namely color, cut, hole, metal_contamination and thread.

In Sect. 3.4, the multi-classification model uses feature concatenation. Therefore, the input \(\mathrm{F{0}}\) in Table 4 represents the features using the 0-th critical layer. Similarly, \(\mathrm{F{0,1,2,3,4}}\) denotes the utilization of all feature maps from the critical layers within the student model as input for the multi-classification model. Figure 4 provides an in-depth presentation of the classification result for each class within the context of the MVTec-AD dataset. Furthermore, the recognition accuracy increases with the addition of features. The multi-classification model achieved more than 90% AUROC classification results in the categories leather, tile, hazelnut and zipper. However, the classification results are unsuitable for complex textures and unusually obscure classes such as grid, capsule and pill. The anomalous features of these categories are similar to the normal features and are challenging to classify. Nevertheless, the multi-classification model achieved an average AUROC classification result of 76.37% in the MVTec-AD dataset. Consequently, MCAD is well-positioned to classify each anomaly for industry applications accurately.

Fig. 5
figure 5

Anomaly detection accuracy of different \(\lambda\) (Eq. 8) on different categories of MVTec-AD

4.4 Ablation study

The influence of different critical layer features of the student model on the anomaly classification results is discussed in Sect. 4.3. Within this section, we analyze the impact of feature activation values across different critical layer features on the results of anomaly detection and localization. Besides, for the hyperparameter \(\lambda\) in Eq. (8), we discuss the anomaly detection result for different \(\lambda\).

Table 5 Anomaly detection and localization AUROC(%) results for different feature activation (FA) values

The incorporation of feature activation values from distinct critical layers exerts a discernible influence on the ultimate outcome, during the course of anomaly detection and localization. Table 5 shows the anomaly detection and localization results using different network layers feature activation values. Where \(\text {FA}{1}\) represents the feature activation value representing the first critical layer. With single-layer feature activation values, \(\text {FA}{4}\) achieved the best anomaly detection and localization results with 96.83 and 97.32%, respectively. When using two feature activation values, \(\text {FA}{3}+\text {FA}{4}\) achieves the best anomaly detection and localization results of 96.92 and 97.23%, respectively. Compared to using \(\text {FA}{1}+\text {FA}{2}\), the AUROC results for anomaly detection and localization improved by 1.74 and 2.40%, respectively. The best results are achieved in the \(\text {FA}{2}+\text {FA}{3}+\text {FA}{4}\) with 97.58 and 98.10%. This result exceeds the results using \(\text {FA}{1}+\text {FA}{2}+\text {FA}{3}+\text {FA}{4}\), which shows that more feature activation values do not necessarily yield better results. This is also reflected in the experiments using \(\text {FA}{3}+\text {FA}{4}\) versus \(\text {FA}{1}+\text {FA}{2}+\text {FA}{3}\). \(\text {FA}{1}+\text {FA}{2}+\text {FA}{3}\) achieves 94.67 and 95.96% detection and localization results. And \(\text {FA}{3}+\text {FA}{4}\) achieves 96.92 and 97.23% detection and localization results. Although the feature activation values of \(\text {FA}{4}\) are based on \(\text {FA}{1}\), the feature extraction capability is enhanced as the depth of the network increases. Thus, \(\text {FA}{4}\) gives better results than other methods in using only a single value.

Different angle distillation loss weights show a significant difference in anomaly detection results in Fig. 5. As the hyperparameter \(\lambda\) increases, there is a significant improvement in the detection results. However, this improvement is not endless, the accuracy instead decreases as \(\lambda\) exceeds 0.6.

4.5 Discussions

The MCAD framework proposed is an anomaly detection and localization model containing multi-classification in this paper. The teacher–student model is used in the anomaly detection and localization process. Utilizing ResNet18 and ResNet10 as the teacher and student models, respectively, we apply relational knowledge distillation to transfer knowledge from teacher to student. And, leveraging feature activation values from pivotal layers in both models to enhance anomaly detection and localization. The multi-classification model is a lightweight model containing residual structure. Experimental results show that MCAD simultaneously achieves anomaly classification with excellent anomaly detection and localization.

Although MCAD has demonstrated outstanding performance in anomaly detection, localization, and classification, the results on the FashionMNIST dataset could be better than MKD [42]. In addition, the results of MCAD are less well than the method of reconstruction-based DRAEM [61] in anomaly detection. DRAEM achieved the best anomaly detection results across multiple categories in the MVTec-AD dataset. Despite the promising results obtained by MCAD on anomaly localization, the results on some categories could be better than those of DRAEM. At the same time, feature fusion capabilities must be improved for anomaly classification to enable better performance on complex textures and categories of classification ambiguity, such as grid, capsule and pill.

5 Conclusion

This paper proposes an MCAD framework for anomaly detection, localization, and classification. First, the teacher–student model is applied for anomaly detection and localization. Improving teacher-to-student model knowledge transfer utilizing RKD. Subsequently, a multi-classification model is employed to accomplish the classification of anomalous images. MCAD achieves 97.58% AUROC and 98.10% AUROC in MVTecAD data on anomaly detection and localization, respectively. In terms of anomaly classification, the multi-classification model secures an average classification accuracy of 76.37% on the MVTec-AD dataset. The experimental findings demonstrated that MCAD achieves outstanding anomaly detection and localization performance with anomaly classification. MCAD addresses the requirements in industrial scenarios, especially for some texture class localization problems and object class classification problems. However, the multi-classification model still has limitations, for example, the class of grid, capsule, and pill. This paper believes that these limitations are addressed by treating complex backgrounds. In future work, the primary focus lies in enhancing feature fusion capabilities, especially within complex backgrounds, to elevate the model’s performance in anomaly classification.