Introduction

Anomaly detection serves as a critical technique for identifying anomalous patterns in voluminous datasets, holding particular relevance in the analysis of imaging data. This technology finds applications in diverse domains, including but not limited to medical diagnosis [2, 3], plant healthcare [4], surveillance video [5], and disaster detection [6, 7]. Recent advancements in deep learning have propelled a surge of scholarly interest in the development of automated anomaly detection methods for expansive image datasets. Based on machine learning research, these techniques can be categorized into three primary classes: supervised, semi-supervised, and unsupervised methodologies. Despite each approach’s unique merits and limitations, the predominant challenge is the efficient identification of anomalies based on a limited number of anomalous instances.

Convolutional neural networks (CNN) represent a prevalent architecture in the landscape of computer vision, offering robust solutions for tasks such as image recognition and segmentation. Utilizing substantial labeled datasets, CNN has achieved state-of-the-art image anomaly detection performance in real-world anomaly detection applications [6, 8]. Nonetheless, CNN-based anomaly detectors frequently grapple with the scarcity of labeled instances and a low incidence of anomalies. Several studies have developed strategies to ameliorate these constraints, including the incorporation of active learning [9] and the deployment of transfer learning [6] to enhance CNN’s learning efficiency.

Unsupervised learning methods have achieved wide acceptance in the domain of anomaly detection, primarily because they eliminate the need for labeled anomalous samples during the training phase. A conventional approach in unsupervised image anomaly detection relies on the utilization of deep convolutional auto-encoders to reconstruct normal images [10]. However, these auto-encoders sometimes falter in the precise reconstruction of fine structures, leading to the generation of excessively blurry images. To counter this limitation, Generative Adversarial Networks (GAN) has been introduced into the field. AnoGAN [11] pioneered the application of GAN for image anomaly detection. Moreover, AnoGAN has been adapted to the field assess color reconstructability, thereby enabling the sensitive detection of color anomalies [12]. In unsupervised anomaly detection, it is a commonly adopted practice to quantify the deviation between the original image and the reconstructed image as the Anomaly Score.

Although unsupervised anomaly detectors eliminate the need for labeling anomalous instances during training, they pose certain shortcomings. Primarily, these detectors are susceptible to overlooking subtle and minute anomalies because the Anomaly Score is predicated upon the distance between the standard and test images. Therefore, the effectiveness of unsupervised anomaly detectors is contingent upon the robust formulation of an Anomaly Score for specific objectives. Second, the appropriate threshold of the Anomaly Score should be carefully tuned to classify normal and anomalous instances accurately. This calibration frequently entails a laborious process of trial and error.

Recent advancements in visual attention mechanisms have garnered considerable traction in computer vision [13]. Attention branch network (ABN) incorporates a branching structure termed Attention Branch [14]. The attention maps from this branch serve as visual explications to describe the decision-making process within CNN. These attention maps have been demonstrated to improve CNN performance across various image classification tasks.

The visual attention mechanism has realized robust prediction for imbalanced data to utilize contrastive learning in anomaly detection [15]. This research raises the intriguing prospect of integrating visual attention into image anomaly detection schemes. Nonetheless, existing visual attention modules, including ABN, predominantly rely on self-attention mechanisms [16, 17]. Consequently, the quality of attention in these modules is intrinsically linked to the network’s overall performance, thereby limiting their direct applicability in enhancing the performance of anomaly detectors.

In a preceding study, the layer-wise external attention network (LEA-Net) was introduced to enhance CNN-based anomaly detection through the incorporation of an external attention mechanism. The external attention mechanism leverages prior knowledge from some external sources; LEA-Net utilizes the outputs of the other network pre-trained. As discussed, unsupervised and supervised anomaly detectors have their problems. To address these problems, LEA-Net consolidated supervised and unsupervised anomaly detection algorithms through the lens of a visual attention mechanism. The burgeoning advancements in visual attention mechanisms intimate the feasibility of leveraging prior knowledge in anomaly detection. The strategies described in [1] include the following:

  • The pre-existing knowledge concerning anomalies is articulated through an anomaly map constructed via the unsupervised learning of normal instances.

  • Subsequently, this anomaly map is transformed into an attention map by an auxiliary network.

  • The attention map is then incorporated into the intermediate layers of the anomaly detection network (ADN).

In line with this strategy, the effectiveness of layer-wise external attention in image anomaly detection was assessed through comprehensive experiments utilizing publicly accessible, real-world datasets. The findings revealed that layer-wise external attention reliably enhanced the performance of anomaly detectors, even with limited data. Further, the results suggested that the external attention mechanism can synergistically operate with the self-attention mechanism to enhance anomaly detection capabilities.

Although the external attention mechanism holds considerable promise for setting a new paradigm in image anomaly detection, its effectiveness depends on the judicious selection of an intermediate layer equipped with external attention. To illustrate the layer-wise external attention mechanism improving anomaly detection performance, we conducted a series of more in-depth experiments. The principal contributions of this research are stated as follows:

  • We introduced an embedding-based approach, a Patch Distribution Modeling Framework (PaDiM) [18], to generate an anomaly map along with reconstruction approaches.

  • We comparatively analyzed the performance of LEA-Net with that of baseline models under various conditions to clarify the modes through which external attention improves the detection performance of CNN.

  • We discerned that the presence of well-localized positional features on an anomaly map is instrumental in successfully implementing layer-wise external attention.

Related Work

A more straightforward methodology was employed for the automated detection of thyroid nodule lesions in X-ray computed tomography images [19]. This technique leverages binary segmentation results acquired from a U-Net as input for supervised image classifiers. The authors demonstrated that such preprocessing via binary segmentation significantly enhances anomaly detection accuracy in practical applications. Similarly, the convolutional adversarial variational autoencoder with guided attention (CAVGA) employs anomaly maps in a weakly supervised setting to localize anomalous areas [20]. Through empirical evaluations using the MVTec AD dataset, CAVGA achieved state-of-the-art performance. Both studies substantiate the considerable promise of incorporating visual attention maps in image anomaly detection.

The concept of visual attention pertains to the selective refinement or amplification of image features for recognition tasks. The human perceptual system prioritizes information germane to the task over comprehensive data processing [21, 22]. Visual attention mechanisms emulate this human faculty in the context of image classification [16, 17, 23,24,25,26]. In most configurations, image classifiers incorporating visual attention use a Self-Attention mechanism, seamlessly integrating with pre-existing models. However, the utility of such visual attention mechanisms is intrinsically tied to the performance of the primary model, constituting a limitation inherent to Self-Attention approaches.

To address this issue, ABN [14] introduced an interactive editing feature for the attention map, thereby guiding the focus points of CNN towards more salient regions within images. Another notable approach in the realm of visual attention is Attention Transfer, rooted in the concept of knowledge distillation [27]. During knowledge distillation, a compact network, termed the student network, acquires foundational knowledge from a more extensive teacher network [28]. The premise of attention transfer rests on the hypothesis that a teacher network is predisposed to concentrate on more information-rich regions compared to a student network.

Most recently, the external attention mechanism has emerged, inspired by both ABN and the Attention Transfer network [1]. A salient feature of the external attention mechanism is its independence from user interaction for modulating attention in CNN. Moreover, unlike Self-Attention mechanisms, the efficacy of external attention is contingent upon prior knowledge derived from a pre-trained external network. This mechanism can be seamlessly integrated into any CNN model through an end-to-end training manner.

Layer-Wise External Attention Network

Figure 1 provides an overview of the Layer-wise External Attention Network (LEA-Net) [1]. As previously elucidated, LEA-Net is a paradigmatic implementation of the external attention mechanism. As depicted in Fig. 1, LEA-Net is bifurcated into two principal components: anomaly map generation via an unsupervised network and anomaly detection by supervised networks. These components will be elaborated upon in the subsequent subsections.

Fig. 1
figure 1

Overview of the layer-wise external attention network

Part 1: Anomaly Map Generation

For anomaly map generation, prior knowledge on normal data is instrumental in identifying a diverse range of real-world anomalies. In practical applications, a pixel-wise scoring metric, known as an anomaly map, is ubiquitously employed for anomaly detection. In this study, we introduce three distinct unsupervised methodologies for generating anomaly maps to explore the dependency of external attention effectiveness on these maps. Figure 2 summarizes the characteristics of these three unsupervised strategies: (a) Color Reconstruction, (b) Auto-Encoding, and (c) PaDiM [18]. These methodologies can be taxonomically divided into two categories: reconstruction-based approaches, comprising (a) and (b), and an embedding-based approach, represented by (c). Further details are provided in the following section.

Fig. 2
figure 2

Unsupervised approaches for generating anomaly maps

Color Reconstruction

The procedure for the color reconstruction task is delineated in Fig. 2(a). Initially, a grayscale image is extracted from an input color image, which is represented in the \(L^{*}a^{*}b^{*}\) color space. Subsequently, the grayscale image undergoes a transformation back to a color image utilizing U-Net [29]. Additionally, chrominance information \(a^{*}b^{*}\) is predicted based on the luminance information \(L^{*}\) within the \(L^{*}a^{*}b^{*}\) color space. The predicted \(a^{*}b^{*}\) is then amalgamated with the \(L^{*}\) from the input color image to yield the final image in \(L^{*}a^{*}b^{*}\) color space. The U-Net model is pre-trained solely on normal instances to facilitate this color reconstruction process. Ultimately, a color anomaly map is generated by evaluating the CIEDE2000 [30] color difference between the reconstructed and original images. For an in-depth discussion of color anomaly map generation, the reader is directed to [12].

Auto-Encoding

For the auto-encoding task, we employ U-Net. An anomaly map is synthesized by computing the pixel-wise absolute deviation between the reconstructed and original images.

PaDiM

PaDiM is an embedding-based methodology optimized for anomaly detection in industrial settings, boasting state-of-the-art results on the MVTec AD dataset. Specifically, PaDiM leverages a pre-trained network for patch embedding to compile statistical data on normal instances. It then calculates the Mahalanobis distance between the feature vector derived from an input image and that obtained from normal instances. In the context of this study, the ResNet-50 model [31], trained on the ImageNet dataset [32], serves as the feature extractor for PaDiM.

Part 2: Anomaly Detection Using Supervised Networks

The intricacies of the anomaly attention network (AAN) and the anomaly detection network (ADN) are elucidated in Fig. 1. AAN serves the function of transforming an anomaly map into an attention map. This transformation modulates the complexity, certainty, and sharpness of the attention map in alignment with the progress of training and hierarchical representation in ADN. This mechanism can be conceptualized as a form of reverse Curriculum Learning. In contrast, ADN operates as a supervised network that categorizes images as normal or abnormal. It receives the attention map from AAN and generates the final classification for anomaly detection. These two networks are interconnected at an intermediate layer via an attention block, with the attention map designed to underscore informative regions on a feature map within ADN.

The structural specifics of AAN and ADN are listed in Table 1. Within this context, the term attention point refers to the intermediate layer at which external Attention is implemented. In a practical setting, multiple attention points can be designated for external Attention. In the scope of this study, ResNet-based architecture was utilized for AAN, and ResNet-18 was employed for ADN [31]. The ResNet-based AAN is architected by sequentially stacking residual blocks. The number of downsampling points in ADN is equal to or greater than those in AAN to facilitate straightforward network interconnection. In the experiments detailed in this paper, we opted for the inclusion of five attention points.

Table 1 Structures of AAN and ADN

Training Process for LEA-Net

Figure 3 delineates the training regimen for LEA-Net, which is carried out in two sequential stages: (1) the training of an unsupervised network and (2) the training of supervised networks. In the first phase, we train the unsupervised network is trained using normal images with the aim of generating anomaly maps tailored for LEA-Net. Notably, if PaDiM for anis employed as the unsupervised network, this training step may be circumvented, with the exception of statistical computation.

Subsequently, to ascertain two predictive probabilities, both an original image and its corresponding anomaly map are input into ADN and AAN, respectively. The parameters of AAN and ADN are concurrently optimized by minimizing a loss function that constitutes a linear amalgamation of these two predictive probabilities. Let \(x_{i} \in \mathbb {R}^{\textrm{H} \times \textrm{W} \times 3}\) represent the ith original input image. Here, \(\textrm{H}\) and \(\textrm{W}\) denote the height and width of images, respectively. Additionally, we define \(i \in \{1, \ldots , \textrm{N}\}\), and \(\textrm{N}\) is the number of training images. Let \(x^{\prime }_{i} \in \mathbb {R}^{\textrm{H} \times \textrm{W}}\) represent the ith anomaly map. Furthermore, let \(y_{i}\in \left\{ 0, 1\right\} \) denote a corresponding ground-truth label. Furthermore, let \({\hat{y}}^{\prime }_{i} \in \left[ 0, 1\right] \) and \({\hat{y}}_{i} \in \left[ 0, 1\right] \) indicate the i-th predictive probabilities for AAN and ADN, respectively. The loss function for the entire classification network can be expressed as the sum of the two loss functions:

$$\begin{aligned} L = \dfrac{1}{\textrm{N}} \sum _{i = 1}^{\textrm{N}} \textrm{BCE} ({\hat{y}}^{\prime }_{i}, y_{i}) + \dfrac{1}{\textrm{N}} \sum _{i = 1}^{\textrm{N}} \textrm{BCE}({\hat{y}}_{i}, y_{i}). \end{aligned}$$

In this context, \(\mathrm {BCE(\cdot )}\) signifies the binary cross-entropy. The loss function as designed with the anticipation that AAN would efficaciously modify attention maps during the training phase, analogous to ABN [14].

We illustrate the architecture of the attention block. Its composition is straightforward, consisting merely of a channel-wise average pooling layer, denoted as \(\phi (\cdot )\), and a sigmoid layer, represented as \(\sigma \). The channel-wise average pooling layer functions by averaging the extracted features per channel. Let us denote the attention map at point \(p \in \{1,2,3,4,5\}\) as \(M_p~\in ~\mathbb {R}^{\mathrm {H_p} \times \mathrm {W_p}}\), such that \(\mathrm {H_p}\) and \(\mathrm {W_p}\) represent the height and width of features at an attention point. Let \(g_{p}(x^{\prime }_i) \in \mathbb {R}^{\mathrm {H_p} \times \mathrm {W_p} \times \mathrm {C_{p}}}\) represent the feature tensor at an attention point p in AAN for input image \(x^{\prime }_i\), and \(\mathrm {C_{p}}\) indicate the number of channels. Therefore, \(M_p\) is generated as follows:

$$\begin{aligned} M_p = \sigma (\phi (g_p(x^{\prime }_i))). \end{aligned}$$

Here, we adopted channel-wise average pooling layer \(\phi (\cdot )\) instead of a convolution layer size of \(1 \times 1\), anticipating the same effect reported by [17]. A sigmoid layer \(\sigma (\cdot )\) normalizes the feature map \(\phi (g_{p}(x^{\prime }_i))\) within the range of [0, 1]. It has been reported that the normalization of the attention map can effectively highlight informative regions [23]. Additionally, the sigmoid function prevents attention maps and ADN features from reversing the significance by multiplying negative values.

Our attention mechanism aims to highlight the informative regions on feature maps, instead of erasing other regions [23]. To mitigate the risk of inadvertently erasing the informative regions through the attention maps, we integrated the attention map into ADN as expressed in the following equation:

$$\begin{aligned} f^{\prime }_{p}(x_{i}) = (1\oplus M_p) \otimes f_p(x_{i}), \end{aligned}$$

where \(\oplus \) denotes the element-wise sum, \(\otimes \) denotes the element-wise product, and \(f^{\prime }_{p}(x_{i})\) represents the updated feature tensor at point p in ADN after the external attention mechanism.

This particular attention strategy is also designed to circumvent the Dying ReLU problem [33]. The issue arises when a large amount of parameters possessing negative values are transformed to zero due to the ReLU function, thereby engendering the vanishing gradient problem. Note that AAN and ADN often exhibit substantial disparities in their feature maps. Additionally, as the network layers deepen, the feature maps tend to become increasingly sparse. In case such sparse features are only multiplied at attention points, the performance of ADN would degrade severely. This factor serves as an additional rationale for adopting our current attention strategy instead of the straightforward multiplication of the attention map.

Fig. 3
figure 3

Overview of the training process for LEA-Net

Experiments

In the experimental section, we conducted anomaly detection tests employing LEA-Net across various conditions, utilizing both the MVTec AD [34] and PlantVillage [8] datasets. The particulars of the datasets, experimental configurations, and results are elaborated upon herein.

Datasets

As illustrated in Fig. 4, we provide exemplars of the data engaged in this investigation. We assessed performance in image anomaly detection with two principal datasets: MVTec AD and PlantVillage. The MVTec AD dataset comprises both defect-free and anomalous images across many objects and texture categories. Specifically, we employed the carpet, hazelnuts, leather, screw, tile, wood, and zipper sub-datasets. On the other hand, PlantVillage contains images of healthy and diseased leaves from multiple plant species. We utilized the potato, grape, and strawberry sub-datasets. Within PlantVillage, we also employed datasets devoid of background elements by image segmentation [35].

The lowermost panels in Fig. 4 depict the generated anomaly maps. Prior to conducting the experiments, all images were standardized to dimensions of \(256 \times 256\) pixels. Table  2 delineates the specific numbers of images included within the training datasets. To appraise the performance on datasets reflective of real-world scenarios, we extracted images from each dataset at random to assemble smaller, imbalanced datasets.

Fig. 4
figure 4

Datasets in this study. For each dataset, the original, reconstructed image and the anomaly map calculated from two images are displayed as positive and negative. These sample anomaly maps were generated by a Color Reconstruction approach

Table 2 Small and imbalanced experimental datasets, reconstructed by random sampling

Experimental Setup

In this section, we delineate the experimental setup employed for training purposes. For the anomaly detection tests, stratified five-fold cross-validation was executed on each dataset, eschewing data augmentation.

The training regimen for both Color Reconstruction and Auto-Encoding utilized the Adam optimizer, furnished with a learning rate of 0.0001. The momentums of the optimizer were configured as \(\beta _1 = 0.9\) and \(\beta _2 = 0.999\). The parameters were updated over a total of 500 epochs, with a batch size of 16. The early stopping was implemented to enhance computational efficiency. In contrast, PaDiM circumvents the need for training model parameters but necessitates the extraction of embedding vectors from a pre-trained model. In the context of this paper, these vectors were derived from the output of three disparate layers of a ResNet-50 model pre-trained on ImageNet. Notably, the maximum dimensions of the feature map stand at \(28\times 28\), whereas the embedding vectors incorporated 1000 embedding vectors with randomly selected dimensions.

The parameters of LEA-Net, including AAN (ResNet-based) and ADN (ResNet-18), were optimized using Adam optimizer with a learning rate of 0.0001. The momentums of Adam were consistent at \(\beta _1 = 0.9\) and \(\beta _2 = 0.999\). A total of 100 epochs were selected to update these parameters, with the batch size maintained at 16. Computational tasks were executed on a system equipped with a GeForce RTX 2028 Ti GPU, running Python 3.10.12 and CUDA 11.8.89.

Comparison of Supervised Networks and LEA-Net

The primary objective of LEA-Net is to augment the performance of the baseline network, which is trained in a purely supervised fashion in the realm of anomaly detection. To assess the efficacy of the external attention mechanism, we conducted comparative analyses on image-level anomaly detection performance across several models: (i) ResNet-18 as the baseline, (ii) ResNet-50, (iii) LEA-Net informed by anomaly maps generated through a color reconstruction task, denoted as LEA-Net (Color Reconstruction), (iv) LEA-Net guided by anomaly maps generated through auto-encoding, referred to as LEA-Net (Auto-Encoding), and (v) LEA-Net shaped by anomaly maps generated based on PaDiM, labelled as LEA-Net (PaDiM).

For each of these models (i)–(v), the network output threshold was fixed at 0.5 to facilitate the computation of \(F_1\) scores. Figure 5 reveals an enhancement in \(F_1\) scores for the baseline model (ResNet-18) due to the implementation of the external attention mechanism. The horizontal axis demarcates the categories of datasets employed in the experiments, while the bars signify the average \(F_1\) scores ascertained through cross-validation. We report only the maximal \(F_1\) score among the five selected attention points. Additionally, error bars represent the standard deviation, and bars corresponding to the highest average \(F_1\) score in each category are marked with a black inverted triangle.

As indicated in Fig. 5, the external attention mechanism substantially elevates the baseline model’s performance across all datasets. Most notably, the \(F_1\) scores in MVTec AD’s carpet, tile, and wood categories witnessed an average improvement of approximately \(14.3\%\). Interestingly, ResNet-50 underperformed compared to ResNet-18 in specific instances, such as the carpet category. Furthermore, the parameter counts for ResNet- 18, ResNet-50, and LEA-Net are \(11.2\textrm{M}\), \(23.5\textrm{M}\), and \(15.6\textrm{M}\), respectively. This observation substantiates that the sheer number of model parameters is not pivotal in achieving the superior performance of LEA-Net.

Fig. 5
figure 5

Comparison of \(F_1\) scores for purely supervised networks and that for LEA-Net

Comparison of Unsupervised Networks and LEA-Net

To rigorously evaluate the efficacy of LEA-Net, we juxtaposed its performance with that of a straightforward thresholding method applied to anomaly maps. We assessed the image-level anomaly detection capability by computing \(F_1\) scores in the following settings: (i) LEA-Net employing anomaly maps generated through color reconstruction, denoted as LEA-Net (Color Reconstruction), (ii) LEA-Net utilizing anomaly maps formed via auto-encoding, termed LEA-Net (Auto-Encoding), (iii) LEA-Net with anomaly maps generated based on PaDiM, labelled as LEA-Net (PaDiM), (iv) Direct thresholding of anomaly maps originated from color reconstruction, identified as Color Reconstruction, (v) Direct thresholding of anomaly maps produced through auto-encoding, referred to as Auto-Encoding, and (vi) Direct thresholding of anomaly maps emanating from PaDiM, designated as PaDiM. For configurations (i)–(iii), the threshold for calculating \(F_1\) scores is set at 0.5.

As depicted in Fig. 6, LEA-Net consistently outperforms the straightforward thresholding approach in the contexts of both Color Reconstruction and Auto-Encoding across all datasets. Specifically, it is noteworthy that LEA-Net considerably enhances performance across most of the PlantVillage dataset. Moreover, PaDiM yields superior results compared to LEA-Net on the MVTec AD dataset, except for the hazelnuts category.

Fig. 6
figure 6

Comparison of \(F_1\) scores for automatic threshold tuning and that for LEA-Net

Dependence on the Selection of the Attention Points

In this section, we focused on evaluating the influence of attention point selection on the efficacy of anomaly detection. As depicted in Fig. 7, we contrast the detection performance of LEA-Net when configured with different attention points. The horizontal axis portrays the generative methods employed for the anomaly maps of the LEA-Net, whereas the vertical axis represents the \(F_1\) score. Each bar signifies the average \(F_1\) scores, and an error bar indicates the standard deviation. The quintet of bars arrayed along the horizontal axis illustrates the performance of LEA-Net corresponding to each attention point. The results in Fig. 7 indicate that the anomaly detection performance depends on the attention points, especially for PaDiM. However, in cases of Color Reconstruction and Auto-Encoding, we did not observe such dependencies except for the carpet.

Fig. 7
figure 7

Comparison of \(F_1\) scores for different attention points

Discussion

Specifically, Fig. 7 demonstrates that the choice of attention points significantly influences anomaly detection performance, an influence that concurrently depends on the type of anomaly in use. To elucidate this, we conducted a comparative study of attention maps at various points, as presented in Fig. 8. These maps are accompanied by their corresponding \(F_1\) scores for the MVTec AD tile category. In the figure, columns (a)–(c) correlate to distinct anomaly maps derived from separate reconstruction tasks: (a) corresponds to Color Reconstruction, (b) to Auto-Encoding, and (c) to PaDiM. Well-localized anomaly maps are observed to substantially enhance detection efficacy when external attention is applied at the first through fourth attention points. Conversely, poorly localized, excessive attention maps tend to compromise performance, except when external attention is deployed at the final attention point. The emphasis of positional information on anomaly is essential for shallow attention points, whereas that of abnormality is critical for deep attention points. As positional information is beneficial for detecting anomalies, we can expect that the hierarchical representation from position to abnormality is vital for external attention to promote anomaly detection performance.

Fig. 8
figure 8

Attention maps of LEA-Net at each attention point for MVTec AD tile

Conclusion

In this study, we have scrutinized the role of the external attention mechanism in enhancing the detection performance of CNN. To use the MVTec AD and PlantVillage datasets for empirical analysis, we have ascertained that layer-wise external attention effectively augments the performance of the baseline model in anomaly detection. The present findings indicate that the effectiveness of external attention is contingent upon the compatibility between the dataset and the anomaly map. Moreover, the data suggests that the focus on positional information is pivotal for shallower attention points, whereas the emphasis on abnormality becomes crucial at deeper attention points. Intriguingly, we also observed that the overall intensity of appreciably amplified by external attention, even when dealing with low-intensity anomaly maps. In conclusion, the positional features within anomalies assume greater importance than the overall intensity and appearance of the anomaly map. Therefore, a well-localized positional feature within an anomaly map serves as a key determinant in the effectiveness of the layer-wise external attention for anomaly detection.