Layer-Wise External Attention by Well-Localized Attention Map for Efficient Deep Anomaly Detection

Nakanishi, Keiichi; Shiroma, Ryo; Hayakawa, Tokihisa; Katafuchi, Ryoya; Tokunaga, Terumasa

doi:10.1007/s42979-024-02912-3

Layer-Wise External Attention by Well-Localized Attention Map for Efficient Deep Anomaly Detection

Original Research
Open access
Published: 28 May 2024

Volume 5, article number 592, (2024)
Cite this article

Download PDF

You have full access to this open access article

SN Computer Science Aims and scope Submit manuscript

Layer-Wise External Attention by Well-Localized Attention Map for Efficient Deep Anomaly Detection

Download PDF

Keiichi Nakanishi¹,
Ryo Shiroma¹^na1,
Tokihisa Hayakawa¹^na1,
Ryoya Katafuchi¹^na1 &
…
Terumasa Tokunaga²^na1

345 Accesses
Explore all metrics

Abstract

The external attention mechanism offers a promising approach to enhance image anomaly detection (Hayakawa et al., in: IMPROVE, pp. 100-–110, 2023). Nevertheless, the effectiveness of this method is contingent upon the judicious selection of an intermediate layer with external attention. In this study, we performed a comprehensive series of experiments to clarify the mechanisms through which external attention improves detection performance. We assessed the performance of the LEA-Net (Hayakawa et al., in: IMPROVE, pp. 100–110, 2023), which implements layer-wise external attention, using MVTec AD and Plant Village datasets. The detection performances of the LEA-Net were compared with that of the baseline model under different anomaly maps generated by three unsupervised approaches. In addition, we investigated the relationship between the detection performance of LEA-Net and the selection of an attention point, which means an intermediate layer where external attention is applied. The findings reveal that the synergy between the dataset and the generated anomaly map influenced the effectiveness of the LEA-Net. For poorly localized anomaly maps, the selection of the attention point becomes a pivotal factor in determining detection efficiency. At shallow attention points, a well-localized attention map successfully notably improves the detection performance. For deeper attention points, the overall intensity of the attention map is essential; this intensity can be substantially amplified by layer-wise external attention, even for a low-intensity anomaly map. Overall, the results suggest that for layer-wise external attention, the positional attributes of anomalies hold greater significance than the overall intensity or visual appearance of the anomaly map.

Transformer-based contrastive learning framework for image anomaly detection

Article 03 May 2023

A weakly supervised anomaly detection method based on deep anomaly scoring network

Article 19 May 2023

Multiresolution feature guidance based transformer for anomaly detection

Article 24 January 2024

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Anomaly detection serves as a critical technique for identifying anomalous patterns in voluminous datasets, holding particular relevance in the analysis of imaging data. This technology finds applications in diverse domains, including but not limited to medical diagnosis [2, 3], plant healthcare [4], surveillance video [5], and disaster detection [6, 7]. Recent advancements in deep learning have propelled a surge of scholarly interest in the development of automated anomaly detection methods for expansive image datasets. Based on machine learning research, these techniques can be categorized into three primary classes: supervised, semi-supervised, and unsupervised methodologies. Despite each approach’s unique merits and limitations, the predominant challenge is the efficient identification of anomalies based on a limited number of anomalous instances.

Convolutional neural networks (CNN) represent a prevalent architecture in the landscape of computer vision, offering robust solutions for tasks such as image recognition and segmentation. Utilizing substantial labeled datasets, CNN has achieved state-of-the-art image anomaly detection performance in real-world anomaly detection applications [6, 8]. Nonetheless, CNN-based anomaly detectors frequently grapple with the scarcity of labeled instances and a low incidence of anomalies. Several studies have developed strategies to ameliorate these constraints, including the incorporation of active learning [9] and the deployment of transfer learning [6] to enhance CNN’s learning efficiency.

Unsupervised learning methods have achieved wide acceptance in the domain of anomaly detection, primarily because they eliminate the need for labeled anomalous samples during the training phase. A conventional approach in unsupervised image anomaly detection relies on the utilization of deep convolutional auto-encoders to reconstruct normal images [10]. However, these auto-encoders sometimes falter in the precise reconstruction of fine structures, leading to the generation of excessively blurry images. To counter this limitation, Generative Adversarial Networks (GAN) has been introduced into the field. AnoGAN [11] pioneered the application of GAN for image anomaly detection. Moreover, AnoGAN has been adapted to the field assess color reconstructability, thereby enabling the sensitive detection of color anomalies [12]. In unsupervised anomaly detection, it is a commonly adopted practice to quantify the deviation between the original image and the reconstructed image as the Anomaly Score.

Although unsupervised anomaly detectors eliminate the need for labeling anomalous instances during training, they pose certain shortcomings. Primarily, these detectors are susceptible to overlooking subtle and minute anomalies because the Anomaly Score is predicated upon the distance between the standard and test images. Therefore, the effectiveness of unsupervised anomaly detectors is contingent upon the robust formulation of an Anomaly Score for specific objectives. Second, the appropriate threshold of the Anomaly Score should be carefully tuned to classify normal and anomalous instances accurately. This calibration frequently entails a laborious process of trial and error.

Recent advancements in visual attention mechanisms have garnered considerable traction in computer vision [13]. Attention branch network (ABN) incorporates a branching structure termed Attention Branch [14]. The attention maps from this branch serve as visual explications to describe the decision-making process within CNN. These attention maps have been demonstrated to improve CNN performance across various image classification tasks.

The visual attention mechanism has realized robust prediction for imbalanced data to utilize contrastive learning in anomaly detection [15]. This research raises the intriguing prospect of integrating visual attention into image anomaly detection schemes. Nonetheless, existing visual attention modules, including ABN, predominantly rely on self-attention mechanisms [16, 17]. Consequently, the quality of attention in these modules is intrinsically linked to the network’s overall performance, thereby limiting their direct applicability in enhancing the performance of anomaly detectors.

In a preceding study, the layer-wise external attention network (LEA-Net) was introduced to enhance CNN-based anomaly detection through the incorporation of an external attention mechanism. The external attention mechanism leverages prior knowledge from some external sources; LEA-Net utilizes the outputs of the other network pre-trained. As discussed, unsupervised and supervised anomaly detectors have their problems. To address these problems, LEA-Net consolidated supervised and unsupervised anomaly detection algorithms through the lens of a visual attention mechanism. The burgeoning advancements in visual attention mechanisms intimate the feasibility of leveraging prior knowledge in anomaly detection. The strategies described in [1] include the following:

The pre-existing knowledge concerning anomalies is articulated through an anomaly map constructed via the unsupervised learning of normal instances.
Subsequently, this anomaly map is transformed into an attention map by an auxiliary network.
The attention map is then incorporated into the intermediate layers of the anomaly detection network (ADN).

In line with this strategy, the effectiveness of layer-wise external attention in image anomaly detection was assessed through comprehensive experiments utilizing publicly accessible, real-world datasets. The findings revealed that layer-wise external attention reliably enhanced the performance of anomaly detectors, even with limited data. Further, the results suggested that the external attention mechanism can synergistically operate with the self-attention mechanism to enhance anomaly detection capabilities.

Although the external attention mechanism holds considerable promise for setting a new paradigm in image anomaly detection, its effectiveness depends on the judicious selection of an intermediate layer equipped with external attention. To illustrate the layer-wise external attention mechanism improving anomaly detection performance, we conducted a series of more in-depth experiments. The principal contributions of this research are stated as follows:

We introduced an embedding-based approach, a Patch Distribution Modeling Framework (PaDiM) [18], to generate an anomaly map along with reconstruction approaches.
We comparatively analyzed the performance of LEA-Net with that of baseline models under various conditions to clarify the modes through which external attention improves the detection performance of CNN.
We discerned that the presence of well-localized positional features on an anomaly map is instrumental in successfully implementing layer-wise external attention.

Related Work

A more straightforward methodology was employed for the automated detection of thyroid nodule lesions in X-ray computed tomography images [19]. This technique leverages binary segmentation results acquired from a U-Net as input for supervised image classifiers. The authors demonstrated that such preprocessing via binary segmentation significantly enhances anomaly detection accuracy in practical applications. Similarly, the convolutional adversarial variational autoencoder with guided attention (CAVGA) employs anomaly maps in a weakly supervised setting to localize anomalous areas [20]. Through empirical evaluations using the MVTec AD dataset, CAVGA achieved state-of-the-art performance. Both studies substantiate the considerable promise of incorporating visual attention maps in image anomaly detection.

The concept of visual attention pertains to the selective refinement or amplification of image features for recognition tasks. The human perceptual system prioritizes information germane to the task over comprehensive data processing [21, 22]. Visual attention mechanisms emulate this human faculty in the context of image classification [16, 17, 23,24,25,26]. In most configurations, image classifiers incorporating visual attention use a Self-Attention mechanism, seamlessly integrating with pre-existing models. However, the utility of such visual attention mechanisms is intrinsically tied to the performance of the primary model, constituting a limitation inherent to Self-Attention approaches.

To address this issue, ABN [14] introduced an interactive editing feature for the attention map, thereby guiding the focus points of CNN towards more salient regions within images. Another notable approach in the realm of visual attention is Attention Transfer, rooted in the concept of knowledge distillation [27]. During knowledge distillation, a compact network, termed the student network, acquires foundational knowledge from a more extensive teacher network [28]. The premise of attention transfer rests on the hypothesis that a teacher network is predisposed to concentrate on more information-rich regions compared to a student network.

Most recently, the external attention mechanism has emerged, inspired by both ABN and the Attention Transfer network [1]. A salient feature of the external attention mechanism is its independence from user interaction for modulating attention in CNN. Moreover, unlike Self-Attention mechanisms, the efficacy of external attention is contingent upon prior knowledge derived from a pre-trained external network. This mechanism can be seamlessly integrated into any CNN model through an end-to-end training manner.

Layer-Wise External Attention Network

Figure 1 provides an overview of the Layer-wise External Attention Network (LEA-Net) [1]. As previously elucidated, LEA-Net is a paradigmatic implementation of the external attention mechanism. As depicted in Fig. 1, LEA-Net is bifurcated into two principal components: anomaly map generation via an unsupervised network and anomaly detection by supervised networks. These components will be elaborated upon in the subsequent subsections.

Part 1: Anomaly Map Generation

For anomaly map generation, prior knowledge on normal data is instrumental in identifying a diverse range of real-world anomalies. In practical applications, a pixel-wise scoring metric, known as an anomaly map, is ubiquitously employed for anomaly detection. In this study, we introduce three distinct unsupervised methodologies for generating anomaly maps to explore the dependency of external attention effectiveness on these maps. Figure 2 summarizes the characteristics of these three unsupervised strategies: (a) Color Reconstruction, (b) Auto-Encoding, and (c) PaDiM [18]. These methodologies can be taxonomically divided into two categories: reconstruction-based approaches, comprising (a) and (b), and an embedding-based approach, represented by (c). Further details are provided in the following section.

Color Reconstruction

The procedure for the color reconstruction task is delineated in Fig. 2(a). Initially, a grayscale image is extracted from an input color image, which is represented in the $L^{*}a^{*}b^{*}$ color space. Subsequently, the grayscale image undergoes a transformation back to a color image utilizing U-Net [29]. Additionally, chrominance information $a^{*}b^{*}$ is predicted based on the luminance information $L^{*}$ within the $L^{*}a^{*}b^{*}$ color space. The predicted $a^{*}b^{*}$ is then amalgamated with the $L^{*}$ from the input color image to yield the final image in $L^{*}a^{*}b^{*}$ color space. The U-Net model is pre-trained solely on normal instances to facilitate this color reconstruction process. Ultimately, a color anomaly map is generated by evaluating the CIEDE2000 [30] color difference between the reconstructed and original images. For an in-depth discussion of color anomaly map generation, the reader is directed to [12].

Auto-Encoding

For the auto-encoding task, we employ U-Net. An anomaly map is synthesized by computing the pixel-wise absolute deviation between the reconstructed and original images.

PaDiM

PaDiM is an embedding-based methodology optimized for anomaly detection in industrial settings, boasting state-of-the-art results on the MVTec AD dataset. Specifically, PaDiM leverages a pre-trained network for patch embedding to compile statistical data on normal instances. It then calculates the Mahalanobis distance between the feature vector derived from an input image and that obtained from normal instances. In the context of this study, the ResNet-50 model [31], trained on the ImageNet dataset [32], serves as the feature extractor for PaDiM.

Part 2: Anomaly Detection Using Supervised Networks

The intricacies of the anomaly attention network (AAN) and the anomaly detection network (ADN) are elucidated in Fig. 1. AAN serves the function of transforming an anomaly map into an attention map. This transformation modulates the complexity, certainty, and sharpness of the attention map in alignment with the progress of training and hierarchical representation in ADN. This mechanism can be conceptualized as a form of reverse Curriculum Learning. In contrast, ADN operates as a supervised network that categorizes images as normal or abnormal. It receives the attention map from AAN and generates the final classification for anomaly detection. These two networks are interconnected at an intermediate layer via an attention block, with the attention map designed to underscore informative regions on a feature map within ADN.

The structural specifics of AAN and ADN are listed in Table 1. Within this context, the term attention point refers to the intermediate layer at which external Attention is implemented. In a practical setting, multiple attention points can be designated for external Attention. In the scope of this study, ResNet-based architecture was utilized for AAN, and ResNet-18 was employed for ADN [31]. The ResNet-based AAN is architected by sequentially stacking residual blocks. The number of downsampling points in ADN is equal to or greater than those in AAN to facilitate straightforward network interconnection. In the experiments detailed in this paper, we opted for the inclusion of five attention points.

Table 1 Structures of AAN and ADN

Full size table

Training Process for LEA-Net

Figure 3 delineates the training regimen for LEA-Net, which is carried out in two sequential stages: (1) the training of an unsupervised network and (2) the training of supervised networks. In the first phase, we train the unsupervised network is trained using normal images with the aim of generating anomaly maps tailored for LEA-Net. Notably, if PaDiM for anis employed as the unsupervised network, this training step may be circumvented, with the exception of statistical computation.

Subsequently, to ascertain two predictive probabilities, both an original image and its corresponding anomaly map are input into ADN and AAN, respectively. The parameters of AAN and ADN are concurrently optimized by minimizing a loss function that constitutes a linear amalgamation of these two predictive probabilities. Let $x_{i} \in \mathbb {R}^{\textrm{H} \times \textrm{W} \times 3}$ represent the ith original input image. Here, $\textrm{H}$ and $\textrm{W}$ denote the height and width of images, respectively. Additionally, we define $i \in \{1, \ldots , \textrm{N}\}$, and $\textrm{N}$ is the number of training images. Let $x^{\prime }_{i} \in \mathbb {R}^{\textrm{H} \times \textrm{W}}$ represent the ith anomaly map. Furthermore, let $y_{i}\in \left\{ 0, 1\right\} $ denote a corresponding ground-truth label. Furthermore, let ${\hat{y}}^{\prime }_{i} \in \left[ 0, 1\right] $ and ${\hat{y}}_{i} \in \left[ 0, 1\right] $ indicate the i-th predictive probabilities for AAN and ADN, respectively. The loss function for the entire classification network can be expressed as the sum of the two loss functions:

$$\begin{aligned} L = \dfrac{1}{\textrm{N}} \sum _{i = 1}^{\textrm{N}} \textrm{BCE} ({\hat{y}}^{\prime }_{i}, y_{i}) + \dfrac{1}{\textrm{N}} \sum _{i = 1}^{\textrm{N}} \textrm{BCE}({\hat{y}}_{i}, y_{i}). \end{aligned}$$

In this context, $\mathrm {BCE(\cdot )}$ signifies the binary cross-entropy. The loss function as designed with the anticipation that AAN would efficaciously modify attention maps during the training phase, analogous to ABN [14].

We illustrate the architecture of the attention block. Its composition is straightforward, consisting merely of a channel-wise average pooling layer, denoted as $\phi (\cdot )$, and a sigmoid layer, represented as $\sigma $. The channel-wise average pooling layer functions by averaging the extracted features per channel. Let us denote the attention map at point $p \in \{1,2,3,4,5\}$ as $M_p~\in ~\mathbb {R}^{\mathrm {H_p} \times \mathrm {W_p}}$, such that $\mathrm {H_p}$ and $\mathrm {W_p}$ represent the height and width of features at an attention point. Let $g_{p}(x^{\prime }_i) \in \mathbb {R}^{\mathrm {H_p} \times \mathrm {W_p} \times \mathrm {C_{p}}}$ represent the feature tensor at an attention point p in AAN for input image $x^{\prime }_i$, and $\mathrm {C_{p}}$ indicate the number of channels. Therefore, $M_p$ is generated as follows:

$$\begin{aligned} M_p = \sigma (\phi (g_p(x^{\prime }_i))). \end{aligned}$$

Here, we adopted channel-wise average pooling layer $\phi (\cdot )$ instead of a convolution layer size of $1 \times 1$, anticipating the same effect reported by [17]. A sigmoid layer $\sigma (\cdot )$ normalizes the feature map $\phi (g_{p}(x^{\prime }_i))$ within the range of [0, 1]. It has been reported that the normalization of the attention map can effectively highlight informative regions [23]. Additionally, the sigmoid function prevents attention maps and ADN features from reversing the significance by multiplying negative values.

Our attention mechanism aims to highlight the informative regions on feature maps, instead of erasing other regions [23]. To mitigate the risk of inadvertently erasing the informative regions through the attention maps, we integrated the attention map into ADN as expressed in the following equation:

$$\begin{aligned} f^{\prime }_{p}(x_{i}) = (1\oplus M_p) \otimes f_p(x_{i}), \end{aligned}$$

where $\oplus $ denotes the element-wise sum, $\otimes $ denotes the element-wise product, and $f^{\prime }_{p}(x_{i})$ represents the updated feature tensor at point p in ADN after the external attention mechanism.

This particular attention strategy is also designed to circumvent the Dying ReLU problem [33]. The issue arises when a large amount of parameters possessing negative values are transformed to zero due to the ReLU function, thereby engendering the vanishing gradient problem. Note that AAN and ADN often exhibit substantial disparities in their feature maps. Additionally, as the network layers deepen, the feature maps tend to become increasingly sparse. In case such sparse features are only multiplied at attention points, the performance of ADN would degrade severely. This factor serves as an additional rationale for adopting our current attention strategy instead of the straightforward multiplication of the attention map.

Experiments

In the experimental section, we conducted anomaly detection tests employing LEA-Net across various conditions, utilizing both the MVTec AD [34] and PlantVillage [8] datasets. The particulars of the datasets, experimental configurations, and results are elaborated upon herein.

Datasets

As illustrated in Fig. 4, we provide exemplars of the data engaged in this investigation. We assessed performance in image anomaly detection with two principal datasets: MVTec AD and PlantVillage. The MVTec AD dataset comprises both defect-free and anomalous images across many objects and texture categories. Specifically, we employed the carpet, hazelnuts, leather, screw, tile, wood, and zipper sub-datasets. On the other hand, PlantVillage contains images of healthy and diseased leaves from multiple plant species. We utilized the potato, grape, and strawberry sub-datasets. Within PlantVillage, we also employed datasets devoid of background elements by image segmentation [35].

The lowermost panels in Fig. 4 depict the generated anomaly maps. Prior to conducting the experiments, all images were standardized to dimensions of $256 \times 256$ pixels. Table 2 delineates the specific numbers of images included within the training datasets. To appraise the performance on datasets reflective of real-world scenarios, we extracted images from each dataset at random to assemble smaller, imbalanced datasets.

Table 2 Small and imbalanced experimental datasets, reconstructed by random sampling

Full size table

Experimental Setup

In this section, we delineate the experimental setup employed for training purposes. For the anomaly detection tests, stratified five-fold cross-validation was executed on each dataset, eschewing data augmentation.

The training regimen for both Color Reconstruction and Auto-Encoding utilized the Adam optimizer, furnished with a learning rate of 0.0001. The momentums of the optimizer were configured as $\beta _1 = 0.9$ and $\beta _2 = 0.999$. The parameters were updated over a total of 500 epochs, with a batch size of 16. The early stopping was implemented to enhance computational efficiency. In contrast, PaDiM circumvents the need for training model parameters but necessitates the extraction of embedding vectors from a pre-trained model. In the context of this paper, these vectors were derived from the output of three disparate layers of a ResNet-50 model pre-trained on ImageNet. Notably, the maximum dimensions of the feature map stand at $28\times 28$, whereas the embedding vectors incorporated 1000 embedding vectors with randomly selected dimensions.

The parameters of LEA-Net, including AAN (ResNet-based) and ADN (ResNet-18), were optimized using Adam optimizer with a learning rate of 0.0001. The momentums of Adam were consistent at $\beta _1 = 0.9$ and $\beta _2 = 0.999$. A total of 100 epochs were selected to update these parameters, with the batch size maintained at 16. Computational tasks were executed on a system equipped with a GeForce RTX 2028 Ti GPU, running Python 3.10.12 and CUDA 11.8.89.

Comparison of Supervised Networks and LEA-Net

The primary objective of LEA-Net is to augment the performance of the baseline network, which is trained in a purely supervised fashion in the realm of anomaly detection. To assess the efficacy of the external attention mechanism, we conducted comparative analyses on image-level anomaly detection performance across several models: (i) ResNet-18 as the baseline, (ii) ResNet-50, (iii) LEA-Net informed by anomaly maps generated through a color reconstruction task, denoted as LEA-Net (Color Reconstruction), (iv) LEA-Net guided by anomaly maps generated through auto-encoding, referred to as LEA-Net (Auto-Encoding), and (v) LEA-Net shaped by anomaly maps generated based on PaDiM, labelled as LEA-Net (PaDiM).

For each of these models (i)–(v), the network output threshold was fixed at 0.5 to facilitate the computation of $F_1$ scores. Figure 5 reveals an enhancement in $F_1$ scores for the baseline model (ResNet-18) due to the implementation of the external attention mechanism. The horizontal axis demarcates the categories of datasets employed in the experiments, while the bars signify the average $F_1$ scores ascertained through cross-validation. We report only the maximal $F_1$ score among the five selected attention points. Additionally, error bars represent the standard deviation, and bars corresponding to the highest average $F_1$ score in each category are marked with a black inverted triangle.

As indicated in Fig. 5, the external attention mechanism substantially elevates the baseline model’s performance across all datasets. Most notably, the $F_1$ scores in MVTec AD’s carpet, tile, and wood categories witnessed an average improvement of approximately $14.3\%$. Interestingly, ResNet-50 underperformed compared to ResNet-18 in specific instances, such as the carpet category. Furthermore, the parameter counts for ResNet- 18, ResNet-50, and LEA-Net are $11.2\textrm{M}$, $23.5\textrm{M}$, and $15.6\textrm{M}$, respectively. This observation substantiates that the sheer number of model parameters is not pivotal in achieving the superior performance of LEA-Net.

Comparison of Unsupervised Networks and LEA-Net

To rigorously evaluate the efficacy of LEA-Net, we juxtaposed its performance with that of a straightforward thresholding method applied to anomaly maps. We assessed the image-level anomaly detection capability by computing $F_1$ scores in the following settings: (i) LEA-Net employing anomaly maps generated through color reconstruction, denoted as LEA-Net (Color Reconstruction), (ii) LEA-Net utilizing anomaly maps formed via auto-encoding, termed LEA-Net (Auto-Encoding), (iii) LEA-Net with anomaly maps generated based on PaDiM, labelled as LEA-Net (PaDiM), (iv) Direct thresholding of anomaly maps originated from color reconstruction, identified as Color Reconstruction, (v) Direct thresholding of anomaly maps produced through auto-encoding, referred to as Auto-Encoding, and (vi) Direct thresholding of anomaly maps emanating from PaDiM, designated as PaDiM. For configurations (i)–(iii), the threshold for calculating $F_1$ scores is set at 0.5.

As depicted in Fig. 6, LEA-Net consistently outperforms the straightforward thresholding approach in the contexts of both Color Reconstruction and Auto-Encoding across all datasets. Specifically, it is noteworthy that LEA-Net considerably enhances performance across most of the PlantVillage dataset. Moreover, PaDiM yields superior results compared to LEA-Net on the MVTec AD dataset, except for the hazelnuts category.

Dependence on the Selection of the Attention Points

In this section, we focused on evaluating the influence of attention point selection on the efficacy of anomaly detection. As depicted in Fig. 7, we contrast the detection performance of LEA-Net when configured with different attention points. The horizontal axis portrays the generative methods employed for the anomaly maps of the LEA-Net, whereas the vertical axis represents the $F_1$ score. Each bar signifies the average $F_1$ scores, and an error bar indicates the standard deviation. The quintet of bars arrayed along the horizontal axis illustrates the performance of LEA-Net corresponding to each attention point. The results in Fig. 7 indicate that the anomaly detection performance depends on the attention points, especially for PaDiM. However, in cases of Color Reconstruction and Auto-Encoding, we did not observe such dependencies except for the carpet.

Discussion

Specifically, Fig. 7 demonstrates that the choice of attention points significantly influences anomaly detection performance, an influence that concurrently depends on the type of anomaly in use. To elucidate this, we conducted a comparative study of attention maps at various points, as presented in Fig. 8. These maps are accompanied by their corresponding $F_1$ scores for the MVTec AD tile category. In the figure, columns (a)–(c) correlate to distinct anomaly maps derived from separate reconstruction tasks: (a) corresponds to Color Reconstruction, (b) to Auto-Encoding, and (c) to PaDiM. Well-localized anomaly maps are observed to substantially enhance detection efficacy when external attention is applied at the first through fourth attention points. Conversely, poorly localized, excessive attention maps tend to compromise performance, except when external attention is deployed at the final attention point. The emphasis of positional information on anomaly is essential for shallow attention points, whereas that of abnormality is critical for deep attention points. As positional information is beneficial for detecting anomalies, we can expect that the hierarchical representation from position to abnormality is vital for external attention to promote anomaly detection performance.

Conclusion

In this study, we have scrutinized the role of the external attention mechanism in enhancing the detection performance of CNN. To use the MVTec AD and PlantVillage datasets for empirical analysis, we have ascertained that layer-wise external attention effectively augments the performance of the baseline model in anomaly detection. The present findings indicate that the effectiveness of external attention is contingent upon the compatibility between the dataset and the anomaly map. Moreover, the data suggests that the focus on positional information is pivotal for shallower attention points, whereas the emphasis on abnormality becomes crucial at deeper attention points. Intriguingly, we also observed that the overall intensity of appreciably amplified by external attention, even when dealing with low-intensity anomaly maps. In conclusion, the positional features within anomalies assume greater importance than the overall intensity and appearance of the anomaly map. Therefore, a well-localized positional feature within an anomaly map serves as a key determinant in the effectiveness of the layer-wise external attention for anomaly detection.

Data Availability Statement

The data used in this study will be made available by contacting the authors directly.

References

Hayakawa T, Nakanishi K, Katafuchi R, Tokunaga T. Layer-wise external attention for efficient deep anomaly detection. In: IMPROVE. 2023. p. 100–110.
Rezvantalab A, Safigholi H, Karimijeshni S. Dermatologist level dermoscopy skin cancer classification using different deep learning convolutional neural networks algorithms. 2018. arXiv:1810.10348.
Cao C, Liu F, Tan H, Song D, Shu W, Li W, Zhou Y, Bo X, Xie Z. Deep learning and its applications in biomedicine. Genom Proteom Bioinform. 2018;16(1):17–32.
Article Google Scholar
Ferentinos KP. Deep learning models for plant disease detection and diagnosis. Comput Electron Agric. 2018;145:311–8.
Article Google Scholar
Roka S, Diwakar M. Cvit: a convolution vision transformer for video abnormal behavior detection and localization. SN Comput Sci. 2023;4(6):829.
Article Google Scholar
Minhas MS, Zelek J. Anomaly detection in images. 2019. arXiv:1905.13147.
Natarajan V, Mao S, Chia L-T. Salient textural anomaly proposals and classification for metal surface anomalies. In: 2019 IEEE 31st international conference on tools with artificial intelligence (ICTAI). 2019. p. 621–28. https://doi.org/10.1109/ICTAI.2019.00092.
Hughes DP, Salathe M. An open access repository of images on plant health to enable the development of mobile disease diagnostics. 2016. arXiv:1511.08060.
Görnitz N, Kloft M, Rieck K, Brefeld U. Toward supervised anomaly detection. J Artif Intell Res (JAIR). 2013;46:235–62.
Article MathSciNet Google Scholar
Haselmann M, Gruber DP, Tabatabai P. Anomaly detection using deep learning based image completion. 2018.
Schlegl T, Seebock P, Waldstein SM, Schmidt-Erfurth U, Langs G. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: International conference on information processing in medical imaging. 2017. p. 146–57.
Katafuchi R, Tokunaga T. Image-based plant disease diagnosis with unsupervised anomaly detection based on reconstructability of colors. 2021. p. 112–20. https://doi.org/10.5220/0010463201120120.
Zhao H, Jia J, Koltun V. Exploring self-attention for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. 2020. p. 10073–82. https://doi.org/10.1109/CVPR42600.2020.01009. arXiv:2004.13621.
Fukui H, Hirakawa T, Yamashita T, Fujiyoshi H. Attention branch network: learning of attention mechanism for visual explanation. 2019.
Takimoto H, Seki J, Situju SF, Kanagawa A. Anomaly detection using siamese network with attention mechanism for few-shot learning. Appl Artif Intell. 2022;36(1):2094885.
Article Google Scholar
Hu J, Shen L, Albanie S, Sun G, Wu E. Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell. 2020;42(8): 2011–23. https://doi.org/10.1109/TPAMI.2019.2913372. arXiv:1709.01507.
Woo S, Park J, Lee JY, Kweon IS. CBAM: convolutional block attention module. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) LNCS, vol. 11211. 2018. p. 3–19. https://doi.org/10.1007/978-3-030-01234-2_1. arXiv:1807.06521.
Defard T, Setkov A, Loesch A, Audigier R. Padim: a patch distribution modeling framework for anomaly detection and localization. In: International conference on pattern recognition. Springer. 2021. p. 475–89.
Li W, Cheng S, Qian K, Yue K, Liu H. Automatic recognition and classification system of thyroid nodules in CT images based on CNN. Comput Intell Neurosci. 2021. https://doi.org/10.1155/2021/5540186.
Article Google Scholar
Venkataramanan S, Peng K-C, Singh RV, Mahalanobis A. Attention guided anomaly localization in images. In: European conference on computer vision. Springer. 2020. p. 485–503.
Reynolds JH, Chelazzi L. Attentional modulation of visual processing. Annu Rev Neurosci. 2004;27:611–47.
Article Google Scholar
Chun MM, Golomb JD, Turk-Browne NB. A taxonomy of external and internal attention. Annu Rev Psychol. 2011;62:73–101.
Article Google Scholar
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X. Residual attention network for image classification. Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017 2017. 2017. p. 6450–58. https://doi.org/10.1109/CVPR.2017.683. arXiv:1704.06904.
Lee H, Kim HE, Nam H. SRM: a style-based recalibration module for convolutional neural networks. In: Proceedings of the IEEE international conference on computer vision 2019. 2019. p. 1854–62. https://doi.org/10.1109/ICCV.2019.00194.
Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q. ECA-Net: efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. 2020. 11531–39. https://doi.org/10.1109/CVPR42600.2020.01155. arXiv:1910.03151.
Yang L, Zhang R-Y, Li L, Xie X. Simam: a simple, parameter-free attention module for convolutional neural networks. In: International conference on machine learning. PMLR. 2021. p. 11863–74.
Zagoruyko S, Komodakis N. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In: 5th international conference on learning representations, ICLR 2017—conference track proceedings. 2017. p. 1–13. arXiv:1612.03928.
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015.
Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer; 2015. p. 234–41.
Sharma G, Wu W, Dalal EN. The CIEDE2000 color-difference formula: implementation notes, supplementary test data, and mathematical observations. Color Res Appl. 2005;30(1):21–30.
Article Google Scholar
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 770–78.
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE; 2009. p. 248–55.
Lu L. Dying relu and initialization: theory and numerical examples. Commun Comput Phys. 2020;28(5):1671–706. https://doi.org/10.4208/cicp.oa-2020-0165.
Article MathSciNet Google Scholar
Bergmann P, Fauser M, Sattlegger D, Steger C. MVTec AD—a comprehensive real-world dataset for unsupervised anomaly detection. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). 2019.
Mohanty SP. PlantVillage-Dataset. GitHub. 2023.

Download references

Acknowledgements

The authors thank the ENAGO Group (English Editing Company) for editing a draft of this paper.

Funding

This work was supported by JSPS KAKENHI Grant Number 22K12169, JST PREST Grant Number JPMJPR1875, and NEDO Intensive Support Program for Young Promising Researchers Grant Number 21W2K034.

Author information

Ryo Shiroma, Tokihisa Hayakawa, Ryoya Katafuchi, and Terumasa Tokunaga have contributed equally to this work.

Authors and Affiliations

Graduate School of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4, Kawazu, Iizuka-shi, Fukuoka-ken, 820-8502, Japan
Keiichi Nakanishi, Ryo Shiroma, Tokihisa Hayakawa & Ryoya Katafuchi
Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4, Kawazu, Iizuka-shi, Fukuoka-ken, 820-8502, Japan
Terumasa Tokunaga

Authors

Keiichi Nakanishi
View author publications
You can also search for this author in PubMed Google Scholar
Ryo Shiroma
View author publications
You can also search for this author in PubMed Google Scholar
Tokihisa Hayakawa
View author publications
You can also search for this author in PubMed Google Scholar
Ryoya Katafuchi
View author publications
You can also search for this author in PubMed Google Scholar
Terumasa Tokunaga
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Authors Mr. Keiichi Nakanishi, Mr. Ryo Shiroma, and Dr. Terumasa Tokunaga conducted the experiments, analyzed the results, and wrote the manuscript. Author Mr. Tokihisa Hayakawa also contributed to writing the manuscript. Author Mr. Ryoya Katafuchi provided significant intellectual content in this study.

Corresponding authors

Correspondence to Keiichi Nakanishi or Terumasa Tokunaga.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Research involving human and/or animals

Not applicable.

Informed consent

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nakanishi, K., Shiroma, R., Hayakawa, T. et al. Layer-Wise External Attention by Well-Localized Attention Map for Efficient Deep Anomaly Detection. SN COMPUT. SCI. 5, 592 (2024). https://doi.org/10.1007/s42979-024-02912-3

Download citation

Received: 11 October 2023
Accepted: 18 April 2024
Published: 28 May 2024
DOI: https://doi.org/10.1007/s42979-024-02912-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Layer-Wise External Attention by Well-Localized Attention Map for Efficient Deep Anomaly Detection

Abstract

Similar content being viewed by others

Transformer-based contrastive learning framework for image anomaly detection

A weakly supervised anomaly detection method based on deep anomaly scoring network

Multiresolution feature guidance based transformer for anomaly detection

Explore related subjects

Introduction

Related Work

Layer-Wise External Attention Network

Part 1: Anomaly Map Generation

Color Reconstruction

Auto-Encoding

PaDiM

Part 2: Anomaly Detection Using Supervised Networks

Training Process for LEA-Net

Experiments

Datasets

Experimental Setup

Comparison of Supervised Networks and LEA-Net

Comparison of Unsupervised Networks and LEA-Net

Dependence on the Selection of the Attention Points

Discussion

Conclusion

Data Availability Statement

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Research involving human and/or animals

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation