1 Introduction

Nowadays, deep learning enables the automation of many industrial tasks, reducing dependence on human intervention and improving efficiency. For example, automation of production lines, data management of industrial sensors, and predictive maintenance are some of the main applications. Industry benefits from deep learning for inspection and quality control. Deep neural networks can detect defects or anomalies in goods more accurately and quickly than manual inspection. Anomaly detection in industrial image data is of utmost importance for many tasks in computer vision [1]. However, training deep learning models requires large amounts of high-quality data, and in the field of surface analysis, very often images acquired in the industrial environment contain some sections that are not part of the surface to be inspected [2].

Just think of images of products running on conveyor rollers or connected to other components not subject to inspection or simply images of the edge of the product which inevitably incorporates part of the background. In many cases, if we know the shape of a product to be inspected, we can simply use some traditional image processing techniques to remove the useless parts from the images. But, in other cases, we don’t know the exact shape of our product or where the background appears in the image.

In order to focus on relevant points of an image, this work proposes and tests a new approach to identify a systematic way to train a CNN that focuses only on the area of interest. To do that, we identify the most important pixels in the images for classification according to a CNN. After that, we calculate how much these pixels overlap with the mask that is provided, for each image, during the training phase. The computed overlap value is added to the loss of the network, to force the network to recognize the most important pixels, only within the area marked by the masks.

The rest of the paper is organized as follows: Section 3 describes the problem and the scenario where this work is placed, Section 4 describes the idea behind this work, the mathematical formulation, and the algorithm for the custom loss developed in this paper. Section 5 illustrates the experiments and the results obtained on the various datasets. In Sections 2 and 6 we present related work and conclusions respectively.

2 Related work

In Machine Learning, anomaly detection has long been an issue of great interest, especially in Industry 4.0 where identifying defects is one of the major tasks of computer vision. Various articles survey anomaly detection in the literature [3, 4]. The aim of our work focuses primarily on the use of CNNs for structural defect detection during the monitoring of manufacturing line.

The vast majority of the CNN-based approaches have been used to study anomalies in the whole image area. Weimer D, et al. in [5] investigate CNNs in order to overcome the difficulties of manually redefining a specific feature representation for each new industrial inspection problem. In [6] the authors show how CNN with Triplet Loss [7] can be used to identify anomalies in the industrial environments.

In the context of anomaly detection in industrial images, Samet Akcay et al. [8] introduced an approach based on Generative Adversarial Network (GAN) [9]. This approach was designed to address a common challenge in the industry, namely when the sample of positive examples (usually representing anomalies) is limited, while negative examples (normal images) are in large numbers. The GAN learns the distribution of the class of interest and uses the difference between a reconstructed image and an input image to detect anomalies. This methodology has proven effective for anomaly detection, even in scenarios where the number of positive examples is low. An and Cho in their study [10] use a Variational Autoencoder (VAE) [11] for anomaly detection. However, it is important to note that GAN or Variational Autoencoders (VAE)-based approaches tend to be more effective in reconstructing simple anomalies. They may encounter difficulties when dealing with images that contain noise, such as background, commonly found in industrial environments. This is a point of challenge that needs further consideration.

Ferrari et al. [12] illustrate an architecture consisting of a GAN to perform the reconstruction and denoising processes and a model for image segmentation capable of detecting defects. The discriminative network is trained using an AOI for each image in the training dataset. The network learns in which area the defects are relevant. In this way, the use of pre-processing algorithms is reduced. Finally, the model was tested on MVTec’s anomaly detection dataset and a large industrial dataset.

In [13] Yong Moon et al. show the importance of using Class Activation Maps (CAMs) to check if the neural network focuses on the area of interest. The authors analyze the CNN architecture in detail using CAM images along with several evaluation metrics to optimize the CNN. Recently, path imaging has been shown to be effective in the segmentation and recognition of anomalies [14, 15]. In [16], the authors introduce the use of Grad-CAM to construct a self-supervised method to remove image noise for robust anomaly detection. Venkataramanan et al. [17] use the activation map to guide autoencoder training by reducing network attention in abnormal areas and increasing attention in normal areas in order to strengthen anomaly detection. Song et al. [18] proposed an interesting methodology based on an Anomaly Segmentation Network (AnoSeg). This network was developed to generate an anomaly map, thus allowing anomalous regions in the image to be effectively segmented. The AnoSeg approach represents a significant contribution as it addresses the challenge of not only detecting anomalies but also segmenting them precisely. This is particularly useful in industrial contexts where it is important to identify not only the presence of anomalies but also their spatial extent. However, it should be noted that AnoSeg, like other neural network-based methods, can also be affected by the presence of noise or background in the image, which can pose a significant challenge in the industrial environment.

In our approach, we use Grad-CAM [19] to add a penalty when the neural network detects an anomaly outside the AOI. Our methodology differs from the previously mentioned methods because the neural network focuses on distinguishing imperfections in a specific area of the image and not on the entire image. This allows the CNN to learn to distinguish anomalies in the area of interest from noise generated by a heterogeneous background, thus addressing some of the challenges associated with anomaly detection in industrial images.

Fig. 1
figure 1

Visual examples of possible problems encountered during surface analysis

These detection systems can also be applied in contexts other than anomaly detection, e.g. in the field of marine detection, several algorithms have been developed that can detect objects by removing noise, through attention-based spatial pyramid pooling networks and bidirectional feature fusion strategy [20, 21]. In [22], the authors use a Multi-Path DCNN Model, dividing the image into three areas of interest by carefully examining each part.

3 Problem description

As mentioned before, the problem is due to the heterogeneity that can occur in the images to be analyzed in an industrial environment. In some cases, it is not possible to perfectly isolate the piece of the surface to be analyzed due to the shape of the object or the environment in which the image of the object is acquired. These problems can bring a lot of useless information into the dataset which should still be processed by vision systems (neural networks in this case). Figure 1, on the left, shows a representation of the surface with intrinsic features like a tapped hole for a screw, which could be recognized as a defect since this feature is not present in all images and not always in the same location. On the right, there is a representation of a curved surface that, due to the curvature itself, can present dark areas that could have light refractions and might lead a neural network to mistake them for defects [23]. This useless information could alter the result of the network making it inefficient or unusable. Figure 2 shows an example that well describes the problem.

Fig. 2
figure 2

Example of the image taken from the MVTec AD dataset [24, 25] with two defects: one inside (red circle) and one outside (green circle) of the area of interest respectively. The damage outside of the area of interest, for classification purposes, must be considered not a defect

To solve this problem, we need to focalize the vision system on a specific Area of Interest (AOI) making sure to weigh more the contribution of the information contained in this area than in the rest of the image.

This problem is like the task of instance/image segmentation but, unfortunately, in the industrial environment and the anomaly detection field, we don’t know a priori the defect. Thus, we can’t generate the masks that highlight the important parts of the images and use these as labels to train a segmentation network. For this reason, we focus on standard Convolutional Neural Networks (CNNs) for performing a binary classification to classify the images with anomalies.

Fig. 3
figure 3

Extraction of the hottest pixels

Fig. 4
figure 4

Cross-Entropy Overlap Distance training phase

4 Cross-entropy overlap distance

The idea is to obtain a value that expresses how much two areas within an image overlap. As an overlap value, we use the Overlap coefficient (also known as the Szymkiewicz - Simpson coefficient) [26]. The objects for which we are going to calculate the overlap will be the mask provided during training (present in the dataset) and the region of the image that CNN believes is most significant for the recognition of that image. To do this, we exploit an explanation algorithm called Gradient-weighted Class Activation Mapping (Grad-CAM) [19]. It allows us to identify which area of the image is most involved in the network decision. Then, at the end of each forward pass of the network’s training phase, we calculate an heatmap that highlights the most important pixels in the input image (the hottest pixels) with Grad-CAM. After that, we extract the hottest pixels (see Fig. 3) and we calculate how much these hottest pixels are overlapped with the mask. The greater the overlap, the lower the penalty applied to the loss. This will allow the network to learn which area of the image to focus on, so masks will no longer be needed in the inference phase. Figure 4 shows the training phase with CEOD.

Fig. 5
figure 5

Stylized sample image with \(A_{d}\) and \(A_{gt}\) as area of the network detection and mask respectively

4.1 Visual explanation by Grad-CAM

Grad-CAM [19] is a localization technique based on the Class Activation Mapping (CAM) algorithm [27] that generates visual explanations for any CNN without requiring changes or re-training. In order to generate a class-discriminative heatmap, Grad-CAM computes the gradient of the output score for class cls, the output (\(out_{cls}\)) calculated before the last (softmax) activation w.r.t. the feature map activations of the last convolutional layer. The global average pooling of these gradients is calculated to obtain the neuron importance weights \(\alpha _{cls}\):

$$\begin{aligned} \alpha _{cls} = \frac{1}{P}\sum _{i}\sum _{j} \frac{\partial {out_{cls}}}{\partial {A}} \end{aligned}$$
(1)

where the \(\sum _{i}\sum _{j}\) represent the global-average pooling and P represents the number of pixels in the feature map. Finally, a weighted combination of activation maps is performed, followed by a ReLU to obtain the heatmap:

$$\begin{aligned} L^{HeatMap}_{cls} = ReLU\left( \sum _{k}\alpha _{cls}A\right) \end{aligned}$$
(2)

For more details, see the work of R.R. Selvaraju et al. [19].

Fig. 6
figure 6

2D and 3D heatmap (top left and bottom left) obtained with Grad-CAM from an image (top right and bottom right)

4.2 Mathematical formulation

Overlap Coefficient equation is:

$$\begin{aligned} overlap_{c}(A_{d}, A_{gt}) = \frac{|A_{d} \cap A_{gt}|}{min(|A_{d} |, |A_{gt}|)} \end{aligned}$$
(3)

where \(A_{d}\) and \(A_{gt}\) are the areas obtained through Grad-CAM and the segmentation mask (or area of the ground truth) respectively. \(A_{gt}\) is obtained with a manual segmentation or using a previously trained segmentation neural network [28,29,30]. Figure 5 shows a graphical representation of \(A_{d}\) and \(A_{gt}\). In this case, if \(A_{d}\) is a subset of \(A_{gt}\) or the converse, the Overlap Coefficient is 1. If we want to add this term to the loss function of the neural network, we need to negate the Overlap Coefficient. Applying the negation of the logarithm we obtain a new value that we have called Overlap Distance (OD), expressed by (4):

$$\begin{aligned} OD(A_{d}, A_{gt}) = - \ln \left( \frac{|A_{d} \cap A_{gt}|}{min(|A_{d}|, |A_{gt}|)}\right) \end{aligned}$$
(4)

In this way, when \(A_{d}\) is a subset of \(A_{gt}\), we obtain the \(-\ln (1)\), and OD becomes 0, giving no contribution to the loss. The logarithm was introduced because it offers less penalty for small differences between predicted and corrected values. When the difference is large, the penalty will be higher. To optimize our CNN also w.r.t. this further aspect, we need to add this new term to the Cross-Entropy loss [31], as described by the following equation:

$$\begin{aligned} ce = -\frac{1}{N} \sum _{i=1}^{N} y_{i}\log (p(y_{i})) \end{aligned}$$
(5)

where N is the number of examples, \(y_{i}\) and \(p(y_{i})\) are the label and the output of the network for the i-th example respectively. This is necessary to take into account the contribution of the classification task to the overall loss. We thus obtain the Cross-Entropy Overlap Distance (CEOD) that is:

$$\begin{aligned} CEOD = ce + OD(A_{d},A_{gt}) \end{aligned}$$
(6)
$$\begin{aligned} CEOD= & {} - \frac{1}{N} \sum _{i=1}^{N} y_{i}\log (p(y_{i})) \nonumber \\{} & {} + \omega y_{i}\left( - \ln \left( \frac{|A_{d}^{i} \cap A_{gt}^{i}|}{min(|A_{d}^{i}|, |A_{gt}^{i}|)}\right) \right) \end{aligned}$$
(7)

The term \(\omega \) in (7) is a new hyper-parameter to be set which represents the degree of impact of the new term on the overall loss. This term depends on the order of magnitude and on the difference between the two parts of the loss. In our experiments, after different tests, we have set \(\omega \) to 0.001. The OD part of the loss is also multiplied by \(y_{i}\) to take into account the label of the images. This is because, in the anomaly detection task, the defect-free images (good images) do not have specific areas with the hottest pixels but their heat map is rather uniform and with low-intensity levels.

Figure 6 (bottom left image), by filtering the heatmap for extracting the hottest pixels, we can obtain the object (or the area) (then the \(A_{d}\) term) used in (7).

Fig. 7
figure 7

2D and 3D representation of the filtered mask (top left and bottom left) and the mask inside the dataset (top right and bottom right)

4.3 Algorithm

For exploiting the OD in the training of a CNN, we need to create a custom training loop for obtaining the feature extracted in the last convolutional layer, generating the heatmap, and then using this in CEOD. For obtaining both the classification output and the features extracted from the last convolutional layer, the output of the CNN was modified. Algorithm 1 shows the process behind the custom training loop and the CEOD loss.

Algorithm 1
figure e

CEOD loss calculation and custom training loop.

As can be seen in line 3 of Algorithm 1, we have applied a convolution with filter \(\big ({\begin{matrix} .5&{}.5&{}.5 \\ .5&{}.5&{}.5 \\ .5&{}.5&{}.5 \end{matrix}}\big )\) to incorporate a distance concept in the OD computation. The transformation of the mask after convolution is visible in Fig. 7. Note how, after convolution, there are no more clear differences in height (between the values 1 and 0, see the bottom right and bottom left 3D representations in Fig. 7) but the AOI of the mask becomes more gradual, widening the AOI and allowing us to implement the distance computation so that it can be differentiated as the rest of the loss.

The distance is important to help the network understand when the detection is far or near the AOI. Following the example of Fig. 8, focusing on the original mask (left side of Fig. 8) we can see that the detection marked with A and B have two different distances (Da and Db) and the distance of B (Db) from the AOI is larger than Da. If we use the original mask in the CEOD, these two detections give the same results because both A and B are multiplied by the same mask value which is zero. Instead, if we exploit the filtered masks, we can see that A1 is partially over the AOI, so its contribution to the calculation of the CEOD will be greater than B1. The closer the detection is to the AOI, the smaller the distance. Clearly, in case of overlap, the distance will be zero. This contribution leads the network to understand that the further the detection is from the AOI, the worse it is.

Fig. 8
figure 8

Original mask (left) and filtered mask (right) with two different detections: A, B, A1 and B1 respectively. Da, Db, D1a, and D1b represent the distance between the detection and the AOI for the original and filtered mask respectively

To make the OD part of the loss differentiable, the formulation of the OD became as follows:

$$\begin{aligned} OD = - \log \left[ clip_{\epsilon } \left( \frac{\sum A_{d} A_{gt}^{*}}{min(\sum A_d, \sum A_{gt}^{*})}\right) \right] \end{aligned}$$
(8)

where \(\sum \) is calculated over the pixels of the images, \( A_{gt}^{*}\) represent the convoluted mask, and \(clip_{\epsilon }\) is a function that clips the value of the Overlap Coefficient in the interval \([\epsilon , 1-\epsilon ]\) to avoid the logarithm returning unacceptable values.

5 Experiments

In this section, we describe the datasets, the general setup, and the results achieved with the experiments. Each experiment was performed on the Marconi100Footnote 1 cluster provided by CinecaFootnote 2, in which each node is equipped with 2 IBM POWER9 AC922 CPUs and 4 NVIDIA Volta V100 with 16GB of RAM and connected to other nodes with NVlink 2.0. We have experimented with two different CNNs: EfficientNet-B0 [32], pre-trained on ImageNet [33], and a custom CNN composed of 8 convolutional blocks (a convolutional block is composed of a convolutional layer, followed by a batch normalization layer) with a max-pooling layer every two blocks. The custom Net has 294994 trainable parameters, 1.33 GFLOPS, and 10,8 ms to an image in inference. The custom CNN was trained from scratch. On the MVTec AD dataset, we have trained both networks with and without the CEOD contribution. Then, we compared the new CEOD loss with the standard classification loss in terms of confusion matrix, accuracy, ROC AUC, and loss.

5.1 Dataset

The experiments were performed on two different datasets. The first dataset is a sub-dataset of MVTec AD [24, 25], specifically, only the zipper images. This sub-dataset was augmented to change its shape and proportions. Figure 9 shows some sample images of this dataset. Each image has an associated binary mask. This dataset is composed of 216 images without defects and 184 images with defects. Then, in total, there are 400 images in the dataset. The training proportion is 70/20/10 for training, validation, and testing. This dataset was chosen because it is representative of the problem in question. The images in this dataset have flaws found on both the zipper and the fabric on the outside of the zipper. To comply with the problem, we consider only the images that have defects on the zip. Then, all images with no flaws on the zipper but surrounding fabric were relabelled as non-defective.

Fig. 9
figure 9

Example of the augmented zip dataset

Table 1 MVTec AD Dataset.CE and Exp. are the acronyms for Cross-Entropy loss and experiment (in bold the best results)

The second dataset used in these experiments is provided by an Italian company. Unfortunately, it is not possible to give details about this dataset due to an NDA signed with the company in the project. But this latter dataset is a real industrial dataset. This dataset is composed of 2818 images without defects and 2246 images with defects. In total, there are 5064 images in the dataset. The training configuration is always 80/20 for training and validation. The test set has 893 images. 467 images without defects and 426 images with defects. The images represent the surface of a product developed by this company. This product has a round shape and a clear reflective smooth surface. For this reason, a light pattern has been applied to these products using a special illuminator to bring out the defects that occur on the surface of the product. The application of this light pattern, due to the reflective surface, causes random scattering effects that are visually comparable to defects. These effects are rarely the same.

In the field of anomaly detection, due to data imbalance, it is customary to augment data to reduce the gap between classes. However, this approach can lead to several problems, such as overfitting, and involves increasingly sophisticated augmentation strategies; this is a hot topic in the literature [34,35,36]. Therefore we chose not to implement data augmentation, and as described in the next section, state-of-the-art results were obtained

5.2 Results

For each experiment, as mentioned before, EfficientNet-B0 was pre-trained on ImageNet. For our experiments, we have tested both transfer learning and fine-tuning. The experiments show that fine-tuning gives significantly better results than transfer learning. This is due to the fact that the network was trained on ImageNet with a standard loss (categorical cross-entropy). Therefore, re-training only the last dense layer may not be sufficient to be able to enhance the contribution of the new loss. For our experiments, we have performed fine-tuning by unfreezing the last 20 convolutional layers. Various other network configurations will be explored in our future work to assess the impact of architecture on performance. The custom CNN, instead, was trained from scratch by adopting the new loss proposed. All experiments were performed with \(\omega \) at 0.001, batch size at 32, and adamax with learning rate at 0.002.

5.2.1 MVTec AD Dataset

Table 1 shows the results of EfficientNet-B0 and the custom CNN trained on the MVTec AD dataset. Both networks were trained with standard loss (Categorical Cross-Entropy) and with our CEOD. As can be seen, in both cases, the application of the new OD can bring the networks to achieve better results in the validation phases. Figure 10 shows the results of EfficientNet-B0 on the test set. From the confusion matrices, we can see that, by the application of OD on the loss (thus using the CEOD loss), the network obtains better results in terms of defect identification at the expense of a slight worsening in the identification of non-defective ones. The network trained with CEOD reaches 95.5% accuracy and 0.95 AUCROC. Indeed, EfficientNet-B0 with classical cross-entropy reaches 93.3% of accuracy and 0.925 AUCROC. Figure 11 shows the results of custom CNN on the test set. The custom CNN trained with CEOD obtains 73.3% accuracy and 0.74 AUCROC in the same test set. The custom CNN trained with standard cross-entropy reaches 48.8% of accuracy and 0.53 AUCROC. Figure 12 shows an example of a heatmap produced by a network trained classically and the one trained with CEOD. The time spent performing 1K training cycles with EfficientNet-B0 trained with standard cross-entropy is 2h 28m 24s. The time spent performing 1K training cycles with EfficientNet-B0 trained with CEOD is 2h 36m 29s. The time spent with the custom CNN to perform 100 training cycles with standard loss is 4m 30s versus 4m 28s for the custom CNN trained with CEOD.

Fig. 10
figure 10

Confusion matrices on test set of MVTec AD dataset obtained with EfficientNet-B0. 0_ND and 1_D represent the class without and with defects respectively

Fig. 11
figure 11

Confusion matrices on test set of MVTec AD dataset obtained with custom CNN. 0_ND and 1_D represent the class without and with defects respectively

Fig. 12
figure 12

In (a) there is the heatmap produced with the classification with standard Cross-Entropy loss (CE) network. It is possible to see that the defect on the zip is not highlighted, unlike the external defect in the upper right. In (b) there is the heatmap produced by the network trained with CEOD. In this case, the defect on the zip is highlighted, unlike external defects which are not considered by the network

Table 2 Industrial Dataset. For EfficientNet-B0, results for Transfer Learning (TL) and fine-tuning (ft) (in bold the best results)
Fig. 13
figure 13

Confusion matrices on test set of the industrial dataset obtained with EfficientNet-B0. 0_ND and 1_D represent the class without and with defects respectively

Fig. 14
figure 14

Confusion matrices on test set of the industrial dataset obtained with the custom CNN. 0_ND and 1_D represent the class without and with defects respectively

5.2.2 Industrial dataset

Table 2 shows the results of EfficientNet-B0 and the custom CNN on the industrial real-case dataset. From Table 2, we can see that the network trained with CEOD is better in terms of accuracy in the validation phase. We can also note the improvement of the network trained with fine-tuning w.r.t. the network trained with only transfer learning. The slight worsening of the loss is probably due to the fact that, unlike the experiment done on the MVTec AD benchmark dataset, this industrial dataset presents many more difficulties. This is because it is representative of a real use case. The sum of the OD part to the cross entropy leads to this slight deterioration. This phenomenon does not occur on the MVTec AD dataset because, being simpler, the network trained with CEOD can clearly exceed that trained with the standard loss and therefore the effect of the sum of the ODs does not lead to worsening the loss. However, despite the excellent results obtained through the network trained for standard classification, with CEOD we are able to obtain a further increase in performance. Figures 13 and 14 show the results of EfficientNet-B0 and the custom CNN on the test set of the industrial dataset. In Figures 13 and 14, we can see that the networks trained with CEOD obtain better results in terms of defect identification at the expense of a slight worsening in the identification of non-defective ones. EfficientNet-B0 trained with CEOD reach 98.9% accuracy and 0.99 of AUCROC compared to EfficientNet-B0 trained with standard cross-entropy that reaches 98.8% accuracy and 0.98 of AUCROC on the test set. The custom CNN trained with CEOD reaches 95.4% accuracy and 0.95 of AUCAUC as compared to the CNN trained with the standard loss that reaches 93.3% accuracy and 0.93 of AUCROC on the test set. The time spent performing 360 training cycles with EfficientNet-B0 trained with standard cross-entropy is 9h 25m 31s. The time spent performing 360 training cycles with EfficientNet-B0 trained with CEOD is 8h 12m 47s. The time spent with the custom CNN to perform 80 training cycles with standard loss is 1h 44m 25s versus 1h 44m 5s for the custom CNN trained with CEOD.

6 Conclusions

The aim of this work is to improve the use of a CNN in the field of anomaly detection by stimulating the network to pay attention mainly to a specific part of an image, to avoid the identification of part of images containing noise defects in the background. This work presents a new loss that acts as an attention mechanism to make a neural network focus on a specific part of an image (called Area of Interest - AOI). This area might not even be the same along all the datasets. This goal of our work was achieved by extending the Szymkiewicz - Simpson Overlap coefficient to obtain what we have defined Overlap Distance (OD). All these contributions were added to the loss function used for the classification task (cross-entropy loss). The experiments show that our approach performs better than standard cross-entropy on both benchmark and industrial real-case datasets.

The introduction of the Overlap Distance (OD) as an attention mechanism represents a significant advancement in anomaly detection using convolutional neural networks. By focusing the network’s attention on specific Areas of Interest (AOI), we mitigate the risk of false positives caused by background noise defects. This innovation holds promise not only in image processing but also in various domains where precise attention allocation is crucial.

The next step for this project is to try to apply this new type of loss to an unsupervised learning framework. We are studying a way to implement this approach into GANs neural network for anomaly detection. This is motivated by the high suitability of GANs for the anomaly detection task in the industrial sector.

The potential impact of this research extends beyond anomaly detection, with implications for a range of industries reliant on accurate image analysis and pattern recognition. We believe that these advancements will contribute to more robust and reliable quality control processes in the industrial sector.