Improving deep learning performance by using Explainable Artificial Intelligence (XAI) approaches

In this work we propose a workflow to deal with overlaid images—images with superimposed text and company logos—, which is very common in underwater monitoring videos and surveillance camera footage. It is demonstrated that it is possible to use Explaining Artificial Intelligence to improve deep learning models performance for image classification tasks in general. A deep learning model trained to classify metal surface defect, which previously had a low performance, is then evaluated with Layer-wise relevance propagation—an Explaining Artificial Intelligence technique—to identify problems in a dataset that hinder the training of deep learning models in a wide range of applications. Thereafter, it is possible to remove this unwanted information from the dataset—using different approaches: from cutting part of the images to training a Generative Inpainting neural network model—and retrain the model with the new preprocessed images. This proposed methodology improved F1 score in 20% when compared to the original trained dataset, validating the proposed workflow.


Introduction
Deep learning classifiers [1,2] are being widely used in a vast range of applications with different objectives in areas such as scientific studies, industry and entertainment [3][4][5] successfully. Despite the revolutionary character of this technology, there are still challenges that diminish its expansion or prevent the consolidation of deep learning in certain areas. Some of the main challenges to be overcome are the great complexity of models that require high computational cost [6] as well as the lack of transparency and explicability [7][8][9], which weaken the confidence and verifiability of decisions taken by a deep learning system.
The absence of explicability and transparency in certain areas is not invariably a problem since state-of-the-art models have an extremely high accuracy [10]. Furthermore, any errors, to a large extent, do not result in such relevant consequences, e.g., in applications such as facial recognition in photos taken by smart cameras [11]. However, in areas such as autonomous cars [12], financial transactions [13] and mainly medical applications [14], failures are unacceptable, considering that erroneous decisions can have disastrous con-sequences, such as the loss of human lives. Due to this fact, these application areas have extreme interest in explaining and interpreting each decision made by deep learning models.
Explaining Artificial Intelligence (XAI) [15] is the area of study that aims to explain, interpret, and visualize the decisions made by deep learning models. Many studies have been developed to understand how models make its decisions, so that in sensitive practical applications, the specialists have more confidence in the model's predictions. Medical research is an area that widely uses XAI techniques [16], aiming to understand how models learn to identify certain clinical problems and important features are taken into account to make each decision.
This study finds high evidence that XAI techniques can also be used to improve deep learning models performance. It is very common that datasets provided by companies and institutions consist of overlaid images-images with superimposed text and company logos-, as show in Fig. 1. It is observed, by XAI technique, namely layer-wise relevance propagation (LRP) [17], that this unwanted extra information can consistently reduce model's performance.
We reproduce-synthetically-the conditions of overlaid images in the dataset GC10-DET, a public dataset [18], adding random information and company logos superimposed to the original images. The processed synthetic dataset can be found at our repository presented in data availability section. Such dataset consists of ten classes of metal surface defects collected by an industry. A deep learning model was trained to classify these defects, obtaining a low accuracy which is considered our baseline. Then, the LRP technique was used to analyze the model's inferences. From acquired results, it can be observed that the model learned to solve the problem by identifying patterns in the text and logos superimposed on the image and not by the actual surface defect itself. Therefore, computer vision techniques were used to remove the superimposed text and logos from the images and the model was retrained, thereby, identifying the defect of interest. This new model achieved a F1 score 20% higher than the baseline.
The main contribution of this work is to show how XAI techniques can be used to improve performance of deep learning models. In addition, a problem of practical interest was solved using deep learning and XAI. Several works, such as [19,20], also explore deep uncertainty learning [21] to improve deep models robustness and interpretability. The advantage of the proposed approach is to that it is straightforward and can be applied to any deep learning model. This paper is organized as follows: Sect. 2 explains the theoretical concepts of the LRP technique and the computer vision techniques used. Section 3 presents the dataset and details of the experimental procedure. Section 4 evaluates the performance of the proposed workflow in a real case study. Finally, Sect. 5 summarizes the conclusions obtained in this work.

Background
LRP technique informs relevance of each pixel for the decision made by a deep neural network. Even though it is an oversimplified form of explanation compared to the human conception of explanation, this information is valuable to illustrate the behavior of deep learning models. This technique works by back-propagating the predicted output in the deep neural network using a set of rules.

Layer-wise relevance propagation
This technique works with the conservation idea, resembling the Kirchoff's conservation laws in electric circuit theory [22]. Let j and k be the neurons at two consecutive layers of a deep neural network, where neurons in k are in a lower layer than neurons in j layer. Neurons in j will receive a relevance score from neurons in k. The relevance scores (Rj) at a given neuron in j is achieved by applying the following rule: where z ik = a jk w jk , being a jk the output values of the activation function of the neurons in j, and w the weight learned during training between the neuron j and k. The relevance in the input layer is between the neurons of the first layer and the pixels of the input image of the network, and the relevance in each pixel of the image is the final process. Figure 2 shows the processes of LRP back propagation. Figure 3 illustrates an application example of the LRP technique in a deep learning model trained to detect metal surface defects. The left image is the input image, and the right image is the LRP output. Red pixels indicate high relevance in the process of image classification while the white region indicates low relevance. In such example, it is possible to observe that, based on its focus-the red pixels-, the network is in fact learning to identify the defect itself.

Computer vision techniques for inpainting
Four computer vision techniques to remove text and logos and withdraw the attention of the model from them were employed: (i) Gaussian Blur [23], (ii) image cropping, (iii) censor bars, and (iv) Generative Inpainting [24]. The first three techniques were applied to the upper and lower regions of the images, details of their application are explained in experimental section. For the fourth technique-Generative Inpainting-a generative model [25] was applied to the images. The generative model used was based in the Generative Adversarial Net-works (GAN) [26], such model estimates generative models via an adversarial process, in which it simultaneously trains two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than the generator G. GAN was introduced in [26] as a framework and the G and D are approximated by two neural networks.
These two networks compete in a min-max game, that under ideal conditions, converges and the generator learns the distribution of the given data. In another words, training a GAN is equivalent to minimizing Jensen-Shannon Divergence (JSD) [27]-other divergences could be possible too-between the generator and data distributions.
Conditional GANs (CGAN) [28], was adapted in this work for image inpainting [29], it was trained using the overlaid image as input to the generator network. The discriminator receives the preprocessed image as a real input image, i.e., computer vision techniques are applied to preprocess these images, and overlaid image as fake ones. As the discriminator associates real images with image without the superimposed text and logos, the generator is forced to learn to remove this information from the input image.
Formally, the generator is not simply maximizing the likelihood of single samples but minimizing the overall distance between real-image with no overlay-and the generated distribution-overlaid image-. Therefore, it is learning the real images' probability distribution, which is accomplished by minimizing the JSD under ideal conditions, as shown below: where G is the generator, D is the discriminator, x is a sample from a given dataset with probability density function p data , y is the condition input of the model, z is the random noise from normal distribution p z , and E is the expected value.

Experimental setup
We first train a deep learning model to classify 10 common metallic surface defects, the classes present on the dataset are: punching, welding line, crescent gap, water spot, oil spot, silk spot, inclusion, rolled pid, crease and waist folding. An example of each class present in the dataset is shown in Fig. 4.

Network architecture, optimizer and training details
MobileNet [30] with pre-trained weights from ImageNet [31] was empirically chosen for training, as it achieved better results with features extracted for this particular dataset. Adam optimizer with a momentum value equal to 0.9, initial learning rate equal to 0.001 and batch size of 16 was used. Thirty experiments were performed for each study case.
The dataset, containing 2306 samples, was split into train, validation and test sets with a distribution of 70%, 20% and 10%, respectively. Since this dataset is not balanced, F1 score, Precision and Recall metrics were used to evaluate our model's performance. The model was trained through 450 epochs with an early stopping patience of 60 epochs without improvement and a 0.001 tolerance over the F1 score in validation samples.

Computer vision techniques
LRP results show that text and logos superimposed to the images prevents models from learning relevant features. To solve this problem, computer vision techniques were used to eliminate such elements in the dataset. The techniques used were: (i) Gaussian Blur, (ii) Image Cropping, (iii) Censor Bars, and (iv) Generative Inpainting, as mentioned in the background chapter.
The first three techniques were applied to the upper and lower regions of the images, an area of 35 × 224 pixels at the top and at the bottom of the image, as show in Fig. 5. For the Gaussian Blur technique, the kernel size was of 17 × 17 pixels with an standard deviation of 20 in both directions, horizontal and vertical. In the Image Cropping technique, the same area was cropped and the resulting image was resized to its original size-224 × 224 pixels. The Censor Bars technique instead of cropping that area, it employs a black stripe to replace them and cover the text and logos. Finally, in the Generative Inpainting technique, the whole image was generated by the generative model, Fig. 5E shows an example of Generative Inpainting. The original image for all cases is illustrated in Fig. 5A. The complete dataset-with preprocessed images-and the original dataset are being made publicly available at our repository (data availability section). Table 1 shows the performance metrics for the first trained model-with the original dataset, with text and logos-, hereafter called Original Model.

Results
As previously stated, to understand the low performance of the Original Model, a LRP technique was used. In Fig. 6, we show two examples belonging to the Inclusion class. It can be observed that the model focused on text and logos to make its decision. LRP results over all samples are publicly available at our repository (data availability section).
By prior knowledge, text and company logos do not have any useful information for class labelling. Furthermore, it is expected for the model to make its decision based on relevant features of the image and avoid these irrelevant patterns. In order to complete understand the learning model, we evaluate the Original Model (trained on images with text and logo) with the images preprocessed with the Generative Inpainting approach (tested on images without logos and text). In Fig. 7 we show the results of LRP over the same images shown in Fig. 6, but without text and logos. The result of the LRP technique shows that although the model used the Inclusion class features in this scenario, the model was also extremely noisy, giving high relevance to the borders of the image. In Table 2, performance metrics of the Original Model are presented: the poor performance shows that the model was focusing in texts and logos to make inferences. Thus, computer vision techniques, such as blur, crop, censor bars and Generative Inpainting are applied to prevent the model from using such information during the learning phase.

Case Study
Discover Artificial Intelligence (2021) 1:9 | https://doi.org/10.1007/s44163-021-00008-y 1 3 It is clear, from the results presented, that when removing unnecessary information, such as text and logos, the model is able to significantly improve its performance. In addition, the model is also inferring the correct class based on relevant features from the images, as shown in Fig. 8. In Fig. 8, it is shown two examples of the Oil spot class, such defect is usually caused by contamination of mechanical lubricant, which will affect the appearance of the metal surface. Analyzing LRP output image, it is clear that it is exactly what the model is focusing on to make the correct prediction. So now, it is evident that the model is using the correct patterns in the image to make predictions, adhering to expectations that models use relevant features from images to label them. LRP results over all preprocessed samples can be accessed at our repository (data availability section).

Results after preprocessing images
Lastly, it is shown in Fig. 9 that during the training phase, all models have similar performance, indicating that a deep learning model can learn to solve the same problem based on different features from the samples; however, there is a possibility, as seen in the Fig. 6, that these learned features include noise data (from overlay) resulting in overfitting. This arise a poor performance in real applications as shown previously.

Conclusions
In this work we show that it is possible to use XAI techniques not only to understand the model's behavior, but also to improve its performance. We observed that, by using XAI, the trained model was using information from the images' superimposed text and logos to infer data classes. By prior knowledge, text and company logos do not have any useful information for class labelling, furthermore, it is expected for the model to make its decision based on relevant features of the image. Thus, computer vision techniques, such as blur, crop, censor bars and generative inpainting are applied in order to prevent the model from using such information during the learning phase, obtaining the best results with