1 Introduction

In the last years deep learning drastically changed the fields of image processing, computer vision and computational photography. For image enhancement, image-to-image models [6, 12, 30], proved to be very effective solutions capable of achieving very high performance in terms of accuracy. In most image-to-image methods a low quality image is mapped via a neural network to its high quality version, preserving the details and the semantic content. Although these methods can be very effective, they come with important limitations: they typically work on images of a given fixed resolution, they can produce visible artifacts, and they do not provide easy to interpret results (black-box approach). This last limitation makes it difficult for a non-expert user to understand the enhancement steps applied by the algorithm. The lack of explainability makes it difficult their application in many fields (e.g. medical or legal) where the interpretability of the results is crucial.

The generation of artifacts in the output mainly depends on the additional challenging requirement for the underlying neural network to learn to preserve the content of the original image. Moreover, many image-to-image methods assume a fixed size for the input images to reduce the amount of memory and computation required. Therefore, high-resolution images, such as those directly obtained from the sensor of DSLR cameras, are first downsampled to more manageable dimensions. This is a reasonable choice when the resolution of the image is not critical but in fields where the resolution matters (e.g. computational photography), this is often unacceptable.

Fig. 1
figure 1

Schematic view of eXIE. Given as input a low quality image x, a pretrained model for image enhancement is used to obtain an high quality version \(\hat{y}\). Both, the low quality and the high quality versions of the images are used to execute a modified version of the A\(^*\) algorithm in order to find the shortest sequence of enhancing operators \([a_0, a_1, \dots , a_{n-1}]\) that emulates the enhancement process. Once this sequence is obtained, each operator is applied to the low quality version to enhance it

In the light of these limitations, we present a new method, called eXIE, able to emulate the enhancement process of image-to-image translation models for image enhancement. It is not a new independent enhancement method, but it is intended to provide explainable outputs for existing methods. eXIE takes an input image together with its translated version obtained trough any other image enhancement method. It provides as output a sequence of enhancing operators that turns the input into an approximation of the output. To do so, eXIE uses a modified version of the A\(^*\) search algorithm. Figure 1 summarizes how the proposed method works.

eXIE can be used in many application scenarios. For instance it can be used to provide a baseline to professional photographers, that can revise and adjust the proposed processing pipeline instead of starting from scratch. It can also be used as an educational tool to support beginners in understanding when to apply image processing operators.

To verify the quality of the sequences found by the proposed method we performed a thorough experimentation on the Five-K data set, comparing the output of several other state of the art methods with their corresponding version modified by eXIE. The loss in accuracy caused by eXIE was minimal, and in some cases the method was even able to improve the underlying enhancement method. Moreover, eXIE showed impressive results when applied to high resolution images. Official code is available here: https://github.com/OcraM17/eXIE

This paper presents the following main contributions:

  • The eXIE algorithm for the explanation of the output of existing image enhancement methods. As far as we know, this is one of the first works combining path finding and image enhancement.

  • eXIE is one of the first explainable algorithm especially designed to obtain a step-by-step interpretation of image enhancement methods outputs.

  • A novel heuristic function utilized by the modified A\(^*\) algorithm that enables the quick discovery of sequences of enhancing operators.

  • An in depth experimentation, in which we observed how eXIE was able to explain the outputs of several state-of-the-art methods with minimal loss in performance as well as to reproduce with high fidelity human retouched images.

This paper presents a new approach in the field of image enhancement and XAI, making it more accessible and understandable for users.

The paper is organized as follows: Section 2 presents the most relevant works on image enhancement and explainable AI published in the literature. Then, Section 3 describes the proposed method and how it works. In Section 4 we define the experimental setup for our experiments. Section 5 presents the results obtained by applying the method on the images produced by several state-of-the-art methods for image enhancement and analyze the obtained sequences. In Section 6 we presents the additional experiments and results that we carried out on high resolution images and using the expert retouched images as target. Finally, Section 7 concludes the paper with a discussion about possible directions for future investigation on this topic.

2 Related works

Image enhancement is a classic problem in image processing in which a low quality image is transformed in its high quality version while preserving its visual content. Providing an explaination of the behavior of these methods, is very important to allow beginner photo editors to learn how to visually enhance low quality images using the output of these method as reference. In this section we analyze two different families of works related to our method: Image Enhancement methods and Explainable AI algorithms (XAI).

The goal of image enhancement algorithms is to improve the visual quality of an image. This can be achieved through various techniques such as image-to-image translation, reinforcement learning, or parametric methods. The purpose of XAI algorithms is to provide an explanation for the predictions made by the algorithms, particularly in sensitive fields such as finance and medicine.

2.1 Enhancement methods

Modern approaches to image enhancement are typically based on deep convolutional neural networks. In particular, in image-to-image translation methods a neural network learns how transform a low quality image into its enhanced version. Isola et al. developed a GAN-based conditional architecture able to translate the input image from a source domain to a new or a different target domain [12]. Using a similar generative adversarial architecture, Zhu et al. developed a cycle-loss for translating images from an unpaired training set [31]. Liu et al. presented an unsupervised version of the generative adversarial networks for image-to-image translation [17]. Ronnenberg et al. developed an encoder-decoder architecture for image segmentation [23]. This architecture was adapted during the years for several tasks, including image enhancement. Cai et al. developed a modified version of the UNet architecture to enhance low light images [4].

Another family of methods includes the parametric approaches. In this case, the neural network learns the coefficients of a parametric color transformation to be applied to the low quality input image to obtain the enhanced version. Gharbi et al. developed an architecture able to learn the coefficients of a transformation working on a low resolution version of the original input image. Once the transformation is obtained, it is applied to the high resolution image to enhance it [6]. Bianco et al. proposed an architecture able to learn the parameters of a color transformation. These parameters are combined with a basis function and the resulting transformation is applied to the input image [2]. The same research group also proposed another approach where a neural network estimates the coefficients of splines color curves [1]. Song et al. presented an enhancement architecture based on a color curve encoder. This encoder computes the color curves parameters on a low resolution version of the input image. The learned transformation is then applied to the high resolution image [26].

Recently, Zhang et al. presented a transformer based architecture for real time image enhancement [30]. In this work, a reduced structure of the original version presented by Vaswani et al. [27] is used to enhance low light images. Kim et al. [14] proposed representative color transformation to enhance low quality images. The proposed architecture is composed of an encoder, a feature fusion module, and two representative color transformation modules (local and global). Kim et al., presented a color representation learning method to enhance low-light images [15]. Hu et al. proposed a white-box RL algorithm able to enhance a low quality image providing the sequence of enhancing operators applied [11]. Guo et al. presented DCE-Net, a new method for light enhancement in images. It formulates light enhancement as an image-specific curve estimation task and trains a CNN to adjust the dynamic range of an image by estimating pixel-wise and high-order curves [8]. Finally, Cotogni et al. proposed a new method for enhancing low-light images using reinforcement learning. This approach combines tree-search theory with deep reinforcement learning to generate sequences of editing operations for processing low-quality images [5].

The method proposed in this paper, eXIE, can replicate the results of other methods, including those mentioned above. eXIE can also improve some of the methods, in particular those that are likely to generate artifacts (e.g. GANs). As shown in Section 5, our method can imitate the images produced by these methods while also removing the artifacts from the output. This is achieved through the use of enhancement operations that preserve the visual content of the image.

2.2 Explainable AI

In the last years, artificial intelligence (AI) and deep learning based methods showed super performances in several fields and applications as computer vision, natural language processing, time series analysis, etc. One major drawback of AI methods is their intrinsic lack of interpretability. Several works have been presented in order to provide explanations for neural networks’ predictions, making them interpretable for the final user.

Grad-Cam is one of the most famous XAI algorithms. This method, following the gradient flow in a convolutional neural network, is able to provide a localisation map as output highliting the most relevant part of the image where the network was mainly focusing its attention to output the final prediction. Moreover, the method was also tested on different tasks such as visual question answering (VQA) and captioning proving the ability of the model of analyzing and providing an explanation of neural networks reasoning [24].

Similarly, Zintgraf et al. proposed a prediction difference analysis method to visualize the neural netowrks’ predictions in the task of image classification. This method analyzes the prediction provided by the neural network and assigns a score to each input feature with respect to a chosen class. Unlike to Grad-Cam, this method, highlights part of the input feature map using conditional and multivariate sampling. This general approach was tested on two different domains: natual images from ImageNet dataset and medical magnetic resonance imaging scans [32].

Shirikumar et al. developed a method called Deep Learning Important FeaTures (DeepLIFT).This method, decompose the neural network’s predictions on a defined input, backpropagating the contribution of each neuron of the network to the input features. A contribution score is assigned to each of the neuron in the CNN in an efficient way with only a single backpropagation step. This score is based on the difference between the activation of each neuron and a reference activation [25].

Lundberg et al. presented an XAI feature importance method based on SHapley Additive exPlanations values (SHAP). Given a prediction, a series of values are assigned to each feature accordingly with the importance of the singular feature in the classification process [19].

Ribeiro et al. proposed a decision rules based method for explaining the behavior of complex model. The presented method, provides local sufficient conditions (highlightning parts of the input) for the correct prediction in several domains like text-classification, structured prediction, tabular classification, image classification and visual-question answering. [22]

One of the most powerful architectures for image generation are generative adversarial networks (GANs). This method shows high performance in several domains, producing great images with high quality details. One of the main drawback of this method is to be a complex black-box model. Nguyen et al. developed a GAN, based on activation maximization, to highlight the features learned by each neuron of the generator in order to explain and interpret the generation process [20].

Li et al. proposed an unrolling based method called DUBLID (Deep Unrolling for Blind Deblurring). The unrolling procedure was used to decompose an iterative algorithm, able to emulate in the gradient domain a total-variation regularization method, in to a fully interpretable neural network for image deblurring [16].

Wang et al. developed an interpretable sparse coding network for image super resolution. The proposed architecture, composed of a cascade of spare-coding network with different scale factors, was able to obtain very good performances without the introduction of artifacts in the final images [28].

As far as we know, our method is the first XAI algorithm for the explanation of image enhancement methods. In fact, it provides a step-by-step explanation of the output produced by black-box image enhancement methods. To do so, it uses a variant of A\(^*\) algorithm to emulate the enhancement process of another method by applying an equivalent sequence of enhancing operators, providing the final user with insight into the enhancement process.

2.3 Dataset

The dataset used for our experiments is the Adobe-MIT Five-K dataset [3]. This dataset is composed of 5000 high-resolution images in RAW format. For each of these images, five enhanced version are included, each one enhanced by an expert photographer identified by a letter from A to E. In our experiments we used the RAW images and the images retouched by Expert C which is considered the most consistent of the five. We split the images by following the procedure presented by Hu et al. [11]: 4000 training images and 1000 test images.

3 Method

The aim of the proposed method, eXIE, is to find an equivalent sequence of enhancing operators able to emulate the enhancement process of another state-of-the-art method. The search of the sequences is performed by a modified version of the A\(^*\) algorithm [9]. We will start by presenting the original A\(^*\) algorithm and the modifications we introduced to tailor it to the problem of image enhancement. We will conclude by describing our novel heuristic function designed ad-hoc for eXIE and providing the complete algorithm with pseudocode.

3.1 Fundamentals of \(A^*\) algorithm

A\(^*\) is an efficient path finding algorithm, able to find the shortest path from an initial node to a final node in a graph, if it exists. To do so, the algorithm uses two functions to evaluate the nodes:

  • the backtrack function g(x), that computes the length of the path from the starting node to the node x.

  • The heuristic function h(x) that estimates the length of the optimal path connecting the current node x to the target node. In order to ensure the optimality of the solution, the heuristic must be optimistic, i.e. the estimate must not exceed the actual distance.

These two functions are added together and the resulting value \(f(x)=g(x)+h(x)\) is assigned to each node. The algorithm iteratively visits one node at a time following ascending values of f. The search terminates when the final node is visited.

While the backtrack function g(x) can be easily defined, the heuristic function h(x) must be carefully designed. A good heuristic function quickly guides the algorithm to the correct solution, while a wrong function may even result in infinite loops and cause the search procedure to fail.

Our method interprets an image enhancement task as a graph, allowing it to be traversed with the eXIE algorithm. Differently from the \(A^*\) algorithm that terminates only when the final state has been reached, we defined two additional stopping criteria for the search procedure. In the following subsections, we present in detail our graph model, the new heuristic function we have developed for graph traversal, and the complete pseudocode of the algorithm.

Fig. 2
figure 2

Example of the graph traversed by the eXIE algorithm. For space reasons, the considered graph is generated using only three editing operators over all the channels of the images and it is truncated after two levels

3.2 eXIE

In this work, the nodes composing the graph are images and the edges connecting two nodes are image processing operators. A connection between two nodes in the graph corresponds to a transition from an image (node) to its modified version obtained applying an editing operator to the former one. Figure 2 shows an example of the graph traversed by the search algorithm.

The initial node is identified by the low quality image \(\textbf{x}\) that needs to be enhanced. The final node is the image \(\hat{\textbf{y}}\), obtained by a different enhancement method. The goal is to find the shortest sequence of editing operators \([a_{0},a_{1},\dots ,a_{n-1}]\) that transforms x into an image \(\textbf{y}^*\) that is sufficiently close to \(\hat{\textbf{y}}\) (formally, \(\textbf{y}^*= a_{n-1}(a_{n-2}(\dots a_0(\textbf{x})))\).

As editing operators we considered a small set of general filtering functions widely used in image processing. In the following \(\textbf{x}_{ijc}\) represents channel \(c \in \{R, G, B\}\) of the pixel with (ij) coordinates:

  • Brightness adjustment:

    $$\begin{aligned} \textbf{x}_{ijc} \rightarrow \textbf{x}_{ijc} + \delta . \end{aligned}$$
    (1)

    We considered the parameters \(\delta \in \Delta = \{-0.05, +0.05, -0.005, +0.005\}\), and applied to all the color channels or to a single color channe.

  • Contrast adjustment:

    $$\begin{aligned} \textbf{x}_{ijc} \rightarrow \mu _c + \sigma \times (\textbf{x}_{ijc} - \mu _c), \end{aligned}$$
    (2)

    where \(\mu _c\) is the average channel value. We considered the variants with \(\sigma \in \Sigma = \{0.9, 1.4\}\) and applied the operator channel wise considering all channels or just one.

  • Gamma Correction:

    $$\begin{aligned} \textbf{x}_{ijc} \rightarrow (\textbf{x}_{ijc})^\gamma . \end{aligned}$$
    (3)

    We considered the two values \(\gamma \in \Gamma = \{0.6, 1.05\}\), and applied the transformation channel wise considering all channels or just one.

Values and the operators contained in the three sets \(\Delta \), \(\Gamma \), \(\Sigma \) have been selected in order to enhance the input image with a reasonable number of applications of the operators. Smaller values would lead to the same results but requiring a longer time for the searching process. The value of the input pixels is always supposed to be in the [0, 1] range, and all output values are clipped to stay in that range. In total 32 image processing operators where considered.

3.3 Heuristic function

The heuristic function is a critical component of the search algorithm as it provides an estimate of the distance between a given node and the final node. To ensure the optimality of the solution, it is essential that the heuristic function be optimistic, which means it must underestimate the actual distance between the nodes. However, if the heuristic function is too optimistic, the algorithm’s progress towards the target may be slow.

We defined the heuristic function by considering how many times one of the operators needs to be applied to transform a single pixel value into the target value. More in detail, for each pixel value \(\textbf{x}_{ijc}\) we compute three counters: the Brightness Counter, the Contrast Counter and the Gamma Correction Counter. The Brightness counter is the number of times the brightness operator, needs to be applied to \(\textbf{x}_{ijc}\) to make it reach the target \(\hat{\textbf{y}}_{ijc}\).

$$\begin{aligned} N^{(B)}_{ijc} = \min _{\delta \in \Delta } \frac{|\textbf{x}_{ijc}- \hat{\textbf{y}}_{ijc}|}{|\delta |}. \end{aligned}$$
(4)

We defined the Contrast Counter in a similar way. This requires to identify special cases since sometimes it is not possible to transform the pixel value in the desired target just by using this operator.

$$\begin{aligned} N^{(C)}_{ijc} = {\left\{ \begin{array}{ll} \frac{1}{\log \max \Sigma } \log \frac{\hat{\textbf{y}}_{ijc}-\mu _c}{\textbf{x}_{ijc}-\mu _c} &{} \text {if } \textbf{x}_{ijc}>\mu _c \text { and } \hat{\textbf{y}}_{ijc}>\mu _c, \\ \frac{1}{\log \min \Sigma } \log \frac{\hat{\textbf{y}}_{ijc}-\mu _c}{\textbf{x}_{ijc}-\mu _c} &{} \text {if } \textbf{x}_{ijc}<\mu _c \text { and } \hat{\textbf{y}}_{ijc}<\mu _c, \\ \infty &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(5)

Finally, the Gamma Correction Counter is defined as:

$$\begin{aligned} N^{(G)}_{ijc} = {\left\{ \begin{array}{ll} \frac{1}{\log \max \Gamma } \log \frac{\log \hat{\textbf{y}}_{ijc} }{\log \textbf{x}_{ijc}} &{} \text {if } \textbf{x}_{ijc} \ge \hat{\textbf{y}}_{ijc}, \\ \frac{1}{\log \min \Gamma } \log \frac{\log \hat{\textbf{y}}_{ijc} }{\log \textbf{x}_{ijc}} &{} \text {if } \textbf{x}_{ijc}< \hat{\textbf{y}}_{ijc}, \\ \end{array}\right. } \end{aligned}$$
(6)

The heuristic function h is defined for the whole image \(\textbf{x}\) as an upper bound to the number of applications of any given operator to match each pixel with the corresponding target value:

$$\begin{aligned} h(\textbf{x}) = \max _{ijc} \min _{a \in [B,C,G]} N^{(a)}_{ijc}. \end{aligned}$$
(7)

Concerning the backtrack function \(g(\textbf{x})\) we counted the number of times an operator has been applied to the initial image to obtain the image \(\textbf{x}\). In order to limit the searching time of the algorithm, we introduced two other modifications. The search terminates when the difference between the actual and the target nodes is below a set threshold \(\Vert \textbf{x} - \hat{\textbf{y}} \Vert < \tau \). Moreover, it can also terminate when the number of explored nodes exceeds a set limit L. When this happens the visited node that is closest to the target is selected and the path (i.e. the sequence of enhancing operators) from the root to this node is obtained as output along with the enhanced image. We observed that, when the number of visited nodes overcomes the value of \(L=7000\), the accuracy of the output images tend to be very stable. Accordingly with the maximum number of explored nodes L we set the value of the threshold \(\tau =2\). These choices have been taken in order to deal with the trade-off between the output image quality and the time required to explore the graph. The pseudo-code for the whole procedure is reported in Algorithm 1.

figure a

4 Experimentation

In this section, we present the methods and metrics used in our experiments to evaluate the performance of eXIE. First, we will give an overview of the state-of-the-art image enhancement methods considered in our experiments. Then, we will describe the evaluation metrics that we have used to quantify the performance of our algorithm and compare it to the existing methods.

4.1 Enhancement methods

We selected ten popular state-of-the-art methods for image enhancement. They belong to different families of approaches such as image-to-image translation, parametric, reinforcement learning and transformer based methods:

1.:

Exposure: This deep reinforcement learning based method, is able to enhance low quality images producing high quality images [11].

2.:

CycleGan: GAN based method, uses a cycle loss for learning the correct function to map an input image from a source domain to a target domain. Among all the possible applications, it could be also applied to enhancement tasks. It could suffer from the artifacts generation in output images [31].

3.:

DaR: This method, called Distort and Recover, is a double Q-learning based algorithm for image enhancement [21].

4.:

Pix2Pix: Proposed by Isola et al. in 2016, this conditional adversarial network for image-to-image translation showed great performances in several domains of application. [12].

5.:

HDRNet: This architecture based on bilateral grid processing and local affine color transformation, learns the correct transformations to be applied to the low quality high resolution input image observing its resized version [6].

6.:

Star-DCE: transformer based method for image enhancement. This method, splits the input images in patches and embed them into tokens. Then these tokens are passed to a long-short range Transformer module composed of two branches: one for long-range context (composed of a cascade of transformer blocks) and the other for short-range context (composed of a cascade of convolutions and batch normalizations [30].

7.:

Parametric: Proposed by Bianco et al, this pipeline learns, in a paired training scenario, the parameters of a color transformation, using a downsampled version of the high resolution low quality input image. Once the color transformation is obtained, it is applied to the original input image to enhance its content [2].

8.:

TreEnhance: is a deep reinforcement learning method that generates sequences of enhancement operations to improve the quality of low-light images. This approach, which is based on the Monte Carlo tree search algorithm, is one of the first works to merge tree-search theory and deep reinforcement learning for image enhancement [5].

9.:

DCE-Net: in this method a CNN is trained to estimate image-specific curves for dynamic range adjustment. In particular, this method does not require reference images for training and uses non-reference loss functions to measure enhancement quality [8].

10.:

DCE-Net-Pooling: Modified version of the previously mentioned method. In this solution, pooling is applied after each convolutional block to reduce the spatial dimension of the feature maps [8].

All these methods have been trained using the configuration described in the original paper on the 4000 images from the Five-K dataset. The trained models were then evaluated on the 1000 test images. Finally, all the images obtained have been analyzed by eXIE to produce the corresponding sequences of operators.

4.2 Metrics

We used four different metrics: the Peak Signal-to-Noise Ratio (PSNR) [13], Learned Perceptual Image Patch Similarity (LPIPS) [29], Delta E (\(\Delta E\)) [7] and Structural Similarity Index SSIM [10].

5 Results

In this section, we report the results obtained by applying eXIE to the methods presented in the previous section. We compare the performance obtained with and without the application of our algorithm from both a quantitative and a qualitative point of view. In the last part of this section, we analyze the explainable side of eXIE by inspecting the sequences of enhancing operators produced by our method.

5.1 Quantitative results

Table 1 compares the metrics computed on the output images of the considered methods and those computed on the images produced by eXIE.

Table 1 Comparison of state-of-the-art enhancement methods on the Adobe Five-K dataset with and without eXIE
Fig. 3
figure 3

Comparison of the images obtained by eXIE and by the original methods

The results show that eXIE can emulate the original methods with great accuracy. The loss in terms of the four metrics considered (PSNR, \(\Delta E\), LPIPS, and SSIM) is never very high and, in some cases, the images found by eXIE are even better than the starting ones. This is due to the fact that eXIE is designed to prevent the introduction of artifacts, which are harder to reproduce than correct enhancements. This is especially true for Exposure, CycleGAN and DaR.

Only for the most accurate methods, and only for the PSNR and \(\Delta E\) metrics, the difference is noticeable. However, this relatively small drop in accuracy is acceptable in many applications in exchange for the explainability of eXIE. The differences in terms of LPIPS and SSIM are always negligible, demonstrating the high accuracy of our proposed algorithm in preserving the content of the images.

5.2 Qualitative results

From the qualitative analysis of the images produced by eXIE shown in Fig. 3 it is possible to notice how eXIE is able to emulate the enhancement process of the original methods with high fidelity. For instance, by looking at the images produced by eXIE on CycleGan (column 6) it is possible to see that the artifacts have been removed and the color balance for the images obtained by eXIE is better than the version produced by using the image-to-image translation method. Similarly, for DaR (column 3) and Exposure (column 2), we can see that eXIE is able to obtain better saturation and contrast in the final image with respect to the result of the original models.

When applied to the best models, Star-DCE (column 1), Pix2Pix (column 4), HDR (column 5), and Parametric (column 7), eXIE is able to replicate their results with high accuracy. These results demonstrate the potential of eXIE as a tool for explaining and improving the output of existing image enhancement methods.

The editing operators have been carefully chosen to avoid the introduction of artifacts in the output image. Unlike other image-to-image translation methods that rely on complex models and can generate undesirable changes, our method utilizes a simple approach to enhance images. The operators used in our method are basic and well-known in the field of image processing, making it easier to understand and control the results. As a result, our method is able to produce high-quality output images with minimal loss in accuracy and with a high level of interpretability.

One of the main feature of our solution, is the ability of preserving the semantic content of the image. In fact, even if the target method creates distortion in the output image, the enhancing operators designed for eXIE are not able to heavily modify the content of the image. The “Brightness Adjustment”,“Contrast Adjustment” and “Gamma correction” operators work by modifying the color curve preserving the structure of the images.

5.3 Sequence inspection

As a further analysis we studied the sequences produced by eXIE (Fig. 4). The sequences obtained with the considered state of the art methods are quite heterogeneous. Moreover, they follow a different order in the application of the operators.

The firsts operators applied are typically those that increase the global brightness of the image. These operators, are applied on all the pixels of the three channels of the image. Then, when the image reaches a good overall balancing, more fine-grained operators are applied. For instance, in the HDRNet column the figure shows how the algorithm finds that, after the application of the brightness over the all image, the best next action is to decrease the value of the red channel of the image, obtaining a refinement of the color distribution.

Fig. 4
figure 4

Example of the sequences obtained with the application of eXIE on Star-DCE, Pix2Pix and HDRNet

6 Additional experiments

In this section, we present two additional experiments that were carried out. In the first experiment, we applied eXIE to low-resolution images to obtain sequences of editing operators. These sequences were then applied to the original resolution images. The main objective of this experiment was to speed up the sequence generation process without any limitations on the images to be enhanced, as the sequences were applied to the original size images. The second experiment, on the other hand, used the ground truth images of the Five-K dataset as the target. This case study allowed us to understand the ability of our method in explaining the manual enhancements procedures performed by expert photographers.

6.1 Enhancement of high resolution images via UNET

Many image-to-image-translation methods, like Pix2Pix or CycleGAN, work with images of fixed dimensions. For these methods, the enhancement of very high resolution images without producing artifacts, requires to keep constant the spatial dimension of the features is kept constant along all the layers of the model. We addressed this scenario by using our method on very low-resolution versions of the input images and by applying the sequences found to the original high resolution images. This approach, requires very few computational resources in terms of memory and time and, as we will see, it does not decrease output quality.

More in details, given a high-resolution image from the Five-K dataset, we resized it to a very low-resolution (\(32 \times 32\)) and provided as input to an especially designed convolutional neural network based on the UNet [23] architecture. eXIE is then applied to the resulting low-resolution images, and the sequence of operators found is applied to the high-resolution input image. Figure 5 summarizes this approach.

Fig. 5
figure 5

The high resolution input image \(\textbf{X}\), is resized obtaining a low resolution version \(\textbf{x}\). This image is enhanced using the previously trained UNet. This architecture provide as output the image \(\hat{ \textbf{y}}\). The images \(\textbf{x}\) and \(\hat{ \textbf{y}}\) are used to execute the eXIE algorithm and obtaining the sequence of enhancing operators to enhance the high resolution image \(\textbf{X}\) and obtaining \(\hat{\textbf{ Y}}\).

The neural network used in this experiment is composed of two parts: an encoder and a decoder. Each of the four blocks in the encoder includes a convolutional layer with kernel of size \(4 \times 4\) and stride 2, batch normalization and leaky ReLU as activation function. The decoder is composed of four blocks, each of them including a transposed convolutional layer (of kernel size \(4 \times 4\) and stride 2), batch normalization, ReLU and dropout (\(p=0.05\)). The output layer of the architecture has a sigmoid function as activation function in order to restrict the values of output pixels to the range [0, 1].

During the training of the network, the high resolution input images \(\textbf{X}\) from the Five-K training set are resized obtaining a low resolution version \(\textbf{x}\). These images are given as input to the net obtaining the enhanced version \(\hat{\textbf{y}} = f(\textbf{x};\theta )\). The images \(\hat{\textbf{y}}\) are compared with the target images \(\textbf{y}\) using the binary cross entropy (BCE) as loss function:

$$\begin{aligned} BCE = - \frac{1}{HWC} \sum _{c=0}^{C}\sum _{i=0}^{W}\sum _{j=0}^{H} \textbf{y}_{jic} \cdot \log (\hat{\textbf{y}}_{jic}) + (1-\textbf{y}_{jic})\cdot \log (1-\hat{\textbf{y}}_{jic}). \end{aligned}$$
(8)

The model has been trained using standard data augmentation techniques (cropping, resizing, random flip and rotations) on the image pairs for 600 epochs and batch size 32. The UNet parameters are updated using mini batch gradient descent and AdamW as optimizer [18] with a learning rate of 5e-3. The learning rate was decayed of 0.1 every 100 epochs starting from the 200th.

Table 2 Results of the experiment with the neural network for low-resolution enhancement (on low-resolution images) and the application of eXIE (on high-resolution images)
Table 3 Results of the application of eXIE on the test images of the Five-K dataset using the images of the 5 available experts as target
Fig. 6
figure 6

Example of the application of eXIE on two high resolution image from Five-K test set after being processed by the UNet

Fig. 7
figure 7

Actions distributions of the sequences obtained applying eXIE to replicate the enhancing process applied by Five-K experts and the UNet

Once the network has been trained, it has been used to enhance the low-resolution versions of the images from the Five-K test set. These enhanced images have been then processed by eXIE along with the original low-resolution inputs.

Table 2 summarizes the results of the application of eXIE to the low-resolution images. The first column shows the performance of the neural model on the low resolution images. The metrics reported in the second column, are computed on the high resolution images. eXIE shows good performance even when it is applied to very high resolution images as the original ones provided in the Five-K dataset.

The values of the metrics computed on the images enhanced using eXIE are good. In fact, by comparing these results with those showed in Table 1 we can see that eXIE is able to enhance high resolution images better than other methods like Exposure, CycleGan and Dar. The values of the performance metrics are very similar to those obtained by HDRNet.

Analyzing the images showed in Fig. 6, is it possible to notice the abscence of artifacts and that the color balancing applied by eXIE is correct. Moreover, from the analysis of the details is it possible to see that the visual content of the image is preserved correctly (this is also confirmed by the high value of SSIM in the previous table).

6.2 Case study: Human target

As a last experiment, we used eXIE to reverse engineering the work of an expert photo editor. The goal, here, is to replicate the image enhanced by a human expert as a sequence of elementary editing operations. The application for this scenario is educational: a beginner photo editor could use the system to learn how to achieve a given editing effect by looking at how eXIE reduced it to a sequence of operations.

To do so, we applied eXIE on the images in the Five-K dataset with the goal of reproducing the versions enhanced by the experts. Table 3 summarizes the results obtained.

The results in the Table 3 shows that eXIE is able to provide very high quality images reproducing the enhancement process applied by the experts of the Five-K dataset. Moreover, the values showed for the considered metrics are very high confirming the ability of our algorithm to emulate not only the enhancement process of image-to-image translation models, but also the sequence of operations chosen by human experts

From the analysis of the distributions of the operations forming the sequences selected by eXIE (Fig. 7), it is possible to notice that the most frequently selected operations are those that modify all the color channels. The distributions are quite similar for all the experts, and also for the UNet, with just some difference in the frequency of single-channel operations (for the brightness, in particular)

Observing actions probability distributions of each single expert it is possible to notice general trends or particular preferences. By observing Expert A actions distribution for example, it is possible to notice how brightness and gamma correction over all the three channels have almost the same probability, indicating that these two actions are interchangeable for Expert A most of the times.

7 Conclusions

In this paper we proposed eXIE, a novel method for explaining the enhancement process of state-of-the-art methods for image enhancement. This XAI method is able to provide an equivalent sequence of enhancement operators that emulate the behavior of image enhancement methods with only a small loss in the performances.

eXIE, was able to produce good looking images, even with a better color distribution with respect to some of the methods on which it is applied. Moreover, its output images do not present artifacts and show high quality details.

The ability of generalization of the method has been tested by applying it to high resolution images and executing the heuristic search algorithm on their low resolution versions.

In the future we plan to explore the application of eXIE to different processing tasks like retargeting and restoration. We will also consider more specific domain of applications such as that of medical imaging, where the explainability of the enhancement process is of vital importance.