1 Introduction

Interpretability is of vital importance for designing trustworthy and transparent deep learning-based systems (Pedreschi et al., 2019; Tonekaboni et al., 2019), and the field of explainable artificial intelligence (XAI) has made great improvements over the last couple of years (Antoran et al., 2021; Schulz et al., 2020). However, there exists no methods for attribution-based explanations of representations, despite the tremendous advances in representation learning using e.g self-supervised learning (Chen et al., 2020; Caron et al., 2020; He et al., 2020). Also, modifying existing XAI methods to handle representations is often impractical or not possible at all, as explained in “Appendix A”. This lack of explainability makes representation learning less trustworthy and dependable, and there is therefore a need for representation learning explainability. To be able to explain learned representations would provide crucial information in several use-cases. For instance, a typical clustering approach is applying K-means to the representation produced by a feature extractor trained on unlabeled data (Lin et al., 2021; Wen et al., 2020; Yang et al., 2017), but there is no method for investigating which features are characteristic for the members of a cluster.

Representation learning explainability would also allow for a new approach for evaluating representation learning frameworks. Representation learning frameworks are typically evaluated by training simple classifiers on the representation produced by the feature extractor or through a downstream task (Chen et al., 2020; He et al., 2020; Caron et al., 2020). However, such approaches provide only limited information about the features used by the models, and might ignore important distinctions between them. For instance, a similar accuracy on some downstream task does not necessarily equate to the representations being based on the same features. This highlights the need for an explanatory framework for representations, as many of the current evaluation methods are not sufficient for illuminating differences in the what features are used by different feature extractors.

However, any explanatory framework can make over or under-confident explanations. Hence, uncertainty is a key component for designing trustworthy models, since trusting an explanation without knowing the uncertainty of the explanation might lead to an unjustified trust in the model. A recent survey where clinicians were asked what was necessary for making trustworthy models, found that explainability alone was not enough and that uncertainty was also of high importance (Tonekaboni et al., 2019). Our experiments show that uncertainty can be used to increase the faithfulness of explanations, by removing uncertain parts. Nevertheless, little work has been done on uncertainty in explanations of representations.

Fig. 1
figure 1

Conceptual illustration of RELAX. An image is passed through an encoder that produces a new vector representation of the image. Similarly, masked images are embedded in the same latent space. Input feature importance is estimated by measuring the similarity between the representation of the unmasked input with the representations of numerous masked inputs

In this work, we present the first framework for explaining representations, entitled REpresentation LeArning eXplainability (RELAX), which is also equipped with uncertainty quantification with respect to its own explanations. The framework is illustrated in Fig. 1. RELAX measures the change in the representation of an image when compared with masked versions of itself. The core idea is that when informative parts of the input are masked out, the representation should change significantly. When averaging over numerous masks, RELAX reveals the important regions of the input. RELAX is an intuitive and highly versatile framework that can explain any representation, given a suitable similarity function and masking strategy. To provide insight into the geometrical properties of RELAX, we show that the importance of a pixel can be seen as the result of a scoring function based on an inner product between the input and the mean of the masked representations in the representation space. Figure 2 shows an example where RELAX is used to investigate the relevance maps and the corresponding uncertainties for a selection of widely used feature extraction models, which demonstrate that RELAX is a versatile framework for highlighting the emphasis that feature extractors put on pixels and regions in the input (top row).

Fig. 2
figure 2

The figure shows the RELAX importance score and its uncertainty for the representation of the leftmost image for three widely used feature extractors. The first row displays the importance for the representation and the second row shows the uncertainty associated with the different explanations. Red indicates high values and blue indicates low values. In this example, two objects are present in the image, one bird prominently displayed in the foreground, and another more inconspicuous bird in the background. The plots show that all models emphasize the bird in the foreground with low uncertainty. On the other hand, there is more disagreement on how much emphasis to put on bird in the background, also with a differing degree of uncertainty. The example illustrates that different feature extractors utilize different features in the representation of the image, and with different amounts of uncertainty. The image is taken from VOC (Everingham et al., 2009) (Color figure online)

Our contributions are:

  • RELAX, a novel framework for explaining representations that also quantifies its uncertainty.

  • A threshold approach called U-RELAX that removes uncertain parts of an explanation and increases the faithfulness of the explanations.

  • A theoretical analysis of the framework and an estimation of the number of masks needed to obtain a given level of confidence.

  • A comprehensive experimental section that compares widely used supervised and self-supervised feature extraction models and evaluates a number of hyperparameters.

  • A user study that examines how well the explanations align with human evaluation.

  • Two use cases for RELAX. First, RELAX enables explainability in state-of-the-art incomplete multi-view clustering. This illustrates the usability of RELAX in recent cutting-edge research. Second, RELAX allows for explanation of classic computer vision techniques such as Histogram of Oriented Gradients (HOG). This demonstrates that RELAX is a flexible framework, which is capable of explaining representations produced by any method, not just those produced by deep neural networks.

Code for RELAX is available at https://github.com/Wickstrom/RELAX.

2 Related Work

In this section, we present the previous works that are most closely related to our work. The focus will be on attribution-based explanations where each input feature is assigned an importance. Therefore, we will not consider other explainability methods such as example-based explanations (Koh & Liang, 2017; Karimi et al., 2020) or global explanations (Mordvintsev et al., 2015).

Occlusion-based explainability There exist a number of occlusion-based explainability methods. Systematically occluding an image with a gray and then measuring the change in activations could be used to provide coarse explanations for CNNs (Zeiler & Fergus, 2014). A more sophisticated occlusion approach can improve explanations, in which smooth masks are generated and accumulated to produce explanations for the prediction of a model (Petsiuk et al., 2018). A slightly different approach is meaningful perturbations, where a spatial perturbation mask that maximally affects the model’s output is optimized (Fong & Vedaldi, 2017). A follow up work proposed extremal perturbations, where a perturbation can be considered extremal if it has maximal effect on the network’s output among all perturbation of a given, fixed area (Fong et al., 2019). On a different note, an information theoretic approach to XAI has been proposed, where noise is injected in order to measure the information in different regions of the input (Schulz et al., 2020). Similarly, Kolek et al. (2021) introduced a rate-distortion perspective to explainability. Note that none of these methods are capable of providing explanations for representations.

Explaining representations Attribution-based explainability methods are extensively used to explain specific sample predictions (Bach et al., 2015; Petsiuk et al., 2018; Schulz et al., 2020). However, to the best of our knowledge, no attribution-based explainability method exists for explaining representations. While initial attempts have been made to explain representations such as the Concept Activation Vectors (Kim et al., 2018), which uses directional derivatives to quantify the model prediction’s sensitivity, these explanations only relate the representations to high-level concepts and require label information. Similarly, network dissection has been proposed to interpret representations (Bau et al., 2017), but requires predefined concepts and label information without indicating the importance of individual pixels. A different direction is designing models that have the capability to explain their own decisions built into the system (Chen et al., 2019; Alvarez-Melis & Jaakkola, 2018). Two drawbacks of such an approach is that it might lead to models with weaker performance and does not explain representations. Another approach maps semantic concepts to vectorial embedding (Fong & Vedaldi, 2018). However, this requires segmentation masks that are not available in the unsupervised setting. Representations have also been investigated from learnability and describability perspectives (Laina et al., 2020), but this was achieved through human-annotators that are typically not available. Lastly, the inspectability of deep representations have been investigated through an information bottleneck approach (Losch et al., 2021), but with a focus on segmentation and predefined concepts.

Uncertainty in explainability Modeling uncertainty in explainability is a rapidly evolving research topic that is receiving an increasing amount of attention. One of the earliest works proposed to use Monte Carlo Dropout (Gal & Ghahramani, 2016) in order to estimate the uncertainty in gradient-based explanations (Wickstrøm et al., 2018, 2020), which was later followed by a similar approach that was based on Layer-wise Relevance Propagation (Bykov et al., 2020). Uncertainties that are inherent in the widely used LIME method (Ribeiro et al., 2016) have been explored (Zhang et al., 2019). Also, ensemble-based approaches, where uncertainty estimates are obtained by taking the standard deviation across the ensemble, have also been proposed (Wickstrøm et al., 2021). Recently, Counterfactual Latent Uncertainty Explanations (CLUE) was presented (Antoran et al., 2021), where uncertainty estimates from probabilistic models can be interpreted. Nevertheless, none of these approaches were designed for quantifying the uncertainty in explanations of representations, as they either require label information or are computationally impractical.

3 Representation Learning Explainability

We present RELAX, our proposed method for explaining representations, equipped with uncertainty quantification. Furthermore, we leverage RELAX’s ability to quantify uncertainty and introduce as a new concept a method for filtering out uncertain parts of the explanations, which we entitle U-RELAX. This is important, as uncertain explanations might give an unwarranted trust in the model. Our framework is inspired by RISE (Petsiuk et al., 2018). However, RISE was designed for explaining predictions and is not transferable for explaining representations or quantifying uncertainty. Note that the proofs of the theorems in this section are given in “Appendix E”.

3.1 RELAX

The central idea of RELAX is that when informative parts are masked out, the representation should change significantly. Let \(\textbf{X}\in \mathbb {R}^{H\times W}\) represent an imageFootnote 1 consisting of \(H\times W\) pixels, and f denote a feature extractor that transforms an image into a representation \(\textbf{h} = f(\textbf{X}) \in \mathbb {R}^{D}\). To mask out regions of the input, we apply a stochastic mask \(\textbf{M}\in [0, 1]^{H\times W}\), where each element \(M_{ij}\) is drawn from some distribution.

The stochastic variable \(\bar{\textbf{h}} = f(\textbf{X} \odot \textbf{M})\), where \(\odot \) denotes element-wise multiplication, is a representation of a masked version of \(\textbf{X}\). Moreover, we let \(s(\textbf{h}, \bar{\textbf{h}})\) represent a similarity measure between the unmasked and the masked representation. Intuitively, \(\textbf{h}\) and \(\bar{\textbf{h}}\) should be similar if \(\textbf{M}\) masks non-informative parts of \(\textbf{X}\). Conversely, if informative parts are masked out, the similarity between the two representations should be low.

Motivated by this intuition, we define the importance \(R_{ij}\) of pixel (ij) as:

$$\begin{aligned} R_{ij} = \textrm{E}_{\textbf{M}}\big [s(\textbf{h}, \bar{\textbf{h}})M_{ij}\big ]. \end{aligned}$$
(1)

Equation (1) is core to our framework as it computes the importance of a pixel (ij) as a weighted similarity score for masked versions of a given image. However, integrating over the entire support of \(\textbf{M}\) is not computationally feasible. Therefore, we approximate the expectation in Eq. (1) by sampling N masks for then to compute the sample mean:

$$\begin{aligned} \bar{R}_{ij} = \frac{1}{N}\sum \limits _{n=1}^N s(\textbf{h}, \bar{\textbf{h}}_n)M_{ij}(n). \end{aligned}$$
(2)

Here, \(\bar{\textbf{h}}_n\) is the representation of the image masked with mask n, and \(M_{ij}(n)\) the value of element (ij) for mask n. The explanations of RELAX are computed through Eq. (2), and an illustration of RELAX is given in Fig. 1. As a similarity measure we use the cosine similarity

$$\begin{aligned} s(\textbf{h}, \bar{\textbf{h}}) = \frac{\langle \textbf{h}, \bar{\textbf{h}}\rangle }{\Vert \textbf{h}\Vert \Vert \bar{\textbf{h}}\Vert }, \end{aligned}$$
(3)

where \(\Vert \cdot \Vert \) denotes the Euclidean norm of a vector. There are several motivations for this choice. First, Liu et al. (2021) argued that angular information preserves the essential semantics in neural networks, in contrast to magnitude information. Since the cosine kernel normalizes the representation, essentially discarding magnitude information, such a similarity measure would be suited to capture key information encoded in the representations. We have compared the cosine similarity to the Euclidean distance to examine the effects of including magnitude information, with the results shown in “Appendix B”. Second, the cosine kernel does not rely on hyperparameters that must be selected, which may be beneficial in an unsupervised setting where we cannot do cross validation. Third, a large portion of feature extractors trained using self-supervised learning use the cosine kernel in their loss function (Chen et al., 2020; Chen & He, 2021). Therefore, it is the natural choice for measuring similarities in their latent space. However, based on the two first points, the cosine kernel is still suitable for models trained without the cosine kernel. Lastly, other alternatives for the kernel functions, such as the radial basis function or polynomial kernel, requires careful tuning of hyperparameters. We consider an investigation of such alternatives and their hyperparameters as a direction for future research.

Note that we recognize that the masking strategy can introduce a shift in the distribution of pixel intensities. However, in our experiments, we observed that this potential shift did not impact the explanations. An experiment where the distribution is approximately preserved is included in “Appendix C”.

Masking distribution There are several ways to sample the masks in Eq. (2), for instance by letting each \(M_{ij}(n)\) be iid. Bernoulli. However, sampling masks with the same size as the input results in a massive sample space, and simultaneously makes it challenging to create smooth masks that cover different portions of the image.Footnote 2

To avoid these problems, we generate masks as suggested by Petsiuk et al. (2018). Binary masks of smaller size than the input image are generated, where each element of these smaller masks is sampled from a Bernoulli distribution with probability p. These masks are then upsampled using bilinear interpolation to the same size as the image. The distribution for \(M_{ij}\) is then a continuous distribution between 0 and 1. Specifically: we sample N binary masks, each with size \(h\times w\), where \(h<H\) and \(w<W\). We upsample these masks to size \((h+1)C_H \times (w+1)C_W\), where \(C_H\times C_W=\lfloor H/h\rfloor \times \lfloor W/w\rfloor \) is the size of the cell in the upsampled masks. Lastly, we crop the final masks of size \(H\times W\) randomly from the \((h+1)C_H \times (w+1)C_W\) masks.

Number of masks required In order to minimize the computational cost of RELAX, we derive the following lower bound on the number of masks required for a certain estimation error.

Theorem 3.1

Suppose \(s(\cdot , \cdot )\) is bounded in (0, 1).Footnote 3 Then, for any \(\delta \in (0, 1)\) and \(t > 0\), if N in Eq. (2), satisfies:

$$\begin{aligned} N \ge -\frac{\ln (\delta /2)}{2t^2}, \end{aligned}$$
(4)

we have \(\textrm{P}(|\bar{R}_{ij} - R_{ij}|\ge t)\le \delta \).

Theorem 3.1 states that if N satisfies Eq. (4), we are able to estimate \(R_{ij}\) to an absolute error of less than t with probability at least \(1-\delta \). See “Appendix E” for proof and verification of bound. In all of our experiments, we generate 3000 masks, which ensures an estimation error below 0.01 with a probability of 0.99.

RELAX from a kernel perspective To provide insights into the geometrical properties of RELAX, we present a kernel viewpoint of Eq. (2).

Theorem 3.2

Suppose the similarity function \(s(\cdot , \cdot )\) is a valid Mercer kernel (Mercer, 1909). The importance \(\bar{R}_{ij}\) then acts as a linear scoring function between \(\textbf{h}\), and the weighted mean of \(\bar{\textbf{h}}_1, \dots , \bar{\textbf{h}}_N\) in the reproducing kernel Hilbert space (RKHS) induced by \(s(\cdot , \cdot )\). That is:

$$\begin{aligned} \bar{R}_{ij} = \langle \phi (\textbf{h}), \frac{1}{N}\sum _{n=1}^N\phi (\bar{\textbf{h}}_n) M_{ij}(n)\rangle _{\mathcal H}, \end{aligned}$$
(5)

where \(\phi : \mathbb R^d \rightarrow \mathcal H \) is the mapping to the RKHS, \(\mathcal H\), induced by the kernel \(s(\cdot , \cdot )\), and \(\langle \cdot , \cdot \rangle _{\mathcal H}\) is the inner product on \(\mathcal H\).

Theorem 3.2 provides interesting insight, as many scoring functions are based on inner-products, e.g. between points of interest and class-conditional means (e.g., Fisher discriminant analysis, Bayes classifier under Gaussian distributions with equal covariance structure). This means that even though RELAX is a novel approach, it is founded in well-known statistical concepts (McCullagh & Nelder, 1989).

Additionally, RELAX has the following interpretation from non-parametric statistics

Theorem 3.3

Suppose \(s(\cdot , \cdot )\) is a valid Parzen window (Theodoridis & Koutroumbas, 2009). Then:

$$\begin{aligned} \bar{R}_{ij} \propto p_{ij}(\textbf{h}), \end{aligned}$$
(6)

where \(p_{ij}(\cdot )\) is a weighted Parzen density estimate (Parzen, 1962) of the density of the masked embeddings:

$$\begin{aligned} p_{ij}(\cdot ) = \frac{1}{\sum _{n'=1}^{N}M_{ij}(n')} \sum \limits _{n=1}^{N} s(\cdot , \bar{\textbf{h}}_n) M_{ij}(n). \end{aligned}$$
(7)

A high RELAX score is obtained when the unmasked representation \(\textbf{h}\) is close to mean of masked representations, which aligns well with out intuition for RELAX.

3.2 Uncertainty in Explanations

Trusting an explanation without a notion of uncertainty can lead to an unjustified faith in the model. Therefore, we introduce an approach that allows uncertainty quantification to be incorporated into the RELAX framework. Our intuition for this approach stems from what happens when informative and uninformative parts are masked out. If informative parts are masked out, the similarity score will not only drop, but drop with varying degree. If there is a big variation in the similarity scores for a given pixel, it indicates that the explanation for said pixel is uncertain. Based on this intuition, we propose to estimate the uncertainty in input feature importance as:

$$\begin{aligned} U_{ij} = \textrm{E}_{\textbf{M}} [(s(\textbf{h}, \bar{\textbf{h}}) - \bar{R}_{ij})^2 M_{ij}]. \end{aligned}$$
(8)

Again, it is not feasible to integrate over all of \(\textbf{M}\) and \(U_{ij}\) is therefore approximated by the sample variance:

$$\begin{aligned} \bar{U}_{ij} = \frac{1}{N}\sum \limits _{n=1}^N (s(\textbf{h}, \bar{\textbf{h}}_n) - \bar{R}_{ij})^2 M_{ij}(n). \end{aligned}$$
(9)

Equation (9) estimates the uncertainty of the RELAX-score for pixel (ij) by measuring the spread along \(M_{ij}\) between the similarity score and the explanations. In other words, Eq. (9) estimates the uncertainty in the importance scores themselves. To estimate Eq. (9), we must first estimate the importance of a pixel. The uncertainty estimates provided in Eq. (9) can be thought of as measuring the spread of pixel importance values in relation to importance estimated using Eq. (2). There are several benefits of our method. First, it requires no labels, which is sometimes used in other uncertainty estimation methods (Antoran et al., 2021). Secondly, it avoids computationally intense sampling methods, for instance through Monte Carlo sampling (Teye et al., 2018; Gal & Ghahramani, 2016). Lastly, the uncertainty estimation can be combined with the computation of Eq. (2), as explained in Sect. 3.4.

Fig. 3
figure 3

Comparison of RELAX and U-RELAX on an image taken from PASCAL VOC, where red indicates high importance and blue indicates low importance. In this case, the emphasis on the bird in the background is removed as the uncertainty was to high for this part of the explanation (Color figure online)

3.3 U-RELAX: Uncertainty Filtered Explanations

All parts of an explanation do not have the same level of uncertainty associated with it. In such cases, it could be beneficial to remove input features that are indicated as important but also have high uncertainty, while only keeping important input features with low uncertainty. This could increase the faithfulness of an explanation and provide clearer explanations. Therefore, we propose a thresholding approach where explanations with high uncertainty are removed from the explanation. We define our U-RELAX importance score as:

$$\begin{aligned} \bar{R}_{ij}' = {\left\{ \begin{array}{ll} \bar{R}_{ij},\quad &{}\text { if } \bar{U}_{ij} < \epsilon \\ 0, &{}\text { otherwise } \end{array}\right. }, \end{aligned}$$
(10)

where \(\epsilon \) is a threshold chosen by the user. Essentially, Eq. (10) provides the possibility to only consider explanations of a particular certainty level, depending on \(\epsilon \). We propose two ways of choosing epsilon. First as:

$$\begin{aligned} \epsilon = \frac{\gamma }{HW}\sum _i^H\sum _j^W \bar{U}_{ij}, \end{aligned}$$
(11)

that is, the average uncertainty for a particular image, weighted by hyperparameter \(\gamma \). This provides a simple and intuitive way of selecting the threshold, which is motivated by only wanting to consider pixels that have high importance and low uncertainty. Alternatively, \(\epsilon \) can be computed by taking the median uncertainty for a particular image. Using mean or median statistics to select hyperparameters is a common approach in machine learning. For instance, in kernel methods the kernel width is often chosen by taking the mean or median distance between all samples in the training data Nordhaug Myhre et al. (2018); Shi et al. (2009). Determining which approach will give the best performance is dependent on the distribution of the data, in this case the distribution of the uncertainty estimates for a given image. The median is more robust to outliers in the data Leys et al. (2013), and could therefore be a better choice for noisy or challenging samples. If the distribution is symmetric the mean is usually preferred Leys et al. (2013). In Sect. 5.4, we conduct a thorough examination of the mean versus median thresholding approach for U-RELAX.

We refer to this uncertainty-filtered version of RELAX as U-RELAX. Figure 3 shows an example of the U-RELAX explanation compared with the RELAX explanation. In this case, the emphasis on the bird in the background is removed as the uncertainty was too high for this part of the explanation.

3.4 One-Pass Version of RELAX

Computing Eq. (9) requires first computing Eq. (2), since the uncertainty estimation requires an estimate of the importance in order to be computed. This introduces additional computational overhead. We refer to computing Eq. (2) followed by Eq. (9) as the two-pass version of RELAX. To improve computational efficiency, we propose an online version of RELAX where importance and uncertainty is computed simultaneously, which we refer to as the one-pass version of RELAX. One-pass RELAX is based on well-known estimators of running mean and variance (West, 1979). Importance is computed as:

$$\begin{aligned} \begin{aligned} \bar{R}_{ij}^{(n)}&= \bar{R}_{ij}^{(n-1)} \\&\quad + M_{ij}(n)\frac{s(\textbf{h}, \bar{\textbf{h}}_n)(n)-\bar{R}_{ij}^{(n-1)}}{W_{ij}(n)}, \end{aligned} \end{aligned}$$
(12)

where \(\bar{R}_{ij}^{(n)}\) is the importance of pixel (ij) at mask n, and \(W_{ij}(n)=\sum _{n'=0}^n M_{ij}(n')\) is the sum of the mask elements (ij) for the first n masks. Uncertainty is computed as:

$$\begin{aligned} \begin{aligned} \bar{U}_{ij}^{(n)}&= \bar{U}_{ij}^{(n-1)}+ M_{ij}(n)(s(\textbf{h}, \bar{\textbf{h}}_n) \\&\quad - \bar{R}_{ij}^{(n)})(s(\textbf{h}, \bar{\textbf{h}}_n)-\bar{R}_{ij}^{(n-1)}), \end{aligned} \end{aligned}$$
(13)

where \(\bar{U}_{ij}^{(n)}\) is the uncertainty in the importance of pixel (ij) after the nth mask. Pseudo-code is shown in Algorithm 1. All experiments are carried out using the one-pass version of RELAX. See “Appendix F” for a comparison of the one-pass versus two-pass version.

figure a

4 Evaluation and Baseline

4.1 Evaluation of Explanations

Evaluation is a developing subfield of XAI, and a unifying score is not agreed upon Doshi-Velez and Kim (2017), even more so for explanations of representations. To evaluate the explanations, we use two of the most widely used explainability evaluation scores, namely localisation and faithfulness (Samek et al., 2017; Petsiuk et al., 2018; Fong et al., 2019; Schulz et al., 2020). All scores are computed using the Quantus toolbox.Footnote 4 Evaluating these metrics is not just important for comparison, but also to ensure the correctness and rigour of RELAX, similarly as done in other works Selvaraju et al. (2017). By measuring the localisation and faithfulness scores of the explanations created from RELAX we empirically investigate the correctness and reliability of RELAX.

Localisation The explanations should put emphasis on input regions corresponding to the objects present in an image. Localisation measures to which degree the explanation agrees with the ground truth location of an object. High performance in localisation indicates that the explanations often align with the bounding boxes or segmentation masks provided by human annotators. We consider three localisation scores, the pointing game (Zhang et al., 2017), top-k intersection, and relevance rank accuracy (Arras et al., 2022). The pointing game measures whether the pixel with the highest importance is located within the object location. Top-k intersection considers the binarized version of the top-k most important pixels and measures the intersection with the ground truth mask. Relevance rank accuracy is measured by taking the ratio of high intensity relevances within the ground truth mask. Since RELAX operates in the unsupervised setting we do not have explanations for individual classes. Therefore, the bounding boxes/segmentation masks are collected into one unified bounding box/segmentation mask. This results in an unsupervised version of localisation that is suitable for explaining representations.

Faithfulness Pixels assigned with high importance should be indicative of “true” importance. Faithfulness is typically measures by monitoring the classification accuracy of a classifier as input features are iteratively removed. High faithfulness indicates that the explanation is capable of identifying features that are important for classifying an image correctly. We measure faithfulness using the monotonicity score. Nguyen and Martinez (2020) proposed to measure monotonicity by computing the correlation of the absolute values of the attributions and the uncertainty in the probability estimation. This will indicate if an explanation is correctly highlighting important features in the input.

4.2 Representation Explainability Baseline

Fig. 4
figure 4

Comparison of RELAX and saliency explanation for an image from PASCAL VOC. The example shows how both explanations focus on the dog, but the saliency explantion is much more erratic and unfocused than the RELAX explanations

Fig. 5
figure 5

Comparison of RELAX and Saliency explanation for an image from PASCAL VOC. The example shows how RELAX captures information about both objects, while the saliency explanation is focused on the gap in between the two objects

While there are are no existing methods that provide attribution-based explanations for representations, it is possible to adopt certain methods to provide such explanations. One of the most common baselines in the field of explainability is saliency explanations (Springenberg et al., 2015; Adebayo et al., 2018), which utilize gradient information to attribute importance. An explanation is obtained by computing the gradient for a prediction with respect to the input. However, it is not trivial to extend such methods for explaining representations. We propose the following for a saliency approach:

$$\begin{aligned} \textbf{S} = \frac{1}{D}\sum \limits _{d=1}^D \nabla f(\textbf{X})_d, \end{aligned}$$
(14)

where D is the dimensionality of the representation and \(S_{ij}\) is the importance of pixel (ij) for the given representation. The gradient for each dimension of the representation will give an explanation, and Eq. (14) takes the mean across all explanations. This is the most straight-forward and intuitive approach for explaining representations with gradients. It also illustrates the challenges that arise when adopting gradient-based explanations for representation, as some form of agglomeration of the explanations is required. Figures 4 and 5 shows a qualitative comparison between the RELAX and saliency explanation for a representation of an image. Both figures illustrate how RELAX provides more intuitive and clear explanations that are able to capture information related to the objects in the image, when compared with the saliency explanation.

Once the saliency approach from Eq. (14) have been established, it is also possible to adopt improvements of the standard saliency explanations. For instance, Guided Backpropagation is a widely used explainability technique that uses gradient information (Springenberg et al., 2015). Guided Backpropagation differs from Eq. (14) by zeroing out negative gradients in the backward pass of the backpropagation scheme. We define the Guided Backpropagation procedure for representations as:

$$\begin{aligned} \textbf{S}_{{\textsc {gb}}} = \frac{1}{D}\sum \limits _{d=1}^D \nabla _{{\textsc {gb}}} f(\textbf{X})_d. \end{aligned}$$
(15)

Second, SmoothGrad is another gradient-based explainability method that can be adopted from Eq. 14 (Smilkov et al., 2017). SmoothGrad injects noise into the input and produces an explanation by averaging over multiple explanations. We define SmoothGrad for representation as:

$$\begin{aligned} \textbf{S}_{{\textsc {sg}}} = \frac{1}{M}\sum \limits _{m=1}^M\frac{1}{D}\sum \limits _{d=1}^D \nabla f(\textbf{X}_m)_d, \end{aligned}$$
(16)

where M is the number of explanations computed based on the noisy input.

Adaptation of state-of-the-art methods As explained in “Appendix A”, many of the existing explanation methods are not trivially extended to the representation learning explainability setting. Nevertheless, using the baselines introduced above we can construct adaptation of the state-of-the-art algorithms integrated gradients Sundararajan et al. (2017) and GRAD-CAM Selvaraju et al. (2017). For the integrated gradients explanations, we follow their proposed procedure but compute gradients using Eq. 14. For the GRAD-CAM explanations, the upsampled output of the global average pooling layer is typically weighted by the class weights. However, these class weights are not available in our unsupervised representation learning setting. Therefore, we weight all parts equally, and gradients are computed using Eq. 14.

Fig. 6
figure 6

The figure shows the RELAX explanation and its uncertainty for the representation of the leftmost image for a number of widely used feature extractors. The first row displays the explanations for the representation and the second row shows the uncertainty associated with the different explanations. Red indicates high values and blue indicates low values. In this example, three elephants are visible in the image. The results show that all models highlight the elephant in the foreground as important for the representation, but there is more disagreement about the elephants in the background. Moreover, the uncertainty of the explanation for the elephant in the foreground is very low compared to the remaining regions of the image. Image is taken from MS COCO (Color figure online)

5 Experiments

To evaluate RELAX, we conduct numerous experiments and report both quantitative and qualitative results. We evaluate several features extraction models, both deep and non-deep, and trained with and without supervision. Our experiments show the advantages of RELAX compared to the baselines, and illustrate how RELAX enables new approaches for analysing and understanding representation learning.

Implementation details. For the supervised model, we use the pretrained model from Pytorch (Paszke et al., 2019). For the models trained without labels but with self-supervision, we use the SimCLR (Chen et al., 2020) and SwAV (Caron et al., 2020) frameworks, both of which have seen recent widespread use. These methods are chosen to represent two major types of self-supervised learning frameworks, namely contrastive instance learning (SimCLR) and clustering-based learning (SwAV). For SimCLR and SwAV, we use the pretrained models from Pytorch Lightning Bolts (Falcon & Cho, 2020). We use a ResNet50 (He et al., 2016) as the backbone for the feature extractors, and all models are trained on ImageNet (Deng et al., 2009). Additionally, we also perform experiments with recent Vision Transformer architectures (Dosovitskiy et al., 2021). The results of these experiments are shown in “Appendix G”.

Similarly as in previous works (Fong et al., 2019; Schulz et al., 2020), we use the test split of the PASCAL VOC07 (VOC) (Everingham et al., 2009) and the validation split of MSCOCO2014 (COCO) (Lin et al., 2014) for evaluating the localisation scores, since they contain information about the location of the objects in the images. For the faithfulness score, we use the validation set of ImageNet (Deng et al., 2009). For all datasets, we randomly sample 1000 images for evaluation and repeat all experiments 3 times. Since we are interested in investigating how RELAX and U-RELAX vary due to the stochastic masking process, we use the same 1000 images across the repeated experiments. We generate 3000 masks to ensure a low estimator error. We set \(h=w=7\) and resize all images to \(H=W=224\), as suggested by Zhang et al. (2017). For the monotonicity score, we use Alexnet (Krizhevsky et al., 2012) as the classifier, as suggested by Samek et al. (2017). We also experiment with the VGG13 (Simonyan & Zisserman, 2015) as the classifier for monotonicity score. These results are reported in “Appendix H”. The threshold for U-RELAX is determined with median aggregation and \(\gamma =1.0\), based on the empirical evaluation conduced in Sect. 5.4.

Table 1 Pointing game, top k, and relevance rank scores in percentages and averaged over 3 runs

5.1 Qualitative Results

Figures 2 and 6 displays the explanation and the uncertainty in the explanations provided by RELAX for an image from the PASCAL VOC and MS COCO dataset, respectively. See “Appendix J” for additional qualitative results. The input to the feature extractors is shown on the left, the first row shows the explanations, and the second row shows the uncertainties.

Table 2 Monotonicity scores averaged over 3 runs
Table 3 Human evaluation of representation explainability methods across 10 images from the PASCAL VOC dataset
Table 4 Evaluation of U-RELAX hyperparameters in terms of pointing game, top k, and relevance rank scores in percentages and averaged over 3 runs

5.1.1 Are all instances of the same object equally important?

Figure 2 shows an example with two objects, one bird prominently displayed in the foreground, and another more inconspicuous bird in the background. An interesting question that RELAX allows us to answer is: are both of these birds important for the representation of this image? And, are both of them equally important? First, all models indicate that the bird in the foreground is important, and that the importance scores for this bird have low uncertainty. Second, SimCLR puts little emphasis on the bird in the background. In contrast, both the supervised feature extractor and SwAV are highlighting the second bird as having an influence on the representation. However, the uncertainty estimates for the second bird is slightly higher than those of the first bird, but still low compared to the remaining parts of the image.

5.1.2 What features are important in complex imageswith numerous objects?

Figure 6 shows an image with 3 elephants, one in the foreground and two in the background. Additionally, the background is more diverse and the objects have different lighting and perspective. Again, RELAX enables investigation of interesting aspects of the representations, such as: are the models capable of recognizing all elephants and to utilize the information? Does the models focus on background information instead of the objects? All models highlight the elephant in the foreground as important with high certainty. However, there is little emphasis on the shaded elephant, and the associated region of the image also has a high degree of uncertainty. Both the supervised model and SwAV put some importance on the third elephant with some degree of certainty, while SimCLR uses little or no information about the third elephant.

In both Figs. 2 and 6, the SwAV feature extractor is focusing on several regions in the input, but with some regions of high uncertainty. While it is difficult to say exactly why, we hypothesize that it can be related to its self-supervised training procedure. SwAV relies on matching image views to a set of prototypes. Therefore, different parts of the input can be related to different prototypes, which we conjecture can lead to SwAV considering several regions of the input.

Table 5 Evaluation of U-RELAX hyperparameters in terms of monotonicity score in percentages and averaged over 3 runs

5.2 Quantitative Results

Tables 1 and 2 displays the quantitative evaluation of our proposed methodology compared with the gradient-based baselines described in Sect. 4.2. The results show how the proposed method outperforms the baselines across all scores. The low standard deviation for RELAX show that the proposed methodology is robust to the stochasticity in the masks. Furthermore, the feature extractor trained using supervised learning achieves the highest performance compared to the feature extractors trained using self-supervised learning, which illustrates that label information does provide additional useful information for these scores.

For the localisation scores, RELAX provides the highest performance. The segmentation masks or bounding boxes can in many cases be large, and U-RELAX might remove uncertain points close to the boundaries of the segmentation masks. This might be desirable from a human perspective, as it provides clearer explanations with less uncertainty, but it will decrease the localisation scores. For the faithfulness score, U-RELAX provides a significant boost in performance for two encoders. The removal of uncertain explanations allows the classifier to focus on a smaller subset of highly relevant features. This can lead to the classifier having a more stable decrease in accuracy and a higher faithfulness score.

Fig. 7
figure 7

RELAX explanation and uncertainty for the representation of an example from Noisy MNIST image for a number of widely used feature extractors. The first row displays input, explanation, and uncertainty for view 1, and the second row for view 2. Red indicates high values and blue indicates low values. The figure shows that Completer is extracting complementary information from the two views for creating its unified representation (Color figure online)

5.3 Human Evaluation

The localisation and faithfulness scores are both proxies for human evaluation that allow for quantitative analysis. However, the ultimate goal of XAI is to provide explanations that are understandable for people and align well with human intuition. Therefore, we conduct a user study with human evaluation of explanations. Our study is inspired by the localisation scores but rely on evaluation of individual humans instead of segmentation masks or bounding boxes. In this user study, 13 people were asked to select their preferred explanation from a selection of explanations across 10 different images. See “Appendix I” for a detailed description of the user study.

Table 3 reports the results of the human evaluation. The results clearly indicate that RELAX and U-RELAX were the methods that aligned most closely with human intuition. Some participants highlighted that when both RELAX and the gradient-based methods indicated an object as important, they often preferred the more object focused explanation of RELAX, as opposed to the more edge focused explanations of the baselines. It was also noted that for some images the participants disagreed with most explanations, and would have provided a different explanation if possible. We believe that these are valuable insights that will be useful for improving explainability methods and also for designing future user studies.

5.4 U-RELAX Hyperparameter Evaluation

Tables 4 and 5 reports localisation and faithfulness scores for different values of the hyperparameters in U-RELAX. Mean versus median aggregation is considered, and a selection of values for \(\gamma \). The results indicate that setting \(\gamma \) to less than 1, typical degrades performance. This can be understood by the thresholding being to strict and removing to many pixel indicated as important. Also, the differences between mean and median aggregation of the uncertainties is mostly low, but median aggregation gives a slight improvement, particularly for the relevance rank score and the monotonicity score.

Fig. 8
figure 8

The figure shows the RELAX explanation for two deep learning-based feature extractors compared with the traditional HOG algorithm. Figure shows how HOG features focus on more indistinct regions in the input, while deep learning methods focus mainly on the cat. Image is taken from PASCAL VOC

5.5 Use Case I: Multi-View Clustering

To further illustrate the ability of RELAX to obtain insights into new tasks, we conduct an experiment on multi-view clustering. We learn a feature extractor using the Completer framework (Lin et al., 2021), which uses an information theoretic approach to fuse several views into a new representation. Completer uses individual encoders for each view, and concatenates the representation from each encoder to produce a unified representation. Clustering is performed by applying K-means to the learned representations. To adopt RELAX for such a setting, we generate individual masks for each view and monitor the change in the representation in the unified representation space. While there is no way to investigate which parts of the different views that influence the unified representation in the Completer framework, using RELAX allows us to answer this question. Figure 7 shows an example on Noisy MNIST (Wang et al., 2015), where one view is a digit and the other view is a noisy version of the same digit. The result shows that the Completer framework is exploiting information from both views to produce a new representation, even if one view contains more noise. Such insights would not be obtainable without RELAX.

Fig. 9
figure 9

The figure shows the RELAX explanation for two deep learning-based feature extractors compared with the traditional HOG algorithm. Figure shows how HOG features puts little attention on the bird and mostly focus on the background. Image is taken from PASCAL VOC

5.6 Use Case II: Explaining HOG Features

RELAX is not limited to representations produced by deep neural networks. It can be used to explain the representation produced by any function that transform an image into a vector representation. To illustrate the versatility of RELAX, we explain representation produced by the Histogram of Oriented Gradients (HOG) feature extraction method (Dalal & Triggs, 2005), which have been used extensively in the computer vision literature. Figures 8 and 9 shows two examples where the relevance map for the HOG representation is compared with the SimCLR and SwAV representations. We consider the representations from these two methods since they are also unsupervised like the HOG features.

Features produced by deep neural networks typically allow for higher performance than those from algorithms such as HOG and other handcrafted feature extraction methods. RELAX provides insights into why this is. In Fig. 8, both the SimCLR and the SwAV feature extractors focus on the cat in the center of the images. The HOG algorithm has a more widespread focus. Also, much of the emphasis is put on the cord going along the staircase. Since the HOG algorithm is utilizing gradient information, these sharp lines will have a big influence on the representation, and it is therefore not surprising that the cat receives less attention. In Fig. 8, both SimCLR and SwAV focus on the bird, while the HOG features are more focused on other regions in the image. For instance, the iron rod and a tree in the background and are indicated as being important for the representation of this image. Both examples provide insights into why HOG features lead to inferior performance, when compared with features produced by deep neural networks. This information would not be available without the proposed RELAX framework.

6 Conclusion

In this work, we presented RELAX, a framework for explaining representations produced by any feature extractor. RELAX is based on masking out parts of an image and for then to measure the similarity with an unmasked version in the representation space. We introduced a principled approach to quantifying uncertainty in explanations. RELAX was evaluated by comparing several widely used feature extractors. Results indicate that there can be a big difference in the quality of the explanations. It was shown that filtering out parts of an explanation based on its uncertainty can improve the faithfulness, and that RELAX can have a facilitating role, providing explainability for several downstream applications such as multi-view clustering. We consider the evaluation of RELAX to other use-cases. such as for the investigation of a models failure cases, as an interesting direction for future research. We believe that RELAX can be an important addition in the intersection between XAI and representation learning.