1 Introduction

Key to the wide success of deep neural networks is end-to-end learning of powerful hidden representations that aim to (i) capture all task-relevant characteristics while (ii) being invariant to all other variabilities in the data [LeC12, AS18]. Deep learning can yield abstract representations that are perfectly adapted feature encodings for the task at hand. However, their increasing abstraction capability and performance comes at the expense of a lack in interpretability [BBM+15]: although the network may solve a problem, it does not convey an understanding of its predictions or their causes, often leaving the impression of a black box [Mil19]. In particular, users are missing an explanation of semantic concepts that the model has learned to represent and of those it has learned to ignore, i.e., its invariances.

Providing such explanations and an understanding of network predictions and their causes is thus crucial for transparent AI. Not only is this relevant to discover limitations and promising directions for future improvements of the AI system itself, but also for compliance with legislation [GF17, Eur20], knowledge distillation from such a system [Lip18], and post hoc verification of the model [SWM17]. Consequently, research on interpretable deep models has recently gained a lot of attention, particularly methods that investigate latent representations to understand what the model has learned [SWM17, SZS+14, BZK+17, FV18, ERO20].

Challenges and aims: Assessing these latent representations is challenging due to two fundamental issues: (i) To achieve robustness and generalization despite noisy inputs and data variability, hidden layers exhibit a distributed coding of semantically meaningful concepts [FV18]. Attributing semantics to a single neuron via backpropagation [MLB+17] or synthesis [YCN+15] is thus impossible without altering the network [MSM18, ZKL+16], which typically degrades performance. (ii) End-to-end learning trains deep representations toward a goal task, making them invariant to features irrelevant for this goal. Understanding these characteristics that a representation has abstracted away is challenging, since we essentially need to portray features that have been discarded.

These challenges call for a method that can interpret existing network representations by recovering their invariances without modifying them. Given these recovered invariances, we seek an invertible mapping that translates a representation and the invariances onto understandable semantic concepts. The mapping disentangles the distributed encoding of the high-dimensional representation and its invariances by projecting them onto separate multi-dimensional factors that correspond to human-understandable semantic concepts. Both this translation and the recovering of invariances are implemented with invertible neural networks (INNs) [Red93, DSB17, KD18]. For the translation, this guarantees that the resulting understandable representation is equally expressive as the model representation combined with the recovered invariances (no information is lost). Its invertibility also warrants that feature modifications applied in the semantic domain correctly adjust the recovered representation.

Our contributions: Our contributions to a comprehensive understanding of deep representations are as follows: (i) We present an approach, which, by utilizing invertible neural networks, improves the understanding of representations produced by existing network architectures with no need for re-training or otherwise compromising their performance. (ii) Our generative approach is able to recover the invariances that result from the non-injective projection (of input onto a latent representation) which deep networks typically learn. This model then provides a probabilistic visualization of the latent representation and its invariances. (iii) We bijectively translate an arbitrarily abstract representation and its invariances via a non-linear transformation into another representation of equal expressiveness, but with accessible semantic concepts. (iv) The invertibility also enables manipulation of the original latent representations in a semantically understandable manner, thus facilitating further diagnostics of a network.

2 Background

Two main approaches to interpretable AI can be identified, those which aim to incorporate interpretability directly into the design of models, and those which aim to provide interpretability to existing models [MSM18]. Approaches from the first category range from modifications of network architectures [ZKL+16], over regularization of models encouraging interpretability [LBMO19, PASC+20], toward combinations of both [ZNWZ18]. However, these approaches always involve a trade-off between model performance and model interpretability. Being of the latter category, our approach allows to interpret representations of existing models without compromising their performance.

To better understand what an existing model has learned, its representations must be studied [SWM17]. Syegedy et al. [SZS+14] show that both random directions and coordinate axes in the feature space of networks can represent semantic properties and conclude that they are not necessarily represented by individual neurons. Different works attempt to select groups of neurons which have a certain semantic meaning, such as based on scenes [ZKL+15], objects [SR15] and object parts [SRD14]. [BZK+17] studied the interpretability of neurons and found that a rotation of the representation space spanned by the neurons decreases its interpretability. While this suggests that the neurons provide a more interpretable basis compared to a random basis, [FV18] shows that the choice of basis is not the only challenge for interpretability of representations. Their findings demonstrate that learned representations are distributed, i.e., a single semantic concept is encoded by an activation pattern involving multiple neurons, and a single neuron is involved in the encoding of multiple different semantic concepts. Instead of selecting a set of neurons directly, [ERO20] learns an INN that transforms the original representation space to an interpretable space, where a single semantic concept is represented by a known group of neurons and a single neuron is involved in the encoding of just a single semantic concept. However, to interpret not only the representation itself but also its invariances, it is insufficient to transform only the representation itself. Our approach therefore transforms the latent representation space of an autoencoder, which has the capacity to represent its inputs faithfully, and subsequently translates a model representation and its invariances into this space for semantic interpretation and visualization.

A large body of works approach interpretability of existing networks based on visualizations. Selvaraju et al. [SCD+20] use gradients of network outputs with respect to a convolutional layer to obtain coarse localization maps. Bach et al.  [BBM+15] propose an approach to obtain pixel-wise relevance scores for a specific class of models which is generalized in [MLB+17]. To obtain richer visual interpretations, [ZF14, SVZ14, YCN+15, MV16] reconstruct images which maximally activate certain neurons. Nguyen et al.  [NDY+16] use a generator network for this task, which was introduced in [DB16] for reconstructing images from their feature representation. Our key insight is that these existing approaches do not explicitly account for the invariances learned by a model. Invariances imply that feature inversion is a one-to-many mapping and thus they must be recovered to solve the task. Recently, [SGM+20] introduced a GAN-based approach that utilizes features of a pretrained classifier as a semantic pyramid for image generation. Nash et al. [NKW19] used samples from an autoregressive model of images conditioned on a feature representation to gain insights into the representation’s invariances. In contrast, our approach recovers an explicit representation of the invariances, which can be recombined with modified feature representations, and thus makes the effect of modifications to representations, e.g., through adversarial attacks, visible.

Other works consider visual interpretations for specialized models. Santurkar et al. [SIT+19] showed that the quality of images which maximally activate certain neurons is significantly improved when activating neurons of an adversarially robust classifier. Bau et al. [BZS+19] explore the relationship between neurons and the images produced by a generative adversarial network. For the same class of models, [GAOI19] finds directions in their input space which represent semantic concepts corresponding to certain cognitive properties. Such semantic directions have previously also been found in classifier networks [UGP+17] but requires aligned data. All of these approaches require either special training of models, are limited to a very special class of models which already provide visualizations, or depend on special assumptions on model and data. In contrast, our approach can be applied to arbitrary models without re-training or modifying them, and provides both visualizations and semantic explanations, for both the model’s representation and its learned invariances.

Fig. 1
figure 1

Proposed architecture. We provide post hoc interpretation for a given deep network \(\boldsymbol{f}= \boldsymbol{\Psi }\circ \boldsymbol{\Phi }\). For a deep representation \(\boldsymbol{z}= \boldsymbol{\Phi }(\boldsymbol{x})\) a conditional INN \(\boldsymbol{t}\) recovers \(\boldsymbol{\Phi }\)’s invariances \(\boldsymbol{v}\) from a representation \(\boldsymbol{\hat{z}}\) which contains entangled information about both \(\boldsymbol{z}\) and \(\boldsymbol{v}\). The INN \(\boldsymbol{e}\) then translates the representation \(\boldsymbol{\hat{z}}\) into a factorized representation with accessible semantic concepts. This approach allows for various applications, including visualizations of network representations of natural (green box) and altered inputs (blue box), semantic network analysis (red box) and semantic image modifications (yellow box)

3 Method

Common tasks of computer vision can be phrased as a mapping from an input image \(\boldsymbol{x}\) to some output \(\boldsymbol{f}(\boldsymbol{x})\) such as a classification of the image, a regression (e.g., of object locations), a (semantic) segmentation map, or a re-synthesis that yields another image. Deep learning utilizes a hierarchy of intermediate network layers that gradually transform the input into increasingly more abstract representations. Let \(\boldsymbol{z}=\boldsymbol{\Phi }(\boldsymbol{x}) \in \mathbb {R}^{N_{\boldsymbol{z}}}\) be the representation extracted by one such layer (without loss of generality we consider \(\boldsymbol{z}\) to be an \(N_{\boldsymbol{z}}\)-dim vector, flattening it if necessary) and \(\boldsymbol{f}(\boldsymbol{x})=\boldsymbol{\Psi }(\boldsymbol{z})=\boldsymbol{\Psi }(\boldsymbol{\Phi }(\boldsymbol{x}))\) the mapping onto the output.

An essential characteristic of a deep feature encoding \(\boldsymbol{z}\) is the increasing abstractness of higher feature encoding layers and the resulting reduction of information. This reduction generally causes the feature encoding to become invariant to those properties of the input image, which do not provide salient information for the task at hand [CWG+18]. To explain a latent representation, we need to recover such invariances \(\boldsymbol{v}\) and make \(\boldsymbol{z}\) and \(\boldsymbol{v}\) interpretable by learning a bijective mapping onto understandable semantic concepts, see Fig. 1. Section 3.1 describes our INN \(\boldsymbol{t}\) to recover an encoding \(\boldsymbol{v}\) of the invariances. Due to the generative nature of \(\boldsymbol{t}\), our approach can correctly sample visualizations of the model representation and its invariances without leaving the underlying data distribution and introducing artifacts. With \(\boldsymbol{v}\) then available, Sect. 3.2 presents an INN \(\boldsymbol{e}\) that translates \(\boldsymbol{t}\)’s encoding of \(\boldsymbol{z}\) and \(\boldsymbol{v}\) without losing information onto disentangled semantic concepts. Moreover, the invertibility allows modifications in the semantic domain to correctly project back onto the original representation or into image space.

3.1 Recovering the Invariances of Deep Models

Learning an encoding to help recover invariances: Key to a deep representation is not only the information \(\boldsymbol{z}\) captures, but also what is has learned to abstract away. To learn what \(\boldsymbol{z}\) misses with respect to \(\boldsymbol{x}\), we need an encoding \(\boldsymbol{\hat{z}}\), which, in contrast to \(\boldsymbol{z}\), includes the invariances exhibited by \(\boldsymbol{z}\). Without making prior assumptions about the deep model \(\boldsymbol{f}\), autoencoders provide a generic way to obtain such an encoding \(\boldsymbol{\hat{z}}\), since they ensure that their input \(\boldsymbol{x}\) can be recovered from their learned representation \(\boldsymbol{\hat{z}}\), which hence also comprises the invariances.

Therefore, we learn an autoencoder with an encoder \(\boldsymbol{E}\) that provides the data representation \(\boldsymbol{\hat{z}}= \boldsymbol{E}(\boldsymbol{x})\) and a decoder \(\boldsymbol{D}\) producing the data reconstruction \(\boldsymbol{\hat{x}}= \boldsymbol{D}(\boldsymbol{\hat{z}})\). Section 3.2 will utilize the decoding from \(\boldsymbol{\hat{z}}\) to \(\boldsymbol{\hat{x}}\) to visualize both \(\boldsymbol{z}\) and \(\boldsymbol{v}\). The autoencoder is trained to reconstruct its inputs by minimizing a perceptual metric between input and reconstruction, \(\Vert \boldsymbol{x}- \boldsymbol{\hat{x}}\Vert \), as in [DB16]. The details of the architecture and training procedure can be found in Sect. 3.3, autoencoder E, D. It is crucial that the autoencoder only needs to be trained once on the training data. Consequently, the same \(\boldsymbol{E}\) can be used to interpret different representations \(\boldsymbol{z}\), e.g., different models or layers within a model, thus ensuring fair comparisons between them. Moreover, the complexity of the autoencoder can be adjusted based on the computational needs, allowing us to work with much lower dimensional encodings \(\boldsymbol{\hat{z}}\) compared to reconstructing the invariances directly from the images \(\boldsymbol{x}\). This reduces the computational demands of our approach significantly.

Learning a conditional INN that recovers invariances: Due to the reconstruction task of the autoencoder, \(\boldsymbol{\hat{z}}\) not only contains the invariances \(\boldsymbol{v}\), but also the representation \(\boldsymbol{z}\). Thus, we must disentangle [EHO19, LSOL20, KSLO19] \(\boldsymbol{v}\) and \(\boldsymbol{z}\) using a mapping \(\boldsymbol{t}(\cdot \vert \boldsymbol{z}): \boldsymbol{\hat{z}}\mapsto \boldsymbol{v}= \boldsymbol{t}(\boldsymbol{\hat{z}}\vert \boldsymbol{z})\) which, depending on \(\boldsymbol{z}\), extracts \(\boldsymbol{v}\) from \(\boldsymbol{\hat{z}}\).

Besides extracting the invariances from a given \(\boldsymbol{\hat{z}}\), \(\boldsymbol{t}\) must also enable an inverse mapping from given model representations \(\boldsymbol{z}\) to \(\boldsymbol{\hat{z}}\) to support a further mapping onto semantic concepts Sect. 3.2 and visualization based on \(\boldsymbol{D}(\boldsymbol{\hat{z}})\). There are many different \(\boldsymbol{x}\) with \(\boldsymbol{\Phi }(\boldsymbol{x}) = \boldsymbol{z}\), namely, all those \(\boldsymbol{x}\) which differ only in properties that \(\boldsymbol{\Phi }\) is invariant to. Thus, there are also many different \(\boldsymbol{\hat{z}}\) that this mapping must recover. Consequently, the mapping from \(\boldsymbol{z}\) to \(\boldsymbol{\hat{z}}\) is set-valued. However, to understand \(\boldsymbol{f}\) we do not want to recover all possible \(\boldsymbol{\hat{z}}\), but only those which are likely under the training distribution of the autoencoder. In particular, this excludes unnatural images such as those obtained by DeepDream [MOT15] or adversarial attacks [SZS+14]. In conclusion, we need to sample \(\boldsymbol{\hat{z}}\sim p(\boldsymbol{\hat{z}}\vert \boldsymbol{z})\).

To avoid a costly inversion process of \(\boldsymbol{\Phi }\), \(\boldsymbol{t}\) must be invertible (implemented as an INN) so that a change of variables

$$\begin{aligned} p(\boldsymbol{\hat{z}}\vert \boldsymbol{z}) = \frac{p(\boldsymbol{v}\vert \boldsymbol{z})}{\vert \det \nabla (\boldsymbol{t}^{-1})(\boldsymbol{v}\vert \boldsymbol{z}) \vert } \quad \text {, where } \boldsymbol{v}= \boldsymbol{t}(\boldsymbol{\hat{z}}\vert \boldsymbol{z}), \end{aligned}$$
(1)

yields \(p(\boldsymbol{\hat{z}}\vert \boldsymbol{z})\) by means of the distribution \(p(\boldsymbol{v}\vert \boldsymbol{z})\) of invariances, given a model representation \(\boldsymbol{z}\). Here, the denominator denotes the absolute value of the determinant of Jacobian \(\nabla (\boldsymbol{t}^{-1})\) of \(\boldsymbol{v}\mapsto \boldsymbol{t}^{-1}(\boldsymbol{v}\vert \boldsymbol{z})=\boldsymbol{\hat{z}}\), which is efficient to compute for common invertible network architectures. Consequently, we obtain \(\boldsymbol{\hat{z}}\) for given \(\boldsymbol{z}\) by sampling from the invariant space \(\boldsymbol{v}\) given \(\boldsymbol{z}\) and then applying \(\boldsymbol{t}^{-1}\),

$$\begin{aligned} \boldsymbol{\hat{z}}\sim p(\boldsymbol{\hat{z}}\vert \boldsymbol{z}) \quad \iff \quad \boldsymbol{v}\sim p(\boldsymbol{v}\vert \boldsymbol{z}) , \boldsymbol{\hat{z}}= \boldsymbol{t}^{-1}(\boldsymbol{v}\vert \boldsymbol{z}) . \end{aligned}$$
(2)

Since \(\boldsymbol{v}\) is the invariant space for \(\boldsymbol{z}\), both are complementary thus implying independence \(p(\boldsymbol{v}\vert \boldsymbol{z}) = p(\boldsymbol{v})\). Because a powerful transformation \(\boldsymbol{t}^{-1}\) can transform between two arbitrary densities, we can assume without loss of generality a Gaussian prior \(p(\boldsymbol{v}) = \mathcal {N}(\boldsymbol{v}\vert \boldsymbol{0}, \boldsymbol{I})\), where \(\boldsymbol{I}\) is the identity matrix. Given this prior, our task is then to learn the transformation \(\boldsymbol{t}\) that maps \(\mathcal {N}(\boldsymbol{v}\vert \boldsymbol{0}, \boldsymbol{I})\) onto \(p(\boldsymbol{\hat{z}}\vert \boldsymbol{z})\). To this end, we maximize the log-likelihood of \(\boldsymbol{\hat{z}}\) given \(\boldsymbol{z}\), which results in a per-example loss of

$$\begin{aligned} J_e(\boldsymbol{\hat{z}}, \boldsymbol{z}) = -\log p(\boldsymbol{\hat{z}}\vert \boldsymbol{z}) = -\log \mathcal {N}(\boldsymbol{t}(\boldsymbol{\hat{z}}\vert \boldsymbol{z}) \vert \boldsymbol{0}, \boldsymbol{I}) - \log \vert \det \nabla \boldsymbol{t}(\boldsymbol{\hat{z}}\vert \boldsymbol{z}) \vert . \end{aligned}$$
(3)

Minimizing this loss over the training data distribution \(p(\boldsymbol{x})\) gives \(\boldsymbol{t}\), a bijective mapping between \(\boldsymbol{\hat{z}}\) and (\(\boldsymbol{z}, \boldsymbol{v}\)),

$$\begin{aligned} J(\boldsymbol{t})&= \mathbb {E}_{\boldsymbol{x}\sim p(\boldsymbol{x})} \left[ J_e(\boldsymbol{E}(\boldsymbol{x}), \boldsymbol{\Phi }(\boldsymbol{x})) \right] \end{aligned}$$
(4)
$$\begin{aligned}&= \mathbb {E}_{\boldsymbol{x}\sim p(\boldsymbol{x})} \left[ \frac{1}{2}\Vert \boldsymbol{t}(\boldsymbol{E}(\boldsymbol{x}) \vert \boldsymbol{\Phi }(\boldsymbol{x}))\Vert ^2 + N_{\boldsymbol{\hat{z}}}\log 2\pi - \log \vert \det \nabla \boldsymbol{t}(\boldsymbol{E}(\boldsymbol{x}) \vert \boldsymbol{\Phi }(\boldsymbol{x})) \vert \right] . \end{aligned}$$
(5)

Note that both \(\boldsymbol{E}\) and \(\boldsymbol{\Phi }\) remain fixed during minimization of \(J\).

3.2 Interpreting Representations and Their Invariances

Visualizing representations and invariances: For an image representation \(\boldsymbol{z}= \boldsymbol{\Phi }(\boldsymbol{x})\), (2) presents an efficient approach (a single forward pass through the INN \(\boldsymbol{t}\)) to sample an encoding \(\boldsymbol{\hat{z}}\), which is a combination of \(\boldsymbol{z}\) with a particular realization of its invariances \(\boldsymbol{v}\). Sampling multiple realizations of \(\boldsymbol{\hat{z}}\) for a given \(\boldsymbol{z}\) highlight what remains constant and what changes due to different \(\boldsymbol{v}\): information preserved in the representation \(\boldsymbol{z}\) remains constant over different samples and information discarded by the model ends up in the invariances \(\boldsymbol{v}\) and shows changes over different samples. Visualizing the samples \(\boldsymbol{\hat{z}}\sim p(\boldsymbol{\hat{z}}\vert \boldsymbol{z})\) with \(\boldsymbol{\hat{x}}= \boldsymbol{D}(\boldsymbol{\hat{z}})\) portrays this constancy and changes due to different \(\boldsymbol{v}\). To complement this visualization, in the following, we learn a transformation of \(\boldsymbol{\hat{z}}\) into a semantically meaningful representation which allows to uncover the semantics captured by \(\boldsymbol{z}\) and \(\boldsymbol{v}\).

Learning an INN to produce semantic interpretations: The autoencoder representation \(\boldsymbol{\hat{z}}\) is an equivalent representation of \((\boldsymbol{z}, \boldsymbol{v})\) but its feature dimensions do not necessarily correspond to semantic concepts [FV18]. More generally, without supervision, we cannot reliably discover semantically meaningful, explanatory factors of \(\boldsymbol{\hat{z}}\) [LBL+19]. In order to explain \(\boldsymbol{\hat{z}}\) in terms of given semantic concepts, we apply the approach of [ERO20] and learn a bijective transformation of \(\boldsymbol{\hat{z}}\) to an interpretable representation \(\boldsymbol{e}(\boldsymbol{\hat{z}})\) where different groups of components, called factors, correspond to semantic concepts.

To learn the transformation \(\boldsymbol{e}\), we parameterize \(\boldsymbol{e}\) by an INN and assume that semantic concepts are defined implicitly by pairs of images, i.e., for each semantic concept we have access to training pairs \(\boldsymbol{x}^\mathrm {a}, \boldsymbol{x}^\mathrm {b}\) that have the respective concept in common. For example, the semantic concept “smiling” is defined by pairs of images, where either both images show smiling persons or both images show non-smiling persons. Applying this formulation, input pairs which are similar in a certain semantic concept are similar in the corresponding factor of the interpretable representation \(\boldsymbol{e}(\boldsymbol{\hat{z}})\).

Following [ERO20], the loss for training the invertible network \(\boldsymbol{e}\) is then given by

$$\begin{aligned} J(\boldsymbol{e}) = \mathbb {E}_{\boldsymbol{x}^\mathrm {a}, \boldsymbol{x}^\mathrm {b}}&\left[ -\log p(\boldsymbol{e}(\boldsymbol{E}(\boldsymbol{x}^\mathrm {a})), \boldsymbol{e}(\boldsymbol{E}(\boldsymbol{x}^\mathrm {b}))) \right. \nonumber \\&\left. -\log \vert \det \nabla \boldsymbol{e}(\boldsymbol{E}(\boldsymbol{x}^\mathrm {a})) \vert -\log \vert \det \nabla \boldsymbol{e}(\boldsymbol{E}(\boldsymbol{x}^\mathrm {b})) \vert \right] . \end{aligned}$$
(6)

Interpretation by applying the learned INNs: After training, the combination of \(\boldsymbol{e}\) with \(\boldsymbol{t}\) from Sect. 3.1 provides semantic interpretations given a model representation \(\boldsymbol{z}\): (2) gives realizations of the invariances \(\boldsymbol{v}\) which are combined with \(\boldsymbol{z}\) to produce \(\boldsymbol{\hat{z}}= \boldsymbol{t}^{-1}(\boldsymbol{v}\vert \boldsymbol{z})\). Then \(\boldsymbol{e}\) transforms \(\boldsymbol{\hat{z}}\) without loss of information into a semantically accessible representation \((\boldsymbol{e}_i)_i = \boldsymbol{e}(\boldsymbol{\hat{z}}) = \boldsymbol{e}(\boldsymbol{t}^{-1}(\boldsymbol{v}\vert \boldsymbol{z}))\) consisting of different semantic factors \(\boldsymbol{e}_i\). Comparing the \(\boldsymbol{e}_i\) for different model representations \(\boldsymbol{z}\) and invariances \(\boldsymbol{v}\) allows us to observe which semantic concepts the model representation \(\boldsymbol{z}=\boldsymbol{\Phi }(\cdot )\) is sensitive to, and which it is invariant to.

Semantic Modifications of Latent Representations: The transformations \(\boldsymbol{t}^{-1}\) and \(\boldsymbol{e}\) not only interpret a representation \(\boldsymbol{z}\) in terms of accessible semantic concepts \((\boldsymbol{e}_i)_i\). Given \(\boldsymbol{v}\sim p(\boldsymbol{v})\), they also allow to modify \(\boldsymbol{\hat{z}}=\boldsymbol{t}^{-1}(\boldsymbol{v}\vert \boldsymbol{z})\) in a semantically meaningful manner by altering its corresponding \((\boldsymbol{e}_i)_i\) and then applying the inverse translation \(\boldsymbol{e}^{-1}\),

$$\begin{aligned} \boldsymbol{\hat{z}}\xrightarrow {\boldsymbol{e}} (\boldsymbol{e}_i) \xrightarrow {\text {modification}} (\boldsymbol{e}_i^{*}) \xrightarrow {\boldsymbol{e}^{-1}} \boldsymbol{\hat{z}}^{*}. \end{aligned}$$
(7)

The modified representation \(\boldsymbol{\hat{z}}^{*}\) is then readily transformed back into image space \(\boldsymbol{\hat{x}}^{*} = \boldsymbol{D}(\boldsymbol{\hat{z}}^{*})\). Besides visual interpretation of the modification, \(\boldsymbol{\hat{x}}^{*}\) can be fed into the model \(\boldsymbol{\Psi }(\boldsymbol{\Phi }(\boldsymbol{\hat{x}}^{*}))\) to probe for sensitivity to certain semantic concepts.

3.3 Implementation Details

In this section, we provide implementation details about the exact training procedure and architecture of all components of our approach. This is only for the sake of clarity and completeness. For readers who are already familiar with INNs or people rather interested in the higher level ideas of our approach than in its technical details, this section can be safely skipped.

Autoencoder\(\boldsymbol{E}, \boldsymbol{D}\): In Sect. 3.1, we introduced an autoencoder to obtain a representation \(\boldsymbol{\hat{z}}\) of \(\boldsymbol{x}\), which includes the invariances abstracted away by a given model representation \(\boldsymbol{z}\). This autoencoder consists of an encoder \(E(\boldsymbol{x})\) and a decoder \(D(\boldsymbol{\hat{z}})\).

Because the INNs \(\boldsymbol{t}\) and \(\boldsymbol{e}\) transform the distribution of \(\boldsymbol{\hat{z}}\), we must ensure a strictly positive density for \(\boldsymbol{\hat{z}}\) to avoid degenerate solutions. This is readily achieved with a stochastic encoder, i.e., we predict mean \(\boldsymbol{E}(\boldsymbol{x})_{\boldsymbol{\mu }}\) and diagonal \(\boldsymbol{E}(\boldsymbol{x})_{\boldsymbol{\sigma }^2}\) of a Gaussian distribution, and obtain the desired representation as \(\boldsymbol{\hat{z}}\sim \mathcal {N}(\boldsymbol{\hat{z}}\vert \boldsymbol{E}(\boldsymbol{x})_{\boldsymbol{\mu }}, {\text {diag}}(\boldsymbol{E}(\boldsymbol{x})_{\boldsymbol{\sigma }^2}))\). Following [DW19], we train this autoencoder as a variational autoencoder using the reparameterization trick [KW14, RMW14] to match the encoded distribution to a standard normal distribution, and jointly learn the scalar output variance \(\gamma \) under an image metric \(\Vert \boldsymbol{x}- \boldsymbol{\hat{x}}\Vert \) to avoid blurry reconstructions. The resulting loss function is thus

(8)

Note that \(\sqrt{(\cdot )}\) and \(\log (\cdot )\) on multi-dimensional entities are applied element-wise. In the experiments shown in this chapter, we use images of spatial resolutions \(28 \times 28\) and \(128 \times 128\), resulting in different architectures for the autoencoder, summarized in Tables 1 and 2, respectively. For the encoder \(\boldsymbol{E}\) processing images of spatial resolution \(128 \times 128\), we use an architecture based on ResNet-101 [HZRS16], and for the corresponding decoder \(\boldsymbol{D}\) we use an architecture based on BigGAN [BDS19], where we include a small fully connected network to replace the class conditioning used in BigGAN by a conditioning on \(\boldsymbol{\hat{z}}\).

Table 1 Autoencoder architecture on datasets with images of resolution \(28 \times 28\)
Table 2 Autoencoder architecture for datasets with images of resolution \(128 \times 128\)

In the \(28\times 28\) case, we use a squared \(L_2\) loss for the image metric, which corresponds to the first term in (8). For our \(128 \times 128\)-models, we further use an improved metric as in [DB16], which includes additional perceptual [ZIE+18] and discriminator losses. The perceptual loss consists of \(L_1\) feature distances obtained from different layers of a fixed, pretrained network. We use a VGG-16 network pretrained on ImageNet and weighted distances of different layers as in [ZIE+18]. The discriminator is trained along with the autoencoder to distinguish reconstructed images from real images using a binary classification loss, and the autoencoder maximizes the log-probability that reconstructed images are classified as real images. The architectures of VGG-16 and the discriminator are summarized in Table 3.

Table 3 Architectures used to compute image metrics for the autoencoder, which were used for training the autoencoder \(\boldsymbol{E}, \, \boldsymbol{D}\) on datasets with images of resolution \(128 \times 128\)

Details on the INN\(\boldsymbol{e}\) for Revealing Semantics of Deep Representations: Previous works have successfully applied INNs for density estimation [DSB17], inverse problems [AKW+19], and on top of autoencoder representations [ERO20, XYA19, DMB+21, BMDO21b, BMDO21a] for a wide range of applications such as video synthesis [DMB+21, BMDO21b] and translation between pretrained, powerful networks [REO20]. This section provides details on how we embed the approach of [ERO20] to reveal the semantic concepts of autoencoder representations \(\boldsymbol{\hat{z}}\), cf. Sect. 3.2.

Since we will never have examples for all relevant semantic concepts, we include a residual concept that captures the remaining variability of \(\boldsymbol{\hat{z}}\), which is not explained by the given semantic concepts.

Following [ERO20], we learn a bijective transformation \(\boldsymbol{e}(\boldsymbol{\hat{z}})\), which translates the non-interpretable representation \(\boldsymbol{\hat{z}}\) invertibly into a factorized representation \((\boldsymbol{e}_i(\boldsymbol{\hat{z}}))_{i=0}^K=\boldsymbol{e}(\boldsymbol{\hat{z}})\), where each factor \(\boldsymbol{e}_i \in \mathbb {R}^{N_{\boldsymbol{e}_{i}}}\) represents one of the given semantic concepts for \(i = 1,\dots ,K\), and \(\boldsymbol{e}_0 \in \mathbb {R}^{N_{\boldsymbol{e}_{0}}}\) is the residual concept.

The INN \(\boldsymbol{e}\) establishes a one-to-one correspondence between an encoding and different semantic concepts and, conversely, enables semantic modifications to correctly alter the original encoding (see next section). Being an INN, \(\boldsymbol{e}(\boldsymbol{\hat{z}})\) and \(\boldsymbol{\hat{z}}\) need to have the same dimensionality and we set \(N_{\boldsymbol{e}_{0}} = N_{\boldsymbol{\hat{z}}}- \sum _{i=1}^KN_{\boldsymbol{e}_{i}}\). We denote the indices of concept i with respect to \(\boldsymbol{e}(\boldsymbol{\hat{z}})\) as \(\mathcal {I}_i \subset \{1, \dots , N_{\boldsymbol{\hat{z}}}\}\) such that we can write \(\boldsymbol{e}_i = (\boldsymbol{e}(\boldsymbol{\hat{z}})_k)_{k\in \mathcal {I}_i}\).

In the following, we emphasize on deriving a loss function for training the semantic INN. Let \(\boldsymbol{e}_i\) be the factor representing some semantic concept, e.g., gender, that the contents of two images \(\boldsymbol{x}^\mathrm {a}, \boldsymbol{x}^\mathrm {b}\) share. Then the projection of their encodings \(\boldsymbol{\hat{z}}^\mathrm {a}, \boldsymbol{\hat{z}}^\mathrm {b}\) onto this semantic concept must be similar [ERO20, KWKT15],

$$\begin{aligned} \boldsymbol{e}_i(\boldsymbol{\hat{z}}^\mathrm {a}) \simeq \boldsymbol{e}_i(\boldsymbol{\hat{z}}^\mathrm {b}) \quad \text {where } \boldsymbol{\hat{z}}^\mathrm {a} = \boldsymbol{E}(\boldsymbol{x}^\mathrm {a}), \boldsymbol{\hat{z}}^\mathrm {b} = \boldsymbol{E}(\boldsymbol{x}^\mathrm {b}) . \end{aligned}$$
(9)

Moreover, to interpret \(\boldsymbol{\hat{z}}\) we are interested in the separate contribution of different semantic concepts \(\boldsymbol{e}_i\) that explain \(\boldsymbol{\hat{z}}\). Hence, we seek a mapping \(\boldsymbol{e}(\cdot )\) that strives to disentangle different concepts,

$$\begin{aligned} \boldsymbol{e}_i(\boldsymbol{\hat{z}}) \perp \boldsymbol{e}_j(\boldsymbol{\hat{z}}) \quad \forall i \ne j, \boldsymbol{x}\quad \text {where } \boldsymbol{\hat{z}}= E(\boldsymbol{x}) . \end{aligned}$$
(10)

The objectives in (9), (10) imply a correlation in \(\boldsymbol{e}_i\) for pairs \(\boldsymbol{\hat{z}}^\mathrm {a}\) and \(\boldsymbol{\hat{z}}^\mathrm {b}\) and no correlation between concepts \(\boldsymbol{e}_i, \boldsymbol{e}_j\) for \(i \ne j\). This calls for a Gaussian distribution with a covariance matrix that reflects these requirements.

Let \(\boldsymbol{e}^\mathrm {a}=(\boldsymbol{e}^\mathrm {a}_i) = (\boldsymbol{e}_i(\boldsymbol{E}(\boldsymbol{x}^\mathrm {a})))\) and \(\boldsymbol{e}^\mathrm {b}\) likewise, where \(\boldsymbol{x}^\mathrm {a}, \boldsymbol{x}^\mathrm {b}\) are samples from a training distribution \(p(\boldsymbol{x}^\mathrm {a}, \boldsymbol{x}^\mathrm {b})\) for the ith semantic concept. The distribution of pairs \(\boldsymbol{e}^\mathrm {a}\) and \(\boldsymbol{e}^\mathrm {b}\) factorizes into a conditional and a marginal,

$$\begin{aligned} p(\boldsymbol{e}^\mathrm {a}, \boldsymbol{e}^\mathrm {b}) = p(\boldsymbol{e}^\mathrm {b} \vert \boldsymbol{e}^\mathrm {a}) p(\boldsymbol{e}^\mathrm {a}). \end{aligned}$$
(11)

Objective (10) implies a diagonal covariance for the marginal distribution \(p(\boldsymbol{e}^\mathrm {a})\), i.e., a standard normal distribution, and (9) entails a correlation between \(\boldsymbol{e}^\mathrm {a}_i\) and \(\boldsymbol{e}^\mathrm {b}_i\). Therefore, the correlation matrix is \(\boldsymbol{\Sigma }^{\mathrm {ab}}= \rho {\text {diag}}((\delta _{\mathcal {I}_i}(k))_{k=1}^{N_{\boldsymbol{\hat{z}}}})\), where

$$\begin{aligned} \delta _{\mathcal {I}_i}(k) = {\left\{ \begin{array}{ll} 1 &{} \text {if}~k \in \mathcal {I}_i,\\ 0 &{} \text {else}.\\ \end{array}\right. } \end{aligned}$$

By symmetry, \(p(\boldsymbol{e}^\mathrm {b}) = p(\boldsymbol{e}^\mathrm {a})\), which gives

$$\begin{aligned} p(\boldsymbol{e}^\mathrm {b} \vert \boldsymbol{e}^\mathrm {a}) = \mathcal {N}(\boldsymbol{e}^\mathrm {b} \vert \boldsymbol{\Sigma }^{\mathrm {ab}}\boldsymbol{e}^\mathrm {a}, \boldsymbol{I}- (\boldsymbol{\Sigma }^{\mathrm {ab}})^2) . \end{aligned}$$
(12)

Inserting (12) and a standard normal distribution for \(p(\boldsymbol{e}^\mathrm {a})\) into (11) yields the negative log-likelihood for a pair \(\boldsymbol{e}^\mathrm {a}, \boldsymbol{e}^\mathrm {b}\).

Given pairs \(\boldsymbol{x}^\mathrm {a}, \boldsymbol{x}^\mathrm {b}\) as training data, another change of variables from \(\boldsymbol{\hat{z}}^\mathrm {a}=\boldsymbol{E}(x^\mathrm {a})\) to \(\boldsymbol{e}^\mathrm {a}=\boldsymbol{e}(\boldsymbol{\hat{z}}^\mathrm {a})\) gives the training loss function for \(\boldsymbol{e}\) as the negative log-likelihood of \(\boldsymbol{\hat{z}}^\mathrm {a}, \boldsymbol{\hat{z}}^\mathrm {b}\),

$$\begin{aligned} J(\boldsymbol{e}) = \mathbb {E}_{\boldsymbol{x}^\mathrm {a}, \boldsymbol{x}^\mathrm {b}}&\left[ -\log p(\boldsymbol{e}(\boldsymbol{E}(\boldsymbol{x}^\mathrm {a})), \boldsymbol{e}(\boldsymbol{E}(\boldsymbol{x}^\mathrm {b}))) \right. \nonumber \\&\left. -\log \vert \det \nabla \boldsymbol{e}(\boldsymbol{E}(\boldsymbol{x}^\mathrm {a})) \vert -\log \vert \det \nabla \boldsymbol{e}(\boldsymbol{E}(\boldsymbol{x}^\mathrm {b})) \vert \right] . \end{aligned}$$
(13)

For simplicity, we have derived the loss for a single semantic concept \(\boldsymbol{e}_i\). Simply summing over the losses of different semantic concepts yields their joint loss function and allows us to learn a joint translator \(\boldsymbol{e}\) for all of them.

In the following, we focus on the log-likelihood of pairs. The loss for \(\boldsymbol{e}\) in (13) contains the log-likelihood of pairs \(\boldsymbol{e}^\mathrm {a}, \boldsymbol{e}^\mathrm {b}\). Inserting (12) and a standard normal distribution for \(p(\boldsymbol{e}^\mathrm {a})\) into (11) yields

$$\begin{aligned} -\log p(\boldsymbol{e}^\mathrm {a}, \boldsymbol{e}^\mathrm {b}) = \frac{1}{2} \left( \sum _{k\in \mathcal {I}_i} \frac{(\boldsymbol{e}^\mathrm {b}_k - \rho \boldsymbol{e}^\mathrm {a}_k)^2}{1-\rho ^2} + \sum _{k\in \mathcal {I}_i^c} (\boldsymbol{e}^\mathrm {b}_k)^2 + \sum _{k=1}^{N_{\boldsymbol{\hat{z}}}} (\boldsymbol{e}^\mathrm {a}_k)^2 \right) + C, \end{aligned}$$
(14)

where \(C=C(\rho , N_{\boldsymbol{\hat{z}}})\) is a constant that can be ignored for the optimization process. \(\rho \in (0,1)\) determines the relative importance of loss terms corresponding to the similarity requirement in (9) and the independence requirement in (10). We use a fixed value of \(\rho =0.9\) for all experiments.

Fig. 2
figure 2

A single invertible block used to build our invertible neural networks

Fig. 3
figure 3

Architectures of our INN models. top: The semantic INN \(\boldsymbol{e}\) consists of stacked invertible blocks. bottom: The conditional INN \(\boldsymbol{t}\) is composed of a embedding module \(\boldsymbol{H}\) that downsamples (upsamples if necessary) a given model representation \(\boldsymbol{h} = \boldsymbol{H}(\boldsymbol{z}) = \boldsymbol{H}(\boldsymbol{\Phi }(\boldsymbol{x}))\). Subsequently, \(\boldsymbol{h}\) is concatenated to the inputs of each block of the invertible model

In the following, we describe the architecture of the semantic INN. In our implementation, \(\boldsymbol{e}\) is built by stacking invertible blocks, see Fig. 2, which consist of three invertible layers: coupling blocks [DSB17], actnorm layers [KD18], and shuffling layers. The final output is split into the factors \((\boldsymbol{e}_i)\), see Fig. 3.

Coupling blocks split their input \(\boldsymbol{x}= (\boldsymbol{x}_1 , \boldsymbol{x}_2)\) along the channel dimension and use fully connected neural networks \(\boldsymbol{s}_i\) and \(\boldsymbol{\tau }_i\) to perform the following computation:

$$\begin{aligned} \tilde{\boldsymbol{x}}_1&= \boldsymbol{x}_1\odot \boldsymbol{s}_1(\boldsymbol{x}_2) + \boldsymbol{\tau }_1(\boldsymbol{x}_2), \end{aligned}$$
(15)
$$\begin{aligned} \tilde{\boldsymbol{x}}_2&= \boldsymbol{x}_2\odot \boldsymbol{s}_2(\tilde{\boldsymbol{x}}_1) + \boldsymbol{\tau }_2(\tilde{\boldsymbol{x}}_1), \end{aligned}$$
(16)

with the element-wise multiplication operator \(\odot \). Actnorm layers consist of learnable shift and scale parameters for each channel, which are initialized to ensure activations with zero mean and unit variance on the first training batch. Shuffling layers use a fixed, randomly initialized permutation to shuffle the channels of its input, which provides a better mixing of channels for subsequent coupling layers.

Fig. 4
figure 4

Graphical distinction of information flow during training and inference. During training of \(\boldsymbol{t}\), the encoder \(\boldsymbol{E}\) provides an (approximately complete) data representation, which is used to learn the invariances of a given model’s representations \(\boldsymbol{z}\). At inference, the encoder is not necessarily needed anymore: given a representation \(\boldsymbol{z}= \boldsymbol{\Phi }(\boldsymbol{x})\), invariances can be sampled from the prior distribution and decoded into data space through \(\boldsymbol{t}^{-1}\)

Conditional INN \(\boldsymbol{t}\) for recovering invariances of deep representations: We first elaborate on the architecture of the conditional INN. We build the conditional invertible neural network \(\boldsymbol{t}\) by expanding the semantic model \(\boldsymbol{e}\) as follows: Given a model representation \(\boldsymbol{z}\), which is used as the conditioning of the INN, we first calculate its embedding

$$\begin{aligned} \boldsymbol{h} = \boldsymbol{H}(\boldsymbol{z}) \end{aligned}$$
(17)

which is subsequently fed into the affine coupling block:

$$\begin{aligned} \tilde{\boldsymbol{x}}_1&= \boldsymbol{x}_1\odot \boldsymbol{s}_1(\boldsymbol{x}_2, \boldsymbol{h}) + \boldsymbol{\tau }_1(\boldsymbol{x}_2, \boldsymbol{h}), \end{aligned}$$
(18)
$$\begin{aligned} \tilde{\boldsymbol{x}}_2&= \boldsymbol{x}_2\odot \boldsymbol{s}_2(\tilde{\boldsymbol{x}}_1, \boldsymbol{h}) + \boldsymbol{\tau }_2(\tilde{\boldsymbol{x}}_1, \boldsymbol{h}), \end{aligned}$$
(19)

with \(\odot \) again being an element-wise multiplication operator, where \(\boldsymbol{s}_i\) and \(\boldsymbol{\tau }_i\) are modified from (16) such that they are capable of processing a concatenated input \((\boldsymbol{x}_i, \boldsymbol{h})\). The embedding module \(\boldsymbol{H}\) is usually a shallow convolutional neural network used to down-/upsample a given model representation \(\boldsymbol{z}\) to a size that the networks \(\boldsymbol{s}_i\) and \(\boldsymbol{\tau }_i\) are able to process. This means that \(\boldsymbol{t}\), analogous to \(\boldsymbol{e}\), consists of stacked invertible blocks, where each block is composed of coupling blocks, actnorm layers, and shuffling layers, cf. Sect. 3.3, details on the INN \(\boldsymbol{e}\) for revealing semantics of deep representations, and Fig. 2. The complete architectures of both \(\boldsymbol{t}\) and \(\boldsymbol{e}\) are depicted in Fig. 3. Additionally, Fig. 4 provides a graphical distinction of the training and testing process of \(\boldsymbol{t}\). During training, the autoencoder \(\boldsymbol{D}\circ \boldsymbol{E}\) provides a representation of the data that contains both the invariances and the representation of some model w.r.t. the input \(\boldsymbol{x}\). After training of \(\boldsymbol{t}\), the encoder may be discarded and visual decodings and/or semantic interpretations of a model representation \(\boldsymbol{z}\) can be obtained by sampling and transforming \(\boldsymbol{v}\) as described in (2).

4 Experiments

To explore the applicability of our approach, we conduct experiments on several models: SqueezeNet [IHM+16], which provides lightweight classification, FaceNet [SKP15], a baseline for face recognition and clustering, trained on the VGGFace2 dataset [CSX+18], and variants of ResNet [HZRS16], a popular architecture often used when fine-tuning a classifier on a specific task and dataset.

Experiments are conducted on the following datasets: CelebA [LLWT15], AnimalFaces [LHM+19], Animals (containing carnivorous animals), ImageNet [DDS+09], and ColorMNIST, which is an augmented version of the MNIST dataset [LCB98], where both background and foreground have random, independent colors. Evaluation details follow in Sect. 4.5.

4.1 Comparison to Existing Methods

A key insight of our chapter is that reconstructions from a given model’s representation \(\boldsymbol{z}= \boldsymbol{\Phi }(\boldsymbol{x})\) are impossible if the invariances the model has learned are not considered. In Fig. 5, we compare our approach to existing methods that either try to reconstruct the image via gradient-based optimization [MV16] or by training a reconstruction network directly on the representations \(\boldsymbol{z}\) [DB16]. By conditionally sampling images \(\boldsymbol{\hat{x}}= \boldsymbol{D}(\boldsymbol{\hat{z}})\), where we obtain \(\boldsymbol{\hat{z}}\) via the INN \(\boldsymbol{t}\) as described in (2) based on the invariances \(\boldsymbol{v}\sim p(\boldsymbol{v}) = \mathcal {N}(\boldsymbol{0}, \boldsymbol{I})\), we bypass this shortcoming and obtain natural images without artifacts for any layer depth. The increased image quality is further confirmed by the Fréchet inception distance (FID) scores [HRU+17] reported in Table 4.

Fig. 5
figure 5

Comparison to existing network inversion methods for AlexNet [KSH12]. In contrast to the methods of [DB16] (D&B) and [MV16] (M&V), our invertible method explicitly samples the invariances of \(\boldsymbol{\Phi }\) w.r.t. the data, which circumvents a common cause for artifacts and produces natural images independent of the depth of the layer which is reconstructed

4.2 Understanding Models

Interpreting a face recognition model: FaceNet [SKP15] is a widely accepted baseline in the field of face recognition. This model embeds input images of human faces into a latent space where similar images have a small \(L_2\)-distance. We aim to understand the process of face recognition within this model by analyzing and visualizing learned invariances for several layers explicitly; see Table 6 for a detailed breakdown of the various layers of FaceNet. For the experiment, we use a pretrained FaceNet and train the generative model presented in (2) by conditioning on various layers. Figure 6 depicts the amount of variance present in each selected layer when generating \(n=250\) samples for each of the 100 different input images. This variance serves as a proxy for the amount of abstraction capability FaceNet has learned in its respective layers: more abstract representations allow for a rich variety of corresponding synthesized images, resulting in a large variance in image space when being decoded. We observe an approximate exponential growth of learned invariances with increasing layer depth, suggesting abstraction mainly happens in the deepest layers of the network. Furthermore, we are able to synthesize images that correspond to the given model representation for each selected layer.

Table 4 FID scores for layer visualizations of AlexNet, obtained with our method and [DB16] (D&B). Scores are calculated on the Animals dataset
Fig. 6
figure 6

Left: Visualizing FaceNet representations and their invariances. Sampling multiple reconstructions \(\boldsymbol{\hat{x}}= \boldsymbol{D}(\boldsymbol{t}^{-1}(\boldsymbol{v}\vert \boldsymbol{z}_\ell ))\) shows the degree of invariance learned by different layers \(\ell \). The invariance w.r.t. pose increases for deeper layers as expected for face identification. Surprisingly, FaceNet uses glasses as an identity feature throughout all its layers as evident from the spatial mean and variance plots, where the glasses are still visible. This reveals a bias and weakness of the model. Right: spatially averaged variances over multiple \(\boldsymbol{x}\) and layers

Fig. 7
figure 7

Analyzing the degree to which different semantic concepts are captured by a network representation changes as training progresses. For SqueezeNet on ColorMNIST, we measure how much the data varies in different semantic concepts \(\boldsymbol{e}_i\) and how much of this variability is captured by \(\boldsymbol{z}\) at different training iterations. Early on, \(\boldsymbol{z}\) is sensitive to foreground and background color, and later on it learns to focus on the digit attribute. The ability to encode this semantic concept is proportional to the classification accuracy achieved by \(\boldsymbol{z}\). At training iterations 4k and 36k, we apply our method to visualize model representations and thereby illustrate how their content changes during training

How does relevance of different concepts emerge during training? Humans tend to provide explanations of entities by describing them in terms of their semantics, e.g., size or color. In a similar fashion, we want to semantically understand how a network (here: SqueezeNet) learns to solve a given problem.

Intuitively, a network should, for example, be able to solve a given classification problem by focusing on the relevant information while discarding task-irrelevant information. To build on this intuition, we construct a toy problem: digit classification on ColorMNIST. We expect the model to ignore both the random background and foreground colors of the input data, as it does not help making a classification decision. Thus, we apply the invertible approach presented in Sect. 3.2 and recover three distinct factors: digit class, background color, and foreground color. To capture the semantic changes occurring over the course of training of this classifier, we couple 20 instances of the invertible interpretation model on the last convolutional layer, each representing a checkpoint between iteration 0 and iteration 40000 (equally distributed). The result is shown in Fig. 7: we see that the digit factor becomes increasingly more relevant, with its relevance being strongly correlated to the accuracy of the model.

Fig. 8
figure 8

Visualizing FGSM adversarial attacks on ResNet-101. To the human eye, the original image and its attacked version are almost indistinguishable. However, the input image is correctly classified as “Siamese cat”, while the attacked version is classified as “mountain lion”. Our approach visualizes how the attack spreads throughout the network. Reconstructions of representations of attacked images demonstrate that the attack targets the semantic content of deep layers. The variance of \(\boldsymbol{\hat{z}}\) explained by \(\boldsymbol{v}\) combined with these visualizations shows how increasing invariances cause vulnerability to adversarial attacks

Fig. 9
figure 9

Revealing texture bias in ImageNet classifiers. We compare visualizations of \(\boldsymbol{z}\) from the penultimate layer of ResNet-50 trained on standard ImageNet (left) and a stylized version of ImageNet (right). On natural images (rows 1–3) both models recognize the input, removing textures through stylization (rows 4–6) makes images unrecognizable to the standard model, however it recognizes objects from textured patches (rows 7–9). Rows 10–12 show that a model without texture bias can be used for sketch-to-image synthesis

4.3 Effects of Data Shifts on Models

This section investigates the effects that altering input data has on the model we want to understand. We examine these effects by manipulating input data through adversarial attacks or image stylization.

How do adversarial attacks affect network representations? Here, we experiment with the fast gradient sign method (FGSM) [GSS15], which manipulates the input image by maximizing the objective of a given classification model. To understand how such an attack modifies representations of a given model, we first compute the image’s invariances with respect to the model as \(\boldsymbol{v}= \boldsymbol{t}(\boldsymbol{E}(\boldsymbol{x}) \vert \boldsymbol{\Phi }(\boldsymbol{x}))\). For an attacked image \(\boldsymbol{x}^{*}\), we then compute the attacked representation as \(\boldsymbol{z}^{*}=\boldsymbol{\Phi }(\boldsymbol{x}^{*})\). Decoding this representation with the original invariance \(\boldsymbol{v}\) allows us to precisely visualize what the adversarial attack changed. This decoding, \(\boldsymbol{\hat{x}}^{*} = \boldsymbol{D}(\boldsymbol{t}(\boldsymbol{v}\vert \boldsymbol{z}^{*}))\), is shown in Fig. 8. We observe that, over layers of the network, the adversarial attack gradually changes the representation toward its target. Its ability to do so is strongly correlated with the amount of invariances, quantified as the total variance explained by \(\boldsymbol{v}\), for a given layer as also observed in [JBZB19].

How does training on different data affect the model? Geirhos et al. [GRM+19] proposed the hypothesis that classification networks based on convolutional blocks mainly focus on texture patterns to obtain class probabilities. We further validate this hypothesis by training our invertible network \(\boldsymbol{t}\) conditioned on pre-logits \(\boldsymbol{z}= \boldsymbol{\Phi }(\boldsymbol{x})\) (i.e., the penultimate layer) of two ResNet-50 realizations. As shown in Fig. 9, a ResNet architecture trained on standard ImageNet is susceptible to the so-called “texture-bias”, as samples generated conditioned on representation of pure texture images consistently show valid images of corresponding input classes. We furthermore visualize that this behavior can indeed be removed by training the same architecture on a stylized version of ImageNetFootnote 1; the classifier does focus on shape. Rows 10–12 of Fig. 9 show that the proposed approach can be used to generate sketch-based content with the texture-agnostic network.

Fig. 10
figure 10

Semantic modifications on CelebA. For each column, after inferring the semantic factors \((\boldsymbol{e}_i)_i=\boldsymbol{e}(\boldsymbol{E}(\boldsymbol{x}))\) of the input \(\boldsymbol{x}\), we replace one factor \(\boldsymbol{e}_i\) by that from another randomly chosen image that differs in this concept. The inverse of \(\boldsymbol{e}\) translates this semantic change back into a modified \(\boldsymbol{\hat{z}}\), which is decoded to a semantically modified image. Distances between FaceNet embeddings before and after modification demonstrate its sensitivity to differences in gender and glasses (see also Fig. 6)

4.4 Modifying Representations

Invertible access to semantic concepts enables targeted modifications of representations \(\boldsymbol{\hat{z}}\). In combination with a decoder for \(\boldsymbol{\hat{z}}\), we obtain semantic image editing capabilities. We provide an example in Fig. 10, where we modify the factors hair color, glasses, gender, beard, age, and smile. We infer \(\boldsymbol{\hat{z}}=\boldsymbol{E}(\boldsymbol{x})\) from an input image. Our semantic INN \(\boldsymbol{e}\) then translates this representation into semantic factors \((\boldsymbol{e}_i)_i=\boldsymbol{e}(\boldsymbol{\hat{z}})\), where individual semantic concepts can be modified independently via the corresponding factor \(\boldsymbol{e}_i\). In particular, we can replace each factor with that from another image, effectively transferring semantics from one representation onto another. Due to the invertibility of \(\boldsymbol{e}\), the modified representation can be translated back into the space of the autoencoder and is readily decoded to a modified image \(\boldsymbol{x}^{*}\).

To observe which semantic concepts FaceNet is sensitive to, we compute the average distance \(\Vert \boldsymbol{f}(\boldsymbol{x}) - \boldsymbol{f}(\boldsymbol{x}^{*})\Vert \) between its embeddings of \(\boldsymbol{x}\) and semantically modified \(\boldsymbol{x}^{*}\) over the test set (last row in Fig. 10). Evidently, FaceNet is particularly sensitive to differences in gender and glasses. The latter suggests a failure of FaceNet to identify persons correctly after they put on glasses.

Table 5 Hyperparameters of INNs for each experiment. Parameter \(n_{flow}\) denotes the number of invertible blocks within in the model, see Fig. 2. Parameters \(h_w\) and \(h_d\) refer to the width and depth of the fully connected subnetworks \(\boldsymbol{s}_i\) and \(\boldsymbol{\tau }_i\), respectively
Table 6 High-level architectures of FaceNet and ResNet, depicted as pytorch-modules. Layers investigated in our experiments are marked in bold. Spatial sizes are provided as a visual aid and vary from model to model in our experiments. If not stated otherwise, we always extract from the last layer in a series of blocks (e.g., on the right: \(23\times \) BottleNeck down \(\rightarrow \mathbb {R}^{8 \times 8 \times 1024}\) refers to the last module in the series of 23 blocks)
Table 7 High-level architectures of SqueezeNet and AlexNet, depicted as pytorch-modules. See Table 6 for further details

4.5 Evaluation Details

Here, we provide additional details on the investigated neural networks in Sect. 4 and present a way to quantify the amount of invariances in those networks. This section is only for completeness. Similar to Sect. 3.3, this section can be skipped for readers who are interested in the higher level concepts of our approach rather than its technical details. An overview of INN hyperparameters for all experiments is provided in Table 5.

Throughout our experiments, we interpret four different models: SqueezeNet, AlexNet, ResNet, and FaceNet. Summaries of each of model’s architecture are provided in Table 6 and Table 7. Implementations and pretrained weights of these models are taken from:

Explained variance: To quantify the amount of invariances and semantic concepts, we use the fraction of the total variance explained by invariances (Fig. 8) and the fraction of the variance of a semantic concept explained by the model representation (Fig. 7).

Using the INN \(\boldsymbol{t}\), we can consider \(\boldsymbol{\hat{z}}= \boldsymbol{t}^{-1}(\boldsymbol{v}\vert \boldsymbol{z})\) as a function of \(\boldsymbol{v}\) and \(\boldsymbol{z}\). The total variance of \(\boldsymbol{\hat{z}}\) is then obtained by sampling \(\boldsymbol{v}\), via its prior which is a standard normal distribution, and \(\boldsymbol{z}\), via \(\boldsymbol{z}= \boldsymbol{\Phi }(\boldsymbol{x})\) with \(\boldsymbol{x}\sim p_{\text {valid}}(\boldsymbol{x})\) sampled from a validation set. We compare this total variance to the average variance obtained when sampling \(\boldsymbol{v}\) for a given \(\boldsymbol{z}\) to obtain the fraction of the total variance explained by invariances:

(20)

In combination with the INN \(\boldsymbol{e}\), which transform \(\boldsymbol{\hat{z}}\) to semantically meaningful factors, we can analyze the semantic content of a model representation \(\boldsymbol{z}\). To analyze how much of a semantic concept represented by factor \(\boldsymbol{e}_i\) is captured by \(\boldsymbol{z}\), we use \(\boldsymbol{e}\) to transform \(\boldsymbol{\hat{z}}\) into \(\boldsymbol{e}_i\) and measure its variance. To measure how much the semantic concept is explained by \(\boldsymbol{z}\), we simply swap the roles of \(\boldsymbol{z}\) and \(\boldsymbol{v}\) in (20), to obtain

(21)

Figure 8 reports (20) and its standard error when evaluated via 10k samples, and Fig. 7 reports (21) and its standard error when evaluated via 10k samples.

Conclusions

Understanding a representation in terms of both its semantics and learned invariances is crucial for interpretation of deep networks. We presented an approach (i) to recover the invariances a model has learned and (ii) to translate the representation and its invariances onto an equally expressive yet semantically accessible encoding. Our diagnostic method is applicable in a plug-and-play fashion on top of existing deep models with no need to alter or retrain them. Since our translation onto semantic factors is bijective, it loses no information and also allows for semantic modifications. Moreover, recovering invariances probabilistically guarantees that we can correctly visualize representations and sample them without leaving the underlying distribution, which is a common cause for artifacts. Altogether, our approach constitutes a powerful, widely applicable diagnostic pipeline for explaining deep representations.