1 Introduction

The task of generating new data from samples has always exerted a particular fascination in machine learning, both because of the potential for almost endless streams of new and original data, as well as for the implications on the knowledge extracted by a model about the data manifold. It is clear that the effectiveness of generative techniques crucially depends on data representation, and different encodings may result in more or less entangled combinations of the different explanatory factors of variation behind the data [1, 2]. The key idea behind unsupervised learning of disentangled representations is that real-world data depends on a relatively small number of explanatory factors of variation which can be compressed and recovered by unsupervised learning techniques [3,4,5]. Strictly related to representation learning, the task of exploration of the latent space of generative models aims to understand the “arithmetic” of the variational factors [6, 7], and the effect that particular trajectories inside the latent space could produce in the visible domain [8,9,10].

In spite of the huge amount of work devoted to the exploration of latent spaces, relatively little attention has been so far devoted to the problem of comparing the latent space of different generative techniques, i.e., to the problem of locating the internal representation \(z_X\) of X in a given space starting from its representation in the latent space of a different model (see Fig. 1).

Fig. 1
figure 1

Given a generative model, it is usually possible to have an encoder-decoder pair mapping the visible space to the latent one (even GANs can be inverted, see Sect. 2.2.1). From this assumption, it is always possible to map an internal representation in a space \(Z_1\) to the corresponding internal representation in a different space \(Z_2\) by passing through the visible domain. This provides a supervised set of input/output pairs: we can try to learn a direct map, as simple as possible. The astonishing fact is that a simple linear map gives excellent results, in many situations. This is quite surprising, given that both encoder and decoder functions are modeled by deep, nonlinear transformations

Fig. 2
figure 2

Examples of relocations of different Types. In the first row we have the original, in the second row the image reconstructed by the first generative model, and in the third row the image obtained by the second model after linear relocation in its space. (a) Relocation of Type 1, between latent spaces relative to different training instances of the same generative model, in this case a particular Variational Autoencoder [11]. The two reconstructions are almost identical. (b) Relocation of Type 2, between a Vanilla VAE and a state-of-the-art Split-VAE [11]. The SVAE produces better quality images, even if not necessarily in the direction of the original: the information lost by the VAE during encoding cannot be recovered by the SVAE, which instead makes a reasonable guess. (c) Relocation of Type 3, between a vanilla GAN and a SVAE. Additional examples involving StyleGAN are given in Sect. 7. To map the original image (First row) into the latent space of the GAN we use an inversion network. Details of reconstructions may slightly differ, but colors pose and the overall appearance is surprisingly similar. In some cases (e.g., the First picture) the reconstruction re-generated by the VAE (from the GAN encoding) is closer to the original than that of the GAN itself.

The key questions we are interested in are the following:

  1. 1.

    Do different trainings of the same generative model induce the extraction of similar features from data, and hence substantially isomorphic spaces up to, say permutations or linear transformations? We refer to this type of transformations as being of Type 1.

  2. 2.

    Do different architectural models driven by common learning objectives (e.g., maximizing log-likelihood) learn similar features? How much do the extracted features depend on the neural network structure? We refer to this type of transformations, between spaces of variants of models in the same class, as being of Type 2.

  3. 3.

    Finally, what is the influence of the learning objective on the internal representation? Is, e.g., a Generative Adversarial Network learning the same features of a Variational AutoEncoder? We refer to these transformations as being of Type 3.

Any answer, whether positive or negative, could substantially improve our knowledge of generative techniques.

Our surprising preliminary results, reported in this article, seem to suggest that (provided models have not been taught or explicitly conceived to act differently) it seems to be possible to pass from a latent space to another by means of a simple linear mapping preserving most of the information.

This linear transformation may be computed directly through linear regression, but we advocate a learning-based technique based on a suitable small “support set” of data samples enucleating, in the visible space, the key variational factors of the data manifold. When we say “small”, we mean that the set has a cardinality comparable with the number of variables in the latent space (so, really small): for instance, in the case of CelebA, we experimented with a support set of 150 images. Locating these 150 samples in the two spaces is enough to allow the definition of a relocation map for all data.

The main results of our investigation are summarized in Fig. 2. Figure. 2a describes an example of relocation between different trainings of a same network (relocation of Type 1); Fig. 2b is relative to the relocation between different models of a same class—two different VAEs, in this case (relocation of Type 2); Fig. 2c is an example of relocation from a VAE to a GAN, that is between different models with different learning objectives (relocation of Type 3). While details may slightly differ, especially for transformations between different generative models, the overall appearance (pose, colors and background) is substantially preserved. Considering the nonlinearity of these generative processes, the result is, at a first glance, quite surprising: pairs of points related by a simple linear mapping in the latent spaces of two different generative models are decoded by the respective decoders in closely related—in some cases almost identical—images!

1.1 Structure of the article

The structure of the article is the following. We start by providing, in Sect. 2, a quick introduction to generative modeling, and in particular to latent variables models, comprising the popular Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs); in this section, we also discuss the problem of inverting GANs. Section 3 covers the domain of semantic exploration of latent spaces, representation learning and disentanglement. In Sect. 4 we start introducing the datasets, the models and the methodology that we used for our experiments. Since we focus on linear transformations, they can be defined by a small set of points, that we call Support Set: locating the points in the Support Set in the two latent spaces is enough to define the transformation. Our approach to get a good Support Set is discussed in Sect. 5. In Sect. 6 we give numerical results about the mappings (visual examples, more readily interpretable, are spread over the article). Section 7 is devoted to the discussion of the latent space of StyleGAN that seems to present some pathological issues: many faces in the CelebA dataset lie outside of its generative range. Even in this case, however, provided we confine the transformation to the StyleGAN subspace, we discover interesting linear mapping to other spaces. Conclusion and future works are discussed in Sect. 8. Additional material is given in appendices: a detailed description of the models used in this work (Sect. A), a full list of all images in the CelebA Support Set (Sect. B).

2 Generative modeling

Generative modeling is the task of learning the high-dimensional probability distribution of a data manifold starting from a representative set of samples. When successfully trained, generative models can be used to create new samples from the underlying distribution, possibly providing estimations of their likelihood. The learning process provides an essential and valuable insight of the kind of features used to encode the distribution, and the way the model “interpreted” and “understood” data.

At the heart of generative techniques there is a relatively small set of techniques [12, 13]: Auto-Regressive models [14, 15], Flow models [16,17,18,19], Energy-based models [20,21,22] and Latent-Variable models, particularly GANs [6, 23, 24] and VAEs [25,26,27].

In this article, we shall mostly focus on the popular and effective Latent-Variable models, that is models where the actual distribution p(x) of a data point x is expressed through marginalization over a vector z of latent variables:

$$\begin{aligned} p(x) = \int _{z} p(x|z)p(z) dz = {{\,\mathrm{\mathbb {E}}\,}}_{p(z)} [p(x|z)] \end{aligned}$$

where z is the latent encoding of x distributed with a known distribution p(z) named prior distribution. The distribution \(p(x|z)\) is usually learned by a deep neural network; after training it can be used to generate new samples via ancestral sampling:

  1. 1.

    sample \(z \sim p(z)\);

  2. 2.

    generate \(x \sim p(x|z)\).

2.1 Variational autoencoders

A Variational AutoEncoder (VAE) [28] has a structure similar to a classical auto-encoder [29, 30], being composed of an encoder producing a latent vector z from an input x and of a decoder which reconstructs the input \({\hat{x}}\) from a latent code; the two components are simultaneously trained using, e.g., a mean squared error loss \(\mid x - {\hat{x}} \mid _2\). However, in order to regularize the latent space, which is a precondition to support semantically meaningful generation [12], latent variables are interpreted as parameters of a local distribution \(q(z|x)\) and a Kullback–Leibler component \(KL(q(z|x)\; \Vert \;\mathcal {N}(0, 1))\) is added to the reconstruction loss, with the purpose of pushing the marginal distribution q(z) toward a standard Gaussian \(\mathcal {N}(0, 1)\). Balancing of these two loss components, usually via a \(\gamma \) or \(\beta \) parameter, is crucial for better generation and learning of disentangled features [31,32,33].

Several issues affect the performance of VAEs, most importantly blurriness of generated images [34]. As such, many variants have been proposed over the years to improve results by addressing the mismatch between the aggregate inference distribution q(z) and the prior p(z). These comprise: quantization of the latent code (VQ-VAE [35]), use of normalizing flows (Hybrid VAE [36]), two-Stage architectures [37], and hierarchical models [16, 38].

2.2 Generative adversarial networks

In a Generative Adversarial Network (GAN) [6, 39, 40] a generator, acting as a sampler for the desired distribution, is jointly trained with a discriminator, evaluating the output of the generator by attempting to distinguish real from generated (“fake”) data. This can be formalized in the form of a zero-sum game, where one agent’s gain is another agent’s loss; the generator and the discriminator must be trained alternately, freezing the respective adversarial component; at the end of the process the generator is supposed to win, producing samples that the discriminator is unable to distinguish from real.

GANs are known to have unstable training and several issues among which the well-known mode collapse phenomenon [40]. Indeed, multiple variations for the loss function have been studied over time [41], including the Wasserstein loss [42], least squares loss [43] or the introduction of a penalty term for the discriminator [44]. Furthermore, a myriad of variations on the structure itself have been proposed, among which: maximizing the mutual information between specific latent variables [45]; exploiting pairs of GANs to perform style transfer between images in distinct datasets [46]; GANs with attention layers [47].

A particularly interesting series of works come from the application of style transfer concepts to GANs (StyleGAN and its successors [48,49,50]). StyleGAN builds on Progressive GANs [24], whose structure is unchanged from that of a baseline GAN but is implemented progressively: the architecture is trained starting from down-sampled images at very low resolution, and at each progression step the input size is increased while additional layers are introduced to both generator and discriminator.

StyleGAN further builds on this structure by adding to the generator (Synthesis network) a fully connected Mapping network which takes the usual seed \(z \in Z\) and produces a “style” vector \(w \in W\). This vector is then specialized per-layer through Adaptive Instance Normalization (AdaIN), which according to the authors produce a behavior similar to style-transfer. Furthermore, a small amount of noise is added to all blocks of the Synthesis network to better fill in the output details. The full structure of StyleGAN can be seen in Fig. 3.

Fig. 3
figure 3

Structure of the StyleGAN generative network (picture from [48]). Observe: (1) the two distinct latent spaces Z and W; (2) the mapping network taking a randomly sampled point \(z\in Z\) as input and generating a style vector w; (3) the use of Adaptive Instance Normalization, or AdaIN (Blocks A), to apply style vectors after each convolution layer of the Synthesis network; (4) the exploitation of noise as an additional source of randomness passed through learned scaling layers (Blocks B)

2.2.1 GAN inversion

The generator of a GAN usually takes as input a seed \(z \sim \mathcal {N}(0, 1)\), and has a role directly comparable to that of a VAE decoder. However, GANs lack a direct encoding process of the original input sample, unlike a VAE encoder. If, as is the case for our study, both generative and encoding processes are needed, a third neural network has to be added to a pre-trained GAN as a sort of plug-in encoder. This re-coder component is known as an inverse GAN, and building an accurate re-coder is a known problem in the literature [51].

Fig. 4
figure 4

Results of our own network for StyleGAN inversion. Images in the first row have been generated by StyleGAN; they are re-coded into the W space and regenerated (second row). The two images are hardly distinguishable. However, as we shall see in Sect. 7, inversion can be more problematic for images outside the generative range of the model; in principle, a good generative model should be able to produce any sample, provided it is not too atypical

Several approaches to inversion have been explored [52,53,54,55], mostly for editing applications, the simplest being SGD optimization [56] or a learning-based approach such as using a neural network trained on generated images to reconstruct the original latent vector using a mean squared error loss \(\mid z - {\hat{z}} \mid _2\), with the advantage that over-fitting is never an issue since training is not constrained to samples of the original data. Hybrid methods combining both efforts have also been explored [57, 58].

Recent works have focused mostly on the inversion of the popular StyleGAN, building on previous work with a variety of inversion structures and minimization objectives [59,60,61,62,63] with the aim of generalization to any dataset. However, we used a simpler and narrow approach by developing our own StyleGAN inverter for the W space using a naive recoding network. It works surprisingly well for commonly generated samples, with a final mean square error close to 0.0040. We show some examples of recoding in Fig. 4.

3 Semantic interpretation of latent spaces

The latent space of a generative model efficiently synthesizes information from data, however, the resulting compressed vectors cannot be easily mapped onto understandable features such as labels or attributes. Therefore, it is also unknown how exactly a model learns from data, in terms of how well it encodes its features, biases and human-meaningful characteristics. At the same time, this knowledge could fundamentally influence the quality of models and provide a foundation on which to improve their performances without relying solely on empirical and qualitative analyses.

Conditional architectures [45, 64] can indeed mitigate this issue by explicitly feeding features alongside samples during training, but in doing so they remodel the task as a supervised problem with respect to the classes on which conditioning is done, with all other data features remaining non-explainable. These approaches do not provide interesting information about the way the neural network understand data, and for this reason, they will not be discussed in this work.

3.1 Exploration and disentanglement

Many works attempt to understand the latent space of GANs by performing exploration on the latent space, that is, they introduce small nudges in a direction based on the empirical principle that they will correspond to a small change in the corresponding generated data. The approach can be particularly useful for image editing, as once a semantically meaningful direction is found (e.g., color, pose, and shape), it can be traveled to tweak an image, introducing a desired feature without the need for a conditional generation model. InterFaceGAN [8] supposes that for a given feature taking values in \((-\infty ; \infty )\) there exists an hyperplane in the latent space whose normal vector allows for a gradual modification of the feature, which can be found, e.g., via an SVM [65]. Further work based on this idea searches for these directions as an iterative or an optimization problem [66] and also extend it to controllable walks in the latent space [10].

A different, more systemic approach to the problem is by [7], which use a closed-form equation to find the editing direction \(n_i\) applied per-layer i of a generator, which is then composed to find the overall direction n. Another approach of the same “arithmetic” flavor comes from [67], where a generative application of PCA with a nonlinear kernel is used to determine the hidden features of a small-scale dataset, without any reliance on a particular generative model.

Much less work on exploration has been devoted to VAEs. An example is given by [9], which however works on a conditional architecture, in order to produce lower-dimensionality subspaces that are easier to analyze.

4 Datasets, models, methodology

4.1 Datasets

As stated in the abstract, we confined our analysis to the familiar and largely investigated data manifold of human faces. Our dataset of reference is CelebA [68], including its higher-quality version CelebAHQ [24]. Images taken from CelebA have been aligned as per their paper [68] and then cropped to size \(128\times 128\) with a y offset of 45 and an x offset of 25 in order to remove as much background information as possible. The crop is then downsampled to size \(64\times 64\) with bilinear interpolation).

CelebaHQ is a dataset of 30K images at resolution \(1024\times 1024\), obtained from a subset of CelebA with a complex methodology explained in Appendix C of [24], comprising a sophisticated preprocessing phase, super-resolution techniques, and selection of best quality samples.

4.2 Generative models

For our experiments we took into considerations four different models, two GANs and two VAEs; in each class, we investigated a basic, average quality ”vanilla” version and a more sophisticated, state-of-the-art model. A summarizing Table 1 for these models is provided. More in-detail, we have investigated the following architectures:

  1. 1.

    Vanilla VAE [28] using \(\gamma \) balancing [31] with a latent dimension Z = 64 trained on the cropped CelebA;

  2. 2.

    Vanilla GAN [39] with a latent dimension Z = 64 trained on the cropped CelebA;

  3. 3.

    SVAE [11] with a latent dimension Z = 150 trained on the cropped CelebA;

  4. 4.

    StyleGAN [48] pre-trained on CelebA-HQFootnote 1, which has a latent dimension Z of size 512 and a style-vector latent dimension W of the same size.

The structure of the StyleGAN has been already briefly discussed in Sect. 2.2. The in-depth architecture of the other models, not central to the topic of this article, is given in Appendix A.

The dimension of the latent space and the resolution of the different models is summarized in Table 1.

Table 1 Dimension of the Latent Space and Resolution for the different models

4.3 Methodology

For each one of the previous models, apart StyleGAN where we only had at our disposal a single set of pre-trained parameters, we trained and tested five different instances. When reporting values in the results, if not differently stated, they have to be understood as an average over the different trainings.

Mapping between different models (transformations of Type 2 and 3) can have a lot of additional issues. Firstly, the two latent spaces may have sensibly different dimensions, for instance 512 for StyleGAN versus 150 for the SVAE and for the other models, and may work at different resolutions, for instance \(1024\times 1024\) for StyleGAN versus \(64\times 64\) for the other models. Furthermore, the two generative models may have been trained on the two different datasets which, albeit similar, have different data and different crops. To this aim, when passing from CelebA-HQ to CelebA we take a simplified crop of dimension \(880\times 880\) with an height offset of 20 and a width offset of 60, which is then downsampled to size \(64\times 64\) with bilinear interpolation.

Since we are interested in linear mappings, the transformations may be defined by a small set of ”corresponding” points common to both spaces: this is what we call a Support Set. Our methodology to build it is defined in Sect. 5. The support Set is defined in the visible domain; we trace their respective encodings in the different spaces, and define the map by linear regression with mean squared error as a loss. When we cannot use a Support Set, we may directly work with the whole visible domain (or the subset of the visible domain common to the two spaces), sampling minibatches in it.

5 Support set

In this Section we explain the technique used to build a small support set of examples driving the linear transformation. This is based on the following steps, each one detailed in a respective subsection:

features ordering:

we order latent variables according to their relevance for reconstruction, using a suitable metric discussed below;

features selection:

we select a small number n of particularly significant latent variables; \(2^n\) must be lower than the cardinality of the support set;

sample selection:

we select points in the space belonging to extremal regions with respect to the selected features.

5.1 Features ordering

Feature importance—the task of associating a score to input features based on how useful they are for solving a specific problem—is a major subfield of Machine Learning. In the case of generative modeling, the goal is to maximize the (log)likelihood of data, and it is natural to associate a score to features according to their contribution to this objective. It is worth observing that different techniques, like, e.g., PCA, would not be beneficial to this aid, due to the shape of the prior latent distribution which is, typically, a spherical Gaussian distributionFootnote 2.

Our feature importance technique requires an encoder in addition to a decoder: it fits particularly well with VAEs, but it can be generalized to GANs by exploiting a re-coder network (see Sect. 2.2.1). Specifically, in order to evaluate the contribution of the variable to the loss function, we compute over a large number of data the average difference between the reconstruction error when the latent variable is zeroed out with respect to the case when it is normally taken into account. We call this information the reconstruction gain associated with the latent variable. It was introduced in [69] where it was used to compare the reconstruction error and the Kullback–Leibler divergence on a per-variable base, in order to clarify the variable collapse phenomenon [27, 70, 71].

We did the experiment on the SVAE, which in our experiments has a latent space of 150 variables. In Fig. 5 we show the information gain relative to all its latent variables, ordered by relevance.

Fig. 5
figure 5

Information gain for all variables, in decreasing order. Only a bunch of variables are in charge of the macroscopic factors of variations

Eleven variables have a score higher than 10, although the distribution has a relatively long tail: the first 20 variables are responsible for about 75% of the information.

Fig. 6
figure 6

Effect of the seven most informative latent variables in the visible domain. Each image is obtained by varying a specific variable in the range [-2.25; +2.25]. Considering these are the variables with the largest information gain, it may be argued that their impact is less pronounced than expected. Most of the variables are associated with a change in luminosity of all or part of the image, possibly associated with modifications in hair color, source of illumination and tiny variations in the pose. In the case of variable 21, there seems to be progressive Female-Male transition (and vice versa for variable 114)

5.2 Feature selection

We keep a small number of the most informative variables. For the way we shall use it, this number must be smaller than the logarithm of the cardinality of the support set. In our case, we aim to a support set of dimension 150, so we focus on the seven most relevant variables.

In Fig. 6 we show examples of the effect of some of these variables on generated images: we take a random point and progressively modify the given variable in the range between -2.25 and 2.25 (remember that the latent space standard deviation is 1).

5.3 Sample selection

Finally, we divide the latent space in sectors corresponding to extreme values for the previously selected variables, and pick up samples in these sectors.

Fig. 7
figure 7

Example of sectors in three dimensions (cropped to distance 2 from the origin). The distance between sectors is equal to twice a configurable threshold. We work with the seven most informative latent variables, obtaining a total of \(2^7=128\) sectors

More precisely, having defined a threshold th and a ”direction” dir given by a \(+/-\) sign for each selected variable, a sector defined by the pair (thdir) is the set of points with direction compatible with dir and at a distance from the origin larger than th. Since we consider all possible directions, this gives a total of \(2^n\) sectors where n is the number of selected variables (for a fixed th). In each sector, we pick up a sample at random (enlarged th sectors become progressively less inhabited).

It is interesting to observe that the number of latent points in the dataset within different sectors at a given threshold is far from uniform. This seems to be a confirmation that the actual image distribution is far from the desired Gaussian normal prior and, in a VAE, a symptom of the potential mismatch between the generative prior and the aggregate inference distribution computed by the encoder, which is a well-known and problematic aspect of VAEs [72,73,74]. Attempts to solve this issue have been made both by acting on the loss function [75] or by exploiting more complex priors [36, 76, 77]; the actual effects on the latent space of these techniques is an interesting research direction for future investigations.

In Figure we show typical inhabitants for a few given sectors. As expected, they share macroscopic features like background color, pose, hairs, and illumination.

Fig. 8
figure 8

Examples of data in different sectors. For each sector, images are different, but share macroscopic features: background color, pose, hairs, illumination, etc.

Part of the 128 images resulting from our selection process are depicted in Fig. 9. The complete list of labels for the support set is reported in the appendix. The samples in the support set occupy “extreme” positions in the latent space with respect to the most informative directions: for this reason, they as supposed to be representative of the principal factors of variations in the dataset.

Fig. 9
figure 9

Part of the images in the support set resulting from our selection process. The samples are supposedly representative of the principal factors of variations in the dataset. Additional examples are given in the appendix

As a partial confirmation of the previous hypothesis, we expect the distance between elements in the support set to be sensibly higher than the average distance between points in the full dataset. This is actually the case: the mean squared error between random CelebA images is 0.116, versus 0.183 for samples in the support set.

6 Results

This Section contains numerical results relative to the transformation between latent spaces. The discussion of StyleGAN, for its relevance and some interesting pathological issues, will be postponed to the next Section.

Here, with we shall use the names VAE, GAN and SVAE to refer to our specific implementations of these models, discussed in Sect. 4.2 and detailed in Appendix A.

We build a set of correspondent input-output pairs by encoding the Support Set (or the full set of visible data) into the two latent spaces. Then, we directly build a linear map by linear regression, minimizing the mean squared error between target and computed latent vectors.

For each transformation, we provide three values:

L-MSE:

Latent Mean Squared Error. This is the loss of the model, namely the mean squared error between the target vectors and those computed by the model;

R-MSE:

Reconstruction Error. This is the mean squared error between the original image in the visible domain and its reconstruction via the source generative model;

M-MSE:

Mapped Error. This is the mean squared error, in the visible domain, between original images and images reconstructed by the target generative model after linear mapping.

The three errors are graphically described in Fig. 10.

Fig. 10
figure 10

Relocations Errors. An original point o in the visible domain is mapped into internal representations \(z_1\) and \(z_2\) in the latent spaces \(Z_1\) and \(Z_2\). The map M is trained to reconstruct \(z_2\) from \(z_1\): L-MSE is the mean squared error between \(z_2\) and \(M(z_1)\). R-MSE is the mean squared error, in the visible domain, between o and its reconstruction according to the first generative model. M-MSE is the mean squared error, in the visible domain, between o and \(D_2(M(z_1))\)

The latent error L-MSE is not easily deciphered; the comparison between R-MSE and M-MSE provides a more intelligible information about the quality of the translation.

The results are given in Table 2.

Table 2 Mapping results for different model pairs: (L-MSE) MSE between the target and Mapped Latent vectors; (R-MSE) MSE between the original and Reconstructed (encoded-decoded) images; (M-MSE) MSE between the original and mapped images via the learned linear mapping. When source and target coincide, we mean different trainings of the same model (Type 1 transformations)

For the sake of comparison, it it worth to recall that the mean squared error between CelebA images is 0.116; in all models the M-MSE is always below 0.039.

7 The StyleGAN space

The “extreme” nature of the images in the Support Set makes them a very natural benchmark of the expressiveness of generative models: is it possible to reconstruct these images by passing them through an encoding-decoding process?

Fig. 11
figure 11

StyleGAN inversion on images in the Support Set. The macro structure (background, pose, illumination, etc.) is preserved, but all other features are lost: images in the Support Set seem to lie outside of the generative range of StyleGAN. Note also the more “conventional” nature of the images obtained by the inversion

For StyleGAN trained on CelebA-HQ, the results are disappointing (see Fig. 11, and compare them with the inversion of generated images in Fig. 4). Although the macrostructure is preserved (background, pose, and illumination), details are sensibly different. Numerically, while the average mean squared error on generated images is 0.026, the corresponding value for the Support Set is 0.251, almost ten times higher.

Our conjecture is that StyleGAN is simply unable to generate data in the support set: they do not belong to its latent space, specifically due to its training dataset. To check this claim we implemented a gradient ascent technique to generate latent representations corresponding to a desired output. Once again, the gradient ascent technique provides almost perfect results on generated images but substantially fails on images in the CelebA support set, as shown in Fig. 12.

Fig. 12
figure 12

Gradient ascent technique for StyleGAN on data in the Support Set. The original is in the first row, and the image generated through gradient ascent, in the second. The technique confirms that these images cannot be generated by StyleGAN

We believe that the latent space of StyleGAN, trained on CelebA-HQ, only faithfully reflects a subspace of the latent space of our other models, trained on the full CelebA dataset. In particular, points in our extreme sectors seem to lie outside of the generative range of StyleGAN, or to be severely underrepresented (Fig. 13). The problem is possibly also related to the well-known fact that faces generated by StyleGAN (and other generative networks) can be easily distinguished from reals [78,79,80].

Fig. 13
figure 13

CelebA Sectors seem to be external to the latent space of StyleGAN

Fig. 14
figure 14

Mapping from the W space of StyleGAN to the latent space of SVAE. In the first row we have sources, sampled by StyleGAN from \(w\in W\). In the second row we have the SVAE reconstruction, starting from a suitably cropped and rescaled images (SVAE work at resolution \(64\)): these images are the best possible approximation of the source images obtainable by SVAE. In the third row we show the output produced by the SVAE decoder after mapping each w in its latent space: results are very similar to those of the second row

Fig. 15
figure 15

Mapping from the latent space of SVAE to the W space of StyleGAN. In the first row we have images generated by StyleGAN: StyleGAN(w), for \(w\in W\). In the second row we have their SVAE reconstructions, starting from suitably cropped and rescaled versions. Images in the third row are obtained by first encoding StyleGAN(w) in the latent space of the SVAE, obtaining a latent representation z. This z is then linearly transformed to a vector \(\hat{w} \in W\); the final image is \(StyleGAN(\hat{w})\)

7.1 Comparison with different spaces

Since exploiting the Support Set is not a viable solution, we need to define a direct mapping by regression on all data. As it is customary for exploration studies, we work with the W StyleGAN space; as matter of fact, the Z space is passed through a long series of fully connected layers (the Mapping network) which we presume, by construction, not being linearly invertible.

Here we try to map the W space of StyleGAN, trained over CelebA-HQ, to the latent space of SVAE trained over CelebA. The input to the transformation map is the vector w, obtained by ancestral sampling from the Z space. The expected output z is obtained by synthesizing with StyleGAN the image corresponding to w, cropping and resizing it to dimension \(64\times 64\) and encoding it in the SVAE latent space. The result of the linear map will be called \(\hat{z}\); let \( SVAE (z)\), and \( SVAE (\hat{z})\) the corresponding decodings to the visible domain. As usual, input vectors w may be generated ad libitum, with no risk of overfitting.

After training, the mean squared error between z and \(\hat{z}\) is around 0.45 with a standard deviation of 0.05. The mean squared error between \( SVAE (z)\), and \( SVAE (\hat{z})\) is 0.014 with standard deviation of 0.002. All results have been repeated over five different parameters configurations of SVAE, relative to five different trainings (obviously, each experiment results in a different linear transformation).

The result is shown in Fig. 14. They are not perfect, but definitely interesting.

We also tested a few variants weighting the distance between latent variables according to their “information relevance”, but we did not observe significant improvements.

Let us come to the mapping from the latent space of VAE to that of the StyleGAN. To train the transformation model (as usual, a single dense layer with no bias), we simply invert input and output of the previous network. After training, the mean squared error between w and \({\hat{w}}\) is around 0.029 with a standard deviation of 0.004. The mean squared error between StyleGAN(w), and \(StyleGAN(\hat{w})\) is 0.076 with standard deviation of 0.014. The results are visually really good, as can be visually checked in Fig. 15.

8 Conclusions

In this article we addressed the problem of comparing the latent space of different generative models, defining transformations between them. Specifically, we proved that we can pass from a latent space to another by means of a simple linear map preserving most of the information. Hence, the organization of the latent space seems to be largely independent from

  • The training process

  • The network architecture

  • The learning objective: GANs and VAEs share the same space

The result is original, surprising and largely unexpected; apparently, the latent space, if not artificially constrained with different objectives, seems to naturally organize itself in a way that is merely dependent from the data manifold. Of course, we expect that this “natural” structure can be altered in many different ways, e.g., through conditioning, which strongly impacts the latent structure, or via transformations like normalizing flows, explicitly aiming toward a strong regularization of the space. We also do not expect the two spaces Z and W of StyleGAN to be linearly related, since otherwise the long chain of eight dense layers between them would have no purpose.

Our result is full of implications from the point of view of representation learning and disentanglement. The fact that the latent space has a sort of implicit and native structure raises promising expectations about the possibility of learning features in a completely unsupervised way. Moreover, the recent observation [8, 66] that variations over a single semantical feature is a quasi-linear manifold in the latent space of generative models fits well with our empirical observations, opening interesting perspectives about the possibility of “porting” disentanglement between different spaces, and more generally, to better understand the issue in a more general framework.

The fact that the transformation between spaces is linear obviously permits its definition in terms of a small set of independent points of the same cardinality of the dimension of the latent space; this is what we call a Support Set. Locating these points in the two latent spaces is enough to define the map. In principle, any set of independent points could serve as a Support Set, but for robustness reasons, it seems preferable to choose points as apart as possible between each other. We described a possible approach for defining such a set, based on ”sectors” in the space. This set is of interest in its own, as it is representative of the principal factors of variations in the dataset. Due to this fact, it also provides a natural benchmark to test the expressiveness of generative models.

This leads to an additional side contribution of our work: in contrast with the usual belief, StyleGAN trained on CelebA-HQ seems to have serious generative deficiencies: many images, in particular most of the images in our Support Set from CelebA, seem to lie outside the generative range of StyleGAN. In particular, as it is also evident in inversion results, the StyleGAN generative process is privileging standardization, strongly penalizing defects, oddities and eccentricities: the StyleGAN space is not a space for minorities.

This could be a cause for concern about CelebA-HQ. Not only it is computationally demanding, but one could also wander if it has statistical relevance: an assortment of 30K images in a space of dimension \(3\times 2^{20}\) looks more like a collection of scattered points than a data manifold.

Our results also raise serious worries about the increasing use of generative techniques for data augmentation purposes. All generative techniques seem to have serious biases, privileging likelihood over diversity: using them for data augmentation may have no statistical significance. It is a bad practice that should be discouraged and deprecated.

As for future developments, most of the work just lies ahead. Here is a short, not-exhaustive list of possible topics:

  • Test and hopefully confirm our mapping results on different datasets;

  • Deepen the relationship between the field of disentanglement through suitable linear manipulations of the latent space;

  • Define and test a Support Set for StyleGAN and Celeba-HQ;

  • Investigate the possibility to improve the transformation with residual nonlinearities, and in that case study them;

  • Better investigate and possibly find a remedy to the generative deficiencies of StyleGAN.