Data generation, which is the task of generating new realistic samples given a set of training data, is a fascinating problem of AI, with many relevant applications in different areas, spanning from computer vision, to natural language processing and medicine. Due to the curse of dimensionality, the problem was practically hopeless to solve, until Deep Neural Networks enabled the scalability of the required techniques via learned approximators. In recent years, deep generative models have gained a lot of attention in the deep learning community, not just for their amazing applications, but also for the fundamental insight they provide on the encoding mechanisms of Neural Networks, the extraction of deep features, and the latent representation of data.

In spite of the successful results, deep generative modeling remains one of the most complex and expensive tasks in AI. Training a complex generative model typically requires a lot of time and computational resources. To make a couple of examples, the hyper-realistic Generative Adversarial Network for face generation in [36] required training on 8 Tesla V100 GPUs for 4 days; the training of BERT [18], a well-known generative model for NLP, takes about 96 h on 64 TPU2 chips.

As remarked in [51], this computational cost has huge implications, both from the ecological point of view, and for the increasing difficulties for academics, students, and researchers, in particular those from emerging economies, to do competitive, state of the art research. As a good practice in Deep Learning, one should give detailed reports about the financial cost of training and running models, in such a way to promote the investigation of increasingly efficient methods.

In this article, we offer a comparative evaluation of some recent generative models. To make the investigation more focused and exhaustive, we restricted the analysis to a single class of models: the so called Variational Autoencoders [38, 48] (VAEs).

Variational Autoencoders are becoming increasingly popular inside the scientific community [53, 60, 61], both due to their strong probabilistic foundation, that will be recalled in “Theoretical Background”, and the precious insight on the latent representation of data. However, in spite of the remarkable achievements, the behaviour of Variational Autoencoders is still far from satisfactory; there is a number of well-known theoretical and practical challenges that still hinder this generative paradigm (see “The Vanilla VAE and Its Problems”), and whose solution drove the recent research on this topic. We try to give an exhaustive presentation of most of the VAE variants in the literature, relating them to the implementation and theoretical issues they were meant to address.

Hence, we focus on a restricted subset of recent architectures that, in our opinion, deserve a deeper investigation, for their paradigmatic nature, the elegance of the underlying theory, or some key architectural insight. The three categories of models that we shall compare are the Two-stage model [16], the Regularized AutoencoderFootnote 1 [39], and some versions of Hierarchical Autoencoders. In the latter class, we provide a detailed analysis of the recent Nouveau VAE [58]; however, its complexity trespasses our computing facilities, so we investigate a much simpler model, and an interesting variant exploiting Feature-wise Linear Modulation [44] at high scales.

One of the metrics used to compare these models is their energetic efficiency, in the spirit of the emerging paradigm known as Green AI [51], aiming to assess performance/efficiency trade-offs. Specifically, for each architecture, we provide a precise mathematical formulation, a discussion of the main ideas underlying their design, a detailed model description, a running implementation in TensorFlow 2 freely available on our GitHub repository, and quantitative results.

Structure of the Article

The article is meant to offer a self-contained introduction to the topic of Variational Autoencoders, just assuming a basic knowledge of neural networks. In the next section, we start with the theoretical background, discussing the strong and appealing probabilistic foundation of this class of generative models. In the following section, we address the way theory is translated into a vanilla neural net implementation, and introduce the many issues arising from this operation: balancing problems in the loss function, posterior collapse, aggregate posterior vs. prior mismatch, blurriness and disentanglement.

In the next three sections, we give a detailed mathematical introduction to the three classes of models for which we provide a deeper investigation, namely the Two-Stage approach, the regularized VAE and hierarchical models. After these sections, our experimental setting is described: we discuss the metrics used for the comparison, and provide a detailed description of the neural network architectures. In the penultimate section, we provide the results of our experimentation, making a critical discussion. In the conclusive section, we summarize the content of the article and draw a few considerations on the future of this field, and the challenges ahead.

Theoretical Background

In this section, we give a formal, theoretical introduction to Variational Autoencoders (VAEs), deriving the so called Evidence Lower Bound (ELBO) adopted as a learning objective for this class of models.

To deal with the problem of generating realistic data points \(x \in {\mathbb {R}}^d\) given a dataset \({\mathbb {D}}= \{ x^{(1)}, \dots , x^{(N)} \}\), generative models usually make the assumption that there exists a ground-truth distribution \(\mu _{GT}\) supported on a low-dimensional manifold \(\chi \subseteq {\mathbb {R}}^d\) with dimension \(k < d\), absolutely continuous with respect to the Hausdorff measure on \(\chi\) and with density \(p_{gt}(x)\). With this assumption, one can rewrite

$$\begin{aligned} p_{gt}(x) = \int _{{\mathbb {R}}^k} p_{gt}(x, z) dz = \int _{{\mathbb {R}}^k} p_{gt}(x|z)p(z) dz = {\mathbb {E}}_{p(z)} [p_{gt}(x|z)], \end{aligned}$$

where \(z \in {\mathbb {R}}^k\) is the latent variable associated with x, distributed with a simple distribution p(z) named prior distribution.

The idea behind generative models is that if we can learn a good approximation of \(p_{gt}(x|z)\) from the data, then we can use that approximation to generate new samples with ancestral sampling, that is,

  • Sample \(z \sim p(z)\).

  • Generate \(x \sim p_{gt}(x|z)\).

For this reason, it is common to define a parametric family of probability distributions \({\mathcal {P}}_\theta = \{ p_\theta (x|z) | \theta \in {\mathbb {R}}^s \}\) with a neural network, and to find \(\theta ^*\) such that

$$\begin{aligned} \theta ^* = \arg \max _\theta {\mathbb {E}}_{{\mathbb {D}}}[\log p_\theta (x)] = \arg \max _\theta {\mathbb {E}}_{{\mathbb {D}}} \left[ \log \int _{{\mathbb {R}}^k} p_\theta (x|z) p(z) dz \right] , \end{aligned}$$

i.e. the Maximum Likelihood Estimation (MLE).

Unfortunately, (2) is usually computationally infeasible. For this reason, VAEs define another probability distribution \(q_\phi (z|x)\) named encoder distribution which describes the relationship between a data point \(x \in \chi\) and its latent variable \(z \in {\mathbb {R}}^k\) and optimizes \(\phi\) and \(\theta\) such that:

$$\begin{aligned} \theta ^*, \phi ^* = \arg \min _{\theta , \phi } {\mathbb {E}}_{{\mathbb {D}}} [D_{KL}(q_\phi (z|x) || p_\theta (z|x))], \end{aligned}$$

where \(D_{KL}(q_\phi (z|x) || p_\theta (z|x)) = {\mathbb {E}}_{q_\phi (z|x)}[\log q_\phi (z|x) - \log p_\theta (z|x) ]\) is the Kullback–Leibler divergence between \(q_\phi (z|x)\) and \(p_\theta (z|x)\).


$$\begin{aligned}&D_{KL}(q_\phi (z|x) || p_\theta (z|x)) \nonumber \\&\quad = {\mathbb {E}}_{q_\phi (z|x)}[\log q_\phi (z|x) - \log p_\theta (z|x) ] \nonumber \\&\quad = {\mathbb {E}}_{q_\phi (z|x)}[\log q_\phi (z|x) - \log p_\theta (x|z) - \log p_\theta (z) + \log p_\theta (x) ] \nonumber \\&\quad = D_{KL}(q_\phi (z|x) || p(z)) - {\mathbb {E}}_{q_{\phi }(z|x)} [ \log p_\theta (x|z) ] + \log p_\theta (x). \end{aligned}$$


$$\begin{aligned} {\mathbb {E}}_{q_{\phi }(z|x)} [ \log p_\theta (x|z) ] - D_{KL}(q_\phi (z|x) || p(z))&= \log p_\theta (x) - D_{KL}(q_\phi (z|x) || p_\theta (z|x)) \nonumber \\&\le \log p_\theta (x), \end{aligned}$$

since \(D_{KL}(q_\phi (z|x) || p_\theta (z|x)) \ge 0\), which implies that the Left Hand Side of the equation above is a lower bound for the loglikelihood of \(p_\theta (x)\). For this reason, it is usually called Evidence Lower BOund (ELBO).

Since ELBO is more tractable than MLE, it is used as the cost function for the training of neural network to optimize both \(\theta\) and \(\phi\):

$$\begin{aligned}&{\mathcal {L}}_{\theta , \phi } (x) := {\mathbb {E}}_{q_{\phi }(z|x)} [ \log p_\theta (x|z) ] - D_{KL}(q_\phi (z|x) || p(z)) \end{aligned}$$
$$\begin{aligned}&{\mathcal {L}}_{\theta , \phi } := {\mathbb {E}}_{{\mathbb {D}}}[ {\mathcal {L}}_{\theta , \phi } (x) ]. \end{aligned}$$

It is worth to remark that ELBO has a form resembling an autoencoder, where the term \(q_\phi (z|x)\) maps the input x to its latent representation z, and \(p_\theta (x|z)\) decodes z back to x. Figure 1 shows a diagram representing the basic VAE structure.

For generative sampling, we forget the encoder and just exploit the decoder, sampling the latent variables according to the prior distribution p(z) (that must be known).

Fig. 1
figure 1

A diagram representing the VAE architecture. The stochastic component \(\epsilon\) in the gray diamond is sampled from G(0,I)

The Vanilla VAE and Its Problems

In this section, we explain how the theoretical form of the ELBO (Eq. 6) can be translated into a numerical loss function exploitable for training of neural networks. This will allow us to point out some of the typical problems that affect this architecture and whose solution drove the design of the variants discussed in the sequel.

In the vanilla VAE, we assume \(q_\phi (z|x)\) to be a Gaussian (spherical) distribution \(G(\mu _\phi (x),\sigma ^2_\phi (x))\), so that learning \(q_\phi (z|x)\) amounts to learning its two first moments.

Similarly, we assume \(p_\theta (x|z)\) has a Gaussian distribution around a decoder function \(\mu _\theta (z)\). The functions \(\mu _\phi (x)\), \(\sigma ^2_\phi (x)\) and \(\mu _\theta (z)\) are modelled by deep neural networks. We remark that knowing the variance of latent variables allows sampling during training.

If the model approximating the decoder function \(\mu _\theta (z)\) is sufficiently expressive (that is case, for deep neural networks), the shape of the prior distribution p(z) does not really matter, and for simplicity it is assumed to be a normal distribution \(p(z) = G(0,I)\). The term \(D_{KL}(q_\phi (z|x)||p(z))\) is hence the KL-divergence between two Gaussian distributions \(G(\mu _\phi (x),\sigma ^2_\phi (x))\) and G(0, I) and it can be computed in closed form as

$$\begin{aligned} \begin{array}{l} D_{KL}(G(\mu _\phi (x),\sigma _\phi (x)),G(0,I)) = \\ \quad \frac{1}{2} \sum_{i=1}^k \mu _\phi (x)^2_i + \sigma ^2_\phi (x)_i-log(\sigma ^2_\phi (x)_i) -1, \end{array} \end{aligned}$$

where k is the dimension of the latent space. The previous equation has an intuitive explanation, as a cost function. By minimizing \(\mu _\phi (x)\), when x is varying on the whole dataset, we are centering the latent space around the origin (i.e. the mean of the prior). The other component is preventing the variance \(\sigma ^2_\phi (x)\) to drop to zero, implicitly forcing a better coverage of the latent space.

Coming to the reconstruction loss \({{\,{{\mathbb {E}}}\,}}_{q_\phi (z|x)} [ \log p_\theta (x|z)]\), under the Gaussian assumption, the logarithm of \(p_\theta (x|z)\) is the quadratic distance between x and its reconstruction \(\mu _\theta (z)\); the variance of this Gaussian distribution can be understood as a parameter balancing the relative importance between reconstruction error and KL-divergence [20].

The problem of integrating sampling with backpropagation during training is solved by the well-known reparametrization trick proposed in [38, 48], where the sample is performed using a standard distribution (outside of the backpropagation flow) and this value is rescaled with \(\mu _\phi (x)\) and \(\sigma _\phi (x)\).

The basic model of the Vanilla VAE that we just outlined is unfortunately hindered by several known theoretical and practical challenges. In the next Sections, we give a short list of important topics which have been investigated in the literature, along with a short discussion of the main works addressing them.

The Balancing Issue

The VAE loss function is the sum of two distinct components, with somehow contrasting effects

$$\begin{aligned} {\mathcal {L}}_{\theta , \phi } (x) := \underbrace{{\mathbb {E}}_{q_{\phi }(z|x)} [ \log p_\theta (x|z) ]}_{\text{ log-likelihood }} - \gamma \underbrace{D_{KL}(q_\phi (z|x) || p(z))}_{\text{ KL-divergence }}. \end{aligned}$$

The log-likelihood loss is just meant to improve the quality of reconstruction, while the Kullback–Leibler component is acting as a regularizer, pushing the aggregate inference distribution \(q_\phi (z) = {\mathbb {E}}_{{\mathbb {D}}} [q_\phi (z|x)]\) towards the desired prior p(z).

Log-likelihood and KL-divergence are frequently balanced by a suitable parameter, allowing to tune their mutual relevance. The parameter is called \(\gamma\), in this context, and it is considered as a normalizing factor for the reconstruction loss.

Privileging log-likelihood will improve the quality of reconstruction, neglecting the shape of the latent space (with ominous effects on generation). Privileging KL-divergence typically results in a smoother and normalized latent space, and more disentangled features [11, 29]; this usually comes at the cost of a more noisy encoding, finally resulting in more blurriness in generated images. [1].

Discovering a good balance between these components is a crucial aspect for an effective training of VAEs.

Several techniques for the calibration of \(\gamma\) have been investigated in the literature, comprising an annealed optimization schedule [8] or a policy enforcing minimum KL contribution from subsets of latent units [37]. These schemes typically require hand-tuning and, as observed in [63], they easily risk to interfere with the principled regularization scheme that is at the core of VAEs.

An alternative possibility, investigated in [16], consists in learning the correct value for the balancing parameter during training, that also allows its automatic calibration along the training process.

In [2] it is observed that considering the objective function used in [16] to learn \(\gamma\), the optimal \(\gamma\) parameter is in fact proportional to the current reconstruction error; so learning can be replaced by a mere computation, using, e.g. a running average. This has a simple and intuitive explanation: what matters is to try to maintain a fixed balance between the two components during training: if the reconstruction error decreases, we must proportionally decrease the KL component that could otherwise prevail, preventing further improvements. The technique in [2] is simple and effective: we shall implicitly adopt it in all our VAE models, unless explicitly stated differently.

A similar technique has been recently investigated in [52], where the KL-divergence is used as a feedback during model training for dynamically tuning the balance of the two components.

Variable Collapse Phenomenon

The KL-divergence component of the VAE loss function typically induces a parsimonious use of latent variables, some of which may be altogether neglected by the decoder, possibly resulting in an under-exploitation of the network capacity; if this is a beneficial side effect or regularization [5, 16] or an issue to be solved ([10, 46, 57, 63]), it is still debated.

The variable collapse phenomenon has a quite intuitive explanation. If, during training, a latent variable gives a modest contribution for the reconstruction of the input (in comparison with other variables), then the Kullback–Leibler divergence may prevail, pushing the mean towards 0 and the standard deviation towards 1. This will make the latent variable even more noisy, in a vicious cycle that will eventually induce the network to completely ignore the latent variable (see Fig. 2, Left).

Fig. 2
figure 2

(Left) The vicious cycle leading to the variable collapse. (Right) An empirical demonstration of the phenomenon: we apply a progressive noise to a latent variable, reducing its contribution to reconstruction; at some point, KL-divergence prevails, enlarging the sampling variance of the variable and making it even more noisy; the phenomenon has catastrophic nature, leading to a complete collapse of the variable. If we remove the artificial noise, the variable gets reactivated. Pictures borrowed from [3]

As described in [3], one can easily get an empirical evidence of the phenomenon by adding some artificial noise to a variable and monitoring its evolution during training (Fig. 2, Right). The contribution of a latent variable to reconstruction is computed as the difference between the reconstruction loss when the variable is masked with respect to the case when it is normally taken into account; we call this information reconstruction gain.

When the reconstruction gain of the variable is becoming less than the KL-divergence, the variable gets ignored by the network: its correspondent mean value will collapse to 0 (independently from x) and its sampling variance is pushed to 1. Sampling has no impact on the network, precisely because the variable is ignored by the decoder.

The variable collapse phenomenon is, at some extent, reversible. However, reactivating a collapsed variable is not a completely trivial operation for a network, probably due to saturation effects and vanishing gradients.

Aggregate Posterior vs. Expected Prior Mismatch

The crucial point of VAEs is to learn an encoder producing an aggregate posterior distribution \(q_\phi (z) = {\mathbb {E}}_{{\mathbb {D}}} [q_\phi (z|x)]\) close to the prior p(z). If this objective is not achieved, generation is doomed to fail.

Before investigating ways to check the intended behavior, let us discuss how the Kullback–Leibler divergence term in (9) acts on the distance \(q_\phi (z)\) and p(z). So, let us average over all x (we omit the \(\phi\) subscript):

$$\begin{aligned} \begin{array}{ll} {{\,{{\mathbb {E}}}\,}}_{p_{gt}(x)} [D_{KL}(q(z|x)|p(z))] \\ = - {{\,{{\mathbb {E}}}\,}}_{p_{gt}(x)} [{\mathcal {H}}(q(z|x))] + {{\,{{\mathbb {E}}}\,}}_{p_{gt}(x)} [{\mathcal {H}}(q(z|x),p(z))] &{} \text{ by } \text{ def. } \text{ of } \text{ KL }\\ = - {{\,{{\mathbb {E}}}\,}}_{p_{gt}(x)} [{\mathcal {H}}(q(z|x))] + {{\,{{\mathbb {E}}}\,}}_{p_{gt}(x)} [{{\,{{\mathbb {E}}}\,}}_{q(z|x)} [\log p(z)]] &{} \text{ by } \text{ def. } \text{ of } \text{ entropy }\\ = - {{\,{{\mathbb {E}}}\,}}_{p_{gt}(x)} [{\mathcal {H}}(q(z|x))] + {{\,{{\mathbb {E}}}\,}}_{q(z)} [\log p(z)] &{} \text{ by } \text{ marginalization }\\ = - \underbrace{{{\,{{\mathbb {E}}}\,}}_{p_{gt}(x)}[ {\mathcal {H}}(q(z|x))]}_{\begin{array}{c} \text{ Avg. } \text{ Entropy }\\ \text{ of } q(z|x) \end{array}} + \underbrace{{\mathcal {H}}(q(z),p(z))}_{\begin{array}{c} \text{ Cross-entropy } \text{ of } \\ q(x) \text{ vs } p(z) \end{array}}&\text{ by } \text{ def. } \text{ of } \text{ entropy } \end{array}. \end{aligned}$$

By minimizing the cross-entropy between q(z) and p(z) we are pushing one towards the other. Jointly, we try to augment the entropy of q(z|x); under the assumption that q(z|x) is Gaussian, its entropy is \(\frac{1}{2}log(e\pi \sigma ^2)\): we are thus enlarging the (mean) variance, further improving the coverage of the latent space, essential for generative sampling.

As a simple sanity check, one should always monitor the moments of the aggregate posterior distribution q(z) during training: the mean should be 0, and the variance 1. Since collapsed variables could invalidate this computation (both mean and variance are close to 0), it is better to use an alternative rule [4] : if we look at \(q(z) = {{\,{{\mathbb {E}}}\,}}_{p_{gt}(x)} [q(z|x)]\) as a Gaussian Mixture Model (GMM), its variance \(\sigma _{GMM}^2\) is given by the sum of the variances of the means \({{\,{{\mathbb {E}}}\,}}_{p_{gt}(x)} [\mu _{\phi }(x)^2]\) and the mean of the variances \({{\,{{\mathbb {E}}}\,}}_{p_{gt}(x)} [\sigma _{\phi }^2(x)]\) of the components (supposing that \({{\,{{\mathbb {E}}}\,}}_{p_{gt}(x)} [\mu _{\phi }(x)\)]=0):

$$\begin{aligned} \sigma _{GMM}^2 = {{\,{{\mathbb {E}}}\,}}_{p_{gt}(x)} [\mu _{\phi }(x)^2] + {{\,{{\mathbb {E}}}\,}}_{p_{gt}(x)} [\sigma _{\phi }^2(x)] = 1, \end{aligned}$$

where in this case \(\mu _\phi (x)\) and \(\sigma _\phi ^2(x)\) are the values computed by the encoder.

This is called variance law in [4], and can be used to verify that the regularization effect of the KL-divergence is properly working.

The big problem is that, even if the two first moments of q(z) are 0 and 1, this does not imply that it should look like a Normal (meaning that the KL-divergence got lost in some local minimum, contenting itself with adjusting the first moments of the distributions).

The potential mismatch between q(z) and the expected prior p(z) is a problematic aspect of VAEs that, as observed by many authors [4, 30, 49], could seriously compromise the whole generative framework. Attempts to solve this issue have been made both by acting on the loss function [55] or by exploiting more complex priors [7, 37, 56].

An interesting possibility, that has been recently deployed in the Hyperspherical VAE [17], consists in replacing the Gaussian Distribution with the von Mises-Fisher (vMF) distribution [24], that is a continuous distribution on the N-dimensional sphere in use in directional statistics.

An orthogonal, drastic alternative consists in renouncing to work in the comfortable setting of continuous latent variables, passing instead in the discrete domain. This approach is at the core of the Vector Quantized VAE [59] (VQ-VAE): each latent variable is forced to occupy a position in a finitely sampled space, so that we can treat each latent variable as a k-dimensional vector in a space of dimension d. This discrete encoding is exploited during sampling, where the prior is learnt via a suitable autoregressive technique.

Clustering, GMM and Two-Stage

In case input data are divided into subcategories (as in the case of MNIST and Cifar10), or have macroscopic attributes like, say, a different color for hairs in the case of CelebA, we could naturally expect to observe this information in the latent encoding of data [62]. In other words, we could imagine the latent space to be organized in clusters, (possibly) reflecting macroscopic features of data.

To make an example, in Fig. 3 it is described the latent encoding of MNIST digits, with a different color for each class in the range 0–9.

Fig. 3
figure 3

Latent encoding of MNIST digits in a latent space of dimension 2. Digits in different categories are represented with a different color. Observe (1) the overall (rough) Gaussian-like disposition of all digits and (2) the typical organization in clusters, in contrast with the uni-modal objective of KL-regularization

We can clearly observe that different digits naturally organize themselves in separate clusters. While the overall distribution still has a Gaussian-like shape, the presence of clusters may obviously contrast with the required smoothness of the internal encoding, introducing regions with higher/lower probability densities. Observe, e.g. the gaps between some of the clusters: sampling in such a region will eventually result in a poor generative output. In other words, clustering could be one of the main source for the mismatch between the prior and the aggregate posterior.

While the phenomenon is evident in a low-dimensional setting, it is more difficult to observe and testify it in higher dimensions. Remember that one of the VAE assumptions is that, as far as you have a sufficiently expressive decoder, the prior does not really matter since the decoder will be able to turn each distribution into the desired one [20].

Still, it makes sense to try to exploit clustering, and a natural approach consists in using a GMM model. Several works have been done in this direction. The simplest approach, followed in [39], is to superimpose a GMM of fixed dimension on the latent space via ex-post estimation using standard machine learning techniques (this is also the approach we shall follow in some of our tests). Alternatively, the GMM model can be learned. In the Variational Deep Embedding approach [62] (VaDE), that essentially provides an unsupervised clustering model, the relevant statistics of the GMM are estimated via Maximum Likelihood Estimation, in a way similar to the Vanilla case (see also [19] for a similar, slightly more sophisticated approach).

In the so-called Two-Stage model [16] a second VAE is trained to learn an accurate approximation of q(z); samples from a Normal distribution are first used to generate samples of q(z), passed to the actual generator of data points. We shall give an extensive discussion of to the Two-Stage approach in “Two-Stage VAE”.

In [26], it is proposed to give an ex-post estimation of q(z), e.g. imposing a distribution with a sufficient complexity (they consider a combination of 10 Gaussians, reflecting the ten categories of MNIST and Cifar10). A suitable regularization technique alternative to KL is used to induce the desirable smoothness of the latent space. A deeper analysis of this approach is done in “Regularized VAE (RAE)”.

An additional and interesting issue of the Two-Stage model concerns the similarity measure to use as a loss function in the second stage. In [16], the traditional mean squared error and categorical cross entropy are considered. However, we discovered that cosine distance works amazingly better. We did not get to cosine distance by trial and error, but by a long and deep investigation on latent representations. These results will be the object of a forthcoming article.


Variational Autoencoders (VAEs), in comparison with alternative generative techniques, usually produce images with a characteristic and annoying blurriness. The phenomenon can also be observed in terms of the mean variance of pixels in generated images, which is significantly lower than that for data in the training set [6].

The source of the problem is not easy to identify, but it is likely due to averaging, implicitly underlying the VAE frameworks (and, more generally, the whole autoencoder approach). In presence of multimodal output, a loglikelihood objective typically results in averaging and hence blurriness [27].

Variational Autoencoders are intrinsically multimodal, both due to dimensionality reduction, and to the sampling process during training.

Several attempts to solve the issue acting on the reconstruction metrics have been made. Structural similarity (frequently used for deblurring purposes) does not seem to be effective [21]. Better results can be obtained by considering deep hidden features extracted from a pretrained image classification model, like e.g. VGG19 [31]. In models of the VAE-GAN family [41, 50, 64], the reconstruction loss is altogether replaced by a discriminator trying to distinguish real images from generated ones. The use of a discriminator, assessing the quality of generated data and acting on the density of the prior, is also a basic component of the recent VAEPP model (VAEs with a pullback prior) [14].

The most promising approaches are however based on iterative/hierarchical approaches [22, 28, 58]. In these architectures, following the idea of latent Gaussian models [35], the vector of latent variables z is split into L groups of latent variables \(z_l , l = 1,...,L\) and the density over the variable of interest is constructed sequentially, in terms of latent variables of lower indices. For instance, the prior p(z) would be written as an autoregressive density of the following kind:

$$\begin{aligned} p(z) = \prod _{l=1}^L p_l(z_l|z_{<l}). \end{aligned}$$

Similarly, the inference probability would be decomposed as

$$\begin{aligned} q_\phi (z|x) = \prod _{l=1}^L q_\phi ^{(l)}(z_l|x,z_{<l}), \end{aligned}$$

where \(q_\phi ^{(l)}(z_l|x,z_{<l})\) is the encoder density of the lth group. Suitable (iterative) neural networks modules are used to sequentially compute the relevant statistics of these distributions, in terms of previous outputs.

As an example of these architectures, the structure of NVAE will be detailed in “NVAE”.

The advantage of this approach is that it usually allows to work with a larger number of latent variables, responsible for small and progressive adjustments of generated samples.


Besides the task of generating new images, [11, 29] noticed that VAEs can also be used to learn an efficient way to represent the data, with important applications in transfer learning and classification.

To understand this phenomenon, suppose that there exists a set of true generative factors \(v = (v_1, \dots , v_S) \in {\mathbb {R}}^S\) such that \(p_{gt}(v|x) = \prod _{i=1}^S p_{gt}(v_i | x)\) (i.e. v are conditionally independent given x) and that each \(v_i\) encodes a meaningful feature of the data point x generated by it. Under the assumption that \(k \ge S\), the latent variables \(z = (z_1, \dots , z_k)\) learnt during the training are a redundant representation of v in a basis where the features are not disentangled. To learn an optimal latent representation of the input image x, it is necessary to train the network in such a way that S coordinates of z are related to v, while the other \(k - S\) coordinates can be used to improve the reconstruction of x, recovering the high frequency components that are missing in v.

In \(\beta\)-VAE [11, 29], this constraint is imposed by noting that in the ELBO function the prior distribution \(p(z) = G(0, I)\) forces the decoder \(q_\phi (z|x)\) to learn a vector z where each variable is independent of each other. To improve disentanglement, we should hence induce the \(D_{KL}\) term to be as small as possible, that can be achieved by augmenting the decoder variance \(\gamma\) to be greater than 1. Unfortunately, since

$$\begin{aligned} {\mathbb {E}}_{p_{gt}(x)} [ D_{KL} (q_\phi (z|x) || p(z)) ] = D_{KL} (q_\phi (z) || p(z)) + I_{q_\phi } (X; Z), \end{aligned}$$

where \(I_{q_\phi }(X;Z)\) is the mutual information between X and Z with respect to the joint distribution \(q_\phi (x, z) = q_\phi (z|x)p_{gt}(x)\), by pushing \(D_{KL}(q_\phi (z|x)||p(z))\) to zero, the mutual information between X and Z is also minimized, reducing the reconstruction efficiency of the network. This problem is addressed in [23, 43] where the ELBO is modified by adding more parameters with the intent to improve disentanglement without losing too much the performance.

Two-Stage VAE

To address the mismatch of aggregate posterior versus the expected prior, Bin Dai and David Wipf in [16] introduced the Two-Stage VAEs.

The idea behind this model is to train two different VAEs sequentially. The first VAE is used to learn a good representation \(q_\phi (z|x)\) of the data in the latent space without guaranteeing exactly \(q(z) = p(z)\), whereas the second VAE should learn to sample from the true q(z) without using the prior distribution p(z). A scheme of the implementation follows (a detailed architectural description is given in “Architecture Overview”):

  • Given a data set \({\mathbb {D}}= \{ x^{(1)}, \dots , x^{(N)} \}\), train a VAE with a fixed latent dimension k, possibly small.

  • Generate latent samples \({\mathcal {Z}} = \{ z^{(1)}, \dots , z^{(N)} \}\) via \(z^{(i)} \sim q_\phi (z|x^{(i)}), \ i=1, \dots N\). By design, these samples are distributed as \(q_\phi (z) = {\mathbb {E}}_{p_{gt}(x)} [q_\phi (z|x)]\), but likely not as \(p(z) = G(0, I)\).

  • Train a second VAE with parameters \((\theta ', \phi ')\) and latent variable \(u \sim p(u) = G(0, I)\) of dimension k to learn the distribution \(q_\phi (z)\) with \({\mathcal {Z}}\) as the dataset.

  • Sample new images by ancestral sampling, i.e. by first sampling \(u \sim p(u)\), then generate a z value by \(p_{\theta '} (z|u)\) and finally \(x \sim p_\theta (x|z)\).

The theoretical foundation of the Two-Stage VAE algorithm is well presented in [16]. We summarize here the main results. The two VAEs aim at separating the components of the ELBO loss function (9), by suitably using the decoder variance \(\gamma\). Remarking that \(p_{gt}(x)\) is the unknown data distribution which we desire to learn and that \(p_\theta (x)={\mathbb {E}}_{q_\phi (z)} [p_\theta (x|z)]\) is the learnt distribution, we hope that \(p_\theta (x) \approx p_{gt}(x) \> \forall x\).

Unfortunately, this is not always possible. In fact, there is a critical distinction between the cases where the dimension of the data d and the latent space dimension k are equal, and the case where \(d > k\).

As a matter of facts, in the first case, it is possible to prove that, under suitable assumptions, for the optimal choice of the parameters \((\theta ^*, \phi ^*)\) it holds that \(p_{\theta ^*} (x) = p_{gt}(x)\) almost everywhere (i.e. VAEs strongly converges to the true distribution \(p_{gt}(x)\)). In the second case, only weak convergence, in the sense that \(\int _A p_{\theta ^*} (x) dx = \int _A p_{gt}(x) dx\) where A is an open subset of \({\mathbb {R}}^d\), can be proved (see Theorems 1 and 2 in [16]).

In the first stage, since the ambient dimension is obviously greater than the latent space dimension (i.e. \(d>k\)), for the previous results only a weak convergence is guaranteed; the parameter \(\gamma\) is chosen in this case to get a good reconstruction (Theorem 4 in [16]). In the second stage by construction the data variable z and its correspondent latent variable u have the same dimension, hence the unknown distribution \(q_\phi (z)\) is exactly identified by the VAE. As a consequence it is possible to sample directly from \(q_\phi (z)\), without using the prior p(z), thus bypassing the problem of mismatch between the aggregate posterior and the prior distributions.

Regularized VAE (RAE)

One of the most interesting variations of vanilla VAE is the work of Partha Ghosh and Mehdi S. M. Sajjadi [26], where the authors tried to solve all the problems related to the classical VAE by completely changing the the way of approaching the problem. They pointed out that, in their typical implementation, VAEs can be seen as a regularized Autoencoder with Additive Gaussian Noise on the decoder input.

In their work, the authors argued that noise injection in decoders input can be seen as a form of regularization, since it implicitly helps to smooth the function learnt by the network.

To get a new insight into this problem, they took in consideration the distinct components of ELBO already introduced in (9):

$$\begin{aligned} {\mathcal {L}}_{\theta , \phi } (x) := \underbrace{{\mathbb {E}}_{q_{\phi }(z|x)} [ \log p_\theta (x|z) ]}_{:= {\mathcal {L}}_{REC}(\theta , \phi )} - \gamma \underbrace{D_{KL}(q_\phi (z|x) || p(z))}_{:= {\mathcal {L}}_{KL}(\phi )}, \end{aligned}$$

where \({\mathcal {L}}_{REC}\) is a term that measures the distance between the input and the reconstruction, whereas \({\mathcal {L}}_{KL}\) is a regularization term that enforces the aggregate posterior to follow the prior distribution.

To show how \({\mathcal {L}}_{KL} ( \phi )\) regularizes the loss, in [26] the Constant-Variance VAEs (CV-VAEs) [26] have been investigated, where the encoder variance \(\sigma ^2_\phi (x)\) is fixed for every \(x \in {\mathbb {D}}\) and thus treated as an hyperparameter \(\sigma ^2\). In this situation,

$$\begin{aligned}&{\mathcal {L}}_{REC}(\theta , \phi ) = - {\mathbb {E}}_{q_\phi (z|x)} \left[ \frac{1}{2} || x - \mu _\theta (z) ||_2^2 \right] \end{aligned}$$
$$\begin{aligned}&{\mathcal {L}}_{KL}(\phi ) = D_{KL}(q_\phi (z|x) || p(z)) = || \mu _\phi (x) ||_2^2 + C \end{aligned}$$
$$\begin{aligned}&{\mathcal {L}}_{\theta , \phi }(x) = - {\mathbb {E}}_{p_{gt}} \left[ {\mathbb {E}}_{q_\phi (z|x)} \left[ \frac{1}{2} || x - \mu _\theta (z) ||_2^2 \right] - \gamma || \mu _\phi (x) ||_2^2 \right] . \end{aligned}$$

We observe that the expression in (17) is a Mean Squared Error (MSE) with \(L_2\) regularization on \(\mu _\phi (x)\).

The authors proposed to substitute noise injection in the decoder input with an explicit regularization scheme in a classical CV-VAE. This is done by modifying the cost function \({\mathcal {L}}_{\theta , \phi } = {\mathbb {E}}_{p_{gt}(x)} [ {\mathcal {L}}_{REC}(\theta , \phi ) - \gamma {\mathcal {L}}_{KL} (\phi ) - \lambda {\mathcal {L}}_{REG}(\theta ) ]\), where \({\mathcal {L}}_{REG}(\theta )\) is a regularizer for the decoder weights, while \(\gamma , \lambda \ge 0\) are regularization parameters.

Whereas \({\mathcal {L}}_{REC} (\theta , \phi ) = - {\mathbb {E}}_{q_\phi (z|x)} [ \frac{1}{2} || x - \mu _\theta (z) ||_2^2 ]\) and \({\mathcal {L}}_{KL}(\phi ) = \frac{1}{2}|| z ||_2^2\) are fixed a priori by the CV-VAE architecture, \({\mathcal {L}}_{REG}(\theta )\) needs to be defined. The choice for \({\mathcal {L}}_{REG}(\theta )\) identifies the specific kind of network. Ghosh and Sajjadi proposed three possible choices for \({\mathcal {L}}_{REG}(\theta )\):

  • \(L_2\)-Regularization, where \({\mathcal {L}}_{REG}(\theta ) = || \theta ||_2^2\) is simply the weight decay on the decoder parameters.

  • Gradient penalty, where \({\mathcal {L}}_{REG}(\theta ) = || \nabla \mu _\theta (\mu _\phi (x))||_2^2\) bounds the gradient norm of the decoder with respect to its input, enforcing smoothness.

  • Spectral normalization, where each weight matrix \(\theta _l\) in the decoder is normalized by an estimate of its largest singular value: \(\theta _l^{SN} = \frac{\theta _l}{s(\theta _l)}\) (the estimate \(s(\theta _l)\) can be easily obtained with one iteration of the power method).

Moreover, they argued that removing noise injection from the decoder input prevents from knowing the distribution of latent variables, thus losing the generative ability of the network. They solved this problem by proposing an ex-post density estimation, where the distribution of the latent variables is learned a posteriori, by fitting \({\mathcal {Z}} = \{ z^{(i)}; z^{(i)} = \mu _\phi (x^{(i)}) \}\) with a GMM model \(q_\delta (z)\) with a fixed number of components and then sampling z from \(q_\delta (z)\) to generate new samples from \(p_\theta (x|z)\). The generative model defined in this way is called Regularized Autoencoder (RAE).

Hierarchical Variational Autoencoder

To improve the quality of the generation in Variational Autoencoders, Kingma et al. [37] strengthened the inference network \(q_\phi (z|x)\) with the powerful Normalizing Flows introduced by Rezende and Mohamed [47]. The idea of Normalizing Flows is to begin with a latent variable \(z_0\) sampled by a simple distribution \(q_\phi (z_0 | x)\), and to iteratively construct more complex variables by applying transformations \(z_t = f_t(z_{t-1})\) for \(t = 1, \dots , T\). By observing that the \(D_{KL}\) expression is

$$\begin{aligned} D_{KL}(q_\phi (z_T|z_{<T}, x) || p(z_T)) = {\mathbb {E}}_{q_\phi (z_T|z_{<T}, x)} \left[ \log q_\phi (z_T|z_{<T}, x) - \log p(z_T) \right] , \end{aligned}$$

its implementation requires the computation of the logarithm of \(q_\phi (z_T|, z_{<T}, x)\). If the functions \(f_t(\cdot )\) are simple enough, it is possible to efficiently use them to compute \(\log q_\phi (z_T| z_{<T}, x)\) as:

$$\begin{aligned} \log q_\phi (z_T| z_{<T}, x) = \log q_\phi (z_0 | x) - \sum _{t=1}^T \log \det \Bigl | \frac{\partial f_t}{\partial z_{t-1}} \Bigr |, \end{aligned}$$

where \(\frac{\partial f_t}{\partial z_{t-1}}\) is the Jacobian matrix of \(f_t(z_{t-1})\) computed by repeatedly applying the well-known change of variable theorem to the multi-variate random variable \(z_T\) defined as

$$\begin{aligned} z_T = f_T(f_{T-1}(\dots (f_1(z_0)) \dots )). \end{aligned}$$

An interesting aspect concerning Normalizing Flows is that, under suitable assumptions, they are provably universal, in the sense defined in [32]. As already mentioned, the first successfully integration of Normalizing Flows in VAEs was by Kingma et al. [37], where they introduced Inverse Autoregressive Flows (IAF). The idea was to define \(f_t(z_{t-1})\) as a simple affine function of the form:

$$\begin{aligned} z_t = f_t(z_{t-1}) = \mu _t + \sigma _t \odot z_{t-1} \quad \forall t = 1, \dots , T, \end{aligned}$$

where \(z_0 \sim q_\phi (z_0 | x) = G(\mu _\phi (x), \sigma _\phi ^2(x))\).

Figure 4 schematically represents the unrolling of Eq. (21).

Fig. 4
figure 4

A scheme of Inverse Autoregressive Flow. Each white box represents one iteration of Eq. (21), where \(\mu _t, \sigma _t^2\) are generated by the encoder \(q_\phi (z_t | x)\)

We highlight that the IAF introduces a natural order in the latent variables. For this reason, we will refer to this kind of models as Hierarchical Variational Autoencoder (HVAE). In this paradigm, we will refer to each \(z_t\) as a group of latent variables, and we will collect the set of all groups in a vector \(z = (z_0, \dots , z_T)\) where the variables are written in the order defined above.

If we distinguish between the encoder (inference) network \(q_\phi (z|x)\) and the decoder (generative) network, we need to choose if the ordering of latent variables is the same in the two parts of the network (bottom-up inference), or if it is reversed (bidirectional inference) as shown in Fig. 5.

Fig. 5
figure 5

Diagrams that schematically represents Hierarchical VAE in two different configurations: bottom-up inference (a) and bidirectional inference (b)

As it is clear from Fig. 5, in bottom-up inference the image \(x \in {\mathbb {R}}^d\) is encoded to \(z = (z_1, \dots , z_T)\) independently from the prior \(p(z) = \prod _{t=1}^T p(z_t|z_{<t})\); in the generative phase the image is reconstructed by taking \(z_T\) as the final output of the encoder, and then sampling each \(z_t\), \(t = T-1, \dots , 0\) from the prior distribution independently from \(q_\phi (z_t|x)\) (i.e. the encoder and decoder are independent from each other). We underline that this fact makes the bottom-up inference training unstable.

Conversely, in bidirectional inference, the process of generating latent variables is shared between the two parts of the network, which makes the training easier, although the design of the network is a bit more difficult.

Since the results of vanilla IAF are not competitive with the state of the art, we will not use them in our future analysis (see the original paper for more information), whereas we will focus our experimental results on two powerful variants of IAF, making use of bidirectional inference and residual blocks to generate high-quality images.

Experimental Setting

For each variant of Variational Autoencoder discussed in the previous sections, we provide an original implementation in TensorFlow 2, and a set of detailed benchmarks on traditional datasets, such as MNIST, Cifar10 and CelebA. The specific architectures which have been tested are described in the following. All models have been compared using standard metrics, assessing both their energy consumption through the number of floating point operations (see “Green AI and FLOPS”), and their performance via the so-called Frechèt Inception Distance [42], briefly discussed in “Frechèt Inception Distance”. Numerical results are given in “Numerical Results”, along with examples of reconstructed and generated images.

Green AI and FLOPS

The paradigm of Green AI [51] is meant to raise the attention on the computational efficiency of neural models, encouraging a reduction in the amount of resources required for their training and deployment. This concept is not so trivial as it seems; in fact, most of traditional AI research (referred to as Red AI, in this con) targets accuracy rather than efficiency, exploiting massive computational power, and resulting in rapidly escalating costs; this trend is not sustainable for various reasons, it is environmentally unfriendly [40], socially not inclusive and inefficient.

The computation of floating point operations (FLOPS) was advocated in [51] as a measure of the efficiency of models; the main advantages of this measure are that it is hardware independent and has a direct (even if not precise) correlation with the running time of the model [13]. There are also known problems related to FLOPs, mostly related to the fact that memory access time can be a more dominant factor in real implementations (see the “Trap of FLOPs” discussion in [33]).

So, while we shall adopt FLOPS for our comparison, we shall also investigate performance through more traditional tools, like Tensorboard, also to gain confidence on the reliability of FLOPs-based assessments.

Frechèt Inception Distance

To test the quality of the generator, we should compare the probability distribution of generated vs. real images. Unfortunately, the dimension of the feature space is typically too large to allow a direct, significant comparison; moreover, in the case of images, adjacent pixels are highly correlated, reducing their statistical relevance. The main idea behind the so-called Frechèt Inception Distance (FID) [42] is to use, instead of raw data, their internal representations generated by some third party, agnostic network. In the case of FID, the Inception v3 network [54] trained on Imagenet is used to this purpose; Inception is usually preferred over other models due to the limited amount of preprocessing performed on input images (images are rescaled in the interval [– 1,1], sample wise). The activations that are traditionally used are those relative to the last pooling layer, with a dimension of 2048 features.

Given the activations \(a_1\) and \(a_2\), relative to real and generated images, and called \(\mu _i, i=1,2\) and \(C_i, i=1,2\) their empirical mean and covariance matrix, respectively, the Frèchet Distance between \(a_1\) and \(a_2\) is defined as

$$\begin{aligned} FID(a_1,a_2) = ||\mu _1 - \mu _2||^2 + Tr(C_1 + C_2 - 2(C_1*C_2)^{\frac{1}{2}}), \end{aligned}$$

where we indicate with Tr the trace of a matrix.

A problem of FID is that it is extremely sensible to a number of different factors listed below.

  1. 1.

    the weights of the Inception network. The checkpoint that is traditionally used is the inception_v3_2016_08_28/inception_v3.ckpt downloaded from TF-Slim’s pre-trained models, also available through Tensorflow-HUB. These weights were originally obtained by training on the ILSVRC-2012-CLS dataset for image classification (“Imagenet”).

  2. 2.

    The dimension of the datasets of real/generated images to be compared. Traditionally, sets of 10 K images are compared; typically, the FID score is inversely proportional to this dimension.

  3. 3.

    The dimension of input images fed to Inception. Inception may work with images of arbitrary size (larger than \(75\times 75\)); however, the “canonical” input dimension is \(299 \times 299\). Again, varying the dimension may result in very different scores.

  4. 4.

    The resizing algorithm. Images must be resized to bring them to the expected input dimension of \(299 \times 299\); as observed in [2], the FID score is quite sensible to the algorithm used to this aim, and in particular to the employed modality: nearest neighbour, bilinear interpolation, cubic interpolation, .... The default, is usually bilinear interpolation, being a good compromise between efficiency and quality.

Unfortunately, articles in the literature are not always fully transparent on the previous points, that may explain some discrepancies and the difficulty one frequently faces in replicating results.

All our experiments have been conducted with “defaults” values: the standard Inception checkpoint inception_v3_2016_08_28/inception_v3.ckpt, 10000 images of dimension \(299\times 299\), rescaled by means of bilinear interpolation.

Let us finally observe that, in the case of VAE, it is customary to measure both the FID score for reconstructed images (\(FID_{\text {rec}}\)) and the FID score for generated images (\(FID_{\text {gen}}\)). The former one is usually reputed to be a lower bound for the latter, no matter what help we may provide to the generator during the sampling phase.

Architecture Overview

In this section, we provide detailed descriptions of the several different neural networks architectures we have been dealing with, each one inspired by a different article. For each of them, different possible configurations have been investigated, varying the number and dimension of layers, as well as the learning objectives. Moreover, since some of the techniques considered are not dependent from the encoder/decoder structure, we also tested a mix of different architectures, hyperparameters configurations, and optimization objectives.

Vanilla Convolutional VAE

In our first experiment, we followed the same structure of [26], which is a simple CNN architecture where we doubled the number of channels for each Convolution, and we down-sampled the spatial dimension by 2 (see Fig. 6).

The encoder is structured as follows. In the first layer, the input image of dimension (dd, 3) (where \(d = 32\) and \(d = 64\) in CIFAR10 and CelebA, respectively) was passed through a convolutional layer with 128 channels and stride equals 2, to obtain 128 images of dimension (d/2, d/2). This operation is repeated for 256, 512, 1024 channels. The result is flattened and passed through two Dense layers to obtain the mean and the variance of the latent variables.

The decoder has the same structure of the encoder, with Transposed convolutions and Upsample layers.

Each convolutional filter has kernel size 4 and ReLU activation function, except for the last layer of the decoder, where we used a sigmoid activation to ensure that the output is in the range [0, 1].

Fig. 6
figure 6

Graphical representation of the Vanilla VAE architecture. The yellow, orange and green boxes represent convolutional, downsampling and dense layers, respectively


The Resnet-like architecture was adopted in [16]. The main difference of this network with respect to the Vanilla VAE is that, before downsampling, the input is processed by a so called Scale Block, that is just a sequence of Residual Blocks. In turn, a Residual Block is an alternated sequence of BatchNormalization and Convolutional layers (with unit stride), intertwined with residual connections. The number of Scale Blocks at each scale of the image pyramid, the number of Residual Blocks inside each Scale Block, and the number of convolutions inside each Residual Block are user configurable hyperparameters.

Fig. 7
figure 7

a Scale block. The Scale Block is used to process features at a given scale; it is a sequence of Residual Blocks intertwined with residual connections. A Residual Block is an alternation of batchnormalization layers, rectified linear units and convolutions. b The input is progressively downsampled via convolutions with stride 2, intermixed by Scale Blocks; at a given scale, a global average pooling layer extract features that are further processed via dense layers to compute mean and variance for latent variables. The decoder is essentially symmetric

In the encoder, at the end of the last Scale Block, a global average level extracts spatial agnostic features. These are first passed through a so called Dense Block (similar to a Residual Block but with dense layers instead of convolutions), and finally used to synthesize mean and variance for latent variables.

The decoder first maps the internal encoding z to a small map of dimension \(4\times 4 \times base\_dim\) via a dense layer suitably reshaped. This is then up-sampled to the final expected dimension, inserting Scale Blocks at each scale.

Two-Stage VAE

To check in what extent the Two-Stage VAE improve the generation ability of a Variational Autoencoder, we tried to fit a second stage to every model we tested, following the architecture described in the following and graphically represented in Fig. 7.

The encoder in the second stage model in composed of a couple of Dense layers of dimension 1536 and ReLU activation function, followed by a concatenation with the Input of the model and then by another Dense layer to obtain the latent representation u with the same dimension of z, following what is described in “Two-Stage VAE”. The decoder has exactly the same structure of the encoder.

As already described, we used the cosine similarity as the reconstruction part of the ELBO objective function.

We observed that, to improve the quality of the generation, the second stage should be trained for a large number of epochs.

Convolutional RAE

In our implementation of RAE, we followed exactly the same structure as in Convolutional Vanilla VAE, with the sole difference that, in RAEs, the latents space is composed of just one fully connected layer representing the variable z (see Fig. 8).

In our tests, we only compared \(L_2\) and GP regularization, with parameter \(\lambda\) heuristically computed to achieve the best performance.

Fig. 8
figure 8

Graphical representation of the RAE architecture. The yellow, orange and green boxes represent convolutional, downsampling and dense layers, respectively. The red circle underlines the sole architectural difference between our implementation of VanillaVAE and RAE, i.e. the fact that in the latter, the latent space is composed by a single Dense layer that directly encodes to z, while in VanillaVAE the encoding is performed by a couple of Dense layers that represents the mean and the variance of a Gaussian distribution


The model is organized in a bottom-up inference network and a top-down generative network (see Fig. 9). Each one of two networks is composed by a hierarchy of modules at different scales. Each scale is composed by groups of sequential (residual) blocks.

Fig. 9
figure 9

The whole NVAE architecture (a) and a focus on its decoder (b)

During generation, each module computes from the current input \(h_l\) a prior \(p(z_l|h_l)\) (\(h_l\) depends from \(z_{<l}\)): after sampling from this prior, the result is combined in some way with the current input \(h_l\), the two informations are processed together and passed to the next module.

During inference, we extract the latent representation at stage l by synthesizing a mean and a standard deviation for \(q(z_l|x,h_l)\): since this information depends from \(h_l\), we expect to provide additional information, not already available by previous latent encodings. Moreover, the computation of \(h_l\), is done by the top-down network, that is hence a sub-network of the inference network. During training, both networks are trained together.

Each network has a hierarchical organization at different scales. Each scale is composed by groups of Blocks.

Both Encoder Blocks (EB) and Decoder Blocks (DB) have similar architectures, and are essentially composed by an alternated sequence of BatchNormalization and Convolutional layers, separated by non linear activation layers, and intermixed with residual connections (so, very similar to the Scale Block discussed in the previous section). A few technical novelties are, however, introduced by the authors:

  • the recent Swish activation function \(f(u) = \frac{u}{1+e^{-u}}\) [45] is used instead of Relu, Elu, or other more traditional choices;

  • a Squeeze-and-Excitation (SE) layer [34] is added at the end of each block;

  • a moderate use of depthwise separable convolutions [15] is deployed to reduce the number of parameters of the network.

Table 1 gives a summary of hyperparameters used in training NVAE on the datasets addressed in this article, borrowed from [58]. \(D^2\) indicates a latent variable with the spatial dimensions of \(D \times D\). As an example, the MNIST model consists of two scales: in the first one, we have five groups of \(4 \times 4 \times 20\)-dimensional latent variables: in the second one, we have ten groups of \(8 \times 8 \times 20\)-dimensional variables.

Table 1 Summary of the hyperparameters used in the training of NVAE on the datasets used in this paper

The figures of merit in Table 1 help to understand the key novelty of NVAE, that is in the massive usage of space located latent variables. Consider for instance the case of Cifar10. The original input of dimension \(32\times 32\times 3\) is first transformed to dimension \(16\times 16\times 128\) and then, without any further downscaling, processed though a long sequence of residual cells \((30\times 2)\). At each iteration, a huge number of latent variables \((16\times 16\times 20)\) is extracted and used for the internal representation, which hence has a dimension widely larger than the input. Due to this fact, as it is also observed by the authors in the appendix, it is not surprising that most of the variables will collapse during training.

Working with such a huge number of latent variables introduces a lot of issues; in particular, it becomes crucial to balance the KL-component of variables belonging to different groups. To this aim, the authors introduce additional balancing coefficients \(\gamma _l\) to ensure that a similar amount of information is encoded in each group (see [58], Appendix A):

$$\begin{aligned}D_{KL}(q(z|x)||p(z)) = \sum _{l=1}^L \gamma _l {{\,{{\mathbb {E}}}\,}}_{q(z<l|x)} [D_{KL}(q(z_l|x, z_{<l})||p(z_l|z_{<l}))]. \end{aligned}$$

The balancing coefficient \(\gamma _l\) is kept proportional to the KL term for that group, in such a way to encourage the model to revive the latent variables in that group when KL is low, and to clip them if KL is too high. Additionally, \(\gamma _l\) is also proportional to the size of each group, to encourage the use of variables at lower scales.

NVAE architectures have a relatively small number of parameters, due to the extensive use of convolutions and depthwise separable convolutions; however, they require a massive amount of memory, and huge computational power: for the configuration used for Cifar10, composed by 30 groups at scale \(16\), we estimated a number of flops for the inference phase larger then 100 G.

Due to this reasons, we experimented a sensibly lighter architecture, just composed of five groups, with a few additional convolutions to augment the receptive fields of the spatially located latent variables. The good news is that the network, even in this severely crippled form, still seems to learn; however, results are really modest and below the performances of different networks with comparable complexity.


As we already remarked, the main novelty of NVAE is in the massive exploitation of a huge number of spatially located latent variables. To test the relevance of this architectural decision, we also tested a different variant of the hereditary architecture of Fig. 9, where we drop the spatial dimension for latent variables, using instead a Featurewise Linear Modulation Layer [44] to modulate channels according to the internal representation. In addition, the first approximation \(h_1\) is directly produced from the latent variable set \(z_0\) through a dense transformation. The general idea is that at lower scales we decide the content of the resulting image, while stylistic details at different resolutions (usually captured in channels correlations [25]) are added at higher scales. We call this variant HFVAE (Hereditary Film VAE); a similar architecture has been investigated in [9].

Numerical Results

In this section, we provide quantitative evaluations for some configurations of the models previously discussed. The precise configurations (layers, channels, blocks, etc.) are discussed below.

The datasets used for the comparison are CIFAR10 and CelebA: in a GreenAI perspective, we are reluctant to address more complex datasets, at higher resolutions, that would require additional computational resources and additional costs. On CelebA, we just evaluated a subset of particularly interesting models.

For each model, we provide the following figures:

  • Params: the number of parameters;

  • FLOPs: an estimation of number of FLOPS (see “Green AI and FLOPS” for more details);

  • MSE: the mean reconstruction error \(\times 10^3\);

  • REC: the FID value computed over reconstructed images;

  • GEN1: the FID value computed over images generated by a first VAE;

  • GEN2: the FID value computed taking advantage of a second VAE;

  • GMM: the FID value computed by superimposing a GMM of ten GaussiansFootnote 2 on the latent space. In the case of hierarchical models, the GMM is computed on the innermost set of latent variables.

The following list provides a legends for the names of models used in the following tables:

  • CNN-by-lz: Vanilla VAE with CNN architecture, basedim of y channels and latent space of dimension z.

  • L2-RAE-by-lz: \(L_2\)-RAE with CNN architecture, basedim of y channels and latent space of dimension z.

  • GP-RAE-by-lz: GP-RAE with CNN architecture, basedim of y channels and latent space of dimension z.

  • Resnet-sx-by-lz: Resnet-like model, with x ScaleBlocks, a basedim of y channels, and a latent space of dimension z.

  • HFVAE-sx-by-lz: HFVAE with x scales, ScaleBlocks, a basedim of y channels, and a latent space of dimension z at hereditary scales; the base latent dimension \(z_0\) is 64.

  • NVAE-zx-by-lz: NVAE with x latent variables channels, a basedim of y and z latent groups of the same scale.

Table 2 Summary of the results obtained with the networks in the first column on Cifar10
Table 3 Summary of the results obtained with the networks in the first column on CelebA

Quality Evaluation

Here we draw a few conclusions about the design of Variational Autoencoders deriving from the previous investigation (Tables 2 and 3) and our past experience with VAEs.

  • The decoder is more important than the encoder. For instance, in the ResNet architecture latent features are extracted via a GlobalAverage layer, obtaining robust features, less prone to overfitting.

  • Working with a larger number of latent variables improves reconstruction, but this does not eventually implies better generation. This is, e.g. evident comparing the two Resnet-like architectures with latent spaces of dimension 128 and 100.

  • Fitting a GMM over the latent space [26] is a cheap technique (it just takes a few minutes) that invariably improves generation, both in terms of perceptual quality and FID score. This fact also confirms the mismatch between the prior and the aggregated posterior discussed in “Aggregate Posterior vs. Expected Prior Mismatch”.

  • The second stage technique [16] typically requires some tuning to properly works, but when it does it usually outperforms the GMM approach. Tuning may involve the loss function (we used cosine similarity in this work), the architecture of the second VAE, and the learning rate (more generally, the optimizer’s parameters).

  • Hierarchical architectures are complex systems, difficult to understand and to work with (monitoring/calibrating training is a really complex task). We cannot express an opinion about NVAE, since its complexity trespasses our modest computational facilities, but simpler architectures like those described in [28] or [22], in our experience, do not sensibly improve generation over a well-constructed traditional VAE.

  • The loss of variance for generated images [6] (see “Blurriness”) is confirmed in all models, and it almost coincides with the mean squared error for reconstruction.

A qualitative comparison between the different models in generating images can also be done by looking at the images in the Appendix.

Energetic Evaluation

Before comparing the energetic footprint of the different models, let us briefly discuss the notion of FLOPs as a measure of computational efficiency. FLOPs have been computed by a library for Tensorflow Keras under development at the University of Bologna, and inspired by similar works for PyTorch (see, e.g. FLOPs only provide an abstract, machine-independent notion of complexity; typically, only the most expensive layers are taken into account (those with superlinear complexity with respect to the size of inputs). The way this quantity will result in an actual execution time and energetic consumption does, however, largely depend from the underlying hardware, and the available parallelism. As an example, in Table 4 we compare the execution time for a forward step over the test set of Cifar10 (10 K) for a couple of hardware configurations. The first one is a Laptop with an NVIDIA Quadro T2000 graphics card and a CPU Core i7-9850H; the second one is a workstation with an Asus GeForce DUAL-GTX1060-O6G graphic card and a CPU intel Core i7-7700K. Observe the strong dependency from the batch size that is not surprising but worth to be recalled (see [12] for a critical analysis of the performance of Neural Networks architectures). Of course, as soon as we move the computation on a cloud, execution times are practically unpredictable.

Table 4 Average forward time (in seconds) over the Cifar10 Test Set (10 k images) for different networks, hardware and batchsize (bs). The two times entries in each cell refer to different machines: the first is a Laptop with an NVIDIA Quadro T2000 graphics card and a Core i7-9850H CPU; the second is a workstation with an Asus GeForce DUAL-GTX1060-O6G graphic card and a intel Core i7-7700K CPU

Unfortunately, as we shall see, even for a given computational device, the relation between FLOPs and execution time is quite aleatory.

Following the traditional paradigm, we compare performances on the forward pass. This is already a questionable point; on one side, it is true that this reflects the final usage of the network when it is deployed in practical applications; on the other side, it is plausible to believe that training still takes a prevalent part of the lifetime of any neural network. Restricting the investigation to forward time means not taking into account some expensive techniques of the training of modern systems, such as regularization components. For example, it is possible to notice that in Table 5, \(L_2\)-RAE and GP-RAE have exactly the same number of FLOPs, since in terms of forward execution they are equal. However, we highlight that the training of GP-RAE is almost ten times slower than the training of \(L_2\)-RAE. This is a consequence of the fact that the regularization term of GP-RAE involves the computation of the decoder gradient with respect to the latent variables, which is an expensive operation not required in \(L_2\)-RAE. Consequently, even if the two models have more or less the same performance in terms of generation quality, \(L_2\)-RAE should be preferred, since its training is cheaper. Moreover, taking into account only the FLOPs of the model, the actual convergence speed of systems is neglected.

The results of the energetic evaluation on the forward pass are given in Table 5; inference times have been computed over a workstation with an Asus GeForceDUAL-GTX1060-O6G graphic card and a intel Core i7-7700K CPU. The same results have also been expressed in graphical form in Fig. 10, relatively to a batch size of dimension 1. In the plot, we omit L2-RAE and GP-RAE, since their architectures and figures are essentially analogous to the basic CNN; similarly for some Resnet architectures.

Table 5 Average Forward Time (in seconds) over the Cifar10 Test Set (10 k images) for different architectures and different batchsize (bs); times refer to a workstation equipped with an Asus GeForceDUAL-GTX1060-O6G GPU and a Intel Core i7-7700K CPU
Fig. 10
figure 10

FLOPs versus execution time. From the plot, we can evince the little relation between the two figures but, possibly, at a magnitude level

As it is clear from these results, there is no precise correlation between FLOPS and execution time.

As an example, from Table 5, we see that HFVAE requires a computation time 4–6 times higher than the others in the face of the lowest number of FLOPS. One of the possible reasons for this behaviour is, in our opinion, the fact that the total computation time also includes memory access time in addition to FLOPS. As observed by several authors (see, e.g. [33]), memory access time is a crucial factor in real implementations, as densely packed data might be read faster than a few numbers of scattered values. For instance, while depthwise convolutions greatly reduce the number of parameters and FLOPS, they require a more fragmented memory access, harder to be implemented efficiently. In future works, we intend to deeper investigate other causes for the absence of correlation between FLOPS and time.


In this article, we presented a critical survey of recent variants of Variational Autoencoders, referring them to the several problems that still hinder this generative paradigm. In view of the emerging Green AI paradigm [51], we also focused the attention on the computational cost of the different architectures. The main conclusions of our investigation are given in “Quality Evaluation”, and we shall not try to summarize them here; we just observe that, while we strongly support the Green AI vision, we must eventually find better metrics than FLOPs to compare the energetic performance of neural networks, or more realistic way to compute them.

The constant improvement in generative sampling during the last few years is very promising for the future of this field, suggesting that state-of-the-art generative performance can be achieved or possibly even improved by carefully designed VAE architectures.

At the same time, the quest for scaling models to higher resolution and larger images, and the introduction of additional, and usually computationally expensive, regularization techniques, is a scaring and dangerous perspective from the point of view of Green AI.

From this point of view, our experience with NVAE is explicative and quite frustrating. The architecture is interesting, and it should eventually deserve a deeper investigation; unfortunately, it seems to require computational facilities far beyond those at our disposal.