1 Introduction

Normalizing flows (NF) are known to be very efficient generative models to approximate probability distributions in an unsupervised setting. For example, Glow (Kingma & Dhariwal, 2018) is able to generate very realistic human faces, competing with state-of-the-art algorithms of variational inference (Papamakarios et al., 2021). Despite some early theoretical results about their stability (Nalisnick et al., 2018) or their approximation and asymptotic properties (Behrmann et al., 2019), their training remains challenging in the most general cases. Their capacity is limited by intrinsic architectural constraints, resulting in a variational mismatch between the target distribution and the actually learnt distribution. In particular Cornish et al. (2020) pointed out the capital issue of target distributions with disconnected support featuring several components. Since NF provide a continuous differentiable change of variable, they are not able to deal with such distributions when using a monomodal (e.g., Gaussian) latent distribution. Even targeting multimodal distributions featuring high probability regions separated by very unlikely areas remains problematic. The trained NF is a continuous differentiable transformation so that the transport of latent samples to the target space may overcharge low probability areas with (undesired) samples. These out-of-distribution samples will correspond to smooth transitions between different modes, which leads to out-of-distribution samples, as discussed by Cornish et al. (2020).

Fig. 1
figure 1

a True latent measure \(p_Z\) given the explicit target measure \(p_X\) (chosen as a double-moon in this toy example); b Tempered distribution \(\tilde{q}_Z={q_Z}|J_f|^{-1}\) learnt by the NF when the instrumental latent measure \(q_Z\) is Gaussian; c Outputs \(\left\{ x^{(n)}\right\} _{n=1}^N\) drawn from the learnt target measure \(q_X\) with a naive sampling procedure, i.e., \(x^{(n)} = {f}(z^{(n)})\) and \(z \sim q_Z\). The reader is also invited to refer to Fig. 2 for an explicit description of the relationships between these measures

Figure 1 illustrates this behavior on a archetypal example considering a bimodal two-dimensional two-moon target measure and a latent Gaussian measure. It is worth noting that, for this toy example, the target measure is not only empirically described by data points but also admits an explicitly known distribution. The NF is first trained on a large set of data points drawn from the true target distribution. Figure 1a shows the latent distribution \(p_Z\) actually learnt by the NF and computed after applying the (inverse) pushforward operator to the explicitly known target distribution \(p_X\), i.e., \(p_Z =f_{\sharp }^{-1}p_X\). It appears that the NF splits the expected Gaussian latent space into two sub-regions separated by an area of minimal likelihood. This area corresponds, in the target domain, to the low probability area between the two modes of the target distribution, which is somewhat expected.

Figure 1c shows the result of a naive sampling from the Gaussian model latent distribution when the generated Gaussian samples are translated into the target domain thanks to the mapping learnt by the NF. The purple line represents the \(97.5\%\) level set. It appears that many samples generated by this naive sampling procedure are out-of-distribution in the low probability area between the two moons, see the top of the plot. They correspond to samples drawn from the latent Gaussian distribution in the low-likelihood area (depicted in dark blue in Fig. 1a) located between the two modes (represented in yellow and light green). Note that this illustrative example and in particular Fig. 1b will be more deeply discussed in the contributive Sects. 3 to 5 in the light of the findings reported along the paper.

The observations made above illustrate a behavior that is structural. NF are diffeomorphisms that preserve the topological structure of the support of the latent distribution. If the information about the structure of the target distribution is ignored, many out-of-distribution samples will be generated. This effect is reinforced by the fact that the NF is trained on a finite data set so that in practice there exist close to empty areas in low probability regions. In other words, since the latent distribution is usually a simple Gaussian unimodal distribution, there is a topological mismatch with the often much more complex target distribution (Cornish et al., 2020), in particular when it is multimodal.

A first contribution of this paper is a theoretical study of the impact of a topological mismatch between the latent distribution on the Jacobian of the NF transformation. We prove that the norm of the Jacobian of a sequence of differentiable mappings between a unimodal distribution and a distribution with disconnected support diverges to infinity (see Proposition 1). This observation suggests that one should consider the information brought by the Jacobian when sampling from the target distribution with a NF.

Capitalizing on this theoretical study, the second contribution of this paper is a new dedicated Markov chain Monte Carlo algorithm to sample efficiently according to the distribution targeted by a NF. The proposed sampling method builds on a Langevin dynamics formulated in the target domain and translated into the latent space, which is made possible thanks to the invertibility of the NF. Interestingly the resulting Langevin diffusion is defined on the Riemann manifold whose geometry is driven by the Jacobian of the NF. As a result, the proposed Markov chain Monte Carlo method is shown to avoid low probability regions and to produce significantly less out-of-distribution samples, even when the target distribution is multimodal. It is worth noting that the proposed method does not require a specific training procedure but can be implemented to sample from any pre-trained NF with any architecture.

The paper is organized as follows. Section 2 reports on related works. Section 3 recalls the main useful notions about normalizing flows. Section 4 studies the theoretical implications of a topological mismatch between the latent distribution and the target distribution. Section 5 introduces the proposed sampling method based on a Langevin dynamics in the latent space. In Sect. 6, numerical experiments illustrate the advantages of the proposed approach by reporting performance results both for 2D toy distributions and in high dimensions on the usual Cifar-10, CelebA and LSUN data sets.

2 Related works

Geometry in neural networks Geometry in neural networks as a tool to understand local generalization was first discussed by Bengio et al. (2013). As a key feature, the Jacobian matrix controls how smoothly a function interpolates a surface from some input data. As an extension, Rifai et al. (2011) showed that the norm of the Jacobian acts as a regularizer of the deterministic autoencoder. Later Arvanitidis et al. (2018) were the first to establish the link between push forward generative models and surface modeling. In particular, they showed that the latent space could reveal a distorted view of the input space that can be characterized by a stochastic Riemannian metric governed by the local Jacobian.

Distribution with disconnected support As highlighted by Cornish et al. (2020), when using ancestral sampling, the structure of the latent distribution should fit the unknown structure of the target distribution. To tackle this issue, several solutions have been proposed. These strategies include augmenting the space on which the model operates (Huang et al., 2020), continuously indexing the flow layers (Cornish et al., 2020), and including stochastic (Wu et al., 2020) or surjective layers (Nielsen et al., 2020). However, these approaches sacrifice the bijectivity of the flow transformation. In most cases, this sacrifice has dramatic impacts: the model is no longer tractable, memory savings during training are no longer possible (Gomez et al., 2017), and the model is no longer a perfect encoder-decoder pair. Other works have promoted the use of multimodal latent distributions (Izmailov et al., 2020; Ardizzone et al., 2020; Hagemann & Neumayer, 2021). Nevertheless, rather than capturing the inherent multimodal nature of the target distribution, their primary motivation is to perform a classification task or to solve inverse problems with flow-based models. Papamakarios et al. (2017) has shown that choosing a mixture of Gaussians as a latent distribution could lead to an improvement of the fidelity to multimodal distributions. Alternatively, Pires and Figueiredo (2020) have studied the learning of a mixture of generators. Using a mutual information term, they encourage each generator to focus on a different submanifold so that the mixture covers the whole support. More recently, Stimper et al. (2022) predicted latent importance weights and proposed a sub-sampling method to avoid the generation of the most irrelevant samples. However, all these methods require to implement elaborated learning strategies which handle several sensitive hyperparameters or impose specific neural architectures. On the contrary, as emphasized earlier, the proposed approach does not require a specific training strategy, is computationally efficient, and can be implemented to any pre-trained NF.

Sampling with normalizing flows Recently NF have been used to facilitate the sampling from explicitly known distributions with non-trivial geometries. To solve the problem, samplers that combine Monte Carlo methods with NF have been proposed. On the one hand, flows have been used as reparametrization maps that improve the geometry of the target distribution before running local conventional samplers such as Hamiltonian Monte Carlo (HMC) (Hoffman et al., 2019; Noé et al., 2019). On the other hand, the push-forward of the NF base distribution through the map has also been used as an independent proposal in importance sampling (Müller et al., 2019) and Metropolis-Hastings steps (Gabrié et al., 2022; Samsonov et al., 2022). In this context, NF are trained using the reverse Kullback-Leiber divergence so that the push-forward distribution approximates the target distribution. These approaches are particularly appealing when a closed-form expression of the target distribution is available explicitly. In contrast, this paper does not assume an explicit knowledge of the target distribution. The proposed approach aims at improving the sampling from a distribution learnt by a given NF trained beforehand.

3 Normalizing flows: preliminaries and problem statement

3.1 Learning a change of variables

NF define a flexible class of deep generative models that seeks to learn a change of variable between a reference Gaussian measure \(q_Z\) and a target measure \(p_{X}\) through an invertible transformation \(f: \mathcal {Z} \rightarrow \mathcal {X}\) with \(f \in \mathcal {F}\) where \(\mathcal {F}\) defines the class of NF. Figure 2 summarizes the usual training of NF that minimizes a discrepancy measure between the target measure \(p_X\) and the push-forwarded measure \(q_X\) defined as

$$\begin{aligned} q_X = f_{\sharp }q_Z \end{aligned}$$
(1)

where \(f_{\sharp }\) stands for the associated push-forward operator. This discrepancy measure is generally chosen as the Kullback-Leibler (KL) divergence \(D_{\textrm{KL}}(p_X \Vert q_X)\). Explicitly writing the change of variables

$$\begin{aligned} q_X(x) = q_Z(f^{-1}(x)) \left| J_{f^{-1}}(x)\right| \end{aligned}$$
(2)

where \(J_{f^{-1}}\) is the Jacobian matrix of \(f^{-1}\), the training is thus formulated as the minimization problem

$$\begin{aligned} \min _{f \in \mathcal {F}} \quad \mathbb {E}_{p_X}[-\log q_Z(f^{-1}(x))+\log |J_{f^{-1}}(x)|] \end{aligned}$$
(3)

Note that the term \(\log p_X(x)\) does not appear in the objective function since this term does not depend on f. In this work, the class \(\mathcal {F}\) of admissible transformations is chosen as the structures composed of coupling layers ((Papamakarios et al., 2021; Dinh et al., 2016; Kingma & Dhariwal, 2018)) ensuring the Jacobian matrix of f to be lower triangular with positive diagonal entries. Because of this triangular structure, the Jacobian \(J_f\) and the inverse of the map \(f^{-1}\) are available explicitly. In particular the Jacobian determinant \(\left| J_f(z)\right|\) evaluated at \(z\in \mathcal {Z}\) measures the dilation, the change of volume of a small neighborhood around z induced by f, i.e., the ratio between the volumes of the corresponding neighborhoods of x and z.

In practice, the target measure \(p_X\) is available only though observed samples \(\left\{ x^{(1)}, x^{(2)}, \ldots , x^{(N)}\right\}\). Adopting a sample-average approximation, the objective function in (3) is replaced by its Monte Carlo estimate. For this fixed set of samples per data batch, the NF training is formulated as

$$\begin{aligned} \hat{f} \in \min _{f \in \mathcal {F}} \frac{1}{N} \sum _{n=1}^N\left[ -\log q_Z(f^{-1}(x^{(n)}))+\log |J_{f^{-1}}(x^{(n)})| \right] . \end{aligned}$$
(4)

It is important to note that the obtained solution \(\hat{f}\) is only an approximation of the exact transport map for two main reasons. First, the feasible set \(\mathcal {F}\) (the class of admissible NF) is restricted to continuous, differentiable, and bijective functions. There is no guarantee that at least one transformation from this set will achieve \(D_{\textrm{KL}}(p_X \Vert q_X) = 0\). Second, even if such a transformation exists in \(\mathcal {F}\), the solution \(\hat{f}\) obtained by (4) only asymptotically matches the minimizer of (3) as \(N \rightarrow \infty\).

The main issues inherent to the NF training and identified above would still hold for more refined training procedures (Coeurdoux et al., 2022), i.e., that would go beyond to the crude minimization problem (3). However, the work reported in this paper does not address the training of the NF. Instead, one will focus on the task which consists in generating samples from the learnt target measure. Thus one will assume that a NF has been already trained to learn a given change of variable. To make the sequel of this paper smoother to read, no distinction will be made between the sought transformation and its estimate, that will be denoted f in what follows.

Fig. 2
figure 2

NF learns a mapping f from data points \(\left\{ x^{(n)}\right\} _{n=1}^N\) assumed to be drawn from \(p_X\) towards the latent Gaussian measure \(q_Z\). The training consists in minimizing the KL divergence between \(p_X\) and \(q_X=f_{\sharp }q_Z\). Once trained, the learnt map permits to go from \(q_Z\) to \(q_X\), which is an approximation of the true target distribution \(p_X\) (Color figure online)

3.2 A Gaussian latent space?

As noticed by Marzouk et al. (2016), learning the transformation f by variational inference can be reformulated with respect to (w.r.t.) the corresponding inverse map \(f^{-1}\). Since the KL divergence is invariant to changes of variables, minimizing \(D_{\textrm{KL}}(p_X\Vert q_X)\) is equivalent to minimizing \(D_{\textrm{KL}}(p_Z \Vert q_Z)\) with \(p_Z=f_{\sharp }^{-1} p_X\). The training procedure is thus formulated in the latent space instead of the target space. In other words, the NF aims at fitting the target measure \(p_Z\) expressed in the latent space to the latent Gaussian measure \(q_Z\). However, due to inescapable shortcomings similar to those highlighted above, the target measure \(p_Z\) in the latent space is only an approximation of the latent Gaussian measure \(q_Z\). This mismatch can be easily observed in Fig. 1a where the depicted actual measure \(p_Z\) is clearly not Gaussian. This issue may be particularly critical when there is a topological mismatch between the respective supports of the target and latent distributions. This will be discussed in more details in Sect. 4.

3.3 Beyond conventional NF sampling

Once the NF has been trained, the standard method to sample from the learnt target distribution is straightforward. It consists in drawing a sample \(z_k\) from the latent Gaussian distribution \(q_Z\) and then applying the learnt transformation f to obtain a sample \(x^{(n)}=f\left( z^{(n)}\right)\). This method will be referred to as “naive sampling” in the sequel of this paper.

Unfortunately, as discussed in Sect. 3.2 (see also Fig. 1), the latent distribution \(q_Z\) is expected to be different from the actual target distribution \(p_Z\) expressed in the latent space. As suggested in the next section, this mismatch will be even more critical when it results from topological differences between the latent and target spaces. As a consequence the naive NF sampling is doomed to be suboptimal and to produce out-of-distribution samples, as illustrated in Fig. 1c. In contrast, the approach proposed in Sect. 5 aims at devising an alternative sampling strategy that explicitly overcomes these shortcomings.

4 Implications of a topological mismatch

The push-forward operator \(f_\sharp\) learnt by an NF transports the mass allocated by \(q_Z\) in \(\mathcal {Z}\) to \(\mathcal {X}\), thereby defining \(q_X\) by specifying where each elementary mass is transported. This imposes a global constraint on the operator f if the model distribution \(q_X\) is expected to match a given target measure \(p_X\) perfectly. Let \(\textrm{supp}(q_Z)=\{z \in \mathcal {Z}: q_Z(z)>0\}\) denote the support of \(q_Z\). Then the push-forward operator \(f_\sharp\) can yield \(q_X=p_X\) only if

$$\begin{aligned} {\text{supp}}(p_X)=\overline{f\left( {\text{supp}}(q_Z)\right) } \end{aligned}$$
(5)

where \(\overline{B}\) is the closure of set B. The constraint (5) is especially onerous for NF because of their bijectivity. The operators f and \(f^{-1}\) are continuous, and f is a homeomorphism. Consequently, for these models, \(q_Z\) and \(p_X\) are isomorphic, i.e., homeomorphic as topological spaces (Runde et al. 2005, Def. 3.3.10). This means that \({\text{supp}}(q_Z)\) and \({\text{supp}}(p_X)\) must share exactly the same topological properties, in particular the number of connected components. This constraint may be unlikely satisfied when learning complex real-world distributions, leading to an insurmountable topological mismatch. In such cases, this finding has serious consequences on the operator f learnt and implemented by a NF. Indeed, the following proposition states that if the respective supports of the latent distribution \(q_Z\) and the target distribution \(p_X\) are not homeomorphic, then the norm of the Jacobian \(|J_f|\) of f may become arbitrary large. Here \({\mathop {\rightarrow }\limits ^{\mathcal {D}}}\) denotes weak convergence.

Proposition 1

Let \(q_Z\) and \(p_X\) denote distributions defined on \(\mathbb {R}^d\). Assume that \(\text{supp}(q_Z) \ne \text{supp}(p_X)\). For any sequence of measurable, differentiable Lipschitz functions \(f_t: \mathbb {R}^{d} \rightarrow \mathbb {R}^{d}\), if \(f_{t\sharp } q_Z \xrightarrow {\mathcal {D}} p_X\) when \({t\rightarrow +\infty }\), then

$$\begin{aligned} \lim _{t \rightarrow \infty } \sup _{z \in \mathcal {Z}}( \left\| J_{f_t}(z) \right\| ) = +\infty . \end{aligned}$$

The proof is reported in Appendix A.

It is worth noting that training a generative model is generally conducted by minimizing a statistical divergence. For most used divergence measures, (e.g., KL and Jensen-Shannon divergences, Wasserstein distance), this minimization implies a weak convergence of the approximated distribution \(q_X\) towards the target distribution \(p_X\) (Arjovsky et al., 2017). As a consequence, Proposition 1 states that in practice, when training a NF to approximate \(p_X\) with an iterative (e.g., stochastic gradient descent) algorithm, the learnt mapping \(f_t\) along the iterations (denoted here by t) is characterized by a Jacobian supremum which tends to explode in some regions as the algorithm approaches convergence. This result is in line with the experimental findings early discussed and visually illustrated by Fig. 1. Indeed, Fig. 1b depicts the heatmap of the log-likelihood

$$\begin{aligned} \log q_X(f(z)) = \log q_Z(z) - \log \left| J_{f}(z) \right| \end{aligned}$$
(6)

given by (2) after training an NF. The impact of the term governed by the determinant of the Jacobian is clear. It highlights a boundary separating two distinct areas, each associated with a mode in the target distribution \(p_X\). This result still holds when \(q_Z\) and \(q_X\) are defined on \(\mathbb {R}^{d_Z}\) and \(\mathbb {R}^{d_X}\), respectively, with \(d_Z \ne d_X\). This shortcoming is thus also unavoidable when learning injective flow models (Kumar et al., 2017) and other push-forward models such as GANs (Goodfellow et al., 2020).

In practice, models are trained on a data set of finite size. In other words, the underlying target measure \(p_X\) is available only through the empirical measure \(\frac{1}{N}\sum _{n=1}^{N} \delta _{x^{(n)}}\). During the training defined by (4), areas of low probability possibly characterizing a multi-modal target measure are likely interpreted as areas of null probability observed in the empirical measure. This directly results in the topological mismatch discussed above. Thus, even when targeting a distribution \(p_X\) defined over a connected support with regions of infinitesimal support, the learnt mapping is expected to be characterized by a Jacobian with exploding norm in these regions, see Fig 1.

This suggests that these regions correspond to the frontiers between cells defining a partition of the latent space. Specifically, when targeting a multi-modal distribution, the learned model implicitly partitions the Gaussian latent space into disjoint subsets associated with different modes. The boundaries of these subsets correspond to regions with a high Jacobian norm, which must be avoided during sampling to prevent out-of-distribution samples. The Gaussian multi-bubble conjecture was formulated when looking for a way to partition the Gaussian space with the least-weighted perimeter. This conjecture was proven recently by Milman and Neeman (2022). Recently, Issenhuth et al. (2022) leveraged on this finding to assess the optimality of the precision of GANs. They show that the precision of the generator vanishes when the number of components of the target distribution tends towards infinity.

5 NF sampling in the latent space

5.1 Local exploration of the latent space

As explained in Sect. 3.3, naive NF sampling boils down to drawing a Gaussian variable before transformation by the learnt mapping f. This strategy is expected to produce out-of-distribution samples, due to the topological mismatch between \(q_X\) and \(p_X\) discussed in Sect. 4. The proposed alternative elaborates directly on the learnt target distribution \(q_X\).

The starting point of our rational consists in expressing a Langevin diffusion in the target space. This Markov chain Monte Carlo (MCMC) algorithm would target the distribution \(q_X\) using only the derivative of its likelihood \(\nabla _x\log q_X({x})\). After initializing the chain by drawing from an arbitrary distribution \({x}_0 \sim \pi _0(x)\), the updating rule writes

$$\begin{aligned} x_{k+1} \leftarrow x_k+\frac{\epsilon ^2}{2} \nabla _{x} \log q_X(x_k)+ \epsilon \xi \end{aligned}$$
(7)

where \(\xi \sim \mathcal {N}(0, I)\) and \(\epsilon >0\) is a stepsize. When \(\epsilon \rightarrow 0\) and the number of samples \(K \rightarrow \infty\), the distribution of the samples generated by the iterative procedure (7) converges to \(q_X\) under some regularity conditions. In practice, the error is negligible when \(\epsilon\) is sufficiently small and K is sufficiently large. This algorithm referred to as the unadjusted Langevin Algorithm (ULA) always accepts the generated sample proposed by (7), neglecting the errors induced by the discretization scheme of the continuous diffusion. To correct this bias, Metropolis-adjusted Langevin Algorithm (MALA) applies a Metropolis-Hastings step to accept or reject a sample proposed by ULA (Grenander & Miller, 1994).

Again, sampling according to \(q_X\) thanks to the diffusion (7) is likely to be inefficient due to the expected complexity of the target distribution possibly defined over a subspace of \(\mathbb {R}^d\). In particular, this strategy suffers from the lack of prior knowledge about the location of the mass. Conversely, the proposed approach explores the latent space by leveraging on the closed-form change of variable (2) operated by the trained NF. After technical derivations reported in Appendix C.2, the counterpart of the diffusion (7) expressed in the latent space writes

$$\begin{aligned} z^{\prime } = z_k + \frac{\epsilon ^2}{2} G^{-1}(z_k) \nabla _{z} \log \tilde{q}_Z(z_k) + \epsilon \sqrt{G^{-1}(z_k)} \xi \end{aligned}$$
(8)

where

$$\begin{aligned} \tilde{q}_Z(z) = q_Z(z) \left| J_{f}(z)\right| ^{-1} \end{aligned}$$
(9)

and

$$\begin{aligned} G^{-1}(z) = \left[ J_f^{-1}(z)\right] ^2. \end{aligned}$$
(10)

Note that the distribution \(\tilde{q}_Z\) in (9) originates from the change of variable that defines \(q_X\) in (2) and has been already implicitly introduced by (6) in Sect. 4. Interestingly, the matrix \(G(\cdot )\) is a positive definite matrix (see Appendix B). Thus the diffusion (8) characterizes a Riemannian manifold Langevin dynamics where \(G(\cdot )\) is the Riemannian metric associated with the latent space (Girolami & Calderhead, 2011; Xifara et al., 2014). More precisely, it defines the conventional proposal move of the Riemannian manifold adjusted Langevin algorithm (RMMALA) which targets the distribution \(\tilde{q}_Z\) defined by (9). This distribution is explicitly defined through the Jacobian \(J_{f}(\cdot )\) of the transformation whose behavior has been discussed in depth in Sect. 4. It can be interpreted as the Gaussian latent distribution \(q_Z\) tempered by the (determinant of the) Jacobian of the transformation. It has also been evidenced by depicting the heatmap of (6) in Fig. 1b, which shows that it appears as a better approximation of \(p_Z\) than \(q_Z\). Since it governs the drift of the diffusion through the gradient of its logarithm, the diffusion is expected to escape from the areas where the determinant of the Jacobian explodes, see Sect. 4.

The proposal kernel \(g(z^{\prime }|z)\) associated with the diffusion (8) is a Gaussian distribution whose probability density function (pdf) can be conveniently rewritten as (see Property 5 in Appendix C.2)

$$\begin{aligned} g\left( z^{\prime } \mid z_k\right) \propto |J_{f(z_k)}| \exp \left[ -\frac{1}{2 \epsilon ^{2}} \left\| J_{f}(z_k)(z^{\prime }-z_k) + \frac{\epsilon ^2}{2} \tilde{s}_Z(z_k) \right\| ^2\right] . \end{aligned}$$
(11)

where \(\tilde{s}_Z(\cdot )\) denotes the so-called latent score

$$\begin{aligned} \tilde{s}_Z(z) = J^{-1}(z) \nabla _{z} \log \tilde{q}_Z(z). \end{aligned}$$
(12)

The sample proposed according to (8) is then accepted with probability

$$\begin{aligned} \alpha _{\textrm{RMMALA}}(z_k,z^{\prime }) =\min \left( 1, \frac{\tilde{q}_Z\left( z^{\prime }\right) g\left( z_k \mid z^{\prime }\right) }{\tilde{q}_Z\left( z_k\right) g\left( z^{\prime } \mid z_k\right) }\right) . \end{aligned}$$
(13)

It is worth noting that the formulation (11) of the proposal kernel leads to a significantly faster implementation than its canonical formulation. Indeed, it does not require to compute the metric \(G^{-1}(\cdot )\) defined by (10), which depends on the inverse of the Jacobian matrix twice. Moreover, the evaluation of the latent score (12) can be achieved in an efficient manner, bypassing the need for evaluating the inverse of the Jacobian matrix, as elaborated in Appendix C.3.2. Finally, only the Jacobian associated with the forward transformation is required to compute (11). This approach enables a streamlined calculation of the acceptance ratio (13), ensuring an overall computational efficiency.

Besides, the proposal scheme (8) requires to generate high dimensional Gaussian variables with covariance matrix \(\epsilon ^2 G^{-1}(\cdot )\) (Vono et al., 2022). To lighten the corresponding computational burden, we take advantage of a 1st order expansion of \(f^{-1}\) to approximate (8) by the diffusion (see Appendix C.3.1)

$$\begin{aligned} z^{\prime } = f^{-1}\left( f(z_k) +\epsilon \xi \right) + \frac{\epsilon ^2}{2} J_f^{-1}(z_k) \tilde{s}_Z(z). \end{aligned}$$
(14)

According to (14), this alternative proposal scheme requires to generate high dimensional Gaussian variables with a covariance matrix which is now identity, i.e., most cheaper. Moreover, it is worth noting that i) the latent score \(\tilde{s}_Z(\cdot )\) can be evaluated efficiently (see above) and ii) using \(J_{f}^{-1}(z)=J_{f^{-1}}(f(z))\) (see Property 2 in Appendix C.1), sampling \(z^{\prime }\) according to (14) only requires to evaluate the Jacobian associated with the backward transformation. Proofs and implementation details are reported in Appendix C. The algorithmic procedure to sample according to this kernel denoted \(\mathcal {K}_{\textrm{RMMALA}}(\cdot )\) is summarized in Algorithm 1.

Algorithm 1
figure a

Sampling kernel \(\mathcal {K}_{\textrm{RMMALA}}(\cdot )\).

5.2 Independent Metropolis-Hastings sampling

Handling distributions that exhibit several modes or defined on a complex multi-component topology is another major issue raised by the problem addressed here. In practice, conventional sampling schemes such as those based on Langevin dynamics fail to explore the full distribution when modes are isolated since they may get stuck around one of these modes. Thus, the samples proposed according to (8) in areas with high values of \(\Vert J_f(\cdot ) \Vert\) are expected to be rejected. These areas have been identified in Sect. 4 as the low probability regions between modes when targeting a multimodal distribution. To alleviate this problem, one strategy consists in resorting to another kernel to propose moves from one high probability region to another, without requiring to cross the low probability regions. Following this strategy, this paper proposes to combine the diffusion (8) with an independent Metropolis-Hastings (I-MH) with the distribution \(q_Z\) as a proposal. The corresponding acceptance ratio writes

$$\begin{aligned} \begin{aligned} \alpha _{\text{I-MH}}(z_k,z^{\prime })&= \min \left( 1, \frac{\tilde{q}_Z\left( z^{\prime }\right) q_Z(z_{k})}{\tilde{q}_Z\left( z_{k}\right) q_Z(z^{\prime })}\right) \\&= \min \left( 1, \frac{|J_{f}(z_k)|}{|J_{f}(z^{\prime })|}\right) . \end{aligned} \end{aligned}$$
(15)

It is worth noting that this probability of accepting the proposed move only depends on the ratio between the Jacobians evaluated at the current and the candidate states. In particular, candidates located in regions of the latent space characterized by exploding Jacobians in case of a topological mismatch (see Sect. 4) are expected to be rejected with high probability. Conversely, this kernel will favor moves towards other high probability regions not necessarily connected to the regions of the current state. The algorithmic procedure is sketched in Algorithm 2.

Algorithm 2
figure b

Sampling kernel \(\mathcal {K}_{\mathrm {I-MH}}(\cdot )\).

Finally, the overall proposed sampler, referred to as NF-SAILS for NF SAmpling In the Latent Space and summarized in Algorithm 3, combines the transition kernels \(\mathcal {K}_{\textrm{RMMALA}}\) and \(\mathcal {K}_{\mathrm {I-MH}}\), which permits to efficiently explore the latent space both locally and globally. At each iteration k of the sampler, the RMMALA kernel \(\mathcal {K}_{\textrm{RMMALA}}\) associated with the acceptance ratio (13) is selected with probability p and the I-MH kernel \(\mathcal {K}_{\mathrm {I-MH}}\) associated with acceptance ratio (15) is selected with the probability \(1-p\). Again, one would like to emphasize that the proposed strategy does not depend on the NF architecture and can be adopted to sample from any pretrained NF model.

Algorithm 3
figure c

NF-SAILS: NF SAmpling In the Latent Space

6 Experiments

This section reports performance results to illustrate the efficiency of NF-SAILS thanks to experiments based on several models and synthetic data sets. It is compared to state-of-the-art generative models known for their abilities to handle multimodal distributions. These results will show that the proposed sampling strategy achieves good performance, without requiring to adapt the NF training procedure or resorting to non-Gaussian latent distributions. We will also confirm the relevance of the method when working on popular image data sets, namely Cifar-10 (Krizhevsky et al., 2010), CelebA (Liu et al., 2015) and LSUN (Yu et al., 2015).

To illustrate the versatility of proposed approach w.r.t. the NF architecture, two types of coupling layers are used to build the trained NF. For the experiments conducted on the synthetic data sets, the NF architecture is RealNVP (Dinh et al., 2016). Conversely, a Glow model is used for experiments conducted on the image data sets (Kingma & Dhariwal, 2018). However, it is worth noting that the proposed method can apply on top of any generative model fitting multimodal distributions. Additional details regarding the training procedure are reported in Appendix D.1.

6.1 Figures-of-merit

To evaluate the performance of the NF, several figures-of-merit have been considered. When addressing bidimensional problems, we perform a Kolmogorov-Smirnov test to assess the quality of the generated samples w.r.t. the underlying true target distribution (Justel et al., 1997). The goodness-of-fit is also monitored by evaluating the mean log-likelihood of the generated samples and the entropy estimator between samples, which approximates the Kullback–Leibler divergence between empirical samples (Kraskov et al., 2004).

For applications to higher dimensional problems, such as image generation, the performances of the compared algorithms are evaluated using the Fréchet inception distance (FID) (Heusel et al., 2017) using a classifier pre-trained specifically on each data set. Besides, for completeness, we report the bits per dimension (bpd) (Papamakarios et al., 2017), i.e., the log-likelihoods in the logit space, since this is the objective optimized by the trained models.

6.2 Results obtained on synthetic data set

As a first illustration of the performance of NF-SAILS, we consider to learn a mixture of k bidimensional Gaussian distributions, with \(k\in \left\{ 2, 3, 4, 6, 9\right\}\). The NF model \(f(\cdot )\) is a RealNVP (Dinh et al., 2016) composed of \(M=4\) flows, each composed of two three-layer neural networks (\(d \rightarrow 16 \rightarrow 16 \rightarrow d\)) using hyperbolic tangent activation function. We use the Adam optimizer with learning rate \(10^{-4}\) and a batch size of 500 samples.

Table 1 Goodness-of-fit of the generated samples w.r.t. the number k of Gaussians

Table 1 reports the considered metrics when comparing the proposed NF-SAILS sampling method to a naive sampling (see Sect. 3.3) or to state-of-the-art sampling techniques from the literature, namely Wasserstein GAN with gradient penalty (WGAN-GP) (Gulrajani et al., 2017) and denoising diffusion probabilistic models (DDPM) (Ho et al., 2020). These results show that NF-SAILS consistently competes favorably against the compared methods, in particular as the degree of multimodality of the distribution increases. Note that WGAN-GP exploits a GAN architecture. Thus, contrary to the proposed NF-based sampling method, it is unable to provide an explicit evaluation of the likelihood, which explains the N/A values in the table.

Fig. 3
figure 3

Mixture of \(k=6\) Gaussian distributions (green), and 1000 generated samples (blue). The proposed NF-SAILS method in Fig. 3b does not generate samples in-between modes

Figure 3 illustrates this result for \(k=6\) and shows that our method considerably reduces the number of out-of-distribution generated samples. Additional results are reported in Appendix D.2.

Figure 4 depicts the samples generated when using a single kernel of the proposed NF-SAILS algorithm independently, i.e., when a single I-MH kernel \(\mathcal {K}_{\mathrm {I-MH}}\) (left panel) or a single RMMALA \(\mathcal {K}_{\textrm{RMMALA}}\) (middle and right panels) is used. It also shows the impact of the stepsize on the local exploration performed by the RMMALA kernel. For various tuning of the parameters, the effective sample size (ESS) and the rate of rejection (\(p_{\textrm{reject}}\)) are reported in the associated table. Using the single I-MH kernel (\(p=0\)) leads to a good exploration and good effective sample size (ESS); however it is not very efficient due to high number of rejection (\(p_{\textrm{reject}}=0.5\)). On the other hand, using only the RMMALA kernel (\(p=1\)) leads to a higher efficiency (\(p_{\textrm{reject}}=0.1\)) and lower ESS but fails to explore all the modes. Besides, regarding the stepsize \(\epsilon\), the smaller the less efficient the sampling, as shown in the middle and right panels. In the experiments described in this paper, this stepsize has been adjusted following the heuristic of the order of magnitude of \(d^{1/3}\), where d is the dimension of the problem, as advocated in Pillai et al. (2012). Combining the two kernels in NF-SAILS (with \(p=0.7\) and \(\epsilon =0.2\), see last line of the table) seems to be the most efficient strategy to explore all the modes of the targeted distribution, with the best ESS-\(p_{reject}\) trade-off.

Fig. 4
figure 4

Mixture of \(k=3\) Gaussian distributions (green): impact of the kernels and the hyperparameters p and \(\epsilon\). Left: \(p=0\), i.e., using the single \(\mathcal {K}_{\mathrm {I-MH}}\) kernel. Middle and right panels: \(p=1\), i.e., using the single \(\mathcal {K}_{\textrm{RMMALA}}\) kernel for two values of the stepsize \(\epsilon\). The table reports the ESS and the rate of rejection for various combinations of the hyperparametervalues. The last line of the table corresponds the implementation of NF-SAILS adopted for this toy example (Color figure online)

6.3 Results obtained on real image data sets

Moreover, we further study the performance of NF-SAILS on three different real image data sets, namely Cifar-10 (Krizhevsky et al., 2010), CelebA (Liu et al., 2015) and LSUN (Yu et al., 2015). Following the same protocol as implemented by Kingma and Dhariwal (2018), we use a Glow architecture where each neural network are composed of three convolutional layers. The two hidden layers have ReLU activation functions and 512 channels. The first and last convolutions are \(3 \times 3\), while the center convolution is \(1 \times 1\), since its input and output have a large number of channels, in contrast with the first and last convolutions. Details regarding the implementation are reported in Appendix D.3.

We compare the FID score as well as the average negative log-likelihood (bpd), keeping all training conditions constant and averaging the results over 10 Monte Carlo runs. The results are depicted in Fig. 5 reports the results when compared to those obtained by naive sampling or WGAN-GP (Gulrajani et al., 2017). As shown by the different panels of this figure, the proposed NF-SAILS method considerably improves the quality of the generated images, both quantitatively (in term of FID) and semantically. Our methodology compares favourably w.r.t. to WGAN-GP for the two data sets CelebA and LSUN.

Fig. 5
figure 5

Tables report quantitative and perceptual metrics computed from the samples generated by the compared methods. The figures show some samples generated from Glow using the proposed NF-SAILS method

7 Conclusion

This paper discusses the sampling from the target distribution learnt by a normalizing flow. Architectural constraints prevent normalizing flows to properly learn disconnect support measures due to the topological mismatch between the latent and target spaces. Moreover, we theoretically prove that Jacobian norm of the transformation become arbitrarily large to closely represent such target measures. The conducted analysis exhibits the existence of pathological areas in the latent space corresponding to points with exploding Jacobian norms. Using a naive sampling strategy leads to out of distribution samples located in these areas. To overcome this issue, we propose a new sampling procedure based on a Langevin diffusion directly formulated in the latent space. This sampling is interpreted as a Riemanian manifold Metropolis adjusted Langevin algorithm, whose metrics is driven by the Jacobian of the learnt transformation. This local exploration of the latent space is complemented by an independent Metropolis-Hastings kernel which allows moves from one high probability region to another while avoiding crossing pathological areas. One particular advantage of the proposed is that it can be applied to any pre-trained NF model. Indeed it does not require a particular training strategy of the NF or to adapt the distribution assumed in the latent space. The performances of the proposed sampling strategy show to compare favorably to state-of-the art, with very few out-of-distribution samples.