Polyak–Łojasiewicz inequality on the space of measures and convergence of mean-field birth-death processes

The Polyak–Łojasiewicz inequality (PŁI) in Rd\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {R}^d$$\end{document} is a natural condition for proving convergence of gradient descent algorithms (Karimi et al. in: Frasconi et al. (eds) Machine learning and knowledge discovery in databases, Springer International Publishing, Cham, pp 795–811, 2016). In the present paper, we study an analogue of PŁI on the space of probability measures P(Rd)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {P}(\mathbb {R}^d)$$\end{document} and show that it is a natural condition for showing exponential convergence of a class of birth-death processes related to certain mean-field optimization problems. We verify PŁI for a broad class of such problems for energy functions regularised by the KL-divergence.


Introduction
Consider a classical optimization problem, where one is interested in finding a global minimum of a differentiable function f : R d → R. A natural condition on f , under which the gradient descent algorithm has a geometric convergence rate to min y∈R d f (y), is the Polyak-Lojasiewicz inequality (P LI) required to hold with a positive constant κ > 0, for all x ∈ R d (see [14] and the references therein, or [6,5] for other variants of Lojasiewicz inequalities).It is easy to see that when f is strictly convex, (1.1) holds, but the converse is not necessarily true.
In the present paper we are concerned with an optimization problem on the space of probability measures P(R d ).We consider a function V : P(R d ) → R, and we want to find a minimizing measure m * ∈ P(R d ).Such optimization problems have attracted considerable attention in recent years, see e.g.[10,17,13,20,8].In this setting, there exist multiple different choices of flows of probability measures (m t ) t≥0 that can serve as analogues of the gradient descent algorithm in R d , as well as multiple different choices of conditions on V analogous to (1.1) that can be used to prove convergence of such flows.
The main example of V considered in this paper is an energy function regularised by the KL-divergence.Consider F : P(R d ) → R (which can be non-linear) and a probability measure π(dx) ∝ e −U (x) dx with a potential U : R d → R. For any σ ≥ 0, we put where for any m ∈ P(R d ), KL(m|π) = R d log m(x) π(x) m(x)dx m absolutely continuous with respect to π, ∞ otherwise.
It is known (see e.g.Proposition 2.5 in [13]) that V σ is minimized by a measure m σ, * ∈ P(R d ) satisfying where Z is the normalising constant, and for any m ∈ P(R d ) and x ∈ R d , by δF δm (m, x) we denote the flat derivative of F with respect to m, in the direction of x ∈ R d , evaluated at m.For any m, m ′ ∈ P(R d ), the function δF δm : See Appendix 5 for more details on flat derivatives.This notion of derivative appears in the literature under several different names, including the linear functional derivative (see Section 5.4.1 in [7]) or the first variation [2].It is important to note that δF δm is defined only up to a constant, i.e., for any C ∈ R, the function δF δm + C is also a flat derivative of F .Everywhere in this paper we will adopt a normalizing convention requiring R d δF δm (m, x)m(dx) = 0, which then makes the choice of the constant unique.The objective of this work is to identify a flow of measures (m t ) t≥0 such that V σ (m t ) → V σ (m σ, * ) as t → ∞, as well as conditions that ensure that this convergence is exponential.To this end, we equip the space P(R d ) with a suitable distance function d : P(R d ) × P(R d ) → R and consider a corresponding gradient flow, where the form of the flow is dictated by the choice of d.Our main focus is on the Fisher-Rao metric.
Fisher-Rao Gradient Flow.Let P ac (R d ) be the space of probability measures on R d that are absolutely continuous with respect to the Lebesgue measure.Then the Fisher-Rao distance between µ 0 , µ 1 ∈ P ac (R d ) is defined by One can also consider a dynamic representation of the Fisher-Rao metric (see e.g.Section 2.2 in [12] and the references therein), which, for any µ 0 , µ 1 ∈ P ac (R d ) states that where the infimum is taken over all curves in the distributional sense, such that t → m t is weakly continuous with endpoints µ 0 and µ 1 .This result tells us that measures in the space (P ac (R d ), FR) are transported along curves prescribed by a birth-death (or reaction) equation.The main focus of this work is to identify a corresponding Polyak-Lojasiewicz inequality from which we can deduce the exponential convergence to m σ, * of the flow (m t ) t≥0 described by the birth-death equation (1.4) Note that the map (m, x) → a(m, x) formally corresponds to δV σ δm (m t , •) which may not exist since the KL-divergence is only lower semi-continuous.The map (m, x) → a(m, x) is a well-defined function under the assumption of flat-differentiability of F (note that KL(m|π) in (1.4) corresponds to the normalizing constant needed in our normalizing convention mentioned above).
To see why the particular form of (m, x) → a(m, x) in (1.4) is a good choice one needs to show that t → V σ (m t ) is differentiable so that (1.5) The Polyak-Lojasiewicz condition that implies the exponential convergence of V σ (m t ) to V σ (m σ, * ), requires that there exists a constant κ > 0 such that for any m * ∈ arg min m V σ (m) and any m ∈ P(R d ), ) .We call (1.6) the flat Polyak-Lojasiewicz condition, since the function a(m, x) formally corresponds to the flat derivative of V σ , as explained above.With such an inequality at hand, one immediately sees that )) e −κt holds for any t ≥ 0.
The main contributions of this work are: • We establish the existence and uniqueness of the non-linear infinite dimensional birth-death flow (1.4).• We demonstrate that t → V σ (m t ) is differentiable, which implies that the energy dissipation equality (1.5) holds.• We show that for a large class of energy functions V σ , the Polyak-Lojasiewicz condition (1.6) can be verified under relatively mild assumptions.We remark that showing the existence of a solution to (1.4) is non-trivial, since the problem is non-linear and the coefficient a(m, x) contains two terms that are difficult to control: the flat derivative of F and the KL-divergence.Even if one assumes a priori that the former is bounded, it is still unclear how to control the latter.We deal with this problem by introducing a Picard iteration gradient flow approximating (1.4), and then analysing the symmetrised KL-divergence (rather than just the plain KL-divergence) along that auxiliary gradient flow (see Lemmas 3.1 and 3.2).This approach allows us also to obtain bounds on the Radon-Nikodym derivative of m t with respect to π (see Theorem 2.1) that are crucial for proving the differentiability of t → V σ (m t ) (Theorem 2.2) as well as establishing the Polyak-Lojasiewicz condition (1.6) in Theorem 2.3.
It can be shown that the birth-death flow (1.4) is a limit of a minimising movement scheme, see e.g.[22], defined for τ > 0 as Indeed, recalling that (m t , x) → a(m t , x) defined in (1.4) formally corresponds to δV σ δm (m t , x) and using Proposition 2.5 in [13] we can easily see that This is an implicit Euler discretisation of (1.4).Similarly one can consider the mirror descent algorithm, recently studied in [3] for the problem of optimization over the space of measures.One can define As before, one can show, using Proposition 2.5 in [13], that This is an explicit Euler discretisation of (1.4).Note that Theorem 4 in [3] shows convergence of their energy function evaluated at μn under certain strong convexity assumptions, whereas we work with the (measure space version of) Polyak-Lojasiewicz inequality.In this context our results provide a natural extension of convergence results for mirror descent algorithms on R d , which are known to converge under the classical P LI (1.1), see [19].
The remaining part of the paper is organised as follows.In Section 2 we formulate our main results and the assumptions we work with.In Section 2.2 we present a result on the verification of the flat Polyak-Lojasiewicz inequality (1.6) for general energy functions (not necessarily of the form (1.2)) under certain quadratic growth conditions.This section is of independent interest and can be seen as a counterpart of the results that were proved in R d in [14], or the results that were proved on the space of measures in [5] for a quadratic growth condition with respect to the L 2 -Wasserstein distance (while we work with the KL-divergence and the χ 2 -divergence).In Section 2.3 we review the literature and we present a more in-depth discussion on the motivation for studying the gradient flow (1.4).In Section 3 we prove our main results on the existence of the gradient flow and the differentiability of the energy function.Finally, the Appendix includes some general auxiliary results on comparing different f -divergences, adapted from [11] and a brief overview of the notion of the flat derivative.

2.1.
Existence of the birth-death flow and its convergence under the flat Polyak-Lojasiewicz condition.We work with the energy function V σ : P(R d ) → R given by (1.2), for some possibly non-linear F : P(R d ) → R and σ > 0. We have the following assumptions on F .Assumption 1. Suppose F has the first and the second order flat derivatives ( δF δm : (2) There exists a constant C > 0 such that for all m ∈ P(R d ) and for all x ∈ R d we have (3) There exists a constant C 2 > 0 such that for all m ∈ P(R d ) and for all x, y ∈ R d we have Furthermore, suppose we have absolutely continuous probability measures π, m 0 ∈ P(R d ) such that π(dx) ∝ exp − 2 σ 2 U(x) dx for a potential U : R d → R and the following conditions are satisfied.Assumption 2. Suppose m 0 ∈ P(R d ) is absolutely continuous and comparable with π in the following sense.
(1) There exists a constant r > 0 such that (2.4) inf (2) There exists a constant R > 1 such that (2.5) sup Note that here π is just a reference measure, and recall that the actual measure of interest (the minimizer of V σ ) is given implicitly by the following equation where Z is the normalizing constant.We immediately observe that, under condition (2.2), conditions (2.4) and (2.5) together are equivalent to assuming that there exist constants r > 0, R > 1 such that for all As we will explain in more detail in Subsection 2.3, Assumption 2 is a kind of "warm start" condition that says that once we fix the reference measure π in (1.2), the initial measure m 0 of our gradient flow should be comparable to π.We have the following result.KL(m t |π) ≤ 2 log R + 4C σ 2 and there exists a constant R 1 > 1 such that for all t ≥ 0, (2.8) sup If we additionally assume that condition (2.4) from Assumption 2 holds, then there exists a constant r 1 > 0 such that for all t ≥ 0, (2.9) inf As we explained in the discussion in Section 1, the crucial property needed for showing the exponential convergence of (m t ) t≥0 is the differentiability of the energy function along the gradient flow.
Theorem 2.2.Under Assumption 1 and condition (2.5) from Assumption 2, for the unique solution (m t ) t≥0 to (1.4), the function t → V σ (m t ) is differentiable and Note that inequalities (2.8) and (2.9) obtained in Theorem 2.1 imply that there exist constants r1 > 0 and R1 > 1 are such that for all t ≥ 0 and all (similarly to how (2.4) and (2.5) imply (2.6)).This property will be crucial in the proof of the following Polyak-Lojasiewicz inequality.
Using Theorem 2.3, based on the discussion in Section 1, we have the following result.
The proofs of all the results formulated above are postponed to Section 3. In Subsection 2.2 we will explain how to deduce the Polyak-Lojasiewicz inequality (2.11) for a general class of energy functions that satisfy a certain growth condition with respect to the KL-divergence.We will now formulate a lemma where we verify that growth condition for the energy function V σ that we used in this subsection (given by (1.2)).
Lemma 2.5.For V σ given by (1.2), if F is convex, then V σ satisfies the quadratic growth condition for any m ∈ P(R d ).
Proof.The proof is a straightforward extension of the proof of Proposition 1 in [18], where this was shown for V = F + H, where H is the negative entropy.By convexity of F , for any probability measures m, m ′ ∈ P(R d ) we get Taking m = m σ, * in the above calculation finishes the proof, since a(m σ, * , •) is constant by Proposition 2.5 in [13].
Note that we call the growth condition in Lemma 2.5 quadratic, since the KL-divergence corresponds to the square of a distance on the space of measures (compare this to condition (2) for θ = 1/2 in [5], which considered a similar growth condition with the L 2 -Wasserstein distance, and see the discussion below our Remark 2.9 for more details).
2.2.Verification of the flat Polyak-Lojasiewicz condition in a general setting.In this subsection we adapt the proof of Theorem 2 in [14] to the setting of the space of measures.In [14] it was shown how the classical Polyak-Lojasiewicz inequality (1.1) for functions on R d can be inferred from a certain type of a quadratic growth condition.Here we will work with functions on P(R d ) and we will carry out a similar argument, based on certain quadratic growth conditions expressed in terms of either the KL-divergence or the χ 2 -divergence, where the latter is defined for any m ∈ P(R d ) by This result can be interpreted as an analogue of Theorem 1 in [5], which showed that a certain type of the Lojasiewicz inequality can be inferred from a quadratic growth condition with respect to the L 2 -Wasserstein distance.We will present our reasoning in a series of lemmas.
Lemma 2.6.Suppose that G : P(R d ) → R has the first order flat derivative and that G is convex (cf.(2.1)).Then for any absolutely continuous probability measures m, m Proof.Since R d δG δm (m, x)m(x)dx = 0 by convention, from the convexity condition (2.1) we get A simple application of the Cauchy-Schwarz inequality in L 2 (m) proves the desired assertion.
Next we need a lemma that allows us to compare the χ 2 -divergence and the KLdivergence, between two absolutely continuous measures, such that the ratio of their densities is bounded from above and below.Lemma 2.7.Suppose we have absolutely continuous m, m ′ ∈ P(R d ) such that there exist constants r, R > 0 such that for any x ∈ R d we have Then we have Proof.The proof can be adapted from the proofs of Proposition 1 and Proposition 2 in [11], which covered the case of discrete probability measures.For completeness, we include the proof in Section 4.
Based on the above lemmas, we can show the following result.
Theorem 2.8.Suppose that G : P(R d ) → R has the first order flat derivative and that G is convex.Suppose further that G is minimized by an absolutely continuous measure m * and that there exists a constant λ > 0 such that for any m ′ ∈ P(R d ), Moreover, suppose that we have an absolutely continuous measure m ∈ P(R d ) such that there exist constants r, R > 0 such that for any x ∈ R d we have Then .
Proof.We follow the argument from the proof of Theorem 1 in [5].Since G is assumed to be convex, from Lemma 2.6 we get However, due to Lemma 2.7, we have which, together with (2.16) and G(m) − G(m * ) ≥ λ KL(m|m * ) leads to .
Plugging (2.17) into the right hand side of (2.16), we obtain .
Remark 2.9.Under the assumptions of Theorem 2.8, we obtain the flat Polyak-Lojasiewicz condition of the type (1.6) with the constant .
In what follows, we will prove that the flow (m t ) t≥0 given by (1.4) is such that r1 ≤ mt(x) m σ, * (x) ≤ R1 with some constants r1 > 0, R1 > 1, for all t > 0 and x ∈ R d , which will allow us to show (2.15) with G on the left hand side replaced by V σ , and δG δm (m, x) on the right hand side replaced by a(m, x) given by (1.4).This will be the basis of the proof of our main results in Section 3 and will provide us with an exponential convergence rate of V σ (m t ) to V σ (m σ, * ).We can easily observe that the convergence rate κ given by (2.18) degenerates to zero when λ → 0 or r → 0 or R → ∞.
Condition (2.13) corresponds to the classical quadratic growth condition for functions f : R d → R that can be used (see Theorem 2 in [14]) to prove the classical Polyak-Lojasiewicz inequality (1.1) under the additional assumption of convexity of f (but not necessarily strong convexity).More precisely, the quadratic growth condition in R d states that where x p ∈ arg min x∈R d f (x).Specifying an analogous condition for functions on the space of measures is non-straightforward, as there are multiple choices of the notion of the distance.Blanchet and Bolte in [5] proved that a certain type of a Lojasiewicz inequality can be implied by a condition such as (2.13) but with the L 2 -Wasserstein distance instead of the KL-divergence, see formula (2) and Theorem 1 in [5].Based on the proof of our Theorem 2.8, it is clear that we can also consider a quadratic growth condition with respect to the χ 2 -divergence with reversed arguments, i.e., we have the following result.
Corollary 2.10.Suppose that G : P(R d ) → R has the first order flat derivative and that G is convex.Suppose further that G is minimized by an absolutely continuous measure m * and that there exists a constant λ > 0 such that for any m ′ ∈ P(R d ), Then for any m ∈ P(R d ) we have the flat Polyak-Lojasiewicz condition .
The quadratic growth condition with respect to the KL-divergence (2.13) seems more natural than the one with respect to the χ 2 -divergence (2.19) (note that the former is verified in Lemma 2.5 for a large class of energy functions given by (1.2)).It is clear based on Lemma 2.7 that (2.13) implies (2.19), but we are presently unaware of any examples of energy functions that would satisfy (2.19) but not (2.13).

2.3.
Literature review, connection to the Wasserstein-Fisher-Rao gradient flow and further research.In order to present our results in a broader context, let us first discuss a different type of gradient flows and associated Lojasiewicz-type inequalities.We will also provide two heuristic examples in order to build a better intuition for our approach.
2.3.1.Wasserstein Gradient Flow.The dynamic representation of the L 2 -Wasserstein metric W 2 due to Benamou and Brenier [4,24] states that for any µ 0 , µ 1 ∈ P 2 (R d ), (2.21) where the infimum is taken over all curves [0, 1] ∋ t → (m t , ν t ) ∈ P 2 (R d ) × L 2 (R d ; m t ) solving ∂ t m t + div(ν t m t ) = 0 in the distributional sense, such that t → m t is weakly continuous with endpoints µ 0 and µ 1 .This result tells us that measures in the space (P 2 (R d ), W 2 ) of probability measures with finite second moments are transported along curves described by the forward-Kolmogorov PDE.One can show [13] that V σ (m t ) is decreasing along the gradient flow (m t ) t≥0 satisfying (2.22) Note that this flow corresponds to the mean-field Langevin equation (see e.g.(1.4) and (1.5) in [13]), and in particular becomes the classical overdamped Langevin equation when F = 0. Indeed, if we can show that t → V σ (m t ) is differentiable (see e.g.[13, Theorem 2.9]), we obtain In the case when F is convex, and hence V σ is strictly convex, V σ (m t ) → V σ (m σ, * ), see [13].More recently, [18] and [9] under additional structural assumptions proved that this convergence is exponential.In this setting, the Polyak-Lojasiewicz condition that implies the exponential convergence V σ (m t ) → V σ (m σ, * ), requires that there exists a constant κ > 0 such that for any m * ∈ arg min m∈P(R d ) V σ (m) and any t ≥ 0, With such an inequality at hand, one immediately sees that and the exponential convergence follows due to the Gronwall lemma.
Example 2.11.Let F = 0 in (1.2).In this case the minimizing probability measure m σ, * = arg min m V σ (m) = π.Then, assuming that we can show that t → KL(m t |π) is differentiable, we have In this case the Polyak-Lojasiewicz inequality is just the well-known log-Sobolev inequality Example 2.12.Let us consider an example with a different type of energy function.
Here the Polyak-Lojasiewicz inequality becomes the Poincaré inequality Flows of this type have been the subject of intensive research over the last few years [15,12,16,21].If we can show the existence of such a flow, and the differentiability of t → V σ (m t ), one can then check that . If the corresponding Polyak-Lojasiewicz conditions (2.24) and (2.11) are satisfied, then the right hand side is bounded by − σ 2 r1 /(4 R1 ) + κ (V σ (m t ) − V σ (m σ, * )) and we easily obtain the exponential convergence )) e −κ 1 t , where κ 1 = σ 2 r1 /(4 R1 ) + κ.This shows that both the Langevin part and the birth-death part can independently contribute to the convergence of V σ (m t ), if the right corresponding conditions (2.24) or (2.11) are satisfied.However, the issues of the existence of (m t ) t≥0 , the differentiability of t → V σ (m t ) and the verification of (2.24) in general settings are all non-trivial and will be studied in our future research, together with the issue of particle system approximation of (2.27), see also the last paragraph of this section.
We note that [21] studied convergence of flows similar to (2.27).However, they covered energy functions of a very specific form (see (11) in [21]) and without regularisation by the KL-divergence.Moreover, [21] obtained an asymptotic polynomial convergence rate in their main result (Theorem 4.6) and they did not address some important technical issues such as the question of the existence of the gradient flow and the differentiability of t → V σ (m t ).
On the other hand, [16] studied (2.27) corresponding to the linear case (F = 0) of our Example 2.11 and obtained an exponential rate of convergence to π, measured in the KL-divergence (see Theorem 3.3 therein).Interestingly, even though the authors of [16] did not explicitly make a connection to the Polyak-Lojasiewicz inequalities, their proof is in fact based on showing a special case of condition (1.6) as specified above (see their inequality (2 − 2δ)H 1 (f ) ≤ H 2 (f ) in the proof of Theorem 3.3, integrate it with respect to ρ t and note that our m t corresponds to their ρ t ).This Polyak-Lojasiewicz inequality is verified in [16] under a positive lower bound on the ratio of densities inf x∈R d ρt(x) π(x) that is required to hold for all sufficiently large t, see (B.3) in [16].Then they use an argument based on the maximum principle (which is possible due to the Langevin component of their dynamics) to show that this condition in fact only has to hold at an initial time t 0 .As a consequence, they conclude that compared to the classical result on the exponential convergence of the Langevin dynamics to π under the log-Sobolev inequality, by adding the birth-death component to the dynamics they can get rid of the log-Sobolev assumption and replace it by a "warm start" condition inf x∈R d ρt 0 (x) π(x) ≥ c for some c > 0. However, in [16] the Langevin part of the dynamics is only applied to make the use of the maximum principle possible, and does not directly contribute to the convergence rate.Moreover, similarly as in [21], the question of the existence of the gradient flow and the differentiability of t → V σ (m t ) were not addressed in [16].
In this paper we study a more general setting than [16], including non-linear functions F in the energy function V σ in (1.2), and we rigorously prove the existence of the corresponding birth-death gradient flow (m t ) t≥0 , as well as the differentiability of t → V σ (m t ).We also verify the flat Polyak-Lojasiewicz inequality (1.6) and thus establish the exponential rate of convergence of V σ (m t ) to V σ (m σ, * ).Our condition guaranteeing that (1.6) holds (Assumption 2) resembles the warm start condition from [16], however, in order to show that it propagates from t = 0 to all t > 0, we do not use the Langevin component of the dynamics and hence we work with a "pure" birth-death dynamics (the Fisher-Rao gradient flow).
Other recent papers studying the mean-field optimization problem specified by (1.2), such as [18] and [9], focused on the Wasserstein gradient flow (2.22).Both [18] and [9] proved the exponential convergence rate of V σ (m t ) to V σ (m σ, * ) under the assumption of the log-Sobolev inequality for a class of proximal Gibbs measures related to m σ, * .Compared to [18,9], working with the Fisher-Rao gradient flow allows us to get rid of that assumption, at the cost of introducing the additional "warm start" conditions in Assumption 2.
With all that said, we would like to point out that from the point of view of practical algorithms (that will be the subject of our future work), combining the birth-death dynamics with the Langevin dynamics seems advisable.The Wasserstein-Fisher-Rao gradient flow (2.27) can be seen as the mean-field limit of an interacting particle system that can be used as a basis of practically implementable algorithms (as studied in Sections 6 in [16] and [21]).The support of the birth-death flow does not change in time and hence, intuitively, if we do not include the diffusion component in our dynamics and we initialize it with the empirical measure of a set of particles, the dynamics will just keep re-arranging the mass between the particles but will not change their positions.Hence the convergence of such dynamics should be expected to be worse than the convergence of a particle system utilizing both the Langevin and the birth-death components.This issue is not apparent in the analysis of the mean-field limit process in the present paper (as our results use a "warm start" assumption on the initial condition), but we will investigate it in detail in our future work on the particle system approximations and the corresponding algorithms.From the practical point of view, the main message of this paper is that the birth-death component of such algorithms can be defined in terms of the function a given by (1.4), which corresponds to the flat derivative of the energy function V σ , but the focus here is on the theoretical analysis of the gradient flow rather than applications.

Existence of the gradient flow and other proofs
In order to prove the existence of a solution (m t ) t≥0 to ( Based on this formula, we will define a Picard iteration scheme.To this end, let us first fix T > 0 and choose a flow of probability measures (m For each n ≥ 1, we want to fix m 0 = m 0 (with m 0 satisfying condition (2.5) from Assumption 2) and define (m We have the following result.t ) t∈[0,T ] is well-defined.Note that due to (2.2), the only potential issue with the definition of (m (n) t ) t∈[0,T ] is due to the KL-divergence term under the integral, since a priori we do not know whether it is integrable.We will now prove by induction how to bound that term.Suppose that T 0 KL(m s |π) ds . (3.5) We also have where in the third inequality we bounded t n−1 )dt n−1 .
For sufficiently large n, the constant on the right hand side becomes less than 1 and the proof is complete.
We can now finalize the proof of Theorem 2.1.
Proof of Theorem 2.
where the second inequality follows from Lemma 3.1.In order to ensure that the solution (m t ) t∈[0,T ] can be extended to all t ≥ 0, we first need to prove the bound on the ratio m t /π in (2.8).
Step 2: Ratio condition (2.8).Following the discussion from the beginning of Section 3, we see that for any t ∈ [0, T ] we have Using (2.2), (2.5) and (3.9) we obtain Hence we can choose R . Note that we choose R 1 > 1 purely for convenience, to ensure that log R 1 > 0 in our subsequent calculations.Obtaining a lower bound on mt(x) π(x) follows similarly, by using (2.4) instead of (2.5).
Step 3: Existence of the gradient flow on [0, ∞).In order to complete our proof, note that the unique solution (m t ) t∈[0,T ] to (3.1) can also be expressed as From (2.2), (3.9) and (2.8), we obtain for any t ∈ This gives m t T V ≤ m 0 T V e C V t , and shows that m t does not explode in any finite time, hence we obtain a global solution (m t ) t∈[0,∞) .In particular, the bounds in (3.9), (2.8) and (2.9) hold for all t ≥ 0.
Proof of Theorem 2.2.We have the differentiability of F (m t ) as a consequence of Assumption 1.In order to show the differentiability of KL(m ≤ g(x) for some function g integrable with respect to π, which is sufficient by a standard result in measure theory (see e.g.Theorem 11.5 in [23]).Indeed, by (2.2), (2.8) and (2.7), we get We can now write Proof of Theorem 2.3.By Lemma 2.5, the quadratic growth condition (2.13) required in Theorem 2.8 is satisfied for m = m t for all t > 0, with λ = σ 2 /2.Moreover, due to (2.8) and (2.9), the ratio condition (2.14) required in Theorem 2.8 is satisfied for m = m t for all t > 0. Indeed, recall that by the discussion below Assumption 2, the ratio condition for m 0 /π with constants r and R is equivalent to the ratio condition for m 0 /m σ, * with corresponding constants r and R. Similarly, since due to (2.8) and (2.9) we have a bound on m t /π for all t > 0 with constants r 1 and R 1 , we can apply the argument below Assumption 2 to obtain a bound on m t /m σ, * for all t > 0, with appropriately modified constants r1 and R1 .Furthermore, note that by the proof of Lemma 2.5, in the case of G = V σ , the convexity condition needed in the proof of Lemma 2.6 (and thus in Theorem 2.8) can be applied with a(m, x) instead of δG δm , i.e., for any m, m ′ ∈ P(R d ) we have As a consequence, the argument from the proof of Theorem 2.8 applies to our setting and the flat Polyak-Lojasiewicz condition (2.11) is satisfied for all t ≥ 0.

Appendix: Relations between different f-divergences
Suppose we have absolutely continuous probability measures m, m ′ ∈ P(R d ) and a convex function f : [0, ∞) → R. Then the f -divergence of m with respect to m ′ is defined by For instance, choosing f (t) = t log t leads to the KL-divergence and f (t) = (t − 1) 2 leads to the χ 2 -divergence.We have the following result adapted from Theorem 6 in [11].
Proof.Let us define a mapping F a : (0, ∞) → R given by F a (t) := f (t) − at log t.Then F a is such that F a (1) = 0, and is twice differentiable and convex on (r, R) since F ′′ a (t) ≥ 0 on (r, R).Note that the f -divergence associated to a convex F a with F a (1) = 0 is always non-negative due to Jensen's inequality, and hence we have 0 ≤ I Fa (µ|ν) = I f (µ|ν) − a KL(µ|ν) .
We now define a mapping F A : (0, ∞) → R by setting F A (t) := At log t − f (t).Then F A is such that F A (1) = 0, and is twice differentiable and convex on (r, R) since F ′′ A (t) ≥ 0 on (r, R).We again use the fact that the corresponding f -divergence is non-negative, and we obtain 0 ≤ I F A (µ|ν) = A KL(µ|ν) − I f (µ|ν) , which finishes the proof.
Proof of Lemma 2.7.We consider the mapping f 1 : (0, ∞) → R given by f 1 (t) = − log(t).Note that the f -divergence corresponding to this f 1 is the KL-divergence with swapped arguments, i.e., for any absolutely continuous probability measures µ and ν, we have
dx for probability measures m ∈ P(R d ) absolutely continuous with respect to π, and denote t |π) ∂ t m t (x)dx ,where the last equality follows due to R d ∂ t m t (x)dx = 0. Combining this with (1.4) proves (2.10).