Polyak–Łojasiewicz inequality on the space of measures and convergence of mean-field birth-death processes

Liu, Linshan; Majka, Mateusz B.; Szpruch, Łukasz

doi:10.1007/s00245-022-09962-0

Polyak–Łojasiewicz inequality on the space of measures and convergence of mean-field birth-death processes

Open access
Published: 13 March 2023

Volume 87, article number 48, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Mathematics & Optimization Aims and scope Submit manuscript

Polyak–Łojasiewicz inequality on the space of measures and convergence of mean-field birth-death processes

Download PDF

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

The Polyak–Łojasiewicz inequality (PŁI) in $\mathbb {R}^d$ is a natural condition for proving convergence of gradient descent algorithms (Karimi et al. in: Frasconi et al. (eds) Machine learning and knowledge discovery in databases, Springer International Publishing, Cham, pp 795–811, 2016). In the present paper, we study an analogue of PŁI on the space of probability measures $\mathcal {P}(\mathbb {R}^d)$ and show that it is a natural condition for showing exponential convergence of a class of birth-death processes related to certain mean-field optimization problems. We verify PŁI for a broad class of such problems for energy functions regularised by the KL-divergence.

Exponential Decay of Rényi Divergence Under Fokker–Planck Equations

Article 17 June 2019

Functional Central Limit Theorem and Strong Law of Large Numbers for Stochastic Gradient Langevin Dynamics

Article Open access 28 August 2023

Nonasymptotic Estimates for Stochastic Gradient Langevin Dynamics Under Local Conditions in Nonconvex Optimization

Article 13 January 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Consider a classical optimization problem, where one is interested in finding a global minimum of a differentiable function $f{:} \mathbb {R}^d \rightarrow \mathbb {R}$. A natural condition on f, under which the gradient descent algorithm has a geometric convergence rate to $\min _{y\in \mathbb R^d} f(y)$, is the Polyak–Łojasiewicz inequality (PŁI)

$$\begin{aligned} \frac{1}{\kappa } \Vert \nabla f(x)\Vert ^2 \ge f(x) - \min _{y\in \mathbb R^d} f(y) \,, \end{aligned}$$

(1.1)

required to hold with a positive constant $\kappa > 0$, for all $x \in \mathbb {R}^d$ (see [14] and the references therein, or [5, 6] for other variants of Łojasiewicz inequalities). It is easy to see that when f is strictly convex, (1.1) holds, but the converse is not necessarily true.

In the present paper we are concerned with an optimization problem on the space of probability measures $\mathcal {P}(\mathbb {R}^d)$. We consider a function $V: \mathcal {P}(\mathbb {R}^d) \rightarrow \mathbb {R}$, and we want to find a minimizing measure $m^* \in \mathcal {P}(\mathbb {R}^d)$. Such optimization problems have attracted considerable attention in recent years, see e.g. [8, 10, 13, 17, 20]. In this setting, there exist multiple different choices of flows of probability measures $(m_t)_{t \ge 0}$ that can serve as analogues of the gradient descent algorithm in $\mathbb {R}^d$, as well as multiple different choices of conditions on V analogous to (1.1) that can be used to prove convergence of such flows.

The main example of V considered in this paper is an energy function regularised by the KL-divergence. Consider $F: \mathcal {P}(\mathbb {R}^d) \rightarrow \mathbb {R}$ (which can be non-linear) and a probability measure $\pi (dx) \propto e^{-U(x)}dx$ with a potential $U: \mathbb {R}^d \rightarrow \mathbb {R}$. For any $\sigma \ge 0$, we put

$$\begin{aligned} V^{\sigma }(m) = F(m) + \frac{\sigma ^2}{2} {\text {KL}}(m|\pi ) \,, \quad m\in \mathcal {P}(\mathbb {R}^d)\,, \end{aligned}$$

(1.2)

where for any $m \in \mathcal {P}(\mathbb {R}^d)$,

$$\begin{aligned} {\text {KL}}(m|\pi ) ={\left\{ \begin{array}{ll} \int _{\mathbb {R}^d} \log {\left( \frac{m(x)}{\pi (x)}\right) } m(x) dx \, &{}m \text { absolutely continuous with respect to }\pi , \\ \infty &{}\text {otherwise.} \end{array}\right. } \end{aligned}$$

It is known (see e.g. Proposition 2.5 in [13]) that $V^{\sigma }$ is minimized by a measure $m^{\sigma ,*} \in \mathcal {P}(\mathbb {R}^d)$ satisfying

$$\begin{aligned} m^{\sigma ,*}(x) = \frac{1}{Z} \exp {\left( -\frac{2}{\sigma ^2} \left( \frac{\delta F}{\delta m} (m^{\sigma ,*},x) + U(x) \right) \right) } \,, \end{aligned}$$

(1.3)

where Z is the normalising constant, and for any $m \in \mathcal {P}(\mathbb {R}^d)$ and $x \in \mathbb {R}^d$, by $\frac{\delta F}{\delta m} (m,x)$ we denote the flat derivative of F with respect to m, in the direction of $x \in \mathbb {R}^d$, evaluated at m. For any m, $m' \in \mathcal {P}(\mathbb {R}^d)$, the function $\frac{\delta F}{\delta m} : \mathcal {P}(\mathbb {R}^d) \times \mathbb {R}^d \rightarrow \mathbb {R}$ satisfies

$$\begin{aligned} F(m') - F(m) = \int _0^1 \int _{\mathbb {R}^d} \frac{\delta F}{\delta m} \left( m + \lambda (m'-m) , x \right) (m'-m)(dx) d\lambda \,. \end{aligned}$$

See Appendix 5 for more details on flat derivatives. This notion of derivative appears in the literature under several different names, including the linear functional derivative (see Section 5.4.1 in [7]) or the first variation [2]. It is important to note that $\frac{\delta F}{\delta m}$ is defined only up to a constant, i.e., for any $C \in \mathbb {R}$, the function $\frac{\delta F}{\delta m} + C$ is also a flat derivative of F. Everywhere in this paper we will adopt a normalizing convention requiring $\int _{\mathbb {R}^d} \frac{\delta F}{\delta m} (m,x) m(dx) = 0$, which then makes the choice of the constant unique.

The objective of this work is to identify a flow of measures $(m_t)_{t\ge 0}$ such that $V^{\sigma }(m_t) \rightarrow V^{\sigma }(m^{\sigma ,*})$ as $t \rightarrow \infty $, as well as conditions that ensure that this convergence is exponential. To this end, we equip the space $\mathcal {P}(\mathbb {R}^d)$ with a suitable distance function $d:\mathcal {P}(\mathbb {R}^d)\times \mathcal {P}(\mathbb {R}^d) \rightarrow \mathbb R$ and consider a corresponding gradient flow, where the form of the flow is dictated by the choice of d. Our main focus is on the Fisher–Rao metric.

1.1 Fisher–Rao Gradient Flow

Let $\mathcal {P}_{ac}(\mathbb {R}^d)$ be the space of probability measures on $\mathbb {R}^d$ that are absolutely continuous with respect to the Lebesgue measure. Then the Fisher–Rao distance between $\mu _0$, $\mu _1 \in \mathcal {P}_{ac}(\mathbb {R}^d)$ is defined by

$$\begin{aligned} {\text {FR}}(\mu _0,\mu _1) = \int _{\mathbb {R}^d} \left| \sqrt{\mu _0(x)} - \sqrt{\mu _1(x)} \right| ^2 dx \,. \end{aligned}$$

One can also consider a dynamic representation of the Fisher–Rao metric (see e.g. Section 2.2 in [12] and the references therein), which, for any $\mu _0$, $\mu _1 \in \mathcal {P}_{ac}(\mathbb {R}^d)$ states that

$$\begin{aligned} \begin{aligned}&{\text {FR}}(\mu _0,\mu _1)\\ {}&\quad =\inf \left\{ \int _0^1 \int _{\mathbb {R}^d} |\nu _s|^2 m_s(dx)ds\,:\, \text {s.t} \,\,\, \partial _s m_s = \nu _s m_s\,,\quad m_{i}=\mu _i\,,\,\, i=0,1\right\} \,, \end{aligned} \end{aligned}$$

where the infimum is taken over all curves $[0,1] \ni t \mapsto (m_t,\nu _t) \in \mathcal {P}_{ac}(\mathbb {R}^d) \times L^2(\mathbb {R}^d;m_t)$ solving $\partial _t m_t = \nu _t m_t$ in the distributional sense, such that $t \mapsto m_t$ is weakly continuous with endpoints $\mu _0$ and $\mu _1$. This result tells us that measures in the space $(\mathcal P_{ac}(\mathbb R^d), {\text {FR}} )$ are transported along curves prescribed by a birth-death (or reaction) equation. The main focus of this work is to identify a corresponding Polyak–Łojasiewicz inequality from which we can deduce the exponential convergence to $m^{\sigma ,*}$ of the flow $(m_t)_{t \ge 0}$ described by the birth-death equation

$$\begin{aligned} \partial _t m_t(x)= & {} - a(m_t,x) m_t(x), \nonumber \\ a(m,x):= & {} \frac{\delta F}{\delta m} (m,x) + \frac{\sigma ^2}{2} \log \left( \frac{m(x)}{\pi (x)}\right) - \frac{\sigma ^2}{2} {\text {KL}}(m|\pi ). \end{aligned}$$

(1.4)

Note that the map $(m,x)\mapsto a(m,x)$ formally corresponds to $\frac{\delta V^{\sigma }}{\delta m}(m_t,\cdot )$ which may not exist since the KL-divergence is only lower semi-continuous. The map $(m,x)\mapsto a(m,x)$ is a well-defined function under the assumption of flat-differentiability of F (note that ${\text {KL}}(m|\pi )$ in (1.4) corresponds to the normalizing constant needed in our normalizing convention mentioned above).

To see why the particular form of $(m,x)\mapsto a(m,x)$ in (1.4) is a good choice one needs to show that $t \mapsto V^\sigma (m_t)$ is differentiable so that

$$\begin{aligned} \begin{aligned} \partial _t&\left( V^{\sigma }(m_t) - V^{\sigma }(m^{\sigma ,*})\right) \\&= \int _{\mathbb {R}^d} \left( \frac{\delta F}{\delta m} (m_t,x) + \frac{\sigma ^2}{2}\log \left( \frac{m_t(x)}{\pi (x)}\right) - \frac{\sigma ^2}{2}{\text {KL}}(m_t|\pi ) \right) \partial _t m_t(x) dx \\&= - \int _{\mathbb {R}^d} \left| a(m_t,x) \right| ^2 m_t(x)dx\,. \end{aligned} \end{aligned}$$

(1.5)

The Polyak–Łojasiewicz condition that implies the exponential convergence of $V^{\sigma }(m_t)$ to $V^{\sigma }(m^{\sigma ,*})$, requires that there exists a constant $\kappa >0$ such that for any $m^{*}\in \arg \min _m V^{\sigma }(m)$ and any $m\in \mathcal P(\mathbb R^d)$,

$$\begin{aligned} \frac{1}{ \kappa }\left\| a(m,\cdot )\right\| ^2_{L^2(m)} \ge V^{\sigma }(m) - V^{\sigma }(m^*) \,. \end{aligned}$$

(1.6)

We call (1.6) the flat Polyak–Łojasiewicz condition, since the function a(m, x) formally corresponds to the flat derivative of $V^{\sigma }$, as explained above. With such an inequality at hand, one immediately sees that

$$\begin{aligned} \begin{aligned} \partial _t ( V^{\sigma }(m_t) - V(m^{\sigma ,*})) \le - \kappa (V^{\sigma }(m_t) - V(m^{\sigma ,*}) )\,, \end{aligned} \end{aligned}$$

which implies that $V^{\sigma }(m_t) - V^{\sigma }(m^{\sigma ,*}) \le \left( V^{\sigma }(m_0) - V^{\sigma }(m^{\sigma ,*}) \right) e^{-\kappa t}$ holds for any $t \ge 0$.

The main contributions of this work are:

We establish the existence and uniqueness of the non-linear infinite dimensional birth-death flow (1.4).
We demonstrate that $t \mapsto V^\sigma (m_t)$ is differentiable, which implies that the energy dissipation equality (1.5) holds.
We show that for a large class of energy functions $V^{\sigma }$, the Polyak–Łojasiewicz condition (1.6) can be verified under relatively mild assumptions.

We remark that showing the existence of a solution to (1.4) is non-trivial, since the problem is non-linear and the coefficient a(m, x) contains two terms that are difficult to control: the flat derivative of F and the KL-divergence. Even if one assumes a priori that the former is bounded, it is still unclear how to control the latter. We deal with this problem by introducing a Picard iteration gradient flow approximating (1.4), and then analysing the symmetrised KL-divergence (rather than just the plain KL-divergence) along that auxiliary gradient flow (see Lemmas 3.1 and 3.2). This approach allows us also to obtain bounds on the Radon-Nikodym derivative of $m_t$ with respect to $\pi $ (see Theorem 2.1) that are crucial for proving the differentiability of $t \mapsto V^\sigma (m_t)$ (Theorem 2.2) as well as establishing the Polyak–Łojasiewicz condition (1.6) in Theorem 2.3.

It can be shown that the birth-death flow (1.4) is a limit of a minimising movement scheme, see e.g. [22], defined for $\tau > 0$ as

$$\begin{aligned} \mu _{n+1} = {\text {argmin}}_{\nu \in \mathcal P_{ac}(\mathbb R^d)} \left\{ V^{\sigma }(\nu ) + \frac{1}{\tau } {\text {KL}}(\nu |\mu _n) \right\} \,. \end{aligned}$$

Indeed, recalling that $(m_t,x)\mapsto a(m_t,x)$ defined in (1.4) formally corresponds to $\frac{\delta V^{\sigma }}{\delta m}(m_t,x)$ and using Proposition 2.5 in [13] we can easily see that

$$\begin{aligned} \frac{\log \mu _{n+1}(x) - \log \mu _n(x)}{\tau } = - a(\mu _{n+1},x) \,. \end{aligned}$$

(1.7)

This is an implicit Euler discretisation of (1.4). Similarly one can consider the mirror descent algorithm, recently studied in [3] for the problem of optimization over the space of measures. One can define

$$\begin{aligned} \bar{\mu }_{n+1} = {\text {argmin}}_{\nu \in P_{ac}(\mathbb R^d)} \left\{ \int _{ R^d} a(\bar{\mu }_n,x) (\nu - \bar{\mu }_n)(dx) + \frac{1}{\tau } {\text {KL}}(\nu |\bar{\mu }_n) \right\} \,. \end{aligned}$$

As before, one can show, using Proposition 2.5 in [13], that

$$\begin{aligned} \frac{\log \bar{\mu }_{n+1}(x) - \log \bar{\mu }_n(x)}{\tau } = - a(\bar{\mu }_{n},x) \,. \end{aligned}$$

(1.8)

This is an explicit Euler discretisation of (1.4). Note that Theorem 4 in [3] shows convergence of their energy function evaluated at $\bar{\mu }_n$ under certain strong convexity assumptions, whereas we work with the (measure space version of) Polyak–Łojasiewicz inequality. In this context our results provide a natural extension of convergence results for mirror descent algorithms on $\mathbb {R}^d$, which are known to converge under the classical PŁI (1.1), see [19].

The remaining part of the paper is organised as follows. In Section 2 we formulate our main results and the assumptions we work with. In Section 2.2 we present a result on the verification of the flat Polyak–Łojasiewicz inequality (1.6) for general energy functions (not necessarily of the form (1.2)) under certain quadratic growth conditions. This section is of independent interest and can be seen as a counterpart of the results that were proved in $\mathbb {R}^d$ in [14], or the results that were proved on the space of measures in [5] for a quadratic growth condition with respect to the $L^2$-Wasserstein distance (while we work with the KL-divergence and the $\chi ^2$-divergence). In Section 2.3 we review the literature and we present a more in-depth discussion on the motivation for studying the gradient flow (1.4). In Section 3 we prove our main results on the existence of the gradient flow and the differentiability of the energy function. Finally, the Appendix includes some general auxiliary results on comparing different f-divergences, adapted from [11] and a brief overview of the notion of the flat derivative.

2 Main Results

2.1 Existence of the Birth-Death Flow and Its Convergence Under the Flat Polyak–Łojasiewicz Condition

We work with the energy function $V^{\sigma }: \mathcal {P}(\mathbb {R}^d) \rightarrow \mathbb {R}$ given by (1.2), for some possibly non-linear $F: \mathcal {P}(\mathbb {R}^d) \rightarrow \mathbb {R}$ and $\sigma > 0$. We have the following assumptions on F.

Assumption 1

Suppose F has the first and the second order flat derivatives ($\frac{\delta F}{\delta m} : \mathcal {P}(\mathbb {R}^d) \times \mathbb {R}^d \rightarrow \mathbb {R}$ and $\frac{\delta ^2 F}{\delta m^2}: \mathcal {P}(\mathbb {R}^d) \times \mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}$, respectively). Furthermore, suppose that

(1)
F is convex, i.e., for any m, $m' \in \mathcal {P}(\mathbb {R}^d)$ we have
$$\begin{aligned} F(m) - F(m') \le \int _{\mathbb {R}^d} \frac{\delta F}{\delta m}(m,x) \left( m - m' \right) (dx) \,. \end{aligned}$$
(2.1)
(2)
There exists a constant $C > 0$ such that for all $m \in \mathcal {P}(\mathbb {R}^d)$ and for all $x \in \mathbb {R}^d$ we have
$$\begin{aligned} \left| \frac{\delta F}{\delta m} (m,x) \right| \le C \,. \end{aligned}$$
(2.2)
(3)
There exists a constant $C_2 > 0$ such that for all $m \in \mathcal {P}(\mathbb {R}^d)$ and for all x, $y \in \mathbb {R}^d$ we have
$$\begin{aligned} \left| \frac{\delta ^2 F}{\delta m^2} (m,x,y) \right| \le C_2 \,. \end{aligned}$$
(2.3)

Furthermore, suppose we have absolutely continuous probability measures $\pi $, $m_0 \in \mathcal {P}(\mathbb {R}^d)$ such that $\pi (dx) \propto \exp \left( -\frac{2}{\sigma ^2}U(x) \right) dx$ for a potential $U: \mathbb {R}^d \rightarrow \mathbb {R}$ and the following conditions are satisfied.

Assumption 2

Suppose $m_0 \in \mathcal {P}(\mathbb {R}^d)$ is absolutely continuous and comparable with $\pi $ in the following sense.

(1)
There exists a constant $r>0$ such that
$$\begin{aligned} \inf _{x \in \mathbb {R}^d} \frac{m_0(x)}{\pi (x)} \ge r \,. \end{aligned}$$
(2.4)
(2)
There exists a constant $R > 1$ such that
$$\begin{aligned} \sup _{x \in \mathbb {R}^d} \frac{m_0(x)}{\pi (x)} \le R \,. \end{aligned}$$
(2.5)

Note that here $\pi $ is just a reference measure, and recall that the actual measure of interest (the minimizer of $V^{\sigma }$) is given implicitly by the following equation

$$\begin{aligned} m^{\sigma ,*}(x) = \frac{1}{Z} \exp {\left( -\frac{2}{\sigma ^2} \left( \frac{\delta F}{\delta m} (m^{\sigma ,*},x) + U(x) \right) \right) } \,, \end{aligned}$$

where Z is the normalizing constant. We immediately observe that, under condition (2.2), conditions (2.4) and (2.5) together are equivalent to assuming that there exist constants $\bar{r}>0$, $\bar{R} > 1$ such that for all $x \in \mathbb {R}^d$,

$$\begin{aligned} \bar{r} \le \frac{m_0(x)}{m^{\sigma ,*}(x)} \le \bar{R} \,. \end{aligned}$$

(2.6)

As we will explain in more detail in Subsection 2.3, Assumption 2 is a kind of ”warm start” condition that says that once we fix the reference measure $\pi $ in (1.2), the initial measure $m_0$ of our gradient flow should be comparable to $\pi $. We have the following result.

Theorem 2.1

Under Assumption 1 and condition (2.5) from Assumption 2, Eq. (1.4) has a unique solution $(m_t)_{t \ge 0}$. Moreover, for $t\ge 0$,

$$\begin{aligned} {\text {KL}}(m_t|\pi ) \le 2 \log R + \frac{4C}{\sigma ^2} \end{aligned}$$

(2.7)

and there exists a constant $R_1 > 1$ such that for all $t \ge 0$,

$$\begin{aligned} \sup _{x \in \mathbb {R}^d} \frac{m_t(x)}{\pi (x)} \le R_1 \,. \end{aligned}$$

(2.8)

If we additionally assume that condition (2.4) from Assumption 2 holds, then there exists a constant $r_1 > 0$ such that for all $t \ge 0$,

$$\begin{aligned} \inf _{x \in \mathbb {R}^d} \frac{m_t(x)}{\pi (x)} \ge r_1 \,. \end{aligned}$$

(2.9)

As we explained in the discussion in Section 1, the crucial property needed for showing the exponential convergence of $(m_t)_{t \ge 0}$ is the differentiability of the energy function along the gradient flow.

Theorem 2.2

Under Assumption 1 and condition (2.5) from Assumption 2, for the unique solution $(m_t)_{t \ge 0}$ to (1.4), the function $t \mapsto V^{\sigma }(m_t)$ is differentiable and

$$\begin{aligned} \partial _t V^{\sigma }(m_t) = - \int _{\mathbb {R}^d} \left| \frac{\delta F}{\delta m} (m_t,x) + \frac{\sigma ^2}{2} \log \left( \frac{m_t(x)}{\pi (x)}\right) - \frac{\sigma ^2}{2} {\text {KL}}(m_t|\pi ) \right| ^2 m_t(x) dx \,.\nonumber \\ \end{aligned}$$

(2.10)

Note that inequalities (2.8) and (2.9) obtained in Theorem 2.1 imply that there exist constants $\bar{r}_1 > 0$ and $\bar{R}_1 > 1$ are such that for all $t \ge 0$ and all $x \in \mathbb {R}^d$,

$$\begin{aligned} \bar{r}_1 \le \frac{m_t(x)}{m^{\sigma ,*}(x)} \le \bar{R}_1 \end{aligned}$$

(similarly to how (2.4) and (2.5) imply (2.6)). This property will be crucial in the proof of the following Polyak–Łojasiewicz inequality.

Theorem 2.3

Under Assumptions 1 and 2, the flow $(m_t)_{t \ge 0}$ solving (1.4) satisfies

$$\begin{aligned} V^{\sigma }(m_t) - V^{\sigma }(m^{\sigma ,*}) \le \frac{4 \bar{R}_1}{\sigma ^2 \bar{r}_1} \left\| a(m_t, \cdot ) \right\| ^2_{L^2(m_t)} \end{aligned}$$

(2.11)

for all $t \ge 0$.

Using Theorem 2.3, based on the discussion in Section 1, we have the following result.

Corollary 2.4

Under Assumptions 1 and 2, the flow $(m_t)_{t \ge 0}$ solving (1.4) satisfies

$$\begin{aligned} V^{\sigma }(m_t) - V^{\sigma }(m^{\sigma ,*}) \le \left( V^{\sigma }(m_0) - V^{\sigma }(m^{\sigma ,*}) \right) e^{-\kappa t} \,, \end{aligned}$$

for all $t \ge 0$, where $\kappa = \sigma ^2 \bar{r} / 4 \bar{R}$.

The proofs of all the results formulated above are postponed to Section 3.

In Subsection 2.2 we will explain how to deduce the Polyak–Łojasiewicz inequality (2.11) for a general class of energy functions that satisfy a certain growth condition with respect to the KL-divergence. We will now formulate a lemma where we verify that growth condition for the energy function $V^{\sigma }$ that we used in this subsection (given by (1.2)).

Lemma 2.5

For $V^{\sigma }$ given by (1.2), if F is convex, then $V^{\sigma }$ satisfies the quadratic growth condition

$$\begin{aligned} V^{\sigma }(m) - V^{\sigma }(m^{\sigma ,*}) \ge \frac{\sigma ^2}{2} {\text {KL}}(m | m^{\sigma ,*}) \end{aligned}$$

for any $m \in \mathcal {P}(\mathbb {R}^d)$.

Proof

The proof is a straightforward extension of the proof of Proposition 1 in [18], where this was shown for $V = F + H$, where H is the negative entropy. By convexity of F, for any probability measures m, $m' \in \mathcal {P}(\mathbb {R}^d)$ we get

$$\begin{aligned} V^{\sigma }(m'){} & {} = F(m') + \frac{\sigma ^2}{2} {\text {KL}} (m' | \pi )\\{} & {} \ge F(m) + \int _{\mathbb {R}^d} \frac{\delta F}{\delta m} (m,x) (m'-m)(dx) + \frac{\sigma ^2}{2} {\text {KL}} (m' | \pi )\\{} & {} = F(m) + \int _{\mathbb {R}^d} \left( \frac{\delta F}{\delta m} (m,x) + \frac{\sigma ^2}{2} \log \frac{m(x)}{\pi (x)} - \frac{\sigma ^2}{2} \log \frac{m(x)}{\pi (x)} \right) \\ {}{} & {} \quad \times (m'-m)(dx) + \frac{\sigma ^2}{2} {\text {KL}} (m' | \pi )\\{} & {} = F(m) + \int _{\mathbb {R}^d} a(m,x) (m'-m)(dx) -\int _{\mathbb {R}^d} \frac{\sigma ^2}{2} \log \frac{m(x)}{\pi (x)} (m'-m)(dx)\\ {}{} & {} \quad + \frac{\sigma ^2}{2} {\text {KL}} (m' | \pi )\\{} & {} = F(m) + \int _{\mathbb {R}^d} a(m,x) (m'-m)(dx) + \frac{\sigma ^2}{2} {\text {KL}}(m'|m) + \frac{\sigma ^2}{2} {\text {KL}}(m|\pi )\\{} & {} \ge V^{\sigma }(m) + \int _{\mathbb {R}^d} a(m,x) (m'-m)(dx) + \frac{\sigma ^2}{2} {\text {KL}}(m'|m) \,. \end{aligned}$$

Taking $m=m^{\sigma ,*}$ in the above calculation finishes the proof, since $a(m^{\sigma ,*},\cdot )$ is constant by Proposition 2.5 in [13]. $\square $

Note that we call the growth condition in Lemma 2.5 quadratic, since the KL-divergence corresponds to the square of a distance on the space of measures (compare this to condition (2) for $\theta = 1/2$ in [5], which considered a similar growth condition with the $L^2$-Wasserstein distance, and see the discussion below our Remark 2.9 for more details).

2.2 Verification of the Flat Polyak–Łojasiewicz Condition in a General Setting

In this subsection we adapt the proof of Theorem 2 in [14] to the setting of the space of measures. In [14] it was shown how the classical Polyak–Łojasiewicz inequality (1.1) for functions on $\mathbb {R}^d$ can be inferred from a certain type of a quadratic growth condition. Here we will work with functions on $\mathcal {P}(\mathbb {R}^d)$ and we will carry out a similar argument, based on certain quadratic growth conditions expressed in terms of either the KL-divergence or the $\chi ^2$-divergence, where the latter is defined for any $m \in \mathcal {P}(\mathbb {R}^d)$ by

$$\begin{aligned} \chi ^2(m|\pi ) ={\left\{ \begin{array}{ll} \int _{\mathbb {R}^d} \left( \frac{m(x)}{\pi (x)} - 1\right) ^2 \pi (x) dx \, &{}m \text { absolutely continuous with respect to }\pi , \\ \infty &{}\text {otherwise.} \end{array}\right. } \end{aligned}$$

This result can be interpreted as an analogue of Theorem 1 in [5], which showed that a certain type of the Łojasiewicz inequality can be inferred from a quadratic growth condition with respect to the $L^2$-Wasserstein distance. We will present our reasoning in a series of lemmas.

Lemma 2.6

Suppose that $G: \mathcal {P}(\mathbb {R}^d) \rightarrow \mathbb {R}$ has the first order flat derivative and that G is convex (cf. (2.1)). Then for any absolutely continuous probability measures m, $m' \in \mathcal {P}(\mathbb {R}^d)$,

$$\begin{aligned} \begin{aligned} G(m) - G(m')&\le \left( \int _{\mathbb {R}^d} \left| \frac{\delta G}{\delta m}(m,x) \right| ^2 m(x) dx \right) ^{1/2}\\ {}&\quad \left( \int _{\mathbb {R}^d} \left( \frac{m'(x)}{m(x)} - 1 \right) ^2 m(x) dx \right) ^{1/2} \\&= \left\| \frac{\delta G}{\delta m} (m,\cdot ) \right\| _{L^2(m)} \cdot \chi ^2(m'|m)^{1/2} \,. \end{aligned} \end{aligned}$$

Proof

Since $\int _{\mathbb {R}^d} \frac{\delta G}{\delta m}(m,x) m(x) dx = 0$ by convention, from the convexity condition (2.1) we get

$$\begin{aligned} G(m) - G(m') \le - \int _{\mathbb {R}^d} \frac{\delta G}{\delta m}(m,x) \left( \frac{m'(x)}{m(x)} - 1 \right) m(x)dx \,. \end{aligned}$$

A simple application of the Cauchy-Schwarz inequality in $L^2(m)$ proves the desired assertion. $\square $

Next we need a lemma that allows us to compare the $\chi ^2$-divergence and the KL-divergence, between two absolutely continuous measures, such that the ratio of their densities is bounded from above and below.

Lemma 2.7

Suppose we have absolutely continuous m, $m' \in \mathcal {P}(\mathbb {R}^d)$ such that there exist constants r, $R > 0$ such that for any $x \in \mathbb {R}^d$ we have

$$\begin{aligned} r \le \frac{m(x)}{m'(x)} \le R \,. \end{aligned}$$

Then we have

$$\begin{aligned} {\text {KL}}(m'|m) \le \frac{1}{r} {\text {KL}}(m|m') \qquad \text { and } \qquad \chi ^2(m|m') \le 2R {\text {KL}}(m|m') \,. \end{aligned}$$

(2.12)

Proof

The proof can be adapted from the proofs of Proposition 1 and Proposition 2 in [11], which covered the case of discrete probability measures. For completeness, we include the proof in Section 4. $\square $

Based on the above lemmas, we can show the following result.

Theorem 2.8

Suppose that $G: \mathcal {P}(\mathbb {R}^d) \rightarrow \mathbb {R}$ has the first order flat derivative and that G is convex. Suppose further that G is minimized by an absolutely continuous measure $m^*$ and that there exists a constant $\lambda > 0$ such that for any $m' \in \mathcal {P}(\mathbb {R}^d)$,

$$\begin{aligned} G(m') - G(m^*) \ge \lambda {\text {KL}}(m'|m^*) \,. \end{aligned}$$

(2.13)

Moreover, suppose that we have an absolutely continuous measure $m \in \mathcal {P}(\mathbb {R}^d)$ such that there exist constants r, $R > 0$ such that for any $x \in \mathbb {R}^d$ we have

$$\begin{aligned} r \le \frac{m(x)}{m^*(x)} \le R \,. \end{aligned}$$

(2.14)

Then

$$\begin{aligned} G(m) - G(m^*) \le \frac{2R}{\lambda r} \left\| \frac{\delta G}{\delta m} (m,\cdot ) \right\| ^2_{L^2(m)} \,. \end{aligned}$$

(2.15)

Proof

We follow the argument from the proof of Theorem 1 in [5]. Since G is assumed to be convex, from Lemma 2.6 we get

$$\begin{aligned} G(m) - G(m^*) \le \left\| \frac{\delta G}{\delta m} (m,\cdot ) \right\| _{L^2(m)} \cdot \chi ^2(m^*|m)^{1/2} \,. \end{aligned}$$

(2.16)

However, due to Lemma 2.7, we have

$$\begin{aligned} \chi ^2(m^*|m) \le 2R {\text {KL}}(m^*|m) \le \frac{2R}{r} {\text {KL}}(m|m^*) \,, \end{aligned}$$

which, together with (2.16) and $G(m) - G(m^*) \ge \lambda {\text {KL}}(m|m^*)$ leads to

$$\begin{aligned} {\text {KL}}(m|m^*)^{1/2} \le \frac{1}{\lambda } \left( \frac{2R}{r} \right) ^{1/2} \left\| \frac{\delta G}{\delta m} (m,\cdot ) \right\| _{L^2(m)} \,. \end{aligned}$$

In particular,

$$\begin{aligned} \chi ^2(m^*|m)^{1/2} \le \frac{2R}{\lambda r} \left\| \frac{\delta G}{\delta m} (m,\cdot ) \right\| _{L^2(m)} \,. \end{aligned}$$

(2.17)

Plugging (2.17) into the right hand side of (2.16), we obtain

$$\begin{aligned} G(m) - G(m^*) \le \frac{2R}{\lambda r} \left\| \frac{\delta G}{\delta m} (m,\cdot ) \right\| _{L^2(m)}^2 \,. \end{aligned}$$

$\square $

Remark 2.9

Under the assumptions of Theorem 2.8, we obtain the flat Polyak–Łojasiewicz condition of the type (1.6) with the constant

$$\begin{aligned} \kappa = \left( \frac{2R}{\lambda r} \right) ^{-1} \,. \end{aligned}$$

(2.18)

In what follows, we will prove that the flow $(m_t)_{t \ge 0}$ given by (1.4) is such that $\bar{r}_1 \le \frac{m_t(x)}{m^{\sigma ,*}(x)} \le \bar{R}_1$ with some constants $\bar{r}_1 > 0$, $\bar{R}_1 > 1$, for all $t >0$ and $x \in \mathbb {R}^d$, which will allow us to show (2.15) with G on the left hand side replaced by $V^{\sigma }$, and $\frac{\delta G}{\delta m}(m,x)$ on the right hand side replaced by a(m, x) given by (1.4). This will be the basis of the proof of our main results in Section 3 and will provide us with an exponential convergence rate of $V^{\sigma }(m_t)$ to $V^{\sigma }(m^{\sigma ,*})$. We can easily observe that the convergence rate $\kappa $ given by (2.18) degenerates to zero when $\lambda \rightarrow 0$ or $r \rightarrow 0$ or $R \rightarrow \infty $.

Condition (2.13) corresponds to the classical quadratic growth condition for functions $f: \mathbb {R}^d \rightarrow \mathbb {R}$ that can be used (see Theorem 2 in [14]) to prove the classical Polyak–Łojasiewicz inequality (1.1) under the additional assumption of convexity of f (but not necessarily strong convexity). More precisely, the quadratic growth condition in $\mathbb {R}^d$ states that

$$\begin{aligned} f(x) - \min _{y\in \mathbb R^d} f(y) \ge \frac{\mu }{2} \Vert x - x_p \Vert ^2 \,, \end{aligned}$$

where $x_p \in \arg \min _{x \in \mathbb {R}^d} f(x)$. Specifying an analogous condition for functions on the space of measures is non-straightforward, as there are multiple choices of the notion of the distance. Blanchet and Bolte in [5] proved that a certain type of a Łojasiewicz inequality can be implied by a condition such as (2.13) but with the $L^2$-Wasserstein distance instead of the KL-divergence, see formula (2) and Theorem 1 in [5]. Based on the proof of our Theorem 2.8, it is clear that we can also consider a quadratic growth condition with respect to the $\chi ^2$-divergence with reversed arguments, i.e., we have the following result.

Corollary 2.10

Suppose that $G: \mathcal {P}(\mathbb {R}^d) \rightarrow \mathbb {R}$ has the first order flat derivative and that G is convex. Suppose further that G is minimized by an absolutely continuous measure $m^*$ and that there exists a constant $\lambda > 0$ such that for any $m' \in \mathcal {P}(\mathbb {R}^d)$,

$$\begin{aligned} G(m') - G(m^*) \ge \lambda \chi ^2(m^*|m') \,. \end{aligned}$$

(2.19)

Then for any $m \in \mathcal {P}(\mathbb {R}^d)$ we have the flat Polyak–Łojasiewicz condition

$$\begin{aligned} G(m) - G(m^*) \le \frac{1}{\lambda } \left\| \frac{\delta G}{\delta m} (m,\cdot ) \right\| _{L^2(m)}^2 \,. \end{aligned}$$

(2.20)

Proof

Using (2.16) and (2.19), one immediately obtains

$$\begin{aligned} \chi ^2(m^*|m)^{1/2} \le \frac{1}{\lambda } \left\| \frac{\delta G}{\delta m} (m,\cdot ) \right\| _{L^2(m)} \,, \end{aligned}$$

which can be plugged back into (2.16) to obtain (2.20). $\square $

The quadratic growth condition with respect to the KL-divergence (2.13) seems more natural than the one with respect to the $\chi ^2$-divergence (2.19) (note that the former is verified in Lemma 2.5 for a large class of energy functions given by (1.2)). It is clear based on Lemma 2.7 that (2.13) implies (2.19), but we are presently unaware of any examples of energy functions that would satisfy (2.19) but not (2.13).

2.3 Literature Review, Connection to the Wasserstein–Fisher–Rao Gradient Flow and Further Research

In order to present our results in a broader context, let us first discuss a different type of gradient flows and associated Łojasiewicz-type inequalities. We will also provide two heuristic examples in order to build a better intuition for our approach.

2.3.1 Wasserstein Gradient Flow

The dynamic representation of the $L^2$-Wasserstein metric $\mathcal W_2$ due to Benamou and Brenier [4, 25] states that for any $\mu _0$, $\mu _1 \in \mathcal P_2(\mathbb R^d)$,

$$\begin{aligned} \mathcal W_2(\mu _0,\mu _1)= & {} \inf \left\{ \int _0^1 \int _{\mathbb {R}^d} |\nu _s|^2 m_s(dx)ds :\, \text {s.t} \,\, \partial _s m_s \right. \nonumber \\ {}{} & {} \left. + \text {div}(\nu _s m_s) =0\,,\, m_{i}=\mu _i\,,\,\, i=0,1 \right\} , \end{aligned}$$

(2.21)

where the infimum is taken over all curves $[0,1] \ni t \mapsto (m_t,\nu _t) \in \mathcal P_2(\mathbb R^d) \times L^2(\mathbb {R}^d;m_t)$ solving $\partial _t m_t + \text {div}(\nu _t m_t) =0$ in the distributional sense, such that $t \mapsto m_t$ is weakly continuous with endpoints $\mu _0$ and $\mu _1$. This result tells us that measures in the space $(\mathcal P_2(\mathbb R^d), \mathcal W_2 )$ of probability measures with finite second moments are transported along curves described by the forward-Kolmogorov PDE.

One can show [13] that $V^{\sigma }(m_t)$ is decreasing along the gradient flow $(m_t)_{t \ge 0}$ satisfying

$$\begin{aligned} \partial _t m_t= & {} \text {div}\left( \nabla a(m_t,\cdot )m_t\right) ,\nonumber \\ a(m,x):= & {} \frac{\delta F}{\delta m} (m,x) + \frac{\sigma ^2}{2} \log \left( \frac{m(x)}{\pi (x)}\right) - \frac{\sigma ^2}{2} {\text {KL}}(m|\pi ). \end{aligned}$$

(2.22)

Note that this flow corresponds to the mean-field Langevin equation (see e.g. (1.4) and (1.5) in [13]), and in particular becomes the classical overdamped Langevin equation when $F=0$. Indeed, if we can show that $t \mapsto V^{\sigma }(m_t)$ is differentiable (see e.g. [13, Theorem 2.9]), we obtain

$$\begin{aligned} \begin{aligned} \partial _t V^{\sigma }(m_t)&= \int _{\mathbb {R}^d} a(m_t,x) \partial _t m_t(x) dx = \int _{\mathbb {R}^d} a(m_t,x) \text {div}\left( \nabla a(m_t,x)m_t(x)\right) dx \\&= - \int _{\mathbb {R}^d} \left| \nabla a(m_t,x)\right| ^2 m_t(dx) \,. \end{aligned} \end{aligned}$$

(2.23)

In the case when F is convex, and hence $V^{\sigma }$ is strictly convex, $V^{\sigma }(m_t) \rightarrow V^{\sigma }(m^{\sigma ,*})$, see [13]. More recently, [18] and [9] under additional structural assumptions proved that this convergence is exponential.

In this setting, the Polyak–Łojasiewicz condition that implies the exponential convergence $V^{\sigma }(m_t) \rightarrow V^{\sigma }(m^{\sigma ,*})$, requires that there exists a constant $\kappa >0$ such that for any $m^{*}\in \arg \min _{m \in \mathcal {P}(\mathbb {R}^d)} V^{\sigma }(m)$ and any $t\ge 0$,

$$\begin{aligned} \frac{1}{ \kappa }\left\| \nabla a (m_t,\cdot ) \right\| ^2_{L^2(m_t)} \ge V^{\sigma }(m_t) - V^{\sigma }(m^{\sigma ,*}) \,. \end{aligned}$$

(2.24)

With such an inequality at hand, one immediately sees that

$$\begin{aligned} \begin{aligned} \partial _t ( V^{\sigma }(m_t) - V^{\sigma }(m^{\sigma ,*}))&= - \int _{\mathbb {R}^d} \left| \nabla a(m_t,x)\right| ^2 m_t(dx) \\ {}&\le - \kappa (V^{\sigma }(m_t) - V^{\sigma }(m^{\sigma ,*}) )\,, \end{aligned} \end{aligned}$$

and the exponential convergence follows due to the Gronwall lemma.

Example 2.11

Let $F=0$ in (1.2). In this case the minimizing probability measure $m^{\sigma ,*} = \arg \min _m V^{\sigma }(m) = \pi $. Then, assuming that we can show that $t \mapsto {\text {KL}}(m_t|\pi )$ is differentiable, we have

$$\begin{aligned} \partial _t ( V^{\sigma }(m_t) - V^{\sigma }(m^{\sigma ,*})) = \frac{\sigma ^2}{2} \, \partial _t {\text {KL}}(m_t|\pi ) = - \frac{\sigma ^4}{4} \int _{\mathbb {R}^d} \left| \nabla \, \log \frac{m_t(x)}{\pi (x)} \right| ^2 m_t(dx)\,. \end{aligned}$$

In this case the Polyak–Łojasiewicz inequality is just the well-known log-Sobolev inequality

$$\begin{aligned} \frac{1}{ \kappa }\int _{\mathbb {R}^d} \left| \nabla \, \log \frac{m_t(x)}{\pi (x)} \right| ^2 m_t(dx) \ge {\text {KL}}(m_t|\pi ) \,. \end{aligned}$$

(2.25)

Example 2.12

Let us consider an example with a different type of energy function. Consider $V^{\sigma }(m) := \chi ^2(m|\pi ) = \int _{\mathbb {R}^d} \left( \frac{m(x)}{\pi (x)} - 1 \right) ^2 \pi (x)dx$ for probability measures $m \in \mathcal {P}(\mathbb {R}^d)$ absolutely continuous with respect to $\pi $, and denote

$$\begin{aligned} \bar{a}(m,x) := 2 \left( \frac{m(x)}{\pi (x)} - 1 \right) - 2 \chi ^2(m|\pi ) \,, \end{aligned}$$

which formally corresponds to the flat derivative of the $\chi ^2$-divergence. Then $V^{\sigma }(m_t)$ is decreasing along the gradient flow $(m_t)_{t \ge 0}$ satisfying

$$\begin{aligned} \partial _t m_t = \text {div}\left( \nabla \bar{a}(m_t,\cdot )\pi \right) \,, \end{aligned}$$

(2.26)

i.e., similarly as in (2.23), assuming $t \mapsto \chi ^2(m_t|\pi )$ is differentiable, we have

$$\begin{aligned} \partial _t V^{\sigma }(m_t) = - \int _{\mathbb {R}^d} \left| \nabla \bar{a}(m_t,x)\right| ^2 \pi (dx) \,. \end{aligned}$$

Here the Polyak–Łojasiewicz inequality becomes the Poincaré inequality

$$\begin{aligned} \frac{1}{ \kappa }\int _{\mathbb {R}^d} \left| \nabla \left( \frac{m_t(x)}{\pi (x)} \right) \right| ^2 \pi (dx) \ge \chi ^2(m_t|\pi ) \,. \end{aligned}$$

Note that this corresponds to (2.24) with the $L^2(\pi )$ norm instead of $L^2(m_t)$, since we used a different gradient flow (compare (2.26) to (2.22)).

2.3.2 Wasserstein–Fisher–Rao Gradient Flow

A natural idea is to combine the Wasserstein (2.22) and the Fisher–Rao (1.4) gradient flows which in our setting leads to

$$\begin{aligned} \partial _t m_t = \text {div}\left( \nabla a(m_t,\cdot )m_t\right) - a(m_t,x) m_t \,. \end{aligned}$$

(2.27)

Flows of this type have been the subject of intensive research over the last few years [12, 15, 16, 21]. If we can show the existence of such a flow, and the differentiability of $t \mapsto V^{\sigma }(m_t)$, one can then check that

$$\begin{aligned} \begin{aligned} \partial _t \left( V^{\sigma }(m_t) - V^{\sigma }(m^{\sigma ,*})\right)&= - \left\| \nabla a(m_t,\cdot ) \right\| ^2_{L^2(m_t)} - \left\| a (m_t,\cdot ) \right\| ^2_{L^2(m_t)} \,. \end{aligned} \end{aligned}$$

If the corresponding Polyak–Łojasiewicz conditions (2.24) and (2.11) are satisfied, then the right hand side is bounded by $- \left( \sigma ^2 \bar{r}_1/(4\bar{R}_1) + \kappa \right) \left( V^{\sigma }(m_t) - V^{\sigma }(m^{\sigma ,*}) \right) $ and we easily obtain the exponential convergence $V^{\sigma }(m_t) - V^{\sigma }(m^{\sigma ,*}) \le \left( V^{\sigma }(m_0) - V^{\sigma }(m^{\sigma ,*})\right) e^{-\kappa _1 t}$, where $\kappa _1 = \sigma ^2 \bar{r}_1/(4\bar{R}_1) + \kappa $. This shows that both the Langevin part and the birth-death part can independently contribute to the convergence of $V^{\sigma }(m_t)$, if the right corresponding conditions (2.24) or (2.11) are satisfied. However, the issues of the existence of $(m_t)_{t \ge 0}$, the differentiability of $t \mapsto V^{\sigma }(m_t)$ and the verification of (2.24) in general settings are all non-trivial and will be studied in our future research, together with the issue of particle system approximation of (2.27), see also the last paragraph of this section.

We note that [21] studied convergence of flows similar to (2.27). However, they covered energy functions of a very specific form (see (11) in [21]) and without regularisation by the KL-divergence. Moreover, [21] obtained an asymptotic polynomial convergence rate in their main result (Theorem 4.6) and they did not address some important technical issues such as the question of the existence of the gradient flow and the differentiability of $t \mapsto V^{\sigma }(m_t)$.

On the other hand, [16] studied (2.27) corresponding to the linear case ($F=0$) of our Example 2.11 and obtained an exponential rate of convergence to $\pi $, measured in the KL-divergence (see Theorem 3.3 therein). Interestingly, even though the authors of [16] did not explicitly make a connection to the Polyak–Łojasiewicz inequalities, their proof is in fact based on showing a special case of condition (1.6) as specified above (see their inequality $(2-2\delta )H_1(f) \le H_2(f)$ in the proof of Theorem 3.3, integrate it with respect to $\rho _t$ and note that our $m_t$ corresponds to their $\rho _t$). This Polyak–Łojasiewicz inequality is verified in [16] under a positive lower bound on the ratio of densities $\inf _{x \in \mathbb {R}^d} \frac{\rho _t(x)}{\pi (x)}$ that is required to hold for all sufficiently large t, see (B.3) in [16]. Then they use an argument based on the maximum principle (which is possible due to the Langevin component of their dynamics) to show that this condition in fact only has to hold at an initial time $t_0$. As a consequence, they conclude that compared to the classical result on the exponential convergence of the Langevin dynamics to $\pi $ under the log-Sobolev inequality, by adding the birth-death component to the dynamics they can get rid of the log-Sobolev assumption and replace it by a ”warm start” condition $\inf _{x \in \mathbb {R}^d} \frac{\rho _{t_0}(x)}{\pi (x)} \ge c$ for some $c >0$. However, in [16] the Langevin part of the dynamics is only applied to make the use of the maximum principle possible, and does not directly contribute to the convergence rate. Moreover, similarly as in [21], the question of the existence of the gradient flow and the differentiability of $t \mapsto V^{\sigma }(m_t)$ were not addressed in [16].

In this paper we study a more general setting than [16], including non-linear functions F in the energy function $V^{\sigma }$ in (1.2), and we rigorously prove the existence of the corresponding birth-death gradient flow $(m_t)_{t \ge 0}$, as well as the differentiability of $t \mapsto V^{\sigma }(m_t)$. We also verify the flat Polyak–Łojasiewicz inequality (1.6) and thus establish the exponential rate of convergence of $V^{\sigma }(m_t)$ to $V^{\sigma }(m^{\sigma ,*})$. Our condition guaranteeing that (1.6) holds (Assumption 2) resembles the warm start condition from [16], however, in order to show that it propagates from $t=0$ to all $t>0$, we do not use the Langevin component of the dynamics and hence we work with a ”pure” birth-death dynamics (the Fisher–Rao gradient flow).

Other recent papers studying the mean-field optimization problem specified by (1.2), such as [18] and [9], focused on the Wasserstein gradient flow (2.22). Both [18] and [9] proved the exponential convergence rate of $V^{\sigma }(m_t)$ to $V^{\sigma }(m^{\sigma ,*})$ under the assumption of the log-Sobolev inequality for a class of proximal Gibbs measures related to $m^{\sigma ,*}$. Compared to [9, 18], working with the Fisher–Rao gradient flow allows us to get rid of that assumption, at the cost of introducing the additional ”warm start” conditions in Assumption 2.

With all that said, we would like to point out that from the point of view of practical algorithms (that will be the subject of our future work), combining the birth-death dynamics with the Langevin dynamics seems advisable. The Wasserstein–Fisher–Rao gradient flow (2.27) can be seen as the mean-field limit of an interacting particle system that can be used as a basis of practically implementable algorithms (as studied in Sections 6 in [16] and [21]). The support of the birth-death flow does not change in time and hence, intuitively, if we do not include the diffusion component in our dynamics and we initialize it with the empirical measure of a set of particles, the dynamics will just keep re-arranging the mass between the particles but will not change their positions. Hence the convergence of such dynamics should be expected to be worse than the convergence of a particle system utilizing both the Langevin and the birth-death components. This issue is not apparent in the analysis of the mean-field limit process in the present paper (as our results use a "warm start" assumption on the initial condition), but we will investigate it in detail in our future work on the particle system approximations and the corresponding algorithms. From the practical point of view, the main message of this paper is that the birth-death component of such algorithms can be defined in terms of the function a given by (1.4), which corresponds to the flat derivative of the energy function $V^{\sigma }$, but the focus here is on the theoretical analysis of the gradient flow rather than applications.

3 Existence of the Gradient Flow and Other Proofs

In order to prove the existence of a solution $(m_t)_{t \ge 0}$ to

$$\begin{aligned} \partial _t m_t(x) = - \left( \frac{\delta F}{\delta m} (m_t,x) + \frac{\sigma ^2}{2}\log \left( \frac{m_t(x)}{\pi (x)}\right) - \frac{\sigma ^2}{2}{\text {KL}}(m_t|\pi ) \right) m_t(x) \,, \end{aligned}$$

(3.1)

we first notice that (3.1) is equivalent to

$$\begin{aligned} \partial _t \log m_t(x) = - \left( \frac{\delta F}{\delta m}(m_t,x) + \frac{\sigma ^2}{2} \log \left( \frac{m_t(x)}{\pi (x)} \right) - \frac{\sigma ^2}{2} {\text {KL}}(m_t|\pi )\right) \,. \end{aligned}$$

(3.2)

By Duhamel’s formula, (3.2) is equivalent to

$$\begin{aligned} \log m_t(x)= & {} e^{-\frac{\sigma ^2}{2}t} \log m_0(x) - \int _0^t \frac{\sigma ^2}{2} e^{-\frac{\sigma ^2}{2}(t-s)}\\ {}{} & {} \times \left( \frac{2}{\sigma ^2} \frac{\delta F}{\delta m}(m_s,x) - \log \pi (x) - {\text {KL}}(m_s|\pi ) \right) ds \,. \end{aligned}$$

Based on this formula, we will define a Picard iteration scheme. To this end, let us first fix $T > 0$ and choose a flow of probability measures $(m_t^{(0)})_{t \in [0,T]}$ such that

$$\begin{aligned} \int _0^T {\text {KL}}(m_s^{(0)}|\pi ) ds < \infty \,. \end{aligned}$$

(3.3)

For each $n \ge 1$, we want to fix $m_0^{(n)} = m_0^{(0)} = m_0$ (with $m_0$ satisfying condition (2.5) from Assumption 2) and define $(m_t^{(n)})_{t \in [0,T]}$ by

$$\begin{aligned} \begin{aligned} \log m_t^{(n)}(x)&= e^{-\frac{\sigma ^2}{2}t} \log m_0(x) - \int _0^t \frac{\sigma ^2}{2} e^{-\frac{\sigma ^2}{2}(t-s)} \\&\quad \times \left( \frac{2}{\sigma ^2} \frac{\delta F}{\delta m}(m_s^{(n-1)},x) - \log \pi (x) - {\text {KL}}(m_s^{(n-1)}|\pi ) \right) ds \,. \end{aligned} \end{aligned}$$

(3.4)

We have the following result.

Lemma 3.1

The sequence of flows $\left( (m_t^{(n)})_{t \in [0,T]} \right) _{n=0}^{\infty }$ given by (3.4) is well-defined and such that for all $n \ge 1$ and all $t \in [0,T]$ we have

$$\begin{aligned} {\text {KL}}(m_t^{(n)}|\pi ) \le 2 \log R + \frac{4}{\sigma ^2}C \,. \end{aligned}$$

Proof

Consider $n=1$. By (2.2) and (3.3), the integral on the right hand side of (3.4) is finite, and hence $(m_t^{(1)})_{t \in [0,T]}$ is well-defined. Note that due to (2.2), the only potential issue with the definition of $(m_t^{(n)})_{t \in [0,T]}$ is due to the KL-divergence term under the integral, since a priori we do not know whether it is integrable. We will now prove by induction how to bound that term. Suppose that $\int _0^T {\text {KL}}(m_s^{(n-1)}|\pi ) ds < \infty $ and, based on (3.4), write

$$\begin{aligned} \begin{aligned} \log \frac{m_t^{(n)}(x)}{\pi (x)}&= e^{-\frac{\sigma ^2}{2}t} \log \frac{m_0(x)}{\pi (x)} \\&\quad - \int _0^t \frac{\sigma ^2}{2} e^{-\frac{\sigma ^2}{2}(t-s)} \left( \frac{2}{\sigma ^2} \frac{\delta F}{\delta m}(m_s^{(n-1)},x) - {\text {KL}}(m_s^{(n-1)}|\pi ) \right) ds \,. \end{aligned} \nonumber \\ \end{aligned}$$

(3.5)

We also have

$$\begin{aligned} \begin{aligned} \log \frac{\pi (x)}{m_t^{(n)}(x)}&= - e^{-\frac{\sigma ^2}{2}t} \log \frac{m_0(x)}{\pi (x)} \\&\quad - \int _0^t \frac{\sigma ^2}{2} e^{-\frac{\sigma ^2}{2}(t-s)} \left( - \frac{2}{\sigma ^2} \frac{\delta F}{\delta m}(m_s^{(n-1)},x) + {\text {KL}}(m_s^{(n-1)}|\pi ) \right) ds \,. \end{aligned} \nonumber \\ \end{aligned}$$

(3.6)

Due to (2.2) and (2.5), we can multiply both sides of (3.5) by $m_t^{(n)}(x)$ and integrate with respect to x in order to obtain

$$\begin{aligned} {\text {KL}}(m_t^{(n)}|\pi ) \le \log R + \frac{2}{\sigma ^2}C + \int _0^t \frac{\sigma ^2}{2} e^{-\frac{\sigma ^2}{2}(t-s)} {\text {KL}}(m_s^{(n-1)}|\pi ) ds \,. \end{aligned}$$

Similarly, by multiplying both sides of (3.6) by $\pi (x)$ and integrating with respect to x, we obtain

$$\begin{aligned} {\text {KL}}(\pi |m_t^{(n)}) \le \log R + \frac{2}{\sigma ^2}C - \int _0^t \frac{\sigma ^2}{2} e^{-\frac{\sigma ^2}{2}(t-s)} {\text {KL}}(m_s^{(n-1)}|\pi ) ds \,. \end{aligned}$$

Consequently, we obtain

$$\begin{aligned} {\text {KL}}(m_t^{(n)}|\pi ) \le {\text {KL}}(m_t^{(n)}|\pi ) + {\text {KL}}(\pi |m_t^{(n)}) \le 2 \log R + \frac{4}{\sigma ^2}C \,, \end{aligned}$$

which finishes the proof by induction. $\square $

We will now consider the sequence of flows $\left( (m_t^{(n)})_{t \in [0,T]} \right) _{n=0}^{\infty }$ in $\mathcal {P}(\mathbb {R}^d)^{[0,T]}$ equipped with the distance $\mathcal{T}\mathcal{V}_T$, defined for any $(\mu _t)_{t \in [0,T]}$, $(\nu _t)_{t \in [0,T]} \in \mathcal {P}(\mathbb {R}^d)^{[0,T]}$ by

$$\begin{aligned} \mathcal{T}\mathcal{V}_T \left( (\mu _t)_{t \in [0,T]}, (\nu _t)_{t \in [0,T]} \right) := \int _0^T TV(\mu _t, \nu _t) dt \,. \end{aligned}$$

Since $\mathcal {P}(\mathbb {R}^d)$ equipped with the total variation distance TV is complete, we can apply the argument from Lemma A.5 in [24] with $p=1$ to conclude that $\mathcal {P}(\mathbb {R}^d)^{[0,T]}$ equipped with $\mathcal{T}\mathcal{V}_T$ is also complete. We will now consider the Picard iteration mapping $\Psi \left( (m_t^{(n-1)})_{t \in [0,T]}\right) := (m_t^{(n)})_{t \in [0,T]}$ defined via (3.4), and show that $\Psi $ is contractive in $(\mathcal {P}(\mathbb {R}^d)^{[0,T]},\mathcal{T}\mathcal{V}_T)$. Then the Banach fixed point theorem will give us the existence of a solution to (3.1).

Lemma 3.2

The mapping $\Psi \left( (m_t^{(n-1)})_{t \in [0,T]}\right) := (m_t^{(n)})_{t \in [0,T]}$ defined via (3.4) is contractive in $(\mathcal {P}(\mathbb {R}^d)^{[0,T]},\mathcal{T}\mathcal{V}_T)$.

Proof

From (3.4) we have

$$\begin{aligned} \begin{aligned} \log m_t^{(n)}(x) - \log m_t^{(n-1)}(x)&= - \int _0^t \frac{\sigma ^2}{2} e^{-\frac{\sigma ^2}{2}(t-s)} \\&\quad \times \left[ \frac{2}{\sigma ^2} \left( \frac{\delta F}{\delta m}(m_s^{(n-1)},x) - \frac{\delta F}{\delta m}(m_s^{(n-2)},x) \right) \right. \\ {}&\quad \left. - {\text {KL}}(m_s^{(n-1)}|\pi ) + {\text {KL}}(m_s^{(n-2)}|\pi ) \right] ds \,. \end{aligned} \end{aligned}$$

Multiplying both sides by $m_t^{(n)}(x)$ and integrating with respect to x, we obtain

$$\begin{aligned} \begin{aligned} {\text {KL}}(m_t^{(n)}|m_t^{(n-1)})&= - \int _0^t \frac{\sigma ^2}{2} e^{-\frac{\sigma ^2}{2}(t-s)} \Bigg [ \frac{2}{\sigma ^2} \int _{\mathbb {R}^d} \left( \frac{\delta F}{\delta m}(m_s^{(n-1)},x) \right. \\ {}&\quad \left. - \frac{\delta F}{\delta m}(m_s^{(n-2)},x) \right) m_t^{(n)}(dx) \\&\quad - {\text {KL}}(m_s^{(n-1)}|\pi ) + {\text {KL}}(m_s^{(n-2)}|\pi ) \Bigg ] ds \,. \end{aligned} \end{aligned}$$

(3.7)

Moreover, note that

$$\begin{aligned} \begin{aligned}&\int _{\mathbb {R}^d} \left( \frac{\delta F}{\delta m}(m_s^{(n-1)},x) - \frac{\delta F}{\delta m}(m_s^{(n-2)},x) \right) m_t^{(n)}(dx)\\&\quad = \int _{\mathbb {R}^d} \int _{\mathbb {R}^d} \int _0^1 \frac{\delta ^2 F}{\delta m^2} \left( m_s^{(n-2)} + \lambda \left( m_s^{(n-1)} - m_s^{(n-2)} \right) ,x,y \right) d \lambda \\&\qquad \times \left( m_s^{(n-1)} - m_s^{(n-2)} \right) (dy) m_t^{(n)}(dx) \,. \end{aligned} \end{aligned}$$

Similarly, again from (3.4) we have

$$\begin{aligned} \begin{aligned} \log m_t^{(n-1)}(x) - \log m_t^{(n)}(x)&= - \int _0^t \frac{\sigma ^2}{2} e^{-\frac{\sigma ^2}{2}(t-s)} \\&\quad \times \left[ \frac{2}{\sigma ^2} \left( \frac{\delta F}{\delta m}(m_s^{(n-2)},x) - \frac{\delta F}{\delta m}(m_s^{(n-1)},x) \right) \right. \\ {}&\quad \left. - {\text {KL}}(m_s^{(n-2)}|\pi ) + {\text {KL}}(m_s^{(n-1)}|\pi ) \right] ds \,. \end{aligned} \end{aligned}$$

Multiplying both sides by $m_t^{(n-1)}(x)$ and integrating with respect to x, we obtain

$$\begin{aligned} \begin{aligned} {\text {KL}}(m_t^{(n-1)}|m_t^{(n)})&= - \int _0^t \frac{\sigma ^2}{2} e^{-\frac{\sigma ^2}{2}(t-s)} \Bigg [ \frac{2}{\sigma ^2} \int _{\mathbb {R}^d} \left( \frac{\delta F}{\delta m}(m_s^{(n-2)},x) \right. \\ {}&\quad \left. - \frac{\delta F}{\delta m}(m_s^{(n-1)},x) \right) m_t^{(n-1)}(dx)\\&\quad - {\text {KL}}(m_s^{(n-2)}|\pi ) + {\text {KL}}(m_s^{(n-1)}|\pi ) \Bigg ] ds \,. \end{aligned} \end{aligned}$$

(3.8)

Similarly as before, we note that

$$\begin{aligned} \begin{aligned}&\int _{\mathbb {R}^d} \left( \frac{\delta F}{\delta m}(m_s^{(n-2)},x) - \frac{\delta F}{\delta m}(m_s^{(n-1)},x) \right) m_t^{(n-1)}(dx)\\&\quad = -\int _{\mathbb {R}^d} \int _{\mathbb {R}^d} \int _0^1 \frac{\delta ^2 F}{\delta m^2} \left( m_s^{(n-2)} + \lambda \left( m_s^{(n-1)} - m_s^{(n-2)} \right) ,x,y \right) d \lambda \\&\qquad \times \left( m_s^{(n-1)} - m_s^{(n-2)} \right) (dy) m_t^{(n-1)}(dx) \,. \end{aligned} \end{aligned}$$

Combining (3.7) and (3.8), we obtain

$$\begin{aligned} \begin{aligned} {\text {KL}}(m_t^{(n)}|m_t^{(n-1)}) + {\text {KL}}(m_t^{(n-1)}|m_t^{(n)})&= - \int _0^t e^{-\frac{\sigma ^2}{2}(t-s)} \\&\quad \times \int _{\mathbb {R}^d} \int _{\mathbb {R}^d} \int _0^1 \frac{\delta ^2 F}{\delta m^2} \left( m_s^{(n-2)} + \lambda \left( m_s^{(n-1)} - m_s^{(n-2)} \right) ,x,y \right) d \lambda \\&\quad \times \left( m_s^{(n-1)} - m_s^{(n-2)} \right) (dy) \left( m_t^{(n)} - m_t^{(n-1)}\right) (dx) ds \,. \end{aligned} \end{aligned}$$

Hence, due to (2.3), we get

$$\begin{aligned} \begin{aligned}&{\text {KL}}(m_t^{(n)}|m_t^{(n-1)}) + {\text {KL}}(m_t^{(n-1)}|m_t^{(n)}) \\&\quad \le \int _0^t e^{-\frac{\sigma ^2}{2}(t-s)} C_2 TV(m_s^{(n-1)},m_s^{(n-2)}) TV(m_t^{(n)},m_t^{(n-1)}) ds \end{aligned} \end{aligned}$$

By the Pinsker-Csizsar inequality, $TV^2(m_t^{(n)},m_t^{(n-1)}) \le \frac{1}{2} {\text {KL}}(m_t^{(n)}|m_t^{(n-1)})$ and hence

$$\begin{aligned} 4 TV^2(m_t^{(n)},m_t^{(n-1)}) \le C_2 TV(m_t^{(n)},m_t^{(n-1)}) \int _0^t e^{-\frac{\sigma ^2}{2}(t-s)} TV(m_s^{(n-1)},m_s^{(n-2)}) ds \,, \end{aligned}$$

which gives

$$\begin{aligned} \begin{aligned}&TV(m_t^{(n)},m_t^{(n-1)}) \le \frac{C_2}{4} \int _0^t e^{-\frac{\sigma ^2}{2}(t-s)} TV(m_s^{(n-1)},m_s^{(n-2)}) ds \\&\quad \le \left( \frac{C_2}{4} \right) ^{n-1} e^{-\frac{\sigma ^2}{2}t} \int _0^t \int _0^{t_1} \ldots \int _0^{t_{n-2}} e^{\frac{\sigma ^2}{2}t_{n-1}} TV(m_{t_{n-1}}^{(1)}, m_{t_{n-1}}^{(0)}) dt_{n-1} \ldots dt_2 dt_1 \\&\quad \le \left( \frac{C_2}{4} \right) ^{n-1} e^{-\frac{\sigma ^2}{2}t} \frac{t^{n-2}}{(n-2)!}\int _0^t e^{\frac{\sigma ^2}{2}t_{n-1}} TV(m_{t_{n-1}}^{(1)}, m_{t_{n-1}}^{(0)}) dt_{n-1} \\&\quad \le \left( \frac{C_2}{4} \right) ^{n-1} \frac{t^{n-2}}{(n-2)!}\int _0^t TV(m_{t_{n-1}}^{(1)}, m_{t_{n-1}}^{(0)}) dt_{n-1} \,, \end{aligned} \end{aligned}$$

where in the third inequality we bounded $\int _0^{t_{n-2}} dt_{n-1} \le \int _0^{t} dt_{n-1}$ and in the fourth inequality we bounded $e^{\frac{\sigma ^2}{2}t_{n-1}} \le e^{\frac{\sigma ^2}{2}t}$. Hence we obtain

$$\begin{aligned} \int _0^T TV(m_t^{(n)},m_t^{(n-1)})dt \le \left( \frac{C_2}{4} \right) ^{n-1} \frac{T^{n-1}}{(n-2)!}\int _0^T TV(m_{t_{n-1}}^{(1)}, m_{t_{n-1}}^{(0)}) dt_{n-1} \,. \end{aligned}$$

For sufficiently large n, the constant on the right hand side becomes less than 1 and the proof is complete. $\square $

We can now finalize the proof of Theorem 2.1.

Proof of Theorem 2.1

Step 1: Existence of the gradient flow and bound (2.7) on [0, T]. By Lemma 3.2, for any $T > 0$ we obtain the existence of a flow $(m_t)_{t \in [0,T]}$ satisfying (3.1). Moreover, for Lebesgue-almost all $t \in [0,T]$ we have

$$\begin{aligned} TV(m_t^{(n)},m_t) \rightarrow 0 \qquad \text { as } n \rightarrow \infty \,, \end{aligned}$$

which implies

$$\begin{aligned} m_t^{(n)} \rightarrow m_t \qquad \text { weakly, as } n \rightarrow \infty \,. \end{aligned}$$

Hence, using the lower semi-continuity of the KL-divergence (see e.g. Theorem 2.34 in [1]) we obtain

$$\begin{aligned} {\text {KL}}(m_t|\pi ) \le \liminf _{n \rightarrow \infty } {\text {KL}}(m_t^{(n)}|\pi ) \le 2 \log R + \frac{4C}{\sigma ^2} \,, \end{aligned}$$

(3.9)

where the second inequality follows from Lemma 3.1. In order to ensure that the solution $(m_t)_{t \in [0,T]}$ can be extended to all $t \ge 0$, we first need to prove the bound on the ratio $m_t/\pi $ in (2.8).

Step 2: Ratio condition (2.8). Following the discussion from the beginning of Section 3, we see that for any $t \in [0,T]$ we have

$$\begin{aligned} \begin{aligned} \log \frac{m_t(x)}{\pi (x)}&= e^{-\frac{\sigma ^2}{2}t} \log \frac{m_0(x)}{\pi (x)} \\&\quad - \int _0^t \frac{\sigma ^2}{2} e^{-\frac{\sigma ^2}{2}(t-s)} \left( \frac{2}{\sigma ^2} \frac{\delta F}{\delta m}(m_s,x) - {\text {KL}}(m_s|\pi ) \right) ds \,. \end{aligned} \end{aligned}$$

Using (2.2), (2.5) and (3.9) we obtain

$$\begin{aligned} \log \frac{m_t(x)}{\pi (x)} \le \log R + C + \frac{\sigma ^2}{2} \left( 2 \log R + \frac{4C}{\sigma ^2} \right) \,. \end{aligned}$$

Hence we can choose $R_1 := 1 + \exp \left( \log R + C + \frac{\sigma ^2}{2} \left( 2 \log R + \frac{4C}{\sigma ^2} \right) \right) $. Note that we choose $R_1 > 1$ purely for convenience, to ensure that $\log R_1 > 0$ in our subsequent calculations. Obtaining a lower bound on $\frac{m_t(x)}{\pi (x)}$ follows similarly, by using (2.4) instead of (2.5).

Step 3: Existence of the gradient flow on $[0,\infty )$. In order to complete our proof, note that the unique solution $(m_t)_{t \in [0,T]}$ to (3.1) can also be expressed as

$$\begin{aligned} m_t(x)= & {} m_0(x) \exp \left( - \int _0^t \left( \frac{\delta F}{\delta m} (m_s,x) + \frac{\sigma ^2}{2}\log \left( \frac{m_s(x)}{\pi (x)}\right) \right. \right. \\ {}{} & {} \quad \left. \left. - \frac{\sigma ^2}{2}{\text {KL}}(m_s|\pi ) \right) ds \right) \,. \end{aligned}$$

From (2.2), (3.9) and (2.8), we obtain for any $t \in [0,T]$

$$\begin{aligned} \begin{aligned}&\left| \frac{\delta F}{\delta m}(m_t,x) + \frac{\sigma ^2}{2}\log \left( \frac{m_t(x)}{\pi (x)} \right) - \frac{\sigma ^2}{2}{\text {KL}}(m_t|\pi ) \right| \\&\quad \le 3C + \frac{\sigma ^2}{2}\left( \max \{ |\log r_1|,\log R_1 \} + 2 \log R \right) =: C_V \,. \end{aligned} \end{aligned}$$

This gives $\Vert m_t \Vert _{TV} \le \Vert m_0 \Vert _{TV} e^{C_V t}$, and shows that $m_t$ does not explode in any finite time, hence we obtain a global solution $(m_t)_{t \in [0,\infty )}$. In particular, the bounds in (3.9), (2.8) and (2.9) hold for all $t \ge 0$. $\square $

Proof of Theorem 2.2

We have the differentiability of $F(m_t)$ as a consequence of Assumption 1. In order to show the differentiability of ${\text {KL}}(m_t|\pi ) = \int _{\mathbb {R}^d} \log \left( \frac{m_t(x)}{\pi (x)}\right) m_t(x) dx = \int _{\mathbb {R}^d} \log \left( \frac{m_t(x)}{\pi (x)}\right) \frac{m_t(x)}{\pi (x)} \pi (x)dx$, we need to prove that $\left| \partial _t \left( \log \left( \frac{m_t(x)}{\pi (x)}\right) \frac{m_t(x)}{\pi (x)}\right) \right| \le g(x)$ for some function g integrable with respect to $\pi $, which is sufficient by a standard result in measure theory (see e.g. Theorem 11.5 in [23]). Indeed, by (2.2), (2.8) and (2.7), we get

$$\begin{aligned} \begin{aligned} \left| \partial _t \left( \log \left( \frac{m_t(x)}{\pi (x)}\right) \frac{m_t(x)}{\pi (x)} \right) \right|&= \left| \frac{\pi (x)}{m_t(x)}\frac{\partial _t m_t(x)}{\pi (x)} \frac{m_t(x)}{\pi (x)} + \log \left( \frac{m_t(x)}{\pi (x)}\right) \frac{\partial _t m_t(x)}{\pi (x)} \right| \\&= \left| \left( 1+\log \left( \frac{m_t(x)}{\pi (x)}\right) \right) \frac{\partial _t m_t(x)}{\pi (x)} \right| \\&= \left| \left( 1 + \log \left( \frac{m_t(x)}{\pi (x)} \right) \right) \left( \frac{\delta F}{\delta m}(m_t,x) + \frac{\sigma ^2}{2}\log \left( \frac{m_t(x)}{\pi (x)} \right) \right. \right. \\ {}&\quad \left. \left. - \frac{\sigma ^2}{2}{\text {KL}}(m_t|\pi )\right) \frac{m_t(x)}{\pi (x)} \right| \\&\le \left( 1 + \max \{ |\log r_1|, \log R_1 \} \right) \left( 3C + \frac{\sigma ^2}{2}\left( \max \{ |\log r_1|, \log R_1 \} + 2 \log R \right) \right) R_1 \,. \end{aligned} \end{aligned}$$

We can now write

$$\begin{aligned} \begin{aligned} \partial _t V^{\sigma }(m_t)&= \int _{\mathbb {R}^d} \frac{\delta F}{\delta m}(m_t,x) \partial _t m_t(x) dx + \frac{\sigma ^2}{2}\int _{\mathbb {R}^d} \partial _t \left( \log \left( \frac{m_t(x)}{\pi (x)} \right) m_t(x) \right) dx \\&= \int _{\mathbb {R}^d} \left[ \frac{\delta F}{\delta m}(m_t,x) + \frac{\sigma ^2}{2}\left( 1 + \log \left( \frac{m_t(x)}{\pi (x)}\right) \right) \right] \partial _t m_t(x) dx \\&= \int _{\mathbb {R}^d} \left( \frac{\delta F}{\delta m} (m_t,x) + \frac{\sigma ^2}{2} \log \left( \frac{m_t(x)}{\pi (x)}\right) - \frac{\sigma ^2}{2} {\text {KL}}(m_t|\pi ) \right) \partial _t m_t(x) dx \,, \end{aligned} \end{aligned}$$

where the last equality follows due to $\int _{\mathbb {R}^d} \partial _t m_t(x) dx = 0$. Combining this with (1.4) proves (2.10). $\square $

Proof of Theorem 2.3

By Lemma 2.5, the quadratic growth condition (2.13) required in Theorem 2.8 is satisfied for $m=m_t$ for all $t >0$, with $\lambda = \sigma ^2/2$. Moreover, due to (2.8) and (2.9), the ratio condition (2.14) required in Theorem 2.8 is satisfied for $m=m_t$ for all $t >0$. Indeed, recall that by the discussion below Assumption 2, the ratio condition for $m_0/\pi $ with constants r and R is equivalent to the ratio condition for $m_0/m^{\sigma ,*}$ with corresponding constants $\bar{r}$ and $\bar{R}$. Similarly, since due to (2.8) and (2.9) we have a bound on $m_t/\pi $ for all $t >0$ with constants $r_1$ and $R_1$, we can apply the argument below Assumption 2 to obtain a bound on $m_t/m^{\sigma ,*}$ for all $t >0$, with appropriately modified constants $\bar{r}_1$ and $\bar{R}_1$. Furthermore, note that by the proof of Lemma 2.5, in the case of $G=V^{\sigma }$, the convexity condition needed in the proof of Lemma 2.6 (and thus in Theorem 2.8) can be applied with a(m, x) instead of $\frac{\delta G}{\delta m}$, i.e., for any m, $m' \in \mathcal {P}(\mathbb {R}^d)$ we have

$$\begin{aligned} V^{\sigma }(m) - V^{\sigma }(m') \le \int _{\mathbb {R}^d} a(m,x)(m-m')(dx) \,. \end{aligned}$$

As a consequence, the argument from the proof of Theorem 2.8 applies to our setting and the flat Polyak–Łojasiewicz condition (2.11) is satisfied for all $t \ge 0$. $\square $

4 Appendix: Relations Between Different f-Divergences

Suppose we have absolutely continuous probability measures m, $m' \in \mathcal {P}(\mathbb {R}^d)$ and a convex function $f: [0,\infty ) \rightarrow \mathbb {R}$. Then the f-divergence of m with respect to $m'$ is defined by

$$\begin{aligned} I_f(m|m') := \int _{\mathbb {R}^d} f\left( \frac{m(x)}{m'(x)} \right) m'(x) dx \,. \end{aligned}$$

For instance, choosing $f(t) = t \log t$ leads to the KL-divergence and $f(t) = (t-1)^2$ leads to the $\chi ^2$-divergence. We have the following result adapted from Theorem 6 in [11].

Lemma 4.1

Let $f: [0,\infty ) \rightarrow \mathbb {R}$ be convex and such that $f(1)=0$. Let us consider an interval $(r,R) \subset (0,\infty )$ such that

(i)
f is twice differentiable on (r, R)
(ii)
there exist real constants a, A such that
$$\begin{aligned} a \le t f''(t) \le A \quad \text {for all } t \in (r,R) \,. \end{aligned}$$

Then for any absolutely continuous probability measures $\mu $ and $\nu $, we have the inequality

$$\begin{aligned} a {\text {KL}}(\mu |\nu ) \le I_f(\mu |\nu ) \le A {\text {KL}}(\mu |\nu ) \,. \end{aligned}$$

Proof

Let us define a mapping $F_a: (0,\infty ) \rightarrow \mathbb {R}$ given by $F_a(t) := f(t)-a t\log t$. Then $F_a$ is such that $F_a(1)=0$, and is twice differentiable and convex on (r, R) since $F_a''(t) \ge 0$ on (r, R). Note that the f-divergence associated to a convex $F_a$ with $F_a(1)=0$ is always non-negative due to Jensen’s inequality, and hence we have

$$\begin{aligned} 0 \le I_{F_a}(\mu |\nu ) = I_f(\mu |\nu ) - a {\text {KL}} (\mu |\nu ) \,. \end{aligned}$$

We now define a mapping $F_A: (0,\infty ) \rightarrow \mathbb {R}$ by setting $F_A(t) := A t\log t -f(t)$. Then $F_A$ is such that $F_A(1)=0$, and is twice differentiable and convex on (r, R) since $F_A''(t) \ge 0$ on (r, R). We again use the fact that the corresponding f-divergence is non-negative, and we obtain

$$\begin{aligned} 0 \le I_{F_A}(\mu |\nu ) = A {\text {KL}} (\mu |\nu )-I_f(\mu |\nu )\,, \end{aligned}$$

which finishes the proof. $\square $

Proof of Lemma 2.7

We consider the mapping $f_1: (0,\infty ) \rightarrow \mathbb {R}$ given by $f_1(t)= -\log (t)$. Note that the f-divergence corresponding to this $f_1$ is the KL-divergence with swapped arguments, i.e., for any absolutely continuous probability measures $\mu $ and $\nu $, we have

$$\begin{aligned} I_{f_1}(\mu |\nu ) = -\int _{\mathbb {R}^d} \log \frac{\mu (x)}{\nu (x)} \nu (x) dx = \int _{\mathbb {R}^d} \log \frac{\nu (x)}{\mu (x)} \nu (x) dx = {\text {KL}}(\nu |\mu ) \,. \end{aligned}$$

We remark that $f_1(1)=0$ and that $f_1$ is twice differentiable on any interval $(r,R) \subset (0,\infty )$. We also have

$$\begin{aligned} \frac{1}{R} \le t f_1''(t) \le \frac{1}{r} \quad \text {for all } t \in (r,R) \end{aligned}$$

since $f''(t)=1/t^2$. Applying Lemma 4.1 with $a = 1/R$ and $A = 1/r$, we have

$$\begin{aligned} \frac{1}{R} {\text {KL}}(\mu |\nu ) \le {\text {KL}} (\nu |\mu ) \le \frac{1}{r} {\text {KL}}(\mu |\nu ) \,. \end{aligned}$$

This shows the first inequality in (2.12). We now consider the mapping $f_2: (0,\infty ) \rightarrow \mathbb {R}$ defined by $f_2(t):= (t-1)^2$, i.e., $I_{f_2}$ is the $\chi ^2$-divergence. Again, $f_2(1)=0$ and $f_2$ is twice differentiable on any interval $(r,R) \subset (0,\infty )$. Moreover, we have

$$\begin{aligned} 2r \le t f_2''(t) \le 2R \quad \text {for all } t \in (r,R) \end{aligned}$$

since $f_2''=2$. Applying Lemma 4.1 with $a=2r$ and $A=2R$, we have

$$\begin{aligned} 2r {\text {KL}}(\mu |\nu ) \le \chi ^2 (\mu |\nu ) \le 2 R {\text {KL}}(\mu |\nu ) \,. \end{aligned}$$

This shows the second inequality in (2.12) and concludes the proof. $\square $

5 Appendix: Flat Derivative

Definition 5.1

Fix $q\ge 0$ and let $\mathcal P_q(\mathbb {R}^d)$ be the space of probability measures on $\mathbb {R}^d$ with finite q-th moments. A functional $F:\mathcal P_q(\mathbb {R}^d) \rightarrow \mathbb R$, is said to admit a first order linear derivative (or a flat derivative), if there exists a functional $\frac{\delta F}{\delta m}: \mathcal P_q(\mathbb {R}^d) \times \mathbb R^d\rightarrow \mathbb R$, such that

(1)
For all $a\in \mathbb R^d$, $\mathcal P_q(\mathbb {R}^d) \ni m \mapsto \frac{\delta F}{\delta m}(m,a)$ is continuous.
(2)
For any $\nu \in \mathcal P_q(\mathbb {R}^d)$, there exists $C>0$ such that for all $a\in \mathbb R^d$ we have
$$\begin{aligned} \left| \frac{\delta F}{\delta m}({\nu },a)\right| \le C(1+|a|^q)\,. \end{aligned}$$
(3)
For all m, $m'\in \mathcal P_q (\mathbb {R}^d)$,
$$\begin{aligned} F(m')-F(m)=\int _{0}^{1}\int _{\mathbb {R}^d}\frac{\delta F}{\delta m}(m + \lambda (m'-m),a)\left( m'- m\right) (da)\,d\lambda . \end{aligned}$$
(5.1)

The functional $\frac{\delta F}{\delta m}$ is then called the linear (functional) derivative of F on $\mathcal P_q(\mathbb {R}^d)$.

Note that Definition 5.1 easily generalizes to higher order linear derivatives. More precisely, for a fixed $a \in \mathbb {R}^d$ the functional $\frac{\delta F}{\delta m}(\cdot , a): \mathcal P_q(\mathbb {R}^d) \rightarrow \mathbb R$ can admit a first order linear derivative $\frac{\delta }{\delta m}\left( \frac{\delta F}{\delta m}(\cdot , a)\right) : \mathcal P_q(\mathbb {R}^d) \times \mathbb {R}^d \rightarrow \mathbb R$ whenever the conditions from Definition 5.1 are satisfied. If that derivative exists for any $a \in \mathbb {R}^d$, we say that F admits a second order linear derivative $\frac{\delta ^2 F}{\delta m^2}: \mathcal P_q(\mathbb {R}^d) \times \mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb R$, which then satisfies, for all $a \in \mathbb {R}^d$, and for all m, $m'\in \mathcal P_q (\mathbb {R}^d)$

$$\begin{aligned} \frac{\delta F}{\delta m}(m',a)-\frac{\delta F}{\delta m}(m,a)=\int _{0}^{1}\int _{\mathbb {R}^d}\frac{\delta ^2 F}{\delta m^2}(m + \lambda (m'-m),a',a)\left( m'- m\right) (da')\,d\lambda . \end{aligned}$$

References

Ambrosio, L., Fusco, N., Pallara, D.: Functions of Bounded Variation and Free Discontinuity Problems. Oxford Mathematical Monographs, The Clarendon Press, Oxford University Press, New York (2000)
MATH Google Scholar
Ambrosio, L., Gigli, N., Savaré, G.: Gradient Flows in Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics ETH Zürich, 2nd edn. Birkhäuser Verlag, Basel (2008)
MATH Google Scholar
Aubin-Frankowski, P.-C., Korba, A., Léger F.: Mirror Descent with Relative Smoothness in Measure Spaces, with Application to Sinkhorn and EM. arXiv:2206.08873 (2022)
Benamou, J.-D., Brenier, Y.: A computational fluid mechanics solution to the Monge–Kantorovich mass transfer problem. Numer. Math. 84(3), 375–393 (2000)
Article MathSciNet MATH Google Scholar
Blanchet, A., Bolte, J.: A family of functional inequalities: Łojasiewicz inequalities and displacement convex functions. J. Funct. Anal. 275(7), 1650–1673 (2018)
Article MathSciNet MATH Google Scholar
Bolte, J., Daniilidis, A., Ley, O., Mazet, L.: Characterizations of łojasiewicz inequalities: subgradient flows, talweg, convexity. Trans. Am. Math. Soc. 362(6), 3319–3363 (2010)
Article MATH Google Scholar
Carmona, R., Delarue, F.: Probabilistic Theory of Mean Field Games with Applications. I. Probability Theory and Stochastic Modelling. Mean field FBSDEs, Control, and Games, Vol. 83. Springer, Cham (2018)
Chizat, L.: Convergence rates of gradient methods for convex optimization in the space of measures. arXiv:2105.08368 (2021)
Chizat, L.: Mean-field langevin dynamics: exponential convergence and annealing. arXiv:2202.01009 (2022)
Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.), Advances in Neural Information Processing Systems, Vol. 31. Curran Associates, Inc., Red Hook, NY (2018)
Dragomir, S.: Upper and lower bounds for Csiszar’s f-divergence in terms of the Kullback-Leibler distance and applications (1999)
Gallouët, T.O., Monsaingeon, L.: A JKO splitting scheme for Kantorovich–Fisher–Rao gradient flows. SIAM J. Math. Anal. 49(2), 1100–1130 (2017)
Article MathSciNet MATH Google Scholar
Hu, K., Ren, Z., Šiška, D., Szpruch, L.: Mean-field Langevin dynamics and energy landscape of neural networks. Ann. Inst. Henri Poincaré Probab. Stat. 57(4), 2043–2065 (2021)
Article MathSciNet MATH Google Scholar
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer International Publishing, Cham (2016)
Chapter Google Scholar
Liero, M., Mielke, A., Savaré, G.: Optimal entropy-transport problems and a new Hellinger–Kantorovich distance between positive measures. Invent. Math. 211(3), 969–1117 (2018)
Article MathSciNet MATH Google Scholar
Lu, Y., Lu, J., Nolen, J.: Accelerating Langevin Sampling with Birth-death. arXiv:1905.09863 (2019)
Mei, S., Montanari, A., Nguyen, P.-M.: A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. USA 115(33), E7665–E7671 (2018)
Article MathSciNet MATH Google Scholar
Nitanda, A., Wu, D., Suzuki, T.: Convex analysis of the mean field Langevin dynamics. arXiv:2201.10469 (2022)
Radhakrishnan, A., Belkin, M., Uhler C.: Linear convergence of generalized mirror descent with time-dependent mirrors. arXiv:2009.08574 (2020)
Ren, Z., Wang, S.: Entropic fictitious play for mean field optimization problem. arXiv:2202.05841 (2022)
Rotskoff, G., Jelassi, S., Bruna, J., Vanden-Eijnden, E.: Neuron birth-death dynamics accelerates gradient descent and converges asymptotically. In: Chaudhuri, K., Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, vol. 97 of Proceedings of Machine Learning Research, pp. 5508–5517. PMLR, 09–15 (2019)
Santambrogio, F.: Euclidean, metric, and Wasserstein gradient flows: an overview. Bull. Math. Sci 7(1), 87–154 (2017)
Article MathSciNet MATH Google Scholar
Schilling, R.L.: Measures, Integrals and Martingales. Cambridge University Press, New York (2005)
Book MATH Google Scholar
Šiška, D., Szpruch, Ł.: Gradient Flows for Regularized Stochastic Control Problems. arXiv:2006.05956 (2020)
Villani, C.: Optimal Transport. Grundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 338. Old and New. Springer-Verlag, Berlin (2009)

Download references

Acknowledgements

LS acknowledges the support of the UKRI Prosperity Partnership Scheme (FAIR) under EPSRC Grant EP/V056883/1.

Author information

Authors and Affiliations

School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, EH14 4AS, UK
Linshan Liu & Mateusz B. Majka
School of Mathematics, University of Edinburgh, James Clerk Maxwell Building, Peter Guthrie Tait Road, Edinburgh, EH9 3FD, UK
Łukasz Szpruch

Authors

Linshan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Mateusz B. Majka
View author publications
You can also search for this author in PubMed Google Scholar
Łukasz Szpruch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mateusz B. Majka.

Ethics declarations

Competing Interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, L., Majka, M.B. & Szpruch, Ł. Polyak–Łojasiewicz inequality on the space of measures and convergence of mean-field birth-death processes. Appl Math Optim 87, 48 (2023). https://doi.org/10.1007/s00245-022-09962-0

Download citation

Accepted: 23 December 2022
Published: 13 March 2023
DOI: https://doi.org/10.1007/s00245-022-09962-0

Keywords

Mathematics Subject Classification

49Q20

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Polyak–Łojasiewicz inequality on the space of measures and convergence of mean-field birth-death processes

Abstract

Similar content being viewed by others

Exponential Decay of Rényi Divergence Under Fokker–Planck Equations

Functional Central Limit Theorem and Strong Law of Large Numbers for Stochastic Gradient Langevin Dynamics

Nonasymptotic Estimates for Stochastic Gradient Langevin Dynamics Under Local Conditions in Nonconvex Optimization

1 Introduction

1.1 Fisher–Rao Gradient Flow

2 Main Results

2.1 Existence of the Birth-Death Flow and Its Convergence Under the Flat Polyak–Łojasiewicz Condition

Assumption 1

Assumption 2

Theorem 2.1

Theorem 2.2

Theorem 2.3

Corollary 2.4

Lemma 2.5

Proof

2.2 Verification of the Flat Polyak–Łojasiewicz Condition in a General Setting

Lemma 2.6

Proof

Lemma 2.7

Proof

Theorem 2.8

Proof

Remark 2.9

Corollary 2.10

Proof

2.3 Literature Review, Connection to the Wasserstein–Fisher–Rao Gradient Flow and Further Research

2.3.1 Wasserstein Gradient Flow

Example 2.11

Example 2.12

2.3.2 Wasserstein–Fisher–Rao Gradient Flow

3 Existence of the Gradient Flow and Other Proofs

Lemma 3.1

Proof

Lemma 3.2

Proof

Proof of Theorem 2.1

Proof of Theorem 2.2

Proof of Theorem 2.3

4 Appendix: Relations Between Different f-Divergences

Lemma 4.1

Proof

Proof of Lemma 2.7

5 Appendix: Flat Derivative

Definition 5.1

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation