Conformal mirror descent with logarithmic divergences

The logarithmic divergence is an extension of the Bregman divergence motivated by optimal transport and a generalized convex duality, and satisfies many remarkable properties. Using the geometry induced by the logarithmic divergence, we introduce a generalization of continuous time mirror descent that we term the conformal mirror descent. We derive its dynamics under a generalized mirror map, and show that it is a time change of a corresponding Hessian gradient flow. We also prove convergence results in continuous time. We apply the conformal mirror descent to online estimation of a generalized exponential family, and construct a family of gradient flows on the unit simplex via the Dirichlet optimal transport problem.


Introduction
Information geometry provides not only powerful tools for studying spaces of probability distributions, but also a wide range of geometric structures that are useful for various challenges in data science [5,3,6]. The Bregman divergence [9] plays a key role in the theory and application of information geometry. It is the canonical divergence of the dually flat geometry [30] which arises naturally in exponential families [7], and can serve as a loss function in statistical estimation and optimal control [18]. The Bregman divergence is especially tractable in applied settings, as it is closely connected to convex duality and satisfies a generalized Pythagorean theorem which greatly simplifies the analysis of Bregman projections. Among the many applications of Bregman divergences, we mention clustering [7], exponential family principal component analysis [12] as well as boosting and logistic regression [13,29].
We present a generalization of mirror descent [32,8] which is a popular first-order iterative optimization algorithm. Mirror descent is a gradient descent algorithm where a Bregman divergence serves as a proximal function. A suitable convex generating function may be chosen to exploit the geometry of the problem. The update step (2.6) of mirror descent involves a change of coordinates using the so-called mirror map which corresponds to the information-geometric dual parameter. In the continuous time limit, mirror descent can be represented as a Riemannian gradient flow with respect to the Hessian metric induced by the given Bregman divergence [1,37]. The basic ideas are reviewed in Sections 2.1 and 2.2.
Our generalization, termed the conformal mirror descent, is based on the theory of logarithmic divergences [34,35,45,46,49]. In many senses, the logarithmic divergence may be regarded as a canonical deformation of the Bregman divergence. Just as the Bregman divergence captures the dually flat geometry, the logarithmic divergence is a canonical divergence for a dually projectively flat statistical manifold with constant nonzero sectional curvature, and also satisfies a generalized Pythagorean theorem [45]. Moreover, the logarithmic divergence under divisive normalization leads to a deformed exponential family, which is closely related to the q-exponential family in statistical physics [31], while recovering natural analogues of intrinsic information-geometric properties of the exponential family in the deformed case [45,49]. For example, the Kullback-Leibler (KL) divergence (which is the Bregman divergence of the cumulant generating function) becomes the Rényi divergence, and the dual variable can be interpreted as an escort expectation. Another appealing property is that the logarithmic divergence is associated with a generalized convex duality motivated by optimal transport [41,42]. Following [49], we call it the λ-duality, where λ = 0 is the curvature parameter. It was recently shown [48] that the dualistic geometry in information geometry can be naturally embedded in the pseudo-Riemannian geometry of optimal transport [23] using the framework of c-divergence, under which divergences are induced by optimal transport maps. Bregman and logarithmic divergences are special cases corresponding to particular cost functions [35,45]. In Section 2.3, we review properties of λ-duality and logarithmic divergences that are needed in this paper. Further results about λ-duality and its relation with convex duality can be found in [50].
In Section 3, we formulate the conformal mirror descent in continuous time as a Riemannian gradient flow, where the underlying metric is induced by a logarithmic divergence. We call it the conformal mirror descent because the metric can be shown to be a conformal transformation of a Hessian metric. This implies that the conformal mirror descent is, in continuous time, a time change of mirror descent. We also derive explicit dynamics of the gradient flow under the λ-mirror map corresponding to the logarithmic divergence and related convergence results. The λ-duality suggests many new generating functions that are potentially useful in various applications.
We give two applications to demonstrate the utility of our conformal mirror descent. In Section 4, we consider online estimation of the λ-exponential family introduced in [49], and derive an elegant online natural gradient update which generalizes the one for the exponential family [37]. Dirichlet optimal transport on the unit simplex [34,35,36] is one of the original motivations of the theory of logarithmic divergences (and corresponds to the case λ = −1). Expressing the (−1)-mirror map in terms of the Dirichlet optimal transport map, we derive in Section 5, an interesting family of gradient flows on the unit simplex.
Finally, in Section 6 we discuss our contributions in the context of related literature, and propose several directions for future research.
In computations, we regard θ as a column vector and write θ = θ 1 · · · θ d , where denotes transposition. The Euclidean gradient ∇f (θ) = ∇ θ f (θ) of a real-valued function f is also regarded as a column vector. Due to the difficulty of unifying notations in different settings, in this paper we do not adopt the Einstein summation convention.

2.
From convex duality to λ-duality 2.1. Convex duality and Bregman divergence. We begin by reviewing convex duality and Bregman divergence, which are at the core of classical information geometry [5,3] (also see [4] for a recent overview). Let φ be a lower semicontinuous convex function on R d . Its convex conjugate is defined by φ * (y) = sup x∈R d { x, y − φ(x)}, where ·, · denotes the Euclidean inner product. Then φ * is also lower semi-continuous and convex, and we have and equality holds if and only if y is a subgradient of φ at x.
Let Θ ⊂ R d be an open convex set and let φ : Θ → R be a smooth convex function whose Hessian ∇ 2 φ(θ) is everywhere positive definite. We call such a φ a Bregman generator. The Bregman divergence of φ, regarded as a generalized distance, is defined for θ, θ ∈ Θ by Under the stated conditions, ∇φ is a diffeomorphism from Θ onto its range. We call θ the primal variable and ζ = ∇φ(θ) the dual variable. 1 The inverse transformation is given by θ = ∇φ * (ζ). Then, we may express (2.2) in self-dual form by which is closely related to the Fenchel-Young inequality (2.1).
Conjugation, which characterizes convex duality, is defined in terms of the pairing function c(x, y) = − x, y . It turns out that much of the above can be generalized. For a general c, called the cost function in the context of optimal transport [41,42], we can define the c-conjugate of a function ϕ by ϕ (c) (y) = sup x {−c(x, y) − ϕ(x)}. A function ϕ is said to be c-convex if it is the c-conjugate of some function ϕ, i.e., ϕ = ϕ (c) . We have the following analogue of the Fenchel-Young inequality (2.1): If equality holds, we call y a c-subgradient of φ at x. If this y is unique, we call it the c-gradient. Under suitable conditions, a Monge-Kantorovich optimal transport problem can be solved by an optimal transport map, which can be expressed as the c-gradient of some c-convex potential ϕ. The inequality (2.4) can be used to define a c-divergence on the graph of the optimal transport map [48]. The λ-duality [49] is the generalized convex duality based on the logarithmic cost which is mathematically tractable and has remarkable properties. 2 We recover the usual convex duality in the limiting case lim λ→0 c λ (x, y) = − x, y . 1 We reserve the symbol η for the dual variable under the λ-duality; see (2.10). 2 In our applications x and y only vary in respective domains such that 1 + λ x, y > 0, so the logarithm in (2.5) is well defined.

Mirror descent.
Consider the minimization problem min θ∈Θ f (θ) where f : Θ → R is assumed to be differentiable. Let φ : Θ → R be a Bregman generator as in Section 2.1. It induces the mirror map ζ = ∇ θ φ(θ). For clarity, we use ∇ θ to indicate that the gradient is taken with respect to θ. The mirror descent algorithm minimizes f by iterating the update where δ = δ k > 0 is the learning rate which may depend on k. We obtain θ k+1 by applying the inverse mirror map, i.e., θ k+1 = ∇ ζ φ * (ζ k+1 ). Thus, we require that both ∇φ and ∇φ * can be computed for implementing the algorithm. Letting φ(θ) = 1 2 |θ| 2 = 1 2 θ, θ recovers Euclidean gradient descent since in this case, ζ = ∇ θ 1 2 |θ| 2 = θ. In general, (2.6) requires an extra projection step when the right hand side is outside Θ. The (unconstrained) update (2.6) is equivalent to the update of a Bregman proximal method, namely It is easy to verify that the first order condition of (2.7) can be expressed as (2.6). Geometrically, θ k+1 minimizes a linear approximation of f over a Bregman ball based at θ k .
Further insights can be obtained by studying the continuous time limit as done in [37] and [22]. The Bregman divergence admits the quadratic approximation where G 0 (θ) = ∇ 2 θ φ(θ) is a Hessian Riemannian metric (when expressed under the primal θ-coordinates) and induces the Riemannian gradient grad G 0 f = G −1 0 ∇ θ f . See [38] for an in-depth geometric study of Hessian manifolds. Letting δ → 0 in (2.6) or (2.7) and scaling time appropriately, one obtains a Hessian Riemannian gradient flow [1]: Naturally, one may consider other metrics to obtain generalizations of mirror descent (see [22] for a discussion). In this paper, we use the Riemannian metric induced by the logarithmic divergence, which is particularly tractable.
2.3. λ-duality and logarithmic divergence. Following the treatment of [49], we introduce the λ-duality which utilizes the logarithmic cost function c λ defined by (2.5). In general, c-convex functions and c-gradients are difficult to characterize explicitly. Remarkably, for the logarithmic cost function c λ , it is possible to relate c λ -convex with usual convexity and express the c λ -gradient in terms of the usual gradient. The following definition summarizes the generalized convexity notion and the required regularity conditions needed for our applications. Throughout, we let λ = 0 be a fixed constant.
Definition 1 (Regular c λ -convex function and c λ -gradient). Let Θ ⊂ R d be an open convex set. A smooth function ϕ : Θ → R is said to be regular c λ -convex if Φ λ = 1 Table 1. Examples of regular c λ -convex functions on the real line and their corresponding λ-mirror maps.
convex, ∇ 2 θ Φ λ is positive definite and 1 − λ ∇ θ ϕ(θ), θ > 0. Given such a function ϕ, we define its c λ -gradient by We also call ∇ (λ) θ ϕ the λ-mirror map. Under the stated conditions, it can be shown that ∇ In a nutshell, instead of convex functions, we use functions ϕ such that Φ λ = 1 λ (e λϕ − 1) are convex, and replace the usual gradient by the λ-mirror map. Some examples of regular c λ -convex functions are given in Table 1.
Henceforth we let ϕ be a regular c λ -convex function. Let ψ be the c λ -conjugate defined by Then, for θ ∈ Θ we have Moreover, 1 + λ θ, η > 0 for any θ ∈ Θ and η ∈ H, and for η = ∇ (λ) ϕ(θ) we have the identity Thus ϕ and ψ satisfy a generalized Legendre-like duality with respect to the cost function c λ . The inverse λ-mirror map is given by θ = ∇ (λ) η ψ(η). We use ϕ to define a λ-logarithmic divergence which is different from the Bregman divergence. For completeness, we review the main idea.
Expressing this inequality in terms of ϕ, we have, using the chain rule, (2.12) Figure 1. The λ-logarithmic divergence (2.14) is the error term of a logarithmic first order approximation; see (2.13). We visualize it for The red dashed curve shows the case λ = 1.
Now there are two cases depending on the sign of λ, but the resulting expression is the same. Here, we consider the case λ < 0 and the other case is similar. From (2.12), we have Taking the difference yields the λ-logarithmic divergence. When ϕ is convex, letting λ → 0 in (2.13) recovers the Bregman divergence (see Figure 1).
Definition 2 (λ-logarithmic divergence). Let ϕ be a regular c λ -convex function. We define its λ-logarithmic divergence for θ, θ ∈ Θ by An important application of the logarithmic divergence is to some generalized exponential families, where an appropriately defined potential function ϕ leads to the Rényi divergence. See Section 4, where we exploit this property in online parameter estimation.
Analogous to (2.3), the λ-logarithmic divergence admits the following self-dual representation which verifies that it is the c-divergence of the cost c λ : The logarithmic divergence can be justified by the remarkable properties satisfied by the induced dualistic geometry (g, ∇, ∇ * ) [35,45,49]: • The primal and dual connections (∇, ∇ * ) are dually projectively flat. In particular, the primal (resp. dual) geodesics are time-reparameterized straight lines under the primal (resp. dual) coordinate system.
• The sectional curvature of ∇ and ∇ * with respect to g are everywhere constant and equal to λ.
• The generalized Pythagorean theorem holds for the λ-logarithmic divergence.
• Given a dualistic structure which is dually projectively projectively flat with constant (nonzero) sectional curvature, one can define (locally) a λ-logarithmic divergence which induces the given structure. Thus, the λ-logarithmic divergence can be regarded as a canonical divergence.
Letting λ → 0 recovers well-known properties of the dually flat geometry induced by a Bregman divergence.
To define the conformal mirror descent we will need the explicit form of the metric g. We give two equivalent representations, both of which are useful in our development. Under the primal coordinate system θ, the coefficients of g are given by the matrix The first representation states that G λ is a rank-one correction of the Hessian ∇ 2 θ ϕ. The second representation states that G λ is a conformal transformation of the Hessian metric That is, G λ is a conformal Hessian metric. Geometrically, a conformal transformation preserves the angles between tangent vectors but distorts their lengths. We note that the last expression in (2.16) may also be understood via the identity which can be verified by a direct computation. It states that the λ-logarithmic divergence is a monotone transformation of a left conformal Bregman divergence [33]. See [47] for more discussion in this direction.

Conformal mirror descent
In this section, we present our first main contribution: a generalization of continuous time mirror descent as the Riemannian gradient flow with respect to a λ-logarithmic divergence. In Section 3.1, we define the flow and interpret it in two ways: (i) a mirror-like descent under the λ-mirror map (2.10), and (ii) a time change of a Hessian gradient flow. It reduces to the continuous time mirror descent (2.9) in the limit λ → 0. Convergence results are stated and proved in Section 3.2.
3.1. The flow and two representations. As described in Section 2.2, the usual (Bregman) mirror descent can be understood as (i) a Bregman proximal method (2.7); or (ii) a (discretization of) the Hessian gradient flow (2.9). This suggests two ways to generalize the method. Formally, we may replace the Bregman divergence in (2.7) by a λ-logarithmic divergence. While the resulting proximal method is well-defined, unfortunately the firstorder condition cannot be solved explicitly to yield a simple update as in mirror descent (see (2.6)). We study instead the continuous time Riemannian gradient flow with respect to the metric G λ , and it turns out that this is much more tractable. We fix λ = 0 and let a regular c λ -convex generator ϕ be given on the convex domain Θ.
Definition 3 (Conformal mirror descent in continuous time). Let f : Θ → R be a differentiable function to be minimized. Given an initial value θ 0 ∈ Θ, the continuous time conformal mirror descent is the Riemannian gradient flow given by While any Riemannian metric G(θ) can be used to define a gradient flow, implementation of the flow generally requires computation of the Riemannian gradient G −1 ∇ θ f . In (2.9), the mirror map ∇φ eliminates the need to compute G −1 0 because G 0 = ∇ 2 φ is the Jacobian of the mirror map. Here, we show that a similar property holds under the λ-mirror map. We let I d denote the d × d identity matrix.
Proof. Under the primal coordinate system, the metric G λ is given by In matrix form, we have is the Jacobian of the transformation θ → η. Now we may invert (3.3) using the Sherman-Morrison formula to get By definition, the gradient flow (3.1) is given by Expressing the flow in terms of the dual variable, we have, by the chain rule again, Next, by using the fact that G λ is a conformal Hessian metric, we show that the conformal mirror descent gradient flow can be viewed as a time change of a Hessian gradient flow.
Theorem 2 (Time-change of Hessian gradient flow). Let (θ s ) s≥0 be the Hessian gradient flow (2.9) with respect to the Bregman generator Φ λ = 1 λ (e λϕ − 1). Consider the time change s = s t , where d dt s t = exp λϕ(θ st ) . Then θ t =θ st is the conformal mirror descent (3.1) induced by ϕ. In particular, let ζ t = ∇Φ λ (θ t ) be the dual variable with respect to the Bregman generator Φ λ . Then the flow can be expressed as d Proof. By (3.1) and (2.16), we have whereG 0 = ∇ 2 θ Φ λ . Letθ(s) be the Hessian gradient flow (2.9) induced by the metricG 0 , and let s = s t be the given time change. Applying the chain rule in (2.9), we have . Comparing this with (3.4), we see thatθ st = θ t . The proof of the last statement is similar.
By Theorem 2, the trajectory of a conformal mirror descent gradient flow is the same as that of a Hessian gradient flow; the conformal transformation of the metric introduces a time-varying learning rate depending on the value ϕ(θ t ). The main novelty of conformal mirror descent is that the λ-duality suggests novel choices of the generator ϕ; additionally, the λ-mirror map is more natural in certain problems. For example, in Section 4 we apply it to online natural gradient learning for some generalized exponential families.

Proposition 3 (Primal and dual flows).
(i) The trajectory of the primal flow follows a time-changed primal geodesic, i.e., along the straight line from θ 0 to θ * under the primal coordinate system.
(ii) The trajectory of the dual flow follows a time-changed dual geodesic, i.e., along the straight line from η 0 to η * under the dual coordinate system.
Proof. We first consider the dual flow (3.6). Using the self-dual representation (2.15), where the last equality can be verified using the definition (2.10) of the λ-mirror map.
By Theorem 1 we have, after some simplification, Thus, the dual flow evolves along a time-changed dual geodesic.
Since L λ,ϕ [θ * : θ] = L λ,ψ [η : η * ] and both L λ,ϕ and L λ,ψ induce the same Riemannian metric, the primal flow (3.5) for L λ,ϕ is equivalent to the dual flow for L λ,ψ . By the case proved above, we have that the trajectory follows a time-changed straight line under the θ-coordinates.
To implement conformal mirror descent in practice, the flow (3.1) must be discretized. From Definition 3 and Theorems 1 and 2, we have the following three forward Euler discretizations: • Primal Euler discretization: • Dual Euler discretization: • Mirror descent with adaptive learning rate: A detailed analysis of these (and possibly other) discretization schemes is beyond the scope of this paper. In the next subsection, we study the convergence of conformal mirror descent in continuous time.

Convergence results.
In this subsection, we present continuous time convergence results for conformal mirror descent that are analogous to those of mirror descent. Our main tool is Lyapunov analysis following [44]. In what follows, we let (θ t ) t≥0 be the solution to the gradient flow (3.1) for a given continuously differentiable and convex function f : Θ → R.
We also let θ * be a minimizer of f over Θ.
We first observe that the logarithmic divergence is a Lyapunov function of the gradient flow.
Proof. Using the self-dual representation (2.15), we have Using (3.2) and simplifying, we have

9)
where Φ λ = 1 λ (e λϕ − 1) is the Bregman generator. In particular, if f is strictly convex, Proof. This result can be derived using Theorem 2 and convergence results of Hessian gradient flow (see e.g. [45, Section 2.1.3]). For completeness, we give a self-contained proof. Using a similar argument as in the proof of Lemma 3.1, we have that and hence is another Lyapunov function. Since E t is non-increasing, we have Note that by (2.17), the last expression in (3.11) is equal to If f is strictly convex, from (3.8) we have that lim t→∞ θ t = θ * . Since e λϕ(θt) → e λϕ(θ * ) , the quantity τ t = t 0 e λϕ(θs) ds grows linearly as t → ∞. It follows from (3.9) that

Online estimation of generalized exponential family
Mirror descent is often used to estimate parameters of stochastic models, both offline and online. Using a duality between the exponential family and Bregman divergence [7], the authors of [37] considered online parameter estimation for exponential families, and showed that the mirror descent step is equivalent to the natural gradient step [2]. In this section we generalize this result to the λ-exponential family introduced in [49].
We begin with some preliminaries. Following [49], by a λ-exponential family we mean a parameterized probability density (with respect to a reference measure ν) of the form where x + = max{x, 0} and F (x) = (F 1 (x), . . . , F d (x)) is a vector of statistics. For example, if ν is the Lebesgue measure on R, λ ∈ (−2, 0) and F (x) = (x, x 2 ), then we obtain from (4.1) the Student's t distribution (as a location-scale family) with −2 λ − 1 > 0 degrees of freedom (see Example 1 below). The density (4.1) is a generalized or deformed exponential family and is a reparameterized version of the q-exponential family (where q = 1 − λ) in statistical physics (see [49,Section 3] for the precise relation). 4 As λ → 0, we recover the usual exponential family. Under suitable regularity conditions (including λ < 1 or q = 1 − λ > 0), it can be shown that the divisive normalization function ϕ in (4.1) is c λ -convex on the parameter space Θ and hence, defines a λ-logarithmic divergence. This divergence can be interpreted probabilistically as L λ,ϕ [θ : θ ] = H r q (p θ ||p θ ), where H r q is the Rényi divergence of order q. Consequently, the induced Riemannian metric is a constant multiple of the Fisher information metric I [40]: Moreover, the dual variable η = ∇ (λ) θ ϕ(θ) under the λ-mirror map can be interpreted as a generalized expectation parameter known as the escort expectation: In fact, the density (4.1) maximizes the Rényi entropy of order q subject to constraints on the escort expectation. These (and other) results nicely parallel those of the exponential family.
Consider now online estimation of (4.1) under i.i.d. sampling. By considering the distribution of Y = F (X), we consider a λ-exponential family on R d as p θ (y) = (1 + λ θ, y ) Suppose we observe data points y k , k = 1, 2, . . .. Let the current guess of the parameter be θ k . After observing y k , we update θ k to θ k+1 by a minimizing gradient step with respect to the log-loss Note that the negative log-likelihood f k is typically not convex in θ. We do this by discretizing the conformal mirror descent (3.2), where the generating function ϕ is the potential function in (4.3). Since G λ is a multiple of the Fisher metric, the forward Euler step of (3.2) in dual coordinates leads to the (unconstrained) natural gradient update where δ k > 0 is the learning rate. Simplifying (4.5), we obtain an explicit and clean update which is not obvious from the time change perspective.
Theorem 5 (Online natural gradient step for λ-exponential family). The online natural gradient update (4.5) is given by Proof. Differentiating f k (θ) in (4.4), we have which has the same form as in the dual gradient flow in Proposition 3(ii). (This is not a coincidence in view of the duality between λ-exponential family and λ-logarithmic divergence; see [49,Section VI].) Continuing the computation as in the proof of Proposition 3, we obtain (4.6) which is the discrete analogue of (3.7).
Since (4.5) is a natural gradient update, by [2, Theorem 2] the algorithm (when δ k = 1 k ) is Fisher efficient as k → ∞. When λ → 0, we recover the linear update for exponential families derived in [37]. In general, an extra projection step, which is also necessary for the exponential family (λ = 0), is needed to retain θ k+1 in the parameter set Θ (or η k+1 in H). The adjustment is implemented in the experiments below by a reflection across the boundary.
Example 1 (Student's t-distribution as a location-scale family). Let ν > 0 be a constant.
where µ and σ are the location and scale parameters, respectively, and Γ is the gamma function. 5 In the following, we regard ν as known and consider online estimation of (µ, σ).
The potential function ϕ is given on Θ by where C is a constant depending only on ν. By a straightforward computation, we obtain explicit expressions of the λ-mirror map and its inverse: In Figure 2 (left), we show ten trajectories (in terms of (µ k , σ k )) of the algorithm (4.6) with δ k = 1/k, where the true parameter is (µ * , σ * ) and the initial guess is (µ 0 , σ 0 ). As expected, the iterates converge to (µ * , σ * ) as k → ∞. The preceding computations can be generalized to the multivariate location-scale t-distribution where the degrees of freedom is also assumed to be known.
Example 2 (Dirichlet perturbation on the unit simplex). The Dirichlet perturbation model is a fundamental example of the λ-exponential family (see [49,Example 3.14]) and is closely related to the Dirichlet optimal transport problem studied in [34,35,36]; see also Section 5 below, where we use the Dirichlet transport to define gradient flows on the simplex. Fix d ≥ 1 and consider the open unit simplex (4.8) Given p, q ∈ 1+d , define the perturbation operation by This is the vector addition operation under the Aitchison geometry in compositional data analysis [17]. Let σ > 0 and let λ = −σ < 0. Fix p ∈ 1+d , which we regard as the unknown parameter, and let D = (D 0 , . . . , D d ) be a random vector whose distribution is the Dirichlet distribution with parameters (σ −1 /(1 + d), . . . , σ −1 /(1 + d)) ∈ (0, ∞) 1+d . As σ → 0, the distribution of D converges weakly to the point mass at the barycenter (1/ (1 + d), . . . , 1/(1 + d)). Thus, we may regard σ as a noise parameter. The Dirichlet perturbation model is specified as It may be regarded as a multiplicative analogue of the Gaussian additive model Y = X + where ∼ N (0, σ 2 I d ).

Gradient flows on the simplex via Dirichlet transport
By Brenier's theorem [10], the mirror map ζ = ∇φ(θ) in classical (Bregman) mirror descent can be interpreted as an optimal transport map for the quadratic cost c(x, y) = 1 2 |x − y| 2 . Also, the Bregman divergence is the c-divergence of the quadratic cost. This suggests an interpretation of mirror descent in terms of optimal transport. In fact, our conformal mirror descent generalizes this set-up to the logarithmic cost c λ (x, y) = −1 λ log(1 + λ x, y ) for λ = 0. In this section, we specialize to the unit simplex and the case λ = −1. Using the Dirichlet optimal transport problem studied in [36], we define a family of gradient flows on the unit simplex and compare them with the entropic descent, which is an important example of mirror descent. 5.1. Dirichlet transport. Following [36], we define the Dirichlet cost function on n × n (where n = 1 + d ≥ 2) by c(p, q) = log It is closely related to the Dirichlet perturbation model in Example 2, because the density of Q (with respect to a suitable reference measure) is proportional to e 1 λ c(p,q) [36,Remark 6]. It is easy to verify that c(p, q) Up to a change of variables and addition of linear terms (see [45,Remark 3]), the Dirichlet cost function is equivalent to the logarithmic cost c −1 . The (−1)-mirror map then corresponds to the optimal transport map of the Dirichlet transport.
We now adapt the logarithmic divergence and the (−1)-mirror map to the simplex following the notations of [36]. The role of the c −1 -convex generator is now played by an exponentially concave function.

5.2)
It is easy to see that if ϕ is exponentially concave, then −ϕ is c −1 -convex and L ϕ = L −1,−ϕ . In order that the induced Riemannian metric is well-defined, we assume that ∇ 2 e ϕ is strictly negative definite when restricted to the tangent space of n . The (−1)-mirror map is now given in terms of the optimal transport map of the Dirichlet transport. We let e 0 , . . . , e n−1 denote the standard Euclidean basis. Given a differentiable function f on n , define the directional derivative Recall the perturbation operation defined by (4.9). We also define the powering operation for p ∈ n and α ∈ R by Note that n is an (n − 1)-dimensional real vector space under the operations ⊕ and ⊗. We define p = (−1) ⊗ p be the additive inverse of p.
The optimal transport map T ϕ : n → n is defined by That T ϕ is an optimal transport map for the Dirichlet cost function (5.1) is proved in [36,Theorem 1], which is an analogue of Brenier's theorem. The terminology "portfolio map" for the mapping π ϕ is motivated by its use in portfolio theory [19,34,46].

Example 3 (Examples of portfolio and transport maps).
(i) Let ϕ(p) = n−1 i=0 1 n log p i . Then the associated portfolio map is the constant map π ϕ (p) = 1 n , . . . , 1 n called the equal-weighted portfolio. From (5.4), the transport map is the identity T ϕ (p) = p. This function corresponds to the quadratic function 1 2 |x| 2 whose Euclidean gradient is the identity.
is a fixed parameter. Then π ϕ (p) = α ⊗ p is called the diversity-weighted portfolio. The transport map is given by T ϕ (p) = (1 − α) ⊗ p which can be interpreted as a dilation under the Aitchison geometry. As α → 0 we recover the identity transport.
Let f : n → R be a differentiable function. Using the Riemannian metric g induced by L ϕ , we can define the gradient flow d dt which is a special case of (3.1) (up to reparameterization) when λ = −1. The following result derives explicitly the dynamics under the dual variable q t = T ϕ (p t ) defined in terms of the transport map. We omit the proof as it is a straightforward, but tedious computation.
Theorem 6 (Conformal mirror descent on d under Dirichlet transport). Consider the gradient flow (5.5), and let q t = T ϕ (p t ). Then for i = 0, . . . , n − 1, we have Example 4. Consider the equal-weighted portfolio in (3). Then q t = T ϕ (p t ) = p t and corresponding gradient flow (5.6) is given by  Figure 3. Convergence rates (left) and final estimates p (final) (right) of f (p k ) = c(p k , p * ) for the entropic and conformal descents using step size δ k = 1 d √ k , for various targets p * that were randomly chosen and fixed. In Figure 3a, p * is the barycenter on d , whereas in Figures 3b-3d, p * was sampled from a Dirchlet distribution with varying parameters. For both the entropic and conformal descents, we plot the average over 12 randomized initial points p 0 (for p * fixed).
This motivates the multiplicative discrete update . This is similar to but different from the entropic descent (Bregman mirror descent on n induced by the negative Shannon entropy) whose update is given by where ∇ i f is the i-th component of ∇f .
Example 5. Consider minimization of the function f (p) = c(p, p * ) where c is the Dirichlet cost function defined by (5.1) and p * is fixed. In this experiment, we generate p * randomly according to various distributions on n . We implement (5.6) using the forward Euler discretization for the diversity-weighted portfolio (Example 3(ii)) where α ∈ {0, . . . , 0.9}, and compare the performance with that of the entropic descent (5.7). The results are shown in Figure 3. Values of α closer to 1 perform better than the entropic descent across all settings, and recover the minimizer p * considerably more accurately.

Discussion and future directions
Convex duality and Bregman divergence underlie much of the theory and applications of classical information geometry. In this paper, we use the λ-duality and the associated logarithmic divergence to propose a tractable gradient flow called the conformal mirror descent. We demonstrate its usefulness in online parameter estimation and gradient flows on the simplex. Here, we discuss other related work and some directions for future research.
In this paper, we generalize the Hessian gradient flow primarily from the informationgeometric point of view. Being a fundamental first-order optimization method, mirror descent has been studied and generalized in many directions. For instance, convergence of many discrete and continuous time descent algorithms was studied using Lyapunov arguments in [44]. In [24], a family of accelerated mirror descent algorithms with quadratic convergence was proposed. Likewise, [43] presents a unifying analysis of accelerated descent using variational methods. A future avenue is to explore accelerated variants of the conformal mirror flow, and to interpret these using information-geometric frameworks; one such exploration is presented by [15].
Mirror descent provides a concrete framework to understand seemingly unrelated optimization algorithms. Recent lines of work [28,26,27] have analyzed and interpreted the popular Sinkhorn algorithm [39] -an iterative scheme used for solving the entropic optimal transport problem [14] -as a form of mirror descent. Our conformal mirror descent may be applied to develop new algorithms for regularized optimal transport problems and analyzing their convergence properties.
Statistical inference and machine learning involving generalized exponential families is the subject of a recent line of work, for e.g. [16,21]. We expect that λ-duality and logarithmic divergences will be useful in this endeavor. Nevertheless, the current framework (as in [49]) assumes that the curvature parameter λ is given and known (except in special cases such as the Dirichlet perturbation model in Example 2). A natural direction is to develop datadriven methods to choose λ (and analogous quantities for other generalized exponential families).
The λ-duality is a special case of the c-duality in optimal transport when c = c λ is the logarithmic cost given by (2.5). While the λ-duality is particularly tractable, efficient algorithms related to general c-duality will likely open up many new applications. For example, the recent paper [11] used c-convexity to define normalizing flows on Riemannian manifolds. It is also natural to analyze similarly derived gradient flows with respect to other cost functions. We hope our results will motivate and inspire further work in applications of generalized c-convex duality.