Abstract
We study a natural Wasserstein gradient flow on manifolds of probability distributions with discrete sample spaces. We derive the Riemannian structure for the probability simplex from the dynamical formulation of the Wasserstein distance on a weighted graph. We pull back the geometric structure to the parameter space of any given probability model, which allows us to define a natural gradient flow there. In contrast to the natural Fisher–Rao gradient, the natural Wasserstein gradient incorporates a ground metric on sample space. We illustrate the analysis of elementary exponential family examples and demonstrate an application of the Wasserstein natural gradient to maximum likelihood estimation.
Keywords
Optimal transport Information geometry Wasserstein statistical manifold Displacement convexity Machine learning1 Introduction
The statistical distance between histograms plays a fundamental role in statistics and machine learning. It provides the geometric structure on statistical manifolds [3]. Learning problems usually correspond to minimizing a loss function over these manifolds. An important example is the Fisher–Rao metric on the probability simplex, which has been studied especially within the field of information geometry [3, 6]. A classic result due to Chentsov [11] characterizes this Riemannian metric as the only one, up to scaling, that is invariant with respect to natural statistical embeddings by Markov morphisms (see also [9, 21, 32]). Using the Fisher–Rao metric, a natural Riemannian gradient descent method is introduced [2]. This natural gradient has found numerous successful applications in machine learning (see, e.g., [1, 27, 35, 36, 40]).
Optimal transport provides another statistical distance, named Wasserstein or Earth Mover’s distance. In recent years, this metric has attracted increasing attention within the machine learning community [5, 16, 31]. One distinct feature of optimal transport is that it provides a distance among histograms that incorporates a ground metric on sample space. The \(L^2\)Wasserstein distance has a dynamical formulation, which exhibits a metric tensor structure. The set of probability densities with this metric forms an infinitedimensional Riemannian manifold, named density manifold [20]. The gradient descent method in the density manifold, called Wasserstein gradient flow, has been widely studied in the literature; see [34, 38] and references.
A question intersecting optimal transport and information geometry arises: What is the natural Wasserstein gradient descent method on the parameter space of a statistical model? In optimal transport, the Wasserstein gradient flow is studied on the full space of probability densities, and shown to have deep connections with the ground metrics on sample space deriving from physics [33], fluid mechanics [10] and differential geometry [25]. We expect that these relations also exist on parametrized probability models, and that the Wasserstein gradient flow can be useful in the optimization of objective functions that arise in machine learning problems. By incorporating a ground metric on sample space, this method can serve to implement useful priors in the learning algorithms.
We are interested in developing synergies between the information geometry and optimal transport communities. In this paper, we take a natural first step in this direction. We introduce the Wasserstein natural gradient flow on the parameter space of probability models with discrete sample spaces. The \(L^2\)Wasserstein metric on discrete states was introduced in [12, 26, 29]. Following the settings from [13, 14, 17, 22], the probability simplex forms the Riemannian manifold called Wasserstein probability manifold. The Wasserstein metric on the probability simplex can be pulled back to the parameter space of a probability model. This metric allows us to define a natural Wasserstein gradient method on parameter space.
We note that one finds several formulations of optimal transport for continuous sample spaces. On the one hand, there is the static formulation, known as Kantorovich’s linear programming [38]. Here, the linear program is to find the minimal value of a functional over the set of joint measures with given marginal histograms. The objective functional is given as the expectation value of the ground metric with respect to a joint probability density measure. On the other hand, there is the dynamical formulation, known as the BenamouBrenier formula [8]. This dynamic formulation gives the metric tensor for measures by lifting the ground metric tensor of sample spaces. Both static and dynamic formulations are equivalent in the case of continuous state spaces. However, the two formulations lead to different metrics in the simplex of discrete probability distributions. The major reason for this difference is that the discrete sample space is not a length space.^{1} Thus the equivalence result in classical optimal transport is no longer true in the setting of discrete sample spaces. We note that for the static formulation, there is no Riemannian metric tensor for the discrete probability simplex. See [14, 26] for a detailed discussion.
In the literature, the exploration of connections between optimal transport and information geometry was initiated in [4, 18, 39]. These works focus on the distance function induced by linear programming on discrete sample spaces. As we pointed out above, this approach can not cover the Riemannian and differential structures induced by optimal transport. In this paper, we use the dynamical formulation of optimal transport to define a Riemannian metric structure for general statistical manifolds. With this, we obtain a natural gradient operator, which can be applied to any optimization problem over a parameterized statistical model. In particular, it is applicable to maximum likelihood estimation. Other works have studied the Gaussian family of distributions with \(L^2\)Wasserstein metric [30, 37]. In that particular case, the constrained optimal transport metric tensor can be written explicitly and the corresponding density submanifold is a totally geodesic submanifold. In contrast to those works, our discussion is applicable to arbitrary parametric models.
This paper is organized as follows. In Sect. 2 we briefly review the Riemannian manifold structure in probability space introduced by optimal transport in the cases of continuous and discrete sample spaces. In Sect. 3 we introduce Wasserstein statistical manifolds by isometric embedding into the probability manifold, and in Sect. 4 we derive the corresponding gradient flows. In Sect. 5 we discuss a few examples.
2 Optimal transport on continuous and discrete sample spaces
In this section, we briefly review the results of optimal transport. We introduce the corresponding Riemannian structure for simplices of probability distributions with discrete support.
2.1 Optimal transport on continuous sample space
We start with a review of the optimal transport problem on continuous spaces. This will guide our discussion of the discrete state case. For related studies, we refer the reader to [20, 38] and the many references therein.
Denote the sample space by \((\Omega , g^\Omega )\). Here \(\Omega \) is a finite dimensional smooth Riemannian manifold, for example, \(\mathbb {R}^d\) or the open unit ball therein. Its inner product is denoted by \(g^\Omega \) and its volume form by dx. Denote the geodesic distance of \(\Omega \) by \(d_\Omega :\Omega \times \Omega \rightarrow \mathbb {R}_+\).
2.2 Dynamical optimal transport on discrete sample spaces
We translate the dynamical perspective from the previous section to discrete state spaces, i.e., we replace the continuous space \(\Omega \) by a discrete space \(I=\{1,\ldots , n\}\).
We are now ready to introduce the \(L^2\)Wasserstein metric on \(\mathcal {P}_+(I)\).
Definition 1
Remark 1
2.3 Wasserstein geometry and discrete probability simplex
In this section we introduce the primal coordinates of the discrete probability simplex with \(L^2\)Wasserstein Riemannian metric. Our discussion follows the recent work [22]. The probability simplex \(\mathcal {P}(I)\) is a manifold with boundary. To simplify the discussion, we focus on the interior \(\mathcal {P}_+(I)\). The geodesic properties on the boundary \(\partial \mathcal {P}(I)\) have been studied in [17].
We first present this in a dual formulation, which is known in the literature [25].
Definition 2
We shall now give the inner product in primal coordinates. The following matrix operator will be the key to the Riemannian metric tensor of \((\mathcal {P}_+(I), g^W)\).
Definition 3
 \(D \in \mathbb {R}^{E\times n}\) is the discrete gradient operator$$\begin{aligned} D_{(i,j)\in {E}, k\in V}={\left\{ \begin{array}{ll} \sqrt{\omega _{ij}}, &{} \text {if }i=k,\hbox { }i>j\\ \sqrt{\omega _{ij}}, &{} \text {if }j=k,\hbox { }i>j\\ 0, &{} \text {otherwise} \end{array}\right. }, \end{aligned}$$

\(D^{\mathsf {T}}\in \mathbb {R}^{n\times E}\) is the discrete divergence operator, also called oriented incidence matrix [15], and
 \(\Lambda (a)\in \mathbb {R}^{E\times E}\) is a weight matrix depending on a,$$\begin{aligned} \Lambda (a)_{(i,j)\in E, (k,l)\in E}={\left\{ \begin{array}{ll} \frac{a_i+a_j}{2} &{} \text {if }(i,j)=(k,l)\in E\\ 0 &{} \text {otherwise} \end{array}\right. }. \end{aligned}$$
Definition 4
3 Wasserstein statistical manifold
In this section we study parametric probability models endowed with the \(L^2\)Wasserstein Riemannian metric. We define this in the natural way, by pulling back the Riemannian structure from the Wasserstein manifold that we discussed in the previous section. This allows us to introduce a natural gradient flow on the parameter space of a statistical model.
3.1 Wasserstein statistical manifold
Definition 5
This inner product is consistent with the restriction of the Wasserstein metric \(g^W\) to \( p(\Theta )\). For this reason, we call \( p(\Theta )\), or \((\Theta , I, p)\), together with the induced Riemannian metric g, Wasserstein statistical manifold.
We need to make sure that the embedding procedure is valid, because the metric tensor \(L(p)^{\mathcal {\dagger }}\) is only of rank \(n1\). The next lemma shows that \((\Theta , g)\) is a well defined ddimensional Riemannian manifold.
Lemma 6
Proof
3.2 Geometry calculations in statistical manifold
We next present the geometric formulas in a probability model. This approach connects the geometry formulas in the full probability set to the ones in a submanifold \((p(\Theta ), g)\), and in the parameter space \((\Theta , g)\).
We first study the orthogonal projection operator from \((\mathcal {P}_+(I), g^W)\) to \((p(\Theta ), g)\).
Theorem 7
Proof
Proposition 8
Proof
We next establish the parallel transport and geodesic equation in \((p(\Theta ), g)\).
Proposition 9
Proof
We last present the curvature tensor in \((p(\Theta ), g)\), denoted by \(R(\cdot , \cdot )\cdot :T_{p(\Theta )} p(\Theta )\times T_{p(\theta )} p(\Theta )\times T_{p(\theta )} p(\Theta )\rightarrow T_{p(\theta )} p(\Theta )\).
Proposition 10
Proof
4 Gradient flow on Wasserstein statistical manifold
In this section we introduce the natural Riemannian gradient flow on Wasserstein statistical manifold \((\Theta , g)\).
4.1 Gradient flow on parameter space
Theorem 11
Proof
Remark 2
4.2 Displacement convexity on parameter space
The Wasserstein structure on the statistical manifold not only provides us the gradient operator, but also the Hessian operator on \((\Theta , g)\). The latter allows us to introduce the displacement convexity on parameter space.
We first review some facts. One remarkable property of Wasserstein geometry is that it yields a correspondence between differential operators on sample space and differential operators on probability space. E.g., the Hessian operator on Wasserstein manifold is equal to the expectation of Hessian operator on sample space.
In this section we would like to extend the displacement convexity to parameter space \(\Theta \). In other words, we relate the parameter to the differential structures of sample space via constrained Wasserstein geometry \((\Theta , g)\). If \(\Theta \) is the full probability manifold, our definition coincides with the classical Hessian operator in sample space.
Definition 12
Remark 3
In particular, the displacement convexity of KL divergence relates to the Ricci curvature lower bound on sample space. We elaborate this notion in [23].
We next derive the displacement convexity condition for stochastic relaxation.
Theorem 13
Proof
4.3 Numerical methods
Here we discuss the numerical computation of the Wasserstein metric and the Wasserstein gradient flow.
In practice, the forward Euler method is usually easier to implement than the backward Euler method. We would also suggest to implement the natural Wasserstein gradient using this method for minimization problems. As known in optimization, the JKO scheme can also be useful for nonsmooth objective functions. Moreover, the backward Euler method is usually unconditionally stable, which means that one can choose a large step size h for computations.
5 Examples
Example 1
For comparison, we compute the exponential geodesic triangle between the same distributions \(q^1,q^2,q^3\). This is shown in Fig. 2. In this case, there is no distinction between the states 1, 2, 3 and the three paths are symmetric. The exponential geodesic between two distributions \(p^0\) and \(p^1\) is given by \((p^0)^{1t}(p^1)^{t}/ \sum _x(p^0)^{1t}(p^1)^{t}\), \(t\in [0,1]\).
Example 2
In Fig. 3 we plot the negative Wasserstein gradient vector field in the parameter space \(\Theta =[0,1]^2\). As can be seen, the Wasserstein gradient direction depends on the ground metric on sample space (encoded in the edge weights). If b and d are far away, there is higher tendency to go c, rather than b. This reflects the intuition that, the more ground distance between b and d, the harder for the probability distribution to move from its concentration place b to d. We observe that the the attraction region of the two local minimizers changes dramatically as the ground metric between b and d changes, i.e., as \(\omega _{bd}\) varies from 0.1, 1, 10. This is different in the Fisher–Rao gradient flow, plotted in Fig. 4, which is independent of the ground metric on sample space.
The above result illustrates the displacement convexity shown in Theorem 16. Different ground metric exhibits different displacement convexity of f on parameter space \((\Theta , g)\). These properties lead to different convergence regions of Wasserstein gradient flows.
Example 3
 Our first choice are the orthogonal characterswhich can be interpreted as a Fourier basis for the space of real valued functions over binary vectors.$$\begin{aligned} \sigma _\lambda (x) = \prod _{i\in \lambda } (1)^{x_i} = e^{ i \pi \langle 1_\lambda , x\rangle }, \quad x\in \{0,1\}^n, \end{aligned}$$
 As an alternative choice we consider the basis of monomialswhich is not orthogonal, but is frequently used in practice.$$\begin{aligned} \pi _\lambda (x) = \prod _{i\in \lambda } x_i, \quad x\in \{0,1\}^n, \end{aligned}$$
The results are shown in Fig. 5. The convergence to the final value can be monitored in terms of the normalized area under the optimization curve, \(\sum _{t=1}^T (D_tD_T)/(D_0D_T)\), where \(D_t\) is the divergence value at iteration t, and T is the final time. All methods achieved similar values of the divergence, except for the Euclidean gradient with nonorthogonal parametrization, which did not always reach the minimum. For the Fisher and Wasserstein gradients, the learning paths were virtually identical under the two different model parametrizations, as we already expected from the fact that these are covariant gradients. On the other hand, for the Euclidean gradient, the paths (and the number of iterations) were heavily dependent on the model parametrization, with the orthogonal basis usually being a much better choice than the nonorthogonal basis. In terms of the number of iterations until the convergence criterion was satisfied, the comparison is difficult because different methods work best with different step sizes. With the simple adaptive method and a suitable initial step size, the Wasserstein gradient was faster than the Euclidean and Fisher gradients. On the other hand, using Adam to adapt the step size, orthogonal Euclidean, Fisher, and Wasserstein were comparable.
6 Discussion
We introduced the Wasserstein statistical manifolds, which are submanifolds of the probability simplex with the \(L^2\)Wasserstein Riemannian metric tensor. With this, we defined an optimal transport natural gradient flow on parameter space.
The Wasserstein distance has already been discussed with divergences in information geometry and also shown to be useful in machine learning, for instance in training restricted Boltzmann machines and generative adversarial networks. In this work, we used the Wasserstein distance to define a geometry on the parameter space of a statistical model. Following this geometry, we establish a corresponding natural gradient and displacement convexity on parameter space.
We presented an application of the Wasserstein natural gradient method to maximum likelihood estimation in hierarchical probability models. The experiments show that, in combination with a suitable step size, the Wasserstein gradient can be a competitive optimization method and even reduce the required number of parameter iterations compared both to Euclidean and Fisher gradient methods. It will be essential to conduct further experimental studies to better understand the effects of the learning rate, as well as the interplay of ground metric, model, and optimization problem. In our current implementation, the Wasserstein gradient involved heavier computational costs compared to the Euclidean and Fisher gradients. For applications, it will be important to explore efficient computation and approximation approaches.
Regarding the theory, we suggest that many studies from information geometry will have a natural analog or extension in the Wasserstein statistical manifold. Some questions to consider include the following. Is it possible to characterize the Wasserstein metric on probability manifolds through an invariance requirement of Chentsov type? For instance, the work [32] formulates extensions of Markov embeddings for polytopes and weighted point configurations. Is there a weighted graph structure for which the corresponding Wasserstein metric recovers the Fisher metric?
The critical innovation coming from the Wasserstein gradient in comparison to the Fisher gradient is that it incorporates a ground metric in sample space. We suggest that this could have a positive effect not only concerning optimization, as discussed above, but also regarding generalization performance, in interplay with the optimization. The reason is that the ground metric on sample space provides means to introduce preferences in the hypothesis space. The specific form of such a regularization still needs to be developed and investigated. In this regard, a natural question is how to define natural ground metric notions. These could be fixed in advance or trained.
We hope that this paper contributes to strengthening the emerging interactions between information geometry and optimal transport, in particular, to machine learning problems, and to develop better natural gradient methods.
Footnotes
 1.
A length space is one in which the distance between points can be measured as the infimum length of continuous curves between them.
 2.
We use the direct method, which is a standard technique in optimal control. Here the time is discretized, and the sum replacing the integral is minimized by means of gradient descent with respect to \((p(t)_i)_{i=1,3, t\in \{t_1,\ldots , t_N\}} \in \mathbb {R}^{2\times N}\). A reference for these techniques is [24].
Notes
Acknowledgements
The authors would like to thank Prof. Luigi Malagò for his inspiring talk at UCLA in December 2017. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement n^{o} 757983).
References
 1.Amari, S.: Neural learning in structured parameter spacesnatural Riemannian gradient. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems 9, pp. 127–133. MIT, London (1997)Google Scholar
 2.Amari, S.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)CrossRefGoogle Scholar
 3.Amari, S.: Information Geometry and Its Applications. Number volume 194 in Applied mathematical sciences. Springer, Tokyo (2016)CrossRefGoogle Scholar
 4.Amari, S., Karakida, R., Oizumi, M.: Information geometry connecting Wasserstein distance and KullbackLeibler divergence via the EntropyRelaxed Transportation Problem (2017). arXiv:1709.10219 [cs, math]
 5.Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN (2017). arXiv:1701.07875 [cs, stat]
 6.Ay, N., Jost, J., Lê, H., Schwachhöfer, L.: Information Geometry Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge / A Series of Modern Surveys in Mathematics. Springer, Berlin (2017)Google Scholar
 7.Bakry, D., Émery, M.: Diffusions hypercontractives. In: Azéma, J., Yor, M. (eds.) Séminaire de Probabilités XIX 1983/84, pp. 177–206. Springer, Berlin (1985)CrossRefGoogle Scholar
 8.Benamou, J.D., Brenier, Y.: A computational fluid mechanics solution to the MongeKantorovich mass transfer problem. Numerische Mathematik 84(3), 375–393 (2000)MathSciNetCrossRefGoogle Scholar
 9.Campbell, L.: An extended Čencov characterization of the information metric. Proc. Am. Math. Soc. 98, 135–141 (1986)Google Scholar
 10.Carlen, E.A., Gangbo, W.: Constrained Steepest Descent in the 2Wasserstein Metric. Ann. Math. 157(3), 807–846 (2003)MathSciNetCrossRefGoogle Scholar
 11.Čencov, N.N.: Statistical Decision Rules and Optimal Inference. Translations of Mathematical Monographs, vol. 53. American Mathematical Society, Providence (1982). (Translation from the Russian edited by Lev J. Leifman)Google Scholar
 12.Chow, S.N., Huang, W., Li, Y., Zhou, H.: Fokker–Planck equations for a free energy functional or markov process on a graph. Arch. Ration. Mech. Anal. 203(3), 969–1008 (2012)MathSciNetCrossRefGoogle Scholar
 13.Chow, S.N., Li, W., Zhou, H.: A discrete Schrodinger equation via optimal transport on graphs (2017). arXiv:1705.07583 [math]
 14.Chow, S.N., Li, W., Zhou, H.: Entropy dissipation of Fokker–Planck equations on graphs. Discrete Contin. Dyn. Syst. A 38(10), 4929–4950 (2018)MathSciNetCrossRefGoogle Scholar
 15.Chung, F. R. K.: Spectral Graph Theory. Number no. 92 in Regional conference series in mathematics. In: Published for the Conference Board of the mathematical sciences by the American Mathematical Society, Providence, R.I. (1997)Google Scholar
 16.Frogner, C., Zhang, C., Mobahi, H., ArayaPolo, M., Poggio, T.: Learning with a Wasserstein loss (2015). arXiv:1506.05439 [cs, stat]
 17.Gangbo, W., Li, W., Mou, C.: Geodesic of minimal length in the set of probability measures on graphs. accepted in ESAIM: COCV (2018)Google Scholar
 18.Karakida, R., Amari, S.: Information geometry of wasserstein divergence. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information, pp. 119–126. Springer, Cham (2017)CrossRefGoogle Scholar
 19.Kingma, D. P., Adam, J. Ba.: A method for stochastic optimization (2014). CoRR, arXiv:1412.6980
 20.Lafferty, J.D.: The density manifold and configuration space quantization. Trans. Am. Math. Soc. 305(2), 699–741 (1988)MathSciNetCrossRefGoogle Scholar
 21.Lebanon, G.: Axiomatic geometry of conditional models. IEEE Trans. Inf. Theory 51(4), 1283–1294 (2005)MathSciNetCrossRefGoogle Scholar
 22.Li, W.: Geometry of probability simplex via optimal transport (2018). arXiv:1803.06360 [math]
 23.Li, W., Montufar, G.: Ricci curvature for parameter statistics via optimal transport (2018). arXiv:1807.07095
 24.Li, W., Yin, P., Osher, S.: Computations of optimal transport distance with fisher information regularization. J. Sci. Comput. 75, 1581–1595 (2017)MathSciNetCrossRefGoogle Scholar
 25.Lott, J.: Some geometric calculations on Wasserstein space. Commun. Math. Phys. 277(2), 423–437 (2007)MathSciNetCrossRefGoogle Scholar
 26.Maas, J.: Gradient flows of the entropy for finite Markov chains. J. Funct. Anal. 261(8), 2250–2292 (2011)MathSciNetCrossRefGoogle Scholar
 27.Malagò, L., Matteucci, M., Pistone, G.: Towards the geometry of estimation of distribution algorithms based on the exponential family. In: Proceedings of the 11th Workshop Proceedings on Foundations of Genetic Algorithms, FOGA ’11, New York, NY, USA, 2011. ACM, pp. 230–242Google Scholar
 28.Malagò, L., Pistone, G.: Natural gradient flow in the mixture geometry of a discrete exponential family. Entropy 17(12), 4215–4254 (2015)MathSciNetCrossRefGoogle Scholar
 29.Mielke, A.: A gradient structure for reaction–diffusion systems and for energydriftdiffusion systems. Nonlinearity 24(4), 1329–1346 (2011)MathSciNetCrossRefGoogle Scholar
 30.Modin, K.: Geometry of matrix decompositions seen through optimal transport and information geometry. J. Geometr. Mech. 9(3), 335–390 (2017)MathSciNetCrossRefGoogle Scholar
 31.Montavon, G., Müller, K.R., Cuturi, M.: Wasserstein training of restricted boltzmann machines. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 3718–3726. Curran Associates Inc, Red Hook (2016)Google Scholar
 32.Montúfar, G., Rauh, J., Ay, N.: On the Fisher metric of conditional probability polytopes. Entropy 16(6), 3207–3233 (2014)MathSciNetCrossRefGoogle Scholar
 33.Nelson, E.: Quantum Fluctuations. Princeton series in physics. Princeton University Press, Princeton (1985)zbMATHGoogle Scholar
 34.Otto, F.: The geometry of dissipative evolution equations: the porous medium equation. Commun. Partial Diff. Equ. 26(1–2), 101–174 (2001)MathSciNetCrossRefGoogle Scholar
 35.Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. In: International Conference on Learning Representations 2014 (Conference Track) (2014)Google Scholar
 36.Peters, J., Vijayakumar, S., Schaal, S.: Natural actorcritic. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) Machine Learning: ECML 2005, pp. 280–291. Springer, Berlin (2005)CrossRefGoogle Scholar
 37.Takatsu, A.: Wasserstein geometry of Gaussian measures. Osaka J. Math. 48(4), 1005–1026 (2011)MathSciNetzbMATHGoogle Scholar
 38.Villani, C.: Optimal Transport: Old and New. Number 338 in Grundlehren der mathematischen Wissenschaften. Springer, Berlin (2009)CrossRefGoogle Scholar
 39.Wong, T.K.: Logarithmic divergences from optimal transport and Rényi geometry (2017). arXiv:1712.03610 [cs, math, stat]
 40.Yi, S., Wierstra, D., Schaul, T., Schmidhuber, J.: Stochastic search using the natural gradient. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA. ACM, pp. 1161–1168 (2009)Google Scholar