On the locality of the natural gradient for learning in deep Bayesian networks

We study the natural gradient method for learning in deep Bayesian networks, including neural networks. There are two natural geometries associated with such learning systems consisting of visible and hidden units. One geometry is related to the full system, the other one to the visible sub-system. These two geometries imply different natural gradients. In a first step, we demonstrate a great simplification of the natural gradient with respect to the first geometry, due to locality properties of the Fisher information matrix. This simplification does not directly translate to a corresponding simplification with respect to the second geometry. We develop the theory for studying the relation between the two versions of the natural gradient and outline a method for the simplification of the natural gradient with respect to the second geometry based on the first one. This method suggests to incorporate a recognition model as an auxiliary model for the efficient application of the natural gradient method in deep networks.


The natural gradient method
Within the last decade, deep artificial neural networks have led to unexpected successes of machine learning in a large number of applications [15]. One important direction of research within the field of deep learning is based on the natural gradient method from information geometry [3,4,8]. It has been proposed by Amari [2] as a gradient method that is invariant with respect to coordinate transformations. This method turns out to be extremely efficient within various fields of artificial intelligence and machine learning, including neural networks [2], reinforcement learning [7,19], and robotics [27]. It is known to overcome several problems of traditional gradient methods. Most importantly, the natural gradient method avoids the so-called plateau problem, and it is less sensitive to singularities (for a detailed discussion, see Section 12.2 of the book [3]; the subject of singularities is treated in [32]). On the other hand, there are significant challenges and limitations concerning the applicability of the natural gradient method [22]. Without further assumptions this method becomes intractable in the context of deep neural networks with a large number of parameters. Various approximate methods have been proposed and studied as alternatives to the original method [20,23,26]. In this article, we highlight information-geometric structures of deep Bayesian and, in particular, neural networks that allow for a simplification of the natural gradient. The guiding scheme of this simplification is locality with respect to the underlying network structure [5]. There are several aspects of learning that can be addressed from this perspective: 1. Objective function: Typically, learning is based on the optimisation of some global objective function related to the overall performance of the network, which, in the most general context, is evaluated in some behaviour space. On the other hand, if we assume that individual units access information only from their local neighbourhood, then we are naturally led to the following problem. Is it possible to decompose the objective function into individual local objective functions that can be evaluated by the corresponding units? 2. Learning I: Assuming that learning is based on the gradient of a global objective function, does the above-mentioned decomposition into local functions imply a corresponding locality of the gradient with respect to the parametrisation? In that case, the individual units would adjust their parameter values, such as the synaptic connection strengths in the case of neural networks, based on local information. This is a typical implicit assumption within the field of neural networks, most prominently realised in terms of Hebbian learning [16], which implies that a connection between two neurons is modified based on their joint activity. 3. Learning II: When computing the natural gradient of an objective function, we have to evaluate (the inverse of) the Fisher information matrix. Even if locality of learning is guaranteed for the Euclidean gradient, this matrix might reintroduce non-locality. We will analyse to what extent the natural gradient preserves locality? One instance of this property corresponds to a block diagonal structure of the Fisher information matrix which simplifies its inversion [5,28].
We are now going to introduce the required formalism and outline the problem setting in more detail.

Preliminaries and the main problem
We first introduce the notation used in this article. Let S be a non-empty finite set. We denote the canonical basis of the vector space R S by e s , s ∈ S. The corresponding dual vectors δ s ∈ R S * , s ∈ S, defined by Let us now consider a model M ⊆ P(S) which we assume to be a d-dimensional smooth manifold with local coordinates ξ = (ξ 1 , . . . , ξ d ) → p ξ , where ξ is from an open domain Ξ in R d . Below, we will treat more general models, but starting with manifolds allows us to outline more clearly the challenges we face in the context of the natural gradient method. With p(s; ξ) := p ξ (s), we define the vectors ∂ i (ξ ) := ∂ ∂ξ i p ξ , i = 1, . . . , d, which span the tangent space T ξ M. (Throughout this article, we use the subscript ξ , as in T ξ M, to denote the point p ξ whenever this simplifies the notation.) From (1) we then obtain the Fisher information matrix G(ξ ) = g i j (ξ ) i j , defined by Given a smooth function L : M → R, its gradient grad M ξ L ∈ T ξ M in p ξ is the direction of steepest ascent. It has the following usual representation in the local coordinates ξ : As we can see, the gradient depends on the Fisher-Rao metric in p ξ . It is this dependence that makes it natural and the reason for calling it the natural gradient. Let us clarify how to read the equation (3). The LHS is a vector in T ξ M, whereas the RHS is a vector in R d , which appears somewhat inconsistent. The way to read this is the following: As a vector in the tangent space in p ξ , the gradient has a represen- The coordinates x i are then given by the RHS of (3) (see also "Moore-Penrose inverse and gradients" of the Appendix).
In this article, the set S will typically be a Cartesian product of state sets of units, for instance binary neurons. More precisely, we consider a non-empty and finite set The definition (2) of the Fisher information matrix in ξ can now be applied to both models, M and M V . In order to distinguish them from each other, we write g i j (ξ ) = ∂ i (ξ ), ∂ j (ξ ) ξ for the components of the Fisher information matrix G(ξ ) in p ξ ∈ M, and correspondingly g i j (ξ ) := ∂ i (ξ ),∂ j (ξ ) ξ for the components of the Notice that we face a number of difficulties already at this point.
1. Even if we choose M to be a smooth manifold, its projection M V is typically a much more complicated geometric object with various kinds of singularities (to be formally defined in Section 3.1). Therefore, we will allow for more general models without assuming M to be a smooth manifold in the first place. However, we will restrict attention to non-singular points only. 2. In addition to having a general model M, we also drop the assumption that the parametrisation ξ = (ξ 1 , . . . , ξ d ) → p ξ is given by a (diffeomorphic) coordinate system. This has consequences on the definition of the Fisher-Rao metric in a non-singular point p ξ : (a) In order to interpret the Fisher-Rao metric as a Riemannian metric, the derivatives ∂ i (ξ ) = ∂ ∂ξ i p ξ , i = 1, . . . , d, have to span the whole tangent space T ξ M in p ξ . (This is often implicitly assumed but not explicitly stated.) Otherwise, the Fisher-Rao metric defined by (2) will not be positive definite. We will refer to a parametrisation that satisfies this condition as a proper parametrisation. Note that for a proper parametrisation ξ → p ξ of M, the composition ξ → π V ( p ξ ) is not necessarily a proper parametrisation of M V . (b) Another consequence of not having a coordinate system as a parametrisation is the fact that the number d of parameters may exceed the dimension of the model. Even if we assume M to be a smooth manifold and its parametrisation given by a coordinate system, such that d equals the dimension of M, the corresponding projected model M V can have a much lower dimension. In that case, we say that the model is overparametrised. Such models play an important role within the field of deep learning. The Fisher-Rao metric for such models is well defined in non-singular points. However the Fisher information matrix (2) will be degenerate so that the representation of a gradient in terms of the parameters is not unique anymore. Below, we will come back to this problem.
We use the natural gradient method in order to minimise (or maximise) a function L : M V → R which is usually obtained as a restriction of a smooth function defined on P V . Therefore, it is natural to use the Fisher-Rao metric on M V which is inherited from P V . Assuming that all required quantities are well defined, we can express this natural gradient in terms of the parametrisation as where G + (ξ ) is the Moore-Penrose inverse of the Fisher information matrix G(ξ ) (for details on the Moore-Penrose inverse, see "Moore-Penrose inverse and gradients" of the Appendix). If the parametrisation is given by a coordinate system then this reduces to the ordinary matrix inverse. The general difficulty that we face with Eq. (4) is the inversion of the Fisher information matrix, especially in deep networks with many parameters. On the other hand, the model M V is obtained as the image of the model M which can be easier to handle, despite the fact that it "lives" in the larger space P V ,H . Instead of optimising the function L on M V we can try to optimise the pullback of L, that is L•π V , which is defined on M. But this creates a conceptual problem related to the very nature of the natural gradient method. As M inherits the Fisher-Rao metric from P V ,H , we can express the corresponding gradient as This can simplify the problem in various ways. As already outlined, M V typically has singularities, even if M is a smooth manifold. In that case, the gradient (5) is well defined for all ξ , whereas the gradient (4) is not. A further simplification comes from the fact that M is typically associated with some network, which implies a block structure of the Fisher information matrix G(ξ ) in p ξ ∈ M. In Section 2, we will demonstrate this simplification for models that are associated with directed acyclic graphs, where the elements of M factorise accordingly. With this simplification, the inversion of G(ξ ) can become much easier than the inversion of G(ξ ) (when the latter is defined). On the other hand, if we consider the model M V to be the prime model, where the hidden units play the role of auxiliary units, then we have to use the information geometry of M V for learning. Therefore, it is important to relate the corresponding natural gradients, that is (4) and (5), to each other. This is done in a second step, presented in Sect. 3. In particular, we will identify conditions for the equivalence of the two gradients, leading to a new interpretation of Chentsov's classical characterisation of the Fisher-Rao metric in terms of its invariance with respect to congruent Markov morphisms [12]. (A general version of this charecterisation is provided in [8].) Based on the comparison of the gradients (4) and (5), we will investigate how to extend locality properties of learning that hold for M to the model M V . This is closely related to the above-mentioned approximate methods as alternatives to the natural gradient method. Of particular relevance in this context is the replacement of the Fisher information matrix by the unitwise Fisher information matrices as studied in [20,26]. Note, however, that we are not aiming at approximating the natural gradient on M V by the unitwise natural gradient. In this article, we aim at identifying conditions for their equivalence. Furthermore, in order to satisfy these conditions we propose an extension M of M which corresponds to an interesting extension of the underlying network. This will lead us to a new interpretation of so-called recognition models, which are used in the context of Helmholtz machines and the wake-sleep algorithm [13,17,25]. Information-geometric works on the wake-sleep algorithm and its close relation to the em-algorithm are classical [1,14,18]. More recent contributions to the information geometry of the wake-sleep algorithm are provided by [10] and [31]. Directions of related research in view of this article are outlined in the conclusions, Sect. 4.

Locality of the Euclidean gradient
We now define a sub-manifold of P V ,H in terms of a directed acyclic graph G = (N , E) where E is the set of directed edges. For a node s, we define the set pa(s) := {r ∈ N : (r , s) ∈ E} of its parents and the set ch(s) := {t ∈ N : (s, t) ∈ E} of its children. The latter will be only used in "Gibbs sampling" of the Appendix.
With each node s we associate a local Markov kernel, that is a map with x s k s (x s |x pa(s) ) = 1 for all x pa(s) ∈ X pa(s) . Note that for pa(s) = ∅, the configuration set X pa(s) consists of one element, the empty configuration. In this case, a Markov kernel reduces to a probability vector over X s . (We will revisit Markov kernels from a geometric perspective in Section 3.2.) Given such a family of Markov kernels, we define the joint distribution The distributions of the product structure (6) form a (statistical) model that plays an important role within the field of graphical models, in particular Bayesian networks [21]. A natural sub-model is given by the product distributions, that is those distributions of the form In order to treat a more general sub-model, in particular one that is given by a neural network, a so-called neuro-manifold, we consider for each unit s ∈ N a parametrisation R d s ξ s = (ξ (s;1) , . . . , ξ (s;d s ) ) → κ s ξ s . This defines a model M as the image of the map where k s (x s |x pa(s) ; ξ s ) := k s ξ s (x s |x pa(s) ). In order to use vector and matrix notation, we consider a numbering of the units, that is N = {s 1 , . . . , s n+m }, with i ≤ j whenever s i ∈ pa(s j ). To simplify notation, we can alternatively assume, without loss of generality, N = {1, 2, . . . , n + m} such that r ≤ s whenever r ∈ pa(s). This allows us to write the parametrisation (7) Now we come to the main objective of learning as studied in this article. Given a target probability vector p * ∈ P V on the state set of visible units, the aim is to represent, or at least approximate, it by an appropriate elementp of the model M V = π V (M) ⊆ P V . Such an approximation requires a measure of proximity, a divergence, between probability vectors. Information geometry provides ways to identify a natural choice of such a divergence, referred to as canonical divergence [6]. In the present context, the relative entropy or Kullback-Leibler divergence (abbreviated as KL-divergence) between two probability vectors p and q is the most commonly used divergence. This leads to the search of a probability vectorp ∈ M V that satisfies For this search we use the parametrisation (7) of the elements of M and define the function Minimisation of L can be realised in terms of the gradient method. In this section we begin with the Euclidean gradient which is determined by the partial derivatives of L. It is remarkable that, even though the network can be large, with many hidden units, the resulting derivatives are local in a very useful way (see a similar derivation in the context of sigmoid belief networks in [24]): ln k r (x r |x pa(r ) ; ξ r ). (11) We have an expectation value of a function, ∂ ∂ξ (r ;i) ln k r (x r |x pa(r ) ; ξ r ), that is local in two ways: All arguments of this function, the states and the parameters, are local with respect to the node r . However, the distribution p * ξ , used for the evaluation of the expectation value, depends on the full set of parameters ξ . On the other hand, due to the locality of ∂ ∂ξ (r ;i) ln k r (x r |x pa(r ) ; ξ r ) with respect to the states x pa(r ) and x r , this expectation value depends only on the marginal p * (x pa(r ) , x r ). One natural way to approximate (11) is by sampling from this distribution. This is typically difficult, compared to the sampling from p ξ which factorises according to the underlying directed acyclic graph G. "One-shot sampling" from p ξ is possible by simply using p ξ as a generative model, which here simply means recursive application of the local kernels k r ξ r according to the underlying directed acyclic graph. This kind of sampling is also referred to as ancestral sampling [15]. As p * ξ incorporates the target distribution p * on X V and does not necessarily factorise according to G, sampling from it has to run much longer. For completeness, the Gibbs sampling method is outlined in more detail in "Gibbs sampling" of the Appendix.
We exemplify the derivative (11) in the context of binary neurons where it leads to a natural learning rule.

Example 1 (Neural networks (I))
We assume that the units r ∈ N , referred to as neurons in this context, are binary with state sets {−1, +1}. For each neuron r , we consider a vector w r = (w ir ) i∈ pa(r ) of synaptic connection strengths and a threshold value ϑ r . For a synaptic strength w ir , i is referred to as the pre-synaptic and r the post-synaptic neuron, respectively. We set ξ (r ;i) := w ir , i = 1, . . . , d r − 1, and ξ (r ;d r ) := ϑ r , that is ξ r = (w r , ϑ r ). In order to update its state, the neuron first evaluates the local function h r (x pa(r ) ) := i∈ pa (r ) w ir x i − ϑ r and then generates a state x r ∈ {−1, +1} with probability We calculate the derivatives ∂ ∂w ir ln k r (x r |x pa(r ) ; w r , ϑ r ) = x i x r 1 + e x r h r (x pa(r ) ) , (13) ∂ ∂ϑ r ln k r (x r |x pa(r ) ; w r , ϑ r ) = − x r 1 + e x r h r (x pa(r ) ) , (14) and, with (11), we obtain Equation (15) is one instance of the Hebb rule which is based on the learning paradigm phrased as "cells that fire together wire together" [16]. Note, however, that the causal interpretation of the underlying directed acyclic graph ensures that the pre-synaptic activity x i is measured before the post-synaptic activity x r . This causally consistent version of the Hebb rule has been experimentally studied in the context of spiketiming-dependent plasticity of real neurons (e.g., [9]). In order to use the derivatives (15) and (16) for learning, we have to sample from p * ξ . An outline of Gibbs sampling in this context is provided in Example 7 of the Appendix.

The wake-sleep algorithm
We now highlight an important alternative to sampling from p * ξ for the computation of the derivative (11). This alternative is based on the idea that we have, in addition to the generative model M of distributions p ξ , a so-called recognition model Q H |V of conditional distributions q(x H |x V ; η) with which we can approximate p(x H | x V ; ξ). As a consequence, such a recognition model allows us to approximate (11) where we replace and correspondingly the marginals on pa(r ) ∪ {r }. We obtain For the evaluation of the gradient of L with respect to the ξ -parameters we can now use the recognition model, instead of the generative model. This approximation will be the more accurate the smaller the following relative entropy is: Ideally, we would like the recognition model to be rich enough to represent the conditional distributions of the generative model. More precisely, we assume that for all ξ , there is an (20) to be tractable, we assume that q(x H |x V ; η) also factorises according to some directed acyclic graph G , so that where pa (r ) denotes the parent set of the node r with respect to the graph G . With the assumption (21), the expression (20) simplifies considerably, and we obtain . (23) Note that, while p ξ factorises according to G so that the conditional distribution p(x r |x pa(r ) ; ξ) coincides with the kernel k r (x r |x pa (r ) ; ξ), the conditional distribution p(x r |x pa (r ) ; ξ) with respect to G does not have a correspondingly simple structure.
On the other hand, we can easily sample from p ξ , and thereby also from p(x pa (r ) ; ξ) and p(x r |x pa (r ) ; ξ), using the product structure with respect to G.
Let us now come back to the original problem of minimising L with respect to ξ based on the gradient descent method. If the parameter η of the recognition model is such that (17) is exact, and we can evaluate the partial derivatives ∂/∂ξ (r ;i) by sampling from . This can then be used for updating the parameter ξ , say from ξ to ξ +Δξ where Δξ is proportional to the Euclidean gradient. As this update is based on sampling from the target distribution p * (x V ) and the recognition model q(x H |x V ; η), it is referred to as the wake phase. After this update, we typically have q(x H |x V ; η) = p(x H |x V ; ξ + Δξ ). In order to use (17) for the next update of ξ , we therefore have to readjust η, say from η to η + Δη, so that we recover the identity q(x H |x V ; η + Δη) = p(x H |x V ; ξ + Δξ ). This can be achieved by choosing Δη to be proportional to the Euclidean gradient (22) with respect to η. The evaluation of the partial derivatives ∂/∂η (r ; j) requires sampling from the generative model p(x V , x H ; ξ), with no involvement of the target distribution p * (x V ). This is the reason why the η-update is referred to as the sleep phase. Alternating application of the wake phase and the sleep phase yields the so-called wake-sleep algorithm, which has been introduced and studied in the context of neural networks in [13,17,25]. It has been pointed out that this algorithm cannot be interpreted as a gradient decent algorithm of a potential function on both variables ξ and η. On the other hand, here we derived the wake-sleep algorithm as a gradient decent algorithm for the optimisation of the objective function L which only depends on the variable ξ . The auxiliary variable η is used for the approximation of the gradient of L with respect to ξ . In order to have a good approximation of this gradient, we have to apply the sleep phase update more often, until convergence of η. Only then, we can update ξ within the next wake phase. With this asymmetry of time scale for the two phases, the wake-sleep algorithm is a gradient decent algorithm for ξ , which has been pointed out in the context of the em-algorithm in [18].
We have introduced the parameters η for sampling and thereby evaluating the derivative (11). However, there is another remarkable feature of the corresponding extended optimisation problem. While the original optimisation function L, defined by (10), does not appear to be local in any sense, the extended optimisation in terms of a generalised wake-sleep algorithm, which is equivalent to the original problem, is based on a set of local functions associated with the respective units. More precisely, the expressions (18) and (22) are derivatives of local cross entropies, whereas the expressions (19) and (23) are derivatives of local KL-divergences.
We conclude with the important note that a recognition model which, on the one hand, is rich enough to represent all distributions p(x H |x V ; ξ) and, on the other hand, factorises according to (21) might require a large graph G and a correspondingly large number of parameters η (r ; j) which constitute the vector η. In practice, the recognition model is typically chosen to be of the same dimensionality as the generative model and does not necessarily satisfy the above conditions. Figure 1a depicts a typical generative network G, which underlies the model p(x V |x H ; ξ). It is a directed acyclic network, and we assume that the model is simply given by the set of all joint distributions on X H × X V that factorise according to G. In Fig. 1b, we have a typical recognition network G obtained from the generative network G of Fig. 1a by reverting the directions of all arrows. The corresponding recognition model is given by the set of all conditional distributions q(x H |x V ) that factorise according to G . However, this model is not large enough to ensure that for all ξ , there is an η such that q(

Example 2
. Adding further lateral connections, as shown in Fig. 1c, enlarges G , and we obtain a correspondingly enlarged recognition model which now has that property.

Locality of the natural gradient
In the previous section, we have computed the partial derivatives (11) which turn out to be local and allow us to apply the gradient method for learning. However, from the information-geometric point of view, we have to use the Fisher-Rao metric for the gradient, which leads us to the natural gradient method. In general, the natural gradient is difficult to evaluate because the Fisher information matrix has to be inverted (see Eqs.  (4) and (5)). In our context of a model that is associated with a directed acyclic graph G, however, the Fisher information matrix simplifies considerably. More precisely, we consider a model M with the parametrisation (7). The tangent space of M in p ξ is spanned by the vectors The following theorem specifies the structure of the Fisher information matrix G(ξ ) with the entries g (r ;i)(s; j) (ξ ) = ∂ (r ;i) (ξ ), ∂ (s; j) (ξ ) ξ .

Theorem 1
Let M be a model with the parametrisation (7). Then the Fisher informa- where With this, the entries of the Fisher information matrix G(ξ ) are given by g (r ;i)(s; j) (ξ ) = g (r ;i, j) (ξ ) whenever r = s, and g (r ;i)(s; j) (ξ ) = 0 otherwise. Using matrix notation, we have Proof The parametrisation (7) yields and therefore In what follows, we use the shorthand notation x <s and x >s for x {i∈N : i<s} and x {i∈N : i>s} , respectively. With (27) we obtain for r ≤ s: If r = s, this expression reduces to This concludes the proof.
Theorem 1 highlights a number of simplifications of the Fisher information matrix as result of the particular parametrisation of the model in terms of a directed acyclic graph. The presented proof is adapted from [5] (see also the related work [28]): 1. The Fisher information matrix G(ξ ) has a block structure, reflecting the structure of the underlying graph (see Fig. 3). Each block G r (ξ ) corresponds to a node r and has d r ×d r components. Outside of these blocks the matrix is filled with zeros. The natural gradient method requires the inversion of G(ξ ) (the usual inverse G −1 (ξ ), if it exists, or, more generally, the Moore-Penrose inverse G + (ξ )). With the block structure of G(ξ ), this inversion reduces to the inversion of the individual matrices G r (ξ ). The corresponding simplification of the natural gradient is summarised in Corollary 1. 2. The terms g (r ;i, j) (ξ ), defined by (25), are expectation values of the functions C(x pa(r ) ; ξ r ). These functions are local in two ways. On the one hand, they depend only on local states x pa(r ) and, on the other hand, only local parameters ξ r are involved (see the definition (26)). This kind of locality is very useful in applications of the natural gradient method. Especially in the context of neural networks, locality of learning is considered to be essential. Note, however, that the terms g (r ;i, j) (ξ ) are not completely local. This is because the expectation value (25) is evaluated with respect to p ξ , where ξ is the full parameter vector. (As only the distribution of X pa(r ) appears, parameters of non-ancestors of r do not play a role in the definition of g (r ;i, j) (ξ ), which simplifies the situation to some extent.) In order to evaluate the Fisher information matrix in applications, we have to overcome this non-locality by sampling from p(x pa(r ) ; ξ). As we are dealing with directed acyclic graphs, this can be simply done by recursive application of the local kernels k r ξ r .
To highlight the relevance of Theorem 1, let us consider a few simple examples.

Example 3 (Exponential families)
Consider the model given by local kernels of the exponential form In this case, the expression (25) yields where the conditional covariance on the RHS of (29) is evaluated with respect to k r (·|x pa(r ) ; ξ r ).

Example 4 (Neural networks (II))
Neural networks, which we introduced in Example 1, can be considered as a special case of the models of Example 3. This can be seen by rewriting the transition probability (12) as follows: This is a special case of (28) which only involves pairwise interactions. In order to evaluate the terms (26) we need the derivatives According to Theorem 1, we can evaluate the Fisher information matrix in a local way. More explicitly, we have

Example 5 (Shallow versus deep networks)
In this example, we demonstrate the difference in sparsity of the Fisher information matrix for architectures of varying depth. Fig. 2 shows two networks with three visible and nine hidden neurons each. The number of synaptic connections is 27 in both cases, thereby assuming the neuronal model of Examples 1 and 4 (for simplicity, we do not consider the threshold values). If we associate one parameter with each edge, the synaptic strength, then we have 27 parameters in the system, and the Fisher information matrices have 27 × 27 = 729 entries. Theorem 1 implies the block diagonal structure of the Fisher information matrices shown in Fig. 3. As we can see, depth is associated with higher sparsity of these matrices. We have at least 486 zeros in the shallow case and at least 648 zeros in the deep case.
This example can be generalised to a network with n visible and m = l · n hidden neurons. As in Fig. 2, in the one case we arrange all m hidden neurons in one layer of width l · n and, in the other case, we arrange the hidden neurons in l layers of width n. In both cases, we have n · m = n(l · n) = l · n 2 edges, corresponding to the number of parameters, and therefore the Fisher information matrix has l 2 n 4 entries. With the shallow architecture, we have at most n(l · n) 2 = l 2 n 3 non-zero entries, whereas in the deep architecture there are at most l · n · n 2 = l · n 3 non-zero entries. The difference is l 2 · n 3 − l · n 3 = l · n 3 (l − 1). For n = l = 3, we recover the difference of the above numbers, 648 − 486 = 162.
Example 6 (Restricted Boltzmann machine) If we deal with models that are associated with undirected graphs we cannot expect the Fisher information matrix to have a block diagonal structure. Consider, for instance, a restricted Boltzmann machine, as shown Note that this deviates somewhat from the setting of a restricted Boltzmann machine as we ignore the threshold values for simplicity. The Fisher information matrix on M is given by which has no zeros imposed by the architecture.
The simplification of the Fisher information matrix, stated in Theorem 1, has several important consequences. As an immediate consequence we obtain a corresponding simplification of the natural gradient of a smooth real-valued function L on M, mainly referring to the function (10), in terms of the parametrisation (7). In order to express this simplification we consider the vectors (24) which span the tangent space of M in p ξ . In particular, they allow us to represent the gradient of L as a linear combination Corollary 1 Consider the situation of Theorem 1 and a real-valued smooth function L on M. With we have the following coordinates of the natural gradient of L in the representation (32): Here, G + r (ξ ) denotes the Moore-Penrose inverse of the matrix G r (ξ ) defined by (25) and (26). (It reduces to the usual matrix inverse whenever G r (ξ ) has maximal rank.) Note that Theorem 1 as well as its Corollary 1 can equally be applied to the recognition model Q H |V defined by (21). In Section 2.2 we have studied natural objective functions that involve both, the generative as well as the recognition model, and highlighted their locality properties. Together with the locality of the corresponding Fisher information matrices, these properties allow us to evaluate a natural gradient version of the wake-sleep algorithm, referred to as natural wake-sleep algorithm in [31].
The prime objective function to be optimised is typically defined on the projected model M V . It naturally carries the Fisher-Rao metric of P V so that we can define the natural gradient of the given objective function directly on M V . On the other hand, we have seen that the Fisher information matrix on the full model M ⊆ P V ,H has a block structure associated with the underlying network. This implies useful locality properties of the natural gradient and thereby makes the method applicable within the context of deep learning. The main problem that we are now going to study is the following: Can we extend the locality of the natural gradient on the full model M, as stated in Corollary 1, to the natural gradient on the projected model M V ? In the following section we first study this problem in a more general setting of Riemannian manifolds.

The general problem
We now develop a more general perspective, which we motivate by analogy to the context of the previous sections. Assume that we have two Riemannian manifolds (Z, g Z ) and (X , g X ) with dimensions d Z and d X , respectively, and a differentiable map π : Z → X , with its differential dπ p : T p Z → T π( p) X in p. The manifold Z corresponds to the manifold of (strictly positive) distributions on the full set of units, the visible and the hidden units. The map π plays the role of the marginalisation map which marginalises out the hidden units and which we will interpret in Sect. 3.2 as one instance of a more general coarse graining procedure. Typically, we have a model M ⊆ Z which corresponds to a model consisting of the joint distributions on the full system that can be represented by a network. It is obtained in terms of a parametrisation ϕ : Ξ → Z, ξ → p ξ , where Ξ is a differentiable manifold, usually an open subset of R d . In general, M will not be a sub-manifold of Z and can contain various kinds of singularities. We restrict attention to the non-singular points of M.
Note that k is a local dimension of M in p, which is upper bounded by the dimension d of Ξ . We denote the set of non-singular points of M by Smooth(M). If a point p ∈ M is not non-singular, it is called a singularity or a singular point of M (for more details see [32]). In a non-singular point p, the tangent space T p M is well defined. Throughout this article, we will assume that the parametrisation ϕ of M is a proper parametrisation in the sense that for all p ∈ Smooth(M) and all ξ ∈ Ξ with ϕ(ξ ) = p, the image of the differential dϕ ξ coincides with the full tangent space T p M. This assumption is required, but often not explicitly stated, when dealing with the natural gradient method for optimisation on parametrised models. More precisely, when we interpret the Fisher information matrix (2) as a "coordinate representation" of the Fisher-Rao metric, we implicitly assume that the vectors ∂ i (ξ ) = ∂ ∂ξ i p ξ , i = 1, . . . , d, span the tangent space of the model in p ξ . Note that linear independence, which ensures the non-degeneracy of the Fisher information matrix, is not required and would in fact be too restrictive given that overparametrised models play an important role within the field of deep learning.
We now consider a smooth function L : X → R and study its gradient on X (with respect to g X ) in relation to the corresponding gradient of L • π : Z → R on We have the following proposition where we use the somewhat simpler notation " ·, · " for both metrics, g Z and g X .

Proposition 1 Consider a model M in Z and a differentiable map π : Z → X and let p be a non-singular point of M. Assume that the following compatibility condition is satisfied:
Then, for all smooth functions L : X → R, we have where Π denotes the projection of tangent vectors in T π( p) X onto dπ p (T p M).

Proof First observe that grad
This proves Eq. (36).
As stated above, the parametrised model M plays the role of the distributions on the full network, consisting of the visible and hidden units. We want to relate this model to the projected model S := π(M). The composition of the parametrisation ϕ and the projection π serves as a parametrisation ξ → π( p ξ ) of S as shown in the following diagram.
The map π • ϕ is a proper parametrisation of S if for all q ∈ Smooth(S) and all ξ with π( p ξ ) = q, the image of the differential d(π • ϕ) ξ coincides with the full tangent space T q S. Obviously, this does not follow from the assumption that ϕ is a proper parametrisation of M and requires further assumptions. One necessary, but not sufficient, condition is the following: Assume that π • ϕ is a proper parametrisation of S and consider a point The condition (38)  We have the following implication of Proposition 1.

Theorem 2 Consider a model M in Z and a differentiable map π : Z → X with image S = π(M). Furthermore, assume that the compatibility condition (35) is satisfied in an admissible point p ∈ M. Then for all smooth functions
Proof This follows directly from (36). For an admissible point p we have dπ p T p M = T π( p) S. Therefore, the RHS of (36) reduces to the orthogonal projection of grad X π( p) L onto T π( p) S which equals the grad S π( p) L.
Note that, if we do not assume (38), we have to replace the RHS of (39) by Π grad S π( p) L , where Π denotes the projection of tangent vectors in T π( p) S onto dπ p (T p M). Therefore, it can well be the case that the gradient on M vanishes in a point p while the corresponding gradient on S, that is grad S π( p) L, does not. Such a point p is referred to as spurious critical point (see [30]). In addition to the problem of having singularities of M and S = π(M), this represents another problem with gradient methods for the optimisation of smooth functions on parametrised models. However, the problem of spurious critical points does not appear if we are dealing with a proper parametrisation ϕ of M for which π • ϕ is also a proper parametrisation of S.
We conclude this section by addressing the following problem: If we assume that the compatibility condition (35) is satisfied for a model M in Z, what can we say about the corresponding compatibility for a sub-model M of M? In general we cannot expect that (35) also holds for M . The following theorem characterises those sub-models M of M for which this is satisfied, so that Theorem 2 will also hold for them.

Theorem 3 Assume that (35) holds for a model M in Z and consider a sub-model
This theorem is a direct implication of Lemma 1 below which reduces the problem to the simple setting of linear algebra. Let (F, ·, · F ), (G, ·, · G ) be two finitedimensional real Hilbert spaces, and let T : F → G be a linear map. We can decompose F into a "vertical component" F V := ker T and its orthogonal complement F H in F, the corresponding "horizontal component". Now let E be a linear subspace of F, equipped with the induced inner product ·, · E , and consider the restriction T E : E → G of T to E. Denoting by ⊥ E and ⊥ F the orthogonal complements in E and F, respectively, we can decompose E into and Note that, while we always have

Lemma 1 Assume:
Then the following two statements about a subspace E of F are equivalent: Proof Let us first assume that (43) holds true. This implies For all A, B ∈ E H ⊆ F H , (41) then takes the form In order to prove the opposite implication, we assume that (43) does not hold for E. This means that is a proper subspace of E. We denote the orthogonal complement of Q in E by R and choose a non-trivial vector A in R. Such a vector can be uniquely decomposed as a sum of two non-trivial vectors A 1 ∈ F H and A 2 ∈ F V . This implies This means that (42) does not hold for the subspace E.

A new interpretation of Chentsov's theorem
We now come back to the context of probability distributions but take a slightly more general perspective than in Section 1.2. We interpret X V as a coarse graining of the set Replacing the Cartesian product X V ×X H by a general set Z, a coarse graining of Z is an onto mapping X : Z → X, which partitions Z into the atoms Z x := X −1 (x). The corresponding pushforward map is given by Obviously, we have and the orthogonal complement with respect to the Fisher-Rao metric in p (note that V p is independent of p). Given a vector A = z A(z) δ z ∈ T (Z), we can decompose it uniquely as with A H ∈ H p and A V ∈ V p . More precisely, For a vector We now examine the inner product of two such vectors A, B ∈ H p : Thus, the compatibility condition (35) is satsified. Theorem 2 implies that for all smooth functions L : P(X) → R and all p ∈ P(Z), the following equality of gradients holds (note that all points p ∈ P(Z) are admissible): where P(Z) and P(X) are equipped with the respective Fisher-Rao metrics. Even though this is a simple observation, it highlights an important point here. A coarse graining is generally associated with a loss of information, which is expressed by the monotonicity of the Fisher-Rao metric. This information loss is maximal when we project from the full space P(Z) onto P(X). Nevertheless, the gradient of any function L that is defined on P(X) is not sensitive to this information loss. In order to study models M in P(Z) with the same invariance of gradients, we have to impose the condition (40), which takes the form Definition 1 If a model M ⊆ P(Z) satisfies the condition (54) in p, we say that it is cylindrical in p. If it is cylindrical in all non-singular points, we say that it is (pointwise) cylindrical.
Of particular interest are cylindrical models with a trivial vertical component. These are the models, for which the coarse graining X is a minimal sufficient statistic. They have been used by Chentsov [12] in order to characterise the Fisher-Rao metric (see Theorem 4). To be more precise, we need to revisit Markov kernels from a geometric perspective. We consider the space of linear maps from R Z to R X , which is canonically isomorphic to R Z * ⊗ R X , and define the polytope of Markov kernels as k(z|x) ≥ 0 for all x, z, and z k(z|x) = 1 for all x .
The set P(Z) of probability vectors is a subset where each vector p is identified with k(z|x) := p(z). We now consider a Markov kernel K that is coupled with the coarse graining X : Z → X in terms of k(z|x) > 0 if and only if z ∈ Z x . Such a Markov kernel is called X -congruent. It defines an embedding K * : P(X) → P(Z), referred to as an X -congruent Markov morphism. We obviously have X * •K * = id P(X) . The image of K * , which we denote by M(K ), is the relative interior of the simplex with the extreme points It is easy to see that Comparing the Eq. (55) with (53), we observe that the gradient on the LHS is now evaluated on M(K ), with respect to the induced Fisher-Rao metric. The gradient on the RHS remains as it is. The differential of K * is given by with the image The following simple calculation shows that K * is an isometric embedding (see Fig.  5): The invariance (55) of gradients follows also directly from the invariance (56) of inner products. In fact, K * being an isometric embedding is equivalent to (55) (for details, see the proof of Theorem 5). A fundamental result of Chentsov [12] characterises the Fisher-Rao metric as the only invariant metric (see also [8]).

Theorem 4 (Chentsov's theorem)
Assume that for any non-empty finite set S, P(S) is equipped with a Riemannian metric g (S) , such that the following is satisfied: Whenever we have a coarse graining X : Z → X and an X -congruent Markov morphism K * : P(X) → P(Z), the invariance (56) holds, interpreted as a condition for g (X) and g (Z) . Then there exists a positive real number α such that for all S, the metric g (S) coincides with the Fisher-Rao metric multiplied by α.
In order to compute the gradient of a function on an extended space that is equivalent to the actual gradient, we want to use Eq. (39) of Theorem 2. Instances of this equivalence are given by the Eqs. (53) and (55) where we considered two extreme cases, the full model P(Z) and the model M(K ), respectively, which both project onto P(X). We know that Theorem 2 also holds for all cylindrical models M, including, but not restricted to, intermediate cases where M(K ) ⊆ M ⊆ P(Z). How flexible are we here with the choice of the metric? In fact, a reformulation of Chentsov's uniqueness result, Theorem 4, identifies the Fisher-Rao metric as the only metric for which Eq. (39) holds.

Theorem 5
Assume that for any non-empty finite set S, P(S) is equipped with a Riemannian metric g (S) . Then the following properties are equivalent: where the gradient on the LHS is evaluated with respect to the restriction of g (Z) and the RHS is evaluated with respect to the restriction of g (X) . 2. There exists a positive real number α such that for all S, the metric g (S) coincides with the Fisher-Rao metric multiplied by α.
Proof "(1) ⇒ (2):" Let X : Z → X be a coarse graining, and let K * : P(X) → P(Z) be an X -congruent Markov morphism. We consider the model M(K ), as a special instance of a cylindrical model M in P(Z). In that case, (57) is equivalent to We choose p ∈ P(X) and A, B ∈ T (X). We can represent A as a gradient of a function L. More precisely, with we have This implies This proves the invariance (56). According to Chentsov's uniqueness result, Theorem 4, this invariance characterises the Fisher-Rao metric up to a constant α > 0. "(2) ⇒ (1):" This follows from the compatibility (52), which holds for the Fisher-Rao metric, and Theorems 2 and 3.

Cylindrical extensions of a model
Throughout this section, we consider a model M, together with a proper parametrisa- We refer to such a model as a cylindrical extension of M. Before we come to the explicit construction of cylindrical extensions, let us first demonstrate their direct use for relating the respective natural gradients to each other. Given an admissible point p ∈ M that is also admissible in M, we can decompose the tangent space T p M into the sum T p M ⊕ T ⊥ p M, where the second summand is the orthogonal complement of the first one in T p M. We can use this decomposition in order to relate the natural gradient of a smooth function L defined on the projected model M X to the natural gradient of L • X * : (Here " " stands for the projection onto T p M and "⊥" stands for the projection onto the corresponding orthogonal complement in T p M.) The difference between the natural gradient on the full model M and the natural gradient on the coarse grained model M X is given by grad ⊥ p (L • X * ) which vanishes when M itself is already cylindrical. Thus, the equality (62) generalises (57).
The product extension I Given a non-singular point p ξ = z p(z; ξ) δ z of M, the tangent space in p ξ is spanned by Now, consider the projection of p ξ onto P(X) in terms of X * , that is Assuming that this projected point is a non-singular point of M X = X * (M), the corresponding tangent space In addition to the described projection of p ξ onto the "horizontal" space, leading to M X , we can also project it onto the "vertical" space. In order to do so, we define a Markov kernel K ξ = x,z p(z|x; ξ) δ z ⊗ e x : We denote the image of the map ξ → K ξ by M Z |X ⊆ K(Z|X), and assume that K ξ is a non-singular point of M Z |X . The corresponding tangent vectors in K ξ are given by∂ Note that for all three sets of vectors, ∂ i (ξ ),∂ H i (ξ ), and∂ V i (ξ ), i = 1, . . . , d, linear independence is not required. In fact, it is important to include overparametrised systems into the analysis, where linear independence is not given. Now, we can define the product extension M I of M as follows: For each pair (ξ, ξ ) ∈ Ξ × Ξ , we define p ξ,ξ = p(·; ξ, ξ ) as The product extension is then simply the set of all points that can be obtained in this way. Obviously, M consists of those points in M I that are given by identical parameters, that is ξ = ξ , which proves (61) (a). Furthermore, X * ( p ξ,ξ ) = X * ( p ξ ), and therefore this extension has the same projection as the original model M so that (61) (b) is satisfied. The last requirement for M I to be a cylindrical extension of M, (61) (c), will be proven below in Proposition 2. We obtain the tangent space of M I in p ξ,ξ by taking the derivatives with respect to ξ 1 , . . . , ξ d and ξ 1 , . . . , ξ d , respectively: A comparison with (64) shows that we have a natural isometric correspondence by mapping δ x to z∈Z x p(z|x; ξ ) δ z (this map is given by the X -congruent Markov morhphism discussed above; see also Fig. 5). Now we consider the vertical directions: A comparison with (66) shows that we also have a natural correspondence by mapping δ z ⊗ e x to p(x; ξ) δ z , in addition to the above-mentioned correspondence (69). This proves that (ξ, ξ ) → p ξ,ξ is a proper parametrisation of M I . The situation is illustrated in Fig. 6. Now we consider the natural Fisher-Rao metric on M I ⊆ P(Z) in p ξ,ξ , assuming that all points associated with (ξ, ξ ) are non-singular. It follows from Proposition 2 below that ∂ H i (ξ, ξ ), ∂ V j (ξ, ξ ) ξ,ξ = 0 for all i, j, where ·, · ξ,ξ denotes the Fisher-Rao metric in p ξ,ξ . For the inner products of the horizontal vectors we obtain In particular, these inner products do not depend on ξ . The inner products of the vertical vectors are given by This defines two matrices, and the Fisher information matrix G(ξ, ξ ) with respect to the product coordinate system is a block matrix, .
In order to compute the gradient of a function L : M I → R, we have to consider the pseudoinverse of G(ξ, ξ ), and, with the Euclidean gradient ∇ ξ,ξ L = (∇ ξ L, ∇ ξ L), we have Consider a pair (ξ, η) ∈ Ξ × H so that all points associated with it are non-singular points of the respective models. For the horizontal and vertical vectors we obtain, analogous to (68) and (70), and The correspondence (69) of horizontal vectors translates to this time by mapping δ x to z∈Z x q(z|x; η) δ z . Furthermore, we obtain the generalisation of (71) as by mapping δ z ⊗ e x to p(x; ξ) δ z . The situation is illustrated in Fig. 7. We now consider the Fisher-Rao metric on M I I ⊆ P(Z) in p ξ,η . It follows again from Proposition 2 below that ∂ H i (ξ, η), ∂ V j (ξ, η) ξ,η = 0 for all i, j, where ·, · ξ,η denotes the Fisher-Rao metric in p ξ,η . For the inner products of the horizontal and the vertical vectors, respectively, we obtain and The gradient of a function L on M X is given in terms of (75), the formula that we already obtained for the previous product extension, where we have to replace ξ by In order to be more explicit, consider the parametrisation of M which is naturally embedded in M I I . For the tangent vectors we now obtain (by the chain rule) (b) Clearly, from (a) we obtain X * (M) ⊆ X * ( M I I ). To prove the opposite inclusion, we consider a point p ξ,η ∈ M I I and show that the point p ξ ∈ M has the same X * -projection: (c) We have We first show that the horizontal vectors q(z|x; η) δ z P V ,H , but the objective function L only depends on the probability distribution of the visible states, giving rise to a projected model M V ⊆ P V . Both geometric objects, M and M V , carry a natural geometry inherited from the respective ambient space. In Sect. 2 we studied various locality properties of the natural gradient based on the first geometry, thereby assuming a factorisation of the elements of M according to a directed acyclic graph. These properties simplify the Fisher information matrix for M and allow us to apply the natural gradient method to deep networks. The second geometry, the geometry of M V , was studied in Sect. 3 where we took a somewhat more general perspective. In what follows, we restate the general problem of comparing the two mentioned geometries within that perspective and summarise the corresponding results.
Consider a model S in the set P(X) of probability distributions on a finite set X, and a smooth function L : P(X) → R. The task is to optimise L on S in terms of the natural gradient grad S L. With no further assumptions this can be a very difficult problem. Typically, however, S is obtained as the image of a simpler model M of probability distributions on a larger set Z, P(Z). More precisely, we consider a surjective map X : Z → X, and the corresponding push-forward map X * : P(Z) → P(X) of probability measures. The model S is then nothing but the X * -image of M, that is S = M X = X * (M). Now, instead of optimising L on M X , we can optimise L • X * on M and aim to simplify the problem by exploiting the structure of M. This works to some extent. Even though the two problems are closely related, the corresponding gradient fields, d X * grad M (L • X * ) and grad M X L, typically differ from each other. Thus, the optimisation of L on M X , based on the Fisher-Rao metric on P(X), and the optimisation of L • X * on M, based on the Fisher-Rao metric on P(Z), are not equivalent. We can try to improve the situation by replacing the Fisher-Rao metric on M and M X , respectively, by different Riemannian metrics. While this might be a reasonable approach for the simplification of the problem, from the informationgeometric perspective, the Fisher-Rao metric is the most natural one, which is the reason for referring to the Fisher-Rao gradient as the natural gradient. This is directly linked to the invariance of gradients, as we have highlighted in this article. If we request invariance of the gradients for all coarse grainings X : Z → X, all models M ⊆ P(Z) from a particular class, and all smooth functions L : M X → R, by Chentsov's classical characterisation theorem, we have have to impose the Fisher-Rao metric on the individual models (see Theorem 5). Even then, the invariance of gradients is satisfied only if the model is cylindrical in the sense of Definition 1. Given a model M that is not cylindrical, we have proposed cylindrical extensions M which contain M. The natural gradient of L on M X is then equivalent to the natural gradient of L • X * on such an extension M.
As an outlook, we want to touch upon the following two related problems: 1. Can we exploit the simplicity of the original model M in order to simplify the optimisation on M? 2. The original model M is associated with some network. What kind of network can we associate with the extended model M?
We want to briefly address these problems within the context of Section 2, where X = X V , Z = X V × X H , and X = X V : (v, h) → v. As the cylindrical extension M I I suggests, it can be associated with the addition of a recognition model Q H |V , assuming that M is a generative model. If both models are parametrised by (7) and (21), respectively, then the corresponding Fisher information matrices simplify as stated in Theorem 1. They both have a block structure where each block corresponds to one unit. Outside of these blocks, the matrices are filled with zeros. Being more precise, we consider all parameters that correspond to unit r , the parameters ξ r = (ξ (r ;1) , . . . , ξ (r ;d r ) ) of the generative model M, and the parameters η r = (η (r ;1) , . . . , η (r ;d r ) ) of the recognition model Q H |V . With (85) we then obtain We know that g (r ;i)(s; j) (ξ ) = 0 if r = s and g V (t;k)(u;l) (ξ, η(ξ )) = 0 if t = u. With the latter property, the sum on the RHS of (89) reduces to If all partial derivatives ∂η (t;k) /∂ξ (r ;i) (ξ ) are local in the sense that they vanish whenever t = r , then the matrix G H (ξ ) inherits the block structure of the matrices G(ξ ) and G V (ξ, η(ξ )). However, this is typically not the case and suggests an additional coupling between the generative model and the recognition model. Without that coupling, the partial derivatives in (90) will "overwrite" the block structure of the matrix G(ξ ), leading to a non-local matrix G H (ξ ) with g H (r ;i)(s; j) (ξ ) = 0 even if r = s. The degree of non-locality will depend on the specific properties of the partial derivatives ∂η (t;k) /∂ξ (r ;i) (ξ ).
We conclude this article by revisiting the wake-sleep algorithm of Sect. 2.2. Let us assume that (89) and (90) imply a sufficient simplification so that a natural gradient step in M I I can be made. This will update the generation parameters, say from ξ to ξ +Δξ , and leave the recognition parameters η unchanged. (The situation is illustrated in Fig.  8.) Such an update corresponds to a natural gradient version of the wake step. The resulting point (ξ +Δξ, η) in M I I will typically be outside of M. As the simplification through (89) and (90) only holds on M, we have to update the recognition parameters, say from η to η + Δη, so that the resulting point (ξ + Δξ, η + Δη) is again in M. This sleep step will ensure that the next update of the generation parameters benefits from the simplicity of the Fisher information matrix.
Note that it is irrelevant how we get back to M within the sleep step, as far as we do not change the generation parameters. Also, it might be required to apply several sleep steps until we get back to M, which highlights the asymmetry of time scales of the two phases. This asymmetric version has been outlined and discussed in the context of the em-algorithm in [18]. The overall wake-sleep step will typically not follow the gradient of an objective function on M I I . However, this is not the aim here. The prime process is the process in ξ which parametrises M V . Effectively, the outlined version of the wake-sleep algorithm will follow the natural gradient of the objective function with respect to the geometry of M V . The natural wake-sleep algorithm with respect to the geometry of M has been recently studied in [31].
In Sect. 2.2 we introduced the recognition model as an auxiliary model for the evaluation of the gradient with respect to ξ . This work reveals another role of the recognition model in the context of the natural gradient method. It allows us to define an extension of the original model M so that we can effectively apply the natural gradient method on the projected model M V within the context of deep learning. The presented results suggest criteria for the coupling between the generative model and recognition model that would ensure the locality of the natural gradient on M V . These criteria involve, on the one hand, the structure of the underlying networks (see Example 2) and, on the other hand, the parametrisation of the models. However, in this article they are formulated as sufficient conditions for the simplification of the natural gradient, based on theoretical results. The derivation of explicit design principles for correspondingly coupled models requires further studies. and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. know that there is at least one solution x(ξ ) that represents the gradient. In the case where G(ξ ) is of maximal rank this solution is unique and we can simply apply the inverse of G(ξ ) in order to obtain the coefficients of the gradient as x(ξ ) = G −1 (ξ )∇ ξ L. This is the usual case when we have a local (diffeomorphic) coordinate system around the point p ξ . Even though we interpret a parametrisation of a model as a coordinate system, the number of parameters often exceeds the dimension of the model. In these cases, the matrix G(ξ ) will not be of maximal rank so that we have a non-trivial kernel ker G(ξ ). We can always add to a solution x(ξ ) of (94) a vector y(ξ ) from that kernel and obtain another solution x(ξ ) + y(ξ ). The affine space A(ξ ) = x(ξ ) + ker G(ξ ) ⊆ R d of solutions describes all possible representations of the gradient in terms of ∂ 1 (ξ), . . . , ∂ d (ξ ). They are all equally adequate for describing a learning process that takes place in M. However, from the perspective of linear algebra there is a natural choice, the element in the affine solution space A(ξ ) that is orthogonal to ker G(ξ ) (with respect to the canonical inner product in R d ). This defines the Moore-Penrose inverse G + (ξ ), also called pseudoinverse. In this paper, we were concerned with a number of simplifications of the natural gradient. One simplification was expressed in terms of a block diagonal structure of the Fisher information matrix. For the representation of the natural gradient, we evaluated the pseudoinverse of that block diagonal matrix based on the following simple observation (see, e.g., [11] for more general results related to the pseudoinverse of a block matrix): How natural is the Moore-Penrose inverse? There are two perspectives here. On the one hand, G + (ξ ) ∇ ξ L is natural in the sense that it represents an object, grad M ξ L, that is independent of the parametrisation. On the other hand, the inner product used for the definition of G + (ξ ) is the canonical inner product in R d which does not have to be at all related to the metric g ξ . In this article, we have chosen the Moore-Penrose inverse as one possible extension of the usual inverse to overparametrised models, following previous proposals (see, e.g., [29]). Note, however, that there are also other possibilities for such an extension. We have some flexibility here which might allow us to further simplify the representation of the natural gradient in terms of a particular choice of the parametrsiation.

Gibbs sampling
In this section we outline the Gibbs sampling method for the approximation of the derivative (11). By holding the configuration x V constant, we can sample from p(x H |x V ; ξ) by randomly selecting a node s ∈ H , and then updating the state of that node according to p(x s |x H \s , x V ; ξ). After this update we repeat choosing a node and updating its state. This will generate, after many repetitions, p * ξ -typical patterns. The conditional distribution is simple because, due to the local Markov property, it satisfies p(x s |x H \s , x V ; ξ) = p(x s |x bl(s) ; ξ), where bl(s) denotes the Markov blanket Example 7 (Neural networks (III)) In order to evaluate the derivatives (15) and (16) in Example 1, we have to sample from p * ξ . We use Gibbs sampling based on the expression (96)  .
Comparing this with the update rule (12), we observe that the full Markov blanket is involved in terms of the modulation function a s .
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.