1 Introduction

Within the field of deep learning, gradient methods have become ubiquitous tools for parameter optimisation. Standard gradient optimisation procedures use the vector of coordinate derivatives of the objective function as the update direction of the parameters. This is implicitly assuming a Euclidean geometry on the space of parameters. It can be argued that this is not always the most natural choice of geometry. Instead one can choose a more natural geometry for the problem at hand and then determine the Riemannian gradient of the objective function for this natural geometry, resulting in the so called natural gradient. The natural gradient method is the optimisation algorithm that performs discrete parameter updates in the direction of the natural gradient. This method was first proposed by Amari [1] using the geometry induced by the Fisher–Rao metric. It is an active field of study within information geometry [3, 6, 10] and has been shown extremely effective in many applications [4, 16, 17]. More recently, also other geometries on the model have been studied, such as the Wasserstein geometry [9, 11]. The natural gradient is defined independently of a specific parametrisation. Although it is an open problem, there is work supporting the idea that the efficiency of learning of the method is due to this invariance [18].

In practice, the update direction of the parameters is given by the ordinary gradient multiplied by the inverse of the Gram matrix associated with the metric on the model. We will refer to this vector on the parameter space as the natural parameter gradient. In order to determine whether this direction is desired we have to map this vector to the model, since it is the location on the model, not on the parameter space, that determines the performance of the model. In non-overparametrised systems it can be shown that in a non-singular point on the model, the pushforward of the natural parameter gradient is equal to the natural gradient. Furthermore the natural parameter gradient can be called parametrisation invariant in this case [12]. In many practical applications of machine learning, and in particular deep learning, one deals however with overparametrised models, in which different directions on the parameter space correspond to a single direction on the model. In this case, the Gram matrix is degenerate and we use a generalised inverse to calculate the natural parameter gradient. In this paper, we will investigate whether the pushforward of the natural parameter gradient remains equal to the natural gradient in the overparametrised setting.

The Moore-Penrose (MP) inverse is the canonical choice of generalised inverse for the natural parameter gradient [5]. The definition of the MP inverse is based on the Euclidean inner product defined on the parameter space. Using the MP inverse is therefore thought to affect the parametrisation invariance of the natural parameter gradient [13], and thus potentially the performance of the natural gradient method. In this paper we propose two different notions of invariance. The first evaluates the invariance of the natural parameter gradient by examining the behaviour of its pushforward on the model. The second looks at the behaviour on the parameter space itself. Since the location and direction on the model is what matters, we argue that the former is of greater importance.

2 The natural gradient

Let \((\mathcal {Z}, g)\) be a Riemannian manifold, \(\Xi \) be a parameter space that we assume to be an open subset of \(\mathbb {R}^d\), \(\phi :\Xi \rightarrow \mathcal {Z}\) a smooth map (taking the role of a parametrisationFootnote 1), \(\mathcal {M} :=\phi (\Xi ) \subset \mathcal {Z}\) a modelFootnote 2, and \(\mathcal {L}:\mathcal {Z}\rightarrow \mathbb {R}\) a smooth (objective) function (see Fig. 1). We call \(p \in \mathcal {M}\) non-singular if \(\mathcal {M}\) is locally an embedded submanifold of \(\mathcal {Z}\) around p and we denote the set of non-singular points with \({{\,\textrm{Smooth}\,}}(\mathcal {M})\). A point p is called singular if it is not non-singular. The Riemannian gradient of \(\mathcal {L}\) on \(\mathcal {Z}\) is defined implicitly as follows:

$$\begin{aligned} g_p\left( {{\,\textrm{grad}\,}}_p^\mathcal {Z}\mathcal {L}, \cdot \right) = d\mathcal {L}_p(\cdot ). \end{aligned}$$
(1)

By the Riesz representation theorem, this defines the gradient uniquely.

Definition 1

(Natural gradient) For \(p \in {{\,\textrm{Smooth}\,}}(\mathcal {M})\) the Riemannian gradient of \(\mathcal {L}|_\mathcal {M}\) on the model \(\mathcal {M}\) is called the natural gradient and is denoted \({{\,\textrm{grad}\,}}^\mathcal {M}_p \mathcal {L}\).

It is easy to show that:

$$\begin{aligned} {{\,\textrm{grad}\,}}^\mathcal {M}_p \mathcal {L}= \Pi _p ({{\,\textrm{grad}\,}}^\mathcal {Z}_p \mathcal {L}), \end{aligned}$$
(2)

where \(\Pi _p\) is the projection onto \(T_p\mathcal {M}\). We define the pushforward of the tangent vector on the parameter space through the parametrisation as \(\partial _{i}(\xi _{}) :=d\phi _\xi \left( \left. \frac{\partial }{\partial \xi ^i}\right| _\xi \right) \), and the Gram matrix \(G(\xi )\) by \(G_{ij}(\xi ) :=g_{\phi (\xi )}\left( \partial _{i}(\xi _{}), \partial _{j}(\xi _{})\right) \). We denote the vector of coordinate derivatives with \(\nabla _\xi \mathcal {L}:=\left( \partial _1(\xi ) \mathcal {L}, ..., \partial _d(\xi )\mathcal {L}\right) = \left( \frac{\partial \mathcal {L}\circ \phi }{\partial \xi ^1}(\xi ),..., \frac{\partial \mathcal {L}\circ \phi }{\partial \xi ^d}(\xi )\right) \in \mathbb {R}^d\). Let \(\xi \) be such that \(\phi (\xi ) \in {{\,\textrm{Smooth}\,}}(\mathcal {M})\). We say that a parametrisation is proper in \(\xi \) when: \({{\,\textrm{span}\,}}\left( \left\{ \partial _{1}(\xi _{}), ..., \partial _{d}(\xi _{})\right\} \right) = T_{\phi (\xi )}\mathcal {M}\). Furthermore, following the Einstein summation convention, we write \(a^i b_i\) for the sum \(\sum _i a^i b_i\).

Fig. 1
figure 1

Parametrisation and objective function

Definition 2

(Generalised inverse) A generalised inverse of an \(n \times m\) matrix A, denoted \(A^+\), is an \(m \times n\) matrix satisfying the following property:

$$\begin{aligned} A A^+ A = A. \end{aligned}$$
(3)

Note that this definition implies that for \(w \in \mathbb {R}^n\) in the image of A, i.e. \(w = Av\) for some \(v \in \mathbb {R}^m\), we have:

$$\begin{aligned} AA^+ w = AA^+ A v = Av = w. \end{aligned}$$
(4)

This shows that \(A A^+\) is the identity operator on the image of A.

Definition 3

(Natural parameter gradient) We define the natural parameter gradient to be the following vector on the parameter space:

$$\begin{aligned} \widetilde{\nabla }_\xi \mathcal {L}:=\left( G^+(\xi ) \nabla _\xi \mathcal {L}\right) ^i \left. \frac{\partial {}}{\partial {\xi ^i}}\right| _\xi . \end{aligned}$$
(5)

The pushforward of this vector, given by:

$$\begin{aligned} d\phi _\xi \widetilde{\nabla }_\xi \mathcal {L}= \left( G^+(\xi ) \nabla _\xi \mathcal {L}\right) ^i \partial _i, \end{aligned}$$
(6)

is called the natural parameter gradient on \(\mathcal {M}\).

Often, the natural parameter gradient is denoted with matrix notation as follows:

$$\begin{aligned} \widetilde{\nabla }_\xi \mathcal {L}:=G^+(\xi ) \nabla _\xi \mathcal {L}, \end{aligned}$$
(7)

where an identification between the canonical basis of \(\mathbb {R}^d\) and the vectors \(\left. \frac{\partial {}}{\partial {\xi ^i}}\right| _\xi \) is made implicitly.

We are now in the position to state the main result of the paper:

Theorem 1

Let \(\xi \in \Xi \) and \(p = \phi (\xi ) \in \mathcal {M}\). We have:

$$\begin{aligned} d\phi _\xi \widetilde{\nabla }_\xi \mathcal {L}= \Pi _\xi \left( {{\,\textrm{grad}\,}}_p^\mathcal {Z}\mathcal {L}\right) , \end{aligned}$$
(8)

where \(\Pi _\xi \) is the projection onto \({{\,\textrm{span}\,}}\{\partial _i(\xi )\}_i\). In particular, when \(\phi (\xi )\) is non-singular and \({{\,\textrm{span}\,}}\{\partial _i(\xi )\}_i = T_p\mathcal {M}\) we have:

$$\begin{aligned} d\phi _\xi \widetilde{\nabla }_\xi \mathcal {L}= {{\,\textrm{grad}\,}}^\mathcal {M}_p \mathcal {L}. \end{aligned}$$
(9)

This theorem implies that under certain conditions the pushforward of the natural parameter gradient is equal to the natural gradient. Furthermore we see that in general the natural parameter gradient on \(\mathcal {M}\) is dependent on the choice of parametrisation through \(\Pi _\xi \), but becomes invariant when the coordinate vectors span the full tangent space of \(\mathcal {M}\). In the next section we will study in more detail the invariance properties of the natural parameter gradient.

The proof of Theorem 1 will be based on the following result from linear algebra:

Lemma 1

Let \((V, \langle \cdot , \cdot \rangle )\) be a finite-dimensional inner product space and \(V^*\) its dual space. Let \(\{e_i\}_{i \in \{1,...,d\}} \subset V\) (not necessarily linearly independent), G the matrix defined by \(G_{ij} = \langle e_i, e_j \rangle \), \(\omega \in V^*\), v such that \(\langle v, \cdot \rangle = \omega (\cdot )\) and \(\Pi \) the projection on the space \({{\,\textrm{span}\,}}\{e_i\}_i\). Then,

$$\begin{aligned} \Pi (v) = \left( G^+\right) ^{ij}\omega (e_j)e_i. \end{aligned}$$
(10)

Proof

Start by noting \(\Pi (v)\) is uniquely defined by the fact that \(\langle \Pi (v), w \rangle = \omega (w)\), for \(w \in {{\,\textrm{span}\,}}\{e_i\}_i\) and \(\langle \Pi (v), w \rangle = 0\) for \(w \in \left( {{\,\textrm{span}\,}}\{e_i\}_i\right) ^\perp \). Since the RHS of (10) lies in the span of \(\{e_i\}_i\), it remains to show that for an arbitrary vector \(w = w^i e_i \in {{\,\textrm{span}\,}}\{e_i\}_i\) we have:

$$\begin{aligned} \langle \left( G^+\right) ^{ij}\omega (e_j)e_i, w^k e_k \rangle = \omega (w). \end{aligned}$$
(11)

Working out the LHS gives:

$$\begin{aligned} \langle \left( G^+\right) ^{ij}\omega (e_j)e_i, w^k e_k \rangle&= G_{ik} \left( G^+\right) ^{ij}\omega (e_j) w^k \end{aligned}$$
(12)
$$\begin{aligned}&= w^k G_{ki}\left( G^+\right) ^{ij}G_{jl} v^l\end{aligned}$$
(13)
$$\begin{aligned}&= w^k G_{kl} v^l \end{aligned}$$
(14)
$$\begin{aligned}&= w^k \omega (e_k) \end{aligned}$$
(15)
$$\begin{aligned}&= \omega (w), \end{aligned}$$
(16)

where we use the fact that: \(\omega (e_i) = G_{ij}v^j\) and the symmetry of G in the second equality.

\(\square \)

Proof

(Proof of Theorem 1) We now let \(T_p\mathcal {M}\) take the role of V, \(d\mathcal {L}_p\) the role of \(\omega \), \(\partial _{i}(\xi _{})\) the role of \(e_i\), and \({{\,\textrm{grad}\,}}_p \mathcal {L}\) the role of v. Equation (8) now follows immediately. When the tangent vectors \(\{\partial _i(\xi )\}_i\) span the whole tangent space of \(\mathcal {M}\) at p, \(\Pi _\xi \) becomes the identity on \(T_p\mathcal {M}\). This gives Eq. (9). \(\square \)

3 Invariance properties of the natural parameter gradient

In this section we study the invariance properties of the natural parameter gradient by using an alternative parametrisation of \(\mathcal {M}\) given by:

$$\begin{aligned} \psi :\Theta \ni \theta \mapsto \psi (\theta ) \in \mathcal {Z}. \end{aligned}$$
(17)

Note that \(G^+(\xi ), \nabla _{\xi } \mathcal {L}\) and \(\partial _{i}(\xi _{})\) in the definition of \(d\phi _\xi \widetilde{\nabla }_\xi \mathcal {L}\) all implicitly depend on the parametrisation \(\phi \). For an alternative parametrisation \(\psi \) we will therefore write: \({\partial }_i(\theta ) :=d\psi _{\theta } \left( \frac{\partial {}}{\partial {\theta }}|_{\theta }\right) \), \({G}_{ij}(\theta ) :=g_{\psi (\theta )} \left( {\partial }_i(\theta ), {\partial }_j(\theta )\right) \), and \(\nabla _{\theta } \mathcal {L}:=({\partial }_1(\theta ) \mathcal {L}, ..., {\partial }_d(\theta )\mathcal {L})\) (see Fig. 2).

Fig. 2
figure 2

Two parametrisations of \(\mathcal {M}\)

The invariance properties can be studied from the perspective of the model and from the perspective of the parameter space itself. Since the former is of more importance, we will start with this one.

3.1 Parametrisation dependence and reparametrisation invariance on the model

A parametrisation can be used to represent tangent vectors on the model space by elements of \(\mathbb {R}^d\). A representation (of vectors on \(\mathcal {M}\)) can be interpreted as the map \(\mathcal {O}:(\phi , \xi ) \mapsto \mathcal {O}(\phi , \xi ) \in T_\xi \Xi \ (\cong \mathbb {R}^d)\) that takes a parametrisation-coordinate pair and assigns a tangent vector on the parameter space to it. The natural parameter gradient defined by \(\widetilde{\nabla }_\xi \mathcal {L}= \left( G^+(\xi ) \nabla _\xi \mathcal {L}\right) ^i \left. \frac{\partial {}}{\partial {\xi ^i}}\right| _\xi \) in Eq.  (5) is an example of a representation, where the dependence on \(\phi \) on the RHS is implicit. Naively, one could define invariance of a representation in the following way:

Definition 4

(Parametrisation independence) Let \(\mathcal {M}\) be a model. A representation \(\mathcal {O}(\cdot , \cdot )\) is called parametrisation independent if for any pair \(\phi , \psi \) of parametrisations of \(\mathcal {M}\), and coordinates \(\xi ,\theta \) such that \(\psi (\theta ) = \phi (\xi )\), the following holds:

$$\begin{aligned} d\psi _{\theta } \mathcal {O}(\psi ,\theta ) = d\phi _\xi \mathcal {O}(\phi ,\xi ). \end{aligned}$$
(18)

It turns out that this is not a very useful definition. As we will see, no non-trivial representation can be parametrisation independent in the sense of this definition. We will illustrate this in Example 1 and 2 below for the natural parameter gradient on specific models. A formal proof can be found in Appendix A.1.

In order to overcome the limitation of Definition 4, we propose the following more suitable definition of invariance of a representation:

Definition 5

(Reparametrisation invariance) Let \(\mathcal {M}\) be a model. A representation \(\mathcal {O}(\cdot , \cdot )\) is called reparametrisation invariant if for any pair \(\phi , \psi \) of parametrisations of \(\mathcal {M}\), such that \(\psi = \phi \circ f\) for a diffeomorphism \(f:\Theta \rightarrow \Xi \), and coordinates \(\xi ,\theta \) such that \(\theta = f^{-1}(\xi )\), the equality (18) holds.

Due to the extra requirement of the existence of the reparametrisation function f in Definition 5, we get the following central result of this paper:

Theorem 2

The natural parameter gradient is reparametrisation invariant.

Proof

By Definition 5, we need to show that for \(\psi = \phi \circ f\) and \(\theta = f^{-1}(\xi )\) we have:

$$\begin{aligned} d\psi _{\theta }\widetilde{\nabla }_\theta \mathcal {L}= d\phi _\xi \widetilde{\nabla }_\xi \mathcal {L}. \end{aligned}$$
(19)

Since the differential \(df_{\theta }\) is surjective, we have that \({{\,\textrm{span}\,}}\{\partial _i(\xi )\}_i = {{\,\textrm{span}\,}}\{{\partial }_j(\theta )\}_j\). Therefore, by using Eq. (8) of Theorem 1, we get:

$$\begin{aligned} d\psi _{\theta }\widetilde{\nabla }_\theta \mathcal {L}= \Pi _\theta \left( {{\,\textrm{grad}\,}}_{\psi (\theta )}^\mathcal {Z}\mathcal {L}\right) = \Pi _\xi \left( {{\,\textrm{grad}\,}}_{\phi (\xi )}^\mathcal {Z}\mathcal {L}\right) = d\phi _\xi \widetilde{\nabla }_\xi \mathcal {L}, \end{aligned}$$
(20)

which is what we wanted to show. \(\square \)

Remark 1

Note that under the extra assumptions that \(\mathcal {M}\) is a smooth manifold and all parametrisations are required to be diffeomorphisms Definitions 4 and 5 are equivalent, since the composition \(f = \psi \circ \phi ^{-1}\) is a diffeomorphism. These assumptions are often implicitly made when referring to the invariance of the natural gradient. However, as we will see below, this is no longer the case in our more general setting.

3.1.1 Example 1

This example will be of a graphical nature. Consider the parametrisation \(\phi \) that is the composition of the 2 maps in Fig. 3. Now let \(\psi :\Xi \rightarrow \mathcal {M}\) be this parametrisation but with a 90 degree rotation around \(\phi (\xi )\) applied before projecting down to \(\mathcal {M}\).

Fig. 3
figure 3

Parametrisation with a non-surjective span of the parameter vectors

Note that the spans of the parameter vectors have trivial intersection as depicted in Fig. 4a. This immediately implies that for any representation we have: \( d\psi _{\theta } \mathcal {O}(\psi ,\theta ) \ne d\phi _\xi \mathcal {O}(\phi ,\xi )\) except when both sides are equal to zero. In particular, we can let the natural gradient be as in Fig. 4b. We know from Theorem 1 that the natural parameter gradients on \(\mathcal {M}\) will be the projections of the natural gradient onto the respective spans of the parameter vectors as depicted in the figure. Note that the projection should be orthogonal with respect to the inner product \(g_p\), which we have chosen here to be Euclidean for ease of illustration.

Fig. 4
figure 4

Spans and gradient vectors on the model of both parametrisations

This example shows that for non-singular points, we can construct two parametrisations that give different natural parameter gradients on the same point of the model. Note that this is not in violation of Theorem 2 since there does not exist a diffeomorphism f such that \(\psi = \phi \circ f\).

3.1.2 Example 2

Let us consider the case in which \(\phi \) is a smooth map from an interval on the real line to \(\mathbb {R}^2\) as depicted in Fig. 5. We have that \(\xi _1\) and \(\xi _2\) are both mapped to the same point p in \(\mathbb {R}^2\). Note that \(\mathcal {M}\) is in this case not a locally embedded submanifold around p and thus p is a singular point. Note that \(G{(\xi _1})\) is a real number different from zero and therefore non-degenerate. Calculating the natural parameter gradient on \(\mathcal {M}\) for \(\xi = \xi _1\) gives:

$$\begin{aligned} d\phi _{\xi _1}\widetilde{\nabla }_{\xi _1}\mathcal {L}&= G^+(\xi _1) \nabla _{\xi } \mathcal {L}\ \partial _{}(\xi _{1})\end{aligned}$$
(21)
$$\begin{aligned}&= G^{-1}(\xi _1) \frac{\partial \mathcal {L}\circ \phi }{\partial \xi }(\xi _1) \partial _{}(\xi _{1}). \end{aligned}$$
(22)

Since \(G^{-1}(\xi _1) \frac{\partial \mathcal {L}\circ \phi }{\partial \xi }(\xi _1)\) is a scalar, the resulting vector will lie in the span of \(\partial _{}(\xi _{1})\) illustrated by the blue arrows in the figure.

Fig. 5
figure 5

Parametrisation that contains a singular point

Now let \(f:\Theta \rightarrow \Xi \) be a diffeomorphism such that \(f(\theta _1) = \xi _2\). An alternative parametrisation of \(\mathcal {M}\) is given by:

$$\begin{aligned} \psi = \phi \circ f. \end{aligned}$$
(23)

Calculating the natural parameter gradient at \(\theta _1\) for this parametrisation gives:

$$\begin{aligned} d\psi _{{\theta _1}}\widetilde{\nabla }_{{\theta _1}}\mathcal {L} = {G}^{-1}({\theta _1}) \frac{\partial \mathcal {L}\circ \psi }{\partial \theta }(\theta _1) \partial (\theta _1). \end{aligned}$$
(24)

Note that this vector is in the span of \(\partial (\theta _1)\) denoted by the red arrows in the figure and therefore in general different from (22). This shows that when \({{\,\textrm{span}\,}}\{\partial _i(\xi )\}_i \ne {{\,\textrm{span}\,}}\{{\partial }_j(\theta )\}_j\) the outcome of \(\left( G^+(\xi ) \nabla _{\xi } \mathcal {L}\right) ^i \partial _{i}(\xi _{})\) can be dependent on the choice of parametrisation and therefore the natural parameter gradient is not parametrisation independent. Note however that this result is not in contradiction with Theorem 2 since we do not have \(\theta _1 = f^{-1}(\xi _1)\). See Appendix A.2 for a worked-out example of this.

3.2 A.2 Reparametrisation (in)variance on the parameter space

In the previous section we have looked at the invariance properties of the natural parameter gradient from the perspective of the model. One can also study the invariance properties from the perspective of the parameter space as is done for example in Section 12 of [12]. Translating the definition of invariance given there to our notation gives the following:

Definition 6

(Reparametrisation invariance on the parameter space) A representation \(\mathcal {O}(\cdot , \cdot )\) is called reparametrisation invariant on the parameter space if for any pair of parametrisations \(\phi , \psi \) such that \(\psi = \phi \circ f\) for a diffeomorphism \(f:\Theta \rightarrow \Xi \), and coordinates \(\xi ,\theta \) such that \(\theta = f^{-1}(\xi )\), we have:

$$\begin{aligned} df_{\theta }\mathcal {O}(\psi ,\theta ) = \mathcal {O}(\phi ,\xi ). \end{aligned}$$
(25)

Note that reparametrisation invariance on the parameter space implies reparametrisation invariance on the model as defined in Definition 5. Furthermore, it can be shown that when \(\mathcal {M}\) is a smooth manifold and all parametrisations are required to be diffeomorphisms, like in Remark 1, this definition is equivalent to Definitions 4 and 5. In that case, the natural parameter gradient satisfies Eq. (25). As we will see below, this is not true for general \(\phi \). We would like to argue, however, that this is not a suitable definition of invariance, since multiple vectors on the parameter space can be mapped to the same vector on the model. Therefore inequality on the parameter space does not have to imply inequality on the model.

Fig. 6
figure 6

Two parametrisations of \(\mathcal {M}\) with different gradient vectors on the parameter space

We will now make the above explicit. Let us choose the MP inverse as generalised inverse for \(\widetilde{\nabla }_\xi \mathcal {L}\) and consider an alternative parametrisation \(\psi = \phi \circ f\) for a diffeomorphism \(f:\Theta \rightarrow \Xi \) (see Fig. 6). We denote the matrix of partial derivatives of f at \({\theta }\) with \(F_i^j ({\theta }) = \frac{\partial {f^j}}{\partial {{\theta }^i}}({\theta })\). For \(\xi = f({\theta })\) we get the following relations:

$$\begin{aligned} {\partial }_i({\theta })&= F_{i}^j({\theta }) \, \partial _j(\xi ) \end{aligned}$$
(26)
$$\begin{aligned} \nabla _{\theta } \mathcal {L}&= F({\theta }) \, \nabla _\xi \mathcal {L}\end{aligned}$$
(27)
$$\begin{aligned} {G}({\theta })&= F({\theta }) \, G(\xi ) \, F^T({\theta }). \end{aligned}$$
(28)

We map \(\widetilde{\nabla }_\theta \mathcal {L}\) to \(T_\xi \Xi \) through \(df_{\theta }\) and get:

$$\begin{aligned} df_{\theta } \widetilde{\nabla }_\theta \mathcal {L}&=df_{\theta }\left( \left( {G}^+({\theta })\nabla _{\theta } \mathcal {L}\right) ^i\left. \frac{\partial }{\partial {\theta }^i}\right| _{\theta }\right) \end{aligned}$$
(29)
$$\begin{aligned}&= \left( \left( F({\theta })G(\xi )F^T({\theta })\right) ^+ F({\theta }) \, \nabla _\xi \mathcal {L}\right) ^i F_{i}^j({\theta }) \left. \frac{\partial {}}{\partial {\xi ^j}}\right| _\xi \end{aligned}$$
(30)
$$\begin{aligned}&= \left( F^T({\theta }) \left( F({\theta })G(\xi )F^T({\theta })\right) ^+ F({\theta }) \, \nabla _\xi \mathcal {L}\right) ^j \left. \frac{\partial {}}{\partial {\xi ^j}}\right| _\xi . \end{aligned}$$
(31)

We will write \(y_\Xi , y_{\Theta }\) for the coefficients of \(\widetilde{\nabla }_\xi \mathcal {L}\) and \(df_{\theta } \widetilde{\nabla }_\theta \mathcal {L}\) respectively. From Theorem 1 and the fact that \(F(\theta )\) is of full rank we know that \(F(\theta ) \, \nabla _\xi \mathcal {L}\) lies in the image of \(F(\theta )G(\xi )F^T(\theta )\). Therefore, by the definition of the MP inverse, we have that \(\left( F(\theta )G(\xi )F^T(\theta )\right) ^+ F(\theta ) \, \nabla _\xi \mathcal {L}= {{\,\mathrm{arg\,min}\,}}_x \{ ||x|| : F(\theta )G(\xi )F^T(\theta ) x = F(\theta ) \, \nabla _\xi \mathcal {L}\}\), where \(||\cdot ||\) is the Euclidean norm on \(\mathbb {R}^d\). The coefficients in (31) become:

$$\begin{aligned} y_{\Theta }= & {} F^T({\theta }) \left( F({\theta })G(\xi )F^T({\theta })\right) ^+ F({\theta }) \, \nabla _\xi \mathcal {L}\end{aligned}$$
(32)
$$\begin{aligned}= & {} {{\,\mathrm{arg\,min}\,}}_y\left\{ ||\left( F^T({\theta })\right) ^{-1} y|| : G(\xi ) y = \nabla _\xi \mathcal {L}\right\} , \end{aligned}$$
(33)

where we substitute \(y=F^T(\theta ) x\) in the last line.

Remark 2

Note that \(||\left( F^T\right) ^{-1} (\cdot )||\) is the pushforward of the norm on \(\Theta \) through f. This shows nicely the equivalence of the gradient for on the one hand constructing a different parametrisation \((\psi )\), and on the other hand defining a different inner product \(\left( ||\left( F^T\right) ^{-1} (\cdot )||\right) \) for the existing parametrisation \((\phi )\).

Comparing the result to the natural parameter gradient on \(\Xi \) gives:

$$\begin{aligned} \widetilde{\nabla }_\xi \mathcal {L}&= \left( G^+(\xi ) \nabla _\xi \mathcal {L}\right) ^i \left. \frac{\partial {}}{\partial {\xi ^i}}\right| _\xi \end{aligned}$$
(34)
$$\begin{aligned}&= (y_\Xi )^i \left. \frac{\partial {}}{\partial {\xi ^i}}\right| _\xi \end{aligned}$$
(35)
$$\begin{aligned} y_\Xi&= {{\,\mathrm{arg\,min}\,}}_y\{ ||y|| : G(\xi ) y = \nabla _\xi \mathcal {L}\} . \end{aligned}$$
(36)

Because the norms in (33) and (36) are different, generally \(y_{\Theta } \ne y_\Xi \). However, both satisfy \(G(\xi )y = \nabla _\xi \mathcal {L}\) and therefore \(G(\xi )(y_{\Theta } - y_\Xi ) = 0\). This implies:

$$\begin{aligned} d\phi _\xi \left( df_{\theta } \widetilde{\nabla }_\theta \mathcal {L}- \widetilde{\nabla }_\xi \mathcal {L}\right)&= \left( y_{\Theta } - y_\Xi \right) ^i \partial _i(\xi ) \end{aligned}$$
(37)
$$\begin{aligned}&= 0, \end{aligned}$$
(38)

where the last equality can be verified by taking the norm on the RHS of (37) using that it is non-degenerate, like so:

$$\begin{aligned} ||\left( y_{\Theta } - y_\Xi \right) ^i \partial _i(\xi )||^2_g = \left( y_{\Theta } - y_\Xi \right) ^T G(\xi ) \left( y_{\Theta } - y_\Xi \right) =0. \end{aligned}$$
(39)

This shows that for overparametrised systems the natural parameter gradient is not reparametrisation invariant on the parameter space. However, as implied by Theorem 1, the dependency on the parametrisation disappears when the gradient is mapped to the model. See Appendix A.3 for a worked-out example of the above discussion.

4 Practical considerations for the natural gradient method

The natural gradient method is performed by updating the current parameter vector \(\xi \in \Xi \) in the direction of the vector \(G^+(\xi ) \nabla _\xi \mathcal {L}\in \mathbb {R}^d\). In case of a constrained parameter space, i.e. \(\Xi \) is not the full space \(\mathbb {R}^d\), such as the space of covariance matrices, one runs the risk of stepping outside the parameter space, see also [2]. This is called a constraint violation. One can use backprojection [8], addition of a penalty, and weight clipping [7] to avoid these violations. Note that for a variety of neural network applications, including many supervised learning tasks, the parameter space is unconstrained. For these models however, the generalised inverse is often hard to compute due to the high number of parameters. In this context, the Woodbury matrix identity with damping is often used instead [15]. Investigating these topics further falls outside the scope of this paper.

4.1 Reparametrisation (in)variance of the natural gradient method trajectory

We have shown in Theorem 2 that from the perspective of the model, the natural parameter gradient is reparametrisation invariant. That is, for two parametrisations \(\phi , \psi \) for which \(\psi = \phi \circ f\) for a diffeomorphism f and \(\theta = f^{-1}(\xi )\), we have that,

$$\begin{aligned} d\phi _\xi \widetilde{\nabla }_\xi \mathcal {L}= \left( G^+(\xi ) \nabla _\xi \mathcal {L}\right) ^i \partial _i = \left( {G}^+(\theta ) \nabla _{\theta } \mathcal {L}\right) ^j {\partial }_j = d\psi _\theta \widetilde{\nabla }_\theta \mathcal {L}. \end{aligned}$$
(40)

The natural gradient method is performed by updating the current parameter vector \(\xi \in \mathbb {R}^d\) in the direction of the vector \(G^+(\xi ) \nabla _\xi \mathcal {L}\in \mathbb {R}^d\). Equation (40) implies that if we would update the parameters for both parametrisations an infinitesimal amount, this would give us the same result on the model. We would like to emphasise however that updating the parameters by a finite amount will in general result in different locations on the model. Therefore the natural gradient method trajectory is dependent on the choice of parametrisation. This is however not an issue specific to overparametrised models but with the natural gradient method in general. See Section 12 of [12] for exact bounds on the invariance.

4.2 Occurrence of non-proper points

We saw in the proof of Theorem 2 that when \({{\,\textrm{span}\,}}\{\partial _i(\xi )\}_i = {{\,\textrm{span}\,}}\{{\partial }_j(\theta )\}_j\) for two parametrisations \(\phi \) and \(\psi \) with \(\phi (\xi ) = \psi (\theta )\), we have \(d\phi _\xi \widetilde{\nabla }_\xi \mathcal {L}= d\psi _\theta \widetilde{\nabla }_\theta \mathcal {L}\). For \(\phi (\xi ) \in {{\,\textrm{Smooth}\,}}(\mathcal {M})\) note that this equality holds in particular when \({{\,\textrm{span}\,}}\{\partial _i(\xi )\}_i = {{\,\textrm{span}\,}}\{{\partial }_j(\theta )\}_j = T_p\mathcal {M}\), i.e. \(\phi \) is proper in \(\xi \). Therefore we will now study when this is the case. We start by recalling some basic facts from smooth manifold theory: Let MN be smooth manifolds and \(F:M \rightarrow N\) a smooth map. We call a point \(p \in M\) a regular point if \(dF_p: T_p M \rightarrow T_{F(p)} N\) is surjective and a critical point otherwise. A point \(q \in N\) is called a regular value if all the elements in \(F^{-1}(q)\) are regular points, and a critical value otherwise. If M is n-dimensional, we say that a subset \(S \subset M\) has measure zero in M, if for every smooth chart \((U, \psi )\) for M, the subset \(\psi (S \cap U) \subset \mathbb {R}^{n}\) has n-dimensional measure zero. That is: \(\forall \delta >0\), there exists a countable cover of \(\psi (S \cap U)\) consisting of open rectangles, the sum of whose volumes is less than \(\delta \). We have the following result based on Sard’s theorem:

Proposition 1

If \({{\,\textrm{Smooth}\,}}(\mathcal {M})\) is a manifold, then the image of the set of points for which \(\phi \) is not proper has measure zero in \({{\,\textrm{Smooth}\,}}(\mathcal {M})\).

Proof

From the definition of \({{\,\textrm{Smooth}\,}}(\mathcal {M})\) we know that for every \(p \in {{\,\textrm{Smooth}\,}}(\mathcal {M})\) there exists a \(U_p\) open in \(\mathcal {Z}\) such that \(U_p \cap \mathcal {M}\) is an embedded submanifold of \(\mathcal {Z}\). Let \(U :=\bigcup _{p\in {{\,\textrm{Smooth}\,}}(\mathcal {M})} U_p\). Note that: \(U \cap \mathcal {M}= {{\,\textrm{Smooth}\,}}(\mathcal {M})\) and therefore \(\phi ^{-1}(U) = \phi ^{-1}({{\,\textrm{Smooth}\,}}(\mathcal {M}))\). Since U is open in \(\mathcal {Z}\), \(\phi ^{-1}({{\,\textrm{Smooth}\,}}(\mathcal {M}))\) is an open subset of \(\Xi \) and thus an embedded submanifold. Therefore we can consider the map:

$$\begin{aligned} \phi |_{\phi ^{-1}({{\,\textrm{Smooth}\,}}(\mathcal {M}))}: \phi ^{-1}({{\,\textrm{Smooth}\,}}(\mathcal {M})) \rightarrow {{\,\textrm{Smooth}\,}}(\mathcal {M}) \end{aligned}$$
(41)

and note that the image of the set of points for which \(\phi \) is not proper is equal to the set of critical values of \(\phi |_{\phi ^{-1}({{\,\textrm{Smooth}\,}}(\mathcal {M}))}\) in \({{\,\textrm{Smooth}\,}}(\mathcal {M})\). A simple application of Sard’s theorem gives the result. \(\square \)

This proposition implies that when \({{\,\textrm{Smooth}\,}}(\mathcal {M})\) is a manifold, the set of points for which the pushforward of the natural parameter gradient is unequal to the natural gradient has measure zero in \({{\,\textrm{Smooth}\,}}(\mathcal {M})\).

5 Conclusion

In this paper we have studied the natural parameter gradient, which was defined as the update direction of the natural gradient method, and its pushforward to the model in an overparametrised setting. We have seen that the latter is equal to the natural gradient under certain conditions. Furthermore we have proposed different notions of invariance and studied whether the natural parameter gradient satisfies these. From the perspective of the model, we have seen that the natural parameter gradient is reparametrisation invariant but that it is not parametrisation independent. Additionally, we saw that the natural parameter gradient is not reparametrisation invariant on the parameter space. We have argued, however, that this notion is less suitable in an overparametrised setting since multiple vectors on the parameter space can correspond to the same vector on the model. Finally we have given some practical considerations for the natural gradient method.