1 Introduction

The Fisher metric on a statistical manifold (a manifold consisting of probability distributions) is one of the most important notions in information geometry [1]. It is usually treated as a Riemannian metric which is a metric on the tangent bundle. The subject of the present paper is the metric on the cotangent bundle corresponding to the Fisher metric, which we call the Fisher co-metric. The Fisher metric and the Fisher co-metric are essentially a single geometric object so that one is induced from another. Nevertheless, studying the Fisher co-metric has several implications as mentioned below, which are what the present paper intends to show.

Firstly, as will be seen in Sect. 2, the Fisher co-metric is defined via the variance/covariance of random variables based on a natural correspondence between cotangent vectors and random variables. This definition is very natural and does not seem arbitrary. There is no room for questions such as why \(\log p\) appears in the definition of the Fisher metric.

Secondly, the above relationship between cotangent vectors and random variables directly links the variance/covariance of an unbiased estimator and the Fisher co-metric, which trivializes the Cramér-Rao inequality. Recognizing this fact, the Fisher metric appears to be a detour for the Cramér-Rao inequality, at least conceptually.

Thirdly, once we focus on the Fisher co-metric, we are motivated to reconsider a known result for the Fisher metric as a source of similar problems for the Fisher co-metic and the variance/covariance, which may lead to a new insight. As an example, co-metric and variance/covariance versions of Čencov’s theorem on characterization of the Fisher metric are investigated in this paper.

The paper is organized as follows. In Sect. 2, we introduce the Fisher co-metric on the manifold \(\mathcal {P}(\Omega )\), which is the totality of positive probability distributions on a finite set \(\Omega \), via the variance/covariance of random variables on \(\Omega \). In Sect. 3, the Fisher co-metric is shown to be equivalent to the Fisher metric by a natural correspondence. In Sect. 4, the Fisher metric and co-metric on an arbitrary submanifold of \(\mathcal {P}\) are discussed, where we see that the Cramér-Rao inequality is trivialized by considering the co-metric. Section 5 treats the e- and m-connections on \(\mathcal {P}(\Omega )\), where it is clarified that, in application to estimation theory, the role of the m-connection as a connection on the cotangent bundle and its relation to the Fisher co-metric are crucial. Sections 25 can be considered to constitute a first half of the paper, which is aimed at showing the naturalness and the usefulness of considering the Fisher co-metric.

The second half of the paper focuses on the monotonicity and the invariance of the Fisher metric and co-metric with respect to Markov maps. In Sect. 6, we investigate the monotonicity. We show there that the monotonicity of the Fisher metric, which is well known as a characteristic property of the metric, is equivalently translated into the monotonicity of the Fisher co-metric and that of the variance. In Sect. 7, after reviewing the invariance of the Fisher metric and Čencov’s theorem, we consider their co-metric versions. It is shown that, being different from the monotonicity, the invariance of the metric and that of the co-metric are not logically equivalent. We present a theorem on characterization of the Fisher co-metric in terms of the invariance, which corresponds to Čencov’s theorem but does not follow from it. The obtained theorem can also be expressed as a theorem on characterization of the variance/covariance. In Sect. 8, we investigate a stronger version of the invariance, which can be regarded as the joint condition that combines the invariance of the metric and that of the co-metric. The formulation used for expressing this condition is applied to affine connections in Sect. 9, whereby a kind of invariance condition for affine connection is obtained. The condition is shown to be equivalent to a known version of invariance condition which is seemingly weaker than the original condition used by Čencov to characterize the \(\alpha \)-connections, but actually characterizes the \(\alpha \)-connection as well. Section 10 is devoted to concluding remarks.

Remark 1.1

Throughout this paper, we denote the tangent space and the cotangent space of a manifold M at a point \(p\in M\) by \(T_{p} (M)\) and \(T^*_{p} (M)\), respectively. We also denote the totality of smooth vector fields and that of smooth differential 1-forms on M by \(\mathfrak {X}(M)\) and \(\mathfrak {D}(M)\), respectively. We generally use capital letters \(X, Y{,} \ldots \) for vector fields in \(\mathfrak {X}(M)\), which are maps assigning tangent vectors \(X_p, Y_p, \ldots \) in \(T_{p} (M)\) to each point \(p\in M\). To save the symbols, we also denote general tangent vectors in \(T_{p} (M)\) by \(X_p, Y_p, \ldots \), not only when they are the values of vector fields. Similarly, We use Greek letters \(\alpha , \beta , \ldots \) for 1-forms in \(\mathfrak {D}(M)\), which are maps assigning cotangent vectors \(\alpha _p, \beta _p, \ldots \) in \(T^*_{p} (M)\) to each point \(p\in M\), and also denote general cotangent vectors by \(\alpha _p, \beta _p, \ldots \), not only when they are the values of 1-forms. The pairing of \(X_p\in T_{p} (M)\) and \(\alpha _p\in T^*_{p}(M)\) is expressed as \(\alpha _p (X_p)\), considering a cotangent vector as a function on the tangent space. We keep the first capital letters \(A, B, \ldots \) for random variables (\(\mathbb {R}\)-valued functions on sample spaces).

2 The Fisher co-metric

We introduce the Fisher co-metric in this section, while its equivalence to the Fisher metric will be shown in the next section.

Let \(\Omega \) be a finite set with cardinality \(|\Omega |\ge 2\), and let \(\mathcal {P}(\Omega )\) be the totality of strictly positive probability distributions on \(\Omega \):

$$\begin{aligned} \mathcal {P}=\mathcal {P}(\Omega ):= \Bigl \{\,p \,\Big \vert \, p: \Omega \rightarrow (0, 1), \; \sum _{\omega \in \Omega } p(\omega ) =1 \,\Bigr \}, \end{aligned}$$
(2.1)

which is regarded as a manifold with \(\dim \mathcal {P}(\Omega ) = |\Omega |-1\). Let the totality of \(\mathbb {R}\)-valued functions on \(\Omega \) be denoted by \(\mathbb {R}^\Omega \), and define \(\bigl (\mathbb {R}^\Omega \bigr )_c:= \{A\in \mathbb {R}^\Omega \,\vert \, \sum _{\omega \in \Omega } A(\omega ) =c\}\) for a constant \(c\in \mathbb {R}\). Since \(\mathcal {P}\) is an open subset of the affine space \(\bigl (\mathbb {R}^\Omega \bigr )_1\), its tangent space can be identified with the linear space \(\bigl (\mathbb {R}^\Omega \bigr )_0\). Following the terminology of [1], we denote this identification \(T_{p} (\mathcal {P}) \rightarrow \bigl (\mathbb {R}^\Omega \bigr )_0\) by \(X_p \mapsto X_p^{(\textrm{m})}\), and call \(X_p^{(\textrm{m})}\) the m-representation of \(X_p\).

For an arbitrary submanifold M of \(\mathcal {P}\) (including the case when \(M=\mathcal {P}\)) we define \(T^{(\textrm{m})}_{p}(M):= \{X_p^{(\textrm{m})} \,\vert \, X_p \in T_{p}(M)\}\), which is a linear subspace of \(T^{(\textrm{m})}_{p}(\mathcal {P}) = \bigl (\mathbb {R}^\Omega \bigr )_0\). When the elements of M are parametrized as \(p_\xi \) by a coordinate system \(\xi = (\xi ^i)\) of M, the m-representation of \((\partial _i)_p \in T_{p}(M)\), where \(\partial _i:= \frac{\partial }{\partial \xi ^i}\), with \(p=p_\xi \) is represented as

$$\begin{aligned} (\partial _i)_{p}^{(\textrm{m})} = \partial _i p_{\xi }, \end{aligned}$$
(2.2)

and \(\{ (\partial _i)_{p}^{(\textrm{m})} \}_{i=1}^n\) (\(n=\dim M\)) constitute a basis of \(T^{(\textrm{m})}_{p}(M)\).

We denote the expectation of a random variable \(A\in \mathbb {R}^\Omega \) w.r.t. a distribution \(p\in \mathcal {P}\) by

$$\begin{aligned} \langle A\rangle _p&:= \sum _{\omega \in \Omega } p(\omega ) A(\omega ), \end{aligned}$$
(2.3)

and define the function

$$\begin{aligned} \langle A\rangle : \mathcal {P}\rightarrow \mathbb {R}, \;\; p \mapsto \langle A\rangle _p. \end{aligned}$$
(2.4)

Since \(\langle A\rangle \) is a smooth function on the manifold \(\mathcal {P}\), its differential \((d\langle A\rangle )_p \in T^*_{p} (\mathcal {P})\) at each point \(p\in \mathcal {P}\) is defined. We introduce the following map:

$$\begin{aligned} \delta _p: \mathbb {R}^\Omega \rightarrow T^*_{p} (\mathcal {P}), \;\; A \mapsto \delta _p (A):=(d\langle A\rangle )_p {,} \end{aligned}$$
(2.5)

for which we have

$$\begin{aligned} \forall A\in \mathbb {R}^\Omega , \, \forall X_p\in T_{p} (\mathcal {P}), \;\; \delta _p (A) (X_p) = X_p \langle A\rangle = \sum _{\omega \in \Omega } X_p^{(\textrm{m})}(\omega ) A (\omega ). \end{aligned}$$
(2.6)

Proposition 2.1

For every \(p\in \mathcal {P}\), the linear map \(\delta _p: \mathbb {R}^\Omega \rightarrow T^*_{p} (\mathcal {P})\) is surjective with \(\textrm{Ker}\,\delta _p =\mathbb {R}\), where \(\mathbb {R}\) is regarded as a subspace of \(\mathbb {R}^\Omega \) by identifying a constant \(c\in \mathbb {R}\) with the constant function \(\omega \mapsto c\). Hence, \(\delta _p\) induces a linear isomorphism \(\mathbb {R}^\Omega / \mathbb {R}\rightarrow T^*_{p} (\mathcal {P})\).

Proof

Every cotangent vector \(\alpha _p\in T^*_{p} (\mathcal {P})\) is a linear functional on \(T_{p} (\mathcal {P})\), which is represented as \(\alpha _p: X_p \mapsto \sum _\omega X_p^{(\textrm{m})} (\omega ) A (\omega )\) by some \(A\in \mathbb {R}^{\Omega }\). This means that \(\alpha _p = \delta _p (A)\) due to (2.6). Hence, \(\delta _p\) is surjective. For any \(A\in \mathbb {R}^{\Omega }\), we have

$$\begin{aligned} A \in \textrm{Ker}\,\delta _p \;&\Leftrightarrow \; \forall X_p\in T_{p} (\mathcal {P}), \;\; \delta _p (A) (X_p) = \sum _\omega X_p^{(\textrm{m})} (\omega ) A (\omega ) = 0 \nonumber \\&\Leftrightarrow \; \forall B \in \bigl (\mathbb {R}^\Omega \bigr )_0, \;\; \sum _\omega A(\omega ) B(\omega ) = 0 \nonumber \\&\Leftrightarrow \; A \in \mathbb {R}, \nonumber \end{aligned}$$

which proves \(\textrm{Ker}\,\delta _p =\mathbb {R}\). \(\square \)

For each \(p\in \mathcal {P}\), denote the \(L^2\) inner product and the covariance of random variables \(A, B\in \mathbb {R}^\Omega \) by

$$\begin{aligned} \langle A, B\rangle _p&:= \langle AB\rangle _p \quad \text {and} \end{aligned}$$
(2.7)
$$\begin{aligned} \textrm{Cov}_p (A, B)&:= \langle A- \langle A\rangle _p, B- \langle B\rangle _p\rangle _p. \end{aligned}$$
(2.8)

Then \(\textrm{Cov}_p: (A, B) \mapsto \textrm{Cov}_p (A, B)\) is a degenerate nonnegative bilinear form on \(\mathbb {R}^\Omega \) with kernel \(\mathbb {R}\), and defines an inner product on \(\mathbb {R}^\Omega / \mathbb {R}\). Therefore, Proposition 2.1 implies that an inner product on \(T^*_{p} (\mathcal {P})\), which we denote by \(g_p\), can be defined by

$$\begin{aligned} g_p (\delta _p (A), \delta _p (B)) = \textrm{Cov}_p (A, B). \end{aligned}$$
(2.9)

Denoting the norm for \(g_p\) by \(\Vert \cdot \Vert _p\), we have

$$\begin{aligned} \Vert \delta _p(A) \Vert _p^2 = V_p (A), \end{aligned}$$
(2.10)

where the RHS is the variance \(V_p (A):= \langle (A- \langle A\rangle _p)^2 \rangle _p\).

We have thus defined the map g which maps each point \(p\in \mathcal {P}\) to the inner product \(g_p\) on \(T^*_{p} (\mathcal {P})\). We generally call such a map (a metric on the cotangent bundle) a co-metric. Although a co-metric is essentially equivalent to a usual (Riemannian) metric (a metric on the tangent bundle) by the correspondence explained in the next section, it is often useful to distinguish them conceptually. The co-metric defined by (2.9) is called the Fisher co-metric, since it corresponds to the Fisher metric as will be shown later.

Remark 2.2

Eq. (2.10) is found in Theorem 2.7 of the book [1], where the norm and the inner product on the cotangent space were considered to be induced from the Fisher metric.

3 The correspondence between a metric and a co-metric

By a standard argument of linear algebra, an inner product \(\langle \cdot , \cdot \rangle \) on an \(\mathbb {R}\)-linear space V establishes a natural linear isomorphism between V and its dual space \(V^*\), which we denote by \({\mathop {\longleftrightarrow }\limits ^{\langle \cdot , \cdot \rangle }}\). This gives a one-to-one correspondence between a metric on a manifold M and a co-metric on M as follows. Given a metric g on M, a tangent vector \(X_p\in T_{p} (M)\) and a cotangent vector \(\alpha _p\in T^*_{p} (M)\) at a point \(p\in M\) correspond each other by

$$\begin{aligned} X_p {\mathop {\longleftrightarrow }\limits ^{g_p}} \alpha _p&\; \Leftrightarrow \; \forall Y_p\in T_{p} (M), \; \alpha _p (Y_p) = g_p (X_p, Y_p). \end{aligned}$$
(3.1)

The correspondence is extended to the correspondence between a vector field \(X\in \mathfrak {X}(M)\) and a 1-form \(\alpha \in \mathfrak {D}(M)\) by

$$\begin{aligned} X {\mathop {\longleftrightarrow }\limits ^{g}} \alpha \; \Leftrightarrow \; \forall p\in M, \;\; X_p {\mathop {\longleftrightarrow }\limits ^{g_p}} \alpha _p. \end{aligned}$$
(3.2)

(Note: some literature refers to this correspondence as the musical isomorphism with notation \(\alpha = X^\flat \) and \(X = \alpha ^\sharp \), while we will use the symbol \(\sharp \) for a different meaning later.) This correspondence determines a co-metric on M, which is denoted by the same symbol g, such that for every \(p\in M\)

$$\begin{aligned} X_p {\mathop {\longleftrightarrow }\limits ^{g_p}} \alpha _p \;\; \text {and} \;\; Y_p {\mathop {\longleftrightarrow }\limits ^{g_p}} \beta _p \; \Rightarrow \; g_p (\alpha _p, \beta _p) = g_p (X_p, Y_p). \end{aligned}$$
(3.3)

Conversely, given a co-metric g on M, the correspondence \({\mathop {\longleftrightarrow }\limits ^{g_p}}\) is defined by

$$\begin{aligned} X_p {\mathop {\longleftrightarrow }\limits ^{g_p}} \alpha _p&\; \Leftrightarrow \; \forall \beta _p\in T^*_{p} (M), \; \beta _p (X_p) = g_p (\alpha _p, \beta _p), \end{aligned}$$
(3.4)

and a metric on M is defined by the same relation as (3.3). It should be noted that when a metric and a co-metric correspond in this way, the relations (3.1) and (3.4) are equivalent, so that there arises no confusion even if we use the same symbol g for the corresponding metric and co-metric in \({\mathop {\longleftrightarrow }\limits ^{g_p}}\) and \({\mathop {\longleftrightarrow }\limits ^{g}}\).

Note that for an arbitrary coordinate system \((\xi ^i)\) of M, \(g_{ij}:= g(\frac{\partial }{\partial \xi ^i}, \frac{\partial }{\partial \xi ^j})\) and \(g^{ij}:= g(d\xi ^i, d\xi ^j)\) form the inverse matrices of each other at every point of M. Note also that the norms for \((T_{p}(M), g_p)\) and \((T^*_{p}(M), g_p)\) are linked by

$$\begin{aligned} \Vert X_p \Vert _{p}&= \max _{\alpha _p \in T^*_{p} (M)\setminus \{0\}} \frac{|\alpha _p (X_p)|}{\Vert \alpha _p \Vert _{p}} \end{aligned}$$
(3.5)
$$\begin{aligned} \text {and}\qquad \Vert \alpha _p \Vert _{p}&= \max _{X_p\in T_{p} (M)\setminus \{0\}} \frac{|\alpha _p (X_p)|}{\Vert X_p \Vert _{p}}, \end{aligned}$$
(3.6)

where the \(\max \)’s in these equations are achieved by those \(X_p\) and \(\alpha _p\) which correspond to each other by \({\mathop {\longleftrightarrow }\limits ^{g_p}}\) up to a constant factor.

For a tangent vector \(X_p\in T_{p} (\mathcal {P})\), define

$$\begin{aligned} L_{X_p}:= X_p^{(\textrm{m})} / p\; \in \{A\in \mathbb {R}^\Omega \,\vert \, \langle A\rangle _{p} =0\}, \end{aligned}$$
(3.7)

which is the derivative of the map \(\mathcal {P}\rightarrow \mathbb {R}^\Omega \), \(p\mapsto \log p\) w.r.t. \(X_p\). (In [1], \(L_{X_p}\) is called the e-representation of \(X_p\) and is denoted by \(X_p^{(\textrm{e})}\).) Note that \(L_{X_p}\) is characterized by (cf. (2.6))

$$\begin{aligned} \forall A\in \mathbb {R}^{\Omega }, \; \delta _p (A) (X_p) = X_p \langle A\rangle = \langle L_{X_p}, A\rangle _p. \end{aligned}$$
(3.8)

The following proposition shows that the metric induced from the Fisher co-metric g by the correspondence \({\mathop {\longleftrightarrow }\limits ^{g}}\) is the Fisher metric.

Proposition 3.1

For each point \(p\in \mathcal {P}\), we have:

  1. 1.

    \(\forall A\in \mathbb {R}^{\Omega }, \;\forall X_p\in T_{p}(\mathcal {P}), \;\; X_p {\mathop {\longleftrightarrow }\limits ^{g_p}} \delta _p (A) \; \Leftrightarrow \; L_{X_p} = A - \langle A\rangle _p\).

  2. 2.

    \(\forall X_p, Y_p\in T_{p}(\mathcal {P}), \;\; g_p (X_p, Y_p) = \langle L_{X_p}, L_{Y_p}\rangle _p\).

Proof

1: According to (3.4), the condition \(X_p {\mathop {\longleftrightarrow }\limits ^{g_p}} \delta _p (A)\) is equivalent to

$$\begin{aligned} \forall B\in \mathbb {R}^\Omega , \; g_p (\delta _p (A), \delta _p (B)) = \delta _p (B) (X_p). \end{aligned}$$
(3.9)

Here the LHS is equal to

$$\begin{aligned} \langle A-\langle A\rangle _p, B-\langle B\rangle _p\rangle _p = \langle A-\langle A\rangle _p, B\rangle _p, \end{aligned}$$

while the RHS is equal to \(\langle L_{X_p}, B\rangle _p\) by (3.8). Hence, (3.9) is equivalent to \(L_{X_p} = A-\langle A\rangle _p\).

2: Obvious from item 1 and (3.3). \(\square \)

4 The Fisher co-metric on a submanifold and the Cramér–Rao inequality

Let M be an arbitrary submanifold of \(\mathcal {P}\). Then a metric on M is induced as the restriction of the Fisher metric g, which we denote by \(g_M: p\mapsto g_{M, p} = g_p\vert _{T_{p}(M)^2}\). When a coordinate system \(\xi = (\xi ^i)\) is given on M, corresponding to (2.2) it holds that

$$\begin{aligned} L_{(\partial _i)_p} = \partial _i \log p_\xi \quad \text {at} \quad p = p_\xi . \end{aligned}$$
(4.1)

We have

$$\begin{aligned} g_{M, ij} (p):= g_{M, p} ((\partial _i)_p, (\partial _j)_p) = \langle \partial _i \log p_\xi , \partial _j \log p_\xi \rangle _p, \end{aligned}$$
(4.2)

which defines the Fisher information matrix \(G_M (p) = [g_{M, ij} (p) ]\). The metric \(g_{M}\) induces a co-metric on M, which is denoted by the same symbol \(g_{M}\). Letting

$$\begin{aligned} g_{M}^{ij} (p):= g_{M, p} ((d\xi ^i)_p, (d\xi ^j)_p), \end{aligned}$$
(4.3)

we have \(G_M (p)^{-1} = [g_{M}^{ij} (p) ]\).

Suppose that a cotangent vector \(\alpha _p \in T^*_{p}(M)\) on M is the restriction of a cotangent vector \(\tilde{\alpha }_p \in T^*_{p}(\mathcal {P})\) on \(\mathcal {P}\); i.e., \(\alpha _p = \tilde{\alpha }_p\vert _{T_{p} (M)}\). Then, it follows from (3.6) that

$$\begin{aligned} \Vert \alpha _p \Vert _{M, p}&{ = \max _{X_p\in T_{p} (M)\setminus \{0\}} \frac{|\alpha _p (X_p)|}{\Vert X_p \Vert _{M, p}} } \nonumber \\&= \max _{X_p\in T_{p} (M)\setminus \{0\}} \frac{|\tilde{\alpha }_p (X_p)|}{\Vert X_p \Vert _{p}} \nonumber \\&\le \max _{X_p\in T_{p} (\mathcal {P})\setminus \{0\}} \frac{|\tilde{\alpha }_p (X_p)|}{\Vert X_p \Vert _{p}} = \Vert \tilde{\alpha }_p \Vert _{p} . \end{aligned}$$
(4.4)

(Note that \(\Vert X_p \Vert _p = \Vert X_p \Vert _{M, p}\) since the metric on M is the restriction of the metric on \(\mathcal {P}\).) Furthermore, for an arbitrary \(\alpha _p \in T^*_{p}(M)\), there always exists \(\tilde{\alpha }_p \in T^*_{p} (\mathcal {P})\) satisfying \(\alpha _p = \tilde{\alpha }_p\vert _{T_{p} (M)}\) and \(\Vert \alpha _p \Vert _{M, p} =\Vert \tilde{\alpha }_p \Vert _{p}\). Indeed, letting \(X_p\in T_{p}(M)\) be defined by \(X_p {\mathop {\longleftrightarrow }\limits ^{g_{M, p}}} \alpha _p\), such an \(\tilde{\alpha }_p\) is obtained by \(X_p {\mathop {\longleftrightarrow }\limits ^{g_{p}}} \tilde{\alpha }_p\).

The above observations lead to the following proposition.

Proposition 4.1

  1. 1.

    For any \(\alpha _p \in T^*_{p} (M)\), we have

    $$\begin{aligned} \Vert \alpha _p \Vert _{M, p}&= \min \{\Vert \tilde{\alpha }_p \Vert _p\, \,\vert \, \,\tilde{\alpha }_p\in T^*_{p}(\mathcal {P}) \;\;\text {and}\;\; \alpha _p = \tilde{\alpha }_p\vert _{T_{p}(M)}\} \nonumber \\&= \Vert {(\alpha _p)}^\sharp \Vert _p, \end{aligned}$$
    (4.5)

    where \({(\alpha _p)}^\sharp := \arg \min _{\tilde{\alpha }_p}\{\Vert \tilde{\alpha }_p \Vert _p \,\vert \, \cdots \}\).

  2. 2.

    For any \(\alpha _p, \beta _p \in T^*_{p} (M)\), we have

    $$\begin{aligned} g_{M, p} (\alpha _p, \beta _p) = g_p ({(\alpha _p)}^\sharp , {(\beta _p)}^\sharp ). \end{aligned}$$
    (4.6)

The above proposition shows that the Fisher co-metric on M can be defined from the Fisher co-metric on \(\mathcal {P}\) directly by (4.5) and (4.6), not by way of the Fisher metric.

Corollary 4.2

(The Cramér-Rao inequality) Suppose that an n-tuple of random variables \(\textbf{A}=(A^1, \dots {,}\, A^n )\in (\mathbb {R}^\Omega )^n\) satisfies

$$\begin{aligned} \forall i\in \{1, \ldots , n\}, \;\; (d\xi ^i)_p = \delta _p (A^i)\vert _{T_{p}(M)} \end{aligned}$$
(4.7)

for a coordinate system \(\xi = (\xi ^i)\) of an n-dimensional submanifold M of \(\mathcal {P}\) and for a point \(p\in M\). Letting \(V_p (\textbf{A}) = [v^{ij}] \in \mathbb {R}^{n\times n}\) be the variance-covariance matrix of \(\textbf{A}\) defined by

$$\begin{aligned} v^{ij}:= \textrm{Cov}_p (A^i, A^j) \end{aligned}$$
(4.8)

and letting \(G_M (p)\) be the Fisher information matrix, we have

$$\begin{aligned} V_p (\textbf{A}) \ge {G_M (p)}^{-1}. \end{aligned}$$
(4.9)

Proof

For an arbitrary column vector \(c= (c_i)\in \mathbb {R}^n\), let

$$\begin{aligned} \tilde{\alpha }_p&:= \sum _i c_i \delta _p (A^i) \in T^*_{p} (\mathcal {P}),\\ \alpha _p&:= \sum _i c_i (d\xi ^i)_p \in T^*_{p} (M). \end{aligned}$$

Since (4.7) implies that \(\tilde{\alpha }_p\vert _{T_{p}(M)} = \alpha _p\), it follows from Proposition 4.1 that \(\Vert \tilde{\alpha }_p \Vert _p \ge \Vert \alpha _p \Vert _{M, p}\). Noting that \(\Vert \tilde{\alpha }_p \Vert _p^2 = {}^{\textrm{t}}{c}\, V_p (\textbf{A})\, c\) and \(\Vert \alpha _p \Vert _{{M, p}}^2 = {}^{\textrm{t}}{c}\, {G_M (p)}^{-1} c\), where \({}^{\textrm{t}}{}\) denotes the transpose, we obtain (4.9). \(\square \)

5 On the e, m-connections

An affine connection is usually treated as a connection on the tangent bundle, while it corresponds to a connection on the cotangent bundle by the relation

$$\begin{aligned} \forall X, Y \in \mathfrak {X}(M), \; \forall \alpha \in \mathfrak {D}(M), \;\; X \alpha (Y) = \alpha (\nabla _X Y) +( \nabla _X \alpha ) (Y). \end{aligned}$$
(5.1)

This correspondence is one-to-one, so that we can define an affine connection by specifying a connection on the cotangent bundle. Therefore, the \(\alpha \)-connection in information geometry can also be introduced in this way. Although affine connections are out of the main subject of this paper, we will briefly discuss the significance of defining the m-connection (i.e. \((\alpha = -1)\)-connection) in this way, since it is closely related to the role of the Fisher co-metric in the Cramér-Rao inequality.

We start by introducing the m-connection \(\nabla ^{(\textrm{m})}\) on \(\mathcal {P}= \mathcal {P}(\Omega )\) as a flat connection on the cotangent bundle for which the 1-form \(d\langle A\rangle \) is parallel for any \(A\in \mathbb {R}^\Omega \); i.e.,

$$\begin{aligned} \forall X\in \mathfrak {X}(\mathcal {P}), \forall A\in \mathbb {R}^\Omega , \; \nabla ^{(\textrm{m})}_X d\langle A\rangle =0. \end{aligned}$$
(5.2)

Since \(\dim \{d \langle A\rangle \,\vert \, A\in \mathbb {R}^\Omega \} = \vert \Omega \vert - 1 = \dim \mathcal {P}\), (5.2) implies that every parallel 1-from is represented as \(d\langle A\rangle \) by some \(A\in \mathbb {R}^\Omega \). Then the correspondence (5.1) determines a connection on the tangent bundle, which is denoted by the same symbol \(\nabla ^{(\textrm{m})}\). Letting \(\alpha = d\langle A\rangle \) in (5.1) and applying (5.2), we have

$$\begin{aligned} \forall X, Y\in \mathfrak {X}(\mathcal {P}), \forall A\in \mathbb {R}^\Omega , \; X Y \langle A\rangle = (\nabla ^{(\textrm{m})}_{X} Y) \langle A\rangle . \end{aligned}$$
(5.3)

This implies that, for any \(Y\in \mathfrak {X}(\mathcal {P})\),

$$\begin{aligned}&Y \text { is m-parallel} \nonumber \\&\; \Leftrightarrow \; \forall A\in \mathbb {R}^\Omega , \, \forall X\in \mathfrak {X}(\mathcal {P}), \; X Y \langle A\rangle =0 \nonumber \\&\; \Leftrightarrow \; \forall A\in \mathbb {R}^\Omega , \; Y_p \langle A\rangle = \sum _\omega Y_p^{(\textrm{m})} (\omega ) A(\omega ) \text { does not depend on } p\in \mathcal {P}\nonumber \\&\; \Leftrightarrow \; Y_p^{(\textrm{m})} \text { does not depend on } p\in \mathcal {P}, \end{aligned}$$
(5.4)

where “m-parallel” means “parallel w.r.t. \(\nabla ^{(\textrm{m})}\)”. Since this property characterizes the m-connection on \(\mathcal {P}\) (e.g. Equation (2.39) of [1]), our definition of the m-connection is equivalent to the usual definition in information geometry.

Next, we define the e-connection \(\nabla ^{(\textrm{e})}\) as the dual connection of \(\nabla ^{(\textrm{m})}\) w.r.t. the Fisher metric g ([1, 2]), which means that

$$\begin{aligned} \forall X, Y, Z\in \mathfrak {X}(\mathcal {P}), \; Z g(X, Y) = g(\nabla ^{(\textrm{e})}_{Z} X, Y) + g(X, \nabla ^{(\textrm{m})}_{Z} Y). \end{aligned}$$
(5.5)

Using (5.1), we can rewrite (5.5) into

$$\begin{aligned} \forall X, Y, Z \in&\mathfrak {X}(\mathcal {P}),\, \forall \alpha \in \mathfrak {D}(\mathcal {P}), \; \nonumber \\&\,X {\mathop {\longleftrightarrow }\limits ^{g}} \alpha \; \Rightarrow \; (\nabla ^{(\textrm{m})}_{Z} \alpha ) (Y) = g (\nabla ^{(\textrm{e})}_{Z} X, Y). \end{aligned}$$
(5.6)

This implies that, for any \(X\in \mathfrak {X}(\mathcal {P})\) and \(\alpha \in \mathfrak {D}(\mathcal {P})\),

$$\begin{aligned} X {\mathop {\longleftrightarrow }\limits ^{g}} \alpha \; \Rightarrow \; \Bigl [\, X \text { is e-parallel} \; \Leftrightarrow \; \alpha \text { is m-parallel} \, \Bigr ]. \end{aligned}$$
(5.7)

Now, let us recall the situation of Corollary 4.2. An estimator \(\textbf{A}=(A^1, \dots {,} \, A^n )\) is said to be efficient for the statistical model \((M, \xi )\) when it is unbiased (i.e. \(\forall i\), \(\xi ^i = \langle A^i\rangle \vert _{M}\)) and achieves the equality in the Cramér-Rao inequality (4.9) for every \(p\in M\). Noting that the achievability at each \(p\in M\) is represented by the condition \(\forall i\), \(\delta _p (A^i) {= (d\langle A^i\rangle )_p} = {((d\xi ^i)_p)}^\sharp \) and recalling (5.2), we can see that the condition for \((M, \xi )\) to have an efficient estimator is expressed as

$$\begin{aligned} \forall i, \; \exists \tilde{\alpha }^i \in \mathfrak {D}(\mathcal {P}), \;\; \tilde{\alpha }^i \text { is m-parallel and}\;\; \forall p\in M, \; \tilde{\alpha }^i_p = {((d\xi ^i)_p)}^\sharp . \end{aligned}$$
(5.8)

On the other hand, it is well known that the existence of an efficient estimator is equivalent to the condition that M is an exponential family and that \(\xi \) is an expectation coordinate system, which can be rephrased as (see Theorem 3.12 of [1])

$$\begin{aligned} M \text { is an e-autoparallel submanifold of } \mathcal {P}, \nonumber \\ \text {and } \xi \text { is an m-affine coordinate system.} \end{aligned}$$
(5.9)

Therefore, the two conditions (5.8) and (5.9) are necessarily equivalent. These are both purely geometrical conditions for a submanifold of the dually flat space \((\mathcal {P}, g, \nabla ^{(\textrm{e})}, \nabla ^{(\textrm{m})})\), and we can prove their equivalence within this geometrical framework, forgetting its statistical background. Indeed, the equivalence can be proved for a more general situation where M is a submanifold of a manifold S equipped with a Riemannian metric g and a pair of dual affine connections \(\nabla , \nabla ^*\) on the assumption that \(\nabla ^*\) is flat. Note that this assumption is weaker than the dually-flatness of \((S, g, \nabla , \nabla ^*)\) in that \(\nabla \) is allowed to have non-vanishing torsion, which is essential in application to quantum estimation theory. See section 7 of [4] for details.

6 Monotonicity

The monotonicity with respect to a Markov map is known to be an important and characteristic property of the Fisher metric. In this section we discuss the monotonicity of the Fisher co-metric and its relation to the variance of random variables.

Let \(\Omega _1\) and \(\Omega _2\) be arbitrary finite sets, and let \(\mathcal {P}_i:= \mathcal {P}(\Omega _i)\) for \(i=1, 2\). A map \(\Phi : \mathcal {P}_1 \rightarrow \mathcal {P}_2\) is called a Markov map when it is affine in the sense that \(\forall p, q\in \mathcal {P}_1 \), \(0\le \forall a \le 1\), \(\Phi (a p + (1-a) q ) = a\, \Phi (p) + (1-a)\, \Phi (q)\). Every Markov map \(\Phi \) is represented as

$$\begin{aligned} \forall p\in \mathcal {P}_1, \;\; \Phi (p) = \sum _{x\in \Omega _1} W(\cdot \,\vert \, x) p(x), \end{aligned}$$
(6.1)

where W is a surjective channel from \(\Omega _1\) to \(\Omega _2\); i.e.,

$$\begin{aligned} \forall (x, y)\in \Omega _1\times \Omega _2, \; W(y\,\vert \, x) \ge 0, \quad \forall x\in \Omega _1, \; \sum _{y\in \Omega _2} W(y\,\vert \, x) =1, \end{aligned}$$
(6.2)
$$\begin{aligned} \text {and}\quad \forall y\in \Omega _2, \; \exists x\in \Omega _1, \; W(y\,\vert \, x) >0. \end{aligned}$$
(6.3)

When \(\Phi \) is represented as (6.1), we write \(\Phi = \Phi _W\).

More generally, for a submanifold M of \(\mathcal {P}_1\) and a submanifold N of \(\mathcal {P}_2\), a map \(\varphi : M \rightarrow N\) is called a Markov map when there exists a Markov map \(\Phi : \mathcal {P}_1 \rightarrow \mathcal {P}_2\) such that \(\varphi = \Phi \vert _M\). Since a Markov map \(\varphi \) is smooth, it induces at each \(p\in M\) the differential

$$\begin{aligned} \varphi _* = \varphi _{*,p} = (d\varphi )_p: T_{p} (M) \rightarrow T_{\varphi (p)} (N) \end{aligned}$$
(6.4)

and its dual

$$\begin{aligned} \varphi ^* = \varphi ^*_p = {}^{\textrm{t}}{ (d\varphi )_p }: T^*_{\varphi (p)} (N) \rightarrow T^*_{p} (M), \end{aligned}$$
(6.5)

where \({}^{\textrm{t}}{}\) denotes the transpose of a linear map. See Remark 6.2 below for the notation \(\varphi ^* = \varphi ^*_p\).

As is well known, the Fisher metric satisfies the following monotonicity property for its norm:

$$\begin{aligned} \forall p\in M, \forall X_p\in T_{p} (M), \;\; \Vert \varphi _*(X_p) \Vert _{\varphi (p)} \le \Vert X_p \Vert _{p}. \end{aligned}$$
(6.6)

The cotangent version of the monotonicity is given below.

Proposition 6.1

We have

$$\begin{aligned} \forall p\in M, \forall \alpha _{\varphi (p)} \in T^*_{\varphi (p)} (N), \;\; \Vert \varphi ^*(\alpha _{\varphi (p)}) \Vert _{M, p} \le \Vert \alpha _{\varphi (p)} \Vert _{N, \varphi (p)}, \end{aligned}$$
(6.7)

where \(\Vert \cdot \Vert _{M, p}\) and \(\Vert \cdot \Vert _{N, \varphi (p)}\) denote the norms w.r.t. the Fisher co-metrics \(g_{M}\) and \(g_{N}\), respectively.

Proof

Since the inequality is trivial when \(\Vert \varphi ^*(\alpha _{\varphi (p)}) \Vert _{M, p} = 0\), we assume \(\Vert \varphi ^*(\alpha _{\varphi (p)}) \Vert _{M, p} > 0\). Then, invoking (3.6), we have

$$\begin{aligned} \Vert \varphi ^*(\alpha _{\varphi (p)}) \Vert _{M, p}&= \max _{X_p\in T_{p} (M) \setminus \{0\}} \frac{\vert \varphi ^* (\alpha _{\varphi (p)}) (X_p)\vert }{\Vert X_p \Vert _{{M, p}}} \\&= \max _{X_p\in T_{p} (M) \setminus \{0\}} \frac{\vert \alpha _{\varphi (p)} (\varphi _*(X_p))\vert }{\Vert X_p \Vert _p} \\&{= \max _{X_p\in T_{p} (M) : \varphi _*(X_p) \ne 0} \frac{\vert \alpha _{\varphi (p)} (\varphi _*(X_p))\vert }{\Vert X_p \Vert _p}} \\&\le {\sup _{X_p\in T_{p} (M) : \varphi _*(X_p) \ne 0}} \frac{\vert \alpha _{\varphi (p)} (\varphi _*(X_p))\vert }{\Vert \varphi _*(X_p) \Vert _{\varphi (p)}} \\&\le \max _{Y_{\varphi (p)}\in T_{\varphi (p)} (N) \setminus \{0\}} \frac{\vert \alpha _{\varphi (p)} (Y_{\varphi (p)})\vert }{\Vert Y_{\varphi (p)} \Vert _{\varphi (p)}} = \Vert \alpha _{\varphi (p)} \Vert _{N, \varphi (p)}, \end{aligned}$$

where the third equality follows since \(X_p\) achieving \(\max _{X_p\in T_{p} (M) \setminus \{0\}}\) should satisfy \(\varphi _*(X_p) \ne 0\) due to the assumption \(\Vert \varphi ^*(\alpha _{\varphi (p)}) \Vert _{M, p} > 0\), and the first \(\le \) follows from (6.6). \(\square \)

Remark 6.2

We have written \(\varphi ^* (\alpha _{\varphi (p)})\) for \(\varphi ^*_p (\alpha _{\varphi (p)})\) above (and will use similar notations throughout the paper), considering that omitting p from \(\varphi ^*_p\) is harmless in the context and is better for the readability of expressions. Note that the notation \(\varphi ^* (\alpha _{\varphi (p)})\), if it appears alone, is mathematically ambiguous unless \(\varphi ^{-1} (\varphi (p))\) is the singleton \(\{p\}\) in that \(\varphi ^*_{p'} (\alpha _{\varphi (p)}) \in T^*_{p'}(M)\) depends on a choice of \(p' \in \varphi ^{-1} (\varphi (p))\). On the other hand, the notation \(\varphi _*(X_p)\) has no such ambiguity, since we know that the argument \(X_p\) belongs to \(T_{p}(M)\) and hence \(\varphi _*\) must be \(\varphi _{*, p}\).

Let us consider the case when \(M = \mathcal {P}_1\) and \(N = \mathcal {P}_2\), and let \(\Phi = \Phi _W: \mathcal {P}_1 \rightarrow \mathcal {P}_2\) be an arbitrary Markov map represented by a surjective channel W. Recalling (6.1) and the definition of m-representation of tangent vectors, we have

$$\begin{aligned} Y_{\Phi (p)} = \Phi _* (X_p ) \; \Leftrightarrow \; Y_{\Phi (p)}^{(\textrm{m})} = \sum _{x\in \Omega _1} W(\cdot \,\vert \, x) X_p^{(\textrm{m})} (x) \end{aligned}$$
(6.8)

for \(X_p \in T_{p}(\mathcal {P}_1)\) and \(Y_{\Phi (p)}\in T_{\Phi (p)} (\mathcal {P}_2)\). We claim that

$$\begin{aligned} \forall A\in \mathbb {R}^{\Omega _2}, \;\; \Phi ^* (\delta _{\Phi (p)} (A)) = \delta _p (E_W (A\,\vert \,\cdot )), \end{aligned}$$
(6.9)

where \(\Phi ^* = \Phi ^*_p\), and \(E_W (A\,\vert \,\cdot ) \in \mathbb {R}^{\Omega _1}\) denotes the conditional expectation of A defined by

$$\begin{aligned} \forall x\in \Omega _1, \;\; E_W (A\,\vert \,x) = \sum _{y\in \Omega _2} W(y \,\vert \, x) A(y). \end{aligned}$$
(6.10)

Eq. (6.9) is verified as follows; for every \(\beta _{p} = \delta _p (B) \in T^*_{p} (\mathcal {P}_1)\), where \(B \in \mathbb {R}^{\Omega _2}\), we have

$$\begin{aligned}&\beta _{p} = \Phi ^* (\delta _{\Phi (p)} (A)) \nonumber \\&\; \Leftrightarrow \; \forall X_p\in T_{p} (\mathcal {P}_1), \;\; \beta _{p} (X_p) =\delta _{\Phi (p)} (A)(\Phi _* (X_p)) \nonumber \\&\; \Leftrightarrow \; \forall X_p\in T_{p} (\mathcal {P}_1), \;\; \sum _{x\in \Omega _1} X^{(\textrm{m})}_p (x) B(x) = \sum _{(x, y) \in \Omega _1\times \Omega _2} W(y \,\vert \, x) X_p^{(\textrm{m})} (x) A(y) \nonumber \\&\; \Leftrightarrow \; B - E_W (A\,\vert \,\cdot ) \in \mathbb {R}\nonumber \\&\; \Leftrightarrow \; \beta _p = \delta _p ( E_W (A\,\vert \,\cdot )), \end{aligned}$$
(6.11)

where the second \(\Leftrightarrow \) follows from (2.6) and (6.8), the third \(\Leftrightarrow \) follows from \(T^{(\textrm{m})}_{p}(\mathcal {P}) = \bigl (\mathbb {R}^\Omega \bigr )_0\), and \(\mathbb {R}\) is identified with the set of constant functions on \(\Omega _1\).

Invoking (2.10) and (6.9), we see that the monotonicity (6.7) is equivalent to the following well-known inequality for the variance:

$$\begin{aligned} \forall A\in \mathbb {R}^{\Omega _2}, \;\; V_{p} (E_W (A\,\vert \,\cdot ) ) \le V_{\Phi (p)} (A), \end{aligned}$$
(6.12)

which we refer to as the monotonicity of the variance.

In the above proof of Proposition 6.1, we derived (6.7) from (6.6). Conversely, we can derive (6.6) from (6.7) by the use of (3.5) as follows; for any \(X_p\in T_{p}(M)\),

$$\begin{aligned} \Vert \varphi _*(X_p) \Vert _{\varphi (p)}&= \max _{\alpha _{\varphi (p)}\in T^*_{\varphi (p)}(N)\setminus \{0\}} \frac{\vert \alpha _{\varphi (p)} (\varphi _* (X_p))\vert }{\Vert \alpha _{\varphi (p)} \Vert _{N, \varphi (p)}} \nonumber \\&= \max _{\alpha _{\varphi (p)}\in T^*_{\varphi (p)}(N)\setminus \{0\}} \frac{\vert \varphi ^*(\alpha _{\varphi (p)}) (X_p)\vert }{\Vert \alpha _{\varphi (p)} \Vert _{N, \varphi (p)}} \nonumber \\&{= \max _{\alpha _{\varphi (p)}\in T^*_{\varphi (p)}(N): \varphi ^*(\alpha _{\varphi (p)}) \ne 0} \frac{\vert \varphi ^*(\alpha _{\varphi (p)}) (X_p)\vert }{\Vert \alpha _{\varphi (p)} \Vert _{N, \varphi (p)}} } \nonumber \\&\le { \sup _{\alpha _{\varphi (p)}\in T^*_{\varphi (p)}(N): \varphi ^*(\alpha _{\varphi (p)}) \ne 0} } \frac{\vert \varphi ^*(\alpha _{\varphi (p)}) (X_p)\vert }{\Vert \varphi ^*(\alpha _{\varphi (p)}) \Vert _{M, p}} \nonumber \\&\le \max _{\beta _{p}\in T^*_{p}(M)\setminus \{0\}} \frac{\vert \beta _{p} (X_p)\vert }{\Vert \beta _{p} \Vert _{M, p}} = \Vert X_p \Vert _p, \end{aligned}$$
(6.13)

where we have assumed \(\Vert \varphi _*(X_p) \Vert _{\varphi (p)} >0\) with no loss of generality, which yields the third equality (cf. the proof of Proposition 6.1), and the first \(\le \) follows from (6.7). Thus, (6.6) and (6.7) are equivalent. Note that this equivalence is derived solely from a general argument on metrics and co-metrics, and does not rely on the special characteristics of the Fisher metric/co-metric. In this sense, we say that (6.6) and (6.7) are logically equivalent.

Recalling that the Fisher metric is characterized as the unique monotone metric up to a constant factor, we obtain the following propositions from the logical equivalence mentioned above.

Proposition 6.3

The monotonicity (6.7) characterizes the Fisher co-metric up to a constant factor.

Proposition 6.4

The variance is characterized up to a constant factor as the positive quadratic form for random variables satisfying the monotonicity (6.12).

Remark 6.5

We have described the above propositions in a rough form for the sake of readability. For the exact statement, we need a formulation similar to Theorems 7.1, 7.2 and 7.3 in the next section. See also Remark 8.4.

Remark 6.6

Since the monotonicity of the Fisher metric (6.6), that of the Fisher co-metric (6.7), and that of the variance (6.12) are all logically equivalent, we can derive (6.6) from the more popular (6.12).

7 Invariance

Čencov showed in [3] that the Fisher metric is characterized up to a constant factor as a covariant tensor field of degree 2 satisfying the invariance for Markov embeddings. Note that the invariance is weaker than the monotonicity and that the tensor field is not assumed to be positive nor symmetric. In this section we review Čencov’s theorem and then investigate its co-metric version, which will be shown to be equivalent to a theorem characterizing the variance/covariance of random variables.

We begin by reviewing the invariance property of the Fisher metric. Suppose that M and N are arbitrary submanifolds of \(\mathcal {P}_1 = \mathcal {P}(\Omega _1)\) and \(\mathcal {P}_2 = \mathcal {P}(\Omega _2)\), respectively, and that a pair of Markov maps

$$\begin{aligned} \varphi : M \rightarrow N \quad \text {and}\quad \psi : N \rightarrow M \end{aligned}$$
(7.1)

satisfis

$$\begin{aligned} \psi \circ \varphi = id _M, \end{aligned}$$
(7.2)

where \(\circ \) denotes the composition of maps. Note that \(\varphi \) is injective while \(\psi \) is surjective. Given a pair of points \((p, q) \in M\times N\) satisfying

$$\begin{aligned} q = \varphi (p) \quad \text {and}\quad p = \psi (q), \end{aligned}$$
(7.3)

we have

$$\begin{aligned} \psi _{*, q} \circ \varphi _{*, p} = id _{T_{p}(M)}. \end{aligned}$$
(7.4)

It then follows from the monotonicity (6.6) that

$$\begin{aligned} \Vert X_p \Vert _{M, p} \ge \Vert \varphi _* (X_p) \Vert _{N, q} \ge \Vert \psi _* (\varphi _* (X_p)) \Vert _{M, p} = \Vert X_p \Vert _{M, p}, \end{aligned}$$
(7.5)

so that we have the invariance of the Fisher metric

$$\begin{aligned} \forall X_p\in T_{p} (M), \;\; \Vert X_p \Vert _{M, p} = \Vert \varphi _* (X_p) \Vert _{N, q}, \end{aligned}$$
(7.6)

which is equivalent to

$$\begin{aligned} \forall X_p, \forall Y_p\in T_{p} (M), \;\; g_{M, p} (X_p, Y_p) = g_{N, q} (\varphi _* (X_p), \varphi _* (Y_p)) {.} \end{aligned}$$
(7.7)

This means that \(\varphi _{*, p}: T_{p}(M) \rightarrow T_{q} (N)\) is isometry, which is represented as

$$\begin{aligned} (\varphi _{*, p})^\dagger \circ \varphi _{*, p} = id _{T_{p}(M)}, \end{aligned}$$
(7.8)

where \((\varphi _{*, p})^\dagger : T_{q}(N) \rightarrow T_{p} (M)\) denotes the adjoint (Hermitian conjugate) of \(\varphi _{*, p} \) w.r.t. the inner products \(g_{M, p}\) and \(g_{N, q}\).

A Markov map \(\Phi : \mathcal {P}_1\rightarrow \mathcal {P}_2\) is called an Markov embedding when there exists a Markov map \(\Psi : \mathcal {P}_2\rightarrow \mathcal {P}_1\) such that

$$\begin{aligned} \Psi \circ \Phi = id _{\mathcal {P}_1}. \end{aligned}$$
(7.9)

Note that \(\vert \Omega _1\vert \le \vert \Omega _2\vert \) necessarily holds in this case. As a special case of the invariance (7.7), we have

$$\begin{aligned} \forall p\in \mathcal {P}_1,\, \forall X_p, \forall Y_p\in T_{p} (\mathcal {P}_1), \;\; g_p (X_p, Y_p) = g_{\Phi (p)} (\Phi _* (X_p), \Phi _* (Y_p)). \end{aligned}$$
(7.10)

According to Čencov, this property characterizes the Fisher metric up to a constant factor. The exact statement is presented below.

Theorem 7.1

(Čencov [3]) For \(n=2, 3, \ldots \), let \(\Omega _n:= \{1, 2, \ldots , n\}\) and \(\mathcal {P}_n:= \mathcal {P}(\Omega _n)\), and let \(g_n\) be the Fisher metric on \(\mathcal {P}_n\). Suppose that we are given a sequence \(\{h_n\}_{n=2}^\infty \), where \(h_n\) is a covariant tensor field of degree 2 on \(\mathcal {P}_n\) which continuously maps each point \(p\in \mathcal {P}_n\) to a bilinear form \(h_{n, p}: T_{p} (\mathcal {P}_n)^2 \rightarrow \mathbb {R}\). Then the following two conditions are equivalent.

  1. (i)

    \(\exists c\in \mathbb {R}, \, \forall n, \;\; h_n = c g_n\).

  2. (ii)

    For any \(m\le n\) and any Markov embedding \(\Phi : \mathcal {P}_m \rightarrow \mathcal {P}_n\), it holds that

    $$\begin{aligned} \forall p\in \mathcal {P}_m,\,&\forall X_{p}, \forall Y_p \in T_{p} (\mathcal {P}_m), \nonumber \\&h_{m, p} (X_{p}, Y_{p}) = h_{n, \Phi (p)} (\Phi _* (X_p), \Phi _* (Y_p)). \end{aligned}$$
    (7.11)

We now proceed to the invariance property of co-metrics. Let us consider the same situation as (7.1)-(7.3), which implies that

$$\begin{aligned} \varphi _p^*\circ \psi _q^* = id _{T^*_{p}(M)}. \end{aligned}$$
(7.12)

Then it follows from the monotonicity (6.7) that, for any \(\alpha _{p}\in T^*_{p}(M)\),

$$\begin{aligned} \Vert \alpha _p \Vert _{M, p} \ge \Vert \psi ^* (\alpha _p) \Vert _{N, q} \ge \Vert \varphi ^*(\psi ^* (\alpha _p)) \Vert _{M, p} = \Vert \alpha _p \Vert _{M, p}, \end{aligned}$$
(7.13)

so that we have the invariance of the Fisher co-metric

$$\begin{aligned} \forall \alpha _p \in T^*_{p}(M), \;\; \Vert \alpha _p \Vert _{M, p} = \Vert \psi ^* (\alpha _p) \Vert _{N, q}, \end{aligned}$$
(7.14)

which is equivalent to

$$\begin{aligned} \forall \alpha _p, \beta _p \in T^*_{p}(M), \;\; g_{M, p}(\alpha _p, \beta _p) = g_{N, q} (\psi ^* (\alpha _p), \psi ^* (\beta _p)) \end{aligned}$$
(7.15)

This means that \(\psi ^*_{q}: T^*_{p}(M) \rightarrow T^*_{q} (N)\) is isometry, which is represented as

$$\begin{aligned} (\psi ^*_{q} )^\dagger \circ \psi ^*_{q} = id _{T^*_{p}(M)}, \end{aligned}$$
(7.16)

where \((\psi ^*_{q} )^\dagger : T^*_{q}(N) \rightarrow T^*_{p} (M)\) denotes the adjoint (Hermitian conjugate) of \(\psi ^*_{q}\) w.r.t. the inner products \(g_{M, p}\) and \(g_{N, q}\) on the cotangent spaces. Due to \(\psi ^*_{q} = {}^{\textrm{t}}{(\psi _{*, q})}\), (7.16) can be rewritten as

$$\begin{aligned} \psi _{*, q} \circ (\psi _{*, q})^\dagger = id _{T_{p}(M)}. \end{aligned}$$
(7.17)

We observed in Sect. 6 that the monotonicity of metrics (6.6) and that of co-metrics (6.7) are logically equivalent. On the other hand, such an equivalence does not hold for the invariance properties (7.6) and (7.14). Indeed, a linear-algebraic consideration shows that the implications (7.4) \(\wedge \) (7.8) \(\Rightarrow \) (7.17) and (7.4) \(\wedge \) (7.17) \(\Rightarrow \) (7.8) do not hold unless \(\varphi _{*, p}\) is a linear isomorphism (cf. Lemma 8.2). This means that we cannot expect Čencov’s theorem to yield a corollary which states that the Fisher co-metric is characterized by the invariance (7.14). Nevertheless, the statement itself is true as explained below.

Let us return to the situation of (7.9). We call a Markov map \(\Psi : \mathcal {P}_2 \rightarrow \mathcal {P}_1\) a Markov co-embedding when there exists a Markov embedding \(\Phi : \mathcal {P}_1 \rightarrow \mathcal {P}_2\) satisfying (7.9). As an example of (7.14), (7.9) implies the invariance

$$\begin{aligned} \forall p\in \mathcal {P}_1, \forall \alpha _p\in T^*_{p} (\mathcal {P}_1), \;\; \Vert \alpha _p \Vert _p = \Vert \Psi ^* (\alpha _p) \Vert _{\Phi (p)}, \end{aligned}$$
(7.18)

which can be rewritten as

$$\begin{aligned} \forall q\in \Phi (\mathcal {P}_1), \forall \alpha _{\Psi (q)} \in T^*_{\Psi (q)} (\mathcal {P}_1), \;\; \Vert \alpha _{\Psi (q)} \Vert _{\Psi (q)} = \Vert \Psi ^* (\alpha _{\Psi (q)}) \Vert _{q}. \end{aligned}$$
(7.19)

Actually, the range \(\forall q\in \Phi (\mathcal {P}_1)\) in the above equation can be extended to \(\forall q\in \mathcal {P}_2\) for the reason described below.

It is known (e.g. Lemma 9.5 of [3]) that every pair \((\Phi , \Psi )\) of Markov embedding and co-embedding satisfying (7.9) is represented in the following form:

$$\begin{aligned} \forall q \in \mathcal {P}_2, \;\; \Psi (q) = q^F \quad \text {with}\quad q^F (x):= \sum _{y\in F^{-1} (x)} q(y), \end{aligned}$$
(7.20)

and

$$\begin{aligned} \forall p \in \mathcal {P}_1, \;\; \Phi (p) = \sum _{x\in \Omega _1} p(x)\, r_x, \end{aligned}$$
(7.21)

where F is a surjection \(\Omega _2\rightarrow \Omega _1\) which yields the partition \(\Omega _2 = \bigsqcup _{x\in \Omega _1} F^{-1} (x)\), and \(\{r_x\}_{x\in \Omega _1}\) is a family of probability distributions on \(\Omega _2\) such that the support of \(r_x\) is \(F^{-1}(x)\) for every \(x\in \Omega _1\). We note that a Markov co-embedding \(\Psi \) is determined by F alone, while a Markov embedding \(\Phi \) is determined by F and \(\{r_x\}_{x\in \Omega _1}\) together. Consequently, \(\Psi \) is uniquely determined from \(\Phi \), while \(\Phi \) for a given \(\Psi \) has the degree of freedom corresponding to \(\{r_x\}_{x\in \Omega _1}\). According to this fact, when a Markov co-embedding \(\Psi \) and a distribution \(q\in \mathcal {P}_2\) are arbitrarily given, we can always choose a Markov embedding \(\Phi \) satisfying (7.9) and \(q \in \Phi (\mathcal {P}_1)\); indeed, defining \(r_x\) by

$$\begin{aligned} r_x (y):= \left\{ \begin{array}{ll} q(y) / q^F (x) &{} \text {if} \;\; F(y) = x \\ 0 &{} \text {otherwise}, \end{array}\right. \end{aligned}$$
(7.22)

the resulting \(\Phi \) satisfies \(q = \Phi (q^F) \in \Phi (\mathcal {P}_1)\). This is the reason why \(\forall q\in \Phi (\mathcal {P}_1)\) in (7.19) can be replaced with \(\forall q\in \mathcal {P}_2\). We thus have

$$\begin{aligned} \forall q\in \mathcal {P}_2, \forall \alpha _{\Psi (q)} \in T^*_{\Psi (q)} (\mathcal {P}_1), \;\; \Vert \alpha _{\Psi (q)} \Vert _{\Psi (q)} = \Vert \Psi ^* (\alpha _{\Psi (q)}) \Vert _{q}, \end{aligned}$$
(7.23)

or equivalently,

$$\begin{aligned} \forall q\in \mathcal {P}_2, \forall \alpha _{\Psi (q)},\;&\forall \beta _{\Psi (q)} \in T^*_{\Psi (q)} (\mathcal {P}_1), \;\; \nonumber \\&g_{\Psi (q)} (\alpha _{\Psi (q)}, \beta _{\Psi (q)}) = g_{q} ({\Psi ^*} (\alpha _{\Psi (q)}), {\Psi ^*}( \beta _{\Psi (q)})) \end{aligned}$$
(7.24)

for every Markov co-embedding \(\Psi \).

The invariance (7.23) characterizes the Fisher co-metric up to a constant factor. Namely, we have the following theorem.

Theorem 7.2

For \(n=2, 3, \ldots \), let \(\Omega _n:= \{1, 2, \ldots , n\}\) and \(\mathcal {P}_n:= \mathcal {P}(\Omega _n)\), and let \(g_n\) be the Fisher co-metric on \(\mathcal {P}_n\). Suppose that we are given a sequence \(\{h_n\}_{n=2}^\infty \), where \(h_n\) is a contravariant tensor field of degree 2 on \(\mathcal {P}_n\) which continuously maps each point \(p\in \mathcal {P}_n\) to a bilinear form \(h_{n, p}: T^*_{p} (\mathcal {P}_n)^2 \rightarrow \mathbb {R}\). Then the following two conditions are equivalent.

  1. (i)

    \(\exists c\in \mathbb {R}, \, \forall n, \;\; h_n = c g_n\).

  2. (ii)

    For any \(m\le n\) and any Markov co-embedding \(\Psi : \mathcal {P}_n \rightarrow \mathcal {P}_m\), it holds that

    $$\begin{aligned} \forall q\in \mathcal {P}_n, \, \forall \alpha _{\Psi (q)},&\forall \beta _{\Psi (q)} \in T^*_{\Psi (q)} (\mathcal {P}_m),\nonumber \\&h_{m, \Psi (q)} (\alpha _{\Psi (q)}, \beta _{\Psi (q)}) = h_{n, q} (\Psi ^* (\alpha _{\Psi (q)}), \Psi ^* (\beta _{\Psi (q)})). \end{aligned}$$
    (7.25)

The proof will be given by rewriting the statement in terms of variance/covariance for random variables. Suppose that a Markov co-embedding \(\Psi : \mathcal {P}_1\rightarrow \mathcal {P}_2\) is represented as (7.20) by a surjection \(F: \Omega _2 \rightarrow \Omega _1\). Then \(\Psi \) is represented as \(\Psi = \Phi _W\) by the channel W from \(\Omega _2\) to \(\Omega _1\) defined by

$$\begin{aligned} W (x \,\vert \, y) = \left\{ \begin{array}{cl} 1 &{}\;\text {if}\;\; x = F(y),\\ 0 &{} \; \text {otherwise}. \end{array} \right. \end{aligned}$$
(7.26)

For an arbitrary \(A\in \mathbb {R}^{\Omega _1}\), its conditional expectation w.r.t. W is represented as

$$\begin{aligned} E_W (A\,\vert \, y) = \sum _{x\in \Omega _1} W(x\,\vert \, y) A(x) = A (F(y)), \end{aligned}$$
(7.27)

so that it follows from (6.9) that

$$\begin{aligned} \Psi ^* (\delta _{q^F} (A) ) = \delta _q (A\circ F), \end{aligned}$$
(7.28)

where we have invoked \(\Psi (q) = q^F\) from (7.20). Hence, (7.23) and (7.24) are rewritten as

$$\begin{aligned} V_{q^F} (A) = V_q (A\circ F) \end{aligned}$$
(7.29)

and

$$\begin{aligned} \textrm{Cov}_{q^F} (A, B) = \textrm{Cov}_q (A\circ F, B\circ F). \end{aligned}$$
(7.30)

These identities themselves are obvious, but what is important is that they characterize the variance/covariance up to a constant factor. Namely, we have the following theorem.

Theorem 7.3

In the same situation as Theorem 7.2, suppose that we are given a sequence \(\{\gamma _n\}_{n=2}^\infty \), where \(\gamma _n\) is a map which continuously maps each point \(p\in \mathcal {P}_n\) to a bilinear form \(\gamma _{n, p}\) on \(\mathbb {R}^{\Omega _n}\). Then the following conditions (i) and (ii) are equivalent.

  1. (i)

    \(\exists c\in \mathbb {R}, \, \forall n, \forall p\in \mathcal {P}_n, \;\; \gamma _{n, p} = c\, \textrm{Cov}_p\).

  2. (ii)

    : (ii-1) \(\wedge \) (ii-2)

    1. (ii-1)

      \(\forall n, \forall p\in \mathcal {P}_n, \forall A\in \mathbb {R}^{\Omega _n}, \;\; \gamma _{n, p} (A, 1) =0\).

    2. (ii-2)

      For any \(m\le n\) and any surjection \(F: \Omega _n \rightarrow \Omega _m\), it holds that

      $$\begin{aligned} \forall p\in \mathcal {P}_n, \forall A, B \in \mathbb {R}^{\Omega _m}, \;\; \gamma _{m, p^F} (A, B) = \gamma _{n, p} (A\circ F, B\circ F). \end{aligned}$$
      (7.31)

Note that, if we assume that \(\{\gamma _n\}_n\) are all symmetric tensors, then (7.31) can be replaced with

$$\begin{aligned} \forall p\in \mathcal {P}_n, \forall A \in \mathbb {R}^{\Omega _m}, \;\; \gamma _{m, p^F} (A, A) = \gamma _{n, p} (A\circ F, A\circ F), \end{aligned}$$
(7.32)

which corresponds to (7.29).

See A1 in Appendix for the proof, where we use an argument similar to Čencov’s proof of Theorem 7.1. It is obvious that Theorem 7.2 immediately follows from this theorem.

Remark 7.4

If we delete (ii-1) from (ii) in Theorem 7.3, then we have (i)\({}'\) \(\Leftrightarrow \) (ii-2) by replacing (i) with

(i)\({}'\):

\( \exists c_1, \exists c_2\in \mathbb {R}, \, \forall n, \forall p\in \mathcal {P}_n,\)

\( \forall A, B\in \mathbb {R}^{\Omega _n}, \;\; \gamma _{n, p}(A, B) = c_1\, \langle A, B\rangle _p + c_2\, \langle A\rangle _p \langle B\rangle _p.\)

We give a proof for (i)\({}'\) \(\Leftrightarrow \) (ii-2) in A1, from which Theorem 7.3 is straightforwad.

8 Strong invariance

In the preceding two sections, we have observed the following facts.

  • The monotonicity of metrics and that of co-metrics are logically equivalent.

  • The monotonicity logically implies the invariance of metrics and that of co-metrics.

  • The invariance of metrics and that of co-metrics are not logically equivalent.

In this section we introduce a new notion of invariance called the strong invariance, and show that:

  • The strong invariance of metrics and that of co-metrics are logically equivalent.

  • The monotonicity of metrics/co-metrics logically implies the strong invariance of metrics/co-metrics.

  • The strong invariance of metrics/co-metrics logically implies the invariance of metrics and that of co-metrics.

Recall the situation of (7.1), (7.2) and (7.3), where we are given submanifolds \(M \subset \mathcal {P}_1\) and \(N \subset \mathcal {P}_2\), Markov mappings \(\varphi : M \rightarrow N\) and \(\psi : N \rightarrow M\) satisfying \(\psi \circ \varphi = id _M\), and points \(p\in M\) and \(q\in N\) satisfying \(q = \varphi (p)\) and \(p = \psi (q)\). Then we have the following proposition.

Proposition 8.1

The Fisher metrics \(g_M\) and \(g_N\) on M and N satisfy

$$\begin{aligned} \forall X_p\in T_{p} (M), \forall Y_q\in T_{q} (N), \;\; g_{M, p} (X_p, \psi _* (Y_q)) = g_{N, q} (\varphi _* (X_p), Y_q), \end{aligned}$$
(8.1)

or equivalently,

$$\begin{aligned} \psi _{*, q} = (\varphi _{*, p})^\dagger . \end{aligned}$$
(8.2)

In addition, \((\psi _{*, q})^\dagger \circ \psi _{*, q} = \varphi _{*, p} \circ \psi _{*, q}\) is the orthogonal projector from \(T_{q} (N)\) onto \(\varphi _{*, p} (T_{p} (M)) = T_{q} (\varphi (M))\).

The property (8.1) is called the strong invariance of the Fisher metric. The proposition will be proved by using the following lemma.

Lemma 8.2

Let U and V be finite-dimensional metric linear spaces, and let \(A: U\rightarrow V\) and \(B: V\rightarrow U\) be linear maps satisfying \(B A = I\). Then the following two conditions are equivalent.

  1. (i)

    \(A^\dagger A = B B^\dagger = I\).

  2. (ii)

    \(B = A^\dagger \).

When these conditions hold, \(B^\dagger B = A B\) is the orthogonal projector from V onto the image \(\textrm{Im}\, A\) of A.

Proof

It is obvious that (ii) \(\Rightarrow \) (i) under the assumption \(B A = I\). Conversely, if we assume (i) with \(B A = I\), then we have

$$\begin{aligned} (B- A^\dagger ) (B - A^\dagger )^\dagger&= B B^\dagger - B A - A^\dagger B^\dagger + A^\dagger A\\&= I - I - I + I = 0, \end{aligned}$$

from which (ii) follows.

Assume (i) and (ii). Then \((B^\dagger B)^2 = B^\dagger B\) due to \(B B^\dagger =I\), which implies that \(B^\dagger B\) is the orthogonal projector onto \(\textrm{Im}\, B^\dagger = \textrm{Im}\, A\). \(\square \)

Proof of Prop 8.1.

Letting \(U:= T_{p} (M)\), \(V:=T_{q} ({N})\), \(A:= \varphi _{*, p}\) and \(B:= \psi _{*,q}\) in the previous lemma, we see from (7.4), (7.8) and (7.17) that the assumption \(BA = I\) and the condition (i) are satisfied, so that we have (ii), which means (8.2). \(\square \)

As can be seen from Lemma 8.2, the strong invariance (8.1) is equivalent to the condition that both the invariance for the metric (7.6)–(7.8) and the invariance for the co-metric (7.14)–(7.16) hold. Taking the transpose of both sides of (8.2), the strong invariance is also expressed

$$\begin{aligned} \psi _{q}^* = (\varphi _{p}^*)^\dagger , \end{aligned}$$
(8.3)

which means that the Fisher co-metric satisfies

$$\begin{aligned} \forall \alpha _p\in T^*_{p} (M), \forall \beta _q\in T^*_{q} (N), \;\; g_{M, p} (\alpha _p, \varphi ^* (\beta _q)) = g_{N, q} (\psi ^* (\alpha _p), \beta _q). \end{aligned}$$
(8.4)

Since the strong invariance (8.4) logically implies the invariance (7.7) for the Fisher metric via (8.1), we see that the following proposition, which is stated in a rough form similar to Proposition 6.3, is obtained as a corollary of Čencov’s theorem (Theorem 7.1).

Proposition 8.3

The strong invariance (8.4) characterizes the Fisher co-metric up to a constant factor.

An exact formulation of this proposition will be given in A2 of Appendix with a proof based on Čencov’s theorem.

Remark 8.4

The above proposition is stronger than Proposition 6.3 and weaker than Theorem 7.2. To formulate Proposition 6.3 and Proposition 8.3 in exact forms similar to Theorem 7.2, it matters what assumptions should be imposed on bilinear forms \(\{h_{n, p}\}\) on the cotangent spaces prior to the monotonicity or the strong invariance. Here we should keep in mind that the significance of these propositions, which are weaker than Theorem 7.2, lies in the fact that they follow from Čencov’s theorem while Theorem 7.2 does not. For Proposition 6.3, we need to assume that \(\{h_{n, p}\}\) are inner products (positive symmetric forms) to ensure that the monotonicity of them is translated into the monotonicity of the corresponding inner products on the tangent spaces. For Proposition 8.3, on the other hand, we only need to assume \(\{h_{n, p}\}\) to be non-degenerate (nonsingular) and symmetric. See A2 for details.

Let us consider the strong invariance (8.4) for the case when \((\phi , \psi )\) is a Markov embedding/co-embedding pair \((\Phi , \Psi )\) and rewrite it into an identity for the covariance of random variables. Let \(\Omega _1\) and \(\Omega _2\) be arbitrary finite sets satisfying \(\vert \Omega _1 \vert \le \vert \Omega _2 \vert \) and let \(\mathcal {P}_i:= \mathcal {P}(\Omega _i)\), \(i=1, 2\). Given a surjection \(F: \Omega _2 \rightarrow \Omega _1\) and a distribution \(q\in \mathcal {P}_2\), let \((\Phi , \Psi )\) be defined by (7.20) and (7.21) with (7.22). For arbitrary \(A \in \mathbb {R}^{\Omega _1}\) and \(B \in \mathbb {R}^{\Omega _2}\), let \(\alpha _{q^F}:= \delta _{q^F} (A) \in T^*_{{q^F}} (\mathcal {P}_1)\) and \(\beta _q:= \delta _q (B) \in T^*_{q} (\mathcal {P}_2)\), for which the strong invariance (8.4) is represented as

$$\begin{aligned} g_{q^F} (\alpha _{q^F}, \Phi ^* (\beta _q)) = g_q (\Psi ^* (\alpha _{q^F}), \beta _q). \end{aligned}$$
(8.5)

Recalling (7.28), we have

$$\begin{aligned} \Psi ^* (\alpha _{q^F}) = \delta _q (A\circ F). \end{aligned}$$
(8.6)

On the other hand, \(\Phi \) is represented as \(\Phi = \Phi _V\) by the channel V defined by

$$\begin{aligned} V (y\,\vert \, x) = \left\{ \begin{array}{ll} r_x (y) = \frac{q(y)}{q^F (x)}&{}\; \text {if}\;\; x = F(y) \\ 0 &{}\; \text {otherwise}, \end{array}\right. \end{aligned}$$
(8.7)

so that (6.9) yields

$$\begin{aligned} \Phi ^* (\beta _q) = \delta _{q^F} (E_V (B\, \vert \, \cdot )). \end{aligned}$$
(8.8)

Hence, the strong invariance (8.5) is rewritten as

$$\begin{aligned} \textrm{Cov}_{q^F} (A, E_V (B\,\vert \, \cdot )) = \textrm{Cov}_q (A\circ F, B). \end{aligned}$$
(8.9)

This identity can be verified directly as follows. Letting \(a:= \langle A\rangle _{q^F} = \langle A\circ F\rangle _q\) and \(b:= \langle B\rangle _q = \langle E_V (B\, \vert \, \cdot )\rangle _{q^F}\), we have

$$\begin{aligned} \text {RHS}&= \sum _y q(y) (A (F(y)) - a) (B(y) -b)\nonumber \\&=\sum _x \sum _{y\in F^{-1} (x)} q(y) (A (F(y)) - a) (B(y) -b)\nonumber \\&= \sum _x \, (A (x) - a) \sum _{y\in F^{-1} (x)} q(y) (B(y) -b)\nonumber \\&= \sum _x q^F (x) (A (x) - a) (E_V (B\,\vert \, x) - b)\nonumber \\&=\text {LHS} , \end{aligned}$$
(8.10)

where the fourth equality follows from

$$\begin{aligned} E_V (B\, \vert \, x) = \sum _y V (y\,\vert \, x) B(y)&= \frac{1}{q^F (x)} \sum _{y\in F^{-1} (x)} q (y) B(y). \end{aligned}$$
(8.11)

Note that (7.30) is obtained from (8.9) by substituting \(B\circ F\) for B.

9 Weak invariance for affine connections

In addition to characterizing the Fisher metric by the invariance with respect to Markov embeddings, Čencov also gave a characterization of the \(\alpha \)-connections by the invariance condition. In this section we show that a similar notion to the strong invariance of metrics, which is described in terms of Markov embedding/co-embedding pairs, can be considered for affine connections.

Let \(\Omega _1, \Omega _2\) be arbitrary finite sets satisfying \(2\le \vert \Omega _1 \vert \le \vert \Omega _2 \vert \), and let \(\Phi : \mathcal {P}_1 \rightarrow \mathcal {P}_2\) be a Markov embedding, where \(\mathcal {P}_i:= \mathcal {P}(\Omega _i)\), \(i=1, 2\). Suppose that affine connections \(\nabla \) and \(\nabla '\) are given on \(\mathcal {P}_1\) and \(\mathcal {P}_2\), respectively. When these connections are the \(\alpha \)-connection on \(\mathcal {P}_1\) and that on \(\mathcal {P}_2\) for some common \(\alpha \in \mathbb {R}\), they satisfy

$$\begin{aligned} \forall X, Y \in \mathfrak {X}(\mathcal {P}_1), \; \; \Phi _* (\nabla _X Y) = \nabla '_{\Phi _* (X)} \Phi _* (Y). \end{aligned}$$
(9.1)

Some remarks on the meaning of the above equation are in order. First, we define \(\Phi _* (X)\) for an arbitrary vector field X on \(\mathcal {S}_1\) as a vector field on \(K:= \Phi (\mathcal {P}_1)\) that maps each point \(q=\Phi (p)\in K\), where \(p\in \mathcal {P}_1\), to

$$\begin{aligned} (\Phi _* (X))_q:= \Phi _{*, p} (X_p) \in T_{q} (K). \end{aligned}$$
(9.2)

Since K is a submanifold of \(\mathcal {S}_2\) on which the connection \(\nabla '\) is given, \(\nabla '_{\Phi _* (X)} \Phi _* (Y)\) in (9.1) is defined as a map which maps each point \(q\in K\) to a tangent vector in \(T_{q}(\mathcal {P}_2)\), although \(\nabla '_{\Phi _* (X)} \Phi _* (Y)\) does not belong to \(\mathfrak {X}(K)\) in general. The condition (9.1) means that K is autoparallel in \(\mathcal {P}_2\) with respect to \(\nabla '\) and that the restricted connection of \(\nabla '\) induced on the autoparallel K is obtained from \(\nabla \) by the diffeomorphism \(\Phi : \mathcal {P}_1 \rightarrow K\). Čencov [3] showed that the invariance condition characterizes the family \(\{\alpha \text {-connection}\}_{\alpha \in \mathbb {R}}\) by a formulation similar to Theorem 7.1.

Remark 9.1

As is mentioned above, the fact that \(\{\alpha \text {-connection}\}_{\alpha \in \mathbb {R}}\) satisfy the invariance (9.1) implies that \(K = \Phi (\mathcal {P}_1)\) is autoparallel in \(\mathcal {P}_2\) w.r.t. the \(\alpha \)-connection for every \(\alpha \in \mathbb {R}\). A kind of converse result is found in [5], which states that if a submanifold K of \(\mathcal {P}_2 = \mathcal {P}(\Omega _2)\) is autoparallel w.r.t. the \(\alpha \)-connection for every \(\alpha \in \mathbb {R}\) (or, for some two different values of \(\alpha \)), then K is represented as \(\Phi (\mathcal {S}_1)\) by some Markov embedding \(\Phi \) from some \(\mathcal {S}_1 = \mathcal {S}(\Omega _1)\) into \(\mathcal {S}_2\).

Let g and \(g'\) be the Fisher metrics on \(\mathcal {P}_1\) and \(\mathcal {P}_2\), respectively. Noting that \(\Phi _*\) is an isometry with respect to these metrics due to the invariance of the Fisher metric, (9.1) implies that

$$\begin{aligned} \forall X, Y, Z\in \mathfrak {X}(\mathcal {S}_1), \;\; g(\nabla _X Y, Z) = g' (\nabla '_{\Phi _* (X)} \Phi _* (Y), \, \Phi _* (Z)) \circ \Phi , \end{aligned}$$
(9.3)

where the RHS denotes the function on \(\mathcal {P}_1\) such that

$$\begin{aligned} p \mapsto g'_{\Phi (p)} \bigl ( \bigl ( \nabla '_{\Phi _* (X)} \Phi _* (Y)\bigr )_{\Phi (p)}, \, \Phi _{*, p} (Z_p) \bigr ). \end{aligned}$$
(9.4)

Since (9.3) is apparently weaker than (9.1), we call this property the weak invariance of the connections \(\nabla , \nabla '\). In an actual fact, however, as is mentioned in [6] and is proved in [7], the weak invariance (9.3) characterizes the family \(\{\alpha \text {-connection}\}_{\alpha \in \mathbb {R}}\) as well as the stronger condition (9.1).

Now, recalling the strong invariance (8.1) of the Fisher metric, we see that the weak invariance (9.3) is equivalent to

$$\begin{aligned} \forall X, Y \in \mathfrak {X}(\mathcal {S}_1), \;\; \nabla _X Y = \Psi _* \bigl ( \nabla '_{\Phi _* (X)} \Phi _* (Y) \bigr ), \end{aligned}$$
(9.5)

where the RHS denotes the map

$$\begin{aligned} \mathcal {P}_1 \ni p \mapsto \Psi _{*, \Phi (p)} \bigl ( \bigl ( \nabla '_{\Phi _* (X)} \Phi _* (Y) \bigr )_{\Phi (p)} \bigr ). \end{aligned}$$
(9.6)

The fact that the weak invariance is a condition on connections is more clearly expressed in (9.5) than (9.3) in that (9.5) does not include the Fisher metric.

10 Concluding remarks

In this paper we have focused on the face of the Fisher metric as a metric on the cotangent bundle, calling it the Fisher co-metric to distinguish it from the original Fisher metric on the tangent bundle. What we have shown are listed below.

  1. 1.

    Based on a correspondence between cotangent vectors and random variables, the Fisher co-metric is defined via the variance/covariance in a natural way (Sect. 2).

  2. 2.

    The Cramér-Rao inequality is trivialized by considering the Fisher co-metric (Sect. 4).

  3. 3.

    The role of the m-connection as a connection on the cotangent bundle is important in considering the achievability condition for the Cramér-Rao inequality (Sect. 5).

  4. 4.

    The monotonicity of the Fisher metric is equivalently translated into the monotonicity of the Fisher co-metric and that of the variance (Sect. 6).

  5. 5.

    The invariance of the Fisher metric and that of the Fisher co-metric are not logically equivalent, and a new Čencov-type theorem for characterizing the Fisher co-metric by the invariance is established, which can also be regarded as a theorem for characterizing the variance/covariance (Sect. 7).

  6. 6.

    The notion of strong invariance is introduced, which combines the invariance of the Fisher metric and that of the Fisher co-metric (Sect. 8).

  7. 7.

    The weak invariance of the \(\alpha \)-connections is expressed in a formulation similar to the strong invariance of the Fisher metric/co-metric (Sect. 9).

It should be noted that, although this paper emphasizes the importance of the Fisher co-metric, this does not diminish the importance of the Fisher metric at all. Apart from the importance as a metric on the tangent bundle itself, which is essential for geometry of statistical manifolds, we should not forget that the Fisher information matrix (i.e. the components of the Fisher metric) is of primary importance as a practical tool to compute the Fisher co-metric. Even if we know that \(g_{M}^{ij} (p) = g_{M, p} ((d\xi ^i)_p, (d\xi ^j)_p)\) in (4.3) can be defined by (4.5) and (4.6) and that understanding \(g_{M}^{ij} (p)\) in this way is important for conceptual understanding of the Cramér-Rao inequality, this does not tell us a method to compute \(g_{M}^{ij} (p)\) for a given statistical model \((M, \xi )\) better than computing the inverse of the Fisher information matrix \(G_M (p):= [g_{M, ij} (p) ]\) in general.

Finally, we note that some of the results obtained here can be extended to the quantum case in several directions, which will be discussed in a forthcoming paper.