Codivergences and information matrices

Derumigny, Alexis; Schmidt-Hieber, Johannes

doi:10.1007/s41884-024-00135-2

Codivergences and information matrices

Research Paper
Open access
Published: 22 May 2024

Volume 7, pages 253–282, (2024)
Cite this article

Download PDF

You have full access to this open access article

Information Geometry Aims and scope Submit manuscript

Codivergences and information matrices

Download PDF

539 Accesses
6 Altmetric
Explore all metrics

Abstract

We propose a new concept of codivergence, which quantifies the similarity between two probability measures $P_1, P_2$ relative to a reference probability measure $P_0$. In the neighborhood of the reference measure $P_0$, a codivergence behaves like an inner product between the measures $P_1-P_0$ and $P_2-P_0$. Codivergences of covariance-type and correlation-type are introduced and studied with a focus on two specific correlation-type codivergences, the $\chi ^2$-codivergence and the Hellinger codivergence. We derive explicit expressions for several common parametric families of probability distributions. For a codivergence, we introduce moreover the divergence matrix as an analogue of the Gram matrix. It is shown that the $\chi ^2$-divergence matrix satisfies a data-processing inequality.

Infinite-Dimensional Log-Determinant Divergences III: Log-Euclidean and Log-Hilbert–Schmidt Divergences

The f-Divergence and Coupling of Probability Distributions

Article 01 January 2021

Robust statistical inference based on the C-divergence family

Article 30 July 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

One of the objectives of information geometry is to measure distances or angles in statistical spaces, usually for parametric models. This is often done by the use of a divergence, generating a Riemannian manifold structure on the considered space of distributions, see [1, 3, 19]. Divergences between probability measures quantify a certain notion of difference between them. Divergences are in general not symmetric, as opposed to distances. Famous examples of divergences include the Kullback–Leibler divergence, the $\chi ^2$-divergence, and the Hellinger distance.

In this article, we are interested in defining a local notion of inner product between two probability measures in the neighborhood of a given reference probability measure $P_0$. This allows us to identify different directions relative to $P_0$, and to give some meaning to the “angle” between these directions. Contrary to most of the previous work on finite-dimensional Riemannian manifolds spanned by specific parametric statistical models, we do not require any parametric restrictions on the considered probability measures.

Motivation and application of our approach is the recently considered generic framework to derive lower bounds for the trade-off between bias and variance in nonparametric statistical models [9]. The key ingredient in this lower bound strategy are so-called change of expectation inequalities that relate the change of the expected value of a random variable with respect to different distributions to the variance and also involve the divergence matrices examined in this work. Another possible area of application are the lower bounds for statistical query algorithms, see e.g. [10, 11]

Regarding work on infinite-dimensional information geometry, [5, 20] studied the manifold generated by all probability densities connected to a given probability density. [21] reviews a more general theory on infinite-dimensional statistical manifolds given a reference density. Another line of work [12, 16,17,18] seeks to define infinite-dimensional manifolds, with applications to Bayesian estimation and the choice of priors. [24] consider different possible structures on the set of probability densities on [0, 1].

The article is structured as follows. In Sect. 2, we introduce a general concept of codivergence, study specific properties of codivergences on the space of probability measures, and discuss specific (classes of) codivergences. Section 3 considers the construction of divergence matrices from a given codivergence. Section 4 is devoted to the data-processing inequality that holds for the $\chi ^2$-divergence matrix introduced in Sect. 3, thereby generalizing the usual data-processing inequality for the $\chi ^2$-divergence. Section 5 provides derivations of explicit expressions for a class of codivergences applied to common parametric models. Elementary facts on ranks from linear algebra are collected in Sect. 6.

Notation: If P is a probability measure and $\textbf{X}$ a random vector, we write $E_P[\textbf{X}]$ and ${\text {Cov}}_P(\textbf{X})$ for the expectation vector and covariance matrix with respect to P, respectively.

2 Codivergences

2.1 Abstract framework and definition

We start by recalling the definition of a divergence [1, Definition 1.1]. This definition is situated within the framework of a d-dimensional differentiable manifold $\mathcal {X}$ with an atlas $(U_i, \varphi _i)$. Formally, this means that the $(U_i)$ are an open cover of the topological space $\mathcal {X}$, and $\varphi _i: U_i \rightarrow \mathbb {R}^d$ are isomorphisms such that $\varphi _j \circ \varphi _i^{-1}$ is $C^1$ on $\varphi _i(U_i \cap U_j) \rightarrow \varphi _j(U_i \cap U_j)$ [14].

Definition 2.1

A divergence D on a d-dimensional differentiable manifold $\mathcal {X}$ is a function $\mathcal {X}^2 \rightarrow \mathbb {R}_+$ satisfying

(i)
$\forall P, Q \in \mathcal {X},$ $D(P|Q) = 0$ if and only if $P = Q$.
(ii)
For all $P \in \mathcal {X}$, for any chart $(U, \varphi )$ with $P \in U$, there exists a matrix $G = G(P)$ such that for any $Q \in U$,
$$\begin{aligned} \hspace{-2em} D(P | Q) = \frac{1}{2} \big ( \varphi (Q) - \varphi (P) \big )^T G \big ( \varphi (Q) - \varphi (P) \big ) + O\Big ( \Vert \varphi (Q) - \varphi (P) \Vert ^3 \Big ). \end{aligned}$$
(1)

The matrix $G=G(P)$ may depend on the choice of coordinates $\varphi $. For the most common divergences, G is symmetric, positive-definite and thus defines a scalar product on the tangent space at P. Whereas a divergence measures the similarity between two elements $P,Q\in \mathcal {X},$ we want to define codivergences measuring the angle $\sphericalangle P_1P_0P_2$ of $P_1,P_2\in \mathcal {X}$ relative to $P_0\in \mathcal {X}.$

Equation (1) states that the divergence D(P|Q) is a quadratic form in terms of the local coordinates $\varphi (Q) \in \mathbb {R}^d,$ whenever P and Q are close. Generalizing to the infinite-dimensional case requires to work with bilinear forms instead. Moreover for the infinite-dimensional setting, imposing an expansion of the form (1) in every possible direction around $P \in \mathcal {X}$ is restrictive. We therefore allow the quadratic expansion to hold in a possibly smaller bilinear expansion domain. Furthermore, we allow codivergences to attain the value $+ \infty .$ This is inspired by existing statistical divergences (such as $\chi ^2$- or Kullback–Leibler divergences) that can also take the value $+\infty $. Therefore, imposing an expansion of the form (1) globally may not be possible as the codivergence on the left-hand side of (1) may take the value $+\infty $ in some directions away from P, while the right-hand side of (1) is always finite.

We now provide the definition of a codivergence if $\mathcal {X}$ is a subset of a real vector space.

Definition 2.2

Let $\mathcal {X}$ and $(E_{u})_{u\in \mathcal {X}}$ be a subset and a family of subspaces of a real vector space E, respectively. A function $(u,v,w) \in \mathcal {X}^3 \mapsto D(u | v, w) \in \mathbb {R}\cup \{ + \infty \}$ defines a codivergence on $\mathcal {X}$ with bilinear expansion domain $E_{u}$ at u, if for any $u, v, w \in \mathcal {X},$

(i)
$D(u | v, w) = D(u | w, v)$;
(ii)
$D(u | v, v) \ge 0$, with equality if $u = v$;
(iii)
there exists a bilinear map $\langle \cdot , \cdot \rangle _{u}$ defined on $E_{u}$, such that, for any $h, g \in E_{u}$ and for any scalars s, t in some sufficiently small open neighborhood of (0, 0) (that may depend on h and g) with respect to the Euclidean topology in $\mathbb {R}^2$, we have $(u + t h, \ u + sg) \in \mathcal {X}^2,$ $D \big (u \big | u + t h , u + s g\big ) < + \infty ,$ and $D \big (u \big | u + t h , u + s g\big ) = ts \langle h, g \rangle _{u} + o(t^2 + s^2)$ as $(s,t) \rightarrow (0,0)$.

The last part of the definition imposes that, locally around each u, the codivergence $(v, w) \mapsto D(u | v, w)$ is finite and behaves like a bilinear form in the centered variables $(v-u, w-u).$ As a consequence, for a given u, the mapping $(v, w) \mapsto D(u | v, w)$ is Gateaux-differentiable on $\mathcal {X}^2$ at (u, u) with Gateaux derivative 0 in every direction $(h,g) \in E_u^2$. Condition (iii) can moreover be understood as a second-order Taylor expansion at (u, u) in the direction (h, g). The mapping $(v, w) \mapsto D(u | v, w)$ needs, however, not to be twice Gateaux-differentiable at (u, u) for (iii) to hold. This is analogous to usual counter-examples in analysis where a function may admit a second-order Taylor expansion at a given point without being twice differentiable. Nevertheless, if $D \big (u \big | u + t h , u + s g\big )$ is twice differentiable in (t, s) at (0, 0), then the partial derivative $\partial ^2 D \big (u \big | u + t h , u + s g\big )/ \partial t \partial s$ at (0, 0) must be equal to $\langle h, g \rangle _{u}$. We refer to [2] for a discussion on higher-order functional derivatives.

We also provide a definition of codivergence if $\mathcal {X}$ is a differentiable Banach manifold, see [4, 14] for an introduction to Banach manifolds. Let B be a Banach space and $\mathcal {X}$ be a Banach manifold modeled on B. This guarantees existence of a B-atlas $(U_i, \varphi _i)$ with $U_i$ an open cover of $\mathcal {X}$ and $\varphi _i: U_i \rightarrow B$ such that $\varphi _j \circ \varphi _i^{-1}$ is $C^1$ (with respect to the norm on B).

This generalization can be useful in the case where the space $\mathcal {X}$ is not flat. Indeed, part (iii) of Definition 2.2 imposes that for $h \in E_u$, we must have $u + t h \in \mathcal {X}$ for t small enough. On the contrary, in the following definition we consider a more subtle case where the point u may be approached on a smooth curve (not necessarily affine), under the assumption that $\mathcal {X}$ is a B-manifold.

We first recall the construction of the tangent space via curves following [4, Definition 2.21] and [15, Section 2.1.1]: for a fixed $u \in \mathcal {X}$, let i be such that $u \in U_i$ and let $\mathscr {C}_u$ be the set of smooth curves c such that $c: [-1,1] \rightarrow U_i$ and $c(0) = u$. We define an equivalence relation $\sim $ on $\mathscr {C}_u$ by $c_1 \sim c_2$ if for all smooth real-valued functions f on $U_i$, we have $(f \circ c_1)'(0) = (f \circ c_2)'(0)$. We define the tangent space at u as the quotient set $T_u:= \mathscr {C}_u /{\sim }$, which can be given a vector space structure isomorphic to B.

We give a short outline of the main ideas to obtain this property. Let D denote the Fréchet differential operator. For ${\overline{c}} \in T_u$ and c a representative of the equivalence class ${\overline{c}}$, note that $\varphi _i \circ c: [-1,1] \rightarrow B$ is differentiable (by assumption on c); the mapping $D(\varphi _i \circ c)(0)$ is linear from $\mathbb {R}$ to B and can therefore be identified with an element of B itself; this element $D(\varphi _i \circ c)(0)$ also does not depend on the representative c. This defines a mapping $\theta _u: T_u \mapsto B$ by $\theta _u({\overline{c}}):= D(\varphi _i \circ c)(0)$. It can be shown that $\theta _u$ is bijective. Through its inverse $\theta _u^{-1}$ one can transport the vector space structure of B on $T_u$, making it a real vector space too.

Definition 2.3

Let $\mathcal {X}$ be a B-manifold. A function $(u,v,w) \in \mathcal {X}^3 \mapsto D(u | v, w) \in \mathbb {R}\cup \{ + \infty \}$ defines a codivergence on $\mathcal {X}$ with bilinear expansion domain $E_{u}$ at u, if for any $u, v, w \in \mathcal {X},$

(i)
$D(u | v, w) = D(u | w, v)$;
(ii)
$D(u | v, v) \ge 0$, with equality if $u = v$;
(iii)
$E_u$ is a subspace of the tangent space $T_u$ of $\mathcal {X}$ at u;
(iv)
there exists a bilinear map $\langle \cdot , \cdot \rangle _{u}$ defined on $E_{u}$. For any ${\overline{g}}, {\overline{h}} \in E_{u}$, for any representatives g and h of the respective equivalence classes ${\overline{g}}$ and ${\overline{h}},$ and for any scalars s, t in some sufficiently small open neighborhood of (0, 0) with respect to the Euclidean topology in $\mathbb {R}^2$ (the neighborhood may depend on the choice of the representatives g and h), we have $D \big (u \big | h(t) , g(s) \big ) < + \infty ,$ and $D \big (u \big | h(t) , g(s) \big ) = ts \langle {{\overline{h}}, {\overline{g}}} \rangle _{u} + o(t^2 + s^2)$ as $(s,t) \rightarrow (0,0).$

From a codivergence D(u|v, w) that takes finite values on a finite-dimensional manifold and with bilinear expansions domains the tangent spaces, we can always construct a divergence by setting $v = w$. Then D(u|v, v) behaves like a quadratic form in v whenever v is close to u.

If $\mathcal {X}$ is a B-manifold and a closed subspace of a vector space E, then the notions of codivergences in Definitions 2.2 and 2.3 coincide. This is because differentiable curves are, in first order, linear functions in a small enough neighborhood of 0.

For both definitions, a given space $\mathcal {X}$ and a given family of bilinear expansion domains $(E_{u})_{u \in \mathcal {X}}$, the set of codivergences on $\mathcal {X}$ is a convex cone.

For an example covered by Definition 2.3 but not by Definition 2.2 assume that $\mathcal {X}$ is the unit circle. No codivergence can exist in the sense of Definition 2.2 with non-trivial bilinear expansion domains $(E_u)$. An example of a codivergence on $\mathcal {X}= \{ e^{i\theta }, \theta \in \mathbb {R}\}$ in the sense of Definition 2.3 is

$$\begin{aligned} D(u | v , w) = {\left\{ \begin{array}{ll} e^{(\theta _1 - \theta _0) (\theta _2 - \theta _0)} - 1, &{} \text {if } v, w \in U_u, \\ + \infty , &{} \text {else,} \end{array}\right. } \end{aligned}$$

where $u, v, w \in \mathcal {X}^3,$ $U_u:= \{u e^{i\theta }, \theta \in (-\pi /2, \pi /2)\},$ $u = e^{i\theta _0},$ $v = e^{i\theta _1}$ and $w = e^{i\theta _2}$ for some $\theta _0 \in \mathbb {R}$, $\theta _1, \theta _2 \in [\theta _0 - \pi , \theta _0 + \pi )$. Such a representation of v and w always exists and is unique since $[\theta _0 - \pi , \theta _0 + \pi )$ is a half-open interval of length $2 \pi $. In this case, the tangent space $T_u$ of the circle at any point $u = e^{i \theta _0}$ is diffeomorphic to $\mathbb {R}$ and we will use this identification (denoted by the symbol “$\simeq $”). Let $g,h \in T_u \simeq \mathbb {R}$ and assume $s,t \in \mathbb {R}$. Then $g(s) = u e^{igs} \in U_u$ for s small enough. Similarly, $h(t) = u e^{iht} \in U_u$ for t small enough. So, $D \big (u \big | h(t) , g(s) \big )$ is finite for all (s, t) in a small enough neighborhood of (0, 0), and, whenever this is the case, we have

$$\begin{aligned} D \big (u \big | h(t) , g(s) \big ) = D \big (u \big | u e^{iht} , u e^{igs} \big ) = e^{ht gs} - 1 = ts \langle h, g \rangle _{u} + o(t^2 + s^2), \end{aligned}$$

where $\langle h, g \rangle _{u} = gh$ is the local bilinear form (which in this example is independent of u) and the bilinear expansion domain can be taken to be $E_u = T_u \simeq \mathbb {R}$.

2.2 Codivergences on the space of probability measures

For the application to statistics, E is the space of all finite signed measures on a measurable space $(\mathcal {A}, \mathscr {B})$, and $\mathcal {X}$ is the space of all probability measures on $(\mathcal {A}, \mathscr {B})$. Probability measures form a convex subset of all signed measures E. Since E is a vector space, the natural definition of a codivergence on $\mathcal {X}$ is Definition 2.2. A visual representation of such a codivergence is provided in Fig. 1.

In a next step, we characterize the bilinear expansion domains of a codivergence for $\mathcal {X}$ the space of probability measures. Given a probability measure $P_0 \in \mathcal {X},$ we say that a function $h: \mathcal {X}\rightarrow \mathbb {R}$ is $P_0$-essentially bounded by a constant $C > 0$ if $P_0(\{x \in \mathcal {A}: |h(x)| \le C \}) = 1$ and define ${\text {ess sup}}_{P_0}|h|:= \inf \{C > 0: |h| \text { is } P_0\text {-essentially bounded by } C \}.$ We will show that

$$\begin{aligned} \mathcal {M}_{P_0} := \left\{ \mu \in E: \mu \ll P_0, \int d\mu = 0, {\text {ess sup}}_{P_0}\Big |\frac{d\mu }{dP_0}\Big | < + \infty \right\} \end{aligned}$$

is the largest bilinear expansion domain that any codivergence on $\mathcal {X}$ can have at $P_0$. The rationale is that $P_0+t\mu $ is otherwise not a probability measure. Indeed if $\mu \in \mathcal {M}_{P_0}$ has a density h with respect to $P_0$, then the $P_0$-density $1 + t h$ is non-negative for given $t>0$ if and only if h is larger than $-1/t$. Conversely, the density $1 - t h$ is non-negative for given $t <0$ if and only if h is smaller than 1/t. This gives a link between a bound on $h = d\mu /dP_0$ and the non-negativity of the probability measure $P_0 + t \mu .$

For every measurable set A,

$$\begin{aligned} (P_0 + t \mu )(A) = \int _{x\in A} d(P_0 + t \mu )(x) = \int _{x\in A} \big (1 + th(x)\big ) dP_0(x). \end{aligned}$$

(2)

The value of an integral is unchanged if the function to be integrated is modified on a $P_0$-null set. Therefore we only need the function $1 + th$ to be positive $P_0$-almost everywhere for $P_0 + t \mu $ to be a positive measure.

Proposition 2.4

For any codivergence D on the space of probability measures $\mathcal {X}$, the bilinear expansion domain of D at any probability measure $P_0 \in \mathcal {X}$ must be included in $\mathcal {M}_{P_0}$. Furthermore, every $\mu \in \mathcal {M}_{P_0}$ has a density $d\mu / dP_0$ with respect to $P_0$ such that ${\text {ess sup}}_{P_0}|d\mu / dP_0| = 1 / a_*$ with $a_*:= \sup \{ a > 0: P_0 + t \mu \in \mathcal {X}\, \text {for all } t \in [-a, a]\} \in (0, +\infty ]$ and the convention $1/+ \infty = 0$.

Proof of Proposition 2.4

We begin by proving the first part. Let $\mu $ be a finite signed measure belonging to the bilinear expansion domain at $P_0$ of some codivergence D on the space of probability measures $\mathcal {X}$. For $a > 0$, we write $\mu \in R(a)$ if and only if $P_0 + t \mu \in \mathcal {X}$ for all $-a\le t\le a.$ Since $\mu $ belongs to the bilinear expansion domain of D at $P_0$, Definition 2.2(iii) implies existence of an open neighborhood T of 0 such that for any $t \in T$, $P_0 + t \mu \in \mathcal {X}$. Therefore, there exists $a > 0$ with $\mu \in R(a)$.

We now show that $\mu \in R(a),$ for some $a>0,$ implies $\mu \ll P_0$. The proof relies on the Jordan decomposition theorem for finite signed measures (e.g. Corollary 4.1.6 in [7]). It states that every finite signed measure $\mu $ on a measurable space $(\mathcal {A}, \mathscr {B})$ can be decomposed as

$$\begin{aligned} \mu = \alpha _+ \mu _+ - \alpha _- \mu _-, \quad \text {with} \ \alpha _+, \alpha _- \ge 0, \end{aligned}$$

(3)

and $\mu _-, \mu _+$ orthogonal probability measures on $(\mathcal {A}, \mathscr {B})$. By the Lebesgue decomposition theorem (see Theorem 4.3.2 in [7]), $\mu $ can always be decomposed as $\mu = \mu _A + \mu _S$, where $\mu _A$ is a signed measure that is absolutely continuous with respect to $P_0$, $\mu _S$ is a signed measure that is singular with respect to $P_0$, and $\mu _A$ and $\mu _S$ are orthogonal. By the Jordan decomposition (3), we decompose the signed measure $\mu _S = \alpha _+ \mu _{S,+} - \alpha _- \mu _{S,-}$ into its positive and negative part $\mu _{S,+}$ and $\mu _{S,-}$. These two measures are orthogonal and $\alpha _+, \alpha _- \ge 0.$ Then, $P_0 + a \mu = P_0 + a \mu _A + a \alpha _+ \mu _{S,+} - a \alpha _- \mu _{S,-}$ can be a probability measure only if $\alpha _- = 0$. This is because we can find a set U such that $P_0(U) = \mu _A(U) = \mu _{S,+}(U) = 0$ and $\mu _{S,-}(U) = 1$. Therefore $(P_0 + a \mu )(U) = - a \alpha _- \mu _{S,-}(U) = - a \alpha _- \le 0$. In the same way, $P_0 - a \mu $ can be a probability measure only if $\alpha _+ = 0$. Therefore, if $\mu \in R(a)$ for some $a>0$, then $\alpha _+ = \alpha _- = 0$, and $\mu = \mu _A$ is absolutely continuous with respect to $P_0$.

Let h be the density of $\mu $ with respect to $P_0$. Then

$$\begin{aligned} \frac{d(P_0 + t \mu )}{dP_0} = 1 + t \frac{d\mu }{dP_0} = 1 + t h. \end{aligned}$$

Note that $P_0 + t \mu $ is a signed measure integrating to 1 if and only if $\int d\mu = \int h dP_0 = 0$.

We now show that, for any $a > 0$, $\mu \in R(a)$ implies ${\text {ess sup}}_{P_0} |h| \le 1/a$. If $\mu \in R(a)$, then for any $A \in \mathscr {B},$ $(P_0 + a \mu )(A) \ge 0$ and $(P_0 - a \mu )(A) \ge 0$. Let us define the sets $A_+:= \{ x \in \mathcal {A}: 1 + a h(x) \ge 0\}$ and $A_-:= \{ x \in \mathcal {A}: 1 - a h(x) \ge 0\}$. Let $A^C$ denote the complement of a set A. We have $(P_0 + a \mu )(A_+^C) = \int _{A_+^C} 1 + a h(x) dP_X(x) \le 0$ since this is the integral of a negative function. Therefore $P_0(A_+^C) = 0$ and then $P_0(A_+) = 1$. Similarly, $(P_0 + a \mu )(A_-^C) = \int _{A_-^C} 1 - a h(x) dP_X(x) \le 0.$ Hence, $P_0(A_-^C) = 0$ and $P_0(A_-) = 1$.

Therefore, $P_0(A_+ \cap A_-) = 1$. This means that for $P_0$-almost every $x \in \mathcal {A}$, $1 + a h(x) \ge 0$ and $1 - a h(x) \ge 0$. Therefore, for $P_0$-almost every $x \in \mathcal {A}$, $|h(x)| \le 1/a$. Therefore, h is $P_0$-essentially bounded by $C:= 1/a$. We have finally shown that $\mu \in R(a)$ implies ${\text {ess sup}}_{P_0} |h| \le 1/a$ and $\mu \in \mathcal {M}_{P_0}$, proving the first part of Proposition 2.4.

Conversely, note that $\mu \in \mathcal {M}_{P_0}$ is a sufficient condition for $P_0 + t \mu $ to be a probability measure for all t in a sufficiently small open neighborhood of 0.

We now show the second part of Proposition 2.4. Remember that $a_*:= \sup \{ a > 0: \mu \in R(a) \} \in (0, + \infty ]$. Let $(a_n)_{n \in \mathbb {N}}$ be an increasing sequence of real numbers strictly smaller than $a_*$ and converging to $a_*$. For every positive integer n, we have $\mu \in R(a_n)$. Therefore, by the previous reasoning, ${\text {ess sup}}_{P_0} |h| \le 1/a_n$, meaning that $P_0(\{ x \in \mathcal {A}: |h(x)| \le 1/a_n \}) = 1$. By a union bound, we obtain $P_0( \cap _{n \ge 0} \{ x \in \mathcal {A}: |h(x)| \le 1/a_n \}) = 1$. Therefore $P_0( \{ x \in \mathcal {A}: |h(x)| \le 1/a_* \}) = 1$, and by definition ${\text {ess sup}}_{P_0}|h| \le 1/ a_*$.

We now show the reverse version of this inequality. Let $C > {\text {ess sup}}_{P_0}|h|$. Then $P_0(\{x \in \mathcal {A}: |h(x)| \le C\} = 1$. Hence, for any $t \in [-1/C, 1/C]$, and for $P_0$-almost every x, $-1 \le t h(x) \le 1$. Consequently, for any $t \in [-1/C, 1/C]$, and for $P_0$-almost every x, $1 + t h(x) \ge 0$ and $1 - t h(x) \ge 0$. For any $t \in [-1/C, 1/C]$, $P_0 + t \mu $ is a finite signed measure with a density that is non-negative $P_0$-almost everywhere and integrates to 1. These are sufficient conditions for $P_0 + t \mu $ to be a probability measure on $\mathcal {A}$, proving $\mu \in R(1/C)$. Therefore, $1/C \le a_*$ and thus $1/a_* \le C$. This holds for any $C > 0$ such that $C > {\text {ess sup}}_{P_0}|h|$, proving that $1/a_* \le {\text {ess sup}}_{P_0}|h|$. Together with the inequality ${\text {ess sup}}_{P_0}|h| \le 1/ a_*$, the claim $1/a_* = {\text {ess sup}}_{P_0}|h|$ follows. $\square $

2.3 Examples of codivergences

For any real $a\ge 0,$ set $a/0:=+\infty .$ For a non-negative function $\phi :[0,\infty )\rightarrow [0,\infty ),$ we can define two codivergences. The first one will be referred to as covariance-type codivergence between three probability measures $P_0, P_1, P_2$ and is defined by

$$\begin{aligned} {V_\phi }(P_0 | P_1, P_2) := \int \phi \Bigg (\frac{dP_1}{dP_0}\Bigg )\phi \Bigg (\frac{dP_2}{dP_0}\Bigg ) dP_0- \int \phi \Bigg (\frac{dP_1}{dP_0}\Bigg )dP_0 \int \phi \Bigg (\frac{dP_2}{dP_0}\Bigg ) dP_0, \end{aligned}$$

(4)

and the second one will be called correlation-type codivergence and is defined by

$$\begin{aligned} {R_\phi }(P_0 | P_1, P_2) := \frac{{V_\phi }(P_0 | P_1, P_2)}{\int \phi \big (\frac{dP_1}{dP_0}\big )dP_0 \int \phi \big (\frac{dP_2}{dP_0}\big ) dP_0} =\frac{ \int \phi \big (\frac{dP_1}{dP_0}\big )\phi \big (\frac{dP_2}{dP_0}\big ) dP_0}{\int \phi \big (\frac{dP_1}{dP_0}\big )dP_0 \int \phi \big (\frac{dP_2}{dP_0}\big ) dP_0} - 1, \end{aligned}$$

(5)

if $P_1, P_2 \ll P_0.$ Otherwise, we define both ${V_\phi }(P_0 | P_1, P_2)$ and ${R_\phi }(P_0 | P_1, P_2)$ to be equal to $+ \infty $.

Obviously, both codivergences ${V_\phi }$ and ${R_\phi }$ are symmetric in $P_1$ and $P_2$. By Jensen’s inequality we see that ${V_\phi }(P_0 | P_1, P_1)\ge 0$ and ${R_\phi }(P_0 | P_1, P_1)\ge 0$. If $\phi (1)=0,$ then ${R_\phi }(P_0|P_0,P_0) = +\infty .$ For $\phi (1)>0,$ the functions $\phi $ and $t\phi $ with positive scalar t give the same codivergence ${R_\phi }$ and simply rescale ${V_\phi }$. Without loss of generality, we therefore can (and will) assume that $\phi (1)=1.$

We say that a function f admits a second order Taylor expansion around 1 if $f(1+y) = f(1) + y f'(1) +\frac{y^2}{2}f''(1) + o(y^2)$ for all y in an open neighborhood of zero. The following proposition is proved in Section A.

Proposition 2.5

Assume that $\phi (1) = 1$ and $\phi $ admit a second order Taylor expansion around 1. Then the ${V_\phi }$ and the ${R_\phi }$ codivergences are codivergences in the sense of Definition 2.2 with bilinear expansion domains $\mathcal {M}_{P_0}$ and bilinear maps $\phi '(1)^2\langle \mu , \widetilde{\mu }\rangle _{P_0},$ where

$$\begin{aligned} \langle \mu , \widetilde{\mu }\rangle _{P_0} := \int \frac{d \mu }{dP_0} d\widetilde{\mu }= \int h g dP_0, \quad \text {with} \ h = \frac{d\mu }{dP_0} \ \text {and} \ g = \frac{d\widetilde{\mu }}{dP_0}. \end{aligned}$$

If $\nu $ is a measure dominating $P_0$, the bilinear map can be written as

$$\begin{aligned} \langle \mu , \widetilde{\mu }\rangle _{P_0} = \int \frac{h g}{p_0} d\nu , \quad \text {for the densities} \ h = \frac{d\mu }{d\nu }, \ g = \frac{d\widetilde{\mu }}{d\nu }, \ p_0 = \frac{dP_0}{d\nu }. \end{aligned}$$

(6)

A consequence of Proposition 2.5 is the following: locally, all ${V_\phi }$ and ${R_\phi }$ codivergences (that satisfies the regularity conditions) define the same structure. This scalar product is the nonparametric Fisher information metric. The name originates from the identity [12, Equation (8)]

$$\begin{aligned}{}[I(\theta )]_{ij} = \int \frac{p_i(x | \theta ) p_j(x | \theta )}{p(x | \theta )} d\nu (x), \end{aligned}$$

(7)

where $[I(\theta )]_{ij}$ is the (i, j)-th entry of the Fisher information matrix for a parametric model of $\nu $-densities $p( \cdot | \theta )$ indexed by a finite dimensional parameter $\theta $ and $p_i(x | \theta ):= \partial p(x | \theta ) / \partial \theta _i$. Eqs. (7) and (6) have the same structure. One of the earliest reference to the nonparametric Fisher information metric is [8]. The concept has been applied in several frameworks, such as computer vision [24] or shape data analysis [25]. The geometry of the nonparametric Fisher information metric has been studied by [6, 12] in the context of Bayesian inference.

An interesting subclass of codivergences is obtained by choosing $\phi _\alpha (x) = x^\alpha .$ To ease the notation, we set

$$\begin{aligned} {V_\alpha }:= V_{\phi _\alpha } \quad \text {and} \ \ {R_\alpha }:= R_{\phi _\alpha }. \end{aligned}$$

(8)

Although the resulting codivergences seem related to the well-known Rényi divergence $(1-\alpha )^{-1} \log (\int p(x)^\alpha q(x)^{1-\alpha } d\nu (x)) $ between probability measures P and Q with densities p and q [23], the term $\int (p_1(x)p_2(x))^\alpha p_0(x)^{1-2\alpha } d\nu (x)$ occurring in the definitions of ${V_\alpha }$ and ${R_\alpha }$ is of a different nature.

In the case $\alpha = 1$, that is, $\phi (x) = x,$ both notions of codivergence agree. Denoting by $p_0, p_1, p_2$ the respective $\nu $-densities of $P_0, P_1, P_2,$ where $\nu $ is a measure dominating $P_0$, the corresponding codivergence

$$\begin{aligned} \chi ^2(P_0 | P_1, P_2) := {\left\{ \begin{array}{ll} \displaystyle \int \frac{dP_1}{dP_0} dP_2 - 1 = \int \frac{p_1 p_2}{p_0} d\nu - 1, &{} \text {if } P_1 \ll P_0 \text { and } P_2 \ll P_0, \\ + \infty , &{} \text {else,} \end{array}\right. } \end{aligned}$$

will be called $\chi ^2$-codivergence. The (usual) $\chi ^2$-divergence is defined as $\chi ^2(P,Q):= \int (dP/dQ-1)^2 dQ= \int (dP/dQ)^2 dQ-1$, if P is dominated by Q and $+\infty $ otherwise. Therefore, the $\chi ^2$-codivergence $\chi ^2(P_0 | P_1, P_1)$ coincides with the usual $\chi ^2$-divergence $\chi ^2(P_1, P_0)$ for any $P_0$ and $P_1$.

Another interesting codivergence is ${R_\alpha }$ with $\alpha = 1/2$. The resulting codivergence

$$\begin{aligned} \rho (P_0 | P_1, P_2) := \dfrac{\int \sqrt{p_1 p_2} d\nu }{ \int \sqrt{p_1 p_0} d\nu \int \sqrt{p_2 p_0} d\nu } - 1, \end{aligned}$$

(9)

is called Hellinger codivergence. We can (and will) define the Hellinger codivergence as $\int \sqrt{p_1 p_2} d\nu /(\int \sqrt{p_1 p_0} d\nu \int \sqrt{p_2 p_0} d\nu )$ whenever the denominator is positive. This is considerably weaker than $P_1,P_2 \ll P_0,$ as it is only required that the support of $p_0$ intersects with non-zero $\nu $-mass the support of $p_1$ and the support of $p_2$. Note that $\rho (P_0 | P_1, P_2)$ is independent of the choice of the dominating measure $\nu $ (and potentially $+\infty $ if the denominator is 0).

The name Hellinger codivergence is motivated by the representation

$$\begin{aligned} \rho (P_0 | P_1, P_2) = \frac{\alpha (P_1, P_2)}{\alpha (P_0, P_1) \alpha (P_0, P_2)} - 1, \end{aligned}$$

where $\alpha (P,Q):= \int \sqrt{pq} d\nu $ is the Hellinger affinity between two positive measures P, Q with densities p, q taken with respect to a common dominating measure.

The $\chi ^2$- and Hellinger codivergence are of interest as they can be used to control changes of expectation between probability measures, see Section 2.2 of [9].

We always have

$$\begin{aligned} \rho (P_0 | P_1, P_1) \le \chi ^2(P_0 | P_1, P_1). \end{aligned}$$

(10)

To see this, observe that Hölder’s inequality with $p=3/2$ and $q=3$ gives for any non-negative function f, $1=\int p_1\le (\int f^{3/2} p_1)^{2/3}(\int f^{-3}p_1)^{1/3}.$ The choice $f=(p_0/p_1)^{1/3}$ yields $1\le (\int \sqrt{p_1p_0})^2 \int p_1^2/p_0.$ Therefore $1 / (\int \sqrt{p_1p_0})^2 \le \int p_1^2/p_0.$ Subtracting one on each side of this expression yields (10).

Proposition 2.5 implies that the $\chi ^2$-codivergence and the Hellinger codivergence are codivergences with respective bilinear maps $\langle \mu , \widetilde{\mu }\rangle _{P_0}$ for the $\chi ^2$-codivergence and $\langle \mu , \widetilde{\mu }\rangle _{P_0} / 4$ for the Hellinger codivergence.

For the Hellinger codivergence, the expansion in Proposition 2.5 can be generalized. Assume that $P_0$ is dominated by some positive measure $\nu $. Define $\textrm{Supp}(\mu ):= \{ x \in \mathcal {A}: d\mu /d\nu (x) \ne 0\}$ for any signed measure $\mu $ dominated by $\nu .$ If $\mu _1$ and $\mu _2$ are signed measures dominated by $\nu $ such that (i) $\textrm{Supp}(\mu _i) \cap \textrm{Supp}(P_0)$ has a positive $\nu $-measure, and (ii) their densities $h_i$ are positive on $\textrm{Supp}(\mu _i) \backslash \textrm{Supp}(P_0)$, then

$$\begin{aligned} \rho (P_0 | P_0 + t \mu _1, P_0 + s \mu _2)&= \sqrt{t s} \int _{\textrm{Supp}(P_0)^C} \sqrt{h_1 h_2} d\nu \nonumber \\&\quad + t s \int _{\textrm{Supp}(P_0)} \frac{h_1 h_2}{2 p_0} d\nu + o(t^2 + s^2). \end{aligned}$$

(11)

Compared to Definition 2.2 (iii), there is thus an additional term for probability measures that have mass outside of the support of $P_0$. Consequently, this expansion cannot be linked to one local bilinear form and the mapping $(t, s) \in \mathbb {R}_+^2 \mapsto \rho (P_0 | P_0 + t \mu _1, P_0 + s \mu _2)$ is not differentiable at (0, 0). This is in line with Proposition 2.4: for perturbations $\mu $ that do not belong to $\mathcal {M}_{P_0}$, the measures $P_0 + t \mu $ cannot be probability measures for all t in any open neighborhood of 0.

The ${R_\alpha }$ codivergences admit convenient expressions for product measures and for exponential families. The first proposition is proved in Sect. A.2.

Proposition 2.6

Let $P_{j\ell }$ be probability measures for any $j=0, 1, 2$ and for any $\ell = 1, \dots , d$ satisfying $P_{1\ell }, P_{2\ell } \ll P_{0\ell }$. Then

$$\begin{aligned} {R_\alpha }\left( { \bigotimes _{\ell =1}^d P_{0\ell }}\bigg \vert { \bigotimes _{\ell =1}^d P_{1\ell }}{ \bigotimes _{\ell =1}^d P_{2\ell }}\right) = \prod _{\ell =1}^d \Bigg ( {R_\alpha }(P_{0\ell } | P_{1\ell }, P_{2\ell } ) + 1\Bigg ) - 1. \end{aligned}$$

Proposition 2.7

Let $\Theta $ be a subset of a real vector space and let $(P_\theta :\theta \in \Theta )$ be an exponential family with $\nu $-densities $p_\theta (x)=h(x)\exp (\theta ^\top T(x)-A(\theta ))$ for some dominating measure $\nu $. Then, for any $\theta _0,\theta _1,\theta _2\in \Theta $ satisfying

$$\begin{aligned} \theta _0 + \alpha \big (\theta _1+\theta _2-2\theta _0\big ), \theta _0 + \alpha \big (\theta _1-\theta _0\big ), \theta _0 + \alpha \big (\theta _2-\theta _0\big ) \in \Theta , \end{aligned}$$

(12)

we have

$$\begin{aligned} {R_\alpha }(P_{\theta _0}|P_{\theta _1},P_{\theta _2})&= \exp \Big (A\big (\theta _0+\alpha \big (\theta _1+\theta _2-2\theta _0\big )\big ) - A\big (\theta _0+\alpha \big (\theta _1-\theta _0\big )\big ) \\&\quad - A\big (\theta _0+\alpha \big (\theta _2-\theta _0\big )\big ) + A\big (\theta _0\big )\Big )-1. \end{aligned}$$

This proposition is proved in Sect. A.3. (12) is satisfied if $\Theta $ is a vector space or if $0<\alpha \le 1$ and $\Theta $ is convex. In the case of the Gamma distribution the parameter space is $\Theta = (-1, +\infty ) \times (- \mathbb {R})$ and in this case the constraints in (12) are sufficient and necessary for the statement of Proposition 2.7 to hold, see Sect. 5.4 for details.

For the most common families of distributions, closed-form expressions for the ${R_\alpha }(P_{\theta _0}|P_{\theta _1},P_{\theta _2})$ codivergences are reported in Table 1. Derivations for these expressions are given in Sect. 5. This section also contains expressions for the Gamma distribution. As mentioned before, these codivergences quantify to which extent the measures $P_1$ and $P_2$ represent different directions around $P_0.$ The explicit formulas show this in terms of the parameters and reveal significant similarity between the different families. For the multivariate normal distribution the ${R_\alpha }$ codivergence vanishes if and only if the vectors $\theta _1-\theta _0$ and $\theta _2-\theta _0$ are orthogonal.

Table 1 Closed-form expressions for the ${R_\alpha }$ codivergence for some parametric distributions

Full size table

3 Divergence matrices

Definition 3.1

Let $M\ge 1.$ For a given codivergence $D( \cdot | \cdot , \cdot )$ on a space $\mathcal {X}\subset E$ and $u, v_1, \dots , v_M$ elements of $\mathcal {X}$, we define the divergence matrix $D(u | v_1, \dots , v_M)$ as the $M \times M$ matrix with (j, k)-th entry $D(u | v_1, \dots , v_M)_{j,k}:= D(u | v_j, v_k)$, for all $1 \le j,k \le M$.

If $v_1, \dots , v_M$ are all in a neighborhood of u, the divergence matrix D can be related to the Gram matrix of the bilinear form $\langle \cdot , \cdot \rangle _{u}$. Formally, for $\textbf{t}= (t_1, \dots , t_M) \in \mathbb {R}^M$ such that for any $i = 1, \dots , M, u + t_i h_i \in \mathcal {X}$, we have

$$\begin{aligned} D(u | u + t_1 h_1, \dots , u + t_M h_M) = \textbf{t}\mathbb {G}_{u} \textbf{t}^\top + o(\Vert {\textbf{t}}\Vert ^2), \end{aligned}$$

with Gram matrix $\mathbb {G}_{u}:= (\langle h_i, h_j \rangle _{u})_{1 \le i,j \le M}$.

Based on the codivergences ${V_\phi }(P_0|P_1,P_2), {R_\phi }(P_0|P_1,P_2),$ one can now define corresponding $M\times M$ divergence matrices with (j, k)-th entry

$$\begin{aligned}&{V_\phi }(P_0|P_1,\ldots ,P_M)_{j,k} :={V_\phi }(P_0|P_j,P_k) \nonumber \\&\quad = \int \phi \Bigg (\frac{dP_j}{dP_0}\Bigg )\phi \Bigg (\frac{dP_k}{dP_0}\Bigg ) dP_0-\int \phi \Bigg (\frac{dP_j}{dP_0}\Bigg )dP_0 \int \phi \Bigg (\frac{dP_k}{dP_0}\Bigg ) dP_0, \end{aligned}$$

(13)

and

$$\begin{aligned} {R_\phi }(P_0|P_1,\ldots ,P_M)_{j,k} :={R_\phi }(P_0|P_j,P_k) = \dfrac{ \int \phi \bigg (\frac{dP_j}{dP_0}\bigg )\phi \bigg (\frac{dP_k}{dP_0}\bigg ) dP_0}{\int \phi \bigg (\frac{dP_j}{dP_0}\bigg )dP_0 \int \phi \bigg (\frac{dP_k}{dP_0}\bigg ) dP_0} - 1, \end{aligned}$$

(14)

provided that $P_1,\ldots ,P_M\ll P_0.$ The codivergence matrices are linked by the relationship

$$\begin{aligned} {R_\phi }(P_0|P_1,\ldots ,P_M) = D \cdot {V_\phi }(P_0|P_1,\ldots ,P_M) \cdot D, \end{aligned}$$

(15)

where D denotes the $M\times M$ diagonal matrix with j-th diagonal entry $1/\int \phi \big (\frac{dP_j}{dP_0}\big ) dP_0,$ $j=1,\ldots ,M.$

Similarly as ${\text {Cov}}(X_1, X_2)$ can denote either the covariance between the random variables $X_1$ and $X_2$ or the $2\times 2$ covariance matrix of the random vector $(X_1,X_2),$ the expressions ${V_\phi }(P_0|P_1,P_2)$ and ${R_\phi }(P_0|P_1,P_2)$ can also denote either codivergences or $2\times 2$ divergence matrices. Within the context, it is always clear which of the two interpretations is meant.

The divergence matrices with function $\phi _\alpha (x):= x^\alpha $ are denoted by ${V_\alpha }(P_0 | P_1, \dots , P_M)$ and ${R_\alpha }(P_0 | P_1, \dots , P_M)$. Similarly, the $\chi ^2$-divergence matrix $\chi ^2(P_0 | P_1, \dots , P_M)$ and the Hellinger affinity matrix $\rho (P_0 | P_1, \dots , P_M)$ are the $M\times M$ divergence matrices of the $\chi ^2$-codivergence and the Hellinger codivergence with (j, k)-th entry

$$\begin{aligned} \chi ^2(P_0 | P_1, \dots , P_M)_{j,k}&:= \int \frac{dP_j}{dP_0} dP_k - 1 \text {, \ \ and,} \\ \rho (P_0 | P_1, \dots , P_M)_{j,k}&:= \frac{\int \sqrt{p_j p_k} d\nu }{\int \sqrt{p_j p_0 } d\nu \int \sqrt{p_k p_0} d\nu } -1, \end{aligned}$$

for all $1 \le j, k \le M$. As in the previous section, the condition for finiteness of the Hellinger codivergence matrix is weaker than for general ${R_\phi }$ and ${V_\phi }$ codivergences. Instead of domination $P_1, \dots , P_M \ll P_0$, it is only required that the integrals $\int \sqrt{p_j p_0} d\nu $ are positive, for some dominating measure $\nu $ and $p_j:= dP_j / d\nu $. By (6), the local Gram matrix of the $\chi ^2$-divergence matrix at a distribution $P_0$ is $\mathbb {G}_{P_0}:= \big [\int \frac{h_i h_j}{p_0} d\nu \big ]_{1 \le i,j \le M},$ and the local Gram matrix of the Hellinger divergence matrix is $\mathbb {G}_{P_0} / 4$.

Let $\Phi (X):= (\phi (dP_1/dP_0(X)), \dots , \phi (dP_M/dP_0(X)))^\top $ denote the random vector containing the likelihood ratios of the M measures. Since ${\text {Cov}}(U,V) = E[UV] - E[U]E[V],$ we have

$$\begin{aligned} {V_\phi }(P_0|P_1,\ldots ,P_M) = {\text {Cov}}_{P_0}\big (\Phi (X)\big ), \end{aligned}$$

(16)

where the covariance is computed with respect to the distribution $P_0$ as indicated by the subscript $P_0.$ Moreover, we have

$$\begin{aligned} \textbf{v}^\top {V_\phi }(P_0|P_1,\ldots ,P_M) \textbf{v}&= {\text {Var}}_{P_0} \big (\textbf{v}^\top \Phi (X)\big ). \end{aligned}$$

Applying Eq. (15) yields moreover

$$\begin{aligned} {R_\phi }(P_0|P_1,\ldots ,P_M) = D {\text {Cov}}_{P_0}\big (\Phi (X)\big ) D. \end{aligned}$$

(17)

This shows that ${V_\phi }(P_0|P_1,\ldots ,P_M)$ and ${R_\phi }(P_0|P_1,\ldots ,P_M)$ can be interpreted as covariance matrices and are therefore symmetric and positive semi-definite. Applying the Taylor expansion to the likelihood ratios in the previous identities provides a direct way of recovering the local Gram matrix associated to the nonparametric Fisher information metric.

In a next step, we state a more specific identity for the $\chi ^2$-divergence matrix. To do so, we first extend the usual notion of the $\chi ^2$-divergence to the case where the first argument is a signed measure. Let $\mu $ be a finite signed measure and P be a probability measure defined on the same measurable space $(\Omega , {\mathcal {A}})$. We define the $\chi ^2$-divergence of $\mu $ and P by

$$\begin{aligned} \chi ^2(\mu ,P) := {\left\{ \begin{array}{ll} \displaystyle \int \Bigg ( \frac{d\mu }{dP} - \mu (\Omega )\Bigg )^2 dP, &{} \text { if } \mu \ll P, \\ + \infty &{} \text { else.} \end{array}\right. } \end{aligned}$$

(18)

Here, $d\mu /dP$ denotes the Radon–Nikodym derivative of the signed measured $\mu $ with respect to P (defined e.g. in Theorem 4.2.4 in [7]). This definition of $\chi ^2(\mu ,P)$ generalizes the case where $\mu $ is a probability measure and allows us to rewrite the $\chi ^2$-divergence matrix as

$$\begin{aligned} \textbf{v}^\top \chi ^2(P_0 | P_1, \dots , P_M) \textbf{v}&= \int \left( \sum _{j=1}^M \left( \frac{dP_j}{dP_0}-1\right) v_j \right) ^2 dP_0 \hspace{-0.1em} = \chi ^2\left( \sum _{j=1}^M v_j P_j, P_0\right) , \end{aligned}$$

(19)

with $\sum _{j=1}^M v_j P_j$ the mixture (signed) measure of $P_1, \dots , P_M.$ Similarly, for the Hellinger divergence matrix it can be checked that

$$\begin{aligned} \textbf{v}^\top \rho (P_0 | P_1,\dots ,P_M) \textbf{v}= \int \left( \sum _{j=1}^M \left( \frac{\sqrt{p_j}}{\int \sqrt{p_j p_0} d\nu }-\sqrt{p_0} \right) v_j \right) ^2 d\nu . \end{aligned}$$

(20)

Writing ${\text {Rank}}(A)$ for the rank of a matrix A and ${\text {Rank}}(x_1, \dots , x_n)$ for the dimension of the linear span of n elements $x_1, \dots , x_n$ in a vector space E, we will now derive an identity for the rank of divergence matrices.

Proposition 3.2

Let $M \ge 1$, and let $P_0, P_1, \dots , P_M$ be $(M+1)$ probability distributions.

(i)
Assume that $P_1, \dots , P_M \ll P_0$. Then for any non-negative function $\phi :[0,\infty )\rightarrow [0,\infty )$ such that $\phi (1)=1$, we have
$$\begin{aligned} \vspace{-0.3em} {\text {Rank}}({R_\phi }(P_0 | P_1, \dots , P_M))&= {\text {Rank}}({V_\phi }(P_0 | P_1, \dots , P_M)) \\&= {\text {Rank}}\bigg (1, \phi \circ \frac{dP_{1}}{dP_0}, \dots , \phi \circ \frac{dP_M}{dP_0} \bigg ) - 1, \end{aligned}$$
where functions are considered as elements of the vector space $L^1(\mathcal {A}, \mathscr {B}, P_0)$, that is, linear independence is considered $P_0$-almost everywhere.
(ii)
Let $\nu $ be a common dominating measure of $P_0, \dots , P_M$. Assume that $\forall j = 1, \dots , M, \int p_j p_0 d\nu > 0$ with $p_j:= dP_j / d\nu $. Then we have
$$\begin{aligned} {\text {Rank}}(\rho (P_0 | P_1, \dots , P_M)) = {\text {Rank}}(\sqrt{p_0}, \sqrt{p_1}, \dots , \sqrt{p_M}) - 1, \end{aligned}$$
where functions are considered as elements of the vector space $L^1(\mathcal {A}, \mathscr {B}, \nu )$.

Statement (ii) is not a consequence of (i) with $\phi (x) = x^{1/2}.$ Indeed, (i) relies on likelihood ratios assuming that the measures $P_1, \dots , P_M$ are dominated by $P_0,$ while (ii) only requires that each of the probability measures $P_1, \dots , P_M$ has a common support with $P_0$ of positive $P_0$-measure. The proof of (ii) exploits the specific property (20) of the Hellinger divergence.

Proposition 3.2 applied to $\phi (x) = x$ shows that whenever $P_0$ is a linear combination of $P_1, \dots , P_M$, then ${\text {Rank}}(1, dP_{1}/dP_0, \dots , dP_{M}/dP_0) < M + 1$ and ${\text {Rank}}(\chi ^2(P_0|P_1, \dots , P_M)) < M,$ which means that the $\chi ^2(P_0|P_1, \dots , P_M)$ divergence matrix is singular. Similarly, whenever $\sqrt{p_0}$ is a linear combination of $\sqrt{p_1}, \dots , \sqrt{p_M}$, the Hellinger divergence matrix is singular.

Proof of Proposition 3.2

We first prove (i). Since D is an invertible matrix, a direct consequence of Eq. (15) is ${\text {Rank}}({R_\phi }(P_0 | P_1, \dots , P_M)) = {\text {Rank}}({V_\phi }(P_0 | P_1, \dots , P_M)).$ Applying Eq. (16) and then Lemma 6.3, we obtain that $r:= {\text {Rank}}({V_\phi }(P_0 | P_1, \dots , P_M)) = {\text {Rank}}({\text {Cov}}_{P_0}(Z_1, \dots , Z_M)) = {\text {Rank}}(Z_1 - E_{P_0} Z_1, \dots , Z_M - E_{P_0} Z_M)$, where $Z_j:= \phi (dP_j/dP_0(X))$ for $j=1, \dots , M$ and $E_{P_0}$ denotes the expectation with respect to $P_0$. The random vectors $Z_1 - E_{P_0} Z_1, \dots , Z_M - E_{P_0} Z_M$ are centered and therefore linearly independent of the (constant) random variable $Z_0:= 1 = \phi (dP_0 / dP_0(X))$. Therefore,

$$\begin{aligned} r&= {\text {Rank}}(Z_1 - E_{P_0} Z_1, \dots , Z_M - E_{P_0} Z_M) \\&= {\text {Rank}}(1, Z_1 - E_{P_0} Z_1, \dots , Z_M - E_{P_0} Z_M) - 1 \\&= {\text {Rank}}(Z_0, Z_1, \dots , Z_M) - 1. \end{aligned}$$

By Lemma 6.2, r is the highest integer such that there exists $i_1, \dots , i_r \in \{0, \dots , M\}$ with $(Z_{i_1}, \dots , Z_{i_r})$ linearly independent random variables $P_0$-almost surely.

Using the definition of the $Z_j$ and $X \sim P_0$, the random variables $\{Z_{i_1}, \dots , Z_{i_r}\}$ are linearly independent $P_0$-almost surely if and only if $P_0 \big (\sum _{j=1}^r a_j \phi (dP_{i_j}/dP_0(X)) = 0 \big ) = 1$ implies $a_0=\ldots =a_r=0$. This is the case if and only if the functions $\{\phi \circ dP_{i_1}/dP_0, \dots , \phi \circ dP_{i_r}/dP_0\}$ are linearly independent $P_0$-almost everywhere, proving ${\text {Rank}}(Z_{i_1}, \dots , Z_{i_r}) = {\text {Rank}}(\phi \circ dP_{i_1}/dP_0, \dots , \phi \circ dP_{i_r}/dP_0)$.

Before proving (ii) in full generality, we first show that ${\text {Rank}}(\rho (P_0 | P_1, \dots , P_M)) = M$ if and only if all the $M+1$ functions $\sqrt{p_0}, \dots , \sqrt{p_M}$ are linearly independent $\nu $-almost everywhere. The matrix is singular if and only if there exists a non-null vector v such that $\sum _{j=1}^M \frac{v_j \sqrt{p_j}}{\int \sqrt{p_j p_0} d\nu } = \sum _{j=1}^M v_j \sqrt{p_0}$ $\nu $-almost everywhere. This is the case if and only if there are numbers $w_0,\dots ,w_M,$ that are not all equal to zero, satisfying $\sum _{j=0}^M w_j\sqrt{p_j}=0,$ $\nu $-almost everywhere. To verify the more difficult reverse direction of this equivalence, it is enough to observe that $\sum _{j=0}^M w_j\sqrt{p_j}=0$ implies $w_0=- \sum _{j=1}^M w_j \int \sqrt{p_j p_0} d\nu $ and thus, taking $v_j=w_j \int \sqrt{p_j p_0} d\nu $ yields $\sum _{j=1}^M \frac{v_j \sqrt{p_j}}{\int \sqrt{p_j p_0} d\nu } = \sum _{j=1}^M v_j \sqrt{p_0}.$

We now show the general case of (ii). For an $n\times n$ matrix A and index sets $I,J \subset \{1, \dots , n\}$, the submatrix $A_{I,J}$ defines the submatrix consisting of the rows I and the columns J. If $I=J$, $A_{I,I}$ is called a principal submatrix of the matrix A. Let r be an integer in $\{1, \dots , M\}$. By Lemma 6.4,

$$\begin{aligned} r = {\text {Rank}}(\rho (P_0 | P_1, \dots , P_M)) \end{aligned}$$

if and only if

$$\begin{aligned} \begin{array}{c} \rho (P_0 | P_1, \dots , P_M) \text { has an invertible principal submatrix of size } r \\ \text { and all principal submatrix of size } r+1 \text { of } \rho (P_0 | P_1, \dots , P_M) \text { are singular} \end{array} \end{aligned}$$

if and only if (using the fact that the principal submatrices of $\rho (P_0 | P_1, \dots , P_M)$ of size k are exactly the matrices of the form $\rho (P_0 | P_{i_1}, \dots , P_{i_k})$ for some $i_1, \dots , i_k \in \{1, \dots M\}$)

$$\begin{aligned} r = \max _{} \{k=1, \dots , M: \exists i_1, \dots , i_k \in \{1, \dots , M\}, \rho (P_0 | P_{i_1}, \dots , P_{i_k}) \text { is invertible} \} \end{aligned}$$

if and only if (using the case of full rank that was proved before)

$$\begin{aligned} r = \max _{}&\bigg \{k=1, \dots , M: \exists i_1, \dots , i_k \in \{1, \dots , M\}, \\&\sqrt{p_0}, \sqrt{p_{i_1}}, \dots , \sqrt{p_{i_r}} \text { are linearly independent} \bigg \} \end{aligned}$$

if and only if $r = {\text {Rank}}(\sqrt{p_0}, \sqrt{p_1}, \dots , \sqrt{p_M}) - 1.$ $\square $

4 Data processing inequality for the $\chi ^2$-divergence matrix

In a parametric statistical model $(Q_\theta )_{\theta \in \Theta }$, it is assumed that the statistician observes a random variable X following one of the distributions $Q_\theta $ for some $\theta \in \Theta $. If we transform X to obtain a new variable Y, then Y follows the distribution $P_\theta := K Q_\theta $ for some Markov kernel K. When $\theta $ is unknown but the Markov kernel K is known and independent of $\theta $, this means that the new statistical model is $(P_\theta := K Q_\theta , \theta \in \Theta )$. As in the usual case for the $\chi ^2$-divergence, it is natural to think that such a transformation cannot increase the amount of information present in the model. In our more general framework, such an inequality still holds and is presented in the following data processing inequality.

Theorem 4.1

(Data processing/entropy contraction) If K is a Markov kernel and $Q_0,\dots ,Q_M$ are probability measures such that $Q_0$ dominates $Q_1,\dots ,Q_M,$ then,

$$\begin{aligned} \chi ^2\big (K Q_0 | KQ_1,\dots , KQ_M\big ) \le \chi ^2\big (Q_0 | Q_1,\dots , Q_M\big ), \end{aligned}$$

where $\le $ denotes the partial order on the set of positive semi-definite matrices.

In particular, the $\chi ^2$-divergence matrix is invariant under invertible transformations. The rest of this section is devoted to the proof of Theorem 4.1. First, we generalize the well-known data-processing inequality for the $\chi ^2$-divergence to the case (18), where one measure is a finite signed measure and use afterwards Eq. (19).

The $\chi ^2(\mu ,P)$-divergence with a signed measure can be computed from the usual $\chi ^2$-divergence between probability measures by the following relationship

Lemma 4.2

Assume that $\mu \ll P$. Let $\mu =\alpha _+\mu _+-\alpha _-\mu _-$ be the Jordan decomposition (3) of $\mu $ with $\alpha _+, \alpha _- \ge 0$ and $\mu _+,$ $\mu _-$ orthogonal probability measures. Then

$$\begin{aligned} \chi ^2(\mu ,P) = \alpha _+^2 \chi ^2\big (\mu _+,P\big ) + \alpha _-^2 \chi ^2 \big (\mu _-, P\big ) + 2\alpha _+ \alpha _-. \end{aligned}$$

Proof

Observe that $\displaystyle \alpha _+^2 \chi ^2\big (\mu _+,P\big ) +\alpha _-^2 \chi ^2\big (\mu _-,P\big ) +2\alpha _+\alpha _- =\int \Big (\alpha _+\Big (\frac{d\mu _+}{dP}-1\Big )-\alpha _-\Big (\frac{d\mu _-}{dP}-1\Big )\Big )^2 dP = \int \Big (\frac{d\mu }{dP}-\mu (\Omega )\Big )^2 dP =\chi ^2(\mu ,P).$ $\square $

Lemma 4.3

If $\mu $ is a finite signed measure, P is a probability measure and both measures are defined on the same measurable space, then, for any Markov kernel K, the data-processing inequality

$$\begin{aligned} \chi ^2(K\mu ,KP) \le \chi ^2(\mu ,P) \end{aligned}$$

holds.

Proof

We can assume that $\mu \ll P,$ since otherwise the right-hand side of the inequality is $+\infty $ and the result holds. In particular, $\mu \ll \nu $ for a positive measure $\nu $ implies that $K\mu \ll K\nu .$ Indeed, if $K\nu (A)=0$ for a given measurable set A, then, $\int K(A,x) d\nu (x)=0,$ implying $K(A,\cdot )=0$ $\nu $-almost everywhere. Since $\mu \ll \nu ,$ the equality also holds $\mu $-almost everywhere and so $K\mu (A)=\int K(A,x)d\mu (x)=0,$ proving $K\mu \ll K\nu .$ By the Jordan decomposition (3), there exist orthogonal probability measures $\mu _+,$ $\mu _-$ and non-negative real numbers $\alpha _+,\alpha _-,$ such that $\mu =\alpha _+\mu _+-\alpha _-\mu _-$ and $\mu ( \Omega )=\alpha _+-\alpha _-.$ Thus, $K\mu =\alpha _+K\mu _+-\alpha _-K\mu _-.$ Observe that

$$\begin{aligned} \int \Bigg (\frac{dK\mu _+}{dKP}-1\Bigg )\Bigg (\frac{dK\mu _-}{dKP}-1\Bigg ) dKP&= \int \bigg ( \frac{dK\mu _+}{dKP} \frac{dK\mu _-}{dKP} - \frac{dK\mu _-}{dKP} - \frac{dK\mu _+}{dKP} + 1 \bigg ) dKP \\&= \int \frac{dK\mu _+}{dKP} \frac{dK\mu _-}{dKP} dKP-1 \ge -1. \end{aligned}$$

Because $\mu _+$ and $\mu _-$ are orthogonal, we similarly find that

$$\begin{aligned} \int \Bigg (\frac{d\mu _+}{dP}-1\Bigg )\Bigg (\frac{d\mu _-}{dP}-1\Bigg ) dP =-1. \end{aligned}$$

Using the data-processing inequality for the $\chi ^2$ divergence of probability measures twice, $K\mu =\alpha _+K\mu _+-\alpha _-K\mu _-$ and $\mu (\Omega )=\alpha _+-\alpha _-,$ we get

$$\begin{aligned} \chi ^2(K\mu , KP)= & {} \int \Bigg (\frac{dK\mu }{dKP}-\mu (\Omega )\Bigg )^2 dKP \\= & {} \int \Bigg (\alpha _+\Bigg (\frac{dK\mu _+}{dKP}-1\Bigg )-\alpha _-\Bigg (\frac{dK\mu _-}{dKP}-1\Bigg )\Bigg )^2 dKP \\= & {} \alpha _+^2 \chi ^2\bigg (K\mu _+,KP\bigg ) +\alpha _-^2 \chi ^2\bigg (K\mu _-,KP\bigg ) \\{} & {} - \,2\alpha _+\alpha _-\int \Bigg (\frac{dK\mu _+}{dKP}-1\Bigg )\Bigg (\frac{dK\mu _-}{dKP}-1\Bigg ) dKP \\\le & {} \alpha _+^2 \chi ^2\bigg (\mu _+,P\bigg ) +\alpha _-^2 \chi ^2\big (\mu _-,P\big ) +2\alpha _+\alpha _- =\chi ^2(\mu ,P), \end{aligned}$$

by Lemma 4.2. $\square $

We can now complete the proof of Theorem 4.1.

Proof of Theorem 4.1

Let $v=(v_1,\ldots ,v_M)^\top \in \mathbb {R}^M.$ Then, $\sum _{j=1}^M v_jQ_j$ is a finite signed measure dominated by $Q_0$. Using (18) and the previous lemma,

$$\begin{aligned} v^T \chi ^2(K Q_0 | KQ_1, \dots , K Q_M) v&= \int \left( \sum _{j=1}^M v_j\Bigg (\frac{dKQ_j}{dKQ_0}-1\Bigg )\right) ^2 dKQ_0 \\&= \chi ^2\left( K\left( \sum _{j=1}^M v_jQ_j\right) ,KQ_0\right) \\&\le \chi ^2\left( \sum _{j=1}^M v_jQ_j,Q_0\right) \\&= \int \left( \sum _{j=1}^M v_j\Bigg (\frac{dQ_j}{dQ_0}-1\Bigg )\right) ^2 dQ_0 \\&= v^T \chi ^2(Q_0 | Q_1, \dots , Q_M) v. \end{aligned}$$

Since v was arbitrary, this completes the proof. $\square $

A Markov kernel K implies by definition that for every fixed x, $A\mapsto K(A,x)$ is a probability measure. We now provide a simpler and more straightforward proof for Theorem 4.1 without using Lemma 4.3, under the additional common domination assumption:

$$\begin{aligned} \text {There exists a measure } \mu , \text { such that } \forall x \in \Omega , K(x, \cdot ) \ll \mu . \end{aligned}$$

(21)

Simpler proof of Theorem 4.1 under the additional assumption (21)

Because of the identity $v^\top \chi ^2(Q_0 | Q_1, \dots , Q_M)v = \int (\sum _{j=1}^M v_j(dQ_j/dQ_0-1))^2 dQ_0,$ it is enough to prove that for any arbitrary vector $v=(v_1,\dots ,v_M)^\top $,

$$\begin{aligned} \int \left( \sum _{j=1}^M v_j \Bigg (\frac{dKQ_j}{dKQ_0}-1\Bigg ) \right) ^2 dKQ_0 \le \int \left( \sum _{j=1}^M v_j \Bigg (\frac{dQ_j}{dQ_0}-1\Bigg ) \right) ^2 dQ_0. \end{aligned}$$

(22)

Let $\nu $ be a dominating measure for $Q_0,\dots ,Q_M$ and recall that by the additional assumption (21), for any x, the measure $\mu $ is a dominating measure for the probability measure $A\mapsto K(A,x).$ Write $q_j$ for the $\nu $-density of $Q_j.$ Then, $dKQ_j(y)=\int _X k(y,x) q_j(x) d\nu (x) d\mu (y)$ for $j=1,\dots ,M$ and a suitable non-negative kernel function k satisfying $\int k(y,x) d\mu (y)=1$ for all x. Applying the Cauchy-Schwarz inequality, we obtain

$$\begin{aligned} \left( \sum _{j=1}^M v_j \left( \dfrac{dKQ_j}{dKQ_0}(y)-1\right) \right) ^2&= \bigg ( \dfrac{\int k(y,x) [\sum _{j=1}^M v_j (q_j(x)-q_0(x))] d\nu (x)}{\int k(y,x')q_0(x') d\nu (x')} \bigg )^2 \\&\le \dfrac{\int k(y,x)\big ( \sum _{j=1}^M v_j \frac{(q_j(x)-q_0(x))}{q_0(x)}\big )^2 q_0(x) d\nu (x)}{\int k(y,x')q_0(x') d\nu (x')}. \end{aligned}$$

Inserting this in (22), rewriting $dKQ_0(y)=\int _X k(y,x) q_0(x) d\nu (x) d\mu (y),$ interchanging the order of integration using Fubini’s theorem, and applying $\int k(y,x) d\mu (y)=1,$ yields

$$\begin{aligned}&\int \left( \sum _{j=1}^M v_j \Big (\frac{dKQ_j}{dKQ_0}-1\Big ) \right) ^2 dKQ_0\\&\quad \le \iint k(y,x) \left( \sum _{j=1}^M v_j \frac{(q_j(x)-q_0(x))}{q_0(x)}\right) ^2 q_0(x) d\nu (x) d\mu (y) \\&\quad = \int \left( \sum _{j=1}^M v_j \Big ( \frac{q_j(x)}{q_0(x)}-1\Big )\right) ^2 q_0(x) d\nu (x) \\&\quad =\int \left( \sum _{j=1}^M v_j \Big (\frac{dQ_j}{dQ_0}-1\Big ) \right) ^2 dQ_0. \end{aligned}$$

$\square $

5 Derivations for explicit expressions for the ${R_\alpha }$ codivergence

In this section we derive closed-form expressions for the ${R_\alpha }$ codivergences in Table 1. We also obtain a closed-form formula for the case of Gamma distributions and discuss a first order approximation of it.

5.1 Multivariate normal distribution

Suppose $P_j=\mathcal {N}(\theta _j, \sigma ^2 I_d)$ for $j=0, 1, 2.$ Here $\theta _j=(\theta _{j1},\dots ,\theta _{jd})^\top $ are vectors in $\mathbb {R}^d$ and $I_d$ denotes the $d\times d$ identity matrix. Then,

$$\begin{aligned} {R_\alpha }(P_0 | P_1, P_2) = \exp \Big (\alpha ^2\frac{\langle \theta _1 - \theta _0, \theta _2 - \theta _0 \rangle }{\sigma ^2}\Big ) -1. \end{aligned}$$

(23)

Proof

The Lebesgue density of $P_j$ is

$$\begin{aligned} \frac{1}{\sqrt{2\pi }}\exp \Bigg (-\frac{\Vert x-\theta _j\Vert ^2}{2\sigma ^2}\Bigg )=\frac{1}{\sqrt{2\pi }} \exp \Bigg (-\frac{\Vert x\Vert ^2}{2\sigma ^2}\Bigg )\exp \Bigg (\frac{1}{\sigma ^2} \theta _j^\top x -\frac{\Vert \theta _j\Vert ^2}{2\sigma ^2}\Bigg ), \end{aligned}$$

with $\Vert \cdot \Vert $ the Euclidean norm. This is an exponential family $h(x)\exp (\langle \theta , T(x)\rangle -A(\theta ))$ with $T(x) = \sigma ^{-2}x$ and $A(\theta )=\Vert \theta \Vert ^2/(2\sigma ^2).$

Applying Proposition 2.7 and quadratic expansion $\Vert \theta _0+b\Vert ^2=\Vert \theta _0\Vert ^2+2\langle \theta _0, b\rangle +\Vert b\Vert ^2$ to all four terms yields

$$\begin{aligned} {R_\alpha }(P_0 | P_1, P_2)&= \exp \Bigg (\frac{\Vert \theta _0+\alpha (\theta _1+\theta _2-2\theta _0) \Vert ^2}{2\sigma ^2} - \frac{\Vert \theta _0+\alpha (\theta _1-\theta _0) \Vert ^2}{2\sigma ^2} \\&\qquad - \frac{\Vert \theta _0+\alpha (\theta _2-\theta _0) \Vert ^2}{2\sigma ^2} + \frac{\Vert \theta _0 \Vert ^2}{2\sigma ^2}\Bigg )-1 \\&= \exp \Bigg (\frac{\Vert \alpha (\theta _1+\theta _2-2\theta _0) \Vert ^2- \Vert \alpha (\theta _1-\theta _0) \Vert ^2 - \Vert \alpha (\theta _2-\theta _0) \Vert ^2 }{2\sigma ^2}\Bigg )-1 \\&=\exp \Bigg (\alpha ^2\frac{\langle \theta _1 - \theta _0, \theta _2 - \theta _0 \rangle }{\sigma ^2}\Bigg ) -1. \end{aligned}$$

$\square $

5.2 Poisson distribution

If ${\text {Pois}}(\lambda )$ denotes the Poisson distribution with intensity $\lambda > 0,$ and $\lambda _0,\lambda _1,$ $\lambda _2 > 0$, then,

$$\begin{aligned} {R_\alpha \big (\hspace{-0.1em} {\text {Pois}}(\lambda _0) \, \big | \, {\text {Pois}}(\lambda _1) \, , \, {\text {Pois}}(\lambda _2) \big )} = \exp \Big (\lambda _0^{1 - 2\alpha } \big ( \lambda _1^\alpha - \lambda _0^{\alpha } \big ) \big ( \lambda _2^\alpha - \lambda _0^{\alpha } \big ) \Big ) - 1. \end{aligned}$$

Suppose $P_j = \otimes _{\ell =1}^d {\text {Pois}}(\lambda _{j\ell })$ for $j=0,\dots , M$ and $\lambda _{j\ell }>0$ for all $j,\ell .$ Then, as a consequence of Proposition 2.6,

$$\begin{aligned} {R_\alpha }(P_0|P_1,P_2) = \exp \left( \sum _{\ell =1}^d \lambda _{0\ell }^{1 - 2\alpha } \bigg ( \lambda _{1\ell }^\alpha - \lambda _{0\ell }^{\alpha } \bigg ) \bigg ( \lambda _{2\ell }^\alpha - \lambda _{0\ell }^{\alpha } \bigg ) \right) - 1, \end{aligned}$$

with particular cases

$$\begin{aligned} \chi ^2(P_0 | P_1, P_2) = \exp \left( \sum _{\ell =1}^d \frac{(\lambda _{1\ell }-\lambda _{0\ell })(\lambda _{2\ell }-\lambda _{0\ell })}{\lambda _{0\ell }}\right) -1, \end{aligned}$$

(24)

and

$$\begin{aligned} \rho (P_0 | P_1, P_2) = \exp \left( \sum _{\ell =1}^d \bigg (\sqrt{\lambda _{1\ell }}-\sqrt{\lambda _{0\ell }}\big )\big (\sqrt{\lambda _{2\ell }}-\sqrt{\lambda _{0\ell }}\bigg )\right) -1. \end{aligned}$$

(25)

Proof

The density of the Poisson distribution with respect to the counting measure is

$$\begin{aligned} p_\lambda (x) = e^{-\lambda } \frac{\lambda ^x}{x!} = \frac{1}{x!} e^{-\lambda + x \log (\lambda )} =h(x)\exp \big (\theta ^\top T(x)-A(\theta )\big ), \end{aligned}$$

with $\theta = \log (\lambda )$, $T(x) = x$ and $A(\theta ) = \exp (\theta )$. Applying Proposition 2.7 gives

$$\begin{aligned} {R_\alpha }(P_{\theta _0}|P_{\theta _1},P_{\theta _2})&=\exp \Big (A\big (\theta _0+\alpha \big (\theta _1+\theta _2-2\theta _0\big )\big ) - A\big (\theta _0+\alpha \big (\theta _1-\theta _0\big )\big ) \\&\quad - A\big (\theta _0+\alpha \big (\theta _2-\theta _0\big )\big ) + A\big (\theta _0\bigg )\Bigg )-1. \\&= \exp \Big (\exp \big (\log (\lambda _0) + \alpha \big (\log (\lambda _1) + \log (\lambda _2) - 2 \log (\lambda _0) \bigg )\bigg ) \\&\quad - \exp \big (\log (\lambda _0) + \alpha \big (\log (\lambda _1) - \log (\lambda _0) \big )\big ) \\&\quad - \exp \bigg (\log (\lambda _0) + \alpha \bigg (\log (\lambda _2) - \log (\lambda _0) \bigg )\bigg ) + \lambda _0 \Bigg )-1. \\&= \exp \Bigg ( \lambda _0^{1 - 2\alpha } \lambda _1^\alpha \lambda _2^\alpha - \lambda _0^{1 - \alpha } \lambda _1^\alpha - \lambda _0^{1 - \alpha } \lambda _2^\alpha + \lambda _0 \Bigg ) - 1 \\&= \exp \Bigg (\lambda _0^{1 - 2\alpha } \bigg ( \lambda _1^\alpha - \lambda _0^{\alpha } \bigg ) \bigg ( \lambda _2^\alpha - \lambda _0^{\alpha } \bigg ) \Bigg ) - 1. \end{aligned}$$

$\square $

5.3 Bernoulli distribution

If ${\text {Ber}}(\theta )$ denotes the Poisson distribution with parameter $\theta \in (0,1),$ and $\theta _0,$ $\theta _1,$ $\theta _2 \in (0,1),$ then,

$$\begin{aligned}&{R_\alpha \big (\hspace{-0.1em} {\text {Ber}}(\theta _0) \, \big | \, {\text {Ber}}(\theta _1) \, , \, {\text {Ber}}(\theta _2) \big )} \\&\quad = \frac{\theta _0^{1 - 2\alpha } \theta _1^\alpha \theta _2^\alpha + (1 - \theta _0)^{1 - 2 \alpha } (1 - \theta _1)^\alpha (1 - \theta _2)^\alpha }{ \big ( \theta _0^{1 - \alpha } \theta _1^\alpha + (1 - \theta _0)^{1 - \alpha } (1 - \theta _1)^\alpha \big ) \big ( \theta _0^{1 - \alpha } \theta _2^\alpha + (1 - \theta _0)^{1 - \alpha } (1 - \theta _2)^\alpha \big ) } - 1, \end{aligned}$$

Suppose $P_j = \otimes _{\ell =1}^d {\text {Ber}}(\theta _{j\ell })$ for $j=0, 1, 2$ and $\theta _{j\ell } \in (0,1)$ for all $j,\ell .$ Then, as a consequence of Proposition 2.6,

$$\begin{aligned} {R_\alpha }(P_0|P_1,P_2) = \prod _{\ell =1}^d \frac{\theta _{0\ell }^{1 - 2\alpha } \theta _{1\ell }^\alpha \theta _{2\ell }^\alpha + (1 - \theta _{0\ell })^{1 - 2 \alpha } (1 - \theta _{1\ell })^\alpha (1 - \theta _{2\ell })^\alpha }{ r(\theta _{0\ell }, \theta _{1\ell }) r(\theta _{0\ell }, \theta _{2\ell })} - 1, \end{aligned}$$

where $r(\theta _0,\theta _1):= \theta _0^{1 - \alpha } \theta _1^\alpha + (1 - \theta _0)^{1 - \alpha } (1 - \theta _1)^\alpha $. In particular,

$$\begin{aligned} \chi ^2(P_0 | P_1, P_2) = \prod _{\ell =1}^d \bigg (\frac{ (\theta _{1\ell } - \theta _{0\ell }) (\theta _{2\ell } - \theta _{0\ell }) }{\theta _{0\ell }(1-\theta _{0\ell })} + 1 \bigg ) - 1, \end{aligned}$$

(26)

and

$$\begin{aligned} \rho (P_0 | P_1, P_2) = \prod _{\ell =1}^d \frac{\widetilde{r}(\theta _{1\ell },\theta _{2\ell })}{\widetilde{r}(\theta _{1\ell },\theta _{0\ell }) \widetilde{r}(\theta _{2\ell },\theta _{0\ell })} - 1, \end{aligned}$$

(27)

with ${\widetilde{r}}(\theta ,\theta '):= \sqrt{\theta \theta '}+\sqrt{(1-\theta )(1-\theta ')}.$

Proof

The Bernoulli distributions ${\text {Ber}}(\theta ), \theta \in (0,1)$ form an exponential family, dominated by the counting measure on $\{0, 1\}$ with density $P({\text {Ber}}(\theta ) = k) = \theta ^k (1 - \theta )^{1 - k} = \exp (k \log (\theta ) + (1-k) \log (1 - \theta )) = \exp (k \beta - \log (1 + e^\beta ) )$, where $\beta = \log (\theta /(1-\theta ))$ is the natural parameter and $A(\beta ) = \log (1 + e^\beta )$. Therefore, we can apply Proposition 2.7 and obtain

$$\begin{aligned} {R_\alpha }(P_{\beta _0}|P_{\beta _1},P_{\beta _2})&= \exp \Big (A\big (\beta _0+\alpha \big (\beta _1+\beta _2-2\beta _0\big )\big ) - A\big (\beta _0+\alpha \big (\beta _1-\beta _0\big )\big ) \\&\quad - A\big (\beta _0+\alpha \big (\beta _2-\beta _0\big )\big ) + A\big (\beta _0\big )\Big )-1. \end{aligned}$$

Note that

$$\begin{aligned} \beta _0 + \alpha \big (\beta _1 - \beta _0\big )&= \log \left( \frac{\theta _0}{1 - \theta _0} \right) + \alpha \bigg ( \log \left( \frac{\theta _1}{1 - \theta _1} \right) - \log \left( \frac{\theta _0}{1 - \theta _0} \right) \bigg ) \\&= \log \left( \frac{\theta _0^{1 - \alpha } \theta _1^\alpha }{(1 - \theta _0)^{1 - \alpha } (1 - \theta _1)^\alpha } \right) , \end{aligned}$$

so that

$$\begin{aligned} A\big (\beta _0+\alpha \big (\beta _1-\beta _0\big )\big )&= \log \left( 1 + \frac{\theta _0^{1 - \alpha } \theta _1^\alpha }{(1 - \theta _0)^{1 - \alpha } (1 - \theta _1)^\alpha } \right) \\&= \log \left( \frac{\theta _0^{1 - \alpha } \theta _1^\alpha + (1 - \theta _0)^{1 - \alpha } (1 - \theta _1)^\alpha }{(1 - \theta _0)^{1 - \alpha } (1 - \theta _1)^\alpha } \right) . \end{aligned}$$

Similarly,

$$\begin{aligned}&\beta _0 + \alpha \bigg (\beta _1 + \beta _2 - 2\beta _0\big )\\&\quad = \log \left( \frac{\theta _0}{1 - \theta _0} \right) + \alpha \bigg ( \log \left( \frac{\theta _1}{1 - \theta _1} \right) + \log \left( \frac{\theta _2}{1 - \theta _2} \right) - 2 \log \left( \frac{\theta _0}{1 - \theta _0} \right) \bigg ) \\&\quad = \log \left( \frac{\theta _0^{1 - 2\alpha } \theta _1^\alpha \theta _2^\alpha }{(1 - \theta _0)^{1 - 2 \alpha } (1 - \theta _1)^\alpha (1 - \theta _2)^\alpha } \right) , \end{aligned}$$

so that

$$\begin{aligned} A\big ( \beta _0 + \alpha \big (\beta _1 + \beta _2 - 2\beta _0\big ) \big ) = \log \left( \frac{\theta _0^{1 - 2\alpha } \theta _1^\alpha \theta _2^\alpha + (1 - \theta _0)^{1 - 2 \alpha } (1 - \theta _1)^\alpha (1 - \theta _2)^\alpha }{(1 - \theta _0)^{1 - 2 \alpha } (1 - \theta _1)^\alpha (1 - \theta _2)^\alpha } \right) . \end{aligned}$$

Combining all these results together yields

$$\begin{aligned}&{R_\alpha }(P_{\beta _0}|P_{\beta _1},P_{\beta _2}) = \exp \Bigg ( \log \left( \frac{\theta _0^{1 - 2\alpha } \theta _1^\alpha \theta _2^\alpha + (1 - \theta _0)^{1 - 2 \alpha } (1 - \theta _1)^\alpha (1 - \theta _2)^\alpha }{(1 - \theta _0)^{1 - 2 \alpha } (1 - \theta _1)^\alpha (1 - \theta _2)^\alpha } \right) \\&\qquad - \log \left( \frac{\theta _0^{1 - \alpha } \theta _1^\alpha + (1 - \theta _0)^{1 - \alpha } (1 - \theta _1)^\alpha }{(1 - \theta _0)^{1 - \alpha } (1 - \theta _1)^\alpha } \right) \\&\qquad - \log \left( \frac{\theta _0^{1 - \alpha } \theta _2^\alpha + (1 - \theta _0)^{1 - \alpha } (1 - \theta _2)^\alpha }{(1 - \theta _0)^{1 - \alpha } (1 - \theta _2)^\alpha } \right) + \log (1 - \theta _0) \Bigg ) - 1\\&\quad = \frac{\theta _0^{1 - 2\alpha } \theta _1^\alpha \theta _2^\alpha + (1 - \theta _0)^{1 - 2 \alpha } (1 - \theta _1)^\alpha (1 - \theta _2)^\alpha }{ \big ( \theta _0^{1 - \alpha } \theta _1^\alpha + (1 - \theta _0)^{1 - \alpha } (1 - \theta _1)^\alpha \big ) \big ( \theta _0^{1 - \alpha } \theta _2^\alpha + (1 - \theta _0)^{1 - \alpha } (1 - \theta _2)^\alpha \big ) } - 1, \end{aligned}$$

finishing the proof.

$\square $

5.4 Gamma distribution

Let $P_\theta =\Gamma (\alpha , \beta )$ with $\theta =(\alpha -1,-\beta )$ denote the Gamma distribution with shape $\alpha > 0$ and inverse scale $\beta > 0$. If $\alpha _0, \alpha _1, \alpha _2, \beta _0, \beta _1, \beta _2, \alpha _0 + \alpha (\alpha _1 + \alpha _2 - 2 \alpha _0), \alpha _0 + \alpha (\alpha _1 - \alpha _0),\alpha _0 + \alpha (\alpha _2 - \alpha _0),\beta _0 + \alpha (\beta _1 + \beta _2 - 2 \beta _0),\beta _0 + \alpha (\beta _1 - \beta _0) > 0,$ and $\beta _0 + \alpha (\beta _2 - \beta _0)$ are all positive, then we have

$$\begin{aligned} {R_\alpha }(P_{\theta _0}|P_{\theta _1},P_{\theta _2})&= \frac{\Gamma (\alpha _0) \Gamma \big (\alpha _0 + \alpha (\alpha _1 + \alpha _2 - 2 \alpha _0) \big ) }{ \Gamma \big (\alpha _0 + \alpha (\alpha _1 - \alpha _0) \big ) \Gamma \big (\alpha _0 + \alpha (\alpha _2 - \alpha _0) \big )} \\&\quad \times \frac{ \big (\beta _0 + \alpha (\beta _1 - \beta _0) \big )^{\alpha _0 + \alpha (\alpha _1 - \alpha _0)} \big (\beta _0 + \alpha (\beta _2 - \beta _0) \big )^{\alpha _0 + \alpha ( \alpha _2 - \alpha _0)} }{ \beta _0^{\alpha _0} \big (\beta _0 + \alpha (\beta _1 + \beta _2 - 2 \beta _0) \big )^{\alpha _0 + \alpha (\alpha _1 + \alpha _2 - 2 \alpha _0)} } - 1, \end{aligned}$$

otherwise ${R_\alpha }(P_{\theta _0}|P_{\theta _1},P_{\theta _2}) = +\infty $. This can be checked by writing the explicit expression of the integrals that appear in the definition of ${R_\alpha }$.

Proof

The Gamma distributions $\Gamma (\alpha , \beta ), \alpha> 0, \beta > 0$ form an exponential family, dominated by the Lebesgue measure with density

$$\begin{aligned} \beta ^{\alpha } x^{\alpha - 1} \exp (-\beta x) \Gamma (\alpha )^{-1} = \exp \big ( (\alpha - 1) \log (x) - \beta x + \alpha \log (\beta ) - \log (\Gamma (\alpha )) \big ), \end{aligned}$$

natural parameter

$$\begin{aligned} \theta = (\theta _{a}, \theta _{b}) = (\alpha - 1, - \beta ), \end{aligned}$$

and

$$\begin{aligned} A(\theta ) = \log \big (\Gamma (\theta _{a} + 1)\big ) - (\theta _{a} + 1) \log (- \theta _{b}). \end{aligned}$$

Therefore, we can apply Proposition 2.7 with $\Theta = (-1, +\infty ) \times (- \mathbb {R})$. Combining the assumed constraints on the parameters and the linearity of the mapping $(\alpha , \beta ) \mapsto \theta $ ensures that $\theta _0+\alpha \big (\theta _1-\theta _0\big ),$ $\theta _0+\alpha \big (\theta _1-\theta _0\big ),$ $\theta _0+\alpha \big (\theta _1+\theta _2-2\theta _0\big ) \in \Theta $. Therefore, we obtain

$$\begin{aligned} {R_\alpha }(P_{\theta _0}|P_{\theta _1},P_{\theta _2})&=\exp \Big (A\big (\theta _0+\alpha \big (\theta _1+\theta _2-2\theta _0\big )\big ) - A\big (\theta _0+\alpha \big (\theta _1-\theta _0\big )\big ) \\&\quad - A\big (\theta _0+\alpha \big (\theta _2-\theta _0\big )\big ) + A\big (\theta _0\big )\Big )-1, \end{aligned}$$

where

$$\begin{aligned} A\bigg (\theta _0+\alpha \big (\theta _1-\theta _0\bigg )\bigg )&= \log \Bigg (\Gamma \Bigg (\theta _{0, a} + \alpha \bigg (\theta _{1,a} - \theta _{0,a} \bigg ) + 1 \Bigg ) \Bigg ) \\&\quad - \Bigg (\theta _{0, a} + \alpha \bigg (\theta _{1,a} - \theta _{0,a} \bigg ) + 1 \Bigg ) \log \Bigg (\beta _0 + \alpha \bigg (\beta _1 - \beta _0 \bigg ) \Bigg ) \\&= \log \Bigg (\Gamma \Bigg (\alpha _0 + \alpha \bigg ( \alpha _1 - \alpha _0 \bigg ) \Bigg ) \Bigg ) \\&\quad - \Bigg (\alpha _0 + \alpha \bigg ( \alpha _1 - \alpha _0 \bigg ) \Bigg ) \log \Bigg (\beta _0 + \alpha \big (\beta _1 - \beta _0 \bigg ) \Bigg ) \end{aligned}$$

and

$$\begin{aligned} A\big (\theta _0+\alpha \big (\theta _1+\theta _2-2\theta _0\big )&= \log \Big (\Gamma \Big (\alpha _0 + \alpha \big ( \alpha _1 + \alpha _2 - 2 \alpha _0 \big ) \Big ) \Big ) \\&\quad - \Big (\alpha _0 + \alpha \big ( \alpha _1 + \alpha _2 - 2 \alpha _0 \big ) \Big ) \log \Big (\beta _0 + \alpha \big (\beta _1 + \beta _2 - 2 \beta _0 \big ) \Big ). \end{aligned}$$

Combining all these results, we obtain

$$\begin{aligned} {R_\alpha }(P_{\theta _0}|P_{\theta _1},P_{\theta _2})&= \frac{\Gamma (\alpha _0) \Gamma \Big (\alpha _0 + \alpha \big ( \alpha _1 + \alpha _2 - 2 \alpha _0 \big ) \Big ) }{ \Gamma \Big (\alpha _0 + \alpha \big ( \alpha _1 - \alpha _0 \big ) \Big ) \Gamma \Big (\alpha _0 + \alpha \big ( \alpha _2 - \alpha _0 \big ) \Big )} \\&\quad \times \frac{ \big (\beta _0 + \alpha (\beta _1 - \beta _0) \big )^{\alpha _0 + \alpha (\alpha _1 - \alpha _0)} \big (\beta _0 + \alpha (\beta _2 - \beta _0) \big )^{\alpha _0 + \alpha ( \alpha _2 - \alpha _0)} }{ \beta _0^{\alpha _0} \big (\beta _0 + \alpha (\beta _1 + \beta _2 - 2 \beta _0) \big )^{\alpha _0 + \alpha (\alpha _1 + \alpha _2 - 2 \alpha _0)} } - 1. \end{aligned}$$

$\square $

A formula for the product of exponential distributions can be obtained as a special case by setting $\alpha _{j\ell }=1$ for all $j,\ell $ and applying Proposition 2.6. For the families of distributions discussed above, the formulas for the correlation-type ${R_\alpha }$ codivergences encode an orthogonality relation on the parameter vectors. This is less visible in the expressions for the Gamma distribution but can be made more explicit using the first order approximation that we state next. It shows that even for the Gamma distribution these matrix entries can be written in leading order as a term involving a weighted inner product of $\beta _1-\beta _0$ and $\beta _2-\beta _0,$ where $\beta _r$ denotes the vector $(\beta _{r\ell })_{1 \le \ell \le d}.$

Lemma 5.1

Suppose $P_j = \otimes _{\ell =1}^d \Gamma (\alpha _{\ell },\beta _{j\ell })$ for every $j=1,2,3$ and for some $\alpha _{\ell },\beta _{j\ell } > 0$. Let $A:=\sum _{\ell =1}^d \alpha _\ell $ and $\Delta :=\max _{j=1,2}\max _{\ell =1,\ldots ,d} |\beta _{j\ell }-\beta _{0\ell }|/\beta _{0\ell }.$ Denote by $\Sigma $ the $d\times d$ diagonal matrix with entries $\beta _{0\ell }^2/\alpha _{\ell }.$ Then,

$$\begin{aligned} {R_\alpha }(P_0|P_1,P_2) = \exp \Big ( - \alpha ^2 (\beta _1-\beta _0)^\top \Sigma ^{-1}(\beta _2-\beta _0) + o(A \Delta ^2)\Big )-1. \end{aligned}$$

Proof

Using that $\alpha _\ell $ does not depend on j, the expression simplifies and a second order Taylor expansion of the logarithm (the sum of the first order terms vanishes) yields

$$\begin{aligned}&{R_\alpha }(P_{\theta _0}|P_{\theta _1},P_{\theta _2}) \\&\quad = \prod _{\ell =1}^d \frac{ (\beta _{1\ell } + \beta _{0\ell })^{\alpha _\ell } (\beta _{2\ell } + \beta _{0\ell })^{\alpha _\ell } }{(2 \beta _{0\ell })^{\alpha _{\ell }} (\beta _{1\ell } + \beta _{2\ell })^{\alpha _\ell }} \\&\quad = \prod _{\ell =1}^d \exp \Bigg ( \alpha _\ell \bigg ( \log \bigg (1 + \alpha \frac{\beta _{1\ell } - \beta _{0\ell }}{ \beta _{0\ell }} \bigg ) + \log \bigg (1 + \alpha \frac{\beta _{2\ell } - \beta _{0\ell }}{\beta _{0\ell }} \bigg ) \\&\qquad - \log \bigg (1 + \alpha \frac{\beta _{1\ell } - \beta _{0\ell } + \beta _{2\ell } - \beta _{0\ell }}{\beta _{0\ell }} \bigg ) \bigg ) \Bigg ) \\&\quad = \exp \left( \sum _{\ell =1}^d \alpha _\ell \alpha ^2 \Bigg ( - \frac{ (\beta _{1\ell }-\beta _{0\ell })^2}{2\beta _{0\ell }^2} - \frac{(\beta _{2\ell }-\beta _{0\ell })^2}{2\beta _{0\ell }^2} + \frac{(\beta _{1\ell }-\beta _{0\ell } + \beta _{2\ell }-\beta _{0\ell })^2}{2\beta _{0\ell }^2} + o(\Delta ^2)\bigg ) \right) \\&\quad = \exp \left( \alpha ^2 \sum _{\ell =1}^d \frac{\alpha _\ell (\beta _{1\ell }-\beta _{0\ell }) (\beta _{2\ell }-\beta _{0\ell })}{\beta _{0\ell }^2} + o\big (A \Delta ^2\big ) \right) . \end{aligned}$$

$\square $

6 Facts about ranks

Definition 6.1

Let $X_1, \dots , X_n$ be n random variables defined on the same probability space $(\Omega , \mathcal {A}, P)$. We define the rank of $\{X_1, \dots , X_n\}$, denoted by ${\text {Rank}}(X_1, \dots , X_n)$ as the dimension of the vector space ${\text {Vect}}(X_1, \dots , X_n)$ generated by linear combinations of $\{X_1, \dots , X_n\}$, where the equality is to be understood P-almost surely. Moreover, we say that $(X_1, \dots , X_n)$ are linearly independent P-almost surely if for any vector $(a_1, \dots , a_n),$

$$\begin{aligned} P \left( \sum _{i=1}^n a_i X_i = 0 \right) = 1 \quad \text {implies} \ \ a_1 = \cdots = a_n = 0. \end{aligned}$$

Lemma 6.2

Let $X_1, \dots , X_n$ be n random variables defined on the same probability space $(\Omega , \mathcal {A}, P)$. ${\text {Rank}}(X_1, \dots , X_n)$ is the largest integer such that there exists $i_1, \dots , i_r \in \{1, \dots , M\}$ with $(X_{i_1}, \dots , X_{i_r})$ linearly independent random variables P-almost surely.

Proof

Let r be the largest integer such that there exists $i_1, \dots , i_r \in \{1, \dots , M\}$ with $(X_{i_1}, \dots , X_{i_r})$ linearly independent random variables P-almost surely. Then the space generated by $X_1, \dots , X_n$ is at least of dimension r, and therefore ${\text {Rank}}(X_1, \dots , X_n) \ge r$. If ${\text {Rank}}(X_1, \dots , X_n) > r$, then there exists $(r+1)$ linear combinations of the random variables that are linearly independent, contradicting the definition of r. Therefore ${\text {Rank}}(X_1, \dots , X_n) \le r$, completing the proof. $\square $

Lemma 6.3

Let $\textbf{Z}= (Z_1,\ldots ,Z_M)^\top $ be a M-dimensional random vector with mean zero and finite second moments. Then ${\text {Rank}}({\text {Cov}}_P(\textbf{Z})) = {\text {Rank}}(Z_1, \dots , Z_M)$, where the covariance matrix is computed with respect to the distribution P and the rank of a set of random variables is to be understood in the sense of Definition 6.1.

Proof

Let $\lambda _1 \ge \lambda _2 \ge \dots \ge \lambda _M$ be the eigenvalues of ${\text {Cov}}_P(\textbf{Z})$, sorted in decreasing order, and let $\textbf{e}_1, \dots , \textbf{e}_M$ be a corresponding orthonormal basis of eigenvectors. Let r be the rank of ${\text {Cov}}_P(\textbf{Z})$. We have $\lambda _{r+1} = \lambda _{r+2} = \cdots = \lambda _M = 0$ and $\lambda _r > 0$. Let us define $Y_i = \textbf{e}_i^\top \textbf{Z}$ for $i=1, \dots , M$. By usual results on principal components, e.g. [13, Result 8.1], ${\text {Var}}[Y_i] = \lambda _i$ and ${\text {Cov}}(Y_i, Y_j) = \lambda _i 1_{\{i=j\}}$. Therefore,

$$\begin{aligned} {\text {Rank}}(Z_1, \dots , Z_M)&= \dim ({\text {Vect}}(Z_1, \dots , Z_M)) = \dim ({\text {Vect}}(Y_1, \dots , Y_M)) \\&= \dim ({\text {Vect}}(Y_1, \dots , Y_r)) = r, \end{aligned}$$

where the first equality is the definition of the rank, the second equality is a consequence of the fact that $(\textbf{e}_1, \dots , \textbf{e}_M)$ is a basis of $\mathbb {R}^M$, the third equality results from the fact that ${\text {Var}}[Y_i] = 0$ and $E[Y_i] = 0$ for any $i > r$ and the last equality is a consequence of the orthogonality of the $(Y_1, \dots , Y_r)$ as elements of the Hilbert space $L_2(\Omega , \mathcal {A}, P)$. The proof is completed since by definition $r = {\text {Rank}}({\text {Cov}}_P(\textbf{Z}))$. $\square $

Lemma 6.4

(see for example Exercise 3.3.11 in [22]) A symmetric and positive semi-definite $M\times M$ matrix A is of rank r if and only if A has an invertible principal submatrix of size r, and all principal submatrices of size $r+1$ of A are singular.

7 Conclusion

We introduced the concept of codivergence as a notion of “angle” between three probability distributions. Divergence matrices can be viewed as an analogue of the Gram matrix for a finite sequence of probability distributions that are compared relative to one distribution.

Locally around the reference probability measure $P_0$, codivergences are bilinear forms up to remainder terms. Two classes of codivergences emerge that resemble the structure of the covariance and the correlation.

Natural follow-up questions relate to the spectral behavior of a divergence matrix and the link between properties of the divergence matrix and properties of the underlying probability measures.

Data availability

Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

References

Amari, S.-I.: Information Geometry and Its Applications, vol. 194. Springer, Japan (2016)
Google Scholar
Averbukh, V.I., Smolyanov, O.G.: The theory of differentiation in linear topological spaces. Russ. Math. Surv. 22(6), 201 (1967)
Article Google Scholar
Ay, N., Jost, J., Vân Lê, H., Schwachhöfer, L.: Information Geometry, vol. 64. Springer, Switzerland (2017)
Book Google Scholar
Bourles, H.: Fundamentals of Advanced Mathematics 3—Differential Calculus, Tensor Calculus, Differential Geometry, Global Analysis. ISTE Press; Elsevier, Oxford (2019). https://doi.org/10.1016/C2017-0-00728-0
Cena, A., Pistone, G.: Exponential statistical manifold. Ann. Inst. Stat. Math. 59, 27–56 (2007)
Article MathSciNet Google Scholar
Chen, T., Streets, J., Shahbaba, B.: A geometric view of posterior approximation (2015). arXiv preprint arXiv:1510.00861
Cohn, D.L.: Measure Theory. Birkhäuser Advanced Texts: Basler Lehrbücher. [Birkhäuser Advanced Texts: Basel Textbooks], 2nd edn. Birkhäuser/Springer, New York, p. 457 (2013). https://doi.org/10.1007/978-1-4614-6956-8
Dawid, A.P.: Further comments on some comments on a paper by Bradley Efron. Ann. Stat. 5(6), 1249 (1977)
Article Google Scholar
Derumigny, A., Schmidt-Hieber, J.: On lower bounds for the bias-variance trade-off. To appear in the Annals of Statistics (2023). arXiv:2006.00278
Diakonikolas, I., Kane, D.M., Stewart, A.: Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 73–84 (2017)
Feldman, V., Grigorescu, E., Reyzin, L., Vempala, S.S., Xiao, Y.: Statistical algorithms and a lower bound for detecting planted cliques. J. ACM 64(2), 1–37 (2017)
Article MathSciNet Google Scholar
Holbrook, A., Lan, S., Streets, J., Shahbaba, B.: The nonparametric Fisher geometry and the chi-square process density prior (2017). arXiv preprint arXiv:1707.03117
Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis, 6th edn. Pearson Prentice Hall, New Jersey (2007)
Google Scholar
Lang, S.: Differential and Riemannian Manifolds, vol. 160. Springer, New York (2012)
Google Scholar
Lee, J.M.: Manifolds and Differential Geometry, vol. 107. American Mathematical Society, Providence (2022)
Newton, N.J.: An infinite-dimensional statistical manifold modelled on Hilbert space. J. Funct. Anal. 263(6), 1661–1681 (2012)
Article MathSciNet Google Scholar
Newton, N.J.: Infinite-dimensional statistical manifolds based on a balanced chart. Bernoulli 22(2), 711–731 (2016)
Article MathSciNet Google Scholar
Newton, N.J.: A class of non-parametric statistical manifolds modelled on Sobolev space. Inf. Geom. 2(2), 283–312 (2019)
Article MathSciNet Google Scholar
Nielsen, F.: An elementary introduction to information geometry. Entropy 22(10), 1100 (2020)
Article MathSciNet Google Scholar
Pistone, G., Sempi, C.: An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 1543–1561 (1995)
Pistone, G.: Nonparametric information geometry. In: Geometric Science of Information: First International Conference, GSI 2013, Paris, France, August 28–30, 2013. Proceedings, pp. 5–36 (2013). Springer
Ramachandra Rao, A., Bhimasankaram, P.: Linear Algebra, vol. 19. Springer, New Delhi (2000). https://doi.org/10.1007/978-93-86279-01-9
Book Google Scholar
Rényi, A.: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, vol. 4, pp. 547–562 (1961). University of California Press
Srivastava, A., Jermyn, I., Joshi, S.: Riemannian analysis of probability density functions with applications in vision. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007). IEEE
Srivastava, A., Klassen, E.P.: Functional and Shape Data Analysis, vol. 1. Springer, New York (2016)
Book Google Scholar

Download references

Acknowledgements

We are grateful to the Associate Editor and two referees for valuable comments, suggesting Proposition 2.7, and an idea that led us consider the two general classes of covariance-type and correlation-type codivergences.

Funding

The research has been supported by the NWO/STAR grant 613.009.034b and the NWO Vidi grant VI.Vidi.192.021.

Author information

Authors and Affiliations

Department of Applied Mathematics, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands
Alexis Derumigny
University of Twente, Drienerlolaan 5, 7522 NB, Enschede, The Netherlands
Johannes Schmidt-Hieber

Authors

Alexis Derumigny
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Schmidt-Hieber
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Both authors contributed equally to this work.

Corresponding author

Correspondence to Alexis Derumigny.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no Conflict of interest.

Additional information

Communicated by Frank Nielsen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Proofs

1.1 Proof of Proposition 2.5

Proof

As mentioned already, the first and second part of Definition 2.2 are satisfied. To check the third part of Definition 2.2 for $\phi (P_0|P_1,P_2),$ let $\mu , \widetilde{\mu }\in \mathcal {M}_{P_0}$. Then

$$\begin{aligned} \frac{d(P_0 + t \mu )}{dP_0} = 1 + t \frac{d\mu }{dP_0} = 1 + t h, \end{aligned}$$

is square-integrable with respect to $P_0$ for any real number t. Using $\phi (1)=1$, Taylor expansion $\phi (1+y)=1+y \phi '(1) +\tfrac{1}{2}y^2\phi ''(1) + o(y^2),$ that $\int h dP_0=0$ and that h is bounded $P_0$-a.e. by the definition of $\mathcal {M}_{P_0},$ we obtain that, for all sufficiently small t,

$$\begin{aligned} \int \phi \Big ( \frac{d(P_0 + t \mu )}{dP_0}\Big ) dP_0 = \int \phi (1+th) dP_0 = 1 + \frac{t^2}{2}\phi ''(1)\int h^2 dP_0+o(t^2). \end{aligned}$$

Similarly $\int \phi \big ( \frac{d(P_0 + s \widetilde{\mu })}{dP_0}\big ) dP_0= 1+\tfrac{1}{2}s^2\phi ''(1)\int g^2 dP_0+o(s^2)$ and

$$\begin{aligned}&\int \phi \Bigg ( \frac{d(P_0 + t \mu )}{dP_0}\Bigg )\phi \Bigg ( \frac{d(P_0 + s \widetilde{\mu })}{dP_0}\Bigg ) dP_0\\&\quad = \int \phi (1+th)\phi (1+sg) dP_0 \\&\quad = 1+\frac{\phi ''(1)}{2} \int (t^2 h^2+s^2g^2) dP_0 + st \phi '(1)^2\int gh dP_0 +o(t^2+s^2) \\&\quad = \Bigg (1+\frac{t^2}{2}\phi ''(1)\int h^2 dP_0\Bigg )\Bigg (1+\frac{s^2}{2}\phi ''(1)\int g^2 dP_0\Bigg )+ st \phi '(1)^2 \int gh dP_0 +o(t^2+s^2). \end{aligned}$$

Taylor expansion yields $1/(1-y) = 1 + O(y)$ for all $|y|\le 1/2$ and thus for $(s,t)\rightarrow (0,0),$

$$\begin{aligned}&{R_\phi }(P_0 | P_0 + t \mu , P_0 + s \widetilde{\mu }) \\&\quad = \frac{st \phi '(1)^2 \int gh dP_0 + o(t^2+s^2)}{(1+\frac{t^2}{2}\phi ''(1)\int h^2 dP_0)(1+\frac{s^2}{2}\phi ''(1)\int g^2 dP_0) +o(t^2+s^2)} \\&\quad = st \phi '(1)^2 \int gh dP_0 + o(t^2+s^2). \end{aligned}$$

By following the same arguments and replacing the definition of ${R_\phi }(P_0 | P_0 + t \mu , P_0 + s \widetilde{\mu })$ by ${V_\phi }(P_0 | P_0 + t \mu , P_0 + s \widetilde{\mu }) $ in the last step, we also obtain ${V_\phi }(P_0 | P_0 + t \mu , P_0 + s \widetilde{\mu })=st \phi '(1)^2 \int gh dP_0 + o(t^2+s^2).$ $\square $

1.2 Proof of Proposition 2.6

Proof

By Fubini’s theorem,

$$\begin{aligned}&{R_\alpha \Bigg ( \bigotimes _{\ell =1}^d P_{0\ell } \, \Bigg | \, \bigotimes _{\ell =1}^d P_{1\ell } \, , \, \bigotimes _{\ell =1}^d P_{2\ell } \Bigg )}+1 \\&\quad = \dfrac{\displaystyle \int \left( \frac{d\bigotimes _{\ell =1}^d P_{1\ell }}{d\bigotimes _{\ell =1}^d P_{0\ell }} \right) ^\alpha \left( \frac{d\bigotimes _{\ell =1}^d P_{2\ell }}{d\bigotimes _{\ell =1}^d P_{0\ell }} \right) ^\alpha d \left( \bigotimes _{\ell =1}^d P_{0\ell } \right) }{\displaystyle \int \left( \frac{d\bigotimes _{\ell =1}^d P_{1\ell }}{d\bigotimes _{\ell =1}^d P_{0\ell }} \right) ^\alpha d \left( \bigotimes _{\ell =1}^d P_{0\ell } \right) \int \left( \frac{d\bigotimes _{\ell =1}^d P_{2\ell }}{d\bigotimes _{\ell =1}^d P_{0\ell }} \right) ^\alpha d \left( \bigotimes _{\ell =1}^d P_{0\ell } \right) } \\&\quad = \prod _{\ell =1}^d \dfrac{\displaystyle \int \left( \frac{dP_{1\ell }}{dP_{0\ell }} \right) ^\alpha \left( \frac{dP_{2\ell }}{dP_{0\ell }} \right) ^\alpha dP_{0\ell } }{\displaystyle \int \left( \frac{d P_{1\ell }}{dP_{0\ell }} \right) ^\alpha dP_{0\ell } \int \left( \frac{dP_{2\ell }}{dP_{0\ell }} \right) ^\alpha dP_{0\ell } } \\&\quad = \prod _{\ell =1}^d {R_\alpha }( P_{0\ell } | P_{1\ell }, P_{2\ell })+1. \end{aligned}$$

$\square $

1.3 Proof of Proposition 2.7

Proof

Write $P_{i}:=P_{\theta _i}$ and $p_i$ for the corresponding $\nu $-densities. By assumption, we have $\overline{\theta }_\alpha :=\alpha (\theta _1+\theta _2)+(1-2\alpha )\theta _0 \in \Theta $ and

$$\begin{aligned} \int \Big (\frac{dP_1}{dP_0}\Big )^\alpha \Big (\frac{dP_2}{dP_0}\Big )^\alpha dP_0&= \int \big (p_1(x)p_2(x)\big )^\alpha p_0(x)^{1-2\alpha } d\nu (x) \\&= \int h(x) \exp \big ( {\overline{\theta }}_\alpha ^\top T(x)\big ) d\nu (x) \\&\quad \cdot \exp \big (-\alpha A(\theta _1)-\alpha A(\theta _2)+(1-2\alpha )A(\theta _0)\big ) \\&= \exp \big (A({\overline{\theta }}_\alpha )-\alpha A(\theta _1)-\alpha A(\theta _2)+(1-2\alpha )A(\theta _0)\big ). \end{aligned}$$

Setting $P_2=P_0$ (or equivalently, $\theta _2=\theta _0$) in the previous identity gives

$$\begin{aligned} \int \Big (\frac{dP_1}{dP_0}\Big )^\alpha dP_0&= \exp \big (A\big (\alpha \theta _1+(1-\alpha )\theta _0\big )-\alpha A(\theta _1)+(1-\alpha )A(\theta _0)\big ). \end{aligned}$$

Interchanging the role of $\theta _2$ and $\theta _1$ provides moreover a closed-form expression for $\int \big (\frac{dP_2}{dP_0} \big )^\alpha dP_0.$ Combining everything yields

$$\begin{aligned}&{R_\alpha }(P_0|P_1,P_2) = \dfrac{\int \big (\frac{dP_1}{dP_0}\big )^\alpha \bigg (\frac{dP_2}{dP_0}\bigg )^\alpha dP_0}{\int \bigg (\frac{dP_1}{dP_0}\big )^\alpha dP_0 \int \bigg (\frac{dP_2}{dP_0}\big )^\alpha dP_0} -1 \\&\quad = \exp \left( A\big ({\overline{\theta }}_\alpha \big ) - A\bigg (\theta _0+\alpha \bigg (\theta _1-\theta _0\bigg )\bigg ) - A\bigg (\theta _0+\alpha \bigg (\theta _2-\theta _0\bigg )\bigg ) + A\big (\theta _0\big )\right) -1. \end{aligned}$$

$\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Derumigny, A., Schmidt-Hieber, J. Codivergences and information matrices. Info. Geo. 7, 253–282 (2024). https://doi.org/10.1007/s41884-024-00135-2

Download citation

Received: 22 March 2023
Revised: 24 April 2024
Accepted: 25 April 2024
Published: 22 May 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s41884-024-00135-2

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Codivergences and information matrices

Abstract

Similar content being viewed by others

Infinite-Dimensional Log-Determinant Divergences III: Log-Euclidean and Log-Hilbert–Schmidt Divergences

The f-Divergence and Coupling of Probability Distributions

Robust statistical inference based on the C-divergence family

1 Introduction

2 Codivergences

2.1 Abstract framework and definition

Definition 2.1

Definition 2.2

Definition 2.3

2.2 Codivergences on the space of probability measures

Proposition 2.4

Proof of Proposition 2.4

2.3 Examples of codivergences

Proposition 2.5

Proposition 2.6

Proposition 2.7

3 Divergence matrices

Definition 3.1

Proposition 3.2

Proof of Proposition 3.2

4 Data processing inequality for the \(\chi ^2\)-divergence matrix

Theorem 4.1

Lemma 4.2

Proof

Lemma 4.3

Proof

Proof of Theorem 4.1

Simpler proof of Theorem 4.1 under the additional assumption (21)

5 Derivations for explicit expressions for the \({R_\alpha }\) codivergence

5.1 Multivariate normal distribution

Proof

5.2 Poisson distribution

Proof

5.3 Bernoulli distribution

Proof

5.4 Gamma distribution

Proof

Lemma 5.1

Proof

6 Facts about ranks

Definition 6.1

Lemma 6.2

Proof

Lemma 6.3

Proof

Lemma 6.4

7 Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Proofs

Proofs

1.1 Proof of Proposition 2.5

Proof

1.2 Proof of Proposition 2.6

Proof

1.3 Proof of Proposition 2.7

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation