Codivergences and information matrices

We propose a new concept of codivergence, which quantifies the similarity between two probability measures $P_1, P_2$ relative to a reference probability measure $P_0$. In the neighborhood of the reference measure $P_0$, a codivergence behaves like an inner product between the measures $P_1 - P_0$ and $P_2 - P_0$. Codivergences of covariance-type and correlation-type are introduced and studied with a focus on two specific correlation-type codivergences, the $\chi^2$-codivergence and the Hellinger codivergence. We derive explicit expressions for several common parametric families of probability distributions. For a codivergence, we introduce moreover the divergence matrix as an analogue of the Gram matrix. It is shown that the $\chi^2$-divergence matrix satisfies a data-processing inequality.


Introduction
One of the objectives of information geometry is to measure distances or angles in statistical spaces, usually for parametric models.This is often done by the use of a divergence, generating a Riemannian manifold structure on the considered space of distributions, see [1][2][3].Divergences between probability measures quantify a certain notion of difference between them.Divergences are in general not symmetric, as opposed to distances.Famous examples of divergences include the Kullback-Leibler divergence, the χ 2 -divergence, and the Hellinger distance.
In this article, we are interested in defining a local notion of inner product between two probability measures in the neighborhood of a given reference probability measure P 0 .This allows us to identify different directions relative to P 0 , and to give some meaning to the "angle" between these directions.Contrary to most of the previous work on finite-dimensional Riemannian manifolds spanned by specific parametric statistical models, we do not require any parametric restrictions on the considered probability measures.
Motivation and application of our approach is the recently considered generic framework to derive lower bounds for the trade-off between bias and variance in nonparametric statistical models [4].The key ingredient in this lower bound strategy are so-called change of expectation inequalities that relate the change of the expected value of a random variable with respect to different distributions to the variance and also involve the divergence matrices examined in this work.Another possible area of application are the lower bounds for statistical query algorithms, see e.g.[5,6].
Regarding work on infinite-dimensional information geometry, [7,8] studied the manifold generated by all probability densities connected to a given probability density.[9] reviews a more general theory on infinite-dimensional statistical manifolds given a reference density.Another line of work [10][11][12][13] seeks to define infinite-dimensional manifolds, with applications to Bayesian estimation and the choice of priors.[14] consider different possible structures on the set of probability densities on [0, 1].
The article is structured as follows.In Section 2, we introduce a general concept of codivergence, study specific properties of codivergences on the space of probability measures, and discuss specific (classes of) codivergences.Section 3 considers the construction of divergence matrices from a given codivergence.Section 4 is devoted to the data-processing inequality that holds for the χ 2 -divergence matrix introduced in Section 3, thereby generalizing the usual data-processing inequality for the χ 2 -divergence.Section 5 provides derivations of explicit expressions for a class of codivergences applied to common parametric models.Elementary facts on ranks from linear algebra are collected in Section 6.
Notation: If P is a probability measure and X a random vector, we write E P [X] and Cov P (X) for the expectation vector and covariance matrix with respect to P, respectively.

Abstract framework and definition
We start by recalling the definition of a divergence [1,Definition 1.1].This definition is situated within the framework of a d-dimensional differentiable manifold X with an atlas (U i , ϕ i ).Formally, this means that the (U i ) are an open cover of the topological space X , and ϕ i : [15].Definition 2.1.A divergence D on a d-dimensional differentiable manifold X is a function X 2 → R + satisfying (i) ∀P, Q ∈ X , D(P |Q) = 0 if and only if P = Q.
(ii) For all P ∈ X , for any chart (U, ϕ) with P ∈ U , there exists a matrix G = G(P ) such that for any Q ∈ U , The matrix G = G(P ) may depend on the choice of coordinates ϕ.For the most common divergences, G is symmetric, positive-definite and thus defines a scalar product on the tangent space at P .Whereas a divergence measures the similarity between two elements P, Q ∈ X , we want to define codivergences measuring the angle ∢P 1 P 0 P 2 of P 1 , P 2 ∈ X relative to P 0 ∈ X .
Equation (1) states that the divergence D(P |Q) is a quadratic form in terms of the local coordinates ϕ(Q) ∈ R d , whenever P and Q are close.Generalizing to the infinite-dimensional case requires to work with bilinear forms instead.Moreover for the infinite-dimensional setting, imposing an expansion of the form (1) in every possible direction around P ∈ X is restrictive.We therefore allow the quadratic expansion to hold in a possibly smaller bilinear expansion domain.Furthermore, we allow codivergences to attain the value +∞.This is inspired by existing statistical divergences (such as χ 2 -or Kullback-Leibler divergences) that can also take the value +∞.Therefore, imposing an expansion of the form (1) globally may not be possible as the codivergence on the left-hand side of (1) may take the value +∞ in some directions away from P , while the right-hand side of (1) is always finite.
We now provide the definition of a codivergence if X is a subset of a real vector space.Definition 2.2.Let X and (E u ) u∈X be a subset and a family of subspaces of a real vector space E, respectively.A function (u, v, w) ∈ X 3 → D(u|v, w) ∈ R ∪ {+∞} defines a codivergence on X with bilinear expansion domain E u at u, if for any u, v, w ∈ X , (i) D(u|v, w) = D(u|w, v); (ii) D(u|v, v) ≥ 0, with equality if u = v; (iii) there exists a bilinear map •, • u defined on E u , such that, for any h, g ∈ E u and for any scalars s, t in some sufficiently small open neighborhood of (0, 0) (that may depend on h and g) with respect to the Euclidean topology in R 2 , we have (u + th, u + sg) ∈ X 2 , D u u + th , u + sg < +∞, and The last part of the definition imposes that, locally around each u, the codivergence (v, w) → D(u|v, w) is finite and behaves like a bilinear form in the centered variables (v − u, w − u).As a consequence, for a given u, the mapping (v, w) → D(u|v, w) is Gateaux-differentiable on X 2 at (u, u) with Gateaux derivative 0 in every direction (h, g) ∈ E 2 u .Condition (iii) can moreover be understood as a second-order Taylor expansion at (u, u) in the direction (h, g).The mapping (v, w) → D(u|v, w) needs, however, not to be twice Gateaux-differentiable at (u, u) for (iii) to hold.This is analogous to usual counter-examples in analysis where a function may admit a secondorder Taylor expansion at a given point without being twice differentiable.Nevertheless, if D u u+th , u+sg is twice differentiable in (t, s) at (0, 0), then the partial derivative ∂ 2 D u u + th , u + sg /∂t∂s at (0, 0) must be equal to h, g u .We refer to [16] for a discussion on higher-order functional derivatives.
We also provide a definition of codivergence if X is a differentiable Banach manifold, see [15,17] for an introduction to Banach manifolds.Let B be a Banach space and X be a Banach manifold modeled on B. This guarantees existence of a B-atlas (U i , ϕ i ) with U i an open cover of X and ϕ i : This generalization can be useful in the case where the space X is not flat.Indeed, part (iii) of Definition 2.2 imposes that for h ∈ E u , we must have u + th ∈ X for t small enough.On the contrary, in the following definition we consider a more subtle case where the point u may be approached on a smooth curve (not necessarily affine), under the assumption that X is a B-manifold.
We first recall the construction of the tangent space via curves following [17,Definition 2.21] and [18, Section 2.1.1]:for a fixed u ∈ X , let i be such that u ∈ U i and let C u be the set of smooth curves c such that c : [−1, 1] → U i and c(0) = u.We define an equivalence relation ∼ on C u by c 1 ∼ c 2 if for all smooth real-valued functions f on U i , we have (f • c 1 ) ′ (0) = (f • c 2 ) ′ (0).We define the tangent space at u as the quotient set T u := C u /∼, which can be given a vector space structure isomorphic to B.
We give a short outline of the main ideas to obtain this property.Let D denote the Fréchet differential operator.For c ∈ T u and c a representative of the equivalence class c, note that ϕ i • c : [−1, 1] → B is differentiable (by assumption on c); the mapping D(ϕ i • c)(0) is linear from R to B and can therefore be identified with an element of B itself; this element D(ϕ i •c)(0) also does not depend on the representative c.This defines a mapping θ u : T u → B by θ u (c) := D(ϕ i • c)(0).It can be shown that θ u is bijective.Through its inverse θ −1 u one can transport the vector space structure of B on T u , making it a real vector space too.
(iii) E u is a subspace of the tangent space T u of X at u; (iv) there exists a bilinear map •, • u defined on E u .For any g, h ∈ E u , for any representatives g and h of the respective equivalence classes g and h, and for any scalars s, t in some sufficiently small open neighborhood of (0, 0) with respect to the Euclidean topology in R 2 (the neighborhood may depend on the choice of the representatives g and h), we have D u h(t) , g(s) < +∞, and D u h(t) , g(s) = ts h, g u + o(t 2 + s 2 ) as (s, t) → (0, 0).
From a codivergence D(u|v, w) that takes finite values on a finitedimensional manifold and with bilinear expansions domains the tangent spaces, we can always construct a divergence by setting v = w.Then D(u|v, v) behaves like a quadratic form in v whenever v is close to u.
If X is a B-manifold and a closed subspace of a vector space E, then the notions of codivergences in Definition 2.2 and Definition 2.3 coincide.This is because differentiable curves are, in first order, linear functions in a small enough neighborhood of 0.
For both definitions, a given space X and a given family of bilinear expansion domains (E u ) u∈X , the set of codivergences on X is a convex cone.
For an example covered by Definition 2.3 but not by Definition 2.2 assume that X is the unit circle.No codivergence can exist in the sense of Definition 2.2 with non-trivial bilinear expansion domains (E u ).An example of a codivergence on X = {e iθ , θ ∈ R} in the sense of Definition 2.3 is where u, v, w ∈ X 3 , U u := {ue iθ , θ ∈ (−π/2, π/2)}, u = e iθ0 , v = e iθ1 and w = e iθ2 for some Fig. 1 The codivergence between P 1 and P 2 at P 0 measures the position of P 1 and P 2 relative to P 0 .
and w always exists and is unique since [θ 0 − π, θ 0 + π) is a half-open interval of length 2π.In this case, the tangent space T u of the circle at any point u = e iθ0 is diffeomorphic to R and we will use this identification (denoted by the symbol "≃").Let g, h ∈ T u ≃ R and assume s, t ∈ R. Then g(s) = ue igs ∈ U u for s small enough.Similarly, h(t) = ue iht ∈ U u for t small enough.So, D u h(t) , g(s) is finite for all (s, t) in a small enough neighborhood of (0, 0), and, whenever this is the case, we have where h, g u = gh is the local bilinear form (which in this example is independent of u) and the bilinear expansion domain can be taken to be

Codivergences on the space of probability measures
For the application to statistics, E is the space of all finite signed measures on a measurable space (A, B), and X is the space of all probability measures on (A, B).Probability measures form a convex subset of all signed measures E. Since E is a vector space, the natural definition of a codivergence on X is Definition 2.2.A visual representation of such a codivergence is provided in Figure 1.
In a next step, we characterize the bilinear expansion domains of a codivergence for X the space of probability measures.Given a probability measure P 0 ∈ X , we say that a function h : X → R is P 0 -essentially bounded by a constant C > 0 if P 0 ({x ∈ A : |h(x)| ≤ C}) = 1 and define ess sup P0 |h| := inf{C > 0 : |h| is P 0 -essentially bounded by C}.We will show that is the largest bilinear expansion domain that any codivergence on X can have at P 0 .The rationale is that P 0 + tµ is otherwise not a probability measure.
Indeed if µ ∈ M P0 has a density h with respect to P 0 , then the P 0 -density 1 + th is non-negative for given t > 0 if and only if h is larger than −1/t.Conversely, the density 1 − th is non-negative for given t < 0 if and only if h is smaller than 1/t.This gives a link between a bound on h = dµ/dP 0 and the non-negativity of the probability measure P 0 + tµ.
For every measurable set A, The value of an integral is unchanged if the function to be integrated is modified on a P 0 -null set.Therefore we only need the function 1 + th to be positive P 0 -almost everywhere for P 0 + tµ to be a positive measure.
Proposition 2.4.For any codivergence D on the space of probability measures X , the bilinear expansion domain of D at any probability measure P 0 ∈ X must be included in M P0 .Furthermore, every µ ∈ M P0 has a density dµ/dP 0 with respect to P 0 such that ess sup P0 |dµ/dP 0 | = 1/a * with a * := sup{a > 0 : P 0 + tµ ∈ X for all t ∈ [−a, a]} ∈ (0, +∞] and the convention 1/ + ∞ = 0. Proof of Proposition 2. 4 We begin by proving the first part.Let µ be a finite signed measure belonging to the bilinear expansion domain at P 0 of some codivergence D on the space of probability measures X .For a > 0, we write µ ∈ R(a) if and only if P 0 + tµ ∈ X for all −a ≤ t ≤ a.Since µ belongs to the bilinear expansion domain of D at P 0 , Definition 2.2(iii) implies existence of an open neighborhood T of 0 such that for any t ∈ T , P 0 + tµ ∈ X .Therefore, there exists a > 0 with µ ∈ R(a).
We now show that µ ∈ R(a), for some a > 0, implies µ ≪ P 0 .The proof relies on the Jordan decomposition theorem for finite signed measures (e.g.Corollary 4.1.6 in [19]).It states that every finite signed measure µ on a measurable space (A, B) can be decomposed as and µ − , µ + orthogonal probability measures on (A, B).By the Lebesgue decomposition theorem (see Theorem 4.3.2 in [19]), µ can always be decomposed as µ = µ A + µ S , where µ A is a signed measure that is absolutely continuous with respect to P 0 , µ S is a signed measure that is singular with respect to P 0 , and µ A and µ S are orthogonal.By the Jordan decomposition (3), we decompose the signed measure µ S = α + µ S,+ − α − µ S,− into its positive and negative part µ S,+ and µ S,− .These two measures are orthogonal and α + , α − ≥ 0.Then, P 0 + aµ = P 0 + aµ A + aα + µ S,+ − aα − µ S,− can be a probability measure only if α − = 0.This is because we can find a set U such that In the same way, P 0 − aµ can be a probability measure only if α + = 0. Therefore, if µ ∈ R(a) for some a > 0, then α + = α − = 0, and µ = µ A is absolutely continuous with respect to P 0 .
Let h be the density of µ with respect to P 0 .Then Note that P 0 + tµ is a signed measure integrating to 1 if and only if dµ = h dP 0 = 0.
Conversely, note that µ ∈ M P0 is a sufficient condition for P 0 + tµ to be a probability measure for all t in a sufficiently small open neighborhood of 0.

Examples of codivergences
For any real a ≥ 0, set a/0 := +∞.For a non-negative function φ : [0, ∞) → [0, ∞), we can define two codivergences.The first one will be referred to as covariance-type codivergence between three probability measures P 0 , P 1 , P 2 and is defined by and the second one will be called correlation-type codivergence and is defined by if P 1 , P 2 ≪ P 0 .Otherwise, we define both V φ (P 0 |P 1 , P 2 ) and R φ (P 0 |P 1 , P 2 ) to be equal to +∞.
We say that a function f admits a second order Taylor expansion around If ν is a measure dominating P 0 , the bilinear map can be written as A consequence of Proposition 2.5 is the following: locally, all V φ and R φ codivergences (that satisfies the regularity conditions) define the same structure.This scalar product is the nonparametric Fisher information metric.The name originates from the identity [10,Equation (8)] where [I(θ)] ij is the (i, j)-th entry of the Fisher information matrix for a parametric model of ν-densities p(•|θ) indexed by a finite dimensional parameter θ and p i (x|θ) := ∂p(x|θ)/∂θ i .Equation ( 7) and Equation ( 6) have the same structure.One of the earliest reference to the nonparametric Fisher information metric is [20].The concept has been applied in several frameworks, such as computer vision [14] or shape data analysis [21].The geometry of the nonparametric Fisher information metric has been studied by [10,22] in the context of Bayesian inference.
Another interesting codivergence is R α with α = 1/2.The resulting codivergence is called Hellinger codivergence.We can (and will) define the Hellinger codivergence as √ p 1 p 2 dν/( √ p 1 p 0 dν √ p 2 p 0 dν) whenever the denominator is positive.This is considerably weaker than P 1 , P 2 ≪ P 0 , as it is only required that the support of p 0 intersects with non-zero ν-mass the support of p 1 and the support of p 2 .Note that ρ(P 0 |P 1 , P 2 ) is independent of the choice of the dominating measure ν (and potentially +∞ if the denominator is 0).
The name Hellinger codivergence is motivated by the representation where α(P, Q) := √ pq dν is the Hellinger affinity between two positive measures P, Q with densities p, q taken with respect to a common dominating measure.
The χ 2 -and Hellinger codivergence are of interest as they can be used to control changes of expectation between probability measures, see Section 2.2 of [4].
We always have To see this, observe that Hölder's inequality with p = 3/2 and q = 3 gives for any non-negative function f, 1 /p 0 .Subtracting one on each side of this expression yields (10).Proposition 2.5 implies that the χ 2 -codivergence and the Hellinger codivergence are codivergences with respective bilinear maps µ, µ P0 for the χ 2 -codivergence and µ, µ P0 /4 for the Hellinger codivergence.
For the Hellinger codivergence, the expansion in Proposition 2.5 can be generalized.Assume that P 0 is dominated by some positive measure ν.Define Supp(µ) := {x ∈ A : dµ/dν(x) = 0} for any signed measure µ dominated by ν.If µ 1 and µ 2 are signed measures dominated by ν such that (i) Supp(µ i ) ∩ Supp(P 0 ) has a positive ν-measure, and (ii) their densities h i are positive on Supp(µ i )\Supp(P 0 ), then Compared to Definition 2.2 (iii), there is thus an additional term for probability measures that have mass outside of the support of P 0 .Consequently, this expansion cannot be linked to one local bilinear form and the mapping (t, s) ∈ R 2 + → ρ(P 0 |P 0 +tµ 1 , P 0 +sµ 2 ) is not differentiable at (0, 0).This is in line with Proposition 2.4: for perturbations µ that do not belong to M P0 , the measures P 0 + tµ cannot be probability measures for all t in any open neighborhood of 0.
The R α codivergences admit convenient expressions for product measures and for exponential families.The first proposition is proved in Section A.2. Proposition 2.6.Let P jℓ be probability measures for any j = 0, 1, 2 and for any ℓ = 1, . . ., d satisfying P 1ℓ , P 2ℓ ≪ P 0ℓ .Then Proposition 2.7.Let Θ be a subset of a real vector space and let (P θ : θ ∈ Θ) be an exponential family with ν-densities p θ (x) = h(x) exp(θ ⊤ T (x) − A(θ)) for some dominating measure ν.Then, for any θ 0 , θ 1 , θ 2 ∈ Θ satisfying we have This proposition is proved in Section A.3. ( 12) is satisfied if Θ is a vector space or if 0 < α ≤ 1 and Θ is convex.In the case of the Gamma distribution the parameter space is Θ = (−1, +∞) × (−R) and in this case the constraints in (12) are sufficient and necessary for the statement of Proposition 2.7 to hold, see Section 5.4 for details.
For the most common families of distributions, closed-form expressions for the R α (P θ0 |P θ1 , P θ2 ) codivergences are reported in Table 1.Derivations for these expressions are given in Section 5.This section also contains expressions for the Gamma distribution.As mentioned before, these codivergences quantify to which extent the measures P 1 and P 2 represent different directions around P 0 .The explicit formulas show this in terms of the parameters and reveal significant similarity between the different families.For the multivariate normal distribution the R α codivergence vanishes if and only if the vectors θ 1 − θ 0 and θ 2 − θ 0 are orthogonal.
distribution R α (P 0 |P 1 , P 2 ) β jℓ > 0 if all the involved quantities are positive, and +∞ else 1 Closed-form expressions for the Rα codivergence for some parametric distributions.Proofs can be found in Section 5.
If v 1 , . . ., v M are all in a neighborhood of u, the divergence matrix D can be related to the Gram matrix of the bilinear form •, • u .Formally, for t = (t 1 , . . . ,t M ) ∈ R M such that for any i = 1, . . ., M, u + t i h i ∈ X , we have Based on the codivergences V φ (P 0 |P 1 , P 2 ), R φ (P 0 |P 1 , P 2 ), one can now define corresponding M × M divergence matrices with (j, k)-th entry provided that P 1 , . . ., P M ≪ P 0 .The codivergence matrices are linked by the relationship where D denotes the M × M diagonal matrix with j-th diagonal entry 1/ φ dPj dP0 dP 0 , j = 1, . . ., M. Similarly as Cov(X 1 , X 2 ) can denote either the covariance between the random variables X 1 and X 2 or the 2 × 2 covariance matrix of the random vector (X 1 , X 2 ), the expressions V φ (P 0 |P 1 , P 2 ) and R φ (P 0 |P 1 , P 2 ) can also denote either codivergences or 2 × 2 divergence matrices.Within the context, it is always clear which of the two interpretations is meant.
Let Φ(X) := (φ(dP 1 /dP 0 (X)), . . ., φ(dP M /dP 0 (X))) ⊤ denote the random vector containing the likelihood ratios of the M measures.Since Cov(U, where the covariance is computed with respect to the distribution P 0 as indicated by the subscript P 0 .Moreover, we have Applying Equation (15) yields moreover This shows that V φ (P 0 |P 1 , . . ., P M ) and R φ (P 0 |P 1 , . . ., P M ) can be interpreted as covariance matrices and are therefore symmetric and positive semi-definite.Applying the Taylor expansion to the likelihood ratios in the previous identities provides a direct way of recovering the local Gram matrix associated to the nonparametric Fisher information metric.
In a next step, we state a more specific identity for the χ 2 -divergence matrix.To do so, we first extend the usual notion of the χ 2 -divergence to the case where the first argument is a signed measure.Let µ be a finite signed measure and P be a probability measure defined on the same measurable space (Ω, A).We define the χ 2 -divergence of µ and P by Here, dµ/dP denotes the Radon-Nikodym derivative of the signed measured µ with respect to P (defined e.g. in Theorem 4.2.4 in [19]).This definition of χ 2 (µ, P ) generalizes the case where µ is a probability measure and allows us to rewrite the χ 2 -divergence matrix as with M j=1 v j P j the mixture (signed) measure of P 1 , . . ., P M .Similarly, for the Hellinger divergence matrix it can be checked that Writing Rank(A) for the rank of a matrix A and Rank(x 1 , . . ., x n ) for the dimension of the linear span of n elements x 1 , . . ., x n in a vector space E, we will now derive an identity for the rank of divergence matrices.Proposition 3.2.Let M ≥ 1, and let P 0 , P 1 , . . ., P M be (M + 1) probability distributions.
(i) Assume that P 1 , . . ., P M ≪ P 0 .Then for any non-negative function φ where functions are considered as elements of the vector space L 1 (A, B, P 0 ), that is, linear independence is considered P 0 -almost everywhere.
(ii) Let ν be a common dominating measure of P 0 , . . ., P M .Assume that ∀j = 1, . . ., M, p j p 0 dν > 0 with p j := dP j /dν.Then we have where functions are considered as elements of the vector space L 1 (A, B, ν).

Before
everywhere.This is the case if and only if there are numbers w 0 , . . ., w M , that are not all equal to zero, satisfying M j=0 w j √ p j = 0, ν-almost everywhere.To verify the more difficult reverse direction of this equivalence, it is enough to observe that M j=0 w j √ p j = 0 implies w 0 = − M j=1 w j √ p j p 0 dν and thus, taking We now show the general case of (ii).For an n × n matrix A and index sets I, J ⊂ {1, . . ., n}, the submatrix A I,J defines the submatrix consisting of the rows I and the columns J.If I = J, A I,I is called a principal submatrix of the matrix A. Let r be an integer in {1, . . ., M }.By Lemma 6.4, r = Rank(ρ(P 0 |P 1 , . . ., P M )) if and only if ρ(P 0 |P 1 , . . ., P M ) has an invertible principal submatrix of size r and all principal submatrix of size r + 1 of ρ(P 0 |P 1 , . . ., P M ) are singular if and only if (using the fact that the principal submatrices of ρ(P 0 |P 1 , . . ., P M ) of size k are exactly the matrices of the form ρ(P 0 |P i1 , . . ., P i k ) for some i 1 , . . ., i k ∈ {1, . . .M }) r = max{k = 1, . . ., M : ∃i 1 , . . ., i k ∈ {1, . . ., M }, ρ(P 0 |P i1 , . . ., P i k ) is invertible} if and only if (using the case of full rank that was proved before) 4 Data processing inequality for the χ 2 -divergence matrix In a parametric statistical model (Q θ ) θ∈Θ , it is assumed that the statistician observes a random variable X following one of the distributions Q θ for some θ ∈ Θ.If we transform X to obtain a new variable Y , then Y follows the distribution P θ := KQ θ for some Markov kernel K.When θ is unknown but the Markov kernel K is known and independent of θ, this means that the new statistical model is (P θ := KQ θ , θ ∈ Θ).As in the usual case for the χ 2divergence, it is natural to think that such a transformation cannot increase the amount of information present in the model.In our more general framework, such an inequality still holds and is presented in the following data processing inequality.
Theorem 4.1 (Data processing / entropy contraction).If K is a Markov kernel and Q 0 , . . ., Q M are probability measures such that Q 0 dominates Q 1 , . . ., Q M , then, where ≤ denotes the partial order on the set of positive semi-definite matrices.
In particular, the χ 2 -divergence matrix is invariant under invertible transformations.The rest of this section is devoted to the proof of Theorem 4.1.First, we generalize the well-known data-processing inequality for the χ 2divergence to the case (18), where one measure is a finite signed measure and use afterwards Equation (19).
The χ 2 (µ, P )-divergence with a signed measure can be computed from the usual χ 2 -divergence between probability measures by the following relationship Lemma 4.2.Assume that µ ≪ P .Let µ = α + µ + − α − µ − be the Jordan decomposition (3) of µ with α + , α − ≥ 0 and µ + , µ − orthogonal probability measures.Then Lemma 4.3.If µ is a finite signed measure, P is a probability measure and both measures are defined on the same measurable space, then, for any Markov kernel K, the data-processing inequality holds.
Proof We can assume that µ ≪ P, since otherwise the right-hand side of the inequality is +∞ and the result holds.In particular, µ ≪ ν for a positive measure ν implies that Kµ ≪ Kν.Indeed, if Kν(A) = 0 for a given measurable set A, then, K(A, x) dν(x) = 0, implying K(A, •) = 0 ν-almost everywhere.Since µ ≪ ν, the equality also holds µ-almost everywhere and so Kµ(A) = K(A, x)dµ(x) = 0, proving Kµ ≪ Kν.By the Jordan decomposition (3), there exist orthogonal probability measures µ + , µ − and non-negative real numbers α + , α − , such that Because µ + and µ − are orthogonal, we similarly find that Using the data-processing inequality for the χ 2 divergence of probability measures twice, We can now complete the proof of Theorem 4.1.(18) and the previous lemma, Since v was arbitrary, this completes the proof.
A Markov kernel K implies by definition that for every fixed x, A → K(A, x) is a probability measure.We now provide a simpler and more straightforward proof for Theorem 4.1 without using Lemma 4.3, under the additional common domination assumption: There exists a measure µ, such that ∀x ∈ Ω, K(x, • ) ≪ µ.
Simpler proof of Theorem 4.1 under the additional assumption (21) Because of the identity Let ν be a dominating measure for Q 0 , . . ., Q M and recall that by the additional assumption (21), for any x, the measure µ is a dominating measure for the probability measure A → K(A, x).Write q j for the ν-density of Q j .Then, dKQ j (y) = X k(y, x)q j (x) dν(x) dµ(y) for j = 1, . . ., M and a suitable non-negative kernel function k satisfying k(y, x) dµ(y) = 1 for all x.Applying the Cauchy-Schwarz inequality, we obtain Inserting this in (22), rewriting dKQ 0 (y) = X k(y, x)q 0 (x) dν(x) dµ(y), interchanging the order of integration using Fubini's theorem, and applying k(y, x) dµ(y) = 1, yields v j (q j (x) − q 0 (x)) q 0 (x) 5 Derivations for explicit expressions for the R α codivergence In this section we derive closed-form expressions for the R α codivergences in Table 1.We also obtain a closed-form formula for the case of Gamma distributions and discuss a first order approximation of it.
Proof The Gamma distributions Γ(α, β), α > 0, β > 0 form an exponential family, dominated by the Lebesgue measure with density Therefore, we can apply Proposition 2.7 with Θ = (−1, +∞) × (−R).Combining the assumed constraints on the parameters and the linearity of the mapping (α, β) → θ ensures that θ where Combining all these results, we obtain A formula for the product of exponential distributions can be obtained as a special case by setting α jℓ = 1 for all j, ℓ and applying Proposition 2.6.For the families of distributions discussed above, the formulas for the correlation-type R α codivergences encode an orthogonality relation on the parameter vectors.This is less visible in the expressions for the Gamma distribution but can be made more explicit using the first order approximation that we state next.It shows that even for the Gamma distribution these matrix entries can be written in leading order as a term involving a weighted inner product of β 1 −β 0 and β 2 − β 0 , where β r denotes the vector (β rℓ ) 1≤ℓ≤d .
Proof Using that α ℓ does not depend on j, the expression simplifies and a second order Taylor expansion of the logarithm (the sum of the first order terms vanishes) yields 6 Facts about ranks Definition 6.1.Let X 1 , . . ., X n be n random variables defined on the same probability space (Ω, A, P ).We define the rank of {X 1 , . . ., X n }, denoted by Rank(X 1 , . . ., X n ) as the dimension of the vector space Vect(X 1 , . . ., X n ) generated by linear combinations of {X 1 , . . ., X n }, where the equality is to be understood P -almost surely.Moreover, we say that (X 1 , . . ., X n ) are linearly independent P -almost surely if for any vector (a 1 , . . ., a n ), Lemma 6.2.Let X 1 , . . ., X n be n random variables defined on the same probability space (Ω, A, P ).Rank(X 1 , . . ., X n ) is the largest integer such that there exists i 1 , . . ., i r ∈ {1, . . ., M } with (X i1 , . . ., X ir ) linearly independent random variables P -almost surely.
Lemma 6.3.Let Z = (Z 1 , . . ., Z M ) ⊤ be a M -dimensional random vector with mean zero and finite second moments.Then Rank(Cov P (Z)) = Rank(Z 1 , . . ., Z M ), where the covariance matrix is computed with respect to the distribution P and the rank of a set of random variables is to be understood in the sense of Definition 6.1.where the first equality is the definition of the rank, the second equality is a consequence of the fact that (e 1 , . . ., e M ) is a basis of R M , the third equality results from the fact that Var[Y i ] = 0 and E[Y i ] = 0 for any i > r and the last equality is a consequence of the orthogonality of the (Y 1 , . . ., Yr) as elements of the Hilbert space L 2 (Ω, A, P ).The proof is completed since by definition r = Rank(Cov P (Z)).Lemma 6.4 (see for example Exercise 3.3.11 in [25]).A symmetric and positive semi-definite M ×M matrix A is of rank r if and only if A has an invertible principal submatrix of size r, and all principal submatrices of size r + 1 of A are singular.

Conclusion
We introduced the concept of codivergence as a notion of "angle" between three probability distributions.Divergence matrices can be viewed as an analogue of the Gram matrix for a finite sequence of probability distributions that are compared relative to one distribution.
Locally around the reference probability measure P 0 , codivergences are bilinear forms up to remainder terms.Two classes of codivergences emerge that resemble the structure of the covariance and the correlation.
Natural follow-up questions relate to the spectral behavior of a divergence matrix and the link between properties of the divergence matrix and properties of the underlying probability measures.

A.2 Proof of Proposition 2.6
Proof By Fubini's theorem,