Sup-Sums Principles for F-Divergence and a New Definition for t-Entropy

The article presents new sup\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\sup }$$\end{document}-sums principles for integral F-divergence for arbitrary convex functions F on the whole real axis and arbitrary (not necessarily positive and normalized) measures. Among applications of these results, we work out a new ‘integral’ definition for t-entropy explicitly establishing its relation to Kullback–Leibler divergence.


Introduction
The notion of F-divergence was introduced and originally studied in analysis of probability distributions by [2,14,19]. It is defined in the following way. Let Q and P be two probability distributions over a space such that Q is absolutely continuous with respect to P. Then, for a convex function F : R + → R such that F(1) = 0, the F-divergence D F (Q P) of Q from P is defined as where dQ/d P is the Radon-Nikodym derivative of Q with respect to P. Since its introduction, the F-divergence has been intensively exploited and analysed due to the fact that by taking appropriate functions F one arrives here at numerous important divergences such as Kullback-Leibler divergence, Hellinger distance, Pearson χ 2 -divergence, etc.
A comprehensive analysis of F-divergence was worked out by Liese and Vajda in [16] where a sup-sums principle for space partitions was established as well [16,Theorem 16].
In fact, formula (1) can be extended to arbitrary real-valued measures Q. Moreover, for such measures, Q, the value D F (Q P) possesses a substantial statistical meaning. In [13,20,23], it was shown that the value e −n D F (Q P) determines the asymptotics for conditional probabilities of large deviations for a certain family of weighted empirical measures that are close to Q, where F is the rate function for large deviations for the sequence of random weights. In [12], the Fdivergences for real-valued measures Q were applied for parametric estimation and testing.
The object of the present article is a general F-divergence associated with an arbitrary convex function F that is defined on the whole real axis and can take infinite values and arbitrary real-valued (not necessarily positive, normalized, and absolutely continuous) measures. For this F-divergence, we derive a number of new sup-sums principles exploiting as measurable so also continuous partitions of unity (Theorems 10, 12, and 14 of the article). In particular, they disclose the passage procedure from the F-divergence on a finite phase space to the F-divergence on an arbitrary measurable space. This passage involves additional components F (±∞).
On the base of sup-sums principles obtained, we derive the corresponding supsums principle for Kullback-Leibler divergence (Theorem 15) leading also naturally to its new definition for measures that are not probability ones. The initial variant of the sup-sums principle for a particular case of F-divergence (mutual information) for mutually absolutely continuous probability measures was established by Gelfand, Kolmogorov, and Yaglom in [15].
As one more substantial application of integral sup-sums principles, we obtain a new formula for t-entropy. The t-entropy plays a fundamental role in the spectral analysis of operators associated with dynamical systems (cf. Theorem 20) and is a key ingredient in 'entropy statistic theorem'. The latter statement, in the spectral theory of weighted shift and transfer operators, plays the role analogous to Shannon-McMillan-Breiman theorem in information theory [1,18] and its important corollary known as 'asymptotic equipartition property' [10, p. 135]. Up to now, the definition of t-entropy has been formulated in a rather sophisticated manner in terms of actions of transfer operators on continuous partitions of unity (for more details, see Sect. 5). In Theorem 21, we give a fundamentally new 'integral' definition of t-entropy explicitly establishing its relation to Kullback-Leibler divergence.

Sup-Sums F-Divergence
Consider an arbitrary convex function F : Obviously, both limits do exist, and the value of F (+∞) may be finite or equal to +∞ while F (−∞) may be finite or equal to −∞. We adopt the following agreement. The product 0F(x/0) for x = 0 will be defined as the limit lim t→+0 t F(x/t), and for x = 0, it will be assumed to be zero. In other words, Let μ be a finite nonnegative measure and let ν be a finite real-valued measure, both defined on the same measurable space (X , A). For measurable functions g on (X , A), we write By a measurable partition of unity, we will understand a finite set G = {g 1 , . . . , g k } of nonnegative measurable functions on (X , A) such that i g i ≡ 1. Now we introduce the main object of the paper. For any convex function F : R → (−∞, +∞], the sup-sums F-divergence ρ F (μ, ν) is defined as where the supremum is taken over the set of all measurable partitions of unity G, and we assume that if μ[g] = 0, then the corresponding summand in the right-hand part is defined according to convention (3). The relation of the sup-sums F-divergence to the usual (integral) F-divergence will be uncovered in the next section.
The following two lemmas present important properties of the function s F(x/s) used in the definition (4).

Lemma 1
For any convex function F and all s, t ≥ 0 and x, y ∈ R, Each convex function F on the real axis is superlinear, i.e., for some constants A, B ∈ R and all t ∈ R.
Lemma 2 If a convex function F satisfies condition (6), then for all s ≥ 0 and x ∈ R, Now we proceed to description of the technical properties of ρ F (μ, ν).
For any measure ν and a bounded measurable function f on a measurable space (X , A), we define a real-valued measure f ν by the formula Proposition 5 Let μ, ν be finite measures, where μ is nonnegative and ν is realvalued, and let f 1 , f 2 be nonnegative bounded measurable functions on (X , A). Then,

This means that the function
Generally, a real-valued measure ν is decomposed into three components where ν a is absolutely continuous, ν + s is positive and singular, and ν − s is negative and singular (with respect to μ).
The next result describes the corresponding decomposition of ρ F (μ, ν).

Theorem 6
Let μ, ν be finite measures on a measurable space (X , A), where μ is nonnegative and ν is real-valued. Then, where each term may be finite or equal to +∞, and Here, we assume that if ν + s = 0 or ν − s = 0 then the corresponding product in the right-hand part of (11) or (12) is zero regardless of the value of F (±∞).
There is quite a number of objects in analysis where one has to exploit not measurable partitions of unity but continuous ones (one of them will be considered in Sect. 5). To discuss this setting in our context, we need the next definition.
Let X be a topological space and let μ, ν be finite Borel measures, where μ is nonnegative and ν is real-valued. For any convex function F : where the supremum is taken over the set of all (finite) continuous partitions of unity G, and we assume that if μ[g] = 0 then the corresponding summand in the right-hand part is defined according to convention (3).

Theorem 7
Let μ, ν be finite Borel measures on a metric space X , where μ is nonnegative and ν is real-valued. Then, for any convex lower semicontinuous function F,

Remark 8
In fact instead of metrizability of X in Theorem 7, it suffices to require the density of the set of continuous functions C(X ) in the space L 1 (X , μ + |ν|) (which is always true for metrizable space X or, as a variant, for regular measures μ, ν).
Now, let us prove the above formulated results.
Proof of Lemma 1 If s, t > 0, then by convexity of F, Consider the case when s > 0 and t = 0.

turns into the equality s F(x/s) = s F(x/s).
Suppose now that y > 0. If at least one summand in the right-hand part of (5) is infinite, then (5) holds true. If both summands s F(x/s) and 0F(y/0) = y F (+∞) are finite then the function F must be finite and continuous on the whole interval (x/s, +∞). Hence, in (5), one can pass to a limit as t → +0 and obtain the desired inequality The case y < 0 is treated similarly. It remains to analyse the case s, t = 0 and x, y = 0. If x and y have the same sign (say x, y > 0), then (5) turns into equality: (14)). Therefore, in any case, Thus, Lemma 1 is proved in all cases.

Proof of Lemma 3
Consider a countable partition of unity G = {g 1 , g 2 , . . . }. First, we prove that in this case, the sum in (4) is well-defined, i.e., that the limit does exist, being either finite or equal to +∞. Set h n = i≥n g i . Then, by Levi's monotone convergence theorem, where |ν| denotes the total variation of ν. Lemma 2 implies that It follows from (16) and (17) that for any ε > 0, there exists N such that for all n > N and m ≥ n, Now, we have two possibilities: if for any ε > 0 there exists N such that for all n > N and m ≥ n, then limit (15) (4). Each finite partition of unity G in (4) may be transformed into a countable one by adding countably many zero elements, so transition from finite to countable partitions cannot decrease the value of ρ F (μ, ν). Thus, it suffices to proof that it cannot increase as well.
Let ρ F (μ, ν) be defined by (4) using countable partitions G. Then, for any c < ρ F (μ, ν), there exists a countable partition of unity Set h n = i≥n g i . Combining Lemma 2 and (16), we obtain Consider a finite partition of unity G n = {g 1 , . . . , g n−1 , h n }. Now, (20) and (21) imply Then, (20) is valid for some G n instead of G, which along with arbitrariness of the constant c < ρ F (μ, ν) implies the statement of Lemma 3.

Proof of Theorem 4
If g is an element of a measurable partition of unity G, then by Lemma 1, Summing this up over g ∈ G and passing to suprema gives (8).

Proof of Theorem 5 From Theorem 4, it follows that
So, it suffices to prove the inverse inequality. By definition, for any For each g ∈ G i let us define the function Evidently, the collection H = {h g | g ∈ G 1 ∪ G 2 } forms a measurable partition of unity. Note that for each g ∈ G i , we have the equality ( From (22), (23) it follows that and, by arbitrariness of

Proof of Theorem 6
The space X can be decomposed into three disjoint measurable parts, say X = X a X + s X − s , such that the measures μ and ν a are supported on X a while ν + s , ν − s are, respectively, supported on X + s , X − s . Denote by f a , f + s , f − s characteristic functions of these disjoint parts. Then, and hence (10) follows from Theorem 5. Proofs of equalities (11) and (12) are similar. For example, To prove Theorem 7, we need the next Lemma 9 Let μ be a positive finite Borel measure on a topological space X such that C(X ) is dense in L 1 (X , μ). Then, for any measurable partition of unity G = {g 1 , . . . , g n } on X and any ε > 0, there exists a continuous partition of unity H = {h 1 , . . . , h n } on X such that h i − g i < ε in L 1 (X , μ) for all i ∈ 1, n.
Proof Choose a small δ > 0 and approximate each g i by a continuous function f i satisfying f i − g i < δ in the space L 1 (X , μ). Without loss of generality, we can assume that the functions f i are strictly positive (which can always be guaranteed by replacing each f i by max{ f i , 0} + γ with a small γ > 0). Now define a continuous partition of unity with elements Clearly, which implies the estimate Since δ is arbitrary, this finishes the proof.

Proof of Theorem 7
Since any continuous partition of unity is measurable it follows that ρ F,c (μ, ν) ≤ ρ F (μ, ν), and it is enough to prove the opposite inequality. As in the proof of Theorem 6, the space X can be decomposed into three disjoint parts, X = X a X + s X − s , such that the measures μ and ν a are supported on X a while ν + s , ν − s are, respectively, supported on X + s , X − s . Denote by f a , f + s , f − s the characteristic functions of these disjoint parts.
Note that in (24), one can assume that μ[g] > 0 for all g since on the one hand the summands with μ[g] = 0 are equal to 0 according to definition and on the other hand once μ[g ] = 0 and μ[g ] > 0 the pair g , g can be replaced by one element g = g + g in the partition G that does not change the sum in (24) due to absolute continuity of ν a with respect to μ. Now recalling lower semicontinuity of F and definition of F (±∞), the proof of theorem completes by applying Lemma 9 to partitions of unity G

Sup-Sums Principles for Integral F-Divergence
In this section, we present a number of the principal results of the article uncovering interrelation between sup-sums F-divergences and integral F-divergence.
Theorem 10 (sup-sums principle for partitions of unity) Let μ and ν be two finite measures on a measurable space (X , A), where μ is nonnegative and ν is real-valued, and let ν = ν a + ν + s + ν − s be the decomposition (9). Then, and Here, dν a /dμ denotes the Radon-Nikodym derivative, and we assume that if ν + s = 0 or ν − s = 0, then the corresponding product in the right-hand part of (26) is zero regardless of the value of F (±∞).

Corollary 11
For any f ∈ L 1 (X , μ) and any convex function F : R → (−∞, +∞], where the supremum is taken over all measurable partitions of unity G.
Along with partitions of unity one can also use space partitions. Namely, by a measurable partition of space X , we mean a finite family = { 1 , . . . , k } of sets i ∈ A such that 1 · · · k = X . For any convex function F : R → (−∞, +∞] put where the supremum is taken over the set of all measurable partitions of space X , and we assume that if μ( ) = 0, then the corresponding summand in the right-hand part is defined according to convention (3). The argument of the proof of Lemma 3 shows that expression (27) preserves its value whether we use finite or countable measurable partitions of the space X .
The next statement is a 'space' variant of Theorem 10.

Remark 13
In the classical situation, i.e., for a convex function F : (0, +∞) → R and probability measures μ and ν the sup-sums principle (25), (26) with ρ F,X (μ, ν) was established by Vajda [24]. A different proof, based on generalized Taylor expansion of a convex function, is given in [16,Theorem 16]. The paper [16] is a good source of information on the classical F-divergence.

Theorem 14 (sup-sums principle for continuous partitions) Let X be a topological space and μ and ν by two Borel finite measures, where μ is nonnegative and ν is real-valued. If the set C(X ) of continuous functions is dense in L 1 (X , μ + |ν|) (which is always true for a metrizable space X or, as a variant, for regular measures μ and ν) and F is a convex lower semicontinuous function then
It is worth mentioning that there are at least two different ways to define the value of ρ F (μ, ν) for a measure ν that is not absolutely continuous with respect to μ. Let us explain them in the case of finite set X = {1, . . . , K }. In this case, the measures μ, ν have the form μ = (μ 1 , . . . , μ K ), ν = (ν 1 , . . . , ν K ) and then The question is how to define the product μ i F(ν i /μ i ) when μ i = 0 and ν i = 0. The first way (adopted in the present paper as well as in [16,24]) is to put Under this approach, the function ρ F (μ, ν) depends continuously on μ. Namely this property enables us to establish in Theorem 14 a link between sup-sums principles for measurable and continuous partitions of unity, that is inevitable for applications to the spectral objects in Sect. 5.
Under the second approach to the definition of F-divergence, the analogues of all the above-stated results (Propositions 4, 5 and Theorems 7, 10, 12) can be formulated and proved. However, in this setting, they will be meaningful only for absolutely continuous measures ν, while the singular case becomes trivial.
Let us check that for each (no matter finite or countable) measurable partition of unity G, we have where we used Jensen's inequality for the probability measure (g/μ[g])dμ and that by convention (3) and absolute continuity of ν a all the summands with μ[g] = 0 are zero. From (28), it follows that the left-hand part in (25) does not exceed the right-hand one, and to finish the proof of Theorem 10 we have to verify the inequality For the convex function F under consideration, there exists a partition of real axis by three points −∞ ≤ a ≤ b ≤ c ≤ +∞ such that (i) F(y) = +∞ for y < a and y > c; (y) is nonincreasing, finite and continuous on (a, b); (y) is nondecreasing, finite and continuous on (b, c).

Let us decompose X into seven subsets
defined, respectively, by the conditions Some of these sets may be empty; for example, if the function F decreases on (a, c), then b = c and X (b,c) = ∅, and if F is finite everywhere, then the sets X <a , X a , X c , X >c will be empty. Evidently, it is enough to prove inequality (29) for each of the sets (30) separately and then sum the components. In doing so, partitions of unity G on these sets should also be defined separately.
For the sets X <a , X a , X b , X c , X >c (some of them may by empty) inequality (29) is verified easily: it is sufficient to take a trivial partition G consisting of a single unit function on the set considered. Now consider the set X (a,b) . Let us take an arbitrary number ε > 0 and set Clearly, the sets X i form a partition of X (a,b) and their characteristic functions (that we denote by g i ) form a measurable partition of unity on X (a,b) . Note that by monotonicity of F on (a, b), the sets Y i are convex. Therefore, if μ(X i ) > 0, then Now, (31), (32), (33) imply that X (a,b) .
By arbitrariness of ε, this implies inequality (29) for the set X (a,b) . For the set X (b,c) , it is verified in the same way. Thus, Theorem 10 is proved.

Proof of Theorem 12
Every space partition 1 , . . . , k is defined by the partition of unity consisting of the corresponding characteristic functions. Thus, ρ F,X (μ, ν) ≤ ρ F (μ, ν). The rest of the proof coincides with the ending part of the proof of Theorem 10 (starting from formula (29)), where only the space partitions are used.

Proof of Theorem 14
Apply Theorem 10 along with Theorem 7 bearing in mind Remark 8.

Sup-Sums Principle for Kullback-Leibler Divergence, etc
If μ and ν are probability measures on (X , A) and μ is absolutely continuous with respect to ν, then Kullback-Leibler divergence D KL is defined as The principal philosophy behind the results we are going to discuss is not new. Namely, an analogue of Theorem 15 for space partitions of X (cf. (27)) goes back to Gelfand, Kolmogorov, and Yaglom [15]. However, the results obtained in the foregoing section give this field a new flavour and among the basic novelties here is the use of continuous partitions of unity (see, in particular, Remarks 17 and 16), which serves as an inevitable apparatus in the analysis of the objects in Sect. 5.
The results of the foregoing section lead to the next
Recall that, μ is absolutely continuous with respect to ν and hence with respect to ν a as well. So if μ[g] = 0, then ν[g] = 0 and ν a [g] = 0 for any element g of a measurable partition of unity on X . From this and definition (4) of ρ − ln (μ, ν), it follows that where all summands with μ[g] = 0 are supposed to be zero. The analogous equality for ρ − ln (μ, ν a ) may be got in the same way. Thus, Theorem 15 is proved.

Remark 16
The theorem just proved along with formula (38) naturally suggests an extension of the definition of Kullback-Leibler divergence onto measures that are neither necessarily probability ones, nor mutually absolutely continuous. Namely, for any finite positive measures μ, ν on a measurable space (X , A) let us define the generalized Kullback-Leibler divergence D KL (μ ν) by the formula The reasoning from the proof of Theorem 15 shows that D KL (μ ν) defined in this way satisfies equalities (35) and (36) as well.

Remark 17
If X is a topological space and μ and ν are Borel measures such that the set C(X ) of continuous functions is dense in L 1 (X , μ) and L 1 (X , ν) (which is always true for a metrizable space X or, as a variant, for regular measures μ, ν) then recalling Theorems 12 and 14 one concludes that when applying (35) and (36) to definition (39), we can equally use continuous (finite or countable) partitions of unity.

Remark 18
As is known apart from Kullback-Leibler divergence many common divergences are special cases of F-divergence, corresponding to a suitable choice of F. For example, Hellinger distance corresponds to the function F(t) = 1 − √ t, total variation distance corresponds to F(t) = |t − 1|, Pearson χ 2 -divergence corresponds to F(t) = (t −1) 2 , and for the function F(t) = (t α −t)/(α 2 −α), we obtain the so-called α-divergence.
Thus, by choosing the corresponding convex functions F, one can write out the 'sup-sums principles' of Theorem 15 type for them where again one can exploit not only measurable but also continuous partitions of unity. Moreover, for example, for total variation distance, Pearson χ 2 -divergence and α-divergence one naturally arrives at consideration of real-valued (not necessarily nonnegative) measures.

Remark 19
In the paper [22], the result of Theorem 15 type was established for a sigma-finite measure ν and a measure μ which is absolutely continuous with respect to ν.

New Definition for t-Entropy
In this section, as an application of Theorems 14 and 15, we obtain a new formula for t-entropy that clarifies its relationship with Kullback-Leibler divergence.
The t-entropy (we recall its definition below) is a principal object of spectral analysis of operators associated with dynamical systems. In particular, in the series of articles [3][4][5][6][7][8], a relation between t-entropy and spectral radii of the corresponding operators has been established. Namely, it was shown that t-entropy is the Fenchel-Legendre dual to the spectral exponent of operators in question.
For transparency of presentation, let us recall the mentioned objects and results.
Hereafter, X is a Hausdorff compact space, C(X ) is the algebra of continuous functions on X taking real values and equipped with the max-norm, and α : X → X is an arbitrary continuous mapping. The corresponding dynamical system will be denoted by (X , α).
Recall that, a transfer operator A : C(X ) → C(X ), associated with a given dynamical system, is defined in the following way: (a) A is a positive linear operator (i.e., it maps nonnegative functions to nonnegative ones); and (b) the following homological identity for A is valid: As an important and popular example of transfer operators one can take say the classical Perron-Frobenius operator, that is, the operator having the form where a ∈ C(X ) is fixed. This operator is well defined when α is a local homeomorphism.
Let A be a certain transfer operator in C(X ). In what follows, we denote by A ϕ the family of transfer operators in C(X ) given by the formula Let us denote by λ(ϕ) the spectral potential of A ϕ , defined by the formula where r (A ϕ ) is the spectral radius of operator A ϕ . We denote by M(X ) the set of all probability Borel measures on X . Recall that, a measure μ ∈ M(X ) is called α-invariant iff μ(g) = μ(g • α) for all g ∈ C(X ). The family of α-invariant probability measures on X is denoted by M α (X ).
The t-entropy is a certain functional on M(X ) denoted by τ (μ) (its detailed definition will be given below).
The substantial importance of t-entropy is demonstrated by the following variational principle.
Theorem 20 ( [6], Theorem 5.6) Let A : C(X ) → C(X ) be a transfer operator for a continuous mapping α : X → X of a compact Hausdorff space X . Then, One vividly notes the resemblance of this theorem to the Ruelle-Walters variational principle for the topological pressure [21,25,26] uncovering its relation with Kolmogorov-Sinai entropy.
Among the principal ingredients in the proofs of the results leading to Theorem 20 is the so-called 'entropy statistic theorem'. This theorem plays in the spectral theory of weighted shift and transfer operators the role analogous to Shannon-McMillan-Breiman theorem in information theory [1,18] and its important corollary known as 'asymptotic equipartition property' [10, p. 135]. The variational principles containing t-entropy and the objects therein serve as key ingredients of the thermodynamical formalism (see [4,7,17] and the sources quoted there).
Being so important, t-entropy at the same time is rather sophisticated object to calculate. The description of t-entropy not leaning on Fenchel-Legendre duality is not elementary, and it took a substantial time and effort to obtain its 'accessible' definition.
Namely, originally t-entropy τ (μ) was defined in three steps (see, for example, [6]). Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.