Abstract
The article presents new \({\sup }\)-sums principles for integral F-divergence for arbitrary convex functions F on the whole real axis and arbitrary (not necessarily positive and normalized) measures. Among applications of these results, we work out a new ‘integral’ definition for t-entropy explicitly establishing its relation to Kullback–Leibler divergence.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The notion of F-divergence was introduced and originally studied in analysis of probability distributions by [2, 14, 19]. It is defined in the following way. Let Q and P be two probability distributions over a space \(\Omega \) such that Q is absolutely continuous with respect to P. Then, for a convex function \(F\!: {\mathbb {R}}_+ \rightarrow {\mathbb {R}}\) such that \(F(1) = 0\), the F-divergence \(D_F (Q\Vert P)\) of Q from P is defined as
where \(\mathrm{d}Q/\mathrm{d}P\) is the Radon–Nikodym derivative of Q with respect to P.
Since its introduction, the F-divergence has been intensively exploited and analysed due to the fact that by taking appropriate functions F one arrives here at numerous important divergences such as Kullback–Leibler divergence, Hellinger distance, Pearson \(\chi ^2\)-divergence, etc.
A comprehensive analysis of F-divergence was worked out by Liese and Vajda in [16] where a sup-sums principle for space partitions was established as well [16, Theorem 16].
In fact, formula (1) can be extended to arbitrary real-valued measures Q. Moreover, for such measures, Q, the value \(D_F (Q\Vert P)\) possesses a substantial statistical meaning. In [13, 20, 23], it was shown that the value \(e^{-nD_F (Q\Vert P)}\) determines the asymptotics for conditional probabilities of large deviations for a certain family of weighted empirical measures that are close to Q, where F is the rate function for large deviations for the sequence of random weights. In [12], the F-divergences for real-valued measures Q were applied for parametric estimation and testing.
The object of the present article is a general F-divergence associated with an arbitrary convex function F that is defined on the whole real axis and can take infinite values and arbitrary real-valued (not necessarily positive, normalized, and absolutely continuous) measures. For this F-divergence, we derive a number of new sup-sums principles exploiting as measurable so also continuous partitions of unity (Theorems 10, 12, and 14 of the article). In particular, they disclose the passage procedure from the F-divergence on a finite phase space to the F-divergence on an arbitrary measurable space. This passage involves additional components \(F^{\prime }(\pm \infty )\).
On the base of \({\sup }\)-sums principles obtained, we derive the corresponding \({\sup }\)-sums principle for Kullback–Leibler divergence (Theorem 15) leading also naturally to its new definition for measures that are not probability ones. The initial variant of the sup-sums principle for a particular case of F-divergence (mutual information) for mutually absolutely continuous probability measures was established by Gelfand, Kolmogorov, and Yaglom in [15].
As one more substantial application of integral \({\sup }\)-sums principles, we obtain a new formula for t-entropy. The t-entropy plays a fundamental role in the spectral analysis of operators associated with dynamical systems (cf. Theorem 20) and is a key ingredient in ‘entropy statistic theorem’. The latter statement, in the spectral theory of weighted shift and transfer operators, plays the role analogous to Shannon–McMillan–Breiman theorem in information theory [1, 18] and its important corollary known as ‘asymptotic equipartition property’ [10, p. 135]. Up to now, the definition of t-entropy has been formulated in a rather sophisticated manner in terms of actions of transfer operators on continuous partitions of unity (for more details, see Sect. 5). In Theorem 21, we give a fundamentally new ‘integral’ definition of t-entropy explicitly establishing its relation to Kullback–Leibler divergence.
2 Sup-Sums F-Divergence
Consider an arbitrary convex function \(F\!:{\mathbb {R}}\rightarrow (-\infty , +\infty ]\). Let
Obviously, both limits do exist, and the value of \(F^{\prime }(+\infty )\) may be finite or equal to \(+\infty \) while \(F^{\prime }(-\infty )\) may be finite or equal to \(-\infty \).
We adopt the following agreement. The product 0F(x/0) for \(x\ne 0\) will be defined as the limit \(\lim _{t\rightarrow +0} tF(x/t)\), and for \(x=0\), it will be assumed to be zero. In other words,
Let \(\mu \) be a finite nonnegative measure and let \(\nu \) be a finite real-valued measure, both defined on the same measurable space \((X,{\mathfrak {A}})\). For measurable functions g on \((X,{\mathfrak {A}})\), we write
(provided the integrals do exist).
By a measurable partition of unity, we will understand a finite set \(G = \{g_1,\dots ,g_k\}\) of nonnegative measurable functions on \((X,{\mathfrak {A}})\) such that \(\sum _i g_i \equiv 1\).
Now we introduce the main object of the paper. For any convex function \(F\!:{\mathbb {R}} \rightarrow (-\infty ,+\infty ]\), the sup-sums F-divergence \(\rho ^{}_F(\mu ,\nu )\) is defined as
where the supremum is taken over the set of all measurable partitions of unity G, and we assume that if \(\mu [g]=0\), then the corresponding summand in the right-hand part is defined according to convention (3).
The relation of the sup-sums F-divergence to the usual (integral) F-divergence will be uncovered in the next section.
The following two lemmas present important properties of the function sF(x/s) used in the definition (4).
Lemma 1
For any convex function F and all \(s,t\ge 0\) and \(x,y\in {\mathbb {R}},\)
Each convex function F on the real axis is superlinear, i.e.,
for some constants \(A,B\in {\mathbb {R}}\) and all \(t\in {\mathbb {R}}\).
Lemma 2
If a convex function F satisfies condition (6), then for all \(s\ge 0\) and \(x\in {\mathbb {R}},\)
Now we proceed to description of the technical properties of \(\rho ^{}_F(\mu ,\nu )\).
Lemma 3
The value of \(\rho _F(\mu ,\nu )\) does not change if we use countable partitions of unity G in (4) instead of finite ones.
Proposition 4
The function \(\rho ^{}_F(\mu ,\nu )\) is subadditive with respect to the pair \((\mu ,\nu )\). That is, for any finite nonnegative measures \(\mu _1,\,\mu _2\) and any finite real-valued measures \(\nu _1,\,\nu _2,\)
For any measure \(\nu \) and a bounded measurable function f on a measurable space \((X,{\mathfrak {A}})\), we define a real-valued measure \(f\nu \) by the formula
Proposition 5
Let \(\mu ,\,\nu \) be finite measures, where \(\mu \) is nonnegative and \(\nu \) is real-valued, and let \(f_1, f_2\) be nonnegative bounded measurable functions on \((X,{\mathfrak {A}})\). Then,
This means that the function \(\rho ^{}_F(f\mu ,f\nu )\) is additive with respect to f.
Generally, a real-valued measure \(\nu \) is decomposed into three components
where \(\nu _a\) is absolutely continuous, \(\nu ^+_s\) is positive and singular, and \(\nu ^-_s\) is negative and singular (with respect to \(\mu \)).
The next result describes the corresponding decomposition of \(\rho ^{}_F(\mu ,\nu )\).
Theorem 6
Let \(\mu \), \(\nu \) be finite measures on a measurable space \((X,{\mathfrak {A}})\), where \(\mu \) is nonnegative and \(\nu \) is real-valued. Then,
where each term may be finite or equal to \(+\infty \), and
Here, we assume that if \(\nu ^+_s =0\) or \(\nu ^-_s =0\) then the corresponding product in the right-hand part of (11) or (12) is zero regardless of the value of \(F^{\prime }(\pm \infty )\).
There is quite a number of objects in analysis where one has to exploit not measurable partitions of unity but continuous ones (one of them will be considered in Sect. 5). To discuss this setting in our context, we need the next definition.
Let X be a topological space and let \(\mu ,\, \nu \) be finite Borel measures, where \(\mu \) is nonnegative and \(\nu \) is real-valued. For any convex function \(F\!:{\mathbb {R}} \rightarrow (-\infty ,+\infty ]\) set
where the supremum is taken over the set of all (finite) continuous partitions of unity G, and we assume that if \(\mu [g]=0\) then the corresponding summand in the right-hand part is defined according to convention (3).
Theorem 7
Let \(\mu \), \(\nu \) be finite Borel measures on a metric space X, where \(\mu \) is nonnegative and \(\nu \) is real-valued. Then, for any convex lower semicontinuous function F,
Remark 8
In fact instead of metrizability of X in Theorem 7, it suffices to require the density of the set of continuous functions C(X) in the space \(L^1(X,\mu + |\nu |)\) (which is always true for metrizable space X or, as a variant, for regular measures \(\mu \), \(\nu \)).
Now, let us prove the above formulated results.
Proof of Lemma 2
If \(s>0\), then (7) follows immediately from (6). Note that (6) and (2) imply inequalities
In turn along with (3), they imply (7), provided \(s=0\) and \(x\ne 0\). Finally, in case \(s=0\) and \(x=0\), both sides in (7) become zero. \(\square \)
Proof of Lemma 1
If \(s,t>0\), then by convexity of F,
Consider the case when \(s>0\) and \(t=0\).
If \(y=0\), then (5) turns into the equality \(sF(x/s) =sF(x/s)\).
Suppose now that \(y>0\). If at least one summand in the right-hand part of (5) is infinite, then (5) holds true. If both summands sF(x/s) and \(0F(y/0) =yF^{\prime }(+\infty )\) are finite then the function F must be finite and continuous on the whole interval \((x/s, +\infty )\). Hence, in (5), one can pass to a limit as \(t\rightarrow +0\) and obtain the desired inequality
The case \(y<0\) is treated similarly.
It remains to analyse the case \(s,t=0\) and \(x,y\ne 0\). If x and y have the same sign (say \(x,y>0\)), then (5) turns into equality:
Suppose \(x,\,y\) have different signs (say \(x<0\) and \(y>0\)). Recall that \(F^{\prime }(+\infty ) \ge F^{\prime }(-\infty )\) (see (14)). Therefore, in any case,
which means that
Thus, Lemma 1 is proved in all cases. \(\square \)
Proof of Lemma 3
Consider a countable partition of unity \(G = \{g_1,g_2,\dots \}\). First, we prove that in this case, the sum in (4) is well-defined, i.e., that the limit
does exist, being either finite or equal to \(+\infty \).
Set \(h_n =\sum _{i\ge n} g_i\). Then, by Levi’s monotone convergence theorem,
where \(|\nu |\) denotes the total variation of \(\nu \). Lemma 2 implies that
It follows from (16) and (17) that for any \(\varepsilon >0\), there exists N such that for all \(n>N\) and \(m\ge n\),
Now, we have two possibilities: if for any \(\varepsilon >0\) there exists N such that for all \(n>N\) and \(m\ge n\),
then limit (15) does exist (being finite when all the summands in (15) are finite and equal to \(+\infty \) when there is at least one infinite summand); otherwise, if assumption (19) fails, using its negation and (18) one can easily show that the limit still exists and equals \(+\infty \).
Now, let us check equivalence of finite and countable partitions for use in (4).
Each finite partition of unity G in (4) may be transformed into a countable one by adding countably many zero elements, so transition from finite to countable partitions cannot decrease the value of \(\rho ^{}_F(\mu ,\nu )\). Thus, it suffices to proof that it cannot increase as well.
Let \(\rho ^{}_F(\mu ,\nu )\) be defined by (4) using countable partitions G. Then, for any \(c<\rho ^{}_F(\mu ,\nu )\), there exists a countable partition of unity \(G=\{g_1,g_2,\dots \}\) such that
Set \(h_n =\sum _{i\ge n} g_i\). Combining Lemma 2 and (16), we obtain
Consider a finite partition of unity \(G_n =\{g_1,\dots ,g_{n-1},h_n\}\). Now, (20) and (21) imply
Then, (20) is valid for some \(G_n\) instead of G, which along with arbitrariness of the constant \(c<\rho ^{}_F(\mu ,\nu )\) implies the statement of Lemma 3. \(\square \)
Proof of Theorem 4
If g is an element of a measurable partition of unity G, then by Lemma 1,
Summing this up over \(g\in G\) and passing to suprema gives (8). \(\square \)
Proof of Theorem 5
From Theorem 4, it follows that
So, it suffices to prove the inverse inequality.
By definition, for any \(c_i<\rho ^{}_F(f_i\mu ,f_i\nu )\), \(i=1,2\), there exist measurable partitions of unity \(G_i\), \(i=1,2\), such that
For each \(g\in G_i\) let us define the function
Evidently, the collection \(H =\{ h_g\mid g\in G_1\cup G_2\}\) forms a measurable partition of unity. Note that for each \(g\in G_i\), we have the equality \((f_1+f_2)h_g =f_ig\). Therefore,
From (22), (23) it follows that
and, by arbitrariness of \(c_i<\rho ^{}_F(f_i\mu ,f_i\nu )\),
\(\square \)
Proof of Theorem 6
The space X can be decomposed into three disjoint measurable parts, say \(X=X_a\sqcup X^+_s \sqcup X^-_s\), such that the measures \(\mu \) and \(\nu _a\) are supported on \(X_a\) while \(\nu ^+_s\), \(\nu ^-_s\) are, respectively, supported on \(X^+_s\), \(X^-_s\). Denote by \(f_a\), \(f^+_s\), \(f^-_s\) characteristic functions of these disjoint parts. Then,
and hence (10) follows from Theorem 5.
Proofs of equalities (11) and (12) are similar. For example,
\(\square \)
To prove Theorem 7, we need the next
Lemma 9
Let \(\mu \) be a positive finite Borel measure on a topological space X such that C(X) is dense in \(L^1(X,\mu )\). Then, for any measurable partition of unity \(G = \{g_1,\dots ,g_n\}\) on X and any \(\varepsilon >0\), there exists a continuous partition of unity \(H =\{h_1,\dots ,h_n\}\) on X such that \(\Vert h_i-g_i\Vert <\varepsilon \) in \(L^1(X,\mu )\) for all \(i\in \overline{1,n}\).
Proof
Choose a small \(\delta >0\) and approximate each \(g_i\) by a continuous function \(f_i\) satisfying \(\Vert f_i-g_i\Vert <\delta \) in the space \(L^1(X,\mu )\). Without loss of generality, we can assume that the functions \(f_i\) are strictly positive (which can always be guaranteed by replacing each \(f_i\) by \(\max \{f_i,0\} +\gamma \) with a small \(\gamma >0\)). Now define a continuous partition of unity with elements
Clearly,
which implies the estimate
Since \(\delta \) is arbitrary, this finishes the proof.\(\square \)
Proof of Theorem 7
Since any continuous partition of unity is measurable it follows that \(\rho ^{}_{F,c}(\mu ,\nu )\le \rho ^{}_F(\mu ,\nu )\), and it is enough to prove the opposite inequality.
As in the proof of Theorem 6, the space X can be decomposed into three disjoint parts, \(X=X_a\sqcup X^+_s \sqcup X^-_s\), such that the measures \(\mu \) and \(\nu _a\) are supported on \(X_a\) while \(\nu ^+_s\), \(\nu ^-_s\) are, respectively, supported on \(X^+_s\), \(X^-_s\). Denote by \(f_a\), \(f^+_s\), \(f^-_s\) the characteristic functions of these disjoint parts.
Theorem 6 gives the following representation of \(\rho ^{}_F(\mu ,\nu )\):
Suppose that \(\nu ^+_s[f^+_s] >0\) and \(\nu ^-_s[f^-_s] <0\) (otherwise the corresponding summands in the right-hand side of (24) may be omitted).
Note that in (24), one can assume that \(\mu [g] >0\) for all g since on the one hand the summands with \(\mu [g] =0\) are equal to 0 according to definition and on the other hand once \(\mu [g^{\prime }]=0\) and \(\mu [g^{\prime \prime }]>0\) the pair \(g^{\prime }\), \(g^{\prime \prime }\) can be replaced by one element \(g = g^{\prime } + g^{\prime \prime }\) in the partition G that does not change the sum in (24) due to absolute continuity of \(\nu _a\) with respect to \(\mu \).
Now recalling lower semicontinuity of F and definition of \(F^{\prime }(\pm \infty )\), the proof of theorem completes by applying Lemma 9 to partitions of unity \(G^{\prime } =\{ f_ag\mid g\in G\}\cup \{f^+_s,f^-_s\}\) in the space \(L^1(X,\mu +|\nu |)\). \(\square \)
3 Sup-Sums Principles for Integral F-Divergence
In this section, we present a number of the principal results of the article uncovering interrelation between sup-sums F-divergences and integral F-divergence.
Theorem 10
(sup-sums principle for partitions of unity) Let \(\mu \) and \(\nu \) be two finite measures on a measurable space \((X,{\mathfrak {A}})\), where \(\mu \) is nonnegative and \(\nu \) is real-valued, and let \(\nu =\nu _a +\nu ^+_s +\nu ^-_s\) be the decomposition (9). Then,
and
Here, \(\mathrm{d}\nu _a/\mathrm{d}\mu \) denotes the Radon–Nikodym derivative, and we assume that if \(\nu ^+_s =0\) or \(\nu ^-_s =0\), then the corresponding product in the right-hand part of (26) is zero regardless of the value of \(F^{\prime }(\pm \infty )\).
Corollary 11
For any \(f\in L^1(X,\mu )\) and any convex function \(F\!:{\mathbb {R}} \rightarrow (-\infty ,+\infty ],\)
where the supremum is taken over all measurable partitions of unity G.
Along with partitions of unity one can also use space partitions. Namely, by a measurable partition of space X, we mean a finite family \(\Gamma = \{\Delta _1,\dots ,\Delta _k\}\) of sets \(\Delta _i \in {\mathfrak {A}}\) such that \(\Delta _1\sqcup \dots \sqcup \Delta _k =X\).
For any convex function \(F\!:{\mathbb {R}} \rightarrow (-\infty ,+\infty ]\) put
where the supremum is taken over the set of all measurable partitions \(\Gamma \) of space X, and we assume that if \(\mu (\Delta )=0\), then the corresponding summand in the right-hand part is defined according to convention (3).
The argument of the proof of Lemma 3 shows that expression (27) preserves its value whether we use finite or countable measurable partitions of the space X.
The next statement is a ‘space’ variant of Theorem 10.
Theorem 12
(sup-sums principle for space partitions) Under the assumptions of Theorem 10, we have \(\rho ^{}_{F,X}(\mu ,\nu ) = \rho ^{}_F (\mu ,\nu )\). Thus, the equalities (25) and (26) are valid with \(\rho _{F,X}(\mu ,\nu )\) used instead of \(\rho _F(\mu ,\nu )\).
Remark 13
In the classical situation, i.e., for a convex function \(F\!:(0, +\infty ) \rightarrow {\mathbb {R}}\) and probability measures \(\mu \) and \(\nu \) the sup-sums principle (25), (26) with \(\rho ^{}_{F,X} (\mu ,\nu )\) was established by Vajda [24]. A different proof, based on generalized Taylor expansion of a convex function, is given in [16, Theorem 16]. The paper [16] is a good source of information on the classical F-divergence.
In many fields of analysis, one has to use continuous partitions of unity (see, in particular, Sect. 5). The next theorem presents the corresponding refinement of Theorem 10.
Theorem 14
(sup-sums principle for continuous partitions) Let X be a topological space and \(\mu \) and \(\nu \) by two Borel finite measures, where \(\mu \) is nonnegative and \(\nu \) is real-valued. If the set C(X) of continuous functions is dense in \(L^1(X,\mu +|\nu |)\) (which is always true for a metrizable space X or, as a variant, for regular measures \(\mu \) and \(\nu )\) and F is a convex lower semicontinuous function then (25) and (26) are valid with \(\rho _{F,c}(\,\cdot \,,\,\cdot \,)\) (13) substituted for \(\rho _{F}(\,\cdot \,,\,\cdot \,)\).
It is worth mentioning that there are at least two different ways to define the value of \(\rho _F(\mu ,\nu )\) for a measure \(\nu \) that is not absolutely continuous with respect to \(\mu \). Let us explain them in the case of finite set \(X =\{1,\dots ,K\}\). In this case, the measures \(\mu ,\,\nu \) have the form \(\mu =(\mu _1,\dots ,\mu _K)\), \(\nu =(\nu _1,\dots ,\nu _K)\) and then
The question is how to define the product \(\mu _iF(\nu _i/\mu _i)\) when \(\mu _i=0\) and \(\nu _i\ne 0\).
The first way (adopted in the present paper as well as in [16, 24]) is to put
Under this approach, the function \(\rho _F(\mu ,\nu )\) depends continuously on \(\mu \). Namely this property enables us to establish in Theorem 14 a link between sup-sums principles for measurable and continuous partitions of unity, that is inevitable for applications to the spectral objects in Sect. 5.
Alternatively, \(-\rho _F(\mu ,\nu )\) may be treated as the exponential rate for conditional probabilities of large deviations for a certain family of weighted empirical measures. Then, we have to put \(0F(\nu _i/0) =+\infty \) and, respectively, \(\rho _F(\mu ,\nu ) =+\infty \) whenever \(\nu \) is not absolutely continuous with respect to \(\mu \) (for details see [11, 13, 20, 23]).
Under the second approach to the definition of F-divergence, the analogues of all the above-stated results (Propositions 4, 5 and Theorems 7, 10, 12) can be formulated and proved. However, in this setting, they will be meaningful only for absolutely continuous measures \(\nu \), while the singular case becomes trivial.
Proof of Theorem 10
Note that (26) follows from (25) along with Theorem 6.
Let us check that for each (no matter finite or countable) measurable partition of unity G, we have
Indeed,
where we used Jensen’s inequality for the probability measure \((g/\mu [g]) \mathrm{d}\mu \) and that by convention (3) and absolute continuity of \(\nu _a\) all the summands with \(\mu [g] =0\) are zero.
From (28), it follows that the left-hand part in (25) does not exceed the right-hand one, and to finish the proof of Theorem 10 we have to verify the inequality
For the convex function F under consideration, there exists a partition of real axis by three points \(-\infty \le a\le b\le c\le +\infty \) such that
-
(i)
\(F(y) =+\infty \) for \(y<a\) and \(y>c\);
-
(ii)
F(y) is nonincreasing, finite and continuous on (a, b);
-
(iii)
F(y) is nondecreasing, finite and continuous on (b, c).
Let us decompose X into seven subsets
defined, respectively, by the conditions
Some of these sets may be empty; for example, if the function F decreases on (a, c), then \(b=c\) and \(X_{(b,c)} =\emptyset \), and if F is finite everywhere, then the sets \(X_{<a}\), \(X_a\), \(X_c\), \(X_{>c}\) will be empty.
Evidently, it is enough to prove inequality (29) for each of the sets (30) separately and then sum the components. In doing so, partitions of unity G on these sets should also be defined separately.
For the sets \(X_{<a}\), \(X_a\), \(X_b\), \(X_c\), \(X_{>c}\) (some of them may by empty) inequality (29) is verified easily: it is sufficient to take a trivial partition G consisting of a single unit function on the set considered.
Now consider the set \(X_{(a,b)}\). Let us take an arbitrary number \(\varepsilon >0\) and set
Clearly, the sets \(X_i\) form a partition of \(X_{(a,b)}\) and their characteristic functions (that we denote by \(g_i\)) form a measurable partition of unity on \(X_{(a,b)}\).
Note that by monotonicity of F on (a, b), the sets \(Y_i\) are convex. Therefore, if \(\mu (X_i)>0\), then
and by definition of \(Y_i\), we have
Now, (31), (32), (33) imply that
By arbitrariness of \(\varepsilon \), this implies inequality (29) for the set \(X_{(a,b)}\).
For the set \(X_{(b,c)}\), it is verified in the same way. Thus, Theorem 10 is proved. \(\square \)
Proof of Corollary 11
Take \(\nu \) such that \(\mathrm{d}\nu /\mathrm{d}\mu =f\) and apply (25). \(\square \)
Proof of Theorem 12
Every space partition \(\Delta _1,\dots ,\Delta _k\) is defined by the partition of unity consisting of the corresponding characteristic functions. Thus, \(\rho _{F,X}(\mu ,\nu ) \le \rho _F(\mu ,\nu )\). The rest of the proof coincides with the ending part of the proof of Theorem 10 (starting from formula (29)), where only the space partitions are used. \(\square \)
Proof of Theorem 14
Apply Theorem 10 along with Theorem 7 bearing in mind Remark 8. \(\square \)
4 Sup-Sums Principle for Kullback–Leibler Divergence, etc
If \(\mu \) and \(\nu \) are probability measures on \((X,{\mathfrak {A}})\) and \(\mu \) is absolutely continuous with respect to \(\nu \), then Kullback–Leibler divergence \(D_{\mathrm {KL}}\) is defined as
The principal philosophy behind the results we are going to discuss is not new. Namely, an analogue of Theorem 15 for space partitions of X (cf. (27)) goes back to Gelfand, Kolmogorov, and Yaglom [15]. However, the results obtained in the foregoing section give this field a new flavour and among the basic novelties here is the use of continuous partitions of unity (see, in particular, Remarks 17 and 16), which serves as an inevitable apparatus in the analysis of the objects in Sect. 5.
The results of the foregoing section lead to the next
Theorem 15
Under the above conditions on \(\mu \) and \(\nu \),
where \(\nu _a\) is the absolutely continuous component of \(\nu \) with respect to \(\mu \) and the supremum is taken over all (finite or countable) measurable partitions of unity G on X, and we assume that if \(\mu [g]=0\), then the corresponding summand in the sums vanishes regardless of the second multiplier \(\ln (\mu [g]/\nu [g])\) or \(\ln (\mu [g]/\nu _a[g])\).
Proof
According to (2), we have \(-\ln ^{\prime }(+\infty ) =0\). Hence, by Theorem 10,
It is easily seen that outside a set of zero measure \(\mu \),
Therefore,
From (37), (38), we obtain equalities (35).
Recall that, \(\mu \) is absolutely continuous with respect to \(\nu \) and hence with respect to \(\nu _a\) as well. So if \(\mu [g]\ne 0\), then \(\nu [g]\ne 0\) and \(\nu _a[g]\ne 0\) for any element g of a measurable partition of unity on X. From this and definition (4) of \(\rho _{-\ln }(\mu ,\nu )\), it follows that
where all summands with \(\mu [g] =0\) are supposed to be zero. The analogous equality for \(\rho _{-\ln }(\mu ,\nu _a)\) may be got in the same way. Thus, Theorem 15 is proved. \(\square \)
Remark 16
The theorem just proved along with formula (38) naturally suggests an extension of the definition of Kullback–Leibler divergence onto measures that are neither necessarily probability ones, nor mutually absolutely continuous. Namely, for any finite positive measures \(\mu \), \(\nu \) on a measurable space \((X,\mathfrak A)\) let us define the generalized Kullback–Leibler divergence \(D_{\mathrm {KL}}(\mu \Vert \nu )\) by the formula
The reasoning from the proof of Theorem 15 shows that \(D_{\mathrm {KL}} (\mu \Vert \nu )\) defined in this way satisfies equalities (35) and (36) as well.
Remark 17
If X is a topological space and \(\mu \) and \(\nu \) are Borel measures such that the set C(X) of continuous functions is dense in \(L^1(X,\mu )\) and \(L^1(X,\nu )\) (which is always true for a metrizable space X or, as a variant, for regular measures \(\mu \), \(\nu \)) then recalling Theorems 12 and 14 one concludes that when applying (35) and (36) to definition (39), we can equally use continuous (finite or countable) partitions of unity.
Remark 18
As is known apart from Kullback–Leibler divergence many common divergences are special cases of F-divergence, corresponding to a suitable choice of F. For example, Hellinger distance corresponds to the function \(F(t) =1-\sqrt{t}\), total variation distance corresponds to \(F(t) =|t-1|\), Pearson \(\chi ^2\)-divergence corresponds to \(F(t) =(t-1)^2\), and for the function \(F(t) =(t^\alpha -t)/(\alpha ^2 -\alpha )\), we obtain the so-called \(\alpha \)-divergence.
Thus, by choosing the corresponding convex functions F, one can write out the ‘sup-sums principles’ of Theorem 15 type for them where again one can exploit not only measurable but also continuous partitions of unity. Moreover, for example, for total variation distance, Pearson \(\chi ^2\)-divergence and \(\alpha \)-divergence one naturally arrives at consideration of real-valued (not necessarily nonnegative) measures.
Remark 19
In the paper [22], the result of Theorem 15 type was established for a sigma-finite measure \(\nu \) and a measure \(\mu \) which is absolutely continuous with respect to \(\nu \).
5 New Definition for t-Entropy
In this section, as an application of Theorems 14 and 15, we obtain a new formula for t-entropy that clarifies its relationship with Kullback–Leibler divergence.
The t-entropy (we recall its definition below) is a principal object of spectral analysis of operators associated with dynamical systems. In particular, in the series of articles [3,4,5,6,7,8], a relation between t-entropy and spectral radii of the corresponding operators has been established. Namely, it was shown that t-entropy is the Fenchel–Legendre dual to the spectral exponent of operators in question.
For transparency of presentation, let us recall the mentioned objects and results.
Hereafter, X is a Hausdorff compact space, C(X) is the algebra of continuous functions on X taking real values and equipped with the max-norm, and \(\alpha \!:X\rightarrow X\) is an arbitrary continuous mapping. The corresponding dynamical system will be denoted by \((X,\alpha )\).
Recall that, a transfer operator \(A\!:C(X)\rightarrow C(X)\), associated with a given dynamical system, is defined in the following way:
-
(a)
A is a positive linear operator (i.e., it maps nonnegative functions to nonnegative ones); and
-
(b)
the following homological identity for A is valid:
$$\begin{aligned} A(g \circ \alpha \cdot f) = gAf, \quad g,f\in C(X). \end{aligned}$$
As an important and popular example of transfer operators one can take say the classical Perron–Frobenius operator, that is, the operator having the form
where \(a\in C(X)\) is fixed. This operator is well defined when \(\alpha \) is a local homeomorphism.
Let A be a certain transfer operator in C(X). In what follows, we denote by \(A_\varphi \) the family of transfer operators in C(X) given by the formula
Let us denote by \(\lambda (\varphi )\) the spectral potential of \(A_\varphi \), defined by the formula
where \(r(A_\varphi )\) is the spectral radius of operator \(A_\varphi \).
We denote by M(X) the set of all probability Borel measures on X. Recall that, a measure \(\mu \in M(X)\) is called \(\alpha \)-invariant iff \(\mu (g) =\mu (g\circ \alpha )\) for all \(g\in C(X)\). The family of \(\alpha \)-invariant probability measures on X is denoted by \(M_\alpha (X)\).
The t-entropy is a certain functional on M(X) denoted by \(\tau (\mu )\) (its detailed definition will be given below).
The substantial importance of t-entropy is demonstrated by the following variational principle.
Theorem 20
( [6], Theorem 5.6) Let \(A\!: C(X)\rightarrow C(X)\) be a transfer operator for a continuous mapping \(\alpha \!:X\rightarrow X\) of a compact Hausdorff space X. Then,
One vividly notes the resemblance of this theorem to the Ruelle–Walters variational principle for the topological pressure [21, 25, 26] uncovering its relation with Kolmogorov–Sinai entropy.
Among the principal ingredients in the proofs of the results leading to Theorem 20 is the so-called ‘entropy statistic theorem’. This theorem plays in the spectral theory of weighted shift and transfer operators the role analogous to Shannon–McMillan–Breiman theorem in information theory [1, 18] and its important corollary known as ‘asymptotic equipartition property’ [10, p. 135]. The variational principles containing t-entropy and the objects therein serve as key ingredients of the thermodynamical formalism (see [4, 7, 17] and the sources quoted there).
Being so important, t-entropy at the same time is rather sophisticated object to calculate. The description of t-entropy not leaning on Fenchel–Legendre duality is not elementary, and it took a substantial time and effort to obtain its ‘accessible’ definition.
Namely, originally t-entropy \(\tau (\mu )\) was defined in three steps (see, for example, [6]).
Definition 1
Firstly, for a given \(\mu \in M(X)\), any \(n\in {\mathbb {N}}\), and any continuous partition of unity \(G =\{g_1,\dots ,g_k\}\) we set
Here, if \(\mu [g_i] = 0\) for some \(g_i\in G\) then the corresponding summand in (40) is assumed to be zero regardless the value \(m[A^ng_i]\); if \(m[A^ng_i] = 0\) for some \(g_i\in G\) and at the same time \(\mu [g_i]>0\), then \(\tau _n(\mu ,G) = -\infty \).
Secondly, we put
here the infimum is taken over all continuous partitions of unity G in C(X).
And finally, the t-entropy \(\tau (\mu )\) is defined as
Recently, in [9], the authors proved that for \(\mu \in M_\alpha (X)\) (note that only such measures are essential for Theorem 20), the t-entropy could be defined in two steps.
Definition 2
First, we set
where the infimum is taken over the set of all continuous partitions of unity G, and we assume that if \(\mu [g]=0\), then the corresponding summand in the right-hand part of the equality is equal to 0 independently of the value of \(\mu [A^n g]\).
Now, \(\tau (\mu )\) is defined as
In other words, in the original definition of t-entropy, one does not need to calculate the supremum in (40) but can simply put \(m=\mu \) there. In [9], it was proved that this leads to the same value of \(\tau _n(\mu )\) in (42) as in (41).
Of course, two steps are better than three, but even this two-steps definition of t-entropy is also rather sophisticated.
Note parenthetically that, if one identifies a Borel measure \(\mu \) on X with a linear functional \(\mu \!: C(X) \rightarrow {\mathbb {R}}\) given by
then, by Riesz’s theorem, there exists the only one regular Borel measure on X defining the same functional. Thus, since the definition of t-entropy leans only on integration of continuous functions (forming partitions of unity), it suffices to determine the t-entropy only for regular measures \(\mu \) (which are the measures considered, for instance, in Theorem 14).
The next theorem in essence gives a new definition of t-entropy and transparently establishes its relation to Kullback–Leibler divergence.
Theorem 21
(t-entropy via Kullback–Leibler divergence) Let A be a transfer operator for a dynamical system \((X, \alpha )\) then for any regular measure \(\mu \in M_\alpha (X)\), we have
and
where \(A^*\!: C(X)^* \rightarrow C(X)^*\) is the operator adjoint to A.
Proof
Apply Theorem 15 and Remark 17 to (42). Namely, set \(\nu = A^{*n}\mu \) in this equality (so that \(\mu [A^ng] = \nu [g]\)) and apply formulae (34)–(36). \(\square \)
References
Algoet, P.H., Cover, T.M.: A sandwich proof of the Shannon–McMillan–Breiman theorem. Ann. Probab. 16(2), 899–909 (1988)
Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B 28(1), 131–142 (1966)
Antonevich, A.B., Bakhtin, V.I., Lebedev, A.V.: Thermodynamics and spectral radius. Nonlinear Phenom. Complex Syst. 4(4), 318–321 (2001)
Antonevich, A.B., Bakhtin, V.I., Lebedev, A.V., Sarzhinsky, D.S.: Legendre analysis, thermodynamic formalism and spectra of Perron–Frobenius operators. Dokl. Math. 67(3), 343–345 (2003)
Antonevich, A.B., Bakhtin, V.I., Lebedev, A.V.: Spectra of operators associated with dynamical systems: from ergodicity to the duality principle. In: Twenty Years of Bialowieza: A Mathematical Anthology. World Scientific Monograph Series in Mathematics, Chapter 7, vol. 8, pp. 129–161 (2005)
Antonevich, A.B., Bakhtin, V.I., Lebedev, A.V.: On \(t\)-entropy and variational principle for the spectral radii of transfer and weighted shift operators. Ergod. Theory Dyn. Syst. 31, 995–1045 (2011)
Antonevich, A.B., Bakhtin, V.I., Lebedev, A.V.: A road to the spectral radius of transfer operators. Contemp. Math. 567, 17–51 (2012)
Bakhtin, V.I.: On \(t\)-entropy and variational principle for the spectral radius of weighted shift operators. Ergod Theory Dyn. Syst. 30, 1331–1342 (2010)
Bakhtin, V.I., Lebedev, A.V.: A new definition of \(t\)-entropy for transfer operators. Entropy 19(573), 1–6 (2017)
Billingsley, P.: Ergodic Theory and Information. Wiley, New York (1965)
Broniatowski, M., Keziou, A.: Minimization of divergences on sets of signed measures. Stud. Sci. Math. Hung. 43(4), 403–442 (2006)
Broniatowski, M., Keziou, A.: Parametric estimation and tests through divergences and duality technique. J. Multivar. Anal. 100(1), 16–36 (2009)
Broniatowski, M.: A weighted bootstrap procedure for divergence minimization problems. In: Analytical Methods in Statistics. Springer Proceedings in Mathematics & Statistics, vol. 193, pp. 1–22 (2015)
Csiszár, I.: Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitat von Markoffschen Ketten. Magyar. Tud. Akad. Mat. Kutato Int. Kozl 8, 85–108 (1963)
Gelfand, I.M., Kolmogorov, A.N., Yaglom, A.M.: On the general definition of the amount of information. Dokl. Akad. Nauk SSSR 11, 745–748 (1956)
Liese, F., Vajda, I.: On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 52(10), 4394–4412 (2006)
Lopes, A., Mengue, J., Mohr, J., Souza, R.: Entropy and variational principle for one-dimensional lattice systems with a general a priori probability: positive and zero temperature. Ergod. Theory Dyn. Syst. 35, 1925–1961 (2015)
McMillan, B.: The basic theorems of information theory. Ann. Math. Stat. 24, 196–219 (1953)
Morimoto, T.: Markov processes and the H-theorem. J. Phys. Soc. Jpn. 18(3), 328–331 (1963)
Najim, J.: A Cramér type theorem for weighted random variables. Electron. J. Probab. 7(4), 1–32 (2002)
Ruelle, D.: Statistical mechanics on a compact set with \(Z^\nu \) action satisfying expansiveness and specification. Trans. Am. Math. Soc. 185, 237–252 (1973)
Sokol, E.E.: Introduction of the Kullback–Leibler information function by means of partitions of the probability space. J. Belarus. State Univ. Math. Inform. 1, 59–67 (2018)
Thashorras, J., Wintenberger, O.: Large deviations for bootstrapped empirical measures. Bernoulli 20(4), 1845–1878 (2014)
Vajda, I.: On the \(f\)-divergence and singularity of probability measures. Period. Math. Hung. 1, 223–234 (1972)
Walters, P.: A variational principle for the pressure of continuous transformations. Am. J. Math. 97(4), 937–971 (1975)
Walters, P.: An Introduction to Ergodic Theory. Springer, Berlin (1982)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bakhtin, V.I., Lebedev, A.V. Sup-Sums Principles for F-Divergence and a New Definition for t-Entropy. J Theor Probab 35, 350–369 (2022). https://doi.org/10.1007/s10959-020-01046-5
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10959-020-01046-5