Sup-Sums Principles for F-Divergence and a New Definition for t-Entropy

Bakhtin, V. I.; Lebedev, A. V.

doi:10.1007/s10959-020-01046-5

Sup-Sums Principles for F-Divergence and a New Definition for t-Entropy

Open access
Published: 04 November 2020

Volume 35, pages 350–369, (2022)
Cite this article

Download PDF

You have full access to this open access article

Journal of Theoretical Probability Aims and scope Submit manuscript

Sup-Sums Principles for F-Divergence and a New Definition for t-Entropy

Download PDF

2303 Accesses
1 Citation
Explore all metrics

Abstract

The article presents new ${\sup }$-sums principles for integral F-divergence for arbitrary convex functions F on the whole real axis and arbitrary (not necessarily positive and normalized) measures. Among applications of these results, we work out a new ‘integral’ definition for t-entropy explicitly establishing its relation to Kullback–Leibler divergence.

Generalized Jensen and Jensen–Mercer inequalities for strongly convex functions with applications

Article Open access 02 September 2024

A Survey of Reverse Inequalities for f-Divergence Measure in Information Theory

Refinement of Jensen’s inequality and estimation of f- and Rényi divergence via Montgomery identity

Article Open access 19 November 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The notion of F-divergence was introduced and originally studied in analysis of probability distributions by [2, 14, 19]. It is defined in the following way. Let Q and P be two probability distributions over a space $\Omega $ such that Q is absolutely continuous with respect to P. Then, for a convex function $F\!: {\mathbb {R}}_+ \rightarrow {\mathbb {R}}$ such that $F(1) = 0$, the F-divergence $D_F (Q\Vert P)$ of Q from P is defined as

$$\begin{aligned} D_F (Q\Vert P) := \int _\Omega F\bigg (\frac{\mathrm{d}Q}{\mathrm{d}P}\bigg )\, \mathrm{d}P, \end{aligned}$$

(1)

where $\mathrm{d}Q/\mathrm{d}P$ is the Radon–Nikodym derivative of Q with respect to P.

Since its introduction, the F-divergence has been intensively exploited and analysed due to the fact that by taking appropriate functions F one arrives here at numerous important divergences such as Kullback–Leibler divergence, Hellinger distance, Pearson $\chi ^2$-divergence, etc.

A comprehensive analysis of F-divergence was worked out by Liese and Vajda in [16] where a sup-sums principle for space partitions was established as well [16, Theorem 16].

In fact, formula (1) can be extended to arbitrary real-valued measures Q. Moreover, for such measures, Q, the value $D_F (Q\Vert P)$ possesses a substantial statistical meaning. In [13, 20, 23], it was shown that the value $e^{-nD_F (Q\Vert P)}$ determines the asymptotics for conditional probabilities of large deviations for a certain family of weighted empirical measures that are close to Q, where F is the rate function for large deviations for the sequence of random weights. In [12], the F-divergences for real-valued measures Q were applied for parametric estimation and testing.

The object of the present article is a general F-divergence associated with an arbitrary convex function F that is defined on the whole real axis and can take infinite values and arbitrary real-valued (not necessarily positive, normalized, and absolutely continuous) measures. For this F-divergence, we derive a number of new sup-sums principles exploiting as measurable so also continuous partitions of unity (Theorems 10, 12, and 14 of the article). In particular, they disclose the passage procedure from the F-divergence on a finite phase space to the F-divergence on an arbitrary measurable space. This passage involves additional components $F^{\prime }(\pm \infty )$.

On the base of ${\sup }$-sums principles obtained, we derive the corresponding ${\sup }$-sums principle for Kullback–Leibler divergence (Theorem 15) leading also naturally to its new definition for measures that are not probability ones. The initial variant of the sup-sums principle for a particular case of F-divergence (mutual information) for mutually absolutely continuous probability measures was established by Gelfand, Kolmogorov, and Yaglom in [15].

As one more substantial application of integral ${\sup }$-sums principles, we obtain a new formula for t-entropy. The t-entropy plays a fundamental role in the spectral analysis of operators associated with dynamical systems (cf. Theorem 20) and is a key ingredient in ‘entropy statistic theorem’. The latter statement, in the spectral theory of weighted shift and transfer operators, plays the role analogous to Shannon–McMillan–Breiman theorem in information theory [1, 18] and its important corollary known as ‘asymptotic equipartition property’ [10, p. 135]. Up to now, the definition of t-entropy has been formulated in a rather sophisticated manner in terms of actions of transfer operators on continuous partitions of unity (for more details, see Sect. 5). In Theorem 21, we give a fundamentally new ‘integral’ definition of t-entropy explicitly establishing its relation to Kullback–Leibler divergence.

2 Sup-Sums F-Divergence

Consider an arbitrary convex function $F\!:{\mathbb {R}}\rightarrow (-\infty , +\infty ]$. Let

$$\begin{aligned} F^{\prime }(+\infty ) :=\lim _{t\rightarrow +\infty }\frac{F(t)}{t}, \quad F^{\prime }(-\infty ) :=\lim _{t\rightarrow -\infty } \frac{F(t)}{t}. \end{aligned}$$

(2)

Obviously, both limits do exist, and the value of $F^{\prime }(+\infty )$ may be finite or equal to $+\infty $ while $F^{\prime }(-\infty )$ may be finite or equal to $-\infty $.

We adopt the following agreement. The product 0F(x/0) for $x\ne 0$ will be defined as the limit $\lim _{t\rightarrow +0} tF(x/t)$, and for $x=0$, it will be assumed to be zero. In other words,

$$\begin{aligned} 0F\bigg (\frac{x}{0}\bigg ) \,=\, {\left\{ \begin{array}{ll} xF^{\prime }(+\infty ), &{} \text {if}\ \, x>0, \\ xF^{\prime }(-\infty ), &{} \text {if}\ \, x<0, \\ 0, &{} \text {if}\ \, x=0. \end{array}\right. } \end{aligned}$$

(3)

Let $\mu $ be a finite nonnegative measure and let $\nu $ be a finite real-valued measure, both defined on the same measurable space $(X,{\mathfrak {A}})$. For measurable functions g on $(X,{\mathfrak {A}})$, we write

$$\begin{aligned} \mu [g] := \int _X g\,\mathrm{d}\mu , \quad \nu [g] := \int _X g\,\mathrm{d}\nu \end{aligned}$$

(provided the integrals do exist).

By a measurable partition of unity, we will understand a finite set $G = \{g_1,\dots ,g_k\}$ of nonnegative measurable functions on $(X,{\mathfrak {A}})$ such that $\sum _i g_i \equiv 1$.

Now we introduce the main object of the paper. For any convex function $F\!:{\mathbb {R}} \rightarrow (-\infty ,+\infty ]$, the sup-sums F-divergence $\rho ^{}_F(\mu ,\nu )$ is defined as

$$\begin{aligned} \rho ^{}_F(\mu ,\nu ) :=\, \sup _G\sum _{g\in G} \mu [g] F\bigg (\frac{\nu [g]}{\mu [g]}\bigg ), \end{aligned}$$

(4)

where the supremum is taken over the set of all measurable partitions of unity G, and we assume that if $\mu [g]=0$, then the corresponding summand in the right-hand part is defined according to convention (3).

The relation of the sup-sums F-divergence to the usual (integral) F-divergence will be uncovered in the next section.

The following two lemmas present important properties of the function sF(x/s) used in the definition (4).

Lemma 1

For any convex function F and all $s,t\ge 0$ and $x,y\in {\mathbb {R}},$

$$\begin{aligned} (s+t)F\bigg (\frac{x+y}{s+t}\bigg ) \le sF\bigg (\frac{x}{s}\bigg ) + tF\bigg (\frac{y}{t}\bigg ). \end{aligned}$$

(5)

Each convex function F on the real axis is superlinear, i.e.,

$$\begin{aligned} F(t) \ge At +B \end{aligned}$$

(6)

for some constants $A,B\in {\mathbb {R}}$ and all $t\in {\mathbb {R}}$.

Lemma 2

If a convex function F satisfies condition (6), then for all $s\ge 0$ and $x\in {\mathbb {R}},$

$$\begin{aligned} sF\bigg (\frac{x}{s}\bigg ) \ge Ax +Bs. \end{aligned}$$

(7)

Now we proceed to description of the technical properties of $\rho ^{}_F(\mu ,\nu )$.

Lemma 3

The value of $\rho _F(\mu ,\nu )$ does not change if we use countable partitions of unity G in (4) instead of finite ones.

Proposition 4

The function $\rho ^{}_F(\mu ,\nu )$ is subadditive with respect to the pair $(\mu ,\nu )$. That is, for any finite nonnegative measures $\mu _1,\,\mu _2$ and any finite real-valued measures $\nu _1,\,\nu _2,$

$$\begin{aligned} \rho ^{}_F(\mu _1+\mu _2,\nu _1+\nu _2) \le \rho ^{}_F(\mu _1,\nu _1) +\rho ^{}_F(\mu _2,\nu _2). \end{aligned}$$

(8)

For any measure $\nu $ and a bounded measurable function f on a measurable space $(X,{\mathfrak {A}})$, we define a real-valued measure $f\nu $ by the formula

$$\begin{aligned} f\nu [g] := \nu [f g] =\int _X gf\,\mathrm{d}\nu , \quad g\in L^1(X,\nu ). \end{aligned}$$

Proposition 5

Let $\mu ,\,\nu $ be finite measures, where $\mu $ is nonnegative and $\nu $ is real-valued, and let $f_1, f_2$ be nonnegative bounded measurable functions on $(X,{\mathfrak {A}})$. Then,

$$\begin{aligned} \rho ^{}_F\big ((f_1+f_2)\mu ,(f_1+f_2)\nu \big ) =\rho ^{}_F(f_1\mu ,f_1\nu ) +\rho ^{}_F(f_2\mu ,f_2\nu ). \end{aligned}$$

This means that the function $\rho ^{}_F(f\mu ,f\nu )$ is additive with respect to f.

Generally, a real-valued measure $\nu $ is decomposed into three components

$$\begin{aligned} \nu =\nu _a +\nu ^+_s +\nu ^-_s, \end{aligned}$$

(9)

where $\nu _a$ is absolutely continuous, $\nu ^+_s$ is positive and singular, and $\nu ^-_s$ is negative and singular (with respect to $\mu $).

The next result describes the corresponding decomposition of $\rho ^{}_F(\mu ,\nu )$.

Theorem 6

Let $\mu $, $\nu $ be finite measures on a measurable space $(X,{\mathfrak {A}})$, where $\mu $ is nonnegative and $\nu $ is real-valued. Then,

$$\begin{aligned} \rho ^{}_F(\mu ,\nu ) = \rho ^{}_F(\mu ,\nu _a) +\rho ^{}_F(0,\nu ^+_s) +\rho ^{}_F(0,\nu ^-_s), \end{aligned}$$

(10)

where each term may be finite or equal to $+\infty $, and

$$\begin{aligned} \rho ^{}_F(0,\nu ^+_s)&=\nu ^+_s(X)F^{\prime }(+\infty ), \end{aligned}$$

(11)

$$\begin{aligned} \rho ^{}_F(0,\nu ^-_s)&= \nu ^-_s(X)F^{\prime }(-\infty ). \end{aligned}$$

(12)

Here, we assume that if $\nu ^+_s =0$ or $\nu ^-_s =0$ then the corresponding product in the right-hand part of (11) or (12) is zero regardless of the value of $F^{\prime }(\pm \infty )$.

There is quite a number of objects in analysis where one has to exploit not measurable partitions of unity but continuous ones (one of them will be considered in Sect. 5). To discuss this setting in our context, we need the next definition.

Let X be a topological space and let $\mu ,\, \nu $ be finite Borel measures, where $\mu $ is nonnegative and $\nu $ is real-valued. For any convex function $F\!:{\mathbb {R}} \rightarrow (-\infty ,+\infty ]$ set

$$\begin{aligned} \rho ^{}_{F,c}(\mu ,\nu ) :=\, \sup _G\sum _{g\in G} \mu [g] F\bigg (\frac{\nu [g]}{\mu [g]}\bigg ), \end{aligned}$$

(13)

where the supremum is taken over the set of all (finite) continuous partitions of unity G, and we assume that if $\mu [g]=0$ then the corresponding summand in the right-hand part is defined according to convention (3).

Theorem 7

Let $\mu $, $\nu $ be finite Borel measures on a metric space X, where $\mu $ is nonnegative and $\nu $ is real-valued. Then, for any convex lower semicontinuous function F,

$$\begin{aligned} \rho ^{}_{F,c}(\mu ,\nu ) = \rho ^{}_{F}(\mu ,\nu ). \end{aligned}$$

Remark 8

In fact instead of metrizability of X in Theorem 7, it suffices to require the density of the set of continuous functions C(X) in the space $L^1(X,\mu + |\nu |)$ (which is always true for metrizable space X or, as a variant, for regular measures $\mu $, $\nu $).

Now, let us prove the above formulated results.

Proof of Lemma 2

If $s>0$, then (7) follows immediately from (6). Note that (6) and (2) imply inequalities

$$\begin{aligned} F^{\prime }(+\infty ) \ge A, \quad F^{\prime }(-\infty ) \le A. \end{aligned}$$

(14)

In turn along with (3), they imply (7), provided $s=0$ and $x\ne 0$. Finally, in case $s=0$ and $x=0$, both sides in (7) become zero. $\square $

Proof of Lemma 1

If $s,t>0$, then by convexity of F,

$$\begin{aligned} sF\bigg (\frac{x}{s}\bigg ) + tF\bigg (\frac{y}{t}\bigg ) = (s+t)\bigg (\frac{s}{s+t}F\bigg (\frac{x}{s}\bigg ) +\frac{t}{s+t}F\bigg (\frac{y}{t}\bigg )\!\bigg ) \ge (s+t)F\bigg (\frac{x+y}{s+t}\bigg ). \end{aligned}$$

Consider the case when $s>0$ and $t=0$.

If $y=0$, then (5) turns into the equality $sF(x/s) =sF(x/s)$.

Suppose now that $y>0$. If at least one summand in the right-hand part of (5) is infinite, then (5) holds true. If both summands sF(x/s) and $0F(y/0) =yF^{\prime }(+\infty )$ are finite then the function F must be finite and continuous on the whole interval $(x/s, +\infty )$. Hence, in (5), one can pass to a limit as $t\rightarrow +0$ and obtain the desired inequality

$$\begin{aligned} sF\bigg (\frac{x+y}{s}\bigg ) \le sF\bigg (\frac{x}{s}\bigg ) + 0F\bigg (\frac{y}{0}\bigg ). \end{aligned}$$

The case $y<0$ is treated similarly.

It remains to analyse the case $s,t=0$ and $x,y\ne 0$. If x and y have the same sign (say $x,y>0$), then (5) turns into equality:

$$\begin{aligned} (x+y)F^{\prime }(+\infty ) = xF^{\prime }(+\infty ) +yF^{\prime }(+\infty ). \end{aligned}$$

Suppose $x,\,y$ have different signs (say $x<0$ and $y>0$). Recall that $F^{\prime }(+\infty ) \ge F^{\prime }(-\infty )$ (see (14)). Therefore, in any case,

$$\begin{aligned} xF^{\prime }(-\infty ) +yF^{\prime }(+\infty ) \,\ge \, {\left\{ \begin{array}{ll} (x+y)F^{\prime }(-\infty ), &{} \hbox {if}\ \, x+y<0, \\ (x+y)F^{\prime }(+\infty ), &{} \hbox {if}\ \, x+y>0, \\ 0, &{} \hbox {if}\ \, x+y=0, \end{array}\right. } \end{aligned}$$

which means that

$$\begin{aligned} 0F\bigg (\frac{x}{0}\bigg ) + 0F\bigg (\frac{y}{0}\bigg ) \ge \, 0F\bigg (\frac{x+y}{0}\bigg ). \end{aligned}$$

Thus, Lemma 1 is proved in all cases. $\square $

Proof of Lemma 3

Consider a countable partition of unity $G = \{g_1,g_2,\dots \}$. First, we prove that in this case, the sum in (4) is well-defined, i.e., that the limit

$$\begin{aligned} \lim _{n\rightarrow \infty } \sum _{i=1}^n \mu [g_i] F\bigg (\frac{\nu [g_i]}{\mu [g_i]}\bigg ) \end{aligned}$$

(15)

does exist, being either finite or equal to $+\infty $.

Set $h_n =\sum _{i\ge n} g_i$. Then, by Levi’s monotone convergence theorem,

$$\begin{aligned} \lim _{n\rightarrow \infty } \mu [h_n] =0 \quad \text {and}\quad \lim _{n\rightarrow \infty } |\nu | [h_n] =0, \end{aligned}$$

(16)

where $|\nu |$ denotes the total variation of $\nu $. Lemma 2 implies that

$$\begin{aligned} \sum _{i=n}^m \mu [g_i] F\bigg (\frac{\nu [g_i]}{\mu [g_i]}\bigg ) \ge \sum _{i=n}^m \big (A\nu [g_i] +B\mu [g_i]\big ) \ge -|A||\nu |[h_n] -|B| \mu [h_n]. \end{aligned}$$

(17)

It follows from (16) and (17) that for any $\varepsilon >0$, there exists N such that for all $n>N$ and $m\ge n$,

$$\begin{aligned} \sum _{i=n}^m \mu [g_i] F\bigg (\frac{\nu [g_i]}{\mu [g_i]}\bigg ) > -\varepsilon . \end{aligned}$$

(18)

Now, we have two possibilities: if for any $\varepsilon >0$ there exists N such that for all $n>N$ and $m\ge n$,

$$\begin{aligned} \sum _{i=n}^m \mu [g_i] F\bigg (\frac{\nu [g_i]}{\mu [g_i]}\bigg ) < \varepsilon , \end{aligned}$$

(19)

then limit (15) does exist (being finite when all the summands in (15) are finite and equal to $+\infty $ when there is at least one infinite summand); otherwise, if assumption (19) fails, using its negation and (18) one can easily show that the limit still exists and equals $+\infty $.

Now, let us check equivalence of finite and countable partitions for use in (4).

Each finite partition of unity G in (4) may be transformed into a countable one by adding countably many zero elements, so transition from finite to countable partitions cannot decrease the value of $\rho ^{}_F(\mu ,\nu )$. Thus, it suffices to proof that it cannot increase as well.

Let $\rho ^{}_F(\mu ,\nu )$ be defined by (4) using countable partitions G. Then, for any $c<\rho ^{}_F(\mu ,\nu )$, there exists a countable partition of unity $G=\{g_1,g_2,\dots \}$ such that

$$\begin{aligned} \sum _{i=1}^\infty \mu [g_i] F\bigg (\frac{\nu [g_i]}{\mu [g_i]}\bigg ) \,>\, c. \end{aligned}$$

(20)

Set $h_n =\sum _{i\ge n} g_i$. Combining Lemma 2 and (16), we obtain

$$\begin{aligned} \liminf _{n\rightarrow \infty } \mu [h_n] F\bigg (\frac{\nu [h_n]}{\mu [h_n]}\bigg ) \ge \liminf _{n\rightarrow \infty } \big (A\nu [h_n] +B\mu [h_n]\big ) = 0. \end{aligned}$$

(21)

Consider a finite partition of unity $G_n =\{g_1,\dots ,g_{n-1},h_n\}$. Now, (20) and (21) imply

$$\begin{aligned} \sup _{n} \Bigg ( \sum _{i=1}^{n-1} \mu [g_i] F\bigg (\frac{\nu [g_i]}{\mu [g_i]}\bigg ) + \mu [h_n] F\bigg (\frac{\nu [h_n]}{\mu [h_n]}\bigg )\!\Bigg ) \,\ge \, \sum _{i=1}^\infty \mu [g_i] F\bigg (\frac{\nu [g_i]}{\mu [g_i]}\bigg ) > c. \end{aligned}$$

Then, (20) is valid for some $G_n$ instead of G, which along with arbitrariness of the constant $c<\rho ^{}_F(\mu ,\nu )$ implies the statement of Lemma 3. $\square $

Proof of Theorem 4

If g is an element of a measurable partition of unity G, then by Lemma 1,

$$\begin{aligned} \big (\mu _1[g] +\mu _2[g]\big ) F\bigg (\frac{\nu _1[g] +\nu _2[g]}{\mu _1[g] +\mu _2[g]}\bigg ) \le \mu _1[g] F\bigg (\frac{\nu _1[g]}{\mu _1[g]}\bigg ) +\mu _2[g] F\bigg (\frac{\nu _2[g]}{\mu _2[g]}\bigg ). \end{aligned}$$

Summing this up over $g\in G$ and passing to suprema gives (8). $\square $

Proof of Theorem 5

From Theorem 4, it follows that

$$\begin{aligned} \rho ^{}_F\big ((f_1+f_2)\mu ,(f_1+f_2)\nu \big ) \le \rho ^{}_F(f_1\mu ,f_1\nu ) + \rho ^{}_F(f_2\mu ,f_2\nu ). \end{aligned}$$

So, it suffices to prove the inverse inequality.

By definition, for any $c_i<\rho ^{}_F(f_i\mu ,f_i\nu )$, $i=1,2$, there exist measurable partitions of unity $G_i$, $i=1,2$, such that

$$\begin{aligned} \sum _{g\in G_i} \mu [f_ig] F\bigg (\frac{\nu [f_ig]}{\mu [f_ig]}\bigg ) > c_i, \quad i=1,2. \end{aligned}$$

(22)

For each $g\in G_i$ let us define the function

$$\begin{aligned} h_g = {\left\{ \begin{array}{ll} \displaystyle f_ig/(f_1+f_2), &{} \text {if}\ \, f_1+f_2>0, \\ \displaystyle g/2, &{} \text {if}\ \, f_1+f_2=0. \end{array}\right. } \end{aligned}$$

Evidently, the collection $H =\{ h_g\mid g\in G_1\cup G_2\}$ forms a measurable partition of unity. Note that for each $g\in G_i$, we have the equality $(f_1+f_2)h_g =f_ig$. Therefore,

$$\begin{aligned} \sum _{h_g\in H} \mu [(f_1+f_2)h_g] F\bigg (\frac{\nu [(f_1+f_2)h_g]}{\mu [(f_1+f_2)h_g]}\bigg ) = \sum _{i=1}^2 \sum _{g\in G_i} \mu [f_ig] F\bigg (\frac{\nu [f_ig]}{\mu [f_ig]}\bigg ). \end{aligned}$$

(23)

From (22), (23) it follows that

$$\begin{aligned} \rho ^{}_F\big ((f_1+f_2)\mu ,(f_1+f_2)\nu \big ) > c_1+c_2 \end{aligned}$$

and, by arbitrariness of $c_i<\rho ^{}_F(f_i\mu ,f_i\nu )$,

$$\begin{aligned} \rho ^{}_F\big ((f_1+f_2)\mu ,(f_1+f_2)\nu \big ) \ge \rho ^{}_F(f_1\mu ,f_1\nu ) + \rho ^{}_F(f_2\mu ,f_2\nu ). \end{aligned}$$

$\square $

Proof of Theorem 6

The space X can be decomposed into three disjoint measurable parts, say $X=X_a\sqcup X^+_s \sqcup X^-_s$, such that the measures $\mu $ and $\nu _a$ are supported on $X_a$ while $\nu ^+_s$, $\nu ^-_s$ are, respectively, supported on $X^+_s$, $X^-_s$. Denote by $f_a$, $f^+_s$, $f^-_s$ characteristic functions of these disjoint parts. Then,

$$\begin{aligned} f_a\mu =\mu , \quad f_a\nu =\nu _a, \quad f^+_s\mu =0, \quad f^+_s\nu =\nu ^+_s, \quad f^-_s\mu =0, \quad f^-_s\nu =\nu ^-_s, \end{aligned}$$

and hence (10) follows from Theorem 5.

Proofs of equalities (11) and (12) are similar. For example,

$$\begin{aligned} \rho ^{}_F(0,\nu ^+_s) = \sup _G \sum _{g\in G} 0 F\bigg (\frac{\nu ^+_s[g]}{0}\bigg ) = \sup _G\sum _{g\in G} \nu ^+_s[g]F^{\prime }(+\infty ) = \nu ^+_s(X)F^{\prime }(+\infty ). \end{aligned}$$

$\square $

To prove Theorem 7, we need the next

Lemma 9

Let $\mu $ be a positive finite Borel measure on a topological space X such that C(X) is dense in $L^1(X,\mu )$. Then, for any measurable partition of unity $G = \{g_1,\dots ,g_n\}$ on X and any $\varepsilon >0$, there exists a continuous partition of unity $H =\{h_1,\dots ,h_n\}$ on X such that $\Vert h_i-g_i\Vert <\varepsilon $ in $L^1(X,\mu )$ for all $i\in \overline{1,n}$.

Proof

Choose a small $\delta >0$ and approximate each $g_i$ by a continuous function $f_i$ satisfying $\Vert f_i-g_i\Vert <\delta $ in the space $L^1(X,\mu )$. Without loss of generality, we can assume that the functions $f_i$ are strictly positive (which can always be guaranteed by replacing each $f_i$ by $\max \{f_i,0\} +\gamma $ with a small $\gamma >0$). Now define a continuous partition of unity with elements

$$\begin{aligned} h_i :=\frac{f_i}{\sum _{j=1}^n f_j}, \quad i=1,\dots ,n. \end{aligned}$$

Clearly,

$$\begin{aligned} |h_i -f_i| = \bigg | \frac{1-\sum _{j=1}^n f_j}{\sum _{j=1}^n f_j} f_i\bigg | \le \bigg | 1-\sum _{j=1}^n f_j\bigg | \le \sum _{j=1}^n |g_j -f_j|, \end{aligned}$$

which implies the estimate

$$\begin{aligned} \Vert h_i -g_i\Vert \le \Vert h_i -f_i\Vert +\Vert f_i -g_i\Vert \le n\delta +\delta . \end{aligned}$$

Since $\delta $ is arbitrary, this finishes the proof.$\square $

Proof of Theorem 7

Since any continuous partition of unity is measurable it follows that $\rho ^{}_{F,c}(\mu ,\nu )\le \rho ^{}_F(\mu ,\nu )$, and it is enough to prove the opposite inequality.

As in the proof of Theorem 6, the space X can be decomposed into three disjoint parts, $X=X_a\sqcup X^+_s \sqcup X^-_s$, such that the measures $\mu $ and $\nu _a$ are supported on $X_a$ while $\nu ^+_s$, $\nu ^-_s$ are, respectively, supported on $X^+_s$, $X^-_s$. Denote by $f_a$, $f^+_s$, $f^-_s$ the characteristic functions of these disjoint parts.

Theorem 6 gives the following representation of $\rho ^{}_F(\mu ,\nu )$:

$$\begin{aligned} \rho ^{}_F(\mu ,\nu ) = \sup _G\sum _{g\in G} \mu [g] F\bigg (\frac{\nu _a[g]}{\mu [g]}\bigg ) + 0F\bigg (\frac{\nu ^+_s[f^+_s]}{0}\bigg ) + 0F\bigg (\frac{\nu ^-_s[f^-_s]}{0}\bigg ). \end{aligned}$$

(24)

Suppose that $\nu ^+_s[f^+_s] >0$ and $\nu ^-_s[f^-_s] <0$ (otherwise the corresponding summands in the right-hand side of (24) may be omitted).

Note that in (24), one can assume that $\mu [g] >0$ for all g since on the one hand the summands with $\mu [g] =0$ are equal to 0 according to definition and on the other hand once $\mu [g^{\prime }]=0$ and $\mu [g^{\prime \prime }]>0$ the pair $g^{\prime }$, $g^{\prime \prime }$ can be replaced by one element $g = g^{\prime } + g^{\prime \prime }$ in the partition G that does not change the sum in (24) due to absolute continuity of $\nu _a$ with respect to $\mu $.

Now recalling lower semicontinuity of F and definition of $F^{\prime }(\pm \infty )$, the proof of theorem completes by applying Lemma 9 to partitions of unity $G^{\prime } =\{ f_ag\mid g\in G\}\cup \{f^+_s,f^-_s\}$ in the space $L^1(X,\mu +|\nu |)$. $\square $

3 Sup-Sums Principles for Integral F-Divergence

In this section, we present a number of the principal results of the article uncovering interrelation between sup-sums F-divergences and integral F-divergence.

Theorem 10

(sup-sums principle for partitions of unity) Let $\mu $ and $\nu $ be two finite measures on a measurable space $(X,{\mathfrak {A}})$, where $\mu $ is nonnegative and $\nu $ is real-valued, and let $\nu =\nu _a +\nu ^+_s +\nu ^-_s$ be the decomposition (9). Then,

$$\begin{aligned} \rho ^{}_F(\mu ,\nu _a) \,= \int _X F\bigg (\frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\bigg ) \mathrm{d}\mu , \end{aligned}$$

(25)

and

$$\begin{aligned} \rho ^{}_F(\mu ,\nu ) = \int _X F\bigg (\frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\bigg ) \mathrm{d}\mu + \nu ^+_s(X)F^{\prime }(+\infty ) + \nu ^-_s(X)F^{\prime }(-\infty ). \end{aligned}$$

(26)

Here, $\mathrm{d}\nu _a/\mathrm{d}\mu $ denotes the Radon–Nikodym derivative, and we assume that if $\nu ^+_s =0$ or $\nu ^-_s =0$, then the corresponding product in the right-hand part of (26) is zero regardless of the value of $F^{\prime }(\pm \infty )$.

Corollary 11

For any $f\in L^1(X,\mu )$ and any convex function $F\!:{\mathbb {R}} \rightarrow (-\infty ,+\infty ],$

$$\begin{aligned} \int _X F(f)\,\mathrm{d}\mu \,=\, \sup _G \sum _{g\in G} \mu [g] F\bigg (\frac{\mu [fg]}{\mu [g]}\bigg ), \end{aligned}$$

where the supremum is taken over all measurable partitions of unity G.

Along with partitions of unity one can also use space partitions. Namely, by a measurable partition of space X, we mean a finite family $\Gamma = \{\Delta _1,\dots ,\Delta _k\}$ of sets $\Delta _i \in {\mathfrak {A}}$ such that $\Delta _1\sqcup \dots \sqcup \Delta _k =X$.

For any convex function $F\!:{\mathbb {R}} \rightarrow (-\infty ,+\infty ]$ put

$$\begin{aligned} \rho ^{}_{F,X}(\mu ,\nu ) :=\, \sup _\Gamma \sum _{\Delta \in \Gamma } \mu (\Delta ) F\bigg (\frac{\nu (\Delta )}{\mu (\Delta )}\bigg ), \end{aligned}$$

(27)

where the supremum is taken over the set of all measurable partitions $\Gamma $ of space X, and we assume that if $\mu (\Delta )=0$, then the corresponding summand in the right-hand part is defined according to convention (3).

The argument of the proof of Lemma 3 shows that expression (27) preserves its value whether we use finite or countable measurable partitions of the space X.

The next statement is a ‘space’ variant of Theorem 10.

Theorem 12

(sup-sums principle for space partitions) Under the assumptions of Theorem 10, we have $\rho ^{}_{F,X}(\mu ,\nu ) = \rho ^{}_F (\mu ,\nu )$. Thus, the equalities (25) and (26) are valid with $\rho _{F,X}(\mu ,\nu )$ used instead of $\rho _F(\mu ,\nu )$.

Remark 13

In the classical situation, i.e., for a convex function $F\!:(0, +\infty ) \rightarrow {\mathbb {R}}$ and probability measures $\mu $ and $\nu $ the sup-sums principle (25), (26) with $\rho ^{}_{F,X} (\mu ,\nu )$ was established by Vajda [24]. A different proof, based on generalized Taylor expansion of a convex function, is given in [16, Theorem 16]. The paper [16] is a good source of information on the classical F-divergence.

In many fields of analysis, one has to use continuous partitions of unity (see, in particular, Sect. 5). The next theorem presents the corresponding refinement of Theorem 10.

Theorem 14

(sup-sums principle for continuous partitions) Let X be a topological space and $\mu $ and $\nu $ by two Borel finite measures, where $\mu $ is nonnegative and $\nu $ is real-valued. If the set C(X) of continuous functions is dense in $L^1(X,\mu +|\nu |)$ (which is always true for a metrizable space X or, as a variant, for regular measures $\mu $ and $\nu )$ and F is a convex lower semicontinuous function then (25) and (26) are valid with $\rho _{F,c}(\,\cdot \,,\,\cdot \,)$ (13) substituted for $\rho _{F}(\,\cdot \,,\,\cdot \,)$.

It is worth mentioning that there are at least two different ways to define the value of $\rho _F(\mu ,\nu )$ for a measure $\nu $ that is not absolutely continuous with respect to $\mu $. Let us explain them in the case of finite set $X =\{1,\dots ,K\}$. In this case, the measures $\mu ,\,\nu $ have the form $\mu =(\mu _1,\dots ,\mu _K)$, $\nu =(\nu _1,\dots ,\nu _K)$ and then

$$\begin{aligned} \rho _F(\mu ,\nu ) =D_F(\nu \Vert \mu ) =\sum _{i=1}^K \mu _iF\bigg (\frac{\nu _i}{\mu _i}\bigg ). \end{aligned}$$

The question is how to define the product $\mu _iF(\nu _i/\mu _i)$ when $\mu _i=0$ and $\nu _i\ne 0$.

The first way (adopted in the present paper as well as in [16, 24]) is to put

$$\begin{aligned} 0F\bigg (\frac{\nu _i}{0}\bigg ) = \lim _{\mu _i\rightarrow +0} \mu _i F\bigg (\frac{\nu _i}{\mu _i}\bigg ). \end{aligned}$$

Under this approach, the function $\rho _F(\mu ,\nu )$ depends continuously on $\mu $. Namely this property enables us to establish in Theorem 14 a link between sup-sums principles for measurable and continuous partitions of unity, that is inevitable for applications to the spectral objects in Sect. 5.

Alternatively, $-\rho _F(\mu ,\nu )$ may be treated as the exponential rate for conditional probabilities of large deviations for a certain family of weighted empirical measures. Then, we have to put $0F(\nu _i/0) =+\infty $ and, respectively, $\rho _F(\mu ,\nu ) =+\infty $ whenever $\nu $ is not absolutely continuous with respect to $\mu $ (for details see [11, 13, 20, 23]).

Under the second approach to the definition of F-divergence, the analogues of all the above-stated results (Propositions 4, 5 and Theorems 7, 10, 12) can be formulated and proved. However, in this setting, they will be meaningful only for absolutely continuous measures $\nu $, while the singular case becomes trivial.

Proof of Theorem 10

Note that (26) follows from (25) along with Theorem 6.

Let us check that for each (no matter finite or countable) measurable partition of unity G, we have

$$\begin{aligned} \sum _{g\in G} \mu [g] F\bigg (\frac{\nu _a[g]}{\mu [g]}\bigg ) \le \int _X F\bigg (\frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\bigg ) \mathrm{d}\mu . \end{aligned}$$

(28)

Indeed,

$$\begin{aligned} \sum _{g\in G} \mu [g] F\bigg (\frac{\nu _a[g]}{\mu [g]}\bigg )= & {} \sum _{\mu [g]>0} \mu [g] F\bigg (\!\int _X \frac{g}{\mu [g]}\, \mathrm{d}\nu _a\bigg ) \\= & {} \sum _{\mu [g]>0} \mu [g] F\bigg (\!\int _X \frac{g}{\mu [g]} \frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\, \mathrm{d}\mu \bigg ) \\\le & {} \sum _{\mu [g]>0}\mu [g]\int _X \frac{g}{\mu [g]} F\bigg (\frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\bigg ) \mathrm{d}\mu \,\\= & {} \int _X \sum _{g\in G} g F\bigg (\frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\bigg ) \mathrm{d}\mu \,= \int _X F\bigg (\frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\bigg ) \mathrm{d}\mu , \end{aligned}$$

where we used Jensen’s inequality for the probability measure $(g/\mu [g]) \mathrm{d}\mu $ and that by convention (3) and absolute continuity of $\nu _a$ all the summands with $\mu [g] =0$ are zero.

From (28), it follows that the left-hand part in (25) does not exceed the right-hand one, and to finish the proof of Theorem 10 we have to verify the inequality

$$\begin{aligned} \sup _G\sum _{g\in G} \mu [g] F\bigg (\frac{\nu _a[g]}{\mu [g]}\bigg ) \ge \int _X F\bigg (\frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\bigg ) \mathrm{d}\mu . \end{aligned}$$

(29)

For the convex function F under consideration, there exists a partition of real axis by three points $-\infty \le a\le b\le c\le +\infty $ such that

(i)
$F(y) =+\infty $ for $y<a$ and $y>c$;
(ii)
F(y) is nonincreasing, finite and continuous on (a, b);
(iii)
F(y) is nondecreasing, finite and continuous on (b, c).

Let us decompose X into seven subsets

$$\begin{aligned} X_{<a}, \quad X_a, \quad X_{(a,b)}, \quad X_b, \quad X_{(b,c)}, \quad X_c,\quad X_{>c}, \end{aligned}$$

(30)

defined, respectively, by the conditions

$$\begin{aligned} \frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }(x) {<} a, \quad \frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }(x) {=}a, \quad a{<} \frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }(x) {<}b, \quad \dots , \quad \frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }(x) {=}c, \quad \frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }(x) {>}c. \end{aligned}$$

Some of these sets may be empty; for example, if the function F decreases on (a, c), then $b=c$ and $X_{(b,c)} =\emptyset $, and if F is finite everywhere, then the sets $X_{<a}$, $X_a$, $X_c$, $X_{>c}$ will be empty.

Evidently, it is enough to prove inequality (29) for each of the sets (30) separately and then sum the components. In doing so, partitions of unity G on these sets should also be defined separately.

For the sets $X_{<a}$, $X_a$, $X_b$, $X_c$, $X_{>c}$ (some of them may by empty) inequality (29) is verified easily: it is sufficient to take a trivial partition G consisting of a single unit function on the set considered.

Now consider the set $X_{(a,b)}$. Let us take an arbitrary number $\varepsilon >0$ and set

$$\begin{aligned} Y_i= & {} \bigl \{ y\in (a,b)\bigm | i\varepsilon \le F(y) <i\varepsilon +\varepsilon \bigr \}, \quad i\in {\mathbb {Z}}, \end{aligned}$$

(31)

$$\begin{aligned} X_i= & {} \Big \{ x\in X \Bigm | \frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }(x) \in Y_i \Big \}, \quad i\in {\mathbb {Z}}. \end{aligned}$$

(32)

Clearly, the sets $X_i$ form a partition of $X_{(a,b)}$ and their characteristic functions (that we denote by $g_i$) form a measurable partition of unity on $X_{(a,b)}$.

Note that by monotonicity of F on (a, b), the sets $Y_i$ are convex. Therefore, if $\mu (X_i)>0$, then

$$\begin{aligned} \frac{\nu _a(X_i)}{\mu (X_i)} \,=\, \frac{1}{\mu (X_i)}\int _{X_i} \frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\,\mathrm{d}\mu \,\in \, Y_i, \end{aligned}$$

and by definition of $Y_i$, we have

$$\begin{aligned} i\varepsilon \,\le \, F\bigg (\frac{\nu _a(X_i)}{\mu (X_i)}\bigg ) <\, i\varepsilon +\varepsilon . \end{aligned}$$

(33)

Now, (31), (32), (33) imply that

$$\begin{aligned} \int _{X_{(a,b)}} F\bigg (\frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\bigg ) \mathrm{d}\mu \,= & {} \sum _{i\in {\mathbb {Z}}} \int _{X_i} F\bigg (\frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\bigg ) \mathrm{d}\mu \,\le \sum _{i\in {\mathbb {Z}}} (i\varepsilon +\varepsilon )\mu (X_i) \\\le & {} \sum _{i\in {\mathbb {Z}}} \mu (X_i) \bigg (F\bigg (\frac{\nu _a(X_i)}{\mu (X_i)}\bigg ) +\,\varepsilon \bigg )\\= & {} \sum _{i\in {\mathbb {Z}}} \mu [g_i] F\bigg (\frac{\nu _a[g_i]}{\mu [g_i]}\bigg ) +\, \varepsilon \mu \big (X_{(a,b)}\big ). \end{aligned}$$

By arbitrariness of $\varepsilon $, this implies inequality (29) for the set $X_{(a,b)}$.

For the set $X_{(b,c)}$, it is verified in the same way. Thus, Theorem 10 is proved. $\square $

Proof of Corollary 11

Take $\nu $ such that $\mathrm{d}\nu /\mathrm{d}\mu =f$ and apply (25). $\square $

Proof of Theorem 12

Every space partition $\Delta _1,\dots ,\Delta _k$ is defined by the partition of unity consisting of the corresponding characteristic functions. Thus, $\rho _{F,X}(\mu ,\nu ) \le \rho _F(\mu ,\nu )$. The rest of the proof coincides with the ending part of the proof of Theorem 10 (starting from formula (29)), where only the space partitions are used. $\square $

Proof of Theorem 14

Apply Theorem 10 along with Theorem 7 bearing in mind Remark 8. $\square $

4 Sup-Sums Principle for Kullback–Leibler Divergence, etc

If $\mu $ and $\nu $ are probability measures on $(X,{\mathfrak {A}})$ and $\mu $ is absolutely continuous with respect to $\nu $, then Kullback–Leibler divergence $D_{\mathrm {KL}}$ is defined as

$$\begin{aligned} D_{\mathrm {KL}} (\mu \Vert \nu ) := \int _X \ln \biggl (\frac{\mathrm{d}\mu }{\mathrm{d}\nu }\biggr )\, \mathrm{d}\mu . \end{aligned}$$

(34)

The principal philosophy behind the results we are going to discuss is not new. Namely, an analogue of Theorem 15 for space partitions of X (cf. (27)) goes back to Gelfand, Kolmogorov, and Yaglom [15]. However, the results obtained in the foregoing section give this field a new flavour and among the basic novelties here is the use of continuous partitions of unity (see, in particular, Remarks 17 and 16), which serves as an inevitable apparatus in the analysis of the objects in Sect. 5.

The results of the foregoing section lead to the next

Theorem 15

Under the above conditions on $\mu $ and $\nu $,

$$\begin{aligned} D_{\mathrm {KL}} (\mu \Vert \nu )= & {} D_{\mathrm {KL}} (\mu \Vert \nu _a) = \rho _{-\ln } (\mu , \nu ) = \rho _{-\ln } (\mu , \nu _a) \end{aligned}$$

(35)

$$\begin{aligned}= & {} \, \sup _G\sum _{g\in G} \mu [g] \ln \biggl (\frac{\mu [g]}{\nu [g]}\biggr ) =\,\sup _G\sum _{g\in G} \mu [g] \ln \biggl (\frac{\mu [g]}{\nu _a[g]}\biggr ), \end{aligned}$$

(36)

where $\nu _a$ is the absolutely continuous component of $\nu $ with respect to $\mu $ and the supremum is taken over all (finite or countable) measurable partitions of unity G on X, and we assume that if $\mu [g]=0$, then the corresponding summand in the sums vanishes regardless of the second multiplier $\ln (\mu [g]/\nu [g])$ or $\ln (\mu [g]/\nu _a[g])$.

Proof

According to (2), we have $-\ln ^{\prime }(+\infty ) =0$. Hence, by Theorem 10,

$$\begin{aligned} \rho _{-\ln }(\mu ,\nu ) = \rho _{-\ln }(\mu ,\nu _a) = \int _{X} -\ln \biggl (\frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\biggr )\,\mathrm{d}\mu . \end{aligned}$$

(37)

It is easily seen that outside a set of zero measure $\mu $,

$$\begin{aligned} 0< \frac{\mathrm{d}\mu }{\mathrm{d}\nu _a} = \frac{\mathrm{d}\mu }{\mathrm{d}\nu } < +\infty . \end{aligned}$$

Therefore,

$$\begin{aligned} \int _X \ln \biggl (\frac{\mathrm{d}\mu }{\mathrm{d}\nu }\biggr )\,\mathrm{d}\mu = \int _X \ln \biggl (\frac{\mathrm{d}\mu }{\mathrm{d}\nu _a}\biggr )\,\mathrm{d}\mu = \int _{X} -\ln \biggl (\frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\biggr )\,\mathrm{d}\mu . \end{aligned}$$

(38)

From (37), (38), we obtain equalities (35).

Recall that, $\mu $ is absolutely continuous with respect to $\nu $ and hence with respect to $\nu _a$ as well. So if $\mu [g]\ne 0$, then $\nu [g]\ne 0$ and $\nu _a[g]\ne 0$ for any element g of a measurable partition of unity on X. From this and definition (4) of $\rho _{-\ln }(\mu ,\nu )$, it follows that

$$\begin{aligned} \rho _{-\ln }(\mu ,\nu ) =\, \sup _G\sum _{g\in G} \mu [g] \biggl (-\ln \biggl (\frac{\nu [g]}{\mu [g]}\biggr )\!\biggr ) =\, \sup _G\sum _{g\in G} \mu [g] \ln \biggl (\frac{\mu [g]}{\nu [g]}\biggr ), \end{aligned}$$

where all summands with $\mu [g] =0$ are supposed to be zero. The analogous equality for $\rho _{-\ln }(\mu ,\nu _a)$ may be got in the same way. Thus, Theorem 15 is proved. $\square $

Remark 16

The theorem just proved along with formula (38) naturally suggests an extension of the definition of Kullback–Leibler divergence onto measures that are neither necessarily probability ones, nor mutually absolutely continuous. Namely, for any finite positive measures $\mu $, $\nu $ on a measurable space $(X,\mathfrak A)$ let us define the generalized Kullback–Leibler divergence $D_{\mathrm {KL}}(\mu \Vert \nu )$ by the formula

$$\begin{aligned} D_{\mathrm {KL}}(\mu \Vert \nu ) := \int _X -\ln \biggl (\frac{\mathrm{d}\nu _a}{\mathrm{d}\mu }\biggr )\, \mathrm{d}\mu . \end{aligned}$$

(39)

The reasoning from the proof of Theorem 15 shows that $D_{\mathrm {KL}} (\mu \Vert \nu )$ defined in this way satisfies equalities (35) and (36) as well.

Remark 17

If X is a topological space and $\mu $ and $\nu $ are Borel measures such that the set C(X) of continuous functions is dense in $L^1(X,\mu )$ and $L^1(X,\nu )$ (which is always true for a metrizable space X or, as a variant, for regular measures $\mu $, $\nu $) then recalling Theorems 12 and 14 one concludes that when applying (35) and (36) to definition (39), we can equally use continuous (finite or countable) partitions of unity.

Remark 18

As is known apart from Kullback–Leibler divergence many common divergences are special cases of F-divergence, corresponding to a suitable choice of F. For example, Hellinger distance corresponds to the function $F(t) =1-\sqrt{t}$, total variation distance corresponds to $F(t) =|t-1|$, Pearson $\chi ^2$-divergence corresponds to $F(t) =(t-1)^2$, and for the function $F(t) =(t^\alpha -t)/(\alpha ^2 -\alpha )$, we obtain the so-called $\alpha $-divergence.

Thus, by choosing the corresponding convex functions F, one can write out the ‘sup-sums principles’ of Theorem 15 type for them where again one can exploit not only measurable but also continuous partitions of unity. Moreover, for example, for total variation distance, Pearson $\chi ^2$-divergence and $\alpha $-divergence one naturally arrives at consideration of real-valued (not necessarily nonnegative) measures.

Remark 19

In the paper [22], the result of Theorem 15 type was established for a sigma-finite measure $\nu $ and a measure $\mu $ which is absolutely continuous with respect to $\nu $.

5 New Definition for t-Entropy

In this section, as an application of Theorems 14 and 15, we obtain a new formula for t-entropy that clarifies its relationship with Kullback–Leibler divergence.

The t-entropy (we recall its definition below) is a principal object of spectral analysis of operators associated with dynamical systems. In particular, in the series of articles [3,4,5,6,7,8], a relation between t-entropy and spectral radii of the corresponding operators has been established. Namely, it was shown that t-entropy is the Fenchel–Legendre dual to the spectral exponent of operators in question.

For transparency of presentation, let us recall the mentioned objects and results.

Hereafter, X is a Hausdorff compact space, C(X) is the algebra of continuous functions on X taking real values and equipped with the max-norm, and $\alpha \!:X\rightarrow X$ is an arbitrary continuous mapping. The corresponding dynamical system will be denoted by $(X,\alpha )$.

Recall that, a transfer operator $A\!:C(X)\rightarrow C(X)$, associated with a given dynamical system, is defined in the following way:

(a)
A is a positive linear operator (i.e., it maps nonnegative functions to nonnegative ones); and
(b)
the following homological identity for A is valid:
$$\begin{aligned} A(g \circ \alpha \cdot f) = gAf, \quad g,f\in C(X). \end{aligned}$$

As an important and popular example of transfer operators one can take say the classical Perron–Frobenius operator, that is, the operator having the form

$$\begin{aligned} Af(x):= \sum _{y\in \alpha ^{-1}(x)}a(y)f(y), \end{aligned}$$

where $a\in C(X)$ is fixed. This operator is well defined when $\alpha $ is a local homeomorphism.

Let A be a certain transfer operator in C(X). In what follows, we denote by $A_\varphi $ the family of transfer operators in C(X) given by the formula

$$\begin{aligned} A_\varphi f :=A(e^\varphi f), \quad \varphi \in C(X). \end{aligned}$$

Let us denote by $\lambda (\varphi )$ the spectral potential of $A_\varphi $, defined by the formula

$$\begin{aligned} \lambda (\varphi ) := \lim _{n\rightarrow \infty }\frac{1}{n}\ln \Vert {A_\varphi ^n}\Vert = \ln (r(A_\varphi )), \end{aligned}$$

where $r(A_\varphi )$ is the spectral radius of operator $A_\varphi $.

We denote by M(X) the set of all probability Borel measures on X. Recall that, a measure $\mu \in M(X)$ is called $\alpha $-invariant iff $\mu (g) =\mu (g\circ \alpha )$ for all $g\in C(X)$. The family of $\alpha $-invariant probability measures on X is denoted by $M_\alpha (X)$.

The t-entropy is a certain functional on M(X) denoted by $\tau (\mu )$ (its detailed definition will be given below).

The substantial importance of t-entropy is demonstrated by the following variational principle.

Theorem 20

( [6], Theorem 5.6) Let $A\!: C(X)\rightarrow C(X)$ be a transfer operator for a continuous mapping $\alpha \!:X\rightarrow X$ of a compact Hausdorff space X. Then,

$$\begin{aligned} \lambda (\varphi ) = \max _{\mu \in M_\alpha (X)} \bigl (\mu [\varphi ] +\tau (\mu )\bigr ), \quad \varphi \in C(X). \end{aligned}$$

One vividly notes the resemblance of this theorem to the Ruelle–Walters variational principle for the topological pressure [21, 25, 26] uncovering its relation with Kolmogorov–Sinai entropy.

Among the principal ingredients in the proofs of the results leading to Theorem 20 is the so-called ‘entropy statistic theorem’. This theorem plays in the spectral theory of weighted shift and transfer operators the role analogous to Shannon–McMillan–Breiman theorem in information theory [1, 18] and its important corollary known as ‘asymptotic equipartition property’ [10, p. 135]. The variational principles containing t-entropy and the objects therein serve as key ingredients of the thermodynamical formalism (see [4, 7, 17] and the sources quoted there).

Being so important, t-entropy at the same time is rather sophisticated object to calculate. The description of t-entropy not leaning on Fenchel–Legendre duality is not elementary, and it took a substantial time and effort to obtain its ‘accessible’ definition.

Namely, originally t-entropy $\tau (\mu )$ was defined in three steps (see, for example, [6]).

Definition 1

Firstly, for a given $\mu \in M(X)$, any $n\in {\mathbb {N}}$, and any continuous partition of unity $G =\{g_1,\dots ,g_k\}$ we set

$$\begin{aligned} \tau _n(\mu ,G) :=\sup _{m\in M(X)}\sum _{g_i\in G}\mu [g_i] \ln \frac{m[A^ng_i]}{\mu [g_i]}. \end{aligned}$$

(40)

Here, if $\mu [g_i] = 0$ for some $g_i\in G$ then the corresponding summand in (40) is assumed to be zero regardless the value $m[A^ng_i]$; if $m[A^ng_i] = 0$ for some $g_i\in G$ and at the same time $\mu [g_i]>0$, then $\tau _n(\mu ,G) = -\infty $.

Secondly, we put

$$\begin{aligned} \tau _n(\mu ) :=\, \inf _G\tau _n(\mu ,G); \end{aligned}$$

(41)

here the infimum is taken over all continuous partitions of unity G in C(X).

And finally, the t-entropy $\tau (\mu )$ is defined as

$$\begin{aligned} \tau (\mu ) := \inf _{n\in {\mathbb {N}}}\frac{\tau _n(\mu )}{n}. \end{aligned}$$

Recently, in [9], the authors proved that for $\mu \in M_\alpha (X)$ (note that only such measures are essential for Theorem 20), the t-entropy could be defined in two steps.

Definition 2

First, we set

$$\begin{aligned} \tau _n(\mu ):=\, \inf _G \sum _{g\in G} \mu [g]\ln \frac{\mu [A^n g]}{\mu [g]}, \end{aligned}$$

(42)

where the infimum is taken over the set of all continuous partitions of unity G, and we assume that if $\mu [g]=0$, then the corresponding summand in the right-hand part of the equality is equal to 0 independently of the value of $\mu [A^n g]$.

Now, $\tau (\mu )$ is defined as

$$\begin{aligned} \tau (\mu ) := \inf _{n\in {\mathbb {N}}} \frac{\tau _n(\mu )}{n} . \end{aligned}$$

In other words, in the original definition of t-entropy, one does not need to calculate the supremum in (40) but can simply put $m=\mu $ there. In [9], it was proved that this leads to the same value of $\tau _n(\mu )$ in (42) as in (41).

Of course, two steps are better than three, but even this two-steps definition of t-entropy is also rather sophisticated.

Note parenthetically that, if one identifies a Borel measure $\mu $ on X with a linear functional $\mu \!: C(X) \rightarrow {\mathbb {R}}$ given by

$$\begin{aligned} \mu [f] := \int _X f\, \mathrm{d}\mu \end{aligned}$$

then, by Riesz’s theorem, there exists the only one regular Borel measure on X defining the same functional. Thus, since the definition of t-entropy leans only on integration of continuous functions (forming partitions of unity), it suffices to determine the t-entropy only for regular measures $\mu $ (which are the measures considered, for instance, in Theorem 14).

The next theorem in essence gives a new definition of t-entropy and transparently establishes its relation to Kullback–Leibler divergence.

Theorem 21

(t-entropy via Kullback–Leibler divergence) Let A be a transfer operator for a dynamical system $(X, \alpha )$ then for any regular measure $\mu \in M_\alpha (X)$, we have

$$\begin{aligned} \tau _n (\mu ) = \int _X \ln \frac{\mathrm{d}(A^{*n}\mu )_a}{\mathrm{d}\mu } \,\mathrm{d}\mu = -D_{\mathrm {KL}}(\mu \Vert A^{*n}\mu ) \end{aligned}$$

and

$$\begin{aligned} \tau (\mu ) =\inf _{n\in {\mathbb {N}}}\,\frac{1}{n}\!\int _X \ln \frac{\mathrm{d}(A^{*n}\mu )_a}{\mathrm{d}\mu }\,\mathrm{d}\mu =-\sup _{n\in {\mathbb {N}}}\frac{D_{\mathrm {KL}}(\mu \Vert A^{*n}\mu )}{n}, \end{aligned}$$

where $A^*\!: C(X)^* \rightarrow C(X)^*$ is the operator adjoint to A.

Proof

Apply Theorem 15 and Remark 17 to (42). Namely, set $\nu = A^{*n}\mu $ in this equality (so that $\mu [A^ng] = \nu [g]$) and apply formulae (34)–(36). $\square $

References

Algoet, P.H., Cover, T.M.: A sandwich proof of the Shannon–McMillan–Breiman theorem. Ann. Probab. 16(2), 899–909 (1988)
MathSciNet MATH Google Scholar
Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B 28(1), 131–142 (1966)
MathSciNet MATH Google Scholar
Antonevich, A.B., Bakhtin, V.I., Lebedev, A.V.: Thermodynamics and spectral radius. Nonlinear Phenom. Complex Syst. 4(4), 318–321 (2001)
MathSciNet Google Scholar
Antonevich, A.B., Bakhtin, V.I., Lebedev, A.V., Sarzhinsky, D.S.: Legendre analysis, thermodynamic formalism and spectra of Perron–Frobenius operators. Dokl. Math. 67(3), 343–345 (2003)
MathSciNet MATH Google Scholar
Antonevich, A.B., Bakhtin, V.I., Lebedev, A.V.: Spectra of operators associated with dynamical systems: from ergodicity to the duality principle. In: Twenty Years of Bialowieza: A Mathematical Anthology. World Scientific Monograph Series in Mathematics, Chapter 7, vol. 8, pp. 129–161 (2005)
Antonevich, A.B., Bakhtin, V.I., Lebedev, A.V.: On $t$-entropy and variational principle for the spectral radii of transfer and weighted shift operators. Ergod. Theory Dyn. Syst. 31, 995–1045 (2011)
Article MathSciNet Google Scholar
Antonevich, A.B., Bakhtin, V.I., Lebedev, A.V.: A road to the spectral radius of transfer operators. Contemp. Math. 567, 17–51 (2012)
Article MathSciNet Google Scholar
Bakhtin, V.I.: On $t$-entropy and variational principle for the spectral radius of weighted shift operators. Ergod Theory Dyn. Syst. 30, 1331–1342 (2010)
Article MathSciNet Google Scholar
Bakhtin, V.I., Lebedev, A.V.: A new definition of $t$-entropy for transfer operators. Entropy 19(573), 1–6 (2017)
MathSciNet Google Scholar
Billingsley, P.: Ergodic Theory and Information. Wiley, New York (1965)
MATH Google Scholar
Broniatowski, M., Keziou, A.: Minimization of divergences on sets of signed measures. Stud. Sci. Math. Hung. 43(4), 403–442 (2006)
MathSciNet MATH Google Scholar
Broniatowski, M., Keziou, A.: Parametric estimation and tests through divergences and duality technique. J. Multivar. Anal. 100(1), 16–36 (2009)
Article MathSciNet Google Scholar
Broniatowski, M.: A weighted bootstrap procedure for divergence minimization problems. In: Analytical Methods in Statistics. Springer Proceedings in Mathematics & Statistics, vol. 193, pp. 1–22 (2015)
Csiszár, I.: Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitat von Markoffschen Ketten. Magyar. Tud. Akad. Mat. Kutato Int. Kozl 8, 85–108 (1963)
MathSciNet MATH Google Scholar
Gelfand, I.M., Kolmogorov, A.N., Yaglom, A.M.: On the general definition of the amount of information. Dokl. Akad. Nauk SSSR 11, 745–748 (1956)
MathSciNet MATH Google Scholar
Liese, F., Vajda, I.: On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 52(10), 4394–4412 (2006)
Article MathSciNet Google Scholar
Lopes, A., Mengue, J., Mohr, J., Souza, R.: Entropy and variational principle for one-dimensional lattice systems with a general a priori probability: positive and zero temperature. Ergod. Theory Dyn. Syst. 35, 1925–1961 (2015)
Article MathSciNet Google Scholar
McMillan, B.: The basic theorems of information theory. Ann. Math. Stat. 24, 196–219 (1953)
Article MathSciNet Google Scholar
Morimoto, T.: Markov processes and the H-theorem. J. Phys. Soc. Jpn. 18(3), 328–331 (1963)
Article Google Scholar
Najim, J.: A Cramér type theorem for weighted random variables. Electron. J. Probab. 7(4), 1–32 (2002)
MathSciNet MATH Google Scholar
Ruelle, D.: Statistical mechanics on a compact set with $Z^\nu $ action satisfying expansiveness and specification. Trans. Am. Math. Soc. 185, 237–252 (1973)
Article MathSciNet Google Scholar
Sokol, E.E.: Introduction of the Kullback–Leibler information function by means of partitions of the probability space. J. Belarus. State Univ. Math. Inform. 1, 59–67 (2018)
MATH Google Scholar
Thashorras, J., Wintenberger, O.: Large deviations for bootstrapped empirical measures. Bernoulli 20(4), 1845–1878 (2014)
MathSciNet MATH Google Scholar
Vajda, I.: On the $f$-divergence and singularity of probability measures. Period. Math. Hung. 1, 223–234 (1972)
Article MathSciNet Google Scholar
Walters, P.: A variational principle for the pressure of continuous transformations. Am. J. Math. 97(4), 937–971 (1975)
Article MathSciNet Google Scholar
Walters, P.: An Introduction to Ergodic Theory. Springer, Berlin (1982)
Book Google Scholar

Download references

Author information

Authors and Affiliations

John Paul II Catholic University of Lublin, Lublin, Poland
V. I. Bakhtin
Belarusian State University, Minsk, Belarus
V. I. Bakhtin & A. V. Lebedev
University of Bialystok, Białystok, Poland
A. V. Lebedev

Authors

V. I. Bakhtin
View author publications
You can also search for this author in PubMed Google Scholar
A. V. Lebedev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to V. I. Bakhtin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bakhtin, V.I., Lebedev, A.V. Sup-Sums Principles for F-Divergence and a New Definition for t-Entropy. J Theor Probab 35, 350–369 (2022). https://doi.org/10.1007/s10959-020-01046-5

Download citation

Received: 29 February 2020
Revised: 14 August 2020
Published: 04 November 2020
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10959-020-01046-5

Keywords

Mathematics Subject Classification (2020)

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Sup-Sums Principles for F-Divergence and a New Definition for t-Entropy

Abstract

Similar content being viewed by others

Generalized Jensen and Jensen–Mercer inequalities for strongly convex functions with applications

A Survey of Reverse Inequalities for f-Divergence Measure in Information Theory

Refinement of Jensen’s inequality and estimation of f- and Rényi divergence via Montgomery identity

1 Introduction

2 Sup-Sums F-Divergence

Lemma 1

Lemma 2

Lemma 3

Proposition 4

Proposition 5

Theorem 6

Theorem 7

Remark 8

Proof of Lemma 2

Proof of Lemma 1

Proof of Lemma 3

Proof of Theorem 4

Proof of Theorem 5

Proof of Theorem 6

Lemma 9

Proof

Proof of Theorem 7

3 Sup-Sums Principles for Integral F-Divergence

Theorem 10

Corollary 11

Theorem 12

Remark 13

Theorem 14

Proof of Theorem 10

Proof of Corollary 11

Proof of Theorem 12

Proof of Theorem 14

4 Sup-Sums Principle for Kullback–Leibler Divergence, etc

Theorem 15

Proof

Remark 16

Remark 17

Remark 18

Remark 19

5 New Definition for t-Entropy

Theorem 20

Definition 1

Definition 2

Theorem 21

Proof

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2020)

Search

Navigation