Concentration inequalities for bounded functionals via generalized log-Sobolev inequalities

In this paper we prove multilevel concentration inequalities for bounded functionals $f = f(X_1, \ldots, X_n)$ of random variables $X_1, \ldots, X_n$ that are either independent or satisfy certain logarithmic Sobolev inequalities. The constants in the tail estimates depend on the operator norms of $k$-tensors of higher order differences of $f$. We provide various applications in both dependent and independent random variables, including empirical processes $f(X) = \sup_{g \in \mathcal{F}} \lvert g(X) \rvert$ or suprema of homogeneous chaos in bounded random variables in the Banach space case given by $f(X) = \sup_{t} \lVert \sum_{i_1 \neq \ldots \neq i_d} t_{i_1 \ldots i_d} X_{i_1} \cdots X_{i_d} \rVert_{\mathcal{B}}$. The latter application generalizes earlier results of Talagrand and Boucheron-Bousquet-Lugosi-Massart. In the case of Rademacher random variables, these can be interpreted as results on the Boolean hypercube. Further examples are concentration inequalities for $U$-statistics with bounded kernels $h$ and the number of triangles in an exponential random graph model.


Introduction
During the last forty years, the concentration of measure phenomenon has become an established part of probability theory with applications in numerous fields, see for example [MS86;Led01;BLM13;RS14;vH16]. One way to prove concentration of measure is by using functional inequalities, more specifically the entropy method. It has emerged as a way to prove several groundbreaking concentration inequalities in product spaces by Talagrand [Tal91;Tal96], mainly in the works of Ledoux [Led97] and Bobkov and Ledoux [BL97].
To convey the idea, let us recall the logarithmic Sobolev inequality for the standard Gaussian measure µ in R n (see [Gro75]) states that for any f ∈ C ∞ c (R n ) we have Ent µ (f 2 ) ≤ 2 |∇f | 2 dµ, (1.1) where Ent µ (f 2 ) = f 2 log f 2 dµ − f 2 dµ log f 2 dµ is the entropy functional. Informally, it bounds the disorder of a function f (under µ) by its average local fluctuations, measured in terms of the length of the gradient. It is also known that if µ is a measure on a discrete set X (or a more abstract set not allowing for a replacement for |∇f |), then there are several ways to reformulate equation (1.1), see e.g. [DS96] or [BT06]. We will continue these thoughts and work in the framework of difference operators. Given any probability space with measure µ, we call any operator Γ : L ∞ (µ) → L ∞ (µ) satisfying |Γ(af + b)| = a |Γf | for all a > 0, b ∈ R a 1 difference operator. Accordingly, given Γ, we say that µ satisfies a Γ−LSI(σ 2 ), if for all bounded functionals we have Apart from the domain of Γ, it is clear that (1.2) can be seen as generalization to (1.1) by defining Γ(f ) = |∇f | on R n . The aim of this work is to show that the freedom of choosing a suitable difference operator Γ leads to interesting results in the setting of both independent and dependent random variables. A specific choice of Γ will lead to a universal inequality of the type (1.2). In certain cases, we shall use these logarithmic Sobolev inequalities to obtain bounds on the rate of growth on the moment of a functional, and deduce concentration inequalities.
The setting of difference operators is general enough, and yet has the added benefit that the shift-invariance and homogeneity of difference operators implies a Poincaré inequality where Var µ (f ) = f 2 dµ − ( f dµ) 2 is the variance functional. This can be shown by a Taylor expansion of x log x and is by now classical.
Throughout this note, X = (X 1 , . . . , X n ) is a random vector taking values in some product space Y = ⊗ n i=1 X i (equipped with the product σ-algebra) defined on a probability space (Ω, A, P). We denote the law of X by µ. By abuse of language, we say that X = (X 1 , . . . , X n ) satisfies a Γ−LSI(σ 2 ), if its distribution does so. In any finite-dimensional vector space, we let |·| be the Euclidean norm. For brevity, for any probability measure P, any random k-tensor A and any p ∈ (0, ∞] we write Here, |A| op is the operator norm, defined as 1.1. Main results. To present the concentration inequalities, we introduce the difference operator h, which is frequently used in the method of bounded differences. Let X ′ = (X ′ 1 , . . . , X ′ n ) be an independent copy of X, defined on the same probability space. Given f (X) ∈ L ∞ (P), define for each i ∈ {1, . . . , n} where · i,∞ denotes the L ∞ -norm with respect to (X i , X ′ i ). The difference operator |hf | is given as the Euclidean norm of the vector hf . Clearly, depending on the random vector (X j ) j =i , h i f provides a uniform upper bound on the differences with respect to the i-th coordinate.
Our first main theorem is concentration inequalities for general, bounded functionals of an independent random vector X = (X 1 , . . . , X n ). It provides a significant improvement of [BGS18, Theorem 1.1], replacing the Hilbert-Schmidt norms appearing therein by operator norms. This leads to much sharper bounds and wider range of applications. Theorem 1.1. Let X = (X 1 , . . . , X n ) be a random vector with independent components and f : Y → R a measurable function satisfying f (X) ∈ L ∞ (P). There exists a constant C > 0 depending on d such that for any t > 0

holds.
Since defining the (iterated) difference operators h (j) and the operator norms for k-tensors is rather lengthy, we postpone it to Section 2. They can be thought of as analogues of the k-tensors of all partial derivatives of order k in the abstract setting.
Note that the case d = 2 can be considered as a generalized form of the Hanson-Wright inequality for any function f , not just quadratic forms. We shall also see how this can be applied for suprema of polynomial chaos later.
For a class of weakly dependent random variables X 1 , . . . , X n , we can prove similar estimates as in Theorem 1.1. To this end, we will introduce another difference operator, which is more familiar in the context of logarithmic Sobolev inequalities for Markov chains, as developed in [DS96]. Assume that Y = X n for some finite set X , and let X be a Y-valued random vector. We define (1.9) where we denote by µ(· | x i ) the conditional measure (interpreted as a measure on X ) and by µ i the marginal on X n−1 . It appears naturally in the Dirichlet form associated to the Glauber dynamic of µ, which is given by There is a constant C > 0 depending on d such that for any f (X) ∈ L ∞ (P) and t > 0 (1.11) In [SS18], the authors show various examples of random variables satisfying a d−LSI(σ 2 ), see also the next subsection.
The tail estimates of Theorems 1.1 and 1.2 can also be sharpened by considering special classes of functionals, such as polynomials or suprema of chaos-type functionals. In particular, it is possible to replace (1.8) and (1.11) by estimates which depend on possibly sharper norms.
1.1.1. Suprema of polynomial chaos. Let X 1 , . . . , X n be a sequence of independent {−1, +1}-valued random variables with P(X 1 = +1) = P(X 1 = −1) = 1/2, I n,d denote the family of subsets of {1, . . . , n} with d elements and T be a compact set of vectors in R I n,d and write X I := i∈I X i . In [BBLM05, Theorem 14, Corollary 4] the authors have proven one-sided deviation inequalities for the random variable where κ ≈ 1.27 is a numerical constant (cf. Theorem 3.1) and (1.14) For k = d, we use the convention I n,0 = {∅} and X ∅ := 1.
We strengthen the result in several ways. Firstly, we prove concentration inequalities rather than deviation inequalities for the upper tail. Secondly, the estimate will be valid for any Banach space. Thirdly, we remove the requirement of X 1 , . . . , X n being independent. We will first formulate the general theorem, and then show that (1.13) is a corollary.
Fix a Banach space (B, · ) with its dual space (B * , · * ), a compact subset T ⊂ B I n,d and let B * 1 be the 1-ball in B * with respect to · * . Let X 1 , . . . , X n be realvalued random variables and define Since the theory applies to bounded functionals, we assume that X 1 , . . . , X n are P-a.s. bounded by a common constant K, which due to the d-homogeneity of f we assume to be 1. For any k ∈ {1, . . . , n} we define the generalization of (1.14) given by (1.16) One can interpret the quantities W k in the following way: If we denote by f t (x) = I∈I n,d x I t I the corresponding polynomial in n variables, and by ∇ k f t (x) the ktensor of all partial derivatives of order k, then W k = sup t∈T ∇ k f t (X) op .
Furthermore, in the case of independent random variables, we need the quantities  Corollary 1.4. Let X 1 , . . . , X n be independent Rademacher random variables and f = f (X) as in (1.12). We have

On the other hand, if X consists of independent components and has support in
Consequently, there is a constant C > 0 depending on d, such that for all t > 0 As a second corollary, Theorem 1.3 can be used to recover and strengthen a famous result by Talagrand [Tal96, Theorem 1.2] on concentration properties of quadratic forms in a Banach space. Considering the case d = 2, we can express the quantities T 1 := E W 1 and T 2 := E W 2 as Corollary 1.5. Assume that X = (X 1 , . . . , X n ) satisfies a d−LSI(σ 2 ) and is supported in [a, b] n and let f T be as in 1.15 with d = 2. We have for some constant C > 0 and all t ≥ 0 Note that [Tal96, Theorem 1.2] can be retrieved by considering T consisting of a single element (although Talagrand considered fluctuations around the median instead of the mean). Moreover, Theorem 1.3 provides extensions to d ≥ 3. Theorem 1.3 can be applied to random variables other than Rademacher, for example to the spins in an Ising model on n sites in the Dobrushin uniqueness regime, given as follows. Let Y = {−1, +1} n and J = (J ij ) 1≤i,j≤n be a symmetric matrix with vanishing diagonal which satisfies In [GSS18] it was shown that µ satisfies a d−LSI(σ 2 ) with a constant depending on ρ. Thus Theorem 1.3 can be applied to such Ising models. For example, the case J ij = βn −1 δ i =j (the Curie-Weiss model) satisfies the assumptions of Theorem 1.3 for β < 1.
1.1.2. Polynomials and subgraph counts in exponential random graph models. Moreover, we can consider polynomial functions. The case of independent random variables has been treated in [AW15, Theorem 1.4] under more general conditions, so we omit it and concentrate on weakly dependent random variables.
Let f d : R n → R be a multilinear (also called tetrahedral) polynomial of degree d, i.e. of the form for any permutation σ ∈ S k , and the (generalized) diagonal is defined as ∆ k := {(i 1 , . . . , i k ) : |{i 1 , . . . , i k }| < k}. Denote by ∇ k f the k-tensor of all partial derivatives of order k of f . We have the following result. Theorem 1.6. Let X be a random vector with law µ supported in [−1, +1] n and satisfying a d−LSI(σ 2 ), and let f d be as in (1.25). There exists a constant C > 0 depending on d only such that for all t > 0 The family of norms · I arises by different embeddings of copies of R n into the space of all tensors and will be defined in Section 4. It has been first introduced in [Lat06].
The proof of Theorem 1.6 in the context of Ising models (in the Dobrushin uniqueness regime) is already present in [AKPS18], and can be extended without any further effort to any vector X satisfying a d−LSI(σ 2 ). For a short sketch see Section 4.
Theorem 1.6 can be used in the context of exponential random graph models (ERGM). Let us briefly introduce these. Given s real numbers β 1 , . . . , β s and s simple graphs G 1 , . . . , G s (with G 1 being a single edge by convention), the ERGM with parameter β = (β 1 , . . . , β s , G 1 , . . . , G s ) is a probability measure on the space of all graphs on n ∈ N vertices given by the weight function exp ( s i=1 β s n −vs+2 N Gs (X)), where N Gs (X) is the number of copies of G s in X. For details, see [CD13] or [SS18].
By way of example we show concentration properties of the number of triangles T 3 (X) = {e,f,g}∈T 3 X e X f X g (where T 3 denotes the set of all three edges forming a triangle).

Corollary 1.7. Let X be an exponential random graph model with parameter
1.1.3. A logarithmic Sobolev inequality. In the above subsections, we have worked with a d−LSI(σ 2 ) to obtain concentration inequalities for weakly dependent random variables. In general, proving a logarithmic Sobolev inequality is also a non-trivial task. On the other hand, the next result shows that an h−LSI(1) is always true in the case of product measures, which is a result of independent interest.
Theorem 1.8. Let X = (X 1 , . . . , X n ) be a random vector with independent components and values in any Theorem 1.8 can be generalized to allow for weak dependence. Since this requires more definitions and notations, we postpone the formulation and the proof to Section 5, see Theorem 5.1.
To the best of our knowledge, Theorem 1.8 is new. It might be compared to the Efron-Stein inequality (see e.g. [ES81;Ste86]) which is a counterpart of the tensorization property for the variance, and can be regarded as a universal Poincaré inequality for product measures (see e.g. [BGS18] for such an interpretation). A similar inequality with entropy replaced by variance can for example be found in [vH16, Lemma 2.1]. It can also be deduced from (1.28) in the usual way.
Remark. Theorem 1.8 does not imply the usual Efron-Stein inequality. In order to do so, one would have to prove an analogue of Theorem 1.8 with d instead of h. Unfortunately the inequality (1.28) cannot hold in this generality with d. The property of satisfying a logarithmic Sobolev inequality with respect to d is quite 7 restrictive. If (Y, A, µ) is a probability space with a sequence A n ∈ A with µ(A n ) → 0, then choosing the sequence of functions f n := ½ An ∈ L ∞ (µ), we have Ent µ (f 2 n ) = µ(A n ) log(1/µ(A n )). On the other hand, we have (df n ) 2 dµ = µ(A n )(1 − µ(A n )), so that a d−LSI(σ 2 ) cannot hold. Hence d is a difference operator which essentially makes sense in finite situations only. Consequently, this shows that one cannot expect a "entropy" version of the Efron-Stein inequality of the form Ent µ (f 2 ) ≤ E|df | 2 . It seems that for the entropy, it is essential to change the difference operator.
Remark. Unfortunately, it does not seem to be possible to use Theorem 1.8 to estimate the growth of L p norms as in the setting of a d−LSI(σ 2 ). Indeed, it is impossible to prove the required moment inequalities under an h−LSI(σ 2 ). For example, the measure However, a simple calculation shows that In an earlier draft we have falsely claimed to have proven (1.29) and applied it to Erdös-Rényi graphs. However, this led to concentration results which could not hold in that generality. Apparently it is not possible to improve the results in [AW15] on the triangle count in Erdös-Rényi graphs with this method.  [BBLM05] the authors proved inequalities for ϕ-entropies for power functions ϕ(x) = x α , α ∈ (1, 2], leading to moment inequalities for independent random variables. In [BGS18], the authors built upon that approach to prove the moment inequalities used in this work. As for suprema of polynomial chaos, estimates of these functionals have been derived in various situations. A classical example (without the supremum), usually refered to as Hanson-Wright inequalities, has been studied in [HW71; Wri73] for subgaussian random variables and a real quadratic matrix by Hanson-Wright and Wright. A modern proof was given by Rudelson-Vershynin in [RV13]. It has been extended to non-independent random variables in [HKZ12] (for positive semi-definite matrices and X satisfying a uniform Subgaussianity property), in [VW15] (with some logarithmic dependence on the dimension) and [Ada15] under the so-called convex concentration property. More recently, [CY18]  1.3. Outline. Section 2 contains the definitions of the higher order difference operators that are required in Theorems 1.1 and 1.2, as well as a proposition providing a link between tail estimates in the formulations of Theorem 1.1 and L p norm estimates. The concentration inequalities of Theorems 1.1 and 1.2 will be proven in Section 3.
Section 4 provides the proof of Theorem 1.3 and its corollaries. In Section 5 we provide the general version of Theorem 1.8 and its proof. The last Section 6 contains auxiliary results used frequently throughout this work.

Higher order difference operators and L p norm estimates
Let us first define the d-tensors h (d) f for d ≥ 2. The basis of the higher order differences will be the difference operator h. Secondly, since the tail estimates will depend on general L p norm inequalities of the form we will give a proposition that translates such inequalities into tail estimates.
We define the k-tensor h (d) f by specifying it on its "coordinates". Given distinct indices i 1 , . . . , i d we let by X i 1 , . . . , X is , and · i 1 ,...,i d ,∞ denotes the L ∞ -norm with respect to X i 1 , . . . , X i d and X ′ i 1 , . . . , X ′ i d . For instance, for i = j, Using the definition (2.1), we define tensors of d-th order differences as follows: Whenever no confusion is possible, we omit writing the random vector X, i.e. we freely write f instead of f (X) and h (d) f instead of h (d) f (X). Note that |h (d) f | are again difference operators.
We will need another, closely related difference operator. For i ∈ {1, . . . , n} introduce shall denote the L ∞ norm with respect to X ′ i . Lastly, we ignore any measurability issues that may arise. They may be dealt with by restricting oneself to appropriately well-behaved spaces (such as finite or more generally Polish spaces). Alternatively, we could define the difference operators in 9 terms of a majorizing measurable function. Thus we assume that h i 1 ...i d f (X) (and h + i f (X)) are measurable for any d ∈ N and i 1 , . . . , i d .
The vector h + f and the tensors h (d) f can be regarded as elements of the Euclidean spaces R n and R n d respectively. It is easily seen that for a 1-tensor (i.e. a vector) we have |A| op = |A| and for any d ≥ 2 and any d-tensor A |A| op ≤ |A|.
Moreover, note that the supremum is attained, and if A is a nonnegative tensor (i.e .  A i 1 ...i d ≥ 0 for all i 1 , . . . , i d ), the maximizing vectors v 1 , . . . , v d can be chosen to have all positive entries. Indeed, since v 1 we can define | v| j by taking the absolute value element-wise.
In the proofs of Theorems 1.1 we will establish a growth rate on the L p norms of f (X) − E f (X). The following proposition establishes the connection between such norm estimates and the concentration inequalities (1.8) and (1.11) for some constants C 1 , . . . , C d ≥ 0, and let L := |{l : C l > 0}| and r := min{l ∈ {1, . . . , d} : C l > 0}. There is a constant c such that for all t > 0

Concentration inequalities under logarithmic Sobolev inequalities: Proofs
To prove Theorem 1.1 we recall the following L p norm inequalities. These results (with a different choice of normalization for h ± leading to slightly different constants) can be found in [BGS18, Theorem 2.3, Corollary 2.6] (building upon the earlier results in [BBLM05]), and we skip the proofs.
Lemma 3.2. If X 1 , . . . , X n are independent random variables and f (X) ∈ L ∞ (P), then for any p ≥ 2 we have σ 2 > 0 Moreover, we need the following auxiliary statements. The proofs are postponed to Section 6.

Lemma 3.3. For any
Proposition 3.4. Let µ be a measure on a product of Polish spaces satisfying a d−LSI(σ 2 ). Then, for any f ∈ L ∞ (µ) and any p ≥ 2, we have Theorem 1.1. Since X 1 , . . . , X n are independent, from Lemma 3.2 we obtain where we have used that for any positive random variable W The second term on the right hand side can now be estimated using Theorem 3.1, which in combination with Lemma 3.3 gives This can be easily iterated to yield Now it remains to apply Propositio 2.1.
Since |hf | (and all higher order differences in the iteration) is nonnegative, (3.7) can be used to estimate the second term on the right hand side. From here on, its the same iteration as in Theorem 1.1.
Remark. The proof of Theorem 1.2 suggests to make use of an h + −LSI(σ 2 ) instead of a d−LSI(σ 2 ). Indeed, it seems this is possible, but at the same time, our impression is that the behaviour of an h + −LSI(σ 2 ) does not differ much from that of a d−LSI(σ 2 ) -recall the discussion after Theorem 1.8.

Suprema of chaos and polynomials: Proofs
Proof of Theorem 1.3. Let us first consider the case that X satisfies a d−LSI(σ 2 ). From (3.3) (and the Poincaré inequality to remove the L 2 norm) we obtain We shall make use of the pointwise inequality To prove (4.1), first note that We have which proves (4.1). Consequently, As in [BBLM05], this can now be iterated, i.e. we have for any k ∈ {1, . . . , d − 1} |h + W k | ≤ (b − a)W k+1 . Here we may argue as above, where the only difference is to choose ( t, v * ) and α (1) , . . . , α (k) which maximize W k . Finally we obtain The same arguments are also valid without a d−LSI(σ 2 ) property, if one considers (f − E f ) + p and applies Theorem 3.1 instead. Recalling recall equation (3.5), this leads to equation (1.20).
Lastly, to prove (1.21), let us first consider why we cannot argue as in the first two parts. Note that the argument heavily relies on the positive part of the difference operator h + , which allows us to choose the maximizers independently of i ∈ {1, . . . , n}. This is no longer possible in the case of independent random variables. Here, Lemma 3.2 and Theorem 3.1 yield Thus this argument fails if we try to use (4.4). However, we can rewrite where the sup is to be understood as an L ∞ (µ) norm. As a consequence, we have for each fixed i ∈ {1, . . . , n} (again choosing t by maximizing the first summand in the brackets) The proof is now completed as using the same arguments as in the first part, however with W k replaced by W k .
Proof of Corollary 1.4. Since the uniform distribution on {−1, +1} n satisfies a d-LSI(1), an inequality with 4 replaced by 8 in (1.22) follows immediately from Theorem 1.3. To obtain the constant 4, we need to use the (sharper) inequality |df (X)| 2 ≤ 1 2 (b−a) 2 W 2 1 , valid for Rademacher random variables, which can be seen by analyzing (4.2) and the estimate thereafter.
To prove Theorem 1.6, let us now define the family of norms. Let P d denote the set of all partitions of {1, . . . , d}. Any partition I = {I 1 , . . . , I k } ∈ P d induces a partition of the space of d-tensors as follows. Identify the space of all d-tensors with R n d and decompose On this space, define a norm

13
With this identification, any d-tensor A can be trivially identified with a linear functional on R n d via the standard scalar product, i.e.
and we denote A I the operator norm with respect to · I : (4.7) A I = sup |Ax|.
Proof of Theorem 1.6. We will give a sketch of the proof only and refer to [AKPS18, Proof of Theorem 2.2] for details. Recall that by (3.8) we have the inequality Using the arguments in [AKPS18, Proof of Theorem 2.2], this leads to where M is an absolute constant and G i is a sequence of independent standard Gaussian random variables, independent of X. Furthermore, a result by Latała [Lat06] yields The rest now follows as in the previous proofs.
Proof of Corollary 1.7. In [Sin18] and [SS18] we have proven that the condition 1 2 Φ ′ |β| (1) < 1 implies a d−LSI(σ 2 ) for µ β with a constant depending on the parameter β only. Thus, it remains to bound the norms in (1.26). Note that due to the structure of the exponential random graph model, the expectations of E X G and E X H are equal whenever G and H are isomorphic. Thus, we define C S 2 := E X S 2 (where S 2 is a 2-star) and C E = E X e .
The Euclidean norms can be easily bounded: and it remains to estimate the three remaining norms. However, in [AW15, Section 5.1], the authors given estimates for such norms in the Erdös-Rényi case, and it is easy to adapt these to any model with the property that E X G depends only on the isomorphism class of G (in the complete graph). Especially, due to the structure of the exponential random graph models, this is true in our case. This gives Inserting the estimates into (1.26) finishes the proof.

Universal logarithmic Sobolev inequality: Proofs
As mentioned above, Theorem 1.8 admits a generalization to non-product measures. Indeed, a sufficient condition for the h−LSI(σ 2 ) property to hold is that the measure µ satisfies an approximate tensorization (AT) property.
To formulate the generalization, we will make use of the disintegration theorem on Polish spaces (see [DM78, Chapter III] and [AGS08, Theorem 5.3.1]): If µ is a measure on a product space ⊗ n i=1 X i , then for each i ∈ {1, . . . , n} we can decompose the measure using the marginal measure µ i (the measure on ⊗ j =i X i ) and a conditional measure on X i , which we denote by µ(· | x i ). More precisely, for any Borel Theorem 5.1. Assume that Y = ⊗ n i=1 X i is a product of Polish spaces and X = (X 1 , . . . , X n ) is a Y-valued random vector with law µ. If µ satisfies an approximate tensorization property then µ also satisfies an h−LSI(C).
The approximate tensorization property in Theorem 5.1 is interesting in its own right, but it is not yet well-studied. For finite spaces [Mar15] gives sufficient conditions for a measure µ to satisfy an approximate tensorization property. Similar results have been derived in [CMT15], which can be applied in discrete and continuous settings. For example, if one considers a measure of the form for some countable spaces Ω i , x i ∈ Ω i , measures µ 0,i on Ω i and bounded functions w ij , under certain technical conditions µ satisfies an approximate tensorization property. This does not require any functional inequality for µ 0,i . However, it requires a certain weak dependence assumption in general. For example, the push-forward of a random permutation to N n cannot satisfy an approximate tensorization property. It is an interesting question to find necessary and sufficient conditions for the approximate tensorization property to hold.
Proof of Theorem 5.1. First we consider the case n = 1. By homogeneity of both sides, we may and will assume that f 2 (X)d P = 1. Since f is bounded, we have 0 ≤ a ≤ |f (X)| ≤ b < ∞ P-a.s., where b is the essential supremum of |f (X)| and a the essential infimum. Due to the constraints on the integral this leads to a 2 ≤ 1 ≤ b 2 . (Actually the cases b = 1 or a = 1 are trivial, since then f 2 (X) = 1 P-a.s., but we will not make this distinction.) Let F (u) := P(f 2 (X) ≥ u). In particular Using the partial integration formula (see e.g. [ Plugging in these two estimates yields Ent(f 2 (X)) ≤ a 2 log a 2 + (1 − a 2 ) + log b 2 (1 − a 2 ) =: f (a, b).

Auxiliary results: Proofs
This section contains the proofs of the auxiliary statements used in Section 3. Recall the (formal) operator T i (X) = (X i , X ′ i ), where X ′ is an independent copy of X.
Proof of Lemma 3.3. We have