Concentration Inequalities for Bounded Functionals via Log-Sobolev-Type Inequalities

In this paper, we prove multilevel concentration inequalities for bounded functionals f=f(X1,…,Xn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f = f(X_1, \ldots , X_n)$$\end{document} of random variables X1,…,Xn\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_1, \ldots , X_n$$\end{document} that are either independent or satisfy certain logarithmic Sobolev inequalities. The constants in the tail estimates depend on the operator norms of k-tensors of higher order differences of f. We provide applications for both dependent and independent random variables. This includes deviation inequalities for empirical processes f(X)=supg∈F|g(X)|\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(X) = \sup _{g \in {\mathcal {F}}} {|g(X)|}$$\end{document} and suprema of homogeneous chaos in bounded random variables in the Banach space case f(X)=supt‖∑i1≠…≠idti1…idXi1⋯Xid‖B\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(X) = \sup _{t} {\Vert \sum _{i_1 \ne \ldots \ne i_d} t_{i_1 \ldots i_d} X_{i_1} \cdots X_{i_d}\Vert }_{{\mathcal {B}}}$$\end{document}. The latter application is comparable to earlier results of Boucheron, Bousquet, Lugosi, and Massart and provides the upper tail bounds of Talagrand. In the case of Rademacher random variables, we give an interpretation of the results in terms of quantities familiar in Boolean analysis. Further applications are concentration inequalities for U-statistics with bounded kernels h and for the number of triangles in an exponential random graph model.


Introduction
During the last forty years, the concentration of measure phenomenon has become an established part of probability theory with applications in numerous fields, as is witnessed by the monographs [18,38,42,45,54]. One way to prove concentration of measure is by using functional inequalities, more specifically the entropy method. It has emerged as a way to prove several groundbreaking concentration inequalities in product spaces by Talagrand [51,52], mainly in the works [11,37], and further developed in [41].
To convey the idea, let us recall that the logarithmic Sobolev inequality for the standard Gaussian measure ν in R n (see [29]) states that for any f ∈ C ∞ c (R n ) we have where Ent ν ( f 2 ) = f 2 log f 2 dν − f 2 dν log f 2 dν is the entropy functional. Informally, it bounds the disorder of a function f (under ν) by its average local fluctuations, measured in terms of the length of the gradient. It is by now standard that (1) implies subgaussian tail decay for Lipschitz functions (e. g. by means of the Herbst argument). In particular, if f : R n → R is a C 1 function such that |∇ f | ≤ L a.s., we have ν(| f − f dν| ≥ t) ≤ 2 exp(−t 2 /(2L 2 )) for any t ≥ 0.
If μ is a probability measure on a discrete set X (or a more abstract set not allowing for an immediate replacement for |∇ f |), then there are several ways to reformulate Eq. (1), see e. g. [12,26]. We continue these ideas by working in the framework of difference operators. Given a probability space (Y, A, μ), we call any operator : L ∞ (μ) → L ∞ (μ) satisfying | (a f + b)| = a | f | for all a > 0, b ∈ R a difference operator. Accordingly, we say that μ satisfies a -LSI(σ 2 ), if for all bounded measurable functions f we have Apart from the domain of , it is clear that (2) can be seen as generalization of (1) by defining ( f ) = |∇ f | on R n . Another route to obtain concentration inequalities is to modify the entropy method, which was done in the framework of so-called ϕ-entropies. The idea is to replace the function ϕ 0 (x) := x log x in the definition of the entropy Ent ϕ 0 μ ( f ) = E μ ϕ 0 ( f ) − ϕ 0 (E μ f ) by other functions ϕ. This has been studied in [17,22,36]. In the seminal work [16] the authors proved inequalities for ϕ-entropies for power functions ϕ(x) = |x| α , α ∈ (1, 2], leading to moment inequalities for independent random variables. Originally, the entropy method was primarily used to prove sub-Gaussian concentration inequalities for Lipschitz-type functions. However, there are many situations of interest in which the functions under consideration are not Lipschitz or have Lipschitz constants which grow as the dimension increases even after a renormalization which asymptotically stabilizes the variance. Among the simplest examples are polynomial-type functions. Here, the boundedness of the gradient typically has to be replaced by more elaborate conditions on higher order derivatives (up to some order d). Moreover, we cannot have subgaussian tail decay anymore. This is already obvious if we consider the product of two independent standard normal random variables, which leads to subexponential tails. We refer to this topic as higher order concentration.
The earliest higher order concentration results date back to the late 1960s. Already in [13,14,43], the growth of L p norms and hypercontractive estimates of polynomialtype functions in Rademacher or Gaussian random variables, respectively, have been studied. The question of estimating the growth of L p norms of multilinear polynomials in Gaussian random variables was considered in [8,15,35]. In the context of Erdös-Rényi graphs and the triangle problem, concentration inequalities for polynomials functions gained considerable attention, in papers such as [33].
More recently, multilevel concentration inequalities have been proven in [1,5,56] for many classes of functions. These included U -statistics in independent random variables, functions of random vectors satisfying Sobolev-type inequalities and polynomials in sub-Gaussian random variables, respectively. We refer to inequalities of the type as multilevel or higher order (d-th order) concentration inequalities. This means that the tails might have different decay properties in some regimes of [0, ∞). Usually, we have f k (t) = (t/C k ) 2/k for some constant C k which typically depends on the k-th order derivatives.
To convey the basic idea of multilevel concentration inequalities, let us once again consider the case d = 2, e. g. a quadratic form of independent, say, Gaussian random variables. As sketched above, in this case the tails decay subexponentially in general. By means of a multilevel concentration inequality (the so-called Hanson-Wright inequality, which we address in more detail at a later point), we can show that while for t large, subexponential tail decay holds, for small t we even get subgaussian decay. In this sense, multilevel concentration inequalities provide refined tail estimates which do not only cover the behavior for large t.
Our own work started with a second-order concentration inequality on the sphere in [9] and was continued in [10] for bounded functionals of various classes of random variables (e. g. independent random variables or in presence of a logarithmic Sobolev inequality (1)), and in [28] for weakly dependent random variables (e. g. the Ising model). In these papers, we studied higher order concentration, arriving at multi-level tail inequalities of type (3). If the underlying measure μ satisfies a logarithmic Sobolev inequality, [10,Corollary 1.11] denotes the operator norm of the respective tensors of k-th order partial derivatives. A downside in both [10,28] is that for functions of independent or weakly dependent random variables, comparable estimates involve Hilbert-Schmidt instead of operator norms, leading to weaker estimates in general.
A central aspect of the present article is to fix this drawback by a slightly more elaborate approach. Here, we consider both independent and dependent random variables. In either case, we prove multilevel concentration inequalities of the same type, and apply them to different forms of functionals. We provide improvements of earlier higher order concentration results like [10,Theorem 1.1] or [28,Theorem 1.5], replacing the Hilbert-Schmidt norms appearing therein by operator norms. This leads to sharper bounds and a wider range of applicability.
A special emphasis is placed on providing uniform versions of the higher order concentration inequalities. By this, we mean that we consider functionals of supremum type f (X ) = sup f ∈F | f (X )|, which includes suprema of polynomial chaoses, or empirical processes. Two more applications are given by U -statistics in independent and weakly dependent random variables as well as a triangle counting statistic in some models of random graphs, for which we prove concentration inequalities.
Notations Throughout this note, X = (X 1 , . . . , X n ) is a random vector taking values in some product space Y = ⊗ n i=1 X i (equipped with the product σ -algebra) with law μ, defined on a probability space ( , A, P). By abuse of language, we say that X satisfies a -LSI(σ 2 ), if its distribution does. In any finite-dimensional vector space, we let |·| be the Euclidean norm, and for brevity, we write [q] := {1, . . . , q} for any q ∈ N. Given a vector x = (x j ) j=1,...,n we write x i c = (x j ) j =i . To any d-tensor A we define the Hilbert-Schmidt norm |A| HS :

and the operator norm
For brevity, for any random ktensor A and any p ∈ (0, ∞] we abbreviate A HS, p = (E |A| p HS ) 1/ p as well as A op, p = (E |A| p op ) 1/ p . Lastly, we ignore any measurability issues that may arise. Thus, we assume that all the suprema used in this work are either countable or defined as sup t∈T = sup F⊂T :F finite sup t∈F .

Main Results
To formulate our main results, we introduce a difference operator labeled |h f | which is frequently used in the method of bounded differences. Let X = (X 1 , . . . , X n ) be an independent copy of X , defined on the same probability space. Given f (X ) ∈ L ∞ (P), define for each i ∈ [n] and where · i,∞ denotes the L ∞ -norm with respect to (X i , X i ). The difference operator |h f | is given as the Euclidean norm of the vector h f . We shall also need higher order versions of h, denoted by h (d) f . They can be thought of as analogues of the d-tensors of all partial derivatives of order d in an abstract setting. To define the d-tensor h (d) f , we specify it on its "coordinates". That is, given distinct indices i 1 , . . . , i d , we set Using the definition (4), we define tensors of d-th order differences as follows: Whenever no confusion is possible, we omit writing the random vector X , i.e. we freely write f instead of f (X ) and Our first main theorem is a concentration inequality for general, bounded functionals of independent random variables X 1 , . . . , X n . Theorem 1 Let X be a random vector with independent components, f : Y → R a measurable function satisfying f = f (X ) ∈ L ∞ (P), d ∈ N and define C := 217d 2 . We have for any t ≥ 0 For the sake of illustration, let us consider the case of d = 2. Assuming that X 1 , . . . , X n satisfy EX i = 0, EX 2 i = 1 and |X i | ≤ M a.s., let f (X ) be the quadratic form f (X ) = i< j a i j X i X j = X T AX. Here, a i j ∈ R for all i < j, and A is the symmetric matrix with zero diagonal and entries A i j = a i j /2 if i < j. In this case, it is easy to see that h f op,1 ≤ h f op,2 ≤ 4M|A| HS and h (2) where A abs is the matrix given by (A abs ) i j = |a i j |. As a result, This is a version of the famous Hanson-Wright inequality. For the various forms of the Hanson-Wright inequality we refer to [2,4,30,32,47,55,57]. Note that by a modification of our proofs (using arguments especially adapted to polynomials), it is possible to replace |A abs | op by |A| op , thus avoiding the drawback of switching to a matrix with a possibly larger operator norm. See Sects. 2.1 and 2.4 for details. On the other hand, Theorem 1 allows for any function f , not just quadratic forms, and the case of d = 2 can in this sense be considered as generalization of the Hanson-Wright inequality.
For a certain class of weakly dependent random variables X 1 , . . . , X n , we can prove similar estimates as in Theorem 1. To this end, we introduce another difference operator, which is more familiar in the context of logarithmic Sobolev inequalities for Markov chains as developed in [26]. Assume that Y = ⊗ n i=1 X i for some finite sets X 1 , . . . , X n , equipped with a probability measure μ and let μ(· | x i c ) denote the conditional measure (interpreted as a measure on X i ) and μ i c the marginal on ⊗ j =i X j . Finally, set This difference operator appears naturally in the Dirichlet form associated to the Glauber dynamic of μ, given by In the next theorem, we require a d-LSI for the underlying random variables X 1 , . . . , X n . A number of models which satisfy this assumption will be discussed below.
Theorem 2 Let X = (X 1 , . . . , X n ) be a random vector satisfying a d-LSI(σ 2 ) and Again, if d = 2, assuming that EX i = 0, EX 2 i = 1, |X i | ≤ M a.s. and EX i X j = 0 if i = j, we arrive at a Hanson-Wright type inequality, this time including dependent situations. Similar results still hold if we remove the uncorrelatedness condition.
Let us discuss the d-LSI condition in more detail. First, any collection of random independent variables X 1 , . . . , X n with finitely many values satisfies a d-LSI(σ 2 ) with σ 2 depending on the minimal nonzero probability of the X i (cf. Proposition 6). In this situation, Theorems 1 and 2 only differ by constants.
However, the d-LSI conditions also gives rise to numerous models of dependent random variables as in [ for a symmetric matrix J = (J i j ) with zero diagonal and some h ∈ R n . In [28, Proposition 1.1], we have shown that if max i=1,...,n n j=1 |J i j | ≤ 1 − α and max i∈ [n] |h i | ≤ α, the Ising model satisfies a d-LSI(σ 2 ) with σ 2 depending on α and α only. For the special case of h = 0 and J i j = β for all i = j, we obtain the Curie-Weiss model. Here, the two conditions required above reduce to β < 1.
Another simple model in which a d-LSI holds is the random coloring model. If G = (V , E) is a finite graph and C = {1, . . . , k} is a set of colors, we denote by 0 ⊂ C V the set of all proper coloring, i. e. the set of all ω ∈ C V such that {v, w} ∈ E ⇒ ω v = ω w . In [48, Theorem 3.1], we have shown that the uniform distribution on 0 satisfies a d-LSI if the maximum degree is uniformly bounded and k ≥ 2 +1 (strictly speaking, we consider sequences of graphs here). In [48, Theorem 3.1], we moreover prove d-LSIs for the (vertex-weighted) exponential random graph model and the hard-core model. We will further discuss the exponential random graph model in Sect. 2.4.
The common feature in all these models is that the dependencies which appear can be controlled (e. g. by means of a coupling matrix which measures the interactions between the particles of the system under consideration, cf. [28,Theorem 4.2]) in such a way that the model is not "too far" from a product measure. For instance, in the Curie-Weiss model, this just translates to β < 1.
As a final remark, we discuss the LSI property with respect to various difference operators in Sect. 5. In particular, we show that the restriction to finite spaces which is implicit in Theorem 2 is natural since the d-LSI property requires the underlying space to be finite. By contrast, we prove that any set of independent random variables X 1 , . . . , X n satisfies an h-LSI(1). However, it seems that it is not possible to use the entropy method based on h-LSIs.
The upper bound in Theorem 2 admits a "uniform version", i. e. we can prove deviation inequalities for suprema of functions, in the following sense. Let F be a family of uniformly bounded, real-valued, measurable functions and set For any d ∈ N and j = 1, Theorem 3 Assume that either X 1 , . . . , X n are independent or X satisfies a d-LSI(σ 2 ) and let g = g(X ) be as in (7). With the same constant C as in Theorems 1 or 2, respectively, we have for any t ≥ 0 the deviation inequality As mentioned before, Theorem 3 yields bounds for the upper tail only. The background is that the entropy method has certain limitations when it is applied to suprema of functions, cf. also Proposition 1 or Theorem 4 below. Roughly sketched, the reason is that when evaluating difference operators of suprema, if a positive part is involved we may typically choose a coordinate-independent maximizer of the terms involved. Without a positive part, this is no longer possible. See in particular the proof of Theorem 4, where we provide some further details.
Functionals of the form (7) Further research has been done in [34], [49,Sect. 3] and more recently [39,Proposition 5.4]. In these works, Bennett-type inequalities have been proven for general independent random variables. Furthermore, [16, Theorem 10] treats the case g(X ) = sup t∈T n i=1 t i X i for Rademacher random variables X i and a compact set of vectors T ⊂ R n . As a byproduct of our method, we prove a deviation inequality for g which can be regarded as a uniform bounded differences inequality.
Let us put Proposition 1 into context. In the above-mentioned works, the authors derive Bennett-type inequalities for independent random variables X 1 , . . . , X n , whereas in our case the concentration inequalities have sub-Gaussian tails. It might be compared to the sub-Gaussian tail estimates for Bernoulli processes, see e. g. [53,Theorem 5.3.2]. However, the d-LSI(σ 2 ) property is both more and less general. On the one hand, it is possible to include possibly dependent random vectors, but on the other hand for independent random variables it is only applicable if the X i take finitely many values.

Outline
In Sect. 2, we present a number of applications and refinements of our main results. Section 3 contains the proofs of our main theorems. The proofs of the results from Sect. 2 is deferred to Sect. 4. We close out the paper by discussing different forms of logarithmic Sobolev inequalities with respect to various difference operators in the last Sect. 5.

Applications
In the sequel, we consider various situations in which our results can be applied. Some of them can be regarded as sharpenings of our main theorems for functions which have a special structure.

Uniform Bounds
If the functions under consideration are of polynomial type, we may somewhat refine the results from the previous section. Here, we focus on uniform bounds as discussed in Theorem 3.
Let I n,d denote the family of subsets of [n] with d elements, fix a Banach space (B, · ) with its dual space (B * , · * ), a compact subset T ⊂ B I n,d and let B * 1 be the 1-ball in B * with respect to · * . Let X = (X 1 , . . . , X n ) be a random vector with support in [a, b] n for some real numbers a < b and define where X I := i∈I X i . For any k ∈ [d] we let where for k = d we use the convention I n,0 = {∅} and X ∅ := 1. One can interpret the quantities W k as follows: In this sense, we are considering the same quantities as in Theorem 3 but replace the difference operator h by formal derivatives of the polynomial under consideration.
Furthermore, the concentration inequalities are phrased with the help of the quantities Concentration properties for functionals as in (9) have been studied for independent Rademacher variables X 1 , . . . , X n (i. e. P(X i = +1) = P(X i = −1) = 1/2) and B = R in [16,Theorem 14] for all d ≥ 2, and under certain technical assumptions in [2]. We prove deviation inequalities in the weakly dependent setting, and afterwards discuss how these compare to the particular result in [16]. It is easily possible to derive a similar result for functions of independent random variables (in the spirit of Theorem 1). As the corresponding proof is easily done by generalizing the proof of [16,Theorem 14], we omit it. f Consequently, for any t ≥ 0 (13) and the same concentration inequalities hold with E W k replaced by E W k .
Note that independent Rademacher random variables satisfy a d-LSI(1) (see e. g. [26,Example 3.1] or [29,Theorem 3]). Therefore, we get back [16,Theorem 14] from Theorem 4 (with slightly different constants). However, Theorem 4 moreover includes many models with dependencies like those discussed in the introduction. Therefore, it may be considered as a extension of [16,Theorem 14] to dependent situations and moreover to coefficients from any Banach space B. For instance, we may consider an Ising chaos as a natural generalization of a Rademacher chaos to a dependent situation. In this case, Theorem 4 yields that that we still obtain basically the same concentration properties if the dependencies are sufficiently weak (which is guaranteed by the conditions outlined in the introduction).
To illustrate our results further, let us consider the case of d = 2 separately. Here we write The following corollary follows directly from Theorem 4.
For the case of independent Rademacher variables, this recovers the upper tail in a famous result by Talagrand [52, Theorem 1.2] on concentration properties of quadratic forms in Banach spaces, which has also been done in [16]. Note that for B = R, we have where T is the symmetric matrix with zero diagonal and entries T i j = t i j if i < j. If T consists of a single element only, we have T 1 ≤ |T | HS . Hence, Corollary 1 can be regarded as a generalized Hanson-Wright inequality.

The Boolean Hypercube
The case of independent Rademacher random variables above can be interpreted in terms of quantities from Boolean analysis.
The literature on Boolean functions is vast, and a modern overview is given in [44]. Particularly for concentration results we may highlight [5, Theorem 1.4] (which in particular holds for Boolean functions), which we discuss further and partially generalize to dependent models in Sect. 2.4. Proposition 2 may be of interest due to the direct use of quantities from Fourier analysis. Finally, we should add that while many concentration results for Boolean functions like [5,Theorem 1.4] or also Proposition 2 are valid for functions whose Fourier-Walsh decomposition stops at some order d, Theorem 1 or Theorem 2 work for functions with Fourier-Walsh decomposition possibly up to order n.

Concentration Properties of U-Statistics
Another application of Theorems 1 and 2 are concentration properties of so-called U -statistics which frequently arise in statistical theory. We refer to [24] for an excellent monograph. More recently, concentration inequalities for U -statistics have been considered in [1], [ Let Y = X n and assume that X 1 , . . . , X n are either independent random variables, or the vector X = (X 1 , . . . , X n ) satisfies a d-LSI(σ 2 ). Let h : X d → R be a measurable, symmetric function with h(X i 1 , . . . , X i d ) ∈ L ∞ (P) for any i 1 , . . . , i d , and define B := max i 1 =... =i d h(X i 1 , . . . , X i d ) L ∞ (P) . We are interested in the concentration properties of the U -statistic with kernel h, i. e. of Proposition 3 Let X = (X 1 , . . . , X n ) be as above and f = f (X ) be as in (14). There exists a constant C > 0 (the same as in Theorems 1 and 2) such that for any t ≥ 0 and for some C = C(d) The normalization n 1/2−d in (15) is of the right order for U -statistics generated by a non-degenerate kernel h, i.e. Var(E X 1 h(X 1 , . . . , X d )) > 0, see [24,Remarks 4.2.5]. In the case of i.i.d. random variables X 1 , . . . , X n it states that Actually, (15) shows that for t ≤ n 1/2 we have sub-Gaussian tails for any finite n ∈ N for bounded kernels h.
Proposition 3 improves upon our old result [10, Corollary 1.3] by providing multilevel tail bounds, thus yielding much finer estimates than the exponential moment bound given in the earlier paper. Moreover, it does not only address independent random variables but also weakly dependent models. As compared to the results from [1] and [5, Sect. 3.1.2], Proposition 3 covers different types of measures, since in [1] independent random variables were considered, while in [5] a Sobolev-type inequality was required, which does not include the various discrete models for which a d-LSI holds.

Polynomials and Subgraph Counts in Exponential Random Graph Models
Lastly, let us once again consider polynomial functions. The case of independent random variables has been treated in [5,Theorem 1.4] under more general conditions, so we omit it and concentrate on weakly dependent random variables.
Let f d : R n → R be a multilinear (also called tetrahedral) polynomial of degree d, i. e. of the form for symmetric k-tensors a k with vanishing diagonal. Here, a k-tensor a k is called symmetric, if a k i 1 ...i k = a k σ (i 1 )...σ (i k ) for any permutation σ ∈ S k , and the (generalized) diagonal is defined as k := {(i 1 , . . . , i k ) : |{i 1 , . . . , i k }| < k}. Denote by ∇ (k) f the k-tensor of all partial derivatives of order k of f .
For the next result, given some d ∈ N, we recall a family of norms · I on the space of d-tensors for each partition I = {I 1 , . . . , I k } of {1, . . . , d}. The family · I has been first introduced in [35], where it was used to prove two-sided estimates for L p norms of Gaussian chaos, and the definitions given below agree with the ones from [35] as well as [3] and [5]. We can regard the A I as a family of operator-type norms. In particular, it is easy to see that A {1,...,d} = |A| HS and A {{1},...,{d}} = |A| op .
The following result has been proven in the context of Ising models (in the Dobrushin uniqueness regime) in [3], and can easily be extended to any vector X satisfying a d-LSI(σ 2 ). By invoking the family of norms · I , it provides a refinement of our general result for the special case of multilinear polynomials.
Theorem 5 Let X be a random vector supported in [−1, +1] n and satisfying a d-LSI(σ 2 ), and f d = f d (X ) be as in (16). There exists a constant C > 0 depending on d only such that for all t ≥ 0 For illustration, let us once again consider the case of d = 2. In the notation of (16), we take a 1 = 0 and a 2 = A, i. e. f 2 (x) = x T Ax for a symmetric matrix A with vanishing diagonal. In this case, assuming the components of X to be centered (so the k = 1 term vanishes), Theorem 5 reads i. e. we obtain a Hanson-Wright inequality in this situation. For higher orders, we arrive at similar bounds. Altogether, for the class of multilinear polynomials, Theorem 5 yields finer bounds than Theorem 2 (by virtue of the large class of norms involved), though for d ≥ 3 explicit calculations of the norms involved can be difficult.
To point out one possible application, Theorem 5 can be used in the context of the exponential random graph model (ERGM). Let us briefly recall the definitions. Given s ∈ N real numbers β 1 , . . . , β s and simple graphs G 1 , . . . , G s (with G 1 being a single edge by convention), the ERGM with parameter fi = (β 1 , . . . , β s , G 1 , . . . , G s ) is a probability measure on the space of all graphs on n ∈ N vertices given by the weight function exp is the number of copies of G i in the graph x and |V i | is the number of vertices of G i = (V i , E i ). For details, see [23,48]. One can think of the ERGM as an extension of the famous Erdös-Rényi model (which corresponds to the choice s = 1) to account for dependencies between the edges.
By way of example we show concentration properties of the number of triangles T 3 (X ) = {e, f ,g}∈T 3 X e X f X g (where T 3 denotes the set of all three edges forming a triangle). To formulate our results, we need to recall the function β (x) = s i=1 β i |E i |x |E i |−1 which frequently appears in the discussion of the ERGM. Moreover, we set |β| := (|β 1 |, . . . , |β s |). In the following corollary, the condition 1 2 |β| (1) < 1 ensures weak dependence in the sense that a d-LSI holds. As outlined above, in comparison to earlier results like [48, Theorem 3.2], using Theorem 5 yields sharper tail estimates. (β 1 , . . . , β s , G 1 , . . . , G s ) such that 1 2 |β| (1) < 1. There is a constant C(β) such that for all t ≥ 0

Concentration Inequalities Under Logarithmic Sobolev Inequalities: Proofs
In this section, we give the proofs of our main results. All of them work by first establishing a growth rate on the L p norms of f − E f which will then be iterated. For technical reasons, we need to introduce some auxiliary difference operators which are closely related to h. For i ∈ [n] let where f X i ,∞ shall denote the L ∞ norm with respect to X i . The L p norm inequalities which form the core of our proofs can be found in [10, Theorem 2.3, Corollary 2.6] (building upon the earlier results in [16]). Note that as compared to [10], a different choice of normalization for h ± leads to slightly different constants.
Consequently, this leads to Furthermore, we need an auxiliary statement relating differences of consecutive order. In [10], we have proven that |h|h (d) f | HS | ≤ |h (d+1) f | HS . Moreover, we explained that a similar estimate with the Hilbert-Schmidt replaced by operator norms cannot be true. As we will see next, the key step in order to be able to invoke operator norms nevertheless is to work with h + .
Here, we need the following simple but crucial observation: if A is a d-tensor, the supremum in the definition of |A| op is attained, and if A is a non-negative tensor (i. e. A i 1 . ..i d ≥ 0 for all i 1 , . . . , i d ), the maximizing vectors v 1 , . . . , v d can be chosen to have all positive entries. Indeed, since v 1 we can define | v| j by taking the absolute value element-wise.

Lemma 1 For any d ≥ 2
Here, in the first inequality we insert the vectors v 1 , . . . , v d−1 maximizing the supremum and use the monotonicity of x → x + , and the second and third inequality follow from the triangle inequality. Taking the square root yields the claim.
As a final step, we need to establish a connection between L p norm estimates and multilevel concentration inequalities. This is given by the following proposition, which was proven in [1,Theorem 7] and [5,Theorem 3.3]. We state it in the form given in [48, Proof of Theorem 3.6] with slight modifications. 2), and let L := |{l : C l > 0}|. For any t ≥ 0 we have

Proposition 4 Assume that a random variable f satisfies for any p ≥ 2 and some
We will not give a proof of Proposition 4 and refer to the aforementioned works. However, the proof is almost identical to the proof of Proposition 2. The two important cases will be s = 0 (for independent random variables) as well as s = 3/2 (in the weakly dependent setting).
The proof of Theorem 1 is now easily completed.

Proof of Theorem 1
Since X 1 , . . . , X n are independent, Theorem 6 yields where we have used that for any positive random variable W The second term on the right hand side can now be estimated using Theorem 6 again, which in combination with Lemma 1 gives This can be easily iterated to obtain for any d ∈ N Now it remains to apply Proposition 4.
To prove Theorem 2, we shall require the following proposition, which is proven in [28,Proposition 2.4], building upon arguments established in [6]. (Note that the definition of h there differed by a factor of √ 2.) The estimate (20) does not appear therein, but is an easy modification of the proof.

Proposition 5 Let μ be a measure on a product of Polish spaces satisfying a d-LSI(σ 2 ).
Then, for any f ∈ L ∞ (μ) and any p ≥ 2 we have (19) and

Proof of Theorem 2
The proof is very similar to the proof of Theorem 1. In the first step, using (19) leads to Equation (20) can be used to estimate the second term on the right-hand side. So, for any d ∈ N we have by an iteration Again we can apply Proposition 4 to obtain the concentration inequality.
To prove Theorem 3 we shall need the following lemma.
Lemma 2 Let (B, · ) be a Banach space and F a family of uniformly norm-bounded, B-valued, measurable functions and set g(X ) = sup f ∈F f (X ) . We have Proof Fix an X ∈ Y and choose for any where the first inequality follows by monotonicity of x → x + and the second one is a consequence of (a + b − c) + ≤ (a − c) + + b for a, b, c ≥ 0. Thus, we have Taking the limit ε → 0 yields the claim.

Proof of Theorem 3
Note that in the real-valued case, the estimate h + i | f | ≤ h i f holds. For brevity, let s = 3/2. Using this in combination with Proposition 5 and Lemma 2 yields We can apply Proposition 5 again on the right hand side, which gives A combination of Lemmas 1 and 2 shows that |h + W j | ≤ W j+1 , and so by an iteration we obtain In the case of independent random variables, we replace the first step using Theorem 6.

Proof of Proposition 1
The proof shares some similarities with the proof of Lemma 2.
Since X satisfies a d-LSI(σ 2 ), we have for any p ≥ 2 Moreover, for any i ∈ [n] and x ∈ Y, if a maximizer f of sup f ∈F | n j=1 f (x j )| exists, we obtain If a maximizer f does not exist, these estimates remain valid by an approximation argument as in the proof of Lemma 2. Consequently, we have (g − E g) + p ≤ (2σ 2 ( p − 3/2)n sup f ∈F c( f ) 2 ) 1/2 . The claim now follows from Proposition 4.

Suprema of Chaos, U-statistics and Polynomials: Proofs
Proof of Theorem 4 Let us first consider the case that X satisfies a d-LSI(σ 2 ). Recall that we have by (20) ( We shall make use of the pointwise inequality |h + f | ≤ (b − a)W 1 . To see this, let ( t, v * ) be the tuple satisfying sup t∈T sup v * ∈B * 1 v * ( I ∈I n,d X I t I ) = v * ( I ∈I n,d X I t I ) . We have proving the first part. Consequently, As in [16], this can now be iterated, i. e. we have for any k ∈ {1, . . . , Here, we may argue as above, where the only difference is to choose ( t, v * ) and α (1) , . . . , α (k) which maximize W k . This finally leads to using that W d is constant. This proves (11). The same arguments are also valid without a d-LSI(σ 2 ) property, if one considers ( f − E f ) + p and applies Theorem 6 instead. Lastly, to prove (12), let us first consider why we cannot argue as before. Note that the argument heavily relies on the positive part of the difference operator h + , which allows us to choose the maximizers t 1 , . . . , t n independent of i ∈ [n]. This is no longer possible for the concentration inequality. Here, Theorem 6 yields Thus, this argument fails if we try to use these inequalities. However, we can rewrite where the sup is to be understood with respect to the support of X i . As a consequence, we have for each fixed i ∈ [n] (again choosing t by maximizing the first summand in the brackets) This implies The proof is now completed as using the same arguments as in the first part, with W k replaced by W k . The same argument is valid for X satisfying a d-LSI(σ 2 ).

Proof of Proposition 2
The proposition can be proven using a similar technique as before, since the Hilbert-Schmidt norms of higher order difference act as Fourier projections. We choose to take an alternate route as follows. The proof of [44,Theorem 9.21] shows that for any f with degree at most d and any p ≥ 2 First off, by Chebyshev's inequality we have for any p ≥ 1 We want to apply this to a t-dependent parameter p given by the function which combined with the trivial estimate P(·) ≤ 1 gives

Proof of Proposition 3
We apply Theorems 1 and 2 in the respective cases. To this end, we make use of the general bound h (k) f op,1 ≤ h (k) f HS,∞ for k ∈ [d]. For any distinct j 1 , . . . , j k write · = · j 1 ,..., j k ,∞ , so that Now it is easy to see that S i 1 ,...,i d (h, X ) = 0 unless { j 1 , . . . , j k } ⊂ {i 1 , . . . , i d } (for example, this follows if one writes the sum inside the norm as k i=1 (Id − T j i ) f ), and in these cases one can upper bound the supremum by 2 k B, from which we infer Consequently, this leads to Thus, an application of Theorems 1 or 2, respectively, yields for any t ≥ 0 and for C as given therein For the second part, choose t = Bn d−1/2 t for t > 0 to obtain A short calculation shows that the minimum is attained for k = 1 in the range t ≤ n 1/2 and for k = d otherwise, i. e.
Proof of Theorem 5 We give a sketch of the proof only and refer to [3, Proof of Theorem 2.2] for details. Recall that by (19) we have the inequality Using the arguments and notations from [3, Proof of Theorem 2.2] leads to where M is an absolute constant and G i is a sequence of independent standard Gaussian random variables, independent of X . Furthermore, a result by Latała [35] yields The rest now follows as in the previous proofs.

Proof of Corollary 2
In [48] the authors have proven that 1 2 |β| (1) < 1 implies a d-LSI(σ 2 ) for μ β with a constant depending on the parameter β only. Thus, it remains to bound the norms in (17). Note that due to the structure of the exponential random graph model, the expectations of E X G and E X H are equal whenever G and H are isomorphic. Thus, we define C S 2 := E X S 2 (where S 2 is a 2-star) and C E = E X e .
The Euclidean norms can be easily bounded: and it remains to estimate the three remaining norms. However, in [5, Sect. 5.1], the authors given estimates for such norms in the Erdös-Rényi case, and it is easy to adapt these to any model with the property that E X G depends only on the isomorphism class of G (in the complete graph). Particularly, due to the structure of the exponential random graph models, this is true in this setting as well. This gives Inserting these estimates into (17) finishes the proof.

Logarithmic Sobolev Inequalities and Difference Operators
To conclude this paper, we discuss the LSI property (2) for different choices of difference operators . Here, we always assume that the probability measure μ is defined on a product of Polish spaces Y = ⊗ n i=1 X i with product Borel σ -algebra A = B(⊗ n i=1 X i ).
In this situation, we can make use of the disintegration theorem on Polish spaces (see [7,Theorem 5.3.1] and [25, Chapter III]): If μ is a measure on Y, then for each i ∈ {1, . . . , n} we can decompose μ using the marginal measure μ i c (as a measure on ⊗ j =i X i ) and a conditional measure on X i , which we denote by μ(· | x i c ).

More precisely, for any
For finite spaces, μ(· | x i c ) is just the ordinary conditional measure as used in the definition of the difference operator d. Note that the definition of d can in principle be rewritten for products of arbitrary Polish spaces. However, our first result shows that the d-LSI property in fact requires the underlying space to be finite. More precisely, we say that μ has finite support if there is no sequence of sets A n ∈ A with μ(A n ) > 0 for any n and μ(A n ) → 0.

Proposition 6
Let Y = ⊗ n i=1 X i be a product of Polish spaces, and let μ be a probability measure on Y. If μ satisfies a d-LSI, then μ has finite support. Moreover, if μ is a product probability measure, then μ satisfies a d-LSI iff μ has finite support.
Proof First assume μ does not have finite support, i. e. there is a sequence A n ∈ A with μ(A n ) → 0. Choosing f n := 1 A n ∈ L ∞ (μ) and assuming a d-LSI(σ 2 ) holds, we obtain (23) This easily leads to a contradiction.
On the other hand, let μ be a product probability measure with finite support. By tensorization, it suffices to consider n = 1, and we may moreover assume Y to have finitely many elements only. Then, by [12, Remark 6.6], μ satisfies a d-LSI(σ 2 ) with σ 2 ≤ C log(1/ min y:μ(y)>0 μ(y)), which finishes the proof.
In fact, Proposition 6 can be adapted to the difference operator h + as well. To see this, note that that (23) can easily be rewritten for the difference operator h + (with only minor changes) and |d f | 2 dμ ≤ |h + f | 2 dμ. In particular, the d-and h + -LSI properties are not essentially different. The situation drastically changes if we consider h-LSIs instead. Here, a sufficient condition for the h-LSI property to hold is that the measure μ satisfies an approximate tensorization (AT) property. As a consequence, for product probability measures, satisfying an h-LSI is in fact a universal property.

Theorem 7
Let Y = ⊗ n i=1 X i be a product of Polish spaces, and let μ be a probability measure on Y. If μ satisfies an approximate tensorization property then μ also satisfies an h-LSI(C). In particular, any product probability measure satisfies an h-LSI(1).
To the best of our knowledge, Theorem 7 is new. For product measures, it might be compared to the Efron-Stein inequality (see e. g. [27,50]) which establishes the tensorization property for the variance, and can be regarded as a universal Poincaré inequality with respect to d (see e. g. [10] for such an interpretation). However, note that Theorem 7 (i. e. more precisely the h-LSI(1) for product measures) does not imply the Efron-Stein inequality, as the difference operator is h instead of d. Unfortunately, as Proposition 6 demonstrates, there is no "entropy version" of the Efron-Stein inequality of the form Ent μ ( f 2 ) ≤ C E μ |d f | 2 (for any product probability measure μ and some universal constant C).
As by Theorem 7, any set of independent random variables X 1 , . . . , X n satisfies an h-LSI(1), it might be tempting to regard Theorem 1 as an h-LSI analogue of Theorem 2. However, it seems that it is not possible to use the entropy method based on h-LSIs, so that this interpretation is not fully accurate. More precisely, Theorem 7 cannot be used to estimate the growth of L p norms as in the setting of a d-LSI(σ 2 ). Indeed, it is impossible to prove the required moment inequalities f − E f q ≤ (σ 2 q) 1/2 h f q (25) under an h-LSI(σ 2 ). For example, the measure μ p = pδ 1 + (1 − p)δ 0 satisfies h-LSI(σ 2 p ) with σ 2 p ∼ p(1 − p) log(1/ p) (for p → 0), so that (25) would imply for f (x) = x an upper bound on the Orlicz norm associated to 2 However, a simple calculation shows that E exp ( f −E f ) 2 16e 2 σ 2 p → ∞ as p → 0.
The approximate tensorization property in Theorem 7 is interesting in its own right, but it is not yet well-studied. For finite spaces [40] gives sufficient conditions for a measure μ to satisfy an approximate tensorization property. Similar results have been derived in [21], which can be applied in discrete and continuous settings. For example, if one considers a measure of the form for some countable spaces i , x i ∈ i , measures μ 0,i on i and bounded functions w i j , under certain technical conditions μ satisfies an approximate tensorization property. This does not require any functional inequality for μ 0,i . Very recently, in [3,Proposition 5.4] it has been shown that the AT(C) property implies dimension-free concentration inequalities for convex functions.
Note that the AT(C) property requires a certain weak dependence assumption in general. For example, the push-forward of a random permutation π of [n] to N n cannot satisfy an approximate tensorization property. It is an interesting question to find necessary and sufficient conditions for the approximate tensorization property to hold.
Proof of Theorem 7 Let X = (X 1 , . . . , X n ) be a Y-valued random vector with law μ. First we consider the case n = 1. By homogeneity of both sides, we may assume f 2 (X )d P = 1. Since f is bounded, we have 0 ≤ a ≤ | f (X )| ≤ b < ∞ P-a.s., where b is the essential supremum of | f (X )| and a the essential infimum. Due to the constraints on the integral this leads to a 2 ≤ 1 ≤ b 2 . (Actually the cases b = 1 or a = 1 are trivial, since then f 2 (X ) = 1 P-a.s., but we will not make this distinction.) Let F(u) := P( f 2 (X ) ≥ u). In particular Using the partial integration formula (see e. g. Plugging in these two estimates yields Ent f 2 (X ) ≤ a 2 log a 2 + 1 − a 2 + log b 2 1 − a 2 =: f (a, b).
Next, if we show that we can further estimate (as |h f | 2 is a deterministic quantity in the case n = 1) To prove (26), define g(a, b) := a 2 log a 2 + 1 − a 2 + log b 2 1 − a 2 − 2(b − a) 2 .
Now it is easy to see that g(a, 1) = a 2 log a 2 + (1 − a 2 ) − 2(1 − a) 2 ≤ 0, since ∂ a g(a, 1) ≥ 0 for a ∈ [0, 1] and g(1, 1) = 0. Moreover so that g is decreasing on every strip {a 0 } × [1, ∞), and thus g(a, b) ≤ 0 for all a, b ∈ G. This finishes the proof for n = 1. For arbitrary n, the proof is now easily completed. Assume that f ∈ L ∞ (μ), i. e. μ i c (x i c )-a.s. we have f (x i c , ·) ∈ L ∞ (μ(· | x i c )). For these x i c , by the n = 1 case we therefore obtain Plugging this into the assumption leads to As for the second part, it is a classical fact that independent random variables satisfy the tensorization property (i. e. AT(1)), see for example [38,Proposition 5.6], [16,Theorem 4.10] or [54,Theorem 3.14]. In the case of independent random variables, the assumption that Y is a product of Polish spaces can be dropped by simply defining μ(· | x i c ) := μ i = P •X i .