1 Introduction

During the last forty years, the concentration of measure phenomenon has become an established part of probability theory with applications in numerous fields, as is witnessed by the monographs [18, 38, 42, 45, 54]. One way to prove concentration of measure is by using functional inequalities, more specifically the entropy method. It has emerged as a way to prove several groundbreaking concentration inequalities in product spaces by Talagrand [51, 52], mainly in the works [11, 37], and further developed in [41].

To convey the idea, let us recall that the logarithmic Sobolev inequality for the standard Gaussian measure \(\nu \) in \({{\,\mathrm{\mathbb {R}}\,}}^n\) (see [29]) states that for any \(f \in C_c^\infty ({{\,\mathrm{\mathbb {R}}\,}}^n)\) we have

$$\begin{aligned} {{\,\mathrm{Ent}\,}}_{\nu }\big (f^2\big ) \le 2 \int {|\nabla f|}^2 {\text {d}}\nu , \end{aligned}$$
(1)

where \({{\,\mathrm{Ent}\,}}_{\nu }(f^2) = \int f^2 \log f^2 {\text {d}}\nu - \int f^2 {\text {d}}\nu \log \int f^2 {\text {d}}\nu \) is the entropy functional. Informally, it bounds the disorder of a function f (under \(\nu \)) by its average local fluctuations, measured in terms of the length of the gradient. It is by now standard that (1) implies subgaussian tail decay for Lipschitz functions (e. g. by means of the Herbst argument). In particular, if \(f :\mathbb {R}^n \rightarrow {\mathbb {R}}\) is a \({\mathcal {C}}^1\) function such that \(|\nabla f| \le L\) a.s., we have \(\nu (|f - \int f {\text {d}}\nu | \ge t) \le 2 \exp (-t^2/(2L^2))\) for any \(t \ge 0\).

If \(\mu \) is a probability measure on a discrete set \({\mathcal {X}}\) (or a more abstract set not allowing for an immediate replacement for \({|\nabla f|}\)), then there are several ways to reformulate Eq. (1), see e. g. [12, 26]. We continue these ideas by working in the framework of difference operators. Given a probability space \(({\mathcal {Y}}, {\mathcal {A}}, \mu )\), we call any operator \(\Gamma : L^\infty (\mu ) \rightarrow L^\infty (\mu )\) satisfying \(|\Gamma (af + b)| = a\, |\Gamma f|\) for all \(a > 0\), \(b \in {\mathbb {R}}\) a difference operator. Accordingly, we say that \(\mu \) satisfies a \(\Gamma \text {-LSI}(\sigma ^2)\), if for all bounded measurable functions f we have

$$\begin{aligned} {{\,\mathrm{Ent}\,}}_{\mu }\big (f^2\big ) \le 2\sigma ^2 \int \Gamma (f)^2 {\text {d}}\mu . \end{aligned}$$
(2)

Apart from the domain of \(\Gamma \), it is clear that (2) can be seen as generalization of (1) by defining \(\Gamma (f) = {|\nabla f|}\) on \({{\,\mathrm{\mathbb {R}}\,}}^n\).

Another route to obtain concentration inequalities is to modify the entropy method, which was done in the framework of so-called \(\varphi \)-entropies. The idea is to replace the function \(\varphi _0(x) :=x\log x\) in the definition of the entropy \({{\,\mathrm{Ent}\,}}^{\varphi _0}_{\mu }(f) = {{\,\mathrm{\mathbb {E}}\,}}_\mu \varphi _0(f) - \varphi _0({{\,\mathrm{\mathbb {E}}\,}}_\mu f)\) by other functions \(\varphi \). This has been studied in [17, 22, 36]. In the seminal work [16] the authors proved inequalities for \(\varphi \)-entropies for power functions \(\varphi (x) = {|x|}^\alpha , \alpha \in (1,2]\), leading to moment inequalities for independent random variables.

Originally, the entropy method was primarily used to prove sub-Gaussian concentration inequalities for Lipschitz-type functions. However, there are many situations of interest in which the functions under consideration are not Lipschitz or have Lipschitz constants which grow as the dimension increases even after a renormalization which asymptotically stabilizes the variance. Among the simplest examples are polynomial-type functions. Here, the boundedness of the gradient typically has to be replaced by more elaborate conditions on higher order derivatives (up to some order d). Moreover, we cannot have subgaussian tail decay anymore. This is already obvious if we consider the product of two independent standard normal random variables, which leads to subexponential tails. We refer to this topic as higher order concentration.

The earliest higher order concentration results date back to the late 1960s. Already in [13, 14, 43], the growth of \(L^p\) norms and hypercontractive estimates of polynomial-type functions in Rademacher or Gaussian random variables, respectively, have been studied. The question of estimating the growth of \(L^p\) norms of multilinear polynomials in Gaussian random variables was considered in [8, 15, 35]. In the context of Erdös–Rényi graphs and the triangle problem, concentration inequalities for polynomials functions gained considerable attention, in papers such as [33].

More recently, multilevel concentration inequalities have been proven in [1, 5, 56] for many classes of functions. These included U-statistics in independent random variables, functions of random vectors satisfying Sobolev-type inequalities and polynomials in sub-Gaussian random variables, respectively. We refer to inequalities of the type

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\Big ( {|f(X) - {{\,\mathrm{\mathbb {E}}\,}}f(X)|} \ge t \Big ) \le 2 \exp \Big ( - \frac{1}{C} \min _{k =1,\ldots ,d} f_k(t) \Big ) \end{aligned}$$
(3)

as multilevel or higher order (d-th order) concentration inequalities. This means that the tails might have different decay properties in some regimes of \([0,\infty )\). Usually, we have \(f_k(t) = (t/C_k)^{2/k}\) for some constant \(C_k\) which typically depends on the k-th order derivatives.

To convey the basic idea of multilevel concentration inequalities, let us once again consider the case \(d=2\), e. g. a quadratic form of independent, say, Gaussian random variables. As sketched above, in this case the tails decay subexponentially in general. By means of a multilevel concentration inequality (the so-called Hanson–Wright inequality, which we address in more detail at a later point), we can show that while for t large, subexponential tail decay holds, for small t we even get subgaussian decay. In this sense, multilevel concentration inequalities provide refined tail estimates which do not only cover the behavior for large t.

Our own work started with a second-order concentration inequality on the sphere in [9] and was continued in [10] for bounded functionals of various classes of random variables (e. g. independent random variables or in presence of a logarithmic Sobolev inequality (1)), and in [28] for weakly dependent random variables (e. g. the Ising model). In these papers, we studied higher order concentration, arriving at multi-level tail inequalities of type (3). If the underlying measure \(\mu \) satisfies a logarithmic Sobolev inequality, [10, Corollary 1.11] yields \(f_k(t) = (t/C_k)^{2/k}\) with \(C_k = (\int |f^{(k)}|_{\text {op}}^2 {\text {d}}\mu )^{1/2}\) for \(k = 1, \ldots , d-1\) and \(C_d = \sup |f^{(d)}|_{\text {op}}\), where \({|f^{(k)}|}_{\text {op}}\) denotes the operator norm of the respective tensors of k-th order partial derivatives. A downside in both [10, 28] is that for functions of independent or weakly dependent random variables, comparable estimates involve Hilbert–Schmidt instead of operator norms, leading to weaker estimates in general.

A central aspect of the present article is to fix this drawback by a slightly more elaborate approach. Here, we consider both independent and dependent random variables. In either case, we prove multilevel concentration inequalities of the same type, and apply them to different forms of functionals. We provide improvements of earlier higher order concentration results like [10, Theorem 1.1] or [28, Theorem 1.5], replacing the Hilbert–Schmidt norms appearing therein by operator norms. This leads to sharper bounds and a wider range of applicability.

A special emphasis is placed on providing uniform versions of the higher order concentration inequalities. By this, we mean that we consider functionals of supremum type \(f(X) = \sup _{f \in {\mathcal {F}}} {|f(X)|}\), which includes suprema of polynomial chaoses, or empirical processes. Two more applications are given by U-statistics in independent and weakly dependent random variables as well as a triangle counting statistic in some models of random graphs, for which we prove concentration inequalities.

Notations Throughout this note, \(X = (X_1,\ldots ,X_n)\) is a random vector taking values in some product space \({\mathcal {Y}} = \otimes _{i = 1}^n {\mathcal {X}}_i\) (equipped with the product \(\sigma \)-algebra) with law \(\mu \), defined on a probability space \((\Omega , {\mathcal {A}}, {\mathbb {P}})\). By abuse of language, we say that X satisfies a \(\Gamma \text {-LSI}(\sigma ^2)\), if its distribution does. In any finite-dimensional vector space, we let \({|\cdot |}\) be the Euclidean norm, and for brevity, we write \([q] :=\{1,\ldots , q\}\) for any \(q \in {{\,\mathrm{\mathbb {N}}\,}}\). Given a vector \(x = (x_j)_{j = 1,\ldots ,n}\) we write \(x_{i^c} = (x_j)_{j \ne i}\). To any d-tensor A we define the Hilbert–Schmidt norm \({|A|}_{\text {HS}} :=(\sum _{i_1, \ldots , i_d} A_{i_1 \ldots i_d}^2)^{1/2}\) and the operator norm

$$\begin{aligned} {|A|}_{\text {op}} :=\sup _{\begin{array}{c} v^1, \ldots , v^d \in {{\,\mathrm{\mathbb {R}}\,}}^n \\ {|v^j|} \le 1 \end{array}} {\langle v^1 \cdots v^d, A\rangle } = \sup _{\begin{array}{c} v^1, \ldots , v^d \\ {|v^j|} \le 1 \end{array}} \sum _{i_1,\ldots ,i_d} v^1_{i_1} \cdots v^d_{i_d} A_{i_1 \ldots i_d}, \end{aligned}$$

using the outer product \((v^1 \cdots v^d)_{i_1 \ldots i_d} = \prod _{j = 1}^d v^j_{i_j}\). For brevity, for any random k-tensor A and any \(p \in (0,\infty ]\) we abbreviate \({\Vert A\Vert }_{\text {HS},p} = ({{\,\mathrm{\mathbb {E}}\,}}{|A|}_{\text {HS}}^p)^{1/p}\) as well as \({\Vert A\Vert }_{\text {op},p} = ({{\,\mathrm{\mathbb {E}}\,}}{|A|}_{\text {op}}^p)^{1/p}\). Lastly, we ignore any measurability issues that may arise. Thus, we assume that all the suprema used in this work are either countable or defined as \(\sup _{t \in T} = \sup _{F \subset T : F \text { finite}} \sup _{t \in F}\).

1.1 Main Results

To formulate our main results, we introduce a difference operator labeled \({|{\mathfrak {h}} f|}\) which is frequently used in the method of bounded differences. Let \(X' = (X_1',\ldots , X'_n)\) be an independent copy of X, defined on the same probability space. Given \(f(X) \in L^\infty ({\mathbb {P}})\), define for each \(i \in [n]\)

$$\begin{aligned} T_i f :=T_if(X) :=f(X_{i^c}, X_i') = f(X_1, \ldots , X_{i-1}, X_i',[2] X_{i+1},\ldots , X_n) \end{aligned}$$

and

$$\begin{aligned} {\mathfrak {h}}_i f(X) = \Vert f(X) - T_if(X) \Vert _{i, \infty }, \qquad {\mathfrak {h}} f(X) = ({\mathfrak {h}}_1 f(X), \ldots , {\mathfrak {h}}_n f(X)), \end{aligned}$$

where \(\Vert \cdot \Vert _{i, \infty }\) denotes the \(L^\infty \)-norm with respect to \((X_i,X_i')\). The difference operator \({|{\mathfrak {h}}f|}\) is given as the Euclidean norm of the vector \({\mathfrak {h}} f\).

We shall also need higher order versions of \({\mathfrak {h}}\), denoted by \({\mathfrak {h}}^{(d)} f\). They can be thought of as analogues of the d-tensors of all partial derivatives of order d in an abstract setting. To define the d-tensor \({\mathfrak {h}}^{(d)} f\), we specify it on its “coordinates”. That is, given distinct indices \(i_1, \ldots , i_d\), we set

$$\begin{aligned} \begin{aligned} {\mathfrak {h}}_{i_1 \ldots i_d}f(X) = \;&\Big \Vert \, \prod _{s=1}^d\, (\text {Id} - T_{i_s})f(X) \Big \Vert _{i_1, \ldots , i_d, \infty }\\ = \;&\Big \Vert \, f(X) + \sum _{k=1}^d\, (-1)^k \sum _{1 \le s_1< \ldots < s_k \le d} T_{i_{s_1} \ldots i_{s_k}}f(X)\, \Big \Vert _{i_1, \ldots , i_d, \infty } \end{aligned} \end{aligned}$$
(4)

where \(T_{i_1 \ldots i_d} = T_{i_1} \circ \ldots \circ T_{i_d}\) exchanges the random variables \(X_{i_1}, \ldots , X_{i_d}\) by \(X'_{i_1}, \ldots , X'_{i_d}\), and \(\Vert \cdot \Vert _{i_1, \ldots , i_d, \infty }\) denotes the \(L^\infty \)-norm with respect to the random variables \(X_{i_1}, \ldots , X_{i_d}\) and \(X_{i_1}', \ldots , X_{i_d}'\). For instance, for \(i \ne j\),

$$\begin{aligned} {\mathfrak {h}}_{ij} f(X) = \Vert f(X) - T_if(X) - T_jf(X) + T_{ij}f(X) \Vert _{i,j,\infty }. \end{aligned}$$

Using the definition (4), we define tensors of d-th order differences as follows:

$$\begin{aligned} \big ({\mathfrak {h}}^{(d)}f(X)\big )_{i_1 \ldots i_d} = {\left\{ \begin{array}{ll} {\mathfrak {h}}_{i_1 \ldots i_d}f(X), &{} \text {if }i_1, \ldots , i_d\text { are distinct}, \\ 0, &{} \text {else}. \end{array}\right. } \end{aligned}$$

Whenever no confusion is possible, we omit writing the random vector X, i. e. we freely write f instead of f(X) and \({\mathfrak {h}}^{(d)}f\) instead of \({\mathfrak {h}}^{(d)}f(X)\).

Our first main theorem is a concentration inequality for general, bounded functionals of independent random variables \(X_1, \ldots , X_n\).

Theorem 1

Let X be a random vector with independent components, \(f: {\mathcal {Y}} \rightarrow {{\,\mathrm{\mathbb {R}}\,}}\) a measurable function satisfying \(f = f(X) \in L^\infty ({{\,\mathrm{\mathbb {P}}\,}})\), \(d \in {{\,\mathrm{\mathbb {N}}\,}}\) and define \(C :=217d^2\). We have for any \(t \ge 0\)

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\left( {|f - {{\,\mathrm{\mathbb {E}}\,}}f|} \ge t \right) \le 2 \exp \left( -\frac{1}{C} \min _{k = 1,\ldots ,d-1} \left( \frac{t}{{\Vert {\mathfrak {h}}^{(k)} f\Vert }_{\text {op},1}} \right) ^{2/k} \wedge \left( \frac{t}{{\Vert {\mathfrak {h}}^{(d)} f\Vert }_{\text {op},\infty }}\right) ^{2/d} \right) . \end{aligned}$$
(5)

For the sake of illustration, let us consider the case of \(d=2\). Assuming that \(X_1, \ldots , X_n\) satisfy \({\mathbb {E}}X_i = 0\), \({\mathbb {E}} X_i^2 = 1\) and \({|X_i|} \le M\) a.s., let f(X) be the quadratic form \(f(X) = \sum _{i<j} a_{ij} X_i X_j = X^TAX\). Here, \(a_{ij} \in {\mathbb {R}}\) for all \(i<j\), and A is the symmetric matrix with zero diagonal and entries \(A_{ij} = a_{ij}/2\) if \(i < j\). In this case, it is easy to see that \({\Vert {\mathfrak {h}} f\Vert }_{\text {op},1} \le {\Vert {\mathfrak {h}} f\Vert }_{\text {op},2} \le 4M {|A|}_{\text {HS}}\) and \({\Vert {\mathfrak {h}}^{(2)} f\Vert }_{\text {op},\infty } \le 8M^2 {|A^\text {abs}|}_{\text {op}}\), where \(A^\text {abs}\) is the matrix given by \((A^\text {abs})_{ij} = {|a_{ij}|}\). As a result,

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\left( {|f - {{\,\mathrm{\mathbb {E}}\,}}f|} \ge t \right) \le 2 \exp \left( -\frac{1}{CM^2} \min \left( \frac{t^2}{{|A|}_{\text {HS}}^2}, \frac{t}{{|A^\text {abs}|}_{\text {op}}} \right) \right) . \end{aligned}$$

This is a version of the famous Hanson–Wright inequality. For the various forms of the Hanson–Wright inequality we refer to [2, 4, 30, 32, 47, 55, 57].

Note that by a modification of our proofs (using arguments especially adapted to polynomials), it is possible to replace \({|A^\text {abs}|}_{\text {op}}\) by \({|A|}_{\text {op}}\), thus avoiding the drawback of switching to a matrix with a possibly larger operator norm. See Sects. 2.1 and 2.4 for details. On the other hand, Theorem 1 allows for any function f, not just quadratic forms, and the case of \(d=2\) can in this sense be considered as generalization of the Hanson–Wright inequality.

For a certain class of weakly dependent random variables \(X_1, \ldots , X_n\), we can prove similar estimates as in Theorem 1. To this end, we introduce another difference operator, which is more familiar in the context of logarithmic Sobolev inequalities for Markov chains as developed in [26]. Assume that \({\mathcal {Y}} = \otimes _{i = 1}^n {\mathcal {X}}_i\) for some finite sets \({\mathcal {X}}_1, \ldots , {\mathcal {X}}_n\), equipped with a probability measure \(\mu \) and let \(\mu (\cdot \mid x_{i^c})\) denote the conditional measure (interpreted as a measure on \({\mathcal {X}}_i\)) and \(\mu _{i^c}\) the marginal on \(\otimes _{j \ne i} {\mathcal {X}}_j\). Finally, set

$$\begin{aligned} {|\mathfrak {d}f|}^2(x)&:=\sum _{i = 1}^n \big (\mathfrak {d}_i f(x)\big )^2 :=\sum _{i = 1}^n \text {Var}_{\mu \big (\cdot \mid x_{i^c}\big )}\big (f\big (x_{i^c}, \cdot \big )\big ) \\&=\sum _{i = 1}^n \frac{1}{2} \iint \big (f\big (x_{i^c},y\big ) - f\big (x_{i^c},y'\big )\big )^2 {\text {d}}\mu \big (y \mid x_{i^c}\big ) {\text {d}}\mu \big (y' \mid x_{i^c}\big ). \end{aligned}$$

This difference operator appears naturally in the Dirichlet form associated to the Glauber dynamic of \(\mu \), given by

$$\begin{aligned} {\mathcal {E}}(f,f) :=\sum _{i = 1}^n \int {{\,\mathrm{Var}\,}}_{\mu (\cdot \mid x_{i^c})}(f(x_{i^c},\cdot )) {\text {d}}\mu _{i^c}(x_{i^c}) = \int {|\mathfrak {d}f|}^2 {\text {d}}\mu . \end{aligned}$$

In the next theorem, we require a \(\mathfrak {d}\)–LSI for the underlying random variables \(X_1, \ldots , X_n\). A number of models which satisfy this assumption will be discussed below.

Theorem 2

Let \(X = (X_1, \ldots , X_n)\) be a random vector satisfying a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\) and \(f: {\mathcal {Y}} \rightarrow {{\,\mathrm{\mathbb {R}}\,}}\) a measurable function with \(f = f(X) \in L^\infty ({{\,\mathrm{\mathbb {P}}\,}})\). With the constant \(C = 15 \sigma ^2 d^2 > 0\) we have for any \(t \ge 0\)

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\left( {|f - {{\,\mathrm{\mathbb {E}}\,}}f|} \ge t \right) \le 2 \exp \left( -\frac{1}{C} \min _{k = 1,\ldots ,d-1} \left( \frac{t}{{\Vert {\mathfrak {h}}^{(k)} f\Vert }_{\text {op},1}} \right) ^{2/k} \wedge \left( \frac{t}{{\Vert {\mathfrak {h}}^{(d)} f\Vert }_{\text {op},\infty }}\right) ^{2/d} \right) . \end{aligned}$$
(6)

Again, if \(d=2\), assuming that \({\mathbb {E}}X_i = 0\), \({\mathbb {E}} X_i^2 = 1\), \({|X_i|} \le M\) a.s. and \({\mathbb {E}}X_iX_j = 0\) if \(i \ne j\), we arrive at a Hanson–Wright type inequality, this time including dependent situations. Similar results still hold if we remove the uncorrelatedness condition.

Let us discuss the \({\mathfrak {d}}\)–LSI condition in more detail. First, any collection of random independent variables \(X_1, \ldots , X_n\) with finitely many values satisfies a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\) with \(\sigma ^2\) depending on the minimal nonzero probability of the \(X_i\) (cf. Proposition 6). In this situation, Theorems 1 and 2 only differ by constants.

However, the \(\mathfrak {d}\)–LSI conditions also gives rise to numerous models of dependent random variables as in [28, Proposition 1.1] (the Ising model) or [48, Theorem 3.1] (various different models). Let us recall some of them. The Ising model is the probability measure on \(\{\pm 1\}^n\) defined by normalizing \(\pi (\sigma ) = \exp ( \frac{1}{2} \sum _{i,j} J_{ij} \sigma _i \sigma _j + \sum _{i = 1}^n h_i \sigma _i )\) for a symmetric matrix \(J = (J_{ij})\) with zero diagonal and some \(h \in {\mathbb {R}}^n\). In [28, Proposition 1.1], we have shown that if \(\max _{i = 1,\ldots ,n} \sum _{j = 1}^n {|J_{ij}|} \le 1-\alpha \) and \(\max _{i \in [n]} {|h_i|} \le {\widetilde{\alpha }}\), the Ising model satisfies a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\) with \(\sigma ^2\) depending on \(\alpha \) and \({\widetilde{\alpha }}\) only. For the special case of \(h = 0\) and \(J_{ij} = \beta \) for all \(i \ne j\), we obtain the Curie–Weiss model. Here, the two conditions required above reduce to \(\beta < 1\).

Another simple model in which a \(\mathfrak {d}\)–LSI holds is the random coloring model. If \(G = (V,E)\) is a finite graph and \(C = \{1, \ldots , k\}\) is a set of colors, we denote by \(\Omega _0 \subset C^V\) the set of all proper coloring, i. e. the set of all \(\omega \in C^V\) such that \(\{v,w\} \in E \Rightarrow \omega _v \ne \omega _w\). In [48, Theorem 3.1], we have shown that the uniform distribution on \(\Omega _0\) satisfies a \(\mathfrak {d}\)–LSI if the maximum degree \(\Delta \) is uniformly bounded and \(k \ge 2\Delta +1\) (strictly speaking, we consider sequences of graphs here). In [48, Theorem 3.1], we moreover prove \(\mathfrak {d}\)–LSIs for the (vertex-weighted) exponential random graph model and the hard-core model. We will further discuss the exponential random graph model in Sect. 2.4.

The common feature in all these models is that the dependencies which appear can be controlled (e. g. by means of a coupling matrix which measures the interactions between the particles of the system under consideration, cf. [28, Theorem 4.2]) in such a way that the model is not “too far” from a product measure. For instance, in the Curie–Weiss model, this just translates to \(\beta < 1\).

As a final remark, we discuss the LSI property with respect to various difference operators in Sect. 5. In particular, we show that the restriction to finite spaces which is implicit in Theorem 2 is natural since the \(\mathfrak {d}\text {-LSI}\) property requires the underlying space to be finite. By contrast, we prove that any set of independent random variables \(X_1, \ldots , X_n\) satisfies an \({\mathfrak {h}}\)–LSI(1). However, it seems that it is not possible to use the entropy method based on \({\mathfrak {h}}\)–LSIs.

The upper bound in Theorem 2 admits a “uniform version”, i. e. we can prove deviation inequalities for suprema of functions, in the following sense. Let \({\mathcal {F}}\) be a family of uniformly bounded, real-valued, measurable functions and set

$$\begin{aligned} g(X) :=g_{{\mathcal {F}}}(X) :=\sup _{f \in {\mathcal {F}}} {|f(X)|}. \end{aligned}$$
(7)

For any \(d \in {{\,\mathrm{\mathbb {N}}\,}}\) and \(j = 1,\ldots , d\) let \(W_j = W_j(X) :=\sup _{f \in {\mathcal {F}}} {|{\mathfrak {h}}^{(j)} f(X)|}_{\text {op}}\).

Theorem 3

Assume that either \(X_1, \ldots , X_n\) are independent or X satisfies a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\) and let \(g = g(X)\) be as in (7). With the same constant C as in Theorems 1 or 2, respectively, we have for any \(t \ge 0\) the deviation inequality

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}(g - {{\,\mathrm{\mathbb {E}}\,}}g \ge t) \le 2 \exp \left( - \frac{1}{C} \min \left( \min _{j = 1,\ldots , d-1} \left( \frac{t}{{{\,\mathrm{\mathbb {E}}\,}}W_j} \right) ^{2/k}, \frac{t^{2/d}}{{\Vert W_d\Vert }_\infty } \right) \right) \end{aligned}$$

As mentioned before, Theorem 3 yields bounds for the upper tail only. The background is that the entropy method has certain limitations when it is applied to suprema of functions, cf. also Proposition 1 or Theorem 4 below. Roughly sketched, the reason is that when evaluating difference operators of suprema, if a positive part is involved we may typically choose a coordinate-independent maximizer of the terms involved. Without a positive part, this is no longer possible. See in particular the proof of Theorem 4, where we provide some further details.

Functionals of the form (7) have been considered in various works, starting from the first results in [52, Theorem 1.4], and continued in [41, Theorem 3], [46, Théorème 1.1] and [19, Theorem 2.3] in the special case of

$$\begin{aligned} g(X) :=\sup _{f \in {\mathcal {F}}} \Big |\sum _{j = 1}^n f(X_j) \Big |. \end{aligned}$$
(8)

Further research has been done in [34, 49, Sect. 3] and more recently [39, Proposition 5.4]. In these works, Bennett-type inequalities have been proven for general independent random variables. Furthermore, [16, Theorem 10] treats the case \(g(X) = \sup _{t \in {\mathcal {T}}} \sum _{i = 1}^n t_i X_i\) for Rademacher random variables \(X_i\) and a compact set of vectors \({\mathcal {T}} \subset {{\,\mathrm{\mathbb {R}}\,}}^n\). As a byproduct of our method, we prove a deviation inequality for g which can be regarded as a uniform bounded differences inequality.

Proposition 1

Assume that \(X = (X_1, \ldots , X_n)\) satisfies a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\), let \(g = g(X)\) be as in (8), and let c(f) be such that \({|f(x) - f(y)|} \le c(f)\). For any \(t \ge 0\) we have

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\Big (g \ge {{\,\mathrm{\mathbb {E}}\,}}g + t\Big ) \le 2 \exp \left( - \frac{t^2}{15 \sigma ^2 n \sup _{f \in {\mathcal {F}}} c(f)^2} \right) . \end{aligned}$$

Let us put Proposition 1 into context. In the above-mentioned works, the authors derive Bennett-type inequalities for independent random variables \(X_1, \ldots , X_n\), whereas in our case the concentration inequalities have sub-Gaussian tails. It might be compared to the sub-Gaussian tail estimates for Bernoulli processes, see e. g. [53, Theorem 5.3.2]. However, the \(\mathfrak {d}\text {-LSI}(\sigma ^2)\) property is both more and less general. On the one hand, it is possible to include possibly dependent random vectors, but on the other hand for independent random variables it is only applicable if the \(X_i\) take finitely many values.

1.2 Outline

In Sect. 2, we present a number of applications and refinements of our main results. Section 3 contains the proofs of our main theorems. The proofs of the results from Sect. 2 is deferred to Sect. 4. We close out the paper by discussing different forms of logarithmic Sobolev inequalities with respect to various difference operators in the last Sect. 5.

2 Applications

In the sequel, we consider various situations in which our results can be applied. Some of them can be regarded as sharpenings of our main theorems for functions which have a special structure.

2.1 Uniform Bounds

If the functions under consideration are of polynomial type, we may somewhat refine the results from the previous section. Here, we focus on uniform bounds as discussed in Theorem 3.

Let \({\mathcal {I}}_{n,d}\) denote the family of subsets of [n] with d elements, fix a Banach space \(({\mathcal {B}}, {\Vert \cdot \Vert })\) with its dual space \(({\mathcal {B}}^*, {\Vert \cdot \Vert }_*)\), a compact subset \({\mathcal {T}} \subset {\mathcal {B}}^{{\mathcal {I}}_{n,d}}\) and let \({\mathcal {B}}_1^*\) be the 1-ball in \({\mathcal {B}}^*\) with respect to \({\Vert \cdot \Vert }_*\). Let \(X = (X_1, \ldots , X_n)\) be a random vector with support in \([a,b]^n\) for some real numbers \(a < b\) and define

$$\begin{aligned} f(X) :=f_{{\mathcal {T}}}(X) :=\sup _{t \in {\mathcal {T}}} \Big \Vert \sum _{I \in {\mathcal {I}}_{n,d}} X_I t_I \Big \Vert , \end{aligned}$$
(9)

where \(X_I :=\prod _{i \in I} X_i\). For any \(k \in [d]\) we let

$$\begin{aligned} \begin{aligned} W_k&:=\sup _{t \in {\mathcal {T}}} \sup _{v^* \in {\mathcal {B}}_1^*} \sup _{\begin{array}{c} \alpha ^1, \ldots , \alpha ^k \in {{\,\mathrm{\mathbb {R}}\,}}^n \\ {|\alpha ^i|} \le 1 \end{array}} v^*\left( \sum _{\begin{array}{c} i_1, \ldots , i_k \\ \text {distinct} \end{array}} \alpha _{i_1}^1 \cdots \alpha _{i_k}^k \sum _{\begin{array}{c} I \in {\mathcal {I}}_{n,d-k} \\ i_1, \ldots , i_k \notin I \end{array}} X_I t_{I \cup \{i_1, \ldots , i_k\} } \right) \\&= \sup _{t \in {\mathcal {T}}} \sup _{\begin{array}{c} \alpha ^1, \ldots , \alpha ^k \in {{\,\mathrm{\mathbb {R}}\,}}^n \\ {|\alpha ^i|} \le 1 \end{array}} \left\Vert \sum _{\begin{array}{c} i_1, \ldots , i_{k} \\ \text {distinct} \end{array}} \alpha _{i_1}^{1} \cdots \alpha _{i_k}^k \sum _{\begin{array}{c} I \in {\mathcal {I}}_{n,d-k} \\ i_1, \ldots , i_k \notin I \end{array}} X_I t_{I \cup \{i_1,\ldots , i_k\}} \right\Vert , \end{aligned} \end{aligned}$$
(10)

where for \(k = d\) we use the convention \({\mathcal {I}}_{n,0} = \{ \emptyset \}\) and \(X_{\emptyset } :=1\).

One can interpret the quantities \(W_k\) as follows: If \(f_t(x) = \sum _{I \in {\mathcal {I}}_{n,d}} x_I t_I\) is the corresponding polynomial in n variables, and \(\nabla ^{(k)} f_t(x)\) is the k-tensor of all partial derivatives of order k, then \(W_k = \sup _{t \in {\mathcal {T}}} {|\nabla ^{(k)} f_t(X)|}_{\text {op}}\). In this sense, we are considering the same quantities as in Theorem 3 but replace the difference operator \({\mathfrak {h}}\) by formal derivatives of the polynomial under consideration.

Furthermore, the concentration inequalities are phrased with the help of the quantities

$$\begin{aligned} {\widetilde{W}}_k&:=\sup _{\begin{array}{c} \alpha ^1, \ldots , \alpha ^k \in {{\,\mathrm{\mathbb {R}}\,}}^n \\ {|\alpha ^i|} \le 1 \end{array}} \sum _{\begin{array}{c} i_1, \ldots , i_{k} \\ \text {distinct} \end{array}} \alpha _{i_1}^1 \cdots \alpha _{i_k}^k \sup _{t \in {\mathcal {T}}} \sup _{v^* \in {\mathcal {B}}_1^*} v^*\left( \sum _{\begin{array}{c} I \in {\mathcal {I}}_{n,d-k} \\ i_1, \ldots , i_k \notin I \end{array}} X_I t_{I \cup \{i_1, \ldots , i_k\} } \right) \\&= \sup _{\begin{array}{c} \alpha ^1, \ldots , \alpha ^k \in {{\,\mathrm{\mathbb {R}}\,}}^n \\ {|\alpha ^i|} \le 1 \end{array}} \sum _{\begin{array}{c} i_1, \ldots , i_{k} \\ \text {distinct} \end{array}} \alpha _{i_1}^1 \cdots \alpha _{i_k}^k \sup _{t \in {\mathcal {T}}} \left\Vert \sum _{\begin{array}{c} I \in {\mathcal {I}}_{n,d-k} \\ i_1, \ldots , i_k \notin I \end{array}} X_I t_{I \cup \{i_1, \ldots , i_k\}} \right\Vert . \end{aligned}$$

Clearly \({\widetilde{W}}_k \ge W_k\) holds for all \(k \in [d]\).

Concentration properties for functionals as in (9) have been studied for independent Rademacher variables \(X_1, \ldots , X_n\) (i. e. \({{\,\mathrm{\mathbb {P}}\,}}(X_i = +1) = {{\,\mathrm{\mathbb {P}}\,}}(X_i = -1) = 1/2\)) and \({\mathcal {B}} = {\mathbb {R}}\) in [16, Theorem 14] for all \(d \ge 2\), and under certain technical assumptions in [2]. We prove deviation inequalities in the weakly dependent setting, and afterwards discuss how these compare to the particular result in [16]. It is easily possible to derive a similar result for functions of independent random variables (in the spirit of Theorem 1). As the corresponding proof is easily done by generalizing the proof of [16, Theorem 14], we omit it.

Theorem 4

Let \(X = (X_1, \ldots , X_n)\) be a random vector in \({{\,\mathrm{\mathbb {R}}\,}}^n\) with support in \([a,b]^n\) satisfying a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\). For \(f = f(X)\) as in (9) and all \(p \ge 2\) we have

$$\begin{aligned} {\Vert (f - {{\,\mathrm{\mathbb {E}}\,}}f)_+\Vert }_p&\le \sum _{j = 1}^d \big (2\sigma ^2 (b-a)^2 (p-3/2)\big )^{j/2} {{\,\mathrm{\mathbb {E}}\,}}W_j, \end{aligned}$$
(11)
$$\begin{aligned} {\Vert f - {{\,\mathrm{\mathbb {E}}\,}}f\Vert }_p&\le \sum _{j = 1}^d \big (2(b-a)^2 p\big )^{j/2} {{\,\mathrm{\mathbb {E}}\,}}{\widetilde{W}}_j. \end{aligned}$$
(12)

Consequently, for any \(t \ge 0\)

$$\begin{aligned} \begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\left( f - {{\,\mathrm{\mathbb {E}}\,}}f \ge t \right)&\le 2 \exp \left( - \frac{1}{2\sigma ^2(b-a)^2} \min _{k=1,\ldots ,d} \left( \frac{t}{de {{\,\mathrm{\mathbb {E}}\,}}W_k} \right) ^{2/k} \right) \\&\le 2 \exp \left( -\frac{1}{2e^2\sigma ^2(b-a)^2d^2} \min _{k=1,\ldots ,d} \left( \frac{t}{{{\,\mathrm{\mathbb {E}}\,}}W_k} \right) ^{2/k} \right) \end{aligned}, \end{aligned}$$
(13)

and the same concentration inequalities hold with \({{\,\mathrm{\mathbb {E}}\,}}W_k\) replaced by \({{\,\mathrm{\mathbb {E}}\,}}{\widetilde{W}}_k\).

Note that independent Rademacher random variables satisfy a \(\mathfrak {d}\text {-LSI}(1)\) (see e. g. [26, Example 3.1] or [29, Theorem 3]). Therefore, we get back [16, Theorem 14] from Theorem 4 (with slightly different constants). However, Theorem 4 moreover includes many models with dependencies like those discussed in the introduction. Therefore, it may be considered as a extension of [16, Theorem 14] to dependent situations and moreover to coefficients from any Banach space \({\mathcal {B}}\). For instance, we may consider an Ising chaos as a natural generalization of a Rademacher chaos to a dependent situation. In this case, Theorem 4 yields that that we still obtain basically the same concentration properties if the dependencies are sufficiently weak (which is guaranteed by the conditions outlined in the introduction).

To illustrate our results further, let us consider the case of \(d=2\) separately. Here we write

$$\begin{aligned} T_1&:={{\,\mathrm{\mathbb {E}}\,}}W_1 = {{\,\mathrm{\mathbb {E}}\,}}\sup _{t \in {\mathcal {T}}} \sup _{v^* \in {\mathcal {B}}^*_1} \left( \sum _{i = 1}^n \left( \sum _{j = 1}^n X_j v^*(t_{ij}) \right) ^2 \right) ^{1/2} \\ T_2&:={{\,\mathrm{\mathbb {E}}\,}}W_2 = \sup _{t \in {\mathcal {T}}} \sup _{v^* \in {\mathcal {B}}_1^*} {\Vert (v^*(t_{ij}))_{i,j}\Vert }_{\text {op}}. \end{aligned}$$

The following corollary follows directly from Theorem 4.

Corollary 1

Assume that \(X = (X_1, \ldots , X_n)\) satisfies a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\) and is supported in \([a,b]^n\) and let \(f_{{\mathcal {T}}} = f_{{\mathcal {T}}}(X)\) be as in (9) with \(d = 2\). We have for all \(t \ge 0\)

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\left( f_{{\mathcal {T}}}(X) - {{\,\mathrm{\mathbb {E}}\,}}f_{{\mathcal {T}}}(X) \ge t \right) \le 2 \exp \left( - \frac{1}{60(b-a)^2 \sigma ^2} \min \left( \frac{t^2}{T_1^2}, \frac{t}{T_2} \right) \right) . \end{aligned}$$

For the case of independent Rademacher variables, this recovers the upper tail in a famous result by Talagrand [52, Theorem 1.2] on concentration properties of quadratic forms in Banach spaces, which has also been done in [16]. Note that for \({\mathcal {B}} = {\mathbb {R}}\), we have

$$\begin{aligned} T_1 = {{\,\mathrm{\mathbb {E}}\,}}\sup _{t \in {\mathcal {T}}} \left( \sum _{i = 1}^n \left( \sum _{j = 1}^n t_{ij} X_j \right) ^2 \right) ^{1/2},\qquad T_2 = \sup _{t \in {\mathcal {T}}} {|T|}_{\text {op}}, \end{aligned}$$

where T is the symmetric matrix with zero diagonal and entries \(T_{ij} = t_{ij}\) if \(i<j\). If \({\mathcal {T}}\) consists of a single element only, we have \(T_1 \le {|T|}_{\text {HS}}\). Hence, Corollary 1 can be regarded as a generalized Hanson–Wright inequality.

2.2 The Boolean Hypercube

The case of independent Rademacher random variables above can be interpreted in terms of quantities from Boolean analysis. Recall that any function \(f: \{-1,+1\}^n \rightarrow {{\,\mathrm{\mathbb {R}}\,}}\) can be decomposed using the orthonormal Fourier–Walsh basis given by \((x_S)_{S \subseteq [n]}\) for \(x_S :=\prod _{i \in S} x_i\). More precisely, we have

$$\begin{aligned} f(x) = \sum _{S \subset [n]} {\hat{f}}_S x_S = \sum _{j \in [n]} \sum _{S \subseteq [n] : {|S|} = j} {\hat{f}}_S x_S, \end{aligned}$$

where the \(({\hat{f}}_S)_{S \subset [n]}\) are given by \({\hat{f}}_S = \int x_S f {\text {d}}\mu \) and are called the Fourier coefficients of f. For any \(j \in [n]\) we define the Fourier weight of order j as \(W_j(f) :=\sum _{S \subseteq [n] : {|S|} = j} {\hat{f}}_S^2\). It is clear that \({\Vert f\Vert }_2^2 = \sum _{j = 0}^n W_j(f)\). The following multilevel concentration inequality can now be easily deduced.

Proposition 2

Let \(X_1, \ldots , X_n\) be independent Rademacher random variables and let \(f: \{1,+1\}^n \rightarrow {{\,\mathrm{\mathbb {R}}\,}}\) be a function given in the Fourier–Walsh basis as \(f(x) = \sum _{j = 0}^d {\hat{f}}_S x_S\) for some \(d \in {{\,\mathrm{\mathbb {N}}\,}}, d \le n\). For any \(t > 0\) we have

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}({|f(X) - {{\,\mathrm{\mathbb {E}}\,}}f(X)|} \ge t) \le \exp \left( 1 - \min _{j = 1,\ldots ,d} \left( \frac{t}{de W_j(f)^{1/2}} \right) ^{2/j} \right) . \end{aligned}$$

In other words, the event \({|f(X) - {{\,\mathrm{\mathbb {E}}\,}}f(X)|} \le de \max _{j =1,\ldots , d} (W_j(f) t^j)^{1/2}\) holds with probability at least \(1 - \exp (1-t)\).

The literature on Boolean functions is vast, and a modern overview is given in [44]. Particularly for concentration results we may highlight [5, Theorem 1.4] (which in particular holds for Boolean functions), which we discuss further and partially generalize to dependent models in Sect. 2.4. Proposition 2 may be of interest due to the direct use of quantities from Fourier analysis. Finally, we should add that while many concentration results for Boolean functions like [5, Theorem 1.4] or also Proposition 2 are valid for functions whose Fourier–Walsh decomposition stops at some order d, Theorem 1 or Theorem 2 work for functions with Fourier–Walsh decomposition possibly up to order n.

2.3 Concentration Properties of U-Statistics

Another application of Theorems 1 and 2 are concentration properties of so-called U-statistics which frequently arise in statistical theory. We refer to [24] for an excellent monograph. More recently, concentration inequalities for U-statistics have been considered in [1, 5, Sect. 3.1.2] and [10, Corollary 1.3].

Let \({\mathcal {Y}} = {\mathcal {X}}^n\) and assume that \(X_1, \ldots , X_n\) are either independent random variables, or the vector \(X = (X_1, \ldots , X_n)\) satisfies a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\). Let \(h: {\mathcal {X}}^d \rightarrow {{\,\mathrm{\mathbb {R}}\,}}\) be a measurable, symmetric function with \(h(X_{i_1}, \ldots , X_{i_d}) \in L^\infty ({{\,\mathrm{\mathbb {P}}\,}})\) for any \(i_1, \ldots , i_d\), and define \(B :=\max _{i_1 \ne \ldots \ne i_d} {\Vert h(X_{i_1}, \ldots , X_{i_d})\Vert }_{L^\infty ({{\,\mathrm{\mathbb {P}}\,}})}\). We are interested in the concentration properties of the U-statistic with kernel h, i. e. of

$$\begin{aligned} f(X) = \sum _{i_1 \ne \ldots \ne i_d} h(X_{i_1}, \ldots , X_{i_d}). \end{aligned}$$
(14)

Proposition 3

Let \(X = (X_1, \ldots , X_n)\) be as above and \(f = f(X)\) be as in (14). There exists a constant \(C > 0\) (the same as in Theorems 1 and 2) such that for any \(t \ge 0\)

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\Big ( {|f - {{\,\mathrm{\mathbb {E}}\,}}f|} \ge Bt \Big ) \le 2 \exp \left( -\frac{1}{C} \min _{k=1,\ldots , d} \left( \frac{t}{\left( {\begin{array}{c}d\\ k\end{array}}\right) 2^k n^{d-k/2}} \right) ^{2/k} \right) \end{aligned}$$

and for some \(C = C(d)\)

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\big (n^{1/2-d} {|f - {{\,\mathrm{\mathbb {E}}\,}}f|} \ge B t\big ) \le 2 \exp \left( - \frac{1}{4C} \min \left( t^2, n^{1-1/d} t^{2/d} \right) \right) . \end{aligned}$$
(15)

The normalization \(n^{1/2 - d}\) in (15) is of the right order for U-statistics generated by a non-degenerate kernel h, i. e. \(\text {Var}({{\,\mathrm{\mathbb {E}}\,}}_{X_1} h(X_1, \ldots , X_d)) > 0\), see [24, Remarks 4.2.5]. In the case of i.i.d. random variables \(X_1, \ldots , X_n\) it states that

$$\begin{aligned} \frac{1}{n^{d-1/2}} \sum _{i_1< \ldots < i_d} h(X_{i_1}, \ldots , X_{i_d}) \Rightarrow {\mathcal {N}}(0, d^2 {{\,\mathrm{Var}\,}}({{\,\mathrm{\mathbb {E}}\,}}_{X_1} h(X_1, \ldots , X_d))) \end{aligned}$$

whenever \({{\,\mathrm{\mathbb {E}}\,}}h(X_1, \ldots , X_d)^2 < \infty \). Actually, (15) shows that for \(t \le n^{1/2}\) we have sub-Gaussian tails for any finite \(n \in {{\,\mathrm{\mathbb {N}}\,}}\) for bounded kernels h.

Proposition 3 improves upon our old result [10, Corollary 1.3] by providing multilevel tail bounds, thus yielding much finer estimates than the exponential moment bound given in the earlier paper. Moreover, it does not only address independent random variables but also weakly dependent models. As compared to the results from [1] and [5, Sect. 3.1.2], Proposition 3 covers different types of measures, since in [1] independent random variables were considered, while in [5] a Sobolev-type inequality was required, which does not include the various discrete models for which a \({\mathfrak {d}}\)–LSI holds.

2.4 Polynomials and Subgraph Counts in Exponential Random Graph Models

Lastly, let us once again consider polynomial functions. The case of independent random variables has been treated in [5, Theorem 1.4] under more general conditions, so we omit it and concentrate on weakly dependent random variables.

Let \(f_d: {{\,\mathrm{\mathbb {R}}\,}}^n \rightarrow {{\,\mathrm{\mathbb {R}}\,}}\) be a multilinear (also called tetrahedral) polynomial of degree d, i. e. of the form

$$\begin{aligned} f_d(x) :=\sum _{k = 1}^d \sum _{1 \le i_1 \ne \ldots \ne i_k \le n} a^k_{i_1 \ldots i_k} x_{i_1} \cdots x_{i_k} \end{aligned}$$
(16)

for symmetric k-tensors \(a^k\) with vanishing diagonal. Here, a k-tensor \(a^k\) is called symmetric, if \(a^k_{i_1 \ldots i_k} = a^k_{\sigma (i_1) \ldots \sigma (i_k)}\) for any permutation \(\sigma \in {\mathcal {S}}_k\), and the (generalized) diagonal is defined as \(\Delta _k :=\{ (i_1, \ldots , i_k) : {|\{i_1, \ldots , i_k\}|} <k \}\). Denote by \(\nabla ^{(k)} f\) the k-tensor of all partial derivatives of order k of f.

For the next result, given some \(d \in {{\,\mathrm{\mathbb {N}}\,}}\), we recall a family of norms \({\Vert \cdot \Vert }_{{\mathcal {I}}}\) on the space of d-tensors for each partition \({\mathcal {I}} = \{ I_1, \ldots , I_k \}\) of \(\{1,\ldots ,d\}\). The family \({\Vert \cdot \Vert }_{{\mathcal {I}}}\) has been first introduced in [35], where it was used to prove two-sided estimates for \(L^p\) norms of Gaussian chaos, and the definitions given below agree with the ones from [35] as well as [3] and [5]. For brevity, write \(P_d\) for the set of all partitions of \(\{1,\ldots , d\}\). For each \(l = 1,\ldots , k\) we denote by \(x^{(l)}\) a vector in \({{\,\mathrm{\mathbb {R}}\,}}^{n^{I_l}}\), and for a d-tensor \(A = (a_{i_1, \ldots , i_d})\) set

$$\begin{aligned} {\Vert A\Vert }_{{\mathcal {I}}} :=\sup \left\{ \sum _{i_1 \ldots i_d} a_{i_1 \ldots i_d} \prod _{l = 1}^k x^{(l)}_{i_{I_l}} : \sum _{i_{I_l}} (x^{(l)}_{i_{I_l}})^2 \le 1 \text { for all } l = 1, \ldots , k\right\} . \end{aligned}$$

We can regard the \({\Vert A\Vert }_{{\mathcal {I}}}\) as a family of operator-type norms. In particular, it is easy to see that \({\Vert A\Vert }_{\{1, \ldots , d\}} = {|A|}_{\text {HS}}\) and \({\Vert A\Vert }_{\{\{1\}, \ldots , \{d\}\}} = {|A|}_{\text {op}}\).

The following result has been proven in the context of Ising models (in the Dobrushin uniqueness regime) in [3], and can easily be extended to any vector X satisfying a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\). By invoking the family of norms \({\Vert \cdot \Vert }_{{\mathcal {I}}}\), it provides a refinement of our general result for the special case of multilinear polynomials.

Theorem 5

Let X be a random vector supported in \([-1,+1]^n\) and satisfying a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\), and \(f_d = f_d(X)\) be as in (16). There exists a constant \(C> 0\) depending on d only such that for all \(t \ge 0\)

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\left( {|*|}{f_d - {{\,\mathrm{\mathbb {E}}\,}}f_d} \ge t\right) \le 2 \exp \left( - \frac{1}{C} \min _{k = 1,\ldots ,d} \min _{{\mathcal {I}} \in P_k} \left( \frac{t}{\sigma ^{k} {\Vert {{\,\mathrm{\mathbb {E}}\,}}\nabla ^{(k)} f_d\Vert }_{{\mathcal {I}}}} \right) ^{2/{|{\mathcal {I}}|}} \right) . \end{aligned}$$
(17)

For illustration, let us once again consider the case of \(d=2\). In the notation of (16), we take \(a^1 = 0\) and \(a^2 = A\), i. e. \(f_2(x) = x^T A x\) for a symmetric matrix A with vanishing diagonal. In this case, assuming the components of X to be centered (so the \(k=1\) term vanishes), Theorem 5 reads

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\left( {|*|}{f_2 - {{\,\mathrm{\mathbb {E}}\,}}f_2} \ge t\right) \le 2 \exp \left( -\frac{1}{C} \min \left( \frac{t^2}{\sigma ^4{|A|}_{\text {HS}}^2}, \frac{t}{\sigma ^2{|A|}_{\text {op}}} \right) \right) , \end{aligned}$$

i. e. we obtain a Hanson–Wright inequality in this situation. For higher orders, we arrive at similar bounds. Altogether, for the class of multilinear polynomials, Theorem 5 yields finer bounds than Theorem 2 (by virtue of the large class of norms involved), though for \(d \ge 3\) explicit calculations of the norms involved can be difficult.

To point out one possible application, Theorem 5 can be used in the context of the exponential random graph model (ERGM). Let us briefly recall the definitions. Given \(s \in {{\,\mathrm{\mathbb {N}}\,}}\) real numbers \(\beta _1, \ldots , \beta _s\) and simple graphs \(G_1, \ldots , G_s\) (with \(G_1\) being a single edge by convention), the ERGM with parameter \(\mathbf {\beta } = (\beta _1, \ldots , \beta _s, G_1, \ldots , G_s)\) is a probability measure on the space of all graphs on \(n \in {{\,\mathrm{\mathbb {N}}\,}}\) vertices given by the weight function \(\exp \left( \sum _{i = 1}^s \beta _i n^{-{|V_i|}+2} N_{G_i}(x) \right) \), where \(N_{G_i}(x)\) is the number of copies of \(G_i\) in the graph x and \({|V_i|}\) is the number of vertices of \(G_i = (V_i, E_i)\). For details, see [23, 48]. One can think of the ERGM as an extension of the famous Erdös–Rényi model (which corresponds to the choice \(s=1\)) to account for dependencies between the edges.

By way of example we show concentration properties of the number of triangles \(T_3(X) = \sum _{\{e,f,g\} \in {\mathcal {T}}_3} X_e X_f X_g\) (where \({\mathcal {T}}_3\) denotes the set of all three edges forming a triangle). To formulate our results, we need to recall the function \(\Phi _{{{\,\mathrm{\varvec{\beta }}\,}}}(x) = \sum _{i=1}^s \beta _i {|E_i|}x^{{|E_i|}-1}\) which frequently appears in the discussion of the ERGM. Moreover, we set \({|{{\,\mathrm{\varvec{\beta }}\,}}|} :=({|\beta _1|}, \ldots , {|\beta _s|})\). In the following corollary, the condition \(\frac{1}{2} \Phi _{{|{{\,\mathrm{\varvec{\beta }}\,}}|}}'(1) < 1\) ensures weak dependence in the sense that a \(\mathfrak {d}\)–LSI holds. As outlined above, in comparison to earlier results like [48, Theorem 3.2], using Theorem 5 yields sharper tail estimates.

Corollary 2

Let X be an exponential random graph model with parameter \({{\,\mathrm{\varvec{\beta }}\,}}=(\beta _1, \ldots , \beta _s, G_1, \ldots , G_s)\) such that \(\frac{1}{2} \Phi _{{|{{\,\mathrm{\varvec{\beta }}\,}}|}}'(1) < 1\). There is a constant \(C({{\,\mathrm{\varvec{\beta }}\,}})\) such that for all \(t \ge 0\)

$$\begin{aligned}&{{\,\mathrm{\mathbb {P}}\,}}\left( {|T_3 - {{\,\mathrm{\mathbb {E}}\,}}T_3|} \ge t \right) \\&\quad \le 2 \exp \left( - \frac{1}{C({{\,\mathrm{\varvec{\beta }}\,}})} \min \left( \frac{t^2}{\max (C_{S_2} n^4,C_E n^{3}, n^{3})}, \frac{t}{\max (\sqrt{2n},2C_E n)}, \frac{t^{2/3}}{2} \right) \right) . \end{aligned}$$

3 Concentration Inequalities Under Logarithmic Sobolev Inequalities: Proofs

In this section, we give the proofs of our main results. All of them work by first establishing a growth rate on the \(L^p\) norms of \(f - {{\,\mathrm{\mathbb {E}}\,}}f\) which will then be iterated. For technical reasons, we need to introduce some auxiliary difference operators which are closely related to \({\mathfrak {h}}\). For \(i \in [n]\) let

$$\begin{aligned} {\mathfrak {h}}^+_if(X)= & {} \Vert (f(X) - T_if(X))_+ \Vert _{X_i', \infty }, \qquad {\mathfrak {h}}^+f = ({\mathfrak {h}}^+_1f, \ldots , {\mathfrak {h}}^+_nf),\\ {\mathfrak {h}}^-_if(X)= & {} \Vert (f(X) - T_if(X))_- \Vert _{X_i', \infty }, \qquad {\mathfrak {h}}^-f = ({\mathfrak {h}}^-_1f, \ldots , {\mathfrak {h}}^-_nf), \end{aligned}$$

where \({\Vert f\Vert }_{X_i',\infty }\) shall denote the \(L^\infty \) norm with respect to \(X_i'\).

The \(L^p\) norm inequalities which form the core of our proofs can be found in [10, Theorem 2.3, Corollary 2.6] (building upon the earlier results in [16]). Note that as compared to [10], a different choice of normalization for \({\mathfrak {h}}^\pm \) leads to slightly different constants.

Theorem 6

If \(X_1, \ldots , X_n\) are independent random variables and \(f = f(X) \in L^\infty ({{\,\mathrm{\mathbb {P}}\,}})\), with the constant \(\kappa = \frac{\sqrt{e}}{2\,(\sqrt{e} - 1)}\), we have for any \(p \ge 2\),

$$\begin{aligned} \Vert (f - {\mathbb {E}}f)_+ \Vert _p \le (2\kappa p)^{1/2}\, \Vert {\mathfrak {h}}^+f \Vert _p \quad \text {and} \quad \Vert (f - {\mathbb {E}}f)_- \Vert _p \le (2\kappa p)^{1/2}\, \Vert {\mathfrak {h}}^-f \Vert _p. \end{aligned}$$

Consequently, this leads to

$$\begin{aligned} {\Vert f - {{\,\mathrm{\mathbb {E}}\,}}f\Vert }_p \le (8\kappa p)^{1/2} {\Vert {\mathfrak {h}}f\Vert }_p. \end{aligned}$$

Furthermore, we need an auxiliary statement relating differences of consecutive order. In [10], we have proven that \({|{\mathfrak {h}}{|{\mathfrak {h}}^{(d)}f|}_{\text {HS}}|} \le {|{\mathfrak {h}}^{(d+1)}f|}_{\text {HS}}\). Moreover, we explained that a similar estimate with the Hilbert–Schmidt replaced by operator norms cannot be true. As we will see next, the key step in order to be able to invoke operator norms nevertheless is to work with \({\mathfrak {h}}^+\).

Here, we need the following simple but crucial observation: if A is a d-tensor, the supremum in the definition of \({|A|}_{\text {op}}\) is attained, and if A is a non-negative tensor (i. e. \(A_{i_1 \ldots i_d} \ge 0\) for all \(i_1, \ldots , i_d\)), the maximizing vectors \({\widetilde{v}}^1, \ldots , {\widetilde{v}}^d\) can be chosen to have all positive entries. Indeed, since \({\widetilde{v}}^1_{i_1} \cdots {\widetilde{v}}^d_{i_d} \le {|{\widetilde{v}}^1_{i_1} \cdots {\widetilde{v}}^d_{i_d}|}\), we can define \({|{\widetilde{v}}|}^j\) by taking the absolute value element-wise.

Lemma 1

For any \(d \ge 2\)

$$\begin{aligned} {|{\mathfrak {h}}^+ {|{\mathfrak {h}}^{(d-1)} f(X)|}_{\text {op}}|} \le {|{\mathfrak {h}}^{(d)} f(X)|}_{\text {op}}. \end{aligned}$$

Proof

We have

$$\begin{aligned}&{|{\mathfrak {h}}^+ {|{\mathfrak {h}}^{(d-1)} f|}_{\text {op}}|}^2 = \sum _{i= 1}^n \bigg \Vert \left( {|{\mathfrak {h}}^{(d-1)} f|}_{\text {op}} - {|{\mathfrak {h}}^{(d-1)} T_i f|}_{\text {op}} \right) _+ \bigg \Vert _{i,\infty }^2 \\&\quad = \sum _{i=1}^n \bigg \Vert \left( \sup _{v^j} {\langle v^1 \cdots v^{d-1}, {\mathfrak {h}}^{(d-1)}f\rangle } - \sup _{v^j} {\langle v^1 \cdots v^{d-1}, {\mathfrak {h}}^{(d-1)}T_i f\rangle } \right) _+ \bigg \Vert _{i,\infty }^2 \\&\quad \le \sum _{i = 1}^n \bigg \Vert \bigg ({\langle {\widetilde{v}}^1\cdots {\widetilde{v}}^{d-1}, {\mathfrak {h}}^{(d-1)}f - {\mathfrak {h}}^{(d-1)} T_i f\rangle } \bigg )_+ \bigg \Vert _{i,\infty }^2 \\&\quad \le \sum _{i = 1}^n \bigg \Vert \sum _{i_1,\ldots ,i_d-1} {\widetilde{v}}^1_{i_1} \cdots {\widetilde{v}}^{d-1}_{i_{d-1}} \bigg \Vert ({{\,\mathrm{Id}\,}}- T_i) \prod _{j = 1}^{d-1} ({{\,\mathrm{Id}\,}}- T_{i_s})f \bigg \Vert _{i_1 \cdots i_{d-1},\infty } \bigg \Vert _{i,\infty }^2 \\&\quad \le \sum _{i =1}^n \bigg ( \sum _{i_1, \ldots , i_{d-1}} {\widetilde{v}}^1_{i_1} \cdots {\widetilde{v}}^{d-1}_{i_1} {\mathfrak {h}}_{i i_1 \cdots i_{d-1}}f \bigg )^2 \\&\quad = \bigg ( \sup _{v^d : {|v^d|} \le 1} \sum _{i_d=1}^n \sum _{i_1, \ldots , i_{d-1}} {\widetilde{v}}_{i_1}^1 \cdots {\widetilde{v}}^{d-1}_{i_{d-1}} v^{d}_{i_d} {\mathfrak {h}}_{i_1 \cdots i_d} f \bigg )^2 \\&\quad \le \bigg ( \sup _{v^1, \ldots , v^d : {|v^j|} \le 1} \sum _{i_1, \ldots , i_d} v_{i_1}^1 \cdots v_{i_d}^d {\mathfrak {h}}_{i_1 \cdots i_d}f \bigg )^2 \\&\quad = {|{\mathfrak {h}}^{(d)}f|}_{\text {op}}^2 \end{aligned}$$

Here, in the first inequality we insert the vectors \({\widetilde{v}}^1, \ldots , {\widetilde{v}}^{d-1}\) maximizing the supremum and use the monotonicity of \(x \mapsto x_+\), and the second and third inequality follow from the triangle inequality. Taking the square root yields the claim. \(\square \)

As a final step, we need to establish a connection between \(L^p\) norm estimates and multilevel concentration inequalities. This is given by the following proposition, which was proven in [1, Theorem 7] and [5, Theorem 3.3]. We state it in the form given in [48, Proof of Theorem 3.6] with slight modifications.

Proposition 4

Assume that a random variable f satisfies for any \(p \ge 2\) and some constants \(C_1, \ldots , C_d \ge 0\) \({\Vert f - {{\,\mathrm{\mathbb {E}}\,}}f\Vert }_p \le \sum _{k = 1}^{d} C_k (p-s)^{k/2}\) for some \(s \in [0,2)\), and let \(L :={|\{ l : C_l > 0\}|}\). For any \(t \ge 0\) we have

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}({|f - {{\,\mathrm{\mathbb {E}}\,}}f|} \ge t) \le 2 \exp \Big ( - \min \Big (\frac{\log (2)}{2-s},1\Big ) \min _{k = 1, \ldots , d} \Big ( \frac{t}{Le C_k} \Big )^{2/k} \Big ). \end{aligned}$$

We will not give a proof of Proposition 4 and refer to the aforementioned works. However, the proof is almost identical to the proof of Proposition 2. The two important cases will be \(s = 0\) (for independent random variables) as well as \(s = 3/2\) (in the weakly dependent setting).

The proof of Theorem 1 is now easily completed.

Proof of Theorem 1

Since \(X_1, \ldots , X_n\) are independent, Theorem 6 yields

$$\begin{aligned} {\Vert f - {{\,\mathrm{\mathbb {E}}\,}}f\Vert }_p \le (8\kappa p)^{1/2} {\Vert {\mathfrak {h}}f\Vert }_p \le (8\kappa p)^{1/2} {\Vert {\mathfrak {h}}f\Vert }_{\text {op},1} + (8\kappa p)^{1/2} {\Vert ({|{\mathfrak {h}} f|} - {{\,\mathrm{\mathbb {E}}\,}}{|{\mathfrak {h}} f|})_+\Vert }_p \end{aligned}$$

where we have used that for any positive random variable W

$$\begin{aligned} {\Vert W\Vert }_p \le {{\,\mathrm{\mathbb {E}}\,}}W + {\Vert (W - {{\,\mathrm{\mathbb {E}}\,}}W)_+\Vert }_p. \end{aligned}$$
(18)

The second term on the right hand side can now be estimated using Theorem 6 again, which in combination with Lemma 1 gives

$$\begin{aligned} {\Vert \left( {|{\mathfrak {h}} f|} - {{\,\mathrm{\mathbb {E}}\,}}{|{\mathfrak {h}}f|} \right) _+\Vert }_p \le \sqrt{2 \kappa p} {\Vert {\mathfrak {h}}^+ {|{\mathfrak {h}}f|}\Vert }_p \le \sqrt{2\kappa p} {\Vert {\mathfrak {h}}^{(2)}f\Vert }_{\text {op},p}. \end{aligned}$$

This can be easily iterated to obtain for any \(d \in {{\,\mathrm{\mathbb {N}}\,}}\)

$$\begin{aligned} {\Vert f - {{\,\mathrm{\mathbb {E}}\,}}f\Vert }_p \le \sum _{j = 1}^{d-1} (8\kappa p)^{j/2} {\Vert {\mathfrak {h}}^{(j)}f\Vert }_{\text {op},1} + (8\kappa p)^{d/2} {\Vert {\mathfrak {h}}^{(d)}f\Vert }_{\text {op},\infty }. \end{aligned}$$

Now it remains to apply Proposition 4. \(\square \)

To prove Theorem 2, we shall require the following proposition, which is proven in [28, Proposition 2.4], building upon arguments established in [6]. (Note that the definition of \({\mathfrak {h}}\) there differed by a factor of \(\sqrt{2}\).) The estimate (20) does not appear therein, but is an easy modification of the proof.

Proposition 5

Let \(\mu \) be a measure on a product of Polish spaces satisfying a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\). Then, for any \(f \in L^\infty (\mu )\) and any \(p \ge 2\) we have

$$\begin{aligned} {\Vert f - {{\,\mathrm{\mathbb {E}}\,}}f\Vert }_p \le (2\sigma ^2 (p-3/2))^{1/2} {\Vert \mathfrak {d}f\Vert }_p \le (\sigma ^2(p-3/2)/2)^{1/2} {\Vert {\mathfrak {h}} f\Vert }_p \end{aligned}$$
(19)

and

$$\begin{aligned} {\Vert (f - {{\,\mathrm{\mathbb {E}}\,}}f)_+\Vert }_p \le (2\sigma ^2 (p-3/2))^{1/2} {\Vert {\mathfrak {h}}^+ f\Vert }_p. \end{aligned}$$
(20)

Proof of Theorem 2

The proof is very similar to the proof of Theorem 1. In the first step, using (19) leads to

$$\begin{aligned} {\Vert f - {{\,\mathrm{\mathbb {E}}\,}}f\Vert }_p \le (2\sigma ^2 (p-3/2))^{1/2} {\Vert {\mathfrak {h}}f\Vert }_{\text {op},1} + (2\sigma ^2 (p-3/2))^{1/2} {\Vert ({|{\mathfrak {h}}f|} - {{\,\mathrm{\mathbb {E}}\,}}{|{\mathfrak {h}} f|})_+\Vert }_p. \end{aligned}$$

Equation (20) can be used to estimate the second term on the right-hand side. So, for any \(d \in {{\,\mathrm{\mathbb {N}}\,}}\) we have by an iteration

$$\begin{aligned} {\Vert f - {{\,\mathrm{\mathbb {E}}\,}}f\Vert }_p \le \sum _{j = 1}^{d-1} (2\sigma ^2(p-3/2))^{j/2} {\Vert {\mathfrak {h}}^{(j)} f\Vert }_{\text {op},1} + (2\sigma ^2(p-3/2))^{d/2} {\Vert {\mathfrak {h}}^{(d)} f\Vert }_{\text {op},\infty } \end{aligned}$$

Again we can apply Proposition 4 to obtain the concentration inequality. \(\square \)

To prove Theorem 3 we shall need the following lemma.

Lemma 2

Let \(({\mathcal {B}}, {\Vert \cdot \Vert })\) be a Banach space and \({\mathcal {F}}\) a family of uniformly norm-bounded, \({\mathcal {B}}\)-valued, measurable functions and set \(g(X) = \sup _{f \in {\mathcal {F}}} {\Vert f(X)\Vert }\). We have

$$\begin{aligned} {|{\mathfrak {h}}^+ g(X)|} \le \sup _{f \in {\mathcal {F}}} {|{\mathfrak {h}}^+ {\Vert f\Vert }(X)|}. \end{aligned}$$

Proof

Fix an \(X \in {\mathcal {Y}}\) and choose for any \(\varepsilon > 0\) a function \(f_\varepsilon \) such that \({\Vert f_\varepsilon (X)\Vert } \ge \sup _{f \in {\mathcal {F}}} {\Vert f(X)\Vert } - \varepsilon \). This yields

$$\begin{aligned} {\mathfrak {h}}^+_i g(X)&= \sup _{x_i'} ( \sup _{f \in {\mathcal {F}}} {\Vert f(X)\Vert } - \sup _{f \in {\mathcal {F}}} {\Vert f(X_{i^c}, x_i')\Vert })_+ \le \sup _{x_i'} \left( {\Vert f_\varepsilon (X)\Vert } + \varepsilon - {\Vert f_\varepsilon (X_{i^c}, x_i')\Vert } \right) _+\\&\le \sup _{x_i'} \big ({\Vert f_\varepsilon (X)\Vert } - {\Vert f_\varepsilon (X_{i^c}, x_i') \Vert }\big )_+ + \varepsilon = {\mathfrak {h}}_i^+ {\Vert f_\varepsilon \Vert }(X) + \varepsilon , \end{aligned}$$

where the first inequality follows by monotonicity of \(x \mapsto x_+\) and the second one is a consequence of \((a+b-c)_+ \le (a-c)_+ + b\) for \(a,b,c \ge 0\). Thus, we have

$$\begin{aligned} {|{\mathfrak {h}}^+ g(X)|}&= \left( \sum _{i = 1}^n {\mathfrak {h}}_i^+ g(X)^2 \right) ^{1/2} \le \left( \sum _{i = 1}^n ({\mathfrak {h}}_i^+ {\Vert f_\varepsilon \Vert }(X) + \varepsilon )^2 \right) ^{1/2} = {|{\mathfrak {h}}^+ {\Vert f_\varepsilon \Vert }(X) + \varepsilon (1,\ldots ,1)|}\\&\le {|{\mathfrak {h}}^+ {\Vert f_\varepsilon \Vert }(X)|} + \sqrt{n} \varepsilon \le \sup _{f \in {\mathcal {F}}} {|{\mathfrak {h}}^+ {\Vert f\Vert }(X)|} + \sqrt{n} \varepsilon . \end{aligned}$$

Taking the limit \(\varepsilon \rightarrow 0\) yields the claim. \(\square \)

Proof of Theorem 3

Note that in the real-valued case, the estimate \({\mathfrak {h}}_i^+ {|f|} \le {\mathfrak {h}}_i f\) holds. For brevity, let \(s = 3/2\). Using this in combination with Proposition 5 and Lemma 2 yields

$$\begin{aligned} {\Vert (g - {{\,\mathrm{\mathbb {E}}\,}}g)_+\Vert }_p&\le \big (2\sigma ^2 (p-s)\big )^{1/2} {\Vert {\mathfrak {h}}^+ g\Vert }_p \le \big (2\sigma ^2 (p-s)\big )^{1/2} {\Vert \sup _{f \in {\mathcal {F}}} {|{\mathfrak {h}} f|}\Vert }_p \\&\le \big (2\sigma ^2 (p-s)\big )^{1/2} {{\,\mathrm{\mathbb {E}}\,}}W_1 + \big (2\sigma ^2 (p-s)\big )^{1/2} {\Vert (W_1 - {{\,\mathrm{\mathbb {E}}\,}}W_1)_+\Vert }_p. \end{aligned}$$

We can apply Proposition 5 again on the right hand side, which gives

$$\begin{aligned} {\Vert (g - {{\,\mathrm{\mathbb {E}}\,}}g)_+\Vert }_p \le \big (2\sigma ^2 (p-s)\big )^{1/2} {{\,\mathrm{\mathbb {E}}\,}}W_1 + \big (2\sigma ^2 (p-s)\big ) {\Vert {\mathfrak {h}}^+ W_1\Vert }_p. \end{aligned}$$

A combination of Lemmas 1 and 2 shows that \({|{\mathfrak {h}}^+ W_j|} \le W_{j+1}\), and so by an iteration we obtain

$$\begin{aligned} {\Vert (g - {{\,\mathrm{\mathbb {E}}\,}}g)_+\Vert }_p \le \sum _{j = 1}^{d-1} \big (2\sigma ^2 (p-s)\big )^{j/2} {{\,\mathrm{\mathbb {E}}\,}}W_j + \big (2\sigma ^2(p-s)\big )^{d/2} {\Vert W_d\Vert }_\infty .the \end{aligned}$$

In the case of independent random variables, we replace the first step using Theorem 6. Here, \(2\sigma ^2 = 2\kappa \) and \(s = 0\). \(\square \)

Proof of Proposition 1

The proof shares some similarities with the proof of Lemma 2. Since X satisfies a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\), we have for any \(p \ge 2\)

$$\begin{aligned} {\Vert (g - {{\,\mathrm{\mathbb {E}}\,}}g)_+\Vert }_p \le \big (2\sigma ^2 (p-3/2)\big )^{1/2} {\Vert {\mathfrak {h}}^+ g\Vert }_p. \end{aligned}$$

Moreover, for any \(i \in [n]\) and \(x \in {\mathcal {Y}}\), if a maximizer \({\widetilde{f}}\) of \(\sup _{f \in {\mathcal {F}}} {|\sum _{j = 1}^n f(x_j)|}\) exists, we obtain

$$\begin{aligned} {\mathfrak {h}}^+_i g(x)^2&= \sup _{x_i'} \Big (\sup _{f \in {\mathcal {F}}} {|f(X)|} - \sup _{f \in {\mathcal {F}}} {|f(X_{i^c}, x_i')|}\Big )^2_+ \\&\le \sup _{x_i'} \Big ( {|{\widetilde{f}}(X)|} - {|{\widetilde{f}}(X_{i^c}, x_i')|} \Big )_+^2 \le c({\widetilde{f}})^2 \le \sup _{f \in {\mathcal {F}}} c(f)^2. \end{aligned}$$

If a maximizer \({\widetilde{f}}\) does not exist, these estimates remain valid by an approximation argument as in the proof of Lemma 2. Consequently, we have \({\Vert (g - {{\,\mathrm{\mathbb {E}}\,}}g)_+\Vert }_p \le (2\sigma ^2 (p-3/2) n \sup _{f \in {\mathcal {F}}} c(f)^2 )^{1/2}.\) The claim now follows from Proposition 4. \(\square \)

4 Suprema of Chaos, U-statistics and Polynomials: Proofs

Proof of Theorem 4

Let us first consider the case that X satisfies a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\). Recall that we have by (20)

$$\begin{aligned} {\Vert (f - {{\,\mathrm{\mathbb {E}}\,}}f)_+\Vert }_p \le (2\sigma ^2 (p-3/2))^{1/2} {\Vert {\mathfrak {h}}^+ f\Vert }_p. \end{aligned}$$

We shall make use of the pointwise inequality \({|{\mathfrak {h}}^+ f|} \le (b-a) W_1.\) To see this, let \(({\widetilde{t}}, {\widetilde{v}}^*)\) be the tuple satisfying \(\sup _{t \in {\mathcal {T}}} \sup _{v^* \in {\mathcal {B}}_1^*} v^*(\sum _{I \in {\mathcal {I}}_{n,d}} X_I t_I) = {\widetilde{v}}^*(\sum _{I \in {\mathcal {I}}_{n,d}} X_I {\widetilde{t}}_I)\). We have

$$\begin{aligned} {|{\mathfrak {h}}^+ f(X)|}^2&= \sum _{i = 1}^n \sup _{x_i'} \Big ( {\sup _{t, v^*} v^*\Big (\sum _{I \in {\mathcal {I}}_{n,d}} X_I t_I \Big )} - {\sup _{t, v^*} v^*\Big (\sum _{I \in {\mathcal {I}}_{n,d}} (X_{i^c}, x_i')_I t_I \Big )} \Big )_+^2 \\&\le \sum _{i = 1}^n \sup _{x_i'} \Big ( (X_i - x_i') \sum _{\begin{array}{c} I \in {\mathcal {I}}_{n,d-1} \\ i \notin I \end{array}} {\widetilde{v}}^*(X_I {\widetilde{t}}_{I \cup \{i\}}) \Big )^2 \\&\le (b-a)^2 \sum _{i = 1}^n \Big ( {\widetilde{v}}^*\Big ( \sum _{\begin{array}{c} I \in {\mathcal {I}}_{n,d-1} \\ i \notin I \end{array}} X_I {\widetilde{t}}_{I \cup \{i\}} \Big ) \Big )^2 \\&= (b-a)^2 \sup _{\alpha ^{(1)} : {\Vert \alpha ^{(1)}\Vert } \le 1} \Big ({\widetilde{v}}^* \Big ( \sum _{i = 1}^n \alpha ^{(1)}_i \sum _{I \in {\mathcal {I}}_{n,d-1} : i \notin I} X_I {\widetilde{t}}_{I \cup \{i\}} \Big )\Big )^2 \\&\le (b-a)^2 \Big (\sup _{t, v^*} \sup _{\alpha ^{(1)} : {\Vert \alpha ^{(1)}\Vert } \le 1} v^*\Big ( \sum _{i = 1}^n \alpha _i^{(1)} \sum _{I \in {\mathcal {I}}_{n,d} : i \notin I} X_I t_{I \cup \{i\}} \Big )\Big )^2 \\&= (b-a)^2 W_1^2, \end{aligned}$$

proving the first part. Consequently,

$$\begin{aligned} {\Vert (f - {{\,\mathrm{\mathbb {E}}\,}}f)_+\Vert }_p \le (2\sigma ^2 (b-a)^2 (p-3/2))^{1/2}\left( {{\,\mathrm{\mathbb {E}}\,}}W_1 + {\Vert (W_1 - {{\,\mathrm{\mathbb {E}}\,}}W_1)_+\Vert }_p \right) . \end{aligned}$$

As in [16], this can now be iterated, i. e. we have for any \(k \in \{1,\ldots , d-1\}\) \({|{\mathfrak {h}}^+ W_k|} \le (b-a)W_{k+1}\). Here, we may argue as above, where the only difference is to choose \(({\widetilde{t}}, {\widetilde{v}}^*)\) and \({\widetilde{\alpha }}^{(1)},\ldots , {\widetilde{\alpha }}^{(k)}\) which maximize \(W_k\). This finally leads to

$$\begin{aligned} {\Vert f - {{\,\mathrm{\mathbb {E}}\,}}f\Vert }_p \le \sum _{j = 1}^d \big (2\sigma ^2 (b-a)^2 (p-3/2)\big )^{j/2} {{\,\mathrm{\mathbb {E}}\,}}W_j, \end{aligned}$$

using that \(W_d\) is constant. This proves (11). The same arguments are also valid without a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\) property, if one considers \({\Vert (f - {{\,\mathrm{\mathbb {E}}\,}}f)_+\Vert }_p\) and applies Theorem 6 instead.

Lastly, to prove (12), let us first consider why we cannot argue as before. Note that the argument heavily relies on the positive part of the difference operator \({\mathfrak {h}}^+\), which allows us to choose the maximizers \(t_1, \ldots , t_n\) independent of \(i \in [n]\). This is no longer possible for the concentration inequality. Here, Theorem 6 yields

$$\begin{aligned} {\Vert f - {{\,\mathrm{\mathbb {E}}\,}}f\Vert }_p&\le \big (\sigma ^2 p\big )^{1/2} {\Vert {\mathfrak {h}}f\Vert }_{p} \\ {\Vert (f - {{\,\mathrm{\mathbb {E}}\,}}f)_+\Vert }_p&\le \big (\sigma ^2 p\big )^{1/2} {\Vert {\mathfrak {h}}^+ f\Vert }_{p}. \end{aligned}$$

Thus, this argument fails if we try to use these inequalities. However, we can rewrite \({\mathfrak {h}}_i f(x) = \sup _{x_i',x_i''} (f(x_{i^c}, x_i') - f(x_{i^c}, x_i''))_+= \sup _{x_i'} {\mathfrak {h}}_i^+ f(x_{i^c}, x_i')\), where the \(\sup \) is to be understood with respect to the support of \(X_i'\). As a consequence, we have for each fixed \(i \in [n]\) (again choosing \({\widetilde{t}}\) by maximizing the first summand in the brackets)

$$\begin{aligned} {\mathfrak {h}}_i f(x)^2&= \sup _{x_i'} \sup _{x_i''} \Big ( \sup _{t \in {\mathcal {T}}} \bigg \Vert \sum _{I \in {\mathcal {I}}_{n,d}} (X_{i^c}, x_i')_I t_I \bigg \Vert - \sup _{t \in {\mathcal {T}}} \bigg \Vert \sum _{I \in {\mathcal {I}}_{n,d}} (X_{i^c}, x_i'')_I t_I \bigg \Vert \Big )_+^2 \\&\le \sup _{x_i'} \sup _{x_i''} \bigg \Vert (x_i' - x_i'')\sum _{I \in {\mathcal {I}}_{n,d-1} : i \notin I} X_I {\widetilde{t}}_{I \cup \{i\}} \bigg \Vert ^2 \\&\le \sup _{x_i', x_i''} {|x_i' - x_i''|}^2 \sup _{t \in {\mathcal {T}}} \bigg \Vert \sum _{I \in {\mathcal {I}}_{n,d-1} : i \notin I} X_I t_{I \cup \{i\}} \bigg \Vert ^2 \\&\le (b-a)^2 \sup _{t \in {\mathcal {T}}} \bigg \Vert \sum _{I \in {\mathcal {I}}_{n,d-1}: i \notin I} X_I t_{I \cup \{i\}} \bigg \Vert ^2. \end{aligned}$$

This implies

$$\begin{aligned} {|{\mathfrak {h}}f|}^2(x) \le (b-a)^2 \sup _{\begin{array}{c} \alpha ^1 \in {{\,\mathrm{\mathbb {R}}\,}}^n \\ {|\alpha ^1|} \le 1 \end{array}} \sum _{i = 1}^n \alpha _i^1 \sup _{t \in {\mathcal {T}}} \bigg \Vert \sum _{\begin{array}{c} I \in {\mathcal {I}}_{n,d-1} \\ i \notin I \end{array}} X_I t_{I \cup \{i \}} \bigg \Vert = (b-a)^2 {\widetilde{W}}_1^2. \end{aligned}$$

The proof is now completed as using the same arguments as in the first part, with \(W_k\) replaced by \({\widetilde{W}}_k\). The same argument is valid for X satisfying a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\). \(\square \)

Proof of Proposition 2

The proposition can be proven using a similar technique as before, since the Hilbert–Schmidt norms of higher order difference act as Fourier projections. We choose to take an alternate route as follows. The proof of [44, Theorem 9.21] shows that for any f with degree at most d and any \(p \ge 2\)

$$\begin{aligned} {\Vert f(X) - {{\,\mathrm{\mathbb {E}}\,}}f(X)\Vert }_p \le \sum _{j = 1}^d (p-1)^{j/2} W_j(f)^{1/2}. \end{aligned}$$
(21)

First off, by Chebyshev’s inequality we have for any \(p \ge 1\)

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}({|f(X) - {{\,\mathrm{\mathbb {E}}\,}}f(X)|} \ge e{\Vert f(X) - {{\,\mathrm{\mathbb {E}}\,}}f(X)\Vert }_p) \le \exp (-p). \end{aligned}$$

We want to apply this to a t-dependent parameter p given by the function

$$\begin{aligned} \eta _f(t) :=1 + \min _{j = 1,\ldots ,d} \Big ( \frac{t}{de W_j(f)^{1/2}} \Big )^{2/j}. \end{aligned}$$

If \(\eta _f(t) \ge 2\), (21) yields \(e{\Vert f(X) - {{\,\mathrm{\mathbb {E}}\,}}f(X)\Vert }_{\eta _f(t)} \le t\), which combined with the trivial estimate \({{\,\mathrm{\mathbb {P}}\,}}(\cdot ) \le 1\) gives

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}({|f(X) - {{\,\mathrm{\mathbb {E}}\,}}f(X)|} \ge t) \le e^2 \exp (-\eta _f(t)) = \exp \Big (1 - \min _{j = 1,\ldots ,d} \Big ( \frac{t}{de W_j(f)^{1/2}} \Big )^{2/j}\Big ) \end{aligned}$$

as claimed. \(\square \)

Proof of Proposition 3

We apply Theorems 1 and 2 in the respective cases. To this end, we make use of the general bound \({\Vert {\mathfrak {h}}^{(k)} f\Vert }_{\text {op},1} \le {\Vert {\mathfrak {h}}^{(k)}f\Vert }_{\text {HS},\infty }\) for \(k \in [d]\). For any distinct \(j_1, \ldots , j_k\) write \({\Vert \cdot \Vert } = {\Vert \cdot \Vert }_{j_1, \ldots , j_k, \infty }\), so that

$$\begin{aligned}&{\mathfrak {h}}_{j_1 \ldots , j_k} f = \Big \Vert \, f + \sum _{l=1}^k\, (-1)^l \sum _{1 \le s_1< \ldots< s_l \le k} T_{j_{s_1} \ldots j_{s_l}}f\, \Big \Vert \\&\quad = \Big \Vert \sum _{i_1 \ne \ldots \ne i_d} \big ( h(X_{i_1}, \ldots , X_{i_d}) + \sum _{l = 1}^k (-1)^l \sum _{s_1< \ldots < s_l} T_{j_{s_1} \ldots j_{s_l}}h(X_{i_1}, \ldots ,X_{i_d}) \big ) \Big \Vert \\&\quad =:\Big \Vert \sum _{i_1 \ne \ldots \ne i_d} S_{i_1, \ldots , i_d}(h,X) \Big \Vert . \end{aligned}$$

Now it is easy to see that \(S_{i_1, \ldots , i_d}(h,X) = 0\) unless \(\{j_1, \ldots , j_k \} \subset \{ i_1, \ldots , i_d \}\) (for example, this follows if one writes the sum inside the norm as \(\prod _{i = 1}^k (\text {Id} - T_{j_i}) f\)), and in these cases one can upper bound the supremum by \(2^k B\), from which we infer

$$\begin{aligned} {\mathfrak {h}}_{j_1 \ldots , j_k} f&\le \left( {\begin{array}{c}d\\ k\end{array}}\right) 2^k B (n-k) \cdots (n-d+1) \le \left( {\begin{array}{c}d\\ k\end{array}}\right) 2^k B n^{d-k}. \end{aligned}$$

Consequently, this leads to

$$\begin{aligned} {\Vert {\mathfrak {h}}^{(k)} f\Vert }_{\text {HS},\infty } \le \left( {\begin{array}{c}d\\ k\end{array}}\right) 2^k B n^{d-k} n^{k/2} = \left( {\begin{array}{c}d\\ k\end{array}}\right) 2^k B n^{d-k/2}. \end{aligned}$$

Thus, an application of Theorems 1 or 2, respectively, yields for any \(t \ge 0\) and for C as given therein

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\Big ( {|f - {{\,\mathrm{\mathbb {E}}\,}}f|} \ge t \Big ) \le 2 \exp \Big ( - \frac{1}{C} \min _{k=1,\ldots , d} \Big ( \frac{t}{B \left( {\begin{array}{c}d\\ k\end{array}}\right) 2^k n^{d-k/2}} \Big )^{2/k} \Big ). \end{aligned}$$

For the second part, choose \(t = B n^{d-1/2} {\widetilde{t}}\) for \({\widetilde{t}} > 0\) to obtain

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\Big ( n^{1/2 - d} B^{-1} {|f(X) - {{\,\mathrm{\mathbb {E}}\,}}f(X)|} \ge t \Big ) \le 2 \exp \Big ( - \frac{1}{4C_d} \min _{k=1,\ldots , d} n^{\frac{k-1}{k}} t^{2/k} \Big ). \end{aligned}$$

A short calculation shows that the minimum is attained for \(k = 1\) in the range \(t \le n^{1/2}\) and for \(k = d\) otherwise, i. e.

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\big (n^{1/2 - d} B^{-1} {|f(X) - {{\,\mathrm{\mathbb {E}}\,}}f(X)|} \ge t\big ) \le 2 \exp \left( - \frac{1}{4C_d} \min \big (t^2, n^{1-1/d} t^{2/d}\big ) \right) . \end{aligned}$$
(22)

\(\square \)

Proof of Theorem 5

We give a sketch of the proof only and refer to [3, Proof of Theorem 2.2] for details. Recall that by (19) we have the inequality

$$\begin{aligned} {\Vert f_d(X) - {{\,\mathrm{\mathbb {E}}\,}}f_d(X)\Vert }_p \le \big (4\sigma ^2 p\big )^{1/2} {\Vert \nabla f_d(X)\Vert }_p. \end{aligned}$$

Using the arguments and notations from [3, Proof of Theorem 2.2] leads to

$$\begin{aligned} {\Vert f_d(X) - {{\,\mathrm{\mathbb {E}}\,}}f_d(X)\Vert }_p \le \sum _{k=1}^d \big (4\sigma ^2 M^2\big )^{k/2} {\Vert {\langle {{\,\mathrm{\mathbb {E}}\,}}_X \nabla ^{(k)} f_d(X), G_1 \otimes \ldots \otimes G_k\rangle }\Vert }_p, \end{aligned}$$

where M is an absolute constant and \(G_i\) is a sequence of independent standard Gaussian random variables, independent of X. Furthermore, a result by Latała [35] yields

$$\begin{aligned} {\Vert f_d(X) - {{\,\mathrm{\mathbb {E}}\,}}f_d(X)\Vert }_p&\le \sum _{k = 1}^d \sum _{{\mathcal {I}} \in P_k} \big (4\sigma ^2 M^2 p\big )^{k/2} {\Vert {{\,\mathrm{\mathbb {E}}\,}}\nabla ^{(k)} f(X)\Vert }_{{\mathcal {I}}} \\&\le \sum _{k = 1}^d \sum _{{\mathcal {I}} \in P_k} \big (C \sigma ^2 p\big )^{k/2} {\Vert {{\,\mathrm{\mathbb {E}}\,}}\nabla ^{(k)} f(X)\Vert }_{{\mathcal {I}}}. \end{aligned}$$

The rest now follows as in the previous proofs. \(\square \)

Proof of Corollary 2

In [48] the authors have proven that \(\frac{1}{2} \Phi _{{|{{\,\mathrm{\varvec{\beta }}\,}}|}}'(1) < 1\) implies a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\) for \(\mu _{{{\,\mathrm{\varvec{\beta }}\,}}}\) with a constant depending on the parameter \({{\,\mathrm{\varvec{\beta }}\,}}\) only. Thus, it remains to bound the norms in (17). Note that due to the structure of the exponential random graph model, the expectations of \({{\,\mathrm{\mathbb {E}}\,}}X_G\) and \({{\,\mathrm{\mathbb {E}}\,}}X_H\) are equal whenever G and H are isomorphic. Thus, we define \(C_{S_2} :={{\,\mathrm{\mathbb {E}}\,}}X_{S_2}\) (where \(S_2\) is a 2-star) and \(C_{E} = {{\,\mathrm{\mathbb {E}}\,}}X_e\).

The Euclidean norms can be easily bounded:

$$\begin{aligned} {|{{\,\mathrm{\mathbb {E}}\,}}\nabla f(X)|}&= \left( \sum _{e \in {\mathcal {I}}_n} ((n-2) C_{S_2})^2 \right) ^{1/2} \le C_{S_2} n^2 \\ {|{{\,\mathrm{\mathbb {E}}\,}}\nabla ^{(2)} f(X)|}&= \left( \sum _{e, f: e \cap f \ne \emptyset } \left( C_{E} \right) ^2 \right) ^{1/2} \le C_E n^{3/2} \\ {|{{\,\mathrm{\mathbb {E}}\,}}\nabla ^{(3)} f(X)|}&= \left( \sum _{\{e,f,g\} \in {\mathcal {T}}_3} 1 \right) ^{1/2} = n^{3/2}, \end{aligned}$$

and it remains to estimate the three remaining norms. However, in [5, Sect. 5.1], the authors given estimates for such norms in the Erdös–Rényi case, and it is easy to adapt these to any model with the property that \({{\,\mathrm{\mathbb {E}}\,}}X_G\) depends only on the isomorphism class of G (in the complete graph). Particularly, due to the structure of the exponential random graph models, this is true in this setting as well. This gives

$$\begin{aligned} {|{{\,\mathrm{\mathbb {E}}\,}}\nabla ^{(3)} f(X)|}_{\text {op}}&\le 2^{3/2}, \quad {|{{\,\mathrm{\mathbb {E}}\,}}\nabla ^{(3)} f(X)|}_{\{1,2\}\{3\}} \le \sqrt{2n}, \quad {|{{\,\mathrm{\mathbb {E}}\,}}\nabla ^{(2)} f(X)|}_{\text {op}} \le 2C_E n. \end{aligned}$$

Inserting these estimates into (17) finishes the proof. \(\square \)

5 Logarithmic Sobolev Inequalities and Difference Operators

To conclude this paper, we discuss the LSI property (2) for different choices of difference operators \(\Gamma \). Here, we always assume that the probability measure \(\mu \) is defined on a product of Polish spaces \({\mathcal {Y}} = \otimes _{i=1}^n {\mathcal {X}}_i\) with product Borel \(\sigma \)-algebra \({\mathcal {A}} = {\mathcal {B}}(\otimes _{i = 1}^n {\mathcal {X}}_i)\).

In this situation, we can make use of the disintegration theorem on Polish spaces (see [7, Theorem 5.3.1] and [25, Chapter III]): If \(\mu \) is a measure on \({\mathcal {Y}}\), then for each \(i \in \{1,\ldots ,n\}\) we can decompose \(\mu \) using the marginal measure \(\mu _{i^c}\) (as a measure on \(\otimes _{j \ne i} {\mathcal {X}}_i\)) and a conditional measure on \({\mathcal {X}}_i\), which we denote by \(\mu (\cdot \mid x_{i^c})\). More precisely, for any \(A \in {\mathcal {A}}\) we have \(\mu (A) = \int _{\otimes _{j \ne i} {\mathcal {X}}_i} \int _{{\mathcal {X}}_i} {\mathbb {1}}_{A}(x_{i^c}, x_i) {\text {d}}\mu (x_i \mid x_{i^c}) {\text {d}}\mu _{i^c}(x_{i^c})\).

For finite spaces, \(\mu (\cdot \mid x_{i^c})\) is just the ordinary conditional measure as used in the definition of the difference operator \(\mathfrak {d}\). Note that the definition of \(\mathfrak {d}\) can in principle be rewritten for products of arbitrary Polish spaces. However, our first result shows that the \(\mathfrak {d}\)-LSI property in fact requires the underlying space to be finite. More precisely, we say that \(\mu \) has finite support if there is no sequence of sets \(A_n \in {\mathcal {A}}\) with \(\mu (A_n) > 0\) for any n and \(\mu (A_n) \rightarrow 0\).

Proposition 6

Let \({\mathcal {Y}} = \otimes _{i=1}^n {\mathcal {X}}_i\) be a product of Polish spaces, and let \(\mu \) be a probability measure on \({\mathcal {Y}}\). If \(\mu \) satisfies a \(\mathfrak {d}\)-LSI, then \(\mu \) has finite support. Moreover, if \(\mu \) is a product probability measure, then \(\mu \) satisfies a \(\mathfrak {d}\)-LSI iff \(\mu \) has finite support.

Proof

First assume \(\mu \) does not have finite support, i. e. there is a sequence \(A_n \in {\mathcal {A}}\) with \(\mu (A_n) \rightarrow 0\). Choosing \(f_n :={\mathbb {1}}_{A_n} \in L^\infty (\mu )\) and assuming a \(\mathfrak {d}\)-LSI\((\sigma ^2)\) holds, we obtain

$$\begin{aligned} \mu (A_n) \log (1/\mu (A_n)) = {{\,\mathrm{Ent}\,}}_{\mu }(f_n^2) \le 2 \sigma ^2 \int (\mathfrak {d}f_n)^2 {\text {d}}\mu = 2 \sigma ^2 \mu (A_n)(1-\mu (A_n)). \end{aligned}$$
(23)

This easily leads to a contradiction.

On the other hand, let \(\mu \) be a product probability measure with finite support. By tensorization, it suffices to consider \(n=1\), and we may moreover assume \({\mathcal {Y}}\) to have finitely many elements only. Then, by [12, Remark 6.6], \(\mu \) satisfies a \(\mathfrak {d}\)-LSI\((\sigma ^2)\) with \(\sigma ^2 \le C \log (1/\min _{y : \mu (y) > 0} \mu (y))\), which finishes the proof. \(\square \)

In fact, Proposition 6 can be adapted to the difference operator \({\mathfrak {h}}^+\) as well. To see this, note that that (23) can easily be rewritten for the difference operator \({\mathfrak {h}}^+\) (with only minor changes) and \(\int {|\mathfrak {d}f|}^2{\text {d}}\mu \le \int {|{\mathfrak {h}}^+ f|}^2{\text {d}}\mu \). In particular, the \(\mathfrak {d}\)- and \({\mathfrak {h}}^+\)-LSI properties are not essentially different.

The situation drastically changes if we consider \({\mathfrak {h}}\)-LSIs instead. Here, a sufficient condition for the \({\mathfrak {h}}\text {-LSI}\) property to hold is that the measure \(\mu \) satisfies an approximate tensorization (AT) property. As a consequence, for product probability measures, satisfying an \({\mathfrak {h}}\)-LSI is in fact a universal property.

Theorem 7

Let \({\mathcal {Y}} = \otimes _{i = 1}^n {\mathcal {X}}_i\) be a product of Polish spaces, and let \(\mu \) be a probability measure on \({\mathcal {Y}}\). If \(\mu \) satisfies an approximate tensorization property

$$\begin{aligned} {{\,\mathrm{Ent}\,}}_{\mu }(f^2) \le C \sum _{i = 1}^n \int {{\,\mathrm{Ent}\,}}_{\mu (\cdot \mid x_{i^c})}(f^2(x_{i^c}, \cdot )) {\text {d}}\mu _{i^c}(x_{i^c}), \end{aligned}$$
(24)

then \(\mu \) also satisfies an \({\mathfrak {h}}\text {-LSI}(C)\). In particular, any product probability measure satisfies an \({\mathfrak {h}}\text {-LSI}(1)\).

To the best of our knowledge, Theorem 7 is new. For product measures, it might be compared to the Efron–Stein inequality (see e. g. [27, 50]) which establishes the tensorization property for the variance, and can be regarded as a universal Poincaré inequality with respect to \(\mathfrak {d}\) (see e. g. [10] for such an interpretation). However, note that Theorem 7 (i. e. more precisely the \({\mathfrak {h}}\text {-LSI}(1)\) for product measures) does not imply the Efron–Stein inequality, as the difference operator is \({\mathfrak {h}}\) instead of \(\mathfrak {d}\). Unfortunately, as Proposition 6 demonstrates, there is no “entropy version” of the Efron–Stein inequality of the form \({{\,\mathrm{Ent}\,}}_{\mu }(f^2) \le C {{\,\mathrm{\mathbb {E}}\,}}_\mu {|\mathfrak {d}f|}^2\) (for any product probability measure \(\mu \) and some universal constant C).

As by Theorem 7, any set of independent random variables \(X_1, \ldots , X_n\) satisfies an \({\mathfrak {h}}\)-LSI(1), it might be tempting to regard Theorem 1 as an \({\mathfrak {h}}\)-LSI analogue of Theorem 2. However, it seems that it is not possible to use the entropy method based on \({\mathfrak {h}}\)-LSIs, so that this interpretation is not fully accurate. More precisely, Theorem 7 cannot be used to estimate the growth of \(L^p\) norms as in the setting of a \(\mathfrak {d}\text {-LSI}(\sigma ^2)\). Indeed, it is impossible to prove the required moment inequalities

$$\begin{aligned} {\Vert f -{{\,\mathrm{\mathbb {E}}\,}}f\Vert }_q \le (\sigma ^2 q)^{1/2} {\Vert {\mathfrak {h}}f\Vert }_q \end{aligned}$$
(25)

under an \({\mathfrak {h}}\text {-LSI}(\sigma ^2)\). For example, the measure \(\mu _p = p\delta _1 + (1-p)\delta _0\) satisfies \({\mathfrak {h}}\text {-LSI}(\sigma ^2_p)\) with \(\sigma _p^2 \sim p(1-p) \log (1/p)\) (for \(p \rightarrow 0\)), so that (25) would imply for \(f(x) = x\) an upper bound on the Orlicz norm associated to \(\Psi _2(x) = e^{x^2} - 1\)

$$\begin{aligned} {\Vert f - {{\,\mathrm{\mathbb {E}}\,}}f\Vert }_{\Psi _2} \le 2e \sup _{q \ge 1} \frac{{\Vert f - {{\,\mathrm{\mathbb {E}}\,}}f\Vert }_q}{q^{1/2}} \le 4e\sigma _p. \end{aligned}$$

However, a simple calculation shows that \({{\,\mathrm{\mathbb {E}}\,}}\exp \big ( \frac{(f-{{\,\mathrm{\mathbb {E}}\,}}f)^2}{16e^2\sigma _p^2} \big ) \rightarrow \infty \) as \(p \rightarrow 0\).

The approximate tensorization property in Theorem 7 is interesting in its own right, but it is not yet well-studied. For finite spaces [40] gives sufficient conditions for a measure \(\mu \) to satisfy an approximate tensorization property. Similar results have been derived in [21], which can be applied in discrete and continuous settings. For example, if one considers a measure of the form

$$\begin{aligned} \mu (x) = Z^{-1} \prod _{i =1}^n \mu _{0,i}(x_i) \exp \left( \sum _{i,j} J_{ij} w_{ij}(x_i,x_j) \right) \end{aligned}$$

for some countable spaces \(\Omega _i\), \(x_i \in \Omega _i\), measures \(\mu _{0,i}\) on \(\Omega _i\) and bounded functions \(w_{ij}\), under certain technical conditions \(\mu \) satisfies an approximate tensorization property. This does not require any functional inequality for \(\mu _{0,i}\). Very recently, in [3, Proposition 5.4] it has been shown that the \(\text {AT}(C)\) property implies dimension-free concentration inequalities for convex functions.

Note that the \(\text {AT}(C)\) property requires a certain weak dependence assumption in general. For example, the push-forward of a random permutation \(\pi \) of [n] to \({{\,\mathrm{\mathbb {N}}\,}}^{n}\) cannot satisfy an approximate tensorization property. It is an interesting question to find necessary and sufficient conditions for the approximate tensorization property to hold.

Proof of Theorem 7

Let \(X = (X_1, \ldots , X_n)\) be a \({\mathcal {Y}}\)-valued random vector with law \(\mu \). First we consider the case \(n = 1\). By homogeneity of both sides, we may assume \(\int f^2(X) d{{\,\mathrm{\mathbb {P}}\,}}= 1\). Since f is bounded, we have \(0 \le a \le {|f(X)|} \le b < \infty \) \({{\,\mathrm{\mathbb {P}}\,}}\)-a.s., where b is the essential supremum of \({|f(X)|}\) and a the essential infimum. Due to the constraints on the integral this leads to \(a^2 \le 1 \le b^2\). (Actually the cases \(b = 1\) or \(a = 1\) are trivial, since then \(f^2(X) = 1\) \({{\,\mathrm{\mathbb {P}}\,}}\)-a.s., but we will not make this distinction.) Let \(F(u) :={{\,\mathrm{\mathbb {P}}\,}}(f^2(X) \ge u)\). In particular

$$\begin{aligned} F(u) = {\left\{ \begin{array}{ll} 1 &{} u \le a^2, \\ 0 &{} u > b^2. \end{array}\right. } \end{aligned}$$

Using the partial integration formula (see e. g. [31, Theorem 21.67 and Remark 21.68]) in connection with [20, Theorem 7.7.1] yields

$$\begin{aligned} {{\,\mathrm{Ent}\,}}(f^2(X))&= \int _0^\infty u \log u d(-F(u)) = \int _0^{b^2} (\log u + 1) F(u) {\text {d}}u \\&= \int _0^{a^2} (\log u + 1) F(u) {\text {d}}u + \int _{a^2}^{b^2} (\log u +1) F(u) {\text {d}}u \\&= \int _0^{a^2} (\log u + 1)F(u) {\text {d}}u + \int _{a^2}^{b^2} \log u F(u) {\text {d}}u + (1-a^2). \end{aligned}$$

The first integral can be calculated explicitly

$$\begin{aligned} \int _0^{a^2} (\log u + 1) F(u) {\text {d}}u = u(\log u - 1)\mid _0^{a^2} = a^2 \log a^2, \end{aligned}$$

and moreover we have due to \(\log (u) \le \log (b^2)\) on \([a^2, b^2]\)

$$\begin{aligned} \int _{a^2}^{b^2} \log u F(u) {\text {d}}u \le \log \big (b^2\big )\big (1-a^2\big ). \end{aligned}$$

Plugging in these two estimates yields

$$\begin{aligned} {{\,\mathrm{Ent}\,}}\big (f^2(X)\big ) \le a^2 \log a^2 + \big (1-a^2\big ) + \log b^2 \big (1-a^2\big ) =:f(a,b). \end{aligned}$$

Next, if we show that

$$\begin{aligned} f(a,b) \le 2 (b-a)^2 \text { on } G :=\{ (a,b) \in {{\,\mathrm{\mathbb {R}}\,}}^2 : 0 \le a \le 1, 1 \le b < \infty \}, \end{aligned}$$
(26)

we can further estimate (as \({|{\mathfrak {h}}f|}^2\) is a deterministic quantity in the case \(n = 1\))

$$\begin{aligned} {{\,\mathrm{Ent}\,}}\big (f^2(X)\big ) \le 2(b-a)^2 = 2 {{\,\mathrm{\mathbb {E}}\,}}{|{\mathfrak {h}}f|}^2. \end{aligned}$$

To prove (26), define

$$\begin{aligned} g(a,b) :=a^2 \log a^2 + \big (1-a^2\big ) + \log b^2 \big (1-a^2\big ) - 2(b-a)^2. \end{aligned}$$

Now it is easy to see that \(g(a,1) = a^2 \log a^2 + (1-a^2) - 2(1-a)^2 \le 0,\) since \(\partial _ag(a,1) \ge 0\) for \(a \in [0,1]\) and \(g(1,1) = 0\). Moreover

$$\begin{aligned} \partial _b g(a,b) = - \frac{2}{b}\left( b^2 - 1 + (a-b)^2 \right) \le 0, \end{aligned}$$

so that g is decreasing on every strip \(\{ a_0 \} \times [1,\infty )\), and thus \(g(a,b) \le 0\) for all \(a,b \in G\). This finishes the proof for \(n=1\).

For arbitrary n, the proof is now easily completed. Assume that \(f \in L^\infty (\mu )\), i. e. \(\mu _{i^c}(x_{i^c})\)-a.s. we have \(f(x_{i^c}, \cdot ) \in L^\infty (\mu (\cdot \mid x_{i^c}))\). For these \(x_{i^c}\), by the \(n=1\) case we therefore obtain

$$\begin{aligned} {{\,\mathrm{Ent}\,}}_{\mu (\cdot \mid x_{i^c})}\big (f^2(x_{i^c}, \cdot )\big ) \le 2 \sup _{y_i', y_i''} {|f\big (x_{i^c}, y_i'\big ) - f\big (x_{i^c}, y_i''\big )|}^2. \end{aligned}$$

Plugging this into the assumption leads to

$$\begin{aligned} {{\,\mathrm{Ent}\,}}_{\mu }\big (f^2\big ) \le 2C \int \sum _{i = 1}^n \sup _{y_i', y_i''} {|f\big (x_{i^c}, y_i'\big ) - f\big (x_{i^c}, y_i''\big )|}^2 {\text {d}}\mu _{i^c}(x_{i^c}) = 2C \int {|{\mathfrak {h}}f|}^2 {\text {d}}\mu . \end{aligned}$$

As for the second part, it is a classical fact that independent random variables satisfy the tensorization property (i. e. \(\text {AT}(1)\)), see for example [38, Proposition 5.6], [16, Theorem 4.10] or [54, Theorem 3.14]. In the case of independent random variables, the assumption that \({\mathcal {Y}}\) is a product of Polish spaces can be dropped by simply defining \(\mu (\cdot \mid x_{i^c}) :=\mu _i = {{\,\mathrm{\mathbb {P}}\,}}\circ X_i\). \(\square \)