General Bernstein-like inequality for additive functionals of Markov chains

Using the renewal approach we prove Bernstein-like inequalities for additive functionals of geometrically ergodic Markov chains, thus obtaining counterparts of inequalities for sums of independent random variables. The coefficient in the sub-Gaussian part of our estimate is the asymptotic variance of the additive functional, i.e. the variance of the limiting Gaussian variable in the Central Limit Theorem for Markov chains. This refines earlier results by R. Adamczak and W. Bednorz, which were obtained under the additional assumption of strong aperiodicity of the chain.


Introduction
Throughout this paper we assume that Υ = (Υ n ) n∈N is a Markov chain defined on a probability space (Ω, F, P), taking values in a measurable (countably generated) space (X , B), with a transition function P : X × B → [0, 1]. Moreover, we assume that Υ is ψ-irreducible, aperiodic and admits a unique invariant probability measure π. As usual for any initial distribution µ on X we will write P µ (Υ ∈ ·) for the distribution of the chain with Υ 0 distributed according to the measure µ. We will denote by δ x the Dirac's mass at x and to shorten the notation we will use P x instead of P δx .
We say that Υ is geometrically ergodic if there exists a positive number ρ < 1 and a real function G : X → R such that for every starting point x ∈ X and n ∈ N, P n (x, ·) − π(·) T V ≤ G(x)ρ n , (1.1) where · T V denotes the total variation norm of a measure and P n (·, ·) is the n-step transition function of the chain. For equivalent conditions, we refer to Chapter 15 of [25]. We will be interested in tail inequalities for sums of random variables of the form where f : X → R is a measurable real function and x ∈ X is a starting point. Although our main results, stated in Section 4, do not require f to be bounded, we give here a version in the bounded case for the sake of simplicity. This version will be easier to compare to the Bernstein inequality for bounded random variables stated in Section 2 (cf. Theorem 2.1). Below for convenience we set log(·) = ln(· ∨ e), where ln(·) is the natural logarithm.

Theorem 1.1 (Bernstein-like inequality for Markov chains). Let Υ be a geometrically ergodic
Markov chain with state space X and let π be its unique stationary probability measure. Moreover, let f : X → R be a bounded measurable function such that E π f = 0. Furthermore, let x ∈ X . Then we can find constants K, τ > 0 depending only on x and the transition probability P (·, ·) such that for all t > 0, denotes the asymptotic variance of the process (f (Υ i )) i .
Remark 1.2. We refer to Theorem 4.3 for a more general counterpart of Theorem 1.1 and to Theorem 4.4 for explicit formulas for K and τ .
Let us comment briefly on the method of proof. We rely on the by now classical regeneration technique of Athreya-Ney and Nummelin (see [5,26,25]), which allows to split the sum in question into a random number of one-dependent blocks of random lengths. In the context of tail inequalities this approach has been successfully used, e.g., in [1,2,13,7,12,16] and provides Bernstein inequalities of optimal type under an additional assumption of strong aperiodicity of the chain (corresponing to m = 1 in (3.1) below), which ensures that the blocks are independent and allows for a reduction to inequalities for sums of i.i.d. random variables. However, in the general case the implementation of this method available in the literature leads to loss of correlation structure and as a consequence to suboptimal subgaussian coefficient in Bernstein's inequality (in place of σ 2 M rv ). Our main technical contribution is to propose a regeneration-based approach which allows to preserve the correlation structure and recover the correct asymptotic behaviour, corresponding to the CLT for Markov chains.
The organization of the article is as follows. After a brief discussion of our results (Section 2) we introduce the notation and provide a short description of the regeneration method (Section 3). Next we state our main theorems at their full strength (Section 4). At the end we present their proofs (Section 7). Along the way we develop auxiliary theorems for one-dependent random variables (Section 5) and bounds on number of regenerations (Section 6). Some technical lemmas concerning exponential Orlicz norms are deferred to Appendix.

Discussion of the main result
Let us start by recalling the Bernstein inequality in the i.i.d. bounded case.
Let us recall that the CLT for Markov chains (see, e.g., [11,26,25]) guarantees that under assumptions and notation of Theorem 1.1 the sums 1 Thus, the inequality obtained in Theorem 1.1 reflects (up to constants) the asymptotic normal behavior of the sums 1 √ n f (Υ i ) similarly as the classical Bernstein inequality in the i.i.d. context. Furthermore, the term log n which appears in our inequality is necessary. More precisely, one can show that if the following inequality holds for all t > 0: for some a n = o(n) and σ ∈ R (const's stand for some absolute constants whereas const(x) depends only on x and the Markov chain) then one must have σ 2 ≥ const · σ 2 M rv . Moreover, it is known that for some geometrically ergodic chains a n must grow at least logarithmically with n (see [1], Section 3.3).
Concentration inequalities for Markov chains and processes have been thoroughly studied in the literature, the (non-comprehensive) list of works concerning this topic includes [1,2,13,7,12,15,16,17,18,20,21,22,23,27,29,34]. Some results are devoted to concentration for general functions of the chain (they are usually obtained under various Lipschitz or bounded difference type conditions), others specialize to additive functionals, which are the object of study in our case. Tail inequalities for additive functionals are usually counterparts of Hoeffding or Bernstein inequalities. The former ones do not take into account the variance of the additive functional and are expressed in terms of f ∞ only. They can be often obtained as special cases of concentration inequalities for general function (see, e.g., [15,27,29]). Bernstein type estimates of the form (2.1) are considered, e.g., in [1,2,13,7,12,16,17,20,21,22,23,27,34] and use various variance proxies σ 2 , which do not necessarily coincide with the limiting variance σ 2 M rv . In the continuous time case, inequalities of Bernstein type for the natural counterpart the additive functional, involving asymptotic variance have been obtained under certain spectral gap or Lyapunov type conditions in [17,20]. For discrete time Markov chains, inequalities obtained in [1,2,7,12,16] by the regeneration method give (2.1) (under various types of ergodicity assumptions and with various parameters a n ) with σ 2 , which coincides with σ 2 M rv only under additional assumption of strong aperiodicity of the chain. On the other hand the articles [22,23,29,34] provide more general results, available for non-necessarily Markovian sequences of random variables, satisfying various types of mixing conditions. The variance proxies σ 2 that are used in these references are close to the asymptotic variance, however in general do not coincide with it. For instance the inequality obtained in [22], which is valid in particular for geometrically ergodic chains, uses (in our notation) Comparing with (1.2), one can see that σ 2 M rv ≤ σ 2 . In fact one can construct examples when the ratio betweeen the two quantities is arbitrarily large or even σ 2 M rv = 0 and σ 2 > 0. The reference [34] provides an inequality for uniformly geometrically ergodic processes, involving a certain implicitly defined variance proxy σ 2 n , which may be bounded from above by σ 2 from [22] or by Var π (f (Υ 0 )) + C f ∞ E π |f (Υ 0 )|, where C is a constant depending on the mixing properties of the process. For a fixed process, in the non-degenerate situation, when the asymptotic variance is non-zero, it can be substituted for σ 2 n at the cost of introducing additional multiplicative constants, depending on the chain and the function f .
To the best of our knowledge Theorem 1.1 is therefore the first tail inequality available for general geometrically ergodic Markov chains (not necessarily strongly aperiodic), which (up to universal constants) reflects the correct limiting Gaussian behavior of additive functionals. The problem of obtaining an inequality of this type was posed in [2]. Let us remark that quantitative investigation of problems related to the Central Limit Theorems for general aperiodic Markov chains seems to be substantially more difficult than for chains which are strongly aperiodic. For instance optimal strong approximation results are still known only in the latter case [24].

Notation and basic properties
For any k, l ∈ Z, k ≤ l we define integer intervals of consecutive integers For any process X = (X i ) i∈N and S ⊂ N we put Moreover, for k ∈ N we define the corresponding vectorized process Definition 3.1 (Stationarity). We say that a process (X n ) n∈N is stationary if for any k ∈ N the shifted process (X n+k ) n∈N has the same distribution as (X n ) n∈N .
Definition 3.2 (m-dependence). Fix m ∈ N. We say that (X n ) n∈N is m-dependent if for any k ∈ N the process (X n ) n≤k is independent of the process (X n ) n≥m+1+k .
Remark 3.3. Let us note that a process (X n ) n∈N is 0-dependent iff the variables (X n ) n∈N are independent. Finally let us give a natural example of a 1-dependent process (X n ) n∈N . Fix an independent process (ξ n ) n∈N and a Borel, real function h : R 2 → R. Then (h(ξ n , ξ n+1 )) n∈N is 1-dependent. Such processes are called two-block factors. It is worth noting that there are one-dependent processes which are not two-block factors (see [10]).
Remark 3.4. Assume that a process (X n ) n∈N is m-dependent. Then for any n 0 ∈ N the process (X n 0 +k(m+1) ) k∈N is independent. Moreover, if the process (X n ) n∈N is stationary then for any n 0 ∈ N, (X n 0 +k(m+1) ) k∈N is a collection of i.i.d. random variables.

Split chain
As already mentioned in the introduction our proofs will be based on the regeneration technique which was invented independently by Nummelin and Athreya-Ney (see [5] and [26]) and was popularized by Meyn and Tweedie [25]. We will introduce the split chain and then regeneration times of the split chain. The construction of the split chain is well-known and as references we recommend [25] (Chaps. 5,17) and [26]. We briefly recall this technique bellow. Let us stress that although this construction is based on the one presented in [25] our notation is slightly different. Firstly let us recall the minorization condition for Markov chains which plays a main role in the splitting technique.
Definition 3.5. We say that a Markov chain Υ satisfies the Minorization Condition if there exists a set C ∈ B(X ) (called a small set), a probability measure ν on X (a small measure), a constant δ > 0 and a positive integer m ∈ N such that π(C) > 0 and holds for all x ∈ C and B ∈ B(X ).
Remark 3.6. One can assume that ν(C) = 1 (possibly at the cost of increasing m).
Remark 3.7. One can check that under assumptions of our theorem the Minorization Condition (3.1) holds for some C, ν, δ and m. We refer to [25], Section 5.2 for the proof of this fact.
Fix C, m, ν and δ > 0 as in (3.1). The minorization condition allows us to redefine the chain Υ together with an auxiliary regeneration structure. More precisely we start with a splitting of the space X into two identical copies on level 0 and 1 namely we consider X = X × {0, 1}. Now we split Υ in the following way. We consider a process Φ = (Υ, Λ) = (Υ i , Λ i ) i≥0 (usually called the split chain) defined on X (we slightly abuse the notation by denoting the first coordinate of the split chain with the same letter as for the initial Markov chain, but it will turn out that the first coordinate of the split chain has the same distribution as the starting Markov chain so this notation is justified). The random variables Λ k take values in {0, 1} (they indicate the level on which Φ k is). For a fixed x ∈ C let and note that the above Radom-Nikodym derivative is well-defined thanks to (3.1). Moreover, r(x, y) ≤ 1. Now for any A 1 , . . . , A m ∈ B(X ), k ∈ N and i ∈ {0, 1} set Moreover, for any k, i ∈ N such that km < i < (k + 1)m we set Remark 3.8 (Initial distribution for the split chain). In order to be able to set initial distribution for the split chain for arbitrary probability measure µ on X we define the split measure µ * on X by: Such definition ensures that (Υ 0 , Λ 0 ) ∼ µ * as soon as Υ 0 ∼ µ. For convenience sake, for any x ∈ X , we will write P x * (·) = P δ * x (·). (3.7) Remark 3.9 (Markov-like properties of the split chain). In order to give some intuition behind the definition of the split chain note that the distribution of the first coordinate of the split chain Φ with initial distribution µ * coincides with that of the original Markov chain Υ which starts from µ. From now on Υ always corresponds to this first coordinate of the split chain. One can easily generalize (3.3) to show the following Markov-like property of the split chain: for any k ∈ N and product measurable bounded function F we have This, in turn, leads to the fact that the vectorized split chain Φ (m) is a Markov chain. Even more, for any product measurable bounded function F and k ∈ N we have Now we can introduce the aforementioned regeneration structure for Φ. Firstly we define certain stopping times. For convenience we put τ −1 = −m and then for i ≥ 0 we define τ i to be the i'th time when the second coordinate (level coordinate) hits 1, namely Now we are ready to introduce the random blocks and the random block process where we consider Ξ i as a random variable with values in the disjoint union j≥0 X j . For clarity of this presentation, here and later on, we omit the measurability details.
Remark 3.10. Let us now briefly discuss the behaviour of these random blocks. Firstly, by the strong Markov property of the vectorized split chain it is not hard to see that Ξ is a Markov chain. On a closer look one can see that for any product measurable function F where pr m : j≥m X j → X m is a projection on m-last coordinates, is stationary (see [11], Corollary 2.4). The stationarity follows from the fact that for m|k we have that is every time k (which is a multiple of m) the split chain is on level 1 (note that this implies Υ k ∈ C) the split chain regenerates and starts anew from ν. Furthermore, the lengths of Ξ i , are independent random variables for i ≥ 0 and form a stationary process for i ≥ 1. Let us add that if m = 1, one can show that Ξ i 's are independent. This fact makes a crucial difference between strongly aperiodic and not strongly aperiodic Markov chains (see [6,Section 6]).
At last let us introduce the excursions and the excursion process which will play a crucial role in our future considerations. By properties of the random blocks one concludes that χ is 1-dependent and satisfies Moreover, (χ i ) i≥1 is stationary. Due to the Pitman occupation measure formula (see, [25], Theorem 17.3.1, page 428) which says that for any measurable real function G, 17) and observation that P µ -distribution of excursion χ i (f ) (i ≥ 1) is equal to the P ν -distribution of χ 0 , we get that for any initial distribution µ and any i ≥ 1, As a consequence, E π f (Υ i ) = 0 implies that for every i ≥ 1, E µ χ i (f ) = 0. Now we are ready to decompose our sums into random blocks. If m|n then This decomposition will be of utmost importance in our proof.

Asymptotic variances
During the upcoming proofs we will meet two types of asymptotic variances: σ 2 M rv associated with the process (f (Υ i )) i≥0 and σ 2 ∞ associated with χ. The first one, defined as is exactly the variance of the limiting normal distribution of the sequence 1 Var (χ 1 + · · · + χ n ) = Eχ 2 1 + 2Eχ 1 χ 2 is the variance of the limiting normal distribution of the sequence 1 √ n n i=1 χ i . Both asymptotic variances are very closely linked via the formula For the proof of this formula we refer to [25] (see (17.32), page 434).

Main results
In order to state our results in the general form we need to recall the definition of the exponential Orlicz norm. For any random variable X and α > 0 we define If α < 1 then · ψα is just a quasi-norm (for basic properties of these quasi-norms we refer to Appendix A). In what follows we will deal with various underlying measures on the state space X . In order to stress the dependence of the Orlicz norm on the initial distribution µ of the chain Φ we will sometimes write · ψα,µ instead of · ψα . Before we formulate our main result let us introduce and explain the role of the following parameters: (3.19)). The parameter a (resp. b) will allow us to estimate the first (third) term on the right-hand side of (3.19), whereas the parameters c and d will be used to control the middle term. We note that d quantifies geometric ergodicity of Υ and is finite as soon as Υ is geometrically ergodic. Let us mention that all these parameters can be bounded for example by means of drift conditions widely used in the theory of Markov chains (see Remark 4.2). Finally let us remind that σ 2 M rv = Var π (f (Υ 0 )) + 2 ∞ i=1 Cov π (f (Υ 0 ), f (Υ i )) denotes the asymptotic variance of normalized partial sums of the process (f (Υ i )) i .
We are now ready to formulate the first of our main results (recall the definitions of the small set C and the minorization condition (3.1)).
Theorem 4.1. Let Υ be a geometrically ergodic Markov chain and π be its unique stationary probability measure. Let f : X → R be a measurable function such that E π f = 0 and let α ∈ (0, 1]. Moreover, assume for simplicity that m|n. Then for all x ∈ X and t > 0, where σ 2 M rv denotes the asymptotic variance for the process (f (Υ i )) i given by (3.21), the parameters a, b, c, d are defined by (4.2) and M = c(24α −3 log n) 1 α .
Remark 4.2. For the conditions under which a, b, c are finite we refer to [2], where the authors give bounds on a, b, c under classical drift conditions. If f is bounded then one easily shows that where D = max d, τ 0 ψ 1 , P x * , τ 0 ψ 1 , P π * . For computable bounds on D we refer to [9].
Let us note that in Theorem 4.1 the right-hand side of the inequality does not converge to 0 when t tends to infinity (one of the terms depends on n but not on t). Usually in applications t is of order at most n and the other terms dominate on the right-hand side of the inequality, so this does not pose a problem. Nevertheless one can obtain another version of Theorem 4.1, namely Theorem 4.3. Under the assumptions and notation of Theorem 4.1 we have

Bernstein inequalities for one-dependent sequences
In this section we will show two versions (for suprema and randomly stopped sums) of Bernstein inequality for one-dependent random variables. They will be later used in the proofs of our main theorems. In what follows for a one-dependent sequence of random variables (X i ) i≥0 , σ 2 ∞ denotes the asymptotic variance of normalized partial sums, i.e., Lemma 5.1 (Bernstein inequality for suprema of partial sums). Let (X i ) i≥0 be a 1-dependent sequence of centered random variables such that E exp(c −α |X i | α ) ≤ 2 for some α ∈ (0, 1] and c > 0. Assume that there exists a filtration (F i ) i≥0 such that for we have the following: Then Moreover, for any t > 0 and n ∈ N, , v n,m = 5(m + 1)(n + m + 1), w n,m = 2(m + 1)(24α −3 log n) 1 α c, K m = 2(m + 1) exp (8) and L m = 2(m + 1).
Define the "bounded" part of X i , B i = X i ½ |X i |≤M 0 and notice that Using the union bound we get for p = 1/6 Consider first the unbounded part. Using the subadditivity of x → x α , Markov's inequality and then (5.3) we get As for the "bounded" part, notice that EB i 2 ≤ EB 2 i ≤ EX 2 i = σ 2 ∞ . Therefore using the classical Bernstein inequality we get Combining the three last estimates and substituting p = 1/6, allows to finish the proof for independent random variables.
We will now use the independent case to prove the tail estimate (5.2), assuming (5.1), the proof of which we postpone. Note that (5.2) is trivial unless t ≥ w m log (2(m + 1)) (as the right-hand side exceeds 1). Therefore from now on we will consider only t satisfying this lower bound. In particular, setting p = 1/5, we have t ≥ 2 p (2/α) By another application of the union bound together with Lemma A.5 and stationarity of Notice that where the inequality is a consequence of the estimate t ≥ 4 1 α 2c In order to deal with P (| n i=1 Z i | > t(1 − p)) we start with splitting this sum into m + 1 parts and using the union bound, namely Now, to each summand on the right-hand side of the above inequality we will apply the estimate for the independent case obtained at the beginning of this proof. Setting M = (24α −3 log n) 1 α c and taking into account (5.1) we obtain . (5.6) Finally using (5.4), (5.5) and (5.6) we get To conclude (5.2) it is now enough to note that the second summand on the right-hand side above dominates the first one.
To finish the proof of the lemma it remains to show (5.1). Firstly, we address the variance of Z i , which can be easily calculated by using the properties of conditional expectation. We have (recall the notation E i (·) = E (· | F i )) The variance formula in (5.1) follows by observing that due to 3), Now we will demonstrate the upper-bound on Z i ψα in (5.1). Using the triangle inequality (cf. Lemma A.1) twice and then Lemma A.3 we obtain This concludes the proof of the lemma.
Remark 5.2. If (X) i≥0 is a 1-dependent, centered and stationary Markov chain such that X i ∞ ≤ M < ∞ then the assumptions of the above lemma are satisfied with m = 2 and random variables and f : R 2 → R is a bounded, Borel function such that X i = f (ξ i , ξ i+1 ) are centered then we can take F i = σ{ξ j | j ≤ i + 1} and notice that the assumptions of the above lemma are satisfied with m = 1.

Remark 5.3.
It is worth noticing that σ 2 ∞ may be equal to 0 in case of 1-dependent processes (X i ) i∈N . Take for example X i = ξ i+1 − ξ i where (ξ i ) i∈N are i.i.d. random variables. It turns out (cf. [30]) that the reverse is true, that is if for a 1-dependent, bounded stationary process (X i ) i∈N we have σ 2 ∞ = 0 then there exists an i.i.d. process (ξ i ) i∈N such that X i = ξ i+1 − ξ i .

Lemma 5.4 (Bernstein inequality for random sums).
Let (X i ) i≥0 be a 1-dependent sequence of centered random variables such that E exp(c −α |X i | α ) ≤ 2 for some α ∈ (0, 1] and c ≥ 1. Moreover, let N ≤ n ∈ N be an N-valued bounded random variable. Assume that we can find a filtration F = (F i ) i≥0 such that for we have the following: Then for any t > 0 and a > 0, Proof. Observe that 0) and 3) imply 2-dependence of the process (Z i ) i≥1 . Therefore the filtration F satisfies all the assumptions of Lemma 5.1 and thus (5.1) holds. Note also that without loss of generality we may assume that t ≥ w log 9 (otherwise the right-hand side of (5.8) is at least one). Fix s = (8 √ 2 log 9) −1 . Using the union bound we get Now using Lemma A.5, ts/2 ≥ c 2 α 1 α , t ≥ w log 9 and n exp − (st) α 4(2c) α ≤ 1 we obtain Next we take care of the other term on the right-hand side of (5.9). Firstly we split the sum Now we will consider the jth summand of the above sum. Let us take r = 3 8 √ 2 log (9) and notice that there is function f j : N → N such that for any n ∈ N, n 3 ≤ f j (n) ≤ n 3 and Due to Z i ψα ≤ c(8/α) 1 α (cf. (5.1)) and Lemma A.4 along with t ≥ w log(9), n ≥ 2 (for n = 1 the result of the lemma is trivial) we get To handle the first summand on the right-hand side of (5.12) let us fix j and denote γ i := Z 3i+3−j , G i := F 3i−j , T := ⌈N/3 + 1⌉ ≤ ⌈n/3⌉ + 1. Using the assumptions on the filtration F and (5.1) it is straightforward to check that the following properties hold: T is a stopping time with respect to the filtration G i . This is precisely the setting of Proposition 4.4. ii) from [2] which applied with ǫ := 1, p := and q := √ 2 gives that for any a > 0, where Using (5.1), Lemma A.2 with Y = αZ α 8c α and β = 2 α , together with the gamma function estimate Γ(x) ≤ x 2 x−1 for x ≥ 2 (see Theorem 1 in [19]) we get which implies that σ ∞ ≤ 2 3 M and as a consequence, Therefore (5.14) reduces to Combining the above inequality with (5.9)-(5.13) we obtain To conclude it is now enough to recall that r = 3(8 √ 2 log(9)) −1 , s = (8 √ 2 log 9) −1 and do some elementary calculations.

Bounds on the number of regenerations
We will now obtain a bound on the stopping time N , introduced in (3.20). To this end we will use the ψ 1 version of Bernstein inequality, which follows easily from the classical moment version of this inequality (see, e.g., Lemma 2.2.11 in [35]), by observing that for k ≥ 2, Lemma 6.2. If τ 1 − τ 0 ψ 1 ≤ d then for any p > 0, where K p = L p + 16/L p and L p = 16 p + 20. Moreover, the function p → K p is decreasing on R + (in particular K p ≥ K ∞ = 104 5 ) and if p = 2/3 then 1 p K p ≤ 67 .
Proof. For convenience, let T i = τ i −τ i−1 for i ≥ 1. Firstly, notice that without loss of generality we may assume that np ≥ L p ET 1 . Indeed, otherwise, using ET 1 ≤ d we obtain Thus, from now on we consider n such that np ≥ L p ET 1 .
Now we have T i+1 − ET i+1 ψ 1 ≤ 2d so using Lemma 6.1, ET 1 ≤ d and np ≥ L p ET 1 we get which finishes the proof of (6.1). The properties of K p follow from easy computations.
The following lemma is a standard consequence of the tail estimates given in Lemma 6.2. Its proof, based on integration by parts, is analogous to that of Lemma 5.4 in [2] and is therefore omitted. In this section we will prove our main results. The structure of proofs of Theorems 4.1 and 4.3 is similar, and they contain a common part, which we will present first in Sections 7.1 and 7.2. The proof of Theorem 4.1 will be concluded in Section 7.3 and the proof of Theorem 4.3 in Section 7.4. Theorem 4.4 will be obtained as a corollary to Theorem 4.1 in Section 7.5.
Let us thus pass to the proofs of Theorems 4.1 and 4.3. Assume that m|n. The argument will be based on the approach of [1] and [2] (see also [12] and [16]) and will rely on the decomposition where The proof will be divided into three main steps. In the first two (common for both theorems) we will get easy bounds on tails of H n and T n . The main, third step will be devoted to obtaining two different estimates on the tail of M n . To this end we will use Lemmas 5.1, 6.2 (for the proof of Theorem 4.1) and Lemmas 5.4, 6.3 (for Theorem 4.3).

Estimate on T n
By repeating verbatim the easy argument presented in the proof of Theorem 5.1 in [2], we obtain We skip the details.

Proof of Theorem 4.1
Recall that M = c(24α −3 log n) 1 α and note that without loss of generality we can assume that t ≥ 8M log 6. Otherwise (4.3) is trivial as the right hand side is greater than or equal to 1. Fix p = 2/3. We have (A := (p + 1)n(E(τ 1 − τ 0 )) −1 ) To control the first summand on the right-hand side of the above inequality we will apply Lemma 5.1 with m = 2, X i := χ i = F (Ξ i+1 ) (cf. (3.16)), c := c and n := A. Assuming that the assumptions of the lemma are satisfied (we will verify them later on), we obtain (in the first line we use stationarity of (Ξ i ) i≥1 ), Recall that by (3.22), σ 2 ∞ = σ 2 M rv E(τ 1 − τ 0 ). We will now obtain a comparison between σ 2 ∞ and tM , which will allow us to reduce the above estimate to one in which the subgaussian coefficient is expressed only in terms of σ 2 M rv . Thanks to Lemma A.2 applied with Λ := (χ 1 /c) α and β := 2/α we have where the last inequality is a consequence of equation 4 in [19]. Moreover, recalling the definition of M and using the assumption t ≥ 8 log(6)M , we obtain tM ≥ 8 log(6)M 2 = 8 log(6)c 2 (24α −3 log(n)) 2 α ≥ 16 · 8 log(6)c 2 (2/α) 2 α +1 ≥ 76σ 2 ∞ . The last inequality in combination with (7.6) yields In order to justify the above inequality it remains to verify the assumptions of Lemma 5.1. To this end take We will now strongly rely on the properties of the stationary sequence of one-dependent blocks (Ξ i ) i≥1 stated in Remark 3.10 together with (3.15) and (3.16). Since χ i = F (Ξ i+1 ), the assumption 0) of Lemma 5.1 is trivially satisfied. To prove 2), observe that E(χ i+1 |F i ) = E(χ i+1 |Ξ i+1 ) = G(Ξ i+1 ) for some measurable function G, and so the sequence is stationary (as a function of the stationary sequence (Ξ i ) i≥1 ). The sequence (Ξ i )) i≥1 is 1-dependent, which clearly implies that (Z i ) i≥1 is 2-dependent, i.e., the assumption 2) of the lemma. The assumption 3), i.e., the stationarity of the sequence (E(χ i |F i−1 )) i≥1 = (G(Ξ i )) i≥1 follows again by stationarity of (Ξ i ) i≥1 . Finally, using once more the fact that (Ξ i ) i≥0 is one-dependent, we obtain that for any i ≥ 1 the random variable E(χ i |F i−1 ) = G(Ξ i ) is independent of χ i+1 = F (Ξ i+2 ), which ends the verification of the assumptions of Lemma 5.1 and proves (7.7).
Furthermore, if β ∈ N then one can replace the constant 2 with 1.
Proof. If β is a natural number then the claim follows from Taylor's expansion of exp(x). The general case is obtained by Markov's inequality, namely The next lemma allows to pass from the ψ α -norm of a random variable to the norm of its conditional expectation.
Now we give two concentration inequalities which are valid for random variables with finite Orlicz norm. The first one is an easy consequence of the Markov inequality, therefore we omit the proof.
Lemma A.4. For any random variable X with X ψα < ∞ and t > 0 we have Lemma A.5 (Tail inequality for conditional mean value). Let 0 < α ≤ 1. Assume that a random variable X satisfies X ψα < ∞. Moreover, let F be some sigma field. Then for any t ≥ 2 α 1/α X ψα we have Proof. Fix c > X ψα and t ≥ 2 α 1/α c. Then in particular we have α t c α ≥ 2. Using the Markov and Jensen inequalities along with Γ(x) ≤ x x /e x−1 ( [19], Thm. 1) and Lemma A.2 with Y = (|X|/c) α , β = t α /c α we get where in the last inequality we used the estimate xe −x ≤ e − x 2 which is valid for all x ∈ R. Now it is enough to take limit c → X ψα and notice that 2e ≤ 6.