General Bernstein-Like Inequality for Additive Functionals of Markov Chains

Using the renewal approach, we prove Bernstein-like inequalities for additive functionals of geometrically ergodic Markov chains, thus obtaining counterparts of inequalities for sums of independent random variables. The coefficient in the sub-Gaussian part of our estimate is the asymptotic variance of the additive functional, i.e., the variance of the limiting Gaussian variable in the central limit theorem for Markov chains. This refines earlier results by Adamczak and Bednorz, obtained under the additional assumption of strong aperiodicity of the chain.


Introduction
Throughout this paper, we assume that ϒ = (ϒ n ) n∈N is a Markov chain defined on a probability space ( , F, P), taking values in a measurable (countably generated) space (X , B), with a transition function P : X × B → [0, 1]. Moreover, we assume that ϒ is ψ-irreducible and aperiodic and admits a unique invariant probability measure π . As usual for any initial distribution μ on X , we will write P μ (ϒ ∈ ·) for the distribution of the chain with ϒ 0 distributed according to the measure μ. We will denote by δ x the Dirac's mass at x, and to shorten the notation, we will use P x instead of P δ x . We say that ϒ is geometrically ergodic if there exists a positive number ρ < 1 and a real function G : X → R such that for every starting point x ∈ X and n ∈ N, P n (x, ·) − π(·) T V ≤ G(x)ρ n , (1.1) where · T V denotes the total variation norm of a measure and P n (·, ·) is the n-step transition function of the chain. For equivalent conditions, we refer to Chapter 15 of [22]. We will be interested in tail inequalities for sums of random variables of the form where f : X → R is a measurable real function and x ∈ X is a starting point. Although our main results, stated in Sect. 4, do not require f to be bounded, we give here a version in the bounded case for the sake of simplicity. This version will be easier to compare to the Bernstein inequality for bounded random variables stated in Sect. 2 (cf. Theorem 2.1). Below for convenience, we set log(·) = ln(· ∨ e), where ln(·) is the natural logarithm.

Theorem 1.1 (Bernstein-like inequality for Markov chains) Let ϒ be a geometrically ergodic
Markov chain with state space X , and let π be its unique stationary probability measure. Moreover, let f : X → R be a bounded measurable function such that E π f = 0. Furthermore, let x ∈ X . Then, we can find constants K , τ > 0 depending only on x and the transition probability P(·, ·) such that for all t > 0, denotes the asymptotic variance of the process ( f (ϒ i )) i . Remark 1. 2 We refer to Theorem 4.3 for a more general counterpart of Theorem 1.1 and to Theorem 4.4 for explicit formulas for K and τ .
Let us comment briefly on the method of proof. We rely on the by now classical regeneration technique of Athreya-Ney and Nummelin (see [3,22,23]), which allows to split the sum in question into a random number of 1-dependent blocks of random lengths. In the context of tail inequalities, this approach has been successfully used, e.g., in [1,2,6,7,10,12] and provides Bernstein inequalities of optimal type under an additional assumption of strong aperiodicity of the chain (corresponding to m = 1 in (3.1)), which ensures that the blocks are independent and allow for a reduction to inequalities for sums of i.i.d. random variables. However, in the general case the implementation of this method available in the literature leads to loss of correlation structure and as a consequence to suboptimal sub-Gaussian coefficient in Bernstein's inequality (in place of σ 2 Mrv ). Our main technical contribution is to propose a regeneration-based approach which allows to preserve the correlation structure and recover the correct asymptotic behavior, corresponding to the CLT for Markov chains.
The organization of the article is as follows. After a brief discussion of our results (Sect. 2), we introduce the notation and provide a short description of the regeneration method (Sect. 3). Next, we state our main theorems at their full strength (Sect. 4). At the end, we present their proofs (Sect. 7). Along the way, we develop auxiliary theorems for 1-dependent random variables (Sect. 5) and bounds on number of regenerations (Sect. 6). Some technical lemmas concerning exponential Orlicz norms are deferred to Appendix.

Discussion of the Main Result
Let us start by recalling the Bernstein inequality in the i.i.d. bounded case.
Let us recall that the CLT for Markov chains (see, e.g., [9,22,23]) guarantees that under assumptions and notation of Theorem 1.1, the sums 1 Mrv ). Thus, the inequality obtained in Theorem 1.1 reflects (up to constants) the asymptotic normal behavior of the sums 1 √ n f (ϒ i ) similarly as the classical Bernstein inequality in the i.i.d. context. Furthermore, the term log n which appears in our inequality is necessary. More precisely, one can show that if the following inequality holds for all t > 0: for some a n = o(n) and σ ∈ R (const's stand for some absolute constants, whereas const(x) depends only on x and the Markov chain), then one must have σ 2 ≥ const · σ 2 Mrv . Moreover, it is known that for some geometrically ergodic chains a n must grow at least logarithmically with n (see [1], Section 3.3).
Concentration inequalities for Markov chains and processes have been thoroughly studied in the literature, the (non-comprehensive) list of works concerning this topic includes [1,2,6,7,[10][11][12][13][15][16][17]19,20,24,25,27]. Some results are devoted to concentration for general functions of the chain (they are usually obtained under various Lipschitz or bounded difference type conditions); others specialize to additive functionals, which are the object of study in our case. Tail inequalities for additive functionals are usually counterparts of Hoeffding or Bernstein inequalities. The former ones do not take into account the variance of the additive functional and are expressed in terms of f ∞ only. They can be often obtained as special cases of concentration inequalities for general function (see, e.g., [11,24,25]). Bernstein-type estimates of the form (2.1) are considered, e.g., in [1,2,6,7,10,12,13,16,17,19,20,24,27] and use various variance proxies σ 2 , which do not necessarily coincide with the limiting variance σ 2 Mrv . In the continuous time case, inequalities of Bernstein type for the natural counterpart of the additive functional, involving asymptotic variance, have been obtained under certain spectral gap or Lyapunov-type conditions in [13,16]. For discrete time Markov chains, inequalities obtained in [1,2,7,10,12] by the regeneration method give (2.1) (under various types of ergodicity assumptions and with various parameters a n ) with σ 2 , which coincides with σ 2 Mrv only under additional assumption of strong aperiodicity of the chain. On the other hand, the articles [19,20,25,27] provide more general results, available for non-necessarily Markovian sequences of random variables, satisfying various types of mixing conditions. The variance proxies σ 2 that are used in these references are close to the asymptotic variance and however in general do not coincide with it. For instance, the inequality obtained in [19], which is valid in particular for geometrically ergodic chains, uses (in our notation) Mrv ≤ σ 2 . In fact, one can construct examples when the ratio between the two quantities is arbitrarily large or even σ 2 Mrv = 0 and σ 2 > 0. Reference [27] provides an inequality for uniformly geometrically ergodic processes, involving a certain implicitly defined variance proxy σ 2 n , which may be bounded from above by σ 2 from [19] or by Var π ( f (ϒ 0 ))+C f ∞ E π | f (ϒ 0 )|, where C is a constant depending on the mixing properties of the process. For a fixed process, in the non-degenerate situation, when the asymptotic variance is nonzero, it can be substituted for σ 2 n at the cost of introducing additional multiplicative constants, depending on the chain and the function f .
To the best of our knowledge, Theorem 1.1 is therefore the first tail inequality available for general geometrically ergodic Markov chains (not necessarily strongly aperiodic), which (up to universal constants) reflects the correct limiting Gaussian behavior of additive functionals. The problem of obtaining an inequality of this type was posed in [2]. Let us remark that quantitative investigation of problems related to the central limit theorems for general aperiodic Markov chains seems to be substantially more difficult than for chains which are strongly aperiodic. For instance, optimal strong approximation results are still known only in the latter case [21].
For any process X = (X i ) i∈N and S ⊂ N, we put Moreover, for k ∈ N we define the corresponding vectorized process Definition 3.1 (Stationarity) We say that a process (X n ) n∈N is stationary if for any k ∈ N the shifted process (X n+k ) n∈N has the same distribution as (X n ) n∈N .
We say that (X n ) n∈N is m-dependent if for any k ∈ N the process (X n ) n≤k is independent of the process (X n ) n≥m+1+k .

Remark 3.3
Let us note that a process (X n ) n∈N is 0-dependent iff the variables (X n ) n∈N are independent. Finally, let us give a natural example of a 1-dependent process (X n ) n∈N . Fix an independent process (ξ n ) n∈N and a Borel, real function h : R 2 → R. Then, (h(ξ n , ξ n+1 )) n∈N is 1-dependent. Such processes are called two-block factors.
It is worth noting that there are 1-dependent processes which are not two-block factors (see [8]).

Remark 3.4
Assume that a process (X n ) n∈N is m-dependent. Then for any n 0 ∈ N, the process (X n 0 +k(m+1) ) k∈N is independent. Moreover, if the process (X n ) n∈N is stationary, then for any n 0 ∈ N, (X n 0 +k(m+1) ) k∈N is a collection of i.i.d. random variables.

Split Chain
As already mentioned in the Introduction, our proofs will be based on the regeneration technique which was invented independently by Nummelin and Athreya-Ney (see [3] and [23]) and was popularized by Meyn and Tweedie [22]. We will introduce the split chain and then regeneration times of the split chain. The construction of the split chain is well known, and as references, we recommend [22] (Chaps. 5,17) and [23]. We briefly recall this technique below. Let us stress that although this construction is based on the one presented in [22], our notation is slightly different. Firstly, let us recall the minorization condition for Markov chains which plays a main role in the splitting technique.
Definition 3. 5 We say that a Markov chain ϒ satisfies the minorization condition if there exists a set C ∈ B(X ) (called a small set), a probability measure ν on X (a small measure), a constant δ > 0 and a positive integer m ∈ N such that π(C) > 0 and holds for all x ∈ C and B ∈ B(X ).

Remark 3.6
One can assume that ν(C) = 1 (possibly at the cost of increasing m).

Remark 3.7
One can check that under assumptions of our theorem, the minorization condition (3.1) holds for some C, ν, δ and m. We refer to [22], Section 5.2 for the proof of this fact.
Fix C, m, ν and δ > 0 as in (3.1). The minorization condition allows us to redefine the chain ϒ together with an auxiliary regeneration structure. More precisely, we start with a splitting of the space X into two identical copies on level 0 and 1, namely we consider X = X ×{0, 1}. Now, we split ϒ in the following way. We consider a process = (ϒ, ) = (ϒ i , i ) i≥0 (usually called the split chain) defined on X . (We slightly abuse the notation by denoting the first coordinate of the split chain with the same letter as for the initial Markov chain, but it will turn out that the first coordinate of the split chain has the same distribution as the starting Markov chain, so this notation is justified.) The random variables k take values in {0, 1}. (They indicate the level on which k is.) For a fixed x ∈ C, let and note that the above Radon-Nikodym derivative is well defined thanks to (3.1).
Remark 3.8 (Initial distribution for the split chain) In order to be able to set initial distribution for the split chain for arbitrary probability measure μ on X , we define the split measure μ * on X by: Such definition ensures that (ϒ 0 , 0 ) ∼ μ * as soon as ϒ 0 ∼ μ. For convenience sake, for any x ∈ X , we will write Remark 3.9 (Markov-like properties of the split chain) In order to give some intuition behind the definition of the split chain, note that the distribution of the first coordinate of the split chain with initial distribution μ * coincides with that of the original Markov chain ϒ which starts from μ. From now on, ϒ always corresponds to this first coordinate of the split chain. One can easily generalize (3.3) to show the following Markov-like property of the split chain: For any k ∈ N and product measurable bounded function F, we have This, in turn, leads to the fact that the vectorized split chain (m) is a Markov chain. Even more, for any product measurable bounded function F and k ∈ N we have Now, we can introduce the aforementioned regeneration structure for . Firstly, we define certain stopping times. For convenience, we put τ −1 = −m, and then, for i ≥ 0 we define τ i to be the ith time when the second coordinate (level coordinate) hits 1, namely Now, we are ready to introduce the random blocks and the random block process where we consider i as a random variable with values in the disjoint union j≥0 X j . For clarity of this presentation, here and later on, we omit the measurability details.

Remark 3.10
Let us now briefly discuss the behavior of these random blocks. Firstly, by the strong Markov property of the vectorized split chain it is not hard to see that is a Markov chain. On a closer look, one can see that for any product measurable function F (3.11) where pr m : j≥m X j → X m is a projection on m-last coordinates, is stationary (see [9], Corollary 2.4). The stationarity follows from the fact that for m|k, we have that is, every time k (which is a multiple of m) the split chain is on level 1 (note that this implies ϒ k ∈ C) and the split chain regenerates and starts anew from ν. Furthermore, the lengths of i : are independent random variables for i ≥ 0 and form a stationary process for i ≥ 1.
Let us add that if m = 1, one can show that i 's are independent. This fact makes a crucial difference between strongly aperiodic and not strongly aperiodic Markov chains (see [5,Section 6]).

At last, let us introduce the excursions and the excursion process
which will play a crucial role in our future considerations. By properties of the random blocks, one concludes that χ is 1-dependent and satisfies Moreover, (χ i ) i≥1 is stationary. Due to the Pitman occupation measure formula (see, [22], Theorem 17.3.1, page 428) which says that for any measurable real function G, 17) and observation that P μ -distribution of excursion χ i ( f ) (i ≥ 1) is equal to the P ν -distribution of χ 0 , we get that for any initial distribution μ and any i ≥ 1, we are ready to decompose our sums into random blocks. If m|n, then This decomposition will be of utmost importance in our proof.

Asymptotic Variances
During the upcoming proofs, we will meet two types of asymptotic variances: σ 2 Mrv associated with the process ( f (ϒ i )) i≥0 and σ 2 ∞ associated with χ . The first one defined as is exactly the variance of the limiting normal distribution of the sequence 1 The second one: is the variance of the limiting normal distribution of the sequence 1 √ n n i=1 χ i . Both asymptotic variances are very closely linked via the formula For the proof of this formula, we refer to [22] (see (17.32), page 434).

Main Results
In order to state our results in the general form, we need to recall the definition of the exponential Orlicz norm. For any random variable X and α > 0, we define If α < 1, then · ψ α is just a quasi-norm. (For basic properties of these quasi-norms, we refer to Appendix A.) In what follows, we will deal with various underlying measures on the state space X . In order to stress the dependence of the Orlicz norm on the initial distribution μ of the chain , we will sometimes write · ψ α ,μ instead of · ψ α . Before we formulate our main result, let us introduce and explain the role of the following parameters: ). The parameter a (resp. b) will allow us to estimate the first (third) term on the right-hand side of (3.19), whereas the parameters c and d will be used to control the middle term. We note that d quantifies geometric ergodicity of ϒ and is finite as soon as ϒ is geometrically ergodic. Let us mention that all these parameters can be bounded, for example, by means of drift conditions widely used in the theory of Markov chains (see Remark 4.2). Finally, let us remind that denotes the asymptotic variance of normalized partial sums of the process ( f (ϒ i )) i .
We are now ready to formulate the first of our main results. (Recall the definitions of the small set C and the minorization condition (3.1).) Theorem 4.1 Let ϒ be a geometrically ergodic Markov chain and π be its unique stationary probability measure. Let f : X → R be a measurable function such that E π f = 0 and let α ∈ (0, 1]. Moreover, assume for simplicity that m|n. Then for all x ∈ X and t > 0,

Remark 4.2
For the conditions under which a, b, c are finite, we refer to [2], where the authors give bounds on a, b, c under classical drift conditions. If f is bounded, then one easily shows that where D = max d, τ 0 ψ 1 , P x * , τ 0 ψ 1 , P π * . For computable bounds on D, we refer to [4].
Let us note that in Theorem 4.1, the right-hand side of the inequality does not converge to 0 when t tends to infinity. (One of the terms depends on n but not on t.) Usually, in applications t is of order at most n and the other terms dominate on the right-hand side of the inequality, so this does not pose a problem. Nevertheless, one can obtain another version of Theorem 4.1, namely Theorem 4.3 Under the assumptions and notation of Theorem 4.1, we have

Bernstein Inequalities for 1-Dependent Sequences
In this section, we will show two versions (for suprema and randomly stopped sums) of Bernstein inequality for 1-dependent random variables. They will be later used in the proofs of our main theorems. In what follows for a 1-dependent sequence of random variables (X i ) i≥0 , σ 2 ∞ denotes the asymptotic variance of normalized partial sums, i.e., we have the following: Then, Moreover, for any t > 0 and n ∈ N, , v n,m = 5(m + 1)(n + m + 1), w n,m = 2(m + 1)(24α −3 log n) 1 α c, K m = 2(m + 1) exp (8) and L m = 2(m + 1).
Proof Firstly, we will show that if X i 's are centered, independent random variables with common variance σ 2 (8) and L 0 = 2 (allowing for a slight abuse of precision we consider this the m = 0 case of the lemma). Indeed, by Lemma 4.1 in [2] for λ = (2 1/α c) −1 , where U i = X i 1 |X i |>M 0 stands for the "unbounded" part of X i and M 0 = c 3α −2 log n 1 α . Define the "bounded" part of X i , B i = X i 1 |X i |≤M 0 and notice that Using the union bound, we get for p = 1/6 Consider first the unbounded part. Using the subadditivity of x → x α , Markov's inequality and then (5.3), we get As for the "bounded" part, notice that EB i 2 ≤ EB 2 i ≤ EX 2 i = σ 2 ∞ . Therefore, using the classical Bernstein inequality we get Combining the three last estimates and substituting p = 1/6 allow to finish the proof for independent random variables.
We will now use the independent case to prove the tail estimate (5.2), assuming (5.1), the proof of which we postpone. Note that (5.2) is trivial unless t ≥ w m log (2(m + 1)) (as the right-hand side exceeds 1). Therefore, from now on we will consider only t satisfying this lower bound. In particular, setting p = 1/5, we have t ≥ 2 p (2/α) By another application of the union bound together with Lemma A.5 and stationarity of ( where the inequality is a consequence of the estimate t ≥ 4 1 α 2c In order to deal with P n i=1 Z i > t(1 − p) , we start with splitting this sum into m + 1 parts and using the union bound, namely Now, to each summand on the right-hand side of the above inequality we will apply the estimate for the independent case obtained at the beginning of this proof. Setting M = (24α −3 log n) 1 α c and taking into account (5.1), we obtain Finally, using (5.4), (5.5) and (5.6) we get To conclude (5.2), it is now enough to note that the second summand on the right-hand side above dominates the first one.
To finish the proof of the lemma, it remains to show (5.1). Firstly, we address the variance of Z i , which can be easily calculated by using the properties of conditional expectation. We have (recall the notation E i (·) = E (· | F i )) The variance formula in (5.1) follows by observing that due to 3), Now, we will demonstrate the upper bound on Z i ψ α in (5.1). Using the triangle inequality (cf. Lemma A.1) twice and then Lemma A.3, we obtain This concludes the proof of the lemma.

Remark 5.2
If (X ) i≥0 is a 1-dependent, centered and stationary Markov chain such that X i ∞ ≤ M < ∞, then the assumptions of the above lemma are satisfied with m = 2 random variables and f : R 2 → R is a bounded, Borel function such that X i = f (ξ i , ξ i+1 ) are centered, then we can take F i = σ {ξ j | j ≤ i + 1} and notice that the assumptions of the above lemma are satisfied with m = 1.

Remark 5.3
It is worth noticing that σ 2 ∞ may be equal to 0 in case of 1-dependent processes (X i ) i∈N . Take for example X i = ξ i+1 − ξ i where (ξ i ) i∈N are i.i.d. random variables. It turns out (cf. [14]) that the reverse is true, that is, if for a 1-dependent, bounded stationary process (X i ) i∈N we have σ 2 ∞ = 0, then there exists an i.i.d. process

Lemma 5.4 (Bernstein inequality for random sums)
Let (X i ) i≥0 be a 1-dependent sequence of centered random variables such that E exp(c −α |X i | α ) ≤ 2 for some α ∈ (0, 1] and c ≥ 1. Moreover, let N ≤ n ∈ N be an N-valued bounded random variable. Assume that we can find a filtration we have the following: Then for any t > 0 and a > 0, Proof Observe that 0) and 3) imply 2-dependence of the process (Z i ) i≥1 . Therefore, the filtration F satisfies all the assumptions of Lemma 5.1 and thus (5.1) holds. Note also that without loss of generality, we may assume that t ≥ w log 9. (Otherwise, the right-hand side of (5.8) is at least one.) Fix s = (8 √ 2 log 9) −1 . Using the union bound, (5.9) Now using Lemma A.5, ts/2 ≥ c 2 α 1 α , t ≥ w log 9 and n exp − (st) α 4(2c) α ≤ 1, we obtain 2P sup Next, we take care of the other term on the right-hand side of (5.9). Firstly, we split the sum Now, we will consider the jth summand of the above sum. Let us take r = 3 8 √ 2 log (9) and notice that there is function f j : N → N such that for any n ∈ N, n 3 ≤ f j (n) ≤ n 3 and 5.1)) and Lemma A.4 along with t ≥ w log(9), n ≥ 2 (for n = 1, the result of the lemma is trivial), we get To handle the first summand on the right-hand side of (5.12), let us fix j and denote Using the assumptions on the filtration F and (5.1), it is straightforward to check that the following properties hold: T is a stopping time with respect to the filtration G i . This is precisely the setting of Proposition 4.4. ii) from [2] which applied with := 1, p := √ 2 √ 2−1 and q := √ 2 gives that for any a > 0, (5.14) where which implies that σ ∞ ≤ 2 3 M and as a consequence, Therefore, (5.14) reduces to Combining the above inequality with (5.9)-(5.13), we obtain To conclude, it is now enough to recall that r = 3(8 √ 2 log(9)) −1 , s = (8 √ 2 log 9) −1 and do some elementary calculations.

Bounds on the Number of Regenerations
We will now obtain a bound on the stopping time N , introduced in (3.20). To this end, we will use the ψ 1 version of Bernstein inequality, which follows easily from the classical moment version of this inequality (see, e.g., Lemma 2.2.11 in [26]), by observing that for k ≥ 2, where K p = L p + 16/L p and L p = 16 p + 20. Moreover, the function p → K p is decreasing on R + (in particular, K p ≥ K ∞ = 104 5 ) and if p = 2/3, then 1 p K p ≤ 67 .
Proof For convenience, let T i = τ i − τ i−1 for i ≥ 1. Firstly, notice that without loss of generality, we may assume that np ≥ L p ET 1 . Indeed, otherwise, using ET 1 ≤ d we obtain Thus, from now on we consider n such that np ≥ L p ET 1 .
Now, we have T i+1 − ET i+1 ψ 1 ≤ 2d, so using Lemma 6.1, ET 1 ≤ d and np ≥ L p ET 1 , we get which finishes the proof of (6.1). The properties of K p follow from easy computations.
The following lemma is a standard consequence of the tail estimates given in Lemma 6.2. Its proof, based on integration by parts, is analogous to that of Lemma 5.4 in [2] and is therefore omitted. Lemma 6.3 Suppose that τ 1 − τ 0 ψ 1 ≤ d for some d > 0. Then for any p > 0, where a = (1 + p)n [E(τ 1 − τ 0 )] −1 , K p = L p + 16 L p and L p = 16 p + 20. Moreover,

Proofs of Theorems 4.1, 4.3 and 4.4
In this section, we will prove our main results. The structure of proofs of Theorems 4.1 and 4.3 is similar, and they contain a common part, which we will present first in Sects. 7.1 and 7.2. The proof of Theorem 4.1 will be concluded in Sect. 7.3 and the proof of Theorem 4.3 in Sect. 7.4. Theorem 4.4 will be obtained as a corollary to Theorem 4.1 in Sect. 7.5. Let us thus pass to the proofs of Theorems 4.1 and 4.3. Assume that m|n. The argument will be based on the approach of [1] and [2] (see also [10] and [12]) and will rely on the decomposition n−1 i=0 f (ϒ i ) ≤ H n + M n + T n , (7.1) where The proof will be divided into three main steps. In the first two (common for both theorems), we will get easy bounds on tails of H n and T n . The main, third step will be devoted to obtaining two different estimates on the tail of M n . To this end, we will use Lemmas 5.1, 6.2 (for the proof of Theorem 4.1) and Lemmas 5.4, 6.3 (for Theorem 4.3).

Estimate on T n
By repeating verbatim the easy argument presented in the proof of Theorem 5.1 in [2], we obtain We skip the details.

Proof of Theorem 4.1
Recall that M = c(24α −3 log n) 1 α and note that without loss of generality, we can assume that t ≥ 8M log 6. Otherwise, (4.3) is trivial as the right-hand side is greater than or equal to 1. Fix p = 2/3. We have (A := ( p + 1)n(E(τ 1 − τ 0 )) −1 ) To control the first summand on the right-hand side of the above inequality, we will apply Lemma 5.1 with m = 2, X i := χ i = F( i+1 ) (cf. (3.16)), c := c and n := A. Assuming that the assumptions of the lemma are satisfied (we will verify them later on), we obtain (in the first line, we use stationarity of ( i ) i≥1 ): Recall that by (3.22), σ 2 ∞ = σ 2 Mrv E(τ 1 − τ 0 ). We will now obtain a comparison between σ 2 ∞ and t M, which will allow us to reduce the above estimate to one in which the sub-Gaussian coefficient is expressed only in terms of σ 2 Mrv . Thanks to Lemma A.2 applied with := (χ 1 /c) α and β := 2/α, we have where the last inequality is a consequence of equation 4 in [18]. Moreover, recalling the definition of M and using the assumption t ≥ 8 log (6) The last inequality in combination with (7.6) yields In order to justify the above inequality, it remains to verify the assumptions of Lemma 5.1. To this end, take We will now strongly rely on the properties of the stationary sequence of 1-dependent blocks ( i ) i≥1 stated in Remark 3.10 together with (3.15) and (3.16). Since χ i = F( i+1 ), the assumption 0) of Lemma 5.1 is trivially satisfied. To prove 2), observe that is stationary (as a function of the stationary sequence ( i ) i≥1 ). The sequence ( i )) i≥1 is 1-dependent, which clearly implies that (Z i ) i≥1 is 2-dependent, i.e., the assumption 2) of the lemma. The assumption 3), i.e., the stationarity of the sequence (E(χ i |F i−1 )) i≥1 = (G( i )) i≥1 , follows again by stationarity of ( i ) i≥1 . Finally, using once more the fact that ( i ) i≥0 is 1-dependent, we obtain that for any i ≥ 1, the random variable E(χ i |F i−1 ) = G( i ) is independent of χ i+1 = F( i+2 ), which ends the verification of the assumptions of Lemma 5.1 and proves (7.7).
Thus, in order to get a bound on P(M n > t) it suffices to estimate the second term on the right-hand side of (7.5). To this aim, we use Lemma 6.2 with p = 2/3 and d = d obtaining In combination with (7.5) and (7.7), this gives Combining the above inequality with (7.3) and (7.4), we get In order to finish the proof of Theorem 4.1, it is enough to substitute E(τ 1 − τ 0 ) = δ −1 π(C) −1 m.

Proof of Theorem 4.3
Recall that M = c(24α −3 log n) 1 α , and let p > 0 be a parameter which will be fixed later on. We are going to apply Lemma 5.4 with X i := χ i = F( i+1 ), c := c, where the last inequality follows from (recall the definition of K ∞ from Lemma 6.2) Therefore, max 2, ( N /3 − a + 1) + ψ 1 ≤ √ 4/3 + 7/50 K p · d and we get that for arbitrary p > 0, Using the above inequality together with (7.3) and (7.4), we obtain which concludes the proof of Theorem 4.3.

Proof of Theorem 4.4.
Denote M = f ∞ and notice that for t > nM, the left-hand side of (4.6) vanishes, so we may assume that t ≤ nM. Using (4.4), one can easily see that if m|n, then Theorem 4.1 applied with α = 1 implies that The assumption t ≤ nM yields which plugged into (7.8) gives after some elementary calculations that (recall K = exp(10) (7.9) proving the theorem in the special case m|n. Now, we consider the case m n. Define n m to be the smallest integer greater or equal to n, which is divisible by m. Notice that without loss of generality, we can assume that t > 4330D 2 Mδπ(C). (Otherwise, the assertion of the theorem is trivial as the right-hand side of (4.6) exceeds one.) Since D 2 δπ(C) > m (recall E(τ 1 − τ 0 ) = δ −1 π(C) −1 m), this implies that t ≥ 4330Mm. Moreover, as t ≤ nM, we also obtain that n ≥ 4330m.
Thus, for p = 1/4330 we have n m i=n f (ϒ i ) ≤ Mm ≤ pt, and as a consequence, (7.10) Now using (7.9) and the inequality n > 4330m, we get This concludes the proof of Theorem 4.4.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Orlicz Exponential Norm
At the beginning, recall the definition of the exponential Orlicz quasi-norm (4.1) and note that if α ≥ 1, then · ψ α is a norm, whereas for 0 < α < 1, · ψ α is only a quasi-norm. More precisely, we have the following version of the triangle inequality (see Lemma 3.7 in [2]).
Lemma A.1 (Triangle inequality for α ≤ 1) Fix 0 < α ≤ 1. Then, for any random variables X , Y we have Now, we present a moment estimation for random variables with bounded exponential moment.
Furthermore, if β ∈ N, then one can replace the constant 2 with 1.
Proof If β is a natural number, then the claim follows from Taylor's expansion of exp(x). The general case is obtained by Markov's inequality, namely The next lemma allows to pass from the ψ α -norm of a random variable to the norm of its conditional expectation.
Lemma A.3 (Orlicz's norm of conditional mean value) Let 0 < α ≤ 1. Assume that a random variable X satisfies X ψ α < ∞. Moreover, let F be some sigma field. Then, Proof Set φ α (x) = exp(x α ) for x ≥ 0 and notice that φ α is concave on (0, x α ) and convex on (x α , ∞), where x α = 1−α α 1/α . Define α to be a smallest convex function bigger or equal to φ α which is equal to φ α on (x α , ∞), that is, Clearly, α is a convex function on R + and it is easy to see that φ α ≤ α ≤ α exp 1−α α φ α . Using these properties, Jensen's inequality and the definition of the Orlicz norm, we get Put c α = 1 + log α exp 1−α α log(2) 1 α ≥ 1 and note that due to Jensen's inequality, which completes the proof. Now, we give two concentration inequalities which are valid for random variables with finite Orlicz norm. The first one is an easy consequence of the Markov inequality; therefore, we omit the proof.
Lemma A. 4 For any random variable X with X ψ α < ∞ and t > 0, we have Lemma A.5 (Tail inequality for conditional mean value) Let 0 < α ≤ 1. Assume that a random variable X satisfies X ψ α < ∞. Moreover, let F be some sigma field.
Using the Markov and Jensen inequalities along with (x) ≤ x x /e x−1 ( [18], Thm. 1) and Lemma A.2 with Y = (|X |/c) α , β = t α /c α , we get where in the last inequality we used the estimate xe −x ≤ e − x 2 which is valid for all x ∈ R. Now, it is enough to take limit c → X ψ α and notice that 2e ≤ 6.