On the Ergodicity of Certain Markov Chains in Random Environments

We study the ergodic behaviour of a discrete-time process X which is a Markov chain in a stationary random environment. The laws of Xt\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_t$$\end{document} are shown to converge to a limiting law in (weighted) total variation distance as t→∞\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t\rightarrow \infty $$\end{document}. Convergence speed is estimated, and an ergodic theorem is established for functionals of X. Our hypotheses on X combine the standard “drift” and “small set” conditions for geometrically ergodic Markov chains with conditions on the growth rate of a certain “maximal process” of the random environment. We are able to cover a wide range of models that have heretofore been intractable. In particular, our results are pertinent to difference equations modulated by a stationary (Gaussian) process. Such equations arise in applications such as discretized stochastic volatility models of mathematical finance.


Introduction
Markov chains in random environments (recursive chains in the terminology of [4]) were systematically studied on countable state spaces in e.g.[5,6,21].However, papers on the ergodicity of such processes on a general state space are scarce and require rather strong, Doeblin-type conditions, see [17,18,22].An exception is [23], where the system dynamics is assumed to be contracting (on the average) instead but only weak convergence of the laws is established.
In this paper we deal with Markov chains in random environments that satisfy refinements of the usual hypotheses for the geometric ergodicity of Markov chains: minorization on "small sets", see Chapter 5 of [19], and Foster-Lyapunov type "drift" conditions, see Chapter 15 of [19].Assuming that a suitably defined maximal process of the random environment satisfies a tail estimate, we manage to establish stochastic stability.
We adapt ideas of [15] to obtain convergence to a limiting distribution in total variation norm with estimates on the convergence rate.We also present a powerful method to prove ergodic theorems, exploiting ideas of [1,3,14,20], see Section 2 for the statements of our results.A crucial technical ingredient is the notion of Lmixing from [12], see Section 5. We present examples of difference equations modulated by Gaussian processes in Section 3.These can be regarded as discretizations of diffusions in random environments which arise, for instance, in stochastic volatility models of mathematical finance, see [7] and [11].Proofs appear in Sections 4, 6 and 7.

Main results
Let ( , A) be a measurable space and let Y t , t ∈ be a (strongly) stationaryvalued process on some probability space (Ω, , P).A generic element of Ω will be denoted by ω.
Expectation of a real-valued random variable X with respect to P will be denoted by E[X ] in the sequel.For 1 ≤ p < ∞ we write L p to denote the Banach space of (a.s.equivalence classes of) -valued random variables with E[|X | p ] < ∞, equipped with the usual norm.
We fix another measurable space ( , B) and denote by ( ) the set of probability measures on B. Let Q : × × B → [0, 1] be a family of probabilistic kernels parametrized by y ∈ , i.e. for all A ∈ B, Q(•, •, A) is A ⊗ B-measurable and for all y ∈ , x ∈ , A → Q( y, x, A) is a probability on B.
Let X t , t ∈ be a -valued stochastic process such that X 0 is independent of Y t , t ∈ and P(X t+1 ∈ A| t ) = Q(Y t , X t , A) P-a.s., t ≥ 0, where the filtration is defined by t := σ(Y j , j ≤ t; X j , 0 ≤ j ≤ t), t ≥ 0.
Remark 2.1.Obviously, the law of X t , t ∈ (and also its joint law with Y t , t ∈ ) are uniquely determined by (1).Let us consider the particular case where is a Polish space with the corresponding family of Borel sets, B. Then, for every given Q, there exists a process X satisfying (1) (after possibly enlarging the probability space).See e.g.page 228 of [2] for a similar construction.We will establish a more precise result in Lemma 6.1 below, under additional assumptions.
The process Y will represent the random environment whose state Y t at time t determines the transition law Q(Y t , •, •) of the process X at the given instant t.Our purpose is to study the ergodic properties of X .We write µ t := Law(X t ), t ∈ .
We will now introduce a number of assumptions of various kinds that will figure in the statements of the main results: Theorems 2.10, 2.12, 2.13, 2.15, 2.16 and 2.17 below.
The following assumption closely resembles the well-known drift conditions for geometrically ergodic Markov chains, see e.g.Chapter 15 of [19].In our case, however, there is also dependence on the state of the random environment.

Assumption 2.2. (Drift condition) Let V :
→ [0, ∞) be a measurable function.Let A n ∈ A, n ∈ be a non-decreasing sequence of subsets such that A 0 = and = ∪ n∈ A n .Define the -valued function y := min{n : y ∈ A n }, y ∈ .
We try to provide some intuition about Assumption 2.2: we expect that the stochastic process X behaves in an increasingly arbitrary way as the random environment Y becomes more and more "extreme" (i.e.Y grows) so the drift condition (2) becomes less and less stringent on the increasing subsets A n as n grows.Another standard choice would be := ; A is the power set of ; A n := {i ∈ : i ≤ n}.In this case y = y, y ∈ .
One more possibility could be := (0, ∞) with its Borel sets A and with The next assumption stipulates the existence of a whole family of suitable "small sets" C(R(n)) that fit well the sets A n appearing in Assumption 2.2.

Assumption 2.4. (Minorization condition)
There is a non-increasing function α : → (0, 1] and for each n ∈ , there exists a probability measure ν n on B such that, for all y ∈ , x ∈ C(R( y )) and A ∈ B, We may and will assume α(•) ≤ 1/3.
We explain the meaning of this assumption: depending on the "size" y of state y of the random environment, we work on the set C(4K( y )/λ( y )) on which we are able to benefit from a "coupling effect" of strength α( y ).
For a fixed V as in Assumption 2.2, let us define a family of metrics on for each 0 ≤ β ≤ 1.Here |ν 1 − ν 2 | is the total variation of the signed measure ν 1 − ν 2 .Note that ρ 0 is just the total variation distance (and it can be defined for all ν 1 , ν 2 ∈ ( )) while ρ 1 is the (1 + V )-weighted total variation distance.
Let L : × B → [0, 1] be a probabilistic kernel.For each µ ∈ ( ), we define the probability Consistently with these definitions, Q(Y n )µ will refer to the action of the kernel For a bounded measurable function φ : → , we set The latter definition makes sense for any non-negative measurable φ, too.
The following assumption is just an easily verifiable integrability condition about the initial values X 0 and X 1 of the process X .

Assumption 2.5. (Second moment condition on the initial values
We now present a hypothesis controlling the maxima of Y over finite time intervals (i.e. the "degree of extremity" of the random environment).

Assumption 2.6. (Condition on the maximal process of the random environment)
There exist a non-decreasing function g : → and a non-increasing function ℓ : Remark 2.7.It is clear that, for a given process Y , several choices for the pair of functions g, ℓ are possible.Each of these leads to different estimates and it depends on Y and X which choice is better, no general rule can be determined a priori.
Remark 2.8.For Gaussian processes Y in := d , Assumption 2.6 holds, for instance, with g(t) ∼ t, ℓ(t) ∼ exp(−t), see Section 3 for more details.Remark 2.9.Clearly, the maximum of an i.i.d.sequence Y n ∈ , n ∈ is an extensively studied area, see e.g.[10].However, in this case X is an (ordinary) Markov chain and hence of little interest from the point of view of the present paper.Fortunately, one can derive estimates like (6) also for rather general processes Y .For instance, let Y t , t ∈ be an -valued so-called L + -mixing process (see Section 5 for the definition of this concept).Theorem 6.1 of [13] implies that, for each q ≥ 1, with some constants C(q) > 0. The Markov inequality implies that Even more, for arbitrarily small χ > 0 and arbitrarily large r ≥ 1, we can set q := 2r/χ in (7)  We now define a number of quantities that will appear in various convergence rate estimates below.For each t ∈ , set .
Now comes the first main result of the present paper: assuming our conditions on drift, minorization, initial values and control of the maxima, µ t will tend to a limiting law as t → ∞, provided that the terms in r 1 (•) and r 2 (•) decrease fast enough.
Theorem 2.10.Let Assumptions 2.2, 2.4, 2.5 and 2.6 be in force.Assume Then there is a probability µ * on such that µ n → µ * in (1 + V )-weighted total variation as n → ∞.More precisely, Theorem 2.12 below is just a variant of Theorem 2.10: under weaker assumptions it provides convergence in a weaker sense.
Remark 2.14.In Theorem 2.13 above, we require a slight strenghtening of Assumption 2.4 by imposing (3) Condition (12) is closely related to the condition r 1 (0) < ∞ but none of the two implies the other.
For bounded φ we have stronger results, under weaker assumptions.Theorem 2.15.Let be a Polish space and let B be its Borel field.Let Assumptions 2.2, 2.4, 2.6 and 2.11 be in force, but with R(n) := 8K(n)/λ(n), n ∈ in Assumption 2.4.Assume r 3 (0) + r 4 (0) < ∞.Let φ : → be bounded and measurable.Then for every p ≥ 1, L p convergence in (13) holds whenever If Assumptions 2.2 and 2.4 hold with constants λ, α, K then convergence to the limiting law takes place at a geometric rate.More precisely, the following is true.

Theorem 2.16. Let Assumption 2.2 be in force with constants λ, K, and let R
Let Assumption 2.11 hold.Then there is a probability µ * on such that with some constants c 1 , c 2 > 0.

Examples about difference equations in Gaussian environments
In this section we present examples of processes X that satisfy a difference equation, modulated by the process Y .We do not aim at a high degree of generality but prefer to illustrate the power of the results in Section 2 in some simple cases.
We fix := d for some d and := .We also fix a -valued zero-mean Gaussian stationary process Y t , t ∈ .We set y := ⌈| y|⌉, y ∈ as in Example 2.3 above.We will exclusively use V (x) := |x|, x ∈ in the examples below.Remark 3.1.Let ξ t , t ∈ be a zero-mean -valued stationary Gaussian process with unit variance.It is well-known that in this case holds for ζ t := max 1≤i≤t ξ i .Furthermore, for all a > 0, see [24,25].Applying (18) with a := 2t and then proceeding analogously with the process −ξ, it follows from ( 17) that Applying these observations to every coordinate of Y , it follows that Assumption 2.6 holds for the process Y with the choice g(k) := ⌈c 1 k⌉, ℓ(k) := exp(−c 2 k) for some c 1 , c 2 > 0 and thus r 4 (n) decreases at a geometric rate as n → ∞.
More generally, choosing a := t b with some b > 0, Assumption 2.6 holds for Y with the choice g(k We assume throughout this section that ǫ t , t ∈ is an -valued i.i.d.sequence, independent of Y t , t ∈ ; E|ǫ 0 | < ∞ and the law of ǫ 0 has an everywhere positive density f with respect to the Lebesgue measure, which is even and non-increasing on [0, ∞).All these hypotheses could clearly be weakened/modified, we just try to stay as simple as possible.
Example 3.2.First we investigate the effect of the "contraction coefficient" λ in (2).Let d := 1.Let 0 < σ ≤ σ be constants and σ : × → [σ, σ] a measurable function.Let furthermore ∆ : → (0, 1] be even and non-increasing on [0, ∞).We stipulate that the tail of f is not too thin: it is at least as thick as that of a Gaussian variable, that is, for some s > 0.
We assume that the dynamics of X is given by We will find K(•), λ(•), α(•) such that Assumptions 2.2 and 2.4 hold and give an estimate for the rate r 3 (n) appearing in (10).(Note that we already have estimates for the rate r 4 (n) from Remark 3.1.) The density of X 1 conditional to X 0 = x, Y 0 = y (w.r.t. the Lebesgue measure) is easily seen to be and m(•) does not depend on y.Define the probability measure where (Here and in the sequel we use the index set \ {0} instead of for convenience.)Let η := R( y) := 4K/∆( y) and R(n) := R(n), n ∈ .We note that R( y) is defined for every y ∈ while R(n) is defined for every n ∈ , this is why we keep different notations for these two functions here and also in the subsequent examples.We can conclude, using the tail bound (20), that for all A ∈ B, with some c 3 > 0 so (3) in Assumption 2.4 holds with and ν n := ν R(n) .Now let the function ∆ be such that ∆( y) := 1 for 0 ≤ y < 3 and ∆( y) ≥ 1/(ln( y)) δ with some δ > 0, for all y ≥ 3. We obtain from the previous estimations and from Remark 3.1 with with some c 4 > 0. When δ < 1/2, this leads to estimates on the terms of r 3 (n) which guarantee r 3 (0) < ∞.
If instead of (20) we assume then r 3 (0) < ∞ follows whenever δ < 1.This shows nicely the interplay between the feasible fatness of the tail of f and the strength of the mean-reversion ∆(•).
Example 3.4.We now investigate a discrete-time model for financial time series, inspired by the "fractional stochastic volatility model" of [7,11].
Let w t , t ∈ and ǫ t , t ∈ be two sequences of i.i.d.random variables such that the two sequences are also independent.Assume that w t are Gaussian.We define the (causal) infinite moving average process This series is almost surely convergent whenever ∞ j=0 a 2 j < ∞.We take d := 2 here and the random environment will be the = 2 -valued process Y t := (w t , ξ t ), t ∈ .
We imagine that ξ t describes the log-volatility of an asset in a financial market.It is reasonable to assume that ξ is a Gaussian linear process (see [11] where the related continuous-time models are discussed in detail).
Let us now consider the -valued process X which will describe the increment of the log-price of the given asset.Assume that X 0 := 0, The logprice is thus jointly driven by the noise sequences ǫ t , w t .The parameter ∆ is responsible for the autocorrelation of X (∆ is typically close to 1).The parameter ρ controls the correlation of the price and its volatility.This is found to be non-zero (actually, negative) in empirical studies, see [8], hence it is important to include w t , t ∈ both in the dynamics of X and in that of Y .We take It follows that if |x| ≥ c 8 e ξ (1 + |w|) for some suitably large c 8 > 0 then which implies for all x ∈ , with some c 9 > 0, i.e.Assumption 2.2 holds with λ(n) := λ := ∆/2 and K(n) := e n (1 + n).
We now turn our attention to Assumption 2.4.Denote the density of the law of X 1 conditional to X 0 = x, Y 0 = (w, ξ) with respect to the Lebesgue measure by h x,w,ξ (z), z ∈ .For x, z ∈ [−η, η] we clearly have We assume from now on that f , the density of ǫ 0 satisfies with some s > 0, χ > 1, this is reasonable as X t has fat tails according to empirical studies, see [8].
Although the examples above are rather elementary and restricted in their scope, they point towards large classes of models, relevant in applications, where the results of Section 2 apply in a powerful way.

Proofs of stochastic stability
We first present some results of [15] (see also the related ideas in [16]) which are crucial for the developments of the present paper.
Let us assume that there is a probability ν on B such that for some α > 0. Then for each α 0 ∈ (0, α) and for γ 0 := γ + 2K/R, For the proof, see Theorem 3.1 in [15].Next comes an easy corollary.
Let ( , T) be some measurable space.When (x, A) → L(x, A), x ∈ , A ∈ B is a (not necessarily probabilistic) kernel and Z is a -valued random variable then we define a measure [L(Z)](•) on via We will use the following trivial inequalities in the sequel: Proof of Theorem 2.10.Fix y := ( y 0 , y −1 , y −2 , . ..) ∈ − for the moment.Let Here Q( y) is the measure transformation operator described in (4) above where, instead of L(x, A), the kernel Q( y, x, A) is used.

By (29) and by
In the sequel we will need the definition (28) for the kernel (z, A) → µ n (z)(A), z ∈ n , A ∈ B (and for similar kernels).Notice that, for any measurable function w : This is trivial for indicators and then follows for all measurable w in a standard way.By similar arguments, we also have We thus arrive at using the notation M n := max −n+1≤i≤0 Y i .
We now estimate the expectation on the right-hand side of (30) separately on the events {M n ≥ g(n)} and {M n < g(n)}.Introduce the notation Note that Also, the law of ρ 1 (µ 0 , Q(Y −n )µ 0 ) equals that of ρ 1 (µ 0 , Q(Y 0 )µ 0 ).By an application of the Cauchy inequality we deduce from (30) and (31) that for some finite C • > 0, recalling Assumption 2.5.

Now, by elementary properties of the function
Noting that ρ 1 (µ 0 , µ 1 ) < ∞ by Assumption 2.5, it follows from r 1 (0 so µ n , n ≥ 0 is a Cauchy sequence for the complete metric ρ 1 .Hence it converges to some probability µ * as n → ∞.The claimed convergence rate also follows by the above estimates. Proof of Theorem 2.12.Estimates of Theorem 2.10 and (29) imply

This leads to
for some C > 0, using (29), Assumptions 2.6 and 2.11.The result now follows as in the proof of Theorem 2.10 above.
Proof of Theorem 2.16.Analogously to the proof of Theorem 2.10, we obtain The result follows again as in the proof of Theorem 2.10 above.

L-mixing processes
Let t , t ∈ be an increasing sequence of sigma-algebras (i.e. a discrete-time filtration) and let + t , t ∈ be a decreasing sequence of sigma-algebras such that, for each t ∈ , t is independent of + t .Let W t , t ∈ be a real-valued stochastic process.For each r ≥ 1, introduce For each process W such that M 1 (W ) < ∞ we also define, for each r ≥ 1, the quantities For some r ≥ 1, the process W is called L-mixing of order r with respect to We say that W is L-mixing if it is L-mixing of order r for all r ≥ 1.This notion of mixing was introduced in [12].
Remark 5.1.It is easy to check that if W t , t ∈ is L-mixing of order r then also the process Wt : The next lemma (Lemma 2.1 of [12]) is useful when checking the L-mixing property for a given process.

Lemma 5.2. Let ⊂ be a sigma-algebra, X , Y random variables with E
L-mixing is, in many cases, easier to show than other, better-known mixing concepts and it leads to useful inequalities like Lemma 5.3 below.For further related results, see [12].

Lemma 5.3. For an L-mixing process W of order r ≥ 2 satisfying E[W
holds for each N ≥ 1 with a constant C r that does not depend either on N or on W .
Finally, we define a slight strengthening of the concept of L-mixing that was introduced in [13].An L-mixing process W (of order r) is called L + -mixing (of order r), if there is some θ > 0 such that, for each p ≥ 1 (resp.for p = r), holds for a suitable c p > 0.

Proofs of ergodicity I
Throughout this section let the assumptions of Theorem 2.13 be valid: let be a Polish space with Borel field B; let Assumptions 2.2 and 2.5 be in force; let Assumption 2.4 hold with R(n We now present a construction that is crucial for proving Theorem 2.13.The random mappings T t in the lemma below serve to provide the coupling effects that are needed for establishing the L-mixing property (see Section 5 above) for an auxiliary process (Z below) which will, in turn, lead to Theorem 2.13.Such a representation with random mappings was used in [1,3,14,20].In our setting, however, there is also dependence on y ∈ .
For R ≥ 0, denote by C(R) the set of mappings from into that are constant on C(R) = {x ∈ : V (x) ≤ R}.

Lemma 6.1. There exists a sequence of measurable functions T
for all t ≥ 1, y ∈ , x ∈ , A ∈ B and there are events J t ( y) ∈ , for all t ≥ 1, y ∈ such that For each t ≥ 1, let t denote the sigma-algebra generated by the random variables T t ( y, x, •), x ∈ , y ∈ .These sigma-algebras are independent.
Proof.Let U n , n ∈ be an independent sequence of uniform random variables on [0, 1].Let ǫ n , n ∈ be another such sequence, independent of (U n ) n∈ .By enlarging the probability space, if necessary, we can always construct such random variables and we may even assume that (U n , ǫ n ), n ∈ are independent of (X 0 , (Y t ) t∈ ).
We assume that is uncountable, the case of countable being analogous, but simpler.As is Borel-isomorphic to , see page 159 of [9], we may and will assume that, actually, = (we omit the details).The main idea in the arguments below is to separate the "independent component" α(n)ν n (•) from the rest of the kernel Q( y, x, •) − α(n)ν n (•) for y ∈ A n and x ∈ C(R(n)).This independent component will ensure the existence of the constant mappings in (34).
Define, for n ∈ , y ∈ B n , The claimed independence of the sequence of sigma-algebras clearly holds.It is easy to check (33), too.
Remark 6.2.Note that, in the above construction, (U n , ǫ n ) n∈ was taken to be independent of (X 0 , (Y t ) t∈ ).This will be important later, in the proof of Theorem 2.13.
We drop dependence of the mappings T t on ω in the notation from now on and will simply write T t ( y, x).We continue our preparations for the proof of Theorem 2.13.Let t := σ(ǫ i , U i , i ≤ t) and + t := σ(ǫ i , U i , i ≥ t + 1), t ∈ .Take an arbitrary element x ∈ , this will remain fixed throughout this section.
Our approach to the ergodic theorem for X does not rely on the Markovian structure, it proceeds rather through establishing a convenient mixing property.The ensuing arguments will lead to Theorem 2.13 via the L-mixing property of certain auxiliary Markov chains.It turns out that L-mixing is particularly well-adapted to Markov chains, even when they are inhomogeneous (and for us this is the crucial point).The main ideas of the arguments below go back to [1], [3], [14] and [20].
In [14] and [20], Doeblin chains were treated.We need to extend those arguments substantially in the present, more complicated setting.
Let us fix y = ( y 0 , y 1 , . ..) ∈ till further notice such that, for some H ∈ , y j ≤ H holds for all j ∈ .Define Z 0 := X 0 , Z t+1 := T t+1 ( y t , Z t ), t ∈ .Clearly, the process Z heavily depends on the choice of y.However, for a while we do not signal this dependence for notational simplicity.Fix also m ∈ till further notice.Define Zm := x, Zt+1 := T t+1 ( y t , Zt ), t ≥ m.Notice that Zt , t ≥ m are + m -measurable.Our purpose will be to prove that, with a large probability, Z m+τ = Zm+τ for τ large enough.In other words, a coupling between the processes Z and Z is realized.
Fix ε > 0 which will be specified later.Let τ ≥ 1 be an arbitrary integer.Denote Denote Z t := (Z t , Zt ), t ≥ m.Define the ( t ) t∈ -stopping times Proof.Assumption 2.2 easily implies that, for k ≥ 1, Similarly, The counterpart of the above lemma for X instead of Z is the following.Lemma 6.4.
see the proof of Theorem 2.10.
The results below serve to control the number of returns to D and the probability of coupling between the processes Z and Z.Our estimation strategy in the proof of Theorem 2.13 will be the following.We will control P( Zτ+m = Z τ+m ) for large τ: either there were only few returns of the process Z to D (which happens with small probability) or there were many returns but coupling did not occur (which also has small probability).First let us present a lemma controlling the number of returns to D. Lemma 6.5.There is C > 0 such that where ̺(H) := ln(1 + λ(H)/2).In particular, σ n < ∞ a.s.for each n ∈ .Furthermore, C does not depend on either y, m or H.
Proof.We can estimate, for k ≥ 1 and n ≥ 1, Notice that, on {Z σ n +k−1 / ∈ D}, either Z σ n +k−1 or Zσ n +k−1 falls outside D. Let us assume that Z σ n +k−1 does so (the other case can be treated analogously).Assumption 2.2 and the observation (35) imply that This argument can clearly be iterated and leads to by Assumption 2.2, since Z σ n ∈ D. In the case n = 0, we arrive at instead, in a similar way, by Lemma 6.3.Now we turn from probabilities to expectations.Using e ̺(H) ≤ 2, we can estimate, for n ≥ 1, .
When n = 0, we obtain for some C ≥ 8.The statement follows.
Now we make the choice Proof.Lemma 6.5 and the tower rule for conditional expectations easily imply Hence, by the Markov inequality, The statement now follows by direct calculations.Indeed, this choice of ε(H) and τ ≥ 1/ε(H) imply The next lemma controls the probability of coupling between Z and Z. Lemma 6.7.
Proof.For typographical reasons, we will write σ(n Recall the proof of Lemma 6.1 and estimate As easily seen, U σ(ϑ−1)+1 is independent of σ(ϑ−1) so Iterating the above argument, we arrive at the statement of this lemma using 1− x ≤ e −x , x ≥ 0. Lemma 6.8.Let φ : → be measurable with for some 0 < δ ≤ 1/2 and C > 0. Then the process φ(Z t ), t ∈ is L-mixing of order p with respect to ( t , + t ), t ∈ , for all 1 ≤ p < 1/δ.Furthermore, Γ p (φ(Z)), M p (φ(Z)) have upper bounds that do not depend on y, only on H.
Since E[φ(Z Y t )] = E[φ(X t )] converges to φ(x) µ * (d x) at the rate given by Theorem 2.10, we can conclude that L p convergence of the averages indeed takes place at the following rate: holds for N ≥ 1, with some F = F (p) > 0. This proves the statement of the theorem also for 1 ≤ p < 2.
Remark 6.9.It would be natural to have Theorem 2.13 also for each 1/2 < δ ≤ 1.However, this case eludes our present method of proof.

Proofs of ergodicity II
Proof of Theorem 2.15.This follows very closely the proof of Theorem 2.13, we only point out the differences.Denote by S an upper bound for |φ|.Take an arbitrary p ≥ 2. We may use the Hölder inequality with exponents 1 and ∞ in the estimates (39).This leads to Proof of Theorem 2.17.The steps in the proofs of Theorems 2.13 and 2.15 can be repeated with λ, K, α, ̺ not depending on H. Hence π can also be chosen constant.Now we turn to the proof of almost sure convergence.Take p > 2 and q := (p − 2)/(4p).Apply Markov's inequality and (42) to obtain Almost sure convergence follows by the Borel-Cantelli lemma since (p/2) − qp > 1.
Remark 7.1.Under the conditions of Theorem 2.17 we get the L-mixing property of order p for φ(Z y ), with Γ p (φ(Z y )), M p (φ(Z y )) admitting an upper bound independent of y, for 1 ≤ p < 1/δ.When φ is bounded, the same holds for each p ≥ 1.
Remark 7.2.It is possible to prove the convergence of µ n to some µ * relying on the coupling techniques of Sections 6 and 7, without using the ideas in [15].These methods, however, would lead to somewhat weaker results than those of Section 4.
Hence we preferred to use the elegant approach of [15].
Remark 7.3.Let X t , t ∈ be a -valued Markov chain with X 0 = x 0 , where is a Polish space with Borel field B. Denoting the transition kernel of X by Q(x, A), x ∈ , A ∈ B, we impose two standard assumptions (see [19,15]) for geometric ergodicity: [QV ](x) ≤ (1 − λ)V (x) + K, x ∈ , (44) for some measurable function V : → [0, ∞), 0 < λ ≤ 1, K > 0 and for some probability ν, constant α > 0 and Under these assumptions, the process X fits our framework above (choosing to be a singleton) and the arguments of Lemma 6.8 show that, for 0 < δ ≤ 1/2 and for any measurable φ : with some c > 0, the process φ(X t ) is L-mixing of order p for each 1 ≤ p < 1/δ.Furthermore, M p (φ(X )) + Γ p (φ(X ) for some c > 0. When φ is bounded, these results hold for each p ≥ 1.As γ p (φ(X ), τ) decreases exponentially in τ, we actually get the L + -mixing property, too, see Section 5.Even these results (which form a very particular case in our framework) are new and interesting: on one hand, they establish a useful mixing proprty for a wide class of Markov processes; on the other hand, they underline the versatility of the concept of L-mixing.

Example 2 . 3 .
A typical case is where is a subset of a Banach space with norm • ; A its Borel field; A n := { y ∈ : y ≤ n}, n ∈ .In this setting y = ⌈ y ⌉ , where ⌈•⌉ stands for the ceiling function.In the examples of the present paper we will always have = d with some d ≥ 1 and | • | = • will denote the respective Euclidean norm.
Note that the law of Z Y t , t ∈ equals that of X t , t ∈ , by construction of Z and by Remark 6.2.Fix p ≥ 2. Fix N ∈ for the moment and take any y ∈ satisfying | y j | ≤ g(N ), j ∈ .Define W t (y) := φ(Z 1 − e −x ) and x → ln(1 + x).The L-mixing property of order p follows.(Note, however, that c ′′′ depends on p, δ as well as onE[V (X 0 )].)Proof of Theorem 2.13.Now we start signalling the dependence of Z on y and hence write Z y t , t ∈ .Let Y ∈ be defined by Y j = Y j , j ∈ .