1 Introduction

We consider a time-homogeneous Markov chain \((X_t)_{t\in {\mathbb N}_0}\) driven by a transition kernel which satisfies a certain monotonicity property: the conditional distribution of the random variable at time t gets stochastically greater as the value of the variable at time \(t-1\) increases. Such a condition is actually satisfied by several popular models for time series such as autoregressive or integer-valued autoregressive as well as integer-valued ARCH processes under natural assumptions on the involved parameters. To be specific, we assume that, for each fixed z, \(F_x(z):=P(X_t\le z\mid X_{t-1}=x)\) is antitonic (monotonically non-increasing) in x. This assumption allows us to employ a nonparametric antitonic estimator \(\widehat{F}_x(z)\) of the function \(x\mapsto F_x(z)\). Our estimator does not involve any tuning parameter which controls the degree of smoothing and is therefore easy to apply. Moreover, its consistency does not require smoothness properties of the function \(x\mapsto F_x(z)\); the postulated monotonicity suffices. Theorem 2.1 states that the estimator \(\widehat{F}_x(z)\) converges in \(L^1\) norm, weighted by the stationary distribution of the Markov chain, with a rate of \(n^{-1/3}\) which is believed to be the optimal one.

The estimator of \(F_x(z)\) serves as a basis for a new bootstrap method for Markov chains. Among several other methods, those proposed by Rajarshi (1990) and Paparoditis and Politis (2002) are the closest ones to our proposal. While Rajarshi’s bootstrap procedure is based on a nonparametric estimate of the one-step transition density, Paparoditis and Politis (2002) used in their so-called local bootstrap a local resampling of the original data set. In both papers, the proof of consistency of the respective bootstrap method is based on the assumption of a smooth transition density. In contrast, our approach does not require any smoothness assumption on the transition mechanism; it is merely based on the monotonicity assumption on the Markov kernel. We show its applicability for Markov chains with state space \({\mathbb N}_0=\{0,1,2,\ldots \}\). Consistency of bootstrap can be shown in a most transparent way by a so-called coupling of the original process and its bootstrap counterpart, i.e. we define versions \((\widetilde{X}_t)_{t\in {\mathbb N}_0}\) and \((\widetilde{X}_t^*)_{t\in {\mathbb N}_0}\) of these processes on a common probability space \((\widetilde{\Omega },\widetilde{\mathcal A},\widetilde{P})\) such that the corresponding random variables \(\widetilde{X}_t\) and \(\widetilde{X}_t^*\) are equal with a high probability. Somewhat surprisingly, this natural approach was rarely used in statistics. Using Mallows metric to measure the distance between variables from the original and the bootstrap process, it was implicitly employed in the context of independent random variables by Bickel and Freedman (1981) and Freedman (1981). A more explicit use of coupling was made, in the context of U- and V-statistics, but again in the independent case, by Dehling and Mikosch (1994) and Leucht and Neumann (2009). For dependent data, this approach was adopted by Leucht and Neumann (2013), Leucht et al. (2015), and Neumann (2021). Our second main result, Theorem 3.1, describes the results of our coupling approach. The stationary distribution \(P^*_{X^*}\) of the bootstrap process converges in total variation norm and in probability to that of the original process. The coupled process is \(\phi \)-mixing with coefficients decaying at an exponential rate and the corresponding values \(\widetilde{X}_t^0\) and \(\widetilde{X}_t^{*,0}\) of a stationary version of the coupled process coincide with a probability converging to 1. These general results can then be used to prove bootstrap consistency for specific statistics. The proofs of our main theorems and some auxiliary results a postponed to a final Sect. 4.

2 An estimator of a monotone family of distribution functions

Suppose that we observe random variables \(X_0,X_1,\ldots ,X_n\), where \({\textbf{X}}=(X_t)_{t\in {\mathbb N}_0}\) is a strictly stationary Markov chain with state space \(D\subseteq {\mathbb R}\), defined on a probability space \((\Omega ,{\mathcal A},P)\). We denote the stationary distribution by \(P_X\) and the corresponding distribution function by \(F_X\). Let \((F_x)_{x\in {\mathbb R}}\) defined as \(F_x(z)=P(X_t\le z\mid X_{t-1}=x)\) be the corresponding family of conditional distribution functions. We impose the following as our key assumption.

  1. (A1)

    For each \(z\in {\mathbb R}\), the function \(x\mapsto F_x(z)\) is monotonically non-increasing, i.e. if \(x_1< x_2\), then \(P(X_t\le z\mid X_{t-1}=x_1)\ge P(X_t\le z\mid X_{t-1}=x_2)\). In addition we suppose that

  2. (A2)

    \({\textbf{X}}=(X_t)_{t\in {\mathbb N}_0}\) is strong mixing with exponentially decaying coefficients \(\alpha _X(k)\), i.e.

    $$\begin{aligned} \alpha _X(k) \,=\, O\big ( \rho ^k \big ), \end{aligned}$$

    for some \(\rho \in [0,1)\).

Assumption (A1) may be paraphrased as follows. If \(x_1<x_2\) and if \(Y_1\) and \(Y_2\) are random variables following the respective conditional distributions \(P^{X_t\mid X_{t-1}=x_1}\) and \(P^{X_t\mid X_{t-1}=x_2}\), then \(Y_2\) is stochastically not smaller than \(Y_1\). It turns out that this assumption is actually satisfied by popular classes of Markov chain models under natural assumptions. Here is a list of models we have in mind:

  1. (1)

    Nonlinear autoregressive processes with non-decreasing link The process \({\textbf{X}}=(X_t)_{t\in {\mathbb N}_0}\) is assumed to obey the model equation

    $$\begin{aligned} X_t \,=\, f(X_{t-1}) \,+\, \varepsilon _t \qquad \forall t\in {\mathbb N}, \end{aligned}$$

    where \((\varepsilon _t)_{t\in {\mathbb N}}\) is a sequence of i.i.d. random variables and \(\varepsilon _t\) is independent of \(X_{t-1},\ldots ,X_0\). If the function \(f:\,{\mathbb R}\rightarrow {\mathbb R}\) is monotonically non-decreasing, then, for \(x_1<x_2\),

    $$\begin{aligned} P\big (X_t\le z\mid X_{t-1}=x_1\big )= & {} P\big (\varepsilon _t\le z-f(x_1)\big ) \,\ge \, P\big (\varepsilon _t\le z-f(x_2)\big )\\= & {} P\big (X_t\le z\mid X_{t-1}=x_2\big ). \end{aligned}$$

    Furthermore, if \(\varepsilon _t\) has an everywhere positive density and if

    $$\begin{aligned} \big | f(x) \big | \,\le \, \gamma |x| \,-\, \epsilon \qquad \forall x\ge K, \end{aligned}$$

    for some \(\gamma <1\), \(\epsilon >0\), and \(K<\infty \), then the process \({\textbf{X}}\) has a unique stationary distribution and satisfies (A2); see e.g. Doukhan (1994).

  2. (2)

    Branching processes with immigration Let \(X_0\), \((Z_{t,k})_{t,k\in {\mathbb N}}\) and \((\varepsilon _t)_{t\in {\mathbb N}}\) be mutually independent random variables taking values in \({\mathbb N}_0\). We assume that \((Z_{t,k})_{t,k\in {\mathbb N}}\) as well as \((\varepsilon _t)_{t\in {\mathbb N}}\) are sequences of identically distributed random variables. Then the process \({\textbf{X}}=(X_t)_{t\in {\mathbb N}_0}\) given by

    $$\begin{aligned} X_t \,=\, \sum _{k=1}^{X_{t-1}} Z_{t,k} \,+\, \varepsilon _t \qquad \forall t\in {\mathbb N}\end{aligned}$$

    is a branching process with immigration. In the special case of \(Z_{t,k}\sim \hbox {Bin}(1,\alpha )\) we obtain a so-called first-order integer-valued autoregressive (INAR(1)) process which was proposed by McKenzie (1985) and Al-Osh and Alzaid (1987). Since the \(Z_{t,k}\) are non-negative random variables, it is obvious that (A1) is fulfilled. If in addition \(E\varepsilon _t<\infty \) and \(EZ_{t,k}<1\), then \({\textbf{X}}\) has a unique stationary distribution and satisfies (A2); see Pakes (1971).

  3. (3)

    Poisson-INARCH processes The process \({\textbf{X}}=(X_t)_{t\in {\mathbb N}_0}\) is an integer-valued ARCH process of order 1 with Poisson innovations (Poisson-INARCH(1)) if

    $$\begin{aligned} X_t\mid {\mathcal F}_{t-1} \sim \hbox {Poisson}\big ( f(X_{t-1}) \big ), \end{aligned}$$

    where \({\mathcal F}_s\) denotes the \(\sigma \)-algebra generated by \(X_0,\ldots ,X_s\). If f is monotonically non-decreasing, then we obtain, for \(x_1<x_2\) and \(Y_1\sim \hbox {Poisson}(f(x_1))\), \(Y_2\sim \hbox {Poisson}(f(x_2))\),

    $$\begin{aligned} P( X_t\le z\mid X_{t-1}=x_1 ) \,=\, P( Y_1\le z) \,\ge \, P( Y_2\le z) \,=\, P( X_t\le z\mid X_{t-1}=x_1 ), \end{aligned}$$

    i.e., (A1) is fulfilled. Furthermore, if in addition

    $$\begin{aligned} f(x) \,\le \, \gamma x \,-\, \epsilon \qquad \forall x\ge K, \end{aligned}$$

    for some \(\gamma <1\), \(\epsilon >0\), and \(K<\infty \), then \({\textbf{X}}\) has a unique stationary distribution and satisfies (A2); see e.g. Theorem 2 in Doukhan (1994, Sec. 2.4, p. 90).

We consider an estimator of \(F_x(z)=P\big ( X_t\le z\mid X_{t-1}=x\big )\) which takes into account that the function \(x\mapsto F_x(z)\) is monotonically non-increasing under (A1). Nonparametric estimators of monotone functions have a long history and were proposed e.g. by Brunk (1955) and Ayer et al. (1955). Denote by \({\mathbb {1}}(\cdot )\) the indicator function. For \(z\in D\) and \(x\in \{X_0,\ldots ,X_{n-1}\}\), we define

$$\begin{aligned} \widehat{F}_x^{(\max -\min )}(z):=\, \max _{v:\, v\ge x} \; \min _{u:\, u\le x} \frac{\sum _{t=1}^n {\mathbb {1}}\big ( X_t\le z, X_{t-1}\in [u,v]\big )}{\#\{t\le n:\, X_{t-1}\in [u,v]\}} \end{aligned}$$
(2.1a)

and

$$\begin{aligned} \widehat{F}_x^{(\min -\max )}(z):=\, \min _{u:\, u\le x} \; \max _{v:\, v\ge x} \frac{\sum _{t=1}^n {\mathbb {1}}\big ( X_t\le z, X_{t-1}\in [u,v]\big )}{\#\{t\le n:\, X_{t-1}\in [u,v]\}}. \end{aligned}$$
(2.1b)

It is well-known that \(\widehat{F}_x^{(\max -\min )}(z)=\widehat{F}_x^{(\min -\max )}(z)\) for all \(x\in \{X_0,\ldots ,X_{n-1}\}\), see e.g. Theorem 1 in Brunk (1955) and Theorem 1.4.4 in Robertson, Wright, and Dykstra (1988, p. 23). As pointed out by Deng and Zhang (2020), (2.1a) and (2.1b) have to be modified for \(x\not \in \{X_0,\ldots ,X_{n-1}\}\). Since it could well happen that an interval with \(x\in [u,v]\) does not contain any point from the collection \(\{X_0,\ldots ,X_{n-1}\}\) we set \(n_{u,v}=\#\{t\le n:\, X_{t-1}\in [u,v]\}\), \(n_{u,*}=\#\{t\le n:\, u\le X_{t-1}\}\), \(n_{*,v}=\#\{t\le n:\, X_{t-1}\le v\}\), and define

$$\begin{aligned} \widehat{F}_x^{(\max -\min )}(z):=\, \max _{v:\, v\ge x,\,n_{*,v}>0} \; \min _{u:\, u\le x,\, n_{u,v}>0} \frac{\sum _{t=1}^n {\mathbb {1}}\big ( X_t\le z, X_{t-1}\in [u,v]\big )}{\#\{t\le n:\, X_{t-1}\in [u,v]\}} \nonumber \\ \end{aligned}$$
(2.2a)

and

$$\begin{aligned} \widehat{F}_x^{(\min -\max )}(z):=\, \min _{u:\, u\le x,\,n_{u,*}>0} \; \max _{v:\, v\ge x,\,n_{u,v}>0} \frac{\sum _{t=1}^n {\mathbb {1}}\big ( X_t\le z, X_{t-1}\in [u,v]\big )}{\#\{t\le n:\, X_{t-1}\in [u,v]\}}.\nonumber \\ \end{aligned}$$
(2.2b)

The estimators \(\widehat{F}_x^{(\max -\min )}(z)\) and \(\widehat{F}_x^{(\min -\max )}(z)\) are both non-increasing in x as the maxima are taken over non-increasing classes indexed by x and the minima over non-decreasing classes. Furthermore, for fixed \(x\in D\), the mappings \(z\mapsto \widehat{F}_x^{(\max -\min )}(z)\) and \(z\mapsto \widehat{F}_x^{(\min -\max )}(z)\) are non-decreasing which follows from the isotonicity of the functions \(z\mapsto {\mathbb {1}}(X_t\le z, X_{t-1}\in [u,v])\). Furthermore, if \(X_{[1]},\ldots ,X_{[n]}\) is an enumeration of the values in \(\{X_1,\ldots ,X_n\}\) in non-decreasing order, then it follows that, again for fixed \(x\in D\), the mappings \(z\mapsto \widehat{F}_x^{(\max -\min )}(z)\) and \(z\mapsto \widehat{F}_x^{(\min -\max )}(z)\) are constant on the half-open intervals \([X_{[k]},X_{[k+1]})\) (\(k=1,\ldots ,n-1\)), and attain the respective values 0 and 1 on \((-\infty ,X_{[1]})\) and \([X_{[n]},\infty )\). Hence, these estimators are genuine probability distribution functions.

We choose as our estimator of \(F_x(z)\)

$$\begin{aligned} \widehat{F}_x(z):=\, \big ( \widehat{F}_x^{(\max -\min )}(z) \,+\, \widehat{F}_x^{(\min -\max )}(z) \big )/2. \end{aligned}$$

It follows that all of the above properties of \(\widehat{F}_x^{(\max -\min )}(z)\) and \(\widehat{F}_x^{(\min -\max )}(z)\) are inherited by \(\widehat{F}_x(z)\). Its performance is characterized by the following theorem.

Theorem 2.1

Suppose that (A1) and (A2) are fulfilled. Then

$$\begin{aligned} \sup _z \left\{ E\left[ \int _D \big | \widehat{F}_x(z) \,-\, F_x(z) \big | \, dP_X(x) \right] \right\} \,=\, O\big ( n^{-1/3} \big ). \end{aligned}$$

The rate of convergence \(n^{-1/3}\) is known to be optimal in related problems of estimating a monotone function on the basis of independent random variables; see e.g. Durot (2002, Theorem 1) and Zhang (2002, Theorem 2.3). We believe that this rate cannot be improved in our more delicate case of time series data. Note that Mösching and Dümbgen (2020) considered a nonparametric antitonic estimator of \(F_x\) in a regression context where the dependent variables, conditional on the regressors, are independent. They derived under additional Hölder conditions rates of uniform and pointwise convergence for this estimator.

Our approach to prove this result can be most easily explained if the distribution function \(F_X\) is continuous. We split the domain D into \(k_n=\lfloor n^{1/3}\rfloor \) intervals \(I_k=[x_{k-1},x_k)\), where \(x_0=-\infty \) if \(D={\mathbb R}\), \(x_0=0\) if \(D={\mathbb N}_0\) and, in both cases, \(x_k=F_X^{-1}(k/k_n)=\sup \{x:\, F_X(x)\ge k/k_n\}\) for \(k=1,\ldots ,k_n-1\), \(x_{k_n}=\infty \). (As usual, \(\lfloor a\rfloor \) denotes the largest integer less than or equal to a.) We can expect a favorable behavior of \(\widehat{F}_x(z)\) if \(N_k(\omega ):=\#\{t\le n:\, X_{t-1}(\omega )\in I_k\}\) is sufficiently large for all k. Let

$$\begin{aligned} A_n \,=\, \big \{\omega :\, N_k(\omega )\ge n/(2k_n) \quad \hbox {for all }\; k=1,\ldots ,k_n\big \}. \end{aligned}$$

It follows from Lemma 4.2 that \(P(A_n^c)=O(n^{-1/3})\). Since \(\int _D \big | \widehat{F}_x(z) \,-\, F_x(z) \big | \, dP_X(x)\le 1\) holds with probability 1, we obtain that

$$\begin{aligned} E\left[ \int _D \big | \widehat{F}_x(z) \,-\, F_x(z) \big | \, dP_X(x) \; {\mathbb {1}}_{A_n^c}\right] \,\le \, P\big (A_n^c\big ) \,=\, O\big ( n^{-1/3} \big ). \end{aligned}$$
(2.3)

To estimate \(E\big [ \int _D \big (\widehat{F}_x(z) - F_x(z)\big )_+ \, dP_X(x) \; {\mathbb {1}}_{A_n}\big ]\) we proceed as follows. For \(x\in I_k\), \(k\in \{2,\ldots ,k_n\}\), we use the estimate

$$\begin{aligned}{} & {} \big ( \widehat{F}_x(z) \,-\, F_x(z) \big )^+ \, {\mathbb {1}}_{A_n} \le \big ( \widehat{F}_{x_{k-1}}(z) \,-\, F_{x_k}(z) \big )^+ \, {\mathbb {1}}_{A_n} \\{} & {} \quad \le \max _{v:\, v\ge x_{k-1}} \left\{ \frac{ \big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) \,-\, F_{X_{t-1}}(z) \big ]\; {\mathbb {1}}(X_{t-1}\in [x_{k-2},v]) \big | }{ \#\{t\le n:\, X_{t-1}\in [x_{k-2},v]\} } \; {\mathbb {1}}_{A_n} \right\} \\{} & {} \qquad +\, \big ( F_{x_{k-2}}(z) \,-\, F_{x_k}(z) \big ). \end{aligned}$$

We obtain from Lemma 4.3 that

$$\begin{aligned}{} & {} E\left[ \max _{v:\, v\ge x_{k-1}} \left\{ \frac{ \big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) \,-\, F_{X_{t-1}}(z) \big ]\; {\mathbb {1}}(X_{t-1}\in [x_{k-2},v]) \big | }{ \#\{t\le n:\, X_{t-1}\in [x_{k-2},v]\} } \; {\mathbb {1}}_{A_n} \right\} \right] \nonumber \\{} & {} \quad =\, O\big ( n^{-1/3} \big ). \end{aligned}$$

Since

$$\begin{aligned} \sum _{k=2}^{k_n} \big ( F_{x_{k-2}}(z) \,-\, F_{x_k}(z) \big )= & {} \sum _{k=2}^{k_n} \big ( F_{x_{k-2}}(z) \,-\, F_{x_{k-1}}(z) \big ) \,+\, \sum _{k=2}^{k_n} \big ( F_{x_{k-1}}(z) \,-\, F_{x_k}(z) \big ) \\= & {} \big ( F_{x_0}(z) \,-\, F_{x_{k_n-1}}(z) \big ) \,+\, \big ( F_{x_1}(z) \,-\, F_{x_{k_n}}(z) \big ) \,\le \, 2. \end{aligned}$$

we conclude that

$$\begin{aligned} \sum _{k=2}^{k_n} E\left[ \int _{I_k} \big ( \widehat{F}_x(z) \,-\, F_x(z) \big )^+ \, dP_X(x) \; {\mathbb {1}}_{A_n} \right] \,=\, O\big ( n^{-1/3} \big ). \end{aligned}$$

Furthermore, the rough estimate

$$\begin{aligned} E\left[ \int _{I_1} \big ( \widehat{F}_x(z) \,-\, F_x(z) \big )^+ \, dP_X(x) \; {\mathbb {1}}_{A_n} \right] \,\le \, P_X( I_1 ) \,\le \, n^{-1/3} \end{aligned}$$

is obviously true, which leads to

$$\begin{aligned} E\left[ \int _{D} \big ( \widehat{F}_x(z) \,-\, F_x(z) \big )^+ \, dP_X(x) \; {\mathbb {1}}_{A_n} \right] \,=\, O\big ( n^{-1/3} \big ). \end{aligned}$$
(2.4)

We can prove

$$\begin{aligned} E\left[ \int _{D} \big ( \widehat{F}_x(z) \,-\, F_x(z) \big )^- \, dP_X(x) \; {\mathbb {1}}_{A_n} \right] \,=\, O\big ( n^{-1/3} \big ). \end{aligned}$$
(2.5)

in complete analogy to (2.4). The result stated in Theorem 2.1 follows from (2.3) to (2.5). In the general case we have to take into account that the distribution function \(F_X\) is not necessarily continuous. This leads to a technically more involved proof which is presented in full detail in Sect. 4.

The following pictures give an impression of how the functions \(x\mapsto F_x(z)\) are approximated by \(\widehat{F}_x(z)\) for different values of z. We simulated a Poisson-INARCH process of order 1, where \(X_t\mid X_{t-1},X_{t-2},\ldots \sim \hbox {Poisson}\big ( f(X_{t-1}) \big )\) and \(f(x)=\min \big \{\alpha _0+\alpha _1 x, \beta \big \}\). The parameters \(\alpha _0\) and \(\alpha _1\) are chosen as 2.0 and 0.5, respectively, and the truncation constant \(\beta \) is set to 6.0. For a sample size \(n=1000\) and \(z=0,1,\ldots ,11\), the following pictures show \(F_x(z)\) (red lines) and a corresponding estimate \(\widehat{F}_x(z)\) (blue lines). These results are quite encouraging except for large values of x. We conjecture that this deficiency is caused by data sparsity in this region.

figure a

3 A new bootstrap method for Markov chains

Our estimator \(\widehat{F}_x(z)\) can be used for bootstrapping Markov processes, and it is particularly suitable in case of Markov chains with a finite or countably infinite state space. In what follows we assume that \((X_t)_{t\in {\mathbb N}_0}\) is a stationary Markov chain which has a state space \(D\subseteq {\mathbb N}_0\). Bootstrap variates \(X_t^*\) are generated successively according to a slightly modified variant of our estimator \(\widehat{F}_x(z)\). To prove consistency, we retain our monotonicity condition (A1), however, we replace (A2) by the following stronger condition which ensures that both the original and the bootstrap process satisfy a useful mixing condition and possess respective stationary distributions.

(A3) There exist a finite set \(S=\big \{y\in D:\; y\le \bar{s} \hbox { and } P_X(\{y\})>0\big \}\), a probability measure Q on \(({\mathbb N}_0,2^{{\mathbb N}_0})\), and constants \(\delta >0\), \(\kappa >0\), \(\gamma >0\), and \(C<\infty \) such that

  1. (i)

    \(P(X_t\in S\mid X_{t-1}=x) \,\ge \, \delta \,>\, 0 \qquad \qquad \forall x\in {\mathbb N}_0\),

  2. (ii)

    \(P(X_t=y \mid X_{t-1}=x) \,\ge \, \kappa \cdot Q(\{y\}) \qquad \forall x\in S,\;\forall y\in {\mathbb N}_0\),

  3. (iii)

    \(P(X_t\ge x)\,\le \, C\, e^{-\gamma x}\qquad \forall x\in {\mathbb N}_0\).

(A3) (ii) means that the set S is a so-called small set and condition (A3) (i) ensures that this set can be reached from each point \(x\in \Omega _X\) with a probability not smaller than \(\delta \). It follows from these conditions that

$$\begin{aligned} \inf _x P\big ( X_{t+2}=y \mid X_t=x \big ) \,\ge \, \delta \cdot \kappa \cdot \, Q\big ( \{y\} \big ) \qquad \forall y\in {\mathbb N}_0. \end{aligned}$$

Hence, Doeblin’s minorization condition is satisfied and it follows that the process \((X_t)_{t\in {\mathbb N}_0}\) has a unique stationary distribution \(P_X\), is geometrically ergodic, and is uniform (\(\phi \)-) mixing with exponentially decaying coefficients.; see e.g. Theorem 1 in Doukhan (1994 Sec. 2.4, p. 88). In particular, a stationary version of the process satisfies (A2). Note that condition (A3) is satisfied e.g. by a Poisson-INARCH(1) process if the function f is bounded. While (i) and (ii) are obviously fulfilled, (iii) follows from the upper tail bound

$$\begin{aligned} P\big ( Y\ge \lambda +x \big ) \,\le \, e^{-\frac{x^2}{2(\lambda +x)}} \qquad \forall x\ge 0 \end{aligned}$$

which holds for \(Y\sim \hbox {Poisson}(\lambda )\); see Theorem 1 in Canonne (2017).

Before we fix the definition of our bootstrap process we check to what extent a process with transition distribution functions \(\widehat{F}_x(\cdot )\) satisfies a suitable variant of condition (A3). It follows from Theorem 2.1 that

$$\begin{aligned} \widehat{F}_x(z) \,=\, F_x(z) \,+\, O_P\big ( n^{-1/3} \big ) \end{aligned}$$

if \(P_X(\{x\})>0\). This implies that

$$\begin{aligned} P\Big ( \inf \big \{\widehat{F}_x(y)-\widehat{F}_x(y-1):\; x\in S,\, P_X(\{x\})>0\big \} \ge \frac{\kappa }{2}\, Q^*(\{y\}) \quad \forall y\in {\mathbb N}_0\Big ) \mathop {\longrightarrow }\limits _{n\rightarrow \infty }1,\nonumber \\ \end{aligned}$$
(3.1)

where e.g.

$$\begin{aligned} Q^*\big ( \{y\} \big ) \,=\, \left\{ \begin{array}{ll} Q\big ( \{y\} \big ) &{} \quad \hbox { if }\quad y\le \bar{y}, \\ 0 &{} \quad \hbox { if }\quad y>\bar{y} \end{array} \right. \end{aligned}$$

and \(\bar{y}\) such that \(Q(\{0,1,\ldots ,\bar{y}\})\ge 1/2\). Hence, a bootstrap process based on \(\widehat{F}_x(\cdot )\) satisfies a variant of (A3) (ii) with a probability tending to 1.

For a variant of (A3) (i) to hold, it is important that \(\inf _x \{\widehat{F}_x(\bar{s}):\, x\in {\mathbb N}_0\}>0\) is also satisfied with a probability tending to 1. This is not guaranteed to be true since the estimator \(\widehat{F}_x(\bar{s})\) may get unreliable if x gets large. Indeed, the natural lower estimate of \(\widehat{F}_x(\bar{s})\) is given by

$$\begin{aligned} \widehat{F}_x\big (\bar{s}\big ) \,\ge \, \inf _{u:\, u\le x} \frac{ \sum _{t=1}^n {\mathbb {1}}(X_t\le \bar{s},\, X_{t-1}\in [u,\infty ) ) }{ \#\{t\le n:\; X_{t-1}\in [u,\infty )\} }. \end{aligned}$$

However, the right-hand side of this inequality can get arbitrarily close to 0 if x is large since then \(P_X( [x,\infty ))\) gets small. It is actually a well-known shortcoming of nonparametric isotonic/antitonic estimators that they get unreliable near the ends of the domain of the explanatory variable. In view of this problem, we modify \(\widehat{F}_x(z)\) for large x. Let

$$\begin{aligned} \widetilde{x}:=\, \sup \big \{ x:\; \#\{t\le n:\; X_{t-1}\ge x\} \,\ge \, n^{2/3} \big \}. \end{aligned}$$

Then \(\#\{t\le n:\, X_{t-1}\ge \widetilde{x}\}\ge n^{2/3}>\#\{t\le n:\, X_{t-1}>\widetilde{x}\}\). We define

$$\begin{aligned} \widehat{\widehat{F}}_x(z):=\, \left\{ \begin{array}{ll} \widehat{F}_x(z) &{} \qquad \hbox { if }\quad x\le \widetilde{x}, \\ \widehat{F}_{\widetilde{x}}(z) &{} \qquad \hbox { if } \quad x> \widetilde{x} \end{array} \right. . \end{aligned}$$

In what follows we show that the modified estimator \(\widehat{\widehat{F}}_x(\cdot )\) actually satisfies a suitable variant of (A3). To take advantage of Lemma 4.2 we embed the random truncation point \(\widetilde{x}\) between two nonrandom points, \(\widetilde{x}_l\) and \(\widetilde{x}_u\). Let \(\widetilde{x}_l:=\sup \left\{ x:\, P_X\big ( [x,\infty ) \big )\ge 2n^{-1/3} \right\} \) and \(\widetilde{x}_u:=\sup \big \{ x:\, P_X\big ( [x,\infty ) \big )\ge (1/2)n^{-1/3} \big \}\). Then \(P_X\big ( [\widetilde{x}_l,\infty ) \big )\ge 2n^{-1/3}\ge P_X\big ( (\widetilde{x}_l,\infty ) \big )\) and \(P_X\big ( [\widetilde{x}_u,\infty ) \big )\ge (1/2)n^{-1/3}\ge P_X\big ( (\widetilde{x}_u,\infty ) \big )\). Since \(\widetilde{x}>\widetilde{x}_u\) implies that \(\#\big \{t\le n:\, X_{t-1}>\widetilde{x}_u\big \}\ge n^{2/3}\) we obtain by Lemma 4.2 that

$$\begin{aligned} P\big ( \widetilde{x}> \widetilde{x}_u \big )\le & {} P\Big ( \# \big \{t\le n:\, X_{t-1}>\widetilde{x}_u \big \} \,\ge \, n^{2/3} \Big ) \\\le & {} P\Big ( \# \big \{t\le n:\, X_{t-1}>\widetilde{x}_u \big \} \,-\, nP_X\big ( (\widetilde{x}_u,\infty ) \big ) \,\ge \, n^{2/3}/2 \Big )\\&\,=\,&O\big ( n^{-1} \big ). \end{aligned}$$

On the other hand, if \(\widetilde{x}\le \widetilde{x}_u\), then

$$\begin{aligned} \inf _x \widehat{\widehat{F}}_x\big ( \bar{s} \big ) \,=\, \widehat{F}_{\widetilde{x}}\big ( \bar{s} \big ) \,\ge \, \widehat{F}_{\widetilde{x}_u}\big ( \bar{s} \big ). \end{aligned}$$

Therefore,

$$\begin{aligned} P\Big ( \inf _x \widehat{\widehat{F}}_x\big ( \bar{s} \big ) \,\ge \, F_{\widetilde{x}_u}\big ( \bar{s} \big )/2 \Big ) \mathop {\longrightarrow }\limits _{n\rightarrow \infty }1. \end{aligned}$$
(3.2)

Furthermore, since \(\widetilde{x}<\widetilde{x}_l\) implies that \(\#\big \{t\le n:\, X_{t-1}\ge \widetilde{x}_l\big \}< n^{2/3}\) we obtain by Lemma 4.2 that

$$\begin{aligned} P\big ( \widetilde{x}< \widetilde{x}_l \big )\le & {} P\Big ( \# \big \{t\le n:\, X_{t-1}\ge \widetilde{x}_l \big \} \,<\, n^{2/3} \Big ) \\\le & {} P\Big ( \# \big \{t\le n:\, X_{t-1}\ge \widetilde{x}_l \big \} \,-\, n P_X\big ( [\widetilde{x}_l,\infty ) \big ) \,<\, -(n/2) P_X\big ( [\widetilde{x}_l,\infty ) \big ) \Big )\\&\,=\,&O\big ( n^{-1} \big ). \end{aligned}$$

On the other hand, \(\widetilde{x}\ge \widetilde{x}_l\) yields that \(\widehat{\widehat{F}}_x(z)=\widehat{F}_x(z)\) for all \(x\le \widetilde{x}_l\). Hence, we obtain, for each \(z\in {\mathbb N}_0\),

$$\begin{aligned}{} & {} { E\left[ \int \big | \widehat{\widehat{F}}_x(z) \,-\, F_x(z) \big | \, dP_X(x) \right] } \nonumber \\{} & {} \quad = E\left[ \int _{\{x:\, x\le \widetilde{x}_l\}} \big | \widehat{F}_x(z) \,-\, F_x(z) \big | \, dP_X(x) \right] \,+\, P_X\big ( (\widetilde{x}_l,\infty ) \big ) \,+\, O\big ( n^{-1} \big ) \nonumber \\{} & {} \quad = O\big ( n^{-1/3} \big ). \end{aligned}$$
(3.3)

Now we are in a position to define our resampling algorithm generating the bootstrap variates:

  1. 1.

    Choose a starting value \(X_0^*\).

  2. 2.

    For each \(t\in {\mathbb N}_0\), suppose that \(X_0^*,\ldots ,X_t^*\) have been generated already. Then \(X_{t+1}^*\) is generated such that is has, conditioned on \(X_0^*,\ldots ,X_t^*\) and conditioned on the original sample \(X_0,\ldots ,X_n\), a probability distribution function \(\widehat{\widehat{F}}_{X_t^*}(\cdot )\).

In what follows, the symbol \(P^*\) refers to the distribution of the bootstrap variables conditioned on the original sample, e.g. \(P^*\big (X_t^*\in A\big )=P\big (X_t^*\in A\mid X_0,\ldots ,X_n\big )\).

Let \(K_n=\log n/(3\gamma )\). Then (A3)(iii) implies that \(P(X_t> K_n)=O(n^{-1/3})\) and we obtain from (3.3) that

$$\begin{aligned}{} & {} { \sum _{y=0}^\infty \sum _{x=0}^\infty \big | P^*\big ( X^*_{t+1}=y\mid X^*_t=x \big ) \,-\, P\big ( X_{t+1}=y \mid X_t=x \big ) \big | \; P_X(\{x\}) } \nonumber \\{} & {} \quad = \sum _{y=0}^{K_n} \sum _{x=0}^\infty \big | P^*\big ( X^*_{t+1}=y\mid X^*_t=x \big ) \,-\, P\big ( X_{t+1}=y \mid X_t=x \big ) \big | \; P_X(\{x\}) \nonumber \\{} & {} \qquad +\, \sum _{x=0}^\infty \big ( P^*\big ( X^*_{t+1}>K_n\mid X^*_t=x \big ) \,+\, P\big ( X_{t+1}>K_n \mid X_t=x \big ) \big ) \; P_X(\{x\}) \nonumber \\{} & {} \quad \le O\big ( K_n \, n^{-1/3} \big ) +\, \sum _{x=0}^\infty \big | P^*\big ( X^*_{t+1}>K_n\mid X^*_t=x \big ) \,-\, P\big ( X_{t+1}>K_n \mid X_t=x \big ) \big |\nonumber \\{} & {} \qquad \; P_X(\{x\}) \,+\, 2\,P\big ( X_t>K_n \big ) \nonumber \\{} & {} \quad = O_P\big ( n^{-1/3} \log n \big ). \end{aligned}$$
(3.4)

Note that for all \(t\in {\mathbb N}\) \(X_t^*\) takes values in \(\{X_1,\ldots ,X_n\}\), i.e. \(X_t^*\) lies in the collection of x with \(P_X(\{x\})>0\). Hence, it follows from (3.1) and (3.2) that a process with transition distribution functions \(\widehat{\widehat{F}}_x(\cdot )\) satisfies the following Doeblin-type condition with a probability tending to 1:

$$\begin{aligned} \inf _x P^*\big ( X_{t+2}^*=y \mid X_t^*=x \big )\ge & {} \inf _x \sum _{z\in S} P^*\big ( X_{t+2}^*=y \mid X_{t+1}^*=z \big ) \\{} & {} \; P^*\big ( X_{t+1}^*=z \mid X_t^*=x \big ) \\\ge & {} \frac{\kappa \, \delta }{4} \, Q^*\big ( \{y\} \big ) \qquad \forall y\in {\mathbb N}_0. \end{aligned}$$

This implies that, with a probability tending to 1, the bootstrap process is geometrically ergodic and has a unique stationary distribution \(P^*_{X^*}\).

For a successful application of the bootstrap approximation the following properties are vitally important: With a probability tending to 1, conditioned on \(X_0,\ldots ,X_n\),

  1. (a)

    the stationary distribution \(P^*_{X^*}\) converges to \(P_X\),

  2. (b)

    the finite-dimensional distributions of \((X_t^*)_{t}\) converge to those of \((X_t)_t\).

We show these two properties by a coupling of the original process and its bootstrap counterpart, i.e. we define versions \((\widetilde{X}_t)_{t\in {\mathbb N}_0}\) and \((\widetilde{X}_t^*)_{t\in {\mathbb N}_0}\) on a common probability space \((\widetilde{\Omega },\widetilde{\mathcal A},\widetilde{P})\) such that the corresponding random variables \(\widetilde{X}_t\) and \(\widetilde{X}_t^*\) are equal with a high probability. We use the technique of maximal coupling [see e.g. Theorem 5.2 in Chapter I in Lindvall (1992)] and define the transition probabilities \(\widetilde{\pi }\) driving the coupled process \(\big ( (\widetilde{X}_t,\widetilde{X}_t^*) \big )_{t\in {\mathbb N}_0}\) as follows. For \(x,y\in {\mathbb N}_0\), let \(\pi (x,y)=P(X_{t+1}=y\mid X_t=x)\) and \(\pi ^*(x,y)=P^*(X_{t+1}^*=y\mid X_t^*=x)\). Then

$$\begin{aligned} \delta _{x,x^*} \,=\, \frac{1}{2} \sum _{y\in {\mathbb N}_0} \big | \pi (x,y) \,-\, \pi ^*(x^*,y) \big | \end{aligned}$$

is the total variation distance between the distributions with respective probability mass functions \(\pi (x,\cdot )\) and \(\pi ^*(x^*,\cdot )\). Note that \(\delta _{x,x^*}=\sum _y [\pi (x,y)-\min \{\pi (x,y),\pi ^*(x^*,y)\}] =\sum _y [\pi ^*(x^*,y)-\min \{\pi (x,y),\pi ^*(x^*,y)\}]\). The transition probabilities of the coupled process are defined as

$$\begin{aligned} \widetilde{\pi }\big ( (x,x^*), (y,y) \big ) \,=\, \min \big \{ \pi (x,y), \pi ^*(x^*,y) \big \} \qquad \forall x,x^*,y\in {\mathbb N}_0, \end{aligned}$$
(3.5a)

and, for \(x,x^*,y,y^*\in {\mathbb N}_0\) such that \(y\ne y^*\),

$$\begin{aligned}{} & {} { \widetilde{\pi }\big ( (x,x^*), (y,y^*) \big ) } \nonumber \\{} & {} \quad = \left\{ \begin{array}{ll} 0 &{} \quad \hbox { if }\quad \delta _{x,x^*}=0, \\ \frac{[\pi (x,y)-\min \{\pi (x,y),\pi ^*(x^*,y)\} ]\, [\pi ^*(x^*,y^*)-\min \{\pi (x,y^*),\pi ^*(x^*,y^*)\}] }{ \delta _{x,x^*} } &{} \quad \hbox { if }\quad \delta _{x,x^*}\ne 0. \end{array} \right. \nonumber \\ \end{aligned}$$
(3.5b)

The corresponding Markov kernel \(\widetilde{P}\) is defined as

$$\begin{aligned}{} & {} \widetilde{P}\big ( (\widetilde{X}_{t+1},\widetilde{X}_{t+1}^*)\in A\mid (\widetilde{X}_t,\widetilde{X}_t^*)=(x,x^*) \big )\nonumber \\{} & {} \quad \,=\, \sum _{(y,y^*)\in A} \widetilde{\pi }\big ( (x,x^*), (y,y^*) \big ) \qquad \forall A\subseteq {\mathbb N}_0\times {\mathbb N}_0. \end{aligned}$$

Note that \([\pi (x,y)-\min \{\pi (x,y),\pi ^*(x^*,y)\} ]\, [\pi ^*(x^*,y)-\min \{\pi (x,y),\pi ^*(x^*,y)\}]=0\) for all \(x,x^*,y\in {\mathbb N}_0\), which implies in case of \(\delta _{x,x^*}>0\) that

$$\begin{aligned}{} & {} { \sum _{y^*:\; y^*\ne y} \frac{[\pi (x,y)-\min \{\pi (x,y),\pi ^*(x^*,y)\} ]\, [\pi ^*(x^*,y^*)-\min \{\pi (x,y^*),\pi ^*(x^*,y^*)\}] }{ \delta _{x,x^*} } } \\{} & {} \quad = [\pi (x,y)-\min \{\pi (x,y),\pi ^*(x^*,y)\} ] \; \sum _{y^*\in {\mathbb N}_0} \frac{\pi ^*(x^*,y^*)-\min \{\pi (x,y^*),\pi ^*(x^*,y^*)\} }{ \delta _{x,x^*} } \\{} & {} \quad = \pi (x,y) \,-\, \min \{\pi (x,y),\pi ^*(x^*,y)\}. \end{aligned}$$

Therefore we obtain

$$\begin{aligned}{} & {} { \widetilde{P}\big ( \widetilde{X}_{t+1}=y \mid (\widetilde{X}_t,\widetilde{X}_t^*)=(x,x^*) \big ) } \\{} & {} \quad = \sum _{y^*\in {\mathbb N}_0} \widetilde{\pi }\big ( (x,x^*), (y,y^*) \big ) \\{} & {} \quad = \left\{ \begin{array}{ll} \pi (x,y) &{} \quad \hbox { if } \quad \delta _{x,x^*}=0, \\ &{} \\ \min \{ \pi (x,y), \pi ^*(x^*,y)\} \,+\, \sum _{y^*:\; y^*\ne y}&{}\\ \frac{[\pi (x,y)-\min \{\pi (x,y),\pi ^*(x^*,y)\} ]\, [\pi ^*(x^*,y^*)-\min \{\pi (x,y^*),\pi ^*(x^*,y^*)\}] }{ \delta _{x,x^*} } &{} \quad \hbox { if } \quad \delta _{x,x^*}\ne 0 \end{array} \right. \\{} & {} \quad = \pi (x,y) \,=\, P\big ( X_{t+1}=y \mid X_t=x \big ) \end{aligned}$$

and, likewise,

$$\begin{aligned} \widetilde{P}\big ( \widetilde{X}_{t+1}^*=y^* \mid (\widetilde{X}_t,\widetilde{X}_t^*)=(x,x^*) \big ) \,=\, \pi ^*(x^*,y^*) \,=\, P^*\big ( X_{t+1}^*=y^* \mid X_t^*=x^* \big ). \end{aligned}$$

Moreover, we have that

$$\begin{aligned}{} & {} \widetilde{P}\big ( \widetilde{X}_{t+1}=\widetilde{X}_{t+1}^*\mid (\widetilde{X}_t,\widetilde{X}_t^*)=(x,x^*) \big )\,=\, \sum _{y\in {\mathbb N}_0} \min \big \{ \pi (x,y), \pi ^*(x^*,y) \big \}\nonumber \\{} & {} \quad \,=\, 1 \,-\, \delta _{x,x^*}. \end{aligned}$$
(3.6)

Hence, the conditional probability that the two random variables at time \(t+1\) are equal is maximized which explains the usage of the term maximal coupling.

The following theorem summarizes the results of our coupling approach. The stationary distribution \(P^*_{X^*}\) of the bootstrap process converges in total variation norm and in probability to that of the original process. With a probability tending to 1, the coupled process \(\big ((\widetilde{X}_t,\widetilde{X}_t^*)\big )_{t\in {\mathbb N}_0}\) is geometrically \(\phi \)-mixing. And finally, the corresponding values \(\widetilde{X}_t^0\) and \(\widetilde{X}_t^{*,0}\) of a stationary version of the coupled process coincide with a probability converging to 1.

Theorem 3.1

Suppose that (A1) and (A3) are fulfilled. Then

  1. (i)

    \(d_{TV}\big ( P^*_{X^*}, P_X \big ) \,=\, O_{\widetilde{P}}\big ( n^{-1/3} \, (\log n)^2 \big )\).

  2. (ii)

    With a probability tending to 1, the process \(\big ((\widetilde{X}_t,\widetilde{X}_t^*)\big )_{t\in {\mathbb N}_0}\) is \(\phi \)-mixing with coefficients \(\phi _{\widetilde{X},\widetilde{X}^*}(k)\) decaying at a geometric rate.

  3. (iii)

    If \(\big ((\widetilde{X}_t^0, \widetilde{X}_t^{*,0})\big )_{t\in {\mathbb N}_0}\) is a stationary version of the coupled process, then

    $$\begin{aligned} \widetilde{P}\big ( \widetilde{X}_t^0 \ne \widetilde{X}_t^{*,0} \big ) \,=\, O_{\widetilde{P}}\big ( n^{-1/3} (\log n)^2 \big ). \end{aligned}$$

These general results can be used to prove bootstrap consistency for specific statistics. Suppose that \(X_0,\ldots ,X_n\) are observed and that (A1) and (A3) are fulfilled. To illustrate our advocated approach, we consider e.g. the parameter \(\theta :=P(X_{t-1}=x,X_t=y)\), which is consistently estimated by \(\widehat{\theta }_n=n^{-1}\sum _{t=1}^n {\mathbb {1}}(X_{t-1}=x,X_t=y)\). It follows from a central limit theorem for \(\phi \)-mixing processes (see e.g. Theorem 15.12 in Bradley (2007b)) that

$$\begin{aligned} S_n:=\, \sqrt{n} \big ( \widehat{\theta }_n - \theta \big ) \,{\mathop {\longrightarrow }\limits ^{d}}\, Y \sim {\mathcal N}\left( 0, \sigma _\infty ^2\right) , \end{aligned}$$

where \(\sigma _\infty ^2=\sum _{k=-\infty }^\infty \mathop {\textrm{cov}}\nolimits \big ( {\mathbb {1}}(X_0=x,X_1=y), {\mathbb {1}}(X_{|k|}=x,X_{|k|+1}=y) \big )\). The distribution of \(S_n\) can be approximated by that of its bootstrap counterpart,

$$\begin{aligned} S_n^*:=\, \sqrt{n} \big ( \widehat{\theta }_n^* - E^*\widehat{\theta }_n^* \big ), \end{aligned}$$

where \(\widehat{\theta }_n^*=n^{-1}\sum _{t=1}^n {\mathbb {1}}(X_{t-1}^*=x,X_t^*=y)\) and \(E^*\widehat{\theta }_n^*=E\big (\widehat{\theta }_n^*\,\big |\,X_0,\ldots ,X_n)\). In order to prove that the distribution of \(S_n^*\) converges in probability to the same limit as that of \(S_n\), we could use a central limit theorem for triangular arrays of \(\phi \)-mixing random variables. Alternatively, we can use our coupling results and obtain bootstrap consistency almost for free. It follows from (iii) of Theorem 3.1 that

$$\begin{aligned} \widetilde{P}\Big ( (\widetilde{X}_{t-1},\widetilde{X}_t)\ne (\widetilde{X}_{t-1}^*,\widetilde{X}_t^*) \big ) \,=\, O_{\widetilde{P}}\big ( n^{-1/3} \, (\log n)^2\big ). \end{aligned}$$

Using this and a covariance inequality for \(\phi \)-mixing random variables [see e.g. Theorem 1.2.2.3 in Doukhan (1994, p. 9)] we obtain

$$\begin{aligned}{} & {} { \big | \mathop {\textrm{cov}}\nolimits \big ( {\mathbb {1}}(\widetilde{X}_{s-1},\widetilde{X}_s) - {\mathbb {1}}(\widetilde{X}_{s-1}^*,\widetilde{X}_s^*), {\mathbb {1}}(\widetilde{X}_{t-1},\widetilde{X}_t) - {\mathbb {1}}(\widetilde{X}_{t-1}^*,\widetilde{X}_t^*) \big ) \big | } \\{} & {} \quad \le 2\, \phi \big ( |s-t|-1 \big ) \, \big \Vert {\mathbb {1}}(\widetilde{X}_{s-1},\widetilde{X}_s) - {\mathbb {1}}(\widetilde{X}_{s-1}^*,\widetilde{X}_s^*) \big \Vert _1 \, \big \Vert {\mathbb {1}}(\widetilde{X}_{t-1},\widetilde{X}_t) - {\mathbb {1}}(\widetilde{X}_{t-1}^*,\widetilde{X}_t^*) \big \Vert _\infty \\{} & {} \quad = O_{\widetilde{P}} \big ( \phi _{\widetilde{X},\widetilde{X}^*}(|s-t|-1 \big ) \, n^{-1/3} \, (\log n)^2 \big ), \end{aligned}$$

which implies that

$$\begin{aligned}{} & {} { \widetilde{E} \Big | \big ( \widetilde{S}_n \,-\, \widetilde{S}_n^* \big )^2 \Big ] } \\{} & {} \quad = \frac{1}{n} \sum _{s,t=1}^n \mathop {\textrm{cov}}\nolimits \big ( {\mathbb {1}}(\widetilde{X}_{s-1},\widetilde{X}_s) - {\mathbb {1}}(\widetilde{X}_{s-1}^*,\widetilde{X}_s^*), {\mathbb {1}}(\widetilde{X}_{t-1},\widetilde{X}_t) - {\mathbb {1}}(\widetilde{X}_{t-1}^*,\widetilde{X}_t^*) \big ) \\{} & {} \quad = O_{\widetilde{P}} \big ( \phi _{\widetilde{X},\widetilde{X}^*}(|s-t|-1 ) \, n^{-1/3} \, (\log n)^2 \big ). \end{aligned}$$

This implies that

$$\begin{aligned} S_n^* \,{\mathop {\longrightarrow }\limits ^{d}}\, Y \qquad \hbox {in probability}. \end{aligned}$$

If in addition \(\sigma _\infty ^2>0\), then we obtain by Lemma 2.11 of van der Vaart (1998) that

$$\begin{aligned} \sup _x \big | P\big ( S_n\le x \big ) \,-\, P\big ( S_n^*\le x \,\big |\, X_0,\ldots ,X_n \big ) \big | \,{\mathop {\longrightarrow }\limits ^{P}}\, 0. \end{aligned}$$

Hence, we can use bootstrap quantiles to construct confidence intervals for \(\theta \) such that their coverage probability converges to a prescribed level. Similar implications for other types of statistics are discussed in Leucht and Neumann (2013) and Neumann (2021).

Remark 1

In a similar context, Paparoditis and Politis (2002, Theorem 3.3) proved almost sure convergence of the bootstrap stationary distribution to the stationary distribution of the original process. Their method of proof is completely different from ours and employs classical tools from the theory of weak convergence such as Helly’s theorem and the “uniqueness trick” which uses the fact that each subsequence contains a further subsequence converging to the same probability measure. We use a more direct approach based on a coupling of the original and the bootstrap process. The additional benefit is that we obtain a rate of convergence rather than consistency only.

The following pictures give an impression of the effect of our coupling. As done for the pictures displayed in the previous section, we simulated a Poisson-INARCH process of order 1, where \(X_t\mid X_{t-1},X_{t-2},\ldots \sim \hbox {Poisson}\big ( f(X_{t-1}) \big )\) and \(f(x)=\min \big \{\alpha _0+\alpha _1 x, \beta \big \}\). The parameters \(\alpha _0\) and \(\alpha _1\) are chosen as 2.0 and 0.5, respectively, and the truncation constant \(\beta \) is set to 6.0. For respective sample sizes of \(n=200\) and \(n=1000\), Figs. 1 and 2 show one realization of independent and coupled versions of \(X_1,\ldots ,X_{50}\) and \(X_1^*,\ldots ,X_{50}^*\). While the pictures on the left of Figs. 1 and 2 let us at best hope for a similar behavior of the bootstrap and the original process, those on the right provide some evidence that the bootstrap process successfully mimics the behavior of the original process.

Fig. 1
figure 1

Independent and coupled processes, n = 200

Fig. 2
figure 2

Independent and coupled processes, n = 1000

4 Proofs

4.1 Proofs of the main results

Proof of Theorem 2.1

Our strategy to prove this result is already sketched in Sect. 2, in the special case where the distribution function \(F_X\) associated to \(P_X\) is continuous. In the general case with a possibly discontinuous function \(F_X\), we have to take great care since we cannot split the domain D into intervals \(I_k\) such that \(P_X(I_k)=1/k_n\), where \(k_n=\lfloor n^{1/3}\rfloor \). It could be the case that \(P_X\) has masses considerably larger than \(1/k_n\) at single points which requires a modification of our previous approach.

To obtain an appropriate collection of intervals \(I_k\), we define again suitable grid points \(x_0,x_n,\ldots ,x_{K_n}\). For technical reasons we choose them as a decreasing sequence. We set \(x_0:=\infty \) and define recursively \(x_k:=\inf \{x:\, P_X((x,x_{k-1}))\le n^{-1/3}\}\) for \(k\ge 1\). This procedure will terminate when \(x_{K_n}=0\) and \(D={\mathbb R}\) or when \(x_{K_n}=-\infty \), for some \(K_n\). In both cases we have that \(D=[x_1,x_0)\cup \cdots \cup [x_{K_n},x_{K_n-1})\). For \(k=1,\ldots ,K_n-1\), i.e. with a possible exception of \(k=K_n\), we have

$$\begin{aligned} P_X\big ( (x_k,x_{k-1}) \big ) \,\le \, n^{-1/3} \,\le \, \lim _{m\rightarrow \infty } P_X\big ( (x_k-1/m,x_{k-1}) \big ) \,=\, P_X\big ( [x_k,x_{k-1}) \big ), \end{aligned}$$

where the latter equality follows since the probability measure \(P_X\) is continuous from above. In the following we show that

$$\begin{aligned} E\left[ \int _D \big (\widehat{F}_x(z)-F_x(z)\big )^+ \,dP_X(x) \right] \,=\, O\big ( n^{-1/3} \big ). \end{aligned}$$
(4.1)

To this end, we consider the contributions by \(E\big [ \int _{[x_k,x_{k-1})} (\widehat{F}_x(z)-F_x(z))^+\,dP_X(x)\big ]\) separately. We distinguish between three possible cases.

Case 1    If \(P_X\big ([x_k,x_{k-1})\big )\le 2n^{-1/3}\) and \(k<K_n\), then we use for all \(x\in [x_k,x_{k-1})\) in case of \(N_{n,k}:=\big \{t\le n:\, X_{t-1}\in [x_k,x_{k-1})\big \}\ne 0\) the estimate

$$\begin{aligned} \big ( \widehat{F}_x(z) \,-\, F_x(z) \big )^+\le & {} \big ( \widehat{F}_{x_k}(z) \,-\, F_{x_{k-1}}(z) \big )^+ \\\le & {} \max _{v:\, v\ge x_k} \left\{ \frac{\big |\sum _{t=1}^n [{\mathbb {1}}(X_t\le z) \,-\, F_{X_{t-1}}(z)]\, {\mathbb {1}}(X_{t-1}\in [x_{k+1},v]) \big |}{ \#\{t\le n:X_{t-1}\in [x_{k+1},v]\} \vee 1 } \right\} \\{} & {} {} \,+\, \big [ F_{x_{k+1}}(z) \,-\, F_{x_{k-1}}(z) \big ], \end{aligned}$$

which leads to

$$\begin{aligned}{} & {} { E\left[ \int _{[x_k,x_{k-1})} \big ( \widehat{F}_x(z) \,-\, F_x(z) \big )^+ \, dP_X(x)\, {\mathbb {1}}_{\{N_{n,k}\ne 0\}} \right] } \nonumber \\{} & {} \quad \le P_X\big ( [x_k,x_{k-1}) \big ) \;\nonumber \\{} & {} \qquad \left\{ E\left[ \max _{v:\, v\ge x_k} \left\{ \frac{\big |\sum _{t=1}^n [{\mathbb {1}}(X_t\le z) \,-\, F_{X_{t-1}}(z)] {\mathbb {1}}(X_{t-1}\in [x_{k+1},v]) \big |}{ \#\{t\le n:X_{t-1}\in [x_{k+1},v]\} \vee 1 } \right\} \right] \right. \nonumber \\{} & {} \qquad \left. +\, \big [ F_{x_{k+1}}(z) \,-\, F_{x_{k-1}}(z) \big ] \right\} \nonumber \\{} & {} \quad = O\Big ( P_X\big ( [x_k,x_{k-1}) \big ) \; \big \{ n^{-1/3} \,+\, (F_{x_{k+1}}(z) - F_{x_{k-1}}(z)) \big \} \Big ). \end{aligned}$$
(4.2a)

Case 2    If \(P_X\big ([x_k,x_{k-1})\big )> 2n^{-1/3}\) then \(P_X\) has at \(x_k\) a point mass greater than \(n^{-1/3}\) and we argue differently. In this case, we use for all \(x\in (x_k,x_{k-1})\) in case of \(N_{n,k}:=\big \{t\le n:\, X_{t-1}=x_k\big \}\ne 0\) the estimate

$$\begin{aligned} \big ( \widehat{F}_x(z) \,-\, F_x(z) \big )^+\le & {} \max _{v:\, v\ge x_k} \left\{ \frac{\big |\sum _{t=1}^n [{\mathbb {1}}(X_t\le z) \,-\, F_{X_{t-1}}(z)] \, {\mathbb {1}}(X_{t-1}\in [x_k,v]) \big |}{ \#\{t\le n:\, X_{t-1}\in [x_k,v]\} \vee 1 } \right\} \\{} & {} {} \,+\, \big [ F_{x_k}(z) \,-\, F_{x_{k-1}}(z) \big ], \end{aligned}$$

which implies

$$\begin{aligned}{} & {} { E\left[ \int _{(x_k,x_{k-1})} \big ( \widehat{F}_x(z) \,-\, F_x(z) \big )^+ \, dP_X(x)\, {\mathbb {1}}_{\{N_{n,k}\ne 0\}} \right] } \nonumber \\{} & {} \quad \le P_X\big ( (x_k,x_{k-1}) \big ) \;\nonumber \\{} & {} \qquad \left\{ E\left[ \max _{v:\, v\ge x_k} \left\{ \frac{\big |\sum _{t=1}^n [{\mathbb {1}}(X_t\le z) \,-\, F_{X_{t-1}}(z)] \, {\mathbb {1}}(X_{t-1}\in [x_k,v]) \big |}{ \#\{t\le n:\, X_{t-1}\in [x_k,v]\} \vee 1 } \right\} \right] \right. \nonumber \\{} & {} \qquad \left. +\, \big [ F_{x_k}(z) \,-\, F_{x_{k-1}}(z) \big ] \right\} \nonumber \\{} & {} \quad = O\Big ( P_X\big ( (x_k,x_{k-1}) \big ) \; \big \{ n^{-1/3} \,+\, (F_{x_k}(z) - F_{x_{k-1}}(z)) \big \} \Big ). \end{aligned}$$
(4.2b)

For \(x=x_k\), we use the simpler estimate

$$\begin{aligned} \big ( \widehat{F}_x(z) \,-\, F_x(z) \big )^+\le & {} \max _{v:\, v>x_k} \left\{ \frac{\big |\sum _{t=1}^n [{\mathbb {1}}(X_t\le z) \,-\, F_{X_{t-1}}(z)] \, {\mathbb {1}}(X_{t-1}\in [x_k,v]) \big |}{ \#\{t\le n:\, X_{t-1}\in [x_k,v]\} \vee 1 } \right\} , \end{aligned}$$

and we obtain

$$\begin{aligned}{} & {} { E\left[ \int _{ \{x_k\} } \big ( \widehat{F}_x(z) \,-\, F_x(z) \big )^+ \, dP_X(x) \,{\mathbb {1}}_{\{N_{n,k}\ne 0\}} \right] } \nonumber \\{} & {} \quad \le P_X\big ( \{x_k\} \big ) \; E\left[ \max _{v:\, v>x_k} \left\{ \frac{\big |\sum _{t=1}^n [{\mathbb {1}}(X_t\le z) \,-\, F_{X_{t-1}}(z)] {\mathbb {1}}(X_{t-1}\in [x_k,v]) \big |}{ \#\{t\le n:\, X_{t-1}\in [x_k,v]\} \vee 1 } \right\} \; {\mathbb {1}}_{A_n} \right] \nonumber \\{} & {} \quad = O\Big ( P_X\big ( \{x_k\} \big ) \; n^{-1/3} \Big ). \end{aligned}$$
(4.2c)

Case 3    If \(P_X([x_{K_n},x_{K_n-1}))\le 2n^{-1/3}\), then we can simply use the estimate

$$\begin{aligned} E\Bigg [ \int _{[x_{K_n},x_{K_n-1})} \big ( \widehat{F}_x(z) \,-\, F_x(z) \big )^+ \, dP_X(x) \,\le \, 2\, n^{-1/3}. \end{aligned}$$
(4.2d)

Finally, it follows from Lemma 4.2 that \(P\big ( \bigcup _k \{\omega :N_{n,k}(\omega )=0 \} \big )=O(n^{-1/3})\), which implies that

$$\begin{aligned}{} & {} E\left[ \int _D \big (\widehat{F}_x(z)-F_x(z)\big )^+ \,dP_X(x) \; {\mathbb {1}}_{\bigcup _k \{N_{n,k}=0\}} \right] \nonumber \\{} & {} \qquad \,\le \, P\left( \bigcup _k \{\omega :N_{n,k}(\omega )=0 \} \right) \,=\, O\big ( n^{-1/3} \big ). \end{aligned}$$
(4.2e)

From (4.2a) to (4.2e) we obtain (4.1).

The term \(\int _D (\widehat{F}_x(z)-F_x(z))^-\,dP_X(x)\) can be analogously estimated which completes the proof of the theorem. \(\square \)

Proof of Theorem 3.1

  1. (i)

    We construct a coupling of the original process and its bootstrap counterpart, where we use \(\widetilde{\pi }\big ((x,x^*),(y,y^*)\big )\) defined by (3.5a) and (3.5b) as transition probabilities and \(\widetilde{P}\) as transition kernel. The initial values are chosen such that \(\widetilde{X}_0=\widetilde{X}^*_0 \sim P_X\). Then, for each \(t\in {\mathbb N}_0\), conditioned on \((\widetilde{X}_t,\widetilde{X}^*_t)\), the next pair \((\widetilde{X}_{t+1},\widetilde{X}^*_{t+1})\) is generated according to \(\widetilde{P}\). It follows from (3.4) and (3.6) in particular that

    $$\begin{aligned}{} & {} \widetilde{P}\big ( \widetilde{X}_{t+1} \ne \widetilde{X}^*_{t+1}, \, \widetilde{X}_t=\widetilde{X}^*_t \big )\\{} & {} \quad = \sum _{x\in {\mathbb N}_0} \widetilde{P}\big ( \widetilde{X}_{t+1}\ne \widetilde{X}_{t+1}^*\mid \widetilde{X}_t=\widetilde{X}_t^*=x \big ) \, \widetilde{P}\big ( \widetilde{X}_t=\widetilde{X}_t^*=x \big ) \\{} & {} \quad = \sum _{x\in {\mathbb N}_0} \delta _{x,x} \, \widetilde{P}\big ( \widetilde{X}_t=\widetilde{X}_t^*=x \big ) \\{} & {} \quad \le \frac{1}{2} \, \sum _x \sum _y \big | \pi (x,y) \,-\, \pi ^*(x,y) \big | \, P_X(\{x\})\\{} & {} \quad = O_{\widetilde{P}}\big ( n^{-1/3} \, \log n \big ). \end{aligned}$$

    This implies first

    $$\begin{aligned} \widetilde{P}\big ( \widetilde{X}_1\ne \widetilde{X}_1^* \big ) \,=\, O_{\widetilde{P}}\big ( n^{-1/3} \, \log n \big ), \end{aligned}$$

    then

    $$\begin{aligned}{} & {} \widetilde{P}\big ( \widetilde{X}_2\ne \widetilde{X}_2^* \big ) \,\le \, \widetilde{P}\big ( \widetilde{X}_2\ne \widetilde{X}_2^*,\, \widetilde{X}_1=\widetilde{X}_1^* \big ) \,+\, \widetilde{P}\big ( \widetilde{X}_1\ne \widetilde{X}_1^* \big )\\{} & {} \quad = O_{\widetilde{P}}\big ( n^{-1/3} \, \log n \big ), \end{aligned}$$

    and after \(K_n\) such steps

    $$\begin{aligned} d_{TV}\big ( P_X, \widetilde{P}^{\widetilde{X}_{K_n}^*} \big ) \,\le \, \widetilde{P}\big ( \widetilde{X}_{K_n}\ne \widetilde{X}_{K_n}^* \big ) \,=\, O_{\widetilde{P}}\big ( n^{-1/3} \, \log n \, K_n\big ). \end{aligned}$$

    On the other hand, \((X_t^*)_{t\in {\mathbb N}_0}\), and therefore \((\widetilde{X}_t^*)_{t\in {\mathbb N}_0}\) as well, are geometrically ergodic. Hence, for \(K_n=K \log n\) and K sufficiently large,

    $$\begin{aligned} d_{TV}\left( \widetilde{P}^{\widetilde{X}_{K_n}^*}, P^*_{X^*} \right) \,=\, O_{\widetilde{P}}\big ( n^{-1/3} \big ), \end{aligned}$$

    which leads to

    $$\begin{aligned} d_{TV}\big ( P_X, P^*_{X^*} \big )\le & {} d_{TV}\left( P_X, \widetilde{P}^{\widetilde{X}_{K_n}^*} \right) \,+\, d_{TV}\left( \widetilde{P}^{\widetilde{X}_{K_n}^*}, P^*_{X^*} \right) \\= & {} O_{\widetilde{P}}\big ( n^{-1/3} \, (\log n)^2 \big ). \end{aligned}$$
  2. (ii)

    We couple the original and the bootstrap process according to (3.5a) and (3.5b) and show first that

    $$\begin{aligned}{} & {} \widetilde{P} \big ( (\widetilde{X}_t,\widetilde{X}_t^*)\in S\times S \,\big |\, \widetilde{X}_{t-1}=x, \widetilde{X}_{t-1}^*=x^* \big )\nonumber \\{} & {} \quad \,\ge \, P\big ( X_t\in S \,\big |\, X_{t-1}=x \big ) \cdot P^*\big ( X_t^*\in S \,\big |\, X_{t-1}^*=x^* \big ) \end{aligned}$$
    (4.3)

    holds for all \(x,x^*\in {\mathbb N}_0\). Let \(x,x^*\in {\mathbb N}_0\) be arbitrary. To simplify notation we set, for a generic set \(B\subseteq {\mathbb N}_0\), \(\pi (B)=\sum _{y\in B}\pi (x,y)\), \(\pi ^*(B)=\sum _{y\in B}\pi ^*(x^*,y)\), and \(\pi \wedge \pi ^*(B)=\sum _{y\in B}\pi (x,y)\wedge \pi ^*(x^*,y)\). If \(\pi \wedge \pi ^*(S)\ge \pi (S)\cdot \pi ^*(S)\), then (4.3) follows immediately. Suppose now the opposite, \(\pi \wedge \pi ^*(S)<\pi (S)\cdot \pi ^*(S)\). Then \(\delta _{x,x^*}>0\), and it follows from (3.5a) and (3.5b)

    $$\begin{aligned}{} & {} { \widetilde{P} \big ( (\widetilde{X}_t,\widetilde{X}_t^*)\in S\times S \,\big |\, \widetilde{X}_{t-1}=x, \widetilde{X}_{t-1}^*=x^* \big ) } \\{} & {} \quad = \sum _{y\in S} \pi (x,y)\wedge \pi ^*(x^*,y), \\{} & {} \qquad +\, \sum _{y,y^*\in S} \frac{ \big ( \pi (x,y) - \pi (x,y)\wedge \pi ^*(x^*,y) \big ) \, \big ( \pi ^*(x^*,y^*) - \pi (x,y^*)\wedge \pi ^*(x^*,y^*) \big ) }{ \delta _{x,x^*} } \\{} & {} \quad = \pi \wedge \pi ^*(S) \,+\, \frac{ 1 }{ \delta _{x,x^*} } \big ( \pi (S) - \pi \wedge \pi ^*(S) \big ) \big ( \pi ^*(S) - \pi \wedge \pi ^*(S) \big ). \\{} & {} \quad = \pi (S) \cdot \pi ^*(S) +\, \frac{ 1 }{ \delta _{x,x*} } \Big \{ \delta _{x,x^*} \big ( \pi \wedge \pi ^*(S) \,-\, \pi (S) \pi ^*(S) \big )\\{} & {} \qquad +\, \big ( \pi (S) - \pi \wedge \pi ^*(S) \big ) \big ( \pi ^*(S) - \pi \wedge \pi ^*(S) \big ) \Big \}. \end{aligned}$$

    Since \(\delta _{x,x^*}\,=\,1-\pi \wedge \pi ^*({\mathbb N}_0)\,=\,\big (\pi (S)-\pi \wedge \pi ^*(S)\big )+\big (\pi (S^c)-\pi \wedge \pi ^*(S^c)\big )\) the term in curly braces is equal to

    $$\begin{aligned}{} & {} \big (\pi (S)-\pi \wedge \pi ^*(S)\big ) \, \big ( \pi \wedge \pi ^*(S) \,-\, \pi (S) \pi ^*(S) \big )\\{} & {} \qquad \,+\, \big (\pi (S^c)-\pi \wedge \pi ^*(S^c)\big ) \, \big ( \pi \wedge \pi ^*(S) \,-\, \pi (S) \pi ^*(S) \big ) \\{} & {} \qquad {} \,+\, \big (\pi (S)-\pi \wedge \pi ^*(S)\big ) \, \big (\pi ^*(S)-\pi \wedge \pi ^*(S)\big ) \\{} & {} \quad = \big (\pi (S)-\pi \wedge \pi ^*(S)\big ) \, \pi ^*(S) \, \pi (S^c) \\{} & {} \qquad {} \,+\, \big (\pi (S^c)-\pi \wedge \pi ^*(S^c)\big ) \, \big ( \pi \wedge \pi ^*(S) \,-\, \pi (S) \pi ^*(S) \big ) \\{} & {} \quad = \pi (S^c) \, \big ( \pi \wedge \pi ^*(S) \,-\, \pi \wedge \pi ^*(S) \, \pi ^*(S) \big )\\{} & {} \qquad \,+\, \pi \wedge \pi ^*(S^c) \, \big ( \pi (S) \pi ^*(S) \,-\, \pi \wedge \pi ^*(S) \big ), \end{aligned}$$

    and is therefore non-negative. This proves (4.3). It follows from (3.4) that, for \(y,y^*\in S\) such that \(P_X(\{y^*\})>0\),

    $$\begin{aligned} \widetilde{P}\big ( \widetilde{X}_{t+1}= & {} \widetilde{X}_{t+1}^*=z \,\big |\, \widetilde{X}_t=y, \widetilde{X}_t^*=y^* \big ) = \pi (y,z)\wedge \pi ^*(y^*,z) \nonumber \\\ge & {} \kappa \, Q\big ( \{z\} \big ) \,+\, O_P\big ( n^{-1/3} \, \log n \big ). \end{aligned}$$
    (4.4)

    We obtain from (4.3) and (4.4) there exist some \(\kappa ^*>0\) such that

    $$\begin{aligned}{} & {} { \widetilde{P}\big ( (\widetilde{X}_{t+1},\widetilde{X}_{t+1}^*)=(z,z) \,\big |\, \widetilde{X}_{t-1}=x, \widetilde{X}_{t-1}^*=x^* \big ) } \nonumber \\{} & {} \quad \ge \sum _{y,y^*\in S} \widetilde{P}\big ( (\widetilde{X}_{t+1},\widetilde{X}_{t+1}^*)=(z,z) \,\big |\, \widetilde{X}_t=y, \widetilde{X}_t^*=y^* \big ) \,\nonumber \\{} & {} \qquad \widetilde{P}\big ( (\widetilde{X}_t,\widetilde{X}_t^*)=(y,y^*) \,\big |\, \widetilde{X}_{t-1}=x, \widetilde{X}_{t-1}^*=x^* \big ) \nonumber \\{} & {} \quad \ge \kappa ^* \end{aligned}$$
    (4.5)

    holds with a probability tending to 1. Hence, with a probability tending to 1, the coupled process is \(\phi \)-mixing with geometrically decaying coefficients.

  3. (iii)

    According to (4.5), the coupled process \(\big ((\widetilde{X}_t,\widetilde{X}_t^*)\big )_{t\in {\mathbb N}_0}\) satisfies Doeblin’s condition which implies in particular that this process has a unique stationary distribution. Let \(\big ((\widetilde{X}_t^0,\widetilde{X}_t^{*,0})\big )_{t\in {\mathbb N}_0}\) be a stationary version of the coupled process. Since \(\big ((\widetilde{X}_t,\widetilde{X}_t^*)\big )_{t\in {\mathbb N}_0}\) is geometrically ergodic we obtain

$$\begin{aligned} \widetilde{P}\big ( \widetilde{X}_t^0 \ne \widetilde{X}_t^{*,0} \big )\le & {} \widetilde{P}\big ( \widetilde{X}_{K_n} \ne \widetilde{X}_{K_n}^* \big ) \,+\, d_{TV}\big ( \widetilde{P}^{(\widetilde{X}_{K_n}^0,\widetilde{X}_{K_n}^{*,0})}, \widetilde{P}^{(\widetilde{X}_{K_n},\widetilde{X}_{K_n}^*)} \big ) \\= & {} O_{\widetilde{P}}\big ( n^{-1/3} \, (\log n)^2 \big ). \end{aligned}$$

\(\square \)

4.2 Some auxiliary lemmas

Lemma 4.1

Suppose that \((X_t)_{t\in {\mathbb N}_0}\) is a Markov chain with state space \(D\subseteq {\mathbb R}\) such that (A2) is fulfilled. For arbitrary \(I\subseteq D\), let

$$\begin{aligned} \eta _t:=\, \big [ {\mathbb {1}}(X_t\le z) \,-\, P(X_t\le z\mid X_{t-1}) \big ] \; {\mathbb {1}}(X_{t-1}\in I), \end{aligned}$$

where \(I\subseteq D\). Then, for arbitrary \(\gamma <1\),

$$\begin{aligned} E\left[ \left( \sum _{t=1}^n \eta _t \right) ^4 \right] \,=\, O\big ( (n\, p_I)^2 \,+\, n\, p_I^\gamma \big ), \end{aligned}$$

where \(p_I:=P(X_0\in I)\).

Proof

In view of \(E\big [\big (\sum _{t=1}^n \eta _t\big )^4\big ]=\sum _{s,t,u,v=1}^n E[\eta _s \eta _t \eta _u \eta _v]\) we first consider the terms \(E[\eta _s \eta _t \eta _u \eta _v]\). Let the indices be chronologically ordered, i.e. \(1\le s\le t\le u\le v\le n\). Then it follows from the Markov property that

$$\begin{aligned} E\big [ \eta _s \eta _t \eta _u \eta _v \big ] \,=\, 0 \quad \hbox { if }\quad u<v. \end{aligned}$$

Considering the remaining cases of \(s\le t\le u=v\), we make use of the following equalities.

  1. (a)

    \(s=t=u=v\) Then \(E[\eta _s \eta _t \eta _u \eta _v] \,=\, E\big [\eta _s^4\big ]\).

  2. (b)

    \(s=t<u=v\) Then \(E[\eta _s \eta _t \eta _u \eta _v] \,=\, \mathop {\textrm{cov}}\nolimits (\eta _s^2, \eta _u^2) \,+\, E\big [\eta _s^2\big ]\,E\big [\eta _u^2\big ]\).

  3. (c)

    \(s<t\le u=v\) Then \(E[\eta _s \eta _t \eta _u \eta _v] \,=\, \mathop {\textrm{cov}}\nolimits (\eta _s, \eta _t \eta _u^2) \,=\, \mathop {\textrm{cov}}\nolimits (\eta _s \eta _t, \eta _u^2)\).

For \(s<u\), there exist \({4 \atopwithdelims ()2}=6\) quadrupels \((t_1,t_2,t_3,t_4)\) such that \(t_i=t_j=s\) for some \(i\ne j\), and \(t_k=t_l=u\) for some \(k\ne l\). For \(s<t<u\), there exist \(4\cdot 3=12\) quadrupels \((t_1,t_2,t_3,t_4)\) such that \(t_i=s\), \(t_j=t\) and \(t_k=t_l=u\) for some ijkl, \(k\ne l\). Finally, for \(s<t=u\), there exist 4 quadrupels \((t_1,t_2,t_3,t_4)\) such that \(t_i=s\) for some i and \(t_j=u\) for \(j\ne i\). Therefore we obtain

$$\begin{aligned} E\left[ \left( \sum _{t=1}^n \eta _t \right) ^4 \right]\le & {} \sum _{t=1}^n E\big [ \eta _t^4 \big ] + 6\, \sum _{1\le s<u\le n} E\big [ \eta _s^2 \big ] \, E\big [ \eta _u^2 \big ] \nonumber \\{} & {} {} \,+\, 12\, \sum _{r=1}^{n-1} \sum _{(s,t,u)\in {\mathcal T}_{n,r}^{(1)}} \big | \mathop {\textrm{cov}}\nolimits ( \eta _s, \eta _t \eta _u^2) \big | \nonumber \\{} & {} {} \,+\, 12\, \sum _{r=1}^{n-1} \sum _{(s,t,u)\in {\mathcal T}_{n,r}^{(2)}} \big | \mathop {\textrm{cov}}\nolimits ( \eta _s \eta _t, \eta _u^2) \big |, \end{aligned}$$
(4.6)

where

$$\begin{aligned} {\mathcal T}_{n,r}^{(1)}= & {} \big \{ (s,t,u):\; 1\le s<t\le u\le n, \; r:=t-s\ge u-t\big \} \\ {\mathcal T}_{n,r}^{(2)}= & {} \big \{ (s,t,u):\; 1\le s\le t<u\le n, \; r:=u-t> t-s\big \}. \end{aligned}$$

To estimate the last two terms on the right-hand side of (4.6) we use a well-known covariance inequality for \(\alpha \)-mixing random variables,

$$\begin{aligned} \big | \mathop {\textrm{cov}}\nolimits (X,Y) \big | \,\le \, 4\, \big [ \alpha (\sigma (X),\sigma (Y)) \big ]^{1-1/\alpha -1/\beta } \,\Vert X\Vert _\alpha \,\Vert Y\Vert _\beta , \end{aligned}$$

where \(\alpha ,\beta \in (1,\infty )\) are such that \(1/\alpha +1/\beta <1\) and \(\Vert X\Vert _\alpha <\infty \), \(\Vert Y\Vert _\beta <\infty \); see e.g. Bradley (2007a, Corollary 10.16). Choosing \(\alpha =\beta =2/\gamma \) and taking into account that \(|\eta _s|\le 1\) and \(E|\eta _s|\le p_I\) we obtain that

$$\begin{aligned} \big | \mathop {\textrm{cov}}\nolimits ( \eta _s, \eta _t \eta _u^2 ) \big |\le & {} \big [ \alpha _X(t-s-1) \big ]^{1-\gamma /2-\gamma /2} \, \Vert \eta _s \Vert _{2/\gamma } \, \Vert \eta _t \eta _u^2 \Vert _{2/\gamma } \\\le & {} \big [ \alpha _X(t-s-1) \big ]^{1-\gamma } \, p_I^\gamma \end{aligned}$$

as well as

$$\begin{aligned} \big | \mathop {\textrm{cov}}\nolimits ( \eta _s \eta _t, \eta _u^2 ) \big | \,\le \, \big [ \alpha _X(u-t-1) \big ]^{1-\gamma } \, p_I^\gamma . \end{aligned}$$

Using \(\#{\mathcal T}_{n,r}^{(1)}\le n(r+1)\) and \(\#{\mathcal T}_{n,r}^{(2)}\le nr\) we obtain from (4.6)

$$\begin{aligned} E\left[ \left( \sum _{t=1}^n \eta _t \right) ^4 \right]\le & {} n\, p_I \,+\, 6\, (n\, p_I)^2 \\{} & {} \,+\, 12\, n\, \sum _{r=1}^{n-1} (2r+1) \big [ \alpha _X(r-1) \big ]^{1-\gamma } \, p_I^\gamma \\= & {} O\Big ( (n\, p_I)^2 \,+\, n\, p_I^\gamma \Big ), \end{aligned}$$

which completes the proof. \(\square \)

Lemma 4.2

Suppose that \((X_t)_{t\in {\mathbb N}_0}\) is a Markov chain with state space \(D\subseteq {\mathbb R}\) and stationary distribution \(P_X\) such that (A2) is fulfilled. For arbitrary \(I\subseteq D\), let \(N_n(I):=\#\{t\le n:\, X_{t-1}\in I\}\). Then, for arbitrary \(\delta >0\), \(\kappa <\infty \), and \(P_X(I)\ge n^{\delta -1}\),

$$\begin{aligned} P\big ( |N_n(I) \,-\, n\,P_X(I)| \,>\, n\, P_X(I)/2 \big ) \,=\, O\big ( n^{-\kappa } \big ). \end{aligned}$$

Proof

Let \(q\in 2{\mathbb N}\) and \(\epsilon >0\). Since

$$\begin{aligned} \sum _{r=1}^\infty r^{q-2} \, [\alpha _X(r)]^{\epsilon /(q+\epsilon )} \,<\, \infty \end{aligned}$$

it follows from an extension of Rosenthal’s inequality (see e.g. Theorem 2 in Section 1.4.1 in Doukhan (1994)) that

$$\begin{aligned} E\Big [ \big | N_n(I) \,-\, n\, P_X(I) \big |^q \Big ]= & {} E\left[ \Big | \sum _{t=1}^n \big ({\mathbb {1}}(X_{t-1}\in I) \,-\, P_X(I)\big ) \Big |^q \right] \nonumber \\\le & {} C_q\, \big \{ n^{q/2}\, P_X(I)^{q/(2+\epsilon )} \,+\, n\, P_X(I)^{q/(q+\epsilon )} \big \}. \nonumber \\ \end{aligned}$$
(4.7)

Choosing \(\epsilon >0\) small enough we have that \(n\,P_X(I)^{2-2/(2+\epsilon )}\ge n^{\delta '}\) for some \(\delta '>0\). Therefore we obtain from Markov’s inequality that

$$\begin{aligned}{} & {} { P\big ( |N_n(I) \,-\, n\, P_X(I)| \,>\, n\, P_X(I)/2 \big ) } \\{} & {} \quad \le C_q \; \frac{ n^{q/2}\, P_X(I)^{q/(2+\epsilon )} \,+\, n\, P_X(I)^{q/(q+\epsilon )} }{ (n\, \P _X(I)/2)^q } \\{} & {} \quad = O\Big ( \big (n\, P_X(I)^{1-1/(2+\epsilon )}\big )^{-q/2} \,+\, n\; (n \, P_X(I))^{-q} \Big ) =\, O\big ( n^{-\kappa } \big ), \end{aligned}$$

if q is chosen sufficiently large. \(\square \)

Lemma 4.3

Suppose that \((X_t)_{t\in {\mathbb N}_0}\) is a Markov chain with state space \(D\subseteq {\mathbb R}\) and stationary distribution \(P_X\) such that (A2) is fulfilled. Then there exist some \(C<\infty \) such that, for arbitrary \(\underline{x}\le \overline{x}\) with \(P_X([\underline{x},\overline{x}])\ge n^{-1/3}\),

$$\begin{aligned} E\left[ \sup _{v:\, v\ge \overline{x}} \frac{ \big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P(X_t\le z\mid X_{t-1}) \big ] {\mathbb {1}}(X_{t-1}\in [\underline{x},v]) \big | }{ \#\{t\le n:\; X_{t-1}\in [\underline{x},v]\} \vee 1 } \right] \,\le \, C\, n^{-1/3} \nonumber \\ \end{aligned}$$
(4.8a)

and

$$\begin{aligned} E\bigg [ \sup _{u:\, u\le \underline{x}} \frac{ \big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P(X_t\le z\mid X_{t-1}) \big ] {\mathbb {1}}(X_{t-1}\in [u,\overline{x}]) \big | }{ \#\{t\le n:\; X_{t-1}\in [u,\overline{x}]\} \vee 1 } \bigg ] \,\le \, C\, n^{-1/3}. \nonumber \\ \end{aligned}$$
(4.8b)

Proof

We prove only (4.8a) since the proof of (4.8b) is completely analogous. The proof is carried out in two steps. First we consider the technically simpler case where the distribution function \(F_X\) is continuous. This allows us to define a suitable dyadic family of intervals which leads to a readily comprehensible proof. Afterwards we extend the result to the general case.

Step 1 Suppose that \(F_X\) is continuous. First we prove that for arbitrary \(\delta >0\) and each \(v\ge \overline{x}\) there exists some \(C<\infty \) such that

$$\begin{aligned}{} & {} E\left[ \sup _{x:\, \underline{x}\le x\le v} \big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P( X_t\le z\mid X_{t-1} ) \big ] {\mathbb {1}}( X_{t-1} \in [\underline{x},x] ) \big | \right] \nonumber \\{} & {} \quad \,\le \, C\, \sqrt{n\, P_X( [\underline{x},v] )} \,+\, n^\delta . \end{aligned}$$
(4.9)

To deal with the supremum we define a suitable system of dyadic intervals. Let \(J_n\in {\mathbb N}\) be such that \(n^{\delta -1}/2< 2^{-J_n}P_X([\underline{x},v])\le n^{\delta -1}\). For \(j=1,2,\ldots ,J_n\) and \(k=1,2,\ldots ,2^j\), we set

$$\begin{aligned} x_{j,k} \,=\, F_X^{-1}\big ( F_X(\underline{x}) + k2^{-j} P_X([\underline{x},v]) \big ) \end{aligned}$$
(4.10a)

and, for \(j=1,\ldots ,J_n\),

$$\begin{aligned} B_{j,k} \,=\, \left\{ \begin{array}{ll} [ \underline{x}, x_{j,1} ] &{} \hbox { if }\quad k=1, \\ ( x_{j,k-1}, x_{j,k}] &{} \hbox { if }\quad k=2,\ldots ,2^j. \end{array} \right. \end{aligned}$$
(4.10b)

We have that

$$\begin{aligned} P_X\big ( B_{j,k} \big ) \,=\, 2^{-j} P_X([\underline{x},v]). \end{aligned}$$
(4.11)

We define partial sume as

$$\begin{aligned} S_{j,k} \,=\, \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P(X_t\le z\mid X_{t-1})\big ] \, {\mathbb {1}}(X_{t-1}\in B_{j,k}). \end{aligned}$$

Choosing \(\gamma \) such that \((1-2\delta )/(1-\delta )\le \gamma <1\) we have that \(n(2^{-j}P_X([\underline{x},v]))^\gamma =O\big ( (n2^{-j}P_X([\underline{x},v]))^2 \big )\) for all \(j=1,\ldots ,J_n\). Hence, the first term in the bound given in Lemma 4.1 dominates the second and we obtain, for \(j=1,\ldots ,J_n,\; k=1,\ldots ,2^j\),

$$\begin{aligned} E\big [ S_{j,k}^4 \big ] \,=\, O\Big ( \big ( n2^{-j} P_X([\underline{x},v]) \big )^2 \Big ), \end{aligned}$$

which implies that

$$\begin{aligned}{} & {} { E\Big [ \big | S_{j,k} \big |\, {\mathbb {1}}\big ( |S_{j,k}|> \sqrt{n P_X([\underline{x},v])}\, 2^{-j/4} \big )\Big ] } \\{} & {} \quad \le \frac{ E\big [ S_{j,k}^4 \big ] }{ \big ( \sqrt{n P_X([\underline{x},v])}\, 2^{-j/4} \big )^3 } \,=\, O\Big ( \sqrt{n P_X([\underline{x},v])}\, 2^{-5j/4} \Big ). \end{aligned}$$

Therefore, we obtain that

$$\begin{aligned}{} & {} { E\Big [ \max _{1\le k\le 2^j} \big | S_{j,k} \big | \Big ] } \nonumber \\{} & {} \quad \le \sqrt{n P_X([\underline{x},v])} \, 2^{-j/4} +\, \sum _{k=1}^{2^j} E\Big [ \big | S_{j,k} \big |\, {\mathbb {1}}\big ( |S_{j,k}|> \sqrt{n P_X([\underline{x},v])}\, 2^{-j/4} \big )\Big ] \nonumber \\{} & {} \quad = O\Big ( \sqrt{n P_X([\underline{x},v])}\, 2^{-j/4} \Big ). \end{aligned}$$
(4.12)

At the finest scale \(J_n\), we define for \(k=1,\ldots ,J_n\),

$$\begin{aligned} N_{J_n,k} \,=\, \#\big \{ t\le n:\;\; X_{t-1}\in B_{J_n,k} \big \}. \end{aligned}$$

Note that \(EN_{J_n,k}=n2^{-J_n}P_X([\underline{x},v])\le n^\delta \). We obtain from (4.7) that

$$\begin{aligned}{} & {} E\big [ N_{J_n,k} \, {\mathbb {1}}( N_{J_n,k}> 2n^\delta ) \big ]\\{} & {} \quad \le E\Big [ \big | N_{J_n,k} - EN_{J_n,k} \big | \, {\mathbb {1}}\big ( |N_{J_n,k} - EN_{J_n,k}|>n^\delta \big ) \Big ] \\{} & {} \quad \le \frac{ C_q\, \big \{ n^{q/2}\, (2^{-J_n}P_X([\underline{x},v]))^{q/(2+\epsilon )} \,+\, n\, (2^{-J_n}P_X([\underline{x},v]))^{q/(q+\epsilon )} \big \} }{ n^{\delta (q-1)} } \\{} & {} \quad = O\big ( n^{-\kappa } \big ) \end{aligned}$$

holds for arbitrary \(\kappa <\infty \) if q is chosen large enough. Since \(2^{J_n}<n^{1-\delta }/(2 P_X([\underline{x},v]))\le n^{4/3-\delta }/2\) we obtain

$$\begin{aligned} E\left[ \max _{1\le k\le 2^{J_n}} \big \{ N_{J_n,k} \big \} \right] \,=\, 2\, n^\delta \,+\, \sum _{k=1}^{2^{J_n}} E\big [ N_{J_n,k} \, {\mathbb {1}}( N_{J_n,k}> 2n^\delta ) \big ] \,=\, O\big ( n^\delta \big ).\qquad \end{aligned}$$
(4.13)

After these preparatory steps we are in a position to estimate the expected value of the supremum. For arbitrary \(x\in [\underline{x},v]\), there exist p and \((j_1,k_1),\ldots ,(j_p,k_p),k\), \(1\le j_1<\cdots <j_p\le J_n\), such that \(B_{j_1,k_1},\ldots ,B_{j_p,k_p},B_{J_n,k}\) are adjacent intervals and

$$\begin{aligned} B_{j_1,k_1} \cup \cdots \cup B_{j_p,k_p} \subseteq [\underline{x},x] \subseteq B_{j_1,k_1} \cup \cdots \cup B_{j_p,k_p} \cup B_{J_n,k}. \end{aligned}$$

This implies that

$$\begin{aligned} \Big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P( X_t\le z\mid X_{t-1} ) \big ] {\mathbb {1}}\big ( X_{t-1}\in [\underline{x}, x] \big ) \Big | \,\le \, \sum _{l=1}^p \big | S_{j_l,k_l} \big | \,+\, N_{J_n,k} \end{aligned}$$

and, therefore,

$$\begin{aligned}{} & {} E\left[ \sup _{x:\, \underline{x}\le x\le v} \Big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P( X_t\le z\mid X_{t-1} ) \big ] {\mathbb {1}}\big ( X_{t-1}\in [\underline{x}, x] \big ) \Big | \right] \\{} & {} \quad = \sum _{j=1}^{J_n} E\left[ \max _{1\le k\le 2^j} \big \{ |S_{j,k}| \big \} \Big ] +\, E\Big [ \max _{1\le k\le 2^{J_n}} \big \{ N_{J_n,k} \big \} \right] . \end{aligned}$$

It follows from (4.12) and (4.13) that (4.9) is fulfilled.

Now we are in a position to prove (4.8a). We define a dyadic sequence of growing intervals, \(I_0=[\underline{x},\overline{x}]\) and, for \(j\ge 1\)

$$\begin{aligned} I_j \,=\, \big [ \underline{x}, F_X^{-1}\big ( F_X(\underline{x}) \,+\, 2^j P_X([\underline{x},\overline{x}]) \big ) \big ]. \end{aligned}$$

(There exists some \(K_n\ge 0\) such that \(P_X(I_j)=2^j P_X([\underline{x},\overline{x}])\) for \(j=0,\ldots ,K_n\) and \(P_X(I_{K_n+1})<2^{K_n+1}P_X(\underline{x},\overline{x}])\). Then \(I_{K_n+1}=I_{K_n+2}=\ldots \).) Define the event

$$\begin{aligned} A_n \,=\, \Big \{ \big \{\omega :\; \#\{t\le n:\; X_{t-1}(\omega )\in I_j\} \ge n\, P_X(I_j)/2 \;\; \hbox { for all } j=0,\ldots ,K_n \big \} \Big \}. \end{aligned}$$

For \(x\in I_{j+1}\setminus I_j\) we use the estimate

$$\begin{aligned}{} & {} { \frac{ \big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P(X_t\le z\mid X_{t-1}) \big ] {\mathbb {1}}\big ( X_{t-1}\in [\underline{x}, x] \big ) \big | }{ \#\big \{ t\le n:\; X_{t-1}\in [\underline{x},x] \big \} \vee 1 } } \\{} & {} \quad \le \frac{ \sup _{x\in I_{j+1}} \big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P(X_t\le z\mid X_{t-1}) \big ] {\mathbb {1}}\big ( X_{t-1}\in [\underline{x},x] \big ) \big | }{ \#\big \{ t\le n:\; X_{t-1}\in I_j \big \} \vee 1 }. \end{aligned}$$

It follows from Lemma 4.3 that \(P\big ( A_n^c \big )=O\big ( n^{-1/3} \big )\), which implies that

$$\begin{aligned}{} & {} { E\left[ \sup _{x:\, \underline{x}\le x\le v} \frac{ \big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P(X_t\le z\mid X_{t-1}) \big ] {\mathbb {1}}\big ( X_{t-1}\in [\underline{x}, x] \big ) \big | }{ \#\big \{ t\le n:\; X_{t-1}\in [\underline{x},x] \big \} \vee 1 } \right] } \nonumber \\{} & {} \quad \le E\left[ \sup _{x:\, \underline{x}\le x\le v} \frac{ \big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P(X_t\le z\mid X_{t-1}) \big ] {\mathbb {1}}\big ( X_{t-1}\in [\underline{x}, x] \big ) \big | }{ \#\big \{ t\le n:\; X_{t-1}\in [\underline{x},x] \big \} } \;\; {\mathbb {1}}_{A_n} \right] \nonumber \\{} & {} \qquad \,+\, P\big ( A_n^c \big ) \nonumber \\{} & {} \quad = \sum _{j=0}^{K_n} \frac{ E\Big [ \big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P(X_t\le z\mid X_{t-1}) \big ] {\mathbb {1}}\big ( X_{t-1}\in I_{j+1} \big ) \big | \Big ] }{ n\; P_X(I_{j-1})/2 } \,+\, O\big ( n^{-1/3} \big ) \nonumber \\{} & {} \quad = O\left( \sum _{j=0}^{K_n} 2^{-j/2} / \sqrt{ n P_X([\underline{x},\overline{x}]) } \right) \nonumber \\{} & {} \qquad \,+\, O\big ( n^{-1/3} \big ) \,=\, O\big ( n^{-1/3} \big ), \end{aligned}$$
(4.14)

i.e. (4.8a) is fulfilled.

Step 2 In case of a general stationary distribution \(P_X\), the definition according to (4.10a) and (4.10b) does no longer guarantee that the convenient property of \(P_X( B_{j,k} )=2^{-j}P_X([\underline{x},v])\) holds true. In order to draw on the calculations in Step 1 we act as follows. Let \((V_t)_{t\in {\mathbb N}_0}\) be a sequence of independent random variables following a uniform distribution on [0, 1], which is independent of the process \((X_t)_{t\in {\mathbb N}_0}\). For the latter process we define an accompanying sequence \((U_t)_{t\in {\mathbb N}_0}\) of uniformly distributed random variables, where \(U_t\) depends on the pair \((X_t,V_t)\) as follows. If \(F_X\) is continuous in the point \(X_t\), then we simply set

$$\begin{aligned} U_t:=\, F_X(X_t). \end{aligned}$$

Otherwise, if \(F_X\) is discontinuous in \(X_t\), then \(P_X(\{X_t\})=F_X(X_t)-F_X(X_t-0)>0\) and we set

$$\begin{aligned} U_t:=\, F_X(X_t) \,-\, V_t/P_X(\{X_t\}). \end{aligned}$$

In both cases we have that

$$\begin{aligned} X_t \,=\, F_X^{-1}(U_t), \end{aligned}$$

where \(G^{-1}(t)=\inf \{x:G(x)\ge t\}\) denotes the generalized inverse of a generic distribution function G. Since \(F_X\) has at most countably many discontinuity points, it follows that the mapping \((X_t,V_t)\mapsto U_t\) is measurable. It also follows that \(U_t\) has a uniform distribution on [0, 1]. Furthermore the process \(\big ( (X_t,V_t) \big )_{t\in {\mathbb N}_0}\) has the same mixing properties as \(\big (X_t\big )_{t\in {\mathbb N}_0}\), i.e.

$$\begin{aligned} \alpha _{(X,V)}(r) \,=\, \alpha _X(r) \qquad \forall r\ge 1; \end{aligned}$$

see e.g. Lemma 8 in Bradley (1981). Now we obtain in complete analogy to the calculations leading to (4.9) in Step 1 that, for arbitrary \(0\le \underline{u}<\overline{u}\le 1\),

$$\begin{aligned}{} & {} E\bigg [ \sup _{u:\, \underline{u}\le u\le \overline{u}} \Big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P(X_t\le z\mid X_{t-1}) \big ] {\mathbb {1}}(U_{t-1}\in [\underline{u},u]) \Big | \bigg ] \nonumber \\{} & {} \quad \,\le \, C \sqrt{n\,(\overline{u}-\underline{u})} \,+\, n^\delta . \end{aligned}$$
(4.15)

It is easy to see that the following inclusions hold true for \(\underline{x}\le x\):

$$\begin{aligned} \big \{ F_X(\underline{x}-0) < U_{t-1} \le F_X(x) \big \}\subseteq & {} \big \{ \underline{x} \le X_{t-1} \le x \big \}\\\subseteq & {} \big \{ F_X(\underline{x}-0) \le U_{t-1} \le F_X(x) \big \}. \end{aligned}$$

Indeed, the second inclusion follows immediately from the construction of \(U_{t-1}\). Regarding the first one, note that it follows again from the construction of \(U_{t-1}\) that \(F_X(\underline{x}-0) < U_{t-1}\) implies \(\underline{x}\le X_{t-1}\). Furthermore, \(U_{t-1}\le F_X(x)\) implies \(X_{t-1}=F_X^{-1}(U_{t-1})\le F_X^{-1}(F_X(x))\le x\). Since \(P(U_{t-1}=F_X(\underline{x}))=0\) we conclude that

$$\begin{aligned}{} & {} { \sup _{x:\, \underline{x}\le x\le \overline{x}} \Big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P(X_t\le z\mid X_{t-1}) \big ] {\mathbb {1}}(X_{t-1}\in [\underline{x},x]) \Big | } \\{} & {} \quad \le \sup _{u:\, F_X(\underline{x}-0)\le u\le F_X(\overline{x})} \Big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P(X_t\le z\mid X_{t-1}) \big ]\\{} & {} \qquad {\mathbb {1}}(U_{t-1}\in [F_X(\underline{x}-0),F_X(x)]) \Big | \end{aligned}$$

holds with probability one. Hence, we obtain from (4.13)

$$\begin{aligned}{} & {} { E\bigg [ \sup _{x:\, \underline{x}\le x\le v} \Big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P(X_t\le z\mid X_{t-1}) \big ] {\mathbb {1}}(X_{t-1}\in [\underline{x},x]) \Big | \bigg ] } \\{} & {} \quad \le E\bigg [ \sup _{u:\, F_X(\underline{x}-0)\le u\le F_X(v)} \Big | \sum _{t=1}^n \big [ {\mathbb {1}}(X_t\le z) - P(X_t\le z\mid X_{t-1}) \big ]\\{} & {} \qquad {\mathbb {1}}(U_{t-1}\in [F_X(\underline{x}-0),F_X(x)]) \Big | \bigg ] \\{} & {} \quad \le C\, \sqrt{n \, \big (F_X(v)-F_X(\underline{x}-0)\big ) } \,+\, n^\delta \\{} & {} \quad = C\, \sqrt{n \, P_X\big ( [\underline{x},v] \big ) } \,+\, n^\delta , \end{aligned}$$

i.e. (4.9) holds true. \(\square \)