Abstract
The Markov property is shared by several popular models for time series such as autoregressive or integer-valued autoregressive processes as well as integer-valued ARCH processes. A natural assumption which is fulfilled by corresponding parametric versions of these models is that the random variable at time t gets stochastically greater conditioned on the past, as the value of the random variable at time \(t-1\) increases. Then the associated family of conditional distribution functions has a certain monotonicity property which allows us to employ a nonparametric antitonic estimator. This estimator does not involve any tuning parameter which controls the degree of smoothing and is therefore easy to apply. Nevertheless, it is shown that it attains a rate of convergence which is known to be optimal in similar cases. This estimator forms the basis for a new method of bootstrapping Markov chains which inherits the properties of simplicity and consistency from the underlying estimator of the conditional distribution function.
Similar content being viewed by others
1 Introduction
We consider a time-homogeneous Markov chain \((X_t)_{t\in {\mathbb N}_0}\) driven by a transition kernel which satisfies a certain monotonicity property: the conditional distribution of the random variable at time t gets stochastically greater as the value of the variable at time \(t-1\) increases. Such a condition is actually satisfied by several popular models for time series such as autoregressive or integer-valued autoregressive as well as integer-valued ARCH processes under natural assumptions on the involved parameters. To be specific, we assume that, for each fixed z, \(F_x(z):=P(X_t\le z\mid X_{t-1}=x)\) is antitonic (monotonically non-increasing) in x. This assumption allows us to employ a nonparametric antitonic estimator \(\widehat{F}_x(z)\) of the function \(x\mapsto F_x(z)\). Our estimator does not involve any tuning parameter which controls the degree of smoothing and is therefore easy to apply. Moreover, its consistency does not require smoothness properties of the function \(x\mapsto F_x(z)\); the postulated monotonicity suffices. Theorem 2.1 states that the estimator \(\widehat{F}_x(z)\) converges in \(L^1\) norm, weighted by the stationary distribution of the Markov chain, with a rate of \(n^{-1/3}\) which is believed to be the optimal one.
The estimator of \(F_x(z)\) serves as a basis for a new bootstrap method for Markov chains. Among several other methods, those proposed by Rajarshi (1990) and Paparoditis and Politis (2002) are the closest ones to our proposal. While Rajarshi’s bootstrap procedure is based on a nonparametric estimate of the one-step transition density, Paparoditis and Politis (2002) used in their so-called local bootstrap a local resampling of the original data set. In both papers, the proof of consistency of the respective bootstrap method is based on the assumption of a smooth transition density. In contrast, our approach does not require any smoothness assumption on the transition mechanism; it is merely based on the monotonicity assumption on the Markov kernel. We show its applicability for Markov chains with state space \({\mathbb N}_0=\{0,1,2,\ldots \}\). Consistency of bootstrap can be shown in a most transparent way by a so-called coupling of the original process and its bootstrap counterpart, i.e. we define versions \((\widetilde{X}_t)_{t\in {\mathbb N}_0}\) and \((\widetilde{X}_t^*)_{t\in {\mathbb N}_0}\) of these processes on a common probability space \((\widetilde{\Omega },\widetilde{\mathcal A},\widetilde{P})\) such that the corresponding random variables \(\widetilde{X}_t\) and \(\widetilde{X}_t^*\) are equal with a high probability. Somewhat surprisingly, this natural approach was rarely used in statistics. Using Mallows metric to measure the distance between variables from the original and the bootstrap process, it was implicitly employed in the context of independent random variables by Bickel and Freedman (1981) and Freedman (1981). A more explicit use of coupling was made, in the context of U- and V-statistics, but again in the independent case, by Dehling and Mikosch (1994) and Leucht and Neumann (2009). For dependent data, this approach was adopted by Leucht and Neumann (2013), Leucht et al. (2015), and Neumann (2021). Our second main result, Theorem 3.1, describes the results of our coupling approach. The stationary distribution \(P^*_{X^*}\) of the bootstrap process converges in total variation norm and in probability to that of the original process. The coupled process is \(\phi \)-mixing with coefficients decaying at an exponential rate and the corresponding values \(\widetilde{X}_t^0\) and \(\widetilde{X}_t^{*,0}\) of a stationary version of the coupled process coincide with a probability converging to 1. These general results can then be used to prove bootstrap consistency for specific statistics. The proofs of our main theorems and some auxiliary results a postponed to a final Sect. 4.
2 An estimator of a monotone family of distribution functions
Suppose that we observe random variables \(X_0,X_1,\ldots ,X_n\), where \({\textbf{X}}=(X_t)_{t\in {\mathbb N}_0}\) is a strictly stationary Markov chain with state space \(D\subseteq {\mathbb R}\), defined on a probability space \((\Omega ,{\mathcal A},P)\). We denote the stationary distribution by \(P_X\) and the corresponding distribution function by \(F_X\). Let \((F_x)_{x\in {\mathbb R}}\) defined as \(F_x(z)=P(X_t\le z\mid X_{t-1}=x)\) be the corresponding family of conditional distribution functions. We impose the following as our key assumption.
-
(A1)
For each \(z\in {\mathbb R}\), the function \(x\mapsto F_x(z)\) is monotonically non-increasing, i.e. if \(x_1< x_2\), then \(P(X_t\le z\mid X_{t-1}=x_1)\ge P(X_t\le z\mid X_{t-1}=x_2)\). In addition we suppose that
-
(A2)
\({\textbf{X}}=(X_t)_{t\in {\mathbb N}_0}\) is strong mixing with exponentially decaying coefficients \(\alpha _X(k)\), i.e.
$$\begin{aligned} \alpha _X(k) \,=\, O\big ( \rho ^k \big ), \end{aligned}$$for some \(\rho \in [0,1)\).
Assumption (A1) may be paraphrased as follows. If \(x_1<x_2\) and if \(Y_1\) and \(Y_2\) are random variables following the respective conditional distributions \(P^{X_t\mid X_{t-1}=x_1}\) and \(P^{X_t\mid X_{t-1}=x_2}\), then \(Y_2\) is stochastically not smaller than \(Y_1\). It turns out that this assumption is actually satisfied by popular classes of Markov chain models under natural assumptions. Here is a list of models we have in mind:
-
(1)
Nonlinear autoregressive processes with non-decreasing link The process \({\textbf{X}}=(X_t)_{t\in {\mathbb N}_0}\) is assumed to obey the model equation
$$\begin{aligned} X_t \,=\, f(X_{t-1}) \,+\, \varepsilon _t \qquad \forall t\in {\mathbb N}, \end{aligned}$$where \((\varepsilon _t)_{t\in {\mathbb N}}\) is a sequence of i.i.d. random variables and \(\varepsilon _t\) is independent of \(X_{t-1},\ldots ,X_0\). If the function \(f:\,{\mathbb R}\rightarrow {\mathbb R}\) is monotonically non-decreasing, then, for \(x_1<x_2\),
$$\begin{aligned} P\big (X_t\le z\mid X_{t-1}=x_1\big )= & {} P\big (\varepsilon _t\le z-f(x_1)\big ) \,\ge \, P\big (\varepsilon _t\le z-f(x_2)\big )\\= & {} P\big (X_t\le z\mid X_{t-1}=x_2\big ). \end{aligned}$$Furthermore, if \(\varepsilon _t\) has an everywhere positive density and if
$$\begin{aligned} \big | f(x) \big | \,\le \, \gamma |x| \,-\, \epsilon \qquad \forall x\ge K, \end{aligned}$$for some \(\gamma <1\), \(\epsilon >0\), and \(K<\infty \), then the process \({\textbf{X}}\) has a unique stationary distribution and satisfies (A2); see e.g. Doukhan (1994).
-
(2)
Branching processes with immigration Let \(X_0\), \((Z_{t,k})_{t,k\in {\mathbb N}}\) and \((\varepsilon _t)_{t\in {\mathbb N}}\) be mutually independent random variables taking values in \({\mathbb N}_0\). We assume that \((Z_{t,k})_{t,k\in {\mathbb N}}\) as well as \((\varepsilon _t)_{t\in {\mathbb N}}\) are sequences of identically distributed random variables. Then the process \({\textbf{X}}=(X_t)_{t\in {\mathbb N}_0}\) given by
$$\begin{aligned} X_t \,=\, \sum _{k=1}^{X_{t-1}} Z_{t,k} \,+\, \varepsilon _t \qquad \forall t\in {\mathbb N}\end{aligned}$$is a branching process with immigration. In the special case of \(Z_{t,k}\sim \hbox {Bin}(1,\alpha )\) we obtain a so-called first-order integer-valued autoregressive (INAR(1)) process which was proposed by McKenzie (1985) and Al-Osh and Alzaid (1987). Since the \(Z_{t,k}\) are non-negative random variables, it is obvious that (A1) is fulfilled. If in addition \(E\varepsilon _t<\infty \) and \(EZ_{t,k}<1\), then \({\textbf{X}}\) has a unique stationary distribution and satisfies (A2); see Pakes (1971).
-
(3)
Poisson-INARCH processes The process \({\textbf{X}}=(X_t)_{t\in {\mathbb N}_0}\) is an integer-valued ARCH process of order 1 with Poisson innovations (Poisson-INARCH(1)) if
$$\begin{aligned} X_t\mid {\mathcal F}_{t-1} \sim \hbox {Poisson}\big ( f(X_{t-1}) \big ), \end{aligned}$$where \({\mathcal F}_s\) denotes the \(\sigma \)-algebra generated by \(X_0,\ldots ,X_s\). If f is monotonically non-decreasing, then we obtain, for \(x_1<x_2\) and \(Y_1\sim \hbox {Poisson}(f(x_1))\), \(Y_2\sim \hbox {Poisson}(f(x_2))\),
$$\begin{aligned} P( X_t\le z\mid X_{t-1}=x_1 ) \,=\, P( Y_1\le z) \,\ge \, P( Y_2\le z) \,=\, P( X_t\le z\mid X_{t-1}=x_1 ), \end{aligned}$$i.e., (A1) is fulfilled. Furthermore, if in addition
$$\begin{aligned} f(x) \,\le \, \gamma x \,-\, \epsilon \qquad \forall x\ge K, \end{aligned}$$for some \(\gamma <1\), \(\epsilon >0\), and \(K<\infty \), then \({\textbf{X}}\) has a unique stationary distribution and satisfies (A2); see e.g. Theorem 2 in Doukhan (1994, Sec. 2.4, p. 90).
We consider an estimator of \(F_x(z)=P\big ( X_t\le z\mid X_{t-1}=x\big )\) which takes into account that the function \(x\mapsto F_x(z)\) is monotonically non-increasing under (A1). Nonparametric estimators of monotone functions have a long history and were proposed e.g. by Brunk (1955) and Ayer et al. (1955). Denote by \({\mathbb {1}}(\cdot )\) the indicator function. For \(z\in D\) and \(x\in \{X_0,\ldots ,X_{n-1}\}\), we define
and
It is well-known that \(\widehat{F}_x^{(\max -\min )}(z)=\widehat{F}_x^{(\min -\max )}(z)\) for all \(x\in \{X_0,\ldots ,X_{n-1}\}\), see e.g. Theorem 1 in Brunk (1955) and Theorem 1.4.4 in Robertson, Wright, and Dykstra (1988, p. 23). As pointed out by Deng and Zhang (2020), (2.1a) and (2.1b) have to be modified for \(x\not \in \{X_0,\ldots ,X_{n-1}\}\). Since it could well happen that an interval with \(x\in [u,v]\) does not contain any point from the collection \(\{X_0,\ldots ,X_{n-1}\}\) we set \(n_{u,v}=\#\{t\le n:\, X_{t-1}\in [u,v]\}\), \(n_{u,*}=\#\{t\le n:\, u\le X_{t-1}\}\), \(n_{*,v}=\#\{t\le n:\, X_{t-1}\le v\}\), and define
and
The estimators \(\widehat{F}_x^{(\max -\min )}(z)\) and \(\widehat{F}_x^{(\min -\max )}(z)\) are both non-increasing in x as the maxima are taken over non-increasing classes indexed by x and the minima over non-decreasing classes. Furthermore, for fixed \(x\in D\), the mappings \(z\mapsto \widehat{F}_x^{(\max -\min )}(z)\) and \(z\mapsto \widehat{F}_x^{(\min -\max )}(z)\) are non-decreasing which follows from the isotonicity of the functions \(z\mapsto {\mathbb {1}}(X_t\le z, X_{t-1}\in [u,v])\). Furthermore, if \(X_{[1]},\ldots ,X_{[n]}\) is an enumeration of the values in \(\{X_1,\ldots ,X_n\}\) in non-decreasing order, then it follows that, again for fixed \(x\in D\), the mappings \(z\mapsto \widehat{F}_x^{(\max -\min )}(z)\) and \(z\mapsto \widehat{F}_x^{(\min -\max )}(z)\) are constant on the half-open intervals \([X_{[k]},X_{[k+1]})\) (\(k=1,\ldots ,n-1\)), and attain the respective values 0 and 1 on \((-\infty ,X_{[1]})\) and \([X_{[n]},\infty )\). Hence, these estimators are genuine probability distribution functions.
We choose as our estimator of \(F_x(z)\)
It follows that all of the above properties of \(\widehat{F}_x^{(\max -\min )}(z)\) and \(\widehat{F}_x^{(\min -\max )}(z)\) are inherited by \(\widehat{F}_x(z)\). Its performance is characterized by the following theorem.
Theorem 2.1
Suppose that (A1) and (A2) are fulfilled. Then
The rate of convergence \(n^{-1/3}\) is known to be optimal in related problems of estimating a monotone function on the basis of independent random variables; see e.g. Durot (2002, Theorem 1) and Zhang (2002, Theorem 2.3). We believe that this rate cannot be improved in our more delicate case of time series data. Note that Mösching and Dümbgen (2020) considered a nonparametric antitonic estimator of \(F_x\) in a regression context where the dependent variables, conditional on the regressors, are independent. They derived under additional Hölder conditions rates of uniform and pointwise convergence for this estimator.
Our approach to prove this result can be most easily explained if the distribution function \(F_X\) is continuous. We split the domain D into \(k_n=\lfloor n^{1/3}\rfloor \) intervals \(I_k=[x_{k-1},x_k)\), where \(x_0=-\infty \) if \(D={\mathbb R}\), \(x_0=0\) if \(D={\mathbb N}_0\) and, in both cases, \(x_k=F_X^{-1}(k/k_n)=\sup \{x:\, F_X(x)\ge k/k_n\}\) for \(k=1,\ldots ,k_n-1\), \(x_{k_n}=\infty \). (As usual, \(\lfloor a\rfloor \) denotes the largest integer less than or equal to a.) We can expect a favorable behavior of \(\widehat{F}_x(z)\) if \(N_k(\omega ):=\#\{t\le n:\, X_{t-1}(\omega )\in I_k\}\) is sufficiently large for all k. Let
It follows from Lemma 4.2 that \(P(A_n^c)=O(n^{-1/3})\). Since \(\int _D \big | \widehat{F}_x(z) \,-\, F_x(z) \big | \, dP_X(x)\le 1\) holds with probability 1, we obtain that
To estimate \(E\big [ \int _D \big (\widehat{F}_x(z) - F_x(z)\big )_+ \, dP_X(x) \; {\mathbb {1}}_{A_n}\big ]\) we proceed as follows. For \(x\in I_k\), \(k\in \{2,\ldots ,k_n\}\), we use the estimate
We obtain from Lemma 4.3 that
Since
we conclude that
Furthermore, the rough estimate
is obviously true, which leads to
We can prove
in complete analogy to (2.4). The result stated in Theorem 2.1 follows from (2.3) to (2.5). In the general case we have to take into account that the distribution function \(F_X\) is not necessarily continuous. This leads to a technically more involved proof which is presented in full detail in Sect. 4.
The following pictures give an impression of how the functions \(x\mapsto F_x(z)\) are approximated by \(\widehat{F}_x(z)\) for different values of z. We simulated a Poisson-INARCH process of order 1, where \(X_t\mid X_{t-1},X_{t-2},\ldots \sim \hbox {Poisson}\big ( f(X_{t-1}) \big )\) and \(f(x)=\min \big \{\alpha _0+\alpha _1 x, \beta \big \}\). The parameters \(\alpha _0\) and \(\alpha _1\) are chosen as 2.0 and 0.5, respectively, and the truncation constant \(\beta \) is set to 6.0. For a sample size \(n=1000\) and \(z=0,1,\ldots ,11\), the following pictures show \(F_x(z)\) (red lines) and a corresponding estimate \(\widehat{F}_x(z)\) (blue lines). These results are quite encouraging except for large values of x. We conjecture that this deficiency is caused by data sparsity in this region.
3 A new bootstrap method for Markov chains
Our estimator \(\widehat{F}_x(z)\) can be used for bootstrapping Markov processes, and it is particularly suitable in case of Markov chains with a finite or countably infinite state space. In what follows we assume that \((X_t)_{t\in {\mathbb N}_0}\) is a stationary Markov chain which has a state space \(D\subseteq {\mathbb N}_0\). Bootstrap variates \(X_t^*\) are generated successively according to a slightly modified variant of our estimator \(\widehat{F}_x(z)\). To prove consistency, we retain our monotonicity condition (A1), however, we replace (A2) by the following stronger condition which ensures that both the original and the bootstrap process satisfy a useful mixing condition and possess respective stationary distributions.
(A3) There exist a finite set \(S=\big \{y\in D:\; y\le \bar{s} \hbox { and } P_X(\{y\})>0\big \}\), a probability measure Q on \(({\mathbb N}_0,2^{{\mathbb N}_0})\), and constants \(\delta >0\), \(\kappa >0\), \(\gamma >0\), and \(C<\infty \) such that
-
(i)
\(P(X_t\in S\mid X_{t-1}=x) \,\ge \, \delta \,>\, 0 \qquad \qquad \forall x\in {\mathbb N}_0\),
-
(ii)
\(P(X_t=y \mid X_{t-1}=x) \,\ge \, \kappa \cdot Q(\{y\}) \qquad \forall x\in S,\;\forall y\in {\mathbb N}_0\),
-
(iii)
\(P(X_t\ge x)\,\le \, C\, e^{-\gamma x}\qquad \forall x\in {\mathbb N}_0\).
(A3) (ii) means that the set S is a so-called small set and condition (A3) (i) ensures that this set can be reached from each point \(x\in \Omega _X\) with a probability not smaller than \(\delta \). It follows from these conditions that
Hence, Doeblin’s minorization condition is satisfied and it follows that the process \((X_t)_{t\in {\mathbb N}_0}\) has a unique stationary distribution \(P_X\), is geometrically ergodic, and is uniform (\(\phi \)-) mixing with exponentially decaying coefficients.; see e.g. Theorem 1 in Doukhan (1994 Sec. 2.4, p. 88). In particular, a stationary version of the process satisfies (A2). Note that condition (A3) is satisfied e.g. by a Poisson-INARCH(1) process if the function f is bounded. While (i) and (ii) are obviously fulfilled, (iii) follows from the upper tail bound
which holds for \(Y\sim \hbox {Poisson}(\lambda )\); see Theorem 1 in Canonne (2017).
Before we fix the definition of our bootstrap process we check to what extent a process with transition distribution functions \(\widehat{F}_x(\cdot )\) satisfies a suitable variant of condition (A3). It follows from Theorem 2.1 that
if \(P_X(\{x\})>0\). This implies that
where e.g.
and \(\bar{y}\) such that \(Q(\{0,1,\ldots ,\bar{y}\})\ge 1/2\). Hence, a bootstrap process based on \(\widehat{F}_x(\cdot )\) satisfies a variant of (A3) (ii) with a probability tending to 1.
For a variant of (A3) (i) to hold, it is important that \(\inf _x \{\widehat{F}_x(\bar{s}):\, x\in {\mathbb N}_0\}>0\) is also satisfied with a probability tending to 1. This is not guaranteed to be true since the estimator \(\widehat{F}_x(\bar{s})\) may get unreliable if x gets large. Indeed, the natural lower estimate of \(\widehat{F}_x(\bar{s})\) is given by
However, the right-hand side of this inequality can get arbitrarily close to 0 if x is large since then \(P_X( [x,\infty ))\) gets small. It is actually a well-known shortcoming of nonparametric isotonic/antitonic estimators that they get unreliable near the ends of the domain of the explanatory variable. In view of this problem, we modify \(\widehat{F}_x(z)\) for large x. Let
Then \(\#\{t\le n:\, X_{t-1}\ge \widetilde{x}\}\ge n^{2/3}>\#\{t\le n:\, X_{t-1}>\widetilde{x}\}\). We define
In what follows we show that the modified estimator \(\widehat{\widehat{F}}_x(\cdot )\) actually satisfies a suitable variant of (A3). To take advantage of Lemma 4.2 we embed the random truncation point \(\widetilde{x}\) between two nonrandom points, \(\widetilde{x}_l\) and \(\widetilde{x}_u\). Let \(\widetilde{x}_l:=\sup \left\{ x:\, P_X\big ( [x,\infty ) \big )\ge 2n^{-1/3} \right\} \) and \(\widetilde{x}_u:=\sup \big \{ x:\, P_X\big ( [x,\infty ) \big )\ge (1/2)n^{-1/3} \big \}\). Then \(P_X\big ( [\widetilde{x}_l,\infty ) \big )\ge 2n^{-1/3}\ge P_X\big ( (\widetilde{x}_l,\infty ) \big )\) and \(P_X\big ( [\widetilde{x}_u,\infty ) \big )\ge (1/2)n^{-1/3}\ge P_X\big ( (\widetilde{x}_u,\infty ) \big )\). Since \(\widetilde{x}>\widetilde{x}_u\) implies that \(\#\big \{t\le n:\, X_{t-1}>\widetilde{x}_u\big \}\ge n^{2/3}\) we obtain by Lemma 4.2 that
On the other hand, if \(\widetilde{x}\le \widetilde{x}_u\), then
Therefore,
Furthermore, since \(\widetilde{x}<\widetilde{x}_l\) implies that \(\#\big \{t\le n:\, X_{t-1}\ge \widetilde{x}_l\big \}< n^{2/3}\) we obtain by Lemma 4.2 that
On the other hand, \(\widetilde{x}\ge \widetilde{x}_l\) yields that \(\widehat{\widehat{F}}_x(z)=\widehat{F}_x(z)\) for all \(x\le \widetilde{x}_l\). Hence, we obtain, for each \(z\in {\mathbb N}_0\),
Now we are in a position to define our resampling algorithm generating the bootstrap variates:
-
1.
Choose a starting value \(X_0^*\).
-
2.
For each \(t\in {\mathbb N}_0\), suppose that \(X_0^*,\ldots ,X_t^*\) have been generated already. Then \(X_{t+1}^*\) is generated such that is has, conditioned on \(X_0^*,\ldots ,X_t^*\) and conditioned on the original sample \(X_0,\ldots ,X_n\), a probability distribution function \(\widehat{\widehat{F}}_{X_t^*}(\cdot )\).
In what follows, the symbol \(P^*\) refers to the distribution of the bootstrap variables conditioned on the original sample, e.g. \(P^*\big (X_t^*\in A\big )=P\big (X_t^*\in A\mid X_0,\ldots ,X_n\big )\).
Let \(K_n=\log n/(3\gamma )\). Then (A3)(iii) implies that \(P(X_t> K_n)=O(n^{-1/3})\) and we obtain from (3.3) that
Note that for all \(t\in {\mathbb N}\) \(X_t^*\) takes values in \(\{X_1,\ldots ,X_n\}\), i.e. \(X_t^*\) lies in the collection of x with \(P_X(\{x\})>0\). Hence, it follows from (3.1) and (3.2) that a process with transition distribution functions \(\widehat{\widehat{F}}_x(\cdot )\) satisfies the following Doeblin-type condition with a probability tending to 1:
This implies that, with a probability tending to 1, the bootstrap process is geometrically ergodic and has a unique stationary distribution \(P^*_{X^*}\).
For a successful application of the bootstrap approximation the following properties are vitally important: With a probability tending to 1, conditioned on \(X_0,\ldots ,X_n\),
-
(a)
the stationary distribution \(P^*_{X^*}\) converges to \(P_X\),
-
(b)
the finite-dimensional distributions of \((X_t^*)_{t}\) converge to those of \((X_t)_t\).
We show these two properties by a coupling of the original process and its bootstrap counterpart, i.e. we define versions \((\widetilde{X}_t)_{t\in {\mathbb N}_0}\) and \((\widetilde{X}_t^*)_{t\in {\mathbb N}_0}\) on a common probability space \((\widetilde{\Omega },\widetilde{\mathcal A},\widetilde{P})\) such that the corresponding random variables \(\widetilde{X}_t\) and \(\widetilde{X}_t^*\) are equal with a high probability. We use the technique of maximal coupling [see e.g. Theorem 5.2 in Chapter I in Lindvall (1992)] and define the transition probabilities \(\widetilde{\pi }\) driving the coupled process \(\big ( (\widetilde{X}_t,\widetilde{X}_t^*) \big )_{t\in {\mathbb N}_0}\) as follows. For \(x,y\in {\mathbb N}_0\), let \(\pi (x,y)=P(X_{t+1}=y\mid X_t=x)\) and \(\pi ^*(x,y)=P^*(X_{t+1}^*=y\mid X_t^*=x)\). Then
is the total variation distance between the distributions with respective probability mass functions \(\pi (x,\cdot )\) and \(\pi ^*(x^*,\cdot )\). Note that \(\delta _{x,x^*}=\sum _y [\pi (x,y)-\min \{\pi (x,y),\pi ^*(x^*,y)\}] =\sum _y [\pi ^*(x^*,y)-\min \{\pi (x,y),\pi ^*(x^*,y)\}]\). The transition probabilities of the coupled process are defined as
and, for \(x,x^*,y,y^*\in {\mathbb N}_0\) such that \(y\ne y^*\),
The corresponding Markov kernel \(\widetilde{P}\) is defined as
Note that \([\pi (x,y)-\min \{\pi (x,y),\pi ^*(x^*,y)\} ]\, [\pi ^*(x^*,y)-\min \{\pi (x,y),\pi ^*(x^*,y)\}]=0\) for all \(x,x^*,y\in {\mathbb N}_0\), which implies in case of \(\delta _{x,x^*}>0\) that
Therefore we obtain
and, likewise,
Moreover, we have that
Hence, the conditional probability that the two random variables at time \(t+1\) are equal is maximized which explains the usage of the term maximal coupling.
The following theorem summarizes the results of our coupling approach. The stationary distribution \(P^*_{X^*}\) of the bootstrap process converges in total variation norm and in probability to that of the original process. With a probability tending to 1, the coupled process \(\big ((\widetilde{X}_t,\widetilde{X}_t^*)\big )_{t\in {\mathbb N}_0}\) is geometrically \(\phi \)-mixing. And finally, the corresponding values \(\widetilde{X}_t^0\) and \(\widetilde{X}_t^{*,0}\) of a stationary version of the coupled process coincide with a probability converging to 1.
Theorem 3.1
Suppose that (A1) and (A3) are fulfilled. Then
-
(i)
\(d_{TV}\big ( P^*_{X^*}, P_X \big ) \,=\, O_{\widetilde{P}}\big ( n^{-1/3} \, (\log n)^2 \big )\).
-
(ii)
With a probability tending to 1, the process \(\big ((\widetilde{X}_t,\widetilde{X}_t^*)\big )_{t\in {\mathbb N}_0}\) is \(\phi \)-mixing with coefficients \(\phi _{\widetilde{X},\widetilde{X}^*}(k)\) decaying at a geometric rate.
-
(iii)
If \(\big ((\widetilde{X}_t^0, \widetilde{X}_t^{*,0})\big )_{t\in {\mathbb N}_0}\) is a stationary version of the coupled process, then
$$\begin{aligned} \widetilde{P}\big ( \widetilde{X}_t^0 \ne \widetilde{X}_t^{*,0} \big ) \,=\, O_{\widetilde{P}}\big ( n^{-1/3} (\log n)^2 \big ). \end{aligned}$$
These general results can be used to prove bootstrap consistency for specific statistics. Suppose that \(X_0,\ldots ,X_n\) are observed and that (A1) and (A3) are fulfilled. To illustrate our advocated approach, we consider e.g. the parameter \(\theta :=P(X_{t-1}=x,X_t=y)\), which is consistently estimated by \(\widehat{\theta }_n=n^{-1}\sum _{t=1}^n {\mathbb {1}}(X_{t-1}=x,X_t=y)\). It follows from a central limit theorem for \(\phi \)-mixing processes (see e.g. Theorem 15.12 in Bradley (2007b)) that
where \(\sigma _\infty ^2=\sum _{k=-\infty }^\infty \mathop {\textrm{cov}}\nolimits \big ( {\mathbb {1}}(X_0=x,X_1=y), {\mathbb {1}}(X_{|k|}=x,X_{|k|+1}=y) \big )\). The distribution of \(S_n\) can be approximated by that of its bootstrap counterpart,
where \(\widehat{\theta }_n^*=n^{-1}\sum _{t=1}^n {\mathbb {1}}(X_{t-1}^*=x,X_t^*=y)\) and \(E^*\widehat{\theta }_n^*=E\big (\widehat{\theta }_n^*\,\big |\,X_0,\ldots ,X_n)\). In order to prove that the distribution of \(S_n^*\) converges in probability to the same limit as that of \(S_n\), we could use a central limit theorem for triangular arrays of \(\phi \)-mixing random variables. Alternatively, we can use our coupling results and obtain bootstrap consistency almost for free. It follows from (iii) of Theorem 3.1 that
Using this and a covariance inequality for \(\phi \)-mixing random variables [see e.g. Theorem 1.2.2.3 in Doukhan (1994, p. 9)] we obtain
which implies that
This implies that
If in addition \(\sigma _\infty ^2>0\), then we obtain by Lemma 2.11 of van der Vaart (1998) that
Hence, we can use bootstrap quantiles to construct confidence intervals for \(\theta \) such that their coverage probability converges to a prescribed level. Similar implications for other types of statistics are discussed in Leucht and Neumann (2013) and Neumann (2021).
Remark 1
In a similar context, Paparoditis and Politis (2002, Theorem 3.3) proved almost sure convergence of the bootstrap stationary distribution to the stationary distribution of the original process. Their method of proof is completely different from ours and employs classical tools from the theory of weak convergence such as Helly’s theorem and the “uniqueness trick” which uses the fact that each subsequence contains a further subsequence converging to the same probability measure. We use a more direct approach based on a coupling of the original and the bootstrap process. The additional benefit is that we obtain a rate of convergence rather than consistency only.
The following pictures give an impression of the effect of our coupling. As done for the pictures displayed in the previous section, we simulated a Poisson-INARCH process of order 1, where \(X_t\mid X_{t-1},X_{t-2},\ldots \sim \hbox {Poisson}\big ( f(X_{t-1}) \big )\) and \(f(x)=\min \big \{\alpha _0+\alpha _1 x, \beta \big \}\). The parameters \(\alpha _0\) and \(\alpha _1\) are chosen as 2.0 and 0.5, respectively, and the truncation constant \(\beta \) is set to 6.0. For respective sample sizes of \(n=200\) and \(n=1000\), Figs. 1 and 2 show one realization of independent and coupled versions of \(X_1,\ldots ,X_{50}\) and \(X_1^*,\ldots ,X_{50}^*\). While the pictures on the left of Figs. 1 and 2 let us at best hope for a similar behavior of the bootstrap and the original process, those on the right provide some evidence that the bootstrap process successfully mimics the behavior of the original process.
4 Proofs
4.1 Proofs of the main results
Proof of Theorem 2.1
Our strategy to prove this result is already sketched in Sect. 2, in the special case where the distribution function \(F_X\) associated to \(P_X\) is continuous. In the general case with a possibly discontinuous function \(F_X\), we have to take great care since we cannot split the domain D into intervals \(I_k\) such that \(P_X(I_k)=1/k_n\), where \(k_n=\lfloor n^{1/3}\rfloor \). It could be the case that \(P_X\) has masses considerably larger than \(1/k_n\) at single points which requires a modification of our previous approach.
To obtain an appropriate collection of intervals \(I_k\), we define again suitable grid points \(x_0,x_n,\ldots ,x_{K_n}\). For technical reasons we choose them as a decreasing sequence. We set \(x_0:=\infty \) and define recursively \(x_k:=\inf \{x:\, P_X((x,x_{k-1}))\le n^{-1/3}\}\) for \(k\ge 1\). This procedure will terminate when \(x_{K_n}=0\) and \(D={\mathbb R}\) or when \(x_{K_n}=-\infty \), for some \(K_n\). In both cases we have that \(D=[x_1,x_0)\cup \cdots \cup [x_{K_n},x_{K_n-1})\). For \(k=1,\ldots ,K_n-1\), i.e. with a possible exception of \(k=K_n\), we have
where the latter equality follows since the probability measure \(P_X\) is continuous from above. In the following we show that
To this end, we consider the contributions by \(E\big [ \int _{[x_k,x_{k-1})} (\widehat{F}_x(z)-F_x(z))^+\,dP_X(x)\big ]\) separately. We distinguish between three possible cases.
Case 1 If \(P_X\big ([x_k,x_{k-1})\big )\le 2n^{-1/3}\) and \(k<K_n\), then we use for all \(x\in [x_k,x_{k-1})\) in case of \(N_{n,k}:=\big \{t\le n:\, X_{t-1}\in [x_k,x_{k-1})\big \}\ne 0\) the estimate
which leads to
Case 2 If \(P_X\big ([x_k,x_{k-1})\big )> 2n^{-1/3}\) then \(P_X\) has at \(x_k\) a point mass greater than \(n^{-1/3}\) and we argue differently. In this case, we use for all \(x\in (x_k,x_{k-1})\) in case of \(N_{n,k}:=\big \{t\le n:\, X_{t-1}=x_k\big \}\ne 0\) the estimate
which implies
For \(x=x_k\), we use the simpler estimate
and we obtain
Case 3 If \(P_X([x_{K_n},x_{K_n-1}))\le 2n^{-1/3}\), then we can simply use the estimate
Finally, it follows from Lemma 4.2 that \(P\big ( \bigcup _k \{\omega :N_{n,k}(\omega )=0 \} \big )=O(n^{-1/3})\), which implies that
From (4.2a) to (4.2e) we obtain (4.1).
The term \(\int _D (\widehat{F}_x(z)-F_x(z))^-\,dP_X(x)\) can be analogously estimated which completes the proof of the theorem. \(\square \)
Proof of Theorem 3.1
-
(i)
We construct a coupling of the original process and its bootstrap counterpart, where we use \(\widetilde{\pi }\big ((x,x^*),(y,y^*)\big )\) defined by (3.5a) and (3.5b) as transition probabilities and \(\widetilde{P}\) as transition kernel. The initial values are chosen such that \(\widetilde{X}_0=\widetilde{X}^*_0 \sim P_X\). Then, for each \(t\in {\mathbb N}_0\), conditioned on \((\widetilde{X}_t,\widetilde{X}^*_t)\), the next pair \((\widetilde{X}_{t+1},\widetilde{X}^*_{t+1})\) is generated according to \(\widetilde{P}\). It follows from (3.4) and (3.6) in particular that
$$\begin{aligned}{} & {} \widetilde{P}\big ( \widetilde{X}_{t+1} \ne \widetilde{X}^*_{t+1}, \, \widetilde{X}_t=\widetilde{X}^*_t \big )\\{} & {} \quad = \sum _{x\in {\mathbb N}_0} \widetilde{P}\big ( \widetilde{X}_{t+1}\ne \widetilde{X}_{t+1}^*\mid \widetilde{X}_t=\widetilde{X}_t^*=x \big ) \, \widetilde{P}\big ( \widetilde{X}_t=\widetilde{X}_t^*=x \big ) \\{} & {} \quad = \sum _{x\in {\mathbb N}_0} \delta _{x,x} \, \widetilde{P}\big ( \widetilde{X}_t=\widetilde{X}_t^*=x \big ) \\{} & {} \quad \le \frac{1}{2} \, \sum _x \sum _y \big | \pi (x,y) \,-\, \pi ^*(x,y) \big | \, P_X(\{x\})\\{} & {} \quad = O_{\widetilde{P}}\big ( n^{-1/3} \, \log n \big ). \end{aligned}$$This implies first
$$\begin{aligned} \widetilde{P}\big ( \widetilde{X}_1\ne \widetilde{X}_1^* \big ) \,=\, O_{\widetilde{P}}\big ( n^{-1/3} \, \log n \big ), \end{aligned}$$then
$$\begin{aligned}{} & {} \widetilde{P}\big ( \widetilde{X}_2\ne \widetilde{X}_2^* \big ) \,\le \, \widetilde{P}\big ( \widetilde{X}_2\ne \widetilde{X}_2^*,\, \widetilde{X}_1=\widetilde{X}_1^* \big ) \,+\, \widetilde{P}\big ( \widetilde{X}_1\ne \widetilde{X}_1^* \big )\\{} & {} \quad = O_{\widetilde{P}}\big ( n^{-1/3} \, \log n \big ), \end{aligned}$$and after \(K_n\) such steps
$$\begin{aligned} d_{TV}\big ( P_X, \widetilde{P}^{\widetilde{X}_{K_n}^*} \big ) \,\le \, \widetilde{P}\big ( \widetilde{X}_{K_n}\ne \widetilde{X}_{K_n}^* \big ) \,=\, O_{\widetilde{P}}\big ( n^{-1/3} \, \log n \, K_n\big ). \end{aligned}$$On the other hand, \((X_t^*)_{t\in {\mathbb N}_0}\), and therefore \((\widetilde{X}_t^*)_{t\in {\mathbb N}_0}\) as well, are geometrically ergodic. Hence, for \(K_n=K \log n\) and K sufficiently large,
$$\begin{aligned} d_{TV}\left( \widetilde{P}^{\widetilde{X}_{K_n}^*}, P^*_{X^*} \right) \,=\, O_{\widetilde{P}}\big ( n^{-1/3} \big ), \end{aligned}$$which leads to
$$\begin{aligned} d_{TV}\big ( P_X, P^*_{X^*} \big )\le & {} d_{TV}\left( P_X, \widetilde{P}^{\widetilde{X}_{K_n}^*} \right) \,+\, d_{TV}\left( \widetilde{P}^{\widetilde{X}_{K_n}^*}, P^*_{X^*} \right) \\= & {} O_{\widetilde{P}}\big ( n^{-1/3} \, (\log n)^2 \big ). \end{aligned}$$ -
(ii)
We couple the original and the bootstrap process according to (3.5a) and (3.5b) and show first that
$$\begin{aligned}{} & {} \widetilde{P} \big ( (\widetilde{X}_t,\widetilde{X}_t^*)\in S\times S \,\big |\, \widetilde{X}_{t-1}=x, \widetilde{X}_{t-1}^*=x^* \big )\nonumber \\{} & {} \quad \,\ge \, P\big ( X_t\in S \,\big |\, X_{t-1}=x \big ) \cdot P^*\big ( X_t^*\in S \,\big |\, X_{t-1}^*=x^* \big ) \end{aligned}$$(4.3)holds for all \(x,x^*\in {\mathbb N}_0\). Let \(x,x^*\in {\mathbb N}_0\) be arbitrary. To simplify notation we set, for a generic set \(B\subseteq {\mathbb N}_0\), \(\pi (B)=\sum _{y\in B}\pi (x,y)\), \(\pi ^*(B)=\sum _{y\in B}\pi ^*(x^*,y)\), and \(\pi \wedge \pi ^*(B)=\sum _{y\in B}\pi (x,y)\wedge \pi ^*(x^*,y)\). If \(\pi \wedge \pi ^*(S)\ge \pi (S)\cdot \pi ^*(S)\), then (4.3) follows immediately. Suppose now the opposite, \(\pi \wedge \pi ^*(S)<\pi (S)\cdot \pi ^*(S)\). Then \(\delta _{x,x^*}>0\), and it follows from (3.5a) and (3.5b)
$$\begin{aligned}{} & {} { \widetilde{P} \big ( (\widetilde{X}_t,\widetilde{X}_t^*)\in S\times S \,\big |\, \widetilde{X}_{t-1}=x, \widetilde{X}_{t-1}^*=x^* \big ) } \\{} & {} \quad = \sum _{y\in S} \pi (x,y)\wedge \pi ^*(x^*,y), \\{} & {} \qquad +\, \sum _{y,y^*\in S} \frac{ \big ( \pi (x,y) - \pi (x,y)\wedge \pi ^*(x^*,y) \big ) \, \big ( \pi ^*(x^*,y^*) - \pi (x,y^*)\wedge \pi ^*(x^*,y^*) \big ) }{ \delta _{x,x^*} } \\{} & {} \quad = \pi \wedge \pi ^*(S) \,+\, \frac{ 1 }{ \delta _{x,x^*} } \big ( \pi (S) - \pi \wedge \pi ^*(S) \big ) \big ( \pi ^*(S) - \pi \wedge \pi ^*(S) \big ). \\{} & {} \quad = \pi (S) \cdot \pi ^*(S) +\, \frac{ 1 }{ \delta _{x,x*} } \Big \{ \delta _{x,x^*} \big ( \pi \wedge \pi ^*(S) \,-\, \pi (S) \pi ^*(S) \big )\\{} & {} \qquad +\, \big ( \pi (S) - \pi \wedge \pi ^*(S) \big ) \big ( \pi ^*(S) - \pi \wedge \pi ^*(S) \big ) \Big \}. \end{aligned}$$Since \(\delta _{x,x^*}\,=\,1-\pi \wedge \pi ^*({\mathbb N}_0)\,=\,\big (\pi (S)-\pi \wedge \pi ^*(S)\big )+\big (\pi (S^c)-\pi \wedge \pi ^*(S^c)\big )\) the term in curly braces is equal to
$$\begin{aligned}{} & {} \big (\pi (S)-\pi \wedge \pi ^*(S)\big ) \, \big ( \pi \wedge \pi ^*(S) \,-\, \pi (S) \pi ^*(S) \big )\\{} & {} \qquad \,+\, \big (\pi (S^c)-\pi \wedge \pi ^*(S^c)\big ) \, \big ( \pi \wedge \pi ^*(S) \,-\, \pi (S) \pi ^*(S) \big ) \\{} & {} \qquad {} \,+\, \big (\pi (S)-\pi \wedge \pi ^*(S)\big ) \, \big (\pi ^*(S)-\pi \wedge \pi ^*(S)\big ) \\{} & {} \quad = \big (\pi (S)-\pi \wedge \pi ^*(S)\big ) \, \pi ^*(S) \, \pi (S^c) \\{} & {} \qquad {} \,+\, \big (\pi (S^c)-\pi \wedge \pi ^*(S^c)\big ) \, \big ( \pi \wedge \pi ^*(S) \,-\, \pi (S) \pi ^*(S) \big ) \\{} & {} \quad = \pi (S^c) \, \big ( \pi \wedge \pi ^*(S) \,-\, \pi \wedge \pi ^*(S) \, \pi ^*(S) \big )\\{} & {} \qquad \,+\, \pi \wedge \pi ^*(S^c) \, \big ( \pi (S) \pi ^*(S) \,-\, \pi \wedge \pi ^*(S) \big ), \end{aligned}$$and is therefore non-negative. This proves (4.3). It follows from (3.4) that, for \(y,y^*\in S\) such that \(P_X(\{y^*\})>0\),
$$\begin{aligned} \widetilde{P}\big ( \widetilde{X}_{t+1}= & {} \widetilde{X}_{t+1}^*=z \,\big |\, \widetilde{X}_t=y, \widetilde{X}_t^*=y^* \big ) = \pi (y,z)\wedge \pi ^*(y^*,z) \nonumber \\\ge & {} \kappa \, Q\big ( \{z\} \big ) \,+\, O_P\big ( n^{-1/3} \, \log n \big ). \end{aligned}$$(4.4)We obtain from (4.3) and (4.4) there exist some \(\kappa ^*>0\) such that
$$\begin{aligned}{} & {} { \widetilde{P}\big ( (\widetilde{X}_{t+1},\widetilde{X}_{t+1}^*)=(z,z) \,\big |\, \widetilde{X}_{t-1}=x, \widetilde{X}_{t-1}^*=x^* \big ) } \nonumber \\{} & {} \quad \ge \sum _{y,y^*\in S} \widetilde{P}\big ( (\widetilde{X}_{t+1},\widetilde{X}_{t+1}^*)=(z,z) \,\big |\, \widetilde{X}_t=y, \widetilde{X}_t^*=y^* \big ) \,\nonumber \\{} & {} \qquad \widetilde{P}\big ( (\widetilde{X}_t,\widetilde{X}_t^*)=(y,y^*) \,\big |\, \widetilde{X}_{t-1}=x, \widetilde{X}_{t-1}^*=x^* \big ) \nonumber \\{} & {} \quad \ge \kappa ^* \end{aligned}$$(4.5)holds with a probability tending to 1. Hence, with a probability tending to 1, the coupled process is \(\phi \)-mixing with geometrically decaying coefficients.
-
(iii)
According to (4.5), the coupled process \(\big ((\widetilde{X}_t,\widetilde{X}_t^*)\big )_{t\in {\mathbb N}_0}\) satisfies Doeblin’s condition which implies in particular that this process has a unique stationary distribution. Let \(\big ((\widetilde{X}_t^0,\widetilde{X}_t^{*,0})\big )_{t\in {\mathbb N}_0}\) be a stationary version of the coupled process. Since \(\big ((\widetilde{X}_t,\widetilde{X}_t^*)\big )_{t\in {\mathbb N}_0}\) is geometrically ergodic we obtain
\(\square \)
4.2 Some auxiliary lemmas
Lemma 4.1
Suppose that \((X_t)_{t\in {\mathbb N}_0}\) is a Markov chain with state space \(D\subseteq {\mathbb R}\) such that (A2) is fulfilled. For arbitrary \(I\subseteq D\), let
where \(I\subseteq D\). Then, for arbitrary \(\gamma <1\),
where \(p_I:=P(X_0\in I)\).
Proof
In view of \(E\big [\big (\sum _{t=1}^n \eta _t\big )^4\big ]=\sum _{s,t,u,v=1}^n E[\eta _s \eta _t \eta _u \eta _v]\) we first consider the terms \(E[\eta _s \eta _t \eta _u \eta _v]\). Let the indices be chronologically ordered, i.e. \(1\le s\le t\le u\le v\le n\). Then it follows from the Markov property that
Considering the remaining cases of \(s\le t\le u=v\), we make use of the following equalities.
-
(a)
\(s=t=u=v\) Then \(E[\eta _s \eta _t \eta _u \eta _v] \,=\, E\big [\eta _s^4\big ]\).
-
(b)
\(s=t<u=v\) Then \(E[\eta _s \eta _t \eta _u \eta _v] \,=\, \mathop {\textrm{cov}}\nolimits (\eta _s^2, \eta _u^2) \,+\, E\big [\eta _s^2\big ]\,E\big [\eta _u^2\big ]\).
-
(c)
\(s<t\le u=v\) Then \(E[\eta _s \eta _t \eta _u \eta _v] \,=\, \mathop {\textrm{cov}}\nolimits (\eta _s, \eta _t \eta _u^2) \,=\, \mathop {\textrm{cov}}\nolimits (\eta _s \eta _t, \eta _u^2)\).
For \(s<u\), there exist \({4 \atopwithdelims ()2}=6\) quadrupels \((t_1,t_2,t_3,t_4)\) such that \(t_i=t_j=s\) for some \(i\ne j\), and \(t_k=t_l=u\) for some \(k\ne l\). For \(s<t<u\), there exist \(4\cdot 3=12\) quadrupels \((t_1,t_2,t_3,t_4)\) such that \(t_i=s\), \(t_j=t\) and \(t_k=t_l=u\) for some i, j, k, l, \(k\ne l\). Finally, for \(s<t=u\), there exist 4 quadrupels \((t_1,t_2,t_3,t_4)\) such that \(t_i=s\) for some i and \(t_j=u\) for \(j\ne i\). Therefore we obtain
where
To estimate the last two terms on the right-hand side of (4.6) we use a well-known covariance inequality for \(\alpha \)-mixing random variables,
where \(\alpha ,\beta \in (1,\infty )\) are such that \(1/\alpha +1/\beta <1\) and \(\Vert X\Vert _\alpha <\infty \), \(\Vert Y\Vert _\beta <\infty \); see e.g. Bradley (2007a, Corollary 10.16). Choosing \(\alpha =\beta =2/\gamma \) and taking into account that \(|\eta _s|\le 1\) and \(E|\eta _s|\le p_I\) we obtain that
as well as
Using \(\#{\mathcal T}_{n,r}^{(1)}\le n(r+1)\) and \(\#{\mathcal T}_{n,r}^{(2)}\le nr\) we obtain from (4.6)
which completes the proof. \(\square \)
Lemma 4.2
Suppose that \((X_t)_{t\in {\mathbb N}_0}\) is a Markov chain with state space \(D\subseteq {\mathbb R}\) and stationary distribution \(P_X\) such that (A2) is fulfilled. For arbitrary \(I\subseteq D\), let \(N_n(I):=\#\{t\le n:\, X_{t-1}\in I\}\). Then, for arbitrary \(\delta >0\), \(\kappa <\infty \), and \(P_X(I)\ge n^{\delta -1}\),
Proof
Let \(q\in 2{\mathbb N}\) and \(\epsilon >0\). Since
it follows from an extension of Rosenthal’s inequality (see e.g. Theorem 2 in Section 1.4.1 in Doukhan (1994)) that
Choosing \(\epsilon >0\) small enough we have that \(n\,P_X(I)^{2-2/(2+\epsilon )}\ge n^{\delta '}\) for some \(\delta '>0\). Therefore we obtain from Markov’s inequality that
if q is chosen sufficiently large. \(\square \)
Lemma 4.3
Suppose that \((X_t)_{t\in {\mathbb N}_0}\) is a Markov chain with state space \(D\subseteq {\mathbb R}\) and stationary distribution \(P_X\) such that (A2) is fulfilled. Then there exist some \(C<\infty \) such that, for arbitrary \(\underline{x}\le \overline{x}\) with \(P_X([\underline{x},\overline{x}])\ge n^{-1/3}\),
and
Proof
We prove only (4.8a) since the proof of (4.8b) is completely analogous. The proof is carried out in two steps. First we consider the technically simpler case where the distribution function \(F_X\) is continuous. This allows us to define a suitable dyadic family of intervals which leads to a readily comprehensible proof. Afterwards we extend the result to the general case.
Step 1 Suppose that \(F_X\) is continuous. First we prove that for arbitrary \(\delta >0\) and each \(v\ge \overline{x}\) there exists some \(C<\infty \) such that
To deal with the supremum we define a suitable system of dyadic intervals. Let \(J_n\in {\mathbb N}\) be such that \(n^{\delta -1}/2< 2^{-J_n}P_X([\underline{x},v])\le n^{\delta -1}\). For \(j=1,2,\ldots ,J_n\) and \(k=1,2,\ldots ,2^j\), we set
and, for \(j=1,\ldots ,J_n\),
We have that
We define partial sume as
Choosing \(\gamma \) such that \((1-2\delta )/(1-\delta )\le \gamma <1\) we have that \(n(2^{-j}P_X([\underline{x},v]))^\gamma =O\big ( (n2^{-j}P_X([\underline{x},v]))^2 \big )\) for all \(j=1,\ldots ,J_n\). Hence, the first term in the bound given in Lemma 4.1 dominates the second and we obtain, for \(j=1,\ldots ,J_n,\; k=1,\ldots ,2^j\),
which implies that
Therefore, we obtain that
At the finest scale \(J_n\), we define for \(k=1,\ldots ,J_n\),
Note that \(EN_{J_n,k}=n2^{-J_n}P_X([\underline{x},v])\le n^\delta \). We obtain from (4.7) that
holds for arbitrary \(\kappa <\infty \) if q is chosen large enough. Since \(2^{J_n}<n^{1-\delta }/(2 P_X([\underline{x},v]))\le n^{4/3-\delta }/2\) we obtain
After these preparatory steps we are in a position to estimate the expected value of the supremum. For arbitrary \(x\in [\underline{x},v]\), there exist p and \((j_1,k_1),\ldots ,(j_p,k_p),k\), \(1\le j_1<\cdots <j_p\le J_n\), such that \(B_{j_1,k_1},\ldots ,B_{j_p,k_p},B_{J_n,k}\) are adjacent intervals and
This implies that
and, therefore,
It follows from (4.12) and (4.13) that (4.9) is fulfilled.
Now we are in a position to prove (4.8a). We define a dyadic sequence of growing intervals, \(I_0=[\underline{x},\overline{x}]\) and, for \(j\ge 1\)
(There exists some \(K_n\ge 0\) such that \(P_X(I_j)=2^j P_X([\underline{x},\overline{x}])\) for \(j=0,\ldots ,K_n\) and \(P_X(I_{K_n+1})<2^{K_n+1}P_X(\underline{x},\overline{x}])\). Then \(I_{K_n+1}=I_{K_n+2}=\ldots \).) Define the event
For \(x\in I_{j+1}\setminus I_j\) we use the estimate
It follows from Lemma 4.3 that \(P\big ( A_n^c \big )=O\big ( n^{-1/3} \big )\), which implies that
i.e. (4.8a) is fulfilled.
Step 2 In case of a general stationary distribution \(P_X\), the definition according to (4.10a) and (4.10b) does no longer guarantee that the convenient property of \(P_X( B_{j,k} )=2^{-j}P_X([\underline{x},v])\) holds true. In order to draw on the calculations in Step 1 we act as follows. Let \((V_t)_{t\in {\mathbb N}_0}\) be a sequence of independent random variables following a uniform distribution on [0, 1], which is independent of the process \((X_t)_{t\in {\mathbb N}_0}\). For the latter process we define an accompanying sequence \((U_t)_{t\in {\mathbb N}_0}\) of uniformly distributed random variables, where \(U_t\) depends on the pair \((X_t,V_t)\) as follows. If \(F_X\) is continuous in the point \(X_t\), then we simply set
Otherwise, if \(F_X\) is discontinuous in \(X_t\), then \(P_X(\{X_t\})=F_X(X_t)-F_X(X_t-0)>0\) and we set
In both cases we have that
where \(G^{-1}(t)=\inf \{x:G(x)\ge t\}\) denotes the generalized inverse of a generic distribution function G. Since \(F_X\) has at most countably many discontinuity points, it follows that the mapping \((X_t,V_t)\mapsto U_t\) is measurable. It also follows that \(U_t\) has a uniform distribution on [0, 1]. Furthermore the process \(\big ( (X_t,V_t) \big )_{t\in {\mathbb N}_0}\) has the same mixing properties as \(\big (X_t\big )_{t\in {\mathbb N}_0}\), i.e.
see e.g. Lemma 8 in Bradley (1981). Now we obtain in complete analogy to the calculations leading to (4.9) in Step 1 that, for arbitrary \(0\le \underline{u}<\overline{u}\le 1\),
It is easy to see that the following inclusions hold true for \(\underline{x}\le x\):
Indeed, the second inclusion follows immediately from the construction of \(U_{t-1}\). Regarding the first one, note that it follows again from the construction of \(U_{t-1}\) that \(F_X(\underline{x}-0) < U_{t-1}\) implies \(\underline{x}\le X_{t-1}\). Furthermore, \(U_{t-1}\le F_X(x)\) implies \(X_{t-1}=F_X^{-1}(U_{t-1})\le F_X^{-1}(F_X(x))\le x\). Since \(P(U_{t-1}=F_X(\underline{x}))=0\) we conclude that
holds with probability one. Hence, we obtain from (4.13)
i.e. (4.9) holds true. \(\square \)
References
Al-Osh M, Alzaid A (1987) First-order integer-valued autoregressive (INAR(1)) processes. J Time Ser Anal 8(3):261–275
Ayer M, Brunk HD, Ewing GM, Reid WT, Silverman E (1955) An empirical distribution function for sampling with incomplete information. Ann Math Stat 26(4):641–647
Bickel PJ, Freedman DA (1981) Some asymptotic theory for the bootstrap. Ann Stat 9(6):1196–1217
Bradley RC (1981) Central limit theorems under weak dependence. J Multivar Anal 11:1–16
Bradley RC (2007a) Introduction to strong mixing conditions, vol I. Kendrick Press, Heber City
Bradley RC (2007b) Introduction to strong mixing conditions, vol II. Kendrick Press, Heber City
Brunk HD (1955) Maximum likelihood estimates of monotone parameters. Ann Math Stat 26(4):607–616
Canonne CL (2017) A short note on Poisson tail bounds. http://www.cs.columbia.edu/ccanonne/files/misc/2017-poissonconcentration.pdf. Accessed 20 April 2022
Dehling H, Mikosch T (1994) Random quadratic forms and the bootstrap for \(U\)-statistics. J Multivar Anal 51:392–413
Deng H, Zhang C-H (2020) Isotonic regression in multi-dimensional spaces and graphs. Ann Stat 48(6):3672–3698
Doukhan P (1994) Mixing: properties and examples. Lecture notes in statistics, vol 85. Springer, Berlin
Durot C (2002) Sharp asymptotics for isotonic regression. Probab Theory Relat Fields 122:222–240
Freedman DA (1981) Bootstrapping regression models. Ann Stat 9(6):1218–1228
Leucht A, Neumann MH (2009) Consistency of general bootstrap methods for degenerate U-and V-type statistics. J Multivar Anal 100:1622–1633
Leucht A, Neumann MH (2013) Dependent wild bootstrap for degenerate \(U\)- and \(V\)-statistics. J Multivar Anal 117:257–280
Leucht A, Neumann MH, Kreiss J-P (2015) A model specification test for GARCH(1,1) processes. Scand J Stat 42:1167–1193
Lindvall T (1992) Lectures on the coupling method. Wiley, New York
McKenzie E (1985) Some simple models for discrete variate time series. Water Resour Bull 21(4):645–650
Mösching A, Dümbgen L (2020) Monotone least squares and isotonic quantiles. Electron J Stat 14:24–49
Neumann MH (2021) Bootstrap for integer-valued GARCH(\(p\),\(q\)) processes. Stat Neerl 75(3):343–363
Pakes AG (1971) Branching processes with immigration. J Appl Probab 8(1):32–42
Paparoditis E, Politis DN (2002) The local bootstrap for Markov processes. J Stat Plan Inference 108:301–328
Rajarshi MB (1990) Bootstrap in Markov-sequences based on estimates of transition density. Ann Inst Stat Math 42:253–268
Robertson T, Wright FT, Dykstra RL (1988) Order restricted statistical inference. Wiley, New York
van der Vaart AW (1998) Asymptotic statistics. Cambridge University Press, Cambridge
Zhang C-H (2002) Risk bounds in isotonic regression. Ann Stat 30(2):528–555
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Neumann, M.H. Estimation and bootstrap for stochastically monotone Markov processes. Metrika 87, 31–59 (2024). https://doi.org/10.1007/s00184-023-00903-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-023-00903-7