1 Introduction

This paper studies aperiodic Markov chains defined on discrete state space \(\mathbb {S} = \{1,\ldots ,S\}\), with \( S \in \mathbb {N}\). For ease of presentation we will consider the case of aperiodic Markov chains. The extension of our results to the case of periodic chains is postponed to the “Appendix”. The steady-state behavior of an aperiodic Markov chain P is characterized through its ergodic projector \( \varPi _P\), where

$$\begin{aligned} \varPi _P = \lim _{ n \rightarrow \infty } P^n , \end{aligned}$$

see [17, 23]. Computing \( \varPi _P \) in the above way, i.e., by taking powers of P, is known as the power method (PM). Any finite aperiodic Markov chain P is geometrically ergodic, i.e., there exist finite numbers rc and \(\beta \in (0,1)\) such that

$$\begin{aligned} \forall n \ge r : \quad \Vert P^n-\varPi _{P}\Vert \le c\beta ^n , \end{aligned}$$

where \( || \cdot || \) denotes the maximum absolute row sum norm (note that any matrix norm may be used), \( \beta \) is called the rate and r is called the transient phase, see, for example, [14, 17]. Geometric ergodicity implies that PM enjoys a geometric rate of convergence once the powers exceed r.

The main advantages of the power method are that PM is easy to implement and that it requires no further information on P. In addition, PM can be efficiently implemented for large sparse matrices, which is the main reason why PM is used for the acclaimed Google PageRank algorithm introduced by Brin and Page [7], and for more detail see [4, 9, 20]. PM has two main versions. In the vector-updating version of PM, one computes \( \mu P^n \) for given vector \( \mu \). Vector-updating applies in case a given Markov chain \({\hat{P}} \) with known stationary distribution \( \pi _{ {\hat{P}} } \) is updated (due to a change in the underlying hyperlink structure of the network) to a new Markov chain P. Then, computing \( \pi _{{\hat{P}} } P^n \) converges faster towards \( \pi _{P}\). An advantage of vector updating is that it only requires vector-matrix multiplications. The downside of this approach is that one cannot change the initial vector without a complete recalculation. The matrix-updating version directly computes \( P^n\) in order to approximate \( \varPi _P\). The advantage of the matrix-updating PM is that by squaring a matrix power \( P^n \), i.e., going from \( P^n \) to \( ( P^n )^2 = P^{2 n }\), high powers of P can be relatively easily computed. Indeed, computing \( P^n \) only requires \( \log _2 ( n ) \) matrix multiplications. Moreover, applying different initial vectors to \( \varPi _P\) allows to model different initial distributions which is of particular interest in case of multi-chains, see the subsequent section for details. The downside is that even with the \( \log _2 ( n ) \) advantage, matrix updating may require a significant number of matrix multiplications and as the power increase these matrices are not sparse. In case P is periodic, both the vector-updating and the matrix-updating do not converge unless a convex combination of the original P with the identity matrix is used which comes at the expense of reduced convergence speed.

In this paper, we mainly focus on matrix-updating PM, from now on simply referred to as PM. Iterative methods, such as PM converge slowly in case the subdominant eigenvalue of P is close to 1, see [13, 15]. This typically happens if either the P-chain only jumps with small probability from the transient states to (one of) the ergodic class(es) or if P is nearly decomposable. Roughly speaking, an irreducible chain P is called nearly decomposable if the state-space can be divided into classes so that the interactions between states are relatively frequent compared to interactions between the classes (a formal definition will be provided later in the text). It can be shown that an irreducible Markov chain without transient states is nearly decomposable if and only if the subdominant eigenvalue is close to 1, see [11]. A famous example of a nearly decomposable Markov chain is the so-called Courtois matrix, which is a \( 6 \times 6 \) transition matrix for which PM requires \( n \approx 69.000\) in order to provide an approximation of \( \varPi _P \) that is correct in first 6 digits, [25]. In case the ergodic classes and the transient states are known, one may compute the ergodic projector directly by first computing the equilibrium distribution for each ergodic class, and then the long-term behavior of the transient states, see [5, 17] and the detailed discussion in Sect. 2. For a comprehensive overview of numerical methods for computing the ergodic projector of a finite Markov chain, we refer to [25].

Our research on Markov chains is stimulated by the growing interest in the analysis of social networks (where the Markov chain is used to model relationships among social agents, see [22]) and by the analysis of the world wide web, were based on the (bored-) random-surfer-concept, the Markov chain models the probability of randomly going from one page to another, [9, 20]. A key feature of these networks is that they are large and that neither their structure (transient states, ergodic classes) nor their balancedness (nearly decomposable or not) are known a priori. Other examples of these type of complex networks include telecommunication networks, cognitive and semantic networks and biological networks.

In this paper, we develop a novel approach for approximating the ergodic projector of an aperiodic Markov chain. We firstly establish a new representation of the ergodic projector of P through constructing an alternative Markov chain and then make this result useful for numerical computation. Starting point of our analysis is the known analytical relation

$$\begin{aligned} \varPi _{P } = \lim _{ \alpha \downarrow 0 } \alpha (I-(1-\alpha )P)^{-1}, \end{aligned}$$

see, for example, Theorem 1.5 in [16], where the term \( \alpha (I-(1-\alpha )P)^{-1} \) is recognizable as the resolvent kernel of P. See also [19] for applications of the resolvent kernel in stability theory of Markov chains. We call the transformation

$$\begin{aligned} H_{ \alpha } ( P ) = \alpha P (I-(1-\alpha )P)^{-1} , \end{aligned}$$

for \( \alpha \in ( 0 ,1 ]\), the modified resolvent kernel of P. In particular, the resolvent kernel is modified so that it allows for efficient numerical evaluation. To see this note that in case of large P one solves a system of linear equations, for which P can be used as a basis, rather than computing the inverse explicitly. Letting \(X_\alpha \) be a geometrically distributed random variable with parameter \(\alpha \in (0,1)\), the modified resolvent kernel can be written as

$$\begin{aligned} H_{ \alpha } ( P ) = \mathbb {E} [P^{X_\alpha +1}] , \end{aligned}$$

since \(\Vert (1-\alpha )P\Vert <1\), which suffices to show that \( H_{ \alpha } ( P ) \) is again a Markov transition matrix with the same ergodic projector as P for any \( \alpha \in (0, 1) \), and since at \( \alpha =1 \) it holds \( H_{ \alpha } ( P ) =P\), the statement holds for \( \alpha =1 \) as well. In formula,

$$\begin{aligned} \varPi _P = \varPi _{ H_{ \alpha } ( P ) } = \lim _{ k \rightarrow \infty } \left( H_\alpha ( P) \right) ^k , \quad \alpha \in ( 0 ,1 ] . \end{aligned}$$
(1)

The main contribution of this paper is that we take (1) as starting point for developing a new numerical algorithm for computing \( \varPi _P\). As main technical result we will show that

$$\begin{aligned} \Vert \varPi _{P } - \left( H_\alpha ( P ) \right) ^k \Vert \le ( \alpha \gamma (P) )^k , \end{aligned}$$
(2)

where \( \gamma (P) \) a finite (possibly large) constant depending on P, to be defined later in the text.

Letting \( \alpha < 1 / \gamma ( P ) \), the result put forward in (2) implies that the Markov kernel \(H_\alpha ( P )\) is geometrically ergodic with transient phase \( r=1 \) and rate \( \alpha \gamma (P) \). Put differently, the transformation \( P \mapsto H_\alpha ( P ) \) provides a jump start for PM as the desired contraction property is immediately effective. Moreover, we will show that iterating the transformation yields a geometric reduction in the geometric rate, so that, for example, \( H_\alpha ( H_ \alpha ( P ) ) \) has a rate that is proportional to \( \alpha ^2 \). The above theoretical results lead to a new numerical approach for approximately computing \( \varPi _P \), called jump start power method.

The main contributions of this paper are as follows.

  • The error of approximating \( \varPi _P\) by powers of the modified resolvent \( ( H_\alpha ( P ) )^k \) is of order \( ( \alpha \gamma (P))^k \). We use this fact to introduce the jump start power method (JSPM) that enjoys the robustness of PM but overcomes the numerical deficiency of PM. JSPM works well for multi-chains, nearly decomposable chains, and chains that jump with small probability from the transient states to (one of) the ergodic class(es).

  • An adapted version of JSPM is developed for large-scale Markov chains which utilizes the structure of the Markov chain and takes only ‘one jump’ towards \(\varPi _P\), i.e., \(k=1\) together with a carefully chosen \(\alpha \in (0,1)\).

  • An extensive numerical study is provided that corroborates the form of the analytical bound for the decay of the error and illustrates the numerical advantages of JSPM.

The article is organized as follows. Section 2 formally introduces Markov multi-chains, nearly decomposable Markov chains, and defines concepts used throughout the article. Section 3 presents the main technical results of the paper. Specifically, the approximate formula in (2) is derived. JSPM is presented in Sect. 4 together with a numerical study on the performance of the algorithm. The article concludes with a discussion of potential further research. The extension of our results to the case of periodic chains is presented in the “Appendix”.

2 A Brief Review of Markov Chains

This paper studies aperiodic Markov chains defined on discrete state space \(\{X_t\}_{t = 0,1,\ldots }\), with \(\mathbb {S} = \{1,\ldots ,S\}\); see [17] for definitions. For ease of presentation we will first consider the case of aperiodic Markov chains. The extension of our results to the case of periodic chains is postponed to the “Appendix”. For the (ij)-th element of P it holds that \(P(i,j) = \Pr (X_{t+1} = j|X_t = i)\) is independent of t and the past states, i.e., the probability distribution of the next state only depends on the current state. This leads to

$$\begin{aligned} (P^n)(i,j) = \Pr (X_n=j|X_0=i) ,\qquad \text{ for } \text{ all } (i,j) \in \mathbb {S} \times \mathbb {S}, \end{aligned}$$

which reads as the n-step transition probability of the Markov chain, where transition matrix \(P^n\) is simply obtained from taking the n-th matrix power of P. Taking n to infinity leads to the (ij)-th element of the ergodic projector, denoted by \(\varPi _P\), and defined by

$$\begin{aligned} \varPi _P(i,j) = \lim _{n\rightarrow \infty } (P^n)(i,j) ,\qquad \text{ for } \text{ all } (i,j) \in \mathbb {S} \times \mathbb {S} . \end{aligned}$$

Entry \(\varPi _P(i,j)\) represents the probability of the chain being in state j in the long-run when starting in state i. For more details we refer to [17].

In case the Markov chain has only one closed irreducible set of states, also called ergodic class, and a (possibly empty) set of transient states, it is called a Markov uni-chain (in short: uni-chain). For uni-chains it holds that the chain will eventually be trapped in the (unique) ergodic class, independent of the initial state. The unique distribution to which a uni-chain converges is described by the stationary distribution of P denoted as \(\pi ^\top _P\) which can be found by solving \(\pi ^\top _P\)P =  \(\pi ^\top _P\). Since the stationary distribution is independent of the initial state, all rows of \(\varPi _P\) equal \(\pi ^\top _P\) in case P describes a Markov uni-chain.

Markov multi-chains (in short: multi-chains) have multiple ergodic classes and a (possibly empty) set of transient states. Other than for uni-chains, for multi-chains the initial state has an impact on the resulting limiting distribution, which stems from the fact that once the chain enters one of the several ergodic classes it remains there permanently. First of all, one has to uncover the ergodic classes and the transient states using, for example, the already mentioned algorithm in [8]. After possible relabelling of states, the transition matrix and the ergodic projector of a multi-chain can be written in the following canonical forms, respectively,

$$\begin{aligned} P = \left[ \begin{array}{ccccc} P_1 &{} 0 &{} 0 &{} \cdots &{} 0 \\ 0 &{} P_2 &{} 0 &{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \ddots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} 0 &{} P_I &{} 0 \\ P_{T1} &{} P_{T2} &{} \cdots &{} P_{TI} &{} P_{TT} \end{array} \right] \quad \text{ and } \quad \varPi _P = \left[ \begin{array}{ccccc} \varPi _1 &{} 0 &{} 0 &{} \cdots &{} 0 \\ 0 &{} \varPi _2 &{} 0 &{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \ddots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} 0 &{} \varPi _I &{} 0 \\ R_{1} &{} R_{2} &{} \cdots &{} R_{I} &{} 0 \end{array} \right] , \end{aligned}$$

where I is the number of ergodic classes. For the i-th ergodic class, \(P_i\) gives the one step transition probabilities between ergodic states from the i-th ergodic class and \(\varPi _i\) gives a square matrix of which all rows equal the unique stationary distribution of the chain inside the i-th ergodic class. Specifically, all rows in \(\varPi _{P_i}\) equal \(\pi ^\top _{P_i}\), which is the unique probability vector satisfying \( \pi ^\top _{P_i} P_i = \pi ^\top _{P_i}\). Note that all diagonal values of \(\varPi _P\) corresponding to ergodic states are non-zero contrary to the diagonal values of transient states which are zero, an insight that will be elaborated in Sect. 4.3. Hence, whether state i is ergodic or transient can be concluded from the value of entry (ii) of \( \varPi _P\). We call this criterion for ergodicity of a state the diagonal criterion.

Moreover, \(R_i (j,k)\) gives the equilibrium probability of ending in ergodic state k (which is part of the i-th ergodic class) when starting in transient state j. In order to calculate \(R_i\), define J as the number of transient states, \(I_T\) as the unity matrix of size J and Z(ji) as the probability of ending in the i-th ergodic class when starting in transient state j. Note that Z is a \(J \times I\) matrix Z. It then holds that

$$\begin{aligned} Z = (I_T - P_{TT} ) ^{-1} \left[ P_{T1}e_1 \;\;\; P_{T2}e_2 \;\;\; \ldots \;\;\; P_{TI}e_I \right] , \end{aligned}$$

where \(e_i\) is a column vector of ones of size equal to the number of states in ergodic class i; see, e.g., [5]. Denote the i-th column of Z with \(Z(\bullet ,i)\), then it holds that \(R_i = Z(\bullet ,i) \pi ^\top _{P_i}\).

In case there are multiple ergodic classes the stationary distribution fails to be unique. Indeed, any row of \(\varPi _P\) is a stationary distribution of the Markov chain. More specifically, denote the i-th row of \(\varPi _P\) by \(\varPi _P(i,\bullet )\), then it holds that \(\varPi _P(i,\bullet )\) is a probability distribution which satisfies \(\varPi _P(i,\bullet ) P=\varPi _P(i,\bullet )\). This implies that any convex combination of the rows is also a stationary distribution of the Markov chain, i.e., for \((\gamma _i)_{i \in \mathbb {S}}: \sum _{i = 1}^S \gamma _i = 1\) and \(\gamma _i\ge 0 \), for all \( i \in \mathbb {S}\), it holds that \(\sum _{i = 1}^S \gamma _i \varPi _P(i,\bullet )\) is a probability distribution which is invariant with respect to P. When an initial distribution \(\mu ^\top \) is considered, this convex combination is fixed (and given by \( \mu ^\top \)) meaning that there exists a unique stationary distribution for the chain started in \( \mu ^\top \) (describing the long-run behavior of the chain started in \( \mu ^\top \)), or, more formally, \(\mu ^\top \varPi _{P}\) is the unique stationary distribution satisfying \((\mu ^\top \varPi _P) P = (\mu ^\top \varPi _P)\) when starting in \(\mu ^\top \). Literature concerning Markov multi-chains includes Markov decision processes from [23], series expansion of Markov chains [2, 5, 6] and singular perturbation analysis [1, 12] where the underlying multi-chain structure is often known beforehand.

A Markov chain P is called nearly decomposable if P is irreducible and after possible relabeling of states can be written

$$\begin{aligned} P = \left[ \begin{array}{cccc} P_{11} &{} P_{12} &{} \cdots &{} P_{1k} \\ P_{21} &{} P_{22} &{} \ddots &{} \vdots \\ \vdots &{} \ddots &{} \ddots &{} P_{(k-1)k} \\ P_{k1} &{} \cdots &{} P_{k(k-1)} &{} P_{kk} \end{array} \right] , \end{aligned}$$

where the diagonal blocks \(P_{ii}\), \(i=1,2,\ldots ,k\), are square and have rows that sum up to \(1-\varepsilon \), with \(\varepsilon > 0\) small.

A Markov chain may belong to all of the above types simultaneously. For example, a multi-chain with transient states may have an ergodic class that for itself constitutes a nearly decomposable chain. Below we illustrate this by means of a simple Markov chain.

Example 1

Let \(p,q,r_1,r_2, r_3 \in ( 0 , 1 ) \) and define transition matrix P on state-space \( \{ 1 ,2 , 3 ,4 \} \) as

$$\begin{aligned} P = \left( \begin{array}{cccc} 1-p &{} p &{} 0 &{} 0 \\ q &{} 1-q &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} 0 \\ r_1 &{} r_2 &{} r_3 &{} 1-\sum _{i=1}^3 r_i\\ \end{array} \right) , \end{aligned}$$

where \(0 < \sum _{i=1}^3 r_i \le 1\).

Markov chain P is noticeably a multi-chain with ergodic classes \( \{ 1 ,2 \} \) and \( \{3 \} \). State 4 is transient. If pq are small, then the submatrix describing the transitions within the ergodic class \( \{ 1 , 2 \} \) becomes nearly decomposable. Similar when \(\sum _{i=1}^3 r_i\) is small, state 4 is only weakly connected to states \(\{1,2,3\}\).

The ergodic projector of P can be computed to be

$$\begin{aligned} \varPi _P = \left[ \begin{array}{cccc} {\frac{q}{p+q}}&{}{\frac{p}{p+q}}&{}0&{}0 \\ {\frac{q}{p+q}}&{}{\frac{p}{p+q}}&{}0&{}0 \\ 0&{}0&{}1&{}0\\ {\frac{ \left( r_1+r_2 \right) q}{ \left( \sum _{i=1}^3r_i \right) \left( p+q \right) }}&{}{\frac{ \left( r_1+r_2 \right) p}{ \left( \sum _{i=1}^3r_i \right) \left( p+q \right) }}&{}{ \frac{r_3}{\sum _{i=1}^3r_i}}&{}0\end{array} \right] . \end{aligned}$$

Note that when \(p=q=\sum _{i=1}^3 r_i=0\), we obtain \(P=I\) and \(\varPi _P=I\).

Recall that throughout this paper we let \( || \cdot || \) denote the maximum absolute row sum norm. For any finite-state, aperiodic Markov chain P, there exists a finite number r such that

$$\begin{aligned} \forall n \ge r : \quad \Vert P^n-\varPi _{P}\Vert \le c\beta ^n , \end{aligned}$$

where \(c=\sup _{l=0,1,\ldots ,r-1}\Vert P^l-\varPi _{P}\Vert <\infty \) and \(\beta \in (0,1)\); see for details [14]. This property is called geometric ergodicity and we will call r the transient phase and \( \beta \) the rate. For ease of references, we call \( r , c , \beta \)ergodicity parameters of P.

In this paper we will also study the impact of starting with a power \( P^q \) for the evaluation of \( \varPi _P\) and we introduce the following additional ergodicity parameters

$$\begin{aligned} \gamma ( P, q ) =\kappa (P,q)\phi (R,q) + \frac{c\beta ^{q(\phi (P,q)+1)}}{1-\beta ^q}, \end{aligned}$$
(3)

where

$$\begin{aligned} \kappa (P,q)=\sup _{n=q,q+1,\ldots ,q \phi (P,q)} \Vert P^n-\varPi _P \Vert , \end{aligned}$$

and

$$\begin{aligned} \phi (P,q) = \left\lceil \frac{\max \{r-q,0\}}{q} \right\rceil . \end{aligned}$$

For simplicity, we wrote in the introduction \(\gamma (P)\) instead of \(\gamma (P,1)\).

3 Bounding the Approximation Error

Starting point of our analysis is the equality

$$\begin{aligned} \varPi _P = (1-\alpha ) \varPi _P + \alpha \varPi _P, \quad \text{ for } \alpha \in [0,1] . \end{aligned}$$

Adding and subtracting \(\alpha P^q\) to the right hand side gives

$$\begin{aligned} \varPi _P = \alpha P^q + (1-\alpha ) \varPi _P P^q + \alpha (\varPi _P - P^q), \end{aligned}$$

where we used for the second term on the right hand side that \(\varPi _P P = \varPi _P\). Inserting N times this last expression for \(\varPi _P\) into the first \(\varPi _P\) on the right hand side leads to

$$\begin{aligned} \varPi _P = \alpha P^q \sum _{n=0}^N ((1-\alpha )P^q)^n + \varPi _P ((1-\alpha ) P^q)^{N+1} + \alpha (\varPi _P - P^q)\sum _{n=0}^N ((1-\alpha )P^q)^n.\nonumber \\ \end{aligned}$$
(4)

For \(N , q \in \mathbb {N}\) and \(\alpha \in (0,1)\) let

$$\begin{aligned} G_\alpha ( N , P^q ) = \frac{\sum _{n=0}^N ((1-\alpha )P^q)^n}{1-(1-\alpha )^{N+1}} . \end{aligned}$$

Note that for \(\alpha \in (0,1)\) and \( q \ge 1 \) it holds that

$$\begin{aligned} G_\alpha ( P^q ) := \lim _{N\rightarrow \infty } G_\alpha ( N , P^q ) = \sum _{n=0}^\infty ((1-\alpha )P^q)^n = (I-(1-\alpha )P^q)^{-1} , \end{aligned}$$

where existence of the Neumann series is guaranteed since \(\Vert (1-\alpha )P^q\Vert <1\), for \(\alpha \in (0,1)\). Equation (4) can be rewritten in succession via (i) simplifying the second term, (ii) bringing the second term of the right hand side to the other side, (iii) dividing by \(1-(1-\alpha )^{N+1}\) and (iv) using the \(G_\alpha ( N , P^q ) \)-notation:

$$\begin{aligned} \varPi _P = \alpha P^q G_\alpha ( N , P^q ) + \alpha (\varPi _P - P^q) G_\alpha ( N , P^q ) . \end{aligned}$$
(5)

Remark 1

The (ij)-th element of \(G_\alpha ( N , P^q ) \) gives the scaled \((1-\alpha )\)-discounted expected number of visits of the Markov chain with transition matrix \(P^q\) to the j-th state in the first \(N + 1\) number of discrete time steps (including the state i at time zero) when starting in state i. Intuitively, the discounting ensures that the weights of the visits after many discrete time steps of the Markov chain with transition matrix \(P^q\) becomes smaller and smaller, ensuring existence of \(H_\alpha ( P^q ) \) since \(\Vert (1-\alpha )P^q\Vert <1\), for \(\alpha \in (0,1)\).

Post-multiplying Eq. (5) with

$$\begin{aligned} \left( \alpha P^q G_\alpha ( N , P^q ) \right) ^{k-1}, \quad \text{ for } k \in \mathbb {N} \end{aligned}$$

i.e., the \(k-1\) power of the first term of the right hand side of (5), gives

$$\begin{aligned} \varPi _P = \left( \alpha P^q G_\alpha ( N , P^q ) \right) ^k + \alpha ^k (\varPi _P - P^{qk})\left( G_\alpha ( N , P^q ) \right) ^k, \end{aligned}$$
(6)

where we used that

$$\begin{aligned} \left( \alpha P^q G_\alpha ( N , P^q ) \right) ^{k-1} \varPi _P = \left( \frac{\alpha \sum _{n=0}^N (1-\alpha )^n}{1-(1-\alpha )^{N+1}} \right) ^{k-1} \varPi _P = \varPi _P. \end{aligned}$$

Taking the limit \(N \rightarrow \infty \) in (6) leads to

$$\begin{aligned} \varPi _P&= ( H_\alpha (P^q ) )^k + \alpha ^k (\varPi _P - P^{qk})\left( G_\alpha ( P^q ) \right) ^k, \end{aligned}$$

where we use the notation

$$\begin{aligned} H_\alpha ( P^q ) := \alpha P^q G_\alpha ( P^q ), \end{aligned}$$

which is the modified resolvent kernel of P. Analogous to \(G_\alpha (\cdot )\) let in the following

$$\begin{aligned} H_\alpha (N,P^q) := \alpha P^q G_\alpha (N,P). \end{aligned}$$

Lemma 1

For \(k,q\in \mathbb {N}\), \(N \ge \phi (P,q)\), and \(\alpha \in (0,1)\) it holds that

$$\begin{aligned} \left\| \varPi _{P} - ( H_\alpha ( N, P^q ) )^k \right\| \le \left( \frac{\alpha \gamma (P,q)}{1-(1-\alpha )^{N+1}} \right) ^k. \end{aligned}$$

Proof

From (6) it follows that

$$\begin{aligned} \varPi _P - \left( H_\alpha ( N, P^q ) \right) ^k = \alpha ^k (\varPi _P - P^q) P^{q(k-1)} ( G_\alpha ( N , P^q ) )^{k}. \end{aligned}$$
(7)

Since \((\varPi _P-P^q)\varPi _P = 0\), it holds that

$$\begin{aligned} (\varPi _P - P^q) P^{q(k-1)} = (-1)^{k-1} (\varPi _P - P^q)^k, \end{aligned}$$

so that (7) can be written as

$$\begin{aligned} \varPi _P - ( H_\alpha ( N, P^q ) )^k&= \alpha ^k (-1)^{k-1} (\varPi _P - P^q)^k \left( G_\alpha ( N , P^q ) ) \right) ^{k} \nonumber \\&= -\left[ - \alpha (\varPi _P - P^q) \frac{\sum _{n=0}^N ((1-\alpha )P^q)^n}{1-(1-\alpha )^{N+1}} \right] ^k, \end{aligned}$$
(8)

where the definition of \(G_\alpha ( N , P^q ) \) is filled in in the last equation. Applying norms to Eq. (8) we get

$$\begin{aligned}&\Vert \varPi _P - ( H_\alpha ( N, P^q ) )^k \Vert \nonumber \\&\quad \le \left\| \alpha (\varPi _P - P^q) \frac{\sum _{n=0}^N ((1-\alpha )P^q)^n}{1-(1-\alpha )^{N+1}} \right\| ^k\end{aligned}$$
(9)
$$\begin{aligned}&\quad \le \left( \frac{\alpha \sum _{n=0}^N (1-\alpha )^n \left\| \varPi _P - P^{q(n+1)}\right\| }{1-(1-\alpha )^{N+1}} \right) ^k \nonumber \\&\quad = \left( \frac{1}{1-(1-\alpha )^{N+1}} \right) ^k \left\{ \alpha \sum _{n_1=0}^{\min \{N,\phi (P,q)-1\}} (1-\alpha )^{n_1} \left\| \varPi _P - P^{q(n_1+1)}\right\| \right. \nonumber \\&\qquad \left. +\,\alpha \sum _{n_2=\phi (P,q)}^N (1-\alpha )^{n_2} \left\| \varPi _P - P^{q(n_2+1)}\right\| \right\} ^k, \end{aligned}$$
(10)

where the summation is split at \(\phi (P,q)\) into two summations, one where geometric ergodicity does not apply and one where it does, respectively. Continuing calculations from (10) shows

$$\begin{aligned}&\Vert \varPi _P - ( H_\alpha ( N, P^q ) )^k\Vert \nonumber \\&\quad \le \left( \frac{1}{1-(1-\alpha )^{N+1}} \right) ^k \left\{ \sup _{n=0,1,\ldots ,\min \{N,\phi (P,q)-1\}} \left\| \varPi _P - P^{q(n+1)} \right\| \left[ 1-(1-\alpha )^{\min \{N+1,\phi (P,q)\}}\right] \right. \nonumber \\&\qquad \left. + \,\alpha c \beta ^q ((1-\alpha )\beta ^q)^{\phi (P,q)} \frac{1-((1-\alpha )\beta ^q)^{\max \{N-\phi (P,q)+1,0\}}}{1-(1-\alpha )\beta ^q}\right\} ^k. \end{aligned}$$
(11)

So we may conclude from (11) that for

  1. 1.

    \(N \le \phi (P,q) - 1\) (geometric ergodicity does not apply):

    $$\begin{aligned} \Vert \varPi _P - ( H_\alpha ( N, P^q ) )^k\Vert \le \left( \sup _{n=0,1,\ldots ,N} \left\| \varPi _P - P^{q(n+1)} \right\| \right) ^k \end{aligned}$$
    (12)
  2. 2.

    \(N \ge \phi (P,q)\) (geometric ergodicity applies): since

    $$\begin{aligned} 1-(1-\alpha )^{\phi (P,q)} \le \alpha \phi (P,q),\quad \text{ for } \alpha \in (0,1) \text{ and } \phi (P,q) = 0,1,\ldots , \end{aligned}$$

    it holds that

    $$\begin{aligned} \Vert \varPi _P - ( H_\alpha ( N, P^q ) )^k\Vert \le \left( \frac{\alpha \gamma (P,q)}{1-(1-\alpha )^{N+1}}\right) ^k, \end{aligned}$$

    where \(\gamma (P,q)\) is a finite constant defined in (3).

Note that it is necessary for the bound to be meaningful that \(N \ge \phi (P,q)\) so that the geometric ergodicity applies. \(\square \)

Remark 2

For notational easiness define the bound found in Lemma 1 in case \(N \ge \phi (P,q)\) as \(f(\alpha ) = \left( \frac{\alpha \gamma (P,q)}{1-(1-\alpha )^{N+1}} \right) ^k\). It holds that

$$\begin{aligned} \lim _{\alpha \downarrow 0} f(\alpha ) = \left( \frac{\gamma (P,q)}{N+1} \right) ^k, \end{aligned}$$

and

$$\begin{aligned} \lim _{\alpha \uparrow 1} f(\alpha ) = \left( \gamma (P,q) \right) ^k , \end{aligned}$$

so that \(\lim _{\alpha \downarrow 0} f(\alpha ) < \lim _{\alpha \uparrow 1} f(\alpha )\) for \(k,q \in \mathbb {N}\) and \(N \ge \phi (P,q)\). Furthermore, sinceFootnote 1

$$\begin{aligned} \frac{d}{d\alpha } f(\alpha ) = k\alpha ^{k-1}\gamma (q)^k \frac{1-(1+\alpha N)(1-\alpha )^N}{[1-(1-\alpha )^{N+1}]^{k+1}} > 0 ,\quad \text{ for } \alpha \in (0,1), \end{aligned}$$

it holds that for any choice of k, q and \(N \ge \phi (P,q)\) it is optimal to choose \(\alpha \in (0,1)\) as small as possible.

The following theorem summarizes some properties of \(H_\alpha (P^q)\).

Theorem 1

It holds for \(k, q \ge 1\) and \( \alpha \in ( 0 ,1 ) \) that

$$\begin{aligned} \left\| \varPi _P - \left( H_\alpha ( P^q ) \right) ^k \right\| \le (\alpha \gamma (P,q))^k \end{aligned}$$
(13)

and

$$\begin{aligned} \varPi _{P } = \lim _{ \alpha \downarrow 0 } ( H_\alpha ( P^q ) )^k = \lim _{ q \rightarrow \infty } ( H_\alpha ( P^q ) )^k = \lim _{ k \rightarrow \infty } ( H_\alpha ( P^q ) )^k. \end{aligned}$$
(14)

Proof

Inequality (13) follows directly from Lemma 1 by letting \(N\rightarrow \infty \). The first two equalities from (14) follow from Inequality (13) and the third equality from (1). \(\square \)

Remark 3

Theorem 3 in the “Appendix” shows that the results put forward in Lemma 1 and Theorem 1 apply to periodic Markov chains with period d for \( q=1 \) when \(\gamma (P,1)\) is replaced by \({{\bar{\gamma }}}(P,d)\) defined in the “Appendix”.

The result put forward in Theorem 1 shows that for \( \alpha < \gamma ( P , q )\) it holds that the modified resolvent \( H_\alpha ( P^q ) \) is geometrically ergodic with rate \( \alpha \gamma ( P , q ) \), transient phase \( r =1 \), and ergodic projector \( \varPi _P\).

As our numerical study in the second part of the paper shows, the modified resolvent is potentially more efficient than PM, which makes it, apart from the fact that it directly applies to multi-chains, an attractive alternative to PM. In the following, \(H_\alpha (P)\) is illustrated for Example 1.

Example 2

We revisit Example 1 where we assume that \(p,q,r_1,r_2\) and \(r_3\) are non-zero probabilities. For this chain \(H_\alpha ( P )\) can be explicitly solved to be

$$\begin{aligned}&H_\alpha ( P )=\\&\quad \left[ \begin{array}{cccc} \frac{\alpha (1-p) + (1-\alpha )q}{\alpha +(1-\alpha )(p+q)} &{} \frac{p}{\alpha +(1-\alpha )(p+q)}&{}0&{}0\\ \frac{q}{\alpha +(1-\alpha )(p+q)} &{} \frac{\alpha (1-q) + (1-\alpha )p}{\alpha +(1-\alpha )(p+q)}&{}0&{}0\\ 0&{}0&{}1&{}0\\ \frac{\alpha r_1+(1-\alpha )q(r_1 + r_2)}{\left( \alpha + (1-\alpha )(p+q)\right) \left( \alpha + (1-\alpha )\left( \sum _{i=1}^3r_i \right) \right) } &{} \frac{\alpha r_2 + (1-\alpha )p(r_1+r_2)}{\left( \alpha + (1-\alpha )(p+q)\right) \left( \alpha + (1-\alpha )\left( \sum _{i=1}^3r_i \right) \right) } &{} \frac{r_3}{\alpha + (1-\alpha )\left( \sum _{i=1}^3r_i \right) } &{} \frac{\alpha \left( 1 - \sum _{i=1}^3r_i \right) }{\alpha + (1-\alpha )\left( \sum _{i=1}^3r_i \right) } \end{array} \right] . \end{aligned}$$

Hence, letting \( \alpha \) tend to zero yields element-wise convergence of \(H_\alpha ( P )\) to \(\varPi _P\), which is in accordance with Theorem 1. For example, the absolute error of the (1, 1)-th element equals

$$\begin{aligned} \frac{\alpha p |p + q - 1|}{(p+q)(\alpha + (1-\alpha ) (p+q))} \end{aligned}$$

so that the corresponding relative error is

$$\begin{aligned} \frac{\left| H_\alpha ( P )(1,1) - \varPi _P(1,1) \right| }{\varPi _P(1,1)} = \frac{\alpha p |p + q - 1|}{q(\alpha + (1-\alpha ) (p+q))}, \end{aligned}$$

where \(H_\alpha ( P )(i,j)\) indicates the (ij)-th element of \(H_\alpha ( P )\). It shows that the relative error of \(H_\alpha ( P )(1,1)\) can be bounded by the linear function \(\alpha c_1(p,q)\), where \(c_1(p,q) = p|p+q-1|/\min \{q,q(p+q)\}\) is a (pq)-dependent constant.

Furthermore, the asymptotic probabilities of going from one ergodic class to another (or to a transient state) are zero. This shows that \(H_\alpha ( P )\) uncovers the structure of the ergodic classes. In addition, the approximation assigns in general a positive mass to jumps from a transient state to itself, e.g.,

$$\begin{aligned} H_\alpha ( P )(4,4) = \frac{\alpha \left( 1 - \sum _{i=1}^3r_i \right) }{\alpha + (1-\alpha )\left( \sum _{i=1}^3r_i \right) } , \end{aligned}$$

which is strictly larger than zero if \(\sum _{i=1}^3r_i < 1\), while clearly \(\varPi _P(4,4) = 0\). Note that, besides the case \(\sum _{i=1}^3r_i = 1\), the approximation may give the wrong impression that a transient state, say i, is ergodic. However, when \(\alpha \) is chosen sufficiently small (for example of the order \(10^{-8}\)), together with the fact that there are no transitions from ergodic states towards i, the fact that i is transient becomes apparent.

In the following, we analyze the effect that taking a power of the modified resolvent has on the convergence. Denote \((H_\alpha ( P ) )^2 (1,1)\) as the (1, 1)-th element from \((H_\alpha ( P ) )^2 \). It then holds that

$$\begin{aligned} (H_\alpha ( P ) )^2 (1,1) = \frac{(\alpha (1-p) + (1-\alpha )q)^2+pq}{(\alpha +(1-\alpha )(p+q))^2}, \end{aligned}$$

the relative error of which can be computed to be equal to

$$\begin{aligned} \frac{\left| (H_\alpha ( P ) )^2 (1,1) - \varPi _P(1,1) \right| }{\varPi _P(1,1)} = \frac{\alpha ^2 \left| (p+q)\left( (1-p)^2+pq\right) - q \right| }{q(\alpha +(1-\alpha )(p+q))^2}, \end{aligned}$$

which can be bounded by the quadratic function \(\alpha ^2 c_2(p,q)\) where

$$\begin{aligned} c_2(p,q) = \frac{\alpha ^2 \left| (p+q)\left( (1-p)^2+pq\right) - q \right| }{q(\min \{1,p+q\})^2}. \end{aligned}$$

Note that when \(p+q = 1\) or \(p=q=0\) the relative errors of approximation \(H_\alpha ( P ) (1,1)\) and \((H_\alpha ( P ) )^2 (1,1)\) are zero. For \(p+q \in (0,2]\setminus \{1\}\) and \(\alpha \in (0,1)\) the relative error of \((H_\alpha ( P ) )^2 (1,1)\) is strictly smaller than that of \(H_\alpha ( P ) (1,1)\). Furthermore, comparing the relative error bounds \(\alpha c_1(p,q)\) and \(\alpha ^2 c_2(p,q)\) shows the quadratic improvement of \((H_\alpha (P))^2(1,1)\), which is in accordance with Theorem 1. This illustrates the improvement that can be achieved through the power k in the generalization. The other entries of \( H_\alpha ( P ) \) can be analyzed along the same lines.

In the following example, we discuss the convergence of \(H_\alpha (P)\) in case of a nearly decomposable Markov chain.

Example 3

In the light of nearly decomposable Markov chains it is interesting to see what happens in case \(p+q \downarrow 0\) for the Markov chain in Example 1, i.e., when p and q are both close to 0. L’Hôpital’s rule shows that the relative error of the (1, 1)-th element for \(p+q \downarrow 0\) in the limit equals

$$\begin{aligned} \frac{\alpha }{1-\alpha } \frac{p}{q}, \end{aligned}$$

see also the relative errors from Example 1. Similar, the relative error of \(H_\alpha (P)(1,2)\) converges for \(p+q \downarrow 0\) towards

$$\begin{aligned} \frac{\alpha }{1-\alpha }. \end{aligned}$$

Both relative errors show that arbitrary accuracy can be achieved by using the modified resolvent with \( \alpha \) small even in case of nearly decomposable Markov chains.

Now consider the case where \(\sum _{i=1}^3 r_i = \varepsilon \), for \(\varepsilon > 0 \) small. In that case the Markov chain breaks almost up into 3 ergodic classes. For the (4, 4)-th element it holds that

$$\begin{aligned} H_\alpha ( P )(4,4) = \frac{\alpha (1-\varepsilon )}{\alpha + (1-\alpha )\varepsilon }, \end{aligned}$$

which equals the absolute error, since \(\varPi _{P_1}(4,4) = 0\). Choosing \(\alpha \) such that

$$\begin{aligned} \alpha < \frac{\varepsilon \delta }{1-\delta -(1-\delta )\varepsilon } \end{aligned}$$

leads to an absolute error smaller than \(\delta \), showing that arbitrary accurate precision can be achieved with \(H_\alpha (P)\) even in case the Markov chain almost breaks up into 3 ergodic classes. Similar for the (4, 3)-th element of \(H_\alpha ( P )\) it can be shown that the relative error equals

$$\begin{aligned} \frac{\left| H_\alpha ( P )(4,3) - \varPi _P(4,3) \right| }{\varPi _P(4,3)} = 1 - \frac{\varepsilon }{\alpha + (1-\alpha )\varepsilon }, \end{aligned}$$

showing that in order to obtain a relative error smaller than \(\eta \) one should choose \(\alpha \le \varepsilon \eta \), again showing that any accuracy can be achieved in theory. Note that \(H_\alpha ( P )(4,1)=H_\alpha ( P )(4,2)=H_\alpha ( P )(4,3)=0\) in case \(\sum _{i=1}^3 r_i = 0\), i.e., in case the Markov chain consists of three ergodic classes this is correctly detected.

Alternatively to PM applied to \( H_\alpha ( P ) \), one may compute the modified resolvent of \( H_\alpha ( P ) \). More specifically, one may construct recursively a sequence \( \{ H_\alpha ( P ; n ) : n \in \mathbb {N} \} \) of nested modified resolvents with \( H_\alpha ( P ; 0 ) = P \) and, for \( n \ge 1\)

$$\begin{aligned} H_\alpha ( P ; n ) = H_\alpha ( H_\alpha ( P ; n-1 ) ) . \end{aligned}$$

As the following theorem shows, the norm error of \(H_\alpha (P^q;n)\) can be bounded by a geometric function with power n and rate \(\alpha \).

Theorem 2

For \(\alpha \in (0,1)\) such that \(\alpha \gamma (P,q) < 1\) it holds that

$$\begin{aligned} \Vert \varPi _P - H_\alpha ( P^q ; n ) \Vert \le \frac{\gamma (P,q) \alpha ^n}{1 - \alpha \gamma (P,q) (1-\alpha ^{n-1})}, \quad n\in \mathbb {N} , \end{aligned}$$

and

$$\begin{aligned} \lim _{ n \rightarrow \infty } H_\alpha ( P^q ; n ) = \varPi _P . \end{aligned}$$

Proof

Proof via mathematical induction. Because \(\alpha \gamma (P, q) < 1\) it is clear that the bound holds true for \(n=1\) via Theorem 1. Now suppose it holds true for general \(n - 1 \ge 1\), then

$$\begin{aligned} \Vert \varPi _P - H_\alpha ( P^q ; n ) \Vert&= \Vert \varPi _P - H_\alpha ( H_\alpha ( P^q ; n-1 ) ) \Vert \end{aligned}$$

since \(\Vert (1-\alpha ) H_\alpha ( P^q ; n-1 ) \Vert <1\) we can write out the inverse and bring \(\varPi _P\) inside the summation

$$\begin{aligned} = \left\| \alpha \sum _{l=0}^\infty (1-\alpha )^l (\varPi _P - ( H_\alpha ( P^q ; n -1) )^{l+1}) \right\| \end{aligned}$$

straightforward bounding

$$\begin{aligned} \le \alpha \sum _{l=0}^\infty (1-\alpha )^l \left\| \varPi _P - ( H_\alpha ( P^q ; n -1 ))^{l+1}\right\| \end{aligned}$$

since \(\varPi _P H_\alpha ( P^q ; n-1 ) = \varPi _P\)

$$\begin{aligned} \le \alpha \sum _{l=0}^\infty (1-\alpha )^l \left\| \varPi _P - H_\alpha ( P^q ; n-1 ) \right\| ^{l+1} \end{aligned}$$

filling in the induction hypothesis

$$\begin{aligned} =\frac{\gamma (P,q) \alpha ^{n}}{1 - \alpha \gamma (P,q) (1-\alpha ^{n-2})} \sum _{l=0}^\infty \left( \frac{(1-\alpha )\gamma (P,q) \alpha ^{n-1}}{1 - \alpha \gamma (P,q) (1-\alpha ^{n-2})}\right) ^l \end{aligned}$$

for all \(n - 1 \ge 1\) when \(\alpha \gamma (P,q) < 1\) it holds that \(\frac{(1-\alpha )\gamma (P,q) \alpha ^{n-1}}{1 - \alpha \gamma (P,q) (1-\alpha ^{n-2})} < 1\) and thus

$$\begin{aligned} = \frac{\gamma (P,q) \alpha ^{n}}{1 - \alpha \gamma (P,q) (1-\alpha ^{n-2})-(1-\alpha )\gamma (P,q) \alpha ^{n-1}} \end{aligned}$$

taking out \(\alpha \gamma (P,q)\) in the denominator gives

$$\begin{aligned} = \frac{\gamma (P,q) \alpha ^{n}}{1 - \alpha \gamma (P,q) (1-\alpha ^{n-1})} \end{aligned}$$

thereby showing that it holds for n and thus ends the proof. \(\square \)

Corollary 1

For \(\alpha \in (0,1)\) such that \(\alpha \gamma (P,q) < 1\) it directly follows from the above theorem that

$$\begin{aligned} \Vert \varPi _P - H_\alpha ( P ; n ) \Vert \le \frac{\varepsilon }{1-\varepsilon } \alpha ^{n-1}, \quad n \ge 1 , \end{aligned}$$
(15)

when we define \(\varepsilon = \alpha \gamma (P, q)\). Furthermore, since \(H_\alpha (P;n) \varPi _P = \varPi _P\) it holds for \(k \ge 1\) that

$$\begin{aligned} \Vert \varPi _P - \left( H_\alpha ( P ; n )\right) ^k \Vert&= \Vert \left( \varPi _P - H_\alpha ( P ; n ) \right) ^k \Vert \\&\le \left( \Vert \varPi _P - H_\alpha ( P ; n ) \Vert \right) ^k \\&= \left( \frac{\varepsilon }{1-\varepsilon } \alpha ^{n-1} \right) ^k, \end{aligned}$$

where in the last Eq. (15) is used.

Remark 4

The results put forward in Theorem 2 and Corollary 1 apply for the case \( q=1 \) also to periodic Markov chains with period d when \(\gamma (P, 1)\) is replaced by \({{\bar{\gamma }}}(P,d)\). For details see the “Appendix”.

Theorem 2 shows that repeated application of the modified resolvent yields a geometric improvement of the rate of geometric ergodicity. Example 4 illustrates Theorem 2.

Example 4

We revisit the instance from Examples 1 and 2 where we assume that \(p,q,r_1,r_2\) and \(r_3\) are non-zero probabilities. For this chain \(H_\alpha ( P;2 )\) can be explicitly solved to be

$$\begin{aligned}&H_\alpha ( P ;2)\\&\quad =\left[ \begin{array}{cccc} \frac{q+\alpha ^2(1-p-q)}{p+q+\alpha ^2(1-p-q)} &{}\frac{p}{p+q+\alpha ^2(1-p-q)}&{}0&{}0\\ \frac{q}{p+q+\alpha ^2(1-p-q)} &{} \frac{p+\alpha ^2(1-p-q)}{p+q+\alpha ^2(1-p-q)} &{}0&{}0\\ 0&{}0&{}1&{}0\\ \frac{(r_1+r_2)q(1-\alpha ^2) + r_1\alpha ^2}{\left( (p+q)(1-\alpha ^2) + \alpha ^2 \right) \left( (1-\alpha ^2) \sum _{i=1}^3r_i + \alpha ^2 \right) } &{} \frac{(r_1+r_2)p(1-\alpha ^2) + r_2\alpha ^2}{\left( (p+q)(1-\alpha ^2) + \alpha ^2 \right) \left( (1-\alpha ^2) \sum _{i=1}^3r_i + \alpha ^2 \right) } &{} \frac{r_3}{(1-\alpha ^2) \sum _{i=1}^3 r_i + \alpha ^2} &{} \frac{(1-\sum _{i=1}^3r_i)\alpha ^2}{(1-\alpha ^2) \sum _{i=1}^3r_i + \alpha ^2} \end{array} \right] . \end{aligned}$$

This shows that \(H_\alpha ( P ;2)\) converges with quadratic rate in terms of \(\alpha \) towards \(\varPi _P\). Consequently, a larger rate of convergence is achieved for \(H_\alpha ( P ;2)\) than for \(H_\alpha (P)\); compare with Example 2. More specifically, the relative error of the (1, 1)-th element equals

$$\begin{aligned} \frac{\left| H_\alpha ( P;2 )(1,1) - \varPi _P(1,1) \right| }{\varPi _P(1,1)} = \left| \frac{p(1-p-q)}{q(p+q) + \alpha ^2q(1-p-q)} \right| \alpha ^2, \end{aligned}$$

where the absolute term converges to constant \(\frac{p|1-p-q|}{q(p+q)}\) for \(\alpha \) small. Similar, the relative error of \(H_\alpha (P;3)(1,1)\) equals

$$\begin{aligned} \frac{\left| H_\alpha ( P;3 )(1,1) - \varPi _P(1,1) \right| }{\varPi _P(1,1)} = \left| \frac{p(1-p-q)}{q(p+q) + \alpha ^3q(1-p-q)} \right| \alpha ^3, \end{aligned}$$

showing that the relative error error is approximately a factor \(\alpha \) smaller than that of \(H_\alpha ( P;2 )(1,1)\) in accordance with what can be expected from Theorem 2.

Unfortunately, as \( \gamma ( P , q ) \) is not available, it is neither clear what a good initial choice for \( \alpha \) is, nor when to terminate \((H_\alpha (P))^k\) or the repeated application of \( H_\alpha ( P ; n ) \). In the following, we will address these two issues in more detail.

Remark 5

In [24] the resolvent of a Markov chain P is defined as

$$\begin{aligned} R^\rho = ( \rho I + ( I - P ) )^{-1} = ( ( 1 + \rho ) I - P )^{-1} , \end{aligned}$$

for \( \rho > 0 \). Let \( \rho = \alpha /(1 - \alpha ) \), then

$$\begin{aligned} R^\rho = (1 - \alpha ) ( I - ( 1- \alpha ) P )^{-1} , \end{aligned}$$

which yields

$$\begin{aligned} G_\alpha ( P ) = \frac{1}{1 - \alpha } R^\rho . \end{aligned}$$

Puterman develops in [24] the resolvent into a Laurent series yielding

$$\begin{aligned} \rho R^\rho = \varPi _P - \sum _{n=0}^\infty (- \rho D_P )^{n+1} , \end{aligned}$$

in our notation with \( \rho = \alpha /(1 - \alpha ) \)

$$\begin{aligned} \varPi _P - \alpha G_\alpha ( P ) = \sum _{n=0}^\infty \left( - \frac{\alpha }{1 - \alpha } D_P \right) ^{ n+1 } . \end{aligned}$$

Note that the series on the above right hand side involves the deviation matrix \(D_P=(I-P+\varPi _P)^{-1}-\varPi _P\) and that deriving efficient bounds for the norm of the deviation is itself a demanding task in case of multi-chains. As the norm of the deviation matrix typically takes large values, the above series only applies for significantly small values of \( \alpha \).

4 The Jump Start Power Method

In the previous section, we have shown that going from P to the modified resolvent \( H_\alpha ( P) \) potentially yields a geometrically ergodic Markov chain with no transient phase (i.e., \( r =1 \)). In this section we show how this result can be made fruitful for numerical computations. In particular, Sect. 4.1 illustrates the modified resolvent theory through numerical experiments, Sect. 4.2 develops a practical method that exploits the developed theory by introducing the jump start power method (JSPM) and provides numerical results. Lastly, Sect. 4.3 discusses and numerically illustrates the use of JSPM in case of large (sparse) systems.

4.1 Motivating Numerical Experiments

As a first step we analyze the effect of mapping P to \( H_\alpha ( P ) \) by comparing numerically \( P^n \) with \( H_\alpha ( P ) \). The considered instances cover a wide range of Markov chains and for an overview, we refer to Table 1. Each row in Table 1 corresponds to an instance defined by its transition matrix (Tr. Matrix). The instances are based on random graph models that capture key properties of real-life networks. The instances vary in terms of size S (given in the ’S’ column), structure (given in the ’Ergodic Structure’ column), connectivity (as indication, column ’\(p^\star \)’ gives the smallest non-zero element of P), and parameters used for random graph models (given in the ’Description’ column). The ergodic structure is denoted by \(([v_1, v_2, \ldots , v_I],T)\), where I is the number of ergodic classes, \(v_i\) the number of states in the i-th ergodic class and \(T=S - \sum _{i=1}^I v_i\) is the number of transient states. In case of transient states, the corresponding part in P is randomly filled such that transient states most likely point towards each other and multiple ergodic classes (if possible). The description column gives the relevant reference of the instance together with parameters given in the same order as they appeared in the original reference, where altered labels are used in case of conflicting notation (e.g., if \( \alpha \) is used as parameter in the original reference, we refer to this parameter as \( \beta \)). For the implementation of the different instances, the code provided in MATLAB toolbox CONTEST [26] was used. The CONTEST toolbox generates symmetric adjacency matrices which are in many cases periodic. In order to obtain the corresponding transition matrix P, the rows are first normalized ensuring that each row sums up to one. Afterwards, the transition matrix P is mixed with the identity matrix to achieve aperiodicity, which does not affect the ergodic behavior of the chain.

Table 1 Instances used for numerical experiments

Comparison of \( P^n \)and\( H_\alpha (P)\) In the first numerical experiment, we compute for a series of Markov chains \( P_i \), with \( 1 \le i \le 4 \) in Table 1, the power \( n_\alpha (P) \) such that

$$\begin{aligned} \left| \Vert \varPi _P - H_\alpha (P) \Vert - \Vert \varPi _P - P^{n_\alpha (P)} \Vert \right| < 10^{-12}. \end{aligned}$$

In words, \( n_\alpha (P) \) is the power of P that is substituted by \( H_\alpha (P) \). Note that a power \(n_\alpha (P)\) can be obtained via \(\log _2 n_\alpha (P)\) matrix multiplications. The numerical results depicted in Fig. 1 show that the modified resolvent can replace PM approximations for large powers.

Fig. 1
figure 1

Combinations of \(\alpha \) and \(n_\alpha (P_i)\) such that \(\left| \Vert \varPi _{P_i} - H_\alpha (P_i) \Vert - \Vert \varPi _{P_i} - (P_i)^{n_\alpha (P_i)} \Vert \right| < 10^{-12}\) for \(i=1,\,2,\,3,\,4\)

In particular for the Courtois matrix \(P_1\), in order to approximately achieve a norm error of \(7.92\cdot 10^{-7}\) a power is needed of \(2^{16}\) while the same norm error is obtained via the modified resolvent with \(\alpha \approx 10^{-10}\). For \(P_4\) the modified resolvent with \(\alpha \approx 10^{-11.18}\) leads to the same norm error (of approximately \(1.63 \cdot 10^{-5}\)) as PM with power 20655175 (\(\approx 2^{24.3}\)). As for computation times, experiments showed that on average PM \((P_4)^k\) with power \(k=20655175\) takes on average 73.12 seconds in a sparse matrix setting whereas the modified resolvent \(H_{\alpha =10^{-11.18}}(P_4)\) takes on average 2.68 seconds, i.e., a difference of factor 27.28 on average (the experiments were performed in MATLAB R2011b on a 64-bit Windows desktop PC with Intel(R) Core(TM) i5-2310 CPU @ 2.90GHz processor).

Length of Transient Phase for \( H_\alpha (P)\) Figure 2 illustrates the effect that powers of \(H_\alpha (P)\) have on the norm error for the Courtois matrix. Different values for \(\alpha \) are considered and for each \(\alpha \) the exponential decay location is determined and thereby the length of the transient phase is identified. Note that \(H_{\alpha =1}(P)\) equals P. A heuristic approach is used to find the exponential decay location where for each \(\alpha \) an exponential function is repeatedly fitted to the data until the coefficient of determination \(R^2\) is close enough to 1 (where \( R^2 =1 \) represents a perfect fit). After each fit which leads to an insufficient coefficient of determination, the dataset is reduced by increasing the value of the first considered power n and the fitting repeats. The found exponential decay locations (i.e., the smallest power in the dataset that led to \(R^2\) sufficiently close to 1) are denoted with the large dots and labeled, where the labels correspond to the fitted functions given under the graph together with the \(R^2\) in parenthesis behind the function.

The main observation from Fig. 2 is that the exponential decay locations shift to the left (and thereby the transient phase becomes smaller) for decreasing values of \(\alpha \).

Fig. 2
figure 2

Development norm errors for powers of modified resolvent in case of the Courtois matrix

This phenomenon has been theoretically shown in the previous section. It is worth noting for \( \alpha \le 10^{-3}\) there is no transient phase, i.e., \( r =1 \) in these cases. Furthermore, from the fitted functions it follows that smaller \(\alpha \) values lead to stronger norm error reduction for increasing powers.

The Nested Modified Resolvent\(H_\alpha (P;n)\) Similar to Figs. 23 illustrates the effect of the nested modified resolvent \(H_\alpha (P;n)\) for varying n in case \(P = P_1\).

Fig. 3
figure 3

Development norm errors for nested modified resolvent in case of the Courtois matrix

It shows that relatively large values for \(\alpha \) already lead to small norm errors after a few iterations. In particular, the fitted relation between the norm error of \(H_{\alpha =0.01}(P;n)\) and n is approximately \(4901 e^{-4.6n}\) whereas that of \((H_{\alpha =0.01}(P))^n\) and n is approximately \(e^{-0.0198n}\) (see also Fig. 2), showing that the effect of an increase in the number of iterations in the nested modified resolvent is far more effective than an increase in the power of the modified resolvent for the same \(\alpha \). It therefore illustrates the sharper bound found for the nested modified resolvent in comparison with powers of the modified resolvent.

4.2 Jump Start Power Method (JSPM)

In this section, we will develop a power-method like algorithm based on the theory established in the previous section. To that end we discuss how to choose \(\alpha \) and we provide a stopping rule for the algorithm. Our recommendations are based on numerical experiments and balance avoiding numerical issues with achieving good numerical approximations.

Numerical experiments indicate that in order to achieve high accuracy it is best to choose relatively large \(\alpha \) and take powers of the resulting modified resolvent. Unfortunately, the modified resolvent is no sparse matrix and computing powers is rather costly. Alternatively, one may choose \(\alpha \) (very) small and calculate the modified resolvent only once (without taking powers). This, however, may lead to numerical issues when approaching the machine precision. In particular, the condition number of matrix \((I-(1-\alpha )P)\) grows for decreasing \(\alpha \), therefore, choosing \(\alpha \) significantly small leads to an ill-conditioned matrix which is more difficult to invert accurately. For purpose of illustration consider the Courtois matrix \(P_1\). It then holds that

$$\begin{aligned} \Vert \varPi _{P_1} - H_{\alpha =10^{-10}}(P_1) \Vert \approx 5.09 \cdot 10^{-7} . \end{aligned}$$

According to the theoretical results decreasing \( \alpha \) to, say, \(\alpha =10^{-12}\) should improve the quality of the approximation (see Theorem 1). But the contrary is true as

$$\begin{aligned} \Vert \varPi _{P_1} - H_{\alpha =10^{-12}}(P_1) \Vert \approx 3.48 \cdot 10^{-5}. \end{aligned}$$

This shows that numerical issues come into play when computing the modified resolvent for \(\alpha =10^{-12}\) for \( P_1 \). A similar effect can be observed for the nested modified resolvent

$$\begin{aligned} \Vert \varPi _{P_1} - H_{\alpha =10^{-3}}(P_1;3) \Vert \approx 5.00 \cdot 10^{-6} \quad \text{ compared } \text{ to } \quad \Vert \varPi _{P_1} - H_{\alpha =10^{-4}}(P_1;3) \Vert \approx 3.62 \cdot 10^{-5}, \end{aligned}$$

and even for PM with \(P_1\) rounding errors can lead to numerical issues, i.e.,

$$\begin{aligned} \Vert \varPi _{P_1} - P_1^{10^5} \Vert \approx 2.06 \cdot 10^{-9} \quad \text{ compared } \text{ to } \quad \Vert \varPi _{P_1} - P_1^{10^{10}} \Vert \approx 1.22 \cdot 10^{-5}. \end{aligned}$$

In Fig. 4 the norm errors of \((H_\alpha (P_1))^k\), i.e., \( \Vert \varPi _{P_1} - (H_\alpha (P_1))^k \Vert \), are plotted for varying \(\alpha \in (0,1)\) and powers k. From Fig. 4 it follows for each k that choosing \(\alpha \) too small leads to numerical issues and consequently leading to an increase of norm errors, contrary to what can be expected from theory. For example, for \(k=1\) numerical issues appear when choosing \(\alpha \) smaller than \(10^{-10}\) from where the norm errors start increasing in a zig-zag pattern. Furthermore, the figure shows that the smallest norm errors can be achieved by larger powers k and relatively larger \(\alpha \).

Fig. 4
figure 4

Illustration of the norm errors for \((H_\alpha (P_1))^k\) with varying \(\alpha \) and k

Based on the above results, we advise to use the modified resolvent in a PM framework with a carefully chosen \(\alpha \). When to terminate the power iterations is a delicate matter. A natural stopping rule is to terminate the algorithm when the improvement of an extra iteration becomes insignificant. More specifically, in order to find a power k such that \( \Vert \varPi _P -P^k \Vert \le \varepsilon \), one may terminate PM if \(\Vert P^k P - P^k \Vert < \varepsilon \), for \(\varepsilon > 0\) small. Unfortunately, this stopping rule may stop the algorithm too early as is illustrated in Example 5 below.

Example 5

For \(\delta \in (0,1)\) and \(k\in \mathbb {N}\) let

$$\begin{aligned} P = \left[ \begin{array}{cc} 1-\delta &{} \delta \\ 0 &{} 1 \end{array} \right] \quad \text{ so } \text{ that } \quad \varPi _P = \left[ \begin{array}{cc} 0 &{} 1 \\ 0 &{} 1 \end{array} \right] \quad \text{ and } \quad P^k = \left[ \begin{array}{cc} (1-\delta )^k &{} 1 - (1-\delta )^k \\ 0 &{} 1 \end{array} \right] . \end{aligned}$$

It can be shown that \(\Vert P^k P - P^k \Vert = 2\delta (1-\delta )^k\) and \(\Vert \varPi _P - P^{k+1} \Vert = 2(1-\delta )^k\), which implies that

$$\begin{aligned} \delta \, \Vert \varPi _P - P^{k+1} \Vert = \Vert P^k P - P^k \Vert . \end{aligned}$$

Hence, \( \Vert P^k P - P^k \Vert \le \varepsilon \) only implies

$$\begin{aligned} \Vert \varPi _P - P^{k+1} \Vert \le \frac{\varepsilon }{\delta } , \end{aligned}$$

which for small values of \( \delta \) (e.g., \( \delta < \varepsilon /2 \)) provides no insight and thereby showing that the stopping rule is insufficient.

To prevent a similar pitfall when using \(\Vert (H_\alpha (P))^k H_\alpha (P) - (H_\alpha (P))^k\Vert < \varepsilon \) for \(\alpha \in (0,1)\) as stopping rule, we recommend choosing \(\alpha \) such that

$$\begin{aligned} \alpha = \min \{ \alpha _{\max },\; p^\star /N \} , \end{aligned}$$
(16)

where \(\alpha _{\max }\) is a user-defined upper bound for \(\alpha \), \( p^\star \) denotes the minimal non-zero value of P, formally given by,

$$\begin{aligned} p^\star := \min _{ i , j } \{ P ( i , j ) : P ( i , j ) > 0 \} , \end{aligned}$$

and N is a user-defined scaling to ensure that \(\alpha \) is significantly smaller than \(p^\star \). The intuition is that by choosing \(\alpha< \! \! < p^\star \) the effect of the smallest transition is taken into account. As illustrated by our numerical examples, even for a nearly decomposable matrix, the minimal non-zero entry is typically not so small that \(\alpha \) given in (16) leads to numerical instabilities for \( H_\alpha ( P ) \). In addition, choosing \(\alpha \) as in (16) typically leads to \( H_\alpha ( P ) \) having no transient phase (i.e., \(r=1 \)).

The above considerations lead to the following jump start power method (JSPM).

  1. (1)

    Choose\( \alpha = \min \{ \alpha _{\max },\; p^\star /N \} \)and select numerical precision\( \varepsilon \).

  2. (2)

    Initialize\(k = 1\)and calculate\(H_\alpha (P)\).

  3. (3)

    Set\(k=k+1\).

  4. (4)

    If

    $$\begin{aligned} \Vert (H_\alpha (P))^{k-1} H_\alpha (P) - (H_\alpha (P))^{k-1} \Vert \ge \varepsilon \end{aligned}$$

    go to step 3. Otherwise go to step 5.

  5. (5)

    Return\((H_\alpha (P))^{k}\).

It is worth noting that rather than computing the resolvent in Step 2 of the JSPM directly, which requires evaluating the inverse of \( I - ( 1 - \alpha ) P \), it is numerically more efficient to solve \( X(I-(1-\alpha )P ) = \alpha P \), which reduces the problem to that of solving systems of linear equations.

In Table 3 some numerical results for JSPM are shown. For an overview of the instances see Table 1. Two parameter choices for \( \alpha \) and \( \varepsilon \) are considered, see Table 2. Parameter Setting 1 aims at achieving higher accuracy of the approximation (i.e., a small value for \(\varepsilon \)), which is numerically possible by choosing \(\alpha \) not too small. Parameter Setting 2 focuses more on a quick convergence of the algorithm, i.e., a larger value for \( \varepsilon \) compared to setting 1 and small \(\alpha \).

From the results put forward in Table 3, it follows that significantly smaller norm errors can be achieved using Parameter Setting 1 but at the cost of more iterations (read, powers of \( H_\alpha (P) \)). Since the modified resolvent is typically not sparse taking powers of \( H_\alpha (P) \) becomes impractical for large Markov chains. This issue will be the topic of the next subsection.

Table 2 Parameter settings used for JSPM in numerical experiments
Table 3 Results of JSPM for two parameter settings given in Table 2

4.3 JSPM for Large Markov Chains

This final subsection discusses JSPM for large Markov chains. A common feature of large chains is that the transition matrix P is sparse but the ergodic projector \(\varPi _P\) is not due to connectivity [22]. This leads to numerical issues in approximating \(\varPi _P\). In particular for JSPM: when the approximation \( (H_\alpha (P) )^k \) approaches \(\varPi _P\) as k is increasing, iterations become computational more expensive and a memory burden emerges due to the loss of sparsity.

Therefore, in case of large instances, our advice based on numerical experiments is to choose \(\alpha \) significantly small and return \(H_\alpha (P)\) as approximation, i.e., apply the JSPM for \( k =1 \). In addition, instead of calculating \(H_\alpha (P)\) as a whole, we recommend to calculate a concentrated version of \(H_\alpha (P)\), denoted by \(H^c_\alpha (P)\), where the computation of \(H^c_\alpha (P)\) utilizes the structural properties of \( \varPi _P\) such as the fact that all rows corresponding to ergodic states from the same ergodic class are identical. In particular, when row i of \(H_\alpha (P)\), denoted by \(H_\alpha (P)(i,\bullet )\), is calculated, then based on this approximation it can be decided whether i is ergodic or transient by inspecting the value of \(H_\alpha (P)(i,i)\). Indeed evoking the diagonal criterion, see Sect. 2, state i is ergodic if and only if \(H_\alpha (P)(i,i)\) is significantly larger than 0. In case i is identified as ergodic, all indexes corresponding to (significantly) positive entries of \(H_\alpha (P)(i,\bullet )\) are identified as belonging to the same ergodic class. Vector \(H_\alpha (P)(i,\bullet )\) is saved in \(H^c_\alpha (P)\) as approximation for the rows of the particular ergodic class and we are done considering all the indexes from this ergodic class. In case i is identified as transient, \(H_\alpha (P)(i,\bullet )\) is saved in \(H^c_\alpha (P)\) as approximation for the i-th row of \(\varPi _P\). We will refer to this procedure as the adapted JSPM version for large instances.

In the following, we introduce the adapted JSPM algorithm, where \( E_j \) denotes the set of indexes identified as part of the j-th ergodic class, I denotes the number of identified ergodic classes, and C denotes the set of considered/evaluated indexes. The user-defined value to decide whether a state is ergodic or not is denoted by \(\iota \), with \( \iota > 0\). For large instances P, the adapted JSPM for computing \(H^c_\alpha (P)\) can be summarized as follows (recall that \(\mathbb {S}\) is the state space of the Markov chain under consideration):

  1. (1)

    Choose\(\iota > 0\).

  2. (2)

    Initialize\(I=0\)and\(C = \emptyset \).

  3. (3)

    If\(\mathbb {S} \setminus C \ne \emptyset \):

    1. (3.1)

      Select\(i\in \mathbb {S} \setminus C\).

    2. (3.2)

      Calculate\(H_\alpha (P)(i,\bullet )\).

    Otherwise go to step 6.

  4. (4)

    If\(H_\alpha (P(i,i)) > \iota \):

    1. (4.1)

      Stateiis identified as ergodic, set\(I = I + 1\).

    2. (4.2)

      \(E_I = \{ j \; : \; H_\alpha (P)(i,j) > \iota \}\).

    3. (4.3)

      \(C = C \cup E_I\).

    Otherwiseiis identified as transient, set\(C = C \cup \{ i \}\).

  5. (5)

    Save \(H_\alpha (P)(i,\bullet )\) in \(H_\alpha ^c(P)\) and go to step 3.

  6. (6)

    Return\(H_\alpha ^c(P)\).

Table 4 Large instances used for numerical experiments

For instances \(P_{10}\), \(P_{11}\), \(P_{12}\) and \(P_{13}\) from Table 4 the adapted JSPM is applied where we have chosen \(\alpha = \min \{ 10^{-10},\; (p^\star )^2 \}\) and \(\iota = (1/S)^2\). The philosophy behind the choice of \(\alpha \) is similar to Parameter Setting 2 in the previous section, i.e., small \(\alpha \) is chosen such that one iteration is most likely sufficient. Our experience for real life networks is that \(\iota = (1/S)^2\) ensures correct distinctions between transient and ergodic states. In Table 5 the norm errors and computation times in seconds (sec.) of the experiments can be found.

From the results it follows that high accuracy is achieved in a relatively small amount of time, i.e., a unique row of \(H_\alpha (P)\) in case of 25625 states is calculated with MATLAB R2011b in 1.37 seconds on a 64-bit Windows desktop PC with Intel(R) Core(TM) i5-2310 CPU @ 2.90GHz processor with norm error \(3.524 \cdot 10^{-8}\). To put the results into context, for instance \( P_{1 2} \) it takes on average 2.49 seconds to calculate \(\mu ^\top (P_{12})^4\), i.e., two sparse matrix multiplications, to obtain \((P_{12})^4\), with norm error \(\Vert \mu ^\top \varPi _{P_{12}} - \mu ^\top ( P_{12} )^4 \Vert = 0.1794\), where \(\mu ^\top \) equals the first row of an appropriate sized identity matrix. It becomes even more counterproductive if we calculate \(\mu ^\top ( P_{12}) ^8\), which takes on average 593.75 seconds and leads to norm error \(\Vert \mu ^\top \varPi _{P_{12}} - \mu ^\top ( P_{12} )^8 \Vert = 0.079\). The significant increase in computation time is due to loss of sparsity. In this case, the vector-updating version of PM might be more efficient. Indeed, performing 7 sparse vector-matrix multiplications to obtain \(\mu ^\top (P_{12})^8 = P_{12}(1,\bullet )(P_{12})^7\), requires only 0.0109 seconds. However, evaluating \(\mu ^\top (P_{12})^{10000}\) in this way, leading to norm error \(\Vert \mu ^\top \varPi _{P_{12}} - \mu ^\top ( P_{12} )^{10000} \Vert = 0.0048\), already requires 15.63 seconds. Clearly, this demonstrates the potential of the adapted JSPM in evaluating the ergodic projector. Similar observations can be expected for the other large instances.

A way to (most likely) improve accuracy of the adapted JSPM without significantly increasing computation time is to calculate \(H^c_\alpha (P^q)\), for \(q > 1\). The intuition is that for relatively small q, \(P^q\) may not affect the sparsity too much (increase in computation time is limited) but may increase the accuracy (which is likely according to the theory). Note that although it is common, it is not necessary that larger q increases accuracy, theory only provides upperbounds for the norm error. Example 6 provides an instance for which a larger q does not increase accuracy of \(H_\alpha (P)\).

Example 6

Take

$$\begin{aligned} P= & {} \left[ \begin{array}{ccc} 0.1 &{} 0.45 &{} 0.45 \\ 0.9 &{} 0.1 &{} 0 \\ 0.9 &{} 0 &{} 0.1 \end{array} \right] \;\; \text{ so } \text{ that } \;\; \varPi _P = \left[ \begin{array}{ccc} 1/2 &{} 1/4 &{} 1/4 \\ 1/2 &{} 1/4 &{} 1/4 \\ 1/2 &{} 1/4 &{} 1/4 \end{array} \right] \;\; \text{ and } \\ P^2= & {} \left[ \begin{array}{ccc} 0.82 &{} 0.09 &{} 0.09 \\ 0.18 &{} 0.415 &{} 0.405 \\ 0.18 &{} 0.405 &{} 0.415 \end{array} \right] . \end{aligned}$$

It holds that \(\Vert \varPi _P - H_{\alpha =10^{-6}}(P) \Vert \approx 4.44 \cdot 10^{-7}\), whereas \(\Vert \varPi _P - H_{\alpha =10^{-6}}(P^2) \Vert \approx 1.78 \cdot 10^{-6}\), which is due to the periodic behavior of P (visible by comparing P and \(P^2\)).

For the instances put forward in Table 4 we tested the effect of considering \((P_i)^2\) instead of \(P_i\), where \(i=10,11,12,13,\). Most of the time taking a power increased the accuracy but more significantly increased computation time so that practical usability of taking powers is questionable.

Table 5 Results for adapted JSPM, with \(\alpha = \min \{ 10^{-10},\; (p^\star )^2 \}\) and \(\iota = (1/S)^2\), in case of large instances

5 Conclusion

This paper introduces JSPM which is a generalization of PM. JSPM is a highly accurate approximation method for the ergodic projector of a general finite Markov chain, including periodic Markov multi-chains. Convergence analysis and numerical experiments show that it can provide a viable generalization of PM. Especially in case of large-scale Markov chains, JSPM works well and can deal with nearly decomposable chains without running into numerical instabilities.

Further research includes extending the techniques used for analyzing JSPM to the deviation matrix and achieving higher accuracy via numerical ingenuity.