The Jump Start Power Method: A New Approach for Computing the Ergodic Projector of a Finite Markov Chain

This article presents a new numerical method for approximately computing the ergodic projector of a finite Markov chain. Our approach requires neither structural information on the chain, such as, the identification of ergodic classes, transient states, or qualitative information, such as whether the chain is nearly decomposable or not. The theoretical deduction of the new method is corroborated by an extensive numerical study.


Introduction
This paper studies aperiodic Markov chains defined on discrete state space S = {1, . . . , S}, with S ∈ N. For ease of presentation we will consider the case of aperiodic Markov chains. The extension of our results to the case of periodic chains is postponed to the "Appendix". The steady-state behavior of an aperiodic Markov chain P is characterized through its ergodic projector Π P , where Π P = lim n→∞ P n , see [17,23]. Computing Π P in the above way, i.e., by taking powers of P, is known as the power method (PM). Any finite aperiodic Markov chain P is geometrically ergodic, i.e., there exist finite numbers r , c and β ∈ (0, 1) such that ∀n ≥ r : where || · || denotes the maximum absolute row sum norm (note that any matrix norm may be used), β is called the rate and r is called the transient phase, see, for example, [14,17]. Geometric ergodicity implies that PM enjoys a geometric rate of convergence once the powers exceed r . The main advantages of the power method are that PM is easy to implement and that it requires no further information on P. In addition, PM can be efficiently implemented for large sparse matrices, which is the main reason why PM is used for the acclaimed Google PageRank algorithm introduced by Brin and Page [7], and for more detail see [4,9,20]. PM has two main versions. In the vector-updating version of PM, one computes μP n for given vector μ. Vector-updating applies in case a given Markov chainP with known stationary distribution πP is updated (due to a change in the underlying hyperlink structure of the network) to a new Markov chain P. Then, computing πP P n converges faster towards π P . An advantage of vector updating is that it only requires vector-matrix multiplications. The downside of this approach is that one cannot change the initial vector without a complete recalculation. The matrix-updating version directly computes P n in order to approximate Π P . The advantage of the matrix-updating PM is that by squaring a matrix power P n , i.e., going from P n to (P n ) 2 = P 2n , high powers of P can be relatively easily computed. Indeed, computing P n only requires log 2 (n) matrix multiplications. Moreover, applying different initial vectors to Π P allows to model different initial distributions which is of particular interest in case of multi-chains, see the subsequent section for details. The downside is that even with the log 2 (n) advantage, matrix updating may require a significant number of matrix multiplications and as the power increase these matrices are not sparse. In case P is periodic, both the vector-updating and the matrix-updating do not converge unless a convex combination of the original P with the identity matrix is used which comes at the expense of reduced convergence speed.
In this paper, we mainly focus on matrix-updating PM, from now on simply referred to as PM. Iterative methods, such as PM converge slowly in case the subdominant eigenvalue of P is close to 1, see [13,15]. This typically happens if either the P-chain only jumps with small probability from the transient states to (one of) the ergodic class(es) or if P is nearly decomposable. Roughly speaking, an irreducible chain P is called nearly decomposable if the state-space can be divided into classes so that the interactions between states are relatively frequent compared to interactions between the classes (a formal definition will be provided later in the text). It can be shown that an irreducible Markov chain without transient states is nearly decomposable if and only if the subdominant eigenvalue is close to 1, see [11]. A famous example of a nearly decomposable Markov chain is the so-called Courtois matrix, which is a 6 × 6 transition matrix for which PM requires n ≈ 69.000 in order to provide an approximation of Π P that is correct in first 6 digits, [25]. In case the ergodic classes and the transient states are known, one may compute the ergodic projector directly by first computing the equilibrium distribution for each ergodic class, and then the long-term behavior of the transient states, see [5,17] and the detailed discussion in Sect. 2. For a comprehensive overview of numerical methods for computing the ergodic projector of a finite Markov chain, we refer to [25].
Our research on Markov chains is stimulated by the growing interest in the analysis of social networks (where the Markov chain is used to model relationships among social agents, see [22]) and by the analysis of the world wide web, were based on the (bored-) randomsurfer-concept, the Markov chain models the probability of randomly going from one page to another, [9,20]. A key feature of these networks is that they are large and that neither their structure (transient states, ergodic classes) nor their balancedness (nearly decomposable or not) are known a priori. Other examples of these type of complex networks include telecommunication networks, cognitive and semantic networks and biological networks.
In this paper, we develop a novel approach for approximating the ergodic projector of an aperiodic Markov chain. We firstly establish a new representation of the ergodic projector of P through constructing an alternative Markov chain and then make this result useful for numerical computation. Starting point of our analysis is the known analytical relation see, for example, Theorem 1.5 in [16], where the term α(I − (1 − α)P) −1 is recognizable as the resolvent kernel of P. See also [19] for applications of the resolvent kernel in stability theory of Markov chains. We call the transformation for α ∈ (0, 1], the modified resolvent kernel of P. In particular, the resolvent kernel is modified so that it allows for efficient numerical evaluation. To see this note that in case of large P one solves a system of linear equations, for which P can be used as a basis, rather than computing the inverse explicitly. Letting X α be a geometrically distributed random variable with parameter α ∈ (0, 1), the modified resolvent kernel can be written as since (1 − α)P < 1, which suffices to show that H α (P) is again a Markov transition matrix with the same ergodic projector as P for any α ∈ (0, 1), and since at α = 1 it holds H α (P) = P, the statement holds for α = 1 as well. In formula, The main contribution of this paper is that we take (1) as starting point for developing a new numerical algorithm for computing Π P . As main technical result we will show that where γ (P) a finite (possibly large) constant depending on P, to be defined later in the text. Letting α < 1/γ (P), the result put forward in (2) implies that the Markov kernel H α (P) is geometrically ergodic with transient phase r = 1 and rate αγ (P). Put differently, the transformation P → H α (P) provides a jump start for PM as the desired contraction property is immediately effective. Moreover, we will show that iterating the transformation yields a geometric reduction in the geometric rate, so that, for example, H α (H α (P)) has a rate that is proportional to α 2 . The above theoretical results lead to a new numerical approach for approximately computing Π P , called jump start power method.
The main contributions of this paper are as follows.
-The error of approximating Π P by powers of the modified resolvent (H α (P)) k is of order (αγ (P)) k . We use this fact to introduce the jump start power method (JSPM) that enjoys the robustness of PM but overcomes the numerical deficiency of PM. JSPM works well for multi-chains, nearly decomposable chains, and chains that jump with small probability from the transient states to (one of) the ergodic class(es).
-An adapted version of JSPM is developed for large-scale Markov chains which utilizes the structure of the Markov chain and takes only 'one jump' towards Π P , i.e., k = 1 together with a carefully chosen α ∈ (0, 1). -An extensive numerical study is provided that corroborates the form of the analytical bound for the decay of the error and illustrates the numerical advantages of JSPM.
The article is organized as follows. Section 2 formally introduces Markov multi-chains, nearly decomposable Markov chains, and defines concepts used throughout the article. Section 3 presents the main technical results of the paper. Specifically, the approximate formula in (2) is derived. JSPM is presented in Sect. 4 together with a numerical study on the performance of the algorithm. The article concludes with a discussion of potential further research. The extension of our results to the case of periodic chains is presented in the "Appendix".

A Brief Review of Markov Chains
This paper studies aperiodic Markov chains defined on discrete state space {X t } t=0,1,... , with S = {1, . . . , S}; see [17] for definitions. For ease of presentation we will first consider the case of aperiodic Markov chains. The extension of our results to the case of periodic chains is postponed to the "Appendix". For the (i, j)-th element of P it holds that P(i, j) = Pr(X t+1 = j|X t = i) is independent of t and the past states, i.e., the probability distribution of the next state only depends on the current state. This leads to which reads as the n-step transition probability of the Markov chain, where transition matrix P n is simply obtained from taking the n-th matrix power of P. Taking n to infinity leads to the (i, j)-th element of the ergodic projector, denoted by Π P , and defined by Entry Π P (i, j) represents the probability of the chain being in state j in the long-run when starting in state i. For more details we refer to [17]. In case the Markov chain has only one closed irreducible set of states, also called ergodic class, and a (possibly empty) set of transient states, it is called a Markov uni-chain (in short: uni-chain). For uni-chains it holds that the chain will eventually be trapped in the (unique) ergodic class, independent of the initial state. The unique distribution to which a uni-chain converges is described by the stationary distribution of P denoted as π P which can be found by solving π P P = π P . Since the stationary distribution is independent of the initial state, all rows of Π P equal π P in case P describes a Markov uni-chain.
Markov multi-chains (in short: multi-chains) have multiple ergodic classes and a (possibly empty) set of transient states. Other than for uni-chains, for multi-chains the initial state has an impact on the resulting limiting distribution, which stems from the fact that once the chain enters one of the several ergodic classes it remains there permanently. First of all, one has to uncover the ergodic classes and the transient states using, for example, the already mentioned algorithm in [8]. After possible relabelling of states, the transition matrix and the ergodic projector of a multi-chain can be written in the following canonical forms, respectively, where I is the number of ergodic classes. For the i-th ergodic class, P i gives the one step transition probabilities between ergodic states from the i-th ergodic class and Π i gives a square matrix of which all rows equal the unique stationary distribution of the chain inside the i-th ergodic class. Specifically, all rows in Π P i equal π P i , which is the unique probability vector satisfying π P i P i = π P i . Note that all diagonal values of Π P corresponding to ergodic states are non-zero contrary to the diagonal values of transient states which are zero, an insight that will be elaborated in Sect. 4.3. Hence, whether state i is ergodic or transient can be concluded from the value of entry (i, i) of Π P . We call this criterion for ergodicity of a state the diagonal criterion. Moreover, R i ( j, k) gives the equilibrium probability of ending in ergodic state k (which is part of the i-th ergodic class) when starting in transient state j. In order to calculate R i , define J as the number of transient states, I T as the unity matrix of size J and Z ( j, i) as the probability of ending in the i-th ergodic class when starting in transient state j. Note that Z is a J × I matrix Z . It then holds that where e i is a column vector of ones of size equal to the number of states in ergodic class i; see, e.g., [5]. Denote the i-th column of Z with Z (•, i), then it holds that R i = Z (•, i)π P i .
In case there are multiple ergodic classes the stationary distribution fails to be unique. Indeed, any row of Π P is a stationary distribution of the Markov chain. More specifically, denote the i-th row of Π P by Π P (i, •), then it holds that Π P (i, •) is a probability distribution which satisfies Π P (i, •)P = Π P (i, •). This implies that any convex combination of the rows is also a stationary distribution of the Markov chain, i.e., for (γ i ) i∈S : S i=1 γ i = 1 and γ i ≥ 0, for all i ∈ S, it holds that S i=1 γ i Π P (i, •) is a probability distribution which is invariant with respect to P. When an initial distribution μ is considered, this convex combination is fixed (and given by μ ) meaning that there exists a unique stationary distribution for the chain started in μ (describing the long-run behavior of the chain started in μ ), or, more formally, μ Π P is the unique stationary distribution satisfying (μ Π P )P = (μ Π P ) when starting in μ . Literature concerning Markov multi-chains includes Markov decision processes from [23], series expansion of Markov chains [2,5,6] and singular perturbation analysis [1,12] where the underlying multi-chain structure is often known beforehand.
A Markov chain P is called nearly decomposable if P is irreducible and after possible relabeling of states can be written where the diagonal blocks P ii , i = 1, 2, . . . , k, are square and have rows that sum up to 1 −ε, with ε > 0 small.
A Markov chain may belong to all of the above types simultaneously. For example, a multi-chain with transient states may have an ergodic class that for itself constitutes a nearly decomposable chain. Below we illustrate this by means of a simple Markov chain.

Example 1
Let p, q, r 1 , r 2 , r 3 ∈ (0, 1) and define transition matrix P on state-space {1, 2, 3, 4} as The ergodic projector of P can be computed to be Note that when p = q = 3 i=1 r i = 0, we obtain P = I and Π P = I .
Recall that throughout this paper we let ||·|| denote the maximum absolute row sum norm. For any finite-state, aperiodic Markov chain P, there exists a finite number r such that ∀n ≥ r : where c = sup l=0,1,...,r −1 P l −Π P < ∞ and β ∈ (0, 1); see for details [14]. This property is called geometric ergodicity and we will call r the transient phase and β the rate. For ease of references, we call r , c, β ergodicity parameters of P.
In this paper we will also study the impact of starting with a power P q for the evaluation of Π P and we introduce the following additional ergodicity parameters where κ(P, q) = sup n=q,q+1,...,qφ(P,q) For simplicity, we wrote in the introduction γ (P) instead of γ (P, 1).

Bounding the Approximation Error
Starting point of our analysis is the equality Adding and subtracting α P q to the right hand side gives where we used for the second term on the right hand side that Π P P = Π P . Inserting N times this last expression for Π P into the first Π P on the right hand side leads to For N , q ∈ N and α ∈ (0, 1) let Note that for α ∈ (0, 1) and q ≥ 1 it holds that where existence of the Neumann series is guaranteed since (1 − α)P q < 1, for α ∈ (0, 1). Equation (4) can be rewritten in succession via (i) simplifying the second term, (ii) bringing the second term of the right hand side to the other side, (iii) dividing by 1 − (1 − α) N +1 and (iv) using the G α (N , P q )-notation:

Remark 1
The (i, j)-th element of G α (N , P q ) gives the scaled (1 − α)-discounted expected number of visits of the Markov chain with transition matrix P q to the j-th state in the first N + 1 number of discrete time steps (including the state i at time zero) when starting in state i. Intuitively, the discounting ensures that the weights of the visits after many discrete time steps of the Markov chain with transition matrix P q becomes smaller and smaller, ensuring i.e., the k − 1 power of the first term of the right hand side of (5), gives where we used that Taking the limit N → ∞ in (6) leads to where we use the notation which is the modified resolvent kernel of P. Analogous to G α (·) let in the following .
Since (Π P − P q )Π P = 0, it holds that so that (7) can be written as where the definition of G α (N , P q ) is filled in in the last equation. Applying norms to Eq. (8) we get where the summation is split at φ(P, q) into two summations, one where geometric ergodicity does not apply and one where it does, respectively. Continuing calculations from (10) shows So we may conclude from (11) that for 1. N ≤ φ(P, q) − 1 (geometric ergodicity does not apply): it holds that Note that it is necessary for the bound to be meaningful that N ≥ φ(P, q) so that the geometric ergodicity applies.

Remark 2 For notational easiness define the bound found in Lemma 1 in case
it holds that for any choice of k, q and N ≥ φ(P, q) it is optimal to choose α ∈ (0, 1) as small as possible.
The following theorem summarizes some properties of H α (P q ).

Remark 3 Theorem 3 in the "
Appendix" shows that the results put forward in Lemma 1 and Theorem 1 apply to periodic Markov chains with period d for q = 1 when γ (P, 1) is replaced byγ (P, d) defined in the "Appendix".
The result put forward in Theorem 1 shows that for α < γ (P, q) it holds that the modified resolvent H α (P q ) is geometrically ergodic with rate αγ (P, q), transient phase r = 1, and ergodic projector Π P .
As our numerical study in the second part of the paper shows, the modified resolvent is potentially more efficient than PM, which makes it, apart from the fact that it directly applies to multi-chains, an attractive alternative to PM. In the following, H α (P) is illustrated for Example 1.

Example 2
We revisit Example 1 where we assume that p, q, r 1 , r 2 and r 3 are non-zero probabilities. For this chain H α (P) can be explicitly solved to be Hence, letting α tend to zero yields element-wise convergence of H α (P) to Π P , which is in accordance with Theorem 1. For example, the absolute error of the (1, 1)-th element equals so that the corresponding relative error is where H α (P)(i, j) indicates the (i, j)-th element of H α (P). It shows that the relative error of H α (P)(1, 1) can be bounded by the linear function αc 1 Furthermore, the asymptotic probabilities of going from one ergodic class to another (or to a transient state) are zero. This shows that H α (P) uncovers the structure of the ergodic classes. In addition, the approximation assigns in general a positive mass to jumps from a transient state to itself, e.g., which is strictly larger than zero if 3 i=1 r i < 1, while clearly Π P (4, 4) = 0. Note that, besides the case 3 i=1 r i = 1, the approximation may give the wrong impression that a transient state, say i, is ergodic. However, when α is chosen sufficiently small (for example of the order 10 −8 ), together with the fact that there are no transitions from ergodic states towards i, the fact that i is transient becomes apparent.
In the following, we analyze the effect that taking a power of the modified resolvent has on the convergence. Denote (H α (P)) 2 (1, 1) as the (1, 1)-th element from (H α (P)) 2 . It then holds that the relative error of which can be computed to be equal to which can be bounded by the quadratic function α 2 c 2 ( p, q) where Note that when p + q = 1 or p = q = 0 the relative errors of approximation H α (P) (1,1) and (H α (P)) 2 (1, 1) are zero. For p + q ∈ (0, 2] \ {1} and α ∈ (0, 1) the relative error of (H α (P)) 2 (1, 1) is strictly smaller than that of H α (P) (1,1). Furthermore, comparing the relative error bounds αc 1 ( p, q) and α 2 c 2 ( p, q) shows the quadratic improvement of (H α (P)) 2 (1, 1), which is in accordance with Theorem 1. This illustrates the improvement that can be achieved through the power k in the generalization. The other entries of H α (P) can be analyzed along the same lines.
In the following example, we discuss the convergence of H α (P) in case of a nearly decomposable Markov chain.

Example 3
In the light of nearly decomposable Markov chains it is interesting to see what happens in case p + q ↓ 0 for the Markov chain in Example 1, i.e., when p and q are both close to 0. L'Hôpital's rule shows that the relative error of the (1, 1)-th element for p + q ↓ 0 in the limit equals see also the relative errors from Example 1. Similar, the relative error of H α (P)(1, 2) converges for p + q ↓ 0 towards α 1 − α .
Both relative errors show that arbitrary accuracy can be achieved by using the modified resolvent with α small even in case of nearly decomposable Markov chains. Now consider the case where 3 i=1 r i = ε, for ε > 0 small. In that case the Markov chain breaks almost up into 3 ergodic classes. For the (4, 4)-th element it holds that which equals the absolute error, since Π P 1 (4, 4) = 0. Choosing α such that leads to an absolute error smaller than δ, showing that arbitrary accurate precision can be achieved with H α (P) even in case the Markov chain almost breaks up into 3 ergodic classes. Similar for the (4, 3)-th element of H α (P) it can be shown that the relative error equals showing that in order to obtain a relative error smaller than η one should choose α ≤ εη, again showing that any accuracy can be achieved in theory. Note that H α (P)(4, 1) = H α (P)(4, 2) = H α (P)(4, 3) = 0 in case 3 i=1 r i = 0, i.e., in case the Markov chain consists of three ergodic classes this is correctly detected.
As the following theorem shows, the norm error of H α (P q ; n) can be bounded by a geometric function with power n and rate α.

Corollary 1
For α ∈ (0, 1) such that αγ (P, q) < 1 it directly follows from the above theorem that when we define ε = αγ (P, q). Furthermore, since H α (P; n)Π P = Π P it holds for k ≥ 1 that where in the last Eq. (15) is used.

Remark 4
The results put forward in Theorem 2 and Corollary 1 apply for the case q = 1 also to periodic Markov chains with period d when γ (P, 1) is replaced byγ (P, d). For details see the "Appendix".
Theorem 2 shows that repeated application of the modified resolvent yields a geometric improvement of the rate of geometric ergodicity. Example 4 illustrates Theorem 2.

Example 4
We revisit the instance from Examples 1 and 2 where we assume that p, q, r 1 , r 2 and r 3 are non-zero probabilities. For this chain H α (P; 2) can be explicitly solved to be This shows that H α (P; 2) converges with quadratic rate in terms of α towards Π P . Consequently, a larger rate of convergence is achieved for H α (P; 2) than for H α (P); compare with Example 2. More specifically, the relative error of the (1, 1)-th element equals where the absolute term converges to constant p|1− p−q| q( p+q) for α small. Similar, the relative error of H α (P; 3)(1, 1) equals showing that the relative error error is approximately a factor α smaller than that of H α (P; 2)(1, 1) in accordance with what can be expected from Theorem 2.
Unfortunately, as γ (P, q) is not available, it is neither clear what a good initial choice for α is, nor when to terminate (H α (P)) k or the repeated application of H α (P; n). In the following, we will address these two issues in more detail.

Remark 5 In [24] the resolvent of a Markov chain P is defined as
Puterman develops in [24] the resolvent into a Laurent series yielding Note that the series on the above right hand side involves the deviation matrix D P = (I − P + Π P ) −1 − Π P and that deriving efficient bounds for the norm of the deviation is itself a demanding task in case of multi-chains. As the norm of the deviation matrix typically takes large values, the above series only applies for significantly small values of α.

The Jump Start Power Method
In the previous section, we have shown that going from P to the modified resolvent H α (P) potentially yields a geometrically ergodic Markov chain with no transient phase (i.e., r = 1). In this section we show how this result can be made fruitful for numerical computations. In particular, Sect. 4.1 illustrates the modified resolvent theory through numerical experiments, Sect. 4.2 develops a practical method that exploits the developed theory by introducing the jump start power method (JSPM) and provides numerical results. Lastly, Sect. 4.3 discusses and numerically illustrates the use of JSPM in case of large (sparse) systems.

Motivating Numerical Experiments
As a first step we analyze the effect of mapping P to H α (P) by comparing numerically P n with H α (P). The considered instances cover a wide range of Markov chains and for an overview, we refer to Table 1. Each row in Table 1 corresponds to an instance defined by its transition matrix (Tr. Matrix). The instances are based on random graph models that capture key properties of real-life networks. The instances vary in terms of size S (given in the 'S' column), structure (given in the 'Ergodic Structure' column), connectivity (as indication, column ' p ' gives the smallest non-zero element of P), and parameters used for random graph models (given in the 'Description' column). The ergodic structure is denoted by ([v 1 , v 2 , . . . , v I ], T ), where I is the number of ergodic classes, v i the number of states in the i-th ergodic class and T = S − I i=1 v i is the number of transient states. In case of transient states, the corresponding part in P is randomly filled such that transient states most likely point towards each other and multiple ergodic classes (if possible). The description column gives the relevant reference of the instance together with parameters given in the same order as they appeared in the original reference, where altered labels are used in case of conflicting notation (e.g., if α is used as parameter in the original reference, we refer to this parameter as β). For the implementation of the different instances, the code provided in MATLAB toolbox CONTEST [26] was used. The CONTEST toolbox generates symmetric adjacency matrices which are in many cases periodic. In order to obtain the corresponding transition matrix P, the rows are first normalized ensuring that each row sums up to one. Afterwards, the transition matrix P is mixed with the identity matrix to achieve aperiodicity, which does not affect the ergodic behavior of the chain.
Comparison of P n and H α (P) In the first numerical experiment, we compute for a series of Markov chains P i , with 1 ≤ i ≤ 4 in Table 1, the power n α (P) such that In words, n α (P) is the power of P that is substituted by H α (P). Note that a power n α (P) can be obtained via log 2 n α (P) matrix multiplications. The numerical results depicted in Fig. 1 show that the modified resolvent can replace PM approximations for large powers.
In particular for the Courtois matrix P 1 , in order to approximately achieve a norm error of 7.92 · 10 −7 a power is needed of 2 16 while the same norm error is obtained via the modified resolvent with α ≈ 10 −10 . For P 4 the modified resolvent with α ≈ 10 −11.18 leads to the same norm error (of approximately 1.63 · 10 −5 ) as PM with power 20655175 (≈ 2 24.3 ). As for computation times, experiments showed that on average PM (P 4 ) k with power k = 20655175 takes on average 73.12 seconds in a sparse matrix setting whereas the modified resolvent H α=10 −11.18 (P 4 ) takes on average 2.68 seconds, i.e., a difference of factor 27.28 on average (the experiments were performed in MATLAB R2011b on a 64-bit Windows desktop PC with Intel(R) Core(TM) i5-2310 CPU @ 2.90GHz processor).
Length of Transient Phase for H α (P) Figure 2 illustrates the effect that powers of H α (P) have on the norm error for the Courtois matrix. Different values for α are considered and for each α the exponential decay location is determined and thereby the length of the transient phase is identified. Note that H α=1 (P) equals P. A heuristic approach is used to find the exponential decay location where for each α an exponential function is repeatedly fitted to the data until the coefficient of determination R 2 is close enough to 1 (where R 2 = 1 represents a perfect fit). After each fit which leads to an insufficient coefficient of determination, the dataset is reduced by increasing the value of the first considered power n and the fitting repeats. The found exponential decay locations (i.e., the smallest power in the dataset that led to R 2 sufficiently close to 1) are denoted with the large dots and labeled, where the labels correspond to the fitted functions given under the graph together with the R 2 in parenthesis behind the function.
The main observation from Fig. 2 is that the exponential decay locations shift to the left (and thereby the transient phase becomes smaller) for decreasing values of α. This phenomenon has been theoretically shown in the previous section. It is worth noting for α ≤ 10 −3 there is no transient phase, i.e., r = 1 in these cases. Furthermore, from the fitted functions it follows that smaller α values lead to stronger norm error reduction for increasing powers. For implementation the MATLAB code provided by CONTEST [26] was used Fig. 1 Combinations of α and n α (P i ) such that The Nested Modified Resolvent H α (P; n) Similar to Figs. 2, 3 illustrates the effect of the nested modified resolvent H α (P; n) for varying n in case P = P 1 .
It shows that relatively large values for α already lead to small norm errors after a few iterations. In particular, the fitted relation between the norm error of H α=0.01 (P; n) and n is approximately 4901e −4.6n whereas that of (H α=0.01 (P)) n and n is approximately e −0.0198n (see also Fig. 2), showing that the effect of an increase in the number of iterations in the nested modified resolvent is far more effective than an increase in the power of the modified resolvent for the same α. It therefore illustrates the sharper bound found for the nested modified resolvent in comparison with powers of the modified resolvent.

Jump Start Power Method (JSPM)
In this section, we will develop a power-method like algorithm based on the theory established in the previous section. To that end we discuss how to choose α and we provide a stopping rule Numerical experiments indicate that in order to achieve high accuracy it is best to choose relatively large α and take powers of the resulting modified resolvent. Unfortunately, the modified resolvent is no sparse matrix and computing powers is rather costly. Alternatively, one may choose α (very) small and calculate the modified resolvent only once (without taking powers). This, however, may lead to numerical issues when approaching the machine precision. In particular, the condition number of matrix (I − (1 − α)P) grows for decreasing α, therefore, choosing α significantly small leads to an ill-conditioned matrix which is more difficult to invert accurately. For purpose of illustration consider the Courtois matrix P 1 . It then holds that According to the theoretical results decreasing α to, say, α = 10 −12 should improve the quality of the approximation (see Theorem 1). But the contrary is true as This shows that numerical issues come into play when computing the modified resolvent for α = 10 −12 for P 1 . A similar effect can be observed for the nested modified resolvent Fig. 4 Illustration of the norm errors for (H α (P 1 )) k with varying α and k and even for PM with P 1 rounding errors can lead to numerical issues, i.e., Π P 1 − P 10 5 1 ≈ 2.06 · 10 −9 compared to Π P 1 − P 10 10 1 ≈ 1.22 · 10 −5 .
In Fig. 4 the norm errors of (H α (P 1 )) k , i.e., Π P 1 − (H α (P 1 )) k , are plotted for varying α ∈ (0, 1) and powers k. From Fig. 4 it follows for each k that choosing α too small leads to numerical issues and consequently leading to an increase of norm errors, contrary to what can be expected from theory. For example, for k = 1 numerical issues appear when choosing α smaller than 10 −10 from where the norm errors start increasing in a zig-zag pattern. Furthermore, the figure shows that the smallest norm errors can be achieved by larger powers k and relatively larger α. Based on the above results, we advise to use the modified resolvent in a PM framework with a carefully chosen α. When to terminate the power iterations is a delicate matter. A natural stopping rule is to terminate the algorithm when the improvement of an extra iteration becomes insignificant. More specifically, in order to find a power k such that Π P − P k ≤ ε, one may terminate PM if P k P − P k < ε, for ε > 0 small. Unfortunately, this stopping rule may stop the algorithm too early as is illustrated in Example 5 below.
Hence, P k P − P k ≤ ε only implies which for small values of δ (e.g., δ < ε/2) provides no insight and thereby showing that the stopping rule is insufficient.
To prevent a similar pitfall when using (H α (P)) k H α (P)−(H α (P)) k < ε for α ∈ (0, 1) as stopping rule, we recommend choosing α such that where α max is a user-defined upper bound for α, p denotes the minimal non-zero value of P, formally given by, and N is a user-defined scaling to ensure that α is significantly smaller than p . The intuition is that by choosing α < < p the effect of the smallest transition is taken into account. As illustrated by our numerical examples, even for a nearly decomposable matrix, the minimal non-zero entry is typically not so small that α given in (16) leads to numerical instabilities for H α (P). In addition, choosing α as in (16) typically leads to H α (P) having no transient phase (i.e., r = 1). The above considerations lead to the following jump start power method (JSPM).
It is worth noting that rather than computing the resolvent in Step 2 of the JSPM directly, which requires evaluating the inverse of I − (1 − α)P, it is numerically more efficient to solve X (I − (1 − α)P) = α P, which reduces the problem to that of solving systems of linear equations.
In Table 3 some numerical results for JSPM are shown. For an overview of the instances see Table 1. Two parameter choices for α and ε are considered, see Table 2. Parameter Setting 1 aims at achieving higher accuracy of the approximation (i.e., a small value for ε), which is numerically possible by choosing α not too small. Parameter Setting 2 focuses more on a quick convergence of the algorithm, i.e., a larger value for ε compared to setting 1 and small α.
From the results put forward in Table 3, it follows that significantly smaller norm errors can be achieved using Parameter Setting 1 but at the cost of more iterations (read, powers of H α (P)). Since the modified resolvent is typically not sparse taking powers of H α (P) becomes impractical for large Markov chains. This issue will be the topic of the next subsection.  Table 3 Results of JSPM for two parameter settings given in Table 2 Parameter

JSPM for Large Markov Chains
This final subsection discusses JSPM for large Markov chains. A common feature of large chains is that the transition matrix P is sparse but the ergodic projector Π P is not due to connectivity [22]. This leads to numerical issues in approximating Π P . In particular for JSPM: when the approximation (H α (P)) k approaches Π P as k is increasing, iterations become computational more expensive and a memory burden emerges due to the loss of sparsity. Therefore, in case of large instances, our advice based on numerical experiments is to choose α significantly small and return H α (P) as approximation, i.e., apply the JSPM for k = 1. In addition, instead of calculating H α (P) as a whole, we recommend to calculate a concentrated version of H α (P), denoted by H c α (P), where the computation of H c α (P) utilizes the structural properties of Π P such as the fact that all rows corresponding to ergodic states from the same ergodic class are identical. In particular, when row i of H α (P), denoted by H α (P)(i, •), is calculated, then based on this approximation it can be decided whether i is ergodic or transient by inspecting the value of H α (P)(i, i). Indeed evoking the diagonal criterion, see Sect. 2, state i is ergodic if and only if H α (P)(i, i) is significantly larger than 0. In case i is identified as ergodic, all indexes corresponding to (significantly) positive entries of H α (P)(i, •) are identified as belonging to the same ergodic class. Vector H α (P)(i, •) is saved in H c α (P) as approximation for the rows of the particular ergodic class and we are done considering all the indexes from this ergodic class. In case i is identified as transient, H α (P)(i, •) is saved in H c α (P) as approximation for the i-th row of Π P . We will refer to this procedure as the adapted JSPM version for large instances.
In the following, we introduce the adapted JSPM algorithm, where E j denotes the set of indexes identified as part of the j-th ergodic class, I denotes the number of identified ergodic classes, and C denotes the set of considered/evaluated indexes. The user-defined value to decide whether a state is ergodic or not is denoted by ι, with ι > 0. For large instances P, Block diagonal transition matrix of P 10 and P 11 with weak connection between P 10 and P 11 . I.e., two random nodes from P 10 and P 11 , resp., are connected with probability 1.5e−9. (1) Choose ι > 0.
(3) If S \ C = ∅: Otherwise go to step 6.  Table 4 the adapted JSPM is applied where we have chosen α = min{10 −10 , (p ) 2 } and ι = (1/S) 2 . The philosophy behind the choice of α is similar to Parameter Setting 2 in the previous section, i.e., small α is chosen such that one iteration is most likely sufficient. Our experience for real life networks is that ι = (1/S) 2 ensures correct distinctions between transient and ergodic states. In Table 5 the norm errors and computation times in seconds (sec.) of the experiments can be found.
From the results it follows that high accuracy is achieved in a relatively small amount of time, i.e., a unique row of H α (P) in case of 25625 states is calculated with MATLAB R2011b in 1.37 seconds on a 64-bit Windows desktop PC with Intel(R) Core(TM) i5-2310 CPU @ 2.90GHz processor with norm error 3.524 · 10 −8 . To put the results into context, A way to (most likely) improve accuracy of the adapted JSPM without significantly increasing computation time is to calculate H c α (P q ), for q > 1. The intuition is that for relatively small q, P q may not affect the sparsity too much (increase in computation time is limited) but may increase the accuracy (which is likely according to the theory). Note that although it is common, it is not necessary that larger q increases accuracy, theory only provides upperbounds for the norm error. Example 6 provides an instance for which a larger q does not increase accuracy of H α (P). It holds that Π P −H α=10 −6 (P) ≈ 4.44·10 −7 , whereas Π P −H α=10 −6 (P 2 ) ≈ 1.78·10 −6 , which is due to the periodic behavior of P (visible by comparing P and P 2 ).

Example 6 Take
For the instances put forward in Table 4 we tested the effect of considering (P i ) 2 instead of P i , where i = 10, 11, 12, 13,. Most of the time taking a power increased the accuracy but more significantly increased computation time so that practical usability of taking powers is questionable.

Conclusion
This paper introduces JSPM which is a generalization of PM. JSPM is a highly accurate approximation method for the ergodic projector of a general finite Markov chain, including periodic Markov multi-chains. Convergence analysis and numerical experiments show that it can provide a viable generalization of PM. Especially in case of large-scale Markov chains, JSPM works well and can deal with nearly decomposable chains without running into numerical instabilities.
Further research includes extending the techniques used for analyzing JSPM to the deviation matrix and achieving higher accuracy via numerical ingenuity.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix: Proof of Results for Periodic Markov Chains
The main results of this paper can be extended to periodic Markov chains, also known as cyclic Markov chains. As standard literature is in general not concerned with cyclic Markov multi-chains, required technical toolbox with respect to cyclic Markov multi-chains will be developed first.
The Cesaro limit (of order one) version of the ergodic projector is given by Property Π P P = Π P also holds for the Cesaro limit version. When P is aperiodic, lim N →∞ P n exists and equals Π P , see, e.g., the appendix in [23]. For i ∈ S, define Using this notion, let the period of state i ∈ S be described as where GCD(·) denotes the greatest common divisor (GCD) of a set. The period d of a Markov multi-chain with P is defined as where LCM(·) denotes the least common multiple (LCM) of a set. The communicating class of i ∈ S is denoted by C(i) and can be formalized as Note that A(i) is closed under addition, i.e., and it follows from number theory that A(i) contains all but a finite number of positive multiples of d(i): This allows us to prove the following connection between periodic and aperiodic Markov chains.

Lemma 2 Let P be a Markov multi-chain with period d. Then P d is aperiodic.
Proof Showing that suffices to prove that P d is aperiodic. To see this, choose i ∈ S and j ∈ C(i). Then it holds that ∃ m : P m (i, j) > 0 and furthermore due to (17) This means that Relying on Lemma 2, the following lemma establishes a geometric ergodicity result for periodic Markov chains.
Since P d is aperiodic (and finite), see Lemma 2, geometric ergodicity of aperiodic Markov chains implies that there exists a finite r d such that x ≥ r d : where finite r d , c d and β d ∈ (0, 1) are the ergodicity parameters of P d . Therefore, it holds for x ≥ r d ⇒ n ≥ r d d that where the equality follows because · represents the ∞-norm by assumption. Since y ≤ d − 1, and thus note that (β d ) 1 d ∈ (0, 1). This leads to In the following theorem, the above results will be applied to prove a sharp bound for the norm error also applicable to periodic Markov chains. As parameter q affects the periodicity of a Markov chain, only the case q = 1 will be considered for simplicity. The theorem uses the following generalized geometric ergodicity parameters, indicated with a bar, (Π P − P n 1 d+1+n 2 ) , The generalized geometric ergodicity parameters are 'generalized' in the sense that they reduce to γ (P, q = 1), κ(P, q = 1), and φ(P, q = 1), resp., in case of periodicity d = 1.
As a result, the same holds for the following theorem, i.e., Theorem 3 reduces to the results from Lemma 1 and Theorem 2 when periodicity d = 1 (and q = 1).
Observing that ends the proof.