Abstract
Given access to a single long trajectory generated by an unknown irreducible Markov chain M, we simulate an \(\alpha \)-lazy version of M which is ergodic. This enables us to generalize recent results on estimation and identity testing, that were stated for ergodic Markov chains, in a way that allows fully empirical inference. In particular, our approach shows that the pseudo spectral gap introduced by Paulin (Electron J Probab 20:32, 2015) and defined for ergodic Markov chains may be given a meaning already in the case of irreducible but possibly periodic Markov chains.
Similar content being viewed by others
Data availability statement
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
References
Batu T, Fischer E, Fortnow L, Kumar R, Rubinfeld R, White P (2001) Testing random variables for independence and identity. In: Proceedings 42nd IEEE symposium on foundations of computer science. IEEE, pp 442–451
Boyd S, Boyd SP, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Bui A, Sohier D (2007) How to compute times of random walks based distributed algorithms. Fund Inform 80(4):363–378
Chan SO, Ding Q, Li SH (2021) Learning and testing irreducible Markov chains via the \( k \)-cover time. In: Algorithmic learning theory. PMLR, pp 458–480
Cherapanamjeri Y, Bartlett PL (2019) Testing symmetric Markov chains without hitting. In: Conference on learning theory. PMLR, pp 758–785
Daskalakis C, Dikkala N, Gravin N (2018) Testing symmetric Markov chains from a single trajectory. In: Conference on learning theory. PMLR, pp 385–409
Ding J, Lee JR, Peres Y (2011) Cover times, blanket times, and majorizing measures. In: Proceedings of the forty-third annual ACM symposium on theory of computing, pp 61–70
Feige U, Rabinovich Y (2003) Deterministic approximation of the cover time. Random Struct Algorithms 23(1):1–22
Feige U, Zeitouni O (2009) Deterministic approximation for the cover time of trees. arXiv preprint arXiv:0909.2005
Fill JA (1991) Eigenvalue bounds on convergence to stationarity for nonreversible Markov chains, with an application to the exclusion process. In: The annals of applied probability, pp 62–87
Fried S, Wolfer G (2022) Identity testing of reversible Markov chains. In: International conference on artificial intelligence and statistics. PMLR, pp 798–817
Han Y, Jiao J, Weissman T (2015) Minimax estimation of discrete distributions. In: 2015 IEEE international symposium on information theory (ISIT). IEEE, pp. 2291–2295
Hao Y, Orlitsky A, Pichapati V (2018) On learning Markov chains. arXiv preprint arXiv:1810.11754
Hermon J (2016) Maximal inequalities and mixing times. PhD thesis, UC Berkeley
Horn RA, Johnson CR (2012) Matrix analysis. Cambridge University Press, Cambridge
Kamath S, Orlitsky A, Pichapati D, Suresh AT (2015) On learning distributions from their samples. In: Conference on learning theory, . PMLR, pp. 1066–1100
Lalley SP (2009) Convergence rates of Markov chains. Lecture notes, Available online http://galton.uchicago.edu/~lalley/Courses/313ConvergenceRates.pdf. 2012
Levin DA, Peres Y (2017) Markov chains and mixing times, vol 107. American Mathematical Society, Providence
Marshall AW, Olkin I, Arnold BC (1979) Inequalities: theory of majorization and its applications, vol 143. Springer, New York
Montenegro R, Tetali P (2006) Mathematical aspects of mixing times in Markov chains. Found Trends Theor Comput Sci 1(3):237–354
Orlitsky A, Suresh AT (2015) Competitive distribution estimation: Why is Good-Turing good. In: NIPS, pp 2143–2151
Paulin D (2015) Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electron J Probab 20:32
Valiant G, Valiant P (2017) An automatic inequality prover and instance optimal identity testing. SIAM J Comput 46(1):429–455
Wolfer G, Kontorovich A (2019) Estimating the mixing time of ergodic Markov chains. In: Proceedings of the thirty-second conference on learning theory, volume 99 of proceedings of machine learning research. PMLR, pp 3120–3159
Wolfer G, Kontorovich A (2020) Minimax testing of identity to a reference ergodic Markov chain. In: Proceedings of the twenty third international conference on artificial intelligence and statistics, volume 108 of proceedings of machine learning research. PMLR, pp 191–201
Wolfer G, Kontorovich A (2021) Statistical estimation of ergodic Markov chain kernel over discrete state space. Bernoulli 27(1):532–553
Wolfer G, Kontorovich A (2022) Improved estimation of relaxation time in non-reversible Markov chains. arXiv preprint arXiv:2209.00175
Acknowledgements
We thank the anonymous referee for the careful reading of the manuscript and for the insightful suggestions that helped us significantly improve this work. In particular, the referee showed us the proof of Theorem 4.6. We are also grateful to Geoffrey Wolfer for posing us the problem of extending the results of Wolfer and Kontorovich (2019, 2020, 2021) to irreducible Markov chains and for many helpful discussions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We state that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Research Supported by the Israel Science Foundation (ISF) through grant No. 1456/18 and by the European Research Council grant No. 949707.
Appendix
Appendix
1.1 \(\ell _1\)-projection on \(\Delta _d\)
In Example 3.2 (b) it was necessary to project a matrix on \({\mathcal {M}}_d\) with respect to \(||\cdot ||_\infty \). Since, in this norm, each row is considered separately, projecting on \({\mathcal {M}}_d\) is equivalent to projecting each of the rows of the matrix on \(\Delta _d\), with respect to the \(\ell _1\) norm. Thus, we need to solve (at most) d optimization problems of the following form: For \(x\in {\mathbb {R}}^d\), find an \(\ell _1\)-projection \(P_{\Delta _d}(x)\) of x on \(\Delta _d\) (cf. Boyd et al. 2004, p. 397):
where
Notice that, in contrast to \(||\cdot ||_p\) for \(p>1\), optimization problem (11) has, in general, infinitely many solutions and we will understand \(P_{\Delta _d}(x)\) as the set of all such solutions.
The following lemma seems to be well known but we were not able to find a reference. In it, only parts (a) and (b) and \(|S|\le 1\) are relevant for our needs.
Lemma 5.1
Let \(x=(x_1,\ldots ,x_d)\in {\mathbb {R}}^d\). Denote
If \(S=[d]\) then choose any \(y\in \Delta _d\). Otherwise, define \(y=(y_1,\ldots ,y_d)\in \Delta _d\) as follows: Set \(y_i = 0\) for every \(i\in S\). Now, for every \(i\in [d]\setminus S\):
-
(a)
If \(s=1\) then set \(y_i = x_i\)
-
(b)
If \(s<1\) then choose any \(y_i \ge x_i\) such that \(\sum _{j\in [d]\setminus S} x_j= 1\).
-
(c)
If \(s>1\) then choose any \(y_i \le x_i\) such that \(\sum _{j\in [d]\setminus S} x_j= 1\).
Then \(y\in P_{\Delta _d}(x)\).
Proof
First, assume that \(S=[d]\) and let \(y'=(y'_1,\ldots ,y'_d)\in \Delta _d\). For each \(i\in [d]\), denote \(\varepsilon _i = y'_i - y_i\). Notice that \(\sum _{i=1}^d\varepsilon _i = 0\). We have
Now, assume that \(S\ne [d]\) and let \(z=(z_1,\ldots ,z_d)\in \Delta _d\). We consider each of the three possibilities for s separately:
-
(a)
Without loss, there exists \(k\in [d]\) such that \(x_i < 0\) for every \(1\le i\le k\) and \(x_i\ge 0\) for every \(k+1\le i\le d\). We have
$$\begin{aligned} \sum _{i=1}^d|z_i-x_i|&=\sum _{i=1}^{k}|z_i-x_i|+\sum _{i=k+1}^{d}|z_i-x_i|\\&\ge \sum _{i=1}^{k}|0-x_i|+\sum _{i=k+1}^{d}|x_i-x_i|\\&=\sum _{i=1}^{k}|y_i-x_i|+\sum _{i=k+1}^{d}|y_i-x_i|\\&= \sum _{i=1}^d|y_i-x_i|. \end{aligned}$$ -
(b)
Without loss, there exist \(1\le k\le l\le m\le d\) such that
$$\begin{aligned} x_i<0=y_i\le z_i,&\;\;\forall 1\le i\le k \\ 0\le z_i\le x_i\le y_i,&\;\;\forall k+1\le i\le l \\ 0\le x_i\le z_i\le y_i,&\;\;\forall l+1\le i\le m \\ 0\le x_i\le y_i\le z_i,&\;\;\forall m+1\le i\le d . \end{aligned}$$Then,
$$\begin{aligned} \sum _{i=1}^{d}|z_i-x_i| =&\sum _{i=1}^{d}|y_i-x_i|+\sum _{i=1}^{k}(z_i-y_i) +\sum _{i=k+1}^{l}\left( (y_i-z_i)-2(y_i-x_i)\right) -\\&\sum _{i=l+1}^{m}(y_i-z_i)+\sum _{i=m+1}^{d}(z_i-y_i). \end{aligned}$$Thus, it suffices to show that
$$\begin{aligned} \sum _{i=1}^{k}(z_i-y_i)+\sum _{i=k+1}^{l}\left( (y_i-z_i)-2(y_i-x_i)\right) -\sum _{i=l+1}^{m}(y_i-z_i)+\sum _{i=m+1}^{d}(z_i-y_i)\ge 0.\nonumber \\ \end{aligned}$$(12)Indeed,
$$\begin{aligned} {}(12)&\iff \overbrace{\sum _{i=1}^{d}z_i}^{=1}+\sum _{i=k+1}^{l}\left( -2(-x_i+z_i)\right) -\overbrace{\sum _{i=k+1}^{d}y_i}^{=1}\ge 0 \iff \sum _{i=k+1}^{l}(x_i-z_i)\ge 0 \end{aligned}$$and, by assumption, \(x_i\ge z_i\) for every \(k+1\le i\le l\).
-
(c)
Similar to the previous case.
\(\square \)
An ordinary triangle inequality trick would have introduced an additional 2-factor on the sample complexity in Example 3.2 (b). The following lemma shows that this may be avoided:
Lemma 5.2
Suppose \((y_1,\ldots ,y_d)\in \Delta _d\) and let \(x=(x_1,\ldots ,x_d)\in {\mathbb {R}}^d\) such that \(||x||_1=1\). Let \(\varepsilon >0\) and assume that \(||x-y||_1<\varepsilon \). Then there exists \(x'=(x'_1,\ldots ,x'_d)\in P_{\Delta _d}(x)\) such that \(||x'-y||_1<\varepsilon \).
Proof
If \(x_1,\ldots ,x_d\ge 0\) then \(x\in \Delta _d\) and we may take \(x'=x\). Otherwise, assume without loss that there is \(1\le k\le d\) such that \(x_i<0\), for \(1\le i\le k\) and \(x_i\ge 0\), for \(k+1\le i\le d\). Let \(k+1\le l\le d\) be minimal such that \(\sum _{i=1}^l x_i\ge 0\) (such l must exist since \(||x||_1=1\)). Define \(x'=(x'_1,\ldots ,x'_d)\in \Delta _d\) as follows: For \(1\le i\le d\) let
We have
\(\square \)
1.2 Proof of Lemma 2.1
Since \(\lambda _2\ge \lambda _d\) and, by assumption, \(\lambda _2\le -\lambda _d\), necessarily \(\lambda _d<0\). Now, we notice that the second largest and the smallest eigenvalues of \({\mathcal {L}}_\alpha (M)\) are given by \(\alpha +(1-\alpha )\lambda _{2}\) and \(\alpha +(1-\alpha )\lambda _{d}\), respectively. We have
It is easily seen that, since \(1>\lambda _2\ge \lambda _d>-1\), we have \(\frac{-\lambda _{d}-\lambda _{2}}{1-\lambda _{2}}\le \frac{-2\lambda _{d}}{1-\lambda _{d}}\) and that the third inequality in (13) holds trivially. Thus,
which proves the first assertion.
Turning to the second assertion, we have
Finally, for this \(\alpha \), we have
\(\square \)
1.3 Proof of Lemma 4.5
Let \(t_0=\max \left\{ \left\lceil \frac{1}{1-\alpha }\right\rceil \left\lceil \ln \frac{2}{\varepsilon }\right\rceil ,t_{\textsf{mix}}\left( M,\frac{\varepsilon }{2}\right) \right\} \) and let \(t=2\left\lceil \frac{1}{1-\alpha }\right\rceil t_0\). Denote by \(\pi \) the stationary distribution of M. Then, for every \(i\in [d]\), we have
where \(Y\sim \text {Binomial}(t, \alpha )\) and where, in the second inequality, we used that the total variation distance is monotone decreasing (e.g., Lalley 2009, Proposition 7). Since \(t_0\ge t_{\textsf{mix}}\left( M,\frac{\varepsilon }{2}\right) \), we have
and, since \(t_0\ge \left\lceil \frac{1}{1-\alpha }\right\rceil \left\lceil \ln \frac{2}{\varepsilon }\right\rceil \), by Hoeffding’s inequality,
The assertion regarding \(t_{\textsf{mix}}({\mathcal {L}}_\alpha (M))\) follows from the general assertion regarding \(t_{\textsf{mix}}({\mathcal {L}}_\alpha (M), \varepsilon )\), together with the inequality
(cf. Levin and Peres 2017, (4.34)). \(\square \)
1.4 Proof of inequality (8)
First, we prove that \(||I-\Pi ||_\pi \le 1\). We have
Now, let \(M\in {\mathcal {M}}_d^{\text {irr}}\) with stationary distribution \(\pi \). Since \(\Pi = M\Pi \) and due to the sub-multiplicativity of \(||\cdot ||_\pi \), we have
Thus, it remains to prove that \(||M||_\pi \le 1\). Using Jensen’s inequality, we have
Finally, \(||M^*||_\pi \le 1\) holds since the time reversal \(M^*\) of M also has \(\pi \) as stationary distribution (e.g., Levin and Peres 2017, Proposition 1.23). \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fried, S. On the \(\alpha \)-lazy version of Markov chains in estimation and testing problems. Stat Inference Stoch Process 26, 413–435 (2023). https://doi.org/10.1007/s11203-022-09283-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11203-022-09283-7