Informed sub-sampling MCMC: approximate Bayesian inference for large datasets

Maire, Florian; Friel, Nial; Alquier, Pierre

doi:10.1007/s11222-018-9817-3

Informed sub-sampling MCMC: approximate Bayesian inference for large datasets

Published: 09 June 2018

Volume 29, pages 449–482, (2019)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Florian Maire^1,2,
Nial Friel^1,2 &
Pierre Alquier³

904 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

This paper introduces a framework for speeding up Bayesian inference conducted in presence of large datasets. We design a Markov chain whose transition kernel uses an unknown fraction of fixed size of the available data that is randomly refreshed throughout the algorithm. Inspired by the Approximate Bayesian Computation literature, the subsampling process is guided by the fidelity to the observed data, as measured by summary statistics. The resulting algorithm, Informed Sub-Sampling MCMC, is a generic and flexible approach which, contrary to existing scalable methodologies, preserves the simplicity of the Metropolis–Hastings algorithm. Even though exactness is lost, i.e the chain distribution approximates the posterior, we study and quantify theoretically this bias and show on a diverse set of examples that it yields excellent performances when the computational budget is limited. If available and cheap to compute, we show that setting the summary statistics as the maximum likelihood estimator is supported by theoretical arguments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

A simple introduction to Markov Chain Monte–Carlo sampling

Article Open access 11 March 2016

Don van Ravenzwaaij, Pete Cassey & Scott D. Brown

Stratified random sampling from streaming and stored data

Article 23 October 2020

Trong Duc Nguyen, Ming-Hung Shih, … Bojian Xu

References

Allassonnière, S., Amit, Y., Trouvé, A.: Towards a coherent statistical framework for dense deformable template estimation. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 69(1), 3–29 (2007)
Article MathSciNet Google Scholar
Alquier, P., Friel, N., Everitt, R., Boland, A.: Noisy Monte Carlo: convergence of Markov chains with approximate transition kernels. Stat. Comput. 26(1–2), 29–47 (2016)
Article MathSciNet MATH Google Scholar
Andrieu, C., Roberts, G.O.: The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Stat. 37, 697–725 (2009)
Article MathSciNet MATH Google Scholar
Andrieu, C., Vihola, M.: Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms. Ann Appl. Probab. 25(2), 1030–1077 (2015)
Article MathSciNet MATH Google Scholar
Banterle, M., Grazian, C., Lee, A., Robert, C.P.: Accelerating Metropolis–Hastings algorithms by delayed acceptance. arXiv preprint arXiv:1503.00996 (2015)
Bardenet, R., Doucet, A., Holmes, C.: Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In: ICML, pp. 405–413 (2014)
Bardenet, R., Doucet, A., Holmes, C.: On Markov chain Monte Carlo methods for tall data. J. Mach. Learn. Res. 18, 1–43 (2017)
MathSciNet MATH Google Scholar
Bierkens, J., Fearnhead, P., Roberts, G.: The zig-zag process and super-efficient sampling for Bayesian analysis of big data. Ann. Stat. (2018) (to appear)
Chib, S., Greenberg, E.: Understanding the metropolis-Hastings algorithm. Am. Stat. 49(4), 327–335 (1995)
Google Scholar
Csilléry, K., Blum, M.G., Gaggiotti, O.E., François, O.: Approximate Bayesian computation (ABC) in practice. Trends Ecol. Evolut. 25(7), 410–418 (2010)
Article Google Scholar
Dalalyan, A.S.: Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. arXiv preprint arXiv:1704.04752 (2017)
Douc, R., Moulines, E., Rosenthal, J.S.: Quantitative bounds on convergence of time-inhomogeneous Markov chains. Ann. Appl. Probab. 14, 1643–1665 (2004)
Article MathSciNet MATH Google Scholar
Fearnhead, P., Bierkens, J., Pollock, M., Roberts, G.O.: Piecewise deterministic Markov processes for continuous-time Monte Carlo. arXiv preprint arXiv:1611.07873 (2016)
Fearnhead, P., Prangle, D.: Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J. R. Stat. Soc. Seri. B (Stat. Methodol.) 74(3), 419–474 (2012)
Article MathSciNet Google Scholar
Geyer, C.J., Thompson, E.A.: Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Am. Stat. Assoc. 90(431), 909–920 (1995)
Article MATH Google Scholar
Haario, H., Saksman, E., Tamminen, J.: An adaptive Metropolis algorithm. Bernoulli 7, 223–242 (2001)
Article MathSciNet MATH Google Scholar
Hobert, J.P., Robert, C.P.: A mixture representation of $\pi $ with applications in Markov chain Monte Carlo and perfect sampling. Ann. Appl. Probab. 14, 1295–1305 (2004)
Article MathSciNet MATH Google Scholar
Huggins, J., Zou, J.: Quantifying the accuracy of approximate diffusions and Markov chains. In: Proceedings of the 20th International Conference on Artifical Intelligence and Statistics, PLMR, vol. 54, pp. 382–391 (2016)
Jacob, P.E., Thiery, A.H., et al.: On nonnegative unbiased estimators. Ann. Stat. 43(2), 769–784 (2015)
Article MathSciNet MATH Google Scholar
Johndrow, J.E., Mattingly, J.C.: Error bounds for approximations of Markov chains. arXiv preprint arXiv:1711.05382 (2017)
Johndrow, J.E., Mattingly, J.C., Mukherjee, S., Dunson, D.: Approximations of Markov chains and Bayesian inference. arXiv preprint arXiv:1508.03387 (2015)
Korattikara, A., Chen, Y., Welling, M.: Austerity in MCMC land: cutting the Metropolis–Hastings budget. In: Proceedings of the 31st International Conference on Machine Learning (2014)
Le Cam, L.: On some asymptotic properties of maximum likelihood estimates and related Bayes’ estimates. Univ. Calif. Publ. Stat. 1, 277–330 (1953)
MathSciNet Google Scholar
Le Cam, L.: Asymptotic Methods in Statistical Decision Theory. Springer, Berlin (1986)
Book MATH Google Scholar
Maclaurin, D., Adams, R.P.: Firefly Monte Carlo: exact MCMC with subsets of data. In: Twenty-Fourth International Joint Conference on Artificial Intelligence (2015)
Marin, J.-M., Pudlo, P., Robert, C.P., Ryder, R.J.: Approximate Bayesian computational methods. Stat. Comput. 22(6), 1167–1180 (2012)
Article MathSciNet MATH Google Scholar
Medina-Aguayo, F.J., Lee, A., Roberts, G.O.: Stability of noisy Metropolis-Hastings. Stat. Comput. 26(6), 1187–1211 (2016)
Article MathSciNet MATH Google Scholar
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953)
Article Google Scholar
Meyn, S.P., Tweedie, R.L.: Markov Chains and Stochastic Stability. Cambridge University Press, Cambridge (2009)
Book MATH Google Scholar
Mitrophanov, A.Y.: Sensitivity and convergence of uniformly ergodic Markov chains. J. Appl. Probab. 142, 003–1014 (2005)
MathSciNet MATH Google Scholar
Nunes, M.A., Balding, D.J.: On optimal selection of summary statistics for approximate Bayesian computation. Stat. Appl. Genet. Mol. Biol. 9(1) (2010)
Pollock, M., Fearnhead, P., Johansen, A.M., Roberts, G.O.: The scalable Langevin exact algorithm: Bayesian inference for big data. arXiv preprint arXiv:1609.03436 (2016)
Pritchard, J.K., Seielstad, M.T., Perez-Lezaun, A., Feldman, M.W.: Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 16(12), 1791–1798 (1999)
Article Google Scholar
Quiroz, M., Villani, M., Kohn, R.: Speeding up MCMC by efficient data subsampling. Riksbank Research Paper Series (121) (2015)
Quiroz, M., Villani, M., Kohn, R.: Exact subsampling MCMC. arXiv preprint arXiv:1603.08232 (2016)
Roberts, G.O., Rosenthal, J.S., et al.: Optimal scaling for various Metropolis-Hastings algorithms. Stat. Sci. 16(4), 351–367 (2001)
Article MathSciNet MATH Google Scholar
Rudolf, D., Schweizer, N.: Perturbation theory for Markov chains via Wasserstein distance. Bernoulli 24(4A), 2610–2639 (2018)
Article MathSciNet MATH Google Scholar
Van der Vaart, A.W.: Asymptotic Statistics, vol. 3. Cambridge University Press, Cambridge (2000)
Google Scholar
Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 681–688 (2011)
Wilkinson, R.D.: Approximate Bayesian computation (ABC) gives exact results under the assumption of model error. Stat. Appl. Genet. Mol. Biol 12(2), 129–141 (2013)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The Insight Centre for Data Analytics is supported by Science Foundation Ireland under Grant Number SFI/12/RC/2289. Nial Friel’s research was also supported by an Science Foundation Ireland grant: 12/IP/1424. Pierre Alquier’s research was funded by Labex ECODEC (ANR - 11-LABEX-0047) and by the research programme New Challenges for New Data from LCL and GENES, hosted by the Fondation du Risque. We thank the Associate Editor and two anonymous Referees for their contribution to this work.

Author information

Authors and Affiliations

School of Mathematics and Statistics, University College Dublin, Dublin, Ireland
Florian Maire & Nial Friel
The Insight Centre of Data Analytics, University College Dublin, Dublin, Ireland
Florian Maire & Nial Friel
CREST, ENSAE, Université Paris Saclay, Paris, France
Pierre Alquier

Authors

Florian Maire
View author publications
You can also search for this author in PubMed Google Scholar
Nial Friel
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Alquier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florian Maire.

Appendices

Proofs

1.1 Proof of Proposition 1

Proof

For notational simplicity and without loss of generality, we take here g as the identity on $\Theta $. Let $n<N$ and U be a subset of $\{1,\ldots ,N\}$ with cardinal n. Consider the power likelihood:

$$\begin{aligned} \tilde{f}_n(Y_U\,|\,\theta )= & {} f(Y_U\,|\,\theta )^{N/n}=\left\{ \prod _{k\in U}f(Y_k\,|\,\theta )\right\} ^{N/n}\\= & {} \frac{\exp \left\{ (N/n)\sum _{k\in U}S(Y_k)\right\} ^{T}\theta }{L(\theta )^N}\,, \end{aligned}$$

and the corresponding power posterior:

$$\begin{aligned} \tilde{\pi }_n(\theta \,|\,Y_U)=\frac{\exp \left\{ (N/n)\sum _{k\in U}S(Y_k)\right\} ^{T}\theta }{L(\theta )^N}p(\theta )\bigg / \tilde{Z}_n(Y_U)\,, \end{aligned}$$

where

$$\begin{aligned} \tilde{Z}_n(Y_U)=\int p(\text {d}\theta )\frac{\exp \left\{ (N/n)\sum _{k\in U}S(Y_k)\right\} ^{T}\theta }{L(\theta )^N}\,. \end{aligned}$$

For any $\theta $ such that $p(\theta )\ne 0$, write:

$$\begin{aligned} \log \frac{\pi (\theta \,|\,Y_{1:N})}{\tilde{\pi }_n(\theta \,|\,Y_{U})}= & {} \left\{ \sum _{k=1}^N S(Y_k)-(N/n)\sum _{k\in U}S(Y_k)\right\} ^{T}\theta \nonumber \\&+\log \frac{\tilde{Z}_n(Y_U)}{Z(Y_{1:N})}\,. \end{aligned}$$

(A.1)

and the KL divergence between $\pi (\,\cdot \,|\,Y_{1:N})$ and $\tilde{\pi }(\,\cdot \,|\,Y_U)$, denoted $\text {KL}_n(U)$, simply writes

$$\begin{aligned} \text {KL}_n(U)=\Delta _n(U)^{T}\mathbb {E}_\pi (\theta )+\log \frac{\tilde{Z}_n(Y_U)}{Z(Y_{1:N})}\,, \end{aligned}$$

(A.2)

where $\Delta _n(U)=\sum _{k=1}^N S(Y_k)-(N/n)\sum _{k\in U}S(Y_k)$. Now, note that

$$\begin{aligned} \tilde{Z}_n(Y_U)= & {} \int p(\text {d}\theta )\frac{\exp \left\{ (N/n)\sum _{k\in U}S(Y_k)\right\} ^{T}\theta }{L(\theta )^N}\nonumber \\= & {} \int p(\text {d}\theta )\frac{\exp \left\{ \sum _{k=1}^N S(Y_k)-\Delta _n(U)\right\} ^{T}\theta }{L(\theta )^N}\nonumber \\= & {} \int p(\text {d}\theta )f(Y_{1:N}\,|\,\theta ){\exp \left\{ -\Delta _n(U)^{T}\theta \right\} }\nonumber \\= & {} Z(Y_{1:N})\mathbb {E}_\pi \left\{ \exp \left( -\Delta _n(U)^{T}\theta \right) \right\} \,. \end{aligned}$$

(A.3)

Plugging (A.3) into (A.2) yields:

$$\begin{aligned} \text {KL}_n(U)= & {} \Delta _n(U)^{T}\mathbb {E}_\pi (\theta )+\log \mathbb {E}_\pi \left\{ \exp \left( -\Delta _n(U)^{T}\theta \right) \right\} \,,\nonumber \\= & {} \log \frac{\mathbb {E}_\pi \left\{ \exp \left( -\Delta _n(U)^{T}\theta \right) \right\} }{\exp (-\Delta _n(U)^{T}\mathbb {E}_\pi (\theta ))}\nonumber \\= & {} \log \mathbb {E}_\pi \exp \left[ \left\{ \mathbb {E}_\pi (\theta )-\theta \right\} ^{T}\Delta _n(U)\right] \,. \end{aligned}$$

(A.4)

Finally, Cauchy-Schwartz inequality provides the following upper bound for $\text {KL}_n(U)$:

$$\begin{aligned} \text {KL}_n(U)\le \log \mathbb {E}_\pi \exp \left\{ \left\| \mathbb {E}_\pi (\theta )-\theta \right\| \Vert \Delta _n(U)\Vert \right\} \,. \end{aligned}$$

(A.5)

$\square $

1.2 Proof of Proposition 2

Proof

Under some weak assumptions, Bernstein-von Mises theorem states that $\pi (\,\cdot \,|\,Y_{1:N})$ is asymptotically (in N) a Gaussian distribution with the maximum likelihood $\theta ^{*}$ as mean and $\Gamma _N=I^{-1}(\theta ^{*})/N$ as covariance matrix, where $I(\theta )$ is the Fisher information matrix at $\theta $. Let us denote by $\Phi $ the pdf of $\mathcal {N}(\theta ^{*},\Gamma _N)$. Under this approximation, $\mathbb {E}_\pi (\theta )=\theta ^{*}$ and from (A.3), we write:

$$\begin{aligned} \exp \text {KL}_n(U)\approx & {} \int \Phi (\text {d}\theta )\exp \left[ \{\theta ^{*}-\theta \}^{T}\Delta _n(U)\right] \nonumber \\= & {} \int \Phi (\theta ^{*}-\theta )\exp \{\theta ^{T}\Delta _n(U)\}\text {d}\theta \nonumber \\= & {} \int \frac{1}{{(2\pi )}^{(d/2)}|\Gamma _N|^{(1/2)}}\nonumber \\&\exp \left\{ -(1/2)\theta ^{T}\Gamma _N^{-1}\theta +\theta ^{T}\Delta _n(U)\right\} \text {d}\theta \,,\nonumber \\= & {} \frac{1}{{(2\pi )}^{(d/2)}|\Gamma _N|^{(1/2)}}\int \exp \left[ -(1/2)\left\{ \theta ^{T}\Gamma _N^{-1}\theta \right. \right. \nonumber \\&\left. \left. -2\theta ^{T}\Gamma _N^{-1}\Gamma _N \Delta _n(U)\right\} \right] \text {d}\theta \,,\nonumber \\= & {} \exp \{(1/2)\Delta _n(U)^{T}\Gamma _N \Delta _n(U)\}\,, \end{aligned}$$

(A.6)

by integration of a multivariate Gaussian density function. Eventually, (A.6) yields the following approximation:

$$\begin{aligned} \text {KL}_n(U)\approx \widehat{\text {KL}}_n(U)=(1/2)\Delta _n(U)^{T}\Gamma _N \Delta _n(U)\,. \end{aligned}$$

(A.7)

$\square $

1.3 Proof of Proposition 3

Proof

Let $\textsf {U}_n\supset A_n(\theta ):=\left\{ U\in \textsf {U}_n,\;g(\theta )^{T}\Delta _n(U)\le 0\right\} $ and remark that using Cauchy-Schwartz inequality, we have:

$$\begin{aligned}&\mathbb {E}\left\{ \frac{f(Y\,|\,\theta )}{f(Y_U\,|\,\theta )^{N/n}}\right\} \le \nu _{n,\epsilon }\left\{ A_n(\theta )\right\} \nonumber \\&\quad +\sum _{U\in \textsf {U}_n\backslash A_n(\theta )}\nu _{n,\epsilon }(U)\exp \{ \Vert g(\theta )\Vert \Vert \Delta _n(U)\Vert \}\,. \end{aligned}$$

Now, define $\bar{\Delta }_n(U):=\bar{S}(Y)-\bar{S}(Y_U)$ where $\bar{S}$ is the normalized summary statistics vector, i.e if $U\in \textsf {U}_n$, $\bar{S}(Y_U)=S(Y_U)/n$. Clearly, when $N\rightarrow \infty $, some terms

$$\begin{aligned} \exp \{\Vert g(\theta )\Vert \Vert \Delta _n(U)\Vert \}=\exp \{N\Vert g(\theta )\Vert \Vert \bar{\Delta }_n(U)\Vert \} \end{aligned}$$

will have a large contribution to the sum. More precisely, any mismatch between summary statistics of some subsamples $\{Y_U,\,U\in \textsf {U}_n\backslash A_n(\theta )\}$ with respect to the full dataset will be amplified by the factor N, whereby exponentially inflating the upper bound. However, assigning the distribution $\nu _{n,\epsilon }$ (12) to the subsamples $\{Y_U,\,U\in \textsf {U}_n\}$, allows to balance out this effect. Indeed, note that

$$\begin{aligned}&\mathbb {E}\left\{ \frac{f(Y\,|\,\theta )}{f(Y_U\,|\,\theta )^{N/n}}\right\} \le \nu _{n,\epsilon }\{A_n(\theta )\}\\&\quad +\sum _{U\in \textsf {U}_n\backslash A_n(\theta )}\exp \{-\epsilon \Vert \Delta _n(U)\Vert ^2\\&\quad +\,\Vert g(\theta )\Vert \Vert \Delta _n(U)\Vert \}/ Z(\epsilon )\,, \end{aligned}$$

where $Z(\epsilon )=\sum _{U\in \textsf {U}_n}\exp \{-\epsilon \Vert \Delta _n(U)\Vert ^2\}$ and we have, for a fixed n and when $N\rightarrow \infty $, that

$$\begin{aligned} \nu _{n,\epsilon }(U)\frac{f(Y\,|\,\theta )}{f(Y_U\,|\,\theta )}\rightarrow _{\Vert \Delta _n(U)\Vert \rightarrow \infty } 0\,. \end{aligned}$$

Since g is bounded, then $\mathbb {E}\left\{ {f(Y\,|\,\theta )}\slash {f(Y_U\,|\,\theta )^{N/n}}\right\} $ is bounded too. $\square $

1.4 Proof of Proposition 4

We preface the proof Proposition 4 with five Lemmas, some of which are inspired from Medina-Aguayo et al. (2016). For notational simplicity, the dependence on $(n,\epsilon )$ of any ISS-MCMC related quantities is implicit. For all $(\theta ,U)\in \Theta \times \textsf {U}_n$, we denote by $\phi _U(\theta )=f(y_U\,|\,\theta )^{N/n}/f(y\,|\,\theta )$ and recall that $a(\theta ,\theta ')$ is the (exact) MH acceptance ratio so that $\alpha (\theta ,\theta ')=1\wedge a(\theta ,\theta ')$. Unless stated otherwise, $\mathbb {E}$ is the expectation taken under $\nu _{n,\epsilon }$. For simplicity, $\tilde{K}_{n,\epsilon }$ is written as $\tilde{K}_n$.

Lemma 1

For any $(\theta ,\theta ')\in \Theta ^2$, we have

$$\begin{aligned} \tilde{\alpha }(\theta ,\theta ')\le \alpha (\theta ,\theta ')\left\{ 1\vee \mathbb {E}\frac{\phi _U(\theta ')}{\phi _U(\theta )}\right\} \,. \end{aligned}$$

Proof

This follows from a slight adaptation of Lemma 3.3 in Medina-Aguayo et al. (2016):

$$\begin{aligned} \tilde{\alpha }(\theta ,\theta ')= & {} \mathbb {E}\left\{ 1\wedge \frac{f(Y_U\,|\,\theta ')^{N/n}p(\theta ')Q(\theta ',\theta )}{f(Y_U\,|\,\theta )^{N/n}p(\theta )Q(\theta ,\theta ')}\frac{f(Y\,|\,\theta )f(Y\,|\,\theta ')}{f(Y\,|\,\theta )f(Y\,|\,\theta ')}\right\} \\&1\wedge \left\{ a(\theta ,\theta ')\mathbb {E}\frac{\phi _U(\theta ')}{\phi _U(\theta )}\right\} \\\le & {} 1\wedge \left[ a(\theta ,\theta ')\left\{ \mathbb {E}\frac{\phi _U(\theta ')}{\phi _U(\theta )}\vee 1\right\} \right] \\\le & {} \alpha (\theta ,\theta ')\left\{ \mathbb {E}\frac{\phi _U(\theta ')}{\phi _U(\theta )}\vee 1\right\} \,, \end{aligned}$$

where we have used Jensen’s inequality and the fact that the inequality $1\wedge ab\le (1\wedge a)b$ holds for $a>0$ and $b\ge 1$. $\square $

Lemma 2

For any $\theta \in \Theta $ and all $\delta >0$, we have

$$\begin{aligned} \tilde{\rho }(\theta )-\rho (\theta )\le \delta +2\sup _{\theta \in \Theta }\mathbb {P}\left\{ \left| \phi _U(\theta )-1\right| \ge \frac{\delta }{2}\right\} \,. \end{aligned}$$

Proof

The proof is identical to proof of Lemma 3.2 in Medina-Aguayo et al. (2016) by noting that Lemma 3.1 in the same reference holds for two random variables $\phi _U(\theta )$ and $\phi _{U}(\theta ')$ that are not independent, i.e for all $(\theta ,\theta ')\in \Theta ^2$ any $U\in \textsf {U}_n$ and all $\delta \in (0,1)$

$$\begin{aligned} \mathbb {P}\left\{ \frac{\phi _U(\theta )}{\phi _U(\theta ')}\le 1-\delta \right\} \le 2\sup _{\theta \in \Theta }\mathbb {P}\left\{ |\phi _U(\theta )-1|\ge \delta /2\right\} \,. \end{aligned}$$

Lemma 3

Assume that Assumption A.4 holds. Then we have

$$\begin{aligned} \sup _{(\theta ,\theta ')\in \Theta ^2}1\vee \mathbb {E}\bigg \{\frac{\phi _U(\theta )}{\phi _U(\theta ')}\bigg \}\le \mathbb {E}\left\{ e^{2\gamma \Vert \Delta _n(u)\Vert }\right\} \,. \end{aligned}$$

Proof

Using Cauchy-Schwartz inequality, we write that for all $(\theta ,\theta ')\in \Theta ^2$,

$$\begin{aligned} \mathbb {E}\left\{ \frac{\phi _U(\theta ')}{\phi _U(\theta )}\right\}= & {} \mathbb {E}\left\{ \frac{f(Y_U\,|\,\theta ')^{N/n}}{f(Y\,|\,\theta ')}\frac{f(Y\,|\,\theta )}{f(Y_U\,|\,\theta )^{N/n}}\right\} \nonumber \\\le & {} \left[ \mathbb {E}\left\{ \frac{f(Y_U\,|\,\theta ')^{N/n}}{f(Y\,|\,\theta ')}\right\} ^2\right] ^{1/2}\nonumber \\&\left[ \mathbb {E}\left\{ \frac{f(Y\,|\,\theta )}{f(Y_U\,|\,\theta )^{N/n}}\right\} ^2\right] ^{1/2}\,. \end{aligned}$$

(A.8)

Now for all $\theta \in \Theta $, we define the event $\mathcal {E}_\theta :=\{U\in \textsf {U}_n\,,\;f(Y\,|\,\theta )\le f(Y_U\,|\,\theta )^{N/n}\}$ so that

$$\begin{aligned} \mathbb {E}\left\{ \frac{f(Y_U\,|\,\theta )^{N/n}}{f(Y\,|\,\theta )}\right\} ^2= & {} \mathbb {E}\left\{ \frac{f(Y_U\,|\,\theta )^{N/n}}{f(Y\,|\,\theta )}{\mathbb {1}}_{\mathcal {E}_\theta }(U)\right\} ^2\\&+\mathbb {E}\left\{ \frac{f(Y_U\,|\,\theta )^{N/n}}{f(Y\,|\,\theta )}{\mathbb {1}}_{\overline{\mathcal {E}_\theta }}(U)\right\} ^2 \end{aligned}$$

and we note that for all $(\theta ,U)\in \Theta \times \textsf {U}_n$, Eq. (26) writes

$$\begin{aligned} \left\{ \frac{f(Y_U\,|\,\theta )^{N/n}}{f(Y\,|\,\theta )}\right\} ^2{\mathbb {1}}_{\mathcal {E}_\theta }(U)\le e^{2\gamma \Vert \Delta _n(U)\Vert }{\mathbb {1}}_{\mathcal {E}_\theta }(U)\,, \end{aligned}$$

but also

$$\begin{aligned} \left\{ \frac{f(Y_U\,|\,\theta )^{N/n}}{f(Y\,|\,\theta )}\right\} ^2{\mathbb {1}}_{\overline{\mathcal {E}_\theta }}(U)\le e^{2\gamma \Vert \Delta _n(U)\Vert }{\mathbb {1}}_{\overline{\mathcal {E}_\theta }}(U)\,, \end{aligned}$$

so that

$$\begin{aligned}&\mathbb {E}\left\{ \frac{f(Y_U\,|\,\theta )^{N/n}}{f(Y\,|\,\theta )}{\mathbb {1}}_{\mathcal {E}_\theta }(U)\right\} ^2+ \mathbb {E}\left\{ \frac{f(Y_U\,|\,\theta )^{N/n}}{f(Y\,|\,\theta )}{\mathbb {1}}_{\overline{\mathcal {E}_\theta }}(U)\right\} ^2\\&\quad \le \mathbb {E}\left\{ e^{2\gamma \Vert \Delta _n(U)\Vert }{\mathbb {1}}_{\mathcal {E}_\theta }(U)\right\} + \mathbb {E}\left\{ e^{2\gamma \Vert \Delta _n(U)\Vert }{\mathbb {1}}_{\overline{\mathcal {E}_\theta }}(U)\right\} \\&\quad =\mathbb {E}\left\{ e^{2\gamma \Vert \Delta _n(U)\Vert }\right\} \,. \end{aligned}$$

A similar argument gives the same upper bound for $\mathbb {E}\{{f(Y\,|\theta )}\slash {f(Y_U\,|\,\theta )^{N/n}}\}^2$ so that Eq. (A.8) yields

$$\begin{aligned} \mathbb {E}\left\{ \frac{\phi _U(\theta ')}{\phi _U(\theta )}\right\} \le \mathbb {E}\left\{ e^{2\gamma \Vert \Delta _n(U)\Vert }\right\} \,. \end{aligned}$$

The proof is completed by noting that for three numbers a, b and c, $c>b\Rightarrow a\vee b\le a\vee c$ and $\gamma \Vert \Delta _n(U)\Vert >0$. $\square $

Lemma 4

Assume that Assumption A.4 holds. Then we have for all $\theta \in \Theta $ and $\delta >0$

$$\begin{aligned} \mathbb {P}\left\{ \left| \phi _U(\theta )-1\right| \ge \delta /2\right\} \le \frac{2\gamma }{\log (1+\delta /2)}\mathbb {E}\{\Vert \Delta _n(U)\Vert \}\,. \end{aligned}$$

Proof

With the same notations as in proof of Lemma 3 and roughly with the same reasoning we have for all $\theta \in \Theta $ and all $\delta >0$

$$\begin{aligned}&\mathbb {P}\left\{ \left| \phi _U(\theta )-1\right| \ge \delta /2\right\} \\&\quad =\mathbb {P}\left\{ \left| \phi _U(\theta )-1\right| \ge \delta /2\cap \mathcal {E}_\theta \right\} \\&\qquad +\,\mathbb {P}\left\{ \left| \phi _U(\theta )-1\right| \ge \delta /2\cap \overline{\mathcal {E}_\theta }\right\} \\&\quad = \mathbb {P}\left\{ \frac{f(Y_U\,|\,\theta )^{N/n}}{f(Y\,|\,\theta )}\ge 1+\delta /2\cap \mathcal {E}_\theta \right\} \\&\qquad +\,\mathbb {P}\left\{ \frac{f(Y_U\,|\,\theta )^{N/n}}{f(Y\,|\,\theta )}\le 1-\delta /2\cap \overline{\mathcal {E}_\theta }\right\} \\&\quad \le \mathbb {P}\left\{ e^{\gamma \Vert \Delta _n(U)\Vert }\ge 1+\delta /2\cap \mathcal {E}_\theta \right\} \\&\qquad +\,\mathbb {P}\left\{ e^{-\gamma \Vert \Delta _n(U)\Vert }\le 1-\delta /2\cap \overline{\mathcal {E}_\theta }\right\} \\&\quad \le \mathbb {P}\left\{ \gamma \Vert \Delta _n(U)\Vert \ge \log (1+\delta /2)\right\} \\&\qquad + \,\mathbb {P}\left\{ \gamma \Vert \Delta _n(U)\Vert \ge -\log (1-\delta /2)\right\} \,, \end{aligned}$$

where the first inequality follows by inclusion (on $\mathcal {E}_\theta $) of

$$\begin{aligned} \left\{ \frac{f(Y_U\,|\,\theta )^{N/n}}{f(Y\,|\,\theta )}\ge 1+\delta /2\right\} \subset \left\{ e^{\gamma \Vert \Delta _n(U)\Vert }\ge 1+\delta /2\right\} \end{aligned}$$

and similarly for the second term. Now, note that for all $x>0$, $\log (1+x)<-\log (1-x)$ so that

$$\begin{aligned} \mathbb {P}\left\{ \left| \phi _U(\theta )-1\right| \ge \delta /2\right\}\le & {} 2\mathbb {P}\left\{ \gamma \Vert \Delta _n(U)\Vert \ge \log (1+\delta /2)\right\} \\\le & {} \frac{2\gamma }{\log (1+\delta /2)}\mathbb {E}\left\{ \Vert \Delta _n(U)\Vert \right\} \,, \end{aligned}$$

where the last inequality follows from Markov inequality. $\square $

We study the limiting case where N is fixed and $n\rightarrow N$.

Lemma 5

Assume N is fixed and let $n\rightarrow N$. Then,

$$\begin{aligned} \mathbb {E}\{\Vert \Delta _n(U)\Vert \}\rightarrow 0\quad \text {and}\quad \mathbb {E}\left\{ \exp {2\gamma \Vert \Delta _n(U)\Vert }\right\} \rightarrow 1\,. \end{aligned}$$

Proof

It follows from the fact that when $n\rightarrow N$, $\nu _{n,\epsilon }$ converges to the dirac on $U^\dag =\{1,\ldots ,N\}$ and therefore,

$$\begin{aligned}&\mathbb {E}\{\Vert \Delta _n(U)\Vert \}\rightarrow \Vert \Delta \bar{S}(U^\dag )\Vert =0\quad \text {and}\\&\quad \mathbb {E}\left\{ \exp {2\gamma \Vert \Delta _n(U)\Vert }\right\} \rightarrow \exp {2\gamma \Vert \Delta \bar{S}(U^\dag )\Vert }=1\,. \end{aligned}$$

$\square $

We can now prove Proposition 4:

PropositionAssume thatA.3andA.4hold. If the marginal MH chainKis geometrically ergodic, i.e A.1holds, then there exists an$n_0\le N$such that for all$n>n_0$, $\tilde{K}_n$is also geometrically ergodic.

Proof

By (Meyn and Tweedie 2009, Theorems 14.0.1 & 15.0.1), there exists a function $V:\textsf {X}\rightarrow [1,\infty [$, two constants $\lambda \in (0,1)$ and $b<\infty $ and a small set $S\subset \textsf {X}$ such that K satisfies a drift condition:

$$\begin{aligned} KV\le \lambda V+b{\mathbb {1}}_S\,. \end{aligned}$$

(A.9)

We now show how to use the previous Lemmas to establish the geometric ergodicity of $\tilde{K}_n$ for some n sufficiently large. This reasoning is very similar to that presented in (Medina-Aguayo et al. 2016, Theorem 3.2).

$$\begin{aligned}&(\tilde{K}_n-K)V(\theta )\nonumber \\&\quad =\int Q(\theta ,\text {d}\theta ')\left( \tilde{\alpha }(\theta ,\theta ')-\alpha (\theta ,\theta ')\right) V(\theta ')\nonumber \\&\qquad +\,\left( \tilde{\rho }(\theta )-\rho (\theta )\right) V(\theta )\nonumber \\&\quad \le \left( \mathbb {E}\left\{ e^{2\gamma \Vert \Delta _n(u)\Vert }\right\} -1\right) \int Q(\theta ,\text {d}\theta ')\alpha (\theta ,\theta ')V(\theta ')\nonumber \\&\qquad +\,\left( \delta +\frac{2\gamma }{\log (1+\delta /2)}\mathbb {E}\left\{ \Vert \Delta _n(U)\Vert \right\} \right) V(\theta )\nonumber \\&\quad \le \left( \mathbb {E}\left\{ e^{2\gamma \Vert \Delta _n(u)\Vert }\right\} -1\right) \left( \lambda V(\theta )+b{\mathbb {1}}_S(\theta )-\rho (\theta )V(\theta )\right) \nonumber \\&\qquad +\,\left( \delta +\frac{2\gamma }{\log (1+\delta /2)}\mathbb {E}\left\{ \Vert \Delta _n(U)\Vert \right\} \right) V(\theta )\nonumber \\&\quad \le \mathbb {E}\left\{ e^{2\gamma \Vert \Delta _n(u)\Vert }\right\} b{\mathbb {1}}_S(\theta )\nonumber \\&\qquad +\,\left( \lambda \left( \mathbb {E}\left\{ e^{2\gamma \Vert \Delta _n(u)\Vert }\right\} -1\right) +\delta \right. \nonumber \\&\qquad \left. +\,\frac{2\gamma }{\log (1+\delta /2)}\mathbb {E}\left\{ \Vert \Delta _n(U)\Vert \right\} \right) V(\theta ) \end{aligned}$$

(A.10)

Combining Eq. (A.9) with Eq. (A.10), we have that

$$\begin{aligned} \tilde{K}_n V(\theta )\le & {} \left\{ 1+\mathbb {E}e^{2\gamma \Vert \Delta _n(u)\Vert }\right\} b{\mathbb {1}}_S(\theta )\nonumber \\&+\,\left( \lambda \mathbb {E}\left\{ e^{2\gamma \Vert \Delta _n(u)\Vert }\right\} +\delta \right. \nonumber \\&\left. +\,\frac{2\gamma }{\log (1+\delta /2)}\mathbb {E}\left\{ \Vert \Delta _n(U)\Vert \right\} \right) V(\theta ) \end{aligned}$$

(A.11)

Fix $\epsilon >0$. From Lemma 5, there exists $(n_1,n_2)\in \mathbb {N}^2$ such that

$$\begin{aligned} n\ge & {} n_1\Rightarrow \mathbb {E}\exp \{2\gamma \Vert \Delta _n(U)\Vert \}-1\le \epsilon \,, \nonumber \\ n\ge & {} n_2\Rightarrow \mathbb {E}\Vert \Delta _n(U)\Vert \le \epsilon \log (1+\epsilon /4)/4\gamma \,. \end{aligned}$$

(A.12)

Combining Eqs. (A.10) and (A.12) yields that for all $n\ge n_0:=\max (n_1,n_2)$, we have

$$\begin{aligned} \tilde{K}_n V(\theta )\le & {} (\epsilon +1)b{\mathbb {1}}_S(\theta )\nonumber \\&+V(\theta )\left( {\lambda (\epsilon +1)}+\delta +\frac{\epsilon \log (1+\epsilon /4)}{2\log (1+\delta /2)}\right) \,.\nonumber \\ \end{aligned}$$

(A.13)

Taking $\delta =\epsilon /2$ in Eq. (A.13) gives

$$\begin{aligned} \tilde{K}_n V(\theta )\le (\epsilon +1)b{\mathbb {1}}_S(\theta )+V(\theta )\left\{ \epsilon \left( \lambda +1\right) +\lambda \right\} \,. \end{aligned}$$

To show that $\tilde{K}_n$ (for $n>n_0$) satisfies a geometric drift condition, it is sufficient to take $\epsilon <(1-\lambda )/(1+\lambda )$ and to check that S is also small for $\tilde{K}_n$. This is demonstrated exactly as in the proof of Medina-Aguayo et al. (2016, Theorem 3.2). $\square $

1.5 Proof of Proposition 5

This proof borrows ideas from the perturbation analysis of uniformly ergodic Markov chains. First, note that by straightforward algebra we have that

$$\begin{aligned}&\Vert K(\theta ,\,\cdot \,)-\tilde{K}(\theta ,\,\cdot \,)\Vert \nonumber \\&\quad \le \int Q(\theta ,\text {d}\theta ')\mathbb {E}\left| \alpha (\theta ,\theta ')-\tilde{\alpha }(\theta ,\theta '\,|\,U)\right| \,,\nonumber \\&\quad \le \int Q(\theta ,\text {d}\theta ')\mathbb {E}\left| a(\theta ,\theta ')-\tilde{a}(\theta ,\theta '\,|\,U)\right| \,,\nonumber \\&\quad =\int Q(\theta ,\text {d}\theta ')a(\theta ,\theta ')\mathbb {E}\left| 1-\frac{\phi _U(\theta ')}{\phi _U(\theta )}\right| \,,\nonumber \\&\quad =\mathbb {E}\left\{ \int Q(\theta ,\text {d}\theta ')a(\theta ,\theta ')\left| \phi _U(\theta )\right. \right. \nonumber \\&\qquad \left. \left. -\,{\phi _U(\theta ')}\right| \frac{f(Y\,|\,\theta )}{f(Y_U\,|\,\theta )^{N/n}}\right\} \,,\nonumber \\&\quad \le \mathbb {E}\left\{ \sup _{\theta \in \Theta }\frac{f(Y\,|\,\theta )}{f(Y_U\,|\,\theta )^{N/n}}\int Q(\theta ,\text {d}\theta ')a(\theta ,\theta ')\right. \nonumber \\&\qquad \left. \left| \phi _U(\theta )-{\phi _U(\theta ')}\right| \right\} \,,\nonumber \\&\quad \le \mathbb {E}\left\{ \sup _{\theta \in \Theta }\frac{f(Y\,|\,\theta )}{f(Y_U\,|\,\theta )^{N/n}}\right\} \sup _{U\in \textsf {U}_n}\int Q(\theta ,\text {d}\theta ')a(\theta ,\theta ')\nonumber \\&\qquad \left| \phi _U(\theta )-{\phi _U(\theta ')}\right| \,. \end{aligned}$$

(A.14)

Now, under Assumption A.2 and using Mitrophanov (2005, Corollary 3.1) we have that for any starting point $\theta _0\in \Theta $,

$$\begin{aligned}&\Vert K^i(\theta _0,\,\cdot \,)-\tilde{K}^i(\theta _0,\,\cdot \,)\Vert \nonumber \\&\quad \le \left( \lambda +\frac{C\rho ^\lambda }{1-\rho }\right) \sup _{\theta \in \Theta }\Vert K(\theta ,\,\cdot \,)-\tilde{K}(\theta ,\,\cdot \,)\Vert \,, \end{aligned}$$

(A.15)

where $\lambda =\lceil \log (1/C)/\log \rho \rceil $. Combining Eqs (A.14) and (A.15) leads to Eq. (29) with $\kappa =\lambda +{C\rho ^\lambda }/{1-\rho }$. Moreover, note that using Eq. (29) we have

$$\begin{aligned} \sup _{\theta \in \Theta }\Vert \pi -\tilde{K}^i(\theta ,\,\cdot \,)\Vert\le & {} \sup _{\theta \in \Theta }\Vert \pi -K^i(\theta ,\,\cdot \,)\Vert \\&+\sup _{\theta \in \Theta }\Vert K^i(\theta ,\,\cdot \,)-\tilde{K}^i(\theta ,\,\cdot \,)\Vert \,,\\\le & {} C\rho ^i+\kappa A_n\sup _{(\theta ,U)\in \Theta \times \textsf {U}_n}B_n(\theta ,U) \end{aligned}$$

and taking the limit when $i\rightarrow \infty $ leads to Eq. (30). Finally, for a large enough n, we know from Proposition 4 that the marginal Markov chain $\{\tilde{\theta }_i\,,i\in \mathbb {N}\}$ produced by ISS-MCMC is geometrically ergodic and we denote by $\tilde{\pi }_n$ its stationary distibution. For such a n, we have for any $\theta _0\in \Theta $

$$\begin{aligned} \Vert \pi -\tilde{\pi }_n\Vert\le & {} \Vert K^i(\theta _0,\,\cdot \,)-\pi \Vert +\Vert \tilde{K}^i(\theta _0,\,\cdot \,)-\tilde{\pi }_n\Vert \\&+\Vert K^i(\theta _0,\,\cdot \,)-\tilde{K}^i(\theta _0,\,\cdot \,)\Vert \\\le & {} \Vert K^i(\theta _0,\,\cdot \,)-\pi \Vert +\Vert \tilde{K}^i(\theta _0,\,\cdot \,)-\tilde{\pi }_n\Vert \\&+\kappa A_n\sup _{(\theta ,U)\in \Theta \times \textsf {U}_n}B_n(\theta ,U) \end{aligned}$$

and taking the limit as $i\rightarrow \infty $ yields Eq. (31).

1.6 Extension of Proposition 5 beyond the time homogeneous case

We start with the two following remarks relative to the Informed Sub-Sampling Markov chain.

Remark 1

Assume $U_0\sim \nu _{n,\epsilon }$ and $\tilde{\theta }_0\sim \mu $ for some initial distribution $\mu $ on $(\Theta ,\vartheta )$. The distribution of $U_i$ given $\tilde{\theta }_i$ is for some $u\in \textsf {U}_n$,

$$\begin{aligned} \mathbb {P}(U_i=u\,|\,\tilde{\theta }_i) \propto \sum _{U_0\in \textsf {U}_n}\int _{\tilde{\theta }_0\in \Theta } \nu _{n,\epsilon }(U_0)\mu (\text {d}\tilde{\theta }_0)\bar{K}^{i}(\tilde{\theta }_0,U_0;\tilde{\theta }_i,u)\,, \end{aligned}$$

where $\bar{K}(\theta ,U;\text {d}\theta ',U'):=K(\theta ,\text {d}\theta '\,|\,U)H(U,U')$ and H is the transition kernel of the Markov chain $\{U_i,\,i\in \mathbb {N}\}$. As a consequence $\mathbb {P}(U_i\in \,\cdot \,|\,\tilde{\theta })$ depends on $\tilde{\theta }$ and i.

Remark 2

The marginal Markov chain $\{\tilde{\theta }_i,\,i\in \mathbb {N}\}$ produced by ISS-MCMC algorithm is time inhomogeneous since for all $A\in \mathcal {X}$,

$$\begin{aligned} \tilde{K}(\theta _{i-1},A):= & {} \mathbb {P}(\tilde{\theta }_i\in A\,|\,\tilde{\theta }_{i-1})\nonumber \\= & {} \sum _{u\in \textsf {U}_n}{K}(\tilde{\theta }_{i-1},\text {d}\tilde{\theta }_i\,|\,U_i)\mathbb {P}(U_i=u\,|\,\tilde{\theta }_i)\,, \end{aligned}$$

(A.16)

and $\mathbb {P}(U_i=u\,|\,\tilde{\theta }_i)$ depends on i (Remark 1). We thus denote by $\tilde{K}_i$ the marginal transition kernel $\tilde{\theta }_{i-1}\rightarrow \tilde{\theta }_i$. However, we observe that if the random variables $\{U_i,\,i\in \mathbb {N}\}$ are i.i.d. with distribution $\nu _{n,\epsilon }$, $K_i$ becomes time homogeneous as $\mathbb {P}(U_i=u\,|\,\theta _i)=\nu _{n,\epsilon }(u)$ for all i.

A consequence of Remark 2 is that Mitrophanov (2005, Theorem 3.1) does not hold when Assumption A.3 is not satisfied. Indeed, $\{\tilde{\theta }_i,\,i\in \mathbb {N}\}$ is not a time homogeneous Markov chain in this case and we first need to generalize the result from Mitrophanov in order to apply it to our context. This is presented in Lemma 6.

Lemma 6

Let K be the transition kernel of an uniformly ergodic Markov chain that admits $\pi $ as stationary distribution. Let $\tilde{K}_i$ be the i-th transition kernel of the ISS-MCMC Markov chain. In particular, let $p_i(\,\cdot \,|\,\theta ):=\mathbb {P}(U_i\in \,\cdot \,|\,\theta )$ be the distribution of the random variable $U_i$, used at iteration i of the noisy Markov chain given $\theta $. We have:

$$\begin{aligned} \lim _{i\rightarrow \infty }\Vert \pi -\tilde{\pi }_i\Vert \le \kappa \sup _{\theta \in \Theta }\sup _{i\in \mathbb {N}}\int \delta _i(\theta ,\theta ')Q(\theta ,\text {d}\theta ')\,, \end{aligned}$$

(A.17)

where $\delta _i:\Theta \times \Theta \rightarrow \mathbb {R}^+$ is a function that satisfies

$$\begin{aligned} \mathbb {E}_i\left\{ \left| a(\theta ,\theta ')-\tilde{a}(\theta ,\theta '\,|\,U)\right| \right\} \le \delta _i(\theta ,\theta ') \end{aligned}$$

and the expectation is under $p_i(\,\cdot \,|\,\theta )$.

Proof

In addition of the notations of Sect. 4, we define the following quantities for a Markov transition kernel regarded as an operator on $\mathcal {M}$, the space of signed measures on $(\Theta ,\mathcal {B}(\Theta ))$: $\tau (K):=\sup _{\pi \in \mathcal {M}_{0,1}}\Vert \pi K\Vert $ is the ergodicity coefficient of K, $\Vert K\Vert :=\sup _{\pi \in \mathcal {M}_{1}}\Vert \pi K\Vert $ is the operator norm of K and $\mathcal {M}_1:=\{\pi \in \mathcal {M},\,\Vert \pi \Vert =1\}$ and $\mathcal {M}_{0,1}:=\{\pi \in \mathcal {M}_1,\,\pi (\Theta )=0\}$.

Remarks 1 and 2 explain why, in general, $\{\tilde{\theta }_i,\,i\in \mathbb {N}\}$ is a time-inhomogeneous Markov chain with transition kernel $\{\tilde{K}_i,\,i\in \mathbb {N}\}$. For each $i\in \mathbb {N}$, define $\pi _i$ as the distribution of $\theta _i$ produced by the Metropolis–Hastings algorithm (Algorithm 1) with transition kernel K, referred to as the exact kernel hereafter. Our proof is based on the following identity:

$$\begin{aligned} K^i-\tilde{K}_1\tilde{K}_2\cdots \tilde{K}_i= & {} (K-\tilde{K}_1)K^{i-1}+\tilde{K}_1(K-\tilde{K}_2)K^{i-2}\nonumber \\&+\tilde{K}_1\tilde{K}_2(K-\tilde{K}_3)K^{i-3}+\cdots \nonumber \\&+\tilde{K}_1\cdots \tilde{K}_{i-1}(K-\tilde{K}_i)\,, \end{aligned}$$

(A.18)

for each $i\in \mathbb {N}$. Equation (A.18) will help translating the proof of Theorem 3.1 in Mitrophanov (2005) to the time-inhomogeneous setting and in particular, we have for each $i\in \mathbb {N}$:

$$\begin{aligned} \pi _i-\tilde{\pi }_i=(\pi _0-\tilde{\pi }_0)K^i+\sum _{j=0}^{i-1}\tilde{\pi }_j(K-\tilde{K}_{j+1})K^{i-j-1}\,. \end{aligned}$$

(A.19)

Following the proof of Theorem 3.1 in Mitrophanov (2005), we obtain

$$\begin{aligned}&\Vert \pi _i-\tilde{\pi }_i\Vert \le \Vert \pi _0-\tilde{\pi }_0\Vert \tau (K^i)+\sum _{j=0}^{i-1}\Vert K-\tilde{K}_{i-j}\Vert \tau (K^{j})\,,\nonumber \\&\le \left\{ \begin{array}{lc} \Vert \pi _0-\tilde{\pi }_0\Vert +i\sup _{j\le i}\Vert K-\tilde{K}_j\Vert &{}\text {if }i\le \lambda \\ \Vert \pi _0-\tilde{\pi }_0\Vert C\rho ^i+\sup _{j\le i}\Vert K-\tilde{K}_j\Vert \left\{ \lambda +C\frac{\rho ^\lambda -\rho ^{i}}{1-\rho }\right\} &{}\text {else}\\ \end{array} \right. \nonumber \\ \end{aligned}$$

(A.20)

where $\lambda =\left\lceil {\log _\rho (1/C)}\right\rceil $. Without loss of generality, we take $\pi _0=\tilde{\pi }_0$ and since $\Vert \pi -\tilde{\pi }_i\Vert \le \Vert \pi -\pi _i\Vert +\Vert \pi _i-\tilde{\pi }_i\Vert $ we have for all $i>\lambda $ that

$$\begin{aligned} \Vert \pi -\tilde{\pi }_i\Vert \le \left\{ \lambda +C\frac{\rho ^\lambda -\rho ^{i}}{1-\rho }\right\} \sup _{j\le i}\Vert K-\tilde{K}_j\Vert \,. \end{aligned}$$

(A.21)

Taking the limit as $i\rightarrow \infty $ leads to

$$\begin{aligned} \lim _{i\rightarrow \infty }\Vert \pi -\tilde{\pi }_i\Vert \le \left\{ \lambda +C\frac{\rho ^\lambda }{1-\rho }\right\} \sup _{i\in \mathbb {N}}\Vert K-\tilde{K}_i\Vert \,. \end{aligned}$$

(A.22)

Using a similar derivation than in the proof of Corollary 2.3 in Alquier et al. (2016), we obtain

$$\begin{aligned} \Vert K-\tilde{K}_i\Vert \le \sup _{\theta \in \Theta }\int Q(\theta ,\text {d}\theta ')\mathbb {E}_i\left| a(\theta ,\theta ')-\tilde{a}(\theta ,\theta '\,|\,U_i)\right| \,, \end{aligned}$$

where the expectation is under $p_i(\,U\,|\,\theta )$ and which combined with (A.22) leads to

$$\begin{aligned} \lim _{i\rightarrow \infty }\Vert \pi _i-\tilde{\pi }_i\Vert\le & {} \left( \lambda +C\frac{\rho ^\lambda }{1-\rho }\right) \sup _{\theta \in \Theta }\sup _{i\in \mathbb {N}}\mathbb {E}_i\left| a(\theta ,\theta ')\right. \\&\left. -\tilde{a}(\theta ,\theta '\,|\,U_i)\right| \end{aligned}$$

where the expectation is under $Q(\theta ,\cdot )\otimes p_i(\,\cdot \,|\,\theta )$. Any upper bound $\delta _i(\theta ,\theta ')$ of the expectation on the right hand side yields (A.17). $\square $

By straightforward algebra, we have:

$$\begin{aligned}&\mathbb {E}_i\left| a(\theta ,\theta ')-\tilde{a}(\theta ,\theta '\,|\,U_i)\right| \nonumber \\&\quad =a(\theta ,\theta ')\mathbb {E}_i\left\{ \frac{f(Y\,|\,\theta )}{f(Y_U\,|\,\theta )^{N/n}}\left| \phi _U(\theta )-\phi _U(\theta ')\right| \right\} \end{aligned}$$

(A.23)

where we have defined $\phi _U(\theta )=f(Y_U\,|\,\theta )^{N/n}\slash f(Y\,|\,\theta )$. Using Lemma 6, we have that

$$\begin{aligned}&\lim _{i\rightarrow \infty }\Vert \pi -\tilde{\pi }_i\Vert \le \kappa \sup _{\theta \in \Theta }\sup _{i\in \mathbb {N}}\mathbb {E}_i\left\{ \sup _{\theta \in \Theta }\frac{f(Y\,|\,\theta )}{f(Y_U\,|\,\theta )^{N/n}} \right. \nonumber \\&\qquad \int Q(\theta ,\text {d}\theta ')a(\theta ,\theta ')\left. \left| \phi _U(\theta )-\phi _U(\theta ')\right| \right\} \,,\nonumber \\&\quad \le \kappa \sup _{\theta \in \Theta }\sup _{i\in \mathbb {N}}\mathbb {E}_i\left\{ \sup _{\theta \in \Theta }\frac{f(Y\,|\,\theta )}{f(Y_U\,|\,\theta )^{N/n}}\right\} \sup _{(\theta ,U)\in \Theta \times \textsf {U}_n}\nonumber \\&\qquad \int Q(\theta ,\text {d}\theta ')a(\theta ,\theta ')\left| \phi _U(\theta )-\phi _U(\theta ')\right| \,. \end{aligned}$$

(A.24)

which is the counterpart of (30) when Assumption A.3 does not hold. We note that the second supremum in Eq. (A.24) is in fact $B_n$ defined at Eq. (28) and, as such, can be controlled as described in Section 5.3.1. However, this is not clearly the case for the first supremum in Eq. (A.24) which differs from $A_n$ defined at Eq. (27):

$$\begin{aligned} \tilde{A}_n:=\sup _i\sup _\theta \mathbb {E}_i\{\sup _\theta 1/\phi _U(\theta )\}\ne \mathbb {E}\{\sup _\theta 1/\phi _U(\theta )\}=A_n\,. \end{aligned}$$

(A.25)

We now show that, under two additional Assumptions (A.5 and A.6), the control based on the summary statistics also applies to the time inhomogeneous case when Assumption A.3 does not hold.

A 5

One-step minorization For all $i\in \mathbb {N}$ and all $A\in \vartheta $, there exists some $\eta >0$ such that $p_i(A)>\eta \lambda (A)$ where $\lambda $ is the Lebesgue measure.

This assumption typically holds if $\Theta $ is compact or if the chain $\{\tilde{\theta }_i,U_i\}_i$ admits a minorization condition. Since we assume, in this discussion, that the exact MH Markov chain is uniformly ergodic and as such satisfy a minorization condition, see e.g. Meyn and Tweedie (2009, Thm 16.2.3) and Hobert and Robert (2004). We may study conditions on which $\{\tilde{\theta }_i\}_i$ inherits this property and leave this for future work but already note that Assumption A.5 is not totally unrealistic.

A 6

The marginal Markov chain $\{U_i\}_i$ has initial distribution $U_0\sim \nu _{n,\epsilon }$.

Even though this assumption is difficult to meet in practice as $|\textsf {U}_n|$ may be very large, the discussion at the beginning of Section 6.1 indicates an approach to set the distribution of $U_0$ close from $\nu _{n,\epsilon }$.

Again, while the Assumptions 5 and 6 are perhaps challenging to guarantee, Proposition 8 aims at giving some level of confidence to the user that the ISS-MCMC method is useful, even when Assumption A.3 does not hold. In addition, it reinforces the importance of choosing summary statistics that satisfy Assumption A.4.

Proposition 8

Assume that Assumptions A.1, A.4, A.5 and A.6 hold. Then there exists a positive number $M>0$ such that

$$\begin{aligned} \tilde{A}_n\le M A_n\,, \end{aligned}$$

(A.26)

where $A_n$ and $\tilde{A}_n$ have been defined at Eq. (A.25).

Corollary 2

Under the same Assumptions as Proposition 8, the control explained in Sect. 5.3.2 is also valid in the time inhomogeneous case.

Proof of Proposition 8

From Assumption A.4, there exists some $\gamma >0$ such that

$$\begin{aligned} \tilde{A}_n= & {} \sup _i\sup _\theta \mathbb {E}_{i}\{f(Y\,|\,\theta )/f(Y_U\,|\,\theta )^{N/n}\}\nonumber \\\le & {} \sup _i\sup _\theta \int \text {d}p_i(U\,|\,\theta )e^{\gamma \Vert \Delta _n(U)\Vert }\,, \end{aligned}$$

(A.27)

where $\text {d}p_i(U\,|\,\theta )=p_i(U\,|\,\theta )\text {d}U$ and $\text {d}U$ is the counting measure. Now, the conditional probability writes:

$$\begin{aligned} p_i(U\in \cdot \,|\,\theta ):=\mathbb {P}(U_i\in \cdot \,,\,\tilde{\theta }_i\in \text {d}\theta )/\mathbb {P}(\tilde{\theta }_i\in \text {d}\theta )\,. \end{aligned}$$

On the one hand, Lemma 1 shows that there exists a bounded function $f_i$ such that $\mathbb {P}(U_i\in \cdot \,,\,\tilde{\theta }_i\in \text {d}\theta )\le f_i(\tilde{\theta })\text {d}\theta \nu _{n,\epsilon }(\,\cdot \,)$. On the other hand, Assumption 5 guarantees that there exists some $\eta >0$ such that for all $\tilde{\theta }\in \Theta $, $\mathbb {P}(\tilde{\theta }_i\in \text {d}\theta )>\eta \text {d}\theta $. Combining those two facts allows to write that

$$\begin{aligned} p_i(U\in \cdot \,|\,\theta )\le \frac{f_i(\theta )\text {d}\theta \nu _{n,\epsilon }(\,\cdot \,)}{\eta \text {d}\theta }=\frac{f_i(\theta )}{\eta }\nu _{n,\epsilon }(\,\cdot \,)\,. \end{aligned}$$

(A.28)

Plugging Eq. (A.28) into Eq. (A.27), yields to

$$\begin{aligned} \tilde{A}_n\le \sup _i\sup _\theta \int \text {d}p_i(U\,|\,\theta )e^{\gamma \Vert \Delta _n(U)\Vert }\le \sup _\theta \sup _i \frac{f_i(\tilde{\theta })}{\eta } A_n\,, \end{aligned}$$

which completes the proof, setting $M:=\sup _\theta \sup _i f_i(\theta )/\eta $. $\square $

Lemma 1

Assume that Assumptions A.1, A.4, A.5 and A.6 hold. In addition, let us assume that $U_0\sim \nu _{n,\epsilon }$. Then $p_i(\theta ,U)$ is dominated by $\text {d}\theta \text {d}U$ where $\text {d}\theta $ and $\text {d}U$ implicitly refer to the Lebesgue and the counting measure, respectively. In other words there is a sequence of bounded functions $\{f_i:\Theta \rightarrow \mathbb {R}^+\}$ such that

$$\begin{aligned} \text {d}p_i(U,\tilde{\theta })\le f_i(\theta )\text {d}\tilde{\theta }\text {d}\nu _{n,\epsilon }(U)\,. \end{aligned}$$

(A.29)

Proof

We proceed by induction. Defining $\varrho (\tilde{\theta }\,|\,U)$ as the probability to reject a MH move for the parameter $\tilde{\theta }$ when the subset variable is U, we recall that $\varrho (\tilde{\theta }\,|\,U)<1$ and $\tilde{\alpha }(\tilde{\theta },\tilde{\theta }'\,|\,U)<1$. By assumption on the proposal kernel, it satisfies $Q(\tilde{\theta }, \,\text {d}\tilde{\theta }'\,)=Q(\tilde{\theta },\tilde{\theta }')\text {d}\tilde{\theta }'$ and define the function $\overline{Q}:\theta \mapsto \sup _{\tilde{\theta }'\in \Theta }Q(\tilde{\theta }',\theta )$. Similarly, we define the function $\overline{\varrho }:\theta \mapsto \sup _{U\in \textsf {U}_n}\varrho (\theta \,|\,U)$. Deriving the calculation separately for the continuous and the diagonal parts of the Metropolis–Hastings kernel $K(\theta ,\cdot \,|\,U)$ (see Eq. (25)), we have:

$$\begin{aligned}&\text {d}p_1(U,\tilde{\theta })\\&\quad =\int _{\tilde{\theta }_0\in \Theta }\sum _{U_0\in \textsf {U}_n} \mu (\text {d}\tilde{\theta }_0)\nu (U_0)H(U_0,U)K(\tilde{\theta }_0,\text {d}\tilde{\theta }\,|\,U)\,,\\&\quad \le \int \sum \mu (\text {d}\tilde{\theta }_0)\nu (U_0)H(U_0,U)Q(\tilde{\theta }_0,\text {d}\tilde{\theta })\tilde{\alpha }(\tilde{\theta }_0,\tilde{\theta }\,|\,U)\\&\qquad +\int \sum \mu (\text {d}\tilde{\theta }_0)\nu (U_0)H(U_0,U)\delta _{\tilde{\theta }_0}(\text {d}\tilde{\theta })\varrho (\tilde{\theta }_0\,|\,U)\,,\\&\quad \le \int \sum \mu (\text {d}\tilde{\theta }_0)\nu (U_0)H(U_0,U)\overline{Q}(\tilde{\theta })\text {d}\tilde{\theta }\\&\qquad +\int \sum \mu (\text {d}\tilde{\theta })\nu (U_0)H(U_0,U)\overline{\varrho }(\tilde{\theta })\,,\\&\quad \le \sum \nu (U_0)H(U_0,U)\overline{Q}(\tilde{\theta })\text {d}\tilde{\theta }\\&\qquad +\sum \nu (U_0)H(U_0,U)\mu (\tilde{\theta })\overline{\varrho }(\tilde{\theta }) \text {d}\tilde{\theta }\,,\\&\quad =\underbrace{\left\{ \overline{Q}(\tilde{\theta })+\mu (\tilde{\theta })\overline{\varrho }(\tilde{\theta })\right\} }_{:=f_1(\tilde{\theta })}\text {d}\tilde{\theta }\text {d}\nu (U)\,, \end{aligned}$$

where the last equality follows from the $\nu _{n,\epsilon }$-stationarity of H. In this derivation, we have defined $\mu $ as the initial distribution of the Markov chain $\{\tilde{\theta }_i\}_i$ and $\nu $ as a shorthand notation for $\nu _{n,\epsilon }$. Now, let us assume that there is a bounded function $f_{i-1}$ such that $\text {d}p_1(U,\tilde{\theta })\le f_{i-1}(\tilde{\theta })\text {d}\tilde{\theta }\text {d}\nu (U)$. Using the notation $\mu K:=\int \mu (\text {d}x)K(x,\cdot )$ for any Markov kernel K and a measure $\mu $ on some measurable space $(\textsf {X},\mathcal {X})$ and recalling that $\bar{K}$ is the transition kernel of ISS-MCMC on the extended space $\Theta \times \textsf {U}_n$, we have:

$$\begin{aligned}&\text {d}p_i(U,\tilde{\theta })\\&\quad =\sum _{U_{i-1}\in \textsf {U}_n}\int _{\tilde{\theta }_{i-1}\in \Theta }\bar{\mu } \bar{K}^{i-1}(U_{i-1},\text {d}\tilde{\theta }_{i-1})H(U_{i-1},U)\\&\qquad K(\tilde{\theta }_{i-1},\text {d}\tilde{\theta }\,|\,U)\,,\\&\quad \le \sum _{U_{i-1}\in \textsf {U}_n}\int _{\tilde{\theta }_{i-1}\in \Theta }\bar{\mu } \bar{K}^{i-1}(U_{i-1},\text {d}\tilde{\theta }_{i-1})H(U_{i-1},U)\overline{Q}(\tilde{\theta })\text {d}\tilde{\theta }\\&\qquad +\sum _{U_{i-1}\in \textsf {U}_n}\bar{\mu } \bar{K}^{i-1}(U_{i-1},\text {d}\tilde{\theta })H(U_{i-1},U)\varrho (\tilde{\theta }\,|\,U)\,,\\&\quad \le \sum _{U_{i-1}\in \textsf {U}_n}\bar{\mu } \bar{K}^{i-1}(U_{i-1})H(U_{i-1},U)\overline{Q}(\tilde{\theta })\text {d}\tilde{\theta }\\&\qquad +\sum _{U_{i-1}\in \textsf {U}_n} \text {d}p_{i-1}(U_{i-1},\tilde{\theta })H(U_{i-1},U)\varrho (\tilde{\theta }\,|\,U)\\&\quad \le \nu (U)\overline{Q}(\tilde{\theta })\text {d}\tilde{\theta }+f_{i-1}(\tilde{\theta })\sum _{U_{i-1}}H(U_{i-1},U)\varrho (\tilde{\theta }\,|\,U)\\&\quad \le \underbrace{\left\{ \overline{Q}(\tilde{\theta })+f_{i-1}(\tilde{\theta })\overline{\varrho }(\tilde{\theta })\right\} }_{:=f_i(\tilde{\theta })}\text {d}\tilde{\theta }\text {d}\nu (U) \end{aligned}$$

and $f_i$ is bounded. The first term in the third inequality follows from noting that

$$\begin{aligned}&\sum \bar{\mu }\bar{K}^{i-1}(U_{i-1})H(U_{i-1},U_i)\\&\quad =\sum \int \bar{\mu }\bar{K}^{i-2}(U_{i-2},\text {d}\tilde{\theta }_{i-2})\sum \int H(U_{i-2},U_{i-1})\\&\qquad K(\tilde{\theta }_{i-2},\text {d}\tilde{\theta }_{i-1}\,|\,U_{i-1})H(U_{i-1},U)\\&\quad =\sum \int \bar{\mu }\bar{K}^{i-2}(U_{i-2},\text {d}\tilde{\theta }_{i-2}) H^2(U_{i-2},U)=\cdots \\&\quad =\sum \int \mu (\text {d}\theta _0)\nu (U_0)H^{i}(U_0,U)=\nu (U)\,. \end{aligned}$$

$\square $

1.7 Proof of Proposition 6

Proof

Note that for all $(\theta ,\zeta )\in \Theta \times \mathbb {R}^d$, a Taylor expansion of $\pi (\theta )$ and $\phi _U(\theta )$ at $\theta +\Sigma \zeta $ in (32) combined to the triangle inequality leads to:

$$\begin{aligned}&B(U,\theta )\\&\quad \le \frac{1}{\sqrt{N}}\mathbb {E}\left\{ \left| (M\zeta )^{T}\nabla _\theta \phi _U(\theta )\right| \left( 1+\frac{1}{\sqrt{N}}(M\zeta )^{T}\nabla _\theta \log \pi (\theta )\right) \right\} \\&\qquad +\frac{1}{2N}\mathbb {E}\left\{ |(M\zeta )^{T}\nabla _\theta ^2\phi _U(\theta )M\zeta |\right\} +\mathbb {E}\{R(\Vert M\zeta \Vert /\sqrt{N})\}\,, \end{aligned}$$

where the expectation is under $\Phi _d$ and $R(x)=o(x)$ at 0. Applying Cauchy-Schwartz gives:

$$\begin{aligned} B(U,\theta )\le & {} \frac{1}{\sqrt{N}}\mathbb {E}\{\Vert M\zeta \Vert \}\Vert \nabla _\theta \phi _U(\theta )\Vert \\&+\frac{1}{N}\mathbb {E}\{\Vert M\zeta \Vert ^2\}\Vert \nabla _\theta \phi _U(\theta )\Vert \Vert \nabla _\theta \log \pi (\theta )\Vert \\&+\frac{1}{2N}\mathbb {E}\{|\zeta ^{T}M^{T}\nabla _\theta ^2\phi _U(\theta )M\zeta |\}+\mathbb {E}\{R(\Vert M\zeta \Vert /\sqrt{N})\}\,. \end{aligned}$$

Now, we observe that:

$\mathbb {E}\{\Vert M\zeta \Vert \}=\mathbb {E}\{\sum _{i=1}^d(\sum _{j=1}^d M_{i,j}\zeta _{j})^2\}^{1/2}\le \mathbb {E}\{\sum _{i=1}^d|\sum _{j=1}^d M_{i,j}\zeta _{j}|\}\le \mathbb {E}\{\sum _{i=1}^d\sum _{j=1}^d|M_{i,j}||\zeta _{j}|\}=\sum _{i=1}^d\sum _{j=1}^d|M_{i,j}|\mathbb {E}\{|\zeta _i|\}=\sqrt{\frac{2}{\pi }}\Vert M\Vert _1 $
$\mathbb {E}\{\Vert M\zeta \Vert ^2\}=\mathbb {E}\{\sum _{i=1}^d(\sum _{j=1}^d M_{i,j} \zeta _j)^2\}=\sum _{i=1}^d\mathbb {E}\{(\sum _{j=1}^dM_{i,j}\zeta _j)^2\} =\sum _{i=1}^{d}\text {var}(\sum _{j=1}^d M_{i,j} \zeta _j)=\sum _{i=1}^{d}\sum _{j=1}^d M_{i,j}^2 \text {var}(\zeta _j)=\Vert M\Vert _2^2 $
considering the quadratic form associate to the operator $T(U,\theta )=M^{T}\nabla _{\theta }^2\phi _U(\theta ) M$, noting that $T(U,\theta )$ is symmetric its eigenvalues $\lambda _1\ge \lambda _2\ge \cdots \ge \lambda _d$ are real and we have
$$\begin{aligned} \zeta ^{T}T(U,\theta ) \zeta \le \lambda _1\Vert \zeta \Vert ^2 \end{aligned}$$
so that:
$$\begin{aligned}&\mathbb {E}\left\{ |(M\zeta )^{T}\nabla _\theta ^2\phi _U(\theta )M\zeta |\right\} \nonumber \\&\quad \le d \sup _{i}|\lambda _i|\le d |\!|\!|M^{T}\nabla _{\theta }^2\phi _U(\theta ) M|\!|\!|\end{aligned}$$
where for any square matrix A, we have defined $|\!|\!|A|\!|\!|=\sup _{x\in \mathbb {R}^d,\Vert x\Vert =1}\Vert Ax\Vert $ as the operator norm.

$\square $

Proof of Proposition 7

In this section, we are assuming that there is an infinite stream of observations $(Y_1,Y_2,\ldots )$ and a parameter $\theta _0\in \Theta $ such that $Y_i\sim f(\,\cdot \,|\,\theta _0)$. Let $\rho >1$ be a constant defined as the ratio N / n i.e the size of the full dataset over the size of the subsamples of interest. The full dataset is thus $Y_{1:\rho n}$. We define the set

$$\begin{aligned} \textsf {U}_n^\rho =\left\{ U\subset \{1,\ldots ,\rho n\},\;|U|=n\right\} \end{aligned}$$

such that $Y_U$ ($U\in \textsf {U}_n^\rho $) is the set of subsamples of interest. We study the asymptotics when $n\rightarrow \infty $ i.e we let the whole dataset and the size of subsamples of interest grow at the same rate.

Proposition 7

Let $\theta ^{*}_{\rho n}$ be the MLE of $Y_1,\ldots ,Y_{\rho n}$ and $\theta ^{*}_U$ be the MLE of the subsample $Y_U$ ($U\in \textsf {U}_n^\rho $). Assume that there exists a compact set $\kappa _n\subset \Theta $ such that $(\theta ^{*}_{\rho n},\theta _0)\in \kappa _n^2$ and for all U, there exists a compact set $\kappa _U\subset \Theta $ such that $(\theta ^{*}_U,\theta _0)\in \kappa _U^2$. Then, there exists a constants $\beta $, a metric $\Vert \cdot \Vert _{\theta _0}$ on $\Theta $ and a non-decreasing subsequence $\{\sigma _n\}_{n\in \mathbb {N}}$, ($\sigma _n\in \mathbb {N}$) such that for all $U\in \textsf {U}_{\sigma _n}^\rho $, we have for p-almost all $\theta \in \kappa _n\cap \kappa _U$

$$\begin{aligned}&\log f(Y_{1:\rho \sigma _n}\,|\,\theta )-\rho \log f(Y_U\,|\,\theta )\le H_n(Y,\theta )+\beta \nonumber \\&\quad +\frac{\rho \sigma _n}{2}\Vert \theta ^{*}_U-\theta ^{*}\Vert _{\theta _0}\,, \end{aligned}$$

(B.1)

where

$$\begin{aligned} \underset{n\rightarrow \infty }{\text {plim}}\quad H_n(Y,\theta )\overset{\mathbb {P}_{\theta _{0}}}{=} 0\,. \end{aligned}$$

Proof

Fix $n\in \mathbb {N}$. Consider the case where the prior distribution p is uniform on $\kappa _n$. In this case, the posterior is

$$\begin{aligned} \pi _{n}(\theta \,|\,Y_{1:\rho n})= & {} f(Y_{1:\rho n}\,|\,\theta ) {\mathbb {1}}_{\kappa _n}(\theta )\big / Z_{\rho n}\,,\\ Z_{\rho n}= & {} \int _{\kappa _n} f(Y_{1:\rho n}\,|\,\theta )\text {d}\theta \end{aligned}$$

and from Corollary 3, we know that there exists a subsequence $\tau _n\subset \mathbb {N}$ such that for p-almost all $\theta \in \kappa _n$

$$\begin{aligned} \left| \log \frac{f(Y_{1:\rho \tau _n}\,|\,\theta )}{Z_{\rho \tau _n}}-\log \Phi _{\rho \tau _n}(\theta )\right| \overset{\mathbb {P}_{\theta _{0}}}{\rightarrow }0\,, \end{aligned}$$

(B.2)

where $\theta \mapsto \Phi _{\rho \tau _n}(\theta )$ is the pdf of $\mathcal {N}(\theta ^{*}_{\rho \tau _n},I(\theta _0)^{-1}/\rho \tau _n)$. Similarly, there exists another subsequence $\gamma _n\subset \mathbb {N}$ such that for all $U\in \textsf {U}_{\gamma _n}^\rho $ and for p-almost all $\theta \in \kappa _U$

$$\begin{aligned}&\left| \rho \log \frac{f(Y_U\,|\,\theta )}{Z_{\gamma _n}(U)}-\rho \log \Phi _{U}(\theta )\right| \overset{\mathbb {P}_{\theta _{0}}}{\rightarrow }0\,,\nonumber \\&\quad Z_{\gamma _n}(U)=\int _{\kappa _{U}} f(Y_U\,|\,\theta )\text {d}\theta \end{aligned}$$

(B.3)

where $\theta \mapsto \Phi _{U}(\theta )$ is the pdf of $\mathcal {N}(\theta ^{*}_{U},I(\theta _0)^{-1}/|U|)$. Let $\{\sigma _n\}_{n\in \mathbb {N}}$ be the sequence defined as $\sigma _n=\max \{\tau _n,\gamma _n\}$. We know from (B.2) and (B.3) that for all $\varepsilon >0$ and all $\eta >0$, there exists $n_1\in \mathbb {N}$ such that for all $U\in \textsf {U}_{\sigma _n}^{\rho }$ and for all $n\ge n_1$

$$\begin{aligned}&\mathbb {P}_{\theta _0}\left\{ \left| \log \frac{f(Y_{1:\rho \sigma _n}\,|\,\theta )}{Z_{\rho \sigma _n}}-\log \Phi _{\rho \sigma _n}(\theta )\right| \right. \nonumber \\&\quad \left. +\left| \rho \log \frac{f(Y_U\,|\,\theta )}{Z_{\sigma _n}(U)}-\rho \log \Phi _{U}(\theta )\right| \ge \varepsilon \right\} \le \eta \,. \end{aligned}$$

(B.4)

Now, by straightforward algebra, we have for any $U\in \textsf {U}_{\sigma _n}^{\rho }$

$$\begin{aligned}&\log f(Y_{1:\rho \sigma _n}\,|\,\theta )-\rho \log f(Y_U\,|\,\theta )\nonumber \\&\quad =\log \frac{f(Y_{1:\rho \sigma _n}\,|\,\theta )}{Z_{\rho \sigma _n}}-\log \Phi _{\rho \sigma _n}(\theta ) \nonumber \\&\qquad -\,\rho \log \frac{f(Y_U\,|\,\theta )}{Z_{\sigma _n}(U)}\nonumber \\&\qquad +\rho \log \Phi _{U}(\theta ) +\log \frac{Z_{\rho \sigma _n}}{Z_{\sigma _n}(U)^\rho }+\log \Phi _{\rho \sigma _n}(\theta )\nonumber \\&\qquad -\,\rho \log \Phi _U(\theta )\nonumber \\&\quad \le \left| \log \frac{f(Y_{1:\rho \sigma _n}\,|\,\theta )}{Z_{\rho \sigma _n}}-\log \Phi _{\rho \sigma _n}(\theta )\right. \nonumber \\&\qquad \left. - \rho \log \frac{f(Y_U\,|\,\theta )}{Z_{\sigma _n}(U)}+\rho \log \Phi _{U}(\theta )\right| \nonumber \\&\qquad +\,\log \frac{Z_{\rho \sigma _n}}{Z_{\sigma _n}(U)^\rho } + (\rho -1)\log (2\pi )^{d/2}\nonumber \\&\qquad +\,\frac{\rho \sigma _n}{2}\bigg |\Vert \theta -\theta ^*_U\Vert _{\theta _0}- \Vert \theta -\theta ^*\Vert _{\theta _0}\bigg |\nonumber \\&\quad \le \left| \log \frac{f(Y_{1:\rho \sigma _n}\,|\,\theta )}{Z_{\rho \sigma _n}}-\log \Phi _{\rho \sigma _n}(\theta )\right| \nonumber \\&\qquad +\, \left| \rho \log \frac{f(Y_U\,|\,\theta )}{Z_{\sigma _n}(U)}-\rho \log \Phi _{U}(\theta )\right| \nonumber \\&\qquad +\,\log \frac{Z_{\rho \sigma _n}}{Z_{\sigma _n}(U)^\rho } + (\rho -1)\log (2\pi )^{d/2}\nonumber \\&\qquad +\,\frac{\rho \sigma _n}{2}\Vert \theta ^*_U-\theta ^*\Vert _{\theta _0}\,, \end{aligned}$$

(B.5)

where we have used Lemma 3 for the first inequality and the triangle inequalities for the second. Combining (B.5) with (B.4) yields (34). $\square $

Lemma 2

Consider a posterior distribution $\pi _n$ given n data $Y_{1:n}$ where p is the prior distribution and its Bernstein-von Mises approximation is $\Phi _n=\mathcal {N}(\theta ^{*}(Y_{1:n}),I(\theta _0)^{-1}/n)$. There exists a subsequence $\{\tau _n\}_n\subset \mathbb {N}$ such that

$$\begin{aligned} \underset{n\rightarrow \infty }{\text {plim}}\left| \pi _{\tau _n}(\theta )-\Phi _{\tau _n}(\theta )\right| \overset{\mathbb {P}_{\theta _0}}{=}0\,,\quad \text {for}\;p\text {-almost all}\; \theta \,. \end{aligned}$$

(B.6)

Proof

This follows for the fact that convergence in $L_1$ implies pointwise convergence almost everywhere of a subsequence, i.e there exists a subsequence $\{\tau _n\}_{n\in \mathbb {N}}\subset \mathbb {N}$ such that

$$\begin{aligned} \Vert \pi _{n}-\Phi _n\Vert _1\rightarrow 0 \Rightarrow |\pi _{\tau _n}(\theta )-\Phi _{\tau _n}(\theta )|\rightarrow 0\quad p\text {-a.e.} \end{aligned}$$

(B.7)

Eq. B.6 follows from combining the Bernstein-von Mises theorem and Eq. (B.7):

$$\begin{aligned}&\text {plim}_{n\rightarrow \infty } \Vert \pi _n-\Phi _n(\theta ^{*},I(\theta _0)^{-1}/n)\Vert _1\overset{\mathbb {P}_{\theta _0}}{=}0\\&\quad \Rightarrow \text {plim}_{n\rightarrow \infty }|\pi _{\tau _n}(\theta )-\Phi _{\tau _n}(\theta )|\overset{\mathbb {P}_{\theta _0}}{=}0\quad p\text {-a.e.} \end{aligned}$$

$\square $

Corollary 3

There exists a subsequence $\{\tau _n\}_{n\in \mathbb {N}}\in \mathbb {N}$ such that

$$\begin{aligned} \underset{n\rightarrow \infty }{\text {plim}}\left| \log \pi _{\tau _n}(\theta )-\log \Phi _{\tau _n}(\theta )\right| \overset{\mathbb {P}_{\theta _0}}{=}0\,,\quad \text {for}\;p \text {-almost all}\;\theta \,. \end{aligned}$$

(B.8)

Proof

Follows from Lemma 2, by continuity of the logarithm.

Lemma 3

For any $U\in \textsf {U}_n$, let $\theta \mapsto \Phi _U(\theta )$ be the pdf of $\mathcal {N}(\theta _U^*,I(\theta _0)^{-1}/n)$ and $\Phi _{\rho n}$ be the pdf of $\mathcal {N}(\theta _{\rho n}^*,I(\theta _0)^{-1}/\rho n)$ be the Bernstein-von Mises approximations of respectively $\pi (\,\cdot \,|\,Y_U)$ and $\pi (\cdot \,|\,Y_{1:\rho n})$ where $U\subset \textsf {U}_n(Y_{1:\rho n})$. Then we have for all $\theta \in \Theta $

$$\begin{aligned}&\log \Phi _{\rho n}(\theta )-\rho \log \Phi _U(\theta )\le (\rho -1)\log (2\pi )^{d/2}\\&\quad +\frac{\rho n}{2}\left\{ \Vert \theta -\theta ^*_U\Vert _{\theta _0}- \Vert \theta -\theta ^*\Vert _{\theta _0}\right\} \,, \end{aligned}$$

where for any d-squared symmetric matrix M, we have defined by $\Vert \cdot \Vert _M$ the norm associated to the scalar product $\left\langle u , v\right\rangle _M=u^{T}M v$.

Proof

This follows from straightforward algebra and noting that

$$\begin{aligned} \log \rho n |I(\theta _0)|-\rho \log n|I(\theta _0)|\le 0\,. \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Maire, F., Friel, N. & Alquier, P. Informed sub-sampling MCMC: approximate Bayesian inference for large datasets. Stat Comput 29, 449–482 (2019). https://doi.org/10.1007/s11222-018-9817-3

Download citation

Received: 26 June 2017
Accepted: 04 June 2018
Published: 09 June 2018
Issue Date: 01 May 2019
DOI: https://doi.org/10.1007/s11222-018-9817-3

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Informed sub-sampling MCMC: approximate Bayesian inference for large datasets

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

A simple introduction to Markov Chain Monte–Carlo sampling

Stratified random sampling from streaming and stored data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Proofs

1.1 Proof of Proposition 1

Proof

1.2 Proof of Proposition 2

Proof

1.3 Proof of Proposition 3

Proof

1.4 Proof of Proposition 4

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Proof

1.5 Proof of Proposition 5

1.6 Extension of Proposition 5 beyond the time homogeneous case

Remark 1

Remark 2

Lemma 6

Proof

A 5

A 6

Proposition 8

Corollary 2

Proof of Proposition 8

Lemma 1

Proof

1.7 Proof of Proposition 6

Proof

Proof of Proposition 7

Proposition 7

Proof

Lemma 2

Proof

Corollary 3

Proof

Lemma 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation