Abstract
This paper introduces a framework for speeding up Bayesian inference conducted in presence of large datasets. We design a Markov chain whose transition kernel uses an unknown fraction of fixed size of the available data that is randomly refreshed throughout the algorithm. Inspired by the Approximate Bayesian Computation literature, the subsampling process is guided by the fidelity to the observed data, as measured by summary statistics. The resulting algorithm, Informed Sub-Sampling MCMC, is a generic and flexible approach which, contrary to existing scalable methodologies, preserves the simplicity of the Metropolis–Hastings algorithm. Even though exactness is lost, i.e the chain distribution approximates the posterior, we study and quantify theoretically this bias and show on a diverse set of examples that it yields excellent performances when the computational budget is limited. If available and cheap to compute, we show that setting the summary statistics as the maximum likelihood estimator is supported by theoretical arguments.
Similar content being viewed by others
References
Allassonnière, S., Amit, Y., Trouvé, A.: Towards a coherent statistical framework for dense deformable template estimation. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 69(1), 3–29 (2007)
Alquier, P., Friel, N., Everitt, R., Boland, A.: Noisy Monte Carlo: convergence of Markov chains with approximate transition kernels. Stat. Comput. 26(1–2), 29–47 (2016)
Andrieu, C., Roberts, G.O.: The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Stat. 37, 697–725 (2009)
Andrieu, C., Vihola, M.: Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms. Ann Appl. Probab. 25(2), 1030–1077 (2015)
Banterle, M., Grazian, C., Lee, A., Robert, C.P.: Accelerating Metropolis–Hastings algorithms by delayed acceptance. arXiv preprint arXiv:1503.00996 (2015)
Bardenet, R., Doucet, A., Holmes, C.: Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In: ICML, pp. 405–413 (2014)
Bardenet, R., Doucet, A., Holmes, C.: On Markov chain Monte Carlo methods for tall data. J. Mach. Learn. Res. 18, 1–43 (2017)
Bierkens, J., Fearnhead, P., Roberts, G.: The zig-zag process and super-efficient sampling for Bayesian analysis of big data. Ann. Stat. (2018) (to appear)
Chib, S., Greenberg, E.: Understanding the metropolis-Hastings algorithm. Am. Stat. 49(4), 327–335 (1995)
Csilléry, K., Blum, M.G., Gaggiotti, O.E., François, O.: Approximate Bayesian computation (ABC) in practice. Trends Ecol. Evolut. 25(7), 410–418 (2010)
Dalalyan, A.S.: Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. arXiv preprint arXiv:1704.04752 (2017)
Douc, R., Moulines, E., Rosenthal, J.S.: Quantitative bounds on convergence of time-inhomogeneous Markov chains. Ann. Appl. Probab. 14, 1643–1665 (2004)
Fearnhead, P., Bierkens, J., Pollock, M., Roberts, G.O.: Piecewise deterministic Markov processes for continuous-time Monte Carlo. arXiv preprint arXiv:1611.07873 (2016)
Fearnhead, P., Prangle, D.: Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J. R. Stat. Soc. Seri. B (Stat. Methodol.) 74(3), 419–474 (2012)
Geyer, C.J., Thompson, E.A.: Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Am. Stat. Assoc. 90(431), 909–920 (1995)
Haario, H., Saksman, E., Tamminen, J.: An adaptive Metropolis algorithm. Bernoulli 7, 223–242 (2001)
Hobert, J.P., Robert, C.P.: A mixture representation of \(\pi \) with applications in Markov chain Monte Carlo and perfect sampling. Ann. Appl. Probab. 14, 1295–1305 (2004)
Huggins, J., Zou, J.: Quantifying the accuracy of approximate diffusions and Markov chains. In: Proceedings of the 20th International Conference on Artifical Intelligence and Statistics, PLMR, vol. 54, pp. 382–391 (2016)
Jacob, P.E., Thiery, A.H., et al.: On nonnegative unbiased estimators. Ann. Stat. 43(2), 769–784 (2015)
Johndrow, J.E., Mattingly, J.C.: Error bounds for approximations of Markov chains. arXiv preprint arXiv:1711.05382 (2017)
Johndrow, J.E., Mattingly, J.C., Mukherjee, S., Dunson, D.: Approximations of Markov chains and Bayesian inference. arXiv preprint arXiv:1508.03387 (2015)
Korattikara, A., Chen, Y., Welling, M.: Austerity in MCMC land: cutting the Metropolis–Hastings budget. In: Proceedings of the 31st International Conference on Machine Learning (2014)
Le Cam, L.: On some asymptotic properties of maximum likelihood estimates and related Bayes’ estimates. Univ. Calif. Publ. Stat. 1, 277–330 (1953)
Le Cam, L.: Asymptotic Methods in Statistical Decision Theory. Springer, Berlin (1986)
Maclaurin, D., Adams, R.P.: Firefly Monte Carlo: exact MCMC with subsets of data. In: Twenty-Fourth International Joint Conference on Artificial Intelligence (2015)
Marin, J.-M., Pudlo, P., Robert, C.P., Ryder, R.J.: Approximate Bayesian computational methods. Stat. Comput. 22(6), 1167–1180 (2012)
Medina-Aguayo, F.J., Lee, A., Roberts, G.O.: Stability of noisy Metropolis-Hastings. Stat. Comput. 26(6), 1187–1211 (2016)
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953)
Meyn, S.P., Tweedie, R.L.: Markov Chains and Stochastic Stability. Cambridge University Press, Cambridge (2009)
Mitrophanov, A.Y.: Sensitivity and convergence of uniformly ergodic Markov chains. J. Appl. Probab. 142, 003–1014 (2005)
Nunes, M.A., Balding, D.J.: On optimal selection of summary statistics for approximate Bayesian computation. Stat. Appl. Genet. Mol. Biol. 9(1) (2010)
Pollock, M., Fearnhead, P., Johansen, A.M., Roberts, G.O.: The scalable Langevin exact algorithm: Bayesian inference for big data. arXiv preprint arXiv:1609.03436 (2016)
Pritchard, J.K., Seielstad, M.T., Perez-Lezaun, A., Feldman, M.W.: Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 16(12), 1791–1798 (1999)
Quiroz, M., Villani, M., Kohn, R.: Speeding up MCMC by efficient data subsampling. Riksbank Research Paper Series (121) (2015)
Quiroz, M., Villani, M., Kohn, R.: Exact subsampling MCMC. arXiv preprint arXiv:1603.08232 (2016)
Roberts, G.O., Rosenthal, J.S., et al.: Optimal scaling for various Metropolis-Hastings algorithms. Stat. Sci. 16(4), 351–367 (2001)
Rudolf, D., Schweizer, N.: Perturbation theory for Markov chains via Wasserstein distance. Bernoulli 24(4A), 2610–2639 (2018)
Van der Vaart, A.W.: Asymptotic Statistics, vol. 3. Cambridge University Press, Cambridge (2000)
Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 681–688 (2011)
Wilkinson, R.D.: Approximate Bayesian computation (ABC) gives exact results under the assumption of model error. Stat. Appl. Genet. Mol. Biol 12(2), 129–141 (2013)
Acknowledgements
The Insight Centre for Data Analytics is supported by Science Foundation Ireland under Grant Number SFI/12/RC/2289. Nial Friel’s research was also supported by an Science Foundation Ireland grant: 12/IP/1424. Pierre Alquier’s research was funded by Labex ECODEC (ANR - 11-LABEX-0047) and by the research programme New Challenges for New Data from LCL and GENES, hosted by the Fondation du Risque. We thank the Associate Editor and two anonymous Referees for their contribution to this work.
Author information
Authors and Affiliations
Corresponding author
Appendices
Proofs
1.1 Proof of Proposition 1
Proof
For notational simplicity and without loss of generality, we take here g as the identity on \(\Theta \). Let \(n<N\) and U be a subset of \(\{1,\ldots ,N\}\) with cardinal n. Consider the power likelihood:
and the corresponding power posterior:
where
For any \(\theta \) such that \(p(\theta )\ne 0\), write:
and the KL divergence between \(\pi (\,\cdot \,|\,Y_{1:N})\) and \(\tilde{\pi }(\,\cdot \,|\,Y_U)\), denoted \(\text {KL}_n(U)\), simply writes
where \(\Delta _n(U)=\sum _{k=1}^N S(Y_k)-(N/n)\sum _{k\in U}S(Y_k)\). Now, note that
Plugging (A.3) into (A.2) yields:
Finally, Cauchy-Schwartz inequality provides the following upper bound for \(\text {KL}_n(U)\):
\(\square \)
1.2 Proof of Proposition 2
Proof
Under some weak assumptions, Bernstein-von Mises theorem states that \(\pi (\,\cdot \,|\,Y_{1:N})\) is asymptotically (in N) a Gaussian distribution with the maximum likelihood \(\theta ^{*}\) as mean and \(\Gamma _N=I^{-1}(\theta ^{*})/N\) as covariance matrix, where \(I(\theta )\) is the Fisher information matrix at \(\theta \). Let us denote by \(\Phi \) the pdf of \(\mathcal {N}(\theta ^{*},\Gamma _N)\). Under this approximation, \(\mathbb {E}_\pi (\theta )=\theta ^{*}\) and from (A.3), we write:
by integration of a multivariate Gaussian density function. Eventually, (A.6) yields the following approximation:
\(\square \)
1.3 Proof of Proposition 3
Proof
Let \(\textsf {U}_n\supset A_n(\theta ):=\left\{ U\in \textsf {U}_n,\;g(\theta )^{T}\Delta _n(U)\le 0\right\} \) and remark that using Cauchy-Schwartz inequality, we have:
Now, define \(\bar{\Delta }_n(U):=\bar{S}(Y)-\bar{S}(Y_U)\) where \(\bar{S}\) is the normalized summary statistics vector, i.e if \(U\in \textsf {U}_n\), \(\bar{S}(Y_U)=S(Y_U)/n\). Clearly, when \(N\rightarrow \infty \), some terms
will have a large contribution to the sum. More precisely, any mismatch between summary statistics of some subsamples \(\{Y_U,\,U\in \textsf {U}_n\backslash A_n(\theta )\}\) with respect to the full dataset will be amplified by the factor N, whereby exponentially inflating the upper bound. However, assigning the distribution \(\nu _{n,\epsilon }\) (12) to the subsamples \(\{Y_U,\,U\in \textsf {U}_n\}\), allows to balance out this effect. Indeed, note that
where \(Z(\epsilon )=\sum _{U\in \textsf {U}_n}\exp \{-\epsilon \Vert \Delta _n(U)\Vert ^2\}\) and we have, for a fixed n and when \(N\rightarrow \infty \), that
Since g is bounded, then \(\mathbb {E}\left\{ {f(Y\,|\,\theta )}\slash {f(Y_U\,|\,\theta )^{N/n}}\right\} \) is bounded too. \(\square \)
1.4 Proof of Proposition 4
We preface the proof Proposition 4 with five Lemmas, some of which are inspired from Medina-Aguayo et al. (2016). For notational simplicity, the dependence on \((n,\epsilon )\) of any ISS-MCMC related quantities is implicit. For all \((\theta ,U)\in \Theta \times \textsf {U}_n\), we denote by \(\phi _U(\theta )=f(y_U\,|\,\theta )^{N/n}/f(y\,|\,\theta )\) and recall that \(a(\theta ,\theta ')\) is the (exact) MH acceptance ratio so that \(\alpha (\theta ,\theta ')=1\wedge a(\theta ,\theta ')\). Unless stated otherwise, \(\mathbb {E}\) is the expectation taken under \(\nu _{n,\epsilon }\). For simplicity, \(\tilde{K}_{n,\epsilon }\) is written as \(\tilde{K}_n\).
Lemma 1
For any \((\theta ,\theta ')\in \Theta ^2\), we have
Proof
This follows from a slight adaptation of Lemma 3.3 in Medina-Aguayo et al. (2016):
where we have used Jensen’s inequality and the fact that the inequality \(1\wedge ab\le (1\wedge a)b\) holds for \(a>0\) and \(b\ge 1\). \(\square \)
Lemma 2
For any \(\theta \in \Theta \) and all \(\delta >0\), we have
Proof
The proof is identical to proof of Lemma 3.2 in Medina-Aguayo et al. (2016) by noting that Lemma 3.1 in the same reference holds for two random variables \(\phi _U(\theta )\) and \(\phi _{U}(\theta ')\) that are not independent, i.e for all \((\theta ,\theta ')\in \Theta ^2\) any \(U\in \textsf {U}_n\) and all \(\delta \in (0,1)\)
Lemma 3
Assume that Assumption A.4 holds. Then we have
Proof
Using Cauchy-Schwartz inequality, we write that for all \((\theta ,\theta ')\in \Theta ^2\),
Now for all \(\theta \in \Theta \), we define the event \(\mathcal {E}_\theta :=\{U\in \textsf {U}_n\,,\;f(Y\,|\,\theta )\le f(Y_U\,|\,\theta )^{N/n}\}\) so that
and we note that for all \((\theta ,U)\in \Theta \times \textsf {U}_n\), Eq. (26) writes
but also
so that
A similar argument gives the same upper bound for \(\mathbb {E}\{{f(Y\,|\theta )}\slash {f(Y_U\,|\,\theta )^{N/n}}\}^2\) so that Eq. (A.8) yields
The proof is completed by noting that for three numbers a, b and c, \(c>b\Rightarrow a\vee b\le a\vee c\) and \(\gamma \Vert \Delta _n(U)\Vert >0\). \(\square \)
Lemma 4
Assume that Assumption A.4 holds. Then we have for all \(\theta \in \Theta \) and \(\delta >0\)
Proof
With the same notations as in proof of Lemma 3 and roughly with the same reasoning we have for all \(\theta \in \Theta \) and all \(\delta >0\)
where the first inequality follows by inclusion (on \(\mathcal {E}_\theta \)) of
and similarly for the second term. Now, note that for all \(x>0\), \(\log (1+x)<-\log (1-x)\) so that
where the last inequality follows from Markov inequality. \(\square \)
We study the limiting case where N is fixed and \(n\rightarrow N\).
Lemma 5
Assume N is fixed and let \(n\rightarrow N\). Then,
Proof
It follows from the fact that when \(n\rightarrow N\), \(\nu _{n,\epsilon }\) converges to the dirac on \(U^\dag =\{1,\ldots ,N\}\) and therefore,
\(\square \)
We can now prove Proposition 4:
PropositionAssume thatA.3andA.4hold. If the marginal MH chainKis geometrically ergodic, i.e A.1holds, then there exists an\(n_0\le N\)such that for all\(n>n_0\), \(\tilde{K}_n\)is also geometrically ergodic.
Proof
By (Meyn and Tweedie 2009, Theorems 14.0.1 & 15.0.1), there exists a function \(V:\textsf {X}\rightarrow [1,\infty [\), two constants \(\lambda \in (0,1)\) and \(b<\infty \) and a small set \(S\subset \textsf {X}\) such that K satisfies a drift condition:
We now show how to use the previous Lemmas to establish the geometric ergodicity of \(\tilde{K}_n\) for some n sufficiently large. This reasoning is very similar to that presented in (Medina-Aguayo et al. 2016, Theorem 3.2).
Combining Eq. (A.9) with Eq. (A.10), we have that
Fix \(\epsilon >0\). From Lemma 5, there exists \((n_1,n_2)\in \mathbb {N}^2\) such that
Combining Eqs. (A.10) and (A.12) yields that for all \(n\ge n_0:=\max (n_1,n_2)\), we have
Taking \(\delta =\epsilon /2\) in Eq. (A.13) gives
To show that \(\tilde{K}_n\) (for \(n>n_0\)) satisfies a geometric drift condition, it is sufficient to take \(\epsilon <(1-\lambda )/(1+\lambda )\) and to check that S is also small for \(\tilde{K}_n\). This is demonstrated exactly as in the proof of Medina-Aguayo et al. (2016, Theorem 3.2). \(\square \)
1.5 Proof of Proposition 5
This proof borrows ideas from the perturbation analysis of uniformly ergodic Markov chains. First, note that by straightforward algebra we have that
Now, under Assumption A.2 and using Mitrophanov (2005, Corollary 3.1) we have that for any starting point \(\theta _0\in \Theta \),
where \(\lambda =\lceil \log (1/C)/\log \rho \rceil \). Combining Eqs (A.14) and (A.15) leads to Eq. (29) with \(\kappa =\lambda +{C\rho ^\lambda }/{1-\rho }\). Moreover, note that using Eq. (29) we have
and taking the limit when \(i\rightarrow \infty \) leads to Eq. (30). Finally, for a large enough n, we know from Proposition 4 that the marginal Markov chain \(\{\tilde{\theta }_i\,,i\in \mathbb {N}\}\) produced by ISS-MCMC is geometrically ergodic and we denote by \(\tilde{\pi }_n\) its stationary distibution. For such a n, we have for any \(\theta _0\in \Theta \)
and taking the limit as \(i\rightarrow \infty \) yields Eq. (31).
1.6 Extension of Proposition 5 beyond the time homogeneous case
We start with the two following remarks relative to the Informed Sub-Sampling Markov chain.
Remark 1
Assume \(U_0\sim \nu _{n,\epsilon }\) and \(\tilde{\theta }_0\sim \mu \) for some initial distribution \(\mu \) on \((\Theta ,\vartheta )\). The distribution of \(U_i\) given \(\tilde{\theta }_i\) is for some \(u\in \textsf {U}_n\),
where \(\bar{K}(\theta ,U;\text {d}\theta ',U'):=K(\theta ,\text {d}\theta '\,|\,U)H(U,U')\) and H is the transition kernel of the Markov chain \(\{U_i,\,i\in \mathbb {N}\}\). As a consequence \(\mathbb {P}(U_i\in \,\cdot \,|\,\tilde{\theta })\) depends on \(\tilde{\theta }\) and i.
Remark 2
The marginal Markov chain \(\{\tilde{\theta }_i,\,i\in \mathbb {N}\}\) produced by ISS-MCMC algorithm is time inhomogeneous since for all \(A\in \mathcal {X}\),
and \(\mathbb {P}(U_i=u\,|\,\tilde{\theta }_i)\) depends on i (Remark 1). We thus denote by \(\tilde{K}_i\) the marginal transition kernel \(\tilde{\theta }_{i-1}\rightarrow \tilde{\theta }_i\). However, we observe that if the random variables \(\{U_i,\,i\in \mathbb {N}\}\) are i.i.d. with distribution \(\nu _{n,\epsilon }\), \(K_i\) becomes time homogeneous as \(\mathbb {P}(U_i=u\,|\,\theta _i)=\nu _{n,\epsilon }(u)\) for all i.
A consequence of Remark 2 is that Mitrophanov (2005, Theorem 3.1) does not hold when Assumption A.3 is not satisfied. Indeed, \(\{\tilde{\theta }_i,\,i\in \mathbb {N}\}\) is not a time homogeneous Markov chain in this case and we first need to generalize the result from Mitrophanov in order to apply it to our context. This is presented in Lemma 6.
Lemma 6
Let K be the transition kernel of an uniformly ergodic Markov chain that admits \(\pi \) as stationary distribution. Let \(\tilde{K}_i\) be the i-th transition kernel of the ISS-MCMC Markov chain. In particular, let \(p_i(\,\cdot \,|\,\theta ):=\mathbb {P}(U_i\in \,\cdot \,|\,\theta )\) be the distribution of the random variable \(U_i\), used at iteration i of the noisy Markov chain given \(\theta \). We have:
where \(\delta _i:\Theta \times \Theta \rightarrow \mathbb {R}^+\) is a function that satisfies
and the expectation is under \(p_i(\,\cdot \,|\,\theta )\).
Proof
In addition of the notations of Sect. 4, we define the following quantities for a Markov transition kernel regarded as an operator on \(\mathcal {M}\), the space of signed measures on \((\Theta ,\mathcal {B}(\Theta ))\): \(\tau (K):=\sup _{\pi \in \mathcal {M}_{0,1}}\Vert \pi K\Vert \) is the ergodicity coefficient of K, \(\Vert K\Vert :=\sup _{\pi \in \mathcal {M}_{1}}\Vert \pi K\Vert \) is the operator norm of K and \(\mathcal {M}_1:=\{\pi \in \mathcal {M},\,\Vert \pi \Vert =1\}\) and \(\mathcal {M}_{0,1}:=\{\pi \in \mathcal {M}_1,\,\pi (\Theta )=0\}\).
Remarks 1 and 2 explain why, in general, \(\{\tilde{\theta }_i,\,i\in \mathbb {N}\}\) is a time-inhomogeneous Markov chain with transition kernel \(\{\tilde{K}_i,\,i\in \mathbb {N}\}\). For each \(i\in \mathbb {N}\), define \(\pi _i\) as the distribution of \(\theta _i\) produced by the Metropolis–Hastings algorithm (Algorithm 1) with transition kernel K, referred to as the exact kernel hereafter. Our proof is based on the following identity:
for each \(i\in \mathbb {N}\). Equation (A.18) will help translating the proof of Theorem 3.1 in Mitrophanov (2005) to the time-inhomogeneous setting and in particular, we have for each \(i\in \mathbb {N}\):
Following the proof of Theorem 3.1 in Mitrophanov (2005), we obtain
where \(\lambda =\left\lceil {\log _\rho (1/C)}\right\rceil \). Without loss of generality, we take \(\pi _0=\tilde{\pi }_0\) and since \(\Vert \pi -\tilde{\pi }_i\Vert \le \Vert \pi -\pi _i\Vert +\Vert \pi _i-\tilde{\pi }_i\Vert \) we have for all \(i>\lambda \) that
Taking the limit as \(i\rightarrow \infty \) leads to
Using a similar derivation than in the proof of Corollary 2.3 in Alquier et al. (2016), we obtain
where the expectation is under \(p_i(\,U\,|\,\theta )\) and which combined with (A.22) leads to
where the expectation is under \(Q(\theta ,\cdot )\otimes p_i(\,\cdot \,|\,\theta )\). Any upper bound \(\delta _i(\theta ,\theta ')\) of the expectation on the right hand side yields (A.17). \(\square \)
By straightforward algebra, we have:
where we have defined \(\phi _U(\theta )=f(Y_U\,|\,\theta )^{N/n}\slash f(Y\,|\,\theta )\). Using Lemma 6, we have that
which is the counterpart of (30) when Assumption A.3 does not hold. We note that the second supremum in Eq. (A.24) is in fact \(B_n\) defined at Eq. (28) and, as such, can be controlled as described in Section 5.3.1. However, this is not clearly the case for the first supremum in Eq. (A.24) which differs from \(A_n\) defined at Eq. (27):
We now show that, under two additional Assumptions (A.5 and A.6), the control based on the summary statistics also applies to the time inhomogeneous case when Assumption A.3 does not hold.
A 5
One-step minorization For all \(i\in \mathbb {N}\) and all \(A\in \vartheta \), there exists some \(\eta >0\) such that \(p_i(A)>\eta \lambda (A)\) where \(\lambda \) is the Lebesgue measure.
This assumption typically holds if \(\Theta \) is compact or if the chain \(\{\tilde{\theta }_i,U_i\}_i\) admits a minorization condition. Since we assume, in this discussion, that the exact MH Markov chain is uniformly ergodic and as such satisfy a minorization condition, see e.g. Meyn and Tweedie (2009, Thm 16.2.3) and Hobert and Robert (2004). We may study conditions on which \(\{\tilde{\theta }_i\}_i\) inherits this property and leave this for future work but already note that Assumption A.5 is not totally unrealistic.
A 6
The marginal Markov chain \(\{U_i\}_i\) has initial distribution \(U_0\sim \nu _{n,\epsilon }\).
Even though this assumption is difficult to meet in practice as \(|\textsf {U}_n|\) may be very large, the discussion at the beginning of Section 6.1 indicates an approach to set the distribution of \(U_0\) close from \(\nu _{n,\epsilon }\).
Again, while the Assumptions 5 and 6 are perhaps challenging to guarantee, Proposition 8 aims at giving some level of confidence to the user that the ISS-MCMC method is useful, even when Assumption A.3 does not hold. In addition, it reinforces the importance of choosing summary statistics that satisfy Assumption A.4.
Proposition 8
Assume that Assumptions A.1, A.4, A.5 and A.6 hold. Then there exists a positive number \(M>0\) such that
where \(A_n\) and \(\tilde{A}_n\) have been defined at Eq. (A.25).
Corollary 2
Under the same Assumptions as Proposition 8, the control explained in Sect. 5.3.2 is also valid in the time inhomogeneous case.
Proof of Proposition 8
From Assumption A.4, there exists some \(\gamma >0\) such that
where \(\text {d}p_i(U\,|\,\theta )=p_i(U\,|\,\theta )\text {d}U\) and \(\text {d}U\) is the counting measure. Now, the conditional probability writes:
On the one hand, Lemma 1 shows that there exists a bounded function \(f_i\) such that \(\mathbb {P}(U_i\in \cdot \,,\,\tilde{\theta }_i\in \text {d}\theta )\le f_i(\tilde{\theta })\text {d}\theta \nu _{n,\epsilon }(\,\cdot \,)\). On the other hand, Assumption 5 guarantees that there exists some \(\eta >0\) such that for all \(\tilde{\theta }\in \Theta \), \(\mathbb {P}(\tilde{\theta }_i\in \text {d}\theta )>\eta \text {d}\theta \). Combining those two facts allows to write that
Plugging Eq. (A.28) into Eq. (A.27), yields to
which completes the proof, setting \(M:=\sup _\theta \sup _i f_i(\theta )/\eta \). \(\square \)
Lemma 1
Assume that Assumptions A.1, A.4, A.5 and A.6 hold. In addition, let us assume that \(U_0\sim \nu _{n,\epsilon }\). Then \(p_i(\theta ,U)\) is dominated by \(\text {d}\theta \text {d}U\) where \(\text {d}\theta \) and \(\text {d}U\) implicitly refer to the Lebesgue and the counting measure, respectively. In other words there is a sequence of bounded functions \(\{f_i:\Theta \rightarrow \mathbb {R}^+\}\) such that
Proof
We proceed by induction. Defining \(\varrho (\tilde{\theta }\,|\,U)\) as the probability to reject a MH move for the parameter \(\tilde{\theta }\) when the subset variable is U, we recall that \(\varrho (\tilde{\theta }\,|\,U)<1\) and \(\tilde{\alpha }(\tilde{\theta },\tilde{\theta }'\,|\,U)<1\). By assumption on the proposal kernel, it satisfies \(Q(\tilde{\theta }, \,\text {d}\tilde{\theta }'\,)=Q(\tilde{\theta },\tilde{\theta }')\text {d}\tilde{\theta }'\) and define the function \(\overline{Q}:\theta \mapsto \sup _{\tilde{\theta }'\in \Theta }Q(\tilde{\theta }',\theta )\). Similarly, we define the function \(\overline{\varrho }:\theta \mapsto \sup _{U\in \textsf {U}_n}\varrho (\theta \,|\,U)\). Deriving the calculation separately for the continuous and the diagonal parts of the Metropolis–Hastings kernel \(K(\theta ,\cdot \,|\,U)\) (see Eq. (25)), we have:
where the last equality follows from the \(\nu _{n,\epsilon }\)-stationarity of H. In this derivation, we have defined \(\mu \) as the initial distribution of the Markov chain \(\{\tilde{\theta }_i\}_i\) and \(\nu \) as a shorthand notation for \(\nu _{n,\epsilon }\). Now, let us assume that there is a bounded function \(f_{i-1}\) such that \(\text {d}p_1(U,\tilde{\theta })\le f_{i-1}(\tilde{\theta })\text {d}\tilde{\theta }\text {d}\nu (U)\). Using the notation \(\mu K:=\int \mu (\text {d}x)K(x,\cdot )\) for any Markov kernel K and a measure \(\mu \) on some measurable space \((\textsf {X},\mathcal {X})\) and recalling that \(\bar{K}\) is the transition kernel of ISS-MCMC on the extended space \(\Theta \times \textsf {U}_n\), we have:
and \(f_i\) is bounded. The first term in the third inequality follows from noting that
\(\square \)
1.7 Proof of Proposition 6
Proof
Note that for all \((\theta ,\zeta )\in \Theta \times \mathbb {R}^d\), a Taylor expansion of \(\pi (\theta )\) and \(\phi _U(\theta )\) at \(\theta +\Sigma \zeta \) in (32) combined to the triangle inequality leads to:
where the expectation is under \(\Phi _d\) and \(R(x)=o(x)\) at 0. Applying Cauchy-Schwartz gives:
Now, we observe that:
-
\(\mathbb {E}\{\Vert M\zeta \Vert \}=\mathbb {E}\{\sum _{i=1}^d(\sum _{j=1}^d M_{i,j}\zeta _{j})^2\}^{1/2}\le \mathbb {E}\{\sum _{i=1}^d|\sum _{j=1}^d M_{i,j}\zeta _{j}|\}\le \mathbb {E}\{\sum _{i=1}^d\sum _{j=1}^d|M_{i,j}||\zeta _{j}|\}=\sum _{i=1}^d\sum _{j=1}^d|M_{i,j}|\mathbb {E}\{|\zeta _i|\}=\sqrt{\frac{2}{\pi }}\Vert M\Vert _1 \)
-
\(\mathbb {E}\{\Vert M\zeta \Vert ^2\}=\mathbb {E}\{\sum _{i=1}^d(\sum _{j=1}^d M_{i,j} \zeta _j)^2\}=\sum _{i=1}^d\mathbb {E}\{(\sum _{j=1}^dM_{i,j}\zeta _j)^2\} =\sum _{i=1}^{d}\text {var}(\sum _{j=1}^d M_{i,j} \zeta _j)=\sum _{i=1}^{d}\sum _{j=1}^d M_{i,j}^2 \text {var}(\zeta _j)=\Vert M\Vert _2^2 \)
-
considering the quadratic form associate to the operator \(T(U,\theta )=M^{T}\nabla _{\theta }^2\phi _U(\theta ) M\), noting that \(T(U,\theta )\) is symmetric its eigenvalues \(\lambda _1\ge \lambda _2\ge \cdots \ge \lambda _d\) are real and we have
$$\begin{aligned} \zeta ^{T}T(U,\theta ) \zeta \le \lambda _1\Vert \zeta \Vert ^2 \end{aligned}$$so that:
$$\begin{aligned}&\mathbb {E}\left\{ |(M\zeta )^{T}\nabla _\theta ^2\phi _U(\theta )M\zeta |\right\} \nonumber \\&\quad \le d \sup _{i}|\lambda _i|\le d |\!|\!|M^{T}\nabla _{\theta }^2\phi _U(\theta ) M|\!|\!|\end{aligned}$$where for any square matrix A, we have defined \(|\!|\!|A|\!|\!|=\sup _{x\in \mathbb {R}^d,\Vert x\Vert =1}\Vert Ax\Vert \) as the operator norm.
\(\square \)
Proof of Proposition 7
In this section, we are assuming that there is an infinite stream of observations \((Y_1,Y_2,\ldots )\) and a parameter \(\theta _0\in \Theta \) such that \(Y_i\sim f(\,\cdot \,|\,\theta _0)\). Let \(\rho >1\) be a constant defined as the ratio N / n i.e the size of the full dataset over the size of the subsamples of interest. The full dataset is thus \(Y_{1:\rho n}\). We define the set
such that \(Y_U\) (\(U\in \textsf {U}_n^\rho \)) is the set of subsamples of interest. We study the asymptotics when \(n\rightarrow \infty \) i.e we let the whole dataset and the size of subsamples of interest grow at the same rate.
Proposition 7
Let \(\theta ^{*}_{\rho n}\) be the MLE of \(Y_1,\ldots ,Y_{\rho n}\) and \(\theta ^{*}_U\) be the MLE of the subsample \(Y_U\) (\(U\in \textsf {U}_n^\rho \)). Assume that there exists a compact set \(\kappa _n\subset \Theta \) such that \((\theta ^{*}_{\rho n},\theta _0)\in \kappa _n^2\) and for all U, there exists a compact set \(\kappa _U\subset \Theta \) such that \((\theta ^{*}_U,\theta _0)\in \kappa _U^2\). Then, there exists a constants \(\beta \), a metric \(\Vert \cdot \Vert _{\theta _0}\) on \(\Theta \) and a non-decreasing subsequence \(\{\sigma _n\}_{n\in \mathbb {N}}\), (\(\sigma _n\in \mathbb {N}\)) such that for all \(U\in \textsf {U}_{\sigma _n}^\rho \), we have for p-almost all \(\theta \in \kappa _n\cap \kappa _U\)
where
Proof
Fix \(n\in \mathbb {N}\). Consider the case where the prior distribution p is uniform on \(\kappa _n\). In this case, the posterior is
and from Corollary 3, we know that there exists a subsequence \(\tau _n\subset \mathbb {N}\) such that for p-almost all \(\theta \in \kappa _n\)
where \(\theta \mapsto \Phi _{\rho \tau _n}(\theta )\) is the pdf of \(\mathcal {N}(\theta ^{*}_{\rho \tau _n},I(\theta _0)^{-1}/\rho \tau _n)\). Similarly, there exists another subsequence \(\gamma _n\subset \mathbb {N}\) such that for all \(U\in \textsf {U}_{\gamma _n}^\rho \) and for p-almost all \(\theta \in \kappa _U\)
where \(\theta \mapsto \Phi _{U}(\theta )\) is the pdf of \(\mathcal {N}(\theta ^{*}_{U},I(\theta _0)^{-1}/|U|)\). Let \(\{\sigma _n\}_{n\in \mathbb {N}}\) be the sequence defined as \(\sigma _n=\max \{\tau _n,\gamma _n\}\). We know from (B.2) and (B.3) that for all \(\varepsilon >0\) and all \(\eta >0\), there exists \(n_1\in \mathbb {N}\) such that for all \(U\in \textsf {U}_{\sigma _n}^{\rho }\) and for all \(n\ge n_1\)
Now, by straightforward algebra, we have for any \(U\in \textsf {U}_{\sigma _n}^{\rho }\)
where we have used Lemma 3 for the first inequality and the triangle inequalities for the second. Combining (B.5) with (B.4) yields (34). \(\square \)
Lemma 2
Consider a posterior distribution \(\pi _n\) given n data \(Y_{1:n}\) where p is the prior distribution and its Bernstein-von Mises approximation is \(\Phi _n=\mathcal {N}(\theta ^{*}(Y_{1:n}),I(\theta _0)^{-1}/n)\). There exists a subsequence \(\{\tau _n\}_n\subset \mathbb {N}\) such that
Proof
This follows for the fact that convergence in \(L_1\) implies pointwise convergence almost everywhere of a subsequence, i.e there exists a subsequence \(\{\tau _n\}_{n\in \mathbb {N}}\subset \mathbb {N}\) such that
Eq. B.6 follows from combining the Bernstein-von Mises theorem and Eq. (B.7):
\(\square \)
Corollary 3
There exists a subsequence \(\{\tau _n\}_{n\in \mathbb {N}}\in \mathbb {N}\) such that
Proof
Follows from Lemma 2, by continuity of the logarithm.
Lemma 3
For any \(U\in \textsf {U}_n\), let \(\theta \mapsto \Phi _U(\theta )\) be the pdf of \(\mathcal {N}(\theta _U^*,I(\theta _0)^{-1}/n)\) and \(\Phi _{\rho n}\) be the pdf of \(\mathcal {N}(\theta _{\rho n}^*,I(\theta _0)^{-1}/\rho n)\) be the Bernstein-von Mises approximations of respectively \(\pi (\,\cdot \,|\,Y_U)\) and \(\pi (\cdot \,|\,Y_{1:\rho n})\) where \(U\subset \textsf {U}_n(Y_{1:\rho n})\). Then we have for all \(\theta \in \Theta \)
where for any d-squared symmetric matrix M, we have defined by \(\Vert \cdot \Vert _M\) the norm associated to the scalar product \(\left\langle u , v\right\rangle _M=u^{T}M v\).
Proof
This follows from straightforward algebra and noting that
\(\square \)
Rights and permissions
About this article
Cite this article
Maire, F., Friel, N. & Alquier, P. Informed sub-sampling MCMC: approximate Bayesian inference for large datasets. Stat Comput 29, 449–482 (2019). https://doi.org/10.1007/s11222-018-9817-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-018-9817-3