Skip to main content

Subsampling MCMC - an Introduction for the Survey Statistician

Abstract

The rapid development of computing power and efficient Markov Chain Monte Carlo (MCMC) simulation algorithms have revolutionized Bayesian statistics, making it a highly practical inference method in applied work. However, MCMC algorithms tend to be computationally demanding, and are particularly slow for large datasets. Data subsampling has recently been suggested as a way to make MCMC methods scalable on massively large data, utilizing efficient sampling schemes and estimators from the survey sampling literature. These developments tend to be unknown by many survey statisticians who traditionally work with non-Bayesian methods, and rarely use MCMC. Our article explains the idea of data subsampling in MCMC by reviewing one strand of work, Subsampling MCMC, a so called Pseudo-Marginal MCMC approach to speeding up MCMC through data subsampling. The review is written for a survey statistician without previous knowledge of MCMC methods since our aim is to motivate survey sampling experts to contribute to the growing Subsampling MCMC literature.

This is a preview of subscription content, access via your institution.

References

  • Alquier, P., Friel, N., Everitt, R. and Boland, A. (2016). Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels. Stat. Comput. 26, 1-2, 29–47.

    MathSciNet  MATH  Article  Google Scholar 

  • Andrieu, C. and Roberts, G.O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Stat. 37, 2, 697–725.

    MathSciNet  MATH  Article  Google Scholar 

  • Bardenet, R., Doucet, A. and Holmes, C. (2014). Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach, p. 405–413.

  • Bardenet, R., Doucet, A. and Holmes, C. (2017). On Markov chain Monte Carlo methods for tall data. J. Mach. Learn. Res. 18, 47, 1–43.

    MathSciNet  MATH  Google Scholar 

  • Beaumont, M.A. (2003). Estimation of population growth or decline in genetically monitored populations. Genetics 164, 3, 1139–1160.

    Google Scholar 

  • Betancourt, M. (2015). The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In International Conference on Machine Learning, pp. 533–540.

  • Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXiv:1701.02434.

  • Bierkens, J., Fearnhead, P. and Roberts, G (2018). The zig-zag process and super-efficient sampling for Bayesian analysis of big data. Annals of Statistics, forthcoming.

  • Blei, D.M., Kucukelbir, A. and McAuliffe, J.D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association 112, 518, 859–877.

    MathSciNet  Article  Google Scholar 

  • Bouchard-Côté, A., Vollmer, S.J. and Doucet, A. (2018). The bouncy particle sampler: a nonreversible rejection-free Markov chain Monte Carlo method. J. Am. Stat. Assoc. 113, 522, 855–867.

    MathSciNet  MATH  Article  Google Scholar 

  • Brooks, S., Gelman, A., Jones, G. and Meng, X.-L (2011). Handbook of Markov chain Monte Carlo. CRC Press, Boca Raton.

    MATH  Book  Google Scholar 

  • Ceperley, D. and Dewing, M. (1999). The penalty method for random walks with uncertain energies. J. Chem. Phys. 110, 20, 9812–9820.

    Article  Google Scholar 

  • Chen, C.-F. (1985). On asymptotic normality of limiting density functions with Bayesian implications. J. R. Stat. Soc. Ser. B Stat. Methodol. 47, 3, 540–546.

    MathSciNet  MATH  Google Scholar 

  • Chen, T., Fox, E. and Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pp. 1683–1691.

  • Dang, K.-D., Quiroz, M., Kohn, R., Tran, M.-N. and Villani, M. (2017). Hamiltonian Monte Carlo with energy conserving subsampling. arXiv:1708.00955.

  • Dean, J. and Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1, 107–113.

    Article  Google Scholar 

  • Del Moral, P. (2004). Feynman-Kac formulae: genealogical and interacting particle systems with applications. Springer, Berlin.

    MATH  Book  Google Scholar 

  • Deligiannidis, G., Doucet, A. and Pitt, M.K. (2018). The correlated pseudo-marginal method. Journal of the Royal Statistical Society B, forthcoming.

  • Deville, J.-C. and Särndal, C.-E. (1992). Calibration estimators in survey sampling. J. Am. Stat. Assoc. 87, 418, 376–382.

    MathSciNet  MATH  Article  Google Scholar 

  • Doucet, A., De Freitas, N. and Gordon, N. (2001). An introduction to sequential Monte Carlo methods. In Sequential Monte Carlo methods in practice, pp. 3–14. Springer.

  • Doucet, A., Pitt, M., Deligiannidis, G. and Kohn, R. (2015). Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. Biometrika 102, 2, 295–313.

    MathSciNet  MATH  Article  Google Scholar 

  • Duane, S., Kennedy, A.D., Pendleton, B.J. and Roweth, D. (1987). Hybrid Monte Carlo. Phys. Lett. B 195, 2, 216–222.

    MathSciNet  Article  Google Scholar 

  • Flury, T. and Shephard, N. (2011). Bayesian inference based only on simulated likelihood: particle filter analysis of dynamic economic models. Econometric Theory 27, 5, 933–956.

    MathSciNet  MATH  Article  Google Scholar 

  • Gelman, A., Vehtari, A., Jylänki, P., Sivula, T., Tran, D., Sahai, S., Blomstedt, P., Cunningham, J.P., Schiminovich, D. and Robert, C. (2017). Expectation Propagation as a way of life: A framework for Bayesian inference on partitioned data. arXiv:1412.4869.

  • Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 6, 721–741.

    MATH  Article  Google Scholar 

  • Gunawan, D., Kohn, R., Quiroz, M., Dang, K.-D. and Tran, M.-N. (2018). Subsampling sequential Monte Carlo for static Bayesian models. arXiv:1805.03317.

  • Hammersley, J.M. and Handscomb, D.C. (1964). Monte Carlo methods. Chapman and Hall, London.

    MATH  Book  Google Scholar 

  • Hastings, W.K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97–109.

    MathSciNet  MATH  Article  Google Scholar 

  • Hoffman, M.D. and Gelman, A. (2014). The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15, 1, 1593–1623.

    MathSciNet  MATH  Google Scholar 

  • Joe, H. (2014). Dependence modeling with copulas. CRC Press, Boca Raton.

    MATH  Book  Google Scholar 

  • Jordan, M.I., Ghahramani, Z., Jaakkola, T.S. and Saul, L.K. (1999). An introduction to variational methods for graphical models. Mach. Learn. 37, 2, 183–233.

    MATH  Article  Google Scholar 

  • Korattikara, A., Chen, Y. and Welling, M. (2014). Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In International Conference on Machine Learning, pp. 181–189.

  • Lin, L., Liu, K. and Sloan, J. (2000). A noisy Monte Carlo algorithm. Phys. Rev. D 61, 7, 074505.

    Article  Google Scholar 

  • Lyne, A.-M., Girolami, M., Atchade, Y., Strathmann, H. and Simpson, D. (2015). On Russian roulette estimates for Bayesian inference with doubly-intractable likelihoods. Stat. Sci. 30, 4, 443–467.

    MathSciNet  MATH  Article  Google Scholar 

  • Maclaurin, D. and Adams, R.P. (2014). Firefly Monte Carlo: Exact MCMC with subsets of data. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI 2014).

  • Maire, F., Friel, N. and Alquier, P. (2018). Informed sub-sampling MCMC: Approximate Bayesian inference for large datasets. Statistics and Computing, forthcoming.

  • Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. and Teller, E. (1953). Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 6, 1087–1092.

    Article  Google Scholar 

  • Minka, T.P. (2001). Expectation Propagation for approximate Bayesian inference. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pp. 362–369. Morgan Kaufmann Publishers Inc.

  • Minsker, S., Srivastava, S., Lin, L. and Dunson, D. (2014). Scalable and robust Bayesian inference via the median posterior. In International Conference on Machine Learning, pp. 1656–1664.

  • Neal, R.M. et al. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2, 11, 2.

    MATH  Google Scholar 

  • Neiswanger, W., Wang, C. and Xing, E. (2013). Asymptotically exact, embarrassingly parallel MCMC. arXiv:1311.4780.

  • Nemeth, C. (2018). Sherlock Merging MCMC subposteriors through Gaussian-process approximations. Bayesian Anal. 13, 2, 507–530.

    MathSciNet  MATH  Article  Google Scholar 

  • Nicholls, G.K., Fox, C. and Watt, A.M. (2012). Coupled MCMC with a randomized acceptance probability. arXiv:1205.6857.

  • Papaspiliopoulos, O. (2009). A methodological framework for Monte Carlo probabilistic inference for diffusion processes. Manuscript. Available at http://wrap.warwick.ac.uk/35220/1/WRAP_Papaspiliopoulos_09-31w.pdf.

  • Pitt, M.K., dos Santos Silva, R., Giordani, P. and Kohn, R. (2012). On some properties of Markov Chain Monte Carlo simulation methods based on the particle filter. J. Econ. 171, 2, 134–151.

    MathSciNet  MATH  Article  Google Scholar 

  • Plummer, M., Best, N., Cowles, K. and Vines, K. (2006). Coda: Convergence diagnosis and output analysis for MCMC. R News 6, 1, 7–11.

    Google Scholar 

  • Quiroz, M., Kohn, R., Villani, M. and Tran, M.-N. (2018a). Speeding up MCMC by efficient data subsampling. Journal of the American Statistical Association, forthcoming.

  • Quiroz, M., Tran, M.-N., Villani, M. and Kohn, R. (2018b). Speeding up MCMC by delayed acceptance and data subsampling. J. Comput. Graph. Stat. 27, 12– 22.

    MathSciNet  Article  Google Scholar 

  • Quiroz, M., Tran, M.-N., Villani, M., Kohn, R. and Dang, K.-D. (2018c). The block-Poisson estimator for optimally tuned exact Subsampling MCMC. arXiv:1603.08232.

  • Quiroz, M., Villani, M. and Kohn, R. (2014). Speeding up MCMC by efficient data subsampling. arXiv:1603.08232v1.

  • Rhee, C. and Glynn, P.W. (2015). Unbiased estimation with square root convergence for SDE models. Oper. Res. 63, 5, 1026–1043.

    MathSciNet  MATH  Article  Google Scholar 

  • Roberts, G.O., Gelman, A. and Gilks, W.R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann. Appl. Probab. 7, 1, 110–120.

    MathSciNet  MATH  Article  Google Scholar 

  • Rue, H., Martino, S. and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J. R. Stat. Soc. Ser. B Stat. Methodol. 71, 2, 319–392.

    MathSciNet  MATH  Article  Google Scholar 

  • Särndal, C.-E., Swensson, B. and Wretman, J. (2003). Model assisted survey sampling. Springer Science & Business Media, Berlin.

    MATH  Google Scholar 

  • Scott, S.L., Blocker, A.W., Bonassi, F.V., Chipman, H.A., George, E.I. and McCulloch, R.E. (2016). Bayes and big data: The consensus Monte Carlo algorithm. Int. J. Manag. Sci. Eng. Manag. 11, 2, 78–88.

    Google Scholar 

  • Sherlock, C., Thiery, A.H., Roberts, G.O. and Rosenthal, J.S. (2015). On the efficiency of pseudo-marginal random walk Metropolis algorithms. Ann. Stat. 43, 1, 238– 275.

    MathSciNet  MATH  Article  Google Scholar 

  • Steel, D. and McLaren, C. (2009). Design and analysis of surveys repeated over time. Handbook of Statist. 29, 289–313.

    Article  Google Scholar 

  • Van der Vaart, A.W. (1998). Asymptotic Statistics, vol. 3. Cambridge University Press, Cambridge.

    MATH  Book  Google Scholar 

  • Wagner, W. (1988). Unbiased multi-step estimators for the Monte Carlo evaluation of certain functional integrals. J. Comput. Phys. 79, 2, 336–352.

    MathSciNet  MATH  Article  Google Scholar 

  • Wang, X. and Dunson, D.B. (2014). Parallel MCMC via Weierstrass sampler. arXiv:1312.4605v2.

Download references

Acknowledgements

Matias Quiroz and Robert Kohn were partially supported by Australian Research Council Center of Excellence grant CE140100049.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matias Quiroz.

Appendices

Appendix A: Algorithms

This appendix contains the main sampling algorithms discussed in the paper.

figure a
figure b

Appendix B: Details for the Poisson Regression Example

This appendix gives the details for the control variates in our illustrative Poisson regression example. Quiroz et al. (2018a) give general expressions for the gradients and hessians in the GLM class, and provide general compact expression that reduces the computational complexity of the control variates.

The Poisson Regression Model

The Poisson regression is of the form

$$y_{i} | \mathbf{x}_{i}, \boldsymbol{\theta} \overset{indep.}{\sim} \text{Pois}(\lambda_{i}), \hspace{0.5cm} \lambda_{i} = \exp(\alpha + \mathbf{x}_{i}^{T} \beta), $$

where 𝜃 = (α, β)T.

figure c

Parameter-Expanded Control Variates

Let \(\mathbf {w}_{i} = (1,\mathbf {x}_{i}^{T})^{T}\). The log-likelihood contribution from the i th observation is

$$\ell_{i}(\boldsymbol{\theta}) = y_{i} \mathbf{w}_{i}^{T} \boldsymbol{\theta} - \exp(\mathbf{w}_{i}^{T} \boldsymbol{\theta}) - \log(y_{i}!) $$

with gradient and hessian

$$\nabla_{\boldsymbol{\theta}} \ell_{i}(\boldsymbol{\theta}) = (y_{i} - \exp(\mathbf{w}_{i}^{T} \boldsymbol{\theta}))\mathbf{w}_{i} $$
$$\nabla_{\boldsymbol{\theta} \boldsymbol{\theta}^{T}}^{2} \ell_{i}(\boldsymbol{\theta}) = - \exp(\mathbf{w}_{i}^{T} \boldsymbol{\theta})\mathbf{w}_{i}\mathbf{w}_{i}^{T}. $$

Let μ(𝜃, x) = α + xTβ = wT𝜃. The parameter-expanded control variate in (3.4) is then

$$\begin{array}{@{}rcl@{}} \ell_{i}(\boldsymbol{\theta}) &\approx& y_{i} \mu(\boldsymbol{\hat \theta},\mathbf{x}_{i}) - \exp(\mu(\boldsymbol{\hat \theta},\mathbf{x}_{i})) - \log(y_{i}!) \\ &&+ [y_{i} - \exp(\mu(\boldsymbol{\hat \theta},\mathbf{x}_{i}))](\mu_{i}(\theta)-\mu_{i}(\boldsymbol{\hat \theta})) \\ &&- \frac{1}{2}\exp(\mu(\boldsymbol{\hat \theta},\mathbf{x}_{i}))(\mu(\theta,\mathbf{x}_{i})-\mu(\boldsymbol{\hat \theta}, \mathbf{x}_{i}))^{2}. \end{array} $$

Data-Expanded Control Variates

The log-likelihood contribution from the i th observation is

$$\ell_{i}(\theta) = y_{i} (\alpha + \mathbf{x}_{i}^{T} \beta) - \exp(\alpha + \mathbf{x}_{i}^{T} \beta) - \log(y_{i}!) $$

with gradient and hessian

$$\nabla_{y_{i}} \ell_{i}(\theta) = \alpha + \mathbf{x}_{i}^{T} \beta - \psi_{0}(y_{i}+ 1), $$

where \(\psi _{k}(z) = {\nabla _{z}^{k}} \log {\Gamma }(z)\) is the polygamma function of order k,

$$\begin{array}{*{20}l} \nabla_{\mathbf{x}_{i}} \ell_{i}(\theta) &= (y_{i} - \exp(\alpha + \mathbf{x}_{i}^{T} \beta)) \beta, \hspace{0.1cm} \nabla_{y_{i} y_{i}}^{2} \ell_{i}(\theta) = - \psi_{1}(y_{i}+ 1), \\ \nabla_{\mathbf{x}_{i} \mathbf{x}_{i}^{T}}^{2} \ell_{i}(\boldsymbol{\theta}) &= - \exp(\alpha + \mathbf{x}_{i}^{T} \beta) \beta \beta^{T}, \text{ and } \nabla_{y_{i} \mathbf{x}_{i}^{T}}^{2} \ell_{i}(\boldsymbol{\theta}) = \beta. \end{array} $$

We can write the gradients and hessian compactly by defining \(\mathbf {z}_{i}=(y_{i},\mathbf {x}_{i}^{T})^{T}\),

$$\nabla_{z_{i}} \ell_{i}(\boldsymbol{\theta}) = \left[\begin{array}{l} \alpha + \mathbf{x}_{i}^{T} \beta - \psi_{0}(y_{i}+ 1) \\ (y_{i} - \exp(\alpha + \mathbf{x}_{i}^{T} \beta)) \beta \end{array}\right] $$
$$\nabla_{z_{i} {z_{i}^{T}}}^{2} \ell_{i}({\boldsymbol{\theta}}) = \left[\begin{array}{ll} - \psi_{1}(y_{i}+ 1) & \beta^{T} \\ \beta & - \exp(\alpha + \mathbf{x}_{i}^{T} \beta) \beta \beta^{T} \end{array}\right]. $$
figure d

Let μ(𝜃, x) = α + xTβ. The data-expanded control variate in Eq. 3.5 can after some simplifications be expressed as

$$\begin{array}{@{}rcl@{}} \ell_{i}(\boldsymbol{\theta}) &\approx& y_{c_{i}} \mu(\boldsymbol{\theta}, \mathbf{x}_{c_{i}}) - \exp(\mu(\boldsymbol{\theta}, \mathbf{x}_{c_{i}})) - \log(y_{c_{i}}!) \\ &&+ (y_{i} - y_{c_{i}})(\mu(\boldsymbol{\theta}, \mathbf{x}_{c_{i}}) - \psi_{0}(y_{c_{i}}+ 1)) -\frac{1}{2} (y_{i} - y_{c_{i}})^{2}\psi_{1}(y_{c_{i}}+ 1) \\ &&+[y_{i}-\exp(\mu(\boldsymbol{\theta}, \mathbf{x}_{c_{i}}))] (\mu(\boldsymbol{\theta}, \mathbf{x}_{i})-\mu(\boldsymbol{\theta}, \mathbf{x}_{c_{i}}) ) \\ &&-\frac{1}{2} \exp(\mu(\boldsymbol{\theta}, \mathbf{x}_{c_{i}}))(\mu(\boldsymbol{\theta}, \mathbf{x}_{i})-\mu(\boldsymbol{\theta}, \mathbf{x}_{c_{i}}) )^{2} . \end{array} $$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Quiroz, M., Villani, M., Kohn, R. et al. Subsampling MCMC - an Introduction for the Survey Statistician. Sankhya A 80 (Suppl 1), 33–69 (2018). https://doi.org/10.1007/s13171-018-0153-7

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13171-018-0153-7

Keywords and phrases.

  • Pseudo-Marginal MCMC
  • Difference estimator
  • Hamiltonian Monte Carlo (HMC).

AMS (2000) subject classification.

  • Primary 62-02
  • Secondary 62D05