1 Introduction

One may reasonably balk at the terms “computational statistics” and “Bayesian computation” since, from its very start, statistics has always involved some computational step to extract information, something manageable like an estimator or a prediction, from raw data. This necessarily incomplete and unavoidably biased review of the recent past, current state, and immediate future of algorithms for Bayesian inference thus first requires us to explain what we mean by computation in a statistical context, before turning to what we perceive as medium term solutions and possible deadends.

Computations are an issue in statistics whenever processing a dataset becomes a difficulty, a liability, or even an impossibility. Obviously, the computational challenge varies according to the time when it is faced: what was an issue in the nineteenth century is most likely not so any longer (take for instance the derivation of the moment estimates of a mixture of two normal distributions so painstakenly set by Pearson 1894 for estimating the ratio of “forehead” breadth–body length on a dataset of 1000 crabs or the intense algebraic derivations found in the analysis of variance of the 1950s and 1960s Searle et al. 1992).

The introduction of simulation tools in the 1940s followed hard on the heels of the invention of the computer and certainly contributed an impetus towards faster and better computers, at least in the first decade of this revolution. This shows that these tools were both needed, and unavailable without electronic calculators. The introduction of Markov chain Monte Carlo is harder to pin down as some partial versions can be traced all the way back to 1944–1945 and the Manhattan project at Los Alamos (Metropolis 1987). It is surprisingly much later, i.e., only by the early 1990s, that such methods became part of the Bayesian toolbox, that is, some time after the devising of other computer-dependent tools like the bootstrap or the EM algorithm, and despite the availability of personal computers that considerably eased programming and experimenting (Robert and Casella 2011). It is presumably pointless to try to attribute this delay to a definite cause but a certain lack of probabilistic culture within the statistics community is probably partly to blame.

What makes this time-lag in MCMC methods becoming assimilated into the statistics community even more surprising is that fact that Bayesian inference having a significant role in statistical practice was really on hold pending the discovery of flexible computational tools that (implicitly or explicitly) delivered values for the medium–high-dimensional integrals that underpin the calculation of posterior distributions, in all but toy problems where conjugacy provided explicit answers. In fact, until Bayesians discovered MCMC, the only computational methodology that seemed to offer much chance of making practical Bayesian statistics practical was the portfolio of quadrature methods developed under Adrian Smith’s leadership at Nottingham (Naylor and Smith 1982; Smith et al. 1985, 1987).

The very first article in the first issue of Statistics and Computing, whose quarter-century we celebrate in this special issue, was (to the journal’s credit!) on Bayesian analysis, and was precisely in this direction of using clever quadrature methods to approach moderately high-dimensional posterior analysis (Dellaportas and Wright 1991). By the next (second) issue, sampling-based methods had started to appear, with three papers out of five in the issue on or related to Gibbs sampling (Verdinelli and Wasserman 1991; Carlin and Gelfand 1991; Wakefield et al. 1991).

Now, reflecting upon the evolution of MCMC methods over the 25 or so years they have been at the forefront of Bayesian inference, the focus has evolved a long way, from hierarchical models that extended the linear, mixed and generalised linear models (Albert 1988; Carlin et al. 1992; Bennett et al. 1996) which were initially the focus, and graphical models that stemmed from image analysis (Geman and Geman 1984) and artificial intelligence, to dynamical models driven by ODE’s (Wilkinson 2011b) and diffusions (Roberts and Stramer 2001; Dellaportas et al. 2004; Beskos et al. 2006), hidden trees (Larget and Simon 1999; Huelsenbeck and Ronquist 2001; Chipman et al. 2008; Aldous et al. 2008) and graphs, aside with decision making in highly complex graphical models. While research on MCMC theory and methodology is still active and continually branching (Papaspiliopoulos et al. 2007; Andrieu and Roberts 2009; Łatuszyński et al. 2011; Douc and Robert 2011), progressively incorporating the capacities of parallel processors and GPUs (Lee et al. 2009; Jacob et al. 2011; Strid 2010; Suchard et al. 2010; Scott et al. 2013; Calderhead 2014), we wonder if we are not currently facing a new era where those methods are no longer appropriate to undertake the analysis of new models, and of new formulations where models are no longer completely defined. We indeed believe that imprecise models, incomplete information and summarised data will become, if not already, a central aspect of statistical analysis, due to the massive influx of data and the need to provide non-statisticians with efficient tools. This is why we also cover in this survey the notions of approximate Bayesian computation (ABC) and comment on the use of optimisation tools.

The plan of the paper is that in Sects. 2 and 3 we discuss recent progress and current issues in Markov chain Monte Carlo and ABC, respectively. In Sect. 4, we highlight some areas of modern optimisation that, through lack of familiarity, are making less impact in the mainstream of Bayesian computation than we think justified. Our Discussion in Sect. 5 raises issues about data science and relevance to applications, and looks to the future.

2 MCMC, targeting the posterior

When MCMC techniques were introduced to the mainstream statistical (Bayesian) community in 1990, they were received with skepticism that they could 1 day become the central tool of Bayesian inference. For instance, despite the assurance provided by the ergodic theorem, many researchers thought at first that the convergence of those algorithms was a mere theoretical anticipation rather than a practical reality, in contrast to traditional Monte Carlo methods, and hence that they could not be trusted to provide “exact” answers. This perspective is obviously obsolete by now, when MCMC output is considered as “exact” as regular Monte Carlo, if possibly less efficient in some settings. Nowadays, MCMC is again attracting more attention (than in the past decade, say, where developments were more about alternatives, some of which described in the following sections), both because of methodological developments linked to better theoretical tools, for instance in the handling of stochastic processes, and because of new advances in accelerated computing via parallel and cloud computing.

2.1 Basics of MCMC

The introduction of Markov chain based methods within Monte Carlo thus took a certain amount of argument to reach the mainstream statistical community, when compared with other groups who were using MCMC methods 10–30 years earlier. It may sound unlikely at the current stage of our knowledge, but using methods that (a) generated correlated output, (b) required some burnin time to remove the impact of the initial distribution and (c) did not lead to a closed form expression for asymptotic variances were indeed met with resistance at first. As often, the immense computing advantages offered by this versatile tool soon overcame the reluctance to accept those methods as similarly “exact” as other Monte Carlo techniques, applications driving the move from the early 1990s. We reproduce below the generic version of the “all variables at once” Metropolis–Hastings algorithm (Metropolis et al. 1953; Hastings 1970; Besag et al. 1995; Robert and Casella 2011) as it (still) constitutes in our opinion a fundamental advance in computational statistics, namely that, given a computable density \(\pi \) (up to a normalising constant) on \(\varTheta ,\) and a proposal Markov kernel \(q(\cdot |\cdot ),\) there exists a universal machine that returns a Markov chain with the proper stationary distribution, hence an associated operational MCMC algorithm.

figure a

The first observation about the Metropolis–Hastings is that the flexibility in choosing q is a blessing, but also a curse since the choice determines the performance of the algorithm. Hence a large part of the research on MCMC along the past 30 years (if we arbitrarily set the starting date at Geman and Geman 1984) has been on choice of the proposal q to improve the efficiency of the algorithm, and in characterising its convergence properties. This typically requires gathering or computing additional information about \(\pi \) and we discuss some of the fundamental strategies in subsequent sections. Algorithm 1, and its variants in which variables are updated singly or in blocks according to some schedule, remains a keystone in standard use of MCMC methodology, even though the newer Hamiltonian Monte Carlo (HMC) approach (see Sect. 2.3) may sooner or later come to replace it. While there is nothing intrinsically unique to the nature of this algorithm, or optimal in its convergence properties (other than the result of Peskun 1973 on the optimality of the acceptance ratio), attempts to bypass Metropolis–Hastings are few and limited. For instance, the birth-and-death process developed by Stephens (2000) used a continuous time jump process to explore a set of models, only to be later shown (Cappé et al. 2002) to be equivalent to the (Metropolis–Hastings) reversible jump approach of Green (1995).

Another aspect of the generic Metropolis–Hastings that became central more recently is that while the accept–reject step does overcome need to know the normalising constant, it still requires \(\pi ,\) if unnormalised, and this may be too expensive to compute or even intractable for complicated models and large datasets. Much recent research effort has been devoted to the design and understanding of appropriate modifications that use estimators or approximations of \(\pi \) instead and we will take the opportunity to summarise some of the progress in this direction.

2.2 MALA and manifold MALA

Stochastic differential equations (SDEs) have been and still are informing Monte Carlo development in a number of seminal ways. A key insight is that the Langevin diffusion on \(\varTheta \) solving

$$\begin{aligned} d \theta _{t} = {1 \over 2 } \nabla \log \pi \left( \theta _{t}\right) d t + dB_{t}, \end{aligned}$$
(1)

has \(\pi \) as its stationary and limiting distribution. Here \(B_{t}\) is the standard Brownian motion and \(\nabla \) denotes gradient. The crude approach of sampling an Euler discretisation (Kloeden and Platen 1992) of (1) and using it as an approximate sample from \(\pi \) was introduced in the applied literature (Ermak 1975; Doll and Dion 1976). The method results in a Markov chain evolving according to the dynamics

$$\begin{aligned}&\theta ^{(n)}|\theta ^{(n-1)}\sim Q\left( \theta ^{(n-1)},\,\cdot \right) \nonumber \\&\quad := \theta ^{(n-1)} + {h \over 2 } \nabla \log \pi \left( \theta ^{(n-1)}\right) + h^{1/2}N\left( 0,\,I_{d\times d}\right) , \end{aligned}$$
(2)

for a chosen discretisation step h. There is a delicate tradeoff between accuracy of the approximation improving as \(h \rightarrow 0\) and sampling efficiency [as measured, e.g., by the effective sample size (ESS)] improving when h increases. This solution was soon followed by its Metropolised version (Rossky et al. 1978) that uses the Euler approximation of (2) to produce a proposal in the Metropolis–Hastings algorithm 1, by letting \(q (\cdot |\theta ^{(n-1)}) := \theta ^{(n-1)} + {h \over 2 } \nabla \log \pi (\theta ^{(n-1)}) + h^{1/2}N(0,\,I_{d\times d}).\) While in the probability community Langevin diffusions and their equilibrium distributions had also been around for some time (Kent 1978), it was the Roberts and Tweedie (1996a) paper (motivated by Besag 1994 comment on Grenander and Miller 1994) that brought the approach to the centre of interest of the computational statistics community and sparked systematic study, development and applications of Metropolis adjusted Langevin algorithms (hence MALA) and their cousins.

There is a large body of empirical evidence that at the extra price of computing the gradient, MALA algorithms typically provide a substantial speed-up in convergence on certain types of problems. However for very light-tailed distributions the drift term may grow to infinity and cause additional instability. More precisely, for distributions with sufficiently smooth contours, MALA is geometrically ergodic (c.f. Roberts and Rosenthal 2004) if the tails of \(\pi \) decay as \(\exp \{-|\theta |^{\beta }\}\) with \(\beta \in [1,\,2],\) while the random walk Metropolis algorithm is geometrically ergodic for all \(\beta \ge 1\) (Roberts and Tweedie 1996a; Mengersen and Tweedie 1996). The lack of geometrical ergodicity has been precisely quantified by Bou-Rabee and Hairer (2012).

Various refinements and extensions have been proposed. These include optimal scaling and choice of the discretisation step h, adaptive versions (both discussed in Sect. 2.4), combinations with proximal operators (Pereyra 2015; Schreck et al. 2013), and applications and algorithm development for the infinite-dimensional context (Pillai et al. 2012; Cotter et al. 2013). One particular direction of active research is considering a more general version of Eq. (1) with state-dependent drift and diffusion coefficient

$$\begin{aligned}&d \theta _{t} = \left( {\sigma (\theta _{t}) \over 2 }\nabla \log \pi \left( \theta _{t}\right) +{\gamma (\theta _{t}) \over 2}\right) d t + \sqrt{\sigma }\left( \theta _{t}\right) dB_{t}\nonumber \\&\gamma _{i}\left( \theta _{t}\right) = \sum \limits _{j} \frac{\partial \sigma _{ij}(\theta _{t})}{\partial \theta _{j}}, \end{aligned}$$
(3)

which also has \(\pi \) as invariant distribution (Xifara et al. 2014, c.f. Kent 1978). The resulting proposals are

$$\begin{aligned} q\left( \cdot |\theta ^{(n-1)}\right):= & {} {h \over 2 } \left( \sigma \left( \theta ^{(n-1)}\right) \nabla \log \pi \left( \theta ^{(n-1)}\right) \right. \\&\left. +\,\gamma \left( \theta ^{(n-1)}\right) \right) + h^{1/2}N\left( 0,\,\sigma \left( \theta ^{(n-1)}\right) \right) \\&+\, \theta ^{(n-1)}. \end{aligned}$$

Choosing appropriate \(\sigma \) for improved ergodicity is however nontrivial. The idea has been explored in Stramer and Tweedie (1999a, b), Roberts and Stramer (2002) and more recently Girolami and Calderhead (2011) introduced a mathematically-coherent approach of relating \(\sigma \) to a metric tensor on a Riemannian manifold of probability distributions. The resulting algorithms are termed Manifold MALA (MMALA), Simplified MMALA (Girolami and Calderhead 2011), and position-dependent MALA (PMALA) (Xifara et al. 2014), and differ in implementation cost, depending on how precise is the use they make of versions of Eq. (3). The approach still leaves the specification of the metric to be used in the space of probability distributions to the user, however there are some natural choices. One can, for example, take the Hessian of \(\pi \) and replace its eigenvalues by their absolute values \(\lambda _{i} \rightarrow |\lambda _{i}|.\) Building the metric tensor from this spectrally-positive version of the Hessian of \(\pi \) and randomising the discretisation step size h results in an algorithm that is as robust as random walk Metropolis, in the sense that it is geometrically ergodic for targets with tail decay of \(\exp \{- |\theta |^{\beta }\}\) for \(\beta >1\) (see Taylor 2014). A robustified version of such a metric has been introduced in Betancourt (2013) and termed SoftAbs. Here one approximates the absolute value of the eigenspectrum of the Hessian of \(\pi \) with a smooth strictly positive function \(\lambda _{i} \rightarrow \lambda _{i} \frac{\exp {\{\alpha \lambda _{i}\}} + \exp {\{-\alpha \lambda _{i}\}}}{\exp {\{\alpha \lambda _{i}\}} - \exp {\{-\alpha \lambda _{i}\}}},\) where \(\alpha \) is a smoothing parameter. The metric stabilises the behaviour of both MMALA, and HMC algorithms (discussed in the sequel), in the neighbourhoods where the signature of the Hessian changes.

2.3 Hamiltonian Monte Carlo

As with many improvements in the literature, starting with the very notion of MCMC, Hamiltonian (or hybrid) Monte Carlo (HMC) stems from Physics (Duane et al. 1987). After a slow emergence into the statistical community (Neal 1999), it is now central in statistical software like STAN (Stan Development Team 2014). For a complete account of this important flavour of MCMC, the reader is referred to Neal (2013), which inspired the description below; see also Betancourt et al. (2014) for a highly mathematical differential-geometric approach to HMC.

This method can be seen as a particular and efficient instance of auxiliary variables (see, e.g., Besag and Green 1993; Rubinstein 1981), in which we apply a deterministic-proposal Metropolis method to the augmented target. In physical terms, the idea behind HMC is to add a “kinetic energy” term to the “potential energy” (negative log-target), leading to the Hamiltonian

$$\begin{aligned} H(\theta ,\,p) = - \log \pi (\theta )+p^\text {T} M^{-1} p/2, \end{aligned}$$

where \(\theta \) denotes the object to be simulated (i.e., the parameter), p its speed or momentum and M the Hamiltonian matrix of \(\pi .\) In more statistical language, HMC creates an auxiliary variable p such that moving according to Hamilton’s equations

$$\begin{aligned}&\dfrac{\theta }{dt} = \dfrac{\partial H}{\partial p} = \dfrac{\partial H}{\partial p} = M^{-1} p,\\&\dfrac{dp}{dt} = -\dfrac{\partial H}{\partial \theta } = \dfrac{\partial \log \pi }{\partial \theta }, \end{aligned}$$

preserves the joint distribution with density \(\exp \{-H(\theta ,\,p)\},\) hence the marginal distribution of \(\theta ,\) that is, \(\pi (\theta ).\) Hence, if we could simulate exactly this joint distribution of \((\theta ,\,p),\) a sample from \(\pi (\theta )\) would be a by-product. However, in practice, the equation is solved approximately and hence requires a Metropolis correction. As discussed in, e.g., Neal (2013), the dynamics induced by Hamilton’s equations is reversible and volume-preserving in the \((\theta ,\,p)\) space, which means in practice that there is no need for a Jacobian in Metropolis updates. The practical implementation relies on a k-th order symplectic integrator (Hairer et al. 2006), most commonly on the second order leapfrog approximation that relies on a small step level \(\epsilon ,\) updating p and \(\theta \) via a modified Euler’s method called the leapfrog that is reversible and being symplectic, preserves volume as well. This discretised update can be repeated for an arbitrary number of steps.

When considering the implementation via a Metropolis algorithm, a new value of the momentum p is drawn from the pseudo-prior \(\propto \exp \{ -p^\text {T} M^{-1} p / 2 \}\) and it is followed by a Metropolis step, which proposal is driven by the leapfrog approximation to the Hamiltonian dynamics on \((\theta ,\,p)\) and which acceptance is governed by the Metropolis acceptance probability. What makes the potential strength of this augmentation (or disintegration) scheme is that the value of \(H(\theta ,\,p)\) hardly changes during the Metropolis move, which means that it is most likely to be accepted and that it may produce a very different value of \(\pi (\theta )\) without modifying the overall acceptance probability. In other words, moving along level sets is almost energy-free, but if the move proceeds for “long enough”, the chain can reach far-away regions of the parameter space, thus avoid the myopia of standard MCMC algorithms. As explained in Neal (2013), this means that HMC avoids the inefficient random walk behaviour of most Metropolis–Hastings algorithms. What drives the exploration of the different values of \(H(\theta ,\,p)\) is therefore the simulation of the momentum, which makes its calibration both quite influential and delicate (Betancourt et al. 2014) as it depends on the unknown normalising constant of the target. (By calibration, we mean primarily the choice of the time discretisation step \(\varepsilon \) in the leapfrog approximation and of the number L of leapfrog leaps, but also the choice of the precision matrix M).

2.4 Optimal scaling and adaptive MCMC

The convergence of the Metropolis–Hastings algorithm 1 depends crucially on the choice of the proposal distribution q, as does the performance of both more complex MCMC and SMC algorithms, that often are hybrids using Metropolis–Hastings as simulation substeps.

Optimising over all implementable q appears to be a “disaster problem” due to its infinite-dimensional character, lack of clarity about what is implementable, what is not, and the fact that this optimal q must depend in a complex way on the target \(\pi \) to which we have only a limited access. In particular MALA provides a specific approach to constructing \(\pi \)-tailored proposals and HMC can be viewed as a combination of Gibbs and special Metropolis moves for an extended target.

In this optimisation context, it is thus reasonable to restrict ourselves to some parametric family of proposals \(q_{\xi },\) or more generally of Markov transition kernels \(P_{\xi },\) where \(\xi \in \varXi \) is a tuning parameter, possibly high-dimensional.

The aim of adaptive Markov chain Monte Carlo is conceptually very simple. One expects that there is a set \(\varXi _{\pi } \subset \varXi \) of good parameters \(\xi \) for which the kernel \(P_{\xi }\) converges quickly to \(\pi ,\) and one allows the algorithm to search for \(\varXi _{\pi }\) “on the fly” and redesign the transition kernel during the simulation as more and more information about \(\pi \) becomes available. Thus an adaptive MCMC algorithm would apply the kernel \(P_{\xi ^{(n)}}\) to sample \(\theta ^{(n)}\) given \(\theta ^{(n-1)},\) where the tuning parameter \(\xi ^{(n)}\) is itself a random variable which may depend on the whole history \(\theta ^{(0)},\ldots ,\theta ^{(n-1)}\) and on \(\xi ^{(n-1)}.\) Adaptive MCMC rests on the hope that the adaptive parameter \(\xi ^{(n)}\) will find \(\varXi _{\pi },\) stay there essentially forever and inherit good convergence properties.

There are at least two fundamental difficulties in executing this strategy in practice. First, standard measures of efficiency of Markovian kernels, like the total variation convergence rate (c.f. Meyn and Tweedie 2009; Roberts and Rosenthal 2004), \(L^{2}(\pi )\) spectral gap (Diaconis and Stroock 1991; Roberts 1996; Saloff-Coste 1997; Levin et al. 2009) or asymptotic variance (Peskun 1973; Geyer 1992; Tierney 1998) in the Markov chain central limit theorem will not be available explicitly, and their estimation from a Markov chain trajectory is often an even more challenging task than the underlying MCMC estimation problem itself.

Secondly, when executing an adaptive strategy and trying to improve the transition kernel on the fly, the Markov property of the process is violated, therefore standard theoretical tools do not apply, and establishing validity of the approach becomes significantly more difficult. While the approach has been successfully applied in some very challenging practical problems (Solonen et al. 2012; Richardson et al. 2010; Griffin et al. 2014), there are examples of seemingly reasonable adaptive algorithms that fail to converge to the intended target distribution (Bai et al. 2011; Łatuszyński et al. 2013), indicating that compared to standard MCMC even more care must be taken to ensure validity of inferential conclusions.

While heuristics-based adaptive algorithms have been considered already in Gilks et al. (1994), a remarkable result providing a tool to address the difficulty of optimising Markovian kernels coherently is the Roberts et al. (1997) paper on scaling the proposal variance. It considers settings of increasing dimensionality and investigates efficiency of the random walk Metropolis algorithm as a function of its average acceptance rate. More specifically, given a sequence of targets \(\pi _{d}\) on the product state space \(\varTheta ^{d}\) with iid components constructed from conveniently smooth marginal f,

$$\begin{aligned} \pi _{d}(\theta ):= & {} \prod _{i=1}^{d} f\left( \theta _{i}\right) , \quad \mathrm{for}\quad d=1,\,2,\ldots , \end{aligned}$$
(4)

consider a sequence of Markov chains \(\theta _{d},\,d=1,\,2,\ldots ,\) where the chain \(\theta _{d} = (\theta _d^{(n)})_{n=0,1,\ldots }\) is a random walk Metropolis targeting \(\pi _{d}\) with proposal increments distributed as \(N(0,\,\sigma ^{2}_{d} I_{d\times d}).\)

It then turns out that the only sensible scaling of the proposal as dimensionality increases is to take \(\sigma _{d}^{2} = l^{2} d^{-1}.\) In this regime the sequence of time-rescaled first coordinate processes

$$\begin{aligned} Z_{d}^{(t)}:= & {} \theta _{d, 1}^{(\lfloor td \rfloor )}, \quad \mathrm{for} \quad d=1,\,2,\ldots , \end{aligned}$$

converges in a suitable sense to the solution Z of a SDE

$$\begin{aligned} dZ_{t}= & {} h(l)^{1/2}dB_{t} + \frac{1}{2}h(l) \nabla \log f\left( Z_{t}\right) dt. \end{aligned}$$

Hence maximising the speed of the above diffusion h(l) is equivalent to maximising the efficiency of the algorithm as the dimension goes to infinity. Surprisingly, there is a one-to-one correspondence between the value \(l_{opt} = \mathrm{argmax}\, h(l)\) and the mean acceptance probability of 0.234.

The magic number 0.234 does not depend on f and gives a universal tuning recipe to be used for example in adaptive algorithms: choose the scale of the increment so that approximately 23 % of the proposals are accepted.

The result, established under restrictive assumptions, has been empirically verified to hold much more generally, for non iid targets and also in medium- and even low-dimensional examples with d as small as 5. It has been also combined with relative efficiency loss due to mismatch between the proposal and target covariance matrices (see Roberts and Rosenthal 2001).

The simplicity of the result and easy access to the average acceptance rate makes optimal scaling the main theoretical driver in development of adaptive MCMC algorithms, and adaptive MCMC is the main application and motivation for researching optimal scaling.

A large body of theoretical work extends optimal scaling formally to different and more general scenarios. For example Metropolis for smooth non iid targets has been addressed, e.g., by Bédard (2007), and in infinite dimensional settings by Beskos et al. (2009). Discrete and other discontinuous targets have been considered in Roberts (1998) and Neal et al. (2012). For MALA algorithms an optimal acceptance rate of 0.574 has been established in Roberts and Rosenthal (1998) and confirmed in infinite-dimensional settings in Pillai et al. (2012) along with the stepsize \(\sigma ^{2}_{d} = l^{2} d^{-1/3}.\) Hybrid Monte Carlo (see Sect. 2.3) has been analysed in a similar spirit by Beskos et al. (2013) and Betancourt et al. (2014) concluding that any value \(\in [0.6,\,0.9]\) will be close to optimal and the leapfrog step size should be taken as \(h = l \times d^{-1/4}.\) These results not only inform about optimal tuning, but also provide an efficiency ordering on the algorithms in d-dimensions. Metropolis algorithms need \(\mathcal {O}(d)\) steps to explore the state space, while MALA and HMC need respectively \(\mathcal {O}(d^{1/3})\) and \(\mathcal {O}(d^{1/4}).\)

Further extensions include studying the transient phase before reaching stationarity (Christensen et al. 2005; Jourdain et al. 2012, 2014), the scaling of multiple-try MCMC (Bédard et al. 2012) and delayed rejection MCMC (Bédard et al. 2014), and the temperature scale of parallel tempering type algorithms (Atchadé et al. 2011b; Roberts and Rosenthal 2014). Interestingly, the optimal scaling of the discussed in Sect. 2.5 pseudo-marginal algorithms as obtained in Sherlock et al. (2014), and extended to more general settings in Doucet et al. (2012) and Sherlock (2014), suggests an acceptance rate of just 0.07.

While each of these numerous optimal scaling results gives rise, at least in principle, to an adaptive MCMC design, the pioneering and most successful algorithm is the Adaptive Metropolis of Haario et al. (2001). With its increasing popularity in applications, this has fuelled the development of the field.

Here one considers a normal increment proposal that estimates the target covariance matrix from past samples and applies appropriate dimension-dependent scaling and covariance shrinkage. Precisely, the proposal takes the form

$$\begin{aligned} q\left( \cdot |\theta ^{(n-1)}\right)= & {} N\left( \theta ^{(n-1)},\,C^{(n)}\right) , \end{aligned}$$
(5)

with the covariance matrix

$$\begin{aligned} C^{(n)}= & {} \frac{(2.38)^{2}}{d} \left( \hat{\mathrm{cov}}\left( \theta ^{(0)}, \ldots , \theta ^{(n-1)}\right) + \varepsilon I_{d\times d}\right) , \end{aligned}$$
(6)

which is efficiently computed using a recursive formula.

Versions and refinements of the adaptive Metropolis algorithm (Roberts and Rosenthal 2009; Andrieu and Thoms 2008) have served well in applications and motivated much of the theoretical development. These include, among many other contributions, adaptive Metropolis, delayed rejection adaptive Metropolis (Haario et al. 2006), regional adaptation and parallel chains (Craiu et al. 2009), and the robust version of Vihola (2012) estimating the shape of the distribution rather than its covariance matrix and hence suitable for heavy tailed targets.

Analogous development of adaptive MALA algorithms in Atchadé (2006) and Marshall and Roberts (2012) and of adaptive HMC and Riemannian Manifold Monte Carlo in Wang et al. (2013) building on the adaptive scaling theory, resulted in a similar drastic mixing improvement as the original Adaptive Metropolis.

Another substantial and still unexplored area where adaptive algorithms are applied for very high dimensional and multimodal problems is model and variable selection (Nott and Kohn 2005; Richardson et al. 2010; Lamnisos et al. 2013; Ji and Schmidler 2013; Griffin et al. 2014). These algorithms can incorporate reversible jump moves (Green 1995) and are guided by scaling limits for discrete distributions as well as temperature spacing of parallel tempering to address multimodality. Successful implementations allow for fully Bayesian variable selection in models with over 20,000 variables for which otherwise only ad hoc heuristic approaches have been used in the literature.

To address the second difficulty with adaptive algorithms, several approaches have been developed to establish their theoretical underpinning. While for standard MCMC, convergence in total variation and law of large numbers are obtained almost trivially, and the effort concentrates on stronger results, like CLTs, geometric convergence, nonasymptotic analysis, and, maybe most importantly, comparison and ordering of algorithms, adaptive samplers are intrinsically difficult. The most elegant and theoretically-valid strategy is to change the underlying Markovian kernel at regeneration times only (Gilks et al. 1998). Unfortunately, this is not very appealing for practitioners since regenerations are difficult to identify in more complex settings and are essentially impractically rare in high dimensions. The original Adaptive Metropolis of Haario et al. (2001) has been validated (under some restrictive additional conditions) by controlling the dependencies introduced by the adaptation and using convergence results for mixingales. The approach has been further developed in Atchadé and Rosenthal (2005) and Atchadé (2006) to verify its ergodicity under weaker assumptions and apply the mixingale approach to adaptive MALA. Another successful approach (Andrieu and Moulines 2006 refined in Saksman and Vihola 2010) rests on martingale difference approximations and martingale limit theorems to obtain, under suitable technical assumptions, versions of LLN and CLTs. There are close links between analysing adaptive MCMC and stochastic approximation algorithms and in particular the adaptation step can be often written as a mean field of the stochastic approximation procedure; Andrieu and Robert (2001), Atchadé et al. (2011a) and Andrieu et al. (2015) contribute to this direction of analysis. Fort et al. (2011) develop an approach where both adaptive and interacting MCMC algorithms can be treated in the same framework. This allows addressing “external adaptation” algorithms such as the interacting tempering algorithm (a simplified version of the celebrated equi-energy sampler of Kou et al. 2006) or adaptive parallel tempering in Miasojedow et al. (2013).

We present here the rather general but fairly simple coupling approach (Roberts and Rosenthal 2007) to establishing convergence. Successfully applied to a variety of adaptive Metropolis samplers under weak regularity conditions (Bai et al. 2011), adaptive Gibbs and adaptive Metropolis within adaptive Gibbs samplers (Łatuszyński et al. 2013), it shows that two properties Diminishing Adaptation and Containment are sufficient to guarantee that an adaptive MCMC algorithm will converge asymptotically to the correct target distribution. To this end recall the total variation distance between two measures defined as \(\Vert \nu (\cdot ) - \mu (\cdot )\Vert := \sup _{A\in \mathcal {F}}|\nu (A) - \mu (A)|,\) and for every Markov transition kernel \(P_{\xi },\,\xi \in \varXi \) and every starting point \(\theta \in \varTheta \) define the \(\varepsilon \) convergence function \(M_{\varepsilon }{\text {:}}\,\varTheta \times \varXi \rightarrow \mathbb {N}\) as

$$\begin{aligned} M_{\varepsilon }(\theta ,\,\xi ):= & {} \inf \left\{ n \ge 1{\text {:}}\,\left\| P^{(n)}_{\xi }(\theta ,\, \cdot ) - \pi (\cdot ) \right\| \le \varepsilon \right\} . \end{aligned}$$

Let \(\{(\theta ^{(n)},\,\xi ^{(n)})\}_{n=0}^{\infty }\) be the corresponding adaptive MCMC algorithm and by \(A^{(n)}((\theta ,\,\xi ),\,\cdot )\) denote its marginal distribution at time t, i.e.,

$$\begin{aligned} A^{(n)}((\theta ,\,\xi ),\,B):= & {} \mathbb {P}\left( \theta ^{(n)} \in B | \theta ^{(0)}= \theta ,\,\xi ^{(0)} = \xi \right) . \end{aligned}$$

The adaptive algorithm is ergodic for every starting values of \(\theta \) and \(\xi \) if \(\lim \nolimits _{n\rightarrow \infty }\Vert A^{(n)}((\theta ,\,\xi ),\,\cdot ) - \pi (\cdot )\Vert = 0.\) The two conditions guaranteeing ergodicity are

Definition 1

(Diminishing Adaptation) The adaptive algorithm with starting values \(\theta ^{(0)}=\theta \) and \(\xi ^{(0)} = \xi \) satisfies Diminishing Adaptation, if

$$\begin{aligned}&\lim \limits _{n \rightarrow \infty } D^{(n)} = 0 \quad \mathrm{in\,probability},\,\mathrm{where} \\&D^{(n)} := \sup \limits _{\theta \in \varTheta } \left\| P_{\xi ^{(n+1)}}(\theta ,\, \cdot ) - P_{\xi ^{(n)}}(\theta ,\,\cdot ) \right\| . \end{aligned}$$

Definition 2

(Containment) The adaptive algorithm with starting values \(\theta ^{(0)}=\theta \) and \(\xi ^{(0)} = \xi \) satisfies Containment, if for all \(\varepsilon > 0\) the sequence \(\{M_{\varepsilon }(\theta ^{(n)},\,\xi ^{(n)})\}_{n=0}^{\infty }\) is bounded in probability.

While diminishing adaptation is a standard requirement, Containment is subject to some discussion. On one hand, it may seem difficult to verify in practice; on the other, it may appear restrictive in the context of ergodicity results under some weaker conditions (c.f. Fort et al. 2011). However, it turns out (Łatuszyński and Rosenthal 2014) that if Containment is not satisfied, then the algorithm may still converge, but with positive probability it will be asymptotically less efficient than any nonadaptive ergodic MCMC scheme. Hence algorithms that do not satisfy Containment are termed AdapFail and are best avoided. Containment has been further studied in Bai et al. (2011) and is in particular implied by simultaneous geometric or polynomial drift conditions of the adaptive kernels.

Given that adaptive algorithms may be incorporated in essentially any sampling scheme, their introduction seems to be one of the most important innovations of the last two decades. However, despite substantial effort and many ingenious contributions, the theory of adaptive MCMC lags behind practice even more than may be the case in other computational areas. While theory always matters, the numerous unexpected and counterintuitive examples of transient adaptive algorithms suggest that in this area theory matters even more for healthy development.

For adaptive MCMC to become a routine tool, a clear-cut result is needed saying that under some easily verifiable conditions these algorithms are valid and perform not much worse than their nonadaptive counterpart with fixed parameters. Such a result is yet to be established and may require deeper understanding of how to construct stable adaptive MCMC, rather than aiming heavy technical artillery at algorithms currently in use without modifying them.

2.5 Estimated likelihoods and pseudo-marginals

There are numerous settings of interest where the target density \(\pi (\cdot |y)\) is not available in closed form. For instance, in latent variable models, the likelihood function \(\ell (\theta |y)\) is often only available as an intractable integral

$$\begin{aligned} \ell (\theta |y) = \int \nolimits _\mathcal {Z} g(z,\,y|\theta )\mathrm{d}z, \end{aligned}$$

which leads to

$$\begin{aligned} \pi (\theta |y) \propto \pi (\theta ) \int \nolimits _\mathcal {Z} g(z,\,y|\theta )\mathrm{d}z, \end{aligned}$$

being equally intractable. A solution proposed from the early days of MCMC (Tanner and Wong 1987) is to consider z as an auxiliary variable and to simulate the joint distribution \(\pi (\theta ,\,z|y)\) on \(\varTheta \times \mathcal {Z}\) by a standard method, leading to simulating the marginal density \(\pi (\cdot |y)\) as a by-product. However, when the dimension of the auxiliary variable z grows with the sample size, this technique may run into difficulties as induced MCMC algorithms are more and more likely to have convergence issues. An illustration of this case is provided by hidden Markov models, which have eventually to resort to particle filters as Markov chain algorithms become ineffective (Chopin 2007; Fearnhead and Clifford 2003). Another situation where the target density \(\pi (\cdot |y)\) cannot be directly computed is the case of the “doubly intractable” likelihood (Murray et al. 2006a), when the likelihood function \(\ell (\theta |y)\propto g(y|\theta )\) itself contains a term that is intractable, in that it makes the normalising constant

$$\begin{aligned} \mathfrak {Z}(\theta ) = \int \nolimits _\mathcal {Y} g(y|\theta )\mathrm{d}y, \end{aligned}$$

impossible to compute. The resulting posterior writes

$$\begin{aligned}&\pi (\theta |y) = \frac{ \pi (\theta ) g(y|\theta )}{\mathfrak {Z}(\theta )p(y) }, \quad \mathrm{where} \\&\quad p(y) = \int \nolimits _{\varTheta } \frac{ \pi (\theta ) g(y|\theta ) }{ \mathfrak {Z}(\theta )}\mathrm{d}\theta , \end{aligned}$$

and consequently the Metropolis–Hastings acceptance rate becomes

$$\begin{aligned} \alpha (\theta ,\,\theta ^{\prime }) = \min \left\{ 1,\,\frac{ \pi (\theta ^{\prime }) g(y,\,\theta ^{\prime }) q(\theta ^{\prime }|\theta ) }{ \pi (\theta ) g(y,\,\theta ) q(\theta |\theta ^{\prime }) } \times \frac{ \mathfrak {Z}(\theta ) }{ \mathfrak {Z}(\theta ^{\prime })} \right\} , \end{aligned}$$

and cannot be evaluated exactly for algorithmic purposes.

Examples of this kind abound in Markov random fields models, as for instance for the Ising model (Murray et al. 2006b; Møller et al. 2006).

Both the approaches of Murray et al. (2006a) and Møller et al. (2006) require sampling data from the likelihood \(\ell (\theta |y),\) which limits their applicability. The latter uses in addition an importance sampling function and may suffer from poor acceptance rates.

Andrieu and Roberts (2009) propose a more general resolution of such problems by designing a Metropolis–Hastings algorithm that replaces the intractable target density \(\pi (\cdot |y)\) with an unbiased estimator, following an idea of Beaumont (2003). The approach is termed pseudo-marginal. Rather than evaluating the posterior exactly, a positive unbiased estimate \(S_{\theta }\) of \(\pi (\theta |y)\) is utilised. More formally, a new Markov chain is constructed that evolves on the extended state space \(\varTheta \times \mathbb {R}_{+},\) where at iteration n, given the pair \((\theta ^{(n-1)},\,S_{\theta ^{(n-1)}}^{(n-1)})\) of parameter value and density estimate at this value, the proposal \((\theta ^{\prime },\,S_{\theta ^{\prime }})\) is obtained by sampling \(\theta ^{\prime } \sim q(\cdot |\theta ^{(n-1)})\) and obtaining \(S_{\theta ^{\prime }},\) the estimate of \(\pi (\theta ^{\prime }|y).\) Analogously to the standard Metropolis–Hastings step, the pair \((\theta ^{\prime },\,S_{\theta ^{\prime }})\) is accepted as \((\theta ^{(n)},\,S_{\theta ^{(n)}}^{(n)})\) with probability

$$\begin{aligned} \min \left\{ 1,\,\frac{ S_{\theta ^{\prime }} q(\theta ^{(n-1)}|\theta ^{\prime }) }{ S_{\theta ^{(n-1)}}^{(n-1)} q(\theta ^{\prime }|\theta ^{(n-1)}) } \right\} , \end{aligned}$$

otherwise the proposal is rejected and the new value set as \((\theta ^{(n)},\,S_{\theta ^{(n)}}^{(n)}) :=(\theta ^{(n-1)},\,S_{\theta ^{(n-1)}}^{(n-1)}).\)

It is not difficult to verify that the bivariate chain on extended state space \(\varTheta \times \mathbb {R}_{+}\) enjoys the correct \(\pi (\theta |y)\) marginal on \(\varTheta \) and the approach is valid, see Andrieu and Roberts (2009) for details (and also Andrieu and Vihola 2015 for an abstracted account).

One specific instance of constructing unbiased estimators of the posterior is presented in Girolami et al. (2013) and based on random truncations of infinite series expansions. The paper also offers an excellent overview of inference methods for intractable likelihoods.

The performance of the pseudo-marginal approach will depend on the quality of the estimators \(S_{\theta }\) and hence stabilising them as well as understanding this relationship is an active area of current development. Often \(S_{\theta }\) is constructed as an importance sampler based on an importance sample z. Thus in particular, the improvements from using multiple samples of z to estimate \(\pi \) are of interest and can be assessed from Andrieu and Vihola (2015) where the efficiency of the algorithm is studied in terms of its spectral gap and CLT asymptotic variance. Sherlock et al. (2014), Doucet et al. (2012) and Sherlock (2014), on the other hand, investigate the efficiency as a function of the acceptance rate and variance of the noise, deriving the optimal scaling, as discussed in Sect.  2.4.

As an alternative to the above procedure of using estimates of the intractable likelihood to design a new Markov chain on an extended state space with correct marginal, one could naively use these estimates to approximate the Metropolis–Hastings accept–reject ratio and let the Markov chain evolve in the original state space. This would amount to dropping the current realisation of \(S_{\theta }\) and obtaining a new one in each accept–reject attempt. Such a procedure is termed Monte Carlo within Metropolis (Andrieu and Roberts 2009). Unfortunately this approach does not preserve the stationary distribution, and the resulting Markov chain may even not be ergodic (Medina-Aguayo et al. 2015). If ergodic, the difference between stationary distribution, resulting from the noisy acceptance must be quantified, which is a highly nontrivial task and the bounds will rarely be tight (see also Alquier et al. 2014; Pillai and Smith 2014; Rudolf and Schweizer 2015 for related methodology and theory). The approach is however an interesting avenue since at the price of being biased, it overcomes mixing difficulties of the exact pseudo-marginal version.

Design and understanding of pseudo-marginal algorithms is a direction of dynamic methodological development that in the coming years will be further fuelled not only by complex models with intractable likelihoods, but also by the need of MCMC algorithms for Big Data. In this context the likelihood function cannot be evaluated for the whole dataset even in the iid case just because computing the long product of individual likelihoods is infeasible. Several Big Data MCMC approaches have been already considered in Welling and Teh (2011), Korattikara et al. (2013), Teh et al. (2014), Bardenet et al. (2014), Maclaurin and Adams (2014), Minsker et al. (2014), Quiroz et al. (2014) and Strathmann et al. (2015).

2.6 Particle MCMC

While we refrain from covering particle filters here, since others (Beskos et al. 2015) in this volume are focussing on this technique, a recent advance at the interface between MCMC, pseudo-marginals, and particle filtering is the notion of particle MCMC (or pMCMC), developed by Andrieu et al. (2011). This innovation is indeed rather similar to the pseudo-marginal algorithm approach, taking advantage of the state-space models and auxiliary variables used by particle filters. It differs from standard particle filters in that it targets (mostly) the marginal posterior distribution of the parameters.

The simplest setting in which pMCMC applies is one of a state-space model where a latent sequence \(x_{0:T}\) is a Markov chain with joint density

$$\begin{aligned} p_{0}\left( x_{0}|\,\theta \right) p_{1}\left( x_{1}|x_{0},\,\theta \right) \cdots p_{T}\left( x_{T}|x_{T-1},\,\theta \right) , \end{aligned}$$

and is associated with an observed sequence \(y_{1:T}\) such that

$$\begin{aligned} y_{1:T}|x_{1:T},\,\theta \sim \prod \limits _{i=1}^{T} q_{i}\left( y_{i}|x_{i},\,\theta \right) . \end{aligned}$$

The iterations of pMCMC are MCMC-like in that, at iteration t, a new value \(\theta ^{\prime }\) of \(\theta \) is proposed from an arbitrary transition kernel \(\mathfrak {h}(\cdot |\theta ^{(t)})\) and then a new value of the latent series \(x_{0:T}^{\prime }\) is generated from a particle filter approximation of \(p(x_{0:T}|\theta ^{\prime },\,y_{1:T}).\) Since the particle filter returns as a by-product (Del Moral et al. 2006) an unbiased estimator of the marginal posterior of \(y_{1:T},\,\hat{q}(y_{1:T}|\theta ^{\prime }),\) this estimator can be used as such in the Metropolis–Hastings ratio

$$\begin{aligned} \dfrac{\hat{q}(y_{1:T}|\theta ^{\prime })\pi (\theta ^{\prime })\mathfrak {h}(\theta ^{(t)}|\theta ^{\prime })}{\hat{q}(y_{1:T}|\theta )\pi (\theta ^{(t)})\mathfrak {h}(\theta ^{\prime }|\theta ^{(t)})}\wedge 1. \end{aligned}$$

Its validity follows from the general argument of Andrieu and Roberts (2009), although some additional (notational) effort is needed to demonstrate all random variables used therein are correctly assessed (see Andrieu et al. 2011; Wilkinson 2011a, the latter providing a very progressive introduction to the notions of pMCMC and particle Gibbs, which helped greatly in composing this section). Note however that the general validation of pMCMC as targetting the joint posterior of the states and parameters and of the parallel particle Gibbs sampler does not follow from pseudo-marginal arguments.

This approach is being used increasingly in complex dynamic models like those found in signal processing (Whiteley et al. 2010), dynamical systems like the PDEs in biochemical kinetics (Wilkinson 2011b) and probabilistic graphical models (Lindsten et al. 2014). An extension to approximating the sequential filtering distribution is found in Chopin et al. (2013).

2.7 Parallel MCMC

Since MCMC relies on local updating based on the current value of a Markov chain, opportunities for exploiting parallel resources, either CPU or GPU, would seem quite limited, In fact, the possibilities reach far beyond the basic notion of running independent or coupled MCMC chains on several processors. For instance, Craiu and Meng (2005) construct parallel antithetic coupling to create negatively correlated MCMC chains (see also Frigessi et al. 2000), while Craiu et al. (2009) use parallel exploration of the sample space to tune an adaptive MCMC algorithm. Jacob et al. (2011) exploit GPU facilities to improve by Rao-Blackwellisation the Monte Carlo approximations produced by a Markov chain, even though the parallelisation does not improve the convergence of the chain. See also Lee et al. (2009) and Suchard et al. (2010) for more detailed contributions on the appeal of using GPUs towards massive parallelisation, and Wilkinson (2005) for a general survey on the topic.

Another recently-explored direction is “prefetching”. Based on Brockwell (2006) this approach computes the \(2^{2},\,2^{3},\ldots , 2^{k}\) values of the posterior that will be needed \(2,\,3,\ldots ,k\) sweeps ahead by simulating the possible “futures” of the Markov chain, according to whether the next k proposals are accepted or not, in parallel. Running a regular Metropolis–Hastings algorithm then means building a decision tree back to the current iteration and drawing \(2,\,3,\ldots ,k\) uniform variates to go down the tree to the appropriate branch. As noted by Brockwell (2006), “in the case where one can guess whether or not acceptance probabilities will be ‘high’ or ‘low’, the tree could be made deeper down ‘high’ probability paths and shallower in the ‘low’ probability paths.” This idea is exploited in Angelino et al. (2014), by creating “speculative moves” that consider the reject branch of the prefetching tree more often than not, based on some preliminary or dynamic evaluation of the acceptance rate. Using a fast but close-enough approximation to the true target (and a fixed sequence of uniforms) may also produce a “single most likely path” on which prefetched simulations can be run. The basic idea is thus to run simulations and costly likelihood computations on many parallel processors along a prefetched path, a path that has been prefetched for its high approximate likelihood. There are obviously instances where this speculative simulation is not helpful because the actual chain with the genuine target ends up following another path. Angelino et al. (2014) actually go further by constructing sequences of approximations for the precomputations. The proposition for the sequence found therein is to subsample the original data and use a normal approximation to the difference of the log (sub-)likelihoods. See Strid (2010) for related ideas.

A different use of parallel capabilities is found in Calderhead (2014). At each iteration of Calderhead’s algorithm, N replicas are generated, rather than 1 in traditional Metropolis–Hastings. The Markov chain actually consists of N components, from which one component is selected at random as a seed for the next proposal. This approach can be seen as a special type of data augmentation (Tanner and Wong 1987), where the index of the selected component is an auxiliary variable. The neat trick in the proposal (and the reason for its efficiency gain) is that the stationary distribution of the auxiliary variable can be determined and hence used N times in updating the vector of N components. An interesting feature of this approach is when the original Metropolis–Hastings algorithm is expressed as a finite state space Markov chain on the set of indices \(\{1,\ldots ,N\}.\) Conditional on the values of the N dimensional vector, the stationary distribution of that sub-chain is no longer uniform. Hence, picking N indices from the stationary helps in selecting the most appropriate images, which explains why the rejection rate decreases. The paper indeed evaluates the impact of increasing the number of proposals in terms of ESS, acceptance rate, and mean squared jump distance. Since this proposal is an almost free bonus resulting from using N processors, it sounds worth investigating and comparing with more complex parallel schemes.

Neiswanger et al. (2013) introduced the notion of embarrassingly parallel MCMC, where “embarrassing” refers to the “embarrassingly simple” solution proposed therein, namely to solve the difficulty in handling very large datasets by running completely independent parallel MCMC samplers on parallel threads or computers and using the outcomes of those samplers as density estimates, pulled together as a product towards an approximation of the true posterior density. In other words, the idea is to break the posterior as

$$\begin{aligned} p(\theta |y) \propto \prod \limits _{i=1}^{m} p_{i}(\theta |y), \end{aligned}$$
(7)

and to use the estimate

$$\begin{aligned} \hat{p}(\theta |y) \propto \prod \limits _{i=1}^{m} \hat{p}_{i}(\theta |y), \end{aligned}$$

where the individual estimates are obtained, say, nonparametrically. The method is then “asymptotically exact” in the weak (and unsurprising) sense of converging in the number of MCMC iterations. Still, there is a theoretical justification that is not found in previous parallel methods that mixed all resulting samples without accounting for the subsampling. And the point is made that, in many cases, running MCMC samplers with subsamples produces faster convergence. The decomposition of \(p(\cdot )\) into its components is done by partitioning the iid data into M subsets and taking a power 1/m of the prior in each case. (This may induce issues about impropriety.) However, the subdivision is arbitrary and can thus be implemented in cases other than the fairly restrictive iid setting. Because each (subsample) nonparametric estimate involves T terms, the resulting overall estimate contains Tm terms and the authors suggest using an independent Metropolis sampler to handle this complexity. This is in fact necessary for producing a final sample from the (approximate) true posterior distribution.

In a closely related way, Wang and Dunson (2013) start from the same product representation of the target (posterior), namely, (7). However, they criticise the choice made by Neiswanger et al. (2013) to use MCMC approximations for each component of the product for the following reasons:

  1. (1)

    Curse of dimensionality in the number of parameters d;

  2. (2)

    Curse of dimensionality in the number of subsets m;

  3. (3)

    Tail degeneration;

  4. (4)

    Support inconsistency and mode misspecification.

Point 1 is relevant, but there may be ways other than kernel estimation to mix samples from the terms in the product. Point 2 is less of a clearcut drawback: while the Tm terms corresponding to a product of m sums of T terms sounds self-defeating, Neiswanger et al. (2013) use a clever device to avoid the combinatorial explosion, namely operating on one component at a time. Having non-manageable targets is not such an issue in the post-MCMC era. Point 3 is formally correct, in that the kernel tail behaviour induces the kernel estimate tail behaviour, most likely disconnected from the true target tail behaviour, but this feature is true for any non-parametric estimate, even for the Weierstrass transform defined below, and hence maybe not so relevant in practice. In fact, by lifting the tails up, the simulation from the subposteriors should help in visiting the tails of the true target. Finally, point 4 does not seem to be life-threatening. Assuming that the true target can be computed up to a normalising constant, the value of the target for every simulated parameter could be computed, eliminating those outside the support of the product and highlighting modal regions.

The Weierstrass transform of a density f is a convolution of f and of an arbitrary kernel K. Wang and Dunson (2013) propose to simulate from the product of the Weierstrass transform, using a multi-tiered Gibbs sampler. Hence, the parameter is only simulated once and from a controlled kernel, while the random effects from the convolution are related with each subposterior. While the method requires coordination between the parallel threads, the components of the target are separately computed on a single thread. The clearest perspective on the Weierstrass transform may possibly be the rejection sampling version where simulations from the subpriors are merged together into a normal proposal on \(\theta ,\) to be accepted with a probability depending on the subprior simulations.

VanDerwerken and Schmidler (2013) keep with the spirit of parallel MCMC papers like consensus Bayes (Scott et al. 2013), embarrassingly parallel MCMC (Neiswanger et al. 2013) and Weierstrass MCMC (Wang and Dunson 2013), namely that the computation of the likelihood can be broken into batches and MCMC run over those batches independently. The idea of the authors is to replace an exploration of the whole space operated via a single Markov chain (or by parallel chains acting independently which all have to “converge”) with parallel and independent explorations of parts of the space by separate Markov chains. The motivation is that “Small is beautiful”: it takes a shorter while to explore each set of the partition, hence to converge, and, more importantly, each chain can work in parallel with the others. More specifically, given a partition of the space, into sets \(A_{i}\) with posterior weights \(w_{i},\) parallel chains are associated with targets equal to the original target restricted to those \(A_{i}\)s. This is therefore an MCMC version of partitioned sampling. With regard to the shortcomings listed in the quote above, the authors consider that there does not need to be a bijection between the partition sets and the chains, in that a chain can move across partitions and thus contribute to several integral evaluations simultaneously. It is somewhat unclear (a) whether or not this impacts ergodicity (it all depends on the way the chain is constructed, i.e., against which target) as it could lead to an over-representation of some boundary regions and (b) whether or not it improves the overall convergence properties of the chain(s). A more delicate issue with the partitioned MCMC approach stands with the partitioning. Indeed, in a complex and high-dimension model, the construction of the appropriate partition is a challenge in itself as we often have no prior idea where the modal areas are. Waiting for a correct exploration of the modes is indeed faster than waiting for crossing between modes, provided all modes are represented and the chain for each partition set \(A_{i}\) has enough energy to explore this set. It actually sounds unlikely that a target with huge gaps between modes will see a considerable improvement from the partitioned version when the partition sets \(A_{i}\) are selected on the go, because some of the boundaries between the partition sets may be hard to reach with an off-the-shelf proposal. A last comment about this innovative paper is that the adaptive construction of the partition has much in common with Wang–Landau schemes (Wang and Landau 2001; Lee et al. 2005; Atchadé and Liu 2010; Jacob and Ryder 2014).

3 ABC and others, exactly delivering an approximation

Motivated by highly complex models where MCMC algorithms and other Monte Carlo methods were too inefficient by far, approximate methods have emerged where the output cannot be considered as simulations from the genuine posterior, even under idealised situations of infinite computing power. These methods include ABC techniques, described in more details below, but also variational Bayes (Jaakkola and Jordan 2000), empirical likelihood (Owen 2001), integrated nested Laplace approximation (INLA) (Rue et al. 2009) and other solutions that rely on pseudo-models, or on summarised versions of the data, or both. It is quite important to signal this evolution as we think that it may be a central feature of computational Bayesian statistics in the coming years. From a statistical perspective, it also induces a somewhat paradoxical situation where loss of information is balanced by improvement in precision, for a given computational budget. This perspective is not only interesting at the computational level but forces us (as statisticians) to re-evaluate in depth the nature of a statistical model and could produce a paradigm shift in the near future by giving a brand new meaning to George Box’s motto that “all models are wrong”.

3.1 ABC per se

It seems important to discuss ABC in this partial tour of Bayesian computational techniques as (a) they provide the only approach to their model for some Bayesians, (b) they deliver samples in the parameter space that are exact simulations from a posterior of some kind (Wilkinson 2013), \(\pi ^{\mathrm{ABC}}(\theta |\mathbf {y}_{0})\) if not the original posterior \(\pi (\theta |\mathbf {y}_{0}),\) where \(\mathbf {y}_{0}\) denotes the data in this section (c) they may be more intuitive to some researchers outside statistics, as they entail simulating from the inferred model, i.e., going forward from parameter to data, rather than backward, from data to parameter, as in traditional Bayesian inference, (d) they can be merged with MCMC algorithms, and (e) they allow drawing inference directly from summaries of the data rather than the data itself.

ABC techniques play a role in the 2000s that MCMC methods did in the 1990s, in that they handle new models for which earlier (e.g., MCMC) algorithms were at a loss, in the same way the latter (MCMC) were able to handle models that regular Monte Carlo approaches could not reach, such as latent variable models (Tanner and Wong 1987; Diebolt and Robert 1994; Richardson and Green 1997). New models for which ABC unlocked the gate include Markov random fields, Kingman’s coalescent for phylogeographical data, likelihood models with an intractable normalising constant, and models defined by their quantile function or their characteristic function. While the ABC approach first appeared a “quick-and-dirty” solution, to be considered only until more elaborate representations could be found, those algorithms have been progressively incorporated into the statistician’s toolbox as a novel form of generic nonparametric inference handling partly-defined statistical models. They are therefore attractive as much for this reason as for being handy computational solutions when everything else fails.

A statistically intriguing feature of those methods is that they customarily require—for greater efficiency—replacing the data with (much) smaller-dimension summariesFootnote 1 or summary statistics, because of the complexity of the former. In almost every case calling for ABC, those summaries are not sufficient statistics and the method thus implies from the start a loss of statistical information, at least at a formal level, since relying on the raw data is out of the question and therefore the additional information it provides is moot. This imposed reduction of the statistical information raises many relevant questions, from the choice of summary statistics (Blum et al. 2013) to the consistency of the ensuing inference (Robert et al. 2011).

Although it has now diffused into a wide range of applications, the technique of ABC was first introduced by and for population genetics (Tavaré et al. 1997; Pritchard et al. 1999) to handle ancestry models driven by Kingman’s coalescent and with strictly intractable likelihoods (Beaumont 2010). The likelihood function of such genetic models is indeed “intractable” in the sense that, while derived from a fully defined and parameterised probability model, this function cannot be computed (at all or within a manageable time) for a single value of the parameter and for the given data. Bypassing the original example to avoid getting mired into the details of population genetics, examples of intractable likelihoods include densities with intractable normalising constants, i.e., \(f (\mathbf {y}|\theta ) = g(\mathbf {y}|\theta )/Z(\theta )\) such as in Potts (1952) and auto-exponential (Besag 1972) models, and pseudo-likelihood models (Cucala et al. 2009).

Example 1

A very simple illustration of an intractable likelihood is provided by Bayesian inference based on the median and median absolute deviation statistics of a sample from an arbitrary location-scale family, \(y_{1},\ldots ,y_{n}\mathop {\sim }\limits ^{\mathrm{iid}} \sigma ^{-1} g(\sigma ^{-1}\{y-\mu \}),\) as the joint distribution of this statistic is not available in closed form.

The concept at the core of ABC methods can be seen as both very naïve and intrinsically related to the foundations of Bayesian statistics as inverse probability (Rubin 1984). This concept is that data \(\mathbf {y}\) simulated conditional on values of the parameter close to the “true” value of the parameter should look more similar to the actual data \(\mathbf {y}_{0}\) than data \(\mathbf {y}\) simulated conditional on values of the parameter far from the “true” value. ABC actually involves an acceptance/rejection step in that parameters simulated from the prior are accepted only when

$$\begin{aligned} \rho \left( \mathbf {y},\,\mathbf {y}_{0}\right) < \epsilon , \end{aligned}$$

where \(\rho (\cdot ,\,\cdot )\) is a distance and \(\epsilon >0\) is called the tolerance. It can be shown that the algorithm exactly samples the posterior when \(\epsilon =0,\) but this is very rarely achievable in practice (Grelaud et al. 2009). An algorithmic representation is as follows:

figure b

Calibration of the ABC method in Algorithm 2 involves selecting the distance \(\rho (\cdot ,\,\cdot )\) and deducing the tolerance from computational cost constraints. However, in realistic settings, ABC is never implemented as such because comparing raw data to simulated raw data is rarely efficient, noise dominating signal (see, e.g., Marin et al. 2012 for toy examples). It is therefore natural that one first considers dimension-reduction techniques to bypass this curse of dimensionality. For instance, if rudimentary estimates S(y) of the parameter \(\theta \) are available, they are good candidates. In the ABC literature, they are called summary statistics, a term that does not impose any constraint on their form and hence leaves open the question of performance, as discussed in Marin et al. (2012) and Blum et al. (2013). A more practical version of the ABC algorithm is shown in Algorithm 3 below, with a different output for each choice of the summary statistic. We stress in this version of the algorithm the construction of the tolerance \(\epsilon \) as a quantile of the simulated distances \(\rho (S(\mathbf {y}^{0}),\,S(\mathbf {y}^{(t)})),\) rather than an additional parameter of the method.

figure c

An immediate question about this approximate algorithm is how much it remains connected with the original posterior distribution and in case it does not, where does it draw its legitimacy. A first remark in this connection is that it constitutes at best a convergent approximation to the posterior distribution \(\pi (\theta |S(y_{0})).\) It can easily be seen that ABC generates outcomes from a genuine posterior distribution when the data is randomised with scale \(\epsilon \) (Wilkinson 2013; Fearnhead and Prangle 2012). This interpretation indicates a decrease in the precision of the inference but it does not provide a universal validation of the method. A second perspective on the ABC output is that it is based on a nonparametric approximation of the sampling distribution (Blum 2010; Blum and François 2010), connected with both indirect inference (Drovandi et al. 2011) and k-nearest neighbour estimation (Biau et al. 2014). While a purely Bayesian nonparametric analysis of this aspect has not yet emerged, this brings an additional if cautious support for the method.

Example 2

Continuing from the previous example of a location-scale sample only monitored through the pair median plus mad statistic, we consider the special case of a normal sample \(y_{1},\ldots ,y_{n}\sim \mathcal {N}(\mu ,\,\sigma ^{2}),\) with \(n=100.\) Using a conjugate prior \(\mu \sim \mathcal {N}(0,\,10),\,\sigma ^{-2}\sim \mathcal {G}a(2,\,5),\) we generated \(10^{6}\) parameter values, along with the corresponding pairs of summary statistics. When creating the distance \(\rho (\cdot ,\,\cdot ),\) we used both following versions:

where the denominators are computed from the reference table in order to scale the components properly. Figure 1 shows the impact of the choice of this distance, but even more clearly the discrepancy between inference based on the ABC and the true inference on \((\mu ,\,\sigma ^{2}).\)

The discrepancy can however be completely eliminated by post-processing: Fig. 2 reproduces Fig. 1 by comparing the histograms of an ABC sample with the version corrected by Beaumont et al.’s (2002) local regression, as the latter is essentially equivalent to a regular Gibbs output.

Fig. 1
figure 1

Comparison of the posterior distributions on \(\mu \) (left) and \(\sigma \) (right) when using an ABC Algorithm 3 with distance \(\rho _{1}\) (top) and \(\rho _{2}\) (central), and when using a standard Gibbs sampler (bottom). All three samples are based on the same number of subsampled parameters. The dataset is a \(\mathcal {N}(3,\,2^{2})\) sample and the tolerance value \(\epsilon \) corresponds to \(\alpha =0.5\) % of the reference table

Fig. 2
figure 2

Comparison of the posterior distributions on \(\mu \) (left) and \(\sigma \) (right) when using an ABC algorithm 3 with distance \(\rho _{1}\) (top), a post-processed version by Beaumont et al.’s (2002) local regression (central), and when using a standard Gibbs sampler (bottom). The simulation setting is the same as in Fig. 1

Barber et al. (2015) studies the rate of convergence for ABC algorithms through the mean square error when approximating a posterior moment. They show the convergence rate is of order \(,\) when q is the dimension of the ABC summary statistic, associated with an optimal tolerance in \(.\) Those rates are connected with the nonparametric nature of ABC, as already suggested in the earlier literature: for instance, Blum (2010), who links ABC with standard kernel density non-parametric estimation and find a tolerance (re-expressed as a bandwidth) of order and an rmse of order as well, while Fearnhead and Prangle (2012) obtain similar rates, with a tolerance of order for noisy ABC. See also Calvet and Czellar (2014). Similarly, Biau et al. (2014) obtain precise convergence rates for ABC interpreted as a k-nearest-neighbour estimator.

Lee and Łatuszyński (2014) have also produced precise characterisations of the geometric ergodicity or lack thereof of four ABC–MCMC algorithms:

  1. (1)

    the standard ABC–MCMC (with N replicates of the simulated pseudo-data to each simulated parameter value),

  2. (2)

    versions involving simulations of the replicates repeated at the subsequent step,

  3. (3)

    use of a stopping rule in the generation of the pseudo data, and

  4. (4)

    a “gold-standard algorithm” based on the (unavailable) measure of an \(\epsilon \) ball around the data.

Based a result by Roberts and Tweedie (1996b), also used in Mengersen and Tweedie (1996), namely that an MCMC chain cannot be geometrically ergodic when there exist almost-absorbing states, they derive that (under some technical assumptions) the first two versions above cannot be variance-bounding (i.e., that the spectral gap is zero), while the last two versions can be both variance-bounding and geometrically ergodic under some appropriate conditions on the prior and the above ball measure. This result is thus rather striking in simulating a random number of auxiliary variables is sufficient to produce geometric ergodicity. We note that this result does not contradict the parallel result of Bornn et al. (2014), who establish that there is no efficiency gain in simulating \(N>1\) replicates of the pseudo-data, since there is no randomness involved in that approach. However, the latter result only applies to functions with finite variances.

When testing hypotheses and selecting models, the Bayesian approach relies on modelling hypotheses and model indices as part of the parameter and hence ABC naturally operates as this level as well, as demonstrated in Algorithm 4 following Cornuet et al. (2008), Grelaud et al. (2009) and Toni et al. (2009). In fields like population genetics, model choice and hypotheses validation is presumably the primary motivation for using ABC methods as exemplified in Belle et al. (2008), Cornuet et al. (2010), Excoffier et al. (2009), Ghirotto et al. (2010), Guillemaud et al. (2009), Leuenberger and Wegmann (2010), Patin et al. (2009), Ramakrishnan and Hadly (2009), Verdu et al. (2009) and Wegmann and Excoffier (2010). It is also the area that attracts most of the criticisms addressed against ABC: while some are easily dismissed (see, e.g., Templeton 2008, 2010; Beaumont et al. 2010; Berger et al. 2010), the impact of the choice of the summary statistics on the value of the posterior probability remains a delicate issue that prompted Pudlo et al. (2014) to advocate the alternative use of a posterior predictive error.

figure d

Indeed, Robert et al. (2011) pointed out the potential irrelevance of ABC-based posterior probabilities, due to the possible ancilarity (for model choice) of summary statistics, as also explained in Didelot et al. (2011). Marin et al. (2014) consider for instance the comparison of normal and Laplace fits on both normal and Laplace samples and show that using sample mean and sample variance as summary statistics produces Bayes factors converging to values near 1, instead of the consistent 0 and \(+\infty .\)

Marin et al. (2014) analyses this phenomenon with the aim of producing a necessary and sufficient consistency condition on summary statistics. Quite naturally, the summaries that are acceptable must display different behaviour under both models, in the guise of ranges of means \(\mathbb {E}_{\theta }[S(\mathbf {y}_{0})]\) that do not intersect for the two models. (In the counter-example of the normal–Laplace test, the expectations of the sample mean and variance can be recovered under both models.) This characterisation then leads to a practical asymptotic test validating summary statistics and to the realisation that a larger number of summaries helps in achieving this goal (while degrading the estimated tolerance). More importantly, it shows that the reduction of information represented by an ABC approach may prevent discriminating between models, at least when trying to recover the Bayes factor. In the end, this is a natural consequence of simplifying the description of both the data and the model, and can be found in most limited information settings.

3.2 More fish in the alphabet soup

Besides ABC, approximation techniques have spread wide and far towards analysing more complex or less completely defined models. Rather than a confusion, this multiplicity of available approximations is beneficial both to the understanding of the underlying model and to the calibration of those different methods.

Variational Bayes methods have been proposed for at least two decades to substitute exponential families \(q(\theta |\lambda )\) for complex posterior distributions \(\pi (\theta )\) (Jordan et al. 1999; MacKay 2002). The central notion in those methods is that the exponential family structure and a so-called mean-field representation of the approximation

$$\begin{aligned} q(\theta |\lambda )=\prod \limits _{i=1}^{k} q_{i}\left( \theta _{i}|\lambda _{i}\right) , \end{aligned}$$

allows for a sometimes closed-form minimisation of the Kullback–Leibler distance \(\mathrm{KL}(q(\theta |\lambda ),\,\pi (\theta ))\) between the true target and its approximation. If not, the setting is quite congenial to the use of EM algorithms (Paisley et al. 2012). See Salimans and Knowles (2013) for a contemporary view on this approach, which offers considerable gains in terms of computing time, while being difficult to assess in terms of discrepancy with the “truth”, i.e., the outcome that would result from using the genuine posterior.

Another approach that has met with considerable interest in the past 5 years is INLA (Rue et al. 2009). The method operates on latent Gaussian random fields, with likelihoods of the form

$$\begin{aligned} \prod \limits _{i=1}^{n} f\left( x_{i}|\eta _{i},\,\theta \right) , \end{aligned}$$

where the \(x_{i}\)’s are the observables and the \(\eta _{i}\)’s are latent variables. Using Laplace approximations to the marginal distributions \(\pi (\theta |\mathbf {x}_{0})\) and to \(f(\mathbf {\eta }|\mathbf {x}_{0}),\) INLA produces fast and accurate approximations of the true posterior distribution as well as of the marginal likelihood value. Thanks to the availability of a well-constructed package called R-INLA, this approach has gathered a large group of followers.

A somewhat exotic example of variational approximation is expectation-propagation (EP) (Minka 2001), which starts from an arbitrary decomposition of the target distribution

$$\begin{aligned} \pi (\theta ) = \prod \limits _{j=1}^{k} \pi _{j}(\theta ), \end{aligned}$$

(often inspired by a likelihood decomposition into groups of observations) and iteratively approximate each term \(\pi _{j}\) in the product by a density member of an exponential family, \(\nu (\cdot |\lambda )\)m using the other approximations as a marginal. Given the current approximation of \(\pi (\theta )\) at iteration t,

$$\begin{aligned} \nu \left( \theta |\lambda _{t}\right) = \prod \limits _{j=1}^{k} \nu _{j}\left( \theta |\lambda _{t}\right) , \end{aligned}$$

where \(\lambda _{t}\) is the current value of the hyperparameter, the tth step in the EP algorithm goes as follows:

  1. (1)

    Select \(1\le j\le k\) at random

  2. (2)

    Define the marginal

    $$\begin{aligned} \nu _{-j}\left( \theta |\lambda _{t}\right) \propto \frac{\nu (\theta |\lambda _{t})}{\nu _{j}(\theta |\lambda _{t})}; \end{aligned}$$
  3. (3)

    Update the hyperparameter \(\lambda _{t}\) by solving

    $$\begin{aligned} \lambda _{t+1} = \mathop {\hbox {argmin}}\limits _{\lambda } \mathrm{KL}\left\{ \pi _{j}(\theta )\nu _{-j}\left( \theta |\lambda _{t}\right) ,\,\nu (\theta |\lambda )\right\} \end{aligned}$$
  4. (4)

    Update \(\nu _{j}(\theta |\lambda _{t})\) as

    $$\begin{aligned} \nu _{j}\left( \theta |\lambda _{t+1}\right) \propto \frac{ \nu (\theta |\lambda _{t+1}) }{ \nu _{-j}(\theta |\lambda _{t}) }. \end{aligned}$$

(In the above, KL denotes the Kullback–Leibler divergence.) The algorithm stops at stationarity. The convergence of this approach is not yet fully understood, but Barthelmé and Chopin (2014) consider EP as a practical substitute for ABC, avoiding the selection of summary statistics by using a local constraint

$$\begin{aligned} \left\| x_{i}-x^\mathrm{obs}\right\| \le \epsilon , \end{aligned}$$

on each element of the simulated pseudo-data vector, \(x^\mathrm{obs}\) being the actual data. In addition, EP provides an approximation of the evidence. In the ABC setting, when using a Normal distribution as the exponential family default, implementing EP means computing empirical mean and empirical variance, one observation at a time, under the above tolerance constraint. Obviously, using a Normal candidate means that the final approximation will also look much like a Normal distribution, which both links with other Normal approximations like INLA and variational methods, and signals a difficulty with EP in less smooth cases, such as ridge-like or multimodal posteriors.

While different approximations keep being developed and tested, with arguments ranging from efficient programming, to avoiding simulations, to having an ability to deal with more complex structures, their drawback is the overall incapacity to assess the amount of approximation involved. Bootstrap evaluations can be attempted in the simplest cases but cannot be extended to more realistic situations.

4 Optimisation in modern Bayesian computation

Optimisation methodology for high-dimensional maximum-a-posteriori (MAP) estimation is another area of Bayesian computation that has received a lot of attention over the last years, particularly for problems related to machine learning, signal processing and computer vision. One reason for this is that for many Bayesian models optimisation is significantly more computationally tractable than integration. This has generated a lot of interest in MAP estimators, especially for applications involving very high-dimensional parameter spaces or tight computing time constraints, for which calculating other summaries of the posterior distribution is not feasible. Here we review some of the major breakthroughs in this topic, which originated mainly outside the statistics community. We focus on developments related to high-dimensional convex optimisation, though many of the techniques discussed below are also useful for non-convex optimisation. In particular, in Sect. 4.1 we concentrate on proximal optimisation algorithms, a powerful class of iterative methods that exploit tools from convex analysis, monotone operator theory and theory of non-expansive mappings to construct carefully designed fixed-point schemes. We refer the reader to the excellent book by Bauschke and Combettes (2011) for the mathematics underpinning proximal optimisation algorithms, and to the recent tutorial papers by Combettes and Pesquet (2011), Cevher et al. (2014) and Parikh and Boyd (2014) for an overview of the field and applications to signal processing and machine learning.

However, we do think it is vital to insist that, at the same time as asserting that modern optimisation methodology represents a much-underused opportunity in Bayesian inference, in its raw form it inevitably fails to deliver essential elements of the Bayesian paradigm. The vision is not to deliver a point estimate of an unknown structure, but the full richness of Bayesian inference in its coherence, its proper treatment of uncertainty, its intrinsic treatment of model uncertainty, and so on. Bayesian statistics does not boil down to optimisation with penalisation (Lange et al. 2014). We need to express the uncertainty associated with decisions and estimation, stemming from the stochastic nature of the data, and our lack of knowledge about relevant mechanisms.

The challenge is to use the awesome capacity of fast optimisation in a high-dimensional parameter space to focus on local regions of that space where a combination of analytic and numerical investigation can deliver at least approximations to full posterior distributions and derived quantities. The community has barely risen to this challenge, with only isolated examples such as the discussion in Green (2015) of a problem in unlabelled shape analysis. However, the growing community of INLA (Rue et al. 2009) users may bring an heightened awareness of such possibilities, along with its efficient code (Schrödle and Held 2011; Muff et al. 2013). Another promising research area is to use mathematical and algorithmic tools from convex optimisation to design more efficient high-dimensional MCMC algorithms (Pereyra 2015).

4.1 Proximal algorithms

Similarly to many other computational methodologies that are widely used nowadays, proximal algorithms were first proposed several decades ago by Moreau (1962), Martinet (1970) and Rockafellar (1976), and regained attention recently in the context of large-scale inverse problems and “big data”.

We consider the computation of maximisers of posterior densities \(\pi ({\varvec{\theta }}) = \exp {\{-g({\varvec{\theta }})\}}/\kappa \) that are high-dimensional and log-concave, which we formulate as

$$\begin{aligned} \hat{{\varvec{\theta }}}_{MAP} = \mathop {\hbox {argmin}}\limits \limits _{{\varvec{\theta }}\in \mathbb {R}^{n}} g({\varvec{\theta }}), \end{aligned}$$
(8)

where g belongs to the class \(\varGamma _{0}(\mathbb {R}^{n})\) of lower semicontinuous convex functions from \(\mathbb {R}^{n} \rightarrow (-\infty ,\,+\infty ].\) Notice that g may be non-differentiable and take value \(g({\varvec{\theta }}) = +\infty ,\) reflecting constraints in the parameter space. In order to introduce proximal algorithms we first recall the following standard definitions and results from convex analysis: we say that \({\varvec{\varphi }}\in \mathbb {R}^{n}\) is a subgradient of g at \({\varvec{\theta }}\in \mathbb {R}^{n}\) if it satisfies \((\mathbf {u}-{\varvec{\theta }})^{T}{\varvec{\varphi }}+ g({\varvec{\theta }}) \le g(\mathbf {u}),\,\forall \mathbf {u}\in \mathbb {R}^{n}.\) The set of all such subgradients defines the subdifferential set \(\partial g({\varvec{\theta }}),\) and \(\hat{{\varvec{\theta }}}_{MAP}\) is a minimiser of g if and only if \(\varvec{0} \in \partial g(\hat{{\varvec{\theta }}}_{MAP}).\) The (convex) conjugate of \(g \in \varGamma _{0}(\mathbb {R}^{n})\) is the function \(g^{*} \in \varGamma _{0}(\mathbb {R}^{n})\) defined as \(g^{*}({\varvec{\varphi }}) = \sup _{\mathbf {u}\in \mathbb {R}^{n}} \mathbf {u}^{T}{\varvec{\varphi }}- g(\mathbf {u}).\) The subgradients of g and \(g^{*}\) satisfy the property \({\varvec{\varphi }}\in \partial g({\varvec{\theta }}) \Leftrightarrow {\varvec{\theta }}\in \partial g^{*}({\varvec{\varphi }}).\)

Proximal algorithms take their name from the proximity mapping, defined for \(g \in \varGamma _{0}(\mathbb {R}^{n})\) and \(\lambda > 0\) as (Moreau 1962)

$$\begin{aligned} {\text {prox}}^{\lambda }_{g}({\varvec{\theta }}) = \mathop {\hbox {argmin}}\limits \limits _{\mathbf {u}\in \mathbb {R}^{n}} g(\mathbf {u}) + \Vert \mathbf {u}- {\varvec{\theta }}\Vert ^{2}/2\lambda . \end{aligned}$$
(9)

In order to gain intuition about this mapping it is useful to analyse its behaviour when \(\lambda \in \mathbb {R}^{+}\) is either very small or very large. In the limit \(\lambda \rightarrow \infty ,\) the quadratic penalty term vanishes and (9) maps all points to \(\hat{{\varvec{\theta }}}_{MAP}.\) In the opposite limit \(\lambda \rightarrow 0,\) (9) becomes the identity operator and maps \({\varvec{\theta }}\) to itself. For finite values of \(\lambda ,\,{\text {prox}}^{\lambda }_{g}({\varvec{\theta }})\) behaves similarly to a gradient mapping and moves points in the direction of \(\hat{{\varvec{\theta }}}_{MAP}.\) Like gradients, proximity mappings have several properties that are useful for devising fixed-point methods (Bauschke and Combettes 2011).

Property 1

The proximity mapping of g is related to its subdifferential by the inclusion \(\{{\varvec{\theta }}-{\text {prox}}^{\lambda }_{g}({\varvec{\theta }})\}/\lambda \in \partial g\{{\text {prox}}^{\lambda }_{g}({\varvec{\theta }})\},\) which collapses to \(\nabla g\{{\text {prox}}^{\lambda }_{g}({\varvec{\theta }})\}\) when \(g \in \mathcal {C}^{1}.\) As a result, for any \(\lambda > 0,\) the minimiser of g verifies the fixed-point equation \({\varvec{\theta }}= {\text {prox}}^{\lambda }_{g}({\varvec{\theta }}).\)

Property 2

Proximity mappings are firmly non-expansive; that is, \(\Vert {\text {prox}}^{\lambda }_{g}({\varvec{\theta }})-{\text {prox}}^{\lambda }_{g}(\mathbf {u})\Vert ^{2} \le ({\varvec{\theta }}-\mathbf {u})^{T}\{{\text {prox}}^{\lambda }_{g}({\varvec{\theta }})-{\text {prox}}^{\lambda }_{g}(\mathbf {u})\},\,\forall {\varvec{\theta }},\,\mathbf {u}\in \mathbb {R}^{n}.\)

Property 3

The proximity mappings of g and its conjugate \(g^{*}\) are related by Moreau’s decomposition formula: \({\varvec{\theta }}= {\text {prox}}_{g}^{\lambda }({\varvec{\theta }}) + \lambda {\text {prox}}^{1/\lambda }_{g^{*}}({\varvec{\theta }}/\lambda ).\)

The simplest proximal method to solve (8) is the proximal point algorithm given by the iteration

$$\begin{aligned} {\varvec{\theta }}^{k+1} = {\text {prox}}_{g}^{\lambda }\left( {\varvec{\theta }}^{k}\right) . \end{aligned}$$
(10)

Every sequence \(\{{\varvec{\theta }}^{k}\}_{k\in \mathbb {N}}\) produced by this algorithm converges to \(\hat{{\varvec{\theta }}}_{MAP},\) even if proximity mappings are evaluated inexactly, as long as the errors are of certain types (e.g., summable). A more general proximal point algorithm includes relaxation, i.e.,

$$\begin{aligned} {\varvec{\theta }}^{k+1} = \left( 1-\alpha _{k}\right) {\varvec{\theta }}^{k} + \alpha _{k}{\text {prox}}_{g}^{\lambda }\left( {\varvec{\theta }}^{k}\right) , \quad \alpha _{k} \in (0,\,2), \end{aligned}$$

and with over-relaxation (i.e., \(\alpha _{k} \in (1,\,2)\)) often converges faster than (10). Notice from Property 1 that (10) can be interpreted as an implicit (backward) subgradient steepest descent to minimise g, i.e., \({\varvec{\theta }}^{k+1} = {\varvec{\theta }}^{k} - \lambda {\varvec{\varphi }},\) with \({\varvec{\varphi }}\in \partial g({\varvec{\theta }}^{k+1}).\) Alternatively, proximal point algorithms can also be interpreted as explicit (forward) gradient steepest descent to minimise the Moreau envelope of \(g,\, e_{\lambda } ({\varvec{\theta }}) = \inf _{\mathbf {u}\in \mathbb {R}^{n}}\, g(\mathbf {u}) + \Vert \mathbf {u}- {\varvec{\theta }}\Vert ^{2}/2\lambda ,\) a convex lower bound on g that by construction is continuously differentiable and has the same minimiser as g.

Proximal point algorithms may appear of little relevance because evaluating \({\text {prox}}^{\lambda }_{g}\) can be as difficult as solving (8) in the first place (notice that (9) is a convex minimisation problem similar to (8)). Surprisingly, many advanced proximal optimisation methods can in fact be shown to be either applications of this simple algorithm, or closely related to it.

Most proximal methods operate by splitting g, e.g.,

$$\begin{aligned} \hat{{\varvec{\theta }}}_{MAP}=\mathop {\hbox {argmin}}\limits \limits _{{\varvec{\theta }}\in \mathbb {R}^{n}} \left\{ g_{1}({\varvec{\theta }}) + g_{2}({\varvec{\theta }})\right\} , \end{aligned}$$
(11)

such that \(g_{1} \in \varGamma _{0}(\mathbb {R}^{n})\) and \(g_{2} \in \varGamma _{0}(\mathbb {R}^{n})\) have gradients or proximity mappings that are easy to compute or approximate. For example, for many Bayesian models it is possible to find a decomposition \(g({\varvec{\theta }}) = g_{1}({\varvec{\theta }}) + g_{2}({\varvec{\theta }})\) such that \(g_{1}\) is \(\beta \)-LipschitzFootnote 2 differentiable and \(g_{2} \in \varGamma _{0}(\mathbb {R}^{n}),\) possibly non-differentiable, has a proximity mapping that can be computed efficiently with a specialised algorithm. This decomposition is useful for instance in linear inverse problems, where \(g_{1}\) is often related to a Gaussian observation model involving linear operators and \(g_{2}\) to a log-prior promoting a parsimonious representation (e.g., sparsity on some appropriate dictionary, low-rankness) or enforcing convex constraints (e.g., positivity, positive definiteness). For models that admit this decomposition, it is possible to compute \(\hat{{\varvec{\theta }}}_{MAP}\) efficiently with a forwardbackward algorithm, also known as the proximal gradient algorithm

$$\begin{aligned} {\varvec{\theta }}^{k+1} = {\text {prox}}_{g_{2}}^{\lambda _{n}}\left( {\varvec{\theta }}^{k} - \lambda _{n}\nabla g_{1}\left( {\varvec{\theta }}^{k}\right) \right) . \end{aligned}$$
(12)

For \(\lambda _{n} = \lambda \in (0,\,1/\beta )\) the objective function \(g({\varvec{\theta }}^{k})\) converges to \(g(\hat{{\varvec{\theta }}}_{MAP})\) with rate O(1 / k). If the value of the Lipschitz constant \(\beta \) is unknown \(\lambda _{n}\) can be found by line-search.

A remarkable property of (12) is that it can be accelerated to converge with rate \(O(1/k^{2}),\) which is optimal for this class of problems (Nesterov 2004). This can be achieved for instance by introducing an extrapolation step

$$\begin{aligned}&{\varvec{\theta }}^{+} = {\varvec{\theta }}^{k} + \omega _{k} \left( {\varvec{\theta }}^{k} - {\varvec{\theta }}^{k-1}\right) ,\nonumber \\&{\varvec{\theta }}^{k+1} = {\text {prox}}_{g_{2}}^{\beta ^{-1}}\left( {\varvec{\theta }}^{+} - \beta ^{-1}\nabla g_{1}\left( {\varvec{\theta }}^{+}\right) \right) , \end{aligned}$$
(13)

where \(\{\omega _{k}\}_{k \in \mathbb {N}}\) is an appropriate sequence of extrapolation parameters. It was noticed by Combettes and Pesquet (2011) that several important convex optimisation algorithms can be derived as applications of the forward–backward algorithm, for example the projected gradient algorithm for minimising a Lipschitz differentiable function subject to a convex constraint (in this case the proximity mapping reduces to a projection onto the convex set). Notice that (12) can be interpreted as an implementation of the proximal point iteration (10) where \({\text {prox}}_{g}^{\lambda }({\varvec{\theta }}^{k})\) is approximated by replacing \(g_{1}\) with its first order Taylor series approximation around the point \({\varvec{\theta }}^{k}.\)

Moreover, in some cases it may be more efficient to compute \(\hat{{\varvec{\theta }}}_{MAP}\) by solving the dual of (11), for instance if g admits a decomposition \(g({\varvec{\theta }}) = g_{1}({\varvec{\theta }}) + g_{2}(L{\varvec{\theta }})\) for some linear operator \(L \in \mathbb {R}^{n\times p},\,g_{1} \in \varGamma _{0}(\mathbb {R}^{n})\) strongly convex and \(g_{2} \in \varGamma _{0}(\mathbb {R}^{p})\) with efficient proximity mapping. In this case, the Fenchel–Rockafellar theorem states that \(\hat{{\varvec{\theta }}}_{MAP}\) can be computed by solving the dual problem (Bauschke and Combettes 2011, Chap. 19)

$$\begin{aligned} {\varvec{\psi }}^{*} = \mathop {\hbox {argmin}}\limits \limits _{{\varvec{\psi }}\in \mathbb {R}^{p}} g_{1}^{*}\left( -L^{T}{\varvec{\psi }}\right) +g_{2}^{*}({\varvec{\psi }}), \end{aligned}$$
(14)

and setting \(\hat{{\varvec{\theta }}}_{MAP} = \nabla g_{1}^{*}(-L^{T}{\varvec{\psi }}^{*}).\) This p-dimensional problem can be solved iteratively with a forward–backward algorithm \({\varvec{\psi }}^{k+1} ={\text {prox}}^{\lambda _{n}}_{g^{*}_{2}}({\varvec{\psi }}^{k} - \lambda _{n} \nabla g^{*}_{1}(-L^{T}{\varvec{\psi }}^{k}))\) that can also be accelerated to converge with rate \(O(1/k^{2}),\) and where we note that the proximity mapping of \(g_{2}^{*}\) is typically evaluated by using Property 3, and that the strong convexity of \(g_{1}\) implies Lipschitz differentiability of \(g_{1}^{*}.\) Computing \(\hat{{\varvec{\theta }}}_{MAP}\) via (14) can lead to important computational savings, in particular if \(p \ll n\) or if \(g_{2}\) is separable and has a proximity mapping that can be computed in parallel for each element of \({\varvec{\theta }}\) (this is generally not possible for \(g_{2} \circ L\)). We refer the reader to Komodakis and Pesquet (2014) for an overview of recent dual and primal–dual algorithms and guidelines for parallel implementations.

Another important proximal optimisation method is the Douglas–Rachford splitting algorithm given by

$$\begin{aligned}&{\varvec{\theta }}^{k+\frac{1}{2}} = {\text {prox}}_{g_{1}}^{\lambda }\left( {\varvec{\theta }}^{k}\right) ,\nonumber \\&{\varvec{\theta }}^{k+1} = {\varvec{\theta }}^{k} - {\varvec{\theta }}^{k+\frac{1}{2}} + {\text {prox}}_{g_{2}}^{\lambda }\left( 2{\varvec{\theta }}^{k+\frac{1}{2}} - {\varvec{\theta }}^{k}\right) . \end{aligned}$$
(15)

From a theoretical viewpoint this algorithm is more general than the forward–backward algorithm because it does not require \(g_{1}\) or \(g_{2}\) to be continuously differentiable. However, its practical application is limited to problems for which both \(g_{1}\) and \(g_{2}\) have efficient proximity mappings. Similarly to the forward–backward algorithm, (15) includes many proximal algorithms that been proposed in the literature for specific models, and can also be interpreted as an application of the proximal point algorithm.

The proximal method that is arguably most widely used in Bayesian inference is the alternating direction method of multipliers (ADMM), which operates by formulating (11) as a constrained optimisation problem

$$\begin{aligned} \begin{aligned}&\mathop {\hbox {argmin}}\limits \limits _{{\varvec{\theta }}\in \mathbb {R}^{n},\,\mathbf {z}\in \mathbb {R}^{n}} g_{1}({\varvec{\theta }}) + g_{2}(\mathbf {z})\\&\mathrm{subject\, to}\quad \varvec{{\varvec{\theta }}}= \mathbf {z}, \end{aligned} \end{aligned}$$
(16)

and then using augmented Lagrangian techniques to express (16) as an unconstrained saddle point problem with saddle function \(g_{1}({\varvec{\theta }}) + g_{2}(\mathbf {z}) + \lambda {\varvec{\varphi }}^{T}({\varvec{\theta }}-\mathbf {z}) + \Vert {\varvec{\theta }}-\mathbf {z}\Vert ^{2}/{2\lambda }\) (Boyd et al. 2011). ADMM solves this problem with the iteration

$$\begin{aligned}&{\varvec{\theta }}^{k+1} = {\text {prox}}_{g_{1}}^{\lambda }\left( \mathbf {z}^{k} - {\varvec{\varphi }}^{k}\right) ,\nonumber \\&\mathbf {z}^{k+1} = {\text {prox}}_{g_{2}}^{\lambda }\left( {\varvec{\theta }}^{k+1} + {\varvec{\varphi }}^{k}\right) ,\\&{\varvec{\varphi }}^{k+1} = {\varvec{\varphi }}^{k} + {\varvec{\theta }}^{k+1} - \mathbf {z}^{k+1},\nonumber \end{aligned}$$
(17)

that also involves the proximity mappings of \(g_{1}\) and \(g_{2}.\) This basic ADMM iteration can be tailored to specific models in many ways (e.g., to exploit decompositions of the form \(g_{1} = \tilde{g}_{1} \circ L_{1}\) and \(g_{2} = \tilde{g}_{2} \circ L_{2}\) so that proximal updates can be performed in parallel for all components of \({\varvec{\theta }},\, \mathbf {z}\) and \({\varvec{\varphi }}\)). Interestingly, ADMM can be interpreted as an application of the Douglas–Rachford algorithm to the dual of (16), and is therefore also a special case of the proximal point algorithm. For more details about the ADMM algorithm, see the recent tutorial by Boyd et al. (2011).

Furthermore, an important characteristic of proximal optimisation algorithms is that they can be massively parallelised to take advantage of parallel computer architectures. Suppose for instance that g admits the decomposition \(g({\varvec{\theta }}) = \sum \nolimits _{m = 1}^{M}g_{m}(L_{m} {\varvec{\theta }})\) with \(g_{m} \in \varGamma (\mathbb {R}^{p_{m}})\) and \(L_{m} \in \mathbb {R}^{n \times {p_{m}}}\) such that the mappings of \(g_{m}\) are easy to compute and \(Q = \sum \nolimits _{m = 1}^{M} L_{m}^{T} L_{m}\) is invertible. Then, in a manner akin to (16), we express (8) as

$$\begin{aligned} \begin{aligned}&\mathop {\hbox {argmin}}\limits \limits _{\mathbf {z}_{1} \in \mathbb {R}^{n}, \ldots ,\mathbf {z}_{M} \in \mathbb {R}^{n}} \sum \limits _{m = 1}^{M} g_{m}\left( \mathbf {z}_{m}\right) \\&\mathrm{subject\,to}\quad \mathbf {z}_{m} = L_{m}{\varvec{\theta }}, \quad \forall m = 1,\ldots ,M, \end{aligned} \end{aligned}$$
(18)

and compute \(\hat{{\varvec{\theta }}}_{MAP}\) with the following iteration

$$\begin{aligned}&{\varvec{\theta }}^{k+1} = Q^{-1} \sum \limits _{m = 1}^{M} L_{m}^{T}\left( \mathbf {z}_{m}^{k} -{\varvec{\varphi }}_{m}^{k}\right) ,\nonumber \\&\mathbf {z}_{m}^{k+1} = {\text {prox}}^{\lambda }_{g_{m}}\left( L_{m}{\varvec{\theta }}^{k+1} - {\varvec{\varphi }}_{m}^{k}\right) , \quad \forall m = 1,\ldots ,M,\nonumber \\&{\varvec{\varphi }}_{m}^{k+1} = {\varvec{\varphi }}_{m}^{k} + L_{m}{\varvec{\theta }}^{k+1} - \mathbf {z}_{m}^{k+1}, \quad \forall m = 1,\ldots ,M, \end{aligned}$$
(19)

that can be parallelised with factor M at a coarse level (e.g., on a multi-processor system). Further parallelisation may be possible at a finer scale (e.g., on a vectorial processor such as GPU or FPGA) by taking advantage of the structure of \({\text {prox}}^{\lambda }_{g_{m}}\) or by using specialised algorithms. This algorithm, known as the simultaneous direction method of multipliers, is also closely related to the ADMM, Douglas–Rachford and proximal point algorithms. Notice that splitting g not only allows the exploitation of parallel computer architectures, but may also significantly simplify the computation of proximity mappings; often \({\text {prox}}^{\lambda }_{g_{m}}\) has a closed-form expression. Lastly, it is worth mentioning that there are other modern proximal optimisation algorithms that can be massively parallelised, for example the generalised forward backward algorithm (Raguet et al. 2013), the parallel proximal algorithms (Combettes and Pesquet 2008; Pesquet and Pustelnik 2012), and the parallel primal–dual algorithm (Combettes and Pesquet 2012).

Finally, main current topics of research in proximal optimisation include theory and methodology for: (1) randomised and stochastic algorithms that operate with estimators of gradients and proximity mappings to reduce computational complexity and allow for errors in the update rules, (2) adaptive and variable metric algorithms (e.g., Riemannian and Newton-type) that exploit the model’s geometry to improve convergence speed, and (3) proximal methods for non-convex problems. We anticipate that in the future new and stronger connections will emerge between proximal optimisation and stochastic simulation, in particular through developments in stochastic optimisation and high-dimensional MCMC sampling. For example, one connection is through the integration of modern stochastic convex optimisation and Markovian stochastic approximation (Combettes and Pesquet 2014; Andrieu et al. 2015), and of proximal optimisation and high-dimensional MCMC sampling (Pereyra 2015).

4.2 Convex relaxations

Modern proximal optimisation was greatly motivated by important theoretical results on the recovery of partially-observed sparse vectors and low-rank matrices through convex minimisation (Candès et al. 2006; Candès and Tao 2009) and on compressive sensing (Candès and Wakin 2008). A key idea underlying these works is that of approximating a combinatorial optimisation problem, whose solution is NP-hard, with a “relaxed” convex problem that is computationally tractable, and whose solution is in some sense close to the solution of the original problem. Reciprocally, the development of modern convex optimisation has in turn generated much interest in log-concave models, convex regularisers, and “convexifications” (i.e., convex relaxations for intractable or poorly tractable models) for statistical inference problems involving high-dimensionality, large datasets and computing time constraints (Chandrasekaran et al. 2012; Chandrasekaran and Jordan 2013).

4.3 Illustrative example

For illustration, we show an application of proximal optimisation to Bayesian image resolution enhancement. The goal is to recover a high-resolution image \({\varvec{\theta }}\in \mathbb {R}^{n}\) from a blurred and noisy observed image \(\mathbf {y}\sim \mathcal {N}(H{\varvec{\theta }},\,\sigma ^{2}\varvec{I}_{n}),\) where \(H \in \mathbb {R}^{n\times n}\) is a linear operator representing the blur point spread function of the low resolution acquisition system and \(\sigma ^{2}\) is the system’s noise power. This inverse problem is ill-posed, a difficulty that Bayesian image processing methods address by exploiting prior knowledge about \({\varvec{\theta }}.\) Here we use the following hierarchical Bayesian model (Oliveira et al. 2009)

$$\begin{aligned}&f(\mathbf {y}|{\varvec{\theta }}) = \left( 2\pi \sigma ^{2}\right) ^{-n/2}\exp \left\{ -\Vert \mathbf {y}- H{\varvec{\theta }}\Vert ^{2}_{2}/2\sigma ^{2}\right\} ,\nonumber \\&\pi ({\varvec{\theta }}|\alpha ) \propto \alpha ^{-n}\exp {\left( -\alpha \left\| \nabla _{d} {\varvec{\theta }}\right\| _{1{-}2}\right) },\nonumber \\&\pi (\alpha ) = e^{-\alpha }\varvec{1}_{\mathbb {R}^{+}}(\alpha ), \end{aligned}$$
(20)

where \(\pi ({\varvec{\theta }}|\alpha )\) is the (improper) total-variation Markov random field, \(\Vert \cdot \Vert _{1{-}2}\) denotes the composite \(\ell _{1} - \ell _{2}\) norm and \(\nabla _{d}\) is the discrete gradient operator that computes the vertical and horizontal differences between neighbour image pixels. This prior is log-concave and models the fact that differences between neighbouring image pixels are usually very small but occasionally take large values; it is arguably the most widely used prior in modern statistical image processing. The values of H and \(\sigma ^{2}\) are typically determined during the system’s calibration process and are here assumed known.

We compute the MAP estimator of \({\varvec{\theta }}\) associated with the marginal posterior \(\pi ({\varvec{\theta }}|\mathbf {y}) = \int \nolimits _{0}^{\infty } \pi ({\varvec{\theta }},\,\alpha | \mathbf {y}) \mathrm{d}\alpha ,\) which is unimodal but not log-concave,

$$\begin{aligned} \hat{{\varvec{\theta }}}_{MAP}= & {} \mathop {\hbox {argmin}}\limits \limits _{{\varvec{\theta }}\in \mathbb {R}^{n}} \Vert \mathbf {y}- H{\varvec{\theta }}\Vert ^{2}_{2}/2\sigma ^{2} \nonumber \\&+\, (n+1)\log \left( \left\| \nabla _{d} {\varvec{\theta }}\right\| _{1{-}2} + 1\right) . \end{aligned}$$
(21)

Problem (21) is not convex, but can nevertheless be solved efficiently with proximal algorithms by using a majorisationminimisation strategy. To be precise, starting from some initial condition \({\varvec{\theta }}^{(0)},\) e.g., \({\varvec{\theta }}^{(0)} = \mathbf {y},\) we iteratively minimise the following sequence of strictly convex majorants (Oliveira et al. 2009)

$$\begin{aligned}&{\varvec{\theta }}^{(t+1)} =\mathop {\hbox {argmin}}\limits \limits _{{\varvec{\theta }}\in \mathbb {R}^{n}} \Vert \mathbf {y}- H{\varvec{\theta }}\Vert ^{2}_{2}/2\sigma ^{2} + \alpha _{\mathrm{eff}}^{(t)}\left\| \nabla _{d} {\varvec{\theta }}\right\| _{1{-}2},\nonumber \\&\mathrm{with}\quad \alpha _{\mathrm{eff}}^{(t)} = (n+1)\left( \left\| \nabla _{d} {\varvec{\theta }}^{(t)}\right\| _{1{-}2} + 1\right) . \end{aligned}$$
(22)

Iteration (22) involves a convex subproblem that can easily be solved using most modern proximal optimisation techniques. For example, here we use the state-of-the-art ADMM algorithm SALSA (Afonso et al. 2011) implemented with \(g_{1} ({\varvec{\theta }}) = \Vert \mathbf {y}- H{\varvec{\theta }}\Vert ^{2}_{2}/2\sigma ^{2},\,g_{2}(\mathbf {u}) = \alpha _{eff}^{(t)}\Vert \nabla _{d} \mathbf {u}\Vert _{1{-}2},\) and the constraint \({\varvec{\theta }}= \mathbf {u}\) (though we could have also used other modern algorithms Pesquet and Pustelnik 2012; Combettes and Pesquet 2012; Raguet et al. 2013). To compute the proximity mapping of \(g_{1}\) we use the fact that H is block-circulant to compute matrix products and pseudo-inverses with the FFT algorithm. We compute the proximity mapping of \(g_{2}\) with a highly parallelised implementation of the specialised algorithm of Chambolle (2004).

Fig. 3
figure 3

Observed blurred noisy image \(\mathbf {y}\)

Fig. 4
figure 4

Resolution enhanced image \(\hat{{\varvec{\theta }}}_{MAP}\) obtained by solving (21) with the majorisation–minimisation strategy (22)

Fig. 5
figure 5

Widths of pixel-wise \(90\,\%\) marginal credibility intervals estimated with the proximal MCMC algorithm of Pereyra (2015)

Figure 3 presents a blurred and noisy observation \(\mathbf {y}\) of the popular “boats” imageFootnote 3 of size \(512 \times 512\) pixels, generated with a uniform \(9\times 9\) blur and a noise power of \(\sigma ^{2} = 0.5^{2}\) (blurred-signal-to-noise ratio \(BRSN = 10\log _{10}\{\Vert H{{\varvec{\theta }}}_{0}\Vert ^{2}_{2}/\sigma ^{2}\} = 40\) dB). Figure 4 below shows the MAP estimate \(\hat{{\varvec{\theta }}}_{MAP}\) obtained by solving (21) using four iterations of (22) and a total of 51 ADMM iterations. We observe that this resolution enhancement process has produced a remarkably sharp image with very noticeable fine detail. Moreover, Fig. 5 shows the magnitude of the marginal \(90\,\%\) credibility regions for each pixel, as measured by the distance between the 5 and \(95\,\%\) quantile estimates. These estimates were computed using the proximal Metropolis-adjusted Langevin algorithm (Pereyra 2015), which is appropriate for high-dimensional densities that are not continuously differentiable. We observe in Fig. 5 that the uncertainty is mainly concentrated at the contours and object boundaries, revealing that model is able to accurately detect the presence of sharp edges in the image but with some uncertainty about their exact location. Finally, Fig. 6 shows the convergence of the estimates \({\varvec{\theta }}^{(t,k)}\) produced by each ADMM iteration to \(\hat{{\varvec{\theta }}}_{MAP}\) (as measured by the mean squared error \(\Vert {\varvec{\theta }}^{(t,k)} - \hat{{\varvec{\theta }}}_{MAP} \Vert _{2}^{2}\)). Notice that computing \(\hat{{\varvec{\theta }}}_{MAP}\) only required 10 s (experiment conducted on an Apple Macbook Pro computer running Matlab 2013, a C++ implementation would certainly produce even faster results). This is remarkably fast given the high dimensionality of the problem (\(n = 262,144\)). The computation of the credibility regions by MCMC sampling (20,000 samples with a thinning factor of 1000 to reduce the algorithm’s memory foot-print) required 75 h.

5 Discussion

5.1 Bayesian computation in the era of data science

Is there a revolution taking place right now and have we missed the train, standing on the platform, only concerned with small-print on the train schedules—apart, that is, from the obvious but not-so-new requirement to handle massive datasets (and the mistakes that come with them)?!

Fig. 6
figure 6

Convergence of the estimate \({\varvec{\theta }}\)\(\hat{{\varvec{\theta }}}_{MAP}\) versus computing time (s)

As with other areas of statistical science, the Bayesian computation community has to decide whether data science is an opportunity or a threat. Inevitably if we do not treat it as an opportunity, it will become a threat. Thanks to the ubiquity of “big data” (as an over-hyped phrase mostly useful for attracting research funding, but also to at least some extent in reality), a new potentially multi-disciplinary field of data science is rapidly opening up. This field is attracting huge material resources, and will absorb much human talent. Statistical science has to be a part of this, for its own survival, but also for the sake of society. As Tim Harford has cogently argued (Harford 2014):

Recall big data’s four articles of faith. Uncanny accuracy is easy to overrate if we simply ignore false positives [...]. The claim that causation has been “knocked off its pedestal” is fine if we are making predictions in a stable environment but not if the world is changing [...] or if we ourselves hope to change it. The promise that “N = All”, and therefore that sampling bias does not matter, is simply not true in most cases that count. As for the idea that “with enough data, the numbers speak for themselves” – that seems hopelessly naïve in data sets where spurious patterns vastly outnumber genuine discoveries.

“Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever.

It is a mistake to think that Bayes has no part to play in these developments, but more of us need to get more involved, and learn new tools, as in the way the Consensus Monte Carlo algorithm (Scott et al. 2013) exploits the Hadoop environment (White 2012) and the MapReduce programming model (Dean and Ghemawat 2008). Another direction that can prevent a potential schism between Bayesian modelling and highly complex models is to aim for modularity and local learning, that it, to abandon the goal of modelling big universes for analysing a series of small worlds, in spite of the loss of coherence, amd hence compromise to the Bayesian paradigm, that this entails. The curious case of the cut models presented in Plummer (2015) is an illustration of the potential for developing partial-information Bayesian inference tools where “small is beautiful” because this is the only viable solution.

5.2 Do we care enough about applications?

Bayesian computation began in order to answer rather practical problems—how can we perform a Bayesian analysis of these data using this model?—or the corresponding meta-problems—how can Bayesian analysis be performed generally and reliably for this class of models? The focus was applied methodology (although since the methods were new, they tended to be published in premier theory/methodology journals). Because the research community wanted to understand (the advantages, performance and limitations of) the methods they were advocating, more theoretical work started to be conducted, and, for example, many probabilists were attracted to study the Markov chains that MCMC methodologists created. The centre of mass of research activity drifted away from the original motivations, just as has happened in other areas of mathematically-rigorous computation.

At the same time, those working with data became more ambitious with regard to the scale of data, the complexity of modelling and the sophistication of analysis, all factors that have in principle (and often in fact) stimulated new developments in Bayesian computation. But to a large extent this is a rich, self-stimulating and self-supporting area of research; new applications may or may not need new computational techniques, but new techniques don’t seem to need applications to justify themselves. It is apposite to ask to what extent is cutting-edge computational methodology research really delivering answers to questions that application domains are posing. And to what extent is cutting-edge computational methodology research successfully answering real questions?

We may not be unanimous about answers to these questions, except we can probably all agree they are “not entirely”. We will also disagree about how much this matters, but again there may be something to agree about, that we have failed if methodological innovations disconnect completely from applications. Legitimate differences in research goals partially explain the trend in this direction, but it is fair to say that there is a big communication problem between the computational statistics community and many of the communities where Bayesian computational methods are applied. Unfortunately people in these communities do not always keep up with the state of the art in computational statistics. At the same time, statisticians are often not aware of important developments arising in other fields. (ABC is a good illustration: it took more than 5 years of development within the population genetics community before statisticians became aware the technique existed and a few more years before they realised this was proper Bayesian inference applied on approximate models.) We can perhaps blame the fact that there are not enough people working at the interface of the different communities, but life at the interface is not easy because multidisciplinary and interdisciplinary research is often seen as “marginal” by both communities and is thus difficult to publish, communicate, etc. Then there are of course problems in dissemination, related to the different writing styles, journals, computing languages, software, etc. of each community.

We strongly encourage those developing new techniques always to find a way to disseminate them in such a way that at least somebody else could use them, preferably someone without the ability to have invented the technique for themselves!—and advocate, of course, that successful dissemination be properly rewarded in our career structures.

In a somewhat parallel path, we have seen over the past decades the emergence of new languages and meta-languages intended to handle complexity both of problems and of solutions towards a wider audience of users. BUGS (Lunn et al. 2010) is the archetypal example of such languages and it has been successful to the extent that a large proportion of the users has a fairly limited statistical background and often even less of a computational background. However, the population of BUGS users and sympathisers is tiny compared to that of SAS or other corporate statistical systems. In this respect, we have failed to disseminate concepts like Bayesian analysis and wonderful tools like MCMC algorithms, because most people are unable to turn them into codes by themselves. (Perusing one of the numerous statistics and machine-learning on-line forums like Cross Validated quickly exposes the methodological gap between academics and the masses!) It is unclear how novel programming developments like STAN (Stan Development Team 2014) are going to modify this picture, in that they still assume a decent understanding of both modelling and simulation issues. In that respect, network-based approaches as those covered by BUGS sound more promising towards “modelling locally to learn globally”. Similarly, ABC software is either too specific, like DIYABC (Cornuet et al. 2008) which addresses only population genetic questions, or too dependent on the ability of the modeller to program simulated outcomes from the model under study.

5.3 Anticipating the future

In which of the areas we discuss do we expect a particular emphasis of effort, or significant progress, or do we see particular needs for new efforts or new directions?

One expectation is that in the future computational methodologies will be more flexible and malleable. Over the past 25 years Bayesian modelling and inference techniques have been applied successfully to thousands of problems across a wide range application domains. Each application brings its own constraints in terms of model dimensionality and complexity, data, inferences, accuracy and computing times. These constraints also vary significantly within specific applications. For example, in hyperspectral remote sensing, when a new Bayesian model is introduced it is often first explored and validated by MCMC sampling, then approximated with a variational Bayes method, and then approximated again so that it can be applied to gigabyte-large datasets by using optimisation techniques. Similarly, an interesting result revealed by a fast inference technique can be analysed more deeply with more reliable and accurate methods might. Therefore we expect that in the future the different main computational methodologies will become more adaptable and that the boundaries between them will be less well defined, with many algorithms developed that combine simulation, variational approximations and optimisation. These will be able to handle a wide spectrum of models, degrees of accuracy and computing times, as well as models that have some parts that are simple but high-dimensional and others that are more complex but that only involve low-dimensional components. This can be achieved by using approximations and optimisation to improve stochastic sampling, by using simulation within deterministic algorithms to handle specific parts of the model that are difficult to compute analytically, or in completely new and original ways.

We also anticipate that computational methodologies will continue to be challenged by larger and larger datasets. There is of course a threat that the whole field turns into a library of machine-learning techniques, with limited validation on reference learning sets and a quick turnover of methods, which would both impoverish the field and fail to reach a general audience of practitioners. We must retain a sense of the stochastic elements in data collection, data analysis, and inference, recognising uncertainty in data and models, to preserve the inductive strength of data science—seeing beyond the data we have to what it might have been, what it be next time, and where it came from.