# Bayesian computation: a summary of the current state, and samples backwards and forwards

- First Online:

- Accepted:

DOI: 10.1007/s11222-015-9574-5

- Cite this article as:
- Green, P.J., Łatuszyński, K., Pereyra, M. et al. Stat Comput (2015) 25: 835. doi:10.1007/s11222-015-9574-5

- 13 Citations
- 3.7k Downloads

## Abstract

Recent decades have seen enormous improvements in computational inference for statistical models; there have been competitive continual enhancements in a wide range of computational tools. In Bayesian inference, first and foremost, MCMC techniques have continued to evolve, moving from random walk proposals to Langevin drift, to Hamiltonian Monte Carlo, and so on, with both theoretical and algorithmic innovations opening new opportunities to practitioners. However, this impressive evolution in capacity is confronted by an even steeper increase in the complexity of the datasets to be addressed. The difficulties of modelling and then handling ever more complex datasets most likely call for a new type of tool for computational inference that dramatically reduces the dimension and size of the raw data while capturing its essential aspects. Approximate models and algorithms may thus be at the core of the next computational revolution.

### Keywords

Bayesian analysis MCMC algorithms ABC techniques Optimisation## 1 Introduction

One may reasonably balk at the terms “computational statistics” and “Bayesian computation” since, from its very start, statistics has always involved some computational step to extract information, something manageable like an estimator or a prediction, from raw data. This necessarily incomplete and unavoidably biased review of the recent past, current state, and immediate future of algorithms for Bayesian inference thus first requires us to explain what we mean by computation in a statistical context, before turning to what we perceive as medium term solutions and possible deadends.

Computations are an issue in statistics whenever processing a dataset becomes a difficulty, a liability, or even an impossibility. Obviously, the computational challenge varies according to the time when it is faced: what was an issue in the nineteenth century is most likely not so any longer (take for instance the derivation of the moment estimates of a mixture of two normal distributions so painstakenly set by Pearson 1894 for estimating the ratio of “forehead” breadth–body length on a dataset of 1000 crabs or the intense algebraic derivations found in the analysis of variance of the 1950s and 1960s Searle et al. 1992).

The introduction of simulation tools in the 1940s followed hard on the heels of the invention of the computer and certainly contributed an impetus towards faster and better computers, at least in the first decade of this revolution. This shows that these tools were both needed, and unavailable without electronic calculators. The introduction of Markov chain Monte Carlo is harder to pin down as some partial versions can be traced all the way back to 1944–1945 and the Manhattan project at Los Alamos (Metropolis 1987). It is surprisingly much later, i.e., only by the early 1990s, that such methods became part of the Bayesian toolbox, that is, some time after the devising of other computer-dependent tools like the bootstrap or the EM algorithm, and despite the availability of personal computers that considerably eased programming and experimenting (Robert and Casella 2011). It is presumably pointless to try to attribute this delay to a definite cause but a certain lack of probabilistic culture within the statistics community is probably partly to blame.

What makes this time-lag in MCMC methods becoming assimilated into the statistics community even more surprising is that fact that Bayesian inference having a significant role in statistical practice was really on hold pending the discovery of flexible computational tools that (implicitly or explicitly) delivered values for the medium–high-dimensional integrals that underpin the calculation of posterior distributions, in all but toy problems where conjugacy provided explicit answers. In fact, until Bayesians discovered MCMC, the only computational methodology that seemed to offer much chance of making practical Bayesian statistics practical was the portfolio of quadrature methods developed under Adrian Smith’s leadership at Nottingham (Naylor and Smith 1982; Smith et al. 1985, 1987).

The very first article in the first issue of *Statistics and Computing*, whose quarter-century we celebrate in this special issue, was (to the journal’s credit!) on Bayesian analysis, and was precisely in this direction of using clever quadrature methods to approach moderately high-dimensional posterior analysis (Dellaportas and Wright 1991). By the next (second) issue, sampling-based methods had started to appear, with three papers out of five in the issue on or related to Gibbs sampling (Verdinelli and Wasserman 1991; Carlin and Gelfand 1991; Wakefield et al. 1991).

Now, reflecting upon the evolution of MCMC methods over the 25 or so years they have been at the forefront of Bayesian inference, the focus has evolved a long way, from hierarchical models that extended the linear, mixed and generalised linear models (Albert 1988; Carlin et al. 1992; Bennett et al. 1996) which were initially the focus, and graphical models that stemmed from image analysis (Geman and Geman 1984) and artificial intelligence, to dynamical models driven by ODE’s (Wilkinson 2011b) and diffusions (Roberts and Stramer 2001; Dellaportas et al. 2004; Beskos et al. 2006), hidden trees (Larget and Simon 1999; Huelsenbeck and Ronquist 2001; Chipman et al. 2008; Aldous et al. 2008) and graphs, aside with decision making in highly complex graphical models. While research on MCMC theory and methodology is still active and continually branching (Papaspiliopoulos et al. 2007; Andrieu and Roberts 2009; Łatuszyński et al. 2011; Douc and Robert 2011), progressively incorporating the capacities of parallel processors and GPUs (Lee et al. 2009; Jacob et al. 2011; Strid 2010; Suchard et al. 2010; Scott et al. 2013; Calderhead 2014), we wonder if we are not currently facing a new era where those methods are no longer appropriate to undertake the analysis of new models, and of new formulations where models are no longer completely defined. We indeed believe that imprecise models, incomplete information and summarised data will become, if not already, a central aspect of statistical analysis, due to the massive influx of data and the need to provide non-statisticians with efficient tools. This is why we also cover in this survey the notions of approximate Bayesian computation (ABC) and comment on the use of optimisation tools.

The plan of the paper is that in Sects. 2 and 3 we discuss recent progress and current issues in Markov chain Monte Carlo and ABC, respectively. In Sect. 4, we highlight some areas of modern optimisation that, through lack of familiarity, are making less impact in the mainstream of Bayesian computation than we think justified. Our Discussion in Sect. 5 raises issues about data science and relevance to applications, and looks to the future.

## 2 MCMC, targeting the posterior

When MCMC techniques were introduced to the mainstream statistical (Bayesian) community in 1990, they were received with skepticism that they could 1 day become the central tool of Bayesian inference. For instance, despite the assurance provided by the ergodic theorem, many researchers thought at first that the convergence of those algorithms was a mere theoretical anticipation rather than a practical reality, in contrast to traditional Monte Carlo methods, and hence that they could not be trusted to provide “exact” answers. This perspective is obviously obsolete by now, when MCMC output is considered as “exact” as regular Monte Carlo, if possibly less efficient in some settings. Nowadays, MCMC is again attracting more attention (than in the past decade, say, where developments were more about alternatives, some of which described in the following sections), both because of methodological developments linked to better theoretical tools, for instance in the handling of stochastic processes, and because of new advances in accelerated computing via parallel and cloud computing.

### 2.1 Basics of MCMC

The first observation about the Metropolis–Hastings is that the flexibility in choosing *q* is a blessing, but also a curse since the choice determines the performance of the algorithm. Hence a large part of the research on MCMC along the past 30 years (if we arbitrarily set the starting date at Geman and Geman 1984) has been on choice of the proposal *q* to improve the efficiency of the algorithm, and in characterising its convergence properties. This typically requires gathering or computing additional information about \(\pi \) and we discuss some of the fundamental strategies in subsequent sections. Algorithm 1, and its variants in which variables are updated singly or in blocks according to some schedule, remains a keystone in standard use of MCMC methodology, even though the newer Hamiltonian Monte Carlo (HMC) approach (see Sect. 2.3) may sooner or later come to replace it. While there is nothing intrinsically unique to the nature of this algorithm, or optimal in its convergence properties (other than the result of Peskun 1973 on the optimality of the acceptance ratio), attempts to bypass Metropolis–Hastings are few and limited. For instance, the birth-and-death process developed by Stephens (2000) used a continuous time jump process to explore a set of models, only to be later shown (Cappé et al. 2002) to be equivalent to the (Metropolis–Hastings) reversible jump approach of Green (1995).

Another aspect of the generic Metropolis–Hastings that became central more recently is that while the accept–reject step does overcome need to know the normalising constant, it still requires \(\pi ,\) if unnormalised, and this may be too expensive to compute or even intractable for complicated models and large datasets. Much recent research effort has been devoted to the design and understanding of appropriate modifications that use estimators or approximations of \(\pi \) instead and we will take the opportunity to summarise some of the progress in this direction.

### 2.2 MALA and manifold MALA

*h*. There is a delicate tradeoff between accuracy of the approximation improving as \(h \rightarrow 0\) and sampling efficiency [as measured, e.g., by the effective sample size (ESS)] improving when

*h*increases. This solution was soon followed by its Metropolised version (Rossky et al. 1978) that uses the Euler approximation of (2) to produce a proposal in the Metropolis–Hastings algorithm 1, by letting \(q (\cdot |\theta ^{(n-1)}) := \theta ^{(n-1)} + {h \over 2 } \nabla \log \pi (\theta ^{(n-1)}) + h^{1/2}N(0,\,I_{d\times d}).\) While in the probability community Langevin diffusions and their equilibrium distributions had also been around for some time (Kent 1978), it was the Roberts and Tweedie (1996a) paper (motivated by Besag 1994 comment on Grenander and Miller 1994) that brought the approach to the centre of interest of the computational statistics community and sparked systematic study, development and applications of Metropolis adjusted Langevin algorithms (hence MALA) and their cousins.

There is a large body of empirical evidence that at the extra price of computing the gradient, MALA algorithms typically provide a substantial speed-up in convergence on certain types of problems. However for very light-tailed distributions the drift term may grow to infinity and cause additional instability. More precisely, for distributions with sufficiently smooth contours, MALA is geometrically ergodic (c.f. Roberts and Rosenthal 2004) if the tails of \(\pi \) decay as \(\exp \{-|\theta |^{\beta }\}\) with \(\beta \in [1,\,2],\) while the random walk Metropolis algorithm is geometrically ergodic for all \(\beta \ge 1\) (Roberts and Tweedie 1996a; Mengersen and Tweedie 1996). The lack of geometrical ergodicity has been precisely quantified by Bou-Rabee and Hairer (2012).

*h*, adaptive versions (both discussed in Sect. 2.4), combinations with proximal operators (Pereyra 2015; Schreck et al. 2013), and applications and algorithm development for the infinite-dimensional context (Pillai et al. 2012; Cotter et al. 2013). One particular direction of active research is considering a more general version of Eq. (1) with state-dependent drift and diffusion coefficient

*h*results in an algorithm that is as robust as random walk Metropolis, in the sense that it is geometrically ergodic for targets with tail decay of \(\exp \{- |\theta |^{\beta }\}\) for \(\beta >1\) (see Taylor 2014). A robustified version of such a metric has been introduced in Betancourt (2013) and termed SoftAbs. Here one approximates the absolute value of the eigenspectrum of the Hessian of \(\pi \) with a smooth strictly positive function \(\lambda _{i} \rightarrow \lambda _{i} \frac{\exp {\{\alpha \lambda _{i}\}} + \exp {\{-\alpha \lambda _{i}\}}}{\exp {\{\alpha \lambda _{i}\}} - \exp {\{-\alpha \lambda _{i}\}}},\) where \(\alpha \) is a smoothing parameter. The metric stabilises the behaviour of both MMALA, and HMC algorithms (discussed in the sequel), in the neighbourhoods where the signature of the Hessian changes.

### 2.3 Hamiltonian Monte Carlo

As with many improvements in the literature, starting with the very notion of MCMC, Hamiltonian (or hybrid) Monte Carlo (HMC) stems from Physics (Duane et al. 1987). After a slow emergence into the statistical community (Neal 1999), it is now central in statistical software like STAN (Stan Development Team 2014). For a complete account of this important flavour of MCMC, the reader is referred to Neal (2013), which inspired the description below; see also Betancourt et al. (2014) for a highly mathematical differential-geometric approach to HMC.

*p*its speed or momentum and

*M*the Hamiltonian matrix of \(\pi .\) In more statistical language, HMC creates an auxiliary variable

*p*such that moving according to Hamilton’s equations

*k*-th order symplectic integrator (Hairer et al. 2006), most commonly on the second order

*leapfrog approximation*that relies on a small step level \(\epsilon ,\) updating

*p*and \(\theta \) via a modified Euler’s method called the leapfrog that is reversible and being symplectic, preserves volume as well. This discretised update can be repeated for an arbitrary number of steps.

When considering the implementation via a Metropolis algorithm, a new value of the momentum *p* is drawn from the pseudo-prior \(\propto \exp \{ -p^\text {T} M^{-1} p / 2 \}\) and it is followed by a Metropolis step, which proposal is driven by the leapfrog approximation to the Hamiltonian dynamics on \((\theta ,\,p)\) and which acceptance is governed by the Metropolis acceptance probability. What makes the potential strength of this augmentation (or disintegration) scheme is that the value of \(H(\theta ,\,p)\) hardly changes during the Metropolis move, which means that it is most likely to be accepted and that it may produce a very different value of \(\pi (\theta )\) without modifying the overall acceptance probability. In other words, moving along level sets is almost energy-free, but if the move proceeds for “long enough”, the chain can reach far-away regions of the parameter space, thus avoid the myopia of standard MCMC algorithms. As explained in Neal (2013), this means that HMC avoids the inefficient random walk behaviour of most Metropolis–Hastings algorithms. What drives the exploration of the different values of \(H(\theta ,\,p)\) is therefore the simulation of the momentum, which makes its calibration both quite influential and delicate (Betancourt et al. 2014) as it depends on the unknown normalising constant of the target. (By calibration, we mean primarily the choice of the time discretisation step \(\varepsilon \) in the leapfrog approximation and of the number *L* of leapfrog leaps, but also the choice of the precision matrix *M*).

### 2.4 Optimal scaling and adaptive MCMC

The convergence of the Metropolis–Hastings algorithm 1 depends crucially on the choice of the proposal distribution *q*, as does the performance of both more complex MCMC and SMC algorithms, that often are hybrids using Metropolis–Hastings as simulation substeps.

Optimising over all implementable *q* appears to be a “disaster problem” due to its infinite-dimensional character, lack of clarity about what is implementable, what is not, and the fact that this optimal *q* must depend in a complex way on the target \(\pi \) to which we have only a limited access. In particular MALA provides a specific approach to constructing \(\pi \)-tailored proposals and HMC can be viewed as a combination of Gibbs and special Metropolis moves for an extended target.

In this optimisation context, it is thus reasonable to restrict ourselves to some parametric family of proposals \(q_{\xi },\) or more generally of Markov transition kernels \(P_{\xi },\) where \(\xi \in \varXi \) is a tuning parameter, possibly high-dimensional.

The aim of adaptive Markov chain Monte Carlo is conceptually very simple. One expects that there is a set \(\varXi _{\pi } \subset \varXi \) of good parameters \(\xi \) for which the kernel \(P_{\xi }\) converges quickly to \(\pi ,\) and one allows the algorithm to search for \(\varXi _{\pi }\) “on the fly” and redesign the transition kernel during the simulation as more and more information about \(\pi \) becomes available. Thus an adaptive MCMC algorithm would apply the kernel \(P_{\xi ^{(n)}}\) to sample \(\theta ^{(n)}\) given \(\theta ^{(n-1)},\) where the tuning parameter \(\xi ^{(n)}\) is itself a random variable which may depend on the whole history \(\theta ^{(0)},\ldots ,\theta ^{(n-1)}\) and on \(\xi ^{(n-1)}.\) Adaptive MCMC rests on the hope that the adaptive parameter \(\xi ^{(n)}\) will find \(\varXi _{\pi },\) stay there essentially forever and inherit good convergence properties.

There are at least two fundamental difficulties in executing this strategy in practice. First, standard measures of efficiency of Markovian kernels, like the total variation convergence rate (c.f. Meyn and Tweedie 2009; Roberts and Rosenthal 2004), \(L^{2}(\pi )\) spectral gap (Diaconis and Stroock 1991; Roberts 1996; Saloff-Coste 1997; Levin et al. 2009) or asymptotic variance (Peskun 1973; Geyer 1992; Tierney 1998) in the Markov chain central limit theorem will not be available explicitly, and their estimation from a Markov chain trajectory is often an even more challenging task than the underlying MCMC estimation problem itself.

Secondly, when executing an adaptive strategy and trying to improve the transition kernel on the fly, the Markov property of the process is violated, therefore standard theoretical tools do not apply, and establishing validity of the approach becomes significantly more difficult. While the approach has been successfully applied in some very challenging practical problems (Solonen et al. 2012; Richardson et al. 2010; Griffin et al. 2014), there are examples of seemingly reasonable adaptive algorithms that fail to converge to the intended target distribution (Bai et al. 2011; Łatuszyński et al. 2013), indicating that compared to standard MCMC even more care must be taken to ensure validity of inferential conclusions.

*f*,

*Z*of a SDE

*h*(

*l*) is equivalent to maximising the efficiency of the algorithm as the dimension goes to infinity. Surprisingly, there is a one-to-one correspondence between the value \(l_{opt} = \mathrm{argmax}\, h(l)\) and the mean acceptance probability of 0.234.

The magic number 0.234 does not depend on *f* and gives a universal tuning recipe to be used for example in adaptive algorithms: choose the scale of the increment so that approximately 23 % of the proposals are accepted.

The result, established under restrictive assumptions, has been empirically verified to hold much more generally, for non iid targets and also in medium- and even low-dimensional examples with *d* as small as 5. It has been also combined with relative efficiency loss due to mismatch between the proposal and target covariance matrices (see Roberts and Rosenthal 2001).

The simplicity of the result and easy access to the average acceptance rate makes optimal scaling the main theoretical driver in development of adaptive MCMC algorithms, and adaptive MCMC is the main application and motivation for researching optimal scaling.

A large body of theoretical work extends optimal scaling formally to different and more general scenarios. For example Metropolis for smooth non iid targets has been addressed, e.g., by Bédard (2007), and in infinite dimensional settings by Beskos et al. (2009). Discrete and other discontinuous targets have been considered in Roberts (1998) and Neal et al. (2012). For MALA algorithms an optimal acceptance rate of 0.574 has been established in Roberts and Rosenthal (1998) and confirmed in infinite-dimensional settings in Pillai et al. (2012) along with the stepsize \(\sigma ^{2}_{d} = l^{2} d^{-1/3}.\) Hybrid Monte Carlo (see Sect. 2.3) has been analysed in a similar spirit by Beskos et al. (2013) and Betancourt et al. (2014) concluding that any value \(\in [0.6,\,0.9]\) will be close to optimal and the leapfrog step size should be taken as \(h = l \times d^{-1/4}.\) These results not only inform about optimal tuning, but also provide an efficiency ordering on the algorithms in *d*-dimensions. Metropolis algorithms need \(\mathcal {O}(d)\) steps to explore the state space, while MALA and HMC need respectively \(\mathcal {O}(d^{1/3})\) and \(\mathcal {O}(d^{1/4}).\)

Further extensions include studying the transient phase before reaching stationarity (Christensen et al. 2005; Jourdain et al. 2012, 2014), the scaling of multiple-try MCMC (Bédard et al. 2012) and delayed rejection MCMC (Bédard et al. 2014), and the temperature scale of parallel tempering type algorithms (Atchadé et al. 2011b; Roberts and Rosenthal 2014). Interestingly, the optimal scaling of the discussed in Sect. 2.5 pseudo-marginal algorithms as obtained in Sherlock et al. (2014), and extended to more general settings in Doucet et al. (2012) and Sherlock (2014), suggests an acceptance rate of just 0.07.

While each of these numerous optimal scaling results gives rise, at least in principle, to an adaptive MCMC design, the pioneering and most successful algorithm is the Adaptive Metropolis of Haario et al. (2001). With its increasing popularity in applications, this has fuelled the development of the field.

Versions and refinements of the adaptive Metropolis algorithm (Roberts and Rosenthal 2009; Andrieu and Thoms 2008) have served well in applications and motivated much of the theoretical development. These include, among many other contributions, adaptive Metropolis, delayed rejection adaptive Metropolis (Haario et al. 2006), regional adaptation and parallel chains (Craiu et al. 2009), and the robust version of Vihola (2012) estimating the shape of the distribution rather than its covariance matrix and hence suitable for heavy tailed targets.

Analogous development of adaptive MALA algorithms in Atchadé (2006) and Marshall and Roberts (2012) and of adaptive HMC and Riemannian Manifold Monte Carlo in Wang et al. (2013) building on the adaptive scaling theory, resulted in a similar drastic mixing improvement as the original Adaptive Metropolis.

Another substantial and still unexplored area where adaptive algorithms are applied for very high dimensional and multimodal problems is model and variable selection (Nott and Kohn 2005; Richardson et al. 2010; Lamnisos et al. 2013; Ji and Schmidler 2013; Griffin et al. 2014). These algorithms can incorporate reversible jump moves (Green 1995) and are guided by scaling limits for discrete distributions as well as temperature spacing of parallel tempering to address multimodality. Successful implementations allow for fully Bayesian variable selection in models with over 20,000 variables for which otherwise only ad hoc heuristic approaches have been used in the literature.

To address the second difficulty with adaptive algorithms, several approaches have been developed to establish their theoretical underpinning. While for standard MCMC, convergence in total variation and law of large numbers are obtained almost trivially, and the effort concentrates on stronger results, like CLTs, geometric convergence, nonasymptotic analysis, and, maybe most importantly, comparison and ordering of algorithms, adaptive samplers are intrinsically difficult. The most elegant and theoretically-valid strategy is to change the underlying Markovian kernel at regeneration times only (Gilks et al. 1998). Unfortunately, this is not very appealing for practitioners since regenerations are difficult to identify in more complex settings and are essentially impractically rare in high dimensions. The original Adaptive Metropolis of Haario et al. (2001) has been validated (under some restrictive additional conditions) by controlling the dependencies introduced by the adaptation and using convergence results for mixingales. The approach has been further developed in Atchadé and Rosenthal (2005) and Atchadé (2006) to verify its ergodicity under weaker assumptions and apply the mixingale approach to adaptive MALA. Another successful approach (Andrieu and Moulines 2006 refined in Saksman and Vihola 2010) rests on martingale difference approximations and martingale limit theorems to obtain, under suitable technical assumptions, versions of LLN and CLTs. There are close links between analysing adaptive MCMC and stochastic approximation algorithms and in particular the adaptation step can be often written as a mean field of the stochastic approximation procedure; Andrieu and Robert (2001), Atchadé et al. (2011a) and Andrieu et al. (2015) contribute to this direction of analysis. Fort et al. (2011) develop an approach where both adaptive and interacting MCMC algorithms can be treated in the same framework. This allows addressing “external adaptation” algorithms such as the interacting tempering algorithm (a simplified version of the celebrated equi-energy sampler of Kou et al. 2006) or adaptive parallel tempering in Miasojedow et al. (2013).

*Diminishing Adaptation*and

*Containment*are sufficient to guarantee that an adaptive MCMC algorithm will converge asymptotically to the correct target distribution. To this end recall the total variation distance between two measures defined as \(\Vert \nu (\cdot ) - \mu (\cdot )\Vert := \sup _{A\in \mathcal {F}}|\nu (A) - \mu (A)|,\) and for every Markov transition kernel \(P_{\xi },\,\xi \in \varXi \) and every starting point \(\theta \in \varTheta \) define the \(\varepsilon \) convergence function \(M_{\varepsilon }{\text {:}}\,\varTheta \times \varXi \rightarrow \mathbb {N}\) as

*t*, i.e.,

**Definition 1**

*Diminishing Adaptation*) The adaptive algorithm with starting values \(\theta ^{(0)}=\theta \) and \(\xi ^{(0)} = \xi \) satisfies Diminishing Adaptation, if

**Definition 2**

(*Containment*) The adaptive algorithm with starting values \(\theta ^{(0)}=\theta \) and \(\xi ^{(0)} = \xi \) satisfies Containment, if for all \(\varepsilon > 0\) the sequence \(\{M_{\varepsilon }(\theta ^{(n)},\,\xi ^{(n)})\}_{n=0}^{\infty }\) is bounded in probability.

While diminishing adaptation is a standard requirement, Containment is subject to some discussion. On one hand, it may seem difficult to verify in practice; on the other, it may appear restrictive in the context of ergodicity results under some weaker conditions (c.f. Fort et al. 2011). However, it turns out (Łatuszyński and Rosenthal 2014) that if Containment is not satisfied, then the algorithm may still converge, but with positive probability it will be asymptotically less efficient than any nonadaptive ergodic MCMC scheme. Hence algorithms that do not satisfy Containment are termed AdapFail and are best avoided. Containment has been further studied in Bai et al. (2011) and is in particular implied by simultaneous geometric or polynomial drift conditions of the adaptive kernels.

Given that adaptive algorithms may be incorporated in essentially any sampling scheme, their introduction seems to be one of the most important innovations of the last two decades. However, despite substantial effort and many ingenious contributions, the theory of adaptive MCMC lags behind practice even more than may be the case in other computational areas. While theory always matters, the numerous unexpected and counterintuitive examples of transient adaptive algorithms suggest that in this area theory matters even more for healthy development.

For adaptive MCMC to become a routine tool, a clear-cut result is needed saying that under some easily verifiable conditions these algorithms are valid and perform not much worse than their nonadaptive counterpart with fixed parameters. Such a result is yet to be established and may require deeper understanding of how to construct stable adaptive MCMC, rather than aiming heavy technical artillery at algorithms currently in use without modifying them.

### 2.5 Estimated likelihoods and pseudo-marginals

*z*as an auxiliary variable and to simulate the joint distribution \(\pi (\theta ,\,z|y)\) on \(\varTheta \times \mathcal {Z}\) by a standard method, leading to simulating the marginal density \(\pi (\cdot |y)\) as a by-product. However, when the dimension of the auxiliary variable

*z*grows with the sample size, this technique may run into difficulties as induced MCMC algorithms are more and more likely to have convergence issues. An illustration of this case is provided by hidden Markov models, which have eventually to resort to particle filters as Markov chain algorithms become ineffective (Chopin 2007; Fearnhead and Clifford 2003). Another situation where the target density \(\pi (\cdot |y)\) cannot be directly computed is the case of the “doubly intractable” likelihood (Murray et al. 2006a), when the likelihood function \(\ell (\theta |y)\propto g(y|\theta )\) itself contains a term that is intractable, in that it makes the normalising constant

Examples of this kind abound in Markov random fields models, as for instance for the Ising model (Murray et al. 2006b; Møller et al. 2006).

Both the approaches of Murray et al. (2006a) and Møller et al. (2006) require sampling data from the likelihood \(\ell (\theta |y),\) which limits their applicability. The latter uses in addition an importance sampling function and may suffer from poor acceptance rates.

*n*, given the pair \((\theta ^{(n-1)},\,S_{\theta ^{(n-1)}}^{(n-1)})\) of parameter value and density estimate at this value, the proposal \((\theta ^{\prime },\,S_{\theta ^{\prime }})\) is obtained by sampling \(\theta ^{\prime } \sim q(\cdot |\theta ^{(n-1)})\) and obtaining \(S_{\theta ^{\prime }},\) the estimate of \(\pi (\theta ^{\prime }|y).\) Analogously to the standard Metropolis–Hastings step, the pair \((\theta ^{\prime },\,S_{\theta ^{\prime }})\) is accepted as \((\theta ^{(n)},\,S_{\theta ^{(n)}}^{(n)})\) with probability

It is not difficult to verify that the bivariate chain on extended state space \(\varTheta \times \mathbb {R}_{+}\) enjoys the correct \(\pi (\theta |y)\) marginal on \(\varTheta \) and the approach is valid, see Andrieu and Roberts (2009) for details (and also Andrieu and Vihola 2015 for an abstracted account).

One specific instance of constructing unbiased estimators of the posterior is presented in Girolami et al. (2013) and based on random truncations of infinite series expansions. The paper also offers an excellent overview of inference methods for intractable likelihoods.

The performance of the pseudo-marginal approach will depend on the quality of the estimators \(S_{\theta }\) and hence stabilising them as well as understanding this relationship is an active area of current development. Often \(S_{\theta }\) is constructed as an importance sampler based on an importance sample *z*. Thus in particular, the improvements from using multiple samples of *z* to estimate \(\pi \) are of interest and can be assessed from Andrieu and Vihola (2015) where the efficiency of the algorithm is studied in terms of its spectral gap and CLT asymptotic variance. Sherlock et al. (2014), Doucet et al. (2012) and Sherlock (2014), on the other hand, investigate the efficiency as a function of the acceptance rate and variance of the noise, deriving the optimal scaling, as discussed in Sect. 2.4.

As an alternative to the above procedure of using estimates of the intractable likelihood to design a new Markov chain on an extended state space with correct marginal, one could naively use these estimates to approximate the Metropolis–Hastings accept–reject ratio and let the Markov chain evolve in the original state space. This would amount to dropping the current realisation of \(S_{\theta }\) and obtaining a new one in each accept–reject attempt. Such a procedure is termed Monte Carlo within Metropolis (Andrieu and Roberts 2009). Unfortunately this approach does not preserve the stationary distribution, and the resulting Markov chain may even not be ergodic (Medina-Aguayo et al. 2015). If ergodic, the difference between stationary distribution, resulting from the noisy acceptance must be quantified, which is a highly nontrivial task and the bounds will rarely be tight (see also Alquier et al. 2014; Pillai and Smith 2014; Rudolf and Schweizer 2015 for related methodology and theory). The approach is however an interesting avenue since at the price of being biased, it overcomes mixing difficulties of the exact pseudo-marginal version.

Design and understanding of pseudo-marginal algorithms is a direction of dynamic methodological development that in the coming years will be further fuelled not only by complex models with intractable likelihoods, but also by the need of MCMC algorithms for Big Data. In this context the likelihood function cannot be evaluated for the whole dataset even in the *iid* case just because computing the long product of individual likelihoods is infeasible. Several Big Data MCMC approaches have been already considered in Welling and Teh (2011), Korattikara et al. (2013), Teh et al. (2014), Bardenet et al. (2014), Maclaurin and Adams (2014), Minsker et al. (2014), Quiroz et al. (2014) and Strathmann et al. (2015).

### 2.6 Particle MCMC

While we refrain from covering particle filters here, since others (Beskos et al. 2015) in this volume are focussing on this technique, a recent advance at the interface between MCMC, pseudo-marginals, and particle filtering is the notion of particle MCMC (or *pMCMC*), developed by Andrieu et al. (2011). This innovation is indeed rather similar to the pseudo-marginal algorithm approach, taking advantage of the state-space models and auxiliary variables used by particle filters. It differs from standard particle filters in that it targets (mostly) the marginal posterior distribution of the parameters.

*t*, a new value \(\theta ^{\prime }\) of \(\theta \) is proposed from an arbitrary transition kernel \(\mathfrak {h}(\cdot |\theta ^{(t)})\) and then a new value of the latent series \(x_{0:T}^{\prime }\) is generated from a particle filter approximation of \(p(x_{0:T}|\theta ^{\prime },\,y_{1:T}).\) Since the particle filter returns as a by-product (Del Moral et al. 2006) an unbiased estimator of the marginal posterior of \(y_{1:T},\,\hat{q}(y_{1:T}|\theta ^{\prime }),\) this estimator can be used as such in the Metropolis–Hastings ratio

This approach is being used increasingly in complex dynamic models like those found in signal processing (Whiteley et al. 2010), dynamical systems like the PDEs in biochemical kinetics (Wilkinson 2011b) and probabilistic graphical models (Lindsten et al. 2014). An extension to approximating the sequential filtering distribution is found in Chopin et al. (2013).

### 2.7 Parallel MCMC

Since MCMC relies on local updating based on the current value of a Markov chain, opportunities for exploiting parallel resources, either CPU or GPU, would seem quite limited, In fact, the possibilities reach far beyond the basic notion of running independent or coupled MCMC chains on several processors. For instance, Craiu and Meng (2005) construct parallel antithetic coupling to create negatively correlated MCMC chains (see also Frigessi et al. 2000), while Craiu et al. (2009) use parallel exploration of the sample space to tune an adaptive MCMC algorithm. Jacob et al. (2011) exploit GPU facilities to improve by Rao-Blackwellisation the Monte Carlo approximations produced by a Markov chain, even though the parallelisation does not improve the convergence of the chain. See also Lee et al. (2009) and Suchard et al. (2010) for more detailed contributions on the appeal of using GPUs towards massive parallelisation, and Wilkinson (2005) for a general survey on the topic.

Another recently-explored direction is “prefetching”. Based on Brockwell (2006) this approach computes the \(2^{2},\,2^{3},\ldots , 2^{k}\) values of the posterior that will be needed \(2,\,3,\ldots ,k\) sweeps ahead by simulating the possible “futures” of the Markov chain, according to whether the next *k* proposals are accepted or not, in parallel. Running a regular Metropolis–Hastings algorithm then means building a decision tree back to the current iteration and drawing \(2,\,3,\ldots ,k\) uniform variates to go down the tree to the appropriate branch. As noted by Brockwell (2006), “in the case where one can guess whether or not acceptance probabilities will be ‘high’ or ‘low’, the tree could be made deeper down ‘high’ probability paths and shallower in the ‘low’ probability paths.” This idea is exploited in Angelino et al. (2014), by creating “speculative moves” that consider the reject branch of the prefetching tree more often than not, based on some preliminary or dynamic evaluation of the acceptance rate. Using a fast but close-enough approximation to the true target (and a fixed sequence of uniforms) may also produce a “single most likely path” on which prefetched simulations can be run. The basic idea is thus to run simulations and costly likelihood computations on many parallel processors along a prefetched path, a path that has been prefetched for its high approximate likelihood. There are obviously instances where this speculative simulation is not helpful because the actual chain with the genuine target ends up following another path. Angelino et al. (2014) actually go further by constructing sequences of approximations for the precomputations. The proposition for the sequence found therein is to subsample the original data and use a normal approximation to the difference of the log (sub-)likelihoods. See Strid (2010) for related ideas.

A different use of parallel capabilities is found in Calderhead (2014). At each iteration of Calderhead’s algorithm, *N* replicas are generated, rather than 1 in traditional Metropolis–Hastings. The Markov chain actually consists of *N* components, from which one component is selected at random as a seed for the next proposal. This approach can be seen as a special type of data augmentation (Tanner and Wong 1987), where the index of the selected component is an auxiliary variable. The neat trick in the proposal (and the reason for its efficiency gain) is that the stationary distribution of the auxiliary variable can be determined and hence used *N* times in updating the vector of *N* components. An interesting feature of this approach is when the original Metropolis–Hastings algorithm is expressed as a finite state space Markov chain on the set of indices \(\{1,\ldots ,N\}.\) Conditional on the values of the *N* dimensional vector, the stationary distribution of that sub-chain is no longer uniform. Hence, picking *N* indices from the stationary helps in selecting the most appropriate images, which explains why the rejection rate decreases. The paper indeed evaluates the impact of increasing the number of proposals in terms of ESS, acceptance rate, and mean squared jump distance. Since this proposal is an almost free bonus resulting from using *N* processors, it sounds worth investigating and comparing with more complex parallel schemes.

*M*subsets and taking a power 1/

*m*of the prior in each case. (This may induce issues about impropriety.) However, the subdivision is arbitrary and can thus be implemented in cases other than the fairly restrictive iid setting. Because each (subsample) nonparametric estimate involves

*T*terms, the resulting overall estimate contains

*Tm*terms and the authors suggest using an independent Metropolis sampler to handle this complexity. This is in fact necessary for producing a final sample from the (approximate) true posterior distribution.

- (1)
Curse of dimensionality in the number of parameters

*d*; - (2)
Curse of dimensionality in the number of subsets

*m*; - (3)
Tail degeneration;

- (4)
Support inconsistency and mode misspecification.

*Tm*terms corresponding to a product of

*m*sums of

*T*terms sounds self-defeating, Neiswanger et al. (2013) use a clever device to avoid the combinatorial explosion, namely operating on one component at a time. Having non-manageable targets is not such an issue in the post-MCMC era. Point 3 is formally correct, in that the kernel tail behaviour induces the kernel estimate tail behaviour, most likely disconnected from the true target tail behaviour, but this feature is true for any non-parametric estimate, even for the Weierstrass transform defined below, and hence maybe not so relevant in practice. In fact, by lifting the tails up, the simulation from the subposteriors should help in visiting the tails of the true target. Finally, point 4 does not seem to be life-threatening. Assuming that the true target can be computed up to a normalising constant, the value of the target for every simulated parameter could be computed, eliminating those outside the support of the product and highlighting modal regions.

The Weierstrass transform of a density *f* is a convolution of *f* and of an arbitrary kernel *K*. Wang and Dunson (2013) propose to simulate from the product of the Weierstrass transform, using a multi-tiered Gibbs sampler. Hence, the parameter is only simulated once and from a controlled kernel, while the random effects from the convolution are related with each subposterior. While the method requires coordination between the parallel threads, the components of the target are separately computed on a single thread. The clearest perspective on the Weierstrass transform may possibly be the rejection sampling version where simulations from the subpriors are merged together into a normal proposal on \(\theta ,\) to be accepted with a probability depending on the subprior simulations.

VanDerwerken and Schmidler (2013) keep with the spirit of parallel MCMC papers like consensus Bayes (Scott et al. 2013), embarrassingly parallel MCMC (Neiswanger et al. 2013) and Weierstrass MCMC (Wang and Dunson 2013), namely that the computation of the likelihood can be broken into batches and MCMC run over those batches independently. The idea of the authors is to replace an exploration of the whole space operated via a single Markov chain (or by parallel chains acting independently which all have to “converge”) with parallel and independent explorations of parts of the space by separate Markov chains. The motivation is that “Small is beautiful”: it takes a shorter while to explore each set of the partition, hence to converge, and, more importantly, each chain can work in parallel with the others. More specifically, given a partition of the space, into sets \(A_{i}\) with posterior weights \(w_{i},\) parallel chains are associated with targets equal to the original target restricted to those \(A_{i}\)s. This is therefore an MCMC version of partitioned sampling. With regard to the shortcomings listed in the quote above, the authors consider that there does not need to be a bijection between the partition sets and the chains, in that a chain can move across partitions and thus contribute to several integral evaluations simultaneously. It is somewhat unclear (a) whether or not this impacts ergodicity (it all depends on the way the chain is constructed, i.e., against which target) as it could lead to an over-representation of some boundary regions and (b) whether or not it improves the overall convergence properties of the chain(s). A more delicate issue with the partitioned MCMC approach stands with the partitioning. Indeed, in a complex and high-dimension model, the construction of the appropriate partition is a challenge in itself as we often have no prior idea where the modal areas are. Waiting for a correct exploration of the modes is indeed faster than waiting for crossing between modes, provided all modes are represented and the chain for each partition set \(A_{i}\) has enough energy to explore this set. It actually sounds unlikely that a target with huge gaps between modes will see a considerable improvement from the partitioned version when the partition sets \(A_{i}\) are selected on the go, because some of the boundaries between the partition sets may be hard to reach with an off-the-shelf proposal. A last comment about this innovative paper is that the adaptive construction of the partition has much in common with Wang–Landau schemes (Wang and Landau 2001; Lee et al. 2005; Atchadé and Liu 2010; Jacob and Ryder 2014).

## 3 ABC and others, exactly delivering an approximation

Motivated by highly complex models where MCMC algorithms and other Monte Carlo methods were too inefficient by far, approximate methods have emerged where the output cannot be considered as simulations from the genuine posterior, even under idealised situations of infinite computing power. These methods include ABC techniques, described in more details below, but also variational Bayes (Jaakkola and Jordan 2000), empirical likelihood (Owen 2001), integrated nested Laplace approximation (INLA) (Rue et al. 2009) and other solutions that rely on pseudo-models, or on summarised versions of the data, or both. It is quite important to signal this evolution as we think that it may be a central feature of computational Bayesian statistics in the coming years. From a statistical perspective, it also induces a somewhat paradoxical situation where loss of information is balanced by improvement in precision, for a given computational budget. This perspective is not only interesting at the computational level but forces us (as statisticians) to re-evaluate in depth the nature of a statistical model and could produce a paradigm shift in the near future by giving a brand new meaning to George Box’s motto that “all models are wrong”.

### 3.1 ABC per se

It seems important to discuss ABC in this partial tour of Bayesian computational techniques as (a) they provide the only approach to their model for some Bayesians, (b) they deliver samples in the parameter space that are exact simulations from a posterior of some kind (Wilkinson 2013), \(\pi ^{\mathrm{ABC}}(\theta |\mathbf {y}_{0})\) if not the original posterior \(\pi (\theta |\mathbf {y}_{0}),\) where \(\mathbf {y}_{0}\) denotes the data in this section (c) they may be more intuitive to some researchers outside statistics, as they entail simulating from the inferred model, i.e., going forward from parameter to data, rather than backward, from data to parameter, as in traditional Bayesian inference, (d) they can be merged with MCMC algorithms, and (e) they allow drawing inference directly from summaries of the data rather than the data itself.

ABC techniques play a role in the 2000s that MCMC methods did in the 1990s, in that they handle new models for which earlier (e.g., MCMC) algorithms were at a loss, in the same way the latter (MCMC) were able to handle models that regular Monte Carlo approaches could not reach, such as latent variable models (Tanner and Wong 1987; Diebolt and Robert 1994; Richardson and Green 1997). New models for which ABC unlocked the gate include Markov random fields, Kingman’s coalescent for phylogeographical data, likelihood models with an intractable normalising constant, and models defined by their quantile function or their characteristic function. While the ABC approach first appeared a “quick-and-dirty” solution, to be considered only until more elaborate representations could be found, those algorithms have been progressively incorporated into the statistician’s toolbox as a novel form of generic nonparametric inference handling partly-defined statistical models. They are therefore attractive as much for this reason as for being handy computational solutions when everything else fails.

A statistically intriguing feature of those methods is that they customarily require—for greater efficiency—replacing the data with (much) smaller-dimension summaries^{1} or summary statistics, because of the complexity of the former. In almost every case calling for ABC, those summaries are not sufficient statistics and the method thus implies from the start a loss of statistical information, at least at a formal level, since relying on the raw data is out of the question and therefore the additional information it provides is moot. This imposed reduction of the statistical information raises many relevant questions, from the choice of summary statistics (Blum et al. 2013) to the consistency of the ensuing inference (Robert et al. 2011).

Although it has now diffused into a wide range of applications, the technique of ABC was first introduced by and for population genetics (Tavaré et al. 1997; Pritchard et al. 1999) to handle ancestry models driven by Kingman’s coalescent and with strictly intractable likelihoods (Beaumont 2010). The likelihood function of such genetic models is indeed “intractable” in the sense that, while derived from a fully defined and parameterised probability model, this function cannot be computed (at all or within a manageable time) for a single value of the parameter and for the given data. Bypassing the original example to avoid getting mired into the details of population genetics, examples of intractable likelihoods include densities with intractable normalising constants, i.e., \(f (\mathbf {y}|\theta ) = g(\mathbf {y}|\theta )/Z(\theta )\) such as in Potts (1952) and auto-exponential (Besag 1972) models, and pseudo-likelihood models (Cucala et al. 2009).

*Example 1*

A very simple illustration of an intractable likelihood is provided by Bayesian inference based on the median and median absolute deviation statistics of a sample from an arbitrary location-scale family, \(y_{1},\ldots ,y_{n}\mathop {\sim }\limits ^{\mathrm{iid}} \sigma ^{-1} g(\sigma ^{-1}\{y-\mu \}),\) as the joint distribution of this statistic is not available in closed form.

*inverse probability*(Rubin 1984). This concept is that data \(\mathbf {y}\) simulated conditional on values of the parameter close to the “true” value of the parameter should look more similar to the actual data \(\mathbf {y}_{0}\) than data \(\mathbf {y}\) simulated conditional on values of the parameter far from the “true” value. ABC actually involves an acceptance/rejection step in that parameters simulated from the prior are accepted only when

*S*(

*y*) of the parameter \(\theta \) are available, they are good candidates. In the ABC literature, they are called

*summary statistics*, a term that does not impose any constraint on their form and hence leaves open the question of performance, as discussed in Marin et al. (2012) and Blum et al. (2013). A more practical version of the ABC algorithm is shown in Algorithm 3 below, with a different output for each choice of the summary statistic. We stress in this version of the algorithm the construction of the tolerance \(\epsilon \) as a quantile of the simulated distances \(\rho (S(\mathbf {y}^{0}),\,S(\mathbf {y}^{(t)})),\) rather than an additional parameter of the method.

An immediate question about this approximate algorithm is how much it remains connected with the original posterior distribution and in case it does not, where does it draw its legitimacy. A first remark in this connection is that it constitutes at best a convergent approximation to the posterior distribution \(\pi (\theta |S(y_{0})).\) It can easily be seen that ABC generates outcomes from a genuine posterior distribution when the data is randomised with scale \(\epsilon \) (Wilkinson 2013; Fearnhead and Prangle 2012). This interpretation indicates a decrease in the precision of the inference but it does not provide a universal validation of the method. A second perspective on the ABC output is that it is based on a nonparametric approximation of the sampling distribution (Blum 2010; Blum and François 2010), connected with both indirect inference (Drovandi et al. 2011) and *k*-nearest neighbour estimation (Biau et al. 2014). While a purely Bayesian nonparametric analysis of this aspect has not yet emerged, this brings an additional if cautious support for the method.

*Example 2*

The discrepancy can however be completely eliminated by post-processing: Fig. 2 reproduces Fig. 1 by comparing the histograms of an ABC sample with the version corrected by Beaumont et al.’s (2002) local regression, as the latter is essentially equivalent to a regular Gibbs output.

Barber et al. (2015) studies the rate of convergence for ABC algorithms through the mean square error when approximating a posterior moment. They show the convergence rate is of order \(,\) when *q* is the dimension of the ABC summary statistic, associated with an optimal tolerance in \(.\) Those rates are connected with the nonparametric nature of ABC, as already suggested in the earlier literature: for instance, Blum (2010), who links ABC with standard kernel density non-parametric estimation and find a tolerance (re-expressed as a bandwidth) of order Open image in new window and an rmse of order Open image in new window as well, while Fearnhead and Prangle (2012) obtain similar rates, with a tolerance of order Open image in new window for noisy ABC. See also Calvet and Czellar (2014). Similarly, Biau et al. (2014) obtain precise convergence rates for ABC interpreted as a *k*-nearest-neighbour estimator.

- (1)
the standard ABC–MCMC (with

*N*replicates of the simulated pseudo-data to each simulated parameter value), - (2)
versions involving simulations of the replicates repeated at the subsequent step,

- (3)
use of a stopping rule in the generation of the pseudo data, and

- (4)
a “gold-standard algorithm” based on the (unavailable) measure of an \(\epsilon \) ball around the data.

*random*number of auxiliary variables is sufficient to produce geometric ergodicity. We note that this result does not contradict the parallel result of Bornn et al. (2014), who establish that there is no efficiency gain in simulating \(N>1\) replicates of the pseudo-data, since there is no randomness involved in that approach. However, the latter result only applies to functions with finite variances.

Indeed, Robert et al. (2011) pointed out the potential irrelevance of ABC-based posterior probabilities, due to the possible ancilarity (for model choice) of summary statistics, as also explained in Didelot et al. (2011). Marin et al. (2014) consider for instance the comparison of normal and Laplace fits on both normal and Laplace samples and show that using sample mean and sample variance as summary statistics produces Bayes factors converging to values near 1, instead of the consistent 0 and \(+\infty .\)

Marin et al. (2014) analyses this phenomenon with the aim of producing a necessary and sufficient consistency condition on summary statistics. Quite naturally, the summaries that are acceptable must display different behaviour under both models, in the guise of ranges of means \(\mathbb {E}_{\theta }[S(\mathbf {y}_{0})]\) that do not intersect for the two models. (In the counter-example of the normal–Laplace test, the expectations of the sample mean and variance can be recovered under both models.) This characterisation then leads to a practical asymptotic test validating summary statistics and to the realisation that a larger number of summaries helps in achieving this goal (while degrading the estimated tolerance). More importantly, it shows that the reduction of information represented by an ABC approach may prevent discriminating between models, at least when trying to recover the Bayes factor. In the end, this is a natural consequence of simplifying the description of both the data and the model, and can be found in most limited information settings.

### 3.2 More fish in the alphabet soup

Besides ABC, approximation techniques have spread wide and far towards analysing more complex or less completely defined models. Rather than a confusion, this multiplicity of available approximations is beneficial both to the understanding of the underlying model and to the calibration of those different methods.

*t*,

*t*th step in the EP algorithm goes as follows:

- (1)
Select \(1\le j\le k\) at random

- (2)Define the marginal$$\begin{aligned} \nu _{-j}\left( \theta |\lambda _{t}\right) \propto \frac{\nu (\theta |\lambda _{t})}{\nu _{j}(\theta |\lambda _{t})}; \end{aligned}$$
- (3)Update the hyperparameter \(\lambda _{t}\) by solving$$\begin{aligned} \lambda _{t+1} = \mathop {\hbox {argmin}}\limits _{\lambda } \mathrm{KL}\left\{ \pi _{j}(\theta )\nu _{-j}\left( \theta |\lambda _{t}\right) ,\,\nu (\theta |\lambda )\right\} \end{aligned}$$
- (4)Update \(\nu _{j}(\theta |\lambda _{t})\) as$$\begin{aligned} \nu _{j}\left( \theta |\lambda _{t+1}\right) \propto \frac{ \nu (\theta |\lambda _{t+1}) }{ \nu _{-j}(\theta |\lambda _{t}) }. \end{aligned}$$

While different approximations keep being developed and tested, with arguments ranging from efficient programming, to avoiding simulations, to having an ability to deal with more complex structures, their drawback is the overall incapacity to assess the amount of approximation involved. Bootstrap evaluations can be attempted in the simplest cases but cannot be extended to more realistic situations.

## 4 Optimisation in modern Bayesian computation

Optimisation methodology for high-dimensional maximum-a-posteriori (MAP) estimation is another area of Bayesian computation that has received a lot of attention over the last years, particularly for problems related to machine learning, signal processing and computer vision. One reason for this is that for many Bayesian models optimisation is significantly more computationally tractable than integration. This has generated a lot of interest in MAP estimators, especially for applications involving very high-dimensional parameter spaces or tight computing time constraints, for which calculating other summaries of the posterior distribution is not feasible. Here we review some of the major breakthroughs in this topic, which originated mainly outside the statistics community. We focus on developments related to high-dimensional convex optimisation, though many of the techniques discussed below are also useful for non-convex optimisation. In particular, in Sect. 4.1 we concentrate on *proximal optimisation algorithms*, a powerful class of iterative methods that exploit tools from convex analysis, monotone operator theory and theory of non-expansive mappings to construct carefully designed fixed-point schemes. We refer the reader to the excellent book by Bauschke and Combettes (2011) for the mathematics underpinning proximal optimisation algorithms, and to the recent tutorial papers by Combettes and Pesquet (2011), Cevher et al. (2014) and Parikh and Boyd (2014) for an overview of the field and applications to signal processing and machine learning.

However, we do think it is vital to insist that, at the same time as asserting that modern optimisation methodology represents a much-underused opportunity in Bayesian inference, in its raw form it inevitably fails to deliver essential elements of the Bayesian paradigm. The vision is not to deliver a point estimate of an unknown structure, but the full richness of Bayesian inference in its coherence, its proper treatment of uncertainty, its intrinsic treatment of model uncertainty, and so on. Bayesian statistics does not boil down to optimisation with penalisation (Lange et al. 2014). We need to express the uncertainty associated with decisions and estimation, stemming from the stochastic nature of the data, and our lack of knowledge about relevant mechanisms.

The challenge is to use the awesome capacity of fast optimisation in a high-dimensional parameter space to focus on local regions of that space where a combination of analytic and numerical investigation can deliver at least approximations to full posterior distributions and derived quantities. The community has barely risen to this challenge, with only isolated examples such as the discussion in Green (2015) of a problem in unlabelled shape analysis. However, the growing community of INLA (Rue et al. 2009) users may bring an heightened awareness of such possibilities, along with its efficient code (Schrödle and Held 2011; Muff et al. 2013). Another promising research area is to use mathematical and algorithmic tools from convex optimisation to design more efficient high-dimensional MCMC algorithms (Pereyra 2015).

### 4.1 Proximal algorithms

Similarly to many other computational methodologies that are widely used nowadays, proximal algorithms were first proposed several decades ago by Moreau (1962), Martinet (1970) and Rockafellar (1976), and regained attention recently in the context of large-scale inverse problems and “big data”.

*g*belongs to the class \(\varGamma _{0}(\mathbb {R}^{n})\) of lower semicontinuous convex functions from \(\mathbb {R}^{n} \rightarrow (-\infty ,\,+\infty ].\) Notice that

*g*may be non-differentiable and take value \(g({\varvec{\theta }}) = +\infty ,\) reflecting constraints in the parameter space. In order to introduce proximal algorithms we first recall the following standard definitions and results from convex analysis: we say that \({\varvec{\varphi }}\in \mathbb {R}^{n}\) is a subgradient of

*g*at \({\varvec{\theta }}\in \mathbb {R}^{n}\) if it satisfies \((\mathbf {u}-{\varvec{\theta }})^{T}{\varvec{\varphi }}+ g({\varvec{\theta }}) \le g(\mathbf {u}),\,\forall \mathbf {u}\in \mathbb {R}^{n}.\) The set of all such subgradients defines the subdifferential set \(\partial g({\varvec{\theta }}),\) and \(\hat{{\varvec{\theta }}}_{MAP}\) is a minimiser of

*g*if and only if \(\varvec{0} \in \partial g(\hat{{\varvec{\theta }}}_{MAP}).\) The (convex) conjugate of \(g \in \varGamma _{0}(\mathbb {R}^{n})\) is the function \(g^{*} \in \varGamma _{0}(\mathbb {R}^{n})\) defined as \(g^{*}({\varvec{\varphi }}) = \sup _{\mathbf {u}\in \mathbb {R}^{n}} \mathbf {u}^{T}{\varvec{\varphi }}- g(\mathbf {u}).\) The subgradients of

*g*and \(g^{*}\) satisfy the property \({\varvec{\varphi }}\in \partial g({\varvec{\theta }}) \Leftrightarrow {\varvec{\theta }}\in \partial g^{*}({\varvec{\varphi }}).\)

**Property 1**

The proximity mapping of *g* is related to its subdifferential by the inclusion \(\{{\varvec{\theta }}-{\text {prox}}^{\lambda }_{g}({\varvec{\theta }})\}/\lambda \in \partial g\{{\text {prox}}^{\lambda }_{g}({\varvec{\theta }})\},\) which collapses to \(\nabla g\{{\text {prox}}^{\lambda }_{g}({\varvec{\theta }})\}\) when \(g \in \mathcal {C}^{1}.\) As a result, for any \(\lambda > 0,\) the minimiser of *g* verifies the fixed-point equation \({\varvec{\theta }}= {\text {prox}}^{\lambda }_{g}({\varvec{\theta }}).\)

**Property 2**

Proximity mappings are firmly non-expansive; that is, \(\Vert {\text {prox}}^{\lambda }_{g}({\varvec{\theta }})-{\text {prox}}^{\lambda }_{g}(\mathbf {u})\Vert ^{2} \le ({\varvec{\theta }}-\mathbf {u})^{T}\{{\text {prox}}^{\lambda }_{g}({\varvec{\theta }})-{\text {prox}}^{\lambda }_{g}(\mathbf {u})\},\,\forall {\varvec{\theta }},\,\mathbf {u}\in \mathbb {R}^{n}.\)

**Property 3**

The proximity mappings of *g* and its conjugate \(g^{*}\) are related by Moreau’s decomposition formula: \({\varvec{\theta }}= {\text {prox}}_{g}^{\lambda }({\varvec{\theta }}) + \lambda {\text {prox}}^{1/\lambda }_{g^{*}}({\varvec{\theta }}/\lambda ).\)

*proximal point algorithm*given by the iteration

*g*, i.e., \({\varvec{\theta }}^{k+1} = {\varvec{\theta }}^{k} - \lambda {\varvec{\varphi }},\) with \({\varvec{\varphi }}\in \partial g({\varvec{\theta }}^{k+1}).\) Alternatively, proximal point algorithms can also be interpreted as explicit (forward) gradient steepest descent to minimise the

*Moreau envelope*of \(g,\, e_{\lambda } ({\varvec{\theta }}) = \inf _{\mathbf {u}\in \mathbb {R}^{n}}\, g(\mathbf {u}) + \Vert \mathbf {u}- {\varvec{\theta }}\Vert ^{2}/2\lambda ,\) a convex lower bound on

*g*that by construction is continuously differentiable and has the same minimiser as

*g*.

Proximal point algorithms may appear of little relevance because evaluating \({\text {prox}}^{\lambda }_{g}\) can be as difficult as solving (8) in the first place (notice that (9) is a convex minimisation problem similar to (8)). Surprisingly, many advanced proximal optimisation methods can in fact be shown to be either applications of this simple algorithm, or closely related to it.

*g*, e.g.,

^{2}differentiable and \(g_{2} \in \varGamma _{0}(\mathbb {R}^{n}),\) possibly non-differentiable, has a proximity mapping that can be computed efficiently with a specialised algorithm. This decomposition is useful for instance in linear inverse problems, where \(g_{1}\) is often related to a Gaussian observation model involving linear operators and \(g_{2}\) to a log-prior promoting a parsimonious representation (e.g., sparsity on some appropriate dictionary, low-rankness) or enforcing convex constraints (e.g., positivity, positive definiteness). For models that admit this decomposition, it is possible to compute \(\hat{{\varvec{\theta }}}_{MAP}\) efficiently with a

*forward*–

*backward*algorithm, also known as the proximal gradient algorithm

*O*(1 /

*k*). If the value of the Lipschitz constant \(\beta \) is unknown \(\lambda _{n}\) can be found by line-search.

*g*admits a decomposition \(g({\varvec{\theta }}) = g_{1}({\varvec{\theta }}) + g_{2}(L{\varvec{\theta }})\) for some linear operator \(L \in \mathbb {R}^{n\times p},\,g_{1} \in \varGamma _{0}(\mathbb {R}^{n})\) strongly convex and \(g_{2} \in \varGamma _{0}(\mathbb {R}^{p})\) with efficient proximity mapping. In this case, the Fenchel–Rockafellar theorem states that \(\hat{{\varvec{\theta }}}_{MAP}\) can be computed by solving the dual problem (Bauschke and Combettes 2011, Chap. 19)

*p*-dimensional problem can be solved iteratively with a forward–backward algorithm \({\varvec{\psi }}^{k+1} ={\text {prox}}^{\lambda _{n}}_{g^{*}_{2}}({\varvec{\psi }}^{k} - \lambda _{n} \nabla g^{*}_{1}(-L^{T}{\varvec{\psi }}^{k}))\) that can also be accelerated to converge with rate \(O(1/k^{2}),\) and where we note that the proximity mapping of \(g_{2}^{*}\) is typically evaluated by using Property 3, and that the strong convexity of \(g_{1}\) implies Lipschitz differentiability of \(g_{1}^{*}.\) Computing \(\hat{{\varvec{\theta }}}_{MAP}\) via (14) can lead to important computational savings, in particular if \(p \ll n\) or if \(g_{2}\) is separable and has a proximity mapping that can be computed in parallel for each element of \({\varvec{\theta }}\) (this is generally not possible for \(g_{2} \circ L\)). We refer the reader to Komodakis and Pesquet (2014) for an overview of recent dual and primal–dual algorithms and guidelines for parallel implementations.

*alternating direction method of multipliers*(ADMM), which operates by formulating (11) as a constrained optimisation problem

*g*admits the decomposition \(g({\varvec{\theta }}) = \sum \nolimits _{m = 1}^{M}g_{m}(L_{m} {\varvec{\theta }})\) with \(g_{m} \in \varGamma (\mathbb {R}^{p_{m}})\) and \(L_{m} \in \mathbb {R}^{n \times {p_{m}}}\) such that the mappings of \(g_{m}\) are easy to compute and \(Q = \sum \nolimits _{m = 1}^{M} L_{m}^{T} L_{m}\) is invertible. Then, in a manner akin to (16), we express (8) as

*M*at a coarse level (e.g., on a multi-processor system). Further parallelisation may be possible at a finer scale (e.g., on a vectorial processor such as GPU or FPGA) by taking advantage of the structure of \({\text {prox}}^{\lambda }_{g_{m}}\) or by using specialised algorithms. This algorithm, known as the

*simultaneous direction method of multipliers*, is also closely related to the ADMM, Douglas–Rachford and proximal point algorithms. Notice that splitting

*g*not only allows the exploitation of parallel computer architectures, but may also significantly simplify the computation of proximity mappings; often \({\text {prox}}^{\lambda }_{g_{m}}\) has a closed-form expression. Lastly, it is worth mentioning that there are other modern proximal optimisation algorithms that can be massively parallelised, for example the

*generalised forward backward*algorithm (Raguet et al. 2013), the

*parallel proximal*algorithms (Combettes and Pesquet 2008; Pesquet and Pustelnik 2012), and the parallel primal–dual algorithm (Combettes and Pesquet 2012).

Finally, main current topics of research in proximal optimisation include theory and methodology for: (1) randomised and stochastic algorithms that operate with estimators of gradients and proximity mappings to reduce computational complexity and allow for errors in the update rules, (2) adaptive and variable metric algorithms (e.g., Riemannian and Newton-type) that exploit the model’s geometry to improve convergence speed, and (3) proximal methods for non-convex problems. We anticipate that in the future new and stronger connections will emerge between proximal optimisation and stochastic simulation, in particular through developments in stochastic optimisation and high-dimensional MCMC sampling. For example, one connection is through the integration of modern stochastic convex optimisation and Markovian stochastic approximation (Combettes and Pesquet 2014; Andrieu et al. 2015), and of proximal optimisation and high-dimensional MCMC sampling (Pereyra 2015).

### 4.2 Convex relaxations

Modern proximal optimisation was greatly motivated by important theoretical results on the recovery of partially-observed sparse vectors and low-rank matrices through convex minimisation (Candès et al. 2006; Candès and Tao 2009) and on *compressive sensing* (Candès and Wakin 2008). A key idea underlying these works is that of approximating a combinatorial optimisation problem, whose solution is NP-hard, with a “relaxed” convex problem that is computationally tractable, and whose solution is in some sense close to the solution of the original problem. Reciprocally, the development of modern convex optimisation has in turn generated much interest in log-concave models, convex regularisers, and “convexifications” (i.e., convex relaxations for intractable or poorly tractable models) for statistical inference problems involving high-dimensionality, large datasets and computing time constraints (Chandrasekaran et al. 2012; Chandrasekaran and Jordan 2013).

### 4.3 Illustrative example

*total-variation*Markov random field, \(\Vert \cdot \Vert _{1{-}2}\) denotes the composite \(\ell _{1} - \ell _{2}\) norm and \(\nabla _{d}\) is the discrete gradient operator that computes the vertical and horizontal differences between neighbour image pixels. This prior is log-concave and models the fact that differences between neighbouring image pixels are usually very small but occasionally take large values; it is arguably the most widely used prior in modern statistical image processing. The values of

*H*and \(\sigma ^{2}\) are typically determined during the system’s calibration process and are here assumed known.

*majorisation*–

*minimisation*strategy. To be precise, starting from some initial condition \({\varvec{\theta }}^{(0)},\) e.g., \({\varvec{\theta }}^{(0)} = \mathbf {y},\) we iteratively minimise the following sequence of strictly convex majorants (Oliveira et al. 2009)

*SALSA*(Afonso et al. 2011) implemented with \(g_{1} ({\varvec{\theta }}) = \Vert \mathbf {y}- H{\varvec{\theta }}\Vert ^{2}_{2}/2\sigma ^{2},\,g_{2}(\mathbf {u}) = \alpha _{eff}^{(t)}\Vert \nabla _{d} \mathbf {u}\Vert _{1{-}2},\) and the constraint \({\varvec{\theta }}= \mathbf {u}\) (though we could have also used other modern algorithms Pesquet and Pustelnik 2012; Combettes and Pesquet 2012; Raguet et al. 2013). To compute the proximity mapping of \(g_{1}\) we use the fact that

*H*is block-circulant to compute matrix products and pseudo-inverses with the FFT algorithm. We compute the proximity mapping of \(g_{2}\) with a highly parallelised implementation of the specialised algorithm of Chambolle (2004).

Figure 3 presents a blurred and noisy observation \(\mathbf {y}\) of the popular “boats” image^{3} of size \(512 \times 512\) pixels, generated with a uniform \(9\times 9\) blur and a noise power of \(\sigma ^{2} = 0.5^{2}\) (blurred-signal-to-noise ratio \(BRSN = 10\log _{10}\{\Vert H{{\varvec{\theta }}}_{0}\Vert ^{2}_{2}/\sigma ^{2}\} = 40\) dB). Figure 4 below shows the MAP estimate \(\hat{{\varvec{\theta }}}_{MAP}\) obtained by solving (21) using four iterations of (22) and a total of 51 ADMM iterations. We observe that this resolution enhancement process has produced a remarkably sharp image with very noticeable fine detail. Moreover, Fig. 5 shows the magnitude of the marginal \(90\,\%\) credibility regions for each pixel, as measured by the distance between the 5 and \(95\,\%\) quantile estimates. These estimates were computed using the proximal Metropolis-adjusted Langevin algorithm (Pereyra 2015), which is appropriate for high-dimensional densities that are not continuously differentiable. We observe in Fig. 5 that the uncertainty is mainly concentrated at the contours and object boundaries, revealing that model is able to accurately detect the presence of sharp edges in the image but with some uncertainty about their exact location. Finally, Fig. 6 shows the convergence of the estimates \({\varvec{\theta }}^{(t,k)}\) produced by each ADMM iteration to \(\hat{{\varvec{\theta }}}_{MAP}\) (as measured by the mean squared error \(\Vert {\varvec{\theta }}^{(t,k)} - \hat{{\varvec{\theta }}}_{MAP} \Vert _{2}^{2}\)). Notice that computing \(\hat{{\varvec{\theta }}}_{MAP}\) only required 10 s (experiment conducted on an Apple Macbook Pro computer running Matlab 2013, a C++ implementation would certainly produce even faster results). This is remarkably fast given the high dimensionality of the problem (\(n = 262,144\)). The computation of the credibility regions by MCMC sampling (20,000 samples with a thinning factor of 1000 to reduce the algorithm’s memory foot-print) required 75 h.

## 5 Discussion

### 5.1 Bayesian computation in the era of data science

As with other areas of statistical science, the Bayesian computation community has to decide whether data science is an opportunity or a threat. Inevitably if we do not treat it as an opportunity, it will become a threat. Thanks to the ubiquity of “big data” (as an over-hyped phrase mostly useful for attracting research funding, but also to at least some extent in reality), a new potentially multi-disciplinary field of data science is rapidly opening up. This field is attracting huge material resources, and will absorb much human talent. Statistical science has to be a part of this, for its own survival, but also for the sake of society. As Tim Harford has cogently argued (Harford 2014):

Recall big data’s four articles of faith. Uncanny accuracy is easy to overrate if we simply ignore false positives [...]. The claim that causation has been “knocked off its pedestal” is fine if we are making predictions in a stable environment but not if the world is changing [...] or if we ourselves hope to change it. The promise that “N = All”, and therefore that sampling bias does not matter, is simply not true in most cases that count. As for the idea that “with enough data, the numbers speak for themselves” – that seems hopelessly naïve in data sets where spurious patterns vastly outnumber genuine discoveries.

“Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever.

It is a mistake to think that Bayes has no part to play in these developments, but more of us need to get more involved, and learn new tools, as in the way the Consensus Monte Carlo algorithm (Scott et al. 2013) exploits the Hadoop environment (White 2012) and the MapReduce programming model (Dean and Ghemawat 2008). Another direction that can prevent a potential schism between Bayesian modelling and highly complex models is to aim for modularity and local learning, that it, to abandon the goal of modelling big universes for analysing a series of small worlds, in spite of the loss of coherence, amd hence compromise to the Bayesian paradigm, that this entails. The curious case of the cut models presented in Plummer (2015) is an illustration of the potential for developing partial-information Bayesian inference tools where “small is beautiful” because this is the only viable solution.

### 5.2 Do we care enough about applications?

Bayesian computation began in order to answer rather practical problems—how can we perform a Bayesian analysis of these data using this model?—or the corresponding meta-problems—how can Bayesian analysis be performed generally and reliably for this class of models? The focus was applied methodology (although since the methods were new, they tended to be published in premier theory/methodology journals). Because the research community wanted to understand (the advantages, performance and limitations of) the methods they were advocating, more theoretical work started to be conducted, and, for example, many probabilists were attracted to study the Markov chains that MCMC methodologists created. The centre of mass of research activity drifted away from the original motivations, just as has happened in other areas of mathematically-rigorous computation.

At the same time, those working with data became more ambitious with regard to the scale of data, the complexity of modelling and the sophistication of analysis, all factors that have in principle (and often in fact) stimulated new developments in Bayesian computation. But to a large extent this is a rich, self-stimulating and self-supporting area of research; new applications may or may not need new computational techniques, but new techniques don’t seem to need applications to justify themselves. It is apposite to ask to what extent is cutting-edge computational methodology research really delivering answers to questions that application domains are posing. And to what extent is cutting-edge computational methodology research successfully answering real questions?

We may not be unanimous about answers to these questions, except we can probably all agree they are “not entirely”. We will also disagree about how much this matters, but again there may be something to agree about, that we have failed if methodological innovations disconnect completely from applications. Legitimate differences in research goals partially explain the trend in this direction, but it is fair to say that there is a big communication problem between the computational statistics community and many of the communities where Bayesian computational methods are applied. Unfortunately people in these communities do not always keep up with the state of the art in computational statistics. At the same time, statisticians are often not aware of important developments arising in other fields. (ABC is a good illustration: it took more than 5 years of development within the population genetics community before statisticians became aware the technique existed and a few more years before they realised this was proper Bayesian inference applied on approximate models.) We can perhaps blame the fact that there are not enough people working at the interface of the different communities, but life at the interface is not easy because multidisciplinary and interdisciplinary research is often seen as “marginal” by both communities and is thus difficult to publish, communicate, etc. Then there are of course problems in dissemination, related to the different writing styles, journals, computing languages, software, etc. of each community.

We strongly encourage those developing new techniques always to find a way to disseminate them in such a way that at least *somebody* else could use them, preferably someone without the ability to have invented the technique for themselves!—and advocate, of course, that successful dissemination be properly rewarded in our career structures.

In a somewhat parallel path, we have seen over the past decades the emergence of new languages and meta-languages intended to handle complexity both of problems and of solutions towards a wider audience of users. BUGS (Lunn et al. 2010) is the archetypal example of such languages and it has been successful to the extent that a large proportion of the users has a fairly limited statistical background and often even less of a computational background. However, the population of BUGS users and sympathisers is tiny compared to that of SAS or other corporate statistical systems. In this respect, we have failed to disseminate concepts like Bayesian analysis and wonderful tools like MCMC algorithms, because most people are unable to turn them into codes by themselves. (Perusing one of the numerous statistics and machine-learning on-line forums like Cross Validated quickly exposes the methodological gap between academics and the masses!) It is unclear how novel programming developments like STAN (Stan Development Team 2014) are going to modify this picture, in that they still assume a decent understanding of both modelling and simulation issues. In that respect, network-based approaches as those covered by BUGS sound more promising towards “modelling locally to learn globally”. Similarly, ABC software is either too specific, like DIYABC (Cornuet et al. 2008) which addresses only population genetic questions, or too dependent on the ability of the modeller to program simulated outcomes from the model under study.

### 5.3 Anticipating the future

In which of the areas we discuss do we expect a particular emphasis of effort, or significant progress, or do we see particular needs for new efforts or new directions?

One expectation is that in the future computational methodologies will be more flexible and malleable. Over the past 25 years Bayesian modelling and inference techniques have been applied successfully to thousands of problems across a wide range application domains. Each application brings its own constraints in terms of model dimensionality and complexity, data, inferences, accuracy and computing times. These constraints also vary significantly within specific applications. For example, in hyperspectral remote sensing, when a new Bayesian model is introduced it is often first explored and validated by MCMC sampling, then approximated with a variational Bayes method, and then approximated again so that it can be applied to gigabyte-large datasets by using optimisation techniques. Similarly, an interesting result revealed by a fast inference technique can be analysed more deeply with more reliable and accurate methods might. Therefore we expect that in the future the different main computational methodologies will become more adaptable and that the boundaries between them will be less well defined, with many algorithms developed that combine simulation, variational approximations and optimisation. These will be able to handle a wide spectrum of models, degrees of accuracy and computing times, as well as models that have some parts that are simple but high-dimensional and others that are more complex but that only involve low-dimensional components. This can be achieved by using approximations and optimisation to improve stochastic sampling, by using simulation within deterministic algorithms to handle specific parts of the model that are difficult to compute analytically, or in completely new and original ways.

We also anticipate that computational methodologies will continue to be challenged by larger and larger datasets. There is of course a threat that the whole field turns into a library of machine-learning techniques, with limited validation on reference learning sets and a quick turnover of methods, which would both impoverish the field and fail to reach a general audience of practitioners. We must retain a sense of the stochastic elements in data collection, data analysis, and inference, recognising uncertainty in data and models, to preserve the inductive strength of data science—seeing beyond the data we have to what it might have been, what it be next time, and where it came from.

Maybe due to their initial introduction in population genetics, the oxymoron ‘summary statistics’ is now prevalent in descriptions of ABC algorithms, included in the statistical literature, where the (linguistically sufficient) term ‘statistic’ would suffice.

\(g_{1} \in \mathcal {C}^{1}\) has \(\beta \)-Lipschitz continuous gradient if \(\Vert \nabla g_{1}({\varvec{\theta }}) -\nabla g_{1}(\mathbf {u})\Vert \le \beta \Vert {\varvec{\theta }}-\mathbf {u}\Vert , \,\forall ({\varvec{\theta }},\,\mathbf {u}) \in \mathbb {R}^{N} \times \mathbb {R}^{N}\)

The boat image is available for download from the SIPI image database at http://sipi.usc.edu/database/database.php?volume=misc&image=38#top

## Acknowledgments

Supported in part by “SuSTaIn”, EPSRC Grant EP/D063485/1, at the University of Bristol, and “i-like”, EPSRC Grant EP/K014463/1, at the University of Warwick. Krzysztof Łatuszyński holds a Royal Society University Research Fellowship, and Marcelo Pereyra a Marie Curie Intra-European Fellowship for Career Development. Peter Green also holds a Distinguished Professorship at UTS, Sydney, and Christian Robert an Institut Universitaire de France chair at CEREMADE, Université Paris-Dauphine.

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.