1 Introduction

High-dimensional inverse problems are ubiquitous in the data and imaging sciences, as well as in the physical and engineering sciences more generally. Due to limitations of the data observation process and measurement noise, or even just due to the nature of the problem at hand, most inverse problems encountered are seriously ill-conditioned or ill-posed (canonical examples include, e.g., medical and radio interferometric imaging; Durmus et al. 2018; Cai et al. 2019; Zhou et al. 2020; Lunz et al. 2021). Developing better methodology for solving challenging inverse problems is a significant focus of the community. The Bayesian statistical framework is currently one of the predominant frameworks to perform inference in inverse problems (Robert and Casella 2004; Pereyra et al. 2016). The choice of the Bayesian model used has a profound impact on the solutions delivered, as alternative models can lead to significantly different point estimations and uncertainty quantification results.

In this article we develop methodology to objectively compare alternative Bayesian models in performing inference in the regime of high-dimensional inverse problems, directly form the observed data and in the absence of ground truth. Motivated by applications in computational imaging, we focus on the comparison of models with posterior distributions that are log-concave and potentially not smooth. In this context, model selection has been traditionally addressed through benchmark experiments involving ground truth data and expert supervision. However, for many applications it is difficult and expensive to produce reliable ground truth data. Moreover, for many problems it is simply impossible. Bayesian model selection provides a framework for selecting the most appropriate model directly from the observed data in an objective manner and without reference to ground truth data.

Bayesian model selection requires the computation of the marginal likelihood of the data – the average likelihood of a model over its prior probability space – which is also called the Bayesian evidence. This quantity is a key ingredient of model selection statistics such as Bayes factors and likelihood ratio tests (Robert 2007). The computation of the marginal likelihood for high-dimensional models is highly non-trivial because it requires the computation of integrals over the (high-dimensional) solution space. For example, in the context of Bayesian imaging problems, the dimension is given by the number of parameters (e.g. pixels) of interest, which frequently reach sizes of \({\mathcal {O}}(10^5)\)\({\mathcal {O}}(10^6)\) and beyond. For such settings, the evaluation of the marginal likelihood has been previously considered to be computationally intractable.

Broadly speaking, general purpose Monte Carlo methods can only handle model selection tasks for problems of dimension \({\mathcal {O}}(10)\) to \({\mathcal {O}}(10^2)\) (for reviews see Clyde et al. 2007; Friel and Wyse 2012; Llorente et al. 2020). Nested sampling (Skilling 2006), a state-of-the-art Monte Carlo strategy designed specifically for model selection, has enabled model selection for moderate dimensional problems of size \({\mathcal {O}}(10^2)\) to \({\mathcal {O}}(10^3)\) (Mukherjee et al. 2006; Feroz and Hobson 2008; Feroz et al. 2009; Brewer et al. 2011; Feroz and Skilling 2013; Handley et al. 2015). To the best of our knowledge, model selection for larger problems is currently possible only for models with very specific structures (e.g., conditionally Gaussian models; Harroue 2020).

In this work, we address the difficult computation of the marginal likelihood by proposing a new methodology that carefully integrates nested sampling (Skilling 2006) with proximal Markov chain Monte Carlo (MCMC) (Pereyra 2016; Durmus et al. 2018). This leads to a proximal nested sampling methodology specialised for comparing high-dimensional posterior distributions that are log-concave but potentially not smooth. The proposed approach can be applied computationally to log-concave models of dimension \({\mathcal {O}}(10^6)\) and beyond, making it suitable for model comparison in Bayesian imaging problems. We demonstrate the approach with a range of scientific imaging applications.

The remainder of the article is organised as follows. In Sect. 2 we recall the Bayesian model selection approach, highlight the associated computational challenges, and discuss proximal MCMC methodology for Bayesian computation for inverse problems with an underlying convex geometry. Section 3 recalls the standard nested sampling method. Our proposed proximal nested sampling framework is presented in general form in Sect. 4. In Sect. 5 explicit forms of proximal nested sampling are presented for common forms of the likelihood and prior that arise in imaging sciences. Experimental results validating the proposed method and showcasing its use in scientific imaging applications are reported in Sect. 6. Finally, we conclude in Sect. 7.

2 Bayesian inference for high-dimensional inverse problems

In this section we briefly recall the Bayesian decision-theoretic approach to model comparison, introduce some elements of convex analysis which are essential for our method, and review proximal MCMC methods, which are an important component of the proximal nested sampling methodology proposed in Sect. 4. We conclude the section by briefly explaining the computational difficulties encountered in high-dimensional Bayesian model selection and why it is necessary to develop new methodology for this task. Readers familiar with Bayesian model selection and with proximal MCMC methodology may prefer to skip this section and continue reading from Sect. 3.

2.1 Bayesian estimation and model selection

Let \(\Omega \subseteq {\mathbb {R}}^d\). We consider the estimation of a quantity of interest \(x \in \Omega \) from observed data y. Bayesian methods address such problems by postulating a statistical model \({{\mathcal {M}}}\) relating x and y, from which estimators of x and other inferences can be derived. More precisely, \({{\mathcal {M}}}\) defines a joint probability distribution \(p(x, y \vert {{\mathcal {M}}})\) specified via the decomposition \(p(x, y \vert {{\mathcal {M}}}) = p(y \vert x, \mathcal{M}) p(x\vert {{\mathcal {M}}})\), where \(p(y \vert x, {{\mathcal {M}}})\) denotes the likelihood of x for the observed data y, and the marginal \(p(x\vert {{\mathcal {M}}})\) is the so-called prior of x. Following Bayes’ theorem, inferences on \(x\vert y\) are then based on the posterior distribution

$$\begin{aligned} {p}(x \vert y, {{\mathcal {M}}}) = \frac{{p}(y \vert x, {{\mathcal {M}}}) {p}(x \vert {{\mathcal {M}}})}{{p}(y \vert {{\mathcal {M}}})}, \end{aligned}$$
(1)

which models our beliefs about x after observing y. With applications in Bayesian imaging sciences in mind, we focus on posterior distributions that are log-concave and assume that the potential function \(x \mapsto -\log {p}(x \vert y, {{\mathcal {M}}})\) is convex lower semicontinuous (l.s.c.) on \(\Omega \), but possibly not smooth. This is an important class of models in modern Bayesian imaging sciences because it leads to point estimators that are by construction well-posed and that can be efficiently estimated by using scalable proximal convex optimisation and stochastic sampling methods (Kaipio and Somersalo 2005; Robert and Casella 2004; Pereyra et al. 2016).

We condition on \({{\mathcal {M}}}\) explicitly in (1) because our focus is model selection, where one entertains several alternative posterior distributions for \(x \vert y\) stemming from different underlying modelling assumptions. As a result, rather than the posterior \({p}(x \vert y, {{\mathcal {M}}})\), our main object of interest is the marginal likelihood or model evidence

$$\begin{aligned} {p}(y \vert {{\mathcal {M}}}) = \int _{\Omega } p(y,x \vert {{\mathcal {M}}}) \text {d}x = \int _{\Omega } p(y \vert x, {{\mathcal {M}}}) p(x \vert {{\mathcal {M}}}) \text {d}x,\nonumber \\ \end{aligned}$$
(2)

which measures the likelihood of the observed data under model \({{\mathcal {M}}}\), and which we use to objectively compare different models relating x and y (Robert 2007). Notice that the likelihood of the observed data y under the model \({{\mathcal {M}}}\) is essentially the expectation (or average value) of the likelihood function \(p(y \vert x,{{\mathcal {M}}})\) with respect to (w.r.t.) the prior \(p(x \vert \mathcal{M})\). Therefore, a model that allocates its prior mass to solutions that agree with the observed data achieves a large marginal likelihood value. Conversely, a low marginal likelihood value indicates that only a small proportion of the solutions favoured by the prior agree with the observed data. In other words, the marginal likelihood (2) measures the degree to which the observed data is in agreement with the assumptions of the model, and in doing so it provides a goodness-of-fit summary. Moreover, because all priors have the same total probability mass (i.e., \(\int _\Omega p(x) \text {d}x = 1\)), the likelihood (2) naturally incorporates Occams’s razor, trading off model simplicity and accuracy and penalising over-fitting (Robert 2007).

Bayesian model selection arises from the common and natural inquiry of which model is the most suitable to analyse \(x \vert y\) from a set of models \({{\mathcal {M}}}_1,\ldots ,{{\mathcal {M}}}_K\) available. For simplicity and without loss of generality, we suppose two alternative models \({{\mathcal {M}}}_1\) and \({{\mathcal {M}}}_2\) (the generalisation to additional models is straightforward). From Bayesian decision theory, to objectively compare the two models in settings without ground truth available, one should calculate the Bayes factor (Robert 2007)

$$\begin{aligned} \rho _{12} = \frac{{p}({{\mathcal {M}}}_1 \vert y)}{{p}({{\mathcal {M}}}_2 \vert y)}\frac{{p}({{\mathcal {M}}}_2)}{{p}({{\mathcal {M}}}_1)} \end{aligned}$$
(3)

where \({p}({{\mathcal {M}}}_1)\) and \({p}({{\mathcal {M}}}_2)\) denote the prior probabilities assigned to the two competing models, and where, from Bayes’ theorem, we have that for any \(i \in \{1,2\}\)

$$\begin{aligned} {p}({{\mathcal {M}}}_i \vert y) = \frac{p(y \vert {{\mathcal {M}}}_i) {p}(\mathcal{M}_i)}{p(y \vert {{\mathcal {M}}}_1) {p}({{\mathcal {M}}}_1) + p(y \vert {{\mathcal {M}}}_2) {p}({{\mathcal {M}}}_2)}\,. \end{aligned}$$
(4)

By developing (3) we can easily express the Bayes factor as the likelihood ratio

$$\begin{aligned} \rho _{12} = \frac{{p}({y\vert {\mathcal {M}}}_1)}{{p}(y \vert {{\mathcal {M}}}_2)}, \end{aligned}$$
(5)

highlighting that \(\rho _{12}\) is invariant to choice of the prior probabilities \({p}({{\mathcal {M}}}_1)\) and \({p}({{\mathcal {M}}}_2)\). If one assumes \({p}({{\mathcal {M}}}_1)={p}({{\mathcal {M}}}_2) = 1/2\) to reflect the absence of prior information, then the factor also coincides with the posterior probability ratio \(p({{\mathcal {M}}}_1 \vert y)/p({{\mathcal {M}}}_2 \vert y)\).

Being a likelihood ratio, the factor \(\rho _{12}\) is straightforward to read: if \(\rho _{12} \gg 1\), we prefer model \({{\mathcal {M}}}_1\) over the alternative \({{\mathcal {M}}}_2\); conversely, if \(\rho _{12} \ll 1\), we prefer model \({{\mathcal {M}}}_2\); and if \(\rho _{12} \approx 1\), we do not prefer either, inasmuch as the data y are insufficient for us to make an informed judgement. The fact that \(\rho _{12}\) is a likelihood ratio is also appealing from a frequentist viewpoint, as it is associated with the most powerful test for these two model hypotheses (Casella and Berger 2002).

Unfortunately, calculating \(\rho _{12}\) is generally not possible in large-scale settings because the dimensionality of x renders the marginal likelihoods \({p}(y \vert {{\mathcal {M}}}_1)\) and \({p}(y \vert {{\mathcal {M}}}_2)\) computationally intractable. More precisely, the marginal likelihoods are doubly-intractable because they require computing two intractable integrals over the space of solutions \(\Omega \): the marginalisation of x denoted explicitly in (2); and the normalising constant of the priors \(p(x \vert {\mathcal {M}}_i)\) when these are not available analytically, which otherwise implicitly also requires integrating over \(\Omega \).

It is worth emphasising at this point that this major difficulty related to model selection is not encountered when performing inferences with the posteriors \(p(x \vert y,{\mathcal {M}}_1)\) and \(p(x \vert y,{\mathcal {M}}_2)\) individually, as one can use MCMC methods to sample from \(p(x \vert y,{\mathcal {M}})\) without ever having to evaluate the marginal likelihood \(p(y\vert {\mathcal {M}})\). As a result, efficient Bayesian model selection remains an open problem in many areas of science and engineering that have widerly adopted Bayesian inference techniques for point estimation and uncertainty quantification.

In the following we briefly recall MCMC sampling methods derived from the overdamped Langevin diffusion process, particularly proximal MCMC techniques specialised for large models that are log-concave, and explain why it is necesary to modify them to enable efficient model comparison.

2.2 Bayesian computation and proximal MCMC methods

2.2.1 Convex analysis

Let \(f : {\mathbb {R}}^d \rightarrow \left[ -\infty , +\infty \right] \). The function f is said to be proper if there exists \(x_0 \in {\mathbb {R}}^d\) such that \(f(x_0) < +\infty \). Denote for all \(M \in {\mathbb {R}}\), \(\{ f \le M \} = \{ z \in {\mathbb {R}}^d \ \vert \ f(z) \le M \}\). The function f is l.s.c. if for all \(M \in {\mathbb {R}}\), \(\{ f \le M \}\) is a closed subset of \({\mathbb {R}}^d\). For \(k \ge 0\), denoted by \({{\mathcal {C}}}^k({\mathbb {R}}^d)\) the set of k-times continuously differentiable functions. For \(f \in \mathcal{C}^1({\mathbb {R}}^d)\), denote by \(\nabla f\) the gradient of f. We say that \(f \in {{\mathcal {C}}}^1({\mathbb {R}}^d)\) is a Lipschitz continuously differentiable function if there exists \(C \ge 0\) such that for all \(x,y \in {\mathbb {R}}^d\), \(\Vert \nabla f(x) - \nabla f(y)\Vert \le C \Vert x-y\Vert \).

Given a convex, proper, l.s.c. function \(h: {\mathbb {R}}^d \rightarrow (-\infty , +\infty ]\) and \(\lambda > 0\), the proximal operator (Bauschke and Combettes 2011) associated with function h at \(x \in {\mathbb {R}}^d\) is defined as

$$\begin{aligned} \text {prox}^{\lambda }_h(x) = \mathop {\mathrm{argmin}}_{u\in {\mathbb {R}}^d} \big \{h(u) + \Vert u - x\Vert _2^2/2\lambda \big \}. \end{aligned}$$
(6)

When \(\lambda =1\), we denote \(\text {prox}^{1}_h(x)\) by \(\text {prox}_h(x)\) for simplicity.

Let \({{\mathcal {K}}}\) be a closed convex set in \({\mathbb {R}}^d\) and let \(\chi _{{\mathcal {K}}}\) be the characteristic function for \({\mathcal {K}}\), defined by \(\chi _{{\mathcal {K}}}(x) = 0\) if \(x \in {\mathcal {K}}\) and \(+\infty \) otherwise. The proximal operator of \(\chi _{{\mathcal {K}}}\) is the projection onto \({{\mathcal {K}}}\), given by

$$\begin{aligned} \text {proj}_{{{\mathcal {K}}}}(x) = \mathop {\mathrm{argmin}}_{u\in {\mathbb {R}}^d} \big \{\chi _{{\mathcal {K}}}(u) + \Vert u - x\Vert _2^2/2 \big \}\, . \end{aligned}$$
(7)

The convex conjugate of function h, denoted by \(h^*\), is defined as

$$\begin{aligned} h^*(x) = \sup _{u\in {\mathbb {R}}^d} \big \{x^\top u - h(u) \big \}. \end{aligned}$$
(8)

Its proximal operator can be related to the proximal operator of h by

$$\begin{aligned} \text {prox}_{h^*}(x) = x - \text {prox}_{h}(x). \end{aligned}$$
(9)

The \(\lambda \)-Moreau-Yosida envelope of h (Bauschke and Combettes 2011) is given for any \(x \in {\mathbb {R}}^d\) and \(\lambda > 0\) by

$$\begin{aligned} h^{\lambda }(x) = \min _{u\in {\mathbb {R}}^d} \big \{h(u) + \Vert u - x\Vert _2^2/2\lambda \big \} . \end{aligned}$$
(10)

The envelope \(h^{\lambda }\) is continuously differentiable with Lipschitz gradient. In particular, using the proximal operator, the gradient of \(h^{\lambda }\) can be written

$$\begin{aligned} \nabla h^{\lambda }(x) = \big (x - \text {prox}^{\lambda }_h(x)\big ) / \lambda , \end{aligned}$$
(11)

with \(\lambda \) simultaneously controlling the Lipschitz constant of \(\nabla h^{\lambda }\) as well as the error between h and its smooth approximation \(h^{\lambda }\). This approximation error can be made arbitrarily small by reducing \(\lambda \), at the expense of deteriorating the regularity of \(\nabla h^{\lambda }\), and consequently the speed of convergence of proximal Bayesian computation algorithms rely on \(h^{\lambda }\).

2.2.2 Proximal Langevin MCMC sampling

Consider the problem of calculating probabilities or expectations with respect to (w.r.t.) some distribution \(\pi (\text {d}x)\) which admits a density \(\pi (x)\) w.r.t. the usual d-dimensional Lebesgue measure. In the context of Bayesian inference, this is typically the posterior \(p(x \vert y,{\mathcal {M}})\). Evaluating expectations and probabilities w.r.t. \(\pi \) is non-trivial in problems of moderate and high dimension because of the integrals involved, which are usually beyond the scope of analytical and deterministic numerical integration schemes. These calculations are further complicated when the normalising constant of \(\pi \) is not known, as this requires evaluating an additional d-dimensional integral. Monte Carlo sampling methods address these difficulties by simulating a set of samples from \(\pi \) followed by Monte Carlo stochastic integration to compute probabilities and expectations w.r.t. \(\pi \). While there are different ways of simulating samples from \(\pi \), we focus on MCMC strategies where one proceeds by constructing a Markov chain that has \(\pi \) as its invariant stationary distribution. Again, there are different methods for constructing such Markov chains (see Robert and Casella 2004 for an excellent introduction to MCMC methodology and Green et al. 2015 for a survey of recent developments in the Bayesian computation literature).

The fastest provably convergent MCMC methods for Bayesian inference models can be derived from the Langevin diffusion process, which we recall below. For simplicity, rather than presenting the approach in full generality, we focus our presentation on proximal overdamped Langevin sampling for non-smooth models, which we later use in the proximal nested sampling method proposed in Sect. 4. For a more exhaustive introduction to the topic please see Vargas et al. (2020, Section 2) and references therein.

Assume that \(\pi \) admits a decomposition \(\pi (x) \propto \exp \{-f(x) - g(x)\}\) for all \(x \in {\mathbb {R}}^d\), where \(f \in {\mathcal {C}}^1({\mathbb {R}}^d)\) with \(\nabla f\) Lipschitz continuous with constant \(L_f\), and where g is a proper l.s.c. function that is convex on \({\mathbb {R}}^d\) but potentially non-smooth (e.g., g could encode constraints on the solution space and involve non-smooth regularisers such as the \(\ell _1\) norm). To simulate from \(\pi \), we construct the overdamped Langevin stochastic differential equation (SDE) on \({\mathbb {R}}^d\) given by Durmus et al. (2018)

$$\begin{aligned} \text {d}X_t = -[\nabla f(\text {X}_t) +\nabla g^\lambda (\text {X}_t)] \text {d}t + \sqrt{2} \text {d}W_t, \quad X_0 = x_0,\nonumber \\ \end{aligned}$$
(12)

where \((W_t)_{t\ge 0}\) is a d-dimensional Brownian motion, \(g^\lambda \) is the Moreau-Yosida envelop of g given by (10), \(\lambda > 0\) is a smoothing parameter that we will discuss later, and \(x_0 \in {\mathbb {R}}^d\). When \(x \rightarrow f(x)+g^\lambda (x)\) is convex, the SDE has a unique strong solution and \(X_t\) converges exponentially fast (as \(t \rightarrow \infty \)) to an invariant measure that is in the neighbourhood of \(\pi \).

To use (12) for Bayesian computation, we use a numerical solver to compute a discrete-time approximation of \(X_t\) over some time period \(t \in [0,T]\); the resulting discrete sample path constitutes our set of Monte Carlo samples. In particular, in this article we use the conventional Euler-Maruyama approximation

$$\begin{aligned} X_{n+1} = X_{n} - \frac{\delta }{2}\nabla f(X_{n}) - \frac{\delta }{2}\nabla g^\lambda (X_{n}) + \sqrt{\delta } Z_{n+1}, \end{aligned}$$
(13)

where \(\delta \in [0,1/(L_f + 1/\lambda )]\) is a given stepsize and \((Z_{n})_{n\ge 1}\) is a sequence of i.i.d. d-dimensional standard Gaussian random variables. This MCMC method is known as the Moreau-Yosida unadjusted Langevin algorithm (MYULA) (Durmus et al. 2018). The Markov chain (13) is usually implemented by using (11) and reads

$$\begin{aligned} X_{n+1} = X_{n} - \frac{\delta }{2} \nabla f (X_n)-\frac{\delta }{2\lambda }\left( X_{n}-\text {prox}_{g}^{\lambda }(X_{n})\right) + \sqrt{\delta } Z_{n+1}.\nonumber \\ \end{aligned}$$
(14)

The smoothing parameter \(\lambda \) and the stepsize \(\delta \) jointly control a bias-variance trade-off between the asymptotic estimation errors and non-asymptotic errors associated with using a finite number of iterations. In this article, we use \(\lambda = 1/L_f\) and \(\delta = 0.8/(L_f + 1/\lambda )\) as recommended in Durmus et al. (2018) (recall that \(\nabla f\) is Lipschitz continuous with constant \(L_f\), please see Durmus et al. 2018; Vargas et al. 2020 for further details).

The samples generated by (14) can be directly used for biased Monte Carlo estimation (Durmus et al. 2018). Alternatively, at the expense of additional computation, one can supplement each iteration of MYULA with an MH (Metropolis-Hastings) correction step to asymptotically remove the approximation errors related to the discretisation of the SDE and the use of \(g^\lambda \) instead of g, leading to a type of Metropolis-adjusted Langevin algorithm (MALA) (see Pereyra 2016 for details).

2.3 Estimation of marginal likelihoods and Bayes factors

Let \(\{X_{n}\}_{n=1}^N\) be a set of samples from \(\pi \) (or an approximation of \(\pi \)), generated by using a proximal MCMC method or otherwise. Following a Monte Carlo integration approach, the expectation of any function \(\phi : {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) w.r.t. \(\pi \) is approximated by

$$\begin{aligned} \hat{\text {E}}_\pi (\phi ) = \frac{1}{N}\sum _{n=1}^N \phi (X_{n})\, , \end{aligned}$$
(15)

which, under assumptions, converges to the truth \(\text {E}_\pi (\phi ) = \int _\Omega \phi (x)\pi (x)\text {d}x\) as N increases (or to a biased estimate if the samples are not exactly from \(\pi \)). The accuracy of Monte Carlo estimates depends of course on the number of samples N and on the properties of the MCMC method used, but it also depends crucially on the variance \(\text {Var}_\pi (\phi )\). Unfortunately, \(\text {Var}_\pi (\phi )\) is often very large for the kinds of functions \(\phi \) required for estimating the marginal likelihood (2) (and in some cases \(\text {Var}_\pi (\phi )\) is not even defined), leading to Monte Carlo estimators of the marginal that behave poorly (Newton and Raftery 1994). As a result, it is difficult to use the samples \(\{X_{n}\}_{n=1}^N\) to perform model selection. Several strategies have been proposed to address the aforementioned difficulty and derive well-posed estimators for the marginal likelihood (2) and the Bayes factor (3) (for reviews of classical methods see Friel and Wyse 2012; Clyde et al. 2007).

One avenue is to generate samples from a sequence of distributions bridging \(\pi \) to some tractable reference \(\pi _0\) such as the prior distribution or a Gaussian approximation of \(\pi \), e.g., thermodynamic integration (O’Ruanaidh and Fitzgerald 1996) and annealed important sampling (Neal 2001). Such strategies struggle with large problems because the number of intermediate distributions grows quickly as d increases.

Another promising approach to derive computationally efficient estimators is to construct Rao-Blackwellized estimators by carefully introducing auxiliary variables, as proposed in the seminal papers Chib (1995) and Chib and Jeliazkov (2001). This strategy has been successfully applied recently to signal and image processing models that are conditionally Gaussian given conjugate model hyper-parameters (Harroue 2020). Some generalisations are possible, but constructing efficient Rao-Blackwellized estimators for more general classes of models, e.g., of the form (1), is highly non-trivial.

An alternative natural strategy for stable Monte Carlo estimators for (2) and (3) is to construct a truncated estimator by first using the samples \(\{X_{n}\}_{n=1}^N\) to identify a suitable truncating set \({\mathcal {A}}\), followed by a sample average (15) only with the samples verifying \(X_{n} \in {\mathcal {A}}\) (Brosse et al. 2017). Although by construction well-posed, truncated estimators need to be de-biased by using the volume of \({\mathcal {A}}\), which is usually very expensive to compute when the dimension d is large. From the results of Brosse et al. (2017), we believe that this strategy is unlikely to produce scalable methods suitable for large problems. One can circumvent or simplify the calculation of the volume of \({\mathcal {A}}\) (e.g., see Durmus et al. 2018), but in our experience the resulting estimators become unstable and are difficult to use.

Another alternative approach, which is agnostic to the sampling method, is the harmonic mean estimator (Newton and Raftery 1994); although, in its original form the variance of the estimator can be very poorly behaved such that the estimator can be highly inaccurate in practice. Strategies to resolve this issue have been developed in the recently proposed learnt harmonic mean estimator (McEwen et al. 2022), which has been shown to be highly effective and can scale to dimension \({\mathcal {O}}(10^3)\) and beyond. Nevertheless, it may be challenging to scale this approach to the high-dimensional settings considered in this paper.

One can also consider the widely used Laplace’s method (Tierney and Kadane 1986), which relies on the assumption that the posterior distribution can be adequately approximated by a Gaussian distribution. Unfortunately, this is a strong assumption that often leads to inaccurate estimates in inverse problems that are ill-conditioned or ill-posed, particularly if \(d \ge \text {dim}(y)\). Many other alternatives are described in the literature, e.g., the Savage-Dickey density ratio (Trotta 2007) and Reversible Jump MCMC (Green 1995), which are mainly useful for nested or small models. It is worth mentioning that there are also some model selection strategies that do not rely on the computation of the marginal likelihood (see, e.g., Kamary et al. 2018; Pereyra and McLaughlin 2016); however these are usually very computationally intensive.

Finally, nested sampling provides a distinctively different approach for efficiently estimating (2) and (3) (Skilling 2006). The key idea underpinning nested sampling is the re-parameterisation of the marginal likelihood (2) as a one-dimensional integral of the likelihood with respected to the enclosed prior volume. This greatly reduces the computation costs involved, provided that one can efficiently sample from the prior distribution subject to a hard constraint on the likelihood value. Nested sampling therefore shifts the computational challenge from the direct evaluation of a high-dimensional integral to sampling of the prior subject to a hard likelihood constraint. The generation of samples is challenging and previous works have considered a range of sampling strategies. For example, conventional MCMC sampling (Skilling 2006), rejection sampling (e.g. Mukherjee et al. 2006; Feroz and Hobson 2008; Feroz et al. 2009), slice sampling (e.g. Handley et al. 2015), and more advanced MCMC samplers such as Galilean Monte Carlo (Feroz and Skilling 2013) and diffusive nested sampling (Brewer et al. 2011). Following over a decade of active research, nested sampling is now a well-established technique for computing the marginal likelihood that has found widespread application, particularly in astronomy (e.g. Feroz and Hobson 2008; Feroz et al. 2009; Trotta 2007). Nevertheless, broadly speaking, current nested sampling techniques remain restricted to moderate dimensional problems of size \({\mathcal {O}}(10^2)\) to \({\mathcal {O}}(10^3)\).

With imaging problems in mind, this article presents an efficient nested sampling methodology specifically designed for high-dimensional log-concave models of the form (1). A significant novelty of the proposed approach is that we address the difficult generation of samples by using a proximal MCMC technique that is naturally suited for dealing with high-dimensional log-concave distributions subject to hard convex constraints. Moreover, the proximal nature of the method straightforwardly allows the use of the non-smooth priors that are frequently encountered in imaging (e.g., involving the \(\ell _1\) and total-variation regularisers), which would not be easily addressed by using alternative gradient-based samplers. Section 3 below reviews the nested sampling approach. The proposed proximal nested sampling methodology is presented in Sect. 4.

3 Nested sampling

For ease of notation, given a model \({\mathcal {M}}\), let \({{\mathcal {L}}}(x) = p(y\vert x,{\mathcal {M}})\) denote the likelihood function, \(\pi (x) = p(x\vert {\mathcal {M}})\) the prior, and

$$\begin{aligned} \begin{aligned} {{\mathcal {Z}}} = p(y\vert {\mathcal {M}}) = \int _{\Omega } {{\mathcal {L}}}(x) \pi (x) \text {d} x, \end{aligned} \end{aligned}$$
(16)

the marginal likelihood or evidence associated with a given model \({\mathcal {M}}\) (to simplify notation, we henceforth omit the dependence of \({{\mathcal {Z}}}\) and \({{\mathcal {L}}}\) on y).

Nested sampling (Skilling 2006) was proposed specifically to facilitate the efficient evaluation of \({{\mathcal {Z}}}\) for Bayesian model selection, while also supporting posterior inferences. As mentioned previously, the calculation of the multidimensional marginal likelihood integral (16) is generally computationally intractable. Nested sampling addresses this difficulty by cleverly converting (16) to a one-dimensional integral by re-parameterising the likelihood in terms of the enclosed prior volume. In addition, nested sampling involves the prior via simulation and hence does not require knowledge of the prior normalising constant. As a result, it also circumvents the second level of intractability of \({{\mathcal {Z}}}\) that arises in imaging problems.

Let \(\Omega _{L^*} = \{x \vert {{\mathcal {L}}} (x) > L^* \}\), which groups the parameter space \(\Omega \) into a series of nested subspaces according to the level-set or iso-likelihood contour \({{\mathcal {L}}}(x) = L^* \ge 0\). Note that \(\Omega _{L^* = 0} = \Omega \), since the likelihood values cannot be negative. Define the prior volume \(\xi \) by

$$\begin{aligned} \xi (L^*) = \int _{\Omega _{L^*}} \pi (x) \text {d} x. \end{aligned}$$
(17)

Note that \(\xi (0) = 1\) and \(\xi (L_\text {max}) = 0\), where \(L_\text {max}\) is the maximum of the likelihood in \(\Omega \). Let \({{\mathcal {L}}}^\dagger (\xi )\) be the inverse of the prior volume \(\xi (L^*)\) such that \({{\mathcal {L}}}^\dagger (\xi (L^*)) = L^*\)Footnote 1, and assume it is a monotonically decreasing function of \(\xi \) (which, when \({\mathcal {L}}\) is continuous and \(\pi \) has connected support, is satisfied theoretically and up to practical numerical considerations that can be trivially overcome; Sivia and Skilling 2006). The marginal likelihood integral (16) can then be rewritten as

$$\begin{aligned} {{\mathcal {Z}}}&= \int _0^1 {{\mathcal {L}}}^\dagger (\xi ) \text {d} \xi , \end{aligned}$$
(18)

which is a one-dimensional integral over the prior volume \(\xi \).

To evaluate (18) in practice it is necessary to compute likelihood level-sets (iso-contours) \(L_i\), which correspond to prior volumes \(0< \xi _i \le 1\) satisfying (17). A strategy to generate the likelihoods \(L_i\) and associated prior volumes \(\xi _i\) is discussed in Sect. 3.2. Once the likelihoods \(L_i = {{\mathcal {L}}}^\dagger (\xi _i)\) are obtained, (18) can be used to evaluate the marginal likelihood, where \(\{\xi _i\}_{i = 0}^{N}\) is a sequence of decreasing prior volumes, i.e.,

$$\begin{aligned} 0< \xi _{N}< \cdots< \xi _1 < \xi _0 = 1. \end{aligned}$$
(19)

After discretising the integral (18) and associating each likelihood \(L_i\) a quadrature weight \(w_i\), the marginal likelihood can be computed numerically using standard quadrature methods to give

$$\begin{aligned} {{\mathcal {Z}}} \approx \sum _{i=1}^{N} L_i w_i. \end{aligned}$$
(20)

The simplest assignment of the quadrature weights is \(w_i = \xi _{i-1} - \xi _i\). The trapezium rule can also be used, i.e., \(w_i = (\xi _{i-1} + \xi _{i+1})/2\). The approximation error related to the discretisation of (18) can be made arbitrarily small by increasing N.

3.1 Posterior inferences

Posterior inferences can be easily computed once \({{\mathcal {Z}}}\) is found. Any sample taken randomly in the prior volume interval \((\xi _{i-1}, \xi _i)\) is simply assigned an importance weight

$$\begin{aligned} p_i = \frac{L_i w_i}{{\mathcal {Z}}}. \end{aligned}$$
(21)

Samples with the assigned weights \(\{p_i\}\) can then be used to calculate posterior inferences such as the posterior moments, probabilities, and credible regions.

3.2 Marginal likelihood evaluation

We now recall the basic procedure of the standard nested sampling framework for evaluating the marginal likelihood, i.e. to compute the summation (20). In particular, it is necessary to generate samples of the likelihoods \(L_i\) and to estimate the corresponding enclosed prior volume \(\xi _i\).

Firstly, set the iteration number \(i = 0\), the prior volume \(\xi _0 = 1\), and draw \(N_\text {live}\) live samples of the unknown image x from the prior distribution \(\pi (x)\). Secondly, remove the sample with the smallest likelihood, say \(L_{i+1}\), from the live set and replace it with a new sample. This new sample is again drawn from the prior, but constrained to a higher likelihood than \(L_{i+1}\).

It is necessary to then determine the prior volume \(\xi _{i+1}\) enclosed by the likelihood level-set (iso-contour) defined by \(L_{i+1}\). This is estimated in a stochastic manner. The enclosed prior volume for each step i can be estimated by a shrinkage ratio (random variable) \(t_{i+1}\), i.e. by \(\xi _{i+1} = t_{i+1} \xi _i\), where \(t_{i+1}\) follows the distributionFootnote 2

$$\begin{aligned} {p} (t) = N_\text {live} t^{N_\text {live}-1}. \end{aligned}$$
(22)

Repeat the above step (removing the sample with the smallest likelihood and estimating the updated prior volume) until the entire prior volume (and the nested shells of likelihood) has been traversed. We finally obtain \(\{L_i\}\) and \(\{\xi _i\}\) which can then be used to compute the marginal likelihood by (20). Moreover, we also simultaneously obtain a set of samples of the parameter x comprising all the discarded (dead) samples and the \(N_\text {live}\) final live samples, which can be used for posterior parameter inferences (refer to Sect. 3.1 for further detail).

The volume prior at step i of the nested sampling algorithm, is \(\xi _{i} = \prod _{k=1}^i t_{k}\); recall that \(t_k\) is the shrinkage ratio and is independently distributed following the probability density function given in (22). Since the mean and standard deviation of \(\log t\) are respectively

$$\begin{aligned} E(\log t) = - 1/N_\text {live} \quad \text {and} \quad \sigma (\log t) = 1/N_\text {live}, \end{aligned}$$
(23)

we have

$$\begin{aligned} \log \xi _i \approx - i/N_\text {live} \pm \sqrt{i}/N_\text {live}. \end{aligned}$$
(24)

Ignoring uncertainty, one thus takes

$$\begin{aligned} \xi _i = \exp (- i/N_\text {live}). \end{aligned}$$
(25)

A convergence criteria for the nested sampling algorithm should be adopted. Terminating the algorithm too early or late should be avoided to ensure the marginal likelihood is estimated accurately without unnecessary additional computational cost. One stopping criterion is that the difference in marginal likelihood estimates between two iterations falls below a predefined threshold, while another is to ensure a sufficient number of dead samples is used.

The pseudo code for the nested sampling algorithm is given in Algorithm 1. Observe that the most challenging task in the nested sampling algorithm is drawing samples from the prior with the hard constraint that samples lie within \(\Omega _{L_i}\), i.e. within the space defined by the likelihood level-set (see lines 8–10 in Algorithm 1). This constrained sampling step is relatively easy in small problems but can become very computationally challenging as problem dimension increases. As a result, nested sampling is usually restricted to problems of moderate size.

figure a

3.3 Error estimation

If the prior volumes \(\{\xi _i\}\) considered in the discretised integral (20) used to evaluate the marginal likelihood could be assigned exactly, then the only error in the estimate of the marginal likelihood would be due to the discretisation of the integral, which is trivially \({{\mathcal {O}}} (1/{N}^2)\) and negligible when N is sufficiently large. However, since the shrinkage ratio \(t_i\) is generated randomly, each prior volume \(\xi _i\) is then assigned approximately, which tends to overwhelm the error brought by the discretisation of the integral and will therefore cause the dominant source of uncertainty in the final computed evidence \({{\mathcal {Z}}}\). This uncertainty, fortunately, can be estimated easily. We recall below the error estimation scheme presented in Skilling (2006) using the entropy of the prior volumes. This approach is highly efficient since it does not require any additional sampling.

Let \({{\mathcal {P}}}(\xi ) = {{\mathcal {L}}} (\xi )/{{\mathcal {Z}}}\) be the posterior distribution regarding the prior volume \(\xi \). Then the negative relative entropy H can be defined as

$$\begin{aligned} H = \int {{\mathcal {P}}}(\xi ) \log [{{\mathcal {P}}}(\xi )] \text {d} \xi \approx \sum _{i=1}^{N} \frac{L_i w_i}{{\mathcal {Z}}} \log \left( \frac{L_i}{\mathcal {Z}}\right) , \end{aligned}$$
(26)

which can be computed directly from the obtained likelihoods \(\{L_i\}\), weights \(\{w_i\}\) and the evidence \({{\mathcal {Z}}}\). Following Skilling (2006), the standard deviation of the uncertainty of \(\log {{\mathcal {Z}}}\) using the nested sampling algorithm reads \(\sqrt{H/N_\text {live}}\), i.e.,

$$\begin{aligned} \log {{\mathcal {Z}}} = \log \left( \sum _{i=1}^{N} {L_i w_i} \right) \pm \sqrt{\frac{H}{N_\text {live}}}. \end{aligned}$$
(27)

In Chopin and Robert (2010), it is established that, under some regularity conditions, the approximation error is asymptotically Gaussian in the limit \(N \rightarrow \infty \) and vanishes at the usual Monte Carlo rate \({\mathcal {O}}(N^{-1/2})\). Moreover, the error scales approximately linearly with the model dimension d.

4 Proximal nested sampling framework

The main difficulty in applying nested sampling to large inverse problems is to efficiently simulate from the prior distribution subject to a hard likelihood constraint. More precisely, at iteration i, the samples from the prior are constrained to the region \(\Omega _{L_i}\) defined by the likelihood level-set corresponding to \(L_i\) (i.e. where a new sample must have a likelihood value greater than \(L_i\) at iteration i).

In this section we present our proposed proximal nested sampling method to address this challenging constrained sampling problem. Moreover, the proximal nature of the sampling method ensures that non-differentiable distributions, such as popular sparsity-promoting priors involving the \(\ell _1\) norm, are supported. We first present the methodology of proximal nested sampling for arbitrary log-concave distributions of the form (1). Explicit forms of proximal nested sampling for common choices of priors and likelihoods in imaging sciences are presented in Sect. 5.

4.1 General constrained sampling problem

Following (1) and adopting the notation of Sect. 3, assume that the prior and the likelihood are of the form \(\pi (x) = \text {exp}(-f(x))\) and \({{\mathcal {L}}}(x) = \text {exp}(-g(x))\), where f and g are convex l.s.c. (lower semicontinuous) functions on \(\Omega \).

We consider sampling from the prior \(\pi ({x})\), such that \(\mathcal{L}(x) > L^*\) for some generic likelihood value \(L^* > 0\). Let \(\iota _{{L}^*}(x)\) and \(\chi _{{L}^*}(x)\) be the indicator function and characteristic function, respectively, defined as

$$\begin{aligned} \iota _{{L}^*}(x)= & {} {\left\{ \begin{array}{ll} 1, &{} {{\mathcal {L}}} (x)> {L}^*, \\ 0, &{} \text {otherwise}, \end{array}\right. } \quad \text {and}\nonumber \\ \chi _{{L}^*}(x)= & {} {\left\{ \begin{array}{ll} 0, &{} {{\mathcal {L}}} (x) > {L}^*, \\ +\infty , &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(28)

Since \(\log \) is monotonic, \({{\mathcal {L}}} (x) > {L}^*\) is equivalent to \(g(x) < \tau \), where

$$\begin{aligned} \tau = -\log {L}^*. \end{aligned}$$
(29)

Let \({{\mathcal {B}}}_{\tau } := \{x \ \vert \ g(x) < \tau \}\). Then it is apparent that \(\chi _{{L}^*}(x)\), as a constraint for x, is equivalent to \(\chi _{{{\mathcal {B}}}_{\tau }}(x)\), where

$$\begin{aligned} \chi _{{{\mathcal {B}}}_{\tau }}(x) = {\left\{ \begin{array}{ll} 0, &{} x \in {{\mathcal {B}}}_{\tau }, \\ +\infty , &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(30)

Let \({\pi }_{{L}^*} (x) = \pi (x) \iota _{{L}^*}(x)\) represent the prior distribution with the hard likelihood constraint \({{\mathcal {L}}} (x) > {L}^*\). Since \(\iota _{{L}^*}(x) = \text {exp}(- \chi _{{L}^*}(x))\), then we have

$$\begin{aligned} \begin{aligned} \pi _{{L}^*} (x)&= \pi (x) \iota _{{L}^*}(x) \\&= \text {exp}(-f(x)) \text {exp}(- \chi _{{L}^*}(x)) \\&= \text {exp}(- [f(x) + \chi _{{L}^*}(x)] ) \\&= \text {exp}(- [f(x) + \chi _{{{\mathcal {B}}}_{\tau }}(x)] ). \end{aligned} \end{aligned}$$
(31)

Note that taking logarithm of \(\pi _{{L}^*} (x)\) reads

$$\begin{aligned} -\log \pi _{{L}^*} (x) = f(x) + \chi _{{{\mathcal {B}}}_{\tau }}(x). \end{aligned}$$
(32)

In the following section we introduce our proximal nested sampling algorithm for parameter x to sample from the constrained prior distribution \(\text {exp}(- [f(x) + \chi _{{{\mathcal {B}}}_{\tau }}(x)] )\).

4.2 Drawing a sample from the constrained prior

Sampling distributions over \(\Omega \) is usually challenging because of the dimensionality involved. Sampling from the constrained prior (32) is particularly difficult because of the hard constraint that \(x \in {{\mathcal {B}}}_{\tau }\), encoded in the characteristic function \(\chi _{{{\mathcal {B}}}_{\tau }}(x)\). Sampling is further complicated if the log-prior f(x) is not Lipschitz differentiable over \(\Omega \) (e.g. for non-differentiable sparsity-promoting priors), since high-dimensional sampling methods rely heavily on gradient information. To circumvent these issues we adopt a proximal MCMC approach, which is particularly suitable for high-dimensional distributions that are log-concave but not smooth. More precisely, in a manner akin to Durmus et al. (2018), we use the unadjusted Langevin algorithm (ULA) MCMC sampling strategy combined with Moreau-Yosida approximations of non-differential terms, followed by Metropolis Hastings correction step to control the approximations made, as described in Pereyra (2016).

Using the ULA iterative formula, for each given \(\tau \) (recall that \(\tau \) corresponds to a likelihood value \({L}^*\) by \(\tau = -\log {L}^*\); see (29)), we can generate the following Markov chain

$$\begin{aligned} x^{(k+1)} = x^{(k)} - \frac{\delta }{2} \nabla \bigl [ f(x^{(k)}) + \chi _{{{\mathcal {B}}}_{\tau }}(x^{(k)}) \bigr ] + \sqrt{\delta } { w}^{(k+1)},\nonumber \\ \end{aligned}$$
(33)

where \(\delta > 0\) is the step size and \({ w}^{k+1} \sim {{\mathcal {N}}}(0, {\mathbf {1}}_K)\) (a K-sequence of standard Gaussian random variables).

The non-differentiable characteristic function \(\chi _{\mathcal{B}_{\tau }}(x)\) can be approximated by its Moreau-Yosida envelope \(\chi ^{\lambda }_{{{\mathcal {B}}}_{\tau }}(x)\), with approximation controlled by \(\lambda > 0\). It is straightforward to show that

$$\begin{aligned} \chi ^{\lambda }_{{{\mathcal {B}}}_{\tau }}(x) = \frac{1}{2\lambda } \Vert x - x^*\Vert _2^2 , \end{aligned}$$
(34)

where \(x^* \) is the closest point in \({{\mathcal {B}}}_{\tau }\) to x, given by the projection of x onto \({{\mathcal {B}}}_\tau \), i.e. \(x^* = \text {proj}_{{{\mathcal {B}}}_\tau }(x) = \text {prox}_{\chi _{\mathcal{B}_\tau }}(x)\). Critically, the \(\lambda \)-Moreau-Yosida envelope is \(\tfrac{1}{\lambda }\)-Lipschitz differentiable. Its gradient can be calculated directly from (34) or by noting (11), yielding

$$\begin{aligned} \nabla \chi ^{\lambda }_{{{\mathcal {B}}}_{\tau }}(x) = (x - x^*)/\lambda = (x - \text {prox}_{\chi _{{{\mathcal {B}}}_{\tau }}}(x) ) / \lambda . \end{aligned}$$
(35)

Replacing the characteristic function by its Moreau-Yosida approximation in (33) , and noting the gradient (35), yields

$$\begin{aligned} x^{(k+1)}= & {} x^{(k)} - \frac{\delta }{2}\nabla f(x^{(k)}) - \frac{\delta }{2\lambda } \bigl [ x^{(k)} - \text {prox}_{\chi _{\mathcal{B}_{\tau }}}(x^{(k)}) \bigr ] \nonumber \\&+ \sqrt{\delta } { w}^{(k+1)}. \end{aligned}$$
(36)

When f(x) is differentiable its gradient can be computed directly (we consider the case where f(x) is non-differentiable shortly). For differential log-priors f(x), (36) provides the general strategy for sampling from the prior subject to the hard likelihood constraint (with a subsequent Metropolis-Hasting step as discussed below).

If the sample \(x^{(k)}\) is already in \({{\mathcal {B}}}_\tau \), i.e. \(x \in {{\mathcal {B}}}_\tau \), the term \(\bigl [x^{(k)} - \text {prox}^{\lambda }_{\chi _{{{\mathcal {B}}}_{\tau }}}(x^{(k)}) \bigr ]\) disappears and the Markov chain iteration simply involves taking a noisy step to descent the gradient. In contrast, if \(x^{(k)}\) is not in \({{\mathcal {B}}}_\tau \), i.e. \(x \notin {{\mathcal {B}}}_\tau \), then a step is taken in the direction \(- \bigl [x^{(k)} - \text {prox}^{\lambda }_{\chi _{{{\mathcal {B}}}_{\tau }}}(x^{(k)}) \bigr ]\), which acts to move the next iteration in the Markov chain in the direction of the projection of \(x^{(k)}\) onto the convex set \(\mathcal{B}_\tau \). This term therefore acts to push the Markov chain back into the constraint set \({{\mathcal {B}}}_\tau \) if it wanders outside of it, although due to the Moreau-Yosida approximation of \(\chi _{\mathcal{B}_\tau }\) it does not guarantee the constraint is satisfied (the subsequent Metropolis-Hasting step does guarantee the hard likelihood constraint is satisfied as discussed below).

When f(x) is non-differentiable, it may be approximated by its differentiable Moreau-Yosida envelope \(f^\lambda (x)\). By noting (11), the gradient of the term involving the sum of the two Moreau-Yosida approximations then reads

$$\begin{aligned}&\nabla ( f^{\lambda }(x) + \chi ^{\lambda }_{{{\mathcal {B}}}_{\tau }}(x)) = (x - \text {prox}^{\lambda }_f(x) ) / \lambda \nonumber \\&+ (x - \text {prox}_{\chi _{{{\mathcal {B}}}_{\tau }}}(x) ) / \lambda .\nonumber \\ \end{aligned}$$
(37)

Here we have used the same regularisation parameter \(\lambda > 0\) for both approximations for notational brevity, although clearly different parameters can be considered for \(f^\lambda (x)\) and \(\chi ^{\lambda }_{{{\mathcal {B}}}_{\tau }}(x)\) if desired.

Replacing in (33) both f(x) and \(\chi _{\mathcal{B}_{\tau }}(x)\) by their Moreau-Yosida approximations, and noting the gradient (37), yields

$$\begin{aligned} x^{(k+1)}= & {} (1- \frac{\delta }{\lambda }) x^{(k)} + \frac{\delta }{2\lambda } \text {prox}^{\lambda }_f(x^{(k)}) \nonumber \\&+ \frac{\delta }{2\lambda } \text {prox}_{\chi _{\mathcal{B}_{\tau }}}(x^{(k)}) + \sqrt{\delta } { w}^{(k+1)}. \end{aligned}$$
(38)

For non-differentiable log-concave priors, (38) provides the general strategy for sampling from the prior subject to the hard likelihood constraint.

To summarise, given a proper initial sample, say \(x^{(0)}\), we generate a Markov chain by iteratively applying the Markov kernel (36) if f is Lipschitz differentiable or the regularised surrogate (38) if it is not, which allows drawing samples from the prior that are likely to be within the likelihood iso-contour \(L^*\). This is the main challenge in nested sampling.

The Markov chains generated by ULA-type kernels exhibit some bias resulting from the discretisation of the Langevin stochastic differential equation and from the use of Moreau-Yosida regularisations. This bias can be asymptotically removed by introducing a Metropolis-Hasting correction step to ensure convergence to the required target density. In detail, at each iteration, a new candidate \(x^{\prime }\) generated using formula (36) or (38) is then accepted with probability

$$\begin{aligned} \text {min}\left\{ 1, \frac{q(x^{(k)} \vert x^\prime ) \pi _{L^*}(x^\prime )}{q(x^\prime \vert x^{(k)}) \pi _{L^*}(x^{(k)})} \right\} , \end{aligned}$$
(39)

where \(q(\cdot \vert \cdot )\) is a transition kernel, which we define by a Gaussian related to the ULA random component (following Pereyra 2016), i.e.,

$$\begin{aligned} q(x^\prime \vert x^{(k)}) \sim \exp \Big (- \frac{\big (x^\prime - x^{(k)} - \frac{\delta }{2} \nabla \log \pi _{L^*}(x^{(k)}) \big )^2}{2\delta } \Big ).\nonumber \\ \end{aligned}$$
(40)

If the candidate sample \(x^\prime \) is outside of \({{\mathcal {B}}}_\tau \), i.e. \(x^\prime \notin {{\mathcal {B}}}_\tau \), then \(\pi _{L^*}(x^\prime )=0\) and according to the Metropolis-Hasting update the candidate will not be accepted, ensuring the hard likelihood constraint is satisfied.

We summarise our proximal technique to draw an individual sample from the prior under the hard likelihood constraint in Algorithm 2.

figure b

4.3 Initialisation from the unconstrained prior

The initialisation of the nested sampling method is to draw \(N_\text {live}\) samples \(\{x_n\}_{n=1}^{N_\text {live}}\) from the prior distribution \(\pi (x)\) in the prior space \(\Omega \). If the log-prior f(x) is differentiable this may be applied trivially with the ULA iterative formula. Otherwise f(x) may again be approximated by its Moreau-Yosida envelope and samples from the prior can be generated by the iterative formula

$$\begin{aligned} x^{(k+1)} = (1- \frac{\delta }{2\lambda }) x^{(k)} + \frac{\delta }{2\lambda } \text {prox}^{\lambda }_f(x^{(k)}) + \sqrt{\delta } w^{(k+1)}.\nonumber \\ \end{aligned}$$
(41)

To draw \(N_\text {live}\) samples from the prior, it is necessary to first discard initial samples generated before converging on the target prior distribution. Initial samples corresponding to a number of burn-in iterations, say \(K_\text {burn}\), are discarded. Due to correlations between samples and the algorithm’s memory footprint, the chain is thinned by discarding a number of intermediate iterations between samples (the chain’s thinning factor), say \((K_\text {gap} - 1)\). That is, only the \(K_\text {gap}\)-th sample generated by the iterative formula is kept. Only 1-in-\(K_\text {gap}\) samples are stored when \(k > K_\text {burn}\) and \(\mathtt{mod}(k - K_\text {burn}, K_\text {gap}) = 0\), where \(\mathtt{mod} (\cdot , \cdot )\) represents modulus after division. A Metropolis-Hasting step can also be introduced here to remove the estimation bias. We summarise the technique for drawing \(N_\text {live}\) live samples from the prior in Algorithm 3.

figure c

4.4 Proximal nested sampling algorithm

After embedding Algorithms 2 and 3 into Algorithm 1 (i.e., the standard nested sampling algorithm), we obtain our proposed proximal nested sampling algorithm, which is summarised in Algorithm 4. Recall that Algorithm 2 generates a new single sample from the prior subject to the hard likelihood constraint, which is used to replace the sample with the lowest likelihood value in the live sample set. We suggest using a sample randomly selected from the live sample set as a starting point for Algorithm 2.

So far we have presented the proximal nested sampling framework in its most general form for arbitrary log-concave distributions, which is based on the iterative formula (36) or (38) to sample from the constrained prior. These iterative formula involve computing proximal operators related to the log-prior and likelihood constraint, which we have not yet considered in further detail. In principle computing proximal operators involves solving a minimisation problem, although in many scenarios this can be solved analytically or otherwise efficient iterative algorithms can be used. In the following section we consider explicit forms of proximal nested sampling for common forms of the prior and likelihood, outlining explicitly how the required proximal operators can be computed.

figure d

Before concluding this section, we note that the proposed proximal nested sampling method summarised in Algorithm 4 seeks to provide a Bayesian model selection strategy that is computationally efficient, simple, robust, and easy to deploy, as opposed to a strategy that seeks to deliver optimal performance by using adaptive methods or by leveraging model-specific properties. For example, for some models with favourable factorisation properties, better results would be obtained by replacing ULA by a Gibbs sampler (see e.g. Lucka 2016). Similarly, for models that are close to isotropic, one could replace ULA with a proximal Markov kernel derived from the underdamped Langevin SDE, which includes a Hamiltonian term (see e.g. Melidonis et al. 2022Footnote 3).

Such methods scale more efficiently to large models than the overdamped Langevin method used in this paper, but they are less robust to anisotropy, which is a common feature in Bayesian inverse problems. Moreover, one could also consider using an adaptive MALA kernel with a matrix-valued step-size taking into account second-order properties of the posterior distribution (Pereyra et al. 2016). Lastly, because the proposed proximal nested sampling method has been specifically designed for large models that are log-concave, it is not equipped with mechanisms to handle multi-modality. For problems involving multi-modality, we would recommend modifying the Markov kernel either by using some form of annealing (Neal 2001), or by using an adaptive importance sampling scheme (Martino et al. 2017). However, as mentioned previously, performing model selection for models that are both large and multi-modal is very difficult and remains an important perspective for future work.

5 Explicit forms of proximal nested sampling

In the general proximal nested sampling framework presented in Sect. 4 we considered arbitrary log-concave terms for the prior and likelihood and did not consider further how to compute the proximal operators related to those terms. We now exemplify our proposed proximal nested sampling framework with explicit forms for common priors and likelihoods used in high-dimensional signal and image processing problems. In particular, we outline explicitly how to compute the required proximal operators.

For illustration, we focus on sparsity-promoting priors corresponding to \(f(x) = \mu \Vert \varvec{\Psi }^\dagger x\Vert _1\), where \(\varvec{\Psi }^\dagger \in {\mathbb {C}}^{p \times d}\) represents a sparsifying transform, and Gaussian likelihoods corresponding to \(g(x) = \Vert y -\varvec{\Phi } x\Vert _2^2/{2\sigma ^2}\), where \(y \in {\mathbb {C}}^m\) denotes measured data, \(x \in {\mathbb {R}}^d\) the underlying parameters, and \(\varvec{\Phi } \in {\mathbb {C}}^{m\times d}\) the measurement operator (model), although other common priors are also considered. For simplicity, although not essential, we assume \(\varvec{\Psi }\) is an orthonormal transformation, i.e., \(\varvec{\Psi }^{\dagger }\varvec{\Psi } =\varvec{\Psi }\varvec{\Psi }^{\dagger } = I\).

From the iterative forms given in (36), (38) and (41), on which our proximal nested sampling framework is based, it is necessary to compute two proximal operators: \(\text {prox}^{\lambda }_f(x)\) and \(\text {prox}_{\chi _{{{\mathcal {B}}}_{\tau }}}(x)\), related to the prior and likelihood, respectively (recall that the definition of \(\chi _{\mathcal{B}_{\tau }}\) is related to likelihood function g; see (30)). In the following we calculate these two proximal operators for explicit expressions of f(x) and g(x) and show the corresponding explicit forms of the iterative formulas of (36), (38) and (41).

5.1 Proximal operator for the prior

When f(x) represents a flat prior or \(f(x) = \mu \Vert \varvec{\Psi }^\dagger x\Vert _2^2\) (Gaussian prior) it is differentiable with gradient

$$\begin{aligned} \nabla f(x) = 0 \quad \text {or} \quad \nabla f(x) = 2 \mu \varvec{\Psi }\varvec{\Psi }^\dagger x = 2 \mu x, \end{aligned}$$
(42)

respectively (here we use \(\varvec{\Psi }\varvec{\Psi }^\dagger = I\)). Obviously, there is no need to use the Moreau-Yosida envelope \(\nabla f^{\lambda }(x)\) to approximate \(\nabla f(x)\) when f(x) is differentiable.

When f(x) represents a sparsity-promoting Laplacian-type prior \(f(x) = \mu \Vert \varvec{\Psi }^\dagger x\Vert _1\), \(\forall x^\prime \in {\mathbb {R}}^d\), we have

$$\begin{aligned} \begin{aligned} \text {prox}^{\lambda }_f(x^\prime )&= \mathop {\mathrm{argmin}}_{x\in {\mathbb {R}}^d} \big \{ \mu \Vert \varvec{\Psi }^\dagger x\Vert _1 + \Vert x - x^\prime \Vert _2^2/2\lambda \big \} \\&= x^\prime +\varvec{\Psi } \left( \text {prox}^{\lambda \mu }_{\Vert \cdot \Vert _1} (\varvec{\Psi }^{\dagger } x^\prime ) -\varvec{\Psi }^{\dagger } x^\prime \right) \\&= x^\prime +\varvec{\Psi } \left( \text {soft}_{\lambda \mu }(\varvec{\Psi }^{\dagger } x^\prime ) -\varvec{\Psi }^{\dagger } x^\prime \right) , \end{aligned} \end{aligned}$$
(43)

where the second line follows by standard properties of the proximal operator (Combettes and Pesquet 2011) and where \(\text {soft}_{\lambda }(x) = (\text {soft}_{\lambda }(x_1), \text {soft}_{\lambda }(x_2), \cdots )\) is the soft-thresholding operator defined by

$$\begin{aligned} \text {soft}_{\lambda }(x_i) = {\left\{ \begin{array}{ll} 0, &{} |x_i| < \lambda , \\ x_i (|x_i| - \lambda ) / |x_i|, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(44)

5.2 Proximal operator for the likelihood

Consider the Gaussian likelihood corresponding to \(g(x) = \Vert y -\varvec{\Phi } x\Vert _2^2/{2\sigma ^2}\). Recall that \(\chi _{{{\mathcal {B}}}_{\tau }} (x) = 0\) if \(x \in \{x \ \vert \ g(x) < \tau \}\) and otherwise \(\chi _{{{\mathcal {B}}}_{\tau }} (x) = +\infty \). We are to solve

$$\begin{aligned} \begin{aligned} \text {prox}^{\lambda }_{\chi _{{{\mathcal {B}}}_{\tau }}}(x^\prime )&= \mathop {\mathrm{argmin}}_{x \in {\mathbb {R}}^d} \big \{ \chi _{{{\mathcal {B}}}_{\tau }} (x) + \Vert x - x^\prime \Vert _2^2/2\lambda \big \} \\&= \mathop {\mathrm{argmin}}_{x\in {\mathbb {R}}^d} \big \{ \chi _{{{\mathcal {B}}}_{\tau }} (x) + \Vert x - x^\prime \Vert _2^2 \big \} \\&= \text {proj}_{\chi _{{{\mathcal {B}}}_{\tau }}}(x^\prime ), \end{aligned} \end{aligned}$$
(45)

which is a projection onto set \({{\mathcal {B}}}_{\tau }\).

For the case where the measurement operator is the identity, \(\varvec{\Phi } = I\), (e.g. denoising problems) then problem (45) is the projection onto the \(\ell _2\) ball with radius \(\sqrt{2 \tau \sigma ^2}\). In this case the proximal (projection) operator has closed-form solution

$$\begin{aligned} \text {proj}_{\chi _{{{\mathcal {B}}}_{\tau }}}(x) = {\left\{ \begin{array}{ll} x, &{} \text {if} \ x \in {{\mathcal {B}}}_{\tau }, \\ \frac{x - y }{\Vert x - y \Vert _2} \sqrt{2 \tau \sigma ^2} + y, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(46)

For the case where the measurement operator is not the identity, \(\Phi \ne I\), problem (45) is equivalent to finding an \(x\in {\mathbb {R}}^d\) satisfying

$$\begin{aligned} \min _{x\in {\mathbb {R}}^d} \big \{ \chi _{ {{\mathcal {B}}}^\prime _{\tau ^\prime } } (u) + \Vert x - x^\prime \Vert _2^2/2 \big \}, \quad \text {s.t.} \ \ u =\varvec{\Phi } x, \end{aligned}$$
(47)

where \({{\mathcal {B}}}^\prime _{\tau } := \{z \ \vert \ \Vert y - z\Vert _2^2 < \tau \}\) and \(\tau ^\prime = 2 \tau \sigma ^2\). Minimisation problem (47) can be solved by a variety of different optimisation methods, e.g. by the alternating direction method of multipliers (ADMM) and primal-dual algorithms (see, e.g., Parikh and Boyd 2013 and references therein for further details). In the following we present detailed procedures for using the ADMM and primal-dual algorithms to solve problem (47).

5.2.1 Computation using ADMM method

Firstly, the augmented Lagrangian of the minimisation problem (47) can be represented as

$$\begin{aligned} \Lambda (x, u, z):= & {} \chi _{ {{\mathcal {B}}}^\prime _{\tau ^\prime } } (u) + \frac{1}{2} \Vert x - x^\prime \Vert _2^2 + \beta z^\dagger (u -\varvec{\Phi } x)\nonumber \\&+ \frac{\beta }{2} \Vert u -\varvec{\Phi } x \Vert _2^2, \end{aligned}$$
(48)

for dual variable z and penalty parameter \(\beta > 0\). Starting from an initialisation \(x^{(0)}, z^{(0)}\), the augmented Lagrangian of (48) can be minimised with respect to variables u and x alternatively, while updating the dual value z using the dual ascent method to ensure the constraint \(u=\varvec{\Phi } x\) is satisfied for the final solution, i.e.

$$\begin{aligned} u^{(i)}&= \mathop {\mathrm{argmin}}_{u \in {\mathbb {C}}^m} \Lambda (x^{(i)},u,z^{(i)}) , \end{aligned}$$
(49)
$$\begin{aligned} x^{(i+1)}&= \mathop {\mathrm{argmin}}_{x \in {\mathbb {R}}^d} \Lambda (x,u^{(i)},z^{(i)}) , \end{aligned}$$
(50)
$$\begin{aligned} z^{(i+1)}&= z^{(i)} + u^{(i)} -\varvec{\Phi } x^{(i+1)}, \end{aligned}$$
(51)

which can be rewritten as the following explicit iterative scheme

$$\begin{aligned} u^{(i)}&= \mathop {\mathrm{argmin}}_{u \in {\mathbb {C}}^M} \Big \{ \chi _{ \mathcal{B}^\prime _{\tau ^\prime } } (u) + \frac{\beta }{2} \Vert u -\varvec{\Phi } x^{(i)} + z^{(i)} \Vert _2^2 \Big \} , \end{aligned}$$
(52)
$$\begin{aligned} x^{(i+1)}&= \mathop {\mathrm{argmin}}_{x\in {\mathbb {R}}^D} \Big \{ \frac{1}{2} \Vert x - x^\prime \Vert _2^2 + \frac{\beta }{2} \Vert u^{(i)} -\varvec{\Phi } x + z^{(i)} \Vert _2^2 \Big \}, \end{aligned}$$
(53)
$$\begin{aligned} z^{(i+1)}&= z^{(i)} + u^{(i)} -\varvec{\Phi } x^{(i+1)}. \end{aligned}$$
(54)

The solution to problem (52) has a closed-form expression since it is the projection onto a scaled and shifted \(\ell _2\) ball, i.e.,

$$\begin{aligned} u^{(i)} = {\left\{ \begin{array}{ll} \varvec{\Phi } x^{(i)} - z^{(i)}, &{} \text {if} \ \varvec{\Phi } x^{(i)} - z^{(i)} \in \mathcal{B}^\prime _{\tau ^\prime }, \\ \frac{\Phi x^{(i)} - z^{(i)} - Y }{\Vert \Phi x^{(i)} - z^{(i)} - Y \Vert _2} \sqrt{2 \tau \sigma ^2} + Y, &{} \text {otherwise}. \end{array}\right. }\nonumber \\ \end{aligned}$$
(55)

Problem (53) is differentiable and so can be solved by gradient descent. It is straightforward to show that this problem is equivalent to solving the linear system w.r.t. x

$$\begin{aligned} (\beta \varvec{\Phi }^{\dagger }\varvec{\Phi } + I) x = x^\prime + \beta \varvec{\Phi }^{\dagger } (u^{(i)} + z^{(i)}), \end{aligned}$$
(56)

which can be solved by using iterative methods, with \((\beta \varvec{\Phi }^{\dagger }\varvec{\Phi } + I)\) positive definite.

The pseudo code to compute the proximal operator, \(\text {prox}_{\chi _{{{\mathcal {B}}}_{\tau }}}(x) \), using ADMM is summarised in Algorithm 5. Various stopping criteria can be considered, such as a maximum iteration number or the relative error of solutions at two consecutive iterations, i.e., \(\Vert x^{(i+1)} - x^{(i)}\Vert _2 / \Vert x^{(i)}\Vert _2\).

figure e

5.2.2 Computation using primal-dual method

Alternatively, problem (45) can be solved using a primal-dual method. Note that the problem can be rewritten as

$$\begin{aligned} \min _{x\in {\mathbb {R}}^d} \big \{ \chi _{ {{\mathcal {B}}}^\prime _{\tau ^\prime } } (\varvec{\Phi } x) + \Vert x - x^\prime \Vert _2^2/2 \big \}, \end{aligned}$$
(57)

which is equivalent to the saddle-point problem

$$\begin{aligned} \min _{x\in {\mathbb {R}}^d} \max _{z \in {\mathbb {C}}^K} \big \{ z^\dagger \varvec{\Phi } x - \chi ^*_{ \mathcal{B}^\prime _{\tau ^\prime } } (z) + \Vert x - x^\prime \Vert _2^2/2 \big \}, \end{aligned}$$
(58)

where \(\chi ^*_{{{\mathcal {B}}}^\prime _{\tau ^\prime }}\) is the convex conjugate of \( \chi _{{{\mathcal {B}}}^\prime _{\tau ^\prime }}\). The saddle-point problem (58) can be solved by alternatively optimising with respect to the primal variable x and the dual variable z. Considering a proximal forward-background step for each alternate optimisation, first for the dual variable z followed by the primal variable x, leads to the following iterative scheme

$$\begin{aligned} z^{(i+1)}&= \text {prox}_{\chi ^*_{\mathcal{B}^\prime _{\tau ^\prime }}} (z^{(i)} + \delta _1\varvec{\Phi } {\bar{x}}^{(i)}), \end{aligned}$$
(59)
$$\begin{aligned} x^{(i+1)}&= \text {prox}_h (x^{(i)} - \delta _2\varvec{\Phi }^\dagger z^{(i+1)}), \end{aligned}$$
(60)
$$\begin{aligned} {\bar{x}}^{(i+1)}&= x^{(i+1)} + \delta _3 (x^{(i+1)} - x^{(i)}), \end{aligned}$$
(61)

where \(h(x) = \Vert x - x^\prime \Vert _2^2/2\), and \(\delta _k\), for \(k=1, 2, 3\), are algorithm step size parameters. We next consider how to solve problem (59) and (60) explicitly.

Problem (59) can be solved by

$$\begin{aligned} z^{(i+1)}&= \text {prox}_{\chi ^*_{{{\mathcal {B}}}^\prime _{\tau ^\prime }}} (z^{(i)} + \delta _1\varvec{\Phi } {\bar{x}}^{(i)}) \nonumber \\&= z^{(i)} + \delta _1\varvec{\Phi } {\bar{x}}^{(i)} - \text {prox}_{\chi _{{{\mathcal {B}}}^\prime _{\tau ^\prime }}} (z^{(i)} + \delta _1\varvec{\Phi } {\bar{x}}^{(i)}), \end{aligned}$$
(62)

where we have noted the relationship between the proximal operator of the convex conjugate of a function given by (9). Since \({{\mathcal {B}}}^\prime _{\tau ^\prime }\) is an \(\ell _2\) ball, the proximal operator in (62) has the closed-form expression

$$\begin{aligned} \text {prox}_{\chi _{{{\mathcal {B}}}^\prime _{\tau ^\prime }}} (z) = \text {proj}_{{{{\mathcal {B}}}^\prime _{\tau ^\prime }}} (z) = {\left\{ \begin{array}{ll} z, &{} \text {if} \ z \in {{\mathcal {B}}}^\prime _{\tau ^\prime }, \\ \frac{z - y }{\Vert z - y \Vert _2} \sqrt{2 \tau \sigma ^2} + y, &{} \text {otherwise}. \end{array}\right. }\nonumber \\ \end{aligned}$$
(63)

Problem (60) is to solve

$$\begin{aligned} x^{(i+1)} = \mathop {\mathrm{argmin}}_{x\in {\mathbb {R}}^d} \big \{ \Vert x - x^\prime \Vert _2^2 + \Vert x - (x^{(i)} - \delta _2\varvec{\Phi }^\dagger z^{(i+1)})\Vert _2^2 \big \},\nonumber \\ \end{aligned}$$
(64)

which involves a differentiable objective function and so can be solved analytically, yielding the closed-form solution

$$\begin{aligned} x^{(i+1)} = (x^\prime + x^{(i)} - \delta _2\varvec{\Phi }^\dagger z^{(i+1)})/2. \end{aligned}$$
(65)

The pseudo code to compute the proximal operator, \(\text {prox}^{\lambda }_{\chi _{{{\mathcal {B}}}_{\tau }}}(x) \), using the primal-dual method is summarised in Algorithm 6. The same stopping criterion as for ADMM in Algorithm 5 can also be used for Algorithm 6.

Note that the main difference between the primal-dual method and ADMM is that the primal-dual method does not need to solve the linear system in (56). Therefore, the primal-dual method is typically more efficient computationally and is the approach used in the numerical experiments that follow. However, there are specific problems for which the linear system in (56) admits a computationally efficient solution and where the ADMM method might be more appropriate.

figure f

5.3 Explicit iterative formula for drawing samples

We are now in a position to outline the explicit iterative formulas to draw samples for a variety of common priors using our proximal nested sampling method.

The explicit representations of the iterative equations (36) (for differentiable f(x)) and (38) (for non-differentiable f(x)), which are used in Algorithm 2 to draw an individual sample from the prior under the hard likelihood constraint, for uniform, Gaussian and Laplacian priors, i.e. f(x) constant, \(f(x) = \mu \Vert \varvec{\Psi }^\dagger x\Vert _2^2\) and \(f(x) = \mu \Vert \varvec{\Psi }^\dagger x\Vert _1\), respectively, are

$$\begin{aligned} x^{(k+1)}&= (1- \frac{\delta }{2\lambda }) x^{(k)} + \frac{\delta }{2\lambda } {x^{*}}^{(k)} + \sqrt{\delta } w^{(k+1)}, \end{aligned}$$
(66)
$$\begin{aligned} x^{(k+1)}&= (1- \frac{\delta }{2\lambda } - {\delta \mu } ) x^{(k)} + \frac{\delta }{2\lambda } {x^{*}}^{(k)} + \sqrt{\delta } w^{(k+1)}, \end{aligned}$$
(67)
$$\begin{aligned} x^{(k+1)}&= (1- \frac{\delta }{2\lambda }) x^{(k)} + \frac{\delta }{2\lambda }\varvec{\Psi } \bigl (\text {soft}_{\lambda \mu }(\varvec{\Psi }^{\dagger } x^{(k)}) \nonumber \\&\quad -\varvec{\Psi }^{\dagger } x^{(k)} \bigr ) + \frac{\delta }{2\lambda } {x^{*}}^{(k)} + \sqrt{\delta } w^{(k+1)}, \end{aligned}$$
(68)

where \({x^{*}}^{(k)} = \text {prox}_{\chi _{{{\mathcal {B}}}_\tau }}(x^{(k)})\) is obtained using Algorithm 5 or 6.

Correspondingly, the explicit representations of equation (41), which is used in Algorithm 3 to draw \(N_\text {live}\) initial live samples from the prior distribution \(\pi (x)\) in the prior space \(\Omega \), are, respectively,

$$\begin{aligned} x^{(k+1)}&= x^{(k)} + \sqrt{\delta } w^{(k+1)}, \end{aligned}$$
(69)
$$\begin{aligned} x^{(k+1)}&= (1 - {\delta \mu } ) x^{(k)} + \sqrt{\delta } w^{(k+1)}, \end{aligned}$$
(70)
$$\begin{aligned} x^{(k+1)}&= x^{(k)} + \frac{\delta }{2\lambda }\varvec{\Psi } \bigl (\text {soft}_{\lambda \mu }(\varvec{\Psi }^{\dagger } x^{(k)}) -\varvec{\Psi }^{\dagger } x^{(k)} \bigr ) + \sqrt{\delta } w^{(k+1)}. \end{aligned}$$
(71)

We conclude this section with a brief discussion of the types of priors that the proposed proximal nested sampling method supports. While any prior that is log-concave could be addressed by using proximal nested sampling, we only recommend using the method for priors with proximal operators that are easy to evaluate or to approximate numerically. This is the case for many models used in applied high-dimensional statistics, where inference is often conducted by using convex optimisation algorithms that also require computing proximal operators. For more details about how to evaluate proximal operators, their properties, and lists of functions with known mappings please see Bauschke and Combettes (2011), Combettes and Pesquet (2011) and Parikh and Boyd (2013, Ch. 6). A library with MATLAB implementations of frequently used proximity mappings is also available onlineFootnote 4.

Moreover, since the proposed proximal nested sampling approach was specifically designed for models that are log-concave and with Bayesian imaging applications in mind, we anticipate that it will be mostly used with informative priors designed to regularise and stabilise high-dimensional estimation problems. As explained in Llorente et al. (2022), the marginal likelihood can be very sensitive to the choice of the prior. Therefore, it is important that the parameters of the prior are chosen carefully. In particular, we expect that proximal nested sampling will be used in combination with empirical Bayesian strategies that automatically adjust the parameters of the prior by maximum marginal likelihood estimation (see e.g. Vidal et al. 2020).

Furthermore, high-dimensional Bayesian models that are log-concave often result from a careful trade-off between modelling accuracy and computational tractability, and thus they are inherently misspecified (e.g., in the case of Bayesian imaging applications, one would not expect the prior to define a realistic generative model). Consequently, when using proximal nested sampling in this context one is inherently operating in an \({\mathcal {M}}\)-open Bayesian modelling paradigm, where none of the models under consideration are formally “true”. We refer the reader to Llorente et al. (2022) for more details about performing model selection in this context, as well as for details about prior sensitivity, objectivity, and the use of data-driven priors in Bayesian model selection.

6 Numerical experiments

In this section we validate our proposed proximal nested sampling method and demonstrate its utility on a range of illustrative problems.

We first validate our method on a problem with a Gaussian likelihood and Gaussian prior where the value of the marginal likelihood (Bayesian evidence) can be computed analytically. The dimensions of the problem considered range from low to very high, i.e. 2 to \(10^6\) dimensions.

Following on from this, we demonstrate the effectiveness of the proximal nested sampling method by applying it to two canonical imaging inverse problems, namely image denoising and image reconstruction. In particular, we demonstrate the use of proximal nested sampling for the principled Bayesian model selection of the sparsifying dictionary, the regularisation parameter (i.e. the \(\mu \) parameter of the prior) and the appropriate measurement operator when it may be misspecified. Furthermore, as mentioned already, as a by-product the samples obtained by nested sampling approaches can also be used to perform posterior inferences. This is critical in imaging problems in order to recover point estimates, e.g. restored images. Moreover, alternative forms of uncertainty quantification can also be considered from other posterior inferences, e.g. variance estimates and posterior credible regions (see, e.g., Cai et al. 2018).

6.1 Implementation and computational resources

To perform the numerical experiments presented subsequently, the proximal nested sampling algorithms developed in this article were implemented in MATLAB.Footnote 5 The numerical experiments performed to compute the marginal likelihood for low-dimensional problems (i.e., dimensions less than 200) were run on a Macbook laptop with an i7 Intel CPU and memory of 16 GB. A high-performance workstation, with 24 CPU cores, x86 64 architecture and 256 GB memory, was used for high-dimensional problems.

6.2 Validation in high dimensions

We first consider the validation of the proximal nested sampling method. For ease of validation, we consider the prior potential \(f(x) = \mu \Vert \varvec{\Psi }^\dagger x\Vert _2^2\), with \(\mu = 1/2\), \(\varvec{\Psi } = I\), and the likelihood potential \(g(x) = \Vert y -\varvec{\Phi } x\Vert _2^2/{2\sigma ^2}\), with \(\sigma = 1\), \(\Phi = I\). For this setting, we have a closed-form solution of the marginal likelihood value (see Appendix for further details). Test data \(y\in {\mathbb {R}}^d\) are generated by

$$\begin{aligned} y = x + w, \end{aligned}$$
(72)

where x is an d-dimensional vector of uniformly distributed random numbers in \([0, 1]^d\), and w is an d-dimensional vector of normally distributed random numbers. Note that the underlying model used to generate the mock data does not match the prior \(\pi \) used here, but that is fine for validation of the calculation of the marginal likelihood. Also, in imaging setting the prior is never perfectly specified. In the following, we consider increasing dimensions from \(d=2\) to \(d=10^6\). We separate the test into three parts: i) small models of dimension from \(d=2\) to \(d=200\), ii) moderately large models of dimension from \(d=2\) to \(d=10^5\), and iii) high dimensional models with \(d=10^6\).

We first test our method for low-dimensional models (i.e., \(d < 200\)). For our proximal nested sampling method, we use \(N_\text {live} = 2\times 10^{2}\) live samples and \(N = 3\times 10^{3}\) dead samples, with a thinning factor of 10. We also compare our result with vanilla Monte Carlo (MC) integration where a uniform prior with integrand \(f\cdot g\) is utilised, with the number of samples set to \(10^{5}\). Figure 1 presents the results. Our proximal nested sampling method agrees well with the ground truth, whereas simple MC integration can only achieve acceptable results when the dimension is small, say \(d < 20\). The computation time for the problem with dimension 200 is approximately one minute.

Fig. 1
figure 1

Validation of our proximal nested sampling technique (for dimensions 2–200) to compute the marginal likelihood (Bayesian evidence) for a scenario where a closed-form solution is accessible. The logarithm of the unnormalised prior volume (V) times the marginal likelihood value (\({{\mathcal {Z}}}\)) is plotted against the dimensions of the problem considered. The blue-circle line, red-asterisk line and the black-solid line show the results of MC integration, proximal nested sampling and the ground truth, respectively. We can clearly see that the results computed by proximal nested sampling agree with the ground truth well, whereas the result computed by MC integration with \(10^{5}\) samples can only achieve acceptable results when the dimension is below \(\sim 20\). The computation time for the problem with dimension 200 is approximately one minute.

Fig. 2
figure 2

Validation of our proximal nested sampling technique (for dimensions up to \(10^5\)) to compute the marginal likelihood (Bayesian evidence) for a scenario where a closed-form solution is accessible. The logarithm of the unnormalised prior volume (V) times the marginal likelihood value (\({{\mathcal {Z}}}\)) is plotted against the dimensions of the problem considered. The red-asterisk line and the black-solid line show the results of proximal nested sampling and the ground truth, respectively. We can clearly see that the results computed by proximal nested sampling agrees with the ground truth well. The computation time for the problem with dimension \(10^5\) is approximately 10 minutes

Fig. 3
figure 3

Images used to showcase the use of proximal nested sampling for Bayesian model selection in high-dimensional image processing problems. Panel (a): Cameraman grey-scale image; Panels (b)–(c): W28 and M31 radio galaxies normalised to [0, 1] and then shown in log10 scale (i.e. the numeric labels on the colour bar are the logarithms of the image intensity), respectively

We now test our proximal nested sampling method for high-dimensional cases. Results for dimensions of y up to \(10^{5}\) are given in Fig. 2, where we set the number of the live samples \(N_\text {live}=10^{3}\) and the number of dead samples \(N= 10^4\), with thinning factor 10 (we do not consider direct MC integration any further since it is already shown to fail for dimensions above \(\sim 20\)). These results again show that our proximal nested sampling method can achieve results in close agreement with the ground truth. The computation time for the problem with dimension \(10^5\) is approximately 10 minutes.

Fig. 4
figure 4

Dictionary selection for an image denoising problem solved by proximal nested sampling (test image is cameraman). First row shows the clean image and noisy image. Second row shows the posterior mean images recovered by proximal nested sampling for priors with (sparsifying) transforms \(\varvec{\Psi } = I, \text {DB2}\) and \(\text {DB8}\), respectively, where the log-prior reads \(f(x) = \mu \Vert \varvec{\Psi }^\dagger x\Vert _1\). By eye, both DB2 and DB8 wavelets provide superior reconstruction fidelity compared to \(\varvec{\Psi }=I\). The model \(\varvec{\Psi } = \text {DB2}\) may also be judged to provide slightly superior performance to \(\varvec{\Psi }=\text {DB8}\)

Finally, we consider dimension \(10^6\) as an example to show that our proximal nested sampling method can be pushed to dimensions much higher than \(10^5\). With the same parameters as that used for dimension \(10^5\), ten runs were performed for a \(10^6\) dimensional setting of the same problem. The logarithm of the ground truth value was calculated to be \(2.3850 \times 10^5\). The mean of ten runs of proximal nested sampling was computed be to \(2.3851 \times 10^5\), with standard deviation \(0.0002\times 10^5\). The result computed by proximal nested sampling is in excellent agreement with the ground truth. The computation time for each run of the problem with dimension \(10^6\) is approximately 30 minutes.

6.3 Model selection in image processing

We now illustrate the application of proximal nested sampling for Bayesian model selection in imaging problems. In particular, we focus on two canonical problems, image denoising and image reconstruction, with different likelihoods and priors. We emphasise that Bayesian model selection for these imaging problems is not well addressed by existing techniques due to the high dimensions considered (i.e., higher than \(10^5\)) and the use of general log-concave priors (e.g., like the sparsity promoting Laplace-type priors that include \(\ell _1\) terms).

The three images in Fig. 3 are used in the experiments that follow: Cameraman image, the W28 supernova remnant, and the HI region of the M31 galaxy, all with size of \(256 \times 256\) pixels and with intensities in the range [0, 255]. Sparsity-promoting priors (which are not smooth) and Gaussian likelihoods are consider in the following experiments, formed as \(f(x) = \mu \Vert \varvec{\Psi }^\dagger x\Vert _1\) and \(g(x) = \Vert y -\varvec{\Phi } x\Vert _2^2/{2\sigma ^2}\), respectively, where \(\mu \), \(\varvec{\Psi }\) and \(\varvec{\Phi }\) are set to different forms for model selection purposes.

Table 1 Marginal likelihood (Bayesian evidence) values computed by proximal nested sampling for Bayesian model selection of the sparsifying dictionary for an image denoising problem (see Fig. 4 for corresponding reconstructed images)

6.3.1 Prior model selection in image denoising: dictionary selection

For a standard denoising problem we apply proximal nested sampling to select the dictionary \(\varvec{\Psi }\) used for the sparsifying transform. The noisy image y is generated by \(y = x + w\), where x is the ground truth clean image and w is Gaussian noise with zero mean and standard deviation \(\sigma = \Vert x\Vert _{\infty }10^{-\text {SNR}/20}\), where \(\Vert \cdot \Vert _{\infty }\) is the infinity norm, and the input signal-to-noise ratio (SNR) is set to 20. Set \(\varvec{\Phi } = I\) in the likelihood \(g(x) = \Vert y -\varvec{\Phi } x\Vert _2^2/{2\sigma ^2}\) (i.e., \(g(x) = \Vert y - x\Vert _2^2/{2\sigma ^2}\)). We then investigate the influence of different choices for \(\varvec{\Psi }\) in the prior term \(f(x) = \mu \Vert \varvec{\Psi }^\dagger x\Vert _1\), with \(\mu = 10^{5}\). Specifically, three forms of \(\varvec{\Psi }\) are considered, namely the identity (I), Daubechies 2 wavelets (DB2), and Daubechies 8 wavelets (DB8). For the proximal nested sampling method, the number of the live samples \(N_\text {live}\) and dead samples N is respectively set to \(2\times 10^{3}\), and \(4\times 10^{4}\) with thinning factor \(10^2\), which is sufficient to ensure convergence.

Figure 4 presents the posterior means recovered (i.e. the reconstructed images) for the three dictionaries considered, i.e. for \(\varvec{\Psi } = \{ I, \text {DB2}, \text {DB8} \}\). It is clear that the reconstructed images corresponding to \(\varvec{\Psi } = \text {DB2}\) and \(\text {DB8}\) are significantly better than that for \(\varvec{\Psi } = I\). Moreover, while the difference between the reconstructed images of the models for \(\varvec{\Psi } = \text {DB2}\) and \(\text {DB8}\) is small, by eye the image recovered with DB2 may be judged slightly superior.

Table 1 presents the calculated marginal likelihood valuesFootnote 6 for the different sparsifying transforms \(\varvec{\Psi }\) selected for the prior. The root mean square error (RMSE) is also given, where the RMSE gauges the difference between the posterior mean image and the ground truth image. Note that the RMSE cannot normally be computed in practical problems since the ground truth is not known. Since for these experiments we know the ground truth the RMSE is a useful measure for comparison purposes.

Fig. 5
figure 5

Regularisation parameter selection for an image reconstruction problem solved by proximal nested sampling (test image is W28 radio galaxy). Images from left to right are the posterior mean images recovered by proximal nested sampling for \(\mu \) in the prior definition set to \(10^6, 10^7\) and \(10^8\), respectively. The data y are generated by measuring 30% of noisy Fourier coefficients of the test image. On close inspection it may be noticed that reconstruction for model with \(\mu = 10^6\) is superior to the one with \(\mu = 10^7\), which is superior to the one with \(\mu = 10^8\)

Table 1 shows that the model with \(\varvec{\Psi } = I\) possesses the smallest marginal likelihood value. This implies that for this denoising problem the model with \(\varvec{\Psi } = I\) is inferior to models where \(\varvec{\Psi }\) is set to DB2 and DB8. Moreover, the marginal likelihood difference between models where \(\varvec{\Psi }\) is set to DB2 or DB8 is not dramatic, nevertheless this implies that DB2 is preferred. These finding inferred by Bayesian model selection agree with the RSME values computed for each model, where the model with \(\varvec{\Phi }=\text {DB2}\) is slightly preferred over DB8, and both models with DB2 and DB8 are highly preferred over the model with \(\varvec{\Phi }=I\) (recall that in practice it is not possible to compute the RMSE since it requires knowledge of the underlying ground truth). Furthermore, the model preferences inferred by proximal nested sampling also agree with the assessment of reconstructed image quality by-eye discussed above. The results obtained are consistent with common knowledge that it is typically more effective to denoise a natural image using a prior that promotes sparsity in some (sparsifying) transform domain (e.g. a wavelet domain) rather than in the image domain itself. The computation time for the problem with \(\varvec{\Psi } = I\) is approximately 10 minutes, and for the problem with \(\varvec{\Psi } = \text {DB2}\) or \(\text {DB8}\) is approximately 60 minutes.

In high-dimensional settings note that Bayes factors can be very large due to the concentration of probability in high-dimensions, hence it is not meaningful to consider traditional scales for assessing model comparisons such as the Jeffery’s scale (Nesseris and García-Bellido 2013). Instead, we recommend comparing marginal likelihood values directly.

6.3.2 Prior model selection in image reconstruction: regularisation parameter selection

We now apply proximal nested sampling to a standard reconstruction problem and, firstly, consider the selection of the regularisation parameter \(\mu \) defining the width of the prior. It is typically very challenging to optimally set the regularisation parameter \(\mu \), which controls the strength of prior knowledge and plays a key role in reconstruction quality. Consider noisy observations (noisy measurements)

$$\begin{aligned} y =\varvec{\Phi } x + w, \end{aligned}$$
(73)

where w again denotes Gaussian noise with zero mean and \(\sigma = \Vert x\Vert _{\infty }10^{-\text {SNR}/20}\) (standard deviation), with SNR set to 30, and m and d are respectively the dimension of y and image x. Consider the prior \(f(x) = \mu \Vert \varvec{\Psi }^\dagger x\Vert _1\), with \(\varvec{\Psi } = \text {DB8}\), and likelihood \(g(x) = \Vert y -\varvec{\Phi } x\Vert _2^2/{2\sigma ^2}\). For the reconstruction scenario, \(\varvec{\Phi }\) represents the sensing (measurement) operator. In particular, we consider a measurement model comprising incomplete Fourier measurements (common in radio interferometric and magnetic resonance imaging) defined by the sensing operator \({\varvec{\Phi }} = {\varvec{M}}{\varvec{F}}\), constructed from the Fourier transform \({\varvec{F}}\) followed by a selection mask \({\varvec{M}}\) which is generated randomly through the variable density sampling profile (Puy et al. 2011). We consider the scenario where only 30% of Fourier coefficients are measured, i.e. \(m = 0.3 d\). Note that different forms of the mask \({\varvec{M}}\) result in different sensing operators \(\varvec{\Phi }\).

Table 2 Marginal likelihood (Bayesian evidence) values computed by proximal nested sampling for Bayesian model selection of the regularisation parameter \(\mu \) for an image reconstruction problem (see Fig. 5 for corresponding reconstructed images)

Figure 5 presents the posterior means recovered by proximal nested sampling (i.e. the reconstructed images) for models with \(\mu \) set to \(10^6, 10^7\) and \(10^8\). It is difficult to assess the effectiveness of different regularisation parameters by eye, but on close inspection it may be noticed that the model with \(\mu = 10^6\) is superior to the one with \(\mu = 10^7\), which is superior to the one with \(\mu = 10^8\). The computation time for each problem is approximately 150 minutes.

Table 2 presents the marginal likelihood and RMSE values computed for the models with different regularisation parameters \(\mu \). The computed marginal likelihood for the model with \(\mu = 10^6\) is larger that the value for the model with \(\mu = 10^7\), which is larger than the model with \(\mu = 10^8\), suggesting the model with \(\mu = 10^6\) is preferred. The computed marginal likelihoods are consistent with the model preferences obtained by comparing the RMSE of each model and by visual inspection. Recall that both RMSE and visual comparisons can only be performed here where the ground truth is available and cannot be used for model comparison in practice. In summary, this example demonstrates that our proximal nested sampling method is capable of selecting superior regularisation parameters for models stemming from high-dimensional inverse problems.

6.3.3 Measurement model selection in image reconstruction

We now apply proximal nested sampling to the same reconstruction problem considered above (i.e. image reconstruction from noisy and incomplete Fourier measurements) but focus on the problem of misspecification of the measurement model \(\varvec{\Phi }\). Noisy observations Y are generated by (73), measuring 10% of Fourier coefficients, i.e. with \(m = 0.1 d\).

We use the ground truth model \({\varvec{M}}_\text {truth}\) to simulate observation data y. However, when solving the resulting inverse problem we consider a number of different measurement models, not only the ground truth model \({\varvec{M}}_\text {truth}\) but also misspecified models \({\varvec{M}}_\gamma \), where \(\gamma > 0\) encodes the level of misspecification.

The method by which the model is misspecified in motivated by radio interferometric imaging. In radio interferometry, the coordinates of the Fourier coefficients acquired by the telescope are measured in units of (radio) wavelength. If the wavelength at which observations are made is misspecified, the coordinates of the Fourier coefficients acquired will be scaled. We model precisely this type of misspecified model here to represent the case where the instrument wavelength is not calibrated accurately.

An incorrectly specified wavelength then simply acts to modify the mask of the ground truth measurement model \({\varvec{M}}_\text {truth}\). The misspecified model corresponding to mask \({{\mathbf {M}}}_{\gamma }\), for misspecification parameter \(\gamma \), is generated by extending every measured position in \({{{\textbf {{\textsf {M}}}}}}_\text {truth}\) radially. Specifically, every measured position is extended radially along the line connecting it to the origin to a length of \(\gamma d_j\), \(j \in \Omega _\text {mask}\), where \(\gamma \) is the misspecification scaling factor, \(d_j\) is the distance from the original measured position j to the origin in \({{{\textbf {{\textsf {M}}}}}}_\text {truth}\), and \(\Omega _\text {mask}\) is the set which contains all the measured positions. It is worth mentioning that the larger the scaling factor \(\gamma \), the larger the distortion of \({\varvec{M}}_\gamma \) from the ground truth \({\varvec{M}}_\text {truth}\). Note also that \(\gamma =0\) corresponds to a correctly specified model, i.e. \({\varvec{M}}_{\gamma =0}={\varvec{M}}_\text {truth}\).

For proximal nested sampling, the number of the live samples \(N_\text {live}\) and dead samples N is respectively set to \(2\times 10^{3}\) and \(3\times 10^{4}\) with thinning factor \(10^2\), which is sufficient to ensure convergence. Regularisation parameter \(\mu = 10^8\) is used for these experiments.

Fig. 6
figure 6

Measurement model misspecification for an image reconstruction problem solved by proximal nested sampling (test image is M31 radio galaxy). Panel (a): dirty (back-projected) image \(\varvec{\Phi }^\dagger Y\); Panels (b)–(f): posterior mean images recovered by proximal nested sampling for misspecified models \({\varvec{M}}_\gamma \), where increasing \(\gamma > 0\) corresponds to increasing levels of misspecification (and \(\gamma =0\) corresponds to the ground truth model). It is apparent by eye that the posterior mean image recovered with the ground truth model is the best and that the quality of the recovered posterior mean image degrades as the size of the misspecification scale parameter \(\gamma \) increases

Figure 6 presents the posterior means recovered (i.e. the reconstruction images) for models with \(\varvec{\Phi }_\gamma = {{{\textbf {{\textsf {M}}}}}}_\gamma {{{\textbf {{\textsf {F}}}}}}\) and \(\varvec{\Phi } = {{{\textbf {{\textsf {M}}}}}}_\text {truth}{{{\textbf {{\textsf {F}}}}}}\),. Here misspecified models \({\varvec{M}}_{0.12}\), \({\varvec{M}}_{0.09}\), \({\varvec{M}}_{0.06}\) and \({\varvec{M}}_{0.03}\) are generated for misspecification scaling factors \(\gamma \) with values of 0.12, 0.09, 0.06 and 0.03, respectively. It is apparent by eye that the posterior mean image recovered with the ground truth model is the best and that the quality of the recovered posterior mean image degrades as the size of the misspecification scale parameter \(\gamma \) increases. The computation time for each problem is approximately 150 minutes.

Table 3 presents the marginal likelihood and RMSE values computed for the different models considered. The computed marginal likelihood is largest when the correct ground truth model is adopted in the likelihood. As the misspecification parameter \(\gamma \) is increased (corresponding to greater misspecification and less accurate models), the corresponding computed marginal likelihood values monotonically decrease (become more negative). For Bayesian model comparison, the model with the lowest misspecification parameter \(\gamma \) is always preferred. The computed marginal likelihoods are consistent with the model preferences obtained by comparing the RMSE of each model and by visual inspection (although recall that such tests cannot be used for model comparison in practice when the ground truth is not known).

7 Conclusions

Nested sampling provides an efficient computational framework to estimate the marginal likelihood (Bayesian evidence) for Bayesian model selection. It effectively re-parameterises the marginal likelihood into a one-dimensional integral of the likelihood with respect to the enclosed prior volume. The challenge of nested sampling is to sample from the prior distribution subject to a hard likelihood constraint. A variety of successful techniques have been developed to perform such sampling in low and moderate dimensional problems. However, existing approaches are not directly useful for imaging applications because they scale poorly to large problems and struggle to support models that are not smooth.

Table 3 Marginal likelihood (Bayesian evidence values computed by proximal nested sampling for Bayesian model selection for measurement model misspecification for an image reconstruction problem (see Fig. 6 for corresponding reconstructed images)

In this article we presented the proximal nested sampling method that is specifically designed for Bayesian models that are log-concave, potentially very high-dimensional (\(d = 10^6\) and beyond), and potentially not smooth. This is achieved by exploiting tools from proximal calculus and Moreau-Yosida regularisation to efficiently sample from the prior subject to the hard likelihood constraint through a proximal MCMC approach. The resulting Markov chain iterations combine a gradient step that approximates a Langevin SDE that scales efficiently to large problems, with a projection term that acts to push the Markov chain back into the likelihood constraint set if it wanders outside of it, and a Metropolis-Hastings correction step to ensure the hard likelihood constraint is satisfied.

The proposed proximal nested sampling framework was implemented and validated on a Gaussian model for which the marginal likelihood could be calculated in closed-form, showing excellent agreement between values computed analytical and by proximal nested sampling, even in very high dimensions. The use of proximal nested sampling for principled Bayesian model selection was then showcased on a variety of imaging problems with non-smooth sparsity-promoting prior distributions. In particular, model selection problems were considered related to dictionary selection, and selection of the appropriate measurement model when it may be misspecified.

Proximal nested sampling allows Bayesian model selection to be performed at a much higher dimension than that was previously possible, while also supporting non-smooth priors that are widely used in imaging. It is our hope that proximal nested sampling will thus find widespread use for high-dimensional Bayesian model selection, particularly in the imaging sciences.

Important perspectives for future work include: a detailed theoretical analysis of the convergence properties of proximal nested sampling; an extension to (biased) accelerated proximal methods (Vargas et al. 2020); and an analysis of the properties of marginal maximum likelihood estimation for the class of models considered in this paper, such as estimator consistency for model selection in an \({\mathcal {M}}\)-closed setting and concentration in an \({\mathcal {M}}\)-open setting (Llorente et al. 2022). Moreover, it would be interesting to apply proximal nested sampling to other types of models, such as models with likelihood-based priors (Llorente et al. 2022), which can be handled straightforwardly by proximal nested sampling when the likelihood is log-concave. It would also be interesting to modify proximal nested sampling to tackle high-dimensional models that are multi-modal, particularly models with data-driven priors encoded by neural networks (see e.g. Mukherjee et al. 2022, Section 5).