Proximal nested sampling for high-dimensional Bayesian model selection

Bayesian model selection provides a powerful framework for objectively comparing models directly from observed data, without reference to ground truth data. However, Bayesian model selection requires the computation of the marginal likelihood (model evidence), which is computationally challenging, prohibiting its use in many high-dimensional Bayesian inverse problems. With Bayesian imaging applications in mind, in this work we present the proximal nested sampling methodology to objectively compare alternative Bayesian imaging models for applications that use images to inform decisions under uncertainty. The methodology is based on nested sampling, a Monte Carlo approach specialised for model comparison, and exploits proximal Markov chain Monte Carlo techniques to scale efficiently to large problems and to tackle models that are log-concave and not necessarily smooth (e.g., involving ℓ1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _1$$\end{document} or total-variation priors). The proposed approach can be applied computationally to problems of dimension O(106)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(10^6)$$\end{document} and beyond, making it suitable for high-dimensional inverse imaging problems. It is validated on large Gaussian models, for which the likelihood is available analytically, and subsequently illustrated on a range of imaging problems where it is used to analyse different choices of dictionary and measurement model.


Introduction
High-dimensional inverse problems are ubiquitous in the data and imaging sciences, as well as in the physical and engineering sciences more generally.Due to limitations of the data observation process and measurement noise, or even just due to the nature of the problem at hand, most inverse problems encountered are seriously ill-conditioned or ill-posed (canonical examples include, e.g., medical and radio interferometric imaging; Durmus et al 2018; Cai et al 2019;Zhou et al 2020;Lunz et al 2021).Developing better methodology for solving challenging inverse problems is a significant focus of the community.The Bayesian statistical framework is currently one of the predominant frameworks to perform inference in inverse problems (Robert and Casella, 2004;Pereyra et al, 2016).The choice of the Bayesian model used has a profound impact on the solutions delivered, as alternative models can lead to significantly different point estimations and uncertainty quantification results.
In this article we develop methodology to objectively compare alternative Bayesian models in performing inference in the regime of high-dimensional inverse problems, directly form the observed data and in the absence of ground truth.Motivated by applications in computational imaging, we focus on the comparison of models with posterior distributions that are log-concave and potentially not smooth.In this context, model selection has been traditionally addressed through benchmark experiments involving ground truth data and expert supervision.However, for many applications it is difficult and expensive to produce reliable ground truth data.Moreover, for many problems it is simply impossible.Bayesian model selection provides a framework for selecting the most appropriate model directly from the observed data in an objective manner and without reference to ground truth data.
Bayesian model selection requires the computation of the marginal likelihood of the data -the average likelihood of a model over its prior probability space -which is also called the Bayesian evidence.This quantity is a key ingredient of model selection statistics such as Bayes factors and likelihood ratio tests (Robert, 2007).The computation of the marginal likelihood for highdimensional models is highly non-trivial because it requires the computation of integrals over the (high-dimensional) solution space.For example, in the context of Bayesian imaging problems, the dimension is given by the number of parameters (e.g.pixels) of interest, which frequently reach sizes of O(10 5 ) to O(10 6 ) and beyond.For such settings, the evaluation of the marginal likelihood has been previously considered to be computationally intractable.
Broadly speaking, general purpose Monte Carlo methods can only handle model selection tasks for problems of dimension O(10) to O(10 2 ) (for reviews see Clyde et al 2007;Friel and Wyse 2012;Llorente et al 2020).Nested sampling (Skilling, 2006), a state-of-the-art Monte Carlo strategy designed specifically for model selection, has enabled model selection for moderate dimensional problems of size O(10 2 ) to O(10 3 ) (Mukherjee et al, 2006;Feroz and Hobson, 2008;Feroz et al, 2009;Brewer et al, 2011;Feroz and Skilling, 2013;Handley et al, 2015).To the best of our knowledge, model selection for larger problems is currently possible only for models with very specific structures (e.g., conditionally Gaussian models; Harroue 2020).
In this work, we address the difficult computation of the marginal likelihood by proposing a new methodology that carefully integrates nested sampling (Skilling, 2006) with proximal Markov chain Monte Carlo (MCMC) (Pereyra, 2016;Durmus et al, 2018).This leads to a proximal nested sampling methodology specialised for comparing high-dimensional posterior distributions that are log-concave but potentially not smooth.The proposed approach can be applied computationally to log-concave models of dimension O(10 6 ) and beyond, making it suitable for model comparison in Bayesian imaging problems.We demonstrate the approach with a range of scientific imaging applications.
The remainder of the article is organised as follows.In Section 2 we recall the Bayesian model selection approach, highlight the associated computational challenges, and discuss proximal MCMC methodology for Bayesian computation for inverse problems with an underlying convex geometry.Section 3 recalls the standard nested sampling method.Our proposed proximal nested sampling framework is presented in general form in Section 4. In Section 5 explicit forms of proximal nested sampling are presented for common forms of the likelihood and prior that arise in imaging sciences.Experimental results validating the proposed method and showcasing its use in scientific imaging applications are reported in Section 6.Finally, we conclude in Section 7.

Bayesian inference for high-dimensional inverse problems
In this section we briefly recall the Bayesian decision-theoretic approach to model comparison, introduce some elements of convex analysis which are essential for our method, and review proximal MCMC methods, which are an important component of the proximal nested sampling methodology proposed in Section 4. We conclude the section by briefly explaining the computational difficulties encountered in high-dimensional Bayesian model selection and why it is necessary to develop new methodology for this task.Readers familiar with Bayesian model selection and with proximal MCMC methodology may prefer to skip this section and continue reading from Section 3.

Bayesian estimation and model selection
Let Ω ⊆ R d .We consider the estimation of a quantity of interest x ∈ Ω from observed data y.Bayesian methods address such problems by postulating a statistical model M relating x and y, from which estimators of x and other inferences can be derived.More precisely, M defines a joint probability distribution p(x, y|M) specified via the decomposition p(x, y|M) = p(y|x, M)p(x|M), where p(y|x, M) denotes the likelihood of x for the observed data y, and the marginal p(x|M) is the so-called prior of x.Following Bayes' theorem, inferences on x|y are then based on the posterior distribution p(x|y, M) = p(y|x, M)p(x|M) p(y|M) , which models our beliefs about x after observing y.With applications in Bayesian imaging sciences in mind, we focus on posterior distributions that are log-concave and assume that the potential function x → − log p(x|y, M) is convex lower semicontinuous (l.s.c.) on Ω, but possibly not smooth.This is an important class of models in modern Bayesian imaging sciences because it leads to point estimators that are by construction well-posed and that can be efficiently estimated by using scalable proximal convex optimisation and stochastic sampling methods (Kaipio and Somersalo, 2005;Robert and Casella, 2004;Pereyra et al, 2016).
We condition on M explicitly in (1) because our focus is model selection, where one entertains several alternative posterior distributions for x|y stemming from different underlying modelling assumptions.As a result, rather than the posterior p(x|y, M), our main object of interest is the marginal likelihood or model evidence which measures the likelihood of the observed data under model M, and which we use to objectively compare different models relating x and y (Robert, 2007).
Notice that the likelihood of the observed data y under the model M is essentially the expectation (or average value) of the likelihood function p(y|x, M) with respect to (w.r.t.) the prior p(x|M).Therefore, a model that allocates its prior mass to solutions that agree with the observed data achieves a large marginal likelihood value.Conversely, a low marginal likelihood value indicates that only a small proportion of the solutions favoured by the prior agree with the observed data.In other words, the marginal likelihood (2) measures the degree to which the observed data is in agreement with the assumptions of the model, and in doing so it provides a goodness-of-fit summary.Moreover, because all priors have the same total probability mass (i.e., Ω p(x)dx = 1), the likelihood (2) naturally incorporates Occams's razor, trading off model simplicity and accuracy and penalising over-fitting (Robert, 2007).Bayesian model selection arises from the common and natural inquiry of which model is the most suitable to analyse x|y from a set of models M 1 , . . ., M K available.For simplicity and without loss of generality, we suppose two alternative models M 1 and M 2 (the generalisation to additional models is straightforward).From Bayesian decision theory, to objectively compare the two models in settings without ground truth available, one should calculate the Bayes factor (Robert, 2007) where p(M 1 ) and p(M 2 ) denote the prior probabilities assigned to the two competing models, and where, from Bayes' theorem, we have that for any i ∈ {1, 2} By developing (3) we can easily express the Bayes factor as the likelihood ratio highlighting that ρ 12 is invariant to choice of the prior probabilities p(M 1 ) and p(M 2 ).If one assumes p(M 1 ) = p(M 2 ) = 1/2 to reflect the absence of prior information, then the factor also coincides with the posterior probability ratio p(M 1 |y)/p(M 2 |y).
Being a likelihood ratio, the factor ρ 12 is straightforward to read: if ρ 12 1, we prefer model M 1 over the alternative M 2 ; conversely, if ρ 12 1, we prefer model M 2 ; and if ρ 12 ≈ 1, we do not prefer either, inasmuch as the data y are insufficient for us to make an informed judgement.The fact that ρ 12 is a likelihood ratio is also appealing from a frequentist viewpoint, as it is associated with the most powerful test for these two model hypotheses (Casella and Berger, 2002).
Unfortunately, calculating ρ 12 is generally not possible in large-scale settings because the dimensionality of x renders the marginal likelihoods p(y|M 1 ) and p(y|M 2 ) computationally intractable.More precisely, the marginal likelihoods are doubly-intractable because they require computing two intractable integrals over the space of solutions Ω: the marginalisation of x denoted explicitly in (2); and the normalising constant of the priors p(x|M i ) when these are not available analytically, which otherwise implicitly also requires integrating over Ω.
It is worth emphasising at this point that this major difficulty related to model selection is not encountered when performing inferences with the posteriors p(x|y, M 1 ) and p(x|y, M 2 ) individually, as one can use MCMC methods to sample from p(x|y, M) without ever having to evaluate the marginal likelihood p(y|M).As a result, efficient Bayesian model selection remains an open problem in many areas of science and engineering that have widerly adopted Bayesian inference techniques for point estimation and uncertainty quantification.
In the following we briefly recall MCMC sampling methods derived from the overdamped Langevin diffusion process, particularly proximal MCMC techniques specialised for large models that are log-concave, and explain why it is necesary to modify them to enable efficient model comparison.

Convex analysis
Given a convex, proper, l.s.c.function h : R d → (−∞, +∞] and λ > 0, the proximal operator (Bauschke and Combettes, 2011) When λ = 1, we denote prox 1 h (x) by prox h (x) for simplicity.Let K be a closed convex set in R d and let χ K be the characteristic function for K, defined by χ K (x) = 0 if x ∈ K and +∞ otherwise.The proximal operator of χ K is the projection onto K, given by proj K (x) = argmin The convex conjugate of function h, denoted by h * , is defined as Its proximal operator can be related to the proximal operator of h by The λ-Moreau-Yosida envelope of h (Bauschke and Combettes, 2011) is given for any x ∈ R d and λ > 0 by The envelope h λ is continuously differentiable with Lipschitz gradient.In particular, using the proximal operator, the gradient of h λ can be written with λ simultaneously controlling the Lipschitz constant of ∇h λ as well as the error between h and its smooth approximation h λ .This approximation error can be made arbitrarily small by reducing λ, at the expense of deteriorating the regularity of ∇h λ , and consequently the speed of convergence of proximal Bayesian computation algorithms rely on h λ .

Proximal Langevin MCMC sampling
Consider the problem of calculating probabilities or expectations with respect to (w.r.t.) some distribution π(dx) which admits a density π(x) w.r.t. the usual d-dimensional Lebesgue measure.In the context of Bayesian inference, this is typically the posterior p(x|y, M).Evaluating expectations and probabilities w.r.t.π is non-trivial in problems of moderate and high dimension because of the integrals involved, which are usually beyond the scope of analytical and deterministic numerical integration schemes.These calculations are further complicated when the normalising constant of π is not known, as this requires evaluating an additional d-dimensional integral.Monte Carlo sampling methods address these difficulties by simulating a set of samples from π followed by Monte Carlo stochastic integration to compute probabilities and expectations w.r.t.π.While there are different ways of simulating samples from π, we focus on MCMC strategies where one proceeds by constructing a Markov chain that has π as its invariant stationary distribution.Again, there are different methods for constructing such Markov chains (see Robert and Casella 2004 for an excellent introduction to MCMC methodology and Green et al 2015 for a survey of recent developments in the Bayesian computation literature).
The fastest provably convergent MCMC methods for Bayesian inference models can be derived from the Langevin diffusion process, which we recall below.For simplicity, rather than presenting the approach in full generality, we focus our presentation on proximal overdamped Langevin sampling for nonsmooth models, which we later use in the proximal nested sampling method proposed in Section 4. For a more exhaustive introduction to the topic please see Vargas et al (2020, Section 2) and references therein.
Assume that π admits a decomposition π(x) ∝ exp{−f (x) − g(x)} for all x ∈ R d , where f ∈ C 1 (R d ) with ∇f Lipschitz continuous with constant L f , and where g is a proper l.s.c.function that is convex on R d but potentially non-smooth (e.g., g could encode constraints on the solution space and involve non-smooth regularisers such as the 1 norm).To simulate from π, we construct the overdamped Langevin stochastic differential equation (SDE) on R d given by Durmus et al ( 2018) where (W t ) t≥0 is a d-dimensional Brownian motion, g λ is the Moreau-Yosida envelop of g given by (10), λ > 0 is a smoothing parameter that we will discuss later, and x 0 ∈ R d .When x → f (x) + g λ (x) is convex, the SDE has a unique strong solution and X t converges exponentially fast (as t → ∞) to an invariant measure that is in the neighbourhood of π.
To use (12) for Bayesian computation, we use a numerical solver to compute a discrete-time approximation of X t over some time period t ∈ [0, T ]; the resulting discrete sample path constitutes our set of Monte Carlo samples.In particular, in this article we use the conventional Euler-Maruyama approximation where δ ∈ [0, 1/(L f + 1/λ)] is a given stepsize and (Z n ) n≥1 is a sequence of i.i.d.d -dimensional standard Gaussian random variables.This MCMC method is known as the Moreau-Yosida unadjusted Langevin algorithm (MYULA) (Durmus et al, 2018).The Markov chain ( 13) is usually implemented by using (11) and reads The smoothing parameter λ and the stepsize δ jointly control a bias-variance trade-off between the asymptotic estimation errors and non-asymptotic errors associated with using a finite number of iterations.The samples generated by ( 14) can be directly used for biased Monte Carlo estimation (Durmus et al, 2018).Alternatively, at the expense of additional computation, one can supplement each iteration of MYULA with an MH (Metropolis-Hastings) correction step to asymptotically remove the approximation errors related to the discretisation of the SDE and the use of g λ instead of g, leading to a type of Metropolis-adjusted Langevin algorithm (MALA) (see Pereyra 2016 for details).

Estimation of marginal likelihoods and Bayes factors
Let {X n } N n=1 be a set of samples from π (or an approximation of π), generated by using a proximal MCMC method or otherwise.Following a Monte Carlo integration approach, the expectation of any function φ : R d → R w.r.t.π is approximated by which, under assumptions, converges to the truth E π (φ) = Ω φ(x)π(x)dx as N increases (or to a biased estimate if the samples are not exactly from π).The accuracy of Monte Carlo estimates depends of course on the number of samples N and on the properties of the MCMC method used, but it also depends crucially on the variance Var π (φ).Unfortunately, Var π (φ) is often very large for the kinds of functions φ required for estimating the marginal likelihood (2) (and in some cases Var π (φ) is not even defined), leading to Monte Carlo estimators of the marginal that behave poorly (Newton and Raftery, 1994).As a result, it is difficult to use the samples {X n } N n=1 to perform model selection.Several strategies have been proposed to address the aforementioned difficulty and derive well-posed estimators for the marginal likelihood (2) and the Bayes factor (3) (for reviews of classical methods see Friel and Wyse 2012;Clyde et al 2007).
One avenue is to generate samples from a sequence of distributions bridging π to some tractable reference π 0 such as the prior distribution or a Gaussian approximation of π, e.g., thermodynamic integration (O' Ruanaidh and Fitzgerald, 1996) and annealed important sampling (Neal, 2001).Such strategies struggle with large problems because the number of intermediate distributions grows quickly as d increases.
Another promising approach to derive computationally efficient estimators is to construct Rao-Blackwellized estimators by carefully introducing auxiliary variables, as proposed in the seminal papers Chib (1995) and Chib and Jeliazkov (2001).This strategy has been successfully applied recently to signal and image processing models that are conditionally Gaussian given conjugate model hyper-parameters (Harroue, 2020).Some generalisations are possible, but constructing efficient Rao-Blackwellized estimators for more general classes of models, e.g., of the form (1), is highly non-trivial.
An alternative natural strategy for stable Monte Carlo estimators for (2) and ( 3) is to construct a truncated estimator by first using the samples {X n } N n=1 to identify a suitable truncating set A, followed by a sample average (15) only with the samples verifying X n ∈ A (Brosse et al, 2017).Although by construction well-posed, truncated estimators need to be de-biased by using the volume of A, which is usually very expensive to compute when the dimension d is large.From the results of Brosse et al (2017), we believe that this strategy is unlikely to produce scalable methods suitable for large problems.One can circumvent or simplify the calculation of the volume of A (e.g., see Durmus et al 2018), but in our experience the resulting estimators become unstable and are difficult to use.
Another alternative approach, which is agnostic to the sampling method, is the harmonic mean estimator (Newton and Raftery, 1994); although, in its original form the variance of the estimator can be very poorly behaved such that the estimator can be highly inaccurate in practice.Strategies to resolve this issue have been developed in the recently proposed learnt harmonic mean estimator (McEwen et al, 2022), which has been shown to be highly effective and can scale to dimension O(10 3 ) and beyond.Nevertheless, it may be challenging to scale this approach to the high-dimensional settings considered in this paper.
One can also consider the widely used Laplace's method (Tierney and Kadane, 1986), which relies on the assumption that the posterior distribution can be adequately approximated by a Gaussian distribution.Unfortunately, this is a strong assumption that often leads to inaccurate estimates in inverse problems that are ill-conditioned or ill-posed, particularly if d ≥ dim(y).Many other alternatives are described in the literature, e.g., the Savage-Dickey density ratio (Trotta, 2007) and Reversible Jump MCMC (Green, 1995), which are mainly useful for nested or small models.It is worth mentioning that there are also some model selection strategies that do not rely on the computation of the marginal likelihood (see, e.g., Kamary et al 2018;Pereyra and McLaughlin 2016); however these are usually very computationally intensive.
Finally, nested sampling provides a distinctively different approach for efficiently estimating ( 2) and (3) (Skilling, 2006).The key idea underpinning nested sampling is the re-parameterisation of the marginal likelihood (2) as a one-dimensional integral of the likelihood with respected to the enclosed prior volume.This greatly reduces the computation costs involved, provided that one can efficiently sample from the prior distribution subject to a hard constraint on the likelihood value.Nested sampling therefore shifts the computational challenge from the direct evaluation of a high-dimensional integral to sampling of the prior subject to a hard likelihood constraint.The generation of samples is challenging and previous works have considered a range of sampling strategies.For example, conventional MCMC sampling (Skilling, 2006), rejection sampling (e.g.Mukherjee et al 2006;Feroz and Hobson 2008;Feroz et al 2009), slice sampling (e.g.Handley et al 2015), and more advanced MCMC samplers such as Galilean Monte Carlo (Feroz and Skilling, 2013) and diffusive nested sampling (Brewer et al, 2011).Following over a decade of active research, nested sampling is now a well-established technique for computing the marginal likelihood that has found widespread application, particularly in astronomy (e.g.Feroz and Hobson, 2008;Feroz et al, 2009;Trotta, 2007).Nevertheless, broadly speaking, current nested sampling techniques remain restricted to moderate dimensional problems of size O(10 2 ) to O(10 3 ).
With imaging problems in mind, this article presents an efficient nested sampling methodology specifically designed for high-dimensional log-concave models of the form (1). A significant novelty of the proposed approach is that we address the difficult generation of samples by using a proximal MCMC technique that is naturally suited for dealing with high-dimensional log-concave distributions subject to hard convex constraints.Moreover, the proximal nature of the method straightforwardly allows the use of the nonsmooth priors that are frequently encountered in imaging (e.g., involving the sampling approach.The proposed proximal nested sampling methodology is presented in Section 4.

Nested sampling
For ease of notation, given a model M, let L(x) = p(y|x, M) denote the likelihood function, π(x) = p(x|M) the prior, and the marginal likelihood or evidence associated with a given model M (to simplify notation, we henceforth omit the dependence of Z and L on y).
Nested sampling (Skilling, 2006) was proposed specifically to facilitate the efficient evaluation of Z for Bayesian model selection, while also supporting posterior inferences.As mentioned previously, the calculation of the multidimensional marginal likelihood integral ( 16) is generally computationally intractable.Nested sampling addresses this difficulty by cleverly converting ( 16) to a one-dimensional integral by re-parameterising the likelihood in terms of the enclosed prior volume.In addition, nested sampling involves the prior via simulation and hence does not require knowledge of the prior normalising constant.As a result, it also circumvents the second level of intractability of Z that arises in imaging problems.
Let Ω L * = {x|L(x) > L * }, which groups the parameter space Ω into a series of nested subspaces according to the level-set or iso-likelihood contour L(x) = L * ≥ 0. Note that Ω L * =0 = Ω, since the likelihood values cannot be negative.Define the prior volume ξ by Note that ξ(0) = 1 and ξ(L max ) = 0, where L max is the maximum of the likelihood in Ω.Let L † (ξ) be the inverse of the prior volume ξ(L * ) such that L † (ξ(L * )) = L *1 , and assume it is a monotonically decreasing function of ξ (which, when L is continuous and π has connected support, is satisfied theoretically and up to practical numerical considerations that can be trivially overcome; Sivia and Skilling 2006).The marginal likelihood integral ( 16) can then be rewritten as which is a one-dimensional integral over the prior volume ξ.
To evaluate (18) in practice it is necessary to compute likelihood level-sets (iso-contours) L i , which correspond to prior volumes 0 < ξ i ≤ 1 satisfying (17).A strategy to generate the likelihoods L i and associated prior volumes ξ i is discussed in Section 3.2.Once the likelihoods L i = L † (ξ i ) are obtained, ( 18) can be used to evaluate the marginal likelihood, where {ξ i } N i=0 is a sequence of decreasing prior volumes, i.e., After discretising the integral (18) and associating each likelihood L i a quadrature weight w i , the marginal likelihood can be computed numerically using standard quadrature methods to give The simplest assignment of the quadrature weights is The trapezium rule can also be used, i.e., w i = (ξ i−1 + ξ i+1 )/2.The approximation error related to the discretisation of ( 18) can be made arbitrarily small by increasing N .

Posterior inferences
Posterior inferences can be easily computed once Z is found.Any sample taken randomly in the prior volume interval (ξ i−1 , ξ i ) is simply assigned an importance weight Samples with the assigned weights {p i } can then be used to calculate posterior inferences such as the posterior moments, probabilities, and credible regions.

Marginal likelihood evaluation
We now recall the basic procedure of the standard nested sampling framework for evaluating the marginal likelihood, i.e. to compute the summation (20).
In particular, it is necessary to generate samples of the likelihoods L i and to estimate the corresponding enclosed prior volume ξ i .Firstly, set the iteration number i = 0, the prior volume ξ 0 = 1, and draw N live live samples of the unknown image x from the prior distribution π(x).Secondly, remove the sample with the smallest likelihood, say L i+1 , from the live set and replace it with a new sample.This new sample is again drawn from the prior, but constrained to a higher likelihood than L i+1 .
It is necessary to then determine the prior volume ξ i+1 enclosed by the likelihood level-set (iso-contour) defined by L i+1 .This is estimated in a stochastic manner.The enclosed prior volume for each step i can be estimated by a shrinkage ratio (random variable) t i+1 , i.e. by ξ i+1 = t i+1 ξ i , where t i+1 follows the distribution2 Repeat the above step (removing the sample with the smallest likelihood and estimating the updated prior volume) until the entire prior volume (and the nested shells of likelihood) has been traversed.We finally obtain {L i } and {ξ i } which can then be used to compute the marginal likelihood by (20).Moreover, we also simultaneously obtain a set of samples of the parameter x comprising all the discarded (dead) samples and the N live final live samples, which can be used for posterior parameter inferences (refer to Section 3.1 for further detail).
The volume prior at step i of the nested sampling algorithm, is ξ i = i k=1 t k ; recall that t k is the shrinkage ratio and is independently distributed following the probability density function given in ( 22).Since the mean and standard deviation of log t are respectively we have log Ignoring uncertainty, one thus takes A convergence criteria for the nested sampling algorithm should be adopted.Terminating the algorithm too early or late should be avoided to ensure the marginal likelihood is estimated accurately without unnecessary additional computational cost.One stopping criterion is that the difference in marginal likelihood estimates between two iterations falls below a predefined threshold, while another is to ensure a sufficient number of dead samples is used.
The pseudo code for the nested sampling algorithm is given in Algorithm 1. Observe that the most challenging task in the nested sampling algorithm is drawing samples from the prior with the hard constraint that samples lie within Ω Li , i.e. within the space defined by the likelihood level-set (see lines 8-10 in Algorithm 1).This constrained sampling step is relatively easy in small problems but can become very computationally challenging as problem dimension increases.As a result, nested sampling is usually restricted to problems of moderate size.
Algorithm 1 Nested sampling algorithm Initialization: Data Y .Set Z = 0, ξ 0 = 1 and i = 0. Draw N live samples {x n } N live n=1 from the prior distribution π(x) in the prior space Ω.Output: Evidence Z and posterior probabilities {p i }.
for i = 1, . . ., until the stopping criterion reached -Find the lowest likelihood, say L i , in the set of live samples.
-Update evidence by Z = Z + L i w i .
-Draw a new sample from the prior distribution π(x) in the restricted parameter space Ω Li , and replace the individual sample associated with the lowest likelihood L i in the set of live samples.end for Update the evidence by Compute the posterior probability for each individual sample p i = L i w i /Z.

Error estimation
If the prior volumes {ξ i } considered in the discretised integral (20) used to evaluate the marginal likelihood could be assigned exactly, then the only error in the estimate of the marginal likelihood would be due to the discretisation of the integral, which is trivially O(1/N 2 ) and negligible when N is sufficiently large.However, since the shrinkage ratio t i is generated randomly, each prior volume ξ i is then assigned approximately, which tends to overwhelm the error brought by the discretisation of the integral and will therefore cause the dominant source of uncertainty in the final computed evidence Z.This uncertainty, fortunately, can be estimated easily.We recall below the error estimation scheme presented in Skilling (2006) using the entropy of the prior volumes.This approach is highly efficient since it does not require any additional sampling.
Let P(ξ) = L(ξ)/Z be the posterior distribution regarding the prior volume ξ.Then the negative relative entropy H can be defined as which can be computed directly from the obtained likelihoods {L i }, weights {w i } and the evidence Z.Following Skilling (2006), the standard deviation of the uncertainty of log Z using the nested sampling algorithm reads H/N live , i.e., log Z = log In Chopin and Robert (2010), it is established that, under some regularity conditions, the approximation error is asymptotically Gaussian in the limit N → ∞ and vanishes at the usual Monte Carlo rate O(N −1/2 ).Moreover, the error scales approximately linearly with the model dimension d.

Proximal nested sampling framework
The main difficulty in applying nested sampling to large inverse problems is to efficiently simulate from the prior distribution subject to a hard likelihood constraint.More precisely, at iteration i, the samples from the prior are constrained to the region Ω Li defined by the likelihood level-set corresponding to L i (i.e.where a new sample must have a likelihood value greater than L i at iteration i).
In this section we present our proposed proximal nested sampling method to address this challenging constrained sampling problem.Moreover, the proximal nature of the sampling method ensures that non-differentiable distributions, such as popular sparsity-promoting priors involving the 1 norm, are supported.We first present the methodology of proximal nested sampling for arbitrary log-concave distributions of the form (1). Explicit forms of proximal nested sampling for common choices of priors and likelihoods in imaging sciences are presented in Section 5.

General constrained sampling problem
Following (1) and adopting the notation of Section 3, assume that the prior and the likelihood are of the form π(x) = exp(−f (x)) and L(x) = exp(−g(x)), where f and g are convex l.s.c.(lower semicontinuous) functions on Ω.
We consider sampling from the prior π(x), such that L(x) > L * for some generic likelihood value L * > 0. Let ι L * (x) and χ L * (x) be the indicator function and characteristic function, respectively, defined as Since log is monotonic, L(x) > L * is equivalent to g(x) < τ , where Then it is apparent that χ L * (x), as a constraint for x, is equivalent to χ Bτ (x), where Note that taking logarithm of π L * (x) reads (32) In the following section we introduce our proximal nested sampling algorithm for parameter x to sample from the constrained prior distribution exp(−[f (x) + χ Bτ (x)]).

Drawing a sample from the constrained prior
Sampling distributions over Ω is usually challenging because of the dimensionality involved.Sampling from the constrained prior (32) is particularly difficult because of the hard constraint that x ∈ B τ , encoded in the characteristic function χ Bτ (x).Sampling is further complicated if the log-prior f (x) is not Lipschitz differentiable over Ω (e.g. for non-differentiable sparsitypromoting priors), since high-dimensional sampling methods rely heavily on gradient information.To circumvent these issues we adopt a proximal MCMC approach, which is particularly suitable for high-dimensional distributions that are log-concave but not smooth.More precisely, in a manner akin to Durmus et al (2018), we use the unadjusted Langevin algorithm (ULA) MCMC sampling strategy combined with Moreau-Yosida approximations of nondifferential terms, followed by Metropolis Hastings correction step to control the approximations made, as described in Pereyra (2016).
Using the ULA iterative formula, for each given τ (recall that τ corresponds to a likelihood value L * by τ = − log L * ; see ( 29)), we can generate the following Markov chain where δ > 0 is the step size and w k+1 ∼ N (0, 1 K ) (a K-sequence of standard Gaussian random variables).
The non-differentiable characteristic function χ Bτ (x) can be approximated by its Moreau-Yosida envelope χ λ Bτ (x), with approximation controlled by λ > 0. It is straightforward to show that where x * is the closest point in B τ to x, given by the projection of x onto B τ , i.e. x * = proj Bτ (x) = prox χ Bτ (x).Critically, the λ-Moreau-Yosida envelope is 1 λ -Lipschitz differentiable.Its gradient can be calculated directly from (34) or by noting (11), yielding Replacing the characteristic function by its Moreau-Yosida approximation in (33) , and noting the gradient (35), yields k+1) .(36) When f (x) is differentiable its gradient can be computed directly (we consider the case where f (x) is non-differentiable shortly).For differential log-priors f (x), (36) provides the general strategy for sampling from the prior subject to the hard likelihood constraint (with a subsequent Metropolis-Hasting step as discussed below).
If the sample x (k) is already in B τ , i.e. x ∈ B τ , the term ) disappears and the Markov chain iteration simply involves taking a noisy step to descent the gradient.In contrast, if ) , which acts to move the next iteration in the Markov chain in the direction of the projection of x (k) onto the convex set B τ .This term therefore acts to push the Markov chain back into the constraint set B τ if it wanders outside of it, although due to the Moreau-Yosida approximation of χ Bτ it does not guarantee the constraint is satisfied (the subsequent Metropolis-Hasting step does guarantee the hard likelihood constraint is satisfied as discussed below).
When f (x) is non-differentiable, it may be approximated by its differentiable Moreau-Yosida envelope f λ (x).By noting (11), the gradient of the term involving the sum of the two Moreau-Yosida approximations then reads Here we have used the same regularisation parameter λ > 0 for both approximations for notational brevity, although clearly different parameters can be considered for f λ (x) and χ λ Bτ (x) if desired.Replacing in (33) both f (x) and χ Bτ (x) by their Moreau-Yosida approximations, and noting the gradient (37), yields For non-differentiable log-concave priors, (38) provides the general strategy for sampling from the prior subject to the hard likelihood constraint.
To summarise, given a proper initial sample, say x (0) , we generate a Markov chain by iteratively applying the Markov kernel (36) if f is Lipschitz differentiable or the regularised surrogate (38) if it is not, which allows drawing samples from the prior that are likely to be within the likelihood iso-contour L * .This is the main challenge in nested sampling.
The Markov chains generated by ULA-type kernels exhibit some bias resulting from the discretisation of the Langevin stochastic differential equation and from the use of Moreau-Yosida regularisations.This bias can be asymptotically removed by introducing a Metropolis-Hasting correction step to ensure convergence to the required target density.In detail, at each iteration, a new candidate x generated using formula (36) or ( 38) is then accepted with probability min 1, where q(•|•) is a transition kernel, which we define by a Gaussian related to the ULA random component (following Pereyra 2016), i.e., If the candidate sample x is outside of B τ , i.e. x / ∈ B τ , then π L * (x ) = 0 and according to the Metropolis-Hasting update the candidate will not be accepted, ensuring the hard likelihood constraint is satisfied.
We summarise our proximal technique to draw an individual sample from the prior under the hard likelihood constraint in Algorithm 2.

Initialisation from the unconstrained prior
The initialisation of the nested sampling method is to draw N live samples {x n } N live n=1 from the prior distribution π(x) in the prior space Ω.If the logprior f (x) is differentiable this may be applied trivially with the ULA iterative formula.Otherwise f (x) may again be approximated by its Moreau-Yosida envelope and samples from the prior can be generated by the iterative formula To draw N live samples from the prior, it is necessary to first discard initial samples generated before converging on the target prior distribution.Initial samples corresponding to a number of burn-in iterations, say K burn , are discarded.Due to correlations between samples and the algorithm's memory footprint, the chain is thinned by discarding a number of intermediate iterations between samples (the chain's thinning factor), say (K gap − 1).That is, only the K gap -th sample generated by the iterative formula is kept.Only 1in-K gap samples are stored when k > K burn and mod(k − K burn , K gap ) = 0, where mod(•, •) represents modulus after division.A Metropolis-Hasting step can also be introduced here to remove the estimation bias.We summarise the technique for drawing N live live samples from the prior in Algorithm 3.
-Metropolis-Hasting step to remove the estimation bias.

Proximal nested sampling algorithm
After embedding Algorithms 2 and 3 into Algorithm 1 (i.e., the standard nested sampling algorithm), we obtain our proposed proximal nested sampling algorithm, which is summarised in Algorithm 4. Recall that Algorithm 2 generates a new single sample from the prior subject to the hard likelihood constraint, which is used to replace the sample with the lowest likelihood value in the live sample set.We suggest using a sample randomly selected from the live sample set as a starting point for Algorithm 2.
So far we have presented the proximal nested sampling framework in its most general form for arbitrary log-concave distributions, which is based on the iterative formula (36) or (38) to sample from the constrained prior.These iterative formula involve computing proximal operators related to the logprior and likelihood constraint, which we have not yet considered in further detail.In principle computing proximal operators involves solving a minimisation problem, although in many scenarios this can be solved analytically or otherwise efficient iterative algorithms can be used.In the following section we consider explicit forms of proximal nested sampling for common forms of the prior and likelihood, outlining explicitly how the required proximal operators can be computed.
Algorithm 4 Proximal nested sampling algorithm Initialization: Data Y .Set Z = 0, ξ 0 = 1 and i = 0. Using Algorithm 3 to draw N live samples {x n } N live n=1 from the prior distribution π(x) in the prior space Ω.Output: Evidence Z and posterior probabilities {p i }.
for i = 1, . . ., until the stopping criterion reached -Find the lowest likelihood, say L i , in the set of live samples.
-Update evidence by Z = Z + L i w i .
-Randomly select a sample, say x (0) , from the set of live samples.
-Use Algorithm 2 to draw a new sample x new = ProxSampleDraw(x (0) , L i ) from the prior distribution π(x) in the restricted parameter space Ω Li , and replace the individual sample x i,low by the newly drawn sample x new .end for Update the evidence by Compute the posterior probability for each individual sample p i = L i w i /Z.
Before concluding this section, we note that the proposed proximal nested sampling method summarised in Algorithm 4 seeks to provide a Bayesian model selection strategy that is computationally efficient, simple, robust, and easy to deploy, as opposed to a strategy that seeks to deliver optimal performance by using adaptive methods or by leveraging model-specific properties.For example, for some models with favourable factorisation properties, better results would be obtained by replacing ULA by a Gibbs sampler (see e.g.Lucka 2016).Similarly, for models that are close to isotropic, one could replace ULA with a proximal Markov kernel derived from the underdamped Langevin SDE, which includes a Hamiltonian term (see e.g.Melidonis et al 2022 3 ).Such methods scale more efficiently to large models than the overdamped Langevin method used in this paper, but they are less robust to anisotropy, which is a common feature in Bayesian inverse problems.Moreover, one could also consider using an adaptive MALA kernel with a matrix-valued step-size taking into account second-order properties of the posterior distribution (Pereyra et al, 2016).Lastly, because the proposed proximal nested sampling method has been specifically designed for large models that are log-concave, it is not equipped with mechanisms to handle multi-modality.For problems involving multi-modality, we would recommend modifying the Markov kernel either by using some form of annealing (Neal, 2001), or by using an adaptive importance sampling scheme (Martino et al, 2017).However, as mentioned previously, performing model selection for models that are both large and multi-modal is very difficult and remains an important perspective for future work.

Explicit forms of proximal nested sampling
In the general proximal nested sampling framework presented in Section 4 we considered arbitrary log-concave terms for the prior and likelihood and did not consider further how to compute the proximal operators related to those terms.We now exemplify our proposed proximal nested sampling framework with explicit forms for common priors and likelihoods used in high-dimensional signal and image processing problems.In particular, we outline explicitly how to compute the required proximal operators.
For illustration, we focus on sparsity-promoting priors corresponding to f (x) = µ Ψ † x 1 , where Ψ † ∈ C p×d represents a sparsifying transform, and Gaussian likelihoods corresponding to g(x) = y − Φx 2 2 /2σ 2 , where y ∈ C m denotes measured data, x ∈ R d the underlying parameters, and Φ ∈ C m×d the measurement operator (model), although other common priors are also considered.For simplicity, although not essential, we assume Ψ is an orthonormal transformation, i.e., Ψ † Ψ = ΨΨ † = I.
From the iterative forms given in ( 36), ( 38) and ( 41), on which our proximal nested sampling framework is based, it is necessary to compute two proximal operators: prox λ f (x) and prox χ Bτ (x), related to the prior and likelihood, respectively (recall that the definition of χ Bτ is related to likelihood function g; see (30)).In the following we calculate these two proximal operators for explicit expressions of f (x) and g(x) and show the corresponding explicit forms of the iterative formulas of ( 36), ( 38) and (41).models considered in this paper.They might fail to be geometrically ergodic, in which case the nested sampling scheme would also behave poorly (see Betancourt 2011 for an example of a nested sampling method based on Hamiltonian dynamics).

Proximal operator for the prior
When f (x) represents a flat prior or f respectively (here we use ΨΨ † = I).Obviously, there is no need to use the Moreau-Yosida envelope ∇f λ (x) to approximate ∇f (x) when f (x) is differentiable.
When f (x) represents a sparsity-promoting Laplacian-type prior where the second line follows by standard properties of the proximal operator (Combettes and Pesquet, 2011) and where soft λ (44)
For the case where the measurement operator is the identity, Φ = I, (e.g.denoising problems) then problem (45) is the projection onto the 2 ball with radius √ 2τ σ 2 .In this case the proximal (projection) operator has closed-form solution For the case where the measurement operator is not the identity, Φ = I, problem ( 45) is equivalent to finding an x ∈ R d satisfying min where B τ := {z | y − z 2 2 < τ } and τ = 2τ σ 2 .Minimisation problem (47) can be solved by a variety of different optimisation methods, e.g. by the alternating direction method of multipliers (ADMM) and primal-dual algorithms (see, e.g., Parikh and Boyd 2013 and references therein for further details).In the following we present detailed procedures for using the ADMM and primal-dual algorithms to solve problem (47).

Computation using ADMM method
Firstly, the augmented Lagrangian of the minimisation problem (47) can be represented as for dual variable z and penalty parameter β > 0. Starting from an initialisation x (0) , z (0) , the augmented Lagrangian of ( 48) can be minimised with respect to variables u and x alternatively, while updating the dual value z using the dual ascent method to ensure the constraint u = Φx is satisfied for the final solution, i.e.
), ( 49) which can be rewritten as the following explicit iterative scheme The solution to problem (52) has a closed-form expression since it is the projection onto a scaled and shifted 2 ball, i.e., Problem ( 53) is differentiable and so can be solved by gradient descent.It is straightforward to show that this problem is equivalent to solving the linear system w.r.t.x (βΦ † Φ + I)x = x + βΦ † (u (i) + z (i) ), ( 56) which can be solved by using iterative methods, with (βΦ † Φ + I) positive definite.
The pseudo code to compute the proximal operator, prox χ Bτ (x), using ADMM is summarised in Algorithm 5. Various stopping criteria can be considered, such as a maximum iteration number or the relative error of solutions at two consecutive iterations, i.e., x (i+1) − x (i)  2 / x (i) 2 .

Computation using primal-dual method
Alternatively, problem (45) can be solved using a primal-dual method.Note that the problem can be rewritten as min which is equivalent to the saddle-point problem min where χ * B τ is the convex conjugate of χ B τ .The saddle-point problem (58) can be solved by alternatively optimising with respect to the primal variable x and the dual variable z.Considering a proximal forward-background step for each alternate optimisation, first for the dual variable z followed by the primal variable x, leads to the following iterative scheme ), ( 59) where h(x) = x − x 2 2 /2, and δ k , for k = 1, 2, 3, are algorithm step size parameters.We next consider how to solve problem ( 59) and ( 60) explicitly.
Problem ( 59) can be solved by where we have noted the relationship between the proximal operator of the convex conjugate of a function given by ( 9).Since B τ is an 2 ball, the proximal operator in ( 62) has the closed-form expression Problem ( 60) is to solve which involves a differentiable objective function and so can be solved analytically, yielding the closed-form solution The pseudo code to compute the proximal operator, prox λ χ Bτ (x), using the primal-dual method is summarised in Algorithm 6.The same stopping criterion as for ADMM in Algorithm 5 can also be used for Algorithm 6.
Note that the main difference between the primal-dual method and ADMM is that the primal-dual method does not need to solve the linear system in (56).Therefore, the primal-dual method is typically more efficient computationally and is the approach used in the numerical experiments that follow.However, there are specific problems for which the linear system in (56) admits a computationally efficient solution and where the ADMM method might be more appropriate.

Explicit iterative formula for drawing samples
We are now in a position to outline the explicit iterative formulas to draw samples for a variety of common priors using our proximal nested sampling method.
Algorithm 2 to draw an individual sample from the prior under the hard likelihood constraint, for uniform, Gaussian and Laplacian priors, i.e. f (x) constant, where x * (k) = prox χ Bτ (x (k) ) is obtained using Algorithm 5 or 6.Correspondingly, the explicit representations of equation ( 41), which is used in Algorithm 3 to draw N live initial live samples from the prior distribution π(x) in the prior space Ω, are, respectively, We conclude this section with a brief discussion of the types of priors that the proposed proximal nested sampling method supports.While any prior that is log-concave could be addressed by using proximal nested sampling, we only recommend using the method for priors with proximal operators that are easy to evaluate or to approximate numerically.This is the case for many models used in applied high-dimensional statistics, where inference is often conducted by using convex optimisation algorithms that also require computing proximal operators.For more details about how to evaluate proximal operators, their properties, and lists of functions with known mappings please see Bauschke and Combettes (2011), Combettes and Pesquet (2011) and Parikh and Boyd (2013, Ch. 6).A library with MATLAB implementations of frequently used proximity mappings is also available online4 .
Moreover, since the proposed proximal nested sampling approach was specifically designed for models that are log-concave and with Bayesian imaging applications in mind, we anticipate that it will be mostly used with informative priors designed to regularise and stabilise high-dimensional estimation problems.As explained in Llorente et al (2022), the marginal likelihood can be very sensitive to the choice of the prior.Therefore, it is important that the parameters of the prior are chosen carefully.In particular, we expect that proximal nested sampling will be used in combination with empirical Bayesian strategies that automatically adjust the parameters of the prior by maximum marginal likelihood estimation (see e.g.Vidal et al 2020).
Furthermore, high-dimensional Bayesian models that are log-concave often result from a careful trade-off between modelling accuracy and computational tractability, and thus they are inherently misspecified (e.g., in the case of Bayesian imaging applications, one would not expect the prior to define a realistic generative model).Consequently, when using proximal nested sampling in this context one is inherently operating in an M-open Bayesian modelling paradigm, where none of the models under consideration are formally "true".We refer the reader to Llorente et al (2022) for more details about performing model selection in this context, as well as for details about prior sensitivity, objectivity, and the use of data-driven priors in Bayesian model selection.

Numerical experiments
In this section we validate our proposed proximal nested sampling method and demonstrate its utility on a range of illustrative problems.
We first validate our method on a problem with a Gaussian likelihood and Gaussian prior where the value of the marginal likelihood (Bayesian evidence) can be computed analytically.The dimensions of the problem considered range from low to very high, i.e. 2 to 10 6 dimensions.
Following on from this, we demonstrate the effectiveness of the proximal nested sampling method by applying it to two canonical imaging inverse problems, namely image denoising and image reconstruction.In particular, we demonstrate the use of proximal nested sampling for the principled Bayesian model selection of the sparsifying dictionary, the regularisation parameter (i.e. the µ parameter of the prior) and the appropriate measurement operator when it may be misspecified.Furthermore, as mentioned already, as a by-product the samples obtained by nested sampling approaches can also be used to perform posterior inferences.This is critical in imaging problems in order to recover point estimates, e.g.restored images.Moreover, alternative forms of uncertainty quantification can also be considered from other posterior inferences, e.g.variance estimates and posterior credible regions (see, e.g., Cai et al 2018).

Implementation and computational resources
To perform the numerical experiments presented subsequently, the proximal nested sampling algorithms developed in this article were implemented in MATLAB. 5The numerical experiments performed to compute the marginal likelihood for low-dimensional problems (i.e., dimensions less than 200) were run on a Macbook laptop with an i7 Intel CPU and memory of 16 GB.A highperformance workstation, with 24 CPU cores, x86 64 architecture and 256 GB memory, was used for high-dimensional problems.

Validation in high dimensions
We first consider the validation of the proximal nested sampling method.For ease of validation, we consider the prior potential f (x) = µ Ψ † x 2 2 , with µ = 1/2, Ψ = I, and the likelihood potential g(x) = y−Φx 2 2 /2σ 2 , with σ = 1, Φ = I.For this setting, we have a closed-form solution of the marginal likelihood value (see Appendix for further details).Test data y ∈ R d are generated by where x is an d-dimensional vector of uniformly distributed random numbers in [0, 1] d , and w is an d-dimensional vector of normally distributed random numbers.Note that the underlying model used to generate the mock data does not match the prior π used here, but that is fine for validation of the calculation of the marginal likelihood.Also, in imaging setting the prior is never perfectly specified.In the following, we consider increasing dimensions from d = 2 to d = 10 6 .We separate the test into three parts: i) small models of dimension from d = 2 to d = 200, ii) moderately large models of dimension from d = 2 to d = 10 5 , and iii) high dimensional models with d = 10 6 .We first test our method for low-dimensional models (i.e., d < 200).For our proximal nested sampling method, we use N live = 2 × 10 2 live samples and N = 3 × 10 3 dead samples, with a thinning factor of 10.We also compare our result with vanilla Monte Carlo (MC) integration where a uniform prior with integrand f • g is utilised, with the number of samples set to 10 5 .Fig. 1 presents the results.Our proximal nested sampling method agrees well with the ground truth, whereas simple MC integration can only achieve acceptable results when the dimension is small, say d < 20.The computation time for the problem with dimension 200 is approximately one minute.
We now test our proximal nested sampling method for high-dimensional cases.Results for dimensions of y up to 10 5 are given in Figure 2, where we set log(V × Z) Dimension Fig. 1 Validation of our proximal nested sampling technique (for dimensions 2 to 200) to compute the marginal likelihood (Bayesian evidence) for a scenario where a closed-form solution is accessible.The logarithm of the unnormalised prior volume (V ) times the marginal likelihood value (Z) is plotted against the dimensions of the problem considered.The bluecircle line, red-asterisk line and the black-solid line show the results of MC integration, proximal nested sampling and the ground truth, respectively.We can clearly see that the results computed by proximal nested sampling agree with the ground truth well, whereas the result computed by MC integration with 10 5 samples can only achieve acceptable results when the dimension is below ∼ 20.The computation time for the problem with dimension 200 is approximately one minute.
the number of the live samples N live = 10 3 and the number of dead samples N = 10 4 , with thinning factor 10 (we do not consider direct MC integration any further since it is already shown to fail for dimensions above ∼ 20).These results again show that our proximal nested sampling method can achieve results in close agreement with the ground truth.The computation time for the problem with dimension 10 5 is approximately 10 minutes.
Finally, we consider dimension 10 6 as an example to show that our proximal nested sampling method can be pushed to dimensions much higher than 10 5 .With the same parameters as that used for dimension 10 5 , ten runs were performed for a 10 6 dimensional setting of the same problem.The logarithm of the ground truth value was calculated to be 2.3850×10 5 .The mean of ten runs of proximal nested sampling was computed be to 2.3851 × 10 5 , with standard deviation 0.0002 × 10 5 .The result computed by proximal nested sampling is in excellent agreement with the ground truth.The computation time for each run of the problem with dimension 10 6 is approximately 30 minutes.

Model selection in image processing
We now illustrate the application of proximal nested sampling for Bayesian model selection in imaging problems.In particular, we focus on two canonical problems, image denoising and image reconstruction, with different likelihoods and priors.We emphasise that Bayesian model selection for these imaging log(V × Z) Dimension Fig. 2 Validation of our proximal nested sampling technique (for dimensions up to 10 5 ) to compute the marginal likelihood (Bayesian evidence) for a scenario where a closed-form solution is accessible.The logarithm of the unnormalised prior volume (V ) times the marginal likelihood value (Z) is plotted against the dimensions of the problem considered.The redasterisk line and the black-solid line show the results of proximal nested sampling and the ground truth, respectively.We can clearly see that the results computed by proximal nested sampling agrees with the ground truth well.The computation time for the problem with dimension 10 5 is approximately 10 minutes.
problems is not well addressed by existing techniques due to the high dimensions considered (i.e., higher than 10 5 ) and the use of general log-concave priors (e.g., like the sparsity promoting Laplace-type priors that include 1 terms).

Prior model selection in image denoising: dictionary selection
For a standard denoising problem we apply proximal nested sampling to select the dictionary Ψ used for the sparsifying transform.The noisy image y is generated by y = x + w, where x is the ground truth clean image and w is Gaussian noise with zero mean and standard deviation σ = x ∞ 10 −SNR/20 , where • ∞ is the infinity norm, and the input signal-to-noise ratio (SNR) is set to 20.Set Φ = I in the likelihood g(x) = y − Φx 2 2 /2σ 2 (i.e., g(x) = y − x 2 2 /2σ 2 ).We then investigate the influence of different choices for Ψ in the prior term f (x) = µ Ψ † x 1 , with µ = 10 5 .Specifically, three forms of Ψ are considered, namely the identity (I), Daubechies 2 wavelets (DB2), and Daubechies 8 wavelets (DB8).For the proximal nested sampling method, the number of the live samples N live and dead samples N is respectively set to 2 × 10 3 , and 4 × 10 4 with thinning factor 10 2 , which is sufficient to ensure convergence.Fig. 4 presents the posterior means recovered (i.e. the reconstructed images) for the three dictionaries considered, i.e. for Ψ = {I, DB2, DB8}.It is clear that the reconstructed images corresponding to Ψ = DB2 and DB8 are significantly better than that for Ψ = I.Moreover, while the difference between the reconstructed images of the models for Ψ = DB2 and DB8 is small, by eye the image recovered with DB2 may be judged slightly superior.
Table 1 Marginal likelihood (Bayesian evidence) values computed by proximal nested sampling for Bayesian model selection of the sparsifying dictionary for an image denoising problem (see Fig. 4 for corresponding reconstructed images).Sparsity-promoting (non-differentiable) priors are considered with (sparsifying) transforms Ψ = I, DB2 and DB8.Comparing models, Bayesian model selection afforded by proximal nested sampling suggests the model with the DB2 dictionary is superior, followed by DB8, both of which are far superior to the case where Ψ = I, which agrees with the RMSE (root mean square error) values and assessment performed by eye, which require the ground truth to be known.Table 1 presents the calculated marginal likelihood values6 for the different sparsifying transforms Ψ selected for the prior.The root mean square error (RMSE) is also given, where the RMSE gauges the difference between the posterior mean image and the ground truth image.Note that the RMSE cannot normally be computed in practical problems since the ground truth is not known.Since for these experiments we know the ground truth the RMSE is a useful measure for comparison purposes.
Table 1 shows that the model with Ψ = I possesses the smallest marginal likelihood value.This implies that for this denoising problem the model with Ψ = I is inferior to models where Ψ is set to DB2 and DB8.Moreover, the marginal likelihood difference between models where Ψ is set to DB2 or DB8 is not dramatic, nevertheless this implies that DB2 is preferred.These finding inferred by Bayesian model selection agree with the RSME values computed for each model, where the model with Φ = DB2 is slightly preferred over DB8, and both models with DB2 and DB8 are highly preferred over the model with Φ = I (recall that in practice it is not possible to compute the RMSE since it requires knowledge of the underlying ground truth).Furthermore, the model preferences inferred by proximal nested sampling also agree with the assessment of reconstructed image quality by-eye discussed above.The results obtained are consistent with common knowledge that it is typically more effective to denoise a natural image using a prior that promotes sparsity in some (sparsifying) transform domain (e.g. a wavelet domain) rather than in the image domain itself.The computation time for the problem with Ψ = I is approximately 10 minutes, and for the problem with Ψ = DB2 or DB8 is approximately 60 minutes.
In high-dimensional settings note that Bayes factors can be very large due to the concentration of probability in high-dimensions, hence it is not meaningful to consider traditional scales for assessing model comparisons such as the Jeffery's scale (Nesseris and García-Bellido, 2013).Instead, we recommend comparing marginal likelihood values directly.

Prior model selection in image reconstruction: regularisation parameter selection
We now apply proximal nested sampling to a standard reconstruction problem and, firstly, consider the selection of the regularisation parameter µ defining the width of the prior.It is typically very challenging to optimally set the regularisation parameter µ, which controls the strength of prior knowledge and plays a key role in reconstruction quality.Consider noisy observations (noisy measurements) y = Φx + w, (73) where w again denotes Gaussian noise with zero mean and σ = x ∞ 10 −SNR/20 (standard deviation), with SNR set to 30, and m and d are respectively the dimension of y and image x.Consider the prior f (x) = µ Ψ † x 1 , with Ψ = DB8, and likelihood g(x) = y − Φx 2 2 /2σ 2 .For the reconstruction scenario, Φ represents the sensing (measurement) operator.In particular, we consider a measurement model comprising incomplete Fourier measurements (common in radio interferometric and magnetic resonance imaging) defined by the sensing operator Φ = MF, constructed from the Fourier transform F followed by a selection mask M which is generated randomly through the variable density sampling profile (Puy et al, 2011).We consider the scenario where only 30% of Fourier coefficients are measured, i.e. m = 0.3d.Note that different forms of the mask M result in different sensing operators Φ.
Table 2 Marginal likelihood (Bayesian evidence) values computed by proximal nested sampling for Bayesian model selection of the regularisation parameter µ for an image reconstruction problem (see Fig. 5 for corresponding reconstructed images).Prior definition with µ set to 10 6 , 10 7 and 10 8 , respectively, are considered.Comparing models, Bayesian model selection afforded by proximal nested sampling suggests the model with µ = 10 6 is superior to the one with µ = 10 7 , which is superior to the one with µ = 10 8 , which agrees with the RMSE (root mean square error) values and assessment performed by eye, which require the ground truth to be known.is difficult to assess the effectiveness of different regularisation parameters by eye, but on close inspection it may be noticed that the model with µ = 10 6 is superior to the one with µ = 10 7 , which is superior to the one with µ = 10 8 .The computation time for each problem is approximately 150 minutes.Table 2 presents the marginal likelihood and RMSE values computed for the models with different regularisation parameters µ.The computed marginal likelihood for the model with µ = 10 6 is larger that the value for the model with µ = 10 7 , which is larger than the model with µ = 10 8 , suggesting the model with µ = 10 6 is preferred.The computed marginal likelihoods are consistent with the model preferences obtained by comparing the RMSE of each model and by visual inspection.Recall that both RMSE and visual comparisons can only be performed here where the ground truth is available and cannot be used for model comparison in practice.In summary, this example demonstrates that our proximal nested sampling method is capable of selecting superior regularisation parameters for models stemming from high-dimensional inverse problems.

Measurement model selection in image reconstruction
We now apply proximal nested sampling to the same reconstruction problem considered above (i.e.image reconstruction from noisy and incomplete Fourier measurements) but focus on the problem of misspecification of the measurement model Φ.Noisy observations Y are generated by (73), measuring 10% of Fourier coefficients, i.e. with m = 0.1d.
We use the ground truth model M truth to simulate observation data y.However, when solving the resulting inverse problem we consider a number of different measurement models, not only the ground truth model M truth but also misspecified models M γ , where γ > 0 encodes the level of misspecification.
The method by which the model is misspecified in motivated by radio interferometric imaging.In radio interferometry, the coordinates of the Fourier coefficients acquired by the telescope are measured in units of (radio) wavelength.If the wavelength at which observations are made is misspecified, the coordinates of the Fourier coefficients acquired will be scaled.We model precisely this type of misspecified model here to represent the case where the instrument wavelength is not calibrated accurately.
An incorrectly specified wavelength then simply acts to modify the mask of the ground truth measurement model M truth .The misspecified model corresponding to mask M γ , for misspecification parameter γ, is generated by extending every measured position in M truth radially.Specifically, every measured position is extended radially along the line connecting it to the origin to a length of γd j , j ∈ Ω mask , where γ is the misspecification scaling factor, d j is the distance from the original measured position j to the origin in M truth , and Ω mask is the set which contains all the measured positions.It is worth mentioning that the larger the scaling factor γ, the larger the distortion of M γ from the ground truth M truth .Note also that γ = 0 corresponds to a correctly specified model, i.e.M γ=0 = M truth .
For proximal nested sampling, the number of the live samples N live and dead samples N is respectively set to 2 × 10 3 and 3 × 10 4 with thinning factor 10 2 , which is sufficient to ensure convergence.Regularisation parameter µ = 10 8 is used for these experiments.
Fig. 6 presents the posterior means recovered (i.e. the reconstruction images) for models with Φ γ = M γ F and Φ = M truth F,.Here misspecified models M 0.12 , M 0.09 , M 0.06 and M 0.03 are generated for misspecification scaling factors γ with values of 0.12, 0.09, 0.06 and 0.03, respectively.It is apparent by eye that the posterior mean image recovered with the ground truth model is the best and that the quality of the recovered posterior mean image degrades as the size of the misspecification scale parameter γ increases.The computation time for each problem is approximately 150 minutes.6 for corresponding reconstructed images).Misspecified models are denoted Mγ , where increasing γ > 0 corresponds to increasing levels of misspecification (and γ = 0 corresponds to the ground truth model).Comparing models, Bayesian model selection afforded by proximal nested sampling suggests the model with the lowest misspecification parameter γ is always preferred, which also agrees with the RMSE (root mean square error) values and assessment performed by eye, which require the ground truth to be known.Table 3 presents the marginal likelihood and RMSE values computed for the different models considered.The computed marginal likelihood is largest when the correct ground truth model is adopted in the likelihood.As the misspecification parameter γ is increased (corresponding to greater misspecification and less accurate models), the corresponding computed marginal likelihood values monotonically decrease (become more negative).For Bayesian model comparison, the model with the lowest misspecification parameter γ is always preferred.The computed marginal likelihoods are consistent with the model preferences obtained by comparing the RMSE of each model and by visual inspection (although recall that such tests cannot be used for model comparison in practice when the ground truth is not known).

Conclusions
Nested sampling provides an efficient computational framework to estimate the marginal likelihood (Bayesian evidence) for Bayesian model selection.It effectively re-parameterises the marginal likelihood into a one-dimensional integral of the likelihood with respect to the enclosed prior volume.The challenge of nested sampling is to sample from the prior distribution subject to a hard likelihood constraint.A variety of successful techniques have been developed to perform such sampling in low and moderate dimensional problems.However, existing approaches are not directly useful for imaging applications because they scale poorly to large problems and struggle to support models that are not smooth.
In this article we presented the proximal nested sampling method that is specifically designed for Bayesian models that are log-concave, potentially very high-dimensional (d = 10 6 and beyond), and potentially not smooth.This is achieved by exploiting tools from proximal calculus and Moreau-Yosida regularisation to efficiently sample from the prior subject to the hard likelihood constraint through a proximal MCMC approach.The resulting Markov chain iterations combine a gradient step that approximates a Langevin SDE that scales efficiently to large problems, with a projection term that acts to push the Markov chain back into the likelihood constraint set if it wanders outside of it, and a Metropolis-Hastings correction step to ensure the hard likelihood constraint is satisfied.
The proposed proximal nested sampling framework was implemented and validated on a Gaussian model for which the marginal likelihood could be calculated in closed-form, showing excellent agreement between values computed analytical and by proximal nested sampling, even in very high dimensions.The use of proximal nested sampling for principled Bayesian model selection was then showcased on a variety of imaging problems with non-smooth sparsitypromoting prior distributions.In particular, model selection problems were considered related to dictionary selection, and selection of the appropriate measurement model when it may be misspecified.
Proximal nested sampling allows Bayesian model selection to be performed at a much higher dimension than that was previously possible, while also supporting non-smooth priors that are widely used in imaging.It is our hope that proximal nested sampling will thus find widespread use for high-dimensional Bayesian model selection, particularly in the imaging sciences.
Important perspectives for future work include: a detailed theoretical analysis of the convergence properties of proximal nested sampling; an extension to (biased) accelerated proximal methods (Vargas et al, 2020); and an analysis of the properties of marginal maximum likelihood estimation for the class of models considered in this paper, such as estimator consistency for model selection in an M-closed setting and concentration in an M-open setting (Llorente et al, 2022).Moreover, it would be interesting to apply proximal nested sampling to other types of models, such as models with likelihood-based priors (Llorente et al, 2022), which can be handled straightforwardly by proximal nested sampling when the likelihood is log-concave.It would also be interesting to modify proximal nested sampling to tackle high-dimensional models that are multi-modal, particularly models with data-driven priors encoded by neural networks (see e.g.Mukherjee et al 2022, Section 5).

Fig. 3
Fig. 3 Images used to showcase the use of proximal nested sampling for Bayesian model selection in high-dimensional image processing problems.Panel (a): Cameraman grey-scale image; Panels (b)-(c): W28 and M31 radio galaxies normalised to [0, 1] and then shown in log10 scale (i.e. the numeric labels on the colour bar are the logarithms of the image intensity), respectively.

Fig. 4
Fig. 4 Dictionary selection for an image denoising problem solved by proximal nested sampling (test image is cameraman).First row shows the clean image and noisy image.Second row shows the posterior mean images recovered by proximal nested sampling for priors with (sparsifying) transforms Ψ = I, DB2 and DB8, respectively, where the log-prior reads f (x) = µ Ψ † x 1 .By eye, both DB2 and DB8 wavelets provide superior reconstruction fidelity compared to Ψ = I.The model Ψ = DB2 may also be judged to provide slightly superior performance to Ψ = DB8.

Fig. 6
Fig. 6 Measurement model misspecification for an image reconstruction problem solved by proximal nested sampling (test image is M31 radio galaxy).Panel (a): dirty (back-projected) image Φ † Y ; Panels (b)-(f): posterior mean images recovered by proximal nested sampling for misspecified models Mγ , where increasing γ > 0 corresponds to increasing levels of misspecification (and γ = 0 corresponds to the ground truth model).It is apparent by eye that the posterior mean image recovered with the ground truth model is the best and that the quality of the recovered posterior mean image degrades as the size of the misspecification scale parameter γ increases.

Table 3
Marginal likelihood (Bayesian evidence values computed by proximal nested sampling for Bayesian model selection for measurement model misspecification for an image reconstruction problem (see Fig.