1 Introduction

Bayesian inverse problems arise in many applications in science and engineering. When performing Bayesian inversion, one tries to characterize a probability distribution on the unknown parameters of a model given some observed data. These data are typically subject to noise, and are often modeled as

$$\begin{aligned} y = A(x) + \epsilon , \end{aligned}$$
(1)

where \(A: \mathbb {R}^d \rightarrow \mathbb {R}^m\) is the forward operator, \(x\in \mathbb {R}^d\) is the unknown parameter, and \(\epsilon \in \mathbb {R}^m\) is the noise. In this work, we are concerned with high-dimensional Bayesian inverse problems with the posterior density given by

$$\begin{aligned} \pi (x) \propto \mathcal {L}(x)\pi _0(x). \end{aligned}$$
(2)

Here, \(\mathcal {L}:\mathbb {R}^d\rightarrow \mathbb {R}\) denotes the likelihood function of observing the data y given x, and \(\pi _0(x)\) is the prior density. The likelihood function models the relationship between forward operator, error model, and data, e.g., (1). We note that our framework applies in general to any likelihood function.

This work is motivated by, but not limited to, Bayesian inference for image reconstruction. In these applications, heavy-tailed priors are especially popular to preserve sharp edges. To this end, heavy-tailed priors can be imposed directly Hosseini (2017); Markkanen et al. (2019), or via a hierarchical framework Uribe et al. (2022), on the differences between elements of the unknown parameter, i.e., pixels. Classical choices for heavy-tailed priors include \(\alpha \)-stable distributions, such as the Cauchy distribution Suuronen et al. (2023).

Instead of applying a prior formulation on the differences between pixels, we use the fact that natural signals, and therefore images, can be effectively represented in a sparse manner using adapted bases, such as for example point source bases and wavelets bases Cai et al. (2018). Then, one can use the so-called synthesis formulation \(s=W x\) to expand a signal s in a suitable basis W for which x is sparse Elad et al. (2007). In this case, the heavy-tailed Laplace distribution is a typical prior choice to enforce sparsity in x. Indeed, in Simoncelli (1999) it was found that the marginals of wavelet coefficients of photographic images are well approximated by the Laplace distribution.

Following the above arguments, we use a product-form Laplace prior with density equal to

$$\begin{aligned} \pi _0(x) \propto \exp \left( - \sum _{i=1}^d \delta _i |x_i|\right) , \end{aligned}$$

where \(\delta _i >0\) for all \(i\in \{1,2,\dots ,d\}\) are the rate parameters. Consequently, we express the posterior density as

$$\begin{aligned} \pi (x) \propto \mathcal {L}(x) \exp {\left( - \sum _{i=1}^d \delta _i |x_i|\right) }. \end{aligned}$$
(3)

We note that for the case where the forward operator is linear and the likelihood function is Gaussian (linear-Gaussian likelihood), the posterior density (3) can be characterized via the Bayesian LASSO Park and Casella (2008).

In real-world applications, Bayesian inference is often performed in a high-dimensional parameter space. For instance, in imaging science, the number of pixels is very large, resulting in parameter spaces with dimensions of order \(d=\mathcal {O}(10^4)\) or higher. When sampling from non-smooth densities as (3), the proximal unadjusted Langevin (p-ULA) or proximal Metropolis-adjusted Langevin algorithm (p-MALA) Pereyra (2016); Lau et al. (2023) can be used, but their performance deteriorates significantly with the dimensionality of the problem.

Inspired by the certified dimension reduction (CDR) methodology Zahm et al. (2022); Cui and Tong (2021); Li et al. (2023), we propose a new method, the certified coordinate selection, to select the components in x that contribute most to the update from the prior to the posterior. Hence, the efficiency of the aforementioned sampling methods can be improved substantially by restricting them to perform inference on the selected components only.

In principle, the CDR method consists in replacing the likelihood function with a ridge approximation \(x\mapsto \widetilde{\mathcal {L}}(U_r^\intercal x)\) for some matrix \(U_r\in \mathbb {R}^{d\times r}\) with \(r\ll d\) orthogonal columns, and some function \(\widetilde{\mathcal {L}}:\mathbb {R}^r\rightarrow \mathbb {R}\). The matrix \(U_r\) is determined by minimizing an error bound on the Kullback-Leibler (KL) divergence, which can be obtained via logarithmic Sobolev inequalities.

The CDR method has been successfully applied within a number of Bayesian updating strategies, such as the cross-entropy method Ehre et al. (2023), Stein-variational gradient descent Chen and Ghattas (xxxx), transport maps Brennan et al. (xxxx), and in Bayesian inference applied to rare event estimation Uribe et al. (2020). However, it has only been employed in cases where the prior is either Gaussian, or it is normalized by computing a map that pushes forward the original random variable to a standard Gaussian random variable Cui et al. (2022).

Applying the CDR method to a posterior as in (3) is not straightforward, because the Laplace prior does not satisfy the logarithmic Sobolev inequality (see Herbst’s argument in section 5.4.1 in Bakry et al. (2014)). However, the Laplace prior satisfies the Poincaré inequality Bakry et al. (2014), which can be used to bound the Hellinger distance instead of the KL divergence of the low-dimensional posterior approximation Cui and Tong (2021); Li et al. (2023).

In order to fully control the Poincaré constant, we restrict our analysis to matrices \(U_r\), which are coordinate selection matrices, so that \(U_r^\intercal x = (x_{\mathcal {I}_1},\ldots ,x_{\mathcal {I}_r})\), for \({\mathcal {I}}\subset \{1,\ldots ,d\}\) being the set of selected coordinates. That way, we also preserve the interpretability of the dimension reduction, since \(x_r\) are coordinates of the original vector x.

Our main contributions are as follows:

  • We propose to select the relevant coordinates based on the following diagnostic

    $$\begin{aligned} h_i = \frac{1}{\delta _i^2} \int _{\mathbb {R}^d} (\partial _i \log \mathcal {L}(x))^2 \pi (x) \textrm{d}x, \end{aligned}$$

    and we show that the Hellinger distance between the exact and the approximated posterior can be explicitly bounded using \(h_i\).

  • We prove that in the case of a linear-Gaussian likelihood, we only need to estimate the posterior mean and the posterior covariance to compute the diagnostic.

  • We show for the above case how a smoothing approximation to the Laplace prior can be used to compute a diagnostic and define an efficient proposal for the preconditioned Metropolis-adjusted Langevin algorithm (MALA).

  • We test our methods on a 1D signal deblurring task, which is given in the synthesis formulation and a high-dimensional 2D super-resolution example.

The remainder of this paper is structured as follows. In Sect. 2, we present the theoretical part of our method, which comprises the posterior approximation and its certification. In Sect. 3, we outline the general approach to sample the approximate posterior and recall the pseudo-marginal Markov chain Monte Carlo (MCMC) algorithm, which can be used to sample the exact posterior. In Sect. 4, we present detailed methodology for the case of a linear-Gaussian likelihood. In Sect. 5, we test our methods on two numerical examples: a 1D deblurring example and a 2D super-resolution example. We draw the conclusions in Sect. 6.

2 Certified coordinate selection

In this section, we first show how we approximate the posterior and how this approximation can be controlled by an upper bound in the Hellinger distance. The result can then be used to compute a diagnostic \(h\in \mathbb {R}^d\), which ranks the coordinates based on their contribution to the update from the prior to the posterior.

2.1 Posterior approximation

We aim at identifying the set of components in the parameter vector \(x\in \mathbb {R}^d\) that are most informed by the data relative to prior information. To this end, we define the coordinate splitting

$$\begin{aligned} x = (x_{_{\mathcal {I}}}, x_{_{\mathcal {I}^c}}), \end{aligned}$$
(4)

where the set \({\mathcal {I}}\subset \{1,\ldots ,d\}\) contains the indices of the informed coordinates, and \({\mathcal {I}^c}=\{1,\dots , d\}{\setminus } \mathcal {I}\) includes the complementary indices. We refer to \(x_{_{\mathcal {I}}}\) as selected coordinates. Notice that if the likelihood is almost constant in \(x_{_{\mathcal {I}^c}}\), the update from prior to posterior happens mainly on \(x_{_{\mathcal {I}}}\).

To formalize this idea, let us introduce the posterior approximation

$$\begin{aligned} \widetilde{\pi }(x) = \pi (x_{_{\mathcal {I}}}) \pi _0(x_{_{\mathcal {I}^c}}|x_{_{\mathcal {I}}}), \end{aligned}$$
(5)

where \(\pi (x_{_{\mathcal {I}}})\) is the posterior marginal and \(\pi _0(x_{_{\mathcal {I}^c}}|x_{_{\mathcal {I}}})\) is the conditional prior. Compared to the exact posterior, which can be factorized as \(\pi (x) = \pi (x_{_{\mathcal {I}}}) \pi (x_{_{\mathcal {I}^c}}|x_{_{\mathcal {I}}})\), the approximation \(\widetilde{\pi }(x)\) essentially consists in replacing the conditional posterior \(\pi (x_{_{\mathcal {I}^c}}|x_{_{\mathcal {I}}})\) with the conditional prior \(\pi _0(x_{_{\mathcal {I}^c}}|x_{_{\mathcal {I}}})\). Combining (5) and (2), we can define the following quasi-optimal approximate posterior

$$\begin{aligned} \widetilde{\pi }^\dagger (x) \propto \widetilde{\mathcal {L}}^\dagger (x_{_{\mathcal {I}}}) \pi _0(x), \end{aligned}$$
(6)

where we call

$$\begin{aligned} \widetilde{\mathcal {L}}^\dagger (x_{_{\mathcal {I}}}) = {\int _{\mathbb {R}^{\mathcal {I}^c}}} \mathcal {L}(x_{_{\mathcal {I}}},x_{_{\mathcal {I}^c}}) \pi _0(x_{_{\mathcal {I}^c}}) \textrm{d}x_{_{\mathcal {I}^c}}\end{aligned}$$
(7)

the quasi-optimal reduced likelihood. As we show in the following proposition, the posterior approximation (6) is a quasi-optimal choice when certifying the approximation using the Hellinger distance

$$\begin{aligned} H\!\left( {\pi }, {\widetilde{\pi }} \right) ^2 = \frac{1}{2} \int _{\mathbb {R}^d} \left( \sqrt{\pi (x)} - \sqrt{\widetilde{\pi }(x)} \right) ^2 \textrm{d}x. \end{aligned}$$
(8)

Proposition 1

Let \(\pi (x) \propto \mathcal {L}(x) \pi _0(x)\) be a probability density on \(\mathbb {R}^d\) where \(\pi _0(x)\) is a product-form density, and let \(x=(x_{_{\mathcal {I}}},x_{_{\mathcal {I}^c}})\) be any coordinate splitting. Then, the reduced likelihood \(\widetilde{\mathcal {L}}:\mathbb {R}^{|\mathcal {I}|} \rightarrow \mathbb {R}_{+}\) which minimizes \(H\!\left( {\pi }, {\widetilde{\pi }} \right) \), where \(\widetilde{\pi }(x) \propto \widetilde{\mathcal {L}}(x_{_{\mathcal {I}}}) \pi _0(x)\), is given by

$$\begin{aligned} \widetilde{\mathcal {L}}^*(x_{_{\mathcal {I}}}) = \left( {\int _{\mathbb {R}^{\mathcal {I}^c}}} \sqrt{ \mathcal {L}( x_{_{\mathcal {I}}}, x_{_{\mathcal {I}^c}}) } \pi _0(x_{_{\mathcal {I}^c}}) \textrm{d}x_{_{\mathcal {I}^c}} \right) ^2. \end{aligned}$$
(9)

which we call the optimal reduced likelihood. Correspondingly, we call

$$\begin{aligned} \widetilde{\pi }^*(x) \propto \widetilde{\mathcal {L}}^*(x_{_{\mathcal {I}}}) \pi _0(x) \end{aligned}$$
(10)

the optimal approximate posterior.

Moreover, \(\widetilde{\mathcal {L}}^\dagger \) as defined in (7), yields a quasi-optimal posterior approximation \(\widetilde{\pi }^\dagger \) with respect to the Hellinger distance in the sense that

$$\begin{aligned} H\!\left( {\pi }, { \widetilde{\pi }^\dagger } \right) ^2 \le 2 H\!\left( {\pi }, {\widetilde{\pi }^*} \right) ^2. \end{aligned}$$
(11)

Proof

See section A.1. \(\square \)

2.2 Certifying the approximation

We now provide an upper bound on the Hellinger distance \(H\!\left( {\pi }, { \widetilde{\pi }^\dagger } \right) \) between the posterior defined in (3) and its quasi-optimal approximation (6).

Proposition 2

Consider the probability density defined in (3). Given a coordinate splitting \(x=(x_{_{\mathcal {I}}},x_{_{\mathcal {I}^c}})\), the quasi-optimal approximate posterior \(\widetilde{\pi }^\dagger (x)\) given in (6) satisfies

$$\begin{aligned} H\!\left( {\pi }, {\widetilde{\pi }^\dagger } \right) ^2 \le {2} \sum _{i\in \mathcal {I}^c} h_i, \end{aligned}$$
(12)

where the diagnostic \(h \in \mathbb {R}^d\) is given by

$$\begin{aligned} h_i = \frac{1}{\delta _i^2} \int _{\mathbb {R}^d} (\partial _i \log \mathcal {L}(x))^2 \pi (x) \textrm{d}x. \end{aligned}$$
(13)

Proof

See section A.2 \(\square \)

Remark 1

In case of non-negativity constraints on x, the analogue to the Laplace prior is the exponential prior. The upper bound on the Hellinger distance is the same for this case, since both distributions share the same Poincaré constant.

With the diagnostic h, the coordinate splitting can be performed by finding \(\mathcal {I}^c\) such that

$$\begin{aligned} {2} \sum _{i\in \mathcal {I}^c} h_i \le \tau , \end{aligned}$$
(14)

where \(\tau \) is a given desired precision on the Hellinger distance. As a matter of fact, the set \(\mathcal {I}\) contains the indices i associated with the \(r(\tau )\) largest components in h.

Notice that the number of selected coordinates \(r(\tau )\) can be abnormally large, especially if the bound (12) is loose. In this case, we set \(r=\min (r(\tau ), r_{\max })\) for a pre-given \(r_{\max }\) and we let \(\mathcal {I}\) contain the indices of the r largest components in h.

2.3 Estimating the diagnostic

The definition of the diagnostic \(h_i\) in (13) involves an expectation over the posterior, which is expensive to compute. In practice, an approximate posterior expectation can still yield a useful estimate with which efficient dimension reduction via coordinate selection can be achieved. In particular, for posteriors with linear-Gaussian likelihood, we present detailed methodology in section 4 for estimating h. To motivate the estimation of the diagnostic (13) also in the general case, we provide the following proposition showing a straightforward result.

Proposition 3

Let \(\widetilde{\pi }\) be any approximation to \(\pi \) such that there exists \(0<\alpha \le \beta <\infty \) such that

$$\begin{aligned} \alpha \widetilde{\pi }(x) \le \pi (x) \le \beta \tilde{\pi }(x), \end{aligned}$$
(15)

for all x. Then, the approximate diagnostic

$$\begin{aligned} \tilde{h}_i = \frac{1}{\delta _i^2} \int _{\mathbb {R}^d} (\partial _i \log \mathcal {L}(x))^2 \tilde{\pi }(x) \textrm{d}x \end{aligned}$$
(16)

satisfies \(\alpha \tilde{h}_i \le h_i \le \beta \tilde{h}_i\). In particular, we have \( h_i = 0 ~\Leftrightarrow ~ \tilde{h}_i = 0. \)

In the general case of a nonlinear non-Gaussian likelihood, importance sampling may be a feasible way to estimate h. Accordingly, we point to Cui and Tong (2021), theorem 4.2, for statistical estimates of the error in the upper bound (12), and to Cui and Tong (2021), proposition 4.5, for bounds on the sampling variance of importance sampling estimates. Finally, Cui and Tong (2021) also discusses adaptive sampling schemes where h is constructed iteratively.

We conclude this section with algorithm 1, which outlines the general procedure of our certified coordinate selection method.

Algorithm 1
figure a

General procedure of the certified coordinate selection method

3 Sampling algorithms

In this section, we propose two algorithms for drawing samples from the optimal approximate and the exact posterior, respectively.

3.1 Sampling the approximate posterior

The product-form of the prior allows us to write the optimal approximate posterior as

$$\begin{aligned} \widetilde{\pi }^*(x) \propto \widetilde{\mathcal {L}}^*(x_{_{\mathcal {I}}}) \pi _0(x_{_{\mathcal {I}}}) \pi _0(x_{_{\mathcal {I}^c}}) \propto \widetilde{\mathcal {L}}^*(x_{_{\mathcal {I}}}) \pi _0(x_{_{\mathcal {I}^c}}), \end{aligned}$$

and thus naturally suggests a simple sampling scheme where the main sampling effort is concentrated on the selected coordinates \(x_{_{\mathcal {I}}}\)Cui and Zahm (2021). The sampling method consists in first drawing samples \(\{x_{_{\mathcal {I}}}^{i}\}_{i=1}^{N}\) from the low-dimensional density \(\widetilde{\pi }^*(x_{_{\mathcal {I}}})\) using a MCMC method. Then, for each sample \(x_{_{\mathcal {I}}}^{(i)}\), we draw a sample \(x_{_{\mathcal {I}^c}}^{(i)}\) from the marginal prior \(\pi _0(x_{_{\mathcal {I}^c}})\). In the end, reassembling \(x^{(i)}=(x_{_{\mathcal {I}}}^{(i)}, x_{_{\mathcal {I}^c}}^{(i)})\) yields samples from the optimal approximate posterior \(\widetilde{\pi }^*(x)\). We summarize this procedure in algorithm 2.

Algorithm 2
figure b

Sampling scheme for the approximate posterior

In practice, the optimal reduced likelihood \(\widetilde{\mathcal {L}}^*(x_{_{\mathcal {I}}})\) (9) must be approximated to enable sampling from \(\widetilde{\pi }^*(x_{_{\mathcal {I}}})\). Since we expect the likelihood to be mostly flat in the directions of not selected coordinates, a natural approach is to fix \(x_{_{\mathcal {I}^c}}\) in (9) to the prior mean \(\mu _0=0\). Then, we obtain the approximation \(\widetilde{\mathcal {L}}^*(x_{_{\mathcal {I}}}) \approx \mathcal {L}(x_{_{\mathcal {I}}}, x_{_{\mathcal {I}^c}}=0)\), which is computationally cheap while giving satisfactory results as has been shown in the numerical examples of Zahm et al. (2022).

3.2 Sampling the exact posterior via the pseudo-marginal MCMC algorithm

The marginal of the quasi-optimal posterior \(\widetilde{\pi }^\dagger (x_{_{\mathcal {I}}}) = \widetilde{\mathcal {L}}^\dagger (x_{_{\mathcal {I}}}) \pi _0(x_{_{\mathcal {I}}})\) satisfies \(\widetilde{\pi }^\dagger (x_{_{\mathcal {I}}})=\pi (x_{_{\mathcal {I}}})\). Using this fact, a pseudo-marginal MCMC algorithm Cui and Zahm (2021); Andrieu and Roberts (2009) can be employed, which in combination with a so-called recycling step, samples the exact posterior. Note that the bound in (12) allows us to control the error of this quasi-optimal posterior approximation and therefore provides theoretical justification for using this sampling algorithm.

Algorithm 3
figure c

Pseudo-marginal MCMC

We outline the pseudo-marginal MCMC algorithm for the i-th iteration in the following. Given the state \(x^{(i-1)} = (x_{_{\mathcal {I}}}^{(i-1)}, x_{_{\mathcal {I}^c}}^{(i-1)})\), a candidate \(z_{_{\mathcal {I}}}^{(i)}\) is drawn from a proposal distribution \(q(\cdot |x_{_{\mathcal {I}}}^{(i-1)})\), which targets \(\widetilde{\pi }^\dagger (x_{_{\mathcal {I}}})\). Then, the quasi-optimal reduced likelihood \(\widetilde{\mathcal {L}}^\dagger (z_{_{\mathcal {I}}}^{(i)})\) is approximated with M freshly drawn samples \(\{ z_{_{\mathcal {I}^c}}^{(i,j)} \}_{j=1}^M \sim \pi _0(z_{_{\mathcal {I}^c}})\) as

$$\begin{aligned} \widetilde{\mathcal {L}}^\dagger (z_{_{\mathcal {I}}}^{(i)}) \approx \frac{1}{M} \sum _{j=1}^M \mathcal {L}(z_{_{\mathcal {I}}}^{(i)},z_{_{\mathcal {I}^c}}^{(i,j)}). \end{aligned}$$
(17)

Thus, we obtain a set of candidate samples \(\{z_{_{\mathcal {I}}}^{(i)}, \{ z_{_{\mathcal {I}^c}}^{(i,j)} \}_{j=1}^M \}\), which is accepted with probability

$$\begin{aligned} \alpha = \min \left\{ 1, \frac{ \pi _0(z_{_{\mathcal {I}}}^{(i)}) \widetilde{\mathcal {L}}^\dagger (z_{_{\mathcal {I}}}^{(i)}) q(x_{_{\mathcal {I}}}^{(i-1)}|z_{_{\mathcal {I}}}^{(i)}) }{\pi _0(x_{_{\mathcal {I}}}^{(i-1)}) \widetilde{\mathcal {L}}^\dagger (x_{_{\mathcal {I}}}^{(i-1)}) q(z_{_{\mathcal {I}}}^{(i)}|x_{_{\mathcal {I}}}^{(i-1)})} \right\} . \end{aligned}$$
(18)

At this point, the samples \(x_{_{\mathcal {I}}}^{(i)}\) follow \(\widetilde{\pi }^\dagger (x_{_{\mathcal {I}}})=\pi (x_{_{\mathcal {I}}})\). Now to obtain samples \(x_{_{\mathcal {I}^c}}^{(i)}\) from \(\pi (x_{_{\mathcal {I}^c}})\), we can use the following recycling step. We select \(x_{_{\mathcal {I}^c}}^{(i)}\) from \(\{ z_{_{\mathcal {I}^c}}^{(i,j)} \}_{j=1}^M\) according to the discrete probability

$$\begin{aligned} \mathbb {P}\left( X_{_{\mathcal {I}^c}}^{(i)}=x_{_{\mathcal {I}^c}}^{(i,j)} | x_{_{\mathcal {I}}}^{(i)}, \{x_{_{\mathcal {I}^c}}^{(i,j)}\}_{j=1}^M \right) \nonumber \\ = \frac{ \mathcal {L}( x_{_{\mathcal {I}}}^{(i)}, x_{_{\mathcal {I}^c}}^{(i,j)} ) }{ \sum _{j=1}^M \mathcal {L}\left( x_{_{\mathcal {I}}}^{(i)}, x_{_{\mathcal {I}^c}}^{(i,j)} \right) }. \end{aligned}$$
(19)

We summarize the pseudo-marginal MCMC algorithm in algorithm 3.

4 Methodology for linear-Gaussian likelihood

In this section we describe a detailed application of the certified coordinate selection method for a posterior density in the form

$$\begin{aligned} \pi (x) \propto \exp {\left( -\frac{1}{2} \Vert y - Ax \Vert _{\Sigma _{\textrm{obs}}^{-1}}^2 - \sum _{i=1}^d \delta _i |x_i| \right) }, \end{aligned}$$
(20)

where the noise follows the Gaussian distribution \(\mathcal {N}(0,\Sigma _{\textrm{obs}})\).

Note that to obtain h in (13) we need to compute an expectation over the posterior density. The next lemma shows that, for a linear-Gaussian likelihood, the diagnostic h admits a closed-form expression involving only the posterior mean and the posterior covariance.

Lemma 4

Let \(\mathcal {L}(x) \propto \exp (-\frac{1}{2} \Vert y - A x \Vert _{\Sigma _{\textrm{obs}}^{-1}}^2 )\) where \(\Sigma _{\textrm{obs}}\in \mathbb {R}^{m\times m}\) is positive definite and assume the mean \(\mu \) and the covariance \(\Sigma \) of the probability density \(\pi (x) \propto \mathcal {L}(x) \pi _0(x)\) exist. Then we can compute the diagnostic h as

$$\begin{aligned} h =&~ \Lambda ( {\text {diag}}\left( A^\intercal \Sigma _{\textrm{obs}}^{-1}A \Sigma A^\intercal \Sigma _{\textrm{obs}}^{-1}A\right) \nonumber \\&+ ( A^\intercal \Sigma _{\textrm{obs}}^{-1}( y - A \mu ) )^{\circ 2} ), \end{aligned}$$
(21)

where \((\cdot )^{\circ 2}\) denotes entry-wise square and \(\Lambda ={\text {diag}}\left( 1/\delta _1^2,\dots ,1/\delta _d^2\right) \).

Proof

See section A.2.2.

If the posterior mean and the posterior covariance are unknown, a first and intuitive choice is to replace them respectively by the prior mean \(\mu _0=0\) and the prior covariance \(\Sigma _0 = 2\Lambda \). This yields

$$\begin{aligned} \tilde{h}_{\textrm{prior}} =&~ 2 {\text {diag}}\left( (\Lambda ^{1/2} A^\intercal \Sigma _{\textrm{obs}}^{-1}A \Lambda ^{1/2})^2 \right) \nonumber \\&+\Lambda ( A^\intercal \Sigma _{\textrm{obs}}^{-1}y )^{\circ 2}. \end{aligned}$$
(22)

In the following, we show how a more precise estimate of the diagnostic compared to the prior-informed estimate of (22) can be obtained. To this end, we employ a Gaussian approximation at the maximum-a-posteriori (MAP) estimate. Note that the negative logarithm of (20) is strictly convex and that its minimizer, i.e., the MAP-estimate, can be computed efficiently via convex optimization toolboxes even for high-dimensional problems.

4.1 MAP-approximated diagnostic

The density in (20) is unimodal and differs only from a Gaussian in that the norm in the prior is \(l^1\) instead of \(l^2\). This motivates us to employ a common strategy in Bayesian inversion where the posterior density is approximated by a Gaussian centered at the MAP-estimate \(x_\textrm{MAP}\) (e.g., Murphy (2012)). That is, we estimate the mean as \(\mu \approx x_\textrm{MAP}\), and the covariance matrix as

$$\begin{aligned} \Sigma ^{-1} \approx H :=-\nabla ^2 \log \pi (x_\textrm{MAP}). \end{aligned}$$
(23)

Plugging in \(x_\textrm{MAP}\) for \(\mu \) and \(H^{-1}\) for \(\Sigma \) in (21) we obtain

$$\begin{aligned} \tilde{h}_\textrm{MAP} =&~ \Lambda ( {\text {diag}}\left( A^\intercal \Sigma _{\textrm{obs}}^{-1}A H^{-1} A^\intercal \Sigma _{\textrm{obs}}^{-1}A \right) \nonumber \\&+ (A^\intercal \Sigma _{\textrm{obs}}^{-1}( y - A x_\textrm{MAP}) )^{\circ 2} ). \end{aligned}$$
(24)

The non-differentiability of \(|\cdot |\) poses an obstacle in computing (23). Inspired by Vogel (2002), we use the approximation

$$\begin{aligned} |x| \approx \sqrt{x^2+\varepsilon }, \end{aligned}$$
(25)

where we can control the amount of smoothing around 0 with \(0<\varepsilon \ll 1\). With this, we obtain

$$\begin{aligned} H = A^\intercal \Sigma _{\textrm{obs}}^{-1}A + {\text {diag}}\left( \delta _i \varepsilon \left( \sqrt{{x_\textrm{MAP,i}}^2 +\varepsilon } \right) ^{-3}\right) , \end{aligned}$$
(26)

where we now use \({\text {diag}}\left( \cdot \right) \) to describe a diagonal matrix with diagonal given by the vector argument.

It remains the question of how to choose \(\varepsilon \). It appears natural to choose \(\varepsilon \) very small to obtain a good approximation to the absolute value. However, we have

$$\begin{aligned} \delta _i \varepsilon \left( \sqrt{{x_\textrm{MAP,i}}^2 +\varepsilon } \right) ^{-3} \overset{x_\textrm{MAP,i}\rightarrow 0}{\rightarrow } \frac{\delta _i}{\sqrt{\varepsilon }} \overset{\varepsilon \rightarrow 0}{\rightarrow } \infty . \end{aligned}$$

Since we expect \(x_\textrm{MAP,i}\approx 0\) for many coordinates, \(\varepsilon \) must not be chosen too small to avoid fast, nearly non-smooth changes among the elements in H that lead to numerical instabilities when computing its inverse which is needed for (24). Hence, we set \(\varepsilon \) according to the following heuristic.

Observe that (26) resembles the inverse of the covariance matrix of a Gaussian posterior density constructed by a linear-Gaussian likelihood and a Gaussian prior. In this light \(\delta _i^{-1} \varepsilon ^{-1} (\sqrt{{x_\textrm{MAP,i}}^2 +\varepsilon } )^{3}\) represents the variance of the i-th component. Now our heuristic rule is that these variances should be at least as large as the smallest variance of the chosen Laplace prior. Therefore, we require

$$\begin{aligned} \min _{x_\textrm{MAP,i}} \Vert \delta \Vert _\infty ^{-1} \varepsilon ^{-1} \left( \sqrt{{x_\textrm{MAP,i}}^2 +\varepsilon } \right) ^{3} \ge \frac{2}{\Vert \delta \Vert _\infty ^2}. \end{aligned}$$

We can assume that \(\min _{i=1,\dots ,d} x_\textrm{MAP,i}= 0\) such that we obtain

$$\begin{aligned} \varepsilon \ge 4/\Vert \delta \Vert _\infty ^2. \end{aligned}$$
(27)

4.2 Preconditioned MALA

The approximation in (25) enables employing MALA, since it allows for the computation of approximate gradients. MALA is derived by discretizating a Langevin diffusion equation and steers the sampling process by using gradient information of the log-target density Robert and Casella (2004). While different algorithms have been developed to sample non-smooth log-densities (see, e.g., Lau et al. (2023)), the proposed smoothing is simple to implement and computationally cheap. Moreover, combined with the Metropolis step, we obtain convergence to the target density.

We note that for the approximation of the prior gradient in the MALA implementation,

$$\begin{aligned} \nabla \log \pi _0\approx -\delta \frac{x}{\sqrt{x^2+\varepsilon }}, \end{aligned}$$
(28)

the smoothing parameter \(\varepsilon \) can be different from the one in the heuristic rule (27). In fact, for the MALA-proposal, it should hold \(0<\varepsilon \ll 1\) such that proposals computed from a state near 0 are close to the exact discretized Langevin diffusion. We note in our experiments that even for relatively large choices of \(\varepsilon \), e.g., using (27), MCMC chains appear to have converged. However, the larger \(\varepsilon \) in (28), the more concentrated are the samples around 0. Hence, we choose \(\varepsilon =10^{-8}\) in our numerical experiments.

Finally, we remark that if H (26) is invertible, the inverse \(H^{-1}\) can be used as preconditioner to make the MALA-proposal more efficient in high dimensions Martin et al. (2012); Petra et al. (2014). Furthermore, in the pseudo-marginal MCMC algorithm 3, a preconditioner for the local MALA-proposal for the update of \(x_{_{\mathcal {I}}}\) (line 2) can be obtained by projecting \(H^{-1}\) onto the selected coordinates.

5 Numerical experiments

In this section, we illustrate the performance of our methods in two different applications: a 1D deblurring problem and a 2D super-resolution problem. Algorithm 2 and algorithm 3 are used to obtain samples from the optimal approximate and exact posterior, respectively.

We compute 10 independent MCMC chains in each sampling experiment. This allows us to assess the variability of the sample statistics and to check the convergence of the MCMC chains by means of the estimated potential scale reduction (EPSR) statistic Gelman and Rubin (1992), denoted by \(\hat{R}\). In the following, we obtain \(\hat{R}<1.1\) in each sampling experiment, and thus assume that all our MCMC chains are converged.

We use the Python package arviz Kumar et al. (2019) to compute \(\hat{R}\). The same package is also used to obtain effective sample size (ESS) and credibility interval (CI) (see, e.g., Murphy (2012) for definitions).

Following the arguments outlined in section 4.1 and section 4.2, we use the heuristic rule (27) to chose \(\varepsilon \) whenever computing \(\tilde{h}_\textrm{MAP}\), and set \(\varepsilon =10^{-8}\) in the approximated gradient of our MALA-proposals.

5.1 1D signal deblurring

The main purpose of this experiment is to demonstrate the applicability of our diagnostic when performing the coordinate splitting. Additionally, we show results when using the pseudo-marginal MCMC algorithm 3 to sample the exact posterior, and when sampling from the optimal approximate posterior (algorithm 2).

5.1.1 Problem description

The data y is obtained artificially via

$$\begin{aligned} y=G s_\textrm{true}+e, \end{aligned}$$

where \(s_{\textrm{true}} \in \mathbb {R}^{1024}\) denotes the piece-wise constant ground truth, G is a Gaussian blur operator with the kernel width 27 and standard deviation 3, and \(e \in \mathbb {R}^{1024}\) is a realization from \(\mathcal {N}(0, \sigma _\textrm{obs}^2)\) with \(\sigma _\textrm{obs}=0.03\). The true signal and the data are shown in Fig. 1.

Fig. 1
figure 1

True signal and data of the 1D example

We employ a 10-level Haar wavelet transform with periodic boundary condition and formulate the Bayesian inverse problem in the coefficients domain. Let W and \(W^\dagger \) denote the discrete wavelet transform and the inverse discrete wavelet transform, respectively. The true coefficients \(x_\textrm{true}=W^\dagger s_\textrm{true}\) are sparse with \(\Vert x_\textrm{true}\Vert _0=60\), see Fig. 2. The posterior density formulated with respect to the coefficients reads

$$\begin{aligned} \pi (x) \propto \exp {\left( -\frac{1}{2 \sigma _\textrm{obs}^2} \Vert y - G W x \Vert _2^2 - \sum _{i=1}^{1024} \delta _i |x_i | \right) }. \end{aligned}$$
(29)

Thus, following our previous notation for the forward operator, we have \(A=GW\). We use the Python package pywt Lee et al. (2019) to compute the discrete wavelet transforms.

To take the different scales of the wavelet coefficients into account, we chose different \(\delta \) for each level of the wavelet basis. We define

$$\begin{aligned} \delta _i=c 2^{ \tfrac{\ell (i)}{2}}, \end{aligned}$$
(30)

for some \(c>0\), where \(\ell (i) \in \{1,\dots ,10\}\) denotes the level of the i-th wavelet coefficient. We plot \(\delta _i\) for \(c=1\) in Fig. 2. Note that for \(c=1\), the prior in (29) on x corresponds to a Besov-\(\mathcal {B}_{11}^1\) prior Kolehmainen et al. (2012); Lassas and Siltanen (2009) on the signal s.

In the following, we investigate the estimation of the diagnostic and the performance of the pseudo-marginal MCMC algorithm 3 while varying the global parameter c. Note that the prior becomes tighter with larger c. We consider the cases \(c\in \{1,5,25\}\).

Fig. 2
figure 2

Scatters (left y-axis): Haar wavelet coefficients of the true signal. Solid line (right y-axis): Rate parameters computed via (30) with \(c=1\)

5.1.2 Bound on the Hellinger distance

In this section, we proceed as follows for each \(c\in \{1,5,25\}\). We compute a MAP-approximation \(\tilde{h}_\textrm{MAP}\) via (24) and a prior-approximation \(\tilde{h}_\textrm{prior}\) via (22). Then, we compare the bound (12) obtained via \(\tilde{h}_\textrm{MAP}\) and \(\tilde{h}_\textrm{prior}\) to a reference bound, that we compute via a reference diagnostic \(\tilde{h}_\textrm{ref}\). Moreover, we compare the order of coordinates suggested by \(\tilde{h}_\textrm{MAP}\) and \(\tilde{h}_\textrm{prior}\) to the reference order given by \(\tilde{h}_\textrm{ref}\).

We obtain \(\tilde{h}_\textrm{ref}\) via the Monte Carlo approximation

$$\begin{aligned} \tilde{h}_\textrm{ref}=\frac{1}{N} \sum _{i=1}^N \nabla \log \mathcal {L}(x^{(i)})^{\circ 2}, \end{aligned}$$

where \(\{x^{(i)}\}_{i=1}^{N}\) are posterior samples computed with MALA on the full dimension. The samples are taken from all 10 MCMC chains, each comprising of \(2\times 10^6\) samples, which do not include the burn-in period of \(10^5\) samples. Further, we only keep every 100-th sample to reduce correlation resulting in a mean ESS of 500 over all chains and coordinates.

To compute \(\tilde{h}_\textrm{MAP}\), we require \(x_\textrm{MAP}\), which we obtain by using the convex optimization Python package cvxpy Diamond and Boyd (2016); Agrawal et al. (2018). We show \(W x_\textrm{MAP}\) in Fig. 3 and see that all estimates are close to \(s_\textrm{true}\).

Fig. 3
figure 3

Black dotted line: True signal. Colored lines: \(W x_\textrm{MAP}\) for \(c\in \{1,5,25\}\)

The bounds on the Hellinger distance computed via \(\tilde{h}_\textrm{ref}\), \(\tilde{h}_\textrm{MAP}\), and \(\tilde{h}_\textrm{prior}\) for \(c\in \{1,5,25\}\) are shown in the left column of Fig. 4. The curves are generated by first sorting the diagnostics in ascending order and then plotting their cumulative sums. The vertical lines indicate the indices \(\{i|x_{\textrm{true},i}\ne 0\}\).

In the right column of Fig. 4, we illustrate the differences in ordering of coordinates suggested by \(\tilde{h}_\textrm{MAP}\) and \(\tilde{h}_\textrm{prior}\) in comparison to \(\tilde{h}_\textrm{ref}\) as follows. For \(1\le g \le d\), consider \(\mathcal {S}_a(g) = (s_1, s_2, \dots , s_g)\), \(s_i \in \{1,2,\dots ,d\}\), to be a tuple defining the ascending ordering of the g most important coordinates implied by some diagnostic a. That is, the least and most important coordinates according to \(\mathcal {S}_a(g)\) are the coordinates with indices \(s_1\) and \(s_g\), respectively. Then, given an approximate diagnostic \(\tilde{h}_a\), we can plot for each number of not selected coordinates, i.e., \(g=|\mathcal {I}^c|\), the fraction of correctly chosen coordinates with respect to the reference ordering, i.e., \(|\mathcal {S}_\textrm{ref}(g) \cap \mathcal {S}_a(g)|\, / g\).

Fig. 4
figure 4

Left column: Upper bounds computed via \(\tilde{h}_\textrm{MAP}\), \(\tilde{h}_\textrm{prior}\) and \(\tilde{h}_\textrm{ref}\). The bounds are computed by sorting the diagnostics in ascending order followed by a cumulative sum. The vertical lines indicate the indices \(\{i|x_{\textrm{true},i}\ne 0\} \). Right column: Fractions of correctly chosen coordinates with respect to the reference ordering implied by \(\tilde{h}_\textrm{ref}\) over number of not selected coordinates for \(\tilde{h}_\textrm{MAP}\) and \(\tilde{h}_\textrm{prior}\)

The left column of Fig. 4 indicates that as the prior becomes tighter with larger c, the same bound can be retained while more coordinates are included in \(\mathcal {I}^c\), and consequently, a more efficient dimension reduction is possible. Moreover, the upper bounds show that \(\tilde{h}_\textrm{MAP}\) provides in general a similar bound over \(\mathcal {I}^c\) as \(\tilde{h}_\textrm{ref}\). However, for very tight priors, e.g., \(c=25\), all bounds are visually close to each other. The concentration of vertical lines on the right in all rows suggests that most of the indices \(\{i|x_{\textrm{true},i}\ne 0\} \) tend to be included in \(\mathcal {I}\) no matter which approximation to h is chosen. Recall that the diagnostic reveals the coordinates, where the update from prior to posterior information is most evident. Consequently, the remaining vertical lines scattered across the graph correspond to coordinates, where the data cannot be distinguished easily from prior information.

From the right column of Fig. 4, it can be seen that ca. \(70\%\) of coordinates are always correctly selected for all \(c\in \{1,5,25\}\) and both approximate diagnostics. In line with the good visual agreement of the bounds obtained via \(\tilde{h}_\textrm{MAP}\) and \(\tilde{h}_\textrm{ref}\), their orderings are also similar: For all \(c\in \{1,5,25\}\), roughly \(95-100\%\) of coordinates are always selected correctly. The graphs’ oscillatory behaviours on the right are to be attributed to the large influence of correctly or incorrectly selected coordinates with respect to relatively small subsets of selected coordinates.

5.1.3 Sampling the exact posterior

We sample the exact posterior (29) for \(c\in \{1,5,25\}\) with MALA on the full dimensional space, as well as with the pseudo-marginal MCMC algorithm 3 with MALA-proposals on the selected coordinates. In the following, we refer to these methods as ‘MALA’ and ‘PM-MALA’, respectively.

To compute \(\nabla \log \pi \) for the MALA-proposals, we use the approximated prior gradient (28). Furthermore, we require the adjoints of W and G for the likelihood gradient. We compute \(W^*\) with the technique from Folberth and Becker (2016), which involves handling the padding of the boundary conditions manually. Regarding the blurring operator, we have \(G^*=G\).

We obtain \(H^{-1}\) via (26) and use only its diagonal as preconditioner for the MALA-proposals to save computational cost. For PM-MALA, where the MALA-proposals are employed to update \(x_{_{\mathcal {I}}}\), we project \(H^{-1}\) on the selected coordinates and also use only its diagonal. We sample \(M=5\) vectors of \(x_{_{\mathcal {I}^c}}\) in each iteration inspired by the numerical experiments in Cui and Zahm (2021).

We sample 10 independent chains for each configuration as follows. In total, we compute \(10^6\) samples for each chain and save every 50-th sample to decrease correlation. We use a burn-in period of \(10^5\) samples during which we adapt the step size to achieve a fixed acceptance rate. Following Roberts and Rosenthal (2001), we target an acceptance rate of 0.574 for x in the MALA runs and for \(x_{_{\mathcal {I}}}\) in the PM-MALA runs.

Note that we need to select enough coordinates in order to achieve a stable acceptance rate during the PM-MALA iterations. Based on some pilot runs and based on the MAP-approximated bounds in Fig. 4, we select \(n_\mathcal {I}\in \{726, 311, 127 \}\) for \(c\in \{1,5,25\}\), respectively. With these choices of \(n_\mathcal {I}\), we expect \(H^2\!\left( {\pi }, {\widetilde{\pi }^\dagger } \right) < 0.2\) for each \(c\in \{1,5,25\}\). In the next section 5.1.4, we sample the quasi-optimal approximate posterior \(\widetilde{\pi }^\dagger \) and check the expected bound numerically.

In Fig. 5 we show the \(99\%\) CI for \(c\in \{1,5,25\}\) in the signal space and observe that the CI becomes narrower with increasing c. However, reduced uncertainty due to large c enables more efficient pseudo-marginal MCMC sampling since \(\mathcal {I}\) can be chosen smaller with increasing c whilst still obtaining good mixing. This can be seen in table 1, where mixing in terms of ESS is significantly larger for PM-MALA than for MALA on the full dimension. In particular, ESS for coordinate indices in \(\mathcal {I}^c\) is close to the total amount of samples since the proposals for \(x_{_{\mathcal {I}^c}}\) are drawn independently. Table 1 also shows that running MALA on the reduced dimensional space tends to allow for larger step sizes, which contributes to improved mixing in \(x_{_{\mathcal {I}}}\).

Fig. 5
figure 5

True signal and \(99\%\) CI for \(c\in \{1,5,25\}\) in signal space

Table 1 Comparison between MALA on the full dimensional space and MALA within the pseudo-marginal MCMC Algorithm 3 for the 1D example.

We note that depending on the complexity of the forward operator and the choice of M, the pseudo-marginal algorithm 3 may require significantly more runtime than a Metropolis-Hastings algorithm with the same proposal density in full dimensions. Recall that M corresponds to the number of forward model evaluations in each sampling step (algorithm 3, line 4). In our experiments, we set \(M=5\), and PM-MALA requires approximately the same runtime as MALA.

5.1.4 Sampling the approximate posterior

In this section we use algorithm 2 to sample the optimal approximate posterior (6) for \(c\in \{1,5,25\}\). To this end, we employ a MALA-proposal to sample \(x_{_{\mathcal {I}}}\) in line 1. As outlined in section 3.1, we approximate the optimal reduced likelihood (9) by fixing not selected coordinates to the prior mean such that the optimal approximate posterior reads

$$\begin{aligned} \widetilde{\pi }(x) \propto \mathcal {L}(x_{_{\mathcal {I}}}, x_{_{\mathcal {I}^c}}=0) \pi _0(x). \end{aligned}$$
(31)

We select \(n_\mathcal {I}\in \{726, 311, 127 \}\) for \(c\in \{1,5,25\}\), respectively. With these choices of \(n_\mathcal {I}\), we expect \(H\!\left( {\pi }, {\widetilde{\pi }} \right) ^2 \le 0.2\) according to \(\tilde{h}_{\textrm{MAP}}\) in Fig. 4. An estimation of the Hellinger distance based on samples from \(\widetilde{\pi }\) would allow for assessing the quality of the optimal approximate posterior and for checking the tightness of our bound (12). However, computing a numerical estimate of the Hellinger distance based on samples is hard since it tends to be unstable due to the unknown normalizing constants of \(\pi \) and \(\widetilde{\pi }\).

Instead, we can obtain a numerical estimate of another bound on the Hellinger distance based on samples from any approximation \(\widetilde{\pi }\), which is independent of the normalizing constants as

$$\begin{aligned} H\!\left( {\pi }, {\widetilde{\pi }} \right) ^2&\le 2 \int \left( \sqrt{ \frac{\rho (x)}{\tilde{\rho }(x)} } - 1 \right) ^2 \widetilde{\pi }(x) \textrm{d}x \end{aligned}$$
(32)
$$\begin{aligned}&\approx \frac{2}{N} \sum _{i=1}^N \left( \sqrt{ \frac{\rho ( x^{(i)} )}{\tilde{\rho }(x^{(i)})} } - 1 \right) ^2, \end{aligned}$$
(33)

where \(x^{(i)} \sim \widetilde{\pi }\), and \(\rho \) and \(\tilde{\rho }\) are the exact and the approximate unnormalized posterior densities, respectively. See section A.3 for the derivation of (32). While we can use this bound to assess the quality of the approximate posterior (31), it does not allow for any conclusions on the tightness of our bound (12), which we estimate through the MAP-approximated diagnostic.

As in the previous section, we compute a preconditioner for the MALA-proposals by projecting the diagonal of \(H^{-1}\) onto the selected coordinates. We sample again 10 independent chains of \(2 \times 10^6\) samples and an additional burn-in period of \(10^5\) samples with adapting step size targeting an acceptance rate of 0.574. We thin the chains to decrease auto-correlation by keeping only every 100-th sample.

We show the sampling results in table  2. For each \(c\in \{1,5,25\}\), the sample-approximated upper bound is about half of the expected upper bound of 0.2, which we obtain via \(\tilde{h}_\textrm{MAP}\). This suggests that the optimal approximate posterior is relatively close to the exact posterior. Moreover, the standard deviation across the chains of the sample-approximated upper bound is small, indicating that the estimate (33) is indeed stable.

Table 2 Results of sampling the approximate posterior (31).

5.2 2D super-resolution microscopy

The purpose of this experiment is to show that our coordinate selection method works well in high dimensions and that the optimal approximate posterior can be used to perform efficient inference. The test problem is inspired by the application of stochastic optical reconstruction microscopy (STORM) from Zhu et al. (2012). A similar example was considered in the Bayesian context in Durmus et al. (2018). STORM is a super-resolution microscopy method based on single-molecule stochastic switching, where the goal is to detect molecule positions in live cell imaging. The images are obtained by a microscope detecting the photon count of the (fluorescence) photoactivated molecules.

5.2.1 Problem description

We consider a microscopic image \(y\in \mathbb {R}^m\), which is obtained from a 2D pixel-array by concatenation in the usual column-wise fashion. Here, we set \(m=32^2=1024\). In STORM, we want to estimate precise molecule positions by computing a super-resolution image \(x\in \mathbb {R}^{d}\). In this example, we set the oversampling ratio \(k=4\), which leads to \(d=mk^2=16 384\). Based on the kernel from the optical measurement instrument given in Zhu et al. (2012), we generate the forward operator \(A\in \mathbb {R}^{m\times d}\). The data y is obtained via

$$\begin{aligned} y= A x_\textrm{true} + e, \end{aligned}$$
(34)

where \(e \in \mathbb {R}^m\) is simulated from \(\mathcal {N}(0, \sigma _\textrm{obs}^2)\).

Similar as in Zhu et al. (2012), we generate the ground-truth image \(x_\textrm{true}\) for the high photon count case with 50 uniformly distributed molecules on a field of size \(4\,\upmu \textrm{m}\times 4\,\upmu \textrm{m}\). The intensity of each molecule is simulated from a lognormal distribution with mode 3000 and standard deviation 1700. In Fig. 6 we show the ground-truth image and the data, which is obtained according to (34) with \(\sigma _\textrm{obs}=30\).

We use a Laplace prior due to the sparse behavior of \(x_\textrm{true}\), which leads to the posterior density

$$\begin{aligned} \pi (x) \propto \exp {\left( -\frac{1}{2 \sigma _\textrm{obs}^2 } \Vert y - A x \Vert _2^2 - \delta \Vert x\Vert _1 \right) }. \end{aligned}$$
(35)

In Fig. 6 we also show the MAP-estimate with \(\delta =1.275\), where \(\delta \) is chosen based on the visual quality after some pilot runs.

Fig. 6
figure 6

Left: Data computed via (34). Center: Truth in super-resolution. Right: MAP-estimate

5.2.2 Bound on the Hellinger distance

We use the MAP-approximation (24) to estimate the diagnostic and to compute the bound on the Hellinger distance, which we show in the left panel in Fig. 7. It is obvious that by using the MAP-approximation we can detect the coordinates of interest very accurately, which may be due to the good quality of the MAP-estimate. Although we obtain very large bounds on the Hellinger distance for this example, we can still employ the diagnostic to detect the most relevant coordinates to perform uncertainty quantification on the molecule positions.

Fig. 7
figure 7

Left: Upper bounds computed via \(\tilde{h}_\textrm{MAP}\) and \(\tilde{h}_\textrm{prior}\). The bounds are computed by sorting the diagnostics in ascending order followed by a cumulative sum. The vertical lines indicate the indices of the molecule positions. Center: True molecule positions (white scatters) and 1000 selected indices (red). Right: 99% CI in \(\log _{10}\)-scale for better visibility and true molecule positions (red scatters)

5.2.3 Sampling the approximate posterior and uncertainty quantification

We use the MAP-approximated diagnostic to select 1000 coordinates, which we show in the center of Fig. 7. The posterior density is again approximated as in (31). We sample the posterior by using the No-U-Turn sampler (NUTS) Hoffman and Gelman (xxxx) implemented in the Python package pyro Bingham et al. (2018). After sampling 10 independent chains with 30000 burn-in samples and 10000 posterior samples for each chain, we obtain converged chains with \(\max _i \hat{R}_i = 1.01\), and an averaged ESS (over all chains and components) equal to 168.

In this example, we cannot estimate the bound on the Hellinger distance via (32), since the ratio \(\frac{\rho ( x^{(i)} )}{\tilde{\rho }(x^{(i)})}\) computed with our samples \(\{x^{(i)}\}_{i=1}^N \sim \widetilde{\pi }(x)\) is unstable. However, as it can be observed from the center plot in Fig. 7, our diagnostic is able to select the correct molecule coordinates and the relevant neighbourhoods around them. Therefore, we can still use the samples from \(\widetilde{\pi }(x)\) to perform uncertainty quantification on the intensity of the photons and on the true molecule positions as follows.

To illustrate the uncertainty in the intensity, we plot the \(99\%\) CI for the selected molecules in the right figure of Fig. 7. The large ranges in CI can be contributed to the large ranges in photon intensity. Further, we observe that the approximate posterior tends to have larger CI at the true molecule positions, which are marked in red.

Now we estimate the uncertainty in the true molecule positions in the super-resolution grid by applying the following procedure. We select the pixels of the 50 largest posterior means as the detected molecule positions. Then, we move each of these posterior means pixel-wise vertically and horizontally until they leave the \(99\%\) CI of their neighbouring pixels. The corresponding distances reflect the uncertainty in vertical and horizontal direction. The average, taken over both directions, amounts to 3.82 pixels, or \(118.4 \textrm{nm}\). We note that this result is in agreement with the results in Zhu et al. (2012).

6 Conclusions

We outlined a coordinate selection method for high-dimensional Bayesian inverse problems with product-form Laplace prior. Inspired by the CDR methodology, we defined an approximate posterior density by replacing the likelihood with a ridge approximation. The ridge approximation is constructed such that it varies mainly on the coordinates which contribute mostly to the update from the prior to the posterior. Based on a bound in the Hellinger distance between the exact and a quasi-optimal approximate posterior, we then derived a diagnostic vector \(h \in \mathbb {R}^d\), which can be used to select the important coordinates.

After performing the coordinate selection, it is relatively easy to sample the approximate posterior. An additional advantage of our coordinate splitting is that advanced MCMC algorithms, such as delayed acceptance MCMC or pseudo-marginal MCMC, can be employed to sample the exact posterior.

The computation of h involves, however, integrating over the posterior density. For the case of a linear forward operator with additive Gaussian error, we presented a tractable methodology for estimating the diagnostic h before performing Bayesian inference via, e.g., MCMC methods.

The numerical results indicate that our methodology, which estimates the diagnostic based on a MAP-estimate, succeeds in revealing the most important coordinates. This enabled us to sample the approximate posterior very efficiently. Furthermore, the coordinate splitting allowed us to employ the pseudo-marginal MCMC algorithm to sample the exact posterior. Here our results show that the pseudo-marginal MCMC algorithm with MALA-proposals on the selected coordinates performs significantly better in terms of mixing of the sample chains when compared to MALA on the full-dimensional space.

Our methodology for estimating the diagnostic based on a MAP-estimate hinges on a smoothing approximation of the prior. This introduces an additional parameter \(\varepsilon \) which we fix following a heuristic rule. However, other ways for estimating the diagnostic not only in the case of a linear-Gaussian likelihood, but also for more general problems with non-linear forward operator and/or non-Gaussian likelihood should be investigated.

Moreover, we approximate the optimal reduced likelihood by setting the non-selected coordinates to zero (the prior mean). While this approximation yields good results in the first example, the approximation deteriorates in the second high-dimensional example. Therefore, better approximations to the optimal reduced likelihood should be explored as well.